0% found this document useful (0 votes)

83 views

Econometrics - Applied Robust Statistic To Regression Analysis

This document is a table of contents for a book on applied robust statistics. It lists 10 chapters that cover topics like location models, useful distributions, truncated distributions, multiple linear regression, regression diagnostics, robust regression methods, robust regression algorithms, and resistance and equivariance. The book provides an introduction to robust statistical techniques for detecting and handling outliers in data.

Uploaded by

Luciene Torquato

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

Econometrics - Applied Robust Statistic To Regression Analysis

Uploaded by

Luciene Torquato

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 534

Applied Robust Statistics

David J. Olive
Southern Illinois University
Department of Mathematics
Mailcode 4408
Carbondale, IL 62901-4408
[email protected]

July 6, 2005
Contents

Preface v

1 Introduction 1
1.1 Outlier....s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 The Location Model 25

2.1 Four Essential Location Estimators . . . . . . . . . . . . 25
2.2 A Note on Notation . . . . . . . . . . . . . . . . . . . . . 29
2.3 The Population Median and MAD . . . . . . . . . . . . . 30
2.4 Robust Conﬁdence Intervals . . . . . . . . . . . . . . . . . 38
2.5 Large Sample CIs and Tests . . . . . . . . . . . . . . . . . 41
2.6 Some Two Stage Trimmed Means . . . . . . . . . . . . . 44
2.7 Asymptotics for Two Stage Trimmed Means . . . . . . 48
2.8 L, R, and M Estimators . . . . . . . . . . . . . . . . . . . 53
2.9 Asymptotic Theory for the MAD . . . . . . . . . . . . . 56
2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.11 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3 Some Useful Distributions 72

3.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . 73
3.2 The Burr Distribution . . . . . . . . . . . . . . . . . . . . 74
3.3 The Cauchy Distribution . . . . . . . . . . . . . . . . . . . 74
3.4 The Chi Distribution . . . . . . . . . . . . . . . . . . . . . 75
3.5 The Chi–square Distribution . . . . . . . . . . . . . . . . 75

i
3.6 The Double Exponential Distribution . . . . . . . . . . . 77
3.7 The Exponential Distribution . . . . . . . . . . . . . . . . 78
3.8 The Two Parameter Exponential Distribution . . . . . 79
3.9 The Extreme Value Distribution . . . . . . . . . . . . . . 80
3.10 The Gamma Distribution . . . . . . . . . . . . . . . . . . 81
3.11 The Half Normal Distribution . . . . . . . . . . . . . . . 83
3.12 The Logistic Distribution . . . . . . . . . . . . . . . . . . 84
3.13 The Lognormal Distribution . . . . . . . . . . . . . . . . . 84
3.14 The Normal Distribution . . . . . . . . . . . . . . . . . . . 85
3.15 The Pareto Distribution . . . . . . . . . . . . . . . . . . . 87
3.16 The Poisson Distribution . . . . . . . . . . . . . . . . . . . 88
3.17 The Power Distribution . . . . . . . . . . . . . . . . . . . . 88
3.18 The Rayleigh Distribution . . . . . . . . . . . . . . . . . . 89
3.19 The Student’s t Distribution . . . . . . . . . . . . . . . . 89
3.20 The Truncated Extreme Value Distribution . . . . . . . 90
3.21 The Uniform Distribution . . . . . . . . . . . . . . . . . . 91
3.22 The Weibull Distribution . . . . . . . . . . . . . . . . . . . 91
3.23 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.24 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4 Truncated Distributions 100

4.1 The Truncated Exponential Distribution . . . . . . . . . 103
4.2 The Truncated Double Exponential Distribution . . . . 105
4.3 The Truncated Normal Distribution . . . . . . . . . . . . 105
4.4 The Truncated Cauchy Distribution . . . . . . . . . . . . 108
4.5 Asymptotic Variances for Trimmed Means . . . . . . . . 109
4.6 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5 Multiple Linear Regression 126

5.1 A Graphical Method for Response Transformations . . 128
5.2 Assessing Variable Selection . . . . . . . . . . . . . . . . . 138
5.3 A Review of MLR . . . . . . . . . . . . . . . . . . . . . . . 153
5.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

ii
6 Regression Diagnostics 185
6.1 Numerical Diagnostics . . . . . . . . . . . . . . . . . . . . 185
6.2 Graphical Diagnostics . . . . . . . . . . . . . . . . . . . . . 188
6.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . 192
6.4 A Simple Plot for Model Assessment . . . . . . . . . . . 195
6.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 202
6.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

7 Robust and Resistant Regression 211

7.1 High Breakdown Estimators . . . . . . . . . . . . . . . . . 211
7.2 Two Stage Estimators . . . . . . . . . . . . . . . . . . . . . 213
7.3 Estimators with Adaptive Coverage . . . . . . . . . . . . 215
7.4 Theoretical Properties . . . . . . . . . . . . . . . . . . . . 216
7.5 Computation and Simulations . . . . . . . . . . . . . . . . 223
7.6 Resistant Estimators . . . . . . . . . . . . . . . . . . . . . 228
7.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

8 Robust Regression Algorithms 236

8.1 Inconsistency of Resampling Algorithms . . . . . . . . . 239
8.2 Theory for Concentration Algorithms . . . . . . . . . . . 244
8.3 Elemental Sets Fit All Planes . . . . . . . . . . . . . . . . 253
8.4 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 263
8.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

9 Resistance and Equivariance 267

9.1 Resistance of Algorithm Estimators . . . . . . . . . . . . 267
9.2 Advice for the Practitioner . . . . . . . . . . . . . . . . . 271
9.3 Desirable Properties of a Regression Estimator . . . . . 272
9.4 The Breakdown of Breakdown . . . . . . . . . . . . . . . 275
9.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

10 Multivariate Models 284

10.1 The Multivariate Normal Distribution . . . . . . . . . . 285
10.2 Elliptically Contoured Distributions . . . . . . . . . . . . 289
10.3 Sample Mahalanobis Distances . . . . . . . . . . . . . . . 292
10.4 Aﬃne Equivariance . . . . . . . . . . . . . . . . . . . . . . 294

iii
10.5 Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
10.6 Algorithms for the MCD Estimator . . . . . . . . . . . . 296
10.7 Theory for CMCD Estimators . . . . . . . . . . . . . . . 298
10.8 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 310
10.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

11 CMCD Applications 317

11.1 DD Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.2 Robust Prediction Regions . . . . . . . . . . . . . . . . . . 325
11.3 Resistant Regression . . . . . . . . . . . . . . . . . . . . . 328
11.4 Robustifying Robust Estimators . . . . . . . . . . . . . . 332
11.5 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 333
11.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

12 1D Regression 337
12.1 Estimating the Suﬃcient Predictor . . . . . . . . . . . . 340
12.2 Visualizing 1D Regression . . . . . . . . . . . . . . . . . . 346
12.3 Predictor Transformations . . . . . . . . . . . . . . . . . . 358
12.4 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . 359
12.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
12.6 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 372
12.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375

13 Generalized Linear Models 383

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
13.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . 385
13.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 389
13.4 Loglinear Regression . . . . . . . . . . . . . . . . . . . . . 398
13.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
13.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . 412
13.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

14 Stuﬀ for Students 437

14.1 Tips for Doing Research . . . . . . . . . . . . . . . . . . . 437
14.2 R/Splus and Arc . . . . . . . . . . . . . . . . . . . . . . . . 440
14.3 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
14.4 Hints for Selected Problems . . . . . . . . . . . . . . . . . 457

iv
Preface

Statistics is, or should be, about scientiﬁc investigation and how to do it

better ....
Box (1990)
In the statistical literature the word “robust” is synonymous with “good.”
There are many classical statistical procedures such as least squares estima-
tion for multiple linear regression and the t–interval for the population mean
µ. A given classical procedure should perform reasonably well if certain as-
sumptions hold, but may be unreliable if one or more of these assumptions are
violated. A robust analog of a given classical procedure should also work well
when these assumptions hold, but the robust procedure is generally tailored
to also give useful results when a single, specific assumption is relaxed.
In this book, two assumptions are of particular interest. The first as-
sumption concerns the error distribution. Many classical statistical proce-
dures work well for independent identically distributed (iid) errors with “light
tails”, but can perform poorly for “heavy tailed” error distributions or if out-
liers are present. Distributionally robust statistics should give useful results
when the assumption of iid light tailed errors is relaxed.
The second assumption of interest is that the data follow a 1D regression
model where the response variable Y is independent of the vector of predic-
tors x given a single linear combination βT x of the predictors. Important
questions include
• how can the conditional distribution Y |βT x be visualized?
• How can β be estimated?
• What happens if a parametric 1D model is unknown or misspecified?
Answers to these important questions can be found from regression graphics
procedures for dimension reduction.

v
Preface vi

A major goal of regression graphics and distributionally robust

statistical procedures is to reduce the amount of iteration needed
to obtain a good final model. This goal is important because lots of
iteration consumes valuable time and propagates error and subjective choices.
Classical statistical procedures will often lead to a completely inappropriate
final model if the model is misspecified or if outliers are present.
Distributionally robust statistics refers to methods that are designed to
perform well when the shape of the true underlying model deviates slightly
from the assumed parametric model, eg if outliers are present. According to
Huber (1981, p. 5), a robust statistical procedure should perform reasonably
well at the assumed model, should be impaired only slightly by small depar-
tures from the model, and should not be catastrophically impaired by some-
what larger deviations. Hampel, Ronchetti, Rousseeuw and Stahel (1986,
p. 11) add that a robust procedure should describe the structure fitting the
bulk of the data and identify deviating data points. Finding outliers is very
important. Rousseeuw and Leroy (1987, p. vii) declare that the main mes-
sage of their book is that robust regression is useful in identifying outliers.
We should always examine the outliers to see if they follow a pattern, are
recording errors, or if they could be explained adequately by an alternative
model.
Two paradigms appear in the robust literature. The “perfect classification
paradigm” assumes that diagnostics or distributionally robust statistics can
be used to perfectly classify the data into a “clean” subset and a subset
of outliers. Then classical methods are applied to the clean data. These
methods tend to be inconsistent, but this paradigm is widely used and can
be very useful for a fixed data set that contains outliers. Consider a multiple
linear regression data set with outliers. Both case (or deletion) diagnostics
and robust estimators attempt to classify the data into outliers and non–
outliers. A robust estimator attempts to find a reasonable fit for the bulk of
the data and then uses this fit to find discrepant cases while case diagnostics
use a fit to the entire data set to find discrepant cases.
The “asymptotic paradigm” assumes that the data are iid and develops
the large sample properties of the estimators. Unfortunately, many robust
estimators that have rigorously proven asymptotic theory are impractical
to compute. In the robust literature for multiple linear regression and for
multivariate location and dispersion, often no distinction is made between
the two paradigms: frequently the large sample properties for an impractical
Preface vii

estimator are derived, but the examples and software use an inconsistent
“perfect classification” procedure. In this text, some practical estimators
that have good statistical properties are developed (see Theorems 8.7, 10.14
and 10.15), and some effort has been made to state whether the “perfect
classification” or “asymptotic” paradigm is being used.
The majority of the statistical procedures described in Hampel, Ronchetti,
Rousseeuw and Stahel (1986), Huber (1981), and Rousseeuw and Leroy
(1987) assume that outliers are present or that the true underlying error
distribution has heavier tails than the assumed model. However, these three
references and some of the papers in Stahel and Weisberg (1991a,b) and
Maddela and Rao (1997) do discuss other departures from the assumed
model. Other texts on distributional robustness include Atkinson and Riani
(2000), Atkinson, Riani and Cerioli (2004), Dell’Aquila and Ronchetti (2005),
Hettmansperger and McKean (1998), Hoaglin, Mosteller and Tukey (1983),
Insightful (2002), Jureckova and Sen (1996), Marazzi (1993), Maronna (2006),
Morgenthaler, Ronchetti, and Stahel (1993), Morgenthaler and Tukey (1991),
Müller (1997), Rey (1978), Rieder (1996), Shevlyakov and Vilchevski (2002),
Staudte and Sheather (1990) and Wilcox (2005). Diagnostics and outliers
are discussed in Atkinson (1985), Barnett and Lewis (1994), Belsley, Kuh,
and Welsch (1980), Chatterjee and Hadi (1988), Cook and Weisberg (1982),
Fox (1991), Hawkins (1980) and Iglewicz and Hoaglin (1993).
Several textbooks on statistical analysis and theory also discuss robust
methods. For example, see Dodge and Jureckova (2000), Gentle (2002),
Gnanadesikan (1997), Hamilton (1992), Seber and Lee (2003), Thode (2002)
and Wilcox (2001, 2003).
Besides distributional robustness, this book also considers regression graph-
ics procedures that are useful even when the 1D regression model is unknown
or misspecified. 1D regression and regression graphics procedures are de-
scribed in Cook and Weisberg (1999a), Cook (1998a) and Li (2000).
A unique feature of this text is the discussion of the interrelationships be-
tween distributionally robust procedures and regression graphics with focus
on 1D regression. A key assumption for regression graphics is that the predic-
tor distribution is approximately elliptically contoured. Ellipsoidal trimming
(based on robust estimators of multivariate location and dispersion) can be
used to induce this condition. An important regression graphics technique is
dimension reduction: assume that there are p predictors collected in a p × 1
vector x. Then attempt to reduce the dimension of the predictors from p to
Preface viii

1 by ﬁnding a linear combination w = β T x of the predictors such that Y is

independent of x given w. This technique is extremely important since the
plot of β T x versus Y can be used to visualize the conditional distribution of
Y |βT x in the 1D regression model.
The study of robust statistics is useful for anyone who handles random
data. Applications can be found in statistics, economics, engineering, infor-
mation technology, psychology, and in the biological, environmental, geolog-
ical, medical, physical and social sciences.
The book begins by describing the 1D regression model. Then some ex-
amples are presented to illustrate why robust procedures are needed. Chapter
2 presents the location model with an emphasis on the median, the median
absolute deviation and the trimmed mean. Chapter 3 is simply a list of
properties for certain univariate distributions, and Chapter 4 shows how to
find the mean and variance of Y if the population is a mixture distribu-
tion or a truncated distribution. Chapter 4 ends by presenting a simulation
study of confidence intervals that use the sample mean, median and trimmed
mean. Chapter 5 presents multiple linear regression and includes graphical
methods for response transformations and variable selection. Chapter 6 con-
siders diagnostics while Chapter 7 covers robust and resistant procedures for
multiple linear regression. Chapter 8 shows that commonly used robust re-
gression estimators such as the Splus function lmsreg are inconsistent, but
a simple modification
√ to existing algorithms for LMS and LTS results in
easily computed n consistent high breakdown (HB) estimators. Chapter
9 shows that the concept of breakdown is not very useful while Chapter 10
covers multivariate location and dispersion and covers the multivariate nor-
mal
√ and other elliptically contoured distributions. The easily computed HB
n consistent CMCD estimators are also introduced. It is shown that the
cov.mcd estimator is a zero breakdown inconsistent estimator, but a sim- √
ple modification to the cov.mcd estimator results in an easily computed n
consistent HB estimator. Chapter 11 provides applications of these CMCD
estimators including a graph for detecting multivariate outliers and for de-
termining whether the data distribution is multivariate normal. Chapter 12
covers 1D regression. Plots for visualizing the 1D regression model and for
assessing variable selection are presented. Chapter 13 gives graphical aids for
generalized linear models while Chapter 14 provides information on software
and suggests some projects for the students.
Preface ix

Background
This course assumes that the student has had considerable exposure to
statistics, but is at a much lower level than most texts on distributionally
robust statistics. Calculus and a course in linear algebra are essential. Fa-
miliarity with least squares regression is also assumed and could come from
econometrics or numerical linear algebra, eg Weisberg (2005), Datta (1995),
Golub and Van Loan (1989) or Judge, Griﬃths, Hill, Lütkepohl and Lee
(1985). The matrix representation of the multiple linear regression model
should be familiar. An advanced course in statistical inference, especially
one that covered convergence in probability and distribution, is needed for
several sections of the text. Casella and Berger (2002), Poor (1988) and
White (1984) easily meet this requirement.
There are other courses that would be useful but are not required. An
advanced course in least squares theory or linear models can be met by Seber
and Lee (2003) in statistics, White (1984) in economics, and Porat (1993) in
electrical engineering. Knowledge of the multivariate normal distribution at
the level of Johnson and Wichern (1988) would be useful. A course in pattern
recognition, eg Duda, Hart and Stork (2000), also covers the multivariate
normal distribution.
If the students have had only one calculus based course in statistics (eg
DeGroot 1975 or Wackerly, Mendenhall and Scheaﬀer 2002), then cover Ch.
1, 2.1–2.5, 4.6, Ch. 5, Ch. 6, 7.6, part of 8.2, 9.2, 10.1, 10.2, 10.3, 10.6,
10.7, 11.1, 11.3, Ch. 12 and Ch. 13. (This will cover the most important
material in the text. Many of the remaining sections are for Ph.D. students
and experts in robust statistics.) Many of the Chapter 5 homework problems
were used in the author’s multiple linear regression course, and many Chapter
13 problems were used in the author’s categorical data analysis course.
Some of the applications in this text include

• using an RR plot to detect outliers in multiple linear regression. See p.

6–7, 194, and 229.

• Prediction intervals in the Gaussian multiple linear regression model in

the presence of outliers are given on p. 12–14.

• Using plots to detect outliers in the location model is shown on p. 26.

Preface x

• Robust parameter estimation using the sample median and the sample
median absolute deviation is described on p. 35–37 and in Chapter 3.

• Inference based on the sample median is proposed on p. 38.

• Inference based on the trimmed mean is proposed on p. 39.

• Two graphical methods for selecting a response transformation for mul-

tiple linear regression are given on p. 14–16 and Section 5.1.

• A graphical method for assessing variable selection for the multiple

linear regression model is described in Section 5.2.

• Using an FF plot to detect outliers in multiple linear regression and to

compare the fits of different fitting procedures is discussed on p. 194.

• Section 6.3 shows how to use the forward response plot to detect outliers
and to assess the adequacy of the multiple linear regression model.

• Section 6.4 shows how to use the FY plot to detect outliers and to
assess the adequacy of very general regression models of the form y =
m(x) + e.

• Section 7.6 provides the resistant mbareg estimator for multiple linear
regression which is useful for teaching purposes.

• Section 8.2 shows how to modify the inconsistent zero breakdown esti-
mators for LMS and LTS (such √ as lmsreg) so that the resulting modiﬁ-
cation is an easily computed n consistent high breakdown estimator.
√
• Sections 10.6 and 10.7 provide the easily computed robust n consis-
tent HB covmba estimator for multivariate location and dispersion. It
is also shown how to modify the inconsistent zero breakdown cov.mcd
√
estimator so that the resulting modiﬁcation is an easily computed n
consistent high breakdown estimator. Application are numerous.

• Section 11.1 shows that the DD plot can be used to detect multivariate
outliers and as a diagnostic for whether the data is multivariate nor-
mal or from some other elliptically contoured distribution with second
moments.
Preface xi

• Section 11.2 shows how to produce a resistant 95% covering ellipsoid

for multivariate normal data.

• Section 11.3 suggests the resistant tvreg estimator for multiple linear
regression that can be modiﬁed to create a resistant weighted MLR
estimator if the weights wi are known.

• Section 11.4 suggests how to “robustify robust estimators.” The basic

idea is to replace the inconsistent zero breakdown estimators (such as
lmsreg and cov.mcd)
√ used in the “robust procedure” with the eas-
ily computed n consistent high breakdown robust estimators from
Sections 8.2 and 10.7.
• The resistant trimmed views methods for visualizing 1D regression
models graphically are discussed on p. 17–18 and Section 12.2. Al-
though the OLS view is emphasized, the method can easily be gener-
alized to other ﬁtting methods such as SAVE, SIR and PHD and even
lmsreg.

• Rules of thumb for selecting predictor transformations are given in

Section 12.3.
• A fast all subsets method for variable selection for multiple linear re-
gression is extended to the 1D regression model in Section 12.4. Also
see Example 1.6. Plots for comparing a submodel with the full model
after performing variable selection are also given.

• Graphical aids for binary regression models such as logistic regression

are given in Section 13.3.
• Graphical aids for Poisson regression models such as loglinear regression
are given in Section 13.4.
• Throughout the book there are goodness of ﬁt and lack of ﬁt plots for
examining the model. The EY plot is especially important.

The website (http://www.math.siu.edu/olive/ol-bookp.htm) for this book

provides more than 29 data sets for Arc, and over 70 R/Splus programs in
the ﬁle rpack.txt. The students should save the data and program ﬁles on a
disk. Section 14.2 discusses how to get the data sets and programs into the
software, but the following commands will work.
Preface xii

Downloading the book’s R/Splus functions rpack.txt into R or

Splus:
Download rpack.txt onto a disk. Enter R and wait for the curser to appear.
Then go to the File menu and drag down Source R Code. A window should
appear. Navigate the Look in box until it says 3 1/2 Floppy(A:). In the Files
of type box choose All ﬁles(*.*) and then select rpack.txt. The following line
should appear in the main R window.

> source("A:/rpack.txt")

If you use Splus, the command

> source("A:/rpack.txt")

will enter the functions into Splus. Creating a special workspace for the
functions may be useful.
Type ls(). About 40 R/Splus functions from rpack.txt should appear. In
R, enter the command q(). A window asking “Save workspace image?” will
appear. Click on No to remove the functions from the computer (clicking on
Yes saves the functions on R, but you have the functions on your disk).
Similarly, to download the text’s R/Splus data sets, save robdata.txt on a
disk and use the following command.

> source("A:/robdata.txt")

Why Many of the Best Known High Breakdown Estimators are

not in this Text
Most of the literature for high breakdown (HB) robust statistics can be
classiﬁed into four categories: a) the statistical properties for HB estima-
tors that are impractical to compute, b) the statistical properties for two
stage estimators that need an initial HB consistent estimator, c) “plug in
estimators” that use an inconsistent zero breakdown estimator in place of
the classical estimator and d) ad hoc techniques for outlier detection that
have little theoretical justiﬁcation other than the ability to detect outliers on
some “benchmark data sets.”
This is an applied text and does not cover in detail high breakdown (HB)
estimators for regression and multivariate location and dispersion that are
impractical to compute such as the CM, depth, GS, LQD, LMS, LTS, LTA,
Preface xiii

MCD, MVE, projection, repeated median and S estimators. Two stage esti-
mators that need an initial high breakdown estimator from the above list are
even less practical to compute. These estimators include the cross checking,
MM, one step GM, one step GR, REWLS, tau and t type estimators. Also,
although two stage estimators tend to inherit the breakdown value of the
initial estimator, their outlier resistance as measured by maximal bias tends
to decrease sharply. Typically the implementations for these estimators are
not given, impractical to compute, or result in a zero breakdown estimator
that is often inconsistent. The inconsistent zero breakdown implementations
and ad hoc procedures should usually only be used as diagnostics for outliers
and other model misspecifications, not for inference.
Many of the ideas in the HB literature are good, but the ideas were
premature for applications without a computational and theoretical break-
through. This text, Olive(2004a) and Olive and Hawkins (2006) provide
this breakthrough and show that simple modifications to elemental basic√ re-
sampling or concentration algorithms result in the easily computed HB n
consistent CMCD estimator for multivariate location and dispersion (MLD)
and CLTS estimator for multiple linear regression (MLR). The Olive (2004a)
MBA estimator is a special case of the CMCD estimator and is much faster
than the inconsistent zero breakdown Rousseeuw and Van Driessen (1999)
FMCD estimator. The Olive (2005) resistant MLR estimators also have good
statistical properties. See Sections 7.6, 8.2, 10.7, 11.4, Olive (2004a, 2005),
Hawkins and Olive (2002) and Olive and Hawkins (2006).
As an illustration for how the CMCD estimator improves the ideas from
the HB literature, consider the He and Wang (1996) cross checking estima-
tor that uses the classical estimator if it is close to the robust estimator,
and uses the robust estimator otherwise. The resulting estimator is an HB
asymptotically efficient estimator if a consistent HB robust estimator is used.
He and Wang (1997) show that the all elemental subset approximation to S
estimators is a consistent HB MLD estimator that could be used in the cross
checking estimator, but then the resulting cross checking estimator is im-
practical to compute. If the (basic resampling MVE or) FMCD estimator
is used, then the cross checking estimator is practical to compute but has
zero breakdown since the FMCD and classical estimators both have zero
breakdown. Since the FMCD estimator is inconsistent and highly variable,
the probability that the FMCD estimator and classical estimator are close
does not go to one as n →√∞. Hence the cross checking estimator is also
inconsistent. Using the HB n consistent CMCD estimator results in an HB
Preface xiv

asymptotically eﬃcient cross checking estimator that is practical to compute.

The bias of the cross checking estimator is greater than that of the robust
estimator since the probability that the robust estimator is chosen when
outliers are present is less than one. However, few two stage estimators will
have performance that rivals the statistical properties and simplicity of the
cross checking estimator when correctly implemented (eg with the covmba
estimator for multivariate location and dispersion).
This text also tends to ignore most robust location estimators because the
cross checking technique can be used to create a very robust asymptotically
eﬃcient estimator if the data are iid from a location–scale family (see Olive
2006). In this setting the cross checking estimators of location and scale √
based on the sample median and median absolute deviation should be n
consistent and should have very high resistance to outliers. An M-estimator,
for example, will have both lower eﬃciency and outlier resistance than the
cross checking estimator.
Acknowledgments
This work has been partially supported by NSF grant DMS 0202922. Col-
laborations with Douglas M. Hawkins and R. Dennis Cook were extremely
valuable. I am very grateful to the developers of useful mathematical and
statistical techniques and to the developers of computer software and hard-
ware. Linda Gibson made the initial webpage in 2002 for the earliest version
of this book, and Dr. George Parker kept my computer running. Teach-
ing robust statistics to Math 583 students in 2004 was invaluable. Some of
the material in this book was also used in a Math 583 regression graphics
course, two Math 484 multiple linear regression courses and in a Math 485
categorical data course.
Chapter 1

Introduction

All models are wrong, but some are useful.

Box (1979)

In data analysis, an investigator is presented with a problem and data

from some population. The population might be the collection of all possible
outcomes from an experiment while the problem might be predicting a future
value of the response variable Y or summarizing the relationship between Y
and the p × 1 vector of predictor variables x. A statistical model is used to
provide a useful approximation to some of the important underlying charac-
teristics of the population which generated the data. Models for regression
and multivariate location and dispersion are frequently used.
Model building is an iterative process. Given the problem and data but
no model, the model building process can often be aided by graphs that help
visualize the relationships between the different variables in the data. Then
a statistical model can be proposed. This model can be fit and inference per-
formed. Then diagnostics from the fit can be used to check the assumptions
of the model. If the assumptions are not met, then an alternative model
can be selected. The fit from the new model is obtained, and the cycle is
repeated.
Definition 1.1. Regression investigates how the response variable Y
changes with the value of a p × 1 vector x of predictors. Often this con-
ditional distribution Y |x is described by a 1D regression model, where Y is
conditionally independent of x given the sufficient predictor β T x, written

Y x|β T x. (1.1)

1
CHAPTER 1. INTRODUCTION 2

This class of models is very rich. Generalized linear models (GLM’s)

are a special case of 1D regression, and an important class of parametric or
semiparametric 1D regression models has the form

Yi = g(xTi β, ei ) (1.2)

for i = 1, ..., n where g is a bivariate function, β is a p × 1 unknown vector

of parameters, and ei is a random error. Often the errors e1, ..., en are iid
(independent and identically distributed) from a distribution that is known
except for a scale parameter. For example, the ei’s might be iid from a normal
(Gaussian) distribution with mean 0 and unknown standard deviation σ. For
this Gaussian model, estimation of β and σ is important for inference and
for predicting a new value of the response variable Ynew given a new vector
of predictors xnew .
Many of the most used statistical models are 1D regression models. A
single index model uses

g(xT β, e) = m(xT β) + e (1.3)

and an important special case is multiple linear regression

Y = xT β + e (1.4)

where m is the identity function. The response transformation model uses

g(β T x, e) = t−1 (β T x + e) (1.5)

where t−1 is a one to one (typically monotone) function. Hence

t(y) = β T x + e. (1.6)

Several important survival models have this form. In a 1D binary regression

model, the Y |x are independent Bernoulli[ρ(βT x)] random variables where

P (Y = 1|x) ≡ ρ(β T x) = 1 − P (Y = 0|x) (1.7)

In particular, the logistic regression model uses

exp(β T x)
ρ(β T x) = .
1 + exp(β T x)
CHAPTER 1. INTRODUCTION 3

In the literature, the response variable is sometimes called the dependent

variable while the predictor variables are sometimes called carriers, covari-
ates, explanatory variables, or independent variables. The ith case (Yi , xTi )
consists of the values of the response variable Yi and the predictor variables
xTi = (xi,1, ..., xi,p) where p is the number of predictors and i = 1, ..., n. The
sample size n is the number of cases.
Box (1979) warns that “All models are wrong, but some are useful.”
For example the function g or the error distribution could be misspeciﬁed.
Diagnostics are used to check whether model assumptions such as the form
of g and the proposed error distribution are reasonable. Often diagnostics
use residuals ri . If m is known, then the single index model uses

ri = Yi − m(xTi β̂)

where β̂ is an estimate of β. Sometimes several estimators β̂ j could be used.

Often β̂j is computed from a subset of the n cases or from different fitting
methods. For example, ordinary least squares (OLS) and least absolute de-
viations (L1 ) could be used to compute β̂ OLS and β̂ L1 , respectively. Then
the corresponding residuals can be plotted.
Exploratory data analysis (EDA) can be used to find useful models when
the form of the regression or multivariate model is unknown. For example,
suppose g is a monotone function t−1 :

Y = t−1 (xT β + e). (1.8)

Then the transformation

Z = t(Y ) = xT β + e (1.9)

follows a multiple linear regression model.

Robust statistics can be tailored to give useful results even when a cer-
tain speciﬁed model assumption is incorrect. An important class of robust
statistics can give useful results when the assumed model error distribution
is incorrect. This class of statistics is useful when outliers, observations far
from the bulk of the data, are present. The class is also useful when the
error distribution has heavier tails than the assumed error distribution, eg
if the assumed distribution is normal but the actual distribution is Cauchy
CHAPTER 1. INTRODUCTION 4

or double exponential. This type of robustness is often called distributional

robustness.
Another class of robust statistics, known as regression graphics, gives use-
ful results when the 1D regression model (1.1) is misspeciﬁed or unknown.
Let the estimated suﬃcient predictor ESP = xTi β̂ OLS where β̂ OLS is ob-
tained from the OLS multiple linear regression of Y on x. Then a very
important regression graphics result is that the EY plot of the ESP versus
Y can often be used to visualize the conditional distribution of Y |βT x.
Distributionally robust statistics and regression graphics have amazing
applications for regression, multivariate location and dispersion, diagnostics,
and EDA. This book illustrates some of these applications and investigates
the interrelationships between these two classes of robust statistics.

1.1 Outlier....s
The main message of this book is that robust regression is extremely useful
in identifying outliers ....
Rousseeuw and Leroy (1987, p. vii)

Following Staudte and Sheather (1990, p. 32), we deﬁne an outlier to

be an observation that is far from the bulk of the data. Similarly, Ham-
pel, Ronchetti, Rousseeuw and Stahel (1986, p. 21) define outliers to be
observations which deviate from the pattern set by the majority of the data.
Typing and recording errors may create outliers, and a data set can have
a large proportion of outliers if there is an omitted categorical variable (eg
gender, species, or geographical location) where the data behaves differently
for each category. Outliers should always be examined to see if they follow a
pattern, are recording errors, or if they could be explained adequately by an
alternative model. Recording errors can sometimes be corrected and omit-
ted variables can be included, but often there is no simple explanation for a
group of data which differs from the bulk of the data.
Although outliers are often synonymous with “bad” data, they are fre-
quently the most important part of the data. Consider, for example, finding
the person whom you want to marry, finding the best investments, finding
the locations of mineral deposits, and finding the best students, teachers,
doctors, scientists, or other outliers in ability. Huber (1981, p. 4) states
that outlier resistance and distributional robustness are synonymous while
CHAPTER 1. INTRODUCTION 5

Hampel, Ronchetti, Rousseeuw and Stahel (1986, p. 36) state that the ﬁrst
and most important step in robustiﬁcation is the rejection of distant outliers.

In the literature there are two important paradigms for robust procedures.
The perfect classification paradigm considers a fixed data set of n cases of
which 0 ≤ d < n/2 are outliers. The key assumption for this paradigm is
that the robust procedure perfectly classifies the cases into outlying and non-
outlying (or “clean”) cases. The outliers should never be blindly discarded.
Often the clean data and the outliers are analyzed separately.
The asymptotic paradigm uses an asymptotic distribution to approximate
the distribution of the estimator when the sample size n is large. An impor-
tant example is the central limit theorem (CLT): let Y1 , ..., Yn be iid with
mean µ and standard deviation σ; ie, the Yi ’s follow the location model

Y = µ + e.

Then
√ 1 n
Yi − µ) → N(0, σ 2 ).
D
n(
n i=1

Hence the sample mean Y n is asymptotically normal AN(µ, σ 2/n).

For this paradigm, one must determine what the estimator is estimating,
the rate of convergence, the asymptotic distribution, and how large n must
be for the approximation to be useful. Moreover, the (asymptotic) stan-
dard error (SE), an estimator of the asymptotic standard deviation, must
be computable if the estimator is to be useful for inference.√ Note that the
sample mean is estimating the population mean µ with a n √ convergence
rate, the asymptotic distribution is normal, and the SE = S/ n where S
is the sample standard deviation. For many distributions the central limit
theorem provides a good approximation if the sample size n > 30. Chapter
2 examines the sample mean, standard deviation and robust alternatives.

1.2 Applications
One of the key ideas of this book is that the data should be examined with
several estimators. Often there are many procedures that will perform well
when the model assumptions hold, but no single method can dominate every
CHAPTER 1. INTRODUCTION 6

other method for every type of model violation. For example, OLS is best
for multiple linear regression when the iid errors are normal (Gaussian) while
L1 is best if the errors are double exponential. Resistant estimators may
outperform classical estimators when outliers are present but be far worse if
no outliers are present.
Portnoy and Mizera (1999) note that different multiple linear regression
estimators tend to estimate β in the iid constant variance symmetric error
model, but otherwise each estimator estimates a different parameter. Hence
a plot of the residuals or fits from different estimators should be useful for
detecting departures from this very important model. The “RR plot” is a
scatterplot matrix of the residuals from several regression fits. Tukey (1991)
notes that such a plot will be linear with slope one if the model assumptions
hold. Let the ith residual from the jth fit β̂ j be ri,j = Yi − xTi β̂ j where
the superscript T denotes the transpose of the vector and (Yi , xTi ) is the ith
observation. Then
ri,1 − ri,2 = xTi (β̂1 − β̂2 )
≤ xi (β̂1 − β + β̂ 2 − β).
The RR plot is simple to use since if β̂ 1 and β̂ 2 have good convergence
rates and if the predictors xi are bounded, then the residuals will cluster
tightly about the identity line (the unit slope line through the origin) as n
increases to ∞. For example, plot the least squares residuals versus the L1
residuals. Since OLS and L1 are consistent, the plot should be linear with
slope one when the regression assumptions hold, but the plot should not have
slope one if there are Y –outliers since L1 resists these outliers while OLS does
not. Making a scatterplot matrix of the residuals from OLS, L1 , and several
other estimators can be very informative.
Example 1.1. Gladstone (1905–1906) attempts to estimate the weight
of the human brain (measured in grams after the death of the subject) using
simple linear regression with a variety of predictors including age in years,
height in inches, head height in mm, head length in mm, head breadth in mm,
head circumference in mm, and cephalic index (divide the breadth of the head
by its length and multiply by 100). The sex (coded as 0 for females and 1
for males) of each subject was also included. The variable cause was coded
as 1 if the cause of death was acute, as 3 if the cause of death was chronic,
and coded as 2 otherwise. A variable ageclass was coded as 0 if the age was
under 20, as 1 if the age was between 20 and 45, and as 3 if the age was
CHAPTER 1. INTRODUCTION 7

• • •
••• • ••• • • •• •
•• ••
• •
• •
•••••• •••• • •• •••••••
• •••• • •
•
••• • • •••• •
• ••
••
•• • •• •• • •
• • ••••••••• •
•• • • • •
•••••••••• ••••••••••• • •• ••••••••••••••••••••••
•••••••••••••••
• ••••••••••••••• •
• •• • •••
• •
•• • •
• • •
••••••• • •• ••••••••••
• •
•••
••
•
••••
••••• • ••
••••••••• • • • • ••••••••••••••••••••••
•••••••••••• ••••••• • • •••••••••••••••••• ••
•••••••••••• • ••••••••• • • •• • • • •
• •• ••••• • •
•• ••••
• • • ••
• • •
• ••• ••
•• • • ••••• •
• •••• • ••••••• • • •
•• • • • •
• • ••• • • • ••••• • •
•• •• • • •••••••••••••••••••
•••
•••••••••• • • •••••••••••••
••••••• • •
• ••••••••• • •
••••• •
••••••••••••••• •• ••••••••••••••••••••
•••••••••••••••••• •
•••
••••• • •
• •••••••••••••••••••••••••••••
• •
• ••••••••••••• ••••
••••••••••••• • •••••••••• • •
••••••••••••
• • • • •••••••••••••••• •
•• • • •• •••
•••• • ••••••
• • •
• • •
• • •• • •
• • •
• • •
• • •
• • •• • •
• ••
• •• • • •• • • •• • • • • •
•••• ••••••••••••••••••••••••••••••••••••• •••••• • ••••••••••••••••••• •••••••• •
• •••••••••• ••••••••••••••••••••• ••••••• •
• ••••••••••••••••••
•• •• ••••••••
• •••••••••••••••••• • •••••••••••••••••••••••••••••••••••••••••••••••••
• •••••••••••••• • ••• • •• •••••••••••• •• •
• • •
• • •
• • • • ••• ••• • •
••••• • • •••••
• •• • •• ••••••
•• • • •
•
•••••••••••••••••••••••• •• •••• •• •••
••••••••••••• • • ••
••••••••••••
• • • • • •• • • •• ••••• •• •
• • •••••••••••••••••••••••••••••••••••••• • •••••••••••••••••••••••• •••••••••••••••• • • ••
• •• • ••• •• • ••••••••••••••••••••••••• • •••••••• •
••••••••••••••••••••••••• •••••••••• •••••••••••
•••••••••••••••• • • •••••••••••• •••••••••••
•• •• • •
• •• • ••• •
•

Figure 1.1: RR Plot for Gladstone data

CHAPTER 1. INTRODUCTION 8

•
••

• • •

Figure 1.2: Gladstone data, 119 is a typo

CHAPTER 1. INTRODUCTION 9

over 45. Head size is the product of the head length, head breadth, and head
height.
The data set contains 276 cases, and we decided to use multiple linear
regression to predict brain weight using the six head measurements height,
length, breadth, size, cephalic index and circumference as predictors. Cases
188 and 239 were deleted because of missing values. There are five infants
(cases 238, 263-266) of age less than 7 months that are x-outliers. Nine
toddlers were between 7 months and 3.5 years of age, four of whom appear
to be x-outliers (cases 241, 243, 267, and 269). Figure 1.1 shows an RR
plot comparing the OLS, L1 , ALMS and ALTS fits. ALMS is the default
version of the R/Splus function lmsreg while ALTS is the default version of
ltsreg, and these two resistant estimators are described further in Chapter
7. Attempts to reproduce Figure 1.1 (made in 1997 with an old version of
Splus) will fail due to changes in the R/Splus code. Note that Figure 1.1
suggests that three of the methods are producing approximately the same
fits while the ALMS estimator is fitting 9 of the 274 points in a different
manner. These 9 points correspond to five infants and four toddlers that are
x-outliers.
An obvious application of outlier resistant methods is the detection of
outliers. Generally robust and resistant methods can only detect certain
configurations of outliers, and the ability to detect outliers rapidly decreases
as the sample size n and the number of predictors p increase. When the
Gladstone data was first entered into the computer, the variable head length
was inadvertently entered as 109 instead of 199 for case 119. Residual plots
for six Splus regression estimators (described further in Section 7.2, KLMS
and KLTS used options that should generally detect more outliers than the
default versions of lmsreg and ltsreg) are shown in Figure 1.2. In 1997,
ALMS and the classical OLS and L1 estimators failed to identify observation
119 as unusual. Eventually this coding error was detected and corrected.
Example 1.2. Buxton (1920, p. 232-5) gives 20 measurements of 88
men. Height was the response variable while an intercept, head length, nasal
height, bigonal breadth, and cephalic index were used as predictors in the
multiple linear regression model. Observation 9 was deleted since it had
missing values. Five individuals, numbers 62–66, were reported to be about
0.75 inches tall with head lengths well over five feet! Figure 1.3 shows that
the outliers were accommodated by all of the Splus estimators, except KLMS.
The Buxton data is also used to illustrate robust multivariate location and
CHAPTER 1. INTRODUCTION 10

• ••
• • •

Figure 1.3: Buxton data, the outliers do not have large residuals.
CHAPTER 1. INTRODUCTION 11

dispersion estimators in Example 11.4 and to illustrate a graphical diagnostic

for multivariate normality in Example 11.2.
Example 1.3. Now suppose that the only variable of interest in the
Buxton data is Y = height. How should the five adult heights of 0.75 inches
be handled? These observed values are impossible, and could certainly be
deleted if it was felt that the recording errors were made at random; however,
the outliers occurred on consecutive cases: 62–66. If it is reasonable to
assume that the true heights of cases 62–66 are a random sample of five
heights from the same population as the remaining heights, then the outlying
cases could again be deleted. On the other hand, what would happen if
cases 62–66 were the five tallest or five shortest men in the sample? In
particular, how are point estimators and confidence intervals affected by the
outliers? Chapter 2 will show that classical location procedures based on
the sample mean and sample variance are adversely affected by the outliers
while procedures based on the sample median or the 25% trimmed mean can
frequently handle a small percentage of outliers.
For the next application, assume that the population that generates the
data is such that a certain proportion γ of the cases will be easily identified
but randomly occurring unexplained outliers where γ < α < 0.2, and assume
that remaining proportion 1 − γ of the cases will be well approximated by
the statistical model.
A common suggestion for examining a data set that has unexplained
outliers is to run the analysis on the full data set and to run the analysis
on the “cleaned” data set with the outliers deleted. Then the statistician
may consult with subject matter experts in order to decide which analysis
is “more appropriate.” Although the analysis of the cleaned data may be
useful for describing the bulk of the data, the analysis may not very useful
if prediction or description of the entire population is of interest.
Similarly, the analysis of the full data set will likely be unsatisfactory for
prediction since numerical statistical methods tend to be inadequate when
outliers are present. Classical estimators will frequently fit neither the bulk of
the data nor the outliers well, while an analysis from a good practical robust
estimator (if available) should be similar to the analysis of the cleaned data
set.
Hence neither of the two analyses alone is appropriate for prediction or
description of the actual population. Instead, information from both analyses
CHAPTER 1. INTRODUCTION 12

should be used. The cleaned data will be used to show that the bulk of the
data is well approximated by the statistical model, but the full data set will
be used along with the cleaned data for prediction and for description of the
entire population.
To illustrate the above discussion, consider the multiple linear regression
model
Y = Xβ + e (1.10)
where Y is an n × 1 vector of dependent variables, X is an n × p matrix
of predictors, β is a p × 1 vector of unknown coeﬃcients, and e is an n × 1
vector of errors. The ith case (Yi , xTi ) corresponds to the ith row xTi of X
and the ith element Yi of Y . Assume that the errors ei are iid zero mean
normal random variables with variance σ 2 .
Finding prediction intervals for future observations is a standard problem
in regression. Let β̂ denote the ordinary least squares (OLS) estimator of β
and let n 2
r
MSE = i=1 i
n−p
where ri = Yi − xTi β̂ is the ith residual. Following Neter, Wasserman, Nacht-
sheim and Kutner (1996, p. 235), a (1 − α)100% prediction interval (PI) for
a new observation Yh corresponding to a vector of predictors xh is given by

Ŷh ± t1−α/2,n−pse(pred) (1.11)

where Ŷh = xTh β̂, P (t ≤ t1−α/2,n−p ) = 1 − α/2 where t has a t distribution

with n − p degrees of freedom, and

se(pred) = MSE(1 + xTh (X T X)−1 xh ).

For discussion, suppose that 1 − γ = 0.92 so that 8% of the cases are

outliers. If interest is in a 95% PI, then using the full data set will fail
because outliers are present, and using the cleaned data set with the outliers
deleted will fail since only 92% of future observations will behave like the
“clean” data.
A simple remedy is to create a nominal 100(1 − α)% PI for future cases
from this population by making a classical 100(1 − α∗ ) PI from the clean
cases where
1 − α∗ = (1 − α)/(1 − γ). (1.12)
CHAPTER 1. INTRODUCTION 13

a) No Outliers b) No Outliers

0.04
30

0.0
R
Y

-0.04
10

-0.08
0

0 10 20 30 40 0 10 20 30 40

FIT FIT

c) Full Data d) Full Data

15
40

10
169
30

182 76

5
R
Y

96 76 200
20

0 6
10

169
-5

182
48 48
0

0 10 20 30 40 0 10 20 30 40

FIT FIT

Figure 1.4: Plots for Summarizing the Entire Population

Assume that the data have been perfectly classiﬁed into nc clean cases and
no outlying cases where nc + no = n. Also assume that no outlying cases will
fall within the PI. Then the PI is valid if Yh is clean, and

P(Yh is in the PI) = P(Yh is in the PI and clean) =

P(Yh is in the PI | Yh is clean) P(Yh is clean) = (1 − α∗ )(1 − γ) = (1 − α).

The formula for this PI is then

Ŷh ± t1−α∗/2,nc −p se(pred) (1.13)

where Ŷh and se(pred) are obtained after performing OLS on the nc clean
cases. For example, if α = 0.1 and γ = 0.08, then 1 − α∗ ≈ 0.98. Since γ will
be estimated from the data, the coverage will only be approximately valid.
The following example illustrates the procedure.
Example 1.4. STATLIB provides a data set (see Johnson 1996) that is
available from the website (http://lib.stat.cmu.edu/datasets/bodyfat). The
CHAPTER 1. INTRODUCTION 14

data set includes 252 cases, 14 predictor variables, and a response variable
Y = bodyfat. The correlation between Y and the first predictor x1 = density
is extremely high, and the plot of x1 versus Y looks like a straight line except
for four points. If simple linear regression is used, the residual plot of the
fitted values versus the residuals is curved and five outliers are apparent.
The curvature suggests that x21 should be added to the model, but the least
squares fit does not resist outliers well. If the five outlying cases are deleted,
four more outliers show up in the plot. The residual plot for the quadratic fit
looks reasonable after deleting cases 6, 48, 71, 76, 96, 139, 169, 182 and 200.
Cases 71 and 139 were much less discrepant than the other seven outliers.
These nine cases appear to be outlying at random: if the purpose of the
analysis was description, we could say that a quadratic fits 96% of the cases
well, but 4% of the cases are not fit especially well. If the purpose of the
analysis was prediction, deleting the outliers and then using the clean data to
find a 99% prediction interval (PI) would not make sense if 4% of future cases
are outliers. To create a nominal 90% PI for future cases from this population,
make a classical 100(1−α∗ ) PI from the clean cases where 1−α∗ = 0.9/(1−γ).
For the bodyfat data, we can take 1−γ ≈ 1−9/252 ≈ 0.964 and 1−α∗ ≈ 0.94.
Notice that (0.94)(0.96) ≈ 0.9.
Figure 1.4 is useful for presenting the analysis. The top two plots have
the nine outliers deleted. Figure 1.4a is a forward response plot of the fitted
values Ŷi versus the response Yi while Figure 1.4b is a residual plot of the
fitted values Ŷi versus the residuals ri . These two plots suggest that the
multiple linear regression model fits the bulk of the data well. Next consider
using weighted least squares where cases 6, 48, 71, 76, 96, 139, 169, 182 and
200 are given weight zero and the remaining cases weight one. Figure 1.4c
and 1.4d give the forward response plot and residual plot for the entire data
set. Notice that seven of the nine outlying cases can be seen in these plots.
The classical 90% PI using x = (1, 1, 1)T and all 252 cases was Ŷh ±
t0.95,249se(pred) = 46.3152 ± 1.651(1.3295) = (44.12, 48.51). When the 9 out-
liers are deleted, nc = 243 cases remain. Hence the 90% PI using Equation
(1.13) with 9 cases deleted was Ŷh ±t0.97,240se(pred) = 44.961±1.88972(0.0371)
= (44.89, 45.03). The classical PI is about 31 times longer than the new PI.
For the next application, consider a response transformation model

y = t−1
λo (x β + e)
T
CHAPTER 1. INTRODUCTION 15

where λo ∈ Λ = {0, ±1/4, ±1/3, ±1/2, ±2/3, ±1}. Then

tλo (y) = xT β + e

follows a multiple linear regression (MLR) model where the response variable
yi > 0 and the power transformation family

yλ − 1
tλ(y) ≡ y (λ) = (1.14)
λ
for λ = 0 and y (0) = log(y).
The following simple graphical method for selecting response transforma-
tions can be used with any good classical, robust or Bayesian MLR estimator.
Let zi = tλ(yi ) for λ = 1, and let zi = yi if λ = 1. Next, perform the multiple
linear regression of zi on xi and make the forward response plot of zî versus
zi . If the plotted points follow the identity line, then take λo = λ. One plot
is made for each of the eleven values of λ ∈ Λ, and if more than one value of
λ works, take the simpler transformation or the transformation that makes
the most sense to subject matter experts. (Note that this procedure can be
modified to create a graphical diagnostic for a numerical estimator λ̂ of λo
by adding λ̂ to Λ.) The following example illustrates the procedure.
Example 1.5. Box and Cox (1964) present a textile data set where
samples of worsted yarn with different levels of the three factors were given
a cyclic load until the sample failed. The goal was to understand how y =
the number of cycles to failure was related to the predictor variables. Figure
1.5 shows the forward response plots for two MLR estimators: OLS and
the R/Splus function lmsreg. Figures 1.5a and 1.5b show that a response
transformation is needed while 1.5c and 1.5d both suggest that log(y) is the
appropriate response transformation. Using OLS and a resistant estimator
as in Figure 1.5 may be very useful if outliers are present.
The textile data set is used to illustrate another graphical method for
selecting the response transformation tλ in Section 5.1.
Another important application is variable selection: the search for a sub-
set of predictor variables that can be deleted from the model without impor-
tant loss of information. Section 5.2 gives a graphical method for assessing
variable selection for multiple linear regression models while Section 12.4
gives a similar method for 1D regression models.
CHAPTER 1. INTRODUCTION 16

a) OLS, LAMBDA=1 b) LMSREG, LAMBDA=1

3000

3000
2000

2000
Y

Y
1000

1000
0

0
-500 0 500 1000 1500 2000 0 200 400 600 800

FIT RFIT

c) OLS, LAMBDA=0 d) LMSREG, LAMBDA=0

8
7

7
LOGY

LOGY
6

6
5

5 6 7 8 5 6 7 8

WFIT RWFIT

Figure 1.5: OLS and LMSREG Suggest Using log(y) for the Textile Data

The basic idea is to obtain fitted values from the full model and the
candidate submodel. If the candidate model is good, then the plotted points
in a plot of the submodel fitted values versus the full model fitted values
should follow the identity line. In addition, a similar plot should be made
using the residuals.
A problem with this idea is how to select the candidate submodel from
the nearly 2p potential submodels. One possibility would be to try to order
the predictors in importance, say x1, ..., xp. Then let the kth model contain
the predictors x1 , x2, ..., xk for k = 1, ..., p. If the predicted values from the
submodel are highly correlated with the predicted values from the full model,
then the submodel is “good.” This idea is useful even for extremely compli-
cated models. Section 12.4 will show that the all subsets, forward selection
and backward elimination techniques of variable selection for multiple lin-
ear regression will often work for the 1D regression model provided that the
Mallows’ Cp criterion is used.
Example 1.6. The Boston housing data of Harrison and Rubinfeld
(1978) contains 14 variables and 506 cases. Suppose that the interest is
in predicting the per capita crime rate from the other variables. Variable
CHAPTER 1. INTRODUCTION 17

OLS View

500
Y

0
-500

-400 -200 0 200 400

X %*% bols

Figure 1.6: OLS View for m(u) = u3

selection for this data set is discussed in much more detail in Section 12.4.
Another important topic is ﬁtting 1D regression models given by Equation
(1.2) where g and β are both unknown. Many types of plots will be used
in this text and a plot of x versus y will have x on the horizontal axis and
y on the vertical axis. This notation is also used by the software packages
Splus (MathSoft 1999ab) and R, the free version of Splus available from
(http://www.r-project.org/). The R/Splus commands
X <- matrix(rnorm(300),nrow=100,ncol=3)
Y <- (X %*% 1:3)^3 + rnorm(100)
were used to generate 100 trivariate Gaussian predictors x and the response
Y = (βT x)3 + e where e ∼ N(0, 1). This is a model of form (1.3) where m is
the cubic function.
An amazing result is that the unknown function m can often be visualized
by the “OLS view,” a plot of the OLS ﬁt (possibly ignoring the constant)
versus Y generated by the following commands.
bols <- lsfit(X,Y)$coef[-1]
plot(X %*% bols, Y)
CHAPTER 1. INTRODUCTION 18

The OLS view, shown in Figure 1.6, can be used to visualize m and
for prediction. Note that Y appears to be a cubic function of the OLS fit
and that if the OLS fit = 0, then the graph suggests using Ŷ = 0 as the
predicted value for Y . This plot and modifications will be discussed in detail
in Chapters 12 and 13.
This section has given a brief outlook of the book. Also look at the preface
and table of contents, and then thumb through the remaining chapters to
examine the procedures and graphs that will be developed.

1.3 Complements
Many texts simply present statistical models without discussing the process
of model building. An excellent paper on statistical models is Box (1979).
The concept of outliers is rather vague although Barnett and Lewis (1994),
Davies and Gather (1993) and Gather and Becker (1997) give outlier models.
Also see Beckman and Cook (1983) for history.
Outlier rejection is a subjective or objective method for deleting or chang-
ing observations which lie far away from the bulk of the data. The modified
data is often called the “cleaned data.” See Rousseeuw and Leroy (1987,
p. 106, 161, 254, and 270), Huber (1981, p. 4-5, and 19), and Hampel,
Ronchetti, Rousseeuw and Stahel (1986, p. 24, 26, and 31). Data editing,
screening, truncation, censoring, Winsorizing, and trimming are all methods
for data cleaning. David (1981, ch. 8) surveys outlier rules before 1974, and
Hampel, Ronchetti, Rousseeuw and Stahel (1986, Section 1.4) surveys some
robust outlier rejection rules. Outlier rejection rules are also discussed in
Hampel (1985), Simonoff (1987a,b), and Stigler (1973b).
Robust estimators can be obtained by applying classical methods to the
cleaned data. Huber (1981, p. 4-5, 19) suggests that the performance of such
methods may be more difficult to work out than that of robust estimators
such as the M-estimators, but gives a procedure for cleaning regression data.
Staudte and Sheather (1990, p. 29, 136) state that rejection rules are the least
understood and point out that for subjective rules where the cleaned data is
assumed to be iid, one can not find an unconditional standard error estimate.
Even if the data consists of observations which are iid plus outliers, some
“good” observations will usually be deleted while some “bad” observations
will be kept. In other words, the assumption of perfect classification is often
CHAPTER 1. INTRODUCTION 19

unreasonable.
The graphical method for response transformations illustrated in Example
1.5 was suggested by Olive (2004b).
Seven important papers that inﬂuenced this book are Hampel (1975),
Siegel (1982), Devlin, Gnanadesikan and Kettenring (1981), Rousseeuw (1984),
Li and Duan (1989), Cook and Nachtsheim (1994) and Rousseeuw and Van
Driessen (1999). The importance of these papers will become clearer later in
the text.
An excellent text on regression (using 1D regression models such as (1.1))
is Cook and Weisberg (1999a). A more advanced text is Cook (1998a). Also
see Cook (2003), Horowitz (1998), Li (2000) and Weisberg (2005).
This text will use the software packages Splus (MathSoft (now Insightful)
1999ab) and R, a free version of Splus available from the website (http://www.
r-project.org/), and Arc (Cook and Weisberg 1999a), a free package available
from the website (http://www.stat.umn.edu/arc).
Section 14.2 of this text, Becker, Chambers, and Wilks (1988), and Ven-
ables and Ripley (1997) are useful for R/Splus users. The websites (http://
www.burns-stat.com/), (http://lib.stat.cmu.edu/S/splusnotes) and (http://
www.isds.duke.edu/computing/S/Snotes/Splus.html) also have useful infor-
mation.
The Gladstone, Buxton, bodyfat and Boston housing data sets are avail-
able from the text’s website under the ﬁle names gladstone.lsp, buxton.lsp,
bodfat.lsp and boston2.lsp.

1.4 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-
FUL.
1.1∗. Using the notation on p. 6, let Ŷi,j = xTi β̂j and show that
ri,1 − ri,2 = Ŷi,1 − Ŷi,2 .
R/Splus Problems
1.2∗. a) Using the R/Splus commands on p. 17, reproduce a plot like
Figure 1.6. Once you have the plot you can print it out directly, but it will
CHAPTER 1. INTRODUCTION 20

generally save paper by placing the plots in the Word editor.

b) Activate Word (often by double clicking on a Word icon). Click on the
screen and type “Problem 1.2.” In R/Splus, click on the plot and then press
the keys Ctrl and c simultaneously. This procedure makes a temporary copy
of the plot. In Word, move the pointer to Edit and hold down the leftmost
mouse button. This will cause a menu to appear. Drag the pointer down to
Paste. In the future, these menu commands will be denoted by “Edit>Paste.”
The plot should appear on the screen. To save your output on your diskette,
use the Word menu commands “File > Save as.” In the Save in box select
“3 1/2 Floppy(A:)” and in the File name box enter HW1d2.doc. To exit
from Word, click on the “X” in the upper right corner of the screen. In Word
a screen will appear and ask whether you want to save changes made in your
document. Click on No. To exit from R/Splus, type “q()” or click on the
“X” in the upper right corner of the screen and then click on No.
T
c) To see the plot of 10β̂ x versus Y , use the commands

plot(10X %% bols, Y)

title("Scaled OLS View")

d) Include the plot in Word using commands similar to those given in b).

e) Do the two plots look similar? Can you see the cubic function?
1.3∗. a) Enter the following R/Splus function that is used to illustrate
the central limit theorem when the data Y1 , ..., Yn are iid from an exponential
distribution. The function generates a data set of size n and computes Y 1
from the data set. This step is repeated nruns = 100 times. The output is
a vector (Y 1, Y 2 , ..., Y 100 ). A histogram of these means should resemble a
symmetric normal density once n is large enough.

cltsim <- function(n=100, nruns=100){

ybar <- 1:nruns
for(i in 1:nruns){
ybar[i] <- mean(rexp(n))}
list(ybar=ybar)}

b) The following commands will plot 4 histograms with n = 1, 5, 25 and

100. Save the plot in Word using the procedure described in Problem 1.2b.
CHAPTER 1. INTRODUCTION 21

> z1 <- cltsim(n=1)

> z5 <- cltsim(n=5)
> z25 <- cltsim(n=25)
> z200 <- cltsim(n=200)
> par(mfrow=c(2,2))
> hist(z1$ybar)
> hist(z5$ybar)
> hist(z25$ybar)
> hist(z200$ybar)
c) Explain how your plot illustrates the central limit theorem.
d) Repeat parts a), b) and c), but in part a), change rexp(n) to rnorm(n).
Then Y1 , ..., Yn are iid N(0,1) and Y ∼ N(0, 1/n).
Arc Problems
1.4∗. a) Activate Arc (Cook and Weisberg 1999a). Generally this will
be done by finding the icon for Arc or the executable file for Arc. Using the
mouse, move the pointer (curser) to the icon and press the leftmost mouse
button twice, rapidly. This procedure is known as double clicking on the icon.
A window should appear with a “greater than” > prompt. The menu File
should be in the upper left corner of the window. Move the pointer to File
and hold the leftmost mouse button down. Then the menu will appear. Drag
the pointer down to the menu command load. Then click on data, next click
on ARCG and then click on wool.lps. You will need to use the slider bar in
the middle of the screen to see the file wool.lsp: click on the arrow pointing
to the right until the file appears. In the future these menu commands will
be denoted by “File > Load > Data > ARCG > wool.lsp.” These are the
commands needed to activate the file wool.lsp.
b) To fit a multiple linear regression model, perform the menu commands
“Graph&Fit>FitlinearLS.” A window will appear. Double click on Amp, Len
and Load. This will place the three variables under the Terms/Predictors box.
Click once on Cycles, move the pointer to the Response box and click once.
Then cycles should appear in the Response box. Click on OK. If a mistake
was made, then you can double click on a variable to move it back to the
Candidates box. You can also click once on the variable, move the pointer to
the Candidates box and click. Output should appear on the Listener screen.
c) To make a residual plot, use the menu commands “Graph&Fit>Plot
CHAPTER 1. INTRODUCTION 22

of.” A window will appear. Double click on L1: Fit–Values and then double
click on L1:Residuals. Then L1: Fit–Values should appear in the H box and
L1:Residuals should appear in the V box. Click on OK to obtain the plot.
d) The graph can be printed with the menu commands “File>Print,” but
it will generally save paper by placing the plots in the Word editor. Activate
Word (often by double clicking on a Word icon). Click on the screen and
type “Problem 1.4.” In Arc, use the menu command “Edit>Copy.” In Word,
use the menu commands “Edit>Paste.”
e) In your Word document, write “1.4e)” and state whether the points
cluster about the horizontal axis with no pattern. If curvature is present,
then the multiple linear regression model is not appropriate.
f) After editing your Word document, get a printout by clicking on the
printer icon or by using the menu commands “File>Print.” To save your
output on your diskette, use the Word menu commands “File > Save as.” In
the Save in box select “3 1/2 Floppy(A:)” and in the File name box enter
HW1d4.doc. To exit from Word and Arc, click on the “X” in the upper right
corner of the screen. In Word a screen will appear and ask whether you want
to save changes made in your document. Click on No. In Arc, click on OK.
Warning: The following problem uses data from the book’s web-
page. Save the data files on a disk. Next, get in Arc and use the menu
commands “File > Load” and a window with a Look in box will appear.
Click on the black triangle and then on 3 1/2 Floppy(A:). Then click twice
on the data set name, eg, bodfat.lsp. These menu commands will be de-
noted by “File > Load > 3 1/2 Floppy(A:) > bodfat.lsp” where the data file
(bodfat.lsp) will depend on the problem.
If the free statistics package Arc is on your personal computer (PC),
there will be a folder Arc with a subfolder Data that contains a subfolder
Arcg. Your instructor may have added a new folder mdata in the subfolder
Data and added bodfat.lsp to the folder mdata. In this case the Arc menu
commands “File > Load > Data > mdata > bodfat.lsp” can be used.
1.5∗. This text’s webpage has several files that can be used by Arc.
Chapter 14 explains how to create such files.
a) Use the Arc menu commands “File > Load > 3 1/2 Floppy(A:) >
bodfat.lsp” to activate the file bodfat.lsp.
CHAPTER 1. INTRODUCTION 23

b) Next use the menu commands “Graph&Fit>FitlinearLS” to obtain a

window. Double click on x1 and click once on y. Move the pointer to the
Response box and click. Then x1 should be in the Terms/Predictors box and
y should be in the Response box. Click on OK. This performs simple linear
regression of y on x1 and output should appear in the Listener box.
c) Next make a residual plot with the menu commands “Graph&Fit>Plot
of.” A window will appear. Double click on L1: Fit–Values and then double
click on L1:Residuals. Then L1: Fit–Values should appear in the H box and
L1:Residuals should appear in the V box. Click on OK to obtain the plot.
There should be a curve in the center of the plot with five points separated
from the curve. To delete these five points from the data set, move the pointer
to one of the five points and hold the leftmost mouse button down. Move the
mouse down and to the right. This will create a box, and after releasing the
mouse button, any point that was in the box will be highlighted. To delete the
highlighted points, click on the Case deletions menu, and move the pointer to
Delete selection from data set. Repeat this procedure until the five outliers
are deleted. Then use the menu commands “Graph&Fit>FitlinearLS” to
obtain a window and click on OK. This performs simple linear regression of
y on x1 without the five deleted cases. (Arc displays the case numbers of the
cases deleted, but the labels are off by one since Arc gives the first case the
case number zero.) Again make a residual plot and delete any outliers. Use
L2: Fit–Values and L2:Residuals in the plot. The point in the upper right
of the plot is not an outlier since it follows the curve.
d) Use the menu commands “Graph&Fit>FitlinearLS” to obtain a win-
dow and click on OK. This performs simple linear regression of y on x1
without the seven to nine deleted cases. Make a residual plot (with L3 fit-
ted values and residuals) and include the plot in Word. The plot should be
curved and hence the simple linear regression model is not appropriate.
e) Use the menu commands “Graph&Fit>Plot of” and place L3:Fit-
Values in the H box and y in the V box. This makes a forward response
plot. Include the plot in Word. If the forward response plot is not linear,
then the simple linear regression model is not appropriate.
f) Comment on why both the residual plot and forward response plot are
needed to show that the simple linear regression model is not appropriate.
g) Use the menu commands “Graph&Fit>FitlinearLS” to obtain a win-
CHAPTER 1. INTRODUCTION 24

dow, and click on the Full quad. circle. Then click on OK. These commands
will ﬁt the quadratic model y = x1 + x12 + e without using the deleted cases.
Make a residual plot of L4:Fit-Values versus L4:Residuals and a forward re-
sponse plot of L4:Fit-Values versus y. For both plots place the ﬁtted values
in the H box and the other variable in the V box. Include these two plots in
Word.
h) If the forward response plot is linear and if the residual plot is rectangu-
lar about the horizontal axis, then the quadratic model may be appropriate.
Comment on the two plots.
Chapter 2

The Location Model

2.1 Four Essential Location Estimators

The location model
Yi = µ + ei , i = 1, . . . , n (2.1)
is often summarized by obtaining point estimates and conﬁdence intervals
for a location parameter and a scale parameter. Assume that there is a
sample Y1 , . . . , Yn of size n where the Yi are iid from a distribution with
median MED(Y ), mean E(Y ), and variance V (Y ) if they exist. Also assume
that the Yi have a cumulative distribution function (cdf) F that is known
up to a few parameters. For example, F could be normal, exponential,
or double exponential. The location parameter µ is often the population
mean or median
while the scale parameter is often the population standard
deviation V (Y ).
By far the most important robust technique for the location model is to
make a plot of the data. Dot plots, histograms, box plots, density estimates,
and quantile plots (also called empirical cdf’s) can be used for this purpose
and allow the investigator to see patterns such as shape, spread, skewness,
and outliers.
Example 2.1. Buxton (1920) presents various measurements on 88 men
from Cyprus. Case 9 was removed since it had missing values. Figure 2.1
shows the dot plot, histogram, density estimate, and box plot for the heights
of the men. Although measurements such as height are often well approxi-
mated by a normal distribution, cases 62-66 are gross outliers with recorded

25
CHAPTER 2. THE LOCATION MODEL 26

a) Dot plot of heights b) Histogram of heights

1500

60
1000
height

40
500

20
0
0

0 20 40 60 80 0 500 1000 1500 2000

Index height

c) Density of heights d) Boxplot of heights

0.004

1500
1000
dens$y

0.002

500
0.0

0 500 1000 1500 2000

dens$x

Figure 2.1: Dot plot, histogram, density estimate, and box plot for heights
from Buxton (1920).
CHAPTER 2. THE LOCATION MODEL 27

heights around 0.75 inches! It appears that their heights were recorded under
the variable “head length,” so these height outliers can be corrected. Note
that the presence of outliers is easily detected in all four plots.
Point estimation is one of the oldest problems in statistics and four of
the most important statistics for the location model are the sample mean,
median, variance, and the median absolute deviation (mad). Let Y1 , . . . , Yn
be the random sample; ie, assume that Y1 , ..., Yn are iid.
Deﬁnition 2.1. The sample mean
n
Yi
Y = i=1 . (2.2)
n

The sample mean is a measure of location and estimates the population

mean (expected value) µ = E(Y ). The sample mean is often described as
the “balance point” of the data. The following alternative description is also
useful. For any value m consider the data values Yi ≤ m, and the values
Yi > m. Suppose that there are n rods where rod i has length ri (m) = Yi − m
which is the ith residual of m. Since ni=1 (Yi − Y ) = 0, Y is the value of m
such that the sum of the lengths of the rods corresponding to Yi ≤ m is equal
to the sum of the lengths of the rods corresponding to Yi > m. If the rods
have the same diameter, then the weight of a rod is proportional to its length,
and the weight of the rods corresponding to the Yi ≤ Y is equal to the weight
of the rods corresponding to Yi > Y . The sample mean is drawn towards an
outlier since the absolute residual corresponding to a single outlier is large.
If the data Y1 , ..., Yn is arranged in ascending order from smallest to largest
and written as Y(1) ≤ · · · ≤ Y(n) , then Y(i) is the ith order statistic and the
Y(i) ’s are called the order statistics. Using this notation, the median
MEDc (n) = Y((n+1)/2) if n is odd,
and
MEDc (n) = (1 − c)Y(n/2) + cY((n/2)+1) if n is even
for c ∈ [0, 1]. Note that since a statistic is a function, c needs to be ﬁxed.
The low median corresponds to c = 0, and the high median corresponds to
c = 1. The choice of c = 0.5 will yield the sample median. For example, if
the data Y1 = 1, Y2 = 4, Y3 = 2, Y4 = 5, and Y5 = 3, then Y = 3, Y(i) = i for
i = 1, ..., 5 and MEDc (n) = 3 where the sample size n = 5.
CHAPTER 2. THE LOCATION MODEL 28

Deﬁnition 2.2. The sample median

MED(n) = Y((n+1)/2) if n is odd, (2.3)
Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used.
Definition 2.3. The sample variance
n 2
n
2 i=1 (Yi − Y ) Y 2 − n(Y )2
VAR(n) = Sn = = i=1 i , (2.4)
n−1 n−1

and the sample standard deviation Sn = Sn2 .
The sample median need not be unique and is a measure of location while
the sample standard deviation is a measure of scale. In terms of the “rod
analogy,” the median is a value m such that at least half of the rods are to
the left of m and at least half of the rods are to the right of m. Hence the
number of rods to the left and right of m rather than the lengths of the rods
determine the sample median. The sample standard deviation is vulnerable
to outliers and is a measure of the average absolute value of the rod lengths
|ri (Y )|. The sample mad, defined below, is a measure of the median absolute
value of the rod lengths |ri (MED(n))|.
Definition 2.4. The sample median absolute deviation or median devi-
ation is
MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n). (2.5)
For use in theory, let
MD(n) = MED(|Yi − MED(Y )|, i = 1, . . . , n).

Since MAD(n) is the median of n distances, at least half of the obser-

vations are within a distance MAD(n) of MED(n) and at least half of the
observations are a distance of MAD(n) or more away from MED(n).
Example 2.2. Let the data be 1, 2, 3, 4, 5, 6, 7, 8, 9. Then MED(n) = 5
and MAD(n) = 2 = MED{0, 1, 1, 2, 2, 3, 3, 4, 4}.
Since these estimators are nonparametric estimators of the corresponding
population quantities, they are useful for a very wide range of distributions.
CHAPTER 2. THE LOCATION MODEL 29

Table 2.1: Some commonly used notation.

population sample
E(Y ), µ, θ Y n , E(n) µ̂, θ̂
MED(Y ), M MED(n), M̂
VAR(Y ), σ 2 VAR(n), S 2 , σ̂ 2
SD(Y ), σ SD(n), S, σ̂
MAD(Y ) MAD(n)
IQR(Y ) IQR(n)

They are also quite old. Rey (1978, p. 2) quotes Thucydides on a technique
used by Greek besiegers in the winter of 428 B.C. Cities were often surrounded
by walls made of layers of bricks, and besiegers made ladders to scale these
walls. The length of the ladders was determined by counting the layers of
bricks. Many soldiers counted the number of bricks, and the mode of the
counts was used to estimate the number of layers. The reasoning was that
some of the counters would make mistakes, but the majority were likely to
hit the true count. If the majority did hit the true count, then the sample
median would equal the mode. In a lecture, Professor Portnoy stated that in
215 A.D., an “eggs bulk” of impurity was allowed in the ritual preparation of
food, and two Rabbis desired to know what is an “average sized egg” given
a collection of eggs. One said use the middle sized egg while the other said
average the largest and smallest eggs of the collection. Hampel, Ronchetti,
Rousseeuw and Stahel (1986, p. 65) attribute MAD(n) to Gauss in 1816.

2.2 A Note on Notation

Notation is needed in order to distinguish between population quantities,
random quantities, and observed quantities. For population quantities, cap-
ital letters like E(Y ) and MAD(Y ) will often be used while the estima-
tors will often be denoted by MED(n), MAD(n), MED(Yi , i = 1, ..., n), or
MED(Y1 , . . . , Yn ). The random sample will be denoted by Y1 , . . . , Yn . Some-
times the observed sample will be ﬁxed and lower case letters will be used.
For example, the observed sample may be denoted by y1 , ..., yn while the
estimates may be denoted by med(n), mad(n), or ȳn . Table 2.1 summarizes
CHAPTER 2. THE LOCATION MODEL 30

some of this notation.

2.3 The Population Median and MAD

The population median MED(Y ) and the population mad (or median abso-
lute deviation, or median deviation) MAD(Y ) are very important quantities
of a distribution.
Deﬁnition 2.5. The population median is any value MED(Y ) such that

P (Y ≤ MED(Y )) ≥ 0.5 and P (Y ≥ MED(Y )) ≥ 0.5. (2.6)

Deﬁnition 2.6. The population median absolute deviation is

MAD(Y ) = MED(|Y − MED(Y )|). (2.7)

MED(Y ) is a measure of location while MAD(Y ) is a measure of scale.

The median is the middle value of the distribution. Since MAD(Y ) is the
median distance from MED(Y ), at least half of the mass is inside [MED(Y )−
MAD(Y ), MED(Y ) + MAD(Y )] and at least half of the mass of the distribu-
tion is outside of the interval (MED(Y ) − MAD(Y ), MED(Y ) + MAD(Y )).
In other words, MAD(Y ) is any value such that

P (Y ∈ [MED(Y ) − MAD(Y ), MED(Y ) + MAD(Y )]) ≥ 0.5,

and
P (Y ∈ (MED(Y ) − MAD(Y ), MED(Y ) + MAD(Y )) ≤ 0.5.

MAD(Y ) and MED(Y ) are often simple to find for location, scale, and
location–scale families. Assume that the cdf F of Y has a probability density
function (pdf) or probability mass function (pmf) f. The following definitions
are taken from Casella and Berger (2002, p. 116-119) and Lehmann (1983,
p. 20).
Definition 2.7. Let fY (y) be the pdf of Y. Then the family of pdf’s
fW (w) = fY (w − µ) indexed by the location parameter µ, −∞ < µ < ∞, is
CHAPTER 2. THE LOCATION MODEL 31

the location family for the random variable W = µ + Y with standard pdf
fY (y).
Definition 2.8. Let fY (y) be the pdf of Y. Then the family of pdf’s
fW (w) = (1/σ)fY (w/σ) indexed by the scale parameter σ > 0, is the scale
family for the random variable W = σY with standard pdf fY (y).
Definition 2.9. Let fY (y) be the pdf of Y. Then the family of pdf’s
fW (w) = (1/σ)fY ((w − µ)/σ) indexed by the location and scale parameters
µ, −∞ < µ < ∞, and σ > 0, is the location–scale family for the random
variable W = µ + σY with standard pdf fY (y).
Table 2.2 gives the population mads and medians for some “brand name”
distributions. The distributions are location–scale families except for the
exponential and tp distributions. The notation tp denotes a t distribution
with p degrees of freedom while tp,α is the α percentile of the tp distribution,
ie P (tp ≤ tp,α) = α. Hence tp,0.5 = 0 is the population median. The second
column of Table 2.2 gives the section of Chapter 3 where the random variable
is described further. For example, the exponential (λ) random variable is
described in Section 3.7. Table 2.3 presents approximations for the binomial,
chi-square, and gamma distributions.
Finding MED(Y ) and MAD(Y ) for symmetric distributions and location–
scale families is made easier by the following lemma and Table 2.2. Let
F (yα) = P (Y ≤ yα) = α for 0 < α < 1 where the cdf F (y) = P (Y ≤ y). Let
D = MAD(Y ), M = MED(Y ) = y0.5 and U = y0.75.
Lemma 2.1. a) If W = a + bY, then MED(W ) = a + bMED(Y ) and
MAD(W ) = |b|MAD(Y ).
b) If Y has a pdf that is continuous and positive on its support and
symmetric about µ, then MED(Y ) = µ and MAD(Y ) = y0.75 − MED(Y ).
Find M = MED(Y ) by solving the equation F (M) = 0.5 for M, and find U
by solving F (U) = 0.75 for U. Then D = MAD(Y ) = U − M.
c) Suppose that W is from a location–scale family with standard pdf
fY (y) that is continuous and positive on its support. Then W = µ + σY
where σ > 0. First find M by solving FY (M) = 0.5. After finding M, find
D by solving FY (M + D) − FY (M − D) = 0.5. Then MED(W ) = µ + σM
and MAD(W ) = σD.
Proof sketch. a) Assume the probability density function of Y is con-
CHAPTER 2. THE LOCATION MODEL 32

Table 2.2: MED(Y ) and MAD(Y ) for some useful random variables.

NAME Section MED(Y ) MAD(Y )

Cauchy C(µ, σ) 3.3 µ σ
double exponential DE(θ, λ) 3.6 θ 0.6931λ
exponential EXP(λ) 3.7 0.6931λ λ/2.0781
two parameter exponential EXP(θ, λ) 3.8 θ + 0.6931λ λ/2.0781
extreme value EV(θ, σ) 3.9 θ + 0.3665σ 0.7670σ
half normal HN(µ, σ) 3.11 µ + 0.6745σ 0.3991 σ
logistic L(µ, σ) 3.12 µ 1.0986 σ
normal N(µ, σ 2) 3.14 µ 0.6745σ
Rayleigh R(µ, σ) 3.18 µ + 1.1774σ 0.4485σ
tp 3.19 0 tp,3/4
uniform U(θ1, θ2) 3.21 (θ1 + θ2)/2 (θ2 − θ1 )/4

tinuous and positive on its support. Assume b > 0. Then

1/2 = P [Y ≤ MED(Y )] = P [a + bY ≤ a + bMED(Y )] = P [W ≤ MED(W )].

1/2 = P [MED(Y ) − MAD(Y ) ≤ Y ≤ MED(Y ) + MAD(Y )]

= P [a + bMED(Y ) − bMAD(Y ) ≤ a + bY ≤ a + bMED(Y ) + bMAD(Y )]
= P [MED(W ) − bMAD(Y ) ≤ W ≤ MED(W ) + bMAD(Y )]
= P [MED(W ) − MAD(W ) ≤ W ≤ MED(W ) + MAD(W )].
The proofs of b) and c) are similar. QED
Frequently the population median can be found without using a com-
puter, but often the population mad is found numerically. A good way to
get a starting value for MAD(Y ) is to generate a simulated random sample
Y1 , ..., Yn for n ≈ 10000 and then compute MAD(n). The following examples
are illustrative.
Example 2.3. Suppose the W ∼ N(µ, σ 2). Then W = µ + σZ where
Z ∼ N(0, 1). The standard normal random variable Z has a pdf that is
symmetric about 0. Hence MED(Z) = 0 and MED(W ) = µ+σMED(Z) = µ.
CHAPTER 2. THE LOCATION MODEL 33

Table 2.3: Approximations for MED(Y ) and MAD(Y ).

Name Section MED(Y ) MAD(Y

)
binomial BIN(k,ρ) 3.1 kρ 0.6745 kρ(1 − ρ)
2 √
chi-square χp 3.5 p − 2/3 0.9536 p
√
gamma G(ν, λ) 3.10 β(ν − 1/3) λ ν/1.483

Let D = MAD(Z) and let P (Z ≤ z) = Φ(z) be the cdf of Z. Now Φ(z) does
not have a closed form but is tabled extensively. Lemma 2.1b) implies that
D = z0.75 − 0 = z0.75 where P (Z ≤ z0.75) = 0.75. From a standard normal
table, 0.67 < D < 0.68 or D ≈ 0.674. A more accurate value can be found
with the following R/Splus command.
> qnorm(0.75)
[1] 0.6744898
Hence MAD(W ) ≈ 0.6745σ.
Example 2.4. If W is exponential (λ), then the cdf of W is FW (w) =
1 − exp(−w/λ) for w > 0 and FW (w) = 0 otherwise. Since exp(log(1/2)) =
exp(− log(2)) = 0.5, MED(W ) = log(2)λ. Since the exponential distribution
is a scale family with scale parameter λ, MAD(W ) = Dλ for some D > 0.
Hence
0.5 = FW (log(2)λ + Dλ) − FW (log(2)λ − Dλ),
or 0.5 =

1 − exp[−(log(2) + D)] − (1 − exp[−(log(2) − D)]) = exp(− log(2))[eD − e−D ].

Thus 1 = exp(D) − exp(−D) which needs to be solved numerically. One way

to solve this equation is to write the following R/Splus function.

tem <- function(D){exp(D) - exp(-D)}

Then plug in values D until tem(D) ≈ 1. Below is some output.

> mad(rexp(10000),constant=1) #get the sample MAD if n = 10000

[1] 0.4807404
CHAPTER 2. THE LOCATION MODEL 34

> tem(0.48)
[1] 0.997291
> tem(0.49)
[1] 1.01969
> tem(0.484)
[1] 1.006238
> tem(0.483)
[1] 1.004
> tem(0.481)
[1] 0.9995264
> tem(0.482)
[1] 1.001763
> tem(0.4813)
[1] 1.000197
> tem(0.4811)
[1] 0.99975
> tem(0.4812)
[1] 0.9999736
Hence D ≈ 0.4812 and MAD(W ) ≈ 0.4812λ ≈ λ/2.0781. If X is a
two parameter exponential (θ, λ) random variable, then X = θ + W. Hence
MED(X) = θ + log(2)λ and MAD(X) ≈ λ/2.0781.
Example 2.5. This example shows how to approximate the population
median and mad under severe contamination when the “clean” observations
are from a symmetric location–scale family. Let Φ be the cdf of the standard
normal, and let Φ(zα ) = α. Note that zα = Φ−1 (α). Suppose Y ∼ (1−γ)FW +
γFC where W ∼ N(µ, σ 2 ) and C is a random variable far to the right of µ.
Show a)
MED(Y ) ≈ µ + σz[ 1 ]
2(1−γ)

and b) if 0.4285 < γ < 0.5,

MAD(Y ) ≈ MED(Y ) − µ + σz[ 1
]
2(1−γ)

≈ 2σz[ 1
] .
2(1−γ)

Solution. a) Since the pdf of C is far to the right of µ,

MED(Y ) − µ
(1 − γ)Φ( ) ≈ 0.5,
σ
CHAPTER 2. THE LOCATION MODEL 35

and
MED(Y ) − µ 1
Φ( )≈ .
σ 2(1 − γ)
b) Since the mass of C is far to the right of µ,

(1 − γ)P [MED(Y ) − MAD(Y ) < W < MED(Y ) + MAD(Y )] ≈ 0.5.

Since the contamination is high, P (W < MED(Y ) + MAD(Y )) ≈ 1, and

0.5 ≈ (1 − γ)P (MED(Y ) − MAD(Y ) < W )

MED(Y ) − MAD(Y ) − µ
= (1 − γ)[1 − Φ( )].
σ
Writing z[α] for zα gives

MED(Y ) − MAD(Y ) − µ 1 − 2γ
≈z .
σ 2(1 − γ)

Thus
1 − 2γ
MAD(Y ) ≈ MED(Y ) − µ − σz .
2(1 − γ)
Since z[α] = −z[1 − α],

1 − 2γ 1
−z =z
2(1 − γ) 2(1 − γ)

and
1 1
MAD(Y ) ≈ µ + σz − µ + σz .
2(1 − γ) 2(1 − γ)

Application 2.1. The MAD Method: In analogy with the method of

moments, robust point estimators can be obtained by solving MED(n) =
MED(Y ) and MAD(n) = MAD(Y ). In particular, the location and scale
parameters of a location–scale family can often be estimated robustly using
c1 MED(n) and c2 MAD(n) where c1 and c2 are appropriate constants. Table
2.4 shows some of the point estimators and the following example illustrates
the procedure. For a location–scale family, asymptotically eﬃcient estimators
can be obtained using the cross checking technique. See He and Fung (1999).
CHAPTER 2. THE LOCATION MODEL 36

Example 2.6. a) For the normal N(µ, σ 2 ) distribution, MED(Y ) = µ

and MAD(Y ) ≈ 0.6745σ. Hence µ̂ = MED(n) and σ̂ ≈ MAD(n)/0.6745 ≈
1.483MAD(n).
b) Assume that Y is gamma(ν, λ). Chen and Rubin (1986) showed that
MED(Y ) ≈ λ(ν − 1/3) for ν > 1.5. By the central limit theorem,

Y ≈ N(νλ, νλ2 )
2
√ large ν. If X is N(µ, σ ) then MAD(X) ≈ σ/1.483. Hence MAD(Y ) ≈
for
λ√ν/1.483. Assuming that ν is large, solve MED(n) = λν and MAD(n) =
λ ν/1.483 for ν and λ obtaining
2
MED(n) (1.483MAD(n))2
ν̂ ≈ and λ̂ ≈ .
1.483MAD(n) MED(n)

c) Suppose that Y1 , ..., Yn are iid from an extreme value distribution, then
the cdf of Y is
y−θ
F (y) = exp[− exp(−( ))].
σ
This family is an asymmetric location-scale family. Since 0.5 = F (MED(Y )),
MED(Y ) = θ − σ log(log(2)) ≈ θ + 0.36651σ. Let D = MAD(Y ) if θ = 0
and σ = 1. Then 0.5 = F [MED(Y ) + MAD(Y )] − F [MED(Y ) − MAD(Y )].
Solving 0.5 = exp[− exp(−(0.36651 + D))] − exp[− exp(−(0.36651 − D))] for
D numerically yields D = 0.767049. Hence MAD(Y ) = 0.767049σ.
d) Sometimes MED(n) and MAD(n) can also be used to estimate the
parameters of two parameter families that are not location–scale families.
Suppose that Y1 , ..., Yn are iid from a Weibull(φ, λ) distribution where λ, y,
and φ are all positive. The cdf of Y is F (y) = 1 − exp(−y φ /λ) for y >
0. Taking φ = 1 gives theexponential(λ) distribution while φ = 2 gives
the Rayleigh(µ = 0, σ = λ/2) distribution. Since F (MED(Y )) = 1/2,
MED(Y ) = (λ log(2))1/φ . These results suggest that if φ is known, then

(MED(n))φ
λ̂ = .
log(2)

Falk (1997) shows that under regularity conditions, the joint distribution
of the sample median and mad is asymptotically normal. See Section 2.9.
CHAPTER 2. THE LOCATION MODEL 37

Table 2.4: Robust point estimators for some useful random variables.

BIN(k,ρ) ρ̂ ≈ MED(n)/k
C(µ, σ) µ̂ = MED(n) σ̂ = MAD(n)
χ2p p̂ ≈ MED(n) + 2/3, rounded
DE(θ, λ) θ̂ = MED(n) λ̂ = 1.443MAD(n)
EXP(λ) λ̂1 = 1.443MED(n) λ̂2 = 2.0781MAD(n)
EXP(θ, λ) θ̂ = MED(n) − 1.440MAD(n) λ̂ = 2.0781MAD(n)
EV(θ, σ) θ̂ = MED(n) − 0.4778MAD(n) σ̂ = 1.3037MAD(n)
2
G(ν, λ) ν̂ ≈ [MED(n)/1.483MAD(n)]2 λ̂ ≈ [1.483MAD(n)]
MED(n)
HN(µ, σ) µ̂ = MED(n) − 1.6901MAD(n) σ̂ = 2.5057MAD(n)
L(µ, σ) µ̂ = MED(n) σ̂ = 0.9102MAD(n)
N(µ, σ 2) µ̂ = MED(n) σ̂ = 1.483MAD(n)
R(µ, σ) µ̂ = MED(n) − 2.6255MAD(n) σ̂ = 2.230MAD(n)
U(θ1, θ2 ) θˆ1 = MED(n) − 2MAD(n) ˆ
θ2 = MED(n) + 2MAD(n)

A special case of this result follows. Let ξα be the α percentile of Y. Thus

P (Y ≤ ξα ) = α. If Y is symmetric and has a positive continuous pdf f, then
MED(n) and MAD(n) are asymptotically independent
2
√ MED(n) MED(Y ) D 0 σM 0
n − →N , 2
MAD(n) MAD(Y ) 0 0 σD

where
2 1
σM = ,
4[f(MED(Y ))]2
and

2 1 3 2 3 1
σD = 2
− + 2
= .
64 [f(ξ3/4)] f(ξ3/4 )f(ξ1/4) [f(ξ1/4 )] 16[f(ξ3/4 )]2
CHAPTER 2. THE LOCATION MODEL 38

2.4 Robust Conﬁdence Intervals

In this section, large sample confidence intervals (CIs) for the sample me-
dian and 25% trimmed mean are given. The following confidence interval
provides considerable resistance to gross outliers while being very simple to
compute. The standard error SE(MED(n)) is due to Bloch and Gastwirth
(1968), but the degrees of freedom p is motivated by the confidence interval
for the trimmed mean. Let x denote the “greatest integer function” (eg,
7.7 = 7). Let x denote the smallest integer greater than or equal to x
(eg, 7.7 = 8).

with the sample median. Let Un =

Application 2.2: inference
n − Ln where Ln = n/2 − n/4 and use

SE(MED(n)) = 0.5(Y(Un ) − Y(Ln +1) ).

√
Let p = Un − Ln − 1 (so p ≈ n ). Then a 100(1− α)% conﬁdence interval
for the population median is

MED(n) ± tp,1−α/2SE(MED(n)). (2.8)

Deﬁnition 2.10. The symmetrically trimmed mean or the δ trimmed

mean
1
Un
Tn = Tn (Ln , Un ) = Y(i) (2.9)
Un − Ln
i=Ln +1

where Ln = nδ and Un = n − Ln . If δ = 0.25, say, then the δ trimmed

mean is called the 25% trimmed mean.
The (δ, 1 − γ) trimmed mean uses Ln = nδ and Un = nγ

The trimmed mean is estimating a truncated mean µT . Assume that Y

has a probability density function fY (y) that is continuous and positive on
its support. Let yδ be the number satisfying P (Y ≤ yδ ) = δ. Then
y1−δ
1
µT = yfY (y)dy. (2.10)
1 − 2δ yδ

Notice that the 25% trimmed mean is estimating

y0.75
µT = 2yfY (y)dy.
y0.25
CHAPTER 2. THE LOCATION MODEL 39

To perform inference, ﬁnd d1 , ..., dn where


 Y(Ln +1), i ≤ Ln
di = Y(i) , Ln + 1 ≤ i ≤ Un

Y(Un ) , i ≥ Un + 1.
Then the Winsorized variance is the sample variance Sn2 (d1 , ..., dn) of d1 , ..., dn,
and the scaled Winsorized variance
Sn2 (d1 , ..., dn)
VSW (Ln , Un ) = . (2.11)
([Un − Ln ]/n)2

The standard error (SE) of Tn is SE(Tn ) = VSW (Ln , Un )/n.
Application 2.3: inference with the δ trimmed mean. A large
sample 100 (1 − α)% confidence interval (CI) for µT is
Tn ± tp,1− α2 SE(Tn ) (2.12)
where P (tp ≤ tp,1− α2 ) = 1 − α/2 if tp is from a t distribution with p =
Un − Ln − 1 degrees of freedom. This interval is the classical t–interval when
δ = 0, but δ = 0.25 gives a robust CI.
Example 2.7. In 1979 an 8th grade student received the following scores
for the nonverbal, verbal, reading, English, math, science, social studies, and
problem solving sections of a standardized test: 6, 9, 9, 7, 8, 9, 9, 7. Assume
that if this student took the exam many times, then these scores would be
well approximated by a symmetric distribution with mean µ. Find a 95% CI
for µ.
Solution. When computing small examples by hand, the steps are to sort
the data from smallest to largest value, find n, Ln , Un , Y(Ln ), Y(Un ) , p, MED(n)
and SE(MED(n)). After finding tp,1−α/2, plug the relevant quantities into the
formula for the CI. The sorted data are 6, 7, √
7, 8, 9, 9, 9, 9. Thus MED(n) =
(8 + 9)/2 = 8.5. Since n = 8, Ln = 4 − 2 = 4 − 1.414 = 4 − 2 = 2
and Un = n − Ln = 8 − 2 = 6. Hence SE(MED(n)) = 0.5(Y(6) − Y(3)) =
0.5 ∗ (9 − 7) = 1. The degrees of freedom p = Un − Ln − 1 = 6 − 2 − 1 = 3.
The cutoff t3,0.975 = 3.182. Thus the 95% CI for MED(Y ) is
MED(n) ± t3,0.975SE(MED(n))
= 8.5 ± 3.182(1) = (5.318, 11.682). The classical
n t–interval uses Y = (6 + 7 +
2 2 2
7 + 8 + 9 + 9 + 9 + 9)/8 and Sn = (1/7)[( i=1 Yi ) − 8(8 )] = (1/7)[(522 −
CHAPTER 2. THE LOCATION MODEL 40

≈ 1.4286, and t7,0.975 ≈ 2.365. Hence the 95% CI for µ is

8(64)] = 10/7
8 ± 2.365( 1.4286/8) = (7.001, 8.999). Notice that the t-cutoff = 2.365 for
the classical interval is less than the t-cutoff = 3.182 for the median interval
and that SE(Y ) < SE(MED(n)). The parameter µ is between 1 and 9 since
the test scores are integers between 1 and 9. Hence for this example, the
t–interval is considerably superior to the overly long median interval.
Example 2.8. In the last example, what happens if the 6 becomes 66
and a 9 becomes 99?
Solution. Then the ordered data are 7, 7, 8, 9, 9, 9, 66, 99. Hence
MED(n) = 9. Since Ln and Un only depend on the sample size, they take
the same values as in the previous example and SE(MED(n)) = 0.5(Y(6) −
Y(3)) = 0.5 ∗ (9 − 8) = 0.5. Hence the 95% CI for MED(Y ) is MED(n) ±
t3,0.975SE(MED(n)) = 9 ± 3.182(0.5) = (7.409, 10.591). Notice that with
discrete data, it is possible to drive SE(MED(n)) to 0 with a few outliers if
√
n is small. The classical confidence interval Y ± t7,0.975S/ n blows up and
is equal to (−2.955, 56.455).
Example 2.9. The Buxton (1920) data contains 87 heights of men,
but five of the men were recorded to be about 0.75 inches tall! The mean
height is Y = 1598.862 and the classical 95% CI is (1514.206, 1683.518).
MED(n) = 1693.0 and the resistant 95% CI based on the median is (1678.517,
1707.483). The 25% trimmed mean Tn = 1689.689 with 95% CI (1672.096,
1707.282).
The heights for the five men were recorded under their head lengths, so
the outliers can be corrected. Then Y = 1692.356 and the classical 95% CI is
(1678.595, 1706.118). Now MED(n) = 1694.0 and the 95% CI based on the
median is (1678.403, 1709.597). The 25% trimmed mean Tn = 1693.200 with
95% CI (1676.259, 1710.141). Notice that when the outliers are corrected,
the three intervals are very similar although the classical interval length is
slightly shorter. Also notice that the outliers roughly shifted the median
confidence interval by about 1 mm while the outliers greatly increased the
length of the classical t–interval.
Sections 2.5, 2.6 and 2.7 provide additional information on CIs and tests.
CHAPTER 2. THE LOCATION MODEL 41

2.5 Large Sample CIs and Tests

Large sample theory can be used to construct conﬁdence intervals (CIs) and
hypothesis tests. Suppose that Y = (Y1 , ..., Yn )T and that Wn ≡ Wn (Y ) is
an estimator of some parameter µW such that
√ D 2
n(Wn − µW ) → N(0, σW )
2
where σW /n is the asymptotic variance of the estimator Wn . The above
notation means that for large n,
2
Wn − µW ≈ N(0, σW /n).
2 2
Suppose that SW is a consistent estimator
√ of σW so that the (asymptotic)
standard error of Wn is SE(Wn ) = SW / n. Let zα be the α percentile of the
N(0,1) distribution. Hence P (Z ≤ zα ) = α if Z ∼ N(0, 1). Then
Wn − µW
1 − α ≈ P (−z1−α/2 ≤ ≤ z1−α/2),
SE(Wn )
and an approximate or large sample 100(1 − α)% CI for µW is given by

(Wn − z1−α/2SE(Wn ), Wn + z1−α/2SE(Wn )).

Three common approximate level α tests of hypotheses all use the null
hypothesis Ho : µW = µo . A right tailed test uses the alternative hypothesis
HA : µW > µo , a left tailed test uses HA : µW < µo , and a two tail test uses
HA : µW = µo . The test statistic is
Wn − µo
to = ,
SE(Wn )
and the (approximate) p-values are P (Z > to) for a right tail test, P (Z < to )
for a left tail test, and 2P (Z > |to|) = 2P (Z < −|to|) for a two tail test. The
null hypothesis Ho is rejected if the p-value < α.
Remark 2.1. Frequently the large sample CIs and tests can be improved
for smaller samples by substituting a t distribution with p degrees of freedom
for the standard normal distribution Z where p ≡ pn is some increasing
function of the sample size n. Then the 100(1 − α)% CI for µW is given by

(Wn − tp,1−α/2SE(Wn ), Wn + tp,1−α/2SE(Wn )).

CHAPTER 2. THE LOCATION MODEL 42

The test statistic rarely has an exact tp distribution, but the approximation
tends to make the CI’s and tests more conservative; ie, the CI’s are longer
and Ho is less likely to be rejected. This book will typically use very simple
rules for p and not investigate the exact distribution of the test statistic.
Paired and two sample procedures can be obtained directly from the one
sample procedures. Suppose there are two samples Y1 , ..., Yn and X1 , ..., Xm.
If n = m and it is known that (Yi , Xi ) match up in correlated pairs, then
paired CIs and tests apply the one sample procedures to the diﬀerences Di =
Yi − Xi . Otherwise, assume the two samples are independent, that n and m
are large, and that
√ 2
n(W n (Y ) − µW (Y )) 0 σ (Y ) 0
√ D
→N , W
2 .
m(Wm (X) − µW (X)) 0 0 σW (X)
Then
2
(Wn (Y ) − µW (Y )) 0 σW (Y )/n 0
≈N , 2 ,
(Wm (X) − µW (X)) 0 0 σW (X)/m
and
2 2
σW (Y ) σW (X)
Wn (Y ) − Wm (X) − (µW (Y ) − µW (X)) ≈ N(0, + ).
n m
Hence
2 2
SW (Y ) SW (X)
SE(Wn (Y ) − Wm (X)) = + ,
n m
and the large sample 100(1 − α)% CI for µW (Y ) − µW (X) is given by
(Wn (Y ) − Wm (X)) ± z1−α/2SE(Wn (Y ) − Wm (X)).
Often approximate level α tests of hypotheses use the null hypothesis
Ho : µW (Y ) = µW (X). A right tailed test uses the alternative hypothesis
HA : µW (Y ) > µW (X), a left tailed test uses HA : µW (Y ) < µW (X), and a
two tail test uses HA : µW (Y ) = µW (X). The test statistic is
Wn (Y ) − Wm (X)
to = ,
SE(Wn (Y ) − Wm (X))
and the (approximate) p-values are P (Z > to) for a right tail test, P (Z < to )
for a left tail test, and 2P (Z > |to|) = 2P (Z < −|to|) for a two tail test. The
null hypothesis Ho is rejected if the p-value < α.
CHAPTER 2. THE LOCATION MODEL 43

Remark 2.2. Again a tp distribution will often be used instead of the

N(0,1) distribution. If pn is the degrees of freedom used for a single sample
procedure when the sample size is n, use p = min(pn , pm ) for the two sample
procedure. These CI’s are known as Welch intervals. See Welch (1937) and
Yuen (1974).
Example 2.10. Consider the single sample procedures where Wn = Y n .
2
Then µW = E(Y ), σW = VAR(Y ), SW = Sn , and p = n − 1. Let tp denote a
random variable with a t distribution with p degrees of freedom and let the
α percentile tp,α satisfy P (tp ≤ tp,α) = α. Then the classical t-interval for
µ ≡ E(Y ) is
Sn
Y n ± tn−1,1−α/2 √
n
and the t-test statistic is
Y − µo
to = √ .
Sn / n
The right tailed p-value is given by P (tn−1 > to ).
Now suppose that there are two samples where Wn (Y ) = Y n and Wm (X) =
2
X m . Then µW (Y ) = E(Y ) ≡ µY , µW (X) = E(X) ≡ µX , σW (Y ) = VAR(Y ) ≡
2 2 2
σY , σW (X) = VAR(X) ≡ σX , and pn = n − 1. Let p = min(n − 1, m − 1).
Since
Sn2 (Y ) Sm 2 (X)
SE(Wn (Y ) − Wm (X)) = + ,
n m
the two sample t-interval for µY − µX

Sn2 (Y ) Sm 2 (X)
(Y n − X m ) ± tp,1−α/2 +
n m
and two sample t-test statistic

Y n − Xm
to = .
2 (Y )
Sn 2 (X )
Sm
n
+ m

The right tailed p-value is given by P (tp > to ). For sample means, values of
the degrees of freedom that are more accurate than p = min(n − 1, m − 1)
can be computed. See Moore (2004, p. 452).
CHAPTER 2. THE LOCATION MODEL 44

2.6 Some Two Stage Trimmed Means

Robust estimators are often obtained by applying the sample mean to a
sequence of consecutive order statistics. The sample median, trimmed mean,
metrically trimmed mean, and two stage trimmed means are examples. For
the trimmed mean given in Definition 2.10 and for the Winsorized mean,
defined below, the proportion of cases trimmed and the proportion of cases
covered are fixed.
Definition 2.11. Using the same notation as in Definition 2.10, the
Winsorized mean

1
Un
Wn = Wn (Ln , Un ) = [Ln Y(Ln +1) + Y(i) + (n − Un )Y(Un ) ]. (2.13)
n
i=Ln +1

Deﬁnition 2.12. A randomly trimmed mean

1
Un
Rn = Rn (Ln , Un ) = Y(i) (2.14)
Un − Ln
i=Ln +1

where Ln < Un are integer valued random variables. Un − Ln of the cases

are covered by the randomly trimmed mean while n − Un + Ln of the cases
are trimmed.
Deﬁnition 2.13. The metrically trimmed mean (also called the Huber
type skipped mean) Mn is the sample mean of the cases inside the interval

[θ̂n − k1 Dn , θ̂n + k2 Dn ]

where θ̂n is a location estimator, Dn is a scale estimator, k1 ≥ 1, and k2 ≥ 1.

The proportions of cases covered and trimmed by randomly trimmed

means such as the metrically trimmed mean are now random. Typically the
sample median MED(n) and the sample mad MAD(n) are used for θ̂n and
Dn , respectively. The amount of trimming will depend on the distribution
of the data. For example, if Mn uses k1 = k2 = 5.2 and the data is normal
(Gaussian), about 1% of the data will be trimmed while if the data is Cauchy,
about 12% of the data will be trimmed. Hence the upper and lower trimming
CHAPTER 2. THE LOCATION MODEL 45

points estimate lower and upper population percentiles L(F ) and U(F ) and
change with the distribution F.
Two stage estimators are frequently used in robust statistics. Often the
initial estimator used in the first stage has good resistance properties but
has a low asymptotic relative efficiency or no convenient formula for the SE.
Ideally, the estimator in the second stage will have resistance similar to the
initial estimator but will be efficient and easy to use. The metrically trimmed
mean Mn with tuning parameter k1 = k2 ≡ k = 6 will often be the initial
estimator for the two stage trimmed means. That is, retain the cases that
fall in the interval

[MED(n) − 6MAD(n), MED(n) + 6MAD(n)].

Let L(Mn ) be the number of observations that fall to the left of MED(n) −
k1 MAD(n) and let n − U(Mn ) be the number of observations that fall to
the right of MED(n) + k2 MAD(n). When k1 = k2 ≡ k ≥ 1, at least half of
the cases will be covered. Consider the set of 51 trimming proportions in the
set C = {0, 0.01, 0.02, ..., 0.49, 0.50}. Alternatively, the coarser set of 6 trim-
ming proportions C = {0, 0.01, 0.1, 0.25, 0.40, 0.49} may be of interest. The
greatest integer function (eg 7.7 = 7) is used in the following definitions.
Definition 2.14. Consider the smallest proportion αo,n ∈ C such that
αo,n ≥ L(Mn )/n and the smallest proportion 1 − βo,n ∈ C such that 1 −
βo,n ≥ 1 − (U(Mn )/n). Let αM,n = max(αo,n , 1 − βo,n ). Then the two stage
symmetrically trimmed mean TS,n is the αM,n trimmed mean. Hence TS,n
is a randomly trimmed mean with Ln = n αM,n and Un = n − Ln . If
αM,n = 0.50, then use TS,n = MED(n).
Definition 2.15. As in the previous definition, consider the smallest
proportion αo,n ∈ C such that αo,n ≥ L(Mn )/n and the smallest proportion
1 − βo,n ∈ C such that 1 − βo,n ≥ 1 − (U(Mn )/n). Then the two stage asym-
metrically trimmed mean TA,n is the (αo,n , 1 − βo,n ) trimmed mean. Hence
TA,n is a randomly trimmed mean with Ln = n αo,n and Un = n βo,n .
If αo,n = 1 − βo,n = 0.5, then use TA,n = MED(n).
Example 2.11. These two stage trimmed means are almost as easy to
compute as the classical trimmed mean, and no knowledge of the unknown
parameters is needed to do inference. First, order the data and find the
number of cases L(Mn ) less than MED(n) − k1 MAD(n) and the number of
cases n − U(Mn ) greater than MED(n) + k2 MAD(n). (These are the cases
CHAPTER 2. THE LOCATION MODEL 46

trimmed by the metrically trimmed mean Mn , but Mn need not be com-

puted.) Next, convert these two numbers into percentages and round both
percentages up to the nearest integer. For TS,n find the maximum of the two
percentages. For example, suppose that there are n = 205 cases and Mn
trims the smallest 15 cases and the largest 20 cases. Then L(Mn )/n = 0.073
and 1 − (U(Mn )/n) = 0.0976. Hence Mn trimmed the 7.3% smallest cases
and the 9.76% largest cases, and TS,n is the 10% trimmed mean while TA,n
is the (0.08, 0.10) trimmed mean.
Definition 2.16. The standard error SERM for the two stage trimmed
means given in Definitions 2.10, 2.14 and 2.15 is

SERM (Ln , Un ) = VSW (Ln , Un )/n

where the scaled Winsorized variance VSW (Ln , Un ) =

2
Un 2 2
[Ln Y(L n +1)
+ i=Ln +1 Y(i) + (n − Un )Y(U n)
] − n [Wn (Ln , Un )]2
. (2.15)
(n − 1)[(Un − Ln )/n]2

Remark 2.3. A simple method for computing VSW (Ln , Un ) has the
following steps. First, ﬁnd d1 , ..., dn where

 Y(Ln +1), i ≤ Ln
di = Y(i) , Ln + 1 ≤ i ≤ Un

Y(Un ) , i ≥ Un + 1.

Then the Winsorized variance is the sample variance Sn2 (d1 , ..., dn) of d1 , ..., dn,
and the scaled Winsorized variance
Sn2 (d1 , ..., dn)
VSW (Ln , Un ) = . (2.16)
([Un − Ln ]/n)2

Notice that the SE given in Definition 2.16 is the SE for the δ trimmed mean
where Ln and Un are fixed constants rather than random.
Application 2.4. Let Tn be the two stage (symmetrically or) asymmet-
rically trimmed mean that trims the Ln smallest cases and the n − Un largest
cases. Then for the one and two sample procedures described in Section 2.5,
use the one sample standard error SERM (Ln , Un ) given in Definition 2.16 and
the tp distribution where the degrees of freedom p = Un − Ln − 1.
CHAPTER 2. THE LOCATION MODEL 47

The CI’s and tests for the α trimmed mean and two stage trimmed means
given by Applications 2.3 and 2.4 are very similar once Ln has been computed.
For example, a large sample 100 (1 − γ)% conﬁdence interval (CI) for µT is

(Tn − tUn −Ln −1,1− γ2 SERM (Ln , Un ), Tn + tUn−Ln −1,1− γ2 SERM (Ln , Un )) (2.17)

where P (tp ≤ tp,1− γ2 ) = 1 − γ/2 if tp is from a t distribution with p degrees of

freedom. Section 2.6 provides the asymptotic theory for the δ and two stage
trimmed means and shows that µT is the mean of a truncated distribution.
Chapter 3 gives suggestions for k1 and k2 while Chapter 4 provides a simula-
tion study comparing the robust and classical point estimators and intervals.
Next Examples 2.7, 2.8 and 2.9 are repeated using the intervals based on the
two stage trimmed means instead of the median.
Example 2.12. In 1979 a student received the following scores for the
nonverbal, verbal, reading, English, math, science, social studies, and prob-
lem solving sections of a standardized test:
6, 9, 9, 7, 8, 9, 9, 7.
Assume that if this student took the exam many times, then these scores
would be well approximated by a symmetric distribution with mean µ. Find
a 95% CI for µ.
Solution. If TA,n or TS,n is used with the metrically trimmed mean that
uses k = k1 = k2 , eg k = 6, then µT (a, b) = µ. When computing small
examples by hand, it is convenient to sort the data:
6, 7, 7, 8, 9, 9, 9, 9.
Thus MED(n) = (8 + 9)/2 = 8.5. The ordered residuals Y(i) − MED(n) are
-2.5, -1.5, -1.5, 0.5, 0.5, 0.5, 0.5, 0.5.
Find the absolute values and sort them to get
0.5, 0.5, 0.5, 0.5, 0.5, 1.5, 1.5, 2.5.
Then MAD(n) = 0.5, MED(n) − 6MAD(n) = 5.5, and MED(n) + 6MAD(n)
= 11.5. Hence no cases are trimmed by the metrically trimmed mean, ie
L(Mn ) = 0 and U(Mn ) = n = 8. Thus Ln = 8(0) = 0, and Un = n−Ln = 8.
Since no cases are trimmed by the two stage trimmed means, the robust
interval will have the same endpoints as the classical t–interval. To see
this, note that Mn = TS,n = TA,n = Y = (6 + 7 +7+8+9+9+9+
2
9)/8 = 8 = Wn (Ln , Un ). Now VSW (Ln , Un ) = (1/7)[ ni=1 Y(i) − 8(82 )]/[8/8]2
= (1/7)[(522 − 8(64)] = 10/7 ≈ 1.4286, and t7,0.975 ≈ 2.365. Hence the 95%
CI for µ is 8 ± 2.365( 1.4286/8) = (7.001, 8.999).
CHAPTER 2. THE LOCATION MODEL 48

Example 2.13. In the last example, what happens if a 6 becomes 66

and a 9 becomes 99? Use k = 6 and TA,n . Then the ordered data are
7, 7, 8, 9, 9, 9, 66, 99.
Thus MED(n) = 9 and MAD(n) = 1.5. With k = 6, the metrically trimmed
mean Mn trims the two values 66 and 99. Hence the left and right trimming
proportions of the metrically trimmed mean are 0.0 and 0.25 = 2/8, respec-
tively. These numbers are also the left and right trimming proportions of
TA,n since after converting these proportions into percentages, both percent-
ages are integers. Thus Ln = 0 = 0, Un = 0.75(8) = 6 and the two stage
asymmetically trimmed mean trims 66 and 99. So TA,n = 49/6 ≈ 8.1667. To
compute the scaled Winsorized variance, use Remark 2.3 to find that the di ’s
are
7, 7, 8, 9, 9, 9, 9, 9
and
Sn2 (d1 , ..., d8) 0.8393
VSW = ≈ ≈ 1.4921.
[(6 − 0)/8] 2 .5625

Hence the robust confidence interval is 8.1667 ± t5,0.975 1.4921/8 ≈ 8.1667√±
1.1102 ≈ (7.057, 9.277). The classical confidence interval Y ± tn−1,0.975S/ n
blows up and is equal to (−2.955, 56.455).
Example 2.14. Use k = 6 and TA,n to compute a robust CI using
the 87 heights from the Buxton (1920) data that includes 5 outliers. The
mean height is Y = 1598.862 while TA,n = 1695.22. The classical 95% CI is
(1514.206,1683.518) and is more than five times as long as the robust 95%
CI which is (1679.907,1710.532). In this example the five outliers can be
corrected. For the corrected data, no cases are trimmed and the robust and
classical estimators have the same values. The results are Y = 1692.356 =
TA,n and the robust and classical 95% CIs are both (1678.595,1706.118). Note
that the outliers did not have much affect on the robust confidence interval.

2.7 Asymptotics for Two Stage Trimmed Means

Large sample or asymptotic theory is very important for understanding ro-
bust statistics. Convergence in distribution, convergence in probability, al-
most everywhere (sure) convergence, and tightness (bounded in probability)
are reviewed in the following remark.
CHAPTER 2. THE LOCATION MODEL 49

Remark 2.4. Let X1 , X2 , ... be random variables with corresponding

cdf’s F1, F2, .... Let X be a random variable with cdf F. Then Xn converges
in distribution to X if
lim Fn (t) = F (t)
n→∞

at each continuity point t of F. If X1 , X2 , ... and X share a common proba-

bility space, then Xn converges in probability to X if

lim P (|Xn − X| < ) = 1,

n→∞

for every > 0, and Xn converges almost everywhere (or almost surely, or
with probability 1) to X if

P ( lim Xn = X) = 1.
n→∞

The three types of convergence will be denoted by

D P ae
Xn → X, Xn → X, and Xn → X,

respectively. Notation such as “Xn converges to X ae” will also be used.

Serﬂing (1980, p. 8-9) deﬁnes Wn to be bounded in probability, Wn = OP (1),
if for every > 0 there exist positive constants D and N such that

P (|Wn | > D ) <

for all n ≥ N , and Wn = OP (n−δ ) if nδ Wn = OP (1). The sequence Wn =

oP (n−δ ) if nδ Wn = oP (1) which means that
P
nδ Wn → 0.

Truncated and Winsorized random variables are important because they

simplify the asymptotic theory of robust estimators. Let Y be a random
variable with continuous cdf F and let α = F (a) < F (b) = β. Thus α is
the left trimming proportion and 1 − β is the right trimming proportion. Let
F (a−) = P (Y < a). (Refer to Proposition 4.1 for the notation used below.)
Deﬁnition 2.17. The truncated random variable YT ≡ YT (a, b) with
truncation points a and b has cdf
F (y) − F (a−)
FYT (y|a, b) = G(y) = (2.18)
F (b) − F (a−)
CHAPTER 2. THE LOCATION MODEL 50

for a ≤ y ≤ b. Also G is 0 for y < a and G is 1 for y > b. The mean and
variance of YT are
∞ b
ydF (y)
µT = µT (a, b) = ydG(y) = a (2.19)
−∞ β −α
and b
∞
y 2dF (y)
σT2 = σT2 (a, b) = (y − µT )2dG(y) = a
− µ2T .
−∞ β−α
See Cramér (1946, p. 247).
Deﬁnition 2.18. The Winsorized random variable

 a, Y ≤ a
YW = YW (a, b) = Y, Y ≤ b

b, Y ≥ b.

If the cdf of YW (a, b) = YW is FW , then



 0, y<a

F (a), y=a
FW (y) =

 F (y), a < y < b

1, y ≥ b.
Since YW is a mixture distribution with a point mass at a and at b, the mean
and variance of YW are
b
µW = µW (a, b) = αa + (1 − β)b + ydF (y)
a

and b
2 2 2 2
σW = σW (a, b) = αa + (1 − β)b + y 2dF (y) − µ2W .
a

Deﬁnition 2.19. The quantile function

F −1 = Q(t) = inf{y : F (y) ≥ t}. (2.20)

Note that Q(t) is the left continuous inverse of F and if F is strictly

increasing and continuous, then F has an inverse F̃ −1 and F̃ −1 (t) = Q(t).
The following conditions on the cdf are used.
CHAPTER 2. THE LOCATION MODEL 51

Regularity Conditions. (R1) Let Y1 , . . . , Yn be iid with cdf F .

(R2) Let F be continuous and strictly increasing at a = Q(α) and b = Q(β).

The following theorem is proved in Bickel (1965), Stigler (1973a), and

Shorack and Wellner (1986, p. 678-679). The α trimmed mean is asymptot-
ically equivalent to the (α, 1 − α) trimmed mean. Let Tn be the (α, 1 − β)
trimmed mean. Lemma 2.3 shows that the standard error SERM given in the
previous section is estimating the appropriate asymptotic standard deviation
of Tn .
Theorem 2.2. If conditions (R1) and (R2) hold and if 0 < α < β < 1,
then
√ D σ 2 (a, b)
n(Tn − µT (a, b)) → N[0, W ]. (2.21)
(β − α)2

Lemma 2.3: Shorack and Wellner (1986, p. 680). Assume that

regularity conditions (R1) and (R2) hold and that
Ln P Un P
→ α and → β. (2.22)
n n
Then
2
P σW (a, b)
VSW (Ln , Un ) → .
(β − α)2

Since Ln = nα and Un = n − Ln (or Ln = nα and Un = nβ ) satisfy

the above lemma, the standard error SERM can be used for both trimmed
means and two stage trimmed means: SERM (Ln , Un ) = VSW (Ln , Un )/n
where the scaled Winsorized variance VSW (Ln , Un ) =
2
n 2 2
[Ln Y(L n +1)
+ Ui=Ln +1
Y(i) + (n − Un )Y(U n)
] − n [Wn (Ln , Un )]2
.
(n − 1)[(Un − Ln )/n]2
Again Ln is the number of cases trimmed to the left and n−Un is the number
of cases trimmed to the right by the trimmed mean.
The following notation will be useful for ﬁnding the asymptotic distribu-
tion of the two stage trimmed means. Let a = MED(Y ) − kMAD(Y ) and
b = MED(Y ) + kMAD(Y ) where MED(Y ) and MAD(Y ) are the popula-
tion median and median absolute deviation respectively. Let α = F (a−) =
CHAPTER 2. THE LOCATION MODEL 52

P (Y < a) and let αo ∈ C = {0, 0.01, 0.02, ..., 0.49, 0.50} be the smallest value
in C such that αo ≥ α. Similarly, let β = F (b) and let 1−βo ∈ C be the small-
est value in the index set C such that 1 − βo ≥ 1 − β. Let αo = F (ao−), and
let βo = F (bo). Recall that L(Mn ) is the number of cases trimmed to the left
and that n − U(Mn ) is the number of cases trimmed to the right by the met-
rically trimmed mean Mn . Let αo,n ≡ α̂o be the smallest value in C such that
αo,n ≥ L(Mn )/n, and let 1−βo,n ≡ 1− β̂o be the smallest value in C such that
1−βo,n ≥ 1−(U(Mn )/n). Then the robust estimator TA,n is the (αo,n , 1−βo,n )
trimmed mean while TS,n is the max(αo,n , 1 − βo,n )100% trimmed mean. The
following lemma is useful for showing that TA,n is asymptotically equivalent
to the (αo , 1 − βo ) trimmed mean and that TS,n is asymptotically equivalent
to the max(αo , 1 − βo) trimmed mean.
Lemma 2.4: Shorack and Wellner (1986, p. 682-683). Let F
have a strictly positive and continuous derivative in some neighborhood of
MED(Y ) ± kMAD(Y ). Assume that
√
n(MED(n) − MED(Y )) = OP (1) (2.23)

and √
n(MAD(n) − MAD(X)) = OP (1). (2.24)
Then
√ L(Mn )
n( − α) = OP (1) (2.25)
n
and
√ U(Mn )
n( − β) = OP (1). (2.26)
n

Corollary 2.5. Let Y1 , ..., Yn be iid from a distribution with cdf F that
has a strictly positive and continuous pdf f on its support. Let αM =
max(αo , 1 − βo ) ≤ 0.49, βM = 1 − αM , aM = F −1(αM ), and bM = F −1(βM ).
Assume that α and 1 − β are not elements of C = {0, 0.01, 0.02, ..., 0.50}.
Then
√ D σ 2 (ao , bo )
n[TA,n − µT (ao , bo )] → N(0, W ),
(βo − αo )2
and
√ D σ 2 (aM , bM )
n[TS,n − µT (aM , bM )] → N(0, W ).
(βM − αM )2
CHAPTER 2. THE LOCATION MODEL 53

Proof. The ﬁrst result follows from Theorem 2.2 if the probability that
TA,n is the (αo , 1 − βo) trimmed mean goes to one as n tends to inﬁnity. This
D D
condition holds if L(Mn )/n → α and U(Mn )/n → β. But these conditions
follow from Lemma 2.4. The proof for TS,n is similar. QED

2.8 L, R, and M Estimators

Deﬁnition 2.20. An L-estimator is a linear combination of order statistics.

n
TL,n = cn,i Y(i)
i=1

for some choice of constants cn,i .

The sample mean, median and trimmed mean are L-estimators. Often
only a fixed number of the cn,i are nonzero. Examples include the max = Y(n) ,
the min = Y(1), the range = Y(n) − Y(1), and the midrange = (Y(n) + Y(1) )/2.
The following definition and theorem are useful for L-estimators such as
the interquartile range and median that use a fixed linear combination of
sample quantiles. Recall that the smallest integer function x rounds up,
eg 7.7 = 8.
Definition 2.21. The sample α quantile ξˆn,α = Y(nα). The population
quantile ξα = Q(α) = inf{y : F (y) ≥ α}.
Theorem 2.6: Serfling (1980, p. 80). Let 0 < α1 < α2 < · · · <
αk < 1. Suppose that F has a density f that is positive and continuous in
neighborhoods of ξα1 , ..., ξαk . Then
√ D
n[(ξˆn,α1 , ..., ξ̂n,αk )T − (ξα1 , ..., ξαk )T ] → Nk (0, Σ)

where Σ = (σij ) and

αi (1 − αj )
σij =
f(ξαi )f(ξαj )
for i ≤ j and σij = σji for i > j.
R-estimators are derived from rank tests and include the sample mean
and median. See Hettmansperger and McKean (1998).
CHAPTER 2. THE LOCATION MODEL 54

Deﬁnition 2.22. An M-estimator of location T with preliminary esti-

mator of scale MAD(n) is computed with at least one Newton step
n Yi −T (m)
i=1 ψ( MAD(n) )
T (m+1) = T (m) + MAD(n) n
Yi −T (m)
i=1 ψ ( MAD(n) )

where T (0) = MED(n). In particular, the one step M-estimator

n −MED(n)
i=1 ψ( YiMAD(n)
)
(1)
T = MED(n) + MAD(n) .
n Yi −MED(n)
i=1 ψ ( MAD(n) )

The key to M-estimation is ﬁnding a good ψ. The sample mean and

sample median are M-estimators. Recall that Newton’s method is an iterative
procedure for ﬁnding the solution T to the equation h(T ) = 0 where M-
estimators use
n
Yi − T
h(T ) = ψ( ).
i=1
S
Thus
d n
Yi − T −1

h (T ) = h(T ) = ψ ( )( )
dT i=1
S S
where S = MAD(n) and
Yi − T d
ψ ( )= ψ(y)
S dy
evaluated at y = (Yi − T )/S. Beginning with an initial guess T (0), successive
terms are generated from the formula T (m+1) = T (m) − h(T (m))/h (T (m)).
Often the iteration is stopped if |T (m+1) − T (m) | < where is a small
constant. However, one step M-estimators often have the same asymptotic
properties as the fully iterated versions. The following example may help
clarify notation.
Example 2.15. Huber’s M-estimator uses

 −k, y < −k
ψk (y) = y, −k ≤ y ≤ k

k, y > k.
CHAPTER 2. THE LOCATION MODEL 55

Now
Y −T
ψk ( )=1
S
if T − kS ≤ Y ≤ T + kS and is zero otherwise (technically the derivative is
undeﬁned at y = ± k, but assume that Y is a continuous random variable
so that the probability of a value occuring on a “corner” of the ψ function is
zero). Let Ln count the number of observations Yi < MED(n) − kMAD(n),
and let n − Un count the number of observations Yi > MED(n) + kMAD(n).
Set T (0) = MED(n) and S = MAD(n). Then

n
Yi − T (0)
ψk ( ) = Un − Ln .
i=1
S

Since
Yi − MED(n)
ψk ( )=
MAD(n)

 −k, Yi < MED(n) − kMAD(n)
Ỹ , MED(n) − kMAD(n) ≤ Yi ≤ MED(n) + kMAD(n)
 i
k, Yi > MED(n) + kMAD(n),
where Ỹi = (Yi − MED(n))/MAD(n),

n
Y(i) − T (0)
Un
Y(i) − T (0)
ψk ( ) = −kLn + k(n − Un ) + .
S S
i=1 i=Ln +1

Hence n −MED(n)
i=1 ψk ( YiMAD(n)
)
MED(n) + S
ψk ( Yi −MED(n) )
n
MAD(n)
i=1
n
kMAD(n)(n − Un − Ln ) + Ui=L n +1
[Y(i) − MED(n)]
= MED(n) + ,
Un − Ln
and Huber’s one step M-estimator
n
kMAD(n)(n − Un − Ln ) + Ui=L n +1
Y(i)
H1,n = .
Un − Ln
CHAPTER 2. THE LOCATION MODEL 56

2.9 Asymptotic Theory for the MAD

Since MD(n) = MED(|Yi − MED(Y )|, i = 1, . . . , n) is a median and conver-
gence results for the median are well known, see for example Serﬂing (1980,
p. 74-77) or Theorem 2.6 from the previous section, it is simple to prove con-
vergence results for MAD(n). Typically MED(n) = MED(Y ) + OP (n−1/2 )
and MAD(n) = MAD(Y ) + OP (n−1/2). Equation (2.27) in the proof of the
following lemma implies that if MED(n) converges to MED(Y ) ae and MD(n)
converges to MAD(Y ) ae, then MAD(n) converges to MAD(Y ) ae.
Lemma 2.7. If MED(n) = MED(Y ) + OP (n−δ ) and
MD(n) = MAD(Y ) + OP (n−δ ), then MAD(n) = MAD(Y ) + OP (n−δ ).
Proof. Let Wi = |Yi − MED(n)| and let Vi = |Yi − MED(Y )|. Then

Wi = |Yi − MED(Y ) + MED(Y ) − MED(n)| ≤ Vi + |MED(Y ) − MED(n)|,

and

MAD(n) = MED(W1, . . . , Wn ) ≤ MED(V1 , . . . , Vn ) + |MED(Y ) − MED(n)|.

Similarly

Vi = |Yi − MED(n) + MED(n) − MED(Y )| ≤ Wi + |MED(n) − MED(Y )|

and thus

MD(n) = MED(V1 , . . . , Vn ) ≤ MED(W1 , . . . , Wn ) + |MED(Y ) − MED(n)|.

Combining the two inequalities shows that

MD(n)−|MED(Y )−MED(n)| ≤ MAD(n) ≤ MD(n)+|MED(Y )−MED(n)|,

or
|MAD(n) − MD(n)| ≤ |MED(n) − MED(Y )|. (2.27)
Adding and subtracting MAD(Y ) to the left hand side shows that

|MAD(n) − MAD(Y ) − OP (n−δ )| = OP (n−δ ) (2.28)

and the result follows. QED

CHAPTER 2. THE LOCATION MODEL 57

The main point of the following theorem is that the joint distribution of
MED(n) and MAD(n) is asymptotically normal. Hence the limiting distribu-
tion of MED(n) + kMAD(n) is also asymptotically normal for any constant
k. The parameters of the covariance matrix are quite complex and hard to es-
timate. The assumptions of f used in Theorem 2.8 guarantee that MED(Y )
and MAD(Y ) are unique.
Theorem 2.8: Falk (1997). Let the cdf F of Y be continuous near and
diﬀerentiable at MED(Y ) = F −1(1/2) and MED(Y )±MAD(Y ). Assume that
f = F , f(F −1 (1/2)) > 0, and A ≡ f(F −1(1/2) − MAD(Y )) + f(F −1 (1/2) +
MAD(Y )) > 0. Let C ≡ f(F −1 (1/2) − MAD(Y )) − f(F −1(1/2) + MAD(Y )),
and let B ≡ C 2 +4Cf(F −1(1/2))[1−F (F −1 (1/2)−MAD(Y ))−F (F −1(1/2)+
MAD(Y ))]. Then

√ MED(n) MED(Y ) D
n − →
MAD(n) MAD(Y )
2
0 σM σM,D
N , 2 (2.29)
0 σM,D σD
where
2 1 2 1 B
σM = , σD = (1 + 2 −1 1 ),
4f 2 (F −1( 12 )) 4A 2 f (F ( 2 ))
and
1 −1 1 C
σM,D = 1 (1 − 4F (F ( ) + MAD(Y )) + ).
−1
4Af(F ( 2 )) 2 f(F ( 12 ))
−1

Determining whether the population median and mad are unique can be
useful. Recall that F (y) = P (Y ≤ y) and F (y−) = P (Y < y). The median
is unique unless there is a ﬂat spot at F −1(0.5), that is, unless there exist a
and b with a < b such that F (a) = F (b) = 0.5. MAD(Y ) may be unique even
if MED(Y ) is not, see Problem 2.7. If MED(Y ) is unique, then MAD(Y )
is unique unless F has ﬂat spots at both F −1 (MED(Y ) − MAD(Y )) and
F −1(MED(Y ) + MAD(Y )). Moreover, MAD(Y ) is unique unless there exist
a1 < a2 and b1 < b2 such that F (a1) = F (a2), F (b1) = F (b2),
P (ai ≤ Y ≤ bi) = F (bi) − F (ai−) ≥ 0.5,
and
P (Y ≤ ai ) + P (Y ≥ bi ) = F (ai) + 1 − F (bi−) ≥ 0.5
CHAPTER 2. THE LOCATION MODEL 58

for i = 1, 2. The following lemma gives some simple bounds for MAD(Y ).
Lemma 2.9. Assume MED(Y ) and MAD(Y ) are unique. a) Then
min{MED(Y ) − F −1(0.25), F −1 (0.75) − MED(Y )} ≤ MAD(Y ) ≤
max{MED(Y ) − F −1(0.25), F −1 (0.75) − MED(Y )}. (2.30)
b) If Y is symmetric about µ = F −1(0.5), then the three terms in a) are
equal.
c) If the distribution is symmetric about zero, then MAD(Y ) = F −1 (0.75).
d) If Y is symmetric and continuous with a finite second moment, then

MAD(Y ) ≤ 2VAR(Y ).
e) Suppose Y ∈ [a, b]. Then
0 ≤ MAD(Y ) ≤ m = min{MED(Y ) − a, b − MED(Y )} ≤ (b − a)/2,
and the inequalities are sharp.
Proof. a) This result follows since half the mass is between the upper
and lower quartiles and the median is between the two quartiles.
b) and c) are corollaries of a).
d) This inequality holds by Chebyshev’s inequality, since

P ( |Y − E(Y )| ≥ MAD(Y ) ) = 0.5 ≥ P ( |Y − E(Y )| ≥ 2VAR(Y ) ),
and E(Y ) = MED(Y ) for symmetric distributions with finite second mo-
ments.
e) Note that if MAD(Y ) > m, then either MED(Y ) − MAD(Y ) < a
or MED(Y ) + MAD(Y ) > b. Since at least half of the mass is between a
and MED(Y ) and between MED(Y ) and b, this contradicts the definition of
MAD(Y ). To see that the inequalities are sharp, note that if at least half of
the mass is at some point c ∈ [a, b], than MED(Y ) = c and MAD(Y ) = 0.
If each of the points a, b, and c has 1/3 of the mass where a < c < b, then
MED(Y ) = c and MAD(Y ) = m. QED
Many other results for MAD(Y ) and MAD(n) are possible. For example,
note that Lemma 2.9 b) implies that when Y is symmetric, MAD(Y ) =
F −1(3/4) − µ and F (µ + MAD(Y )) = 3/4. Also note that MAD(Y ) and the
interquartile range IQR(Y ) are related by
2MAD(Y ) = IQR(Y ) ≡ F −1 (0.75) − F −1 (0.25)
CHAPTER 2. THE LOCATION MODEL 59

when Y is symmetric. Moreover, results similar to those in Lemma 2.9 hold

for MAD(n) with quantiles replaced by order statistics. One way to see this
is to note that the distribution with a point mass of 1/n at each observation
Y1 , . . . , Yn will have a population median equal to MED(n). To illustrate
the outlier resistance of MAD(n) and the MED(n), consider the following
lemma.
Lemma 2.10. If Y1 , . . . , Yn are n ﬁxed points, and if m ≤ n − 1 arbitrary
points W1 , . . . , Wm are added to form a sample of size n + m, then

MED(n + m) ∈ [Y(1), Y(n) ] and 0 ≤ MAD(n + m) ≤ Y(n) − Y(1). (2.31)

Proof. Let the order statistics of Y1 , . . . , Yn be Y(1) ≤ . . . ≤ Y(n) . By

adding a single point W , we can cause the median to shift by half an order
statistic, but since at least half of the observations are to each side of the
sample median, we need to add at least m = n−1 points to move MED(n+m)
to Y(1) or to Y(n) . Hence if m ≤ n−1 points are added, [MED(n+m)−(Y(n) −
Y(1)), MED(n + m) + (Y(n) − Y(1))] contains at least half of the observations
and MAD(n + m) ≤ Y(n) − Y(1). QED
Hence if Y1 , . . . , Yn are a random sample with cdf F and if W1 , . . . , Wn−1
are arbitrary, then the sample median and mad of the combined sample,
MED(n + n − 1) and MAD(n + n − 1), are bounded by quantities from the
random sample from F .

2.10 Summary
1) Given a small data set, recall that
n
Yi
Y = i=1
n
and the sample variance
n n
2 − Y )2
i=1 (Yi Yi2 − n(Y )2
VAR(n) = S = Sn2 = = i=1
,
n−1 n−1
and the sample standard deviation (SD)

S = Sn = Sn2 .
CHAPTER 2. THE LOCATION MODEL 60

If the data Y1 , ..., Yn is arranged in ascending order from smallest to largest

and written as Y(1) ≤ · · · ≤ Y(n) , then the Y(i) ’s are called the order statistics.
The sample median

MED(n) = Y((n+1)/2) if n is odd,

Y(n/2) + Y((n/2)+1)
MED(n) = if n is even.
2
The notation MED(n) = MED(Y1 , ..., Yn) will also be used. To ﬁnd the
sample median, sort the data from smallest to largest and ﬁnd the middle
value or values.
The sample median absolute deviation

MAD(n) = MED(|Yi − MED(n)|, i = 1, . . . , n).

To find MAD(n), find Di = |Yi − MED(n)|, then find the sample median
of the Di by ordering them from smallest to largest and finding the middle
value or values.
2) Find the population median M = MED(Y ) by solving the equation
F (M) = 0.5 for M where the cdf F (y) = P (Y ≤ y). If Y has a pdf f(y)
that is symmetric about µ, then M = µ. If W = a + bY, then MED(W ) =
a + bMED(Y ). Often a = µ and b = σ.
3) To find the population median absolute deviation D = MAD(Y ), first
find M = MED(Y ) as in 2) above.
a) Then solve F (M + D) − F (M − D) = 0.5 for D.
b) If Y has a pdf that is symmetric about µ, then let U = y0.75 where
P (Y ≤ yα) = α, and yα is the 100αth percentile of Y for 0 < α < 1.
Hence M = y0.5 is the 50th percentile and U is the 75th percentile. Solve
F (U) = 0.75 for U. Then D = U − M.
c) If W = a + bY, then MAD(W ) = |b|MAD(Y ).
MED(Y ) and MAD(Y ) need not be unique, but for “brand name” con-
tinuous random variables, they are unique.
4) A large sample 100 (1 − α)% confidence interval (CI) for θ is

θ̂ ± tp,1− α2 SE(θ̂)
CHAPTER 2. THE LOCATION MODEL 61

where P (tp ≤ tp,1− α2 ) = 1 − α/2 if tp is from a t distribution with p degrees

of freedom. We will use 95% CIs so α = 0.05 and tp,1− α2 = tp,0.975 ≈ 1.96 for
p > 20. Be able to ﬁnd θ̂, p and SE(θ̂) for the following three estimators.
a) The classical CI for √ the population mean θ = µ uses θ̂ = Y ,
p = n − 1 and SE(Y ) = S/ n.
Let x denote the “greatest integer function”. Then x is the largest
integer less than or equal to x (eg, 7.7 = 7). Let x denote the smallest
integer greater than or equal to x (eg, 7.7 = 8).

b) Let Un = n − Ln where Ln = n/2 − n/4 . Then the CI for the
population median θ = MED(Y ) uses θ̂ = MED(n), p = Un − Ln − 1 and
SE(MED(n)) = 0.5(Y(Un ) − Y(Ln +1) ).

c) The 25% trimmed mean

1
Un
Tn = Tn (Ln , Un ) = Y(i)
Un − Ln i=L +1
n

where Ln = n/4 and Un = n − Ln . That is, order the data, delete the
Ln smallest cases and the Ln largest cases and take the sample mean of
the remaining Un − Ln cases. The 25% trimmed mean is estimating the
population truncated mean
y0.75
µT = 2yfY (y)dy.
y0.25

To perform inference, ﬁnd d1 , ..., dn where


 Y(Ln +1), i ≤ Ln
di = Y(i) , Ln + 1 ≤ i ≤ Un

Y(Un ) , i ≥ Un + 1.
(The “half set” of retained cases is not changed, but replace the Ln small-
est deleted cases by the smallest retained case Y(Ln +1) and replace the Ln
largest deleted cases by the largest retained case Y(Un ).) Then the Win-
sorized variance is the sample variance Sn2 (d1 , ..., dn) of d1 , ..., dn, and the
scaled Winsorized variance
Sn2 (d1 , ..., dn)
VSW (Ln , Un ) = .
([Un − Ln ]/n)2
CHAPTER 2. THE LOCATION MODEL 62

Then the CI for the population truncated mean θ = µT uses θ̂ = Tn ,

p = Un − Ln − 1 and

SE(Tn ) = VSW (Ln , Un )/n.

2.11 Complements
Chambers, Cleveland, Kleiner and Tukey (1983) is an excellent source for
graphical procedures such as quantile plots, QQ-plots, and box plots.
The conﬁdence intervals and tests for the sample median and 25% trimmed
mean can be modiﬁed for censored data as can the robust point estimators
based on MED(n) and MAD(n). Suppose that Y(R+1) , ..., Y(n) have been right
censored (similar results hold for left censored data). Then create a pseudo
sample Z(i) = Y(R) for i > R and Z(i) = Y(i) for i ≤ R. Then compute the
robust estimators based Z1 , ..., Zn . These estimators will be identical to the
estimators based on Y1 , ..., Yn (no censoring) if the amount of right censoring
is moderate. For a one parameter family, nearly half of the data can be right
censored if the estimator is based on the median. If the sample median and
MAD are used for a two parameter family, the proportion of right censored
data depends on the skewness of the distribution. Symmetric data can tol-
erate nearly 25% right censoring, right skewed data a larger percentage, and
left skewed data a smaller percentage. See Olive (2006). He and Fung (1999)
present an alternative robust method that also works well for censored data.
Huber (1981, p. 74-75) and Chen (1998) show that the sample median
minimizes the asymptotic bias for estimating MED(Y ) for the family of sym-
metric contaminated distributions, and Huber (1981) concludes that since
the asymptotic variance is going to zero for reasonable estimators, MED(n)
is the estimator of choice for large n. Hampel, Ronchetti, Rousseeuw, and
Stahel (1986, p. 133-134, 142-143) contains some other optimality properties
of MED(n) and MAD(n). Price and Bonnett (2001), McKean and Schrader
(1984) and Bloch and Gastwirth (1968) are useful references for estimating
the SE of the sample median.
Several other approximations for the standard error of the sample median
SE(MED(n)) could be used.
a) McKean and Schrader (1984) proposed
Y(n−c+1) − Y(c)
SE(MED(n)) =
2z1− α2
CHAPTER 2. THE LOCATION MODEL 63

where c = (n+1)/2 − z1−α/2 n/4 is rounded up to the nearest integer. This
− α)% CI
estimator was based on the half length of a distribution free 100 (1 √
(Y(c) , Y(n−c+1) ) for MED(Y ). Use the tp approximation with p = 2 n − 1.
b) This proposal is also due to Bloch and Gastwirth (1968). Let Un =
n − Ln where Ln = n/2 − 0.5n0.8 and use

Y(Un ) − Y(Ln +1)

SE(MED(n)) = .
2n0.3
Use the tp approximation with p = Un − Ln − 1.
c) MED(n) is the 50% trimmed mean, so trimmed means with trimming
proportions close to 50% should have an asymptotic variance close to that of
the sample median. Hence an ad hoc estimator is

SE(MED(n)) = SERM (Ln , Un )

where Un = n − Ln where Ln = n/2 − n/4 and SERM (Ln , Un ) is given
by Deﬁnition 2.16 on p. 46. Use the tp approximation with p = Un − Ln − 1.

In a small simulation study

(see Section 4.6), the proposal in Application
2.2 using Ln = n/2 − n/4 seemed to work best. Using Ln = n/2 −
0.5n0.8 gave better coverages for symmetric data but is vulnerable to a
single cluster of shift outliers if n ≤ 100.
An enormous number of procedures have been proposed that have bet-
ter robustness or asymptotic properties than the classical procedures when
outliers are present. Huber (1981), Hampel, Ronchetti, Rousseeuw, and Sta-
hel (1986) and Staudte and Sheather (1990) are standard references. For
location–scale families, we recommend using the robust estima-
tors from Application 2.1 to create a highly robust asymptotically
eﬃcient cross checking estimator. See Olive (2006) and He and Fung
(1999). Joiner and Hall (1983) compare and contrast L, R, and M-estimators
while Jureckova and Sen (1996) derive the corresponding asymptotic theory.
Mosteller (1946) is an early reference for L-estimators. Bickel (1965), Dixon
and Tukey (1968), Stigler (1973a), Tukey and McLaughlin (1963) and Yuen
(1974) discuss trimmed and Winsorized means while Jaeckel (1971a,b) and
Jureckova, Koenker, and Welsh (1994) and Prescott (1978) examine adap-
tive methods of trimming. Bickel (1975) examines one-step M-estimators,
and Andrews, Bickel, Hampel, Huber, Rogers and Tukey (1972) present a
CHAPTER 2. THE LOCATION MODEL 64

simulation study comparing trimmed means and M-estimators. A robust

method for massive data sets is given in Rousseeuw and Bassett (1990).
Davies and Gather (1993) and Hampel (1985) consider metrically trimmed
means. Shorack (1974) and Shorack and Wellner (1986, section 19.3) derive
the asymptotic theory for a large class of robust procedures for the iid loca-
tion model. Special cases include trimmed, Winsorized, metrically trimmed,
and Huber type skipped means. Also see Kim (1992) and papers in Hahn,
Mason, and Weiner (1991). Olive (2001) considers two stage trimmed means.
Shorack and Wellner (1986, p. 3) and Parzen (1979) discuss the quantile
function while Stigler (1973b) gives historic references to trimming tech-
niques, M-estimators, and to the asymptotic theory of the median. David
(1995,1998), Field (1985), and Sheynin (1997) also contain historical refer-
ences.
Scale estimators are are essential for testing and are discussed in Falk
(1997), Hall and Welsh (1985), Lax (1985), Rousseeuw and Croux (1992,
1993), and Simonoff (1987b). There are many alternative approaches for
testing and confidence intervals. Guenther (1969) discusses classical confi-
dence intervals while Gross (1976) considers robust confidence intervals for
symmetric distributions. Basically all of the methods which truncate or
Winsorize the tails worked. Wilcox (2005) uses ordinary trimmed means
for testing while Kafadar (1982) uses the biweight M-estimator. Also see
Horn (1983). Hettmansperger and McKean (1998) consider rank procedures.
Wilcox (2005) replaces ordinary population means by truncated population
means and uses trimmed means to create analogs of one, two, and three way
anova, multiple comparisons, random comparisons, and split plot designs.
Often a large class of estimators is defined and picking out good members
from the class can be difficult. Freedman and Diaconis (1982) and Clarke
(1986) illustrate some potential problems for M-estimators. Jureckova and
Sen (1996, p. 208) show that under symmetry a large class of M-estimators
is asymptotically normal, but the asymptotic theory is greatly complicated
when symmetry is not present. Stigler (1977) is a very interesting paper and
suggests that Winsorized means (which are often called “trimmed means”
when the trimmed means from Definition 2.10 do not appear in the paper)
are adequate for finding outliers.
CHAPTER 2. THE LOCATION MODEL 65

2.12 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-
FUL.
2.1. Write the location model in matrix form.
2.2. Let fY (y) be the pdf of Y. If W = µ + Y where −∞ < µ < ∞, show
that the pdf of W is fW (w) = fY (w − µ).
2.3. Let fY (y) be the pdf of Y. If W = σY where σ > 0, show that the
pdf of W is fW (w) = (1/σ)fY (w/σ).
2.4. Let fY (y) be the pdf of Y. If W = µ + σY where −∞ < µ < ∞ and
σ > 0, show that the pdf of W is fW (w) = (1/σ)fY ((w − µ)/σ).
√
2.5. Use Theorem 2.8 to ﬁnd the limiting distribution of n(MED(n) −
MED(Y )).
2.6. The interquartile range IQR(n) = ξˆn,0.75 − ξˆn,0.25 and is a popular
estimator of scale. Use Theorem 2.6 to show that
√ 1
n (IQR(n) − IQR(Y )) → N(0, σA2 )
D
2
where
1 3 2 3
σA2 = − + .
64 [f(ξ3/4 )]2 f(ξ3/4)f(ξ1/4 ) [f(ξ1/4)]2

2.7. Let the pdf of Y be f(y) = 1 if 0 < y < 0.5 or if 1 < y < 1.5. Assume
that f(y) = 0, otherwise. Then Y is a mixture of two uniforms, one U(0, 0.5)
and the other U(1, 1.5). Show that the population median MED(Y ) is not
unique but the population mad MAD(Y ) is unique.
√
2.8. a) Let Ln = 0 and Un = n. Prove that SERM (0, n) = S/ n. In other
words, the SE given by Deﬁnition 2.16 reduces to the SE for the sample mean
if there is no trimming.
b) Prove Remark 2.3:
Sn2 (d1 , ..., dn)
VSW (Ln , Un ) = .
[(Un − Ln )/n]2
CHAPTER 2. THE LOCATION MODEL 66

2.9. Find a 95% CI for µT based on the 25% trimmed mean for the
following data sets. Follow Examples 2.12 and 2.13 closely with Ln = 0.25n
and Un = n − Ln .
a) 6, 9, 9, 7, 8, 9, 9, 7
b) 66, 99, 9, 7, 8, 9, 9, 7
2.10. Consider the data set 6, 3, 8, 5, and 2. Show work.
a) Find the sample mean Y .
b) Find the standard deviation S
c) Find the sample median MED(n).
d) Find the sample median absolute deviation MAD(n).
2.11∗. The Cushny and Peebles data set (see Staudte and Sheather 1990,
p. 97) is listed below.

1.2 2.4 1.3 1.3 0.0 1.0 1.8 0.8 4.6 1.4

a) Find the sample mean Y .

b) Find the sample standard deviation S.
c) Find the sample median MED(n).
d) Find the sample median absolute deviation MAD(n).
e) Plot the data. Are any observations unusually large or unusually small?

2.12∗. Consider the following data set on Spring 2004 Math 580 home-
work scores.

66.7 76.0 89.7 90.0 94.0 94.0 95.0 95.3 97.0 97.7

Then Y = 89.54 and S 2 = 103.3604.

a) Find SE(Y ).
b) Find the degrees of freedom p for the classical CI based on Y .
Parts c)-g) refer to the CI based on MED(n).
c) Find the sample median MED(n).
d) Find Ln .
e) Find Un .
f) Find the degrees of freedom p
g) Find SE(MED(n)).
CHAPTER 2. THE LOCATION MODEL 67

2.13∗. Consider the following data set on Spring 2004 Math 580 home-
work scores.

66.7 76.0 89.7 90.0 94.0 94.0 95.0 95.3 97.0 97.7

Consider the CI based on the 25% trimmed mean.

a) Find Ln .
b) Find Un .
c) Find the degrees of freedom p
d) Find the 25% trimmed mean Tn .
e) Find d1 , ..., d10.
f) Find d.
g) Find S 2(d1 , ..., d10).
e) Find SE(Tn ).
2.14. Consider the data set 6, 3, 8, 5, and 2.
a) Refering to Application 2.2 on p. 38, ﬁnd Ln , Un , p and SE(MED(n)).

b) Referring to Application 2.3 on p. 39, let Tn be the 25% trimmed

mean. Find Ln , Un , p, Tn and SE(Tn ).
R/Splus problems
2.15∗. Use the commands

height <- rnorm(87, mean=1692, sd = 65)

height[61:65] <- 19.0

to simulate data similar to the Buxton heights. Make a plot similar to Figure
2.1 using the following R/Splus commands.

> par(mfrow=c(2,2))
> plot(height)
> title("a) Dot plot of heights")
> hist(height)
> title("b) Histogram of heights")
> length(height)
[1] 87
CHAPTER 2. THE LOCATION MODEL 68

> val <- quantile(height)[4] - quantile(height)[2]

> val
75%
103.5
> wid <- 4*1.06*min(sqrt(var(height)),val/1.34)*(87^(-1/5))
> wid
[1] 134.0595
> dens<- density(height,width=wid)
> plot(dens$x,dens$y)
> lines(dens$x,dens$y)
> title("c) Density of heights")
> boxplot(height)
> title("d) Boxplot of heights")
2.16∗. The following command computes MAD(n).
mad(y, constant=1)

a) Let Y ∼ N(0, 1). Estimate MAD(Y ) with the following commands.

y <- rnorm(10000)
mad(y, constant=1)
b) Let Y ∼ EXP(1). Estimate MAD(Y ) with the following commands.
y <- rexp(10000)
mad(y, constant=1)
2.17∗. The following commands computes the α trimmed mean. The
default uses tp = 0.25 and gives the 25% trimmed mean.
tmn <-
function(x, tp = 0.25)
{
mean(x, trim = tp)
}
a) Compute the 25% trimmed mean of 10000 simulated N(0, 1) random
variables with the following commands.
y <- rnorm(10000)
tmn(y)
CHAPTER 2. THE LOCATION MODEL 69

b) Compute the mean and 25% trimmed mean of 10000 simulated EXP(1)
random variables with the following commands.

y <- rexp(10000)
mean(y)
tmn(y)

2.18. The following R/Splus function computes the metrically trimmed

mean.

metmn <-
function(x, k = 6)
{
madd <- mad(x, constant = 1)
med <- median(x)
mean(x[(x >= med - k * madd) & (x <= med + k * madd)])
}

Compute the metrically trimmed mean of 10000 simulated N(0, 1) ran-

dom variables with the following commands.

y <- rnorm(10000)
metmn(y)

Warning: For the following problems, use the command

source(“A:/rpack.txt”) to download the programs. See Preface or Sec-
tion 14.2. Typing the name of the rpack function, eg ratmn, will display
the code for the function. Use the args command, eg args(ratmn), to display
the needed arguments for the function.
2.19. Download the R/Splus function ratmn that computes the two stage
asymmetrically trimmed mean TA,n . Compute the TA,n for 10000 simulated
N(0, 1) random variables with the following commands.

y <- rnorm(10000)
ratmn(y)

2.20. Download the R/Splus function rstmn that computes the two stage
symmetrically trimmed mean TS,n . Compute the TS,n for 10000 simulated
N(0, 1) random variables with the following commands.
CHAPTER 2. THE LOCATION MODEL 70

y <- rnorm(10000)
rstmn(y)

2.21∗. a) Download the cci function which produces a classical CI. The
default is a 95% CI.
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command cci(height).
2.22∗. a) Download the R/Splus function medci that produces a CI using
the median and the Bloch and Gastwirth SE.
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command medci(height).
2.23∗. a) Download the R/Splus function tmci that produces a CI using
the 25% trimmed mean as a default.
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command tmci(height).
2.24. a) Download the R/Splus function atmci that produces a CI using
TA,n .
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command atmci(height).
2.25. a) Download the R/Splus function stmci that produces a CI using
TS,n .
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command stmci(height).
2.26. a) Download the R/Splus function stmci that produces a CI using
the median and SERM (Ln , Un ).
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command med2ci(height).
2.27. a) Download the R/Splus function cgci that produces a CI using
TS,n and the coarse grid C = {0, 0.01, 0.1, 0.25, 0.40, 0.49}.
b) Compute a 95% CI for the artificial height data set created in Problem
2.15. Use the command cgci(height).
2.28. a) Bloch and Gastwirth (1968) suggest using
√
n
SE(MED(n)) = [Y([n/2]+m) − Y([n/2]−m) ]
4m
CHAPTER 2. THE LOCATION MODEL 71

where m → ∞ but n/m → 0 as n → ∞. Taking m = 0.5n0.8 is optimal

in some sense, but not as resistant as the choice m = n/4. Download the
R/Splus function bg2ci that is used to simulate the CI that uses MED(n)
and the “optimal” BG SE.
b) Compute a 95% CI for the artiﬁcial height data set created in Problem
2.15. Use the command bg2ci(height).
2.29. a) Enter the following commands to create a function that produces
a Q plot.

qplot<-
function(y)
{ plot(sort(y), ppoints(y))
title("QPLOT")}

b) Make a Q plot of the height data from Problem 2.15 with the following
command.

qplot(height)

c) Make a Q plot for N(0, 1) data with the following commands.

Y <- rnorm(1000)
qplot(y)
Chapter 3

Some Useful Distributions

The two stage trimmed means of Chapter 2 are asymptotically equivalent to

D
a classical trimmed mean provided that An = MED(n) − k1 MAD(n) → a,
D
Bn = MED(n) + k2 MAD(n) → b and if 100F (a−) and 100F (b) are not
integers. This result will also hold if k1 and k2 depend on n. For example take
D
k1 = k2 = c1 + c2/n. Then MED(n) ± k1 MAD(n) → MED(Y ) ± c1 MAD(Y ).
A trimming rule suggests values for c1 and c2 and depends on the distribution
of Y. Sometimes the rule is obtained by transforming the random variable Y
into another random variable W (eg transform a lognormal into a normal)
and then using the rule for W . These rules may not be as resistant to outliers
as rules that do not use a transformation. For example, an observation which
does not seem to be an outlier on the log scale may appear as an outlier on
the original scale.
Several of the trimming rules in this chapter have been tailored so that
the probability is high that none of the observations are trimmed when the
sample size is moderate. Robust (but perhaps ad hoc) analogs of classical
procedures can be obtained by applying the classical procedure to the data
that was not trimmed.
Relationships between the distribution’s parameters and MED(Y ) and
MAD(Y ) are emphasized. Note that for location–scale families, highly out-
lier resistant estimates for the two parameters can be obtained by replacing
MED(Y ) by MED(n) and MAD(Y ) by MAD(n).
Deﬁnition 3.1. The moment generating function (mgf) of a random
variable Y is
m(t) = E(etY )

72
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 73

provided that the expectation exists for t in some neighborhood of 0.

Deﬁnition 3.2. The characteristic function (chf) of a random variable
Y is
c(t) = E(eitY )
√
where the complex number i = −1.
Deﬁnition 3.3. The indicator function IA (x) ≡ I(x ∈ A) = 1 if x ∈ A
and 0, otherwise. Sometimes an indicator function such as I(0,∞)(y) will be
denoted by I(y > 0).

3.1 The Binomial Distribution

If Y has a binomial distribution, Y ∼ BIN(k, ρ), then the probability mass
function (pmf) of Y is

k y
P (Y = y) = ρ (1 − ρ)k−y
y
for 0 < ρ < 1 and y = 0, 1, . . . , k.
The moment generating function m(t) = ((1 − ρ) + ρet )k , and the character-
istic function c(t) = ((1 − ρ) + ρeit)k .
E(Y ) = kρ, and
VAR(Y ) = kρ(1 − ρ).
The following normal approximation is often used.
Y ≈ N(kρ, kρ(1 − ρ))
when kρ(1 − ρ) > 9. Hence

y + 0.5 − kρ
P (Y ≤ y) ≈ Φ .
kρ(1 − ρ)
Also
1 1 1 (y − kρ)2
P (Y = y) ≈ √ exp − .
kρ(1 − ρ) 2π 2 kρ(1 − ρ)
See Johnson, Kotz and Kemp (1992, p. 115). This normal approximation
suggests that MED(Y ) ≈ kρ, and MAD(Y ) ≈ 0.6745 kρ(1 − ρ). Hamza
(1995) states that |E(Y ) − MED(Y )| ≤ max(ρ, 1 − ρ) and shows that
|E(Y ) − MED(Y )| ≤ log(2).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 74

Given a random sample of size n, the classical estimate of ρ is ρ̂ = ȳn /k. If

each yi is a nonnegative integer between 0 and k, then a trimming rule is
keep yi if
4 4
med(n) − 5.2(1 + )mad(n) ≤ yi ≤ med(n) + 5.2(1 + )mad(n).
n n
(This rule can be very bad if the normal approximation is not good.)

3.2 The Burr Distribution

If Y has a Burr distribution, Y ∼ Burr(φ, λ), then the probability density
function (pdf) of Y is
1 φy φ−1
f(y) =
λ (1 + y φ ) λ1 +1
where y, φ, and λ are all positive.
The cdf of Y is

− log(1 + y φ)
F (y) = 1 − exp = 1 − (1 + y φ )−1/λ for y > 0.
λ

MED(Y ) = [eλ log(2) − 1]1/φ .

See Patel, Kapadia and Owen (1976, p. 195).
Assume that φ is known. Since W = log(1 + Y φ ) is EXP (λ),
MED(W1 , ..., Wn)
λ̂ =
log(2)
is a robust estimator. If all the yi ≥ 0 then a trimming rule is keep yi if
2
0.0 ≤ wi ≤ 9.0(1 + )med(n)
n
where med(n) is applied to w1, . . . , wn with wi = log(1 + yiφ ).

3.3 The Cauchy Distribution

If Y has a Cauchy distribution, Y ∼ C(µ, σ), then the pdf of Y is
σ 1 1
f(y) = =
π σ + (y − µ)
2 2
πσ[1 + ( y−µ
σ
)2 ]
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 75

where y and µ are real numbers and σ > 0.

The cumulative distribution function (cdf) of Y is
F (y) = π1 [arctan( y−µ
σ
) + π/2]. See Ferguson (1967, p. 102).
This family is a location–scale family that is symmetric about µ. The mo-
ments of Y do not exist, but the chf of Y is c(t) = exp(itµ − |t|σ).
MED(Y ) = µ, the upper quartile = µ + σ, and the lower quartile = µ − σ.
MAD(Y ) = F −1 (3/4) − MED(Y ) = σ. For a standard normal random vari-
able, 99% of the mass is between −2.58 and 2.58 while for a standard Cauchy
C(0, 1) random variable 99% of the mass is between −63.66 and 63.66. Hence
a rule which gives weight one to almost all of the observations of a Cauchy
sample will be more susceptible to outliers than rules which do a large amount
of trimming.

3.4 The Chi Distribution

If Y has a chi distribution, Y ∼ χp, then the pdf of Y is
2
y p−1e−y /2
f(y) = p −1
2 2 Γ(p/2)
where y ≥ 0 and p is a positive integer.
MED(Y ) ≈ p − 2/3.
See Patel, Kapadia and Owen (1976, p. 38). Since W = Y 2 is χ2p , a trimming
rule is keep yi if wi = yi2 would be kept by the trimming rule for χ2p .

3.5 The Chi–square Distribution

If Y has a chi–square distribution, Y ∼ χ2p , then the pdf of Y is
p y
y 2 −1 e− 2
f(y) = p p
2 2 Γ( 2 )
where y ≥ 0 and p is a positive integer.
E(Y ) = p.
VAR(Y ) = 2p.
Since Y is gamma G(ν = p/2, λ = 2),
2r Γ(r + p/2)
E(Y r ) = , r > −p/2.
Γ(p/2)
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 76

MED(Y ) ≈ p−2/3. See Pratt (1968, p. 1470) for more terms in the expansion
of MED(Y ).
Empirically, √
2p 2 √
MAD(Y ) ≈ (1 − )2 ≈ 0.9536 p.
1.483 9p
Note that p ≈ MED(Y ) + 2/3, and VAR(Y ) ≈ 2MED(Y ) + 4/3. Let i be an
integer such that i ≤ w < i + 1. Then deﬁne rnd(w) = i if i ≤ w ≤ i + 0.5
and rnd(w) = i + 1 if i + 0.5 < w < i + 1. Then p ≈ rnd(MED(Y ) + 2/3),
and the approximation can be replaced by equality for p = 1, . . . , 100.
There are several normal approximations for this distribution. For p large,
Y ≈ N(p, 2p), and √
2Y ≈ N( 2p, 1).
Let
α = P (Y ≤ χ2p,α) = Φ(zα)
where Φ is the standard normal cdf. Then
1
χ2p,α ≈ (zα + 2p)2.
2
The Wilson–Hilferty approximation is
13
Y 2 2
≈ N(1 − , ).
p 9p 9p

See Bowman and Shenton (1992, p. 6). This approximation gives

x
P (Y ≤ x) ≈ Φ[(( )1/3 − 1 + 2/9p) 9p/2],
p
and
2 2
χ2p,α ≈ p(zα + 1 − )3 .
9p 9p
The last approximation is good if p > −1.24 log(α). See Kennedy and Gentle
(1980, p. 118).
Assume all yi > 0. Let p̂ = rnd(med(n) + 2/3).
Then a trimming rule is keep yi if

1 2 2 2
(−3.5 + 2p̂) I(p̂ ≤ 15) ≤ yi ≤ p̂[(3.5 + 2.0/n) + 1 − ]3 .
2 9p̂ 9p̂
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 77

Another trimming rule would be to let

1/3
yi
wi = .
p̂
Then keep yi if the trimming rule for the normal distribution keeps the wi .

3.6 The Double Exponential Distribution

If Y has a double exponential distribution (or Laplace distribution), Y ∼
DE(θ, λ), then the pdf of Y is

1 |y − θ|
f(y) = exp −
2λ λ
where y is real and λ > 0.
The cdf of Y is
y−θ
F (y) = 0.5 exp if y ≤ θ,
λ
and
−(y − θ)
F (y) = 1 − 0.5 exp if y ≥ θ.
λ
This family is a location–scale family which is symmetric about θ.
The mgf m(t) = exp(θt)/(1 − λ2 t2), |t| < 1/λ and
the chf c(t) = exp(θit)/(1 + λ2 t2 ).
E(Y ) = θ, and
MED(Y ) = θ.
VAR(Y ) = 2λ2 , and
MAD(Y ) = log(2)λ ≈ 0.693λ.
Hence λ = MAD(Y )/ log(2) ≈ 1.443MAD(Y ).
To see that MAD(Y ) = λ log(2), note that F (θ + λ log(2)) = 1 − 0.25 = 0.75.
The maximum likelihood estimators are θ̂M LE = MED(n) and
1
n
λ̂M LE = |Yi − MED(n)|.
n i=1

A 100(1 − α)% conﬁdence interval (CI) for λ is

n
n
2 i=1 |Yi − MED(n)| 2 i=1 |Yi − MED(n)|
, ,
χ22n−1,1− α χ22n−1, α
2 2
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 78

and a 100(1 − α)% CI for θ is

 
n
z i=1 |Yi − MED(n)| 
MED(n) ± 1−α/2
2
n n − z1−α/2

where χ2p,α and zα are the α percentiles of the χ2p and standard normal dis-
tributions, respectively. See Patel, Kapadia and Owen (1976, p. 194).
A trimming rule is keep yi if
2.0
yi ∈ [med(n) ± 10.0(1 + )mad(n)].
n
Note that F (θ + λ log(1000)) = 0.9995 ≈ F (MED(Y ) + 10.0MAD(Y )).

3.7 The Exponential Distribution

If Y has an exponential distribution, Y ∼ EXP(λ), then the pdf of Y is
1 y
f(y) = exp (− ) I(y ≥ 0)
λ λ
where λ > 0 and the indicator I(y ≥ 0) is one if y ≥ 0 and zero otherwise.
The cdf of Y is
F (y) = 1 − exp(−y/λ), y ≥ 0.

The mgf m(t) = 1/(1 − λt), t < 1/λ and

the chf c(t) = 1/(1 − iλt).
E(Y ) = λ,
and VAR(Y ) = λ2 .
Since Y is gamma G(ν = 1, λ), E(Y r ) = λΓ(r + 1) for r > −1.
MED(Y ) = log(2)λ and
MAD(Y ) ≈ λ/2.0781 since it can be shown that
exp(MAD(Y )/λ) = 1 + exp(−MAD(Y )/λ).
Hence 2.0781 MAD(Y ) ≈ λ.
A robust estimator is λ̂ = MED(n)/ log(2).
The classical estimator is λ̂ = Y n and the 100(1 − α)% CI for E(Y ) = λ is
n
n
2 i=1 Yi 2 i=1 Yi
,
χ22n,1− α χ22n, α
2 2
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 79

where P (Y ≤ χ22n, α ) = α/2 if Y is χ22n. See Patel, Kapadia and Owen (1976,
2
p. 188).
If all the yi ≥ 0, then the trimming rule is keep yi if
c2
0.0 ≤ yi ≤ 9.0(1 + )med(n)
n
where c2 = 2.0 seems to work well. Note that P (Y ≤ 9.0MED(Y )) ≈ 0.998.

3.8 The Two Parameter Exponential Distri-

bution
If Y has a two parameter exponential distribution, Y ∼ EXP(θ, λ), then the
pdf of Y is
1 (y − θ)
f(y) = exp − I(y ≥ θ)
λ λ
where λ > 0.
The cdf of Y is

F (y) = 1 − exp[−(y − θ)/λ)], y ≥ θ.

This family is an asymmetric location-scale family.

The mgf m(t) = exp(tθ)/(1 − λt), t < 1/λ and
the chf c(t) = exp(itθ)/(1 − iλt).
E(Y ) = θ + λ,
and VAR(Y ) = λ2 .

MED(Y ) = θ + λ log(2)
and
MAD(Y ) ≈ λ/2.0781.
Hence θ ≈ MED(Y ) − 2.0781 log(2)MAD(Y ). See Rousseeuw and Croux
(1993) for similar results. Note that 2.0781 log(2) ≈ 1.44.
A trimming rule is keep yi if
c4
med(n) − 1.44(1.0 + )mad(n) ≤ yi ≤
n
c2
med(n) − 1.44mad(n) + 9.0(1 + )med(n)
n
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 80

where c2 = 2.0 and c4 = 2.0 may be good choices.

To see that 2.0781MAD(Y ) ≈ λ, note that
θ+λ log(2)+MAD
1
0.5 = exp(−(y − θ)/λ)dy
θ+λ log(2)−MAD λ

= 0.5[−e−MAD/λ + eMAD/λ]
assuming λ log(2) > MAD. Plug in MAD = λ/2.0781 to get the result.

3.9 The Extreme Value Distribution

If Y has a (type I) extreme value distribution for the max (or Gumbel dis-
tribution), Y ∼ EV(θ, σ), then the pdf of Y is

1 y−θ y−θ
f(y) = exp(−( )) exp[− exp(−( ))]
σ σ σ
where y and θ are real and σ > 0. (Then −Y has an extreme value distribution
for the min or the log–Weibull distribution, see Problem 3.10.)
The cdf of Y is
y−θ
F (y) = exp[− exp(−( ))].
σ
This family is an asymmetric location–scale family with a mode at θ.
The mgf m(t) = exp(tθ)Γ(1 − σt) for |t| < 1/σ.
E(Y ) ≈ θ + 0.57721σ, and
VAR(Y ) = σ 2π 2 /6 ≈ 1.64493σ 2 .

MED(Y ) = θ − σ log(log(2)) ≈ θ + 0.36651σ

and
MAD(Y ) ≈ 0.767049σ.
W = exp(−(Y − θ)/σ) ∼ EXP(1).
A trimming rule is keep yi if

med(n) − 2.5mad(n) ≤ yi ≤ med(n) + 7mad(n).

CHAPTER 3. SOME USEFUL DISTRIBUTIONS 81

3.10 The Gamma Distribution

If Y has a gamma distribution, Y ∼ G(ν, λ), then the pdf of Y is

y ν−1 e−y/λ
f(y) =
λν Γ(ν)
where ν, λ, and y are positive.
The mgf of Y is ν ν
1/λ 1
m(t) = 1 =
λ
−t 1 − λt
for t < 1/λ. The chf ν
1
c(t) = .
1 − iλt
E(Y ) = νλ.
VAR(Y ) = νλ2 .

λr Γ(r + ν)
E(Y r ) = if r > −ν.
Γ(ν)
Chen and Rubin (1986) show that λ(ν − 1/3) < MED(Y ) < λν = E(Y ).
Empirically, for ν > 3/2,

MED(Y ) ≈ λ(ν − 1/3),

and √
λ ν
MAD(Y ) ≈ .
1.483
This family is a scale family for ﬁxed ν, so if Y is G(ν, λ) then cY is G(ν, cλ)
for c > 0. If W is EXP(λ) then W is G(1, λ). If W is χ2p , then W is G(p/2, 2).
If Y and W are independent and Y is G(ν, λ) and W is G(φ, λ), then Y + W
is G(ν + φ, λ). Some classical estimates are given next. Let

yn
w = log
geometric mean(n)

where geometric mean(n) = (y1 y2 . . . yn )1/n . Then Thom’s estimate (Johnson

and Kotz 1970a, p. 188) is

0.25(1 + 1 + 4w/3 )
ν̂ ≈ .
w
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 82

Also
0.5000876 + 0.1648852w − 0.0544274w2
ν̂M LE ≈
w
for 0 < w ≤ 0.5772, and

8.898919 + 9.059950w + 0.9775374w2

ν̂M LE ≈
w(17.79728 + 11.968477w + w2 )

for 0.5772 < w ≤ 17. If w > 17 then estimation is much more diﬃcult, but a
rough approximation is ν̂ ≈ 1/w for w > 17. See Bowman and Shenton (1988,
p. 46) and Greenwood and Durand (1960). Finally, λ̂ = yn /ν̂. Notice that λ̂
may not be very good if ν̂ < 1/17. For some M–estimators, see Marazzi and
Ruﬃeux (1996).
Several normal approximations are available. For large ν, Y ≈ N(νλ, νλ2 ).
The Wilson–Hilferty approximation says that for ν ≥ 0.5,

1/3 1/3 1 2/3 1
Y ≈ N (νλ) (1 − ), (νλ) .
9ν 9ν

Hence if Y is G(ν, λ) and

α = P [Y ≤ Gα ],

then 3
1 1
Gα ≈ νλ zα +1−
9ν 9ν
where zα is the standard normal percentile, α = Φ(zα ). Bowman and Shenton
(1988, p. 101) include higher order terms.
Next we give some trimming rules. Assume each yi > 0. Assume ν ≥ 0.5.
Rule 1. Assume β is known. Let ν̂ = (med(n)/λ) + (1/3). Keep yi if
yi ∈ [lo, hi] where

1 1
lo = max(0, ν̂λ [−(3.5 + 2/n) + 1 − ]3),
9ν̂ 9ν̂
and
1 1
hi = ν̂λ [(3.5 + 2/n) + 1 − ]3 .
9ν̂ 9ν̂
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 83

Rule 2. Assume ν is known. Let λ̂ = med(n)/(ν − (1/3)). Keep yi if

yi ∈ [lo, hi] where

1 1
lo = max(0, ν λ̂ [−(3.5 + 2/n) + 1 − ]3),
9ν 9ν
and 3
1 1
hi = ν λ̂ (3.5 + 2/n) +1− .
9ν 9ν
Rule 3. Let d = med(n) − c mad(n). Keep yi if

dI[d ≥ 0] ≤ yi ≤ med(n) + c mad(n)

where
c ∈ [9, 15].

3.11 The Half Normal Distribution

If Y has a half normal distribution, Y ∼ HN(µ, σ), then the pdf of Y is

2 −(y − µ)2
f(y) = √ exp ( )
2π σ 2σ 2

where σ > 0 and y ≥ µ and µ is real. Let Φ(y) denote the standard normal
cdf. Then the cdf of Y is
y−µ
F (y) = 2Φ( )−1
σ
for y > µ and F (y) = 0, otherwise. This is an asymmetric location–scale
same distribution as µ + σ|Z| where Z ∼ N(0, 1).
family that has the
E(Y ) = µ + σ 2/π ≈ µ + 0.797885σ.
2
VAR(Y ) = σ (π−2)
π
≈ 0.363380σ 2 .
Note that Z ∼ χ21. Hence the formula for the rth moment of the χ21
2

random variable can be used to ﬁnd the moments of Y .

MED(Y ) = µ + 0.6745σ.
MAD(Y ) = 0.3990916σ.
Thus µ̂ ≈ MED(n) − 1.6901MAD(n) and σ̂ ≈ 2.5057MAD(n).
Pewsey (2002) shows that classical inference for this distribution is simple.
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 84

3.12 The Logistic Distribution

If Y has a logistic distribution, Y ∼ L(µ, σ), then the pdf of y is
exp (−(y − µ)/σ)
f(y) =
σ[1 + exp (−(y − µ)/σ)]2
where σ > 0 and y and µ are real.
The cdf of Y is
1 exp ((y − µ)/σ)
F (y) = = .
1 + exp (−(y − µ)/σ) 1 + exp ((y − µ)/σ)
This family is a symmetric location–scale family.
The mgf of Y is m(t) = πσteµt csc(πσt) for |t| < 1/σ, and
the chf is c(t) = πiσteiµt csc(πiσt) where csc(t) is the cosecant of t.
E(Y ) = µ, and
MED(Y ) = µ.
VAR(Y ) = σ 2π 2 /3, and
MAD(Y ) = log(3)σ ≈ 1.0986 σ.
Hence σ = MAD(Y )/ log(3). n
1
The estimators µ̂ = Y n and σ̂ 2 = 3S 2 /π 2 where S 2 = n−1 i=1 (Yi − Y n )
2

are sometimes used. A trimming rule is keep yi if

c2 c2
med(n) − 7.6(1 + )mad(n) ≤ yi ≤ med(n) + 7.6(1 + )mad(n)
n n
where c2 is between 0.0 and 7.0. Note that if
ec q
q = FL(0,1)(c) = then c = log( ).
1 + ec 1−q
Taking q = .9995 gives c = log(1999) ≈ 7.6.
To see that MAD(Y ) = log(3)σ, note that F (µ + log(3)σ) = 0.75,
F (µ − log(3)σ) = 0.25, and 0.75 = exp (log(3))/(1 + exp(log(3))).

3.13 The Lognormal Distribution

If Y has a lognormal distribution, Y ∼ LN(µ, σ 2 ), then the pdf of Y is

1 −(log(y) − µ)2
f(y) = √ exp
y 2πσ 2 2σ 2
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 85

where y > 0 and σ > 0 and µ is real.

The cdf of Y is
log(y) − µ
F (y) = Φ for y > 0
σ
where Φ(y) is the standard normal N(0,1) cdf.
E(Y ) = exp(µ + σ 2/2) and
VAR(Y ) = exp(σ 2)(exp(σ 2 ) − 1) exp(2µ).
For any r, E(Y r ) = exp(rµ + r2 σ 2/2).
MED(Y ) = exp(µ) and
exp(µ)[1 − exp(−0.6744σ)] ≤ MAD(Y ) ≤ exp(µ)[1 + exp(0.6744σ)].
Since Wi = log(Yi ) is N(µ, σ 2), robust estimators are

µ̂ = MED(W1 , ..., Wn) and σ̂ = 1.483MAD(W1 , ..., Wn).

Assume all yi ≥ 0. Then a trimming rule is keep yi if

c2 c2
med(n) − 5.2(1 + )mad(n) ≤ wi ≤ med(n) + 5.2(1 + )mad(n)
n n
where c2 is between 0.0 and 7.0. Here med(n) and mad(n) are applied to
w1, . . . , wn where wi = log(yi ).

3.14 The Normal Distribution

If Y has a normal distribution (or Gaussian distribution), Y ∼ N(µ, σ 2),
then the pdf of Y is

1 −(y − µ)2
f(y) = √ exp
2πσ 2 2σ 2

where σ > 0 and µ and y are real.

Let Φ(y) denote the standard normal cdf. Recall that Φ(y) = 1 − Φ(−y).
The cdf F (y) of Y does not have a closed form, but

y−µ
F (y) = Φ ,
σ

and
Φ(y) ≈ 0.5(1 + 1 − exp(−2y 2 /π) ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 86

See Johnson and Kotz (1970a, p. 57).

The moment generating function is m(t) = exp(tµ + t2 σ 2/2).
The characteristic function is c(t) = exp(itµ − t2σ 2 /2).
E(Y ) = µ and
VAR(Y ) = σ 2.

2r/2 Γ((r + 1)/2)

E[|Y − µ|r ] = σ r √ for r > −1.
π

If k ≥ 2 is an integer, then E(Y k ) = (k − 1)σ 2 E(Y k−2 ) + µE(Y k−1 ).

MED(Y ) = µ and

MAD(Y ) = Φ−1 (0.75)σ ≈ 0.6745σ.

Hence σ = [Φ−1 (0.75)]−1 MAD(Y ) ≈ 1.483MAD(Y ).
This family is a location–scale family which is symmetric about µ.
Suggested estimators are

1 1
n n
Y n = µ̂ = Yi and S 2 = SY2 = σ̂ 2 = (Yi − Y n )2.
n i=1 n − 1 i=1

The classical (1 − α)100% CI for µ when σ is unknown is

SY SY
(Y n − tn−1,1− α2 √ , Y n + tn−1,1− α2 √ )
n n

where P (t ≤ tn−1,1− α2 ) = 1 − α/2 when t is from a t distribution with n − 1

degrees of freedom.
If α = Φ(zα), then

co + c1 m + c2 m 2
zα ≈ m −
1 + d1 m + d2 m2 + d3 m3
where
m = [−2 log(1 − α)]1/2,
c0 = 2.515517, c1 = 0.802853, c2 = 0.010328, d1 = 1.432788, d2 = 0.189269,
d3 = 0.001308, and 0.5 ≤ α. For 0 < α < 0.5,

zα = −z1−α .
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 87

See Kennedy and Gentle (1980, p. 95).

A trimming rule is keep yi if
c2 c2
med(n) − 5.2(1 + )mad(n) ≤ yi ≤ med(n) + 5.2(1 + )mad(n)
n n
where c2 is between 0.0 and 7.0. Using c2 = 4.0 seems to be a good choice.
Note that
P (µ − 3.5σ ≤ Y ≤ µ + 3.5σ) = 0.9996.
To see that MAD(Y ) = Φ−1 (0.75)σ, note that 3/4 = F (µ + MAD) since Y
is symmetric about µ. However,

y−µ
F (y) = Φ
σ
and
3 µ + Φ−1 (3/4)σ − µ
=Φ .
4 σ
So µ + MAD = µ + Φ−1 (3/4)σ. Cancel µ from both sides to get the result.

3.15 The Pareto Distribution

If Y has a Pareto distribution, Y ∼ PAR(σ, λ), then the pdf of Y is
1 1/λ
λ
σ
f(y) = 1+1/λ
y
where y ≥ σ, σ > 0, and λ > 0.
The cdf of Y is F (y) = 1 − (σ/y)1/λ for y > σ.
σ
This family is a scale family when λ is ﬁxed. E(Y ) = 1−λ
for λ < 1.

r σr
E(Y ) = for r < 1/λ.
1 − λr
MED(Y ) = σ2λ.
X = log(Y/σ) is EXP(λ) and W = log(Y ) is EXP(θ = log(σ), λ). Let
θ̂ = MED(W1, ..., Wn) − 1.440MAD(W1, ..., Wn). Then robust estimators are
σ̂ = eθ̂ and λ̂ = 2.0781MAD(W1 , ..., Wn).
A trimming rule is keep yi if
med(n) − 1.44mad(n) ≤ wi ≤ 10med(n) − 1.44mad(n)
where med(n) and mad(n) are applied to w1 , . . . , wn with wi = log(yi).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 88

3.16 The Poisson Distribution

If Y has a Poisson distribution, Y ∼ POIS(θ), then the pmf of Y is

e−θ θy
P (Y = y) =
y!
for y = 0, 1, . . . , where θ > 0.
The mgf of Y is m(t) = exp(θ(et − 1)), and the chf of Y is
c(t) = exp(θ(eit − 1)).
E(Y ) = θ, and Chen and Rubin (1986) and Adell and Jodrá (2005) show
that
−1 < MED(Y ) − E(Y ) < 1/3.
VAR(Y ) = θ.
The classical estimator of θ is θ̂ = Y n . √ √
The approximations Y ≈ N(θ, θ) and 2 Y ≈ N(2 θ, 1) are sometimes used.
Suppose each yi is a nonnegative integer. Then a trimming rule is keep yi if
√
wi = 2 yi is kept when a normal trimming rule is applied to the wi s. (This
rule can be very bad if the normal approximation is not good.)

3.17 The Power Distribution

If Y has a power distribution, Y ∼ POW(λ), then the pdf of Y is
1 1 −1
f(y) = yλ ,
λ
where λ > 0 and 0 ≤ y ≤ 1.
The cdf of Y is F (y) = y λ for 0 ≤ y ≤ 1.
MED(Y ) = (1/2)1/λ .
Since W = − log(Y ) is EXP(λ), if all the yi ∈ [0, 1], then a cleaning rule is
keep yi if
2
0.0 ≤ wi ≤ 9.0(1 + )med(n)
n
where med(n) is applied to w1 , . . . , wn with wi = − log(yi ). See Problem 3.7
for robust estimators.
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 89

3.18 The Rayleigh Distribution

If Y has a Rayleigh distribution, Y ∼ R(µ, σ), then the pdf of Y is
2
y−µ 1 y−µ
f(y) = exp −
σ2 2 σ

where σ > 0, µ is real, and y ≥ µ. See Cohen and Whitten (1988, Ch. 10).
This is an asymmetric location–scale family.
The cdf of Y is 2
1 y−µ
F (y) = 1 − exp −
2 σ
for y ≥ µ, andF (y) = 0, otherwise.
E(Y ) = µ + σ π/2 ≈ µ + 1.253314σ.
VAR(Y ) = σ 2(4 − π)/2 ≈ 0.429204σ 2 .
MED(Y ) = µ + σ log(4) ≈ µ + 1.17741σ.
Hence µ ≈ MED(Y ) − 2.6255MAD(Y ) and σ ≈ 2.230MAD(Y ).
Let σD = MAD(Y ). If µ = 0, and σ = 1, then

0.5 = exp[−0.5( log(4) − D)2 ] − exp[−0.5( log(4) + D)2 ].

Hence D ≈ 0.448453 and MAD(Y ) = 0.448453σ.

It can be shown that W = (Y − µ)2 ∼ EXP(2σ 2 ).
Other parameterizations for the Rayleigh distribution are possible. See
Problem 3.9.

3.19 The Student’s t Distribution

If Y has a Student’s t distribution, Y ∼ tp , then the pdf of Y is

Γ( p+1
2
) y 2 −( p+1 )
f(y) = (1 + ) 2
(pπ)1/2Γ(p/2) p
where p is a positive integer and y is real. This family is symmetric about
0. The t1 distribution is the Cauchy(0, 1) distribution. If Z is N(0, 1) and is
independent of W ∼ χ2p, then
Z
W 1/2
(p)
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 90

is tp.
E(Y ) = 0 for p ≥ 2.
MED(Y ) = 0.
VAR(Y ) = p/(p − 2) for p ≥ 3, and
MAD(Y ) = tp,0.75 where P (tp ≤ tp,0.75) = 0.75.
If α = P (tp ≤ tp,α), then Cooke, Craven, and Clarke (1982, p. 84) suggest
the approximation
w2
tp,α ≈ p[exp( α ) − 1)]
p
where
zα(8p + 3)
wα = ,
8p + 1
zα is the standard normal cutoﬀ: α = Φ(zα ), and 0.5 ≤ α. If 0 < α < 0.5,
then
tp,α = −tp,1−α.
This approximation seems to get better as the degrees of freedom increase.
A trimming rule for p ≥ 3 is keep yi if yi ∈ [±5.2(1 + 10/n)mad(n)].

3.20 The Truncated Extreme Value Distribu-

tion
If Y has a truncated extreme value distribution, Y ∼ TEV(λ) then the pdf
of Y is
1 ey − 1
f(y) = exp y −
λ λ
where y > 0, and λ > 0.
The cdf of Y is
−(ey − 1)
F (y) = 1 − exp
λ
for y > 0.
MED(Y ) = log(1 + λ log(2)).
Since W = eY − 1 is EXP(λ), if all the yi > 0, then a trimming rule is keep
yi if
2
0.0 ≤ wi ≤ 9.0(1 + )med(n)
n
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 91

where med(n) is applied to w1, . . . , wn with wi = eyi − 1. See Problem 3.8 for
robust estimators.

3.21 The Uniform Distribution

If Y has a uniform distribution, Y ∼ U(θ1, θ2 ), then the pdf of Y is
1
f(y) = I(θ1 ≤ y ≤ θ2).
θ2 − θ1
The cdf of Y is F (y) = (y − θ1)/(θ2 − θ1 ) for θ1 ≤ y ≤ θ2 .
This family is a location-scale family which is symmetric about (θ1 + θ2 )/2.
By deﬁnition, m(0) = c(0) = 1. For t = 0, the mgf of Y is
etθ2 − etθ1
m(t) = ,
(θ2 − θ1 )t
and the chf of Y is
eitθ2 − eitθ1
c(t) = .
(θ2 − θ1)it
E(Y ) = (θ1 + θ2 )/2, and
MED(Y ) = (θ1 + θ2 )/2.
VAR(Y ) = (θ2 − θ1 )2 /12, and
MAD(Y ) = (θ2 − θ1 )/4.
Note that θ1 = MED(Y ) − 2MAD(Y ) and θ2 = MED(Y ) + 2MAD(Y ).
Some classical estimates are θ̂1 = y(1) and θ̂2 = y(n).
A trimming rule is keep yi if
c2 c2
med(n) − 2.0(1 + )mad(n) ≤ yi ≤ med(n) + 2.0(1 + )mad(n)
n n
where c2 is between 0.0 and 5.0. Replacing 2.0 by 2.00001 yields a rule for
which the cleaned data will equal the actual data for large enough n (with
probability increasing to one).

3.22 The Weibull Distribution

If Y has a Weibull distribution, Y ∼ W (φ, λ), then the pdf of Y is
φ φ−1 − yφ
f(y) = y e λ
λ
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 92

where λ, y, and φ are all positive. For ﬁxed φ, this is a scale family in
σ = λ1/φ .
The cdf of Y is F (y) = 1 − exp(−y φ /λ) for y > 0.
E(Y ) = λ1/φ Γ(1 + 1/φ).
VAR(Y ) = λ2/φ Γ(1 + 2/φ) − (E(Y ))2 .
r
E(Y r ) = λr/φ Γ(1 + ) for r > −φ.
φ
MED(Y ) = (λ log(2))1/φ .
Note that
(MED(Y ))φ
λ= .
log(2)
Since W = Y φ is EXP(λ), if all the yi > 0 and if φ is known, then a cleaning
rule is keep yi if
2
0.0 ≤ wi ≤ 9.0(1 + )med(n)
n
where med(n) is applied to w1 , . . . , wn with wi = yiφ . See Olive (2006) and
Problem 3.10c for robust estimators of φ and λ.

3.23 Complements
Many of the distribution results used in this chapter came from Johnson and
Kotz (1970a,b) and Patel, Kapadia and Owen (1976). Cohen and Whitten
(1988), Ferguson (1967), Castillo (1988), Cramér (1946), Kennedy and Gentle
(1980), Lehmann (1983), Meeker and Escobar (1998), Bickel and Doksum
(1977), DeGroot (1975), Hastings and Peacock (1975) and Leemis (1986) also
have useful results on distributions. Also see articles in Kotz and Johnson
(1982ab, 1983ab, 1985ab, 1986, 1988ab) and Armitrage and Colton (1998a-
f). Often an entire book is devoted to a single distribution, see for example,
Bowman and Shenton (1988).
Many of the robust point estimators in this chapter are due to Olive
(2006). These robust estimators are usually inefficient, but can be used as
starting values for iterative procedures such as maximum likelihood and as a
quick check for outliers. These estimators can also be used to create a robust
fully efficient cross checking estimator. See He and Fung (1999).
If no outliers are present and the sample size is large, then the robust
and classical methods should give similar estimates. If the estimates differ,
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 93

then outliers may be present or the assumed distribution may be incorrect.

Although a plot is the best way to check for univariate outliers, many users
of statistics plug in data and then take the result from the computer without
checking assumptions. If the software would print the robust estimates be-
sides the classical estimates and warn that the assumptions might be invalid
if the robust and classical estimates disagree, more users of statistics would
use plots and other diagnostics to check model assumptions.

3.24 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-
FUL.
3.1. Verify the formula for the cdf F for the following distributions.
a) Cauchy (µ, σ).
b) Double exponential (θ, λ).
c) Exponential (λ).
d) Logistic (µ, σ).
e) Pareto (σ, λ).
f) Power (λ).
g) Uniform (θ1, θ2 ).
h) Weibull W (φ, λ).
3.2∗. Verify the formula for MED(Y ) for the following distributions.
a) Exponential (λ).
b) Lognormal (µ, σ 2). (Hint: Φ(0) = 0.5.)
c) Pareto (σ, λ).
d) Power (λ).
e) Uniform (θ1 , θ2).
f) Weibull (φ, λ).
3.3∗. Verify the formula for MAD(Y ) for the following distributions.
(Hint: Some of the formulas may need to be verified numerically. Find the
cdf in the appropriate section of Chapter 3. Then find the population median
MED(Y ) = M. The following trick can be used except for part c). If the
distribution is symmetric, find U = y0.75. Then D = MAD(Y ) = U − M.)
a) Cauchy (µ, σ).
b) Double exponential (θ, λ). CONTINUED
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 94

c) Exponential (λ).
d) Logistic (µ, σ).
e) Normal (µ, σ 2).
f) Uniform (θ1, θ2 ).
3.4. Verify the formula for the expected value E(Y ) for the following
distributions.
a) Binomial (k, ρ).
b) Double exponential (θ, λ).
c) Exponential (λ).
d) gamma (ν, λ).
e) Logistic (µ, σ). (Hint from deCani and Stine (1986): Let Y = [µ + σW ] so
E(Y ) = µ + σE(W ) where W ∼ L(0, 1). Hence
∞
ey
E(W ) = y y 2
dy.
−∞ [1 + e ]

Use substitution with

ey
u= .
1 + ey
Then 1
k
E(W ) = [log(u) − log(1 − u)]k du.
0
Also use the fact that
lim log(v) log(v) = 0
v→0

to show E(W ) = 0.)

f) Lognormal (µ, σ 2).
g) Normal (µ, σ 2 ).
h) Pareto (σ, λ).
i) Poisson (θ).
j) Uniform (θ1 , θ2).
k) Weibull (φ, λ).
3.5. Verify the formula for the variance VAR(Y ) for the following distri-
butions.
a) Binomial (k, ρ).
b) Double exponential (θ, λ).
c) Exponential (λ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 95

d) gamma (ν, λ).

e) Logistic (µ, σ). (Hint from deCani and Stine (1986): Let Y = [µ + σX] so
V (Y ) = σ 2V (X) = σ 2 E(X 2 ) where X ∼ L(0, 1). Hence
∞
2 ey
E(X ) = y2 dy.
−∞ [1 + ey ]2

Use substitution with

ey
v= .
1 + ey
Then 1
2
E(X ) = [log(v) − log(1 − v)]2dv.
0

Let w = log(v) − log(1 − v) and du = [log(v) − log(1 − v)]dv. Then

1 1
2
E(X ) = wdu = uw|10 − udw.
0 0

Now
uw|10 = [v log(v) + (1 − v) log(1 − v)] w|10 = 0
since
lim v log(v) = 0.
v→0

Now
1 1 1
log(v) log(1 − v)
− udw = − dv − dv = 2π 2/6 = π 2 /3
0 0 1−v 0 v
using
1 1
log(v) log(1 − v)
dv = dv = −π 2/6.)
0 1−v 0 v

f) Lognormal (µ, σ 2).

g) Normal (µ, σ 2 ).
h) Pareto (σ, λ).
i) Poisson (θ).
j) Uniform (θ1 , θ2).
k) Weibull (φ, λ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 96

3.6. Assume that Y is gamma (ν, λ). Let

α = P [Y ≤ Gα ].

Using
1 1
Y 1/3 ≈ N((νλ)1/3 (1 − ), (νλ)2/3 ),
9ν 9ν
show that
1 1
Gα ≈ νλ[zα + 1 − ]3
9ν 9ν
where zα is the standard normal percentile, α = Φ(zα ).
3.7. Suppose that Y1 , ..., Yn are iid from a power (λ) distribution. Suggest
a robust estimator for λ
a) based on Yi and
b) based on Wi = − log(Yi ).
3.8. Suppose that Y1 , ..., Yn are iid from a truncated extreme value
TEV(λ) distribution. Find a robust estimator for λ
a) based on Yi and
b) based on Wi = eYi − 1.
3.9. Other parameterizations for the Rayleigh distribution are possible.
For example, take µ = 0 and λ = 2σ 2. Then W is Rayleigh RAY(λ), if the
pdf of W is
2w
f(w) = exp(−w2 /λ)
λ
where λ and w are both positive.
The cdf of W is F (w) = 1 − exp(−w2/λ) for w > 0.
E(W ) = λ1/2 Γ(1 + 1/2).
VAR(W ) = λΓ(2) − (E(W ))2.
r
E(W r ) = λr/2 Γ(1 + ) for r > −2.
2

MED(W ) = λ log(2).
W is RAY(λ) if W is Weibull W (λ, 2). Thus W 2 ∼ EXP(λ). If all wi > 0,
then a trimming rule is keep wi if 0 ≤ wi ≤ 3.0(1 + 2/n)MED(n).
a) Find the median MED(W ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 97

b) Suggest a robust estimator for λ.

3.10. The extreme value distribution in Section 3.9 is for the max. If
the random variable W has an extreme value distribution for the min, also
called the log–Weibull distribution, then the pdf of W is
1 w−θ w−θ
f(y) = exp( ) exp[− exp( )]
σ σ σ
where w and θ are real and σ > 0.
The cdf of W is
w−θ
F (w) = 1 − exp[− exp( )].
σ
This family is an asymmetric location-scale family.
E(W ) ≈ θ − 0.57721σ, and
VAR(W ) = σ 2π 2/6 ≈ 1.64493σ 2 .
If W has a log–Weibull (θ, σ) distribution, then Y = −W has an extreme
value (−θ, σ) distribution (for the max) given in Section 3.9.
a) Find MED(W ).
b) Find MAD(W ).
c) If X has a Weibull distribution, X ∼ W (φ, λ), then W = log(X) is
log-Weibull with parameters
1
θ = log(λ φ ) and σ = 1/φ.

Use the results of a) and b) to suggest estimators for φ and λ.

3.11. Suppose that Y has a half normal distribution, Y ∼ HN(µ, σ).
a) Show that MED(Y ) = µ + 0.6745σ.
b) Show that MAD(Y ) = 0.3990916σ numerically.
3.12. If Y has a half Cauchy distribution, Y ∼ HC(µ, σ), then the pdf
of Y is
2
f(y) =
πσ[1 + ( y−µ
σ
)2 ]
where y ≥ µ, µ is a real number and σ > 0. The cumulative distribution
function (cdf) of Y is F (y) = π2 arctan( y−µ
σ
) for y > µ and is 0, otherwise.
a) Find MED(Y ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 98

b) Find MAD(Y ) numerically.

3.13. If Y has a log–Cauchy distribution , Y ∼ LC(µ, σ), then the pdf
of Y is
1
f(y) =
πσy[1 + ( log(y)−µ
σ
)2 ]
where y > 0, σ > 0 and µ is a real number. Then W = log(Y ) has a
Cauchy(µ, σ) distribution. Suggest robust estimators for µ and σ based on
an iid sample Y1 , ..., Yn.
3.14. If Y has a half logistic distribution, Y ∼ HL(µ, σ), then the pdf of
Y is
2 exp (−(y − µ)/σ)
f(y) =
σ[1 + exp (−(y − µ)/σ)]2
where σ > 0, y ≥ µ and µ are real. The cdf of Y is

exp[(y − µ)/σ] − 1
F (y) =
1 + exp[(y − µ)/σ)]

for y > µ and 0 otherwise. This family is an asymmetric location–scale

family. Find MED(Y ).
3.15. If Y has a log–logistic distribution, Y ∼ LL(φ, τ ), then the pdf of
Y is
φτ (φy)τ −1
f(y) =
[1 + (φy)τ ]2
where y > 0 and τ > 0. Then W = log(Y ) has a logistic(µ = − log(φ), σ =
1/τ ) distribution. Hence φ = e−µ and τ = 1/σ. See Kalbfleisch and Prentice
(1980, p. 27-28).
1
a) Using F (y) = for y > 0, find MED(Y ).
1 + (φy)τ
b) Suggest robust estimators for τ and φ.
3.16. If Y has a geometric distribution, Y ∼ geom(p), then the pmf of
Y is P (Y = y) = p(1 − p)y for y = 0, 1, 2, ... and 0 ≤ p ≤ 1. The cdf for Y
is F (y) = 1 − (1 − p)y+1 for y ≥ 0 and F (y) = 0 for y < 0. Use the cdf to
find an approximation for MED(Y ).
CHAPTER 3. SOME USEFUL DISTRIBUTIONS 99

3.17. If Y has a Maxwell–Boltzmann distribution, Y ∼ MB(µ, σ), then

the pdf of Y is
√ −1 2
2(y − µ)2 e 2σ2 (y−µ)
f(y) = √
σ3 π
where µ is real, y ≥ µ and σ > 0. This is a location–scale family. Note that
W = (Y − µ)2 ∼ G(3/2, 2σ 2 ).
√ 1
E(Y ) = µ + σ 2 .
Γ(3/2)
2
5
Γ( ) 1
VAR(Y ) = 2σ 2 2
− .
Γ(3/2) Γ(3/2)
Show that MED(Y ) = µ + 1.5381722σ and MAD(Y ) = 0.460244σ.
3.18 If Y is Fréchet (µ, σ, φ), then the cdf of Y is
−φ
y−µ
F (y) = exp −
σ

for y ≥ µ and 0 otherwise where σ, φ > 0. Find MED(Y ).

3.19. If Y has an F distribution with degrees of freedom p and n − p,
then
χ2p /p
≈ χ2p /p
D
Y = 2
χn−p /(n − p)
if n is much larger than p (n >> p). Find an approximation for MED(Y ) if
n >> p.
Chapter 4

Truncated Distributions

This chapter presents a simulation study of several of the conﬁdence intervals

first presented in Chapter 2. Theorem 2.2 on p. 51 shows that the (α, β)
trimmed mean Tn is estimating a parameter µT with an asymptotic variance
2
equal to σW /(β −α)2 . The first five sections of this chapter provide the theory
needed to compare the different confidence intervals. Many of these results
will also be useful for comparing multiple linear regression estimators.
Mixture distributions are often used as outlier models. The following two
definitions and proposition are useful for finding the mean and variance of
a mixture distribution. Parts a) and b) of Proposition 4.1 below show that
the definition of expectation given in Definition 4.2 is the same as the usual
definition for expectation if Y is a discrete or continuous random variable.
Definition 4.1. The distribution of a random variable Y is a mixture
distribution if the cdf of Y has the form

k
FY (y) = αi FWi (y) (4.1)
i=1

where 0 < αi < 1, ki=1 αi = 1, k ≥ 2, and FWi (y) is the cdf of a continuous
or discrete random variable Wi , i = 1, ..., k.
Definition 4.2. Let Y be a random variable with cdf F (y). Let h be a
function such that the expected value Eh(Y ) = E[h(Y )] exists. Then
∞
E[h(Y )] = h(y)dF (y). (4.2)
−∞

100
CHAPTER 4. TRUNCATED DISTRIBUTIONS 101

Proposition 4.1. a) If Y is a discrete random variable that has a pmf

f(y) with support Y, then
∞
Eh(Y ) = h(y)dF (y) = h(yi )f(yi ).
−∞ yi ∈Y

b) If Y is a continuous random variable that has a pdf f(y), then

∞ ∞
Eh(Y ) = h(y)dF (y) = h(y)f(y)dy.
−∞ −∞

c) If Y is a random variable that has a mixture distribution with cdf FY (y) =

k
i=1 αi FWi (y), then
∞
k
Eh(Y ) = h(y)dF (y) = αi EWi [h(Wi )]
−∞ i=1
∞
where EWi [h(Wi )] = −∞
h(y)dFWi (y).
Example 4.1. Proposition 4.1c implies that the pmf or pdf of Wi is
used to compute EWi [h(Wi )]. As an example, suppose the cdf of Y is F (y) =
(1 − )Φ(y) + Φ(y/k) where 0 < < 1 and Φ(y) is the cdf of W1 ∼ N(0, 1).
Then Φ(y/k) is the cdf of W2 ∼ N(0, k 2 ). To find EY, use h(y) = y. Then
EY = (1 − )EW1 + EW2 = (1 − )0 + 0 = 0.
To find EY 2 , use h(y) = y 2 . Then
EY 2 = (1 − )EW12 + EW22 = (1 − )1 + k 2 = 1 − + k 2.
Thus VAR(Y ) = E[Y 2 ] − (E[Y ])2 = 1 − + k 2 . If = 0.1 and k = 10, then
EY = 0, and VAR(Y ) = 10.9.
Remark 4.1. Warning: Mixture distributions and linear combinations
of random variables are very different quantities. As an example, let
W = (1 − )W1 + W2
where , W1 and W2 are as in the previous example and suppose that W1
and W2 are independent. Then W , a linear combination of W1 and W2 , has
a normal distribution with mean
EW = (1 − )EW1 + EW2 = 0
CHAPTER 4. TRUNCATED DISTRIBUTIONS 102

and variance

VAR(W ) = (1 − )2VAR(W1 ) + 2VAR(W2) = (1 − )2 + 2k 2 < VAR(Y )

where Y is given in the example above. Moreover, W has a unimodal normal

distribution while Y does not follow a normal distribution. In fact, if X1 ∼
N(0, 1), X2 ∼ N(10, 1), and X1 and X2 are independent, then (X1 +X2 )/2 ∼
N(5, 0.5); however, if Y has a mixture distribution with cdf

FY (y) = 0.5FX1 (y) + 0.5FX2 (y) = 0.5Φ(y) + 0.5Φ(y − 10),

then the pdf of Y is bimodal.

Truncated distributions can be used to simplify the asymptotic theory
of robust estimators of location and regression. Sections 4.1, 4.2, 4.3, and
4.4 will be useful when the underlying distribution is exponential, double
exponential, normal, or Cauchy (see Chapter 3). Sections 4.5 and 4.6 exam-
ine how the sample median, trimmed means and two stage trimmed means
behave at these distributions.
Deﬁnitions 2.17 and 2.18 deﬁned the truncated random variable YT (a, b)
and the Winsorized random variable YW (a, b). Let Y have cdf F and let the
truncated random variable YT (a, b) have the cdf FT (a,b). The following lemma
illustrates the relationship between the means and variances of YT (a, b) and
YW (a, b). Note that YW (a, b) is a mixture of YT (a, b) and two point masses at
a and b. Let c = µT (a, b) − a and d = b − µT (a, b).
Lemma 4.2. Let a = µT (a, b) − c and b = µT (a, b) + d. Then
a)
µW (a, b) = µT (a, b) − αc + (1 − β)d, and
b)
2
σW (a, b) = (β − α)σT2 (a, b) + (α − α2 )c2
+[(1 − β) − (1 − β)2]d2 + 2α(1 − β)cd.
c) If α = 1 − β then
2
σW (a, b) = (1 − 2α)σT2 (a, b) + (α − α2 )(c2 + d2 ) + 2α2 cd.

d) If c = d then
2
σW (a, b) = (β − α)σT2 (a, b) + [α − α2 + 1 − β − (1 − β)2 + 2α(1 − β)]d2.
CHAPTER 4. TRUNCATED DISTRIBUTIONS 103

e) If α = 1 − β and c = d, then µW (a, b) = µT (a, b) and

2
σW (a, b) = (1 − 2α)σT2 (a, b) + 2αd2 .

Proof. We will prove b) since its proof contains the most algebra. Now
2
σW = α(µT − c)2 + (β − α)(σT2 + µ2T ) + (1 − β)(µT + d)2 − µ2W .

Collecting terms shows that

2
σW = (β − α)σT2 + (β − α + α + 1 − β)µ2T + 2[(1 − β)d − αc]µT

+αc2 + (1 − β)d2 − µ2W .

From a),

µ2W = µ2T + 2[(1 − β)d − αc]µT + α2 c2 + (1 − β)2d2 − 2α(1 − β)cd,

and we ﬁnd that

2
σW = (β − α)σT2 + (α − α2 )c2 + [(1 − β) − (1 − β)2]d2 + 2α(1 − β)cd. QED

4.1 The Truncated Exponential Distribution

Let Y be a (one sided) truncated exponential T EXP (λ, b) random variable.
Then the pdf of Y is
1 −y/λ
λ
e
fY (y|λ, b) =
1 − exp(− λb )
for 0 < y ≤ b. Let b = kλ, and let
kλ
1 −y/λ
ck = e dx = 1 − e−k .
0 λ
Next we will ﬁnd the ﬁrst two moments of Y ∼ T EXP (λ, b = kλ) for k > 0.

Lemma 4.3. If Y is T EXP (λ, b = kλ) for k > 0, then

1 − (k + 1)e−k
a) E(Y ) = λ ,
1 − e−k
CHAPTER 4. TRUNCATED DISTRIBUTIONS 104

and
2 2 1 − 12 (k 2 + 2k + 2)e−k
b) E(Y ) = 2λ .
1 − e−k
See Problem 4.9 for a related result.
Proof. a) Note that
kλ
y −y/λ
ck E(Y ) = e dy
0 λ
kλ
−y/λ kλ
= −ye |0 + e−y/λ dy
0
(use integration by parts). So ck E(Y ) =

−kλe−k + (−λe−y/λ )|kλ

= −kλe−k + λ(1 − e−k ).

Hence
1 − (k + 1)e−k
E(Y ) = λ .
1 − e−k

b) Note that

2
kλ
y 2 −y/λ
ck E(Y ) = e dy.
0 λ
Since
d
[−(y 2 + 2λy + 2λ2 )e−y/λ ]
dy
1 −y/λ 2
= e (y + 2λy + 2λ2 ) − e−y/λ (2y + 2λ)
λ
1
= y 2 e−y/λ ,
λ
2
we have ck E(Y ) =
[−(y 2 + 2λy + 2λ2 )e−y/λ ]kλ
0

= −(k 2λ2 + 2λ2 k + 2λ2 )e−k + 2λ2 .

So the result follows. QED
Since as k → ∞, E(Y ) → λ, and E(Y 2) → 2λ2 , we have VAR(Y ) → λ2 .
If k = 9 log(2) ≈ 6.24, then E(Y ) ≈ .998λ, and E(Y 2 ) ≈ 0.95(2λ2 ).
CHAPTER 4. TRUNCATED DISTRIBUTIONS 105

4.2 The Truncated Double Exponential Dis-

tribution
Suppose that X is a double exponential DE(µ, λ) random variable. Chapter
3 states that MED(X) = µ and MAD(X) = log(2)λ. Let c = k log(2), and let
the truncation points a = µ − kMAD(X) = µ − cλ and b = µ + kMAD(X) =
µ+cλ. Let XT (a, b) ≡ Y be the truncated double exponential T DE(µ, λ, a, b)
random variable. Then the pdf of Y is
1
fY (y|µ, λ, a, b) = exp(−|y − µ|/λ)
2λ(1 − exp(−c))

for a ≤ y ≤ b.
Lemma 4.4. a) E(Y ) = µ.

2 1 − 12 (c2 + 2c + 2)e−c
b) VAR(Y ) = 2λ .
1 − e−c

Proof. a) follows by symmetry and b) follows from Lemma 4.3 b) since

VAR(Y ) = E[(Y − µ)2 ] = E(WT2 ) where WT is T EXP (λ, b = cλ). QED
As c → ∞, VAR(Y ) → 2λ2 . If k = 9, then c = 9 log(2) ≈ 6.24 and
VAR(Y ) ≈ 0.95(2λ2 ).

4.3 The Truncated Normal Distribution

Now if X is N(µ, σ 2) then let Y be a truncated normal T N(µ, σ 2, a, b) random
variable. Then fY (y) =
2
√ 1
2πσ2
exp ( −(y−µ)
2σ2
)
I[a,b](y)
Φ( b−µ
σ
) − Φ( a−µ
σ
)

where φ is the standard normal pdf and Φ is the standard normal cdf. The
indicator function
I[a,b](y) = 1 if a ≤ y ≤ b
and is zero otherwise.
CHAPTER 4. TRUNCATED DISTRIBUTIONS 106

Lemma 4.5.

φ( a−µ
σ
) − φ( b−µ
σ
)
E(Y ) = µ + σ,
Φ( b−µ
σ
) − Φ( a−µ
σ
)

and VAR(Y ) =
2
( a−µ
)φ( a−µ
) − ( b−µ
)φ( b−µ
) φ( a−µ
) − φ( b−µ
)
σ2 1 + σ σ σ σ
− σ2 σ σ
.
Φ( b−µ
σ
) − Φ( a−µ
σ
) Φ( b−µ
σ
) − Φ( a−µ
σ
)

(See Johnson and Kotz 1970a, p. 83.)

Proof. Let c =
1
.
Φ( σ ) − Φ( a−µ
b−µ
σ
)
Then b
E(Y ) = yfY (y)dy.
a
Hence b
1 y −(y − µ)2
E(Y ) = √ exp ( )dy
c a 2πσ 2 2σ 2
b
y−µ 1 −(y − µ)2
= ( ) √ exp ( )dy +
a σ 2π 2σ 2
b
µ 1 −(y − µ)2
√ exp ( )dy
σ 2π a 2σ 2
b
y−µ 1 −(y − µ)2
= ( ) √ exp ( )dy
a σ 2π 2σ 2
b
1 −(y − µ)2
+µ √ exp ( )dy.
a 2πσ 2 2σ 2
Note that the integrand of the last integral is the pdf of a N(µ, σ 2 ) distribu-
tion. Let z = (y − µ)/σ. Thus dz = dy/σ, and E(Y )/c =
b−µ
σ z 2 µ
σ √ e−z /2dz +
a−µ
σ
2π c
σ 2
b−µ µ
= √ (−e−z /2)| a−µ
σ
+ .
2π σ c
CHAPTER 4. TRUNCATED DISTRIBUTIONS 107

Multiplying both sides by c gives the expectation result.

b
2
E(Y ) = y 2fY (y)dy.
a
Hence b
1 2 y2 −(y − µ)2
E(Y ) = √ exp ( )dy
c a 2πσ 2 2σ 2
b 2
y 2µy µ2 1 −(y − µ)2
=σ ( 2 − 2 + 2 ) √ exp ( )dy
a σ σ σ 2π 2σ 2
b
2yµ − µ2 1 −(y − µ)2
+σ √ exp ( )dy
a σ2 2π 2σ 2
b
y−µ 2 1 −(y − µ)2 µ µ2
=σ ( ) √ exp ( )dy + 2 E(Y ) − .
a σ 2π 2σ 2 c c
Let z = (y − µ)/σ. Then dz = dy/σ, dy = σdz, and y = σz + µ. Hence
E(Y 2 )/c =
b−µ
µ µ2 σ z2 2
2 E(Y ) − +σ σ √ e−z /2dz.
c c a−µ
σ
2π
2 /2
Next integrate by parts with w = z and dv = ze−z dz. Then E(Y 2 )/c =
µ µ2
2 E(Y ) − +
c c
b−µ
σ2 −z2 /2
b−µ σ 2
√ [(−ze )| a−µ +
σ
e−z /2dz]
2π σ a−µ

σ
2

µ µ 2 a−µ a−µ b−µ b−µ 1
= 2 E(Y ) − +σ ( )φ( )−( )φ( )+ .
c c σ σ σ σ c
Using
1
VAR(Y ) = c E(Y 2 ) − (E(Y ))2
c
gives the result. QED
Corollary 4.6. Let Y be T N(µ, σ 2 , a = µ − kσ, b = µ + kσ). Then
E(Y ) = µ and VAR(Y ) =

2 2kφ(k)
σ 1− .
2Φ(k) − 1
CHAPTER 4. TRUNCATED DISTRIBUTIONS 108

Table 4.1: Variances for Several Truncated Normal Distributions

k VAR(Y )
2.0 0.774σ 2
2.5 0.911σ 2
3.0 0.973σ 2
3.5 0.994σ 2
4.0 0.999σ 2

Proof. Use the symmetry of φ, the fact that Φ(−x) = 1 − Φ(x), and the
above lemma to get the result. QED
Examining VAR(Y ) for several values of k shows that the T N(µ, σ 2, a =
µ − kσ, b = µ + kσ) distribution does not change much for k > 3.0. See Table
4.1.

4.4 The Truncated Cauchy Distribution

If X is a Cauchy C(µ, σ) random variable, then MED(X) = µ and MAD(X) =
σ. If Y is a truncated Cauchy T C(µ, σ, µ − aσ, µ + bσ) random variable, then
1 1
fY (y) = −1 −1
tan (b) + tan (a) σ[1 + ( y−µ
σ
)2 ]
for µ − aσ < y < µ + bσ. Moreover,

log(1 + b2) − log(1 + a2 )
E(Y ) = µ + σ ,
2[tan−1 (b) + tan−1 (a)]
and VAR(Y ) =
2
−1 −1
2 b + a − tan (b) − tan (a) log(1 + b2 ) − log(1 + a2)
σ − .
tan−1 (b) + tan−1 (a) tan−1 (b) + tan−1 (a)

Lemma 4.7. If a = b, then E(Y ) = µ, and

−1

2 b − tan (b)
VAR(Y ) = σ .
tan−1 (b)
See Johnson and Kotz (1970a, p. 162).
CHAPTER 4. TRUNCATED DISTRIBUTIONS 109

4.5 Asymptotic Variances for Trimmed Means

The truncated distributions will be useful for ﬁnding the asymptotic vari-
ances of trimmed and two stage trimmed means. Assume that Y is from a
symmetric location–scale family with parameters µ and σ and that the trun-
cation points are a = µ − zσ and b = µ + zσ. Recall that for the trimmed
mean Tn ,
√ D σ 2 (a, b)
n(Tn − µT (a, b)) → N(0, W ).
(β − α)2
Since the family is symmetric and the truncation is symmetric, α = F (a) =
1 − β and µT (a, b) = µ.
Deﬁnition 4.3. Let Y1 , ..., Yn be iid random variables and let Dn ≡
Dn (Y1 , ..., Yn) be an estimator of a parameter µD such that
√ D 2
n(Dn − µD ) → N(0, σD ).
√ 2
Then the asymptotic variance of n(Dn − µD ) is σD and the asymptotic
2 2 2
variance (AV) of Dn is σD /n. If SD is a consistent estimator of σD , then the
√
(asymptotic) standard error (SE) of Dn is SD / n.
2 2
Remark 4.2. In the literature, usually either σD or σD /n is called the
2
asymptotic variance of Dn . The parameter σD is a function of both the
estimator Dn and the underlying distribution F of Y1 . Frequently nVAR(Dn )
2
converges in distribution to σD , but not always. See Staudte and Sheather
(1990, p. 51) and Lehmann (1999, p. 232).
Example 4.2. If Y1 , ..., Yn are iid from a distribution with mean µ and
variance σ 2, then by the central limit theorem,
√
n(Y n − µ) → N(0, σ 2 ).
D

that VAR(Y n ) = σ 2 /n = AV (Y n ) and that the standard error SE(Y n )

Recall√
= Sn / n where Sn2 is the sample variance.
Remark 4.3. Returning to the trimmed mean Tn where Y is from a
symmetric location–scale family, take µ = 0 since the asymptotic variance
does not depend on µ. Then
2
σW (a, b) σT2 (a, b) 2α(F −1 (α))2
n AV (Tn ) = = + .
(β − α)2 1 − 2α (1 − 2α)2
CHAPTER 4. TRUNCATED DISTRIBUTIONS 110

See, for example, Bickel (1965). This formula is useful since the variance of
the truncated distribution σT2 (a, b) has been computed for several distribu-
tions in the previous sections.
Deﬁnition 4.4. An estimator Dn is a location and scale equivariant
estimator if

Dn (α + βY1 , ..., α + βYn ) = α + βDn (Y1 , ..., Yn)

where α and β are arbitrary real constants.

Remark 4.4. Many location estimators such as the sample mean, sample
median, trimmed mean, metrically trimmed mean, and two stage trimmed
means are equivariant. Let Y1 , ..., Yn be iid from a distribution with cdf
FY (y) and suppose that Dn is an equivariant estimator of µD ≡ µD (FY ) ≡
µD (FY (y)). If Xi = α + βYi where β = 0, then the cdf of X is FX (y) =
FY ((y − α)/β). Suppose that
y−α
µD (FX ) ≡ µD [FY ( )] = α + βµD [FY (y)]. (4.3)
β
Let Dn (Y ) ≡ Dn (Y1 , ..., Yn). If
√ D 2
n[Dn (Y ) − µD (FY (y))] → N(0, σD ),

then
√ √
n[Dn (X) − µD (FX )] = n[α + βDn (Y ) − (α + βµD (FY ))] → N(0, β 2 σD
2
D
).

This result is especially useful when F is a cdf from a location–scale family

with parameters µ and σ. In this case, Equation (4.3) holds when µD is the
population mean, population median, and the population truncated mean
with truncation points a = µ− z1 σ and b = µ+ z2 σ (the parameter estimated
by trimmed and two stage trimmed means).
Recall the following facts about two stage trimmed means from Chap-
ter 2. Let a = MED(Y ) − k1 MAD(Y ) and b = MED(Y ) + k2 MAD(Y )
where MED(Y ) and MAD(Y ) are the population median and median ab-
solute deviation respectively. Usually we will take k1 = k2 = k. Assume
that the underlying cdf F is continuous. Let α = F (a) and let αo ∈ C =
{0, 0.01, 0.02, ..., 0.49, 0.50} be the smallest value in C such that αo ≥ α. Sim-
ilarly, let β = F (b) and let 1 − βo ∈ C be the smallest value in the index set
CHAPTER 4. TRUNCATED DISTRIBUTIONS 111

C such that 1 − βo ≥ 1 − β. Let αo = F (ao), and let βo = F (bo). Let L(Mn )

count the number of Yi < â = MED(n) − k1 MAD(n) and let n − U(Mn )
count the number of Yi > b̂ = MED(n) + k2 MAD(n). Let αo,n ≡ α̂o be the
smallest value in C such that αo,n ≥ L(Mn )/n, and let 1 − βo,n ≡ 1 − β̂o
be the smallest value in C such that 1 − βo,n ≥ 1 − (U(Mn )/n). Then the
robust estimator TA,n is the (αo,n , 1 − βo,n ) trimmed mean while TS,n is the
max(αo,n , 1 − βo,n)100% trimmed mean. The asymmetrically trimmed TA,n is
asymptotically equivalent to the (αo , 1 − βo) trimmed mean and the symmet-
rically trimmed mean TS,n is asymptotically equivalent to the max(αo , 1− βo)
100% trimmed mean. Then from Corollary 2.5,
√ D σ 2 (ao , bo )
n[TA,n − µT (ao , bo )] → N(0, W ),
(βo − αo )2
and
√ D σ 2 (aM , bM )
n[TS,n − µT (aM , bM )] → N(0, W ).
(βM − αM )2
If the distribution of Y is symmetric then TA,n and TS,n are asymptotically
equivalent. It is important to note that no knowledge of the unknown distri-
bution and parameters is needed to compute the two stage trimmed means
and their standard errors.
The next three lemmas ﬁnd the asymptotic variance for trimmed and two
stage trimmed means when the underlying distribution is normal, double
exponential and Cauchy, respectively. Assume a = MED(Y ) − kMAD(Y )
and b = MED(Y ) + kMAD(Y ).
Lemma 4.8. Suppose that Y comes from a normal N(µ, σ 2 ) distribution.
Let Φ(x) be the cdf and let φ(x) be the density of the standard normal. Then
for the α trimmed mean,
2zφ(z)

1 − 2Φ(z)−1 2αz 2
n AV = + σ2 (4.4)
1 − 2α (1 − 2α)2

where α = Φ(−z), and z = kΦ−1 (0.75). For the two stage estimators, round
100α up to the nearest integer J. Then use αJ = J/100 and zJ = −Φ−1 (αJ )
in Equation (4.4).
Proof. If Y follows the normal N(µ, σ 2) distribution, then a = µ −
kMAD(Y ) and b = µ+kMAD(Y ) where MAD(Y ) = Φ−1 (0.75)σ. It is enough
CHAPTER 4. TRUNCATED DISTRIBUTIONS 112

to consider the standard N(0,1) distribution since n AV (Tn , N(µ, σ 2 )) =

σ 2 n AV (Tn , N(0, 1)). If a = −z and b = z, then by Corollary 4.6,
2zφ(z)
σT2 (a, b) = 1 − .
2Φ(z) − 1
Use Remark 4.3 with z = kΦ−1 (0.75), and α = Φ(−z) to get Equation (4.4).

Lemma 4.9. Suppose that Y comes from a double exponential DE(0,1)

distribution. Then for the α trimmed mean,
2−(z2 +2z+2)e−z
1−e−z 2αz 2
n AV = + (4.5)
1 − 2α (1 − 2α)2
where z = k log(2) and α = 0.5 exp(−z). For the two stage estimators,
round 100α up to the nearest integer J. Then use αJ = J/100 and let zJ =
− log(2αJ ).
Proof Sketch. For the DE(0, 1) distribution, MAD(Y ) = log(2). If the
DE(0,1) distribution is truncated at −z and z, then use Remark 4.3 with
2 − (z 2 + 2z + 2)e−z
σT2 (−z, z) = .
1 − e−z

Lemma 4.10. Suppose that Y comes from a Cauchy (0,1) distribution.

Then for the α trimmed mean,
z − tan−1 (z) 2α(tan[π(α − 12 )])2
n AV = + (4.6)
(1 − 2α) tan −1 (z) (1 − 2α)2
where z = k and
1 1
α= + tan−1 (z).
2 π
For the two stage estimators, round 100α up to the nearest integer J. Then
use αJ = J/100 and let zJ = tan[π(αJ − 0.5)].
Proof Sketch. For the C(0, 1) distribution, MAD(Y ) = 1. If the C(0,1)
distribution is truncated at −z and z, then use Remark 4.3 with
z − tan−1 (z)
σT2 (−z, z) = .
tan−1 (z)
CHAPTER 4. TRUNCATED DISTRIBUTIONS 113

4.6 Simulation
In statistics, simulation uses computer generated pseudo-random variables in
place of real data. This artificial data can be used just like real data to pro-
duce histograms and confidence intervals and to compare estimators. Since
the artificial data is under the investigator’s control, often the theoretical
behavior of the statistic is known. This knowledge can be used to estimate
population quantities (such as MAD(Y )) that are otherwise hard to compute
and to check whether software is running correctly.
Example 4.3. The R/Splus software is especially useful for generating
random variables. The command

Y <- rnorm(100)

creates a vector Y that contains 100 pseudo iid N(0,1) variables. More gen-
erally, the command

Y <- rnorm(100,10,sd=4)

creates a vector Y that contains 100 pseudo iid N(10, 16) variables since
42 = 16. To study the sampling distribution of Y n , we could generate K
N(0, 1) samples of size n, and compute Y n,1 , ..., Y n,K where the notation
Y n,j denotes the sample mean of the n pseudo-variates from the jth sample.
The command

M <- matrix(rnorm(1000),nrow=100,ncol=10)

creates a 100 × 10 matrix containing 100 samples of size 10. (Note that
100(10) = 1000.) The command

M100 <- apply(M,1,mean)

creates the vector M100 of length 100 which contains Y n,1 , ..., Y n,K where
K = 100 and n = 10. A histogram from this vector should resemble the pdf
of a N(0, 0.1) random variable. The sample mean and variance of the 100
vector entries should be close to 0 and 0.1, respectively.
Example 4.4. Similarly the commands

M10 <- matrix(rexp(1000),nrow=100,ncol=10)

CHAPTER 4. TRUNCATED DISTRIBUTIONS 114

creates a 100 × 10 matrix containing 100 samples of size 10 exponential(1)

(pseudo) variates. (Note that 100(10) = 1000.) The command

M10 <- apply(M,1,mean)

gets the sample mean for each (row) sample of 10 observations. The command

M <- matrix(rexp(10000),nrow=100,ncol=100)

creates a 100 × 100 matrix containing 100 samples of size 100 exponential(1)
(pseudo) variates. (Note that 100(100) = 10000.) The command

M100 <- apply(M,1,mean)

gets the sample mean for each (row) sample of 100 observations. The com-
mands

hist(M10) and hist(M100)

will make histograms of the 100 sample means. The ﬁrst histogram should
be more skewed than the second, illustrating the central limit theorem.
Example 4.5. As a slightly more complicated example, suppose that
it is desired to approximate the value of MAD(Y ) when Y is the mixture
distribution with cdf F (y) = 0.95Φ(y) + 0.05Φ(y/3). That is, roughly 95% of
the variates come from a N(0, 1) distribution and 5% from a N(0, 9) distribu-
tion. Since MAD(n) is a good estimator of MAD(Y ), the following R/Splus
commands can be used to approximate MAD(Y ).

contam <- rnorm(10000,0,(1+2*rbinom(10000,1,0.05)))

mad(contam,constant=1)

Running these commands suggests that MAD(Y ) ≈ 0.70. Now F (MAD(Y )) =

0.75. To ﬁnd F (0.7), use the command

0.95*pnorm(.7) + 0.05*pnorm(.7/3)

which gives the value 0.749747. Hence the approximation was quite good.
Deﬁnition 4.5. Let T1,n and T2,n be two estimators of a parameter τ
such that
nδ (T1,n − τ ) → N(0, σ12(F ))
D
CHAPTER 4. TRUNCATED DISTRIBUTIONS 115

Table 4.2: Simulated Scaled Variance, 500 Runs, k = 5

F n Y MED(n) 1% TM TS,n
N(0,1) 10 1.116 1.454 1.116 1.166
N(0,1) 50 0.973 1.556 0.973 0.974
N(0,1) 100 1.040 1.625 1.048 1.044
N(0,1) 1000 1.006 1.558 1.008 1.010
N(0,1) ∞ 1.000 1.571 1.004 1.004
DE(0,1) 10 1.919 1.403 1.919 1.646
DE(0,1) 50 2.003 1.400 2.003 1.777
DE(0,1) 100 1.894 0.979 1.766 1.595
DE(0,1) 1000 2.080 1.056 1.977 1.886
DE(0,1) ∞ 2.000 1.000 1.878 1.804

and
nδ (T2,n − τ ) → N(0, σ22 (F )),
D

then the asymptotic eﬃciency of T1,n with respect to T2,n is

σ22 (F ) AV (T2,n )
AE(T1,n, T2,n ) = = .
σ12 (F ) AV (T1,n )

This deﬁnition brings up several issues. First, both estimators must have
the same convergence rate nδ . Usually δ = 0.5. If Ti,n has convergence rate
nδi , then estimator T1,n is judged to be better than T2,n if δ1 > δ2. Secondly,
the two estimators need to estimate the same parameter τ. This condition
will often not hold unless the distribution is symmetric about µ. Then τ = µ
is a natural choice. Thirdly, robust estimators are often judged by their
Gaussian eﬃciency with respect to the sample mean (thus F is the normal
distribution). Since the normal distribution is a location–scale family, it is
often enough to compute the AE for the standard normal distribution. If the
data come from a distribution F and the AE can be computed, then T1,n is
judged to be a better estimator at the data than T2,n if the AE > 1.
In simulation studies, typically the underlying distribution F belongs to a
symmetric location–scale family. There are at least two reasons for using such
distributions. First, if the distribution is symmetric, then the population
CHAPTER 4. TRUNCATED DISTRIBUTIONS 116

median MED(Y ) is the point of symmetry and the natural parameter to

estimate. Under the symmetry assumption, there are many estimators of
MED(Y ) that can be compared via their asymptotic eﬃciency with respect
to the sample mean or maximum likelihood estimator (MLE). Secondly, once
the AE is obtained for one member of the family, it is typically obtained for
all members of the location–scale family. That is, suppose that Y1 , ..., Yn are
iid from a location–scale family with parameters µ and σ. Then Yi = µ + σZi
where the Zi are iid from the same family with µ = 0 and σ = 1. Typically
AV [Ti,n (Y )] = σ 2AV [Ti,n(Z)],
so
AE[T1,n(Y ), T2,n(Y )] = AE[T1,n(Z), T2,n (Z)].

Example 4.6. If T2,n = Y , then by the central limit theorem σ22 (F ) = σ 2

when F is the N(µ, σ 2 ) distribution. Then AE(TA,n, Y n ) = 1/(nAV ) where
nAV is given by Equation (4.4). If k ∈ [5, 6], then J = 1, and AE(TA,n , Y n ) ≈
0.996. Hence TS,n and TA,n are asymptotically equivalent to the 1% trimmed
mean and are almost as good as the optimal sample mean at Gaussian data.

Example 4.7. If F is the DE(0, 1) cdf, then the asymptotic eﬃciency

of TA,n with respect to the mean is AE = 2/(nAV ) where nAV is given by
Equation (4.5). If k = 5, then J = 2, and AE(TA,n , Y n ) ≈ 1.108. Hence TS,n
and TA,n are asymptotically equivalent to the 2% trimmed mean and perform
better than the sample mean. If k = 6, then J = 1, and AE(TA,n , Y n ) ≈
1.065.
The results from a small simulation are presented in Table 4.2. For each
sample size n, 500 samples were generated. The sample mean Y , sample
median, 1% trimmed mean, and TS,n were computed. The latter estimator
was computed using the trimming parameter k = 5. Next the sample variance
S 2 (T ) of the 500 values T1, ..., T500 was computed where T is one of the four
estimators. The value in the table is nS 2(T ). These numbers estimate n
times the actual variance of the estimators. Suppose that for n ≥ N, the
tabled numbers divided by n are close to the asymptotic variance. Then
the asymptotic theory may be useful if the sample size n ≥ N and if the
distribution corresponding to F is a reasonable approximation to the data
(but see Lehmann (1999, p. 74)). The scaled asymptotic variance is reported
CHAPTER 4. TRUNCATED DISTRIBUTIONS 117

in the rows n = ∞. The simulations were performed for normal and double
exponential data, and the simulated values are close to the theoretical values.

A small simulation study was used to compare some simple randomly

trimmed means. The N(0, 1), 0.75N(0, 1) + 0.25N(100, 1) (shift), C(0,1),
DE(0,1) and exponential(1) distributions were considered. For each distri-
bution K = 500 samples of size n = 10, 50, 100, and 1000 were generated.
Six different CI’s
Dn ± td,.975SE(Dn )
were used. The degrees of freedom d = Un − Ln − 1, and usually SE(Dn ) =
SERM (Ln , Un ). See Definition 2.16 on p. 46.
√
(i) The classical interval used Dn = Y , d = n − 1 and SE = S/ n. Note that
√
Y is a 0% trimmed mean that uses Ln = 0, Un = n and SERM (0, n) = S/ n.
(ii) This robust interval used Dn = TA,n with k1 = k2 = 6 and SE =
SERM (Ln , Un ) where Un and Ln are given by Definition 2.15.
(iii) This resistant interval used Dn = TS,n with k1 = k2 = 3.5, and SE =
SERM (Ln , Un ) where Un and Ln are given by Definition 2.14.
interval used Dn = MED(n) with √
(iv) This resistant Un = n − Ln where
Ln = n/2 − n/4 . Note that d = Un − Ln − 1 ≈ n. Following Bloch
and Gastwirth (1968), SE(MED(n)) = 0.5(Y(Un ) − Y(Ln +1) ). See Application
2.2.
interval again used Dn = MED(n) with Un = n− Ln where
(v) This resistant
Ln = n/2 − n/4 , but SE(MED(n)) = SERM (Ln , Un ) was used. Note
that MED(n) is the 50% trimmed mean and that the percentage of cases
used to compute the SE goes to 0 as n → ∞.
(vi) This resistant interval used the 25% trimmed mean for Dn and SE =
SERM (Ln , Un ) where Un and Ln are given by Ln = 0.25n and Un = n− Ln .

In order for a location estimator to be used for inference, there must exist
a useful SE and a useful cutoﬀ value td where the degrees of freedom d is
a function of n. Two criteria will be used to evaluate the CI’s. First, the
observed coverage is the proportion of the K = 500 runs for which the CI
contained the parameter estimated by Dn . This proportion should be near
the nominal coverage 0.95. Notice that if W is the proportion of runs where
the CI contains the parameter,
then KW is a binomial random variable.
Hence the SE of W is p̂(1 − p̂)/K ≈ 0.013 for the observed proportion
CHAPTER 4. TRUNCATED DISTRIBUTIONS 118

Table 4.3: Simulated 95% CI Coverages, 500 Runs

F and n Y TA,n TS,n MED (v) 25% TM

N(0,1) 10 0.960 0.942 0.926 0.948 0.900 0.938
N(0,1) 50 0.948 0.946 0.930 0.936 0.890 0.926
N(0,1) 100 0.932 0.932 0.932 0.900 0.898 0.938
N(0,1) 1000 0.942 0.934 0.936 0.940 0.940 0.936
DE(0,1) 10 0.966 0.954 0.950 0.970 0.944 0.968
DE(0,1) 50 0.948 0.956 0.958 0.958 0.932 0.954
DE(0,1) 100 0.956 0.940 0.948 0.940 0.938 0.938
DE(0,1) 1000 0.948 0.940 0.942 0.936 0.930 0.944
C(0,1) 10 0.974 0.968 0.964 0.980 0.946 0.962
C(0,1) 50 0.984 0.982 0.960 0.960 0.932 0.966
C(0,1) 100 0.970 0.996 0.974 0.940 0.938 0.968
C(0,1) 1000 0.978 0.992 0.962 0.952 0.942 0.950
EXP(1) 10 0.892 0.816 0.838 0.948 0.912 0.916
EXP(1) 50 0.938 0.886 0.892 0.940 0.922 0.950
EXP(1) 100 0.938 0.878 0.924 0.930 0.920 0.954
EXP(1) 1000 0.952 0.848 0.896 0.926 0.922 0.936
SHIFT 10 0.796 0.904 0.850 0.940 0.910 0.948
SHIFT 50 0.000 0.986 0.620 0.740 0.646 0.820
SHIFT 100 0.000 0.988 0.240 0.376 0.354 0.610
SHIFT 1000 0.000 0.992 0.000 0.000 0.000 0.442

p̂ ∈ [0.9, 0.95], and an observed coverage between 0.92 and 0.98 suggests that
the observed coverage is close to the nominal coverage of √ 0.95.
The second criterion is the scaled length of the CI = n CI length =
√
n(2)(td,0.975)(SE(Dn )) ≈ 2(1.96)(σD )
√ D 2
where the approximation holds if d >√30, if n(Dn − µD ) → N(0, σD ), and
if SE(Dn ) is a good estimator of σD / n for the given value of n.
Tables 4.3 and 4.4 can be used to examine the six diﬀerent interval esti-
mators. A good estimator should have an observed coverage p̂ ∈ [.92, .98],
and a small scaled length. In Table 4.3, coverages were good for N(0, 1)
data, except the interval (v) where SERM (Ln , Un ) is slightly too small for
CHAPTER 4. TRUNCATED DISTRIBUTIONS 119

Table 4.4: Simulated Scaled CI Lengths, 500 Runs

F and n Y TA,n TS,n MED (v) 25% TM

N(0,1) 10 4.467 4.393 4.294 7.803 6.030 5.156
N(0,1) 50 4.0135 4.009 3.981 5.891 5.047 4.419
N(0,1) 100 3.957 3.954 3.944 5.075 4.961 4.351
N(0,1) 1000 3.930 3.930 3.940 5.035 4.928 4.290
N(0,1) ∞ 3.920 3.928 3.928 4.913 4.913 4.285
DE(0,1) 10 6.064 5.534 5.078 7.942 6.120 5.742
DE(0,1) 50 5.591 5.294 4.971 5.360 4.586 4.594
DE(0,1) 100 5.587 5.324 4.978 4.336 4.240 4.404
DE(0,1) 1000 5.536 5.330 5.006 4.109 4.021 4.348
DE(0,1) ∞ 5.544 5.372 5.041 3.920 3.920 4.343
C(0,1) 10 54.590 10.482 9.211 12.682 9.794 9.858
C(0,1) 50 94.926 10.511 8.393 7.734 6.618 6.794
C(0,1) 100 243.4 10.782 8.474 6.542 6.395 6.486
C(0,1) 1000 515.9 10.873 8.640 6.243 6.111 6.276
C(0,1) ∞ ∞ 10.686 8.948 6.157 6.157 6.255
EXP(1) 10 4.084 3.359 3.336 6.012 4.648 3.949
EXP(1) 50 3.984 3.524 3.498 4.790 4.105 3.622
EXP(1) 100 3.924 3.527 3.503 4.168 4.075 3.571
EXP(1) 1000 3.914 3.554 3.524 3.989 3.904 3.517
SHIFT 10 184.3 18.529 24.203 203.5 166.2 189.4
SHIFT 50 174.1 7.285 9.245 18.686 16.311 180.1
SHIFT 100 171.9 7.191 29.221 7.651 7.481 177.5
SHIFT 1000 169.7 7.388 9.453 7.278 7.123 160.6
CHAPTER 4. TRUNCATED DISTRIBUTIONS 120

n ≤ 100. The coverages for the C(0,1) and DE(0,1) data were all good even
for n = 10.
For the mixture 0.75N(0, 1) + 0.25N(100, 1), the “coverage” counted the
number of times 0 was contained in the interval and divided the result by 500.
These rows do not give a genuine coverage since the parameter µD estimated
by Dn is not 0 for any of these estimators. For example Y estimates µ = 25.
Since the median, 25% trimmed mean, and TS,n trim the same proportion of
cases to the left as to the right, MED(n) is estimating MED(Y ) ≈ Φ−1 (2/3) ≈
0.43 while the parameter estimated by TS,n is approximately the mean of a
truncated standard normal random variable where the truncation points are
Φ−1 (.25) and ∞. The 25% trimmed mean also has trouble since the number
of outliers is a binomial(n, 0.25) random variable. Hence approximately half
of the samples have more than 25% outliers and approximately half of the
samples have less than 25% outliers. This fact causes the 25% trimmed mean
to have great variability. The parameter estimated by TA,n is zero to several
decimal places. Hence the coverage of the TA,n interval is quite high.
The exponential(1) distribution is skewed, so the central limit theorem
is not a good approximation for n = 10. The estimators Y , TA,n , TS,n , MED(n)
and the 25% trimmed mean are estimating the parameters 1, 0.89155, 0.83071,
log(2) and 0.73838 respectively. Now the coverages of TA,n and TS,n are
slightly too small. For example, TS,n is asymptotically equivalent to the 10%
trimmed mean since the metrically trimmed mean truncates the largest 9.3%
of the cases, asymptotically. For small n, the trimming proportion will be
quite variable and the mean of a truncated exponential distribution with
the largest γ percent of cases trimmed varies with γ. This variability of the
truncated mean does not occur for symmetric distributions if the trimming
is symmetric since then the truncated mean µT is the point of symmetry
regardless of the amount of truncation.
Examining Table 4.4 for N(0,1) data shows that the scaled lengths of the
ﬁrst 3 intervals are about the√same. The rows labeled ∞ give the scaled
length 2(1.96)(σD ) expected if nSE is a good estimator of σD . The median
interval and 25% trimmed mean interval are noticeably
√ larger than the clas-
sical interval. Since the degrees of freedom d ≈ n for the median intervals,
td,0.975 is considerably larger than 1.96 = z0.975 for n ≤ 100.
The intervals for the C(0,1) and DE(0,1) data behave about as expected.
The classical interval is very long at C(0,1) data since the ﬁrst moment of
C(0,1) data does not exist. Notice that for n ≥ 50, all of the resistant
intervals are shorter on average than the classical intervals for DE(0,1) data.
CHAPTER 4. TRUNCATED DISTRIBUTIONS 121

For the mixture distribution, examining the length of the interval should
be fairer than examining the “coverage.” The length of the 25% trimmed
mean is long since about half of the time the trimmed data contains no
outliers while half of the time the trimmed data does contain outliers. When
n = 100, the length of the TS,n interval is quite long. This occurs because
the TS,n will usually trim all outliers, but the actual proportion of outliers
is binomial(100, 0.25). Hence TS,n is sometimes the 20% trimmed mean and
sometimes the 30% trimmed mean. But the parameter µT estimated by the
γ % trimmed mean varies quite a bit with γ. When n = 1000, the trimming
proportion is much less variable, and the CI length is shorter.
For exponential(1) data, 2(1.96)(σD ) = 3.9199 for Y and MED(n). The
25% trimmed mean appears to be the best of the six intervals since the scaled
length is the smallest while the coverage is good.

4.7 Complements
Several points about resistant location estimators need to be made. First,
by far the most important step in analyzing location data is to
check whether outliers are present with a plot of the data. Secondly,
no single procedure will dominate all other procedures. In particular, it is
unlikely that the sample mean will be replaced by a robust estimator. The
sample mean works very well for distributions with second moments if the
second moment is small. In particular, the sample mean works well for many
skewed and discrete distributions. Thirdly, the mean and the median should
usually both be computed. If a CI is needed and the data is thought to be
symmetric, several resistant CI’s should be computed and compared with the
classical interval. Fourthly, in order to perform hypothesis testing, plausible
values for the unknown parameter must be given. The mean and median of
the population are fairly simple parameters even if the population is skewed
while the truncated population mean is considerably more complex.
With some robust estimators, it very difficult to determine what the es-
timator is estimating if the population is not symmetric. In particular, the
difficulty in finding plausible values of the population quantities estimated
by M, L, and R estimators may be one reason why these estimators are not
widely used. For testing hypotheses, the following population quantities are
listed in order of increasing complexity.
1. The population median MED(Y ).
CHAPTER 4. TRUNCATED DISTRIBUTIONS 122

2. The population mean E(Y ).

3. The truncated mean µT as estimated by the α trimmed mean.

4. The truncated mean µT as estimated by the (α, β) trimmed mean.

5. The truncated mean µT as estimated by the TS,n .

6. The truncated mean µT as estimated by the TA,n .

Bickel (1965), Prescott (1978), and Olive (2001) give formulas similar to
Equations (4.4) and (4.5). Gross (1976), Guenther (1969) and Lax (1985) are
useful references for conﬁdence intervals. Andrews, Bickel, Hampel, Huber,
Rogers and Tukey (1972) is a very well known simulation study for robust
location estimators.
In Section 4.6, only intervals that are simple to compute by hand for
sample sizes of ten or so were considered. The interval based on MED(n)
(see Application 2.2 and the column “MED” in Tables 4.3 and 4.4) is even
easier to compute than the classical interval, kept its coverage pretty well,
and was frequently shorter than the classical interval.
Stigler (1973a) showed that the trimmed mean has a limiting normal
distribution even if the population is discrete provided that the asymptotic
truncation points a and b have zero probability; however, in ﬁnite samples
the trimmed mean can perform poorly if there are gaps in the distribution
near the trimming proportions.
The estimators TS,n and TA,n depend on a parameter k. Smaller values of
k should have smaller CI lengths if the data has heavy tails while larger values
of k should perform better for light tailed distributions. In simulations, TS,n
performed well for k > 1, but the variability of TA,n was too large for n ≤ 100
for Gaussian data if 1 < k < 5. These estimators also depend on the grid
C of trimming proportions. Using C = {0, 0.01, 0.02, ..., 0.49, 0.5} makes the
estimators easy to compute, but TS,n will perform better if the much coarser
grid Cc = {0, 0.01, 0.10, 0.25, 0.40, 0.49, 0.5} is used. The performance does
not change much for symmetric data, but can improve considerably if the
data is skewed. The estimator can still perform rather poorly if the data is
asymmetric and the trimming proportion of the metrically trimmed mean is
near one of these allowed trimming proportions. For example if k = 3.5 and
the data is exponential(1), the metrically trimmed mean trims approximately
9.3% of the cases. Hence the TS,n is often the 25% and the 10% trimmed
CHAPTER 4. TRUNCATED DISTRIBUTIONS 123

mean for small n. When k = 4.5, TS,n with grid Cc is usually the 10%
trimmed mean and hence performs well on exponential(1) data.
TA,n is the estimator most like high breakdown M–estimators proposed
in the literature. These estimators basically use a random amount of trim-
ming and work well on symmetric data. Estimators that give zero weight to
distant outliers (“hard rejection”) can work well on “contaminated normal”
populations such as (1 − )N(0, 1) + N(µs , 1). Of course ∈ (0, 0.5) and µs
can always be chosen so that these estimators perform poorly. Stigler (1977)
argues that complicated robust estimators are not needed.

4.8 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-
FUL.
4.1∗. Suppose the random variable X has cdf FX (x) = 0.9 Φ(x − 10) +
0.1 FW (x) where Φ(x − 10) is the cdf of a normal N(10, 1) random variable
with mean 10 and variance 1 and FW (x) is the cdf of the random variable
W that satisﬁes P (W = 200) = 1.
a) Find E(W ).
b) Find E(X).
4.2. Suppose the random variable X has cdf FX (x) = 0.9 FZ (x) +
0.1 FW (x) where FZ is the cdf of a gamma(α = 10, β = 1) random variable
with mean 10 and variance 10 and FW (x) is the cdf of the random variable
W that satisﬁes P (W = 400) = 1.
a) Find E(W ).
b) Find E(X).
4.3. a) Prove Lemma 4.2 a).
b) Prove Lemma 4.2 c).
c) Prove Lemma 4.2 d).
d) Prove Lemma 4.2 e).
4.4. Suppose that F is the cdf from a distribution that is symmetric
about 0. Suppose a = −b and α = F (a) = 1 − β = 1 − F (b). Show that
2
σW (a, b) σT2 (a, b) 2α(F −1 (α))2
= + .
(β − α)2 1 − 2α (1 − 2α)2
CHAPTER 4. TRUNCATED DISTRIBUTIONS 124

n
4.5. Recall
n that L(Mn ) = i=1 I[Yi < MED(n) − k MAD(n)] and n −
U(Mn ) = i=1 I[Yi > MED(n) + k MAD(n)] where the indicator variable
I(A) = 1 if event A occurs and is zero otherwise. Show that TS,n is a
randomly trimmed mean. (Hint: round

100 max[L(Mn ), n − U(Mn )]/n

up to the nearest integer, say Jn . Then TS,n is the Jn % trimmed mean with
Ln = (Jn /100) n and Un = n − Ln .)
4.6. Show that TA,n is a randomly trimmed mean. (Hint: To get Ln ,
round 100L(Mn )/n up to the nearest integer Jn . Then Ln = (Jn /100) n .
Round 100[n − U(Mn )]/n up to the nearest integer Kn . Then Un = (100 −
Kn )n/100 .)
4.7∗. Let F be the N(0, 1) cdf. Show that the eﬃciency of the sample
median MED(n) with respect to the sample mean Y n is AE ≈ 0.64.
4.8∗. Let F be the DE(0, 1) cdf. Show that the eﬃciency of the sample
median MED(n) with respect to the sample mean Y n is AE ≈ 2.0.
4.9. If Y is T EXP (λ, b = kλ) for k > 0, show that a)

k
E(Y ) = λ 1 − k .
e −1

b)
2 2 (0.5k 2 + k)
E(Y ) = 2λ 1 − .
ek − 1
R/Splus problems
Warning: Use the command source(“A:/rpack.txt”) to download
the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg rcisim, will display the code for the function. Use the
args command, eg args(rcisim), to display the needed arguments for the
function.
4.10. a) Download the R/Splus function nav that computes Equation
(4.4) from Lemma 4.8.
CHAPTER 4. TRUNCATED DISTRIBUTIONS 125

b) Find the asymptotic variance of the α trimmed mean for α = 0.01, 0.1,
0.25 and 0.49.
c) Find the asymptotic variance of TA,n for for k = 2, 3, 4, 5 and 6.
4.11. a) Download the R/Splus function deav that computes Equation
(4.5) from Lemma 4.9.
b) Find the asymptotic variance of the α trimmed mean for α = 0.01, 0.1,
0.25 and 0.49.
c) Find the asymptotic variance of TA,n for for k = 2, 3, 4, 5 and 6.
4.12. a) Download the R/Splus function cav that ﬁnds n AV for the
Cauchy(0,1) distribution.
b) Find the asymptotic variance of the α trimmed mean for α = 0.01, 0.1,
0.25 and 0.49.
c) Find the asymptotic variance of TA,n for for k = 2, 3, 4, 5 and 6.
4.13. a) Download the R/Splus function rcisim to reproduce Tables
4.3 and 4.4. Two lines need to be changed with each CI. One line is the
output line that calls the CI and the other line is the parameter estimated
for exponential(1) data. The program below is for the classical interval.
Thus the program calls the function cci used in Problem 2.16. The functions
medci, tmci, atmci, stmci, med2ci, cgci and bg2ci given in Problems 2.22
– 2.28 are also interesting.
b) Enter the following commands, obtain the output and explain what
the output shows.
i) rcisim(n,type=1) for n = 10, 50, 100
ii) rcisim(n,type=2) for n = 10, 50, 100
iii) rcisim(n,type=3) for n = 10, 50, 100
iv) rcisim(n,type=4) for n = 10, 50, 100
v) rcisim(n,type=5) for n = 10, 50, 100

4.14. a) Download the R/Splus functions cisim and robci.

b) To evaluate the CI’s on the same data, type the command cisim(100).
Chapter 5

Multiple Linear Regression

In the multiple linear regression model,

Yi = xi,1 β1 + xi,2β2 + · · · + xi,pβp + ei = xTi β + ei (5.1)

for i = 1, . . . , n. In matrix notation, these n equations become

Y = Xβ + e, (5.2)

where Y is an n × 1 vector of dependent variables, X is an n × p matrix

of predictors, β is a p × 1 vector of unknown coeﬃcients, and e is an n × 1
vector of unknown errors. Equivalently,
      
Y1 x1,1 x1,2 . . . x1,p β1 e1
 Y2   x2,1 x2,2 . . . x2,p   β2   e2 
      
 ..  =  .. .. . ..   ..  +  ..  . (5.3)
 .   . . . . .  .   . 
Yn xn,1 xn,2 . . . xn,p βp en

Often the ﬁrst column of X is X1 = 1, the n × 1 vector of ones. The ith

case (xTi , Yi ) corresponds to the ith row xTi of X and the ith element of Y .
If the ei are iid with zero mean and variance σ 2, then regression is used to
estimate the unknown parameters β and σ 2.
Definition 5.1. Given an estimate b of β, the corresponding vector of
predicted or fitted values is Y = Xb.
Most regression methods attempt to find an estimate b for β which min-
imizes some criterion function Q(b) of the residuals where the ith residual

126
CHAPTER 5. MULTIPLE LINEAR REGRESSION 127

ri (b) = ri = Yi − xTi b = Yi − Ŷi . The order statistics for the absolute residuals
are denoted by
|r|(1) ≤ |r|(2) ≤ · · · ≤ |r|(n) .
Two of the most used classical regression methods are ordinary least squares
(OLS) and least absolute deviations (L1 ).
Deﬁnition 5.2. The ordinary least squares estimator β̂OLS minimizes

n
QOLS (b) = ri2 (b), (5.4)
i=1

and β̂OLS = (X T X)−1 X T Y .

The vector of predicted or ﬁtted values Y OLS = X β̂ OLS = HY where the
hat matrix H = X(X T X)−1 X T provided the inverse exists.
Deﬁnition 5.3. The least absolute deviations estimator β̂L1 minimizes

n
QL1 (b) = |ri (b)|. (5.5)
i=1

Deﬁnition 5.4. The Chebyshev (L∞ ) estimator β̂ L∞ minimizes the max-

imum absolute residual QL∞ (b) = |r(b)|(n) .
The location model is a special case of the multiple linear regression
(MLR) model where p = 1, X = 1 and β = µ. One very important change
in the notation will be used. In the location model, Y1 , ..., Yn were assumed
to be iid with cdf F. For regression, the errors e1, ..., en will be assumed
to be iid with cdf F. For now, assume that the xTi β are constants. Note
that Y1 , ..., Yn are independent if the ei are independent, but they are not
identically distributed since if E(ei ) = 0, then E(Yi ) = xTi β depends on i.
The most important regression model is deﬁned below.
Deﬁnition 5.5. The iid constant variance symmetric error model uses
the assumption that the errors e1, ..., en are iid with a distribution that is
symmetric about zero and VAR(e1 ) = σ 2 < ∞.
In the location model, β̂ OLS = Y , β̂ L1 = MED(n) and the Chebyshev
estimator is the midrange β̂ L∞ = (Y(1) + Y(n) )/2. These estimators are simple
CHAPTER 5. MULTIPLE LINEAR REGRESSION 128

to compute, but computation in the multiple linear regression case requires a

computer. Most statistical software packages have OLS routines, and the L1
and Chebyshev fits can be efficiently computed using linear programming.
The L1 fit can also be found by examining all

n n!
C(n, p) = =
p p!(n − p)!
subsets of size p where n! = n(n − 1)(n − 2) · · · 1 and 0! = 1. The Chebyshev
fit to a sample of size n > p is also a Chebyshev fit to some subsample of size
h = p + 1. Thus the Chebyshev fit can be found by examining all C(n, p + 1)
subsets of size p + 1. These two combinatorial facts will be very useful for
the design of high breakdown regression algorithms described in Chapters 7
and 8.

5.1 A Graphical Method for Response Trans-

formations
If the ratio of largest to smallest value of y is substantial, we usually begin
by looking at log y.
Mosteller and Tukey (1977, p. 91)
The applicability of the multiple linear regression model can be expanded
by allowing response transformations. An important class of response trans-
formation models adds an additional unknown transformation parameter λo ,
such that
(λ )
Yi o = xTi β + ei (5.6)
(λ )
If λo was known, then Wi = Yi o would follow a multiple linear regression
model with p predictors including the constant. Here, β is a p × 1 vector
of unknown coefficients depending on λo , x is a p × 1 vector of predictors
that are assumed to be measured with negligible error, and the errors ei
are assumed to be iid and symmetric about 0. A frequently used family of
transformations is given in the following definition.
Definition 5.6. Assume that all of the values of the response variable
Yi are positive. Then the power transformation family
Yλ−1
Y (λ) = (5.7)
λ
CHAPTER 5. MULTIPLE LINEAR REGRESSION 129

for λ = 0 and Y (0) = log(Y ). Generally λ ∈ Λ where Λ is some interval such

as [−1, 1] or a coarse subset such as Λc = {0, ±1/4, ±1/3, ±1/2, ±2/3, ±1}.
This family is a special case of the response transformations considered by
Tukey (1957).
There are several reasons to use a coarse grid of powers. First, several of
the powers correspond to simple transformations such as the log, square root,
and cube root. These powers are easier to interpret than λ = .28, for example.
According to Mosteller and Tukey (1977, p. 91), the most commonly
used power transformations are the λ = 0 (log), λ = 1/2, λ = −1 and
λ = 1/3 transformations in decreasing frequency of use. Secondly, if the
estimator λ̂n can only take values in Λc , then sometimes λ̂n will converge
(eg ae) to λ∗ ∈ Λc . Thirdly, Tukey (1957) showed that neighboring power
transformations are often very similar, so restricting the possible powers to
a coarse grid is reasonable.
This section follows Cook and Olive (2001) closely and proposes a graph-
ical method for assessing response transformations under model (5.6). The
appeal of the proposed method rests with its simplicity and its ability to
show the transformation against the background of the data. The method
uses the two plots defined below.
Definition 5.7. An FFλ plot is a scatterplot matrix of the fitted values
(λi )
Ŷ for i = 1, ..., 5 where λ1 = −1, λ2 = −0.5, λ3 = 0, λ4 = 0.5 and λ5 = 1.
These fitted values are obtained by OLS regression of Y (λi ) on the predictors.
For λ5 = 1, we will usually replace Ŷ (1) by Ŷ and Y (1) by Y .
Definition 5.8. For a given value of λ ∈ Λc , a transformation plot is a
plot of Ŷ versus Y (λ). Since Y (1) = Y − 1, we will typically replace Y (1) by
Y in the transformation plot.
Remark 5.1. Our convention is that a plot of W versus Z means that
W is on the horizontal axis and Z is on the vertical axis. We may add fitted
OLS lines to the transformation plot as visual aids.
Application 5.1. Assume that model (5.6) is a useful approximation of
the data for some λo ∈ Λc . Also assume that each subplot in the FFλ plot is
nearly linear. To estimate λ ∈ Λc graphically, make a transformation plot for
each λ ∈ Λc . If the transformation plot is nearly linear for λ̃, then λ̂o = λ̃. (If
more than one transformation plot is nearly linear, contact subject matter
CHAPTER 5. MULTIPLE LINEAR REGRESSION 130

experts and use the simplest or most reasonable transformation.)

By “nearly linear” we mean that a line from simple linear regression
would fit the plotted points very well, with a correlation greater than 0.95.
We introduce this procedure with the following example.
Example 5.1: Textile Data. In their pioneering paper on response
transformations, Box and Cox (1964) analyze data from a 33 experiment on
the behavior of worsted yarn under cycles of repeated loadings. The response
Y is the number of cycles to failure and a constant is used along with the
three predictors length, amplitude and load. Using the normal profile log
likelihood for λo , Box and Cox determine λ̂o = −0.06 with approximate 95
percent confidence interval −0.18 to 0.06. These results give a strong indi-
cation that the log transformation may result in a relatively simple model,
as argued by Box and Cox. Nevertheless, the numerical Box–Cox transfor-
mation method provides no direct way of judging the transformation against
the data. This remark applies also to many of the diagnostic methods for
response transformations in the literature. For example, the influence diag-
nostics studied by Cook and Wang (1983) and others are largely numerical.
To use the graphical method, we first check the assumption on the FFλ
plot. Figure 5.1 shows the FFλ plot meets the assumption. The smallest
sample correlation among the pairs in the scatterplot matrix is about 0.9995.
Shown in Figure 5.2 are transformation plots of Ŷ versus Y (λ) for four values
of λ. The plots show how the transformations bend the data to achieve a
homoscedastic linear trend. Perhaps more importantly, they indicate that
the information on the transformation is spread throughout the data in the
plot since changing λ causes all points along the curvilinear scatter in Figure
5.2a to form along a linear scatter in Figure 5.2c. Dynamic plotting using
λ as a control seems quite effective for judging transformations against the
data and the log response transformation does indeed seem reasonable.
The next example illustrates that the transformation plots can show char-
acteristics of data that might influence the choice of a transformation by the
usual Box–Cox procedure.
Example 5.2: Mussel Data. Cook and Weisberg (1999a, p. 351, 433,
447) gave a data set on 82 mussels sampled off the coast of New Zealand.
The response is muscle mass M in grams, and the predictors are the length
L and height H of the shell in mm, the logarithm log W of the shell width W,
CHAPTER 5. MULTIPLE LINEAR REGRESSION 131

0 20 40 60 80 100 1.85 1.90 1.95

1500
YHAT

500
-500
80
60

YHAT^(0.5)
40
20
0

8
7
YHAT^(0)

6
5
1.95

YHAT^(-0.5)
1.90
1.85

1.002
0.998

YHAT^(-1)
0.994

-500 0 500 1500 5 6 7 8 0.994 0.998 1.002

Figure 5.1: FFλ Plot for the Textile Data

CHAPTER 5. MULTIPLE LINEAR REGRESSION 132

a) lambda = 1 b) lambda = 0.5

120
3000

100
80
Y**(0.5)
2000
Y

60
1000

40
20
0

-500 0 500 1000 1500 2000 -500 0 500 1000 1500 2000

YHAT YHAT

c) lambda = 0 d) lambda = -2/3

1.48
7

Y**(-2/3)
LOG(Y)

1.46
6

1.44
5

-500 0 500 1000 1500 2000 -500 0 500 1000 1500 2000

YHAT YHAT

Figure 5.2: Four Transformation Plots for the Textile Data

CHAPTER 5. MULTIPLE LINEAR REGRESSION 133

the logarithm log S of the shell mass S and a constant. With this starting
point, we might expect a log transformation of M to be needed because M
and S are both mass measurements and log S is being used as a predictor.
Using log M would essentially reduce all measurements to the scale of length.
The Box–Cox likelihood method gave λ̂0 = 0.28 with approximate 95 percent
confidence interval 0.15 to 0.4. The log transformation is excluded under this
inference leading to the possibility of using different transformations of the
two mass measurements.
The FFλ plot (not shown, but very similar to Figure 5.1) exhibits strong
linear relations, the correlations ranging from 0.9716 to 0.9999. Shown in
Figure 5.3 are transformation plots of Y (λ) versus Ŷ for four values of λ. A
striking feature of these plots is the two points that stand out in three of the
four plots (cases 8 and 48). The Box–Cox estimate λ̂ = 0.28 is evidently in-
fluenced by the two outlying points and, judging deviations from the OLS line
in Figure 5.3c, the mean function for the remaining points is curved. In other
words, the Box–Cox estimate is allowing some visually evident curvature in
the bulk of the data so it can accommodate the two outlying points. Recom-
puting the estimate of λo without the highlighted points gives λ̂o = −0.02,
which is in good agreement with the log transformation anticipated at the
outset. Reconstruction of the plots of Ŷ versus Y (λ) indicated that now the
information for the transformation is consistent throughout the data on the
horizontal axis of the plot.
The essential point of this example is that observations that influence the
choice of power transformation are often easily identified in a transformation
plot of Ŷ versus Y (λ) when the FFλ subplots are strongly linear.
The easily verified assumption that there is strong linearity in the FFλ
plot is needed since if λo ∈ [−1, 1], then

Ŷ (λ) ≈ cλ + dλ Ŷ (λo ) (5.8)

for all λ ∈ [−1, 1]. Consequently, for any value of λ ∈ [−1, 1], Ŷ (λ) is essen-
tially a linear function of the fitted values Ŷ (λo) for the true λo , although we
do not know λo itself. However, to estimate λo graphically, we could select
∗
any fixed value λ∗ ∈ [−1, 1] and then plot Ŷ (λ ) versus Y (λ) for several values
of λ and find the λ ∈ Λc for which the plot is linear with constant variance.
∗
This simple graphical procedure will then work because a plot of Ŷ (λ ) versus
Y (λ) is equivalent to a plot of cλ∗ + dλ∗ Ŷ (λo ) versus Y (λ) by Equation (5.8).
Since the plot of Ŷ (1) versus Y (λ) differs from a plot of Ŷ versus Y (λ) by a
CHAPTER 5. MULTIPLE LINEAR REGRESSION 134

a) lambda = -0.25 b) lambda = 0

2.5

4
2.0

3
1.5
Y**(-0.25)

LOG(Y)

2
8
1.0

1
0.5

48 48
0.0

-10 0 10 20 30 40 -10 0 10 20 30 40

YHAT YHAT

c) lambda = 0.28 d) lambda = 1

50
6

40
Y**(0.28)

30
4

20
2

8
10

48
48 8
0

-10 0 10 20 30 40 -10 0 10 20 30 40

YHAT YHAT

Figure 5.3: Transformation Plots for the Mussel Data

CHAPTER 5. MULTIPLE LINEAR REGRESSION 135

constant shift, we take λ∗ = 1, and use Ŷ instead of Ŷ (1) . By using a single

set of fitted values Ŷ on the horizontal axis, influential points or outliers that
might be masked in plots of Ŷ (λ) versus Y (λ) for λ ∈ Λc will show up unless
they conform on all scales.
Note that in addition to helping visualize λ̂ against the data, the transfor-
mation plots can also be used to show the curvature and heteroscedasticity in
the competing models indexed by λ ∈ Λc . Example 5.2 shows that the plot
can also be used as a diagnostic to assess the success of numerical methods
such as the Box–Cox procedure for estimating λo .
There are at least two interesting facts about the strength of the linearity
in the FFλ plot. First, the FFλ correlations are frequently all quite high for
many data sets when no strong linearities are present among the predictors.
Let x = (x1, w T )T where x1 ≡ 1 and let β = (β1, η T )T . Then w corre-
sponds to the nontrivial predictors. If the conditional predictor expectation
E(w|wT η) is linear or if w follows an elliptically contoured distribution with
second moments, then for any λ (not necessarily confined to a selected Λ),
(λ)
the population fitted values Ŷpop are of the form
(λ)
Ŷpop = αλ + τλ wT η (5.9)

so that any one set of population ﬁtted values is an exact linear function
of any other set provided the τλ ’s are nonzero. See Cook and Olive (2001).
This result indicates that sample FFλ plots will be linear when E(w|wT η) is
linear, although Equation (5.9) does not by itself guarantee high correlations.
However, the strength of the relationship (5.8) can be checked easily by
inspecting the FFλ plot.
Secondly, if the FFλ subplots are not strongly linear, and if there is non-
linearity present in the scatterplot matrix of the nontrivial predictors, then
transforming the predictors to remove the nonlinearity will often
be a useful procedure. The linearizing of the predictor relationships could
be done by using marginal power transformations or by transforming the
joint distribution of the predictors towards an elliptically contoured distri-
bution. The linearization might also be done by using simultaneous power
transformations λ = (λ2 , . . . , λp )T of the predictors so that the vector wλ
(λ ) (λ )
= (x2 2 , ..., xp p )T of transformed predictors is approximately multivariate
normal. A method for doing this was developed by Velilla (1993). (The basic
idea is the same as that underlying the likelihood approach of Box and Cox
CHAPTER 5. MULTIPLE LINEAR REGRESSION 136

2 4 6 8 10 12 1.0 1.2 1.4 1.6 1.8

50
40
30
YHAT

20
10
0
8 10

YHAT^(0.5)
6
4
2

3.5
YHAT^(0)

2.5
1.5
1.8
1.4

YHAT^(-0.5)
1.0

1.0
0.9
YHAT^(-1)

0.8
0.7
0 10 20 30 40 50 1.5 2.5 3.5 0.7 0.8 0.9 1.0

Figure 5.4: FFλ Plot for Mussel Data with Original Predictors

for estimating a power transformation of the response in regression, but the

likelihood comes from the assumed multivariate normal distribution of w λ .)
More will be said about predictor transformations in Sections 5.3 and 12.3.
Example 5.3: Mussel Data Again. Return to the mussel data, this
time considering the regression of M on a constant and the four untrans-
formed predictors L, H, W and S. The FFλ plot for this regression is shown
in Figure 5.4. The sample correlations in the plots range between 0.76 and
0.991 and there is notable curvature in some of the plots. Figure 5.5 shows
the scatterplot matrix of the predictors L, H, W and S. Again nonlinearity
is present. Figure 5.6 shows that taking the log transformations of H and
S results in a linear scatterplot matrix for the new set of predictors L, H,
log W , and log S. Then the search for the response transformation can be
done as in Example 5.2.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 137

20 30 40 50 60 0 100 200 300

300
250
length

200
150
60
50

width
40
30
20

140
height

100
80
300
200

shell
100
0

150 200 250 300 80 100 120 140 160

Figure 5.5: Scatterplot Matrix for Original Mussel Data Predictors

3.0 3.4 3.8 4.2 3 4 5 6

300
250

length
200
150
4.2
3.8

Log W
3.4
3.0

140

height
100
80
6
5

Log S
4
3

150 200 250 300 80 100 120 140 160

Figure 5.6: Scatterplot Matrix for Transformed Mussel Data Predictors

CHAPTER 5. MULTIPLE LINEAR REGRESSION 138

5.2 Assessing Variable Selection

Variable selection, also called subset or model selection, is the search for a
subset of predictor variables that can be deleted without important loss of
information. This section follows Olive and Hawkins (2005) closely. A model
for variable selection in multiple linear regression can be described by

Y = xT β + e = β T x + e = β TS xS + βTE xE + e = βTS xS + e (5.10)

where e is an error, Y is the response variable, x = (xTS , xTE )T is a p × 1

vector of predictors, xS is a kS × 1 vector and xE is a (p − kS ) × 1 vector.
Given that xS is in the model, β E = 0 and E denotes the subset of terms
that can be eliminated given that the subset S is in the model.
Since S is unknown, candidate subsets will be examined. Let xI be the
vector of k terms from a candidate subset indexed by I, and let xO be the
vector of the remaining predictors (out of the candidate submodel). Then

Y = βTI xI + βTO xO + e. (5.11)

Deﬁnition 5.9. The model Y = β T x + e that uses all of the predictors

is called the full model. A model Y = β TI xI + e that only uses a subset xI
of the predictors is called a submodel. The suﬃcient predictor (SP) is the
linear combination of the predictor variables used in the model. Hence the
full model is SP = β T x and the submodel is SP = β TI xI .
T
The estimated suﬃcient predictor (ESP) is β̂ x and the following re-
marks suggest that a submodel I is worth considering if the correlation
corr(ESP, ESP (I)) ≥ 0.95. Suppose that S is a subset of I and that model
(5.10) holds. Then

SP = β T x = βTS xS = βTS xS + β T(I/S)xI/S + 0T xO = βTI xI (5.12)

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, β O = 0 and the sample correlation
corr(βT xi , βTI xI,i) = 1.0 for the population model if S ⊆ I.
This section proposes a graphical method for evaluating candidate sub-
models. Let β̂ be the estimate of β obtained from the regression of Y on all
of the terms x. Denote the residuals and ﬁtted values from the full model by
T T
ri = Yi − β̂ xi = Yi − Ŷi and Ŷi = β̂ xi respectively. Similarly, let β̂ I be the
CHAPTER 5. MULTIPLE LINEAR REGRESSION 139

estimate of βI obtained from the regression of Y on xI and denote the cor-

T T
responding residuals and fitted values by rI,i = Yi − β̂I xI,i and ŶI,i = β̂ I xI,i
where i = 1, ..., n. Two important summary statistics for a multiple linear re-
gression model are R2 , the proportion of the variability of Y explained by the
nontrivial predictors in the model, and the estimate σ̂ of the error standard
deviation σ.
Definition 5.10. The “fit–fit” or FF plot is a plot of ŶI,i versus Ŷi while
a “residual–residual” or RR plot is a plot rI,i versus ri . A forward response
plot is a plot of ŶI,i versus Yi .
Many numerical methods such as forward selection, backward elimina-
tion, stepwise and all subset methods using the Cp(I) criterion (Jones 1946,
Mallows 1973), have been suggested for variable selection. We will use the
FF plot, RR plot, the forward response plots from the full and submodel,
and the residual plots (of the fitted values versus the residuals) from the full
and submodel. These six plots will contain a great deal of information about
the candidate subset provided that Equation (5.10) holds and that a good
estimator for β̂ and β̂ I is used.
For these plots to be useful, it is crucial to verify that a multiple linear
regression (MLR) model is appropriate for the full model. Both the for-
ward response plot and the residual plot for the full model need
to be used to check this assumption. The plotted points in the forward
response plot should cluster about the identity line (that passes through the
origin with unit slope) while the plotted points in the residual plot should
cluster about the horizontal axis (the line r = 0). Any nonlinear patterns
or outliers in either plot suggests that an MLR relationship does not hold.
Similarly, before accepting the candidate model, use the forward response
plot and the residual plot from the candidate model to verify that an MLR
relationship holds for the response Y and the predictors xI . If the submodel
is good, then the residual and forward response plots of the submodel should
be nearly identical to the corresponding plots of the full model. Assume that
all submodels contain a constant.
Application 5.2. To visualize whether a candidate submodel using pre-
dictors xI is good, use the fitted values and residuals from the submodel and
full model to make an RR plot of the rI,i versus the ri and an FF plot of ŶI,i
versus Ŷi . Add the OLS line to the RR plot and identity line to both plots as
CHAPTER 5. MULTIPLE LINEAR REGRESSION 140

visual aids. The subset I is good if the plotted points cluster tightly about
the identity line in both plots. In particular, the OLS line and the identity
line should nearly coincide near the origin in the RR plot.
To verify that the six plots are useful for assessing variable selection,
the following notation will be useful. Suppose that all submodels include
a constant and that X is the full rank n × p design matrix for the full
model. Let the corresponding vectors of OLS ﬁtted values and residuals be
Ŷ = X(X T X)−1 X T Y = HY and r = (I − H)Y , respectively. Sup-
pose that X I is the n × k design matrix for the candidate submodel and
that the corresponding vectors of OLS ﬁtted values and residuals are Ŷ I =
X I (X TI X I )−1 X TI Y = H I Y and r I = (I − H I )Y , respectively. For mul-
tiple linear regression, recall that if the candidate model of xI has k terms
(including the constant), then the FI statistic for testing whether the p − k
predictor variables in xO can be deleted is
SSE(I) − SSE SSE n − p SSE(I)
FI = / = [ − 1]
(n − k) − (n − p) n − p p − k SSE
where SSE is the error sum of squares from the full model and SSE(I) is the
error sum of squares from the candidate submodel. Also recall that
SSE(I)
Cp (I) = + 2k − n = (p − k)(FI − 1) + k
MSE
where MSE is the error mean square for the full model. Notice that Cp (I) ≤
2k if and only if FI ≤ p/(p − k). Remark 5.3 below suggests that for subsets
I with k terms, submodels with Cp (I) ≤ 2k are especially interesting.
A plot can be very useful if the OLS line can be compared to a reference
line and if the OLS slope is related to some quantity of interest. Suppose
that a plot of w versus z places w on the horizontal axis and z on the vertical
axis. Then denote the OLS line by ẑ = a + bw. The following proposition
shows that the FF, RR and forward response plots will cluster about the
identity line. Notice that the proposition is a property of OLS and holds
even if the data does not follow an MLR model. Let corr(x, y) denote the
correlation between x and y.
Proposition 5.1. Suppose that every submodel contains a constant and
that X is a full rank matrix.
Forward Response Plot: i) If w = ŶI and z = Y then the OLS line is the
CHAPTER 5. MULTIPLE LINEAR REGRESSION 141

identity line.
ii) If w = Y and z = ŶI then the OLS line has slope b = [corr(Y, ŶI )]2 = R2I
and intercept a = Y (1 − R2I ) where Y = ni=1 Yi /n and R2I is the coeﬃcient
of multiple determination from the candidate model.
FF Plot: iii) If w = ŶI and z = Ŷ then the OLS line is the identity line.
Note that ESP (I) = ŶI and ESP = Ŷ .
iv) If w = Ŷ and z = ŶI then the OLS line has slope b = [corr(Ŷ , ŶI )]2 =
SSR(I)/SSR and intercept a = Y [1 − (SSR(I)/SSR)] where SSR is the
regression sum of squares.
v) If w = r and z = rI then the OLS line is the identity line.
RR Plot: vi) If w = rI and z = r then a = 0 and the OLS slope b =
[corr(r, rI )]2 and

SSE n−p n−p
corr(r, rI ) = = = .
SSE(I) Cp(I) + n − 2k (p − k)FI + n − p

Proof: Recall that H and H I are symmetric idempotent matrices and

that HH I = H I . The mean of OLS ﬁtted values is equal to Y and the
mean of OLS residuals is equal to 0. If the OLS line from regressing z on w
is ẑ = a + bw, then a = z − bw and

(wi − w)(zi − z) SD(z)
b= = corr(z, w).
(wi − w)2 SD(w)

Also recall that the OLS line passes through the means of the two variables
(w, z).
(*) Notice that the OLS slope from regressing z on w is equal to one if
and only if the OLS slope from regressing w on z is equal to [corr(z, w)]2.
2 T
i) The slope b = 1 if ŶI,i Yi = ŶI,i . This equality holds since Ŷ I Y =
T
Y T H I Y = Y T H I H I Y = Ŷ I Ŷ I . Since b = 1, a = Y − Y = 0.
ii) By (*), the slope

2 (ŶI,i − Y )2
b = [corr(Y, ŶI )] = R2I = = SSR(I)/SST.
(Yi − Y )2

The result follows since a = Y − bY .

CHAPTER 5. MULTIPLE LINEAR REGRESSION 142

2
iii) The slope b = 1 if ŶI,i Ŷi = ŶI,i . This equality holds since
T T
Ŷ Ŷ I = Y T HH I Y = Y T H I Y = Ŷ I Ŷ I . Since b = 1, a = Y − Y = 0.
iv) From iii),
SD(Ŷ )
1= [corr(Ŷ , ŶI )].
SD(ŶI )
Hence
SD(ŶI )
corr(Ŷ , ŶI ) =
SD(Ŷ )
and the slope
SD(ŶI )
b= corr(Ŷ , ŶI ) = [corr(Ŷ , ŶI )]2 .
SD(Ŷ )
Also the slope
(ŶI,i − Y )2
b= = SSR(I)/SSR.
(Ŷi − Y )2
The result follows since a = Y − bY .
v) The OLS line passes through the origin. Hence a = 0. The slope
b = r T r I /r T r. Since r T r I = Y T (I − H)(I − H I )Y and (I − H)(I − H I ) =
I − H, the numerator r T r I = r T r and b = 1.
vi) Again a = 0 since the OLS line passes through the origin. From v),

SSE(I)
1= [corr(r, rI )].
SSE
Hence
SSE
corr(r, rI ) =
SSE(I)
and the slope

SSE
b= [corr(r, rI )] = [corr(r, rI )]2.
SSE(I)
Algebra shows that

n−p n−p
corr(r, rI ) = = . QED
Cp(I) + n − 2k (p − k)FI + n − p
CHAPTER 5. MULTIPLE LINEAR REGRESSION 143

a) Full Forward Response Plot b) Full Residual Plot

1600

200
100
1200

FRES
Y

0
800

-200
400

400 600 800 1000 1200 1400 400 600 800 1000 1200 1400

FFIT FFIT

c) Submodel Forward Response Plot d) Submodel Residual Plot

1600

200
100
1200

SRES3

0
Y

800

-200
400

400 600 800 1000 1200 1400 400 600 800 1000 1200 1400

SFIT3 SFIT3

Figure 5.7: Gladstone data: comparison of the full model and the submodel.

Remark 5.2. Note that for large n, Cp (I) < k or FI < 1 will force
corr(ESP,ESP(I)) to be high. If the estimators β̂ and β̂ I are not the OLS
estimators, the plots will be similar to the OLS plots if the correlation of the
ﬁtted values from OLS and the alternative estimators is high (≥ 0.95).
A standard model selection procedure will often be needed to suggest
models. For example, forward selection or backward elimination could be
used. If p < 30, Furnival and Wilson (1974) provide a technique for selecting
a few candidate subsets after examining all possible subsets.
Remark 5.3. Daniel and Wood (1980, p. 85) suggest using Mallows’
graphical method for screening subsets by plotting k versus Cp (I) for models
close to or under the Cp = k line. Proposition 5.1 vi) implies that if Cp (I) ≤ k
then corr(r, rI ) and corr(ESP, ESP (I)) both go to 1.0 as n → ∞. Hence
models I that satisfy the Cp(I) ≤ k screen will contain the true model S
with high probability when n is large. This result does not guarantee that
CHAPTER 5. MULTIPLE LINEAR REGRESSION 144

a) RR Plot for (size)^(1/3) b) FF Plot for (size)^(1/3)

200

1200
100
FRES

FFIT
0

800
600
-200

400
-200 -100 0 100 200 400 600 800 1000 1200 1400

SRES1 SFIT1

c) RR Plot for 2 Predictors d) RR Plot for 4 Predictors

200

200
100

100
FRES

FRES
0

0
-200

-200

-200 -100 0 100 200 -200 -100 0 100 200

SRES2 SRES4

Figure 5.8: Gladstone data: submodels added (size)1/3, sex, age and ﬁnally
breadth.

a) RR Plot b) FF Plot
1400
200

1200
100

1000
FRES

FFIT
0

800
-100

600
-200

400

-200 -100 0 100 200 400 600 800 1000 1200 1400

SRES3 SFIT3

Figure 5.9: Gladstone data with Predictors (size)1/3, sex, and age
CHAPTER 5. MULTIPLE LINEAR REGRESSION 145

the true model S will satisfy the screen, hence overfit is likely (see Shao
1993). Let d be a lower bound on corr(r, rI ). Proposition 5.1 vi) implies that
if
1 p
Cp (I) ≤ 2k + n 2 − 1 − 2 ,
d d
then corr(r, rI ) ≥ d. The simple screen Cp (I) ≤ 2k corresponds to

p
dn ≡ 1 − .
n
To reduce the chance of overfitting, use the Cp = k line for large values of k,
but also consider models close to or under the Cp = 2k line when k ≤ p/2.
Example 5.4. The FF and RR plots can be used as a diagnostic for
whether a given numerical method is including too many variables. Glad-
stone (1905-1906) attempts to estimate the weight of the human brain (mea-
sured in grams after the death of the subject) using simple linear regression
with a variety of predictors including age in years, height in inches, head
height in mm, head length in mm, head breadth in mm, head circumference
in mm, and cephalic index. The sex (coded as 0 for females and 1 for males)
of each subject was also included. The variable cause was coded as 1 if the
cause of death was acute, 3 if the cause of death was chronic, and coded as 2
otherwise. A variable ageclass was coded as 0 if the age was under 20, 1 if the
age was between 20 and 45, and as 3 if the age was over 45. Head size, the
product of the head length, head breadth, and head height, is a volume mea-
surement, hence (size)1/3 was also used as a predictor with the same physical
dimensions as the other lengths. Thus there are 11 nontrivial predictors and
one response, and all models will also contain a constant. Nine cases were
deleted because of missing values, leaving 267 cases.
Figure 5.7 shows the forward response plots and residual plots for the full
model and the final submodel that used a constant, size1/3, age and sex.
The five cases separated from the bulk of the data in each of the four plots
correspond to five infants. These may be outliers, but the visual separation
reflects the small number of infants and toddlers in the data. A purely
numerical variable selection procedure would miss this interesting feature of
the data. We will first perform variable selection with the entire data set,
and then examine the effect of deleting the five cases. Using forward selection
and the Cp statistic on the Gladstone data suggests the subset I5 containing
a constant, (size)1/3, age, sex, breadth, and cause with Cp(I5) = 3.199. The
CHAPTER 5. MULTIPLE LINEAR REGRESSION 146

p–values for breadth and cause were 0.03 and 0.04, respectively. The subset
I4 that deletes cause has Cp (I4) = 5.374 and the p–value for breadth was 0.05.
Figure 5.8d shows the RR plot for the subset I4. Note that the correlation
of the plotted points is very high and that the OLS and identity lines nearly
coincide.
A scatterplot matrix of the predictors and response suggests that (size)1/3
might be the best single predictor. First we regressed y = brain weight on the
eleven predictors described above (plus a constant) and obtained the residuals
ri and fitted values ŷi. Next, we regressed y on the subset I containing
(size)1/3 and a constant and obtained the residuals rI,i and the fitted values
ŷI,i. Then the RR plot of rI,i versus ri , and the FF plot of ŷI,i versus ŷi were
constructed.
For this model, the correlation in the FF plot (Figure 5.8b) was very high,
but in the RR plot the OLS line did not coincide with the identity line (Figure
5.8a). Next sex was added to I, but again the OLS and identity lines did not
coincide in the RR plot (Figure 5.8c). Hence age was added to I. Figure 5.9a
shows the RR plot with the OLS and identity lines added. These two lines
now nearly coincide, suggesting that a constant plus (size)1/3, sex, and age
contains the relevant predictor information. This subset has Cp(I) = 7.372,
R2I = 0.80, and σ̂I = 74.05. The full model which used 11 predictors and a
constant has R2 = 0.81 and σ̂ = 73.58. Since the Cp criterion suggests adding
breadth and cause, the Cp criterion may be leading to an overfit.
Figure 5.9b shows the FF plot. The five cases in the southwest corner
correspond to five infants. Deleting them leads to almost the same conclu-
sions, although the full model now has R2 = 0.66 and σ̂ = 73.48 while the
submodel has R2I = 0.64 and σ̂I = 73.89.
Example 5.5. Cook and Weisberg (1999a, p. 261, 371) describe a data
set where rats were injected with a dose of a drug approximately proportional
to body weight. The data set is included as the file rat.lsp in the Arc software
and can be obtained from the website (http://www.stat.umn.edu/arc/). The
response Y is the fraction of the drug recovered from the rat’s liver. The three
predictors are the body weight of the rat, the dose of the drug, and the liver
weight. The experimenter expected the response to be independent of the
predictors, and 19 cases were used. However, the Cp criterion suggests using
the model with a constant, dose and body weight, both of whose coefficients
were statistically significant. The FF and RR plots are shown in Figure 5.10.
The identity line and OLS lines were added to the plots as visual aids. The
CHAPTER 5. MULTIPLE LINEAR REGRESSION 147

a) RR Plot b) FF plot

0.10

0.50
0.45
0.05
full$residual

ffit

0.40
0.0

0.35
-0.05

0.30
-0.10

-0.10 -0.05 0.0 0.05 0.10 0.30 0.35 0.40 0.45 0.50

sub$residual sfit

Figure 5.10: FF and RR Plots for Rat Data

FF plot shows one outlier, the third case, that is clearly separated from the
rest of the data.
We deleted this case and again searched for submodels. The Cp statistic
is less than one for all three simple linear regression models, and the RR and
FF plots look the same for all submodels containing a constant. Figure 5.11
shows the RR plot where the residuals from the full model are plotted against
Y − Y , the residuals from the model using no nontrivial predictors. This plot
suggests that the response Y is independent of the nontrivial predictors.
The point of this example is that a subset of outlying cases can cause
numeric second-moment criteria such as Cp to find structure that does not
exist. The FF and RR plots can sometimes detect these outlying cases,
allowing the experimenter to run the analysis without the influential cases.
The example also illustrates that global numeric criteria can suggest a model
with one or more nontrivial terms when in fact the response is independent
of the predictors.
Numerical variable selection methods for MLR are very sensitive to “influ-
ential cases” such as outliers. For the MLR model, standard case diagnostics
CHAPTER 5. MULTIPLE LINEAR REGRESSION 148

RR Plot

0.10
0.05
full$residual

0.0
-0.05
-0.10

-0.10 -0.05 0.0 0.05 0.10

subresidual

Figure 5.11: RR Plot With Outlier Deleted, Submodel Uses No Predictors

are the full model residuals ri and the Cook’s distances

ri2 hi
CDi = , (5.13)
pσ̂ 2 (1 − hi ) (1 − hi )
where hi is the leverage and σ̂ 2 is the usual estimate of the error variance.
(See Chapter 6 for more details about these quantities.)
Definition 5.11. The RC plot is a plot of the residuals ri versus the
Cook’s distances CDi .
Though two-dimensional, the RC plot shows cases’ residuals, leverage,
and influence together. Notice that cases with the same leverage define
a parabola in the RC plot. In an ideal setting with no outliers or undue
case leverage, the plotted points should have an evenly-populated parabolic
shape. This leads to a graphical approach of making the RC plot, temporarily
deleting cases that depart from the parabolic shape, refitting the model and
regenerating the plot to see whether it now conforms to the desired shape.
The cases deleted in this approach have atypical leverage and/or devi-
ation. Such cases often have substantial impact on numerical variable se-
lection methods, and the subsets identified when they are excluded may be
CHAPTER 5. MULTIPLE LINEAR REGRESSION 149

very different from those using the full data set, a situation that should cause
concern. Warning: deleting influential cases and outliers will often
lead to better plots and summary statistics, but the cleaned data
may no longer represent the actual population. In particular, the
resulting model may be very poor for prediction.
A thorough subset selection analysis will use the RC plots in conjunction
with the more standard numeric-based algorithms. This suggests running
the numerical variable selection procedure on the entire data set and on the
“cleaned data” set with the influential cases deleted, keeping track of inter-
esting models from both data sets. For a candidate submodel I, let Cp (I, c)
denote the value of the Cp statistic for the cleaned data. The following two
examples help illustrate the procedure.
Example 5.6. Ashworth (1842) presents a data set of 99 communities
in Great Britain. The response variable y = log(population in 1841) and the
predictors are x1, x2 , x3 and a constant where x1 is log(property value in
pounds in 1692), x2 is log(property value in pounds in 1841), and x3 is the
log(percent rate of increase in value). The initial RC plot, shown in Figure
5.12a, is far from the ideal of an evenly-populated parabolic band. Cases
14 and 55 have extremely large Cook’s distances, along with the largest
residuals. After deleting these cases and refitting OLS, Figure 5.12b shows
that the RC plot is much closer to the ideal parabolic shape. If case 16 had a
residual closer to zero, then it would be a very high leverage case and would
also be deleted.
Table 5.1 shows the summary statistics of the fits of all subsets using all
cases, and following the removal of cases 14 and 55. The two sets of results
are substantially different. On the cleaned data the submodel using just x2
is the unique clear choice, with Cp (I, c) = 0.7. On the full data set however,
none of the subsets is adequate. Thus cases 14 and 55 are responsible for all
indications that predictors x1 and x3 have any useful information about y.
This is somewhat remarkable in that these two cases have perfectly ordinary
values for all three variables.
Example 5.4 (continued). Now we will apply the RC plot to the Glad-
stone data using y = weight, x1 = age, x2 = height, x3 = head height, x4 =
head length, x5 = head breadth, x6 = head circumference, x7 = cephalic index,
x8 = sex, and x9 = (size)1/3. All submodels contain a constant.
Table 5.2 shows the summary statistics of the more interesting subset
CHAPTER 5. MULTIPLE LINEAR REGRESSION 150

a) RC Plot b) Plot Without Cases 14 and 55

4
16
55

0.06
3

0.04
CD2
CD

0.02
1

0.0
0

-1.5 -1.0 -0.5 0.0 0.5 1.0 -1.5 -1.0 -0.5 0.0 0.5

RES RES2

c) RR Plot d) FF Plot

12
14 55
1.0
0.5

10
FRES

FFIT
-0.5

8
6 55
-1.5

-1 0 1 2 6 8 10 12

SRES SFIT

Figure 5.12: Plots for the Ashworth Population Data

a) Initial RC Plot b) RC Plot With Case 118 Deleted

0.12

118 234
0.06
0.08

248
0.04
CD

CD
0.04

0.02
0.0

0.0

-200 -100 0 100 200 -200 -100 0 100 200

RES RES

c) RC Plot With Cases 118, 234, 248 Deleted d) Final RC Plot

258
0.06

0.03
0.04

0.02
CD

CD
0.02

0.01
0.0

0.0

-100 0 100 200 -100 0 100 200

RES RES

Figure 5.13: RC Plots for the Gladstone Brain Data

CHAPTER 5. MULTIPLE LINEAR REGRESSION 151

Table 5.1: Exploration of Subsets – Ashworth Data

All cases 2 removed

Subset I k SSE Cp (I) SSE Cp(I, c)
x1 2 93.41 336 91.62 406
x2 2 23.34 12.7 17.18 0.7
x3 2 105.78 393 95.17 426
x1 , x2 3 23.32 14.6 17.17 2.6
x1 , x3 3 23.57 15.7 17.07 2.1
x2 , x3 3 22.81 12.2 17.17 2.6
All 4 20.59 4.0 17.05 4.0

Table 5.2: Some Subsets – Gladstone Brain Data

All cases Cleaned data

Subset I k SSE ×103 Cp (I) SSE×103 Cp (I, c)
x1 , x9 3 1486 12.6 1352 10.8
x8 , x9 3 1655 43.5 1516 42.8
x1 , x8 , x9 4 1442 6.3 1298 2.3
x1 , x5 , x9 4 1463 10.1 1331 8.7
x1 , x5 , x8 , x9 5 1420 4.4 1282 1.2
All 10 1397 10.0 1276 10.0

regressions. The smallest Cp value came from the subset x1, x5, x8 , x9, and
in this regression x5 has a t value of −2.0. Deleting a single predictor from
an adequate regression changes the Cp by approximately t2 − 2, where t
stands for that predictor’s Student’s t in the regression – as illustrated by the
increase in Cp from 4.4 to 6.3 following deletion of x5 . Analysts must choose
between the larger regression with its smaller Cp but a predictor that does
not pass the conventional screens for statistical signiﬁcance, and the smaller,
more parsimonious, regression using only apparently statistically signiﬁcant
predictors, but (as assessed by Cp ) possibly less accurate predictive ability.
Figure 5.13 shows a sequence of RC plots used to identify cases 118, 234,
248 and 258 as atypical, ending up with an RC plot that is a reasonably
evenly-populated parabolic band. Using the Cp criterion on the cleaned data
CHAPTER 5. MULTIPLE LINEAR REGRESSION 152

Table 5.3: Summaries for Seven Data Sets

inﬂuential cases submodel I p, Cp (I), Cp(I, c)

ﬁle, response transformed predictors
14, 55 log(x2) 4, 12.665, 0.679
pop, log(y) log(x1 ), log(x2), log(x3 )
118, 234, 248, 258 (size)1/3, age, sex 10, 6.337, 3.044
cbrain,brnweight (size)1/3
118, 234, 248, 258 (size)1/3, age, sex 10, 5.603, 2.271
cbrain-5,brnweight (size)1/3
11, 16, 56 sternal height 7, 4.456, 2.151
cyp,height none
3, 44 x2 , x5 6, 0.793, 7.501
major,height none
11, 53, 56, 166 log(LBM), log(Wt), sex √ 12, −1.701, 0.463
ais,%Bfat log(Ferr), log(LBM), log(Wt), Ht
3 no predictors 4, 6.580, −1.700
rat,y none

suggests the same final submodel I found earlier – that using a constant,
x1 = age, x8 = sex and x9 = size1/3.
The five cases (230, 254, 255, 256 and 257) corresponding to the five
infants were well separated from the bulk of the data and have higher leverage
than average, and so good exploratory practice would be to remove them also
to see the effect on the model fitting. The right columns of Table 5.2 reflect
making these 9 deletions. As in the full data set, the subset x1, x5 , x8, x9 gives
the smallest Cp , but x5 is of only modest statistical significance and might
reasonably be deleted to get a more parsimonious regression. What is striking
after comparing the left and right columns of Table 5.2 is that, as was the
case with the Ashworth data set, the adequate Cp values for the cleaned data
set seem substantially smaller than their full-sample counterparts: 1.2 versus
4.4, and 2.3 versus 6.3. Since these Cp for the same p are dimensionless and
comparable, this suggests that the 9 cases removed are primarily responsible
for any additional explanatory ability in the 6 unused predictors.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 153

Multiple linear regression data sets with cases that influence numerical
variable selection methods are common. Table 5.3 shows results for seven
interesting data sets. The first two rows correspond to the Ashworth data in
Example 5.6, the next 2 rows correspond to the Gladstone Data in Example
5.4, and the next 2 rows correspond to the Gladstone data with the 5 infants
deleted. Rows 7 and 8 are for the Buxton (1920) data while rows 9 and 10
are for the Tremearne (1911) data. These data sets are available from the
book’s website. Results from the final two data sets are given in the last 4
rows. The last 2 rows correspond to the rat data described in Example 5.5.
Rows 11 and 12 correspond to the Ais data that comes with Arc (Cook and
Weisberg, 1999a).
The full model used p predictors, including a constant. The final sub-
model I also included a constant, and the nontrivial predictors are listed in
the second column of Table 5.3. The third column lists p, Cp(I) and Cp (I, c)
while the first column gives the set of influential cases. Two rows are pre-
sented for each data set. The second row gives the response variable and any
predictor transformations. For example, for the Gladstone data p = 10 since
there were 9 nontrivial predictors plus a constant. Only the predictor size
was transformed, and the final submodel is the one given in Example 5.4.
For the rat data, the final submodel is the one given in Example 5.5: none
of the 3 nontrivial predictors was used.
Table 5.3 and simulations suggest that if the subset I has k predictors,
then using the Cp (I) ≤ 2k screen is better than using the conventional
Cp (I) ≤ k screen. The major and ais data sets show that deleting the
influential cases may increase the Cp statistic. Thus interesting models from
the entire data set and from the clean data set should be examined.

5.3 A Review of MLR

The simple linear regression (SLR) model is Yi = β1 + β2 Xi + ei where
the ei are iid with E(ei ) = 0 and VAR(ei) = σ 2 for i = 1, ..., n. The Yi and
ei are random variables while the Xi are treated as known constants.
The parameters β1, β2 and σ 2 are unknown constants that need to be
estimated. (If the Xi are random variables, then the model is conditional on
the Xi ’s. Hence the Xi ’s are still treated as constants.)
The normal SLR model adds the assumption that the ei are iid N(0, σ 2).
That is, the error distribution is normal with zero mean and constant variance
CHAPTER 5. MULTIPLE LINEAR REGRESSION 154

σ 2.
The response variable Y is the variable that you want to predict while
the predictor (or independent or explanatory) variable X is the variable used
to predict the response.
A scatterplot is a plot of W versus Z with W on the horizontal axis
and Z on the vertical axis and is used to display the conditional dis-
tribution of Z given W . For SLR the scatterplot of X versus Y is often
used.
For SLR, E(Yi ) = β1 +β2Xi and the line E(Y ) = β1 +β2X is the regression
function. VAR(Yi) = σ 2 .
For SLR, the least squares n estimators b1 and b2 minimize the least
2
squares criterion Q(η1 , η2) = i=1 (Yi − η1 − η2Xi ) . For a ﬁxed η1 and η2 , Q
is the sum of the squared vertical deviations from the line Y = η1 + η2 X.
The least squares (OLS) line is Ŷ = b1 + b2 X where
n
(Xi − X)(Yi − Y )
β̂2 ≡ b2 = i=1n
i=1 (Xi − X)
2

and β̂1 ≡ b1 = Y − b2 X.
By the chain rule,

∂Q n
= −2 (Yi − η1 − η2 Xi )
∂η1 i=1

and
d2 Q
= 2n.
dη12
Similarly,
∂Q n
= −2 Xi (Yi − η1 − η2 Xi )
∂η2 i=1

and
d2 Q n
= 2 Xi2 .
dη12 i=1

The OLS estimators b1 and b2 satisfy the normal equations:

n
n
Yi = nb1 + b2 Xi and
i=1 i=1
CHAPTER 5. MULTIPLE LINEAR REGRESSION 155

n
n
n
Xi Yi = b1 Xi + b2 Xi2 .
i=1 i=1 i=1

Knowing how to use output from statistical software packages is impor-

tant. Shown below is an actual Arc output and an output only using symbols.

Response = Y
Coefficient Estimates
Label Estimate Std. Error t-value p-value
Constant b1 se(b1) t for b1 p-value for beta_1
x b2 se(b2) to = b2/se(b2) p-value for beta_2

R Squared: r^2
Sigma hat: sqrt{MSE}
Number of cases: n
Degrees of freedom: n-2

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 1 SSR MSR Fo=MSR/MSE p-value for beta_2
Residual n-2 SSE MSE
-----------------------------------------------------------------
Response = brnweight
Terms = (size)
Coefficient Estimates
Label Estimate Std. Error t-value p-value
Constant 305.945 35.1814 8.696 0.0000
size 0.271373 0.00986642 27.505 0.0000

R Squared: 0.74058
Sigma hat: 83.9447
Number of cases: 267
Degrees of freedom: 265

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 1 5330898. 5330898. 756.51 0.0000
Residual 265 1867377. 7046.71
CHAPTER 5. MULTIPLE LINEAR REGRESSION 156

For SLR, Ŷi = b1 + b2 Xi is called the ith ﬁtted value (or predicted value)
for observation Yi while the ith residual is ri = Yi − Ŷi .
n
n
2
The error (residual) sum of squares SSE = (Yi − Ŷi ) = ri2 .
i=1 i=1
For SLR, the mean square error MSE = SSE/(n − 2) is an unbiased
estimator of the error variance σ 2.
Properties of the OLS line:
i) the residuals sum to zero: ni=1 ri = 0.

ii) ni=1 Yi = ni=1 Ŷi .
iii) The independent variable and residuals are uncorrelated:

n
Xi ri = 0.
i=1

iv) The ﬁtted values and residuals are uncorrelated: ni=1 Ŷi ri = 0.
v) The least squares line passes through the point (X, Y ).
Let the p × 1 vector β = (β1 , ..., βp)T and let the p × 1 vector xi =
(1, Xi,2 , ..., Xi,p)T . Notice that Xi,1 ≡ 1 for i = 1, ..., n. Then the multiple
linear regression (MLR) model is

Yi = β1 + β2 Xi,2 + · · · + βpXi,p + ei = xTi β + ei

for i = 1, ..., n where the ei are iid with E(ei ) = 0 and VAR(ei) = σ 2 for
i = 1, ..., n. The Yi and ei are random variables while the Xi are treated
as known constants. The parameters β1 , β2, ..., βp and σ 2 are unknown
constants that need to be estimated.
In matrix notation, these n equations become

Y = Xβ + e,

where Y is an n × 1 vector of dependent variables, X is an n × p matrix

of predictors, β is a p × 1 vector of unknown coeﬃcients, and e is an n × 1
vector of unknown errors. Equivalently,
      
Y1 1 X1,2 X1,3 . . . X1,p β1 e1
 Y2   1 X2,2 X2,3 . . . X2,p   β2   e2 
      
 ..  =  .. .. .. . ..   ..  +  ..  .
 .   . . . . . .  .   . 
Yn 1 Xn,2 Xn,3 . . . Xn,p βp en
CHAPTER 5. MULTIPLE LINEAR REGRESSION 157

The first column of X is 1, the n × 1 vector of ones. The ith case (xTi , Yi )
corresponds to the ith row xTi of X and the ith element of Y . If the ei
are iid with zero mean and variance σ 2, then regression is used to estimate
the unknown parameters β and σ 2 . (If the Xi are random variables, then
the model is conditional on the Xi ’s. Hence the Xi ’s are still treated as
constants.)
The normal MLR model adds the assumption that the ei are iid N(0, σ 2).
That is, the error distribution in normal with zero mean and constant vari-
ance σ 2. Simple linear regression is a special case with p = 2.
The response variable Y is the variable that you want to predict while
the predictor (or independent or explanatory) variables X1 , X2 , ..., Xp are the
variables used to predict the response. Since X1 ≡ 1, sometimes X2 , ..., Xp
are called the predictor variables.
For MLR, E(Yi ) = β1 + β2Xi,2 + · · · + βpXi,p = xTi β and the hyperplane
E(Y ) = β1 + β2X2 + · · · + βpXp = xT β is the regression function. VAR(Yi) =
σ 2.
The least squares
estimators b1 , b2, ..., bp minimize theleast squares
criterion Q(η) = ni=1 (Yi − η1 − η2Xi,2 − · · · − ηp Xi,p )2 = ni=1 ri2 . For a
fixed η, Q is the sum of the squared vertical deviations from the hyperplane
H = η1 + η2X2 + · · · + ηp Xp .
The least squares estimator β̂ = b satisfies the MLR normal equations

X T Xb = X T Y

and the least squares estimator is

β̂ = b = (X T X)−1 X T Y .

The vector of predicted or ﬁtted values is Ŷ = Xb = HY where the hat

matrix H = X(X T X)−1 X T . The ith entry of Ŷ is the ith ﬁtted value (or
predicted value) Ŷi = b1 + b2Xi,2 + · · · + bp Xi,p = xTi b for observation Yi while
the ith residual is ri = Yi − Ŷi . The vector of residuals is r = (I − H )Y .
n
n
2
The (residual) error sum of squares SSE = (Yi − Ŷi ) = ri2 . For
i=1 i=1
MLR, the MSE = SSE/(n− p) is an unbiased estimator of the error variance
σ 2.
After obtaining the least squares equation from computer output, predict
Y for a given x = (1, X2 , ..., Xp)T : Ŷ = b1 + b2 X2 + · · · + bp Xp = xT b.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 158

Response = Y
Coeﬃcient Estimates

Label Estimate Std. Error t-value p-value

Constant b1 se(b1) to,1 for Ho: β1 = 0
x2 b2 se(b2) to,2 = b2/se(b2 ) for Ho: β2 = 0
..
.
xp bp se(bp) to,p = bp/se(bp ) for Ho: βp = 0
R Squared: r^2
Sigma hat: sqrt{MSE}
Number of cases: n
Degrees of freedom: n-p

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression p-1 SSR MSR Fo=MSR/MSE for Ho:
Residual n-p SSE MSE β2 = · · · = βp = 0
Response = brnweight
Coefficient Estimates
Label Estimate Std. Error t-value p-value
Constant 99.8495 171.619 0.582 0.5612
size 0.220942 0.0357902 6.173 0.0000
sex 22.5491 11.2372 2.007 0.0458
breadth -1.24638 1.51386 -0.823 0.4111
circum 1.02552 0.471868 2.173 0.0307

R Squared: 0.749755
Sigma hat: 82.9175
Number of cases: 267
Degrees of freedom: 262

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 4 5396942. 1349235. 196.24 0.0000
Residual 262 1801333. 6875.32
CHAPTER 5. MULTIPLE LINEAR REGRESSION 159

Know the meaning of the least squares multiple linear regression output.
Shown on the previous page is an actual Arc output and an output only using
symbols.
The 100 (1 − α) % CI for βk is bk ± t1−α/2,n−p se(bk ). If ν = n − p > 30,
use the N(0,1) cutoﬀ z1−α/2. The corresponding 4 step t–test of hypotheses
has the following steps:
i) State the hypotheses Ho: βk = 0 Ha: βk = 0.
ii) Find the test statistic to,k = bk /se(bk ) or obtain it from output.
iii) Find the p–value from output or use the t–table: p–value =
2P (tn−p < −|to,k |).
Use the normal table or ν = ∞ in the t–table if the degrees of freedom
ν = n − p > 30.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem.
Recall that Ho is rejected if the p–value < α. As a benchmark for this
textbook, use α = 0.05 if α is not given. If Ho is rejected, then conclude that
Xk is needed in the MLR model for Y given that the other p − 2 nontrivial
predictors are in the model. If you fail to reject Ho, then conclude that Xk
is not needed in the MLR model for Y given that the other p − 2 nontrivial
predictors are in the model. Note that Xk could be a very useful individual
predictor, but may not be needed if other predictors are added to the model.
It is better to use the output to get the test statistic and p–value than to use
formulas and the t–table, but exams may not give the relevant output.
Be able to perform the 4 step ANOVA F test of hypotheses:
i) State the hypotheses Ho: β2 = · · · = βp = 0 Ha: not Ho
ii) Find the test statistic F o = MSR/MSE or obtain it from output.
iii) Find the p–value from output or use the F–table: p–value =
P (Fp−1,n−p > Fo).
iv) State whether you reject Ho or fail to reject Ho. If Ho is rejected, conclude
that there is a MLR relationship between Y and the predictors X2 , ..., Xp . If
you fail to reject Ho, conclude that there is a not a MLR relationship between
Y and the predictors X2 , ..., Xp .

Be able to ﬁnd i) the point estimator Ŷh = xTh Y of Yh given x = xh =

(1, Xh,2 , ..., Xh,p)T and
CHAPTER 5. MULTIPLE LINEAR REGRESSION 160

ii) the 100 (1 − α)% CI for E(Yh ) = xTh β = E(Ŷh ). This interval is
Ŷh ± t1−α/2,n−p se(Ŷh ). Generally se(Ŷh ) will come from output.
Suppose you want to predict a new observation Yh where Yh is indepen-
dent of Y1 , ..., Yn. Be able to ﬁnd
i) the point estimator Ŷh = xTh b and the
ii) the 100 (1 − α)% prediction interval (PI) for Yh . This interval is
Ŷh ± t1−α/2,n−p se(pred). Generally se(pred) will come from output. Note that
Yh is a random variable not a parameter.
Full model

Source df SS MS Fo and p-value

Regression p − 1 SSR MSR Fo=MSR/MSE
Residual dfF = n − p SSE(F) MSE(F) for Ho:β2 = · · · = βp = 0

Reduced model

Source df SS MS Fo and p-value

Regression q SSR MSR Fo=MSR/MSE
Residual dfR = n − q SSE(R) MSE(R) for Ho: β2 = · · · = βq = 0

Summary Analysis of Variance Table for the Full Model

Source df SS MS F p-value
Regression 6 260467. 43411.1 87.41 0.0000
Residual 69 34267.4 496.629

Summary Analysis of Variance Table for the Reduced Model

Source df SS MS F p-value
Regression 2 94110.5 47055.3 17.12 0.0000
Residual 73 200623. 2748.27

Know how to perform the 4 step change in SS F test. Shown is an

actual Arc output and an output only using symbols. Note that both the
full and reduced models must be ﬁt in order to perform the change in SS
F test. Without loss of generality, assume that the Xi corresponding to
the βi for i ≥ q are the terms to be dropped. Then the full MLR model
is Yi = β1 + β2 Xi,2 + · · · + βp Xi,p + ei while the reduced model is Yi =
β1 + β2Xi,2 + · · · + βq Xi,q + ei Then the change in SS F test has the following
CHAPTER 5. MULTIPLE LINEAR REGRESSION 161

4 steps:
i) Ho: the reduced model is good Ha: use the full model
ii) Fo ≡ FR =
SSE(R) − SSE(F )
/MSE(F )
dfR − dfF
iii) p–value = P(FdfR −dfF ,dfF > F o). (Here dfR − dfF = p − q = number of
parameters set to 0, and dfF = n − p).
iv) Reject Ho if the p–value < α and conclude that the full model should be
used. Otherwise, fail to reject Ho and conclude that the reduced model is
good.

Given two of SSTO = ni=1 (Yi − Y )2 , SSE = ni=1 (Yi − Ŷi )2 = ni=1 ri2 ,

and SSR = ni=1 (Ŷi − Y )2, ﬁnd the other sum of squares using the formula
SSTO = SSE + SSR.
Be able to ﬁnd R2 = SSR/SST O = (sample correlation of Yi and Ŷi )2.
Know i) that the covariance matrix of a random vector Y is Cov(Y ) =
E[(Y − E(Y ))(Y − E(Y ))T ].
ii) E(AY ) = AE(Y ).
iii) Cov(AY ) = ACov(Y )AT .
Given the least squares model Y = Xβ + e, be able to show that
i) E(b) = β and
ii) Cov(b) = σ 2(X T X)−1 .
A matrix A is idempotent if AA = A.
An added variable plot (also called a partial regression plot) is used to
give information about the test Ho : βi = 0. The points in the plot cluster
about a line with slope = bi . If there is strong trend then Xi is needed in
the MLR for Y given that the other predictors X2 , ..., Xi−1, Xi+1 , ..., Xp are
in the model. If there is almost no trend, then Xi may not be needed in the
MLR for Y given that the other predictors X2 , ..., Xi−1, Xi+1 , ..., Xp are in
the model.
The forward response plot of ŷi versus y is used to check whether
the MLR model is appropriate. If the MLR model is appropriate, then the
plotted points should cluster about the identity line. The squared correlation
[corr(yi, ŷi )]2 = R2 . Hence the clustering is tight if R2 ≈ 1. If outliers are
present or if the plot is not linear, then the current model or data need to
be changed or corrected. Know how to decide whether the MLR model is
CHAPTER 5. MULTIPLE LINEAR REGRESSION 162

appropriate by looking at a forward response plot.

The residual plot of ŷi versus ri is used to detect departures from the
MLR model. If the model is good, then the plot should be ellipsoidal with
no trend and should be centered about the horizontal axis. Outliers and
patterns such as curvature or a fan shaped plot are bad. Be able to tell a
good residual plot from a bad residual plot.
Know that for any MLR, the above two plots should be made.
Other residual plots are also useful. Plot Xj,i versus ri for each nontrivial
predictor variable j in the model and for any potential predictors Xj not in
the model. Plot the time order versus ri if the time order is known. Again,
trends and outliers suggest that the model could be improved. A box shaped
plot with no trend suggests that the MLR model is good.
The FF plot of ŷI,i versus ŷi and the RR plot of rI,i versus ri can be
used to check whether a candidate submodel I is good. The submodel is
good if the plotted points in the FF and RR plots cluster tightly about the
identity line. In the RR plot, the OLS line and identity line can be added to
the plot as visual aids. It should be diﬃcult to see that the OLS and identity
lines intersect at the origin (the OLS line is the identity line in the FF plot).
If the FF plot looks good but the RR plot does not, the submodel may be
good if the main goal of the analysis is to predict y. The two plots are also
useful for examining the reduced model in the change in SS F test. Note
that if the candidate model seems to be good, the usual MLR checks should
still be made. In particular, the forward response plot and residual plot (of
ŷI,i versus rI,i) need to be made for the submodel.
The plot of the residuals yi − y versus ri is useful for the Anova F test of
Ho : β2 = · · · = βp = 0 versus Ha: not Ho. If Ho is true, then the plotted
points in this special case of the RR plot should cluster tightly about the
identity line.
A scatterplot of x versus y is used to visualize the conditional distri-
bution of y|x. A scatterplot matrix is an array of scatterplots. It is used
to examine the marginal relationships of the predictors and response. It is
often useful to transform predictors if strong nonlinearities are apparent in
the scatterplot matrix.
For the graphical method for choosing a response transformation, the
FFλ plot should have very high correlations. Then the transformation plots
CHAPTER 5. MULTIPLE LINEAR REGRESSION 163

can be used. Choose a transformation such that the transformation plot

is linear. Given several transformation plots, you should be able to ﬁnd the
transformation corresponding to the linear plot.
There are several guidelines for choosing power transformations.
First, suppose you have a scatterplot of two variables xλ1 1 versus xλ2 2 where
both x1 > 0 and x2 > 0. Also assume that the plotted points follow a
nonlinear one to one function. Consider the ladder of powers
−1, −2/3, −0.5, −1/3, −0.25, 0, 0.25, 1/3, 0.5, 2/3, and 1.
To spread small values of the variable, make λi smaller. To spread large
values of the variable, make λi larger. See Cook and Weisberg (1999a, p.
86).
For example, in the plot of shell versus height in Figure 5.5, small values
of shell need spreading since if the plotted points were projected on the
horizontal axis, there would be too many points at values of shell near 0.
Similarly, large values of height need spreading.
Next, suppose that all values of the variable w to be transformed are
positive. The log rule says use log(w) if max(wi)/ min(wi ) > 10. This rule
often works wonders on the data and the log transformation is the most used
(modiﬁed) power transformation. If the variable w can take on the value of
0, use log(w + c) where c is a small constant like 1, 1/2, or 3/8.
The unit rule says that if Xi and y have the same units, then use the
same transformation of Xi and y. The cube root rule says that if w is
a volume measurement, then cube root transformation w1/3 may be useful.
Consider the ladder of powers. No transformation (λ = 1) is best, then the
log transformation, then the square root transformation, then the reciprocal
transformation.
Theory, if available, should be used to select a transformation. Frequently
more than one transformation will work. For example if y = weight and X1
1/3
= volume = X2 ∗ X3 ∗ X4 , then y versus X1 and log(y) versus log(X1 ) =
log(X2 ) + log(X3 ) + log(X4 ) may both work. Also if y is linearly related with
X2 , X3 , X4 and these three variables all have length units mm, say, then the
1/3
units of X1 are (mm)3. Hence the units of X1 are mm.
There are also several guidelines for building a MLR model. Suppose
that variable Z is of interest and variables W2 , ..., Wr have been collected
along with Z. Make a scatterplot matrix of W2 , ..., Wr and Z. (If r is large,
CHAPTER 5. MULTIPLE LINEAR REGRESSION 164

several matrices may need to be made. Each one should include Z.) Remove
or correct any gross outliers. It is often a good idea to transform the Wi
to remove any strong nonlinearities from the predictors. Eventually
you will find a response variable Y = tZ (Z) and nontrivial predictor variables
X2 , ..., Xp for the full model. Interactions such as Xk = Wi Wj and powers
such as Xk = Wi2 may be of interest. Indicator variables are often used
in interactions, but do not transform an indicator variable. The forward
response plot for the full model should be linear and the residual plot should
be ellipsoidal with zero trend. Find the OLS output. The statistic R2 gives
the proportion of the variance of Y explained by the predictors and is of
some importance.
Variable selection is closely related to the change in F test. You are
seeking a subset I of the variables to keep in the model. The submodel I
will always contain a constant and will have k − 1 nontrivial predictors where
1 ≤ k ≤ p. Know how to find candidate submodels from output.
Forward selection starts with a constant = W1 . Step 1) k = 2: compute
Cp for all models containing the constant and a single predictor Xi . Keep
the predictor W2 = Xj , say, that corresponds to the model with the smallest
value of Cp .
Step 2) k = 3: Fit all models with k = 3 that contain W1 and W2 . Keep the
predictor W3 that minimizes Cp . ...
Step j) k = j + 1: Fit all models with k = j + 1 that contains W1 , W2 , ..., Wj .
Keep the predictor Wj+1 that minimizes Cp . ...
Step p − 1): Fit the full model.
Backward elimination: All models contain a constant = U1 . Step 1)
k = p: Start with the full model that contains X1 , ..., Xp. We will also say
that the full model contains U1 , ..., Up where U1 = X1 but Ui need not equal
Xi for i > 1.
Step 2) k = p − 1: fit each model with p − 1 predictors including a constant.
Delete the predictor Up , say, that corresponds to the model with the smallest
Cp . Keep U1 , ..., Up−1.
Step 3) k = p− 2: fit each model with p− 2 predictors and a constant. Delete
the predictor Up−1 that corresponds to the smallest Cp . Keep U1 , ..., Up−2. ...
Step j) k = p − j + 1: fit each model with p − j + 1 predictors and a
constant. Delete the predictor Up−j+2 that corresponds to the smallest Cp.
Keep U1, ..., Up−j+1. ...
CHAPTER 5. MULTIPLE LINEAR REGRESSION 165

Step p − 1) k = 2. The current model contains U1 , U2 and U3 . Fit the model

U1 , U2 and the model U1, U3 . Assume that model U1 , U2 minimizes Cp . Then
delete U3 and keep U1 and U2 .
Rule of thumb for variable selection (assuming that the cost of each
predictor is the same): find the submodel Im with the minimum Cp . If Im uses
km predictors, do not use any submodel that has more than km predictors.
Since the minimum Cp submodel often has too many predictors, also look
at the submodel Io with the smallest value of k, say ko , such that Cp ≤ 2k.
This submodel may have too few predictors. So look at the predictors
in Im but not in Io and see if they can be deleted or not. (If Im = Io, then
it is a good candidate for the best submodel.)
Assume that the full model has p predictors including a constant and that
the submodel I has k predictors including a constant. Then we would like
properties i) – xi) below to hold. Often we can not find a submodel where
i) – xi) all hold simultaneously. Given that i) holds, ii) to xi) are listed in
decreasing order of importance with ii) – v) much more important than vi)
– xi).
i) Want k ≤ p < n/5.
ii) The forward response plot and residual plots from both the full model and
the submodel should be good. The corresponding plots should look similar.
iii) Want k small but Cp(I) ≤ 2k.
iv) Want corr(Ŷ , ŶI ) ≥ 0.95.
v) Want the change in SS F test using I as the reduced model to have p-value
≥ 0.01. (So use α = 0.01 for the change in SS F test applied to models chosen
from variable selection. Recall that there is very little evidence for rejecting
Ho if p-value ≥ 0.05, and only moderate evidence if 0.01 ≤ p-value < 0.05.)
vi) Want R2I > 0.9R2 and R2I > R2 − 0.07.
vii) Want MSE(I) to be smaller than or not much larger than the MSE from
the full model.
viii) Want hardly any predictors with p-value ≥ 0.05.
xi) Want only a few predictors to have 0.01 < p-value < 0.05.
Influence is roughly (leverage)(discrepancy). The leverages hi are the
diagonal elements of the hat matrix H and measure how far xi is from the
sample mean of the predictors. See Chapter 6.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 166

5.4 Complements
Algorithms for OLS are described in Datta (1995), Dongarra, Moler, Bunch
and Stewart (1979), and Golub and Van Loan (1989). Algorithms for L1
are described in Adcock and Meade (1997), Barrodale and Roberts (1974),
Bloomfield and Steiger (1980), Dodge (1997), Koenker (1997), Koenker and
d’Orey (1987), Portnoy (1997), and Portnoy and Koenker (1997). See Harter
(1974a,b, 1975a,b,c, 1976) for a historical account of linear regression. Draper
(2000) provides a bibliography of more recent references.
Early papers on transformations include Bartlett (1947) and Tukey (1957).
In a classic paper, Box and Cox (1964) developed numerical methods for es-
timating λo in the family of power transformations. It is well known that the
Box–Cox normal likelihood method for estimating λo can be sensitive to re-
mote or outlying observations. Cook and Wang (1983) suggested diagnostics
for detecting cases that influence the estimator, as did Tsai and Wu (1992),
Atkinson (1986), and Hinkley and Wang (1988). Yeo and Johnson (2000)
provide a family of transformations that does not require the variables to be
positive.
According to Tierney (1990, p. 297), one of the earliest uses of dynamic
graphics was to examine the effect of power transformations. In particular,
a method suggested by Fowlkes (1969) varies λ until the normal probability
plot is straight. McCulloch (1993) also gave a graphical method for finding
T
response transformations. A similar method would plot Y (λ) vs (α̂0 + β̂ λ x)
for λ ∈ Λ. See Example 1.5. Cook and Weisberg (1982, section 2.4) surveys
several transformation methods, and Cook and Weisberg (1994) described
how to use an inverse response plot of fitted values versus Y to visualize the
needed transformation.
The literature on numerical methods for variable selection in the OLS
multiple linear regression model is enormous. Three important papers are
Jones (1946), Mallows (1973), and Furnival and Wilson (1974). Chatterjee
and Hadi (1988, p. 43-47) give a nice account on the effects of overfitting
on the least squares estimates. Also see Claeskins and Hjort (2003), Hjort
and Claeskins (2003) and Efron, Hastie, Johnstone and Tibshirani (2004).
Some useful ideas for variable selection when outliers are present are given
by Burman and Nolan (1995), Ronchetti and Staudte (1994), and Sommer
and Huggins (1996).
In the variable selection problem, the FF and RR plots can be highly
CHAPTER 5. MULTIPLE LINEAR REGRESSION 167

informative for 1D regression models as well as the MLR model. Results

from Li and Duan (1989) suggest that the FF and RR plots will be useful
for variable selection in models where Y is independent of x given βT x (eg
GLM’s), provided that no strong nonlinearities are present in the predictors
(eg if the predictors x are iid from an elliptically contoured distribution).
See Section 12.4.
Chapters 11 and 13 of Cook and Weisberg (1999a) give excellent discus-
sions of variable selection and response transformations, respectively. They
also discuss the effect of deleting terms from the full model on the mean and
variance functions. It is possible that the full model mean function E(y|x)
is linear while the submodel mean function E(y|xI ) is nonlinear.
Several authors have used the FF plot to compare models. For example,
Collett (1999, p. 141) plots the fitted values from a logistic regression model
versus the fitted values from a complementary log–log model to demonstrate
that the two models are producing nearly identical estimates.

5.5 Problems
Problems with an asterisk * are especially important.
5.1. Suppose that the regression model is Yi = 7+ βXi + ei for i = 1, ..., n
where the ei are iid N(0, σ 2 ) random variables. The least squares criterion
n
is Q(η) = (Yi − 7 − ηXi )2 .
i=1

a) What is E(Yi )?
b) Find the least squares estimator b of β by setting the ﬁrst derivative
d
Q(η) equal to zero.
dη
c) Show that your b is the global minimizer of the least squares criterion
d2
Q by showing that the second derivative 2 Q(η) > 0 for all values of η.
dη
5.2. The location model is Yi = µ + ei for i = 1, ..., n where the ei are iid
with mean E(ei ) = 0 and constant variance VAR(ei ) = σ 2 . The least squares
n
estimator µ̂ of µ minimizes the least squares criterion Q(η) = (Yi − η)2.
i=1
CHAPTER 5. MULTIPLE LINEAR REGRESSION 168

To ﬁnd the least squares estimator, perform the following steps.

d
a) Find the derivative Q, set the derivative equal to zero and solve for
dη
η. Call the solution µ̂.
b) To show that the solution was indeed the global minimizer of Q, show
d2
that Q > 0 for all real η. (Then the solution µ̂ is a local min and Q is
dη 2
convex, so µ̂ is the global min.)
5.3. The normal error model for simple linear regression through the
origin is
Yi = βXi + ei
for i = 1, ..., n where e1, ..., en are iid N(0, σ 2 ) random variables.
a) Show that the least squares estimator for β is
n
Xi Yi
b = i=1 n 2
.
i=1 Xi

b) Find E(b).
c) Find VAR(b).
n
(Hint: Note that b = i=1 ki Yi where the ki depend on the Xi which are
treated as constants.)

Output for Problem 5.4

Full Model Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 6 265784. 44297.4 172.14 0.0000
Residual 67 17240.9 257.327

Reduced Model Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 1 264621. 264621. 1035.26 0.0000
Residual 72 18403.8 255.608
CHAPTER 5. MULTIPLE LINEAR REGRESSION 169

5.4. Assume that the response variable Y is height, and the explanatory
variables are X2 = sternal height, X3 = cephalic index, X4 = ﬁnger to ground,
X5 = head length, X6 = nasal height, X7 = bigonal breadth. Suppose that
the full model uses all 6 predictors plus a constant (= X1 ) while the reduced
model uses the constant and sternal height. Test whether the reduced model
can be used instead of the full model using the output on the previous page.
The data set had 74 cases.

Output for Problem 5.5

Full Model Summary Analysis of Variance Table
Source df SS MS F p-value
Regression 9 16771.7 1863.52 1479148.9 0.0000
Residual 235 0.29607 0.0012599

Reduced Model Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 2 16771.7 8385.85 6734072.0 0.0000
Residual 242 0.301359 0.0012453

Coefficient Estimates, Response = y, Terms = (x2 x2^2)

Label Estimate Std. Error t-value p-value
Constant 958.470 5.88584 162.843 0.0000
x2 -1335.39 11.1656 -119.599 0.0000
x2^2 421.881 5.29434 79.685 0.0000

5.5. The above output comes from the Johnson (1996) STATLIB data
set bodyfat after several outliers are deleted. It is believed that Y = β1 +
β2X2 + β3X22 + e where Y is the person’s bodyfat and X2 is the person’s
density. Measurements on 245 people were taken and are represented by
the output above. In addition to X2 and X22 , 7 additional measurements
X4 , ..., X10 were taken. Both the full and reduced models contain a constant
X1 ≡ 1.
a) Predict Y if X2 = 1.04. (Use the reduced model Y = β1 + β2X2 +
β3X22 + e.)
b) Test whether the reduced model can be used instead of the full model.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 170

5.6. Suppose that the regression model is Yi = 10 + 2Xi2 + β3 Xi3 + ei for

i = 1, ..., n where the ei are iid N(0, σ 2 ) random variables. The least squares
n
criterion is Q(η3) = (Yi − 10 − 2Xi2 − η3 Xi3 )2 . Find the least squares es-
i=1
d
timator b3 of β3 by setting the ﬁrst derivative Q(η3 ) equal to zero. Show
dη3
that your b3 is the global minimizer of the least squares criterion Q by show-
d2
ing that the second derivative 2 Q(η3) > 0 for all values of η3 .
dη3
5.7. Show that the hat matrix H = X(X T X)−1 X T is idempotent, that
is, show that HH = H 2 = H.
5.8. Show that I − H = I − X(X T X)−1 X T is idempotent, that is,
show that (I − H)(I − H) = (I − H)2 = I − H.

Output for Problem 5.9

Label Estimate Std. Error t-value p-value
Constant -5.07459 1.85124 -2.741 0.0076
log[H] 1.12399 0.498937 2.253 0.0270
log[S] 0.573167 0.116455 4.922 0.0000

R Squared: 0.895655 Sigma hat: 0.223658 Number of cases: 82

(log[H] log[S]) (4 5)
Prediction = 2.2872, s(pred) = 0.467664,
Estimated population mean value = 2.2872, s = 0.410715

5.9. The output above was produced from the file mussels.lsp in Arc.
Let Y = log(M) where M is the muscle mass of a mussel. Let X1 ≡ 1, X2 =
log(H) where H is the height of the shell, and let X3 = log(S) where S is
the shell mass. Suppose that it is desired to predict Yh,new if log(H) = 4 and
log(S) = 5, so that xh = (1, 4, 5). Assume that se(Ŷh ) = 0.410715 and that
se(pred) = 0.467664.
a) If xh = (1, 4, 5) find a 99% confidence interval for E(Yh ).
b) If xh = (1, 4, 5) find a 99% prediction interval for Yh,new .
CHAPTER 5. MULTIPLE LINEAR REGRESSION 171

5.10∗. a) Show Cp (I) ≤ k iﬀ FI ≤ 1.

b) Show Cp (I) ≤ 2k iﬀ FI ≤ p/(p − k).

Output for Problem 5.11 Coefficient Estimates Response = height

Label Estimate Std. Error t-value p-value
Constant 227.351 65.1732 3.488 0.0008
sternal height 0.955973 0.0515390 18.549 0.0000
finger to ground 0.197429 0.0889004 2.221 0.0295

R Squared: 0.879324 Sigma hat: 22.0731

Summary Analysis of Variance Table

Source df SS MS F p-value
Regression 2 259167. 129583. 265.96 0.0000
Residual 73 35567.2 487.222

5.11. The output above is from the multiple linear regression of the
response y = height on the two nontrivial predictors sternal height = height
at shoulder and finger to ground = distance from the tip of a person’s middle
finger to the ground.
a) Consider the plot with yi on the vertical axis and the least squares
fitted values ŷi on the horizontal axis. Sketch how this plot should look if
the multiple linear regression model is appropriate.
b) Sketch how the residual plot should look if the residuals ri are on the
vertical axis and the fitted values ŷi are on the horizontal axis.
c) From the output, are sternal height and finger to ground useful for
predicting height? (Perform the ANOVA F test.)
5.12. Suppose that it is desired to predict the weight of the brain (in
grams) from the cephalic index measurement. The output below uses data
from 267 people.

predictor coef Std. Error t-value p-value

Constant 865.001 274.252 3.154 0.0018
cephalic 5.05961 3.48212 1.453 0.1474

Do a 4 step test for β2 = 0.

CHAPTER 5. MULTIPLE LINEAR REGRESSION 172

5.13. Suppose that the scatterplot of X versus Y is strongly curved

rather than ellipsoidal. Should you use simple linear regression to predict Y
from X? Explain.
5.14. Suppose that the 95% conﬁdence interval for β1 is (−17.457, 15.832).
Is X a useful linear predictor for Y ? If your answer is no, could X be a useful
predictor for Y ? Explain.
5.15∗. a) For λ = 0, expand f(λ) = y λ in a Taylor series about λ = 1.
(Treat y as a constant.)
b) Let
yλ − 1
g(λ) = y (λ) = .
λ
Assuming that
y [log(y)]k ≈ ak + bk y,
show that ∞
+ bk y) (λ−1)
k
[ k=o (ak ]−1
g(λ) ≈ k!
λ
1 (λ − 1)k 1 (λ − 1)k
∞ ∞
1
= [( ak )− ]+( bk )y
λ k! λ λ k!
k=o k=o
= aλ + bλ y.
c) Often only terms k = 0, 1, and 2 are kept. Show that this 2nd order
expansion is
2
(λ−1)2

yλ − 1 (λ − 1)a1 + (λ−1)
2
a 2 − 1 1 + b 1 (λ − 1) + b 2 2
≈ + y.
λ λ λ

Output for problem 5.16.

Current terms: (finger to ground nasal height sternal height)
df RSS | k C_I
Delete: nasal height 73 35567.2 | 3 1.617
Delete: finger to ground 73 36878.8 | 3 4.258
Delete: sternal height 73 186259. | 3 305.047
5.16. From the output from backward elimination given above, what
terms should be used in the MLR model to predict Y ? (DON’T FORGET
THE CONSTANT!)
CHAPTER 5. MULTIPLE LINEAR REGRESSION 173

Output for Problem 5.17.

L1 L2 L3 L4
# of predictors 10 6 4 3
# with 0.01 ≤ p-value ≤ 0.05 0 0 0 0
# with p-value > 0.05 6 2 0 0
2
RI 0.774 0.768 0.747 0.615
corr(Ŷ , ŶI ) 1.0 0.996 0.982 0.891
√pC (I) 10.0 3.00 2.43 22.037
MSE 63.430 61.064 62.261 75.921
p-value for change in F test 1.0 0.902 0.622 0.004

5.17. The above table gives summary statistics for 4 MLR models con-
sidered as final submodels after performing variable selection. The forward
response plot and residual plot for the full model L1 was good. Model L3
was the minimum Cp model found. Which model should be used as the final
submodel? Explain briefly why each of the other 3 submodels should not be
used.

Output for Problem 5.18.

L1 L2 L3 L4
# of predictors 10 5 4 3
# with 0.01 ≤ p-value ≤ 0.05 0 1 0 0
# with p-value > 0.05 8 0 0 0
2
RI 0.655 0.650 0.648 0.630
corr(Ŷ , ŶI ) 1.0 0.996 0.992 0.981
√Cp(I) 10.0 4.00 5.60 13.81
MSE 73.548 73.521 73.894 75.187
p-value for change in F test 1.0 0.550 0.272 0.015

5.18. The above table gives summary statistics for 4 MLR models con-
sidered as final submodels after performing variable selection. The forward
response plot and residual plot for the full model L1 was good. Model L2
was the minimum Cp model found. Which model should be used as the final
submodel? Explain briefly why each of the other 3 submodels should not be
used.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 174

Output for Problem 5.19.

ADJUSTED 99 cases 2 outliers
k CP R SQUARE R SQUARE RESID SS MODEL VARIABLES
-- ----- -------- -------- --------- --------------
1 760.7 0.0000 0.0000 185.928 INTERCEPT ONLY
2 12.7 0.8732 0.8745 23.3381 B
2 335.9 0.4924 0.4976 93.4059 A
2 393.0 0.4252 0.4311 105.779 C
3 12.2 0.8748 0.8773 22.8088 B C
3 14.6 0.8720 0.8746 23.3179 A B
3 15.7 0.8706 0.8732 23.5677 A C
4 4.0 0.8857 0.8892 20.5927 A B C

ADJUSTED 97 cases after deleting the 2 outliers

k CP R SQUARE R SQUARE RESID SS MODEL VARIABLES
-- ----- -------- -------- --------- --------------
1 903.5 0.0000 0.0000 183.102 INTERCEPT ONLY
2 0.7 0.9052 0.9062 17.1785 B
2 406.6 0.4944 0.4996 91.6174 A
2 426.0 0.4748 0.4802 95.1708 C
3 2.1 0.9048 0.9068 17.0741 A C
3 2.6 0.9043 0.9063 17.1654 B C
3 2.6 0.9042 0.9062 17.1678 A B
4 4.0 0.9039 0.9069 17.0539 A B C

5.19. The output above is from software that does all subsets variable
selection. The data is from Ashworth (1842). The predictors were A =
log(1692 property value), B = log(1841 property value) and C = log(percent
increase in value) while the response variable is Y = log(1841 population).
a) The top output corresponds to data with 2 small outliers. From this
output, what is the best model? Explain brieﬂy.
b) The bottom output corresponds to the data with the 2 outliers re-
moved. From this output, what is the best model? Explain brieﬂy.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 175

Problems using R/Splus.

Warning: Use the command source(“A:/rpack.txt”) to download
the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg Tplt, will display the code for the function. Use the args
command, eg args(Tplt), to display the needed arguments for the function.
5.20∗. a) Download the R/Splus function Tplt that makes the transfor-
mation plots for λ ∈ Λc .
b) Download the R/Splus function ffL that makes a FFλ plot.
c) Use the following R/Splus command to make a 100 × 3 matrix. The
columns of this matrix are the three nontrivial predictor variables.

nx <- matrix(rnorm(300),nrow=100,ncol=3)

Use the following command to make the response variable Y.

y <- exp( 4 + nx%%c(1,1,1) + 0.5rnorm(100) )

This command means the MLR model log(Y ) = 4 + X2 + X3 + X4 + e

will hold where e ∼ N(0, 0.25).
To ﬁnd the response transformation, you need the programs ffL and Tplt
given in the two previous problems. Type ls() to see if your version of R or
Splus has these programs.
To make an F F λ plot, type the following command.

ffL(nx,y)

Include the F F λ plot in Word by pressing the Ctrl and c keys simulta-
neously. This will copy the graph. Then in Word use the menu commands
“File>Paste”.
d) To make the transformation plots type the following command.

Tplt(nx,y)
The ﬁrst plot will be for λ = −1. Move the curser to the plot and hold
the rightmost mouse key down. Highlight stop to go to the next plot.
Repeat these mouse operations to look at all of the plots. When you get a
plot that clusters about the OLS line which is included in each plot, include
CHAPTER 5. MULTIPLE LINEAR REGRESSION 176

this transformation plot in Word by pressing the Ctrl and c keys simulta-
neously. This will copy the graph. Then in Word use the menu commands
“File>Paste”. You should get the log transformation.
e) Type the following commands.

out <- lsfit(nx,log(y))

ls.print(out)

Use the mouse to highlight the created output and include the output in
Word.
) using the output in
f) Write down the least squares equation for log(Y
e).
Problems using ARC
To quit Arc, move curser to the x in the northeast corner and click.
Problems 5.21–5.26 use data sets that come with Arc (Cook and Weisberg
1999a).
5.21∗. a) In Arc enter the menu commands “File>Load>Data>ARCG”
and open the ﬁle big-mac.lsp. Next use the menu commands “Graph&Fit>
Plot of” to obtain a dialog window. Double click on TeachSal and then
double click on BigMac. Then click on OK. These commands make a plot
of x = TeachSal = primary teacher salary in thousands of dollars versus y =
BigMac = minutes of labor needed to buy a Big Mac and fries. Include the
plot in Word.
Consider transforming y with a (modiﬁed) power transformation

(y λ − 1)/λ, λ = 0
y (λ) =
log(y), λ=0

b) Should simple linear regression be used to predict y from x? Explain.

c) In the plot, λ = 1. Which transformation will increase the linearity of

the plot, log(y) or y (2)? Explain.
5.22. In Arc enter the menu commands “File>Load>Data>ARCG” and
open the ﬁle mussels.lsp.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 177

The response variable Y is the mussel muscle mass M, and the explanatory
variables are X2 = S = shell mass, X3 = H = shell height, X4 = L = shell
length and X5 = W = shell width.
Enter the menu commands “Graph&Fit>Fit linear LS” and fit the model:
enter S, H, L, W in the “Terms/Predictors” box, M in the “Response” box
and click on OK.
a) To get a forward response plot, enter the menu commands
“Graph&Fit>Plot of” and place L1:Fit-Values in the H–box and M in the
V–box. Copy the plot into Word.
b) Based on the forward response plot, does a linear model seem reason-
able?
c) To get a residual plot, enter the menu commands “Graph&Fit>Plot
of” and place L1:Fit-Values in the H–box and L1:Residuals in the V–box.
Copy the plot into Word.
d) Based on the residual plot, what MLR assumption seems to be vio-
lated?
e) Include the regression output in Word.
f) Ignoring the fact that an important MLR assumption seems to have
been violated, do any of predictors seem to be needed given that the other
predictors are in the model?
g) Ignoring the fact that an important MLR assumption seems to have
been violated, perform the ANOVA F test.
5.23∗. In Arc enter the menu commands “File>Load>Data>ARCG”
and open the file mussels.lsp. Use the commands “Graph&Fit>Scatterplot
Matrix of.” In the dialog window select H, L, W, S and M (so select M last).
Click on “OK” and include the scatterplot matrix in Word. The response M
is the edible part of the mussel while the 4 predictors are shell measurements.
Are any of the marginal predictor relationships nonlinear? Is E(M|H) linear
or nonlinear?
5.24∗. The file wool.lsp has data from a 33 experiment on the behavior of
worsted yarn under cycles of repeated loadings. The response y is the number
of cycles to failure and the three predictors are the length, amplitude and
CHAPTER 5. MULTIPLE LINEAR REGRESSION 178

load. Make an FFλ plot by using the following commands.

From the menu “Wool” select “transform” and double click on Cycles.
Select “modified power” and use p = −1, −0.5, 0 and 0.5. Use the menu
commands “Graph&Fit>Fit linear LS” to obtain a dialog window. Next fit
LS five times. Use Amp, Len and Load as the predictors for all 5 regres-
sions, but use Cycles−1, Cycles−0.5, log[Cycles], Cycles0.5 and Cycles as the
response.
Next use the menu commands “Graph&Fit>Scatterplot-matrix of” to
create a dialog window. Select L5:Fit-Values, L4:Fit-Values, L3:Fit-Values,
L2 :Fit-Values, and L1:Fit-Values. Then click on “OK.” Include the resulting
F F λ plot in Word.
b) Use the menu commands “Graph&Fit>Plot of” to create a dialog win-
dow. Double click on L5:Fit-Values and double click on Cycles−1 , Cycles−0.5,
log[Cycles], Cycles0.5 or Cycles until the resulting plot in linear. Include the
plot of y versus y (λ) that is linear in Word. Use the OLS fit as a visual aid.
What response transformation do you end up using?
5.25. In Arc enter the menu commands “File>Load>Data>ARCG” and
open the file bcherry.lsp. The menu Trees will appear. Use the menu com-
mands “Trees>Transform” and a dialog window will appear. Select terms
Vol, D, and Ht. Then select the log transformation. The terms log Vol, log D
and log H should be added to the data set. If a tree is shaped like a cylinder
or a cone, then V ol ∝ D2 Ht and taking logs results in a linear model.
a) Fit the full model with Y = log V ol, X1 = log D and X2 = log Ht.
Add the output that has the LS coefficients to Word.
b) Fitting the full model will result in the menu L1. Use the commands
“L1>AVP–All 2D.” This will create a plot with a slider bar at the bottom
that says log[D]. This is the added variable plot for log(D). To make an added
variable plot for log(Ht), click on the slider bar. Add the OLS line to the AV
plot for log(Ht) by moving the OLS slider bar to 1 and include the resulting
plot in Word.
c) Fit the reduced model that drops log(Ht). Make an RR plot with
the residuals from the full model on the V axis and the residuals from the
submodel on the H axis. Add the LS line and the identity line as visual aids.
(Click on the Options menu to the left of the plot and type “y=x” in the
resulting dialog window to add the identity line.) Include the plot in Word.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 179

d) Similarly make an FF plot using the ﬁtted values from the two models.
Add the two lines. Include the plot in Word.
e) Next put the residuals from the submodel on the V axis and log(Ht)
on the H axis. Include this residual plot in Word.
f) Next put the residuals from the submodel on the V axis and the ﬁtted
values from the submodel on the H axis. Include this residual plot in Word.

g) Next put log(Vol) on the V axis and the fitted values from the submodel
on the H axis. Include this forward response plot in Word.
h) Does log(Ht) seem to be an important term? If the only goal is to
predict volume, will much information be lost if log(Ht) is omitted? Remark
on the information given by each of the 6 plots. (Some of the plots
will suggest that log(Ht) is needed while others will suggest that log(Ht) is
not needed.)
5.26∗. a) In this problem we want to build a MLR model to predict
Y = g(BigMac) for some power transformation g. In Arc enter the menu
commands “File>Load>Data>Arcg” and open the file big-mac.lsp. Make
a scatterplot matrix of the variate valued variables and include the plot in
Word.
b) The log rule makes sense for the BigMac data. From the scatterplot,
use the “Transformations” menu and select “Transform to logs”. Include the
resulting scatterplot in Word.
c) From the “Mac” menu, select “Transform”. Then select all 10 vari-
ables and click on the “Log transformations” button. Then click on “OK”.
From the “Graph&Fit” menu, select “Fit linear LS.” Use log[BigMac] as the
response and the other 9 “log variables” as the Terms. This model is the full
model. Include the output in Word.
d) Make a forward response plot (L1:Fit-Values in H and log(BigMac) in
V) and residual plot (L1:Fit-Values in H and L1:Residuals in V) and include
both plots in Word.
e) Using the “L1” menu, select “Examine submodels” and try forward
selection and backward elimination. Using the Cp ≤ 2k rule suggests that the
submodel using log[service], log[TeachSal] and log[TeachTax] may be good.
From the “Graph&Fit” menu, select “Fit linear LS”, fit the submodel and
CHAPTER 5. MULTIPLE LINEAR REGRESSION 180

include the output in Word.

f) Make a forward response plot (L2:Fit-Values in H and log(BigMac)
in V) and residual plot (L2:Fit-Values in H and L2:Residuals in V) for the
submodel and include the plots in Word.
g) Make an RR plot (L2:Residuals in H and L1:Residuals in V) and
FF plot (L2:Fit-Values in H and L1:Fit-Values in V) for the submodel and
include the plots in Word.
h) Do the plots and output suggest that the submodel is good? Explain.

Warning: The following problems uses data from the book’s

webpage. Save the data ﬁles on a disk. Get in Arc and use the menu
commands “File > Load” and a window with a Look in box will appear. Click
on the black triangle and then on 3 1/2 Floppy(A:). Then click twice on the
data set name.
5.27∗. (Scatterplot in Arc.) Activate the cbrain.lsp dataset with the
menu commands “File > Load > 3 1/2 Floppy(A:) > cbrain.lsp.” Scroll up
the screen to read the data description.
a) Make a plot of age versus brain weight brnweight. The commands
“Graph&Fit > Plot of” will bring down a menu. Put age in the H box and
brnweight in the V box. Put sex in the Mark by box. Click OK. Make the
lowess bar on the plot read .1. Open Word.
In Arc, use the menu commands “Edit > Copy.” In Word, use the menu
commands “Edit > Paste.” This should copy the graph into the Word doc-
ument.
b) For a given age, which gender tends to have larger brains?
c) At what age does the brain weight appear to be decreasing?
5.28. (SLR in Arc.) Activate cbrain.lsp. Brain weight and the cube
root of size should be linearly related. To add the cube root of size to the
data set, use the menu commands “cbrain > Transform.” From the window,
select size and enter 1/3 in the p: box. Then click OK. Get some output
with commands “Graph&Fit > Fit linear LS.” In the dialog window, put
brnweight in Response, and (size)1/3 in terms.
a) Cut and paste the output (from Coeﬃcient Estimates to Sigma hat)
CHAPTER 5. MULTIPLE LINEAR REGRESSION 181

into Word. Write down the least squares equation Ŷ = b1 + b2x.

b) If (size)1/3 = 15, what is the estimated brnweight?
c) Make a plot of the ﬁtted values versus the residuals. Use the commands
“Graph&Fit > Plot of” and put “L1:Fit-values” in H and “L1:Residuals” in
V. Put sex in the Mark by box. Put the plot into Word. Does the plot look
ellipsoidal with zero mean?
d) Make a plot of the ﬁtted values versus y = brnweight. Use the com-
mands “Graph&Fit > Plot of” and put “L1:Fit-values in H and brnweight in
V. Put sex in Mark by. Put the plot into Word. Does the plot look linear?

5.29∗. The following data set has 5 babies that are “good leverage
points:” they look like outliers but should not be deleted because they follow
the same model as the bulk of the data.
a) In Arc enter the menu commands “File>Load>3 1/2 Floppy(A:)” and
open the ﬁle cbrain.lsp. Select transform from the cbrain menu, and add
size1/3 using the power transformation option (p = 1/3). From
Graph&Fit, select Fit linear LS. Let the response be brnweight and as terms
include everything but size and Obs. Hence your model will include size1/3.
This regression will add L1 to the menu bar. From this menu, select Examine
submodels. Choose forward selection. You should get models including k =
2 to 12 terms including the constant. Find the model with the smallest
Cp (I) = CI statistic and include all models with the same k as that model
in Word. That is, if k = 2 produced the smallest CI , then put the block
with k = 2 into Word. Next go to the L1 menu, choose Examine submodels
and choose Backward Elimination. Find the model with the smallest CI and
include all of the models with the same value of k in Word.
b) What model was chosen by forward selection?
c) What model was chosen by backward elimination?
d) Which model do you prefer?
e) Give an explanation for why the two models are diﬀerent.
f) Pick a submodel and include the regression output in Word.
g) For your submodel in f), make an RR plot with the residuals from the
CHAPTER 5. MULTIPLE LINEAR REGRESSION 182

full model on the V axis and the residuals from the submodel on the H axis.
Add the OLS line and the identity line y=x as visual aids. Include the RR
plot in Word.
h) Similarly make an FF plot using the fitted values from the two models.
Add the two lines. Include the FF plot in Word.
i) Using the submodel, include the forward response plot (of Ŷ versus Y )
and residual plot (of Ŷ versus the residuals) in Word.
j) Using results from f)-i), explain why your submodel is a good model.
5.30. a) In Arc enter the menu commands “File>Load>3 1/2 Floppy(A:)”
and open the file cyp.lsp. This data set consists of various measurements
taken on men from Cyprus around 1920. Let the response Y = height and
X = cephalic index = 100(head breadth)/(head length). Use Arc to get the
least squares output and include the relevant output in Word.
b) Intuitively, the cephalic index should not be a good predictor for a
person’s height. Perform a 4 step test of hypotheses with Ho: β2 = 0.
5.31. a) In Arc enter the menu commands “File>Load>3 1/2 Floppy(A:)”
and open the file cyp.lsp.
The response variable Y is height, and the explanatory variables are a
constant, X2 = sternal height (probably height at shoulder) and X3 = finger
to ground.
Enter the menu commands “Graph&Fit>Fit linear LS” and fit the model:
enter sternal height and finger to ground in the “Terms/Predictors” box,
height in the “Response” box and click on OK.
Include the output in Word. Your output should certainly include the
lines from “Response = height” to the ANOVA table.
b) Predict Y if X2 = 1400 and X3 = 650.
c) Perform a 4 step ANOVA F test of the hypotheses with
Ho: β2 = β3 = 0.
d) Find a 99% CI for β2.
e) Find a 99% CI for β3.
f) Perform a 4 step test for β2 = 0.
CHAPTER 5. MULTIPLE LINEAR REGRESSION 183

g) Perform a 4 step test for β3 = 0.

h) What happens to the conclusion in g) if α = 0.01?
i) The Arc menu “L1” should have been created for the regression. Use
the menu commands “L1>Prediction” to open a dialog window. Enter 1400
650 in the box and click on OK. Include the resulting output in Word.
j) Let Xh,2 = 1400 and Xh,3 = 650 and use the output from i) to find a
95% CI for E(Yh ). Use the last line of the output, that is, se = S(Ŷh ).
k) Use the output from i) to find a 95% PI for Yh,new . Now se(pred) =
s(pred).
l) Make a residual plot of the fitted values vs the residuals and make the
forward response plot of the fitted values versus Y . Include both plots in
Word.
m) Do the plots suggest that the MLR model is appropriate? Explain.
5.32. In Arc enter the menu commands “File>Load>3 1/2 Floppy(A:)”
and open the file cyp.lsp.
The response variable Y is height, and the explanatory variables are
X2 = sternal height (probably height at shoulder) and X3 = finger to ground.
Enter the menu commands “Graph&Fit>Fit linear LS” and fit the model:
enter sternal height and finger to ground in the “Terms/Predictors” box,
height in the “Response” box and click on OK.
a) To get a forward response plot, enter the menu commands
“Graph&Fit>Plot of” and place L1:Fit-Values in the H–box and height in
the V–box. Copy the plot into Word.
b) Based on the forward response plot, does a linear model seem reason-
able?
c) To get a residual plot, enter the menu commands “Graph&Fit>Plot
of” and place L1:Fit-Values in the H–box and L1:Residuals in the V–box.
Copy the plot into Word.
d) Based on the residual plot, does a linear model seem reasonable?
CHAPTER 5. MULTIPLE LINEAR REGRESSION 184

5.33. In Arc enter the menu commands “File>Load>3 1/2 Floppy(A:)”

and open the file cyp.lsp.
The response variable Y is height, and the explanatory variables are X2
= sternal height, X3 = finger to ground, X4 = bigonal breadth X5 = cephalic
index X6 = head length and X7 = nasal height. Enter the menu commands
“Graph&Fit>Fit linear LS” and fit the model: enter the 6 predictors (in
order: X2 1st and X7 last) in the “Terms/Predictors” box, height in the
“Response” box and click on OK. This gives the full model. For the reduced
model, only use predictors 2 and 3.
a) Include the ANOVA tables for the full and reduced models in Word.
b) Use the menu commands “Graph&Fit>Plot of...” to get a dialog win-
dow. Place L2:Fit-Values in the H–box and L1:Fit-Values in the V–box.
Place the resulting plot in Word.
c) Use the menu commands “Graph&Fit>Plot of...” to get a dialog win-
dow. Place L2:Residuals in the H–box and L1:Residuals in the V–box. Place
the resulting plot in Word.
d) Both plots should cluster tightly about the identity line if the reduced
model is about as good as the full model. Is the reduced model good?
e) Perform the 4 step change in SS F test (of Ho: the reduced model is
good) using the 2 ANOVA tables from part (a). The test statistic is given
in Section 5.3.
5.34. Activate the cyp.lsp data set. Choosing no more than 3 nonconstant
terms, try to predict height with multiple linear regression. Include a plot
with the fitted values on the horizontal axis and height on the vertical axis. Is
your model linear? Also include a plot with the fitted values on the horizontal
axis and the residuals on the vertical axis. Does the residual plot suggest that
the linear model may be inappropriate? (There may be outliers in the plot.
These could be due to typos or because the error distribution has heavier
tails than the normal distribution.) State which model you use.
Chapter 6

Regression Diagnostics

Using one or a few numerical summaries to characterize the relationship

between x and y runs the risk of missing important features, or worse, of
being misled.
Chambers, Cleveland, Kleiner, and Tukey (1983, p. 76)

6.1 Numerical Diagnostics

Diagnostics are used to check whether model assumptions are reasonable.
Section 6.4 provides a graph for assessing model adequacy for very general
regression models while the ﬁrst three sections of this chapter focus on di-
agnostics for the multiple linear regression model with iid constant variance
symmetric errors. Under this model,
Yi = xTi β + ei
for i = 1, ..., n where the errors are iid from a symmetric distribution with
E(ei ) = 0 and VAR(ei ) = σ 2.
It is often useful to use notation to separate the constant from the non-
trivial predictors. Assume that xi = (1, xi,2 , ..., xi,p)T ≡ (1, uTi )T where the
(p− 1) × 1 vector of nontrivial predictors ui = (xi,2, ..., xi,p)T . In matrix form,
Y = X + e,
X = [X1 , X2 , ..., Xp] = [1, U ],
1 is an n × 1 vector of ones, and U = [X2, ..., Xp] is the n × (p − 1) matrix
of nontrivial predictors. The kth column of U is the n × 1 vector of the

185
CHAPTER 6. REGRESSION DIAGNOSTICS 186

jth predictor Xj = (x1,j , ..., xn,j )T where j = k + 1. The sample mean and
covariance matrix of the nontrivial predictors are

1
n
u= ui (6.1)
n i=1

and
1
n
C = Cov(U ) = (ui − u)(ui − u)T , (6.2)
n − 1 i=1
respectively.
Some important numerical quantities that are used as diagnostics measure
≡β
the distance of ui from u and the influence of case i on the OLS fit β OLS .
Recall that the vector of fitted values =
= X(X T X)−1 X T Y = HY
Y = X β

where H is the hat matrix. Recall that the ith residual ri = Yi − Yi . Case (or
leave one out or deletion) diagnostics are computed by omitting the ith case
from the OLS regression. Following Cook and Weisberg (1999a, p. 357), let

(i)
Y (i) = X β (6.3)

denote the n × 1 vector of ﬁtted values for estimating β with OLS without
the ith case. Denote the jth element of Y (i) by Y(i),j . It can be shown that
the variance of the ith residual VAR(ri ) = σ 2(1 − hi ). The usual estimator
of the error variance is n 2
2 r
= i=1 i .
σ
n−p
The (internally) studentized residual
ri
ei = √

1 − hi
σ
has zero mean and unit variance.
Deﬁnition 6.1. The ith leverage hi = H ii is the ith diagonal element of
the hat matrix H. The ith squared (classical) Mahalanobis distance

MD2i = (ui − u)T C −1 (ui − u).

CHAPTER 6. REGRESSION DIAGNOSTICS 187

The ith Cook’s distance

(β T X T X(β
(i) − β) (i) − β)
(Y (i) − Y )T (Y (i) − Y )
CDi = = (6.4)
σ2
p σ2
p

1
n
= 2 (Y(i),j − Yj )2 .
p
σ j=1

Proposition 6.1. a) (Rousseeuw and Leroy 1987, p. 225)

1 1
hi = MD2i + .
n−1 n
b) (Cook and Weisberg 1999a, p. 184)
1
hi = xTi (X T X)−1 xi = (xi − x)T (U T U )−1 (xi − x) + .
n
c) (Cook and Weisberg 1999a, p. 360)

ri2 hi e2i hi

CDi = 2 = .
σ (1 − hi ) 1 − hi
p p 1 − hi

When the statistics CDi , hi and MDi are large, case i may be an outlier or
influential case. Examining a stem plot or dot plot of these three statistics for
unusually large values can be useful for flagging influential cases. Cook and
Weisberg (1999a, p. 358) suggest examining cases with CDi > 0.5 and that
cases with CDi > 1 should always be studied. Since H = H T and H = HH,
the hat matrix is symmetric and idempotent. Hence the eigenvalues of H
n
are zero or one and trace(H) = i=1 hi = p. Rousseeuw and Leroy (1987, p.
220 and p. 224) suggest using hi > 2p/n and MD2i > χ2p−1,0.95 as benchmarks
for leverages and Mahalanobis distances where χ2p−1,0.95 is the 95th percentile
of a chi–square distribution with p − 1 degrees of freedom.
Note that Proposition 6.1c) implies that Cook’s distance is the product
of the squared residual and a quantity that becomes larger the farther ui is
from u. Hence influence is roughly the product of leverage and distance of
Yi from Yi (see Fox 1991, p. 21). Mahalanobis distances and leverages both
define ellipsoids based on a metric closely related to the sample covariance
matrix of the nontrivial predictors. All points ui on the same ellipsoidal
CHAPTER 6. REGRESSION DIAGNOSTICS 188

contour are the same distance from u and have the same leverage (or the
same Mahalanobis distance).
Cook’s distances, leverages, and Mahalanobis distances can be effective
for finding influential cases when there is a single outlier, but can fail if there
are two or more outliers. Nevertheless, these numerical diagnostics combined
with plots such as residuals versus fitted values and fitted values versus the
response are probably the most effective techniques for detecting cases that
effect the fitted values when the multiple linear regression model is a good
approximation for the bulk of the data. In fact, these diagnostics may be
useful for perhaps up to 90% of such data sets while residuals from robust
regression and Mahalanobis distances from robust estimators of multivariate
location and dispersion may be helpful for perhaps another 3% of such data
sets.

6.2 Graphical Diagnostics

Automatic or blind use of regression models, especially in exploratory work,
all too often leads to incorrect or meaningless results and to confusion
rather than insight. At the very least, a user should be prepared to make
and study a number of plots before, during, and after ﬁtting the model.
Chambers, Cleveland, Kleiner, and Tukey (1983, p. 306)

A scatterplot of x versus y (recall the convention that a plot of x versus

y means that x is on the horizontal axis and y is on the vertical axis) is
used to visualize the conditional distribution y|x of y given x (see Cook and
Weisberg 1999a, p. 31). For the simple linear regression model (with one
nontrivial predictor x2 ), by far the most eﬀective technique for checking the
assumptions of the model is to make a scatterplot of x2 versus Y and a
residual plot of x2 versus ri . Departures from linearity in the scatterplot
suggest that the simple linear regression model is not adequate. The points
in the residual plot should scatter about the line r = 0 with no pattern. If
curvature is present or if the distribution of the residuals depends on the
value of x2 , then the simple linear regression model is not adequate.
Similarly if there are two nontrivial predictors, say x2 and x3, make a
three-dimensional (3D) plot with Y on the vertical axis, x2 on the horizontal
axis and x3 on the out of page axis. Rotate the plot about the vertical
axis, perhaps superimposing the OLS plane. As the plot is rotated, linear
CHAPTER 6. REGRESSION DIAGNOSTICS 189

combinations of x2 and x3 appear on the horizontal axis. If the OLS plane

b1 + b2 x2 + b3 x3 fits the data well, then the plot of b2x2 + b3 x3 versus Y should
scatter about a straight line. See Cook and Weisberg (1999a, ch. 8).
In general there are more than two nontrivial predictors and in this set-
ting two plots are crucial for any multiple linear regression analysis,
regardless of the regression estimator (eg OLS, L1 etc.). The first plot is a
scatterplot of the fitted values Yi versus the residuals ri , and the second plot
is a scatterplot of the fitted values Yi versus the response Yi .
Definition 6.2. A residual plot is a plot of a variable wi versus the
residuals ri . Typically wi is a linear combination of the predictors: wi = xTi η
where η is a known p × 1 vector. The forward response plot is a plot of the
fitted values Yi versus the response Yi .
The most commonly used residual plot takes η = β. Plots against the
individual predictors xj and potential predictors are also used. If the plot is
not ellipsoidal with zero slope, then the multiple linear regression model with
iid constant variance symmetric errors is not sustained. In other words, if the
variables in the residual plot show some type of dependency, eg increasing
variance or a curved pattern, then the multiple linear regression model may
be inadequate. The following proposition shows that the forward response
plot simultaneously displays the fitted values, response, and residuals. The
plotted points in the forward response plot should scatter about the identity
line if the multiple linear regression model holds. Note that residual plots
magnify departures from the model while the forward response plot empha-
sizes how well the model fits the data. Cook and Weisberg (1997, 1999a ch.
17) call a plot that emphasizes model agreement a model checking plot.
Proposition 6.2. Suppose that the regression estimator b of β is used
to find the residuals ri ≡ ri (b) and the fitted values Yi ≡ Yi (b) = xTi b. Then
in the forward response plot of Yi versus Yi , the vertical deviations from the
identity line (that has unit slope and zero intercept) are the residuals ri (b).
Proof. The identity line in the forward response plot is Y = xT b. Hence
the vertical deviation is Yi − xTi b = ri (b). QED
One of the themes of this text is to use a several estimators to create plots
and estimators. Many estimators bj are consistent estimators of β when the
multiple linear regression model holds.
CHAPTER 6. REGRESSION DIAGNOSTICS 190

Deﬁnition 6.3. Let b1, ..., bJ be J estimators of β. Assume that J ≥ 2

and that OLS is included. A fit-fit (FF) plot is a scatterplot matrix of the
fitted values Y (b1 ), ..., Y (bJ ). A residual-residual (RR) plot is a scatterplot
matrix of the residuals r(b1 ), ..., r(bJ ).
If the multiple linear regression model holds, if the predictors are bounded,
and if all J regression estimators are consistent estimators of β, then the sub-
plots in the FF and RR plots should be linear with a correlation tending to
one as the sample size n increases. To prove this claim, let the ith residual
from the jth fit bj be ri (bj ) = Yi − xTi bj where (Yi , xTi ) is the ith observation.
Similarly, let the ith fitted value from the jth fit be Yi (bj ) = xTi bj . Then

ri (b1 ) − ri (b2) = Yi (b1 ) − Yi (b2 ) = xTi (b1 − b2 )

≤ xi (b1 − β + b2 − β). (6.5)

The FF plot is a powerful way for comparing fits. The commonly sug-
gested alternative is to look at a table of the estimated coefficients, but
coefficients can differ greatly while yielding similar fits if some of the pre-
dictors are highly correlated or if several of the predictors are independent
of the response. Adding the response Y to the scatterplot matrix of fitted
values can also be useful.
To illustrate the RR plot, we examined two moderately-sized data sets (in
Chapter 1) with six Splus estimators: OLS, L1 , ALMS = the default version
of lmsreg, ALTS = the default version of ltsreg, KLMS = lmsreg with
the option “all” which makes K = min(C(n, p), 30000), and KLTS = ltsreg
with K = 100000. The plots were made in 1997 and the Splus functions
ltsreg and lmsreg have changed since then. Hence the plots can not be
reproduced exactly. The newer version of ltsreg is actually considerably
worse than the older version while lmsreg has improved.
Example 6.1. Gladstone (1905-6) records the brain weight and various
head measurements for 276 individuals. This data set, along with the Buxton
data set in the following example, can be downloaded from the text’s website.
We’ll predict brain weight using six head measurements (head height, length,
breadth, size, cephalic index and circumference) as predictors, deleting cases
188 and 239 because of missing values. There are five infants (cases 238, and
CHAPTER 6. REGRESSION DIAGNOSTICS 191

263-266) of age less than 7 months that are x-outliers. Nine toddlers were
between 7 months and 3.5 years of age, four of whom appear to be x-outliers
(cases 241, 243, 267, and 269). (The points are not labeled on the plot, but
the five infants and these four toddlers are easy to recognize when discrepant.)
Figure 1.1 (on p. 7) shows the RR plot. We dispose of the OLS and
L1 fits by noting that the very close agreement in their residuals implies
an operational equivalence in the two fits. ALMS fits the nine x-outliers
quite differently than OLS, L1 , and ALTS. All fits are highly correlated for
the remaining 265 points, showing that all fits agree on these cases, thus
focusing attention on the infants and toddlers.
All of the Splus fits except ALMS accommodated the infants. The funda-
mental reason that ALMS is the “outlier” among the fits is that the infants
and toddlers, while well separated from the rest of data, turn out to fit the
overall linear model quite well. A strength of the LMS criterion – that it
does not pay much attention to the leverage of cases – is perhaps a weakness
here since it leads to the impression that these cases are bad, whereas they
are no more than atypical.
Turning to optimization issues, ALMS had an objective function of 52.7
while KLMS had a much higher objective function of 114.7 even though
KLMS used ten times as many subsamples. The current version of lmsreg
will no longer give a smaller objective function to the algorithm that uses a
smaller number of subsamples.
Figure 1.2 (on p. 8) shows the residual plots for the Gladstone data when
one observation, 119, had head length entered incorrectly as 109 instead of
199. Unfortunately, the ALMS estimator did not detect this outlier.
Example 6.2. Buxton (1920, p. 232-5) gives 20 measurements of 88
men. We chose to predict stature using an intercept, head length, nasal
height, bigonal breadth, and cephalic index. Observation 9 was deleted since
it had missing values. Five individuals, numbers 62-66, were reported to be
about 0.75 inches tall with head lengths well over five feet! This appears to
be a clerical error; these individuals’ stature was recorded as head length and
the integer 18 or 19 given for stature, making the cases massive outliers with
enormous leverage. These absurdly bad observations turned out to confound
the standard high breakdown (HB) estimators. The residual plots in Figure
1.3 (on p. 10) show that five of the six Splus estimators accommodated them.
This is a warning that even using the objective of high breakdown will not
necessarily protect one from extremely aberrant data. Nor should we take
CHAPTER 6. REGRESSION DIAGNOSTICS 192

much comfort in the fact that KLMS clearly identified them; the criterion of
this fit was worse than that of the ALMS fit, and so should be regarded as
inferior.
This plot is no longer reproducible because of changes in the Splus code.
Figure 7.1 (on p. 229) shows the RR plot for more current (as of 2000) Splus
implementations of lmsreg and ltsreg. Problem 6.1 shows how to create
RR and FF plots.
Example 6.3. Figure 1.6 (on p. 17) is nearly identical to a forward
response plot. Since the plotted points do not scatter about the identity
line, the multiple linear regression model is not appropriate. Nevertheless,

Yi ∝ (xTi β̂ OLS )3 .

In Chapter 12 it will be shown that the forward response plot is useful for
visualizing the conditional distribution Y |βT x in 1D regression models where

Y x|β T x.

6.3 Outlier Detection

Do not attempt to build a model on a set of poor data! In human surveys,
one often finds 14–inch men, 1000–pound women, students with “no” lungs,
and so on. In manufacturing data, one can find 10,000 pounds of material
in a 100 pound capacity barrel, and similar obvious errors. All the
planning, and training in the world will not eliminate these sorts of
problems. ... In our decades of experience with “messy data,” we have yet
to find a large data set completely free of such quality problems.
Draper and Smith (1981, p. 418)

There is an enormous literature on outlier detection in multiple linear re-

gression. Typically a numerical measure such as Cook’s distance or a residual
plot based on resistant ﬁts is used. The following terms are frequently en-
countered.
Deﬁnition 6.4. Suppose that some analysis to detect outliers is per-
formed. Masking occurs if the analysis suggests that one or more outliers
are in fact good cases. Swamping occurs if the analysis suggests that one or
more good cases are outliers.
CHAPTER 6. REGRESSION DIAGNOSTICS 193

Forward Response Plot

3
1800
1700
Y

63
1600

44
1500

1550 1600 1650 1700 1750 1800

FIT

Residual Plot

3
50
0
RES

63
-50
-100

1550 1600 1650 1700 1750 1800

FIT

Figure 6.1: Residual and Forward Response Plots for the Tremearne Data
CHAPTER 6. REGRESSION DIAGNOSTICS 194

The following techniques are useful for detecting outliers when the mul-
tiple linear regression model is appropriate.

1. Find the OLS residuals and ﬁtted values and make a forward response
plot and a residual plot. Look for clusters of points that are separated
from the bulk of the data and look for residuals that have large absolute
values. Beginners frequently label too many points as outliers. Try
to estimate the standard deviation of the residuals in both plots. In
the residual plot, look for residuals that are more than 5 standard
deviations away from the r = 0 line.

2. Make an RR plot. See Figures 1.1 and 7.1 on p. 7 and p. 229, respec-
tively.

3. Make an FF plot. See Problem 6.1.

4. Display the residual plots from several diﬀerent estimators. See Figures
1.2 and 1.3 on p. 8 and p. 10, respectively.

5. Display the forward response plots from several diﬀerent estimators.

This can be done by adding Y to the FF plot.

6. Make a scatterplot matrix of several diagnostics such as leverages,

Cook’s distances and studentized residuals.

Example 6.4. Tremearne (1911) presents a data set of about 17 mea-

surements on 115 people of Hausa nationality. We deleted 3 cases (107, 108
and 109) because of missing values and used height as the response vari-
able Y . The ﬁve predictor variables used were height when sitting, height
when kneeling, head length, nasal breadth, and span (perhaps from left hand
to right hand). Figure 6.1 presents the OLS residual and forward response
plots for this data sets. Points corresponding to cases with Cook’s distance
> min(0.5, 2p/n) are shown as highlighted squares (cases 3, 44 and 63). The
3rd person was very tall while the 44th person was rather short. From the
plots, the standard deviation of the residuals appears to be around 10. Hence
cases 3 and 44 are certainly worth examining. Two other cases have residuals
near ﬁfty.
Data sets like this one are very common. The majority of the cases seem
to follow a multiple linear regression model with iid Gaussian errors, but
CHAPTER 6. REGRESSION DIAGNOSTICS 195

a small percentage of cases seem to come from an error distribution with

heavier tails than a Gaussian distribution.
Detecting outliers is much easier than deciding what to do with them.
After detection, the investigator should see whether the outliers are recording
errors. The outliers may become good cases after they are corrected. But
frequently there is no simple explanation for why the cases are outlying.
Typical advice is that outlying cases should never be blindly deleted and that
the investigator should analyze the full data set including the outliers as well
as the data set after the outliers have been removed (either by deleting the
cases or the variables that contain the outliers).
Typically two methods are used to ﬁnd the cases (or variables) to delete.
The investigator computes OLS diagnostics and subjectively deletes cases,
or a resistant multiple linear regression estimator is used that automatically
gives certain cases zero weight.
Suppose that the data has been examined, recording errors corrected, and
impossible cases deleted. For example, in the Buxton (1920) data, 5 people
with heights of 0.75 inches were recorded. For this data set, these heights
could be corrected. If they could not be corrected, then these cases should
be discarded since they are impossible. If outliers are present even after
correcting recording errors and discarding impossible cases, then we can add
two additional rough guidelines.
First, if the purpose is to display the relationship between the predictors
and the response, make a forward response plot using the full data set (com-
puting the ﬁtted values by giving the outliers weight zero) and using the data
set with the outliers removed. Both plots are needed if the relationship that
holds for the bulk of the data is obscured by outliers. The outliers are re-
moved from the data set in order to get reliable estimates for the bulk of the
data. The identity line should be added as a visual aid and the proportion of
outliers should be given. Secondly, if the purpose is to predict a future value
of the response variable, then a procedure such as that described in Example
1.4 on p. 13–14 should be used.

6.4 A Simple Plot for Model Assessment

Regression is the study of the conditional distribution y|x of the response
y given the p × 1 vector of predictors x. Many important statistical models
CHAPTER 6. REGRESSION DIAGNOSTICS 196

have the form

yi = m(xi1, ..., xip) + ei = m(xTi ) + ei ≡ mi + ei (6.6)

for i = 1, ..., n where the zero mean error ei is independent of xi . Additional

assumptions on the errors are often made.
The above class of models is very rich. Many anova models, categorical
models, nonlinear regression, nonparametric regression, semiparametric and
time series models have this form. A single index model uses

y = m(βT x) + e. (6.7)

The multiple linear regression model is an important special case. A multi–

index model has the form

y = m(βT1 x, ..., βTk x) + e (6.8)

where k ≥ 1 is as small as possible. Another important special case of model

(6.6) is the response transformation model where

zi ≡ t−1(yi ) = t−1 (βT xi + ei)

and thus
yi = t(zi ) = β T xi + ei . (6.9)
There are several important regression models that do not have additive
errors including generalized linear models. If

y = g(βT x, e) (6.10)

then the regression has 1–dimensional structure while

y = g(β T1 x, ..., βTk x, e) (6.11)

has k–dimensional structure if k ≥ 1 is as small as possible. These models

do not necessarily have additive errors although the single index and multi–
index models are important exceptions.
Deﬁnition 6.5 (Cook and Weisberg 1997, 1999a, ch. 17): A plot of aT x
versus y for various choices of a is called a model checking plot.
This plot is useful for model assessment and emphasizes the goodness of
ﬁt of the model. In particular, plot each predictor xj versus y, and also plot
CHAPTER 6. REGRESSION DIAGNOSTICS 197

T
β̂ x versus y if model (6.10) holds. Residual plots are also used for model
assessment, but residual plots emphasize lack of fit.
The following notation is useful. Let m̂ be an estimator of m. Let the
ith predicted or fitted value ŷi = m̂i = m̂(xTi ), and let the ith residual
ri = yi − ŷi .
Definition 6.6. Then a fit–response or FY plot is a plot of ŷ versus y.
Application 6.1. Use the FY plot to check the model for goodness of
fit, outliers and influential cases.
To understand the information contained in the FY plot, first consider a
plot of mi versus yi . Ignoring the error in the model yi = mi + ei gives y = m
which is the equation of the identity line with unit slope and zero intercept.
The vertical deviations from the identity line are yi − mi = ei . The reasoning
for the FY plot is very similar. The line y = ŷ is the identity line and the
vertical deviations from the line are the residuals yi − m̂i = yi − ŷi = ri .
Suppose that the model yi = mi + ei is a good approximation to the data
and that m̂ is a good estimator of m. If the identity line is added to the plot
as a visual aid, then the plotted points will scatter about the line and the
variability of the residuals can be examined.
For a given data set, it will often be useful to generate the FY plot,
residual plots, and model checking plots. An advantage of the FY plot is
that if the model is not a good approximation to the data or if the estimator
m̂ is poor, then detecting deviations from the identity line is simple. Also,
residual variability is easier to judge against a line than a curve. On the
other hand, model checking plots may provide information about the form
of the conditional mean function E(y|x) = m(xT ). See Chapter 12.
Many numerical diagnostics for detecting outliers and influential cases on
the fit have been suggested, and often this research generalizes results from
Cook (1977, 1986) to various models of form (6.6). Information from these
diagnostics can be incorporated into the FY plot by highlighting cases that
have large absolute values of the diagnostic.
The most important example is the multiple linear regression (MLR)
model. For this model, the FY plot is the forward response plot. If the MLR
model holds and the errors ei are iid with zero mean and constant variance
σ 2, then the plotted points should scatter about the identity line with no
other pattern.
When the bulk of the data follows the MLR model, the following rules
CHAPTER 6. REGRESSION DIAGNOSTICS 198

of thumb are useful for finding influential cases and outliers. Look for points
with large absolute residuals and for points far away from y. Also look for
gaps separating the data into clusters. To determine whether small clusters
are outliers or good leverage points, give zero weight to the clusters, and fit
a MLR estimator to the bulk of the data. Denote the weighted estimator by
β̂w . Then plot ŷw versus y using the entire data set. If the identity line passes
through the bulk of the data but not the cluster, then the cluster points may
be outliers.
To see why gaps are important, suppose that OLS was used to obtain
ŷ = m̂. Then the squared correlation (corr(y, ŷ))2 is equal to the coefficient
of determination R2 . Even if an alternative MLR estimator is used, R2 over
emphasizes the strength of the MLR relationship when there are two clusters
of data since much of the variability of y is due to the smaller cluster.
Now suppose that the MLR model is incorrect. If OLS is used in the FY
plot, and if y = g(βT x, e), then the plot can be used to visualize g for many
data sets (see Ch. 12). Hence the plotted points may be very far from linear.
The plotted points in FY plots created from other MLR estimators may not
be useful for visualizing g, but will also often be far from linear.
A commonly used diagnostic is Cook’s distance CDi . Assume that OLS
is used to fit the model and to make the FY plot ŷ versus y. Then CDi
tends to be large if ŷ is far from the sample mean y and if the corresponding
absolute residual |ri | is not small. If ŷ is close to y then CDi tends to be
small unless |ri | is large. An exception to these rules of thumb occurs if a
group of cases form a cluster and the OLS fit passes through the cluster.
Then the CDi ’s corresponding to these cases tend to be small even if the
cluster is far from y.
An advantage of the FY plot over numerical diagnostics is that while it
depends strongly on the model m, defining diagnostics for different fitting
methods can be difficult while the FY plot is simply a plot of ŷ versus y. For
the MLR model, the FY plot can be made from any good MLR estimator,
including OLS, least absolute deviations and the R/Splus estimator lmsreg.

Example 6.2 (continued): Figure 6.2 shows the forward response plot
and residual plot for the Buxton data. Although an index plot of Cook’s
distance CDi may be useful for ﬂagging inﬂuential cases, the index plot
provides no direct way of judging the model against the data. As a remedy,
cases in the FY plot with CDi > min(0.5, 2p/n) were highlighted. Notice
CHAPTER 6. REGRESSION DIAGNOSTICS 199

Forward Response Plot

1500
1000
Y

500

63 65
61 64
62
0

0 500 1000 1500

FIT

Residual Plot
100

61
50

64
RES

63
65
-50

62
-150

0 500 1000 1500

FIT

Figure 6.2: Plots for Buxton Data

that the OLS fit passes through the outliers, but the FY plot is resistant to y–
outliers since y is on the vertical axis. Also notice that although the outlying
cluster is far from y only two of the outliers had large Cook’s distance. Hence
masking occurred for both Cook’s distances and for OLS residuals, but not
for OLS fitted values. FY plots using other MLR estimators such as lmsreg
were similar.
High leverage outliers are a particular challenge to conventional numer-
ical MLR diagnostics such as Cook’s distance, but can often be visualized
using the forward response and residual plots. (Using the trimmed views of
Section 11.3 and Chapter 12 is also effective for detecting outliers and other
departures from the MLR model.)
Example 6.5. Hawkins, Bradu, and Kass (1984) present a well known
artificial data set where the first 10 cases are outliers while cases 11-14 are
good leverage points. Figure 6.3 shows the residual and forward response
plots based on the OLS estimator. The highlighted cases have Cook’s dis-
tance > min(0.5, 2p/n), and the identity line is shown in the FY plot. Since
the good cases 11-14 have the largest Cook’s distances and absolute OLS
CHAPTER 6. REGRESSION DIAGNOSTICS 200

Forward Response Plot

7 8
10 5

10
2 6 3
1 9 4
8
6
Y

13
2

14 11
12
0

0 2 4 6 8

FIT

Residual Plot

7 8
4

2 6 10 5
1 9 3
2

4
0
RES

-2
-4

14
-6

13
12
-8

0 2 4 6 8

FIT

Figure 6.3: Plots for HBK Data

FY PLOT
1.0
0.5
0.0
Y

-0.5
-1.0

5 2
4 3
-1.5

10 8

9 7
1
6
-2.0

-1.5 -1.0 -0.5 0.0 0.5

FIT

Figure 6.4: Projection Pursuit Regression, Artiﬁcial Data

CHAPTER 6. REGRESSION DIAGNOSTICS 201

9
8
7
Y

6
5
4

4 5 6 7 8 9

FIT

Figure 6.5: FY Plot for Log(Lynx) Data

residuals, swamping has occurred. (Masking has also occurred since the out-
liers have small Cook’s distances, and some of the outliers have smaller OLS
residuals than clean cases.) To determine whether both clusters are outliers
or if one cluster consists of good leverage points, cases in both clusters could
be given weight zero and the resulting forward response plot created. (Al-
ternatively, forward response plots based on the tvreg estimator of Section
11.3 could be made with the untrimmed cases highlighted. For high levels of
trimming, the identity line often passes through the good leverage points.)
The above example is typical of many “benchmark” outlier data sets for
MLR. In these data sets traditional OLS diagnostics such as Cook’s distance
and the residuals often fail to detect the outliers, but the combination of the
FY plot and residual plot is usually able to detect the outliers.
Example 6.6. MathSoft (1999a, p. 245-246) gives an FY plot for sim-
ulated data. In this example the simulated data set is modified by planting
10 outliers. Let x1 and x2 be iid uniform U(−1, 1) random variables, and let
y = x1 x2 + e where the errors e are iid N(0, 0.04) random variables. The ar-
tificial data set uses 400 cases, but the first 10 cases used y ∼ N(−1.5, 0.04),
CHAPTER 6. REGRESSION DIAGNOSTICS 202

x1 ∼ N(0.2, 0.04) and x2 ∼ N(0.2, 0.04) where y, x1, and x2 were indepen-
dent. The model y = m(x1, x2 ) + e was fitted nonparametrically without
using knowledge of the true regression relationship. The fitted values m̂
were obtained from the Splus function ppreg for projection pursuit regres-
sion (Friedman and Stuetzle, 1981). The outliers are easily detected with
the FY plot shown in Figure 6.4.
Example 6.7. The lynx data is a well known time series concerning the
number wi of lynx trapped in a section of Northwest Canada from 1821 to
1934. There were n = 114 cases and MathSoft (1999b, p. 166-169) fits an
AR(11) model yi = β0 + β1yi−1 + β2yi−2 + · · · + β11yi−11 + ei to the data
where yi = log(wi ) and i = 12, 13, ..., 114. The FY plot shown in Figure 6.5
suggests that the AR(11) model fits the data reasonably well. To compare
different models or to find a better model, use an FF plot of Y and the fitted
values from several competing time series models. See Problem 6.4.

6.5 Complements
Excellent introductions to OLS diagnostics include Fox (1991) and Cook and
Weisberg (1999a, p. 161-163, 183-184, section 10.5, section 10.6, ch. 14, ch.
15, ch. 17, ch. 18, and section 19.3). More advanced works include Belsley,
Kuh, and Welsch (1980), Cook and Weisberg (1982), Atkinson (1985) and
Chatterjee and Hadi (1988). Hoaglin and Welsh (1978) examines the hat
matrix while Cook (1977) introduces Cook’s distance.
Some other papers of interest include Barrett and Gray (1992), Gray
(1985), Hadi and Simonoff (1993), Hettmansperger and Sheather (1992),
Velilla (1998), and Velleman and Welsch (1981).
Hawkins and Olive (2002, p. 141, 158) suggest using the RR and FF
plots. Typically RR and FF plots are used if there are several estimators for
one fixed model, eg OLS versus L1 or frequentist versus Bayesian for multiple
linear regression, or if there are several competing models. An advantage of
the FF plot is that the response Y can be added to the plot. The FFλ
plot is an FF plot where the fitted values were obtained from competing
power transformation models indexed by the power transformation parameter
λ ∈ Λc . Variable selection uses both FF and RR plots.
Rousseeuw and van Zomeren (1990) suggest that Mahalanobis distances
CHAPTER 6. REGRESSION DIAGNOSTICS 203

based on robust estimators of location and dispersion can be more useful

than the distances based on the sample mean and covariance matrix. They
show that a plot of robust Mahalanobis distances RDi versus residuals from
robust regression can be useful.
Several authors have suggested using the forward response plot to visual-
ize the coefficient of determination R2 in multiple linear regression. See for
example Chambers, Cleveland, Kleiner, and Tukey (1983, p. 280). Anderson-
Sprecher (1994) provides an excellent discussion about R2 .
Some papers about the single index model include Aldrin, Bφlviken, and
Schweder (1993), Härdle, Hall, and Ichimura (1993), Naik and Tsai (2001),
Simonoff and Tsai (2002), Stoker (1986) and Weisberg and Welsh (1994).
Also see Olive (2004b). An interesting paper on the multi–index model is
Hristache, Juditsky, Polzehl, and Spokoiny (2001).
The fact that the FY plot is extremely useful for model assessment and for
detecting influential cases and outliers for an enormous variety of statistical
models does not seem to be well known. Certainly in any multiple linear
regression analysis, the FY plot and the residual plot of ŷ versus r should
always be made. The FY plot is not limited to models of the form (6.6) since
the plot can be made as long as fitted values ŷ can be obtained from the
model. If ŷi ≈ yi for i = 1, ..., n then the plotted points will scatter about
the identity line.

6.6 Problems
R/Splus Problems
Warning: Use the command source(“A:/rpack.txt”) to download
the programs and the command source(“A:/robdata.txt”) to download
the data. See Preface or Section 14.2. Typing the name of the rpack
function, eg MLRplot, will display the code for the function. Use the args
command, eg args(MLRplot), to display the needed arguments for the func-
tion.
6.1∗. a) After entering the two source commands above, enter the follow-
ing command.

> MLRplot(buxx,buxy)
CHAPTER 6. REGRESSION DIAGNOSTICS 204

Click the rightmost mouse button (and in R click on Stop). The forward
response plot should appear. Again, click the rightmost mouse button (and
in R click on Stop). The residual plot should appear. Hold down the Ctrl
and c keys to make a copy of the two plots. Then paste the plots in Word.
b) The response variable in height, but 5 cases were recorded with heights
about 0.75 inches tall. The highlighted squares in the two plots correspond
to cases with large Cook’s distances. With respect to the Cook’s distances,
what is happening, swamping or masking?
c) RR plots: One feature of the MBA estimator (see Chapter 7) is that it
depends on the sample of 7 centers drawn and changes each time the function
is called. In ten runs, about nine plots will look like Figure 7.1, but in about
one plot the MBA estimator will also pass through the outliers.
If you use R, type the following command and include the plot in Word.

> library(lqs)
> rrplot2(buxx,buxy)

If you use Splus, type the following command and include the plot in
Word.

> rrplot(buxx,buxy)

d) FF plots: ideally, the plots in the top row will cluster about the identity
line.
If you use R, type the following command and include the plot in Word.

> library(lqs)
> ffplot2(buxx,buxy)

If you use Splus, type the following command and include the plot in
Word.

> ffplot(buxx,buxy)

6.2. a) If necessary, enter the two source commands above Problem

6.1. The diagplot function that makes a scatterplot matrix of various OLS
diagnostics.
b) Enter the following command and include the resulting plot in Word.
CHAPTER 6. REGRESSION DIAGNOSTICS 205

> diagplot(buxx,buxy)

6.3. This problem makes the FY plot for the lynx data in Example 6.7.
a) Check that the lynx data is in Splus by typing the command help(lynx).
A window will appear if the data is available.
b) For Splus, enter the following Splus commands to produce the FY plot.
Include the plot in Word. The command abline(0,1) adds the identity line.

> Y <- log(lynx)

> out <- ar.yw(Y)
> FIT <- Y - out$resid
> plot(FIT,Y)

For R, enter the following R commands to produce the FY plot. Include

the plot in Word. The command abline(0,1) adds the identity line.

> library(ts)
> data(lynx)
> Y <- log(lynx)
> out <- ar.yw(Y)
> Yts <- Y[12:114]
> FIT <- Yts - out$resid[12:114]
> plot(FIT,Yts)
> abline(0,1)

6.4∗. Following Lin and Pourahmadi (1998), consider the lynx time se-
ries data and let the response Yt = log(lynx). Moran (1953) suggested the
autoregressive AR(2) model Ŷt = 1.05 + 1.41Yt−1 − 0.77Yt−2 . Tong (1977)
suggested the AR(11) model Ŷt = 1.13Yt−1 − .51Yt−2 + .23Yt−3 − 0.29Yt−4 +
.14Yt−5 − 0.14Yt−6 + .08Yt−7 − .04Yt−8 + .13Yt−9 + 0.19Yt−10 − .31Yt−11 . Brock-
well and Davis (1991, p. 550) suggested the AR(12) model Ŷt = 1.123 +
1.084Yt−1 − .477Yt−2 + .265Yt−3 − 0.218Yt−4 + .180Yt−9 − 0.224Yt−12 . Tong
(1983) suggested the following two self–exciting autoregressive models. The
SETAR(2,7,2) model uses Ŷt = .546 + 1.032Yt−1 − .173Yt−2 + .171Yt−3 −
0.431Yt−4 + .332Yt−5 − 0.284Yt−6 + .210Yt−7 if Yt−2 ≤ 3.116 and Ŷt = 2.632 +
1.492Yt−1 − 1.324Yt−2 , otherwise. The SETAR(2,5,2) model uses Ŷt = .768 +
1.064Yt−1 − .200Yt−2 + .164Yt−3 − 0.428Yt−4 + .181Yt−5 if Yt−2 ≤ 3.05 and
CHAPTER 6. REGRESSION DIAGNOSTICS 206

Ŷt = 2.54 + 1.474Yt−1 − 1.202Yt−2 , otherwise. The FF plot of the fitted val-
ues and the response can be used to compare the models. Type the rpack
command fflynx() (in R, 1st type library(ts) and data(lynx)).
a) Include the resulting plot and correlation matrix in Word.
b) Which model seems to be best? Explain briefly.
c) Which two pairs of models gave very similar fitted values?
6.5. This problem may not work in R. Type help(ppreg) to make
sure that Splus has the function ppreg. Then make the FY plot for Example
6.6 with the following commands. Include the plot in Word.

> set.seed(14)
> x1 <- runif(400,-1,1)
> x2 <- runif(400,-1,1)
> eps <- rnorm(400,0,.2)
> Y <- x1*x2 + eps
> x <- cbind(x1,x2)
> x[1:10,] <- rnorm(20,.2,.2)
> Y[1:10] <- rnorm(10,-1.5,.2)
> out <- ppreg(x,Y,2,3)
> FIT <- out$ypred
> plot(FIT,Y)
> abline(0,1)

Arc problems
Warning: The following problem uses data from the book’s web-
page. Save the data ﬁles on a disk. Get in Arc and use the menu com-
mands “File > Load” and a window with a Look in box will appear. Click
on the black triangle and then on 3 1/2 Floppy(A:). Then click twice on the
data set name.
Using material learned in Chapters 5–6, analyze the data sets described
in Problems 6.5–6.12. Assume that the response variable Y = t(Z) and
that the predictor variable X2 , ..., Xp are functions of remaining variables
W2 , ..., Wr. Unless told otherwise, the full model Y, X1 , X2 , ..., Xp (where
X1 ≡ 1) should use functions of every variable W2 , ..., Wr (and often p = r +
1). (In practice, often some of the variables and some of the cases are deleted,
but we will use all variables and cases, unless told otherwise, primarily so
CHAPTER 6. REGRESSION DIAGNOSTICS 207

that the instructor has some hope of grading the problems in a reasonable
amount of time.) See pages 163–165 for useful tips for building a full model.
Read the description of the data provided by Arc. Once you have a
good full model, perform forward selection and backward elimination. Find
the model that minimizes Cp (I) and find the smallest value of k such that
Cp (I) ≤ 2k. The minimum Cp model often has too many terms while the
2nd model sometimes has too few terms.
a) Give the output for your full model, including Y = t(Z) and R2 . If it
is not obvious from the output what your full model is, then write down the
full model. Include a forward response plot for the full model. (This plot
should be linear).
b) Give the output for your final submodel. If it is not obvious from the
output what your submodel is, then write down the final submodel.
c) Give between 3 and 5 plots that justify that your multiple linear re-
gression submodel is reasonable. Below or beside each plot, give a brief
explanation for how the plot gives support for your model.
6.6. For the file bodfat.lsp, described in Example 1.4, use Z = Y but do
not use X1 as a predictor in the full model. Do parts a), b) and c) above.
6.7∗. For the file boston2.lsp, described in Examples 1.6, 12.7 and 12.8
use Z = (y =)CRIM. Do parts a), b) and c) above Problem 6.6.
Note: Y = log(CRIM), X4 , X8 , is an interesting submodel, but more
predictors are probably needed.
6.8∗. For the file major.lsp, described in Example 6.4, use Z = Y . Do
parts a), b) and c) above Problem 6.6.
Note: there are 1 or more outliers that affect numerical methods of vari-
able selection.
6.9. For the file marry.lsp, described below, use Z = Y . This data set
comes from Hebbler (1847). The census takers were not always willing to
count a woman’s husband if he was not at home. Do not use the predictor
X2 in the full model. Do parts a), b) and c) above Problem 6.6.
6.10∗. For the file museum.lsp, described below, use Z = Y . Do parts
a), b) and c) above Problem 6.6.
This data set consists of measurements taken on skulls at a museum and
was extracted from tables in Schaaffhausen (1878). There are at least three
groups of data: humans, chimpanzees and gorillas. The OLS fit obtained
from the humans passes right through the chimpanzees. Since Arc numbers
CHAPTER 6. REGRESSION DIAGNOSTICS 208

cases starting at 0, cases 47–59 are apes. These cases can be deleted by
highlighting the cases with small values of Y in the scatterplot matrix and
using the case deletions menu. (You may need to maximize the window
containing the scatterplot matrix in order to see this menu.)
i) Try variable selection all of the data.
ii) Try variable selection without the apes.
√ perhaps only X1 , X2 and X3 should be used
If all of the cases are used,
in the full model. Note that Y and X2 have high correlation.
6.11∗. For the file pop.lsp, described below, use Z = Y . Do parts a), b)
and c) above Problem 6.6.
This data set comes from Ashworth (1842). Try transforming all variables
to logs. Then the added variable plots show two outliers. Delete these
two cases. Notice the effect of these two outliers on the p–values for the
coefficients and on numerical methods for variable selection.
Note: then log(Y ) and log(X2 ) make a good submodel.
6.12∗. For the file pov.lsp, described below, use i) Z = flife and ii)
Z = gnp2 = gnp + 2. This dataset comes from Rouncefield (1995). Making
loc into a factor may be a good idea. Use the commands poverty>Make
factors and select the variable loc. For ii), try transforming to logs and
deleting the 6 cases with gnp2 = 0. (These cases had missing values for gnp.
The file povc.lsp has these cases deleted.) Try your final submodel on the
data that includes the 6 cases with gnp2 = 0. Do parts a), b) and c) above
Problem 6.6.
6.13∗. For the file skeleton.lsp, described below, use Z = y. Do parts a),
b) and c) above Problem 6.6.
This data set is also from Schaaffhausen (1878). At one time I heard
or read a conversation between a criminal forensics expert with his date. It
went roughly like “If you wound up dead and I found your femur, I could tell
what your height was to within an inch.” Two things immediately occurred
to me. The first was “no way” and the second was that the man must not
get many dates! The files cyp.lsp and major.lsp have measurements including
height, but their R2 ≈ 0.9. The skeleton data set has at least four groups:
stillborn babies, newborns and children, older humans and apes.
a) Take logs of each variable and fit the regression on log(Y) on log(X1 ),
..., log(X13 ). Make a residual plot and highlight the case with the with the
smallest residual. From the Case deletions menu, select Delete selection from
CHAPTER 6. REGRESSION DIAGNOSTICS 209

data set. Go to Graph&Fit and again ﬁt the regression on log(Y) on log(X1 ),

..., log(X13 ) (you should only need to click on OK). The output should say
that case 37 has been deleted. Include this output for the full model in Word.
b) Do part b) above Problem 6.6.
c) Do part c) above Problem 6.6.
6.14. Activate big-mac.lsp in Arc. Assume that a multiple linear regres-
sion model holds for t(y) and some terms (functions of the predictors) where
y is BigMac = hours of labor to buy Big Mac and fries. Using techniques
you have learned in class find such a model. (Hint: Recall from Problem 5.26
that transforming all variables to logs and then using the model constant,
log(service), log(TeachSal) and log(TeachTax) was ok but the residuals did
not look good. Try adding a few terms from the minimal Cp model.)
a) Write down the full model that you use (eg a very poor full model is
exp(BigMac) = β1 + β2 exp(EngSal) + β3(T eachSal)3 + e) and include a
forward response plot for the full model. (This plot should be linear). Give
R2 for the full model.
b) Write down your final model (eg a very poor final model is
exp(BigMac) = β1 + β2 exp(EngSal) + β3(T eachSal)3 + e).
c) Include the least squares output for your model and between 3 and
5 plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
6.15. This is like the Problem 6.14 with the BigMac data. Assume
that a multiple linear regression model holds for t(y) and for some terms
(usually powers or logs of the predictors). Using the techniques learned in
class, find such a model. Give output for the full model, output for the final
submodel and use several plots to justify your choices. These data sets, as
well as the BigMac data set, come with Arc. See Cook and Weisberg (1999a).
(INSTRUCTOR: Allow 2 hours for each part.)

file response Y
a) allomet.lsp BRAIN
b) casuarin.lsp W
c) evaporat.lsp Evap
d) hald.lsp Y
e) haystack.lsp Vol
CHAPTER 6. REGRESSION DIAGNOSTICS 210

f) highway.lsp rate
(from the menu Highway, select ‘‘Add a variate" and type
sigsp1 = sigs + 1. Then you can transform sigsp1.)
g) landrent.lsp Y
h) ozone.lsp ozone
i) paddle.lsp Weight
j) sniffer.lsp Y
k) water.lsp Y
i) Write down the full model that you use and include the full model
residual plot and forward response plot in Word. Give R2 for the full model.

ii) Write down the ﬁnal submodel that you use.

iii) Include the least squares output for your model and between 3 and
5 plots that justify that your multiple linear regression model is reasonable.
Below or beside each plot, give a brief explanation for how the plot gives
support for your model.
6.16∗. a) Activate buxton.lsp (you need to download the ﬁle onto your
disk Floppy 3 1/2 A:). From the “Graph&Fit” menu, select “Fit linear LS.”
Use height as the response variable and bigonal breadth, cephalic index, head
length and nasal height as the predictors. Include the output in Word.
b) Make a forward response plot (L1:Fit-Values in H and height in V)
and residual plot (L1:Fit-Values in H and L1:Residuals in V) and include
both plots in Word.
c) In the residual plot use the mouse to move the curser just above and
to the left of the outliers. Hold the leftmost mouse button down and move
the mouse to the right and then down. This will make a box on the residual
plot that contains the outliers. Got to the “Case deletions menu” and click
on Delete selection from data set. From the “Graph&Fit” menu, select “Fit
linear LS” and ﬁt the same model as in a) (the model should already be
entered, just click on “OK”.) Include the output in Word.
d) Make a forward response plot (L2:Fit-Values in H and height in V)
and residual plot (L2:Fit-Values in H and L2:Residuals in V) and include
both plots in Word.
e) Explain why the outliers make the MLR relationship seem much stronger
than it actually is. (Hint: look at R2 .)
Chapter 7

Robust and Resistant

Regression

7.1 High Breakdown Estimators

Assume that the multiple linear regression model

Y = Xβ + e

is appropriate for all or for the bulk of the data. For a high breakdown (HB)
regression estimator b of β, the median absolute residual

MED(|r|i ) ≡ MED(|r(b)|1, ..., |r(b)|n )

stays bounded even if close to half of the data set cases are replaced by
arbitrarily bad outlying cases; ie, the breakdown value of the regression esti-
mator is close to 0.5. The concept of breakdown will be made more precise
in Section 9.4.
Perhaps the first HB regression estimator proposed was the least median
of squares (LMS) estimator. Let |r(b)|(i) denote the ith ordered absolute
2
residual from the estimate b sorted from smallest to largest, and let r(i) (b)
denote the ith ordered squared residual. Three of the most important robust
estimators are defined below.
Definition 7.1. The least quantile of squares (LQS(cn )) estimator mini-
mizes the criterion
2
QLQS (b) = r(c n)
(b). (7.1)

211
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 212

When cn /n → 1/2, the LQS(cn ) estimator is also known as the least median
of squares estimator (Hampel 1975, p. 380).
Deﬁnition 7.2. The least trimmed sum of squares (LTS(cn )) estimator
(Rousseeuw 1984) minimizes the criterion

cn
2
QLT S (b) = r(i) (b). (7.2)
i=1

Deﬁnition 7.3. The least trimmed sum of absolute deviations (LTA(cn ))

estimator (Hössjer 1991) minimizes the criterion

cn
QLT A (b) = |r(b)|(i). (7.3)
i=1

These three estimators all find a set of fixed size cn = cn (p) ≥ n/2 cases
to cover, and then fit a classical estimator to the covered cases. LQS uses
the Chebyshev fit, LTA uses L1 , and LTS uses OLS.
Definition 7.4. The integer valued parameter cn is the coverage of the
estimator. The remaining n−cn cases are given weight zero. In the literature
and software,
cn = n/2 + (p + 1)/2 (7.4)
is often used as the default. Here x is the greatest integer less than or
equal to x. For example, 7.7 = 7.
Remark 7.1. Warning: In the literature, HB regression estimators
seem to come in two categories. The first category consists of estimators
that have no rigorous asymptotic theory but can be computed for very small
data sets. The second category consists of estimators that have rigorous
asymptotic theory but are impractical to compute. Due to the high compu-
tational complexity of these estimators, they are rarely used; however, the
criterion are widely used for fast approximate algorithm estimators that can
detect certain configurations of outliers. These approximations are typically
inconsistent estimators with low breakdown. One of the most disappointing
aspects of robust literature is that frequently no distinction is made between
the impractical HB estimators and the inconsistent algorithm estimators used
to detect outliers. Chapter 8 shows
√ how to fix some of these algorithms so
that the resulting estimator is n consistent and high breakdown.
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 213

7.2 Two Stage Estimators

The LTA and LTS estimators are very similar to trimmed means. If the
coverage cn is a sequence of integers such that cn /n → τ ≥ 0.5, then 1 − τ
is the approximate amount of trimming. There is a tradeoff in that the
Gaussian efficiency of LTA and LTS seems to rapidly increase to that of the
L1 and OLS estimators, respectively, as τ tends to 1, but the breakdown
value 1 − τ decreases to 0. We will use the unifying notation LTx(τ ) for the
LTx(cn ) estimator where x is A, Q, or S for LTA, LQS, and LTS, respectively.
Since the exact algorithms for the LTx criteria have very high computational
complexity, approximations based on iterative algorithms are generally used.
We will call the algorithm estimator β̂A the ALT x(τ ) estimator.
Many algorithms use Kn randomly selected “elemental” subsets of p cases
called a “start,” from which the residuals are computed for all n cases. The
efficiency and resistance properties of the ALTx estimator depend strongly
on the number of starts Kn used. Chapter 8 describes such approximations
in much greater detail.
For a fixed choice of Kn , increasing the coverage cn in the LTx criterion
seems to result in a more stable ALTA or ALTS estimator. For this reason,
Splus has increased the default coverage of the ltsreg function to 0.9n while
Rousseeuw and Hubert (1999) recommend 0.75n. The price paid for this
stability is greatly decreased resistance to outliers.
Similar issues occur in the location model: as the trimming proportion α
decreases, the Gaussian efficiency of the α trimmed mean increases to 1, but
the breakdown value decreases to 0. Chapters 2 and 4 described the following
procedure for obtaining a robust two stage trimmed mean. The metrically
trimmed mean Mn computes the sample mean of the cases in the interval

[MED(n) − kMAD(n), MED(n) + kMAD(n)]

where MED(n) is the sample median and MAD(n) is the sample median
absolute deviation. A convenient value for the trimming constant is k = 6.
Next, ﬁnd the percentage of cases trimmed to the left and to the right by
Mn , and round both percentages up to the nearest integer and compute the
corresponding trimmed mean. Let TA,n denote the resulting estimator. For
example, if Mn trimmed the 7.3% smallest cases and the 9.76% largest cases,
then the ﬁnal estimator TA,n is the (8%, 10%) trimmed mean. TA,n is asymp-
totically equivalent to a sequence of trimmed means with an asymptotic
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 214

variance that is easy to estimate.

To obtain a regression generalization of the two stage trimmed mean, com-
pute the ALTx(cn ) estimator where cn ≡ cn,1 is given by Equation (7.4). Con-
sider a finite number L of coverages cn,1 and cn,j = τj n where j = 2, ..., L
and τj ∈ G. We suggest using L = 5 and G = {0.5, 0.75, 0.90, 0.99, 1.0}. The
exact coverages c are defined by cn,1 ≡ cn , cn,2 = .75 n , cn,3 = .90 n ,
cn,4 = .99 n , and cn,5 = n. (This choice of L and G is illustrative. Other
choices, such as G = {0.5, 0.6, 0.7, 0.75, 0.9, 0.99, 1.0} and L = 7, could be
made.)
Definition 7.5. The RLTx(k) estimator is the ALTx(τR ) estimator
where τR is the largest τj ∈ G such that τj n ≤ Cn (β̂ ALT x(cn ) ) where

n
n
2
Cn (b) = I[|r|(i)(b) ≤ k |r|(cn )(b)] = I[r(i) (b) ≤ k 2 r(c
2
n)
(b)]. (7.5)
i=1 i=1

The two stage trimmed mean inherits the breakdown value of the median
and the stability of a trimmed mean with a low trimming proportion. The
RLTx estimator can be regarded as an extension of the two stage mean to
regression. The RLTx estimator inherits the high breakdown value of the
ALTx(0.5) estimator, and the stability of the ALTx(τR ) where τR is typically
close to one.
The tuning parameter k ≥ 1 controls the amount of trimming. The in-
equality k ≥ 1 implies that Cn ≥ cn , so the RLTx(k) estimator generally has
higher coverage and therefore higher statistical efficiency than ALTx(0.5).
Notice that although L estimators ALTx(cn,j ) were defined, only two are
needed: ALTx(0.5) to get a resistant scale and define the coverage needed,
and the final estimator ALTX(τR ). The computational load is typically less
than twice that of computing the ALTx(0.5) estimator since the computa-
tional complexity of the ALTx(τ ) estimators decreases as τ increases from
0.5 to 1.
The behavior of the RLTx estimator is easy to understand. Compute
the most resistant ALTx estimator β̂ ALT x(cn ) and obtain the corresponding
residuals. Count the number Cn of absolute residuals that are no larger than
k |r|(cn ) ≈ kMED(|r|i ). Then find τR ∈ G and compute the RLTx estimator.
(The RLTx estimator uses Cn in a manner analogous to the way that the two
stage mean uses kMAD(n).) If k = 6, and the regression model holds, the
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 215

RLTx estimator will be the classical estimator or the ALTx estimator with
99% coverage for a wide variety of data sets. On the other hand, if β̂ ALT x(cn )
fits cn cases exactly, then |r|(cn ) = 0 and RLTx = ALTx(cn ).
The RLTx estimator has the same breakdown point as the ALTx(0.5)
estimator. Theoretical results and a simulation study, based on Olive and
Hawkins (2003) and presented in Sections 7.4 and 7.5, suggest that the RLTx
estimator is simultaneously more stable and more resistant than the ALTx(
0.75 n) estimator for x = A and S. Increasing the coverage for the LQS
criterion is not suggested since the Chebyshev fit tends to have less efficiency
than the LMS fit.

7.3 Estimators with Adaptive Coverage

Estimators with adaptive coverage (EAC estimators) are also motivated by
the idea of varying the coverage to better model the data set, but diﬀer from
the RLTx estimators in that they move the determination of the covered
cases “inside the loop”. Let cn and Cn be given by (7.4) and (7.5). Hence

n
2
Cn (b) = I[r(i) (b) ≤ k 2 r(c
2
n)
(b)].
i=1

Deﬁnition 7.6. The least adaptive quantile of squares (LATQ(k)) esti-

mator is the L∞ ﬁt that minimizes
2
QLAT Q(b) = r(C (b).
n (b))

The least adaptively trimmed sum of squares (LATS(k)) estimator is the OLS
ﬁt that minimizes
Cn (b)

2
QLAT S (b) = r(i) (b).
i=1

The least adaptively trimmed sum of absolute deviations (LATA(k)) estimator

is the L1 ﬁt that minimizes
Cn (b)

QLAT A (b) = |r|(i)(b).
i=1
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 216

Note that the adaptive estimators reduce to the highest breakdown ver-
sions of the fixed coverage estimators if k = 1 and (provided there is no exact
fit to at least cn of the cases) to the classical estimators if k = ∞.
These three adaptive coverage estimators simultaneously achieve a high
breakdown value with high coverage, as do the RLTx estimators, but there
are important outlier configurations where the resistance of the two estima-
tors differs. The notation LATx will sometimes be used.

7.4 Theoretical Properties

Many regression estimators β̂ satisfy
√ D
n(β̂ − β) → Np(0, V (β̂, F ) W ) (7.6)

when
XT X
→ W −1 ,
n
and when the errors ei are iid with a cdf F and a unimodal pdf f that is
symmetric with a unique maximum at 0. When the variance V (ei ) exists,
1
V (OLS, F ) = V (ei) = σ 2 while V(L1 , F) = .
4[f(0)]2

See Koenker and Bassett (1978) and Bassett and Koenker (1978). Broﬃtt
(1974) compares OLS, L1 , and L∞ in the location model and shows that the
rate of convergence of the Chebyshev estimator is often very poor.
Remark 7.2. Obtaining asymptotic theory for LTA and LTS is a very
challenging problem. Mašı̈ček (2004) shows that LTS is consistent, but there
may be no other results outside of the location model. See Davies (1993),
Garcı́a-Escudero, Gordaliza and Matrán (1999), Hawkins and Olive (2002),
Hössjer (1994), Stromberg, Hawkins and Hössjer (2000), and Rousseeuw
(1984) for further discussion and conjectures. For the location model, Yohai
and Maronna (1976) and Butler (1982) derived asymptotic theory for LTS
while Tableman (1994ab) derived asymptotic theory for LTA. Shorack (1974)
and Shorack and Wellner (1986, section 19.3) derived the asymptotic theory
for a large class of location estimators that use random coverage (as do many
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 217

others). In the regression setting, it is known that LQS(τ ) converges at a

cube root rate to a non-Gaussian limit (Davies 1990, Kim and Pollard 1990,
and Davies 1993, p. 1897), and it is known that scale estimators based on
regression residuals behave well (see Welsh 1986).
Negative results are easily obtained. If the “shortest half” is not unique,
then LQS, LTA, and LTS are inconsistent. For example, the shortest half is
not unique for the uniform distribution.
The asymptotic theory for RLTx depends on that for ALTx. Most ALTx
implementations have terrible statistical properties, but an exception
√
are the easily computed n consistent HB CLTx estimators given in Theorem
8.7 (and Olive and Hawkins 2006). The following lemma can be used to
estimate the coverage of the RLTx estimator given the error distribution F.
Lemma 7.1. Assume that the errors are iid with a density f that is sym-
metric about 0 and positive and continuous in neighborhoods of F −1(0.75)
and kF −1(0.75). If the predictors x are bounded in probability and β̂n is
consistent for β, then

Cn (β̂ n ) P
→ τF ≡ τF (k) = F (k F −1(0.75)) − F (−k F −1 (0.75)). (7.7)
n
Proof. First assume that the predictors are bounded. Hence x ≤ M
for some constant M. Let 0 < γ < 1, and let 0 < < 1. Since β̂n is
consistent, there exists an N such that

P (A) = P (β̂j,n ∈ [βj − , βj + ], j = 1, ..., p) ≥ 1 − γ
4pM 4pM
for all n ≥ N. If n ≥ N, then on set A,

p

sup |ri − ei| = sup | xi,j (βj − β̂j,n )| ≤ .
i=1,...,n i=1,...,n
i=1
2

Since and γ are arbitrary,

P
ri − ei → 0.
This result also follows from Rousseeuw and Leroy (1987, p. 128). In par-
ticular,
|r|(cn ) → MED(|e1|) = F −1 (0.75).
P
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 218

Now there exists N1 such that

P (B) ≡ P (|ri − ei| < , i = 1, ..., n & | |r|(cn ) − MED(|e1|)| < ) ≥ 1−γ
2 2k
for all n ≥ N1. Thus on set B,

1
n
Cn (β̂ n )
I[−kMED(|e1|) + ≤ ei ≤ kMED(|e1 |) − ] ≤
n i=1 n

1
n
≤ I[−kMED(|e1|) − ≤ ei ≤ kMED(|e1|) + ],
n i=1
and the result follows since γ and are arbitrary and the three terms above
converge to τF almost surely as goes to zero.
When x is bounded in probability, ﬁx M and suppose Mn of the cases
have predictors xi such that xi ≤ M. By the argument above, the propor-
tion of absolute residuals of these cases that are below |r|(cMn ) converges in
probability to τF . But the proportion of such cases can be made arbitrarily
close to one as n increases to ∞ by increasing M. QED
Under the same conditions of Lemma 7.1,

|r|(cn ) (β̂ n ) → F −1 (0.75).

This result can be used as a diagnostic – compute several regression estima-

tors including OLS and L1 and compare the corresponding median absolute
residuals.
A competitor to RLTx is to compute ALTx, give zero weight to cases
with large residuals, and ﬁt OLS to the remaining cases. He and Portnoy
(1992) prove that this two–stage estimator has the same rate as the initial
estimator. Theorem 7.2 gives a similar result for the RLTx estimator, but
the RLTx estimator could be an OLS, L1 or L∞ ﬁt to a subset of the data.
In particular, if the exact LTx estimators are used, Theorem 7.2 shows that
the RLTQ estimator has an OP (n−1/3 ) rate and suggests that the RLTA and
RLTS estimators converge at an OP (n−1/2 ) rate.
Theorem 7.2. If β̂ ALT x(τj ) − β = OP (n−δ ) for all τj ∈ G, then

β̂RLT x − β = OP (n−δ ).
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 219

Proof. Since G is ﬁnite, this result follows from Pratt (1959). QED
Theorem 7.3 shows that the RLTx estimator is asymptotically equivalent
to an ALTx estimator that typically has high coverage.
Theorem 7.3. Assume that τj , τj+1 ∈ G. If
P
P [Cn (β̂ ALT x(0.5))/n ∈ (τj , τj+1 )] → 1,

then the RLTx estimator is asymptotically equivalent to the ALTx(τj ) esti-

mator.
The next theorem gives a case where RLTx can have an OP (n−1/2) con-
vergence rate even though the ALTx(0.5) rate is poor. The result needs to
be modiﬁed slightly for uniform data since then the ALTx constant is not
consistent even if the slopes are.
Theorem 7.4. Assume that the conditions of Lemma 7.1 hold, that the
predictors are bounded, and that the errors ei have support on [−d, d]. If the
ALTx(0.5) estimators are consistent and if k > d/F −1 (0.75), then the RLTx
estimators are asymptotically equivalent to the L1 , L∞ , and OLS estimators
for x = A, Q, and S respectively.
Proof. The proof of Lemma 7.1 shows that k|r|(cn ) (b) converges to
kF −1(0.75) > d where b is the ALTx(0.5) estimator and that the residuals
ri converge to the errors ei . Hence the coverage proportion converges to one
in probability. QED
Choosing a suitable k for a target distribution F is simple. Assume
Equation (7.7) holds where τF is not an element of G. If n is large, then with
high probability τR will equal the largest τi ∈ G such that τi < τF . Small
sample behavior can also be predicted. For example, if the errors follow a
N(0, σ 2 ) distribution and n = 1000, then

P (−4σ < ei < 4σ, i = 1, ..., 1000) ≈ (0.9999)1000 > 0.90.

On the other hand, |r|(cn ) is converging to Φ−1 (0.75)σ ≈ 0.67σ. Hence if

k ≥ 6.0 and n < 1000, the RLTS estimator will cover all cases with high
probability if the errors are Gaussian. To include heavier tailed distributions,
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 220

increase k. For example, similar statements hold for distributions with lighter
tails than the double exponential distribution if k ≥ 10.0 and n < 200.
Proposition 7.5: Breakdown of LTx, RLTx, and LATx Estima-
tors. LMS(τ ), LTS(τ ), and LTA(τ ) have breakdown value
min(1 − τ, τ ).
The breakdown value for the LATx estimators is 0.5, and the breakdown
value for the RLTx estimators is equal to the breakdown value of the ALTx(cn )
estimator.
The breakdown results for the LTx estimators are well known. See Hössjer
(1994, p. 151). Breakdown proofs in Rousseeuw and Bassett (1991) and
Niinimaa, Oja, and Tableman (1990) could also be modified to give the result.
See Section 9.4 for the definition of breakdown.
Theorem 7.6. The LMS(τ ) converges at a cubed root rate to a non-
Gaussian limit (under regularity conditions on the error distribution that
exclude the uniform distribution).
The proof of Theorem 7.6 is given in Davies (1990) and Kim and Pollard
(1990). Also see Davies (1993, p. 1897).
Conjecture 7.1. Let the iid errors ei have a cdf F that is continuous
and strictly increasing on its interval support with a symmetric, unimodal,
differentiable density f that strictly decreases as |x| increases on the support.
a) The estimator β̂ LT S satisfies Equation (7.6) and the asymptotic vari-
ance of LTS(τ ) is
F −1 (1/2+τ /2) 2
F −1 (1/2−τ /2)
w dF (w)
V (LT S(τ ), F ) = . (7.8)
[τ − 2F (1/2 + τ /2)f(F −1 (1/2 + τ /2))]2
−1

See Rousseeuw and Leroy (1987, p. 180, p. 191), and Tableman (1994a, p.
337).
b) The estimator β̂ LT A satisﬁes Equation (7.6) and the asymptotic vari-
ance of LTA(τ ) is
τ
V (LT A(τ ), F ) = . (7.9)
4[f(0) − f(F −1 (1/2 + τ /2))]2
See Tableman (1994b, p. 392) and Hössjer (1994).
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 221

As τ → 1, the eﬃciency of LTS approaches that of OLS and the eﬃciency

of LTA approaches that of L1 . Hence for τ close to 1, LTA will be more
efficient than LTS when the errors come from a distribution for which the
sample median is more efficient than the sample mean (Koenker and Bassett,
1978). The results of Oosterhoff (1994) suggest that when τ = 0.5, LTA will
be more efficient than LTS only for sharply peaked distributions such as the
double exponential. To simplify computations for the asymptotic variance of
LTS, we will use truncated random variables (see Definition 2.17).
Lemma 7.7. Under the symmetry conditions given in Conjecture 7.1,

τ σT2 F (−k, k)
V (LT S(τ ), F ) = (7.10)
[τ − 2kf(k)]2

and
τ
V (LT A(τ ), F ) = (7.11)
4[f(0) − f(k)]2
where
k = F −1 (0.5 + τ /2). (7.12)

Proof. Let W have cdf F and pdf f. Suppose that W is symmetric

about zero, and by symmetry, k = F −1 (0.5 + τ /2) = −F −1(0.5 + τ /2). If W
has been truncated at a = −k and b = k, then the variance of the truncated
random variable WT by
k 2
w dF (w)
VAR(WT ) = σT2 F (−k, k) = −k
F (k) − F (−k)

by Deﬁnition 2.17. Hence

F −1 (1/2+τ /2)
w2 dF (w) = τ σT2 F (−k, k)
F −1 (1/2−τ /2)

and the result follows from the deﬁnition of k.

This result is useful since formulas for the truncated variance have been
given in Chapter 4. The following examples illustrate the result. See Hawkins
and Olive (1999b).
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 222

Example 7.1: N(0,1) Errors. If YT is a N(0, σ 2) truncated at a = −kσ

and b = kσ, VAR(YT ) =

2kφ(k)
σ 2[1 − ].
2Φ(k) − 1

Hence the asymptotic variance of LTS(τ ) at the standard normal is

1
V (LT S(τ ), Φ) = (7.13)
τ − 2kφ(k)

while
τ 2πτ
V (LT A(τ ), Φ) = = (7.14)
4[φ(0) − φ(k)]2 4[1 − exp(−k 2 /2)]2

where φ is the standard normal pdf and

k = Φ−1 (0.5 + τ /2).

Thus for τ ≥ 1/2, LTS(τ ) has breakdown value of 1 − τ and Gaussian effi-
ciency
1
= τ − 2kφ(k). (7.15)
V (LT S(τ ), Φ)
The 50% breakdown estimator LTS(0.5) has a Gaussian efficiency of 7.1%.
If it is appropriate to reduce the amount of trimming, we can use the 25%
breakdown estimator LTS(0.75) which has a much higher Gaussian efficiency
of 27.6% as reported in Ruppert (1992, p. 255). Also see the column labeled
“Normal” in table 1 of Hössjer (1994).
Example 7.2: Double Exponential Errors. The double exponential
(Laplace) distribution is interesting since the L1 estimator corresponds to
maximum likelihood and so L1 beats OLS, reversing the comparison of the
normal case. For a double exponential DE(0, 1) random variable,

2 − (2 + 2k + k 2 ) exp(−k)
V (LT S(τ ), DE(0, 1)) =
[τ − k exp(−k)]2

while
τ 1
V (LT A(τ ), DE(0, 1)) = =
4[0.5 − 0.5 exp(−k)] 2 τ
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 223

where k = − log(1 − τ ). Note that LTA(0.5) and OLS have the same asymp-
totic efficiency at the double exponential distribution. Also see Tableman
(1994a,b).
Example 7.3: Cauchy Errors. Although the L1 estimator and the
trimmed estimators have finite variance when the errors are Cauchy, the
OLS estimator has infinite variance (because the Cauchy distribution has
infinite variance). If XT is a Cauchy C(0, 1) random variable symmetrically
truncated at −k and k, then

k − tan−1 (k)
VAR(XT ) = .
tan−1 (k)

Hence
2k − πτ
V (LT S(τ ), C(0, 1)) = 2k
π[τ − π(1+k 2) ]
2

and
τ
V (LT A(τ ), C(0, 1)) =
4[ π1 − 1
π(1+k2 )
]2
where k = tan(πτ /2). The LTA sampling variance converges to a ﬁnite value
as τ → 1 while that of LTS increases without bound. LTS(0.5) is slightly
more eﬃcient than LTA(0.5), but LTA pulls ahead of LTS if the amount of
trimming is very small.

7.5 Computation and Simulations

In addition to the LMS estimator, there are at least two other regression
estimators, the least quantile of differences (LQD) and the regression depth
estimator, that have rather high breakdown and rigorous asymptotic theory.
The LQD estimator is the LMS estimator computed on the (n−1)n/2 pairs of
case difference (Croux, Rousseeuw and Hössjer 1994). The regression depth
estimator (Rousseeuw and Hubert 1999) is interesting because its criterion
does not use residuals. The large sample theory for the depth estimator is
given by Bai and He (1999). The LMS, LTS, LTA, LQD and depth estimators
can be computed exactly only if the data set is tiny.
Proposition 7.8. a) There is an LTS(c) estimator β̂ LT S that is the OLS
fit to the cases corresponding to the c smallest LTS squared residuals.
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 224

b) There is an LTA(c) estimator β̂ LT A that is the L1 ﬁt to the cases corre-

sponding to the c smallest LTA absolute residuals.
c) There is an LQS(c) estimator β̂LQS that is the Chebyshev ﬁt to the cases
corresponding to the c smallest LQS absolute residuals.
Proof. a) By the deﬁnition of the LTS(c) estimator,

c
c
2 2
r(i) (β̂LT S ) ≤ r(i) (b)
i=1 i=1

where b is any p × 1 vector. Without loss of generality, assume that the cases
have been reordered so that the first c cases correspond to the cases with the
c smallest residuals. Let β̂ OLS (c) denote the OLS fit to these c cases. By the
definition of the OLS estimator,

c
c
ri2 (β̂OLS (c)) ≤ ri2 (b)
i=1 i=1

where b is any p × 1 vector. Hence β̂OLS (c) also minimizes the LTS criterion
and thus β̂ OLS (c) is an LTS estimator. The proofs of b) and c) are similar.
QED
Definition 7.7. In regression, an elemental set is a set of p cases.
One way to compute these estimators exactly is to generate all C(n, c)
subsets of size c, compute the classical estimator b on each subset, and find
the criterion Q(b). The robust estimator is equal to the bo that minimizes
the criterion. Since c ≈ n/2, this algorithm is impractical for all but the
smallest data sets. Since the L1 fit is an elemental fit, the LTA estimator can
be found by evaluating all C(n, p) elemental sets. See Hawkins and Olive
(1999b). Since any Chebyshev fit is also a Chebyshev fit to a set of p + 1
cases, the LQS estimator can be found by evaluating all C(n, p+1) cases. See
Stromberg (1993ab) and Appa and Land (1993). The LMS, LTA, and LTS
estimators can also be evaluated exactly using branch and bound algorithms
if the data set size is small enough. See Agulló (1997, 2001).
Typically HB algorithm estimators should not be used unless the criterion
complexity is O(n). The complexity of the estimator depends on how many
fits are computed and on the complexity of the criterion evaluation. For
example the LMS and LTA criteria have O(n) complexity while the depth
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 225

criterion complexity is O(np−1 log n). The LTA and depth estimators eval-
uates O(np ) elemental sets while LMS evaluates the O(np+1 ) subsets of size
p+1. The LQD criterion complexity is O(n2 ) and evaluates O(n2(p+1) ) subsets
of case distances.
Consider the algorithm that takes a subsample of nδ cases and then
computes the exact algorithm to this subsample. Then the complexities
of the LTA, LMS, depth and LQD algorithms are O(nδ(p+1) ), O(nδ(p+2) ),
O(nδ(2p−1) log nδ ) and O(nδ(2p+4) ), respectively. The convergence rates of the
estimators are nδ/3 for LMS and nδ/2 for the remaining √ three estimators (if
the LTA estimator does indeed have the conjectured n convergence rate).
These algorithms rapidly become impractical as n and p increase. For ex-
ample, if n = 100 and δ = 0.5, use p < 7, 6, 4, 2 for these LTA, LMS, depth,
and LQD algorithms respectively. If n = 10000, this LTA algorithm may not
be practical even for p = 3. These results suggest that the LTA and LMS
approximations will be more interesting than depth or LQD approximations
unless a computational breakthrough is made for the latter two estimators.
We simulated LTA and LTS for the location model using normal, dou-
ble exponential, and Cauchy error models. For the location model, these
estimators can be computed exactly: ﬁnd the order statistics

Y(1) ≤ Y(2) ≤ · · · ≤ Y(n)

of the data. For LTS compute the sample mean and for LTA compute the
sample median (or the low or high median) and evaluate the LTS and LTA
criteria of each of the n−c+1 “c-samples” Y(i) , . . . , Y(i+c−1) , for i = 1, . . . , n−
c + 1. The minimum across these samples then deﬁnes the LTA and LTS
estimates.
We computed the sample standard deviations of the resulting location es-
timate from 1000 runs of each sample size studied. The results are shown in
Table 7.1. For Gaussian errors, the observed standard deviations are smaller
than the asymptotic standard deviations but for the double exponential er-
rors, the sample size needs to be quite large before the observed standard
deviations agree with the asymptotic theory.
Table 7.2 presents the results of a small simulation study. We compared
ALTS(τ ) for τ = 0.5, 0.75, and 0.9 with RLTS(6) for 6 diﬀerent error distribu-
tions – the normal(0,1), Laplace, uniform(−1, 1) and three 60% N(0,1) 40 %
contaminated normals. The three contamination scenarios were N(0,100) for
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 226

Table 7.1: Monte Carlo Eﬃciencies Relative to OLS.

dist n L1 LTA(0.5) LTS(0.5) LTA(0.75)

N(0,1) 20 .668 .206 .223 .377
N(0,1) 40 .692 .155 .174 .293
N(0,1) 100 .634 .100 .114 .230
N(0,1) 400 .652 .065 .085 .209
N(0,1) 600 .643 .066 .091 .209
N(0,1) ∞ .637 .053 .071 .199
DE(0,1) 20 1.560 .664 .783 1.157
DE(0,1) 40 1.596 .648 .686 1.069
DE(0,1) 100 1.788 .656 .684 1.204
DE(0,1) 400 1.745 .736 .657 1.236
DE(0,1) 600 1.856 .845 .709 1.355
DE(0,1) ∞ 2.000 1.000 .71 1.500

a “scale” contaminated setting, and two “location” contaminations: N(5.5,1)

and N(12,1). The mean of 5.5 was intended as a case where the ALTS(0.5)
estimator should outperform the RLTS estimator, as these contaminants are
just small enough that many pass the k = 6 screen, and the mean of 12 to
test how the estimators handled catastrophic contamination.
The simulation used n = 100 and p = 6 (5 slopes and an intercept) over
1000 runs and computed β̂ −β2 /6 for each run. Note that for the three CN
scenarios the number of contaminants is a binomial random variable which,
with probability 6% will exceed the 47 that the maximum breakdown setting
can accommodate.
The means from the 1000 values are displayed. Their standard errors are
at most 5% of the mean. The last column shows the percentage of times that
τR was equal to .5, .75, .9, .99 and 1.0. Two ﬁtting algorithms were used
– a traditional elemental algorithm with 3000 starts and a concentration
algorithm (see Chapter 8). As discussed in Hawkins and Olive (2002) this
choice, chosen to match much standard practice, is far fewer than we would
recommend with a raw elemental algorithm.
All of the estimators in this small study are inconsistent zero
breakdown estimators, but some are useful for detecting outliers. (A bet-
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 227

Table 7.2: β̂ − β2 /p, 1000 runs

pop.-alg. ALTS ALTS ALTS RLTS % of runs that τR

(.5) (.75) (.9) (6) = .5,.75,.9,.99 or 1
N(0,1)-conc 0.0648 0.0350 0.0187 0.0113 0,0,6,18,76
DE(0,1)-conc 0.1771 0.0994 0.0775 0.0756 0,0,62,23,15
U(−1, 1)-conc 0.0417 0.0264 0.0129 0.0039 0,0,2,6,93
scale CN-conc 0.0560 0.0622 0.2253 0.0626 2,96,2,0,0
5.5 loc CN-conc 0.0342 0.7852 0.8445 0.8417 0,4,19,9,68
12 loc CN-conc 0.0355 3.5371 3.9997 0.0405 85,3,2,0,9
N(0,1)-elem 0.1391 0.1163 0.1051 0.0975 0,0,1,6,93
DE(0,1)-elem 0.9268 0.8051 0.7694 0.7522 0,0,20,28,52
U(−1, 1)-elem 0.0542 0.0439 0.0356 0.0317 0,0,0,1,98
scale CN-elem 4.4050 3.9540 3.9584 3.9439 0,14,40,18,28
5.5 loc CN-elem 1.8912 1.6932 1.6113 1.5966 0,0,1,3,96
12 loc CN-elem 8.3330 7.4945 7.3078 7.1701 4,0,1,2,92

√
ter choice than the inconsistent estimators is to use the easily computed n
consistent HB CLTx estimators given in Theorem 8.7.) The concentration
algorithm used 300 starts for the location contamination distributions, and
50 starts for all others, preliminary experimentation having indicated that
this many starts were sufficient. Comparing the ‘conc’ mean squared errors
with the corresponding ‘elem’ confirms the recommendations in Hawkins and
Olive (2002) that far more than 3000 elemental starts are necessary to achieve
good results. The ‘elem’ runs also verify that second-stage refinement, as sup-
plied by the RLTS approach, is not sufficient to overcome the deficiencies in
the poor initial estimates provided by the raw elemental approach.
The RLTS estimator was, with one exception, either the best of the 4
estimators or barely distinguishable from the best. The single exception
was the concentration algorithm with the contaminated normal distribution
F (x) = 0.6Φ(x) + 0.4Φ(x − 5.5), where most of the time it covered all cases.
We already noted that location contamination with this mean and this choice
of k is about the worst possible for the RLTS estimator, so that this worst-
case performance is still about what is given by the more recent recommen-
dations for ALTx coverage – 75% or 90% is positive. This is reinforced by
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 228

RLTS’ excellent performance with 12σ location outliers.

The simulation suggests that the RLTx method with concentration is a
better approach for improving the resistance and performance of the inconsis-
tent Splus ltsreg estimator than increasing the coverage from 50% to 90%.
The simulation also suggests that even the inconsistent version of RLTx used
in the study is useful for detecting outliers. The concentration RLTx estima-
tor would be improved if√max(n, 500) starts were used instead of 50 starts.
Use the easily computed n consistent HB CLTx estimator of Theorem 8.7
√
to make a n consistent HB RLTx estimator as soon as the CLTx estimator
is available from the software.

7.6 Resistant Estimators

Definition 7.7. A regression estimator β̂ of β is a resistant estimator if β̂
is known to be useful for detecting certain types of outliers. (Often we also
require β̂ to be a consistent estimator of β.)
Typically resistant estimators are useful when the errors are iid from a
heavy tailed distribution. Some examples include the L1 estimator, which
can be useful for detecting Y -outliers, and some M, R, GM, and GR esti-
mators. M-estimators tend to obtain a tradeoff between the resistance of the
L1 estimator and the Gaussian efficiency of the OLS estimator. This trade-
off is especially apparent with the Huber M-estimator. Street, Carroll, and
Ruppert (1988) discuss the computation of standard errors for M-estimators.
R-estimators have elegant theory similar to that of OLS, and the Wilcoxon
rank estimator is especially attractive. See Hettmansperger and McKean
(1998, ch. 3). GM-estimators are another large class of estimators. Carroll
and Welsh (1988) claim that only the Mallows class of GM-estimators
are consistent for slopes if the errors are asymmetric. Also see Simp-
son, Ruppert, and Carroll (1992, p. 443). The Mallows estimator may have a
breakdown value as high as 1/(p+1). A recent discussion of GR-estimators is
in Hettmansperger and McKean (1998, ch. 5). The resistant trimmed views
estimator (tvreg) is presented in Section 11.3.
For illustration, we will construct a simple resistant algorithm estimator,
called the median ball algorithm (MBA or mbareg). The Euclidean distance
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 229

-150 -50 0 50 100 -150 -50 0 50 100

150
50
OLS residuals

-50
-150
150
50

L1 residuals
-50
-150

50
0
ALMS residuals

-100
-200
150
50

ALTS residuals
-50
-150

0
-1000

MBA residuals
-3000

-150 -50 0 50 100 -200 -100 0 50 100 -3000 -1000 0

Figure 7.1: RR plot for the Buxton Data

CHAPTER 7. ROBUST AND RESISTANT REGRESSION 230

of the ith vector of predictors xi from the jth vector of predictors xj is

Di (xj ) = (xi − xj )T (xi − xj ).

For a fixed xj consider the ordered distances D(1)(xj ), ..., D(n)(xj ). Next,
let β̂ j (α) denote the OLS fit to the min(p + 3 + αn/100 , n) cases with
the smallest distances where the approximate percentage of cases used is
α ∈ {1, 2.5, 5, 10, 20, 33, 50}. (Here x is the greatest integer function so
7.7 = 7. The extra p + 3 cases are added so that OLS can be computed for
small n and α.) This yields seven OLS fits corresponding to the cases with
predictors closest to xj . A fixed number K of cases are selected at random
without replacement to use as the xj . Hence 7K OLS fits are generated. We
use K = 7 as the default. A robust criterion Q is used to evaluate the 7K
fits and the OLS fit to all of the data. Hence 7K + 1 OLS fits are generated
and the MBA estimator is the fit that minimizes the criterion. The median
squared residual, the LTA criterion, and the LATA criterion are good choices
for Q. Replacing the 7K + 1 OLS fits by L1 fits increases the resistance of
the MBA estimator.
Three ideas motivate this estimator. First, x-outliers, which are outliers
in the predictor space, tend to be much more destructive than Y -outliers
which are outliers in the response variable. Suppose that the proportion of
outliers is γ and that γ < 0.5. We would like the algorithm to have at least
one “center” xj that is not an outlier. The probability of drawing a center
that is not an outlier is approximately 1 − γ K > 0.99 for K ≥ 7 and this
result is free of p. Secondly, by using the different percentages of coverages,
for many data sets there will be a center and a coverage that contains no
outliers. √
Thirdly, the MBA estimator is a n consistent estimator. To see this,
assume that n is increasing to ∞. For each center xj,n there are 7 spheres
centered at xj,n . Let rj,h,n be the radius of the hth sphere with center xj,n .
Fix an extremely large N such that for n ≥ N these 7K regions in the
predictor space are fixed. Hence for n ≥ N the centers are xj,N and the
√ for j = 1, ..., K and h = 1, ..., 7. Since only a fixed number
radii are rj,h,N √
(7K + 1) of n consistent fits are computed, the final estimator is also a n
consistent estimator of β, regardless of how the final estimator is chosen (by
Pratt 1959).
Section 11.3 will compare the MBA estimator with other resistant es-
timators including the R/Splus estimator lmsreg and the trimmed views
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 231

estimator. Splus also contains other regression estimators (such as ltsreg,

lmRobMM and rreg), but the current (as of 2000) implementations of ltsreg
and rreg are not very useful for detecting outliers. Section 6.3 suggested
using resistant estimators in RR and FF plots to detect outliers. Chapter
8 discusses some of the more conventional algorithms that have appeared in
the literature.
Example 7.4. Buxton (1920, p. 232-5) gives 20 measurements of 88
men. Height was the response variable while an intercept, head length, nasal
height, bigonal breadth, and cephalic index were used as predictors in the
multiple linear regression model. Observation 9 was deleted since it had
missing values. Five individuals, numbers 62–66, were reported to be about
0.75 inches tall with head lengths well over ﬁve feet! Example 1.2 and Figure
1.3 on p. 9 and p. 10 showed that some old Splus estimators passed right
through the outliers. Figure 7.1 shows the RR plot for the Splus 2000 estima-
tors lsfit, l1fit, lmsreg, ltsreg and the MBA estimator. Note that only
the MBA estimator gives large absolute residuals to the outliers. One feature
of the MBA estimator is that it depends on the sample of 7 centers drawn
and changes each time the function is called. In ten runs, about nine plots
will look like Figure 7.1, but in about one in ten plots the MBA estimator
will also pass through the outliers.

7.7 Complements
The LTx and LATx estimators discussed in this chapter are not useful for
applications because they are impractical to compute; however, the criterion
are useful for making resistant or robust algorithm estimators. In particu-
lar the robust criterion are √
used in the MBA algorithm (see Problem 7.5)
and in the easily computed n consistent HB CLTx estimators described in
Theorem 8.7 and in Olive and Hawkins (2006).
Section 7.3 is based on Olive and Hawkins (1999) while Sections 7.2, 7.4,
7.5 and 7.6 follow Hawkins and Olive (1999b), Olive and Hawkins (2003) and
Olive (2005).
Several HB regression estimators are well known, and perhaps the ﬁrst
proposed was the least median of squares (LMS) estimator. See Hampel
(1975, p. 380). For the location model, Yohai and Maronna (1976) and Butler
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 232

(1982) derived asymptotic theory for LTS. Rousseeuw (1984) generalized the
location LTS estimator to the LTS regression estimator and the minimum
covariance determinant estimator for multivariate location and dispersion
(see Chapter 10). Bassett (1991) suggested the LTA estimator for location
and Hössjer (1991) suggested the LTA regression estimator.
Two stage regression estimators compute a high breakdown regression
(or multivariate location and dispersion) estimator in the first stage. The
initial estimator is used to weight cases or as the initial estimator in a one
step Newton’s method procedure. The goal is for the two stage estimator
to inherit the outlier resistance properties of the initial estimator while hav-
ing high asymptotic efficiency when the errors follow a zero mean Gaussian
distribution. The theory for many of these estimators is often rigorous, but
the estimators are even less practical to compute than the initial estima-
tors. There are dozens of references including Jureckova and Portnoy (1987),
Simpson, Ruppert and Carroll (1992), Coakley and Hettmansperger (1993),
Chang, McKean, Naranjo and Sheather (1999), and He, Simpson and Wang
(2000).
The “cross checking estimator,” see He and Portnoy (1992, p. 2163) and
Davies (1993, p. 1981), computes a high breakdown estimator and OLS and
uses OLS if the two estimators are sufficiently close. The easily computed
HB estimators from Theorem 8.7 (and Olive and Hawkins 2006) make two
stage estimators such as the cross checking estimator practical for the first
time.
The theory of the RLTx estimator is very simple, but it can be used to
understand other results. For example, Theorem 7.3 will hold as long as
the initial estimator b used to compute Cn is consistent. In other words,
CLMS(0.5) (from Theorem 8.7) could be used as the√initial estimator for the
RLTS estimator. Suppose that the easily computed n consistent HB CLTS
estimator b (from Theorem 8.7) is used. If the CLTS(0.99) estimator does
indeed have high Gaussian efficiency, then the RLTS estimator that uses b as
the initial estimator will also have high Gaussian efficiency. Similar results
have appeared in the literature, but their proofs are very technical, often
requiring the theory of empirical processes.
The major drawback of high breakdown estimators that have nice the-
oretical results such as high efficiency is that they tend to be impractical
to compute. If an inconsistent zero breakdown initial estimator is used, as
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 233

in most of the literature and in the simulation study in Section 7.5, then
the final estimator (including even the simplest two stage estimators such
as the cross checking and RLTx estimators) also has zero breakdown and
√
is often inconsistent. Hence n consistent resistant estimators such as the
MBA estimator often have higher outlier resistance than zero breakdown
implementations of HB estimators such as ltsreg.
Another drawback of high breakdown estimators that have high efficiency
is that they tend to have considerably more bias than estimators such as
LTS(0.5) for many outlier configurations. For example the fifth row of Ta-
ble 7.2 shows that the RLTS estimator can perform much worse than the
ALTS(0.5) estimator if the outliers are within the k = 6 screen.

7.8 Problems
R/Splus Problems
Warning: Use the command source(“A:/rpack.txt”) to download
the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg mbamv, will display the code for the function. Use the
args command, eg args(mbamv), to display the needed arguments for the
function.
7.1. a) Download the R/Splus function nltv that computes the asymp-
totic variance of the LTS and LTA estimators if the errors are N(0,1).
b) Enter the commands nltv(0.5), nltv(0.75), nltv(0.9) and nltv(0.9999).
Write a table to compare the asymptotic variance of LTS and LTA at these
coverages. Does one estimator always have a smaller asymptotic variance?
7.2. a) Download the R/Splus function deltv that computes the asymp-
totic variance of the LTS and LTA estimators if the errors are double expo-
nential DE(0,1).
b) Enter the commands deltv(0.5), deltv(0.75), deltv(0.9) and deltv(0.9999).
Write a table to compare the asymptotic variance of LTS and LTA at these
coverages. Does one estimator always have a smaller asymptotic variance?
7.3. a) Download the R/Splus function cltv that computes the asymp-
totic variance of the LTS and LTA estimators if the errors are Cauchy C(0,1).
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 234

b) Enter the commands cltv(0.5), cltv(0.75), cltv(0.9) and cltv(0.9999).

Write a table to compare the asymptotic variance of LTS and LTA at these
coverages. Does one estimator always have a smaller asymptotic variance?
7.4∗. a) If necessary, use the commands source(“A:/rpack.txt”) and
source(“A:/robdata.txt”).
b) Enter the command mbamv(belx,bely) in R/Splus. Click on the right-
most mouse button (and in R, click on Stop). You need to do this 7 times
before the program ends. There is one predictor x and one response y. The
function makes a scatterplot of x and y and cases that get weight one are
shown as highlighted squares. Each MBA sphere covers half of the data.
When you find a good fit to the bulk of the data, hold down the Ctrl and c
keys to make a copy of the plot. Then paste the plot in Word.
c) Enter the command mbamv2(buxx,buxy) in R/Splus. Click on the right-
most mouse button (and in R, click on Stop). You need to do this 14 times
before the program ends. There is one predictor x and one response y. The
function makes the forward response and residual plots based on the OLS fit
to the highlighted cases. Each MBA sphere covers half of the data. When
you find a good fit to the bulk of the data, hold down the Ctrl and c keys to
make a copy of the two plots. Then paste the plots in Word.
7.5∗. This problem compares the MBA estimator that uses the median
squared residual MED(ri2 ) criterion with the MBA√estimator that uses the
LATA criterion.
√ On clean data, both estimators are n consistent since both
use 50 n consistent OLS estimators. The MED(ri2 ) criterion has trouble
with data sets where the multiple linear regression relationship is weak and
there is a cluster of outliers. The LATA criterion tries to give all x–outliers,
including good leverage points, zero weight.
a) If necessary, use the commands source(“A:/rpack.txt”) and
source(“A:/robdata.txt”). The mlrplot2 function is used to compute both
MBA estimators. Use the rightmost mouse button to advance the plot.
b) Use the command mlrplot2(belx,bely) and include the resulting plot in
Word. Is one estimator better than the other, or are they about the same?
c) Use the command mlrplot2(cbrainx,cbrainy) and include the resulting
plot in Word. Is one estimator better than the other, or are they about the
same?
d) Use the command mlrplot2(museum[,3:11],museum[,2]) and include
the resulting plot in Word. For this data set, most of the cases are based on
CHAPTER 7. ROBUST AND RESISTANT REGRESSION 235

humans but a few are based on apes. The MBA LATA estimator will often
give the cases corresponding to apes larger absolute residuals than the MBA
estimator based on MED(ri2 ).
e) Use the command mlrplot2(buxx,buxy) until the outliers are clustered
about the identity line in one of the two forward response plots. (This will
usually happen within 10 or fewer runs. Pressing the “up arrow” will bring
the previous command to the screen and save typing.) Then include the
resulting plot in Word. Which estimator went through the outliers and which
one gave zero weight to the outliers?
f) Use the command mlrplot2(hx,hy) several times. Usually both MBA es-
timators fail to find the outliers for this artificial Hawkins data set that is also
analyzed by Atkinson and Riani (2000, section 3.1). The lmsreg estimator
can be used to find the outliers. In Splus, use the command ffplot(hx,hy) and
in R use the commands library(lqs) and ffplot2(hx,hy). Include the resulting
plot in Word.
Chapter 8

Robust Regression Algorithms

Recall from Chapter 7 that high breakdown regression estimators such as

LTA, LTS, and LMS are impractical to compute. Hence algorithm estimators
are used as approximations. Consider the multiple linear regression model

Y = Xβ + e

where β is a p× 1 vector of unknown coeﬃcients. Assume that the regression

estimator β̂ Q is the global minimizer of some criterion Q(b) ≡ Q(b|Y , X).
In other words, Q(β̂ Q) ≤ Q(b) for all b ∈ B ⊆ p. Typically B = p , but
occasionally B is a smaller set such as the set of OLS fits to cn ≈ n/2 of the
cases. In this case, B has a huge but finite number C(n, cn ) of vectors b.
Often Q depends on Y and X only through the residuals ri (b) = Yi − xTi b,
but there are exceptions such as the regression depth estimator.
Definition 8.1. In the multiple linear regression setting, an elemental
set is a set of p cases.
Some notation is needed for algorithms that use many elemental sets. Let

J = Jh = {h1, ..., hp}

denote the set of indices for the ith elemental set. Since there are n cases,
h1 , ..., hp are p distinct integers between 1 and n. For example, if n = 7 and
p = 3, the ﬁrst elemental set may use cases J1 = {1, 7, 4}, and the second
elemental set may use cases J2 = {5, 3, 6}. The data for the ith elemental set
is (Y Jh , X Jh ) where Y Jh = (Yh1 , ..., Yhp)T is a p × 1 vector, and the p × p

236
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 237

matrix    
xTh1 xh1,1 xh1,2 . . . xh1,p
 xTh2   xh2,1 xh2,2 . . . xh2,p 
   
X Jh =  .. = .. .. .. .. .
 .   . . . . 
xThp xhp,1 xhp,2 . . . xhp,p

Deﬁnition 8.2. The elemental ﬁt from the hth elemental set Jh is

bJh = X −1
Jh Y Jh

provided that the inverse of X Jh exists.

Deﬁnition 8.3. Assume that the p cases in each elemental set are dis-
tinct (eg drawn without replacement from the n cases that form the data
set). Then the elemental basic resampling algorithm for approximating the
estimator β̂ Q that globally minimizes the criterion Q(b) uses Kn elemental
sets J1 , ..., JKn randomly drawn (eg with replacement) from the set of all
C(n, p) elemental sets. The algorithm estimator bA is the elemental ﬁt that
minimizes Q. That is,

bA = argminh=1,...,Kn Q(bJh ).

Several estimators can be found by evaluating all elemental sets. For

example, the LTA, L1 , RLTA, LATA, and regression depth estimators can
be found this way. Given the criterion Q, the key parameter of the basic
resampling algorithm is the number Kn of elemental sets used in the algo-
rithm. It is crucial to note that the criterion Q(b) is a function of all n cases
even though the elemental fit only uses p cases. For example, assume that
Kn = 2, J1 = {1, 7, 4}, Q(bJ1 ) = 1.479, J2 = {5, 3, 6}, and Q(bJ2 ) = 5.993.
Then bA = bJ1 .
To understand elemental fits, the notions of a matrix norm and vector
norm will be useful. We will follow Datta (1995, p. 26-31) and Golub and
Van Loan (1989, p. 55-60).
Definition 8.4. The y be an n × 1 vector. Then y is a vector norm if
vn1) y ≥ 0 for every y ∈ n with equality iff y is the zero vector,
vn2) ay = |a| y for all y ∈ n and for all scalars a, and
vn3) x + y ≤ x + y for all x and y in n .
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 238

Deﬁnition 8.5. Let G be an n × p matrix. Then G is a matrix norm

if
mn1) G ≥ 0 for every n×p matrix G with equality iﬀ G is the zero matrix,
mn2) aG = |a| G for all scalars a, and
mn3) G + H ≤ G + H for all n × p matrices G and H.
Example 8.1. The q-norm of a vector y is

yq = (|y1|q + · · · + |yn |q )1/q .

In particular, y1 = |y1| + ·

· · + |yn |,
the Euclidean norm y2 = y12 + · · · + yn2 , and
y∞ = maxi |yi |.
Given a matrix G and a vector norm yq the q-norm or subordinate matrix
norm of matrix G is
Gyq
Gq = max .
y =0 yq
It can be shown that the maximum column sum norm

n
G1 = max |gij |,
1≤j≤p
i=1

the maximum row sum norm

p
G∞ = max |gij |,
1≤i≤n
j=1

and the spectral norm

G2 = maximum eigenvalue of GT G.

The Frobenius norm

!
"
" p
n
GF = # |gij | = trace(GT G).
2

j=1 i=1

From now on, unless otherwise stated, we will use the spectral norm as
the matrix norm and the Euclidean norm as the vector norm.
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 239

8.1 Inconsistency of Resampling Algorithms

We will call algorithms that approximate high breakdown (HB) regression
estimators “HB algorithms” although the high breakdown algorithm estima-
tors bA that have appeared in the literature (that are practical to compute)
are typically inconsistent low breakdown estimators. To examine the statis-
tical properties of the basic resampling algorithm, more properties of matrix
norms are needed. For the matrix X Jh , the subscript h will often be sup-
pressed.
Several useful results involving matrix norms will be used. First, for any
subordinate matrix norm,
Gyq ≤ Gq yq .
Hence for any elemental ﬁt bJ (suppressing q = 2),
bJ − β = X −1 −1 −1
J (X J β + eJ ) − β = X J eJ ≤ X J eJ . (8.1)
The following results (Golub and Van Loan 1989, p. 57, 80) on the Euclidean
norm are useful. Let 0 ≤ σp ≤ σp−1 ≤ · · · ≤ σ1 denote the singular values of
X J . Then
σ1
X −1
J = , (8.2)
σp X J
max |xhi,j | ≤ X J ≤ p max |xhi,j |, and (8.3)
i,j i,j

1 1
≤ ≤ X −1
J . (8.4)
p maxi,j |xhi,j | X J

The key idea for examining elemental set algorithms is eliminating X −1

J .
−1
If there are reasonable conditions under which inf X J > d for some con-
stant d that is free of n where the inﬁnum is taken over all C(n, p) elemental
sets, then the elemental design matrix X J will play no role in producing
a sequence of consistent elemental ﬁts. We will use the convention that if
the inverse X −1 −1
J does not exist, then X J = ∞. The following lemma is
crucial.
Lemma 8.1. Assume that the n × p design matrix X = [xij ] and that
the np entries xij are bounded:
max |xij | ≤ M
i,j
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 240

for some real number M > 0 that does not depend on n. Then for any
elemental set X J ,
1
X −1
J ≥ . (8.5)
pM

Proof. If X J does not have an inverse, then by the convention X −1

J =
∞, and the result holds. Assume that X J does have an inverse. Then by
Equation (8.4),
1 1 1
≤ ≤ ≤ X −1
J .
pM p maxi,j |xhi,j | X J

QED
In proving consistency results, there is an infinite sequence of estimators
that depend on the sample size n. Hence the subscript n will be added to
the estimators. Refer to Remark 2.4 for the definition of convergence in
probability.
Definition 8.6. Lehmann (1999, p. 53-54): a) A sequence of random
variables Wn is tight or bounded in probability, written Wn = OP (1), if for
every > 0 there exist positive constants D and N such that

P (|Wn | ≤ D ) ≥ 1 −

for all n ≥ N . Also Wn = OP (Xn ) if |Wn /Xn | = OP (1).

b) The sequence Wn = oP (n−δ ) if nδ Wn = oP (1) which means that
P
nδ Wn → 0.

c) Wn has the same order as Xn in probability, written Wn P Xn , if for

every > 0 there exist positive constants N and 0 < d < D such that
$ $ $ $
$ Wn $ 1 $ Xn $ 1
P (d ≤ $$ $ ≤ D ) = P ( ≤$ $ ≤ )≥ 1−
Xn $ D $ Wn $ d

for all n ≥ N .
d) Similar notation is used for a k × r matrix A = [ai,j ] if each element
ai,j has the desired property. For example, A = OP (n−1/2 ) if each ai,j =
OP (n−1/2).
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 241

Deﬁnition 8.7. Let Wn = β̂n − β.

a) If Wn P n−δ for some δ > 0, then both Wn and β̂ n have (tightness)
rate nδ .
b) If there exists a constant κ such that
D
nδ (Wn − κ) → X

for some nondegenerate random variable X, then both Wn and β̂n have
convergence rate nδ .
If Wn has convergence rate nδ , then Wn has tightness rate nδ , and the term
“tightness” will often be omitted. Notice that if Wn P Xn , then Xn P Wn ,
Wn = OP (Xn ) and Xn = OP (Wn ). Notice that if Wn = OP (n−δ ), then nδ is a
lower bound on the rate of Wn . As an example, if LMS, OLS or L1 are used
for β̂, then Wn = OP (n−1/3 ), but Wn P n−1/3 for LMS while Wn P n−1/2
for OLS and L1 . Hence the rate for OLS and L1 is n1/2.
To examine the lack of consistency of the basic resampling algorithm
estimator bA,n meant to approximate the theoretical estimator β̂ Q,n , recall
that the key parameter of the basic resampling algorithm is the number of
elemental sets Kn ≡ K(n, p). Typically Kn is a fixed number, eg Kn ≡ K =
3000, that does not depend on n.
Example 8.2. This example illustrates the basic resampling algorithm
with Kn = 2. Let the data consist of the five (xi, yi ) pairs (0,1), (1,2), (2,3),
(3,4), and (1,11). Then p = 2 and n = 5. Suppose the criterion Q is the
median of the n squared residuals and that J1 = {1, 5}. Then observations
(0, 1) and (1, 11) were selected. Since bJ1 = (1, 10)T , the estimated line
is y = 1 + 10x, and the corresponding residuals are 0, −9, −18, −27, and
0. The criterion Q(bJ1 ) = 92 = 81 since the ordered squared residuals are
0, 0, 81, 182 , and 272 . If observations (0, 1) and (3, 4) are selected next, then
J2 = {1, 4}, bJ2 = (1, 1)T , and 4 of the residuals are zero. Thus Q(bJ2 ) = 0
and bA = bJ2 = (1, 1)T . Hence the algorithm produces the fit y = 1 + x.
Example 8.3. In the previous example the algorithm fit was reasonable,
but in general using a fixed Kn ≡ K in the algorithm produces inconsistent
estimators. To illustrate this claim, consider the location model Yi = β + ei
where the ei are iid and β is a scalar (since p = 1 in the location model). If β
was known, the natural criterion for an estimator bn of β would be Q(bn ) =
|bn − β|. For each sample size n, K elemental sets Jh,n = {hn }, h = 1, ..., K
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 242

of size p = 1 are drawn with replacement from the integers 1, ..., n. Denote
the resulting elemental ﬁts by

bJh,n = Yhn

for h = 1, ..., K. Then the “best ﬁt” Yo,n minimizes |Yhn − β|. If α > 0, then

P (|Yo,n − β| > α) ≥ [P (|Y1 − β| > α)]K > 0

provided that the errors have mass outside of [−α, α], and thus Yo,n is not
a consistent estimator. The inequality is needed since the Yhn may not be
distinct: the inequality could be replaced with equality if the Y1n , ..., YKn were
an iid sample of size K. Since α > 0 was arbitrary in the above example,
the inconsistency result holds unless the iid errors are degenerate at zero.
The basic idea is from sampling theory. A fixed finite sample can be used
to produce an estimator that contains useful information about a population
parameter, eg the population mean, but unless the sample size n increases to
∞, the confidence interval for the population parameter will have a length
bounded away from zero. In particular, if Y n (K) is a sequence of sample
means based on samples of size K = 100, then Y n (K) is not a consistent
estimator for the population mean.
The following notation is useful for the general regression setting and
will also be used for some algorithms that modify the basic resampling algo-
rithm. Let bsi,n be the ith elemental fit where i = 1, ..., Kn and let bA,n be the
algorithm estimator; that is, bA,n is equal to the bsi,n that minimized the cri-
terion Q. Let β̂ Q,n denote the estimator that the algorithm is approximating,
eg β̂ LT A,n . Let bos,n be the “best” of the K elemental fits in that

bos,n = argmini=1,...,Kn bsi,n − β (8.6)

where the Euclidean norm is used. Since the algorithm estimator is an ele-
mental ﬁt bsi,n ,
bA,n − β ≥ bos,n − β.
Thus an upper bound on the rate of bos,n is an upper bound on the rate of
bA,n .
Theorem 8.2. Let the number of randomly selected elemental sets
Kn → ∞ as n → ∞. Assume that the error distribution possesses a density
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 243

f that is positive and continuous in a neighborhood of zero and that Kn ≤

C(n, p). Also assume that the predictors are bounded in probability and that
the iid errors are independent of the predictors. Then an upper bound on
1/p
the rate of bos,n is Kn .
Proof. Let J = {i1, ..., ip} be a randomly selected elemental set. Then
Y J = X J β + eJ where the p errors are independent, and the data (Y J , X J )
produce an estimator
bJ = X −1
J YJ
of β. Let 0 < δ ≤ 1. If each observation in J has an absolute error bounded
by M/nδ , then
√
−1 −1 M p
bJ − β = X J eJ ≤ X J δ .
n
Lemma 8.1 shows that the norm X −1 J is bounded away from 0 provided
that the predictors are bounded. Thus if the predictors are bounded in
probability, then bJ − β is small only if all p errors in eJ are small. Now
M 2 M f(0)
Pn ≡ P (|ei | < δ
)≈ (8.7)
n nδ
for large n. Note that if W counts the number of errors satisfying (8.7) then
W ∼ binomial(n, Pn ), and the probability that all p errors in eJ satisfy
Equation (8.7) is proportional to 1/nδp . If Kn = o(nδp ) elemental sets are
used, then the probability that the best elemental ﬁt bos,n satisﬁes
M
bos,n − β ≤
nδ
tends to zero regardless of the value of the constant M > 0. Replace nδ by
1/p
Kn for the more general result. QED
Remark 8.1. It is crucial that the elemental sets were chosen randomly.
For example the cases within any elemental set could be chosen without re-
placement, and then the Kn elemental sets could be chosen with replacement.
Alternatively, random permutations of the integers 1, ..., n could be selected
with replacement. Each permutation generates approximately n/p elemental
sets: the jth set consists of the cases (j − 1)p + 1, ..., jp. Alternatively g(n)
cases could be selected without replacement and then all

g(n)
Kn = C(g(n), p) =
p
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 244

elemental sets generated. As an example where the elemental sets are not
chosen randomly, consider the L1 criterion. Since there is always an elemental
L1 fit, this fit has n1/2 convergence rate and is a consistent estimator of β.
Here we can take Kn ≡ 1, but the elemental set was not drawn randomly.
Using brain power to pick elemental sets is frequently a good idea.
1/p
It is also crucial to note that the Kn rate is only an upper bound on the
rate of the algorithm estimator bA,n . It is possible that the best elemental set
has a good convergence rate while the basic resampling algorithm estimator
is inconsistent.
Corollary 8.3. If the number Kn ≡ K of randomly selected elemental
sets is fixed and free of the sample size n, eg K = 3000, then the algorithm
estimator bA,n is an inconsistent estimator of β.
Conjecture 8.1. Suppose that the errors possess a density that is posi-
tive and continuous on the real line, that β̂Q,n − β = OP (n−1/2 ) and that
Kn ≤ C(n, p) randomly selected elemental sets are used in the algorithm.
−1/2p
Then the algorithm estimator satisfies bA,n − β = OP (Kn ).
Remark 8.2. This rate √ can be achieved if the algorithm minimizing Q
over all elemental subsets is n consistent (eg regression depth, see Bai and
He 1999). Randomly select g(n) cases and let Kn = C(g(n), p). Then apply
the all elemental subset algorithm to the g(n) cases. Notice that an upper
bound on the rate of bos,n is g(n) while

bA,n − β = OP ((g(n))−1/2).

8.2 Theory for Concentration Algorithms

Newer HB algorithms use random elemental sets to generate starting trial
fits, but then refine them. One of the most successful subset refinement
algorithms is the concentration algorithm. Consider the LTA, LTS and LMS
criterion that cover c ≡ cn ≥ n/2 cases.
Definition 8.8. A start is an initial trial fit and an attractor is the final
fit generated by the algorithm from the start. In a concentration algorithm,
let b0,j be the jth start and compute all n residuals ri (b0,j ) = yi − xTi b0,j . At
the next iteration, a classical estimator b1,j is computed from the cn ≈ n/2
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 245

cases corresponding to the smallest squared residuals. This iteration can be

continued for k steps resulting in the sequence of estimators b0,j , b1,j , ..., bk,j .
The result of the iteration bk,j is called the jth attractor. The ﬁnal con-
centration algorithm estimator is the attractor that optimizes the criterion.

Sometimes the notation bsi,n = b0i,n for the ith start and bai,n for the
ith attractor will be used. Using k = 10 concentration steps often works
well, and iterating until convergence is usually fast (in this case k = ki
depends on i). The “h–set” basic resampling algorithm uses starts that are
ﬁts to randomly selected sets of h ≥ p cases, and is a special case of the
concentration algorithm with k = 0.
The notation CLTS, CLMS and CLTA will be used to denote concentra-
tion algorithms for LTA, LTS and LMS, respectively. Consider the LTS(cn )
criterion. Suppose the ordered squared residuals from the start b0k are ob-
tained. Then b1k is simply the OLS ﬁt to the cases corresponding to the cn
smallest squared residuals. Denote these cases by i1 , ..., icn . Then

cn
cn
cn
cn
2
r(i) (b1k ) ≤ ri2j (b1k ) ≤ ri2j (b0k ) = 2
r(i) (b0k )
i=1 j=1 j=1 j=1

where the second inequality follows from the deﬁnition of the OLS estimator.
Convergence to the attractor tends to occur in a few steps.
A simpliﬁed version of the CLTS(c) algorithms of Ruppert (1992), Vı́šek
(1996), Hawkins and Olive (1999a) and Rousseeuw and Van Driessen (2000,
2002) uses Kn elemental starts. The LTS(c) criterion is

c
2
QLT S (b) = r(i) (b) (8.8)
i=1

2
where r(i) (b) is the ith smallest squared residual. For each elemental start
find the exact-fit bsj to the p cases in the elemental start and then get the
c smallest squared residuals. Find the OLS fit to these c cases and find
the resulting c smallest squared residuals, and iterate for k steps (or until
convergence). Doing this for Kn elemental starts leads to Kn (not necessarily
distinct) attractors baj . The algorithm estimator β̂ALT S is the attractor that
minimizes Q. Substituting the L1 or Chebyshev fits and LTA or LMS criteria
for OLS in the concentration step leads to the CLTA or CLMS algorithm.
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 246

a) A Start for the Animal Data b) The Attractor for the Start
8

8
6

6
4

4
Y

Y
2

2
0

0 5 10 0 5 10

X X

Figure 8.1: The Highlighted Points are More Concentrated about the At-
tractor

Example 8.4. As an illustration of the CLTA concentration algorithm,

consider the animal data from Rousseeuw and Leroy (1987, p. 57). The
response y is the log brain weight and the predictor x is the log body weight
for 25 mammals and 3 dinosaurs (outliers with the highest body weight).
Suppose that the ﬁrst elemental start uses cases 20 and 14, corresponding
to mouse and man. Then bs,1 = (2.952, 1.025)T and the sum of the c = 14
smallest absolute residuals
14
|r|(i) (bs,1 ) = 12.101.
i=1

Figure 8.1a shows the scatterplot of x and y. The start is also shown and
the 14 cases corresponding to the smallest absolute residuals are highlighted.
The L1 ﬁt to these c highlighted cases is b2,1 = (2.076, 0.979)T and

14
|r|(i)(b2,1) = 6.990.
i=1
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 247

a) 5 Elemental Starts b) The Corresponding Attractors

8
6

6
4

4
Y

Y
2

2
0

0 5 10 0 5 10

X X

Figure 8.2: Starts and Attractors for the Animal Data

The iteration consists of ﬁnding the cases corresponding to the c smallest

residuals, obtaining the corresponding L1 ﬁt and repeating. The attractor
ba,1 = b8,1 = (1.741, 0.821)T and the LTA(c) criterion evaluated at the at-
tractor is

14
|r|(i) (ba,1 ) = 2.172.
i=1

Figure 8.1b shows the attractor and that the c highlighted cases correspond-
ing to the smallest absolute residuals are much more concentrated than those
in Figure 8.1a. Figure 8.2a shows 5 randomly selected starts while Figure
8.2b shows the corresponding attractors. Notice that the elemental starts
have more variablity than the attractors, but if the start passes through an
outlier, so does the attractor.
Notation for the attractor needs to be added to the notation used for the
basic resampling algorithm. Let bsi,n be the ith start, and let bai,n be the
ith attractor. Let bA,n be the algorithm estimator, that is, the attractor that
minimized the criterion Q. Let β̂Q,n denote the estimator that the algorithm
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 248

is approximating, eg β̂ LT S,n . Let bos,n be the “best” start in that

bos,n = argmini=1,...,Kn bsi,n − β.
Similarly, let boa,n be the best attractor. Since the algorithm estimator is an
attractor, bA,n − β ≥ boa,n − β, and an upper bound on the rate of boa,n
is an upper bound on the rate of bA,n .
Typically the algorithm will use randomly selected elemental starts, but
more generally the start could use (eg OLS or L1 ) fits computed from hi
cases. Many algorithms will use the same number hi ≡ h of cases for all
starts. If bsi,n , b2i,n , ..., bai,n is the sequence of fits in the iteration from the
ith start to the ith attractor, typically cn cases will be used after the residuals
from the start are obtained. However, for LATx algorithms, the jth fit bji,n
in the iteration uses Cn (bj−1,i,n ) cases where Cn (b) is given by Equation (7.5)
on p. 214.
Remark 8.3. Failure of zero-one weighting. Assume that the iteration
from start to attractor is bounded by the use of a stopping rule. In other
words, ai, n ≤ M for some constant M and for all i = 1, ..., Kn and for all
n. Then the consistency rate of the best attractor is equal to the rate for
the best start for the LTS concentration algorithm if all of the start sizes
hi are bounded (eg if all starts are elemental). For example, suppose the
concentration algorithm for LTS uses elemental starts, and OLS is used in
each concentration step. If the best start satisfies bos,n − β = OP (n−δ )
then the best attractor satisfies boa,n − β = OP (n−δ ). In particular, if the
number of starts Kn ≡ K is a fixed constant (free of the sample size n)
and all K of the start sizes are bounded by a fixed constant (eg p), then the
algorithm estimator bA,n is inconsistent.
This result holds because zero-one weighting fails to improve the consis-
tency rate. That is, suppose an initial fit β̂ n satisfies β̂ n − β = OP (n−δ )
where 0 < δ ≤ 0.5. If β̂cn denotes the OLS fit to the c cases with the smallest
absolute residuals, then
β̂ cn − β = OP (n−δ ). (8.9)
See Ruppert and Carroll (1980, p. 834 for δ = 0.5), Dollinger and Staudte
(1991, p. 714), He and Portnoy (1992) and Welsh and Ronchetti (1993).
These results hold for a wide variety of zero-one weighting techniques. Con-
centration uses the cases with the smallest c absolute residuals, and the pop-
ular “reweighting for efficiency” technique applies OLS to cases that have
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 249

absolute residuals smaller than some constant. He and Portnoy (1992, p.

2161) note that such an attempt to get a rate n1/2 estimator from the rate
n1/3 initial LMS fit does not in fact improve LMS’s rate.
Remark 8.4. While the formal proofs in the literature cover OLS fitting,
it is a reasonable conjecture that the result also holds if alternative fits such
as L1 are used in the concentration steps. Heuristically, zero-one weighting
from the initial estimator results √in a data set with the same “tilt” as the
initial estimator, and applying a n consistent estimator to the cases with
the c smallest case distances can not get rid of this tilt.
Remarks 8.3 and 8.4 suggest that the consistency rate of the algorithm
estimator is bounded above by the rate of the best elemental start. Theorem
8.2 and the following remark show that the number of random starts is the
determinant of the actual performance of the estimator, as opposed to the
theoretical convergence rate of β̂Q,n . Suppose Kn = O(n) starts are used.
Then the rate of the algorithm estimator is no better than n1/p which drops
dramatically as the dimensionality increases.
Remark 8.5: The wide spread of subsample slopes. Some addi-
tional insights into the size h of the start come from a closer analysis of an
idealized case – that of normally distributed predictors. Assume that the
errors are iid N(0, 1) and that the xi ’s are iid Np (0, I). Use h observations
(X h , Y h ) to obtain the OLS fit
b = (X Th X h )−1 X Th Y h ∼ Np (β, (X Th X h )−1 ).
Then (b − β)T (b − β) is distributed as (p Fp,h−p+1 )/(h − p + 1).
Proof (provided by Morris L. Eaton). Let V = X Th X h . Then V
has the Wishart distribution W (I p , p, h) while V −1 has the inverse Wishart
distribution W −1 (I p , p, h + p − 1). Without loss of generality, assume β =
0. Let W ∼ W (I p , p, h) and β̂|W ∼ N(0, W −1 ). Then the characteristic
function of β̂ is
1
φ(t) = E(E[exp(itT β̂)|W ]) = EW [exp(− tT W −1 t)].
2
Let X ∼ Np (0, I p ) and S ∼ W (I p , p, h) be independent. Let Y = S −1/2X.
Then the characteristic function of Y is
1
ψ(t) = E(E[exp(i(S −1/2t)T X)|S]) = ES [exp(− tT S −1 t)].
2
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 250

Since β̂ and Y have the same characteristic functions, they have the same
distribution. Thus β̂2 has the same distribution as

X T S −1 X ∼ (p/(h − p + 1)) Fp,h−p+1 .

QED
This result shows the inadequacy of elemental sets in high dimensions.
For a trial fit to provide a useful preliminary classification of cases into inliers
and outliers requires that it give a reasonably precise slope. However if p is
large, this is most unlikely; the density of (b−β)T (b−β) varies near zero like
p
[(b − β)T (b − β)]( 2 −1) . For moderate to large p, this implies that good trial
slopes will be extremely uncommon and so enormous numbers of random
elemental sets will have to be generated to have some chance of finding one
that gives a usefully precise slope estimate. The only way to mitigate this
effect of basic resampling is to use larger values of h, but this negates the
main virtue elemental sets have, which is that when outliers are present, the
smaller the h the greater the chance that the random subset will be clean.
The following two propositions examine increasing the start size. The
first result (compare Remark 8.3) proves that increasing the start size from
elemental to h ≥ p results in a zero breakdown inconsistent estimator. Let
the k–step CLTS estimator be the concentration algorithm estimator for LTS
that uses k concentration steps. Assume that the number of concentration
steps k and the number of starts Kn ≡ K do not depend on n (eg k = 10
and K = 3000, breakdown is defined in Section 9.4).
Proposition 8.4. Suppose that each start uses h randomly selected cases
and that Kn ≡ K starts are used. Then
i) the (“h-set”) basic resampling estimator is inconsistent.
ii) The k–step CLTS estimator is inconsistent.
iii) The breakdown value is bounded above by K/n.
Proof. To prove i) and ii), notice that each start is inconsistent. Hence
each attractor is inconsistent by He and Portnoy (1992). Choosing from K
inconsistent estimators still results in an inconsistent estimator. To prove iii)
replace one observation in each start by a high leverage case (with y tending
to ∞). QED
The next result shows that the situation changes dramatically if K is
fixed but the start size h = hn = g(n) where g(n) → ∞. In particular,
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 251

if several starts with rate n1/2 are used, the final estimator also has rate
n1/2. The drawback to these algorithms is that they may not have enough
outlier resistance. Notice that the basic resampling result below is free of the
criterion. Suppose that β̂ 1, ..., β̂K are consistent estimators of β each with
the same rate g(n). Pratt (1959) shows that if β̂ A is an estimator obtained
by choosing one of the K estimators, then β̂ A is a consistent estimator of β
with rate g(n).
Proposition 8.5. Suppose Kn ≡ K starts are used and that all starts
have subset size hn = g(n) ↑ ∞ as n → ∞. Assume that the estimator
applied to the subset has rate nδ .
i) For the hn -set basic resampling algorithm, the algorithm estimator has
rate [g(n)]δ .
ii) Under mild regularity conditions (eg given by He and Portnoy 1992), the
k–step CLTS estimator has rate [g(n)]δ .
Proof. i) The hn = g(n) cases are randomly sampled without replace-
ment. Hence the classical estimator applied to these g(n) cases has rate
[g(n)]δ . Thus all K starts have rate [g(n)]δ , and the result follows by Pratt
(1959). ii) By He and Portnoy (1992), all K attractors have [g(n)]δ rate, and
the result follows by Pratt (1959). QED
These results show that fixed Kn ≡ K elemental methods are inconsis-
tent. Several simulation studies have shown that the versions of the resam-
pling algorithm that use a fixed number of elemental starts provide fits with
√
behavior that conforms with the asymptotic behavior of the n consistent
target estimator. These paradoxical studies can be explained by the following
proposition (a recasting of a coupon collection problem).
Proposition 8.6. Suppose that Kn ≡ K random starts of size h are
selected and let Q(1) ≤ Q(2) ≤ · · · ≤ Q(B) correspond to the order statistics
of the criterion values of the B = C(n, h) possible starts of size h. Let R be
the rank of the smallest criterion value from the K starts. If P (R ≤ Rα ) = α,
then
Rα ≈ B[1 − (1 − α)1/K ].
Proof. If Wi is the rank of the ith start, then W1 , ..., WK are iid discrete
uniform on {1, ..., B} and R = min(W1 , ..., WK ). If r is an integer in [1, B],
then
B−r K
P (R ≤ r) = 1 − ( ) .
B
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 252

Solve the above equation α = P (R ≤ Rα ) for Rα . QED

Remark 8.6. If K = 500, then with 95% probability about 1 in 10000
elementals sets will be better than the best elemental start found from the
elemental concentration algorithm. From Feller (1957, p. 211-212),

B KB2 B2
E(R) ≈ 1 + , and VAR(R) ≈ ≈ .
K +1 (K + 1)2 (K + 2) K2

Notice that the median of R is MED(R) ≈ B[1 − (0.5)1/K ].

Thus simulation studies that use very small generated data sets, so the
probability of finding a good approximation is high, are quite misleading
about the performance of the algorithm on more realistically sized data sets.
For example, if n = 100, h = p = 3, and K = 3000, then M = 161700 and
the median rank is about 37. Hence the probability is about 0.5 that only
36 elemental subsets will give a smaller value of Q than the fit chosen by the
algorithm, and so using just 3000 starts may well suffice. This is not the case
with larger values of p.
The following theorem shows that is simple to improve the CLTS esti-
mator by adding two carefully chosen attractors. Notice that lmsreg is an
inconsistent zero breakdown estimator but the modification to lmsreg is HB
and asymptotically equivalent to OLS. Hence the modified estimator has a
√
n rate which is higher than the n1/3 rate of the LMS estimator. Let β̂ k,OLS
denote the attractor that results when β̂OLS is the start. Let bk be the at-
tractor from the start consisting of OLS applied to the cn cases closest to the
median of the Yi and let β̂k,B = 0.99bk . Then β̂ k,B is a HB biased estimator
of β. (See Example 9.3. An estimator is HB if its median absolute residual
stays bounded even if nearly half of the cases are outliers.)
Theorem 8.7. i) Suppose that the CLTS algorithm uses Kn ≡ K ran-
domly selected elemental starts (eg K = 500). Add the attractors β̂ k,OLS and
√
β̂k,B . Then the resulting estimator is a n consistent HB estimator if β̂ OLS
√
is n consistent.
ii) Suppose a basic resampling algorithm is used for HB criterion that is
minimized by a consistent estimator for β (eg for LMS or LTS). Also assume
that the algorithm uses Kn ≡ K randomly selected elemental starts (eg K =
500). Adding β̂OLS and β̂ k,B results in a HB estimator that is asymptotically
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 253

equivalent to the OLS estimator if the OLS estimator is a consistent estimator

of β.
Proof. Chapter 9 shows that concentration and basic resampling algo-
rithms that use a HB start are HB, and β̂ k,B is a HB biased estimator.
√
i) By He and Portnoy (1992), the OLS attractor β̂k,OLS is n consistent
estimator. As n → ∞, the estimator that minimizes the LTS criterion gets
arbitrarily close to β since the LTS estimator is consistent by Mašı̈ček (2004).
Since β̂ k,B is a biased estimator of β, with probability tending to one, the
OLS attractor will have a smaller criterion value. With probability tending
to one, the OLS attractor will also have a smaller criterion value than the
criterion value of the attractor from a randomly drawn elemental set (by
Proposition 8.6 and He and Portnoy 1992). Since K randomly elemental
sets are used, the CLTS estimator is asymptotically equivalent to the OLS
attractor.
ii) As in the proof of i), the OLS estimator will minimize the criterion
value with probability tending to one as n → ∞. QED

8.3 Elemental Sets Fit All Planes

The previous sections showed that using a ﬁxed number of randomly selected
elemental sets results in an inconsistent estimator while letting the subset size
hn = g(n) where g(n) → ∞ resulted in a consistent estimator that had little
outlier resistance. Since elemental sets seem to provide the most resistance,
another option would be to use elemental sets, but let Kn → ∞. This section
provides an upper bound on the rate of such algorithms.
In the elemental basic resampling algorithm, Kn elemental sets are ran-
domly selected, producing the estimators b1,n, ..., bKn ,n . Let bo,n be the “best”
elemental ﬁt examined by the algorithm in that

bo,n = argmini=1,...,Kn bi,n − β. (8.10)

Notice that bo,n is not an estimator since β is unknown, but since the algo-
rithm estimator is an elemental ﬁt, bA,n − β ≥ bo,n − β, and an upper
bound on the rate of bo,n is an upper bound on the rate of bA,n . Theorem 8.2
1/p
showed that the rate of the bo,n ≤ Kn , regardless of the criterion Q. This
result is one of the most powerful tools for examining the behavior of robust
estimators actually used in practice. For example, many basic resampling
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 254

algorithms use Kn = O(n) elemental sets drawn with replacement from all
C(n, p) elemental sets. Hence the algorithm estimator bA,n has a rate ≤ n1/p.

1/p
This section will show that the rate of bo,n is Kn and suggests that the
number of elemental sets bi,n that satisfy bi,n − β ≤ Mnδ (where M > 0
is some constant and 0 < δ ≤ 1) is proportional to np(1−δ) .
Two assumptions are used.
(A1) The errors are iid, independent of the predictors, and have a density f
that is positive and continuous in a neighborhood of zero.
(A2) Let τ be proportion of elemental sets J that satisfy X −1 J ≤ B for
some constant B > 0. Assume τ > 0.
These assumptions are reasonable, but results that do not use (A2) are
given later. If the errors can be arbitrarily placed, then they could cause the
estimator to oscillate about β. Hence no estimator would be consistent for
β. Note that if > 0 is small enough, then P (|ei | ≤ ) ≈ 2 f(0). Equations
(8.2) and (8.3) suggest that (A2) will hold unless the data is such that nearly
all of the elemental trial designs X J have badly behaved singular values.
Theorem 8.8. Assume that all C(n, p) elemental subsets are searched
and that (A1) and (A2) hold. Then bo,n − β = OP (n−1 ).
Proof. Let the random variable Wn, count the number of errors ei that
satisfy |ei | ≤ M /n for i = 1, ..., n. For fixed n, Wn, is a binomial random
variable with parameters n and Pn where nPn → 2f(0)M as n → ∞. Hence
Wn, converges in distribution to a Poisson(2f(0)M ) random variable, and
for any fixed integer k > p, P (Wn, > k) → 1 as M → ∞ and n → ∞. Hence
if n is large enough, then with arbitrarily high probability there exists an M
such that at least C(k, p) elemental sets Jhn have all |ehn i | ≤ M /n where
the subscript hn indicates that the sets depend on n. By condition (A2),
√
the proportion of these C(k, p) fits that satisfy bJhn − β ≤ B pM /n is
greater than τ. If k is chosen sufficiently large, and if n is sufficiently large,
√
then with arbitrarily high probability, bo,n − β ≤ B pM /n and the result
follows. QED
Corollary 8.9. Assume that Hn ≤ n but Hn ↑ ∞ as n → ∞. If (A1)
and (A2) hold, and if Kn = Hnp randomly chosen elemental sets are used,
−1/p
then bo,n − β = OP (Hn−1 ) = OP (Kn ).
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 255

Proof. Suppose Hn cases are drawn without replacement and all C(Hn , p)
∝ Hnp elemental sets are examined. Then by Theorem 8.8, the best elemental
set selected by this procedure has rate Hn . Hence if Kn = Hnp randomly cho-
sen elemental sets are used and if n is sufficiently large, then the probability
of drawing an elemental set Jhn such that bJhn − β ≤ M Hn−1 goes to one
as M → ∞ and the result follows. QED
Suppose that an elemental set J is “good” if bJ − β ≤ M Hn−1 for some
constant M > 0. The preceding proof used the fact that with high probabil-
ity, good elemental sets can be found by a specific algorithm that searches
Kn ∝ Hnp distinct elemental sets. Since the total number of elemental sets
is proportional to np , an algorithm that randomly chooses Hnp elemental sets
will find good elemental sets with arbitrarily high probability. For example,
the elemental sets could be drawn with or without replacement from all of
the elemental sets. As another example, draw a random permutation of the
n cases. Let the first p cases be the 1st elemental set, the next p cases the
2nd elemental set, etc. Then about n/p elemental sets are generated, and
the rate of the best elemental set is n1/p.
Also note that the number of good sets is proportional to np Hn−p . In
particular, if Hn = nδ where 0 < δ ≤ 1, then the number of “good” sets
is proportional to np(1−δ). If the number of randomly drawn elemental sets
Kn = o((Hn )p ), then bA,n −β = OP (Hn−1 ) since P (bo,n −β ≤ Hn−1 M ) →
0 for any M > 0.
A key assumption to Corollary 8.9 is that the elemental sets are randomly
drawn. If this assumption is violated, then the rate of the best elemental set
could be much higher. For example, the single elemental fit corresponding
to the L1 estimator could be used, and this fit has a n1/2 rate.
The following argument shows that similar results hold if the predictors
are iid with a multivariate density that is everywhere positive. For now,
assume that the regression model contains a constant: x = (1, x2 , ..., xp)T .
Construct a (hyper) pyramid and place the “corners” of the pyramid into
a p × p matrix W . The pyramid defines p “corner regions” R1 , ..., Rp. The
p points that form W are not actual observations, but the fit bJ can be
evaluated on W . Define the p × 1 vector z = W β. Then β = W −1 z, and
ẑ = W bJ is the fitted hyperplane evaluated at the corners of the pyramid.
If an elemental set has one observation in each corner region and if all p
absolute errors are less than , then the absolute deviation |δi| = |zi − ẑi| < ,
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 256

i = 1, ..., p.
To ﬁx ideas and notation, we will present three examples. The ﬁrst two
examples consider the simple linear regression model with one predictor and
an intercept while the third example considers the multiple regression model
with two predictors and an intercept.
Example 8.5. Suppose the design has exactly two distinct predictor
values, (1, x1,2) and (1, x2,2), where x1,2 < x2,2 and

P (Yi = β1 + β2x1,2 + ei ) = P (Yi = β1 + β2x2,2 + ei ) = 0.5.

Notice that
β = X −1 z
where
z = (z1, z2 )T = (β1 + β2x1,2, β1 + β2 x2,2)T
and
1 x1,2
X= .
1 x2,2
If we assume that the errors are iid N(0, 1), then P (Yi = zj ) = 0 for j = 1, 2
and n ≥ 1. However,

min |Yi − zj | = OP (n−1 ).

i=1,...,n

Suppose that the elemental set J = {i1, i2} is such that xij = xj and |yij −
zj | < for j = 1, 2. Then bJ = X −1 Y J and
√
bJ − β ≤ X −1 Y J − z ≤ X −1 2 .

Hence bJ − β is bounded by multiplied by a constant (free of n).

Example 8.6. Now assume that Yi = β1 + β2xi,2 + ei where the design
points xi,2 are iid N(0, 1). Although there are no replicates, we can still
evaluate the elemental ﬁt at two points, say w1 and w2 where w2 > 0 is some
number (eg w2 = 1) and w1 = −w2 . Since p = 2, the 1-dimensional pyramid
is simply a line segment [w1 , w2] and

1 w1
W = .
1 w2
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 257

Let region R1 = {x2 : x2 ≤ w1 } and let region R2 = {x2 : x2 ≥ w2 }. Now a ﬁt

bJ will be a “good” approximation for β if J corresponds to one observation
xi1 ,2 from R1 and one observation xi2 ,2 from R2 and if both absolute errors
are small compared to w2 . Notice that the observations with absolute errors
|ei| < fall between the two lines y = β1 + β2 x2 ± . If the errors ei are iid
N(0, 1), then the number of observations in regions R1 and R2 with errors
|ei| < will increase to ∞ as n increases to ∞ provided that
1
=
nδ
where 0 < δ < 1.
Now we use a trick to get bounds. Let z = W β be the true line evaluated
at w1 and w2 . Thus z = (z1, z2)T where zi = β1 + β2wi for i = 1, 2. Consider
any subset J = {i1 , i2} with xij ,2 in Rj and |eij | < for j = 1, 2. The line
from this subset is determined by bJ = X −1 J Y J so

ẑ = W bJ

is the ﬁtted line evaluated at w1 and w2. Let the deviation vector

δ J = (δJ,1 , δJ,2)T

where
δ J,i = zi − ẑi .
Hence
bJ = W −1 (z − δ J )
and
|δJ,i | ≤
by construction. Thus

bJ − β = W −1 z − W −1 δ J − W −1 z
√
≤ W −1 δJ ≤ W −1 2 .

The basic idea is that if a fit is determined by one point from each region
and if the fit is good, then the fit has small deviation at points w1 and w2
because lines can’t bend. See Figure 8.3. Note that the bound is true for
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 258

•
••
••
•••
•
••
••
• ••
••
••
•••
•
••
••
• ••
••
••
•••
••
••
•••
••
••
••
••
• ••
• ••
••
••
•••
•
••
••
• ••
••
••
•••
•
••
••
••
••

Figure 8.3: The true line is y = x + 0.

every fit such that one point is in each region and both absolute errors are
less than . The number of such fits can be enormous. For example, if is a
constant, then the number of observations in region Ri with errors less than
is proportional to n for i = 1, 2. Hence the number of “good” fits from the
two regions is proportional to n2 .
Example 8.7. Now assume that p = 3 and Yi = β1 + β2xi,2 + β3xi,3 + ei
where the predictors (xi,2, xi,3 ) are scattered about the origin, eg iid N(0, I 2).
Now we need a matrix W and three regions with many observations that
have small errors. Let
 
1 a −a/2
W =  1 −a −a/2 
1 0 a/2
for some a > 0 (eg a = 1). Note that the three points (a, −a/2)T , (−a, −a/2)T ,
and (0, a/2)T determine a triangle. Use this triangle as the pyramid. Then
the corner regions are formed by extending the three lines that form the
triangle and using points that fall opposite of a corner of the triangle. Hence
R1 = {(x2, x3)T : x3 < −a/2 and x2 > a/2 − x3},
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 259

R2 = {(x2, x3 )T : x3 < −a/2 and x2 < x3 − a/2}, and

R3 = {(x2 , x3)T : x3 > x2 + a/2 and x3 > a/2 − x2}.
See Figure 8.4.
Now we can bound certain ﬁts in a manner similar to that of Example
8.6. Again let z = W β. The notation x ∈ Ri will be used to indicate
that (x2 , x3)T ∈ Ri . Consider any subset J = {i1 , i2, i3} with xij in Rj and
|eij | < for j = 1, 2, and 3. The plane from this subset is determined by
bJ = X −1J Y J so
ẑ = W bJ
is the ﬁtted plane evaluated at the corners of the triangle. Let the deviation
vector
δ J = (δJ,1 , δJ,2, δJ,3 )T
where
δJi = zi − ẑi .
Hence
bJ = W −1 (z − δ J )
and
|δJ,i | ≤
by construction. Thus

bJ − β = W −1 z − W −1 δ J − W −1 z
√
≤ W −1 δJ ≤ W −1 3 .
For Example 8.7, there is a prism shaped region centered at the triangle
determined by W . Any elemental subset J with one point in each corner
region and with each absolute error less than produces a plane that cuts
the prism. Hence each absolute deviation at the corners of the triangle is less
than .
The geometry in higher dimensions uses hyperpyramids and hyperprisms.
When p = 4, the p = 4 rows that form W determine a 3–dimensional
pyramid. Again we have 4 corner regions and only consider elemental subsets
consisting of one point from each region with absolute errors less than .
The resulting hyperplane will cut the hyperprism formed by extending the
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 260

Corner Regions for Triangle (−1,0), (1,0), (0,1)

Corner Region R3

3
2
1
x3

Corner Region R2 Corner Region R1

−1
−2
−3

−3 −2 −1 0 1 2 3

Figure 8.4: The Corner Regions for Two Predictors and a Constant.

pyramid into 4 dimensions by a distance of . Hence the absolute deviations

will be less than .
We use the pyramids to insure that the fit from the elemental set is
good. Even if all p cases from the elemental set have small absolute errors,
the resulting fit can be very poor. Consider a typical scatterplot for simple
linear regression. Many pairs of points yield fits almost orthogonal to the
“true” line. If the 2 points are separated by a distance d, and the errors are
very small compared to d, then the fit is close to β. The separation of the
p cases in p−space by a (p − 1)–dimensional pyramid is sufficient to insure
that the elemental fit will be good if all p of the absolute errors are small.
Now we describe the pyramids in a bit more detail. Since our model
contains a constant, if p = 2, then the 1–dimensional pyramid is simply a
line segment. If p = 3, then the 2–dimensional pyramid is a triangle, and in
general the (p − 1)–dimensional pyramid is determined by p points. We also
need to define the p corner regions Ri . When p = 2, the two regions are to
the left and right of the line segment. When p = 3, the corner regions are
formed by extending the lines of the triangle. In general, there are p corner
regions, each formed by extending the p−1 surfaces of the pyramid that form
the corner. Hence each region looks like a pyramid without a base. (Drawing
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 261

pictures may help visualizing the geometry.)

The pyramid determines a p × p matrix W . Deﬁne the p × 1 vector
z = W β. Hence
β = W −1 z.
Note that the p points that determine W are not actual observations, but
W will be useful as a tool to obtain a bound as in Examples 8.6 and 8.7.
The notation x ∈ Ri will be used to indicate that (x2, ..., xp)T ∈ Ri .
Lemma 8.10. Fix the pyramid that determines (z, W ) and consider any
elemental set (X J , Y J ) with each point (xThi , yhi ) such that xhi ∈ a corner
region Ri and each absolute error |yhi − xThi β| ≤ . Then the elemental set
produces a ﬁt bJ = X −1 J Y J such that
√
bJ − β ≤ W −1 p . (8.11)

Proof. Let the p × 1 vector z = W β, and consider any subset J =

{h1 , h2, ..., hp} with xhi in Ri and |ehi | < for i = 1, 2, ..., p. The ﬁt from
this subset is determined by bJ = X −1 J Y J so ẑ = W bJ . Let the p × 1 de-
viation vector δ = (δ1, ..., δp) where δi = zi − ẑi . Then bJ = W −1 (z − δ)
T

and |δi| ≤ by construction. Thus bJ − β = W −1 z − W −1 δ − W −1 z

√
≤ W −1 δ ≤ W −1 p . QED
Remark 8.7. When all elemental sets are searched, Theorem 8.2 showed
that the rate of bo,n ≤ n. Also, the rate of bo,n ∈ [n1/2, n] since the L1
estimator is elemental and provides the lower bound.
Next we will consider all C(n, p) elemental sets and again show that best
elemental fit bo,n satisfies bo,n − β = OP (n−1 ). To get a bound, we need
to assume that the number of observations in each of the p corner regions
is proportional to n. This assumption is satisfied if the nontrivial predictors
are iid from a distribution with a joint density that is positive on the entire
(p − 1)−dimensional Euclidean space. We replace (A2) by the following
assumption.
(A3) Assume that the probability that a randomly selected x ∈ Ri is
bounded below by αi > 0 for large enough n and i = 1, ..., p.
If Ui counts the number of cases (xTj , yj ) that have xj ∈ Ri and |ei | <
M /Hn , then Ui is a binomial random variable with success probability pro-
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 262

portional to M /Hn , and the number Gn of elemental ﬁts bJ satisfying Equa-

tion (8.11) with replaced by M /Hn satisﬁes
%
p
M p
Gn ≥ Ui ∝ np ( ) .
i=1
Hn

Hence the probability that a randomly selected elemental set bJ that satisﬁes
√
bJ − β ≤ W −1 p M /Hn is bounded below by a probability that is
proportional to (M /Hn )p . If the number of randomly selected elemental sets
Kn = Hnp , then
√ M
P (bo,n − β ≤ W −1 p )→1
Hn
as M → ∞. Notice that one way to choose Kn is to draw Hn ≤ n cases
without replacement and then examine all Kn = C(Hn , p) elemental sets.
These remarks prove the following corollary.
Corollary 8.11. Assume that (A1) and (A3) hold. Let Hn ≤ n and
assume that Hn ↑ ∞ as n → ∞. If Kn = Hnp elemental sets are randomly
chosen then
bo,n − β = OP (Hn−1 ) = OP (Kn−1/p ).

In particular, if all C(n, p) elemental sets are examined, then bo,n −β =
OP (n−1 ). Note that Corollary 8.11 holds as long as the bulk of the data
satisfies (A1) and (A3). Hence if a fixed percentage of outliers are added to
clean cases, rather than replacing clean cases, then Corollary 8.11 still holds.
The following result shows that elemental fits can be used to approximate
any p × 1 vector c. Of course this result is asymptotic, and some vectors will
not be well approximated for reasonable sample sizes.
Theorem 8.12. Assume that (A1) and (A3) hold and that the error
density f is positive and continuous everywhere. Then the closest elemental
fit bc,n to any p × 1 vector c satisfies bc,n − c = OP (n−1 ).
Proof sketch. The proof is essentially the same. Sandwich the plane
determined by c by only considering points such that |gi | ≡ |yi − xTi c| < α.
Since the ei ’s have positive density, P (|gi | < α) ∝ 1/α) (at least for xi in
some ball of possibly huge radius R about the origin). Also the pyramid needs
to lie on the c-plane and the corner regions will have smaller probabilities.
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 263

By placing the pyramid so that W is in the “center” of the X space, we may

assume that these probabilities are bounded away from zero, and make M
so large that the probability of a “good” elemental set is larger than 1 − .
QED
This result proves that elemental sets can be useful for projection pursuit
as conjectured by Rousseeuw and Leroy (1987, p. 145). Normally we will
only be interested in insuring that many elemental ﬁts are close to β. If the
errors have a pdf which is positive only in a neighborhood of 0, eg uniform(-1,
1), then Corollary 8.11 holds, but some slope intercept combinations cannot
be realized. If the errors are not symmetric about 0, then many ﬁts may
be close to β, but estimating the constant term without bias may not be
possible. If the model does not contain a constant, then results similar to
Corollary 8.10 and Theorem 8.12 hold, but a p dimensional pyramid is used
in the proofs instead of a (p − 1) dimensional pyramid.

8.4 Complements
The widely used basic resampling and concentration algorithms that use
a fixed number K of randomly drawn elemental sets are inconsistent, but
Theorem 8.7 shows that it is easy to modify some
√ of these algorithms so that
the easily computed modified estimator is a n consistent high breakdown
(HB) estimator. The√basic idea is to evaluate the criterion on K elemental
sets as well as on a n consistent estimator such as OLS and on an easily
computed HB but biased √estimator such as β̂ k,B . Similar ideas will be used to
create easily computed n consistent HB estimators of multivariate location
and dispersion. See Section 10.7.
The first two sections of this chapter followed Hawkins and Olive (2002)
and Olive and Hawkins (2006) closely. The “basic resampling”, or “ele-
mental set” method was used for finding outliers in the regression setting by
Siegel (1982), Rousseeuw (1984), and Hawkins, Bradu and Kass (1984). Fare-
brother (1997) sketches the history of elemental set methods. Also see Mayo
and Gray (1997). Hinich and Talwar (1975) used nonoverlapping elemental
sets as an alternative to least squares. Rubin (1980) used elemental sets for
diagnostic purposes. The “concentration” technique may have been intro-
duced by Devlin, Gnanadesikan and Kettenring (1975) although a similar
idea appears Gnanadesikan and Kettenring (1972, p. 94). The concentration
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 264

technique for regression was used by Ruppert (1992) and Hawkins and Olive
(1999a).
A different generalization of the elemental set method uses for its starts
subsets of size greater than p (Atkinson and Weisberg 1991). Another possi-
ble refinement is a preliminary partitioning of the cases (Woodruff and Rocke,
1994, Rocke, 1998, Rousseeuw and Van Driessen, 1999, 2002).
If an exact algorithm exists but an approximate algorithm is also used,
the two estimators should be distinguished in some manner. For example
β̂LM S could denote the estimator from the exact algorithm while β̂ALM S could
denote the estimator from the approximate algorithm. In the literature this
distinction is too seldomly made, but there are a few outliers. Portnoy (1987)
makes a distinction between LMS and PROGRESS LMS while Cook and
Hawkins (1990, p. 640) point out that the AMVE is not the minimum
volume ellipsoid (MVE) estimator (which is a high breakdown estimator of
multivariate location and dispersion that is sometimes used to define weights
in regression algorithms). Rousseeuw and Bassett (1991) find the breakdown
point and equivariance properties of the LMS algorithm that searches all
C(n, p) elemental sets. Woodruff and Rocke (1994, p. 889) point out that in
practice the algorithm is the estimator. Hawkins (1993a) has some results
when the fits are computed from disjoint elemental sets, and Rousseeuw
(1993, p. 126) states that the all subsets version of PROGRESS is a high
breakdown algorithm, but the random sampling versions of PROGRESS are
not high breakdown algorithms.
Algorithms which use one interchange on elemental sets may be compet-
itive. Heuristically, only p − 1 of the observations in the elemental set need
small absolute errors since the best interchange would be with the observa-
tion in the set with a large error and an observation outside of the set with a
very small absolute error. Hence K ∝ nδ(p−1) starts are needed. Since finding
the best interchange requires p(n − p) comparisons, the run time should be
competitive with the concentration algorithm. Another idea is to repeat the
interchange step until convergence. We do not know how many starts are
needed for this algorithm to produce good results.
Theorems 8.2 and 8.8 are an extension of Hawkins (1993a, p. 582) which
states that if the algorithm uses O(n) elemental sets, then at least one ele-
mental set b is likely to have its jth component bj close to the jth component
βj of β.
Note that one-step estimators can improve the rate of the initial estima-
tor. See for example Chang, McKean, Naranjo, and Sheather (1999) and
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 265

Simpson, Ruppert, and Carroll (1992). Although the theory for the estima-
tors in these two papers requires an initial high breakdown estimator with
at least an n1/4 rate of convergence, implementations often use an initial
inconsistent, low breakdown algorithm estimator. Instead of using lmsreg
or ltsreg as the initial estimator, use the CLTS estimator of Theorem 8.7
(or the MBA or trimmed views estimators of Sections 7.6 and 10.5). The
CLTS estimator can also be used to create an asymptotically eﬃcient high
breakdown cross checking estimator.
The Rousseeuw and Leroy (1987) data sets are available from the follow-
ing website

(http://www.uni-koeln.de/themen/Statistik/data/rousseeuw/).

Good websites for Fortran programs of algorithm estimators include

(http://www.agoras.ua.ac.be/) and
(http://www.stat.umn.edu/ARCHIVES/archives.html).

8.5 Problems
8.1. Since an elemental ﬁt b passes through the p cases, a necessary condition
for b to approximate β well is that all p errors be small. Hence no “good”
approximations will be lost when we consider only the cases with |ei | < . If
the errors are iid, then for small > 0, case i has

P (|ei | < ) ≈ 2 f(0).

Hence if = 1/n(1−δ) , where 0 ≤ δ < 1, ﬁnd how many cases have small
errors.
8.2 Suppose that e1, ..., e100 are iid and that α > 0. Show that

P ( min |ei | > α) = [P (|e1| > α)]100.

i=1,...,100
CHAPTER 8. ROBUST REGRESSION ALGORITHMS 266

Splus Problems
For problems 8.3 and 8.4, if the animal or Belgian telephone data sets
(Rousseeuw and Leroy 1987) are not available, use the following commands.

> zx <- 50:73

> zy <- -5.62 +0.115*zx + 0.25*rnorm(24)
> zy[15:20] <- sort(rnorm(6,mean=16,sd=2))

Warning: Use the command source(“A:/rpack.txt”) to download

the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg conc2, will display the code for the function. Use the args
command, eg args(conc2), to display the needed arguments for the function.

8.3. a) Download the Splus function conc2. This function does not work
in R.
b) Include the output from the following command in Word.

conc2(zx,zy)

8.4. a) Download the Splus function attract that was used to produce
Figure 8.2. This function does not work in R.
b) Repeat the following command ﬁve times.

> attract(zx,zy)

c) Include one of the plots from the command in b) in Word.

Chapter 9

Resistance and Equivariance

9.1 Resistance of Algorithm Estimators

In spite of the inconsistency of resampling algorithms that use a fixed number
K of elemental starts, these algorithms appear throughout the robustness
literature and in R/Splus software. Proposition 8.6 on p. 251 suggests that
the algorithms can be useful for small data sets.
The previous chapter used the asymptotic paradigm to show that the
algorithm estimators are inconsistent. In this paradigm, it is assumed that
the data set size n is increasing to ∞ and we want to know whether an
estimator β̂n converges to β or not.
Definition 9.1. Suppose that a subset of h cases is selected from the n
cases making up the data set. Then the subset is clean if none of the h cases
are outliers.
In this chapter we will consider the perfect classification paradigm where
the goal is to analyze a single fixed data set of n cases of which 0 ≤ d < n/2
are outliers. The remaining n − d cases are “clean”. The main assumption of
the perfect classification paradigm is that the algorithm can perfectly classify
the clean and outlying cases; ie, the outlier configuration is such that if K
subsets of size h ≥ p are selected, then the subset Jo corresponding to the fit
that minimizes the criterion Q will be clean, and the (eg OLS or L1 ) fit bJo
computed from the cases in Jo will perfectly classify the n − d clean cases
and d outliers. Then a separate analysis is run on each of the two groups.
Although this is a very big assumption that is almost impossible to verify,

267
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 268

the paradigm gives a useful initial model for the data. The assumption is
very widely used in the literature for diagnostics and robust statistics.
Remark 9.1. Suppose that the data set contains n cases with d outliers
and n − d clean cases. Suppose that h ≥ p cases are selected at random
without replacement. Let W count the number of the h cases that were
outliers. Then W is a hypergeometric(d, n − d, h) random variable and
&d'&n−d'
h j
P (W = j) = &n' ≈
j h−j
γ (1 − γ)h−j
h
j
where the contamination proportion γ = d/n and the binomial(h, ρ ≡ γ =
d/n) approximation to the hypergeometric(d, n − d, h) distribution is used.
In particular, the probability that the subset of h cases is clean = P (W =
0) ≈ (1 − γ)h which is maximized by h = p. Hence using elemental sets
maximizes the probability of getting a clean subset. Moreover, computing
the elemental fit is faster than computing the fit from h > p cases.
Remark 9.2. Now suppose that K elemental sets are chosen with re-
placement. If Wi is the number of outliers in the ith elemental set, then
the Wi are iid hypergeometric(d, n − d, p) random variables. Suppose that
it is desired to find K such that the probability P(that at least one of the
elemental sets is clean) ≡ P1 ≈ 1 − α where α = 0.05 is a common choice.
Then P1 = 1− P(none of the K elemental sets is clean)
≈ 1 − [1 − (1 − γ)p]K
by independence. Hence
α ≈ [1 − (1 − γ)p ]K
or
log(α) log(α)
K≈ ≈ (9.1)
log([1 − (1 − γ) ])
p −(1 − γ)p
using the approximation log(1 − x) ≈ −x for small x. Since log(.05) ≈ −3,
if α = 0.05, then
3
K≈ .
(1 − γ)p
Frequently a clean subset is wanted even if the contamination proportion
γ ≈ 0.5. Then for a 95% chance of obtaining at least one clean elemental set,
K ≈ 3 (2p ) elemental sets need to be drawn.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 269

Table 9.1: Largest p for a 95% Chance of a Clean Subsample.

K
γ 500 3000 10000 105 106 107 108 109
0.01 509 687 807 1036 1265 1494 1723 1952
0.05 99 134 158 203 247 292 337 382
0.10 48 65 76 98 120 142 164 186
0.15 31 42 49 64 78 92 106 120
0.20 22 30 36 46 56 67 77 87
0.25 17 24 28 36 44 52 60 68
0.30 14 19 22 29 35 42 48 55
0.35 11 16 18 24 29 34 40 45
0.40 10 13 15 20 24 29 33 38
0.45 8 11 13 17 21 25 28 32
0.50 7 9 11 15 18 21 24 28

Notice that number of subsets K needed to obtain a clean elemental set

with high probability is an exponential function of the number of predictors p
but is free of n. Hence if this choice of K is used in an elemental or concentra-
tion algorithm, then the algorithm is inconsistent and has (asymptotically)
zero breakdown. Nevertheless, many practitioners use a value of K that is
free of both n and p (eg K = 500 or K = 3000).
This practice suggests fixing both K and the contamination proportion γ
and then finding the largest number of predictors p that can be in the model
such that the probability of finding at least one clean elemental set is high.
Given K and γ, P (at least one of K subsamples is clean) = 0.95 ≈
1 − [1 − (1 − γ)p ]K . Thus the largest value of p satisfies
3
≈ K,
(1 − γ)p
or ( )
log(3/K)
p≈ (9.2)
log(1 − γ)
if the sample size n is very large. Again x is the greatest integer function:
7.7 = 7.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 270

Table 9.1 shows the largest value of p such that there is a 95% chance
that at least one of K subsamples is clean using the approximation given
by Equation (9.2). Hence if p = 28, even with one billion subsamples, there
is a 5% chance that none of the subsamples will be clean if the contamina-
tion proportion γ = 0.5. Since clean elemental fits have great variability, an
algorithm needs to produce many clean fits in order for the best fit to be
good. When contamination is present, all K elemental sets could contain
outliers. Hence basic resampling and concentration algorithms that only use
K elemental starts are doomed to fail if γ and p are large.
Remark 9.3: Breakdown. The breakdown value of concentration al-
gorithms that use K elemental starts is bounded above by K/n. (See Section
9.4 for more information about breakdown.) For example if 500 starts are
used and n = 50000, then the breakdown value is at most 1%. To cause a
regression algorithm to break down, simply contaminate one observation in
each starting elemental set so as to displace the fitted coefficient vector by a
large amount. Since K elemental starts are used, at most K points need to
be contaminated.
This is a worst-case model, but sobering results on the outlier resistance of
such algorithms for a fixed data set with d gross outliers can also be derived.
Assume that the LTA(c), LTS(c), or LMS(c) algorithm is applied to a fixed
data set of size n where n − d of the cases follow a well behaved model and
d < n/2 of the cases are gross outliers. If d > n − c, then every criterion
evaluation will use outliers, and every attractor will produce a bad fit even
if some of the starts are good. If d < n − c and if the outliers are far enough
from the remaining cases, then clean starts of size h ≥ p may result in clean
attractors that could detect certain types of outliers (that may need to be
hugely discrepant on the response scale).
Proposition 9.1. Let γo be the highest percentage of massive outliers
that a resampling algorithm can detect reliably. Then
n−c
γo ≈ min( , 1 − [1 − (0.2)1/K ]1/h)100% (9.3)
n
if n is large.
Proof. In Remark 9.2, change p to h to show that if the contamination
proportion γ is fixed, then the probability of obtaining at least one clean
subset of size h with high probability (say 1 − α = 0.8) is given by 0.8 =
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 271

1 − [1 − (1 − γ)h ]K . Fix the number of starts K and solve this equation for
γ. QED
The value of γo depends on c > n/2 and h. To maximize γo , take c ≈ n/2
and h = p. For example, with K = 500 starts, n > 100, and h = p ≤ 20 the
resampling algorithm should be able to detect up to 24% outliers provided
every clean start is able to at least partially separate inliers from outliers.
However if h = p = 50, this proportion drops to 11%.
Remark 9.4: Hybrid Algorithms. More sophisticated algorithms use
both concentration and partitioning. Partitioning evaluates the start on a
subset of the data, and poor starts are discarded. This technique speeds
up the algorithm, but the consistency and outlier resistance still depends on
the number of starts. For example, Equation (9.3) agrees very well with the
Rousseeuw and Van Driessen (1999) simulation performed on a hybrid MCD
algorithm. (See Section 10.6.)

9.2 Advice for the Practitioner

Results from the previous section and chapter suggest several guidelines for
the practitioner. Also see Section 6.3.
1) Theorem 8.7 shows how to modify elemental basic resampling and
concentration algorithms so that the easily computed modified estimator is
√
a n consistent HB estimator. The basic idea is simple: in addition to using
K randomly selected elemental starts, also use two carefully chosen starts.
One start should lead to an√easily computed but biased HB attractor and
the other start should be a n consistent estimator such as β̂ OLS (or β̂ L1 ).
2) Do not overlook classical (OLS and L1) procedures and diagnostics.
They often suffice where the errors ei and their propensity to be outlying are
independent of the predictors xi . In particular, use Cook’s distance, residual
plots and the forward response plot. The list of real multiple linear regression
“benchmark” data sets with outlier configurations that confound both the
residual plot of Y versus r and the forward response plot of Y versus Y is
currently rather small.
3) For 3 or fewer variables, use graphical methods such as scatterplots
and 3D plots to detect outliers and other model violations.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 272

4) Make a scatterplot matrix of the predictors and the response if p is

small. Often isolated outliers can be detected in the plot. Also, removing
strong nonlinearities in the predictors with power transformations can be
very useful.
5) Use several estimators – both classical and robust. (We recommend
using OLS, L1 , the CLTS estimator from Theorem 8.7, lmsreg, the tvreg
estimator from Section 11.3, mbareg and the MBA estimator using the LATA
criterion (see Problem 7.5).) Then make a scatterplot matrix of the residuals,
of fitted values and of Mahalanobis distances from the different fits. If the
multiple linear regression model holds, then the subplots will be strongly
linear if consistent estimators are used. Thus these plots can be used to
detect a wide variety of violations of model assumptions.
6) Use subset refinement – concentration (and/or interchange). Concen-
tration may not improve the theoretical convergence rates, but concentration
gives dramatic practical improvement in many data sets.
7) Compute the median absolute deviation of the response variable mad(yi )
and the median absolute residual med(|r|i (β̂)) from the estimator β̂. If
mad(yi ) is smaller, then the constant med(yi) fits the data better than β̂
according to the median squared residual criterion.
Other techniques, such as using partitioning to screen out poor starts,
are also important. See Woodruff and Rocke (1994). The line search may
also be a good technique. Let bb be the fit which currently minimizes the
criterion. Ruppert (1992) suggests evaluating the criterion Q on

λbb + (1 − λ)b

where b is the fit from the current subsample and λ is between 0 and 1. Using
λ ≈ 0.9 may make sense. If the algorithm produces a good fit at some stage,
then many good fits will be examined with this technique.

9.3 Desirable Properties of a Regression Es-

timator
There are many desirable properties for regression estimators including (per-
haps in decreasing order of importance)
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 273

a) conditions under which β̂n is a consistent estimator,

b) computability (eg in seconds, or hours, or days)
c) the limiting distribution of nδ (β̂n − β),
d) rate and tightness results (see Definition 8.6): β̂ n − β P n−δ or β̂ n − β =
OP (n−δ ),
e) conditions under which the slopes (β̂2,n, ..., β̂p,n) are consistent estimators
of the population slopes (β2 , ..., βp) when the errors are asymmetric,
f) conditions under which β̂n is a consistent estimator of β when heteroscedas-
ticity is present,
g) resistance of β̂ n for a fixed data set of n cases of which d < n/2 are out-
liers,
h) equivariance properties of β̂ n , and
i) the breakdown value of β̂n .
To some extent Chapter 8 and Remark 9.3 gave negative results: for the
typical computable HB algorithms that used a fixed number of K elemen-
tal starts, the algorithm estimator bA,n is inconsistent with an asymptotic
breakdown value of zero. Section 9.1 discussed the resistance of such algo-
rithm estimators for a fixed data set containing d outliers while Theorem
8.7 showed how to modify some of these algorithms, resulting in an easily
√
computed n consistent HB estimator.
Breakdown and equivariance properties have received considerable atten-
tion in the literature. Several of these properties involve transformations of
the data. If X and Y are the original data, then the vector of the coefficient
estimates is
= β(X,
β Y ) = T (X, Y ), (9.4)
the vector of predicted values is

Y = Y (X, Y ) = X β(X,
Y ), (9.5)

and the vector of residuals is

r = r(X, Y ) = Y − Y . (9.6)

If the design X is transformed into W and the dependent variable Y is

transformed into Z, then (W , Z) is the new data set. Several of these im-
portant properties are discussed below, and we follow Rousseeuw and Leroy
(1987, p. 116-125) closely.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 274

is
Regression Equivariance: Let u be any p × 1 vector. Then β
regression equivariant if

β(X,
Y + Xu) = T (X, Y + Xu) = T (X, Y ) + u = β(X, Y ) + u. (9.7)

Hence if W = X, and Z = Y + Xu, then

= Y + Xu,
Z

and
= r(X, Y ).
r(W , Z) = Z − Z
Note that the residuals are invariant under this type of transformation, and
note that if

u = −β,
then regression equivariance implies that we should not ﬁnd any linear struc-
ture if we regress the residuals on X.
is scale equivariant if
Scale Equivariance: Let c be any scalar. Then β

β(X,
cY ) = T (X, cY ) = cT (X, Y ) = cβ(X, Y ). (9.8)

Hence if W = X, and Z = cY then

= cY ,
Z

and
r(X, cY ) = c r(X, Y ).
Scale equivariance implies that if the Yi ’s are stretched, then the fits and the
residuals should be stretched by the same factor.

Affine Equivariance: Let A be any p × p nonsingular matrix. Then β
is affine equivariant if

β(XA,
Y ) = T (XA, Y ) = A−1 T (X, Y ) = A−1 β(X, Y ). (9.9)

Hence if W = XA and Z = Y , then

Z
= W β(XA,
Y ) = XAA−1 β(X, Y ) = Y ,

and
= Y − Y = r(X, Y ).
r(XA, Y ) = Z − Z
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 275

Note that both the predicted values and the residuals are invariant under an
aﬃne transformation of the independent variables.
Permutation Invariance: Let P be an n×n permutation matrix. Then
P P = P P T = I n where I n is an n × n identity matrix and the superscript
T

T denotes the transpose of a matrix. Then β is permutation invariant if

X, P Y ) = T (P X, P Y ) = T (X, Y ) = β(X,
β(P Y ). (9.10)
Hence if W = P X, and Z = P Y , then
= P Y ,
Z
and
r(P X, P Y ) = P r(X, Y ).
If an estimator is not permutation invariant, then swapping rows of the n ×
(p + 1) augmented matrix (X, Y ) will change the estimator. Hence the case
number is important. If the estimator is permutation invariant, then the
position of the case in the data cloud is of primary importance. Resampling
algorithms are not permutation invariant because permuting the data causes
diﬀerent subsamples to be drawn.

9.4 The Breakdown of Breakdown

This section gives a standard deﬁnition of breakdown (see Zuo 2001 for ref-
erences) and then shows that if the median absolute residual is bounded in
the presence of high contamination, then the regression estimator has a high
breakdown value. The following notation will be useful. Let W denote the
data matrix where the ith row corresponds to the ith case. For regression,
W is the n × (p + 1) matrix with ith row (xTi , yi ). Let W nd denote the
data matrix where any d of the cases have been replaced by arbitrarily bad
contaminated cases. Then the contamination fraction is γ = d/n.
If T (W ) is a p × 1 vector of regression coeﬃcients, then the breakdown
value of T is
d
B(T, W ) = min{ : sup T (W nd ) = ∞}
n Wn
d

where the supremum is over all possible corrupted samples W nd and 1 ≤ d ≤

n.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 276

A regression estimator basically breaks down if the median absolute resid-

ual can be made arbitrarily large. Consider a fixed data set W nd with ith row
(w Ti , zi ). If the regression estimator T (W nd ) = β̂ satisfies β̂ = M for some
T
constant M, then the median absolute residual med(|zi − β̂ w i |) is bounded
T
by maxi=1,...,n yi − β̂ xi ≤ maxi=1,...,n [|yi| + pj=1 M|xi,j |] if d < n/2. If
β̂ = ∞, then so does the median absolute residual.
T
If n/2 > d + p − 1 and if the median absolute residual med(|zi − β̂ w i |) is
bounded, is it true that T (W nd ) < ∞? Technically, a little care is needed.
If the (xTi , yi ) are in general position, then the contamination could be such
that β̂ passes exactly through p − 1 “clean” cases and d “contaminated”
cases. Hence d + p − 1 cases could have absolute residuals equal to zero
with β̂ arbitrarily large (but finite). Nevertheless, if T possesses reason-
able equivariant properties and T (W nd ) is replaced by the median absolute
residual in the definition of breakdown, then the two breakdown values are
asymptotically equivalent. (If T (W ) ≡ 0, then T is neither regression nor
affine equivariant. The breakdown value of T is one, but the median abso-
lute residual can be made arbitrarily large if the contamination proportion
is greater than n/2.)
This result implies that the breakdown value of a regression estimator is
more of a y-outlier property than an x-outlier property. If the yi ’s are fixed,
arbitrarily large x-outliers will rarely drive β̂ to ∞. The x-outliers can
drive β̂ to ∞ if they can be constructed so that the estimator is no longer
defined, eg so that X T X is nearly singular. The following examples may
help illustrate these points.
Example 9.1. Suppose the response values Y are near 0. Consider the
fit from an elemental set:
bJ = X −1
J YJ

and examine Equations (8.2), (8.3), and (8.4) on p. 239. Now

bJ ≤ X −1
J Y J ,

and since x-outliers make X J large, x-outliers tend to drive bJ towards
zero not towards ∞. The x-outliers may make bJ large if they can make
the trial design X J nearly singular. Notice that Euclidean norm bJ can
easily be made large if one or more of the elemental response variables is
driven far away from zero.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 277

Example 9.2. Without loss of generality, assume that the good y’s are
contained in an interval [a, f] for some a and f. Assume that the regression
model contains an intercept β1. Then there exists an estimator bo of β such
that bo ≤ max(|a|, |f|) if d < n/2.
Proof. Let med(n) = med(y1 , ..., yn) and mad(n) = mad(y1 , ..., yn). Take
bo = (med(n), 0, ..., 0)T . Then bo = |med(n)| ≤ max(|a|, |f|). Note that
the median absolute residual for the fit bo is equal to the median absolute
deviation mad(n) = med(|yi −med(n)|, i = 1, ..., n) ≤ f −a if d < (n+1)/2 .
QED
A high breakdown regression estimator is an estimator which has a bound-
ed median absolute residual even when close to half of the observations are
arbitrary. Rousseeuw and Leroy (1987, p. 29, 206) conjecture that high
breakdown regression estimators can not be computed cheaply, and they
conjecture that if the algorithm is also affine equivariant, then the complexity
of the algorithm must be at least O(np ). The following counterexample shows
that these two conjectures are false.
Example 9.3. If the model has an intercept, then an affine equivariant
high breakdown estimator β̂W LS (k) can be found by computing OLS to the
set of cases that have yi ∈ [med(y1, ..., yn) ± k mad(y1, ..., yn)] where k ≥ 1
(so at least half of the cases are used). When k = 1, this estimator uses the
“half set” of cases closest to med(y1, ..., yn). Hence 0.99β̂ W LS (1) could be
used as the biased HB estimator needed in Theorem 8.7.

√ Proof. This estimator has a median absolute residual bounded by

n k mad(y1 , ..., yn). To see this, consider the weighted least squares ﬁt
β̂W LS (k) obtained by running OLS on the set S consisting of the nj obser-
vations which have

Yi ∈ [MED(Y1 , ..., Yn) ± kMAD(Y1 , ..., Yn)] ≡ [MED(n) ± kMAD(n)]

where k ≥ 1 (to guarantee that nj ≥ n/2). Consider the estimator

β̂ M = (MED(n), 0, ..., 0)T

which yields the predicted values Ŷi ≡ MED(n). The squared residual

ri2 (β̂ M ) ≤ (k MAD(n))2

CHAPTER 9. RESISTANCE AND EQUIVARIANCE 278

if the ith case is in S. Hence the weighted LS ﬁt has

ri2 (β̂ W LS ) ≤ nj (k MAD(n))2.
i∈S

Thus
√
MED(|r1(β̂ W LS )|, ..., |rn(β̂W LS )|) ≤ nj k MAD(n) < ∞.

Hence β̂ W LS is high breakdown, and it is aﬃne equivariant since the design

is not used to choose the observations. If k is huge and MAD(n) = 0, then
this estimator and β̂OLS will be the same for most data sets. Thus high
breakdown estimators can be very nonrobust.
Example 9.4. Consider the smallest computer number A greater than
zero and the largest computer number B. Choose k such that kA > B.
Define the estimator β̂ as above if MAD(Yi , i = 1, ..., n) is greater than A,
otherwise define the estimator to be β̂ OLS . Then we can just run OLS on the
data without computing MAD(Yi , i = 1, ..., n).
The affine equivariance property can be achieved for a wide variety of
algorithms. The following lemma shows that if T1 , . . . , TK are K equivariant
regression estimators and if TQ is the Tj which minimizes the criterion Q,
then TQ is equivariant, too. A similar result is in Rousseeuw and Leroy (1987,
p. 117). Also see Rousseeuw and Bassett (1991).
Lemma 9.2. Let T1, . . . , TK be K regression estimators which are re-
gression, scale, and affine equivariant. Then if TQ is the estimator whose
residuals minimize a criterion which is a function Q of the absolute residuals
such that
Q(|cr1|, . . . , |crn |) = |c|d Q(|r1|, . . . , |rn |)
for some d > 0, then TQ is regression, scale, and affine equivariant.
Proof. By the induction principle, we can assume that K = 2. Since the
Tj are regression, scale, and affine equivariant, the residuals do not change
under the transformations of the data that define regression and affine equiv-
ariance. Hence TQ is regression and affine equivariant. Let ri,j be the residual
for the ith case from fit Tj . Now without loss of generality, assume that T1
is the method which minimizes Q. Hence

Q(|r1,1|, . . . , |rn,1 |) < Q(|r1,2|, . . . , |rn,2 |).

CHAPTER 9. RESISTANCE AND EQUIVARIANCE 279

Thus
Q(|cr1,1|, . . . , |crn,1 |) = |c|d Q(|r1,1|, . . . .|rn,1|) <
|c|d Q(|r1,2|, . . . , |rn,2|) = Q(|cr1,2|, . . . , |crn,2|),
and TQ is scale equivariant. QED
Since least squares is regression, scale, and affine equivariant, the fit from
an elemental or subset refinement algorithm that uses OLS also has these
properties provided that the criterion Q satisfies the condition in Lemma
9.2. If
Q = med(ri2 ),
then d = 2. If

h
Q= (|r|(i) )τ
i=1
or

n
Q= wi |ri |τ
i=1

where τ is a positive integer and wi = 1 if

|ri |τ < k med(|ri |τ ),

then d = τ .
Corollary 9.3. Any low breakdown aﬃne equivariant estimator can be
approximated by a high breakdown aﬃne equivariant estimator.
Proof. Let β̂W LS (k2 ) be the estimator in Example 9.3 with k = k2 ≥ 1.
Let β̂ be the low breakdown estimator, and let

β̂approx = β̂ if med(r2i [β̂], i = 1, ..., n) ≤ k1 med(r2i [β̂WLS (k2)], i = 1, ..., n),

β̂approx = β̂ W LS (k2 ),
otherwise. If k1 > 1 is large, the approximation will be good. QED
Remark 9.5. Similar breakdown results hold for multivariate location
and dispersion estimators. See Section 10.5.
Remark 9.6. There are several important points about breakdown that
do not seem to be well known. First, a breakdown result is weaker than even
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 280

a result such as an estimator being asymptotically unbiased for some popula-

tion quantity such as β. This latter property is useful since if the asymptotic
distribution of the estimator is a good approximation to the sampling dis-
tribution of the estimator, and if many independent samples can be drawn
from the population, then the estimator can be computed for each sample
and the average of all the diﬀerent estimators should be close to the pop-
ulation quantity. The breakdown value merely gives a yes or no answer to
the question of whether the median absolute residual can be made arbitrarily
large when the contamination proportion is equal to γ, and having a bounded
median absolute residual does not imply that the high breakdown estimator
is asymptotically unbiased or in any way useful.
Secondly, the literature implies that the breakdown value is a measure of
the global reliability of the estimator and is a lower bound on the amount
of contamination needed to destroy an estimator. These interpretations are
not correct since the complement of complete and total failure is not global
reliability. The breakdown value dT /n is actually an upper bound on the
amount of contamination that the estimator can tolerate since the estimator
can be made arbitrarily bad with dT maliciously placed cases.
In particular, the breakdown value of an estimator tells nothing about
more important properties such as consistency or asymptotic normality. Cer-
tainly we are reluctant to call an estimator robust if a small proportion of
outliers can drive the median absolute residual to ∞, but this type of estima-
tor failure is very simple to prevent. Notice that Example 9.3 suggests that
many classical regression estimators can be approximated arbitrarily closely
by a high breakdown estimator: simply make k huge and apply the classi-
cal procedure to the cases that have yi ∈ [med(n) ± k mad(n)]. Of course
these high breakdown approximations may perform very poorly even in the
presence of a single outlier.
Remark 9.7. The breakdown values of the LTx, RLTx, and LATx esti-
mators was given by Proposition 7.5 on p. 220.
Since breakdown is a very poor measure of resistance, alternative mea-
sures are needed. The following description of resistance expands on remarks
in Rousseeuw and Leroy (1987, p. 24, 70). Suppose that the data set con-
sists a cluster of clean cases and a cluster of outliers. Set β = 0 and let the
dispersion matrix of the “clean” cases (xTi , yi )T be the identity matrix I p+1 .
Assume that the dispersion matrix of the outliers is cI p+1 where 0 ≤ c ≤ 1
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 281

and that γ is the proportion of outliers. Then the mean vectors of the clusters
can be chosen to make the outliers bad leverage points. (This type of data set
is frequently used in simulations where the aﬃne and regression equivariance
of the estimators is used to justify these choices.) It is well known that the
LMS(cn ), LTA(cn ) and LTA(cn ) are deﬁned by the “narrowest strip” cover-
ing cn of the cases where the width of the strip is measured in the vertical
direction with the L∞ , L1 , and L2 criterion, respectively. We will assume
that cn ≈ n/2 and focus on the LMS estimator since the narrowness of the
strip is simply the vertical width of the strip.
Figure 9.1 will be useful for examining the resistance of the LMS estima-
tor. The data set consists of 300 N2(0, I 2) clean cases and 200

N2 ((9, 9)T , 0.25I 2 )

cases. Then the narrowest strip that covers only clean cases covers 1/[2(1−γ)]
of the clean cases. For the artificial data, γ = 0.4, and 5/6 of the clean cases
are covered and the width of the strip is approximately 2.76. The strip
shown in Figure 9.1 consists of two parallel lines with y-intercepts of ±1.38
and covers approximately 250 cases. As this strip is rotated counterclockwise
about the origin until it is parallel to the y-axis, the vertical width of the
strip increases to ∞. Hence LMS will correctly produce a slope near zero
if no outliers are present. Next, stop the rotation when the center of the
strip passes through the center of both clusters, covering nearly 450 cases.
The vertical width of the strip can be decreased to a value less than 2.76
while still covering 250 cases. Hence the LMS fit will accommodate the
outliers, and with 40% contamination, an outlying cluster can tilt the LMS
fit considerably. As c → 0, the cluster of outliers tends to a point mass and
even greater tilting is possible; nevertheless, for the Figure 9.1 data, a 40%
point mass can not drive the LMS slope to ∞.
Next suppose that the 300 distinct clean cases lie exactly on the line
through the origin with zero slope. Then an “exact fit” to at least half of the
data occurs and any rotation from this line can cover at most 1 of the clean
cases. Hence a point mass will not be able to rotate LMS unless it consists
of at least 299 cases (creating 300 additional exact fits). Similar results hold
for the LTA and LTS estimators.
These remarks suggest that the narrowest band interpretation of the LTx
estimators gives a much fuller description of their resistance than their break-
down value. Also, setting β = 0 may lead to misleading simulation studies.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 282

Narrowest Band Interpretation of Resistance

10
8
6
4
Y

2
0
-2
-4

-2 0 2 4 6 8 10

Figure 9.1: 300 N(0, I 2 ) cases and 200 N((9, 9)T , 0.25I 2 ) cases

The band interpretation can also be used to describe the resistance of

the LATx estimators. For example, the LATS(4) estimator uses an adaptive
amount of coverage, but must cover at least half of the cases. Let b be the
center of a band covering cn cases. Then the LATS criterion inflates the
band to cover Cn (b) cases. If b passes through the center of both clusters
in Figure 9.1, then nearly 100% of the cases will be covered. Consider the
band with the x-axis as its center. The LATS criterion inflates the band
to cover all of the clean cases but none of the outliers. Since only 60% of
the cases are covered, the LATS(4) criterion is reduced and the outliers have
large residuals. Although a large point mass can tilt the LATx estimators
if the point mass can make the median squared residual small, the LATx
estimators have a very strong tendency to give outlying clusters zero weight.
In fact, the LATx estimator may tilt slightly to avoid a cluster of “good
leverage” points if the cluster is far enough from the bulk of the data.
Problem 7.5 helps illustrate this phenomenon with the MBA estimators
that use the MED(ri2 ) and LATA criteria. We suggest that the residuals
and fitted values from these estimators (and from OLS and L1) should be
compared graphically with the RR and FF plots of Sections 6.3 and 7.6.
CHAPTER 9. RESISTANCE AND EQUIVARIANCE 283

9.5 Complements
Feller (1957) is a great source for examining subsampling behavior when the
data set is ﬁxed.
Hampel, Ronchetti, Rousseeuw and Stahel (1986, p. 96-98) and Donoho
and Huber (1983) provide some history for breakdown. Maguluri and Singh
(1997) have interesting examples on breakdown. Morgenthaler (1989) and
Stefanski (1991) conjectured that high breakdown estimators with high eﬃ-
ciency are not possible. These conjectures have been shown to be false.

9.6 Problems
9.1 a) Enter or download the following R/Splus function

pifclean<-
function(k, gam)
{
p <- floor(log(3/k)/log(1 - gam))
list(p = p)
}

b) Include the output from the commands below that are used to produce
the second column of Table 9.1.

> zx <- c(.01,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5)

> pifclean(3000,zx)

9.2. a) To get an idea for the amount of contamination a basic resam-

pling or concentration algorithm can tolerate, enter or download the gamper
function (with the source(“A:/rpack.txt”) command) that evaluates Equation
(9.2).
b) Next enter the following commands and include the output in Word.

> zx <- c(10,20,30,40,50,60,70,80,90,100)

> for(i in 1:10) gamper(zx[i])
Chapter 10

Multivariate Models

Deﬁnition 10.1. An important multivariate location and dispersion model

is a joint distribution
f(z|µ, Σ)
for a p × 1 random vector x that is completely speciﬁed by a p × 1 population
location vector µ and a p×p symmetric
positive deﬁnite population dispersion
matrix Σ. Thus P (x ∈ A) = A f(z)dz for suitable sets A.
The multivariate location and dispersion model is in many ways similar to
the multiple linear regression model. The data are iid vectors from some dis-
tribution such as the multivariate normal (MVN) distribution. The location
parameter µ of interest may be the mean or the center of symmetry of an
elliptically contoured distribution. We will estimate hyperellipsoids instead
of hyperplanes, and we will use Mahalanobis distances instead of absolute
residuals to determine if an observation is a potential outlier.
Assume that X 1 , ..., X n are n iid p × 1 random vectors and that the joint
distribution of X 1 is f(z|µ, Σ). Also assume that the data X i = xi has been
observed and stored in an n × p matrix
 
  x1,1 x1,2 . . . x1,p
x1T
 
 ..   x2,1 x2,2 . . . x2,p 
W =  .  =  .. .. .. .. 
 . . . . 
xnT
xn,1 xn,2 . . . xn,p
where the ith row of W is xTi . Each column of W corresponds to a variable.
For example, the data may consist of n visitors to a hospital where the p = 2
variables height and weight of each individual were measured.

284
CHAPTER 10. MULTIVARIATE MODELS 285

There are some differences in the notation used in multiple linear regres-
sion and multivariate location dispersion models. Notice that W could be
used as the design matrix in multiple linear regression although usually the
first column of the regression design matrix is a vector of ones. The n × p de-
sign matrix in the multiple linear regression model was denoted by X and Xi
denoted the ith column of X. In the multivariate location dispersion model,
X and X i will be used to denote a p × 1 random vector with observed value
xi , and xTi is the ith row of the data matrix W . Johnson and Wichern (1988,
p. 7, 53) uses X to denote the n × p data matrix and a n × 1 random vector,
relying on the context to indicate whether X is a random vector or data
matrix. Software tends to use different notation. For example, R/Splus will
use commands such as

var(x)

to compute the sample covariance matrix of the data. Hence x corresponds

to W .

10.1 The Multivariate Normal Distribution

Definition 10.2: Rao (1965, p. 437). A p × 1 random vector X has
a p−dimensional multivariate normal distribution Np (µ, Σ) iff tT X has a
univariate normal distribution for any p × 1 vector t.
If Σ is positive definite, then X has a pdf
1 −(1/2)(z−µ)T Σ (z −µ)
−1
f(z) = e (10.1)
(2π)p/2|Σ|1/2

where |Σ|1/2 is the square root of the determinant of Σ. Note that if p = 1,

then the quadratic form in the exponent is (z − µ)(σ 2 )−1 (z − µ) and X has
the univariate N(µ, σ 2) pdf. If Σ is positive semidefinite but not positive
definite, then X has a degenerate distribution. For example, the univariate
N(0, 02 ) distribution is degenerate (the point mass at 0).
Definition 10.3. The population mean of a random p × 1 vector X =
(X1 , ..., Xp)T is
E(X) = (E(X1 ), ..., E(Xp))T
CHAPTER 10. MULTIVARIATE MODELS 286

and the p × p population covariance matrix

Cov(X) = E(X − E(X))(X − E(X))T = ((σi,j )).

That is, the ij entry of Cov(X) is Cov(Xi , Xj ) = σi,j .

The covariance matrix is also called the variance–covariance matrix and
variance matrix. Sometimes the notation Var(X) is used. Note that Cov(X)
is a symmetric positive semideﬁnite matrix. If X and Y are p × 1 random
vectors, a a conformable constant vector and A and B are conformable
constant matrices, then

E(a + X) = a + E(X) and E(X + Y ) = E(X) + E(Y ) (10.2)

and
E(AX) = AE(X) and E(AXB) = AE(X)B. (10.3)
Thus
Cov(a + AX) = Cov(AX) = ACov(X)AT . (10.4)

Some important properties of MVN distributions are given in the follow-

ing three propositions. These propositions can be proved using results from
Johnson and Wichern (1988, p. 127-132).
Proposition 10.1. a) If X ∼ Np(µ, Σ), then E(X) = µ and

Cov(X) = Σ.

b) If X ∼ Np (µ, Σ), then any linear combination tT X = t1X1 + · · · +

tpXp ∼ N1(tT µ, tT Σt). Conversely, if tT X ∼ N1 (tT µ, tT Σt) for every p × 1
vector t, then X ∼ Np (µ, Σ).
c) The joint distribution of independent normal random vari-
ables is MVN. If X1 , ..., Xp are independent univariate normal N(µi , σi2)
random vectors, then X = (X1 , ..., Xp)T is Np (µ, Σ) where µ = (µ1 , ..., µp)
and Σ = diag(σ12, ..., σp2) (so the oﬀ diagonal entries σi,j = 0 while the diag-
onal entries of Σ are σi,i = σi2 ).
d) If X ∼ Np (µ, Σ) and if A is a q×p matrix, then AX ∼ Nq (Aµ, AΣAT ).
If a is a p × 1 vector of constants, then a + X ∼ Np (a + µ, Σ).
CHAPTER 10. MULTIVARIATE MODELS 287

It will be useful to partition X, µ, and Σ. Let X 1 and µ1 be q × 1

vectors, let X 2 and µ2 be (p − q) × 1 vectors, let Σ11 be a q × q matrix, let
Σ12 be a q × (p − q) matrix, let Σ21 be a (p − q) × q matrix, and let Σ22 be
a (p − q) × (p − q) matrix. Then

X1 µ1 Σ11 Σ12
X= , µ= , and Σ = .
X2 µ2 Σ21 Σ22

Proposition 10.2. a) All subsets of a MVN are MVN: (Xk1 , ..., Xkq )T
∼ Nq (µ̃, Σ̃) where µ̃i = E(Xki ) and Σ̃ij = Cov(Xki , Xkj ). In particular,
X 1 ∼ Nq (µ1 , Σ11 ) and X 2 ∼ Np−q (µ2, Σ22).
b) If X 1 and X 2 are independent, then Cov(X 1 , X 2 ) = Σ12 =
E[(X 1 − E(X 1 ))(X 2 − E(X 2 ))T ] = 0, a q × (p − q) matrix of zeroes.
c) If X ∼ Np (µ, Σ), then X 1 and X 2 are independent iﬀ Σ12 = 0.
d) If X 1 ∼ Nq (µ1, Σ11) and X 2 ∼ Np−q (µ2 , Σ22) are independent, then

X1 µ1 Σ11 0
∼ Np , .
X2 µ2 0 Σ22

Proposition 10.3. The conditional distribution of a MVN is

MVN. If X ∼ Np (µ, Σ), then the conditional distribution of X 1 given
that X 2 = x2 is multivariate normal with mean µ1 + Σ12Σ−1
22 (x2 − µ2 ) and
−1
covariance Σ11 − Σ12Σ22 Σ21 . That is,
X 1 |X 2 = x2 ∼ Nq (µ1 + Σ12Σ−1 −1
22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ).

Example 10.1. Let p = 2 and let (Y, X)T have a bivariate normal
distribution. That is,

Y µY σY2 Cov(Y, X)
∼ N2 , 2 .
X µX Cov(X, Y ) σX
Also recall that the population correlation between X and Y is given by
Cov(X, Y ) σX,Y
ρ(X, Y ) = =
VAR(X) VAR(Y ) σX σY
if σX > 0 and σY > 0. Then Y |X = x ∼ N(E(Y |X = x), VAR(Y |X = x))
where the conditional mean

1 σY2
E(Y |X = x) = µY + Cov(Y, X) 2 (x − µX ) = µY + ρ(X, Y ) 2
(x − µX )
σX σX
CHAPTER 10. MULTIVARIATE MODELS 288

and the conditional variance

1
VAR(Y |X = x) = σY2 − Cov(X, Y ) 2
Cov(X, Y )
σX

σY2
= σY2 − ρ(X, Y ) 2
2
ρ(X, Y ) σX σY2
σX
= σY2 − ρ2 (X, Y )σY2 = σY2 [1 − ρ2 (X, Y )].
Also aX + bY is univariate normal with mean aµX + bµY and variance
a 2 σX
2
+ b2σY2 + 2ab Cov(X, Y ).
Remark 10.1. There are several common misconceptions. First, it
is not true that every linear combination tT X of normal random
variables is a normal random variable, and it is not true that all
uncorrelated normal random variables are independent. The key
condition in Proposition 10.1b and Proposition 10.2c is that the joint distri-
bution of X is MVN. It is possible that X1 , X2 , ..., Xp each has a marginal
distribution that is univariate normal, but the joint distribution of X is not
MVN. See Seber and Lee (2003, p. 23), Kowalski (1973) and examine the
following example from Rohatgi (1976, p. 229). Suppose that the joint pdf
of X and Y is a mixture of two bivariate normal distributions both with
EX = EY = 0 and VAR(X) = VAR(Y ) = 1, but Cov(X, Y ) = ±ρ. Hence
f(x, y) =
1 1 −1
exp( (x2 − 2ρxy + y 2)) +
2 2π 1 − ρ 2 2(1 − ρ )
2

1 1 −1 2 2 1 1
exp( (x + 2ρxy + y )) ≡ f1 (x, y) + f2(x, y)
2 2π 1 − ρ2 2(1 − ρ2 ) 2 2
where x and y are real and 0 < ρ < 1. Since both marginal distributions of
fi (x, y) are N(0,1) for i = 1 and 2 by Proposition
10.2 a), the marginal dis-
tributions of X and Y are N(0,1). Since xyfi (x, y)dxdy = ρ for i = 1 and
−ρ for i = 2, X and Y are uncorrelated, but X and Y are not independent
since f(x, y) = fX (x)fY (y).
Remark 10.2. In Proposition 10.3, suppose that X = (Y, X2 , ..., Xp)T .
Let X1 = Y and X 2 = (X2 , ..., Xp)T . Then E[Y |X 2 ] = β1 +β2X2 +· · ·+βp Xp
and VAR[Y |X 2] is a constant that does not depend on X 2 . Hence Y =
β1 + β2X2 + · · · + βpXp + e follows the multiple linear regression model.
CHAPTER 10. MULTIVARIATE MODELS 289

10.2 Elliptically Contoured Distributions

Deﬁnition 10.4: Johnson (1987, p. 107-108). A p×1 random vector X
has an elliptically contoured distribution, also called an elliptically symmetric
distribution, if X has density

f(z) = kp |Σ|−1/2g[(z − µ)T Σ−1 (z − µ)], (10.5)

and we say X has an elliptically contoured ECp (µ, Σ, g) distribution.

If X has an elliptically contoured (EC) distribution, then the character-
istic function of X is

φX (t) = exp(itT µ)ψ(tT Σt) (10.6)

for some function ψ. If the second moments exist, then

E(X) = µ (10.7)

and
Cov(X) = cX Σ (10.8)
where
cX = −2ψ (0).

Deﬁnition 10.5. The population squared Mahalanobis distance

U ≡ D2 = D2 (µ, Σ) = (X − µ)T Σ−1 (X − µ) (10.9)

has density
π p/2
h(u) = kp up/2−1g(u). (10.10)
Γ(p/2)

A spherically symmetric distribution is an ECp (µ, cI, g) distribution where

I is the p × p identity matrix, and the multivariate normal distribution
Np (µ, Σ) has kp = (2π)−p/2, ψ(u) = g(u) = exp(−u/2), and h(u) is the
χ2p density.
The following lemma is useful for proving properties of EC distributions
without using the characteristic function (10.6). See Eaton (1986) and Cook
(1998a, p. 130).
CHAPTER 10. MULTIVARIATE MODELS 290

Lemma 10.4. Let X be a p × 1 random vector with 1st moments; ie,

E(X) exists. Let B be any constant full rank p × r matrix where 1 ≤ r ≤ p.
Then X is elliptically contoured iﬀ for all such conforming matrices B,

E(X|B T X) = µ + M B B T (X − µ) = aB + M B B T X (10.11)

where aB is a p × 1 constant vector and M B a p × r constant matrix that

depend on B.
To use this lemma to prove interesting properties, partition X, µ, and
Σ. Let X 1 and µ1 be q × 1 vectors, let X 2 and µ2 be (p − q) × 1 vectors. Let
Σ11 be a q × q matrix, let Σ12 be a q × (p − q) matrix, let Σ21 be a (p − q) × q
matrix, and let Σ22 be a (p − q) × (p − q) matrix. Then

X1 µ1 Σ11 Σ12
X= , µ= , and Σ = .
X2 µ2 Σ21 Σ22
Also assume that the (p + 1) × 1 vector (Y, X T )T is ECp+1 (µ, Σ, g) where Y
is a random variable, X is a p × 1 vector, and

Y µY ΣY Y ΣY X
, µ= , and Σ = .
X µX ΣXY ΣXX

Another useful fact is that aB and M B do not depend on g:

aB = µ − M B B T µ = (I p − M B B T )µ,

and
M B = ΣB(B T ΣB)−1 .
Notice that in the formula for M B , Σ can be replaced by cΣ where c > 0 is
a constant. In particular, if the EC distribution has 2nd moments, Cov(X)
can be used instead of Σ.
Proposition 10.5. Let X ∼ ECp (µ, Σ, g) and assume that E(X) exists.

a) Any subset of X is EC, in particular X 1 is EC.

b) (Cook 1998a p. 131, Kelker 1970).

Cov(X|B T X) = dg (B T X)[Σ − ΣB(B T ΣB)−1 B T Σ]

where the real valued function dg (B T X) is constant iﬀ X is MVN.

CHAPTER 10. MULTIVARIATE MODELS 291

Proof. a) Let A be an arbitrary full rank q × r matrix where 1 ≤ r ≤ q.

Let
A
B= .
0
Then B T X = AT X 1 , and

X1
E[X|B X] = E[
T
|AT X 1 ] =
X2

µ1 M 1B & ' X 1 − µ1
+ A
T T
0
µ2 M 2B X 2 − µ2
by Lemma 10.4. Hence E[X 1|AT X 1 ] = µ1 +M 1B AT (X 1 −µ1 ). Since A was
arbitrary, X 1 is EC by Lemma 10.4. Notice that M B = ΣB(B T ΣB)−1 =

Σ11 Σ12 A & T ' Σ11 Σ12 A −1
[ A 0 T
]
Σ21 Σ22 0 Σ21 Σ22 0

M 1B
= .
M 2B
Hence
M 1B = Σ11A(AT Σ11A)−1
and X 1 is EC with location and dispersion parameters µ1 and Σ11. QED
Proposition 10.6. Let (Y, X T )T be ECp+1 (µ, Σ, g) where Y is a random
variable.
a) Assume that E[(Y, X T )T ] exists. Then E(Y |X) = α + β T X where
α = µY − β T µX and
β = Σ−1 XX ΣXY .

b) Even if the ﬁrst moment does not exist, the conditional median

MED(Y |X) = α + βT X

where α and β are given in a).

Proof. a) The trick is to choose B so that Lemma 10.4 applies. Let
T
0
B= .
Ip
CHAPTER 10. MULTIVARIATE MODELS 292

Then B T ΣB = ΣXX and

ΣY X
ΣB = .
ΣXX

Now
Y Y Y
E[ | X] = E[ |B T
]
X X X

−1 Y − µY
= µ + ΣB(B ΣB) B T T
X − µX
by Lemma 10.4. The right hand side of the last equation is equal to

ΣY X −1 µY − ΣY X Σ−1 −1
XX µX + ΣY X ΣXX X
µ+ ΣXX (X − µX ) =
ΣXX X

and the result follows since

βT = ΣY X Σ−1
XX .

b) See Croux, Dehon, Rousseeuw and Van Aelst (2001) for references.
Example 10.2. This example illustrates another application of Lemma
10.4. Suppose that X comes from a mixture of two multivariate normals
with the same mean and proportional covariance matrices. That is, let

X ∼ (1 − γ)Np (µ, Σ) + γNp (µ, cΣ)

where c > 0 and 0 < γ < 1. Since the multivariate normal distribution is
elliptically contoured (and see Proposition 4.1c),

E(X|B T X) = (1 − γ)[µ + M 1 B T (X − µ)] + γ[µ + M 2 B T (X − µ)]

= µ + [(1 − γ)M 1 + γM 2 ]B T (X − µ) ≡ µ + M B T (X − µ).

Hence X has an elliptically contoured distribution.

10.3 Sample Mahalanobis Distances

In the multivariate location and dispersion model, sample Mahalanobis dis-
tances play a role similar to that of residuals in multiple linear regression.
CHAPTER 10. MULTIVARIATE MODELS 293

The observed data X i = xi for i = 1, ..., n is collected in an n × p matrix W

with n rows xT1 , ..., xTn . Let the p × 1 column vector T (W ) be a multivari-
ate location estimator, and let the p × p symmetric positive definite matrix
C(W ) be a covariance estimator.
Definition 10.6. The ith squared Mahalanobis distance is
Di2 = Di2 (T (W ), C(W )) = (xi − T (W ))T C −1 (W )(xi − T (W )) (10.12)
for each point xi . Notice that Di2 is a random variable (scalar valued).
Notice that the population squared Mahalanobis distance is
2
Dx (µ, Σ) = (x − µ)T Σ−1 (x − µ) (10.13)
and that the term Σ−1/2(x − µ) is the p−dimensional analog to the z-score
used to transform a univariate N(µ, σ 2) random variable into a N(0, 1) ran-
dom variable. Hence the sample Mahalanobis distance Di is an analog of the
sample z-score zi = (xi − X)/σ̂. Also notice that the Euclidean distance of
xi from the estimate of center T (W ) is Di (T (W ), I p ) where I p is the p × p
identity matrix.
Example 10.3. The contours of constant density for the Np (µ, Σ) dis-
tribution are ellipsoids defined by x such that (x − µ)T Σ−1 (x − µ) = a2.
An α−density region Rα is a set such that P (X ∈ Rα) = α, and for the
Np (µ, Σ) distribution, the regions of highest density are sets of the form
{x : (x − µ)T Σ−1 (x − µ) ≤ χ2p(α)} = {x : Dx
2
(µ, Σ) ≤ χ2p (α)}
where P (W ≤ χ2p(α)) = α if W ∼ χ2p . If the X i are n iid random vectors
each with a Np (µ, Σ) pdf, then a scatterplot of Xi,k versus Xi,j should be
ellipsoidal for k = j. Similar statements hold if X is ECp (µ, Σ, g), but the
α-density region will use a constant Uα obtained from Equation (10.10).
The classical Mahalanobis distance corresponds to the sample mean and
sample covariance matrix
1
n
T (W ) = x = xi ,
n i=1

and
1
n
C(W ) = S = (xi − T (W ))(xi − T (W ))T
n − 1 i=1
CHAPTER 10. MULTIVARIATE MODELS 294

and will be denoted by MDi2 . When T (W ) and C(W ) are estimators other
than the sample mean and covariance, Di2 will sometimes be denoted by
RDi2 .

10.4 Aﬃne Equivariance

Before defining an important equivariance property, some notation is needed.
Again assume that the data is collected in an n × p data matrix W . Let
B = 1bT where 1 is an n × 1 vector of ones and b is a p × 1 constant vector.
Hence the ith row of B is bTi ≡ bT for i = 1, ..., n. For such a matrix B,
consider the affine transformation Z = W A+ B where A is any nonsingular
p × p matrix.
Definition 10.7. Then the multivariate location and dispersion estima-
tor (T, C) is affine equivariant if
T (Z) = T (W A + B) = AT T (W ) + b, (10.14)
and
C(Z) = C(W A + B) = AT C(W )A. (10.15)

The following proposition shows that the Mahalanobis distances are in-
variant under aﬃne transformations. See Rousseeuw and Leroy (1987, p.
252-262) for similar results.
Proposition 10.7. If (T, C) is aﬃne equivariant, then
Di2 (W ) ≡ Di2 (T (W ), C(W )) =
Di2 (T (Z), C(Z)) ≡ Di2 (Z). (10.16)

Proof. Since Z = W A + B has ith row

zTi = xTi A + bT ,
Di2 (Z) = [z i − T (Z)]T C −1 (Z)[z i − T (Z)]
= [AT (xi − T (W ))]T [AT C(W )A]−1[AT (xi − T (W ))]
= [xi − T (W )]T C −1 (W )[xi − T (W )] = Di2 (W ). QED
CHAPTER 10. MULTIVARIATE MODELS 295

10.5 Breakdown
This section gives a standard deﬁnition of breakdown (see Zuo 2001 for refer-
ences) for estimators of multivariate location and dispersion. The following
notation will be useful. Let W denote the data matrix where the ith row
corresponds to the ith case. For multivariate location and dispersion, W is
the n × p matrix with ith row xTi . Let W nd denote the data matrix with
ith row wTi where any d of the cases have been replaced by arbitrarily bad
contaminated cases. Then the contamination fraction is γ = d/n.
Let (T (W ), C(W )) denote an estimator of multivariate location and dis-
persion where the p× 1 vector T (W ) is an estimator of location and the p× p
symmetric positive semideﬁnite matrix C(W ) is an estimator of dispersion.
The breakdown value of the multivariate location estimator T at W is
d
B(T, W ) = min{ : sup T (W nd ) = ∞}
n Wn
d

where the supremum is over all possible corrupted samples W nd and 1 ≤ d ≤

n. Let 0 ≤ λp (C(W )) ≤ · · · ≤ λ1 (C(W )) denote the eigenvalues of the
dispersion estimator applied to data W . The estimator C breaks down if
the smallest eigenvalue can be driven to zero or if the largest eigenvalue can
be driven to ∞. Hence the breakdown value of the dispersion estimator is
d 1
B(C, W ) = min{ : sup med[ , λ1 (C(W nd ))] = ∞}.
n Wn λp(C(W nd ))
d

If nonequivariant estimators that have a breakdown value of n/2 or greater

are excluded, then a multivariate location estimator has a breakdown value of
dT /n iff dT is the smallest number of arbitrarily bad cases that can make the
median Euclidean distance med(wi − T (W ndT )) arbitrarily large. Thus a
multivariate location estimator will not break down if it can not be driven out
of some ball of (possibly huge) radius R about the origin. To see this, note
that for a fixed data set W nd with ith row w Ti , if the multivariate location
estimator T (W nd ) satisfies T (W nd ) = M for some constant M, then the
median Euclidean distance med(wi − T (W nd )) ≤ maxi=1,...,n xi − T (W nd )
≤ maxi=1,...,n xi + M if d < n/2. Similarly, if med(wi − T (W nd )) = M
for some constant M, then T (W nd ) is bounded if d < n/2.
The dispersion estimator C breaks down if the smallest eigenvalue λp can
be driven to zero or if the largest eigenvalue λ1 can be driven to ∞. From
CHAPTER 10. MULTIVARIATE MODELS 296

numerical linear algebra, it is known that the largest eigenvalue of a p × p

matrix C is bounded above by p max |ci,j | where ci,j is the (i, j) entry of C.
See Datta (1995, p. 403).
Assume that (T, C) is the classical estimator (x, S) applied to some sub-
set of cn ≈ n/2 cases of the data. Denote these cases by z 1, ..., zcn . Then the
(i, j)th element ci,j of C is

1
cn
ci,j = (zi,k − z k )(zj,k − z j ).
cn − 1 k=1

Hence the maximum eigenvalue λ1 can not get arbitrarily large if the zi are
all contained in some ball of radius R about the origin, eg, if none of the
cn cases is an outlier. If all of the zi are bounded, then λ1 is bounded,
and λp can only be driven to zero if the determinant of C can be driven to
zero. The determinant |S| of S is known as the generalized sample variance.
Consider the hyperellipsoid

{z : (z − T )T C −1 (z − T ) ≤ D(c
2
n)
} (10.17)
2
where D(c n)
is the cn th smallest squared Mahalanobis distance based on
(T, C). This ellipsoid contains the cn cases with the smallest Di2 . The vol-
ume of this ellipsoid is proportional to the square root of the determinant
|C|1/2, and this volume will be positive unless extreme degeneracy is present
among the cn cases. See Johnson and Wichern (1988, p. 103-104).

10.6 Algorithms for the MCD Estimator

Deﬁnition 10.8. Consider the subset Jo of cn ≥ n/2 observations whose
sample covariance matrix has the lowest determinant among all C(n, cn ) sub-
sets of size cn . Let TM CD and C M CD denote the sample mean and sample
covariance matrix of the cn cases in Jo . Then the minimum covariance de-
terminant MCD(cn ) estimator is (TM CD (W ), C M CD (W )).
The MCD estimator is a high breakdown estimator if cn ≈ n/2 and the
value cn = (n + p + 1)/2 is often used as the default. The MCD estimator
is the pair (βLT S , QLT S /(cn − 1)) in the location model. The population
analog of the MCD estimator is closely related to the ellipsoid of highest
concentration that contains cn /n ≈ half of the mass. The MCD estimator is
CHAPTER 10. MULTIVARIATE MODELS 297

a consistent estimator for (µ, aΣ) where a is some positive constant when the
data X i are elliptically contoured ECp (µ, Σ, g), and TM CD has a Gaussian
limit. See Butler, Davies, and Jhun (1993).
Computing robust covariance estimators can be very expensive. For ex-
ample, to compute the exact MCD(cn ) estimator (TM CD , CM CD ), we need to
consider the C(n, cn ) subsets of size cn . Woodruff and Rocke (1994, p. 893)
note that if 1 billion subsets of size 101 could be evaluated per second, it
would require 1033 millenia to search through all C(200, 101) subsets if the
sample size n = 200.
Hence high breakdown (HB) algorithms will again be used to approximate
the robust estimators. Many of the properties and techniques used for HB
regression algorithm estimators carry over for HB algorithm estimators of
multivariate location and dispersion. Elemental sets are the key ingredient
for both basic resampling and concentration algorithms.
Definition 10.9. Suppose that x1 , ..., xn are p × 1 vectors of observed
data. An elemental set J is a set of p + 1 cases in the multivariate location
and dispersion model. An elemental start is the sample mean and sample co-
variance matrix of the data corresponding to J. In a concentration algorithm,
let (T0,j , C 0,j ) be the jth start (not necessarily elemental) and compute all n
Mahalanobis distances Di (T0,j , C 0,j ). At the next iteration, the classical es-
timator (T1,j , C 1,j ) = (x1,j , S 1,j ) is computed from the cn ≈ n/2 cases corre-
sponding to the smallest distances. This iteration can be continued for k steps
resulting in the sequence of estimators (T0,j , C 0,j ), (T1,j , C 1,j ), ..., (Tk,j , C k,j ).
The result of the iteration (Tk,j , C k,j ) is called the jth attractor. Kn starts
are used, and the concentration estimator, called the CMCD estimator, is the
attractor that has the smallest determinant det(C k,j ). The basic resampling
algorithm estimator is a special case where k = 0 so that the attractor is the
start: (xk,j , S k,j ) = (x0,j , S 0,j ).
This concentration algorithm is a simplified version of the algorithms
given by Rousseeuw and Van Driessen (1999) and Hawkins and Olive (1999a).
Using k = 10 concentration steps often works well.
Proposition 10.8: Rousseeuw and Van Driessen (1999, p. 214).
Suppose that the classical estimator (xi,j , S i,j ) is computed from cn cases and
that the n Mahalanobis distances RDk ≡ RDk (xi,j , S i,j ) are computed. If
(xi+1,j , S i+1,j ) is the classical estimator computed from the cn cases with the
smallest Mahalanobis distances RDk , then the MCD criterion det(S i+1,j ) ≤
CHAPTER 10. MULTIVARIATE MODELS 298

det(S i,j ) with equality iﬀ (xi+1,j , S i+1,j ) = (xi,j , S i,j ).

As in regression, starts that use a consistent initial estimator could be
used. Lopuhaä (1999) shows that if (x1,1, S 1,1) is the sample mean and
covariance matrix applied to the cases with the smallest cn Mahalanobis
distances based on the initial estimator (T0,1, C 0,1), then (x1,1, S 1,1 ) has the
same rate of convergence as the initial estimator. Hence the rate of the best
attractor is equal to the rate of the best start. Results from Arcones (1995)
and Kim (2000) also suggest that zero–one weighting does not improve the
rate of a multivariate location estimator. Lopuhaä’s result or ideas from
sampling theory can be used to justify the following proposition.
Proposition 10.9. If the number of randomly chosen elemental starts K
in an elemental concentration algorithm is ﬁxed and free of n (eg K = 500),
then the algorithm estimator is inconsistent.
Remark 10.3. Let γo be the highest percentage of large outliers that an
elemental concentration algorithm can detect reliably. For many data sets,
n−c
γo ≈ min( , 1 − [1 − (0.2)1/K ]1/h)100% (10.18)
n
if n is large and h = p + 1.
The proof of this remark is exactly the same as the proof of Proposition 9.1
and Equation (10.18) agrees very well with the Rousseeuw and Van Driessen
(1999) simulation performed on the hybrid FMCD algorithm that uses both
concentration and partitioning. Section 10.7 will provide more theory for the
CMCD algorithms and will show that there exists a useful class of data sets
where the elemental concentration algorithm can tolerate up to 25% massive
outliers.

10.7 Theory for CMCD Estimators

This section presents a simple estimator to be used along with the classical
and FMCD estimators. Recall from Definition 10.9 that a concentration al-
gorithm uses Kn starts (T0,j , C 0,j ). Each start is refined with k concentration
steps, resulting in Kn attractors (Tk,j , C k,j ), and the final estimator is the
attractor that optimizes the criterion.
CHAPTER 10. MULTIVARIATE MODELS 299

Concentration algorithms have been used by several authors, and the

basic resampling algorithm is a special case with k = 0. Using k = 10 con-
centration steps works well, and iterating until convergence is usually fast.
The DGK estimator (Devlin, Gnanadesikan and Kettenring 1975, 1981) de-
fined below is one example. Gnanadesikan and Kettenring (1972, p. 94–95)
provide a similar algorithm. The DGK estimator is affine equivariance since
the classical estimator is affine equivariant and Mahalanobis distances are
invariant under affine transformations by Proposition 10.7.
Definition 10.10. The DGK estimator uses the classical estimator com-
puted from all n cases as the only start.
Some observations on breakdown from Section 10.5 will be useful for cre-
ating a simple robust estimator. Let W nd denote the data matrix where any
d < n/2 of the n cases have been replaced by arbitrarily bad contaminated
cases. Then the contamination fraction is γ = d/n.
Consider a fixed data set W nd . A multivariate location estimator T basi-
cally “breaks down” if the d outliers can make the median Euclidean distance
MED(w i −T (W nd )) arbitrarily large where w Ti is the ith row of W nd . Thus
a multivariate location estimator T will not break down if T can not be driven
out of some ball of (possibly huge) radius R about the origin.
The estimator C breaks down if the smallest eigenvalue λp can be driven
to zero or if the largest eigenvalue λ1 can be driven to ∞. From numerical
linear algebra, it is known that the largest eigenvalue of a p × p matrix C
is bounded above by p max |ci,j | where ci,j is the (i, j) entry of C. See Datta
(1995, p. 403).
Assume that (T, C) is the classical estimator (xJ , S J ) applied to some
subset J of cn ≈ n/2 cases of the data. Denote these cases by z 1 , ..., zcn .
Then the (i, j) entry of C is

1 n c
ci,j = (zi,k − z i )(zj,k − zj ).
cn − 1
k=1

Hence the maximum eigenvalue λ1 can not get arbitrarily large if the zi are
all contained in some ball of radius R about the origin, eg, if none of the
cn cases is an outlier. If all of the z i are bounded, then all of the λi are
bounded, and λp can only be driven to zero if the determinant of C can
be driven to zero. The determinant |S J | of S J is known as the generalized
sample variance.
CHAPTER 10. MULTIVARIATE MODELS 300

Consider the hyperellipsoid

{z : (z − xJ )T S −1 2
J (z − xJ ) ≤ d }. (10.19)

The volume of the hyperellipsoid is equal to

2π p/2 p
d det(SJ ), (10.20)
pΓ(p/2)
and this volume will be positive unless extreme degeneracy is present among
the cn cases. See Johnson and Wichern (1988, p. 103-104). If d2 = D(c 2
n)
,
the cn th smallest squared Mahalanobis distance based on (xJ , S J ), then the
hyperellipsoid contains the cn cases with the smallest Di2 . Using the above
ideas suggests the following robust estimator.
Deﬁnition 10.11. Let the Mth start (T0,M , C 0,M ) = (x0,M , S 0,M ) be
the classical estimator applied after trimming the M% of cases furthest in
Euclidean distance from the coordinatewise median MED(W ) where M ∈
{0, 50} (or use, eg, M ∈ {0, 50, 60, 70, 80, 90, 95, 98}). Then concentra-
tion steps are performed resulting in the Mth attractor (Tk,M , C k,M ) =
(xk,M , S k,M ), and the M = 0 attractor is the DGK estimator. Let (TA , C A )
correspond to the attractor that has the smallest determinant. The median
ball algorithm (MBA) estimator (TM BA, C M BA) takes TM BA = TA and

MED(Di2 (TA , C A ))
C M BA = CA (10.21)
χ2p,0.5

where χ2p,0.5 is the 50th percentile of a chi–square distribution with p degrees

of freedom.
The following assumption and remark will be useful for examining the sta-
tistical properties of multivariate location and dispersion (MLD) estimators.

Assumption (E1): Assume that x1 , ..., xn are iid from an elliptically

contoured ECp (µ, Σ, g) distribution and that Cov(x) = aX Σ for some con-
stant aX > 0.
Then from Deﬁnition 10.5, the population squared Mahalanobis distance

U ≡ D2 (µ, Σ) = (X − µ)T Σ−1 (X − µ) (10.22)

CHAPTER 10. MULTIVARIATE MODELS 301

has density
π p/2
h(u) = kp up/2−1g(u), (10.23)
Γ(p/2)
and the 50% highest density region has the form of the hyperellipsoid
{z : (z − µ)T Σ−1 (z − µ) ≤ U0.5 }
where U0.5 is the median of the distribution of U. For example, if the x are
MVN, then U has the χ2p distribution.
Remark 10.4. √
a) Butler, Davies and Jhun (1993): The MCD(cn ) estimator is a n con-
sistent HB estimator for (µ, aM CD Σ) where the constant aM CD > 0 depends
on the EC distribution.
b) Lopuhaä (1999): If (T, C) is a consistent estimator for (µ, aΣ) with
rate nδ where the constants a > 0 and δ > 0, then the classical estimator
(xM , S M ) computed after trimming the M% (where 0 < M < 100) of cases
with the largest distances Di (T, C) is a consistent estimator for (µ, aM Σ)
with the same rate nδ where aM > 0 is some constant. Notice that applying
the classical estimator to the cn ≈ n/2 cases with the smallest distances
corresponds to M = 50.
c) Rousseeuw and Van Driessen (1999): Assume that the classical esti-
mator (xm,j , S m,j ) is computed from cn cases and that the n Mahalanobis
distances Di ≡ Di (xm,j , S m,j ) are computed. If (xm+1,j , S m+1,j ) is the clas-
sical estimator computed from the cn cases with the smallest Mahalanobis
distances Di , then the MCD criterion det(S m+1,j ) ≤ det(S m,j ) with equality
iﬀ (xm+1,j , S m+1,j ) = (xm,j , S m,j ).
d) Pratt (1959): Let K be a ﬁxed positive integer and let the constant
a > 0. Suppose that (T1, C 1 ), ..., (TK , C K ) are K consistent estimators of
(µ, aΣ) each with the same rate nδ . If (TA , C A ) is an estimator obtained by
choosing one of the K estimators, then (TA , C A ) is a consistent estimator of
(µ, aΣ) with rate nδ .
e) Olive (2002): Suppose that (Ti, C i ) are consistent estimators for (µ, aiΣ)
where ai > 0 for i = 1, 2. Let Di,1 and Di,2 be the corresponding distances
and let R be the set of cases with distances Di (T1 , C 1) ≤ MED(Di (T1 , C 1)).
Let rn be the correlation between Di,1 and Di,2 for the cases in R. Then
rn → 1 in probability as n → ∞.
f) Olive (2004a): (x0,50, S 0,50) is a high breakdown estimator. If the data
distribution is EC but not spherically symmetric, then for m ≥ 0, S m,50 under
CHAPTER 10. MULTIVARIATE MODELS 302

estimates the major axis and over estimates the minor axis of the highest
density region. Concentration reduces but fails to eliminate this bias. Hence
the estimated highest density region based on the attractor is “shorter” in
the direction of the major axis and “fatter” in the direction of the minor axis
than estimated regions based on consistent
√ estimators. Arcones (1995) and
Kim (2000) showed that x0,50 is a HB n consistent estimator of µ.
The following remarks help explain why the MBA estimator is robust.
Using k = 5 concentration steps often works well. The scaling makes C M BA
a better estimate of Σ if the data is multivariate normal MVN. See Equa-
tions (11.2) and (11.4). The attractor (Tk,0, C k,0 ) that uses the classical
estimator (0% trimming) as a start is the DGK estimator and has good
statistical properties. By Remark 10.4f, the start (T0,50, C 0,50) that uses
50% trimming is a high breakdown estimator. Since only cases xi such that
xi − MED(x) ≤ MED(xi − MED(x)) are used, the largest eigenvalue of
C 0,50 is bounded if fewer than half of the cases are outliers.
The geometric behavior of the start (T0,M , C 0,M ) with M ≥ 0 is simple.
If the data xi are MVN (or EC) then the highest density regions of the
data are hyperellipsoids. The set of x closest to the coordinatewise median
in Euclidean distance is a hypersphere. For EC data the highest density
ellipsoid and hypersphere will have approximately the same center as the
hypersphere, and the hypersphere will be drawn towards the longest axis of
the hyperellipsoid. Hence too much data will be trimmed in that direction.
For example, if the data are MVN with Σ = diag(1, 2, ..., p) then C 0,M will
underestimate the largest variances and overestimate the smallest variances.
Taking k concentration steps can greatly reduce but not eliminate the bias
of C k,M if the data is EC, and the determinant |C k,M | < |C 0,M | unless the
attractor is equal to the start by Remark 10.4c. The attractor (Tk,50, C k,50 )
is not aﬃne equivariant but is resistant to gross outliers in that they will
initially be given weight zero if they are further than the median Euclidean
distance from the coordinatewise median. Gnanadesikan and Kettenring
(1972, p. 94) suggest an estimator similar to the attractor (Tk,50, C k,50 ), also
see Croux and Van Aelst (2002).
Next, we will compare several concentration algorithms with theory and
simulation. Let the CMCD algorithm use k > 1 concentration steps where
the ﬁnal estimator is the attractor that has the smallest determinant (the
MCD criterion). We recommend k = 10 for the DGK estimator and k = 5
CHAPTER 10. MULTIVARIATE MODELS 303

for the MBA estimator.

To investigate the consistency and rate of robust estimators of multivari-
ate location and dispersion, the following extension of Deﬁnitions 8.6 and
8.7 will be used. Let g(n) √≥ 1 be an increasing function of the sample size
n: g(n) ↑ ∞, eg g(n) = n. See White (1984, p. 15). Notice that if a
p × 1 random vector T − µ converges to a nondegenerate multivariate normal
√ √
distribution with convergence rate n, then T has (tightness) rate n.
Deﬁnition 10.12. Let A = [ai,j ] be an r × c random matrix.
a) A = OP (Xn ) if ai,j = OP (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
b) A = op (Xn ) if ai,j = op (Xn ) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
c) A P (1/(g(n)) if ai,j P (1/(g(n)) for 1 ≤ i ≤ r and 1 ≤ j ≤ c.
d) Let A1 = T − µ and A2 = C − cΣ for some constant c > 0. If A1 P
(1/(g(n)) and A2 P (1/(g(n)), then (T, C) has (tightness) rate g(n).
In MLR, if the start is a consistent estimator for β, then so is the attrac-
tor. Hence all attractors are estimating the same parameter. The following
proposition shows that MLD concentration estimators with k ≥ 1 are esti-
mating (µ, aM CD Σ).
Proposition 10.10. Assume that (E1) holds and that (T, C) is a con-
sistent estimator of for (µ, aΣ) with rate nδ where the constants a > 0 and
δ > 0, then the classical estimator (xm,j , S m,j ) computed after trimming the
cn ≈ n/2 of cases with the largest distances Di (T, C) is a consistent estima-
tor for (µ, aM CD Σ) with the same rate nδ . Hence MED(Di2 (xm,j , S m,j )) is a
consistent estimator of U0.5 /aM CD .
Proof. The result follows by Remark 10.4b if a50 = aM CD . But by
Remark 10.4e the overlap of cases used to compute (xm,j , S m,j ) and (TM CD ,
C M CD ) goes to 100% as n → ∞. Hence the two sample covariance matrices
S m,j and C M CD both estimate the same quantity aM CD Σ. QED
The following proposition proves that the elemental concentration and
“h–set” basic resampling algorithms produce inconsistent zero breakdown
estimators.
Proposition 10.11. Suppose that each start uses h ≥ p + 1 randomly
selected cases and that the number of starts Kn ≡ K does not depend on n
(eg, K = 500). Then
CHAPTER 10. MULTIVARIATE MODELS 304

i) the (“h-set”) basic resampling estimator is inconsistent.

ii) The k–step CMCD concentration algorithm is inconsistent.
iii) For the basic resampling algorithm, the breakdown value is bounded
above by K/n.
iv) For CMCD the breakdown value is bounded above by K(h − p)/n.
Proof. To prove i) and ii), notice that each start is inconsistent. Hence
each attractor is inconsistent by Lopuhaä (1999) for CMCD. Choosing from
K inconsistent estimators still results in an inconsistent estimator. iii) replace
one case in each start by a case with a value tending to ∞. iv). If h ≥ p + 1,
replace h − p cases so that the start is singular and the covariance matrix
can not be computed. QED
We certainly prefer to use consistent estimators whenever possible. When
the start subset size hn ≡ h and the number of starts Kn ≡ K are both ﬁxed,
the estimator is inconsistent. The situation changes dramatically if the start
subset size hn = g(n) → ∞ as n → ∞. The conditions in Proposition
10.12i hold, for example, if the classical estimator is applied to hn cases ran-
domly drawn with replacement from a distribution with a covariance matrix
Cov(x) = aX Σ. Then each of the K starts estimates (µ, aX Σ) with rate
[hn ]1/2.
Proposition 10.12. Suppose Kn ≡ K starts are used and that all starts
have subset size hn = g(n) ↑ ∞ as n → ∞. Assume that the estimator
applied to the subset has rate nδ .
i) If each of the K estimators (Ti , C i ) is a [g(n)]δ consistent estimator for
(µ, aΣ) (ie, ai ≡ a for i = 1, ..., K), then the MLD hn -set basic resampling
algorithm estimator has rate [g(n)]δ .
ii) The CMCD estimator has rate [g(n)]δ if assumption (E1) holds.
iii) The DGK estimator has rate n1/2 if assumption (E1) holds.
iv) The MBA estimator has rate n1/2 if (E1) holds and the distribution is
spherically symmetric.
Proof. i) The result follows by Pratt (1959). ii) By Lopuhaä (1999), all
K attractors have [g(n)]δ rate, and the result follows by Proposition 10.10
and Pratt (1959). iii) The DGK estimator uses K = 1 and hn = n, and
the k concentration steps are performed after using the classical estima-
tor as a start. Hence the result follows√by Lopuhaä (1999). iv) Each of
the K starts in the MBA algorithm is n consistent (if M > 0 then the
CHAPTER 10. MULTIVARIATE MODELS 305

(MED(W ), I p) = (T−1 , C −1 ) can be regarded as the start). Hence the result

follows by Proposition 10.10 and Pratt (1959). QED
Suppose that the concentration algorithm covers cn cases. Then Remark
10.3 suggested that concentration algorithms using K starts each consisting
of h cases can handle roughly a percentage γo of huge outliers where
n − cn
γo ≈ min( , 1 − [1 − (0.2)1/K ]1/h)100% (10.24)
n
if n is large. Empirically, this value seems to give a rough approximation for
many simulated data sets.
However, if the data set is multivariate and the bulk of the data falls in one
compact ellipsoid while the outliers fall in another hugely distant compact
ellipsoid, then a concentration algorithm using a single start can sometimes
tolerate nearly 25% outliers. For example, suppose that all p + 1 cases in the
elemental start are outliers but the covariance matrix is nonsingular so that
the Mahalanobis distances can be computed. Then the classical estimator
is applied to the cn ≈ n/2 cases with the smallest distances. Suppose the
percentage of outliers is less than 25% and that all of the outliers are in
this “half set.” Then the sample mean applied to the cn cases should be
closer to the bulk of the data than to the cluster of outliers. Hence after a
concentration step, the percentage of outliers will be reduced if the outliers
are very far away. After the next concentration step the percentage of outliers
will be further reduced and after several iterations, all cn cases will be clean.
In a small simulation study, 20% outliers were planted for various values
of p. If the outliers were distant enough, then the minimum DGK distance for
the outliers was larger than the maximum DGK distance for the nonoutliers.
Hence the outliers would be separated from the bulk of the data in a DD plot
of classical versus robust distances. For example, when the clean data comes
from the Np (0, I p ) distribution and the outliers come from the Np (2000 1, I p )
distribution, the DGK estimator with 10 concentration steps was able to
separate the outliers in 17 out of 20 runs when n = 9000 and p = 30. With
10% outliers, a shift of 40, n = 600 and p = 50, 18 out of 20 runs worked.
Olive (2004a) showed similar results for the Rousseeuw and Van Driessen
(1999) FMCD algorithm and that the MBA estimator could often correctly
classify up to 49% distant outliers. The following proposition shows that it
is very diﬃcult to drive the determinant of the dispersion estimator from a
concentration algorithm to zero.
CHAPTER 10. MULTIVARIATE MODELS 306

Proposition 10.13. Consider the CMCD and MCD estimators that

both cover cn cases. For multivariate data, if at least one of the starts is
nonsingular, then the CMCD estimator C A is less likely to be singular than
the high breakdown MCD estimator C M CD .
Proof. If all of the starts are singular, then the Mahalanobis distances
cannot be computed and the classical estimator can not be applied to cn
cases. Suppose that at least one start was nonsingular. Then C A and C M CD
are both sample covariance matrices applied to cn cases, but by deﬁnition
C M CD minimizes the determinant of such matrices. Hence 0 ≤ det(C M CD ) ≤
det(C A ). QED
Next we will show that it is simple to modify existing elemental con-
centration algorithms such that the resulting CMCD estimators has good
statistical properties. These CMCD estimators satisfy i) 0 < det(C CM CD ) <
∞ even if nearly half of the cases are outliers, and if (E1) holds then ii)
CMCD − MCD = OP (n−1/2), and iii) the CMCD estimators are asymptoti-
cally equivalent to the DGK estimator if (E1) holds but the data distribution
is not spherically symmetric.
We will be interested in the attractor that minimizes the determinant
det(S k,M ) and in the attractor that minimizes the volume criterion

det(S k,M )[MED(Di2 )]p, (10.25)

(see Rousseeuw and Leroy 1987, p. 259) which is proportional to the volume
of the hyperellipsoid
{z : (z − xk,M )T S −1 2
k,M (z − xk,M ) ≤ d } (10.26)
where d2 =√MED(Di2 (xk,M , S k,M )). The following two theorems show how
to produce n consistent robust estimators from starts that use O(n) cases.
The following theorem shows that the MBA estimator has good statistical
properties.
Theorem 10.14. Suppose (E1) holds.
a) If (TA , C A ) is the attractor that minimizes the volume criterion (10.25),
√
then (TA , C A ) is a HB n consistent estimator of (µ, aM CD Σ).
b)If√(TA , C A ) is the attractor that minimizes det(Sk,M ), then (TA , C A ) is
a√HB n consistent estimator of (µ, aM CD Σ). Hence the MBA estimator is
a n consistent HB estimator.
CHAPTER 10. MULTIVARIATE MODELS 307

Proof. a) The estimator is HB since (x0,50, S 0,50) is a high breakdown

estimator and hence has a bounded volume if up to nearly 50% of the cases are
outliers. If the distribution is spherically symmetric then the result follows
by Proposition 10.4iv. Otherwise, the hyperellipsoid corresponding to the
highest density region has at least one major axis and at least one minor axis.
The estimators with M > 0 trim too much data in the direction of the major
axis and hence the resulting attractor is not estimating the highest density
region. But the DGK estimator (M = 0) is estimating the highest density
region. Thus the probability that the DGK estimator is the attractor that
minimizes the volume goes to one as n → ∞, and (TA , C A ) is asymptotically
equivalent to the DGK estimator (Tk,0, C k,0 ). QED
b) The estimator is HB since 0 < det(S M CD ) ≤ det(C A ) ≤ det(S 0,50)
< ∞ if up to nearly 50% of the cases are outliers. If the distribution is spher-
ically symmetric then the result follows by Proposition 10.4iv. Otherwise,
the estimators with M > 0 trim to much data in the direction of the major
axis and hence the resulting attractor is not estimating the highest density
√ Hence S k,M is not estimating aM CD Σ. But the DGK estimator−1/2
region. S k,0
is a n consistent estimator of aM CD Σ and S M CD − S k,0 = OP (n ).
Thus the probability that the DGK attractor minimizes the determinant goes
to one as n → ∞, and (TA , C A ) is asymptotically equivalent to the DGK
estimator (Tk,0, C k,0 ). QED
The following theorem shows that ﬁxing the inconsistent zero breakdown
elemental CMCD algorithm is simple. Just add the two MBA starts.
Theorem 10.15. Suppose that the CMCD algorithm uses Kn ≡ K
randomly selected elemental starts (eg, K = 500), the start (T0,0, C 0,0 ) and
√
the start (T0,50, C 0,50). Then this CMCD estimator is a n consistent HB
estimator.
Proof. The estimator is HB since 0 < det(S M CD ) ≤ det(C CM CD ) ≤
det(S 0,50) < ∞ if up to nearly 50% of the cases are outliers. Notice that the
DGK estimator is the attractor for (T0,0, C 0,0). Under (E1), the probabil-
ity that the attractor from a randomly drawn elemental set gets arbitrarily
close to the MCD estimator goes to zero as n → ∞. But DGK − MCD =
OP (n−1/2). Since the number of randomly drawn elemental sets K does not
depend on n, the probability that the DGK estimator has a smaller criterion
value than that of the best elemental attractor also goes to one. Hence if
the distribution is spherically symmetric then (with probability going to one)
CHAPTER 10. MULTIVARIATE MODELS 308

one of the MBA attractors will minimize the criterion value and the result
follows. If (E1) holds and the distribution is not spherically symmetric, then
the probability that the DGK attractor minimizes the determinant goes to
one as n → ∞, and (TCM CD , C CM CD ) is asymptotically equivalent to the
DGK estimator (Tk,0, C k,0 ). QED
To compare (TM BA, C M BA ) and (TF M CD , C F M CD ), we used simulated
data with n = 100 cases and computed the FMCD estimator with the
R/Splus function cov.mcd. Initially the data sets had no outliers, and all
100 cases were MVN with zero mean vector and Σ = diag(1,2, ..., p). We
generated 500 runs of this data with p = 4. The averaged diagonal elements
of C M BA were 1.202, 2.260, 3.237 and 4.204. (In the simulations, the scale
factor in Equation (10.21) appeared to be slightly too large for small n but
slowly converged to the correct factor as n increased.) The averaged diagonal
elements of C F M CD were 0.838, 1.697, 2.531, and 3.373. The approximation
1.2C F M CD ≈ Σ was good. For both matrices, all off diagonal elements had
average values less than 0.034 in magnitude.
Next data sets with 40% outliers were generated. The last 60 cases were
MVN with zero mean vector and Σ = diag(1,2, ..., p). The first √ 40 cases√were
MVN with the same Σ, but the p× 1 mean vector µ = (10, 10 2, ..., 10 p)T .
We generated 500 runs of this data using p = 4. Shown below are the averages
of C M BA and C F M CD . Notice that C F M CD performed extremely well while
the C M BA entries were over inflated by a factor of about 2 since the outliers
inflate the scale factor MED(Di2 (TA , C A ))/χ2p,0.5.
MBA FMCD
  
2.120 −0.031 −0.069 0.004 0.980 0.002 −0.004 0.011
 −0.031 4.144 −0.111 −0.146   0.002 1.977 −0.008 −0.014 
  
 −0.069 −0.111 6.211 −0.419   −0.004 −0.008 2.991 0.013 
−0.138 0.008 −0.419 7.933 0.011 −0.014 0.013 3.862

The DD plot of MDi versus RDi is useful for detecting outliers. The
resistant estimator will be useful if (T, C) ≈ (µ, cΣ) where c > 0 since
scaling by c aﬀects the vertical labels of the RDi but not the shape of the
DD plot. For the outlier data, the MBA estimator is biased, but the outliers
in the MBA DD plot will have large RDi since C M BA ≈ 2C F M CD ≈ 2Σ.
When p is increased to 8, the cov.mcd estimator was usually not useful
for detecting the outliers for this type of contamination. Figure 10.1 shows
CHAPTER 10. MULTIVARIATE MODELS 309

FMCD DD Plot

72
4.0

34
3.5

19
3.0
RD

32
2.5

50
2.0
1.5

1.5 2.0 2.5 3.0 3.5 4.0

Figure 10.1: The FMCD Estimator Failed

Resistant DD Plot

30 6
20

20
15
RD

10
5

95 53

1.5 2.0 2.5 3.0 3.5 4.0

Figure 10.2: The Outliers are Large in the MBA DD Plot

CHAPTER 10. MULTIVARIATE MODELS 310

that now the FMCD RDi are highly correlated with the MDi . The DD plot
based on the MBA estimator detects the outliers. See Figure 10.2.

10.8 Complements
The theory for concentration algorithms is due to Hawkins and Olive (2002)
and Olive and Hawkins (2006). The MBA estimator is due to Olive (2004a).
The computational and theoretical simplicity of the MBA estimator makes it
one of the most useful robust estimators ever proposed. An important appli-
cation of the robust algorithm estimators and of case diagnostics is to detect
outliers. Sometimes it can be assumed that the analysis for influential cases
and outliers was completely successful in classifying the cases into outliers
and good or “clean” cases. Then classical procedures can be performed on
the good cases. This assumption of perfect classification is often unreason-
able, and it is useful to have robust procedures, such as the MBA estimator,
that have rigorous asymptotic theory and are practical to compute. Since the
MBA estimator is about an order of magnitude faster than alternative robust
estimators, the MBA estimator may be useful for data mining applications.
In addition to concentration and randomly selecting elemental sets, three
other algorithm techniques are important. He and Wang (1996) suggest
computing the classical estimator and a robust estimator. The final cross
checking estimator is the classical estimator if both estimators are “close,”
otherwise the final estimator is the robust estimator. The second technique
was proposed by Gnanadesikan and Kettenring (1972, p. 90). They suggest
using the dispersion matrix C = [ci,j ] where ci,j is a robust estimator of the
covariance of Xi and Xj . Computing the classical estimator on a subset of
the data results in an estimator of this form. The identity

ci,j = Cov(Xi , Xj) = [VAR(Xi + Xj) − VAR(Xi − Xj )]/4

where VAR(X) = σ 2(X) suggests that a robust estimator of dispersion can be

created by replacing the sample standard deviation σ̂ by a robust estimator
of scale. Maronna and Zamar (2002) modify this idea to create a fairly
fast high breakdown consistent OGK estimator of multivariate location and
dispersion. (This estimator may be the leading competitor of the MBA
estimator. Also see Mehrotra, 1995). Woodruﬀ and Rocke (1994) introduced
the third technique, partitioning, which evaluates a start on a subset of the
CHAPTER 10. MULTIVARIATE MODELS 311

cases. Poor starts are discarded, and L of the best starts are evaluated on
the entire data set. This idea is also used by Rocke and Woodruff (1996) and
by Rousseeuw and Van Driessen (1999).
There certainly exist types of outlier configurations where the FMCD es-
timator outperforms the robust MBA estimator. The MBA estimator is most
vulnerable to outliers that lie inside the hypersphere based on the median
Euclidean distance from the coordinatewise median (see Problem 10.17 for a
remedy). Although the MBA estimator should not be viewed as a replace-
ment for the FMCD estimator, the FMCD estimator should be modified as
in Theorem 10.15. Until this modification appears in the software, both es-
timators can be used for outlier detection by making a scatterplot matrix of
the Mahalanobis distances from the FMCD, MBA and classical estimators.
The simplest version of the MBA estimator only has two starts. A simple
modification would be to add additional starts as in Problem 10.17.
Johnson and Wichern (1988) and Mardia, Kent and Bibby (1979) are
good references for multivariate statistical analysis based on the multivariate
normal distribution. The elliptically contoured distributions generalize the
multivariate normal distribution and are discussed (in increasing order of
difficulty) in Johnson (1987), Fang, Kotz, and Ng (1990), Fang and Anderson
(1990), and Gupta and Varga (1993). Fang, Kotz, and Ng (1990) sketch the
history of elliptically contoured distributions while Gupta and Varga (1993)
discuss matrix valued elliptically contoured distributions. Cambanis, Huang,
and Simons (1981), Chmielewski (1981) and Eaton (1986) are also important
references. Also see Muirhead (1982, p. 30–42).
Rousseeuw (1984) introduced the MCD and the minimum volume ellip-
soid MVE(cn) estimator. For the MVE estimator, T (W ) is the center of
the minimum volume ellipsoid covering cn of the observations and C(W )
is determined from the same ellipsoid. TM V E has a cube root rate and the
limiting distribution is not Gaussian. See Davies (1992). Bernholdt and
Fisher (2004) show that the MCD estimator can be computed with O(nv )
complexity where v = 1 + p(p + 3)/2 if x is a p × 1 vector.
Rocke and Woodruff (1996, p. 1050) claim that any affine equivariant
location and shape estimation method gives an unbiased location estimator
and a shape estimator that has an expectation that is a multiple of the true
shape for elliptically contoured distributions. Hence there are many can-
didate robust estimators of multivariate location and dispersion. See Cook,
Hawkins, and Weisberg (1993) for an exact algorithm for the MVE. Other pa-
pers on robust algorithms include Hawkins (1993b, 1994), Hawkins and Olive
CHAPTER 10. MULTIVARIATE MODELS 312

(1999a), Hawkins and Simonoff (1993), He and Wang (1996), Rousseeuw and
Van Driessen (1999), Rousseeuw and van Zomeren (1990), Ruppert * (1992),
and Woodruff and Rocke (1993). Rousseeuw and Leroy (1987, 7.1) also
describes many methods. Papers by Fung (1993), Ma and Genton (2001),
Olive (2004a), Poston, Wegman, Priebe, and Solka (1997) and Ruiz-Gazen
(1996) may also be of interest.
The discussion by Rocke and Woodruff (2001) and by Hubert (2001) of
Peña and Prieto (2001) stresses the fact that no one estimator can domi-
nate all others for every outlier configuration. These papers and Wisnowski,
Simpson, and Montgomery (2002) give outlier configurations that can cause
problems for the FMCD estimator.

10.9 Problems
10.1∗. Suppose that
       
X1 49 3 1 −1 0
 X2    100   1 6 1 −1  
  ∼ N4   ,   .
 X3    17   −1 1 4 0  
X4 7 0 −1 0 2

a) Find the distribution of X2 .

b) Find the distribution of (X1 , X3 )T .
c) Which pairs of random variables Xi and Xj are independent?
d) Find the correlation ρ(X1 , X3 ).
10.2∗. Recall that if X ∼ Np (µ, Σ), then the conditional distribution of
X 1 given that X 2 = x2 is multivariate normal with mean µ1 + Σ12Σ−1 22 (x2 −
−1
µ2 ) and covariance matrix Σ11 − Σ12 Σ22 Σ21.
Let σ12 = Cov(Y, X) and suppose Y and X follow a bivariate normal
distribution

Y 49 16 σ12
∼ N2 , .
X 100 σ12 25

a) If σ12 = 0, ﬁnd Y |X. Explain your reasoning.

CHAPTER 10. MULTIVARIATE MODELS 313

b) If σ12 = 10 ﬁnd E(Y |X).

c) If σ12 = 10, ﬁnd Var(Y |X).
10.3. Let σ12 = Cov(Y, X) and suppose Y and X follow a bivariate
normal distribution

Y 15 64 σ12
∼ N2 , .
X 20 σ12 81

a) If σ12 = 10 ﬁnd E(Y |X).

b) If σ12 = 10, ﬁnd Var(Y |X).
c) If σ12 = 10, ﬁnd ρ(Y, X), the correlation between Y and X.
10.4. Suppose that
X ∼ (1 − γ)ECp (µ, Σ, g1 ) + γECp (µ, cΣ, g2)
where c > 0 and 0 < γ < 1. Following Example 10.2, show that X has
an elliptically contoured distribution assuming that all relevant expectations
exist.
10.5. In Proposition 10.5, show that if the second moments exist, then
Σ can be replaced by Cov(X).

crancap hdlen hdht

1485 175 132
1450 191 117
1460 186 122
1425 191 125
1430 178 120
1290 180 117
90 75 51

10.6∗. The table (W ) above represents 3 head measurements on 6 people

and one ape. Let X1 = cranial capacity, X2 = head length and X3 = head
height. Let x = (X1 , X2 , X3 )T . Several multivariate location estimators,
including the coordinatewise median and sample mean, are found by applying
a univariate location estimator to each random variable and then collecting
the results into a vector. a) Find the coordinatewise median MED(W ).
CHAPTER 10. MULTIVARIATE MODELS 314

b) Find the sample mean x.

10.7. Using the notation in Proposition 10.6, show that if the second
moments exist, then

Σ−1 −1
XX ΣXY = [Cov(X)] Cov(X, Y ).

10.8. Using the notation under Lemma 10.4, show that if X is elliptically
contoured, then the conditional distribution of X 1 given that X 2 = x2 is
also elliptically contoured.
10.9∗. Suppose Y ∼ Nn (Xβ, σ 2I). Find the distribution of
(X T X)−1 X T Y if X is an n × p full rank constant matrix.
10.10. Recall that Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))T ]. Using
the notation of Proposition 10.6 on p. 291, let (Y, X T )T be ECp+1 (µ, Σ, g)
where Y is a random variable. Let the covariance matrix of (Y, X T ) be

ΣY Y ΣY X VAR(Y ) Cov(Y, X)
Cov((Y, X ) ) = c
T T
=
ΣXY ΣXX Cov(X, Y ) Cov(X)

where c is some positive constant. Show that E(Y |X) = α + βT X where

α = µY − β T µX and

β = [Cov(X)]−1Cov(X, Y ).
10.11. (Due to R.D. Cook.) Let X be a p × 1 random vector with
E(X) = 0 and Cov(X) = Σ. Let B be any constant full rank p × r matrix
where 1 ≤ r ≤ p. Suppose that for all such conforming matrices B,

E(X|B T X) = M B B T X

where M B a p × r constant matrix that depend on B.

Using the fact that ΣB = Cov(X, B T X) = E(XX T B) =
E[E(XX T B|B T X)], show that M B = ΣB(B T ΣB)−1 .
R/Splus Problems
Use the command source(“A:/rpack.txt”) to download the func-
tions and the command source(“A:/robdata.txt”) to download the data.
See Preface or Section 14.2. Typing the name of the rpack function, eg
CHAPTER 10. MULTIVARIATE MODELS 315

covmba, will display the code for the function. Use the args command, eg
args(covmba), to display the needed arguments for the function.
10.12. a) Download the maha function that creates the classical Maha-
lanobis distances.
b) Enter the following commands and check whether observations 201–210
look like outliers.

> simx2 <- matrix(rnorm(200),nrow=100,ncol=2)

> outx2 <- matrix(10 + rnorm(80),nrow=40,ncol=2)
> outx2 <- rbind(outx2,simx2)
> maha(outx2)

10.13. a) Download the rmaha function that creates the robust Maha-
lanobis distances.
b) Obtain outx2 as in Problem 10.12 b). R users need to enter the
command library(lqs). Enter the command rmaha(outx2) and check whether
observations 201–210 look like outliers.
10.14. a) Download the covmba function.
b) Download the program rcovsim.
c) Enter the command rcovsim(100) three times and include the output
in Word.
d) Explain what the output is showing.
10.15∗. a) Assuming that you have done the two source commands above
Problem 10.12 (and in R the library(lqs) command), type the command
ddcomp(buxx). This will make 4 DD plots based on the DGK, MBA, FMCD
and median ball estimators. The DGK and median ball estimators are the
two attractors used by the MBA estimator. With the leftmost mouse button,
move the curser to each outlier and click. This data is the Buxton (1920)
data and cases with numbers 61, 62, 63, 64, and 65 were the outliers with
head lengths near 5 feet. After identifying the outliers in each plot, hold the
rightmost mouse button down (and in R click on Stop) to advance to the
next plot. When done, hold down the Ctrl and c keys to make a copy of the
plot. Then paste the plot in Word.
CHAPTER 10. MULTIVARIATE MODELS 316

b) Repeat a) but use the command ddcomp(cbrainx). This data is the

Gladstone (1905-6) data and some infants are multivariate outliers.
c) Repeat a) but use the command ddcomp(museum[,-1]). This data
is the Schaaffhausen (1878) skull measurements and cases 48–60 were apes
while the first 47 cases were humans.
10.16∗. (Perform the source(“A:/rpack.txt”) command if you have not
already done so.) The concmv function illustrates concentration with p = 2
and a scatterplot of X1 versus X2 . The outliers are such that the MBA
estimator can not always detect them. Type the command concmv(). Hold
the rightmost mouse button down (and in R click on Stop) to see the DD plot
after one concentration step. Repeat 4 more times to see the DD plot based
on the attractor. The outliers have large values of X2 and the highlighted
cases have the smallest distances. Repeat the command concmv() several
times. Sometimes the start will contain outliers but the attractor will be clean
(none of the highlighted cases will be outliers), but sometimes concentration
causes more and more of the highlighted cases to be outliers, so that the
attractor is worse than the start. Copy one of the DD plots where none of
the outliers are highlighted into Word.
10.17∗. (Perform the source(“A:/rpack.txt”) command if you have not
already done so.) The ddmv function illustrates concentration with the DD
plot. The first graph is the DD plot after one concentration step. Hold
the rightmost mouse button down (and in R click on Stop) to see the DD
plot after two concentration steps. Repeat 4 more times to see the DD plot
based on the attractor. In this problem, try to determine the proportion
of outliers gam that the DGK estimator can detect for p = 2, 4, 10 and 20.
Make a table of p and gam. For example the command ddmv(p=2,gam=.4)
suggests that the DGK estimator can tolerate nearly 40% outliers with p = 2,
but the command ddmv(p=4,gam=.4) suggest that gam needs to be lowered
(perhaps by 0.1 or 0.05). Try to make 0 < gam < 0.5 as large as possible.
10.18. (Perform the source(“A:/rpack.txt”) command if you have not
already done so.) A simple modification of the MBA estimator adds starts
trimming M% of cases furthest from the coordinatewise median MED(x).
For example use M ∈ {98, 95, 90, 80, 70, 60, 50}. Obtain the program cmba2
from rpack.txt and try the MBA estimator on the data sets in Problem
10.15.
Chapter 11

CMCD Applications

11.1 DD Plots
A basic way of designing a graphical display is to arrange for reference
situations to correspond to straight lines in the plot.
Chambers, Cleveland, Kleiner, and Tukey (1983, p. 322)

Deﬁnition 11.1: Rousseeuw and Van Driessen (1999). The DD

plot is a plot of the classical Mahalanobis distances MDi versus robust Ma-
halanobis distances RDi .
The DD plot is analogous to the RR and FF plots and is used as a
diagnostic for multivariate normality, elliptical symmetry and for outliers.
Assume that the data set consists of iid vectors from an ECp (µ, Σ, g) dis-
tribution with second moments. Then the classical sample mean and covari-
ance matrix (TM , C M ) = (x, S) is a consistent estimator for (µ, cx Σ) =
(E(X), Cov(X)). Assume that an alternative algorithm estimator (TA , C A )
is a consistent estimator for (µ, aA Σ) for some constant aA > 0. By scal-
ing the algorithm estimator, the DD plot can be constructed to follow the
identity line with unit slope and zero intercept. Let (TR, C R ) = (TA , C A /τ 2 )
denote the scaled algorithm estimator where τ > 0 is a constant to be deter-
mined. Notice that (TR , C R ) is a valid estimator of location and dispersion.
Hence the robust distances used in the DD plot are given by

RDi = RDi (TR , C R ) = (xi − TR(W ))T [C R (W )]−1 (xi − TR(W ))

= τ Di (TA , C A ) for i = 1, ..., n.

317
CHAPTER 11. CMCD APPLICATIONS 318

The following proposition shows that if consistent estimators are used to

construct the distances, then the DD plot will tend to cluster tightly about
the line segment through (0, 0) and (MDn,α , RDn,α ) where 0 < α < 1 and
MDn,α is the α sample percentile of the MDi . Nevertheless, the variability in
the DD plot may increase with the distances. Let K > 0 be a constant, eg
the 99th percentile of the χ2p distribution.
Proposition 11.1. Assume that x1 , ..., xn are iid observations from a
distribution with parameters (µ, Σ) where Σ is a symmetric positive definite
matrix. Let aj > 0 and assume that (µ̂j,n , Σ̂j,n ) are consistent estimators
of (µ, aj Σ) for j = 1, 2. Let Di,j ≡ Di (µ̂j,n , Σ̂j,n ) be the ith Mahalanobis
distance computed from (µ̂j,n , Σ̂j,n ). Consider the cases in the region R =
{i|0 ≤ Di,j ≤ K, j = 1, 2}. Let rn denote the correlation between Di,1 and
Di,2 for the cases in R (thus rn is the correlation of the distances in the lower
left corner of the DD plot). Then rn → 1 in probability as n → ∞.
Proof. Let Bn denote the subset of the sample space on which both Σ̂1,n
and Σ̂2,n have inverses. Then P (Bn ) → 1 as n → ∞. The result follows
since for fixed x
−1 1
Dj2 ≡ (x − µ̂j )T Σ̂j (x − µ̂j ) = (x − µ)T Σ−1 (x − µ)
aj
2 1
+ (x − µ)T Σ−1 (µ − µ̂j ) + (µ − µ̂j )T Σ−1 (µ − µ̂j )
aj aj
1 −1
+ (x − µ̂j )T [aj Σ̂j − Σ−1 ](x − µ̂j ) (11.1)
aj
on Bn , and the last three terms converge to zero in probability. QED
The above result implies that a plot of the MDi versus the Di (TA , C A ) ≡
Di (A) will follow a line through the origin with some positive slope since
if x = µ, then both the classical and the algorithm distances should be
close to zero. We want to find τ such that RDi = τ Di (TA , C A ) and
the DD plot of MDi versus RDi follows the identity line. By Proposition
11.1, the plot of MDi versus Di (A) will follow the line segment defined by
the origin (0, 0) and the point of observed median Mahalanobis distances,
(med(MDi ), med(Di (A))). This line segment has slope

med(Di (A))/med(MDi )
CHAPTER 11. CMCD APPLICATIONS 319

which is generally not one. By taking τ = med(MDi )/med(Di (A)), the plot
will follow the identity line if (x, S) is a consistent estimator of (µ, cxΣ) and
if (TA , C A ) is a consistent estimator of (µ, aA Σ). (Using the notation from
Proposition 11.1, let (a1, a2) = (cx, aA ).) The classical estimator is consistent
if the population has second moments, and the algorithm estimator (TA , C A )
tends to be consistent on the class of EC distributions and biased otherwise.
By replacing the observed median med(MDi ) of the classical Mahalanobis
distances with the target population analog, say MED, τ can be chosen so
that the DD plot is simultaneously a diagnostic for elliptical symmetry and a
diagnostic for the target EC distribution. That is, the plotted points follow
the identity line if the data arise from a target EC distribution such as the
multivariate normal distribution, but the points follow a line with non-unit
slope if the data arise from an alternative EC distribution. In addition the
DD plot can often detect departures from elliptical symmetry such as outliers,
the presence of two groups, or the presence of a mixture distribution. These
facts make the DD plot a useful alternative to other graphical diagnostics for
target distributions. See Easton and McCulloch (1990), Li, Fang, and Zhu
(1997), and Liu, Parelius, and Singh (1999) for references.
Example 11.1. Rousseeuw and Van Driessen (1999) choose the multi-
variate normal Np (µ, Σ) distribution as the target. If the data are indeed iid
MVN vectors, then the (MDi )2 are asymptotically χ2p random variables, and

MED = χ2p,0.5 where χ2p,0.5 is the median of the χ2p distribution. Since the
target distribution is Gaussian, let

2
χp,0.5 χ2p,0.5
RDi = Di (A) so that τ = . (11.2)
med(Di (A)) med(Di (A))
Note that the DD plot can be tailored to follow the identity line if the
data are iid observations from any target elliptically contoured distribution
that has 2nd moments. If it is known that med(MDi ) ≈ MED where MED is
the target population analog (obtained, for example, via simulation, or from
the actual target distribution as in Equations (10.8), (10.9) and (10.10) on
p. 289), then use
MED
RDi = τ Di (A) = Di (A). (11.3)
med(Di (A))
CHAPTER 11. CMCD APPLICATIONS 320

Table 11.1: Corr(RDi , MDi ) for Np (0, I p ) Data, 100 Runs.

p n mean min % < 0.95 % < 0.8

3 44 0.866 0.541 81 20
3 100 0.967 0.908 24 0
7 76 0.843 0.622 97 26
10 100 0.866 0.481 98 12
15 140 0.874 0.675 100 6
15 200 0.945 0.870 41 0
20 180 0.889 0.777 100 2
20 1000 0.998 0.996 0 0
50 420 0.894 0.846 100 0

√ choice of the algorithm estimator (TA , C A ) is important, and the

The
HB n consistent MBA estimator is a good choice. In this chapter we used
the R/Splus function cov.mcd which is basically an implementation of the
elemental MCD concentration algorithm described in the previous chapter.
The number of starts used was K = max(500, n/10) (the default is K = 500,
so the default can be used if n ≤ 50000).
Conjecture 11.1. If X 1 , ..., X n are iid ECp (µ, Σ, g) and an elemental
MCD concentration algorithm is used to produce the estimator (TA,n , C A,n ),
then this algorithm estimator is consistent for (µ, aΣ) for some constant
a > 0 (that depends on g) if the number of starts K = K(n) → ∞ as the
sample size n → ∞.
Notice that if this conjecture is true, and if the data is EC with 2nd
moments, then 2
med(Di (A))
CA (11.4)
med(MDi )
estimates Cov(X). For the DD plot, consistency is desirable but not neces-
sary. It is necessary that the correlation of the smallest 99% of the MDi and
RDi be very high. This correlation goes to 1 by Proposition 11.1 if consistent
estimators are used.
The choice of using a concentration algorithm to produce (TA , C A ) is cer-
tainly not perfect, and the cov.mcd estimator should be modiﬁed by adding
CHAPTER 11. CMCD APPLICATIONS 321

the MBA starts as shown in Theorem 10.15. There exist data sets with out-
liers or two groups such that both the classical and robust estimators produce
ellipsoids that are nearly concentric. We suspect that the situation worsens
as p increases.
In a simulation study, Np (0, I p ) data were generated and cov.mcd was
used to compute ﬁrst the Di (A), and then the RDi using Equation (11.2).
The results are shown in Table 11.1. Each choice of n and p used 100 runs,
and the 100 correlations between the RDi and the MDi were computed. The
mean and minimum of these correlations are reported along with the percent-
age of correlations that were less than 0.95 and 0.80. The simulation shows
that small data sets (of roughly size n < 8p + 20) yield plotted points that
may not cluster tightly about the identity line even if the data distribution
is Gaussian.
Since every estimator of location and dispersion deﬁnes an ellipsoid, the
DD plot can be used to examine which points are in the robust ellipsoid

{x : (x − TR )T C −1 2
R (x − TR ) ≤ RD(h) } (11.5)
2
where RD(h) is the hth smallest squared robust Mahalanobis distance, and
which points are in a classical ellipsoid

{x : (x − x)T S −1 (x − x) ≤ MD(h)
2
}. (11.6)

In the DD plot, points below RD(h) correspond to cases that are in the
ellipsoid given by Equation (11.5) while points to the left of MD(h) are in an
ellipsoid determined by Equation (11.6).
The DD plot will follow a line through the origin closely only if the two
ellipsoids are nearly concentric, eg if the data is EC. The DD plot will follow
the identity line closely if med(MDi ) ≈ MED, and RD2i =

MED
(xi − TA )T [( )2 C −1 T −1
A ](xi − TA ) ≈ (xi − x) S (xi − x) = MD2i
med(Di (A))

for i = 1, ..., n. When the distribution is not EC,

(TA , C A ) = (TM BA, C M BA) or (TA , C A ) = (TFMCD , C FMCD )

and (x, S) will often produce ellipsoids that are far from concentric.
CHAPTER 11. CMCD APPLICATIONS 322

a) DD Plot, 200 N(0, I3) Cases b) DD Plot, 200 EC Cases

12
10
3

8
RD

6
2

4
1

2
0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 1 2 3 4 5 6

MD MD

c) DD Plot, 200 Lognormal Cases d) Weighted DD Plot for Lognormal Data

3.0
20

2.5
15

2.0
RD

1.5
10

1.0
5

0.5
0

0 2 4 6 0.5 1.0 1.5

MD MD

Figure 11.1: 4 DD Plots

Application 11.1. The DD plot can be used simultaneously as a di-

agnostic for whether the data arise from a multivariate normal (MVN or
Gaussian) distribution or from another EC distribution with 2nd moments.
EC data will cluster about a straight line through the origin; MVN data in
particular will cluster about the identity line. Thus the DD plot can be used
to assess the success of numerical transformations towards elliptical symme-
try. This application is important since many statistical methods assume
that the underlying data distribution is MVN or EC.
Figure 11.1 shows the DD plots for 3 artiﬁcial data sets. The DD plot for
200 N3 (0, I 3 ) points shown in Figure 1a resembles the identity line. The DD
plot for 200 points from the elliptically contoured distribution 0.6N3 (0, I 3 ) +
0.4N3 (0, 25 I 3 ) in Figure 11.1b clusters about a line through the origin with
a slope close to 2.0.
A weighted DD plot magniﬁes the lower left corner of the DD plot by
CHAPTER 11. CMCD APPLICATIONS 323

omitting the cases with RDi ≥ χ2p,.975. This technique can magnify features
that are obscured when large RDi ’s are present. If the distribution of x is EC,
Proposition 11.1 implies that the correlation of the points in the weighted
DD plot will tend to one and that the points will cluster about a line passing
through the origin. For example, the plotted points in the weighted DD plot
(not shown) for the non-MVN EC data of Figure 11.1b are highly correlated
and still follow a line through the origin with a slope close to 2.0.
Figures 11.1c and 11.1d illustrate how to use the weighted DD plot. The
ith case in Figure 11.1c is (exp(xi,1), exp(xi,2), exp(xi,3))T where xi is the
ith case in Figure 11a; ie, the marginals follow a lognormal distribution.
The plot does not resemble the identity line, correctly suggesting that the
distribution of the data is not MVN; however, the correlation of the plotted
points israther high. Figure 11.1d is the weighted DD plot where cases with
RDi ≥ χ23,.975 ≈ 3.06 have been removed. Notice that the correlation of the
plotted points is not close to one and that the best fitting line in Figure 11.1d
may not pass through the origin. These results suggest that the distribution
of x is not EC.
It is easier to use the DD plot as a diagnostic for a target distribution
such as the MVN distribution than as a diagnostic for elliptical symmetry.
If the data arise from the target distribution, then the DD plot will tend
to be a useful diagnostic when the sample size n is such that the sample
correlation coefficient in the DD plot is at least 0.80 with high probability.
As a diagnostic for elliptical symmetry, it may be useful to add the OLS line
to the DD plot and weighted DD plot as a visual aid, along with numerical
quantities such as the OLS slope and the correlation of the plotted points.
Numerical methods for transforming data towards a target EC distribu-
tion have been developed. Generalizations of the Box–Cox transformation
towards a multivariate normal distribution are described in Velilla (1993).
Alternatively, Cook and Nachtsheim (1994) offer a two-step numerical pro-
cedure for transforming data towards a target EC distribution. The first step
simply gives zero weight to a fixed percentage of cases that have the largest
robust Mahalanobis distances, and the second step uses Monte Carlo case
reweighting with Voronoi weights.
Example 11.2. Buxton (1920, p. 232-5) gives 20 measurements of 88
CHAPTER 11. CMCD APPLICATIONS 324

a) DD Plot for Buxton Data

64 61 63
300

65
62
200
RD

100
0

1 2 3 4 5

b) DD Plot with Outliers Removed

5
4
3
RD

2
1

1 2 3

Figure 11.2: DD Plots for the Buxton Data

men. We will examine whether the multivariate normal distribution is a plau-

sible model for the measurements head length, nasal height, bigonal breadth,
and cephalic index where one case has been deleted due to missing values.
Figure 11.2a shows the DD plot. Five head lengths were recorded to be
around 5 feet and are massive outliers. Figure 11.2b is the DD plot com-
puted after deleting these points and suggests that the normal distribution
is plausible. (The recomputation of the DD plot means that the plot is not
a weighted DD plot which would simply omit the outliers and then rescale
the vertical axis.)
The DD plot complements rather than replaces the numerical procedures.
For example, if the goal of the transformation is to achieve a multivariate
normal distribution and if the data points cluster tightly about the identity
line, as in Figure 11.1a, then perhaps no transformation is needed. For the
data in Figure 11.1c, a good numerical procedure should suggest coordinate-
wise log transforms. Following this transformation, the resulting plot shown
CHAPTER 11. CMCD APPLICATIONS 325

in Figure 11.1a indicates that the transformation to normality was successful.

Application 11.2. The DD plot can be used to detect multivariate

outliers. See Figures 10.2 and 11.2a.

11.2 Robust Prediction Regions

Suppose that (TA , C A ) denotes the algorithm estimator of location and dis-
persion. Section 11.1 showed that if X is multivariate normal Np (µ, Σ), TA
estimates µ and C A /τ 2 estimates Σ where τ is given in Equation 11.2. Then
(TR, C R ) ≡ (TA , C A /τ 2 ) is an estimator of multivariate location and disper-
sion. Given an estimator (T, C), a 95% covering ellipsoid for MVN data is
the ellipsoid
{z : (z − T )T C −1 (z − T ) ≤ χ2p,0.95}. (11.7)
This ellipsoid is a 95% prediction region if the data is Np (µ, Σ) and if (T, C)
is a consistent estimator of (µ, Σ).
Example 11.3. An artiﬁcial data set consisting of 100 iid cases from a

0 1.49 1.4
N2 ,
0 1.4 1.49

distribution and 40 iid cases from a bivariate normal distribution with mean
(0, −3)T and covariance I 2 . Figure 11.3 shows the classical covering ellipsoid
that uses (T, C) = (x, S). The symbol “1” denotes the data while the symbol
“2” is on the border of the covering ellipse. Notice that the classical ellipsoid
covers almost all of the data. Figure 11.4 displays the resistant covering
ellipse. The resistant covering ellipse contains most of the 100 “clean” cases
and excludes the 40 outliers. Problem 11.5 recreates similar ﬁgures with the
classical and the resistant R/Splus cov.mcd estimators.
Example 11.4. Buxton (1920) gives various measurements on 88 men
including height and nasal height. Five heights were recorded to be about
19mm and are massive outliers. Figure 11.5 shows that the classical covering
ellipsoid is quite large but does not include any of the outliers. Figure 11.6
shows that the resistant covering ellipsoid is not inﬂated by the outliers.
CHAPTER 11. CMCD APPLICATIONS 326

Classical 95% Covering Ellipsoid

1
22
22 22
22
22 22
22 1 2222
2 1
2 1 1 2
1 1 1 22
22 2
2

22 11
11 2
22
22 1 11 1 2
2
2 1 1 111 1 2
2
222 1 11 1
2
2
2
2 1 1 1 1 22
2
22 1111 11 1111 11 11 2
2
2
2 1 1
2 11 1 1
22 11 11 1 22
1
1 11 22
0

2
222 1 111 2
22
22 1 1111 22
2
2 1 11111 1111 1
1
22
22 11 22
22 1 1 1
1 1 22
2 1 1 22
22 1 1 11 2
22 11 1 1 22
2 1 1 1 22
2
-2

2
2 1
1 1 1 1 22
2
2 1 1 11 1 2
2 11 1 1 1 2
2
2 1 22
2 1 11 1 2
2
21
2 1 1 1 1 2222
22 1 1
2 1 1 1 2
22
22 122221 1
1
2 1 22
1 222
-4

22 1 11

-3 -2 -1 0 1 2 3

Figure 11.3: Artiﬁcial Bivariate Data

Resistant 95% Covering Ellipsoid

1 2
2 22
2 22
22
2 22
2 1 22
2 22
22 1 1 11 2
1 1
22
2

2 1 11 22
2 1 1 1 1 22
22 1 11111 22
2 1 11 1 22
2 2
22 11 1 1 22
22
22 1 111111 11111 11 11 222
22 1
111 1 1 2
2 1 2
2 1 11 11 1 2
22
0

2
22 11 111 22
22 2
1 111 22
1 111111111 2 1
2 11 22 1
222 1 1 11
1 222
222 1 1 2
2 1
2 1 222 11
22 11 1 1 22
22 11 1
22 22
-2

2 1 1 2 1
22 11
22
2 11 111
22 11 1 1 1
22
2 2 1 1
2 22 1 1 1
22 1 22
22 2 1 1 1
222 22 1 1 11 1
22
22 22
2 1 1 1 1
1
22 22
2 1
-4

22 2 1 11 1

-4 -2 0 2 4

Figure 11.4: Artiﬁcial Data

CHAPTER 11. CMCD APPLICATIONS 327

Classical 95% Covering Ellipsoid for Buxton Data

1
22 1 1
1 1 1 2
2
22 1 1 1 1 1
1 1 1 1
1 1 1 1 1 2 1
2
22 1 1 1 1 1
1 1
1 1 1
1 1 1 1 1 1
2
2
22 1 1 1
1 1
1 1 1 1 1
1 2
22 1
1
1
1 1
1 1 1
1
2
2
22
2 1 22
2 22
1500

2 22
2 22
2
2 22
2
2 2
2 2
2
2
22 22
2 2
22 2
22 2
22
22
22 22
1000

22 2
22
22
2 222
2
2 22
22
22 22
2 2
500

1 1 1 1 1
0

45 50 55 60

Figure 11.5: Ellipsoid is Inﬂated by Outliers

Resistant 95% Covering Ellipsoid for Buxton Data

2 1 1 22
2 1 1 1
2 2 1 1 1 1 1 1 1 1 1 1 1 2
1222
2 2 1
1 1 1 1 1 1
1 1 1
22 1 1 1 1 1
1 1 1 1 1 1 22
22
2
1 1 1
1 1
1 1 1 1 1
1 22
22 1
1
1
1 1
1 1 1
1 2 2 2
2 2
2 21 2 2 2 2
1500
1000
500

1 1 1 1 1
0

40 45 50 55 60

Figure 11.6: Ellipsoid Ignores Outliers

CHAPTER 11. CMCD APPLICATIONS 328

11.3 Resistant Regression

Ellipsoidal trimming can be used to create resistant multiple linear regression
(MLR) estimators. To perform ellipsoidal trimming, an estimator (T, C) is
computed and used to create the squared Mahalanobis distances Di2 for each
vector of observed predictors xi . If the ordered distance D(j) is unique, then
j of the xi ’s are in the ellipsoid

{x : (x − T )T C −1 (x − T ) ≤ D(j)
2
}. (11.8)

The ith case (yi, xTi )T is trimmed if Di > D(j) . Then an estimator of β is
computed from the untrimmed cases. For example, if j ≈ 0.9n, then about
10% of the cases are trimmed, and OLS or L1 could be used on the untrimmed
cases.
Recall that a forward response plot is a plot of the fitted values Ŷi versus
the response Yi and is very useful for detecting outliers. If the MLR model
holds and the MLR estimator is good, then the plotted points will scatter
about the identity line that has unit slope and zero intercept. The identity
line is added to the plot as a visual aid, and the vertical deviations from the
identity line are equal to the residuals since Yi − Ŷi = ri .
The resistant trimmed views estimator combines ellipsoidal trimming and
the forward response plot. First compute (T, C), perhaps using the MBA
estimator or the R/Splus function cov.mcd. Trim the M% of the cases with
the largest Mahalanobis distances, and then compute the MLR estimator β̂M
from the untrimmed cases. Use M = 0, 10, 20, 30, 40, 50, 60, 70, 80, and
T
90 to generate ten forward response plots of the fitted values β̂ M xi versus yi
using all n cases. (Fewer plots are used for small data sets if β̂ M can not be
computed for large M.) These plots are called “trimmed views.”
Definition 11.2. The trimmed views (TV) estimator β̂ T ,n corresponds
to the trimmed view where the bulk of the plotted points follow the identity
line with smallest variance function, ignoring any outliers.
Example 11.4 (continued). For the Buxton (1920) data, height was
the response variable while an intercept, head length, nasal height, bigonal
breadth, and cephalic index were used as predictors in the multiple linear
regression model. Observation 9 was deleted since it had missing values.
Five individuals, cases 61–65, were reported to be about 0.75 inches tall
CHAPTER 11. CMCD APPLICATIONS 329

with head lengths well over ﬁve feet! OLS was used on the untrimmed cases
and Figure 11.7 shows four trimmed views corresponding to 90%, 70%, 40%
and 0% trimming. The OLS TV estimator used 70% trimming since this
trimmed view was best. Since the vertical distance from a plotted point
to the identity line is equal to the case’s residual, the outliers had massive
residuals for 90%, 70% and 40% trimming. Notice that the OLS trimmed
view with 0% trimming “passed through the outliers” since the cluster of
outliers is scattered about the identity line.
The TV estimator β̂ T ,n has good statistical properties if the estimator
applied to the untrimmed cases (X M,n , Y M,n ) has good statistical properties.
Candidates include OLS, L1 , Huber’s M–estimator, Mallows’ GM–estimator
or the Wilcoxon rank estimator. See Rousseeuw and Leroy (1987, p. 12-13,
150). The basic idea is that if an estimator with OP (n−1/2 ) convergence rate
is applied to a set of nM ∝ n cases, then the resulting estimator β̂ M,n also
has OP (n−1/2) rate provided that the response y was not used to select the
nM cases in the set. If β̂M,n − β = OP (n−1/2 ) for M = 0, ..., 90 then
β̂T ,n − β = OP (n−1/2) by Pratt (1959).
Let X n = X 0,n denote the full design matrix. Often when proving asymp-
totic normality of an MLR estimator β̂0,n , it is assumed that

X Tn X n
→ W −1 .
n

If β̂0,n has OP (n−1/2) rate and if for big enough n all of the diagonal elements
of −1
X TM,n X M,n
n

are all contained in an interval [0, B) for some B > 0, then β̂M,n − β =
OP (n−1/2).
The distribution of the estimator β̂M,n is especially simple when OLS is
used and the errors are iid N(0, σ 2 ). Then

β̂ M,n = (X TM,n X M,n )−1 X TM,n Y M,n ∼ Np (β, σ 2(X TM,n X M,n)−1 )
√
and n(β̂M,n −β) ∼ Np (0, σ 2(X TM,n X M,n/n)−1 ). Notice that this result does
not imply that the distribution of β̂ T ,n is normal.
CHAPTER 11. CMCD APPLICATIONS 330

90% 70%

1500

1500
1000

1000
y

y
500

500
0

0
2000 4000 6000 8000 10000 5000 10000 15000 20000

fit fit

40% 0%
1500

1500
1000

1000
y

y
500

500
0

2000 3000 4000 5000 6000 0 500 1000 1500

fit fit

Figure 11.7: 4 Trimmed Views for the Buxton Data

Table 11.2 compares the TV, MBA (for MLR), lmsreg, ltsreg, L1 and
OLS estimators on 7 data sets available from the text’s website. The column
headers give the file name while the remaining rows of the table give the
sample size n, the number of predictors p, the amount of trimming M used by
the TV estimator, the correlation of the residuals from the TV estimator with
the corresponding alternative estimator, and the cases that were outliers.
If the correlation was greater than 0.9, then the method was effective in
detecting the outliers, and the method failed, otherwise. Sometimes the
trimming percentage M for the TV estimator was picked after fitting the
bulk of the data in order to find the good leverage points and outliers.
Notice that the TV, MBA and OLS estimators were the same for the
Gladstone data and for the major data (Tremearne 1911) which had two
small y–outliers. For the Gladstone data, there is a cluster of infants that are
good leverage points, and we attempt to predict brain weight with the head
measurements height, length, breadth, size and cephalic index. Originally, the
variable length was incorrectly entered as 109 instead of 199 for case 119, and
the glado data contains this outlier. In 1997, lmsreg was not able to detect
the outlier while ltsreg did. Due to changes in the Splus 2000 code, lmsreg
CHAPTER 11. CMCD APPLICATIONS 331

Table 11.2: Summaries for Seven Data Sets, the Correlations of the Residuals
from TV(M) and the Alternative Method are Given in the 1st 5 Rows

Method Buxton Gladstone glado hbk major nasty wood

MBA 0.997 1.0 0.455 0.960 1.0 -0.004 0.9997
LMSREG -0.114 0.671 0.938 0.977 0.981 0.9999 0.9995
LTSREG -0.048 0.973 0.468 0.272 0.941 0.028 0.214
L1 -0.016 0.983 0.459 0.316 0.979 0.007 0.178
OLS 0.011 1.0 0.459 0.780 1.0 0.009 0.227
outliers 61-65 none 119 1-10 3,44 2,6,...,30 4,6,8,19
n 87 247 247 75 112 32 20
p 5 7 7 4 6 5 6
M 70 0 30 90 0 90 20

now detects the outlier but ltsreg does not.

The TV estimator can be modiﬁed to create a resistant weighted MLR
estimator. To see this, recall that the weighted least squares (WLS) estima-
tor using weights Wi can be found √ using the√ordinary least squares (OLS)
regression (without intercept) of Wi Yi on Wi xi . This idea can be used
for categorical data analysis since the minimum chi-square estimator is often
computed using WLS. See Section 13.3 for an illustration of Application 11.3
below. Let xi = (1, xi,2, ..., xi,p)T , let Yi = xTi β + ei and let β̃ be an estimator
of β.

√ Wi ,
Deﬁnition 11.3. For a multiple linear regression√model with weights
a weighted forward response plot is √ a plot of Wi xTi β̃ versus Wi Yi .
The weighted
√ residual
√ plot is a plot of Wi xTi β̃ versus the WMLR resid-
uals rW i = Wi Yi − Wi xi β̃.
T

Application 11.3. For resistant weighted MLR, use the WTV estimator
which is selected from ten weighted forward response plots.
CHAPTER 11. CMCD APPLICATIONS 332

11.4 Robustifying Robust Estimators

Many papers have been written that need a HB consistent estimator of MLD.
Since no practical HB estimator was available, inconsistent zero breakdown
estimators were often used in implementations, resulting in zero breakdown
estimators that were often inconsistent (although perhaps useful as diagnos-
tics).
√
Applications of the robust n consistent CMCD estimators are numerous.
For example, robustify the ideas in the following papers by using the CMCD
estimator instead of the FMCD, MCD or MVE estimator. Binary regres-
sion: see Croux and Haesbroeck (2003). Canonical correlation analysis: see
Branco, Croux, Filzmoser, and Oliviera (2005). Discriminant analysis: see
He and Fung (2000). Factor analysis: see Pison, Rousseeuw, Filzmoser, and
Croux (2003). Analogs of Hotelling’s T 2 test: see Willems, Pison, Rousseeuw,
and Van Aelst (2002). Longitudinal data analysis: see He, Cui and Simp-
son (2004). Multiple linear regression: see He, Simpson and Wang (2000).
Robust efficient MLR estimators can be made by using this modified t-type
estimator to create a cross checking estimator. See He (1991) and Davies
(1993). Multivariate analysis diagnostics: the DD plot of classical Maha-
lanobis distances versus CMCD distances should be used for multivariate
analysis much as Cook’s distances are used for MLR. Multivariate regres-
sion: see Rousseeuw, Van Aelst, Van Driessen and Agulló (2004). Principal
components: see Hubert, Rousseeuw, and Vanden Branden (2005). Efficient
estimators of MLD: see He and Wang (1996).
Regression via Dimension Reduction: Regression is the study of the
conditional distribution of the response y given the vector of predictors
x = (1, w T )T where w is the vector of nontrivial predictors. Make a DD
plot of the classical Mahalanobis distances versus the robust distances com-
puted from w. If w comes from an elliptically contoured distribution, then
the plotted points in the DD plot should follow a straight line through the
origin. Give zero weight to cases in the DD plot that do not cluster tightly
about “the best straight line” through the origin (often the identity line with
unit slope), and run a weighted regression procedure. This technique can
increase the resistance of regression procedures such as sliced inverse regres-
sion (SIR, see Li, 1991) and MAVE (Xia, Tong, Li, and Zhu, 2002). Also see
Cook and Nachtsheim (1994) and Li, Cook and Nachtsheim (2004). Gather,
Hilker and Becker (2001, 2002) also develop a robust version of SIR.
Visualizing 1D Regression: A 1D regression is a special case of regression
CHAPTER 11. CMCD APPLICATIONS 333

where the response Y is independent of the predictors x given β T x. Gen-

eralized linear models and single index models are important special cases.
Resistant methods for visualizing 1D regression are given in Olive (2002,
2004b). Also see Chapters 12 and 13.

11.5 Complements
The first section of this chapter followed Olive (2002) closely. The DD plot
can be used to diagnose elliptical symmetry, to detect outliers, and to assess
the success of numerical methods for transforming data towards an ellipti-
cally contoured distribution. Since many statistical methods assume that
the underlying data distribution is Gaussian or EC, there is an enormous
literature on numerical tests for elliptical symmetry. Bogdan (1999), Czörgö
(1986) and Thode (2002) provide references for tests for multivariate normal-
ity while Koltchinskii and Li (1998) and Manzotti, Pérez and Quiroz (2002)
have references for tests for elliptically contoured distributions.
The TV estimator was proposed by Olive (2002, 2005) and is similar to
an estimator proposed by Rousseeuw and van Zomeren (1992). Although
both the TV and MBA estimators have the good OP (n−1/2 ) convergence
rate, their efficiency under normality may be very low. (We could argue that
the TV and OLS estimators are asymptotically equivalent on clean data if
0% trimming is always picked when all 10 plots look good.) Using the TV
and MBA estimators as the initial estimator in the cross checking estimator
results in a resistant (easily computed but zero breakdown) asymptotically
efficient final estimator. High breakdown estimators that have high efficiency
tend to be impractical to compute, but an exception is the cross checking
estimator that uses the CLTS estimator from Theorem 8.7 as the initial
robust estimator.
The ideas used in Section 11.3 have the potential for making many meth-
ods resistant. First, suppose that the MLR model holds but Cov(e) = σ 2Σ
and Σ = V V where V is known and nonsingular. Then V −1 Y = V −1 Xβ+
V −1 e, and the TV and MBA MLR estimators can be applied to Ỹ = V −1 Y
and X̃ = V −1 X provided that OLS is fit without an intercept.
Secondly, many 1D regression models (where yi is independent of xi given
the sufficient predictor xTi β) can be made resistant by making EY plots
of the estimated sufficient predictor xTi β̂M versus yi for the 10 trimming
CHAPTER 11. CMCD APPLICATIONS 334

proportions. Since 1D regression is the study of the conditional distribution

of yi given xTi β, the EY plot is used to visualize this distribution and needs
to be made anyway. See Chapter 12.
Thirdly, for nonlinear regression models of the form yi = m(xi , β) + ei,
the ﬁtted values are ŷi = m(xi , β̂) and the residuals are ri = yi − ŷi . The
points in the FY plot of the ﬁtted values versus the response should follow
the identity line. The TV estimator would make FY and residual plots for
each of the trimming proportions. The MBA estimator with the median
squared residual criterion can also be used for many of these models.

11.6 Problems
PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE-
FUL.
11.1∗. If X and Y are random variables, show that
Cov(X, Y) = [Var(X + Y) − Var(X − Y)]/4.
R/Splus Problems
Warning: Use the command source(“A:/rpack.txt”) to download
the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg ddplot, will display the code for the function. Use the
args command, eg args(ddplot), to display the needed arguments for the
function.
11.2. a) Download the program ddsim. (In R, type the command li-
brary(lqs).)
b) Using the function ddsim for p = 2, 3, 4, determine how large the
sample size n should be in order for the DD plot of n Np(0, I p) cases to
be cluster tightly about the identity line with high probability. Table your
results. (Hint: type the command ddsim(n=20,p=2) and increase n by 10
until most of the 20 plots look linear. Then repeat for p = 3 with the n that
worked for p = 2. Then repeat for p = 4 with the n that worked for p = 3.)
11.3. a) Download the program corrsim. (In R, type the command
library(lqs).)
b) A numerical quantity of interest is the correlation between the MDi
and RDi in a DD plot that uses n Np (0, I p ) cases. Using the function corrsim
CHAPTER 11. CMCD APPLICATIONS 335

for p = 2, 3, 4, determine how large the sample size n should be in order for
9 out of 10 correlations to be greater than 0.9. (Try to make n small.) Table
your results. (Hint: type the command corrsim(n=20,p=2,nruns=10) and
increase n by 10 until 9 or 10 of the correlations are greater than 0.9. Then
repeat for p = 3 with the n that worked for p = 2. Then repeat for p = 4
with the n that worked for p = 3.)
11.4∗. a) Download the ddplot function. (In R, type the command
library(lqs).)
b) Using the following commands to make generate data from the EC
distribution (1 − )Np (0, I p ) + Np(0, 25 I p ) where p = 3 and = 0.4.

n <- 400
p <- 3
eps <- 0.4
x <- matrix(rnorm(n * p), ncol = p, nrow = n)
zu <- runif(n)
x[zu < eps,] <- x[zu < eps,]*5

c) Use the command ddplot(x) to make a DD plot and include the plot
in Word. What is the slope of the line followed by the plotted points?
11.5. a) Download the ellipse function.
b) Use the following commands to create a bivariate data set with outliers
and to obtain a classical and robust covering ellipsoid. Include the two plots
in Word. (In R, type the command library(lqs).)

> simx2 <- matrix(rnorm(200),nrow=100,ncol=2)

> outx2 <- matrix(10 + rnorm(80),nrow=40,ncol=2)
> outx2 <- rbind(outx2,simx2)
> ellipse(outx2)
> zout <- cov.mcd(outx2)
> ellipse(outx2,center=zout$center,cov=zout$cov)

11.6. a) Download the function mplot.

b) Enter the commands in Problem 11.4b to obtain a data set x. The
function mplot makes a plot without the RDi and the slope of the resulting
line is of interest. CONTINUED
CHAPTER 11. CMCD APPLICATIONS 336

c) Use the command mplot(x) and place the resulting plot in Word.
d) Do you prefer the DD plot or the mplot? Explain.
11.7 a) Download the function wddplot.
b) Enter the commands in Problem 11.4b to obtain a data set x.
c) Use the command wddplot(x) and place the resulting plot in Word.
11.8. a) In addition to the source(“A:/rpack.txt”) command, also use
the source(“A:/robdata.txt”) command (and in R, type the library(lqs) com-
mand).
b) Type the command tvreg(buxx,buxy). Click the rightmost mouse but-
ton (and in R, highlight Stop). The forward response plot should appear.
Repeat 10 times and remember which plot percentage M (say M = 0) had
the best forward response plot. Then type the command tvreg2(buxx,buxy, M
= 0) (except use your value of M, not 0). Again, click the rightmost mouse
button (and in R, highlight Stop). The forward response plot should appear.
Hold down the Ctrl and c keys to make a copy of the plot. Then paste the
plot in Word.
c) The estimated coeﬃcients β̂ T V from the best plot should have appeared
on the screen. Copy and paste these coeﬃcients into Word.
Chapter 12

1D Regression

... estimates of the linear regression coeﬃcients are relevant to the linear
parameters of a broader class of models than might have been suspected.
Brillinger (1977, p. 509)
After computing β̂, one may go on to prepare a scatter plot of the points
(β̂xj , yj ), j = 1, ..., n and look for a functional form for g(·).
Brillinger (1983, p. 98)

Regression is the study of the conditional distribution y|x of the response

y given the (p − 1) × 1 vector of nontrivial predictors x. The scalar y is a
random variable and x is a random vector. A special case of regression is
multiple linear regression. In Chapter 5 the multiple linear regression model
was Yi = wi,1η1 + wi,2η2 + · · · + wi,pηp + ei = w Ti η + ei for i = 1, . . . , n. In this
chapter, the subscript i is often suppressed and the multiple linear regression
model is written as y = α + x1β1 + · · · + xp−1 βp−1 + e = α + βT x + e. The
primary diﬀerence is the separation of the constant term α and the nontrivial
predictors x. In Chapter 5, wi,1 ≡ 1 for i = 1, ..., n. Taking y = Yi , α = η1 ,
βj = ηj+1 , and xj = wi,j+1 and e = ei for j = 1, ..., p − 1 shows that the
two models are equivalent. The change in notation was made because the
distribution of the nontrivial predictors is very important for the theory of
the more general regression models.
Deﬁnition 12.1: Cook and Weisberg (1999a, p. 414). In a 1D
regression model, the response y is conditionally independent of x given a
single linear combination βT x of the predictors, written

y x|βT x. (12.1)

337
CHAPTER 12. 1D REGRESSION 338

The 1D regression model is also said to have 1–dimensional structure or

1D structure. An important 1D regression model, introduced by Li and Duan
(1989), has the form
y = g(α + βT x, e) (12.2)
where g is a bivariate (inverse link) function and e is a zero mean error that
is independent of x. The constant term α may be absorbed by g if desired.
Special cases of the 1D regression model (12.1) include many important
generalized linear models (GLM’s) and the single index model where

y = m(α + β T x) + e. (12.3)

Typically m is the conditional mean or median function. For example if all

of the expectations exist, then

E[y|x] = E[m(α + β T x)|x] + E[e|x] = m(α + βT x).

The multiple linear regression model is an important special case where m is

the identity function: m(α + β T x) = α + βT x. Another important special
case of 1D regression is the response transformation model where

g(α + β T x, e) = t−1 (α + β T x + e) (12.4)

and t−1 is a one to one (typically monotone) function. Hence

t(y) = α + βT x + e.

Koenker and Geling (2001) note that if yi is an observed survival time, then
many survival models have the form of Equation (12.4). They provide three
illustrations including the Cox (1972) proportional hazards model. Li and
Duan (1989, p. 1014) note that the class of 1D regression models also includes
binary regression models, censored regression models, and certain projection
pursuit models. Applications are also given by Stoker (1986), Horowitz (1996,
1998) and Cook and Weisberg (1999a).
Deﬁnition 12.2. Regression is the study of the conditional distribution
of y|x. Focus is often on the mean function E(y|x) and/or the variance
function VAR(y|x). There is a distribution for each value of x = xo such
that y|x = xo is deﬁned. For a 1D regression,

E(y|x = xo ) = E(y|βT x = βT xo ) ≡ M(β T xo )

CHAPTER 12. 1D REGRESSION 339

and

VAR(y|x = xo ) = VAR(y|βT x = βT xo ) ≡ V (y|βT x = β T xo )

where M is the kernel mean function and V is the kernel variance function.
Notice that the mean and variance functions depend on the same linear
combination if the 1D regression model is valid. This dependence is typical of
GLM’s where M and V are known kernel mean and variance functions that
depend on the family of GLM’s. See Cook and Weisberg (1999a, section
23.1). A heteroscedastic regression model

y = M(β T1 x) + V (β T2 x) e (12.5)

is a 1D regression model if β2 = cβ 1 for some scalar c.

In multiple linear regression, the difference between the response yi and
T
the estimated conditional mean function α̂ + β̂ xi is the residual. For more
general regression models this difference may not be the residual, and the
T
“discrepancy” yi −M(β̂ xi ) may not be estimating the error ei. To guarantee
that the residuals are estimating the errors, the following definition is used
when possible.
Definition 12.3: Cox and Snell (1968). Let the errors ei be iid with
pdf f and assume that the regression model yi = g(xi , η, ei ) has a unique
solution for ei :
ei = h(xi , η, yi ).
Then the ith residual
êi = h(xi , η̂, yi )
where η̂ is a consistent estimator of η.
Example 12.1. Let η = (α, β T )T . If y = m(α + βT x) + e where m is
T
known, then e = y − m(α + β T x). Hence êi = yi − m(α̂ + β̂ xi ) which is the
usual definition of the ith residual for such models.
Dimension reduction can greatly simplify our understanding of the con-
ditional distribution y|x. If a 1D regression model is appropriate, then the
(p − 1)–dimensional vector x can be replaced by the 1–dimensional scalar
CHAPTER 12. 1D REGRESSION 340

βT x with “no loss of information.” Cook and Weisberg (1999a, p. 411) de-
fine a sufficient summary plot (SSP) to be a plot that contains all the sample
regression information about the conditional distribution y|x of the response
given the predictors.
Definition 12.4: If the 1D regression model holds, then y x|a + cβT x
for any constants a and c = 0. The quantity a + cβ T x is called a sufficient
predictor (SP), and a sufficient summary plot is a plot of any SP versus y. An
T
estimated sufficient predictor (ESP) is α̃+ β̃ x where β̃ is an estimator of cβ
for some nonzero constant c. An estimated sufficient summary plot (ESSP)
or EY plot is a plot of any ESP versus y.
If there is only one predictor x, then the plot of x versus y is both a
sufficient summary plot and an EY plot, but generally only an EY plot can
be made. Since a can be any constant, a = 0 is often used. The following
section shows how to use the OLS regression of y on x to obtain an ESP.

12.1 Estimating the Suﬃcient Predictor

Some notation is needed before giving theoretical results. Let x, a, t, and β
be (p − 1) × 1 vectors where only x is random.
Deﬁnition 12.5: Cook and Weisberg (1999a, p. 431). The predic-
tors x satisfy the condition of linearly related predictors with 1D structure
if
E[x|βT x] = a + tβT x. (12.6)

If the predictors x satisfy this condition, then for any given predictor xj ,
E[xj |βT x] = aj + tj β T x.
Notice that β is a ﬁxed (p − 1) × 1 vector. If x is elliptically contoured (EC)
with 1st moments, then the assumption of linearly related predictors holds
since
E[x|bT x] = ab + tb bT x
for any nonzero (p − 1) × 1 vector b (see Lemma 10.4 on p. 290). The
condition of linearly related predictors is impossible to check since β is un-
known, but the condition is far weaker than the assumption that x is EC.
CHAPTER 12. 1D REGRESSION 341

The stronger EC condition is often used since there are checks for whether
this condition is plausible, eg use the DD plot. The following proposition
gives an equivalent definition of linearly related predictors. Both definitions
are frequently used in the regression graphics literature.
Proposition 12.1. The predictors x are linearly related iff

E[bT x|βT x] = ab + tb β T x (12.7)

for any (p − 1) × 1 constant vector b where ab and tb are constants that

depend on b.
Proof. Suppose that the assumption of linearly related predictors holds.
Then
E[bT x|βT x] = bT E[x|β T x] = bT a + bT tβT x.
Thus the result holds with ab = bT a and tb = bT t.
Now assume that Equation (12.7) holds. Take bi = (0, ..., 0, 1, 0, ..., 0)T ,
the vector of zeroes except for a one in the ith position. Then by Deﬁnition
12.5, E[x|βT x] = E[I px|β T x] =
 T   
b1 x a1 + t1 β T x
   
E[  ...  | β T x] =  ..
.  ≡ a + tβ x.
T

bTp x ap + tp β T x

QED
Following Cook (1998a, p. 143-144), assume that there is an objective
function
1
n
Ln (a, b) = L(a + bT xi , yi) (12.8)
n i=1
where L(u, v) is a bivariate function that is a convex function of the ﬁrst
argument u. Assume that the estimate (â, b̂) of (a, b) satisﬁes

(â, b̂) = arg min Ln (a, b). (12.9)

a,b

For example, the ordinary least squares (OLS) estimator uses

L(a + bT x, y) = (y − a − bT x)2 .
CHAPTER 12. 1D REGRESSION 342

Maximum likelihood type estimators such as those used to compute GLM’s

and Huber’s M–estimator also work, as does the Wilcoxon rank estima-
tor. Assume that the population analog (α, η) is the unique minimizer of
E[L(a + bT x, y)] where the expectation exists and is with respect to the joint
distribution of (y, xT )T . For example, (α, η) is unique if L(u, v) is strictly
convex in its first argument. The following result is a useful extension of
Brillinger (1977, 1983).
Theorem 12.2 (Li and Duan 1989, p. 1016): Assume that the x are
linearly related predictors, that (yi , xTi )T are iid observations from some joint
distribution and that Cov(xi ) exists and is positive definite. Assume L(u, v)
is convex in its first argument and that η is unique. Assume that y x|βT x.
Then η = cβ for some scalar c.
Proof. See Li and Duan (1989) or Cook (1998a, p. 144).
Remark 12.1. This theorem basically means that if the 1D regression
model is appropriate and if the condition of linearly related predictors holds,
then the (eg OLS) estimator b̂ ≡ η̂ ≈ cβ. Li and Duan (1989, p. 1031)
show that under additional conditions, (â, b̂) is asymptotically normal. In
√
particular, the OLS estimator frequently has a n convergence rate. If the
OLS estimator (α̂, β̂) satisfies β̂ ≈ cβ when model (12.1) holds, then the EY
plot of
T
α̂ + β̂ x versus y
can be used to visualize the conditional distribution y|α + βT x provided that
c = 0.
Remark 12.2. If b̂ is a consistent estimator of η ≡ βb , then certainly
β b = cx β + u g

where ug = βb − cx β is the bias vector. Moreover, the bias vector ug = 0

if x is elliptically contoured under the assumptions of Theorem 12.2. This
result suggests that the bias vector might be negligible if the distribution of
the predictors is close to being EC. Often if no strong nonlinearities are
present among the predictors, the bias vector is small enough so that
T
b̂ x is a useful ESP.
Remark 12.3. Suppose that the 1D regression model is appropriate and
y x|βT x. Then y x|cβT x for any nonzero scalar c. If y = g(β T x, e) and
CHAPTER 12. 1D REGRESSION 343

both g and β are unknown, then g(βT x, e) = ha,c (a + cβ T x, e) where

w−a
ha,c (w, e) = g( , e)
c
for c = 0. In other words, if g is unknown, we can estimate cβ but we can
not determine c or β; ie, we can only estimate β up to a constant.
A very useful result is that if y = m(x) for some function m, then m can
be visualized with both a plot of x versus y and a plot of cx versus y if c = 0.
In fact, there are only three possibilities, if c > 0 then the two plots are nearly
identical: except the labels of the horizontal axis change. (The two plots are
usually not exactly identical since plotting controls to “fill space” depend on
several factors and will change slightly.) If c < 0, then the plot appears to
be flipped about the vertical axis. If c = 0, then m(0) is a constant, and the
plot is basically a dot plot. Similar results hold if yi = g(α + β T xi , ei ) if the
errors ei are small. OLS often provides a useful estimator of cβ where c = 0,
but OLS can result in c = 0 if g is symmetric about the median of α + β T x.
Definition 12.6. If the 1D regression model (12.1) holds, and a specific
estimator such as OLS is used, then the ESP will be called the OLS ESP
and the EY plot will be called the OLS EY plot.
Example 12.2. Suppose that xi ∼ N3 (0, I 3 ) and that

y = m(β T x) + e = (x1 + 2x2 + 3x3)3 + e.

Then a 1D regression model holds with β = (1, 2, 3)T . Figure 12.1 shows the
sufficient summary plot of βT x versus y, and Figure 12.2 shows the sufficient
summary plot of −βT x versus y. Notice that the functional form m appears
to be cubic in both plots and that both plots can be smoothed by eye or with
a scatterplot smoother such as lowess. The two figures were generated with
the following R/Splus commands.
X <- matrix(rnorm(300),nrow=100,ncol=3)
SP <- X%*%1:3
Y <- (SP)^3 + rnorm(100)
plot(SP,Y)
plot(-SP,Y)
We particularly want to use the OLS estimator (âOLS , b̂OLS ) to produce
an estimated sufficient summary plot. This estimator is obtained from the
CHAPTER 12. 1D REGRESSION 344

Sufficient Summary Plot for Gaussian Predictors

500
Y

0
-500

-10 -5 0 5 10

Figure 12.1: SSP for m(u) = u3

The SSP using -SP.

500
Y

0
-500

-10 -5 0 5 10

- SP

Figure 12.2: Another SSP for m(u) = u3

CHAPTER 12. 1D REGRESSION 345

usual multiple linear regression of yi on xi , but we are not assuming that the
multiple linear regression model holds; however, we are hoping that the 1D
regression model y x|βT x is a useful approximation to the data and that
b̂OLS ≈ cβ for some nonzero constant c. In addition to Theorem 12.2, nice
results exist if the single index model is appropriate. Recall that
Cov(x, y) = E[(x − E(x))((y − E(y))T ].
Deﬁnition 12.7. Suppose that (yi , xTi )T are iid observations and that
the positive deﬁnite (p − 1) × (p − 1) matrix Cov(x) = ΣX and the (p − 1) × 1
vector Cov(x, y) = ΣX,Y . Let the OLS estimator (â, b̂) be computed from
the multiple linear regression of y on x plus a constant. Then (â, b̂) estimates
the population quantity (αOLS , β OLS ) where
β OLS = Σ−1
X ΣX,Y . (12.10)

The following notation will be useful for studying the OLS estimator.
Let the suﬃcient predictor z = β T x and let w = x − E(x). Let r =
w − (ΣX β)β T w.
Theorem 12.3. In addition to the conditions of Deﬁnition 12.7, also
assume that yi = m(βT xi ) + ei where the zero mean constant variance iid
errors ei are independent of the predictors xi . Then
β OLS = Σ−1
X ΣX,Y = cm,X β + um,X (12.11)
where the scalar
cm,X = E[βT (x − E(x)) m(βT x)] (12.12)
and the bias vector
um,X = Σ−1
X E[m(β x)r].
T
(12.13)
Moreover, um,X = 0 if x is from an elliptically contoured distribution with
2nd moments, and cm,X = 0 unless Cov(x, y) = 0. If the multiple linear
regression model holds, then cm,X = 1, and um,X = 0.
The proof of the above result is outlined in Problem 12.2 using an argu-
ment due to Aldrin, Bφlviken, and Schweder (1993). See related results in
Stoker (1986) and Cook, Hawkins, and Weisberg (1992). If the 1D regres-
sion model is appropriate, then typically Cov(x, y) = 0 unless β T x follows a
symmetric distribution and m is symmetric about the median of β T x.
CHAPTER 12. 1D REGRESSION 346

Definition 12.8. Let (â, b̂) denote the OLS estimate obtained from the
OLS multiple linear regression of y on x. The OLS view is a plot of
T
b̂ x versus y.
Remark 12.4. All of this awkward notation and theory leads to one of
the most remarkable results in statistics, perhaps first noted by Brillinger
(1977, 1983) and called the 1D Estimation Result by Cook and Weisberg
(1999a, p. 432). The result is that if the 1D regression model is appropriate,
then the OLS view will frequently be a useful estimated sufficient summary
T
plot (ESSP). Hence the OLS predictor b̂ x is a useful estimated sufficient
predictor (ESP).
Although the OLS view is frequently a good ESSP if no strong nonlinear-
ities are present in the predictors and if cm,X = 0 (eg the sufficient summary
plot of βT x versus y is not approximately symmetric), even better estimated
sufficient summary plots can be obtained by using ellipsoidal trimming. This
topic is discussed in the following section and follows Olive (2002) closely.

12.2 Visualizing 1D Regression

If there are two predictors, even with a distribution that is not EC, Cook
and Weisberg (1999a, ch. 8) demonstrate that a 1D regression can be visual-
ized using a three–dimensional plot with y on the vertical axes and the two
predictors on the horizontal and out of page axes. Rotate the plot about the
vertical axes. Each combination of the predictors gives a two dimensional
“view.” Search for the view with a smooth mean function that has the small-
est possible variance function and use this view as the estimated suﬃcient
summary plot.
For higher dimensions, Cook and Nachtsheim (1994) and Cook (1998a, p.
152) demonstrate that the bias um,X can often be made small by ellipsoidal
trimming. To perform ellipsoidal trimming, an estimator (T, C) is computed
where T is a (p − 1) × 1 multivariate location estimator and C is a (p −
1) × (p − 1) symmetric positive dispersion estimator. Then the ith squared
Mahalanobis distance is the random variable
Di2 = (xi − T )T C −1 (xi − T ) (12.14)

for each vector of observed predictors xi . If the ordered distances D(j) are
CHAPTER 12. 1D REGRESSION 347

unique, then j of the xi are in the hyperellipsoid

{x : (x − T )T C −1 (x − T ) ≤ D(j)
2
}. (12.15)
The ith case (yi , xTi )T is trimmed if Di > D(j) . Thus if j ≈ 0.9n, then about
10% of the cases are trimmed.
We suggest that the estimator (T, C) should be the classical sample mean
and covariance matrix (x, S) or a robust CMCD estimator such as covmba
or FMCD. See Section 10.7. When j ≈ n/2, the CMCD estimator attempts
to make the volume of the hyperellipsoid given by Equation (12.15) small.
Ellipsoidal trimming seems to work for at least three reasons. The trim-
ming divides the data into two groups: the trimmed cases and the untrimmed
cases. If the distribution of the predictors xi is EC then the distribution of
the untrimmed predictors still retains enough symmetry so that the bias
vector is approximately zero. If the distribution of xi is not EC, then the
distribution of the untrimmed predictors will often have enough symmetry
so that the bias vector is small. In particular, trimming often removes strong
nonlinearities from the predictors and the weighted predictor distribution
is more nearly elliptically symmetric than the predictor distribution of the
entire data set (recall Winsor’s principle: “all data are roughly Gaussian in
the middle”). Secondly, under heavy trimming, the mean function of the
untrimmed cases may be more linear than the mean function of the entire
data set. Thirdly, if |c| is very large, then the bias vector may be small
relative to cβ. Trimming sometimes inflates |c|. From Theorem 12.3, any of
these three reasons should produce a better estimated sufficient predictor.
For example, examine Figure 11.4 on p. 326. The data are not EC, but
the data within the resistant covering ellipsoid are approximately EC.
Example 12.3. Cook and Weisberg (1999a, p. 351, 433, 447) gave a
data set on 82 mussels sampled off the coast of New Zealand. The variables
are the muscle mass M in grams, the length L and height H of the shell
in mm, the shell width W and the shell mass S. The robust and classical
Mahalanobis distances were calculated, and Figure 12.3 shows a scatterplot
matrix of the mussel data, the RDi ’s, and the MDi ’s. Notice that many
of the subplots are nonlinear. The cases marked by open circles were given
weight zero by the FMCD algorithm, and the linearity of the retained cases
has increased. Note that only one trimming proportion is shown and that a
heavier trimming proportion would increase the linearity of the untrimmed
cases.
CHAPTER 12. 1D REGRESSION 348

Figure 12.3: Scatterplot for Mussel Data, o Corresponds to Weight Zero

CHAPTER 12. 1D REGRESSION 349

The two ideas of using ellipsoidal trimming to reduce the bias and choos-
ing a view with a smooth mean function and smallest variance function can
be combined into a graphical method for finding the estimated sufficient sum-
mary plot and the estimated sufficient predictor. Trim the M% of the cases
with the largest Mahalanobis distances, and then compute the OLS estima-
tor (α̂M , β̂ M ) from the untrimmed cases. Use M = 0, 10, 20, 30, 40, 50, 60,
T
70, 80, and 90 to generate ten plots of β̂ M x versus y using all n cases. In
analogy with the Cook and Weisberg procedure for visualizing 1D structure
with two predictors, the plots will be called “trimmed views.” Notice that
M = 0 corresponds to the OLS view.
Definition 12.9. The best trimmed view is the trimmed view with a
smooth mean function and the smallest variance function and is the estimated
sufficient summary plot. If M ∗ = E is the percentage of cases trimmed that
T
corresponds to the best trimmed view, then β̂ E x is the estimated sufficient
predictor.
The following examples illustrate the R/Splus function trviews that is
used to produce the ESSP. If R is used instead of Splus, the command

library(lqs)

needs to be entered to access the function cov.mcd called by trviews. The

function trviews is used in Problem 12.5. Also notice the trviews estimator
is basically the same as the tvreg estimator described in Section 11.3. The
tvreg estimator can be used to simultaneously detect whether the data is
following a multiple linear regression model or some other single index model.
T
Plot α̂E + β̂ E x versus y and add the identity line. If the plotted points follow
the identity line then the MLR model is reasonable, but if the plotted points
follow a nonlinear mean function, then a nonlinear single index model may
be reasonable.
Example 12.2 continued. The command

trviews(X, Y)

produced the following output.

CHAPTER 12. 1D REGRESSION 350

Intercept X1 X2 X3
0.6701255 3.133926 4.031048 7.593501
Intercept X1 X2 X3
1.101398 8.873677 12.99655 18.29054
Intercept X1 X2 X3
0.9702788 10.71646 15.40126 23.35055
Intercept X1 X2 X3
0.5937255 13.44889 23.47785 32.74164
Intercept X1 X2 X3
1.086138 12.60514 25.06613 37.25504
Intercept X1 X2 X3
4.621724 19.54774 34.87627 48.79709
Intercept X1 X2 X3
3.165427 22.85721 36.09381 53.15153
Intercept X1 X2 X3
5.829141 31.63738 56.56191 82.94031
Intercept X1 X2 X3
4.241797 36.24316 70.94507 105.3816
Intercept X1 X2 X3
6.485165 41.67623 87.39663 120.8251
The function generates 10 trimmed views. The first plot trims 90% of the
cases while the last plot does not trim any of the cases and is the OLS view.
To advance a plot, press the right button on the mouse (in R, highlight
stop rather than continue). After all of the trimmed views have been
generated, the output is presented. For example, the 5th line of numbers in
T
the output corresponds to α̂50 = 1.086138 and β̂50 where 50% trimming was
used. The second line of numbers corresponds to 80% trimming while the
T
last line corresponds to 0% trimming and gives the OLS estimate (α̂0 , β̂ 0 ) =
(â, b̂). The trimmed views with 50% and 90% trimming were very good.
We decided that the view with 50% trimming was the best. Hence β̂E =
(12.60514, 25.06613, 37.25504)T ≈ 12.5β. The best view is shown in Figure
12.4 and is nearly identical to the sufficient summary plot shown in Figure
12.1. Notice that the OLS estimate = (41.68, 87.40, 120.83)T ≈ 42β. The
OLS view is Figure 1.6 in Chapter 1 (on p. 17) and is again very similar
to the sufficient summary plot, but it is not quite as smooth as the best
trimmed view.
The plot of the estimated sufficient predictor versus the sufficient predic-
CHAPTER 12. 1D REGRESSION 351

ESSP for Gaussian Predictors

500
Y

0
-500

-100 -50 0 50 100

ESP

Figure 12.4: Best View for Estimating m(u) = u3

CORR(ESP,SP) is Approximately One

10
5
SP

0
-5
-10

-100 -50 0 50 100

ESP

Figure 12.5: The angle between the SP and the ESP is nearly zero.
CHAPTER 12. 1D REGRESSION 352

tor is also informative. Of course this plot can usually only be generated for
simulated data since β is generally unknown. If the plotted points are highly
correlated (with |corr(ESP,SP)| > 0.95) and follow a line through the origin,
then the estimated suﬃcient summary plot is nearly as good as the suﬃcient
summary plot. The simulated data used β = (1, 2, 3)T , and the commands

SP <- X %*% 1:3

ESP <- X %*% c(12.60514, 25.06613, 37.25504)
plot(ESP,SP)

generated the plot shown in Figure 12.5.

Example 12.4. An artificial data set with 200 trivariate vectors xi was
generated. The marginal distributions of xi,j are iid lognormal for j = 1, 2,
and 3. Since the response yi = sin(βT xi )/β T xi where β = (1, 2, 3)T , the
random vector xi is not elliptically contoured and the function m is strongly
nonlinear. Figure 12.6d shows the OLS view and Figure 12.7d shows the best
trimmed view. Notice that it is difficult to visualize the mean function with
the OLS view, and notice that the correlation between Y and the ESP is very
low. By focusing on a part of the data where the correlation is high, it may be
possible to improve the estimated sufficient summary plot. For example, in
Figure 12.7d, temporarily omit cases that have ESP less than 0.3 and greater
than 0.75. From the untrimmed cases, obtained the ten trimmed estimates
β̂90 , ..., β̂0. Then using all of the data, obtain the ten views. The best view
could be used as the ESSP.
Application 12.1. Suppose that a 1D regression analysis is desired on
a data set, use the trimmed views as an exploratory data analysis technique
to visualize the conditional distribution y|βT x. The best trimmed view is
an estimated sufficient summary plot. If the single index model (12.3) holds,
the function m can be estimated from this plot using parametric models
or scatterplot smoothers such as lowess. Notice that y can be predicted
visually using up and over lines.
Application 12.2. The best trimmed view can also be used as a diag-
nostic for linearity and monotonicity.
For example in Figure 12.4, if ESP = 0, then Ŷ = 0 and if ESP = 100,
then Ŷ = 500. Figure 12.4 suggests that the mean function is monotone but
CHAPTER 12. 1D REGRESSION 353

a) SIR VIEW b) SAVE VIEW

0.6

0.6
0.2

0.2
Y

Y
−0.2

−10 −8 −6 −4 −2 −0.2 −15 −10 −5 0 5

ESP ESP

c) PHD VIEW d) OLS VIEW

0.6

0.6
0.2

0.2
Y

Y
−0.2

−0.2

0 2 4 6 8 10 0.01 0.03 0.05

ESP ESP

Figure 12.6: Estimated Suﬃcient Summary Plots Without Trimming

CHAPTER 12. 1D REGRESSION 354

a) 10% TRIMMED SIR VIEW b) 40% TRIMMED SAVE VIEW

0.6

0.6
0.2

0.2
Y

Y
−0.2

−10 −8 −6 −4 −2 0 −0.2 0 2 4 6 8 10

ESP ESP

c) 90% TRIMMED PHD VIEW d) 90% TRIMMED OLS VIEW

0.6

0.6
0.2

0.2
Y

Y
−0.2

−0.2

−12 −8 −6 −4 −2 0 1 2 3 4

ESP ESP

Figure 12.7: 1D Regression with Trimmed Views

CHAPTER 12. 1D REGRESSION 355

Table 12.1: Estimated Suﬃcient Predictors Coeﬃcients Estimating c(1, 2, 3)T

method b1 b2 b3
OLS View 0.0032 0.0011 0.0047
90% Trimmed OLS View 0.086 0.182 0.338
SIR View −0.394 −0.361 −0.845
10% Trimmed SIR VIEW −0.284 −0.473 −0.834
SAVE View −1.09 0.870 -0.480
40% Trimmed SAVE VIEW 0.256 0.591 0.765
PHD View −0.072 −0.029 −0.0097
90% Trimmed PHD VIEW −0.558 −0.499 −0.664
LMSREG VIEW −0.003 −0.005 −0.059
70% Trimmed LMSREG VIEW 0.143 0.287 0.428

not linear, and Figure 12.7 suggests that the mean function is neither linear
nor monotone.
Application 12.3. Assume that a known 1D regression model is as-
sumed for the data. Then the best trimmed view is a model checking plot
and can be used as a diagnostic for whether the assumed model is appropri-
ate.
The trimmed views are sometimes useful even when the assumption of
linearly related predictors fails. Cook and Li (2002) summarize when compet-
ing methods such as the OLS view, sliced inverse regression (SIR), principal
Hessian directions (PHD), and sliced average variance estimation (SAVE)
can fail. All four methods frequently perform well if there are no strong
nonlinearities present in the predictors.
Example 12.4 (continued). Figure 12.6 shows that the EY plots for
SIR, PHD, SAVE, and OLS are not very good while Figure 12.7 shows that
trimming improved the SIR, SAVE and OLS methods.
One goal for future research is to develop better methods for visualizing
1D regression. Trimmed views seem to become less eﬀective as the number
of predictors k = p − 1 increases. Consider the suﬃcient predictor SP =
x1 + · · · + xk . With the sin(SP)/SP data, several trimming proportions gave
CHAPTER 12. 1D REGRESSION 356

0 1 2 3 4 5

0.6
0.4
Y

0.2
-0.2
5
4

ESP
3
2
1
0

40
30
SP

20
10
0
-0.2 0.0 0.2 0.4 0.6 0 10 20 30 40

Figure 12.8: 1D Regression with lmsreg

good views with k = 3, but only one of the ten trimming proportions gave
a good view with k = 10. In addition to problems with dimension, it is
not clear which covariance estimator and which regression estimator should
be used. Preliminary investigations suggest that the classical covariance
estimator gives better estimates than cov.mcd, but among the many Splus
regression estimators, lmsreg often worked well. Theorem 12.2 suggests that
strictly convex regression estimators such as OLS will often work well, but
there is no theory for the robust regression estimators.
Example 12.4 continued. Replacing the OLS trimmed views by alter-
native MLR estimators often produced good EY plots, and for single index
models, the lmsreg estimator often worked the best. Figure 12.8 shows a
scatterplot matrix of y, ESP and SP where the sufficient predictor SP =
βT x. The ESP used ellipsoidal trimming with lmsreg instead of OLS. The
top row of Figure 12.8 shows that the estimated sufficient summary plot and
the sufficient summary plot are nearly identical. Also the correlation of the
CHAPTER 12. 1D REGRESSION 357

LMSREG TRIMMED VIEW

0.6
0.4
Y

0.2
0.0
-0.2

0 1 2 3 4 5

FIT

Figure 12.9: The Weighted lmsreg Fitted Values Versus Y

ESP and the SP is nearly one. Table 12.1 shows the estimated sufficient pre-
dictor coefficients b when the sufficient predictor coefficients are c(1, 2, 3)T .
Only the SIR, SAVE, OLS and lmsreg trimmed views produce estimated
sufficient predictors that are highly correlated with the sufficient predictor.
Figure 12.9 helps illustrate why ellipsoidal trimming works. This view
used 70% trimming and the open circles denote cases that were trimmed.
The highlighted squares are the untrimmed cases. Note that the highlighted
cases are far more linear than the data set as a whole. Also lmsreg will give
half of the highlighted cases zero weight, further linearizing the function.
In Figure 12.9, the lmsreg constant α̂70 is included, and the plot is simply
the forward response plot of the weighted lmsreg fitted values versus y.
CHAPTER 12. 1D REGRESSION 358

The vertical deviations from the line through the origin are the “residuals”
T
yi − α̂70 − β̂70x and at least half of the highlighted cases have small residuals.

12.3 Predictor Transformations

As a general rule, inferring about the distribution of Y |X from a lower
dimensional plot should be avoided when there are strong nonlinearities
among the predictors.
Cook and Weisberg (1999b, p. 34)

Even if the multiple linear regression model is valid, a model based on a

subset of the predictor variables depends on the predictor distribution. If the
predictors are linearly related (eg EC), then the submodel mean and vari-
ance functions are generally well behaved, but otherwise the submodel mean
function could be nonlinear and the submodel variance function could be
nonconstant. For 1D regression models, the presence of strong nonlinearities
among the predictors can invalidate inferences. A necessary condition for
x to have an EC distribution (or for no strong nonlinearities to be present
among the predictors) is for each marginal plot of the scatterplot matrix of
the predictors to have a linear or ellipsoidal shape.
One of the most useful techniques in regression is to remove gross nonlin-
earities in the predictors by using predictor transformations. Power trans-
formations are particularly effective. A multivariate version of the Box–Cox
transformation due to Velilla (1993) can cause the distribution of the trans-
formed predictors to be closer to multivariate normal, and the Cook and
Nachtsheim (1994) procedure can cause the distribution to be closer to ellip-
tical symmetry. Marginal Box-Cox transformations also seem to be effective.
Power transformations can also be selected with slider bars in Arc.
There are several rules for selecting marginal transformations visually.
(Also see discussion in Section 5.3.) First, use theory if available. Suppose
that variable X2 is on the vertical axis and X1 is on the horizontal axis and
that the plot of X1 versus X2 is nonlinear. The unit rule says that if X1 and
X2 have the same units, then try the same transformation for both X1 and
X2 .
Power transformations are also useful. Assume that all values of X1 and
X2 are positive. (This restriction could be removed by using the modified
CHAPTER 12. 1D REGRESSION 359

power transformations of Yeo and Johnson 2000.) Let λ be the power of the
transformation. Then the following four rules are often used.
The log rule states that positive predictors that have the ratio between
their largest and smallest values greater than ten should be transformed to
logs. See Cook and Weisberg (1999a, p. 87).
Secondly, if it is known that X2 ≈ X1λ and the ranges of X1 and X2 are
such that this relationship is one to one, then
1/λ
X1λ ≈ X2 and X2 ≈ X1 .
1/λ
Hence either the transformation X1λ or X2 will linearize the plot. This
relationship frequently occurs if there is a volume present. For example let
X2 be the volume of a sphere and let X1 be the circumference of a sphere.
Thirdly, the bulging rule states that changes to the power of X2 and the
power of X1 can be determined by the direction that the bulging side of the
curve points. If the curve is hollow up (the bulge points down), decrease the
power of X2 . If the curve is hollow down (the bulge points up), increase the
power of X2 If the curve bulges towards large values of X1 increase the power
of X1 . If the curve bulges towards small values of X1 decrease the power of
X1 . See Tukey (1977, p. 173–176).
Finally, Cook and Weisberg (1999a, p. 86) give the following rule.
To spread small values of a variable, make λ smaller.
To spread large values of a variable, make λ larger.
For example, in Figure 12.14c, small values of Y and large values of FFIT
need spreading, and using log(Y ) would make the plot more linear.

12.4 Variable Selection

A standard problem in 1D regression is variable selection, also called subset or
model selection. Assume that model (12.1) holds, that a constant is always
included, and that x = (x1, ..., xp−1)T are the p − 1 nontrivial predictors,
which we assume to be of full rank. Then variable selection is a search for
a subset of predictor variables that can be deleted without important loss of
information. This section follows Olive and Hawkins (2005) closely.
Variable selection for the 1D regression model is very similar to variable
selection for the multiple linear regression model (see Section 5.2). To clarify
ideas, assume that there exists a subset S of predictor variables such that
CHAPTER 12. 1D REGRESSION 360

if xS is in the 1D model, then none of the other predictors are needed in

the model. Write E for these (‘extraneous’) variables not in S, partitioning
x = (xTS , xTE )T . Then

SP = α + βT x = α + βTS xS + β TE xE = α + β TS xS . (12.16)

The extraneous terms that can be eliminated given that the subset S is in
the model have zero coeﬃcients.
Now suppose that I is a candidate subset of predictors, that S ⊆ I and
that O is the set of predictors not in I. Then

SP = α + β T x = α + β TS xS = α + βTS xS + β T(I/S)xI/S + 0T xO = α + β TI xI ,

(if I includes predictors from E, these will have zero coeﬃcient). For any
subset I that includes all relevant predictors, the correlation

corr(α + βT xi, α + βT
I xI,i ) = 1. (12.17)

This observation, which is true regardless of the explanatory power of

the model, suggests that variable selection for 1D regression models is simple
in principle. For each value of j = 1, 2, ..., p − 1 nontrivial predictors, keep
track of subsets I that provide the largest values of corr(ESP,ESP(I)). Any
such subset for which the correlation is high is worth closer investigation
and consideration. To make this advice more speciﬁc, use the rule of thumb
that a candidate subset of predictors I is worth considering if the sample
correlation of ESP and ESP(I) satisﬁes
T T T T
corr(α̃ + β̃ xi , α̃I + β̃ I xI,i) = corr(β̃ xi , β̃I xI,i) ≥ 0.95. (12.18)

The difficulty with this approach is that fitting all of the possible sub-
models involves substantial computation. An exception to this difficulty is
multiple linear regression where there are efficient “leaps and bounds” algo-
rithms for searching all subsets when OLS is used (see Furnival and Wilson
1974). Since OLS often gives a useful ESP, the following all subsets procedure
can be used for 1D models.

• Fit a full model using the methods appropriate to that 1D problem to

T
ﬁnd the ESP α̂ + β̂ x.
T
• Find the OLS ESP α̂OLS + β̂ OLS x.
CHAPTER 12. 1D REGRESSION 361

• If the 1D ESP and the OLS ESP have ‘a strong linear relationship’ (for
example |corr(ESP, OLS ESP)| ≥ 0.95), then infer that the 1D problem
is one in which OLS may serve as an adequate surrogate for the correct
1D model ﬁtting procedure.

• Use computationally fast OLS subsetting procedures such as the leaps

and bounds algorithm along with the Mallows (1973) Cp criterion to
identify predictor subsets I containing k variables (including the con-
stant) with Cp (I) ≤ 2k.

• Perform a ﬁnal check on the subsets that satisfy the Cp screen by using
them to ﬁt the 1D model.

For a 1D model, the response, ESP, and vertical discrepancies V =

y − ESP are important. When the multiple linear regression (MLR) model
holds, the fitted values are the ESP: ŷ = ESP , and the vertical discrepancies
are the residuals.
T T
Definition 12.10. a) The plot of α̃I + β̃I xI,i versus α̃ + β̃ xi is called
an EE plot (often called an FF plot for MLR).
T T
b) The plot of discrepancies yi − α̃I − β̃ I xI,i versus yi − α̃ − β̃ xi is called
a VV plot (often called an RR plot for MLR).
T T
c) The plots of α̃I + β̃I xI,i versus yi and of α̃ + β̃ xi versus yi are called EY
plots or estimated sufficient summary plots (often called forward response
plots or FY plots for MLR).
Many numerical methods such as forward selection, backward elimination,
stepwise and all subset methods using the Cp criterion (Jones 1946, Mallows
1973), have been suggested for variable selection. The four plots in Definition
12.10 contain valuable information to supplement the raw numerical results
of these selection methods. Particular uses include:

• The key to understanding which plots are the most useful is the obser-
vation that a wz plot is used to visualize the conditional distribution
of z given w. Since a 1D regression is the study of the conditional
distribution of y given α + β T x, the EY plot is used to visualize this
conditional distribution and should always be made. A major problem
with variable selection is that deleting important predictors can change
the functional form m of the model. In particular, if a multiple linear
CHAPTER 12. 1D REGRESSION 362

regression model is appropriate for the full model, linearity may be

destroyed if important predictors are deleted. When the single index
model (12.3) holds, m can be visualized with an EY plot. Adding visual
T
aids such as the estimated parametric mean function m(α̂ + β̂ x) can
T
be useful. If an estimated nonparametric mean function m̂(α̂ + β̂ x)
such as lowess follows the parametric curve closely, then often numerical
goodness of ﬁt tests will suggest that the model is good. See Chambers,
Cleveland, Kleiner, and Tukey (1983, p. 280) and Cook and Weisberg
(1999a, p. 425, 432). For variable selection, the EY plots from the full
model and submodel should be very similar if the submodel is good.

• Sometimes outliers will inﬂuence numerical methods for variable selec-

tion. Outliers tend to stand out in at least one of the plots. An EE plot
is useful for variable selection because the correlation of ESP(I) and
ESP is important. The EE plot can be used to quickly check that the
correlation is high, that the plotted points fall about some line, that
the line is the identity line, and that the correlation is high because the
relationship is linear, rather than because of outliers.

• Numerical methods may include too many predictors. Investigators

can examine the p–values for individual predictors, but the assump-
tions needed to obtain valid p–values are often violated; however, the
OLS t tests for individual predictors are meaningful since deleting a
predictor changes the Cp value by t2 − 2 where t is the test statistic for
the predictor (see Daniel and Wood 1980,√ p. 100-101). Suppose that
|corr(ESP, OLS ESP)| ≥ 0.95. If |t| < 2 then the predictor can prob-
ably be deleted since Cp decreases while if |t| ≥ 2 then the predictor
is probably useful even when the other predictors are in the model. If
the correlation of the two sets of ESPs is near one, then the predictions
from the submodel and the full model should be nearly the same if m
is smooth.

To see why the plots contain useful information, review Proposition 5.1
(on p. 140 - 142), and Remarks 5.2 and 5.3 with corr(r, rI ) replaced by
corr(V, V (I)). In many settings (not all of which meet the Li–Duan sufficient
conditions), the full model OLS ESP is a good estimator of the sufficient
predictor. If the fitted full 1D model y x|α+βT x is a useful approximation
to the data and if β̂ OLS is a good estimator of cβ where c = 0, then a
CHAPTER 12. 1D REGRESSION 363

subset I will produce an EY plot similar to the EY plot of the full model if
corr(OLS ESP, OLS ESP(I)) ≥ 0.95. Hence the EY plots based on the full
and submodel ESP can both be used to visualize the conditional distribution
of y.
To see that models with Cp(I) ≤ 2k for small k are interesting, assume
that subset I uses k predictors including the intercept, that Cp (I) ≤ 2k
and n ≥ 10p. Then 0.9 ≤ corr(V, V (I)), and both corr(V, V (I)) → 1.0 and
corr(OLS ESP, OLS ESP(I)) → 1.0 as n → ∞. Hence the plotted points in
both the VV plot and the EE plot will cluster about the identity line (see
Proposition 5.1 vi). Notice that for a fixed value of k, the model I with the
smallest value of Cp (I) maximizes corr(V, V (I)).
The Cp (I) ≤ k screen tends to overfit. We simulated multiple linear
regression and single index model data sets with p = 8 and n = 50, 100, 1000
and 10000. The true model S satisfied Cp (S) ≤ k for about 60% of the
simulated data sets, but S satisfied Cp (S) ≤ 2k for about 97% of the data
sets.
Assuming that a 1D model holds, a common assumption made for variable
selection is that the fitted full model ESP is a good estimator of the sufficient
predictor, and the usual numerical and graphical checks on this assumption
should be made. To see that this assumption is weaker than the assumption
that the OLS ESP is good, notice that if a 1D model holds but β̂OLS estimates
cβ where c = 0, then the Cp(I) criterion could wrongly suggest that all
subsets I have Cp (I) ≤ 2k. Hence we also need to check that c = 0.
There are several methods are for checking the OLS ESP, including: a) if
an ESP from an alternative fitting method is believed to be useful, check that
the ESP and the OLS ESP have a strong linear relationship – for example that
|corr(ESP, OLS ESP)| ≥ 0.95. b) Often examining the EY plot shows that a
1D model is reasonable. For example, if the data are tightly clustered about a
smooth curve, then a single index model may be appropriate. c) Verify that
a 1D model is appropriate using graphical techniques given by Cook and
Weisberg (1999a, p. 434-441). d) Verify that x has an elliptically contoured
distribution with 2nd moments and that the mean function m(α + β T x) is
not symmetric about the median of the distribution of α+β T x. Then results
from Li and Duan (1989) suggest that c = 0.
Condition a) is both the most useful (being a direct performance check)
and the easiest to check. A standard fitting method should be used when
available (eg, for parametric 1D models or the proportional hazards model).
CHAPTER 12. 1D REGRESSION 364

Conditions c) and d) need x to have a continuous multivariate distribution

while the predictors can be factors for a) and b). Using trimmed views results
in an ESP that can sometimes cause condition b) to hold when d) is violated.
To summarize, if the ﬁtted full 1D model y x|α + β T x is a useful
approximation to the data and if β̂OLS is a good estimator of cβ where
c = 0, then a subset I is good if corr(OLS ESP, OLS ESP(I)) ≥ 0.95. If
n is large enough, Proposition 5.1 implies that this condition will hold if
Cp (I) ≤ 2k or if FI ≤ 1. This result suggests that within the (large) subclass
of 1D models where the OLS ESP is useful, the OLS change in SS F test is
robust (asymptotically) to model misspeciﬁcations in that FI ≤ 1 correctly
suggests that submodel I is good. The√ OLS t tests for individual predictors
are also meaningful since if |t| < 2 then the predictor can probably be
deleted since Cp decreases while if |t| ≥ 2 then the predictor is probably
useful even when the other predictors are in the model. See Section 12.5.
The following examples help illustrate the above discussion.
Example 12.6. This example illustrates that the plots are useful for
general 1D regression models such as the response transformation model.
Cook and Weisberg (1999a, p. 351, 433, 447, 463) describe a data set on 82
mussels. The response y is the muscle mass in grams, and the four predictors
are the logarithms of the shell length, width, height and mass. The logarithm
transformation was used to remove strong nonlinearities that were evident
in a scatterplot matrix of the untransformed predictors. The Cp criterion
suggests using log(width) and log(shell mass) as predictors. The FF and RR
plots are shown in Figure 12.10ab. The forward response plots based on the
full and submodel are shown in Figure 12.10cd and are nearly identical, but
not linear (hence the FF plot is actually an EE plot).
When log(muscle mass) is used as the response, the Cp criterion suggests
using log(height) and log(shell mass) as predictors (the correlation between
log(height) and log(width) is very high). Figure 12.11a shows the RR plot
and 2 outliers are evident. These outliers correspond to the two outliers
in the forward response plot shown in Figure 12.11b. After deleting the
outliers, the Cp criterion still suggested using log(height) and log(shell mass)
as predictors. The p–value for including log(height) in the model was 0.03,
and making the FF and RR plots after deleting log(height) suggests that
log(height) may not be needed in the model.
Example 12.7 According to Li (1997), the predictors in the Boston
CHAPTER 12. 1D REGRESSION 365

a) FF Plot b) RR Plot

15
30

10
20

FRES
FFIT

5
10

0
0

-5
-10
-10

-10 0 10 20 30 40 -5 0 5 10 15

SFIT SRES

c) Forward Response Plot (Full) d) Forward Response Plot (Sub)

50
40

40
30

30
Y

Y
20

20
10

10
0

-10 0 10 20 30 40 -10 0 10 20 30 40

FFIT SFIT

Figure 12.10: Mussel Data with Muscle Mass as the Response

a) RR Plot with Two Outliers

0.2
-0.2
FRES

-0.6
-1.0

-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4

SRES

b) Linear Forward Response Plot

4
3
2
Y

1
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

SFIT

Figure 12.11: Mussel Data with log(Muscle Mass) as the Response

CHAPTER 12. 1D REGRESSION 366

a) Full Forward Response Plot b) Full Residual Plot

2
2

1
FRES
0

0
Y

-1
-2

-2
-4

-4 -2 0 2 -4 -2 0 2

FFIT FFIT

c) Submodel Forward Response Plot d) Submodel Residual Plot

2
2

1
SRES
0
Y

0
-2

-1
-4

-2

-4 -2 0 2 -4 -2 0 2

SFIT SFIT

Figure 12.12: Forward Response and Residual Plots for Boston Housing Data

a) Forward Response Plot with X4 and X8 b) Outliers in Predictors

20
2

15
0

RAD
Y

10
-2

5
-4

-3 -2 -1 0 1 2 3 0.4 0.5 0.6 0.7 0.8

SFIT2 NOX

Figure 12.13: Relationships between NOX, RAD and Y = log(CRIM)

CHAPTER 12. 1D REGRESSION 367

housing data of Harrison and Rubinfeld (1978) have a nonlinear quasi–helix

relationship which can cause regression graphics methods to fail. Neverthe-
less, the graphical diagnostics can be used to gain interesting information
from the data. The response y = log(CRIM) where CRIM is the per capita
crime rate by town. The predictors used were x1 = proportion of residential
land zoned for lots over 25,000 sq.ft., log(x2 ) where x2 is the proportion of
non-retail business acres per town, x3 = Charles River dummy variable (= 1
if tract bounds river; 0 otherwise), x4 = NOX = nitric oxides concentration
(parts per 10 million), x5 = average number of rooms per dwelling, x6 =
proportion of owner-occupied units built prior to 1940, log(x7) where x7 =
weighted distances to five Boston employment centers, x8 = RAD = index
of accessibility to radial highways, log(x9 ) where x9 = full-value property-tax
rate per $10,000, x10 = pupil-teacher ratio by town, x11 = 1000(Bk − 0.63)2
where Bk is the proportion of blacks by town, log(x12) where x12 = % lower
status of the population, and log(x13) where x13 = median value of owner-
occupied homes in $1000’s. The full model has 506 cases and 13 nontrivial
predictor variables.
Figure 12.12ab shows the forward response plot and residual plot for the
full model. The residual plot suggests that there may be three or four groups
of data, but a linear model does seem plausible. Backward elimination with
Cp suggested the “Cp submodel” with the variables x1, log(x2 ), NOX, x6 ,
log(x7), RAD, x10, x11 and log(x13). The full model had R2 = 0.878 and
σ̂ = 0.7642. The Cp submodel had Cp (I) = 6.576, R2I = 0.878, and σ̂I =
0.762. Deleting log(x7 ) resulted in a model with Cp = 8.483 and the smallest
coefficient p–value was 0.0095. The FF and RR plots for this model (not
shown) looked like the identity line. Examining further submodels showed
that NOX and RAD were the most important predictors. In particular, the
OLS coefficients of x1 , x6 and x11 were orders of magnitude smaller than
those of NOX and RAD. The submodel including a constant, NOX, RAD
and log(x2) had R2 = 0.860, σ̂ = 0.811 and Cp = 67.368. Figure 12.12cd
shows the forward response plot and residual plot for this submodel.
Although this submodel has nearly the same R2 as the full model, the
residuals show more variability than those of the full model. Nevertheless,
we can examine the effect of NOX and RAD on the response by deleting
log(x2). This submodel had R2 = 0.842, σ̂ = 0.861 and Cp = 138.727. Figure
12.13a shows that the forward response plot for this model is no longer linear.
The residual plot (not shown) also displays curvature. Figure 12.13a shows
that there are two groups, one with high y and one with low y. There are
CHAPTER 12. 1D REGRESSION 368

a) VV Plot b) EE Plot

15
60

10
40
FULLV

FESP

5
20

0
0

0 20 40 60 0 5 10 15

SUBV SESP

c) Full Model EY Plot d) Submodel EY Plot

80
60

60
Y

Y
40

40
20

20
0

0 5 10 15 0 5 10 15

FESP SESP

Figure 12.14: Boston Housing Data: Nonlinear 1D Regression Model

three clusters of points in the plot of NOX versus RAD shown in Figure
12.13b (the single isolated point in the southeast corner of the plot actually
corresponds to several cases). The two clusters of high NOX and high RAD
points correspond to the cases with high per capita crime rate.
The tiny ﬁlled in triangles if Figure 12.13a represent the ﬁtted values for
a quadratic. We added NOX 2 , RAD2 and NOX ∗ RAD to the full model
and again tried variable selection. Although the full quadratic in NOX and
RAD had a linear forward response plot, the submodel with NOX, RAD
and log(x2) was very similar. For this data set, NOX and RAD seem to be
the most important predictors, but other predictors are needed to make the
model linear and to reduce residual variation.
Example 12.8. In the Boston housing data, now let y = CRIM. Since
log(y) has a linear relationship with the predictors, y should follow a nonlin-
ear 1D regression model. Consider the full model with predictors log(x2), x3,
x4, x5 , log(x7 ), x8 , log(x9) and log(x12). Regardless of whether y or log(y)
is used as the response, the minimum Cp model from backward elimination
used a constant, log(x2 ), x4, log(x7 ), x8 and log(x12) as predictors. If y is
CHAPTER 12. 1D REGRESSION 369

the response, then the model is nonlinear and Cp = 5.699. Proposition 5.1
vi) (on p. 141) suggests that if Cp ≤ 2k, then the points in the VV plot
should tightly cluster about the identity line even if a multiple linear regres-
sion model fails to hold. Figure 12.14 shows the VV and EE plots for the
minimum Cp submodel. The EY plots for the full model and submodel are
also shown. Note that the clustering in the VV plot is indeed higher than
the clustering in the EE plot. Note that the EY plots are highly nonlinear
but are nearly identical.

12.5 Inference
Inference for tests of the form Ho: Aβ = 0 can be performed using χ2 tests
where A is a k × (p − 1) constant matrix of rank k. Let the 1D model
Y x|α + β T x be written as Y x|α + βTI xI + βTO xO where the reduced
model is Y x|α + βTI xI and xO denotes the terms outside of the reduced
model. Notice that test Ho: β = 0 and the test Ho: βO = 0 for a submodel
I have the correct form. The test for Ho: βi = 0 uses A = (0, ..., 0, 1, 0, ..., 0)
were the 1 is in the ith position. In the following theorem, it is crucial that
Ho: Aβ = 0. Tests for Ho: Aβ = 1, say, may not be valid even if the
sample size n is large. Let β̂ be the the OLS estimator of β and let the
−1 AT ]−1 Aβ/σ̂ 2 where Cov(x)
χ2 test statistic T = (Aβ̂)T [A(Cov(x)) is a
2 2
consistent estimator of Cov(x) and σ̂ is a consistent
√ estimator of σ . By Chen
and Li (1998), the asymptotic distribution of n(β̂ − cβ) is approximately
Np−1 (0, σ 2[Cov(x)]−1 ).
Theorem 12.4: Li and Duan (1989, p. 1012, 1034-1035). Assume
that the 1D regression model Y x|α + β T x is appropriate. Then under
regularity conditions, the χ2 test of the form Ho: Aβ = 0 that rejects Ho if
T > χ2k,1−α are asymptotically valid.
If ellipsoidal trimming is used and the trimming proportion was not cho-
sen using the response y, then Theorem 12.4 still applies. Using EY plots
to pick M uses y. Make a DD plot using the predictors x. The majority of
the plotted points will often follow some line through the origin. If the DD
plot is used to trim M% of the cases that do not follow the line, as in Figure
11.2, then the following corollary can be used.
Corollary 12.5. Suppose that the trimming proportion M is chosen
CHAPTER 12. 1D REGRESSION 370

without using the response y. Let (yi,M , xi,M ) denote the data that was not
trimmed where i = 1, ..., nM . Then the χ2 test of the form Ho: Aβ = 0 are
asymptotically valid when based on (yi,M , xi,M ).
There is a close relationship between this inference result and the result in
the previous section that showed that variable selection procedures, originally
meant for MLR, can be used for 1D data. Li and Duan (1989) suggests that if
a 1D model Y x|α+βT x is appropriate, then β OLS = cβ for some constant
c. Assume that x is a (p − 1) × 1 vector of nontrivial predictors and that
all models also include a constant. Hence the full model uses p predictors.
Under additional conditions, Li and Duan show that c = 0 and β̂ OLS is
asymptotically normal. For many 1D data sets, the change is SS F statistic
FI can be used to test whether the p − k predictor variables not in I can be
deleted. When the Gaussian multiple linear regression model holds, the p–
value (for Ho: the p − k predictors can be deleted) is approximately p–value
= P (Fp−k,n−p > FI ). That is, if model I is selected before collecting data,
then the FI statistic has an Fp−k,n−p distribution when Ho is true. Theorem
12.4 shows that for many 1D data sets, the FI statistic has an asymptotic
Fp−k,n−p distribution when Ho is true. Hence the p–value ≈ P (Fp−k,n−p >
FI ). The key assumptions for this result are that n is large and no strong
linearities are present in the predictors.
Variable selection with the Cp criterion is closely related to the change in
SS F test. The following results are properties of OLS and hold even if the
data does not follow a 1D model. If the candidate model of xI has k terms
(including the constant), then let

SSE(I) − SSE SSE n − p SSE(I)

FI = / = [ − 1]
(n − k) − (n − p) n − p p − k SSE

where SSE is the residual sum of squares from the full model and SSE(I) is
the residual sum of squares from the candidate submodel. Then

SSE(I)
Cp (I) = + 2k − n = (p − k)(FI − 1) + k (12.19)
MSE
where MSE is the residual mean square for the full model. Let ESP(I) =
T
α̂I + β̂ I x be the ESP for the submodel and let VI = Y − ESP (I) so that
T
VI,i = Yi − α̂I + β̂ I xi . Let ESP and V denote the corresponding quantities
for the full model. It can be shown that corr(VI , V ) → 1 forces corr(OLS
CHAPTER 12. 1D REGRESSION 371

ESP, OLS ESP(I)) → 1 and that

SSE n−p n−p
corr(V, VI ) = = = .
SSE(I) Cp(I) + n − 2k (p − k)FI + n − p

Also Cp (I) ≤ 2k corresponds to corr(VI , V ) ≥ dn where

p
dn = 1 − .
n
Notice that the submodel Ik that minimizes Cp (I) also maximizes corr(V, VI )
among all submodels I with k predictors including a constant.
If a 1D model holds, a common assumption made for variable selection is
that the fitted full model ESP is a good estimator of the sufficient predictor,
and the usual graphical and numerical checks on this assumption should
be made. Also assume that the OLS ESP is useful. This assumption can
be checked by making an OLS EY plot or by verifying that |corr(OLS ESP,
ESP)| ≥ 0.95. Then we suggest that submodels I are “interesting” if Cp (I) ≤
min(2k, p).
The Li and Duan regularity conditions are rather difficult to verify, but
using the p–values from OLS output seems to be a useful benchmark when
OLS variable selection can be used for the 1D data. For example, if the OLS
ESP and the standard ESP are highly correlated, the OLS output may be
useful. To see this, suppose that n > 5p and first consider the model Ii that
deletes the predictor Xi . Then the model has k = p − 1 predictors including
the constant, and the test statistic is ti where

t2i = FIi .

Using (12.19) and Cp (If ull ) = p, it can be shown that

Cp(Ii ) = Cp(If ull ) + (t2i − 2).

Using the screen Cp (I) ≤ min(2k, p) suggests that the predictor Xi should
not be deleted if √
|ti | > 2 ≈ 1.414.
More generally, it can be shown that Cp (I) ≤ 2k iﬀ
p
FI ≤ .
p−k
CHAPTER 12. 1D REGRESSION 372

Now k is the number of terms in the model including a constant while

p − k is the number of terms set to 0. As k → 0, the change in SS F test
will reject Ho βO = 0 (ie, say that the full model should be used instead of
the submodel I) unless FI is not much larger than 1. If p is very large and
p − k is very small, then the change in SS F test will tend to suggest that
there is a model I that is about as good as the full model even though model
I deletes p − k predictors.

12.6 Complements
An excellent introduction to 1D regression and regression graphics is Cook
and Weisberg (1999a, ch. 18, 19, and 20) and Cook and Weisberg (1999b).
More advanced treatments are Cook (1998a) and Li (2000). Important pa-
pers include Brillinger (1977, 1983), Li and Duan (1989) and Stoker (1986).
Xia, Tong, Li and Zhu (2002) provides a method for single index models
(and multi–index models) that does not need the linearity condition. Formal
testing procedures for the single index model are given by Simonoﬀ and Tsai
(2002).
There are many ways to estimate 1D models, including maximum likeli-
hood for parametric models. The literature for estimating cβ when model
(12.1) holds is growing, and Cook and Li (2002) summarize when compet-
ing methods such as ordinary least squares (OLS), sliced inverse regression
(SIR), principal Hessian directions (PHD), and sliced average variance esti-
mation (SAVE) can fail. All four methods frequently perform well if there
are no strong nonlinearities present in the predictors. Cook and Ni (2005)
provides theory for inverse regression methods such as SAVE. Further in-
formation about these and related methods can be found, for example, in
Brillinger (1977, 1983), Bura and Cook (2001), Chen and Li (1998), Cook
(1998ab, 2000, 2003, 2004), Cook and Critchley (2000), Cook and Li (2004),
Cook and Weisberg (1991, 1999ab), Fung, He, Liu and Shi (2002), Li (1991,
1992, 2000), Satoh and Ohtaki (2004) and Yin and Cook (2002, 2003).
In addition to OLS, specialized methods for 1D models with an unknown
inverse link function (eg models (12.2) and (12.3)) have been developed, and
often the focus is on developing asymptotically eﬃcient methods. See the
references in Cavanagh and Sherman (1998), Delecroix, Härdle and Hris-
tache (2003), Härdle, Hall and Ichimura (1993), Horowitz (1998), Hristache,
CHAPTER 12. 1D REGRESSION 373

Juditsky, Polzehl, and Spokoiny (2001), Stoker (1986), Weisberg and Welsh
(1994) and Xia, Tong, Li and Zhu (2002).
Corollary 12.5 holds for much more general 1D regression methods. If the
trimming is done with a DD plot and the dimension reduction method such
as SIR is performed on the untrimmed data (yi,M , xi,M ), then the inference
that is valid for M = 0 tends to be valid for M > 0.
Several papers have suggested that outliers and strong nonlinearities need
to be removed from the predictors. See Brillinger (1991), Cook (1998a, p.
152), Cook and Nachtsheim (1994), Heng-Hui (2001), Li and Duan (1989,
p. 1011, 1041, 1042) and Li (1991, p. 319). Outlier resistant methods for
general methods such as SIR are less common, but see Gather, Hilker and
Becker (2001, 2002). Trimmed views were introduced by Olive (2002, 2004b).
Li, Cook and Nachtsheim (2004) find clusters, fit OLS to each cluster and
then pool the OLS estimators into a final estimator. This method uses all n
cases while trimmed views gives M% of the cases weight zero. The trimmed
views estimator will often work well when outliers and influential cases are
present.
Section 12.4 follows Olive and Hawkins (2005) closely. The literature on
numerical methods for variable selection in the OLS multiple linear regression
model is enormous, and the literature for other given 1D regression models
is also growing. Li, Cook and Nachtsheim (2005) give an alternative method
for variable selection that can work without specifying the model. Also see,
for example, Claeskins and Hjort (2003), Efron, Hastie, Johnstone and Tib-
shirani (2004), Fan and Li (2001, 2002), Hastie (1987), Lawless and Singhai
(1978), Naik and Tsai (2001), Nordberg (1982), Nott and Leonte (2004), and
Tibshirani (1996). For generalized linear models, forward selection and back-
ward elimination based on the AIC criterion are often used. See Chapter 13,
Agresti (2002, p. 211-217) or Cook and Weisberg (1999a, p. 485, 536-538).
Again, if the variable selection techniques in these papers are successful, then
the estimated sufficient predictors from the full and candidate model should
be highly correlated, and the EE, VV and EY plots will be useful.
The variable selection model with x = (xTS , xTE )T and SP = α + βT x =
α + βTS xS is not the only variable selection model. Burnham and Anderson
(2004) note that for many data sets, the variables can be ordered in decreasing
importance from x1 to xp−1 . The “tapering effects” are such that if n >> p,
then all of the predictors should be used, but for moderate n it is better to
CHAPTER 12. 1D REGRESSION 374

delete some of the least important predictors.

A more general regression model is

y x|βT1 x, ..., βTk x (12.20)

also written as
y x|B T x
where B is the (p − 1) × k matrix

B = [β1 , ..., βk ].

If this model is valid, B T x is called a set of suﬃcient predictors. The struc-

tural dimension d is the smallest value of k such that model (12.20) holds.
If d = 0 then y x, and if d = 1, then a 1D regression model is valid. Note
that 0 ≤ d ≤ p − 1 since y x|I p−1x where I p−1 is the (p − 1) × (p − 1)
identity matrix.
Notice that if y B T X then y AT X for any matrix A such that the
span of B is equal to the span of A. Suppose that B is minimal in that
B is a (p − 1) × d matrix. Then the conditional distribution of y|x can be
investigated if d and the span of B, called the central dimension reduction
subspace SY |X , can be estimated. A suﬃcient summary plot is a (d + 1)–
dimensional plot of y versus B T x. According to Cook and Weisberg (1999b,
p. 33), most regression problems can be characterized with d ≤ 2, but there
are certainly exceptions.
The key condition for theoretical results in regression graphics is the con-
dition of linearly related predictors which holds if

E(x|B T x) = a + CB T x

where a is some constant (p− 1) × 1 vector and C is some constant (p− 1) × d

matrix. Equivalently, for each single predictor xj ,

E(xj |B T x) = aj + cTj B T x

j = 1, ..., p. Again the condition of linearly related predictors holds if x is

elliptically contoured with ﬁnite ﬁrst moments, and the linearity condition
holds approximately for many other distributions (Hall and Li 1993). As a
rule of thumb, if no strong nonlinearities are present among the predictors,
then regression graphics procedures often perform well. Again, one of the
CHAPTER 12. 1D REGRESSION 375

most useful techniques in regression is to remove gross nonlinearities in the

predictors by using predictor transformations.
Under the assumption that the predictors x follow an EC distribution,
inverse regression can be used to suggest response transformations (Cook
1998a, p. 21) and to identify semiparametric regression functions (Cook
1998a, p. 56-57), as well as to determine the central subspace dimension d
(Cook 1998a, p. 144, 188, 191, and 197). The assumption is also used to show
that sliced inverse regression (SIR), principal Hessian directions (PHD), and
sliced average variance estimation (SAVE) provide information about the
central subspace (Cook 1998a, p. 204, 225, and 250 respectively) and to
derive the asymptotic theory of associated statistics (Cook 1998a, p. 211,
228, 230). See also Li (1991, 1992, 2000), Cook (1998b, 2000), Cook and
Critchley (2000), Cook and Li (2002), Cook and Lee (1999), Fung, He, Liu
and Shi (2002), Chen and Li (1998) and Yin and Cook (2002).
Cook (1993) and Cook and Croos-Dabrera (1998) show that partial resid-
ual plots perform best when the predictor distribution is EC. “Backfitting”
uses partial residual plots for fitting models, with applications including pro-
jection pursuit regression, generalized additive models, additive spline mod-
els, and smoothing spline ANOVA. See Buja, Hastie, and Tibshirani (1989),
Ansley and Kohn (1994), Luo (1998), and Wand (1999).
The mussel data set is included as the file mussel.lsp in the Arc software
and can be obtained from the web site (http://www.stat.umn.edu/arc/).
The Boston housing data can be obtained from the text website or from the
STATLIB website (http://lib.stat.cmu.edu/datasets/boston).

12.7 Problems
12.1. Refer to Deﬁnition 12.3 for the Cox and Snell (1968) deﬁnition for
residuals, but replace η by β.
a) Find êi if yi = µ + ei and T (Y ) is used to estimate µ.
b) Find êi if yi = xTi β + ei .
c) Find êi if yi = β1 exp[β2(xi − x̄)]ei where the ei are iid exponential(1)
random variables and x̄ is the sample mean of the xi s.
√
d) Find êi if yi = xTi β + ei / wi .
CHAPTER 12. 1D REGRESSION 376

12.2∗. (Aldrin, Bφlviken, and Schweder 1993). Suppose

y = m(βT x) + e (12.21)
where m is a possibly unknown function and the zero mean errors e are
independent of the predictors. Let z = βT x and let w = x − E(x). Let
Σx,y = Cov(x, y), and let Σx =Cov(x) = Cov(w). Let r = w −(Σx β)β T w.

a) Recall that Cov(x, y) = E[(x − E(x))(y − E(y))T ] and show that

Σx,y = E(wy).
b) Show that E(wy) = Σx,y = E[(r + (Σx β)β T w) m(z)] =
E[m(z)r] + E[βT w m(z)]Σx β.
c) Using βOLS = Σ−1
x Σx,y , show that β OLS = c(x)β + u(x) where the
constant
c(x) = E[β T (x − E(x))m(βT x)]
and the bias vector u(x) = Σ−1
x E[m(β x)r].
T

d) Show that E(wz) = Σx β. (Hint: Use E(wz) = E[(x − E(x))xT β] =

E[(x − E(x))(xT − E(xT ) + E(xT ))β].)
e) Assume m(z) = z. Using d), show that c(x) = 1 if βT Σx β = 1.
f) Assume that m(z) = z and βT Σx β = 1. Note that u(x) = 0 if
E(zr) = 0. Show that E(zr) = E(rz) = 0. (Hint: Find E(rz) and use d).)
g) Suppose that β T Σx β = 1 and that the distribution of x is multivariate
normal. Using the fact that E(zr) = 0, show that u(x) = 0.
(Note: the assumption βT Σx β = 1 can be made without loss of generality
since if βT Σx β = d2 > 0 (assuming Σx is positive deﬁnite), then y =
m(d(β/d)T x) + e ≡ md (η T x) + e where md (u) = m(du), η = β/d and
η T Σx η = 1.)
12.3. Suppose that you have a statistical model where both ﬁtted values
and residuals can be obtained. For example this is true for time series and
for nonparametric regression models such as y = f(x1 , ..., xp) + e where ŷ =
ˆ 1 , ..., xp) and the residual ê = y − f(x
f(x ˆ 1 , ..., xp). Suggest graphs for variable
selection for such models.
CHAPTER 12. 1D REGRESSION 377

Output for Problem 12.4.

BEST SUBSET REGRESSION MODELS FOR CRIM
(A)LogX2 (B)X3 (C)X4 (D)X5 (E)LogX7 (F)X8 (G)LogX9 (H)LogX12
3 "BEST" MODELS FROM EACH SUBSET SIZE LISTED.

ADJUSTED
k CP R SQUARE R SQUARE RESID SS MODEL VARIABLES
-- ----- -------- -------- --------- ---------------
1 379.8 0.0000 0.0000 37363.2 INTERCEPT ONLY
2 36.0 0.3900 0.3913 22744.6 F
2 113.2 0.3025 0.3039 26007.8 G
2 191.3 0.2140 0.2155 29310.8 E
3 21.3 0.4078 0.4101 22039.9 E F
3 25.0 0.4036 0.4059 22196.7 F H
3 30.8 0.3970 0.3994 22442.0 D F
4 17.5 0.4132 0.4167 21794.9 C E F
4 18.1 0.4125 0.4160 21821.0 E F H
4 18.8 0.4117 0.4152 21850.4 A E F
5 10.2 0.4226 0.4272 21402.3 A E F H
5 10.8 0.4219 0.4265 21427.7 C E F H
5 12.0 0.4206 0.4252 21476.6 A D E F
6 5.7 0.4289 0.4346 21125.8 A C E F H
6 9.3 0.4248 0.4305 21279.1 A C D E F
6 10.3 0.4237 0.4294 21319.9 A B E F H
7 6.3 0.4294 0.4362 21065.0 A B C E F H
7 6.3 0.4294 0.4362 21066.3 A C D E F H
7 7.7 0.4278 0.4346 21124.3 A C E F G H
8 7.0 0.4297 0.4376 21011.8 A B C D E F H
8 8.3 0.4283 0.4362 21064.9 A B C E F G H
8 8.3 0.4283 0.4362 21065.8 A C D E F G H
9 9.0 0.4286 0.4376 21011.8 A B C D E F G H

12.4. The output above is for the Boston housing data from software
that does all subsets variable selection. The full model is a 1D transformation
model with response variable y = CRIM and uses a constant and variables
A, B, C, D, E, F, G and H. (Using log(CRIM) as the response would give an
MLR model.) From this output, what is the best submodel? Explain brieﬂy.
CHAPTER 12. 1D REGRESSION 378

12.5∗. a) Show that Cp(I) ≤ 2k if and only if FI ≤ p/(p − k).

b) Using (12.19), ﬁnd E(Cp ) and Var(Cp) assuming that an MLR model
is appropriate and that Ho (the reduced model I can be used) is true.
c) Using (12.19), Cp (If ull ) = p and the notation in Section 12.5, show
that
Cp(Ii ) = Cp(If ull ) + (t2i − 2).

R/Splus Problems
Warning: Use the command source(“A:/rpack.txt”) to download
the programs. See Preface or Section 14.2. Typing the name of the
rpack function, eg trviews, will display the code for the function. Use the
args command, eg args(trviews), to display the needed arguments for the
function.
12.6. Use the following R/Splus commands to make 100 N3 (0, I3 ) cases
and 100 trivariate non-EC cases.

n3x <- matrix(rnorm(300),nrow=100,ncol=3)

ln3x <- exp(n3x)

In R, type the command library(lqs).

a) Using the commands pairs(n3x) and pairs(ln3x) and include both scat-
terplot matrices in Word. (Click on the plot and hit Ctrl and c at the same
time. Then go to ﬁle in the Word menu and select paste.) Are strong
nonlinearities present among the MVN predictors? How about the non-EC
predictors? (Hint: a box or ball shaped plot is linear.)
b) Make a single index model and the suﬃcient summary plot with the
following commands
ncy <- (n3x%*%1:3)^3 + 0.1*rnorm(100)
plot(n3x%*%(1:3),ncy)
and include the plot in Word.
c) The command trviews(n3x, ncy) will produce ten plots. To advance the
plots, click on the rightmost mouse button (and in R select stop) to advance
to the next plot. The last plot is the OLS view. Include this plot in Word.
CHAPTER 12. 1D REGRESSION 379

d) After all 10 plots have been looked at the output will show 10 estimated
predictors. The last estimate is the OLS (least squares) view and might look
like

Intercept X1 X2 X3
4.417988 22.468779 61.242178 75.284664

If the OLS view is a good estimated suﬃcient summary plot, then the
plot created from the command (leave out the intercept)
plot(n3x%*%c(22.469,61.242,75.285),n3x%*%1:3)

should cluster tightly about some line. Your linear combination will be dif-
ferent than the one used above. Using your OLS view, include the plot using
the command above (but with your linear combination) in Word. Was this
plot linear? Did some of the other trimmed views seem to be better that the
OLS view, that is did one of the trimmed views seem to have a smooth mean
function with a smaller variance function than the OLS view?
e) Now type the R/Splus command
lncy <- (ln3x%*%1:3)^3 + 0.1*rnorm(100).
Use the command trviews(ln3x,lncy) to ﬁnd the best view with a smooth
mean function and the smallest variance function. This view should not be
the OLS view. Include your best view in Word.
f) Get the linear combination from your view, say (94.848, 216.719, 328.444)T ,
and obtain a plot with the command
plot(ln3x%*%c(94.848,216.719,328.444),ln3x%*%1:3).
Include the plot in Word. If the plot is linear with high correlation, then
your EY plot in e) should be good.
12.7. (At the beginning of your R/Splus session, use source(“A:/rpack.txt”)
command (and library(lqs) in R.))
a) Perform the commands
> nx <- matrix(rnorm(300),nrow=100,ncol=3)
> lnx <- exp(nx)
> SP <- lnx%*%1:3
> lnsincy <- sin(SP)/SP + 0.01*rnorm(100)
CHAPTER 12. 1D REGRESSION 380

For parts b), c) and d) below, to make the best trimmed view with
trviews, ctrviews or lmsviews, you may need to use the function twice.
The ﬁrst view trims 90% of the data, the next view trims 80%, etc. The last
view trims 0% and is the OLS view (or lmsreg view). Remember to advance
the view with the rightmost mouse button (and in R, highlight “stop”). Then
click on the plot and next simultaneously hit Ctrl and c. This makes a copy
of the plot. Then in Word, use the menu commands “Copy>paste.”
b) Find the best trimmed view with OLS and cov.mcd with the following
commands and include the view in Word.

> trviews(lnx,lnsincy)

(With trviews, suppose that 40% trimming gave the best view. Then
instead of using the procedure above b), you can use the command

> essp(lnx,lnsincy,M=40)

to make the best trimmed view. Then click on the plot and next simultane-
ously hit Ctrl and c. This makes a copy of the plot. Then in Word, use the
menu commands “Copy>paste”. Click the rightmost mouse button (and in
R, highlight “stop”) to return the command prompt.)
c) Find the best trimmed view with OLS and (x, S) using the following
commands and include the view in Word. See the paragraph above b).

> ctrviews(lnx,lnsincy)

d) Find the best trimmed view with lmsreg and cov.mcd using the fol-
lowing commands and include the view in Word. See the paragraph above
b).

> lmsviews(lnx,lnsincy)

e) Which method or methods gave the best EY plot? Explain brieﬂy.

12.8. Warning: this problem may take too much time. This
problem is like Problem 12.7 but uses many more single index models.
a) Make some prototype functions with the following commands.

> nx <- matrix(rnorm(300),nrow=100,ncol=3)

> SP <- nx%*%1:3
> ncuby <- SP^3 + rnorm(100)
CHAPTER 12. 1D REGRESSION 381

> nexpy <- exp(SP) + rnorm(100)

> nlinsy <- SP + 4*sin(SP) + 0.1*rnorm(100)
> nsincy <- sin(SP)/SP + 0.01*rnorm(100)
> nsiny <- sin(SP) + 0.1*rnorm(100)
> nsqrty <- sqrt(abs(SP)) + 0.1*rnorm(100)
> nsqy <- SP^2 + rnorm(100)

b) Make suﬃcient summary plots similar to Figures 12.1 and 12.2 with
the following commands and include both plots in Word.

> plot(SP,ncuby)
> plot(-SP,ncuby)

c) Find the best trimmed view with the following commands (ﬁrst type
library(lqs) if you are using R). Include the view in Word.

> trviews(nx,ncuby)

You may need to use the function twice. The ﬁrst view trims 90% of the
data, the next view trims 80%, etc. The last view trims 0% and is the OLS
view. Remember to advance the view with the rightmost mouse button (and
in R, highlight “stop”). Suppose that 40% trimming gave the best view.
Then use the command

> essp(nx,ncuby, M=40)

to make the best trimmed view. Then click on the plot and next simultane-
ously hit Ctrl and c. This makes a copy of the plot. Then in Word, use the
menu commands “Copy>paste”.
d) To make a plot like Figure 12.5, use the following commands. Let
tem = β̂ obtained from the trviews output. In Example 12.3, tem can be
obtained with the following command.

> tem <- c(12.60514, 25.06613, 37.25504)

Include the plot in Word.

> ESP <- nx%*%tem

> plot(ESP,SP)
CHAPTER 12. 1D REGRESSION 382

e) Repeat b), c) and d) with the following data sets.

i) nx and nexpy
ii) nx and nlinsy
iii) nx and nsincy
iv) nx and nsiny
v) nx and nsqrty
vi) nx and nsqy
Enter the following commands to do parts vii) to x).

> lnx <- exp(nx)

> SP <- lnx%*%1:3
> lncuby <- (SP/3)^3 + rnorm(100)
> lnlinsy <- SP + 10*sin(SP) + 0.1*rnorm(100)
> lnsincy <- sin(SP)/SP + 0.01*rnorm(100)
> lnsiny <- sin(SP/3) + 0.1*rnorm(100)
> ESP <- lnx%*%tem

vii) lnx and lncuby

viii) lnx and lnlinsy
ix) lnx and lnsincy
x) lnx and lnsiny
12.9. Warning: this problem may take too much time. Repeat
Problem 12.8 but replace trviews with a) lmsviews, b) symviews (that cre-
ates views that sometimes work even when symmetry is present), c) ctrviews
and d) sirviews.
For part c), the essp command will not work. Instead, for the best
trimmed view, click on the plot and next simultaneously hit Ctrl and c.
This makes a copy of the plot. Then in Word, use the menu commands
“Copy>paste”.
Chapter 13

Generalized Linear Models

13.1 Introduction
Generalized linear models are an important class of parametric 1D regres-
sion models that include multiple linear regression, logistic regression and
loglinear regression. Assume that there is a response variable Y and a k × 1
vector of nontrivial predictors x. Before defining a generalized linear model,
the definition of a one parameter exponential family is needed. Let q(y) be
a probability density function (pdf) if Y is a continuous random variable
and let q(y) be a probability mass function (pmf) if Y is a discrete random
variable. Assume that the support of the distribution of Y is Y and that the
parameter space of θ is Θ.
Definition 13.1. A family of pdf’s or pmf’s {q(y|θ) : θ ∈ Θ} is a
1-parameter exponential family if
q(y|θ) = k(θ)h(y) exp[w(θ)t(y)] (13.1)
where k(θ) ≥ 0 and h(y) ≥ 0. The functions h, k, t, and w are real valued
functions.
In the definition, it is crucial that k and w do not depend on y and that
h and t do not depend on θ. The parameterization is not unique since, for
example, w could be multiplied by a nonzero constant m if t is divided by
m. Many other parameterizations are possible. If h(y) = g(y)IY (y), then
usually k(θ) and g(y) are positive, so another parameterization is
q(y|θ) = exp[w(θ)t(y) + d(θ) + S(y)]IY (y) (13.2)

383
CHAPTER 13. GENERALIZED LINEAR MODELS 384

where S(y) = log(g(y)), d(θ) = log(k(θ)), and the support Y does not depend
on θ. Here the indicator function IY (y) = 1 if y ∈ Y and IY (y) = 0, otherwise.

Deﬁnition 13.2. Assume that the data is (Yi , xi ) for i = 1, ..., n. An

important type of generalized linear model (GLM) for the data states
that the Y1 , ..., Yn are independent random variables from a 1-parameter ex-
ponential family with pdf or pmf

c(θ(xi ))
q(yi|θ(xi )) = k(θ(xi ))h(yi) exp yi . (13.3)
a(φ)

Here φ is a known constant (often a dispersion parameter), a(·) is a known

function, and θ(xi ) = η(α + βT xi ). Let E(Yi ) ≡ E(Yi |xi ) = µ(xi ). The
GLM also states that g(µ(xi )) = α + βT xi where the link function g is
a diﬀerentiable monotone function. Then the canonical link function is
g(µ(xi )) = c(µ(xi )) = α + βT xi , and the quantity α + β T x is called the
linear predictor.
The GLM parameterization (13.3) can be written in several ways. By
Equation (13.2),

q(yi|θ(xi )) = exp[w(θ(xi ))yi + d(θ(xi )) + S(y)]IY (y)

c(θ(xi )) b(c(θ(xi ))
= exp yi − + S(y) IY (y)
a(φ) a(φ)

νi b(νi )
= exp yi − + S(y) IY (y)
a(φ) a(φ)
where νi = c(θ(xi )) is called the natural parameter, and b(·) is some known
function.
Notice that a GLM is a parametric model determined by the 1-parameter
exponential family, the link function, and the linear predictor. Since the link
function is monotone, the inverse link function g −1 (·) exists and satisﬁes

µ(xi ) = g −1 (α + βT xi ). (13.4)

Also notice that the Yi follow a 1-parameter exponential family where

c(θ)
t(yi) = yi and w(θ) = ,
a(φ)
CHAPTER 13. GENERALIZED LINEAR MODELS 385

and notice that the value of the parameter θ(xi ) = η(α + βT xi ) depends
on the value of xi . Since the model depends on x only through the linear
predictor α+β T x, a GLM is a 1D regression model. Thus the linear predictor
is also a sufficient predictor.
The following three sections illustrate three of the most important gener-
alized linear models. After selecting a GLM, the investigator will often want
to check whether the model is useful and to perform inference. Several things
to consider are listed below.
i) Show that the GLM provides a simple, useful approximation for the
relationship between the response variable Y and the predictors x.
ii) Estimate α and β using maximum likelihood estimators.
iii) Estimate µ(xi ) = di τ (xi ) or estimate τ (xi ) where the di are known
constants.
iv) Check for goodness of fit of the GLM with an estimated sufficient
summary plot.
v) Check for lack of fit of the GLM (eg with a residual plot).
vii) Check whether Y is independent of x; ie, check whether β = 0.
viii) Check whether a reduced model can be used instead of the full model.
xi) Use variable selection to find a good submodel.
x) Predict Yi given xi .

13.2 Multiple Linear Regression

Suppose that the response variable Y is continuous. Then the multiple linear
regression model is often a very useful model and is closely related to the
GLM based on the normal distribution. To see this claim, let f(y|µ) be the
N(µ, σ 2 ) family of pdf’s where −∞ < µ < ∞ and σ > 0 is known. Recall
that µ is the mean and σ is the standard deviation of the distribution. Then
the pdf of Y is
1 −(y − µ)2
f(y|µ) = √ exp .
2πσ 2σ 2
Since
1 −1 −1 µ
f(y|µ) = √ exp( 2 µ2 ) exp( 2 y 2) exp( 2
y),
2πσ 2σ 2σ σ
- ./ 0- ./ 0 -./0
k(µ)≥0 h(y)≥0 c(µ)/a(σ2)
CHAPTER 13. GENERALIZED LINEAR MODELS 386

this family is a 1-parameter exponential family. For this family, θ = µ =

E(Y ), and the known dispersion parameter φ = σ 2. Thus a(σ 2) = σ 2 and
the canonical link is the identity link c(µ) = µ.
Hence the GLM corresponding to the N(µ, σ 2) distribution with canonical
link states that Y1 , ..., Yn are independent random variables where

Yi ∼ N(µ(xi ), σ 2) and E(Yi ) ≡ E(Yi |xi ) = µ(xi ) = α + βT xi

for i = 1, ..., n. This model can be written as

Yi ≡ Yi |xi = α + β T xi + ei

where ei ∼ N(0, σ 2 ).
When the predictor variables are continuous, the above model is called a
multiple linear regression (MLR) model. When the predictors are categorical,
the above model is called an analysis of variance (ANOVA) model, and when
the predictors are both continuous and categorical, the model is called an
MLR or analysis of covariance model. The MLR model is discussed in detail
in Chapter 5, where the normality assumption and the assumption that σ is
known can be relaxed.
5
0
Y

−5
−10

−10 −5 0 5

Figure 13.1: SSP for MLR Data

CHAPTER 13. GENERALIZED LINEAR MODELS 387

5
0
Y

−5
−10

−10 −5 0 5

ESP

Figure 13.2: ESSP = Forward Response Plot for MLR Data

2
1
RES

0
−1
−2

−10 −5 0 5

ESP

Figure 13.3: Residual Plot for MLR Data

CHAPTER 13. GENERALIZED LINEAR MODELS 388

5
0
Y

−5
−10

−1.4 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2

ESP

Figure 13.4: Forward Response Plot when Y is Independent of the Predictors

A suﬃcient summary plot (SSP) of the suﬃcient predictor SP = α+βT xi

versus the response variable Yi with the mean function added as a visual aid
can be useful for describing the multiple linear regression model. This plot
can not be used for real data since α and β are unknown. The artificial data
used to make Figure 13.1 used n = 100 cases with k = 5 nontrivial predictors.
The data used α = −1, β = (1, 2, 3, 0, 0)T , ei ∼ N(0, 1) and x ∼ N5(0, I).
In Figure 13.1, notice that the identity line with unit mean and zero
intercept corresponds to the mean function since the identity line is the line
Y = SP = α + βT x = g(µ(x)). The vertical deviation of Yi from the line
is equal to ei = Yi − (α + β T xi ). For a given value of SP , Yi ∼ N(SP, σ 2 ).
For the artificial data, σ 2 = 1. Hence if SP = 0 then Yi ∼ N(0, 1), and if
SP = 5 the Yi ∼ N(5, 1). Imagine superimposing the N(SP, σ 2 ) curve at
various values of SP . If all of the curves were shown, then the plot would
resemble a road through a tunnel. For the artificial data, each Yi is a sample
of size 1 from the normal curve with mean α + βT xi .
T
The estimated sufficient summary plot (ESSP) is a plot of α̂+ β̂ xi versus
Yi with the identity line added as a visual aid. Now the vertical deviation
T
of Yi from the line is equal to the residual ri = Yi − (α̂ + β̂ xi ). The
interpretation of the ESSP is almost the same as that of the SSP, but now
CHAPTER 13. GENERALIZED LINEAR MODELS 389

the mean SP is estimated by the estimated suﬃcient predictor (ESP). This

plot is also called the forward response plot and is used as a goodness of
fit diagnostic. The residual plot is a plot of the ESP versus ri and is used as
a lack of fit diagnostic. These two plots should be made immediately after
fitting the MLR model and before performing inference. Figures 13.2 and
13.3 show the forward response plot and residual plot for the artificial data.
The forward response plot is also a useful visual aid for describing the
ANOVA F test (see p. 159) which tests whether β = 0, that is, whether the
predictors x are needed in the model. If the predictors are not needed in the
model, then Yi and E(Yi |xi ) should be estimated by the sample mean Y . If
the predictors are needed, then Yi and E(Yi |xi ) should be estimated by the
T
ESP Ŷi = α̂ + β̂ xi . The fitted value Ŷi is the maximum likelihood estimator
computed using ordinary least squares. If the identity line clearly fits the
data better than the horizontal line Y = Y , then the ANOVA F test should
have a small p–value and reject the null hypothesis Ho that the predictors x
are not needed in the MLR model. Figure 13.4 shows the forward response
plot for the artificial data when only X4 and X5 are used as predictors with
the identity line and the line Y = Y added as visual aids. In this plot the
horizontal line fits the data about as well as the identity line which was
expected since Y is independent of X4 and X5 .
It is easy to find data sets where the forward response plot looks like
Figure 13.4, but the p–value for the ANOVA F test is very small. In this
case, the MLR model is statistically significant, but the investigator needs
to decide whether the MLR model is practically significant.

13.3 Logistic Regression

Multiple linear regression is used when the response variable is continuous,
but for many data sets the response variable is categorical and takes on two
values: 0 or 1. The category that is counted is labelled as a 1 or a “success”
while the category that is not counted is labelled as a 0 or a “failure.” The
term “success” is simply the category that is counted and does not have the
dictionary meaning. For example a “success” could be a person who con-
tracted lung cancer and died within 5 years of detection. Often the labelling
is arbitrary, eg, if the response variable is gender taking on the two categories
female and male. If males are counted then Y = 1 if the subject is male and
Y = 0 if the subject is female. If females are counted then this labelling is
CHAPTER 13. GENERALIZED LINEAR MODELS 390

reversed. For a binary response variable, a binary regression model is often

appropriate.
Deﬁnition 13.3. The binomial regression model states that Y1 , ..., Yn
are independent random variables with

Yi ∼ binomial(mi , ρ(xi )).

The binary regression model is the special case where mi ≡ 1 for i =

1, ..., n while the logistic regression model is the special case of binomial
regression where

exp(α + βT xi )
P (success|xi ) = ρ(xi ) = . (13.5)
1 + exp(α + β T xi )

For the remainder of this section, assume that the binary re-
gression model is of interest. To see that the binary logistic regression
model is a GLM, assume that Y is a binomial(1, ρ) random variable. For a
one parameter family, take a(φ) ≡ 1. Then the pmf of Y is

1 y 1−y 1 ρ
p(y) = ρ (1 − ρ) = (1 − ρ) exp[log( ) y].
y y - ./ 0 1−ρ
-./0 k(ρ)≥0 - ./ 0
h(y)≥0 c(ρ)

Hence this family is a 1-parameter exponential family with θ = ρ = E(Y )

and canonical link
ρ
c(ρ) = log .
1−ρ
This link is known as the logit link, and if g(µ(x)) = g(ρ(x)) = c(ρ(x)) =
α + β T x then the inverse link satisﬁes

−1 exp(α + βT x)
g (α + β x) =T
= ρ(x) = µ(x).
1 + exp(α + βT x)

Hence the GLM corresponding to the binomial(1, ρ) distribution with canon-

ical link is the binary logistic regression model.
Although the logistic regression model is the most important model for
binary regression, several other models are also used. Notice that ρ(x) =
CHAPTER 13. GENERALIZED LINEAR MODELS 391

P (S|x) is the population probability of success S given x, while 1 − ρ(x) =

P (F |x) is the probability of failure F given x. In particular, for binary
regression,
ρ(x) = P (Y = 1|x) = 1 − P (Y = 0|x).
If this population proportion ρ = ρ(α + βT x), then the model is a 1D re-
gression model. The model is a GLM if the link function g is diﬀerentiable
and monotone so that g(ρ(α + βT x)) = α + βT x and g −1 (α + βT x) =
ρ(α + β T x). Usually the inverse link function corresponds to the cumula-
tive distribution function of a location scale family. For example, for lo-
gistic regression, g −1 (x) = exp(x)/(1 + exp(x)) which is the cdf of the lo-
gistic L(0, 1) distribution. For probit regression, g −1 (x) = Φ(x) which is
the cdf of the Normal N(0, 1) distribution. For the complementary log-
log link, g −1 (x) = 1 − exp[− exp(x)] which is the CDF for the extreme
value distribution for the minimum (see Problem 3.10). For this model,
g(ρ(x)) = log[− log(1 − ρ(x))] = α + βT x.
Another important binary regression model is the discriminant func-
tion model. See Hosmer and Lemeshow (2000, p. 43-44). Assume that
πj = P (Y = j) and that x|Y = j ∼ Nk (µj , Σ) for j = 0, 1. That is,
the conditional distribution of x given Y = j follows a multivariate normal
distribution with mean vector µj and covariance matrix Σ which does not
depend on j. Then as for the binary logistic regression model,

exp(α + βT x)
P (Y = 1|x) = ρ(x) = .
1 + exp(α + βT x)

Deﬁnition 13.4. Under the conditions above, the discriminant func-

tion parameters are given by

β = Σ−1 (µ1 − µ0 ) (13.6)

and
π1
α = log − 0.5(µ1 − µ0 )T Σ−1 (µ1 + µ0 ).
π2

The discriminant function estimators α̂D and β̂D are found by replacing
the population quantities π1 , π2, µ1 , µ0 and Σ by sample quantities. The
logistic regression (maximum likelihood) estimator also tends to perform well
for this type of data. An exception is when the Y = 0 cases and Y = 1 cases
CHAPTER 13. GENERALIZED LINEAR MODELS 392

can be perfectly or nearly perfectly classified by the ESP. Let the logistic
T
regression ESP = α̂ + β̂ x. Consider the ESS plot of the ESP versus Y . If
the Y = 0 values can be separated from the Y = 1 values by a vertical line (eg
ESP = 0), then there is perfect classification. (If only a few cases need to be
deleted in order for the data set to have perfect classification, then the amount
of “overlap” is small and there is nearly “perfect classification.”) In this
case the maximum likelihood estimator for the logistic regression parameters
(α, β) does not exist because the logistic curve can not approximate a step
function perfectly.
Using Definition 13.4 makes simulation of logistic regression data straight-
forward. Set π0 = π1 = 0.5, Σ = I, and µ0 = 0. Then α ≈ −0.5µT1 µ1
and β = µ1 . The artificial data set used in the following discussion used
β = (1, 1, 1, 0, 0)T and hence α = −1.5. Let Ni be the number of cases where
Y = i for i = 0, 1. For the artificial data, N0 = N1 = 100, and hence the
total sample size n = N1 + N0 = 200.
Again a sufficient summary plot of the sufficient predictor SP = α+βT xi
versus the response variable Yi with the mean function added as a visual aid
can be useful for describing the logistic regression (LR) model. The artificial
data described above was used because the plot can not be used for real data
since α and β are unknown.
Unlike the SSP for multiple linear regression where the mean function
is always the identity line, the mean function in the SSP for LR can take
a variety of shapes depending on the range of the SP. For the LR SSP, the
mean function is
exp(SP )
ρ(SP ) = .
1 + exp(SP )
If the SP = 0 then Y |SP ∼ binomial(1,0.5). If the SP = −5, then Y |SP ∼
binomial(1,ρ ≈ 0.007) while if the SP = 5, then Y |SP ∼ binomial(1,ρ ≈
0.993). Hence if the range of the SP is in the interval (−∞, −5) then the
mean function is flat and ρ(SP ) ≈ 0. If the range of the SP is in the interval
(5, ∞) then the mean function is again flat but ρ(SP ) ≈ 1. If −5 < SP < 0
then the mean function looks like a slide. If −1 < SP < 1 then the mean
function looks linear. If 0 < SP < 5 then the mean function first increases
rapidly and then less and less rapidly. Finally, if −5 < SP < 5 then the
mean function has the characteristic “ESS” shape shown in Figure 13.5.
The estimated sufficient summary plot (ESSP or ESS plot) is a plot of
CHAPTER 13. GENERALIZED LINEAR MODELS 393

1.0
0.8
0.6
Y

0.4
0.2
0.0

−8 −6 −4 −2 0 2 4 6

Figure 13.5: SSP for LR Data

1.0
0.8
0.6
Y

0.4
0.2
0.0

−6 −4 −2 0 2 4 6

ESP

Figure 13.6: ESS Plot for LR Data

CHAPTER 13. GENERALIZED LINEAR MODELS 394

T
ESP = α̂ + β̂ xi versus Yi with the estimated mean function
exp(ESP )
ρ̂(ESP ) =
1 + exp(ESP )
added as a visual aid. The interpretation of the ESS plot is almost the same
as that of the SSP, but now the SP is estimated by the estimated sufficient
predictor (ESP).
This plot is very useful as a goodness of fit diagnostic. Divide the ESP into
J “slices” each containing approximately n/J cases. Compute the sample
mean = sample proportion of the Y ’s in each slice and add the resulting
step function to the ESS plot. This is done in Figure 13.6 with J = 10
slices. This step function is a simple nonparametric estimator of the mean
function ρ(SP ). If the step function follows the estimated LR mean function
(the logistic curve) closely, then the LR model fits the data well. The plot
of these two curves is a graphical approximation of the goodness of fit tests
described in Hosmer and Lemeshow (2000, p. 147–156).
The deviance test described in Section 13.5 is used to test whether β = 0,
and is the analog of the ANOVA F test for multiple linear regression. If
the LR model is a good approximation to the data but β = 0, then the
predictors x are not needed in the model and ρ̂(xi ) ≡ ρ̂ = Y (the usual
univariate estimator of the success proportion) should be used instead of the
LR estimator
T
exp(α̂ + β̂ xi )
ρ̂(xi ) = T
.
1 + exp(α̂ + β̂ xi )
If the logistic curve clearly fits the step function better than the line Y = Y ,
then Ho will be rejected, but if the line Y = Y fits the step function about
as well as the logistic curve (which should only happen if the logistic curve
is linear with a small slope), then Y may be independent of the predictors.
Figure 13.7 shows the ESS plot when only X4 and X5 are used as predic-
tors for the artificial data, and Y is independent of these two predictors by
construction. It is possible to find data sets that look like Figure 13.7 where
the p–value for the deviance test is very small. Then the LR relationship
is statistically significant, but the investigator needs to decide whether the
relationship is practically significant.
For binary data the Yi only take two values, 0 and 1, and the residuals do
not behave very well. Instead of using residual plots, we suggest using the
binary response plot given in the following definition.
CHAPTER 13. GENERALIZED LINEAR MODELS 395

1.0
0.8
0.6
Y

0.4
0.2
0.0

−0.5 0.0 0.5

ESP

Figure 13.7: ESS Plot When Y Is Independent Of The Predictors

Definition 13.5. (Cook 1996): Suppose that the binary response vari-
able Y is conditionally independent of x given the sufficient predictor SP =
α + β T x. Let V be a linear combination of the predictors that is (approx-
imately) uncorrelated with the estimated sufficient predictor ESP. Then a
binary response plot is a plot of the ESP versus V where different plot-
ting symbols are used for Y = 0 and Y = 1.
To make a binary response plot for logistic regression, sliced inverse re-
gression (SIR) can be used to find V . SIR is a regression graphics method
T
and the first SIR predictor β̂SIR1 x is used as the ESP while the second SIR
T
predictor β̂ SIR2x is used as V . (Other regression graphics methods, eg SAVE
or PHD, may provide a better plot, but the first SIR predictor is often highly
T
correlated with the LR ESP α̂ + β̂ x.) After fitting SIR and LR, check that
|corr(SIRESP, LRESP)| ≥ 0.95.
If the LR model holds, then Y is independent of x given the SP, written
Y x|SP.
If the absolute correlation is high, then this conditional independence is ap-
proximately true if the SP is replaced by either the SIR or LR ESP.
CHAPTER 13. GENERALIZED LINEAR MODELS 396

2
1
V

0
−1
−2

−4 −2 0 2

SIRESP

Figure 13.8: This Binary Response Plot Suggests That The Model Is OK
3
2
1
V

0
−1
−2

−3 −2 −1 0 1 2 3

SIRESP

Figure 13.9: This Binary Response Plot Is Inconclusive

CHAPTER 13. GENERALIZED LINEAR MODELS 397

To check whether the LR model is good, consider the symbol density of

+’s and 0’s in a narrow vertical slice where 0 is used if Y = 0 and + is
used if Y = 1. This symbol density should be approximately constant (up to
binomial variation) from the bottom to the top of the slice. (Hence the +’s
and 0’s should be mixed in the slice.) The plot would be easier to interpret
if the LR ESP was used on the horizontal axis instead of the SIR ESP since
then the approximate probability of the symbol + could be computed. For
example, if there are nJ points in narrow slice J and if the ESP ≈ −5,
then the points in the slice resemble a sample of nJ cases from a binomial
(1, 0.007) distribution and almost none of the points should be +’s. If the
ESP ≈ 0, then the points resemble a sample of nJ cases from a binomial (1,
0.5) distribution and about half of the points should be +’s. Moreover the
proportion of +’s should be near 0.5 in the bottom, middle and top of the
narrow slice. If the ESP ≈ 5, then the points resemble a sample of nJ cases
from a binomial (1, 0.993) distribution and almost all of the points should
be +’s.
The symbol density often changes greatly as the narrow slice is moved
from the left to the right of the plot. The symbol density increases, often
from 0% to 100%, if the LR ESP is used or if the SIR ESP is used and the
correlation between the SIR and LR ESP is near 1. If the SIR ESP is used
and the correlation between the SIR and LR ESP is near −1, then the symbol
density decreases, often from 100% to 0%. See Figures 13.8 and 13.9. If there
are one or more wide slices where the symbol density is not constant from
top to bottom, then the LR model may not be good (eg a more complicated
model may be needed). If it is difficult to quickly find slices where the symbol
density is not mixed, then the binary response plot should not be used as
evidence that the model is bad.
The following rules of thumb may be useful. i) A narrow slice is good
if the symbol density is approximately constant from top to bottom. The
narrow slice is bad if the symbol density changes dramatically from top to
bottom. ii) Slices with near 0% or 100% symbol density are good: if only a
few isolated points need to be changed to make a good plot, then the model
is often good and the points correspond to an unlikely outcome; however,
the isolated points could be “outliers” in that x is outlying or the value of
Y was misclassified. iii) The LR model is bad if more than 25% of the slices
are bad while the model is good if fewer than 10% of the slices are bad. iv)
If between 10% and 25% of the slices are bad, then the plot is inconclusive.
CHAPTER 13. GENERALIZED LINEAR MODELS 398

Figure 13.8 shows the binary response plot for the artiﬁcial data. The
correlation between the SIR and LR ESP’s was near −1. Hence the slice
symbol density of +’s decreases from nearly 100% in the left of the plot to
0% in the right of the plot. The symbol density is mixed in most of the slices,
suggesting that the LR model is good. For contrast, Figure 13.9 shows the
binary response plot when only X2 and X5 are in the model. Consider the
slice where the ESP is between −2.4 and −1.7. At the bottom and top of
the slice the proportion of +’s is near 1 but in the middle of the slice there
are several 0’s. In the slice where the ESP is between −1.7 and −0.8, the
proportion of +’s increase as one moves from the bottom of the slice to the
top of the slice. Hence there is a large slice from about −2.4 to −0.8 where
the plot does not look good. Although this model is poor, the binary response
plot is inconclusive since only about 20% of the slices are bad. If the bad
slice went from −2.4 to 0.5, the LR model would be bad because more than
25% of the slices would be bad.

13.4 Loglinear Regression

If the response variable Y is a count, then the loglinear regression model is
often useful. Counts often occur in wildlife studies where a region is divided
into subregions and Yi is the number of a speciﬁed type of animal found in the
subregion. Another application is for I × J contingency tables where there
are two categorical variables, one with I levels and one with J levels. Then
a count Yij is made for each of the IJ combinations of the two variables.
Deﬁnition 13.6. The Poisson regression model states that Y1 , ..., Yn
are independent random variables with

Yi ∼ Poisson(µ(xi )).

The loglinear regression model is the special case where

µ(xi ) = exp(α + β T xi ). (13.7)

To see that the loglinear regression model is a GLM, assume that Y is

a Poisson(µ) random variable. For a one parameter family, take a(φ) ≡ 1.
CHAPTER 13. GENERALIZED LINEAR MODELS 399

Then the pmf of Y is

e−µ µy 1
p(y) = e−µ
= -./0 exp[log(µ) y]
y! y! - ./ 0
k(µ)≥0 -./0 c(µ)
h(y)≥0

for y = 0, 1, . . . , where µ > 0. Hence this family is a 1-parameter exponential

family with θ = µ = E(Y ), and the canonical link is the log link

c(µ) = log(µ).

Since g(µ(x)) = c(µ(x)) = α + β T x, the inverse link satisﬁes

g −1 (α + βT x) = exp(α + β T x) = µ(x).

Hence the GLM corresponding to the Poisson(µ) distribution with canonical

link is the loglinear regression model.
A sufficient summary plot of the sufficient predictor SP = α + βT xi
versus the response variable Yi with the mean function added as a visual aid
can be useful for describing the loglinear regression (LLR) model. Artificial
data needs to be used because the plot can not be used for real data since
α and β are unknown. The data used in the discussion below had n = 100,
x ∼ N5 (1, I/4) and

Yi ∼ Poisson(exp(α + β T xi ))

where α = −2.5 and β = (1, 1, 1, 0, 0)T .

The shape of the mean function µ(SP ) = exp(SP ) for loglinear regression
depends strongly on the range of the SP. The variety of shapes occurs because
the plotting software attempts to fill the vertical axis. Hence if max(Yi ) is
less than 3 then the exponential function will be rather flat, but if there is a
single large count, then the exponential curve will look flat in the left of the
plot but will increase sharply in the right of the plot. Figure 13.10 shows the
SSP for the artificial data. Notice that Y |SP = 0 ∼ Poisson(1). In general,
Y |SP ∼ Poisson(exp(SP)).
The estimated sufficient summary plot (ESSP or EY plot) is a plot of the
T
ESP = α̂ + β̂ xi versus Yi with the estimated mean function

µ̂(ESP ) = exp(ESP )
CHAPTER 13. GENERALIZED LINEAR MODELS 400

15
10
Y

5
0

−2 −1 0 1 2

Figure 13.10: SSP for Loglinear Regression

15
10
Y

5
0

−2 −1 0 1 2

ESP

Figure 13.11: EY Plot for Loglinear Regression

CHAPTER 13. GENERALIZED LINEAR MODELS 401

added as a visual aid. The interpretation of the EY plot is almost the same
as that of the SSP, but now the SP is estimated by the estimated sufficient
predictor (ESP).
This plot is very useful as a goodness of fit diagnostic. The lowess
curve is a nonparametric estimator of the mean function called a “scatterplot
smoother.” The lowess curve is represented as a jagged curve to distinguish
it from the estimated LLR mean function (the exponential curve) in Figure
13.11. If the lowess curve follows the exponential curve closely (except possi-
bly for the largest values of the ESP), then the LLR model may fit the data
well. A useful lack of fit plot is a plot of the ESP versus the deviance
residuals that are often available from the software.
Warning: For the majority of count data sets where the LLR mean
function is correct, the LLR model is not appropriate but the LLR MLE
is still a consistent estimator of β. The problem is that if Y ∼ P (µ), then
E(Y ) = VAR(Y ) = µ, but for the majority of data sets where E(Y |x) =
µ(x) = exp(SP ), it turns out that VAR(Y |x) > exp(SP ). This phenomenon
is called overdispersion. Adding parametric and nonparametric estimators
of the standard deviation function to the EY plot can be useful. See Cook and
Weisberg (1999a, p. 401-403). Alternatively, if the EY plot looks good and
G2 /(n − 2
√ k − 1) ≈ 1, then the LLR model is likely useful. If G /(n − k − 1) >
1 + 3/ n − k + 1, then a more complicated count model may be needed.
Here the deviance G2 is described in Section 13.5.
The deviance test described in Section 13.5 is used to test whether β = 0,
and is the analog of the ANOVA F test for multiple linear regression. If
the LLR model is a good approximation to the data but β = 0, then the
predictors x are not needed in the model and µ̂(xi ) ≡ µ̂ = Y (the sample
mean) should be used instead of the LLR estimator
T
µ̂(xi ) = exp(α̂ + β̂ xi ).

If the exponential curve clearly fits the lowess curve better than the line
Y = Y , then Ho should be rejected, but if the line Y = Y fits the lowess
curve about as well as the exponential curve (which should only happen if the
exponential curve is approximately linear with a small slope), then Y may be
independent of the predictors. Figure 13.12 shows the ESSP when only X4
and X5 are used as predictors for the artificial data, and Y is independent of
CHAPTER 13. GENERALIZED LINEAR MODELS 402

15
10
Y

5
0

0.5 0.6 0.7 0.8 0.9 1.0

ESP

Figure 13.12: EY Plot when Y is Independent of the Predictors

these two predictors by construction. It is possible to find data sets that look
like Figure 13.12 where the p–value for the deviance test is very small. Then
the LLR relationship is statistically significant, but the investigator needs to
decide whether the relationship is practically significant.
Simple diagnostic plots for the loglinear regression model can be made
using weighted least squares (WLS). To see this, assume that all n of the
counts Yi are large. Then

log(µ(xi )) = log(µ(xi )) + log(Yi ) − log(Yi ) = α + β T xi ,

or
log(Yi ) = α + βT xi + ei
where
Yi
ei = log .
µ(xi )
The error ei does not have zero mean or constant variance, but if µ(xi ) is
large
Yi − µ(xi )
≈ N(0, 1)
µ(xi )
CHAPTER 13. GENERALIZED LINEAR MODELS 403

by the central limit theorem. Recall that log(1 + x) ≈ x for |x| < 0.1. Then,
heuristically,

µ(xi ) + Yi − µ(xi ) Yi − µ(xi )
ei = log ≈ ≈
µ(xi ) µ(xi )

1 Yi − µ(xi ) 1
≈ N 0, .
µ(xi ) µ(xi ) µ(xi )
This suggests that for large µ(xi ), the errors ei are approximately 0 mean
with variance 1/µ(xi ). If the µ(xi ) were known, and all of the Yi were large,
then a weighted least squares of log(Yi ) on xi with weights wi = µ(xi ) should
produce good estimates of (α, β). Since the µ(xi ) are unknown, the estimated
weights wi = Yi could be used. Since P (Yi = 0) > 0, the estimators given in
the following definition are used. Let Zi = Yi if Yi > 0, and let Zi = 0.5 if
Yi = 0.
Definition 13.7. The minimum chi–square estimator of the param-
eters (α, β) in a loglinear regression model are (α̂M , β̂M ), and are found from
the weighted least squares regression of log(Zi ) on xi with weights wi = Zi .
Equivalently,
√ use the ordinary
√ least squares (OLS) regression (without inter-
cept) of Zi log(Zi ) on Zi (1, xTi )T .
The minimum chi–square estimator tends to be consistent if n is fixed
and all n counts Yi increase to ∞ while the loglinear regression maximum
likelihood estimator tends to be consistent if the sample size n → ∞. See
Agresti (2002, p. 611-612). However, the two estimators are often close for
many data sets. This result and the equivalence of the minimum chi–square
estimator to an OLS estimator suggest the following diagnostic plots. Let
(α̃, β̃) be an estimator of (α, β).
Definition 13.8. For a loglinear regression model, a weighted forward
√ √ T √
response plot is a plot of Zi ESP = Zi (α̃ + β̃ xi ) versus Zi log(Zi ).
√ T
The weighted residual plot is a plot of Zi (α̃ + β̃ xi ) versus the WMLR
√ √ T
residuals rW i = Zi log(Zi ) − Zi (α̃ + β̃ xi ).
If the loglinear regression model is appropriate and if the minimum chi–
square estimators are reasonable, then the plotted points in the weighted
forward response plot should follow the identity line. Cases with large WMLR
residuals may not be fit very well by the model. When the counts Yi are
CHAPTER 13. GENERALIZED LINEAR MODELS 404

a) Weighted Forward Response Plot b) Weighted Residual Plot

1.0
8
sqrt(Z) * log(Z)

0.0
WRES
6
4

−1.0
2

−2.0
0

−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

WFIT WFIT

c) WFRP Based on MLE d) WRP Based on MLE

3
10

2
8
sqrt(Z) * log(Z)

MWRES
6

1
4

0
2

−1
0

−2 0 2 4 6 8 −2 0 2 4 6 8

MWFIT MWFIT

Figure 13.13: Diagnostic Plots for Loglinear Regression

CHAPTER 13. GENERALIZED LINEAR MODELS 405

small, the WMLR residuals can not be expected to be approximately normal.

Notice that a resistant estimator for (α, β) can be obtained by replacing OLS
(in Definition 13.7) with a resistant MLR estimator.
Figure 13.13 shows the diagnostic plots for the artificial data using both
the minimum chi–square estimator and the LLR maximum likelihood esti-
mator. Even though the counts Yi are small for this data set, the points in
both weighted forward response plots follow the identity line, and neither
residual plot has outlying residuals. Also notice that the larger counts are
fit better than the smaller counts and hence the residual plots have a “left
opening megaphone” shape. More research is needed to determine for when
these plots are useful.

13.5 Inference
This section gives a very brief discussion of inference for the logistic regression
(LR) and loglinear regression (LLR) models. Inference for these two models
is very similar to inference for the multiple linear regression (MLR) model.
For all three of these models, Y is independent of the k×1 vector of predictors
x = (x1 , ..., xk)T given the suﬃcient predictor α + β T x:

Y x|α + β T x.

To perform inference for LR and LLR, computer output is needed. The

following page shows output using symbols and Arc output from a real data
set with k = 2 nontrivial predictors. This data set is the banknote data set
described in Cook and Weisberg (1999a, p. 524). There were 200 Swiss bank
notes of which 100 were genuine (Y = 0) and 100 counterfeit (Y = 1). The
goal of the analysis was to determine whether a selected bill was genuine or
counterfeit from physical measurements of the bill.
Point estimators for the mean function are important. Given values of
x = (x1 , ..., xk)T , a major goal of binary logistic regression is to estimate the
success probability P (Y = 1|x) = ρ(x) with the estimator
T
exp(α̂ + β̂ x)
ρ̂(x) = T
. (13.8)
1 + exp(α̂ + β̂ x)
CHAPTER 13. GENERALIZED LINEAR MODELS 406

Response = Y
Coeﬃcient Estimates

Label Estimate Std. Error Est/SE p-value

Constant α̂ se(α̂) zo,0 for Ho: α = 0
x1 β̂1 se(β̂1) zo,1 = β̂1/se(β̂1 ) for Ho: β1 = 0
.. .. .. .. ..
. . . . .
xk β̂k se(β̂k ) zo,k = β̂k /se(β̂k ) for Ho: βk = 0

Number of cases: n
Degrees of freedom: n - k - 1
Pearson X2:
Deviance: D = G^2
-------------------------------------
Binomial Regression
Kernel mean function = Logistic
Response = Status
Terms = (Bottom Left)
Trials = Ones
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -389.806 104.224 -3.740 0.0002
Bottom 2.26423 0.333233 6.795 0.0000
Left 2.83356 0.795601 3.562 0.0004

Scale factor: 1.
Number of cases: 200
Degrees of freedom: 197
Pearson X2: 179.809
Deviance: 99.169

Similarly, a major goal of loglinear regression is to estimate the mean

E(Y |x) = µ(x) with the estimator
T
µ̂(x) = exp(α̂ + β̂ x). (13.9)

For tests, the p–value is an important quantity. Recall that Ho is rejected

if the p–value < δ. A p–value between 0.07 and 1.0 provides little evidence
CHAPTER 13. GENERALIZED LINEAR MODELS 407

that Ho should be rejected, a p–value between 0.01 and 0.07 provides moder-
ate evidence and a p–value less than 0.01 provides strong statistical evidence
that Ho should be rejected. Statistical evidence is not necessarily practical
evidence, and reporting the p–value along with a statement of the strength
of the evidence is more informative than stating that the p–value is less
than some chosen value such as δ = 0.05. Nevertheless, as a homework
convention, use δ = 0.05 if δ is not given.
Investigators also sometimes test whether a predictor Xj is needed in the
model given that the other k − 1 nontrivial predictors are in the model with
a 4 step Wald test of hypotheses:
i) State the hypotheses Ho: βj = 0 Ha: βj = 0.
ii) Find the test statistic zo,j = β̂j /se(β̂j ) or obtain it from output.
iii) The p–value = 2P (Z < −|zoj |) = 2P (Z > |zoj |). Find the p–value from
output or use the standard normal table.
iv) State whether you reject Ho or fail to reject Ho and give a nontechnical
sentence restating your conclusion in terms of the story problem.
If Ho is rejected, then conclude that Xj is needed in the GLM model for
Y given that the other k − 1 predictors are in the model. If you fail to reject
Ho, then conclude that Xj is not needed in the GLM model for Y given that
the other k − 1 predictors are in the model. Note that Xj could be a very
useful GLM predictor, but may not be needed if other predictors are added
to the model.
The Wald conﬁdence interval (CI) for βj can also be obtained from the
output: the large sample 100 (1 − δ) % CI for βj is βˆj ± z1−δ/2 se(β̂j ).
The Wald test and CI tend to give good results if the sample size n
is large. Here 1 − δ refers to the coverage of the CI. Recall that a 90%
CI uses z1−δ/2 = 1.645, a 95% CI uses z1−δ/2 = 1.96, and a 99% CI uses
z1−δ/2 = 2.576.
For a GLM, often 3 models are of interest: the full model that uses all k
of the predictors xT = (xTR, xTO ), the reduced model that uses the r predic-
tors xR , and the saturated model that uses n parameters θ1, ..., θn where
n is the sample size. For the full model the k + 1 parameters α, β1, ..., βk are
estimated while the reduced model has r + 1 parameters. Let lSAT (θ1, ..., θn)
be the likelihood function for the saturated model and let lF U LL(α, β) be the
CHAPTER 13. GENERALIZED LINEAR MODELS 408

likelihood function for the full model. Let

LSAT = log lSAT (θ̂1, ..., θ̂n)

be the log likelihood function for the saturated model evaluated at the max-
imum likelihood estimator (MLE) (θ̂1, ..., θ̂n) and let

LF U LL = log lF U LL (α̂, β̂)

be the log likelihood function for the full model evaluated at the MLE (α̂, β̂).
Then the deviance

D = G2 = −2(LF U LL − LSAT ).

The degrees of freedom for the deviance = dfF U LL = n − k − 1 where n is

the number of parameters for the saturated model and k + 1 is the number
of parameters for the full model.
The saturated model for logistic regression states that Y1 , ..., Yn are inde-
pendent binomial(mi, ρi ) random variables where ρ̂i = Yi /mi . The saturated
model is usually not very good for binary data (all mi = 1) or if the mi are
small. The saturated model can be good if all of the mi are large or if ρi is
very close to 0 or 1 whenever mi is not large.
The saturated model for loglinear regression states that Y1 , ..., Yn are in-
dependent Poisson(µi ) random variables where µ̂i = Yi . The saturated model
is usually not very good for Poisson data, but the saturated model may be
good if n is ﬁxed and all of the counts Yi are large.
If X ∼√χ2d then E(X) = d and VAR(X) = 2d. An observed value √ of
x > d + 3 d is unusually large and an observed value of x < d − 3 d is
unusually small.
When the saturated model is good, a rule of thumb is that the logistic
or loglinear
√ regression model is ok if G2 ≤ n − k − 1 (or if G2 ≤ n − k −
1 + 3 n − k − 1). The χ2n−k+1 approximation for G2 is rarely good even for
large sample sizes n. For LR, the ESS plot is often a much better diagnostic
for goodness of ﬁt, especially when ESP = α + βT xi takes on many values
2
√ when k + 1 << n. For LLR, both the EY plot and G ≤ n − k − 1 +
and
3 n − k − 1 should be checked.
The Arc output on the following page, shown in symbols and for a real
data set, is used for the deviance test described after the output. Assume that
CHAPTER 13. GENERALIZED LINEAR MODELS 409

the estimated suﬃcient summary plot has been made and that the logistic or
loglinear regression model ﬁts the data well in that the nonparametric step or
lowess estimated mean function follows the estimated model mean function
closely. The deviance test is used to test whether β = 0. If this is the case,
then the predictors are not needed in the GLM model. If Ho : β = 0 is not
rejected, then for loglinear regression the estimator µ̂ = Y should be used
while for logistic regression

n
n
ρ̂ = Yi / mi
i=1 i=1

should be used. Note that ρ̂ = Y for binary logistic regression.

Response = Y
Terms = (X1 , ..., Xk )
Sequential Analysis of Deviance

Total Change
Predictor df Deviance df Deviance
Ones n − 1 = dfo G2o
X1 n−2 1
X2 n−3 1
.. .. .. ..
. . . .
Xk n − k − 1 = dfF U LL G2F U LL 1

-----------------------------------------
Data set = cbrain, Name of Fit = B1
Response = sex
Terms = (cephalic size log[size])
Sequential Analysis of Deviance
Total Change
Predictor df Deviance | df Deviance
Ones 266 363.820 |
cephalic 265 363.605 | 1 0.214643
size 264 315.793 | 1 47.8121
log[size] 263 305.045 | 1 10.7484
CHAPTER 13. GENERALIZED LINEAR MODELS 410

The 4 step deviance test is

i) Ho : β1 = · · · = βk = 0 HA : not Ho
ii) test statistic G2 (o|F ) = G2o − G2F U LL
iii) The p–value = P (χ2 > G2 (o|F )) where χ2 ∼ χ2k has a chi–square
distribution with k degrees of freedom. Note that k = k + 1 − 1 = dfo −
dfF U LL = n − 1 − (n − k − 1).
iv) Reject Ho if the p–value < δ and conclude that there is a GLM
relationship between Y and the predictors X1 , ..., Xk . If p–value ≥ δ, then
fail to reject Ho and conclude that there is not a GLM relationship between
Y and the predictors X1 , ..., Xk .
The output shown on the following page, both in symbols and for a real
data set, can be used to perform the change in deviance test. If the re-
duced model leaves out a single variable Xi , then the change in deviance
test becomes Ho : βi = 0 versus HA : βi = 0. This likelihood ratio test is a
competitor of the Wald test. The likelihood ratio test is usually better than
the Wald test if the sample size n is not large, but the Wald test is currently
easier for software to produce. For large n the test statistics from the two
tests tend to be very similar (asymptotically equivalent tests).
T
If the reduced model is good, then the EE plot of ESP (R) = α̂R + β̂R xRi
T
versus ESP = α̂ + β̂ xi should be highly correlated with the identity line
with unit slope and zero intercept.
After obtaining an acceptable full model where

SP = α + β1 x1 + · · · + βk xk = α + β T x = α + β TR xR + β TO xO

try to obtain a reduced model

SP = α + βR1xR1 + · · · + βRr xRr = αR + β TR xR

where the reduced model uses r of the predictors used by the full model and
xO denotes the vector of k − r predictors that are in the full model but not
the reduced model. For logistic regression, the reduced model is Yi |xRi ∼
independent Binomial(mi, ρ(xRi )) while for loglinear regression the reduced
model is Yi |xRi ∼ independent Poisson(µ(xRi )) for i = 1, ..., n.
CHAPTER 13. GENERALIZED LINEAR MODELS 411

Response = Y Terms = (X1 , ..., Xk ) (Full Model)

Label Estimate Std. Error Est/SE p-value

Constant α̂ se(α̂) zo,0 for Ho: α = 0
x1 β̂1 se(β̂1) zo,1 = β̂1/se(β̂1) for Ho: β1 = 0
.. .. .. .. ..
. . . . .
xk β̂k se(β̂k ) zo,k = β̂k /se(β̂k ) for Ho: βk = 0
Degrees of freedom: n - k - 1 = dfF U LL
Deviance: D = G2F U LL
Response = Y Terms = (X1 , ..., Xr ) (Reduced Model)

Label Estimate Std. Error Est/SE p-value

Constant α̂ se(α̂) zo,0 for Ho: α = 0
x1 β̂1 se(β̂1) zo,1 = β̂1/se(β̂1) for Ho: β1 = 0
.. .. .. .. ..
. . . . .
xr β̂r se(β̂r ) zo,r = β̂k /se(β̂r ) for Ho: βr = 0
Degrees of freedom: n - r - 1 = dfRED
Deviance: D = G2RED

(Full Model) Response = Status, Terms = (Diagonal Bottom Top)

Label Estimate Std. Error Est/SE p-value
Constant 2360.49 5064.42 0.466 0.6411
Diagonal -19.8874 37.2830 -0.533 0.5937
Bottom 23.6950 45.5271 0.520 0.6027
Top 19.6464 60.6512 0.324 0.7460

Degrees of freedom: 196

Deviance: 0.009

(Reduced Model) Response = Status, Terms = (Diagonal)

Label Estimate Std. Error Est/SE p-value
Constant 989.545 219.032 4.518 0.0000
Diagonal -7.04376 1.55940 -4.517 0.0000

Degrees of freedom: 198

Deviance: 21.109
CHAPTER 13. GENERALIZED LINEAR MODELS 412

Assume that the ESS plot looks good. Then we want to test Ho : the
reduced model is good (can be used instead of the full model) versus HA :
use the full model (the full model is significantly better than the reduced
model). Fit the full model and the reduced model to get the deviances
G2F U LL and G2RED .
The 4 step change in deviance test is
i) Ho : the reduced model is good HA : use the full model
ii) test statistic G2 (R|F ) = G2RED − G2F U LL
iii) The p–value = P (χ2 > G2 (R|F )) where χ2 ∼ χ2k−r has a chi–square
distribution with k degrees of freedom. Note that k is the number of non-
trivial predictors in the full model while r is the number of nontrivial pre-
dictors in the reduced model. Also notice that k − r = (k + 1) − (r + 1) =
dfRED − dfF U LL = n − r − 1 − (n − k − 1).
iv) Reject Ho if the p–value < δ and conclude that the full model should
be used. If p–value ≥ δ, then fail to reject Ho and conclude that the reduced
model is good.
Interpretation of coefficients: if x1, ..., xi−1, xi+1, ..., xk can be held fixed,
then increasing xi by 1 unit increases the sufficient predictor SP by βi units.
As a special case, consider logistic regression. Let ρ(x) = P (success|x) =
1 − P(failure|x) where a “success” is what is counted and a “failure” is what
is not counted (so if the Yi are binary, ρ(x) = P (Yi = 1|x)). Then the
estimated odds of success is
ρ̂(x) T
Ω̂(x) = = exp(α̂ + β̂ x).
1 − ρ̂(x)

In logistic regression, increasing a predictor xi by 1 unit (while holding all

other predictors ﬁxed) multiplies the estimated odds of success by a factor
of exp(β̂i ).

13.6 Variable Selection

This section gives some rules of thumb for variable selection for logistic and
loglinear regression. Before performing variable selection, a useful full model
needs to be found. The process of ﬁnding a useful full model is an iterative
process. Given a predictor x, sometimes x is not used by itself in the full
model. Suppose that Y is binary. Then to decide what functions of x should
CHAPTER 13. GENERALIZED LINEAR MODELS 413

Table 13.1: Building the Full Logistic Regression Model

distribution of x|y = i variables to include in the model

be in the model, look at the conditional distribution of x|Y = i for i = 0, 1.

The rules shown in Table 13.1 are used if x is an indicator variable or if x is
a continuous variable. See Cook and Weisberg (1999a, p. 501) and Kay and
Little (1987) .

The full model will often contain factors and interaction. If w is a nominal
variable with J levels, make w into a factor by using use J − 1 (indicator or)
dummy variables x1,w , ..., xJ−1,w in the full model. For example, let xi,w = 1 if
w is at its ith level, and let xi,w = 0, otherwise. An interaction is a product
of two or more predictor variables. Interactions are diﬃcult to interpret.
Often interactions are included in the full model, and then the reduced model
without any interactions is tested. The investigator is often hoping that the
interactions are not needed.
A scatterplot of x versus Y is used to visualize the conditional distri-
bution of Y |x. A scatterplot matrix is an array of scatterplots and is used
to examine the marginal relationships of the predictors and response. Place
Y on the top or bottom of the scatterplot matrix. Variables with outliers,
missing values or strong nonlinearities may be so bad that they should not be
included in the full model. Suppose that all values of the variable x are posi-
tive. The log rule says add log(x) to the full model if max(xi )/ min(xi ) > 10.
For the binary logistic regression model, mark the plotted points by a 0 if
Y = 0 and by a + if Y = 1.
To make a full model, use the above discussion and then make an EY
plot to check that the full model is good. The number of predictors in the
CHAPTER 13. GENERALIZED LINEAR MODELS 414

full model should be much smaller than the number of data cases n. Suppose
that the Yi are binary for i = 1, ..., n. Let N1 = Yi = the number of 1’s and
N0 = n−N1 = the number of 0’s. A rough rule of thumb is that the full model
should use no more than min(N0, N1 )/5 predictors and the ﬁnal submodel
should have r predictor variables where r is small with r ≤ min(N0 , N1)/10.
For loglinear regression, a rough rule of thumb is that the full model should
use no more than n/5 predictors and the ﬁnal submodel should use no more
than n/10 predictors.
Variable selection, also called subset or model selection, is the search for
a subset of predictor variables that can be deleted without important loss of
information. A model for variable selection for a GLM can be described by

SP = α + βT x = α + βTS xS + βTE xE = α + βTS xS (13.10)

where x = (xTS , xTE )T is a k × 1 vector of nontrivial predictors, xS is a rS × 1

vector and xE is a (k − rS ) × 1 vector. Given that xS is in the model,
βE = 0 and E denotes the subset of terms that can be eliminated given that
the subset S is in the model.
Since S is unknown, candidate subsets will be examined. Let xI be the
vector of r terms from a candidate subset indexed by I, and let xO be the
vector of the remaining terms (out of the candidate submodel). Then

SP = α + βTI xI + βTO xO . (13.11)

Deﬁnition 13.9. The model with SP = α + β T x that uses all of the

predictors is called the full model. A model with SP = α + β TI xI that only
uses the constant and a subset xI of the nontrivial predictors is called a
submodel.
Suppose that S is a subset of I and that model (13.10) holds. Then

SP = α + βTS xS = α + β TS xS + β T(I/S)xI/S + 0T xO = α + β TI xI (13.12)

where xI/S denotes the predictors in I that are not in S. Since this is true
regardless of the values of the predictors, βO = 0 if the set of predictors S is
a subset of I. Let (α̂, β̂) and (α̂I , β̂ I ) be the estimates of (α, β) obtained from
ﬁtting the full model and the submodel, respectively. Denote the ESP from
CHAPTER 13. GENERALIZED LINEAR MODELS 415

T
the full model by ESP = α̂ + β̂ xi and denote the ESP from the submodel
by ESP (I) = α̂I + β̂I xIi .
Definition 13.10. An EE plot is a plot of ESP (I) versus ESP .
Variable selection is closely related to the change in deviance test for
a reduced model. You are seeking a subset I of the variables to keep in
the model. The AIC(I) statistic is used as an aid in backward elimination
and forward selection. The full model and the model Imin found with the
smallest AIC are always of interest. Also look for the model Il where the
AIC is the local minimum with the smallest number of nontrivial predictors
(say rl , so that deleting predictors from Il for forward selection or backward
elimination causes AIC to increase). Burnham and Anderson (2004) suggest
that if ∆(I) = AIC(I) − AIC(Imin), then models with ∆(I) ≤ 2 are good,
models with 4 ≤ ∆(I) ≤ 7 are borderline, and models with ∆(I) > 10 should
not be used as the final submodel. Create a full model. The full model has
a deviance at least as small as that of any submodel. The final submodel
should have an EE plot that clusters tightly about the identity line. As a
rough rule of thumb, a good submodel I has corr(ESP (I), ESP ) ≥ 0.95.
Backward elimination starts with the full model with k nontrivial vari-
ables, and the predictor that optimizes some criterion is deleted. Then there
are k − 1 variables left, and the predictor that optimizes some criterion is
deleted. This process continues for models with k − 2, k − 3, ..., 3 and 2
predictors.
Forward selection starts with the model with 0 variables, and the pre-
dictor that optimizes some criterion is added. Then there is 1 variable in
the model, and the predictor that optimizes some criterion is added. This
process continues for models with 2, 3, ..., k − 2 and k − 1 predictors. Both
forward selection and backward elimination result in an (often different) se-
quence of k models {x∗1}, {x∗1, x∗2}, ..., {x∗1, x∗2, ..., x∗k−1}, {x∗1, x∗2, ..., x∗k} = full
model.
All subsets variable selection can be performed with the following
procedure. Compute the ESP of the GLM and compute the OLS ESP found
by the OLS regression of Y on x. Check that |corr(ESP, OLS ESP)| ≥ 0.95.
This high correlation will exist for many data sets. Then perform multiple
linear regression and the corresponding all subsets OLS variable selection
with the Cp(I) criterion. If the sample size n is large and Cp (I) ≤ 2(r + 1)
CHAPTER 13. GENERALIZED LINEAR MODELS 416

where the subset I has r + 1 variables including a constant, then corr(OLS

ESP, OLS ESP(I)) will be high by the proof of Proposition 5.1, and hence
corr(ESP, ESP(I)) will be high. In other words, if the OLS ESP and GLM
ESP are highly correlated, then performing multiple linear regression and
the corresponding MLR variable selection (eg forward selection, backward
elimination or all subsets selection) based on the Cp (I) criterion may provide
many interesting submodels.
Know how to ﬁnd good models from output. The following rules of thumb
(roughly in order of decreasing importance) may be useful. It is often not
possible to have all 11 rules of thumb to hold simultaneously. Let submodel
I have rI + 1 predictors, including a constant. Do not use more predictors
than the minimum AIC model. Then the submodel I is good if
i) the EY plot for the submodel looks like the EY plot for the full model.
ii) corr(ESP,ESP(I)) ≥ 0.95.
iii) The plotted points in the EE plot cluster tightly about the identity line.
iv) Want the p-value ≥ 0.01 for the change in deviance test that uses I as
the reduced model.
v) For LR want rI + 1 ≤ min(N1 , N0 )/10. For LLR, want rI + 1 ≤ n/10.
vi) The plotted points in the VV plot cluster tightly about the identity line.
vii) Want the deviance G2 (I) close to G2 (full) (see iv): G2 (I) ≥ G2 (full)
since adding predictors to I does not increase the deviance).
viii) Want AIC(I) smaller than or not much larger than AIC(full).
ix) Want hardly any predictors with p-values > 0.05.
x) Want few predictors with p-values
√ between 0.01 and 0.05.
2
xi) Want G (I) ≤ n − rI − 1 + 3 n − rI − 1.
Heuristically, backward elimination tries to delete the variable that will
increase the deviance the least. An increase in deviance greater than 4 (if the
predictor has 1 degree of freedom) may be troubling in that a good predictor
may have been deleted. In practice, the backward elimination program may
delete the variable such that the submodel I with j predictors has a) the
smallest AIC(I), b) the smallest deviance G2 (I) or c) the biggest p–value
(preferably from a change in deviance test but possibly from a Wald test)
in the test Ho βi = 0 versus HA βi = 0 where the model with j + 1 terms
from the previous step (using the j predictors in I and the variable x∗j+1 ) is
treated as the full model.
Heuristically, forward selection tries to add the variable that will decrease
CHAPTER 13. GENERALIZED LINEAR MODELS 417

the deviance the most. A decrease in deviance less than 4 (if the predictor has
1 degree of freedom) may be troubling in that a bad predictor may have been
added. In practice, the forward selection program may add the variable such
that the submodel I with j nontrivial predictors has a) the smallest AIC(I),
b) the smallest deviance G2 (I) or c) the smallest p–value (preferably from a
change in deviance test but possibly from a Wald test) in the test Ho βi = 0
versus HA βi = 0 where the current model with j terms plus the predictor
xi is treated as the full model (for all variables xi not yet in the model).
Suppose that the full model is good and is stored in M1. Let M2, M3,
M4 and M5 be candidate submodels found after forward selection, backward
elimination, etc. Make a scatterplot matrix of the ESPs for M2, M3, M4,
M5 and M1. Good candidates should have estimated sufficient predictors
that are highly correlated with the full model estimated sufficient predictor
(the correlation should be at least 0.9 and preferably greater than 0.95). For
binary logistic regression, mark the symbols (0 and +) using the response
variable Y .
The final submodel should have few predictors, few variables with large
Wald p–values (0.01 to 0.05 is borderline), a good EY plot and an EE plot
that clusters tightly about the identity line. If a factor has I − 1 dummy
variables, either keep all I − 1 dummy variables or delete all I − 1 dummy
variables, do not delete some of the dummy variables.

13.7 Complements
GLMs were introduced by Nelder and Wedderburn (1972). Books on gen-
eralized linear models (in roughly decreasing order of diﬃculty) include Mc-
Cullagh and Nelder (1989), Fahrmeir and Tutz (2001), Myers, Montgomery
and Vining (2002) and Dobson (2001). Cook and Weisberg (1999, ch.’s 21-
23) also has an excellent discussion. Texts on categorical data analysis that
have useful discussions of GLM’s include Agresti (2002), Le (1998), Lindsey
(2004), Simonoﬀ (2003) and Powers and Xie (2000) who give econometric
applications. Collett (1999) and Hosmer and Lemeshow (2000) are excellent
texts on logistic regression. See Christensen (1997) for a Bayesian approach
and see Cramer (2003) for econometric applications. Cameron and Trivedi
(1998) and Winkelmann (2000) cover Poisson regression.
The EY plots are also called model checking plots. See Cook and Weisberg
CHAPTER 13. GENERALIZED LINEAR MODELS 418

(1997, 1999a, p. 397, 514, and 541). Cook (1996) introduced the binary
response plot. Also see Cook (1998a, Ch. 5) and Cook and Weisberg (1999a,
section 22.2). Olive and Hawkins (2005) discuss variable selection.
Barndorﬀ-Nielsen (1982) is a very readable discussion of exponential fami-
lies. Also see the webpage (http://www.math.siu.edu/olive/infer.htm). Many
of the distributions in Chapter 3 belong to a 1-parameter exponential family.
A possible method for resistant binary regression is to use trimmed views
but make the ESS plot. This method would work best if x came from an
elliptically contoured distribution. Another possibility is to substitute robust
estimators for the classical estimators in the discrimination estimator.

13.8 Problems
PROBLEMS WITH AN ASTERISK * ARE USEFUL.

Output for problem 13.1: Response = sex

Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -18.3500 3.42582 -5.356 0.0000
circum 0.0345827 0.00633521 5.459 0.0000

13.1. Consider trying to estimate the proportion of males from a popu-

lation of males and females by measuring the circumference of the head. Use
the above logistic regression output to answer the following problems.
a) Predict ρ̂(x) if x = 550.0.
b) Find a 95% CI for β.
c) Perform the 4 step Wald test for Ho : β = 0.
CHAPTER 13. GENERALIZED LINEAR MODELS 419

Output for Problem 13.2

Response = sex
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -19.7762 3.73243 -5.298 0.0000
circum 0.0244688 0.0111243 2.200 0.0278
length 0.0371472 0.0340610 1.091 0.2754

13.2∗. Now the data is as in Problem 13.1, but try to estimate the pro-
portion of males by measuring the circumference and the length of the head.
Use the above logistic regression output to answer the following problems.
a) Predict ρ̂(x) if circumference = x1 = 550.0 and length = x2 = 200.0.
b) Perform the 4 step Wald test for Ho : β1 = 0.
c) Perform the 4 step Wald test for Ho : β2 = 0.

Output for problem 13.3

Response = ape
Terms = (lower jaw, upper jaw, face length)
Trials = Ones
Sequential Analysis of Deviance
All fits include an intercept.
Total Change
Predictor df Deviance | df Deviance
Ones 59 62.7188 |
lower jaw 58 51.9017 | 1 10.8171
upper jaw 57 17.1855 | 1 34.7163
face length 56 13.5325 | 1 3.65299

13.3∗. A museum has 60 skulls of apes and humans. Lengths of the lower
jaw, upper jaw and face are the explanatory variables. The response variable
is ape (= 1 if ape, 0 if human). Using the output on the previous page,
perform the four step deviance test for whether there is a LR relationship
between the response variable and the predictors.
CHAPTER 13. GENERALIZED LINEAR MODELS 420

Output for Problem 13.4.

Full Model
Response = ape
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant 11.5092 5.46270 2.107 0.0351
lower jaw -0.360127 0.132925 -2.709 0.0067
upper jaw 0.779162 0.382219 2.039 0.0415
face length -0.374648 0.238406 -1.571 0.1161

Number of cases: 60
Degrees of freedom: 56
Pearson X2: 16.782
Deviance: 13.532

Reduced Model
Response = ape
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant 8.71977 4.09466 2.130 0.0332
lower jaw -0.376256 0.115757 -3.250 0.0012
upper jaw 0.295507 0.0950855 3.108 0.0019

Number of cases: 60
Degrees of freedom: 57
Pearson X2: 28.049
Deviance: 17.185

13.4∗. Suppose the full model is as in Problem 13.3, but the reduced
model omits the predictor face length. Perform the 4 step change in deviance
test to examine whether the reduced model can be used.
CHAPTER 13. GENERALIZED LINEAR MODELS 421

The following three problems use the possums data from Cook and Weis-
berg (1999a).

Output for Problem 13.5

Data set = Possums, Response = possums
Terms = (Habitat Stags)
Coefficient Estimates
Label Estimate Std. Error Est/SE p-value
Constant -0.652653 0.195148 -3.344 0.0008
Habitat 0.114756 0.0303273 3.784 0.0002
Stags 0.0327213 0.00935883 3.496 0.0005

Number of cases: 151

Degrees of freedom: 148
Pearson X2: 110.187
Deviance: 138.685

13.5∗. Use the above output to perform inference on the number of

possums in a given tract of land. The output is from a loglinear regression.
a) Predict µ̂(x) if habitat = x1 = 5.8 and stags = x2 = 8.2.
b) Perform the 4 step Wald test for Ho : β1 = 0.
c) Find a 95% conﬁdence interval for β2.

Output for Problem 13.6

Response = possums Terms = (Habitat Stags)
Total Change
Predictor df Deviance | df Deviance
Ones 150 187.490 |
Habitat 149 149.861 | 1 37.6289
Stags 148 138.685 | 1 11.1759

13.6∗. Perform the 4 step deviance test for the same model as in Problem
13.5 using the output above.
CHAPTER 13. GENERALIZED LINEAR MODELS 422

Output for Problem 13.7

Terms = (Acacia Bark Habitat Shrubs Stags Stumps)
Label Estimate Std. Error Est/SE p-value
Constant -1.04276 0.247944 -4.206 0.0000
Acacia 0.0165563 0.0102718 1.612 0.1070
Bark 0.0361153 0.0140043 2.579 0.0099
Habitat 0.0761735 0.0374931 2.032 0.0422
Shrubs 0.0145090 0.0205302 0.707 0.4797
Stags 0.0325441 0.0102957 3.161 0.0016
Stumps -0.390753 0.286565 -1.364 0.1727
Number of cases: 151
Degrees of freedom: 144
Deviance: 127.506

13.7∗. Let the reduced model be as in Problem 13.5 and use the output
for the full model be shown above. Perform a 4 step change in deviance test.

B1 B2 B3 B4
df 945 956 968 974
# of predictors 54 43 31 25
# with 0.01 ≤ Wald p-value ≤ 0.05 5 3 2 1
# with Wald p-value > 0.05 8 4 1 0
2
G 892.96 902.14 929.81 956.92
AIC 1002.96 990.14 993.81 1008.912
corr(B1:ETA’U,Bi:ETA’U) 1.0 0.99 0.95 0.90
p-value for change in deviance test 1.0 0.605 0.034 0.0002

13.8∗. The above table gives summary statistics for 4 models considered
as ﬁnal submodels after performing variable selection. (Several of the predic-
tors were factors, and a factor was considered to have a bad Wald p-value >
0.05 if all of the dummy variables corresponding to the factor had p-values >
0.05. Similarly the factor was considered to have a borderline p-value with
0.01 ≤ p-value ≤ 0.05 if none of the dummy variables corresponding to the
factor had a p-value < 0.01 but at least one dummy variable had a p-value
between 0.01 and 0.05.) The response was binary and logistic regression was
used. The ESS plot for the full model B1 was good. Model B2 was the
minimum AIC model found. There were 1000 cases: for the response, 300
were 0’s and 700 were 1’s.
CHAPTER 13. GENERALIZED LINEAR MODELS 423

a) For the change in deviance test, if the p-value ≥ 0.07, there is little
evidence that Ho should be rejected. If 0.01 ≤ p-value < 0.07 then there is
moderate evidence that Ho should be rejected. If p-value < 0.01 then there
is strong evidence that Ho should be rejected. For which models, if any, is
there strong evidence that “Ho: reduced model is good” should be rejected.
b) For which plot is “corr(B1:ETA’U,Bi:ETA’U)” (using notation from
Arc) relevant?
c) Which model should be used as the ﬁnal submodel? Explain brieﬂy
why each of the other 3 submodels should not be used.
Arc Problems
The following four problems use data sets from Cook and Weisberg (1999a).

13.9. Activate the banknote.lsp dataset with the menu commands

“File > Load > Data > Arcg > banknote.lsp.” Scroll up the screen to read
the data description. Twice you will ﬁt logistic regression models and include
the coeﬃcients in Word. Print out this output when you are done and include
the output with your homework.
From Graph&Fit select Fit binomial response. Select Top as the predictor,
Status as the response and ones as the number of trials.
a) Include the output in Word.
b) Predict ρ̂(x) if x = 10.7.
c) Find a 95% CI for β.
d) Perform the 4 step Wald test for Ho : β = 0.
e) From Graph&Fit select Fit binomial response. Select Top and Diagonal
as predictors, Status as the response and ones as the number of trials. Include
the output in Word.
f) Predict ρ̂(x) if x1 = Top = 10.7 and x2 = Diagonal = 140.5.
g) Find a 95% CI for β1 .
h) Find a 95% CI for β2. CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 424

i) Perform the 4 step Wald test for Ho : β1 = 0.

j) Perform the 4 step Wald test for Ho : β2 = 0.
13.10∗. Activate banknote.lsp in Arc. with the menu commands
“File > Load > Data > Arcg > banknote.lsp.” Scroll up the screen to read the
data description. From Graph&Fit select Fit binomial response. Select Top
and Diagonal as predictors, Status as the response and ones as the number
of trials.
a) Include the output in Word.
b) (Binary Response Plot): From Graph&Fit select Plot of. Select Diag-
onal for H, Top for V, Case-numbers for O, and Status for Mark by. From
the 3D plot, select Recall logistic(H,V) from the popup menu Recall/Extract.
Include this plot in Word. How good is this model at classifying counterfeit
and real money?
c) From Graph&Fit select Fit linear LS. Select Diagonal and Top for
predictors, and Status for the response. From Graph&Fit select Plot of and
select L2:Fit-Values for H, B1:Eta’U for V, and Status for Mark by. Include
T
the plot in Word. Is the plot linear? How are α̂OLS + β̂ OLS x and α̂logistic +
T
β̂logistic x related (approximately)?
13.11∗. a) Activate possums.lsp in Arc with the menu commands
“File > Load > Data > Arcg > possums.lsp.” Scroll up the screen to read
the data description.
From Graph&Fit select Fit Poisson response. Select y as the response
and select Acacia, bark, habitat, shrubs, stags and stumps as the predictors.
Include the output in Word. This is your full model
b) EY plot: From Graph&Fit select Plot of. Select P1:Eta’U for the H
box and y for the V box. From the OLS popup menu select Poisson and
move the slider bar to 1. Move the lowess slider bar until the lowess curve
tracks the exponential curve well. Include the EY plot in Word.
c) From Graph&Fit select Fit Poisson response. Select y as the response
and select bark, habitat, stags and stumps as the predictors. Include the
output in Word.
d) EY plot: From Graph&Fit select Plot of. Select P2:Eta’U for the H
box and y for the V box. From the OLS popup menu select Poisson and
CHAPTER 13. GENERALIZED LINEAR MODELS 425

move the slider bar to 1. Move the lowess slider bar until the lowess curve
tracks the exponential curve well. Include the EY plot in Word.
e) Deviance test. From the P2 menu, select Examine submodels and click
on OK. Include the output in Word and perform the 4 step deviance test.
f) Perform the 4 step change of deviance test.
g) EE plot. From Graph&Fit select Plot of. Select P2:Eta’U for the H
box and P1:Eta’U for the V box. Move the OLS slider bar to 1. Click on
the Options popup menu and type “y=x”. Include the plot in Word. Is the
plot linear?
13.12∗. In this problem you will find a good submodel for the possums
data.
a) Activate possums.lsp in Arc with the menu commands
“File > Load > Data > Arcg> possums.lsp.” Scroll up the screen to read
the data description.
b) From Graph&Fit select Fit Poisson response. Select y as the response
and select Acacia, bark, habitat, shrubs, stags and stumps as the predictors.
In Problem 13.11, you showed that this was a good full model.
c) Using what you have learned in class find a good submodel and include
the relevant output in Word.
(Hints: Use forward selection and backward elimination and find a model
that discards a lot of predictors but still has a deviance close to that of the full
model. Also look at the model with the smallest AIC. Either of these models
could be your initial candidate model. Fit this candidate model and look
at the Wald test p–values. Try to eliminate predictors with large p–values
but make sure that the deviance does not increase too much. You may have
several models, say P2, P3, P4 and P5 to look at. Make a scatterplot matrix
of the Pi:ETA’U from these models and from the full model P1. Make the
EE and ESS plots for each model. The correlation in the EE plot should
be at least 0.9 and preferably greater than 0.95. As a very rough guide for
Poisson regression, the number of predictors in the full model should be less
than n/5 and the number of predictors in the final submodel should be less
than n/10.) CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 426

d) Make an EY plot for your final submodel, say P2. From Graph&Fit
select Plot of. Select P2:Eta’U for the H box and y for the V box. From
the OLS popup menu select Poisson and move the slider bar to 1. Move
the lowess slider bar until the lowess curve tracks the exponential curve well.
Include the EY plot in Word.
e) Suppose that P1 contains your full model and P2 contains your final
submodel. Make an EE plot for your final submodel: from Graph&Fit select
Plot of. Select P1:Eta’U for the V box and P2:Eta’U, for the H box. After
the plot appears, click on the options popup menu. A window will appear.
Type y = x and click on OK. This action adds the identity line to the plot.
Also move the OLS slider bar to 1. Include the plot in Word.
f) Using c), d), e) and any additional output that you desire (eg AIC(full),
AIC(min) and AIC(final submodel), explain why your final submodel is good.

Warning: The following problems use data from the book’s web-
page. Save the data files on a disk. Get in Arc and use the menu com-
mands “File > Load” and a window with a Look in box will appear. Click
on the black triangle and then on 3 1/2 Floppy(A:). Then click twice on the
data set name.
13.13∗. (ESS Plot): Activate cbrain.lsp in Arc with the menu commands
“File > Load > 3 1/2 Floppy(A:) > cbrain.lsp.” Scroll up the screen to read
the data description. From Graph&Fit select Fit binomial response. Select
brnweight, cephalic, breadth, cause, size, and headht as predictors, sex as the
response and ones as the number of trials. Perform the logistic regression
and from Graph&Fit select Plot of. Place sex on V and B1:Eta’U on H. From
the OLS popup menu, select Logistic and move the slider bar to 1. From the
lowess popup menu select SliceSmooth and move the slider bar until the fit is
good. Include your plot in Word. Are the slice means (observed proportions)
tracking the logistic curve (fitted proportions) very well?
13.14∗. Suppose that you are given a data set, told the response, and
asked to build a logistic regression model with no further help. In this prob-
lem, we use the cbrain data to illustrate the process.
a) Activate cbrain.lsp in Arc with the menu commands
“File > Load > 1/2 Floppy(A:) > cbrain.lsp.” Scroll up the screen to read
CHAPTER 13. GENERALIZED LINEAR MODELS 427

the data description. From Graph&Fit select Scatterplot-matrix of. Select

age, breadth, cephalic, circum, headht, height, length, size, and sex. Also place
sex in the Mark by box.
Include the scatterplot matrix in Word.
b) Use the menu commands “cbrain>Transform” and select age and the
log transformation. Why the log transformation chosen?
c) From Graph&Fit select Plot of and select size. Also place sex in the
Mark by box. A plot will come up. From the GaussKerDen menu (the
triangle to the left) select Fit by marks, move the sliderbar to 0.9, and include
the plot in Word.
d) Use the menu commands “cbrain>Transform” and select size and the
log transformation. From Graph&Fit select Fit binomial response. Select
age, log(age), breadth, cephalic, circum, headht, height, length, size, log(size),
as predictors, sex as the response and ones as the number of trials. This
is the full model. Perform the logistic regression and include the relevant
output for testing in Word.
e) From Graph&Fit select Plot of. Place sex on V and B1:Eta’U on
H. From the OLS popup menu, select Logistic and move the slider bar to 1.
From the lowess popup menu select SliceSmooth and move the slider bar until
the fit is good. Include your plot in Word. Are the slice means (observed
proportions) tracking the logistic curve (fitted proportions) fairly well?
f) From B1 select Examine submodels and select Add to base model (For-
ward Selection). Include the output with df = 259 in Word.
g) From B1 select Examine submodels and select Delete from full model
(Backward Elimination). Include the output with df corresponding to the
minimum AIC model in Word. What predictors does this model use?
h) As a final submodel, use the model from f): from Graph&Fit select
Fit binomial response. Select age, log(age), circum, height, length, size, and
log(size) as predictors, sex as the response and ones as the number of trials.
Perform the logistic regression and include the relevant output for testing in
Word.
i) Put the EE plot H B2 ETA’U versus V B1 ETA’U in Word. Is the plot
linear? CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 428

j) From Graph&Fit select Plot of. Place sex on V and B2:Eta’U on H.

From the OLS popup menu, select Logistic and move the slider bar to 1.
From the lowess popup menu select SliceSmooth and move the slider bar until
the fit is good. Include your plot in Word. Are the slice means (observed
proportions) tracking the logistic curve (fitted proportions) fairly well?
k) Perform the 4 step change in deviance test using the full model in d)
and the reduced submodel in h).
Now act as if the final submodel is the full model.
l) From B2 select Examine submodels click OK and include the output
in Word. Then use the output to perform a 4 step deviance test on the
submodel.
m) From Graph&Fit select Inverse regression. Select age, log(age), cir-
cum, height, length, size, and log(size) as predictors, and sex as the response.
From Graph&Fit select Plot of. Place I2.SIR.p1 on the H axis and B2.Eta’U
on the V axis. Include the plot in Word. Is the plot linear?
n) Binary response plot: From Graph&Fit select Plot of. Place I2.SIR.p1
on the H axis, I2.SIR.p2 on the V axis and place sex in the Mark by box.
Include the plot in Word. Draw in a narrow vertical slice where the binary
response plot looks the worst.
13.15∗. In this problem you will find a good submodel for the ICU data
obtained from STATLIB.
a) Activate ICU.lsp in Arc with the menu commands
“File > Load > 1/2 Floppy(A:) > ICU.lsp.” Scroll up the screen to read the
data description.
b) Use the menu commands “ICU>Make factors” and select loc and race.

c) From Graph&Fit select Fit binomial response. Select STA as the re-
sponse and ones as the number of trials. The full model will use every
predictor except ID, LOC and RACE (the latter 2 are replaced by their fac-
tors): select AGE, Bic, CAN, CPR, CRE, CRN, FRA, HRA, INF, {F}LOC ,
PCO, PH, PO2 , PRE , {F}RACE, SER, SEX, SYS and TYP as predictors.
Perform the logistic regression and include the relevant output for testing in
Word. CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 429

d) Make the ESS plot for the full model: from Graph&Fit select Plot of.
Place STA on V and B1:Eta’U on H. From the OLS popup menu, select
Logistic and move the slider bar to 1. From the lowess popup menu select
SliceSmooth and move the slider bar until the fit is good. Include your plot
in Word. Is the full model good?
e) Using what you have learned in class find a good submodel and include
the relevant output in Word. (Hints: Use forward selection and backward
elimination and find a model that discards a lot of predictors but still has
a deviance close to that of the full model. Also look at the model with
the smallest AIC. Either of these models could be your initial candidate
model. Fit this candidate model and look at the Wald test p–values. Try
to eliminate predictors with large p–values but make sure that the deviance
does not increase too much. WARNING: do not delete part of a factor.
Either keep all 2 factor dummy variables or delete all I-1=2 factor dummy
variables. You may have several models, say B2, B3, B4 and B5 to look at.
Make the EE and ESS plots for each model. WARNING: if a factor is in
the full model but not the reduced model, then the EE plot may have I = 3
lines. See part h) below.
f) Make an ESS plot for your final submodel.
g) Suppose that B1 contains your full model and B5 contains your final
submodel. Make an EE plot for your final submodel: from Graph&Fit select
Plot of. Select B1:Eta’U for the V box and B5:Eta’U, for the H box. After
the plot appears, click on the options popup menu. A window will appear.
Type y = x and click on OK. This action adds the identity line to the plot.
Include the plot in Word.
If the EE plot is good and there are one or more factors in the full model
that are not in the final submodel, then the bulk of the data will cluster
tightly about the identity line, but some points may be far away from the
identity line (often lying on some other line) due to the deleted factors.
h) Using e), f), g) and any additional output that you desire (eg AIC(full),
AIC(min) and AIC(final submodel), explain why your final submodel is good.

13.16. In this problem you will examine the museum skull data.
a) Activate museum.lsp in Arc with the menu commands
CHAPTER 13. GENERALIZED LINEAR MODELS 430

“File > Load > 3 1/2 Floppy(A:) > museum.lsp.” Scroll up the screen to
read the data description.
b) From Graph&Fit select Fit binomial response. Select ape as the re-
sponse and ones as the number of trials. Select x5 as the predictor. Perform
the logistic regression and include the relevant output for testing in Word.
c) Make the ESS plot and place it in Word (the response variable is ape
not y). Is the LR model good?
Now you will examine logistic regression when there is perfect classifica-
tion of the sample response variables. Assume that the model used in d)–h)
is in menu B2.
d) From Graph&Fit select Fit binomial response. Select ape as the re-
sponse and ones as the number of trials. Select x3 as the predictor. Perform
the logistic regression and include the relevant output for testing in Word.
e) Make the ESS plot and place it in Word (the response variable is ape
not y). Is the LR model good?
f) Perform the Wald test for Ho : β = 0.
g) From B2 select Examine submodels and include the output in Word.
Then use the output to perform a 4 step deviance test on the submodel used
in part d).
h) The tests in f) and g) are both testing Ho : β = 0 but give different
results. Why are the results different and which test is correct?
13.17. In this problem you will find a good submodel for the credit data
from Fahrmeir and Tutz (2001).
a) Activate credit.lsp in Arc with the menu commands
“File > Load > Floppy(A:) > credit.lsp.” Scroll up the screen to read the
data description. This is a big data set and computations may take several
minutes.
b) Use the menu commands “credit>Make factors” and select x1, x3 , x4, x6,
x7, x8 , x9, x10, x11, x12, x14, x15, x16, and x17. Then click on OK.
c) From Graph&Fit select Fit binomial response. Select y as the response
CHAPTER 13. GENERALIZED LINEAR MODELS 431

and ones as the number of trials. Select {F}x1, x2 , {F}x3, {F}x4, x5, {F}x6,
{F}x7, {F}x8, {F}x9, {F}x10, {F}x11, {F}x12, x13, {F}x14, {F}x15, {F}x16,
{F}x17, x18, x19 and x20 as predictors. Perform the logistic regression and
include the relevant output for testing in Word. You should get 1000 cases,
df = 945, and a deviance of 892.957
d) Make the ESS plot for the full model: from Graph&Fit select Plot
of. Place y on V and B1:Eta’U on H. From the OLS popup menu, select
Logistic and move the slider bar to 1. From the lowess popup menu select
SliceSmooth and move the slider bar until the fit is good. Include your plot
in Word. Is the full model good?
e) Using what you have learned in class find a good submodel and include
the relevant output in Word. (Hints: Use forward selection and backward
elimination and find a model that discards a lot of predictors but still has
a deviance close to that of the full model. Also look at the model with
the smallest AIC. Either of these models could be your initial candidate
model. Fit this candidate model and look at the Wald test p–values. Try
to eliminate predictors with large p–values but make sure that the deviance
does not increase too much. WARNING: do not delete part of a factor.
Either keep all 2 factor dummy variables or delete all I-1=2 factor dummy
variables. You may have several models, say B2, B3, B4 and B5 to look at.
Make the EE and ESS plots for each model. WARNING: if a factor is in
the full model but not the reduced model, then the EE plot may have I = 3
lines. See part h) below.
f) Make an ESS plot for your final submodel.
g) Suppose that B1 contains your full model and B5 contains your final
submodel. Make an EE plot for your final submodel: from Graph&Fit select
Plot of. Select B1:Eta’U for the V box and B5:Eta’U, for the H box. Place
y in the Mark by box. After the plot appears, click on the options popup
menu. A window will appear. Type y = x and click on OK. This action adds
the identity line to the plot. Also move the OLS slider bar to 1. Include the
plot in Word.
h) Using e), f), g) and any additional output that you desire (eg AIC(full),
AIC(min) and AIC(final submodel), explain why your final submodel is good.
CHAPTER 13. GENERALIZED LINEAR MODELS 432

13.18∗. a) This problem uses a data set from Myers, Montgomery and
Vining (2002). Activate popcorn.lsp in Arc with the menu commands
“File > Load > Floppy(A:) > popcorn.lsp.” Scroll up the screen to read the
data description. From Graph&Fit select Fit Poisson response. Use oil, temp
and time as the predictors and y as the response. From Graph&Fit select
Plot of. Select P1:Eta’U for the H box and y for the V box. From the OLS
popup menu select Poisson and move the slider bar to 1. Move the lowess
slider bar until the lowess curve tracks the exponential curve. Include the
EY plot in Word.
b) From the P1 menu select Examine submodels, click on OK and include
the output in Word.
c) Test whether β1 = β2 = β3 = 0.
d) From the popcorn menu, select Transform and select y. Put 1/2 in the
p box and click on OK. From the popcorn menu, select Add a variate and type
yt = sqrt(y)*log(y) in the resulting window. Repeat three times adding the
variates oilt = sqrt(y)*oil, tempt = sqrt(y)*temp and timet = sqrt(y)*time.
From Graph&Fit select Fit linear LS and choose y 1/2, oilt, tempt and timet
as the predictors, yt as the response and click on the Fit intercept box to
remove the check. Then click on OK. From Graph&Fit select Plot of. Select
L2:Fit-Values for the H box and yt for the V box. A plot should appear.
Click on the Options menu and type y = x to add the identity line. Include
the weighted forward response plot in Word.
e) From Graph&Fit select Plot of. Select L2:Fit-Values for the H box and
L2:Residuals for the V box. Include the weighted residual response plot in
Word.
f) For the plot in e), highlight the case in the upper right corner of the
plot by using the mouse to move the arrow just above and to the left the
case. Then hold the rightmost mouse button down and move the mouse to
the right and down. From the Case deletions menu select Delete selection
from data set, then from From Graph&Fit select Fit Poisson response. Use
oil, temp and time as the predictors and y as the response. From Graph&Fit
select Plot of. Select P3:Eta’U for the H box and y for the V box. From
the OLS popup menu select Poisson and move the slider bar to 1. Move the
lowess slider bar until the lowess curve tracks the exponential curve. Include
the EY plot in Word. CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 433

g) From the P3 menu select Examine submodels, click on OK and include

the output in Word.
h) Test whether β1 = β2 = β3 = 0.
i) From Graph&Fit select Fit linear LS. Make sure that y 1/2, oilt, tempt
and timet are the predictors, yt is the response, and that the Fit intercept
box does not have a check. Then click on OK From Graph&Fit select Plot
of. Select L4:Fit-Values for the H box and yt for the V box. A plot should
appear. Click on the Options menu and type y = x to add the identity line.
Include the weighted forward response plot in Word.
j) From Graph&Fit select Plot of. Select L4:Fit-Values for the H box and
L4:Residuals for the V box. Include the weighted residual response plot in
Word.
k) Is the deleted point influential? Explain briefly.
l) From Graph&Fit select Plot of. Select P3:Eta’U for the H box and
P3:Dev-Residuals for the V box. Include the deviance residual response plot
in Word.
m) Is the weighted residual plot from part j) a better lack of fit plot than
the deviance residual plot from part m)? Explain briefly.
R/Splus problems
Download functions with the command source(“A:/rpack.txt”). See
Preface or Section 14.2. Typing the name of the rpack function, eg
binrplot, will display the code for the function. Use the args command, eg
args(binrplot), to display the needed arguments for the function.
13.19.
a) Obtain the function lrdata from rpack.txt. Enter the commands
out <- lrdata()
x <- out$x
y <- out$y
b) Obtain the function binrplot from rpack.txt. Enter the command
binrplot(x,y) and include the resulting plot in Word. OVER
c) Obtain the function lressp from rpack.txt. Enter the commands
lressp(x,y) and include the resulting plot in Word.
CHAPTER 13. GENERALIZED LINEAR MODELS 434

13.20. a) Obtain the function llrdata from rpack.txt. Enter the

commands
out <- llrdata()
x <- out$x
y <- out$y
b) Obtain the function llressp from rpack.txt. Enter the commands
llressp(x,y) and include the resulting plot in Word.
c) Obtain the function llrwtfrp from rpack.txt. Enter the commands
llrwtfrp(x,y) and include the resulting plot in Word.
The following problem uses SAS and Arc.
13.21∗. SAS–all subsets: On the webpage (http://www.math.siu.edu/
olive/students.htm) there are 2 files cbrain.txt and hw10d2.sas that will be
used for this problem. The first file contains the cbrain data (that you have
analyzed in Arc several times) without the header that describes the data.
a) Using Netscape or Internet Explorer, go to the webpage and click on
cbrain.txt. After the file opens, copy and paste the data into Notepad. (In
Netscape, the commands “Edit>Select All” and “Edit>copy” worked.) Then
open Notepad and enter the commands “Edit>paste” to make the data set
appear.
b) SAS needs an “end of file” marker to determine when the data ends.
SAS uses a period as the end of file marker. Add a period on the line after
the last line of data in Notepad and save the file as cbrain.dat on your disk
using the commands “File>Save as.” A window will appear, in the top box
make 3 1/2 Floppy (A:) appear while in the File name box type cbrain.dat.
In the Save as type box, click on the right of the box and select All Files.
Warning: make sure that the file has been saved as cbrain.dat, not
as cbrain.dat.txt.
c) As described in a), go to the webpage and click on hw10d2.sas. After
the file opens, copy and paste the data into Notepad. Use the commands
“File>Save as.” A window will appear, in the top box make 3 1/2 Floppy
(A:) appear while in the File name box type hw10d2.sas. In the Save as type
box, click on the right of the box and select All Files, and the file will be
saved on your disk. Warning: make sure that the file has been saved
as hw10d2.sas, not as hw10d2.sas.txt. CONTINUED
CHAPTER 13. GENERALIZED LINEAR MODELS 435

d) Get into SAS, and from the top menu, use the “File> Open” com-
mand. A window will open. Use the arrow in the NE corner of the window
to navigate to “3 1/2 Floppy(A:)”. (As you click on the arrow, you should see
My Documents, C: etc, then 3 1/2 Floppy(A:).) Double click on h10d2.sas.
(Alternatively cut and paste the program into the SAS editor window.) To
execute the program, use the top menu commands “Run>Submit”. An out-
put window will appear if successful. Warning: if you do not have the
two files on A drive, then you need to change the infile command in
h10d2.sas to the drive that you are using, eg change infile “a:cbrain.dat”;
to infile “f:cbrain.dat”; if you are using F drive.
e) To copy and paste relevant output into Word, click on the output
window and use the top menu commands “Edit>Select All” and then the
menu commands “Edit>Copy”.
The model should be good if C(p) ≤ 2k where k = “number in model.”
The only SAS output for this problem that should be included
in Word are two header lines (Number in model, R-square, C(p), Variables
in Model) and the first line with Number in Model = 6 and C(p) = 7.0947.
You may want to copy all of the SAS output into Notepad, and then cut and
paste the relevant two lines of output into Word.
f) Activate cbrain.lsp in Arc with the menu commands
“File > Load > Data > mdata > cbrain.lsp.” From Graph&Fit select Fit
binomial response. Select age = X2, breadth = X6, cephalic = X10, circum
= X9, headht = X4, height = X3, length = X5 and size = X7 as predictors,
sex as the response and ones as the number of trials. This is the full logistic
regression model. Include the relevant output in Word. (A better full model
was used in Problem 13.14.)
g) ESS plot. From Graph&Fit select Plot of. Place sex on V and B1:Eta’U
on H. From the OLS popup menu, select Logistic and move the slider bar
to 1. From the lowess popup menu select SliceSmooth and move the slider
bar until the fit is good. Include your plot in Word. Are the slice means
(observed proportions) tracking the logistic curve (fitted proportions) fairly
well? OVER
h) From Graph&Fit select Fit binomial response. Select breadth = X6,
cephalic = X10, circum = X9, headht = X4, height = X3, and size = X7 as
predictors, sex as the response and ones as the number of trials. This is the
CHAPTER 13. GENERALIZED LINEAR MODELS 436

“best submodel.” Include the relevant output in Word.

i) Put the EE plot H B2 ETA’U versus V B1 ETA’U in Word. Is the plot
linear?
j) From Graph&Fit select Plot of. Place sex on V and B2:Eta’U on H.
From the OLS popup menu, select Logistic and move the slider bar to 1.
From the lowess popup menu select SliceSmooth and move the slider bar until
the ﬁt is good. Include your plot in Word. Are the slice means (observed
proportions) tracking the logistic curve (ﬁtted proportions) fairly well?
Chapter 14

Stuﬀ for Students

14.1 Tips for Doing Research

As a student or new researcher, you will probably encounter researchers who
think that their method of doing research is the only correct way of doing
research, but there are dozens of methods that have proven effective.
Familiarity with the literature is important since your research should
be original. The field of high breakdown (HB) robust statistics has perhaps
produced more literature in the past 40 years than any other field in statistics.
This text presents the author’s applied research in the fields of high break-
down robust statistics and regression graphics from 1990–2005, and a sum-
mary of the ideas that most influenced the development of this text follows.
Important contributions in the location model include detecting outliers with
dot plots and other graphs, the sample median and the sample median ab-
solute deviation. Stigler (1973a) and Tukey and McLaughlin (1963) (and
others) developed inference for the trimmed mean. Gnanadesikan and Ket-
tenring (1972) suggested an algorithm similar to concentration and suggested
that robust covariance estimators could be formed by estimating the elements
of the covariance matrix with robust scale estimators. Hampel (1975) intro-
duced the least median of squares estimator. The LTS and LTA estimators
were interesting extensions. Devlin, Gnanadesikan and Kettenring (1975,
1981) introduced the concentration technique. Siegel (1982) suggested using
elemental sets to find robust regression estimators. Rousseeuw (1984) pop-
ularized LMS and extended the LTS/MCD location estimator to the MCD
estimator of multivariate location and dispersion. Ruppert (1992) used con-

437
CHAPTER 14. STUFF FOR STUDENTS 438

centration for HB multiple linear regression. Cook and Nachtsheim (1994)

showed that robust Mahalanobis distances could be used to reduce the bias
of 1D regression estimators. Rousseeuw and Van Driessen (1999) introduced
the DD plot. Important references from the regression graphics literature in-
clude Stoker (1986), Li and Duan (1989), Cook (1998a), Cook and Ni (2005),
Cook and Weisberg (1999a), Li (2000) and Xia, Tong, Li, and Zhu (2002).
Much of the HB literature is not applied or consists of ad hoc methods.
In far too many papers, the estimator actually used is an ad hoc inconsistent
zero breakdown approximation of an estimator for which there is theory. The
MCD, LTS, LMS, LTA, depth and MVE estimators are impractical to com-
pute. The S estimators and projection estimators are currently impossible
to compute. Unless there is a computational breakthrough, these estimators
can rarely be used in practical problems. Similarly, two stage estimators need
a good initial HB estimator, but no good initial HB estimator was available
until Olive (2004a) and Olive and Hawkins (2006).
There are hundreds of papers on outlier detection. Most of these com-
pare their method with an existing method on outlier configurations where
their method does better. However, the new method rarely outperforms the
existing method (such as lmsreg or cov.mcd) if a broad class of outlier con-
figurations is examined. In such a paper, check whether the new estimator is
consistent and if the author has shown types of outlier configurations where
the method fails. Try to figure out how the method would perform
for the cases of one and two predictors.
Dozens of papers suggest that a classical method can be made robust by
replacing a classical estimator with a robust estimator. Again inconsistent
robust estimators are usually used. These methods can be very useful, but
rely on perfect classification of the data into outliers and clean cases. Check
whether these methods can find outliers that can not be found by the forward
response plot, EY plot, MBA DD plot and FMCD DD plot.
For example consider making a robust Hotelling’s t–test. If the paper uses
the FMCD cov.mcd algorithm, then the procedure is relying on the perfect
classification paradigm. On the other hand, Srivastava and Mudholkar (2001)
present an estimator that has large sample theory.
Beginners can have a hard time determining whether a robust algorithm
estimator is consistent or not. As a rule of thumb, assume that the ap-
proximations (including those for depth, LTA, LMS, LTS, MCD, MVE, S,
projection estimators and two stage estimators) are inconsistent unless the
authors show that they understand Hawkins and Olive (2002) and Olive and
CHAPTER 14. STUFF FOR STUDENTS 439

Hawkins (2006). In particular, the elemental or basic resampling algorithms,

concentration algorithms and algorithms based on random projections should
be considered inconsistent until you can prove otherwise.
After finding a research topic, paper trailing is an important technique
for finding related literature. To use this technique, find a paper on the topic,
go to the bibliography of the paper, find one or more related papers and
repeat. Often your university’s library will have useful internet resources for
finding literature. Usually a research university will subscribe to either The
Web of Knowledge with a link to ISI Web of Science or to the Current Index to
Statistics. Both of these resources allow you to search for literature by author,
eg Olive, or by topic, eg robust statistics. Both of these methods search for
recent papers. With Web of Knowledge, find an article with General Search,
click on the article and then click on the Find Related Articles icon to get a
list of related articles. For papers before 1997, use the free Current Index to
Statistics website (http://query.statindex.org/CIS/OldRecords/queryOld).
The search engines (www.google.com), (www.yahoo.com), (www.msn.com),
(www.ask.com), (www.info.com) and (www.scirus.com) are also useful. When
searching, enter a topic and the word robust or outliers. For example, enter
the keywords robust factor analysis or factor analysis and outliers. The key-
words sliced inverse regression, dimension reduction and single index models
are useful for finding regression graphics literature.
The STATLIB site (http://lib.stat.cmu.edu/) is useful for finding statis-
tics departments, data sets and software. Statistical journals often have
websites that make abstracts and preprints available. Three useful websites
are given below.

(http://www.stats.gla.ac.uk/cti/links_stats/journals.html)
(http://www.uni-koeln.de/themen/Statistik/journals.html)
(http://www.stat.uiuc.edu/~he/jourlist.html)

Websites for researchers or research groups can be very useful. Below are
websites for Dr. Rousseeuw’s group, Dr. He, Dr. Rocke, Dr. Croux, Dr.
Hubert’s group and for the University of Minnesota.

(http://www.agoras.ua.ac.be/)
(http://www.stat.uiuc.edu/~he/index.html)
(http://handel.cipic.ucdavis.edu/~dmrocke/preprints.html)
CHAPTER 14. STUFF FOR STUDENTS 440

(http://www.econ.kuleuven.ac.be/public/NDBAE06/)
(http://www.wis.kuleuven.ac.be/stat/robust.html)
(http://www.stat.umn.edu)

The latter website has useful links to software. Arc and R can be down-
loaded from these links. Familiarity with a high level programming
language such as FORTRAN or R/Splus is essential. A very useful R link
is (http://www.r-project.org/#doc).
Finally, a Ph.D. student needs an advisor or mentor and most researchers
will ﬁnd collaboration valuable. Attending conferences and making your
research available over the internet can lead to contacts.
Some references on research, including technical writing and presenta-
tions, include Becker and Keller-McNulty (1996), Ehrenberg (1982), Hamada
and Sitter (2004), Rubin (2004) and Smith (1997).

14.2 R/Splus and Arc

R is the free version of Splus. The website (http://www.stat.umn.edu) has
useful links for Arc which is the software developed by Cook and Weisberg
(1999a). The website (http://www.stat.umn.edu) also has a link to Cran
which gives R support. As of July 2005, the author’s personal computer has
Version 1.1.1 (August 15, 2000) of R, Splus–2000 (see Mathsoft 1999ab) and
Version 1.03 (August 2000) of Arc. Many of the text R/Splus functions and
figures were made in the middle 1990’s using Splus on a workstation.
Downloading the book’s R/Splus functions rpack.txt into R or
Splus:
Many of the homework problems use R/Splus functions contained in the
book’s website (http://www.math.siu.edu/olive/ol-bookp.htm) under the file
name rpack.txt. Suppose that you download rpack.txt onto a disk. Enter R
and wait for the curser to appear. Then go to the File menu and drag down
Source R Code. A window should appear. Navigate the Look in box until it
says 3 1/2 Floppy(A:). In the Files of type box choose All files(*.*) and then
select rpack.txt. The following line should appear in the main R window.

> source("A:/rpack.txt")
CHAPTER 14. STUFF FOR STUDENTS 441

Type ls(). About 70 R/Splus functions from rpack.txt should appear.

When you finish your R/Splus session, enter the command q(). A window
asking “Save workspace image?” will appear. Click on No if you do not want
to save the programs in R. (If you do want to save the programs then click
on Yes.)
If you use Splus, the command
> source("A:/rpack.txt")
will enter the functions into Splus. Creating a special workspace for the
functions may be useful.
This section gives tips on using R/Splus, but is no replacement for books
such as Becker, Chambers, and Wilks (1988), Chambers (1998), Fox (2002)
or Venables and Ripley (1999). Also see Mathsoft (1999ab) and use the
website (http://www.google.com) to search for useful websites. For example
enter the search words R documentation.
The command q() gets you out of R or Splus.
Least squares regression is done with the function lsfit.
The commands help(fn) and args(fn) give information about the function
fn, eg if fn = lsfit.
Type the following commands.
x <- matrix(rnorm(300),nrow=100,ncol=3)
y <- x%*%1:3 + rnorm(100)
out<- lsfit(x,y)
out$coef
ls.print(out)
The first line makes a 100 by 3 matrix x with N(0,1) entries. The second
line makes y[i] = 0+ 1∗ x[i, 1] + 2∗ x[i, 2] +3∗ x[i, 2] + e where e is N(0,1). The
term 1:3 creates the vector (1, 2, 3)T and the matrix multiplication operator is
% ∗ %. The function lsfit will automatically add the constant to the model.
Typing “out” will give you a lot of irrelevant information, but out$coef and
out$resid give the OLS coefficients and residuals respectively.
To make a residual plot, type the following commands.
fit <- y - out$resid
plot(fit,out$resid)
title("residual plot")
CHAPTER 14. STUFF FOR STUDENTS 442

The first term in the plot command is always the horizontal axis while the
second is on the vertical axis.
To put a graph in Word, hold down the Cntl and c buttons simultane-
ously. Then select “paste” from the Word Edit menu.
To enter data, open a data set in Notepad or Word. You need to know
the number of rows and the number of columns. Assume that each case is
entered in a row. For example, assuming that the file cyp.lsp has been saved
on your disk from the webpage for this book, open cyp.lsp in Word. It has
76 rows and 8 columns. In R or Splus, write the following command.
cyp <- matrix(scan(),nrow=76,ncol=8,byrow=T)
Then copy the data lines from Word and paste them in R/Splus. If a curser
does not appear, hit enter. The command dim(cyp) will show if you have
entered the data correctly.
Enter the following commands
cypy <- cyp[,2]
cypx<- cyp[,-c(1,2)]
lsfit(cypx,cypy)$coef
to produce the output below.
Intercept X1 X2 X3 X4
205.40825985 0.94653718 0.17514405 0.23415181 0.75927197
X5 X6
-0.05318671 -0.30944144
To check that the data is entered correctly, fit LS in Arc with the re-
sponse variable height and the predictors sternal height, finger to ground,
head length, nasal length, bigonal breadth, and cephalic index (entered in
that order). You should get the same coefficients given by R or Splus.
Making functions in R and Splus is easy.
For example, type the following commands.
mysquare <- function(x){
# this function squares x
r <- x^2
r }
The second line in the function shows how to put comments into functions.
CHAPTER 14. STUFF FOR STUDENTS 443

Modifying your function is easy.

Use the fix command.
fix(mysquare)
This will open an editor such as Notepad and allow you to make changes.
In Splus, the command Edit(mysquare) may also be used to modify the
function mysquare.
To save data or a function in R, when you exit, click on Yes when the
“Save worksheet image?” window appears. When you reenter R, type ls().
This will show you what is saved. You should rarely need to save anything
for the material in the first thirteen chapters of this book. In Splus, data
and functions are automatically saved. To remove unwanted items from the
worksheet, eg x, type rm(x),
pairs(x) makes a scatterplot matrix of the columns of x,
hist(y) makes a histogram of y,
boxplot(y) makes a boxplot of y,
stem(y) makes a stem and leaf plot of y,
scan(), source(), and sink() are useful on a Unix workstation.
To type a simple list, use y <− c(1,2,3.5).
The commands mean(y), median(y), var(y) are self explanatory.

The following commands are useful for a scatterplot created by the com-
mand plot(x,y).
lines(x,y), lines(lowess(x,y,f=.2))
identify(x,y)
abline(out$coef ), abline(0,1)
The usual arithmetic operators are 2 + 4, 3 − 7, 8 ∗ 4, 8/4, and

2^{10}.

The ith element of vector y is y[i] while the ij element of matrix x is

x[i, j]. The second row of x is x[2, ] while the 4th column of x is x[, 4]. The
transpose of x is t(x).
The command apply(x,1,fn) will compute the row means if fn = mean.
The command apply(x,2,fn) will compute the column variances if fn = var.
The commands cbind and rbind combine column vectors or row vectors with
an existing matrix or vector of the appropriate dimension.
CHAPTER 14. STUFF FOR STUDENTS 444

Downloading the book’s R/Splus data sets robdata.txt into R or

Splus is done in the same way for downloading rpack.txt. Use the command

> source("A:/robdata.txt")

For example the command

> lsfit(belx,bely)

will perform the least squares regression for the Belgian telephone data.
Transferring Data to and from Arc and R or Splus.
For example, suppose that the Belgium telephone data (Rousseeuw and Leroy
1987, p. 26) has the predictor year stored in x and the response number of
calls stored in y in R or Splus. Combine the data into a matrix z and then
use the write.table command to display the data set as shown below. The

sep=’ ’

separates the columns by two spaces.

> z <- cbind(x,y)

> write.table(data.frame(z),sep=’ ’)
row.names z.1 y
1 50 0.44
2 51 0.47
3 52 0.47
4 53 0.59
5 54 0.66
6 55 0.73
7 56 0.81
8 57 0.88
9 58 1.06
10 59 1.2
11 60 1.35
12 61 1.49
13 62 1.61
14 63 2.12
15 64 11.9
16 65 12.4
CHAPTER 14. STUFF FOR STUDENTS 445

17 66 14.2
18 67 15.9
19 68 18.2
20 69 21.2
21 70 4.3
22 71 2.4
23 72 2.7073
24 73 2.9

To enter a data set into Arc, use the following template new.lsp.

dataset=new
begin description
Artificial data.
Contributed by David Olive.
end description
begin variables
col 0 = x1
col 1 = x2
col 2 = x3
col 3 = y
end variables
begin data

Next open new.lsp in Notepad. (Or use the vi editor in Unix. Sophisti-
cated editors like Word will often work, but they sometimes add things like
page breaks that do not allow the statistics software to use the file.) Then
copy the data lines from R/Splus and paste them below new.lsp. Then mod-
ify the file new.lsp and save it on a disk as the file belg.lsp. (Or save it in
mdata where mdata is a data folder added within the Arc data folder.) The
header of the new file belg.lsp is shown below.

dataset=belgium
begin description
Belgium telephone data from
Rousseeuw and Leroy (1987, p. 26)
end description
begin variables
CHAPTER 14. STUFF FOR STUDENTS 446

col 0 = case
col 1 = x = year
col 2 = y = number of calls in tens of millions
end variables
begin data
1 50 0.44
. . .
. . .
. . .
24 73 2.9

The file above also shows the first and last lines of data. The header file
needs a data set name, description, variable list and a begin data command.
Often the description can be copied and pasted from source of the data, eg
from the STATLIB website. Note that the first variable starts with Col 0.
To transfer a data set from Arc to R or Splus, select the item
“Display data” from the dataset’s menu. Select the variables you want to
save, and then push the button for “Save in R/Splus format.” You will be
prompted give a file name. If you select bodfat, then two files bodfat.txt and
bodfat.Rd will be created. The file bodfat.txt can be read into either R or Splus
using the read.table command. The file bodfat.Rd saves the documentation
about the data set in a standard format for R.
As an example, the following command was used to enter the body fat
data into Splus. (The mdata folder does not come with Arc. The folder
needs to be created and filled with files from the book’s website. Then the
file bodfat.txt can be stored in the mdata folder.)

bodfat <- read.table("C:\\ARC\\DATA\\MDATA\\BODFAT.TXT",header=T)

bodfat[,16] <- bodfat[,16]+1

The last column of the body fat data consists of the case numbers which
start with 0 in Arc. The second line adds one to each case number.
As another example, use the menu commands
“File>Load>Data>Arcg>forbes.lsp” to activate the forbes data set. From
the Forbes menu, select Display Data. A window will appear. Double click
on Temp and Pressure. Click on Save Data in R/Splus Format and save as
forbes.txt in the folder mdata.
Enter Splus and type the following command.
CHAPTER 14. STUFF FOR STUDENTS 447

forbes<-read.table("C:\\ARC\\DATA\\ARCG\\FORBES.TXT",header=T)

The command forbes will display the data set.

Getting infromation about a library in R
In R, a library is an add–on package of R code. The command library()
lists all available libraries, and information about a specific library, such as
lqs for robust estimators like cov.mcd or ts for time series estimation, can
be found, eg, with the command library(help=lqs).
Downloading a library into R
Many researchers have contributed a library of R code that can be down-
loaded for use. To see what is available, go to the website
(http://cran.us.r-project.org/) and click on the Packages icon. Suppose you
are interested the Weisberg (2002) dimension reduction library dr. Scroll
down the screen and click on dr. Then click on the file corresponding to your
type of computer, eg dr 2.0.0.zip for Windows. My unzipped files are stored
in my directory

C:\unzipped.

The ﬁle

C:\unzipped\dr

contains a folder dr which is the R library. Cut this folder and paste it into
the R library folder. (On my computer, I store the folder rw1011 in the ﬁle

C:\Temp.

The folder

C:\Temp\rw1011\library

contains the library packages that came with R.) Open R and type the fol-
lowing command.
library(dr)
Next type help(dr) to make sure that the library is available for use.
CHAPTER 14. STUFF FOR STUDENTS 448

14.3 Projects
Straightforward Projects

• Compare the response transformation method illustrated in Example

1.5 with the method given in Section 5.1. Use simulations and real
data.

• Investigate the approximations for MED(Y ) and MAD(Y ) for Gamma

data. See Table 2.3.

• Application
2.2 suggests using Un = n − Ln where Ln = n/2 −
n/4 and

SE(MED(n)) = 0.5(Y(Un ) − Y(Ln +1) ).

√
Then use the tp approximation with p = Un − Ln − 1 ≈ n .
Run a simulation to compare a 95% CI with this interval and a 95%
CI that uses
SE(MED(n)) = 0.5(Y(Un ) − Y(Ln ) )
with z1−α/2 instead of tp,1−α/2.

• Find a useful technique in Chambers, Cleveland, Kleiner and Tukey

(1983), Cook (1998a) or Cook and Weisberg (1999a) that was not pre-
sented in this course. Analyze a real data set with the technique.

• Read Stigler (1977). This paper suggests a method for comparing new
estimators. Use this method with the two stage estimators TS,n and
TA,n described in Section 2.6.

• Read Anscombe (1961) and Anscombe and Tukey (1963). These papers
suggest graphical methods for checking multiple linear regression and
experimental design methods that were the “state of the art” at the
time. What graphical procedures did they use and what are the most
important procedures that were not suggested?

• Read Bentler and Yuan (1998) and Cattell (1966). These papers use
scree plots to determine how many eigenvalues of the covariance ma-
trix are nonzero. This topic is very important for dimension reduction
methods such as principal components.
CHAPTER 14. STUFF FOR STUDENTS 449

• The simulation study in Section 4.6 suggests that TS,n does not work
well on exponential data. Find a coarse grid so that TS,n works well
normal and exponential data. Illustrate with a simulation study.

• Examine via simulation how the graphical method for assessing variable
selection complements numerical methods. Find at least two data sets
where deleting one case changes the model selected by a numerical
variable selection method such as Cp .

• Are numerical diagnostics such as Cook’s distance needed? Examine

whether Cook’s distance can detect inﬂuential points that can not be
found using the OLS forward response plot. Are there data sets where
the forward response plot is more eﬀective?

• Are robust estimators needed for multiple linear regression? Examine

whether using the OLS forward response plot is as eﬀective as robust
methods for detecting outliers.

• Find some benchmark multiple linear regression outlier data sets. Fit
OLS, L1 and M-estimators from R/Splus. Are any of the M-estimators
as good as L1 ? (Note: l1fit is in Splus but not in R.)

• Compare lmsreg and the MBA estimator on real and simulated mul-
tiple linear regression data.

• Find some benchmark multiple linear regression outlier data sets. Fit
robust estimators such as ltsreg from R/Splus, but do not use lmsreg.
Are any of the robust estimators as good as the MBA estimator?

• Make a graphical version of the Durbin-Watson test for dependent er-

rors in multiple linear regression.

• There are several papers that give tests or diagnostics for linearity.
Find a data set from such a paper and find the fitted values from some
nonparametric method. Plot these fitted values versus the fitted values
from a multiple linear regression such as OLS. What should this plot
look like? How can the forward response plot and trimmed views be
used as a diagnostic for linearity? See Hawkins and Olive (2002, p.
158).
CHAPTER 14. STUFF FOR STUDENTS 450

• R/Splus provides several regression functions for examining data when

the multiple linear regression model is not appropriate such as projec-
tion pursuit regression and methods for time series. Examine the FY
plot of Section 6.4 for such methods. Generate outlier data and check
whether the outliers can be found with the FY plot. Run the rpack
function fysim and include the output and last plot in Word.
• Remark 10.3 estimates the percentage of outliers that the FMCD algo-
rithm can tolerate. At the end of Section 11.2, data is generated such
that the FMCD estimator works well for p = 4 but fails for p = 8. Gen-
erate similar data sets for p = 8, 9, 10, 12, 15, 20, 25, 30, 35, 40, 45, and
50. For each value of p ﬁnd the smallest integer valued percentage of
outliers needed to cause the FMCD and MBA estimators to fail. Use
the rpack function concsim. If concsim is too slow for large p, use
covsim2 which will only give counts for the fast MBA estimator. As a
criterion, a count ≥ 16 is good. Compare these observed FMCD per-
centages with Remark 10.3 (use the gamper2 function). Do not forget
the library(lqs) command if you use R.
• DD plots: compare classical–MBA vs classical–cov.mcd DD plots on
real and simulated data. Do problems 10.14, 11.2 and 11.3 but with a
wider variety of data sets, n, p and gamma.
• Many papers substitute the latest MCD (or LTS) algorithm for the
classical estimator and have titles like “Fast and Robust Factor Anal-
ysis.” Find such a paper (see Section 11.4) that analyzes a data set
on
i) factor analysis,
ii) discriminant analysis,
iii) principal components,
iv) canonical correlation analysis,
v) Hotelling’s t test, or
vi) principal component regression.
For the data, make a scatterplot matrix of the classical, MBA and
FMCD Mahalanobis distances. Delete any outliers and run the classical
procedure on the undeleted data. Did the paper’s procedure perform
as well as this procedure?
CHAPTER 14. STUFF FOR STUDENTS 451

• Examine the DD plot as a diagnostic for multivariate normality and

elliptically contoured distributions. Use real and simulated data.

• Resistant regression: modify tvreg by using OLS–covmba instead of

OLS–cov.mcd. (L1–cov.mcd and L1–covmba are also interesting.) Com-
pare your function with tvreg. The tvreg and covmba functions are
in rpack.txt.

• Using ESP to Search for the Missing Link: Compare trimmed views
which uses OLS and cov.mcd with another regression–MLD combo.
There are 8 possible projects: i) OLS–MBA, ii) OLS–Classical (use
ctrviews), iii) SIR–cov.mcd (sirviews), iv) SIR–MBA, v) SIR–class-
ical, vi) lmsreg–cov.mcd (lmsviews), vii) lmsreg–MBA, and viii) lmsreg
–classical. Do Problem 12.7ac (but just copy and paste the best view
instead of using the essp(nx,ncuby,M=40) command) with both your
estimator and trimmed views. Try to see what types of functions
work for both estimators, when trimmed views is better and when the
procedure i)–viii) in better. If you can invent interesting 1D functions,
do so.

• Many 1D regression models where yi is independent of xi given the

suﬃcient predictor xTi β can be made resistant by making EY plots of
the estimated suﬃcient predictor xTi β̂ versus yi for the 10 trimming
proportions. Since 1D regression is the study of the conditional distri-
bution of yi given xTi β, the EY plot is used to visualize this distribution
and needs to be made anyway. See how well trimmed views work when
outliers are present.

• Investigate using trimmed views to make various procedures such as

sliced inverse regression resistant against the presence of nonlinearities.
The function sirviews in rpack.txt may be useful.

• Examine the method of variable selection for 1D regression models

suggested in Section 12.4.

• The DGK estimator with 66% coverage should be able to tolerate a

cluster of about 30% extremely distant outliers. Compare the DGK es-
timators with 50% and 66% coverage for various outlier conﬁgurations.
CHAPTER 14. STUFF FOR STUDENTS 452

Harder Projects

• The Super Duper Outlier Scooper for MLR: Write R/Splus functions
to compute the two estimators given by Theorem 8.7. Compare these
estimators with lmsreg and ltsreg on real and simulated data.
• The Super Duper Outlier Scooper for Multivariate Location and Disper-
sion: Consider the modiﬁed MBA estimator for multivariate location
and dispersion given in Problem 10.17. This MBA estimator uses 8
starts using 0%, 50%, 60%, 70%, 80%, 90%, 95% and 98% trimming of
the cases closest to √ the coordinatewise median in Euclidean distance.
The estimator is n consistent on elliptically contoured distributions
with 2nd moments. For small data sets the cmba2 function can fail
because the covariance estimator applied to the closest 2% cases to
the coordinatewise median is singular. Modify the function so that it
works well on small data sets. Then consider the following proposal
that may make the estimator asymptotically equivalent to the classi-
cal estimator when the data are from a multivariate normal (MVN)
distribution. The attractor corresponding to 0% trimming is the DGK
estimator (µ̂0, Σ̂0). Let (µ̂T , Σ̂T ) = (µ̂0 , Σ̂0 ) if det(Σ̂0 ) ≤ det(Σ̂M ) and
(µ̂T , Σ̂T ) = (µ̂M , Σ̂M ) otherwise where (µ̂M , Σ̂M ) is the attractor cor-
responding to M% trimming. Then make the DD plot of the classical
Mahalanobis distances versus the distances corresponding to (µ̂T , Σ̂T )
for M = 50, 60, 70, 80, 90, 95 and 98. If all seven DD plots “look good”
then use the classical estimator. The resulting estimator will be asymp-
totically equivalent to the classical estimator if P(all seven DD plots
“look good”) goes to one as n → ∞. We conjecture that all seven plots
will look good because if n is large and the trimmed attractor “beats”
the DGK estimator, then the plot will look good. Also if the data is
MVN but not spherical, then the DGK estimator will almost always
“beat” the trimmed estimator, so all 7 plots will be identical.
• The TV estimator for MLR has a good combination of resistance and
theory. Consider the following modiﬁcation to make the method asymp-
totically equivalent to OLS when the Gaussian model holds: if each
trimmed view “looks good,” use OLS. The method is asymptotically
equivalent to OLS if the probability P(all 10 trimmed views look good)
goes to one as n → ∞. Rousseeuw and Leroy (1987, p. 128) shows
that if the predictors are bounded, then the ith residual ri converges
CHAPTER 14. STUFF FOR STUDENTS 453

in probability to the ith error ei for i = 1, ..., n. Hence all 10 trimmed

views will look like the OLS view with high probability if n is large.

• Modify the trimmed views estimator for resistant logistic regression.

Make an ESS plot for each of the trimming proportions with the logistic
curve and step function of observed proportions added to the plot. The
rpack function lressp may be useful.

• Modify the trimmed views estimator for resistant Poisson regression.

Make an EY plot for each of the trimming proportions with the expo-
nential curve and lowess curve added to the plot. The rpack function
llressp may be useful.

• Modify the minimum chi–square estimator to make a resistant Poisson

regression estimator by replacing OLS by a resistant regression estima-
tor such as tvreg, mbareg or lmsreg. The rpack function llrwtfrp
may be useful.

• For nonlinear regression models of the form yi = m(xi , β)+ei , the ﬁtted
values are ŷi = m(xi , β̂) and the residuals are ri = yi − ŷi . The points
in the FY plot of the ﬁtted values versus the response should follow
the identity line. The TV estimator would make FY and residual plots
for each of the trimming proportions. The MBA estimator with the
median squared residual criterion can also be used for many of these
models.

• A useful plot for 1D binary regression is the binary response plot of

the first SIR direction versus the second SIR direction. Cases with
y = 0 are plotted with an open circle while cases with y = 1 are
plotted with a cross. If the 1D binary regression model holds and if the
first SIR direction is a useful estimated sufficient predictor, then the
symbol density in any narrow vertical strip is approximately constant.
See Cook (1998a, ch. 5), Cook and Lee (1999) and Cook and Weisberg
(1999a, section 22.2). In analogy with trimmed views, use trimming to
make ten binary response plots.

• Try robustify the discriminant function estimators for binary regression

given in Deﬁnition 13.4 by replacing the classical estimator of multi-
variate location and dispersion by the MBA or FMCD estimator.
CHAPTER 14. STUFF FOR STUDENTS 454

• Econometrics project: Suppose that the MLR model holds but Var(e) =
σ 2Σ and Σ = U U where U is known and nonsingular. Show that
U −1 Y = U −1 Xβ + U −1 e, and the TV and MBA estimators can be
applied to Ỹ = U −1 Y and X̃ = U −1 X provided that OLS is ﬁt
without an intercept.

• Econometrics project: Modify the MBA estimator for time series by

choosing cases that are close together in time. For example, if the time
series is y1 , y2, ..., y1000 and if y100 is a center, then the 10% of cases
closest to y100 in time are (roughly) y50, ..., y150.

• Agresti (2002, p. 109) states that a conﬁdence interval for µ1 − µ2

based on single sample estimates µ̂i and conﬁdence intervals (Li , Ui )
for i = 1, 2 is
1 2
ˆ 2 2 ˆ
d − (µ̂1 − L1 ) + (U2 − µ̂2 ) , d + (U1 − µ̂1 ) + (µ̂2 − L2)
2 2

where dˆ = µ̂1 − µ̂2. This method is used when µi is a proportion or odds

ratio. Try the method when µi is a mean and compare this method to
Welch intervals given by Remark 2.2.

• Compare outliers and missing values, especially missing and outlying

at random. See Little and Rubin (2002).

• Suppose that the data set contains missing values. Code the missing
value as ±99999+ rnorm(1). Run a robust procedure on the data. The
idea is that the case with the missing value will be given weight zero if
the variable is important, and the variable will be given weight zero if
the case is important. See Hawkins and Olive (1999b).

• Econometrics project: Let wi = xTi β be the ﬁtted values for the L1

estimator. Apply regression quantiles (see Koenker, 2005) to the re-
sponse and wi and plot the result. When is this technique competitive
with the usual regression quantiles method?

• Read Stefanski and Boos (2002). One of the most promising uses of
M-estimators is as generalized estimating equations.

• Download the dr function for R, (contributed by Sanford Weisberg),

and make PHD and SAVE trimmed views.
CHAPTER 14. STUFF FOR STUDENTS 455

• Example 1.4 illustrates a robust prediction interval for multiple linear

regression. Run a simulation study to compare the simulated coverage
proportion with the nominal coverage.

• Robust sequential procedures do not seem to work very well. Try using
analogs of the two stage trimmed means. An ad hoc procedure that
has worked very well is to clean the data using the median and mad
at each sample size. Then apply the classical sequential method and
stopping rule to the cleaned data. This procedure is rather expensive
since the median and mad need to be recomputed with each additional
observation until the stopping rule ends data collection. Another idea
is to examine similar methods in the quality control literature.

• Try to make nonparametric prediction intervals for multiple linear re-

gression by ﬁnding ordering the residuals and taking the “shortest in-
terval” containing 90% of the residuals where shortest is in the sense of
LMS, LTS or LTA. See Bucchianico, Einmahl and Mushkudiani (2001).

• See if swapping with elemental sets is a good technique.

• Apply the Cook and Olive (2001) graphical procedure for response
transformations described in Section 5.1 with the power family replaced
by the Yeo and Johnson (2000) family of transformations.

Research Ideas that have Confounded the Author

• Do elemental set and concentration algorithms for MLR give consistent

estimators if the number of starts increases to ∞ with the sample size
n? For example, prove or disprove Conjecture 8.1. (Algorithms that
use a ﬁxed number of elemental sets along with the classical estimator
and a biased but easily computed high breakdown estimator will be
easier to compute and have better statistical properties. See Theorem
8.7 and Olive and Hawkins, 2006.)

• Prove or disprove Conjectures 11.1. Do elemental set and concentra-

tion algorithms for multivariate location and dispersion (MLD) give
consistent estimators if the number of starts increases to ∞ with the
sample size n? (Algorithms that use a ﬁxed number of elemental sets
along with the classical estimator and a biased but easily computed
CHAPTER 14. STUFF FOR STUDENTS 456

high breakdown estimator will be easier to compute and have bet-

ter statistical properties. See Theorem 10.15 and Olive and Hawkins,
2006.)
It is easy to create consistent algorithm estimators that use O(n) ran-
domly chosen elemental sets. He and Wang (1997) show that the all
elemental subset approximation to S estimators for MLD is consistent
for (µ, cΣ). Hence an algorithm that randomly draws g(n) elemental
sets and searches all C(g(n), p + 1) elemental sets is also consistent if
g(n) → ∞ as n → ∞. For example, O(n) elemental sets are used if
g(n) ∝ n1/(p+1).
When a ﬁxed number of K elemental starts are used, the best attractor
is inconsistent but gets close to (µ, cM CD Σ) if the data distribution is
EC. (The estimator may be unbiased but the variability of the com-
ponent estimators does not go to 0 as n → ∞.) If K → ∞, then the
best attractor should approximate the highest density region arbitrar-
ily closely and the algorithm should be consistent. However, the time
for the algorithm greatly increases, the convergence rate is very poor
(possibly between K 1/2p and K 1/p ), and the elemental concentration
algorithm can not guarantee that the determinant is bounded when
outliers are present.

• A promising two stage estimator is the “cross checking estimator” that

uses an standard consistent estimator and an alternative consistent es-
timator with desirable properties such as a high breakdown value. The
final estimator uses the standard estimator if it is “close” to the alterna-
tive estimator, and hence is asymptotically equivalent to the standard
estimator for clean data. One of the most important areas of research
for robust statistics is finding good computable consistent robust esti-
mators to be used in plots and in the cross checking algorithm. The
estimators given in Theorems 8.7, 10.14 and 10.15 (see Olive 2004a
and Olive and Hawkins 2006) finally make the cross checking estimator
practical, but better estimators are surely possible. A problem with
the MLR cross checking estimator is that the resulting standard errors
have little outlier resistance. He (1991) and Davies (1993) suggested the
cross checking idea for multiple linear regression, He and Wang (1996)
for multivariate location and dispersion, and additional applications
are given in He and Fung (1999).
CHAPTER 14. STUFF FOR STUDENTS 457

• Prove that the LTA estimator is are consistent, and prove that the LTA
and LTS estimators are Op (n−1/2 ). Prove Conjecture 7.1.
These results are in the folklore but have not been shown outside of
the location model. Mašı̈ček (2004) proved that LTS is consistent.

14.4 Hints for Selected Problems

Chapter 1
1.1 ri,1 − ri,2 = Yi − xTi β̂ 1 − (Yi − xTi β̂2 ) = xTi β̂ 2 − xTi β̂ 1 =
Ŷ2,i − Ŷ1,i = Ŷ1,i − Ŷ2,i .
1.2 The plot should be similar to Figure 1.6, but since the data is simu-
lated, may not be as smooth.
1.3 c) The histograms should become more like a normal distribution as
n increases from 1 to 200. In particular, when n = 1 the histogram should be
right skewed while for n = 200 the histogram should be nearly symmetric.
Also the scale on the horizontal axis should decrease as n increases.
d) Now Y ∼ N(0, 1/n). Hence the histograms should all be roughly √
symmetric, but the scale on the horizontal axis should be from about −3/ n
√
to 3/ n.
1.4 e) The plot should be strongly nonlinear, having “V” shape.
1.5 You could save the data set from the text’s website on a disk, and
then open the data in Arc from the disk.
c) Most students should delete cases 5, 47, 75, 95, 168, 181, and 199.
f) The forward response plot looks like a line while the residual plot looks
like a curve. A residual plot emphasizes lack of ﬁt while the forward response
plot emphasizes goodness of ﬁt.
h) The quadratic model looks good.
Chapter 2
2
2.5 N(0, σM )
2.9 a) 8.25 ± 0.7007 = (6.020, 10.480)
b) 8.75 ± 1.1645 = (7.586, 9.914).
CHAPTER 14. STUFF FOR STUDENTS 458

2.10 a) Y = 24/5 = 4.8.

b)
138 − 5(4.8)2
S2 = = 5.7
4
√
so S = 5.7 = 2.3875.
c) The ordered data are 2,3,5,6,8 and MED(n) = 5.
d) The ordered |Yi − MED(n)| are 0,1,2,2,3 and MAD(n) = 2.
2.11 a) Y = 15.8/10 = 1.58.
b)
38.58 − 10(1.58)2
S2 = = 1.5129
9
√
so S = 1.5129 = 1.230.
c) The ordered data set is 0.0,0.8,1.0,1.2,1.3,1.3,1.4,1.8,2.4,4.6 and
MED(n) = 1.3.
d) The ordered |Yi − MED(n)| are 0,0,0.1,0.1,0.3,0.5,0.5,1.1,1.3,3.3 and
MAD(n) = 0.4.
e) 4.6 is unusually large.
√
2.12 a) S/ n = 3.2150.
b) n − 1 = 9.
c) 94.0

d) Ln = n/2 − n/4 = [10/2] − 10/4 = 5 − 2 = 3.
e) Un = n − Ln = 10 − 3 = 7.
f) p = Un − Ln − 1 = 7 − 3 − 1 = 3.
g) SE(MED(n)) = (Y(Un ) − Y(Ln +1) )/2 = (95 − 90.0)/2 = 2.5.
12.13 a) Ln = n/4 = [2.5] = 2.
b) Un = n − Ln = 10 − 2 = 8.
c) p = Un − Ln − 1 = 8 − 2 − 1 = 5.
d) (89.7 + 90.0 + · · · + 95.3)/6 = 558/6 = 93.0.
e) 89.7 89.7 89.7 90.0 94.0 94.0 95.0 95.3 95.3 95.3

f) ( di )/n = 928/10 = 92.8.

g) ( d2i − n(d)2 )/(n− 1) = (86181.54− 10(92.8)2 )/9 = 63.14/9 = 7.0156.
CHAPTER 14. STUFF FOR STUDENTS 459

e)
Sn2 (d1 , ..., dn) 7.0156
VSW = = = 19.4877,
([Un − Ln ]/n)2 ( 8−2
10
)2
so
SE(Tn ) = VSW /n = 19.4877/10 = 1.3960.

2.14 a) Ln = n/2 − n/4 = [5/2] − 5/4 = 2 − 2 = 0.
Un = n − Ln = 5 − 0 = 5.
p = Un − Ln − 1 = 5 − 0 − 1 = 4.
SE(MED(n)) = (Y(Un ) − Y(Ln +1) )/2 = (8 − 2)/2 = 3.
b) Ln = n/4 = 1 = 1.
Un = n − Ln = 5 − 1 = 4.
p = Un − Ln − 1 = 4 − 1 − 1 = 2.
Tn = (3 + 5 + 6)/3 = 4.6667.
The d s are 3 3 5 6 6.

( di )/n = 4.6

( d2i − n(d)2 )/(n − 1) = (115 − 5(4.6)2 )/4 = 9.2/4 = 2.3.

Sn2 (d1 , ..., dn) 2.3

VSW = = 4−1 2 = 6.3889,
([Un − Ln ]/n) 2 ( 5 )
so
SE(Tn ) = VSW /n = 6.3889/5 = 1.1304.
The R/Splus functions for Problems 2.15–2.29 are available from the
text’s website ﬁle rpack.txt and should have been entered into the computer
using the source(“A:/rpack.txt”) as described on p. 438.
2.16 Simulated data: a) about 0.669 b) about 0.486.
2.17 Simulated data: a) about 0.0 b) Y ≈ 1.00 and Tn ≈ 0.74.
2.21 Simulated data gives about (1514,1684).
2.22 Simulated data gives about (1676,1715).
2.23 Simulated data gives about (1679,1712).
Chapter 3
3.2 a) F (y) = 1 − exp(−y/λ) for y ≥ 0. Let M = MED(Y ) = log(2)λ.
Then F (M) = 1−exp(− log(2)λ/λ) = 1−exp(− log(2)) = 1−exp(log(1/2)) =
1 − 1/2 = 1/2.
CHAPTER 14. STUFF FOR STUDENTS 460

b) F (y) = Φ([log(y) − µ]/σ) for y > 0. Let M = MED(Y ) = exp(µ).

Then F (M) = Φ([log(exp(µ)) − µ]/σ) = Φ(0) = 1/2.
3.3 a) M = µ by symmetry. Since F (U) = 3/4 and F (y) = 1/2 +
(1/π)arctan([y − µ]/σ), want arctan([U − µ]/σ) = π/4 or (U − µ)/σ = 1.
Hence U = µ + σ and MAD(Y ) = D = U − M = µ + σ − µ = σ.
b) M = θ by symmetry. Since F (U) = 3/4 and F (y) = 1 − 0.5 exp(−[y −
θ]/λ) for y ≥ 0, want 0.5 exp(−[U − θ]/λ) = 0.25 or exp(−[U − θ]/λ) = 1/2.
So −(U − θ)/λ = log(1/2) or U = θ − λ log(1/2) = θ − λ(− log(2)) =
θ + λ log(2). Hence MAD(Y ) = D = U − M = U − θ = λ log(2).

3.9 a) MED(W ) = λ log(2).
3.10 a) MED(W ) = θ − σ log(log(2)).
b) MAD(W ) ≈ 0.767049σ.
c) Let Wi = log(Xi ) for i = 1, ..., n. Then
σ̂ = MAD(W1, ..., Wn)/0.767049 and θ̂ = MED(W1 , ..., Wn) − σ̂ log(log(2)).
So take φ̂ = 1/σ̂ and λ̂ = exp(θ̂/σ̂).
3.11 a) MED(Y ) = µ.
b) MAD(Y ) = 1.17741σ.
3.12 a) MED(Y ) = µ + σ.
b) MAD(Y ) = 0.73205σ.
3.13 Let µ̂ = MED(W1 , ..., Wn) and σ̂ = MAD(W1, ..., Wn).
3.14 µ + log(3)σ
3.15 b) τ̂ = log(3)/MAD(W1 , ..., Wn) and φ̂ = 1/MED(Y1 , ..., Yn).
3.16 MED(Y ) ≈ (p − 2/3)/p ≈ 1 if p is large.
Chapter 4
4.1 a) 200
b) 0.9(10) + 0.1(200) = 29
4.2 a) 400(1) = 400
b) 0.9(10) + 0.1(400) = 49
CHAPTER 14. STUFF FOR STUDENTS 461

The R/Splus functions for Problems 4.10–4.14 are available from the
text’s website file rpack.txt and should have been entered into the computer
using the source(“A:/rpack.txt”) as described on p. 438.
4.13b i) Coverages should be near 0.95. The lengths should be about 4.3
for n = 10, 4.0 for n = 50 and 3.96 for n = 100.
ii) Coverage should be near 0.78 for n = 10 and 0 for n = 50, 100. The
lengths should be about 187 for n = 10, 173 for n = 50 and 171 for n = 100.
(It can be shown that the expected length for large n is 169.786.)
Chapter 5
5.1 a) 7+ βXi
b) b = (Yi − 7)Xi / Xi2

c) The second derivative = 2 Xi2 > 0.
5.4 Fo = 0.904, p–value > 0.1, fail to reject Ho, so the reduced model is
good
5.5 a) 25.970
b) Fo = 0.600, p–value > 0.5, fail to reject Ho, so the reduced model is
good
2
5.6
2 a) b 3 = X3i (Yi − 10 − 2X2i )/ X3i . The second partial derivative
= X3i > 0.
5.9 a) (1.229, 3.345)
b) (1.0825, 3.4919)
5.11 c) Fo = 265.96, pvalue = 0.0, reject Ho, there is a MLR relationship
between the response variable height and the predictors sternal height and
finger to ground.
5.13 No, the relationship should be linear.
5.14 No, since 0 is in the CI. X could be a very useful predictor for Y ,
eg if Y = X 2 .
5.16 The model uses constant, finger to ground and sternal height. (You
can tell what the variable are by looking at which variables are deleted.)
5.17 Use L3. L1 and L2 have more predictors and higher Cp than L3
CHAPTER 14. STUFF FOR STUDENTS 462

while L4 does not satisfy the Cp ≤ 2k screen.

5.18 Use L3. L1 has too many predictors. L2 has almost the same
summary statistics as L3 but has one more predictor while L4 does not
satisfy the Cp ≤ 2k screen.
5.19 Use a constant, A, B and C since this is the only model that satisfies
the Cp ≤ 2k screen.
b) Use the model with a constant and B since it has the smallest Cp and
the smallest k such that the Cp ≤ 2k screen is satisfied.
5.20. d) The plot should have log(Y ) on the horizontal axis.
e) Since randomly generated data is used, answers vary slightly, but
) ≈ 4 + X1 + X2 + X3 .
log(Y
5.21 a) The plot looks roughly like the SW corner of a square.
b) No, the plot is nonlinear.
c) Want to spread small values of y, so make λ smaller. Hence use y (0) =
log(y).
5.22 d) The first assumption to check would be the constant variance
assumption.
5.23 Several of the marginal relationships are nonlinear, including E(M|H).

5.24 This problem has the student reproduce Example 5.1. Hence log(Y )
is the appropriate response transformation.
5.25 Plots b), c) and e) suggest that log(ht) is needed while plots d), f)
and g) suggest that log(ht) is not needed. Plots c) and d) show that the
residuals from both models are quite small compared to the ﬁtted values.
Plot d) suggests that the two models produce approximately the same ﬁtted
values. Hence if the goal is prediction, the expensive log(ht) measurement
does not seem to be needed.
5.26 h) The submodel is ok, but the forward response and residual plots
found in f) for the submodel do not look as good as those for the full model
found in d). Since the submodel residuals do not look good, more terms are
probably needed in the model.
CHAPTER 14. STUFF FOR STUDENTS 463

5.29 b) Forward selection gives constant, (size)1/3, age, sex, breadth and
cause.
c) Backward elimination gives constant, age, cause, cephalic, headht,
length and sex.
d) Forward selection is better because it has fewer terms and a smaller
Cp .
e) The variables are highly correlated. Hence backward elimination quickly
eliminates the single best predictor (size)1/3 and can not get a good model
that only has a few terms.
f) Although the model in c) could be used, a better model uses constant,
age, sex and (size)1/3.
j) The FF and RR plots are good and so are the forward response and
residual plots if you ignore the good leverage points corresponding to the 5
babies.
Chapter 6
6.1 b) Masking since 3 outliers are good cases with respect to Cook’s
distances.
c) and d) usually the MBA residuals will be large in magnitude, but for
some students MBA, ALMS and ALTS will be highly correlated.
6.4. a) The AR(2) model has the highest correlation with the response
and is the simplest model. The top row of the scatterplot matrix gives the
FY plots for the 5 different estimators.
b) The AR(11) and AR(12) fits are highly correlated as are the SE-
TAR(2,7,2) and SETAR(2,5,2) fits.
6.6 The response Y with a constant and X3 , X7 , X13 and X14 as predictors
is a good submodel. (A competitor would delete X13 but then the residual
plot is not as good.)
6.8 The response Y with a constant, X2 and X5 as predictors is a good
submodel. One outlier is visible in the residual plot. (A competitor would
also use X3 .)
6.9 The submodel using a constant and X1 is ok although the residual
plot does not look very good.
6.13 The model using log(X3 ), log(X4 ), log(X6 ), log(X11 ), log(X13 ) and
CHAPTER 14. STUFF FOR STUDENTS 464

log(X14 ) plus a constant has a good FF plot but more variables may be
needed to get a good RR plot.
6.14 There are many good models including the submodel that uses
Y = log(BigMac) and a constant, log(BusFare) log(EngSal), log(Service),
log(TeachSal) and log(TeachTax) as predictors.
6.16 e) R2 went from 0.978 with outliers to R2 = 0.165 without the
outliers. (The low value of R2 suggests that the MLR relationship is weak,
not that the MLR model is bad.)
Chapter 7
7.4 b) The line should go through the left and right cluster but not
through the middle cluster of outliers.
c) The identity line should NOT PASS through the cluster of outliers
with Y near 0 and the residuals corresponding to these outliers should be
large in magnitude.
7.5 e) Usually the MBA esitmator based on the median squared residual
will pass through the outliers with the MBA LATA estimator gives zero
weight to the outliers (so that the outliers are large in magnitude).
Chapter 8
8.1. Approximately 2 nδ f(0) cases have small errors.
Chapter 10
10.1 a) X2 ∼ N(100, 6).
b)
X1 49 3 −1
∼ N2 , .
X3 17 −1 4
c) X1 X4 and X3 X4 .
d)
Cov(X1, X3 ) −1
ρ(X1 , X2 ) = = √ √ = −0.2887.
VAR(X1 )VAR(X3 ) 3 4
10.2 a) Y |X ∼ N(49, 16) since Y X. (Or use E(Y |X) = µY +
−1
Σ12Σ22 (X − µx ) = 49 + 0(1/25)(X − 100) = 49 and VAR(Y |X) = Σ11 −
Σ12Σ−1
22 Σ21 = 16 − 0(1/25)0 = 16.)
CHAPTER 14. STUFF FOR STUDENTS 465

b) E(Y |X) = µY + Σ12Σ−1

22 (X − µx ) = 49+ 10(1/25)(X − 100) = 9+ 0.4X.

c) VAR(Y |X) = Σ11 − Σ12 Σ−1

22 Σ21 = 16 − 10(1/25)10 = 16 − 4 = 12.

10.4 The proof is identical to that given in Example 10.2. (In addition,
it is fairly simple to show that M1 = M2 ≡ M. That is, M depends on Σ
but not on c or g.)
10.6 a) Sort each column, then find the median of each column. Then
MED(W ) = (1430, 180, 120)T .
b) The sample mean of (X1 , X2 , X3 )T is found by finding the sample mean
of each column. Hence x = (1232.8571, 168.00, 112.00)T .
10.11 ΣB = E[E(X|B T X)X T B)] = E(M B B T XX T B) = M B B T ΣB.
Hence M B = ΣB(B T ΣB)−1 .
10.15 The 4 plots should look nearly identical with the five cases 61–65
appearing as outliers.
10.16 Not only should none of the outliers be highlighted, but the high-
lighted cases should be ellipsoidal.
10.17 Answers will vary since this is simulated data, but should get gam
near 0.4, 0.3, 0.2 and 0.1 as p increases from 2 to 20.
Chapter 11
11.2 b Ideally the answer to this problem and Problem 11.3b would be
nearly the same, but students seem to want correlations to be very high and
use n to high. Values of n around 50, 60 and 80 for p = 2, 3 and 4 should be
enough.
11.3 b Values of n should be near 50, 60 and 80 for p = 2, 3 and 4.
11.4 This is simulated data, but for most plots the slope is near 2.
11.8 The identity line should NOT PASS through the cluster of out-
liers with Y near 0. The amount of trimming seems to vary some with the
computer (which should not happen unless there is a bug in the tvreg2 func-
tion or if the computers are using different versions of cov.mcd), but most
students liked 70% or 80% trimming.
CHAPTER 14. STUFF FOR STUDENTS 466

Chapter 12
12.1.
a) êi = yi − T (Y ).
b) êi = yi − xTi β̂.
c)
yi
êi = .
β̂1 exp[β̂2(xi − x̄)]
√
d) êi = wi (yi − xTi β̂).
12.2.
a) Since y is a (random) scalar and E(w) = 0, Σx,y = E[(x − E(x))(y −
E(y))T ] = E[w(y − E(y))] = E(wy) − E(w)E(y) = E(wy).
b) Using the deﬁnition of z and r, note that y = m(z) + e and
w = r + (Σx β)βT w. Hence E(wy) = E[(r + (Σx β)βT w)(m(z) + e)] =
E[(r + (Σx β)βT w)m(z)] + E[r + (Σx β)β T w]E(e) since e is independent of
x. Since E(e) = 0, the latter term drops out. Since m(z) and β T wm(z) are
(random) scalars, E(wy) = E[m(z)r] + E[β T w m(z)]Σxβ.
c) Using result b), Σ−1 −1 −1
x Σx,y = Σx E[m(z)r] + Σx E[β w m(z)]Σxβ =
T

E[βT w m(z)]Σ−1 −1 −1
x Σx β + Σx E[m(z)r] = E[β w m(z)]β + Σx E[m(z)r] and
T

the result follows.

d) E(wz) = E[(x − E(x))xT β] = E[(x − E(x))(xT − E(xT ) + E(xT ))β]
= E[(x − E(x))(xT − E(xT ))]β + E[x − E(x)]E(xT )β = Σx β.
e) If m(z) = z, then c(x) = E(βT wz) = βT E(wz) = β T Σx β = 1 by
result d).
f) Since z is a (random) scalar, E(zr) = E(rz) = E[(w − (Σx β)βT w)z]
= E(wz)−(Σx β)βT E(wz). Using result d), E(rz) = Σx β−Σx ββ T Σx β =
Σx β − Σx β = 0.
g) Since z and r are linear combinations of x, the joint distribution of
z and r is multivariate normal. Since E(r) = 0, z and r are uncorrelated
and thus independent. Hence m(z) and r are independent and u(x) =
Σ−1 −1
x E[m(z)r] = Σx E[m(z)]E(r) = 0.
12.4 The submodel I that uses a constant and A, C, E, F, H looks best
since it is the minimum Cp(I) model and I has the smallest value of k such
CHAPTER 14. STUFF FOR STUDENTS 467

that Cp(I) ≤ 2k.

12.6 a) No strong nonlinearities for MVN data but there should be some
nonlinearities present for the non–EC data.
b) The plot should look like a cubic function.
c) The plot should use 0% trimming and resemble the plot in b), but may
not be as smooth.
d) The plot should be linear and for many students some of the trimmed
views should be better than the OLS view.
e) The EY plot should look like a cubic with trimming greater than 0%.
f) The plot should be linear.
12.7 b) and c) It is possible that none of the trimmed views look much
like the sinc(ESP) = sin(ESP)/ESP function.
d) Now at least one of the trimmed views should be good.
e) More lms trimmed views should be good than the views from the other
2 methods, but since simulated data is used, one of the plots from b) or c)
could be as good or even better than the plot in d).
Chapter 13
13.2 a) ESP = 1.11108, exp(ESP ) = 3.0376 and ρ̂ = exp(ESP )/(1 +
exp(ESP )) = 3.0376/(1 + 3.0376) = 0.7523.
13.3 G2 (O|F ) = 62.7188 − 13.5325 = 49.1863, df = 3, p–value = 0.00,
reject Ho, there is a LR relationship between ape and the predictors lower
jaw, upper jaw and face length.
13.4 G2 (R|F ) = 17.1855−13.5325 = 3.653, df = 1, 0.05 < p–value < 0.1,
fail to reject Ho, the reduced model is good.
13.5a ESP = 0.2812465 and µ̂ = exp(ESP ) = 1.3248.
13.6 G2 (O|F ) = 187.490 − 138.685 = 48.805, df = 2, p–value = 0.00,
reject Ho, there is a LLR relationship between possums and the predictors
habitat and stags.
13.8 a) B4
b) EE plot
CHAPTER 14. STUFF FOR STUDENTS 468

c) B3 is best. B1 has too many predictors with large Wald p–values, B2

still has too many predictors (want ≤ 300/10 = 30 predictors) while B4 has
too small of a p–value for the change in deviance test.
13.12 c) A good submodel uses a constant, Bar, Habitat and Stags as
predictors.
f) The EY and EE plots are good as are the Wald p–values. Also AIC(full)
= 141.506 while AIC(sub) = 139.644.
13.14 b) Use the log rule: (max age)/(min age) = 1400 > 10.
e) The slice means track the logistic curve very well if 8 slices are used.
i) The EE plot is linear.
j) The slice means track the logistic curve very well if 8 slices are used.
n) The slice form −0.5 to 0.5 is bad since the symbol density is not
approximately constant from the top to the bottom of the slice.
13.15 c) Should have 200 cases, df = 178 and deviance = 112.168.
d) The ESS plot with 12 slices suggests that the full model is good.
e) The submodel I1 that uses a constant, AGE, CAN, SYS, TYP and
FLOC and the submodel I2 that is the same as I1 but also uses FRACE
seem to be competitors. If the factor FRACE is not used, then the EY plot
follows 3 lines, one for each race. The Wald p–values suggest that FRACE
is not needed.
13.16 c) The ESS plot (eg with 4 slices) is bad, so the LR model is bad.
e) Now the ESS plot (eg with 12 slices) is good in that slice smooth and
the logistic curve are close where there is data (also the LR model is good at
classifying 0’s and 1’s).
g) The MLE does not exist since there is perfect classiﬁcation (and the
logistic curve can get close to but never equal a discontinuous step function).
Hence Wald p–values tend to have little meaning; however, the change in
deviance test tends to correctly suggest that there is an LR relationship
when there is perfect classiﬁcation.
For this problem, G2 (O|F ) = 62.7188 − 0.00419862 = 62.7146, df = 1,
p–value = 0.00, so reject Ho and conclude that there is an LR relationship
between ape and the predictor x3 .
CHAPTER 14. STUFF FOR STUDENTS 469

13.18 k) The deleted point is certainly inﬂuential. Without this case,

there does not seem to be a LLR relationship between the predictors and the
response.
m) The weighted residual plot suggests that something is wrong with the
model since the plotted points scatter about a line with positive slope rather
than a line with 0 slope. The deviance residual plot does not suggest that
anything is wrong with the model.
13.19 b) Since this is simulated LR data, the binary response plot should
be good in that the symbol density should be mixed from top to bottom for
most narrow slices.
c) The ESS plot should look ok, but the function uses a default number
of slices rather than allowing the user to select the number of slices using a
“slider bar” (a useful feature of Arc).
13.20 b) Since this is simulated LLR data, the EY plot should look
ok, but the function uses a default lowess smoothing parameter rather than
allowing the user to select smoothing parameter using a “slider bar” (a useful
feature of Arc).
c) The data should the identity line in the weighted forward response
plots. In about 1 in 20 plots there will be a very large count that looks
like an outlier. The weighted residual plot based on the MLE usually looks
better than the plot based on the minimum chi-square estimator (the MLE
plot tendt to have less of a “left opening megaphone shape”).
13.21 e)

Number in Model Rsquare C(p) Variables in model

6 0.2316 7.0947 X3 X4 X6 X7 X9 X10

g) The slice means follow the logistic curve fairly well with 8 slices.
i) The EE plot is linear.
j) The slice means follow the logistic curve fairly well with 8 slices.
BIBLIOGRAPHY 470

1. Adcock, C., and Meade, N. (1997), “A Comparison of Two LP Solvers

and a New IRLS Algorithm for L1 Estimation,” in L1-Statistical Pro-
cedures and Related Topics, ed. Dodge, Y., Institute of Mathematical
Statistics, Hayward, CA, 119-132.
2. Adell, J.A., and Jodrá, P. (2005), “Sharp Estimates for the Median
of the Γ(n + 1, 1) Distribution, Statistics and Probability Letters, 71,
185-191.
3. Agresti, A. (2002), Categorical Data Analysis, 2nd ed., John Wiley and
Sons, Hoboken, NJ.
4. Agulló, J. (1997), “Exact Algorithms to Compute the Least Median of
Squares Estimate in Multiple Linear Regression,” in L1 -Statistical Pro-
cedures and Related Topics, ed. Dodge, Y., Institute of Mathematical
Statistics, Hayward, CA, 133-146.
5. Agulló, J. (2001), “New Algorithms for Computing the Least Trimmed
Squares Regression Estimator,” Computational Statistics and Data
Analysis, 36, 425-439.
6. Aldrin, M., Bφlviken, E., and Schweder, T. (1993), “Projection Pursuit
Regression for Moderate Non-linearities,” Computational Statistics and
Data Analysis, 16, 379-403.
7. Anderson-Sprecher, R. (1994), “Model Comparisons and R2 ,” The
American Statistician, 48, 113-117.
8. Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H.,
and Tukey, J.W. (1972), Robust Estimates of Location, Princeton Uni-
versity Press, Princeton, NJ.
9. Anscombe, F.J. (1961), “Examination of Residuals,” in Proceedings of
the Fourth Berkeley Symposium on Mathematical Statistics and Prob-
ability, ed. J. Neyman, University of California Press, 1-31.
10. Anscombe, F.J., and Tukey, J.W. (1963), “The Examination and Anal-
ysis of Residuals,” Technometrics, 5, 141-160.
11. Ansley, C.F., and Kohn, R. (1994), “Convergence of the Backﬁtting Al-
gorithm for Additive Models,” Journal of the Australian Mathematical
Society, Series A, 57, 316-329.
BIBLIOGRAPHY 471

12. Appa, G.M., and Land, A.H. (1993), “Comment on ‘A Cautionary Note
on the Method of Least Median of Squares’ by Hettmansperger, T.P.
and Sheather, S.J.,” The American Statistician, 47, 160-162.

13. Arcones, M.A. (1995), “Asymptotic Normality of Multivariate Trimmed

Means,” Statistics and Probability Letters, 25, 43-53.

14. Armitrage, P., and Colton, T. (editors), (1998a-f), Encyclopedia of Bio-

statistics, Vol. 1-6, John Wiley and Sons, NY.

15. Ashworth, H. (1842), “Statistical Illustrations of the Past and Present

State of Lancashire,” Journal of the Royal Statistical Society, A, 5,
245-256.

16. Atkinson, A.C. (1985), Plots, Transformations, and Regression, Claren-

don Press, Oxford.

17. Atkinson, A.C. (1986), “Diagnostic Tests for Transformations,” Tech-

nometrics, 28, 29-37.

18. Atkinson, A., and Riani, R. (2000), Robust Diagnostic Regression Anal-
ysis, Springer-Verlag, NY.

19. Atkinson, A., Riani, R., and Cerioli, A. (2004), Exploring Multivariate
Data with the Forward Search, Springer-Verlag, NY.

20. Atkinson, A.C., and Weisberg, S. (1991), “Simulated Annealing for the
Detection of Multiple Outliers Using Least Squares and Least Median
of Squares Fitting,” in Directions in Robust Statistics and Diagnostics,
Part 1, eds. Stahel, W., and Weisberg, S., Springer-Verlag, NY, 7-20.

21. Bai, Z.D., and He, X. (1999), “Asymptotic Distributions of the Maxi-
mal Depth Estimators for Regression and Multivariate Location,” The
Annals of Statistics, 27, 1616-1637.

22. Barndorﬀ-Nielsen, O. (1982), “Exponential Families,” in Encyclopedia

of Statistical Sciences, Vo1. 2, eds. Kotz, S. and Johnson, N.L., John
Wiley and Sons, NY, 587-596.

23. Barnett, V., and Lewis, T. (1994), Outliers in Statistical Data, 3rd ed.,
John Wiley and Sons, NY.
BIBLIOGRAPHY 472

24. Barrett, B.E., and Gray, J.B. (1992), “Diagnosing Joint Inﬂuence in
Regression Analysis,” in the American Statistical 1992 Proceedings of
the Computing Section, 40-45.

25. Barrodale, I., and Roberts, F.D.K. (1974), “Algorithm 478 Solution of
an Overdetermined System of Equations in the l1 Norm [F 4],” Com-
munications of the ACM, 17, 319-320.

26. Bartlett, M.S. (1947), “The Use of Transformations,” Biometrics, 3,

39-52.

27. Bassett, G.W. (1991), “Equivariant, Monotonic, 50% Breakdown Esti-

mators,” The American Statistician, 45, 135-137.

28. Bassett, G.W., and Koenker, R.W. (1978), “Asymptotic Theory of

Least Absolute Error Regression,” Journal of the American Statistical
Association, 73, 618-622.

29. Becker, R.A., Chambers, J.M., and Wilks, A.R. (1988), The New S
Language A Programming Environment for Data Analysis and Graph-
ics, Wadsworth and Brooks/Cole, Paciﬁc Grove, CA.

30. Becker, R.A., and Keller-McNulty, S. (1996), “Presentation Myths,”

The American Statistician, 50, 112-115.

31. Beckman, R.J., and Cook, R.D., (1983), “Outlier.......s,” Technomet-

rics, 25, 119-114.

32. Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnos-
tics: Identifying Inﬂuential Data and Sources of Collinearity, John Wi-
ley and Sons, NY.

33. Bentler, P.M., and Yuan K.H. (1998), “Tests for Linear Trend in the
Smallest Eigenvalues of the Correlation Matrix,” Psychometrika, 63,
131-144.

34. Bernholt, T., and Fischer, P. (2004), “The Complexity of Computing

the MCD-Estimator,” Theoretical Computer Science, 326, 383-398.

35. Bickel, P.J. (1965), “On Some Robust Estimates of Location,” The
Annals of Mathematical Statistics, 36, 847-858.
BIBLIOGRAPHY 473

36. Bickel, P.J. (1975), “One-Step Huber Estimates in the Linear Model,”
Journal of the American Statistical Association, 70, 428-434.

37. Bickel, P.J., and Doksum, K.A. (1977), Mathematical Statistics: Basic
Ideas and Selected Topics, Holden-Day, San Francisco, CA.

38. Bloch, D.A., and Gastwirth, J.L. (1968), “On a Simple Estimate of
the Reciprocal of the Density Function,” The Annals of Mathematical
Statistics, 39, 1083-1085.

39. Bloomﬁeld, P., and Steiger, W. (1980), “Least Absolute Deviations

Curve-Fitting,” SIAM Journal of Statistical Computing, 1, 290-301.

40. Bogdan, M. (1999), “Data Driven Smooth Tests for Bivariate Normal-
ity,” Journal of Multivariate Analysis, 68, 26-53.

41. Bowman, K.O., and Shenton, L.R. (1988), Properties of Estimators for
the Gamma Distribution, Marcel Dekker, NY.

42. Box, G.E.P. (1979), “Robustness in the Strategy of Scientiﬁc Model

Building,” in Robustness in Statistics, eds. Launer, R., and Wilkinson,
G., Academic Press, NY, 201-235.

43. Box, G.E.P. (1990), “Commentary on ‘Communications between Statis-

ticians and Engineers/Physical Scientists’ by H.B. Hoadley and J.R.
Kettenring,” Technometrics, 32, 251-252.

44. Box, G.E.P., and Cox, D.R. (1964), “An Analysis of Transformations,”
Journal of the Royal Statistical Society, B, 26, 211-246.

45. Branco, J.A., Croux, C., Filzmoser, P., and Oliviera, M.R. (2005), “Ro-
bust Canonical Correlations: a Comparative Study,” Computational
Statistics, To Appear.

46. Brillinger, D.R. (1977), “The Identiﬁcation of a Particular Nonlinear

Time Series,” Biometrika, 64, 509-515.

47. Brillinger, D.R. (1983), “A Generalized Linear Model with “Gaus-

sian” Regressor Variables,” in A Festschrift for Erich L. Lehmann,
eds. Bickel, P.J., Doksum, K.A., and Hodges, J.L., Wadsworth, Paciﬁc
Grove, CA, 97-114.
BIBLIOGRAPHY 474

48. Brillinger, D.R. (1991), “Comment on ‘Sliced Inverse Regression for

Dimension Reduction’ by K.C. Li,” Journal of the American Statistical
Association, 86, 333.

49. Brockwell, P.J., and Davis, R.A. (1991), Time Series: Theory and
Methods, Springer–Verlag, NY.

50. Broﬃtt, J.D. (1974), “An Example of the Large Sample Behavior of
the Midrange,” The American Statistician, 28, 69-70.

51. Bucchianico, A.D., Einmahl, J.H.J., and Mushkudiani, N.A. (2001),

“Smallest Nonparametric Tolerance Regions,” The Annals of Statistics,
29, 1320-1343.

52. Buja, A., Hastie, T., and Tibshirani, R. (1989), “Linear Smoothers and
Additive Models,” The Annals of Statistics, 17, 453-555.

53. Bura, E., and Cook, R.D., (2001), “Estimating the Structural Dimen-
sion of Regressions Via Parametric Inverse Regression,” Journal of the
Royal Statistical Society, B, 63, 393-410.

54. Burnham, K.P., and Anderson, D.R. (2004), “Multimodel Inference

Understanding AIC and BIC in Model Selection,” Sociological Methods
& Research, 33, 261-304.

55. Burman, P., and Nolan D. (1995), “A General Akaike-Type Criterion

for Model Selection in Robust Regression,” Biometrika, 82, 877-886.

56. Butler, R.W. (1982), “Nonparametric Interval and Point Prediction

Using Data Trimming by a Grubbs-Type Outlier Rule,” The Annals of
Statistics, 10, 197-204.

57. Butler, R.W., Davies, P.L., and Jhun, M. (1993), “Asymptotics for the
Minimum Covariance Determinant Estimator,” The Annals of Statis-
tics, 21, 1385-1400.

58. Buxton, L.H.D. (1920), “The Anthropology of Cyprus,” The Journal

of the Royal Anthropological Institute of Great Britain and Ireland, 50,
183-235.
BIBLIOGRAPHY 475

59. Cambanis, S., Huang, S., and Simons, G. (1981) “On the Theory of El-
liptically Contoured Distributions,” Journal of Multivariate Analysis,
11, 368-385.

60. Cameron, A.C., and Trivedi, P.K. (1998), Regression Analysis of Count
Data, Cambridge University Press, Cambridge, UK.

61. Carroll, R.J., and Welsh, A.H. (1988), “A Note on Asymmetry and
Robustness in Linear Regression,” The American Statistician, 42, 285-
287.

62. Casella, G., and Berger, R.L. (2002), Statistical Inference, 2nd ed.,
Duxbury, Belmont, CA.

63. Castillo, E. (1988), Extreme Value Theory in Engineering, Academic

Press, Boston.

64. Cattell, R.B. (1966), “The Scree Test for the Number of Factors,” Mul-
tivariate Behavioral Research, 1, 245-276.

65. Cavanagh, C., and Sherman, R.P. (1998), “Rank Estimators for Mono-
tonic Index Models,” Journal of Econometrics, 84, 351-381.

66. Chambers, J.M. (1998), Programming with Data: a Guide to the S

Language, Springer-Verlag, NY.

67. Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P. (1983),
Graphical Methods for Data Analysis, Duxbury Press, Boston.

68. Chang, W.H., McKean, J.W., Naranjo, J.D., and Sheather, S.J. (1999),
“High-Breakdown Rank Regression,” Journal of the American Statis-
tical Association, 94, 205-219.

69. Chatterjee, S., and Hadi, A.S. (1988), Sensitivity Analysis in Linear
Regression, John Wiley and Sons, NY.

70. Chen, C.H., and Li, K.C. (1998), “Can SIR be as Popular as Multiple
Linear Regression?,” Statistica Sinica, 8, 289-316.

71. Chen, J., and Rubin, H. (1986), “Bounds for the Diﬀerence Between
Median and Mean of Gamma and Poisson Distributions,” Statistics and
Probability Letters, 4, 281-283.
BIBLIOGRAPHY 476

72. Chen, Z. (1998), “A Note on Bias Robustness of the Median,” Statistics

and Probability Letters, 38, 363-368.

73. Chmielewski, M.A. (1981), “Elliptically Symmetric Distributions: a

Review and Bibliography,” International Statistical Review, 49, 67-74.

74. Christensen, R. (1997), Log-Linear Models and Logistic Regression, 2nd

ed., Springer-Verlag, NY.

75. Claeskins, G., and Hjort, N.L. (2003), “The Focused Information Cri-
terion,” (with discussion), Journal of the American Statistical Associ-
ation, 98, 900-916.

76. Clarke, B.R. (1986), “Asymptotic Theory for Description of Regions in

Which Newton-Raphson Iterations Converge to Location M-Estimators,”
Journal of Statistical Planning and Inference, 15, 71-85.

77. Coakley, C.W., and Hettmansperger, T.P. (1993), “A Bounded Inﬂu-

ence High Breakdown Eﬃcient Regression Estimator,” Journal of the
American Statistical Association, 84, 872-880.

78. Cohen, A. C., and Whitten, B.J. (1988), Parameter Estimation in Re-
liability and Life Span Models, Marcel Dekker, NY.

79. Collett, D. (1999), Modelling Binary Data, Chapman & Hall/CRC,

Boca Raton, Florida.

80. Cook, R.D. (1977), “Deletion of Inﬂuential Observations in Linear Re-

gression,” Technometrics, 19, 15-18.

81. Cook, R.D. (1986), “Assessment of Local Inﬂuence,” Journal of the

Royal Statistical Society, B, 48, 133-169.

82. Cook, R.D. (1993), “Exploring Partial Residual Plots,” Technometrics,

35, 351-362.

83. Cook, R.D. (1996), “Graphics for Regressions with Binary Response,”
Journal of the American Statistical Association, 91, 983-992.

84. Cook, R.D. (1998a), Regression Graphics: Ideas for Studying Regres-
sion Through Graphics, John Wiley and Sons, NY.
BIBLIOGRAPHY 477

85. Cook, R.D. (1998b), “Principal Hessian Directions Revisited,” Journal

of the American Statistical Association, 93, 84-100.

86. Cook, R.D. (2000), “SAVE: A Method for Dimension Reduction and
Graphics in Regression,” Communications in Statistics Theory and
Methods, 29, 2109-2121.

87. Cook, R.D. (2003), “Dimension Reduction and Graphical Exploration

in Regression Including Survival Analysis,” Statistics in Medicine, 2,
1399-1413.

88. Cook, R.D. (2004), “Testing Predictor Contributions in Suﬃcient Di-

mension Reduction,” The Annals of Statistics, 32, 1062-1092.

89. Cook, R.D., and Critchley, F. (2000), “Identifying Outliers and Regres-
sion Mixtures Graphically,” Journal of the American Statistical Asso-
ciation, 95, 781-794.

90. Cook, R.D., and Croos-Dabrera, R. (1998), “Partial Residual Plots in

Generalized Linear Models,” Journal of the American Statistical Asso-
ciation, 93, 730-739.

91. Cook, R.D., and Hawkins, D.M. (1990), “Comment on ‘Unmasking

Multivariate Outliers and Leverage Points’ by P.J. Rousseeuw and B.C.
van Zomeren,” Journal of the American Statistical Association, 85, 640-
644.

92. Cook, R.D., Hawkins, D.M., and Weisberg, S. (1992), “Comparison of

Model Misspeciﬁcation Diagnostics Using Residuals from Least Mean
of Squares and Least Median of Squares,” Journal of the American
Statistical Association, 87, 419-424.

93. Cook, R.D., Hawkins, D.M., and Weisberg, S. (1993), “Exact Iterative
Computation of the Robust Multivariate Minimum Volume Ellipsoid
Estimator,” Statistics and Probability Letters, 16, 213-218.

94. Cook, R.D., and Lee, H. (1999), “Dimension Reduction in Binary Re-
sponse Regression,” Journal of the American Statistical Association,
94, 1187-1200.
BIBLIOGRAPHY 478

95. Cook, R.D., and Li, B. (2002), “Dimension Reduction for Conditional
Mean in Regression,” The Annals of Statistics, 30, 455-474.

96. Cook, R.D., and Li, B. (2004), “Determining the Dimension of Iterative
Hessian Transformation,” The Annals of Statistics, 32, 2501-2531.

97. Cook, R.D., and Nachtsheim, C.J. (1994), “Reweighting to Achieve

Elliptically Contoured Covariates in Regression,” Journal of the Amer-
ican Statistical Association, 89, 592-599.

98. Cook, R.D., and Ni, L. (2005), “Suﬃcient Dimension Reduction via
Inverse Regression: A Minimum Discrepancy Approach,” Journal of
the American Statistical Association, 100, 410-428.

99. Cook, R.D., and Olive, D.J. (2001), “A Note on Visualizing Response
Transformations in Regression,” Technometrics, 43, 443-449.

100. Cook, R.D., and Wang, P.C. (1983), “Transformations and Inﬂuential
Cases in Regression,” Technometrics, 25, 337-343.

101. Cook, R.D., and Weisberg, S. (1982), Residuals and Inﬂuence in Re-
gression, Chapman & Hall, London.

102. Cook, R.D., and Weisberg, S. (1991), “Comment on ‘Sliced Inverse Re-
gression for Dimension Reduction’ by K.C. Li,” Journal of the Ameri-
can Statistical Association, 86, 328-332.

103. Cook, R.D., and Weisberg, S. (1994), “Transforming a Response Vari-

able for Linearity,” Biometrika, 81, 731-737.

104. Cook, R.D., and Weisberg, S. (1997), “Graphs for Assessing the Ade-
quacy of Regression Models,” Journal of the American Statistical As-
sociation, 92, 490-499.

105. Cook, R.D., and Weisberg, S. (1999a), Applied Regression Including

Computing and Graphics, John Wiley and Sons, NY.

106. Cook, R.D., and Weisberg, S. (1999b), “Graphs in Statistical Analysis:

is the Medium the Message?” The American Statistician, 53, 29-37.

107. Cooke, D., Craven, A.H., and Clarke, G.M. (1982), Basic Statistical
Computing, Edward Arnold Publishers, London.
BIBLIOGRAPHY 479

108. Cox, D.R., and Snell, E. J. (1968), “A General Deﬁnition of Residuals,”

Journal of the Royal Statistical Society, B, 30, 248-275.

109. Cox, D.R. (1972), “Regression Models and Life-Tables,” Journal of the
Royal Statistical Society, B, 34, 187-220.

110. Cramer, H. (1946), Mathematical Methods of Statistics, Princeton Uni-

versity Press, Princeton, NJ.

111. Cramér, J.S. (2003), Logit Models from Economics and Other Fields,
Cambridge University Press, Cambridge, UK.

112. Croux, C., Dehon, C., Rousseeuw, P.J., and Van Aelst, S. (2001), “Ro-
bust Estimation of the Conditional Median Function at Elliptical Mod-
els”, Statistics and Probability Letters, 51, 361-368.

113. Croux, C., and Haesbroeck, G. (2003), “Implementing the Bianco and
Yohai Estimator for Logistic Regression,” Computational Statistics and
Data Analysis, 44, 273-295.

114. Croux, C., Rousseeuw, P.J., and Hössjer, O. (1994), “Generalized

S-Estimators,” Journal of the American Statistical Association, 89,
1271-1281.

115. Croux, C., and Van Aelst, S. (2002), “Comment on ‘Nearest-Neighbor

Variance Estimation (NNVE): Robust Covariance Estimation via Nearest-
Neighbor Cleaning’ by N. Wang and A.E. Raftery,” Journal of the
American Statistical Association, 97, 1006-1009.

116. Czörgö, S. (1986), “Testing for Normality in Arbitrary Dimension,”

The Annals of Statistics, 14, 708-723.

117. Daniel, C., and Wood, F.S. (1980), Fitting Equations to Data, 2nd ed.,
John Wiley and Sons, NY.

118. Datta, B.N. (1995), Numerical Linear Algebra and Applications,

Brooks/Cole Publishing Company, Paciﬁc Grove, CA.

119. David, H.A. (1981), Order Statistics, 2nd ed., John Wiley and Sons,
NY.
BIBLIOGRAPHY 480

120. David, H.A. (1995), “First (?) Occurrences of Common Terms in Math-
ematical Statistics,” The American Statistician, 49, 121-133.

121. David, H.A. (1998), “Early Sample Measures of Variablity,” Statistical

Science, 13, 368-377.

122. Davies, L., and Gather, U. (1993), “The Identiﬁcation of Multiple Out-
liers,” Journal of the American Statistical Association, 88, 782-792.

123. Davies, P.L. (1990), “The Asymptotics of S-Estimators in the Linear

Regression Model,” The Annals of Statistics, 18, 1651-1675.

124. Davies, P.L. (1992), “Asymptotics of Rousseeuw’s Minimum Volume

Ellipsoid Estimator,” The Annals of Statistics, 20, 1828-1843.

125. Davies, P.L. (1993), “Aspects of Robust Linear Regression,” The An-
nals of Statistics, 21, 1843-1899.

126. deCani, J.S, and Stine, R.A. (1986), “A Note on Deriving the Infor-
mation Matrix for a Logistic Distribution,” The American Statistician,
40, 220-222.

127. DeGroot, M.H. (1975), Probability and Statistics, Addison-Wesley Pub-

lishing Company, Reading, MA.

128. Delecroix, M., Härdle, W., and Hristache, M. (2003), “Eﬃcient Estima-
tion in Conditional Single-Index Regression,” Journal of Multivariate
Analysis, 86, 213-226.

129. Dell’Aquila, R., and Ronchetti, E.M. (2005), Introduction to Robust

Statistics with Economic and Financial Applications, John Wiley and
Sons, Hoboken, NJ.

130. Devlin, S.J., Gnanadesikan, R., and Kettenring, J.R. (1975), “Ro-
bust Estimation and Outlier Detection with Correlation Coeﬃcients,”
Biometrika, 62, 531-545.

131. Devlin, S.J., Gnanadesikan, R., and Kettenring, J.R. (1981), “Robust
Estimation of Dispersion Matrices and Principal Components,” Journal
of the American Statistical Association, 76, 354-362.
BIBLIOGRAPHY 481

132. Dixon, W.J., and Tukey, J.W. (1968), “Approximate Behavior of Win-
sorized t (trimming/Winsorization 2),” Technometrics, 10, 83-98.

133. Dobson, A.J. (2001), An Introduction to Generalized Linear Models,

2nd ed., Chapman & Hall, London.

134. Dodge, Y. (editor) (1987), Statistical Data Analysis Based on the L1 -

norm and Related Methods, North-Holland, Amsterdam.

135. Dodge, Y. (editor) (1997), L1 -Statistical Procedures and Related Topics,

Institute of Mathematical Statistics, Hayward, CA.

136. Dodge, Y. and Jureckova, J. (2000), Adaptive Regression, Springer-

Verlag, NY.

137. Dollinger, M.B., and Staudte, R.G. (1991), “Inﬂuence Functions of It-
eratively Reweighted Least Squares Estimators,” Journal of the Amer-
ican Statistical Association, 86, 709-716.

138. Dongarra, J.J., Moler, C.B., Bunch, J.R., and Stewart, G.W. (1979),
Linpack’s Users Guide, SIAM, Philadelphia, PA.

139. Donoho, D.L., and Huber, P.J. (1983), “The Notion of Breakdown
Point,” in A Festschrift for Erich L. Lehmann, eds. Bickel, P.J., Dok-
sum, K.A., and Hodges, J.L., Wadsworth, Paciﬁc Grove, CA, 157-184.

140. Draper, N.R. (2000), “Applied Regression Analysis Bibliography Up-

date 1998-99,” Communications in Statistics Theory and Methods, 2313-
2341.

141. Draper, N.R., and Smith, H. (1981), Applied Regression Analysis, 2nd
ed., John Wiley and Sons, NY.

142. Duda, R.O., Hart, P.E., and Stork, D.G. (2000), Pattern Classiﬁcation,
2nd ed., John Wiley and Sons, NY.

143. Eaton, M.L. (1986), “A Characterization of Spherical Distributions,”

Journal of Multivariate Analysis, 20, 272-276.

144. Easton, G.S., and McCulloch, R.E. (1990), “A Multivariate General-

ization of Quantile Quantile Plots,” Journal of the American Statistical
Association, 85, 376-386.
BIBLIOGRAPHY 482

145. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004), “Least
Angle Regression,” (with discussion), The Annals of Statistics, 32, 407-
451.

146. Ehrenberg, A.S.C. (1982), “Writing Technical Papers or Reports,” The

American Statistician, 36, 326-329

147. Fahrmeir, L. and Tutz, G. (2001), Multivariate Statistical Modelling

based on Generalized Linear Models, 2nd ed., Springer-Verlag, NY.

148. Falk, M. (1997), “Asymptotic Independence of Median and MAD,”

Statistics and Probability Letters, 34, 341-345.

149. Fan, J., and Li, R. (2001), “Variable Selection via Nonconcave Penal-
ized Likelihood and its Oracle Properties,” Journal of the American
Statistical Association, 96, 1348-1360.

150. Fan, J., and Li, R. (2002), “Variable Selection for Cox’s Proportional
Hazard Model and Frailty Model,” The Annals of Statistics, 30, 74-99.

151. Fang, K.T., and Anderson, T.W. (editors) (1990), Statistical Inference
in Elliptically Contoured and Related Distributions, Allerton Press, NY.

152. Fang, K.T., Kotz, S., and Ng, K.W. (1990), Symmetric Multivariate
and Related Distributions, Chapman & Hall, NY.

153. Farebrother, R.W. (1997), “Notes on the Early History of Elemental Set
Methods,” in L1 -Statistical Procedures and Related Topics, ed. Dodge,
Y., Institute of Mathematical Statistics, Hayward, CA, 161-170.

154. Feller, W. (1957), An Introduction to Probability Theory and Its Appli-

cations, Vol. 1, 2nd ed., John Wiley and Sons, NY.

155. Ferguson, T.S. (1967), Mathematical Statistics: A Decision Theoretic

Approach, Academic Press, NY.

156. Field, C. (1985), “Concepts of Robustness,” in A Celebration of Statis-

tics, eds. Atkinson, A.C., and Feinberg, S.E., Springer Verlag, NY,
369-375.
BIBLIOGRAPHY 483

157. Fowlkes, E.B. (1969), “User’s Manual for a System for Interactive Prob-
ability Plotting on Graphic-2,” Technical Memorandum, AT&T Bell
Laboratories, Murray Hill, NJ.

158. Fox, J. (1991), Regression Diagnostics, Sage, Newbury Park, CA.

159. Fox, J. (2002), An R and S-PLUS Companion to Applied Regression,

Sage Publications, Thousand Oaks, CA.

160. Freedman, D.A., and Diaconis, P. (1982), “On Inconsistent M Estima-

tors,” The Annals of Statistics, 10, 454-461.

161. Friedman, J.H., and Stuetzle, W. (1981), “Projection Pursuit Regres-

sion,” Journal of the American Statistical Association, 76, 817-823.

162. Fung, W. (1993), “Unmasking Outliers and Leverage Points: a Conﬁr-

mation,” Journal of the American Statistical Association, 88, 515-519.

163. Fung, W.K., He, X., Liu, L., and Shi, P.D. (2002), “Dimension Reduc-
tion Based on Canonical Correlation,” Statistica Sinica, 12, 1093-1114.

164. Furnival, G., and Wilson, R. (1974), “Regression by Leaps and Bounds,”
Technometrics, 16, 499-511.

165. Garcı́a-Escudero, L.A., Gordaliza, A., and Matrán, C. (1999), “A Cen-

tral Limit Theorem for Multivariate Generalized k-Means,” The Annals
of Statistics, 27, 1061-1079.

166. Gather, U., and Becker, C. (1997), “Outlier Identiﬁcation and Robust
Methods,” in Robust Inference, eds. Maddala, G.S., and Rao, C.R.,
Elsevier Science B.V., Amsterdam, 123-144.

167. Gather, U., Hilker, T., and Becker, C. (2001), “A Robustiﬁed Version
of Sliced Inverse Regression,” in Statistics in Genetics and in the Envi-
ronmental Sciences, eds. Fernholtz, T.L., Morgenthaler, S., and Stahel,
W., Birkhäuser, Basel, Switzerland, 145-157.

168. Gather, U., Hilker, T., and Becker, C. (2002), “A Note on Outlier
Sensitivity of Sliced Inverse Regression,” Statistics, 36, 271-281.

169. Gentle, J.E. (2002), Elements of Computational Statistics, Springer-

Verlag, NY.
BIBLIOGRAPHY 484

170. Gladstone, R.J. (1905-1906), “A Study of the Relations of the Brain to

the Size of the Head,” Biometrika, 4, 105-123.

171. Gnanadesikan, R. (1997), Methods for Statistical Data Analysis of Mul-

tivariate Observations, 2nd ed., John Wiley and Sons, NY.

172. Gnanadesikan, R., and Kettenring, J.R. (1972), “Robust Estimates,

Residuals, and Outlier Detection with Multiresponse Data,” Biomet-
rics, 28, 81-124.

173. Golub, G.H., and Van Loan, C.F. (1989), Matrix Computations, 2nd
ed., John Hopkins University Press, Baltimore, MD.

174. Gray, J.B. (1985), “Graphics for Regression Diagnostics,” in the Amer-
ican Statistical Association 1985 Proceedings of the Statistical Comput-
ing Section, 102-108.

175. Greenwood, J.A., and Durand, D. (1960), “Aids for Fitting the Gamma
Distribution by Maximum Likelihood,” Technometrics, 2, 55-56.

176. Gross, A.M. (1976), “Conﬁdence Interval Robustness with Long-Tailed

Symmetric Distributions,” Journal of the American Statistical Associ-
ation, 71, 409-417.

177. Guenther, W.C. (1969), “Shortest Conﬁdence Intervals,” The American

Statistician, 23, 22-25.

178. Gupta, A.K., and Varga, T. (1993), Elliptically Contoured Models in

Statistics, Kluwar Academic Publishers, Dordrecht, The Netherlands.

179. Hadi, A.S., and Simonoﬀ, J.S. (1993), “Procedures for the Identiﬁca-
tion of Multiple Outliers in Linear Models,” Journal of the American
Statistical Association, 88, 1264-1272.

180. Hahn, G.H., Mason, D.M., and Weiner, D.C. (editors) (1991), Sums,
Trimmed Sums, and Extremes, Birkhäuser, Boston.

181. Hall, P. and Li, K.C. (1993), “On Almost Linearity of Low Dimensional
Projections from High Dimensional Data,” The Annals of Statistics, 21,
867-889.
BIBLIOGRAPHY 485

182. Hall, P., and Welsh, A.H. (1985), “Limit Theorems for the Median
Deviation,” Annals of the Institute of Statistical Mathematics, Part A,
37, 27-36.

183. Hamada, M., and Sitter, R. (2004), “Statistical Research: Some Advice
for Beginners,” The American Statistician, 58, 93-101.

184. Hamilton, L.C. (1992), Regression with Graphics A Second Course in

Applied Statistics, Wadsworth, Belmont, CA.

185. Hampel, F.R. (1975), “Beyond Location Parameters: Robust Concepts

and Methods,” Bulletin of the International Statistical Institute, 46,
375-382.

186. Hampel, F.R. (1985), “The Breakdown Points of the Mean Combined
with Some Rejection Rules,” Technometrics, 27, 95-107.

187. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A.
(1986), Robust Statistics, John Wiley and Sons, NY.

188. Hamza, K. (1995), “The Smallest Uniform Upper Bound on the Dis-
tance Between the Mean and the Median of the Binomial and Poisson
Distributions,” Statistics and Probability Letters, 23, 21-25.

189. Härdle, W., Hall, P., and Ichimura, H. (1993), “Optimal Smoothing in
Single Index Models,” The Annals of Statistics, 21, 157-178.

190. Harrison, D. and Rubinfeld, D.L. (1978), “Hedonic Prices and the De-
mand for Clean Air,” Journal of Environmental Economics and Man-
agement, 5, 81-102.

191. Harter, H.L. (1974a), “The Method of Least Squares and Some Alter-
natives, Part I,” International Statistical Review, 42, 147-174.

192. Harter, H.L. (1974b), “The Method of Least Squares and Some Alter-
natives, Part II,” International Statistical Review, 42, 235-165.

193. Harter, H.L. (1975a), “The Method of Least Squares and Some Alter-
natives, Part III,” International Statistical Review, 43, 1-44.

194. Harter, H.L. (1975b), “The Method of Least Squares and Some Alterna-
tives, Part IV,” International Statistical Review, 43, 125-190, 273-278.
BIBLIOGRAPHY 486

195. Harter, H.L. (1975c), “The Method of Least Squares and Some Alter-
natives, Part V,” International Statistical Review, 43, 269-272.

196. Harter, H.L. (1976), “The Method of Least Squares and Some Alter-
natives, Part VI,” International Statistical Review, 44, 113-159.

197. Hastie, T. (1987), “A Closer Look at the Deviance,” The American

Statistician, 41, 16-20.

198. Hastings, N.A.J., and Peacock, J.B. (1975), Statistical Distributions,

Butterworth, London.

199. Hawkins, D.M. (1980), Identiﬁcation of Outliers, Chapman & Hall,

London.

200. Hawkins, D.M. (1993a), “The Accuracy of Elemental Set Approxima-

tions for Regression,” Journal of the American Statistical Association,
88, 580-589.

201. Hawkins, D.M. (1993b), “A Feasible Solution Algorithm for the Min-
imum Volume Ellipsoid Estimator in Multivariate Data,” Computa-
tional Statistics, 9, 95-107.

202. Hawkins, D.M. (1994), “The Feasible Solution Algorithm for the Min-
imum Covariance Determinant Estimator in Multivariate Data, Com-
putational Statistics and Data Analysis, 17, 197-210.

203. Hawkins, D.M., Bradu, D., and Kass, G.V. (1984), “Location of Several
Outliers in Multiple Regression Data Using Elemental Sets,” Techno-
metrics, 26, 197-208.

204. Hawkins, D.M., and Olive, D.J. (1999a), “Improved Feasible Solution
Algorithms for High Breakdown Estimation,” Computational Statistics
and Data Analysis, 30, 1-11.

205. Hawkins, D.M., and Olive, D. (1999b), “Applications and Algorithms

for Least Trimmed Sum of Absolute Deviations Regression,” Compu-
tational Statistics and Data Analysis, 32, 119-134.
BIBLIOGRAPHY 487

206. Hawkins, D.M., and Olive, D.J. (2002), “Inconsistency of Resampling

Algorithms for High Breakdown Regression Estimators and a New Al-
gorithm,” (with discussion), Journal of the American Statistical Asso-
ciation, 97, 136-159.

207. Hawkins, D.M., and Simonoﬀ, J.S. (1993), “High Breakdown Regres-
sion and Multivariate Estimation,” Applied Statistics, 42, 423-432.

208. He, X. (1991), “A Local Breakdown Property of Robust Tests in Linear

Regression,” Journal of Multivariate Analysis, 38, 294-305.

209. He, X., Cui, H., and Simpson, D.G. (2004), “Longitudinal Data Anal-
ysis Using t-type Regression,” Journal of Statistical Planning and In-
ference, 122, 253-269.

210. He, X., and Fung, W.K. (1999), “Method of Medians for Lifetime Data
with Weibull Models,” Statistics in Medicine, 18, 1993-2009.

211. He, X., and Fung, W.K. (2000), “High Breakdown Estimation for Mul-
tiple Populations with Applications to Discriminant Analysis,” Journal
of Multivariate Analysis, 72, 151-162.

212. He, X., and Portnoy, S. (1992), “Reweighted LS Estimators Converge

at the Same Rate as the Initial Estimator,” The Annals of Statistics,
20, 2161-2167.

213. He, X., Simpson, D.G., and Wang, G.Y. (2000), “Breakdown Points of
t-type Regression Estimators,” Biometrika, 87, 675-687.

214. He, X., and Wang, G. (1996), “Cross-Checking Using the Minimum
Volume Ellipsoid Estimator,” Statistica Sinica, 6, 367-374.

215. He, X., and Wang, G. (1997), “A Qualitative Robustness of S*- Estima-
tors of Multivariate Location and Dispersion,” Statistica Neerlandica,
51, 257-268.

216. Hebbler, B. (1847), “Statistics of Prussia,” Journal of the Royal Sta-

tistical Society, A, 10, 154-186.

217. Heng-Hui, L. (2001), “A Study of Sensitivity Analysis on the Method

of Principal Hessian Directions,” Computational Statistics, 16, 109-130.
BIBLIOGRAPHY 488

218. Hettmansperger, T.P., and McKean, J.W. (1998), Robust Nonparamet-

ric Statistical Methods, Arnold, London.

219. Hettmansperger, T.P., and Sheather, S.J. (1992), “A Cautionary Note

on the Method of Least Median Squares,” The American Statistician,
46, 79-83.

220. Hinckley, D.V., and Wang, S. (1988), “More about Transformations

and Inﬂuential Cases in Regression,” Technometrics, 30, 435-440.

221. Hinich, M.J., and Talwar, P.P. (1975), “A Simple Method for Robust
Regression,” Journal of the American Statistical Association, 70, 113-
119.

222. Hjort, N.L., and Claeskins, G. (2003), “Frequentist Model Average

Estimators,” Journal of the American Statistical Association, 98, 879-
899.

223. Hoaglin, D.C., Mosteller, F., and Tukey, J.W. (1983), Understanding
Robust and Exploratory Data Analysis, John Wiley and Sons, NY.

224. Hoaglin, D.C., and Welsh, R. (1978), “The Hat Matrix in Regression
and ANOVA,” The American Statistician, 32, 17-22.

225. Horn, P.S. (1983), “Some Easy t-Statistics,” Journal of the American
Statistical Association, 78, 930-936.

226. Horowitz, J.L. (1996), “Semiparametric Estimation of a Regression

Model with an Unknown Transformation of the Dependent Variable,”
Econometrica, 64, 103-137.

227. Horowitz, J.L. (1998), Semiparametric Methods in Econometrics,

Springer-Verlag, NY.

228. Hosmer, D.W., and Lemeshow, S. (2000), Applied Logistic Regression,

2nd ed., John Wiley and Sons, NY.

229. Hössjer, O. (1991), Rank-Based Estimates in the Linear Model with

High Breakdown Point, Ph.D. Thesis, Report 1991:5, Department of
Mathematics, Uppsala University, Uppsala, Sweden.
BIBLIOGRAPHY 489

230. Hössjer, O. (1994), “Rank-Based Estimates in the Linear Model with

High Breakdown Point,” Journal of the American Statistical Associa-
tion, 89, 149-158.

231. Hristache, M., Juditsky, A., Polzehl, J., and Spokoiny V. (2001), “Struc-
ture Adaptive Approach for Dimension Reduction,” The Annals of
Statistics, 29, 1537-1566.

232. Huber, P.J. (1981), Robust Statistics, John Wiley and Sons, NY.

233. Hubert, M. (2001), “Discussion of ‘Multivariate Outlier Detection and

Robust Covariance Matrix Estimation’ by D. Peña and F.J. Prieto,”
Technometrics, 43, 303-306.

234. Hubert, M., Rousseeuw, P.J., and Vanden Branden, K. (2005),

“ROBPCA: a New Approach to Robust Principal Component Analy-
sis,” Technometrics, 47, 64-79.

235. Iglewicz, B., and Hoaglin, D.C. (1993), How to Detect and Handle Out-
liers, Quality Press, American Society for Quality, Milwaukee, Wiscon-
sin.

236. Insightful (2002), S-Plus 6 Robust Library User’s Guide, Insightful Cor-
poration, Seattle, WA. Available from
(http://math.carleton.ca/∼help/Splus/robust.pdf).

237. Jaeckel, L.A. (1971a), “Robust Estimates of Location: Symmetry and

Asymmetric Contamination,” The Annals of Mathematical Statistics,
42, 1020-1034.

238. Jaeckel, L.A. (1971b), “Some Flexible Estimates of Location,” The An-
nals of Mathematical Statistics, 42, 1540-1552.

239. Johnson, M.E. (1987), Multivariate Statistical Simulation, John Wiley

and Sons, NY.

240. Johnson, N.L., and Kotz, S. (1970a), Continuous Univariate Distribu-

tions, Vol. 1, Houghton Miﬄin Company, Boston.

241. Johnson, N.L., and Kotz, S. (1970b), Continuous Univariate Distribu-

tions, Vol. 2, Houghton Miﬄin Company, Boston.
BIBLIOGRAPHY 490

242. Johnson, N.L., Kotz, S., and Kemp, A.K. (1992), Univariate Discrete
Distributions, 2nd ed., John Wiley and Sons, NY.
243. Johnson, R.A., and Wichern, D.W. (1988), Applied Multivariate Sta-
tistical Analysis, 2nd ed., Prentice Hall, Englewood Cliffs, NJ.
244. Johnson, R.W. (1996), “Fitting Percentage of Body Fat to Simple Body
Measurements,” Journal of Statistics Education, 4 (1). Available from
(http://www.amstat.org/publications/jse/).
245. Joiner, B.L., and Hall, D.L. (1983), “The Ubiquitous Role of f’/f in
Efficient Estimation of Location,” The American Statistician, 37, 128-
133.
246. Jones, H.L., (1946), “Linear Regression Functions with Neglected Vari-
ables,” Journal of the American Statistical Association, 41, 356-369.
247. Judge, G.G., Griffiths, W.E., Hill, R.C., Lütkepohl, H., and Lee, T.C.
(1985), The Theory and Practice of Econometrics, 2nd ed., John Wiley
and Sons, NY.
248. Jureckova, J., Koenker, R.W., and Welsh, A.H. (1994), “Adaptive
Choice of Trimming Proportions,” Annals of the Institute of Statistical
Mathematics, 46, 737-755.
249. Jureckova, J., and Portnoy, S. (1987), “Asymptotics for One-step M-
estimators in Regression with Application to Combining Efficiency and
High Breakdown Point,” Communications in Statistics Theory and
Methods, 16, 2187-2199.
250. Jureckova, J., and Sen, P.K. (1996), Robust Statistical Procedures:
Asymptotics and Interrelations, John Wiley and Sons, NY.
251. Kafadar, K. (1982), “A Biweight Approach to the One-Sample Prob-
lem,” Journal of the American Statistical Association, 77, 416-424.
252. Kalbfleisch, J.D., and Prentice, R.L. (1980), The Statistical Analysis of
Failure Time Data, John Wiley and Sons, NY.
253. Kay, R., and Little, S. (1987), “Transformations of the Explanatory
Variables in the Logistic Regression Model for Binary Data,” Biometrika,
74, 495-501.
BIBLIOGRAPHY 491

254. Kelker, D. (1970), “Distribution Theory of Spherical Distributions and

a Location Scale Parameter Generalization,” Sankhya, A, 32, 419-430.

255. Kennedy, W.J., and Gentle, J.E. (1980), Statistical Computing, Marcel
Dekker, NY.

256. Kim, J. (2000), “Rate of Convergence of Depth Contours: with Ap-

plication to a Multivariate Metrically Trimmed Mean,” Statistics and
Probability Letters, 49, 393-400.

257. Kim, J., and Pollard, D. (1990), “Cube Root Asymptotics,” The Annals
of Statistics, 18, 191-219.

258. Kim, S. (1992), “The Metrically Trimmed Mean As a Robust Estimator

of Location,” The Annals of Statistics, 20, 1534-1547.

259. Koenker, R.W. (1997), “L1 Computation: an Interior Monologue,” in

L1 -Statistical Procedures and Related Topics, ed. Dodge, Y., Institute
of Mathematical Statistics, Hayward, CA, 15-32.

260. Koenker, R.W. (2005), Quantile Regression, Cambridge University Press,

Cambridge, UK.

261. Koenker, R.W., and Bassett, G. (1978), “Regression Quantiles,” Econo-

metrica, 46, 33-50.

262. Koenker, R.W., and d’Orey, V. (1987), “Computing Regression Quan-

tiles,” Applied Statistics, 36, 383-393.

263. Koenker, R., and Geling, O. (2001), “Reappraising Medﬂy Longevity:

a Quantile Regression Survival Analysis,” Journal of the American Sta-
tistical Association, 96, 458-468.

264. Koltchinskii, V.I., and Li, L. (1998), “Testing for Spherical Symmetry
of a Multivariate Distribution,” Journal of Multivariate Analysis, 65,
228-244.

265. Kotz, S., and Johnson, N.L. (editors) (1982ab), Encyclopedia of Statis-
tical Sciences, Vol. 1-2, John Wiley and Sons, NY.

266. Kotz, S., and Johnson, N.L. (editors) (1983ab), Encyclopedia of Statis-
tical Sciences, Vol. 3-4 , John Wiley and Sons, NY.
BIBLIOGRAPHY 492

267. Kotz, S., and Johnson, N.L. (editors) (1985ab), Encyclopedia of Statis-
tical Sciences, Vol. 5-6, John Wiley and Sons, NY.

268. Kotz, S., and Johnson, N.L. (editors) (1986), Encyclopedia of Statistical
Sciences, Vol. 7, John Wiley and Sons, NY.

269. Kotz, S., and Johnson, N.L. (editors) (1988ab), Encyclopedia of Statis-
tical Sciences, Vol. 8-9, John Wiley and Sons, NY.

270. Kowalski, C.J. (1973), “Non-normal Bivariate Distributions with Nor-

mal Marginals,” The American Statistician, 27, 103-106.

271. Lawless, J.F., and Singhai, K. (1978), “Eﬃcient Screening of Nonnor-

mal Regression Models,” Biometrics, 34, 318-327.

272. Lax, D.A. (1985), “Robust Estimators of Scale: Finite Sample Perfor-
mance in Long-Tailed Symmetric Distributions,” Journal of the Amer-
ican Statistical Association, 80, 736-741.

273. Le, C.T., (1998), Applied Categorical Data Analysis, John Wiley and
Sons, NY.

274. Leemis, L.M. (1986), “Relationships Among Common Univariate Dis-

tributions,” The American Statistician, 40, 143-146.

275. Lehmann, E.L. (1983), Theory of Point Estimation, John Wiley and
Sons, NY.

276. Lehmann, E.L. (1999), Elements of Large–Sample Theory, Springer-

Verlag, NY.

277. Li, K.C. (1991), “Sliced Inverse Regression for Dimension Reduction,”
Journal of the American Statistical Association, 86, 316-342.

278. Li, K.C. (1992), “On Principal Hessian Directions for Data Visualiza-
tion and Dimension Reduction: Another Application of Stein’s Lemma,”
Journal of the American Statistical Association, 87, 1025-1040.

279. Li, K.C. (1997), “Nonlinear Confounding in High-Dimensional Regres-

sion,” The Annals of Statistics, 25, 577-612.
BIBLIOGRAPHY 493

280. Li, K.C. (2000), High Dimensional Data Analysis via the SIR/PHD
Approach, Unpublished Manuscript Available from
(http://www.stat.ucla.edu/∼kcli/).
281. Li, K.C., and Duan, N. (1989), “Regression Analysis Under Link Vio-
lation,” The Annals of Statistics, 17, 1009-1052.
282. Li, L., Cook, R.D, and Nachtsheim, C.J. (2004), “Cluster-based Esti-
mation for Suﬃcient Dimension Reduction,” Computational Statistics
and Data Analysis, 47, 175-193.
283. Li, L., Cook, R.D., and Nachtsheim, C.J. (2005), “Model-Free Variable
Selection,” Journal of the Royal Statistical Society, B, 67, 285-300.
284. Li, R., Fang, K., and Zhu, L. (1997), “Some Q-Q Probability Plots
to Test Spherical and Elliptical Symmetry,” Journal of Computational
and Graphical Statistics, 6, 435-450.
285. Lin, T.C., and Pourahmadi, M. (1998), “Nonparametric and Nonlin-
ear Models and Data Mining in Time Series: A Case-Study on the
Canadian Lynx Data,” Journal of the Royal Statistical Society, C, 47,
187-201.
286. Lindsey, J.K. (2004), Introduction to Applied Statistics: a Modelling
Approach, 2nd ed., Oxford University Press, Oxford, UK.
287. Little, R.J.A., and Rubin, D.B. (2002), Statistical Analysis with Miss-
ing Data, 2nd ed., John Wiley and Sons, NY.
288. Liu, R.Y., Parelius, J.M., and Singh, K. (1999), “Multivariate Analysis
by Data Depth: Descriptive Statistics, Graphics, and Inference,” The
Annals of Statistics, 27, 783-858.
289. Lopuhaä, H.P. (1999), “Asymptotics of Reweighted Estimators of Mul-
tivariate Location and Scatter,” The Annals of Statistics, 27, 1638-
1665.
290. Luo, Z. (1998), “Backﬁtting in Smoothing Spline Anova,” The Annals
of Statistics, 26, 1733-1759.
291. Ma, Y., and Genton, M.G. (2001), “Highly Robust Estimation of Dis-
persion Matrices,” Journal of Multivariate Analysis, 78, 11-36.
BIBLIOGRAPHY 494

292. Maddela, G.S., and Rao, C.R. (editors) (1997), Robust Inference, Hand-
book of Statistics 15, Elsevier Science B.V., Amsterdam.

293. Maguluri, G., and Singh, K. (1997), “On the Fundamentals of Data
Analysis,” in Robust Inference, eds. Maddela, G.S., and Rao, C.R.,
Elsevier Science B.V., Amsterdam, 537-549.

294. Mallows, C. (1973), “Some Comments on Cp ,” Technometrics, 15, 661-

676.

295. Manzotti, A., Pérez, F.J., and Quiroz, A.J. (2002), “A Statistic for
Testing the Null Hypothesis of Elliptical Symmetry,” Journal of Mul-
tivariate Analysis, 81, 274-285.

296. Marazzi, A. (1993), Algorithms, Routines, and S Functions for Robust

Statistics, Wadsworth and Brooks/Cole, Belmont, CA.

297. Marazzi, A., and Ruﬃeux, C. (1996), “Implementing M-Estimators

of the Gamma Distribution,” in Robust Statistics, Data Analysis, and
Computer Intensive Methods, ed. Rieder, H., Springer-Verlag, NY,
277-298.

298. Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979), Multivariate Anal-
ysis, Academic Press, London.

299. Maronna, R.A., and Zamar, R.H. (2002), “Robust Estimates of Loca-
tion and Dispersion for High-Dimensional Datasets,” Technometrics,
44, 307-317.

300. Maronna, R.A. (2006), Robust Statistics, John Wiley and Sons, NY.

301. Mašı̈ček, L. (2004), “Consistency of the Least Weighted Squares Re-

gression Estimator,” in Theory and Applications of Recent Robust Meth-
ods, eds. Hubert, M., Pison, G., Struyf, A., and Van Aelst, S., Series:
Statistics for Industry and Technology, Birkhäuser, Basel, Switzerland,
183-194.

302. MathSoft (1999a), S-Plus 2000 User’s Guide, Data Analysis Products
Division, MathSoft, Seattle, WA. (Mathsoft is now Insightful.)
BIBLIOGRAPHY 495

303. MathSoft (1999b), S-Plus 2000 Guide to Statistics, Volume 2, Data

Analysis Products Division, MathSoft, Seattle, WA. (Mathsoft is now
Insightful.)

304. Mayo, M.S., and Gray, J.B. (1997), “Elemental Subsets: the Building
Blocks of Regression,” The American Statistician, 51, 122-129.

305. McCullagh, P., and Nelder, J.A. (1989), Generalized Linear Models,
2nd ed., Chapman & Hall, London.

306. McCulloch, R.E. (1993), “Fitting Regression Models with Unknown

Transformations Using Dynamic Graphics,” The Statistician, 42, 153-
160.

307. McKean, J.W., and Schrader, R.M. (1984), “A Comparison of Methods

for Studentizing the Sample Median,” Communications in Statistics
Simulation and Computation, 13, 751-773.

308. Meeker, W.Q., and Escobar, L.A. (1998), Statistical Methods for Reli-
ability Data, John Wiley and Sons, NY.

309. Mehrotra, D.V. (1995), “Robust Elementwise Estimation of a Disper-

sion Matrix,” Biometrics, 51, 1344-1351.

310. Moore, D.S. (2004), The Basic Practice of Statistics, 3rd ed., W.H.
Freeman, NY.

311. Moran, P.A.P (1953), “The Statistical Analysis of the Sunspot and
Lynx Cycles,” Journal of Animal Ecology, 18, 115-116.

312. Morgenthaler, S. (1989), “Comment on Yohai and Zamar,” Journal of

the American Statistical Association, 84, 636.

313. Morgenthaler, S., Ronchetti, E., and Stahel, W.A. (editors) (1993),
New Directions in Statistical Data Analysis and Robustness, Birkhäuser,
Boston.

314. Morgenthaler, S., and Tukey, J.W. (1991), Conﬁgural Polysampling: A

Route to Practical Robustness, John Wiley and Sons, NY.

315. Mosteller, F. (1946), “On Some Useful Ineﬃcient Statistics,” The An-
nals of Mathematical Statistics, 17, 377-408.
BIBLIOGRAPHY 496

316. Mosteller, F., and Tukey, J.W. (1977), Data Analysis and Regression,
Addison-Wesley, Reading, MA.

317. Muirhead, R.J. (1982), Aspects of Multivariate Statistical Theory, John

Wiley and Sons, NY.

318. Müller, C.H. (1997), Robust Planning and Analysis of Experiments,

Springer-Verlag, NY.

319. Myers, R.H., Montgomery, D.C., and Vining, G.G. (2002), Generalized
Linear Models with Applications in Engineering and the Sciences, John
Wiley and Sons, NY.

320. Naik, P.A., and Tsai, C. (2001), “Single-Index Model Selections,”

Biometrika, 88, 821-832.

321. Nelder, J.A., and Wedderburn, R.W.M. (1972), “Generalized Linear

Models,” Journal of the Royal Statistical Society, A, 135, 370-380.

322. Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman W. (1996),
Applied Linear Statistical Models, 4th ed., WcGraw-Hill, Boston, MA.

323. Niinimaa, A., Oja, H., and Tableman, M. (1990), “The Finite-Sample
Breakdown Point of the Oja Bivariate Median and of the Corresponding
Half-Samples Version,” Statistics and Probability Letters, 10, 325-328.

324. Nordberg, L. (1982), “On Variable Selection in Generalized Linear and

Related Regression Models,” Communications in Statistics Theory and
Methods, 11, 2427-2449.

325. Nott, D.J., and Leonte, D. (2004), “Sampling Schemes for Bayesian
Variable Selection in Generalized Linear Models,” Journal of Compu-
tational and Graphical Statistics, 13, 362-382.

326. Olive, D.J. (2001), “High Breakdown Analogs of the Trimmed Mean,”
Statistics and Probability Letters, 51, 87-92.

327. Olive, D.J. (2002), “Applications of Robust Distances for Regression,”

Technometrics, 44, 64-71.
BIBLIOGRAPHY 497

328. Olive, D.J. (2004a), “A Resistant Estimator of Multivariate Location

and Dispersion,” Computational Statistics and Data Analysis, 46, 99-
102.

329. Olive, D.J. (2004b), “Visualizing 1D Regression,” in Theory and Appli-

cations of Recent Robust Methods, eds. Hubert, M., Pison, G., Struyf,
A., and Van Aelst, S., Series: Statistics for Industry and Technology,
Birkhäuser, Basel, Switzerland, 221-233.

330. Olive, D.J. (2005), “Two Simple Resistant Regression Estimators,”

Computational Statistics and Data Analysis, 49, 809-819.

331. Olive, D.J. (2006), “Robust Estimators for Transformed Location-Scale

Families,” Unpublishable manuscript available from
(http://www.math.siu.edu/olive/preprints.htm).

332. Olive, D.J., and Hawkins, D.M. (1999), “Comment on ‘Regression

Depth’ by P.J. Rousseeuw and M. Hubert,” Journal of the American
Statistical Association, 94, 416-417.

333. Olive, D.J., and Hawkins, D.M. (2003), “Robust Regression with High
Coverage,” Statistics and Probability Letters, 63, 259-266.

334. Olive, D.J., and Hawkins, D.M. (2005), “Variable Selection for 1D Re-
gression Models,” Technometrics, 47, 43-50.

335. Olive, D.J., and Hawkins, D.M. (2006), “Robustifying Robust Estima-
tors,” Preprint, see (http://www.math.siu.edu/olive/preprints.htm).

336. Oosterhoﬀ, J. (1994), “Trimmed Mean or Sample Median?” Statistics

and Probability Letters, 20, 401-409.

337. Parzen, E. (1979), “Nonparametric Statistical Data Modeling,” Journal

of the American Statistical Association, 74, 105-131.

338. Patel, J.K., Kapadia C.H., and Owen, D.B. (1976), Handbook of Sta-
tistical Distributions, Marcel Dekker, NY.

339. Peña, D., and Prieto, F.J. (2001), “Multivariate Outlier Detection and
Robust Covariance Matrix Estimation,” Technometrics, 286-299.
BIBLIOGRAPHY 498

340. Pewsey, A. (2002), “Large-Sample Inference for the Half-Normal Distri-

bution,” Communications in Statistics Theory and Methods, 31, 1045-
1054.

341. Pison, G., Rousseeuw, P.J., Filzmoser, P., and Croux, C. (2003), “Ro-
bust Factor Analysis,” Journal of Multivariate Analysis, 84, 145-172.

342. Poor, H.V. (1988), An Introduction to Signal Detection and Estimation,

Springer-Verlag, NY.

343. Porat, B. (1993), Digital Processing of Random Signals, Prentice-Hall,

Englewood Cliﬀs, NJ.

344. Portnoy, S. (1987), “Using Regression Quantiles to Identify Outliers,”

in Statistical Data Analysis Based on the L1 Norm and Related Meth-
ods, ed. Y. Dodge, North Holland, Amsterdam, 345-356.

345. Portnoy, S. (1997), “On Computation of Regression Quantiles: Making

the Laplacian Tortoise Faster,” in L1 -Statistical Procedures and Related
Topics, ed. Dodge, Y., Institute of Mathematical Statistics, Hayward,
CA, 187-200.

346. Portnoy, S., and Koenker, R. (1997), “The Gaussian Hare and the
Laplacian Tortoise: Computability of Squared Error Versus Absolute-
Error Estimators,” Statistical Science, 12, 279-300.

347. Portnoy, S., and Mizera, I. (1999), “Comment on ‘Regression Depth’

by P.J. Rousseeuw and M. Hubert,” Journal of the American Statistical
Association, 94, 417-419.

348. Poston, W.L., Wegman, E.J., Priebe, C.E., and Solka, J.L. (1997), “A
Deterministic Method for Robust Estimation of Multivariate Location
and Shape,” Journal of Computational and Graphical Statistics, 6, 300-
313.

349. Powers, D.A., and Xie, Y. (2000), Statistical Methods for Categorical
Data Analysis, Academic Press, San Diego.

350. Pratt, J.W. (1959), “On a General Concept of ‘in Probability’,” The
Annals of Mathematical Statistics, 30, 549-558.
BIBLIOGRAPHY 499

351. Pratt, J.W. (1968), “A Normal Approximation for Binomial, F, Beta,

and Other Common, Related Tail Probabilities, II,” Journal of the
American Statistical Association, 63, 1457-1483.

352. Prescott, P. (1978), “Selection of Trimming Proportions for Robust

Adaptive Trimmed Means,” Journal of the American Statistical Asso-
ciation, 73, 133-140.

353. Price, R.M., and Bonett, D.G. (2001), “Estimating the Variance of the
Sample Median,” Journal of Statistical Computation and Simulation,
68, 295-305.

354. Rao, C.R. (1965), Linear Statistical Inference and Its Applications,
John Wiley and Sons, NY.

355. Rey, W.J. (1978), Robust Statistical Methods, Springer-Verlag, NY.

356. Rieder, H. (1996), Robust Statistics, Data Analysis, and Computer In-
tensive Methods, Springer-Verlag, NY.

357. Rocke, D.M. (1998), “Constructive Statistics: Estimators, Algorithms,

and Asymptotics,” in Computing Science and Statistics, 30, ed. Weis-
berg, S., Interface Foundation of North America, Fairfax Station, Va,
1-14.

358. Rocke, D.M., and Woodruﬀ, D.L. (1996), “Identiﬁcation of Outliers in

Multivariate Data,” Journal of the American Statistical Association,
91, 1047-1061.

359. Rocke, D. M. and Woodruﬀ, D. L. (2001), “Discussion of ‘Multivariate

Outlier Detection and Robust Covariance Matrix Estimation’ by D.
Peña and F.J. Prieto,” Technometrics, 43, 300-303.

360. Rohatgi, V. (1976), An Introduction to Probability Theory and Mathe-

matical Statistics, John Wiley and Sons, NY.

361. Ronchetti, E., and Staudte, R.G. (1994), “A Robust Version of Mal-
lows’s Cp ,” Journal of the American Statistical Association, 89, 550-
559.
BIBLIOGRAPHY 500

362. Rounceﬁeld, M. (1995), “The Statistics of Poverty and Inequality,”

Journal of Statistics and Education, 3(2). Available online from the
website (http://www.amstat.org/publications/jse/).

363. Rousseeuw, P.J. (1984), “Least Median of Squares Regression,” Journal

of the American Statistical Association, 79, 871-880.

364. Rousseeuw, P.J. (1993), “A Resampling Design for Computing

High-Breakdown Regression,” Statistics and Probability Letters, 18,
125-128.

365. Rousseeuw, P.J., and Bassett, G.W. (1990), “The Remedian: A Ro-
bust Averaging Method for Large Data Sets,” Journal of the American
Statistical Association, 85, 97-104.

366. Rousseeuw, P.J., and Bassett, G.W. (1991), “Robustness of the p-

Subset Algorithm for Regression with High Breakdown Point,” in Di-
rections in Robust Statistics and Diagnostics, Part 2, eds. Stahel, W.,
and Weisberg, S., Springer-Verlag, NY, 185-194.

367. Rousseeuw, P.J., and Croux, C. (1992), “Explicit Scale Estimators with
High Breakdown Point,” in L1-Statistical Analysis and Related Meth-
ods, ed. Dodge, Y., Elsevier Science Publishers, Amsterdam, Holland,
77-92.

368. Rousseeuw, P.J., and Croux, C. (1993), “Alternatives to the Median

Absolute Deviation,” Journal of the American Statistical Association,
88, 1273-1283.

369. Rousseeuw, P.J., and Hubert, M. (1999), “Regression Depth,” Journal

of the American Statistical Association, 94, 388-433.

370. Rousseeuw, P.J., and Leroy, A.M. (1987), Robust Regression and Out-
lier Detection, John Wiley and Sons, NY.

371. Rousseeuw, P.J., Van Aelst, S., Van Driessen, K., and Agulló, J. (2004),
“Robust Multivariate Regression,” Technometrics, 46, 293-305.

372. Rousseeuw, P.J., and Van Driessen, K. (1999), “A Fast Algorithm for
the Minimum Covariance Determinant Estimator,” Technometrics, 41,
212-223.
BIBLIOGRAPHY 501

373. Rousseeuw, P.J., and Van Driessen, K. (2000), “An Algorithm for
Positive-Breakdown Regression Based on Concentration Steps,” in Data
Analysis: Modeling and Practical Application, eds. W. Gaul, O. Opitz,
and M. Schader, Springer-Verlag, NY.

374. Rousseeuw, P.J., and Van Driessen, K. (2002), “Computing LTS Re-
gression for Large Data Sets,” Estadistica, 54, 163-190.

375. Rousseeuw, P.J., and van Zomeren, B.C. (1990), “Unmasking Multi-
variate Outliers and Leverage Points,” Journal of the American Statis-
tical Association, 85, 633-651.

376. Rousseeuw, P.J., and van Zomeren, B.C. (1992), “A Comparison of

Some Quick Algorithms for Robust Regression,” Computational Statis-
tics and Data Analysis, 14, 107-116.

377. Rubin, D.B. (1980), “Composite Points in Weighted Least Squares Re-
gressions,” Technometrics, 22, 343-348.

378. Rubin, D.B. (2004), “On Advice for Beginners in Statistical Research,”
The American Statistician, 58, 196-197.

379. Ruiz-Gazen, A. (1996), “A Very Simple Robust Estimator of a Disper-

sion Matrix,” Computational Statistics and Data Analysis, 21, 149-162.

380. Ruppert, D. (1992), “Computing S-Estimators for Regression

and Multivariate Location/Dispersion,” Journal of Computational and
Graphical Statistics, 1, 253-270.

381. Ruppert, D., and Carroll, R. J. (1980), “Trimmed Least Squares Es-
timation in the Linear Model,” Journal of the American Statistical
Association, 75, 828-838.

382. Satoh, K., and Ohtaki, M. (2004), “A Note on Multiple Regression for
Single Index Model,” Communications in Statistics Theory and Meth-
ods, 33, 2409-2422.

383. Schaaﬀhausen, H. (1878), “Die Anthropologische Sammlung Des

Anatomischen Der Universitat Bonn,” Archiv fur Anthropologie, 10,
1-65, Appendix.
BIBLIOGRAPHY 502

384. Seber, G.A.F., and Lee, A.J. (2003), Linear Regression Analysis, 2nd
ed., John Wiley and Sons, NY.

385. Serﬂing, R.J. (1980), Approximation Theorems of Mathematical Statis-

tics, John Wiley and Sons, NY.

386. Shao, J. (1993), “Linear Model Selection by Cross-Validation,” Journal

of the American Statistical Association, 88, 486-494.

387. Shevlyakov, G.L., and Vilchevski, N.O. (2002), Robustness in Data

Analysis: Criteria and Methods, Brill Academic Publishers, Leiden,
Netherlands.

388. Sheynin, O. (1997), “Letter to the Editor,” The American Statistician,

51, 210.

389. Shorack, G.R. (1974), “Random Means,” The Annals of Statistics, 1,

661-675.

390. Shorack, G.R., and Wellner, J.A. (1986), Empirical Processes With
Applications to Statistics, John Wiley and Sons, NY.

391. Siegel, A.F. (1982), “Robust Regression Using Repeated Medians,”

Biometrika, 69, 242-244.

392. Simonoﬀ, J.S. (1987a), “The Breakdown and Inﬂuence Properties of

Outlier-Rejection-Plus-Mean Procedures,” Communications in Statis-
tics Theory and Methods, 16, 1749-1769.

393. Simonoﬀ, J.S. (1987b), “Outlier Detection and Robust Estimation of

Scale,” Journal of Statistical Computation and Simulation, 27, 79-92.

394. Simonoﬀ, J.S. (2003), Analyzing Categorical Data, Springer-Verlag,

NY.

395. Simonoﬀ, J.S., and Tsai, C. (2002), “Score Tests for the Single Index
Model,” Technometrics, 44, 142-151.

396. Simpson, D.G., Ruppert, D., and Carroll, R.J. (1992), “On One-Step
GM Estimates and Stability of Inferences in Linear Regression,” Jour-
nal of the American Statistical Association, 87, 439-450.
BIBLIOGRAPHY 503

397. Smith, W.B. (1997), “Publication is as Easy as C-C-C,” Communica-

tions in Statistics Theory and Methods, 26, vii-xii.

398. Sommer, S., and Huggins, R.M. (1996), “Variables Selection Using the
Wald Test and a Robust Cp ,” Applied Statistics, 45, 15-29.

399. Srivastava, D.K., and Mudholkar, G.S. (2001), “Trimmed T̃ 2: A Ro-

bust Analog of Hotelling’s T 2,” Journal of Statistical Planning and
Inference, 97, 343-358.

400. Stahel, W., and Weisberg, S. (1991a), Directions in Robust Statistics

and Diagnostics, Part 1, Springer-Verlag, NY.

401. Stahel, W., and Weisberg, S. (1991b), Directions in Robust Statistics

and Diagnostics, Part 2, Springer-Verlag, NY.

402. Staudte, R.G., and Sheather, S.J. (1990), Robust Estimation and Test-
ing, John Wiley and Sons, NY.

403. Stefanski, L.A. (1991), “A Note on High-Breakdown Estimators,” Statis-

tics and Probability Letters, 11, 353-358.

404. Stefanski, L.A., and Boos, D.D. (2002), “The Calculus of M–estimators,”
The American Statistician, 56, 29-38.

405. Stigler, S.M. (1973a), “The Asymptotic Distribution of the Trimmed

Mean,” The Annals of Mathematical Statistics, 1, 472-477.

406. Stigler, S.M. (1973b), “Simon Newcomb, Percy Daniell, and the History
of Robust Estimation 1885-1920,” Journal of the American Statistical
Association, 68, 872-878.

407. Stigler, S.M (1977), “Do Robust Estimators Work with Real Data?”
The Annals of Statistics, 5, 1055-1098.

408. Stoker, T.M. (1986), “Consistent Estimation of Scaled Coeﬃcients,”

Econometrica, 54, 1461-1481.

409. Street, J.O., Carroll, R.J., and Ruppert, D. (1988), “A Note on Com-
puting Regression Estimates Via Iteratively Reweighted Least Squares,”
The American Statistician, 42, 152-154.
BIBLIOGRAPHY 504

410. Stromberg, A.J. (1993a), “Computing the Exact Least Median of

Squares Estimate and Stability Diagnostics in Multiple Linear Regres-
sion,” SIAM Journal of Scientiﬁc and Statistical Computing, 14, 1289-
1299.

411. Stromberg, A.J. (1993b), “Comment by Stromberg and Reply,” The

American Statistician, 47, 87-88.

412. Stromberg, A.J., Hawkins, D.M., and Hössjer, O. (2000), “The Least
Trimmed Diﬀerences Regression Estimator and Alternatives,” Journal
of the American Statistical Association, 95, 853-864.

413. Tableman, M. (1994a), “The Inﬂuence Functions for the Least Trimmed
Squares and the Least Trimmed Absolute Deviations Estimators,” Statis-
tics and Probability Letters, 19, 329-337.

414. Tableman, M. (1994b), “The Asymptotics of the Least Trimmed Abso-

lute Deviations (LTAD) Estimator,” Statistics and Probability Letters,
19, 387-398.

415. Thode, H.C. (2002), Testing for Normality, Marcel Dekker, NY.

416. Tibshirani, R. (1996), “Regression Shrinkage and Selection via the

Lasso,” Journal of the Royal Statistical Society, B, 58, 267-288.

417. Tierney, L. (1990), Lisp-Stat, John Wiley and Sons, NY.

418. Tong, H. (1977), “Some Comments on the Canadian Lynx Data,” Jour-
nal of the Royal Statistical Society, A, 140, 432-468.

419. Tong, H. (1983), Threshold Models in Nonlinear Time Series Analysis,

Lecture Notes in Statistics, 21, Springer–Verlag, Heidelberg.

420. Tremearne, A.J.N. (1911), “Notes on Some Nigerian Tribal Marks,”

Journal of the Royal Anthropological Institute of Great Britain and
Ireland, 41, 162-178.

421. Tsai, C.L., and Wu, X. (1992), “Transformation-Model Diagnostics,”

Technometrics, 34, 197-202.

422. Tukey, J.W. (1957), “Comparative Anatomy of Transformations,” An-

nals of Mathematical Statistics, 28, 602-632.
BIBLIOGRAPHY 505

423. Tukey, J.W. (1977), Exploratory Data Analysis, Addison-Wesley Pub-

lishing Company, Reading, MA.

424. Tukey, J.W. (1991), “Graphical Displays for Alternative Regression

Fits,” in Directions in Robust Statistics and Diagnostics, Part 2, eds.
Stahel, W., and Weisberg, S., Springer-Verlag, NY, 309-326.

425. Tukey, J.W., and McLaughlin, D.H. (1963), “Less Vulnerable Con-
ﬁdence and Signiﬁcance Procedures for Location Based on a Single
Sample: Trimming/Winsorization 1,” Sankhya, A, 25, 331-352.

426. Velilla, S. (1993), “A Note on the Multivariate Box-Cox Transformation

to Normality,” Statistics and Probability Letters, 17, 259-263.

427. Velilla, S. (1998), “A Note on the Behavior of Residual Plots in Re-

gression,” Statistics and Probability Letters, 37, 269-278.

428. Velleman, P.F., and Welsch, R.E. (1981), “Eﬃcient Computing of Re-
gression Diagnostics,” The American Statistician, 35, 234-242.

429. Venables, W.N., and Ripley, V.D. (1997), Modern Applied Statistics
with S-PLUS, 2nd ed., Springer-Verlag, NY.

430. Vı́šek, J.Á. (1996), “On High Breakdown Point Estimation,” Compu-
tational Statistics, 11, 137-146.

431. Wackerly, D.D., Mendenhall, W., and Scheaﬀer, R.L., (2002), Mathe-
matical Statistics with Applications, 6th ed., Duxbury, Paciﬁc Grove,
CA.

432. Wand, M.P. (1999), “A Central Limit Theorem for Local Polynomial
Backﬁtting Estimators,” Journal of Multivariate Analysis, 70, 57-65.

433. Weisberg, S. (2002), “Dimension Reduction Regression in R,” Journal

of Statistical Software, 7, webpage (http://www.jstatsoft.org).

434. Weisberg, S. (2005), Applied Linear Regression, 3rd ed., John Wiley
and Sons, NY.

435. Weisberg, S., and Welsh, A.H. (1994), “Adapting for the Missing Link,”
The Annals of Statistics, 22, 1674-1700.
BIBLIOGRAPHY 506

436. Welch, B.L. (1937), “The Signiﬁcance of the Diﬀerence Between Two
Means When the Population Variances are Unequal,” Biometrika, 29,
350-362.

437. Welsh, A.H. (1986), “Bahadur Representations for Robust Scale Esti-
mators Based on Regression Residuals,” The Annals of Statistics, 14,
1246-1251.

438. Welsh, A.H., and Ronchetti, E. (1993), “A Failure of Intuition: Naive

Outlier Deletion in Linear Regression,” Preprint.

439. White, H. (1984), Asymptotic Theory for Econometricians, Academic

Press, San Diego, CA.

440. Wilcox, R.R. (2001), Fundamentals of Modern Statistical Methods: Sub-

stantially Increasing Power and Accuracy, Springer-Verlag, NY.

441. Wilcox, R.R. (2003), Applying Contemporary Statistical Techniques,

Academic Press, San Diego, CA.

442. Wilcox, R.R. (2005), Introduction to Robust Estimation and Testing,

2nd ed., Elsevier Academic Press, San Diego, CA.

443. Willems, G., Pison, G., Rousseeuw, P.J., and Van Aelst, S. (2002), “A
Robust Hotelling Test,” Metrika, 55, 125-138.

444. Winkelmann, R. (2000), Econometric Analysis of Count Data, 3rd ed.,

Springer-Verlag, NY.

445. Wisnowski, J.W., Simpson J.R., and Montgomery D.C. (2002), “A

Performance Study for Multivariate Location and Shape Estimators,”
Quality and Reliability Engineering International, 18, 117-129.

446. Woodruﬀ, D.L., and Rocke, D.M. (1993), “Heuristic Search Algorithms
for the Minimum Volume Ellipsoid,” Journal of Computational and
Graphical Statistics, 2, 69-95.

447. Woodruﬀ, D.L., and Rocke, D.M. (1994), “Computable Robust Esti-
mation of Multivariate Location and Shape in High Dimension Using
Compound Estimators,” Journal of the American Statistical Associa-
tion, 89, 888-896.
BIBLIOGRAPHY 507

448. Xia, Y., Tong, H., Li, W.K., and Zhu, L.-X. (2002), “An Adaptive
Estimation of Dimension Reduction Space,” (with discussion), Journal
of the Royal Statistical Society, B, 64, 363-410.

449. Yeo, I.K., and Johnson, R. (2000), “A New Family of Power Transfor-
mations to Improve Normality or Symmetry,” Biometrika, 87, 954-959.

450. Yin, X.R., and Cook, R.D. (2002), “Dimension Reduction for the Con-
ditional kth Moment in Regression,” Journal of the Royal Statistical
Society, B, 64, 159-175.

451. Yin, X., and Cook, R.D. (2003), “Estimating Central Subspaces Via
Inverse Third Moments,” Biometrika, 90, 113-125.

452. Yohai, V.J. and Maronna, R. (1976), “Location Estimators Based on

Linear Combinations of Modiﬁed Order Statistics,” Communications
in Statistics Theory and Methods, 5, 481-486.

453. Yuen, K.K. (1974), “The Two-Sample Trimmed t for Unequal Popula-
tion Variances,” Biometrika, 61, 165-170.

454. Zuo, Y. (2001), “Some Quantitative Relationships between Two Types

of Finite Sample Breakdown Point,” Statistics and Probability Letters,
51, 369-375.
Index

1D regression, v, 1, 332, 337, 343, attractor, 244, 297

346
1D structure, 338 Bφlviken, 203, 345, 376
Bai, 223, 244
Adcock, 166 Barndorff-Nielsen, 418
Adell, 88 Barnett, vii, 18
affine equivariant, 274, 294 Barrett, 202
affine transformation, 275, 294 Barrodale, 166
Agresti, 373, 403, 417, 454 basic resampling, 237, 263, 297
Agulló, 224, 332 Bassett, 64, 216, 220, 221, 232, 264,
Aldrin, 203, 345, 376 278
ALMS, 9 Bayesian, 15, 202, 417
ALTA, 213 Becker, 18, 19, 332, 373, 440, 441
ALTS, 9, 213 Beckman, 18
Anderson, 311, 373, 415 Belsley, vii, 202
Anderson-Sprecher, 203 Bentler, 448
Andrews, 63, 122 Berger, ix, 30
anova models, 196 Bernholdt, 311
Anscombe, 448 Bibby, 311
Ansley, 375 Bickel, 51, 63, 92, 110, 122
Appa, 224 Binary regression, 332
Arc, 19, 440 binary regression, 2, 390, 453
Arcones, 298, 302 binary response plot, 453
Armitrage, 92 binomial distribution, 73
Ashworth, 149, 208 binomial regression, 390
asymptotic distribution, 5 bivariate normal, 287
asymptotic efficiency, 115 Bloch, 38, 62, 63, 117
asymptotic paradigm, 5, 267 Bloomfield, 166
asymptotic theory, 48, 64 Bogdan, 333
asymptotic variance, 41, 109 Bonnett, 62
Atkinson, vii, 166, 202, 235, 264 Boos, 454

508
INDEX 509

bounded in probability, 49 cdf, 25, 75

Bowman, 76, 82, 92 censored data, 62
Box, v, 1, 3, 15, 18, 130, 135, 166 central limit theorem, 109
box plot, 25 Cerioli, vii
Box–Cox transformation, 130, 323, Chambers, 19, 62, 185, 188, 203,
358 317, 362, 441, 448
Bradu, 199, 263 Chang, 232, 264
branch and bound, 224 characteristic function, 73
Branco, 332 Chatterjee, vii, 166, 202
breakdown, viii, 211, 220, 270, 275, Chebyshev estimator, 127, 216
283, 295, 299 Chen, 36, 81, 88, 369, 372, 375
Brillinger, 337, 342, 346, 372 chf, 73
Brockwell, 205 chi distribution, 75
Broﬃtt, 216 chi–square distribution, 75
Bucchianico, 455 Chmielewski, 311
Buja, 375 Christensen, 417
bulging rule, 359 CI, 41
Bunch, 166 Claeskins, 166, 373
Bura, 372 Clarke, 64, 90
Burman, 166 Cleveland, 62, 185, 188, 203, 317,
Burnham, 373, 415 362, 448
Burr distribution, 74 CLT, 5
Butler, 297, 301 CMCD, 297
Buxton, 9, 25, 48, 153, 191, 195, Coakley, 232
231, 315, 323, 325, 328 Cohen, 89, 92
Collett, 167, 417
Cambanis, 311 Colton, 92
Cameron, 417 concentration, 244, 263, 271, 272,
Canonical correlation analysis, 332 297, 298, 320, 437
canonical correlation analysis, 450 conditional distribution, 287
Carroll, 228, 232, 248, 265 consistent estimator, 109
Casella, ix, 30 contingency tables, 398
Castillo, 92 continuous random variable, 101
categorical data, 417 converges almost everywhere, 49
categorical models, 196 converges in distribution, 49
Cattell, 448 converges in probability, 49
Cauchy distribution, 74 Cook, vii, 18, 19, 21, 129, 130, 135,
Cavanagh, 372 146, 163, 166, 176, 186–189,
INDEX 510

196, 197, 202, 209, 264, 289, Dell’Aquila, vii

290, 311, 314, 323, 332, 337, depth, 236, 244
339, 340, 345–347, 349, 358, depth estimator, 223
362, 364, 372, 373, 395, 401, Devlin, 19, 263, 299, 437
413, 417, 421, 423, 438, 440, DGK estimator, 299
448, 455 Diaconis, 64
Cook’s distance, 187 diagnostic for linearity, 352
Cooke, 90 diagnostics, vi, 1, 3, 185
covariance matrix, 186, 286 dimension reduction, v, vii
coverage, 212 discrete random variable, 101
covering ellipsoid, 325 Discriminant analysis, 332
Cox, 15, 130, 135, 166, 338, 339, discriminant analysis, 450
375 discriminant function, 391
Cramér, 50, 92 distributionally robust, vii
Cramer, 417 Dixon, 63
Craven, 90 Dobson, 417
Critchley, 372, 375 Dodge, vii, 166
Croos-Dabrera, 375 Doksum, 92
cross checking, xiii, 35, 63, 92, 232, Dollinger, 248
265, 310, 332, 333, 456 Dongarra, 166
Croux, 64, 79, 223, 292, 302, 332 Donoho, 283
Cui, 332 double exponential distribution, 77
cumulative distribution function, 75 Draper, 166, 192
Czörgö, 333 Duan, 19, 167, 338, 342, 369, 372,
438
d’Orey, 166 Duda, ix
Daniel, 143, 362
data mining, 310 EAC, 215
Datta, ix, 166, 237, 296, 299 Easton, 319
David, 18, 64 Eaton, 249, 289, 311
Davies, 18, 64, 216, 220, 232, 297, EC, 302, 340
301, 311, 456 EDA, 3
Davis, 205 EE plot, 361, 415
DD plot, 317, 341 eﬃciency, 221
deCani, 94 Efron, 166, 373
DeGroot, ix, 92 Ehrenberg, 440
Dehon, 292 Einmahl, 455
Delecroix, 372 elemental ﬁt, 237
INDEX 511

elemental set, 224, 236, 263, 268, Filzmoser, 332

297 Fisher, 311
ellipsoidal trimming, 328, 346 fitted values, 126
elliptically contoured, vii, 289, 292, Fortran, 265
311, 322, 333, 340, 342, 374 forward response plot, 15, 139, 189,
elliptically symmetric, 289 271, 389
empirical processes, 232 Fowlkes, 166
equivariant, 110 Fox, vii, 187, 202, 441
Escobar, 92 Fréchet, 99
ESP, 346 Freedman, 64
ESSP, 346 Friedman, 202
estimated sufficient predictor, 346 full model, 138, 414
estimated sufficient summary plot, Fung, 35, 62, 63, 92, 312, 332, 372,
340, 346 375, 456
Euclidean norm, 238 Furnival, 143, 166, 360
expected value, 100 FY plot, 197, 334
exponential distribution, 78
exponential family, 383 gamma distribution, 81
extreme value distribution, 36 Garcı́a-Escudero, 216
extreme value distribution for the Gastwirth, 38, 62, 63, 117
max, 80 Gather, 18, 64, 332, 373
extreme value distribution for the Gaussian distribution, 85
min, 80, 97 Geling, 338
EY plot, 4, 334, 342, 361 generalized additive models, 375
generalized linear model, 338, 383,
F distribution, 99 384, 417
Factor analysis, 332 generalized sample variance, 299
factor analysis, 450 Gentle, vii, 76, 92
Fahrmeir, 417, 430 Genton, 312
Falk, 36, 57, 64 geometric distribution, 98
Fan, 373 Gladstone, 6, 145, 190, 316
Fang, 311, 319 GLM, 2, 384, 414
Farebrother, 263 Gnanadesikan, vii, 19, 263, 299, 302,
Feller, 252, 283 310, 437
Ferguson, 92 Golub, ix, 166, 237, 239
FF plot, 139, 190 Gordaliza, 216
FFλ plot, 129 Gray, 202, 263
Field, 64 Griffiths, ix
INDEX 512

Gross, 64, 122 Hettmansperger, vii, 53, 64, 202,

Guenther, 64, 122 228, 232
Gumbel distribution, 80 high breakdown, 211
Gupta, 311 high median, 27
Hilker, 332, 373
Härdle, 203, 372 Hill, ix
Hössjer, 212, 216, 220, 222, 223, Hinich, 263
232 Hinkley, 166
Hadi, vii, 166, 202 Hjort, 166, 373
Haesbroeck, 332 Hoaglin, vii, 202
Hahn, 64 Horn, 64
half Cauchy distribution, 97 Horowitz, 19, 338, 372
half logistic distribution, 98 Hosmer, 391, 394, 417
half normal distribution, 83, 97 Hotelling’s t test, 450
Hall, 63, 64, 203, 372, 374 Hotelling’s T 2 test, 332
Hamada, 440 Hristache, 203, 372
Hamilton, vii Huang, 311
Hampel, vi, 4, 5, 18, 19, 29, 62, 64, Huber, vi, 4, 18, 62, 63, 122, 228,
122, 212, 231, 283 283
Hamza, 73 Hubert, 213, 223, 312, 332
Harrison, 16, 367 Huggins, 166
Hart, ix
Harter, 166 Ichimura, 203, 372
Hastie, 166, 373, 375 identity line, 6, 139
Hastings, 92 Iglewicz, vii
hat matrix, 127, 186 iid, v, 2, 12
Hawkins, vii, xiii, 138, 199, 202, indicator function, 73, 105
215–217, 221, 224, 226, 231, inﬂuence, 186, 187
235, 245, 263, 297, 310, 311, interquartile range, 58
345, 359, 373, 418, 438, 449, inverse response plot, 166
454, 456 inverse Wishart distribution, 249
HB, 211, 239
He, xiii, 35, 62, 63, 92, 218, 223, Jaeckel, 63
232, 244, 248, 253, 310, 312, Jhun, 297, 301
332, 372, 375, 456 Jodrá, 88
Hebbler, 207 Johnson, ix, 13, 73, 81, 86, 92, 106,
Heng-Hui, 373 108, 166, 169, 286, 289, 296,
heteroscedastic, 339 300, 311, 359, 455
INDEX 513

Johnstone, 166, 373 Lawless, 373

Joiner, 63 Lax, 64, 122
joint distribution, 286 Le, 417
Jones, 139, 166, 361 least absolute deviations, 127
Judge, ix least median of squares, 211
Juditsky, 203, 373 least quantile of diﬀerences, 223
Jureckova, vii, 63, 64, 232 least quantile of squares, 211
least squares, 127
Kafadar, 64 least trimmed sum of absolute de-
Kalbﬂeisch, 98 viations, 212
Kapadia, 74, 75, 78, 79, 92 least trimmed sum of squares, 212
Kass, 199, 263 Lee, vii, ix, 288, 375, 453
Kay, 413 Leemis, 92
Kelker, 290 Lehmann, 30, 92, 109, 116, 240
Keller-McNulty, 440 Lemeshow, 391, 394, 417
Kemp, 73 Leonte, 373
Kennedy, 76, 92 Leroy, vi, 4, 18, 187, 217, 220, 246,
Kent, 311 263, 265, 266, 273, 278, 280,
Kettenring, 19, 263, 299, 302, 310, 294, 312, 452
437 leverage, 186
Kim, 64, 217, 220, 298, 302 Lewis, vii, 18
Kleiner, 62, 185, 188, 203, 317, 362, Li, vii, 19, 167, 319, 332, 333, 338,
448 342, 364, 369, 372–374, 438
Koenker, 63, 166, 216, 221, 338, Lin, 205
454 Lindsey, 417
Kohn, 375 linearly related predictors, 340, 374
Koltchinskii, 333 Little, 413, 454
Kotz, 73, 81, 86, 92, 106, 108, 311 Liu, 319, 372, 375
Kowalski, 288 LMS, 211, 236, 245
Kuh, vii, 202 location family, 31
Kutner, 12 location model, 5, 25
L-estimator, 53, 63 location–scale family, 31, 109
Lütkepohl, ix log rule, 359
Land, 224 log–Cauchy distribution , 98
Laplace distribution, 77 log–logistic distribution, 98
LATA, 237 log–Weibull distribution, 80, 97
LATx, 216, 282 logistic distribution, 84
logistic regression, xi, 2, 390
INDEX 514

loglinear regression, xi, 398 MCD, 296

lognormal distribution, 84 McKean, vii, 53, 62, 64, 228, 232,
Longitudinal data analysis, 332 264
Lopuhaä, 298, 301 McLaughlin, 63, 437
low median, 27 Meade, 166
lowess, 343 mean, 27
LQD, 223 median, 27, 28, 30, 36, 62, 110
LTA, 212, 221, 224, 236, 237, 245 median absolute deviation, 28, 110
LTS, 212, 221, 236, 245 median ball algorithm, 300
Luo, 375 Meeker, 92
Mehrotra, 310
M-estimator, xiv, 54, 63, 454 Mendenhall, ix
Ma, 312 method of moments, 35
Mašı̈ček, 216, 253, 457 metrically trimmed mean, 44
mad, 27, 28, 36, 57 mgf, 72
Maddela, vii midrange, 127
Maguluri, 283 minimum chi–square estimator, 403
Mahalanobis distance, 186, 284, 289, minimum covariance determinant,
292–294, 317, 323, 346 296
Mallows, 16, 139, 143, 166, 228, minimum volume ellipsoid, 311
361 missing values, 454
Manzotti, 333 mixture distribution, 50, 100
Marazzi, vii, 82 Mizera, 6
Mardia, 311 MLE, 116
Maronna, vii, 216, 231, 310 MLR, 15, 127
Masking, 192 model, 1
masking, 199 model checking plot, 189, 196
Mason, 64 Moler, 166
Mathsoft, 441 moment generating function, 72
Matrán, 216 monotonicity, 352
matrix norm, 238 Montgomery, 312, 417, 432
maximum likelihood, 77 Moore, 43
Maxwell–Boltzmann distribution, 99 Moran, 205
Mayo, 263 Morgenthaler, vii, 283
MBA, 300 Mosteller, vii, 63, 128, 129
MBA estimator, 320 Mudholkar, 438
McCullagh, 417 Muirhead, 311
McCulloch, 166, 319 multi–index model, 196, 203
INDEX 515

multiple linear regression, 2, 9, 126, Oliviera, 332

138, 211, 338 OLS, 3, 128, 372
multivariate location, 295 OLS view, 17, 346
multivariate location and dispersion, Oosterhoﬀ, 221
1, 11, 279, 284, 297 order statistics, 27
multivariate normal, ix, 11, 284, 285, outliers, 3, 4, 9, 27, 267
289, 311, 317, 319, 325 Owen, 74, 75, 78, 79, 92
Multivariate regression, 332
Mushkudiani, 455 p-value, 43
MVN, 284, 302 Pérez, 333
Myers, 417, 432 Parelius, 319
Pareto distribution, 87
Nachtsheim, 12, 19, 323, 332, 346, partial residual plots, 375
358, 373, 438 partitioning, 264, 271, 272, 298, 310
Naik, 203, 373 Parzen, 64
Naranjo, 232, 264 Patel, 74, 75, 78, 79, 92
Nelder, 417 pdf, 30, 74
Neter, 12 Peña, 312
Newton’s method, 54 Peacock, 92
Ng, 311 percentile, 31
Ni, 372, 438 perfect classiﬁcation paradigm, 5,
Niinimaa, 220 267
Nolan, 166 permutation invariant, 275
nonlinear regression, 196 Pewsey, 83
nonparametric regression, 196 PHD, 372, 375
Nordberg, 373 PI, 14
norm, 238 Pison, 332
normal distribution, 41, 85 pmf, 30, 73
Nott, 373 Poisson distribution, 88
Poisson regression, 398, 417
Ohtaki, 372 Pollard, 217, 220
Oja, 220 Polzehl, 203, 373
Olive, xiii, xiv, 19, 62–64, 92, 122, Poor, ix
129, 135, 138, 202, 203, 215– population correlation, 287
217, 221, 224, 226, 231, 245, population mad, 30
263, 297, 301, 310, 311, 333, population mean, 285
346, 359, 373, 418, 438, 449, population median, 30, 57
454–456 Porat, ix
INDEX 516

Portnoy, 6, 29, 166, 218, 232, 248, residual plot, 189

253, 264 residuals, 3, 339
Poston, 312 resistant binary regression, 418
Pourahmadi, 205 resistant estimator, 405
power distribution, 88 response transformation model, 14,
power transformation, 128 196, 338
Powers, 417 response transformations, 15, 128,
Pratt, 76, 219, 230, 251, 301, 329 129, 375
Prentice, 98 response variable, 3
Prescott, 63, 122 Rey, vii, 29
Price, 62 Riani, vii, 235
Priebe, 312 Rieder, vii
Prieto, 312 Ripley, 19, 441
principal component regression, 450 RLTA, 237
Principal components, 332 Roberts, 166
principal components, 448, 450 robust confidence interval, 48, 64
principal Hessian directions, 375 robust point estimators, 35
probability density function, 74 robust statistics, 3
probability mass function, 73 Rocke, 264, 272, 297, 310, 311
projection pursuit, 375 Rogers, 63, 122
proportional hazards model, 338 Rohatgi, 288
Ronchetti, vi, vii, 4, 5, 18, 29, 62,
quality control, 455 166, 248, 283
quantile function, 50, 64 Rouncefield, 208
Quiroz, 333 Rousseeuw, vi, xiii, 4, 5, 18, 19, 29,
R, 17, 440 62, 64, 79, 187, 202, 212,
r, 453 213, 217, 220, 223, 232, 245,
R-estimators, 53 246, 263, 266, 271, 273, 278,
randomly trimmed mean, 44 283, 292, 294, 297, 301, 311,
Rao, vii 317, 319, 332, 333, 437, 452
Rayleigh distribution, 89, 96 rpack, xi, 440
regression, 1 RR plot, 6, 9, 139, 190, 229
regression equivariance, 274 Rubin, 36, 81, 88, 263, 440, 454
regression equivariant, 274 Rubinfeld, 16, 367
regression graphics, v, vii, 4, 341, Ruffieux, 82
374 Ruiz-Gazen, 312
regression quantiles, 454 Ruppert, 222, 228, 232, 245, 248,
264, 265, 272, 312, 437
INDEX 517

sample mean, 5 single index model, 2, 196, 203, 338,

SAS, 434 345
Satoh, 372 SIR, 372, 375
SAVE, 372, 375 Sitter, 440
scale equivariant, 274 sliced average variance estimation,
scale family, 31 375
scaled Winsorized variance, 46, 51 sliced inverse regression, 375
scatterplot, 162 Smith, 192, 440
scatterplot matrix, 6, 136, 162 smoothing spline ANOVA, 375
Schaaffhausen, 207, 316 Snell, 339, 375
Scheaffer, ix Solka, 312
Schrader, 62 Sommer, 166
Schweder, 203, 345, 376 spectral norm, 238
scree plots, 448 spherically symmetric, 289
SE, 5 spline models, 375
Seber, vii, ix, 288 Splus, 9, 17, 19
semiparametric, 196 Spokoiny, 203, 373
semiparametric regression, 375 Srivastava, 438
Sen, vii, 63, 64 SSP, 340
sequential procedures, 455 Stahel, vi, 4, 5, 18, 29, 62, 283
Serfling, 49, 53, 56 standard deviation, 28
Shao, 145 standard error, 5, 109
Sheather, vii, 4, 18, 63, 66, 109, start, 244
202, 232, 264 STATLIB, 13, 428, 439
Shenton, 76, 82, 92 Staudte, vii, 4, 18, 63, 66, 109, 166,
Sherman, 372 248
Shevlyakov, vii Stefanski, 283, 454
Sheynin, 64 Steiger, 166
Shi, 372, 375 Stewart, 166
Shorack, 51, 52, 64, 216 Stigler, 18, 51, 63, 64, 122, 123,
Siegel, 19, 263, 437 437, 448
Simonoff, 18, 64, 202, 203, 312, 417 Stine, 94
Simons, 311 Stoker, 203, 338, 345, 372, 438
Simpson, 228, 232, 265, 312, 332 Stork, ix
simulation, 113 Street, 228
Singh, 283, 319 Stromberg, 216, 224
Singhai, 373 structural dimension, 374
Student’s t distribution, 89
INDEX 518

Stuetzle, 202 uniform distribution, 91

submodel, 138, 414 unit rule, 358
suﬃcient predictor, 138
Vı́šek, 245
suﬃcient summary plot, 340
Van Aelst, 292, 302, 332
survival models, 338
Van Driessen, xiii, 19, 245, 264, 271,
Swamping, 192
297, 301, 311, 312, 317, 319,
swamping, 201
332, 438
symmetrically trimmed mean, 38
Van Loan, ix, 166, 237, 239
Tableman, 216, 220, 223 van Zomeren, 202, 312, 333
Talwar, 263 Vanden Branden, 332
Thode, vii, 333 Varga, 311
Thom’s estimate, 81 variable selection, 15, 138, 359, 373,
Tibshirani, 166, 373, 375 414
Tierney, 166 variance, 27, 28
time series, 196, 454 vector norm, 237
Tong, 205, 332, 372, 438 Velilla, 135, 202, 323, 358
transformation, 3 Velleman, 202
transformation plot, 129 Venables, 19, 441
Tremearne, 153 Vilchevski, vii
trimmed mean, 44, 111, 122 Vining, 417, 432
trimmed view, 349 VV plot, 361
Trivedi, 417 Wackerly, ix
truncated Cauchy, 108 Wand, 375
truncated double exponential, 105 Wang, xiii, 130, 166, 232, 310, 312,
truncated exponential, 103 332, 456
truncated extreme value distribu- Wasserman, 12
tion, 90 Wegman, 312
truncated normal, 105 Weibull distribution, 91, 97
truncated random variable, 49, 102 weighted least squares, 331
Tsai, 166, 203, 373 Weiner, 64
Tukey, vii, 6, 62, 63, 122, 128, 129, Weisberg, vii, ix, 19, 21, 130, 146,
185, 188, 203, 317, 359, 362, 163, 166, 176, 186–189, 196,
437, 448 202, 203, 209, 264, 311, 337,
Tutz, 417, 430 339, 340, 345–347, 349, 358,
TV estimator, 329, 333 362, 364, 372, 373, 401, 413,
two sample procedures, 42 417, 421, 423, 438, 440, 447,
two stage trimmed means, 109 454
INDEX 519

Welch, 43
Welch intervals, 43
Wellner, 51, 52, 64, 216
Welsch, vii, 202
Welsh, 63, 64, 202, 203, 217, 228,
248, 373
White, ix, 303
Whitten, 89, 92
Wichern, ix, 286, 296, 300, 311
Wilcox, vii, 64
Wilcoxon rank estimator, 228, 342
Wilks, 19, 441
Willems, 332
Wilson, 143, 166, 360
Wilson–Hilferty approximation, 76,
82
Winkelmann, 417
Winsor’s principle, 347
Winsorized mean, 44, 63
Winsorized random variable, 50, 102
Wishart distribution, 249
Wisnowski, 312
Wood, 143, 362
Woodruﬀ, 264, 272, 297, 310, 311
Wu, 166

Xia, 332, 372, 438

Xie, 417

Yeo, 166, 359, 455

Yin, 375
Yohai, 216, 231
Yuan, 448
Yuen, 43

Zamar, 310
Zhu, 319, 332, 372, 438
Zuo, 275, 295

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (83)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (308)
How 2 Setup Trust
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
91% (35)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
77% (13)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Bio Stat Methods
No ratings yet
Bio Stat Methods
474 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
Stat 331 Course Notes
No ratings yet
Stat 331 Course Notes
79 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
ECON 532 Syllabus
No ratings yet
ECON 532 Syllabus
4 pages
Statistics For Geoscientists: Pieter Vermeesch
No ratings yet
Statistics For Geoscientists: Pieter Vermeesch
225 pages
Applied Robust Statistics-David Olive
No ratings yet
Applied Robust Statistics-David Olive
588 pages
Applied Robust Statistics 2005 PDF
No ratings yet
Applied Robust Statistics 2005 PDF
532 pages
Applied Robust Statistics
No ratings yet
Applied Robust Statistics
532 pages
Econometrics Simpler Note
No ratings yet
Econometrics Simpler Note
692 pages
2021 - Creel - econometrics (githuib book)
No ratings yet
2021 - Creel - econometrics (githuib book)
1,060 pages
Introduction To Statistics WITH SAS
No ratings yet
Introduction To Statistics WITH SAS
238 pages
Stat PDF
No ratings yet
Stat PDF
132 pages
Regbook Inside
No ratings yet
Regbook Inside
21 pages
Introduction To Statistics 14 Weeks
No ratings yet
Introduction To Statistics 14 Weeks
310 pages
Math Stats Lecture 2020F
No ratings yet
Math Stats Lecture 2020F
122 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
EC400Stats Lecturenotes2021
No ratings yet
EC400Stats Lecturenotes2021
101 pages
Fundamentals of Mathematical Statistics 2020
No ratings yet
Fundamentals of Mathematical Statistics 2020
196 pages
SOA Exam Statistics For Risk Modelling Study Manual
No ratings yet
SOA Exam Statistics For Risk Modelling Study Manual
42 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
Econometric s
No ratings yet
Econometric s
1,341 pages
A First Course in Mathematical Statistics - Nusbaum
No ratings yet
A First Course in Mathematical Statistics - Nusbaum
195 pages
Econometrics UAB
No ratings yet
Econometrics UAB
353 pages
305MinNotes PDF
No ratings yet
305MinNotes PDF
148 pages
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
100% (1)
Graduate Econometrics Lecture Notes - Michael Creel (414 Pages)
414 pages
Document
No ratings yet
Document
234 pages
Statistical Inference
No ratings yet
Statistical Inference
158 pages
Foundations of Econometrics Using SAS Simulations and Examples
No ratings yet
Foundations of Econometrics Using SAS Simulations and Examples
56 pages
Comp Data Science
No ratings yet
Comp Data Science
821 pages
Statistics Coping With Uncertainty-2024!10!15
No ratings yet
Statistics Coping With Uncertainty-2024!10!15
346 pages
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
No ratings yet
Full Download Introduction to Robust Estimation and Hypothesis Testing Second Edition Rand R. Wilcox PDF DOCX
42 pages
Estimations
100% (1)
Estimations
183 pages
Hansen (2006, Econometrics)
No ratings yet
Hansen (2006, Econometrics)
196 pages
Xiii Xiv Contents: 2 Probability Distributions 67
No ratings yet
Xiii Xiv Contents: 2 Probability Distributions 67
6 pages
Xiii Xiv Contents: 2 Probability Distributions 67
No ratings yet
Xiii Xiv Contents: 2 Probability Distributions 67
6 pages
Gauss Markov Book
No ratings yet
Gauss Markov Book
150 pages
Course HEM245 2021
No ratings yet
Course HEM245 2021
157 pages
Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL
No ratings yet
Kuan C.-M. Introduction To Econometric Theory (LN, Taipei, 2002) (202s) - GL
202 pages
Cimentaciones Maquinas
100% (1)
Cimentaciones Maquinas
235 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Probability
No ratings yet
Probability
180 pages
Generalized Linear Models
100% (8)
Generalized Linear Models
243 pages
Stats 1
No ratings yet
Stats 1
6 pages
Contributions on Theory of Mathematical Statistics Kei Takeuchi - Read the ebook now or download it for a full experience
No ratings yet
Contributions on Theory of Mathematical Statistics Kei Takeuchi - Read the ebook now or download it for a full experience
64 pages
Eco No Metrics
No ratings yet
Eco No Metrics
1,045 pages
Introduction To Statistical Thought
100% (2)
Introduction To Statistical Thought
393 pages
STAT613
No ratings yet
STAT613
295 pages
Probability and Stats For Data Science PDF
No ratings yet
Probability and Stats For Data Science PDF
237 pages
Nonparametric Notes
No ratings yet
Nonparametric Notes
184 pages
Book
No ratings yet
Book
475 pages
TOBo ML
No ratings yet
TOBo ML
135 pages
Lecturenote - COL341 - 2010
No ratings yet
Lecturenote - COL341 - 2010
116 pages
Preface VII Mathematical Notation Xi Contents Xiii
No ratings yet
Preface VII Mathematical Notation Xi Contents Xiii
6 pages
Xxxx Statistical Estimation
No ratings yet
Xxxx Statistical Estimation
87 pages
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)
Brandt and Kinlay - Estimating Historical Volatility v1.2 June 2005
No ratings yet
Brandt and Kinlay - Estimating Historical Volatility v1.2 June 2005
44 pages
Estimation of Weibull Shape Parameter by Shrinkage Towards An Interval Under Failure Censored Sampling
No ratings yet
Estimation of Weibull Shape Parameter by Shrinkage Towards An Interval Under Failure Censored Sampling
22 pages
Final Model Questions Revised SAS Syllabus Exam 2 of 2024 065a7a5676f3b31 05862889
No ratings yet
Final Model Questions Revised SAS Syllabus Exam 2 of 2024 065a7a5676f3b31 05862889
116 pages
STA301 CurrentPastFinalTermSolvedQuestions
No ratings yet
STA301 CurrentPastFinalTermSolvedQuestions
152 pages
2021-Modeling Labels For Conversion Value Prediction
No ratings yet
2021-Modeling Labels For Conversion Value Prediction
6 pages
Estimatr::: Cheat Sheet
No ratings yet
Estimatr::: Cheat Sheet
1 page
Estimation Theory Lec 1 - InTRODUCTION
No ratings yet
Estimation Theory Lec 1 - InTRODUCTION
21 pages
ps8 Sol
No ratings yet
ps8 Sol
4 pages
Introduction to Econometrics 3rd Edition James H. Stock - eBook PDF pdf download
100% (2)
Introduction to Econometrics 3rd Edition James H. Stock - eBook PDF pdf download
44 pages
MA2040 Final Exam: 1. Email Address
No ratings yet
MA2040 Final Exam: 1. Email Address
24 pages
Improving Pretraining Data Using Perplexity Correlations
No ratings yet
Improving Pretraining Data Using Perplexity Correlations
31 pages
tipbil44-4-2
No ratings yet
tipbil44-4-2
9 pages
Practicum
No ratings yet
Practicum
11 pages
Thesis
No ratings yet
Thesis
16 pages
Econometrics
No ratings yet
Econometrics
3 pages
Probability and Statistics B
No ratings yet
Probability and Statistics B
1 page
MAT 3103: Computational Statistics and Probability Chapter 8: Sampling
No ratings yet
MAT 3103: Computational Statistics and Probability Chapter 8: Sampling
14 pages
Handouts of BIO401 Lesson 1-88
No ratings yet
Handouts of BIO401 Lesson 1-88
387 pages
Estimations
No ratings yet
Estimations
24 pages
Complete Business Statistics: Confidence Intervals
No ratings yet
Complete Business Statistics: Confidence Intervals
50 pages
AICcmodavg
No ratings yet
AICcmodavg
22 pages
Cusat Wireless Technology Syllabus
No ratings yet
Cusat Wireless Technology Syllabus
21 pages
STA 202 Correlation and Regression
No ratings yet
STA 202 Correlation and Regression
11 pages
(eBook PDF) The Analysis of Biological Data Second Editioninstant download
100% (4)
(eBook PDF) The Analysis of Biological Data Second Editioninstant download
57 pages
Index Introductory Econometrics For Finance
No ratings yet
Index Introductory Econometrics For Finance
7 pages
FIN213 - Semester Test 2 Solutions Memo 20240503
No ratings yet
FIN213 - Semester Test 2 Solutions Memo 20240503
13 pages
Approved Notification - Assistant Statistical Officer in A.P.Economic and Statistical Subordinate Service
No ratings yet
Approved Notification - Assistant Statistical Officer in A.P.Economic and Statistical Subordinate Service
28 pages
Weatherwax Epstein Hastie Solution Manual
No ratings yet
Weatherwax Epstein Hastie Solution Manual
147 pages

Econometrics - Applied Robust Statistic To Regression Analysis

Uploaded by

Econometrics - Applied Robust Statistic To Regression Analysis

Uploaded by

Applied Robust Statistics

2 The Location Model 25

3 Some Useful Distributions 72

4 Truncated Distributions 100

5 Multiple Linear Regression 126

7 Robust and Resistant Regression 211

8 Robust Regression Algorithms 236

9 Resistance and Equivariance 267

10 Multivariate Models 284

11 CMCD Applications 317

13 Generalized Linear Models 383

14 Stuﬀ for Students 437

Statistics is, or should be, about scientiﬁc investigation and how to do it

A major goal of regression graphics and distributionally robust

1 by ﬁnding a linear combination w = β T x of the predictors such that Y is

• using an RR plot to detect outliers in multiple linear regression. See p.

• Prediction intervals in the Gaussian multiple linear regression model in

• Using plots to detect outliers in the location model is shown on p. 26.

• Inference based on the sample median is proposed on p. 38.

• Inference based on the trimmed mean is proposed on p. 39.

• Two graphical methods for selecting a response transformation for mul-

• A graphical method for assessing variable selection for the multiple

• Using an FF plot to detect outliers in multiple linear regression and to

• Section 11.2 shows how to produce a resistant 95% covering ellipsoid

• Section 11.4 suggests how to “robustify robust estimators.” The basic

• Rules of thumb for selecting predictor transformations are given in

• Graphical aids for binary regression models such as logistic regression

The website (http://www.math.siu.edu/olive/ol-bookp.htm) for this book

Downloading the book’s R/Splus functions rpack.txt into R or

If you use Splus, the command

Why Many of the Best Known High Breakdown Estimators are

asymptotically eﬃcient cross checking estimator that is practical to compute.

All models are wrong, but some are useful.

In data analysis, an investigator is presented with a problem and data

This class of models is very rich. Generalized linear models (GLM’s)

for i = 1, ..., n where g is a bivariate function, β is a p × 1 unknown vector

g(xT β, e) = m(xT β) + e (1.3)

and an important special case is multiple linear regression

where m is the identity function. The response transformation model uses

g(β T x, e) = t−1 (β T x + e) (1.5)

where t−1 is a one to one (typically monotone) function. Hence

Several important survival models have this form. In a 1D binary regression

P (Y = 1|x) ≡ ρ(β T x) = 1 − P (Y = 0|x) (1.7)

In particular, the logistic regression model uses

In the literature, the response variable is sometimes called the dependent

where β̂ is an estimate of β. Sometimes several estimators β̂ j could be used.

Y = t−1 (xT β + e). (1.8)

Then the transformation

follows a multiple linear regression model.

or double exponential. This type of robustness is often called distributional

Following Staudte and Sheather (1990, p. 32), we deﬁne an outlier to

Hence the sample mean Y n is asymptotically normal AN(µ, σ 2/n).

Figure 1.1: RR Plot for Gladstone data

Figure 1.2: Gladstone data, 119 is a typo

dispersion estimators in Example 11.4 and to illustrate a graphical diagnostic

Ŷh ± t1−α/2,n−pse(pred) (1.11)

where Ŷh = xTh β̂, P (t ≤ t1−α/2,n−p ) = 1 − α/2 where t has a t distribution

For discussion, suppose that 1 − γ = 0.92 so that 8% of the cases are

c) Full Data d) Full Data

Figure 1.4: Plots for Summarizing the Entire Population

P(Yh is in the PI) = P(Yh is in the PI and clean) =

P(Yh is in the PI | Yh is clean) P(Yh is clean) = (1 − α∗ )(1 − γ) = (1 − α).

The formula for this PI is then

Ŷh ± t1−α∗/2,nc −p se(pred) (1.13)

where λo ∈ Λ = {0, ±1/4, ±1/3, ±1/2, ±2/3, ±1}. Then

a) OLS, LAMBDA=1 b) LMSREG, LAMBDA=1

c) OLS, LAMBDA=0 d) LMSREG, LAMBDA=0

-400 -200 0 200 400

Figure 1.6: OLS View for m(u) = u3

generally save paper by placing the plots in the Word editor.

plot(10*X %*% bols, Y)

cltsim <- function(n=100, nruns=100){

b) The following commands will plot 4 histograms with n = 1, 5, 25 and

> z1 <- cltsim(n=1)

b) Next use the menu commands “Graph&Fit>FitlinearLS” to obtain a

The Location Model

plot(10X %% bols, Y)

P (|Wn | > D ) <

for all n ≥ N , and Wn = OP (n−δ ) if nδ Wn = OP (1). The sequence Wn =