100% found this document useful (1 vote)
42 views

Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee 2024 Scribd Download

Regression

Uploaded by

laudrukins31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
42 views

Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee 2024 Scribd Download

Regression

Uploaded by

laudrukins31
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Full download ebooks at ebookmeta.

com

Handbook of Regression Analysis With Applications


in R, Second Edition Samprit Chatterjee

https://ebookmeta.com/product/handbook-of-regression-
analysis-with-applications-in-r-second-edition-samprit-
chatterjee/

OR CLICK BUTTON

DOWLOAD NOW

Download more ebook from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

A Second Course in Statistics: Regression Analysis 8th


Edition William Mendenhall

https://ebookmeta.com/product/a-second-course-in-statistics-
regression-analysis-8th-edition-william-mendenhall/

An Introduction To Statistical Learning: With


Applications In R (Second Edition) Gareth James

https://ebookmeta.com/product/an-introduction-to-statistical-
learning-with-applications-in-r-second-edition-gareth-james/

Regression Analysis in R: A Comprehensive View For The


Social Sciences 1st Edition Jocelyn E. Bolin

https://ebookmeta.com/product/regression-analysis-in-r-a-
comprehensive-view-for-the-social-sciences-1st-edition-jocelyn-e-
bolin/

Handbook of Regression Modeling in People Analytics 1st


Edition Keith Mcnulty

https://ebookmeta.com/product/handbook-of-regression-modeling-in-
people-analytics-1st-edition-keith-mcnulty/
The Statistical Analysis of Doubly Truncated Data :
With Applications in R 1st Edition Jacobo De Uña-
Álvarez

https://ebookmeta.com/product/the-statistical-analysis-of-doubly-
truncated-data-with-applications-in-r-1st-edition-jacobo-de-una-
alvarez/

Time Complexity Analysis 1st Edition Aditya Chatterjee

https://ebookmeta.com/product/time-complexity-analysis-1st-
edition-aditya-chatterjee/

The Routledge Handbook of Discourse Analysis Second


Edition Michael Handford

https://ebookmeta.com/product/the-routledge-handbook-of-
discourse-analysis-second-edition-michael-handford/

Introduction to Statistics and Data Analysis: With


Exercises, Solutions and Applications in R, 2nd Edition
Christian Heumann

https://ebookmeta.com/product/introduction-to-statistics-and-
data-analysis-with-exercises-solutions-and-applications-in-r-2nd-
edition-christian-heumann/

Fundamentals of Analysis with Applications Atul Kumar


Razdan

https://ebookmeta.com/product/fundamentals-of-analysis-with-
applications-atul-kumar-razdan/
Handbook of Regression Analysis
With Applications in R
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors
David J. Balding, Noel A.C. Cressie, Garrett M. Fitzmaurice, Harvey
Goldstein, Geert Molenberghs, David W. Scott, Adrian F.M. Smith, and
Ruey S. Tsay
Editors Emeriti
Vic Barnett, Ralph A. Bradley, J. Stuart Hunter, J.B. Kadane, David G.
Kendall, and Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
Handbook of Regression
Analysis With Applications
in R

Second Edition

Samprit Chatterjee
New York University, New York, USA

Jeffrey S. Simonoff
New York University, New York, USA
This second edition first published 2020
© 2020 John Wiley & Sons, Inc

Edition History
Wiley-Blackwell (1e, 2013)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/
permissions.

The right of Samprit Chatterjee and Jeffery S. Simonoff to be identified as the authors of this work has been
asserted in accordance with law.

Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office
111 River Street, Hoboken, NJ 07030, USA

For details of our global editorial offices, customer services, and more information about Wiley products visit us
at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty


While the publisher and authors have used their best efforts in preparing this work, they make no representations
or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all
warranties, including without limitation any implied warranties of merchantability or fitness for a particular
purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional
statements for this work. The fact that an organization, website, or product is referred to in this work as a citation
and/or potential source of further information does not mean that the publisher and authors endorse the
information or services the organization, website, or product may provide or recommendations it may make. This
work is sold with the understanding that the publisher is not engaged in rendering professional services. The
advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist
where appropriate. Further, readers should be aware that websites listed in this work may have changed or
disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be
liable for any loss of profit or any other commercial damages, including but not limited to special, incidental,
consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Chatterjee, Samprit, 1938- author. | Simonoff, Jeffrey S., author.


Title: Handbook of regression analysis with applications in R / Professor
Samprit Chatterjee, New York University, Professor Jeffrey S. Simonoff,
New York University.
Other titles: Handbook of regression analysis
Description: Second edition. | Hoboken, NJ : Wiley, 2020. | Series: Wiley
series in probability and statistics | Revised edition of: Handbook of
regression analysis. 2013. | Includes bibliographical references and
index.
Identifiers: LCCN 2020006580 (print) | LCCN 2020006581 (ebook) | ISBN
9781119392378 (hardback) | ISBN 9781119392477 (adobe pdf) | ISBN
9781119392484 (epub)
Subjects: LCSH: Regression analysis--Handbooks, manuals, etc. | R (Computer
program language)
Classification: LCC QA278.2 .C498 2020 (print) | LCC QA278.2 (ebook) |
DDC 519.5/36--dc23
LC record available at https://lccn.loc.gov/2020006580
LC ebook record available at https://lccn.loc.gov/2020006581

Cover Design: Wiley


Cover Image: © Dmitriy Rybin/Shutterstock

Set in 10.82/12pt AGaramondPro by SPi Global, Chennai, India

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1
Dedicated to everyone who labors in the field
of statistics, whether they are students,
teachers, researchers, or data analysts.
Contents

Preface to the Second Edition xv


Preface to the First Edition xix

Part I
The Multiple Linear Regression Model

1 Multiple Linear Regression 3


1.1 Introduction 3
1.2 Concepts and Background Material 4
1.2.1 The Linear Regression Model 4
1.2.2 Estimation Using Least Squares 5
1.2.3 Assumptions 8
1.3 Methodology 9
1.3.1 Interpreting Regression Coefficients 9
1.3.2 Measuring the Strength of the Regression
Relationship 10
1.3.3 Hypothesis Tests and Confidence Intervals
for β 12
1.3.4 Fitted Values and Predictions 13
1.3.5 Checking Assumptions Using Residual Plots 14
1.4 Example — Estimating Home Prices 15
1.5 Summary 19

2 Model Building 23
2.1 Introduction 23
2.2 Concepts and Background Material 24
2.2.1 Using Hypothesis Tests to Compare Models 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 Model Selection 29
2.3.2 Example — Estimating Home Prices
(continued) 31
2.4 Indicator Variables and Modeling Interactions 38
2.4.1 Example — Electronic Voting and the 2004
Presidential Election 40
2.5 Summary 46

vii
viii CONTENTS

Part II
Addressing Violations of Assumptions

3 Diagnostics for Unusual Observations 53


3.1 Introduction 53
3.2 Concepts and Background Material 54
3.3 Methodology 56
3.3.1 Residuals and Outliers 56
3.3.2 Leverage Points 57
3.3.3 Influential Points and Cook’s Distance 58
3.4 Example — Estimating Home Prices (continued) 60
3.5 Summary 63

4 Transformations and Linearizable Models 67


4.1 Introduction 67
4.2 Concepts and Background Material: The Log-Log Model 69
4.3 Concepts and Background Material: Semilog Models 69
4.3.1 Logged Response Variable 70
4.3.2 Logged Predictor Variable 70
4.4 Example — Predicting Movie Grosses After One Week 71
4.5 Summary 77

5 Time Series Data and Autocorrelation 79


5.1 Introduction 79
5.2 Concepts and Background Material 81
5.3 Methodology: Identifying Autocorrelation 83
5.3.1 The Durbin-Watson Statistic 83
5.3.2 The Autocorrelation Function (ACF) 84
5.3.3 Residual Plots and the Runs Test 85
5.4 Methodology: Addressing Autocorrelation 86
5.4.1 Detrending and Deseasonalizing 86
5.4.2 Example — e-Commerce Retail Sales 87
5.4.3 Lagging and Differencing 93
5.4.4 Example — Stock Indexes 94
5.4.5 Generalized Least Squares (GLS):
The Cochrane-Orcutt Procedure 99
5.4.6 Example — Time Intervals Between Old Faithful
Geyser Eruptions 100
5.5 Summary 104
CONTENTS ix

Part III
Categorical Predictors

6 Analysis of Variance 109


6.1 Introduction 109
6.2 Concepts and Background Material 110
6.2.1 One-Way ANOVA 110
6.2.2 Two-Way ANOVA 111
6.3 Methodology 113
6.3.1 Codings for Categorical Predictors 113
6.3.2 Multiple Comparisons 118
6.3.3 Levene’s Test and Weighted Least Squares 120
6.3.4 Membership in Multiple Groups 123
6.4 Example — DVD Sales of Movies 125
6.5 Higher-Way ANOVA 130
6.6 Summary 132

7 Analysis of Covariance 135


7.1 Introduction 135
7.2 Methodology 136
7.2.1 Constant Shift Models 136
7.2.2 Varying Slope Models 137
7.3 Example — International Grosses of Movies 137
7.4 Summary 142

Part IV
Non-Gaussian Regression Models

8 Logistic Regression 145


8.1 Introduction 145
8.2 Concepts and Background Material 147
8.2.1 The Logit Response Function 148
8.2.2 Bernoulli and Binomial Random Variables 149
8.2.3 Prospective and Retrospective Designs 149
8.3 Methodology 152
8.3.1 Maximum Likelihood Estimation 152
8.3.2 Inference, Model Comparison, and Model
Selection 153
x CONTENTS

8.3.3 Goodness-of-Fit 155


8.3.4 Measures of Association and Classification
Accuracy 157
8.3.5 Diagnostics 159
8.4 Example — Smoking and Mortality 159
8.5 Example — Modeling Bankruptcy 163
8.6 Summary 168

9 Multinomial Regression 173


9.1 Introduction 173
9.2 Concepts and Background Material 174
9.2.1 Nominal Response Variable 174
9.2.2 Ordinal Response Variable 176
9.3 Methodology 178
9.3.1 Estimation 178
9.3.2 Inference, Model Comparisons, and Strength of
Fit 178
9.3.3 Lack of Fit and Violations of
Assumptions 180
9.4 Example — City Bond Ratings 180
9.5 Summary 184

10 Count Regression 187


10.1 Introduction 187
10.2 Concepts and Background Material 188
10.2.1 The Poisson Random Variable 188
10.2.2 Generalized Linear Models 189
10.3 Methodology 190
10.3.1 Estimation and Inference 190
10.3.2 Offsets 191
10.4 Overdispersion and Negative Binomial Regression 192
10.4.1 Quasi-likelihood 192
10.4.2 Negative Binomial Regression 193
10.5 Example — Unprovoked Shark Attacks in Florida 194
10.6 Other Count Regression Models 201
10.7 Poisson Regression and Weighted Least Squares 203
10.7.1 Example — International Grosses of Movies
(continued) 204
10.8 Summary 206

11 Models for Time-to-Event (Survival) Data 209


11.1 Introduction 210
11.2 Concepts and Background Material 211
11.2.1 The Nature of Survival Data 211
11.2.2 Accelerated Failure Time Models 212
11.2.3 The Proportional Hazards Model 214
CONTENTS xi

11.3 Methodology 214


11.3.1 The Kaplan-Meier Estimator and the Log-Rank
Test 214
11.3.2 Parametric (Likelihood) Estimation 219
11.3.3 Semiparametric (Partial Likelihood)
Estimation 221
11.3.4 The Buckley-James Estimator 223
11.4 Example — The Survival of Broadway Shows
(continued) 223
11.5 Left-Truncated/Right-Censored Data and Time-Varying
Covariates 230
11.5.1 Left-Truncated/Right-Censored Data 230
11.5.2 Example — The Survival of Broadway Shows
(continued) 233
11.5.3 Time-Varying Covariates 233
11.5.4 Example — Female Heads of Government 235
11.6 Summary 238

Part V
Other Regression Models

12 Nonlinear Regression 243


12.1 Introduction 243
12.2 Concepts and Background Material 244
12.3 Methodology 246
12.3.1 Nonlinear Least Squares Estimation 246
12.3.2 Inference for Nonlinear Regression Models 247
12.4 Example — Michaelis-Menten Enzyme Kinetics 248
12.5 Summary 252

13 Models for Longitudinal and Nested Data 255


13.1 Introduction 255
13.2 Concepts and Background Material 257
13.2.1 Nested Data and ANOVA 257
13.2.2 Longitudinal Data and Time Series 258
13.2.3 Fixed Effects Versus Random Effects 259
13.3 Methodology 260
13.3.1 The Linear Mixed Effects Model 260
13.3.2 The Generalized Linear Mixed Effects Model 262
13.3.3 Generalized Estimating Equations 262
13.3.4 Nonlinear Mixed Effects Models 263
13.4 Example — Tumor Growth in a Cancer Study 264
13.5 Example — Unprovoked Shark Attacks in the United
States 269
13.6 Summary 275
xii CONTENTS

14 Regularization Methods and Sparse Models 277


14.1 Introduction 277
14.2 Concepts and Background Material 278
14.2.1 The Bias–Variance Tradeoff 278
14.2.2 Large Numbers of Predictors and Sparsity 279
14.3 Methodology 280
14.3.1 Forward Stepwise Regression 280
14.3.2 Ridge Regression 281
14.3.3 The Lasso 281
14.3.4 Other Regularization Methods 283
14.3.5 Choosing the Regularization Parameter(s) 284
14.3.6 More Structured Regression Problems 285
14.3.7 Cautions About Regularization Methods 286
14.4 Example — Human Development Index 287
14.5 Summary 289

Part VI
Nonparametric and Semiparametric
Models

15 Smoothing and Additive Models 295


15.1 Introduction 296
15.2 Concepts and Background Material 296
15.2.1 The Bias–Variance Tradeoff 296
15.2.2 Smoothing and Local Regression 297
15.3 Methodology 298
15.3.1 Local Polynomial Regression 298
15.3.2 Choosing the Bandwidth 298
15.3.3 Smoothing Splines 299
15.3.4 Multiple Predictors, the Curse of Dimensionality, and
Additive Models 300
15.4 Example — Prices of German Used Automobiles 301
15.5 Local and Penalized Likelihood Regression 304
15.5.1 Example — The Bechdel Rule and Hollywood
Movies 305
15.6 Using Smoothing to Identify Interactions 307
15.6.1 Example — Estimating Home Prices
(continued) 308
15.7 Summary 310

16 Tree-Based Models 313


16.1 Introduction 314
16.2 Concepts and Background Material 314
16.2.1 Recursive Partitioning 314
16.2.2 Types of Trees 317
CONTENTS xiii

16.3 Methodology 318


16.3.1 CART 318
16.3.2 Conditional Inference Trees 319
16.3.3 Ensemble Methods 320
16.4 Examples 321
16.4.1 Estimating Home Prices (continued) 321
16.4.2 Example — Courtesy in Airplane Travel 322
16.5 Trees for Other Types of Data 327
16.5.1 Trees for Nested and Longitudinal Data 327
16.5.2 Survival Trees 328
16.6 Summary 332

Bibliography 337
Index 343
Preface to the
Second Edition

The years since the first edition of this book appeared have been fast-moving
in the world of data analysis and statistics. Algorithmically-based methods
operating under the banner of machine learning, artificial intelligence, or
data science have come to the forefront of public perceptions about how to
analyze data, and more than a few pundits have predicted the demise of classic
statistical modeling.
To paraphrase Mark Twain, we believe that reports of the (impending)
death of statistical modeling in general, and regression modeling in particular,
are exaggerated. The great advantage that statistical models have over “black
box” algorithms is that in addition to effective prediction, their transparency
also provides guidance about the actual underlying process (which is crucial
for decision making), and affords the possibilities of making inferences and
distinguishing real effects from random variation based on those models.
There have been laudable attempts to encourage making machine learning
algorithms interpretable in the ways regression models are (Rudin, 2019), but
we believe that models based on statistical considerations and principles will
have a place in the analyst’s toolkit for a long time to come.
Of course, part of that usefulness comes from the ability to generalize
regression models to more complex situations, and that is the thrust of the
changes in this new edition. One thing that hasn’t changed is the philosophy
behind the book, and our recommendations on how it can be best used, and
we encourage the reader to refer to the preface to the first edition for guidance
on those points. There have been small changes to the original chapters, and
broad descriptions of those chapters can also be found in the preface to the
first edition. The five new chapters (Chapters 11, 13, 14, 15, and 16, with
the former chapter 11 on nonlinear regression moving to Chapter 12) expand
greatly on the power and applicability of regression models beyond what
was discussed in the first edition. For this reason many more references are
provided in these chapters than in the earlier ones, since some of the material
in those chapters is less established and less well-known, with much of it still
the subject of active research. In keeping with that, we do not spend much
(or any) time on issues for which there still isn’t necessarily a consensus in the
statistical community, but point to books and monographs that can help the
analyst get some perspective on that kind of material.
Chapter 11 discusses the modeling of time-to-event data, often referred
to as survival data. The response variable measures the length of time until an
event occurs, and a common complicator is that sometimes it is only known
xv
xvi PREFACE TO THE SECOND EDITION

that a response value is greater than some number; that is, it is right-censored.
This can naturally occur, for example, in a clinical trial in which subjects
enter the study at varying times, and the event of interest has not occurred at
the end of the trial. Analysis focuses on the survival function (the probability
of surviving past a given time) and the hazard function (the instantaneous
probability of the event occurring at a given time given survival to that
time). Parametric models based on appropriate distributions like the Weibull
or log-logistic can be fit that take censoring into account. Semiparametric
models like the Cox proportional hazards model (the most commonly-used
model) and the Buckley-James estimator are also available, which weaken
distributional assumptions. Modeling can be adapted to situations where
event times are truncated, and also when there are covariates that change over
the life of the subject.
Chapter 13 extends applications to data with multiple observations for
each subject consistent with some structure from the underlying process. Such
data can take the form of nested or clustered data (such as students all in
one classroom) or longitudinal data (where a variable is measured at multiple
times for each subject). In this situation ignoring that structure results in an
induced correlation that reflects unmodeled differences between classrooms
and subjects, respectively. Mixed effects models generalize analysis of variance
(ANOVA) models and time series models to this more complicated situation.
Models with linear effects based on Gaussian distributions can be generalized
to nonlinear models, and also can be generalized to non-Gaussian distributions
through the use of generalized linear mixed effects models.
Modern data applications can involve very large (even massive) numbers of
predictors, which can cause major problems for standard regression methods.
Best subsets regression (discussed in Chapter 2) does not scale well to very
large numbers of predictors, and Chapter 14 discusses approaches that can
accomplish that. Forward stepwise regression, in which potential predictors
are stepped in one at a time, is an alternative to best subsets that scales
to massive data sets. A systematic approach to reducing the dimensionality
of a chosen regression model is through the use of regularization, in which
the usual estimation criterion is augmented with a penalty that encourages
sparsity; the most commonly-used version of this is the lasso estimator, and it
and its generalizations are discussed further.
Chapters 15 and 16 discuss methods that move away from specified
relationships between the response and the predictor to nonparametric and
semiparametric methods, in which the data are used to choose the form of
the underlying relationship. In Chapter 15 linear or (specifically specified)
nonlinear relationships are replaced with the notion of relationships taking the
form of smooth curves and surfaces. Estimation at a particular location is based
on local information; that is, the values of the response in a local neighborhood
of that location. This can be done through local versions of weighted least
squares (local polynomial estimation) or local regularization (smoothing
splines). Such methods can also be used to help identify interactions between
numerical predictors in linear regression modeling. Single predictor smoothing
PREFACE TO THE SECOND EDITION xvii

estimators can be generalized to multiple predictors through the use of additive


functions of smooth curves. Chapter 16 focuses on an extremely flexible class of
nonparametric regression estimators, tree-based methods. Trees are based on
the notion of binary recursive partitioning. At each step a set of observations (a
node) is either split into two parts (children nodes) on the basis of the values of
a chosen variable, or is not split at all, based on encouraging homogeneity in the
children nodes. This approach provides nonparametric alternatives to linear
regression (regression trees), logistic and multinomial regression (classification
trees), accelerated failure time and proportional hazards regression (survival
trees) and mixed effects regression (longitudinal trees).
A final small change from the first edition to the second edition is in the
title, as it now includes the phrase With Applications in R. This is not really
a change, of course, as all of the analyses in the first edition were performed
using the statistics package R. Code for the output and figures in the book
can (still) be found at its associated web site at http://people.stern
.nyu.edu/jsimonof/RegressionHandbook/. As was the case in the
first edition, even though analyses are performed in R, we still refer to general
issues relevant to a data analyst in the use of statistical software even if those
issues don’t specifically apply to R.
We would like to once again thank our students and colleagues for their
encouragement and support, and in particular students for the tough questions
that have definitely affected our views on statistical modeling and by extension
this book. We would like to thank Jon Gurstelle, and later Kathleen Santoloci
and Mindy Okura-Marszycki, for approaching us with encouragement to
undertake a second edition. We would like to thank Sarah Keegan for her
patient support in bringing the book to fruition in her role as Project Editor.
We would like to thank Roni Chambers for computing assistance, and Glenn
Heller and Marc Scott for looking at earlier drafts of chapters. Finally, we
would like to thank our families for their continuing love and support.

SAMPRIT CHATTERJEE
Brooksville, Maine

JEFFREY S. SIMONOFF
New York, New York

October, 2019
Preface to the
First Edition

How to Use This Book


This book is designed to be a practical guide to regression modeling. There is
little theory here, and methodology appears in the service of the ultimate goal
of analyzing real data using appropriate regression tools. As such, the target
audience of the book includes anyone who is faced with regression data [that
is, data where there is a response variable that is being modeled as a function
of other variable(s)], and whose goal is to learn as much as possible from
that data.
The book can be used as a text for an applied regression course (indeed,
much of it is based on handouts that have been given to students in such a
course), but that is not its primary purpose; rather, it is aimed much more
broadly as a source of practical advice on how to address the problems that
come up when dealing with regression data. While a text is usually organized
in a way that makes the chapters interdependent, successively building on
each other, that is not the case here. Indeed, we encourage readers to dip into
different chapters for practical advice on specific topics as needed. The pace
of the book is faster than might typically be the case for a text. The coverage,
while at an applied level, does not shy away from sophisticated concepts. It is
distinct from, for example, Chatterjee and Hadi (2012), while also having less
theoretical focus than texts such as Greene (2011), Montgomery et al. (2012),
or Sen and Srivastava (1990).
This, however, is not a cookbook that presents a mechanical approach to
doing regression analysis. Data analysis is perhaps an art, and certainly a craft;
we believe that the goal of any data analysis book should be to help analysts
develop the skills and experience necessary to adjust to the inevitable twists
and turns that come up when analyzing real data.
We assume that the reader possesses a nodding acquaintance with regres-
sion analysis. The reader should be familiar with the basic terminology and
should have been exposed to basic regression techniques and concepts, at least
at the level of simple (one-predictor) linear regression. We also assume that
the user has access to a computer with an adequate regression package. The
material presented here is not tied to any particular software. Almost all of the
analyses described here can be performed by most standard packages, although
the ease of doing this could vary. All of the analyses presented here were
done using the free package R (R Development Core Team, 2017), which is
available for many different operating system platforms (see http://www
.R-project.org/ for more information). Code for the output and figures

xix
xx PREFACE TO THE FIRST EDITION

in the book can be found at its associated web site at http://people


.stern.nyu.edu/jsimonof/RegressionHandbook/.
Each chapter of the book is laid out in a similar way, with most having at
least four sections of specific types. First is an introduction, where the general
issues that will be discussed in that chapter are presented. A section on concepts
and background material follows, where a discussion of the relationship of
the chapter’s material to the broader study of regression data is the focus.
This section also provides any theoretical background for the material that is
necessary. Sections on methodology follow, where the specific tools used in
the chapter are discussed. This is where relevant algorithmic details are likely
to appear. Finally, each chapter includes at least one analysis of real data using
the methods discussed in the chapter (as well as appropriate material from
earlier chapters), including both methodological and graphical analyses.
The book begins with discussion of the multiple regression model. Many
regression textbooks start with discussion of simple regression before moving
on to multiple regression. This is quite reasonable from a pedagogical point
of view, since simple regression has the great advantage of being easy to
understand graphically, but from a practical point of view simple regression
is rarely the primary tool in analysis of real data. For that reason, we start
with multiple regression, and note the simplifications that come from the
special case of a single predictor. Chapter 1 describes the basics of the multiple
regression model, including the assumptions being made, and both estimation
and inference tools, while also giving an introduction to the use of residual
plots to check assumptions.
Since it is unlikely that the first model examined will ultimately be the
final preferred model, Chapter 2 focuses on the very important areas of model
building and model selection. This includes addressing the issue of collinearity,
as well as the use of both hypothesis tests and information measures to help
choose among candidate models.
Chapters 3 through 5 study common violations of regression assumptions,
and methods available to address those model violations. Chapter 3 focuses on
unusual observations (outliers and leverage points), while Chapter 4 describes
how transformations (especially the log transformation) can often address both
nonlinearity and nonconstant variance violations. Chapter 5 is an introduction
to time series regression, and the problems caused by autocorrelation. Time
series analysis is a vast area of statistical methodology, so our goal in this
chapter is only to provide a good practical introduction to that area in the
context of regression analysis.
Chapters 6 and 7 focus on the situation where there are categorical variables
among the predictors. Chapter 6 treats analysis of variance (ANOVA) models,
which include only categorical predictors, while Chapter 7 looks at analysis of
covariance (ANCOVA) models, which include both numerical and categorical
predictors. The examination of interaction effects is a fundamental aspect of
these models, as are questions related to simultaneous comparison of many
PREFACE TO THE FIRST EDITION xxi

groups to each other. Data of this type often exhibit nonconstant variance
related to the different subgroups in the population, and the appropriate tool
to address this issue, weighted least squares, is also a focus here.
Chapters 8 though 10 examine the situation where the nature of the
response variable is such that Gaussian-based least squares regression is no
longer appropriate. Chapter 8 focuses on logistic regression, designed for
binary response data and based on the binomial random variable. While
there are many parallels between logistic regression analysis and least squares
regression analysis, there are also issues that come up in logistic regression
that require special care. Chapter 9 uses the multinomial random variable to
generalize the models of Chapter 8 to allow for multiple categories in the
response variable, outlining models designed for response variables that either
do or do not have ordered categories. Chapter 10 focuses on response data in
the form of counts, where distributions like the Poisson and negative binomial
play a central role. The connection between all these models through the
generalized linear model framework is also exploited in this chapter.
The final chapter focuses on situations where linearity does not hold,
and a nonlinear relationship is necessary. Although these models are based on
least squares, from both an algorithmic and inferential point of view there
are strong connections with the models of Chapters 8 through 10, which we
highlight.
This Handbook can be used in several different ways. First, a reader may
use the book to find information on a specific topic. An analyst might want
additional information on, for example, logistic regression or autocorrelation.
The chapters on these (and other) topics provide the reader with this subject
matter information. As noted above, the chapters also include at least one
analysis of a data set, a clarification of computer output, and reference to
sources where additional material can be found. The chapters in the book are
to a large extent self-contained and can be consulted independently of other
chapters.
The book can also be used as a template for what we view as a reasonable
approach to data analysis in general. This is based on the cyclical paradigm
of model formulation, model fitting, model evaluation, and model updating
leading back to model (re)formulation. Statistical significance of test statistics
does not necessarily mean that an adequate model has been obtained. Further
analysis needs to be performed before the fitted model can be regarded as
an acceptable description of the data, and this book concentrates on this
important aspect of regression methodology. Detection of deficiencies of fit
is based on both testing and graphical methods, and both approaches are
highlighted here.
This preface is intended to indicate ways in which the Handbook can
be used. Our hope is that it will be a useful guide for data analysts, and will
help contribute to effective analyses. We would like to thank our students and
colleagues for their encouragement and support. We hope we have provided
xxii PREFACE TO THE FIRST EDITION

them with a book of which they would approve. We would like to thank Steve
Quigley, Jackie Palmieri, and Amy Hendrickson for their help in bringing this
manuscript to print. We would also like to thank our families for their love
and support.

SAMPRIT CHATTERJEE
Brooksville, Maine

JEFFREY S. SIMONOFF
New York, New York

August, 2012
Part One

The Multiple Linear


Regression Model
Chapter One

Multiple Linear Regression


1.1 Introduction 3
1.2 Concepts and Background Material 4
1.2.1 The Linear Regression Model 4
1.2.2 Estimation Using Least Squares 5
1.2.3 Assumptions 8
1.3 Methodology 9
1.3.1 Interpreting Regression Coefficients 9
1.3.2 Measuring the Strength of the Regression
Relationship 10
1.3.3 Hypothesis Tests and Confidence Intervals for β 12
1.3.4 Fitted Values and Predictions 13
1.3.5 Checking Assumptions Using Residual Plots 14
1.4 Example — Estimating Home Prices 15
1.5 Summary 19

1.1 Introduction
This is a book about regression modeling, but when we refer to regression
models, what do we mean? The regression framework can be characterized in
the following way:
1. We have one particular variable that we are interested in understanding
or modeling, such as sales of a particular product, sale price of a home, or

Handbook of Regression Analysis With Applications in R, Second Edition.


Samprit Chatterjee and Jeffrey S. Simonoff.
© 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.
3
4 CHAPTER 1 Multiple Linear Regression

voting preference of a particular voter. This variable is called the target,


response, or dependent variable, and is usually represented by y .
2. We have a set of p other variables that we think might be useful in
predicting or modeling the target variable (the price of the product, the
competitor’s price, and so on; or the lot size, number of bedrooms, number
of bathrooms of the home, and so on; or the gender, age, income, party
membership of the voter, and so on). These are called the predicting, or
independent variables, and are usually represented by x1 , x2 , etc.
Typically, a regression analysis is used for one (or more) of three purposes:
1. modeling the relationship between x and y ;
2. prediction of the target variable (forecasting);
3. and testing of hypotheses.
In this chapter, we introduce the basic multiple linear regression model,
and discuss how this model can be used for these three purposes. Specifically, we
discuss the interpretations of the estimates of different regression parameters,
the assumptions underlying the model, measures of the strength of the
relationship between the target and predictor variables, the construction of
tests of hypotheses and intervals related to regression parameters, and the
checking of assumptions using diagnostic plots.

1.2 Concepts and Background Material


1.2.1 THE LINEAR REGRESSION MODEL
The data consist of n observations, which are sets of observed values {x1i , x2i ,
. . . , xpi , yi } that represent a random sample from a larger population. It is
assumed that these observations satisfy a linear relationship,
yi = β0 + β1 x1i + · · · + βp xpi + εi , (1.1)
where the β coefficients are unknown parameters, and the εi are random error
terms. By a linear model, it is meant that the model is linear in the parameters;
a quadratic model,
yi = β0 + β1 xi + β2 x2i + εi ,
paradoxically enough, is a linear model, since x and x2 are just versions of x1
and x2 .
It is important to recognize that this, or any statistical model, is not
viewed as a true representation of reality; rather, the goal is that the model
be a useful representation of reality. A model can be used to explore the
relationships between variables and make accurate forecasts based on those
relationships even if it is not the “truth.” Further, any statistical model is
only temporary, representing a provisional version of views about the random
process being studied. Models can, and should, change, based on analysis using
1.2 Concepts and Background Material 5

30 •
• •
• •
25 • •
•• •
• •• • •

20 • •
y • • E(y) = β0 + β1x
•• •
15
•• •


10

2 4 6 8
x

FIGURE 1.1: The simple linear regression model. The solid line corresponds
to the true regression line, and the dotted lines correspond to the random
errors εi .

the current model, selection among several candidate models, the acquisition
of new data, new understanding of the underlying random process, and so
on. Further, it is often the case that there are several different models that
are reasonable representations of reality. Having said this, we will sometimes
refer to the “true” model, but this should be understood as referring to the
underlying form of the currently hypothesized representation of the regression
relationship.
The special case of (1.1) with p = 1 corresponds to the simple regression
model, and is consistent with the representation in Figure 1.1. The solid line
is the true regression line, the expected value of y given the value of x. The
dotted lines are the random errors εi that account for the lack of a perfect
association between the predictor and the target variables.

1.2.2 ESTIMATION USING LEAST SQUARES


The true regression function represents the expected relationship between the
target and the predictor variables, which is unknown. A primary goal of a
regression analysis is to estimate this relationship, or equivalently, to estimate
the unknown parameters β. This requires a data-based rule, or criterion,
that will give a reasonable estimate. The standard approach is least squares
regression, where the estimates are chosen to minimize

n
[yi − (β0 + β1 x1i + · · · + βp xpi )]2 . (1.2)
i=1

Figure 1.2 gives a graphical representation of least squares that is based


on Figure 1.1. Now the true regression line is represented by the gray line,
6 CHAPTER 1 Multiple Linear Regression

30 •
• •
• •
25 • •
•• •
• •• • •

20 • •
y
• • ^ ^ ^
• E(y) = β0 + β1x
••
15 •
••


10

2 4 6 8
x

FIGURE 1.2: Least squares estimation for the simple linear regression model,
using the same data as in Figure 1.1. The gray line corresponds to the true
regression line, the solid black line corresponds to the fitted least squares
line (designed to estimate the gray line), and the lengths of the dotted lines
correspond to the residuals. The sum of squared values of the lengths of the
dotted lines is minimized by the solid black line.

and the solid black line is the estimated regression line, designed to estimate
the (unknown) gray line as closely as possible. For any choice of estimated
parameters β̂, the estimated expected response value given the observed
predictor values equals
ŷi = β̂0 + β̂1 x1i + · · · + β̂p xpi ,
and is called the fitted value. The difference between the observed value yi
and the fitted value ŷi is called the residual, the set of which is represented by
the signed lengths of the dotted lines in Figure 1.2. The least squares regression
line minimizes the sum of squares of the lengths of the dotted lines; that is,
the ordinary least squares (OLS) estimates minimize the sum of squares of the
residuals.
In higher dimensions (p > 1), the true and estimated regression relation-
ships correspond to planes (p = 2) or hyperplanes (p ≥ 3), but otherwise the
principles are the same. Figure 1.3 illustrates the case with two predictors.
The length of each vertical line corresponds to a residual (solid lines refer to
positive residuals, while dashed lines refer to negative residuals), and the (least
squares) plane that goes through the observations is chosen to minimize the
sum of squares of the residuals.
1.2 Concepts and Background Material 7



50 • •
45 • •
•• •
40 • • •• •

35 •
y
30
• ••• • •
• •
25 • •• 10
• 8
• 6
20 • 4
2 x2
15 0
0 2 4 6 8 10
x1

FIGURE 1.3: Least squares estimation for the multiple linear regression
model with two predictors. The plane corresponds to the fitted least squares
relationship, and the lengths of the vertical lines correspond to the residuals.
The sum of squared values of the lengths of the vertical lines is minimized by
the plane.

The linear regression model can be written compactly using matrix


notation. Define the following matrix and vectors as follows:
⎛ ⎞
⎛ ⎞ ⎛ ⎞ β0 ⎛ ⎞
1 x11 · · · xp1 y1 ⎜ β1 ⎟ ε1
⎜ .. ⎟ y = ⎜ .. ⎟ ⎜ ⎟ ⎜ .. ⎟ .
X = ⎝ ... ..
. . ⎠ ⎝ . ⎠ β = ⎜ .. ⎟ε = ⎝ . ⎠
⎝ . ⎠
1 x1n · · · xpn yn εn
βp
The regression model (1.1) is then
y = Xβ + ε. (1.3)
The normal equations [which determine the minimizer of (1.2)] can be
shown (using multivariate calculus) to be
(X  X)β̂ = X  y,
which implies that the least squares estimates satisfy
β̂ = (X  X)−1 X  y. (1.4)
The fitted values are then
ŷ = X β̂ = X(X  X)−1 X  y ≡ Hy, (1.5)
8 CHAPTER 1 Multiple Linear Regression

where H = X(X  X)−1 X  is the so-called “hat” matrix (since it takes y to ŷ).
The residuals e = y − ŷ thus satisfy
e = y − ŷ = y − X(X  X)−1 X  y = (I − X(X  X)−1 X  )y, (1.6)
or
e = (I − H)y.

1.2.3 ASSUMPTIONS
The least squares criterion will not necessarily yield sensible results unless
certain assumptions hold. One is given in (1.1) — the linear model should
be appropriate. In addition, the following assumptions are needed to justify
using least squares regression.
1. The expected value of the errors is zero (E(εi ) = 0 for all i). That is, it
cannot be true that for certain observations the model is systematically
too low, while for others it is systematically too high. A violation of this
assumption will lead to difficulties in estimating β0 . More importantly,
this reflects that the model does not include a necessary systematic
component, which has instead been absorbed into the error terms.
2. The variance of the errors is constant (V (εi ) = σ 2 for all i). That is,
it cannot be true that the strength of the model is greater for some
parts of the population (smaller σ ) and less for other parts (larger σ ).
This assumption of constant variance is called homoscedasticity, and its
violation (nonconstant variance) is called heteroscedasticity. A violation
of this assumption means that the least squares estimates are not as efficient
as they could be in estimating the true parameters, and better estimates are
available. More importantly, it also results in poorly calibrated confidence
and (especially) prediction intervals.
3. The errors are uncorrelated with each other. That is, it cannot be true
that knowing that the model underpredicts y (for example) for one
particular observation says anything at all about what it does for any
other observation. This violation most often occurs in data that are
ordered in time (time series data), where errors that are near each other
in time are often similar to each other (such time-related correlation
is called autocorrelation). Violation of this assumption means that the
least squares estimates are not as efficient as they could be in estimating
the true parameters, and more importantly, its presence can lead to very
misleading assessments of the strength of the regression.
4. The errors are normally distributed. This is needed if we want to construct
any confidence or prediction intervals, or hypothesis tests, which we
usually do. If this assumption is violated, hypothesis tests and confidence
and prediction intervals can be very misleading.
1.3 Methodology 9

Since violation of these assumptions can potentially lead to completely


misleading results, a fundamental part of any regression analysis is to check
them using various plots, tests, and diagnostics.

1.3 Methodology
1.3.1 INTERPRETING REGRESSION COEFFICIENTS
The least squares regression coefficients have very specific meanings. They are
often misinterpreted, so it is important to be clear on what they mean (and do
not mean). Consider first the intercept, β̂0 .
β̂0 : The estimated expected value of the target variable when the predictors
are all equal to zero.
Note that this might not have any physical interpretation, since a zero value for
the predictor(s) might be impossible, or might never come close to occurring
in the observed data. In that situation, it is pointless to try to interpret
this value. If all of the predictors are centered to have zero mean, then β̂0
necessarily equals Y , the sample mean of the target values. Note that if there
is any particular value for each predictor that is meaningful in some sense, if
each variable is centered around its particular value, then the intercept is an
estimate of E(y) when the predictors all have those meaningful values.
The estimated coefficient for the j th predictor (j = 1, . . . , p) is interpreted
in the following way:
β̂j : The estimated expected change in the target variable associated with a one
unit change in the j th predicting variable, holding all else in the model
fixed.
There are several noteworthy aspects to this interpretation. First, note the
word associated — we cannot say that a change in the target variable is caused
by a change in the predictor, only that they are associated with each other.
That is, correlation does not imply causation.
Another key point is the phrase “holding all else in the model fixed,” the
implications of which are often ignored. Consider the following hypothetical
example. A random sample of college students at a particular university is
taken in order to understand the relationship between college grade point
average (GPA) and other variables. A model is built with college GPA as a
function of high school GPA and the standardized Scholastic Aptitude Test
(SAT), with resultant least squares fit
College GPA = 1.3 + .7 × High School GPA − .0001 × SAT.
It is tempting to say (and many people would say) that the coefficient for
SAT score has the “wrong sign,” because it says that higher values of SAT
10 CHAPTER 1 Multiple Linear Regression

are associated with lower values of college GPA. This is not correct. The
problem is that it is likely in this context that what an analyst would find
intuitive is the marginal relationship between college GPA and SAT score alone
(ignoring all else), one that we would indeed expect to be a direct (positive)
one. The regression coefficient does not say anything about that marginal
relationship. Rather, it refers to the conditional (sometimes called partial)
relationship that takes the high school GPA as fixed, which is apparently
that higher values of SAT are associated with lower values of college GPA,
holding high school GPA fixed. High school GPA and SAT are no doubt
related to each other, and it is quite likely that this relationship between
the predictors would complicate any understanding of, or intuition about,
the conditional relationship between college GPA and SAT score. Multiple
regression coefficients should not be interpreted marginally; if you really are
interested in the relationship between the target and a single predictor alone,
you should simply do a regression of the target on only that variable. This
does not mean that multiple regression coefficients are uninterpretable, only
that care is necessary when interpreting them.
Another common use of multiple regression that depends on this con-
ditional interpretation of the coefficients is to explicitly include “control”
variables in a model in order to try to account for their effect statistically. This
is particularly important in observational data (data that are not the result of a
designed experiment), since in that case, the effects of other variables cannot be
ignored as a result of random assignment in the experiment. For observational
data it is not possible to physically intervene in the experiment to “hold other
variables fixed,” but the multiple regression framework effectively allows this
to be done statistically.
Having said this, we must recognize that in many situations, it is impossible
from a practical point of view to change one predictor while holding all else
fixed. Thus, while we would like to interpret a coefficient as accounting for the
presence of other predictors in a physical sense, it is important (when dealing
with observational data in particular) to remember that linear regression is at
best only an approximation to the actual underlying random process.

1.3.2 MEASURING THE STRENGTH OF THE REGRESSION


RELATIONSHIP
The least squares estimates possess an important property:

n 
n 
n
(yi − Y )2 = (yi − ŷi )2 + (ŷi − Y )2 .
i=1 i=1 i=1

This formula says that the variability in the target variable (the left side of
the equation, termed the corrected total sum of squares) can be split into two
mutually exclusive parts — the variability left over after doing the regression
(the first term on the right side, the residual sum of squares), and the variability
accounted for by doing the regression (the second term, the regression sum of
1.3 Methodology 11

squares). This immediately suggests the usefulness of R2 as a measure of the


strength of the regression relationship, where
− Y )2
i (ŷi Regression SS Residual SS
R2 = ≡ =1− .
(y
i i − Y )2 Corrected total SS Corrected total SS
The R2 value (also called the coefficient of determination) estimates the
population proportion of variability in y accounted for by the best linear
combination of the predictors. Values closer to 1 indicate a good deal of
predictive power of the predictors for the target variable, while values closer
to 0 indicate little predictive power. An equivalent representation of R2 is
R2 = corr(yi , ŷi )2 ,
where
i (yi − Y )(ŷi − Ŷ )
corr(yi , ŷi ) =
i (yi − Y )2 i (ŷi − Ŷ )2
is the sample correlation coefficient between y and ŷ (this correlation is called
the multiple correlation coefficient). That is, R2 is a direct measure of how
similar the observed and fitted target values are.
It can be shown that R2 is biased upwards as an estimate of the population
proportion of variability accounted for by the regression. The adjusted R2
corrects this bias, and equals
p
Ra2 = R2 − (1 − R2 ). (1.7)
n−p−1
It is apparent from (1.7) that unless p is large relative to n − p − 1 (that is,
unless the number of predictors is large relative to the sample size), R2 and
Ra2 will be close to each other, and the choice of which to use is a minor
concern. What is perhaps more interesting is the nature of Ra2 as providing an
explicit tradeoff between the strength of the fit (the first term, with larger R2
corresponding to stronger fit and larger Ra2 ) and the complexity of the model
(the second term, with larger p corresponding to more complexity and smaller
Ra2 ). This tradeoff of fidelity to the data versus simplicity will be important in
the discussion of model selection in Section 2.3.1.
The only parameter left unaccounted for in the estimation scheme is the
variance of the errors σ 2 . An unbiased estimate is provided by the residual
mean square,
n
(yi − ŷi )2
σ̂ 2 = i=1
. (1.8)
n−p−1
This estimate has a direct, but often underappreciated, use in assessing
the practical importance of the model. Does knowing x1 , . . . , xp really
say anything of value about y ? This isn’t a question that can be answered
completely statistically; it requires knowledge and understanding of the data
and the underlying random process (that is, it requires context). Recall that
the model assumes that the errors are normally distributed with standard
12 CHAPTER 1 Multiple Linear Regression

deviation σ . This means that, roughly speaking, 95% of the time an observed
y value falls within ±2σ of the expected response
E(y) = β0 + β1 x1 + · · · + βp xp .
E(y) can be estimated for any given set of x values using
ŷ = β̂0 + β̂1 x1 + · · · + β̂p xp ,
while the square root of the residual mean square (1.8), termed the standard
error of the estimate, provides an estimate of σ that can be used in constructing
this rough prediction interval ±2σ̂ .

1.3.3 HYPOTHESIS TESTS AND CONFIDENCE INTERVALS


FOR β
There are two types of hypothesis tests of immediate interest related to the
regression coefficients.
1. Do any of the predictors provide predictive power for the target variable?
This is a test of the overall significance of the regression,
H0 : β1 = · · · = βp = 0
versus
Ha : at least one βj = 0, j = 1, . . . , p.
The test of these hypotheses is the F -test,
Regression MS Regression SS/p
F = ≡ .
Residual MS Residual SS/(n − p − 1)
This is referenced against a null F -distribution on (p, n − p − 1) degrees
of freedom.
2. Given the other variables in the model, does a particular predictor provide
additional predictive power? This corresponds to a test of the significance
of an individual coefficient,
H0 : βj = 0, j = 1, . . . , p
versus
Ha : βj = 0.
This is tested using a t-test,
β̂j
tj = ,
s.e.(β̂j )
which is compared to a t-distribution on n − p − 1 degrees of freedom.
Other values of βj can be specified in the null hypothesis (say βj0 ), with
the t-statistic becoming
β̂j − βj0
tj = . (1.9)
s.e.(β̂j )
1.3 Methodology 13

The values of s.e.(β̂j ) are obtained as the square roots of the diagonal ele-
ments of V̂ (β̂) = (X  X)−1 σ̂ 2 , where σ̂ 2 is the residual mean square (1.8).
Note that for simple regression (p = 1), the hypotheses corresponding to
the overall significance of the model and the significance of the predictor
are identical,
H0 : β1 = 0
versus
Ha : β1 = 0.
Given the equivalence of the sets of hypotheses, it is not surprising that
the associated tests are also equivalent; in fact, F = t21 , and the associated
tail probabilities of the two tests are identical.

A t-test for the intercept also can be constructed as in (1.9), although this
does not refer to a hypothesis about a predictor, but rather about whether
the expected target is equal to a specified value β00 if all of the predictors
equal zero. As was noted in Section 1.3.1, this is often not physically
meaningful (and therefore of little interest), because the condition that all
predictors equal zero cannot occur, or does not come close to occurring
in the observed data.
As is always the case, a confidence interval provides an alternative way of
summarizing the degree of precision in the estimate of a regression parameter.
A 100 × (1 − α)% confidence interval for βj has the form
n−p−1
β̂j ± tα/2 s.e.(β̂j ),
n−p−1
where tα/2 is the appropriate critical value at two-sided level α for a
t-distribution on n − p − 1 degrees of freedom.

1.3.4 FITTED VALUES AND PREDICTIONS


The rough prediction interval ŷ ± 2σ̂ discussed in Section 1.3.2 is an approx-
imate 95% interval because it ignores the variability caused by the need to
estimate σ and uses only an approximate normal-based critical value. A more
accurate assessment of predictive power is provided by a prediction interval
given a particular value of x. This interval provides guidance as to how precise
ŷ0 is as a prediction of y for some particular specified value x0 , where ŷ0
is determined by substituting the values x0 into the estimated regression
equation. Its width depends on both σ̂ and the position of x0 relative to the
centroid of the predictors (the point located at the means of all predictors),
since values farther from the centroid are harder to predict as precisely. Specif-
ically, for a simple regression, the estimated standard error of a predicted value
based on a value x0 of the predicting variable is
1 (x0 − X)2
s.e.(ŷ0P ) = σ̂ 1 + + .
n (xi − X)2
14 CHAPTER 1 Multiple Linear Regression

More generally, the variance of a predicted value is


V̂ (ŷ0P ) = [1 + x0 (X  X)−1 x0 ]σ̂ 2 . (1.10)
Here x0 is taken to include a 1 in the first entry (corresponding to the intercept
in the regression model). The prediction interval is then
n−p−1
ŷ0 ± tα/2 s.e.(ŷ0P ),

where s.e.(ŷ0P ) = V̂ (ŷ0P ).


This prediction interval should not be confused with a confidence
interval for a fitted value. The prediction interval is used to provide an
interval estimate for a prediction of y for one member of the population with a
particular value of x0 ; the confidence interval is used to provide an interval
estimate for the true expected value of y for all members of the population with a
particular value of x0 . The corresponding standard error, termed the standard
error for a fitted value, is the square root of
V̂ (ŷ0F ) = x0 (X  X)−1 x0 σ̂ 2 , (1.11)
with corresponding confidence interval
n−p−1
ŷ0 ± tα/2 s.e.(ŷ0F ).
A comparison of the two estimated variances (1.10) and (1.11) shows that the
variance of the predicted value has an extra σ 2 term, which corresponds to
the inherent variability in the population. Thus, the confidence interval for a
fitted value will always be narrower than the prediction interval, and is often
much narrower (especially for large samples), since increasing the sample size
will always improve estimation of the expected response value, but cannot
lessen the inherent variability in the population associated with the prediction
of the target for a single observation.

1.3.5 CHECKING ASSUMPTIONS USING RESIDUAL PLOTS


All of these tests, intervals, predictions, and so on, are based on the belief that
the assumptions of the regression model hold. Thus, it is crucially important
that these assumptions be checked. Remarkably enough, a few very simple
plots can provide much of the evidence needed to check the assumptions.
1. A plot of the residuals versus the fitted values. This plot should have no
pattern to it; that is, no structure should be apparent. Certain kinds of
structure indicate potential problems:
(a) A point (or a few points) isolated at the top or bottom, or left or
right. In addition, often the rest of the points have a noticeable “tilt”
to them. These isolated points are unusual observations and can have
a strong effect on the regression. They need to be examined carefully
and possibly removed from the data set.
1.4 Example — Estimating Home Prices 15

(b) An impression of different heights of the point cloud as the plot is


examined from left to right. This indicates potential heteroscedasticity
(nonconstant variance).
2. Plots of the residuals versus each of the predictors. Again, a plot with no
apparent structure is desired.
3. If the data set has a time structure to it, residuals should be plotted in time
order. Again, there should be no apparent pattern. If there is a cyclical
structure, this indicates that the errors are not uncorrelated, as they are
supposed to be (that is, there is potentially autocorrelation in the errors).
4. A normal plot of the residuals. This plot assesses the apparent normality
of the residuals, by plotting the observed ordered residuals on one axis
and the expected positions (under normality) of those ordered residuals
on the other. The plot should look like a straight line (roughly). Isolated
points once again represent unusual observations, while a curved line
indicates that the errors are probably not normally distributed, and tests
and intervals might not be trustworthy.
Note that all of these plots should be routinely examined in any regression
analysis, although in order to save space not all will necessarily be presented in
all of the analyses in the book.
An implicit assumption in any model that is being used for prediction
is that the future “looks like” the past; that is, it is not sufficient that these
assumptions appear to hold for the available data, as they also must continue
to hold for new data on which the estimated model is applied. Indeed, the
assumption is stronger than that, since it must be the case that the future
is exactly the same as the past, in the sense that all of the properties of the
model, including the precise values of all of the regression parameters, are the
same. This is unlikely to be exactly true, so a more realistic point of view is
that the future should be similar enough to the past so that predictions based
on the past are useful. A related point is that predictions should not be based
on extrapolation, where the predictor values are far from the values used to
build the model. Similarly, if the observations form a time series, predictions
far into the future are unlikely to be very useful.
In general, the more complex a model is, the less likely it is that all
of its characteristics will remain stable going forward, which implies that a
reasonable goal is to try to find a model that is as simple as it can be while
still accounting for the important effects in the data. This leads to questions
of model building, which is the subject of Chapter 2.

1.4 Example — Estimating Home Prices


Determining the appropriate sale price for a home is clearly of great interest
to both buyers and sellers. While this can be done in principle by examining
the prices at which other similar homes have recently sold, the well-known
16 CHAPTER 1 Multiple Linear Regression

existence of strong effects related to location means that there are likely to
be relatively few homes with the same important characteristics to make the
comparison. A solution to this problem is the use of hedonic regression models,
where the sale prices of a set of homes in a particular area are regressed on
important characteristics of the home such as the number of bedrooms, the
living area, the lot size, and so on. Academic research on this topic is plentiful,
going back to at least Wabe (1971).
This analysis is based on a sample from public data on sales of one-family
homes in the Levittown, NY area from June 2010 through May 2011.
Levittown is famous as the first planned suburban community built using
mass production methods, being aimed at former members of the military
after World War II. Most of the homes in this community were built in the
late 1940s to early 1950s, without basements and designed to make expansion
on the second floor relatively easy.
For each of the 85 houses in the sample, the number of bedrooms, number
of bathrooms, living area (in square feet), lot size (in square feet), the year
the house was built, and the property taxes are used as potential predictors
of the sale price. In any analysis the first step is to look at the data, and
Figure 1.4 gives scatter plots of sale price versus each predictor. It is apparent
that there is a positive association between sale price and each variable, other
than number of bedrooms and lot size. We also note that there are two houses
with unusually large living areas for this sample, two with unusually large

4e + 05 4e + 05
Sale price

Sale price

2e + 05 2e + 05

3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0
Number of bedrooms Number of bathrooms

4e + 05 4e + 05
Sale price

Sale price

2e + 05 2e + 05

1000 1500 2000 2500 3000 6000 7000 8000 9000 10000 11000
Living area Lot size

4e + 05 4e + 05
Sale price

Sale price

2e + 05 2e + 05

1948 1950 1952 1954 1956 1958 1960 1962 2000 4000 6000 8000 10000 12000 14000
Year built Property taxes

FIGURE 1.4: Scatter plots of sale price versus each predictor for the home
price data.
1.4 Example — Estimating Home Prices 17

property taxes (these are not the same two houses), and three that were built
six or seven years later than all of the other houses in the sample.
The output below summarizes the results of a multiple regression fit.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 .
Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361
Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 ***
Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 ***
Lot.size -8.971e-01 4.194e+00 -0.214 0.831197
Year.built 3.761e+03 1.963e+03 1.916 0.058981 .
Property.tax 1.476e+00 2.832e+00 0.521 0.603734
---

Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 47380 on 78 degrees of freedom


Multiple R-squared: 0.5065, Adjusted R-squared: 0.4685
F-statistic: 13.34 on 6 and 78 DF, p-value: 2.416e-10

The overall regression is strongly statistically significant, with the tail


probability of the F -test roughly 10−10 . The predictors account for roughly
50% of the variability in sale prices (R2 ≈ 0.5). Two of the predictors (number
of bathrooms and living area) are highly statistically significant, with tail
probabilities less than .0002, and the coefficient of the year built variable is
marginally statistically significant. The coefficients imply that given all else
in the model is held fixed, one additional bathroom in a house is associated
with an estimated expected price that is $51,700 higher; one additional square
foot of living area is associated with an estimated expected price that is $65.90
higher (given the typical value of the living area variable, a more meaningful
statement would probably be that an additional 100 square feet of living area
is associated with an estimated expected price that is $659 higher); and a house
being built one year later is associated with an estimated expected price that is
$3761 higher.
This is a situation where the distinction between a confidence interval
for a fitted value and a prediction interval (and which is of more interest to
a particular person) is clear. Consider a house with 3 bedrooms, 1 bathroom,
1050 square feet of living area, 6000 square foot lot size, built in 1948, with
$6306 in property taxes. Substituting those values into the above equation
gives an estimated expected sale price of a house with these characteristics equal
to $265,360. A buyer or a seller is interested in the sale price of one particular
house, so a prediction interval for the sale price would provide a range for
what the buyer can expect to pay and the seller expect to get. The standard
error of the estimate σ̂ = $47,380 can be used to construct a rough prediction
interval, in that roughly 95% of the time a house with these characteristics can
be expected to sell for within ±(2)(47380) = ±$94,360 of that estimated sale
price, but a more exact interval might be required. On the other hand, a home
appraiser or tax assessor is more interested in the typical (average) sale price
18 CHAPTER 1 Multiple Linear Regression

for all homes of that type in the area, so they can give a justifiable interval
estimate giving the precision of the estimate of the true expected value of the
house, so a confidence interval for the fitted value is desired.
Exact 95% intervals for a house with these characteristics can be obtained
from statistical software, and turn out to be ($167277, $363444) for the
prediction interval and ($238482, $292239) for the confidence interval. As
expected, the prediction interval is much wider than the confidence interval,
since it reflects the inherent variability in sale prices in the population of
houses; indeed, it is probably too wide to be of any practical value in this case,
but an interval with smaller coverage (that is expected to include the actual
price only 50% of the time, say) might be useful (a 50% interval in this case
would be ($231974, $298746), so a seller could be told that there is a 50/50
chance that their house will sell for a value in this range).
The validity of all of these results depends on whether the assumptions
hold. Figure 1.5 gives a scatter plot of the residuals versus the fitted values
and a normal plot of the residuals for this model fit. There is no apparent
pattern in the plot of residuals versus fitted values, and the ordered residuals
form a roughly straight line in the normal plot, so there are no apparent
violations of assumptions here. The plot of residuals versus each of the
predictors (Figure 1.6) also does not show any apparent patterns, other than
the houses with unusual living area and year being built, respectively. It would
be reasonable to omit these observations to see if they have had an effect on
the regression, but we will postpone discussion of that to Chapter 3, where
diagnostics for unusual observations are discussed in greater detail.
An obvious consideration at this point is that the models discussed here
appear to be overspecified; that is, they include variables that do not apparently
add to the predictive power of the model. As was noted earlier, this suggests
the consideration of model building, where a more appropriate (simplified)
model can be chosen, which will be discussed in Chapter 2.

1e + 05 1e + 05
Sample Quantiles
Residuals

0e + 00 0e + 00

− 1e + 05 −1e + 05

250000 350000 450000 −2 −1 0 1 2


Fitted values Theoretical Quantiles
(a) (b)

FIGURE 1.5: Residual plots for the home price data. (a) Plot of residuals
versus fitted values. (b) Normal plot of the residuals.
1.5 Summary 19
Residuals 1e + 05 1e + 05

Residuals
0e + 00 0e + 00

−1e + 05 −1e + 05

3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0
Number of bedrooms Number of bathrooms
1e + 05 1e + 05
Residuals

Residuals
0e + 00 0e + 00

−1e + 05 −1e + 05

1000 1500 2000 2500 3000 6000 7000 8000 9000 10000 11000

Living area Lot size


1e + 05 1e + 05
Residuals

Residuals

0e + 00 0e + 00

−1e + 05 −1e + 05

1948 1950 1952 1954 1956 1958 1960 1962 2000 4000 6000 8000 10000 12000 14000

Year built Property taxes

FIGURE 1.6: Scatter plots of residuals versus each predictor for the home
price data.

1.5 Summary
In this chapter we have laid out the basic structure of the linear regression
model, including the assumptions that justify the use of least squares estima-
tion. The three main goals of regression noted at the beginning of the chapter
provide a framework for an organization of the topics covered.
1. Modeling the relationship between x and y :
• the least squares estimates β̂ summarize the expected change in y for a
given change in an x, accounting for all of the variables in the model;
• the standard error of the estimate σ̂ estimates the standard deviation of
the errors;
• R2 and Ra2 estimate the proportion of variability in y accounted for by
x;
• and the confidence interval for a fitted value provides a measure of the
precision in estimating the expected target for a given set of predictor
values.
20 CHAPTER 1 Multiple Linear Regression

2. Prediction of the target variable:


• substituting specified values of x into the fitted regression model gives
an estimate of the value of the target for a new observation;
• the rough prediction interval ±2σ̂ provides a quick measure of the
limits of the ability to predict a new observation;
• and the exact prediction interval provides a more precise measure of
those limits.
3. Testing of hypotheses:
• the F -test provides a test of the statistical significance of the overall
relationship;
• the t-test for each slope coefficient testing whether the true value is zero
provides a test of whether the variable provides additional predictive
power given the other variables;
• and the t-tests can be generalized to test other hypotheses of interest
about the coefficients as well.
Since all of these methods depend on the assumptions holding, a fun-
damental part of any regression analysis is to check those assumptions. The
residual plots discussed in this chapter are a key part of that process, and
other diagnostics and tests will be discussed in future chapters that provide
additional support for that task.

KEY TERMS
Autocorrelation: Correlation between adjacent observations in a (time) series.
In the regression context it is autocorrelation of the errors that is a violation of
assumptions.
Coefficient of determination (R2 ): The square of the multiple correlation
coefficient, estimates the proportion of variability in the target variable that is
explained by the predictors in the linear model.
Confidence interval for a fitted value: A measure of precision of the estimate
of the expected target value for a given x.
Dependent variable: Characteristic of each member of the sample that is
being modeled. This is also known as the target or response variable.
Fitted value: The least squares estimate of the expected target value for a
particular observation obtained from the fitted regression model.
Heteroscedasticity: Unequal variance; this can refer to observed unequal
variance of the residuals or theoretical unequal variance of the errors.
Homoscedasticity: Equal variance; this can refer to observed equal variance
of the residuals or the assumed equal variance of the errors.
Independent variable(s): Characteristic(s) of each member of the sample that
could be used to model the dependent variable. These are also known as the
predicting variables.
1.5 Summary 21

Least squares: A method of estimation that minimizes the sum of squared


deviations of the observed target values from their estimated expected values.
Prediction interval: The interval estimate for the value of the target variable
for an individual member of the population using the fitted regression model.
Residual: The difference between the observed target value and the corre-
sponding fitted value.
Residual mean square: An unbiased estimate of the variance of the errors.
It is obtained by dividing the sum of squares of the residuals by (n − p − 1),
where n is the number of observations and p is the number of predicting
variables.
Standard error of the estimate (σ̂ ): An estimate of σ , the standard deviation
of the errors, equaling the square root of the residual mean square.
Chapter Two

Model Building
2.1 Introduction 23
2.2 Concepts and Background Material 24
2.2.1 Using Hypothesis Tests to Compare Models 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 Model Selection 29
2.3.2 Example — Estimating Home Prices (continued) 31
2.4 Indicator Variables and Modeling Interactions 38
2.4.1 Example — Electronic Voting and the 2004
Presidential Election 40
2.5 Summary 46

2.1 Introduction
All of the discussion in Chapter 1 is based on the premise that the only
model being considered is the one currently being fit. This is not a good data
analysis strategy, for several reasons.
1. Including unnecessary predictors in the model (what is sometimes called
overfitting) complicates descriptions of the process. Using such models
tends to lead to poorer predictions because of the additional unnecessary
noise. Further, a more complex representation of the true regression
relationship is less likely to remain stable enough to be useful for future
prediction than is a simpler one.

Handbook of Regression Analysis With Applications in R, Second Edition.


Samprit Chatterjee and Jeffrey S. Simonoff.
© 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.
23
24 CHAPTER 2 Model Building

2. Omitting important effects (underfitting) reduces predictive power,


biases estimates of effects for included predictors, and results in less
understanding of the process being studied.
3. Violations of assumptions should be addressed, so that least squares
estimation is justified.
The last of these reasons is the subject of later chapters, while the first
two are discussed in this chapter. This operation of choosing among different
candidate models so as to avoid overfitting and underfitting is called model
selection.
First, we discuss the uses of hypothesis testing for model selection.
Various hypothesis tests address relevant model selection questions, but there
are also reasons why they are not sufficient for these purposes. Part of these
difficulties is the effect of correlations among the predictors, and the situation
of high correlation among the predictors (collinearity) is a particularly
challenging one.
A useful way of thinking about the tradeoffs of overfitting versus under-
fitting is as a contrast between strength of fit and simplicity. The principle
of parsimony states that a model should be as simple as possible while still
accounting for the important relationships in the data. Thus, a sensible way of
comparing models is using measures that explicitly reflect this tradeoff; such
measures are discussed in Section 2.3.1.
The chapter concludes with a discussion of techniques designed to address
the existence of well-defined subgroups in the data. In this situation, it is
often the case that the effects of a predictor on the target variable is different
in the two groups, and ways of building models to handle this are discussed in
Section 2.4.

2.2 Concepts and Background Material


2.2.1 USING HYPOTHESIS TESTS TO COMPARE MODELS
Determining whether individual regression coefficients are statistically sig-
nificant (as discussed in Section 1.3.3) is an obvious first step in deciding
whether a model is overspecified. A predictor that does not add significantly
to model fit should have an estimated slope coefficient that is not significantly
different from 0, and is thus identified by a small t-statistic. So, for example,
in the analysis of home prices in Section 1.4, the regression output on page 17
suggests removing number of bedrooms, lot size, and property taxes from the
model, as all three have insignificant t-values.
Recall that t-tests can only assess the contribution of a predictor given all of
the others in the model. When predictors are correlated with each other, t-tests
can give misleading indications of the importance of a predictor. Consider a
two-predictor situation where the predictors are each highly correlated with
the target variable, and are also highly correlated with each other. In this
2.2 Concepts and Background Material 25

situation, it is likely that the t-statistic for each predictor will be relatively
small. This is not an inappropriate result, since given one predictor the other
adds little (being highly correlated with each other, one is redundant in the
presence of the other). This means that the t-statistics are not effective in
identifying important predictors when the two variables are highly correlated.
The t-tests and F -test of Section 1.3.3 are special cases of a general
formulation that is useful for comparing certain classes of models. It might be
the case that a simpler version of a candidate model (a subset model) might
be adequate to fit the data. For example, consider taking a sample of college
students and determining their college grade point average (GPA), Scholastic
Aptitude Test (SAT) evidence-based reading and writing score (Reading),
and SAT math score (Math). The full regression model to fit to these data is
GPAi = β0 + β1 Readingi + β2 Mathi + εi .
Instead of considering reading and math scores separately, we could consider
whether GPA can be predicted by one variable: total SAT score, which is the
sum of Reading and Math. This subset model is
GPAi = γ0 + γ1 (Reading + Math)i + εi ,
with β1 = β2 ≡ γ1 . This equality condition is called a linear restriction,
because it defines a linear condition on the parameters of the regression model
(that is, it only involves additions, subtractions, and equalities of coefficients
and constants).
The question about whether the total SAT score is sufficient to predict
grade point average can be stated using a hypothesis test about this linear
restriction. As always, the null hypothesis gets the benefit of the doubt; in this
case, that is the simpler restricted (subset) model that the sum of Reading
and Math is adequate, since it says that only one predictor is needed, rather
than two. The alternative hypothesis is the unrestricted full model (with no
conditions on β). That is,
H0 : β1 = β2
versus
Ha : β1 = β2 .
These hypotheses are tested using a partial F -test. The F -statistic has the
form
(Residual SSsubset − Residual SSfull )/d
F = , (2.1)
Residual SSfull /(n − p − 1)
where n is the sample size, p is the number of predictors in the full model, and
d is the difference between the number of parameters in the full model and
the number of parameters in the subset model. This statistic is compared to
an F distribution on (d, n − p − 1) degrees of freedom. So, for example, for
this GPA/SAT example, p = 2 and d = 3 − 2 = 1, so the observed F -statistic
would be compared to an F distribution on (1, n − 3) degrees of freedom.
Some statistical packages allow specification of the full and subset models and
26 CHAPTER 2 Model Building

will calculate the F -test, but others do not, and the statistic has to be calculated
manually based on the fits of the two models.
An alternative form for the F -test above might make clearer what is going
on here:
(Rfull − Rsubset )/d
2 2
F = 2 )/(n − p − 1) .
(1 − Rfull
That is, if the strength of the fit of the full model (measured by R2 ) isn’t
much larger than that of the subset model, the F -statistic is small, and we do
not reject the subset model; if, on the other hand, the difference in R2 values
is large (implying that the fit of the full model is noticeably stronger), we do
reject the subset model in favor of the full model.
The F -statistic to test the overall significance of the regression is a special
case of this construction (with restriction β1 = · · · = βp = 0), as is each of the
individual t-statistics that test the significance of any variable (with restriction
βj = 0). In the latter case Fj = t2j .

2.2.2 COLLINEARITY
Recall that the importance of a predictor can be difficult to assess using t-tests
when predictors are correlated with each other. A related issue is that of
collinearity (sometimes somewhat redundantly referred to as multicollinear-
ity), which refers to the situation when (some of) the predictors are highly
correlated with each other. The presence of predicting variables that are highly
correlated with each other can lead to instability in the regression coefficients,
increasing their standard errors, and as a result the t-statistics for the variables
can be deflated. This can be seen in Figure 2.1. The two plots refer to identical
data sets, other than the one data point that is lightly colored. Dropping
the data points down to the (x1 , x2 ) plane makes clear the high correlation
between the predictors. The estimated regression plane changes from
ŷ = 9.906 − 2.514x1 + 6.615x2
in the top plot to
ŷ = 9.748 + 9.315x1 − 5.204x2
in the bottom plot; a small change in only one data point causes a major
change in the estimated regression function.
Thus, from a practical point of view, collinearity leads to two problems.
First, it can happen that the overall F -statistic is significant, yet each of the
individual t-statistics is not significant (more generally, the tail probability for
the F -test is considerably smaller than those of any of the individual coefficient
t-tests). Second, if the data are changed only slightly, the fitted regression
coefficients can change dramatically. Note that while collinearity can have a
large effect on regression coefficients and associated t-statistics, it does not
have a large effect on overall measures of fit like the overall F -test or R2 , since
adding unneeded variables (whether or not they are collinear with predictors
2.2 Concepts and Background Material 27

50
40
y 30 10
20 8
6
10 2 4
0 −2 0 x2
−2 0 2 4 6 8 10
x1

50
40
y 30 10
20 8
6
10 2 4
0 x2
0 −2
−2 0 2 4 6 8 10
x1

FIGURE 2.1: Least squares estimation under collinearity. The only change
in the data sets is the lightly colored data point. The planes are the estimated
least squares fits.

already in the model) cannot increase the residual sum of squares (it can only
decrease it or leave it roughly the same).
Another problem with collinearity comes from attempting to use a fitted
regression model for prediction. As was noted in Chapter 1, simple models tend
to forecast better than more complex ones, since they make fewer assumptions
about what the future will look like. If a model exhibiting collinearity is used
for future prediction, the implicit assumption is that the relationships among
the predicting variables, as well as their relationship with the target variable,
remain the same in the future. This is less likely to be true if the predicting
variables are collinear.
How can collinearity be diagnosed? The two-predictor model
yi = β0 + β1 x1i + β2 x2i + εi
provides some guidance. It can be shown that in this case
 −1

n
var(β̂1 ) = σ 2
x21i (1 − 2
r12 )
i=1

and  −1

n
var(β̂2 ) = σ 2 x22i (1 − r12
2
) ,
i=1
28 CHAPTER 2 Model Building

Table 2.1: Variance


inflation caused by
correlation of predictors in
a two-predictor model.

r12 Variance
inflation
0.00 1.00
0.50 1.33
0.70 1.96
0.80 2.78
0.90 5.26
0.95 10.26
0.97 16.92
0.99 50.25
0.995 100.00
0.999 500.00

where r12 is the correlation between x1 and x2 . Note that as collinearity


increases (r12 → ±1), both variances tend to ∞. This effect is quantified in
Table 2.1.
This ratio describes by how much the variances of the estimated slope
coefficients are inflated due to observed collinearity relative to when the
predictors are uncorrelated. It is clear that when the correlation is high, the
variability (and hence the instability) of the estimated slopes can increase
dramatically.
A diagnostic to determine this in general is the variance inflation factor
(V IF ) for each predicting variable, which is defined as
1
V IFj = ,
1 − Rj2
where Rj2 is the R2 of the regression of the variable xj on the other predicting
variables. V IFj gives the proportional increase in the variance of β̂j compared
to what it would have been if the predicting variables had been uncorrelated.
There are no formal cutoffs as to what constitutes a large V IF , but collinearity
is generally not a problem if the observed V IF satisfies
 
1
V IF < max 10, ,
1 − Rmodel
2

2
where Rmodel is the usual R2 for the regression fit. This means that either the
predictors are more related to the target variable than they are to each other, or
they are not related to each other very much. In either case coefficient estimates
Random documents with unrelated
content Scribd suggests to you:
Hänen henkensä olisi loukkaamaton, mutta panttina, niin kauan kuin
neuvoteltiin hänen ylhäisen isänsä kanssa, mikä epäilemättä piankin
päättyisi tyydyttävällä tavalla. Ja sen aikaa saisi arvokas vanki
käyttää hyväkseen asuntoa, vaikka se olikin puutteellinen, ja kolmea
kuuromykkää palvelijaa oli käsketty toimittamaan hänen käskyjänsä.

Nikolai Aleksandrovitsh soimasi itseään hulluksi ja koetti sitten


järkeillä. Hän luotti täydellisesti maansa kaukonäköiseen ja
vaikutusvaltaiseen poliisiin, hän tiesi, että Lavrovski ei säästäisi
ponnistuksia eikä tuhlaisi aikaa, ja hän alistui pakkoon rotunsa
luonteenomaisella tyyneydellä. Nuoruus ja hulluus ilmeni hänessä,
kun hän tunsi pistävää tuskaa ajatellessaan, että viettelevä odaliski
oli käyttänyt mairitteluaan tarkoituksiin, jotka niin suuresti erosivat
hänen runollisista mielikuvistaan. Seuraavan puolituntisen aikana
koko Venäjänmaan tsaarin perillinen söi ylellistä illallista aivan yksin
— vankina — nuorekkaan nälkäisenä, huomispäivää muistamatta.

Kreivi Lavrovski puolestaan hoitaessaan hänen keisarillisen


korkeutensa asioita oli pahemmassa kuin pulassa verrattuna
ryöstettyyn suojattiinsa.

Hän oli antanut tsaarin pojan, josta hänen tuli niin sanoaksemme
vastata, täydellisesti pujahtaa käsistään, ja se oli
ennenkuulumatonta venäläisen hovimiehen historiassa. Koska
tapaus oli vertaa vailla, niin olisi samoin epäilemättä rangaistuskin, ja
Lavrovski näki jo puoli tuntia tsaarin pojan katoamisen jälkeen
ummistaessaan silmänsä näkyjä rangaistusvangeista, vankiloista,
kaivoksista ja Siperiasta.

Puoleksikaan tunniksi ei tsaarin poikaa voi jättää palvelematta, ja


kun oli kulunut parisen tuntia ja naamioitujen joukot olivat alkaneet
harveta, Lavrovski alkoi tuntea sieluntuskia, jollaisia hän ei ennen
ollut aavistanutkaan. Ja kun nyt pikkutunneilla viimein iloiset joukot
häipyivät, istui vanha venäläinen yhä ihmisiä tuijottaen,
kangistuneena ja aivot turtuneina kauheita kidutuksia ajatellessaan.

Virkailijat kehoittivat häntä poistumaan. Valot sammutettiin, ja


Lavrovskin täytyi pakostakin lähteä aitiostaan ja työntäytyä kadulle.
Kun hän pari kertaa salaperäisesti kysyi ovimiehiltä ja palvelijoilta
odaliskia ja dominoa, niin herätti se vain iloisuutta. Viisikymmentä
odaliskia ja kaksituhatta dominoa oli kulkenut sisään ja ulos
oopperan portaita pitkin viimeisten tuntien aikana.

Imperial-hotellin uninen portieeri ei ollut nähnyt nuorta


muukalaista, ja venäläinen palvelija, tsaarin toinen seuralainen, kysyi
mykkänä isäntäänsä uskaltamatta lausua mitään ääneen.

Sille miehelle oli jotakin sanottava. Hän oli luotettava ja voisi


olla avuksi. Lavrovski kertoi hänelle puolen totuutta; Nikolai
Aleksandrovitsh palannee huomenna; lienee poissa muutaman
päivän.
Kreivi Lavrovski ei tietänyt; hän luotti Stefanin vaiteliaisuuteen.

Seuraavana päivänä, kun ei mitään kuulunut, vanha venäläinen


alkoi silmäillä kaihoisasti pientä revolveria, mitä hän aina piti
mukanaan. Parempi niin, kuin että laahataan kotiin Venäjälle
syytettynä maankavalluksesta ja lähetetään Irkutskiin kaivamaan
suolaa keisarillista valtiovarastoa varten, kun oli laiminlyönyt
velvollisuutensa valtaistuimen nuoren perijän hoitajana ja valvojana.

Mutta Lavrovski oli enemmän kuin kuudenkymmenen vuoden


vanha, ja siinä iässä elämä tuntuu kauan tuntemaltamme armaalta,
rakkaalta ystävältä, ja meidän on vaikea siitä erota. Hän pani pistolin
takaisin taskuunsa ja päätti etsiä muualta neuvoja ja ohjeita.
Hyvä salapoliisi — yksityinen, ei poliisietsivä — ehkä selvittäisi
asiat ja löytäisi huvinhaluisen nuorukaisen, mikäli hän vielä oli
elossa. No niin, ellei hän ollut, niin Lavrovskin henki ei missään
tapauksessa ollut paljonkaan arvoinen ja revolveri oli aina käsillä.

Stefan ei kysynyt mitään. Lavrovski näytti rasittuneelta ja


tuskaiselta; tyhmälle venäläiselle se riitti.

Aamulehdissä ei mainittu mitään salaperäisistä ruumiista, jotka


olisi löydetty ryöstettyinä kadulta, ja Lavrovski kiiruhti etsimään
salapoliisia.

Eräässä sanomalehtitoimistossa suositeltiin hänelle ranskalaista


Furet’tä, joka oli hyvin kokenut ja tunnettu mies.

Lavrovski meni hänen luokseen. Tähän asti hän oli koettanut olla
ajattelematta liikaa; ajatukset, joita hän oli koettanut pitää koossa,
olisivat vieneet hänet mielisairaalaan, ja hän tahtoi pitää aivonsa
vapaina kaikesta muusta, paitsi siitä, mikä koski hänen
velvollisuuttaan kadonnutta suojattiansa kohtaan ja hänen nimensä
kunniaa.

Furet oli älykäs ja viisas, vaan ei kaikkivoipa. Lavrovski kertoi


hänelle liian vähän, hän tunsi sen puhuessaan. Ranskalainen
salapoliisi arveli, että hänellä oli jokin salaisuus ja koetti saada
selville sitä.

Mutta Lavrovski oli itsepäinen. Kun olisi ollut tarpeen uskoa


luottamuksellisesti salapoliisiin, niin hän pelkäsi eikä uskaltanut
tunnustaa kadonneen muukalaisen henkilöllisyyttä ja puhui hänestä
epämääräisesti mainiten hänet korkea-arvoiseksi ulkomaalaiseksi.
Asia oli toivoton. Furet tuli kärsimättömäksi.

"Hyvä herra", hän sanoi viimein, "minusta tuntuu, että olette


tänään tullut tänne aikoen epäilemättä käyttää hyväksenne
palveluksiani asiassa, mikä painaa mieltänne, mutta olette myös
lujasti päättänyt pitää salaisuudet ominanne. Varmaankin
ajatellessanne asiaa tarkemmin huomaatte, kuinka mahdotonta
minun sen takia on hyödyttää teitä erikoisemmin."

"Ettekö siis voi tehdä mitään?" kysyi Lavrovski epätoivoisena.

Hän näytti niin lohduttomalta, niin surulliselta, että salapoliisi


katsahti häneen säälivästi ja sanoi:

"Menkää kotiinne, herra, ja ajatelkaa asiaa tarkoin, tyynesti ja


kylmästi. Lukekaa poliisiuutiset saadaksenne tietää, ettei
salaperäistä kuolemaa ole tapahtunut tai tuntematonta ruumista ole
löydetty. Minä sillä aikaa toimitan perusteellisia tutkimuksia niin
paljon kuin voin, sekä oopperassa, ajuriasemilla että rautateillä.
Huvittelijanne palannee lopulta parin kolmen päivän kuluttua. Nuoret
miehet usein joutuvat seikkailuihin, mitkä eivät useinkaan kestä paria
päivää kauempaa. Tulkaa sitten tapaamaan minua lauantai-
iltapäivällä, mutta tulkaa päätettyänne kertoa minulle kaiken. Ellette
voi tehdä niin, niin älkää tulko lainkaan, ja siinä tapauksessa, etten
minä sillä aikaa ole löytänyt mitään jälkiä, niin minä puolestani jätän
koko asian. Ja nyt suokaa anteeksi, herra; aikani on kallista ja minun
on tavattava monia asiakkaita."

Furet nousi. Neuvottelu oli päättynyt. Lavrovski tunsi, ettei mitään


enää ollut sanottavissa, ettei mitään enää voitu tehdä, ellei hän
päättävästi uskoutunut kolmannelle henkilölle, ja nyt hän ei ollut
valmis sitä tekemään. Ranskalaisen puheessa lienee ollut perää; oli
sangen mahdollista, että tsaarin poika oli vain nuoren miehen
seikkailuretkellä, eikä mitään menetettäisi odottamalla. Jos ne, jotka
olivat hänet ryöstäneet, olivat aikoneet häntä vahingoittaa, niin se
olisi jo tapahtunut, ja ne kolme päivää, mitkä Lavrovski aikoi odottaa
tuhlaajapojan palaamista — jos hän oli elossa ja vahingoittumatta,
tai heittäytyäkseen tsaarin armoille, jos Nikolai Aleksandrovitshille oli
tapahtunut pahaa — eivät merkitsisi paljonkaan.

Hän otti hattunsa, ja luvaten Furet’lle ajatella asiaa uudelleen


tämän haluamassa valossa hän kumarsi vanhalle salapoliisille ja oli
pian taas kadulla.

Hän oli päättänyt odottaa lauantaihin ja olla uskomatta salaisuutta


kellekään, koska hän yhä luuli, että kauhea seikkailu päättyy
onnellisesti ennen lauantaita, ja siihen asti hän kantaisi tuskansa ja
kuormansa yksin.

Vain Nikolain palvelijan täytyi pakostakin saada jonkinlaisia


selityksiä, vaikka hän olikin alhainen arvoltaan. Vaikka mies ei
olisikaan erikoisen älykäs, tuntui hänestä kai kuitenkin omituiselta,
että hänen isäntänsä lähti hotellista ja jäi ystävien luokse niin
odottamatta ottamatta mukaansa edes tavallisimpia
pukeutumistarpeitaan. Lavrovski päätti siis kertoa hänelle osan
totuudesta — toisin sanoen totuuden sellaisena, kuin hän toivoi sen
olevan.

"Stepan, sinun pitää älytä", hän sanoi, "että hänen keisarillinen


korkeutensa on suvainnut poistua hotellista pariksi päiväksi. Mutta
ennen lähtöään hän antoi minulle mitä ankarimmat ohjeet, että
meidän täytyy pitää hänen poissaolonsa ehdottomasti salassa
kaikilta, sekä täällä että kotona. Et sinä enkä minäkään saa kysellä
tsaarin pojan oikeuksia tehdä niinkuin hän tahtoo; meidän tulee vain
totella hänen määräyksiään niin täsmällisesti kuin suinkin.
Ymmärrettäköön siis, että hänen keisarillinen korkeutensa on
vuoteen omana sairastaen tuhkarokkoa, mikä ei ole vaarallista,
mutta kestänee muutamia päiviä. No, ymmärrätkö minua, ja voiko
hänen keisarillinen korkeutensa täydellisesti luottaa sinun
uskollisuuteesi ja vaiteliaisuuteesi sekä nyt että tulevaisuudessa?"

"Nikolai Aleksandrovitsh on isäntäni", sanoi venäläinen


yksinkertaisesti; "hän on aina pitänyt minua luotettavana
tarvitessaan apua, vaiteliaana tarvitessaan vaitioloani. Puhumani
sanat ovat aivan samoin hänen hallittavissaan kuin tekoni, sanon,
mitä hän haluaa tai hallitsen kieleni, kun hän niin tahtoo."

"Hyvä on, Stepan", sanoi kreivi Lavrovski, "hänen keisarillinen


korkeutensa varmasti muistaa, mitä teet hänelle tänään".

Lavrovski tiesi voivansa luottaa tähän mieheen; kaikki oli siis hyvin
toistaiseksi. Sen jälkeen — Jumalan haltuun, hän ajatteli itämaisen
fatalistisesti.
V

"Ja täytyykö teidän ylhäisyytenne todella lähteä tänään?" sanoi


keisari Frans Joosef I kohteliaan pahoittelevasti, kun kardinaali
d'Orsay, paavin Wienin hoviin valtuuttama lähettiläs, aikoi nousta
sanoakseen hyvästi.

"Todellakin, teidän majesteettinne, ellei eräs pakottava velvollisuus


kutsuisi minua pois, en milloinkaan omasta aloitteestani lähtisi tästä
hauskasta ja vieraanvaraisesta kaupungista. Mutta —" kardinaali
huokasi, ja alistuva ilme näkyi tämän velvollisuutensa marttyyrin
aristokraattisilla kasvoilla.

"Olen todellakin iloinen, että teidän ylhäisyytenne pitää Wieniä


viehättävänä."

"No, en niinkään Wieniä, teidän majesteettinne, vaikka kaupunki


onkin koko hauska, mutta wieniläisiä —!" Kardinaali pysähtyi,
ensimmäistä kertaa diplomaattiuransa aikana ei hän keksinyt sanoja,
joilla olisi tulkinnut ajatuksensa tästä mielenkiintoisesta aiheesta.

"Tulette huomaamaan, että Pietarin naiset ovat wieniläisten


peloittavia kilpailijoita", sanoi keisari miettiväisesti.
Hänen ylhäisyytensä ei vastannut. Hän muisti hienosti väritetyn
skandaalijutun, joka oli tullut hänen korviinsa, kuinka eräs Pietarin
grande dame edellisenä talvena oli löytänyt tiensä Frans Joosefin
jaloon ja syttyväiseen sydämeen, joskin hetkellisesti. Vallitsi tovin
aikaa hiljaisuus. Keisari oli nähtävästi hermostunut, hänen kätensä
hypisteli rauhattomasti vähäpätöisiä pikkutavaroita, jotka kaunistivat
hänen pöytäänsä, ja hän näytti pari kertaa aikovan puhua, mutta
hillitsi äkkiä itsensä.

Kardinaali, joka pitkän diplomaattiuransa aikana oli oppinut tyynen


kärsivälliseksi, nojasi taaksepäin tuolissaan ja odotti, mitä keisari
taas aikoisi sanoa hänelle.

"Teidän ylhäisyytenne näkee monia vanhoja ystäviäni Pietarissa",


sanoi keisari viimein omituisen välttelevästi.

"Erikoisen mielelläni käyn katsomassa kaikkia niitä, joita teidän


majesteettinne haluaa minun näkevän", vastasi kardinaali hiotun
kohteliaasti.

"Teidän ylhäisyytenne on kovin ystävällinen, ja varmaankin te


viette ystävälliset tervehdykseni tsaarille ja tsaarittarelle paljon
arvokkaammalla tavalla, kuin mitä kirjeeni voisivat ilmaista. Tahtoisin
myöskin, että veisitte terveisiä suuriruhtinatar Xenialle ja
suuriruhtinaalle, joiden viime vierailusta Wieniin minulla on niin
miellyttävät muistot."

Kardinaali hymyili tuskin huomattavasti ja silmäsi nopeasti pientä


miniatyyriä, joka oli vanhanaikainen ja johon epäilemättä sisältyi eräs
hauska muisto.
Vaikka kardinaalin silmäys olikin nopea, niin Frans Joosef oli
ilmeisesti sen huomannut, sillä hän lisäsi hiukan hermostuneesti:

"Älkää myöskään unohtako esittää kunnioittavimpia tunteitani


prinsessa Marionoville, joka toivoakseni pian taas käy Wienissä,
viimeisen karnevaalivoittonsa tapahtumapaikalla."

"Kaikki teidän majesteettinne minulle suosiollisesti uskomat


kirjalliset ja suulliset viestit viedään varmasti perille", vastasi
kardinaali d'Orsay.

"Pitäkää varanne", sanoi keisari hermostuneesti nauraen, "pidän


kiinni puheestanne ja lähetän niin runsaita tervehdyksiä, että teidän
matkalaukkunne tulee pullolleen".

"Olen teidän majesteettinne käytettävissä."

Keisari katseli hetkisen hänen ylhäisyytensä älykkäitä,


diplomaattisia kasvoja; sitten ikäänkuin äkillistä mielijohdetta
seuraten hän otti pienen avaimen taskustaan, ja avaten erään
kirjoituspöytänsä isoimmista laatikoista hän veti varovaisesti esiin
ison paketin ja asetti sen kardinaali d'Orsayn hämmästyneiden
silmien eteen.

"Ja jos pyytäisin teidän ylhäisyyttänne viemään tervehdykseni


tässä muodossa?" sanoi Frans Joosef viimein.

Koko uransa aikana hänen ylhäisyytensä ei milloinkaan ollut


ilmaissut täydellisesti hämmästystään, mutta nyt vain sekunnin
ajaksi hänen syvällä olevat silmänsä näyttivät avautuvan
hämmästyksestä laajemmiksi kuin tavallisesti.
"Tämä sanoma on oikeastaan muisto", jatkoi keisari, "joutava asia,
joka saa vastaanottajan muistelemaan Wieniä ja wieniläisiä sillä
tavalla, kuin minä tahtoisin naisen tekevän".

"Naisen?"

"Niin!"

"Ah! ymmärrän! Suuriruhtinatar Xenia", sanoi kardinaali hieman


pisteliäästi.

"Ei! Ei suuriruhtinatar; hän ei välittäisi tällaisista taide-esineistä."

"Ovatko ne taide-esineitä?"

"Mitä harvinaisimpia, ja ne on aiottu taiteen tuntijalle, joka osaa


pitää niitä arvossa."

"Suvaitsisiko teidän majesteettinne mainita tuon tuntijan?"

"Prinsessa Marionov."

"Ooh!"

"Hän on usein ihaillut näitä koruesineitä, emmekä aina voi


täydellisesti tyydyttää kauniin naisen oikkuja. Tahdon kiihkeästi
näyttää teidän ylhäisyydellenne vaatimattoman lahjan, jonka pyydän
teidän laskemaan prinsessan jalkojen juureen."

Hyvin varovaisesti ja kärsivällisesti alkoi keisari kääriä auki


pakettia paljastaen monien paperien ja kääreiden keskeltä
kardinaalin ihailevien silmien eteen parin ihmeen siroja, hyvin
kallisarvoisia posliinisia kynttilänjalkoja, kauneimpia, mitkä
milloinkaan ovat markiisittaren kammiota kaunistaneet.
Kumpikin kynttilänjalka esitti Amoria, mitkä olivat mitä harvinaisinta
vanhaa vieux Vienne-nimistä laatutavaraa, ja Amorien käsivarret
ampuivat ojennettuina kultaista nuolta jättiläismäisellä jousella
kuviteltuun maaliin. Jalat oli lujasti kiinnitetty pohjaan, joka oli hienon
hienoa pintakuviollista kultaa, ja Amorien vartalot nojasivat kevyesti
puunrunkoa vasten, mikä oli puhdasta kultaa ja jonka oksat
muodostivat kynttiläin sijat.

"Todellakin viehättävä, sopiva lahja", sanoi kardinaali ihaillen,


vaikkakin hieman ivallisesti.

Heti kuin hän oli todennut, minkälaisen tervehdyksen keisari tahtoi


toimittaa chère amielleen, tuntui hänen ylhäisyytensä todellakin
vähemmän innokkaalta palvelemaan Frans Joosefia. Kynttilänjalat
tuntuivat niin haurailta, ja kuitenkin ne olisivat niin kiusallisia ja
raskaita, että kardinaali d'Orsay melkein tunsi väristyksiä, kun hänen
oli pakko ottaa vastuulleen niin paljon heikkoa tavaraa matkatessaan
noin tuhat mailia Euroopan halki.

Mutta keisari ei näyttänyt huomaavan kardinaalin innostuksen


puutetta. Innokkaana taiteenharrastajana hän osoitti posliinin hienoa
muovailua ja kullan siroja pintakuvioita.

"Ja mikä lisää näiden koruesineiden viehätystä ja harvinaisuutta",


hän lisäsi, "näihin kynttilänjalkoihin liittyy hitunen salaperäisyyttä.
Painakaapa, teidän ylhäisyytenne, hyvin kevyesti tätä pientä lehteä,
joka on erillään muista kultaoksista."

Kardinaali totteli hyväntahtoisesti ja ihmeekseen keksi, että lehti


kätki pienen jousen, joka, kun sitä painettiin, paljasti kätkössä olevan
säilytyspaikan, mikä oli sametilla reunustettu ja sijaitsi puunrungon
sisällä.
"Tämä salainen jousi on näiden kynttilänjalkojen mielenkiintoisin
ominaisuus", selitti keisari; "isotätini, onneton Marie, sai viedyksi
veljelleen tärkeän tiedon näiden viattomien koruesineiden avulla ja
siten autetuksi de Neubergiä".

Kardinaali oli usein kuullut, että de Neuberg oli käyttänyt joitakin


salaisia keinoja viedessään onnettoman kuningattaren kirjeet
turvallisesti Ranskan rajan yli. Nämä kynttilänjalat olivat siis
perintötavaraa, melkeinpä pyhäinjäännöksiä. Hän oli kuullut, että ne
olivat olleet Hofburgin kappelissa onnettoman kuningattaren
kuoleman jälkeen, kunnes venäläisen kaunottaren silmät olivat
katsoneet niihin halajavasti; ja nyt lähtivät nämä aarteet
huolettomasti marttyyrin perheestä ikuisiksi ajoiksi.

Kardinaali vaikeni. Hän olisi mielellään tahtonut keksiä tekosyyn,


jotta hänen ei olisi tarvinnut toimittaa keisarin antamaa tehtävää.
Hän luuli siitä koituvan kaikenlaisia vastuksia, hienot posliiniset
Amorit särkyisivät helposti, samoin kultaiset oksat ja lehdet, ja hän
pelkäsi kovasti saapuvansa Pietariin mukanaan Amorin puolikkaita ja
oksaton runko.

"Minun ei tietenkään tarvitse lisätä", sanoi hänen majesteettinsa


lopettaen hiljaisuuden, joka oli tulemaisillaan kiusalliseksi, "että
täydellisesti luotan teidän ylhäisyytenne vaitioloon. Sekä Espanjan
kuningatar että Parisin kreivitär ovat ehkä oikeassa siinä, että minun
ei pitäisi luovuttaa näitä kynttilänjalkoja kenenkään muun käsiin kuin
heidän. Mielelläni näkisin, että alamaiseni eivät saa tietää mitään
tästä arkaluontoisesta lähetystehtävästä, jonka pyydän teidän
ylhäisyytenne suorittamaan."

"Teidän majesteettinne voi täydellisesti luottaa minuun. Minun


vaiteliaisuuttani on usein koeteltu luullakseni, ja olen aina ollut
vaitelias tarvittaessa."

Hänen ylhäisyytensä puhe oli nyt vähemmän sydämellistä, mutta


keisari, joka innokkaasti kääri aarteitaan jälleen kokoon, ei
huomannut hänen käytöksensä vähäistä muutosta. Hän oli hankkinut
itselleen Euroopan vaiteliaimman lähettilään lahjaansa viemään, ja
hän oli päättänyt olla antamatta hänelle tilaisuutta peruuttaa puoleksi
antamaansa lupausta.

Kynttilänjalat oli taas huolellisesti kääritty pakettiin, eikä keisari


tahtonut mitenkään enää jatkaa keskustelua, kun hänen toivonsa oli
täyttynyt ja kardinaali d’Orsay oli lopullisesti lupautunut.

"En milloinkaan lakkaa olemasta kiitollinen teidän ylhäisyydellenne


tästä ystävällisestä palveluksesta", hän sanoi viimein ja ojensi
sydämellisesti kätensä kardinaalille. Hänen kädenlyöntinsä oli sekä
arvokas että herttainen, mikä on kaikkien Hapsburgien ominainen
piirre ja jota kukaan ei vielä ole voinut vastustaa.

Kardinaali kumarsi syvään keisarillisen käden yli, ja vaikka hänen


kasvoillaan olikin velvollisuutensa marttyyrin alistunut ilme, hän
pakottautui lausumaan jäähyväisensä niin, että paljon kärsinyt
itsevaltias todella tuli iloiseksi.

Hetken perästä kardinaali d'Orsay oli vaunuissaan matkalla


kotiinsa, ja hänen edessään istuimella oli iso käärö, ja hänen
tavallisesti ilmeettömillä kasvoillaan kuvastui hillitty rauhattomuus.
VI

Oli jo päätetty jonkin aikaa sitten, että hänen ylhäisyytensä kardinaali


d'Orsay lähtisi Wienistä seuraavana päivänä — torstaina —
levätäkseen pari kolme viikkoa diplomaattitöistään tuntemattomana
jossakin Böömin vuoriseudussa. Hän oli päättänyt toimensa hänen
pyhyytensä Leo XIII:n lähettiläänä hänen katolisen ja apostolisen
majesteettinsa Frans Joosefin hovissa levollisesti ja tahdikkaasti,
mikä oli ominaista kaikille hänen ylhäisyytensä toimille, sekä
diplomaattisille että muille, ja nyt hän oli aikeissa lähteä Pietariin
samoin diplomaattisella asialla, joka kylläkin oli hyvin
vaikealaatuinen, ja siihen tarvittaisiin kaikki taidot ja tiedot, mitkä
hänen ylhäisyytensä oli saavuttanut maailmaa nähneenä ja tultuaan
tuntemaan tuon keisarillisen arvoituksen — tsaarin.

Ivan Volenski oli tehnyt herkeämättä työtä koko sen päivän, siitä
asti kuin hänen ylhäisyytensä oli palannut messusta, luokitellen ja
järjestäen hänen diplomaattista kirjeenvaihtoaan, joka koski
päättynyttä lähettilääntointa, ja valmistellen asiakirjoja, joita lähettiläs
tarvitsisi, kun hän palattuaan hyvin ansaitsemaltaan lomalta olisi
valmis lähtemään Pietariin.
Ivan oli työskennellyt kovin rauhoittaakseen hermojaan ja
pakottaakseen mielensä muistelemasta kaikenlaisia mahdollisia
tapahtumia pelätyllä Venäjän rajalla, jollaiset ajatukset olivat
vaivanneet häntä yöllä. Hän tahtoi myös hyvin mielellään päättää
kaikki työnsä lähetystössä nopeasti. Hän paloi halusta lähteä niin
aikaisin kuin suinkin luovuttaakseen toisille vastuun papereista,
mitkä nyt jo tuntuivat rasittavan häntä suunnattomasti.

Myöhemmin iltapäivällä, kun hänen ylhäisyytensä palasi


hyvästeltyään lopullisesti hänen majesteettiaan, huomasi Ivan, joka
oli odottanut häntä, hänen kärsimättömän ilmeensä, mikä ei tuntunut
sopivan hänen rauhallisille kasvoilleen.

"Kardinaali päättää ja keisari säätää!" sanoi hänen ylhäisyytensä


väsyneesti, sitten kuin hän oli mitä varovaisimmin laskenut
kallisarvoisen taakkansa pöydälle. "Ivan, poikani, minulla on huonoja
uutisia sinulle kerrottavana."

"Huonoja uutisia, teidän ylhäisyytenne?"

"Älä ole niin pelästyneen näköinen, poikani. Asia on vain niin


vaivoja kysyvä ja kiusallinen. En voi mennä Karlsbadiin huomenna."

"Niinkö?"

"Olen sen sijaan Amorin lähetti. Se on todellakin erikoislaatuista


diplomatiaa minullekin, vaikka olenkin vanha ja kokenut. Ja tämä",
lisäsi hänen ylhäisyytensä osoittaen pöydällä olevaa isoa kääröä,
"on viesti, jonka saan viedä".

"Mutta minä en ymmärrä. Minne on viesti vietävä?" kysyi Volenski,


jota pappisdiplomaatin hätääntynyt käytös hieman huvitti.
"Aina Pietariin asti, poikani, laskettavaksi maailman kauneimpien
jalkojen juureen — prinsessa Marionovin jalkojen juureen — hänen
katolisen ja apostolisen majesteettinsa Frans Joosefin puolesta."

"Ja teidän ylhäisyytenne on ottanut viedäkseen tuon raskaan


paketin aina Pietariin asti, ja luovutte lomastanne tyydyttääksenne
keisarin oikkua?" kysyi Volenski hämmästyneenä.

"Mitä muutakaan voin tehdä?" sanoi kardinaali kärsimättömästi.


"Tiedättehän kuinka mielisteleviä kuningashuoneeseen kuuluvat
voivat olla — sen kunnioitettu päämies enemmän kuin kukaan muu
maailmassa. Hänen majesteettinsa sai minut lupaamaan
suorittamaan hänen tehtävänsä ja pakotti minut ottamaan nämä
tavarat, ennenkuin olin tointunut edes hämmästyksestä, minkä
valtaan olin joutunut hänen pyyntönsä johdosta."

"Siis nyt teidän ylhäisyytenne aikoo siirtää Karlsbadin matkan


toistaiseksi ja ennen kaikkea suorittaa keisarin antaman tehtävän?"
sanoi Volenski, joka äkkiä tuli hermostuneeksi miettien, mitä tämä
suunnitelmien muuttaminen vaikuttaisi hänen omiin aiheisiinsa.

"Niin, tahdon päästä vapaaksi näistä hauraista kaluista — sillä


hauraita ne totta tosiaan ovat — en saa hetkenkään rauhaa,
ennenkuin ne ovat poissa käsistäni ja annetut tuolle kauniille
lumoojattarelle, jonka on onnistunut houkutella Frans Joosef
antamaan hänelle niin kallisarvoisen perintökalun. Lähdemme
Pietariin huomenna."

"Mekö?"

"Niin, poikani! Pelkään, että sinun samoin kuin minunkin täytyy


siirtää lomasi toistaiseksi. Kun kerran olen joutunut niin pitkälle,
lähden Pietariin heti katsomaan hänen majesteettiaan tsaaria, jolle
minun täytyy hänen pyhyydeltään viedä memorandumi, ja
toimittamaan kaikki työni Venäjällä sinun avullasi niin nopeasti kuin
mahdollista."

Volenski ei vastannut. Hän käsitti, että hänen oma salainen


lähetystoimensa tulisi vain paljon turvallisemmaksi, jos hän todella
saisi matkustaa hänen ylhäisyytensä seurassa. Aivan selvästi tämä
suunnitelmien muutos oli hänelle hyödyksi.

"Olen kyllä valmis matkustamaan huomenna", hän sanoi viimein


pidättäen huonosti ilonsa ja helpotuksesta huoaten.

"Hyvä. Suhtaudut asiaan filosofisemman tyynesti kuin minä,


poikani", sanoi kardinaali murheellisena.

"Mutta, teidän ylhäisyytenne", sanoi Volenski yrittäen lohduttaa,


"teidän ja minun lomani siirtyvät vain eteenpäin. Kuukauden kuluttua
alkaa kevät — ja ilmakin on silloin suotuisampi huvimatkoihin."

"Kuukauden kuluttua, poikani", sanoi kardinaali, jonka alakuloisuus


ei vain ottanut haihtuakseen, "ilmestynee taas jotakin muuta työtä,
mikä ei siedä viivyttelyä. Nykyinen aika oli sopivin."

"Kesken kaiken", sanoi Ivan, "salliiko teidän ylhäisyytenne, että


annan paketin Antoinelle, että hän panee sen johonkin matka-
arkkuun?"

"Varovaisesti, poikani, varovaisesti. Ah, et tiedä, mikä


kaksinkertainen vaiva minulla on näistä kapineista, sillä niiden takia
meidän ei tarvitse ainoastaan siirtää lomaamme toistaiseksi, mutta
ne ovat myös niin särkyväisiä, että niiden kuljettaminen tuon pitkän
matkan aina Pietariin asti tuottaa kahdelle meidän kaltaisellemme
vanhallepojalle runsaasti pitkäaikaista huolta."

"Todellako?"

"Kyllä. Leikkaa sidenuora ja tarkasta koruesineitä. Silmäsi voivat


juhlia nähdessään hienoimman taide-esineen, mitä minä milloinkaan
olen ollut tilaisuudessa näkemään, todellakin sopiva lahja
prinsessalle."

Volenski oli jo avannut paketin ja ihaili tuntijan silmillä hienon


hienoa työtä, muovailun siroutta, mikä ilmeni näissä todella
ainoalaatuisissa koruesineissä.

"Niiden historia on keisarin kertomuksen mukaan yhtä


mielenkiintoinen kuin itse taide-esineetkin. Kynttilänjalat eivät ole
täydelleen sellaisia, kuin miltä ne näyttävät, ja niissä on hauska
salaisuus."

"Salaisuus?"

"Niin", sanoi hänen ylhäisyytensä selittäen Ivanille kätkössä


olevan jousen salaisuuden, "historiassa kerrotaan, että Maria
Antoinette käytti näitä kynttilänjalkoja lähettäessään sukulaisilleen
yksityisluontoisia tiedonantoja Wieniin. Salaisuutta todennäköisesti
on hyvin säilytetty, sillä tähän päivään asti Hapsburgit eivät ole
tallettaneet niitä missään muualla kuin Hofburgin kappelissa, eikä
kukaan tähän päivään mennessä ole tietääkseni nähnyt näitä
salaperäisiä kätköpaikkoja."

Volenski oli tullut kalpeaksi koettaessaan tukahduttaa


mielenliikutuksensa. Hänen kätensä värisi hiukan hänen taas
tarkastellessaan keisarin kynttilänjalkoja erikoisen innokkaasti. Hän
kuunteli hänen ylhäisyyttään vakavana, mikä ei johtunut tavallisesta
taiteentuntijan harrastuksesta. Hänen mieleensä oli äkkiä juolahtanut
suurenmoinen aate. Tämä vihdoinkin tarjosi täyden varmuuden.
Siinä oli paikka, mihin voisi kätkeä paperit, paikka, mitä
tarkkanäköisinkään venäläinen virkamies ei voisi aavistaakaan.
Lisäksi olisivat kynttilänjalat hänen ylhäisyytensä hallussa, ja kuka
uskaltaisi koskea paavin lähettilään omaisuuteen? Nyt hiukan pientä
diplomatiaa, ja sitten rauhaa, mukavaa elämää vapaana huolista,
kunnes oli saavuttu Pietariin ja paperit oli kuljetettu turvallisesti rajan
yli. Hän olisi suorittanut siten ovelimman strategisen tempun, minkä
kukaan salaseuralainen oli milloinkaan suorittanut.

"No, Ivan, mitä ajattelet niistä?" keskeytti hänen ylhäisyytensä ääni


Volenskin mietteet.

"Ne ovat todellakin hienoja taide-esineitä", sanoi nuori mies


hätkähtäen, "en todellakaan ihmettele, että teidän ylhäisyytenne on
levoton niistä. Ne näyttävät niin haurailta, niin heikoilta, että pelkää
niiden vahingoittuvan tavaroita sullotessakin."

"Siksi en uskallakaan luovuttaa niitä Antoinelle, ja minä toivon, että


sinä pidät huolta sullomisesta minun puolestani. Minun sormeni ovat
vanhat ja kömpelöt. Siihen todellakin tarvitaan naisen kättä."

"Naisen kädet eivät voi olla huolellisemmat kuin minun", sanoi Ivan
innokkaasti, "minä huolehdin näistä kapineista heti. Ne ovat
luullakseni parhaiten turvassa teidän ylhäisyytenne omassa
matkalaukussa, joka voidaan ottaa mukaan vaunuosastoon ja jota
voidaan pitää silmällä koko matkan ajan."
"Sinä todellakin huomattavasti kevennät kuormaani, rakas poikani,
huolehtimalla itse näistä kynttilänjaloista. Vakuutan sinulle, että
mikään diplomaattinen rasitus ei ole milloinkaan painanut niin
suuresti mieltäni kuin nämä hauraat kynttilänjalat."

Kohtalo tuntui todellakin liittoutuneen Volenskin kanssa.


Puolalaisena hän oli taikauskoinen ja näki näissä aivan tavallisissa
tapahtumissa jonkin yliluonnollisen voiman vaikutusta.

Hän oli urhoollinen vaaroissa, hän hallitsi hermojaan ja oli peloton,


mutta nyt hänet valtasi ankara mielenliikutus — ilon, vapautuksen,
riemun tunne — ja hänen kätensä värisivät hänen kantaessaan
kallisarvoisia kynttilänjalkoja omaan yksityiseen huoneeseensa.

Hän tahtoi olla yksin, ajatella tyynesti asioita ja olla antamatta


intonsa viedä voittoa järjestä. Tärkeintä oli ottaa huomioon toverien
turvallisuus, ja sitä hän edistäisi kätkemällä paperit salaiseen
kätköpaikkaan.

Hän oli kiihtynyt, innostuksen valtaama!

"Jumalan käsi", hän ajatteli, "suojelee asiaamme. Hän asetti


tämän salaperäisen esineen kättemme ulottuville. Ja nyt, kahden
päivän kuluttua, saa Taranjev paperit. Hänen ylhäisyytensä saa pitää
niistä huolta. Itse paavin lähettiläs vie ne tietämättään rajan yli."

Hänen ylhäisyyttään ei voitaisi epäillä, sehän oli selvää. Jos hän


sanoisi, että taide-esineet kuuluivat hänelle itselleen, ei kolmannen
osaston päällikkökään uskaltaisi puuttua kardinaalin omaisuuteen.

Ja kuumeisesti hän kosketteli toisen kynttilänjalan salaista jousta


ja tuijotti miltei hellämielisesti sisällä näkyvään samettiseen

You might also like