Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee 2024 Scribd Download
Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee 2024 Scribd Download
com
https://ebookmeta.com/product/handbook-of-regression-
analysis-with-applications-in-r-second-edition-samprit-
chatterjee/
OR CLICK BUTTON
DOWLOAD NOW
https://ebookmeta.com/product/a-second-course-in-statistics-
regression-analysis-8th-edition-william-mendenhall/
https://ebookmeta.com/product/an-introduction-to-statistical-
learning-with-applications-in-r-second-edition-gareth-james/
https://ebookmeta.com/product/regression-analysis-in-r-a-
comprehensive-view-for-the-social-sciences-1st-edition-jocelyn-e-
bolin/
https://ebookmeta.com/product/handbook-of-regression-modeling-in-
people-analytics-1st-edition-keith-mcnulty/
The Statistical Analysis of Doubly Truncated Data :
With Applications in R 1st Edition Jacobo De Uña-
Álvarez
https://ebookmeta.com/product/the-statistical-analysis-of-doubly-
truncated-data-with-applications-in-r-1st-edition-jacobo-de-una-
alvarez/
https://ebookmeta.com/product/time-complexity-analysis-1st-
edition-aditya-chatterjee/
https://ebookmeta.com/product/the-routledge-handbook-of-
discourse-analysis-second-edition-michael-handford/
https://ebookmeta.com/product/introduction-to-statistics-and-
data-analysis-with-exercises-solutions-and-applications-in-r-2nd-
edition-christian-heumann/
https://ebookmeta.com/product/fundamentals-of-analysis-with-
applications-atul-kumar-razdan/
Handbook of Regression Analysis
With Applications in R
WILEY SERIES IN PROBABILITY AND STATISTICS
Established by WALTER A. SHEWHART and SAMUEL S. WILKS
Editors
David J. Balding, Noel A.C. Cressie, Garrett M. Fitzmaurice, Harvey
Goldstein, Geert Molenberghs, David W. Scott, Adrian F.M. Smith, and
Ruey S. Tsay
Editors Emeriti
Vic Barnett, Ralph A. Bradley, J. Stuart Hunter, J.B. Kadane, David G.
Kendall, and Jozef L. Teugels
A complete list of the titles in this series appears at the end of this volume.
Handbook of Regression
Analysis With Applications
in R
Second Edition
Samprit Chatterjee
New York University, New York, USA
Jeffrey S. Simonoff
New York University, New York, USA
This second edition first published 2020
© 2020 John Wiley & Sons, Inc
Edition History
Wiley-Blackwell (1e, 2013)
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/
permissions.
The right of Samprit Chatterjee and Jeffery S. Simonoff to be identified as the authors of this work has been
asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us
at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.
10 9 8 7 6 5 4 3 2 1
Dedicated to everyone who labors in the field
of statistics, whether they are students,
teachers, researchers, or data analysts.
Contents
Part I
The Multiple Linear Regression Model
2 Model Building 23
2.1 Introduction 23
2.2 Concepts and Background Material 24
2.2.1 Using Hypothesis Tests to Compare Models 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 Model Selection 29
2.3.2 Example — Estimating Home Prices
(continued) 31
2.4 Indicator Variables and Modeling Interactions 38
2.4.1 Example — Electronic Voting and the 2004
Presidential Election 40
2.5 Summary 46
vii
viii CONTENTS
Part II
Addressing Violations of Assumptions
Part III
Categorical Predictors
Part IV
Non-Gaussian Regression Models
Part V
Other Regression Models
Part VI
Nonparametric and Semiparametric
Models
Bibliography 337
Index 343
Preface to the
Second Edition
The years since the first edition of this book appeared have been fast-moving
in the world of data analysis and statistics. Algorithmically-based methods
operating under the banner of machine learning, artificial intelligence, or
data science have come to the forefront of public perceptions about how to
analyze data, and more than a few pundits have predicted the demise of classic
statistical modeling.
To paraphrase Mark Twain, we believe that reports of the (impending)
death of statistical modeling in general, and regression modeling in particular,
are exaggerated. The great advantage that statistical models have over “black
box” algorithms is that in addition to effective prediction, their transparency
also provides guidance about the actual underlying process (which is crucial
for decision making), and affords the possibilities of making inferences and
distinguishing real effects from random variation based on those models.
There have been laudable attempts to encourage making machine learning
algorithms interpretable in the ways regression models are (Rudin, 2019), but
we believe that models based on statistical considerations and principles will
have a place in the analyst’s toolkit for a long time to come.
Of course, part of that usefulness comes from the ability to generalize
regression models to more complex situations, and that is the thrust of the
changes in this new edition. One thing that hasn’t changed is the philosophy
behind the book, and our recommendations on how it can be best used, and
we encourage the reader to refer to the preface to the first edition for guidance
on those points. There have been small changes to the original chapters, and
broad descriptions of those chapters can also be found in the preface to the
first edition. The five new chapters (Chapters 11, 13, 14, 15, and 16, with
the former chapter 11 on nonlinear regression moving to Chapter 12) expand
greatly on the power and applicability of regression models beyond what
was discussed in the first edition. For this reason many more references are
provided in these chapters than in the earlier ones, since some of the material
in those chapters is less established and less well-known, with much of it still
the subject of active research. In keeping with that, we do not spend much
(or any) time on issues for which there still isn’t necessarily a consensus in the
statistical community, but point to books and monographs that can help the
analyst get some perspective on that kind of material.
Chapter 11 discusses the modeling of time-to-event data, often referred
to as survival data. The response variable measures the length of time until an
event occurs, and a common complicator is that sometimes it is only known
xv
xvi PREFACE TO THE SECOND EDITION
that a response value is greater than some number; that is, it is right-censored.
This can naturally occur, for example, in a clinical trial in which subjects
enter the study at varying times, and the event of interest has not occurred at
the end of the trial. Analysis focuses on the survival function (the probability
of surviving past a given time) and the hazard function (the instantaneous
probability of the event occurring at a given time given survival to that
time). Parametric models based on appropriate distributions like the Weibull
or log-logistic can be fit that take censoring into account. Semiparametric
models like the Cox proportional hazards model (the most commonly-used
model) and the Buckley-James estimator are also available, which weaken
distributional assumptions. Modeling can be adapted to situations where
event times are truncated, and also when there are covariates that change over
the life of the subject.
Chapter 13 extends applications to data with multiple observations for
each subject consistent with some structure from the underlying process. Such
data can take the form of nested or clustered data (such as students all in
one classroom) or longitudinal data (where a variable is measured at multiple
times for each subject). In this situation ignoring that structure results in an
induced correlation that reflects unmodeled differences between classrooms
and subjects, respectively. Mixed effects models generalize analysis of variance
(ANOVA) models and time series models to this more complicated situation.
Models with linear effects based on Gaussian distributions can be generalized
to nonlinear models, and also can be generalized to non-Gaussian distributions
through the use of generalized linear mixed effects models.
Modern data applications can involve very large (even massive) numbers of
predictors, which can cause major problems for standard regression methods.
Best subsets regression (discussed in Chapter 2) does not scale well to very
large numbers of predictors, and Chapter 14 discusses approaches that can
accomplish that. Forward stepwise regression, in which potential predictors
are stepped in one at a time, is an alternative to best subsets that scales
to massive data sets. A systematic approach to reducing the dimensionality
of a chosen regression model is through the use of regularization, in which
the usual estimation criterion is augmented with a penalty that encourages
sparsity; the most commonly-used version of this is the lasso estimator, and it
and its generalizations are discussed further.
Chapters 15 and 16 discuss methods that move away from specified
relationships between the response and the predictor to nonparametric and
semiparametric methods, in which the data are used to choose the form of
the underlying relationship. In Chapter 15 linear or (specifically specified)
nonlinear relationships are replaced with the notion of relationships taking the
form of smooth curves and surfaces. Estimation at a particular location is based
on local information; that is, the values of the response in a local neighborhood
of that location. This can be done through local versions of weighted least
squares (local polynomial estimation) or local regularization (smoothing
splines). Such methods can also be used to help identify interactions between
numerical predictors in linear regression modeling. Single predictor smoothing
PREFACE TO THE SECOND EDITION xvii
SAMPRIT CHATTERJEE
Brooksville, Maine
JEFFREY S. SIMONOFF
New York, New York
October, 2019
Preface to the
First Edition
xix
xx PREFACE TO THE FIRST EDITION
groups to each other. Data of this type often exhibit nonconstant variance
related to the different subgroups in the population, and the appropriate tool
to address this issue, weighted least squares, is also a focus here.
Chapters 8 though 10 examine the situation where the nature of the
response variable is such that Gaussian-based least squares regression is no
longer appropriate. Chapter 8 focuses on logistic regression, designed for
binary response data and based on the binomial random variable. While
there are many parallels between logistic regression analysis and least squares
regression analysis, there are also issues that come up in logistic regression
that require special care. Chapter 9 uses the multinomial random variable to
generalize the models of Chapter 8 to allow for multiple categories in the
response variable, outlining models designed for response variables that either
do or do not have ordered categories. Chapter 10 focuses on response data in
the form of counts, where distributions like the Poisson and negative binomial
play a central role. The connection between all these models through the
generalized linear model framework is also exploited in this chapter.
The final chapter focuses on situations where linearity does not hold,
and a nonlinear relationship is necessary. Although these models are based on
least squares, from both an algorithmic and inferential point of view there
are strong connections with the models of Chapters 8 through 10, which we
highlight.
This Handbook can be used in several different ways. First, a reader may
use the book to find information on a specific topic. An analyst might want
additional information on, for example, logistic regression or autocorrelation.
The chapters on these (and other) topics provide the reader with this subject
matter information. As noted above, the chapters also include at least one
analysis of a data set, a clarification of computer output, and reference to
sources where additional material can be found. The chapters in the book are
to a large extent self-contained and can be consulted independently of other
chapters.
The book can also be used as a template for what we view as a reasonable
approach to data analysis in general. This is based on the cyclical paradigm
of model formulation, model fitting, model evaluation, and model updating
leading back to model (re)formulation. Statistical significance of test statistics
does not necessarily mean that an adequate model has been obtained. Further
analysis needs to be performed before the fitted model can be regarded as
an acceptable description of the data, and this book concentrates on this
important aspect of regression methodology. Detection of deficiencies of fit
is based on both testing and graphical methods, and both approaches are
highlighted here.
This preface is intended to indicate ways in which the Handbook can
be used. Our hope is that it will be a useful guide for data analysts, and will
help contribute to effective analyses. We would like to thank our students and
colleagues for their encouragement and support. We hope we have provided
xxii PREFACE TO THE FIRST EDITION
them with a book of which they would approve. We would like to thank Steve
Quigley, Jackie Palmieri, and Amy Hendrickson for their help in bringing this
manuscript to print. We would also like to thank our families for their love
and support.
SAMPRIT CHATTERJEE
Brooksville, Maine
JEFFREY S. SIMONOFF
New York, New York
August, 2012
Part One
1.1 Introduction
This is a book about regression modeling, but when we refer to regression
models, what do we mean? The regression framework can be characterized in
the following way:
1. We have one particular variable that we are interested in understanding
or modeling, such as sales of a particular product, sale price of a home, or
30 •
• •
• •
25 • •
•• •
• •• • •
•
20 • •
y • • E(y) = β0 + β1x
•• •
15
•• •
•
•
10
•
2 4 6 8
x
FIGURE 1.1: The simple linear regression model. The solid line corresponds
to the true regression line, and the dotted lines correspond to the random
errors εi .
the current model, selection among several candidate models, the acquisition
of new data, new understanding of the underlying random process, and so
on. Further, it is often the case that there are several different models that
are reasonable representations of reality. Having said this, we will sometimes
refer to the “true” model, but this should be understood as referring to the
underlying form of the currently hypothesized representation of the regression
relationship.
The special case of (1.1) with p = 1 corresponds to the simple regression
model, and is consistent with the representation in Figure 1.1. The solid line
is the true regression line, the expected value of y given the value of x. The
dotted lines are the random errors εi that account for the lack of a perfect
association between the predictor and the target variables.
30 •
• •
• •
25 • •
•• •
• •• • •
•
20 • •
y
• • ^ ^ ^
• E(y) = β0 + β1x
••
15 •
••
•
•
10
•
2 4 6 8
x
FIGURE 1.2: Least squares estimation for the simple linear regression model,
using the same data as in Figure 1.1. The gray line corresponds to the true
regression line, the solid black line corresponds to the fitted least squares
line (designed to estimate the gray line), and the lengths of the dotted lines
correspond to the residuals. The sum of squared values of the lengths of the
dotted lines is minimized by the solid black line.
and the solid black line is the estimated regression line, designed to estimate
the (unknown) gray line as closely as possible. For any choice of estimated
parameters β̂, the estimated expected response value given the observed
predictor values equals
ŷi = β̂0 + β̂1 x1i + · · · + β̂p xpi ,
and is called the fitted value. The difference between the observed value yi
and the fitted value ŷi is called the residual, the set of which is represented by
the signed lengths of the dotted lines in Figure 1.2. The least squares regression
line minimizes the sum of squares of the lengths of the dotted lines; that is,
the ordinary least squares (OLS) estimates minimize the sum of squares of the
residuals.
In higher dimensions (p > 1), the true and estimated regression relation-
ships correspond to planes (p = 2) or hyperplanes (p ≥ 3), but otherwise the
principles are the same. Figure 1.3 illustrates the case with two predictors.
The length of each vertical line corresponds to a residual (solid lines refer to
positive residuals, while dashed lines refer to negative residuals), and the (least
squares) plane that goes through the observations is chosen to minimize the
sum of squares of the residuals.
1.2 Concepts and Background Material 7
•
•
50 • •
45 • •
•• •
40 • • •• •
•
35 •
y
30
• ••• • •
• •
25 • •• 10
• 8
• 6
20 • 4
2 x2
15 0
0 2 4 6 8 10
x1
FIGURE 1.3: Least squares estimation for the multiple linear regression
model with two predictors. The plane corresponds to the fitted least squares
relationship, and the lengths of the vertical lines correspond to the residuals.
The sum of squared values of the lengths of the vertical lines is minimized by
the plane.
where H = X(X X)−1 X is the so-called “hat” matrix (since it takes y to ŷ).
The residuals e = y − ŷ thus satisfy
e = y − ŷ = y − X(X X)−1 X y = (I − X(X X)−1 X )y, (1.6)
or
e = (I − H)y.
1.2.3 ASSUMPTIONS
The least squares criterion will not necessarily yield sensible results unless
certain assumptions hold. One is given in (1.1) — the linear model should
be appropriate. In addition, the following assumptions are needed to justify
using least squares regression.
1. The expected value of the errors is zero (E(εi ) = 0 for all i). That is, it
cannot be true that for certain observations the model is systematically
too low, while for others it is systematically too high. A violation of this
assumption will lead to difficulties in estimating β0 . More importantly,
this reflects that the model does not include a necessary systematic
component, which has instead been absorbed into the error terms.
2. The variance of the errors is constant (V (εi ) = σ 2 for all i). That is,
it cannot be true that the strength of the model is greater for some
parts of the population (smaller σ ) and less for other parts (larger σ ).
This assumption of constant variance is called homoscedasticity, and its
violation (nonconstant variance) is called heteroscedasticity. A violation
of this assumption means that the least squares estimates are not as efficient
as they could be in estimating the true parameters, and better estimates are
available. More importantly, it also results in poorly calibrated confidence
and (especially) prediction intervals.
3. The errors are uncorrelated with each other. That is, it cannot be true
that knowing that the model underpredicts y (for example) for one
particular observation says anything at all about what it does for any
other observation. This violation most often occurs in data that are
ordered in time (time series data), where errors that are near each other
in time are often similar to each other (such time-related correlation
is called autocorrelation). Violation of this assumption means that the
least squares estimates are not as efficient as they could be in estimating
the true parameters, and more importantly, its presence can lead to very
misleading assessments of the strength of the regression.
4. The errors are normally distributed. This is needed if we want to construct
any confidence or prediction intervals, or hypothesis tests, which we
usually do. If this assumption is violated, hypothesis tests and confidence
and prediction intervals can be very misleading.
1.3 Methodology 9
1.3 Methodology
1.3.1 INTERPRETING REGRESSION COEFFICIENTS
The least squares regression coefficients have very specific meanings. They are
often misinterpreted, so it is important to be clear on what they mean (and do
not mean). Consider first the intercept, β̂0 .
β̂0 : The estimated expected value of the target variable when the predictors
are all equal to zero.
Note that this might not have any physical interpretation, since a zero value for
the predictor(s) might be impossible, or might never come close to occurring
in the observed data. In that situation, it is pointless to try to interpret
this value. If all of the predictors are centered to have zero mean, then β̂0
necessarily equals Y , the sample mean of the target values. Note that if there
is any particular value for each predictor that is meaningful in some sense, if
each variable is centered around its particular value, then the intercept is an
estimate of E(y) when the predictors all have those meaningful values.
The estimated coefficient for the j th predictor (j = 1, . . . , p) is interpreted
in the following way:
β̂j : The estimated expected change in the target variable associated with a one
unit change in the j th predicting variable, holding all else in the model
fixed.
There are several noteworthy aspects to this interpretation. First, note the
word associated — we cannot say that a change in the target variable is caused
by a change in the predictor, only that they are associated with each other.
That is, correlation does not imply causation.
Another key point is the phrase “holding all else in the model fixed,” the
implications of which are often ignored. Consider the following hypothetical
example. A random sample of college students at a particular university is
taken in order to understand the relationship between college grade point
average (GPA) and other variables. A model is built with college GPA as a
function of high school GPA and the standardized Scholastic Aptitude Test
(SAT), with resultant least squares fit
College GPA = 1.3 + .7 × High School GPA − .0001 × SAT.
It is tempting to say (and many people would say) that the coefficient for
SAT score has the “wrong sign,” because it says that higher values of SAT
10 CHAPTER 1 Multiple Linear Regression
are associated with lower values of college GPA. This is not correct. The
problem is that it is likely in this context that what an analyst would find
intuitive is the marginal relationship between college GPA and SAT score alone
(ignoring all else), one that we would indeed expect to be a direct (positive)
one. The regression coefficient does not say anything about that marginal
relationship. Rather, it refers to the conditional (sometimes called partial)
relationship that takes the high school GPA as fixed, which is apparently
that higher values of SAT are associated with lower values of college GPA,
holding high school GPA fixed. High school GPA and SAT are no doubt
related to each other, and it is quite likely that this relationship between
the predictors would complicate any understanding of, or intuition about,
the conditional relationship between college GPA and SAT score. Multiple
regression coefficients should not be interpreted marginally; if you really are
interested in the relationship between the target and a single predictor alone,
you should simply do a regression of the target on only that variable. This
does not mean that multiple regression coefficients are uninterpretable, only
that care is necessary when interpreting them.
Another common use of multiple regression that depends on this con-
ditional interpretation of the coefficients is to explicitly include “control”
variables in a model in order to try to account for their effect statistically. This
is particularly important in observational data (data that are not the result of a
designed experiment), since in that case, the effects of other variables cannot be
ignored as a result of random assignment in the experiment. For observational
data it is not possible to physically intervene in the experiment to “hold other
variables fixed,” but the multiple regression framework effectively allows this
to be done statistically.
Having said this, we must recognize that in many situations, it is impossible
from a practical point of view to change one predictor while holding all else
fixed. Thus, while we would like to interpret a coefficient as accounting for the
presence of other predictors in a physical sense, it is important (when dealing
with observational data in particular) to remember that linear regression is at
best only an approximation to the actual underlying random process.
This formula says that the variability in the target variable (the left side of
the equation, termed the corrected total sum of squares) can be split into two
mutually exclusive parts — the variability left over after doing the regression
(the first term on the right side, the residual sum of squares), and the variability
accounted for by doing the regression (the second term, the regression sum of
1.3 Methodology 11
deviation σ . This means that, roughly speaking, 95% of the time an observed
y value falls within ±2σ of the expected response
E(y) = β0 + β1 x1 + · · · + βp xp .
E(y) can be estimated for any given set of x values using
ŷ = β̂0 + β̂1 x1 + · · · + β̂p xp ,
while the square root of the residual mean square (1.8), termed the standard
error of the estimate, provides an estimate of σ that can be used in constructing
this rough prediction interval ±2σ̂ .
The values of s.e.(β̂j ) are obtained as the square roots of the diagonal ele-
ments of V̂ (β̂) = (X X)−1 σ̂ 2 , where σ̂ 2 is the residual mean square (1.8).
Note that for simple regression (p = 1), the hypotheses corresponding to
the overall significance of the model and the significance of the predictor
are identical,
H0 : β1 = 0
versus
Ha : β1 = 0.
Given the equivalence of the sets of hypotheses, it is not surprising that
the associated tests are also equivalent; in fact, F = t21 , and the associated
tail probabilities of the two tests are identical.
A t-test for the intercept also can be constructed as in (1.9), although this
does not refer to a hypothesis about a predictor, but rather about whether
the expected target is equal to a specified value β00 if all of the predictors
equal zero. As was noted in Section 1.3.1, this is often not physically
meaningful (and therefore of little interest), because the condition that all
predictors equal zero cannot occur, or does not come close to occurring
in the observed data.
As is always the case, a confidence interval provides an alternative way of
summarizing the degree of precision in the estimate of a regression parameter.
A 100 × (1 − α)% confidence interval for βj has the form
n−p−1
β̂j ± tα/2 s.e.(β̂j ),
n−p−1
where tα/2 is the appropriate critical value at two-sided level α for a
t-distribution on n − p − 1 degrees of freedom.
existence of strong effects related to location means that there are likely to
be relatively few homes with the same important characteristics to make the
comparison. A solution to this problem is the use of hedonic regression models,
where the sale prices of a set of homes in a particular area are regressed on
important characteristics of the home such as the number of bedrooms, the
living area, the lot size, and so on. Academic research on this topic is plentiful,
going back to at least Wabe (1971).
This analysis is based on a sample from public data on sales of one-family
homes in the Levittown, NY area from June 2010 through May 2011.
Levittown is famous as the first planned suburban community built using
mass production methods, being aimed at former members of the military
after World War II. Most of the homes in this community were built in the
late 1940s to early 1950s, without basements and designed to make expansion
on the second floor relatively easy.
For each of the 85 houses in the sample, the number of bedrooms, number
of bathrooms, living area (in square feet), lot size (in square feet), the year
the house was built, and the property taxes are used as potential predictors
of the sale price. In any analysis the first step is to look at the data, and
Figure 1.4 gives scatter plots of sale price versus each predictor. It is apparent
that there is a positive association between sale price and each variable, other
than number of bedrooms and lot size. We also note that there are two houses
with unusually large living areas for this sample, two with unusually large
4e + 05 4e + 05
Sale price
Sale price
2e + 05 2e + 05
3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0
Number of bedrooms Number of bathrooms
4e + 05 4e + 05
Sale price
Sale price
2e + 05 2e + 05
1000 1500 2000 2500 3000 6000 7000 8000 9000 10000 11000
Living area Lot size
4e + 05 4e + 05
Sale price
Sale price
2e + 05 2e + 05
1948 1950 1952 1954 1956 1958 1960 1962 2000 4000 6000 8000 10000 12000 14000
Year built Property taxes
FIGURE 1.4: Scatter plots of sale price versus each predictor for the home
price data.
1.4 Example — Estimating Home Prices 17
property taxes (these are not the same two houses), and three that were built
six or seven years later than all of the other houses in the sample.
The output below summarizes the results of a multiple regression fit.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 .
Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361
Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 ***
Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 ***
Lot.size -8.971e-01 4.194e+00 -0.214 0.831197
Year.built 3.761e+03 1.963e+03 1.916 0.058981 .
Property.tax 1.476e+00 2.832e+00 0.521 0.603734
---
Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
for all homes of that type in the area, so they can give a justifiable interval
estimate giving the precision of the estimate of the true expected value of the
house, so a confidence interval for the fitted value is desired.
Exact 95% intervals for a house with these characteristics can be obtained
from statistical software, and turn out to be ($167277, $363444) for the
prediction interval and ($238482, $292239) for the confidence interval. As
expected, the prediction interval is much wider than the confidence interval,
since it reflects the inherent variability in sale prices in the population of
houses; indeed, it is probably too wide to be of any practical value in this case,
but an interval with smaller coverage (that is expected to include the actual
price only 50% of the time, say) might be useful (a 50% interval in this case
would be ($231974, $298746), so a seller could be told that there is a 50/50
chance that their house will sell for a value in this range).
The validity of all of these results depends on whether the assumptions
hold. Figure 1.5 gives a scatter plot of the residuals versus the fitted values
and a normal plot of the residuals for this model fit. There is no apparent
pattern in the plot of residuals versus fitted values, and the ordered residuals
form a roughly straight line in the normal plot, so there are no apparent
violations of assumptions here. The plot of residuals versus each of the
predictors (Figure 1.6) also does not show any apparent patterns, other than
the houses with unusual living area and year being built, respectively. It would
be reasonable to omit these observations to see if they have had an effect on
the regression, but we will postpone discussion of that to Chapter 3, where
diagnostics for unusual observations are discussed in greater detail.
An obvious consideration at this point is that the models discussed here
appear to be overspecified; that is, they include variables that do not apparently
add to the predictive power of the model. As was noted earlier, this suggests
the consideration of model building, where a more appropriate (simplified)
model can be chosen, which will be discussed in Chapter 2.
1e + 05 1e + 05
Sample Quantiles
Residuals
0e + 00 0e + 00
− 1e + 05 −1e + 05
FIGURE 1.5: Residual plots for the home price data. (a) Plot of residuals
versus fitted values. (b) Normal plot of the residuals.
1.5 Summary 19
Residuals 1e + 05 1e + 05
Residuals
0e + 00 0e + 00
−1e + 05 −1e + 05
3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0
Number of bedrooms Number of bathrooms
1e + 05 1e + 05
Residuals
Residuals
0e + 00 0e + 00
−1e + 05 −1e + 05
1000 1500 2000 2500 3000 6000 7000 8000 9000 10000 11000
Residuals
0e + 00 0e + 00
−1e + 05 −1e + 05
1948 1950 1952 1954 1956 1958 1960 1962 2000 4000 6000 8000 10000 12000 14000
FIGURE 1.6: Scatter plots of residuals versus each predictor for the home
price data.
1.5 Summary
In this chapter we have laid out the basic structure of the linear regression
model, including the assumptions that justify the use of least squares estima-
tion. The three main goals of regression noted at the beginning of the chapter
provide a framework for an organization of the topics covered.
1. Modeling the relationship between x and y :
• the least squares estimates β̂ summarize the expected change in y for a
given change in an x, accounting for all of the variables in the model;
• the standard error of the estimate σ̂ estimates the standard deviation of
the errors;
• R2 and Ra2 estimate the proportion of variability in y accounted for by
x;
• and the confidence interval for a fitted value provides a measure of the
precision in estimating the expected target for a given set of predictor
values.
20 CHAPTER 1 Multiple Linear Regression
KEY TERMS
Autocorrelation: Correlation between adjacent observations in a (time) series.
In the regression context it is autocorrelation of the errors that is a violation of
assumptions.
Coefficient of determination (R2 ): The square of the multiple correlation
coefficient, estimates the proportion of variability in the target variable that is
explained by the predictors in the linear model.
Confidence interval for a fitted value: A measure of precision of the estimate
of the expected target value for a given x.
Dependent variable: Characteristic of each member of the sample that is
being modeled. This is also known as the target or response variable.
Fitted value: The least squares estimate of the expected target value for a
particular observation obtained from the fitted regression model.
Heteroscedasticity: Unequal variance; this can refer to observed unequal
variance of the residuals or theoretical unequal variance of the errors.
Homoscedasticity: Equal variance; this can refer to observed equal variance
of the residuals or the assumed equal variance of the errors.
Independent variable(s): Characteristic(s) of each member of the sample that
could be used to model the dependent variable. These are also known as the
predicting variables.
1.5 Summary 21
Model Building
2.1 Introduction 23
2.2 Concepts and Background Material 24
2.2.1 Using Hypothesis Tests to Compare Models 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 Model Selection 29
2.3.2 Example — Estimating Home Prices (continued) 31
2.4 Indicator Variables and Modeling Interactions 38
2.4.1 Example — Electronic Voting and the 2004
Presidential Election 40
2.5 Summary 46
2.1 Introduction
All of the discussion in Chapter 1 is based on the premise that the only
model being considered is the one currently being fit. This is not a good data
analysis strategy, for several reasons.
1. Including unnecessary predictors in the model (what is sometimes called
overfitting) complicates descriptions of the process. Using such models
tends to lead to poorer predictions because of the additional unnecessary
noise. Further, a more complex representation of the true regression
relationship is less likely to remain stable enough to be useful for future
prediction than is a simpler one.
situation, it is likely that the t-statistic for each predictor will be relatively
small. This is not an inappropriate result, since given one predictor the other
adds little (being highly correlated with each other, one is redundant in the
presence of the other). This means that the t-statistics are not effective in
identifying important predictors when the two variables are highly correlated.
The t-tests and F -test of Section 1.3.3 are special cases of a general
formulation that is useful for comparing certain classes of models. It might be
the case that a simpler version of a candidate model (a subset model) might
be adequate to fit the data. For example, consider taking a sample of college
students and determining their college grade point average (GPA), Scholastic
Aptitude Test (SAT) evidence-based reading and writing score (Reading),
and SAT math score (Math). The full regression model to fit to these data is
GPAi = β0 + β1 Readingi + β2 Mathi + εi .
Instead of considering reading and math scores separately, we could consider
whether GPA can be predicted by one variable: total SAT score, which is the
sum of Reading and Math. This subset model is
GPAi = γ0 + γ1 (Reading + Math)i + εi ,
with β1 = β2 ≡ γ1 . This equality condition is called a linear restriction,
because it defines a linear condition on the parameters of the regression model
(that is, it only involves additions, subtractions, and equalities of coefficients
and constants).
The question about whether the total SAT score is sufficient to predict
grade point average can be stated using a hypothesis test about this linear
restriction. As always, the null hypothesis gets the benefit of the doubt; in this
case, that is the simpler restricted (subset) model that the sum of Reading
and Math is adequate, since it says that only one predictor is needed, rather
than two. The alternative hypothesis is the unrestricted full model (with no
conditions on β). That is,
H0 : β1 = β2
versus
Ha : β1 = β2 .
These hypotheses are tested using a partial F -test. The F -statistic has the
form
(Residual SSsubset − Residual SSfull )/d
F = , (2.1)
Residual SSfull /(n − p − 1)
where n is the sample size, p is the number of predictors in the full model, and
d is the difference between the number of parameters in the full model and
the number of parameters in the subset model. This statistic is compared to
an F distribution on (d, n − p − 1) degrees of freedom. So, for example, for
this GPA/SAT example, p = 2 and d = 3 − 2 = 1, so the observed F -statistic
would be compared to an F distribution on (1, n − 3) degrees of freedom.
Some statistical packages allow specification of the full and subset models and
26 CHAPTER 2 Model Building
will calculate the F -test, but others do not, and the statistic has to be calculated
manually based on the fits of the two models.
An alternative form for the F -test above might make clearer what is going
on here:
(Rfull − Rsubset )/d
2 2
F = 2 )/(n − p − 1) .
(1 − Rfull
That is, if the strength of the fit of the full model (measured by R2 ) isn’t
much larger than that of the subset model, the F -statistic is small, and we do
not reject the subset model; if, on the other hand, the difference in R2 values
is large (implying that the fit of the full model is noticeably stronger), we do
reject the subset model in favor of the full model.
The F -statistic to test the overall significance of the regression is a special
case of this construction (with restriction β1 = · · · = βp = 0), as is each of the
individual t-statistics that test the significance of any variable (with restriction
βj = 0). In the latter case Fj = t2j .
2.2.2 COLLINEARITY
Recall that the importance of a predictor can be difficult to assess using t-tests
when predictors are correlated with each other. A related issue is that of
collinearity (sometimes somewhat redundantly referred to as multicollinear-
ity), which refers to the situation when (some of) the predictors are highly
correlated with each other. The presence of predicting variables that are highly
correlated with each other can lead to instability in the regression coefficients,
increasing their standard errors, and as a result the t-statistics for the variables
can be deflated. This can be seen in Figure 2.1. The two plots refer to identical
data sets, other than the one data point that is lightly colored. Dropping
the data points down to the (x1 , x2 ) plane makes clear the high correlation
between the predictors. The estimated regression plane changes from
ŷ = 9.906 − 2.514x1 + 6.615x2
in the top plot to
ŷ = 9.748 + 9.315x1 − 5.204x2
in the bottom plot; a small change in only one data point causes a major
change in the estimated regression function.
Thus, from a practical point of view, collinearity leads to two problems.
First, it can happen that the overall F -statistic is significant, yet each of the
individual t-statistics is not significant (more generally, the tail probability for
the F -test is considerably smaller than those of any of the individual coefficient
t-tests). Second, if the data are changed only slightly, the fitted regression
coefficients can change dramatically. Note that while collinearity can have a
large effect on regression coefficients and associated t-statistics, it does not
have a large effect on overall measures of fit like the overall F -test or R2 , since
adding unneeded variables (whether or not they are collinear with predictors
2.2 Concepts and Background Material 27
50
40
y 30 10
20 8
6
10 2 4
0 −2 0 x2
−2 0 2 4 6 8 10
x1
50
40
y 30 10
20 8
6
10 2 4
0 x2
0 −2
−2 0 2 4 6 8 10
x1
FIGURE 2.1: Least squares estimation under collinearity. The only change
in the data sets is the lightly colored data point. The planes are the estimated
least squares fits.
already in the model) cannot increase the residual sum of squares (it can only
decrease it or leave it roughly the same).
Another problem with collinearity comes from attempting to use a fitted
regression model for prediction. As was noted in Chapter 1, simple models tend
to forecast better than more complex ones, since they make fewer assumptions
about what the future will look like. If a model exhibiting collinearity is used
for future prediction, the implicit assumption is that the relationships among
the predicting variables, as well as their relationship with the target variable,
remain the same in the future. This is less likely to be true if the predicting
variables are collinear.
How can collinearity be diagnosed? The two-predictor model
yi = β0 + β1 x1i + β2 x2i + εi
provides some guidance. It can be shown that in this case
−1
n
var(β̂1 ) = σ 2
x21i (1 − 2
r12 )
i=1
and −1
n
var(β̂2 ) = σ 2 x22i (1 − r12
2
) ,
i=1
28 CHAPTER 2 Model Building
r12 Variance
inflation
0.00 1.00
0.50 1.33
0.70 1.96
0.80 2.78
0.90 5.26
0.95 10.26
0.97 16.92
0.99 50.25
0.995 100.00
0.999 500.00
2
where Rmodel is the usual R2 for the regression fit. This means that either the
predictors are more related to the target variable than they are to each other, or
they are not related to each other very much. In either case coefficient estimates
Random documents with unrelated
content Scribd suggests to you:
Hänen henkensä olisi loukkaamaton, mutta panttina, niin kauan kuin
neuvoteltiin hänen ylhäisen isänsä kanssa, mikä epäilemättä piankin
päättyisi tyydyttävällä tavalla. Ja sen aikaa saisi arvokas vanki
käyttää hyväkseen asuntoa, vaikka se olikin puutteellinen, ja kolmea
kuuromykkää palvelijaa oli käsketty toimittamaan hänen käskyjänsä.
Hän oli antanut tsaarin pojan, josta hänen tuli niin sanoaksemme
vastata, täydellisesti pujahtaa käsistään, ja se oli
ennenkuulumatonta venäläisen hovimiehen historiassa. Koska
tapaus oli vertaa vailla, niin olisi samoin epäilemättä rangaistuskin, ja
Lavrovski näki jo puoli tuntia tsaarin pojan katoamisen jälkeen
ummistaessaan silmänsä näkyjä rangaistusvangeista, vankiloista,
kaivoksista ja Siperiasta.
Lavrovski meni hänen luokseen. Tähän asti hän oli koettanut olla
ajattelematta liikaa; ajatukset, joita hän oli koettanut pitää koossa,
olisivat vieneet hänet mielisairaalaan, ja hän tahtoi pitää aivonsa
vapaina kaikesta muusta, paitsi siitä, mikä koski hänen
velvollisuuttaan kadonnutta suojattiansa kohtaan ja hänen nimensä
kunniaa.
Lavrovski tiesi voivansa luottaa tähän mieheen; kaikki oli siis hyvin
toistaiseksi. Sen jälkeen — Jumalan haltuun, hän ajatteli itämaisen
fatalistisesti.
V
"Naisen?"
"Niin!"
"Ovatko ne taide-esineitä?"
"Prinsessa Marionov."
"Ooh!"
Ivan Volenski oli tehnyt herkeämättä työtä koko sen päivän, siitä
asti kuin hänen ylhäisyytensä oli palannut messusta, luokitellen ja
järjestäen hänen diplomaattista kirjeenvaihtoaan, joka koski
päättynyttä lähettilääntointa, ja valmistellen asiakirjoja, joita lähettiläs
tarvitsisi, kun hän palattuaan hyvin ansaitsemaltaan lomalta olisi
valmis lähtemään Pietariin.
Ivan oli työskennellyt kovin rauhoittaakseen hermojaan ja
pakottaakseen mielensä muistelemasta kaikenlaisia mahdollisia
tapahtumia pelätyllä Venäjän rajalla, jollaiset ajatukset olivat
vaivanneet häntä yöllä. Hän tahtoi myös hyvin mielellään päättää
kaikki työnsä lähetystössä nopeasti. Hän paloi halusta lähteä niin
aikaisin kuin suinkin luovuttaakseen toisille vastuun papereista,
mitkä nyt jo tuntuivat rasittavan häntä suunnattomasti.
"Niinkö?"
"Mekö?"
"Todellako?"
"Salaisuus?"
"Naisen kädet eivät voi olla huolellisemmat kuin minun", sanoi Ivan
innokkaasti, "minä huolehdin näistä kapineista heti. Ne ovat
luullakseni parhaiten turvassa teidän ylhäisyytenne omassa
matkalaukussa, joka voidaan ottaa mukaan vaunuosastoon ja jota
voidaan pitää silmällä koko matkan ajan."
"Sinä todellakin huomattavasti kevennät kuormaani, rakas poikani,
huolehtimalla itse näistä kynttilänjaloista. Vakuutan sinulle, että
mikään diplomaattinen rasitus ei ole milloinkaan painanut niin
suuresti mieltäni kuin nämä hauraat kynttilänjalat."