Regression Modeling Strategies: With Applications To Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer Series in Statistics) - ISBN 3319194240, 978-3319194240
Regression Modeling Strategies: With Applications To Linear Models, Logistic and Ordinal Regression, and Survival Analysis (Springer Series in Statistics) - ISBN 3319194240, 978-3319194240
Visit the link below to download the full version of this book:
https://cheaptodownload.com/product/regression-modeling-strategies-with-applicat
ions-to-linear-models-logistic-and-ordinal-regression-and-survival-analysis-spri
nger-series-in-statistics-full-pdf-download/
Frank E. Harrell, Jr.
Regression Modeling
Strategies
With Applications to Linear Models,
Logistic and Ordinal Regression,
and Survival Analysis
Second Edition
123
Frank E. Harrell, Jr.
Department of Biostatistics
School of Medicine
Vanderbilt University
Nashville, TN, USA
There are many books that are excellent sources of knowledge about
individual statistical tools (survival models, general linear models, etc.), but
the art of data analysis is about choosing and using multiple tools. In the
words of Chatfield [100, p. 420] “. . . students typically know the technical de-
tails of regression for example, but not necessarily when and how to apply it.
This argues the need for a better balance in the literature and in statistical
teaching between techniques and problem solving strategies.” Whether ana-
lyzing risk factors, adjusting for biases in observational studies, or developing
predictive models, there are common problems that few regression texts ad-
dress. For example, there are missing data in the majority of datasets one is
likely to encounter (other than those used in textbooks!) but most regression
texts do not include methods for dealing with such data effectively, and most
texts on missing data do not cover regression modeling.
This book links standard regression modeling approaches with
• methods for relaxing linearity assumptions that still allow one to easily
obtain predictions and confidence limits for future observations, and to do
formal hypothesis tests,
• non-additive modeling approaches not requiring the assumption that
interactions are always linear × linear,
• methods for imputing missing data and for penalizing variances for incom-
plete data,
• methods for handling large numbers of predictors without resorting to
problematic stepwise variable selection techniques,
• data reduction methods (unsupervised learning methods, some of which
are based on multivariate psychometric techniques too seldom used in
statistics) that help with the problem of “too many variables to analyze and
not enough observations” as well as making the model more interpretable
when there are predictor variables containing overlapping information,
• methods for quantifying predictive accuracy of a fitted model,
vii
viii Preface
• powerful model validation techniques based on the bootstrap that allow the
analyst to estimate predictive accuracy nearly unbiasedly without holding
back data from the model development process, and
• graphical methods for understanding complex models.
On the last point, this text has special emphasis on what could be called
“presentation graphics for fitted models” to help make regression analyses
more palatable to non-statisticians. For example, nomograms have long been
used to make equations portable, but they are not drawn routinely because
doing so is very labor-intensive. An R function called nomogram in the package
described below draws nomograms from a regression fit, and these diagrams
can be used to communicate modeling results as well as to obtain predicted
values manually even in the presence of complex variable transformations.
Most of the methods in this text apply to all regression models, but special
emphasis is given to some of the most popular ones: multiple regression using
least squares and its generalized least squares extension for serial (repeated
measurement) data, the binary logistic model, models for ordinal responses,
parametric survival regression models, and the Cox semiparametric survival
model. There is also a chapter on nonparametric transform-both-sides regres-
sion. Emphasis is given to detailed case studies for these methods as well as
for data reduction, imputation, model simplification, and other tasks. Ex-
cept for the case study on survival of Titanic passengers, all examples are
from biomedical research. However, the methods presented here have broad
application to other areas including economics, epidemiology, sociology, psy-
chology, engineering, and predicting consumer behavior and other business
outcomes.
This text is intended for Masters or PhD level graduate students who
have had a general introductory probability and statistics course and who
are well versed in ordinary multiple regression and intermediate algebra. The
book is also intended to serve as a reference for data analysts and statistical
methodologists. Readers without a strong background in applied statistics
may wish to first study one of the many introductory applied statistics and
regression texts that are available. The author’s course notes Biostatistics
for Biomedical Research on the text’s web site covers basic regression and
many other topics. The paper by Nick and Hardin [476] also provides a good
introduction to multivariable modeling and interpretation. There are many
excellent intermediate level texts on regression analysis. One of them is by
Fox, which also has a companion software-based text [200, 201]. For readers
interested in medical or epidemiologic research, Steyerberg’s excellent text
Clinical Prediction Models [586] is an ideal companion for Regression Modeling
Strategies. Steyerberg’s book provides further explanations, examples, and
simulations of many of the methods presented here. And no text on regression
modeling should fail to mention the seminal work of John Nelder [450].
The overall philosophy of this book is summarized by the following state-
ments.
Preface ix
concrete. At the very least, the code demonstrates that all of the methods
presented in the text are feasible.
This text does not teach analysts how to use R. For that, the reader may
wish to see reading recommendations on www.r-project.org as well as Venables
and Ripley [635] (which is also an excellent companion to this text) and the
many other excellent texts on R. See the Appendix for more information.
In addition to powerful features that are built into R, this text uses a
package of freely available R functions called rms written by the author. rms
tracks modeling details related to the expanded X or design matrix. It is a
series of over 200 functions for model fitting, testing, estimation, validation,
graphics, prediction, and typesetting by storing enhanced model design at-
tributes in the fit. rms includes functions for least squares and penalized least
squares multiple regression modeling in addition to functions for binary and
ordinal regression, generalized least squares for analyzing serial data, quan-
tile regression, and survival analysis that are emphasized in this text. Other
freely available miscellaneous R functions used in the text are found in the
Hmisc package also written by the author. Functions in Hmisc include facilities
for data reduction, imputation, power and sample size calculation, advanced
table making, recoding variables, importing and inspecting data, and general
graphics. Consult the Appendix for information on obtaining Hmisc and rms.
The author and his colleagues have written SAS macros for fitting re-
stricted cubic splines and for other basic operations. See the Appendix for
more information. It is unfair not to mention some excellent capabilities of
other statistical packages such as Stata (which has also been extended to
provide regression splines and other modeling tools), but the extendability
and graphics of R makes it especially attractive for all aspects of the compre-
hensive modeling strategy presented in this book.
Portions of Chapters 4 and 20 were published as reference [269]. Some of
Chapter 13 was published as reference [272].
The author may be contacted by electronic mail at f.harrell@
vanderbilt.edu and would appreciate being informed of unclear points, er-
rors, and omissions in this book. Suggestions for improvements and for future
topics are also welcome. As described in the Web site, instructors may con-
tact the author to obtain copies of quizzes and extra assignments (both with
answers) related to much of the material in the earlier chapters, and to obtain
full solutions (with graphical output) to the majority of assignments in the
text.
Major changes since the first edition include the following:
1. Creation of a now mature R package, rms, that replaces and greatly ex-
tends the Design library used in the first edition
2. Conversion of all of the book’s code to R
3. Conversion of the book source into knitr [677] reproducible documents
4. All code from the text is executable and is on the web site
5. Use of color graphics and use of the ggplot2 graphics package [667]
6. Scanned images were re-drawn
xii Preface
Acknowledgments
A good deal of the writing of the first edition of this book was done during
my 17 years on the faculty of Duke University. I wish to thank my close col-
league Kerry Lee for providing many valuable ideas, fruitful collaborations,
and well-organized lecture notes from which I have greatly benefited over the
past years. Terry Therneau of Mayo Clinic has given me many of his wonderful
ideas for many years, and has written state-of-the-art R software for survival
analysis that forms the core of survival analysis software in my rms package.
Michael Symons of the Department of Biostatistics of the University of North
Preface xiii
Carolina at Chapel Hill and Timothy Morgan of the Division of Public Health
Sciences at Wake Forest University School of Medicine also provided course
materials, some of which motivated portions of this text. My former clini-
cal colleagues in the Cardiology Division at Duke University, Robert Califf,
Phillip Harris, Mark Hlatky, Dan Mark, David Pryor, and Robert Rosati,
for many years provided valuable motivation, feedback, and ideas through
our interaction on clinical problems. Besides Kerry Lee, statistical colleagues
L. Richard Smith, Lawrence Muhlbaier, and Elizabeth DeLong clarified my
thinking and gave me new ideas on numerous occasions. Charlotte Nelson
and Carlos Alzola frequently helped me debug S routines when they thought
they were just analyzing data.
Former students Bercedis Peterson, James Herndon, Robert McMahon,
and Yuan-Li Shen have provided many insights into logistic and survival mod-
eling. Associations with Doug Wagner and William Knaus of the University
of Virginia, Ken Offord of Mayo Clinic, David Naftel of the University of Al-
abama in Birmingham, Phil Miller of Washington University, and Phil Good-
man of the University of Nevada Reno have provided many valuable ideas and
motivations for this work, as have Michael Schemper of Vienna University,
Janez Stare of Ljubljana University, Slovenia, Ewout Steyerberg of Erasmus
University, Rotterdam, Karel Moons of Utrecht University, and Drew Levy of
Genentech. Richard Goldstein, along with several anonymous reviewers, pro-
vided many helpful criticisms of a previous version of this manuscript that
resulted in significant improvements, and critical reading by Bob Edson (VA
Cooperative Studies Program, Palo Alto) resulted in many error corrections.
Thanks to Brian Ripley of the University of Oxford for providing many help-
ful software tools and statistical insights that greatly aided in the production
of this book, and to Bill Venables of CSIRO Australia for wisdom, both sta-
tistical and otherwise. This work would also not have been possible without
the S environment developed by Rick Becker, John Chambers, Allan Wilks,
and the R language developed by Ross Ihaka and Robert Gentleman.
Work for the second edition was done in the excellent academic environ-
ment of Vanderbilt University, where biostatistical and biomedical colleagues
and graduate students provided new insights and stimulating discussions.
Thanks to Nick Cox, Durham University, UK, who provided from his careful
reading of the first edition a very large number of improvements and correc-
tions that were incorporated into the second. Four anonymous reviewers of
the second edition also made numerous suggestions that improved the text.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Hypothesis Testing, Estimation, and Prediction . . . . . . . . . . . 1
1.2 Examples of Uses of Predictive Multivariable Modeling . . . . . 3
1.3 Prediction vs. Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Planning for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Emphasizing Continuous Variables . . . . . . . . . . . . . . . 8
1.5 Choice of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
xv
xvi Contents
3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Prelude to Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Missing Values for Different Types of Response Variables . . . 47
3.4 Problems with Simple Alternatives to Imputation . . . . . . . . . 47
3.5 Strategies for Developing an Imputation Model . . . . . . . . . . . . 49
3.6 Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . . . 52
3.7 Predictive Mean Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8.1 The aregImpute and Other Chained Equations
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.10 Summary and Rough Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1 The R Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 User-Contributed Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 The rms Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.4 Other Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571