0% found this document useful (0 votes)
11 views

Assessing The Performance of Prediction Models

This document discusses various methods for evaluating the performance of prediction models, both traditional measures like the Brier score and c-statistic as well as more novel measures like reclassification tables and net reclassification improvement. It provides definitions and examples of these measures and advocates for reporting different metrics to provide a full assessment of a model's performance, calibration, and clinical utility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Assessing The Performance of Prediction Models

This document discusses various methods for evaluating the performance of prediction models, both traditional measures like the Brier score and c-statistic as well as more novel measures like reclassification tables and net reclassification improvement. It provides definitions and examples of these measures and advocates for reporting different metrics to provide a full assessment of a model's performance, calibration, and clinical utility.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ORIGINAL ARTICLE

Assessing the Performance of Prediction Models


A Framework for Traditional and Novel Measures
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

Ewout W. Steyerberg,a Andrew J. Vickers,b Nancy R. Cook,c Thomas Gerds,d Mithat Gonen,b
Nancy Obuchowski,e Michael J. Pencina,f and Michael W. Kattane
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

Abstract: The performance of prediction models can be assessed


using a variety of methods and metrics. Traditional measures for F rom a research perspective, diagnosis and prognosis con-
stitute a similar challenge: the clinician has some infor-
mation and wants to know how this relates to the true patient
binary and survival outcomes include the Brier score to indicate
overall model performance, the concordance (or c) statistic for state, whether this can be known currently (diagnosis) or only
discriminative ability (or area under the receiver operating charac- at some point in the future (prognosis). This information can
teristic 关ROC兴 curve), and goodness-of-fit statistics for calibration. take various forms, including a diagnostic test, a marker
Several new measures have recently been proposed that can be value, or a statistical model including several predictor vari-
seen as refinements of discrimination measures, including variants ables. For most medical applications, the outcome of interest
of the c statistic for survival, reclassification tables, net reclassifi- is binary and the information can be expressed as probabilis-
cation improvement (NRI), and integrated discrimination improve-
tic predictions.1 Predictions are hence absolute risks, which
ment (IDI). Moreover, decision–analytic measures have been pro-
posed, including decision curves to plot the net benefit achieved by go beyond assessments of relative risks, such as regression
making decisions based on model predictions. coefficients, odds ratios, or hazard ratios.2
We aimed to define the role of these relatively novel approaches There are various ways to assess the performance of a
in the evaluation of the performance of prediction models. For statistical prediction model. The customary statistical ap-
illustration, we present a case study of predicting the presence of proach is to quantify how close predictions are to the actual
residual tumor versus benign tissue in patients with testicular cancer outcome, using measures such as explained variation (eg, R2
(n ⫽ 544 for model development, n ⫽ 273 for external validation). statistics) and the Brier score.3 Performance can further be
We suggest that reporting discrimination and calibration will always quantified in terms of calibration (do close to x of 100 patients
be important for a prediction model. Decision-analytic measures should
with a risk prediction of x% have the outcome?), using, for
be reported if the predictive model is to be used for clinical decisions.
Other measures of performance may be warranted in specific applica- example, the Hosmer-Lemeshow “goodness-of-fit” test.4 Fur-
tions, such as reclassification metrics to gain insight into the value of thermore, discrimination is essential (do patients with the
adding a novel predictor to an established model. outcome have higher risk predictions than those without?).
Discrimination can be quantified with measures such as
(Epidemiology 2010;21: 128 –138)
sensitivity, specificity, and the area under the receiver oper-
ating characteristic curve (or concordance statistic, c).1,5
Submitted 9 February 2009; accepted 24 June 2009. Recently, several new measures have been proposed to
From the aDepartment of Public Health, Erasmus MC, Rotterdam, The Neth-
erlands; bDepartment of Epidemiology and Biostatistics, Memorial Sloan- assess performance of a prediction model. These include
Kettering Cancer Center; New York, NY; cBrigham and Women’s Hospital, variants of the c statistic for survival,6,7 reclassification ta-
Harvard Medical School, Boston, MA; dInstitute of Public Health, Univer- bles,8 net reclassification improvement (NRI), and integrated
sity of Copenhagen, Copenhagen, Denmark; eDepartment of Quantitative
Health Sciences, Cleveland Clinic, Cleveland, OH; and fDepartment of discrimination improvement (IDI),9 which are refinements of
Mathematics and Statistics, Boston University, Boston, MA. discrimination measures. The concept of risk reclassification
This paper was based on discussions at an international symposium “Mea-
suring the accuracy of prediction models” (Cleveland, OH, Sept 29,
has caused substantial discussion in the methodologic and
2008, http://www.bio.ri.ccf.org/html/symposium.html), which was sup- clinical literature.10 –14 Moreover, decision–analytic measures
ported by the Cleveland Clinic Department of Quantitative Health Sci- have been proposed, including “decision curves” to plot the
ences and the Page Foundation.
Supplemental digital content is available through direct URL citations
net benefit achieved by making decisions based on model
in the HTML and PDF versions of this article (www.epidem.com). predictions.15 These measures have not yet widely been used
Editors’ note: Related articles appear on pages 139 and 142. in practice, which may partly be due to their novelty among
Correspondence: Ewout W. Steyerberg, Department of Public Health, Eras-
mus MC, PO Box 2040, 3000 CA Rotterdam, The Netherlands. E-mail: applied researchers.16 In this article, we aim to clarify the role
[email protected]. of these relatively novel approaches in the evaluation of the
performance of prediction models.
Copyright © 2009 by Lippincott Williams & Wilkins
ISSN: 1044-3983/10/2101-0128 We first briefly discuss prediction models in medicine.
DOI: 10.1097/EDE.0b013e3181c30fb2 Next, we review the properties of a number of traditional and

128 | www.epidem.com Epidemiology • Volume 21, Number 1, January 2010


Epidemiology • Volume 21, Number 1, January 2010 Assessing the Performance of Prediction Models

relatively novel measures for the assessment of the perfor- in this situation is that predictions are well calibrated (or
mance of an existing prediction model, or extensions to a “reliable”).29,30
model. For illustration, we present a case study of predicting A specific situation may be that limited resources need to
the presence of residual tumor versus benign tissue in patients be targeted to those with the highest expected benefit, such as
with testicular cancer. those at highest risk. This situation calls for a model that
accurately distinguishes those at high risk from those at low risk.
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

Decision support is another important area, including


PREDICTION MODELS IN MEDICINE
decisions on the need for further diagnostic testing (tests may
Developing Valid Prediction Models be burdensome or costly to a patient), or therapy (eg, surgery
We consider prediction models that provide predictions with risks of morbidity and mortality).31 Such decisions are
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

for a dichotomous outcome, because these are most relevant typically binary and require decision thresholds that are
in medical applications. The outcome can be either an under- clinically relevant.
lying diagnosis (eg, presence of benign or malignant histol-
ogy in a residual mass after cancer treatment), an outcome
occurring within a relatively short time after making the TRADITIONAL PERFORMANCE MEASURES
prediction (eg, 30-day mortality), or a long-term outcome (eg, We briefly consider some of the more commonly used
10-year incidence of coronary artery disease, with censored performance measures in medicine, without intending to be
follow-up of some patients). comprehensive (Table 1).
At model development, we aim for at least internally valid
predictions, ie, predictions that are valid for subjects from the Overall Performance Measures
underlying population.17 Preferably, the predictions are also From a statistical modeler’s perspective, the distance be-
generalizable to “plausibly related” populations.18 Various epi- tween the predicted outcome and actual outcome is central to
demiologic and statistical issues need to be considered in a quantifying overall model performance.32 The distance is Y–Ŷ
modeling strategy for empirical data.1,19,20 When a model is for continuous outcomes. For binary outcomes, with Y defined
developed, it is obvious that we want some quantification of its 0 –1, Ŷ is equal to the predicted probability p, and for survival
performance, such that we can judge whether the model is outcomes, it is the predicted event probability at a given time (or
adequate for its purpose, or better than an existing model. as a function of time). These distances between observed and
predicted outcomes are related to the concept of “goodness-of-
Model Extension With a Marker fit” of a model, with better models having smaller distances
We recognize that a key interest in contemporary med- between predicted and observed outcomes. The main difference
ical research is whether a marker (eg, molecular, genetic, between goodness-of-fit and predictive performance is that the
imaging) adds to the performance of an existing model. former is usually evaluated in the same data while the latter
Often, new markers are selected from a large set based on requires either new data or cross-validation.
strength of association in a particular study. This poses a high Explained variation (R2) is the most common performance
risk of overoptimistic expectations of the marker’s perfor- measure for continuous outcomes. For generalized linear mod-
mance.21,22 Moreover, we are interested in only the incre- els, Nagelkerke’s R2 is often used.1,33 This is a logarithmic
mental value of a marker, on top of predictors that are readily scoring rule. For binary outcomes Y, we score a model with the
accessible. Validation in fully independent, external data is logarithm of predictions p: Y ⫻ log(p) ⫹ (Y ⫺ 1) ⫻ (log(1 ⫺
the best way to compare the performance a model with and p)). Nagelkerke’s R2 can also be calculated for survival out-
without a new marker.21,23 comes, based on the difference in ⫺2 log likelihood of a model
without and a model with one or more predictors.
Usefulness of Prediction Models The Brier score is a quadratic scoring rule, where the
Prediction models can be useful for several purposes, squared differences between actual binary outcomes Y and
such as to decide inclusion criteria or covariate adjustment in predictions p are calculated: (Y ⫺ p)2.3 We can also write this
a randomized controlled trial.24 –26 In observational studies, a in a way similar to the logarithmic score: Y ⫻ (1 ⫺ p)2 ⫹
prediction model may be used for confounder adjustment or (1 ⫺ Y) ⫻ p2. The Brier score for a model can range from 0
case-mix adjustment in comparing an outcome between cen- for a perfect model to 0.25 for a noninformative model with
ters.27 We concentrate here on the usefulness of a prediction a 50% incidence of the outcome. When the outcome inci-
model for medical practice, including public health (eg, dence is lower, the maximum score for a noninformative
screening for disease) and patient care (diagnosing patients, model is lower, eg, for 10%: 0.1 ⫻ (1 ⫺ 0.1)2 ⫹
giving prognostic estimates, decision support). (1 ⫺ 0.1) ⫻ 0.12 ⫽ 0.090. Similar to Nagelkerke’s approach
An important role of prediction models is to inform to the LR statistic, we could scale Brier by its maximum score
patients about their prognosis, for example, after a cancer under a noninformative model: Brierscaled ⫽ 1 – Brier/Briermax,
diagnosis has been made.28 A natural requirement for a model where Briermax ⫽ mean (p) ⫻ (1 ⫺ mean (p)), to let it range

© 2009 Lippincott Williams & Wilkins www.epidem.com | 129


Steyerberg et al Epidemiology • Volume 21, Number 1, January 2010

TABLE 1. Characteristics of Some Traditional and Novel Performance Measures


Aspect Measure Visualization Characteristics

Overall performance R2, Brier Validation graph Better with lower distance between Y and Ŷ.
Captures calibration and discrimination aspects
Discrimination c statistic ROC curve Rank order statistic; interpretation for a pair of
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

subjects with and without the outcome


Discrimination slope Box plot Difference in mean of predictions between
outcomes; easy visualization
Calibration Calibration-in-the-large Calibration or validation graph Compare mean (y) versus mean (ŷ); essential
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

aspect for external validation


Calibration slope Regression slope of linear predictor; essential
aspect for internal and external validation; related
to “shrinkage” of regression coefficients
Hosmer-Lemeshow test Compares observed to predicted by decile of
predicted probability
Reclassification Reclassification table Cross-table or scatter plot Compare classifications from 2 models (one with,
one without a marker) for changes
Reclassification statistic Compare observed outcomes to predicted risks
within cross-classified categories
Net reclassification index (NRI) Compare classifications from 2 models for changes
by outcome for a net calculation of changes in
the right direction
Integrated discrimination index (IDI) Box plots for 2 models (one with, Integrates the NRI over all possible cut-offs;
one without a marker) equivalent to difference in discrimination slopes
Clinical usefulness Net benefit (NB) Cross-table Net number of true positives gained by using a
Decision curve analysis (DCA) Decision curve model compared to no model at a single
threshold (NB) or over a range of thresholds
(DCA)

between 0% and 100%. This scaled Brier score happens to be calibration such as differences in average outcome. A popular
very similar to Pearson’s R2 statistic.35 extension of the c statistic with censored data can be obtained
Calculation of the Brier score for survival outcomes is by ignoring the pairs that cannot be ordered.1 It turns out that
possible with a weight function, which considers the conditional this results in a statistic that depends on the censoring pattern.
probability of being uncensored during time.3,36,37 We can then Gonen and Heller have proposed a method to estimate a
calculate the Brier score at fixed time points, and create a variant of the c statistic that is independent of censoring, but
time-dependent curve. It is useful to use a benchmark curve, holds only in the context of a Cox proportional hazards
based on the Brier score for the overall Kaplan-Meier estimator, model.7 Furthermore, time-dependent c statistics have been
that does not consider any predictive information.3 It turns out proposed.6,38
that overall performance measures comprise 2 important char- In addition to the c statistic, the discrimination slope can
acteristics of a prediction model— discrimination and calibra- be used as a simple measure for how well subjects with and
tion— each of which can be assessed separately. without the outcome are separated.39 The discrimination slope is
calculated as the absolute difference in average predictions for
Discrimination those with and without the outcome. Visualization is readily
Accurate predictions discriminate between those with and possible with a box plot or a histogram; a better discriminating
those without the outcome. Several measures can be used to model will show less overlap between those with and those
indicate how well we classify patients in a binary prediction without the outcome. Extensions of the discrimination slope
problem. The concordance (c) statistic is the most commonly have not yet been made to the survival context.
used performance measure to indicate the discriminative ability
of generalized linear regression models. For a binary outcome, c Calibration
is identical to the area under the receiver operating characteristic Calibration refers to the agreement between observed
(ROC) curve, which plots the sensitivity (true positive rate) outcomes and predictions.29 For example, if we predict a 20%
against 1 – specificity (false positive rate) for consecutive cut- risk of residual tumor for a testicular cancer patient, the
offs for the probability of an outcome. observed frequency of tumor should be approximately 20 of
The c statistic is a rank-order statistic for predictions 100 patients with such a prediction. A graphical assessment
against true outcomes, related to Somers’ D statistic.1 As a of calibration is possible, with predictions on the x-axis and
rank-order statistic, it is insensitive to systematic errors in the outcome on the y-axis. Perfect predictions should be on

130 | www.epidem.com © 2009 Lippincott Williams & Wilkins


Epidemiology • Volume 21, Number 1, January 2010 Assessing the Performance of Prediction Models

the 45-degree line. For linear regression, the calibration plot Any upward movement in categories for subjects with the
is a simple scatter plot. For binary outcomes, the plot contains outcome implies improved classification, and any downward
only 0 and 1 values for the y-axis. Smoothing techniques can movement indicates worse reclassification. The interpretation
be used to estimate the observed probabilities of the outcome is opposite for subjects without the outcome. The improve-
(p (y ⫽ 1)) in relation to the predicted probabilities, eg, using ment in reclassification was quantified as the sum of differ-
the loess algorithm.1 We may, however, expect that the ences in proportions of individuals moving up minus the
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

specific type of smoothing may affect the graphical impres- proportion moving down for those with the outcome, and the
sion, especially in smaller data sets. We can also plot results proportion of individuals moving down minus the proportion
for subjects with similar probabilities, and thus compare the moving up for those without the outcome. This sum was
mean predicted probability with the mean observed outcome. labeled the Net Reclassification Improvement (NRI). Also, a
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

For example, we can plot observed outcome by decile of measure that integrates net reclassification over all possible
predictions, which makes the plot a graphical illustration of cut-offs for the probability of the outcome was proposed
the Hosmer-Lemeshow goodness-of-fit test. A better discrim- (integrated discrimination improvement 关IDI兴).9 The IDI is
inating model has more spread between such deciles than a equivalent to the difference in discrimination slopes of 2
poorly discriminating model. Note that such grouping, though models, and to the difference in Pearson R2 measures,45 or the
common, is arbitrary and imprecise. difference is scaled Brier scores.
The calibration plot can be characterized by an inter-
cept a, which indicates the extent that predictions are sys- Novel Measures Related to Clinical Usefulness
tematically too low or too high (“calibration-in-the-large”), Some performance measures imply that false-negative
and a calibration slope b, which should be 1.40 Such a and false-positive classifications are equally harmful. For
recalibration framework was previously proposed by Cox.41 example, the calculation of error rates is usually made by
At model development, a ⫽ 0 and b ⫽ 1 for regression classifying subjects as positive when their predicted proba-
models. At validation, calibration-in-the-large problems are bility of the outcome exceeds 50%, and as negative other-
common, as well as b smaller than 1, reflecting overfitting of wise. This implies an equal weighting of false-positive and
a model.1 A value of b smaller than 1 can also be interpreted false-negative classifications.
as reflecting a need for shrinkage of regression coefficients in In the calculation of the NRI, the improvement in
a prediction model.42,43 sensitivity and the improvement in specificity are summed.
This implies relatively more weight for positive outcomes if
NOVEL PERFORMANCE MEASURES a positive outcome was less common, and less weight if a
We now discuss some relatively novel performance positive outcome was more common than a negative outcome.
measures, again without attempting to be comprehensive. The weight is equal to the nonevents odds: (1 ⫺ mean (p))/
mean (p), where mean (p) is the average probability of a
Novel Measures Related to Reclassification positive outcome. Accordingly, although weighting is not
Cook8 has proposed to make a “reclassification table” equal, it is not explicitly based on clinical consequences.
by adding a marker to a model to show how many subjects Defining the best diagnostic test as the one closest to the top
are reclassified. For example, a model with traditional risk left hand corner of the ROC curve—that is, the test with the
factors for cardiovascular disease was extended with the highest sum of sensitivity and specificity (the Youden46
predictors “parental history of myocardial infarction” and index: Se ⫹ Sp ⫺ 1)—similarly implies weighting by the
“CRP.” The increase in c statistic was minimal (from 0.805 to nonevents odds.
0.808). However, when Cook classified the predicted risks Vickers and Elkin15 proposed decision-curve analysis
into 4 categories (0 –5, 5–10, 10 –20, ⬎20% 10-year cardio- as a simple approach to quantify the clinical usefulness of a
vascular disease risk), about 30% of individuals changed prediction model (or an extension to a model). For a formal
category when comparing the extended model with the tra- decision analysis, harms and benefits need to be quantified,
ditional one. Change in risk categories, however, is insuffi- leading to an optimal decision threshold.47 It can be difficult
cient to evaluate improvement in risk stratification; the to define this threshold.15 Difficulties may lie at the popula-
changes must be appropriate. One way to evaluate this is to tion level, ie, there is insufficient data on harms and benefits.
compare the observed incidence of events in the cells of the Moreover, the relative weight of harms and benefits may
reclassification table with the predicted probability from the differ from patient to patient, necessitating individual thresh-
original model. Cook proposed a reclassification test as a olds. Hence, we may consider a range of thresholds for the
variant of the Hosmer-Lemeshow statistic within the reclas- probability of the outcome, similar to ROC curves that
sified categories, leading to a ␹2 statistic.44 consider the full range of cut-offs rather than a single cut-off
Pencina et al9 has extended the reclassification idea by for a sensitivity/specificity pair.
conditioning on the outcome: reclassification of subjects with The key aspect of decision-curve analysis is that a
and without the outcome should be considered separately. single probability threshold can be used both to categorize

© 2009 Lippincott Williams & Wilkins www.epidem.com | 131


Steyerberg et al Epidemiology • Volume 21, Number 1, January 2010

patients as positive or negative and to weight false-positive patients and w is a weight equal to the odds of the cut-off (pt/
and false-negative classifications.48 If we assume that the (1 ⫺ pt), or the ratio of harm to benefit.48 Documentation and
harm of unnecessary treatment (a false-positive decision) is software for decision-curve analysis is publicly available (www.
relatively limited—such as antibiotics for infection—the cut- decisioncurveanalysis.org).
off should be low. In contrast, if overtreatment is quite
harmful, such as extensive surgery, we should use a higher Validation Graphs as Summary Tools
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

cut-off before a treatment decision is made. The harm-to- We can extend the calibration graph to a validation
benefit ratio hence defines the relative weight w of false- graph.20 The distribution of predictions in those with and
positive decisions to true-positive decisions. For example, a without the outcome is plotted at the bottom of the graph,
cut-off of 10% implies that FP decisions are valued at 1/9th of capturing information on discrimination, similar to what is
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

a TP decision, and w ⫽ 0.11. The performance of a prediction shown in a box plot. Moreover, it is important to have 95%
model can then be summarized as a Net Benefit: NB ⫽ (TP ⫺ confidence intervals (CIs) around deciles (or other quantiles)
w FP)/N, where TP is the number of true-positive decisions, FP of predicted risk to indicate uncertainty in the assessment of
the number of false-positive decisions, N is the total number of validity. From the validation graph we can learn the discrim-
inative ability of a model (eg, study the spread in observed
outcomes by deciles of predicted risks), the calibration
(closeness of observed outcomes to the 45-degree line), and
TABLE 2. Logistic Regression Models in Testicular Cancer
Dataset (n ⫽ 544), Without and With the Tumor Marker the clinical usefulness (how many predictions are above or
LDHa below clinically relevant decision thresholds).
Without LDH With LDH
Characteristic OR (95% CI) OR (95% CI) APPLICATION TO TESTICULAR CANCER CASE
STUDY
Primary tumor teratoma-positive 2.7 (1.8–4.0) 2.5 (1.6–3.8)
Prechemotherapy AFP elevated 2.4 (1.5–3.7) 2.5 (1.6–3.9) Patients
Prechemotherapy HCG elevated 1.7 (1.1–2.7) 2.2 (1.4–3.4) Men with metastatic nonseminomatous testicular can-
Square root of postchemotherapy 1.08 (0.95–1.23) 1.34 (1.14–1.57) cer can often be cured by cisplatin-based chemotherapy.
mass size (mm) After chemotherapy, surgical resection is generally carried
Reduction in mass size per 10% 0.77 (0.70–0.85) 0.85 (0.77–0.95)
out to remove remnants of the initial metastases that may still
Prechemotherapy LDH (log(LDH/ — 0.37 (0.25–0.56)
upper limit of local normal be present. In the absence of tumor, resection has no thera-
value)) peutic benefits, while it is associated with hospital admission
Continuous predictors were first studied with restricted cubic spline functions, and and with risks of permanent morbidity and mortality. Logistic
then simplified to simple parametric forms.
a
regression models were developed to predict the presence of
The outcome was residual tumor at postchemotherapy resection (299/544, 55%).
residual tumor, combining well-known predictors such as the

TABLE 3. Performance of Testicular Cancer Models With or Without the Tumor Maker
LDH in the Development Dataset (n ⫽ 544) and the Validation Dataset (n ⫽ 273)
Development
External Validation
Performance Measure Without LDH With LDH Without LDH

Overall
Brier 0.174 0.163 0.161
Brierscaled 29.8% 34.0% 20.0%
R2 (Nagelkerke) 38.9% 43.1% 25.0%
Discrimination
C stat 0.818 (0.78–0.85) 0.839 (0.81–0.87) 0.785 (0.73–0.84)
Discrimination slope 0.301 0.340 0.237
Calibration
Calibration-in-the-large 0 0 ⫺0.03
Calibration slope 1 1 0.74
H-L test ␹2 ⫽ 6.2 ␹2 ⫽ 12.0 ␹2 ⫽ 15.9
P ⫽ 0.63 P ⫽ 0.15 P ⫽ 0.07
Clinical usefulness
Net benefit at threshold 20%a 0.2% 1.2% 0.1%
a
Compare to resect all.

132 | www.epidem.com © 2009 Lippincott Williams & Wilkins


Epidemiology • Volume 21, Number 1, January 2010 Assessing the Performance of Prediction Models

histology of the primary tumor, prechemotherapy levels of


tumor markers, and (reduction in) residual mass size.49
We first consider a dataset with 544 patients to develop
a prediction model that includes 5 predictors (Table 2). We
then extend this model with the prechemotherapy level of the
tumor marker lactate dehydrogenase (LDH). This illustrates
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

ways to assess the incremental value of a marker. LDH values


were log-transformed, after examination of nonlinearity with
restricted cubic spline functions and standardizing by divid-
ing by the local upper levels of normal values.50 In a later
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

study, we externally validated the 5-predictor model in 273


patients from a tertiary referral center, where LDH was not
recorded.51 This comparison illustrates ways to assess the
usefulness of a model in a new setting.
A clinically relevant cut-off point for the risk of
tumor was based on a decision analysis, in which estimates
from literature and from experts in the field were used to
formally weigh the harms of missing tumor against the
benefits of resection in those with tumor.52 This analysis
indicated that a risk threshold of 20% would be clinically
reasonable.

Incremental Value of a Marker


Adding LDH value to the 5-predictor model increased
the model ␹2 from 187 to 212 (LR statistic 25, P ⬍ 0.001) in
the development data set. LDH hence had additional predic-
tive value. Overall performance also improved: Nagelkerke’s
R2 increased from 39% to 43%, and the Brier score decreased
from 0.17 to 0.16 (Table 3). The discriminative ability
showed a small increase (c rose from 0.82 to 0.84, Fig. 1).
Similarly, the discrimination slope increased from 0.30 to
0.34 (Fig. 2), producing an IDI of 4%. FIGURE 1. Receiver operating characteristic (ROC) curves for (A)
Using a cut-off of 20% for the risk of tumor led to the predicted probabilities without (solid line) and with the tumor
classification of 465 patients as being high risk for residual marker (dashed line) LDH in the development data set (n ⫽ 544)
tumor with the original model, and 469 at high risk with and (B) for the predicted probabilities without the tumor marker
LDH from the development data set in the validation data set
the extended model (Table 4). The extended model reclas-
(n ⫽ 273). Threshold probabilities are indicated.
sified 19 of the 465 high-risk patients as low risk (4%), and
23 of the 79 low-risk patents as high risk (29%). The total
reclassification was hence 7.7% (42/544). Based on the
observed proportions, those who were reclassified were for all reclassified patients. The IDI was already estimated
placed into more appropriate categories. The P value for from Figure 2 as 4%.
Cook’s reclassification test was 0.030 comparing predic- A cut-off of 20% implies a relative weight of 1:4 for
tions from the original model with observed outcomes in false-positive decisions against true-positive decisions.
the 4 cells of Table 4. A more detailed assessment of the For the model without LDH, the net benefit was (TP – w ⫻
reclassification is obtained by a scatter plot with symbols FP)/N ⫽ (284 ⫺ 0.25 ⫻ (465 ⫺ 284)/544 ⫽ 0.439. If we
by outcome (tumor or necrosis, Fig. 3). Note that some would do resection in all, the net benefit would however be
patients with necrosis have higher predicted risks in the similar: (299 ⫺ 0.25 ⫻ (544 ⫺ 299))/544⫽0.437. The
model without LDH values than in the model with LDH model with LDH has a better NB: (289 ⫺ 0.25 ⫻ (469 ⫺
(circles in right lower corner of the graph). The improve- 289))/544 ⫽ 0.449. Hence, at this particular cut-off, the
ment in reclassification for those with tumor was 1.7%
model with LDH would be expected to lead to one more
((8 –3)/299), and for those with necrosis 0.4% ((16 –15)/
mass with tumor being resected per 100 patients at the
245). Thus, the NRI was 2.1% (95% CI ⫽ ⫺2.9 to
same number of unnecessary resections of necrosis. The
⫹7.0%), which is a much lower percentage than the 7.7%

© 2009 Lippincott Williams & Wilkins www.epidem.com | 133


Steyerberg et al Epidemiology • Volume 21, Number 1, January 2010

TABLE 4. Reclassification for the Predicted Probabilities


Without and With the Tumor Marker LDH in the
Development Dataset
With LDH
Without
LDH Risk <20% Risk >20% Total
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

Risk ⱕ20% (n ⫽ 56) (n ⫽ 23) (n ⫽ 79)


7 tumor (12%) 8 tumor (35%) 15 tumor (19%)
Risk ⬎20% (n ⫽ 19) (n ⫽ 446) (n ⫽ 465)
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

3 tumor (16%) 281 tumor (63%) 284 tumor (61%)


Total (n ⫽ 75) (n ⫽ 469) (n ⫽ 544)
10 tumor (13%) 289 tumor (62%) 299 tumor (55%)

FIGURE 3. Scatter plot of predicted probabilities without and


with the tumor marker LDH (⫹, tumor; o, necrosis). Some pa-
tients with necrosis have higher predicted risks of tumor accord-
ing to the model without LDH than according to the model with
LDH (circles in right lower corner of the graph). For example, we
note a patient with necrosis and an original prediction of nearly
60%, who is reclassified as less than 20% risk.

decision curve shows that the net benefit would be much


larger for higher threshold values (Fig. 4), ie, patients
FIGURE 2. Box plots of predicted probabilities without (left accepting higher risks of residual tumor.
box of each pair) and with (right box) the residual tumor. A,
Development data, model without LDH; B, Development External Validation
data, model with LDH; C, Validation data, model without Overall model performance in the new cohort of 273
LDH. The discrimination slope is calculated as the difference patients (197 with residual tumor) was less than at develop-
between the mean predicted probability with and without ment, according to R2 (25% instead of 39%) and scaled Brier
residual tumor (solid dots indicate means). The difference scores (20% instead of 30%). Also, the c statistic and dis-
between discrimination slopes is equivalent to the inte-
crimination slope were poorer. Calibration was on average
grated discrimination index (B vs A: IDI ⫽ 0.04).
correct (calibration-in-the-large coefficient close to zero), but
the effects of predictors were on average smaller in the new
setting (calibration slope 0.74). The Hosmer-Lemeshow test
was of borderline statistical significance. The net benefit was

134 | www.epidem.com © 2009 Lippincott Williams & Wilkins


Epidemiology • Volume 21, Number 1, January 2010 Assessing the Performance of Prediction Models
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

FIGURE 4. Decision curves (A) for the predicted probabilities


without (solid line) and with (dashed line) the tumor marker
LDH in the development data set (n ⫽ 544) and (B) for the
predicted probabilities without the tumor marker LDH from
the development data set in the validation data set (n ⫽ 273).

close to zero, which was explained by the fact that very few
patients had predicted risks below 20% and that calibration
was imperfect around this threshold (Figs. 2, 5).
Software
All analyses were done in R version 2.8.1 (R Founda-
tion for Statistical Computing, Vienna, Austria), using the
Design library. The syntax is provided in the eAppendix
(http://links.lww.com/EDE/A355).

DISCUSSION
This article provides a framework for a number of FIGURE 5. Validation plots of prediction models for residual
traditional and relatively novel measures to assess the perfor- masses in patients with testicular cancer. A, Development
mance of an existing prediction model, or extensions to a data, model without LDH; B, Development data, model with
model. Some measures relate to the evaluation of the quality LDH; C, Validation data, model without LDH. The arrow indi-
of predictions, including overall performance measures such cates the decision threshold of 20% risk of residual tumor.

© 2009 Lippincott Williams & Wilkins www.epidem.com | 135


Steyerberg et al Epidemiology • Volume 21, Number 1, January 2010

as explained variation and the Brier score, and measures for We recognize that binary decisions can fully be
discrimination and calibration. Other measures quantify the evaluated in a ROC plot. The plot may however be
quality of decisions, including decision-analytic measures obsolete unless the predicted probabilities at the operating
such as the net benefit and decision curves, and measures points are indicated. Optimal thresholds can be defined by
related to reclassification tables (NRI, IDI). the tangent line to the curve, defined by the incidence of
Having a model that discriminates well will commonly the outcome and the relative weight of false-positive and
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

be most relevant for research purposes, such as covariate false-negative decisions.58 If a prediction model is per-
adjustment in a randomized clinical trial. But a model with fectly calibrated, the optimal threshold in the curve corre-
good discrimination (eg, c ⫽ 0.8) may be useless if the sponds to the threshold probability in the net benefit
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

decision threshold for clinical decisions is outside the range analysis. The tangent is a 45-degree line if the outcome
of predictions provided by the model. Furthermore, a poorly incidence is 50% and false-positive and false-negative
discriminating model (eg, c ⫽ 0.6), may be clinically useful decisions are weighted equally. We consider the net benefit
if the clinical decision is close to a “toss-up.”53 This implies and related decision curves preferable to graphical ROC
that the threshold is in the middle of the distribution of curve assessment in the context of prediction models,
predicted risks, as is the case for models in fertility medicine, although these approaches are obviously related.59
for example.54 For clinical practice, providing insight beyond Most performance measures can also be calculated
the c statistic has been a motivation for some recent mea- for survival outcomes, which pose the challenge of dealing
sures, especially in the context of extension of a prediction with censoring observations. Naive calculation of ROC
model with additional predictive information from a biomar- curves for censored observations can be misleading, since
ker or other sources.8,9,45 Many measures provide numerical some of the censored observation would have had events if
summaries that may be difficult to interpret (see eg, Table 3). follow-up were longer. Also, the weight of false-positive
Evaluation of calibration is important if model predic-
and false-negative decisions may change with the fol-
tions are used to inform patients or physicians in making
low-up time considered. Another issue is to consider
decisions. The widely used Hosmer-Lemeshow test has a
competing risks in survival analyses of nonfatal outcomes,
number of drawbacks, including limited power and poor
such as failure of heart valves,61 or mortality due to
interpretability.1,55 The recalibration parameters as proposed
various causes.62 Disregarding competing risks often leads
by Cox (intercept and calibration slope) are more informa-
to overestimation of absolute risk.63
tive.41 Validation plots with the distribution of risks for those
Any performance measure should be estimated with
with and without the outcome provide a useful graphical
correction for optimism, as can be achieved with cross-
depiction, in line with previous proposals.45
validation or bootstrap resampling, for example. To deter-
The net benefit, with visualization in a decision curve,
mine generalizability to other plausibly related settings re-
is a simple summary measure to quantify clinical usefulness
when decisions are supported by a prediction model.15 We quires an external validation data set of sufficient size.18
recognize however that a single summary measure cannot Some statistical updating may then be necessary for param-
give full insight in all relevant aspects of model performance. eters in the model.64 After repeated validation under various
If a threshold is clinically well accepted, such as the 10% and circumstances, an analysis of the impact of using a model for
20%, 10-year risk thresholds for cardiovascular events, re- decision support should follow. This requires formulation of
classification tables, and its associated measures may be a model as a simple decision rule.65
particularly useful. For example, Table 4 clearly illustrates In sum, we suggest that reporting discrimination and
that a model incorporating lactate dehydrogenase puts a few calibration will always be important for a prediction model.
more subjects with tumor in the high risk category (289/299 ⫽ Decision-analytic measures should be reported if the model is
97% instead of 284/299 ⫽ 95%) and one fewer subject without to be used for making clinical decisions. Many more mea-
tumor in the high risk category (180/245 ⫽ 73% instead of sures are available than discussed in this article, and those
181/245 ⫽ 74%). This illustrates the principle that key informa- other measures may have value in specific circumstances.
tion for comparing performances of 2 models is contained in the The novel measures for reclassification and clinical useful-
margins of the reclassification tables.12 ness can provide valuable additional insight regarding the
A key issue in the evaluation of the quality of decisions value of prediction models and extensions to models, which
is that false-positive and false-negative decisions will usually goes beyond traditional measures of calibration and discrim-
have quite different weights in medicine. Using equal weights ination. Many more measures are available than discussed in
for false-positive and false-negative decisions is not “absurd” this article, and those other measures may have value in
in many medical applications.56 Several previously proposed specific circumstances. The novel measures for reclassifica-
measures of clinical usefulness are consistent with decision- tion and clinical usefulness can provide valuable additional
analytic considerations.31,48,57– 60 insight regarding the value of prediction models and exten-

136 | www.epidem.com © 2009 Lippincott Williams & Wilkins


Epidemiology • Volume 21, Number 1, January 2010 Assessing the Performance of Prediction Models

sions to models, which goes beyond traditional measures of 25. Hernandez AV, Steyerberg EW, Habbema JD. Covariate adjustment in
randomized controlled trials with dichotomous outcomes increases sta-
calibration and discrimination. tistical power and reduces sample size requirements. J Clin Epidemiol.
2004;57:454 – 460.
26. Hernandez AV, Eijkemans MJ, Steyerberg EW. Randomized controlled
ACKNOWLEDGMENTS trials with time-to-event outcomes: how much does prespecified covari-
We thank Margaret Pepe and Jessie Gu (University of ate adjustment increase power? Ann Epidemiol. 2006;16:41– 48.
Washington, Seattle, WA) for their critical review and helpful 27. Iezzoni LI. Risk Adjustment for Measuring Health Care Outcomes. 3rd
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

comments, as well as 2 anonymous reviewers. ed. Chicago: Health Administration Press; 2003.
28. Kattan MW. Judging new markers by their ability to improve predictive
accuracy. J Natl Cancer Inst. 2003;95:634 – 635.
REFERENCES 29. Hilden J, Habbema JD, Bjerregaard B. The measurement of performance
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

in probabilistic diagnosis. Part II: Trustworthiness of the exact values of


1. Harrell FE. Regression Modeling Strategies: With Applications to Lin- the diagnostic probabilities. Methods Inf Med. 1978;17:227–237.
ear Models, Logistic Regression, and Survival Analysis. New York: 30. Hand DJ. Statistical methods in diagnosis. Stat Methods Med Res.
Springer; 2001.
1992;1:49 – 67.
2. Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations
31. Habbema JD, Hilden J. The measurement of performance in probabilis-
of the odds ratio in gauging the performance of a diagnostic, prognostic,
tic diagnosis: Part IV. Utility considerations in therapeutics and prog-
or screening marker. Am J Epidemiol. 2004;159:882– 890.
nostics. Methods Inf Med. 1981;20:80 –96.
3. Gerds TA, Cai T, Schumacher M. The performance of risk prediction
32. Vittinghoff E. Regression Methods in Biostatistics: Linear, Logistic,
models. Biom J. 2008;50:457– 479.
Survival, and Repeated Measures Models (Statistics for Biology and
4. Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of
Health). New York: Springer; 2005.
goodness-of-fit tests for the logistic regression model. Stat Med. 1997;
33. Nagelkerke NJ. A note on a general definition of the coefficient of
16:965–980.
determination. Biometrika. 1991;78:691– 692.
5. Obuchowski NA. Receiver operating characteristic curves and their use
34. Brier GW. Verification of forecasts expressed in terms of probability.
in radiology. Radiology. 2003;229:3– 8.
Mon Wea Rev. 1950;78:1–3.
6. Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC
curves. Biometrics. 2005;61:92–105. 35. Hu B, Palta M, Shao J. Properties of R2 statistics for logistic regression.
7. Gonen M, Heller G. Concordance probability and discriminatory power Stat Med. 2006 25:1383–1395.
in proportional hazards regression. Biometrika. 2005;92:965–970. 36. Schumacher M, Graf E, Gerds T. How to assess prognostic models for
8. Cook NR. Use and misuse of the receiver operating characteristic curve survival data: a case study in oncology. Methods Inf Med. 2003;42:564 –
in risk prediction. Circulation. 2007;115:928 –935. 571.
9. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. 37. Gerds TA, Schumacher M. Consistent estimation of the expected Brier
Evaluating the added predictive ability of a new marker: from area under score in general survival models with right-censored event times. Biom
the ROC curve to reclassification and beyond. Stat Med. 2008;27:157– J. 2006;48:1029 –1040.
172. 38. Chambless LE, Diao G. Estimation of time-dependent area under the
10. Pepe MS, Janes H, Gu JW. Letter by Pepe et al. Regarding article, “Use ROC curve for long-term risk prediction. Stat Med. 2006;25:3474 –3486.
and misuse of the receiver operating characteristic curve in risk predic- 39. Yates JF. External correspondence: decomposition of the mean proba-
tion.” Circulation. 2007;116:e132; author reply e134. bility score. Organ Behav Hum Perform. 1982;30:132–156.
11. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. 40. Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ. Validation
Comments on ‘Integrated discrimination and net reclassification im- of probabilistic predictions. Med Decis Making. 1993;13:49 –58.
provements-Practical advice.’ Stat Med. 2008;27:207–212. 41. Cox DR. Two further applications of a model for binary regression.
12. Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by Biometrika. 1958;45:562–565.
using risk stratification tables. Ann Intern Med. 2008;149:751–760. 42. Copas JB. Regression, prediction and shrinkage. J R Stat Soc Ser B.
13. McGeechan K, Macaskill P, Irwig L, Liew G, Wong TY. Assessing new 1983;45:311–354.
biomarkers and predictive models for use in clinical practice: a clini- 43. van Houwelingen JC, Le Cessie S. Predictive value of statistical models.
cian’s guide. Arch Intern Med. 2008;168:2304 –2310. Stat Med. 1990;9:1303–1325.
14. Cook NR, Ridker PM. Advances in measuring the effect of individual 44. Cook NR. Statistical evaluation of prognostic versus diagnostic models:
predictors of cardiovascular risk: the role of reclassification measures. beyond the ROC curve. Clin Chem. 2008;54:17–23.
Ann Intern Med. 2009;150:795– 802. 45. Pepe MS, Feng Z, Huang Y, et al. Integrating the predictiveness of a
15. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for marker with its performance as a classifier. Am J Epidemiol. 2008;167:
evaluating prediction models. Med Decis Making. 2006;26:565–574. 362–368.
16. Steyerberg EW, Vickers AJ. Decision curve analysis: a discussion. Med 46. Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35.
Decis Making. 2008;28:146 –149. 47. Pauker SG, Kassirer JP. The threshold approach to clinical decision
17. Altman DG, Royston P. What do we mean by validating a prognostic making. N Engl J Med. 1980;302:1109 –1117.
model? Stat Med. 2000;19:453– 473. 48. Peirce CS. The numerical measure of success of predictions. Science.
18. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of 1884;4:453– 454.
prognostic information. Ann Intern Med. 1999;130:515–524. 49. Steyerberg EW, Keizer HJ, Fossa SD, et al. Prediction of residual
19. Steyerberg EW, Harrell FE Jr, Borsboom GJ, Eijkemans MJ, Vergouwe retroperitoneal mass histology after chemotherapy for metastatic non-
Y, Habbema JD. Internal validation of predictive models: efficiency of seminomatous germ cell tumor: multivariate analysis of individual
some procedures for logistic regression analysis. J Clin Epidemiol. patient data from six study groups. J Clin Oncol. 1995;13:1177–1187.
2001;54:774 –781. 50. Steyerberg EW, Vergouwe Y, Keizer HJ, Habbema JD. Residual mass
20. Steyerberg EW. Clinical Prediction Models: A Practical Approach to histology in testicular cancer: development and validation of a clinical
Development, Validation, and Updating. New York: Springer; 2009. prediction rule. Stat Med. 2001;20:3847–3859.
21. Simon R. A checklist for evaluating reports of expression profiling for 51. Vergouwe Y, Steyerberg EW, Foster RS, Habbema JD, Donohue JP.
treatment selection. Clin Adv Hematol Oncol. 2006;4:219 –224. Validation of a prediction model and its predictors for the histology of
22. Ioannidis JP. Why most discovered true associations are inflated. Epi- residual masses in nonseminomatous testicular cancer. J Urol. 2001;
demiology. 2008;19:640 – 648. 165:84 – 88.
23. Schumacher M, Binder H, Gerds T. Assessment of survival prediction 52. Steyerberg EW, Marshall PB, Keizer HJ, Habbema JD. Resection of small,
models based on microarray data. Bioinformatics. 2007;23:1768 –1774. residual retroperitoneal masses after chemotherapy for nonseminomatous
24. Vickers AJ, Kramer BS, Baker SG. Selecting patients for randomized testicular cancer: a decision analysis. Cancer. 1999;85:1331–1341.
trials: a systematic approach based on risk group. Trials. 2006;7:30. 53. Pauker SG, Kassirer JP. The toss-up. N Engl J Med. 1981;305:1467–1469.

© 2009 Lippincott Williams & Wilkins www.epidem.com | 137


Steyerberg et al Epidemiology • Volume 21, Number 1, January 2010

54. Hunault CC, Habbema JD, Eijkemans MJ, Collins JA, Evers JL, te 59. Hilden J. The area under the ROC curve and its competitors. Med Decis
Velde ER. Two new prediction rules for spontaneous pregnancy leading Making. 1991;11:95–101.
to live birth among subfertile couples, based on the synthesis of three 60. Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute
previous models. Hum Reprod. 2004;19:2019 –2026. risk. Biostatistics. 2005;6:227–239.
55. Peek N, Arts DG, Bosman RJ, van der Voort PH, de Keizer NF. External 61. Grunkemeier GL, Jin R, Eijkemans MJ, Takkenberg JJ. Actual and
validation of prognostic models for critically ill patients required sub- actuarial probabilities of competing risks: apples and lemons. Ann
stantial sample sizes. J Clin Epidemiol. 2007;60:491–501. Thorac Surg. 2007;83:1586 –1592.
56. Greenland S. The need for reorientation toward cost-effective prediction: 62. Fine JP, Gray RJ. A proportional hazards model for the subdistribution
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1

comments on ‘Evaluating the added predictive ability of a new marker: of a competing risk. JASA. 1999;94:496 –509.
From area under the ROC curve to reclassification and beyond’ by M. J. 63. Gail M. A review and critique of some models used in competing risk
Pencina et al. Statistics in Medicine (DOI: 10.1002/sim. 2929). Stat Med. analysis. Biometrics. 1975;31:209 –222.
2008;27:199 –206. 64. Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ,
57. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of Habbema JD. Validation and updating of predictive logistic regression models:
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023

prognostic models: when is a model clinically useful? Semin Urol Oncol. a study on sample size and shrinkage. Stat Med. 2004;23:2567–2586.
2002;20:96 –107. 65. Reilly BM, Evans AT. Translating clinical research into clinical prac-
58. McNeil BJ, Keller E, Adelstein SJ. Primer on certain elements of tice: impact of using prediction rules to make decisions. Ann Intern Med.
medical decision making. N Engl J Med. 1975;293:211–215. 2006;144:201–209.

138 | www.epidem.com © 2009 Lippincott Williams & Wilkins

You might also like