Assessing The Performance of Prediction Models
Assessing The Performance of Prediction Models
Ewout W. Steyerberg,a Andrew J. Vickers,b Nancy R. Cook,c Thomas Gerds,d Mithat Gonen,b
Nancy Obuchowski,e Michael J. Pencina,f and Michael W. Kattane
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
relatively novel measures for the assessment of the perfor- in this situation is that predictions are well calibrated (or
mance of an existing prediction model, or extensions to a “reliable”).29,30
model. For illustration, we present a case study of predicting A specific situation may be that limited resources need to
the presence of residual tumor versus benign tissue in patients be targeted to those with the highest expected benefit, such as
with testicular cancer. those at highest risk. This situation calls for a model that
accurately distinguishes those at high risk from those at low risk.
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
for a dichotomous outcome, because these are most relevant typically binary and require decision thresholds that are
in medical applications. The outcome can be either an under- clinically relevant.
lying diagnosis (eg, presence of benign or malignant histol-
ogy in a residual mass after cancer treatment), an outcome
occurring within a relatively short time after making the TRADITIONAL PERFORMANCE MEASURES
prediction (eg, 30-day mortality), or a long-term outcome (eg, We briefly consider some of the more commonly used
10-year incidence of coronary artery disease, with censored performance measures in medicine, without intending to be
follow-up of some patients). comprehensive (Table 1).
At model development, we aim for at least internally valid
predictions, ie, predictions that are valid for subjects from the Overall Performance Measures
underlying population.17 Preferably, the predictions are also From a statistical modeler’s perspective, the distance be-
generalizable to “plausibly related” populations.18 Various epi- tween the predicted outcome and actual outcome is central to
demiologic and statistical issues need to be considered in a quantifying overall model performance.32 The distance is Y–Ŷ
modeling strategy for empirical data.1,19,20 When a model is for continuous outcomes. For binary outcomes, with Y defined
developed, it is obvious that we want some quantification of its 0 –1, Ŷ is equal to the predicted probability p, and for survival
performance, such that we can judge whether the model is outcomes, it is the predicted event probability at a given time (or
adequate for its purpose, or better than an existing model. as a function of time). These distances between observed and
predicted outcomes are related to the concept of “goodness-of-
Model Extension With a Marker fit” of a model, with better models having smaller distances
We recognize that a key interest in contemporary med- between predicted and observed outcomes. The main difference
ical research is whether a marker (eg, molecular, genetic, between goodness-of-fit and predictive performance is that the
imaging) adds to the performance of an existing model. former is usually evaluated in the same data while the latter
Often, new markers are selected from a large set based on requires either new data or cross-validation.
strength of association in a particular study. This poses a high Explained variation (R2) is the most common performance
risk of overoptimistic expectations of the marker’s perfor- measure for continuous outcomes. For generalized linear mod-
mance.21,22 Moreover, we are interested in only the incre- els, Nagelkerke’s R2 is often used.1,33 This is a logarithmic
mental value of a marker, on top of predictors that are readily scoring rule. For binary outcomes Y, we score a model with the
accessible. Validation in fully independent, external data is logarithm of predictions p: Y ⫻ log(p) ⫹ (Y ⫺ 1) ⫻ (log(1 ⫺
the best way to compare the performance a model with and p)). Nagelkerke’s R2 can also be calculated for survival out-
without a new marker.21,23 comes, based on the difference in ⫺2 log likelihood of a model
without and a model with one or more predictors.
Usefulness of Prediction Models The Brier score is a quadratic scoring rule, where the
Prediction models can be useful for several purposes, squared differences between actual binary outcomes Y and
such as to decide inclusion criteria or covariate adjustment in predictions p are calculated: (Y ⫺ p)2.3 We can also write this
a randomized controlled trial.24 –26 In observational studies, a in a way similar to the logarithmic score: Y ⫻ (1 ⫺ p)2 ⫹
prediction model may be used for confounder adjustment or (1 ⫺ Y) ⫻ p2. The Brier score for a model can range from 0
case-mix adjustment in comparing an outcome between cen- for a perfect model to 0.25 for a noninformative model with
ters.27 We concentrate here on the usefulness of a prediction a 50% incidence of the outcome. When the outcome inci-
model for medical practice, including public health (eg, dence is lower, the maximum score for a noninformative
screening for disease) and patient care (diagnosing patients, model is lower, eg, for 10%: 0.1 ⫻ (1 ⫺ 0.1)2 ⫹
giving prognostic estimates, decision support). (1 ⫺ 0.1) ⫻ 0.12 ⫽ 0.090. Similar to Nagelkerke’s approach
An important role of prediction models is to inform to the LR statistic, we could scale Brier by its maximum score
patients about their prognosis, for example, after a cancer under a noninformative model: Brierscaled ⫽ 1 – Brier/Briermax,
diagnosis has been made.28 A natural requirement for a model where Briermax ⫽ mean (p) ⫻ (1 ⫺ mean (p)), to let it range
Overall performance R2, Brier Validation graph Better with lower distance between Y and Ŷ.
Captures calibration and discrimination aspects
Discrimination c statistic ROC curve Rank order statistic; interpretation for a pair of
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
between 0% and 100%. This scaled Brier score happens to be calibration such as differences in average outcome. A popular
very similar to Pearson’s R2 statistic.35 extension of the c statistic with censored data can be obtained
Calculation of the Brier score for survival outcomes is by ignoring the pairs that cannot be ordered.1 It turns out that
possible with a weight function, which considers the conditional this results in a statistic that depends on the censoring pattern.
probability of being uncensored during time.3,36,37 We can then Gonen and Heller have proposed a method to estimate a
calculate the Brier score at fixed time points, and create a variant of the c statistic that is independent of censoring, but
time-dependent curve. It is useful to use a benchmark curve, holds only in the context of a Cox proportional hazards
based on the Brier score for the overall Kaplan-Meier estimator, model.7 Furthermore, time-dependent c statistics have been
that does not consider any predictive information.3 It turns out proposed.6,38
that overall performance measures comprise 2 important char- In addition to the c statistic, the discrimination slope can
acteristics of a prediction model— discrimination and calibra- be used as a simple measure for how well subjects with and
tion— each of which can be assessed separately. without the outcome are separated.39 The discrimination slope is
calculated as the absolute difference in average predictions for
Discrimination those with and without the outcome. Visualization is readily
Accurate predictions discriminate between those with and possible with a box plot or a histogram; a better discriminating
those without the outcome. Several measures can be used to model will show less overlap between those with and those
indicate how well we classify patients in a binary prediction without the outcome. Extensions of the discrimination slope
problem. The concordance (c) statistic is the most commonly have not yet been made to the survival context.
used performance measure to indicate the discriminative ability
of generalized linear regression models. For a binary outcome, c Calibration
is identical to the area under the receiver operating characteristic Calibration refers to the agreement between observed
(ROC) curve, which plots the sensitivity (true positive rate) outcomes and predictions.29 For example, if we predict a 20%
against 1 – specificity (false positive rate) for consecutive cut- risk of residual tumor for a testicular cancer patient, the
offs for the probability of an outcome. observed frequency of tumor should be approximately 20 of
The c statistic is a rank-order statistic for predictions 100 patients with such a prediction. A graphical assessment
against true outcomes, related to Somers’ D statistic.1 As a of calibration is possible, with predictions on the x-axis and
rank-order statistic, it is insensitive to systematic errors in the outcome on the y-axis. Perfect predictions should be on
the 45-degree line. For linear regression, the calibration plot Any upward movement in categories for subjects with the
is a simple scatter plot. For binary outcomes, the plot contains outcome implies improved classification, and any downward
only 0 and 1 values for the y-axis. Smoothing techniques can movement indicates worse reclassification. The interpretation
be used to estimate the observed probabilities of the outcome is opposite for subjects without the outcome. The improve-
(p (y ⫽ 1)) in relation to the predicted probabilities, eg, using ment in reclassification was quantified as the sum of differ-
the loess algorithm.1 We may, however, expect that the ences in proportions of individuals moving up minus the
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
specific type of smoothing may affect the graphical impres- proportion moving down for those with the outcome, and the
sion, especially in smaller data sets. We can also plot results proportion of individuals moving down minus the proportion
for subjects with similar probabilities, and thus compare the moving up for those without the outcome. This sum was
mean predicted probability with the mean observed outcome. labeled the Net Reclassification Improvement (NRI). Also, a
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
For example, we can plot observed outcome by decile of measure that integrates net reclassification over all possible
predictions, which makes the plot a graphical illustration of cut-offs for the probability of the outcome was proposed
the Hosmer-Lemeshow goodness-of-fit test. A better discrim- (integrated discrimination improvement 关IDI兴).9 The IDI is
inating model has more spread between such deciles than a equivalent to the difference in discrimination slopes of 2
poorly discriminating model. Note that such grouping, though models, and to the difference in Pearson R2 measures,45 or the
common, is arbitrary and imprecise. difference is scaled Brier scores.
The calibration plot can be characterized by an inter-
cept a, which indicates the extent that predictions are sys- Novel Measures Related to Clinical Usefulness
tematically too low or too high (“calibration-in-the-large”), Some performance measures imply that false-negative
and a calibration slope b, which should be 1.40 Such a and false-positive classifications are equally harmful. For
recalibration framework was previously proposed by Cox.41 example, the calculation of error rates is usually made by
At model development, a ⫽ 0 and b ⫽ 1 for regression classifying subjects as positive when their predicted proba-
models. At validation, calibration-in-the-large problems are bility of the outcome exceeds 50%, and as negative other-
common, as well as b smaller than 1, reflecting overfitting of wise. This implies an equal weighting of false-positive and
a model.1 A value of b smaller than 1 can also be interpreted false-negative classifications.
as reflecting a need for shrinkage of regression coefficients in In the calculation of the NRI, the improvement in
a prediction model.42,43 sensitivity and the improvement in specificity are summed.
This implies relatively more weight for positive outcomes if
NOVEL PERFORMANCE MEASURES a positive outcome was less common, and less weight if a
We now discuss some relatively novel performance positive outcome was more common than a negative outcome.
measures, again without attempting to be comprehensive. The weight is equal to the nonevents odds: (1 ⫺ mean (p))/
mean (p), where mean (p) is the average probability of a
Novel Measures Related to Reclassification positive outcome. Accordingly, although weighting is not
Cook8 has proposed to make a “reclassification table” equal, it is not explicitly based on clinical consequences.
by adding a marker to a model to show how many subjects Defining the best diagnostic test as the one closest to the top
are reclassified. For example, a model with traditional risk left hand corner of the ROC curve—that is, the test with the
factors for cardiovascular disease was extended with the highest sum of sensitivity and specificity (the Youden46
predictors “parental history of myocardial infarction” and index: Se ⫹ Sp ⫺ 1)—similarly implies weighting by the
“CRP.” The increase in c statistic was minimal (from 0.805 to nonevents odds.
0.808). However, when Cook classified the predicted risks Vickers and Elkin15 proposed decision-curve analysis
into 4 categories (0 –5, 5–10, 10 –20, ⬎20% 10-year cardio- as a simple approach to quantify the clinical usefulness of a
vascular disease risk), about 30% of individuals changed prediction model (or an extension to a model). For a formal
category when comparing the extended model with the tra- decision analysis, harms and benefits need to be quantified,
ditional one. Change in risk categories, however, is insuffi- leading to an optimal decision threshold.47 It can be difficult
cient to evaluate improvement in risk stratification; the to define this threshold.15 Difficulties may lie at the popula-
changes must be appropriate. One way to evaluate this is to tion level, ie, there is insufficient data on harms and benefits.
compare the observed incidence of events in the cells of the Moreover, the relative weight of harms and benefits may
reclassification table with the predicted probability from the differ from patient to patient, necessitating individual thresh-
original model. Cook proposed a reclassification test as a olds. Hence, we may consider a range of thresholds for the
variant of the Hosmer-Lemeshow statistic within the reclas- probability of the outcome, similar to ROC curves that
sified categories, leading to a 2 statistic.44 consider the full range of cut-offs rather than a single cut-off
Pencina et al9 has extended the reclassification idea by for a sensitivity/specificity pair.
conditioning on the outcome: reclassification of subjects with The key aspect of decision-curve analysis is that a
and without the outcome should be considered separately. single probability threshold can be used both to categorize
patients as positive or negative and to weight false-positive patients and w is a weight equal to the odds of the cut-off (pt/
and false-negative classifications.48 If we assume that the (1 ⫺ pt), or the ratio of harm to benefit.48 Documentation and
harm of unnecessary treatment (a false-positive decision) is software for decision-curve analysis is publicly available (www.
relatively limited—such as antibiotics for infection—the cut- decisioncurveanalysis.org).
off should be low. In contrast, if overtreatment is quite
harmful, such as extensive surgery, we should use a higher Validation Graphs as Summary Tools
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
cut-off before a treatment decision is made. The harm-to- We can extend the calibration graph to a validation
benefit ratio hence defines the relative weight w of false- graph.20 The distribution of predictions in those with and
positive decisions to true-positive decisions. For example, a without the outcome is plotted at the bottom of the graph,
cut-off of 10% implies that FP decisions are valued at 1/9th of capturing information on discrimination, similar to what is
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
a TP decision, and w ⫽ 0.11. The performance of a prediction shown in a box plot. Moreover, it is important to have 95%
model can then be summarized as a Net Benefit: NB ⫽ (TP ⫺ confidence intervals (CIs) around deciles (or other quantiles)
w FP)/N, where TP is the number of true-positive decisions, FP of predicted risk to indicate uncertainty in the assessment of
the number of false-positive decisions, N is the total number of validity. From the validation graph we can learn the discrim-
inative ability of a model (eg, study the spread in observed
outcomes by deciles of predicted risks), the calibration
(closeness of observed outcomes to the 45-degree line), and
TABLE 2. Logistic Regression Models in Testicular Cancer
Dataset (n ⫽ 544), Without and With the Tumor Marker the clinical usefulness (how many predictions are above or
LDHa below clinically relevant decision thresholds).
Without LDH With LDH
Characteristic OR (95% CI) OR (95% CI) APPLICATION TO TESTICULAR CANCER CASE
STUDY
Primary tumor teratoma-positive 2.7 (1.8–4.0) 2.5 (1.6–3.8)
Prechemotherapy AFP elevated 2.4 (1.5–3.7) 2.5 (1.6–3.9) Patients
Prechemotherapy HCG elevated 1.7 (1.1–2.7) 2.2 (1.4–3.4) Men with metastatic nonseminomatous testicular can-
Square root of postchemotherapy 1.08 (0.95–1.23) 1.34 (1.14–1.57) cer can often be cured by cisplatin-based chemotherapy.
mass size (mm) After chemotherapy, surgical resection is generally carried
Reduction in mass size per 10% 0.77 (0.70–0.85) 0.85 (0.77–0.95)
out to remove remnants of the initial metastases that may still
Prechemotherapy LDH (log(LDH/ — 0.37 (0.25–0.56)
upper limit of local normal be present. In the absence of tumor, resection has no thera-
value)) peutic benefits, while it is associated with hospital admission
Continuous predictors were first studied with restricted cubic spline functions, and and with risks of permanent morbidity and mortality. Logistic
then simplified to simple parametric forms.
a
regression models were developed to predict the presence of
The outcome was residual tumor at postchemotherapy resection (299/544, 55%).
residual tumor, combining well-known predictors such as the
TABLE 3. Performance of Testicular Cancer Models With or Without the Tumor Maker
LDH in the Development Dataset (n ⫽ 544) and the Validation Dataset (n ⫽ 273)
Development
External Validation
Performance Measure Without LDH With LDH Without LDH
Overall
Brier 0.174 0.163 0.161
Brierscaled 29.8% 34.0% 20.0%
R2 (Nagelkerke) 38.9% 43.1% 25.0%
Discrimination
C stat 0.818 (0.78–0.85) 0.839 (0.81–0.87) 0.785 (0.73–0.84)
Discrimination slope 0.301 0.340 0.237
Calibration
Calibration-in-the-large 0 0 ⫺0.03
Calibration slope 1 1 0.74
H-L test 2 ⫽ 6.2 2 ⫽ 12.0 2 ⫽ 15.9
P ⫽ 0.63 P ⫽ 0.15 P ⫽ 0.07
Clinical usefulness
Net benefit at threshold 20%a 0.2% 1.2% 0.1%
a
Compare to resect all.
close to zero, which was explained by the fact that very few
patients had predicted risks below 20% and that calibration
was imperfect around this threshold (Figs. 2, 5).
Software
All analyses were done in R version 2.8.1 (R Founda-
tion for Statistical Computing, Vienna, Austria), using the
Design library. The syntax is provided in the eAppendix
(http://links.lww.com/EDE/A355).
DISCUSSION
This article provides a framework for a number of FIGURE 5. Validation plots of prediction models for residual
traditional and relatively novel measures to assess the perfor- masses in patients with testicular cancer. A, Development
mance of an existing prediction model, or extensions to a data, model without LDH; B, Development data, model with
model. Some measures relate to the evaluation of the quality LDH; C, Validation data, model without LDH. The arrow indi-
of predictions, including overall performance measures such cates the decision threshold of 20% risk of residual tumor.
as explained variation and the Brier score, and measures for We recognize that binary decisions can fully be
discrimination and calibration. Other measures quantify the evaluated in a ROC plot. The plot may however be
quality of decisions, including decision-analytic measures obsolete unless the predicted probabilities at the operating
such as the net benefit and decision curves, and measures points are indicated. Optimal thresholds can be defined by
related to reclassification tables (NRI, IDI). the tangent line to the curve, defined by the incidence of
Having a model that discriminates well will commonly the outcome and the relative weight of false-positive and
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
be most relevant for research purposes, such as covariate false-negative decisions.58 If a prediction model is per-
adjustment in a randomized clinical trial. But a model with fectly calibrated, the optimal threshold in the curve corre-
good discrimination (eg, c ⫽ 0.8) may be useless if the sponds to the threshold probability in the net benefit
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
decision threshold for clinical decisions is outside the range analysis. The tangent is a 45-degree line if the outcome
of predictions provided by the model. Furthermore, a poorly incidence is 50% and false-positive and false-negative
discriminating model (eg, c ⫽ 0.6), may be clinically useful decisions are weighted equally. We consider the net benefit
if the clinical decision is close to a “toss-up.”53 This implies and related decision curves preferable to graphical ROC
that the threshold is in the middle of the distribution of curve assessment in the context of prediction models,
predicted risks, as is the case for models in fertility medicine, although these approaches are obviously related.59
for example.54 For clinical practice, providing insight beyond Most performance measures can also be calculated
the c statistic has been a motivation for some recent mea- for survival outcomes, which pose the challenge of dealing
sures, especially in the context of extension of a prediction with censoring observations. Naive calculation of ROC
model with additional predictive information from a biomar- curves for censored observations can be misleading, since
ker or other sources.8,9,45 Many measures provide numerical some of the censored observation would have had events if
summaries that may be difficult to interpret (see eg, Table 3). follow-up were longer. Also, the weight of false-positive
Evaluation of calibration is important if model predic-
and false-negative decisions may change with the fol-
tions are used to inform patients or physicians in making
low-up time considered. Another issue is to consider
decisions. The widely used Hosmer-Lemeshow test has a
competing risks in survival analyses of nonfatal outcomes,
number of drawbacks, including limited power and poor
such as failure of heart valves,61 or mortality due to
interpretability.1,55 The recalibration parameters as proposed
various causes.62 Disregarding competing risks often leads
by Cox (intercept and calibration slope) are more informa-
to overestimation of absolute risk.63
tive.41 Validation plots with the distribution of risks for those
Any performance measure should be estimated with
with and without the outcome provide a useful graphical
correction for optimism, as can be achieved with cross-
depiction, in line with previous proposals.45
validation or bootstrap resampling, for example. To deter-
The net benefit, with visualization in a decision curve,
mine generalizability to other plausibly related settings re-
is a simple summary measure to quantify clinical usefulness
when decisions are supported by a prediction model.15 We quires an external validation data set of sufficient size.18
recognize however that a single summary measure cannot Some statistical updating may then be necessary for param-
give full insight in all relevant aspects of model performance. eters in the model.64 After repeated validation under various
If a threshold is clinically well accepted, such as the 10% and circumstances, an analysis of the impact of using a model for
20%, 10-year risk thresholds for cardiovascular events, re- decision support should follow. This requires formulation of
classification tables, and its associated measures may be a model as a simple decision rule.65
particularly useful. For example, Table 4 clearly illustrates In sum, we suggest that reporting discrimination and
that a model incorporating lactate dehydrogenase puts a few calibration will always be important for a prediction model.
more subjects with tumor in the high risk category (289/299 ⫽ Decision-analytic measures should be reported if the model is
97% instead of 284/299 ⫽ 95%) and one fewer subject without to be used for making clinical decisions. Many more mea-
tumor in the high risk category (180/245 ⫽ 73% instead of sures are available than discussed in this article, and those
181/245 ⫽ 74%). This illustrates the principle that key informa- other measures may have value in specific circumstances.
tion for comparing performances of 2 models is contained in the The novel measures for reclassification and clinical useful-
margins of the reclassification tables.12 ness can provide valuable additional insight regarding the
A key issue in the evaluation of the quality of decisions value of prediction models and extensions to models, which
is that false-positive and false-negative decisions will usually goes beyond traditional measures of calibration and discrim-
have quite different weights in medicine. Using equal weights ination. Many more measures are available than discussed in
for false-positive and false-negative decisions is not “absurd” this article, and those other measures may have value in
in many medical applications.56 Several previously proposed specific circumstances. The novel measures for reclassifica-
measures of clinical usefulness are consistent with decision- tion and clinical usefulness can provide valuable additional
analytic considerations.31,48,57– 60 insight regarding the value of prediction models and exten-
sions to models, which goes beyond traditional measures of 25. Hernandez AV, Steyerberg EW, Habbema JD. Covariate adjustment in
randomized controlled trials with dichotomous outcomes increases sta-
calibration and discrimination. tistical power and reduces sample size requirements. J Clin Epidemiol.
2004;57:454 – 460.
26. Hernandez AV, Eijkemans MJ, Steyerberg EW. Randomized controlled
ACKNOWLEDGMENTS trials with time-to-event outcomes: how much does prespecified covari-
We thank Margaret Pepe and Jessie Gu (University of ate adjustment increase power? Ann Epidemiol. 2006;16:41– 48.
Washington, Seattle, WA) for their critical review and helpful 27. Iezzoni LI. Risk Adjustment for Measuring Health Care Outcomes. 3rd
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
comments, as well as 2 anonymous reviewers. ed. Chicago: Health Administration Press; 2003.
28. Kattan MW. Judging new markers by their ability to improve predictive
accuracy. J Natl Cancer Inst. 2003;95:634 – 635.
REFERENCES 29. Hilden J, Habbema JD, Bjerregaard B. The measurement of performance
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
54. Hunault CC, Habbema JD, Eijkemans MJ, Collins JA, Evers JL, te 59. Hilden J. The area under the ROC curve and its competitors. Med Decis
Velde ER. Two new prediction rules for spontaneous pregnancy leading Making. 1991;11:95–101.
to live birth among subfertile couples, based on the synthesis of three 60. Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute
previous models. Hum Reprod. 2004;19:2019 –2026. risk. Biostatistics. 2005;6:227–239.
55. Peek N, Arts DG, Bosman RJ, van der Voort PH, de Keizer NF. External 61. Grunkemeier GL, Jin R, Eijkemans MJ, Takkenberg JJ. Actual and
validation of prognostic models for critically ill patients required sub- actuarial probabilities of competing risks: apples and lemons. Ann
stantial sample sizes. J Clin Epidemiol. 2007;60:491–501. Thorac Surg. 2007;83:1586 –1592.
56. Greenland S. The need for reorientation toward cost-effective prediction: 62. Fine JP, Gray RJ. A proportional hazards model for the subdistribution
Downloaded from http://journals.lww.com/epidem by BhDMf5ePHKav1zEoum1tQfN4a+kJLhEZgbsIHo4XMi0hCywCX1
comments on ‘Evaluating the added predictive ability of a new marker: of a competing risk. JASA. 1999;94:496 –509.
From area under the ROC curve to reclassification and beyond’ by M. J. 63. Gail M. A review and critique of some models used in competing risk
Pencina et al. Statistics in Medicine (DOI: 10.1002/sim. 2929). Stat Med. analysis. Biometrics. 1975;31:209 –222.
2008;27:199 –206. 64. Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ,
57. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Validity of Habbema JD. Validation and updating of predictive logistic regression models:
AWnYQp/IlQrHD3i3D0OdRyi7TvSFl4Cf3VC1y0abggQZXdgGj2MwlZLeI= on 04/17/2023
prognostic models: when is a model clinically useful? Semin Urol Oncol. a study on sample size and shrinkage. Stat Med. 2004;23:2567–2586.
2002;20:96 –107. 65. Reilly BM, Evans AT. Translating clinical research into clinical prac-
58. McNeil BJ, Keller E, Adelstein SJ. Primer on certain elements of tice: impact of using prediction rules to make decisions. Ann Intern Med.
medical decision making. N Engl J Med. 1975;293:211–215. 2006;144:201–209.