Developing clinical prediction model Source BMJ SO 2024 (1)
Developing clinical prediction model Source BMJ SO 2024 (1)
Protected by copyright.
bmj-2023‑078276 each year. For example, a review of prediction models overestimating the model’s performance.15
Accepted: 12 June 2024 identified 263 prediction models in obstetrics alone2; This article provides a step-by-step guide for
another review found 606 models related to covid-19.3 researchers interested in clinical prediction modelling.
Interest in predicting health outcomes has been Based on a scoping review of the literature and
heightened by the increasing availability of big data,4 discussions in our group, we identified 13 steps. We
which has also led to the uptake of machine learning aim to provide an overview to help numerically minded
methods for prognostic research in medicine.5 6 clinicians, clinical epidemiologists, and statisticians
Several resources are available to support prognostic navigate the field. We introduce key concepts and
research. The PROGRESS (prognosis research provide references to further reading for each step.
strategy) framework provides detailed guidance on We discuss issues related to model inception, provide
different types of prognostic research.7-9 The TRIPOD practical recommendations about selecting predictors,
(transparent reporting of a multivariable prediction outline sample size considerations, cover aspects of
model for individual prognosis or diagnosis) statement model development, such as handling missing data
gives recommendations for reporting and has recently and assessing performance, and discuss methods
been extended to address prediction model research for evaluating the model’s clinical usefulness. The
in clustered datasets.10-14 PROBAST (prediction model concepts we describe and the steps we propose largely
risk-of-bias assessment tool) provides a structured apply to statistical and machine learning models.
An appendix with code in R accompanies the paper.
SUMMARY POINTS Although several issues discussed here are also
relevant to diagnostic research21 (which is related
Many prediction models are published each year, but they often have
but has subtle differences with prediction modelling)
methodological shortcomings that limit their internal validity and applicability.
and models on predicting treatment effects,22 23 our
A 13 step guide has been developed to help healthcare professionals and
focus is primarily on methods for predicting a future
researchers develop and validate prediction models, avoiding common pitfalls
health outcome. We illustrate the proposed procedure
In the first step, the objective of the prediction model should be defined, using an example of a prediction model for relapse in
including the target population, the outcome to be predicted, the healthcare relapsing-remitting multiple sclerosis. The glossary in
setting where the model will be used, the intended users, and the decisions the table 1 summarises the essential concepts and terms
model will inform used.
Prediction modelling requires a collaborative and interdisciplinary effort within
a team that ideally includes clinicians with content expertise, methodologists,
Step 1: Define aims, create a team, review literature,
users, and people with lived experiences
start writing a protocol
Common pitfalls include inappropriate categorising of continuous outcomes or
Defining aims
predictors, data driven cut-off points, univariable selection methods, overfitting,
We should start by clearly defining the purpose of
and lack of attention to missing data and a sound assessment of performance the envisaged prediction model. In particular, it is
and clinical benefit important to clearly determine the following:
Protected by copyright.
error in new data is a function of bias and variance. In this context, bias relates to the error of the model owing to simplifying
assumptions used. Variance refers to the variability of predictions made. Simple models have high bias, low variance; the
opposite holds for complex models. Increasing model complexity decreases bias and increases variance. The aim is to
develop a model with minimum prediction error in new patients; such a model sits on the sweet spot of the bias-variance
trade-off curve where the model is not too simple, or too complex
Internal validation Methods for obtaining an honest (ie, not optimistic) assessment of the performance of prediction models using the data it
was developed with
Optimism corrected performance Predictive performance of model after correcting for optimism using internal validation
Split sample internal validation Internal validation method where the sample is randomly split into two parts; one part is used for developing the model, the
other for assessing its performance
k-fold cross validation Data are split in k folds; the model is developed in k−1 folds and tested in the left out fold; procedure cycles through all
folds. The method can be used for internal validation; it is also sometimes used for model development, for example, to
determine the value of tuning parameters in penalisation (shrinkage) models
Temporal validation Internal validation method where data are split according to the time of patient enrolment. Model is developed in patients
enrolled earlier and is tested in patients enrolled later in time
Bootstrapping Process of creating samples mimicking the original sample. Bootstrap samples are drawn with replacement from the original
sample. Bootstrapping can be used for internal validation
Internal-external validation Method for validating a prediction model using a clustering variable in the dataset. All clusters but one are used to
develop the model; the model is subsequently tested in the left out cluster. The procedure cycles through all clusters and
performance measures are summarised at the end
External validation Evaluation of model’s performance in new data—ie, data not used for training the model. External validation should ideally
be performed by independent researchers who are not involved in model development. The more diverse the setting and
population of the external validation, the more we learn about model generalisability and transportability
Penalisation (also called regularisation, related to General method for reducing model complexity to obtain a model with better predictions. In regression models, coefficients
shrinkage) are shrunk, leading to less complex models. Penalisation is controlled by one or more penalty parameters embedded in
the model. The amount of penalisation ideally needed is one that brings the model to the sweet spot of the bias-variance
curve—ie, where the model is as complex as it should be, but no more than that
LASSO, ridge regression, elastic net Penalised estimation methods for regression models. LASSO (least absolute shrinkage and selection operator) and elastic
net perform variable selection
Reproducibility Estimated model performance can be reproduced in new sample from the same population or setting as the one used to
develop the model
Transportability Transportability refers to the ability of the model to produce accurate predictions in new patients drawn from a different but
related population or setting
Generalisability Generalisability encompasses model’s reproducibility and transportability
• The target population—for whom should the AIDS, overall survival, progression free survival, a
model predict? For example, people with HIV in particular adverse event.
South Africa; people with a history of diabetes; • The healthcare setting—how will the model be
postmenopausal women in western Europe. used? For example, the model might be used
• The health outcome of interest—what is the in primary care or be implemented in a clinical
endpoint that needs to be predicted? For example, decision support system in tertiary care.
• The user—who is going to use the model? For than developing a new model. This approach is
example, primary care physicians, secondary care known as external validation (table 1). Depending
physicians, patients, researchers. on the validation results, we might decide to update
• The clinical decisions that the model will inform— and adapt the model to the population and setting
how will model predictions be used in the clinical of intended use. Common strategies for updating a
decision making process? For example, a model prediction model include recalibration (eg, adjustment
might be used to identify patients for further of the intercept term in a regression model), revision
diagnostic investigation, to decide on treatment (ie, re-estimation of some model parameters), and
strategies, or to inform a range of personal extension (ie, addition of new predictors).28 29 Although
decisions.24 updating strategies have mainly been described for
Answers to these questions should guide the regression models, they can also be applied to machine
subsequent steps; they will inform various issues, such learning. For example, a random forest model was
as what predictors to include in the model, what data used to predict whether patients with stroke would
to use for developing and validating the model, and experience full recovery within 90 days of the event.
how to assess its clinical usefulness. When tested on an external dataset, the model needed
recalibration, which was performed by fitting logistic
Creating a team regression models to the predictions from the random
When developing a prediction model for clinical use, forest.30 Prediction models for imaging data are often
assembling a group with expertise in the specific developed by fine tuning previously trained neural
medical field, the statistical methodology, and the networks using a process known as transfer learning.31
source data are highly advisable. Including users—that Further guidance on external validation and model
is, clinicians who might use the model and people with updating is available elsewhere,32-36 including sample
lived experiences—is also beneficial. Depending on the size considerations for external validation.37 In the
model’s complexity, it might be necessary to involve following steps, we focus on developing a new model;
software developers at later stages of the project; that we briefly revisit external validation in step 9.
Protected by copyright.
is, developing a web application for users to make
predictions. Step 3: Define the outcome measure
An outcome can be defined and measured in many
Reviewing the literature ways. For example, postoperative mortality can be
Identifying relevant published prediction models measured as a binary outcome at 30 days, at 60 days,
and studies on important risk factors is crucial or using survival time. Using time-to-event instead of
and can be achieved through a scoping review. binary variables is good practice; a prediction model
Discussing the review’s findings with clinicians will for time-to-event can better handle people who were
help us to understand established predictors and the followed up for a limited time and did not experience
limitations of existing models. The literature review the outcome of interest. Moreover, time-to-event
might also provide information on interactions data provide richer information (eg, the survival
between predictors, nonlinear associations between probability at any time point) than a binary outcome
predictors and outcomes, reasons for missing data, at one time point only. Similarly, we can analyse a
and the expected distribution of predictors in the continuous health outcome using a continuous scale
target population. In some situations, performing a or after dichotomising or categorising. For example, a
systematic review might be helpful. Specific guidance continuous depression score at week 8 after starting
on systematic reviews of prediction models has been drug treatment could be dichotomised as remission
published.25-27 or non-remission. Categorising a continuous outcome
leads to loss of information.38-40 Moreover, the
Protocol selection of thresholds for categorisation is often
A study protocol should guide subsequent steps. The arbitrary, lacking biological justification. In some
protocol can be made publicly available in an open cases, thresholds are chosen after exploring various
access journal or as a preprint in an online repository cut-off points and opting for those that fit the data
(eg, www.medrxiv.org or https://osf.io/). In addition best or yield statistically significant results. This data
to the steps discussed here, the TRIPOD statement10 driven approach could lead to reduced performance in
14
and the PROBAST tool15 might be helpful resources new data.38
when writing the protocol.
Step 4: Identify candidate predictors and specify
Step 2: Choose between developing a new model or measurement methods
updating an existing one Candidate predictors
Depending on the specific field, the literature review We should identify potential predictors based on
might show that relevant prediction models already the literature review and expert knowledge (step 1).
exist. Suppose an existing model has a low risk of Like the outcomes of interest, they should ideally
bias (according to PROBAST15) and applies to the be objectively defined and measured using an
research question. In that case, assessing its validity established, reliable method. Understanding the
for the intended setting might be more appropriate biological pathways that might underpin associations
between predictors and the outcome is key. Predictors However, drawbacks relate to data limitations such as
with proven or suspected causal relationships with inadequate data on relevant predictors or outcomes,
the outcome should be prioritised for inclusion; this and variability in the timing of measurements.47
approach might increase the model’s generalisability.
On the other hand, the absence of a causal relationship Data errors
should not a priori exclude potential predictors. Before fitting the model, addressing potential
Predictors not causally related to the outcome but misclassification or measurement errors in predictors
strongly associated with it might still contribute to and outcomes is crucial. This involves considering
model performance, although they might generalise
the nature of the variables collected and the methods
less well to different settings than causal factors.
used for measurement or classification. For example,
Further, we must include only baseline predictors; that
predictors such as physical activity or dietary intake
is, information available when making a prognosis.
are prone to various sources of measurement error.48
Dichotomising or categorising continuous predictors
reduces information and diminishes statistical power The extent of these errors can vary across settings, for
and should be avoided.41 42 Similarly to categorising example, because of differences in the measurement
outcomes, we advise against making data driven, method used. This means that the model’s predictive
post hoc decisions after testing several categorisation performance and potential usefulness could be
thresholds for predictors. In other words, we should reduced.49 If the risk of measurement error is
not choose the categories of a continuous outcome considered high, we might consider alternative
based solely on the associated model performance. outcome measures or exclude less important,
imprecisely measured predictors from the list created
Thinking about the user of the prediction model in step 4. In particular, if systematic errors in the
It is crucial to consider the model’s intended use dataset do not mirror those encountered in clinical
(defined in step 1) and the availability of data. What practice, the model’s calibration might be poor. While
variables are routinely measured in clinical practice methods for correcting measurement errors have been
Protected by copyright.
and are available in the database? What are the costs proposed, they typically require additional data and
and practical issues related to their measurement, assumptions.49
including the degree of invasiveness?43 For example,
the veterans ageing cohort study index (VACS index Variable distributions and missing data
2.0) predicts all cause mortality in people with HIV.44
After examining their distribution in the dataset,
However, some of its predictors, such as the liver
excluding predictors with limited variation is advisable
fibrosis index (FIB-4), will not be available in routine
because they will contribute little. For example, if the
practice in many settings with a high prevalence
ages range from 25 to 45 years and the outcomes
of HIV infection. Similarly, a systematic review of
prognostic models for multiple sclerosis found that are not expected to change much within this range,
44 of 75 models (59%) included predictors unlikely we should remove age from the list of predictors.
to be measured in primary care or standard hospital Similarly, a binary predictor might be present in only a
settings.45 few people. In such cases, we might consider removing
it from the model unless there is previous evidence
Step 5: Collect and examine data that this is a strong predictor.47 More complications
Data collection arise when a variable with low prevalence is known
Ideally, prediction models are developed using to have meaningful predictive value. For example, a
individual participant data from prospective cohort rare genetic mutation could be strongly associated
studies designed for this purpose.1 In practice, with the outcome. The mutation could be omitted
developing prediction models using existing data from from the model because its effect is difficult to estimate
cohort studies or other data not collected explicitly accurately. Alternatively, the few people with the
for this purpose is much more common. Data from mutation could be excluded, making the model
randomised clinical trials can also be used. The quality applicable only to people without it.47 Another issue
of trial data will generally be high, but models could
is incomplete data on predictors and outcomes for
have limited generalisability because trial participants
some participants. Depending on the prevalence of
might not represent the patients seen in clinical
missing data, we might want to modify the outcome
practice. For example, a study found only about 20%
or exclude certain candidate predictors. For example,
of people who have schizophrenia spectrum disorders
would be eligible for inclusion in a typical randomised we might omit a predictor with many missing values,
clinical trial. Patients who are ineligible had a higher especially if there is little evidence of its predictive
risk of hospital admission with psychosis than those power and imputing the missing data is challenging
who are eligible.46 Therefore, a prediction model based (step 7); that is, when the missing values cannot be
on trial data might underestimate the real world risk reliably predicted using the observed data. Conversely,
of hospital admissions. Registry data offer a simple, if the missing information can be imputed, we might
low cost alternative; their main advantage is the decide to retain the variable, particularly when there is
relatively large sample size and representativeness. existing evidence that the predictor is important.
Step 6: Consider sample size variables—can dramatically reduce the sample size. To
General considerations about sample size mitigate the loss of valuable information during model
A very simple model or a model based on covariates development and evaluation, researchers should
that are not associated with the outcome will perform consider imputing missing data.
poorly in the data used to develop it and in new data;
this scenario is called underfitting. Conversely, a model Imputation of missing data
with too many predictors developed in a small dataset Multiple imputation is the approach usually
(overfitting) could perform well in this particular recommended to handle missing data during model
dataset but fail to predict accurately in new data. In development, and appropriately accounts for missing
practice, overfitting is more common than underfitting data uncertainty.56 Several versions of the original
because datasets are often small and have few events, dataset are created, each with missing values imputed
and there is the temptation to create models with the using an imputation model. The imputation model
best (apparent) performance. Therefore, we must should be the same (in terms of predictors included,
ensure the data are sufficient to develop a robust model their transformations and interactions) as the final
that includes the relevant predictors. model we will use to make predictions. Additionally,
the imputation model might involve auxiliary variables
Calculating sample size requirements for a specific associated with missing data, which can enhance the
model effectiveness of the imputations. Once we have created
Riley and colleagues50 provide helpful guidance and the imputed datasets, we must decide whether to
code51 52 on sample size calculations. Users need include participants with imputed outcomes in the
to specify the overall risk (for binary outcomes) or model development. If no auxiliary variables were used
mean outcome value (for continuous outcomes) in the in the imputations, people with imputed outcomes can
target population, the number of model parameters, be removed, and the model can be developed based
and a measure of expected model performance (eg, only on people with observed outcomes.57 However,
the coefficient of determination, R2). Note that the if imputation incorporates auxiliary variables,
Protected by copyright.
number of parameters can be larger than the number including those with imputed outcomes in the model
of predictors. For example, we need two parameters development is advisable.58 A simpler alternative
when using a restricted cubic spline with three knots to to multiple imputation is single imputation when
model a nonlinear association of age with the outcome. each missing value is imputed only once using a
The sample size calculated this way is the minimum for regression model. Sisk and colleagues showed that
a standard statistical model. The sample size must be single imputation can perform well, although multiple
several times larger if we want to use machine learning imputation tends to be more consistent and stable.59
models.53 Sample size calculations for such models In step 4, we made the point that a model should
are considerably more complex and might require include predictors that will be available in practice.
simulations.54 However, we might want to make the model available
even when some predictors are missing, for example,
Calculating number of model parameters for fixed when using the model in a lower level of care. For
sample size example, the QRisk3 tool for predicting cardiovascular
Suppose the sample size is fixed or based on an existing disease can be used even if the general practitioner
study, as is often the case. Then, we should perform does not enter information on blood pressure
sample size calculations to identify the maximum variability (the standard deviation of repeated
number of parameters we can include in the model. readings).60 When anticipating missing data during
A structured way to guide model development can be use in clinical practice, we can impute data during
summarised as follows: the development and implementation phases. In this
• Calculate the maximum number of parameters case, single imputation can be used during model
that can be included in the model given the development and model use.59
available sample size. Imputation methods are not a panacea and might
• Use the available parameters sequentially by fail, typically when the tendency of the outcome to
including predictors from the list, starting from be missing correlates with the outcome itself. For
the ones that are perceived to be more important.55 example, patients receiving a new treatment might
• Note that additional parameters will be needed be more likely to miss follow-up visits if the treatment
for including nonlinear terms or interactions was successful, leading to missing data. Developing
among the predictors in the list. a prediction model in such cases requires additional
modelling efforts61 that are beyond the scope of this
Step 7: Deal with missing data tutorial.
General considerations on missing data
After removing predictors or outcomes with many Step 8: Fit the prediction models
missing values, as outlined in step 5, we might Modelling strategies
still need to address missing values in the retained The strategies for model development should be
data. Relying only on complete cases for model specified in the protocol (step 5). Linear regression for
development—that is, participants with data for all continuous outcomes, logistic regression for binary
outcomes, and Cox or simple parametric models for model for the analysis, such as a cause specific Cox
survival outcomes are the usual starting points in regression model.65 A simpler approach would be to
modelling. If the sample size is large enough (see step analyse a composite outcome.
6), models can include nonlinear terms for continuous
predictors or interactions between predictors. More Data driven variable selection methods
advanced modelling strategies, such as machine We advise against univariable selection methods—
learning models (eg, random forests, support vector that is, methods that test each predictor separately
machines, boosting methods, neural networks, etc), and retain only statistically significant predictors.
can also be used.62 63 These strategies might add value These methods do not consider the association
if there are strong nonlinearities and interactions between predictors and could lead to loss of valuable
between predictors, although they are not immune to information.55 66 Stepwise methods for variable
biases.64 As discussed under step 10, a final strategy selection (eg, forward, backwards, or bidirectional
needs to be selected if several modelling strategies are variable selection) are commonly used. Again, they
explored. are not recommended because they might lead to bias
in estimation and worse predictive performance.55 67
Dealing with competing events 68
If variable selection is desirable—for instance, to
When predicting binary or time-to-event outcomes, we simplify the implementation of the model by further
should consider whether there are relevant competing reducing the number of predetermined predictors—
events. This situation occurs when several possible more suitable methods can be used as described below.
outcomes exist, but a person can only experience
one event. For example, when predicting death from Model estimation
breast cancer, death from another cause is a competing Adding penalty terms to the model (a procedure called
event. In this case, and especially whenever competing penalisation, regularisation, or shrinkage; see table 1)
events are common, we should use a competing risks is recommended to control the complexity of the model
and prevent overfitting.69-71 Penalisation methods
Protected by copyright.
such as ridge, LASSO (least absolute shrinkage and
Underfitting model Overfitting model
High bias, low variance Low bias, high variance selection operator), and elastic net generally lead
to smaller absolute values of the coefficients—that
is, they shrink coefficients towards zero—compared
with maximum likelihood estimation.72 LASSO and
Prediction error
to predict the outcome for a new person. However, Dimensions of prediction performance
this method is not straightforward for model selection Prediction performance has two dimensions, and it is
strategies (eg, LASSO) because the m fitted models essential to assess them both, particularly for binary
might have selected different sets of parameters. As and survival outcomes (see glossary in table 1).
a result, combining them becomes more complex.77 • Discrimination—for continuous outcomes,
78
Rubin’s rule might not apply to machine learning discrimination refers to the model’s ability to
methods because the m models could have different distinguish between patients with different
architectures. Another method for combining the outcomes: good discrimination means that
m models is to use them to make predictions for the patients with higher predicted values also had
new person and then average these m predictions,79 a higher observed outcome values. For binary
procedure conceptually similar to stacking in machine outcomes, good discrimination means that
learning. the model separates people at high risk from
those at low risk. For time-to-event outcomes,
Step 9: Assess the performance of prediction models discrimination refers to the ability of the model
General concepts in assessing model performance to rank patients according to their survival; that
We assess the predictive performance of the modelling is, patients predicted to survive longer survived
strategies explored in step 8. Specifically, we contrast longer.
predictions with observed outcomes for people in • Calibration relates to the agreement between
a dataset to calculate performance measures. For observed and predicted outcome values.80 81 For
continuous outcomes like blood pressure this is continuous outcomes, good calibration means
straightforward: observed outcomes can be directly that predicted values do not systematically
compared with predictions because they are on the overestimate or underestimate observed values.
same scale. When dealing with binary or survival For binary and survival outcomes, good calibration
outcomes, the situation becomes more complex. means the model does not overestimate or
In these cases, prediction models might give the underestimate risks.
Protected by copyright.
probability of an event occurring for each individual Discrimination and calibration are essential when
while observed outcomes are binary (event or no evaluating prediction models. A model can have
event) or involve time-to-event data with censoring. good discrimination by accurately distinguishing
Consequently, more advanced methods are required. between risk levels, but still have poor calibration
Continuous outcomes
• Predicted and observed outcomes can be compared through mean bias, mean squared error, and the coefficient of determination, R2, to measure
overall performance—ie, combining calibration and discrimination. For discrimination alone, rank correlation statistics between predictions and
observations can be used, although this seldom occurs in practice. For calibration, results can be visualised in a scatterplot and an observed
versus predicted line fitted. For a perfectly calibrated model, this line is on the diagonal; for an overfit (underfit) model, the calibration line is
above (below) the diagonal. A smooth calibration line can assess calibration locally—ie, it can indicate areas where the model underestimates
or overestimates the outcome. Smooth calibration lines can be obtained by fitting, for example, restricted cubic splines or a locally estimated
scatterplot smoothing line (LOESS) of the predicted versus the observed outcomes.
Binary outcomes
• Discrimination can be assessed using the area under the receiver operating characteristic curve (AUC). Mean calibration (calibration in the large,
see table 1) can be determined by comparing mean observed versus mean predicted event rates. A logistic regression model can be fit to the
observed outcome using the log odds of the event from the prediction model as the sole independent variable and then the intercept and slope can
be evaluated. Additionally, a calibration curve can be created; for this, participants are grouped according to their predicted probabilities. Calculate
the mean predicted probability and the proportion of events for each group; then compare the two in a scatterplot and draw a smooth calibration
curve (eg, using splines) to assess calibration locally. The Brier score measures overall performance—it is simply calculated as the mean squared
difference between predicted probabilities and actual outcomes. Many additional measures can be used to measure performance, for example, F
score, sensitivity-specificity, etc.
Survival outcomes
• If focus is on a specific time point, discrimination can be assessed as for binary outcomes (fixed time point discrimination).18 However, censoring
of follow-up times complicates this assessment. Uno and colleagues’ inverse probability of censoring weights method can account for censoring.82
Also, discrimination can be assessed across all time points using Harrell’s c statistic.83 Uno’s c statistic can be expanded to a global measure,
across all time points.84 Calibration can be assessed for a fixed time point by comparing the average predicted survival from the model with the
observed survival—ie, estimated while accounting for censorship; this can be obtained from a Kaplan-Meier curve by looking at the specific
time point (calibration in the large at a fixed time). The Kaplan-Meier curve can be compared with the mean predicted survival across all times.
More details can be found elsewhere.18 Smooth calibration curves can also be used to assess performance of the model across the full range of
predicted risks, while additional calibration metrics have also been proposed.85 86 Similar measures can be used for competing events, with some
adjustments.16
owing to a mismatch between predicted and observed honest, meaning optimism does not influence them. In
probabilities. Moreover, a well calibrated model an internal validation procedure, we use data on the
might have poor discrimination. Thus, a robust same patient population as the one used to develop
prediction model should have good discrimination the model and try to assess model performance while
and calibration. Box 1 outlines measures for assessing avoiding optimism. Validation must follow all steps of
model performance. model development, including variable selection.
The simplest method is the split sample approach
Model validation where the dataset is randomly split into two parts (eg,
What data should we use to assess the performance of 70% training and 30% testing). However, this method
a prediction model? The simplest approach is to use the is problematic because it wastes data and decreases
same dataset as for model development; this approach statistical power.55 87 When applied to a small dataset,
will return the so-called apparent model performance it might create two datasets that are inadequate for
(apparent validation). However, this strategy might both model development and evaluation. Conversely,
overestimate the model’s performance (fig 1); that is, for large datasets it offers little benefit because the
it might lead to erroneous (optimistic) assessments. risk of overfitting is low. Further, it might encourage
Optimism is an important issue in prediction modelling researchers to repeat the procedure until they obtain
and is particularly relevant when sample sizes are satisfactory results.88 Another approach is to split
small and models complex. Therefore, assessing the data according to the calendar time of patient
model performance using a more adequate validation enrolment. For example, we might develop the model
procedure is crucial. Proper validation is essential in using data from an earlier period and test it in patients
determining a prediction model’s generalisability— enrolled later. This procedure (temporal validation)35
that is, its reproducibility and transportability.33 47 89
might inform us about possible time trends in model
Reproducibility refers to the model’s ability to produce performance. However, the time point used for splitting
accurate predictions in new patients from the same the data will generally be arbitrary and older data might
population. Transportability is the ability to produce not reflect current patient characteristics or health
Protected by copyright.
accurate predictions in new patients drawn from a care. Therefore, this approach is not recommended for
different but related population. Below, we describe the development phase.88
different approaches to model validation. A better method is k-fold cross validation. In this
approach, we divide the data randomly in k (usually
Internal validation 10) subsets (folds). The model is built using k−1
Internal validation focuses on reproducibility and of these folds and evaluated on the remaining one
specifically aims to ensure that assessments of model
fold. This process is repeated, cycling through all the
performance using the development dataset are
folds so that each can be the testing set. The model’s
performance is measured in each cycle, and the k
Box 2 Calculating optimism corrected measures of performance through estimates are then combined and summarised to get a
bootstrapping final performance measure. Bootstrapping is another
method,90 which can be used to calculate optimism
Use bootstrapping to correct apparent performance and obtain optimism corrected and optimism corrected performance measures for any
measures for any model M and any performance measure as follows.
model. Box 2 outlines the procedure.47 Bootstrapping
Select a measure X (eg, R2, mean squared error, AUC (area under the receiver operating generally leads to more stable and less biased
characteristic curve)) and calculate apparent performance (X0) of model M in the results,93 and is therefore recommended for internal
original sample. validation.47 However, implementation of k-fold cross
1. Create many (at least NB=100) bootstrap samples with the same size as the validation and bootstrapping can be computationally
original dataset by drawing patients from the study population with replacement. demanding when multiple imputation of missing data
Replacement means that some individuals might be included several times in a is needed.88
bootstrap sample, while others might not appear at all. Another method of assessing whether a model’s
2. In each bootstrap sample i (i=1, 2 … NB) construct model Mi by exactly reiterating all predictions are likely to be reliable or not is by checking
steps of developing M, ie, including variable selection methods (if any were used). the model’s stability. Model instability means that
Determine the apparent performance Xi of model Mi in sample i. small changes in the development dataset lead to large
3. Apply Mi to the original sample and calculate performance, Xi*. This performance changes in the resulting model structure (important
will generally be worse than Xi owing to optimism. Calculate optimism for measure
differences in estimates of model parameters, included
X, sample i, as OiX=Xi−Xi*.
predictors, etc), leading to important changes in
4. Average the NB different values of OiX to estimate optimism, OX.
predictions and model performance. Riley and Collins
5. Calculate the optimism corrected value of X as Xcorrected=X0−OiX.
described how to assess the stability of clinical
More advanced versions of bootstrapping (eg, the 0.632+ bootstrap91) require slightly prediction models during the model development
different procedures.92 In practice, we often need to combine bootstrapping with phase using a bootstrap approach.94 The model
multiple imputation. Ideally, we should first bootstrap and then impute.92 However,
building procedure is repeated in several bootstrap
this strategy might be computationally difficult. Instead, we can first impute, then
samples to create numerous models. Predictions from
bootstrap, obtain optimism corrected performance measures from each imputed
these models are then compared with the original
dataset, and finally pool these.
model predictions to investigate possible instability.
Protected by copyright.
External validation requires testing the model on a
and Care Excellence (NICE) in the UK recommends
new set of patients—that is, those not used for model
cholesterol lowering treatment if the predicted 10
development.36 Assuming that the model has shown
year risk of myocardial infarction or stroke is 10% or
good internal validity, external validation studies are
higher (the cut-off threshold probability) based on the
the next step in determining a model’s transportability
QRISK3 risk calculator.60 105 The assumption is that the
before considering its implementation in clinical
benefit of treating one patient who would experience
practice. The more numerous and diverse the settings
a cardiovascular event over 10 years outweighs the
in which the model is externally validated, the more
harms and costs incurred by treating another nine
likely it will generalise to a new setting. An external
people who will not benefit. In other words, the harm
validation study could indicate that a model requires
associated with not treating the one patient who would
updating before being used in a new setting. A
develop the event is assumed to be nine times greater
common scenario is when a model’s discrimination
is adequate in new settings and fairly stable over than the consequences of treating a patient who does
time, but calibration is suboptimal across settings not need it.
or deteriorates over time (calibration drift).98 For Net benefit brings the benefits and harms of a
example, EuroSCORE is a model developed in 1999 for decision strategy (eg, to decide for or against treatment
predicting mortality in hospital for patients undergoing based on a prediction model) on the same scale so they
cardiac surgery.99 Using data from 2001 to 2011, can be compared.104 We can compute the net benefit of
EuroSCORE was shown to consistently overestimate using the model at a particular cut-off threshold (eg,
mortality and its calibration deteriorated over time.100 10% risk for the case of QRISK3 risk calculator). The
In such situations, model updating (step 2) might be net benefit is calculated as the expected percentage
required. of true positives minus the expected percentage of
The inclusion of external validation in model true negatives, multiplied by a weight determined by
development is a topic of debate, with certain journals the chosen cut-off threshold. We obtain the decision
mandating it for publication.88 100 One successful curve by plotting the model’s net benefit across a range
external validation, however, does not establish of cut-off thresholds deemed clinically relevant.106
107
transportability to many other settings, while such a We can compare the benefit of making decisions
requirement might lead to the selective reporting of based on the model with alternative strategies, such
validation data.100 Therefore, our view (echoing recent as treating everyone or no one. We can also compare
recommendations88) is that external validation studies different models. The choice of decision threshold can
should be separated from model development at the be subjective, and the range of sensible thresholds
moment of model development. External validation will depend on the settings, conditions, available
studies are ideally performed by independent diagnostic tests or treatments, and patient preferences.
investigators who were not involved in the original The lower the threshold, the more unnecessary tests
model development.101 For guidance on methods for or interventions we are willing to accept. Of note, a
external validation, see references cited in step 2. decision curve analysis might indicate that a model is
not useful in practice despite its excellent predictive and describe the process and results in detail. The
ability. TRIPOD reporting guideline and checklist10 14 (or, for
There are several pitfalls in the interpretation of clustered datasets, TRIPOD cluster13) should be used to
decision curves.24 Most importantly, the decision curve ensure all important aspects are covered in the paper.
cannot determine at what threshold probability the If possible, the article should report the full model
model should be used. Moreover, because the model’s equation to allow reproducibility and independent
predictive performance influences the decision curve, external validation studies. Software code and, ideally,
the decision curve can be affected by optimism. data should be made freely available. Further, we must
Therefore, a model’s good predictive performance (in ensure the model is accessible to the users we defined
internal validation and after correction for optimism) in step 1. Although this should be self-evident, in
should be established before evaluating its clinical practice, there is often no way to use published models
usefulness through a decision curve. Additionally, to make an actual prediction; for example, Reeve
the curve can be obtained using a cross validation and colleagues found that 52% of published models
approach.108 Vickers and colleagues provide a helpful for multiple sclerosis could not be used in practice
step-by-step guide to interpreting decision curve because no model coefficients, tools, or instructions
analysis, and a website with a software tutorial and were provided.45
other resources.107 The multiple sclerosis example The advantages and disadvantages of different
below includes a decision curve analysis. approaches for making the model available to
users, including score systems, graphical score
Step 12: Assess the predictive ability of individual charts, nomograms, and websites and smartphone
predictors (optional step) applications have been reviewed elsewhere.113
In prediction modelling, the primary focus is typically Simpler approaches are easier to use, for example, on
not on evaluating the importance of individual ward rounds, but might require model simplification by
predictors; rather, the goal is to optimise the model’s removing some predictors or categorising continuous
overall predictive performance. Nevertheless, variables. Online calculators where users input
Protected by copyright.
identifying influential predictors might be of interest, predictor values (eg, a web application using Shiny
for example, when evaluating the potential inclusion of in R)114 can be based on the whole model without
a new biomarker as a routine measurement. Also, some information loss. However, if publicly accessible,
predictors might be modifiable, raising the possibility calculators might be misused by people for whom
that they could play a part in prevention if their they are not intended, or if the model fails to show any
association with the outcome is causal. Therefore, as clinical value (eg, in a subsequent external validation).
an additional, optional step, researchers might want to Generally, the presentation and implementation
assess the predictive capacity of the included predictors. should always be discussed with the users to match
Looking at estimated coefficients in (generalised) their needs (defined in step 1).
linear regression models is a simple way to assess
the importance of different predictors. However, Example: relapsing-remitting multiple sclerosis
when the assumptions of linear regression are Background
not met, for example, when there is collinearity, Multiple sclerosis is a chronic inflammatory disorder
these estimates might be unreliable. However, note of the central nervous system with a highly variable
that multicollinearity does not threaten a model’s clinical course.115 Relapsing-remitting multiple
predictive performance, just at the interpretation of the sclerosis (RRMS), the most common form, is
coefficients. Another method to assess the importance characterised by attacks of worsening neurological
of a predictor, also applicable to machine learning function (relapses) followed by periods of partial
models, is to fit the model with and without this or complete recovery (remissions).116-118 These
predictor and note the reduction in model performance; fluctuations pose a major challenge in managing the
omitting more important predictors will lead to a larger disease. A predictive tool could inform treatment
reduction in performance. More advanced methods decisions. Below, we describe the development of a
include the permutation importance algorithm109 and prediction model for RMMS.119 We briefly outline the
SHAP (Shapley additive explanations)110; we do not procedures followed in the context of our step-by-step
discuss these here. guide. Details of the original analysis and results are
Regardless of the method we choose to assess provided elsewhere.119
predictor importance, we should be careful in our
interpretations; associations seen in data might Step-by-step model development
not reflect causal relationships (eg, see the “Table The aim was to predict relapse within two years in
2 fallacy”111). A thorough causal inference analysis patients with RRMS. Such a prediction can help
is needed to establish causal associations between treatment decisions; if the risk of relapsing is high,
predictors and outcomes.112 patients might consider intensifying treatment, for
example, by taking more active disease modifying
Step 13: Write up and publish drugs, which might however have a higher risk of
Congratulations to us! We have developed a clinical serious adverse events, or considering stem cell
prediction model! Now, it is time to write the paper transplantation. A multidisciplinary team comprising
clinicians, patients, epidemiologists, and statisticians disease duration, number of previous relapses, and
was formed. A literature review identified several number of gadolinium enhanced lesions. The selection
potential predictors for relapse in RRMS. Additionally, aimed to include relevant predictors while excluding
the review showed limitations of existing prediction those that are difficult to measure in clinical practice
models, including lack of internal validation, (step 4). The model was developed using data from
inadequate handling of missing data, and lack of the Swiss Multiple Sclerosis Cohort,120 a prospective
assessment of clinical utility (step 1). These deficiencies cohort study that closely monitors patients with
compromised the reliability and applicability of RRMS. Data included a total of 1752 observations from
existing models in clinical settings. Based on the 935 patients followed up every two years, with 302
review, it was decided to pursue the development of a events observed (step 5). Sample size calculations50
new model, instead of updating an existing one (step indicated a minimum sample of 2082 patients, which
2). The authors chose the (binary) occurrence of at is larger than the available sample, raising concerns
least one relapse within a two year period for people about possible overfitting issues (step 6). Multiple
with RRMS (step 3) as the outcome measure. imputations were used to impute missing covariate
The following predictors were used based on the data. The authors expected no missing data when
literature review and expert opinion: age, expanded using the model in practice (step 7).
disability status scale score, previous treatment for A Bayesian logistic mixed effects prediction
multiple sclerosis, months since last relapse, sex, model was developed, which accounted for several
observations within patients. Regression coefficients
were penalised through a Laplace prior distribution to
0.6 address possible overfitting (step 8). Model calibration
was examined in a calibration plot (fig 2, upper panel),
0.5 and discrimination was assessed using the AUC (area
Observed proportion
0.4
under the receiver operating characteristic curve).
Both assessments were corrected for optimism through
Protected by copyright.
0.3 a bootstrap validation procedure (described in box 2),
with 500 bootstrap samples created for each imputed
0.2
dataset. The optimism corrected calibration slope was
0.2 0.91, and the optimism corrected AUC was 0.65—this
value corresponds to low to moderate discriminatory
0 ability, comparable to or exceeding previous RRMS
0 0.1 0.2 0.3 0.4 0.5 0.6
Predicted probability models (steps 9 and 10). A decision curve analysis was
performed to assess the clinical utility of the model (fig
2, lower panel). The analysis indicated that deciding
Intensify treatment for all
to intensify or not intensify the treatment using
Do not intensify treatment
Intensify treatment according to predictions from model information from the model is preferable to simpler
0.20 strategies—do not intensify treatment, and intensify
treatment for all—for thresholds between 15% and
0.15 30%. Therefore, the model is useful to guide decisions
in practice only if we value the avoidance of relapse
Net benefit
Step 1 Define aims. Think about population, Step 2 Choose between developing a Step 3 Define outcome measure.
outcome, setting, end user, decisions to be informed new or updating an existing model. Prioritise continuous and
Step 8 Fit prediction Step 7 Deal with missing Step 6 Consider sample size. Sample Step 5 Collect data. Consider
model(s). Use data. Think about missing size calculations can inform minimum measurement errors. Check
calibration and discrimination.
shrinkage; avoid predictors in model use. sample size or maximum number of distributions and missingness of
Perform internal validation
can be used
Step 10 Decide on the final Step 11 Perform a decision curve Step 12 Assess Step 13 Write up and
model. Base decision on analysis. Compare net benefit of using predictive ability of publish. Use TRIPOD
optimism corrected model versus simple decision strategies individual predictors statement. Make model
performance measures (treat all or none) (optional step) usable by end users
Fig 3 | Graphical overview of 13 proposed steps for developing a clinical prediction model. TRIPOD=transparent reporting of a multivariable
prediction model for individual prognosis or diagnosis
Protected by copyright.
modelling, where we provide R code covering many machine learning approaches might require additional
aspects of the development of prediction models. The development steps to ensure calibration.94 121 122
code uses simulated datasets and describes the case We trust that our presentation of the key concepts
of continuous, binary, time-to-event, and competing and discussion of topics relevant to the development
risk outcomes. The code covers the following aspects: of clinical prediction models will help researchers
sample size calculations, multiple imputation, to choose the most sensible approach for the
modelling nonlinear associations, assessing apparent problem at hand. Moreover, the paper will hopefully
model performance, performing internal validation increase awareness among researchers of the need
using bootstrap, internal-external validation, and to work in diverse teams, including clinical experts,
decision curve analysis. Readers should note that methodologists, and future model users. Similar
the appendix does not cover all possible modelling to guidance on transparent reporting of research,
methods, models, and performance measures that adopting methodological guidance to improve
can be used. Moreover, parts of the code are based the quality and relevance of clinical research is a
on previous publications.16 18 Additional code is responsibility shared by investigators, reviewers,
provided elsewhere, for example, by Zhou and journals, and funders.123
colleagues.17 Contributors: OE conceived the idea of the project and wrote the
first draft of the manuscript. KC performed the analysis of the real
example in relapsing-remitting multiple sclerosis. MS and OE prepared
Conclusions the online supplement. ME and GS contributed concepts and revised
This tutorial provides a step-by-step guide to developing the manuscript. All authors contributed to the final manuscript. OE
and validating clinical prediction models. We stress is the guarantor of the article. ME and GS contributed equally to the
manuscript as last authors. The corresponding author attests that all
that this is not a complete and exhaustive guide, and listed authors meet authorship criteria and that no others meeting the
it does not aim to replace existing resources. Our criteria have been omitted.
intention is to introduce essential aspects of clinical Funding: OE and MS were supported by the Swiss National Science
prediction modelling. Figure 3 provides an overview of Foundation (SNSF Ambizione grant 180083). ME was supported
by special project funding from the SNSF (grant 32FP30-189498)
the proposed steps. and funding from the National Institutes of Health (5U01-
In principle, most steps we have described apply AI069924-05, R01 AI152772-01). KC and GS were supported by
to traditional statistical and machine learning the HTx project, funded by the European Union’s Horizon 2020
research and innovation programme, 825162. The funders had no
approaches,14 with some exceptions. For example, role in considering the study design or in the collection, analysis,
the structure of a machine learning model is often interpretation of data, writing of the report, or decision to submit the
defined during model development and so will not be article for publication.
known a priori. Consequently, using the final model Competing interests: All authors have completed the ICMJE uniform
disclosure form at www.icmje.org/disclosure-of-interest/ and declare
for multiple imputations, as we discussed in step 7, support from the Swiss National Science Foundation, National
might not be possible. Further, bootstrapping, which Institutes of Health, and European Union’s Horizon 2020 research
we recommended as the method of choice for internal and innovation programme for the submitted work; no financial
relationships with any organisations that might have an interest in the
validation, might not be computationally feasible for submitted work in the previous three years; no other relationships or
some machine learning approaches. Moreover, some activities that could appear to have influenced the submitted work.
Provenance and peer review: Not commissioned; externally peer 23 Rekkas A, Paulus JK, Raman G, et al. Predictive approaches to
reviewed. heterogeneous treatment effects: a scoping review. BMC Med Res
Methodol 2020;20:264. doi:10.1186/s12874-020-01145-1
This is an Open Access article distributed in accordance with the
24 Vickers A. Statistical thinking—seven common errors in decision
terms of the Creative Commons Attribution (CC BY 4.0) license, which
curve analysis, https://www.fharrell.com/post/edca/index.html
permits others to distribute, remix, adapt and build upon this work,
(2023, accessed 24 August 2023).
for commercial use, provided the original work is properly cited. See:
25 Collins GS, Moons KGM, Debray TPA, et al. Systematic reviews of
http://creativecommons.org/licenses/by/4.0/. prediction models. In: Systematic Reviews in Health Research. John
Wiley & Sons, Ltd: 347-76.
1 Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman
26 Debray TPA, Damen JAAG, Snell KIE, et al. A guide to systematic
DG. Prognosis and prognostic research: what, why, and how?
review and meta-analysis of prediction model performance.
BMJ 2009;338:b375. doi:10.1136/bmj.b375
BMJ 2017;356:i6460. doi:10.1136/bmj.i6460
2 Kleinrouweler CE, Cheong-See FM, Collins GS, et al. Prognostic
27 Damen JAA, Moons KGM, van Smeden M, Hooft L. How to conduct
models in obstetrics: available, but far from applicable. Am J Obstet
a systematic review and meta-analysis of prognostic model
Gynecol 2016;214:79-90.e36. doi:10.1016/j.ajog.2015.06.013
3 Wynants L, Van Calster B, Collins GS, et al. Prediction models for studies. Clin Microbiol Infect 2023;29:434-40. doi:10.1016/j.
diagnosis and prognosis of covid-19: systematic review and critical cmi.2022.07.019
appraisal. BMJ 2020;369:m1328. doi:10.1136/bmj.m1328 28 Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg
4 Riley RD, Ensor J, Snell KIE, et al. External validation of clinical EW. Methodological guidance for the evaluation and updating
prediction models using big datasets from e-health records or IPD of clinical prediction models: a systematic review. BMC Med Res
meta-analysis: opportunities and challenges. BMJ 2016;353:i3140. Methodol 2022;22:316. doi:10.1186/s12874-022-01801-8
doi:10.1136/bmj.i3140 29 Su T-L, Jaki T, Hickey GL, Buchan I, Sperrin M. A review of statistical
5 Beam AL, Kohane IS. Big data and machine learning in health care. updating methods for clinical prediction models. Stat Methods Med
JAMA 2018;319:1317-8. doi:10.1001/jama.2017.18391 Res 2018;27:185-97. doi:10.1177/0962280215626466
6 Chen JH, Asch SM. Machine learning and prediction in 30 Dankowski T, Ziegler A. Calibrating random forests for probability
medicine—beyond the peak of inflated expectations. N Engl J estimation. Stat Med 2016;35:3949-60. doi:10.1002/sim.6959
Med 2017;376:2507-9. doi:10.1056/NEJMp1702071 31 Niu S, Liu Y, Wang J, et al. A decade survey of transfer learning
7 Hemingway H, Croft P, Perel P, et al. PROGRESS Group. Prognosis (2010-2020). IEEE Trans Artif Intell 2020;1:151-66doi:10.1109/
research strategy (PROGRESS) 1: a framework for researching clinical TAI.2021.3054609.
outcomes. BMJ 2013;346:e5595. doi:10.1136/bmj.e5595 32 Moons KGM, Kengne AP, Grobbee DE, et al. Risk prediction models:
8 Steyerberg EW, Moons KGM, van der Windt DA, et al. PROGRESS II. External validation, model updating, and impact assessment.
Group. Prognosis Research Strategy (PROGRESS) 3: prognostic model Heart 2012;98:691-8. doi:10.1136/heartjnl-2011-301247
research. PLoS Med 2013;10:e1001381; Epub ahead of print. 33 Ramspek CL, Jager KJ, Dekker FW, Zoccali C, van Diepen M. External
doi:10.1371/journal.pmed.1001381 validation of prognostic models: what, why, how, when and
9 Hingorani AD, Windt DA, Riley RD, et al. PROGRESS Group. Prognosis where?Clin Kidney J 2020;14:49-58. doi:10.1093/ckj/sfaa188
Protected by copyright.
research strategy (PROGRESS) 4: stratified medicine research. 34 Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans
BMJ 2013;346:e5793. doi:10.1136/bmj.e5793 MJ, Habbema JD. Validation and updating of predictive logistic
10 Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting regression models: a study on sample size and shrinkage. Stat
of a multivariable prediction model for individual prognosis or Med 2004;23:2567-86. doi:10.1002/sim.1844
diagnosis (TRIPOD): the TRIPOD statement. BMC Med 2015;13:1. 35 Collins GS, de Groot JA, Dutton S, et al. External validation
doi:10.1186/s12916-014-0241-z of multivariable prediction models: a systematic review
11 Moons KGM, Altman DG, Reitsma JB, et al. Transparent reporting of methodological conduct and reporting. BMC Med Res
of a multivariable prediction model for individual prognosis or Methodol 2014;14:40. doi:10.1186/1471-2288-14-40
diagnosis (TRIPOD): explanation and elaboration. Ann Intern 36 Riley RD, Archer L, Snell KIE, et al. Evaluation of clinical prediction
Med 2015;162:W1-73. doi:10.7326/M14-0698 models (part 2): how to undertake an external validation study.
12 Debray TPA, Collins GS, Riley RD, et al. Transparent reporting of BMJ 2024;384:e074820. doi:10.1136/bmj-2023-074820
multivariable prediction models developed or validated using 37 Riley RD, Snell KIE, Archer L, et al. Evaluation of clinical prediction
clustered data: TRIPOD-Cluster checklist. BMJ 2023;380:e071018. models (part 3): calculating the sample size required for an external
doi:10.1136/bmj-2022-071018 validation study. BMJ 2024;384:e074821. doi:10.1136/bmj-2023-
13 Debray TPA, Collins GS, Riley RD, et al. Transparent reporting of 074821
multivariable prediction models developed or validated using 38 Altman DG, Royston P. The cost of dichotomising continuous
clustered data (TRIPOD-Cluster): explanation and elaboration. variables. BMJ 2006;332:1080. doi:10.1136/bmj.332.7549.1080
BMJ 2023;380:e071058. doi:10.1136/bmj-2022-071058 39 Fedorov V, Mannino F, Zhang R. Consequences of dichotomization.
14 Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: Pharm Stat 2009;8:50-61. doi:10.1002/pst.331
updated guidance for reporting clinical prediction models that use 40 Purgato M, Barbui C. Dichotomizing rating scale scores in psychiatry:
regression or machine learning methods. BMJ 2024;385:e078378.
a bad idea?Epidemiol Psychiatr Sci 2013;22:17-9. doi:10.1017/
doi:10.1136/bmj-2023-078378
S2045796012000613
15 Wolff RF, Moons KGM, Riley RD, et al. PROBAST Group†. PROBAST: a
41 Rhon DI, Teyhen DS, Collins GS, Bullock GS. Predictive models
tool to assess the risk of bias and applicability of prediction model
for musculoskeletal injury risk: why statistical approach makes
studies. Ann Intern Med 2019;170:51-8. doi:10.7326/M18-1376
all the difference. BMJ Open Sport Exerc Med 2022;8:e001388.
16 van Geloven N, Giardiello D, Bonneville EF, et al. STRATOS initiative.
doi:10.1136/bmjsem-2022-001388
Validation of prediction models in the presence of competing
42 Collins GS, Ogundimu EO, Cook JA, Manach YL, Altman DG.
risks: a guide through modern methods. BMJ 2022;377:e069249.
Quantifying the impact of different approaches for handling
doi:10.1136/bmj-2021-069249
17 Zhou Z-R, Wang W-W, Li Y, et al. In-depth mining of clinical data: continuous predictors on the performance of a prognostic model.
the construction of clinical prediction model with R. Ann Transl Stat Med 2016;35:4124-35. doi:10.1002/sim.6986
Med 2019;7:796. doi:10.21037/atm.2019.08.63 43 Leisman DE, Harhay MO, Lederer DJ, et al. Development and
18 McLernon DJ, Giardiello D, Van Calster B, et al. topic groups 6 reporting of prediction models: guidance for authors from
and 8 of the STRATOS Initiative. Assessing performance and editors of respiratory, sleep, and critical care journals. Crit Care
clinical usefulness in prediction models with survival outcomes: Med 2020;48:623-33. doi:10.1097/CCM.0000000000004246
practical guidance for Cox proportional hazards models. Ann Intern 44 Hernández-Favela CG, Hernández-Ruiz VA, Bello-Chavolla OY, et
Med 2023;176:105-14. doi:10.7326/M22-0844 al. Higher Veterans Aging Cohort Study 2.0 Index score predicts
19 Damen JAAG, Hooft L, Schuit E, et al. Prediction models for functional decline among older adults living with HIV. AIDS Res Hum
cardiovascular disease risk in the general population: systematic Retroviruses 2021;37:878-83. doi:10.1089/aid.2020.0295
review. BMJ 2016;353:i2416. doi:10.1136/bmj.i2416 45 Reeve K, On BI, Havla J, et al. Prognostic models for predicting clinical
20 Meehan AJ, Lewis SJ, Fazel S, et al. Clinical prediction models in disease progression, worsening and activity in people with multiple
psychiatry: a systematic review of two decades of progress and sclerosis. Cochrane Database Syst Rev 2023;9:CD013606.
challenges. Mol Psychiatry 2022;27:2700-8. doi:10.1038/s41380- 46 Taipale H, Schneider-Thoma J, Pinzón-Espinosa J, et al.
022-01528-4 Representation and outcomes of individuals with schizophrenia
21 van Smeden M, Reitsma JB, Riley RD, Collins GS, Moons KG. seen in everyday practice who are ineligible for randomized
Clinical prediction models: diagnosis versus prognosis. J Clin clinical trials. JAMA Psychiatry 2022;79:210-8. doi:10.1001/
Epidemiol 2021;132:142-5. doi:10.1016/j.jclinepi.2021.01.009 jamapsychiatry.2021.3990
22 Kent DM, Paulus JK, van Klaveren D, et al. The Predictive Approaches 47 Steyerberg EW. Clinical Prediction Models: A Practical Approach
to Treatment effect Heterogeneity (PATH) statement. Ann Intern to Development, Validation, and Updating. Springer International
Med 2020;172:35-45. doi:10.7326/M18-3667 Publishing, 2019doi:10.1007/978-3-030-16399-0.
48 Bennett DA, Landry D, Little J, Minelli C. Systematic review of 72 Lipkovich I, Dmitrienko A, B R. Tutorial in biostatistics: data-
statistical approaches to quantify, or correct for, measurement error driven subgroup identification and analysis in clinical trials. Stat
in a continuous exposure in nutritional epidemiology. BMC Med Res Med 2017;36:136-96. doi:10.1002/sim.7064
Methodol 2017;17:146. doi:10.1186/s12874-017-0421-6 73 Nakkiran P, Kaplun G, Bansal Y, et al. Deep double
49 van Smeden M, Lash TL, Groenwold RHH. Reflection on modern descent: where bigger models and more data hurt. J Stat
methods: five myths about measurement error in epidemiological Mech 2021;2021:124003doi:10.1088/1742-5468/ac3a74.
research. Int J Epidemiol 2020;49:338-47. doi:10.1093/ije/dyz251 74 Van Calster B, van Smeden M, De Cock B, Steyerberg EW. Regression
50 Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required shrinkage methods for clinical prediction models do not guarantee
for developing a clinical prediction model. BMJ 2020;368:m441. improved performance: Simulation study. Stat Methods Med
doi:10.1136/bmj.m441 Res 2020;29:3166-78. doi:10.1177/0962280220921415
51 Ensor J, Martin EC, Riley RD. pmsampsize: calculates the minimum 75 Riley RD, Snell KIE, Martin GP, et al. Penalization and shrinkage
sample size required for developing a multivariable prediction model, methods produced unreliable clinical prediction models especially
https://cran.r-project.org/package=pmsampsize 2021 (accessed 1 when sample size was small. J Clin Epidemiol 2021;132:88-96.
Feb 2022). doi:10.1016/j.jclinepi.2020.12.005
52 Pate A, Riley RD, Collins GS, et al. Minimum sample size for 76 Rubin DB. Multiple Imputation for Nonresponse in Surveys. John
developing a multivariable prediction model using multinomial Wiley &Sons, 1987doi:10.1002/9780470316696.
logistic regression. Stat Methods Med Res 2023;32:555-71. 77 Chen Q, Wang S. Variable selection for multiply-imputed data with
doi:10.1177/09622802231151220 application to dioxin exposure study. Stat Med 2013;32:3646-59.
53 van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling doi:10.1002/sim.5783
techniques are data hungry: a simulation study for predicting 78 Marino M, Buxton OM, Li Y. Covariate selection for multilevel models
dichotomous endpoints. BMC Med Res Methodol 2014;14:137. with missing data. Stat (Int Stat Inst) 2017;6:31-46. doi:10.1002/
doi:10.1186/1471-2288-14-137 sta4.133
54 Infante G, Miceli R, Ambrogi F. Sample size and predictive 79 Wood AM, Royston P, White IR. The estimation and use of predictions
performance of machine learning methods with survival data: a for the assessment of model performance using large samples with
simulation study. Stat Med 2023;42:5657-75. doi:10.1002/ multiply imputed data. Biom J 2015;57:614-32. doi:10.1002/
sim.9931 bimj.201400004
55 Harrell F. Regression Modeling Strategies: With Applications to Linear 80 Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the
Models, Logistic and Ordinal Regression, and Survival Analysis . 2nd performance of prediction models: a framework for traditional and
ed. Springer International Publishing, 2015. DOI: doi:10.1007/978- novel measures. Epidemiology 2010;21:128-38. doi:10.1097/
3-319-19425-7. EDE.0b013e3181c30fb2
56 Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing 81 Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg
data in epidemiological and clinical research: potential and pitfalls. EW, Topic Group ‘Evaluating diagnostic tests and prediction models’
BMJ 2009;338:b2393. doi:10.1136/bmj.b2393 of the STRATOS initiative. Calibration: the Achilles heel of predictive
57 Kontopantelis E, White IR, Sperrin M, Buchan I. Outcome- analytics. BMC Med 2019;17:230. doi:10.1186/s12916-019-
sensitive multiple imputation: a simulation study. BMC Med Res 1466-7
Protected by copyright.
Methodol 2017;17:2. doi:10.1186/s12874-016-0281-5 82 Uno H, Cai T, Tian L, et al. Evaluating prediction rules for t-year
58 Sullivan TR, Salter AB, Ryan P, Lee KJ. Bias and precision of the survivors with censored regression models. J Am Stat Assoc 2012;1:
“multiple imputation, then deletion” method for dealing with missing Epub ahead of print. doi:10.1198/016214507000000149.
outcome data. Am J Epidemiol 2015;182:528-34. doi:10.1093/aje/ 83 Harrell FEJr, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the
kwv100 yield of medical tests. JAMA 1982;247:2543-6. doi:10.1001/
59 Sisk R, Sperrin M, Peek N, van Smeden M, Martin GP. Imputation jama.1982.03320430047030
and missing indicators for handling missing data in the 84 Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ. On the C-statistics
development and deployment of clinical prediction models: A for evaluating overall adequacy of risk prediction procedures with
simulation study. Stat Methods Med Res 2023;32:1461-77. censored survival data. Stat Med 2011;30:1105-17. doi:10.1002/
doi:10.1177/09622802231165001 sim.4154
60 Hippisley-Cox J, Coupland C, Brindle P. Development and 85 Crowson CS, Atkinson EJ, Therneau TM. Assessing calibration of
validation of QRISK3 risk prediction algorithms to estimate prognostic risk scores. Stat Methods Med Res 2016;25:1692-706.
future risk of cardiovascular disease: prospective cohort study. doi:10.1177/0962280213497434
BMJ 2017;357:j2099. doi:10.1136/bmj.j2099 86 Austin PC, Harrell FEJr, van Klaveren D. Graphical calibration curves
61 Muñoz J, Hufstedler H, Gustafson P, Bärnighausen T, De Jong and the integrated calibration index (ICI) for survival models. Stat
VMT, Debray TPA. Dealing with missing data using the Heckman Med 2020;39:2714-42. doi:10.1002/sim.8570
selection model: methods primer for epidemiologists. Int J 87 Steyerberg EW. Validation in prediction research: the waste by
Epidemiol 2023;52:5-13. doi:10.1093/ije/dyac237 data splitting. J Clin Epidemiol 2018;103:131-3. doi:10.1016/j.
62 Deo RC. Machine learning in medicine. Circulation 2015;132:1920- jclinepi.2018.07.010
30. doi:10.1161/CIRCULATIONAHA.115.001593 88 Collins GS, Dhiman P, Ma J, et al. Evaluation of clinical prediction
63 Lo Vercio L, Amador K, Bannister JJ, et al. Supervised models (part 1): from development to external validation.
machine learning tools: a tutorial for clinicians. J Neural BMJ 2024;384:e074819. doi:10.1136/bmj-2023-074819
Eng 2020;17:062001. doi:10.1088/1741-2552/abbff2 89 Debray TPA, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW,
64 Andaur Navarro CL, Damen JAA, Takada T, et al. Risk of bias in studies Moons KG. A new framework to enhance the interpretation of
on prediction models developed using supervised machine learning external validation studies of clinical prediction models. J Clin
techniques: systematic review. BMJ 2021;375:n2281. doi:10.1136/ Epidemiol 2015;68:279-89. doi:10.1016/j.jclinepi.2014.06.018
bmj.n2281 90 Efron B, Tibshirani R. An Introduction to the Bootstrap.
65 Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing Chapman and Hall/CRC. Epub ahead of print 15 May 1994.
risks and multi-state models. Stat Med 2007;26:2389-430. doi:10.1201/9780429246593.
doi:10.1002/sim.2712 91 Efron B, Tibshirani R. Improvements on cross-validation: the 632+
66 Sauerbrei W, Perperoglou A, Schmid M, et al. for TG2 of the STRATOS Bootstrap Method. J Am Stat Assoc 1997;92:548-60.
initiative. State of the art in selection of variables and functional 92 Wahl S, Boulesteix A-L, Zierer A, Thorand B, van de Wiel MA.
forms in multivariable analysis-outstanding issues. Diagn Progn Assessment of predictive performance in incomplete data by
Res 2020;4:3. doi:10.1186/s41512-020-00074-3 combining internal validation and multiple imputation. BMC Med Res
67 Smith G. Step away from stepwise. J Big Methodol 2016;16:144. doi:10.1186/s12874-016-0239-7
Data 2018;5:32doi:10.1186/s40537-018-0143-6. 93 Steyerberg EW, Harrell FEJr, Borsboom GJ, Eijkemans MJ, Vergouwe
68 Steyerberg EW, Eijkemans MJC, Habbema JDF. Stepwise selection Y, Habbema JD. Internal validation of predictive models: efficiency
in small data sets: a simulation study of bias in logistic regression of some procedures for logistic regression analysis. J Clin
analysis. J Clin Epidemiol 1999;52:935-42. doi:10.1016/S0895- Epidemiol 2001;54:774-81. doi:10.1016/S0895-4356(01)00341-
4356(99)00103-1 9
69 Steyerberg EW, Eijkemans MJ, Harrell FEJr, Habbema JD. Prognostic 94 Riley RD, Collins GS. Stability of clinical prediction models
modeling with logistic regression analysis: in search of a sensible developed using statistical or machine learning methods. Biom
strategy in small data sets. Med Decis Making 2001;21:45-56. J 2023;65:2200302. doi:10.1002/bimj.202200302
doi:10.1177/0272989X0102100106 95 Steyerberg EW, Harrell FEJr. Prediction models need appropriate
70 Van Houwelingen JC. Shrinkage and penalized likelihood as internal, internal-external, and external validation. J Clin
methods to improve predictive accuracy. Stat Neerl 2001;55:17- Epidemiol 2016;69:245-7. doi:10.1016/j.jclinepi.2015.04.005
34doi:10.1111/1467-9574.00154. 96 Takada T, Nijman S, Denaxas S, et al. Internal-external cross-
71 Pavlou M, Ambler G, Seaman SR, et al. How to develop a more validation helped to evaluate the generalizability of prediction
accurate risk prediction model when there are few events. models in large clustered datasets. J Clin Epidemiol 2021;137:83-
BMJ 2015;351:h3868. doi:10.1136/bmj.h3868 91. doi:10.1016/j.jclinepi.2021.03.025
97 May M, Boulle A, Phiri S, et al. IeDEA Southern Africa and West Africa. 109 Breiman L. Random Forests. Mach Learn 2001;45:5-32doi:10.1023
Prognosis of patients with HIV-1 infection starting antiretroviral /A:1010933404324.
therapy in sub-Saharan Africa: a collaborative analysis of scale-up 110 Lundberg S, Lee S-I. A unified approach to interpreting model
programmes. Lancet 2010;376:449-57. doi:10.1016/S0140- predictions. Epub ahead of print 24 November 2017. doi:10.48550/
6736(10)60666-6 arXiv.1705.07874.
98 Jenkins DA, Martin GP, Sperrin M, et al. Continual updating and 111 Westreich D, Greenland S. The table 2 fallacy: presenting
monitoring of clinical prediction models: time for dynamic prediction and interpreting confounder and modifier coefficients. Am J
systems?Diagn Progn Res 2021;5:1. doi:10.1186/s41512-020- Epidemiol 2013;177:292-8. doi:10.1093/aje/kws412
00090-3 112 Pearl J. An introduction to causal inference. Int J Biostat 2010;6:7.
99 Nashef SA, Roques F, Michel P, Gauducheau E, Lemeshow S, doi:10.2202/1557-4679.1203
Salamon R. European system for cardiac operative risk evaluation 113 Bonnett LJ, Snell KIE, Collins GS, Riley RD. Guide to presenting clinical
(EuroSCORE). Eur J Cardiothorac Surg 1999;16:9-13. doi:10.1016/ prediction models for use in clinical settings. BMJ 2019;365:l737.
S1010-7940(99)00134-7 doi:10.1136/bmj.l737
100 Van Calster B, Steyerberg EW, Wynants L, van Smeden M. There is no 114 Chang W, Cheng J, Allaire JJ, et al. shiny: Web Application Framework
such thing as a validated prediction model. BMC Med 2023;21:70. for R. R package version 1.8.0. https://cran.r-project.org/
doi:10.1186/s12916-023-02779-w package=shiny 2021 (accessed 9 December 2021).
101 Siontis GCM, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation 115 McGinley MP, Goldschmidt CH, Rae-Grant AD. Diagnosis and
of new risk prediction models is infrequent and reveals worse treatment of multiple sclerosis: a review. JAMA 2021;325:765-79.
prognostic discrimination. J Clin Epidemiol 2015;68:25-34. doi:10.1001/jama.2020.26858
doi:10.1016/j.jclinepi.2014.09.007 116 Ghasemi N, Razavi S, Nikzad E. Multiple sclerosis: pathogenesis,
102 Domingos P. The role of Occam’s razor in knowledge discovery. Data symptoms, diagnoses and cell-based therapy. Cell J 2017;19:1-10.
Min Knowl Discov 1999;3:409-25doi:10.1023/A:1009868929893. 117 Goldenberg MM. Multiple sclerosis review. P T 2012;37:175-84.
103 Lynam AL, Dennis JM, Owen KR, et al. Logistic regression has similar 118 Crayton HJ, Rossman HS. Managing the symptoms of multiple
performance to optimised machine learning algorithms in a clinical sclerosis: a multimodal approach. Clin Ther 2006;28:445-60.
setting: application to the discrimination between type 1 and type 2 doi:10.1016/j.clinthera.2006.04.005
diabetes in young adults. Diagn Progn Res 2020;4:6. doi:10.1186/ 119 Chalkou K, Steyerberg E, Bossuyt P, et al. Development, validation
s41512-020-00075-2 and clinical usefulness of a prognostic model for relapse in
104 Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches relapsing-remitting multiple sclerosis. Diagn Progn Res 2021;5:17.
to the evaluation of prediction models, molecular markers, and doi:10.1186/s41512-021-00106-6
diagnostic tests. BMJ 2016;352:i6. doi:10.1136/bmj.i6 120 Disanto G, Benkert P, Lorscheider J, et al. SMSC Scientific Board. The
105 Cardiovascular disease: risk assessment and reduction, including Swiss Multiple Sclerosis Cohort-Study (SMSC): a prospective Swiss
lipid modification. London: National Institute for Health and wide investigation of key phases in disease evolution and new
Care Excellence (NICE), https://www.ncbi.nlm.nih.gov/books/ treatment options. PLoS One 2016;11:e0152347. doi:10.1371/
NBK554923/ 2023 (accessed 1 September 2023). journal.pone.0152347
106 Vickers AJ, Elkin EB. Decision curve analysis: a novel method for 121 Park Y, Ho JC. CaliForest: Calibrated Random Forest for Health Data.
Protected by copyright.
evaluating prediction models. Med Decis Making 2006;26:565-74. Proc ACM Conf Health Inference Learn (2020) 2020: 40-50.
doi:10.1177/0272989X06295361 122 Platt J. Probabilistic outputs for support vector machines and
107 Vickers AJ, van Calster B, Steyerberg EW. A simple, step-by-step guide comparisons to regularized likelihood methods. In: Smola A, Bartlett
to interpreting decision curve analysis. Diagn Progn Res 2019;3:18. P, Schölkopf B, Schuurmans D, eds. Advances in large margin
doi:10.1186/s41512-019-0064-7 classifiers. MIT Press, 2000.
108 Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision 123 Moher D. Reporting guidelines: doing better for readers. BMC
curve analysis, a novel method for evaluating diagnostic tests, Med 2018;16:233. doi:10.1186/s12916-018-1226-0
prediction models and molecular markers. BMC Med Inform Decis
Mak 2008;8:53. doi:10.1186/1472-6947-8-53 Web appendix 1: Appendix