Missing Data
Missing Data
Review
Missing Data in Clinical Research: A Tutorial on Multiple
Imputation
Peter C. Austin, PhD,a,b,c Ian R. White, PhD,d Douglas S. Lee, MD PhD,a,b,e,f and
Stef van Buuren, PhDg,h
a
Institute for Clinical Evaluative Sciences, Toronto, Ontario, Canada
b
Institute of Health Policy, Management and Evaluation, University of Toronto, Ontario, Canada
c
Sunnybrook Research Institute, Toronto, Ontario, Canada
d
Medical Research Council Clinical Trials Unit, University College London, London, United Kingdom
e
Department of Medicine, University of Toronto, Toronto, Ontario, Canada
f
Peter Munk Cardiac Centre and University Health Network, Toronto, Ontario, Canada
g
University of Utrecht, Utrecht, The Netherlands
h
Netherlands Organisation for Applied Scientific Research, Leiden, The Netherlands
ABSTRACT
RESUM
E
Missing data is a common occurrence in clinical research. Missing data Les donne es manquantes sont un phe nomène courant dans le
occurs when the value of the variables of interest are not measured or domaine de la recherche clinique, qui survient lorsque les re sultats
recorded for all subjects in the sample. Common approaches to pour des variables d’interêt ne sont pas mesure s ou consigne s pour
addressing the presence of missing data include complete-case ana- tous les sujets d’un e chantillon. Les approches courantes adopte es
lyses, where subjects with missing data are excluded, and mean-value es manquantes comprennent les analyses de cas
pour pallier les donne
imputation, where missing values are replaced with the mean value of complètes, dans lesquelles tous les sujets pour lesquels des donne es
that variable in those subjects for whom it is not missing. However, in sont manquantes sont exclus de l’analyse, et l’imputation par la
many settings, these approaches can lead to biased estimates of moyenne, dans laquelle les valeurs manquantes sont remplace es par
statistics (eg, of regression coefficients) and/or confidence intervals la valeur moyenne rapporte e pour cette variable chez les sujets chez
Missing data is a common occurrence in clinical research. under which data are subject to being missing. Rubin devel-
Missing data occurs when the values of the variables of interest oped a framework for addressing missing data and described 3
are not measured or recorded for all subjects in the sample. different missing-data mechanisms.1,2 Data are said to be
Data can be missing for several reasons, including: (i) patient “missing completely at random” (MCAR) if the probability of
refusal to respond to specific questions (eg, patient does not a variable being missing for a given subject is independent
report data on income); (ii) loss of patient to follow-up; (iii) from both observed and unobserved variables for that sub-
investigator or mechanical error (eg, sphygmomanometer ject.3 If data are MCAR, then the subsample consisting of
failure); and (iv) physicians not ordering certain investigations subjects with complete (or nonmissing) data is a representative
for some patients (eg, cholesterol test not ordered for some subsample of the overall sample. An example of MCAR is a
patients). laboratory value that is missing because the sample was lost or
Before discussing different ways of addressing the presence damaged in the laboratory. The occurrence of such events in
of missing data, it is important to understand the conditions the laboratory is unlikely to be related to characteristics of the
subject. Data are said to be “missing at random” (MAR) if,
after accounting for all the observed variables, the probability
of a variable being missing is independent from the unob-
Received for publication October 22, 2020. Accepted November 24, 2020. served data. If physicians were less likely to order laboratory
Corresponding author: Dr Peter C. Austin, Institute for Clinical Evalu- tests for older patients and that was the only factor influencing
ative Sciences, G106, 2075 Bayview Ave, Toronto, Ontario M4N 3M5, whether or not a test was ordered and recorded, then missing
Canada. Tel.: þ1-416-480-6131; fax: þ1-416-480-6048.
E-mail: [email protected] laboratory data would be MAR (assuming that age was
See page 1330 for disclosure information. recorded for all patients). Finally, data are said to be “missing
https://doi.org/10.1016/j.cjca.2020.11.010
0828-282X/Ó 2020 The Authors. Published by Elsevier Inc. on behalf of the Canadian Cardiovascular Society. This is an open access article under the CC BY
license (http://creativecommons.org/licenses/by/4.0/).
Austin et al. 1323
Multiple Imputation in Clinical Research
that are artificially narrow. Multiple imputation (MI) is a popular lesquels ces re sultats ont e te
recueillis. Toutefois, dans de nombreux
approach for addressing the presence of missing data. With MI, mul- contextes, ces approches peuvent donner lieu à des estimations
tiple plausible values of a given variable are imputed or filled in for biaisees des statistiques (p. ex. des coefficients de re gression) ou à
each subject who has missing data for that variable. This results in the des intervalles de confiance artificiellement e troits. L’imputation mul-
creation of multiple completed data sets. Identical statistical analyses tiple est une approche populaire pour reme dier aux donne es man-
are conducted in each of these complete data sets and the results are quantes. Selon cette me thode, des valeurs plausibles multiples pour
pooled across complete data sets. We provide an introduction to MI une variable donne e sont attribue es ou impute es pour chacun des
and discuss issues in its implementation, including developing the sujets pour lesquels les re sultats pour ladite variable sont manquants.
imputation model, how many imputed data sets to create, and Il en resulte la cre
ation de multiples groupes de donne es complètes.
addressing derived variables. We illustrate the application of MI Des analyses statistiques identiques sont effectue es à partir de cha-
through an analysis of data on patients hospitalised with heart failure. cun de ces groupes de donne es complètes, et les re sultats sont
We focus on developing a model to estimate the probability of 1-year regroupe s pour les diffe rents groupes de donne es complètes. Cet
mortality in the presence of missing data. Statistical software code for article offre une introduction à l’imputation multiple, et aborde les
conducting MI in R, SAS, and Stata are provided. difficultes lie
es à son utilisation, notamment l’e laboration du modèle
d’imputation, le nombre de groupes de donne es imputables à cre er, et
les variables de rivees qui doivent être conside rees. L’application de
l’imputation multiple sera illustre e au moyen d’une analyse des
donne es pour des patients hospitalise s atteints d’insuffisance
cardiaque. Le modèle sugge re aura pour objectif d’estimer la proba-
de mortalite
bilite à 1 an en pre sence de donne es manquantes. Les
codes pour les logiciels statistiques utilise s pour l’imputation multiple
(R, SAS et Stata) sont fournis.
not at random” (MNAR) if they are neither MAR nor Thus, subjects who are missing blood pressure have the
MCAR. Thus, data are MNAR if the probability of a variable missing value replaced with the average value of blood pres-
being missing, even after accounting for all the observed sure among those subjects for whom blood pressure was
variables, is dependent on the value of the missing variable. An measured and recorded. A limitation of mean-value imputa-
example of data that are MNAR is income, in which more tion is that it artificially reduces the variation in the data set.
affluent subjects, even after accounting for other characteris- For example, mean imputation will artificially lower the
tics, are less likely to report their income in surveys than are estimated standard deviation of the variable that includes
less affluent subjects. Unfortunately, one cannot test whether imputed values.2 Furthermore, mean-value imputation ig-
the data are MAR vs MNAR, so one must judge what is nores multivariate relations between different variables in the
plausible using clinical knowledge.4,5 sample. For instance, older subjects may have, on average,
Historically, a popular approach when faced with missing higher blood pressure than younger subjects. This correlation
data was to exclude all subjects with missing data on any between age and blood pressure is not taken into account by
necessary variables and to conduct subsequent statistical ana- mean-value imputation.
lyses using only those subjects who have complete data An alternative to mean value imputation is “conditional-
(accordingly, this approach is often referred to as “complete mean imputation,” in which a regression model is used to
case” analysis). When only the outcome variable is incom- impute a single value for each missing value.2 From the fitted
plete, this approach is valid under MAR and often appro- regression model, the mean or expected value, conditional on
priate.6 With incomplete covariates, there are disadvantages to the observed covariates, is imputed for those subjects with
this approach.2,4,7 First, unless data are MAR, the estimated missing data. Thus, assuming that the imputation model
statistics and regression coefficients may be biased.4 Second, regressed blood pressure on age and sex, the same value of
even if data are MCAR, with the reduction in sample size blood pressure would be imputed for all subjects of the same
there is a corresponding reduction in precision with which age and sex. A modification of conditional-mean imputation
statistics and regression coefficients are estimated. Accord- draws the imputed value from a conditional distribution
ingly, estimated confidence intervals will be wider when using whose parameters are determined from the fitted regression
complete case analysis than if all the data were used. More- model. However, both of these approaches artificially amplify
over, different analyses may use different subsets of the overall the multivariate relationships in the data. Another limitation is
sample, so that it is difficult to compare results even within that the imputed values are treated as known with certainty
the same paper. and treated on an equal footing with the values for the same
An approach to circumvent the limitations of a complete variable for other subjects for whom the variable was observed
case analysis is to replace the missing values of variables with and recorded and not imputed. Mean-value imputation and
plausible values. Such an approach is called “imputation,” conditional-mean imputation are recommended for handling
because one is imputing a value of the variable for those missing values of baseline covariates in randomised trials
subjects with missing data on that variable. Historically, a only.6,8,9
common approach to imputation was “mean-value imputa- A popular approach for addressing the issue of missing data
tion,” in which subjects for whom a given variable is missing is multiple imputation (MI).1,10 MI imputes multiple values
have the missing value replaced with the mean value of that for each missing value. This results in the creation of multiple
variable among all subjects for whom the variable is present. complete data sets in which the missing values have been filled
1324 Canadian Journal of Cardiology
Volume 37 2021
in with plausible values. The analysis of scientific interest is Table 1. Multivariate imputation by chained equations (MICE)
then conducted separately in each of these complete data sets algorithm for multiple imputation
and the results are pooled across the imputed data sets. In this 1. Specify an imputation model for each of the k variables that are subject to
way, MI allows the user to explicitly incorporate the uncer- missing data.
tainty about the true value of imputed variables. 2. For each of the k variables that are subject to missing data, fill in the
missing values with random draws from those subjects with observed
The present paper provides an introduction to MI and values for the variable in question. Note that these initial imputed values
illustrates its application with the use of a cardiovascular do not respect the multivariate relations in the data and will be overwritten
example. The paper is structured as follows. In the next sec- by better imputed values in later stages of the algorithm.
tion we introduce MI and discuss several issues related to its 3. For the first variable that is subject to missing data:
a. Regress this first variable on all the other variables using those subjects
implementation. Then (Case Study) we illustrate its applica- with complete data on the first variable and observed or currently
tion with an example of logistic regression to model mortality imputed values of the other variables.
in patients with heart failure. Finally (Discussion), we sum- b. The estimated regression coefficients and their variance-covariance
marise our brief tutorial and direct the interested reader to matrix (and the estimated variance of the residual distribution if a
more detailed and comprehensive discussions of MI. linear regression model was fit for a continuous variable) are extracted
from the regression model estimated in (a).
c. Using the quantities obtained in (b), randomly perturb the estimated
regression coefficients in a way that reflects the degree of uncertainty
Multiple Imputation for Missing Data arising from the data.
In this section we provide an introduction to MI and d. Using the set of perturbed regression coefficients obtained in (c), the
discuss issues related to its use. conditional distribution of the first variable is determined for each
subject with missing data on that variable.
e. A value of the variable is drawn from this conditional distribution for
Multiple imputation using multivariate imputation by each subject with missing data on the first variable.
chained equations 4. Repeat step 3 for each of the variables that is subject to missing data. Steps
3 and 4 form 1 cycle of the imputation process for creating 1 imputed data
Fully conditional specification is a strategy for specifying set.
multivariate models through conditional distributions. A 5. Repeat steps 3 and 4 the desired number of times (suggested 5 to 20
specific implementation of this strategy in which every vari- cycles). The final imputed values are used as the imputed values in first
imputed data set.
able is imputed conditional on all other variables is now 6. Repeat steps 2-5 M times to produce M imputed data sets (the choice of
known as the multivariate imputation by chained equations M, the number of imputed data sets, is discussed in the section How Many
(MICE)10-13 algorithm. In our description of the algorithm Imputations: How Large Should M Be?).
we assume that there are p variables, of which k are subject to
missing data and p k are complete. The algorithm is the linear predictor of the given subject (created using the
summarised in Table 1. The process described in steps 3 and 4 regression coefficients sampled from the appropriate posterior
is repeated for several cycles to create 1 imputed data set. distribution, as described above). Of those subjects who are
Standard software uses 5 to 20 cycles by default, and it is close, one subject is selected at random and the observed value
rarely necessary to increase these values.10,11 The imputed of the given variable for that randomly selected subject is used
values obtained after the last cycle are used as the imputed as the imputed value of the variable for the subject with
values for the first imputed data set. The entire process is then missing data. Morris et al. suggest that identifying the 10
repeated M times to produce M imputed data sets. closest subjects without missing data performs well.14 Using
the terminology of Morris et al., we refer to the method
Multiple imputation for continuous variables with the described in this section as parametric imputation, because the
use of predictive-mean matching imputed variables are drawn from a parametric distribution.14
The imputation process described above uses linear This is in contrast to PMM, where the imputed variables are
regression and takes the imputed values as random draws from drawn from an observed empirical distribution.
a normal distribution. This has problems if the residuals from
the regressions are not normally distributed (eg, if data are Analyses in the M imputed data sets
skewed) or if relationships are nonlinear (eg, height and age). Once M complete data sets have been constructed using MI,
For example, a variable that can have only positive values (eg, the statistical analysis of scientific interest is conducted in each
counts) may have imputed values that are negative. One op- of the M complete data sets. That analysis would be the exact
tion to address such problems is to transform the variable analysis that would be conducted in the absence of missing data.
before imputation so that the transformed variable is Thus, if the analysis model is a logistic regression model in
approximately normally distributed. For example, the loga- which a binary outcome variable is regressed on a set of predictor
rithmic transformation, when applied to a positively skewed variables, that model is fitted in each of the M imputed data sets.
distribution, can result in a distribution that is more normally The statistics of interest (eg, estimated regression coefficients
distributed. As a last step, one may wish to back-transform and their standard errors) are extracted from the analysis con-
imputations into the original scale. A second option to is to ducted in each of the M imputed data sets.
draw imputations from the observed values by a technique
called predictive-mean matching (PMM).11 For a given sub-
Rubin’s rules for combining estimates and standard
ject with missing data on the variable in question, PMM
errors across imputed data sets
identifies those subjects with no missing data on the variable
in question whose linear predictors (created using the regres- Once the statistics of interest have been estimated in the M
sion coefficients from the fitted imputation model) are close to imputed data sets, they are combined using Rubin’s rules.1 Let
Austin et al. 1325
Multiple Imputation in Clinical Research
q(i) denote the estimated statistic of interest (eg, a regression von Hippel.15 Nowadays, computation is cheap and the use of
coefficient) obtained from the analysis in the ith imputed data 20 to 100 imputed data sets is common.
set (i ¼ 1,., M).PThe pooled estimated of the statistic of
ðiÞ
interest is q ¼ M1 i ¼ 1 q . The MI estimate of the statistic
M
Which variables to include in the imputation model?
is simply the average value of the estimated statistic across the
Investigators need to distinguish between 2 different sta-
M imputed data sets.
tistical models: the imputation model and the analysis model.
Computing the variance of the estimated statistic is more
The imputation model is used for imputing missing data. It is
complex, as it requires accounting for the within-imputation
not of direct interest and is only used to provide reasonable
uncertainty in the estimated statistic and the between-
imputations. The analysis model holds the quantities that are
imputation variation in the estimated statistic. Let W(i)
ultimately of scientific interest and is the focus of the research
denote the estimated variance (eg, the square of the estimated
question. The rules for building imputation and analysis
standard error) of q(i). The
PM average within-imputation variance
ðiÞ models are very different. It is important to include in the
is defined as W ¼ M1 i ¼ 1 W . This is simply the mean imputation model all the variables that will be included in the
estimated variance of the estimated statistic across the M
analysis model. Failure to include these variables in the
imputed data sets. The between-imputation
P variance of the
ðiÞ imputation model usually results in estimates in the analysis
estimated statistic is B ¼ M11 M i ¼ 1 ðq qÞ . This quan-
2
model being biased. The variables must also be included in
tity reflects the degree to which the estimated statistic varies
the imputation model in the right way: for example, Schafer
across the M imputed data sets. The MI estimate of the
noted that if interactions are omitted from the imputation
variance of q obtained with the use of Rubin’s rules is
varðqÞ ¼ W þ ½1 þ M1 B. This quantity reflects both the
model, then the estimated interactions in the analysis model
would be biased toward the null.7
average within-imputation variation and the average between-
It is especially important to include in the imputation
imputation variation in q. Note that when using single
model the outcome variable for the analysis model.5,11 Failure
imputation, there is no estimate of B, so we are unable to
to do so usually results in estimated regression coefficients for
estimate the true variation in the statistic.
the analysis model also being biased toward the null. When
the outcome in the analysis model is a survival or time-to-
event outcome (eg, the outcome model is a Cox propor-
How many imputations: How large should M be? tional hazards model) then there are 2 components to the
outcome: a time-to-event variable denoting the time to the
An important question is how many imputed data sets
occurrence of the event or the time to censoring, and a binary
should be created. Early recommendations were that 3 to 5
indicator variable denoting whether the subject experienced
imputed data sets were sufficient as long as the amount of
the event or was censored. The recommended approach is to
missing information was not very high,1,3 while others sug-
include both in the imputation model, with the time-to-event
gested that often 5 to 10 imputations were required to be
variable transformed using the cumulative survivor function.16
sufficient.7 These early recommendations were based on the
In addition, the imputation model is improved by including
accuracy with which the regression coefficient was estimated
variables that are related to the missingness and variables that
compared with its accuracy had it been estimated with an
are correlated with variables of interest. In longitudinal data,
infinite number of imputed data sets. However, analysts are
when imputing a variable for a specific measurement occasion
interested not only in estimated regression coefficients (eg, log
(eg, on the second clinic visit), one also needs to include in the
odds ratios or log hazard ratios), but also in their associated
imputation model future values of that variable (eg, the value
standard errors (which are used in deriving confidence in-
of that variable at the third clinic visit).
tervals and significance tests). Thus, one wants to estimate not
only regression coefficients accurately, but also standard
Imputing derived variables
errors.
Ideally, one would select M such that the pooled estimated The analysis model may include variables that are derived
regression coefficients and standard errors would not vary from other variables. Examples include body mass index
meaningfully across repeated applications of MI (ie, if the (BMI, which is derived from height and weight), quadratic
entire process was repeated with M new imputed data sets, terms for continuous variables (eg, age2), and interactions
one would obtain estimates similar to those obtained using the between variables (ie, products of variables). When the
initial M imputed data sets). The term Monte Carlo error in a component variables required to create the derived variable are
given statistic (eg, a regression coefficient or a standard error) missing (and therefore the derived variable is also missing),
refers to the standard deviation of that statistic across repeated there are 2 main options for imputing the derived variables.
applications of MI. When focusing on a single statistic,
pffiffiffiffiffiffiffiffiffiffi
ffi the The first option imputes the missing component variables and
Monte Carlo error can be computed as B=M .11 White creates the derived variable after all variables have been
et al. suggested that, as a rule of thumb, the number of imputed. Thus, for example, if height were missing, height
imputed data sets should be at least as large as the percentage would first be imputed and then combined with weight to
of subjects with any missing data.11 They suggest that this will create BMI. Von Hippel refers to this approach as “impute,
result in estimates of regression coefficients, test statistics then transform.” This approach is appealing, as it leads to
(regression coefficients divided by the standard error), and P derived variables that are consistent with the derivation rule.
values with minor variability across repeated MI analyses (ie, The obvious problem with the approach is that the derived
the Monte Carlo error will be low). A more advanced method variable is not part of the imputation model, so it may lead to
for determining the number of imputations was developed by bias, as explained in the preceding section (Which Variables to
1326 Canadian Journal of Cardiology
Volume 37 2021
Include in the Imputation Model?). The second option is to disease in Ontario.21 We used data on 8,338 patients hospi-
treat the derived variable as simply another variable and to talised with congestive heart failure from April 1, 2004, to
impute this variable directly. Thus, if height were missing March 31, 2005, at 81 Ontario hospital corporations. Data on
(and thus BMI were also missing), height and BMI would be patient demographics, vital signs and physical examination at
imputed for those subjects for whom they were missing. This presentation, medical history, and results of laboratory tests
approach is known as “transform, then impute”17 or “just were collected on these patients by retrospective chart review.
another variable.”11 Note that the “just another variable” Subjects were linked to administrative health care data to
approach incorporates the components as well as the derived determine vital status.
variable in the imputation model. This approach is appealing For the purposes of this case study, we considered 10
as it incorporates all necessary variables into the imputation baseline covariates: age, respiratory rate at admission, glucose
model. However, it can lead to quadratic variables with level, urea level, low-density lipoprotein (LDL) cholesterol
negative values or BMI values that are inconsistent with the level, sex, S3 (third heart sound) on admission, S4 (fourth
height and weight of the subject. It has been shown that in heart sound) on admission, neck vein distension on admis-
some settings the approach leads to accurate estimates of sion, and cardiomegaly on chest X-ray. The first 5 are
regression coefficients in the analysis model, though it can fail continuous and the last 5 binary. The outcome was a binary
in others.18,19 Van Buuren describes some alternate strategies outcome denoting whether the patient died within 365 days
for specific types of dependencies.10 Because no strategy of hospital admission. Logistic regression models for 30-day
performs uniformly better, we may need some tailoring to the and 1-year mortality are often used in cardiovascular
type of derived variable. research.22-24 Our purpose here in using these data is to
illustrate the application of statistical methods and not to draw
Missing outcome variables clinical conclusions. Accurate estimation of the association of
variables with cardiovascular outcomes in current patients may
Multiple imputation is blind to which variables are out-
require the use of more recent data and a more comprehensive
comes and which variables are predictors in the final analysis
set of predictor variables. Furthermore, depending on the
model. When developing the imputation models, the
objective of the intended study, a different regression model
important issue is to include in the imputation models all of
may be more appropriate.
the variables from the analysis model. This suggests that one
can impute values of the outcome variable (for the analysis Descriptive statistics
model) for those subjects for whom it is missing. However,
von Hippel provided evidence that excluding subjects who are Means and percentages are reported for the continuous and
missing the outcome variable (for the analysis model) when binary variables, respectively, in Table 2. We also report the
fitting the outcome model will tend to be a better strategy.20 percentage of subjects with missing data for each of the var-
He proposed a strategy that he referred to as “multiple iables. The percentage of missing data ranged from a low of
imputation, then deletion” (MID). Under MID, all subjects 0% (age and sex) to a high of 73% (LDL cholesterol). Overall,
are used in the imputation process. Values are imputed for all 78% of subjects had missing data on at least 1 variable.
missing data, including for those subjects who are missing the
outcome variable. However, subjects for whom the outcome Comparison of subjects with and without missing data
variable was imputed are then excluded when the analysis
We conducted univariate comparisons of those with and
model is fitted in each imputed data set. The MID approach
without missing data. There are at least 2 reasons for these
will tend to result in estimated regression coefficients for the
comparisons. First, as noted above, the imputation model is
analysis model that are more efficient (have smaller variability)
improved by including variables that are related to the miss-
than those obtained when fitting the analysis model in all
ingness. These comparisons help to identify variables that
subjects. In addition, the method is robust against bad
should be included in the imputation model. Second, these
imputation in the outcome. The MID procedure should not
analyses provide evidence as to the plausibility of the MAR
be used if there are auxiliary variables that are strongly related
assumption. If those with and without missing data differ on
to the outcome (and not included in the analysis model) or if
many observed variables, then it is plausible that they may also
the scientific interest extends to parameters other than
differ on unobserved variables. Note that a lack of significant
regression coefficients.11
univariate associations does not provide proof that the data are
MCAR or MAR.
There were meaningful differences in age, sex, and mor-
Case Study
tality (the 3 variables that were not subject to missingness)
We use data on patients hospitalised with heart failure in
between those with complete data and those with missing
the province of Ontario to provide a case study illustrating the
data. The average age of those with complete data was 73.7
application of MI. The analysis model of interest is a logistic
years, and it was 77.5 years for those with missing data. Of
regression model in which death within 1 year of hospital
those with complete data, 43.4% were female, while of those
admission is regressed on 10 patient characteristics.
with missing data, 53.0% were female. Of those with com-
plete data, 23.7% died within 1 year of admission, while of
Data sources
those with missing data, 33.9% died within 1 year of
We used data from the EFFECT (Enhanced Feedback for admission. Patients with missing data tended to be older, were
Effective Cardiac Treatment) study, which was an initiative to more likely to be female, and more likely to die than those
improve the quality of care for patients with cardiovascular with complete data.
Austin et al. 1327
Multiple Imputation in Clinical Research
Female
S3
S4
Cardiomegaly
Figure 1. Estimated log-odds ratios and 95% confidence intervals for variables in the logistic regression model fit in the case study. There are 3
estimates/confidence intervals for each of the 10 variables: analyses using complete cases (grey); multiple imputation analyses when using
parametric imputation (blue); and multiple imputation analyses when using predictive-mean matching (PMM) (red). LDL, low-density lipoprotein; S3,
third heart sound; S4, fourth heart sound.
PMM imputation are also shown in Figure 1. The estimated the complete case analyses and the MI analyses. We refer
odds ratios and associated confidence intervals obtained using readers to previously published guidelines for reporting ana-
PMM imputation were essentially identical to those obtained lyses affected by missing data.5,25
using parametric imputation. In comparing the results of the 3 This introduction to MI was not intended to be exhaus-
regression analyses, one observes that the confidence intervals tive. We refer the interested reader to several excellent texts on
obtained from the imputation-based analyses were narrower MI1-3,10 as well as to more detailed overview articles.7,11 We
than those obtained in the complete case analysis. For some have focused our attention on MI in observational studies in
variables (eg, age, S3, and S4), the confidence intervals ob- which clustering of subjects or a multilevel structure is absent.
tained using the complete case analysis were substantially Other works describe methods for using MI with multilevel
wider than those obtained using MI. data.10,26-29 Similarly, we have focused on the use of para-
metric models (eg, logistic regression models or linear
regression models) for the imputation models. An area of
Discussion current research is on the use of machine-learning methods for
Missing data occurs frequently in clinical research. MI is a MI.30 We have focused on the use of MI when data are either
statistical tool that allows the researcher to replace missing MCAR or MAR. The described methods must be modified if
values with multiple plausible values of the variable in ques- it is thought that the data are MNAR. Van Buuren summa-
tion. The use of MI allows the researcher to analyse complete rises different methods to address data that are MNAR.10 The
data sets while incorporating the uncertainty in the imputed simplest approach is to assume that the distribution of a
values of the variable. We have provided a brief introduction variable in those with missing data is shifted compared with
to MI and guidance regarding its implementation. We illus- the distribution in those with complete data. Sensitivity ana-
trated the application of MI through the analysis of data on lyses can be conducted in which the magnitude of the shift
patients hospitalised with heart failure. parameter is allowed to vary.
When applying MI, researchers should explore differences We have focused on the MICE algorithm for MI, along
between the observed and imputed distributions and between with a modification, PMM. This is not the only method to
Austin et al. 1329
Multiple Imputation in Clinical Research
0.20
0.08
0.15
Density function
Density function
0.06
0.10
0.04
0.05
0.02
0.00
0.00
0 20 40 60 0 10 20 30 40 50
Urea LDL
0.5
0.00 0.02 0.04 0.06 0.08 0.10
0.4
Density function
Density function
0.3
0.2
0.1
0.0
0 20 40 60 80 0 2 4 6
Urea LDL
Figure 2. Distribution of continuous variables in complete cases and in those with imputed data when using parametric imputation. The solid black
line represents the distribution of the given continuous variable in those subjects for whom that variable was not missing. The dashed red lines
denote the distribution of the imputed value for that variable in those subjects for whom the variable was missing. There is 1 red line for each of the
imputed data sets. LDL, low-density lipoprotein.
impute missing data. An earlier method has been described as In our case study, we obtained similar parameter estimates
“joint modeling,”10 of which MI under a normal model is a when using parametric imputation as when using PMM
specific implementation.4 This approach assumes that the set imputation. This is to be expected for estimates that depend
of variables follow a joint multivariate distribution. The on the middle of the distribution, such as means or regression
multivariate normal distribution is widely used in applica- coefficients. In practice, it may be difficult to provide exam-
tions.10 Under this implementation, the variables are assumed ples where PMM imputation beats a well crafted parametric
to follow a multivariate normal distribution. Once the pa- imputation model. However, in practice, analysts often prefer
rameters of this distribution have been estimated, missing PMM imputation because it preserves typical features in the
values can be imputed by random draws from this multivar- raw data. For example, it accounts for discreteness of data,
iate distribution. In theory, this approach requires that all of avoids impossible values, preserves location of quantiles, and is
the variables be continuous. In practice, binary or categorical highly robust to imputation model misspecification. All this
variables frequently occur (eg, presence or absence of dia- costs no additional work on the part of the analyst. If the
betes). Schafer and Graham suggest that despite this theo- complete-data model depends on such features, then the
retical limitation, they have found the multivariate normal inference will also be better when using PMM imputation.
distribution to be useful in a wide range of settings.4 In this tutorial article we have focused on the use of MI in
Furthermore, they provide suggestions for incorporating bi- observational studies. In randomised controlled trials (RCTs),
nary and categorical variables as well as nonnormally distrib- MI is not always the optimal approach.6 When a univariate
uted continuous variables. However, others have suggested outcome is MAR, a complete case analysis using an adjusted
that these methods of incorporating noncontinuous variables analysis is unbiased and efficient.6 With a multivariate
may not perform as desired.10 Given the flexibility of the outcome (eg, an outcome measured at multiple occasions over
MICE algorithm and its ability to explicitly incorporate the course of follow-up), the use of a linear mixed model with
different types of variables, its use may be attractive to re- missing data in the outcome only will tend to result in esti-
searchers in biomedical research. mates with smaller standard errors compared with the use of
1330 Canadian Journal of Cardiology
Volume 37 2021
0.20
0.08
0.15
Density function
Density function
0.06
0.10
0.04
0.05
0.02
0.00
0.00
0 20 40 60 0 10 20 30 40 50
Urea LDL
0.5
0.00 0.02 0.04 0.06 0.08 0.10
0.4
Density function
Density function
0.3
0.2
0.1
0.0
0 20 40 60 80 0 2 4 6
Urea LDL
Figure 3. Distribution of continuous variables in complete cases and in those with imputed data when using predictive mean matching (PMM). The
solid black line denotes the distribution of the given continuous variable in those subjects for whom that variable was not missing. The dashed red
lines denote the distribution of the imputed value for that variable in those subjects for whom the variable was missing. There is 1 red line for each
of the imputed data sets. LDL, low-density lipoprotein.
MI.6 If MI is used, it is suggested that imputation be con- meet prespecified criteria for confidential access, as described
ducted separately in the different arms of the trial.6 at https://www.ices.on.ca/DAS. This research was supported
In summary, MI replaces missing values with plausible by a operating grant from the Canadian Institutes of Health
values. By creating multiple imputed data sets, the analyst can Research (CIHR) (grant number MOP 86508). The EF-
explicitly account for the uncertainty inherent in the imputed FECT data used in the study was funded by a CIHR Team
values. Historical approaches such as complete case analysis, Grant in Cardiovascular Outcomes Research (grant numbers
mean imputation, and single imputation potentially result in CTP79847 and CRT43823). P.C.A. and D.S.L. are sup-
bias, incorrect estimates of standard errors, and consequently ported in part by Mid-Career Investigator awards from the
incorrect tests of statistical significance. Researchers are Heart and Stroke Foundation. D.S.L. is supported by the Ted
encouraged to consider MI as an important tool to address the Rogers Chair in Heart Function Outcomes. I.R.W. was
problems associated with missing data in clinical research. supported by the Medical Research Council Programme
MC_UU_12023/21.
Funding Sources
This study was supported by the ICES which is funded by Disclosures
an annual grant from the Ontario Ministry of Health and The authors have no conflicts of interest to disclose.
Long-Term Care (MOHLTC). The opinions, results and
conclusions reported in this paper are those of the authors and
are independent from the funding sources. No endorsement References
by ICES or the Ontario MOHLTC is intended or should be 1. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York:
inferred. The data sets used for this study were held securely in John Wiley & Sons, 1987.
a linked deidentified form and analysed at ICES. Although
data-sharing agreements prohibit ICES from making the data 2. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken:
set publicly available, access may be granted to those who John Wiley & Sons, 2002.
Austin et al. 1331
Multiple Imputation in Clinical Research
3. Carpenter JR, Kenward MG. Multiple Imputation and Its Application. 19. Vink G, van Buuren S. Multiple imputation of squared terms. Sociol
Chichester: John Wiley & Sons, 2013. Methods Res 2013;42:598-607.
4. Schafer JL, Graham JW. Missing data: our view of the state of the art. 20. von Hippell PT. Regression with missing Ys: an improved strategy for
Psychol Methods 2002;7:147-77. analyzing multiply imputed data. Sociol Methodol 2007;37:83-117.
5. Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing 21. Tu JV, Donovan LR, Lee DS, et al. Effectiveness of public report cards
data in epidemiological and clinical research: potential and pitfalls. BMJ for improving the quality of cardiac care: the EFFECT study: a ran-
2009;338:b2393. domized trial. JAMA 2009;302:2330-7.
6. Sullivan TR, White IR, Salter AB, Ryan P, Lee KJ. Should multiple 22. Lee DS, Austin PC, Rouleau JL, et al. Predicting mortality among pa-
imputation be the method of choice for handling missing data in ran- tients hospitalized for heart failure: derivation and validation of a clinical
domized trials? Stat Methods Med Res 2018;27:2610-26. model. JAMA 2003;290:2581-7.
7. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 23. Tu JV, Austin PC, Walld R, et al. Development and validation of the
1999;8:3-15. Ontario acute myocardial infarction mortality prediction rules. J Am Coll
Cardiol 2001;37:992-7.
8. White IR, Thompson SG. Adjusting for partially missing baseline mea-
surements in randomized trials. Stat Med 2005;24:993-1007. 24. Lee DS, Lee JS, Schull MJ, et al. Prospective validation of the emergency
heart failure mortality risk grade for acute heart failure. Circulation
9. Groenwold RH, White IR, Donders AR, et al. Missing covariate data in 2019;139:1146-56.
clinical research: when and when not to use the missing-indicator method
for analysis. CMAJ 2012;184:1265-9. 25. Akl EA, Shawwa K, Kahale LA, et al. Reporting missing participant data
in randomised trials: systematic survey of the methodological literature
10. van Buuren S. Flexible Imputation of Missing Data. Second Edition. and a proposed guide. BMJ Open 2015;5:e008431.
Boca Raton: CRC Press, 2018.
26. Longford NT. Missing data. In: de Leeuw J, Meijer E, eds. Handbook of
11. White IR, Royston P, Wood AM. Multiple imputation using chained Multilevel Analysis. New York: Springer, 2008:377-99.
equations: issues and guidance for practice. Stat Med 2011;30:377-99.
27. van Buuren S. Multiple imputation of multilevel data. In: Hox JJ,
12. van Buuren S. Multiple imputation of discrete and continuous data by Roberts JK, eds. Handbook of Advanced Multilevel Analysis. New York:
fully conditional specification. Stat Methods Med Res 2007;16:219-42. Routledge, 2011:173-96.
13. van Buuren S, Groothuis-Oudshoorn K. mice: multivariate imputation 28. Molenberghs G, Verbeke G. Missing data. In: Scott MA, J. Simonoff JS,
by chained equations in R. J Stat Softw 2011;45(3). Marx BD, eds. The SAGE Handbook of Multilevel Modeling. London:
SAGE, 2013:403-24.
14. Morris TP, White IR, Royston P. Tuning multiple imputation by pre-
dictive mean matching and local residual draws. BMC Med Res Meth- 29. Audigier V, White IR, Jolani S, et al. Multiple imputation for multilevel
odol 2014;14:75. data with continuous and binary variables. Stat Sci 2018;33:160-83.
15. von Hippell PT. How many imputations do you need? A two-stage 30. Richman M, Trafalis T, Adrianto I. Multiple imputation through ma-
calculation using a quadratic rule. Sociol Methods Res 2020;49:699-718. chine learning algorithms. Paper presented at: American Meteorological
Society 87th Annual Meeting. January 13-18, 2007; San Antonio, TX.
16. White IR, Royston P. Imputing missing covariate values for the Cox
model. Stat Med 2009;28:1982-98.
17. von Hippell PT. How to impute interactions, squares, and other trans- Supplementary Material
formed variables. Sociol Methodol 2009;39:265-91.
To access the supplementary material accompanying this
18. Seaman SR, Bartlett JW, White IR. Multiple imputation of missing article, visit the online version of the Canadian Journal of
covariates with nonlinear effects and interactions: an evaluation of sta- Cardiology at www.onlinecjc.ca and at https://doi.org/10.
tistical methods. BMC Med Res Methodol 2012;12:46. 1016/j.cjca.2020.11.010.