Imputability
Imputability
Abstract
Background: In modern biomedical research of complex diseases, a large number of demographic and clinical
variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data
collection process. Since many downstream statistical and bioinformatics methods require complete data matrix,
imputation is a common and practical solution. In high-throughput experiments such as microarray experiments,
continuous intensities are measured and many mature missing value imputation methods have been developed
and widely applied. Numerous methods for missing data imputation of microarray data have been developed.
Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application
of most methods. Though several methods have been developed in the past few years, not a single complete
guideline is proposed with respect to phenomic missing data imputation.
Results: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training
selection (STS) scheme to select the best imputation method and provide a practical guideline for general
applications. We introduced a novel concept of “imputability measure” (IM) to identify missing values that are
fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN)
methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and
missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid
(KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation
methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package
“phenomeImpute” is made publicly available.
Conclusions: Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A,
KNN-H and random forest were among the top performers although no method universally performed the best.
Imputation of missing values with low imputability measures increased imputation errors greatly and could
potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by
evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data
analyses are available on the author’s publication website.
Keywords: Missing data, K-nearest-neighbor, Phenomic data, Self-training selection
* Correspondence: [email protected]
†
Equal contributors
1
Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
2
Department of Computational and Systems Biology, University of Pittsburgh,
Pittsburgh, PA, USA
Full list of author information is available at the end of the article
© 2014 Liao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain
Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article,
unless otherwise stated.
Liao et al. BMC Bioinformatics 2014, 15:346 Page 2 of 12
http://www.biomedcentral.com/1471-2105/15/346
Background model is not coherent with the assumed model for the
In many studies of complex diseases, a large number of underlying complete data [13,17,18]. Multiple imputations
demographic, environmental and clinical variables are usually are considered to account for the variability due to
collected and missing values (MVs) are inevitable in the imputation [13,14,16,19].
data collection process. Major categories of variables in- Except for some implicit imputation methods, other
clude but not limited to: (1) demographic measures, such above-mentioned methods rely on correct modelling of
as gender, race, education and marital status; (2) environ- the missing data process and work well in traditional sit-
mental exposures, such as pollen, feather pillows and uations with large number of subjects and small number
pollutions; (3) living habits, such as exercise, sleep, diet, of variables (large n, small p). With the trend of increas-
vitamin supplement and smoking; (4) measures of general ing number of variables (large p) in phenomic data, the
health status or organ function, such as body mass index model fitting, diagnostic check and sensitivity analysis
(BMI), blood pressure, walking speed and forced vital cap- become difficult to ensure success of multiple imputation
acity (FVC); (5) summary measures from medical images, or maximum likelihood imputation. The complexity of
such as fMRI and PET scan; (6) drug history; and (7) fam- phenomic data with mixed data types (binary, multi-class
ily disease history. The dimension of the data can easily go categorical, ordinal and continuous) further aggravates the
beyond several hundreds to nearly a thousand and we difficulties of modeling the joint distribution of all vari-
refer to such data as “phenomic data”, hereafter. It has ables. Although a few of the algorithms are designed to
been shown recently that systematic analysis of the phe- handle datasets with both continuous and categorical vari-
nomic data and integration with other genomic infor- ables [14,20-22], the implementation of most of these
mation provide further understanding of diseases [1-5], complicated methods in the high dimensional phenomic
and enhance disease subtype discovery towards preci- data is not straightforward. Imputation methods by exact
sion medicine [6,7]. The presence of missing values in statistical modeling often suffer from “curse of dimension-
clinical research not only reduces statistical power of ality”. Jerez and colleagues compared machine learning
the study but also impedes the implementation of many methods, such as multi-layer perceptron (MLP), self-
statistical and bioinformatic methods that require a organizing maps (SOM) and k-nearest neighbor (KNN), to
complete dataset (e.g. principal component analysis, clus- traditional statistical imputation methods in a large breast
tering analysis, machine learning and graphical models). cancer dataset and concluded that machine learning im-
Many have pointed out that “missing value has the poten- putation methods seemed to perform better in this large
tial to undermine the validity of epidemiologic and clinical clinical data [23].
research and lead the conclusion to bias” [8]. In the past decade, missing value imputation for high-
Standard statistical methods for analysis of data with throughput experimental data,(e.g. microarray data) has
missing values include list-wise deletion or complete-case drawn great attention and many methods have been de-
analysis (i.e. discard any subject with a missing value), veloped and widely used (see [24], [25] for review and
likelihood-based methods, data augmentation and imput- comparative studies). Imputation of phenomic data dif-
ation [9,10]. The list-wise deletion in general leads to loss fers from microarray data and brings new challenges for
of statistical power and biased results when data are not two major reasons. Firstly microarray data contain entirely
missing completely at random. Likelihood-based methods continuous intensity measurements, while phenomic data
and data augmentation are popular for low dimensional have mixed data types. This voids majority of established
data with parametric models for the missing-data process microarray imputation methods for phenomic data. Sec-
[10,11]. However, their application in high dimensional ondly, microarray data monitor gene expression of thou-
data is problematic especially when the missing data pat- sands of genes and the majority of the genes are believed
tern is complicated and the required intensive computing to be co-regulated with others in a systemic sense, which
is most likely insurmountable. On the contrary, imput- leads to a highly correlated structure of the data and
ation provides an intuitive and powerful tool for analysis makes imputation intrinsically easier. The phenomic data,
of data with complex missing-data patterns [12-16]. Expli- on the other hand, are more likely to contain isolated vari-
cit imputation methods such as mean imputation or sto- ables (or samples) that are “not imputable” from other ob-
chastic imputation either undermines the variability of the served variables (samples).
data or requires parametric assumption on the data and There are at least three aspects of novelty in this paper.
subsequently faces similar challenges as the likelihood- Firstly, to our knowledge, this is the first systematic com-
based method and data augmentation [12-14,16]. Implicit parative study of missing value imputation methods for
imputation methods such as nearest-neighbour imput- large-scale phenomic data. We will compare two existing
ation, hot-deck and fractional imputation provide flexible methods (missForest [26] and multivariate imputation by
and powerful approaches for analysis of data with complex chained equations, MICE [16]) and extend four variants of
missing-data patterns even though the implicit imputation KNN imputation method that was popularly used in
Liao et al. BMC Bioinformatics 2014, 15:346 Page 3 of 12
http://www.biomedcentral.com/1471-2105/15/346
microarray analysis [27]. Secondly, to characterize and paper. The methods and detailed implementations are
identify missing values that are “not imputable” from described below.
other observed values in phenomic data, we propose an
“imputability measure” (IM) to quantify imputability of a Two existing methods MICE and missForest
missing value. When a variable or subject has an overall Multivariate Imputation by Chained Equations (MICE)
small IM in its missing values, it is recommended to re- is a popular method to impute multivariate missing data.
move the variable or subject from further analysis (or im- It factorizes the joint conditional density as a sequence
pute with caution). Thirdly, we propose a self-training of conditional probabilities and imputes missing values by
scheme (STS) [24] to select the best missing value imput- multiple regression sequentially based on different types
ation method for each data type in a given dataset. The re- of missing covariates. Gibbs sampling is used to estimate
sult provides a practical guideline in applications. The IM the parameters. It then draws imputation for each variable
and STS selection tool will remain useful when more condition on all the other variables. We used the R pack-
powerful methods for phenomic data imputation are de- age “MICE” to implement this method.
veloped in the future. MissForest is a random forest based method to impute
phenomic data [26]. The method treats the variable of
Methods the missing value as the response variable and borrows
Real data information from other variables by the resampling-based
The current work is motivated by three high-dimensional classification and regression trees to grow a random forest
phenomic datasets, all of which have a mixture of continu- for the final prediction. The method is repeated until the
ous, ordinal, binary and nominal covariates. The Chronic imputed values reach convergence. The method is imple-
Obstructive Pulmonary Disease (COPD) dataset was gen- ment in the “missForest” R package.
erated from a COPD study conducted in the Division of
Pulmonary, Department of Medicine at the University of KNN imputation methods
Pittsburgh. The second dataset is the phenotypic data set KNN method is popular due to its simplicity and proven
of the Lung Tissue Research Consortium (LTRC, http:// effectiveness in many missing value imputation prob-
www.nhlbi.nih.gov/resources/ltrc.htm). The third dataset lems. For a missing value, the method seeks its K near-
is obtained from the Severe Asthma Research Program est variables or subjects and imputes by a weighted
(SARP) study (http://www.severeasthma.org/). These data- average of observed values of the identified neighbours.
sets represent different variable/subject ratios and different We adopted the weight choice from the LSimpute
proportions of data types in the variables. In Table 1, Raw method used for microarray missing value imputation
Data (RD) refers to the original raw data with missing [28]. LSimpute is an extension of the KNN, which uti-
values we initially obtained. Complete Data (CD) repre- lizes correlations between both genes and arrays, and
sents a complete dataset without any missing value after the missing values are imputed by a weighted average of
we iteratively remove variables and subjects with large the gene and array based estimates. Specifically, the
missing value percentage. CDs contain no missing values weight for the kth neighbor of a missing variable or sub-
2
and are ideal to perform simulation for evaluating differ- ject was given by wk ¼ r2k = 1−r2k þ ε , where rk is the
ent methods (see section Simulated datasets). correlation between the kth neighbor and the missing
variable or subject and ε = 10− 6. As a result, this algo-
Imputation methods rithm gives more weight to closer neighbors. Here, we
We will compare four newly developed KNN methods extended the two KNN methods of LSimpute, imput-
with the MICE and the missForest methods in this ation by the nearest variables (KNN-V) and imputation
by the nearest subjects (KNN-S), so that they could be
used to impute the phenomic data with mixed types of
Table 1 Descriptions of three real data sets
variables. Furthermore, we developed a hybrid of these
Number of variables and subjects COPD LTRC SARP two methods using global variable/subject weights (KNN-
Subjects (RD/CD) 699/491 1428/709 1671/640 H) and adaptive variable/subject weights (KNN-A).
Variables (RD/CD) 528/257 1568/129 1761/135
Continuous variables (Con) 113 11 27 Impute by nearest variables (KNN-V)
Multi-class categorical variables (Cat) 12 27 6 To extend the KNN imputation method to data with
Binary variables (Bin) 78 0 86 mixed types of variables, we used established statistical cor-
Ordinal variables (Ord) 54 91 16
relation measures between different data types to measure
the distance among different types of variables. As de-
Total variables in CD 257 129 135
scribed in Table 1, the phenomic data usually contain four
Liao et al. BMC Bioinformatics 2014, 15:346 Page 4 of 12
http://www.biomedcentral.com/1471-2105/15/346
types of variables – continuous (Con), binary (Bin), multi- assumed to be bivariate normal. The Polyserial correlation
class categorical (Cat) and ordinal (Ord). Table 2 lists cor- is the estimated correlation between X and η and is esti-
relation measures across different data types to construct mated by maximum likelihood [31]. It is implemented by
the correlation matrix for KNN-V (Additional file 1 con- the “polyserial” function in the “polycor” R package.
tains more detailed description): Polychoric correlation (Ord vs Ord): Polychoric cor-
Spearman’s rank correlation (Con vs. Con): we use relation measures correlation between two ordinal vari-
Spearman’s rank correlation to measure the correl- ables. Similar to the polyserial correlation described
ation between two continuous variables. It is equiva- above, polychoric correlation estimates the correlation
lent to compute Pearson correlation based on ranks: of two underlying latent continuous variables, which are
XN
d2
assumed to follow a bivariate normal distribution [32].
r ¼ 1−6 N i¼1 i , where di is the rank difference of It is implemented by the “polychor” function in the
ðN −1Þ
2
“polycor” R package.
each corresponding observation and N is the number
of subjects. Phi (Bin vs Bin): Phi coefficient measures the correl-
Point biserial correlation (Con vs. Bin) and its extension ation between two dichotomous variables. The phi coef-
(Con vs. Cat): Point biserial correlation between a continu- ficient is the linear correlation of an underlying bivariate
discrete distribution [33-35]. The Phi correlation is cal-
ous variable X and a dichotomous variable Y (Y = 0 or 1) pffiffiffiffiffiffiffiffiffiffiffiffi
1 −X
0
is defined as r ¼ pXffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 and X
, where X 0 represent culated as r ¼ X2 =N , where N is the number of sub-
SX = pY ð1−pY Þ jects and X2 is the chi-square statistic for the 2 × 2
the means of X given Y = 1 and 0 respectively, SX, the contingency table of the two binary variables.
standard deviation of X and pY, the proportion of subjects Cramer’s V (Bin vs Cat and Cat vs Cat): Cramer’s V
with Y = 1. Note that the point biserial correlation is measures correlation between two nominal variables with
mathematically equivalent to the Pearson correlation two or more levels. It is based on the Pearson’s chi-square
and there is no underlying assumption for Y. When Y is qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
statistic [36]. The formula is given by: r ¼ NXðH−1Þ ,
a multi-level categorical variable with more than two
possible values, the point biserial correlation can be where N is the number of subjects, X2 is the chi-square
generalized, assuming Y follows a multinomial distribu- statistic for the contingency table and H is the number of
tion and the conditional distribution of X given Y is rows or columns, whichever is less.
normal [29]. It is implemented by the “biserial.cor” We note that all correlation measures in Table 2 are
function in the “ltm” R package. based on the classical Pearson correlation (some with
Rank biserial correlation (Ord vs Bin) and its exten- additional Gaussian assumptions on the data) and as a
sion (Ord vs Cat): The rank biserial correlation replaces result, the correlations from different data types are
the continuous variable X in point biserial correlation comparable in selecting K nearest neighbors. A corre-
with ranks. To calculate the correlation between an or- sponding distance measure could be computed as d =
dinal and a nominal variable (binary or multi-class), we |1 − r|, where r is the correlation measures between
transform the ordinal variable into ranks and then apply pairwise variables. Given a missing value in the data
rank biserial correlation or its extension for the calcula- matrix for variable x (missing on subject i), only the K
tion [30]. nearest neighbors of x (denoted as y1 … yK) are included
Polyserial correlation (Con vs Ord): Polyserial correl- in the prediction model. In addition, none of y1, …, yK is
ation measures the correlation between a continuous X allowed to have missing values for the same subject as
and an ordinal variable Y. Y is assumed to be defined the missing value to be predicted. For each neighbour, a
from a latent continuous variable η, generated with generalized linear regression model with single predictor
equal space and is strictly monotonic. The joint distribu- is constructed: g(μ) = α + βyk using available cases, where
tion of the observed continuous variable X and η is μ = E(x) and g(·) is the link function. The regression
methods used for the imputation of different types of
variables are listed in Table 3.Missing values could be im-
Table 2 Correlation measures between different types of puted by x^iðkÞ ¼ g−1 α þ βy ik . Finally, the weighted aver-
variables age of estimated impute values from the K nearest
Variables Con Ord Bin Cat neighbors is used to impute the missing value of con-
Con Spearman -- -- --
tinuous data type. For nominal variables (binary or
multi-class categorical), weighted majority vote from
Ord Polyserial Polycoric -- --
the K nearest neighbors is used. For ordinal variables,
Bin Point Biserial Rank Biserial Phi -- we treat the levels as positive integers (i.e. 1, 2, 3,…, q)
Cat Point Biserial Rank Biserial Cramer’s V Cramer’s V and the imputed value is given by the rounded value of
extension extension
the weighted average.
Liao et al. BMC Bioinformatics 2014, 15:346 Page 5 of 12
http://www.biomedcentral.com/1471-2105/15/346
Table 3 Methods for aggregating imputation information using KNN-S and KNN-V( e2S and e2V ). p is chosen to
of different data types from K nearest neighbors minimize
Variables Regression Final imputed value
methods
X X
X X e2H ¼ p2 e2S þ 2pð1−pÞeS ⋅eV þ ð1−pÞ2 e2V :
Con Linear regression wk ^yk = wk
hX X i
Ord Ordinal logistic min max 1; wk ^yk = wk ; q X X ! !
e2s − ev es
regression Thus, p^ ¼ min max X X X ;0 ;1 .
Bin Logistic regression Weighted majority vote e2s −2 ev es þ e2v
Cat Multinomial logistic Weighted majority vote We simulated second layer of missing values 20 times and
X20
regression p^ i
(q: number of level for ordinal variable). estimated p^ i and took the average 201 as the estimate
of p. Similar to KNN-V imputation, KNN-H imputed
values are rounded to the closest integer for the ordinal
Impute by nearest subjects (KNN-S) variables and the weighted majority vote for nominal
The procedure of the KNN-S is generally the same as variables.
that of the KNN-V. Here, we borrow information from
the nearest subjects, instead of variables. Thus, we will Hybrid imputation using adaptive weight (KNN-A)
have mixed type of values within each vector (subject). Bø et. al. [28] observed that the log-ratios of the squared
We defined similarity of a pair of subjects by the Gower’s errors log e2v =e2s was a decreasing function of rmax in
distance [37]. For each pair of subjects, it is the average of microarray missing value imputation, where rmax is the
distance between each variable for the pair of subjects correlation between the variable with missing value and
XV
δijv dijv its closest neighbour. Such a trend suggested that when
considered: dij ¼ Xv¼1
V , where dijv is the dissimilarity rmax is larger, more weight should be given to KNN-V.
δijv
v¼1 Thus, p should vary for different rmax. We adopted the
score between subject i and j for the vth variable and δijv same procedure to estimate the adaptive weight of p: we
indicates whether the vth variable is available for both estimated p based on eS and eV within each sliding win-
subject i and j; it takes the value of 0 or 1. Depending dow of rmax, (rmax − 0.1, rmax + 0.1), and require that at
on different types of variable, dijv is defined differently: least 10 observations need to be extracted for the com-
(1) for dichotomous and multi-level categorical vari- putation of p.
ables, dijv = 0 if the two subjects agree on the vth vari-
able, otherwise dijv = 1; (2) the contribution of other Evaluation method
variables (continuous and ordinal) is the absolute differ- We compared different missing value imputation methods
ence of both values divided by the total range of that in both simulated data and real datasets. We evaluated the
variable [37]. The calculation of the Gower’s distance is imputation performance by calculating root mean squared
implemented by the “daisy” function in the “cluster” R error (RMSE) for continuous and ordinal variables and
package. proportion of false classification (PFC) for nominal vari-
ables. The pure simulated data are discussed in Simulated
Hybrid imputation by nearest subjects and variables (KNN-H) datasets below. For real datasets, we first generated the
Since the nearest variables and the nearest subjects often complete dataset (CD) from the original raw dataset (RD)
both contain information to improve imputation, we with missing values. We then simulated missing values
propose to combine imputed values from KNN-S and (e.g. randomly at 5% missing rate) to obtain the dataset
KNN-V by: with missing values (MD), performed imputation on the
MD and assessed the performance by calculating the
RMSE between the imputed and the real values. The
KNN−H ¼ p KNN−S þ ð1−pÞ KNN−V: ð^y −y Þ
2
estimated the RMSE and the PFC by 20 randomly gener- Pois(25) and the numbers of variables in each cluster
ated MDs. were from Pois(15) (Additional file 1: Figure S1).
Simulation III (No variable groups + forty subject groups):
In this simulation, we generated data with sparse between-
Simulated datasets variable correlation but strong between-subject correla-
Simulation of complete datasets (CD): To demonstrate tions, a setting similar to the nominal variables in the SARP
the performance of various methods under different cor- data set (Additional file 1: Figure S6(c)). The number of
relation structure, we considered three scenarios to simu- subjects in each cluster followed Pois (14). In each subject
late N = 600 subjects and P = 300 variables. cluster, a common base γc (c =1…40) with length P were
Simulation I (six variable clusters + six subject clusters): shared, and was added by a random error from N(0, 0.01).
We first generated the number of subjects in each cluster We created sparse categorical variable by cutting continu-
from Pois(80), and number of variables in each cluster ous variable at the extreme quantiles (≤ 5 % or ≥ 95 %) and
from Pois(40). To create the correlation structure among generated the other cutting point randomly from UNIF
variables, we first generated a common basis δi (i =1…6) (0.01, 0.99) which created up to 30 levels. (Additional file 1:
with length N for variables in cluster i from N(μ, 4), where Figure S2).
μ is randomly sampled from UNIF(−2, 2). Then we gener- Generate datasets with missing values (MD) from
ated a set of slope and intercept (αip, βip), p = 1… vi, so that complete data (CD): MD were generated by randomly re-
each variable is a linear transformation of the common moving m% values from simulated CD described above or
basis and therefore the correlation structure is preserved. CD from real data described in Section Real data. We con-
The rest of the variables which were independent of those sidered m% = 5%, 20%, 40% in our simulation studies. All
grouped variables were random samples from N(0, 4). The three settings were repeated for 20 times.
subject correlation structure was generated following
the similar strategy: we first generated common basis γj Imputability measure
(j =1…6) from N(1,2) with length P. For all subjects in Current practice in the field is to impute all missing data
cluster j, γj was added to each of them to create correl- after filtering out variables or subjects with more than a
ation within subjects. And the rest of subjects were gen- fixed percent (e.g. 20%) of missing values. This practice
erated from N(0, 4 × IP × P). To create data of mixed implicitly assumes that all missing values are imputable
types, we randomly converted 100 variables into nom- by borrowing information from other variables or sub-
inal variables and 60 variables into ordinal variables by jects. This assumption is usually true in microarray or
randomly generating 3 to 6 ordinal/nominal levels. The other high-throughput marker data since genes usually
proportions of different variable types were similar to interact with each other and are co-regulated at the sys-
that of the COPD data set. The heatmaps of subject and temic level. For high-dimensional phenomic data, however,
variable distance matrixes of the simulated data are we have observed that many variables do not associate or
shown in Figure 1. interact with other variables and are difficult to impute.
Simulation II (twenty variable groups + twenty subject Therefore, to identify these missing values, we introduce a
groups): The number of clusters is increased to 20. The novel concept of “imputability” and develop a quantitative
numbers of subjects in each cluster were generated from “imputability measure” (IM). Specifically, given a dataset
Figure 1 Heatmap of distance matrix in simulation I. (a) Variable and (b) Subject distance matrixes of Simulation I. (black: small distance/high
correlation; white: large distance/low correlation).
Liao et al. BMC Bioinformatics 2014, 15:346 Page 7 of 12
http://www.biomedcentral.com/1471-2105/15/346
with missing values, we generate “second layer” of missing identify the best method for the data set. To achieve
values as described above. We then perform the KNN-V that, we randomly generated a second layer of missing
and the KNN-S method on a “secondary simulated layer” values within each MDb (1 ≤ b ≤ 20) for 20 times and de-
of missing values. The procedure is repeated for t times noted the data sets with two layers of missing values as
(t =10 is usually sufficient) and Ei and Ej could be calcu- MDb,i (1 ≤ i ≤ 20). The method that performs the best in
lated as the average of the RMSEs for the second layer the second layer missing values imputation, i.e., generate
missing values of subject i (i = 1,…,N) and variable j (j = the smallest average RMSE, was identified as the method
1,…,P) of the t times of imputations. Let IMsi = exp(−Ei) selected by the STS scheme for missing value imputation
and IMvj = exp(−Ej). The IM for a missing value Dij is of MDb (denoted as Mb, STS). Consider the optimal
defined as max(IMsi, IMvj). IM provides quantitative method identified by the first layer STS as the “true” opti-
evidence of how well each missing value can be imputed mal imputation method, denoted as Mb*, we counted how
by borrowing information from other variables or sub- many times of the 20 simulations that Mb, STS = Mb* (i.e.
X20
jects. IM ranges between 0 and 1 and small IM values I Mb;STS ¼ Mb /20, where I(⋅) is the indicator
b¼1
represent large imputation errors that should raise con-
function) as the accuracy of STS scheme.
cerns of using imputation. Detailed Procedure of gener-
ating IM is described in Additional file 2 algorithm 1. In
the application guideline to be proposed in the Result Results
section, we will recommend users to avoid imputation Simulation results
or impute with caution for missing values with IM less We compared the performance of seven methods –
than a pre-specified threshold. mean imputation (MeanImp), KNN-V, KNN-S, KNN-H,
KNN-A, missForest and MICE – on the three simulation
scenarios described above. When implementing MICE,
The self-training selection (STS) scheme the R packages returned errors when the nominal or or-
In our analyses, no imputation method performed uni- dinal variables contained large number of levels and any
versally better than all other methods. Thus, the best level contained a small number of observations. As a re-
choice of imputation method depends on the particular sult, MICE was not applied to Simulation III evaluation.
structure of a given data. Previously, we proposed a Self- We first performed simulation to determine effects on
Training Selection (STS) scheme for microarray missing the imputation by the choice of K. We tested K = 5, 10
value imputation [24]. Here we applied the STS scheme and 15 for missing value = 5%, 10% and 20% on different
and evaluated its performance in the complete real data- types of data. The imputation results with different K
sets. Figure 2. shows a diagram of the STS scheme and values are similar (see Additional file 1: Figure S3). We
how we evaluated the STS scheme. From a CD, we sim- thus chose K = 5 for both simulation and real data applica-
ulated 20 MDs (MD1, MD2, …, MD20). Our goal was to tions as it generated good performance in most situations.
Figure 3 shows the boxplots of the RMSEs of the three
types of variables from 20 simulations for the three simu-
lation scenarios. For simulation I and II, we observed that
missForest performed the best in all three data types.
MICE performed better than the KNN-methods in
nominal missing imputation, but performed worse in
the imputation of continuous and ordinal variables. The
two hybrid KNN methods (KNN-A and KNN-H) con-
sistently performed better than KNN-V and KNN-S,
showing the effectiveness to combine information from
variables and subjects. KNN-A performed slightly better
than KNN-H especially in the first two simulation sce-
narios, indicating the advantages of adaptive weight in
Figure 2 Diagram of evaluating performance of STS scheme in combining KNN-V and KNN-S information. For simula-
a real complete data set (CD). Missing data sets are randomly
tion III, KNN-S performed overall the best while KNN-
generated for 20 times (MD1, ⋅⋅⋅, MD20). The STS scheme is applied
to learn the best method from STS simulation (denoted as Mb,STS for V failed. This is expected due to the lack of correlation
the b-th missing data set MDb). The true best (in terms of RMSE) between variables. missForest was also not as good as
method for MDb is denoted as Mb* and the STS best (in terms of KNN-S in the continuous and nominal variable imputa-
RMSE across MDb,1, …, MDb,20) method is denoted as Mb,STS. When tions. In this case, the performance of KNN-S, KNN-H
Mb,STS = Mb*, the STS scheme successfully selects the
and KNN-A were not affected much by missing per-
optimal method.
centages, due to the strong correlation among subjects.
Liao et al. BMC Bioinformatics 2014, 15:346 Page 8 of 12
http://www.biomedcentral.com/1471-2105/15/346
Figure 3 Boxplots of RMSE/PFC for (a) Simulation I and (b) Simulation II and (c) Simulation III. KNN-based methods: KNN-V, KNN-S, KNN-H
and KNN-A; RF: MissForest algorithim; MICE: multivariate imputation by chained equations; MeanImp: mean imputation.
Real data applications low mutual correlation of nominal variables with other
Next we compared different methods in three real data- variables in this data set as demonstrated in Additional
sets. Similar to the above simulation study, we first in- file 1: Figure S6. (note that missForest only borrows in-
vestigate the choice of K for the simulation of real data formation from variables). Overall, no method univer-
sets and reached the same conclusion (Additional file 1: sally outperformed other methods. In Additional file 1:
Figure S4). In order to implement MICE in our com- Figure S5 after filtering, the comparative result is similar
parative analysis, we had to remove categorical variables to Figure 4 for KNN methods and missForest. The
with any sparse level (i.e. having <10% of the total obser- MICE method had unstable performance: sometimes
vations) and those with greater than 10 levels. The performs among the best and sometimes much worse
numbers of variables after such filtering are shown in than all the others.
Additional file 1: Table S1. Since only 26% (38/144), 14%
(16/118) and 45% (49/108) of nominal and ordinal vari- Imputability measure
ables were retained after the filtering, we decided to The motivation of imputability concept rests in that some
remove MICE from the comparison and report the com- variables or subjects have no near neighbour to borrow in-
parative results of the remaining methods with the unfil- formation from, hence cannot be imputed accurately. The
tered data in Figure 4. The comparative results for all distribution of imputability measure (IM; defined in Sec-
methods including MICE on the filtered data are available tion Imputability measure) of the variables (IMv) and sub-
in Additional file 1: Figure S5. As expected, the mean im- jects (IMs) of COPD, LTRC and SARP data are shown in
putation almost always performed the worst (Figure 4). Additional file 1: Figure S7. We observed a heavy tail to
KNN-V usually performed better than KNN-S (except for the left, which indicated existence of many un-imputable
the nominal variables in SARP), indicating better informa- subjects and variables. By including these poorly imputed
tion borrowed from neighboring variables than subjects. values, we risk to reduce the accuracy and power of down-
The hybrid methods KNN-H and KNN-A performed bet- stream analyses. To demonstrate the usefulness of IM, we
ter than either KNN-S or KNN-V alone. KNN-A seemed compared the RMSE/PFC before and after removing un-
to slightly out performed KNN-H. missForest was usually imputable values. Figure 5 shows significant reduction of
the best performer with an exception of nominal variables RMSE and PFC by removing missing values with the low-
in the SARP data set. This is probably because of the est 25% IMs. In Additional file 1: Figure S8, heatmaps of
Liao et al. BMC Bioinformatics 2014, 15:346 Page 9 of 12
http://www.biomedcentral.com/1471-2105/15/346
Figure 4 Boxplots of RMSE/PFC for (a) COPD; (b) SARP and (c) LTRC. KNN-based methods: KNN-V, KNN-S, KNN-H and KNN-A; RF: MissForest
algorithm; MeanImp: Mean imputation.
IMs for the three real datasets are presented. Values col- with RMSE difference within 5% range are considered
ored in green are with low IMs and should be imputed comparable. Thus, if a method generates RMSE within
with caution. 5% of the minimum RMSE of all methods, we consid-
ered the method not distinguishable from the optimal
The self-training selection scheme (STS) and an application method and the method is also an optimal choice. We
guideline found that the STS scheme can almost always select the
Finally, we applied the STS scheme to the real datasets true optimal missing value imputation method with per-
and the performance is reported in Table 4. Methods fect accuracy (with only several exceptions down to 75%-
Figure 5 Boxplots of RMSE/PFC evaluated using (1) all imputed values and (2) only imputable values in LTRC dataset. Boxplots of RMSE/
PFC evaluated using (1) all imputed values and (2) only imputable values in LTRC dataset with m =5% missingness. Color: grey (evaluation using
all imputed values); white (evaluation using only imputable values).
Liao et al. BMC Bioinformatics 2014, 15:346 Page 10 of 12
http://www.biomedcentral.com/1471-2105/15/346
95% accuracy). Figure 6 describes an application guideline some real data. It also had unstable performance, with
for the phenomic missing value imputation. Firstly, the some situations among the top performers while in some
STS scheme is applied to the MD of different data types other situations it performed much worse than the KNN
separately to identify the best imputation method. The methods and missForest. For the KNN methods, the hy-
IMs are then calculated based on the selected optimal brid methods (KNN-H and KNN-A) that combined infor-
method. Finally, imputation is performed based on the op- mation from neighboring subjects and variables usually
timal method selected by the STS scheme and the users performed better than borrowing information from either
have two options to move on to downstream analyses. For subjects (KNN-S) or variables (KNN-V) alone. missForest
Option A, all missing values are imputed accompanied by usually was among the top performers while it could fail
IMs that can be incorporated in downstream analyses. In when correlations among variables are sparse. In the pro-
Option B, only missing values with IMs higher than a pre- posed KNN-based methods, when there are lots of nom-
specified threshold are imputed and reported. inal variables with sparse levels, ordinary logistic
regression will also fail to work. When this happen, con-
Discussion tingency table is used to impute the missing values. This
In our comparative study of the imputation methods avail- partly explained why across different missing percentage,
able for phenomic data, MICE encountered difficulty in (5% to 40%) the accuracy remained mostly unchanged. It
nominal and ordinal data types when any level in the vari- is also due to the lack of similar variables with nominal
able has few observations. This limited its application to missing values. Overall, no method universally performed
Figure 6 An application guideline to apply the STS scheme for a real dataset with missing values.
Liao et al. BMC Bioinformatics 2014, 15:346 Page 11 of 12
http://www.biomedcentral.com/1471-2105/15/346
Author details 21. Raghunathan TE, Siscovick DS: A multile imputation analysis of a
1
Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA. case–control study of the risk of primary cardiac arrest among
2
Department of Computational and Systems Biology, University of Pittsburgh, pharmacologically treated hypertensives. Appl Stat 1996, 45:335–352.
Pittsburgh, PA, USA. 3Department of Human Genetics, University of 22. Schafer JL: Analysis of Incomplete Multivariate Data by Simulation. New York:
Pittsburgh, Pittsburgh, PA, USA. 4Pulmonary, Critical Care and Sleep Medicine, Chapman and Hall; 1997.
Yale School of Medicine, New Haven, CT, USA. 5Department of Medicine, 23. Jerez JM, Molina I, Garcia-Laencina PJ, Alba E, Ribelles N, Martin M, Franco L:
University of Pittsburgh, Pittsburgh, PA, USA. Missing data imputation using statistical and machine learning methods
in a real breast cancer problem. Artif Intell Med 2010, 50(2):105–115.
Received: 6 March 2014 Accepted: 6 October 2014 24. Brock GN, Shaffer JR, Blakesley RE, Lotz MJ, Tseng GC: Which missing value
imputation method to use in expression profiles: a comparative study
and two selection schemes. BMC Bioinformatics 2008, 9:12.
25. Sunghee Oh DDK, Brock GN, Tseng GC: Biological impact of missing-value
References imputation on downstream analyses of gene expression profiles.
1. Denny JC, Ritchie MD, Basford MA, Pulley JM, Bastarache L, Brown-Gentry K, Bioinformatics 2011, 27(1):78–86.
Wang D, Masys DR, Roden DM, Crawford DC: PheWAS: demonstrating the 26. Buhlmann DJSP: MissForest - nonparametric missing value imputation for
feasibility of a phenome-wide scan to discover gene-disease mixed-type data. Bioinformatics 2011, 28:113–118.
associations. Bioinformatics 2010, 26(9):1205–1210. 27. Acuna E, Rodriguez C: The treatment of missing values and its effect in
2. Hanauer DA, Ramakrishnan N: Modeling temporal relationships in large the classifier accuracy. In Clustering and Data Mining Applications.
scale clinical associations. J Am Med Inform Assoc 2013, 20(2):332–341. 2004:639–648.
3. Lyalina S, Percha B, Lependu P, Iyer SV, Altman RB, Shah NH: Identifying 28. Bø TH, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing
phenotypic signatures of neuropsychiatric disorders from electronic values in microarray data with least squares methods. Nucleic Acids Res
medical records. J Am Med Inform Assoc 2013, 20(e2):e297–e305. 2004, 32(3):e34.
4. Ritchie MD, Denny JC, Zuvich RL, Crawford DC, Schildcrout JS, Bastarache L, 29. Olkin I, Tate RF: Multivariate correlation models with mixed disrete and
Ramirez AH, Mosley JD, Pulley JM, Basford MA, Bradford Y, Rasmussen LV, continuous variables. Ann Math Stat 1961, 32(2):448–465.
Pathak J, Chute CG, Kullo IJ, McCarty CA, Chisholm RL, Kho AN, Carlson CS, 30. Agresti A: Measures of nominal-ordinal association. J Am Stat Assoc 1981,
Larson EB, Jarvik GP, Sotoodehnia N, Cohorts for Heart Aging Research in 76(375):524–529.
Genomic Epidemiology (CHARGE) QRS Group, Manolio TA, Li R, Masys DR, 31. Ulf Olsson FD, Dorans NJ: The polyserial correlation coefficient.
Haines JL, Roden DM: Genome- and phenome-wide analyses of cardiac Psychometrika 1982, 47(3):337–347.
conduction identifies markers of arrhythmia risk. Circulation 2013, 32. Olsson U: Maximum likelihood estimation of the polychoric correlation
127(13):1377–1385. coefficient. Psychometrika 1979, 44(4):443–460.
5. Warner JL, Alterovitz G, Bodio K, Joyce RM: External phenome analysis 33. Boas F: Determination of the coefficient of correlation. Science 1909,
enables a rational federated query strategy to detect changing rates of 29:823–824.
treatment-related complications associated with multiple myeloma. J Am 34. Pearson K: Mathematical contributions to the theory of evolution. VII. On
Med Inform Assoc 2013, 20(4):696–699. the correlation of characters not quantitatively measurable. Philos Trans R
6. Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB: Soc Lond Ser A Math Phys Eng Sci 1900, 195:1–47.
Bioinformatics challenges for personalized medicine. Bioinformatics 2011, 35. Yule GU: On the methods of measuring the association between two
27(13):1741–1748. attributes. J Roy Statist Soc 1912, 75:579–652.
7. Singer E: “Phenome” project set to pin down subgroups of autism. Nat 36. Cramér H: Mathematical Methods of Statistics. Princeton: Princeton University
Med 2005, 11(6):583. Press; 1946.
8. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, 37. Gower JC: A general coefficient of similarity and some of its properties.
Carpenter JR: Multiple imputation for missing data in epidemiological Biometrics 1971, 27(4):857–871.
and clinical research: potential and pitfalls. BMJ 2009, 338(jun29 1):b2393.
9. Little RJ, D’Agostino R, Cohen ML, Dickersin K, Emerson SS, Farrar JT,
doi:10.1186/s12859-014-0346-6
Frangakis C, Hogan JW, Molenberghs G, Murphy SA, Neaton JD, Rotnitzky A,
Cite this article as: Liao et al.: Missing value imputation in high-dimensional
Scharfstein D, Shih WJ, Siegel JP, Stern H: The prevention and treatment of phenomic data: imputable or not, and how?. BMC Bioinformatics 2014 15:346.
missing data in clinical trials. N Engl J Med 2012, 367:1355–1360.
10. Tanner MA, Wong WH: The calculation of posterior distributions by data
augmentation. J Am Stat Assoc 1987, 82:528–550.
11. Tanner MA: Tools for Statistical Inference: Methods for the Exploration of
Posterior Distributions and Likelihood Functions. New York: Springer-Verlag;
1996.
12. Liu C: Missing data imputation using the multivariate t distribution.
J Multivar Anal 1995, 53(1):139–158.
13. Little RJA, Rubin DB: Statistical Analysis with Missing Data. 2nd edition. New
York: John Wiley; 2002.
14. Raghunathan TE, Lepkowski JM, Hoewyk JV, Solenberger P: A multivariate
technique for multiply imputing missing values using a sequence of
regression models. Survey Methodology 2001, 27(1):85–95.
15. Rubin DB, Schafer JL: Efficiently creating multiple imputations for Submit your next manuscript to BioMed Central
incomplete multivariate normal data. In Proceeding of the Statistical and take full advantage of:
Computing Section of the American Statistical Association. ; 1990:83–88.
16. van Buuren KG-O S: Mice: multivariate imputation by chained equations
• Convenient online submission
in R. J Stat Softw 2011, 45(3):1–67.
17. Andridge RR, Little RJ: A review of hot deck imputation for survey • Thorough peer review
non-response. Int Stat Rev 2010, 78(1):40–64. • No space constraints or color figure charges
18. Little RJ, Yosef M, Cain KC, Nan B, Harlow SD: A hot-deck multiple
• Immediate publication on acceptance
imputation procedure for gaps in longitudinal data on recurrent events.
Stat Med 2008, 27(1):103–120. • Inclusion in PubMed, CAS, Scopus and Google Scholar
19. Rubin DB: Multiple Imputation for Nonresponse in Surveys. New York: Wiley; • Research which is freely available for redistribution
1987.
20. Raghunathan TE, Grizzle JE: A split questionnaire survey design. J Am Stat
Assoc 1995, 90:54–63. Submit your manuscript at
www.biomedcentral.com/submit