An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics
An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics
Abstract
Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data
as well as for ranking candidate predictors based on the so-called random forest variable importance measures
(VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced
data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the
performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative
permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class
imbalance.
Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based
permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the
new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while
both permutation VIMs have equal performance for balanced data settings.
Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and
predictors not associated with the response for increasing class imbalance. It is outperformed by our new
AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the
case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF
variant based on conditional inference trees. The codes implementing our study are available from the companion
website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.
Keywords: Random forest, Conditional inference trees, Variable importance measure, Feature selection, Unbalanced
data, Class imbalance, Area under the curve.
© 2013 Janitza et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 2 of 11
http://www.biomedcentral.com/1471-2105/14/119
persons are negative, as well as in subclass analyses, e.g., and motivate the use of a new permutation VIM which
if one wants to differentiate between different subtypes is not based on the error rate but on the area under the
of cancer. Usually some subclasses are more common curve (AUC). The AUC can be seen as an accuracy
than other subclasses leading to an imbalance in class measure putting the same weight on both classes – in
sizes. Studies on rare diseases are a further example of contrast to the error rate which essentially gives more
unbalanced data settings in medicine. Data can be weight to the majority class. As such, the AUC is a
obtained only from few persons having the specific rare particularly appropriate prediction accuracy measure in
disease, while samples from healthy control persons are unbalanced data settings [26]. A permutation VIM in
much easier to obtain. Of course unbalanced data are which the error rate is replaced by the AUC is therefore
also relevant in various other areas of application beyond a promising alternative to the standard permutation
the biomedical field, e.g., the prediction of creditworthi- VIM. We performed extensive simulation studies to ex-
ness of a bank’s costumers [11], the detection of fraudulent plore and compare the behaviour of both permutation
telephone calls [12] or the detection of oil spills in satellite VIMs for different class imbalance levels, effect sizes
radar images [13], just to name a few examples. Unbal- and sample sizes.
anced data may arise whenever the class memberships are
observed after data collection. Methods
Like many other classification methods RF produces The RF algorithm is a classification and regression
classification rules that do not accurately predict the mi- method often used for high-dimensional data settings
nority class if data are unbalanced. The RF classifier allo- where the number of predictors exceeds the number of
cates new observations more often to the majority class observations. Note that throughout this article we use
unless the difference between the classes is large and the term predictors which is equivalent to features or
classes are well separable. For extreme class imbalances, descriptors denoting variables that are used to discrimin-
e.g. if the minority class includes only 5% of the observa- ate the response classes. In the RF algorithm several
tions, it might happen that the RF classifier allocates individual decision trees are combined to make a final
every observation to the majority class independently of prediction. The final prediction is then the average (for
the predictors, yielding a minimal error rate of 5%. regression) or the majority vote (for classification) of the
Although this error rate of 5% is very small, such a trivial predictions of all trees in the forest. Each tree is fitted to
classification is of no practical use. a random sample of observations (with or without
Some suggestions have been made to yield a useful replacement) from the original sample. Observations not
classification based either on sampling procedures used to construct a tree are termed out-of-bag (OOB)
[14-17] or on cost sensitivity analyses [14]. Sampling observations for that tree. For each split in each tree a
procedures create an artificial balance between two or randomly drawn subset of predictors is assessed as can-
more classes by oversampling the minority class and/or didates for splitting and the predictor yielding the best
downsampling the majority class. Cost sensitivity ana- split is finally chosen for the split. In the original version
lyses attribute a higher cost to the misclassification of an of RF developed by Leo Breiman [1], the selected split is
observation from the minority class to impede the trivial the split with the largest decrease in Gini impurity. In a
systematic classification to the larger class. Both aspects later version of RF, conditional inference tests are used
have been widely discussed in the literature with respect for selecting the best split in an unbiased way [27]. For
to RF’s classification performance [14,15,18-21]. Recent each split in a tree, each candidate predictor from the
simulation studies [9] have shown that the performance randomly drawn subset is globally tested for its associ-
of RF classification for unbalanced data depends on (i) ation with the response, yielding a global p-value. The
the imbalance ratio, (ii) the class overlap and (iii) the predictor with the smallest p-value is selected, and
sample size. within this globally selected predictor the best split is
The impact of class imbalance on the RF VIM, how- finally chosen for the split.
ever, has to our knowledge not yet been examined in the Both forest versions implement so called variable
literature. In this article we focus on the permutation importance measures which can be used to get a ranking
VIM which is known to be almost unbiased and more of the predictors according to their association with the
reliable than the Gini VIM. The latter has been shown response. In the following, we briefly introduce the
to have a preference for certain types of predictors standard permutation VIM as well as our novel permuta-
[22-25] and therefore its rankings have to be treated tion VIM, which is based on the area under the curve.
with caution. We concentrate on the class imbalance
problem for two response classes with respect to the Random forest variable importance measures
permutation VIM. We investigate the mechanisms of RF’s variable importance measures are often used for
changes in performance for unbalanced data settings feature selection for high-dimensional data settings
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 3 of 11
http://www.biomedcentral.com/1471-2105/14/119
which makes it especially attractive for bioinformatics A novel AUC-based permutation VIM
and related fields, where identifying a subset of relevant Our new AUC-based permutation VIM is closely related
predictors from a large set of candidate predictors is a to the error-rate-based permutation VIM. They only differ
major challenge (known as the “small n large p” prob- with respect to the prediction accuracy measure: In a
lem). The two standard VIMs for feature selection with nutshell, the error rate of a tree involved in (1) is replaced
RF are the Gini VIM and the permutation VIM. Roughly by the area under the curve (AUC) [28]. We define the
speaking the Gini VIM of a predictor of interest is the AUC-based permutation VI for predictor j as:
sum over the forest of the decreases of Gini impurity
generated by this predictor whenever it was selected for ðAUCÞ 1 Xntree∗
splitting, scaled by the number of trees. This measure VIj ¼ AUCtj AU Ctj~Þ ð2Þ
ntree t¼1
has been shown to prefer certain types of predictors
[22-25]. The resulting predictor ranking should therefore
be treated with caution. That is why in this paper we ntree∗ denotes the number of trees in the forest
focus on the permutation VIM that gives essentially whose OOB observations include observations from
unbiased error rate rankings of the predictors. both classes,
AUCtj denotes the area under the curve computed
from the OOB observations in tree t before
Error-rate-based permutation VIM permuting predictor j,
From now on, we denote the standard permutation VIM AUCtj~ denotes the area under the curve computed
as “error-rate-based permutation VIM”, since it is based from the OOB observations in tree t after randomly
on the OOB error rate, as outlined below. More precisely, permuting predictor j.
it measures the difference between the OOB error rate
after and before permuting the values of the predictor Instead of computing the error rate for each tree after
of interest. The error-rate-based permutation variable and before permuting a predictor, the AUC is computed.
importance (VI) for predictor j is defined by: The AUC for a tree is based on the so-called class prob-
abilities, i.e. the estimated probability of each observa-
ðERÞ 1 Xntree tion to belong to the class Y = 0 or Y = 1, respectively.
VIj ¼ ER ~ ERtj
tj ð1Þ The class probabilities of an observation are determined
ntree t¼1
by the relative amount of training observations belong-
ing to the corresponding class in the terminal node in
Where
which an observation falls into. If one considers an
OOB observation with Y = 0 and an OOB observation
ntree denotes the number of trees in the forest,
with Y = 1, a “good tree” is expected to assign a larger
ERtj denotes the mean error rate over all OOB
class probability for class Y = 1 to the observation truly
observations in tree t before permuting predictor j,
belonging to class Y = 1 than to the observation belong-
ERtj~ denotes the mean error rate over all OOB
ing to class Y = 0. The AUC for a tree corresponds to
observations in tree t after randomly permuting
the proportion of pairs for which this is the case. It can
predictor j.
be seen as an estimator of the probability that a ran-
domly chosen observation from class Y = 1 receives a
The idea underlying this VIM is the following: If the higher class probability for class Y = 1 than a randomly
predictor is not associated with the response, the permu- chosen observation from class Y = 0. Note that with the
tation of its values has no influence on the classification, use of the AUC, the information contained in the class
and thus also no influence on the error rate. The error probabilities returned by a tree are adequately exploited.
rate of the forest is not substantially affected by the per- This is not the case for the error rate, that requires a
mutation and the VI of the predictor takes a value close dichotomization of class probabilities. From a practical
to zero, indicating no association between the predictor point of view, the AUC is computed by making use of
and the response. In contrast, if response and predictor its equivalence with the Mann–Whitney-U statistic. The
are associated, the permutation of the predictor values Mann–Whitney-U statistic is solely based on the rankings
destroys this association. “Knocking out” this predictor of two independent samples. AUC values of 1 correspond
by permuting its values results in a worse classification to a perfect tree classifier, since a perfect classifier would
leading to an increased error rate. The difference in attribute each observation from one class a higher prob-
error rates before and after randomly permuting the ability to belong to this class than any observation from
predictor thus takes a positive value reflecting the high the other class. AUC values of 0.5 correspond to a useless
importance of this predictor. tree classifier that randomly allocates class probabilities to
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 4 of 11
http://www.biomedcentral.com/1471-2105/14/119
the observations. In this case in about half the cases a ran- Table 1 Distribution of predictors in class 1 and class 2
domly drawn observation from one class receives a higher Predictors Distribution in Distribution in Effect size
probability of belonging to that class than a randomly class 1 class 2
drawn observation from the other class. X1, . . ., X5 N (1.00, 1) N (0, 1) strong effect
The novel AUC-based permutation VIM is implemented X6, . . ., X10 N (0.75, 1) N (0, 1) moderate effect
in the package party for the unbiased RF variant based on X11, . . ., X15 N (0.50, 1) N (0, 1) weak effect
conditional inference trees. Note that the discrepancy in
X16, . . ., X65 N (0, 1) N (0, 1) no effect
performance between the standard permutation VIM
and the AUC-based permutation VIM is transferable to
the original version of RF since the VI ranking mechan-
ism is completely independent from the construction of The first five predictors X1, . . ., X5 differ strongly be-
the trees. tween classes with mean μ1 = 1 in one class and mean
μ2 = 0 in the other class. The predictors X6, . . ., X10 have
a moderate mean difference between the two classes with
Comparison studies
μ1 = 0.75 and μ2 = 0. For X11, . . ., X15 there is only a small
The behavior of the two introduced permutation VIMs is
difference between the classes with μ1 = 0.5 and μ2 = 0.
expected to be different in the presence of unbalanced
We simulated 50 additional predictors following a
data. The AUC is a prediction accuracy measure which
standard normal distribution with no association to the
puts the same weight on both classes independently of
response variable (termed noise predictors).
their sizes [26]. The error rate, in contrast, gives essentially
We performed analyses with varying sample sizes and
more weight to the majority class because it does not take
report the results for total sample sizes of n = 100, n = 500
class affiliations into account and regards all misclassifica-
and n = 1000. For each parameter combination, i.e. imbal-
tions equally important. In the results section we try to
ance level and sample size, we simulated 100 datasets and
explain the consequences for the performance of the per-
computed AUC-based and error-rate-based permutation
mutation VIMs for unbalanced data settings and provide
VIs for each dataset. Note that for a sample size of n = 100
evidence for our supposition. We performed studies on
an imbalance of 1% is not meaningful since there is only
simulated and on real data to explore and contrast the
one observation in the minority class.
performance of both permutation VIMs. Using simulated
Forest and tree parameters were held fixed. The par-
data we aim to see whether total sample size and effect
ameter ntree denoting the number of trees in a forest
size play a role for the class imbalance problem. We
was set to 1000, the parameter for the number of candi-
explored this by varying the total number of observations
date splits mtry was set to the default value of 5. We
and by simulating predictors with different effect sizes.
used subsampling instead of bootstrap sampling for
Furthermore we conducted analyses based on real data to
constructing the trees, i.e. setting the parameter replace
provide additional evidence based on realistic data struc-
to FALSE [22]. Conditional inference trees were grown
tures which usually incorporate complex interdependen-
to maximal possible depth, i.e. setting the parameters
cies. Our comparison studies on simulated and on real
minsplit, minbucket and mincriterion in the cforest
data were conducted using the unbiased RF variant based
function to zero.
on conditional inference trees. The implementation of this
unbiased RF variant is available in the R system for statis-
Real data
tical computing via the package party [29].
We also investigated the performance of the error-rate-
based and the AUC-based permutation VIM on real
Simulated data data including complex dependencies (e.g. correlations)
The considered simulation design represents a scenario and predictors of different scales. The dataset is about
where the predictors associated with the response vari- RNA editing in land plants [30]. RNA editing is the
able Y (binary) are to be identified from a set of continu- modification of the RNA sequence from the corre-
ous predictors. We performed simulations for varying sponding DNA template. It occurs e.g. in plant mito-
imbalance levels: 50% corresponding to a completely chondria where some cytidines are converted to
balanced sample, 40%, 30%, 20%, 10%, 5% and 1% corre- uridines before translation (abbreviated with C-to-U
sponding to different imbalance levels from slight to conversion in the following). The dataset comprises a
very extreme class imbalances. The simulation setting total of 43 predictors: 41 categorical predictors (40 nu-
comprises both predictors not associated with the re- cleotides at positions −20 to 20 relative to the edited site
sponse and associated predictors with three different and one predictor describing the codon position) and
levels of effect sizes. Table 1 presents the data setting two continuous predictors (one for the estimated fold-
used throughout this simulation. ing energy and one predictor describing the difference
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 5 of 11
http://www.biomedcentral.com/1471-2105/14/119
in estimated folding energy between pre-edited and 43 noise predictors, (2) merging them to the original
edited sequences). It includes 2694 observations, where dataset, (3) randomly subsampling to create an unbal-
exactly one half has an edited site and the other half has anced dataset and (4) computing the error-rate-based
a non-edited site. The data are publicly available from and AUC-based permutation VIs, was repeated 100
the journal’s homepage. After excluding observations times for each imbalance level to get stable results for
with missing values, a total of 2613 observations were the VIM performance. To check the assumption that
left, where 1307 had a non-edited site and 1306 obser- there is a higher association between the response and
vations had an edited site. We used this balanced any of the original predictors than between the response
dataset to explore the performance of ER- and AUC- and any of the simulated predictors, we computed the
based permutation VIM for varying class imbalances – mean VI over 100 completely balanced datasets that had
but now with realistic dependencies and predictors of been extended by noise predictors. Figure 1 shows that
different scales. For this purpose, we artificially created all mean VIs of the original predictors are higher than
different imbalance levels by drawing random subsets any mean VI of a simulated noise predictor and hence
from the class with edited sites. confirms our first impression.
Application of the standard permutation VIM to the
data using the 2613 observations without missing values Performance evaluation criteria
gave VIs greater than zero for all 43 predictors for VIMs give a ranking of the predictors according to their
different random seeds (i.e. different starting values for association with the response. To evaluate the quality of
the random permutation), indicating that all predictors the rankings by the permutation VIMs the AUC was
seem to have at least a small predictive power (data not used as performance measure. The AUC was computed
shown). We generated and added additional predictors to assess the ability of a VIM to differentiate between
without any effect (termed noise predictors in the fol- associated predictors and predictors not associated with
lowing) in order to evaluate the performance of error- the response. AUC values of 1 mean that each associated
rate-based and AUC-based permutation VIMs. Provided predictor receives a higher VI than any noise predictor,
that there is a higher association between the response thus indicating a perfect discrimination. AUC values of
and any of the original predictors than between the re- 0.5 mean that a randomly drawn associated predictor
sponse and any of the simulated noise predictors, a well receives a higher VI than a randomly drawn noise pre-
performing VIM would attribute a higher VI to original dictor in only half of the cases, indicating no discrimina-
predictors than to simulated noise predictors. The noise tive ability.
predictors were generated by randomly permuting the For our comparison studies we defined the two classes
values of the original predictors. Each original predictor which are to be differentiated by a VIM in the following
was permuted once, resulting in a total of 43 noise way. In the first instance of our studies on simulated data,
predictors. The whole process consisting of (1) creating all predictors which are associated with the response
0.03
0.02
0.01
0.00
X.20
X.19
X.18
X.17
X.16
X.15
X.14
X.13
X.12
X.11
X.10
X.9
X.8
X.7
X.6
X.5
X.4
X.3
X.2
X.1
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
cp
fe
dfe
noise1
noise2
noise3
noise4
noise5
noise6
noise7
noise8
noise9
noise10
noise11
noise12
noise13
noise14
noise15
noise16
noise17
noise18
noise19
noise20
noise21
noise22
noise23
noise24
noise25
noise26
noise27
noise28
noise29
noise30
noise31
noise32
noise33
noise34
noise35
noise36
noise37
noise38
noise39
noise40
noise41
noise42
noise43
Figure 1 Mean VIs for the 43 original predictors and 43 noise predictors from the balanced modified C-to-U conversion dataset. Mean
VIs were obtained by averaging the VIs (by commonly used error-rate-based permutation VIM) over 100 extended versions of the C-to-U
conversion dataset.
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 6 of 11
http://www.biomedcentral.com/1471-2105/14/119
formed one class and noise predictors built the other class. observations have that specific pattern of predictor
In more detailed subsequent analyses we then explored values which is required for an observation to be classi-
the ability of the VIMs to discriminate between predictors fied into the minority class. It is likely that a permuta-
with the same effect size and predictors without an effect. tion of the values of an associated predictor might then
For this analysis one class comprised the noise predictors destroy that specific pattern so that after the permuta-
while the other class comprised only predictors with the tion, these observations are not identified anymore to
same effect. For the studies on real data it was not possible be in the minority class. Thus a misclassification due to
to conduct such detailed analyses because the true order- the elimination of an associated predictor is much more
ing of the predictors according to their association with likely to appear in observations from the minority class
the response is not known. Hence in the analysis on real than in observations from the majority class. Note that
data we restricted our analysis to the discrimination be- only a small number of observations from the minority
tween original predictors forming one class and simulated class are affected since most of the observations from
noise predictors forming the other class. the minority class are classified into the majority class
anyway (before as well as after the permutation). The
Results and discussion change in error rates is thus expected to be rather small –
Why may the error-rate-based permutation VIM fail in albeit it is more pronounced than the change in error rates
case of class imbalance? in the majority class.
The prioritisation of the majority class in unbalanced data Note that the error-rate-based permutation VIM does
settings is well known in the context of RF classification not take class affiliations into account. Thus the change
and can easily be seen from trees constructed on unbal- in error rates is actually not computed separately for
anced data. Trees trained on unbalanced data more often each class. Yet, in order to better understand the behav-
predict the majority class, which leads to the minimization ior of the VIM, it may help to point out that if the class
of the overall error rate. But how does this affect the proportions were the same in all OOB samples, the VI
performance of the permutation VIMs? And why is the of a predictor could be directly derived as the weighted
AUC-based permutation VIM expected to be more robust average of the class specific differences in the error rates.
towards class imbalance than the commonly used error- The weights would correspond to the proportion of obser-
rate-based permutation VIM? vations from the respective class. In practice the class
To answer these questions we consider an extremely un- frequencies will not be equal in all OOB samples, but the
balanced data setting and illustrate what happens in a tree concept of a weighted average of the class specific error
when permuting the values of an associated predictor. We rates illustrates the fact that for unbalanced data settings
will first have a look at observations from the majority the VI is mainly driven by the change in error rates
class. For this class nearly all observations are correctly derived from observations from the majority class. Since
classified by a tree which has been trained on extremely the change in error rates in the majority class is expected
unbalanced data. If we now permute the values of an asso- to be much smaller compared to the change in error rates
ciated predictor, this does generally not result in a classifi- in the minority class, the computed VIs are rather low.
cation into the minority class since a classification into the This results in low VIs even for associated predictors and
minority class is an unlikely event – even for an observa- in a poor differentiation of associated predictors and
tion from this class. A very specific data pattern is required predictors not associated with the response.
for an observation to be classified into the minority class.
It is unlikely that a random permutation of an associated Class specific VIs
predictor results in such a specific data pattern just by This theory is supported by computing class specific VIs
chance. Thus, for the majority class we expect hardly any (corresponding to mean changes in error rates computed
observation to be incorrectly classified to the minority only from observations belonging to the same class).
class after the permutation of an associated predictor. Computing class specific VIs was done using the R
Thus the error rate does not considerably increase after package randomForest implementing the standard RF
the permutation of an associated predictor, finally leading algorithm. The importance function of this package
to a rather low contribution to the VI. provides permutation VIs computed separately for each
Now let us consider the classifications by a tree for ob- class (besides the VIs by the standard permutation VIM
servations from the minority class. For an extreme class and by the Gini VIM). The class specific VIs for a total
imbalance most of the observations from the minority sample size of n = 500 and an imbalance level of 5% are
class are falsely classified to the majority class due to the shown in Figure 2, where predictors X1 to X15 have an
above described focus on the majority class. It might be effect while the remaining 50 predictors do not have an
the case that some observations from the minority class effect, corresponding to the simulation setting previously
are correctly classified by the tree because these described in Table 1 in the context of the comparison
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 7 of 11
http://www.biomedcentral.com/1471-2105/14/119
study (for simplicity, we use the same setting as in the for observations from the minority class, which might
comparison study, although the addressed problem is here not be grasped by the error-rate-based permutation VIM
a different one). Different sample sizes and imbalance due to a much higher weighting of the majority class.
levels give similar results (thus not shown). They confirm The VIs for associated predictors obtained by the AUC-
our argumentation that the change in the error rates com- based permutation VIM are thus expected to be compara-
puted from OOB observations from the majority class is tively higher than the VIs obtained by the error-rate-based
smaller than the change in error rates computed from permutation VIM. This would result in a better differenti-
OOB observations from the minority class. This results in ation of associated and noise predictors by the AUC-based
an underestimation of the actual permutation VI due to a permutation VIM. These conjectures are assessed in the
much higher weighting of the majority class in the compu- comparison study presented in the next section. (An add-
tation of the VI (see concordance of VIs in middle and itional performance comparison between the AUC-based
lower panel of Figure 2). The discrepancy between the VIs permutation VIM and the error-rate-based permutation
computed from observations of the minority class and VIs VIM based only on observations from the minority class
computed from observations of the majority class depends is documented in Additional file 1.)
on the class imbalance and is more pronounced for more
extreme class imbalances. Comparison study with simulated data
This motivates the use of an alternative accuracy The performance of the error-rate-based and AUC-
measure which better incorporates the minority class. based VIMs as measured by the AUC is shown in
While the error rate gives the same weight to all obser- Figure 3 for the three different total sample sizes with
vations, therefore focusing more on the majority class, n = 100 (left panel), n = 500 (middle panel) and n =
the AUC is a measure which does not prefer one class 1000 observations (right panel) and different class imbal-
over the other but instead puts exactly the same weight ance levels. Filled boxes correspond to the AUC-based
on both classes. Therefore the AUC-based permutation permutation VIM and unfilled boxes correspond to the
VIM is expected to detect changes in tree predictions error-rate-based permutation VIM. Figure 3 shows that
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 8 of 11
http://www.biomedcentral.com/1471-2105/14/119
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
AUC
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1%
Class Imbalance Level Class Imbalance Level Class Imbalance Level
Figure 3 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between predictors with an effect and predictors
without an effect. Distributions are shown for total sample sizes of n = 100 (left panel), n = 500 (middle panel) and n = 1000 (right panel).
the performance of both VIMs decreases with an in- Influence of sample size
creasing class imbalance for all sample sizes. Note that In Figure 3, the performance of both VIMs improves
the decrease in performance for both VIMs is not solely with an increased total sample size for a fixed imbalance
attributable to the imbalance ratio per se but also to the level since an increase in the sample size results in more
reduced number of observations in the minority class accurate tree predictions. The right panel of Figure 3
with an increasing class imbalance. This is induced by shows that both permutation VIMs are hardly affected
the simulation setting since we held the total number of by class imbalances up to 10% when the sample size is
observations fixed and varied the number of observa- rather large (n = 1000). If the sample size is smaller
tions in both classes to create different class imbalances. (n = 100), however, the performance of the VIMs is con-
If there are only few observations in one class then the siderably decreased for a 10% imbalance level. A de-
tree predictions are less accurate. However the perform- crease in performance for a 10% imbalance level is also
ance of the AUC-based permutation VIM decreases less observed for a sample size of n = 500, especially for
dramatically than the performance of the error-rate error-rate-based permutation VIM. In a nutshell, class
-based permutation VIM. The discrepancy in perfor- imbalance seems to be more problematic for the permu-
mances between the VIMs increases with increasing im- tation VIMs if the total sample size is small.
balance level and is maximal for the most extreme class
imbalance. While for a sample size of n = 500 the error-
rate-based permutation VIM is no longer able to dis- Influence of effect size
criminate between associated and noise predictors (AUC We now explore the ability of the permutation VIMs to
values randomly vary around 0.5) for the most extreme identify predictors with different effect sizes in presence
class imbalance of 1%, the AUC-based permutation VIM of unbalanced data. The AUC was again used as an
still is, showing that it can be used to identify associated evaluation criterion to compare the ability of the AUC-
predictors even if the minority class comprises only few based and error-rate-based permutation VIMs to discrim-
observations. It can be ruled out that the better per- inate between associated and non-associated predictors.
formance of the AUC-based permutation VIM is due Here the evaluation was done for each effect size separately
to chance since the distributions of AUC values sig- meaning that one class comprised all the noise predictors
nificantly differ. Furthermore this difference in perfor- while the other class comprised only predictors with the
mances between both VIMs becomes even larger for considered effect size (either strong, moderate or weak).
larger sample sizes. Figure 4 shows the results for the setting with n = 100. The
In a nutshell, in this first simulation the AUC-based results for other sample sizes are shown in Additional file 2.
permutation VIM performed better in case of class im- The left panel of Figure 4 shows the performance of both
balance. The following subsections focus on the influ- permutation VIMs according to their ability to discriminate
ence of sample size and effect size on the respective between predictors with weak effects and predictors
performance of both permutation VIMs in unbalanced without an effect. The middle panel corresponds to the
data settings. AUC values for predictors with a moderate effect versus
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 9 of 11
http://www.biomedcentral.com/1471-2105/14/119
1.0
1.0
0.8
0.8
0.8
0.6
0.6
0.6
AUC
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1%
Class Imbalance Level Class Imbalance Level Class Imbalance Level
Figure 4 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between noise predictors and predictors with a
weak (left panel), moderate (middle panel) and strong (right panel) effect. Distributions are shown for a total sample size of n = 100.
noise predictors and the right panel corresponds to the VIM still perfectly separates between noise predictors
AUC values for predictors with a strong effect versus and predictors with pronounced effects. We conclude
noise predictors. that class imbalance is more problematic if predictors
Unsurprisingly, for both permutation VIMs predictors with weak effects are to be identified while it plays a
having only a weak effect are less discriminable from minor role if the classes are well separable.
noise predictors than predictors with stronger effects.
For imbalances up to 20% both VIMs identify nearly all Comparison study with real data
predictors with a strong effect. Obviously there are un- Figure 5 shows the distribution of AUC values for 100
balanced data settings where the standard permutation modified C-to-U conversion datasets for varying
1.0
0.8
AUC
0.6
0.4
0.2
imbalance levels. For the balanced dataset and for slight assess the increase in error rate obtained when a certain
class imbalances up to 40% both VIMs have a perfect predictor is removed. In this case the error-rate-based
discriminative ability since all associated predictors re- permutation VIM can be considered. If the goal is to
ceive a higher VI than any noise predictor. Overall the rank the predictors according to their discrimination
performance of both VIMs decreases with an increas- power, however, the AUC-based permutation VIM should
ing class imbalance. Note that the decreasing perform- be preferred.
ance for increasing class imbalances might be partly The problem of imbalance at the OOB data level is
attributable to the reduced total sample size as the directly addressed with the use of a novel AUC-based
class imbalance was created by randomly subsampling permutation VIM. This VIM puts the same weight on
observations from the class with the edited sites. When both classes by measuring the difference in AUCs in-
comparing both VIMs the AUC-based permutation stead of the difference in error rates. It is thus able to
VIM significantly outperformed the standard permuta- detect changes in tree predictions when permuting asso-
tion VIM. For an imbalance of 30% the AUC-based ciated predictors which might not be grasped by the
permutation VIM clearly identified more associated standard permutation VIM. In contrast, the imbalance
predictors than the error-rate-based permutation VIM. on training data level is not addressed by the AUC-
The superiority of the AUC-based permutation VIM based permutation VIM, meaning that the structure of a
over the standard permutation VIM increased with an tree remains untouched. On the one hand this is a draw-
increasing class imbalance. For imbalances between back since class predictions before and after permuting a
15% and 5% the discrepancy between the perform- predictor are similar even if the respective predictor is
ance of AUC-based and standard permutation VIM associated with the response, resulting in a reduced
was maximal. change in the AUCs. On the other hand preserving the
Overall, this study on real data impressively shows that tree structure can be regarded as an advantage since a
the AUC-based permutation VIM also works for complex change in tree structure might open space for new unex-
real data and outperforms the standard permutation VIM pected behaviours. It is a major advantage of our novel
in almost all class imbalance settings. AUC-based permutation VIM that it is based on exactly
the same principle and differs from the standard permu-
Conclusions tation VIM only with respect to the accuracy measure-
The problem of unbalanced data has been widely discussed ment. It is thus expected to share the advantages of the
in the literature for diverse classifiers including random standard permutation VIM and its properties and behav-
forests. Many approaches have been developed to improve iours discovered in recent years (e.g. its behaviour in
the predictive ability of RF classifiers for unbalanced data presence of correlated predictors [31] and in presence
settings. However less attention has been paid to the be- of predictors with different scales [22] and category
haviour of random forests’ variable importance measures sizes in the predictors [24,25]).
for unbalanced data. In this paper we explored the Our studies on simulated as well as on real data show
performance of the permutation VIM for different class that the AUC-based permutation VIM outperforms the
imbalances and proposed an alternative permutation VIM commonly used error-rate-based permutation VIM as well
which is based on the AUC. as the error-rate-based permutation VIM computed only
Our studies on simulated as well as on real data show using observations from the minority class in case of unbal-
that the commonly used error-rate-based permutation anced data settings (see Additional file 1 for the comparison
VIM loses its ability to discriminate between associated to the class specific VIM). The difference in performance
predictors and predictors not associated with the re- between our novel AUC-based permutation VIM and the
sponse for increasing class imbalances. This is particu- standard permutation VIM can be substantial, especially for
larly crucial for small sample sizes and if predictors with extremely unbalanced data settings. But even for slight class
weak effects are to be detected. The decreasing perform- imbalances the AUC-based permutation VIM has shown to
ance of the standard permutation VIM results from two be superior to the standard permutation VIM. We con-
sources: the class imbalance on the training data level clude from our studies that the AUC-based permutation
leading to trees more often predicting the majority class VIM should be preferred to the standard permutation VIM
and the class imbalance at the OOB data level leading to whenever two response classes have different class sizes
blurred VIs due to a much higher weighting of error rate and the aim is to identify relevant predictors.
differences in the majority class. A higher weighting of
the majority class in the VI calculation is problematic Availability and requirements
because the difference in error rates is shown to be less The AUC-based permutation VIM is implemented in
pronounced in the majority class than in the minority the new version of the party package for the freely-
class. Note that in some cases it might be interesting to available statistical software R (http://www.r-project.org
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 11 of 11
http://www.biomedcentral.com/1471-2105/14/119