0% found this document useful (0 votes)
12 views

An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics

This article introduces a novel AUC-based permutation variable importance measure (VIM) for random forests, aimed at improving classification performance in unbalanced data settings. The study demonstrates that the AUC-based VIM outperforms the standard permutation VIM when class sizes are imbalanced, while both measures perform similarly under balanced conditions. The new method is implemented in the R package party, enhancing the analysis of high-dimensional data in bioinformatics and related fields.

Uploaded by

james.f.owers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics

This article introduces a novel AUC-based permutation variable importance measure (VIM) for random forests, aimed at improving classification performance in unbalanced data settings. The study demonstrates that the AUC-based VIM outperforms the standard permutation VIM when class sizes are imbalanced, while both measures perform similarly under balanced conditions. The new method is implemented in the R package party, enhancing the analysis of high-dimensional data in bioinformatics and related fields.

Uploaded by

james.f.owers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Janitza et al.

BMC Bioinformatics 2013, 14:119


http://www.biomedcentral.com/1471-2105/14/119

METHODOLOGY ARTICLE Open Access

An AUC-based permutation variable importance


measure for random forests
Silke Janitza1*, Carolin Strobl2 and Anne-Laure Boulesteix1

Abstract
Background: The random forest (RF) method is a commonly used tool for classification with high dimensional data
as well as for ranking candidate predictors based on the so-called random forest variable importance measures
(VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced
data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the
performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative
permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class
imbalance.
Results: We investigated the performance of the standard permutation VIM and of our novel AUC-based
permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the
new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while
both permutation VIMs have equal performance for balanced data settings.
Conclusions: The standard permutation VIM loses its ability to discriminate between associated predictors and
predictors not associated with the response for increasing class imbalance. It is outperformed by our new
AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the
case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF
variant based on conditional inference trees. The codes implementing our study are available from the companion
website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.
Keywords: Random forest, Conditional inference trees, Variable importance measure, Feature selection, Unbalanced
data, Class imbalance, Area under the curve.

Background variable importance measures (VIMs). It has been used in


In bioinformatics and related fields, such as statistical many applications involving high-dimensional data. As a
genomics and genetic epidemiology, data are often nonparametric method RF can deal with nonlinearity,
highly correlated, heterogeneous and high-dimensional, interactions, correlated predictors and heterogeneity,
with the number of predictors, also known as features or which makes it attractive in genetic epidemiology [3-7].
descriptors, exceeding the number of observations. The However in the context of classification, i.e. when the
random forest (RF) approach developed by Leo Breiman response to be predicted is a class membership, classifica-
in 2001 [1] is particularly appropriate to handle such tion performance of RF has been shown to be suboptimal
complex data [2]. In bioinformatics, RF is a commonly in case of strongly unbalanced data [8-10], i. e. when class
used tool for classification or regression purposes as well sizes differ considerably.
as for ranking candidate predictors through its inbuilt In epidemiology, unbalanced data are observed, e.g., in
population-based studies where only a small number of
* Correspondence: [email protected] subjects develop a certain disease over time, while most
1
Department of Medical Informatics, Biometry and Epidemiology, University subjects remain healthy. Unbalanced data are also com-
of Munich, Marchioninistr. 15, D-81377, Munich, Germany
Full list of author information is available at the end of the article mon in screening studies, where most of the screened

© 2013 Janitza et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 2 of 11
http://www.biomedcentral.com/1471-2105/14/119

persons are negative, as well as in subclass analyses, e.g., and motivate the use of a new permutation VIM which
if one wants to differentiate between different subtypes is not based on the error rate but on the area under the
of cancer. Usually some subclasses are more common curve (AUC). The AUC can be seen as an accuracy
than other subclasses leading to an imbalance in class measure putting the same weight on both classes – in
sizes. Studies on rare diseases are a further example of contrast to the error rate which essentially gives more
unbalanced data settings in medicine. Data can be weight to the majority class. As such, the AUC is a
obtained only from few persons having the specific rare particularly appropriate prediction accuracy measure in
disease, while samples from healthy control persons are unbalanced data settings [26]. A permutation VIM in
much easier to obtain. Of course unbalanced data are which the error rate is replaced by the AUC is therefore
also relevant in various other areas of application beyond a promising alternative to the standard permutation
the biomedical field, e.g., the prediction of creditworthi- VIM. We performed extensive simulation studies to ex-
ness of a bank’s costumers [11], the detection of fraudulent plore and compare the behaviour of both permutation
telephone calls [12] or the detection of oil spills in satellite VIMs for different class imbalance levels, effect sizes
radar images [13], just to name a few examples. Unbal- and sample sizes.
anced data may arise whenever the class memberships are
observed after data collection. Methods
Like many other classification methods RF produces The RF algorithm is a classification and regression
classification rules that do not accurately predict the mi- method often used for high-dimensional data settings
nority class if data are unbalanced. The RF classifier allo- where the number of predictors exceeds the number of
cates new observations more often to the majority class observations. Note that throughout this article we use
unless the difference between the classes is large and the term predictors which is equivalent to features or
classes are well separable. For extreme class imbalances, descriptors denoting variables that are used to discrimin-
e.g. if the minority class includes only 5% of the observa- ate the response classes. In the RF algorithm several
tions, it might happen that the RF classifier allocates individual decision trees are combined to make a final
every observation to the majority class independently of prediction. The final prediction is then the average (for
the predictors, yielding a minimal error rate of 5%. regression) or the majority vote (for classification) of the
Although this error rate of 5% is very small, such a trivial predictions of all trees in the forest. Each tree is fitted to
classification is of no practical use. a random sample of observations (with or without
Some suggestions have been made to yield a useful replacement) from the original sample. Observations not
classification based either on sampling procedures used to construct a tree are termed out-of-bag (OOB)
[14-17] or on cost sensitivity analyses [14]. Sampling observations for that tree. For each split in each tree a
procedures create an artificial balance between two or randomly drawn subset of predictors is assessed as can-
more classes by oversampling the minority class and/or didates for splitting and the predictor yielding the best
downsampling the majority class. Cost sensitivity ana- split is finally chosen for the split. In the original version
lyses attribute a higher cost to the misclassification of an of RF developed by Leo Breiman [1], the selected split is
observation from the minority class to impede the trivial the split with the largest decrease in Gini impurity. In a
systematic classification to the larger class. Both aspects later version of RF, conditional inference tests are used
have been widely discussed in the literature with respect for selecting the best split in an unbiased way [27]. For
to RF’s classification performance [14,15,18-21]. Recent each split in a tree, each candidate predictor from the
simulation studies [9] have shown that the performance randomly drawn subset is globally tested for its associ-
of RF classification for unbalanced data depends on (i) ation with the response, yielding a global p-value. The
the imbalance ratio, (ii) the class overlap and (iii) the predictor with the smallest p-value is selected, and
sample size. within this globally selected predictor the best split is
The impact of class imbalance on the RF VIM, how- finally chosen for the split.
ever, has to our knowledge not yet been examined in the Both forest versions implement so called variable
literature. In this article we focus on the permutation importance measures which can be used to get a ranking
VIM which is known to be almost unbiased and more of the predictors according to their association with the
reliable than the Gini VIM. The latter has been shown response. In the following, we briefly introduce the
to have a preference for certain types of predictors standard permutation VIM as well as our novel permuta-
[22-25] and therefore its rankings have to be treated tion VIM, which is based on the area under the curve.
with caution. We concentrate on the class imbalance
problem for two response classes with respect to the Random forest variable importance measures
permutation VIM. We investigate the mechanisms of RF’s variable importance measures are often used for
changes in performance for unbalanced data settings feature selection for high-dimensional data settings
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 3 of 11
http://www.biomedcentral.com/1471-2105/14/119

which makes it especially attractive for bioinformatics A novel AUC-based permutation VIM
and related fields, where identifying a subset of relevant Our new AUC-based permutation VIM is closely related
predictors from a large set of candidate predictors is a to the error-rate-based permutation VIM. They only differ
major challenge (known as the “small n large p” prob- with respect to the prediction accuracy measure: In a
lem). The two standard VIMs for feature selection with nutshell, the error rate of a tree involved in (1) is replaced
RF are the Gini VIM and the permutation VIM. Roughly by the area under the curve (AUC) [28]. We define the
speaking the Gini VIM of a predictor of interest is the AUC-based permutation VI for predictor j as:
sum over the forest of the decreases of Gini impurity
generated by this predictor whenever it was selected for ðAUCÞ 1 Xntree∗ 
splitting, scaled by the number of trees. This measure VIj ¼ AUCtj  AU Ctj~Þ ð2Þ
ntree t¼1
has been shown to prefer certain types of predictors
[22-25]. The resulting predictor ranking should therefore
be treated with caution. That is why in this paper we  ntree∗ denotes the number of trees in the forest
focus on the permutation VIM that gives essentially whose OOB observations include observations from
unbiased error rate rankings of the predictors. both classes,
 AUCtj denotes the area under the curve computed
from the OOB observations in tree t before
Error-rate-based permutation VIM permuting predictor j,
From now on, we denote the standard permutation VIM  AUCtj~ denotes the area under the curve computed
as “error-rate-based permutation VIM”, since it is based from the OOB observations in tree t after randomly
on the OOB error rate, as outlined below. More precisely, permuting predictor j.
it measures the difference between the OOB error rate
after and before permuting the values of the predictor Instead of computing the error rate for each tree after
of interest. The error-rate-based permutation variable and before permuting a predictor, the AUC is computed.
importance (VI) for predictor j is defined by: The AUC for a tree is based on the so-called class prob-
abilities, i.e. the estimated probability of each observa-
ðERÞ 1 Xntree   tion to belong to the class Y = 0 or Y = 1, respectively.
VIj ¼ ER ~  ERtj
tj ð1Þ The class probabilities of an observation are determined
ntree t¼1
by the relative amount of training observations belong-
ing to the corresponding class in the terminal node in
Where
which an observation falls into. If one considers an
OOB observation with Y = 0 and an OOB observation
 ntree denotes the number of trees in the forest,
with Y = 1, a “good tree” is expected to assign a larger
 ERtj denotes the mean error rate over all OOB
class probability for class Y = 1 to the observation truly
observations in tree t before permuting predictor j,
belonging to class Y = 1 than to the observation belong-
 ERtj~ denotes the mean error rate over all OOB
ing to class Y = 0. The AUC for a tree corresponds to
observations in tree t after randomly permuting
the proportion of pairs for which this is the case. It can
predictor j.
be seen as an estimator of the probability that a ran-
domly chosen observation from class Y = 1 receives a
The idea underlying this VIM is the following: If the higher class probability for class Y = 1 than a randomly
predictor is not associated with the response, the permu- chosen observation from class Y = 0. Note that with the
tation of its values has no influence on the classification, use of the AUC, the information contained in the class
and thus also no influence on the error rate. The error probabilities returned by a tree are adequately exploited.
rate of the forest is not substantially affected by the per- This is not the case for the error rate, that requires a
mutation and the VI of the predictor takes a value close dichotomization of class probabilities. From a practical
to zero, indicating no association between the predictor point of view, the AUC is computed by making use of
and the response. In contrast, if response and predictor its equivalence with the Mann–Whitney-U statistic. The
are associated, the permutation of the predictor values Mann–Whitney-U statistic is solely based on the rankings
destroys this association. “Knocking out” this predictor of two independent samples. AUC values of 1 correspond
by permuting its values results in a worse classification to a perfect tree classifier, since a perfect classifier would
leading to an increased error rate. The difference in attribute each observation from one class a higher prob-
error rates before and after randomly permuting the ability to belong to this class than any observation from
predictor thus takes a positive value reflecting the high the other class. AUC values of 0.5 correspond to a useless
importance of this predictor. tree classifier that randomly allocates class probabilities to
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 4 of 11
http://www.biomedcentral.com/1471-2105/14/119

the observations. In this case in about half the cases a ran- Table 1 Distribution of predictors in class 1 and class 2
domly drawn observation from one class receives a higher Predictors Distribution in Distribution in Effect size
probability of belonging to that class than a randomly class 1 class 2
drawn observation from the other class. X1, . . ., X5 N (1.00, 1) N (0, 1) strong effect
The novel AUC-based permutation VIM is implemented X6, . . ., X10 N (0.75, 1) N (0, 1) moderate effect
in the package party for the unbiased RF variant based on X11, . . ., X15 N (0.50, 1) N (0, 1) weak effect
conditional inference trees. Note that the discrepancy in
X16, . . ., X65 N (0, 1) N (0, 1) no effect
performance between the standard permutation VIM
and the AUC-based permutation VIM is transferable to
the original version of RF since the VI ranking mechan-
ism is completely independent from the construction of The first five predictors X1, . . ., X5 differ strongly be-
the trees. tween classes with mean μ1 = 1 in one class and mean
μ2 = 0 in the other class. The predictors X6, . . ., X10 have
a moderate mean difference between the two classes with
Comparison studies
μ1 = 0.75 and μ2 = 0. For X11, . . ., X15 there is only a small
The behavior of the two introduced permutation VIMs is
difference between the classes with μ1 = 0.5 and μ2 = 0.
expected to be different in the presence of unbalanced
We simulated 50 additional predictors following a
data. The AUC is a prediction accuracy measure which
standard normal distribution with no association to the
puts the same weight on both classes independently of
response variable (termed noise predictors).
their sizes [26]. The error rate, in contrast, gives essentially
We performed analyses with varying sample sizes and
more weight to the majority class because it does not take
report the results for total sample sizes of n = 100, n = 500
class affiliations into account and regards all misclassifica-
and n = 1000. For each parameter combination, i.e. imbal-
tions equally important. In the results section we try to
ance level and sample size, we simulated 100 datasets and
explain the consequences for the performance of the per-
computed AUC-based and error-rate-based permutation
mutation VIMs for unbalanced data settings and provide
VIs for each dataset. Note that for a sample size of n = 100
evidence for our supposition. We performed studies on
an imbalance of 1% is not meaningful since there is only
simulated and on real data to explore and contrast the
one observation in the minority class.
performance of both permutation VIMs. Using simulated
Forest and tree parameters were held fixed. The par-
data we aim to see whether total sample size and effect
ameter ntree denoting the number of trees in a forest
size play a role for the class imbalance problem. We
was set to 1000, the parameter for the number of candi-
explored this by varying the total number of observations
date splits mtry was set to the default value of 5. We
and by simulating predictors with different effect sizes.
used subsampling instead of bootstrap sampling for
Furthermore we conducted analyses based on real data to
constructing the trees, i.e. setting the parameter replace
provide additional evidence based on realistic data struc-
to FALSE [22]. Conditional inference trees were grown
tures which usually incorporate complex interdependen-
to maximal possible depth, i.e. setting the parameters
cies. Our comparison studies on simulated and on real
minsplit, minbucket and mincriterion in the cforest
data were conducted using the unbiased RF variant based
function to zero.
on conditional inference trees. The implementation of this
unbiased RF variant is available in the R system for statis-
Real data
tical computing via the package party [29].
We also investigated the performance of the error-rate-
based and the AUC-based permutation VIM on real
Simulated data data including complex dependencies (e.g. correlations)
The considered simulation design represents a scenario and predictors of different scales. The dataset is about
where the predictors associated with the response vari- RNA editing in land plants [30]. RNA editing is the
able Y (binary) are to be identified from a set of continu- modification of the RNA sequence from the corre-
ous predictors. We performed simulations for varying sponding DNA template. It occurs e.g. in plant mito-
imbalance levels: 50% corresponding to a completely chondria where some cytidines are converted to
balanced sample, 40%, 30%, 20%, 10%, 5% and 1% corre- uridines before translation (abbreviated with C-to-U
sponding to different imbalance levels from slight to conversion in the following). The dataset comprises a
very extreme class imbalances. The simulation setting total of 43 predictors: 41 categorical predictors (40 nu-
comprises both predictors not associated with the re- cleotides at positions −20 to 20 relative to the edited site
sponse and associated predictors with three different and one predictor describing the codon position) and
levels of effect sizes. Table 1 presents the data setting two continuous predictors (one for the estimated fold-
used throughout this simulation. ing energy and one predictor describing the difference
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 5 of 11
http://www.biomedcentral.com/1471-2105/14/119

in estimated folding energy between pre-edited and 43 noise predictors, (2) merging them to the original
edited sequences). It includes 2694 observations, where dataset, (3) randomly subsampling to create an unbal-
exactly one half has an edited site and the other half has anced dataset and (4) computing the error-rate-based
a non-edited site. The data are publicly available from and AUC-based permutation VIs, was repeated 100
the journal’s homepage. After excluding observations times for each imbalance level to get stable results for
with missing values, a total of 2613 observations were the VIM performance. To check the assumption that
left, where 1307 had a non-edited site and 1306 obser- there is a higher association between the response and
vations had an edited site. We used this balanced any of the original predictors than between the response
dataset to explore the performance of ER- and AUC- and any of the simulated predictors, we computed the
based permutation VIM for varying class imbalances – mean VI over 100 completely balanced datasets that had
but now with realistic dependencies and predictors of been extended by noise predictors. Figure 1 shows that
different scales. For this purpose, we artificially created all mean VIs of the original predictors are higher than
different imbalance levels by drawing random subsets any mean VI of a simulated noise predictor and hence
from the class with edited sites. confirms our first impression.
Application of the standard permutation VIM to the
data using the 2613 observations without missing values Performance evaluation criteria
gave VIs greater than zero for all 43 predictors for VIMs give a ranking of the predictors according to their
different random seeds (i.e. different starting values for association with the response. To evaluate the quality of
the random permutation), indicating that all predictors the rankings by the permutation VIMs the AUC was
seem to have at least a small predictive power (data not used as performance measure. The AUC was computed
shown). We generated and added additional predictors to assess the ability of a VIM to differentiate between
without any effect (termed noise predictors in the fol- associated predictors and predictors not associated with
lowing) in order to evaluate the performance of error- the response. AUC values of 1 mean that each associated
rate-based and AUC-based permutation VIMs. Provided predictor receives a higher VI than any noise predictor,
that there is a higher association between the response thus indicating a perfect discrimination. AUC values of
and any of the original predictors than between the re- 0.5 mean that a randomly drawn associated predictor
sponse and any of the simulated noise predictors, a well receives a higher VI than a randomly drawn noise pre-
performing VIM would attribute a higher VI to original dictor in only half of the cases, indicating no discrimina-
predictors than to simulated noise predictors. The noise tive ability.
predictors were generated by randomly permuting the For our comparison studies we defined the two classes
values of the original predictors. Each original predictor which are to be differentiated by a VIM in the following
was permuted once, resulting in a total of 43 noise way. In the first instance of our studies on simulated data,
predictors. The whole process consisting of (1) creating all predictors which are associated with the response

Mean VIs for Extended C−to−U Conversion Dataset


0.04

0.03

0.02

0.01

0.00
X.20
X.19
X.18
X.17
X.16
X.15
X.14
X.13
X.12
X.11
X.10
X.9
X.8
X.7
X.6
X.5
X.4
X.3
X.2
X.1
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
cp
fe
dfe
noise1
noise2
noise3
noise4
noise5
noise6
noise7
noise8
noise9
noise10
noise11
noise12
noise13
noise14
noise15
noise16
noise17
noise18
noise19
noise20
noise21
noise22
noise23
noise24
noise25
noise26
noise27
noise28
noise29
noise30
noise31
noise32
noise33
noise34
noise35
noise36
noise37
noise38
noise39
noise40
noise41
noise42
noise43

Figure 1 Mean VIs for the 43 original predictors and 43 noise predictors from the balanced modified C-to-U conversion dataset. Mean
VIs were obtained by averaging the VIs (by commonly used error-rate-based permutation VIM) over 100 extended versions of the C-to-U
conversion dataset.
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 6 of 11
http://www.biomedcentral.com/1471-2105/14/119

formed one class and noise predictors built the other class. observations have that specific pattern of predictor
In more detailed subsequent analyses we then explored values which is required for an observation to be classi-
the ability of the VIMs to discriminate between predictors fied into the minority class. It is likely that a permuta-
with the same effect size and predictors without an effect. tion of the values of an associated predictor might then
For this analysis one class comprised the noise predictors destroy that specific pattern so that after the permuta-
while the other class comprised only predictors with the tion, these observations are not identified anymore to
same effect. For the studies on real data it was not possible be in the minority class. Thus a misclassification due to
to conduct such detailed analyses because the true order- the elimination of an associated predictor is much more
ing of the predictors according to their association with likely to appear in observations from the minority class
the response is not known. Hence in the analysis on real than in observations from the majority class. Note that
data we restricted our analysis to the discrimination be- only a small number of observations from the minority
tween original predictors forming one class and simulated class are affected since most of the observations from
noise predictors forming the other class. the minority class are classified into the majority class
anyway (before as well as after the permutation). The
Results and discussion change in error rates is thus expected to be rather small –
Why may the error-rate-based permutation VIM fail in albeit it is more pronounced than the change in error rates
case of class imbalance? in the majority class.
The prioritisation of the majority class in unbalanced data Note that the error-rate-based permutation VIM does
settings is well known in the context of RF classification not take class affiliations into account. Thus the change
and can easily be seen from trees constructed on unbal- in error rates is actually not computed separately for
anced data. Trees trained on unbalanced data more often each class. Yet, in order to better understand the behav-
predict the majority class, which leads to the minimization ior of the VIM, it may help to point out that if the class
of the overall error rate. But how does this affect the proportions were the same in all OOB samples, the VI
performance of the permutation VIMs? And why is the of a predictor could be directly derived as the weighted
AUC-based permutation VIM expected to be more robust average of the class specific differences in the error rates.
towards class imbalance than the commonly used error- The weights would correspond to the proportion of obser-
rate-based permutation VIM? vations from the respective class. In practice the class
To answer these questions we consider an extremely un- frequencies will not be equal in all OOB samples, but the
balanced data setting and illustrate what happens in a tree concept of a weighted average of the class specific error
when permuting the values of an associated predictor. We rates illustrates the fact that for unbalanced data settings
will first have a look at observations from the majority the VI is mainly driven by the change in error rates
class. For this class nearly all observations are correctly derived from observations from the majority class. Since
classified by a tree which has been trained on extremely the change in error rates in the majority class is expected
unbalanced data. If we now permute the values of an asso- to be much smaller compared to the change in error rates
ciated predictor, this does generally not result in a classifi- in the minority class, the computed VIs are rather low.
cation into the minority class since a classification into the This results in low VIs even for associated predictors and
minority class is an unlikely event – even for an observa- in a poor differentiation of associated predictors and
tion from this class. A very specific data pattern is required predictors not associated with the response.
for an observation to be classified into the minority class.
It is unlikely that a random permutation of an associated Class specific VIs
predictor results in such a specific data pattern just by This theory is supported by computing class specific VIs
chance. Thus, for the majority class we expect hardly any (corresponding to mean changes in error rates computed
observation to be incorrectly classified to the minority only from observations belonging to the same class).
class after the permutation of an associated predictor. Computing class specific VIs was done using the R
Thus the error rate does not considerably increase after package randomForest implementing the standard RF
the permutation of an associated predictor, finally leading algorithm. The importance function of this package
to a rather low contribution to the VI. provides permutation VIs computed separately for each
Now let us consider the classifications by a tree for ob- class (besides the VIs by the standard permutation VIM
servations from the minority class. For an extreme class and by the Gini VIM). The class specific VIs for a total
imbalance most of the observations from the minority sample size of n = 500 and an imbalance level of 5% are
class are falsely classified to the majority class due to the shown in Figure 2, where predictors X1 to X15 have an
above described focus on the majority class. It might be effect while the remaining 50 predictors do not have an
the case that some observations from the minority class effect, corresponding to the simulation setting previously
are correctly classified by the tree because these described in Table 1 in the context of the comparison
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 7 of 11
http://www.biomedcentral.com/1471-2105/14/119

VIs using only OOB observations from minority class


0.06
0.05
0.04
0.03
0.02
0.01
0.00
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
X36
X37
X38
X39
X40
X41
X42
X43
X44
X45
X46
X47
X48
X49
X50
X51
X52
X53
X54
X55
X56
X57
X58
X59
X60
X61
X62
X63
X64
X65
VIs using only OOB observations from majority class
0.06
0.05
0.04
0.03
0.02
0.01
0.00
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
X36
X37
X38
X39
X40
X41
X42
X43
X44
X45
X46
X47
X48
X49
X50
X51
X52
X53
X54
X55
X56
X57
X58
X59
X60
X61
X62
X63
X64
X65
VIs using all OOB observations
0.06
0.05
0.04
0.03
0.02
0.01
0.00
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
X21
X22
X23
X24
X25
X26
X27
X28
X29
X30
X31
X32
X33
X34
X35
X36
X37
X38
X39
X40
X41
X42
X43
X44
X45
X46
X47
X48
X49
X50
X51
X52
X53
X54
X55
X56
X57
X58
X59
X60
X61
X62
X63
X64
X65
Figure 2 VIs computed only from OOB observations of the minority class (top), from OOB observations of the majority class (middle)
and from all OOB observations (bottom). The first 15 predictors are associated with the response while the remaining predictors are noise
predictors. VIs are shown for a total sample size of n = 500 and an imbalance level of 5%.

study (for simplicity, we use the same setting as in the for observations from the minority class, which might
comparison study, although the addressed problem is here not be grasped by the error-rate-based permutation VIM
a different one). Different sample sizes and imbalance due to a much higher weighting of the majority class.
levels give similar results (thus not shown). They confirm The VIs for associated predictors obtained by the AUC-
our argumentation that the change in the error rates com- based permutation VIM are thus expected to be compara-
puted from OOB observations from the majority class is tively higher than the VIs obtained by the error-rate-based
smaller than the change in error rates computed from permutation VIM. This would result in a better differenti-
OOB observations from the minority class. This results in ation of associated and noise predictors by the AUC-based
an underestimation of the actual permutation VI due to a permutation VIM. These conjectures are assessed in the
much higher weighting of the majority class in the compu- comparison study presented in the next section. (An add-
tation of the VI (see concordance of VIs in middle and itional performance comparison between the AUC-based
lower panel of Figure 2). The discrepancy between the VIs permutation VIM and the error-rate-based permutation
computed from observations of the minority class and VIs VIM based only on observations from the minority class
computed from observations of the majority class depends is documented in Additional file 1.)
on the class imbalance and is more pronounced for more
extreme class imbalances. Comparison study with simulated data
This motivates the use of an alternative accuracy The performance of the error-rate-based and AUC-
measure which better incorporates the minority class. based VIMs as measured by the AUC is shown in
While the error rate gives the same weight to all obser- Figure 3 for the three different total sample sizes with
vations, therefore focusing more on the majority class, n = 100 (left panel), n = 500 (middle panel) and n =
the AUC is a measure which does not prefer one class 1000 observations (right panel) and different class imbal-
over the other but instead puts exactly the same weight ance levels. Filled boxes correspond to the AUC-based
on both classes. Therefore the AUC-based permutation permutation VIM and unfilled boxes correspond to the
VIM is expected to detect changes in tree predictions error-rate-based permutation VIM. Figure 3 shows that
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 8 of 11
http://www.biomedcentral.com/1471-2105/14/119

Sample Size n = 100 Sample Size n = 500 Sample Size n = 1000


1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
AUC
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1%
Class Imbalance Level Class Imbalance Level Class Imbalance Level
Figure 3 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between predictors with an effect and predictors
without an effect. Distributions are shown for total sample sizes of n = 100 (left panel), n = 500 (middle panel) and n = 1000 (right panel).

the performance of both VIMs decreases with an in- Influence of sample size
creasing class imbalance for all sample sizes. Note that In Figure 3, the performance of both VIMs improves
the decrease in performance for both VIMs is not solely with an increased total sample size for a fixed imbalance
attributable to the imbalance ratio per se but also to the level since an increase in the sample size results in more
reduced number of observations in the minority class accurate tree predictions. The right panel of Figure 3
with an increasing class imbalance. This is induced by shows that both permutation VIMs are hardly affected
the simulation setting since we held the total number of by class imbalances up to 10% when the sample size is
observations fixed and varied the number of observa- rather large (n = 1000). If the sample size is smaller
tions in both classes to create different class imbalances. (n = 100), however, the performance of the VIMs is con-
If there are only few observations in one class then the siderably decreased for a 10% imbalance level. A de-
tree predictions are less accurate. However the perform- crease in performance for a 10% imbalance level is also
ance of the AUC-based permutation VIM decreases less observed for a sample size of n = 500, especially for
dramatically than the performance of the error-rate error-rate-based permutation VIM. In a nutshell, class
-based permutation VIM. The discrepancy in perfor- imbalance seems to be more problematic for the permu-
mances between the VIMs increases with increasing im- tation VIMs if the total sample size is small.
balance level and is maximal for the most extreme class
imbalance. While for a sample size of n = 500 the error-
rate-based permutation VIM is no longer able to dis- Influence of effect size
criminate between associated and noise predictors (AUC We now explore the ability of the permutation VIMs to
values randomly vary around 0.5) for the most extreme identify predictors with different effect sizes in presence
class imbalance of 1%, the AUC-based permutation VIM of unbalanced data. The AUC was again used as an
still is, showing that it can be used to identify associated evaluation criterion to compare the ability of the AUC-
predictors even if the minority class comprises only few based and error-rate-based permutation VIMs to discrim-
observations. It can be ruled out that the better per- inate between associated and non-associated predictors.
formance of the AUC-based permutation VIM is due Here the evaluation was done for each effect size separately
to chance since the distributions of AUC values sig- meaning that one class comprised all the noise predictors
nificantly differ. Furthermore this difference in perfor- while the other class comprised only predictors with the
mances between both VIMs becomes even larger for considered effect size (either strong, moderate or weak).
larger sample sizes. Figure 4 shows the results for the setting with n = 100. The
In a nutshell, in this first simulation the AUC-based results for other sample sizes are shown in Additional file 2.
permutation VIM performed better in case of class im- The left panel of Figure 4 shows the performance of both
balance. The following subsections focus on the influ- permutation VIMs according to their ability to discriminate
ence of sample size and effect size on the respective between predictors with weak effects and predictors
performance of both permutation VIMs in unbalanced without an effect. The middle panel corresponds to the
data settings. AUC values for predictors with a moderate effect versus
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 9 of 11
http://www.biomedcentral.com/1471-2105/14/119

Weak Effects Moderate Effects Strong Effects


1.0

1.0

1.0
0.8

0.8

0.8
0.6

0.6

0.6
AUC
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1% 50% 40% 30% 20% 10% 5% 1%
Class Imbalance Level Class Imbalance Level Class Imbalance Level
Figure 4 Distribution of AUC-values for 100 simulated datasets for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs
for different class imbalances. The AUC is used to assess the ability of a VIM to discriminate between noise predictors and predictors with a
weak (left panel), moderate (middle panel) and strong (right panel) effect. Distributions are shown for a total sample size of n = 100.

noise predictors and the right panel corresponds to the VIM still perfectly separates between noise predictors
AUC values for predictors with a strong effect versus and predictors with pronounced effects. We conclude
noise predictors. that class imbalance is more problematic if predictors
Unsurprisingly, for both permutation VIMs predictors with weak effects are to be identified while it plays a
having only a weak effect are less discriminable from minor role if the classes are well separable.
noise predictors than predictors with stronger effects.
For imbalances up to 20% both VIMs identify nearly all Comparison study with real data
predictors with a strong effect. Obviously there are un- Figure 5 shows the distribution of AUC values for 100
balanced data settings where the standard permutation modified C-to-U conversion datasets for varying
1.0
0.8
AUC
0.6
0.4
0.2

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 1%

Class Imbalance Level


Figure 5 Distribution of AUC-values for AUC-based (filled) and error-rate-based (unfilled) permutation VIMs for different class
imbalances derived from 100 modified datasets from C-to-U conversion data. The AUC is used to assess the ability of a VIM to discriminate
between associated predictors and predictors not associated with the response.
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 10 of 11
http://www.biomedcentral.com/1471-2105/14/119

imbalance levels. For the balanced dataset and for slight assess the increase in error rate obtained when a certain
class imbalances up to 40% both VIMs have a perfect predictor is removed. In this case the error-rate-based
discriminative ability since all associated predictors re- permutation VIM can be considered. If the goal is to
ceive a higher VI than any noise predictor. Overall the rank the predictors according to their discrimination
performance of both VIMs decreases with an increas- power, however, the AUC-based permutation VIM should
ing class imbalance. Note that the decreasing perform- be preferred.
ance for increasing class imbalances might be partly The problem of imbalance at the OOB data level is
attributable to the reduced total sample size as the directly addressed with the use of a novel AUC-based
class imbalance was created by randomly subsampling permutation VIM. This VIM puts the same weight on
observations from the class with the edited sites. When both classes by measuring the difference in AUCs in-
comparing both VIMs the AUC-based permutation stead of the difference in error rates. It is thus able to
VIM significantly outperformed the standard permuta- detect changes in tree predictions when permuting asso-
tion VIM. For an imbalance of 30% the AUC-based ciated predictors which might not be grasped by the
permutation VIM clearly identified more associated standard permutation VIM. In contrast, the imbalance
predictors than the error-rate-based permutation VIM. on training data level is not addressed by the AUC-
The superiority of the AUC-based permutation VIM based permutation VIM, meaning that the structure of a
over the standard permutation VIM increased with an tree remains untouched. On the one hand this is a draw-
increasing class imbalance. For imbalances between back since class predictions before and after permuting a
15% and 5% the discrepancy between the perform- predictor are similar even if the respective predictor is
ance of AUC-based and standard permutation VIM associated with the response, resulting in a reduced
was maximal. change in the AUCs. On the other hand preserving the
Overall, this study on real data impressively shows that tree structure can be regarded as an advantage since a
the AUC-based permutation VIM also works for complex change in tree structure might open space for new unex-
real data and outperforms the standard permutation VIM pected behaviours. It is a major advantage of our novel
in almost all class imbalance settings. AUC-based permutation VIM that it is based on exactly
the same principle and differs from the standard permu-
Conclusions tation VIM only with respect to the accuracy measure-
The problem of unbalanced data has been widely discussed ment. It is thus expected to share the advantages of the
in the literature for diverse classifiers including random standard permutation VIM and its properties and behav-
forests. Many approaches have been developed to improve iours discovered in recent years (e.g. its behaviour in
the predictive ability of RF classifiers for unbalanced data presence of correlated predictors [31] and in presence
settings. However less attention has been paid to the be- of predictors with different scales [22] and category
haviour of random forests’ variable importance measures sizes in the predictors [24,25]).
for unbalanced data. In this paper we explored the Our studies on simulated as well as on real data show
performance of the permutation VIM for different class that the AUC-based permutation VIM outperforms the
imbalances and proposed an alternative permutation VIM commonly used error-rate-based permutation VIM as well
which is based on the AUC. as the error-rate-based permutation VIM computed only
Our studies on simulated as well as on real data show using observations from the minority class in case of unbal-
that the commonly used error-rate-based permutation anced data settings (see Additional file 1 for the comparison
VIM loses its ability to discriminate between associated to the class specific VIM). The difference in performance
predictors and predictors not associated with the re- between our novel AUC-based permutation VIM and the
sponse for increasing class imbalances. This is particu- standard permutation VIM can be substantial, especially for
larly crucial for small sample sizes and if predictors with extremely unbalanced data settings. But even for slight class
weak effects are to be detected. The decreasing perform- imbalances the AUC-based permutation VIM has shown to
ance of the standard permutation VIM results from two be superior to the standard permutation VIM. We con-
sources: the class imbalance on the training data level clude from our studies that the AUC-based permutation
leading to trees more often predicting the majority class VIM should be preferred to the standard permutation VIM
and the class imbalance at the OOB data level leading to whenever two response classes have different class sizes
blurred VIs due to a much higher weighting of error rate and the aim is to identify relevant predictors.
differences in the majority class. A higher weighting of
the majority class in the VI calculation is problematic Availability and requirements
because the difference in error rates is shown to be less The AUC-based permutation VIM is implemented in
pronounced in the majority class than in the minority the new version of the party package for the freely-
class. Note that in some cases it might be interesting to available statistical software R (http://www.r-project.org
Janitza et al. BMC Bioinformatics 2013, 14:119 Page 11 of 11
http://www.biomedcentral.com/1471-2105/14/119

and http://cran.r-project.org/web/packages/party/index. genome-wide single-nucleotide polymorphisms using random forests.


html). It can be applied via the function varimpAUC. BMC Proceedings 2007, 1(Suppl 1):S62.
8. Blagus R, Lusa L: Class prediction for high-dimensional class-imbalanced
All codes implementing our studies on simulated and on data. BMC Bioinformatics 2010, 11:523.
real data are available under http://www.ibe.med.uni- 9. Lin WJ, Chen J: Class-imbalanced classifiers for high-dimensional data.
muenchen.de/organisation/mitarbeiter/070_drittmittel/ Brief Bioinform 2012.
10. Khoshgoftaar T, Golawala M, Van Hulse J: An empirical study of learning
janitza/index.html for reproducibility purposes. from imbalanced data using random forest. In Tools with Artificial
Intelligence, 2007. ICTAI 2007: 19th IEEE International Conference on, Volume
2, IEEE; 2007:310–317.
Additional files 11. Huang Y, Hung C, Jiau H: Evaluation of neural networks and data mining
methods on a credit assessment task for class imbalance problem.
Additional file 1: This file shows the results of the performance Nonlinear Analysis: Real World Applications 2006, 7(4):720–747.
comparison between the AUC-based permutation VIM and the 12. Fawcett T, Provost F: Adaptive fraud detection. Data Mining and
error-rate-based permutation VIM computed using only Knowledge Discovery 1997, 1(3):291–316.
observations from the minority class. 13. Kubat M, Holte R, Matwin S: Machine learning for the detection of oil
Additional file 2: This file shows the distribution of AUC-values spills in satellite radar images. Machine Learning 1998, 30(2):195–215.
(analog to Figure 4) for sample sizes n = 500 and n = 1000. 14. Chen C, Liaw A, Breiman L: Using random forest to learn imbalanced data.
University of California, Berkeley: Tech. rep; 2004 [http://statistics.berkeley.
edu/tech-reports/666].
Abbreviations 15. Xie Y, Li X, Ngai E, Ying W: Customer churn prediction using improved
AUC: Area under curve; OOB: Out-of-bag; RF: Random forest; VIM: Variable balanced random forests. Expert Systems with Applications 2009,
importance measure; VI: Variable importance. 36(3):5445–5449.
16. Batista G, Prati R, Monard M: A study of the behavior of several methods
Competing interests for balancing machine learning training data. ACM SIGKDD Explorations
The authors declare that they have no competing interests. Newsletter 2004, 6:20–29.
17. Estabrooks A, Jo T, Japkowicz N: A multiple resampling method for
Authors’ contributions learning from imbalanced data sets. Computational Intelligence 2004,
SJ wrote the paper and conducted all analyses. SJ and ALB developed and 20:18–36.
implemented the new VIM. All authors contributed to the design of the 18. Van Hulse J, Khoshgoftaar T, Napolitano A: Experimental perspectives on
analyses and substantially edited the manuscript. learning from imbalanced data. ACM: In Proceedings of the 24th
International Conference on Machine Learning; 2007:935–942.
19. Van Hulse J, Khoshgoftaar T: Knowledge discovery from imbalanced and
Acknowledgements
noisy data. Data & Knowledge Engineering 2009, 68(12):1513–1542.
SJ was supported by the German Science Foundation (DFG-Einzelförderung
20. Japkowicz N, Stephen S: The class imbalance problem: A systematic
BO3139/2-2). The authors thank Torsten Hothorn for integrating the
study. Intelligent Data Analysis 2002, 6(5):429–449.
implementation of the AUC-based permutation VIM into the new version of
21. Khalilia M, Chakraborty S, Popescu M: Predicting disease risks from highly
the party package.
imbalanced data using random forest. BMC Med Inform Decis Mak 2011, 11:51.
22. Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable
Author details
1 importance measures: Illustrations, sources and a solution.
Department of Medical Informatics, Biometry and Epidemiology, University
BMC Bioinformatics 2007, 8:25.
of Munich, Marchioninistr. 15, D-81377, Munich, Germany. 2Department of
23. Nicodemus KK, Malley JD: Predictor correlation impacts machine learning
Psychology, University of Zurich, Binzmühlestr. 14, CH-8050, Zurich,
algorithms: implications for genomic studies. Bioinformatics 2009,
Switzerland.
25(15):1884–1890.
24. Nicodemus KK: Letter to the editor: On the stability and ranking of
Received: 23 November 2012 Accepted: 21 March 2013
predictors from random forest variable importance measures.
Published: 5 April 2013
Brief Bioinform 2011, 12(4):369–373.
25. Boulesteix AL, Bender A, Bermejo JL, Strobl C: Random forest Gini
References importance favours SNPs with large minor allele frequency: assessment,
1. Breiman L: Random forests. Machine Learning 2001, 45:5–32. sources and recommendations. Brief Bioinform 2012, 13:292–304.
2. Boulesteix AL, Janitza S, Kruppa J, König I: Overview of random forest 26. Calle M, Urrea V, Boulesteix AL, Malats N: AUC-RF: A new strategy for
methodology and practical guidance with emphasis on computational genomic profiling with random forest. Hum Hered 2011, 72(2):121–132.
biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and 27. Hothorn T, Hornik K, Zeileis A: Unbiased recursive partitioning: A conditional
Knowledge Discovery 2012, 2(6):493–507. inference framework. J Comput Graph Stat 2006, 15(3):651–674.
3. Briggs F, Goldstein B, McCauley J, Zuvich R, De Jager P, Rioux J, Ivinson A, 28. Pepe M: The statistical evaluation of medical tests for classification and
Compston A, Hafler D, Hauser S, et al: Variation within DNA repair pathway prediction. USA: Oxford University Press; 2004.
genes and risk of multiple sclerosis. Am J Epidemiol 2010, 172(2):217. 29. Hothorn T, Hornik K, Zeileis A: Party: a laboratory for recursive partytioning. R
4. Chang J, Yeh R, Wiencke J, Wiemels J, Smirnov I, Pico A, Tihan T, Patoka J, package version; 2012:0–3. URL http://cran.r-project.org/package=party.
Miike R, Sison J, et al: Pathway analysis of single-nucleotide 30. Cummings M, Myers D: Simple statistical models predict C-to-U edited
polymorphisms potentially associated with glioblastoma multiforme sites in plant mitochondrial RNA. BMC Bioinformatics 2004, 5:132.
susceptibility using random forests. Cancer Epidemiol Biomarkers Prev 2008, 31. Nicodemus KK, Malley J, Strobl C, Ziegler A: The behavior of random forest
17(6):1368–1373. permutation-based variable importance measures under predictor
5. Liu C, Ackerman H, Carulli J: A genome-wide screen of gene–gene correlation. BMC Bioinformatics 2010, 11:110.
interactions for rheumatoid arthritis susceptibility. Hum Genet 2011,
129(5):473–485.
doi:10.1186/1471-2105-14-119
6. Nicodemus K, Callicott J, Higier R, Luna A, Nixon D, Lipska B, Vakkalanka R,
Cite this article as: Janitza et al.: An AUC-based permutation variable
Giegling I, Rujescu D, Clair D, et al: Evidence of statistical epistasis importance measure for random forests. BMC Bioinformatics 2013 14:119.
between DISC1, CIT and NDEL1 impacting risk for schizophrenia:
biological validation with functional neuroimaging. Hum Genet 2010,
127(4):441–452.
7. Sun Y, Cai Z, Desai K, Lawrance R, Leff R, Jawaid A, Kardia S, Yang H:
Classification of rheumatoid arthritis status with candidate gene and

You might also like