Accepted Manuscript: Information Fusion
Accepted Manuscript: Information Fusion
PII: S1566-2535(18)30257-4
DOI: https://doi.org/10.1016/j.inffus.2018.11.013
Reference: INFFUS 1051
Please cite this article as: Jose M. Moyano, Eva L. Gibaja, Krzysztof J. Cios, Sebastián Ventura, An
evolutionary approach to build ensembles of multi-label classifiers, Information Fusion (2018), doi:
https://doi.org/10.1016/j.inffus.2018.11.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
• New evolutionary ensemble model for multi-label classification.
T
• It considers the imbalance, dimensionality and relationships among the
IP
labels.
CR
• Achieved better and more consistent performance than state-of-the-art
methods.
US
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
Venturaa,d,e,∗
IP
a
Department of Computer Science and Numerical Analysis, University of Córdoba,
Córdoba, Spain
b
Department of Computer Science, Virginia Commonwealth University, Richmond, VA
CR
23284, USA
c
Polish Academy of Sciences, Institute of Theoretical and Applied Informatics, Gliwice,
Poland
d
Faculty of Computing and Information Technology, King Abdulaziz University, Saudi
e
US
Arabia
Knowledge Discovery and Intelligent Systems in Biomedicine Laboratory, Maimonides
Biomedical Research Institute of Córdoba, Spain
AN
Abstract
M
In recent years, the multi-label classification task has gained the attention
of the scientific community given its ability to solve problems where each of
the instances of the dataset may be associated with several class labels at
ED
the same time instead of just one. The main problems to deal with in multi-
label classification are the imbalance, the relationships among the labels,
and the high complexity of the output space. A large number of methods for
multi-label classification has been proposed, but although they aimed to deal
PT
with one or many of these problems, most of them did not take into account
these characteristics of the data in their building phase. In this paper we
present an evolutionary algorithm for automatic generation of ensembles of
CE
among them but avoiding the high complexity of the output space. Further,
the algorithm automatically designs the ensemble evaluating both its pre-
∗
Corresponding author
Email address: [email protected] (Sebastián Ventura)
dictive performance and the number of times that each label appears in the
ensemble, so that in imbalanced datasets infrequent labels are not ignored.
For this purpose, we also proposed a novel mutation operator that consid-
ers the relationship among labels, looking for individuals where the labels
are more related. EME was compared to other state-of-the-art algorithms
T
for multi-label classification over a set of fourteen multi-label datasets and
IP
using five evaluation measures. The experimental study was carried out in
two parts, first comparing EME to classic multi-label classification methods,
and second comparing EME to other ensemble-based methods in multi-label
CR
classification. EME performed significantly better than the rest of classic
methods in three out of five evaluation measures. On the other hand, EME
performed the best in one measure in the second experiment and it was the
US
only one that did not perform significantly worse than the control algorithm
in any measure. These results showed that EME achieved a better and more
consistent performance than the rest of the state-of-the-art methods in MLC.
AN
Keywords: Multi-label classification, Ensemble, Evolutionary algorithm
1. Introduction
M
In recent years, the Multi-Label Classification (MLC) task has gained the
attention of the scientific community given its ability to solve problems where
each of the instances may be associated to several class labels at the same
ED
time, instead of just one. Let be L = {λ1 , λ2 , ..., λq } the set of q different
binary labels (with q > 2), and X the set of m instances, each composed by
d input features; let us define the multi-label classification task as learning
PT
application of MLC, such as social networks mining, where each user could
be subscribed to several groups of interest [1]; multimedia annotation, where
each image or multimedia item could be associated to several class labels [2];
AC
3
ACCEPTED MANUSCRIPT
labels appearing in most of the instances and other that are barely present,
appearing in a few instances. This might lead to an imbalanced dataset where
the frequent labels could be much better predicted than the infrequent ones,
as there is very little information about the infrequent labels. Besides, labels
are not usually independent but tend to be related to each other, where a
T
label may appear more frequently with some labels than with others. The
IP
fact of modeling, or lack of, compound dependencies among labels has a
decisive effect not only on the predictive performance of the model but also
on its complexity. The complexity of the model is also usually related to the
CR
size of the output space. The greater the number of labels, the higher the
complexity of the model, which can make the problem intractable.
In order to try to overcome these problems, several methodologies have
US
been proposed in the literature. For example, Pruned Sets (PS) [7] was pro-
posed in order to reduce the imbalance in the final problem. Besides, to
overcome the problem of modelling the compound dependencies among la-
bels, Classifier Chains (CC) [5] considered the relationship among different
AN
binary methods that originally did not take into account. For the output di-
mensionality problem, RAndom k -labELsets (RAk EL) [8] divided the label
space into smaller subsets, resulting in less complex output spaces. Fur-
M
classifiers. However, in MLC only those methods that combine several classi-
fiers which are able to deal with multi-label data are considered as Ensembles
of Multi-Label Classifiers (EMLCs) [11]. On the other hand, besides tackling
the aforementioned problems, ensembles usually perform better than single
PT
4
ACCEPTED MANUSCRIPT
the relationships among the labels but also reducing the computational cost
in cases where the output space is complex. These subsets of labels are not
only randomly selected but also they evolve with the generations of the evolu-
tionary algorithm, looking for the combinations that perform the best. Also,
a novel mutation operator is proposed, so that it considers the relationship
T
among labels favouring more related combinations of labels. Further, EME
IP
takes into account all the labels approximately the same number of times
in the ensemble, regardless of their frequency or its ease to be predicted; so
that the imbalance of the data is considered and the infrequent labels are not
CR
ignored. For that, the fitness function takes into account both the predictive
performance of the model and the number of times that each label is consid-
ered in the ensemble. Finally, the diversity of the ensemble is not taken into
best performance in only one measure, but it was the only algorithm that
did not perform significantly worse than any of the control algorithm for any
evaluation measures. These results showed that EME achieved a better and
ED
2. Related work
The traditional single-label classification task aims to predict the class or
AC
5
ACCEPTED MANUSCRIPT
element is 1 if the label is relevant and 0 otherwise. In this way, the goal of
MLC is to predict, for an unseen instance, a bipartition including its sets of
relevant (ŷ) and irrelevant labels (ŷ).
Several methods for MLC have been proposed in the literature, aiming
to handle with the three main problems in MLC, such as the imbalance of
T
the output space, the relationship among labels and the high dimensionality
IP
of the output space. These methods are categorized into three main groups:
problem transformation, algorithm adaptation, and EMLCs [14, 15].
Problem transformation methods transform the multi-label problem into
CR
one or more single-label problems. These problems are then solved by us-
ing traditional single-label classification methods. For ease of understanding,
schemes of the main transformations are presented in Figure 1. Binary Rel-
US
evance (BR) [16] decomposes the multi-label problem into q independent
binary single-label problems, then building q independent binary classifiers,
one for each label. BR is simple and intuitive, but the fact of considering
the labels independently makes it unable to model the compound dependen-
AN
cies among the labels. BR do not deal with any of the previously described
problems in MLC. In order to overcome the label independence assumption
of BR, Classifier Chain (CC) [5] generates q binary classifiers but linked in
M
such a way that each binary classifier also includes the label predictions of
previous classifiers in the chain as additional input features. In this way and
unlike BR, CC is able to model the relationships among the labels without
ED
called labelset, that appears in the dataset. This method is able to strongly
model the relationships among the labels, but its complexity grows exponen-
tially with the number of labels; it is also not able to predict a labelset that
AC
does not appear in the training set. Therefore, although it is able to handle
with the relationship among labels, LP greatly increases the dimensionality
of the output space, as well as its imbalance. Pruned Sets (PS) [7] tries to
reduce the complexity of LP, focusing on most important combinations of la-
bels by pruning instances with less frequent labelsets. To compensate for this
loss of information, it reintroduces the pruned instances with a more frequent
6
ACCEPTED MANUSCRIPT
T
tries to reduce the disadvantages of the independence assumption of the bi-
IP
nary methods and allows for simpler LP methods. Besides, ChiDep considers
the relationship among group of labels and the dimensionality of the output
space in building phase, therefore being able to reduce the imbalance in each
CR
model if the groups are small.
The methods in the algorithm adaptation group adapt or extend existing
machine learning methods to directly handle multi-label data. Predictive
US
Clustering Trees (PCTs) [21] are decision trees where the data is partitioned
in each node using a clustering algorithm. In order to adapt them to MLC,
the distance between two instances for the clustering algorithm is calculated
as the sum of the Gini Indices [22] of all labels, so it considers the relationship
AN
among labels when building the model. Instance-based algorithms have been
also adapted for MLC, such as Multi-Label k-Nearest Neighbors (ML-kNN)
[23]. For each unknown instance, first the k nearest neighbors are found, then
M
the number of neighbors belonging to each label are counted, and finally the
maximum a posteriori principle is used to identify the labels for the given
instance. As ML-kNN considers all label assignments of k-nearest neigh-
ED
scenarios was proposed, which takes into account the predicted ranking of
labels. The ranking of labels also imply the relationship among labels, so
BP-MLL considers it in the building phase. A wider description of algorithm
CE
are considered as EMLCs those that combine several classifiers which are
able to deal with multi-label data [11]. Thus, although BR combines several
classifiers it is not an EMLC since it combines single-label but not multi-label
classifiers. Ensemble of BR classifiers (EBR) [5] builds an ensemble of n BR
classifiers where each is trained with a sample from the training dataset,
being n the number of desired multi-label classifiers in the ensemble. The
7
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
ED
PT
CE
Figure 1: Main problem transformations in MLC. For PS, labelsets appearing less than 2
times are pruned and reintroduced with most frequent subsets.
AC
8
ACCEPTED MANUSCRIPT
they are all created randomly and not based on any of the characteristics
of the data. Multi-Label Stacking (MLS) [25] is composed of two phases.
In the first phase, q BR classifiers are learned, one for each label; while in
the second phase, the input feature set is augmented with the predictions of
each binary classifier from the first phase, training q new binary classifiers
T
using the desired outputs as targets. MLS is able to model the relationship
IP
among labels thanks to the use of the predictions in the first phase to predict
the labels in the second phase. Ensemble of Pruned Sets (EPS) [7] makes
an ensemble of n PSs where each classifier is trained with a sample of the
CR
training set without replacement. The use of many PSs with different data
subsets avoids overfitting effects of pruning instances, but as PS, in datasets
with a high number of labels the complexity can be still very high.
US
Hierarchy Of Multi-label classifiERs (HOMER) [6] generates a tree of
multi-label classifiers, where the root contains all labels and each leaf repre-
sents one label. At each node, the labels are split with a clustering algorithm,
grouping similar labels into a meta-label. HOMER considers the relation-
AN
ship among labels to build the model, making it able to handle with smaller
subsets of labels in each node, so that the dimensionality of the output space
in each of them is reduced, also reducing the imbalance depending on the
M
node of the tree the best feature from a random subset of the original ones.
As PCT, it considers the relationship among labels in the building phase.
Finally, RAndom k -labELsets (RAk EL) [8] builds an ensemble of LP classi-
fiers, where each is built over a random projection of the output space. In
PT
this way, RAk EL deals with the relationship among labels as LP does but
in a much simpler way. RAk EL handles with the three main problems of
the MLC: it is able to detect the compound dependencies among labels, it
CE
ever, RAk EL selects the k-labelsets randomly, without considering any the
characteristics of the data, which could lead to a poor performance.
A summary of the previously defined method is available in Table 1. This
table indicates if each method deals with (D) and/or considers each of the
characteristics of the data at building phase (B). Note that there are methods
that are able to deal with any of the problems, but they do not consider the
9
ACCEPTED MANUSCRIPT
T
split the labelsets into smaller ones considering the relationship among the
IP
labels.
Table 1: Summary of state-of-the-art MLC methods. It is indicated with a ‘D’ if the
CR
method is able to deal with the corresponding problem (imbalance, relationships among
labels, and high dimensionality of the output space), and with a ‘B’ if it considers this
characteristic at building phase.
EBR
ECC D
MLS D
EPS D, B D D, B
ED
HOMER D D, B D, B
RF-PCT D, B
RAkEL D D D, B
PT
10
ACCEPTED MANUSCRIPT
T
(Section 3.2). Then, the initial population is evaluated using the multi-label
IP
dataset (Section 3.3). In each generation, popSize individuals are selected by
tournament selection and stored in s. Each individual in s is considered for
crossover and mutation based on their respective probabilities (Section 3.4).
CR
Once the genetic operators are applied, these new individuals are evaluated.
To maintain elitism, the population in each generation keeps all new indi-
viduals, unless the best parent is better than all the children; in this case,
US
the parent replaces the worst child. The best(set) and worst(set) methods
return the best and the worst individual of a set respectively. At the end of
the generations, the best individual in the last generation is returned as the
best ensemble.
AN
3.2. Individuals
The individuals, which are codified as binary arrays of n × q elements,
M
for all classifiers, so the number of labels in each classifier is fixed. Each
fragment of size q in the individual represents the k-labelset of each classifier
in the ensemble.
EME is implemented with the ability to use any multi-label classifier.
PT
BR, makes no sense. Further, the use of LP has many advantages over other
methods that also consider the relationship among labels, as for example CC.
If k is small, as proposed, LP builds an unique model for each k-labelset able
AC
to model the dependencies among all labels at a time with a low complexity
due to the reduced output space. On the other hand, CC needs to build
k different binary models, and not all dependencies are considered in each
model; for example the first label in the chain is modeled without considering
the dependencies with the rest, the second is modeled considering the depen-
dency with only the previous one, and so on. Besides, the use of CC would
11
ACCEPTED MANUSCRIPT
T
I nput:
mlData Multi‒label dataset
IP
G Number of generations
popSize Size of the population
k Size of the k‒labelsets
n Number of classifiers in the ensemble
CR
tourSize Size of the tournament selection
pCross Crossover probability
pMut Mutation probability
Start
p US
randomPopulation(popSize, k, n)
g 0
AN
Evaluate(p, mlData)
No
g++ g < G? e best(p)
M
Yes
PhiBasedMutator(s, pMut)
PT
p s Evaluate(s, mlData)
No
best(p) > best(s)?
CE
s s – worst(s) Yes
s s + best(p)
12
ACCEPTED MANUSCRIPT
T
results, so the C4.5 decision tree (Weka’s J48 [30]) is used as a single-label
IP
classifier. For the parameters of C4.5, we used a minimum number of objects
per leaf of 2, and a pruning confidence of 0.25. Although we used these pa-
rameters, optimizing them for each specific problem would lead to a better
CR
performance, both in EME and in any other method that used C4.5.
The individuals in the initial population are generated by randomly choos-
ing k bits to a value of 1 for each fragment representing a multi-label classi-
US
fier. Then, with the evolutionary algorithm the individuals are crossed and
mutated, evolving towards a more promising combination of multi-label clas-
sifiers, instead of being mere random selections. Figure 3 shows the genotype
(represented as a one-dimensional array and as a matrix) and the phenotype
AN
of an individual. For example, the first, represented by [0, 1, 1, 0, 1, 0] in-
dicates that labels λ2 , λ3 and λ5 are included in the first classifier of the
ensemble.
M
ED
PT
CE
AC
13
ACCEPTED MANUSCRIPT
T
provides prediction for the labels on its own k-labelset, as shown in Figure 4.
IP
In the example in Figure 3, the first classifier included labels λ2 , λ3 and λ5 , so
in Figure 4 the first classifier gives a prediction for only those labels. Finally,
the ratio of positive predictions for each label is calculated. If this ratio is
CR
greater than or equal to a given threshold (in the example, threshold = 0.5),
the final prediction is 1 (relevant label) and 0 (irrelevant) otherwise. As
seen in Figure 4 for a certain example, label λ5 obtains one of four possible
US
votes, so the final prediction is 0, while for label λ6 , which obtains four of
five positive votes, the final prediction is 1.
AN
M
ED
PT
CE
Figure 4: Example of the voting process of the ensemble (prediction threshold = 0.5).
The fitness function measures both the performance of the classifier and
the number of times that each label appears in the ensemble, thus leading
the evolution towards high-performing individuals that also consider all labels
the same number of times regardless of their frequency.
Many evaluation measures for MLC have been proposed in the litera-
ture, some of them identified as non-decomposable measures [31]. The non-
14
ACCEPTED MANUSCRIPT
T
FMeasure for MLC, and it is defined in Equation 1. The ExF is calculated
IP
for each instance, and then, the value is averaged among all the instances.
ExF is defined in the range [0, 1]; the higher the value the better the perfor-
mance of the algorithm. In the following, ↓ and ↑ indicate if the measures
CR
are minimized or maximized respectively.
m
1 X 2|Ŷi ∩ Yi |
↑ ExF = (1)
US
m i=1 |Ŷi | ∪ |Yi |
Further, a coverage ratio measure (cr ), which evaluates the number of
times that each label appears in the ensemble, has been defined. This mea-
AN
sure is shown in Equation 2, being v the vector of votes, i.e. a vector storing
the number of times that each label appears in the ensemble, vw a vector of
votes in the worst case, and stdv(v) the standard deviation of the vector v.
The worst case is the one where the vector of votes is as imbalanced as possi-
M
ble, i.e., some labels appearing in all classifiers and the rest not being present
at all. In the case where all labels appears the same number of times in the
ensemble, the vector of votes is homogeneous and the standard deviation is
ED
stdv(v)
↓ cr = (2)
CE
stdv(vw )
As an example, cr for the case in Figure 4 is shown in Equation 3:
stdv(4, 4, 4, 3, 4, 5)
AC
cr = = 0.1443 (3)
stdv(8, 8, 8, 0, 0, 0)
Since both measures are in the range [0, 1], but ExF is maximized and
cr is minimized, the fitness function is defined as the linear combination of
them, as shown in Equation 4.
15
ACCEPTED MANUSCRIPT
ExF + (1 − cr )
↑ f itness = (4)
2
As all the multi-label ensembles of the population must be generated to
calculate their fitness, evaluation is the process that consumes the most time.
T
In order to reduce the runtime of the algorithm, two structures are created:
one storing the fitness of each evaluated individual and other storing each
IP
multi-label classifier that was built. Thus, if an individual appears more
than once, regardless of the order of its multi-label classifiers, the fitness is
CR
directly obtained from this structure, avoiding to evaluate a full ensemble.
Further, if an individual which is going to be built contains a classifier that
was previously built for other individual, this multi-label classifier does not
In this section the crossover and mutation operators used in the evolu-
AN
tionary algorithm are described. Tournament selection is used to determine
the individuals that form the set of parents. Then, each of these individ-
uals is crossed or mutated based on crossover and mutation probabilities.
M
The crossover and mutation operators are not mutually exclusive, i.e., an
individual could be crossed and mutated in the same generation.
0.5) if the fragments in the same position in both parents are swapped.
Figure 5 shows an example of the crossover operator, where the first and
third classifiers are swapped between the parents. This operator makes each
CE
modified.
16
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 5: Uniform crossover operator.
the individual cease to classify one label to classify other. The bit swapping
is performed considering the relationships among the labels, favoring the
M
[−1, 1], 1 meaning total direct correlation, −1 total indirect correlation, and
0 no correlation.
Figure 6 shows an example of the phi-based mutation operator for a
PT
17
ACCEPTED MANUSCRIPT
X
wb = ε + |φb,l | (5)
l∈A
T
Finally, the two selected positions are swapped (Figure 6d). Thereby, sub-
IP
sets of more related labels are more likely to be selected, but also keeping a
small probability to search for less related combination of labels. The mu-
tated individuals are always valid, because the number of active bits remains
CR
constant.
US
AN
M
each of the multi-label classifiers. The individuals are based on the use of
C4.5 classifier, which complexity is O(m × d2 ) [33], where m is the number
examples and d is the number of features of the dataset. The complexity of
CE
EME is upper bounded by the total number of C4.5 classifiers that it has
to evaluate. In each of the G generations, a total of popSize individuals,
each composed by n C4.5 classifiers are evaluated. However, each base clas-
sifier that has been ever built is stored and EME does not have to build
AC
18
ACCEPTED MANUSCRIPT
defined as nT = min(n × popSize × G, kq ). Nevertheless, this asymptotic
time complexity is reduced in the reality, since each individual that appears
repeated in the population in any generation, is not evaluated and its fitness
is directly obtained, avoiding to build an entire individual. Also note that the
complexity of EME is directly related to the complexity of the base classifier
T
used; using a different single-label classifier its complexity would vary.
IP
4. Experimental studies
CR
The purpose of the experimental studies is to compare EME to other
state-of-the-art algorithms in multi-label classification over a wide range of
datasets and evaluation measures. In this section the multi-label datasets
US
and the evaluation measures used in the experiments are first presented, and
then, the experimental settings are explained.
4.1. Datasets
AN
The experiments were performed over a wide set of 14 reference datasets1
from different domains, such as text categorization, multimedia, chemistry
and biology. Table 2 lists the datasets along with their main characteristics,
such as domain, number of instances (m), number of labels (q), number
M
of features (d), cardinality (card, mean number of labels per instance) and
density (dens, cardinality divided by the number of labels). The datasets
are ordered by number of labels. The MLDA tool [34] was used for the
ED
1
All the datasets and their descriptions are available at the repository in http://www.
uco.es/kdis/mllresources/
19
ACCEPTED MANUSCRIPT
T
Guardian1000 Text 302 6 1000 1.126 0.188 [35]
Bbc1000 Text 352 6 1000 1.125 0.188 [35]
IP
3s-inter3000 Text 169 6 3000 1.142 0.190 [35]
Gnegative Biology 1392 8 440 1.046 0.131 [36]
Plant Biology 948 12 440 1.080 0.089 [36]
CR
Water-quality Chemistry 1060 16 14 5.072 0.362 [37]
Yeast Biology 2417 14 103 4.237 0.303 [38]
Human Biology 3108 14 440 1.190 0.084 [36]
Birds Audio 645 19 260 1.014 0.053 [39]
Slashdot
Genbase
Medical
Text
Biology
Text
US
3782
662
978
22
27
45
1079
1186
1449
1.180
1.252
1.245
0.053
0.046
0.028
[40]
[41]
[42]
AN
icate π is true and 0 otherwise. On the other hand, usually the labels that
are most important, interesting or difficult to predict are the minority labels.
Therefore, the macro approach, which gives the same importance to all labels
ED
for the i-th label. Precision and recall are both based on measuring the pre-
diction of relevant labels. The study in [44] did not consider the specificity
measure; however, we consider that measuring the ratio of correctly predicted
CE
m i=1 q
m
1 X
↑ SA = JYi = Ŷi K (7)
m i=1
20
ACCEPTED MANUSCRIPT
q
1X tpi
↑ MaP = (8)
q i=1 tpi + f pi
q
1X tpi
T
↑ MaR = (9)
q i=1 tpi + f ni
IP
q
1X tni
↑ MaS = (10)
q i=1 tni + f pi
CR
4.3. Experimental settings
The experimental study carried out was divided in two parts. First, EME
US
was compared to other classic MLC methods such as BR, LP, CC, PS, and
ChiDep. Then, in order to perform a more complete experimental study,
EME was compared to other state-of-the-art EMLCs such as EBR, ECC,
AN
MLS, EPS, RAk EL, HOMER, and RF-PCT.
To compare the performance of the algorithms, the Friedman’s test [45]
was used for each evaluation measure. In cases where the Friedman’s test
indicated that there were significant differences in the performance of the
M
algorithms with a 95% confidence, the Holm’s post-hoc test [46] for compar-
isons of multiple classifiers involving a control method was performed. The
adjusted p-values were used in the analysis, since they consider the fact of
ED
CC, EBR, ECC, EPS, RAk EL, and RF-PCT. The default parameters, as
originally recommended by their authors, were used in different algorithms
employed. All methods use C4.5 as a single-label classifier. PS prunes the
CE
instances with labelsets occurring less than 3 times, and keeps the top two
best ranked subsets when reintroducing the pruned instances. EPS is com-
posed of 10 classifiers and sampling is done without replacement, keeping the
AC
21
ACCEPTED MANUSCRIPT
For C4.5 decision tree we used a minimum number of objects per leaf
of 2, and a pruning confidence of 0.25. It should be noted that if the pa-
rameters of C4.5 were tuned for each specific case, the performance of EME
should be improved. However, this improvement should be the same for the
rest of state-of-the-art methods, so tuning the parameters of C4.5 is not the
T
objective of this paper. In this way, we carry out a fair comparison among
IP
methods that use the same parameters of C4.5.
EME was implemented using JCLEC [48] and Mulan [49] frameworks,
and the code is publicly available in a GitHub repository2 . A brief study
CR
was carried out first in order to select the parameters of the evolutionary
algorithm, such as the population size, number of generations, and crossover
and mutation probabilities. For the parameters of the multi-label classifier
US
in EME, they are similar to those proposed for RAk EL, each member of
the ensemble has a subset of k = 3 labels, the ensemble is composed of 2q
classifiers, and the prediction threshold is 0.5.
AN
5. Results and discussion
In this section we present the experimental results. First the experimental
M
study to select the parameters of EME is introduced, and then, the analysis
and discussion of the two experiments carried out are presented, including
the statistical tests performed for each of them. Te supplementary material
available at the KDIS Research Group webpage3 includes tables with the
ED
EME was carried out first. For these experiments, four datasets of different
size were selected: Scene, Flags, PlantGO, and EukaryotePseAAC4 .
For the study of the size of the population and the number of genera-
AC
tions, we analyzed the fitness value of the best individual in each generation,
2
https://github.com/kdis-lab/EME
3
http://www.uco.es/kdis/eme/
4
All these datasets are different from those used in the rest of the experimental study,
and are available at the repository in http://www.uco.es/kdis/mllresources/
22
ACCEPTED MANUSCRIPT
as well as the average value of fitness of the whole population. For the
smaller datasets (i.e., Scene and Flags, with 6 and 7 labels respectively), 50
individuals and a total of 200 generations were used. Figures 7a and 7b show
the fitness value of the best individual and the average value of the whole
population for Scene and Flags datasets respectively. For Scene dataset, we
T
could see as the algorithm converged soon, obtaining the best value of fitness
IP
in early generations (less than 50); however, for Flags dataset the algorithm
converged over the iteration 110, moment in which also the average fitness
value of the population stabilized.
CR
US
AN
M
Figure 7: Variation in fitness of the best individual and the average value of fitness of the
population for Scene and Flags datasets.
ED
enough time for the algorithm to stabilize. Figures 8b and 8a show the fitness
value of the best individual and average population with both configurations
for PlantGO and EukaryotePseAAC datasets respectively. For PlantGO,
CE
23
ACCEPTED MANUSCRIPT
! ! ! !
T
IP
CR
(a) PlantGO (b) EukaryotePseAAC
Figure 8: Variation in fitness of the best individual and the average value of fitness of the
US
population for PlantGO and EukaryotePseAAC datasets.
5
http://www.uco.es/kdis/eme/
24
ACCEPTED MANUSCRIPT
algorithms over all datasets are shown in Tables 3, 4, 5, 6 and 7 for HL, SA,
MaP, MaR, and MaS evaluation measures, respectively.
Table 3: Results of classic MLC algorithms for HL ↓ measure and standard deviations.
Values in bold indicate the best results for each dataset.
T
EME BR LP CC PS ChiDep
Emotions 0.220±0.016 0.254±0.022 0.263±0.016 0.262±0.019 0.273±0.016 0.252±0.017
IP
Reuters1000 0.229±0.013 0.257±0.004 0.268±0.019 0.284±0.022 0.276±0.025 0.257±0.004
Guardian1000 0.228±0.015 0.265±0.030 0.279±0.021 0.287±0.018 0.274±0.014 0.265±0.030
Bbc1000 0.216±0.016 0.263±0.015 0.264±0.013 0.284±0.016 0.270±0.010 0.267±0.017
CR
3s-inter3000 0.265±0.014 0.308±0.026 0.312±0.009 0.311±0.030 0.314±0.030 0.308±0.026
Gnegative 0.091±0.007 0.120±0.011 0.119±0.007 0.122±0.009 0.118±0.011 0.123±0.011
Plant 0.102±0.004 0.139±0.008 0.141±0.006 0.141±0.005 0.144±0.005 0.139±0.006
Water-quality 0.299±0.007 0.310±0.007 0.375±0.008 0.334±0.009 0.337±0.011 0.315±0.012
Yeast 0.210±0.007 0.249±0.007 0.283±0.006 0.268±0.008 0.279±0.007 0.274±0.007
Human
Birds
Slashdot
Genbase
Medical
0.090±0.002
0.047±0.004
0.041±0.001
0.010±0.001
0.121±0.002
0.052±0.010
0.043±0.001
0.001±0.001 0.001±0.001
0.011±0.001
US
0.126±0.002
0.063±0.004
0.054±0.001
0.122±0.003
0.052±0.008
0.052±0.007
0.002±0.001 0.001±0.000
0.013±0.002 0.010±0.001
0.123±0.002
0.054±0.006
0.053±0.001
0.121±0.001
0.052±0.010
0.042±0.001
0.004±0.001 0.001±0.001
0.013±0.002 0.010±0.001
AN
Table 4: Results of classic MLC algorithms for SA ↑ measure and standard deviations.
Values in bold indicate the best results for each dataset.
M
EME BR LP CC PS ChiDep
Emotions 0.248±0.038 0.170±0.047 0.226±0.034 0.218±0.040 0.209±0.056 0.191±0.036
Reuters1000 0.111±0.026 0.092±0.047 0.207±0.055 0.163±0.051 0.204±0.074 0.092±0.047
Guardian1000 0.086±0.035 0.069±0.040 0.166±0.050 0.147±0.049 0.192±0.045 0.069±0.040
ED
For HL, EME performed the best in all cases, including a tie with BR, CC
AC
and ChiDep for Genbase dataset. For SA, the best results are more spread,
where EME, LP, CC, and PS achieved the best result in many datasets. In
the case of MaP, EME again shown the best performance in 11 out of 14
datasets. MaP and MaR are opposite measures, so good results in one of
them usually lead to bad results in the other; for MaR LP achieved the best
performance in seven datasets, while EME was the best in four, BR and
25
ACCEPTED MANUSCRIPT
Table 5: Results of classic MLC algorithms for MaP ↑ measure and standard deviations.
Values in bold indicate the best results for each dataset.
T
EME BR LP CC PS ChiDep
Emotions 0.657±0.039 0.596±0.041 0.569±0.029 0.578±0.027 0.560±0.032 0.593±0.022
IP
Reuters1000 0.235±0.068 0.215±0.051 0.287±0.070 0.211±0.073 0.256±0.106 0.215±0.051
Guardian1000 0.284±0.088 0.205±0.053 0.221±0.053 0.230±0.074 0.248±0.048 0.205±0.053
Bbc1000 0.362±0.061 0.262±0.045 0.293±0.058 0.244±0.079 0.291±0.043 0.256±0.050
CR
3s-inter3000 0.107±0.048 0.167±0.086 0.174±0.038 0.177±0.064 0.154±0.055 0.167±0.086
Gnegative 0.509±0.104 0.317±0.029 0.366±0.125 0.316±0.040 0.335±0.056 0.300±0.027
Plant 0.183±0.046 0.142±0.021 0.144±0.014 0.142±0.028 0.123±0.036 0.143±0.016
Water-quality 0.558±0.023 0.521±0.027 0.446±0.014 0.500±0.024 0.354±0.048 0.510±0.024
Yeast 0.510±0.029 0.403±0.009 0.377±0.021 0.394±0.014 0.377±0.012 0.381±0.010
Human
Birds
Slashdot
Genbase
Medical
0.220±0.038
0.651±0.054
0.163±0.014
0.396±0.071 0.398±0.082
0.529±0.045 0.518±0.055
0.929±0.050 0.929±0.056
0.644±0.053
US
0.129±0.007
0.318±0.086
0.430±0.029
0.143±0.011 0.132±0.019
0.500±0.053 0.434±0.043
0.158±0.014
0.386±0.090 0.318±0.048 0.398±0.082
0.514±0.071
0.915±0.061 0.929±0.050 0.760±0.084 0.929±0.056
0.615±0.059 0.646±0.049 0.621±0.068 0.643±0.056
AN
M
ED
Table 6: Results of classic MLC algorithms for MaR ↑ measure and standard deviations.
Values in bold indicate the best results for each dataset.
EME BR LP CC PS ChiDep
Emotions 0.592±0.025 0.547±0.028 0.561±0.022 0.568±0.037 0.553±0.023 0.571±0.032
PT
26
ACCEPTED MANUSCRIPT
Table 7: Results of classic MLC algorithms for MaS ↑ measure and standard deviations.
Values in bold indicate the best results for each dataset.
EME BR LP CC PS ChiDep
Emotions 0.858±0.014 0.829±0.022 0.811±0.013 0.808±0.014 0.800±0.020 0.820±0.011
Reuters1000 0.914±0.015 0.865±0.018 0.833±0.016 0.822±0.020 0.834±0.019 0.865±0.018
Guardian1000 0.918±0.017 0.864±0.037 0.826±0.011 0.823±0.016 0.833±0.007 0.864±0.037
T
Bbc1000 0.922±0.018 0.864±0.022 0.831±0.009 0.824±0.012 0.833±0.011 0.862±0.024
3s-inter3000 0.883±0.023 0.801±0.026 0.799±0.009 0.795±0.032 0.801±0.018 0.801±0.026
IP
Gnegative 0.961±0.006 0.922±0.013 0.922±0.005 0.919±0.005 0.923±0.007 0.917±0.009
Plant 0.968±0.004 0.916±0.009 0.916±0.003 0.914±0.004 0.915±0.001 0.916±0.008
Water-quality 0.786±0.016 0.782±0.016 0.687±0.008 0.750±0.023 0.899±0.009 0.770±0.020
CR
Yeast 0.803±0.006 0.745±0.006 0.735±0.010 0.743±0.013 0.743±0.009 0.730±0.010
Human 0.968±0.002 0.924±0.002 0.923±0.001 0.924±0.002 0.925±0.002 0.925±0.003
Birds 0.989±0.003 0.982±0.005 0.968±0.004 0.982±0.003 0.982±0.003 0.982±0.005
Slashdot 0.992±0.001 0.991±0.001 0.972±0.001 0.978±0.009 0.973±0.001 0.991±0.002
Genbase 1.000±0.000 1.000±0.000 0.999±0.001 1.000±0.000 0.999±0.001 1.000±0.000
Medical 0.995±0.001 0.995±0.001 0.993±0.001 0.995±0.001 0.994±0.001 0.995±0.001
US
ChiDep in three each, and finally CC was the best in two datasets. MaP
and MaR measures are both focused on relevant labels; on the other hand
AN
MaS measures the ratio of correctly predicted irrelevant labels. For MaS,
EME performed the best in 13 out of 14 datasets, being the best method so
far. Despite the opposition of evaluation measures, EME was able to achieve
M
Table 8: Friedman’s test results for the comparison with classic MLC algorithms. Values
in bold indicate that there exist significant differences in the performance of the algorithms
at 95% confidence.
CE
Statistic p-value
HL 45.42 0.0000
SA 27.61 0.0000
AC
For the four evaluation measures where the Friedman’s test indicated
that there were significant differences in the performance of the algorithms,
27
ACCEPTED MANUSCRIPT
Table 9: Adjusted p-values of the Holm’s test for the comparison with classic MLC algo-
rithms. Algorithms marked with “-” are the control algorithm in each measure and values
in bold indicates that there are significant differences with the control algorithm at 95%
confidence.
EME BR LP CC PS ChiDep
T
HL - 0.0299 0.0000 0.0001 0.0000 0.0267
SA ≥ 0.2 0.0003 ≥ 0.2 ≥ 0.2 - 0.0160
IP
MaP - 0.0617 0.0041 0.0120 0.0002 0.0128
MaS - 0.0339 0.0000 0.0000 0.0008 0.0080
CR
EME had the better performance in three of them. For HL and MaS, EME
performed significantly better than the rest of methods, while for MaP it
performed statistically better than all except BR. Further, for SA, where
US
EME was not the control algorithm, it performed statistically equal than the
control algorithm. These results showed that our algorithm has statistically
better performance than the rest of classic MLC algorithms for all evaluation
AN
measures.
Table 10: Results of EMLCs for HL ↓ measure and standard deviations. Values in bold
PT
For HL, EBR performed the best in seven datasets, while ECC in five,
RF-PCT in four, EME in three and both RAk EL, HOMER and MLS in
28
ACCEPTED MANUSCRIPT
Table 11: Results of EMLCs for SA ↑ measure and standard deviations. Values in bold
indicate the best results for each dataset.
EME ECC EBR RAk EL EPS HOMER MLS RF-PCT
Emotions 0.248±0.038 0.297±0.036 0.274±0.037 0.250±0.032 0.292±0.031 0.182±0.041 0.186±0.039 0.284±0.037
Reuters1000 0.111±0.026 0.064±0.031 0.040±0.026 0.129±0.029 0.115±0.035 0.078±0.031 0.112±0.010 0.045±0.022
T
Guardian1000 0.086±0.035 0.063±0.030 0.037±0.020 0.092±0.036 0.130±0.040 0.069±0.036 0.076±0.055 0.037±0.026
Bbc1000 0.120±0.034 0.086±0.034 0.057±0.024 0.134±0.028 0.142±0.042 0.102±0.051 0.088±0.024 0.045±0.022
3s-inter3000 0.040±0.023 0.050±0.029 0.025±0.026 0.037±0.022 0.044±0.035 0.042±0.027 0.077±0.044 0.033±0.032
IP
Gnegative 0.487±0.030 0.548±0.032 0.497±0.031 0.493±0.030 0.513±0.027 0.421±0.023 0.397±0.037 0.470±0.030
Plant 0.113±0.017 0.140±0.024 0.089±0.020 0.127±0.029 0.095±0.018 0.094±0.015 0.109±0.028 0.101±0.018
Water-quality 0.014±0.008 0.017±0.010 0.016±0.009 0.013±0.008 0.015±0.009 0.004±0.004 0.008±0.004 0.012±0.008
Yeast 0.137±0.013 0.171±0.016 0.131±0.014 0.112±0.015 0.168±0.015 0.076±0.011 0.051±0.008 0.145±0.014
CR
Human 0.159±0.014 0.174±0.011 0.141±0.013 0.167±0.018 0.140±0.013 0.105±0.004 0.122±0.007 0.127±0.012
Birds 0.496±0.048 0.522±0.054 0.516±0.055 0.490±0.045 0.515±0.055 0.457±0.049 0.491±0.053 0.503±0.057
Slashdot 0.323±0.015 0.330±0.021 0.303±0.016 0.314±0.024 0.399±0.008 0.309±0.021 0.310±0.020 0.252±0.013
Genbase 0.966±0.015 0.968±0.013 0.967±0.013 0.965±0.014 0.937±0.018 0.970±0.009 0.967±0.016 0.000±0.000
Medical 0.649±0.037 0.671±0.030 0.650±0.025 0.641±0.040 0.674±0.024 0.654±0.052 0.637±0.044 0.085±0.038
US
Table 12: Results of EMLCs for MaP ↑ measure and standard deviations. Values in bold
indicate the best results for each dataset.
AN
EME ECC EBR RAk EL EPS HOMER MLS RF-PCT
Emotions 0.657±0.039 0.685±0.033 0.704±0.034 0.640±0.032 0.673±0.029 0.588±0.017 0.588±0.026 0.647±0.033
Reuters1000 0.235±0.068 0.170±0.087 0.134±0.090 0.243±0.048 0.244±0.089 0.165±0.030 0.208±0.061 0.137±0.099
Guardian1000 0.284±0.088 0.166±0.070 0.133±0.083 0.250±0.076 0.272±0.114 0.242±0.049 0.202±0.070 0.120±0.111
Bbc1000 0.362±0.061 0.216±0.102 0.214±0.123 0.353±0.077 0.359±0.098 0.248±0.052 0.267±0.069 0.179±0.117
3s-inter3000 0.107±0.048 0.139±0.092 0.094±0.090 0.144±0.061 0.090±0.063 0.165±0.074 0.157±0.080 0.117±0.107
M
Table 13: Results of EMLCs for MaR ↑ measure and standard deviations. Values in bold
indicate the best results for each dataset.
CE
29
ACCEPTED MANUSCRIPT
Table 14: Results of EMLCs for MaS ↑ measure and standard deviations. Values in bold
indicate the best results for each dataset.
EME ECC EBR RAk EL EPS HOMER MLS RF-PCT
Emotions 0.858±0.014 0.861±0.014 0.881±0.012 0.834±0.016 0.856±0.012 0.805±0.013 0.810±0.012 0.828±0.014
Reuters1000 0.914±0.015 0.924±0.017 0.952±0.011 0.896±0.014 0.910±0.017 0.793±0.042 0.867±0.014 0.964±0.014
Guardian1000 0.918±0.017 0.929±0.016 0.958±0.013 0.897±0.021 0.910±0.017 0.822±0.019 0.863±0.016 0.969±0.011
Bbc1000 0.922±0.018 0.936±0.015 0.964±0.010 0.911±0.018 0.912±0.014 0.817±0.020 0.865±0.010 0.974±0.011
T
3s-inter3000 0.883±0.023 0.907±0.023 0.952±0.016 0.863±0.024 0.918±0.015 0.811±0.026 0.794±0.023 0.967±0.014
Gnegative 0.961±0.006 0.964±0.004 0.973±0.004 0.949±0.008 0.968±0.004 0.922±0.007 0.922±0.009 0.964±0.004
Plant 0.968±0.004 0.974±0.004 0.985±0.002 0.959±0.012 0.982±0.003 0.915±0.007 0.917±0.003 0.979±0.004
IP
Water-quality 0.786±0.016 0.759±0.018 0.800±0.017 0.735±0.021 0.940±0.007 0.662±0.046 0.741±0.029 0.685±0.018
Yeast 0.803±0.006 0.774±0.006 0.804±0.005 0.761±0.013 0.786±0.005 0.727±0.017 0.743±0.010 0.746±0.005
Human 0.968±0.002 0.969±0.002 0.981±0.002 0.954±0.008 0.975±0.002 0.927±0.003 0.927±0.003 0.972±0.003
Birds 0.989±0.003 0.993±0.002 0.995±0.001 0.986±0.003 0.992±0.003 0.970±0.006 0.985±0.005 0.991±0.002
CR
Slashdot 0.992±0.001 0.989±0.002 0.992±0.001 0.992±0.001 0.985±0.001 0.984±0.002 0.990±0.001 0.997±0.001
Genbase 1.000±0.000 1.000±0.000 1.000±0.000 1.000±0.000 0.999±0.001 1.000±0.000 1.000±0.000 1.000±0.000
Medical 0.995±0.001 0.994±0.001 0.995±0.001 0.995±0.001 0.994±0.001 0.995±0.001 0.995±0.001 0.999±0.001
US
datasets where the number of labels is greater. That means that when the
label space is wider, EME tends to predict correctly, in average, a greater
number of labels than the rest of methods. This is given by the fact that for
AN
cases where a greater number of different possible combinations of k-labelsets
are available, EME is able to obtain a good combination of subsets of labels
with a great performance. It can be also seen as EME obtained a better
performance than RAk EL in all cases, enhancing the need for optimizing the
M
ratio of multi-label predictions where both the relevant and irrelevant labels
were exactly predicted. Although it is an interesting evaluation measure
in some cases, it must be interpreted cautiously since it does not consider
PT
rest of instances. For MaP, EME was the best in five datasets, followed by
EBR being the best in four. ECC, which achieved better results in other
measures, was not the best in any case for MaP. Further, although MaP
AC
and MaR are opposite measures, ECC did not achieve great results in MaR
either, being the best in only one case. On the other hand, EME was the
best in only one case in MaR but achieved better results for MaP, which is
an expected behavior. Both ECC and EBR were not the best in any case for
this measure. Finally, for MaS the better results were spread between EBR
and RF-PCT, being the best in seven datasets each, while the rest in only
30
ACCEPTED MANUSCRIPT
one.
As in the previous experiment, first Friedman’s test was performed in
order to know if there were significant differences on the performance of
the algorithms. The results of Friedman’s test are shown in Table 15, indi-
cating that significant differences exist for all measures at 95% confidence.
T
Therefore, the post-hoc Holm’s test was performed for all the measures. The
IP
results, including the adjusted p-values are shown in Table 16.
Table 15: Friedman’s test results for the comparison with state-of-the-art EMLCs. Values
CR
in bold indicate that there exist significant differences in the performance of the algorithms
at 95% confidence.
Statistic p-value
HL 52.99 0.0000
SA
MaP
MaR
MaS
US
31.65
26.74
42.49
50.84
0.0000
0.0004
0.0000
0.0000
AN
Table 16: Adjusted p-values of the Holm’s test for the comparison among state-of-the-art
EMLCs. Algorithms marked with “-” are the control algorithm in each measure and values
M
in bold indicates that there are significant differences with the control algorithm at 95%
confidence.
Although EME was the best performing algorithm for only one evalua-
tion measure, it was the only one that did not have significant differences
CE
with the control algorithm in any measure. ECC and EBR, which achieved
great results in some evaluation measures, being the control algorithm in one
and two cases respectively, also had a significantly poor performance than
AC
the control algorithm in some cases, such as for SA and MaR. These re-
sults showed that EME is more consistent in overall performance than other
state-of-the-art EMLCs over all measures, and did not perform significantly
worse than the rest in any case. EME achieved high predictive performance
compared not only with classic MLC algorithms, but also when compared
with other EMLCs.
31
ACCEPTED MANUSCRIPT
Further, EME had a better overall performance than RAk EL in four of the
five measures, including HL and MaS, where RAk EL performed significantly
worse than the control algorithm. This indicates that the fact of not only
selecting the k-labelsets randomly as RAk EL does, but also evolving towards
a more promising combination of k-labelsets in the ensemble makes the model
T
to achieve a better predictive performance.
IP
6. Conclusions
CR
In this paper we presented an evolutionary algorithm for the automatic
generation of ensembles of multi-label classifiers based on projections of la-
bels, taking into account the relationships among the labels but avoiding a
US
high complexity. Each individual in the evolutionary algorithm encodes an
ensemble of multi-label classifiers, which are evaluated taking into account
both the predictive performance of the individual and the number of times
that each label appears in the ensemble. The evolutionary algorithm helps to
AN
obtain a promising and high-performing combination of multi-label classifiers
into an ensemble.
The experiments over a wide set of fourteen datasets and five evaluation
M
measures showed that our algorithm performed statistically better than clas-
sic MLC methods and also had a more consistent performance than other
other state-of-the-art EMLCs. EME obtained the best results in several
ED
cases, and although not being always the first algorithm in the ranking,
EME was the only algorithm that did not perform significantly worse than
the rest in any case. Further, the experimental results also show that the
fact of evolving the individuals toward more promising combinations of multi-
PT
label classifiers achieves better results than just selecting them randomly, as
RAk EL does.
As future work, we aim to extend EME to use a variable number of labels
CE
(k) in each of the classifiers of the ensemble and to explore other ways to com-
bine the predictions of the classifiers to create the final ensemble prediction.
Further, we we aim to perform an optimization and tuning of the parameters
AC
32
ACCEPTED MANUSCRIPT
Acknowledgements
This research was supported by the Spanish Ministry of Economy and
Competitiveness and the European Regional Development Fund, project
TIN2017-83445-P. This research was also supported by the Spanish Ministry
T
of Education under FPU Grant FPU15/02948.
IP
References
CR
[1] L. Tang, H. Liu, Scalable learning of collective behavior based on sparse
social dimensions, in: Proceedings of the 18th ACM Conference on Infor-
mation and Knowledge Management (CIKM 09), 2009, pp. 1107–1116.
US
[2] G. Nasierding, A. Kouzani, Image to text translation by multi-label
classification, in: Advanced Intelligent Computing Theories and Ap-
plications with Aspects of Artificial Intelligence, Vol. 6216, 2010, pp.
247–254.
AN
[3] E. Loza, J. Fürnkranz, Efficient multilabel classification algorithms for
large-scale problems in the legal domain, in: Semantic Processing of
Legal Texts, Vol. 6036, 2010, pp. 192–215.
M
33
ACCEPTED MANUSCRIPT
T
streams with adaptive model rules and random rules, Progress in Arti-
ficial Intelligence 7 (3) (2018) 177–187.
IP
[11] J. M. Moyano, E. L. Gibaja, K. J. Cios, S. Ventura, Review of ensembles
CR
of multi-label classifiers: Models, experimental study and prospects,
Information Fusion 44 (2018) 33 – 45.
US
survey and categorisation, Information Fusion 6 (1) (2005) 5 – 20.
classification via probabilistic classifier chains, in: ICML, Vol. 10, 2010,
pp. 279–286.
34
ACCEPTED MANUSCRIPT
T
[21] H. Blockeel, L. D. Raedt, J. Ramon, Top-down induction of cluster-
ing trees, in: Proceedings of the Fifteenth International Conference on
IP
Machine Learning, ICML ’98, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1998, pp. 55–63.
CR
[22] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Re-
gression Trees, Wadsworth and Brooks, 1984.
US
[23] M.-L. Zhang, Z.-H. Zhou, A k-Nearest Neighbor Based Algorithm for
Multi-label Classification, in: Proceedings of the IEEE International
Conference on Granular Computing (GrC), Vol. 2, The IEEE Compu-
AN
tational Intelligence Society, Beijing, China, 2005, pp. 718–721.
[24] M.-L. Zhang, Z.-H. Zhou, Multi-label neural networks with applications
to functional genomics and text categorization, IEEE Transactions on
Knowledge and Data Engineering 18 (2006) 1338–1351.
M
35
ACCEPTED MANUSCRIPT
T
dependence and loss minimization in multi-label classification, Machine
Learning 88 (1) (2012) 5–45.
IP
[32] J. Cohen, P. Cohen, S. G. West, L. S. Aiken, Applied Multiple Re-
CR
gression / Correlation Analysis for the Behavioral Sciences, Psychology
Press, 2002.
[33] J. Su, H. Zhang, A fast decision tree learning algorithm, in: Proceedings
36
ACCEPTED MANUSCRIPT
T
2013, Southampton, United Kingdom, September 22-25, 2013, 2013, pp.
IP
1–8.
CR
Waikato.
US
Informatics (PCI 2005), 2005, pp. 448–456.
37
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
ED
PT
CE
AC
38