0% found this document useful (0 votes)
51 views

Binary Relevance For Multi-Label Learning An Overview

This document provides an overview of binary relevance for multi-label learning. Binary relevance decomposes multi-label learning into multiple binary classification problems, one for each label. While simple, it ignores correlations between labels. The document discusses strategies that extend binary relevance to exploit label correlations, and recent studies on issues like class imbalance and differing label importance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Binary Relevance For Multi-Label Learning An Overview

This document provides an overview of binary relevance for multi-label learning. Binary relevance decomposes multi-label learning into multiple binary classification problems, one for each label. While simple, it ignores correlations between labels. The document discusses strategies that extend binary relevance to exploit label correlations, and recent studies on issues like class imbalance and differing label importance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Front. Comput. Sci.

, 2018, 12(2): 191–202


https://doi.org/10.1007/s11704-017-7031-7

Binary relevance for multi-label learning: an overview

1,2,3
Min-Ling ZHANG , Yu-Kun LI1,2,3 , Xu-Ying LIU1,2,3 , Xin GENG1,2,3

1 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China
2 Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China
3 Collaborative Innovation Center for Wireless Communications Technology, Nanjing 211100, China


c Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract Multi-label learning deals with problems where meanings [1,2]. For instance, in text categorization, a news
each example is represented by a single instance while being document on government reform can cover multiple topics,
associated with multiple class labels simultaneously. Binary such as politics, economics, and society [3]; in image classi-
relevance is arguably the most intuitive solution for learn- fication, a natural scene image can depict multiple types of
ing from multi-label examples. It works by decomposing the scenery, such as sky, sand, sea, and yacht [4]. Multi-label ob-
multi-label learning task into a number of independent bi- jects exist in many real-world applications, including infor-
nary learning tasks (one per class label). In view of its poten- mation retrieval [5], bioinformatics [6], multimedia content
tial weakness in ignoring correlations between labels, many annotation [7], and Web mining [8].
correlation-enabling extensions to binary relevance have been The goal of multi-label learning is to induce a multi-label
proposed in the past decade. In this paper, we aim to review predictor that can assign a set of relevant labels for the unseen
the state of the art of binary relevance from three perspec- instance. In order to achieve this, the most intuitive solution
tives. First, basic settings for multi-label learning and binary is to learn one binary classifier for each class label, where
relevance solutions are briefly summarized. Second, repre- the relevance of each class label for the unseen instance is
sentative strategies to provide binary relevance with label cor- determined by the prediction yielded by the corresponding
relation exploitation abilities are discussed. Third, some of binary classifier [9]. Specifically, the binary relevance proce-
our recent studies on binary relevance aimed at issues other dure works in an independent manner, where the binary clas-
than label correlation exploitation are introduced. As a con- sifier for each class label is learned by ignoring the existence
clusion, we provide suggestions on future research directions. of other class labels. Due to its conceptual simplicity, binary
relevance has attracted considerable attention in multi-label
Keywords machine learning, multi-label learning, bi-
learning research.1)
nary relevance, label correlation, class-imbalance, relative
However, a consensus assumption for multi-label learning
labeling-importance
is that the correlations between labels should be exploited in
order to build multi-label prediction models with strong gen-
1 Introduction eralization ability [1,2,10,11]. The decomposition nature of
binary relevance leads to its inability to exploit label corre-
Multi-label learning is a popular learning framework lations. Therefore, many correlation-enabling extensions to
for modeling real-world objects with multiple semantic binary relevance have been proposed in the past decade [12–
29]. Generally, representative strategies to provide binary rel-
Received January 25, 2017; accepted July 4, 2017
evance with label correlation exploitation abilities include the
E-mail: [email protected] chaining structure assuming random label correlations, the
1) According to Google Scholar (June 2017), the seminal work on binary relevance [9] has received more than 1,100 citations
192 Front. Comput. Sci., 2018, 12(2): 191–202

stacking structure assuming full-order label correlations, and learning from multi-label training examples [1,2]. It decom-
the controlling structure assuming pruned label correlations. poses the multi-label learning problem into q independent bi-
Although label correlation plays an essential role in induc- nary learning problems. Each binary classification problem
ing effective multi-label learning models, recent studies have corresponds to one class label in the label space Y [9]. Specif-
shown that some inherent properties of multi-label learning ically, for each class label λ j , binary relevance derives a bi-
should also be investigated in order to achieve good gener- nary training set D j from the original multi-label training set
alization performance. On one hand, class labels in the label D in the following manner:
space typically have imbalanced distributions, meaning the
D j = {(xi , yij ) | 1  i  m}. (1)
number of positive instances w.r.t. each class label is far less
than its negative counterpart [30–39]. On the other hand, class In other words, each multi-label training example (xi , yi ) is
labels in the label space typically have different labeling- transformed into a binary training example based on its rele-
importance, meaning the importance degrees of each class vancy to λ j .
label for characterizing the semantics of a multi-label exam- Next, a binary classifier g j : X → R can be induced
ple are relative to each other [40–45]. Therefore, in order to from D j by applying a binary learning algorithm B, i.e.,
enhance the generalization performance of binary relevance gj B(D j ). Therefore, the multi-label training example
models, it is beneficial to consider these inherent properties in (xi , yi ) will contribute to the learning process for all binary
addition to label correlation exploitation during the learning classifiers g j (1  j  q), where xi is utilized as a positive
procedure. (or negative) training example for inducing g j based on its
In this paper, we aim to provide an overview of the state of relevancy (or irrelevancy) to λ j .3)
the art of binary relevance for multi-label learning. In Section Given an unseen instance x∗ , its relevant label set Y ∗ is
2, formal definitions for multi-label learning, as well as the determined by querying the outputs of each binary classifier:
canonical binary relevance solution are briefly summarized. Y ∗ = {λ j | g j (x∗ ) > 0, 1  j  q}. (2)
In Section 3, representative strategies to provide label corre-
lation exploitation abilities to binary relevance are discussed. As shown in Eq. (2), the predicted label set Y ∗ will be empty
In Section 4, some of our recent studies on related issues re- when all binary classifiers yield negative outputs for x∗ . In
garding binary relevance are introduced. Finally, Section 5 this case, one might choose the so-called T-Criterion method
provides suggestions for several future research directions re- [9] to predict the class label with the greatest (least negative)
garding binary relevance. output. Other criteria for aggregating the outputs of binary
classifiers can be found in [9].
Algorithm 1 summarizes the pseudo-code for binary rele-
2 Binary relevance vance. As shown in Algorithm 1, there are several properties
that are noteworthy for binary relevance:
Let X = Rd denote the d-dimensional instance space and let
Y = {λ1 , λ2 , . . . , λq } denote the label space, consisting of q • First, the most prominent property of binary relevance
class labels. The goal of multi-label learning is to induce a lies in its conceptual simplicity. Specifically, binary rel-
multi-label predictor f : X → 2Y from the multi-label train- evance is a first-order approach that builds a classifica-
ing set D = {(xi , yi ) | 1  i  m}. Here, for each multi- tion model in a label-by-label manner and ignores the
label training example (xi , yi ), xi ∈ X is a d-dimensional existence of other class labels. The modeling complex-
feature vector [xi1 , xi2 , . . . , xid ] and yi ∈ {−1, +1}q is a q-bit ity of binary relevance is linear to the number of class
binary vector [yi1 , yi2 , . . . , yiq ] , with yij = +1 (−1) indicat- labels q in the label space;
ing that yij is a relevant (or irrelevant) label for xi .2) Equiv- • Second, binary relevance falls into the category of
alently, the set of relevant labels Y i ⊆ Y for xi corresponds problem transformation approaches, which solve multi-
to Y i = {λ j | yij = +1, 1  j  q}. Given an unseen label learning problems by transforming them into other
instance x∗ ∈ X, its relevant label set Y ∗ is predicted as well-established learning scenarios (binary classifica-
Y ∗ = f (x∗ ) ⊆ Y. tion in this case) [1,2]. Therefore, binary relevance is
Binary relevance is arguably the most intuitive solution for not restricted to a particular learning technique and can
2) Without loss of generality, binary assignment of each class label is represented by +1 and −1 (rather than 1 and 0) in this paper
3) In the seminal literature on binary relevance [9], this training procedure is also referred to as cross-training
Min-Ling ZHANG et al. Binary relevance for multi-label learning: an overview 193

be instantiated with various binary learning algorithms representative extension strategies are discussed: the chain-
with diverse characteristics; ing structure assuming random label correlations [12–18],
• Third, binary relevance optimizes macro-averaged the stacking structure assuming full-order label correlations
label-based multi-label evaluation metrics, which eval- [19–23], and the controlling structure assuming pruned label
uate the learning system’s performance on each class correlations [24–29].
label separately, and then return the mean value across
3.1 Binary relevance with the chaining structure
all class labels. Therefore, the actual multi-label metric
being optimized depends on the binary loss, which is In the chaining structure, a total of q binary classifiers are
minimized by the binary learning algorithm B [46,47]; induced based on a chaining order specified over the class la-
• Finally, binary relevance can be easily adapted to learn bels. Specifically, one binary classifier is built for each class
from multi-label examples with missing labels, where label based on the predictions of the preceding classifiers in
the labeling information for training examples is incom- the chain [12,14].
plete due to various factors, such as high labeling cost, Given the label space Y = {λ1 , λ2 , . . . , λq }, let π :
carelessness of human labelers, etc. [48–50]. In order {1, 2, . . . , q} → {1, 2, . . . , q} be the permutation used to spec-
to accommodate this situation, binary relevance can de- ify a chaining order over all class labels, i.e., λπ(1) λπ(2)
rive the binary training set in Eq. (1) by simply exclud- · · · λπ(q) . Thereafter, for the jth class label λπ( j) in the or-
ing examples whose labeling information yij is not avail- dered list, the classifier chain approach [12,14] works by de-
able. riving a corresponding binary training set Dπ( j) from D in the
following manner:
Algorithm 1 Pseudo-code for binary relevance [9]     
Inputs: Dπ( j) = xi , yiπ(1) , . . . , yiπ( j−1) , yiπ( j)  1  i  m . (3)
D: Multi-label training set {(xi , yi ) | 1  i  m}
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq }) Here, the binary
 assignments  of preceding class labels in the
B: Binary learning algorithm i i
chain, i.e., yπ(1) , . . . , yπ( j−1) , are treated as additional features
x∗ : Unseen instance (x∗ ∈ X)
to append to the original instance xi .
Outputs: Next, a binary classifier gπ( j) : X × {−1, +1} j−1 → R can be
Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y) induced from Dπ( j) by applying a binary learning algorithm
Process: B, i.e., gπ( j) B(Dπ( j) ). In other words, gπ( j) determines the
1: for j = 1 to q do relevancy of λπ( j) by exploiting its correlations with the pre-
2: Derive the binary training set D j according to Eq. (1); ceding labels λπ(1) , . . . , λπ( j−1) in the chain.
3: Induce the binary classifier g j : B(D j );
Given an unseen instance x∗ , its relevant label set Y ∗ is de-
4: end for
termined by iteratively querying the outputs of each binary
5: return Y ∗ = {λ j | g j (x∗ ) > 0, 1  j  q}
x∗
classifier along the chaining order. Let ηπ( j) ∈ {−1, +1} de-
note the predicted binary assignment of λπ( j) on x∗ , which is
recursively determined as follows:
3 Correlation-enabling extensions
x∗

ηπ(1) = sign gπ(1) (x∗ ) ,


As discussed in Section 2, binary relevance has been used   
x∗ ∗ x∗ x∗
widely for multi-label modeling due to its simplicity and ηπ( j) = sign gπ( j) x , ηπ(1) , . . . , ηπ( j−1) . (4)
other attractive properties. However, one potential weakness
Here, sign[·] represents the signed function. Therefore, the
of binary relevance lies in its inability to exploit label corre-
relevant label set Y ∗ is derived as:
lations to improve the learning system’s generalization abil-
ity [1,2]. Therefore, a natural consideration is to attempt to 
x∗
Y ∗ = λπ( j) | ηπ( j) = +1, 1  j  q . (5)
provide binary relevance with label correlation exploitation
abilities while retaining its linear modeling complexity w.r.t. Algorithm 2 presents the pseudo-code of the classifier
the number of class labels. chain. As shown in Algorithm 2, the classifier chain is a high-
In light of the above consideration, many correlation- order approach that considers correlations between labels in
enabling extensions have been proposed following the semi- a random manner specified by the permutation π. In order to
nal work on binary relevance. In the following sections, three account for the randomness introduced by the permutation
194 Front. Comput. Sci., 2018, 12(2): 191–202

ordering, one effective choice is to build an ensemble of clas- capable of yielding probabilistic outputs (e.g., Naive Bayes).
sifier chains with n random permutations {πr | 1  r  n}. Thereafter, the relevant label set for the unseen instance is
One classifier chain can be learned based on each random predicted by performing exact inference [13] or approximate
permutation, and then the outputs from all classifier chains inference (when q is large) over the probabilistic classifier
are aggregated to yield the final prediction [12,14,16]. chain [15,18].

Algorithm 2 Pseudo-code of the classifier chain [12,14] 3.2 Binary relevance with the stacking structure
Inputs:
D: Multi-label training set {(xi , yi ) | 1  i  m} In the stacking structure, a total of 2q binary classifiers are in-
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq }) duced by stacking a set of q meta-level binary relevance mod-
π: Permutation used to specify chaining order els over another set of q base-level binary relevance models.
B: Binary learning algorithm Specifically, each meta-level binary classifier is built upon the
x∗ : Unseen instance (x∗ ∈ X)
predictions of all base-level binary classifiers [19].
Outputs: Following the notations in Section 2, let g j (1  j  q) de-
Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y) note the set of base-level classifiers learned by invoking the
Process: standard binary relevance procedure on the multi-label train-
1: for j = 1 to q do ing set, i.e., g j B(D j ). Thereafter, for each class label λ j ,
2: Derive the binary training set Dπ( j) using Eq. (3); the stacking aggregation approach [1,19] derives a meta-level
3: Induce the binary classifier gπ( j) : B(Dπ( j) );
binary training set D M
j in the following manner:
4: end for

x (1  j  q) using Eq. (4);
5: Determine the binary assignments ηπ(
 j) DM
j =
6: ∗ x ∗
return Y = λπ( j) | ηπ( j) = +1, 1  j  q w.r.t. Eq. (4)    
xi , sign[g1 (xi )], . . . , sign[gq (xi )] , yij  1  i  m . (7)
It is also worth noting that predictive errors incurred in pre- Here, the signed predictions of base-level classifiers, i.e.,

ceding classifiers will be propagated to subsequent classifiers sign[g1 (xi )], . . . , sign[gq (xi )] , are treated as additional fea-
along the chain. These undesirable influences become more tures to append to the original instance xi in the meta-level.
pronounced if error-prone class labels happen to be placed Next, a meta-level classifier g M q
j : X × {−1, +1} → R can
at early chaining positions [12,14,28,51]. Furthermore, dur- be induced from D M j by applying a binary learning algorithm
ing the training phase, the additional features appended to B, i.e., g M B(D M M
j j ). In other words, g j determines the rel-
the input space X correspond to the ground-truth labeling as- evancy of λ j by exploiting its correlations with all the class
signments (Eq. (3)). However, during the testing phase, the labels.
additional features appended to X correspond to predicted Given an unseen instance x∗ , its relevant label set Y ∗ is de-
labeling assignments (Eq. (4)). One way to rectify this dis-  termined by using the outputs of the base-level classifiers as
crepancy is to replace the extra features yiπ(1) , . . . , yiπ( j−1) in extra inputs for the meta-level classifiers:
 i 
x xi 
Eq. (3) with ηπ(1) , . . . , ηπ( j−1) such that the predicted label- x∗
Y ∗ = λ j | gM j (τ ) > 0, 1  j  q ,
ing assignments are appended to X in both the training and  

testing phases [17,51]. where τ x = x∗ , sign[g1 (x∗ )], . . . , sign[gq (x∗ )] . (8)
From a statistical point of view, the task of multi-label
Algorithm 3 presents the pseudo-code for stacking aggre-
learning is equivalent to learning the conditional distribution
gation. As shown in Algorithm 3, stacking aggregation is a
p(y | x) with x ∈ X and y ∈ {−1, +1}q . Therefore, p(y | x)
full-order approach that assumes that each class label has
can be factorized w.r.t. the chaining order specified by π as
correlations with all other class labels. It is worth noting that
follows:
stacking aggregation employs ensemble learning [52] to com-

q   bine two sets of binary relevance models with deterministic
p(y | x) = p yπ( j) | x, yπ(1) , . . . , yπ( j−1) . (6) label correlation exploitation. Ensemble learning can also be
j=1
applied to the classifier chain to compensate for its random-
Here, each term on the RHS of Eq. (6) represents the condi- ness of label correlation exploitation.
tional probability of observing yπ( j) given x and its preced-  Rather than using the outputs  of the base-level classifiers
i i
ing labels in the chain. Specifically, this term can be esti- sign[g1 (x )], . . . , sign[gq (x )] by appending them to the in-
mated by utilizing a binary learning algorithm B, which is puts of the meta-level classifiers, it is also feasible to use the
Min-Ling ZHANG et al. Binary relevance for multi-label learning: an overview 195

 
ground-truth labeling assignments yi1 , . . . , yiq to instantiate tribution p(y | x) with x ∈ X and y = {−1, +1}q . Given the
the meta-level binary training set (i.e., Eq. (21)) [21]. How- Bayesian network structure G specified over (x, y), the con-
ever, similar to the standard classifier chain approach, this ditional distribution p(y | x) can be factorized based on G as
practice would also lead to the discrepancy issue regarding follows:
q
the additional features appended to the input space X in the p(y | x) = p(y j | pa j , x). (9)
training and testing phases. j=1

Here, x serves as the common parent for each y j (1  j  q)


Algorithm 3 Pseudo-code of stacking aggregation [19]
because all class labels inherently depend on the feature space
Inputs:
D: Multi-label training set {(xi , yi ) | 1  i  m} X. Additionally, pa j represents the set of parent class labels
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq }) of y j implied by G. Figure 1 illustrates two examples of how
B: Binary learning algorithm the conditional distribution p(y | x) can be factorized based
x∗ : Unseen instance (x∗ ∈ X) on the given Bayesian network structure.
Outputs:
Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y)

Process:
1: for j = 1 to q do
2: Derive the binary training set D j using Eq. (1);
3: Induce the base-level binary classifier g j : B(D j );
4: end for
5: for j = 1 to q do Fig. 1 Examples of two Bayesian network (DAG) structures with x serving
6: Derive the binary training set DM
j using Eq. (7);
as the common parent. The conditional distribution p(y | x) factorizes based
7: Induce the meta-level binary classifier gM B(DM on each structure as: (a) p(y | x) = p(y1 | x) · p(y2 | y1 , x) · p(y3 | y2 , x) ·
j : j );
p(y4 | y3 , x); (b) p(y | x) = p(y1 | x) · p(y2 | y1 , x) · p(y3 | x) · p(y4 | y2 , y3 , x)
8: end for

9: return Y ∗ = λ j | gM x∗
j (τ ) > 0, 1  j  q w.r.t Eq. (8)
Learning a Bayesian network structure G from a multi-
label training set D = {(xi , yi ) | 1  i  m} is difficult.
There are other ways to make use of the stacking strat-
Existing Bayesian network learning techniques [53] are not
egy to induce a multi-label prediction model. Given the base-
directly applicable for two major reasons. First, variables in
level classifiers g j (1  j  q) and the meta-level classifiers
the Bayesian network have mixed types with y (class labels)
gMj (1  j  q), rather than relying only on the meta-level being discrete and x (feature vector) being continuous. Sec-
classifiers to yield final predictions (i.e., Eq. (8)), one can also
ond, computational complexity is prohibitively high when in-
aggregate the outputs of both the base-level and meta-level
put dimensionality (i.e., number of features) is too large.
classifiers to accomplish this task [20]. Furthermore, rather
These two issues are brought about by the involvement
than using the binary labeling assignments as additional fea-
of the feature vector x when learning the Bayesian network
tures for stacking, one can also adapt specific techniques to
structure. In light of this information, the LEAD approach
generate tailored features for stacking, such as discriminant
[25] chooses to eliminate the effects of features in order to
analysis [22] or rule learning [23].
simplify the Bayesian network generation procedure. Follow-
3.3 Binary relevance with the controlling structure ing the notations in Section 2, let g j (1  j  q) denote the bi-
nary classifiers induced by the standard binary relevance pro-
In the controlling structure, a total of 2q binary classifiers cedure, i.e., g j B(D j ). Accordingly, a set of error random
are induced based on a dependency structure specified over variables are derived to decouple the influences of x from all
the class labels. Specifically, one binary classifier is built for class labels:
each class label by exploiting the pruned predictions of q bi-
e j = y j − sign(g j (x)) (1  j  q). (10)
nary relevance models [25].
A Bayesian network (or directed acyclic graph, DAG) is Thereafter, the Bayesian network structure G for all class
a convenient tool to explicitly characterize correlations be- labels (conditioned on x) can be learned from e j (1  j  q)
tween class labels in a compact manner [25–27]. As men- using off-the-shelf packages [54–56].
tioned in Subsection 3.1, a statistical equivalence for multi- Based on the DAG structure implied by G, for each class
label learning corresponds to modeling the conditional dis- label λ j , the LEAD approach derives a binary training set DGj
196 Front. Comput. Sci., 2018, 12(2): 191–202

from D in the following manner: Algorithm 4 presents the pseudo-code of LEAD. As shown
     in Algorithm 4, LEAD is a high-order approach that con-
DGj = xi , paij , yij  1  i  m . (11) trols the order of correlations using the number of parents of
each class label implied by G. Similar to stacking aggrega-
Here, the binary assignments of parent class labels, i.e., paij , tion, LEAD also employs ensemble learning to combine two
are treated as additional features to append to the original in- sets of binary classifiers g j (1  j  q) and gGj (1  j  q)
stance xi . to yield the multi-label prediction model. Specifically, predic-
Next, a binary classifier gGj : X × {−1, +1}|pa j | → R can tions of the q binary classifiers g j are pruned w.r.t. the parents
be induced from DGj by applying a binary learning algorithm for label correlation exploitation.
B, i.e., gGj B(DGj ). In other words, gGj determines the rele- There are also other ways to consider pruned label corre-
vancy of λ j by exploiting its correlations with the parent class lations with specific controlling structures. First, a tree-based
labels pa j implied by G. Bayesian network can be utilized as a simplified DAG struc-
Given an unseen instance x∗ , its relevant label set Y ∗ is ture where second-order label correlations are considered by
determined by iteratively querying the outputs of each bi- pruning each class label with (up to) one parent [26,27]. Sec-
nary classifier w.r.t. the Bayesian network structure. Let πG : ond, the stacking structure can be adapted to fulfill controlled
{1, 2, . . . , q} → {1, 2, . . . , q} be the causal order implied by label correlation exploitation by pruning the uncorrelated out-
G over all class labels, i.e., λπG (1) λπG (2) · · · λπG (q) . Fur- puts of base-level classifiers for stacking meta-level classi-

thermore, let ηπxG ( j) ∈ {−1, +1} denote the predicted binary fiers [24,29]. Third, class labels with error-prone predictions
assignment of λπG ( j) on x∗ , which is recursively determined can be filtered out of the pool of class labels for correlation
as follows: exploitation [28].

 
ηπxG (1) = sign gGπG (1) (x∗ ) ,

  ∗
 4 Related issues
ηπxG ( j) = sign gGπG ( j) x∗ ,
ηax ya ∈paπG ( j) . (12)
As discussed in Section 3, in order to enhance binary rel-
Accordingly, the relevant label set Y ∗ becomes:
evance, it is necessary to enable label correlation exploita-


Y ∗ = λπG ( j) | ηπxG ( j) = +1, 1  j  q . (13) tion during the learning process. However, it is also note-
worthy that some inherent properties of multi-label learn-
Algorithm 4 Pseudo-code of LEAD [25] ing should be investigated in order to further enhance the
Inputs: generalization ability of binary relevance. Specifically, re-
D: Multi-label training set {(xi , yi ) | 1  i  m} cent studies on the issue of class-imbalance, i.e., the num-
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq }) ber of positive instances and negative instances w.r.t. each
B: Binary learning algorithm
class label are imbalanced [30–39], and the issue of rela-
L: Bayesian network structure learning algorithm
tive labeling-importance, i.e., each class label has different
x∗ : Unseen instance (x∗ ∈ X)
labeling-importance [40–45], are introduced.
Outputs:
Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y) 4.1 Class-imbalance
Process:
The issue of class-imbalance exists in many multi-label learn-
1: for j = 1 to q do
2: Derive the binary training set D j using Eq. (1); ing tasks, especially those where the label space consists
3: Induce the binary classifier g j : B(D j ); of a significant number of class labels. For each class label
4: end for λ j ∈ Y, let D+j = {(xi , +1) | yij = +1, 1  i  m} and
5: Derive the error random variables e j (1  j  q) using Eq. (10); D−j = {(xi , −1) | yij = −1, 1  i  m} denote the set of pos-
6: Learn the Bayesian network structure G L(e1 , e2 , . . . , eq );
itive and negative training examples w.r.t. λ j . The level of
7: for j = 1 to q do
class-imbalance can then be characterized by the imbalance
8: Derive the binary training set DGj using Eq. (11);
9: Induce the binary classifier gGj : B(DGj );
ratio:  
max |D+j |, |D−j |
10: end for ImR j =  . (14)
11: Specify the causal order πG over all class labels w.r.t. G;
 min |D+j |, |D−j |

12: return Y ∗ = λπG ( j) | η xG = +1, 1  j  q w.r.t. Eq. (12)
π ( j) Here, | · | returns the cardinality of a set and, in most cases,
Min-Ling ZHANG et al. Binary relevance for multi-label learning: an overview 197

|D+j | < |D−j | holds. Generally, the imbalance ratio is high for Given an unseen instance x∗ , its relevant label set Y ∗ is
most benchmark multi-label data sets [1,57]. For instance, determined by aggregating the predictions of the classifiers
among the 42 class labels of the rcv1 benchmark data set, induced by the binary and multi-class imbalanced learning

the average imbalance ratio (i.e., 1q qj=1 ImR j ) is greater than algorithms:
15 and the maximum imbalance ratio (i.e., max1 jq ImR j ) is
Y ∗ = {λ j | f j (x∗ ) > t j , 1  j  q}
greater than 50 [38]. 
In order to handle the issue of class-imbalance in multi- where f j (x∗ ) = gIj (x∗ ) + gIjk (x∗ , +2). (16)
λk ∈JK
label learning, existing approaches employ binary relevance
Here, t j is a bipartition threshold, which is set by optimizing
as an intermediate step in the learning procedure. Specif-
an empirical metric (e.g., F-measure) over D j .
ically, by decomposing the multi-label learning task into
Algorithm 5 presents the pseudo-code of COCOA. As
q independent binary learning tasks, each of them can be
shown in Algorithm 5, COCOA is a high-order approach
addressed using prevalent binary imbalance learning tech-
that considers correlations between labels in a random man-
niques, such as over-/under-sampling [32,36,37], thresh-
ner via the K coupling class labels in JK . Specifically, during
olding the decision boundary [31,33,34], or optimizing
the training phase, label correlation exploitation is enabled
imbalance-specific metrics [30,35,39]. Because standard bi-
by an ensemble of pairwise cross-couplings between class la-
nary relevance is applied prior to subsequent modeling, exist-
bels. During the testing phase, class-imbalance exploration is
ing approaches handle class-imbalance in multi-label learn-
enabled by aggregating the classifiers induced from the class-
ing at the expense of ignoring the exploitation of label corre-
imbalance learning algorithms.
lations.
Therefore, a favorable solution to class-imbalance in multi- Algorithm 5 Pseudo-code of COCOA [38]
label learning is to consider the exploitation of label correla- Inputs:
tions and the exploration of class-imbalance simultaneously. D: Multi-label training set {(xi , yi ) | 1  i  m}
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq })
In light of this information, the COCOA approach was pro-
BI : Binary imbalance learning algorithm
posed based on a specific strategy called cross-coupling ag- MI : Bulti-class imbalance learning algorithm
gregation [38]. For each class label λ j , a binary classifier gIj K: Number of coupling class labels
is induced from D j (i.e., Eq. (1)) by applying a binary im- x∗ : Unseen instance (x∗ ∈ X)
balance learning algorithm BI [58], i.e., gIj BI (D j ). Addi- Outputs:
tionally, a random subset of K class labels JK ⊂ Y \ {λ j } is Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y)
extracted for pairwise cross-coupling with λ j . For each cou-
Process:
pling label λk ∈ JK , COCOA derives a tri-class training set 1: for j = 1 to q do
Dtri
jk for the label pair (λ j , λk ) from D in the following man- 2: Derive the binary training set D j using Eq. (1);
ner: 3: Induce the binary classifier gIj : BI (D j );
4: Extract a random subset JK ⊂ Y \ {λ j } with K coupling class labels;
Dtri i tri i
jk = {(x , ψ (y , λ j , λk )) | 1  i  m}, 5: for each λk ∈ JK do
⎧ 6: Derive the tri-class training set Dtri using Eq. (15);


⎪ 0, if yij = −1 and yik = −1; jk


⎨ 7: Induce the tri-class classifier gIjk : MI (Dtri );
tri i jk
where ψ (y , λ j , λk ) = ⎪
⎪ +1, if yij = −1 and yik = +1; (15)



8: end for
⎩+2, if yi = +1. 9: end for
j
10: Return Y ∗ = {λ j | f j (x∗ ) > t j , 1  j  q} w.r.t. Eq. (16)
Among the three derived class labels, the first two labels (i.e.,
0 and +1) exploit label correlations by considering the joint
4.2 Relative labeling-importance
labeling assignments of λ j and λk w.r.t. yi , and the third class
label (i.e., +2) corresponds to the case of λ j being a relevant Existing approaches to multi-label learning, including binary
label. relevance, make the common assumption of equal labeling-
Next, a tri-class classifier gIjk : X × {0, +1, +2} → R importance. Here, class labels associated with the training
can be induced from Dtri jk by applying a multi-class imbal-
example are regarded to be relevant, while their relative im-
ance learning algorithm MI [59–61], i.e., gIjk MI (Dtrijk ).
portance in characterizing the example’s semantics is not dif-
I
In other words, a total of K + 1 classifiers, including g j and ferentiated [1,2]. Nevertheless, the degree of labeling impor-
gIjk (λk ∈ JK ), are induced for the class label λ j . tance for each associated class label is generally different
198 Front. Comput. Sci., 2018, 12(2): 191–202

and not directly accessible from multi-label training exam- matrix W = [wik ]m×m is specified for G as follows:
ples. Figure 2 shows an example multi-label natural scene ⎧  i k 2



||x −x ||
image with descending relative labeling-importance: sky m ⎨ exp − 2σ2 2 , if i  k;
∀i,k=1 : wik = ⎪
⎪ (17)
water cloud building pedestrian. Similar situations ⎪
⎩ 0, if i = k.
hold for other types of multi-label objects, such as multi-
category documents with different topical importance and Here, σ > 0 is the width constant for similarity calculation.
multi-functionality genes with different expression levels. The corresponding label propagation matrix P is set as fol-
lows:
1 1
P = D− 2 WD− 2 ,
m
where D = diag[d1 , d2 , . . . , dm ] with di = wik . (18)
k=1

Additionally, the labeling-importance matrix R = [ri j ]m×q is


initialized with R(0) = Φ = [φi j ]m×q as follows:



⎨1, if yij = +1;
∀1  i  m, ∀1  j  q : φi j = ⎪
⎪ (19)
⎩0, if yij = −1.

Next, the label propagation procedure works by iteratively


Fig. 2 An example natural scene image annotated with multiple class la- updating R as: R(t) = αPR(t−1) + (1 − α)Φ. In practice, R(t)
bels (The relative labeling-importance of each annotation is also illustrated
will converge to R∗ as t grows to infinity [42,62,63]:
in this figure, although it is not explicitly provided by the annotator [42])

R∗ = (1 − α)(I − αP)−1 Φ. (20)


It is worth noting that there have been studies on multi-
label learning that have aimed to make use of auxiliary Here, α ∈ (0, 1) is the trade-off parameter that balances the
labeling-importance information. Different forms of auxiliary information flow from label propagation and initial labeling.
information exist, including ordinal scale over each class la- Next, the implicit relative labeling-importance information
bel [40], full ranking over relevant class labels [41], impor- U is obtained by normalizing each row of R∗ as follows:
tance distribution over all class labels [43,44], and oracle ri∗j
λ
feedbacks over queried labels of unlabeled examples [45]. ∀1  i  m, ∀1  j  q : μ xij = q ∗
. (21)
However, in standard multi-label learning, this auxiliary in- j=1 ri j

formation are not assumed to be available and the only ac- In the second stage, in order to make use of the informa-
cessible labeling information is the relevancy/irrelevancy of tion conveyed by U, RELIAB chooses the maximum entropy
each class label. model [64] to parametrize the multi-label predictor as fol-
By leveraging implicit relative labeling-importance infor- lows:
mation, further improvement in the generalization perfor- 1  
f j (x) = exp θj x (1  j  q),
mance of the multi-label learning system can be expected. In Z(x)
q  
light of this information, the RELIAB approach is proposed where Z(x) = exp θj x . (22)
j=1
to incorporate relative labeling-importance information in the
learning process [42]. Formally, for each instance x and class In order to induce the prediction model Θ = [θ1 , θ2 , . . . , θq ],
label λ j , the relative labeling-importance of λ j in character- RELIAB chooses to minimize the following objective func-
λ λ
izing x is denoted μ x j . Specifically, the terms μ x j (1  j  q) tion:
λ
satisfy the non-negativity constraint μ x j  0 and the normal-
q λ j V(Θ, U, D) = Vdis (Θ, U) + β · Vemp (Θ, D). (23)
ization constraint j=1 μ x = 1.
In the first stage, RELIAB estimates the implicit relative Here, the first term Vdis (Θ, U) evaluates how well the predic-
λ
labeling-importance information U = {μ xij | 1  i  m, 1  tion model Θ fits the estimated relative labeling-importance
j  q} through iterative label propagation. Let G = (V, E) be information U (e.g., by Kullback-Leibler divergence) and the
a fully-connected graph constructed over all the training ex- second term evaluates how well the prediction model Θ clas-
amples with V = {xi | 1  i  m}. Additionally, a similarity sifies the training examples in D (e.g., by empirical ranking
Min-Ling ZHANG et al. Binary relevance for multi-label learning: an overview 199

loss). Furthermore, β is the regularization parameter that bal- first author’s homepage (LEAD [25], COCOA [38], RELIAB
ances the two terms of the objective function. [42]).
Given an unseen instance x∗ , its relevant label set Y ∗ is de- For binary relevance, there are several research issues that
termined by thresholding the parametrized prediction model require further investigation. First, the performance evalua-
as follows: tion of multi-label learning is more complicated than single-
label learning. A number of popular multi-label evaluation
Y ∗ = {λ j | f j (x∗ ) > t(x∗ ), 1  j  q}. (24)
metrics have been proposed [1,2,10,11]. It is desirable to
Here, t(x∗ ) is a thresholding function, which can be learned design correlation-enabling extensions for binary relevance
from the training examples [1,34,42]. that are tailored to optimize designated multi-label metrics,
Algorithm 6 presents the pseudo-code of RELIAB. As suitable for the multi-label learning task at hand. Second, in
shown in Algorithm 6, RELIAB employs a two-stage pro- binary relevance, the same set of features is used to induce
cedure to learn from multi-label examples, where the relative the classification models for all class labels. It is appropriate
labeling-importance information estimated in the first stage to develop binary relevance style learning algorithms that
contributes to the model induction in the second stage. Fur- are capable of utilizing label-specific features to character-
thermore, the order of label correlations considered by RE- ize distinct properties of each class label [65–67]. Third, the
LIAB depends on the empirical loss chosen to instantiate modeling complexities of binary relevance, as well as its ex-
Vemp (Θ, D). tensions, are linear to the number of class labels in the label
space. It is necessary to adapt binary relevance to accommo-
Algorithm 6 Pseudo-code of RELIAB [42]
date extreme multi-label learning scenarios with huge (e.g.,
Inputs:
millions) numbers of class labels [68–72].
D: Multi-label training set {(xi , yi ) | 1  i  m}
(xi ∈ X, yi ∈ {−1, +1}q , X = Rd , Y = {λ1 , λ2 , . . . , λq })
Acknowledgements The authors would like to thank the associate edi-
α: Trade-off parameter in (0,1) tor and anonymous reviewers for their helpful comments and suggestions.
β: Regularization parameter This work was supported by the National Natural Science Foundation of
x∗ : Unseen instance (x∗ ∈ X) China (Grant Nos. 61573104, 61622203), the Natural Science Foundation
of Jiangsu Province (BK20141340), the Fundamental Research Funds for
Outputs: the Central Universities (2242017K40140), and partially supported by the
Y ∗ : Predicted label set for x∗ (Y ∗ ⊆ Y) Collaborative Innovation Center of Novel Software Technology and Indus-
trialization.
Process:
1: Construct the fully-connected graph G = (V, E) with V = {xi | 1  i 
m};
References
2: Specify the weight matrix W using Eq. (17);
3: Set the label propagation matrix P using Eq. (18); 1. Zhang M-L, Zhou Z-H. A review on multi-label learning algorithms.
4: Initialize the labeling-importance matrix R using Eq. (19), and then IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8):
derive the converged solution R∗ using Eq. (20); 1819–1837
5: Obtain the relative labeling-importance information U using Eq. (21); 2. Zhou Z-H, Zhang M-L. Multi-label learning. In: Sammut C, Webb G
6: Learn the parametrized prediction model Θ by minimizing the objec- I, eds. Encyclopedia of Machine Learning and Data Mining. Berlin:
tive function specified in Eq. (23); Springer, 2016, 1–8
7: Return Y ∗ = {λ j | f j (x∗ ) > t(x∗ ), 1  j  q} w.r.t. Eq. (24)
3. Schapire R E, Singer Y. Boostexter: a boosting-based system for text
categorization. Machine Learning, 2000, 39(2–3): 135–168
4. Cabral R S, De la Torre F, Costeira J P, Bernardino A. Matrix comple-
5 Conclusion tion for multi-label image classification. In: Proceedings of Advances
in Neural Information Processing Systems. 2011, 190–198
In this paper, the state of the art of binary relevance, which 5. Sanden C, Zhang J Z. Enhancing multi-label music genre classification
is one of the most important solutions for multi-label learn- through ensemble techniques. In: Proceedings of the 34th Annual In-
ing, was reviewed. Specifically, the basic settings for bi- ternational ACM SIGIR Conference on Research and Development in
Information Retrieval. 2011, 705–714
nary relevance, a few representative correlation-enabling ex-
6. Barutcuoglu Z, Schapire R E, Troyanskaya O G. Hierarchical multil-
tensions, and related issues on class-imbalance and relative
abel prediction of gene function. Bioinformatics, 2006, 22(7): 830–836
labeling-importance have been discussed. Code packages for
7. Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J. Correlative multi-
the learning algorithms introduced in this paper are publicly- label video annotation. In: Proceedings of the 15th ACM International
available at the MULAN toobox [57] (binary relevance [9], Conference on Multimedia. 2007, 17–26
classifier chain [12,14], stacking aggregation [19]) and the 8. Tang L, Rajan S, Narayanan V K. Large scale multi-label classification
200 Front. Comput. Sci., 2018, 12(2): 191–202

via metalabeler. In: Proceedings of the 19th International Conference 23rd International Joint Conference on Artificial Intelligence. 2013,
on World Wide Web. 2009, 211–220 1220–1225
9. Boutell M R, Luo J, Shen X, Brown C M. Learning multi-label scene 27. Sucar L E, Bielza C, Morales E F, Hernandez-Leal P, Zaragoza J H,
classification. Pattern Recognition, 2004, 37(9): 1757–1771 Larrañaga P. Multi-label classification with Bayesian network-based
10. Tsoumakas G, Katakis I, Vlahavas I. Mining multi-label data. In: Mai- chain classifiers. Pattern Recognition Letters, 2014, 41: 14–22
mon O, Rokach L, eds. Data Mining and Knowledge Discovery Hand- 28. Li Y-K, Zhang M-L. Enhancing binary relevance for multi-label learn-
book. Berlin: Springer, 2010, 667–686 ing with controlled label correlations exploitation. In: Proceedings of
11. Gibaja E, Ventura S. A tutorial on multilabel learning. ACM Comput- Pacific Rim International Conference on Artificial Intelligence. 2014,
ing Surveys, 2015, 47(3): 52 91–103
12. Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for mul- 29. Alali A, Kubat M. Prudent: a pruned and confident stacking approach
tilabel classification. In: Proceedings of Joint European Conference for multi-label classification. IEEE Transactions on Knowledge and
on Machine Learning and Knowledge Discovery in Databases. 2009, Data Engineering, 2015, 27(9): 2480–2493
254–269 30. Petterson J, Caetano T. Reverse multi-label learning. In: Proceedings of
13. Dembczyński K, Cheng W, Hüllermeier E. Bayes optimal multilabel the Neural Information Processing Systems Comference. 2010, 1912–
classification via probabilistic classifier chains. In: Proceedings of the 1920
27th International Conference on Machine Learning. 2010, 279–286 31. Spyromitros-Xioufis E, Spiliopoulou M, Tsoumakas G, Vlahavas I.
14. Read J, Pfahringer B, Holmes G, Frank E. Classifier chains for multil- Dealing with concept drift and class imbalance in multi-label stream
abel classification. Machine Learning, 2011, 85(3): 333–359 classification. In: Proceedings of the 22nd International Joint Confer-
15. Kumar A, Vembu S, Menon A K, Elkan C. Learning and inference ence on Artificial Intelligence. 2011, 1583–1588
in probabilistic classifier chains with beam search. In: Proceedings of 32. Tahir M A, Kittler J, Yan F. Inverse random under sampling for class
Joint European Conference on Machine Learning and Knowledge Dis- imbalance problem and its application to multi-label classification. Pat-
covery in Databases. 2012, 665–680 tern Recognition, 2012, 45(10): 3738–3750
16. Li N, Zhou Z-H. Selective ensemble of classifier chains. In: Proceed- 33. Quevedo J R, Luaces O, Bahamonde A. Multilabel classifiers with a
ings of International Workshop on Multiple Classifier Systems. 2013, probabilistic thresholding strategy. Pattern Recognition, 2012, 45(2):
146–156 876–883
17. Senge R, del Coz J J, Hüllermeier E. Rectifying classifier chains for 34. Pillai I, Fumera G, Roli F. Threshold optimisation for multi-label clas-
multi-label classification. In: Proceedings of the 15th German Work- sifiers. Pattern Recognition, 2013, 46(7): 2055–2065
shop on Learning, Knowledge, and Adaptation. 2013, 162–169 35. Dembczynski K, Jachnik A, Kotłowski W, Waegeman W, Hüllermeier
18. Mena D, Montañés E, Quevedo J R, del Coz J J. A family of admissible E. Optimizing the F-measure in multi-label classification: plug-in rule
heuristics for A* to perform inference in probabilistic classifier chains. approach versus structured loss minimization. In: Proceedings of the
Machine Learning, 2017, 106(1): 143–169 30th International Conference on Machine Learning. 2013, 1130–1138
19. Godbole S, Sarawagi S. Discriminative methods for multi-labeled clas- 36. Charte F, Rivera A J, del Jesus M J, Herrera F. Addressing imbalance in
sification. In: Proceedings of Pacific-Asia Conference on Knowledge multilabel classification: measures and random resampling algorithms.
Discovery and Data Mining. 20004, 22–30 Neurocomputing, 2015, 163: 3–16
20. Montañés E, Quevedo J R, del Coz J J. Aggregating independent and 37. Charte F, Rivera A J, del Jesus M J, Herrera F. Mlsmote: approaching
dependent models to learn multi-label classifiers. In: proceedings of imbalanced multilabel learning through synthetic instance generation.
Joint European Conference on Machine Learning and Knowledge Dis- Knowledge-Based Systems, 2015, 89: 385–397
covery in Databases. 2011, 484–500 38. Zhang M-L, Li Y-K, Liu X-Y. Towards class-imbalance aware multi-
21. Montañés E, Senge R, Barranquero J, Quevedo J R, del Coz J J, label learning. In: Proceedings of the 24th International Joint Confer-
Hüllermeier E. Dependent binary relevance models for multi-label ence on Artificial Intelligence. 2015, 4041–4047
classification. Pattern Recognition, 2014, 47(3): 1494–1508 39. Wu B, Lyu S, Ghanem B. Constrained submodular minimization for
22. Tahir M A, Kittler J, Bouridane A. Multi-label classification using missing labels and class imbalance in multi-label learning. In: Pro-
stacked spectral kernel discriminant analysis. Neurocomputing, 2016, ceedings of the 30th AAAI Conference on Artificial Intelligence. 2016,
171: 127–137 2229–2236
23. Loza Mencía E, Janssen F. Learning rules for multi-label classification: 40. Cheng W, Dembczynski K J, Hüllermeier E. Graded multilabel clas-
a stacking and a separate-and-conquer approach. Machine Learning, sification: the ordinal case. In: Proceedings of the 27th International
2016, 105(1): 77–126 Conference on Machine Learning. 2010, 223–230
24. Tsoumakas G, Dimou A, Spyromitros E, Mezaris V, Kompatsiaris 41. Xu M, Li Y-F, Zhou Z-H. Multi-label learning with PRO loss. In: Pro-
I, Vlahavas I. Correlation-based pruning of stacked binary relevance ceedings of the 27th AAAI Conference on Artificial Intelligence. 2013,
models for multi-label learning. In: Proceedings of the 1st International 998–1004
Workshop on Learning from Multi-Label Data. 2009, 101–116 42. Li Y-K, Zhang M-L, Geng X. Leveraging implicit relative labeling-
25. Zhang M-L, Zhang K. Multi-label learning by exploiting label depen- importance information for effective multi-label learning. In: Proceed-
dency. In: Proceedings of the 16th ACM SIGKDD International Con- ings of the 15th IEEE International Conference on Data Mining. 2015,
ference on Knowledge Discovery and Data Mining. 2010, 999–1007 251–260
26. Alessandro A, Corani G, Mauá D, Gabaglio S. An ensemble of 43. Geng X, Yin C, Zhou Z-H. Facial age estimation by learning from la-
Bayesian networks for multilabel classification. In: Proceedings of the bel distributions. IEEE Transactions on Pattern Analysis and Machine
Min-Ling ZHANG et al. Binary relevance for multi-label learning: an overview 201

Intelligence, 2013, 35(10): 2401–2412 Brachman R, Stone P, eds. Synthesis Lectures to Artificial Intelligence
44. Geng X. Label distribution learning. IEEE Transactions on Knowledge and Machine Learning. San Francisco, CA: Morgan & Claypool Pub-
and Data Engineering, 2016, 28(7): 1734–1748 lishers, 2009, 1–130
45. Gao N, Huang S-J, Chen S. Multi-label active learning by model 64. Della Pietra S, Della Pietra V, Lafferty J. Inducing features of ran-
guided distribution matching. Frontiers of Computer Science, 2016, dom fields. IEEE Transactions on Pattern Analysis and Machine In-
10(5): 845–855 telligence, 1997, 19(4): 380–393
46. Dembczyński K, WaegemanW, Cheng W, Hüllermeier E. On label de- 65. Zhang M-L, Wu L. LIFT: multi-label learning with label-specific fea-
pendence and loss minimization in multi-label classification. Machine tures. IEEE Transactions on Pattern Analysis and Machine Intelli-
Learning, 2012, 88(1–2): 5–45 gence, 2015, 37(1): 107–120
47. Gao W, Zhou Z-H. On the consistency of multi-label learning. In: Pro-
66. Xu X, Yang X, Yu H, Yu D-J, Yang J, Tsang E C C. Multi-label learn-
ceedings of the 24th Annual Conference on Learning Theory. 2011,
ing with label-specific feature reduction. Knowledge-Based Systems,
341–358
2016, 104: 52–61
48. Sun Y-Y, Zhang Y, Zhou Z-H. Multi-label learning with weak label. In:
Proceedings of the 24th AAAI Conference on Artificial Intelligence. 67. Huang J, Li G, Huang Q, Wu X. Learning label-specific features and
2010, 593–598 class-dependent labels for multi-label classification. IEEE Transactions
on Knowledge and Data Engineering, 2016, 28(12): 3309–3323
49. Xu M, Jin R, Zhou Z-H. Speedup matrix completion with side informa-
tion: application to multi-label learning. In: Proceedings of the Neural 68. Weston J, Bengio S, Usunier N. WSABIE: scaling up to large vocabu-
Information Processing Systems Conference. 2013, 2301–2309 lary image annotation. In: Proceedings of the 22nd International Joint
50. Cabral R, De la Torre F, Costeira J P, Bernardino A. Matrix completion Conference on Artificial Intelligence. 2011, 2764–2770
for weakly-supervised multi-label image classification. IEEE Transac- 69. Agrawal R, Gupta A, Prabhu Y, Varma M. Multi-label learning with
tions on Pattern Analysis and Machine Intelligence, 2015, 37(1): 121– millions of labels: recommending advertiser bid phrases for Web
135 pages. In: Proceedings of the 22nd International Conference on World
51. Senge R, del Coz J J, Hüllermeier E. On the problem of error propaga- Wide Web. 2013, 13–24
tion in classifier chains for multi-label classification. In: Spiliopoulou 70. Xu C, Tao D, Xu C. Robust extreme multi-label learning. In: Proceed-
M, Schmidt-Thieme L, Janning R, eds. Data Analysis, Machine Learn- ings of the 22nd ACM SIGKDD International Conference on Knowl-
ing and Knowledge Discovery. Berlin: Springer, 2014, 163–170 edge Discovery and Data Mining. 2016, 1275–1284
52. Zhou Z-H. Ensemble Methods: Foundations and Algorithms. Boca Ra-
71. Jain H, Prabhu Y, Varma M. Extreme multi-label loss functions for rec-
ton, FL: Chap-man & Hall/CRC, 2012
ommendation, tagging, ranking & other missing label applications. In:
53. Koller D, Friedman N. Probabilistic Graphical Models: Principles and
Proceedings of the 22nd ACM SIGKDD International Conference on
Techniques. Cambridge, MA: MIT Press, 2009
Knowledge Discovery and Data Mining. 2016, 935–944
54. Koivisto M. Advances in exact Bayesian structure discovery in
72. Zhou W J, Yu Y, Zhang M-L. Binary linear compression for multi-label
Bayesian networks. In: Proceedings of the 22nd Conference on Un-
classification. In: Proceedings of the 26th International Joint Confer-
certainty in Artificial Intelligence. 2006, 241–248
ence on Artificial Intelligence. 2017
55. Smith V, Yu J, Smulders T, Hartemink A, Jarvis E. Computational in-
ference of neural information flow networks. PLoS Computational Bi-
ology, 2006, 2: 1436–1449 Min-Ling Zhang received the BS, MS, and
56. Murphy K. Software for graphical models: a review. ISBA Bulletin,
PhD degrees in computer science from
2007, 14(4): 13–15
57. Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I. MULAN:
Nanjing University, China in 2001, 2004
a Java library for multi-label learning. Journal of Machine Learning and 2007, respectively. Currently, he is a
Research, 2011, 12: 2411–2414 professor at the School of Computer Sci-
58. He H, Garcia E A. Learning from imbalanced data. IEEE Transactions
ence and Engineering, Southeast Univer-
on Knowledge and Data Engineering, 2009, 21(9): 1263–1284
sity, China. In recent years, he has served
59. Wang S, Yao X. Multiclass imbalance problems: analysis and potential
solutions. IEEE Transactions on Systems, Man, and Cybernetics-Part as the Program Co-Chairs of ACML’17,
B: Cybernetics, 2012, 42(4): 1119–1130 CCFAI’17, PRICAI’16, Senior PC member or Area Chair of
60. Liu X-Y, Li Q-Q, Zhou Z-H. Learning imbalanced multi-class data AAAI’18/’17, IJCAI’17/’15, ICDM’17/’16, PAKDD’16/’15, etc.
with optimal dichotomy weights. In: Proceedings of the 13th IEEE
He is also on the editorial board of Frontiers of Computer Science,
International Conference on Data Mining. 2013, 478–487
61. Abdi L, Hashemi S. To combat multi-class imbalanced problems by ACM Transactions on Intelligent Systems and Technology, Neural
means of over-sampling techniques. IEEE Transactions on Knowledge Networks. He is the secretary-general of the CAAI (Chinese Associ-
and Data Engineering, 2016, 28(1): 238–251
ation of Artificial Intelligence) Machine Learning Society, standing
62. Zhou D, Bousquet O, Lal T N, Weston J, Schölkopf B. Learning with
committee member of the CCF (China Computer Federation) Arti-
local and global consistency. In: Proceedings of the Neural Information
Processing Systems Conference. 2004, 284–291 ficial Intelligence & Pattern Recognition Society. He is an awardee
63. Zhu X, Goldberg A B. Introduction to semi-supervised learning. In: of the NSFC Excellent Young Scholars Program in 2012.
202 Front. Comput. Sci., 2018, 12(2): 191–202

Yu-Kun Li received the BS and MS de- east University, China. Her research interests mainly include ma-
grees in computer science from Southeast chine learning and data mining, especially cost-sensitive learning
University, China in 2012 and 2015 respec- and class imbalance learning.
tively. Currently, he is an R&D engineer
at the Baidu Inc. His main research in- Xin Geng is currently a professor and the
terests include machine learning and data director of the PALM lab of Southeast Uni-
mining, especially in learning from multi- versity, China. He received the BS (2001)
label data. and MS (2004) degrees in computer sci-
ence from Nanjing University, China, and
Xu-Ying Liu received the BS degree at the PhD (2008) degree in computer science
Nanjing University of Aeronautics and As- from Deakin University, Australia. His re-
tronautics, China, the MS and PhD degrees search interests include pattern recognition,
at Nanjing University, China in 2006 and machine learning, and computer vision. He has published more than
2010 respectively. Now she is an assistant 50 refereed papers in these areas, including those published in pres-
professor at the PALM Group, School of tigious journals and top international conferences.
Computer Science and Engineering, South-

You might also like