AdvancesInKnowledgeDicoveryAndDataMining 2012 Part1
AdvancesInKnowledgeDicoveryAndDataMining 2012 Part1
Advances in
Knowledge Discovery
and Data Mining
13
Series Editors
Volume Editors
Pang-Ning Tan
Michigan State University, Department of Computer Science and Engineering
428 S. Shaw Lane, 48824-1226 East Lansing, MI, USA
E-mail: [email protected]
Sanjay Chawla
University of Sydney, School of Information Technologies
1 Cleveland St., 2006 Sydney, NSW, Australia
E-mail: [email protected]
Chin Kuan Ho
Multimedia University, Faculty of Computing and Informatics
Jalan Multimedia, 63100 Cyberjaya, Selangor, Malaysia
E-mail: [email protected]
James Bailey
The University of Melbourne, Department of Computing and Information Systems
111 Barry Street, 3053 Melbourne, VIC, Australia
E-mail: [email protected]
PAKDD 2012 was the 16th conference of the Pacific Asia Conference series on
Knowledge Discovery and Data Mining. For the first time, the conference was
held in Malaysia, which has a vibrant economy and an aspiration to transform
itself into a knowledge-based society. Malaysians are also known to be very ac-
tive in social media such as Facebook and Twitter. Many private companies
and government agencies in Malaysia are already adopting database and data
warehousing systems, which over time will accumulate massive amounts of data
waiting to be mined. Having PAKDD 2012 organized in Malaysia was therefore
very timely as it created a good opportunity for the local data professionals to ac-
quire cutting-edge knowledge in the field through the conference talks, tutorials
and workshops.
The PAKDD conference series is a meeting place for both university re-
searchers and data professionals to share the latest research results. The PAKDD
2012 call for papers attracted a total of 241 submissions from 32 countries in all
six continents (Asia, Europe, Africa, North America, South America, and Aus-
tralasia), of which 20 (8.3%) were accepted for full presentation and 66 (27.4%)
were accepted for short presentation. Each submitted paper underwent a rigorous
double-blind review process and was assigned to at least four Program Commit-
tee (PC) members. Every paper was reviewed by at least three PC members,
with nearly two-thirds of them receiving four reviews or more. One of the changes
in the review process this year was the adoption of a two-tier approach, in which
a senior PC member was appointed to oversee the reviews for each paper. In
the case where there was significant divergence in the review ratings, the senior
PC members also initiated a discussion phase before providing the Program Co-
chairs with their final recommendation. The Program Co-chairs went through
each of the senior PC members’ recommendations, as well as the submitted pa-
pers and reviews, to come up with the final selection. We thank all reviewers
(Senior PC, PC and external invitees) for their efforts in reviewing the papers
in a timely fashion (altogether, more than 94% of the reviews were completed
by the time the notification was sent). Without their hard work, we would not
have been able to see such a high-quality program.
The three-day conference program included three keynote talks by world-
renowned data mining experts, namely, Chandrakant D. Patel from HP Labs
(Joules of Available Energy as the Global Currency: The Role of Knowledge Dis-
covery and Data Mining); Charles Elkan from the University of California at
San Diego (Learning to Make Predictions in Networks); and Ian Witten from
the University of Waikato (Semantic Document Representation: Do It with Wik-
ification). The program also included four workshops, three tutorials, a doctoral
symposium, and several paper sessions. Other than these intellectually inspiring
events, participants of PAKDD 2012 were able to enjoy several social events
VI Preface
Philip Yu
Ee-Peng Lim
Hong-Tat Ewe
Pang-Ning Tan
Sanjay Chawla
Chin-Kuan Ho
Organization
Organizing Committee
Conference Co-chairs
Philip Yu University of Illinois at Chicago, USA
Hong-Tat Ewe Universiti Tunku Abdul Rahman, Malaysia
Ee-Peng Lim Singapore Management University, Singapore
Program Co-chairs
Pang-Ning Tan Michigan State University, USA
Sanjay Chawla The University of Sydney, Australia
Chin-Kuan Ho Multimedia University, Malaysia
Workshop Co-chairs
Takashi Washio Osaka University, Japan
Jun Luo Shenzhen Institute of Advanced Technology,
China
Tutorial Co-chair
Hui Xiong Rutgers University, USA
Publicity Co-chairs
Rui Kuang University of Minnesota, USA
Ming Li Nanjing University, China
Myra Spiliopoulou University of Magdeburg, Germany
Publication Chair
James Bailey University of Melbourne, Australia
VIII Organization
Steering Committee
Co-chairs
Graham Williams Australian National University, Australia
Tu Bao Ho Japan Advanced Institute of Science and
Technology, Japan
Life Members
Hiroshi Motoda AFOSR/AOARD and Osaka University, Japan
Rao Kotagiri University of Melbourne, Australia
Ning Zhong Maebashi Institute of Technology, Japan
Masaru Kitsuregawa Tokyo University, Japan
David Cheung University of Hong Kong, China
Graham Williams Australian National University, Australia
Ming-Syan Chen National Taiwan University, Taiwan, ROC
Members
Huan Liu Arizona State University, USA
Kyu-Young Whang Korea Advanced Institute of Science and
Technology, Korea
Chengqi Zhang University of Technology Sydney, Australia
Tu Bao Ho Japan Advanced Institute of Science and
Technology, Japan
Ee-Peng Lim Singapore Management University, Singapore
Jaideep Srivastava University of Minnesota, USA
Zhi-Hua Zhou Nanjing University, China
Takashi Washio Institute of Scientific and Industrial Research,
Osaka University
Thanaruk Theeramunkong Thammasat University, Thailand
Organization IX
Program Committee
Aditya Krishna Menon University of California, USA
Aixin Sun Nanyang Technological University, Singapore
Akihiro Inokuchi Osaka University, Japan
Albrecht Zimmerman Katholieke Universiteit Leuven, Belgium
Alexandre Termier Université Joseph Fourier, France
Alfredo Cuzzocrea ICAR-CNR and University of Calabria, Italy
Amol Ghoting IBM T.J. Watson Research Center, USA
Andreas Hotho University of Kassel, Germany
Andrzej Skowron University of Warsaw, Poland
Annalisa Appice Università degli Studi di Bari, Italy
X Organization
Sponsors
Table of Contents – Part I
Purdue University,
West Lafayette, IN 47906, USA
{rrossi,neville}@purdue.edu
1 Introduction
Temporal-relational information is present in many domains such as the Internet,
citation and collaboration networks, communication and email networks, social
networks, biological networks, among many others. These domains all have at-
tributes, links, and/or nodes changing over time which are important to model.
We conjecture that discovering an accurate temporal-relational representation
will disambiguate the true nature and strength of links, attributes, and nodes.
However, the majority of research in relational learning has focused on mod-
eling static snapshots [2, 6] and has largely ignored the utility of learning and
incorporating temporal dynamics into relational representations.
Temporal relational data has three main components (attributes, nodes, links)
that vary in time. First, the attribute values (on nodes or links) may change over
time (e.g., research area of an author). Next, links might be created and deleted
throughout time (e.g., host connections are opened and closed). Finally, nodes
might appear and disappear over time (e.g., through activity in an online social
network).
Within the context of evolving relational data, there are two types of predic-
tion tasks. In a temporal prediction task, the attribute to predict is changing
over time (e.g., student GPA), whereas in a static prediction task, the predic-
tive attribute is constant (e.g., paper topic). For these prediction tasks, the
space of temporal-relational representations is defined by the set of relational
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 1–13, 2012.
c Springer-Verlag Berlin Heidelberg 2012
2 R. Rossi and J. Neville
elements that change over time (attributes, links, and nodes). To incorporate
temporal information in a representation that is appropriate for relational mod-
els, we consider two transformations based on temporal weighting and temporal
granularity. Temporal weighting aims to represent the temporal influence of the
links, attributes and nodes by decaying the weights of each with respect to time,
whereas the choice of temporal granularity restricts attention to links, attributes,
and nodes within a particular window of time. The optimal temporal-relational
representation and the corresponding temporal classifier depends on the partic-
ular temporal dynamics of the links, attributes, and nodes present in the data,
as well as the network domain (e.g., social vs. biological networks).
In this work, we address the problem of selecting the most optimal temporal-
relational representation to increase the accuracy of predictive models. We con-
sider the full space of temporal-relational representations and propose (1) a
temporal-relational classification framework, and (2) a set of temporal ensemble
methods, to leverage time-varying links, attributes, and nodes in relational net-
works. We illustrate the different types of models on a variety of classification
tasks and evaluate each under various conditions. The results demonstrate the
flexibility and effectiveness of the temporal-relational framework for classifica-
tion in time-evolving relational domains. Furthermore, the framework provides a
foundation for automatically searching over temporal-relational representations
to increase the accuracy of predictive models.
2 Related Work
Recent work has started to model network dynamics in order to better pre-
dict link and structure formation over time [7, 10], but this work focuses on
unattributed graphs. Previous work in relational learning on attributed graphs
either uses static network snapshots or significantly limits the amount of tem-
poral information incorporated into the models. Sharan et al. [18] assumes a
strict representation that only uses kernel estimation for link weights, while GA-
TVRC [9] uses a genetic algorithm to learn the link weights. SRPTs [11] incor-
porate temporal and spatial information in the relational attributes. However,
the above approaches focus only on one specific temporal pattern and do not
consider different temporal granularities. In contrast, we explore a larger space
of temporal-relational representations in a flexible framework that can capture
temporal dependencies over links, attributes, and nodes.
To the best of our knowledge, we are the first to propose and investigate
temporal-relational ensemble methods for time-varying relational classification.
However, there has been recent work on relational ensemble methods [8, 14,
15] and non-relational ensemble methods for evolving streams [1]. Preisach et
al. [14] use voting and stacking methods to combine relational data with multiple
relations. In contrast, Eldardiry and Neville [8] incorporates prediction averaging
in the collective inference process to reduce both learning and inference variance.
Time-Evolving Relational Classification and Ensemble Methods 3
Timestep
Window
Window
Union
framework where the links, attributes, and nodes are unioned and the links are
weighted. Below we provide more detail on steps 2-4.
Traditionally, relational classifiers have attempted to use all the data available in
a network [18]. However, since the relevance of data may change over time (e.g.,
links become stale), learning the appropriate temporal granularity (i.e., range of
timesteps) can improve classification accuracy. We briefly define three general
classes for varying the temporal granularity of the links, attributes, and nodes.
1. Timestep. The timestep models only use a single timestep ti for learning.
2. Window. The window models use a sliding window of (multiple) timesteps
{tj , tj+1 , ..., ti } for learning. When the size of window is varied, the space of
possible models in this category is by far the largest.
3. Union. The union model uses all previous temporal information for learning
at time ti , i.e., T = {0, ..., ti }.
The timestep and union models are separated into distinct classes for clarity in
evaluation and for understandability in pattern mining.
(b) Incorporating link weights (c) Using link & attribute weights
Fig. 1. (a) Temporally weighting the attributes and links. (b) The feature calcula-
tion that includes only the temporal link weights. (c) The feature calculation that
incorporates both the temporal attribute weights and the temporal link weights.
Once the temporal granularity and temporal weighting are selected for each rela-
tional component, then a temporal-relational classifier can learned. In this work,
we use modified versions of the RBC [13] and RPT [12] to model the transformed
temporal-relational representation. However, we note that any relational model
6 R. Rossi and J. Neville
that can be modified to incorporate node, link, and attribute weights is suitable
for this phase. We extended RBCs and RPTs since they are interpretable, di-
verse, simple, and efficient. We use k-fold x-validation to learn the “best” model.
Both classifiers are extended for learning and prediction over time.
Weighted Relational Bayes Classifier. RBCs extend naive Bayes classi-
fiers [5] to relational settings by treating heterogeneous relational subgraphs
as a homogeneous set of attribute multisets. The weighted RBC uses standard
maximum likelihood learning. More specifically, the sufficient statistics for each
conditional probability distribution are computed as weighted sums of counts
based on the link and attribute weights. More formally, for a class label C, at-
tributes X, and related items R, the RBC calculates the probability of C for an
item i of type G(i) as follows:
P (C i |X, R) ∝ i
P (Xm |C) P (Xkj |C)P (C)
Xm ∈XG(i) j∈R Xk ∈XG(j)
5 Methodology
For evaluating the framework, we use both static (Y is constant over time) and
temporal prediction tasks (Yt changes over time).
5.1 Datasets
Cora Citation Network. The Cora dataset contains authorship and citation
information about CS research papers extracted automatically from the web.
The prediction tasks are to predict one of seven machine learning papers and to
predict AI papers given the topic of its references. In addition, these techniques
are evaluated using the most prevalent topics its authors are working on through
collaborations with other authors.
6 Empirical Results
In this section, we demonstrate the effectiveness of the temporal-relational frame-
work and temporal ensemble methods on two real-world datasets. The main
findings are summarized below:
1.00
TVRC
We evaluate the temporal-relational frame- RPT
Intrinsic
work using single-models and show that in
0.98
Int+time
all cases the performance of classification im- Int+graph
Int+topics
proves when the temporal dynamics are ap-
AUC
0.96
propriately modeled.
Temporal, Relational, and Non-
Relational Information. The utility of the
0.94
temporal (TVRC), relational (RPT), and
non-relational information (decision tree;
0.92
DT) is assessed using the most primitive
models. Figure 2 compares TVRC with the Fig. 2. Comparing a primitive
RPT and DT models that use more fea- temporal model (TVRC) to com-
tures but ignore the temporal dynamics of peting relational (RPT), and
the data. We find the TVRC to be the sim- non-relational (DT) models
plest temporal-relational classifier that still
outperforms the others. Interestingly, the discovered topic features are the only
additional features that improve performance of the DT model. This is signifi-
cant as these attributes are discovered by dynamically modeling the topics, but
are included in the DT model as simple non-relational features (i.e., no temporal
weighting or granularity).
Exploring Temporal-Relational Models.
0.95
Window Model
framework. To more appropriately evaluate Union Model
0.85
0.80
(i.e., TENC). Similar results were found using Cora and other base classifiers
such as RBC. Models based strictly on varying the temporal granularity were
also explored. More details can be found in [17].
1.00
6.2 Temporal-Ensemble Models
TVRC
RPT
Instead of directly learning the optimal DT
0.98
temporal-relational representation to in-
crease the accuracy of classification, we use
AUC
temporal ensembles by varying the relational
0.96
representation with respect to the temporal
information. These ensemble models reduce
0.94
error due to variance and allow us to assess
which features are most relevant to the do-
0.92
main with respect to the relational or tem-
T=1 T=2 T=3 T=4 Avg
poral information.
Fig. 4. Comparing temporal, rela-
Temporal, Relational, and Traditional tional, and traditional ensembles
Ensembles. We first resampled the in-
stances (nodes, links, features) repeatedly and then learn TVRC, RPT, and DT
models. Across almost all the timesteps, we find the temporal-ensemble that uses
various temporal-relational representations outperforms the relational-ensemble
and the traditional ensemble (see Figure 4). The temporal-ensemble outperforms
the others even when the minimum amount of temporal information is used (e.g.,
time-varying links). More sophisticated temporal-ensembles can be constructed
to further increase accuracy. We have investigated ensembles that use signifi-
cantly different temporal-relational representations (i.e., from a wide range of
model classes) and ensembles that use various temporal weighting parameters.
In all cases, these ensembles are more robust and increase the accuracy over
more traditional ensemble techniques (and single classifiers). Further, the average
improvement of the temporal-ensembles is significant at p < 0.05 with a 16%
reduction in error, justifying the proposed temporal ensemble methodologies.
In the next experiment, we construct en-
1.0
TVRC
sembles using the feature classes. We use the RPT
DT
primitive models (with the transformed fea-
0.9
0.8
poral patterns. First, the team features are Communication Team Centrality Topics
localized in time and are not changing fre- Fig. 5. Comparing attribute classes
quently. For instance, it is unlikely that a w.r.t. temporal, relational, and tra-
developer changes their assigned teams and ditional ensembles
Time-Evolving Relational Classification and Ensemble Methods 11
0.10 DT
TVRC
0.08 RPT
AUC Drop
0.06
0.04
0.02
0.00
assigned−count−prev
hasclosed−prev
bc−centrality−bug−discussion
bug−discussion−count
topics−bug−discussion
degree−centrality−email
all−communication−count
bc−centrality−email−discussion
eigen−centrality−all−discussion
degree−centrality−all−discussion
bc−centrality−all−discussion
assigned−count
email−discussion−count
clust−coeff−email−discussion
team−count
clust−coeff−all−discussion
unicode−team
topics−all−discussion
tkinter−team
topics−email−discussion
Fig. 6. Randomization. The significant attributes used in the temporal ensemble are
compared to the relational and traditional ensembles. The change in AUC is measured.
0.80
Fig. 7. Evaluation of temporal-relational classifiers using only the latent topics of the
communications to predict effectiveness. LDA is used to automatically discover the
latent topics as well as annotating the communication links and individuals with their
appropriate topic in the temporal networks.
7 Conclusion
We proposed and validated a framework for temporal-relational classifiers, en-
sembles, and more generally, representations for temporal-relational data. We
evaluated an illustrative set of temporal-relational models from the proposed
framework. Empirical results show that the models significantly outperform
competing classification models that use either no temporal information or a
very limited amount. The proposed temporal ensemble methods (i.e., tempo-
rally sampling, randomizing, and transforming features) were shown to sig-
nificantly outperform traditional and relational ensembles. Furthermore, the
temporal-ensemble methods were shown to increase the accuracy over traditional
models while providing an efficient alternative to exploring the space of temporal-
models. The results demonstrated the effectiveness, scalability, and flexibility of
the temporal-relational representations for classification and ensembles in time-
evolving domains. In future work, we will theoretically analyze the framework
and the proposed ensemble methods.
References
1. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavaldà, R.: New ensemble meth-
ods for evolving data streams. In: SIGKDD, pp. 139–148 (2009)
2. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hy-
perlinks. In: SIGMOD, pp. 307–318 (1998)
3. Cortes, C., Pregibon, D., Volinsky, C.: Communities of Interest. In: Hoffmann,
F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS,
vol. 2189, pp. 105–114. Springer, Heidelberg (2001)
Time-Evolving Relational Classification and Ensemble Methods 13
4. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
5. Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier
under zero-one loss. Machine Learning 29, 103–130 (1997)
6. Domingos, P., Richardson, M.: Mining the network value of customers. In:
SIGKDD, pp. 57–66 (2001)
7. Dunlavy, D., Kolda, T., Acar, E.: Temporal link prediction using matrix and tensor
factorizations. TKDD 5(2), 10 (2011)
8. Eldardiry, H., Neville, J.: Across-model collective ensemble classification. AAAI
(2011)
9. Güneş, İ., Çataltepe, Z., Öğüdücü, Ş.G.: GA-TVRC: A Novel Relational Time
Varying Classifier to Extract Temporal Information Using Genetic Algorithms. In:
Perner, P. (ed.) MLDM 2011. LNCS, vol. 6871, pp. 568–583. Springer, Heidelberg
(2011)
10. Lahiri, M., Berger-Wolf, T.: Structure prediction in temporal networks using fre-
quent subgraphs. In: CIDM, pp. 35–42 (2007)
11. McGovern, A., Collier, N., Matthew Gagne, I., Brown, D., Rodger, A.: Spatiotem-
poral Relational Probability Trees: An Introduction. In: ICDM, pp. 935–940 (2008)
12. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees.
In: SIGKDD, pp. 625–630 (2003)
13. Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational Bayesian
classifers. In: ICML, pp. 609–612 (2003)
14. Preisach, C., Schmidt-Thieme, L.: Relational ensemble classification. In: ICDM,
pp. 499–509. IEEE (2006)
15. Preisach, C., Schmidt-Thieme, L.: Ensembles of relational classifiers. KIS 14(3),
249–272 (2008)
16. Rossi, R., Neville, J.: Modeling the evolution of discussion topics and communica-
tion to improve relational classification. In: SOMA-KDD, pp. 89–97 (2010)
17. Rossi, R.A., Neville, J.: Representations and ensemble methods for dynamic rela-
tional classification. CoRR abs/1111.5312 (2011)
18. Sharan, U., Neville, J.: Temporal-relational classifiers for prediction in evolving
domains. In: ICML (2008)
Active Learning for Hierarchical
Text Classification
1 Introduction
Hierarchical text classification plays an important role in many real-world ap-
plications, such as webpage topic classification, product categorization and user
feedback classification. Due to the rapid increase of published documents (e.g.,
articles, patents and product descriptions) online, most of the websites (from
Wikipedia and Yahoo! to the small enterprise websites) classify their documents
into a predefined hierarchy (or taxonomy) for easy browsing. As more documents
are published, more human efforts are needed to give the hierarchical labels of
the new documents. It dramatically increases the maintenance cost for those
organization or companies. To tackle this problem, machine learning techniques
such as hierarchical text classification can be utilized to automatically categorize
new documents into the predefined hierarchy.
Many approaches have been proposed to improve the performance of hierar-
chical text classification. Different approaches have been proposed in terms of
how to build the classifiers [2,19], how to construct the training sets [1,6] and
how to choose the decision thresholds [15,1] and so on. As a hierarchy may con-
tain hundreds or even tens of thousands of categories, those approaches often
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 14–25, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Active Learning for Hierarchical Text Classification 15
There are two key steps (step two and three) in the algorithm. In step two, we
introduce the local unlabeled pool to avoid selecting out-of-scope (we will define
it later) examples. In step three, we tackle how to leverage the oracle answers in
the hierarchy. We will discuss them in the following subsections.
2
See http://www.dmoz.org/erz/ for DMOZ editing guidelines.
3
On the root of hierarchy tree, every example is positive.
Active Learning for Hierarchical Text Classification 17
Fig. 1. The hierarchical active learning framework. The typical active learning steps
are numbered 1, 2, 3 in the figure.
4 Experimental Configuration
4.1 Datasets
We utilize four real-world hierarchical text datasets (20 Newsgroups, OHSUMED,
RCV1 and DMOZ) in our experiments. They are common benchmark datasets
for evaluation of text classification methods. We give a brief introduction of the
datasets. The statistic information of the four datasets is shown in Table 1.
Table 1. The statistic information of the four datasets. Cardinality is the average
number of categories per example (i.e., multi-label datasets).
where hP and hR are the hierarchical precision and the hierarchical recall, P̂i is
the set consisting of the most specific categories predicted for test example i and
all its (their) ancestor categories and T̂i is the set consisting of the true most
specific categories of test example i and all its (their) ancestor categories.[14]
5 Empirical Study
In this section, we will first experimentally study the standard version of our
active learning framework for hierarchical text classification, then propose several
improved versions and compare them with the previous version.
8
Uncertain sampling in active learning selects the unlabeled example that is closest
to the decision boundary of the classifier.
9
We heuristically use logarithm of the unlabeled pool size to calculate the number of
selected examples for each category.
20 X. Li, D. Kuang, and C.X. Ling
Empirical Comparison: We set the query limit as 50 × |C| where |C| is the
total umber of categories in the hierarchy. Thus, in our experiments the query
limits for the four datasets are 1,350, 4,300, 4,800 and 4,850 respectively. We
denote the standard hierarchical active learner as AC and the baseline learner
as RD. Figure 2 plots the average learning curves for AC and RD on the four
datasets. As we can see, on all the datasets AC performs significantly better
than RD. This result is reasonable since the unlabeled examples selected by AC
are more informative than RD on all the categories in the hierarchy. From the
curves, it is apparent that to achieve the best performance of RD, AC needs
significantly fewer queries (approximately 43% to 82% queries can be saved)10 .
As mentioned in Section 3.2, when the oracle on a category answers “Yes” for
an example, we can directly include the example into the training set on that
category as a positive example. Furthermore, according to the category relation
in a hierarchy, if an example belongs to a category, it will definitely belong to
all the ancestor categories. Thus, we can propagate the example (as a positive
example) to all its ancestor categories. In such cases, the ancestor classifiers can
obtain free positive examples for training without any query. It coincides with
the goal of active learning: reducing the human labeling cost!
Based on the intuition, we propose a new strategy Propagate to propagate the
examples to the ancestor classifiers when the answer from oracle is “Yes”. The
basic idea is as follows. In each iteration of the active learning process, after we
query an oracle for each selected example, if the answer from the oracle is “Yes”,
we propagate this example to the training sets of all the ancestor categories as
positive. At the end of the iteration, each category combines all the propagated
positive examples and the examples selected by itself to update its classifier.
Fig. 3. Comparison between AC+ and AC in terms of the hierarchical F-measure (first
row), recall (second row) and precision (third row)
22 X. Li, D. Kuang, and C.X. Ling
Querying Negative Examples: For deep categories, when the oracle answers
“No”, we actually discard the selected example in AC+ (as well as in AC, see
Section 5.1). However, in this case, the training set may miss a negative example
and also possibly an informative example. Furthermore, if we keep throwing away
those examples whenever oracle says “No”, the classifiers may not have chance
to learn negative examples. On the other hand, if we include this example, we
may introduce noise to the training set, since the example may not belong to
the parent category, thus an out-of-scope example (see Section 3.1).
How can we deal with the two cases? We introduce a complementary strategy
called Query. In fact, the parent oracle can help us decide between the two
cases. We only need to issue another query to the parent oracle on whether this
example belongs to it. If the answer from the parent oracle is “Yes”, we can
safely include this example as a negative example to the current category. If the
answer is “No”, we can directly discard it. Here, we do not need to further query
all the ancestor oracles, since the example is already out of scope of the current
category and thus can not be included into its training set. There is a trade-off.
As one more query is asked, we may obtain an informative negative example,
but we may also waste a query. Therefore, it is non-trivial if this strategy works
or not.
we can safely do so. It is because for the categories under the same parent, the
example can only belong to at most one category. However, in most of the hi-
erarchical datasets, the example belongs to multiple paths. In this case, it may
be positive on some sibling categories. If we include this example as negative to
the sibling categories, we may introduce noise.
To decide which sibling categories an example can be included as negative, we
adopt a conservative heuristic strategy called Predict. Basically, when a positive
example is included into a category, we add this example as negative to those
sibling categories that the example is least likely to belong to. Specifically, if
we know a queried example x is positive on a category c, we choose m sibling
categories with the minimum probabilities (estimated by Platts Calibration [11]).
We set
m = n − max Ψ↑c (x), (2)
x∈DL
Fig. 4. Comparison between AC+P, AC+Q and AC+ in terms of the hierarchical
F-measure (upper row) and precision (bottom row)
24 X. Li, D. Kuang, and C.X. Ling
We plot their learning curves for the hierarchical F-measure and the hierarchi-
cal precision on the four datasets in Figure 4. As we can see in the figure, both
AC+Q and AC+P achieve better performance of the hierarchical F-measure
than AC+. By introducing more negative examples, both methods maintain or
even increase the hierarchical precision (see the bottom row of Figure 4). As
we mentioned before, AC+Q may waste queries when the parent oracle answers
“No”. However, we discover that the average number of informative examples
obtained per query for AC+Q is much larger than AC+ (at least 0.2 higher per
query). It means that it is actually worthwhile to issue another query in AC+Q.
Another question is whether AC+P introduces noise to the training sets. Ac-
cording to our calculation, the noise rate is at most 5% on all the four datasets.
Hence, it is reasonable that AC+Q and AC+P can further improve AC+.
However, between AC+Q and AC+P, there is no consistent winner on all
the four datasets. On 20 Newsgroup and DMOZ, AC+P achieves higher per-
formance, while on OHSUMED and RCV1, AC+Q is more promising. We also
try to make a simple combination of Query and Predict with AC+ (we call
it AC+QP ), but the performance is not significantly better than AC+Q and
AC+P. We will explore a smarter way to combine them in our future work.
Finally, we compare the improved versions AC+Q and AC+P with the non-
active version RD. We find that AC+Q and AC+P can save approximately 74%
to 90% of the total queries. The savings for the four datasets are 74.1%, 88.4%,
83.3% and 90% respectively (these numbers are derived from Figures 2 and 4).
To summarize, we propose several improved versions (AC+, AC+Q and
AC+P ) in addition to the standard version (AC ) of our hierarchical active learn-
ing framework. According to our empirical studies, we discover that in terms
of the hierarchical F-measure, AC+Q and AC+P are significantly better than
AC+, which in turn is slightly better than AC, which in turn outperforms RD
significantly. In terms of query savings, our best versions AC+Q and AC+P
need significantly fewer queries than the baseline learner RD.
6 Conclusion
We propose a new multi-oracle setting for active learning in hierarchical text
classification as well as an effective active learning framework for this setting.
We explore different solutions which attempt to utilize the hierarchical relation
between categories to improve active learning. We also discover that propagating
positive examples to the ancestor categories can improve the overall performance
of hierarchical active learning. However, it also decreases the precision. To handle
this problem, we propose two additional strategies to leverage negative examples
in the hierarchy. Our empirical study shows both of them can further boost the
performance. Our best strategy proposed can save a considerable number of
queries (74% to 90%) compared to the baseline learner. In our future work,
we will extend our hierarchical active learning algorithms with more advanced
strategies to reduce queries further.
Active Learning for Hierarchical Text Classification 25
References
1. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a
comprehensive study. J. Intell. Inf. Syst. 28, 37–78 (2007)
2. D’Alessio, S., Murray, K., Schiaffino, R., Kershenbaum, A.: The effect of using
hierarchical classifiers in text categorization. In: RIAO 2000, pp. 302–313 (2000)
3. Daraselia, N., Yuryev, A., Egorov, S., Mazo, I., Ispolatov, I.: Automatic extraction
of gene ontology annotation and its correlation with clusters in protein networks.
BMC Bioinformatics 8(1), 243 (2007)
4. Donmez, P., Carbonell, J.G.: Proactive learning: cost-sensitive active learning with
multiple imperfect oracles. In: CIKM 2008, pp. 619–628 (2008)
5. Esuli, A., Sebastiani, F.: Active Learning Strategies for Multi-Label Text Classi-
fication. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR
2009. LNCS, vol. 5478, pp. 102–113. Springer, Heidelberg (2009)
6. Fagni, T., Sebastiani, F.: Selecting negative examples for hierarchical text classifi-
cation: An experimental comparison. J. Am. Soc. Inf. Sci. Technol. 61, 2256–2265
(2010)
7. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for
large linear classification. The Journal of Machine Learning Research 9, 1871–1874
(2008)
8. Lam, W., Ho, C.Y.: Using a generalized instance set for automatic text categoriza-
tion. In: SIGIR 1998, pp. 81–89 (1998)
9. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for
text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
10. Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study
about inter-annotator agreement for multi-label image annotation. In: MIR 2010,
pp. 557–566 (2010)
11. Platt, J.C.: Probabilistic outputs for support vector machines. In: Advances in
Large Margin Classifiers, pp. 61–74 (1999)
12. Roy, N., McCallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: ICML 2001, pp. 441–448 (2001)
13. Ruiz, M.E., Srinivasan, P.: Hierarchical neural networks for text categorization
(poster abstract). In: SIGIR 1999, pp. 281–282 (1999)
14. Silla Jr., C.N., Freitas, A.A.: A survey of hierarchical classification across different
application domains. Data Min. Knowl. Discov. 22, 31–72 (2011)
15. Sun, A., Lim, E.-P.: Hierarchical text classification and evaluation. In: ICDM 2001,
pp. 521–528 (2001)
16. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. J. Mach. Learn. Res. 2, 45–66 (2002)
17. Verspoor, K., Cohn, J., Mniszewski, S., Joslyn, C.: Categorization approach to au-
tomated ontological function annotation. In: Protein Science, pp. 1544–1549 (2006)
18. Xu, Z., Yu, K., Tresp, V., Xu, X., Wang, J.: Representative Sampling for Text
Classification Using Support Vector Machines. In: Sebastiani, F. (ed.) ECIR 2003.
LNCS, vol. 2633, pp. 393–407. Springer, Heidelberg (2003)
19. Xue, G.R., Xing, D., Yang, Q., Yu, Y.: Deep classification in large-scale text hier-
archies. In: SIGIR 2008, pp. 619–626 (2008)
20. Yang, B., Sun, J.T., Wang, T., Chen, Z.: Effective multi-label active learning for
text classification. In: KDD 2009, pp. 917–926 (2009)
TeamSkill Evolved: Mixed Classification
Schemes for Team-Based Multi-player Games
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 26–37, 2012.
c Springer-Verlag Berlin Heidelberg 2012
TeamSkill Evolved 27
daunting. With such large player populations, batch learning methods become
impractical, neccesitating an online skill assessment process in which adjust-
ments to a player’s skill rating happen one game at a time, depending only on
their existing rating and the outcome of the game. This task is made more diffi-
cult in titles centered around team-based competition, where interaction effects
between teammates can be difficult to model and integrate into the assessment
process.
Our work is concerned with this particular variant of the skill estimation prob-
lem. Although many approaches exist for skill estimation, such as the well-known
Elo rating system [1] and the Glicko rating system [2], [3], they were primarily
designed for one versus one competition settings (in games such as Chess or
tennis) instead of team-based play. They can be altered to accomodate competi-
tions involving teams, but, problematically, assume the performances of players
in teams are independent from one another, thereby excluding potentially useful
information regarding a team’s collective “chemistry”. More recent approaches
[4] have explicitly modeled teams, but still assume player independence within
teams, summing individual player ratings to produce an overall team rating.
“Team chemistry” is a widely-held notion in team sports [5] and is often cited
as a key differentiating factor, particularly at the highest levels of competition.
In the context of skill assessment in an online setting, however, less attention
has been given to situations in which team chemistry would be expected to
play a significant role, such as the case where the player population is highly-
skilled individually, instead using data from a general population of players for
evaluation [4].
Our previous work in this area [6] described several methods for capturing
elements of “team chemistry” in the assessment process by maintaining skill
ratings for subsets of teams as well as individuals, aggregating these ratings
together for an overall team skill rating. One of the methods, TeamSkill-AllK-EV
(hereafter referred to as EV), performed especially well in our evaluation. One
drawback of EV, however, was that it weighted each aggregate n-sized subgroup
skill rating uniformly in the final summation, leaving open the possibility that
further improvements might be made through an adaptive weighting process.
In this paper, we build on our previous work by introducing five algorithms
which address this drawback in various ways, TeamSkill-AllK-Ev-OL1 (OL1),
TeamSkill-AllK-Ev-OL2 (OL2), TeamSkill-AllK-Ev-OL3 (OL3), TeamSkill-AllK-
EVGen (EVGen), and TeamSkill-AllK-EVMixed (EVMixed). The first three -
OL1, OL2, and OL3 - employ adaptive weighting frameworks to adjust the
summation weights for each n-sized group skill rating and limit their feature
set to data common across all team games: the players, team assignments, and
the outcome of the game. For EVGen and EVMixed, however, we explore the
use of EV’s final prediction, the label of the winning team, as a feature to be
included along with a set of game-specific performance metrics in a variety of on-
line classification settings [7], [8], [9]. For EVMixed, a threshold based on EV’s
prior probability of one team defeating another is used to determine whether
or not to include the metrics as features and, if not, the algorithm defers to
28 C. DeLong and J. Srivastava
EV’s predicted label. EVGen, in contrast, always includes the metrics during
classification.
Evaluation is carried out on a carefully-compiled dataset consisting of tourna-
ment and scrimmage games between professional Halo 3 teams over the course
of two years. Halo 3 is a first-person shooter (FPS) game which was played
competitively in Major League Gaming (MLG), the largest professional video
game league in the world, from 2008 through 2010. With MLG tournaments
regularly featuring 250+ Halo teams vying for top placings, heavy emphasis is
placed on teamwork, making this dataset ideal for the evaluation of interaction
effects among teammates.
We find that EVMixed outperforms all other approaches in most cases, often
by a significant margin. It performs particularly well in cases of limited game
history and in “close” games where teams are almost evenly-matched. These re-
sults suggest that while game-specific features can play a role in skill assessment,
their utility is limited to contexts in which the skill ratings of teams are similar.
When they are not, the inclusion of game-specific information effectively adds
noise to the dataset since their values aren’t conditioned on the strength of their
opponents.
The outline of this paper follows. Section 2 briefly describes some of the work
related to the problem of skill assessment. In Section 3, we introduce our pro-
posed approaches - OL1, OL2, OL3, EVGen, and EVMixed. In Section 4, we
describe some of the key features of the dataset, our evaluation testbed, and
share the results of our evaluation in terms of game outcome prediction accu-
racy. We then conclude with Section 5, discussing the results and future work.
2 Related Work
The foundations of modern skill assessment approaches date back to the work
of Louis Leon Thurstone [10] who, in 1927, proposed the “law of comparitive
judgement”, a method by which the mean distance between two physical stimuli,
such as perceived loudness, can be computed in terms of the standard deviation
when the stimuli processes are normally-distributed. In 1952, Bradley-Terry-
Luce (BTL) models [11] introduced a logistic variant of Thurstone’s model,
using taste preference measurements for evaluation. This work in turn led to
the creation of the Elo rating system, introduced by Arpad Elo in 1959 [1], a
professor and master chess player who sought to replace the US Chess Feder-
ation’s Harkness rating system with one more theoretically sound. Similar to
Thurstone, the Elo rating system assumes the process underlying each player’s
skill is normally-distributed with a constant skill variance parameter β 2 across
all players, simplifying skill updates after each game.
However, this simplification was also Elo’s biggest drawback since the “relia-
bilty” of a rating was unknown from player to player. To address this, the Glicko
rating system [3], a Bayesian approach introduced in 1993 by Mark Glickman,
allowed for player-specific skill variance, making it possible to determine the
TeamSkill Evolved 29
confidence in a player’s rating over time and produce more conservative skill
estimates.
With the release of the online gaming service Xbox Live in 2002, whose player
population quickly grew into the millions, there was a need for a more generalized
rating system incorporating the notion of teams as well as individual players.
TrueSkill [4], published in 2006 by Ralf Herbrich and Thore Graepel, used a
factor graph-based approach to meet this need. In TrueSkill, skill variance is
also maintained for each player, but in contrast to Glicko, TrueSkill samples an
expected performance given a player’s skill rating which is then summed over all
members of a team to produce an estimate of the collective skill of a team. Be-
cause the summation is over individual players, player performances are assumed
to be independent from one another, leaving out potentially useful group-level
interaction information. For team-based games in which highly-skilled players
may coordinate their strategies, this lost interaction information can make the
estimation of a team’s advantage over another difficult, especially as players
change teams.
Several other variants of the aforementioned approaches have also been in-
troduced, including BTL models [12], [13], [14] and expectation propagation
techniques for the analysis of paired comparison data [15].
3 Proposed Approaches
K
K
E[hi (k)]
s∗i = K (3.1)
k=1 (|hi (k)| > 0) k=1 k
30 C. DeLong and J. Srivastava
Fig. 1. The group history problem. This figure illustrates the group history available
for a team of four players at three different time instances, proceeding chronologically
from left to right. Black font indicates that history is available for a given group while
red font indicates that history is not available.
In this notation, s∗i is the estimated skill of team i and hi (k) is a function
returning the set of skill ratings for player groups of size k in team i, including
the empty set ∅ if none exist. When hi (k) → ∅, we let E[hi (k)] = 0.
Despite its excellent results, EV is a “naive” approach, lacking a means of
updating the summation weights, potentially leading to suboptimal performance.
To that end, we introduce three adaptive frameworks which allow the summation
weights to vary over time - TeamSkill-AllK-Ev-OL1 (OL1), TeamSkill-AllK-Ev-
OL2 (OL2), and TeamSkill-AllK-Ev-OL3 (OL3).
3.1 TeamSkill-AllK-Ev-OL1
When attempting to construct an overall team skill rating, one key challenge
to overcome is the fact that the amount of group history can vary over time.
Consider figure 1: after the first game is played, history is available for all possible
groups of players. Later, player 4 leaves the team and is replaced by player 5,
who has never played with players 1, 2, or 3, leaving only a subset of history
available and none for the team as a whole. Then in the final step, player 2
leaves and is replaced by player 6, who has played with player 3 and 5 before,
but never both on the same team, resulting in yet another variant of the team’s
collective group-level history. The feature space is constantly expanding and
contracting over time, making it difficult to know how best to combine the
group-level ratings together. In OL1, we address this issue by maintaining a
weight wk for each aggregate group skill rating of size k, contracting w during
summation by uniformly redistributing the weights from indicies in the weight
vector not present in the available aggregate group skill rating history. Given
the winning team i, wk is updated by computing to what extent each of the
aggregate rating’s prior probability of team i defeating some team j according
to TeamSkill-K [6], Pk (i > j), is better than random, increasing the weight of
wk for a correctly-predicted outcome.
TeamSkill Evolved 31
1
1 ≤ β ≤ ∞, wk0 = , K = min (max (|hi (k)| > 0), max (|hj (k)| > 0)) (3.2)
K k≤K k≤K
1
u= wkt (3.3)
K k>K
t t
w(k≤K ) = w(k≤K ) + u (3.4)
K
s∗i = wkt E[hi (k)] (3.5)
k=1
1
t+1 t 2 +Pk (i>j)
w(k≤K ) = w(k≤K ) β (3.6)
wt+1
wkt+1 = K k t+1 (3.7)
l=1 wl
The main drawback of this approach is that the weight for k = 1 eventually
dominates the weight vector as it is the element of group history present in
every game and, therefore, the weight most frequently increased relative to the
weights of k > 1. Given enough game history, this classifier will converge to
exactly k = 1 - the classifier corresponding the an unmodified version of the
general learner (Elo, Glicko, or TrueSkill) it employs.
3.2 TeamSkill-AllK-Ev-OL2
3.3 TeamSkill-AllK-Ev-OL3
OL3 works similarly to OL1 in most respects, but instead uses a predefined
window of the d most recent games in which k-sized group history was available
to compute its updates. In this way, the weights “follow” the most confidently-
correct aggregate skill ratings for each window d. In the following, let Ld,k be the
32 C. DeLong and J. Srivastava
3.5 TeamSkill-AllK-EVGen
In EVGen, we create a feature set xt from a combination of EV’s predicted label
{+1, −1} of the winning team, EV ˆ t , and a set of n game-specific metrics m.
For Halo 3, several logical metrics are available, such as kill/death ratio and
assist/death ratio (an assist is given to a player when they do more than half of
the damage to a player who is eventually killed by another player), and act as
rough measures of a team’s in-game efficiency since players respawn after each
death throughout the duration of a game. After compiling these metrics for each
team, we take the difference between them for use in xt , adding in EV ˆ t as the
final feature. EV was chosen because of its superior performance in previous
evaluations [6] as well as results from preliminary testing for this work, drawing
from the pool of all previous approaches (including OL1, OL2, and OL3).
ˆ t , m1 , m2 , ..., mn )
xt = (EV (3.14)
Having constructed the feature set xt , we use a more traditional online clas-
sification framework to predict the label of the winning team ŷt , such as the
perceptron [7], online Passive-Aggressive algorithms [8], or Confidence-Weighted
learning [9] (Note: substitute µt for wt in the latter):
ŷt = sign(wt · xt ) (3.15)
After classification, the weight vector over the feature set is then updated
according to the chosen learning framework.
TeamSkill Evolved 33
3.6 TeamSkill-AllK-EVMixed
EVMixed introduces a slight variant to EVGen’s overall strategy by selecting
a classification approach based on whether or not both teams are considered
relatively evenly-matched (that is, if a team’s prior probability of winning ac-
t
cording to EV, PEV (i > j), is close to 0.5). Here, if the prior probability of one
team winning is within some of 0.5, we use the EVGen model for prediction.
Otherwise we simply use EV’s label. The approach is simple, as is the intuition
behind it: if EV is sufficiently confident in its predicted label, then there is no
need for additional feature information.
sign(wt · xt ) if |PEV
t
(i > j) − 0.5| <
ŷt = (3.16)
ˆ t
EV otherwise
4 Evaluation
4.1 Dataset
We evaluate our proposed approaches using a dataset of 7,568 Halo 3 multiplayer
games between professional teams. Each was played over the Internet on Mi-
crosoft’s Xbox Live service in custom games (known as scrimmages) or on a local
area network at an MLG tournament and includes information such as the play-
ers and teams competing, the date of the game, the map and game type, the result
(win/loss) and score, and per-player statistics such as kills, deaths, and assists.
Characteristics unique to this dataset make it ideal for our evaluation pur-
poses. First, it is common for players to change teams between tournaments,
each of which is held roughly every 1-2 months, thereby allowing us to study the
effects of “team chemistry” on performance without the assumption of degraded
individual skill. Second, because every player is competing at such a high level,
their individual skill isn’t considered as important a factor in winning or losing
a game as their ability to work together as a team.
Table 1. Overall prediction accuracy for all test cases. Bold cells = highest accuracy;
bolded/italicized = 2nd-highest accuracy.
Learner Data Close? k=1 k=2 k=3 k=4 AllK AlKEV AlKLS OL OL2 OL3 EVGen EVMxd
Elo Both N 0.645 0.642 0.636 0.631 0.642 0.645 0.633 0.645 0.645 0.646 0.574 0.647
Y 0.512 0.494 0.497 0.485 0.493 0.5 0.489 0.495 0.495 0.502 0.523 0.521
Tourn. N 0.639 0.626 0.607 0.571 0.628 0.635 0.592 0.639 0.639 0.633 0.572 0.643
Y 0.518 0.497 0.482 0.464 0.5 0.51 0.474 0.531 0.536 0.51 0.549 0.544
Scrim. N 0.643 0.639 0.639 0.631 0.642 0.64 0.633 0.643 0.643 0.64 0.583 0.644
Y 0.503 0.487 0.492 0.476 0.496 0.488 0.476 0.499 0.498 0.487 0.529 0.512
Glicko Both N 0.636 0.63 0.632 0.635 0.64 0.64 0.633 0.637 0.637 0.64 0.581 0.641
Y 0.522 0.564 0.562 0.547 0.569 0.57 0.548 0.524 0.552 0.571 0.528 0.573
Tourn. N 0.638 0.637 0.616 0.588 0.644 0.647 0.613 0.637 0.637 0.647 0.566 0.657
Y 0.484 0.529 0.531 0.523 0.576 0.57 0.557 0.526 0.56 0.57 0.518 0.62
Scrim. N 0.631 0.635 0.637 0.637 0.643 0.637 0.634 0.635 0.636 0.637 0.582 0.638
Y 0.496 0.559 0.565 0.522 0.562 0.551 0.524 0.531 0.551 0.551 0.525 0.554
TrueSkill Both N 0.635 0.641 0.636 0.63 0.638 0.642 0.632 0.635 0.636 0.643 0.572 0.643
Y 0.516 0.555 0.542 0.542 0.552 0.56 0.548 0.536 0.544 0.562 0.522 0.561
Tourn. N 0.64 0.626 0.601 0.576 0.626 0.636 0.601 0.641 0.644 0.634 0.569 0.653
Y 0.5 0.497 0.479 0.474 0.508 0.51 0.495 0.531 0.547 0.508 0.542 0.573
Scrim. N 0.636 0.642 0.639 0.632 0.636 0.638 0.634 0.636 0.637 0.637 0.581 0.64
Y 0.504 0.55 0.542 0.53 0.541 0.54 0.533 0.548 0.55 0.543 0.522 0.542
20% closest games for one rating system are identified and presented to the other.
Because we are interested in performance beyond that of unmodified general
learners (i.e., k = 1), the closest games from k = 1 were presented to the other
TeamSkill approaches while EV’s closest games were presented to k = 1 (due to
its evaluated performance in [6]). The following defaults were used for Elo (α =
0.07, β = 193.4364, μ0 = 1500, σ02 = β 2 ), Glicko (q = log(10)/400, μ0 = 1500,
σ02 = 1002 ), and TrueSkill ( = 0.5, μ0 = 25, σ02 = (μ0 /3)2 , β = σ02 /2) according
to [4] and [3]. For OL1/OL2, β = 1.1, OL3, d = 20. For EVGen/EVMixed
( = 0.03), the Passive-Aggressive II algorithm [8] was used for classification (α =
0.1, C = 0.001, η = 0.9). The final feature set was comprised of cumulative and
windowed (10 games of history) versions of team differences in average team and
player-level kill/death ratio, assist/death ratio, kills/game, and assists/game.
From the results in table 1, it is clear that EVMixed performs the best overall,
and in the widest array of evaluation conditions. It has the best performance in
10 of the 18 test cases and 16 of 18 in which it was at least second best, a
testament to its consistency. EVGen’s overall performance, however, is roughly
7-10% lower on average over all games, exceeding EVMixed’s results only in 3
of the “close” game test cases.
Next we explore how these approaches perform over time by predicting the
outcomes of games occuring prior to 10 tournaments which took place during
2008 and 2009, using tournament data only in order to isolate conditions in
which we expect teamwork to be strongest. From figures 2 and 3, EVMixed’s
superior peformance is readily apparent. Of particular note, however, is how well
EVMixed does when little history is available, having a roughly 64% accuracy
just prior to the first tournament for all three learner cases. For close games,
TeamSkill Evolved 35
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Fig. 3. Prediction accuracy over time for tournament games, close games only
both EVGen and EVMixed show strong results, eventually tapering off and
approaching the other competing methods as more game history is observed.
EVGen EVMixed
Data Close? Perceptron PA-II CW-diag Perceptron PA-II CW-diag
Both N 0.575 0.581 0.584 0.641 0.641 0.641
Y 0.514 0.528 0.528 0.573 0.573 0.573
Tourn. N 0.543 0.566 0.564 0.655 0.657 0.657
Y 0.474 0.518 0.51 0.609 0.62 0.617
Scrim. N 0.575 0.582 0.586 0.638 0.638 0.637
Y 0.515 0.525 0.512 0.556 0.554 0.551
36 C. DeLong and J. Srivastava
5 Discussion
6 Conclusions
In this paper, we extended our previous work by introducing three methods
in which various strategies are used to maintain a set of weights over aggre-
gate group-level skill rating information. Additionally, we explored the utility
of incorporating game-specific data as features during the prediction process,
describing two such approaches: EVGen and EVMixed. EVMixed outperformed
all previous efforts in the vast majority of cases, leading to the conclusion that
game-specific data is best included when teams are relatively evenly-matched,
and disregarded otherwise.
TeamSkill Evolved 37
References
1. Elo, A.: The Rating of Chess Players, Past and Present. Arco Publishing, New
York (1978)
2. Glickman, M.: Paired Comparison Model with Time-Varying Parameters. PhD
thesis. Harvard University, Cambridge, Massachusetts (1993)
3. Glickman, M.: Parameter estimation in large dynamic paired comparison experi-
ments. Applied Statistics 48, 377–394 (1999)
4. Herbrich, R., Graepel, T.: Trueskill: A bayesian skill rating system. Microsoft Re-
search, Tech. Rep. MSR-TR-2006-80 (2006)
5. Yukelson, D.: Principles of effective team building interventions in sport: A di-
rect services approach at penn state university. Journal of Applied Sport Psychol-
ogy 9(1), 73–96 (1997)
6. DeLong, C., Pathak, N., Erickson, K., Perrino, E., Shim, K., Srivastava, J.: Team-
Skill: Modeling Team Chemistry in Online Multi-player Games. In: Huang, J.Z.,
Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II. LNCS, vol. 6635, pp. 519–531.
Springer, Heidelberg (2011)
7. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological Review 65(6), 386–408 (1958)
8. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. Journal of Machine Learning Research 7, 551–585 (2006)
9. Crammer, K., Dredze, M., Pereira, F.: Exact convex confidence-weighted learning.
In: Advances in Neural Information Processing Systems, vol. 21, pp. 345–352 (2009)
10. Thurstone, L.: Psychophysical analysis. American Journal of Psychology 38, 368–
389 (1927)
11. Bradley, R.A., Terry, M.: Rank analysis of incomplete block designs: I. the method
of paired comparisons. Biometrika 39(3/4), 324–345 (1952)
12. Coulom, R.: Whole-History Rating: A Bayesian Rating System for Players of Time-
Varying Strength. In: van den Herik, H.J., Xu, X., Ma, Z., Winands, M.H.M. (eds.)
CG 2008. LNCS, vol. 5131, pp. 113–124. Springer, Heidelberg (2008)
13. Huang, T., Lin, C., Weng, R.: Ranking individuals by group comparisons. Journal
of Machine Learning Research 9, 2187–2216 (2008)
14. Menke, J.E., Reese, C.S., Martinez, T.R.: Hierarchical models for estimating indi-
vidual ratings from group competitions. American Statistical Association (2007)
(in preparation)
15. Birlutiu, A., Heskes, T.: Expectation Propagation for Rating Players in Sports
Competitions. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S.,
Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 374–
381. Springer, Heidelberg (2007)
A Novel Weighted Ensemble Technique
for Time Series Forecasting
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 38–49, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Novel Weighted Ensemble Technique for Time Series Forecasting 39
where, wi is the weight assigned to the ith forecasting method. To ensure un-
biasedness, sometimes it is assumed that the weights add up to unity. Different
combination techniques are developed in literature which are based on different
weight assignment schemes; some important among them are discussed here:
• In the simple average, all models are assigned equal weights, i.e. wi =
1/n (i = 1, 2, . . . , n) [9,10].
• In the trimmed average, individual forecasts are combined by a simple arith-
metic mean, excluding the worst performing k% of the models. Usually, the
value of k is selected from the range of 10 to 30. This method is sensible only
when n ≥ 3 [9,10].
• In the Winsorized average, the i smallest and largest forecasts are selected
th
and set to the (i + 1) smallest and largest forecasts, respectively [9].
• In the error-based combining, an individual weight is chosen to be inversely
proportional to the past forecast error of the corresponding model [3].
• In the outperformance method, the weight assignments are based on the
number of times a method performed best in the past [11].
• In the variance-based method, the optimal weights are determined by mini-
mizing the total Sum of Squared Error (SSE) [7,10].
All the combination techniques, discussed above are linear in nature. The lit-
erature on nonlinear forecast combination methods is very limited and further
research works are required in this area [10].
mean and standard deviation of Ŷ(i) respectively. Then the combined forecast
T
(c) (c) (c)
of Y is defined as: Ŷ(c) = ŷ1 , ŷ2 , . . . , ŷN , where
2
(i) (i)
vk = yk − μ(i) / σ (i) , ∀i = 1, 2, 3; k = 1, 2, . . . , N.
A Novel Weighted Ensemble Technique for Time Series Forecasting 41
(c)
In (2), the nonlinear terms are included in calculating ŷk to take into account
the correlation effects between two
forecasts. It should be noted that for com-
bining n methods, there will be n2 nonlinear terms in (2).
where,
T T
w = [w0 , w1 , w2 , w3 ] , θ = [θ1 , θ2 , θ3 ] .
F = 1|Ŷ(1) |Ŷ(2) |Ŷ(3) .
N ×4
T
1 = [1, 1, . . . , 1] .
⎡ ⎤
(1) (2) (2) (3) (3) (1)
v1 v1 v1 v1 v1 v1
⎢ ⎥
G=⎢ ⎣
..
.
..
.
..
.
⎥
⎦ .
(1) (2) (2) (3) (3) (1)
vN vN vN vN vN vN N ×3
The weights are to be optimized by minimizing the forecast SSE, given by:
N
2
(c)
SSE = yk − ŷk
k=1
T
= (Y − Fw − Gθ) (Y − Fw − Gθ) (4)
= YT Y − 2wT b + wT Vw+
2wT Zθ − 2θT d + θ T Uθ
where,
V = FT F 4×4 , b = FT Y 4×1 , Z = FT G 4×3 ,
d = GT Y 3×1 , U = GT G 3×3 .
Now from (∂/∂w) (SSE) = 0 and (∂/∂θ) (SSE) = 0, we get the following system
of linear equations:
Vw + Zθ = b
(5)
ZT w + Uθ = d
By solving (5), the optimal combination weights can be obtained as:
−1
θopt = U − ZT V−1 Z d − ZT V−1 b
(6)
wopt = V−1 (b − Zθ opt )
These optimal weights are determinable if and only if all the matrix inverses,
involved in (6) are well-defined.
42 R. Adhikari and R.K. Agrawal
2. Set j ← 1.
3. W ← empty, Θ ← empty
// initially set both
the final weight matrices W and Θ as the empty matrices of
orders n × 1 and n2 × 1, respectively.
4. while j ≤ k do
5. Define:
12. Use wcomb and θ comb to calculate the combined forecast vector according to (3).
A Novel Weighted Ensemble Technique for Time Series Forecasting 43
where,
p
q
φ (L) = 1 − φi Li , θ (L) = 1 + θj Lj , and Lyt = yt−1 .
i=1 j=1
The terms p, d, q are the model orders, which respectively refer to the autore-
gressive, degree of differencing and moving average processes; yt is the actual
time series and t is a white noise process. In this model, a nonstationary time
series is transformed to a stationary one by successively (d times) differencing
it [2,4]. A single differencing is often sufficient for practical applications. The
ARIMA(0, 1, 0), i.e. yt − yt−1 = t is the popular Random Walk (RW) model
which is frequently used in forecasting financial and stock-market data [4].
(RP) [15,16] is applied as the network training algorithm and the logistic and
identity functions are used as the hidden and output layer activation functions,
respectively.
………………………………………
………………………………………
Input Layer Hidden Layer Output Layer
Inputs to the network
Output
Bias Bias
Elman networks belong to the class of recurrent neural networks in which one
extra layer, known as the context layer is introduced to recognize the spatial
and temporal patterns in the input data [17]. The Elman networks contain two
types of connections: feedforward and feedback. At every step, the outputs of
the hidden layer are again fed back to the context layer, as shown in Fig. 2. This
recurrence makes the network dynamic, so that it can perform non linear time-
varying mappings of the associated nodes [16,17]. Unlike MLPs, there seems to
be no general model selection guidelines in literature for the Elman ANNs [10].
However, it is a well-known fact that EANNs require much more hidden nodes
than the simple feedforward ANNs in order to adequately model the temporal
relationships [10,16]. In this paper, we use 24 hidden nodes and the training
algorithm traingdx [16] for fitting EANNs.
Input Nodes
Hidden Nodes
Output Nodes
Context Nodes
Feedforward Connection
Feedback Connection
1 1
n n
2
MAE = |yt − ŷt | , MSE = (yt − ŷt ) ,
n t=1 n t=1
n n
2
2
ARV = (yt − ŷt ) / (μ − ŷt ) ,
t=1 t=1
where, yt and ŷt are the actual and forecasted observations, respectively; N is
the size and μ is the mean of the test set. For an efficient forecasting model, the
values of these error measures are expected to be as less as possible.
The sunspots series is stationary with an approximate cycle of 11 years, as
can be seen from Fig. 3(a). Following Zhang [4], the ARIMA(9, 0, 0) (i.e. AR(9))
and the (7, 5, 1) ANN models are fitted to this time series. The EANN model is
fitted with same numbers of input and output nodes as the ANN, but with 24
hidden nodes. For combining, we take base size = 41, validation window = 20
and the number of iterations k = 9.
The S & P and exchange rate are nonstationary financial series and both ex-
hibit quite irregular patterns which can be observed from their respective time
plots in Fig. 4(a) and Fig. 5(a). The RW-ARIMA model is most suitable for
these type of time series1 . For ANN modeling, the (8, 6, 1) and (6, 6, 1) net-
work structures are used for S & P and exchange rate, respectively. As usual,
the fitted EANN models have the same numbers of input and output nodes as
the corresponding ANN models, but 24 hidden nodes. For combining, we take
base size = 200, validation window = 50, k = 12 for the S & P data and
base size = 165, validation window = 40, k = 10 for the exchange rate data.
In Table 2, we present the forecasting performances of ARIMA, ANN, EANN,
simple average and the proposed ensemble scheme for all three time series.
From Table 2, it can be seen that our ensemble technique has provided low-
est forecast errors among all methods. Moreover, the proposed technique has
also achieved considerably better forecasting accuracies than the simple average
combination method, for all three time series. However, we have empirically ob-
served that like the simple average, the performance of our ensemble method is
also quite sensitive to the extreme errors of the component models.
In this paper, we use the term Forecast Diagram to refer the graph which shows
the actual and forecasted observations of a time series. In each forecast diagram,
the solid and dotted line respectively represents the test and forecasted time
series. The forecast diagrams, obtained through our proposed ensemble method
for sunspots, S & P and exchange rate series are depicted in Fig. 3(b), Fig. 4(b)
and Fig. 5(b), respectively.
1
In RW-ARIMA, the preceding observation is the best guide for the next prediction.
2
Original MAE=Obtained MAE×10−3 ; Original MSE=Obtained MSE×10−5 .
A Novel Weighted Ensemble Technique for Time Series Forecasting 47
200 200
(a) (b)
150 150
100 100
50 50
0 0
1 51 101 151 201 251 287 1 19 37 55 67
Fig. 3. (a) The sunspots series, (b) Ensemble forecast diagram for the sunspot series
7.4 7.36
(a) (b)
7.34
7.3
7.32
7.2 7.3
7.28
7.1
7.26
7 7.24
7.22
0 200 400 600 800 1000 1 42 83 124 165 206
Fig. 4. (a) The S & P series, (b) Ensemble forecast diagram for the S & P series
52 48
(a) (b)
50
47
48
46
46
45
44
42 44
0 150 300 450 600 700 1 21 41 61 81 101 116
Fig. 5. (a) Exchange rate series, (b) Ensemble forecast diagram for exchange rate series
48 R. Adhikari and R.K. Agrawal
6 Conclusions
Improving the accuracy of time series forecasting is a major area of concern in
many practical applications. Although numerous forecasting methods have been
developed during the past few decades, but it is often quite difficult to select
the best among them. It has been observed by many researchers that combining
multiple forecasts effectively reduces the prediction errors and hence provides
considerably increased accuracy.
In this paper, we propose a novel nonlinear weighted ensemble technique
for forecasts combination. It is an extension of the common linear combina-
tion scheme in order to include possible correlation effects between the partic-
ipating forecasts. An efficient successive validation mechanism is suggested for
determining the appropriate combination weights. The empirical results with
three real-world time series and three forecasting methods demonstrate that our
proposed technique significantly outperforms each individual method in terms
of obtained forecast accuracies. Moreover, it also provides considerably better
results than the classic simple average combining technique. In future works,
our ensemble mechanism can be further explored with other diverse forecasting
models as well as other varieties of time series data.
References
1. Gooijer, J.G., Hyndman, R.J.: 25 Years of time series forecasting. J. Forecast-
ing 22(3), 443–473 (2006)
2. Box, G.E.P., Jenkins, G.M.: Time Series Analysis: Forecasting and Control, 3rd
edn. Holden-Day, California (1970)
3. Armstrong, J.S.: Combining Forecasts. In: Armstrong, J.S. (ed.) Principles of Fore-
casting: A Handbook for Researchers and Practitioners. Kluwer Academic Publish-
ers, Norwell (2001)
4. Zhang, G.P.: Time series forecasting using a hybrid ARIMA and neural network
model. Neurocomputing 50, 159–175 (2003)
5. Bates, J.M., Granger, C.W.J.: Combination of forecasts. Operational Research
Quarterly 20(4), 451–468 (1969)
6. Clemen, R.T.: Combining forecasts: A review and annotated bibliography. J. Fore-
casting 5(4), 559–583 (1989)
7. Aksu, C., Gunter, S.: An empirical analysis of the accuracy of SA, OLS, ERLS and
NRLS combination forecasts. J. Forecasting 8(1), 27–43 (1992)
8. Zou, H., Yang, Y.: Combining time series models for forecasting. J. Forecast-
ing 20(1), 69–84 (2004)
9. Jose, V.R.R., Winkler, R.L.: Simple robust averages of forecasts: Some empirical
results. International Journal of Forecasting 24(1), 163–169 (2008)
10. Lemke, C., Gabrys, B.: Meta-learning for time series forecasting and forecast com-
bination. Neurocomputing 73, 2006–2016 (2010)
A Novel Weighted Ensemble Technique for Time Series Forecasting 49
11. Bunn, D.: A Bayesian approach to the linear combination of forecasts. Operational
Research Quarterly 26(2), 325–329 (1975)
12. Frietas, P.S., Rodrigues, A.J.: Model combination in neural-based forecasting. Eu-
ropean Journal of Operational Research 173(3), 801–814 (2006)
13. Zhang, G., Patuwo, B.E., Hu, M.Y.: Forecasting with articial neural networks: The
state of the art. J. Forecasting 14, 35–62 (1998)
14. Faraway, J., Chatfield, C.: Time series forecasting with neural networks: a compar-
ative study using the airline data. J. Applied Statistics 47(2), 231–250 (1998)
15. Reidmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: The rprop algorithm. In: Proceedings of the IEEE Int. Conference on
Neural Networks (ICNN), San Francisco, pp. 586–591 (1993)
16. Demuth, M., Beale, M., Hagan, M.: Neural Network Toolbox User’s Guide. The
MathWorks, Natic (2010)
17. Lim, C.P., Goh, W.Y.: The application of an ensemble of boosted Elman networks
to time series prediction: A benchmark study. J. of Computational Intelligence 3,
119–126 (2005)
18. Time Series Data Library, http://robjhyndman.com/TSDL
19. Yahoo! Finance, http://finance.yahoo.com
20. Pacific FX database, http://fx.sauder.ubc.ca/data.html
Techniques for Efficient Learning without Search
1 Introduction
The classical classification learning paradigm performs search through a hypoth-
esis space to identify a hypothesis that optimizes some objective function with
respect to training data. Averaged n-Dependence Estimators (AnDE) [10] is an
approach to learning without search or hypothesis selection, which represents a
fundamental alternative to the classical learning paradigm.
The new paradigm gives rise to a family of algorithms, of which, Webb et. al.
[10] hypothesize, the different members are suited for differing quantities of data.
The algorithms range from low variance with high bias through to high variance
with low bias. Webb et. al. suggest that members with low variance are suited
for small datasets whereas members with low bias are suitable for large datasets.
They claim that the asymptotic error of the lowest bias variant is Bayes optimal.
The algorithms in the family possess a unique set of features that are suitable
for many applications. In particular, they have a training time that is linear
with respect to the number of examples and can learn in a single pass through
the training data without any need to maintain the training data in memory.
Thus, they show great potential for very accurate classification from large data.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 50–61, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Techniques for Efficient Learning without Search 51
Further, they have direct capacity for incremental and anytime [6] learning,
are robust in the face of noise and directly handle missing values. Importantly,
evaluations have shown that their classification accuracy is competitive with the
state-of-the-art in machine learning [10].
AnDE extends the underlying strategy of Averaged One-Dependence Estima-
tors (AODE) [9], which relaxes the Naive Bayes (NB) independence assumption
while retaining many of Naive Bayes’s desirable computational and theoretical
properties. The third member of the AnDE family, A2DE, has been shown to
produce strong predictive accuracy over a wide range of data sets [10].
Although evaluations to date support the hypothesis that the predictive accu-
racy of AnDE increases for larger datasets with higher orders of n, the expected
increase in accuracy comes at the cost of increased computational requirements.
The current implementations further complicate the matter due to their inef-
ficiencies. Thus, efficient implementation is critical. Except in cases of lower
dimensional data, the computational requirements defeat a straightforward ex-
tension of Weka’s AODE [11] to handle A3DE.
This paper presents data structures and algorithms that reduce both memory
and time required for both training and classification. These improvements have
enabled us to evaluate the effectiveness of A3DE on large datasets. The results
provide further evidence that members of the AnDE family with increasing n
are increasingly effective at classifying datasets of increasing size.
The remainder of the paper starts by introducing the AnDE family of al-
gorithms. Section 3 outlines the memory representation developed to reduce
memory usage. The enhancements made to reduce testing times are outlined in
Section 4. Section 5 presents the results of evaluating the effectiveness of the
enhancements. It also compares the effectiveness of A3DE with AnDE members
with lower n. Finally, conclusions are outlined.
We assume herein that NB and the other AnDE family members are imple-
mented by compiling at training time a table of observed low-dimensional prob-
abilities. Under this strategy, the complexity of building this model is O(ta),
where t is the number of training examples and a the number of attributes. As
the model simply stores the frequency of each attribute value for each class after
scanning the training examples, the space complexity is O(kav), where k is the
number of classes and v is the average number of attribute values. As the classi-
fier only needs to estimate the probability of each class for the attribute values
of the test case, the resulting complexity at classification time is O(ka).
Despite the attribute independence assumption, NB delivers relatively ac-
curate results. However, greater accuracy can be achieved if the attribute-
independence assumption is relaxed. New algorithms based on NB have been
developed, referred to as semi-Naive Bayesian techniques, that achieve greater
accuracy by doing this, as real-world problems generally do have relationships
among attributes [12].
Of numerous semi-naive Bayesian techniques, SP-TAN [7], Lazy Bayesian
Rules (LBR) [13] and AODE [9] are among the most accurate. However, SP-
TAN has very high computational complexity at training time and LBR has
high computational complexity for classification. Contrastingly, AODE a more
efficient algorithm, avoids some of the undesirable properties of those algorithms
to achieve comparable results.
2.1 AODE
In practice, AODE only uses estimates of probabilities for which relevant exam-
ples occur in the data. Hence,
⎧ a a a
⎪
⎪
⎨ δ(xα )P̂(y, xα )P̂(x | y, xα )/ δ(xα ) : δ(xα ) > 0
P̂AODE (y, x) = α=1 α=1 α=1 (4)
⎪
⎪
⎩
P̂NB (y, x) : otherwise
2.2 AnDE
AnDE [10] generalises AODE’s strategy of search free extrapolation from low-
dimensional probabilities to high-dimensional probabilities. The first member
of the AnDE family (where n = 0) is NB, the second member is AODE and
the third is A2DE. Investigation into the accuracy of higher dimensional models
with different training set sizes shows that a higher degree model might be
susceptible to variance in a small training sample, and consequently that a lower
degree model is likely to be more accurate for small data. On the other hand,
higher degrees of AnDE may work better for larger training sets as minimizing
bias will be of increasing importance as the size of the data increases [3].
For notational convenience we define
Given sufficient training data, A2DE has lower error than AODE, but at the
cost of significantly more computational resources.
The parent and child dimensions of the frequency matrix is illustrated in Fig. 1a.
It contains cells for each parent-child combination and the (n,n) locations are
reserved for frequencies of parents. The 2-D structure is replicated for each class
to form the 3-D frequency matrix for AODE.
The representation for A2DE is a 4-D matrix that is a collection of tetrahe-
dral structures for each class. Each cell contains the frequencies of (class, par-
ent1, parent2, child) combinations. The matrix reserves (class, parent1, parent1,
parent1) cells for storing frequencies of class-parent1 combinations and (class,
parent1, parent2, parent2) cells for storing class-parent1-parent2 combinations.
AnDE requires a matrix of n + 2 dimensions to store frequencies of all at-
tribute value combinations. The outer dimension has k elements for each class.
The n middle dimensions represent the n parent attribute values and the final
dimension represents the child attribute values. The inner dimensions have av
elements, where a is the number of attributes and v is the average number of
attribute values (including missing values). Consequently, as the size of the fre-
quency matrix is determined by figurate numbers (Pn+1 (av) = av+n n+1 ), resulting
in a memory complexity of O(k av+n n+1 ).
Although this representation allows for straight forward access of the fre-
quency of a class-parent-child combination, the matrix has to be implemented
as a collection of arrays. This incurs overhead and the does not guarantee that a
contiguous block of memory is allocated for the matrix, reducing the possibility
that required parts of the matrix are available in the system’s cache.
The frequency matrix can be stored compactly with the elements of each row
stored in consecutive positions. This representation minimises the overheads that
can occur with multi-dimensional arrays. Taking AODE as an example, the rows
in the 2-D matrix, which are all combinations involving the corresponding parent,
can be stored sequentially in a 1-D array as shown in Fig. 1b.
Allocating slots for all combinations of attribute values in the frequency ma-
trix simplifies access. However, this produces a sparse matrix containing unused
slots allocated for impossible combinations. As training and testing cases have
only single valued attributes, combinations of attribute values of the same at-
tribute are impossible. In the case of the AODE example, the frequency matrix
contains slots to record frequencies of a1 a2 , b1 b2 , b1 b3 and b2 b3 , which are im-
possible combinations (shaded in black in Fig. 2a). The size of the frequency
matrix can be reduced by avoiding the allocation of memory for such impossible
combinations. In the AODE example, the size of the 2-combinations matrix n+1can
a
be reduced from 10 to 6. The size of the n combinations matrix is n+1 v .
Techniques for Efficient Learning without Search 55
f re q1
f re q2
b) a1 b1c1 a1 b1c2 a2 b1c1 a2 b1c2 a1 b2c1 a2b2c2 a1 b3c1 a1 b3c2 a2 b3c1 a2 b3c2
...
5 Evaluation
We selected nine datasets, described in Table 1, from the UCI machine learning
repository for the comparisons. The chosen collection includes small, medium
and large datasets with small, medium and high dimensionality. The datasets
were split into two sets, with 90% of the data used for training and the remaining
10% used for testing. The experiments were conducted on a single CPU single
core virtual Linux machine running on a Dell PowerEdge 1950 with dual quad
core Intel Xeon E5410 processor running at 2.33GHz with 8 GB of RAM.
Techniques for Efficient Learning without Search 57
Dataset Cases Att Values Classes Dataset Cases Att Values Classes
Abalone 4177 8 24 3 House Votes 84 435 16 48 2
Adult 48842 14 117 2 Sonar 208 60 180 2
Connect-4 67557 42 126 3 SPAM E-mail 4601 57 171 2
Covertype 581012 54 118 7 Waveform-5000 5000 40 120 3
Dermatology 366 34 132 6
The implementations of the three algorithms of the AnDE family are lim-
ited to categorical data. Consequently, all numerical attributes are discretized.
When MDL discretization [5], a common discretization method for NB, was
used within each cross-validation fold, we identified that many attributes have
only one value. So, we discretized numerical attributes using three-bin equal-
frequency discretization prior to classification for these experiments.
The memory usage of the classifier was measured by the ‘Classmexer’ tool [4],
which uses Java’s instrumentation framework to query the Java virtual machine
(JVM). It follows relations of objects, so that the size of the arrays inside arrays
are measured, including their object overhead and padding.
Accurately measuring execution time for the Java platform is difficult. There
can be interferences due to a number of JVM processes such as garbage col-
lection and code profiling. Consequently, to make accurate execution time mea-
surements, we use a Java benchmarking framework [2] that aims to minimize
the noise during measurements. The framework executes the code for a fixed
time period (more than 10 seconds) to allow the JVM to complete all dynamic
optimizations and forces the JVM to perform garbage collection before mea-
surements. All tests are repeated in cases where the code is modified by the
JVM. The code is also executed a number of times with the goal of ensuring the
cumulative execution time to be large enough for small errors to be insignificant.
The memory usage for AnDE was reduced by the introduction of a new data
structure that avoids the allocation of space for impossible combinations. The
reductions in memory usage for the enhanced AnDE implementations were com-
pared against the respective versions of AnDE that stores the frequency matrix
in a single array. We do not present the memory reductions of compacting the
multi-dimensional array into one dimension as they are specific to Java.
The reductions in memory usages are summarised in Fig. 4a. Results show that
the memory reduction for AODE ranged from 1% to 14%. The highest percentage
in reduction was observed for the adult dataset, which had a reduction of 9.67KB.
The main reason for the large reduction is the high average number of attribute
values of 8.36 for the adult dataset. In contrast, the other datasets have average
number of attribute values of around 3.
58 H. Salem et al.
We compared the effectiveness the AnDE members using the enhanced ver-
sions implemented in the Weka workbench on the 62 datasets that were used
to evaluate the performance of A2DE [10]. Each algorithm was tested on each
dataset using the repeated cross-validation bias-variance estimation method [8].
We used two-fold cross validation to maximise variation in the training data be-
tween trials. In order to minimise the variance in our measurements, we report
the mean values over 50 cross-validation trials.
The experiments were conducted on the same virtual machine used to evaluate
the effectiveness of the improvements. Due to technical issues, including memory
leaks in the Weka implementation, increasing amounts of memory is required
when multiple trials are conducted. Consequently, we were unable to get bias-
variance results for four datasets (Audiology, Census-Income, Covertype and
Cylinder bands), that were of high dimensionality. We compared the relative
performances of AODE, A2DE and A3DE on the remaining 58 datasets. The
lower, equal or higher outcomes when the algorithms are compared to each other
is summarised as win/draw/loss records in Tab. 2.
The results show that the bias decreases as n increases at the expense of
increased variance. The bias of A3DE is lower significantly more often than not
in comparison to A2DE and AODE. The bias of A2DE is lower significantly more
often relative to AODE. In contrast, the variance of AODE is lower significantly
more often than A2DE and A3DE. The variance of A2DE is lower significantly
more often relative to A3DE.
None of the three algorithms have a significantly lower zero-one loss or RMSE
on the evaluated datasets. We believe that this is due to the wide range sizes
of datasets used in the evaluation. We hypothesize that members of the AnDE
family with lower n, that have a low variance, are best suited for small datasets.
In contrast, members with higher degrees of n are best suited for larger datasets.
of the chosen datasets ranged from just over 10,000 cases (Pen Digits) to over
60,000 cases (Connect-4 Opening).
The evaluation results are summarised as win/draw/loss records in Table 3.
As expected, the results show A3DE has a lower bias and higher variance than
A2DE and AODE. The zero-one loss and the RMSE of A3DE are lower for all
the evaluated datasets in comparison to A2DE and AODE (p=0.008). These
results confirm that A3DE performs better than its lower-dimensional variants
at classifying larger datasets.
7 Conclusions
The AnDE family of algorithms perform search-free learning. The parameter n
controls the bias-variance trade-off such that n = a provides a classifier whose
asymptotic error is the Bayes optimum. We presented techniques for reducing the
memory usage and the testing times of the AnDE implementations that make
A3DE feasible to employ for higher-dimensional data. As A3DE is superior to
AnDE with lower values of n when applied to large data, and as the linear
complexity and single pass learning of AnDE make it particularly attractive
for learning from large data, we believe these optimizations have potential for
considerable impact.
We developed a new compact memory representation for storing the frequen-
cies of attribute-value combinations that stores all frequencies in a 1-D array
avoiding the allocation of space for impossible attribute-value combinations.
The evaluation results showed that the enhancements substantially reduced the
memory requirements. The enhancements reduced the overall A3DE memory
requirements ranging from 13% to 64%, including reductions of over 100MB for
the high-dimensional datasets.
The classification times of the AnDE algorithms were improved by reorganis-
ing the memory representation to maximise locality of reference and minimising
memory accesses. These enhancements resulted in substantial reductions to the
total testing times for the AnDE family of algorithms. In the case of A3DE, the
maximum reduction in total testing time was 8.89ks, which was a reduction of
60%, for the Covertype dataset.
The enhancements to the AnDE algorithms opened the door for evaluating
the performance of A3DE. As expected, the results showed that A3DE has lower
Techniques for Efficient Learning without Search 61
bias in comparison to A2DE and AODE. The results for zero-one error between
A3DE, A2DE and AODE did not produce a clear winner. However, A3DE pro-
duced the lowest error for large datasets (with over 10,000 cases).
The computational complexity of AnDE algorithms is linear with respect to
the number of training examples. Their memory requirements are dictated by
the number of attribute values in the data. Consequently, the most accurate and
feasible member of the AnDE algorithm for a particular dataset will have to be
decided based on the dataset’s size and its dimensionality.
References
1. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases,
http://www.ics.uci.edu/~ mlearn/MLRepository.html
2. Boyer, B.: Robust Java benchmarking (2008),
http://www.ibm.com/developerworks/java/library/j-benchmark1.html
3. Brain, D., Webb, G.I.: The Need for Low Bias Algorithms in Classification Learning
From Large Data Sets. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD
2002. LNCS (LNAI), vol. 2431, pp. 62–73. Springer, Heidelberg (2002)
4. Coffey, N.: Classmexer agent, http://www.javamex.com/classmexer/
5. Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes
for classification learning. In: Proc. of the 13th Int. Joint Conference on Artificial
Intelligence, pp. 1022–1029. Morgan Kaufmann (1993)
6. Hui, B., Yang, Y., Webb, G.I.: Anytime classification for a pool of instances. Ma-
chine Learning 77(1), 61–102 (2009)
7. Keogh, E., Pazzani, M.: Learning augmented Bayesian classifiers: A comparison
of distribution-based and classification-based approaches. In: Proc. of the Interna-
tional Workshop on Artificial Intelligence and Statistics, pp. 225–230 (1999)
8. Webb, G.I.: Multiboosting: A technique for combining boosting and wagging. Ma-
chine Learning 40(2), 159–196 (2000)
9. Webb, G.I., Boughton, J., Wang, Z.: Not so naive Bayes: Aggregating one-
dependence estimators. Machine Learning 58(1), 5–24 (2005)
10. Webb, G.I., Boughton, J., Zheng, F., Ting, K.M., Salem, H.: Learning by ex-
trapolation from marginal to full-multivariate probability distributions: Decreas-
ingly naive Bayesian classification. Machine Learning 86(2), 233–272 (2012),
doi:10.1007/s10994-011-5263-6
11. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann (2005)
12. Zheng, F., Webb, G.I.: A comparative study of semi-naive Bayes methods in clas-
sification learning. In: Simoff, S.J., Williams, G.J., Galloway, J., Kolyshakina, I.
(eds.) Proc. of the 4th Australasian Data Mining Conference (AusDM 2005), pp.
141–156 (2005)
13. Zheng, Z., Webb, G.I.: Lazy learning of Bayesian rules. Machine Learning 41(1),
53–84 (2000)
An Aggressive Margin-Based Algorithm
for Incremental Learning
1 Introduction
Requests of analyzing collected period data have been emerged in recent prac-
tical applications that includes network traffic analysis [1], anomaly detection
[2], and intrusion detection [3]. Generally, those applications are implemented
for adjusting classifiers/detectors periodically. Most of incremental learning ap-
proaches have been proposed based on decision-tree [4], neural network [5,6],
and Support Vector Machines (SVM) [3,7,8,9,10]. Typically they are designed
to build the statistic classification model based on the previously seen samples
and to correct its prediction mistakes on new labeled samples. While focus-
ing on the sample space, SVM generalizes the separating hyperplane (classifier)
based on the whole sample distribution, and maximizes the margins of labeled
samples (support vectors). The margin of a sample is a distance between the
sample and the separating hyperplane. And SVM is theoretically proven that
This work is supported by NSC, Taiwan, under grant no. NSC 99-2218-E-492-006.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 62–73, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Aggressive Margin-Based Algorithm 63
the hyperplane is able to well separate samples with different labels. In [10],
an incremental batch SVM approach was designed to update the classifier by
solving a constrained optimization problem based on each set of collected sam-
ples. An example is illustrated in Fig. 1 (a) where the classifier wi is adjusted as
wi+1 depending on the set of samples, {xi1 , xi2 , xi3 }. This approach should solve a
complicated constrained optimization problem since those collected samples are
adopted simultaneously. Other approaches [8,9] adjusted SVM classifiers incre-
mentally by identifying each new sample as a support vector or not. Different
with [10], in Fig. 1 (b) the classifier wi is adjusted as w1i using first sample xi1
in the set, and then w1i is updated as w2i using xi2 . Thus wi is incrementally ad-
justed as wi+1 depending on each sample in the set. The advantage of [8,9] is to
maintain useful samples that were previously seen as support vectors and to ob-
tain efficient update steps without solving a constrained optimization problem.
But in those SVM approaches, the hyperplanes might not be quickly adjusted
when encountering diverse sample distribution. In other words, the diverse sam-
ples have small chances to be support vectors because the distribution of those
samples is significantly different with the distribution of samples in the set. Thus
in this paper, our approach is to simplify the constrained optimization problem
for update steps and to adapt the diverse sample distribution for classifiers.
wi wi wi
Selection
i
x1 i
x1 i w1 x1i wi1
x i2 xi 2 wi2 x i2 wi2
x i3 xi 3 wi3 x i3 wi3
Fig. 1. Concepts of solving problems of adjusting classifiers. wi and wi+1 are the
current classifier and the next one. xi1 , xi2 , and xi3 are samples used for adjusting wi .
Rather than training the SVM classifier based on each sample or each set
of collected samples, our approach adjusts the current classifier incrementally
according one sample in each collected set. Thus for each potential update, we
formulate an optimization problem with single constraint. Additionally our up-
dated classifier shall correct prediction mistakes of the previous classifier as many
64 J. Fu and S. Lee
Typically whenever the loss is zero, PA is passive and wt+1 = wt means no clas-
sifier adjustment. And while the loss is positive (less than 1), wt is aggressively
updated by adjusting more than the margin, y t (wt · xt ), and then the constrain
l(wt+1 , (xt , y t )) = 0 can be satisfied. Then the Lagrangian of the optimization
problem in Eq. (1) is defined as Eq. (3).
1
L(w, τ ) = ||w − wt ||2 + τ (1 − y t (w· xt )) (3)
2
Let the partial derivation of l with respect to w be zero and then let the deviation
of τ with respect to τ be zero, we have
w = wt + τ y t xt
1 − y t (wt · xt )
τ=
||xt ||2
Ultimately the PA update is performed by solving the constrained optimization
problem in Eq. (1). And it is theoretically shown that the aggressive update
strategy of PA modifies the weight vector as less as possible. The effectiveness
of PA in solving problems of classification and regression is formally analyzed in
[11]. Based on this well-defined learning model of PA, several online algorithms
[18,19] have been proposed for adding confidence information and handling non-
separable data.
in each collected set. For each potential sample, there are two update steps in
IPA: 1) to correct prediction mistakes of the current classifier, and 2) to ag-
gressively update the current classifier by adjusting more than the margin. At
last, the error minimization classifier on the collected dataset is selected as the
next classifier. Before formulating the model of the proposed approach, we de-
fine some notations. Given the labeled dataset K t collected at the round t, there
are |K t | sample-label pairs, {(x1 , y1 ), ..., (x|K t | , y|K t | )}. wt is the classifier at the
round t, the vector of weights. When using each labeled sample xk ∈ K t , the
updated classifier wt+1 shall correct mistakes of the previous classifier wt as
many as possible and wt shall be adjusted as less as possible. Aggressively, if xk
obtains the incorrect predicted sign from wt , then the adjustment for wt should
be achieved within more than xk ’s margin. Thus these update steps to wt are
formulated as the constrained optimization problem,
1
f (wt , (xk , yk ), K t ) = argminw∈Rn { ||w − wt ||2
2
+ C0 l(w, (xi , yi ))}
xi ∈K t ,xi =xk
s.t. l(w, (xk , yk )) = 0, (4)
where C0 is a constant to control the tradeoff between the classifier deviation and
the corrected prediction mistakes, and l(w, (xi , yi )) is the hinge loss function.
Furthermore, after wt is updated using every sample xk ∈ K t according to Eq.
(4), those updated classifiers, {f (wt , (xk , yk ), K t ) : 1 ≤ k ≤ |K t |}, are the candi-
dates for the new classifier. In order to avoid the new classifier being extremely
specific to the current classifier, the selection strategy is to find the proper clas-
sifier which has the most accurate classification performance on K t . When more
than one updated classifiers have the highest classification accuracy, we select
the updated classifier which has the smallest difference with wt . Hence the new
classifier wt+1 , selected among the candidate set of the updated classifiers, is the
solution to the optimization problem,
wt+1 = argminw∈{f (wt ,(xk ,yk ),K t ) : 1≤k≤|K t |} C l(w, (xi , yi )) + ||w − wt ||,
xi ∈K t
(5)
where C is a large constant in order to select w strongly depending on the errors.
To solve the problem in Eq. (4), let C0 = 1 and κt , the subset of |K t |, be
the set of samples of which predicted labels are incorrectly decided by wt . While
the loss of each sample in κt is positive (less than 1), the Lagrangian of the
constrained optimization problem is defined as Eq. (6):
1
L(w, τ ) = ||w − wt ||2 + (1 − yi (w· xi )) + τ (1 − yk (w· xk )) (6)
2
xi ∈κt ,xi =xk
An Aggressive Margin-Based Algorithm 67
1
L(τ ) = || yi xi + τ yk xk ||2
2
xi ∈κt
+ (1 − yi ((wt + yi xi + τ yk xk )· xi ))
xi ∈κt ,xi =xk xi ∈κt ,xi =xk
+ τ (1 − yk ((wt + yi xi + τ yk xk )· xk )) (8)
xi ∈κt ,xi =xk
4 Experiments
In this section, our experiments are designed to present the performance of our
approach in classification accuracy while the classifier is incrementally updated
by several small training sets. To present the effectiveness of updating classifiers
in our approach, we also implement the online PA and an incremental batch
SVM [9]. Additionally in order to show the effectiveness of correcting mistakes
of the previous classifier in eq. (4), the performance of our approach with C0 = 0
is also compared in following experiments. In terms of evaluating classification
accuracy of a classifier, we would like to significantly present classification results
of samples in two different classes. We use the measurement of micro-average ac-
curacy to average the classification accuracies that are calculated in two classes,
respectively. For consistence, the summations of loss errors in the eq. (4) and (5)
are also revised as (1 - micro-average accuracy).
Table 1 presents 13 real-world data collections from 4 different sources used in
our experiments. The multi-domain sentiment dataset 1 contains product reviews
downloaded from Amazon.com from four product types (domains): Kitchen,
Books, DVDs, and Electronics. Each domain has several thousand reviews, but
the exact number varies by domain. In this experiment, only Books, DVDs are
used for evaluating performance of those learning approaches. From the second
data source, the dataset at ECML/PKDD-2006 discovery challenge 2 is used to
decide whether received emails are spam or non-spam. Especially there are over
10,000 features in those three datasets, Books, DVDs, and Emails. But it is dif-
ficult to analyze performance of the SVM classifiers implemented in Matlab [9]
because the execution is time consuming on those high dimensional datasets.
Thus we randomly select a part of documents, as presented in Tab. 1, in fol-
lowing experiments. From the third data source, Spamming Bots [20] is the
set of response codes of the sent emails, collected in National Chung Cheng
1
Sentiment. http://www.cs.jhu.edu/ mdredze/datasets/sentiment/
2
ECML/PKDD-2006. http://www.ecmlpkdd2006.org/challenge.html
An Aggressive Margin-Based Algorithm 69
Table 1. 10 real-world datasets: sizes of the classes and the size of feature dimensions
University (CCU). It is used to analyze the behavior of each email sender and
then to detect the spamming bots. At last the other datasets are the benchmarks
in the UCI repository 3 . While we evaluate classification performance of learning
approaches, we randomly divide each dataset into 10 subsets, and one of subsets
is received at each round. In other words, one subset is used for initially training
the classifier and deciding the value of C0 in eq. (4) by obtaining the highest
classification accuracy on the first subset. Then others are received at each of
9 rounds. The classification accuracy at each round is measured by classifica-
tion results of the classifier updated at previous rounds. To reduce variability in
experimental results, we arrange 10 subset-round permutations on each dataset
and average those 10 classification accuracies at each round.
At first these experiments, except on Diabetes in Fig. 2, are demonstrated
that the proposed IPA has better performance than IPA with C0 = 0. That
means, in addition to minimizing the classifier deviation, it is effective in eq.
(4) to correct mistakes for updating the previous classifier. And on Diabetes,
correction of mistakes to the classifier could not improve the classification ac-
curacy on latter samples. It seems, on Diabetes previous learning knowledge is
not useful for latter label prediction. Secondly on Australian, Ionosphere, Bots,
and 10+4 in Fig. 3-4, it is presented that the online PA method can not obtain
the remarkable classification performance since its update strategy is specific to
each labeled sample. That means, the online PA method tends to be updated
by inconsistent samples. Furthermore, except experimental results on Australian
and Ionosphere in Fig. 3, it is shown that our approach obtains the best (or
similar) classification accuracy in comparison with other approaches. We update
the classifier by carefully analyzing classifier adjustment caused for the labeled
dataset. Then the remarkable classification accuracy is obtained at each round
after the classifier is incrementally updated on most of datasets. Also it is shown
3
UCI Repository. http://archive.ics.uci.edu/ml/
70 J. Fu and S. Lee
Diabetes
80
75
70
acccuracy
65
OnlinePA
60
IPAC0=0
55
IPAC0=0.25
50
IncrementalBatchSVM
45
40
1 2 3 4 5 6 7 8 9
rounds
Australian Ionosphere
95 95
90 90
85
85
80
acccuracy
accuracy
80 OnlinePA 75 OnlinePA
75 IPAC0=0 70 IPAC0=0
IPAC0=1 65 IPAC0=0.75
70
IncrementalBatchSVM 60 IncrementalBatchSVM
65
55
60 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
acccuracy
80 80
OnlinePA OnlinePA
75 75
IPAC0=0 IPAC0=0
70 70
65 IPAC0=1 65 IPAC0=0.25
60 IncrementalBatchSVM 60 IncrementalBatchSVM
55 55
50 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
that our approach has the ability to adapt the diverse sample distribution for
classifiers because we obtain better performance in accuracy than the SVM ap-
proach of which support vectors are maintained as informative samples. Mention
to the performance on Australian and Ionosphere, it seems ambiguous or noise
samples exist so that the approaches (PA and IPA) to incrementally update the
classifier by one sample do not have impressive results. In this case, collected
samples in the set might be simultaneously used for updating classifiers, like the
incremental batch SVM, to filter out misleading or noise samples.
An Aggressive Margin-Based Algorithm 71
Heart Connectionist
100 90
85
90
80
80
acccuracy
acccuracy
75
OnlinePA OnlinePA
70 70
IPAC0=0 IPAC0=0
65
60 IPAC0=1 IPAC0=0.5
60
IncrementalBatchSVM IncrementalBatchSVM
50
55
40 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
BOOK DVD
70 80
65 75
60 70
acccuracy
acccuracy
OnlinePA OnlinePA
55 65
IPAC0=0 IPAC0=0
50 IPAC0=1 60 IPAC0=0.75
IncrementalBatchSVM IncrementalBatchSVM
45 55
40 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
CYT+MEI cp+im
100 100
95 95
90 90
85 85
acccuracy
acccuracy
80 80
OnlinePA OnlinePA
75 75
IPAC0=0 IPAC0=0
70 70
65 IPAC0=0.1 65 IPAC0=1
60 IncrementalBatchSVM 60 IncrementalBatchSVM
55 55
50 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
German EMAIL
80 80
75
75
70
70
acccuracy
acccuracy
65
OnlinePA OnlinePA
60 65
IPAC0=0 IPAC0=0
55
IPAC0=0.5 60 IPAC0=0.25
50
IncrementalBatchSVM IncrementalBatchSVM
55
45
40 50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
rounds rounds
5 Conclusion
In this paper, we propose an efficient incremental learning approach to deal
with the practical requirement of frequently updating classifiers. Our approach is
proposed to adjust the classifier incrementally using one sample in each collected
set. That is, the classifier is aggressively updated by adjusting more than the
margin of a sample, and its prediction mistakes are corrected as more as possible.
For each potential update step, we get a closed form solution for the updated
classifier through solving a simple constrained optimization problem. At last the
selected classifier shall have the least prediction errors on the collected dataset.
Our experimental results are presented that, when updating a classifier, it is
effective to correct its prediction mistakes, in addition to minimizing the classifier
deviation. And it is also shown that our approach has the ability to adapt the
diverse sample distribution for classifiers. Except several datasets that consist of
some misleading or noise samples, the classifier that is incrementally adjusted
by our approach is able to gain remarkable classification accuracy. Therefore it
is presented that the proposed approach is suitable to be applied for effectively
adjusting the existing classifiers using periodically collected datasets.
References
1. Sena, G.G., Belzarena, P.: Early traffic classification using support vector machines.
In: 5th International Latin American Networking Conference, pp. 60–66. ACM,
New York (2009)
2. Robertson, W.K., Maggi, F., Kruegel, C., Vigna, G.: Effective Anomaly Detection
with Scarce Training Data. In: The Network and Distributed System Security
Symposium. ISOC (2010)
3. Du, H., Teng, S., Yang, M., Zhu, Q.: Intrusion Detection System Based on Improved
SVM Incremental Learning. In: International Conference on Artificial Intelligence
and Computational intelligence, pp. 23–28. IEEE Press (2009)
4. Utgoff, P.E.: Incremental Induction of Decision Trees. J. Machine Learning 4, 161–
186 (1989)
5. Mohamed, S., Rubin, D., Marwala, T.: Incremental Learning for Classification of
Protein Sequences. In: International Joint Conference on Neural Networks, pp.
19–24. IEEE Press (2007)
An Aggressive Margin-Based Algorithm 73
6. Chen, Z., Huang, L., Murphey, Y.L.: Incremental Learning for Text Document
Classification. In: International Joint Conference on Neural NetWorks, pp. 2592–
2597. IEEE Press (2007)
7. Ruping, S.: Incremental Learning with Support Vector Machines. In: International
Conference on Data Mining, pp. 641–642. IEEE Press (2001)
8. Xiao, R., Wang, J., Zhang, F.: An Approach to Incremental SVM Learning Algo-
rithm. In: International Conference on Tools with Artificial Intelligence, pp. 268–
273. IEEE Press (2000)
9. Cauwenberghs, G., Poggio, T.: Incremental and Decremental Support Vector Ma-
chine Learning. In: Neural Information Processing Systems, vol. 13. MIT Press,
Cambridge (2001)
10. Liu, Y., He, Q., Chen, Q.: Incremental Batch Learning with Support Vector Ma-
chines. In: 5th World Congress on Intelligent Control and Automation, pp. 1857–
1861. IEEE Press (2004)
11. Crammer, K., Dekel, O., Keshet, J., Shwartz, S.S., Singer, Y.: Online Passive-
Aggressive Algorithms. J. Machine Learning Research 7, 551–585 (2006)
12. Zhu, X.: Lazy Bagging for Classifying Imbalanced Data. In: 7th IEEE International
Conference on Data Mining, pp. 763–768 (2007)
13. Freund, Y., Schapire, R.E.: Large Margin Classification Using the Perceptron Al-
gorithm. J. Machine Learning 37, 277–296 (1999)
14. Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a us-
ability case study for text categorization. In: International Conference on Research
and Development in Information Retrieval, pp. 67–73. ACM, New York (1997)
15. Cesa-Bianchi, N., Conconi, A., Gentile, C.: A Second-Order Perceptron Algorithm.
J. Computing 34(3), 640–668 (2005)
16. Wang, S., San, Y., Wang, S.: An Online Modeling Method Based on Support
Vector Machine. In: International Conference on COmputer Science and Software
Engineering, pp. 98–101. IEEE Press (2008)
17. Sculley, D., Wachman, G.M.: Relaxed Online SVMs for spam filtering. In: Inter-
national Conference on Research and Development in Information Retrieval, pp.
415–422. ACM, New York (2007)
18. Dredze, M., Crammer, K., Pereira, F.: Confidence-Weighted Linear Classification.
In: International Conference on Machine Learning, pp. 264–271. ACM, New York
(2008)
19. Crammer, K., Kulesza, A., Dredze, M.: Adaptive Regularization of Weight Vectors.
In: Neural Information Processing Systems. MIT Press, Cambridge (2009)
20. Lin, P., Yen, T., Fu, J., Yu, C.: Analyzing Anomalous Spamming Activities in a
Campus Network. In: TANET (2011)
Two-View Online Learning
1 Introduction
In applications where large amount of data arrives in sequence, e.g., stock market
prediction and email filtering, simple online learning such as Perceptron [1],
second-order Perceptron [2], and Passive Aggressive (PA) [4] algorithms can be
easily deployed with reasonable performance and low computational cost.
For some domains, data may originate from several different sources, also
known as views. For example, a web page may have a content view comprising
text contained within it, a link view expressing its relationships to other web
pages, and a revision view that tracks the different changes that it has undergone.
When the various data sources are independent, running several instances of
the same algorithm on it and combining the output via an ensemble learning
framework works well. A simple concatenation of the two sources in a vector
space model could unnecessarily favor sources with larger number of dimensions.
On the other hand, training a separate model on each source fails to make good
use of the relationship among the sources, even for a baseline ensemble classifier.
To take advantage of data with multiple views, various methods such as SVM-
2K [7] and alternatives [9] have been proposed. However, the two-view methods
proposed so far utilizes support vector machine (SVM) [3], which is fundamen-
tally a batch learning algorithm that cannot be easily tailored to work well on
large scale online streaming data.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 74–85, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Two-View Online Learning 75
One simple approach to extend the online learning model to handle two view
data is to train one model for each view independently, and combine the clas-
sifier outputs just like in classical ensemble learning. However, this approach
ignores the relationship between the two views. Instead of using the same idea
as SVM-2K where data in one view is used to improve the SVM performance [3]
on another view (single view), we take advantage of the relationship between the
two views to improve the combined performance. Specifically, we propose a novel
online learning algorithm based on the PA algorithm, called Two-view Passive
Aggressive (Two-view PA) learning. Our approach minimizes the difference be-
tween the two classifier outputs, but allows the outputs to differ as long as the
weighted sum of each output leads to the correct result. In classical ensemble
learning, the more diverse the classifier, the better the combined performance. In
a way, the Two-view PA can be viewed as an ensemble of two online classifiers,
except that the two views are jointly optimized.
2 Related Work
Online learning has been researched for more than 50 years. Back in 1962, Block
proposed the seminal Perceptron [1] algorithm, while Novikoff [11] later provided
theoretical findings, which started the first wave of Artificial Intelligence research
in the mid twentieth century. The Perceptron is known to be one of the fastest
online learning algorithms. However, its performance is still far from satisfactory
in practice. Recently in 2005, Cesa-Bianchi et al. [2] proposed the Second-order
Perceptron (SOP) algorithm, which takes advantage of second-order data to
improve the accuracy of the original Perceptron. Compared with Perceptron,
SOP works better in terms of accuracy but requires more time to train.
In 2006, Crammer et al. [4] proposed another Perceptron-based algorithm,
namely the Passive Aggressive (PA) algorithm, which incorporates the margin
maximizing criterion of modern machine learning algorithms. They not only have
better performance than that of the SOP algorithm but also run significantly
faster. Moreover, algorithms that improved upon the PA algorithm include the
Passive-Aggressive Mahalanobis [10], the Confidence-Weight (CW) Linear Clas-
sifier [6], and its latest version, multi-class CW [5]. The CW algorithm updates
its weight by minimizing the Kullback-Leibler divergence between the new and
old weights. However, similar to the SOP algorithm, these algorithms are time
consuming compared to the first-order PA.
The PA algorithm works better than the SOP in terms of both speed and
accuracy. However, it can only process one data stream at one time. On the
other hand, in batch learning, Farquhar et al. [7] proposed a large margin two-
view Support Vector Machine (SVM) [3] algorithm called the SVM-2K, which is
an extension of the well-known SVM algorithm. The two-view learning algorithm
was shown to give better performance compared to the original SVM on different
image datasets [7]. Thus, SVM-2K provides the inspiration for our current work.
76 T.T. Nguyen, K. Chang, and S.C. Hui
1 − yt (wt · xt )
τt = (PA)
xt 2
1 − y (w · x )
t t t
τt = min C, (PA-I)
xt 2
1 − yt (wt · xt )
τt = (PA-II)
xt 2 + 2C1
The performance of the soft margin based PA-I and PA-II algorithms are al-
most identical, and both performed better than the hard margin based PA al-
gorithm [4]. Therefore, in this work, our proposed algorithm will be developed
based on the PA-I algorithm.
For the two-view online learning setting, training data are triplets
(xA B n m A
t , xt , yt ) ∈ R × R × [−1, +1], which arrives in sequence where xt ∈ R is
n
B m
the first view vector, xt ∈ R is the second view vector, and yt is their com-
mon label. The goal is to learn the coupled weights (wtA , wtB ) of a hybrid model
defined as follows.
1 A A
f (xA B B B
t , xt ) = sign (wt · xt + wt · xt )
2
Two-View Online Learning 77
where | · | denotes the absolute function. Here we use L1-norm instead of L2-
norm because it is harder to find a close-form solution for the latter. In the next
section, we will define an optimization problem based on the L1-norm relatedness
measure.
|yt wA · xA B B A A B B B B A A
t − yt w · xt | = max(yt w · xt − yt w · xt , yt w · xt − yt w · xt )
Suppose z = |yt wA · xA B B
t − yt w · xt |, the above optimization problem can be
expressed as follows.
A B 1 1
(wt+1 , wt+1 )= argmin wA − wtA 2 + wB − wtB 2 +γz + Cξ
(wA ,wB )∈Rn ×Rm 2 2
1
s.t. 1 − (yt wA · xA B B
t + yt w · xt ) ≤ ξ;
2
ξ ≥ 0;
z ≥ yt wA · xA
t − y t w B
· xB
t ;
z ≥ yt w · xt − yt w · xA
B B A
t .
1 1
L= wA − wtA 2 + wB − wtB 2 +γz + Cξ
2
2
1
+τ 1 − ξ − (yt wA · xA B B
t + yt w · xt ) − λξ
2
+α(yt wA · xA B B B B A A
t − yt w · xt − z) + β(yt w · xt − yt w · xt − z) (4)
1 1
= wA − wtA 2 + wB − wtB 2 +(γ − α − β)z + (C − λ − τ )ξ
2 2
1 1
+(α − β − τ )yt wA · xA B B
t + (β − α − τ )yt w · xt + τ
2 2
where α, β, τ , and λ are positive Lagrangian multipliers.
Setting the partial derivatives of L with respect to the weight wA to zero, we
have,
∂L 1 1
0= A
= wA − wtA + (α − β − τ )yt xA A A A
t ⇒ w = wt − (α − β − τ )yt xt (5)
∂w 2 2
Similarly, for the other view we have
1
wB = wtB − (β − α − τ )yt xB
t (6)
2
Setting the partial derivatives of L with respect to weight z to zero, we have
∂L
0= = (γ − α − β) ⇒ α + β = γ (7)
∂z
Setting the partial derivatives of L with respect to weight ξ to zero, we have,
∂L
0= = (C − λ − τ ) ⇒ λ + τ = C (8)
∂ξ
1
where the loss t = 1 − (yt wtA · xA B B
t + yt wt · xt ). For the sake of simplicity, we
2
denote,
2
a= and b = xA B 2
t xt
2
(10)
xt + xB
A 2
t
2
+a (α − β)( xA B 2
t − xt ) + 2t
2
(12)
Setting the partial derivatives of L with respect to weight α to zero, we have,
∂L
0= = a (α − β) xB A 2
t +t b + a (β − α) xt +t b
2
∂α B 2 A A A 2
xt yt wt · xt − xt
+a( yt wtB · xB A 2 B 2
t + xt − xt )
B 2 A 2
= a (α − β) xt +t ) b + a (β − α) xt +t ) b
+a( xA 2 B B 2 A
t t − xt t )
where A A A B B B
t = 1 − yt wt · xt and t = 1 − yt wt · xt . We also have α + β = γ.
Therefore, we can conclude that
80 T.T. Nguyen, K. Chang, and S.C. Hui
γ 1 1
B A
t t
α= + − (13)
2 2 xA 2 B 2 xB 2
t + xt t xAt
2
Similarly, we have
γ 1 1
B A
t t
β= − − (14)
2 2 xA 2 B 2 xB 2
t + xt t xAt
2
t
2 − xA 2
t
xB t
α ← max 0, min{γ, γ + t
}
2 xA t + xt
2 B 2
B A
1
t
2 − xA 2
t
xB t
β ← max 0, min{γ, γ − t
}
2 xAt + xt
2 B 2
(α − β)( xA 2 − xB 2 ) + 2
t t t
τt ← min C,
xA
t 2 + xB 2
t
A 1
Update wt+1 ← wtA − (α − β − τt )yt xA t
2
B 1
wt+1 ← wtB − (β − α − τt )yt xB t
2
end
end
Two-View Online Learning 81
4 Performance Evaluation
In this section, we evaluate the online classification performance of our pro-
posed Two-view PA on 3 benchmark datasets, Ads [8], Product Review [9], and
WebKB [12]). The single-view PA algorithm serves as the baseline. We use a
different PA model for each view, naming them PA View 1 and PA View 2. We
also concatenate the input feature vectors from each view to form a larger fea-
ture set, and report the results. We denote this alternative approach as PA Cat.
The dataset summary statistics are shown in Table 1. We note that the Ads and
WebKB datasets are very imbalanced, which led us to use F-measure instead
of accuracy to evaluate the classification performance. To be fair, we choose
C = 0.1 and γ = 0.5 for all PA algorithms. All experiments were conducted
using 5-fold cross validation.
The Ads dataset was first used by Kushmerick [8] to automatically filter ad-
vertisement images from web pages. The Ads dataset comprises more than two
views. In this experiment, we only use four views including image URL view,
destination URL view, base URL view, and alt view. The first and second orig-
inal views were concatenated as View 1 and the remaining two original views
were concatenated as View 2. This dataset has 3279 examples, including 459
positive examples (ads), with the remaining as negative examples (non-ads).
−9
PA Cat
−10 Two−View PA −4
−11
−6
−12
−13
log(view diff)
log(view diff)
−8
−14
−15 −10
−16
−12
−17
−18 PA Cat
−14
Two−View PA
−19
0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500
number of examples number of examples
The experimental results on the Ads dataset are shown in Table 2, where the
F-measure of the proposed algorithm is the best. The Two-view PA performed up
to 2% better than the runner-up, PA View 1. As previously discussed, PA View
1 is better than PA Cat since the two views have quite different classification
outputs.
The Product Review dataset is crawled from popular online Chinese cell-phone
forums [9]. The dataset has 1000 true reviews and 1000 spam reviews. It consists
of two sets of features: one based on review content (lexical view ) and the other
based on extracted characteristics of the review sentences (formal view ).
Two-View Online Learning 83
The experimental results on this dataset are shown in Table 2. Again, Two-
view PA performs better than the other algorithms. The improvement is more
than 2% compared with the runner-up. In this dataset, PA Cat performed better
than either view alone. This is expected since the view difference between the
two views are quite small, as shown in Figure 2.
Moreover, PA Cat is only 0.41% better than the best individual PA View 1.
This is because PA Cat does not take into account the view relatedness informa-
tion. The best performer here is the Two-view PA, which beats the runner-up
by almost 2%.
The WebKB course dataset has been frequently used in the empirical study of
multi-view learning. It comprises 1051 web pages collected from the computer
science departments of four universities. Each page has a class label, course or
non-course. The two views of each page are the textual content of a web page
(page view ) and the words that occur in the hyperlinks of other web pages point-
ing to it (link view ), respectively. We used a processed version of the WebKB
course dataset [12] in our experiment.
−9
PA Cat
−10 Two−View PA −4
−11
−6
−12
−13
log(view diff)
log(view diff)
−8
−14
−15 −10
−16
−12
−17
−18 PA Cat
−14
Two−View PA
−19
0 500 1000 1500 2000 0 500 1000 1500 2000
number of examples number of examples
The performance of PA Cat here is also better than the best single view PA.
However, the view difference of Two-view PA is much smaller than that of the
PA algorithm as shown in Figure 3. Hence, Two-view PA performed more than
3% better than PA Cat, and 5% better than the best individual view PA.
Compared to the Ads and Product Review datasets, the view difference on
the WebKB dataset is the smallest. It means that we are able to combine the
two views into a single view. Therefore, the PA Cat performance on the WebKB
dataset is improved more than 2% compared with the individual view PA.
84 T.T. Nguyen, K. Chang, and S.C. Hui
−9
PA Cat
−10 Two−View PA −4
−11
−6
−12
−13
log(view diff)
log(view diff)
−8
−14
−15 −10
−16
−12
−17
−18 PA Cat
−14
Two−View PA
−19
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
number of examples number of examples
In this paper, we proposed a hybrid model for two-view passive aggressive al-
gorithm, which is able to take advantage of multiple views of data to achieve
an improvement in overall classification performance. We formulate our learning
framework into an optimization problem and derive a closed form solution.
There remain some interesting open problems that warrant further investiga-
tion. For one, at each round we could adjust the weight of each view so that the
better view dominates. In the worst case where the two views are completely
related or co-linear, e.g., view 1 is equal to view 2, our Two-view PA degener-
ates nicely into a single view PA. We would also like to extend Two-view PA to
handle multiple views and multiple classes. Formulating a multi-view PA is non-
trivial, as it involves defining multi-view relatedness and minimizing (V choose
2) view agreements, for a V-view problem. Formulating a multi-class Two-view
PA should be more feasible.
References
1. Block, H.: The perceptron: A model for brain functioning. Rev. Modern Phys. 34,
123–135 (1962)
2. Cesa-Bianchi, N., Conconi, A., Gentile, C.: A second-order perceptron algorithm.
Siam J. of Comm. 34 (2005)
3. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20, 273–297
(1995)
4. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-
aggressive algorithms. Journal of Machine Learning Research, 551–585 (2006)
5. Crammer, K., Dredze, M., Kulesza, A.: Multi-class confidence weighted algorithms.
In: Proceedings of the 2009 Conference on Empirical Methods in Natural Lan-
guage Processing, pp. 496–504. Association for Computational Linguistics, Singa-
pore (2009)
Two-View Online Learning 85
6. Dredze, M., Crammer, K., Pereira, F.: Confidence-weighted linear classification. In:
ICML 2008: Proceedings of the 25th International Conference on Machine Learn-
ing, pp. 264–271. ACM, New York (2008)
7. Farquhar, J.D.R., Hardoon, D.R., Meng, H., Shawe-Taylor, J., Szedmák, S.: Two
view learning: Svm-2k, theory and practice. In: Proceedings of NIPS 2005 (2005)
8. Kushmerick, N.: Learning to remove internet advertisements. In: Proceedings of the
Third Annual Conference on Autonomous Agents, AGENTS 1999, pp. 175–181.
ACM, New York (1999)
9. Li, G., Hoi, S.C.H., Chang, K.: Two-view transductive support vector machines.
In: Proceedings of SDM 2010, pp. 235–244 (2010)
10. Nguyen, T.T., Chang, K., Hui, S.C.: Distribution-aware online classifiers. In:
Walsh, T. (ed.) IJCAI, pp. 1427–1432. IJCAI/AAAI (2011)
11. Novikoff, A.: On convergence proofs of perceptrons. In: Proceedings of the Sympo-
sium on the Mathematical Theory of Automata, vol. 7, pp. 615–622 (1962)
12. Sindhwani, V., Niyogi, P., Belkin, M.: Beyond the point cloud: from transductive
to semi-supervised learning. In: Proceedings of the 22nd International Conference
on Machine Learning, ICML 2005, pp. 824–831. ACM, New York (2005)
A Generic Classifier-Ensemble Approach for Biomedical
Named Entity Recognition
1 Introduction
With the wide applications of information technology in biomedical field, biomedical
technology has developed very rapidly. This in turn produces a large amount of biomed-
ical data such as human gene bank. Consequently, biomedical literature available from
the Web has experienced unprecedented growth over the past few years. The amount
of literature in MEDLINE grows by nearly 400,000 citations each year. To mine infor-
mation from the biomedical databases, a helpful and useful pre-processing step is to
extract the valuable biomedical named entity. In other words, this step needs to identify
some names from scientific text that is not structured as traditional databases and clas-
sify these different names. As a result, biomedical named entity recognition (BioNER)
becomes one of the most important issues in automatic text extraction system. Many
Corresponding author.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 86–97, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Generic Classifier-Ensemble Approach for Biomedical Named Entity Recognition 87
popular classification algorithms have been applied to this bioNER problem. These
algorithms include Support Vector Machine (SVM) [1,18,19], Conditional Random
Fields (CRFs) [3], the Hidden Markov Model (HMM) [5], the Maximum Entropy (ME)
[15], decision tree [16], and so on. While successful, each classifier has its own short-
comings and none of them could consistently perform well over all different datasets.
To overcome the shortcomings of individual methods, ensemble method has been sug-
gested as a promising alternative.
Ensemble method is more attractive than individual classification algorithm in that it
is an effective approach for improving the prediction accuracy of a single classification
algorithm. An ensemble of classifiers is a set of classifiers whose individual decisions
are combined in some way (typically by weighted or unweighted voting) to classify new
examples [8,11]. One of the most active areas of research in supervised learning has
been to study methods for constructing good ensembles of classifiers. The most impor-
tant property of successful ensemble methods is if the individual classifiers have error
rate below 0.5 when classifying sample data while these errors are uncorrelated at least
in some extent. That is, a necessary and sufficient condition for an ensemble of classi-
fiers over its individual members is that the classifiers are accurate and diverse. Several
recent studies indicate that the ensemble learning could improve the performance of a
single classifier in many real world text classification [6,7,9,10,12,13,14,23,24].
In this paper, we propose a generic genetic classifier-ensemble approach, which em-
ploys multi-objective genetic algorithm and SVM based classifiers to construct an en-
semble classifier. Each SVM based classifier is trained on a different feature subset
and used as the classification committee. The rest of the paper is organized as follows:
Section 2 discusses the generic genetic classifier-ensemble approach in detail. Experi-
mental results and analysis are provided in Section 3. Conclusions and future work are
presented in Section 4.
Feature Value
words all words in the training data
orthographic capital, symbol, etc.(see Table 2)
prefix 1,2, and 3 gram of starting letters of word
suffix 1,2, and 3 gram of ending letters of word
lexical POS tags, base phrase classes, and base noun phrase chunks
preceding class -4,-3, -2, -1
surface word simple surface word lists, name aliases and trigger words
1
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
A Generic Classifier-Ensemble Approach for Biomedical Named Entity Recognition 89
Parameter Value
kernel polynomial
degree of kernel 1,2,3
direction of parsing forward, backward
windows position 9 words(position -4, -3,-2,-1,0,+1,+2,+3,+4)
multi-class pair-wise
Next, due to the fact that support vector machines(SVMs) are powerful methods
for learning a classifier and have been applied successfully to many NLP tasks, SVMs
construct the base classifier in BioNER. The general-purpose text chunker named Yet
Another Multipurpose Chunk Annotator-Yamcha2 uses TinySVM3 for learning the clas-
sifiers. Yamcha is utilized to transform the input data into feature vectors usable by
TinySVM [18,19]. Table 3 shows the Yamcha parameters. Accordingly, each classifier
is unique in at least one of the following properties: window size, degree of the poly-
nomial kernel, parsing direction as well as feature set. Consequently, this constructs 46
individual SVM classifier committees [17,20,21].
problem with weights. We define the fitness of a chromosome as the full object F-score
provided by the weighted majority voting type decision combination rule [12,17,22].
In this rule, the class receiving the maximum combined score is selected as the joint
decision. By the definition of the combined score of a particular class,
M
f (ci ) = Fm · w(m, i)
m=1
where M denotes the total number of classifiers and Fm denotes the full object F-score
of mth classifier. w(m,i)is assigned to a weight value in the gene of ith class of mth
classifier in the chromosome.
The third step is to choose the genetic operators-crossover and mutation. A crossover
operator takes two parent chromosomes and creates two children with genetic material
from both parents. In the proposed approach, either uniform or two point crossover
method is randomly selected with equal probability. The selected operator is applied
with a probability pcross to generate two offspring. A mutation operator randomly se-
lects a gene in offspring chromosomes with a probability pmut and adds a small ran-
dom number within the range[0,1] to each weight in the gene. In addition, we still need
to specify the tournament size,elitism, population size and the number of generations.
Tournament size is used in tournament selection during the reproduction. Elitism is ap-
plied at the end of each iteration where the best elit size% of the original population are
used to replace those in the offspring producing the lowest fitness.
value of a gene is wm , this means that the contributing degree of the mth classifier in
this ensemble is wm . Accordingly, the combined score of a given class can be redefined
as:
M
f (ci ) = Fm · wm
m=1
At the same time, all parameters of this algorithm described above including pop-
ulation size, the number of generations, crossover and mutation rate etc. are kept the
same.
through three-fold cross-validation on the training data. In our proposed algorithm, the
training data is initially partitioned into three parts. Each classifier is trained using two
parts and then tested with the remaining part. This procedure is repeated three times and
the whole set of training data is used for computing the best-fitting solution. Multi-class
SVM is used for all individual classifier. The major differences among the individual
classifiers are in their modeling parameter values and feature sets. Each classifier is
different from the rest in at least one modeling parameter or the feature set. During
testing, the outputs of the individual classifiers are combined by using the computed
best-fitting solution of weight classifier-ensemble.
of different combination within feature set and Yamcha parameter. The experimental
performance is evaluated by the standard measures, namely precision, recall and F-
score which is the harmonic mean of precision and recall.
Table 5. The performances of different biomedical named entities on three genetic classifier-
ensemble schemes
Table 6. The comparison with individual best SVM classifier and Vote-based SVM-classifier
selection for bioNER task
Table 7. The comparison with other different individual classifier algorithms on bioNER task
pre-processing and post-processing. For instance, Zhou and Su [1] used name alias
resolution, cascaded entity name resolution, abbreviation resolution and an open dic-
tionary (around 700,000 entries). Finkel et al. used gazetteers and web-querying [2].
Settles used 17 lexicons that include Greek letters, amino acids, and so forth [3]. In
contrast, our system did not include these similar processing.
effective features and more classifiers using different machine learning algorithms in
our ensemble approach, and include some post-processing techniques and comparison
of computational cost.
References
1. Zhou, G., Su, J.: Exploring Deep Knowledge Resources in Biomedical Name Recognition.
In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and
its Applications (JNLPBA 2004), pp. 70–75 (2004)
2. Finkel, J., Dingare, S., Nguyen, H., Nissim, M., Sinclair, G., Manning, C.: Exploiting Context
for Biomedical Entity Recognition: From Syntax to the Web. In: Proceedings of the Joint
Workshop on Natural Language Processing in Biomedicine and its Applications, JNLPBA
2004 (2004)
3. Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and
Novel Feature Sets. In: Proceedings of the Joint Workshop on Natural Language Processing
in Biomedicine and its Applications (JNLPBA 2004), pp. 104–107 (2004)
4. Song, Y., Kim, E., Lee, G.-G., Yi, B.-K.: POSBIOTM-NER in the shared task of
BioNLP/NLPBA 2004. In: Proceedings of the Joint Workshop on Natural Language Pro-
cessing in Biomedicine and its Applications, JNLPBA 2004 (2004)
5. Zhao, S.: Name Entity Recognition in Biomedical Text using a HMM model. In: Proceedings
of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications
(JNLPBA 2004), pp. 84–87 (2004)
6. Zhang, Z., Yang, P.: An ensemble of classifiers with genetic algorithm based feature selec-
tion. IEEE Intelligent Informatics Bulletin 9, 18–24 (2008)
7. Yang, P., Zhang, Z., Zhou, B.B., Zomaya, A.Y.: Sample Subset Optimization for Classifying
Imbalanced Biological Data. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part
II. LNCS(LNAI), vol. 6635, pp. 333–344. Springer, Heidelberg (2011)
8. Yang, P., Yang, Y.-H., Zhou, B.B., Zomaya, A.Y.: A review of ensemble methods in bioin-
formatics. Current Bioinformatics 5, 296–308 (2010)
9. Yang, P., Ho, J.W.K., Zomaya, A.Y., Zhou, B.B.: A genetic ensemble approach for gene-gene
interaction identification. BMC Bioinformatics 11, 524 (2010)
10. Kuncheva, L.I., Jain, L.C.: Designing classifier fusion systems by genetic algorithms. IEEE
Transaction on Evolutionary Computation 4(4) (September 2000)
11. Dietterich, T.G.: Ensemble Methods in Machine Learning. In: Kittler, J., Roli, F. (eds.) MCS
2000. LNCS, vol. 1857, pp. 1–5. Springer, Heidelberg (2000)
12. Ruta, D., Gabrys, B.: Application of the Evolutionary Algorithms for Classifier Selection in
Multiple Classifier Systems with Majority Voting. In: Kittler, J., Roli, F. (eds.) MCS 2001.
LNCS, vol. 2096, pp. 399–408. Springer, Heidelberg (2001)
13. Larkey, L.S., Croft, W.B.: Combining classifier in text categorization. In: SIGIR 1996, pp.
289–297 (1996)
14. Patrick, J., Wang, Y.: Biomedical Named Entity Recognition System. In: Proceedings of the
10th Australasian Document Computing Symposium (2005)
15. Tsai, T.-H., Wu, C.-W., Hsu, W.-L.: Using Maximum Entropy to Extract Biomedical Named
Entities without Dictionaries. In: JNLPBA 2006, pp. 268–273 (2006)
16. Chan, S.-K., Lam, W., Yu, X.: A Cascaded Approach to Biomedical Named Entity Recogni-
tion Using a Unified Model. In: The 7th IEEE International Conference on Data Mining, pp.
93–102
A Generic Classifier-Ensemble Approach for Biomedical Named Entity Recognition 97
17. Dimililer, N., Varoğlu, E.: Recognizing Biomedical Named Entities Using SVMs: Improving
Recognition Performance with a Minimal Set of Features. In: Bremer, E.G., Hakenberg, J.,
Han, E.-H(S.), Berrar, D., Dubitzky, W. (eds.) KDLL 2006. LNCS (LNBI), vol. 3886, pp.
53–67. Springer, Heidelberg (2006)
18. Kazamay, J.-I., Makinoz, T., Ohta, Y., Tsujiiy, J.-I.: Tuning Support Vector Machines for
Biomedical Named Entity Recognition. In: ACL NLP, pp. 1–8 (2002)
19. Mitsumori, T., Fation, S., Murata, M., Doi, K., Doi, H.: Gene/protein name recognition based
on support vector machine using dictionary as features. BMC Bioinformatics 6(suppl. 1)
(2005)
20. Dimililer, N., Varoğlu, E., Altınçay, H.: Vote-Based Classifier Selection for Biomedical NER
Using Genetic Algorithms. In: Martı́, J., Benedı́, J.M., Mendonça, A.M., Serrat, J. (eds.)
IbPRIA 2007, Part II. LNCS, vol. 4478, pp. 202–209. Springer, Heidelberg (2007)
21. Dimililer, N., Varoglu, E., Altmcay, H.: Classifier subset selection for biomedical named
entity recognition. Appl. Intell., 267–282 (2009)
22. Ruta, D., Gabrys, B.: Classifier selection for majority voting. Inf. Fusion 1, 63–81 (2005)
23. Yang, T., Kecman, V., Cao, L., Zhang, C., Huang, J.Z.: Margin-based ensemble classifier for
protein fold recognition. Expert Syst. Appl. 38(10), 12348–12355 (2011)
24. Zhang, P., Zhu, X., Shi, Y., Wu, X.: An Aggregate Ensemble for Mining Concept Drifting
Data Streams with Noise. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.)
PAKDD 2009. LNCS, vol. 5476, pp. 1021–1029. Springer, Heidelberg (2009)
25. John, H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Appli-
cations to biology, control and artificial intelligence. MIT Press (1998) ISBN 0-262-58111-6
26. Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the Bio-Entity
Recognition Task at JNLPBA. In: Proceedings of the International Workshop on Natural
Language Processing in Biomedicine and its Applications (JNLPBA 2004), pp. 70–75 (2004)
Neighborhood Random Classification
1 Introduction
Ensemble methods (EMs) have proved their efficiency in data mining, especially
in supervised machine learning (ML). An EM generates a set of classifiers using
one or several machine learning algorithms (MLA) and aggregates them into a
single classifier (meta-classifier, MC) using, for example, a majority rule vote.
Many papers [3,18,2,14] have shown that a set of classifiers produces a better
prediction than the best among them, regardless of the MLA used. Theoretical
and experimental results have encouraged the implementation of EM techniques
in many fields of application such as physics [6], face recognition [17], ecology [12],
recommender systems [9] and many others too numerous to mention here. The
efficiency of EMs lies in the fact that aggregating different and independent
classifiers reduces the bias and the variance of the MC [8,1,5,3], which are two
key concepts for effective classifiers.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 98–108, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Neighborhood Random Classification 99
Instance based (IB) MLAs such as k-Nearest Neighbors (kNN) are very popu-
lar because of their straightforwardness. To implement them, it is simply neces-
sary to define a dissimilarity measure on the set of observations and fix the value
of k. Thus, using the kN N principle as an EM algorithm is immediate. How-
ever, handling the parameter k can be difficult for some users. To simplify this
problem, we can use approaches based on neighborhood graphs as alternatives.
For example, Relative Neighborhood Graphs (RNG) or Gabriel Graphs (GG)
are “good” candidates. Like kN N , for an unlabeled observation, the classifier,
based on neighborhood graphs, assigns a label according to the labels in the
neighborhood. As an example, we can simply use the majority rule vote in the
neighborhood of the unlabeled observation. While there have been many studies
using kN N in the context of EM, we did not find any study that assesses the ad-
vantages of such neighborhood graphs, based more particularly on RNGs, in EM
approaches. In this paper, we propose an EM approach based on neighborhood
graphs. We provide comparisons with many EM approaches based on kSVM,
Decision Tree (Random Forest), kN N etc. We carried out our experiments on
an R platform.
This paper is organized as follows. In section 2, we introduce and recall certain
notations and definitions. In section 3, we introduce the EMs based on neigh-
borhoods. Besides the classic kN N neighborhood, we will present RNG and GG
neighborhoods. Section 4 is devoted to evaluations and comparisons. Section 5
provides the main conclusions of this study.
2 Basic Concepts
2.1 Notations
X : Ω −→
ω−→ X j (ωi ) j=1,...,p
Y : Ω −→ K
ω
−→ y
El X1 X2 Y El X1 X2 Y
ω13
ω1 2.13 2.33 2 ω10 0 2.33 2 7
ω8 ω4
ω2 2.13 4.11 2 ω11 5.64 5.17 2 6
ω9 0 5.77 2 1 ω6
ω5
ω7
1 2 3 4 5 6 7
By thresholding at the maximum value for this vector, the membership class
can be represented by a zero vector except for the most likely class by a value
of 1 at the corresponding rank: P̂ = (0, . . . , 0, 1, 0, . . . , 0). If the classifier φ is
considered as being reasonably reliable, then predicting of the membership class
for an individual ω is φ(X(ω)) ∈ {y1 , . . . , yk , . . .}.
There are many types of neighborhood that can be used to build a classifier.
Among the most well known are:
– The neighbors in random spaces. For example, we can cite the weak models
approach [7] where neighbors are obtained after a random projection along
axes.
– The neighbors in the sense of a specific property. For example, Gabriel
Graph (GG) neighbors are given by the subset of individuals of the learning
sample that fulfill a certain condition. Likewise, we can define the relative
neighbors (RN), the minimum spanning tree’s (MST) neighbors or the De-
launay’s polyhedron neighbors and so forth [13];
1. Neighborhood set P : the set of all subsets of El . This is the set of all
possible neighbors to which each individual will be connected.
2. The neighborhood function V : this defines the way in which an individual
is linked to an element in the neighborhood set:
V : −→ P
X−→ v = V(X)
C : × P −→ SK
X, v
−→ Πv (X) = (p1 , p2 , . . . , pK )
where SK = (p1 , . . . , pK ) ∈ [0, 1]K s.t. pk = 1
All these geometric structures induce a related neighborhood graph with a sym-
metric neighborhood relationship. Figures 1 and 2 show the neighbor structures
of the relative neighbor graph and the Gabriel graph, using the dataset intro-
duced above (cf 2.1).
ω14 ω14
ω13 ω13
7 7
ω8 ω4 ω8 ω4
6 6 ω9
ω9 Lunula
5 ω11 5 ω11
ω15 ω2 ω15
4 ω2 ω16 ω17 4 ω16 ω17
3 3
ω1
ω1
2 ω10 ω12 2 ω10 ω12
ω3 ω3
1 ω6 1 ω6
ω7 ω5 ω7 ω5
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Following these steps, the RNC then aggregates the M predicted values related
to an unclassified individual to determine its final membership class. The two
key points in this procedure are the sampling procedure for generating the M
classifiers and the procedure for combining the M predictions. Below, we provide
some details of the two key points:
Neighborhood Random Classification 103
– Vote of classifiers, which aggregate the responses of each classifier and nor-
malize them. The majority rule vote is a particular example of this.
– Average vector where the score for each class is the mean of the answers for
all the classifiers.
– Weighted version (majority or mean)
– Maximum Likelihood calculated as the product of the answers for all the
classifiers, for each class. The winning class is the one that has the highest
value.
– Naive Bayes [15].
– Decision Templates [15]. This method is based on the concept of a decision
template, which is the average vector over the individuals of a test sample
belonging to each class, and a decision profile, which is the set of responses of
all classifiers. The membership class is determined according to the Euclidean
distance between the decision profile and the decision template. The winning
class is the one that minimizes this distance.
– Linear regression. In this method, we assume that the probability of a class
is the linear combination of the probabilities of class for each classifier.
104 D.A. Zighed, D. Ezzeddine, and F. Rico
4 Evaluation
To assess the performance of RNC, we carried out many experiments on differ-
ent data sets taken from the UCI Irvine repository. For this, we made a number
of distinctions depending on the type of neighborhood used. As our work was
motivated by the absence of studies on EMs based on geometrical graphs such as
RNGs, we designed two separate experiments for RNC. One was based on RNGs
and the other on kN N where k = 1, 2, 3. The comparison was also extended
to random forests (RFs), K support vector machines (KSVMs), Adaboost, dis-
criminant analysis (DA), logistic regression (RegLog) and C4.5. All experiments
were carried out using R software.
φ(ω) = (p1 , . . . , pK )
# ({ω ∈ V(ω)s.t.Y (ω ) = i})
such that pi =
# (V(ω))
C-svc. For DA and RegLog, we used the lda and glm functions of the R MASS
library and for C4.5, the J48 function of the RWeka library using the control of
the Weka learner.
For kNN, we simply replaced the neighborhood graph with the k-nearest
neighbors in the RNC algorithm using the same distances and the same number
of classifiers. The aggregation method is majority voting. Three values of k were
tested : k = 1, 2 or 3.
All these issues are currently being studied and should produce significant im-
provements for RNCs based on geometrical graphs.
106 D.A. Zighed, D. Ezzeddine, and F. Rico
So, this means computing the distance d(ω1 , ω2 ) for each possible pair (O(p2 )
each for Mahalanobis) and compare the distance d(ω1 , ω2) with all distance
d(ω1 , ω) dans d(ω2 , ω) for each individuals ω.
But optimization can be carried out :
– Distance can be computed using matrix representation and a powerful linear
algrebra library (BLAS).
– Using the RNG condition, it is only necessary to test 1 for all ω such that
d(ω1 , ω) ≤ d(ω1 , ω2 ).
– For a given individual ω1 , sorting all distance d(ω1 , ω) by increasing order.
Then, we test the condition 1 following this order. If the distance d(ω1 , ω2 )
is large compared to the others, we reject the edge between ω1 and ω2 faster.
Practically, computing the results using RNG graphs is several times slower than
using k-NN. For example with k=3, and the data set Twonorm (almost 2000
individuals), it takes 40s for k-NN method and 1min44s for the RNG method
to compute N=100 classifiers. These tests use the BLAS library atlas, on a
TM TM
Intel Core i5 2.60GHz computer with 4G memory.
References
1. Breiman, L.: Bias, variance, and arcing classifiers. Statistics (1996)
2. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
3. Brown, G., Wyatt, J., Harris, R., Yao, X.: Diversity creation methods: a survey
and categorisation. Information Fusion 6(1), 5–20 (2005)
4. Demsar, J.: Statistical comparisons of classifiers over multiple data sets (2006)
5. Domingos, P.: A unified bias-variance decomposition and its applications. In:
ICML, pp. 231–238. Citeseer (2000)
6. Ham, J.S., Chen, Y., Crawford, M.M., Ghosh, J.: Investigation of the random
forest framework for classification of hyperspectral data. IEEE Transactions on
Geoscience and Remote Sensing 43(3) (2005)
7. Ho, T., Kleinberg, E.: Building projectable classifiers of arbitrary complexity. In:
International Conference on Pattern Recognition, vol. 13, pp. 880–885 (1996)
8. Kohavi, R., Wolpert, D.: Bias plus variance decomposition for zero-one loss func-
tions. In: Machine Learning-International Workshop, pp. 275–283. Citeseer (1996)
9. O’Mahony, M.P., Cunningham, P., Smyth, B.: An Assessment of Machine Learning
Techniques for Review Recommendation. In: Coyle, L., Freyne, J. (eds.) AICS 2009.
LNCS, vol. 6206, pp. 241–250. Springer, Heidelberg (2010),
http://portal.acm.org/citation.cfm?id=1939047.1939075
10. Park, J.C., Shin, H., Choi, B.K.: Elliptic Gabriel graph for finding neighbors in
a point set and its application to normal vector estimation. Computer-Aided De-
sign 38(6), 619–626 (2006)
11. Planeta, D.S.: Linear time algorithms based on multilevel prefix tree for finding
shortest path with positive weights and minimum spanning tree in a networks.
CoRR abs/0708.3408 (2007)
12. Prasad, A.M., Iverson, L.R., Liaw, A.: Newer classification and regression tree
techniques: bagging and random forests for ecological prediction. Ecosystems 9(2),
181–199 (2006)
13. Preparata, F.P., Shamos, M.I.: Computational geometry: an introduction. Springer
(1985)
14. Schapire, R.: The boosting approach to machine learning: An overview. Lecture
Note in Statistics, pp. 149–172. Springer (2003)
15. Shipp, C.A., Kuncheva, L.I.: Relationships between combination methods and mea-
sures of diversity in combining classifiers. Information Fusion 3(2), 135–148 (2002)
16. Toussaint, G.T.: The relative neighbourhood graph of a finite planar set. Pattern
Recognition 12(4), 261–268 (1980)
17. Wang, X., Tang, X.: Random sampling lda for face recognition, pp. 259–267 (2004),
http://portal.acm.org/citation.cfm?id=1896300.1896337
18. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better
than all* 1. Artificial Intelligence 137(1-2), 239–263 (2002)
SRF: A Framework for the Study of Classifier
Behavior under Training Set Mislabeling Noise
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 109–121, 2012.
c Springer-Verlag Berlin Heidelberg 2012
110 K. Mirylenka, G. Giannakopoulos, and T. Palpanas
In this work, we study the effect of training set mislabeling noise1 on a clas-
sification task. This type of noise is common in cases of concept drift, where a
target concept shifts over time, rendering previous training instances obsolete.
Essentially, in the case of concept drift, feature noise causes the labels of previ-
ous training instances to be obsolete and, thus, equivalent to mislabeling noise.
Drifting concepts appear in a variety of settings in the real world, such as the
state of a free market or the traits of the most viewed movie. Giannakopou-
los and Palpanas [10] have shown that the performance2 of a classifier in the
presence of noise can be effectively approximated by a sigmoid function, which
relates the signal-to-noise ratio in the training set to the expected performance
of the classifier. We term this approach the “Sigmoid Rule”.
In our work, we examine how much added benefit we can get out of the
sigmoid rule model, by studying and analyzing the parameters of the sigmoid in
order to detect the influence of each parameter on the learner’s behavior. Based
on the most prominent parameters, we define the dimensions characterizing the
algorithm behavior, which can be used to construct criteria for the comparison
of different learning algorithms. We term this set of dimensions the “Sigmoid
Rule” Framework (SRF). We also study, using SRF, how dataset attributes (i.e.,
the number of classes, features and instances and the fractal dimensionality [6])
correlate to the expected performance of classifiers in varying noise settings.
In summary, we make the following contributions. We define a set of intu-
itive criteria based on the SRF that can be used to compare the behavior of
learning algorithms in the presence of noise. This set of criteria provides both
quantitative and qualitative support for learner selection in different settings. We
demonstrate that there exists a connection between the SRF dimensions and the
characteristics of the underlying dataset, using both a correlation study and re-
gression modeling. In both cases we discovered statistically significant relations
between SRF dimensions and dataset characteristics. Our results are based on
an extensive experimental evaluation, using 10 synthetic and 14 real datasets
originating from diverse domains. The heterogeneity of the dataset collection
validates the general applicability of the SRF.
Given the variety of existing learning algorithms, researchers are often inter-
ested in obtaining the best algorithm for their particular tasks. This algorithm-
selection is considered part of the meta-learning domain [11]. According to the
No-Free-Lunch theorems (NFL) described in [22] and proven in [23], [21], there
is no overall best classification algorithm. Nevertheless, NFL theorems, which
compare the learning algorithms over diverse datasets, do not limit us when we
focus on a particular dataset. As mentioned in [1], the results of NFL theorems
1
For the rest of this paper we will use the term noise to refer to this type of noise,
unless otherwise indicated.
2
In this paper, by performance of an algorithm, we mean classification accuracy.
SRF: A Framework for the Study of Classifier Behavior 111
1
f (Z) = m + (M − m)
1 + b · exp(−c(Z − d))
112 K. Mirylenka, G. Giannakopoulos, and T. Palpanas
f (Z ) da lg
Z*
Z1( 3)
ra lg Z inf
Z 2( 3)
Z*
Z
variance of noise is known. In this case, one may choose the algorithm with the
r
lowest value of dalg
alg
, in order to limit the corresponding variance in performance.
Based on the above discussion, we consider the algorithms with higher maximal
performance M , larger width of performance range ralg , higher slope indicator
ralg
dalg and shorter width of the active area of the algorithm dalg to behave better:
we expect to get high performance from an algorithm if the level of noise in the
dataset is very low, and low performance if the level of noise in the dataset is
very high. Decision makers can easily formulate different criteria, based on the
proposed dimensions and particular settings.
4 Experimental Evaluation
In the following paragraphs, we describe the experimental setup, the datasets
and the results of our experiments.
In our study, we used the following machine learning algorithms, implemented
in Weka 3.6.3 [12]: (a) IBk — K-nearest neighbor classifier; (b) Naive Bayes clas-
sifier; (c) SMO — support vector classifier (cf. [15]); (d) NbTree — a decision tree
with naive Bayes classifiers at the leaves; (e) JRip — a RIPPER [5] rule learner
implementation. We have chosen representative algorithms from different fami-
lies of classification approaches, covering very popular classification schemes [24].
We used a total of 24 datasets for our experiments.4 Fourteen of them are real,
and ten are synthetic. All the datasets were divided into groups according to the
number of classes, attributes (features) and instances in the dataset as is shown
on Figure 2. There are 12 possible groups that include all combinations of the
parameters. Two datasets from each group were employed for the experiments.
≥5
00
<5
We created artificial datasets in the cases were real datasets with a certain
set of characteristics were not available. We produced datasets with known in-
trinsic dimensionality. The distribution of dataset characteristics is illustrated
in Figure 3. The traits of the datasets illustrated are the number of classes,
the number of attributes, the number of instances and the estimated intrinsic
(fractal) dimension.
The ten artificial datasets we used were built using the following procedure.
Having randomly sampled the number of classes, features and instances, we
4
Most of the real datasets come from the UCI Machine learning repository [7], and
one from [10]. For a detailed list with references check the following anonymous
online resource: http://tinyurl.com/3g4fmsf.
SRF: A Framework for the Study of Classifier Behavior 115
25
80
20
Number of features
Number of classes
60
15
40
10
●
● ●
●
20
● ●
●
●
●
5
● ●
● ●
● ● ●
● ● ●
● ● ●
0
1 2 3 4 5 6 5 6 7 8 9 10
Fractal Dimension Log of instance count
sample the parameters of each feature distribution. We assume that the fea-
tures follow the Gaussian distribution with mean value (μ) from the interval
[−100, 100] and standard deviation(σ) from the interval [0.1, 30]. The μ and σ
intervals allow overlapping features across classes.
Noise was induced as follows. We created stratified training sets, equally sized
to the stratified test sets. To induce noise, we created noisy versions of the
training sets by mislabeling instances. Using different levels ln of noise, ln =
0, 0.05, ..., 0.95 5 , a training set with ln noise is a set where there is a ln probability
that a training instance will be assigned a different label than their true one.
Hence, we obtained 20 dataset versions with varying noise levels.
Fig. 4. Sigmoid CTF of SMO (left) and IBk (right) for “Wine” dataset. Green solid
line: True Measurements, Dashed red line: estimated sigmoid.
indicate that (for the studied range of datasets) SMO is expected to improve
its performance faster than all other algorithms, when the signal-to-noise ratio
r
increases. This conclusion is based on the slope indicator ( dalg
alg
) values. Also,
IBk has a smaller potential for improvement of performance (but also smaller
potential for loss) than SMO when noise levels change, given that the width of
the performance range ralg is higher for SMO. This difference can also be seen
in Figure 4, where the distance between minimum and maximum performance
values is bigger for the SMO case (see Figure 4(left)).
0.90
0.18
0.80
0.85
●
0.75
0.14
●
●
●
rAlg
0.80
0.70
●
m
0.10
●
● ●
0.65
0.75
●
●
0.60
0.06
0.70
Ibk JRip Naive Bayes NBTree SMO Ibk JRip Naive Bayes NBTree SMO Ibk JRip Naive Bayes NBTree SMO
0.6
4.0
●
3.0 3.5
0.5
rAlg.dAlg
●
dAlg
●
0.4
●
2.5
●
0.3
●
2.0
0.2
●
1.5
Ibk JRip Naive Bayes NBTree SMO Ibk JRip Naive Bayes NBTree SMO
Algorithm Algorithm
Fig. 5. SRF parameters per algorithm. X axis labels (left-to-right): IBk, JRip, NB,
NBTree, SMO.
SRF: A Framework for the Study of Classifier Behavior 117
Error measures
Parameters average(R2a )
MSE MAE RMSE RMAE
m 0.11 0.09 148.92 29.17 0.54
M 0.35 0.30 0.51 0.41 0.88
ralg 0.32 0.27 0.71 0.46 0.85
dalg 1.98 1.41 0.97 0.68 0.67
ralg
dalg 0.37 0.27 4.83 1.46 0.55
m M
Value Value
2.0
0.5
0.4 1.5
0.3
1.0
0.2
0.1 0.5
Dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Dataset
0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
ralg dalg
Value Value
1.5
10
1.0
5
0.5
Dataset Dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
ralgdalg
Value
1.5
1.0
0.5
Dataset
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
0.5
Fig. 6. Real and estimated values of the sigmoid parameters. Real values: Black rect-
angles, Estimated values: circles, Gray zone: 95% prediction conf. interval.
Table 2. Correlation between dataset parameters and SRF parameters. Colored cells:
statistically significant correlation (p − value < 0.05: underlined bold, p − value < 0.1: italics-
bold). Green (dark) : medium correlation, gray (light) : low correlation.
performs badly. Thus, the number of classes significantly influences the behavior
of an algorithm, regardless of the family of the algorithm. Second, the number
of features (x2 ) provides a minor reduction of sensitivity to noise variation (re-
sulting from low correlation to dalg ). This conclusion is also supported by the
r
negative influence on dalg alg
, ralg . We also note that the number of features affects
the maximal performance M , which shows (rather contrary to intuition) that
more features may negatively affect performance in a noise-free scenario. This is
most probably related to features that are not essentially related to the labeling
process, thus inducing feature noise. Third, there is a correlation between the
r
number of instances (x3 ) and dalg alg
. This shows that larger datasets (providing
more instances) reduce sensitivity to noise variation. Last, fractal dimensional-
ity (x4 ) of a dataset has low, but statistically significant negative influence on
M and on ralg . Fractal dimensionality is indicative of the “complexity” of the
dataset. Thus, if the dataset is complex (high x4 ) machine learning is difficult
even at low noise levels. We note that low ralg may be preferable in cases where
the algorithm should be stable even for low signal-to-noise ratios.
The correlation analysis demonstrates the connection between dataset char-
acteristics and SRF dimensions. Consequently, the SRF can be used to reveal
a-priori the properties of an algorithm with respect to a dataset of certain char-
acteristics. This allows an expert to select a good algorithm for a given setting,
based on the requirements of that settings. Such requirements may, e.g., relate to
the stability of an algorithm in varying levels of noise and the expected maximum
performance in non-noisy datasets.
5 Conclusions
over time, and to any concept drift problems for data series mining. We showed
that the parameters related to the behavior of learners correlate with dataset
characteristics, and the range of their variation may be predicted using regres-
sion models. Therefore, SRF is a useful meta-learning framework, applicable to
a wide range of settings that include noise. However, using these SRF mod-
els for parameter prediction does not provide enough precision to be used for
performance estimation.
As part of our ongoing work, we examine whether the “Sigmoid Rule” also
stands in the case of sequential classification. Preliminary experimental results
on the “Climate” UCI dataset (taking into account its temporal aspect) indicate
that, indeed, the “Sigmoid Rule” and therefore SRF are directly applicable, and
can be used as a means to represent the behavior of an HMM-based classifier
in the presence of noise. This finding may open the way to a broader use of the
SRF, including sequential learners.
References
1. Ali, S., Smith, K.A.: On learning algorithm selection for classification. Applied Soft
Computing 6(2), 119–138 (2006)
2. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
3. Camastra, F., Vinciarelli, A.: Estimating the intrinsic dimension of data with a
fractal-based method. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence 24(10), 1404–1407 (2002)
4. Chevaleyre, Y., Zucker, J.-D.: Noise-tolerant rule induction from multi-instance
data. In: De Raedt, L. (ed.) Proceedings of the ICML 2000 Workshop on Attribute-
Value and Relational Learning: Crossing the Boundaries (2000)
5. Cohen, W.W.: Fast effective rule induction. In: ICML (1995)
6. de Sousa, E., Traina, A., Traina Jr., C., Faloutsos, C.: Evaluating the intrinsic
dimension of evolving data streams. In: Proceedings of the 2006 ACM Symposium
on Applied Computing, pp. 643–648. ACM (2006)
7. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
8. Giannakopoulos, G., Palpanas, T.: Adaptivity in entity subscription services. In:
ADAPTIVE (2009)
9. Giannakopoulos, G., Palpanas, T.: Content and type as orthogonal modeling fea-
tures: a study on user interest awareness in entity subscription services. Interna-
tional Journal of Advances on Networks and Services 3(2) (2010)
10. Giannakopoulos, G., Palpanas, T.: The effect of history on modeling systems’ per-
formance: The problem of the demanding lord. In: ICDM (2010)
11. Giraud-Carrier, C., Vilalta, R., Brazdil, P.: Introduction to the special issue on
meta-learning. Machine Learning 54(3), 187–193 (2004)
12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.:
The weka data mining software: an update. ACM SIGKDD Explorations Newslet-
ter 11(1), 10–18 (2009)
SRF: A Framework for the Study of Classifier Behavior 121
13. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann
(2006)
14. Kalapanidas, E., Avouris, N., Craciun, M., Neagu, D.: Machine learning algorithms:
a study on noise sensitivity. In: Proc. 1st Balcan Conference in Informatics, pp.
356–365 (2003)
15. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to
platt’s smo algorithm for svm classifier design. Neural Computation 13(3), 637–649
(2001)
16. Kuh, A., Petsche, T., Rivest, R.L.: Learning time-varying concepts. In: NIPS, pp.
183–189 (1990)
17. Li, Q., Li, T., Zhu, S., Kambhamettu, C.: Improving medical/biological data classi-
fication performance by wavelet preprocessing. In: Proceedings ICDM Conference
(2002)
18. Pendrith, M., Sammut, C.: On reinforcement learning of control actions in noisy
and non-markovian domains. Technical report, School of Computer Science and
Engineering, The University of New South Wales, Sydney, Australia (1994)
19. Teytaud, O.: Learning with noise. Extension to regression. In: Proceedings of Inter-
national Joint Conference on Neural Networks, IJCNN 2001, vol. 3, pp. 1787–1792.
IEEE (2002)
20. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press (2003)
21. Wolpert, D.: The existence of a priori distinctions between learning algorithms.
Neural Computation 8, 1391–1421 (1996)
22. Wolpert, D.: The supervised learning no-free-lunch theorems. In: Proc. 6th Online
World Conference on Soft Computing in Industrial Applications. Citeseer (2001)
23. Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural
Computation 8, 1341–1390 (1996)
24. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLach-
lan, G.J., Ng, A.F.M., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J.,
Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37
(2008)
Building Decision Trees for the Multi-class
Imbalance Problem
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 122–134, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Building Decision Trees for the Multi-class Imbalance Problem 123
2 Methods
We apply a variety of methods to better understand the performance of decision
trees in the class imbalance problem. Due to space restrictions, we limit the
study to two popular decomposition techniques (OVA and ECOC), as well as
building single trees.
124 T.R. Hoens et al.
While given the computing power available today this is of reasonable size, as the
number of classes grows the problem quickly becomes intractable. Since having
so many classes is rare in practice, and does not in fact occur for any datasets
in this paper, we build codewords of maximum size for all datasets.
Decision trees are one of the fundamental learning algorithms in the data mining
community. The most popular of decision tree learning algorithm is C4.5 [14].
Recently Hellinger distance decision trees (HDDTs) [4] have been proposed as
an alternative method for building decision trees for binary class datasets which
exhibit class imbalance.
Provost and Domingos [13] recommend a modification to C4.5 known as C4.4.
In C4.4 decision trees are constructed by building unpruned and uncollapsed
C4.5 decision trees which use Laplace smoothing at the leaves. These choices
are due to empirical results [13] demonstrating that a fully built unpruned,
uncollapsed tree with Laplace smoothing outperforms all other configurations,
and thus are used in all experiments in this paper.
The important function to consider when building a decision tree is known as
the splitting criterion. This function defines how data should be split in order to
maximize performance. In C4.4 this function is gain ratio, which is a measure of
purity based on entropy [14], while in HDDT this function is Hellinger distance.
In the next section we motivate Hellinger distance as a splitting criterion, and
then subsequently devise a strategy for improving its performance on multi-class
datasets.
Hellinger Distance Splitting Criterion. Hellinger distance is a distance met-
ric between probability distributions used by Cieslak and Chawla [4] to create
Hellinger distance decision trees (HDDTs). It was chosen as a splitting criterion
for the binary class imbalance problem due to its property of skew insensitivity.
Hellinger distance is defined as a splitting criterion as [4]:
2
p
|X+j | |X−j |
dH (X+ , X− ) = − (1)
j=1
|X+ | |X− |
where X+ is the set of all positive examples, X− is the set of all negative examples
and X+j (X−j ) is the set of positive (negative) examples with the jth value (of
p distinct values) of the relevant feature.
Since Hellinger distance defines the distance between probability distributions,
it does not naturally extend to the multi-class problem. This is in contrast to gain
ratio — which is based on entropy — which is easily extensible to any number
of classes. Specifically, since Hellinger distance is a distance metric, any natural
extension would be attempting to determine the distance between c probability
distributions, where c is the number of classes. Since this is not a well defined
problem, we propose an extension to the HDDT algorithm for the multi-class
problem.
126 T.R. Hoens et al.
One of the major research questions in this paper is why the performance of
HDDT suffers in the multi-class case when compared to C4.4, especially in light
1
Note that two pairs of subsets (e.g., (C1 , C2 ) and (D1 , D2 )) are considered equal if
C1 = D1 or C1 = D2 .
Building Decision Trees for the Multi-class Imbalance Problem 127
(a) Effects of 100:1 imbal- (b) Effects of 100:1:100:1 (c) Effects of OVA on
ance ratio. imbalance. 100:1:100:1 imbalance.
(d) Effects of 100:1 imbal- (e) Effects of 100:1:100:1 (f) Effects of OVA on
ance. imbalance. 100:1:100:1 imbalance.
Fig. 1. Comparison of the effects of various class distributions on the ability of in-
formation gain (top) and Hellinger distance (bottom) to correctly determine the class
boundary which optimizes AUC
In the multi-class case (Figures 1(b) and 1(e)) Hellinger distance once again
is very aggressive in attempting to capture as much of the minority class as
possible, while C4.4 is much more conservative. Due to the nature of this prob-
lem, however, the more conservative approach is better able to capture the
multi-distributional aspect of the problem. This is demonstrated by the fact
that, based on WAUROCs, C4.4 wins 82 of the 100 runs. Thus, for multi-class,
Hellinger distance is not able to adequately separate the two classes, instead
being overwhelmed by the spurious information from the extra classes.
In order to better understand this phenomena, consider the right-most hor-
izontal split Hellinger makes in the multi-class case. For this split, Hellinger
distance considers the “top” points to be the positive class and the “bottom”
points to be the negative class. As evidenced by the inaccuracy of the top left
points, Hellinger distance is not able to accurately partition the space. Gain
ratio, on the other hand is able to arrive at a better split point which more
accurately represents the boundary for this problem.
Finally we consider the case of OVA decomposition on the dataset. Figure
1(f) shows Hellinger distance is very good at capturing the minority class. This
favorable splitting is exactly what would be expected from such a binary class
imbalanced dataset, and thus explains the performance increase HDDT sees over
C4.4 when used in conjunction with OVA. This hypothesis is further confirmed
when we note that HDDT obtains a higher AUROC in 80 of the 100 runs, thus
confirming that it is the preferred classifier to use.
Given these results, we now better understand the dynamics of Hellinger dis-
tance in the binary class problem which result in inferior performance in the
multi-class domain. Further research into overcoming these challenges might
prove useful in developing a single decision tree approach which, without sam-
pling, is able to outperform the others in the case of multi-class imbalance.
4 Experiments
4.1 Configuration
In order to ensure a fair comparison of the methods, we ran 50 iterations [15] of
2-fold cross-validation. We chose 2-fold cross-validation due to the small number
of instances of some classes in the datasets. Due to space restrictions, we only
consider weighted area under the receiver operating characteristic (WAUROC)
[19]. We chjse this metrics as it is a commonly used criterion when comparing
classifiers in the multi-class imbalance case.
Table 1. Statistics for the datasets used in this paper. C.V. is the coefficient of varia-
tion, # Ftrs is the number of features, and # Insts is the number of instances.
4.3 Results
As stated previously we break the experiment into three different categories.
Each of the categories corresponds to a different level of computational effort
required to construct the classifier, with single trees requiring the least amount
130 T.R. Hoens et al.
Table 2. WAUROC values for the various methods over each of the datasets. Bold
numbers indicate overall best performance. The number in parenthesis indicates the
rank in the category. A indicates that the method performs statistically significantly
worse than the other method in its category at the relevant confidence level.
of work, and ECOC requiring the most. For the sake of space, however, the
WAUROC values for each of the methods is presented in Table 2.
Table 2 also contains the results of the statistical test described in Section
4.2. A classifier receives a check mark if it is considered statistically significantly
worse than the best classifier (i.e., the classifier with the lowest average rank) in
its category (e.g., single tree, OVA, ECOC) at the noted confidence level.
Single Tree Performance. When considering the single tree performances,
C4.4 and MC-HDDT perform equivalently. This is an interesting result for multi-
class imbalanced data sets, and further corroborates the intuition established
with the illustrations in Section 3. As discussed, this is mainly due to the aggres-
sive nature of the splits which Hellinger distance tries to create. The consequence
of this analysis is further evidenced in the OVA performance.
Hellinger distance, as a criterion, is limited in capturing the multi-class diver-
gences. Nevertheless, we recommend MC-HDDT as a decision tree classifier, as
it reduces to HDDT for binary class datasets (achieving statistically significantly
superior performance over C4.4 [4]), and is a competitive alternative to C4.4 for
multi-class datasets (no statistically significant variation in performance).
OVA Performance. When considering OVA performance, HDDT significantly
outperforms C4.4. This result confirms our understanding of the binary class
performances of each of the classifiers. That is, when decomposing the multi-
class problem into multiple binary problems, the binary class problems obtained
are (often) extremely imbalanced. This fact is further exacerbated by the fact
that the multi-class dataset itself is highly imbalanced.
Building Decision Trees for the Multi-class Imbalance Problem 131
Thus in the OVA approach, each binary classifier in the decomposition ensem-
ble must deal with the class imbalance problem. Since HDDT has been shown
to perform statistically significantly better than C4.4 in this scenario, we expect
to see HDDT outperforming C4.4 when using the OVA approach. Based on the
observations obtained, we can conclude that our intuition is correct and, further-
more, that when using OVA decomposition for multi-class imbalance, HDDT are
the appropriate decision tree learning to choose.
ECOC Performance. When comparing the relative performance of the clas-
sifiers, we see that HDDT outperforms C4.4 almost as well as in the OVA ap-
proach. While the statistical significance is only α = 0.10, we see that it is not
statistically significant at the α = 0.05 threshold by one dataset. As Table 2
shows, some of the performance differences were quite small. Thus it seems rea-
sonable to believe that with more datasets we might see the same statistical
significance with this method as was shown in OVA, as we would expect the
same performance gains of using HDDT over C4.4 in this case as well.
This expectations of better performance of HDDT over C4.4 is due to similar
reasoning as the OVA case. That is, by decomposing the problems into multi-
ple binary problems, the class imbalance will still be a major concern. However,
the ECOC approach will result in 2m−1 − 1 binary datasets. Some of these will
be highly imbalanced, while others may be balanced depending on the respec-
tive class distributions. Nevertheless, HDDT is able to capitalize with ECOC.
It is able to achieve stronger separability on highly imbalanced combinations,
and achieves comparable performance to C4.4 on the relatively balanced class
combinations, and thus, as a collective, it is able to outperform C4.4.
Overall Performance. When considering the overall performance of each
method as given in Table 2, we see that, in general, the more computational
power used, the better the performance. That is, the ECOC methods outper-
form the OVA methods which outperform the single tree methods.
This is an unsurprising result, as a wealth of data mining literature demon-
strates that combining a large number of classifiers into an ensemble is a powerful
technique for increasing performance. The decomposition ensemble techniques
employed in this paper are also of particular interest, as the diversity of the clas-
sifiers created in the decomposition ensembles is quite high. That is, since the
class values under consideration are changing between datasets, the classifiers
are not merely learning on different permutations of the underlying instances,
instead having the decision boundaries themselves change. It is well known that
diversity is important to creating good ensembles [11].
5 Related Work
A number of methods have been proposed to counter the class imbalance is-
sue, however a large portion has focused on the binary class problem. Sampling
methods have emerged as a de facto standard, but present numerous challenges
when being extended to multiple classes. This is due to the complexity aris-
ing from the combination of multiple class imbalance types, different amounts
132 T.R. Hoens et al.
algorithms give real gains in performance. We can therefore revise our recom-
mendation, this time recommending the use of HDDT in an ECOC decomposi-
tion ensemble if the user has enough computational power. Otherwise, the user
should consider an OVA decomposition ensemble with HDDT, and, finally, if
not enough computational power exists for such a decomposition, building MC-
HDDTs. We recommend MC-HDDTs over C4.4, as even though the difference
between them is not statistically significant for multi-class datasets, MC-HDDT
reduces to HDDT for binary class datasets, where it has been demonstrated to
be strongly skew insensitive and statistically significantly over C4.4. As a result,
MC-HDDT may be considered the recommended decision tree algorithm.
Finally, Section 3 illustrated the challenges Hellinger distance faces in the
multi-class domain. With this understanding further research can now explore
the problem of multi-class Hellinger distance and attempt to overcome the
demonstrated difficulties to provide a robust classifiers for multi-class problems.
References
1. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
2. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic mi-
nority over-sampling technique. JAIR 16, 321–357 (2002)
3. Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: Data Mining
and Knowledge Discovery Handbook, pp. 875–886 (2010)
4. Cieslak, D.A., Chawla, N.V.: Learning Decision Trees for Unbalanced Data. In:
Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS
(LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)
5. Cieslak, D.A., Chawla, N.V.: Start globally, optimize locally, predict globally: Im-
proving performance on imbalanced data. In: ICDM, pp. 143–152 (2008)
6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. JMLR 7,
30 (2006)
7. Dietterich, T., Bakiri, G.: Error-correcting output codes: A general method for
improving multiclass inductive learning programs. In: AAAI, pp. 395–395 (1994)
8. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In:
KDD, pp. 155–164 (1999)
9. Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-Line Learning
and an Application to Boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS,
vol. 904, pp. 23–37. Springer, Heidelberg (1995)
10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. SIGKDD Exp. News. 11(1), 10–18 (2009)
11. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active
learning. In: NIPS, pp. 231–238 (1995)
12. Maloof, M.A.: Learning when data sets are imbalanced and when costs are unequal
and unknown. In: ICML WLIDS (2003)
13. Provost, F., Domingos, P.: Tree induction for probability-based ranking. Machine
Learning 52(3), 199–215 (2003)
134 T.R. Hoens et al.
14. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
15. Raeder, T., Hoens, T., Chawla, N.: Consequences of Variability in Classifier Per-
formance Estimates. In: ICDM, pp. 421–430 (2010)
16. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. JMLR 5, 101–141
(2004)
17. Ting, K.M.: An instance-weighting method to induce cost-sensitive trees.
TKDE 14(3), 659–665 (2002)
18. Turney, P.D.: Types of cost in inductive concept learning. In: ICML, pp. 15–21
(2000)
19. Van Calster, B., Van Belle, V., Condous, G., Bourne, T., Timmerman, D., Van
Huffel, S.: Multi-class auc metrics and weighted alternatives. In: IJCNN, pp. 1390–
1396 (2008)
20. Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–
572 (2006)
21. Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods ad-
dressing the class imbalance problem. TKDE 18(1), 63–77 (2006)
Scalable Random Forests for Massive Data
Bingguo Li, Xiaojun Chen, Mark Junjie Li, Joshua Zhexue Huang,
and Shengzhong Feng
1 Introduction
Data with millions of records and thousands of features present a big challenge
to current data mining algorithms. On one hand, it is difficult to build classifi-
cation models from such massive data with serial algorithms running on single
machines. On the other hand, most classification algorithms are not capable
of building accurate models from extremely high dimensional data with thou-
sands of features. However, such high dimensional massive data exist in many
application domains, such as text mining, bio-informatics and e-commerce.
Random forests [1] is an effective ensemble model for classifying high dimen-
sional data [2]. A random forest consists of K decision trees, each grown from a
data set randomly sampled from the training data with replacement. At each node
of a decision tree, a subset of m features is randomly selected and the node is split
according to the m features. Breiman [1] suggested m = Log2 (M ) + 1 where M
is the total number of features in data. For very high dimensional data, M is very
big and m is much smaller than M . Therefore, decision trees in a random forest
are grown from subspaces of features [3][4][5]. The random forest classifies data ac-
cording to the majority votes of individual decision trees. Due to the computation
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 135–146, 2012.
c Springer-Verlag Berlin Heidelberg 2012
136 B. Li et al.
added noise to OCR data set. On this noise data, the accuracy of random forests
in Mahout reduced to 51.9% while the accuracy of SRF was 78.6%. On a sepa-
rate sparse data Real-sim2 , Mahout random forests was not able to produce a
model and generated stack overflow error message due to memory leakage but
SRF worked fine. We used a large synthetic data with more than one thousand
attributes and millions of records, to test the scalability of SRF. With 30 comput-
ing nodes, SRF was able to build a random forest with 100 trees in a little more
than 1 hour from a massive data set of 110 Gigabytes with 1000 features and 10 mil-
lion records. This was indeed a significant result. SRF also demonstrated a linear
property with respect to number of trees and number of examples.
The rest of the paper is organized as follows. Section 2 gives a brief review
of the SPDT algorithm. In Section 3, we present the SRF algorithm in details.
We present experiment results on real-life data and scalability tests in Section
4. The paper is concluded in Section 5.
2 Related Work
Building decision trees is the major function of building a random forest. Tra-
ditional decision algorithms [10][11] use a recursive process to create a decision
tree from a training data set. These algorithms will have problems if the train-
ing data or the tree is too big to fit in the main memory. Scalable decision tree
algorithms have been proposed to handle large data. Some take an approach to
pre-sort the training data before building the decision tree, such as SLIQ [12],
SPRINT [13] and ScalParC [14]. Others compute the histograms of attributes
and split the training data according to the histograms, such as BOAT [15],
CLOUDS [16], SPIES [17] and SPDT [9]. The later are more scalable as the
tree growing process is no longer relevant to the size of training data after all
histograms are created. The creation of histograms can be easily parallelized.
Google also proposed PLANET [18] for regression trees based on MapReduce
programming model. PLANET only supports sampling without replacement.
In this work, we use a breadth-first method to construct decision trees for a
random forest. We select the streaming parallel decision tree algorithm SPDT
recently developed at IBM as a framework to develop the breadth-first tree
growing process. Figure 1 sketches the process of the SPDT algorithm. It runs
in a distributed environment with one master node and several workers. Each
worker stores 1/W percentage of the data where W is the number of workers.
To grow a decision tree, the master node instructs workers to compute the local
histograms of features from their local data blocks. After local histograms are
complete, the workers send them to the master which merges them into the global
histograms. The global histograms are then used to compute the conditions to
split the nodes and grow the decision tree. After the nodes in the same level are
split, the master node instructs the workers again to compute histograms for
the newly generated children nodes. This process continues until no node needs
further split.
2
http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/binary.html
138 B. Li et al.
The loop of the controller checks whether N contains non-zero elements and
terminates if all elements in N are zeros. Inside the loop, the controller configures
a MapReduce job first and then dispatches it to the computing nodes to perform
a pair of map and reduce functions on one level of nodes of all decision trees. The
controller also passes the information of the nodes in the current level to each
mapper for computing local histograms. The controller calculates the number
of reducers for the MapReduce job according to the number of nodes in N to
balance the load to reducers. After all reducers complete, parseOutput function
processes the results from reducers and generates the new node information in
tree-tup. The tree-tup data is used to update the random forest model M and
computes a new N to record the number of new nodes generated in each tree.
4 Experiments
In this section we show the classification results of SRF on large complex data
sets and comparisons of them with those by SPDT and Mahout. We also demon-
strate the scalability of SRF to very large data sets. The results show the capa-
bility of SRF in building random forest models from extremely large data with
10 million records and 1000 features in less than 2 hours. Such models would be
very difficult, if not impossible, to build via traditional random forest algorithms.
with those by SPDT and Mahout. Four real-life data sets were selected and these
data sets were used in evaluating SPDT in [9]. The characteristics of these data
sets are summarized in Table 1. They can be downloaded from UCI repository
and Pascal Large Scale Learning Challenge3 .
The second set contained four synthetic data sets that were generated for
scalability test of SRF. The four synthetic sets are listed in Table 2. The points
in the same class in these data sets have Gaussian distributions. To generate
separable classes in the synthetic data, we specified several central points with
labels first and calculated the distance between a generated record and the cen-
tral points. We set the class label of the record as the label of the central point
that had the minimum distance to the record.
Table 3 lists the classification results in term of accuracy of SPDT, Mahout and
SRF on the four real-life data sets described in Table 1. We can see that the
accuracies of SRF were higher than that of SPDT on all the four data sets. The
most prominent result was from the OCR data set that had 1156 features and
one million records. The increase of accuracy over SPDT was 19%. This result
demonstrated that SRF could get better performance than SPDT in handling
massive and high dimensional data, as SRF is a random forests algorithm, it
can get better performance than decision tree algorithm (SPDT) in handling
massive data.
Table 3. Accuracies of SPDT, Mahout and SRF on four real-life data sets
For Mahout and SRF, the accuracy of SRF was closely similar with Mahout
random forests on Isolet and Multiple Features training data sets. The reason is
that the size of these two training data sets was no more than 64 MB, Mahout
144 B. Li et al.
and SRF constructed all decision trees based on one data block, thus these
two algorithms degenerated into traditional random forests algorithms. On the
other hand, SRF could obtained higher accuracy than Mahout random forests
on massive data, such as Face Detection training set.
To illustrate the fact that random forests in Mahout builds all decision trees
on the first data block when the number of blocks is larger than the number
of decision trees, we added a block size of noise records, which was generated
randomly for two labels, in front of the OCR training set. The accuracy of
random forests in Mahout decreased rapidly from 78.9% to 51.9% while the
accuracy of SRF decreased from 79.5% to 78.6% on the noised OCR training
set. The reason is that, for Mahout, all decision trees were built on the first
data block, which contained the added noise data. As a result, SRF outperforms
Mahout in handling massive data.
In addition, Mahout may lead to memory leakage problem while handling
massive data. For example, Mahout random forests generated stack overflow
error message when dealing with Real-sim training data set, which has 35,000
examples and 20,958 features. As random forests in Mahout built decision trees
with depth-first mode, which may cause memory leakage problem.
4.4 Scalability
Figure 3 shows the scalability on four synthetic data sets with respect to the
number of trees in random forests, the number of examples in data, the size of
data block and the number of machines used. Figure 3(a) shows that the time
used to build a random forest model increased linearly as the number of trees
increased in the model. The run times for building one tree model for data sets
D1-D4 are 1174s, 1377s, 1654s and 1866s respectively. The run time increases
slowly as more trees are added to the model. For instance in data set D4, only
less than 4 extra seconds were added to the total run time when each additional
tree was added to the model. The larger the data set, more time it takes when
more trees are added. However, the speed of time increase is very slow. This
result demonstrates that SRF is scalable to the number of trees in the model.
Figure 3(b) shows the scalability of run time on four synthetic sets with respect
to the number of examples in data. We can see a linear increase in time as more
examples were added in building random forest models. When the number of ex-
amples was small, the differences of run times for the four data sets were very small.
As more examples were involved, the run time increased but very slow. The large
the data size, the larger the increase of run time. However, the speed of increase
was not fast. This demonstrates that SRF is also scalable to the data size.
Figure 3(c) shows the change of run time over the change of the size of data
block. The size of data block had impact on the run time of SRF. On the one
hand, the smaller data block generates more mappers. However, if the number of
mappers exceeds the mapper capacity of the system, the run time will increase
rapidly. On the other hand, the larger data block generates heavy load mapper.
From the chart, we can see that the proper size of data block is 32MB or 64MB
for the data sets.
Scalable Random Forests for Massive Data 145
(a) Run time w.r.t. no. of trees. (b) Run time w.r.t. no. of examples.
(c) Run time w.r.t. data block size. (d) Run time w.r.t. no. of nodes.
Figure 3(d) shows the scalability of SRF with respect to the number of ma-
chines involved. We can see that the run time dropped rapidly as more machines
added. This demonstrates that SRF is able to handle very large data by adding
more machines.
5 Conclusions
We have presented the scalable random forest algorithm SRF and its implemen-
tation in MapReduce. In the algorithm, we adopted the breadth-first approach to
build decision trees in a sequence of pairs of map and reduce functions to avoid
the memory leakage in the depth-first recursive approach and make the algorithm
more scalable. We have demonstrated the scalability of SRF with very large syn-
thetic data sets and the results have shown SRF’s ability in building random forest
models from data with millions of records. Our future work is to further optimize
SRF in the area of load balance to make it more efficient and scalable.
References
1. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
2. Banfield, R., Hall, L., Bowyer, K., Kegelmeyer, W.: A comparison of decision tree
ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 173–180 (2007)
3. Ho, T.: Random decision forests. In: Proceedings of the Third International Con-
ference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
4. Ho, T.: C4.5 decision forests. In: Proceedings of Fourteenth International Confer-
ence on Pattern Recognition, vol. 1, pp. 545–549. IEEE (1998)
5. Ho, T.: The random subspace method for constructing decision forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
6. White, T.: Hadoop: The definitive guide. Yahoo Press (2010)
7. Venner, J.: Pro Hadoop. Springer (2009)
8. Lam, C., Warren, J.: Hadoop in action (2010)
9. Ben-Haim, Y., Tom-Tov, E.: A streaming parallel decision tree algorithm. The
Journal of Machine Learning Research 11, 849–872 (2010)
10. Breiman, L.: Classification and regression trees. Chapman & Hall/CRC (1984)
11. Quinlan, J.: C4.5: Programs for machine learning. Morgan Kaufmann (1993)
12. Mehta, M., Agrawal, R., Rissanen, J.: Sliq: A Fast Scalable Classifier for Data
Mining. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996.
LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)
13. Shafer, J., Agrawal, R., Mehta, M.: Sprint: A scalable parallel classifier for data
mining. In: Proceedings of the International Conference on Very Large Data Bases,
pp. 544–555. Citeseer (1996)
14. Joshi, M., Karypis, G., Kumar, V.: Scalparc: A new scalable and efficient paral-
lel classification algorithm for mining large datasets. In: Proceedings of the First
Merged International and Symposium on Parallel and Distributed Processing, Par-
allel Processing Symposium, IPPS/SPDP 1998, pp. 573–579. IEEE (1998)
15. Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.: Boatoptimistic decision tree
construction. In: Proceedings of the 1999 ACM SIGMOD International Conference
on Management of Data, pp. 169–180. ACM (1999)
16. AlSabti, K., Ranka, S., Singh, V.: Clouds: Classification for large or out-of-core
datasets. In: Conference on Knowledge Discovery and Data Mining (1998)
17. Jin, R., Agrawal, G.: Communication and memory efficient parallel decision tree
construction. In: 3rd SIAM International Conference on Data Mining, San Fran-
cisco, CA (2003)
18. Panda, B., Herbach, J., Basu, S., Bayardo, R.: Planet: massively parallel learning
of tree ensembles with mapreduce. Proceedings of the VLDB Endowment 2(2),
1426–1437 (2009)
Hybrid Random Forests: Advantages of Mixed
Trees in Classifying Text Data
Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 , Mark Junjie Li2 ,
and Yunming Ye1
1
Department of Computer Science, Harbin Institute of Technology Shenzhen
Graduate School, Shenzhen 518055, China
2
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,
Shenzhen 518055, China
[email protected], [email protected], [email protected],
[email protected], [email protected]
1 Introduction
Random forests [1,2] are a popular classification method which builds an ensem-
ble of a single type of decision tree. The decision trees are often either built using
C4.5 [3] or CART [4], but only one type was exploited within a single random
forest. In recent years, random forests have attracted increasing attention due
to (1) its competitive performance compared with other classification methods,
especially for high-dimensional data, (2) algorithmic intuitiveness and simplic-
ity, and (3) its most important capability - “ensemble” using bagging [5] and
stochastic discrimination [2].
The most popular forest construction procedure was proposed by Breiman [1],
novelly using bagging to generate training data subsets for building individual
trees. A subspace of features is then randomly selected at each node to grow
branches of a decision tree. The trees are then combined as an ensemble into a
random forest [1].
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 147–158, 2012.
c Springer-Verlag Berlin Heidelberg 2012
148 B. Xu et al.
In the literature, different types of decision trees algorithms have been pro-
posed, including C4.5, CART and CHAID [6]. Each type of decision tree algo-
rithms employs a different tree building process and captures different
discriminative information.
Random forests gain some of their performance advantage through the diver-
sity of the trees in the resulting ensemble. We can add another kind of diversity
to the random forest framework by removing any potential bias from using a
single type of decision tree. We propose to use several different types of decision
trees for each training data subset, and select the best tree as the individual tree
classifier in the random forest model.
Our method is motivated by the experiences of foresters in dealing with the
development and care of hybrid forests. An important concept in Forestry is
that of a “hybrid forest.” Such a forest uses multiple tree species as a mixed
planting in accordance with soil structure (moisture, nutrients, acidity). This
method has demonstrated highly economic, ecological and practical value in
forestry research. Mimicing this idea, we have developed a hybrid random forest
method to explore whether we can further enhance the classification performance
of a random forest ensemble classifier. Specifically, we build three different types
of tree classifiers (C4.5, CART and CHAID) for each training data subset. We
then evaluate the performance of the three classifiers and select the best tree.
In this way, we build a hybrid random forest which may include different types
of decision trees in the ensemble. The added diversity of the decision trees can
effectively improve the accuracy of each tree in the forest, and hence the accuracy
of the ensemble.
To demonstrate the effectiveness of our proposed method, we apply it to the
popular application of text classification. With the ever-increasing volume of
text data from the Internet, databases, and archives, text categorization has
become a key technique for handling and organizing text data. It has received
growing attention in recent years. A set of popular and mature machine learn-
ing approaches have been deployed for categorizing text documents, including
random forests [8], support vector machines (SVM) [9], naive Bayes (NB) [10],
k-nearest neighbors (KNN) [11], and decision trees. Due to algorithmic simplic-
ity and prominent classification performance for high dimensional data, random
forests have become a preferred method.
In this paper, we compare the performance of our random forest with that of
other three random forest methods, i.e., C4.5 random forest, CART random for-
est and CHAID random forest, and other three mainstream text categorization
methods, i.e., support vector machines, naive Bayes and k-nearest neighbors,
on six datasets. The experimental results show that our hybrid random forest
achieves improved classification performance over these six compared methods.
The rest of this paper is organized as follows. Section 2 introduces the frame-
work for building a hybrid Random Forest, and gives a brief analysis of the
method. The evaluation methods are presented in Section 3, we present exper-
imental results in Section 4. Our conclusions and future work are presented in
Section 5.
Hybrid Random Forests: Advantages of Mixed Trees 149
Building a hybrid random forest model in this way will increase the diversity
among the trees. The classification performance of each individual tree classifier
is also maximized.
considering all predictor variables for splitting. The best predictor is chosen
at each node using a variety of impurity or diversity measures. The goal is to
produce subsets of the data which are homogeneous with respect to the target
variable [4]. The main difference between C4.5 and CART is the test selection
and evaluation process.
Chi-squared Automatic Interaction Detector. (CHAID) method is based
on the chi-square test of association. A CHAID decision tree is constructed by
repeatedly splitting subsets of the space into two or more nodes. To determine
the best split at any node, any allowable pair of categories of the predictor
variables is merged until there is no statistically significant difference within the
pair with respect to the target variable [6,7].
From these decision tree algorithms, we can see that the difference lies in the
way to split a node, such as the split functions and binary branches or multi-
branches. In this work we use these different decision tree algorithms to build a
hybrid random forest.
2.3 Algorithm
In this subsection, we present our hybrid random forest algorithm which inte-
grates the three types of tree classifiers. The detailed steps are introduced in
Algorithm 1.
In Algorithm 1, lines 10-17 loop to build K decision trees. In the loop, Line
11 samples the training data D by sampling with replacement to generate an
in-of-bag data subset IOBi for building a decision tree. Lines 12-15 build three
types of tree classifiers (C4.5, CART and CHAID). In this procedure, Line 13
calls the function createT reej () to build a tree classifier. Line 14 calculates the
out-of-bag accuracy of the tree classifier. After this procedure, Line 16 selects
the tree classifier with the maximum out-of-bag accuracy. K decision trees are
thus generated to form a hybrid random forest model M .
Generically, function createT reej () first creates a new node. Then, it tests
the stopping criteria to decide whether to return to the upper node or to split
this node. If we chose to split this node, then we randomly select m features as a
subspace for node splitting. These features are used as candidates to generate the
best split to partition the node. For each subset of the partition, createT reej ()
is called again to create a new node under the current node. If a leaf node is
created, it returns to the parent node. This recursive process continues until a
full tree is generated.
3 Evaluation Methods
We use two measures to evaluate the classification performance of the hybrid
random forest, the test accuracy and the F1 metric. The test accuracy measures
the performance of a random forest on a separate test dataset. The F1 metric is
a commonly used measure of classification performance.
152 B. Xu et al.
Test Accuracy. Let Dt be a test dataset and Yt be the class labels. Given
di ∈ Dt , the number of votes for di on class j is
K
N (di , j) = I(hk (di ) = j) (2)
k=1
Hybrid Random Forests: Advantages of Mixed Trees 153
4 Experiments
In this section, we conduct experiments to demonstrate the effectiveness of the
hybrid random forest algorithm for classifying text data. Text datasets with var-
ious sizes and characteristics are used in the experiments. The experimental
results show that the hybrid random forest algorithm not only outper-
forms single-tree type random forest algorithms, i.e., C4.5 RF, CART RF
and CHAID RF, in classification accuracy, but also outperforms other three
mainstream text categorization methods, i.e., SVM, NB and KNN.
154 B. Xu et al.
4.1 Datasets
In the experiments, we used six real-world text datasets. These text datasets are
selected due to their diversities in the number of terms or features, the number
of documents, and the number of classes. Their dimensionalities vary from 2000
to 11,465, numbers of instances vary from 918 to 11,162 and the minority class
rate varies from 0.32% to 6.43%. In each text dataset, we randomly select 70% of
documents as the training dataset, and the remaining data as the test dataset.
Detailed information of the six text datasets is listed in Table 1.
The purpose of this experiment is to evaluate the effect of the hybrid random
forest method on accuracy. The six text datasets were analyzed and results
were compared with other three random forest methods (C4.5 RF, CART RF
and CHAID RF). For each text dataset, we ran each random forest algorithm
against different sizes of feature subspaces. Since the number of features in these
datasets was very large, we started with a subspace of 15 features and increased
the subspace with 5 more features each time. For a given subspace size, we built
100 trees for each random forest model. In order to obtain a stable result, we
built 80 random forest models for each subspace size, each dataset and each
algorithm, and computed the averages of the test accuracy as the final result for
comparison.
Fig. 2 shows the plots of the average test accuracy of the four random for-
est models in different sizes generated with the four methods from the six text
datasets. For the same number of features, the higher the accuracy, the better
the result. From these figures, we can observe that the hybrid random forest
Hybrid Random Forests: Advantages of Mixed Trees 155
Fig. 2. Test accuracy changes against the number of features in the subspace on the 6
text datasets
algorithm consistently performs better than the other three random forest algo-
rithms. The advantages are more obvious in the smaller subspaces. The hybrid
random forest algorithm quickly achieves high accuracy as the subspace size
increases. The other three random forest algorithms require larger subspaces
to achieve a similar accuracy. These results illustrate that the hybrid random
forest algorithm outperforms the other three random forest algorithms in the
classification accuracy results on all the six text datasets.
To further investigate the performance of the hybrid random forest, we com-
puted the average accuracy of the trees in each single-type random forest. This
is compared to the average accuracy of the trees of the same type √ within the
one hybrid random forest. In all comparisons, the subspace size of M features
156 B. Xu et al.
was used, where M is the total number of features in the dataset. The results
are shown in Table 2. For example, for tree type C4.5 and dataset Fbis, the
average accuracy of all trees from the random forest built using C4.5 (named
as C4.5 RF) is 0.6379. The average accuracy of all C4.5 trees from the hybrid
random forest (named as Hybrid RF) is 0.6489. It is clearly seen in Table 2 that
tree classifiers of any given type in our hybrid random forest always have higher
average classification accuracy than those using only trees of the same one type.
References
1. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
2. Ho, T.K.: The random subspace method for constructing decision forests. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998)
3. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Ma-
teo (1993)
4. Breiman, L., Friedman, J.H., Olshen R.A., Stone, C.J.: Classification and Regres-
sion Trees. Wadsworth and Brooks/Cole Advanced Books and Software, Monterey,
CA (1984)
5. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
6. Biggs, D., Suen, E.: A method of choosing multiway partitions for classification
and decision trees. Journal of Applied Statistics 18(1), 49–62 (1991)
7. Ture, M., Kurt, I., Turhan Kurum, A., Ozdamar, K.: Comparing classification
techniques for predicting essential hypertension. Expert Systems with Applica-
tions 29(3), 583–588 (2005)
8. Klema, J., Almonayyes, A.: Automatic categorization of fanatic texts using random
forests. Kuwait Journal of Science and Engineering 33(2), 1–18 (2006)
9. Begum, N., Fattah, M.A., Ren, F.J.: Automatic text summarization using support
vector machine. International Journal of Innovative Computing Information and
Control 5(7), 1987–1996 (2009)
10. Chen, J.N., Huang, H.K., Tian, S.F., Qu, Y.L.: Feature selection for text clas-
sification with naive bayes. Expert Systems with Applications 36(3), 5432–5435
(2009)
11. Tan, S.: Neighbor-weighted K-nearest neighbor for unbalance text corpus. Expert
Systems with Applications 28(4), 667–671 (2005)
12. Dietterich, T.G.: Machine learning research: Four current directions. AI Maga-
zine 18(4), 97–136 (1997)
13. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: ACM
SIGIR 1999, pp. 42–49 (1999)
14. Han, E.-H(S.), Karypis, G.: Centroid-based Document Classification: Analysis and
Experimental Results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.)
PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
15. TREC. Text retrieval conference, http://trec.nist.gov
16. Lewis, D.D.: Reuters-21578 text categorization test collection distribution 1.0
(2011), http://www.research.att.com/~ lewis
17. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: An interactive re-
trieval evaluation and new large test collection for research. In: SIGIR 1994, pp.
192–201 (1994)
18. Moore, J., Han, E., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G.,
Kumar, V., Mobasher, B.: Web page categorization and feature selection using
association rule and principal component clustering. In: WITS 1997 (1997)
19. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text
classification. In: AAAI Workshop 1998, pp. 41–48 (1998)
20. Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning
Tools and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)
Learning Tree Structure of Label Dependency
for Multi-label Learning
Bin Fu1 , Zhihai Wang1 , Rong Pan2 , Guandong Xu3 , and Peter Dolog2
1
School of Computer and Information Technology,
Beijing Jiaotong University, Beijing 100044, China
{09112072,zhhwang}@bjtu.edu.cn
2
Department of Computer Science, Aalborg University, Denmark
{rpan,dolog}@cs.aau.dk
3
School of Engineering & Science, Victoria University, Australia
[email protected]
1 Introduction
Classification is to predict possible labels on unlabeled instance given a set of
labeled training instances. Traditionally, it is assumed that each instance is as-
sociated with only one label. However, an instance often has multiple labels
simultaneously in practice [1,2]. For example, a report about religion could also
be viewed as a politics report. Classification for this kind of instance is called
multi-label learning. Nowadays, multi-label learning is receiving more and more
concerns, and becoming an important topic.
Various methods have been developed for multi-label learning, and these
methods mainly fall into two categories [2]: (1) algorithm adaptation, which
extends traditional single-label models so that they can deal with multi-label
instances directly. Several adapted traditional models include Bayesian method,
This research has been partially supported by the Fundamental Research Funds for
the Central Universities, China (2011YJS223), National Natural Science Fund of
China (60673089, 60973011), and EU FP7 ICT project M-Eco: Medical Ecosystem
Personalized Event-based Surveillance (No. 247829).
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 159–170, 2012.
c Springer-Verlag Berlin Heidelberg 2012
160 B. Fu et al.
AdaBoost, decision tree, associative rules, k-NN, etc. [3,4,5,6,7]. (2) problem
transformation, which converts a multi-label problem into one or several single-
label problems. Thus traditional single-label classifiers can be used directly with-
out modification. Recently, many methods have been proposed to learn label
dependency as a way of increasing learning performance [1,2,8,9,10,11,12]. How-
ever, most of them do not give an explicit description of label dependency. For
example, classifier chain, a model proposed recently [9], links the labels into a
chain randomly and assumes that each label is dependent on all its preceding
labels in the chain. However, each label may be independent with its preceding
labels while dependent on its following labels since they are linked randomly.
Moreover, more complex dependency such as a tree or DAG-like hierarchical
structure of labels often exists in practice, thus more appropriate models are
needed to describe them.
Hence, we propose one kind of novel method for aforementioned issues. We
quantify the dependencies of pairwise labels firstly, building a complete undi-
rected graph that takes the labels as the set of vertices and the dependent val-
ues as edges. A tree is then derived to depict the dependency explicitly, so the
unrelated labels can be removed for each label and the dependency model is
generalized into a tree model. Furthermore, we also use ensemble technique to
build multiple trees to capture the dependency more accurately. The experimen-
tal results would show our proposed method is competitive and could further
enhance learning performance on most of datasets.
The remainder of this paper is organized as follows: We review the related
works in section 2. A formal definition of multi-label learning is given in section 3.
In section 4, we describe and analyze our proposed methods in detail. Section 5 is
devoted to the experiment design and result analysis. The last section concludes
this paper and gives some potential issues with further research.
2 Related Work
Many methods have been proposed to cope with multi-label learning by exploit-
ing label’s dependencies. According to the order of dependency be learned, these
methods mainly fall into following categories.
(1) No label dependency is learned. Basic BR (Binary Relevance) method
decomposed one multi-label problem into multiple independent binary classifi-
cation problems, one for each label [2]. Boutell et al. used BR for scene classi-
fication [13]. Zhang et al. proposed ML-KNN, a lazy method based on BR [7].
Tsoumakas et al. proposed HOMER to deal with a large number of labels [14].
(2) Learning the dependencies of pairwise labels. Hullermeier et al. proposed
the RPC method that learned the pairwise preferences and then ranked the
labels [15]. Furnkranz et al. extended the RPC by introducing a virtual label [16].
Madjarov et al. proposed a two stage architecture to reduce the computational
complexity of the pair-wise methods [17].
(3) Learning the dependencies within multiple labels. Basic LP (Label Power-
set) method treated the whole set of labels as a new single label and learned
dependencies within all them [2]. Tsoumakas et al. proposed the RAkELd
Learning Tree Structure of Label Dependency for Multi-label Learning 161
method that divided the label set into disjoint subsets of size k randomly [18].
Stacking-base method was proposed to aggregate binary predictions to form
meta-instances [19]. Read proposed the PS, which decomposed the instance’s la-
bels until a threshold was met [8]. Read et al. proposed the CC (Classifier Chain)
algorithm to link the labels into a chain randomly [9]. Dembcynski et al. pro-
posed PCC, a probabilistic framework that solved the multi-label classification
in terms of risk minimization [10].
A number of models have also been used to depict the labels dependencies
explicitly, which include multi-dimensional Bayesian network, conditional depen-
dency networks, conditional random fields [11,20,21,22,23]. Dembczynski et al.
formally explained the difference between the conditional dependency and un-
conditional dependency [24]. Similar with these methods, our proposed method
uses the tree to learn the label dependency explicitly. the difference is that we
simply ignore the feature set in the process of constructing a tree, whereas others
[11] build the models conditioned on the feature set.
where yk denotes the kth label. Hence we can get the label vector’s posterior
probability by calculating each label’s posterior probability respectively, so the
transformation of Eq.(1) is a kind of problem transformation. A key issue is how
to exactly find the set of dependent labels for each label in order to calculate
the posterior probability more accurately.
LDTS firstly measures the dependency for each pairwise labels li and lj , no-
tated as dependency(li , lj ), thus an undirected complete graph G(L, E) is con-
structed, where the label set L denotes the vertices, and E = {dependency(li, lj ) :
li ∈ L, lj ∈ L} denotes the edges. To determine the dependent labels for each
label, a maximum spanning tree is then derived using Prim algorithm, and each
label is assumed to be dependent on its ancestor labels. A dataset is then created
for each label and their dependent labels are added into the feature set, so we
could utilize these dependency since the classifiers is trained based on the new
feature set. The whole training process is outlined in Algorithm 1.
where X and Y are two variables, x and y are their all possible values.
The labels is organized using a tree for two purposes. Firstly, the properties
of maximum spanning tree ensure that each label is more dependent on its
ancestor labels than other labels, since they have greater mutual information
value. Hence it could eliminate weak dependency further by assuming each label
is only dependent on its ancestor labels. When generating the graph and tree of
labels, we simply assume that label dependency is independent with the feature
set and only consider the mutual influence among the labels, this is one kind
of the unconditional dependency described in [24]. Secondly, various kinds of
dependencies including tree hierarchy and DAG of dependency may exist within
Learning Tree Structure of Label Dependency for Multi-label Learning 163
When generating the directed tree in LDTS, the root node is selected ran-
domly. However, selecting a different label will result in a different tree and thus
generating different dependent labels for each label. Another issue is the label
dependency could not be utilized fully, since the dependency of pairwise labels
li , lj calculated here is mutual and useful equally to each other. One possibility
is that a label may also depend on its children labels, but the directed tree does
not allow for this situation. To address such issues, the ensemble learning is used
to generate multiple LDTS classifiers iteratively. In each iteration, the classifier
is trained on a sampling of the original dataset, and the root label is reselected
randomly. Hence each iteration will get a different label tree and combining them
will reduce the influence of the root’s randomness and take full advantage of the
label dependency. We call this extended method ELDTS(Ensemble of LDTS).
164 B. Fu et al.
All above are the description and analysis of our proposed algorithms. Com-
parison with other state-of-the-art algorithms and further analysis will be given
in the following section.
We take several datasets from multiple domains for the experiments, and table
1 depicts them in detail.
(3) Distinct label sets: DLS(D) = |{C|∃(x, C) ∈ D}|. It counts the number of
distinct label sets that appear in the dataset.
Seen from table 1, these datasets cover many domains including text catego-
rization, scene classification, emotion analysis, biology etc.. It should be noted
that there are no label hierarchies in these datasets and we use them to examine
whether our methods could find more strong label dependency and gain bet-
ter performance. More detail description can be found on the official website of
Mulan1 .
(3) One-error: It calculates how many times that top-ranked label is not a
true label of the instance.
1
n
One-Error = δ(arg min rank(xi , l)) (5)
n i=1 l∈L
1
n
1 {(la , lb ) : rank(xi , la ) > rank(xi , lb ), (la , lb ) ∈ Ci × Ci }
R-Loss=
n i=1 Ci |Ci |
(6)
1
http://mulan.sourceforge.net/
166 B. Fu et al.
These criteria evaluate the different aspects of these methods. While Hamming
loss and accuracy do not consider the relation between different predictions, the
other 3 criteria take such a relation into considerations, since they are based on
the ranking of the probabilities predicted for all labels. Because our methods are
intended to get a more accurate probability for each label by finding more strong
labels dependencies, thus for each label, it should be predicted more accurately
and the true labels should be given greater possibilities. Therefore, we expect
that our method could gain better performance under Hamming loss and ranking
loss, since Hamming loss examines the predictions of all labels independently and
ranking loss focuses on whether the true labels are given greater probabilities
than other labels. For other 3 criteria, our method may be effective, but they
are not what our method optimize for.
The algorithms used for comparison are listed in table 2 with their abbreviations
respectively. To examine the effect of label dependency, BR algorithm is used as
a baseline since it does not consider the label dependency, then we compare our
proposed LDTS and ELDTS with CC and ECC methods to see their effectiveness
after eliminating weak dependencies. RAkELd and RAkEL are also used for
comparison as other ways of learning label dependency.
The experiments are divided into two parts, according to the two purposes
mentioned in section 4. One part is on the five aforementioned datasets without
label hierarchy to see whether our method can find more strong dependency and
thus gain better performance, the other part is on the dataset rcv1v2, a dataset
in which there exists a tree hierarchy of labels, to see its performance when a
tree structure is learned. Since only one tree exists in rcv1v2, we do not use the
ensemble method on it.
Table 2. The algorithms used for comparison
All algorithms are implemented on the Mulan framework [26], an open plat-
form for multi-label learning. The parameter values are chosen as those used in
the paper [9]. For the RAkEL, we set the k = m 2 . For the RAkELd , we set the
Learning Tree Structure of Label Dependency for Multi-label Learning 167
k = 3. For the ensemble algorithms, the number of iterations is 10, and for each
iteration, 67% of the original dataset is sampled with replacement to form the
training dataset. SMO, a support vector machine classifier implemented in Weka
[27], is used as the base classifier. All algorithms are executed 5 times using 10-
fold cross validation on all datasets expect rcv1v2 with different random seeds
1, 3, 5, 7, 11, respectively, and the final results are the averaged values. For the
rcv1v2, only 100 attributes are kept, and 10-fold cross validation is used only
one time since it has a huge amount of instances and attributes.
As shown from table 3 to table 7, our proposed LDTS method performs better
on the majority of datasets evaluated by the criteria. It is superior to CC on
3 datasets under the Hamming loss, accuracy, one-error, and ranking loss, but
inferior under other metrics. LDTS algorithm does not improve all the time or
the improvement is not significant. The possible reason is that although LDTS
algorithm could learn the dependency further, it still ignore lots of useful de-
pendency since it only considers unidirectional dependency of pairwise labels,
especially when the labels are dependent mutually. We expect that ensemble
learning that combines different trees can further utilize the label dependency ,
since the dependent direction between pairwise labels is changed in a different
tree by choosing a different root.
To validate above assumption, we also use ensemble learning on these algo-
rithms and compare them each other. Also shown from table 3 to table 7, our
168 B. Fu et al.
6 Conclusion
In this paper, one kind of novel approaches are proposed to exploit the label depen-
dency. Specifically, the dependency degree of pairwise labels is calculated firstly
and then a tree is build to represent the dependency structure of labels. The meth-
ods assume that the dependencies only exist between each label and its ancestor
labels, resulting in reducing the influence of weak dependency. At the same time,
they also generalize the label dependency into a tree model. Furthermore, we utilize
ensemble learning to learn and aggregate multiple label trees to reflect the labels
dependencies fully. The experimental results show that the algorithms we proposed
perform better, especially after boosted by the ensemble learning.
One potential problem is that using mutual information to measure the de-
pendency will give equal values to both of the labels, which assumes that the
dependency for pairwise labels is mutual and equal for each other. However, the
label dependency could be directed possibly and this assumption is often vio-
lated in reality. Hence how to measure the directed label dependency should be
one of the next directions. Additionally, how to generalize the tree structure of
labels further to graph or forest structure is another issue in the future work.
References
1. Cheng, W., Hullermeier, E.: Combining Instance-Based Learning and Logistic Re-
gression for Multilabel Classification. Machine Learning 76(2-3), 211–225 (2009)
2. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data. In: Oded, M.,
Lior, R. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685.
Springer, New York (2010)
3. McCallum, A.K.: Multi-label Text Classification with a Mixture Model Trained by
EM. In: Proceedings of AAAI 1999 Workshop on Text Learning (1999)
4. Schapire, R.E., Singer, Y.: Boostexter: a Boosting-Based System for Text Catego-
rization. Machine Learning 39(2-3), 135–168 (2000)
5. Clare, A.J., King, R.D.: Knowledge Discovery in Multi-label Phenotype Data. In:
Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168, pp. 42–53.
Springer, Heidelberg (2001)
6. Thabtah, F.A., Cowling, P., Peng, Y.: MMAC: a New Multi-class, Multi-label
Associative Classification Approach. In: Proceedings of the 4th International Con-
ference on Data Mining, pp. 217–224 (2004)
7. Zhang, M., Zhou, Z.: ML-KNN: A Lazy Learning Approach to Multi-label Learn-
ing. Pattern Recognition 7(40), 2038–2048 (2007)
8. Read, J.: Multi-label Classification using Ensembles of Pruned Sets. In: Proceed-
ings of the IEEE International Conference on Data Mining, pp. 995–1000. IEEE
Computer Society, Washington, DC (2008)
170 B. Fu et al.
9. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier Chains for Multi-label Classi-
fication. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML
PKDD 2009, Part II. LNCS, vol. 5782, pp. 254–269. Springer, Heidelberg (2009)
10. Dembczynski, K., Cheng, W., Hullermeier, E.: Bayes Optimal Multilabel Classifi-
cation via Probabilistic Classifier Chains. In: Proceedings of the 27th International
Conference on Machine Learning, pp. 279–286. Omnipress (2010)
11. Zhang, M., Zhang, K.: Multi-label Learning by Exploiting Label Dependency. In:
Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 999–1000. ACM Press, Washington, DC (2010)
12. Zhang, Y., Zhou, Z.: Multi-label Dimensionality Reduction via Dependence Maxi-
mization. ACM Transactions on Knowledge Discovery from Data 4(3), 1–21 (2010)
13. Boutell, M.R., Luo, J., Shen, X.: Learning Multi-label Scene Classification. Pattern
Recognition 37(9), 1757–1771 (2004)
14. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and Efficient Multilabel Classifi-
cation in Domains with Large Number of Labels. In: Proceedings of ECML/PKDD
2008 Workshop on Mining Multidimensional Data, pp. 30–44 (2008)
15. Hullermeier, E., Furnkranz, J., Cheng, W.: Label Ranking by Learning Pairwise
Preferences. Artificial Intelligence 172(16-17), 1897–1916 (2008)
16. Furnkranz, J., Hullermeier, E., Mencia, E.L.: Multilabel Classification via Cali-
brated Label Ranking. Machine Learning 2(73), 133–153 (2008)
17. Madjarov, G., Gjorgjevikj, D., Dzeroski, S.: Two Stage Architecture for Multi-label
learning. Pattern Recognition 45(3), 1019–1034 (2011)
18. Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for Multi-label Classi-
fication. IEEE Transactions On Knowledge and Data Engineering 23(7), 1079–1089
(2011)
19. Tsoumakas, G., Dimou, A., Spyromitros, E.: Correlation-Based Pruning of
Stacked Binary Relevance Models for Multi-Label Learning. In: Proceeding of
ECML/PKDD 2009 Workshop on Learning from Multi-Label Data, Bled, Slovenia,
pp. 101–116 (2009)
20. Gaag, L., Waal, P.: Multi-dimensional Bayesian Network Classifiers. In: Third Eu-
ropean Workshop on Probabilistic Graphical Models, pp. 107–114 (2006)
21. Bielza, C., Li, G., Larranage, P.: Multi-dimensional Classification with Bayesian
Networks. International Journal of Approximate Reasoning 52(6), 705–727 (2011)
22. Guo, Y., Gu, S.: Multi-label Classification using Conditional Dependency Net-
works. In: Proceedings of the 22nd International Joint Conference on Artificial
Intelligence, pp. 1300–1305 (2011)
23. Ghamrawi, N., McCallum, A.K.: Collective Multi-label Classification. In: Proceed-
ings of the 2005 ACM Conference on Information and Knowledge Management,
pp. 195–200 (2005)
24. Dembczynski, K., Waegeman, W., Cheng, W.: On Label Dependence in Multi-label
Classification. In: Proceedings of the 2nd International Workshop on Learning From
Multi-label Data, pp. 5–12 (2010)
25. Chow, C.K., Liu, C.N.: Approximating Discrete Probability Distributions with De-
pendency Trees. IEEE Transactions on Information Theory 14(3), 462–467 (1968)
26. Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A Java
Library for Multi-Label Learning. Journal of Machine Learning Research 12, 2411–
2414 (2011)
27. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Multiple Instance Learning
for Group Record Linkage
1 Introduction
Within many organisations, data are collected from various sources and through
different channels, and they are stored in databases with different structures
and formats. As organisations collaborate, data often need to be exchanged and
integrated. The objective of such data integration is to identify and match all
records that correspond to the same real-world entity, such as the same customer,
patient, or taxpayer [10]. Record linkage (also known as data matching or entity
resolution) is a key step to effectively mine rich information that is not available
in a single database. This technology has been used in many areas, such as
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 171–182, 2012.
c Springer-Verlag Berlin Heidelberg 2012
172 Z. Fu et al.
Fig. 1. An example of group (household) record linkage, and the corresponding MIL
setting. Links between individual records correspond to instances while a bag is made
of all links between the records in two groups.
electronic health record systems, the retail industry, business analytics, fraud
detection, demographic tracking, and government administration [10].
As one application of record linkage, linking of historical census records across
time can greatly enhanced their values by, for example, enabling tracking of
households and providing new insights into the dynamic character of social,
economic and demographic changes. In recent years, researchers have tried to
link records between census datasets using automatic or semi-automatic meth-
ods [3,15,19,21]. Unfortunately, these attempts have not been not very successful
in linking records that correspond to individuals in a household [20].
Several reasons make the linking of historical census records a challenging
undertaking. First, the quality of historical census data is poor, because large
amounts of errors and inaccurate information have been introduced during the
census collection and digitisation processes [19]. Second, a large portion of records
contain the same or similar values. It is not uncommon to find different people
with the same name, the same age, and living in the same street in one dataset.
Third, the structure of households and their members can change significantly
between two censuses (which were normally collected every five or ten years).
Therefore, simply comparing individual records does not lead to reliable linkage
outcomes. Considering household information in the linkage process can help
overcome this challenge.
In this research, we tackle the problem of linking individual records and house-
holds in historical census data. A household link will likely contain several links
between individual record pairs for its household members. If two households are
matching, at least one of their record links has to be a match. On the contrary,
if two households are not matching, none of their record links shall be matched.
This is a typical multiple instance learning (MIL) setting. MIL is a supervised
learning method proposed by Dietterich et al. [9]. In MIL, data are represented
as bags, each of which contains some instances. In a binary classification setting,
a positive bag contains both positive and negative instances, while a negative
bag only consists of negative instances. In the training stage, the class labels are
only available at the bag-level but not at the instance-level. The goal of MIL is
to learn a classifier which can predict the label of an unseen bag. When applying
Multiple Instance Learning for Group Record Linkage 173
MIL to the group record linkage problem, group links are treated as bags, and
record links become the instances in these bags. A model can then be learned
to classify a group link as a match or non-match. Figure 1 shows an example of
group linking and its relationship to the MIL setting.
Because an individual record in one census dataset has generally a high simi-
larity with several records in different households in another dataset, a household
in one census dataset is often linked to different households in another dataset.
Although such results can be helpful, e.g. in generating family trees, social sci-
entists are often interested in tracking the majority of household members as
a whole entity over time [20]. This suggests one-to-one household matches are
needed. To reduce the number of multiple household matches, we can employ
a group linking method [17,18], which generates a household match score for
each household pair. Then the household pairs with the highest match score are
selected as the final match results. Such an approach requires the detection of all
matched record pairs in a household, which is equivalent to classifying instances
within a bag as matches or non-matches. This is a problem that has not been ad-
equately addressed in MIL research [16]. In traditional MIL methods [4,14], when
instance selection is concerned, only the optimal positive instances are explored,
whilst no explicit instance classification solution has been given. Therefore, there
is a gap between MIL and its application to group record linkage.
We extend the above mentioned MIL methods to instance level classification
by grouping negative instances from the training set with an instance to be
classified. This transforms the instance into a bag. We can then employ the bag-
level classification model for explicit instance classification. We show that this
method can effectively classify both household and record links.
This paper makes two contributions. First, we extend the MIL method to
instance classification via bag reconstruction. Second, we propose a practical so-
lution to linking households between historical census datasets by group linkage
using MIL. Our method is general in nature and it can be applied to other record
linkage applications that require groups of records rather than individual records
to be linked.
2 Related Work
In recent years, many methods have been developed for record linkage in the
fields of machine learning, data mining and database systems [10]. Among them,
supervised learning has been intensively investigated. It uses labelled record
pairs with known match status (match or non-match) to learn a classification
model. Bilenko et al. [2] proposed a solution based on support vector machines
(SVM) [23] to compute the similarity between strings. Alternatively, Christen [6]
has constructed inputs for a SVM using a pre-selection step which retrieves
record pairs that with high confidence correspond to matches or non-matches.
These pairs then become the positive and negative training samples for a SVM
classifier. This method can be considered as a combination of supervised and
un-supervised techniques.
174 Z. Fu et al.
Group record linkage methods have been developed to process groups rather
than individual records [18]. On et al. [17] defined group similarity from two
aspects, the similarity between matched record pairs and the fraction of matched
record pairs between two groups of records. A group similarity can then be
calculated using a maximum weight bipartite matching.
Multiple instance learning is a paradigm of machine learning that deals with a
collection of data called bags. The original work by Dietterich et al. [9] attempted
to recover an optimal axis-parallel hyper-rectangle in the instance feature space
to separate instances in positive bags from those in negative bags. Departing
from this model, several researchers have extended the framework, such as MI-
SVM [1], DD-SVM [5], SMILE [24], MILES [4] and MILIS [14].
Among these works, we are particularly interested in the Multiple Instance
Learning with Instance Selection (MILIS) method because it allows efficient and
effective instance prototype selection for target concept representation [14]. This
is an important property for (historical) census record linkage, which works on
potentially large numbers of households and their records, and contains signifi-
cant amounts of uncertainty because of low data quality.
MILIS is an extension of MIL using an embedded instance selection (MILES)
method [4]. The general idea of these two methods is to map each bag into a
feature space defined by selected instances, which is based on bag-to-instance
similarity. It generates a feature vector for each bag, whose dimension is the
number of selected instances. In this manner, the MIL problem is converted into
a supervised learning problem, for which a SVM can be used for classification.
The major difference between MILES and MILIS methods is on the instance
selection step. In MILES, all instances in the training set are used for feature
mapping, then important features are selected by a 1-norm SVM. Because the
total number of instances in a training set may be very large, MILES can be
very time consuming. MILIS, however, only selects one instance prototype (IP)
from each bag for the embedding. It generates a feature space with much smaller
dimension than MILES. The selection of IPs is done through a two-step optimi-
sation framework, which updates IPs and a SVM classifier iteratively.
where γ is a feature mapping parameter that controls the similarity. Then a bag
can be represented as an n-dimensional vector
where x∗i are the prototype instances selected from the training set.
As proposed in [14], instance prototypes can be generated by selecting the
least negative instance from each positive bag and the most negative instance
from the negative bag. This requires modelling of the distribution of negative
instances, and computing the probability that an instance has been generated
from the negative population. Given an instance x and its k-nearest negative
instances from the negative bags Xk− , the likelihood of x being negative is
1
k
p(x|X − ) = exp −β||x − x−
j || , (3)
Z j=1
where x− j ∈X
−
is the j th nearest negative neighbour of x, Z is a normalisation
factor, and β is a parameter to control the contribution from training samples.
We then select the instance with the lowest likelihood value from each positive
bag as the positive instance prototypes (PIPs), and the instance with the highest
likelihood value from each negative bag as negative instance prototypes (NIPs).
These PIPs and NIPs form the set of instance prototypes (IPs) used in the
feature mapping. Using Equations 2 and 3, we can represent bags in the training
set in vector form, and then train a SVM classifier by solving the following
unconstrained optimisation problem:
||w||
2
min +C max(1 − yi (wT zi ), 0), (4)
w 2 i
where yi ∈ {1, −1} is the label for bag i, w is a set of parameters that define a
separating hyper-plane, and C is the regularisation parameter [23].
176 Z. Fu et al.
Both MILES and MILIS can find the most positive instance in a positive bag.
This is achieved by selecting an instance in the bag that has the lowest likeli-
hood value using Equation 3, because a positive bag should contain at least one
positive instance. However, when it comes to the situation where a bag contains
more than one positive instance, neither method provides an explicit solution to
finding all the positive instances. Although a threshold may be set for decision,
with instances whose likelihood is higher than the threshold classified as positive,
and visa versa, it is practically difficult to find an appropriate threshold.
Here we propose a method for instance classification by bag reconstruction.
We treat each instance in a positive bag as a seed, and group the instance
with negative instances to create new bags. Then we apply the trained bag-
level classifier to these new bags. If a new bag is classified as positive, then
the seed instance is classified as positive. Otherwise, it is classified as negative.
This method is based on the fact that if a seed is negative, the reconstructed
bag consists of negative instances only, and thus will be classified as negative.
Otherwise, the new bag contains one positive instance, therefore, is very likely
to be classified as positive.
We have adopted two strategies for the bag reconstruction, Random and
Greedy, to cope with multiple positive instances in a candidate bag. The first
strategy randomly selects negative instances from the training set and groups
them with the seed. Therefore, both the random negative instances from the
training set and the seed instance contribute to the embedding step in MIL. The
second strategy is built on top of the random option. With randomly selected
negative instances, a greedy algorithm is adopted which reconstructs new bags
and predicts the label of the newly added instance simultaneously. This guar-
antees not only the seed, but also the negative instances in the candidate bag,
contribute to the embedding step. For each instance x in the candidate bag, we
compute its Hausdorff distance to a bag G that contains NIPs x∗− i only:
d(G, x) = min
∗−
||x − x∗−
i ||
2
(5)
xi ∈G
Using this distance measure, we can get the similarity between an instance and
the negative instances in G. By ranking the distances, we can construct a new bag
by sequentially adding into the bag an instance with the lowest distance among
the rest of the instances in the candidate bag. Evaluating the new bag using the
bag-level SVM classifier, we can get the label of the newly added instance. For a
candidate bag that contains both positive and negative instances, initially, the
added instances are negative. Therefore, the bag is predicted as negative. When
the prediction becomes positive after a new instance is added, the new instance
is classified as positive. We then replace the positive instance with an instance
that has a larger distance, and re-evaluate the new bag. This process continues
until all instances in the candidate bag have been traversed. We summarise this
strategy in Algorithm 1.
Multiple Instance Learning for Group Record Linkage 177
Input:
- A set B− containing all negative bags in the training set
- A bag G containing all NIPs
- A candidate bag Bi that contains mi instances xi,j for j = 1, . . . , mi
- Trained bag-level SVM model Φ
- An empty bag B̃
Output:
- Labels yi,j ∈ {1, −1} for instances xi,j ∈ Bi , for j = 1, . . . , mi
are considered as matches. Thus, the final output may still contain multiple links
per group, but a much smaller number of them.
We performed experiments on one synthetic dataset and six real census datasets
using both the MILES and MILIS methods for the multiple instance learning
step. For the implementation of MILES, we have used the MOSEK1 system
to solve the linear programming formulation in the one-norm SVMs. To train
the MILIS algorithm, we have used LIBLINEAR [11]. The SVM regularisation
parameter C was set using grid search on the training data. For Equation 3, we
set K = 10 which is the same as in [14]. The feature mapping parameter γ in
Equation 1 and the scale parameter β for the likelihood estimation in Equation 3
are both set to 1. For bag reconstruction in instance classification for the census
data experiments, we have grouped a seed with 5 random negative instances.
This is based on the fact that by average, a bag in the census datasets contains
5.65 instances, as can be calculated from Table 2.
For comparison purpose, we have implemented an alternative solution for
bag and instance classification based on the group linkage method proposed
by On et al. [18]. This method computes the sum of the similarity scores for
each record pair, and then separates pairs into matches and non-matches by
comparing the similarity sum with a threshold parameter ρ. The decision on the
optimal ρ can be made based on the trade-off between the number of household
pairs with multiple matches or unique matches. The matched households are
then generated by grouping all matched record pairs that belong to the same
matched household.
Table 2. Number of bags and instances extracted from the historical census datasets
an accuracy of 92.03 ± 2.21% using the MILES model, and 92.32 ± 2.43% using
the MILIS model, while the greedy extension has achieved 92.89 ± 2.89% and
95.50 ± 2.47% on MILES and MILIS, respectively.
We used six census datasets from the district of Rawtenstall in the United King-
dom that were collected in ten-year intervals from 1851 to 1901. These census
data contain twelve attributes per record, including the address, first and family
name, age, gender, relationship to head, industry (occupation), and place of birth
of each individual2 . Because these data are of low quality, we have cleaned and
standardised them using the Febrl data cleaning and record linkage system [7].
Details of this step can be found in Fu et al. [12]. Table 1 shows the number of
records and households in each dataset.
The record level linkage was also conducted using Febrl. Instead of compar-
ing all possible record pairs between two datasets, we used a traditional block-
ing technique combined with a Double-Metaphone encoding technique to index
(block) the datasets [8]. We used a variety of approximate string comparison
functions to calculate the similarity between individual record pairs following
the approach given by Fu et al. [13]. The similarity scores calculated for a record
pair were concatenated into a vector and then used in the MIL classification
step.
We have manually labelled 1,000 household links from the 1871 and 1881
datasets, consisting of 500 matched and 500 non-matched households. To show
the performance of the MILES and MILIS methods on household link classifica-
tion, we performed 100-fold cross validation on the randomly split labelled data,
with half used for training and half for testing.
Both the MILES and MILIS methods show similar performance, achieving
84.54 ± 1.33% and 83.75 ± 1.34% accuracy on household link classification, re-
spectively. When efficiency is concerned, MILIS shows superior performance
than MILES. The MILES method took 29.22 ± 6.37 seconds for training, and
2
www.uk1851census.com
180 Z. Fu et al.
Table 3. Number of positive bags and instances classified in the different pairs of
historical census datasets using the different methods described in this paper
0.88 ± 0.03 seconds for testing, while MILIS only took 2.17 ± 0.10 and 0.25 ± 0.04
seconds for each task. We did not evaluate the instance classification performance
because the true record pair labels were not available to us.
In the next experiment, we re-trained the MILES and MILIS models using
all the labelled data, and then classified all household and record links from any
pair of consecutive census datasets, e.g. 1851 with 1861, 1861 with 1871, and
so on. Because we were mainly interested in finding record matches in matched
households, the instance classification was only performed on positively classi-
fied bags. As shown in Table 3, MILES and MILIS showed mixed performance
on the bag-level classification, each having generated more positive bags than
the counterpart on some datasets. By comparing the number of matched house-
holds with the total number of households in each census dataset (see Table 1),
one can observe that the results contain multiple matches. This is expected be-
cause of two reasons. First, a household may split into several households, for
example, due to the move-out of grown-up children, or two households might
merge when widowed individuals form a new household. Second, there are many
similar record pairs among different households, which may have generated false
positive results. On the instance-level classification, the MILES-based models
have consistently generated more positive instances than the MILIS-based mod-
els. The random bag reconstruction method, on the other hand, has achieved
performance close to that of the greedy bag reconstruction method.
From Table 3, it can be observed that the group linkage method developed
by On et al. [18] has generated many more household and record matches, i.e.
more positive bags and instances, than the proposed MIL based methods. Statis-
tics show that the MILES and MILIS based methods can reduce the number of
matched bags in average between 79.98% and 79.40% respectively, when com-
pared against the group linkage method described by On et al. [18]. Please note
due to the lack of ground truth on household and record pair links, we have not
used traditional measures such as accuracy and F-score for evaluation purposes.
We next applied the group linkage method introduced in Section 3.3 to reduce
the number of multiple household matches, i.e. where one household is matched
with multiple households. Figure 2 shows the performance of the proposed meth-
Multiple Instance Learning for Group Record Linkage 181
7000
Thresholding
4000
3000
2000
1000
0
1851−1861 1861−1871 1871−1881 1881−1891 1891−1901
Matched years
ods and the thresholding method in [18]. The results indicate that the thresholding
method generates the highest number of matches, followed by the MILES-based
methods. The MILIS and greedy bag reconstruction combination has generated
the smallest number of matches for all dataset pairs, which makes it the most re-
liable option in finding household matches between census datasets.
Finally, we performed results fusion so as to let the proposed methods vote
for the most consistent household matches. This was performed by selecting
household matches where all four options, i.e. MILES-random, MILES-greedy,
MILIS-random, and MILIS-greedy, have agreed upon in their decision. These
are the most reliable household matches that can be presented to researchers for
further analysis. The last line in Table 3 shows the number of household matches
after this fusion process.
5 Conclusion
We have introduced a group record linkage method based on multiple instance
learning (MIL), and evaluated this method on real historical census data. In
this method, group links are considered as bags and associated record links are
treated as instances, with only the bag-level labels provided. The multiple in-
stance learning paradigm has provided the group linkage problem with a suitable
supervised learning tool to classify groups, even if the labels of record links are
not available. We have shown the effectiveness of the proposed method on both
synthetic and real historical census data from the UK.
In the future, we plan to extend the instance classification work so that
instances selected for bag reconstruction better characterise the data distribu-
tion, and we will investigate approaches that allow linking records and households
across several census datasets in an iterative fashion. We will also apply our method
to other applications with a similar setting, such as bibliographic databases.
References
1. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: NIPS, Vancouver, Canada, pp. 561–568 (2003)
182 Z. Fu et al.
2. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string
similarity measures. In: ACM KDD, Washington, DC, pp. 39–48 (2003)
3. Bloothooft, G.: Multi-source family reconstruction. History and Computing 7(2),
90–103 (1995)
4. Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded in-
stance selection. IEEE TPAMI 28(12), 1931–1947 (2006)
5. Chen, Y., Wang, J.Z.: Image categorization by learning and reasoning with regions.
Journal of Machine Learning Research 5 (2004)
6. Christen, P.: Automatic Training Example Selection for Scalable Unsupervised
Record Linkage. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.)
PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 511–518. Springer, Heidelberg (2008)
7. Christen, P.: Development and user experiences of an open source data cleaning,
deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48
(2009)
8. Christen, P.: A survey of indexing techniques for scalable record linkage and dedu-
plication. IEEE TKDE (2011)
9. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance
problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)
10. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A
survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)
11. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J.: LIBLINEAR: A
library for large linear classification. JMLR 9, 1871–1874 (2008)
12. Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census
data using household information. In: IEEE ICDM Workshop on DDDM (2011)
13. Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for
historical census household linkage. In: AusDM (2011)
14. Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance
selection. IEEE TPAMI 33(5), 958–977 (2011)
15. Fure, E.: Interactive record linkage: The cumulative construction of life courses.
Demographic Research 3, 11 (2000)
16. Li, F., Sminchisescu, C.: Convex multiple instance learning by estimating likelihood
ratio. In: NIPS (2010)
17. On, B.-W., Elmaciogl, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity
resolution using quasi-cliques. In: IEEE ICDM, Hong Kong, pp. 1008–1015 (2006)
18. On, B.-W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: IEEE ICDE,
Istanbul, Turkey, pp. 496–505 (2007)
19. Quass, D., Starkey, P.: Record linkage for genealogical databases. In: ACM KDD
Workshop, Washington, DC, pp. 40–42 (2003)
20. Reid, A., Davies, R., Garrett, E.: Nineteenth century Scottish demography from
linked censuses and civil registers: a ‘sets of related individuals’ approach. History
and Computing 14(1+2), 61–86 (2006)
21. Ruggles, S.: Linking historical censuses: a new approach. History and Comput-
ing 14(1+2), 213–224 (2006)
22. Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Addison-
Wesley (2005)
23. Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)
24. Xiao, Y., Liu, B., Cao, L., Yin, J., Wu, X.: SMILE: A similarity-based approach
for multiple instance learning. In: IEEE ICDM, Sydney, pp. 309–313 (2010)
Incremental Set Recommendation Based
on Class Differences
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 183–194, 2012.
c Springer-Verlag Berlin Heidelberg 2012
184 Y. Shirai et al.
2 Definition
We will provide some definitions and notations as follows :
Definition 1 (Item). An item is an atomic entity that represents a character-
istic or feature and is denoted by a lower-case character, a, b, c, . . .. A set of all
items to be considered is denoted by Σ.
In the exercise history example, each item could be the name of an exercise.
Definition 2 (Data Record and Class). A data record is a collection of
items that represent the attributes or characteristics of the target object (we use
D to represent a data record). A class is a name for a set of data records, and
is denoted by α, β, γ, ω or φ. Each data record belongs to only one class.
“Positive” or “Negative” is an example of a class.
Definition 3 (Pattern Set/Class Membership). A pattern set is a set of
pairs, each of which consists of an item set and its weight (natural number). If the
weight values are all the same, they can be omitted. A pattern set is denoted by
Cω where ω is the class identifier (Cω = {p : wp |p ∈ 2Σ , wp ∈ N}). If q : wq ∈ Cω
(simply we write q ∈ Cω ), q is called a pattern of class ω.
Incremental Set Recommendation Based on Class Differences 185
3.1 Example
Suppose we have pattern sets for classes α and β as follows:
Binary decision diagrams[2,4] (BDDs) are well-known and widely used for ef-
ficiently manipulating large-scale Boolean function data. A BDD is a directed
graph representation of the Boolean function. The reduction rules in BDD con-
sist of “node deletion rule” (delete all redundant nodes with two edges that point
to the same node) and “node sharing rule” (share all equivalent sub-graphs).
ZDDs (Zero-suppressed BDDs) [6,4] are special type of BDDs which are suit-
able for implicitly handling large-scale combinatorial item set data. The reduc-
tion rules of ZDDs are slightly different from those of BDDs. They are illustrated
in Fig. 1 (a).
ZDDs are especially more effective then BDDs for representing “sparse” com-
binations such as purchase history data. For instance, sets of combinations se-
lecting 10 out of 1000 items can be represented by ZDDs up to 100 times more
compactly than by ordinary BDDs.
VSOP (Valued-Sum-Of-Products Calculator)[7] is a program developed for
calculating a combinatorial item set where each product term has a value, speci-
fied by symbolic expressions based on ZDD techniques. The value of each product
can also be considered as a coefficient or a weight for each term. For example, the
formula (5abc + 3ab + 2bc + c) represents a VSOP with four terms abc, ab, bc and
c, each of which is valued as 5, 3, 2, and 1, respectively. VSOP supports numer-
ical arithmetic operations based on Valued-Sum-Of-Products algebra, such as
addition, subtraction, multiplication, division, and numerical comparison. The
details of the algebra and arithmetic operations of a VSOP calculator are de-
scribed in [6,7].
When dealing with integer values in binary coding, we have to consider the
expression of negative numbers. VSOP adopted another binary coding[8] based
on (−2), namely, each bit represents 1(= (−2)0 ), −2(= (−2)1 ), 4(= (−2)2 ), −8(=
(−2)3 ), 16(= (−2)4 ), . . .. For example, −12 can be decomposed into (−2)5 +
(−2)4 + (−2)2 . In this encoding, each integer number as a coefficient can be
uniquely represented.
Incremental Set Recommendation Based on Class Differences 187
Fig. 1 (b) shows the example of the VSOP representation for abc − ac + 2bc +
c + 2de + 3e − abd. Since ac satisfies the top nodes labeled +1 and −2, the
coefficient of item ac can be calculated by +1 − 2 = −1.
For a given D, we have to output a set of terms from Cα − Cβ , that are mod-
ification of D under the constraints of Nadd and Ndelete , and whose coefficients
are equal to or larger than given integer M .
Fig. 2 shows an example search process on the ZDD structure for Cα − Cβ ,
where Nadd = Ndelete = 1 and D = ac. In this figure, the search process starts
with each top node +1, −2, +4 respectively and then item sets satisfying the
constraints are extracted for each top node +1, −2, +4. The pair of numbers
for item addition and deletion is attached to each edge, as shown in Fig. 2. If
the pair does not satisfy condition Nadd or Ndelete , searching along that path is
terminated. For example, since the pair on the edge from c (left side in Fig. 2)
is (0, 2), which does not satisfy the Ndelete condition, searching along the path
below that node is terminated.
Item sets that need to be found under the condition of M = 2 must satisfy
one of (0, 0, 1), (1, 0, 1), (0, 1, 1), or (1, 1, 1) for the top nodes (+1, −2, +4) in
Fig. 2. For example, suppose D = ac, Nadd = 1 and Ndelete = 1. Since the
188 Y. Shirai et al.
numbers of added items and deleted items w.r.t bc are 1 and 1 respectively, and
since bc satisfies (0, 1, 1) for the top nodes (+1, −2, +4), bc is a recommendation
candidate under the condition of M = 2. As another example, since the numbers
of added items and deleted items w.r.t abc are 1 and 0 respectively, and since
the candidates abc satisfies (1, 0, 0) for the top nodes (+1, −2, +4), abc could
be a recommendation candidate under the condition of M = 1 as well as bc
described above. By the same way, c is also a recommendation candidate under
the condition of M = 1
The naive search algorithm on a ZDD structure is shown in Algorithm 1.
4 Experiments
We first evaluate the efficiency of our approach based on ZDD, using artificial
data, and then we show the examples using actual Internet application data.
for i = 1 to N do
initialize (pathi = {}, AddList = {}, DeleteList = {})
L = L + get candidate(pathi, i, AddList, DeleteList)
end for
merge output(L, M ) (merge the results for each node (1, . . . , n) and output the re-
sults whose coefficients ≥ M )
in both classes. In this experiment, we used three data set sizes for each class:
1, 5, and 10 million records.
The system was implemented in Java, and the experiments were run on SUSE
Linux Enterprise Server 10 with a quad-core AMD Opteron 3 GHz CPU, and
512 GB RAM. The execution times are shown in Table 1. The times shown are
an average for ten trials. In the tables, “sequential search” in which all items in
each record were ordered and stored in memory was done for comparison.
With the random data sets, there were no substantial differences between
the ZDD-based search and the sequential search. This is because a ZDD data
structure is not much more compact or efficient rather than a flat data structure.
The number of ZDD nodes for the random data sets were respectively 2696527,
11789288, and 20738481. Their relationship is almost linear. In contrast, with the
fixed pattern data sets, there were marked differences between the two searches.
The ZDD-based search was more efficient due to the compact representation by
ZDD. The numbers of ZDD nodes for the fixed pattern data sets were respectively
190 Y. Shirai et al.
313594, 363112, and 377979. These data sets were relatively small and did not
linearly increase in size. This was reflected in the total execution times.
In actual applications, there are usually fixed patterns in item occurrences for
each class. Although actual application data are not as strongly biased as in our
experiment, we can nevertheless conclude that the ZDD-based search approach
is well suited for actual applications.
These results shows that our recommendation flamework can suggest possi-
ble candidates for query modifications, in order to get more appropriate search
results for users.
In this paper, we have described a new approach to the set recommendation prob-
lem: changes (item addition and deletion) to a set of items are recommended on
the basis of class differences. Since recommendation services are becoming more
and more popular, our framework should be effective for actual applications
rather than simply being used for collaborative filtering. The use of our algo-
rithms, which use the ZDD data structure, results in efficient calculation for
huge data sets, especially when the data is biased, as it generally is in actual
applications. Although we only considered the case of two classes for simplicity,
we can easily consider a case in which there are three or more classes or there
are multiple classification criteria for the input data.
In related work, Dong et al.[3] proposed using an emerging patterns approach
to detecting differences in classes and using a classification framework based on
the emerging patterns. While frequent pattern mining generally cannot detect
the characteristic item pattern for each class, their approaches focus on detecting
Incremental Set Recommendation Based on Class Differences 193
item sets that are meaningful for classification. Although their motivation is very
similar to ours, they have not yet reported a recommendation procedure based
on emerging patterns.
Other researchers have developed set recommendation procedures based on
certain constraints such as recommendation costs, orders and other conditions
[12,10]. These procedures are practically applicable to trip advice and univer-
sity course selection, for example. Although we do not assume any constraint
between items as pre-defined knowledge, incorporating such constraints into our
recommendation framework should improve its effectiveness.
Searching under the constraints described in this paper is closely related to
searching based on the Levenshtein distance (edit distance). Efficient algorithms,
such as dynamic programming approach, have been developed to calculate the
distance, and many implementations including approximation approaches have
been introduced [9]. The problem we focused on in this paper is slightly different
from those for the edit distance. Our problem is to find similar items from given
polynomials under the constraints of a limited number of item additions and
deletions with a weight constraint. A comparison of the problems remains for
future work.
Future work also includes extending our results in several directions :
– The search procedure based on the ZDD structure described in this paper
still contains redundant processes. Efficient search strategies such as using a
cache of pre-searched results need to be investigated.
– We assume in this framework that items occur only positively in patterns.
In actual applications, however, “don’t care” plays an important role in
recommendation. We need to investigate ways to incorporate such items.
References
1. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender sys-
tems: A survey of the state-of-the-art and possible extensions. IEEE Transactions
on Knowledge and Data Engineering 17(6), 734–749 (2005)
2. Bryant, R.E.: Graph-based algorithms for Boolean function manipulation. IEEE
Transactions on Computers 35(8) (August 1986)
3. Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by Aggregating Emerg-
ing Patterns. In: Arikawa, S., Nakata, I. (eds.) DS 1999. LNCS (LNAI), vol. 1721,
pp. 30–42. Springer, Heidelberg (1999)
4. Knuth, D.E.: The Art of Computer Programming. Bitwise Tricks & Techniques,
vol. 4(1), pp. 117–126. Addison-Wesley (2009)
5. Melville, P., Sindhwani, V.: Recommender Systems. In: Encyclopedia of Machine
Learning, pp. 829–838. Springer (2010)
6. Minato, S.: Zero-Suppressed BDDs for Set Manipulation in Combinatorial Prob-
lems. In: Proc. of 30th ACM/IEEE Design Automation Conference, DAC 1993
(1993)
7. Minato, S.: VSOP (Valued-Sum-of-Products) Calculator for Knowledge Process-
ing Based on Zero-Suppressed BDDs. In: Jantke, K.P., Lunzer, A., Spyratos, N.,
Tanaka, Y. (eds.) Federation over the Web. LNCS (LNAI), vol. 3847, pp. 40–58.
Springer, Heidelberg (2006)
8. Minato, S.: Implicit Manipulation of Polynomials Using Zero-Suppressed BDDs.
In: Proc. of IEEE The European Design and Test Conference (1995)
9. Navarro, G.: A Guided Tour to Approximate String Matching. ACM Computing
Surveys 33(1) (2001)
10. Parameswaran, A., Venetis, P., Garcia-Molina, H.: Recommendation Systems with
Complex Constraints: A Course Recommendation Perspective. Transactions on
Information Systems 29(4) (2011)
11. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. In: Ad-
vances in Artificial Intelligence (2009)
12. Xie, M., Lakshmanan, L.V.S., Wood, P.T.: Breaking out of the box of recommen-
dations: From Items to Packages. In: Proc. of the 4th ACM Conf. on Recommender
Systems (2010)
13. (Rakuten data disclosure), http://rit.rakuten.co.jp/rdr/index_en.html
14. (AOL search logs), http://www.gregsadetsky.com/aol-data
Active Learning for Cross Language Text
Categorization
1 Introduction
Due to the explosive growth of electronic documents in different languages,
there’s an urgent need for effective multilingual text organizing techniques. Cross
Language Text Categorization (CLTC) is the task of assigning class labels to doc-
uments written in a target language (e.g. Chinese) while the system is trained
using labeled examples in a source language (e.g. English). With the technique
of CLTC, we can build classifiers for multiple languages employing the exist-
ing training data in only one language, thereby avoiding the cost of preparing
training data for each individual language.
The basic idea under CLTC is the documents in different languages may share
the same semantic information [14], although they’re in different representations.
Previous works on CLTC have tried several methods to erase the language barrier
and show promising results. However, despite language barrier, there’s another
problem for CLTC. That is the differences between cultures, which may cause
Corresponding author.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 195–206, 2012.
c Springer-Verlag Berlin Heidelberg 2012
196 Y. Liu et al.
topic drift between languages. For example, news of the category sports from
China (in Chinese) and US (in English) may concern different topics. The former
may talk more about table tennis and Liu Xiang while the later may prefer NBA
and NFL. As a result, even if the language barrier is perfectly erased, some
knowledge of the target language still can’t be learned from the training data in
the source language. This will inevitably affect the performance of categorization.
To solve this problem, making use of the unlabeled data in the target language
will be helpful. Because these data is often easy to obtain and contains knowledge
of the target language. If we can provide techniques to learn from it, the resulting
classifier is expected to get more fit for the target language, thereby give better
categorization performance.
In this paper, we propose an active learning algorithm for cross language text
categorization. Our algorithm makes use of both labeled data in the source lan-
guage and unlabeled data in the target language. The classifier first learns the
classification knowledge from the source language, and then learns the cultural
dependent knowledge from the target language. In addition, we extend our al-
gorithm to double viewed form by considering the source and target language
as two views of the classification problem. Experiments show that our algorithm
can effectively improve the cross language classification performance. To the best
of our knowledge, this is the first study of applying active learning to CLTC.
The rest of the paper is organized as follows. First, related works are re-
viewed in Section 2. Then, our active learning approach for CLTC is presented
in Section 3 and its extension to double viewed form is introduced in Section 4.
Section 5 presents the experimental results and analysis. Finally, Section 6 gives
conclusions and future work.
2 Related Work
Several previous works have addressed the task of CLTC. [2] proposes practical
approaches based on machine translation. In their work, two translation strate-
gies are considered. The first strategy translates the training documents into
the target language and the second strategy translates the unlabeled documents
into the source language. After translation, monolingual text categorization is
performed. [12] introduces a model translation method, which transfers classifi-
cation knowledge across languages by translating the model features and takes
into account the ambiguity associated with each word. Besides translation, in
some other studies multilingual models are learned and used for CLTC, such as
the multilingual domain kernels learned from comparable corpora [5] and the
multilingual topic models mined from Wikipedia [8]. Moreover, there are also
some studies of using lexical databases (e.g. WordNet) for CLTC [1].
All the previous methods have somehow solved the language barrier between
training documents and unlabeled documents. But only a few have considered
the culture differences between languages. In these works, authors try to solve
this problem by employing some semi-supervised learning techniques. [12] em-
ploys a self-training process after the model translation. This process applies
Active Learning for Cross Language Text Categorization 197
classification knowledge that can’t be learnt from the translated training data.
From this observation, we derive the active learning algorithm to improve Ce-c .
Active learning [11] is a form of learning algorithm for situations in which un-
labeled data is abundant but labeling data is expensive. In such a scenario, the
learner actively selects examples from a pool of unlabeled data and asks the
teacher to label.
In the context of CLTC, we can assume an additional collection Uc of unla-
beled documents in target language (Chinese in this paper) is available, since
the unlabeled data is usually easy to obtain. Our algorithm consists of two steps.
In the first step, we train a classifier using the translated training set T Re-c , this
classifier can be considered as an initial learner which has learnt the classification
knowledge transferred from the source language. In the second step, we apply
this classifier to the documents in Uc and select out the documents with lowest
classification certainty. Such documents are expected to contain most culture
dependent classification knowledge. We label them and put them into the train-
ing set. Consequently, the classifier is re-trained. The second step is repeated
for several iterations, in order to let the classifier learn the culture dependent
knowledge from the target language. Figure 1 illustrates the whole process.
label y ∈ {−1, 1}, the membership probability p(y = 1|x) can be approximated
using a sigmoid function,
P (y = 1|x) = 1/(1 + exp(Af (x) + B)), (1)
where f (x) is the decision function of SVM, A and B are parameters to be
estimated. Maximum likelihood estimation is used to solve for the parameters,
l
min − (ti log pi + (1 − ti ) log (1 − pi )),
(A,B)
i=1
where,
1 (2)
pi = ,
1 + exp(Af (xi ) + B)
N+ +1
if yi = 1;
ti = N+1+2
N− +2 if yi = −1.
N+ and N− are the number of positive and negative examples in the training
set. Newton’s method with backtracking line search can be used to solve this
optimization problem [7]. For multi-class SVM, we can obtain the probabilities
through pair coupling [18]. Suppose that rij is the binary probability estimate
of P (y = i|y = i or j, x), and pi is the probability P (y = i|x), the problem can
be formulated as
k
1
minp (rji pi − rij pj )2 ,
2 i=1
j,j
=i
(3)
k
subject to pi = 1 and pi ≥ 0, ∀i,
i=1
where k denotes the number of classes. This optimization problem can be solved
using a direct method such as Gaussian elimination, or a simple iterative algo-
rithm [18].
In practice, we employ the toolbox LibSV M [4], which is widely used in data
mining tasks [13]. It implements the above methods for multi-class probability
estimation. After obtaining the class membership probabilities of a document,
we use the best against second best (BVSB) approach [6] to estimate the classi-
fication certainty. This approach has been demonstrated to be effective for multi
class active learning task [6]. It measures the certainty by the difference between
the probability values of the two classes having the highest estimated proba-
bilities. The larger the difference, the higher the certainty is. Suppose c is the
classifier, d is the document to be classified, i and j are the two classes with
highest probabilities, then we calculate the certainty score using
Certainty(d, c) = P (y = i|d, c) − P (y = j|d, c). (4)
Based on the discussions above, we describe the proposed algorithm in Algo-
rithm 1.
200 Y. Liu et al.
Fig. 2. Two directions of translation Fig. 3. Certainty distribution over the un-
labeled documents
[3] approach, which labels documents according to the confident classifier and
generate new training examples for the unconfident one. In other words, the two
learners can teach each other in some times, needn’t always ask the teacher.
Based on this idea, we present the double viewed active learning algorithm in
the next section.
To measure whether a classifier is more certain than the other, we refer to the
difference between their certainties,
5 Evaluation
5.1 Experimental Setup
We choose English-Chinese as our experimental language pair. English is re-
garded as the source language while Chinese is regarded as the target language.
202 Y. Liu et al.
Since there is not a standard evaluation benchmark available for cross language
text categorization, we build a data set from the Internet. This data set contains
42610 Chinese and English news pages during the year 2008 and 2009, which fall
into eight categories: Sports, Military, Tourism, Economy, Information Technol-
ogy, Health, Autos and Education. The main content of each page is extracted
and saved in plain text.
In our experiments, we select 1000 English documents and 2000 Chinese doc-
uments from each class. The set of English documents is treated as the training
set T Re . For the Chinese documents, we first randomly select 1000 documents
from each class to form the test set T Sc, and leave the remaining documents as
the additional unlabeled set Uc .
As we will use the two views of each document in our algorithm, we em-
ploy Google T ranslate 1 to translate all Chinese documents into English and
all English documents into Chinese. Then, for all Chinese or Chinese trans-
lated documents, we segment the text with the tool ICT CLAS 2 , afterwards
remove the common words. For all English or English translated documents, the
EuropeanLanguageLemmatizer 3 is applied to restore each word in the text to
its base form. Then we use a stop words list to eliminate common words.
Each document is transformed into an English feature vector and a Chinese
feature vector with T F -IDF format. The LibSV M package is employed for the
1
http://translate.google.com
2
http://ictclas.org/
3
http://lemmatizer.org/
Active Learning for Cross Language Text Categorization 203
basic classifier. We choose linear kernel due to its good performance in text
classification task. Since we need probabilistic outputs, the b option of LibSV M
is selected for both training and classification. The cost parameter c is set to 1.0
as default. We use Micro-Average F1 score as the evaluation measure, as it’s a
standard evaluation used in most previous categorization research [10,17].
We can observe that, the initial classifier doesn’t perform well on the Chinese
test set. As the number of iterations increases, the performance is significantly
improved. The certainty-based strategy shows an obvious advantage over the
random strategy. This verify our assumption that documents with low predic-
tion certainty usually contain culture dependent classification knowledge and
therefore are most informative for the learner. After 20 iterations, the Micro av-
erage F1 measure on the 8000 test documents is increased by about 11 percents
while the additional cost is to label 200 selected examples.
which means in each iteration 10 examples having lowest average certainty are
selected and labeled by the teacher; and we set m to 5, which means each classifier
labels 5 examples for the other. The certainty threshold h is set to 0.8, in order to
reduce the error introduced by automatically labeled examples. In each iteration,
the two classifiers are retrained and applied to the test set. We combine their
predictions based on each view to get the overall prediction. Figure 5 shows the
micro average F1 curves of the Chinese, English and overall classifiers. The curve
of the single viewed algorithm is plotted as well for comparison.
We can observe that, the English classifier generally has better performance
than the Chinese one, a possible reason is that more noises are introduced in
Chinese view due to the text segmentation process. The overall classifier has
highest accuracy, as it combines the information from both views. All the three
classifiers generated by double viewed algorithm outperform the one of the single
viewed algorithm. Because in each iteration they get 10 more labeled examples
(each classifier automatically labels 5 examples for the other).
In our double viewed algorithm, the classifiers learn from each other and the
teacher. We would like to investigate the effect of the two approaches individually.
This can be done by set the parameter n and m in Algorithm 2. We first set
n to 10 and m to 0, then set n to 0 and m to 5. The corresponding curves are
showed in Figure 6.
Fig. 5. Micro-F1 curves of double viewed Fig. 6. Compare effects of the two learn-
algorithm ing approaches
In this paper, we proposed the active learning algorithm for cross language text
categorization. The proposed method can effectively improve the cross language
classification performance by learning from unlabeled data in the target lan-
guage. For the future work, we will incorporate more metrics in the selecting
strategy of active learning. For instance, can we detect the scenario in which the
classifier is pretty certain but actually wrong? If such examples can be detected
and labeled for retraining, the classifier will be further adaptable for the target
language.
References
1. Amine, B.M., Mimoun, M.: Wordnet based cross-language text categorization. In:
2007 IEEE/ACS International Conference on Computer Systems and Applications,
pp. 848–855. IEEE (2007)
2. Bel, N., Koster, C.H.A., Villegas, M.: Cross-Lingual Text Categorization. In: Koch,
T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer,
Heidelberg (2003)
3. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training.
In: Proceedings of the Eleventh Annual Conference on Computational Learning
Theory, pp. 92–100. ACM (1998)
4. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software
available at http://www.csie.ntu.edu.tw/~ cjlin/libsvm
5. Gliozzo, A., Strapparava, C.: Cross language text categorization by acquiring mul-
tilingual domain models from comparable corpora. In: Proceedings of the ACL
Workshop on Building and Using Parallel Texts, pp. 9–16. Association for Com-
putational Linguistics (2005)
6. Joshi, A.J., Porikli, F., Papanikolopoulos, N.: Multi-class active learning for image
classification. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2009, pp. 2372–2379. IEEE (2008)
7. Lin, H.T., Lin, C.J., Weng, R.C.: A note on platt’s probabilistic outputs for support
vector machines. Machine Learning 68(3), 267–276 (2007)
8. Ni, X., Sun, J.T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia.
In: Proceedings of the 18th International Conference on World Wide Web, pp.
1155–1156. ACM (2009)
9. Rigutini, L., Maggini, M., Liu, B.: An EM based training algorithm for cross-
language text categorization. In: Proceedings of the 2005 IEEE/WIC/ACM Inter-
national Conference on Web Intelligence, pp. 529–535. IEEE (2005)
10. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput-
ing Surveys (CSUR) 34(1), 1–47 (2002)
11. Settles, B.: Active learning literature survey. Computer Sciences Technical Report
1648, University of Wisconsin–Madison (2009)
12. Shi, L., Mihalcea, R., Tian, M.: Cross language text classification by model trans-
lation and semi-supervised learning. In: Proc. EMNLP, pp. 1057–1067. Association
for Computational Linguistics, Cambridge (2010)
13. Tang, J., Liu, H.: Feature selection with linked data in social media. In: SIAM
International Conference on Data Mining (2012)
14. Tang, J., Wang, X., Gao, H., Hu, X., Liu, H.: Enriching short texts representation
in microblog for clustering. Frontiers of Computer Science (2012)
15. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. The Journal of Machine Learning Research 2, 45–66 (2002)
16. Wan, X.: Co-training for cross-lingual sentiment classification. In: Proceedings of
the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Inter-
national Joint Conference on Natural Language Processing of the AFNLP, vol. 1,
pp. 235–243. Association for Computational Linguistics (2009)
17. Wang, X., Tang, J., Liu, H.: Document clustering via matrix representation. In:
The 11th IEEE International Conference on Data Mining, ICDM 2011 (2011)
18. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classifica-
tion by pairwise coupling. The Journal of Machine Learning Research 5, 975–1005
(2004)
Evasion Attack of Multi-class Linear Classifiers
1 Introduction
Researchers and engineers of information security have successfully deployed systems
using machine learning and data mining for detecting suspicious activities, filtering
spam, recognizing threats, etc. [2,12]. These systems typically contain a classifier that
flags certain instances as malicious based on a set of features. Unfortunately, evaded
malicious instances that fail to be detected are inevitable for any known classifier. To
make matters worse, there is evidence showing that adversaries have investigated sev-
eral approaches to evade the classifier by disguising malicious instance as normal in-
stances. For example, spammers can add unrelated words, sentences or even paragraphs
to the junk mail for avoiding detection of the spam filter [11]. Furthermore, spammers
can embed the text message in an image. By adding varied background and distorting
the image, the generated junk message can be difficult for OCR systems to identify but
easy for humans to interpret [7]. As a reaction to adversarial attempts, authors of [5]
employed a cost-sensitive game theoretic approach to preemptively adapt the decision
boundary of a classifier by computing the adversary’s optimal strategy. Moreover, sev-
eral improved spam filters that are more effective in adversarial environments have been
proposed [7,3].
The ongoing war between adversaries and classifiers pressures machine learning re-
searchers to reconsider the vulnerability of classifier in adversarial environments. The
problem of evasion attack is posed and a query algorithm for evading linear classifiers
is presented [10]. Given a malicious instance, the goal of the adversary is finding a dis-
guised instance with the minimal cost to deceive the classifier. Recently, the evasion
problem has been extended to the binary convex-inducing classifiers [13].
We continue investigate the vulnerability of classifiers to the evasion attack and gen-
eralize this problem to the family of multi-class linear classifiers; e.g. linear support
vector machines [4,6,9]. Multi-class linear classifiers have become one of the most
promising learning techniques for large sparse data with a huge number of instances
and features. We propose an adversarial query algorithm for searching minimal-cost
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 207–218, 2012.
c Springer-Verlag Berlin Heidelberg 2012
208 H. Xiao, T. Stibor, and C. Eckert
disguised instances. We believe that revealing a scar on the multi-class classifier is the
only way to fix it in the future. The contributions of this paper are:
1. We generalize the problem of evasion attack to the multi-class linear classifier,
where the instance space is divided into multiple convex sets.
2. We prove that effective evasion attack based on the linear probing is feasible under
certain assumption of the adversarial cost. A description of the vulnerability of
multi-class linear classifiers is presented.
3. We propose a query algorithm for disguising an adversarial instance as any other
classes with minimal cost. The experiment on two real-world data set shows the
effectiveness of our algorithm.
2 Problem Setup
Let X = {(x1 , . . . , xD ) ∈ RD | L ≤ xd ≤ U for all d} be the feature space. Each
component of an instance x ∈ X is a feature bounded by L and U which we denote
as xd . A basis vector of the form (0, . . . , 0, 1, 0, . . . , 0) with a 1 only at the dth feature
terms δ d . We assume that the feature space representation is known to the adversary,
thus the adversary can query any point in X .
where 0 < ed < ∞ represents the cost coefficient of the adversary associates with
the dth feature, allowing that some features may be more important than others. In
particular, given the adversarial instance xA , function a(x, xA ) measures different costs
of using some instances as compared to others. Moreover, we use B(y, C) = {x ∈
X | a(x, y) ≤ C} to denote the cost ball centered at y with cost no more than C.
In generalizing work [10], we alter the definition of minimal adversarial cost (MAC).
Given a fixed classifier f and an adversarial cost function a we define the MAC of class
k with respect to an instance y to be the value
MAC(k, y) = min a(x, y), k = f (y).
x:x∈Xk
Note, that in practice the exact decision boundary is unknown to the adversary, thus
finding exact value of IMAC becomes an infeasible task. Nonetheless, it is still tractable
to approximate IMAC by finding -IMAC, which is defined as follows
-IMAC(k, y) = {x ∈ Xk | a(x, y) ≤ (1 + ) · MAC(k, y), k = f (y), > 0} .
That is, every instance in -IMAC(k, y) has the adversarial cost no more than a fac-
tor of (1 + ) of the MAC(k, y). The goal of the adversary now becomes finding
-IMAC(k, xA ) for all classes k
= f (xA ) while keeping as small as possible.
210 H. Xiao, T. Stibor, and C. Eckert
Lemma 1 indicates that the classifier f decomposes RD into K convex polytopes.
Following the notations and formulations introduced in [8], we represent a hyperplane
Hi as the boundary of a half-space ∂Hi+ ; i.e. Hi = ∂Hi+ = {x ∈ X | w
i xT =
bi }. Let Xk = p=1 Hp , where {H1 , . . . , HPk } is irredundant to Xk . Let Hk =
+ + +
Pk 1
{H1+ , . . . , HP+k } be an irredundant set that defines Xk , then Xk ⊂ int X provided that
th
none half-space in Hk is defined by (3). Moreover, we define Pk the p facet of Xk as
Fkp = Hp ∩ Xk , and the convex surface of Xk as ∂Xk = p=1 Fkp .
Theorem 1. Let y be an instance in X and k ∈ K \ f (y). Let x be an instance in
IMAC(k, y) as defined in Section 2.3. Then x must be attained on the convex surface
∂Xk .
Proof. We first show the existence of IMAC(k, y). By Lemma 1, Xk defines a feasible
region. Thus minimizing a(x, y) on Xk is a solvable problem. Secondly, Xk is bounded
in each direction of the gradient of a(x, y), which implies that IMAC(k, y) exists.
We now prove that x must lie on ∂Xk by contrapositive. Assume that x is not on
∂Xk thus is an interior point; i.e. x ∈ int Xk . Let B(y, C) denote the ball centered at
y with cost no more than a(x, y). Due to the convexity of Xk and B(y, C), we have
int Xk ∩ int B(y, C) = ∅. Therefore, there exists at least one instance in Xk with cost
less than a(x, y), which implies that x is not IMAC(k, y).
n + + +
1
Let C be a convex polytope such
that C = + i=1 Hi . The family {H1 , . . . , Hn } is called
irredundant to C provided that 1≤j≤n,j=i Hj = C for each j = 1, . . . , n.
Evasion Attack of Multi-class Linear Classifiers 211
Theorem 1 restricts the searching of IMAC to the convex surface. In particular, when
cost coefficients are equal, e.g. e1 = · · · = eD , we can show that searching in all
axis-aligned directions gives at least one IMAC.
Theorem 2. Let y be an instance in X such that Xf (y) ⊂ int X . Let P be the number of
facets of Xf (y) and Fp be the pth facet, where p = {1, . . . , P }. Let Gd = {y+θδ d | θ ∈
R}, where d ∈ {1, . . . , D}. Let Q = {Gd ∩ Fp | d = 1, . . . , D, p = 1, . . . , P }, in which
each element differs from y on only one dimension. If the adversarial cost function
defined in (2) has equal cost coefficients, then there exists at least one x ∈ Q such that
x is IMAC(f (x), y).
Proof. Let Hp be the hyperplane defining the pth facet Fp . Consider all the points
of intersection of the lines Gd with the hyperplanes Hp ; i.e. I = {Gd ∩ Hp | d =
1, . . . , D, p = 1, . . . , P }. Let x = argminx∈I a(x, y). Then x is our desired instance.
We prove that x ∈ Q by contrapositive. Suppose x ∈ / Q , due to the convexity of
Xf (y) , the line segment [x, y] intersects ∂Xf (y) at a point on another facet. Denote this
point as z, then z differs from y on only one dimension and a(z, y) < a(x, y).
Next, we prove x is IMAC(f (x), y) by contrapositive. Let B(y, C) denote the reg-
ular cost ball centered at y with cost no more than a(x, y). That is, each vertex of the
cost ball has the same distance of C with y. Suppose x is not IMAC(f (x), y), then
there exists z ∈ Xf (x) ∩ int B(y, C). By Theorem 1, z and x must lie on the same
facet, which is defined by a hyperplane H ∗ . Let Q∗ be intersection points of H ∗ with
lines G1 , . . . , GD ; i.e. Q∗ = {Gd ∩ H ∗ | d = 1, . . . , D}. Then there exists at least one
point v ∈ Q∗ such that v ∈ int B(y, C). Due to the regularity of B(y, C), we have
a(v, y) < a(x, y).
We now define special convex sets for approximating -IMAC near the convex sur-
face. Given > 0, the interior parallel body of Xk is P− (k) = {x ∈ Xk |
B(x, ) ⊆ Xk }
and the corresponding exterior parallel body is defined as P+ (k) = x∈Xk B(x, ).
Moreover, the interior margin of Xk is M− (k) = Xk \ P− (k) and the corresponding
exterior margin is M+ (k) = P+ (k) \ Xk . By relaxing the searching scope from the
convex surface to a margin in the distance , Theorem 1 and Theorem 2 immediately
imply the following results.
Corollary 1. Let y be an instance in X and k ∈ K \ f (y). For all > 0 such that
M− (k)
= ∅, -IMAC(k, y) ⊆ M− (k).
Corollary 2. Let y be an instance in X and be a positive number such that
P+ (f (y)) ⊂ int X . Let P be the number of facets of P+ (f (y)) and Fp be the pth
facet, where p = {1, . . . , P }. Let Gd = {y + θδ d | θ ∈ R}, where d ∈ {1, . . . , D}.
Let Q = {Gd ∩ Fp | d = 1, . . . , D, p = 1, . . . , P }, in which each element differs from
y on only one dimension. If adversarial cost function defined in (2) has equal cost
coefficients, then there exists at least one x ∈ Q such that x is in -IMAC(f (x), y).
Corollary 1 and Corollary 2 point out an efficient way of approximating -IMAC with
linear probing, which forms the backbone of our proposed algorithm in Section 4.
Finally, we consider the vulnerability of a multi-class linear classifier to linear prob-
ing. The problem arises of detecting convex polytopes in X with a random line. As one
212 H. Xiao, T. Stibor, and C. Eckert
can easily scale any hypercube to a unit hypercube with edge length 1, our proof is
restricted to the unit hypercube in RD .
When E Z is small, a random line intersects small number of decision regions and not
much information is leaked to the adversary. Thus, a robust multi-class classifier that
resists linear probing should have a small value of E Z.
Theorem 3. Let f be the multi-class√linear classifier defined in (1), then the expectation
of Z is bounded by 1 < E Z < 1 + 2(K−1)2D .
Z = |F ∩ G|,
Next, we compute E N n=1 |F ∩ Gn |. Let M = |F |. Due to the convexity of Xk , any
given line can hit a facet no more than once. Therefore, we have
N
N
M
E |F ∩ Gn | = E |Fm ∩ Gn |
n=1 n=1 m=1
M
= E n ∈ {1, . . . , N }|Fm ∩ Gn = ∅
m=1
M
= μ(G; G ∩ Fm
= ∅). (7)
m=1
We remark that Theorem 3 implies a way to construct a robust classifier that resists
evasion algorithm based on linear probing, e.g. by jointly minimizing (9) and the error
function in the training procedure.
Based on theoretical results, we present an algorithm for deceiving the multi-class linear
classifier by disguising the adversarial instance xA as other classes with approximately
minimal cost, while issuing polynomially many queries in: the number of features, the
range of feature, the number of classes and the number of iterations.
An outline of our searching approach is presented in Algorithms 1 to 3. We use a
K × D matrix Ψ for storing ISMAC of K classes and an array C of length K for
2
The surface area in RD is the (D − 1)-dimensional Lebesgue measure.
214 H. Xiao, T. Stibor, and C. Eckert
the corresponding adversarial cost of these instances. The scalar value W represents
the maximal cost of all optimum instances. Additionally, we need a K × I matrix T
for storing the searching path of optimum instances in each iteration. The k th row of
matrix Ψ is denoted as Ψ[k, :]. We consider Ψ, T, C, W as global variables so they are
accessible in every scope. After initializing variables, our main routine MLCEvading
(Algorithm 1 line 4) first invokes MDSearch (Algorithm 2) to search instances that is
close to the starting point xA in all classes and saves them to Ψ. Then it repetitively
selects instances from Ψ as new starting points and searches instances with lower ad-
versarial cost (Algorithm 3 line 6–7). The whole procedure iterates I times. Finally, we
obtain Ψ[k, :] as the approximation of -IMAC(k, xA ) .
We begin by describing RBSearch in Algorithm 3, a subroutine for searching in-
stances near decision boundaries along dimension d. Essentially, given an instance x,
an upper bound u and a lower bound l, we perform a recursive binary search on the line
segment {x + θδ d | l ≤ θ ≤ u} through x. The effectiveness of this recursive algorithm
relies on the fact that it is impossible to have xu and xl in the same class while xm is in
another class. In particular, if the line segment meets an exterior margin M+ (k) and
-IMAC(k, x) is the intersection, then RBSearch finds an -IMAC. Otherwise, when
the found instance y yields lower adversarial cost than instance in Ψ does, Algorithm 4
is invoked to update Ψ. The time complexity of RBSearch is O( u−l ).
We next describe Algorithm 2. Given x which is known as ISMAC(k, xA ) and the
current maximum cost W , the algorithm iterates (D − 1) times on P+ (Xf (x) ) for
finding instances with cost lower than W . Additionally, we introduce two heuristics to
prune unnecessary queries. First, the searched dimension in the previous iteration of x is
omitted. Second, we restrict the upper and lower bound of the searching scope on each
dimension. Specifically, knowing W and a(x, xA ) = c, we only allow RBSearch to
find instance in [xd − We−c d
, xd + We−c
d
] since any instance lying out of this scope gives
adversarial cost higher than W . This pruning is significant when we have obtained
ISMAC for every class. Special attention must be paid to searched dimensions of x
(see Algorithm 2 line 5–7). Namely, if d is a searched dimension before the (i − 1)th
W −c W −c
iteration, then we relax the searching scope to [xA A
d − ed , xd + ed ] so that no low-
cost instances will be missed.
A
7 else u = min{U, xd + δ}
8 xu ← x, xl ← x
9 xud ← u, xld ← l
10 if f (xu ) = k then RBSearch(xd, u, x, d, i, )
11 if f (xl ) = k then RBSearch(l, xd , x, d, i, )
m l
8 if f (x ) = f (x ) then
9 RBSearch(m, u, x, d, i, )
10 else if f (xm ) = f (xu ) then
11 RBSearch(l, m, x, d, i, )
12 else
13 RBSearch(l, m, x, d, i, )
14 RBSearch(m, u, x, d, i, )
Proof. Follows from the correctness of the algorithm and the fact that the time com-
plexity of RBSearch is O( u−l
).
5 Experiments
We demonstrate the algorithm3 on two real-world data sets, the 20-newsgroups4 and
the 10-Japanese female face5 . On the newsgroups data set, the task of the adversary is
to evade a text classifier by disguising a commercial spam as a message in other top-
ics. On the face data set, the task of adversary is to deceive the classifier by disguising
a suspect’s face as an innocent. We employ LIBLINEAR [6] package to build target
multi-class linear classifiers, which return labels of queried instances. The cost coeffi-
cients are set to e1 = · · · = eD = 1 for both tasks. For the groundtruth solution, we
directly solve the optimization problem with linear constraints (3) and (4) by using the
models’ parameters. We then measure the average empirical
for (K −1) classes, which
1
C[k]
is defined as
= K−1 k=f (xA ) MAC(k,xA ) − 1 , where C[k] is the adversarial cost
of disguised instance of class k. Evidently, small indicates better approximation of
IMAC.
The training data used to configure the newsletter classifier consists of 7, 505 docu-
ments, which are partitioned evenly across 20 different newsgroups. Each document
is represented as a 61, 188-dimensional vector, where each component is the number
of occurrences of a word. The accuracy of the classifier on training data is 100% for
every class. We set the category “misc.forsale” as the adversarial class. That is, given
a random document in “misc.forsale”, the adversary attempts to disguise this docu-
ment as from other category; e.g. “rec.sport.baseball”. Parameters of the algorithm are
K = 20, L = 0, U = 100, I = 10, = 1. The adversary is restricted to query at most
10, 000 times. The adversarial cost of each class is depicted in Fig. 1 (left).
The training data contains 210 gray-scaled images of 7 facial expressions (each with
3 images) posed by 10 Japanese female subjects. Each image is represented by a 100-
dimensional vector using principal components. The accuracy of the classifier on train-
ing data is 100% for every class. We randomly pick a subject as an imaginary suspect.
Given a face image of the suspect, the adversary camouflage this face to make it be
classified as other subjects. Parameters of the algorithm are K = 10, L = −105 , U =
3
A Matlab implementation is available at
http://home.in.tum.de/˜xiaoh/pakdd2012-code.zip
4
http://people.csail.mit.edu/jrennie/20Newsgroups/
5
http://www.kasrl.org/jaffe.html
Evasion Attack of Multi-class Linear Classifiers 217
3
×10
50
3
Adversarial cost
40
30
20 2
10
spo.hockey
comp.mswin
comp.graph
spo.bball
comp.ibmpc
comp.mac
comp.xwin
alt.atheism
sci.med
rec.autos
rec.motor
sci.elec
sci.space
soc.relg
pol.guns
pol.meast
pol.misc
relg.misc
sci.crypt
1
Fig. 1. Box plots for adversarial cost of disguised instance of each class. (Left) On the 20-
newsgroups data set, we consider “misc.forsale” as the adversarial class. Note, that feature values
of the instance are non-negative integers as they represent the number of words in the document.
Therefore, the adversarial cost can be interpreted as the number of modified words in the dis-
guised document comparing to the original document from “misc.forsale”. The value of for 19
classes is 0.79. (Right) On the 10-Japanese female faces data set, we randomly select a subject as
the suspect. The box plot shows that the adversarial cost of camouflage suspicious faces as other
subjects. The value of
for 9 classes is 0.51. A more illustrative result is depicted in Fig. 2.
Suspect Innocent
Original
Disguised faces
Fig. 2. Disguised faces given by our algorithm to defeat a multi-class face recognition system.
The original faces (with neutral expression) of 10 females are depicted in the first row, where the
left most one is the imaginary suspect and the remaining 9 people are innocents. From the second
row to sixth row, faces of the suspect with different facial expressions are fed to the algorithm
(see the first column). The output disguised faces from the algorithm are visualized in the right
hand image matrix. Each row corresponds to disguised faces of the input suspicious face on the
left. Each column corresponds to an innocent.
218 H. Xiao, T. Stibor, and C. Eckert
105 , I = 10, = 1. The adversary is restricted to query at most 10, 000 times. The
adversarial cost of each class is depicted in Fig. 1 (right). Moreover, we visualize dis-
guised faces in Fig. 2. Observe that many disguised faces are similar to the suspect’s
face by humans interpretation, yet they are deceptive for the classifier. This visualization
directly demonstrates the effectiveness of our algorithm.
It has not escaped our notice that an experienced adversary with certain domain
knowledge can reduce the number of queries by careful selecting cost function and
employing heuristics. Nonetheless, the goal of this paper is not to design real attacks
but rather examine the correctness and effectiveness of our algorithm so as to understand
vulnerabilities of classifiers.
6 Conclusions
Adversary and classifier are Yin and Yang of information security. We believe that un-
derstanding the vulnerability of classifiers is the only way to develop resistant classifiers
in the future. In this paper, we showed that multi-class linear classifiers are vulnerable
to the evasion attack and presented an algorithm for disguising the adversarial instance.
Future work includes generalizing the evasion attack problem to the family of general
multi-class classifier with nonlinear decision boundaries.
References
1. Ball, K.: Cube slicing in Rn . Proc. American Mathematical Society 97(3), 465–473 (1986)
2. Barbara, D., Jajodia, S.: Applications of data mining in computer security. Springer (2002)
3. Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical
data compression models. JMLR 7, 2673–2698 (2006)
4. Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass prob-
lems. Machine Learning 47(2), 201–233 (2002)
5. Dalvi, N., Domingos, P., et al.: Adversarial classification. In: Proc. 10th SIGKDD, pp. 99–
108. ACM (2004)
6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large
linear classification. JMLR 9, 1871–1874 (2008)
7. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information em-
bedded into images. JMLR 7, 2699–2720 (2006)
8. Grünbaum, B.: Convex polytopes, vol. 221. Springer (2003)
9. Keerthi, S., Sundararajan, S., Chang, K., Hsieh, C., Lin, C.: A sequential dual method for
large scale multi-class linear svms. In: Proc. 14th SIGKDD, pp. 408–416. ACM (2008)
10. Lowd, D., Meek, C.: Adversarial learning. In: Proc. 11th SIGKDD, pp. 641–647. ACM
(2005)
11. Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proc. 2nd Conference
on Email and Anti-Spam, pp. 125–132 (2005)
12. Maloof, M.: Machine learning and data mining for computer security: methods and applica-
tions. Springer (2006)
13. Nelson, B., Rubinstein, B.I.P., Huang, L., Joseph, A.D., Hon Lau, S., Lee, S., Rao, S., Tran,
A., Tygar, J.D.: Near-optimal evasion of convex-inducing classifiers. In: Proc. 13th AISTATS
(2010)
14. Rockafellar, R.: Convex analysis, vol. 28. Princeton Univ. Pr. (1997)
15. Santaló, L.: Integral geometry and geometric probability. Cambridge Univ. Pr. (2004)
Foundation of Mining Class-Imbalanced Data
1 Introduction
In data mining, datasets are often imbalanced (or class imbalanced); that is, the number
of examples of one class (the rare class) is much smaller than the number of the other
class (the majority class).1
This problem happens often in real-world applications of data mining. For example,
in medical diagnosis of a certain type of cancer, usually only a small number of people
being diagnosed actually have the cancer; the rest do not. If the cancer is regarded as
the positive class, and non-cancer (healthy) as negative, then the positive examples may
only occur 5% in the whole dataset collected. Besides, the number of fraudulent actions
is much smaller than that of normal transactions in credit card usage data. When a clas-
sifier is trained on such an imbalanced dataset, it often shows a strong bias toward the
majority class, since the goal of many standard learning algorithms is to minimize the
overall prediction error rate. Thus, by simply predicting every example as the majority
class, the classifier can still achieve a very low error rate on a class-imbalanced dataset
with, for example, 2% rare class.
When mining the class-imbalanced data, do we always get poor performance (e.g.,
100% error) on the rare class? Can the error of the rare class (as well as the majority
class) be bounded? If so, is the bound sensitive to the class imbalance ratio? Although
1 In this paper, we only study binary classification.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 219–230, 2012.
c Springer-Verlag Berlin Heidelberg 2012
220 D. Kuang, C.X. Ling, and J. Du
the issue of class imbalance has been received extensive studies [9,3,2,7,4,5], as far as
we know, no previous works have been done to answer those questions.
In fact, PAC learning (Probably Approximately Correct Learning) [8,6] is an appro-
priate model to study the bounds for classification performance. The traditional PAC
learning model studies the learnability of the general concept for certain kinds of learn-
ers (such as consistent learner and agnostic learner), and answers the question that how
many examples would be sufficient to guarantee a low total error rate. However, pre-
vious works [9] point out that accuracy or total error rate are inappropriate to evaluate
the classification performance when class is imbalanced, since such metrics overly em-
phasize the majority class and neglect the rare class which is usually more important in
real-world applications. Thus, when class is imbalanced, better measures are desired.
In our paper, we will use error rate on the rare (and majority) class and cost-weighted
error2 to evaluate the classification performance on class-imbalanced data. The error
rate on the rare (and majority) class can reflect how well the rare (and majority) class
is learned. If the misclassification cost of the class is known, we can adopt another
common measure (cost-weighted error) to deal with imbalanced data. By weighting the
error rate on each class by its associated cost, we will get higher penalty for the error
on the rare class (usually the more important class).
In our paper, we attempt to use the PAC-learning model to study, when class is im-
balanced, how many sampled examples needed to guarantee a low error on a particu-
lar class (the rare class or majority class) and a low cost-weighted error respectively.
A bound on cost-weighted error is necessary since it would naturally “suppress” er-
rors on the rare class. We theoretically derive several upper bounds for both consistent
learner and agnostic learner. Similar to the upper bounds in traditional PAC learning,
the bounds we derive are generally quite loose, but they do provide a theoretical guar-
antee on the classification performance when class-imbalanced data is learned. Due to
the loose bounds, to make our work more practical, we also empirically study how
class imbalance affects the performance by using a specific learner. From our experi-
mental results, some interesting implications can be found. The results in this paper can
provide some theoretical foundations for mining the class-imbalanced data in the real
world.
The rest of the paper is organized as follows. We theoretically derive several up-
per bounds on the sample complexity for both consistent learner and agnostic learner.
Then we empirically explore how class imbalance affects the classification performance
by using a specific learner. Finally, we draw the conclusions and address our future
work.
2 Upper Bounds
In this section, we take advantage of PAC-learning theory to study the sample complex-
ity when learning from the class-imbalanced data. Instead of bounding the total error
rate, we focus on the error rate on a particular class (rare class or majority class) and
the cost-weighted error.
2 We will define it in the next section.
Foundation of Mining Class-Imbalanced Data 221
First of all, we introduce some notations for readers’ convenience. We assume that the
examples in training set T are drawn randomly and independently from a fixed but un-
known class-imbalanced distribution D. We denote p (0 < p < 0.5) as the proportion of
the rare class (the positive class) in D. For the class-imbalanced training set, p can be
very small (such as 0.001). The number of total training examples is denoted as m and
the number of positive and negative training examples are denoted as m+ and m− re-
spectively. For any hypothesis h from the hypothesis space H, we denote eD (h), eD+ (h),
and eD− (h) as the total, the positive, and the negative generalization error, respectively,
of h, and we also denote eT (h), eT + (h), and eT − (h) as the total, the positive, and the
negative training error, respectively, of h.
Given ε (0 < ε < 1) and δ (0 < δ < 1), the traditional PAC learning provides up-
per bounds on the total number of training examples needed to guarantee eD (h) < ε
with probability at least 1 − δ . However, it guarantees nothing about the positive er-
ror eD+ (h) for the imbalanced datasets. As we discussed before, the majority classifier
would predict every example as negative, resulting in a 100% error rate on the positive
(rare) examples. To have a lower positive error, the learner should observe more posi-
tive examples. Thus, in this subsection, we study the upper bounds of the examples on
a particular class (say positive class here) needed to guarantee, with probability at least
1 − δ , eD+ (h) < ε+ , given any ε+ (0 < ε+ < 1).
We first present a simple relation between the total error and the positive error as
well as the negative error, and will use it to derive some upper bounds.
Theorem 1. Given any ε+ (0 < ε+ < 1) and the positive class proportion p (0 < p <
0.5) according to distribution D and target function C, for any hypothesis h, if eD (h) <
ε+ × p, then eD+ (h) < ε+ .
Thus,
eD (h)
eD+ (h) ≤ .
p
Therefore, if eD (h) < ε+ × p, eD+ (h) < ε+ .
Following the same direction, we can also derive a similar result for the error on
negative class eD− (h). That is, given ε− (0 < ε− < 1), if eD (h) < ε− × (1 − p), then
eD− (h) < ε− .
Theorem 1 simply tells us, as long as the total error is small enough, a desired positive
error (as well as negative error) can always be guaranteed. Based on Theorem 1, we can
”reuse” the upper bounds in the traditional PAC learning model and adapt them to be
the upper bounds of a particular class in the class-imbalanced datasets. We first consider
consistent learner in the next subsection.
222 D. Kuang, C.X. Ling, and J. Du
m+ ≥ UB(ε+ × p, δ ) × p,
then with probability at least 1 − δ , the consistent learner will output a hypothesis h
having eD+ (h) ≤ ε+ .
Proof. By the definition of the upper bound for the sample complexity, given 0 < ε < 1,
0 < δ < 1, if m ≥ UB(ε , δ ), with probability at least 1 − δ any consistent learner will
output a hypothesis h having eD (h) ≤ ε .
Here, we simply substitute ε in UB(ε , δ ) with ε+ × p, which is still within (0, 1).
Consequently, we obtain that if m ≥ UB(ε+ × p, δ ), with probability at least 1 − δ
any consistent learner will output a hypothesis h having eD (h) ≤ ε+ × p. According to
Theorem 1, we get eD+ (h) < ε+ .
Also, m = mp+ , thus we know, m ≥ UB(ε+ × p, δ ) equals to
m+ ≥ UB(ε+ × p, δ ) × p.
By using the similar proof to Theorem 2, we can also derive the upper bound for the
negative class. Given 0 < ε− < 1, if the number of negative examples m− ≥ UB(ε− ×
(1 − p), δ ) × (1 − p), then, with probability at least 1 − δ , the consistent learner will
output a hypothesis h having eD− (h) ≤ ε− .
The two upper bounds above can be adapted to any traditional upper bound of con-
sistent learners. For instance, it is well known that any consistent learner using finite
hypothesis space H has an upper bound ε1 × (ln|H| + ln δ1 ) [6]. Thus, by applying our
new upper bounds, we obtain the following corollary.
Corollary 1. For any consistent learner using finite hypothesis space H, the upper
bound on the number of positive sample for eD+ (h) ≤ ε+ is
1 1
m+ ≥ (ln|H| + ln ),
ε+ δ
Foundation of Mining Class-Imbalanced Data 223
and the upper bound on the number of negative sample for eD− (h) ≤ ε− is
1 1
m− ≥ (ln|H| + ln ).
ε− δ
From Corollary 1, we can discover that when the consistent learner uses finite hy-
pothesis space, the upper bound of sample size on a particular class is directly related
to the desired error rate (ε+ or ε− ) on the class, and the class imbalance ratio p does
not affect the upper bound. This indicates that, for consistent learner, no matter how
class-imbalanced the data is (how small p is), as soon as we sample sufficient examples
in a class, we can always achieve the desired error rate on that class.
Agnostic Learner. In this subsection, we consider agnostic learner L using finite hy-
pothesis space H, which makes no assumption about whether or not the target concept
c is representable by H. Agnostic learner simply finds the hypothesis with the mini-
mum (probably non-zero) training error. Given an arbitrary small ε+ , we can not ensure
eD+ (h) ≤ ε+ , since very likely eT + (h) > ε+ . Hence, we guarantee eD+ (h) ≤ eT + (h) + ε
to happen with probability higher than 1 − δ , for such h with the minimum training
error. To prove the upper bound for agnostic learner, we adapt the original proof for
agnostic learner in [6]. The original proof regards drawing m examples from the distri-
bution D as m independent Bernoulli trials, but in our proof, we only treat drawing m+
examples from the positive class as m+ Bernoulli trials.
Theorem 3. Given ε+ (0 < ε+ < 1), any δ (0 < δ < 1), if the number of positive ex-
amples observed
1 1
m+ > 2 (ln|H| + ln ),
2ε+ δ
then with probability at least 1 − δ , the agnostic learner will output a hypothesis h, such
that eD+ (h) ≤ eT + (h) + ε+
Proof. For any h, we consider eD+ (h) as the true probability that h will misclassify a
randomly drawn positive example. eT + (h) is an observed frequency of misclassification
over the given m+ positive training examples. Since the entire training examples are
drawn identically and independently, drawing and predicting positive training exam-
ples are also identical and independent. Thus, we can treat drawing and predicting m+
positive training examples as m+ independent Bernoulli trials.
Therefore, according to Hoeffding bounds, we can have,
This formula tells us that the probability that there exists one bad hypothesis h making
eD+ (h) > eT + (h) + ε is bounded by |H|e−2m+ ε . If we let |H|e−2m+ ε be less than δ ,
2 2
224 D. Kuang, C.X. Ling, and J. Du
then for any hypothesis including the outputted hypothesis h in H, eD+ (h) − eT + (h) ≤ ε
will hold true with the probability at least 1 − δ . So, solving for m+ in the inequation
|H|e−2m+ ε < δ , we obtain
2
1 1
m+ > (ln|H| + ln ).
2ε+
2 δ
In fact, by using the similar procedure, we can also prove the upper bound for the
number of negative examples m− when using agnostic learner: 2ε12 (ln|H| + ln δ1 ).
−
We can observe a similar pattern here. The upper bounds for the agnostic learner are
also not affected by the class imbalance ratio p.
From the upper bound of either consistent learner or agnostic learner we derived,
we learned that when the amount of examples on a class is enough, class imbalance
does not take any effect. This discovery actually refutes a common misconception that
we need more examples just because of the more imbalanced class ratio. We can see,
the class imbalance is in fact a data insufficiency problem, which was also observed
empirically in [4]. Here, we further confirm it with our theoretical analysis.
In this subsection, we derive a new relation (Theorem 1) between the positive error
and the total error, and use it to derive a general upper bound (Theorem 2) which can
be applied to any traditional PAC upper bound for consistent learner. We also extend
the existing proof of agnostic learner to derive a upper bound on a particular class for
agnostic learner. Although the proof of the theorems above may seem straightforward,
no previous work explicitly states the same conclusion from the theoretical perspective.
It should be noted that although the agnostic learner outputs the hypothesis with the
minimum (total) training error, it is possible that the outputted hypothesis has 100%
error rate on the positive class in the training set. In this case, the guaranteed small dif-
ference ε+ between the true positive error and the training positive error can still result
in 100% true error rate on the positive class. If the positive errors are more costly than
the negative errors, it is more reasonable to assign higher cost for misclassifying positive
examples, and let the agnostic learner minimize the cost-weighted training error instead
of the flat training error. In the following part, we will introduce misclassification cost
to our error bounds.
Definition 1 (Cost-Weighted Error). Given the cost ratio r, the class ratio p, eD+ as
the positive error on D, eD− as the negative error on D, the cost-weighted error on D
can be defined as,
rpeD+ + (1 − p)eD−
cD (h) = .
rp + (1 − p)
By the same definition, we can also define the cost-weighted error on the training set T
rpe +(1−p)e
as cT (h) = Trp+(1−p)
+ T−
. The weight of the error on a class is determined by its class
ratio and misclassification cost. The rp is the weight for the positive class and 1 − p is
the weight for the negative class. In our definition for the cost-weighted error, we use
the normalized weight.
In the following part, we study the upper bounds for the examples needed to guaran-
tee a low cost-weighted error on D. We give a non-trivial proof for the upper bounds of
consistent learner, and the proof for the upper bound of agnostic learner is omitted due
to its similarity to that of the consistent learner (but only with finite hypothesis space).
Consistent Learner. To derive a relatively tight upper bound of sample size for cost-
weighted error, we first introduce a property. That is, there exist many combinations of
positive error eD+ and negative error eD− that can make the same cost-weighted error
value. For example, given rp = 0.4, if eD+ = 0.1 and eD− = 0.2, cD will be 0.16, while
eD+ = 0.25 and eD− = 0.1 can also produce the same cost-weighted error. We can let the
upper bound to be the least required sample size among all the combinations of positive
error and negative error that can make the desired cost-weighted error.
Theorem 4. Given ε (0 < ε < 1), any δ (0 < δ < 1), the cost ratio r (r ≥ 1) and the
positive proportion p (0 < p < 0.5) according to the distribution D, if the total number
of examples observed
1+r 1
m≥ (ln|H| + ln ),
ε (rp + (1 − p)) δ
then, with probability at least 1 − δ , the consistent learner will output a hypothesis h
such that the cost-weighted error cD (h) ≤ ε .
Proof. In order to make cD (h) ≤ ε , we should ensure,
rpeD+ + (1 − p)eD−
≤ ε. (1)
rp + (1 − p)
rp (1−p)
Here, we let X = rp+(1−p) , thus 1 − X = rp+(1−p) . Accordingly, Formula (1) can be
transformed into XeD+ + (1 − X )eD− ≤ ε . To guarantee it, we should make sure,
ε − XeD+
eD− ≤ .
1−X
According to Corollary 1, if we observe,
1 1
m− ≥ ε −XeD+ (ln|H| + ln ), (2)
δ
1−X
226 D. Kuang, C.X. Ling, and J. Du
ε −Xe
we can also ensure eD− (h) ≤ 1−XD+ with probability at least 1 − δ to happen. Besides,
in order to have eD+ on positive class, we also need to observe,
1 1
m+ ≥ (ln|H| + ln ). (3)
eD+ δ
To guarantee Formula (2) and (3), we need to sample at least m examples such that
m = MAX( mp+ , 1−p
m−
). Thus,
1 1 1
m ≥ MAX( , ε −XeD+ )(ln|H| + ln )).
eD+ × p × (1 − p) δ
1−X
However, since eD+ is a variable, different eD+ will lead to different eD− , and thus
affect m. In order to have a tight upper bound for m, we only need,
1 1 1
m ≥ MIN ε (MAX( , ε −XeD+ )(ln|H| + ln )).
0≤eD+ ≤ X eD+ × p × (1 − p) δ
1−X
When 1
eD+ ×p > ε −XeD+
1
, MAX( e 1
, ε −XeD+
1
)= 1
eD+ ×p , which is a de-
×(1−p) D+ ×p ×(1−p)
1−X 1−X
creasing function of eD+ , but when 1
eD+ ×p < ε −XeD+
1
, it becomes an increas-
1−X ×(1−p)
ing function of eD+ . Thus, the minimum value of the function can be achieved when
e ×p = ε −XeD+
1 1
. By solving the equation, we obtain the minimum value for the
D+
1−X ×(1−p)
function,
1 1
(ln|H| + ln ).
ε (1−p)
×p δ
p+X−2X p
rp
ε (rp+(1−p)) (ln|H| + ln δ ).
1+r 1
If we recover X with rp+(1−p) , then it can be transformed into
Therefore, as long as,
1+r 1
m≥ (ln|H| + ln ),
ε (rp + (1 − p)) δ
then with probability at least 1 − δ , the consistent learner will output a hypothesis h
such that cD (h) ≤ ε .
We can see that the upper bound of cost-weighted error for consistent learner is related
to p and r. By performing a simple transformation, we can transform the above upper
r+1
bound into ε ((r−1)p+1) (ln|H|+ ln δ1 ). It is known that r ≥ 1, thus r − 1 ≥ 0. Therefore, as
p decreases within (0, 0.5), the upper bound increases. It means that the more the class is
imbalanced, the more examples we need to achieve a desired cost-weighted error. In this
case, class imbalance actually affects the classification performance in terms of cost-
weighted error. If we make another transformation to the upper bound, we can obtain,
2p−1
pε + ε (rp2 +(1−p)p) (ln|H| + ln δ ). Since 0 < p < 0.5, 2p − 1 < 0. Thus, as r increases,
1 1
CFN
the upper bound also increases. It shows that a higher cost ratio CFP would require
Foundation of Mining Class-Imbalanced Data 227
more examples for training. Intuitively speaking, when class is imbalanced, the cost-
weighted error largely depends on the error on the rare class. As we have proved before,
to achieve the same error on the rare class, we need the same amount of examples on the
rare class, thus more class-imbalanced data requires more examples in total. Besides,
higher cost on the rare class leads to higher cost-weighted error, thus to achieve the
same cost-weighted error, we will also need more examples in total.
Agnostic Learner. As mentioned before, the hypothesis with the minimum training
error produced by agnostic learner may still lead to 100% error rate on the rare class.
Hence, instead of outputting the hypothesis with minimum training error, we redefine
agnostic learner as the learner that outputs the hypothesis with the minimum cost-
weighted error on the training set. Generally, with higher cost on positive errors, the
agnostic learner is less likely to produce a hypothesis that misclassifies all the posi-
tive training examples. The following theorem demonstrates that, for agnostic learner,
how many examples needed to guarantee a small difference of the cost-weighted errors
between the distribution D and the training set T .
Theorem 5. Given ε (0 < ε < 1), any δ (0 < δ < 1), the cost ratio r (r ≥ 1) and the
positive proportion p (0 < p < 0.5) according to the distribution D, if the total number
of examples observed
√ √
r p+ 1− p 1
m≥ 2 (ln|H| + ln ),
2ε (rp + (1 − p)) δ
then, with probability at least 1 − δ , the agnostic learner will output a hypothesis h such
that cD (h) ≤ cT (h) + ε .
The proof for Theorem 5 is very similar to that of Theorem 4, thus here we omit the
detail of the proof. Furthermore, we can also extract the same patterns from the upper
bound here as found for the upper bound in Theorem 4: more examples are required
when the cost ratio increases or the class becomes more imbalanced.
To summarize, in this section we derive several upper bounds to guarantee the error
rate on a particular class (rare class or majority class) as well as the cost-weighted error,
for both consistent learner and agnostic learner. We found some interesting and useful
patterns from those theoretical results: the upper bound for the error rate on a particular
class is not affected by the class imbalance, while the upper bound for the cost-weighted
error is sensitive to both the class imbalance and the cost ratio. Although those pattern
may not be so surprising, as far as we know, no previous work theoretically proved it
before. Such theoretical results would be more reliable than the results only based on
the empirical observation.
Since the upper bounds we derive are closely related to the hypothesis space, which
is often huge for many learning algorithms, they are generally very loose (It should be
noted that in traditional PAC learning, the upper bounds are also very loose). In fact,
when we practically use some specific learners, to achieve a desired error rate on a class
or cost-weighted cost, usually the number of examples needed are much less than the
theoretical upper bounds. Therefore, in the next section, we will empirically study the
performance of a specific learner, to see how the class imbalance and cost ratio influence
the classification performance.
228 D. Kuang, C.X. Ling, and J. Du
0 1
A2 A3
1
0
0
1
A4 + + -
0
A5 -
0
+ -
positive error
positive error
0.7 25 positive examples 0.7 500 positive examples 0.7 500 positive examples
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0.1% 0.5% 1% 5% 10% 25% 50% 0.1% 0.5% 1% 5% 10% 25% 50% 0.1% 0.5% 1% 5% 10% 25% 50%
proportion of positive class proportion of positive class proportion of positive class
cost-weighted error
1000 total examples 1000 total examples
0.4 0.4
2000 total examples 2000 total examples
5000 total examples 5000 total examples
0.3 0.3
0.2 0.2
0.1 0.1
0 0.0
0.1% 0.5% 1% 5% 10% 25% 50% 1:1 5:1 10:1 50:1 100:1 500:1 1000:1
proportion of positive class cost ratio
example to be 1/9 of the probability of drawing a negative example, and the probabil-
ity of drawing examples within a class is uniform. According to the data distribution,
we sample a training set until it contains a certain number of positive examples (we
set three different numbers for each dataset), and train a unpruned decision tree on it.
Then, we evaluate its performance (positive error and cost-weighted error) on another
sampled test set from the same data distribution. Finally, we compare the performance
under different data distributions (0.1%, 0.5%, 1%, 5%, 10%, 25%, 50%) to see how
class imbalance ratio affects the performance of the unpruned decision tree. All the
results are the average value over 10 independent runs.
Figure 2 presents the positive error on three datasets. The three curves in each sub-
graph represent three different numbers of positive examples in the training set. For the
artificial dataset, since the concept is easy to learn, the number of positive examples
chosen is smaller than that of the UCI datasets. We can see, generally the more the
positive examples for training, the flatter the curve and the lower the positive error. It
means, as we have more positive examples, class imbalance has less negative effect on
the positive error in practice. The observation is actually consistent with Corollary 1.
To see how class imbalance influences the cost-weighted error, we compare the cost-
weighted error under different class ratios with fixed cost ratio. To explore how cost
ratio affects the cost-weighted error, we compare the cost-weighted error over different
cost ratios with fixed class ratio. For this part, we only use the artificial dataset to show
the results (see Figure 3). We can see, generally, as the class becomes more imbalanced
or the cost ratio increases, the cost-weighted error goes higher. It is also consistent with
our theory (Theorem 4).
We have to point out that, our experiment is not a verification of our derived theories.
The actual amount of examples we used in our experiment is much smaller compared to
230 D. Kuang, C.X. Ling, and J. Du
the theoretical bounds. Despite of that, we still find that the empirical observations have
similar patterns to our theoretical results. Thus, our theorems not only offer a theoretical
guarantee, but also has some useful implications for real-world applications.
4 Conclusions
In this paper, we study the class imbalance issue from PAC-learning perspective. An
important contribution of our work is that, we theoretically prove that the upper bound
of the error rate on a particular class is not affected by the (imbalanced) class ratio. It
actually refutes a common misconception that we need more examples just because of
the more imbalanced class ratio. Besides the theoretical theorems, we also empirically
explore the issue of the class imbalance. The empirical observations reflect the patterns
we found in our theoretical upper bounds, which means our theories are still helpful for
the practical study of class-imbalanced data.
Although intuitively our results might seem to be straightforward, few previous
works have explicitly addressed these fundamental issues with PAC bounds for class-
imbalanced data before. Our work actually confirms the practical intuition by theoretical
proof and fills a gap in the established PAC learning theory. For imbalanced data issue,
we do need such a theoretical guideline for practical study.
In our future work, we will study bounds for AUC, since it is another useful measure
for the imbalanced data. Another common heuristic method to deal with imbalanced
data is over-sampling and under-sampling. We will also study their bounds in the future.
References
1. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/˜mlearn/mlrepository.html
2. Carvalho, R., Freitas, A.: A hybrid decision tree/genetic algorithm method for data mining.
Inf. Sci. 163(1-3), 13–35 (2004)
3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, P.: Smote: Synthetic minority over-sampling
technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
4. Japkowicz, N.: Class imbalances: Are we focusing on the right issue? In: ICML-KDD 2003
Workshop: Learning from Imbalanced Data Sets (2003)
5. Klement, W., Wilk, S., Michalowski, W., Matwin, S.: Classifying Severely Imbalanced Data.
In: Butz, C., Lingras, P. (eds.) Canadian AI 2011. LNCS, vol. 6657, pp. 258–264. Springer,
Heidelberg (2011)
6. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
7. Ting, K.M.: The problem of small disjuncts: its remedy in decision trees. In: Proceeding of
the Tenth Canadian Conference on Artificial Intelligence, pp. 91–97 (1994)
8. Valiant, L.G.: A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
9. Weiss, G.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6(1), 7–19
(2004)
10. WEKA Machine Learning Project: Weka,
http://www.cs.waikato.ac.nz/˜ml/weka
Active Learning with c-Certainty
1 Introduction
It is well known that the noise in labels deteriorates learning performance, espe-
cially for active learning, as most active learning strategies often select examples
with noise on many natural learning problems [1]. To rule out the negative effects
of the noisy labels, querying multiple oracles has been proposed in active learn-
ing [2,3,4]. This multiple-oracle strategy is reasonable and useful in improving
label quality. For example, in paper reviewing, multiple reviewers (i.e., oracles
or labelers) are requested to label a paper (as accepted, weak accepted, weak
rejected or rejected), so that the final decision (i.e., label) can be more accurate.
However, there is still no way to guarantee the label quality in spite of the
improvements obtained in previous works [3,4,5]. Furthermore, strong assump-
tions, such as even distribution of noise [3], and example-independent (fixed)
noise level [4], have been made. These assumptions, in the paper reviewing ex-
ample mentioned above, imply that all the reviewers are at the same level of
expertise and have the same probability in making mistakes.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 231–242, 2012.
c Springer-Verlag Berlin Heidelberg 2012
232 E.A. Ni and C.X. Ling
Obviously, the assumptions may be too strong and not realistic, as it is ubiqui-
tous that label quality (or noise-level) is example-dependent in real-world data.
In the paper reviewing example, the quality of a label given by a reviewer should
depend heavily on how close the reviewer’s research is to the topic of the paper.
The closer it is, the higher quality the label has. Thus, it is necessary to study
this learning problem further.
In this paper, we propose a novel active learning paradigm, under which or-
acles are assumed to return both labels and confidences. This assumption is
reasonable in real-life applications. Taking paper reviewing as an example again,
usually a reviewer is required to give not only a label (accept, weak accept, weak
reject or reject) for a paper, but also his confidence (high, medium or low) for
the labeling.
Under the paradigm, we propose a new active learning strategy, called c-
certainty learning. C-certainty learning guarantees the label quality to be equal
to or higher than a threshold c (c is the probability of correct labeling; see
later) by querying oracles multiple times. In the paper reviewing example, with
the labels and confidences given by reviewers (oracles), we can estimate the
certainty of the label. If the certainty is too low (e.g., lower than a given c),
another reviewer has to be sought to review the paper to improve the label
quality.
Furthermore, instead of assuming noise level to be example-independent in the
previous works, we allow it to be example-dependent. We design an algorithm
that is able to select the Best Multiple Oracles to query (called BMO ) for each
given example. With BMO, fewer queries are required on average for a label to
meet the threshold c compared to random selection of oracles. Thus, for a given
query budget, BMO is expected to obtain more examples with labels of high
quality due to the selection of best oracles. As a result, more accurate models
can be built.
We conduct extensive experiments on the UCI datasets by generating various
types of oracles. The results show that our new algorithm BMO is robust, and
performs well with the different types of oracles. The reason is that BMO can
guarantee the label quality by querying oracles repeatedly and ensure the best
oracles can be queried. As far as we know, this is the first work that proposes
this new active learning paradigm.
The rest of this paper is organized as follows. We review related works in
Section 2. Section 3 introduces the learning paradigm and the calculation of
certainty after querying multiple oracles. We present our learning algorithm,
BMO, in Section 4 and the experiment results in Section 5. We conclude our
work in Section 6.
2 Previous Works
Labeling each example with multiple oracles has been studied when labeling is
not perfect in supervised learning [5,6,7]. Some principled probabilistic solutions
have been proposed on how to learn and evaluate the multiple-oracle problem.
Active Learning with c-Certainty 233
However, as far as we know, few of them can guarantee the labeling quality to
be equal to or greater than a given threshold c, which can be guaranteed in our
work.
Other recent works related to multiple oracles have some assumptions which
may be too strong and unrealistic. One assumption is that the noise of oracles
is equally distributed [3]. The other type of assumption is that the noise level of
different oracles are different as long as they do not change over time [4,8]. Their
works estimate the noise level of different oracles during the learning process and
prefer querying the oracles with low noise levels. However, it is ubiquitous that
the quality of an oracle is example-dependent. In this paper, we remove all the
assumptions and allow the noise level of oracles vary among different examples.
Active learning on the data with example-dependent noise level was studied
in [9]. However, it focuses on how to choose examples considering the tradeoff
between more informative examples and examples with lower noise level.
3 c-Certainty Labeling
C-Certainty labeling is based on the assumption that oracles can return both
labels and their confidences in the labelings. For this study, we define confidence
formally first here. Confidence for labeling an example x is the probability that
the label given by an oracle is the same as the true label of x. We assume that
the confidences of oracles on any example are greater than 0.51 .
By using the labels and confidences given by oracles, we guarantee that the
label certainty of each example can meet the threshold c (c ∈ (0.5, 1]) by querying
oracles repeatedly (called c-certainty labeling). That is, a label is valid if its
certainty is or equal to than c. Otherwise, more queries would be issued to
different oracles to improve the certainty.
How to update the label certainty of an example x after obtaining a new
answer from an oracle? Let the set of previous n − 1 answers be An−1 , and the
new answer be An in the form of (P, fn ), where P indicates positive and fn is the
confidence. The label certainty of x, C(TP |An ), can be updated with Formula 1
(See Appendix for the details of its derivation).
⎧ p(TP )×fn
⎪
⎨ p(TP )×fn +p(TN )×(1−fn ) , if n = 1 and An = {P, fn }
C(TP |An ) =
⎪
⎩ C(TP |An−1 )×fn
C(TP |An−1 )×fn +(1−C(TP |An−1 ))×(1−fn )
, if n > 1 and An = {P, fn },
(1)
where TP and TN are the true positive and negative labels respectively. Formula
1 can be applied directly when An is positive (i.e., An = {P, fn }); while for a
negative answer, we can transform it as An = {N, fn } = {P, (1 − fn )} such that
1
This assumption is reasonable, as usually oracles can label examples more correctly
than random.
234 E.A. Ni and C.X. Ling
To improve the querying efficiency, the key issue is to select the best oracle for
every given example. This is very different from the case when the noise level is
example-independent [4,8], as in our case the performance of each oracle varies
on labeling different examples.
How to select the best oracle given that the noise levels are example-dependent?
The basic idea is that an oracle can probably label an example x with high
confidence if it has labeled xj confidently and xj is close to x. This idea is
reasonable as the confidence distribution (expertise level) of oracles is usually
continuous, and does not change abruptly. More specifically, we assume that
each of the m oracle candidates (O1 , · · · , Om ) has labeled a set of examples Ei
(1 ≤ i ≤ m). Eki (1 ≤ i ≤ m) is the set of k (k = 3 in our experiment) nearest
neighbors of x in Ei (1 ≤ i ≤ m). BMO chooses the oracle Oi such that examples
in Eki are of high confidence and close to the example x. The potential confidence
for each oracle in labeling x can be calculated with Formula 2.
k
1
k
× j=1 fxoji
Pci = 1
k , (2)
1+ k
× j=1 |x − xj |
By selecting the best oracles, BMO can improve the label certainty of a given
example to meet the threshold c with only a few queries (See Section 5). That is,
more labeled examples can be obtained for a predefined query budget compared
to random selection of oracles. Thus, the model built is expected to have better
performance.
5 Experiments
In our experiment, to compare with BMO, we implement two other learning
strategies. One is Random selection of Multiple Oracles (RMO). Rather than
236 E.A. Ni and C.X. Ling
selecting the best oracle in BMO, RMO selects oracles randomly to query for a
given example and repeats until the label certainty is greater than or equal to
c. The other strategy is Random selection of Single Oracle (RSO). RSO queries
for each example only once without considering c, which is similar to traditional
active learning algorithms.
Since RSO only queries one oracle for each example, it will have the most
labeled examples for a predefined query budget but with the highest noise level.
To reduce the negative effect of noisy labels, we weight all labeled examples ac-
cording to their label certainty when building final models. To make all the three
strategies comparable, we also use weighting in BMO and RMO. In addition, all
the three algorithms take uncertain sampling as the example-selecting strategy
and decision tree (J48 in WEKA [13]) as their base learners. The implementation
is based on the WEKA source code.
The experiment is conducted on UCI datasets [14], including abolone, anneal,
cmc new, credit, mushroom, spambase and splice, which are commonly used in
the supervised learning research. As the number of oracles cannot be infinite
in real world, we only generate 10 oracles for each dataset. If an example has
been presented to all the 10 oracles, the label and the certainty obtained will be
taken in directly. The threshold c is predefined to be 0.8 and 0.9 respectively.
The experiment results presented are the average of 10 runs, and t-test results
are of 95% confidence.
In our previous discussion, we take the confidence given by an oracle as the
true confidence. However, in real life, oracles may overestimate or underestimate
themselves intentionally or unintentionally. If the confidence given by an oracle
O does not equal the true confidence, we call O an unfaithful oracle; otherwise, it
is faithful. To observe the robustness of our algorithm, we conduct our empirical
studies with both faithful and unfaithful oracles2 in the following.
As no oracle is provided for the UCI data, we generate a faithful oracle as follows.
Firstly, we select one example x randomly as an “expertise center” and label it
with the highest confidence. Then, to make the oracle faithful, we calculate the
Euclidean distance from each of the rest examples to x, and assign them confi-
dences based on the distances. The further the distance is, the lower confidence
the oracle has in labeling the example. Noise is added into labels according to
the confidence level. Thus the oracle is faithful.
The confidence is supposed to follow a certain distribution. We choose three
common distributions, linear, normal and dual normal distributions. Linear dis-
tribution assumes the confidence reduces linearly as the distance increases. For
normal distribution, the reduction of confidence follows the probability density
1 x2
function f (x) = √2πσ 2
exp(− 2σ 2 ) − 0.55. Dual normal distribution indicates that
2
Actually it is difficult to model the behaviors of unfaithful oracles with a large
confidence deviation. In our experiment, we show that our algorithm works well
given unfaithful oracles slightly deviating from the true confidence.
Active Learning with c-Certainty 237
0.3
0.3
0.25
0.25
0.2
0.2 0.15
0.15 0.1
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
certainty>=0.8 certainty>=0.9
500 500 certainty<0.9
certainty<0.8
400 400
300 300
200 200
100 100
0 0
BMO RMO RSO BMO RMO RSO
the oracle has two “expertise centers” (see Figure 1). As mentioned earlier, we
generate 10 oracles for each dataset in this experiment. Among the 10 oracles,
three of them follow the linear distribution, three the normal distribution and
four the dual normal distribution.
Due to the similar results of different datasets, we only show the details of
one dataset (anneal) in Figure 2 and a summary of the comparison afterwards.
Figure 2 shows the testing error rates of BMO, RMO and RSO for the threshold
0.8 (left) and 0.9 (right) respectively. The x axis indicates the query budgets while
the y axis represents the error rate on test data. On one hand, as we expected
that, for both thresholds 0.8 and 0.9, the error rate of BMO is much lower than
238 E.A. Ni and C.X. Ling
that of RMO and RSO for all different budgets, and the performances of the
latter two are similar. On the other hand, the curve of RMO when c = 0.8 is not
as smooth as the other ones.
The different performances of the three learning strategies can be explained
by two factors, the noise level and the number of examples. Due to limited space,
we only show how the two factors affect the performances through one dataset
(anneal) when the query budget is 500 in Figure 3.
Figure 3 shows that on average BMO only queries about 1.4 (c = 0.8) and
1.7 (c = 0.9) oracles for each example; while RMO queries more oracles (1.7 and
2.0). That is, BMO obtains more labeled examples than RMO for a given bud-
get. Moreover, the examples labeled by BMO have much higher label certainty
than that by RMO3 . On the other hand, the examples labeled by RSO is much
more noisy than BMO (i.e., the red portion is much larger). It is the noise that
deteriorates the performance of RSO. Thus, BMO outperforms the other two
strategies because of its guaranteed label quality and the selection of the best
oracles to query.
By looking closely into the curves in Figure 2, we find that the curve of RMO
when c = 0.8 is not as smooth as the other ones. The reason is that RMO of
c = 0.8 has fewer labeled examples when compared to BMO and RSO of c = 0.8
and has more noise when compared to that of c = 0.9. Fewer examples make the
model learnt more sensitive to the quality of each label; while the label quality of
c = 0.8 is not high enough. Thus, the stability of RMO when c = 0.8 is weakened.
In addition, we also show the t-test results in terms of the error rate on all the
seven UCI datasets in Table 1. As for each dataset 10 different query budgets
are considered, the total times of t-test for each group is 70. Table 1 shows that
BMO wins RMO 94 times out of 140 (c = 0.8 and c = 0.9) and wins RSO 86 out
of 140 without losing once. It is clear that BMO outperforms RMO and RSO
significantly.
In summary, with faithful oracles, the experiment results show that BMO does
work better by guaranteeing the label quality and selecting the best oracles to
query. On the other hand, even though RMO also can guarantee the label qual-
ity, its strategy of randomly selecting oracles reduces the learning performance.
Furthermore, the results of RSO illustrate that weighting with the label quality
may reduce the negative influence of noise but still its effect is limited.
3
Some of the examples still have certainty lower than c due to the limited oracles in
our experiment.
Active Learning with c-Certainty 239
Unfaithful oracles are generated for each dataset by building models over 20%
of the examples. More specifically, to generate an oracle, we randomly select
one example x as an “expertise center”, and sample examples around it. The
closer an example xi is to x, the higher the probability it will be sampled with.
Thus, the oracle built on the sampled examples can label the examples closer
to x with higher confidences. The sampling probability follows exactly the same
distribution in Figure 1. For each data set, 10 oracles are generated and three
follow the linear distribution, three the normal distribution and four the dual
normal distribution.
0.3
Error rate
0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
50
100
150
200
250
300
350
400
450
500
50
100
150
200
250
300
350
400
450
500
600 600
Number of examples
certainty>=0.8 certainty>=0.9
Number of examples
As sampling rate declines with the increasing distance, the oracle built may
fail to give true confidence for the examples that are far from the “center”.
As a result, the oracle is unfaithful. That is, the oracles are unfaithful due to
“insufficient knowledge” rather than “lying” deliberately.
We run BMO, RMO and RSO on the seven UCI datasets and show the testing
error rates and the number of labeled examples on one data set (anneal) in
Figure 4 and a summary on all the datasets afterwards. It is surprising that the
performances of BMO on unfaithful oracles are similar to that on faithful oracles.
That is, the error rate of BMO is much less than that of RMO and RSO, and
the latter two are similar. The examples labeled by BMO are more than that by
RMO and its label quality is higher than that of both RMO and SMO, which
are also similar to that on faithful oracles.
240 E.A. Ni and C.X. Ling
The comparison shows clearly that BMO is robust even for unfaithful oracles.
The reason is that BMO selects the best multiple oracles to query, and it is
unlikely that all the best oracles are unfaithful at the same time as our unfaithful
oracles do not “lie” deliberately as mentioned. Thus, BMO still performs well.
Table 2 shows the t-test results on 10 different query budgets for all the
seven UCI datasets. We can see that BMO wins RMO 95 times out of 140 and
wins RSO 98 out of 140, which indicates that BMO works significantly better
than RMO and RSO under most of the circumstances. However, BMO loses
to RMO 19 times and RSO 10 times, which are different from the results on
faithful oracles. Thus, even though BMO is robust, still it works slightly worse
on unfaithful oracles than on faithful ones.
Table 2. T Test results for all datasets and budgets on unfaithful oracles
In summary, BMO is robust for working with unfaithful oracles, even though
its good performance may be reduced slightly. This property is crucial for BMO
to be applied successfully in real applications.
6 Conclusion
References
1. Balcan, M., Beygelzimer, A., Langford, J.: Agnostic active learning. In: Proceedings
of the 23rd International Conference on Machine Learning, pp. 65–72. ACM (2006)
2. Settles, B.: Active Learning Literature Survey. Machine Learning 15(2), 201–221
(1994)
Active Learning with c-Certainty 241
3. Sheng, V., Provost, F., Ipeirotis, P.: Get another label? improving data quality
and data mining using multiple, noisy labelers. In: Proceeding of the 14th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
614–622. ACM (2008)
4. Donmez, P., Carbonell, J., Schneider, J.: Efficiently learning the accuracy of la-
beling sources for selective sampling. In: Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 259–268.
ACM (2009)
5. Raykar, V., Yu, S., Zhao, L., Jerebko, A., Florin, C., Valadez, G., Bogoni, L., Moy,
L.: Supervised Learning from Multiple Experts: Whom to trust when everyone lies
a bit. In: Proceedings of the 26th Annual International Conference on Machine
Learning, pp. 889–896. ACM (2009)
6. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?:
evaluating non-expert annotations for natural language tasks. In: Proceedings of
the Conference on Empirical Methods in Natural Language Processing, pp. 254–
263. Association for Computational Linguistics (2008)
7. Sorokin, A., Forsyth, D.: Utility data annotation with amazon mechanical turk. In:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, CVPRW 2008, pp. 1–8. IEEE (2008)
8. Zheng, Y., Scott, S., Deng, K.: Active learning from multiple noisy labelers with
varied costs. In: 2010 IEEE International Conference on Data Mining, pp. 639–648.
IEEE (2010)
9. Du, J., Ling, C.: Active learning with human-like noisy oracle. In: 2010 IEEE
International Conference on Data Mining, pp. 797–802. IEEE (2010)
10. Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In: Proceed-
ings of the 17th Annual International ACM SIGIR Conference, pp. 3–12. Springer-
Verlag New York, Inc. (1994)
11. Roy, N., McCallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: Machine Learning-International Workshop then Con-
ference, pp. 441–448. Citeseer (2001)
12. Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: Advances in
Neural Information Processing Systems (NIPS). Citeseer (2008)
13. WEKA Machine Learning Project, “Weka”,
http://www.cs.waikato.ac.nz/~ ml/weka
14. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/~ mlearn/mlrepository.html
242 E.A. Ni and C.X. Ling
C(TP |An )
P (An |TP ) × P (TP )
=
P (An )
P (An−1 , An |TP ) × P (TP )
=
P (An )
P (An−1 |TP ) × P (TP ) × P (An |TP ) × P (An−1 )
=
P (An−1 ) × P (An )
p(An−1 )
= C(TP |An−1 ) × p(An |TP ) × (3)
p(An )
p(An−1 )
p(An )
p(An−1 )
=
p(An |T P ) × p(TP ) + p(A |TN ) × p(TN )
n
p(An−1 )
=
p(An−1 |TP ) × p(TP ) × p(An |TP ) + p(An−1 |TN ) × p(TN ) × p(An |TN )
1
=
(C(TP |An−1 )) × p(An |TP ) + C(TN |An−1 ) × p(An |TN ))
As An = (P, fn ),
C(TP |An )
C(TP |An−1 ) × p(An |TP )
=
C(TP |An−1 ) × p(An |TP ) + (1 − C(TP |An−1 )) × p(An |TN )
C(TP |An−1 ) × fn
=
C(TP |An−1 ) × fn + (1 − C(TP |An−1 )) × (1 − fn )
A Term Association Translation Model for Naive
Bayes Text Classification
1 Introduction
Text classification (TC) is the task of classifying documents into a set of pre-
defined categories. It has long been an important research topic in information
retrieval (IR). Many statistical classification methods and machine learning (ML)
techniques have been developed to TC, such as the naive Bayes classifier [12],
the support vector machines [10], the k-nearest neighbor method [20], and the
boosting method [16]. In addition, text classification based on term associations
[1] is also a promising approach. The performance of text classification highly
depends on the document representation. Most of the existing methods repre-
sent a document using a vector space model (VSM) or a language model (LM).
Generally, the bag-of-words (BoW) method is a widely used data representation
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 243–253, 2012.
c Springer-Verlag Berlin Heidelberg 2012
244 M.-S. Wu and H.-M. Wang
in IR and TC. Under this scheme, each document is modeled as a vector with
a dimension equal to the size of the dictionary, and each element of the vector
denotes the frequency that a word appears in the document. Basically, all the
words are treated independently.
One of the important restrictions in most of the existing TC methods may lie in
that the individual terms are usually too general and that these methods do not
consider the associations between words in the documents. In some cases of TC,
individual words are not sufficient to represent the accurate information of the doc-
ument. For example, a document with “shuttle launch” may be assumed to belong
to the “ball game” class. However, if the word “NASA” is an association term, it is
very likely that the document should be assigned to the “aeronautics” class.
It is well-known that the relationships between words are very important for
statistical language modeling. Using LM for TC has been studied recently [2,14].
Although N -gram LM can exploit the relationships between words, they only
consider the dependencies of neighboring words [5]. For example, the trigram
LM is unable to characterize word dependence beyond the span of three succes-
sive words. In [22], the trigram LM was improved by integrating with the trigger
pairs, which extract the word relationships from the sequence of historical words.
Nevertheless, a trigger pair is word order dependent. In other words, a word can
only be triggered by the previous context. Recent studies have revealed that
modeling term associations could provide richer semantics of documents for LM
and IR [4,18,19]. Cao et al. [4] integrated the word co-occurrence information and
the WordNet information into language models. Wei and Croft [18] investigated
the use of term associations to improve the performance of LM-based IR. In [19],
the word associations were integrated into the topic modeling paradigm. Adding
word associations to represent a document inevitably increases the model’s com-
plexity, but the new information reduces the ambiguity mentioned above. Gen-
erally, any set of words co-occur in the contexts can be considered having a
strong association and collected as the associative words, e.g., “uneven bars” and
“balance” in the class of gymnastics and “aerofoil” and “jet engine” in the class
of airplane transportation. However, the associative words are not necessary to
co-occur in a document. We believe that a language model considering term
associations would be definitely more useful in TC.
In this paper, we propose a novel model for text classification by incorporate
the strengths of term associations into the translation LM framework. Different
from the traditional TC techniques and algorithms in the literature, we model
the associations between words existing in the documents of a class. To discover
the associative terms in the documents, we learn the translation language model
based on the joint probability (JP) of the associative terms through the Bayes
rule and based on the mutual information (MI) of the associative terms.
The remainder of this paper is organized as follows. In Section 2, we briefly
review the framework of the naive Bayes classifier and language models. The
proposed models for text classification are presented in Section 3. Experimental
setup and results are discussed in Section 4. Finally, we give the conclusions in
Section 5.
A Term Association Translation Model for Naive Bayes Text Classification 245
2 Related Work
2.1 Terminology
We begin by defining the notation and terminology in this paper. A word or
term is a linguistic building block for text. A word is denoted by w ∈ V =
{1, 2, . . . , |V |}, where |V | is number of distinct words/terms. A document, rep-
resented by d = {w1 , · · · , wnd }, is an ordered list of nd words. A query, denoted
by q = {q1 , · · · , qT }, is a string of T words. A collection of documents is de-
noted by D = {d1 , · · · , d|D| }, where |D| is the number of documents in collection
D. A background model, denoted by MB , is the language model estimated in
collection D. A set of class labels is denoted by C = {c1 , · · · , c|C| }, where |C|
is the number of distinct classes. A LM M is a probability function defined on
a set of word strings. This includes the important special case of the probability
P (w|M) of a word w. A class LM, denoted by MC , is the language model
estimated based on class c.
The word probability P (w|c) and the class prior probability P (c) are estimated
from the training documents with Laplace smoothing as follows
1 + n(w, c)
P (w|c) = , (3)
|V | + N (c)
1 + n(d, c)
P (c) = , (4)
|C| + |D|
246 M.-S. Wu and H.-M. Wang
where n(w, c) is the number of times word w occurs in the training documents
that belong to class c; N (c) is the total number of words in the training doc-
uments that belong to class c; n(d, c) denotes the number of documents that
belong to class c; and |D| is total number of training documents.
Several extensions of the naive Bayes classifier have been proposed. For exam-
ple, Nigam et al. [13] combined the Expectation-Maximization (EM) algorithm
and the naive Bayes classifier to learn from both labeled and unlabeled doc-
uments in a semi-supervised manner. More recently, Dai et al. [7] proposed a
transfer learning algorithm to learn the naive Bayes classifier for text classifica-
tion, which allowed the distributions of the training and test data to be different.
However, these methods all assume that the words in a document are indepen-
dent of each other; hence, they cannot cope well with the term dependence and
association.
d̂ = arg max f (q, dm ) = arg max P (dm |q) = arg max P (q|dm )P (dm ). (5)
dm dm dm
Assuming that the documents {d1 , · · · , d|D| } have an equal prior probability of
relevance, the ranking can be done according to the likelihood of the N -gram
language model
T
P (q|dm ) = P (q1 , · · · , qT |dm ) = P (qt |qt−n+1
t−1
, dm ), (6)
t=1
where each word qt only depends on its n − 1 historical words qt−n+1 t−1
=
{qt−n+1 , · · · , qt−1 }. P (qt |qt−n+1 , dm ) can be estimated according to the maxi-
t−1
t
where c(qt−n+1 , dm ) denotes the number of times that word qt follows the his-
t−1
torical words qt−n+1 t−1
in document dm and c(qt−n+1 , dm ) denotes the number
t−1
of times that the historical words qt−n+1 occur in document dm . The unigram
document model is generally adopted in the IR community [15]. However, the
A Term Association Translation Model for Naive Bayes Text Classification 247
document terms are often too few to train a reliable ML-based model because
the unseen words lead to zero unigram probabilities. Zhai and Lafferty [21] have
used several smoothing methods to deal with the data sparseness problem in
LM-based IR.
Since previous research [4,18,19] have shown that some relationships exist
between words, we utilize them in the document model rather than using the
traditional unigram document model for text classification.
Assuming that P (c) is uniformly distributed and applying the unigram class LM
in the task, the decision can be rewritten as
The traditional naive Bayes classifier usually uses Laplace smoothing to deal
with the zero probability problem. However, some previous research has shown
that it is not as effective as the smoothing methods for language modeling [2,14].
Therefore, we can interpolate a unigram class LM with the unigram collection
background model by using the Jelinek-Mercer smoothing method as follows,
where λ can be tuned empirically. In this paper, the method based on (10) is
denoted as NBC-UN, and λ is set to 0.5.
In order to discover the association between two terms wi and w, we are
interested in Pt (wi |w), the probability that word wi will occur given that w
occurs. The term translation probability Pt (wi |w) is different from the bigram
probability P (wi |w) in that the words wi and w are not limited to occur in
order and adjacently in the former. Then, the term association information can
be integrated into the unigram class model as follows,
P (wi |c) = P (wi |w)P (w|c), (11)
w∈c
248 M.-S. Wu and H.-M. Wang
where P (w|c) reflects the distribution of words in the training documents of class
c, which can be computed via the maximum likelihood estimate. By replacing
P (wi |c) in (10) with the one computed by (11), we have
P (wi |MC ) = λ[ P (wi |w)P (w|c)] + (1 − λ)P (wi |MB ). (12)
w∈c
The model in (12) is obviously more computationally intensive than the model
in (9). Therefore, we need to build a global term translation model for all classes
and the word probability distribution for each class beforehand. To discover the
associative terms in the training documents, we learn the translation LM based
on the joint probability of the associative terms through the Bayes rule and
based on the mutual information (MI) of the associative terms.
P (wi , w)
Pt (wi |w) = , (13)
P (w)
if wi and w are assumed sampled independently and identically from the unigram
class model c, and the probability of w can be expressed as
P (w) = P (w|c)P (c). (15)
c
After re-normalizing P (wi , w) in (14) and P (w) in (15), and considering a uni-
form prior P (c), we obtain
P (wi |c)P (w|c)
Pt (wi |w) = c . (16)
c P (w|c)
The method based on (12) with Pt (wi |w) computed by (16) is denoted as TATM-
JP (the term association translation model estimated by the joint probability of
terms).
where P (wi , w) is estimated as the ratio of the number of documents that contain
both wi and w, i.e., cd (wi , w), and the total number of documents |D| as follows
cd (wi , w)
P (wi , w) = ; (18)
|D|
cd (wi ) − cd (wi , w)
P (wi , w̄) = , (19)
|D|
If the two words wi and w tend to associate with each other, the probability
would be higher. The method based on (12) with Pt (wi |w) computed by (20) is
denoted as TATM-MI (the term association translation model estimated based
on the mutual information of terms).
4 Experiments
4.1 Corpora
We evaluate the proposed TC methods on two standard document collec-
tions: Reuters-21578 (Reuters)1 and 20 Newsgroups (20NG)2 . According to the
ModApte split, the Reuters corpus is separated into 7,194 documents for training
and 2,788 documents for testing. 135 categories have been defined, but only 118
1
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
2
http://people.csail.mit.edu/jrennie/20Newsgroups/
250 M.-S. Wu and H.-M. Wang
Micro-averaged Macro-averaged
P R F P R F
NBC 0.802 0.800 0.801 0.817 0.795 0.806
NBC-UN 0.809 0.807 0.808 0.822 0.802 0.812
TATM-JP 0.818 0.815 0.817 0.827 0.810 0.818
TATM-MI 0.821 0.819 0.820 0.829 0.814 0.821
The use of term associations for TC has attracted great interest. This paper
has presented a new term association translation model, which models term
associations, for TC. The proposed model can be learned based on the joint
probability of the associative terms through the Bayes rule or based on the
mutual information of the associative terms. The experimental results show that
the new model learned in either way outperforms the traditional TC methods.
For future work, we plan to investigate the effect of the feature selection method
[17] for the selection of associative terms. In addition, we will integrate our model
into the topic models such as probability latent semantic analysis (PLSA) [9] or
latent Dirichlet allocation (LDA) [3] for text classification. Another interesting
direction is to combine the term association document model with the relevance-
based document model, and apply the combined model in TC.
References
1 Introduction
These days many applications deal with large amounts of transaction data, i.e.
network traffic data, sensor network data and web usage data [3]. Such data,
also referred to as data streams in the rest of the paper, often present skewed
distributions, i.e. some classes are not sufficiently represented while instances of
other classes are over-represented.
Class imbalance exists in a large number of real-world domains and, hence,
learning on the static imbalanced data has received great focus [4,6]. Existing
solutions can be divided into the following four categories: (i) under-sampling the
majority class, so that its size matches that of the minority class(es); (ii) over-
sampling the minority class so as to match the size of the other class(es); (iii)
internally biasing the learning process so as to compensate for class imbalance;
(iv) multi-experts systems. Despite such efforts, most of these methods, while
increase the accuracy on the minority class, decrease the global accuracy in
comparison with traditional learning algorithms.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 254–265, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Double-Ensemble Approach for Classifying Skewed Data Streams 255
Turning our attention to data streams classification, recent research has been
directed towards the topic of data streams classification [7,15,8,16]. Few methods,
however, have been designed to classify skewed data streams [9].
Therefore, skewed data streams classification deserves more attention. In this
respect, we propose here a classification method for skewed data streams, pre-
senting the following contributions: (i) we discuss the pros and cons of metrics
for performance evaluation under class skew; (ii) we present a review of the lit-
erature concerning classification methods for both static and streaming skewed
datasets; (iii) we propose a new approach for skewed data streams classification.
Comparing with existing methods, our proposed method improves not only the
accuracy on each class but also the global recognition accuracy, as confirmed by
experiments carried out on three public datasets.
The rest of the paper is organized as follows: we present background and
motivations in section 2, where we also review related work. In section 3, we
introduce our approach in detail. In section 4, we report the experimental results.
Finally, we conclude the paper in section 5.
belonging to class i. Hence, nii /n+i represents the accuracy for each class. It is
clear that gacc ranges in [0, 1]. For two-classes skewed data classification tasks,
we further introduce the following two metrics which specialize in measuring the
performance of a classifier on the two different classes:
– True Positive Rate or Recall, which is defined as T Prate = acc+ = T PT+F
P
N;
− TN
– True Negative Rate, which is defined as T Nrate = acc = T N +F P ;
From above definitions, for two classes recognition problem, we obtain gacc =
√
acc+ · acc− . On one side, to get a large value of gacc, both accuracies should
be large. On the other side, gacc will be low if either accuracy value is low.
Hence, gacc is a balance of acc+ and acc− . Nevertheless, if we only use the gacc
value to evaluate a classifier’s performance, we can not distinguish its separate
performance on the two different classes. As an example, consider the classifier
for the Credit Card mentioned above. Its acc− value is 100% but, since its acc+ is
0%, the gacc value for this classifier is 0%. This example confirms that neither acc
nor gacc on its own is enough to reflect the overall performance of the classifier
on skewed data, motivating the use of acc+ and acc− .
As a short summary, the metrics of acc, gacc, acc+ (or acc− ) should be used
together as a joint measure to evaluate classification performance on skewed data
streams. Indeed, on the one hand, acc measures the global recognition rate and,
on the other hand, gacc reflects how much classifier performance is balanced. In
addition, acc+ (or acc− ) reports separate classification performance on the two
different classes.
on the amount of needed samples, member samples from the k nearest neigh-
bors are randomly chosen.
3. Internally biasing the discrimination-based process to compensate class im-
balance without altering the class distributions. It should assign different
weights to prototypes of different classes [13], or use a weighted distance
function in the classification phase compensating the TS imbalance without
altering the class distribution [1].
4. Multi-experts systems (MES). In MES, each composing classifier Ci is
trained on a TS composed of a sample subset Ni of the majority class N
and all instances from the minority class P . So after sampling a subset Ni
from N , Ci is trained on Ni ∪ P . Later for the test data, the outputs of Ci
on the test samples are combined to make the final predication [11]. The
main motivation of this approach lies in the observation that a MES gen-
erally produces better results than those provided by any of its composing
classifiers.
2.4 Motivations
The review of the literature reported so far shows that recent research has
focused, on the one hand, on class-imbalance learning on static data and, on
the other hand, on classifying non-skewed data streams. However, conventional
methods for non-skewed data streams usually do not give enough attention to
258 C. Zhang and P. Soda
skewed streams, whereas static class-imbalance learning methods often harm the
accuracy on majority class, although they increase the recognition accuracy on
the minority class.
3 Proposed Method
As reported in section 2.4, we have noticed that balancing the accuracies for
each class has the side effect of decreasing the global recognition accuracy (acc).
Therefore, we present in the following a method aiming at achieving larger acc
while still improving acc+ or gacc.
1 t=0
.C
t*
acc (%)
t=0 t*
t=1
t=1
gacc (%) 1
Fig. 1. Framework of Proposed Method (left) and Example of Acc-Gacc curves (right)
Given an instance x belonging to test set, the final label O(x) is determined
as follows:
ON EC (x) if φ(x) ≥ t∗
O(x) = (1)
OSEC (x) otherwise
When the reliability φ(x) provided by NEC is larger than the threshold t∗ , the
final label corresponds to the label returned by NEC because it is reasonable to
assume that NEC is likely to provide a correct classification. But, when φ(x) is
below t∗ , O(x) is equal to the label assigned by SEC, i.e. a classification method
tailored specially for skewed data. Indeed, in this case, the value of the reliability
suggests that the decision returned by NEC may not be safe. We will explain
the rationale of reliability estimation in subsection 3.3.
260 C. Zhang and P. Soda
where x is a sample, ONEC (x) and OSEC (x) are the outputs provided by NEC
and SEC, φ(x) is the reliability of NEC, and O(x) is the final label.
the difference of predictions on the two different classes. A low value of φ(x)
will suggest that the classification decision made on instance x is not safe since,
for instance, it may be a borderline instance or it can be affected by noise in
the feature space; while a large value of φ(x) would suggest that the classifier is
more likely to have provided a correct classification [10].
In order to explain the rationale of using the reliability for skewed data clas-
sification, let us consider Fig. 2, where we report the experimentally measured
distributions of reliability values for test samples labeled by a classifier (i.e. NEC)
trained on a skewed distribution. On the one hand, when we apply NEC, the
minority class samples are more likely to receive low reliability values (see the
left part of Fig. 2). On the other hand, although low reliability values can also be
found for true negative instances, instances with high reliability values are more
likely to belong to the majority (negative) class (see the right part of Fig. 2).
In short, there are two main reasons for using reliability estimations on skewed
data stream classifiers: (1) applying NEC, samples with high reliability values
are more likely to belong to negative (majority) class. Hence, we can use reli-
ability values to distinguish between positive and negative instances; (2) SEC
is trained on artificially balanced training sets, so it should recognize not only
positive instances, but also negative ones. Therefore, although instances with
low reliability values can contain negative (majority) instances, SEC should be
able to correctly classify most of them.
4 Experimental Evaluation
In this section, we first describe the datasets used for the experiments. Second,
we introduce the experimental protocol and, third, we report the experimental
results.
262 C. Zhang and P. Soda
!
!
samples
% of
!
Reliability
Fig. 2. Examples of reliability distributions for majority and minority class samples
4.1 Datasets
We use the three datasets shown in table 3. These datasets vary in both num-
ber of features and skewness. The prediction task for the Adult dataset is to
determine whether a person makes over 50K income a year. We only use two
classes of the Forest Cover dataset: Ponderosa and Lodgepole Pine. The task of
the Forest Cover dataset is to predict the forest cover type. The Credit Card
dataset was provided by the 2009 UCSD/FICO data mining contest1 and used
for predicting whether a transaction is an anomaly or not.
We test our approach on the above mentioned datasets. For each dataset, we
divide the data into batches, and the last two batches are left for testing only.
We vary the size of the batches in our tests: each batch in the Credit Card data
contains 10,000 transactions, the size of a batch in the Forest Cover dataset is
20,000, and it is 5,000 for the Adult data.
1
http : //mill.ucsd.edu
A Double-Ensemble Approach for Classifying Skewed Data Streams 263
As there are few methods for skewed data stream classification, we implement
SDM07 [9] for SEC and we apply SEA [15] for NEC. These two methods were
chosen because both of them are well recognized methods for classifying skewed
or non-skewed data streams. Since both NEC and SEC are classifier ensembles,
we use C4.5, Naı̈ve Bayes and Logistic Regression as the base learners in our
experiments. Performance are estimated measuring acc, gacc, acc+ , and acc− .
4.3 Results
Tables 2, 4 and 5 report the results of the tests we performed on the Credit Card
dataset. Tables 6 and 7 show a portion of the test results on both Adult and
Forest Cover datasets. It is worth noting that SEA usually achieves the largest
acc value but has the smallest gacc value. The case is reversed for SDM07, with
the largest value for gacc but the smallest value for acc. Our proposed method,
however, achieves a balanced performance between the two above methods. As
discussed in section 2, this occurs because SEA is a learning method that usually
ignores the minority class in skewed data. SDM07, on the other hand, is biased
toward the minority class but harms the recognition accuracy on majority class.
Unlike the other two methods, our proposed approach balances acc and gacc
simultaneously.
We now provide a deeper analysis of the results achieved on the Credit Card
dataset (Tables 2, 4 and 5). We notice that: (i) SEA usually achieves the best
values of acc, while SDM07 often has the best values of both acc+ and gacc;
(ii) Sometimes the gacc values of our method are as large as or even larger than
SDM07; (iii) Our method increases the values of the acc+ of SEA by up to 70%.
Our method also outperforms SEA in terms of gacc by 25%; (iv) Our method
does not outperform SEA in terms of acc. With respect to SEA, our method
decreases acc by approximately 4%, but the decrease in acc is usually 13% in
the case of SDM07.
In summary, the above observations show that the proposed method takes into
account minority class instances without harming the global accuracy as much as
existing methods. We owe this fact to both the double-ensemble framework and
the multi-objective optimization technique embedded in the learning algorithm,
which dynamically adapts its threshold to variation in data distribution.
264 C. Zhang and P. Soda
Similar results were also found in the experiments with the other two datasets.
The results are shown in Table 6 and Table 7.
Finally, we report the elapsed time during training and test phases of each
method. The running time increases with the batch size. Using C4.5 as the base
learner on Credit Card data, the proposed method takes 353 seconds, whereas
SDM07 spends 280 seconds. In the case of the Adult data, the proposed method
and SDM07 use 274 and 234 seconds, respectively. These results are reasonable,
because the proposed method trains two ensembles of classifiers. Hence, the
training time is slight longer than that of SDM07.
A Double-Ensemble Approach for Classifying Skewed Data Streams 265
5 Conclusions
In this paper, we have presented a classification method for skewed data streams.
This method is based on two classifier ensembles suited for learning with and
without class skew. While still improving the accuracy on each class, the pro-
posed method does not decrease the global recognition accuracy as much as
existing methods. Future work will be directed towards extending our study to
multi-class data streams.
References
1. Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of
classifiers. Pattern Analysis & Applications 6(3), 245–256 (2003)
2. Batista, G.E., Carvalho, A.C., Monard, M.C.: Applying One-sided Selection to Un-
balanced Datasets. In: Cairó, O., Cantú, F.J. (eds.) MICAI 2000. LNCS, vol. 1793,
pp. 315–325. Springer, Heidelberg (2000)
3. Bay, S.D., Kibler, D., Pazzani, M.J., Smyth, P.: The uci kdd archive of large data
sets for data mining research and experimentation. SIGKDD Explorations, 81–85
(2000)
4. Chan, P.K., Fan, W., Prodromidis, A.: Distributed data mining in credit card fraud
detection. IEEE Intelligent Systems 14, 67–74 (1999)
5. Chawla, N.V., Bowyer, K.W., Hall, L.O.: Smote: Synthetic minority over-sampling
technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
6. Chawla, N.V., Japkowicz, N.: Editorial: Special issue on learning from imbalanced
data sets. SIGKDD Explorations 6 (2004)
7. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. SIGKDD,
pp. 71–80 (2000)
8. Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed
data streams. In: Proc. SIGKDD, pp. 523–528 (2003)
9. Gao, J., Fan, W., Han, J., Yu, P.S.: A general framework for mining concept-
drifting data streams with skewed distributions. In: Proc. SIAM SDM 2007, pp.
3–14 (2007)
10. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE
Transactions On Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
11. Kotsiantis, S., Pintelas, P.: Mixture of expert agents for handling imbalanced data
sets. Ann. of Mathematics, Computing and Teleinformatics 1(1), 46–55 (2003)
12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided
selection. In: Proc. ICML 1997, pp. 179–186 (1997)
13. Pazzani, M., Merz, C., Murphy, P., Ali, K.: Reducing misclassification costs. In:
Proc. ICML 1994, pp. 217–225 (1994)
14. Soda, P.: A multi-objective optimisation approach for class imbalance learning.
Pattern Recognition 44(8), 1801–1810 (2011)
15. Street, W.N., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale clas-
sification. In: Proc. SIGKDD, pp. 377–382 (2001)
16. Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using
ensemble classifiers. In: SIGKDD, pp. 226–235 (2003)
Generating Balanced Classifier-Independent
Training Samples from Unlabeled Data
1 Introduction
Supervised learning algorithms can provide promising solutions to many real-
world problems such as text classification, anomaly detection and information
security. A major limitation of supervised learning is the difficulty in obtaining
labeled data to train predictive models. Ideally, one would like to train classifiers
on diverse labeled data representative of all classes. In many domains, such as
text classification or security, there is an abundant amount of unlabeled data, but
obtaining a representative subset is challenging: data is typically highly skewed
and sparse.
There are two widely used approaches for selecting data to label—random
sampling and active learning. Random sampling, a low-cost approach, produces
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 266–281, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Generating Balanced Classifier-Independent Training Samples 267
a subset of the data with a similar distribution to the original data set, producing
skewed training data for unbalanced data. Training with unbalanced labeled data
yields poor results as reported in recent work on the effect of class distribution on
learning and performance degradation [1–3]. Active learning produces training
data incrementally by identifying the most informative data to label at each
phase [4–6]. However, active learning requires knowing the classifier in advance,
which is not feasible in many real applications, and requires costly re-training
at each step.
In this paper, we present new strategies to generate training samples from
unlabeled data to overcome limitations in random and existing active sampling
methods. Our core algorithm is an iterative method, in which we generate a
small fraction (e.g., 10%) of the desired training set each iteration, indepen-
dently of both the original data distribution as well as the target classifier. More
specifically, we first label a small number of randomly selected samples and
subsequently apply semi-supervised clustering to embed prior knowledge (i.e.,
labeled samples) to produce clusters approximating the true classes [7–9]. We
then estimate the class distribution of the clusters, and increase the balancedness
of the training sample via biased sampling.
A simplistic strategy for biased sampling would be to assume that the class
distribution of a cluster is the same as the distribution of labeled samples in
the cluster, and to draw samples proportionally to the estimated class distri-
butions. However, this assumption does not hold in early iterations when the
number of labeled samples is small, and there is high uncertainty about the
class distributions. We present two hybrid approaches to address this issue that
perform well in practice. The first approach is to combine the estimated class
distribution-based sampling and random sampling. As the number of labeled
samples increases, we decrease the influence of random sampling favoring the es-
timation based on previously labeled samples. The second approach is for cases
where additional domain knowledge is available. We use the domain knowledge
to estimate class distributions. Domain knowledge may come in many forms,
such as conditional probabilities and correlation, e.g., there is a heavy skew in
the geographical location of servers hosting malware[10]. We perform a similar
transition between the domain knowledge-based density estimation and previ-
ously labeled sample-based estimation.
We have validated these strategies on 14 data sets from the UCI data reposi-
tory [11] as well as a private data set authorizing users to systems (i.e., labeled
grant and deny). These data sets reflect a range of parameters: some are bal-
anced and others highly skewed; and some have binary classes while others have
multiple classes. We compare our strategies to random sampling as well as un-
certainty based active sampling based on three classifiers: Naive Bayes, Logistic
Regression, and SVM. The experiments show that, for highly skewed data sets,
our sampling algorithm produces substantially more balanced samples than ran-
dom sampling. For mildly skewed data sets, our method results in about 25%
more minority samples. Similarly, our algorithm performs better than uncer-
tainty sampling based methods for highly skewed samples, producing more than
268 Y. Park et al.
20% more minority samples on average. For mildly skewed data sets, our algo-
rithm’s results are not statistically different from uncertainty sampling based
on logistic regression. Given that uncertainty sampling requires one to fix the
classifier to be trained and is much slower, we conclude that our algorithm is
always preferable to both random and uncertainty based sampling. We test the
domain knowledge based strategy on the access control permission datasets. Our
result show that, in most cases, the addition of domain knowledge significantly
improves the convergence of the sampling so we can produce balanced sample
sets more quickly.
The quality of training data can best be evaluated by the performance of
classifiers trained on this data. We have compared various sampling strategies by
training and testing a range of classifiers. Our tests show that the classifiers built
with our training data outperform other classifiers in most of the experimental
scenarios and produce more consistent performance. Further, our classifiers often
outperform uncertainty sampling on AUC and F1 measures, even when sampling
and classification used the same classifier. The experimental results confirm that
our sampling methods are very generic and can produce highly balanced training
data irrespective of the underlying data distribution and the target classifier.
2 Related Work
There is an extensive body of work on generating “good” training data sets. A
common approach is active learning, which iteratively selects informative sam-
ples, e.g., near the classification border, for human labeling [6, 12–14]. The sam-
pling schemes most widely used in active learning are uncertainty sampling and
Query-By-Committee sampling [13, 15, 16]. Uncertainty sampling selects the
most informative sample determined by one classification model, while QBC
sampling determines informative samples by a majority vote. A major problem
with active learning is that the update process is very expensive as it requires
classification of all data samples and retraining of the model at each iteration.
This cost is prohibitive for large scale problems. Techniques such as batch mode
active learning [17, 18] have been proposed to improve the efficiency of uncer-
tainty learning. However, as the batch size grows, the effectiveness of active
learning decreases [18–20].
Another approach is re-sampling, i.e., over- and under-sampling classes [21,
22], however this requires labeled data. Recent work combines active learning
and re-sampling to address class imbalance in unlabeled data. Tomanek and
Hahn [23] propose incorporating a class-specific cost in the framework of QBC-
based active learning for named entity recognition. By setting a higher cost for
the minority class, this method boosts the committee’s disagreement value on
the minority class resulting in more minority samples in the training set. Zhu
and Hovy [24] incorporate over- and under-sampling in active learning for word
sense disambiguation. Their algorithm uses active learning to select samples for
human experts to label, and then re-samples this subset. In their experiments,
under-sampling caused negative effects but over-sampling helps increase bal-
ancedness. However, both [24] and [23] are primarily designed and applied to
Generating Balanced Classifier-Independent Training Samples 269
binary classification problems for text and are hard to generalize to multi-class
problems and non-text domains.
Our approach is iterative like active learning, but it differs crucially in that
it relies on semi-supervised clustering instead of classification and selects target
samples based on estimated class distribution in each cluster. This makes it
more general where the best classifier is not known in advance. Ours is the
first attempt at using active learning with semi-supervised clustering instead of
classification and thus does not suffer from over-fitting. Furthermore, since most
classification methods require the presence of at least two different classes in
the training set, there is a challenge in providing the initial labeling sample for
active learning; an arbitrary insertion of instances from at least two classes is
required. Our method does not have this limitation, and, although not shown
in the experiments, performs as well with a random initial sample. Our work
provides a general framework which is domain independent and can be easily
customized to specific domains.
3.1 Overview
Given an unlabeled dataset with unknown class distribution, potentially skewed,
our goal is to produce balanced labeled samples for training predictive models.
Formally, we can define the balanced training set problem as follows:
Definition 1. Let D be an unlabeled data set containing classes from which
we wish to select T , a subset of D of size N . Let L(T ) be the labels of the training
data set T , then the balancedness of T is defined as the distance between the
label distribution of L(T ) and the discrete uniform distribution with classes,
i.e., D(Uniform() Multi (L(T ))). The balanced training set problem is the
problem of finding a training data set that minimizes this distance.
If we know the class labels in a given set, then we can use over- and under-
sampling to draw balanced sample set [21, 22, 25]. However, the class labels are
not known, so instead we must use a series of approximations to approach the
results of this ideal solution. We apply an iterative semi-supervised clustering
algorithm to estimate the class distribution in the unlabeled data set and guide
the sample selection to produce a balanced set. In each iteration, the algorithm
draws a batch of samples (B), and domain experts provide the labels of the
selected samples. The labeled samples are used in subsequent iterations.
Algorithm 1 is a high level description of our strategy. It takes three inputs: D,
an unlabeled data set; , the number of target classes in D; and N , the number
of training samples to generate. We note that the value of the input parameters
270 Y. Park et al.
TrainingSetGeneration(D, , N, [B])
if B is undefined
then
B ← min |D| 10
N
, 10 ;
end
|D|
maxcluster ← 10
;
T ← L(B randomly selected samples);
while |T | < N do
{C1 , . . . , Ck } ← SemiSupervisedClustering (D, T , maxcluster );
T ← ∅;
foreach j = 1 to k do
numj ← DetermineOptimalNumberToSample (Cj );
Tj ← MaximumEntropySampling (Cj , numj );
T ← T ∪ Tj ;
end
T ← T ∪ L(T );
end
Algorithm 1. High level steps of the proposed algorithm
the most relevant features in the side information. It learns a global distance met-
ric parameterized by a transformation matrix Ĉ to capture relevant features in
the labeled sample set. It maximizes the similarity between the original data set
X and the new representation Y constrained by the mutual information I(X, Y ).
1
By projecting X into the new space through transformation Y = Ĉ − 2 X, two
projected data objects, Yi , Yj , in the same connected component have a smaller
distance.
Here, we sketch the steps to compute the “within-chunklet” covariance matrix
(transformation matrix), Ĉ. Given a data set X = {xi }N i=1 and a labeled sample
set L ⊂ X, suppose u connected components (i.e., chunklets) M = {Mj }uj=1
u
are obtained based on L, which satisfies X = i=1 Mj . Let the data points in
|M |
a component Mj be denoted as {xji }i=1j for 1 ≤ j ≤ u. Then, the covariance
matrix Ĉ is defined by Equation 1, where mj is the centroid of Mj .
u |Mj |
1
Ĉ = (xji − mj )(xji − mj )T (1)
N j=1 i=1
After projecting the data set into a new space using RCA, the data set is re-
cursively partitioned until all the clusters are smaller than a predetermined
threshold, maxcluster . Algorithm 2 summarizes our semi-supervised clustering
algorithm using RCA.
prior, the fraction of labeled samples with that class label. As before, we use the
recursive binary clustering technique described previously to cluster the data.
We find that this simple heuristic produces good clusters and yields balanced
samples more quickly for categorical data.
Finally, given the set of clusters {Ci }ki=1 , and the number of samples to select
from each cluster, we sample to maximize the entropy of the sample L(T ). We
assume that the data in each cluster follows a Gaussian distribution. For a
continuous variable x ∈ Ci , let the mean be μ, and the standard deviation be
274 Y. Park et al.
σ, then the normal distribution N (μ, σ 2 ) has maximum entropy among all real-
valued distributions. The entropy for a multivariate Gaussian distribution [28]
is defined as:
1 1
H(X) = d (1 + log (2π)) + log (|Σ|) (2)
2 2
where d is the dimension, Σ the covariance matrix, and |Σ| the determinant of
Σ. Thus, more variation the covariance matrix has along the principal directions,
the more information it embeds.
Note that the number of possible
subsets of r elements from a cluster C
can grow very large (i.e., |C|
r ), so finding a subset with the global maximum
entropy can be computationally very intensive. We use a greedy method that
selects the next sample which adds the most entropy to the existing labeled
set. Our algorithm performs the covariance calculation O(rn) times, while the
exhaustive search approach requires O(nr ). If there are no previously labeled
samples, we start the selection with the two samples that have the longest dis-
tance in the cluster. The maximum entropy-based sampling method is presented
in Algorithm 3.
MaximumEntropySampling(T , C, num)
CU ← unlabeled samples in C;
TC ← ∅;
while |TC | < num do
u ← arg maxui ∈CU H(T ∪ {ui }) ;
TC ← TC ∪ {u} ;
CU ← CU \ {u} ;
end
Return T ∪ TC
Algorithm 3. Maximum entropy sampling strategy
summarized in Table 1: some are highly skewed while others are balanced, some
are multi-class while others are binary.
All UCI data sets are used unmodified except the KDD Cup’99 set which
contains a “normal” class and 20 different classes of network attacks. In this
experiment, we selected only “normal” class and “guess password” class to create
a highly skewed data set. When a data set is provided with a training set and
a test set separately (e.g., ‘Statlog’), we combined the two sets. The features
in the access control data set are typically organization attributes of a user:
department name, job roles, whether the employee is a manager, etc. These
categorical features are converted to binary features. Since, such access control
permissions are assigned based on a combination of attributes, these data sets
are also useful to assess the benefits of domain knowledge.
For each data set, we randomly select 80% of the data to be used as un-
labeled data, from which training samples are generated. The remaining 20%
of the samples is used to test classifiers trained with the training samples. For
uncertainty-based active learning, we use three widely used classification al-
gorithms, Naive Bayes, Logistic Regression, and SVM, and these variants are
labeled Un Naive, Un LR, and Un SVM respectively. We used the C-support
vector classification (C-SVC) SVM with a radial basis function (RBF) kernel,
and Logisitc Regression with RBF kernel. All classification experiments were
conducted using RapidMiner, an open source machine learning tool kit [29]. Lo-
gistic Regression in RapidMiner only supports binary classification, and thus
it was extended to a multi-class classifier using “one-against-all” strategy for
multi-class data sets [30]. All experimental results reported here are the average
of 10 runs of the experiments.
276 Y. Park et al.
Table 2. Distance of the sampled class distributions to the uniform distribution. The
best performing algorithm for each data set is highlighted in bold.
method. Then the balancedness of the sample is defined as the Euclidean distance
k
i=1 (Ui − Pi ) .
between the distributions U (X) and P (X) i.e. d = 2
Table 3. The recall rates for the binary class data sets. Min. Ratio refers to the ratio
of the minority class in the unlabeled data set. For the access permission data, the
average and the standard deviation over multiple data sets are reported.
Since uncertainty based sampling methods are targeted to cases where the
classifier to be trained is known, the right comparison with these methods should
include the performance of the resulting classifiers. Further, these algorithms
are not very efficient due to re-training at each step. With these caveats, we
can directly compare the balancedness of the results. For highly skewed data
sets, our method performs better especially when compared to Un SVM and
Un Naive methods. On KDD’99, we produce 20x and 2x more minority samples
compared to Un Naive and Un SVM respectively, while Un LR performs almost
as well as our algorithm. Similarly for Page Blocks, we perform about 20% better
than these methods. We note that our method found all minority samples for all
10 split sets for the Page Blocks set. For other data sets, our algorithm shows
no significant statistical difference compared to these methods on almost all
cases and sometimes we do better. Based on these results, we also conclude that
our method is preferable to the uncertainty-based methods based on broader
applicability and efficiency.
Figure 1 pictorially depicts the performance of our sampling algorithm as well
as the uncertainty based sampling for a few data sets to highlight cases where our
method performs better. These figures show the distance from uniform against
the percentage of sampled data over iterations. The results show that our sam-
pling technique consistently converges towards balancedness while there is some
variation with uncertainty techniques, which remains true for other data sets
as well. Note that the distance increases in Page Blocks and Access Permission
data sets after 20% point is because our method exhausted all minority samples.
Table 4. Performance comparison of the three classifiers trained with five different
sampling techniques. The figures in bold denote the best performing sampling technique
for each classifier. The figures in italics denote the best performing classifier excluding
the uncertainty sampling methods paired with their respective classifier.
Fig. 2. Comparison of our algorithm (‘Sigmoid’), our algorithm with domain knowledge
(‘+Domain’), and random sampling (‘Random’). The y-axis shows the minority class
density in the training data, and x-axis shows the recall of the minority class.
5 Conclusion
In this paper, we considered the problem of generating a training set that can
optimize the classification accuracy and also is robust to classifier change. We
confirmed through experiments that our method produces very balanced train-
ing data for highly skewed data sets and outperforms other methods in correctly
classifying the minority class. For a balanced multi-class problem, our algorithm
outperforms active learning by a large margin and works slightly better than
random sampling. Furthermore, our algorithm is much faster compared to ac-
280 Y. Park et al.
References
1. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explo-
rations 6(1) (2004)
2. Weiss, G., Provost, F.: The effect of class distribution on classifier learning: An
empirical study. Dept. of Comp. Science, Rutgers University, Tech. Rep. ML-TR-
43 (2001)
3. Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In:
ICML (2004)
4. Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: ICML (2008)
5. Ertekin, S., Huang, J., Bottou, L., Giles, C.L.: Learning on the border: active
learning in imbalanced data classification. In: CIKM (2007)
6. Settles, B.: Active learning literature survey. University of Wisconsin-Madison,
Tech. Rep. (2009)
7. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric
from equivalence constraints. Journal of Machine Learning Research 6 (2005)
8. Wagstaff, K., Cardie, C.: Clustering with instance-level constraints. In: ICML
(2000)
9. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with
application to clustering with side-information. In: Advances in Neural Info. Proc.
Systems, vol. 15. MIT Press (2003)
10. Provos, N., Mavrommatis, P., Rajab, M., Monrose, F.: All your iFRAMEs point
to us. Google, Tech. Rep. (2008)
11. Frank, A., Asuncion, A.: UCI machine learning repository
12. Campbell, C., Cristianini, N., Smola, A.J.: Query learning with large margin clas-
sifiers. In: ICML (2000)
13. Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query
by committee algorithm. Machine Learning 28(2-3) (1997)
14. Tong, S., Koller, D.: Support vector machine active learning with application sto
text classification. In: ICML (2000)
15. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In:
SIGIR (1994)
16. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Computational
Learning Theory (1992)
17. Hoi, S.C.H., Jin, R., Zhu, J., Lyu, M.R.: Batch mode active learning and its ap-
plication to medical image classification. In: ICML (2006)
Generating Balanced Classifier-Independent Training Samples 281
18. Guo, Y., Schuurmans, D.: Discriminative batch mode active learning. In: NIPS
(2007)
19. Schohn, G., Cohn, D.: Less is more: Active learning with support vector machines.
In: ICML (2000)
20. Xu, Z., Hogan, C., Bauer, R.: Greedy is not enough: An efficient batch mode active
learning algorithm. In: ICDM Workshops (2009)
21. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class imbalance
learning. In: IEEE Trans. on Sys. Man. and Cybernetics (2009)
22. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic
minority over-sampling technique. JAIR 16 (2002)
23. Tomanek, K., Hahn, U.: Reducing class imbalance during active learning for named
entity recognition. In: K-CAP (2009)
24. Zhu, J., Hovy, E.: Active learning for word sense disambiguation with methods for
dddressing the class imbalance problem. In: EMNLP-CoNLL (2007)
25. wU, Y., Zhang, R., Rudnicky, E.: Data selection for speech recognition. In: ASRU
(2007)
26. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering
with background knowledge. In: ICML (2001)
27. Shortliffe, E.H., Buchanan, B.G.: A model of inexact reasoning in medicine. Math-
ematical Biosciences 23(3-4) (1975)
28. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley Interscience
(1991)
29. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid proto-
typing for complex data mining tasks. In: Proc. KDD (2006)
30. Rifkin, R.M., Klautau, A.: In defense of one-vs-all classification. J. Machine Learn-
ing (1998)
Nyström Approximate Model Selection for LSSVM
1 Introduction
Support vector machine (SVM) [18] is a learning system for training linear learning
machines in the kernel-induced feature spaces, while controlling the capacity to prevent
overfitting by generalization theory. It can be formulated as a quadratic programming
problem with linear inequality constraints. The least squares support vector machine
(LSSVM) [16] is a least squares version of SVM, which considers equality constraints
instead of inequalities for classical SVM. As a result, the solution of LSSVM follows
directly from solving a system of linear equations, instead of quadratic programming.
Model selection is an important issue in LSSVM research. It involves the selection
of kernel function and associated kernel parameters and the selection of regularization
parameter. Typically, the form of kernel function will be determined as several types,
such as polynomial kernel and radial basis function (RBF) kernel. In this situation, the
selection of kernel function amounts to tuning the kernel parameters. Model selection
can be reduced to the selection of kernel parameters and regularization parameter which
minimize the expectation of test error [4]. We usually refer to these parameters collec-
tively as hyperparameters. Common model selection approaches mainly adopt a nested
two-layer inference [11], where the inner layer trains the classifier for fixed hyperpa-
rameters and the outer layer tunes the hyperparameters to minimize the generalization
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 282–293, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Nyström Approximate Model Selection for LSSVM 283
error. The generalization error can be estimated either via testing on some unused data
(hold-out testing or cross validation) or via a theoretical bound [17,5].
The k-fold cross validation gives an excellent estimate of the generalization error
[9] and the extreme form of cross validation, leave-one-out (LOO), provides an almost
unbiased estimate of the generalization error [14]. However, the naive model selection
strategy based on cross validation, which adopts a grid search in the hyperparameters
space, unavoidably brings high computational complexity, since it would train LSSVM
for every possible value of the hyperparameters vector. Minimizing the estimate bounds
of the generalization error is an alternative to model selection, which is usually realized
by the gradient descent techniques. The commonly used estimate bounds include span
bound [17] and radius margin bound [5]. Generally, these methods using the estimate
bounds reduce the whole hyperparameters space to a search trajectory in the direction
of gradient descent, to accelerate the outer layer of model selection, but multiple times
of LSSVM training have to be implemented in the inner layer to iteratively attain the
minimal value of the estimates. Training LSSVM is equivalent to computing the inverse
of a full n × n matrix, so its complexity is O(n3 ), where n is the number of training ex-
amples. Therefore, it is prohibitive for the large scale problems to directly train LSSVM
for every hyperparameters vector on the search trajectory. Consequently, efficient model
selection approaches via the acceleration of the inner computation are imperative.
As pointed out in [5,3], the model selection criterion is not required to be an unbiased
estimate of the generalization error, instead the primary requirement is merely for the
minimum of the model selection criterion to provide a reliable indication of the mini-
mum of the generalization error in hyperparameters space. We argue that it is sufficient
to calculate an approximate criterion that can discriminate the optimal hyperparame-
ters from the candidates. Such considerations drive the proposal of approximate model
selection approach for LSSVM.
Since the high computational cost for calculating the inverse of a kernel matrix is a
major problem of LSSVM, we consider to approximate a kernel matrix by a “nice” ma-
trix with a lower computational cost when calculating its inverse. The Nyström method
is an effective technique for generating a low rank approximation for the given kernel
matrix [19,13,8]. Using the low rank approximation, we design an efficient algorithm
for solving LSSVM, whose complexity is lower than O(n3 ). We further derive a model
approximation error bound to measure the effect of Nyström approximation on the deci-
sion function of LSSVM. Finally, we present an efficient approximate model selection
scheme. It conforms to the two-layer iterative procedure, but the inner computation has
been realized more efficiently. By rigorous experiments on several benchmark datasets,
we show that approximate model selection can significantly improve the efficiency of
model selection, and meanwhile guarantee low generalization error.
The rest of the paper is organized as follows. In Section 2, we give a brief introduc-
tion of LSSVM and a reformulation of it. In Section 3, we present an efficient algorithm
for solving LSSVM. In Section 4, we analyze the effect of Nyström approximation on
the decision function of LSSVM. In Section 5, we present an approximate model selec-
tion scheme for LSSVM. In Section 6, we report experimental results. The last section
gives the conclusion.
284 L. Ding and S. Liao
We use X to denote the input space and Y the output domain. Usually we will have
X ⊆ Rd , Y = {−1, 1} for binary classification. The training set is denoted by
1
n
1
L= w2 + [yi − w · φ(xi ) − b]2 , (1)
2 2μ i=1
where μ > 0 is called regularization parameter. The basic training algorithm for LSSVM
[16] views the regularized loss function (1) as a constrained minimization problem
1 2
n
1
min w2 + ε,
2 2μ i=1 i (2)
s.t. εi = yi − w · φ(xi ) − b.
where K = [K(xi , x j )]ni, j=1 , In is the n × n identity matrix, 1 is a column vector of n ones,
α = (α1 , α2 , . . . , αn )T ∈ Rn is a vector of Lagrange multipliers, and y ∈ Yn is the label
vector.
If we let Kμ,n = K + μIn , we can write the first row of Equation (4) as
−1
Kμ,n (α + Kμ,n 1b) = y. (5)
−1 −1
Therefore, α = Kμ,n (y−1b). Replacing α with Kμ,n (y−1b) in the second row of Equation
(4), we can obtain
−1 −1
1T Kμ,n 1b = 1T Kμ,n y. (6)
The system of linear equations (4) can then be rewritten as
−1
Kμ,n 0 α + Kμ,n 1b y
−1 = T −1 . (7)
0T 1T Kμ,n 1 b 1 Kμ,n y
Nyström Approximate Model Selection for LSSVM 285
1T ν
b= and α = ν − ρb. (9)
1T ρ
The decision function of LSSVM can be written as f (x) = ni=1 αi K(xi , x) + b.
If Equation (8) is solved, we can easily obtain the solution of LSSVM. However, the
complexity of calculating the inverse of the matrix Kμ,n is O(n3 ). In the following, we
will demonstrate that Nyström method can be used to speed up this process.
A − Ak ξ = min A − Dξ
D∈Rm×n :rank(D)≤k
for ξ = F, 2. · F and · 2 denote the Frobenius norm and the spectral norm. Such Ak
is called the optimal rank k approximation of the matrix A. It can be computed through
the singular value decomposition (SVD) of A. If A ∈ Rn×n is symmetric positive semi-
definite (SPSD), A = UΣUT , where U is a unitary matrix and Σ = diag(σ1 , . . . , σn ) is
T
a real diagonal matrix with σ1 ≥ · · · ≥ σn ≥ 0. For k ≤ rank(A), Ak = ki=1 σi Ui Ui ,
i
where U is the ith column of U.
We now briefly review the Nyström method [8,19]. Let K ∈ Rn×n be an SPSD matrix.
The Nyström method generates a low rank approximation of K using a subset of the
columns of the matrix. Suppose we randomly sample c columns of K uniformly without
replacement. Let C denote the n × c matrix formed by theses columns. Let W be the
c × c matrix consisting of the intersection of these c columns with the corresponding
c rows of K. Without loss of generality, we can rearrange the columns and rows of K
based on this sampling such that:
T
W K21 W
K= , C= . (10)
K21 K22 K21
Since K is SPSD, W is also SPSD. The Nyström method uses W and C from Equation
of K for k ≤ c defined by:
(10) to construct a rank k approximation K
= CW + CT ≈ K,
K (11)
k
For LSSVM, we need to solve the inverse of K + μIn . To reduce the computational
cost, we intend to use the inverse of K + μIn as an approximation of the inverse of
+ μI n is guaranteed.
K + μIn . Since VV T is positive semi-definite, the invertibility of K
+ μIn , we further introduce the Woodbury
To efficiently calculate the inverse of K
formula [12]
−1
( A + XYZ)−1 = A−1 − A−1 X Y −1 + Z A−1 X Z A−1 , (14)
Proof. The computational complexity of step 1 is O(c3 ), since the main computational
part of this step is the SVD on W. In step 2, matrix multiplications
are required, so
its complexity is O(kcn). In step 3, the inverse of μIk + V V is solved by computing
T
4 Error Analysis
In this section, we analyze the effect of Nyström approximation on the decision function
of LSSVM.
We assume that approximation is only used in training. At testing time the true kernel
function is used. This scenario has been considered by [6]. The decision function f
derived with the exact kernel matrix K is defined by
n T
α kx
f (x) = αi K(x, xi ) + b = ,
b 1
i=1
where k x = (K(x, x1), . . . , K(x, xn ))T . We define κ > 0 such that K(x, x) ≤ κ and
x) ≤ κ.
K(x,
288 L. Ding and S. Liao
+ μIn )ρ
= 1. We can write
denote the solution of ( K
+ μIn )−1 1 − (K + μIn )−1 1
ρ
− ρ = ( K
(18)
= − (K + μIn )−1 ( K
− K)(K + μIn )−1 1.
For last equality, we used the identity A−1 − B−1 = −A−1 (A − B)B−1 for any two
invertible matrices A, B. Thus, ρ
− ρ2 can be bounded as follows:
+ μIn )−1 2 K
ρ
− ρ2 ≤ ( K − K2 (K + μIn )−1 2 12
√
12 n (19)
≤ 2 K − K2 = 2 K − K2 .
μ μ
Since K and K are positive semi-definite matrices, the eigenvalues of K+μI n and K+μIn
are larger than or equal to μ. Therefore the eigenvalues of ( K + μIn )−1 and (K + μIn )−1
are less than or equal to 1/μ.
We further consider ν of Equation (8). Replacing 1 with y, we can obtain the similar
bound √
y2 n
ν
− ν2 ≤ 2 K − K2 = 2 K − K2 . (20)
μ μ
As the assumptions, we use the true kernel function at testing time, so no approxi-
mation affects k x . For simplicity, we assume the offset b to be a constant ζ. Therefore,
the approximate decision function f
is given by f
(x) = [α
; ζ]T [k x ; 1].
We can obtain
⎛ T T ⎞ T
⎜⎜⎜ α
α ⎟⎟⎟⎟ k x α
− α k x
f (x) − f (x) = ⎜⎝ ⎜ − ⎟ = = (α
− α)T k x . (21)
ζ ζ ⎠ 1 0 1
By Schwarz inequality,
√
| f
(x) − f (x)| ≤ α
− α2 k x 2 = nκα
− α2 . (22)
From Equation (9), we know that α = ν − ρb = ν − ρζ, so
α
− α2 ≤ ν
− ν2 + ζρ − ρ
2
√ √
n n
≤ 2 K − K2 + ζ 2 K − K2
μ μ (23)
√
n
≤ (1 + ζ) 2 K − K2 .
μ
We let μ0 = μ/n. Substituting the upper bound of α
− α2 into Equation (22), we can
obtain
√
√ n κ(1 + ζ)
| f
(x) − f (x)| ≤ nκ(1 + ζ) 2 2 K − K2 = K − K2 . (24)
n μ0 nμ20
We further introduce a kernel matrix approximation error bound of Nyström method
− K2 .
[13] to upper bound K
Nyström Approximate Model Selection for LSSVM 289
Theorem 2. Let K ∈ Rn×n be an SPSD matrix. Assume that c columns of K are sampled
be the rank-k Nyström approximation
uniformly at random without replacement, let K
to K, and let Kk be the best rank-k approximation to K. For
> 0, η = log(2/δ)g(c,n−c) c
with g(a, s) = a+s−1/2
as
· 1−1/(2 max{a,s})
1
, if c ≥ 64k/
4 , then with probability at least 1 − δ,
⎡⎛ ⎞⎛ n ⎞⎤ 12
⎢⎢⎢⎜⎜ n ⎟
⎟ ⎜⎜⎜ ⎟⎟⎥⎥⎥
F ≤ K − Kk F +
⎢⎢⎢⎢⎜⎜⎜⎜ ⎟⎟⎟ ⎜⎜ 2 + η max(nK )⎟ ⎟⎟⎥⎥⎥ ,
K − K ⎣⎝ c Kii ⎟ ⎠ ⎜⎝ ⎜ n Kii ii ⎟ ⎟⎠⎥⎦
i∈D(c) i=1
where i∈D(c) Kii is the sum of largest c diagonal entries of K.
| f (x) − f (x)| ≤ ⎜
⎜K − Kk F +
⎢⎢⎣⎜⎝ ⎜ ⎟ ⎜
Kii ⎟⎠ ⎜⎜⎝ n Kii + η max(nKii )⎟⎟⎟⎟⎠⎥⎥⎥⎥⎦ ⎟⎟⎟⎟ ,
2
nμ20 ⎜⎜⎝ c i∈D(c) i=1
⎟⎠
where i∈D(c) Kii is the sum of largest c diagonal entries of K.
Theorem 3 measures the effect of kernel matrix approximation on the decision func-
tion of LSSVM. It enables us to bound the relative performance of LSSVM when the
Nyström method is used to approximate the kernel matrix. We refer to the bound given
in Theorem 3 as a model approximation error bound.
In order to find the hyperparameters that minimize the generalization error of LSSVM,
many model selection approaches have been proposed, such as the cross validation,
span bound [17], radius margin bound [5], PRESS criterion [1] and so on. However,
when optimizing model selection criteria, all these approaches need to solve LSSVM
completely in the inner layer for each iteration.
Here we discuss the problem of approximate model selection. We argue that for
model selection purpose, it is sufficient to calculate an approximate criterion that can
discriminate the optimal hyperparameters from candidates. Theorem 3 shows that when
Nyström approximation is used, the change of learning results of LSSVM is bounded,
which is a theoretical support for approximate model selection. In the following, we
present an approximate model selection
scheme,
as shown in Algorithm 2.
We use the RBF kernel K xi , x j = exp −γxi − x j 2 to describe the scheme, but
this scheme is also suitable for other kernel types.
290 L. Ding and S. Liao
Let S denote the iteration steps of optimizing model selection criteria. The complex-
ity of solving LSSVM by calculating the inverse of the exact kernel matrix is O(n3 ). For
radius margin bound or span bound [5], a standard LSSVM needs to be solved in the
inner layer for each iteration, so the total complexity of these two methods is O(S n3 ).
For PRESS criterion [1], the inverse of kernel matrix also needs to be calculated for
each iteration, so its complexity is O(S n3 ). From Theorem 1, we know that using Algo-
rithm 1, we could solve LSSVM in O(c3 + nck). Therefore, if we use the above model
selection criteria in the outer layer, the complexity of approximate model selection is
O(S (c3 + nck)). For t-fold cross validation, let S γ and S μ denote the grid steps of γ and
μ. If LSSVM is directly solved, the complexity of t-fold cross validation is O(tS γ S μ n3 ).
However, the complexity of approximate model selection using t-fold cross validation
as outer layer criterion will be O(tS γ S μ (c3 + nck)).
6 Experiments
The benchmark datasets in our experiments are introduced in [15], as shown in Table 1.
For each dataset, there are 100 random training and test pre-defined partitions1 (except
20 for the Image and Splice dataset). The use of multiple benchmarks means that the
evaluation is more robust as the selection of data sets that provide a good match to the
inductive bias of a particular classifier becomes less likely. Likewise, the use of multiple
partitions provides robustness against sensitivity to the sampling of data to form training
and test sets.
In Rätsch’s experiment [15], model selection is performed on the first five training
sets of each dataset. The median values of the hyperparameters over these five sets are
then determined and subsequently used to evaluate the error rates throughout all 100
partitions. However, for this experimental scheme, some of the test data is no longer
1
http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark
Nyström Approximate Model Selection for LSSVM 291
statistically “pure” since it has been used during model selection. Furthermore, the use
of median of the hyperparameters would introduce an optimistic bias [3]. In our ex-
periments, we perform model selection on the training set of each partition, then train
the classifier with the obtained optimal hyperparameters still on the training set, and
finally evaluate the classifier on the corresponding test set. Therefore, we can obtain
100 test error rates for each dataset (except 20 for the Image and Splice dataset). The
statistical analysis of these test error rates is conducted to assess the performance of
the model selection approach. This experimental scheme is rigorous and can avoid the
major flaws of the previous one [3]. All experiments are performed on a Core2 Quad
PC, with 2.33GHz CPU and 4GB memory.
6.2 Effectiveness
Following the experimental setup in Section 6.1, we perform model selection respec-
tively using 5-fold cross validation (5-fold CV) and approximate 5-fold CV, that is,
approximate model selection by minimizing 5-fold CV error (as shown in Algorithm
2). The CV is performed on a 13 × 11 grid of (γ, μ) respectively varying in [2−15 , 29 ]
and [2−15 , 25 ] both with step 22 . We set c = 0.1n and k = 0.5c in Algorithm 1.
We compare effectiveness of two model selection approaches. Effectiveness includes
efficiency and generalization. Efficiency is measured by average computation time for
model selection. Generalization is measured by the mean test error rate (TER) of the
classifiers trained with the optimal hyperparameters produced by different model selec-
tion approaches.
Results are shown in Table 2. We use the z statistic of TER [2] to estimate the sta-
tistical significance of differences in performance. Let x̄ and ȳ represent the means of
TER of two approaches, and e x and ey the corresponding standard errors, then the z
statistic is computed as z = ( x̄ − ȳ)/ e2x + e2y and z = 1.64 corresponds to a 95% sig-
nificance level. From Table 2, approximate 5-fold CV is significantly outperformed by
5-fold CV only on the Splice dataset, but the difference is just 2.5%. Besides, according
292 L. Ding and S. Liao
Table 2. Comparison of computation time and test error rate (TER) of 5-fold cross validation
(5-fold CV) and approximate 5-fold CV
to the Wilcoxon signed rank test [7], neither of 5-fold CV and approximate 5-fold CV
is statistically superior at the 95% level of significance.
However, Table 2 also shows that approximate 5-fold CV is more efficient than 5-fold
CV on all datasets. It is worth noting that the larger the training set size is, the efficiency
gain is more obvious, which is in accord with the results of complexity analysis.
7 Conclusion
In this paper, Nyström method was first introduced into the model selection problem.
A brand new approximate model selection approach of LSSVM was proposed, which
fully exploits the theoretical and computational virtue of Nyström approximation. We
designed an efficient algorithm for solving LSSVM and bounded the effect of kernel
matrix approximation on the decision function of LSSVM. We derived a model approx-
imation error bound, which is a theoretical support for approximate model selection.
We presented an approximate model selection scheme and analyzed its complexity as
compared with other classic model selection approaches. This complexity shows the
promise of the application of approximate model selection for large scale problems. We
finally verified the effectiveness of our approach by rigorous experiments on several
benchmark datasets.
The application of our theoretical results and approach to practical large problems
will be one of major concerns. Besides, a new efficient model selection criterion directly
dependent on kernel matrix approximation will be proposed in near future.
References
1. Cawley, G.C., Talbot, N.L.C.: Fast exact leave-one-out cross-validation of sparse least-
squares support vector machines. Neural Networks 17(10), 1467–1475 (2004)
2. Cawley, G.C., Talbot, N.L.C.: Preventing over-fitting during model selection via Bayesian
regularisation of the hyper-parameters. Journal of Machine Learning Research 8, 841–861
(2007)
3. Cawley, G.C., Talbot, N.L.C.: On over-fitting in model selection and subsequent selec-
tion bias in performance evaluation. Journal of Machine Learning Research 11, 2079–2107
(2010)
4. Chapelle, O., Vapnik, V.: Model selection for support vector machines. In: Advances in Neu-
ral Information Processing Systems, vol. 12, pp. 230–236. MIT Press, Cambridge (2000)
5. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for
support vector machines. Machine Learning 46(1), 131–159 (2002)
6. Cortes, C., Mohri, M., Talwalkar, A.: On the impact of kernel approximation on learning
accuracy. In: Proceedings of the 13th International Conference on Artificial Intelligence and
Statistics (AISTATS), Sardinia, Italy, pp. 113–120 (2010)
7. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research 7, 1–30 (2006)
8. Drineas, P., Mahoney, M.: On the Nyström method for approximating a Gram matrix for
improved kernel-based learning. Journal of Machine Learning Research 6, 2153–2175 (2005)
9. Duan, K., Keerthi, S., Poo, A.: Evaluation of simple performance measures for tuning SVM
hyperparameters. Neurocomputing 51, 41–59 (2003)
10. Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore
(1996)
11. Guyon, I., Saffari, A., Dror, G., Cawley, G.: Model selection: Beyond the Bayesian / frequen-
tist divide. Journal of Machine Learning Research 11, 61–87 (2010)
12. Higham, N.: Accuracy and stability of numerical algorithms. SIAM, Philadelphia (2002)
13. Kumar, S., Mohri, M., Talwalkar, A.: Sampling techniques for the Nyström method. In: Pro-
ceedings of the 12th International Conference on Artificial Intelligence and Statistics (AIS-
TATS), Clearwater, Florida, USA, pp. 304–311 (2009)
14. Luntz, A., Brailovsky, V.: On estimation of characters obtained in statistical procedure of
recognition. Technicheskaya Kibernetica 3 (1969) (in Russian)
15. Rätsch, G., Onoda, T., Müller, K.: Soft margins for AdaBoost. Machine Learning 42(3),
287–320 (2001)
16. Suykens, J., Vandewalle, J.: Least squares support vector machine classifiers. Neural Pro-
cessing Letters 9(3), 293–300 (1999)
17. Vapnik, V., Chapelle, O.: Bounds on error expectation for support vector machines. Neural
Computation 12(9), 2013–2036 (2000)
18. Vapnik, V.: Statistical Learning Theory. John Wiley & Sons, New York (1998)
19. Williams, C., Seeger, M.: Using the Nyström method to speed up kernel machines. In: Ad-
vances in Neural Information Processing Systems 13, pp. 682–688. MIT Press, Cambridge
(2001)
Exploiting Label Dependency
for Hierarchical Multi-label Classification
1 Introduction
Traditional classification tasks deal with assigning instances to a single label. In multi-
label classification, the task is to find the set of labels that an instance can belong to
rather than assigning a single label to a given instance. Hierarchical multi-label classi-
fication is a variant of traditional classification where the task is to assign instances to
a set of labels where the labels are related through a hierarchical classification scheme
[1]. In other words, when an instance is labeled with a certain class, it should also be
labeled with all of its superclasses, this is known as the hierarchy constraint.
Hierarchical multi-label classification is a widely studied problem in many domains
such as functional genomics, text categorization, image annotation and object recogni-
tion [2]. In functional genomics (which is the application that we focus on in this paper)
the problem is the prediction of gene/protein functions. Biologists have a hierarchical
organization of the functions that the genes can be assigned to. An individual gene or
protein may be involved in more than one biological activity, and hence, there is a need
for a prediction algorithm that is able to identify all the possible functions of a particular
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 294–305, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Exploiting Label Dependency for Hierarchical Multi-label Classification 295
Fig. 1. Flat versus Hierarchical classification. (a) Flat representation of the labels (b) Hierarchical
representation of the same set of labels.
gene [2]. There are two types of class hierarchy structures: a rooted tree structure, such
as the MIPS’s FunCat taxonomy [17], and a directed acyclic graph (DAG) structure,
such as the Gene Ontology (GO) [7]. In this paper, we use the FunCat scheme.
Most of the existing research focuses on a “flat” classification approach, that oper-
ates on non-hierarchical classification schemes, where a binary classifier is constructed
for each label separately as shown in Figure 1(a). This approach ignores the hierarchi-
cal structure of the classes shown in Figure 1(b). Reducing a hierarchical multi-label
classification problem to a conventional classification problem allows the possibility
of applying the existing methods. However, since the prediction of the class labels has
to be performed independently, such transformations are not capable of exploiting the
interdependencies and correlations between the labels [6]. Moreover, the flat classifica-
tion algorithm fails to take advantage of the information inherent in the class hierarchy,
and hence may be suboptimal in terms of efficiency and effectiveness [9].
2 Related Work
Since conventional classification methods, such as binary classification and multi-class
classification, were not designed to directly tackle the hierarchical classification prob-
lems, such algorithms are referred to as flat classification algorithms [18]. It is important
to mention that flat classification and other similar approaches are not considered to be
a hierarchical classification approach, as they create new (meta) classes instead of using
pre-established taxonomies.
Different approaches have been proposed in the literature to tackle the hierarchi-
cal multi-label classification problem [4,21,16]. Generally, these approaches can be
grouped into two categories: the local classifier methods and the global classifier meth-
ods. Moreover, most of the existing methods use a top-down class prediction strategy
in the testing phase [3,16,18]. The local strategy treats any label independently, and
thus ignores any possible correlation or interdependency between the labels. There-
fore, some methods perform an additional step to correct inconsistent predictions. For
example, in [3], a Bayesian framework is developed for correcting class-membership
inconsistency for the separate class-wise models approach. In [1], a hierarchical multi-
label boosting algorithm, named HML-Boosting, was proposed to exploit the hierarchi-
cal dependencies among the labels. HML-Boosting algorithm relies on the hierarchical
information and utilizes the hierarchy to improve the prediction accuracy.
True path rule (T P R) is a rule that governs the annotation of GO and FunCat tax-
onomies. According to this rule, annotating a gene to a given class is automatically
transferred to all of its ancestors to maintain the hierarchy constraint [12]. In [20], a
true path ensemble method was proposed. In this method, a classifier is built for each
functional class in the training phase. A bottom-up approach is followed in the test-
ing phase to correct the class-membership inconsistency. In a modified version of TPR
(T P R − w), a parent weight is introduced. The weight is used to explicitly modulate
the contribution of the local predictions with the positive predictions coming from the
descendant nodes. In [5], a hierarchical bottom-up Bayesian cost-sensitive ensemble
(HBAYES-CS), is proposed. Basically, a calibrated classifier is trained at each node in
the taxonomy. H-loss is used in the evaluation phase to predict the labels for a given
node. In a recent work [2], we proposed a novel Hierarchical Bayesian iNtegration
algorithm HiBiN, a general framework that uses Bayesian reasoning to integrate het-
erogeneous data sources for accurate gene function prediction. On the other hand, sev-
eral research groups have studied the effective exploitation of correlation information
Exploiting Label Dependency for Hierarchical Multi-label Classification 297
among different labels in the context of multi-label learning. However, these approaches
do not assume the existence of any pre-defined taxonomical structure of the classes
[6,11,14,23,24].
3 HiBLADE Algorithm
Let X = d be the d-dimensional input space and Y = {y1 , y2 , ..., yL } be the finite set
of L possible labels. The hierarchical relationships among classes in Y are defined as
follows: Given y1 , y2 ∈ Y, y1 is the ancestor of y2 , denoted by (↑ y2 ) = y1 , if and only
if y1 is a superclass of y2 .
Let a hierarchical multi-label training set D = {< x1 , Y1 >, ..., < xN , YN >},
where xi ∈ X is a feature vector for instance i and Yi ⊆ Y is the set of labels associated
with xi , such that yi ∈ Yi ⇒ yi ∈ Yi , ∀(↑ yi ) = yi . Having Q as the quality criterion
for evaluating the model based on the prediction accuracy, the objective function is
defined as follows: a function f : D → 2y . Here, 2y is the power set of Y, such that Q
is maximized, and y ∈ f(x) ⇒ y ∈ f, ∀(↑ y ) = y. The function f is represented here
by the HiBLADE algorithm.
Hierarchical multi-label learning aims to model and predict p(child class | parent
class). Our goal is to make use of hierarchical dependencies as well as the extracted
dependencies among the labels yk where 1 ≤ k ≤ L and (↑ yk ) = ym such that
for each example we can better predict its labels. The problem then becomes how to
identify and make use of such dependencies in an efficient way.
Algorithm 1. HiBLADE
1: Input: A pair < Y, L > where Y is a tree-structured set of classes and L is the total number
of classes of Y.
2: Output: For each class yl ∈ Y, the final composite classifier yl = sign[Fl (x)].
3: Algorithm:
4: for i = 1, ..., L do
5: if class i is a leaf class then
6: Do nothing
7: else
8: Let children(i) = y1i , ..., yki be the k children classes of i
9: Form the new feature vectors by adding the labels of the classes at the higher levels to
the current feature vectors.
10: Learn classifiers for children(i) using the shared models Algorithm
11: end if
12: end for
In each boosting iteration t, the entire pool is searched for the best fitted model other
than the model that was built directly for that label and its corresponding combination
weights, the best fitted model is called htl . We refer to the best fitted model as the
candidate model. The chosen model htc is then updated based on the following formula:
ii
γij = ∗ βij (1)
ii + ji
where ji is the error results from applying model htj on the examples in class i and ii
is the error results from applying the model hti on the examples in class i. βij controls
the proportional contribution of Bayesian-based and instance-based similarities. βij is
computed as follows:
βij = φ ∗ bij + (1 − φ) ∗ sij (2)
where bij is the Bayesian correlation between class i and class j, and it is estimated
as bij = |i ∩ j|/|j|, where |i ∩ j| is the number of positive examples in class i and
class j and |j| is the number of positive examples in class j. sij is the instance-based
similarity between class i and class j. Each instance from one class is compared to each
other instance from the other class. In HiBLADE, sij is computed using the Euclidean
distance between the positive examples in both classes that has the following formula:
sij = (il − jl )2 (3)
l
where l is the corresponding feature in the two vectors. sij is normalized to be in the
range of [0, 1]. φ is a threshold parameter that has a value in the range [0, 1]. Setting φ
to 0 means that only instance-based similarity is taken into consideration in the learning
process. While setting it to 1 means that only Bayesian-based correlation is taken into
consideration. On the other hand, any value of φ between 0 and 1 combines both types
of correlation. It is important to emphasize that these computations are performed only
for the class that is found to be the most useful class with respect to the current class.
Exploiting Label Dependency for Hierarchical Multi-label Classification 299
In the general case, both classes, the current class and the candidate class, contribute
to the final prediction. In other words, any value of ji other than 0, indicates the level
of contribution from the candidate class. More specifically, if the error of the candidate
class, ji , is greater than the error of the current class, ii , the value of γij will be small
indicating that only a limited contribution of the candidate class is considered. In con-
trast, if the error of the current class, ii , is greater than the error of the candidate class,
ji , then γij will be high, and hence, the prediction decision will be dependable more
on the candidate class. Finally, the models for the current class and the used candidate
class are replaced by the new learned models. At the end, the composite classifiers Fc
provide the prediction results.
Algorithm 2 shows the details of the shared models algorithm. The shared models al-
gorithm takes as input the children classes of a particular class together with the feature
vectors for the instances that are positive at the parent class. These instances will form
the positive and negative examples for each one of the children classes. The algorithm
begins by initializing a pool of M models, where M is the number of children classes,
one for each class that is learned using a boosting-type algorithm such as ADABOOST.
The number of base models to be generated is determined by T . In each iteration t
and for each label in the set of the children labels, we look for the best fitted model,
htl (x) and the corresponding combination weights, αtl . The contribution of the selected
base model, htl (x), to the overall classifier, Fc (x), depends on the current label. In other
words, if the error, ji of the candidate classifier is 0, this will be a perfect model for
the current label. Hence, equation (1) will be reduced to γij = βij . In this case, the
contribution of that model depends on the level of correlation between the candidate
class and the current one. On the other hand, if the current model is a perfect model,
i.e., the error ii = 0, then equation 1 will be reduced to γij = 0, which means that for
the current iteration, there is no need to look at any other classifier.
4 Experimental Details
We chose to demonstrate the performance of our algorithm for the prediction of gene
functions in yeast using four bio-molecular datasets that were used in [20]. Valentini
[20] pre-processed the datasets so that for each dataset, only genes that are annotated
with FunCat taxonomy are selected. To make this paper self-contained, we briefly ex-
plain the data collection process and the pre-processing steps performed on the data.
Uninformed features that have the same value for all of the examples are removed.
Class “99” in FunCat corresponds to an “unclassified protein”. Therefore, genes that
are annotated only with that class are excluded. Finally, in order to have a good size of
positive training examples for each class, selection has been performed to classes with
at least 20 positive examples. Dataset characteristics are summarized in Table 1.
Table 1. The characteristics of the four bio-molecular datasets used in our experiments
The gene expression dataset, Gene-Expr, is obtained by merging the results of two
studies, gene expression measures relative to 77 conditions and transcriptional responses
of yeast to environmental stress measured on 173 conditions [10]. For each gene prod-
uct in the protein-protein interaction dataset, PPI-BG, a binary vector is generated that
implies the presence or absence of protein-protein interaction. Protein-protein interac-
tion data have been downloaded from BioGRID database [19,20]. In Pfam-1 dataset,
a binary vector is generated for every gene product that reflects the presence or ab-
sence of 4950 protein domains obtained from Pfam (Protein families) database [8,20].
For PPI-VM dataset, Von Mering experiments produced protein-protein data from yeast
two-hybrid assay, mass spectrometry of purified complexes, correlated mRNA expres-
sion and genetic interactions [22].
Table 2. Per-level F1 measure for Gene-Expr dataset using Flat, HiBLADEI , HiBLADEC
with φ = 0.5 and HiBLADEB for boosting iterations=50
Classical evaluation measures such as precision, recall and F-measure are used by un-
structured classification problems and thus, they are inadequate to address the hierarchi-
cal natures of the classes. Another approach that is used for the hierarchical multi-label
learning is to use extended versions of the single label metrics (precision, recall and F-
measure). To evaluate our algorithm, we adopted both, the classical and the hierarchical
evaluation measures. F1 measure considers the joint contribution of both precision (P)
and recall (R). F1 measure is defined as follows:
2×P ×R 2T P
F1 = = (4)
P +R 2T P + F P + F N
where TP stands for True Positive, TN for True Negative, FP for False Positive and
FN for False Negative. When TP=FP=FN=0, we made F1 measure to equal to 1 as the
classifier has correctly classified all the examples as negative examples [9]. Hierarchical
measures are defined as follows:
1 |C(x)∩ ↑ p|
hP = (5)
|l(P (x))| | ↑ p|
p∈l(P (x))
1 | ↑ c ∩ P (x)|
hR = (6)
|l(C(x))| | ↑ c|
c∈l(C(x))
2 × hP × hR
hF = (7)
hP + hR
where hP, hR and hF stands for hierarchical precision, hierarchical recall and hierarchi-
cal F-measure, respectively. P (x) is a subgraph formed by the predicted class labels for
the instance x while C(x) is a subgraph formed by the true class labels for the instance
x. p is one of the predicted class labels and c is one of the true labels for instance x.
l(P (x)) and l(C(x)) are the set of leaves in graphs P (x) and C(x), respectively. We
also computed both micro-averaged hierarchical F-measure (hF1μ ) and macro-averaged
hierarchical F-measure hF1M . hF1μ is computed by computing hP and hR for each
path in the hierarchical structure of the tree and then applying equation (7). On the other
hand, hF1M is computed by calculating hF1 for each path in the hierarchical structure
of the classes independently and then averaging them. Having high hierarchical pre-
cision means that the predictor is capable of predicting the most general functions of
the instance, while having high hierarchical recall indicates that the predictor is able to
predict the most specific classes [20]. The hierarchical F-measure takes into account the
partially correct paths in the overall taxonomy.
We analyzed the performance of the proposed framework at each level of the Fun-
Cat taxonomy, and we also compared the proposed method with four other methods
that follow the local classifier approach, namely, HBAYES-CS, HTD, TPR and TPR-w.
302 N. Alaydie, C.K. Reddy, and F. Fotouhi
Table 3. Per-level F1 measure for PPI-BG dataset using Flat, HiBLADEI , HiBLADEC with
φ = 0.5 and HiBLADEB for boosting iterations=50
HBAYES-CS, TPR and TPR-w are described in the Related Work Section. HTD (Hier-
archical Top-Down) is the baseline method that belongs to the local classifier strategy
and performs hierarchical classification in a top-down fashion. Since HiBLADE also
belongs to the local classifier strategy, it is fair to have a comparison against a local
classifier approach that does not consider any type of correlation between the labels.
We also analyzed the effect of the proper choice of the threshold φ on the performance
of the algorithm. The setup for the experiments is summarized as follows:
– Flat: This is the baseline method that does not take the hierarchical taxonomy of the
classes into account and does not consider label dependencies. A classifier is built
for each class independently of the others. We used AdaBoost as the base learner
to form a baseline algorithm for the comparison with the other methods.
– HiBLADEI : The proposed algorithm that considers Instance-based similarities
only. Here φ is set to zero.
– HiBLADEB : The proposed algorithm that considers classes correlation based on
Bayesian probabilities only. Here φ is set to one.
– HiBLADEC : The proposed algorithm that considers a combination of both
instance-based similarity and classes correlation. Here φ is set to 0.5.
The proposed algorithm outperforms the flat classification method in most of the cases
with significant differences in the performance measurements. The results in Tables 2,
3, 4 and 5 indicate that the deeper the level the better the performance of the proposed
algorithm compared to the flat classification method. For example, in all of the datasets,
the proposed algorithm outperformed the flat classification method in all the levels that
are higher than level 1. This result is consistent with our understanding of both of the
classification schemes. In other words, the proposed method and the flat classification
method have a similar learning procedure for the classes in the first level. However, the
proposed method achieved better results for the deeper levels in the hierarchy.
Table 5. Per-level F1 measure for PPI-VM dataset using Flat, HiBLADEI , HiBLADEC with
φ = 0.5 and HiBLADEB for boosting iterations=50
To get more insights into the best choice of φ threshold, we compare hierarchical
precision, hierarchical recall, hierarchical F1μ measure and hierarchical F1M measure
for Gene-Expr, PPI-BG, Pfam-1 and PPI-VM datasets for φ = 0.0, 0.5 and 1.0, re-
spectively, for 50 boosting iterations. Table 6 shows the results of the comparisons.
The most significant measures are highlighted. As shown in Table 6, the combination
of Bayesian-based correlation and instance-based similarity achieved the best perfor-
mance results in most of the cases. For example, six of the highest performance values,
in general, in this table are achieved when φ = 0.5.
Table 6. Hierarchical precision, hierarchical recall, hierarchical F1M and hierarchical F1µ mea-
sures of HiBLADE for all the four datasets using boosting iterations =50
Gene-Expr PPI-BG
Measure
φ = 0.0 φ = 0.5 φ = 1.0 φ = 0.0 φ = 0.5 φ = 1.0
hP 0.820 0.808 0.826 0.878 0.924 0.875
hR 0.644 0.630 0.627 0.662 0.686 0.701
hF1M 0.702 0.689 0.692 0.735 0.769 0.756
hF1µ 0.722 0.708 0.712 0.755 0.787 0.778
Pfam-1 PPI-VM
Measure
φ = 0.0 φ = 0.5 φ = 1.0 φ = 0.0 φ = 0.5 φ = 1.0
hP 0.763 0.836 0.875 0.716 0.748 0.719
hR 0.625 0.663 0.637 0.542 0.551 0.557
hF1M 0.669 0.720 0.714 0.590 0.605 0.601
hF1µ 0.687 0.740 0.737 0.617 0.635 0.628
learners, while HTD, TPR and TPR-w are using Linear SVMs as the base learners.
Figure 2 shows the F-measure of the different methods. By exploiting the label de-
pendencies, the classifiers performance are effected positively. Our results show that
the proposed algorithm significantly outperforms the local learning algorithms. Al-
though there is no clear winner among the different versions of HiBLADE algorithm,
HiBLADE always achieved significantly better results than the other methods.
5 Conclusion
In this paper, we proposed a hierarchical multi-label classification framework for in-
corporating information about the hierarchical relationships among the labels as well
as the label correlations. The experimental results showed that the proposed algorithm,
HiBLADE, outperforms the flat classification method and the local classifiers method
that builds independent classifier for each class. For future work, we plan to generalize
the proposed approach to general graph structures and develop more scalable solutions
using some other recent proposed boosting strategies [13,15].
References
1. Alaydie, N., Reddy, C.K., Fotouhi, F.: Hierarchical boosting for gene function prediction. In:
Proceedings of the 9th International Conference on Computational Systems Bioinformatics
(CSB), Stanford, CA, USA, pp. 14–25 (August 2010)
2. Alaydie, N., Reddy, C.K., Fotouhi, F.: A Bayesian Integration Model of Heterogeneous Data
Sources for Improved Gene Functional Inference. In: Proceedings of the ACM Conference on
Bioinformatics, Computational Biology and Biomedicine (ACM-BCB), Chicago, IL, USA,
pp. 376–380 (August 2011)
3. Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of
gene function. Bioinformatics 22(7), 830–836 (2006)
4. Bi, W., Kwok, J.: Multi-Label Classification on Tree- and DAG-Structured Hierarchies. In:
Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine
Learning (ICML 2011), pp. 17–24. ACM, New York (2011)
Exploiting Label Dependency for Hierarchical Multi-label Classification 305
5. Cesa-Bianchi, N., Valentini, G.: Hierarchical cost-sensitive algorithms for genome-wide gene
function prediction. In: Proceedings of the Third International Workshop on Machine Learn-
ing in Systems Biology, Ljubljana, Slovenia, pp. 25–34 (2009)
6. Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for
multilabel classification. Machine Learning 76(2-3), 211–225 (2009)
7. The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature
Genetics 25(1), 25–29 (2000)
8. Deng, M., Chen, T., Sun, F.: An integrated probabilistic model for functional prediction of
proteins. In: Proc. 7th Int. Conf. Comp. Mol. Biol., pp. 95–103 (2003)
9. Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. In-
formation Retrieval 11, 287–313 (2008)
10. Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein,
D., Brown, P.O.: Genomic expression programs in the response of yeast cells to environmen-
tal changes. Mol. Biol. Cell 11, 4241–4257 (2000)
11. Jun, G., Ghosh, J.: Multi-class Boosting with Class Hierarchies. In: Benediktsson, J.A., Kit-
tler, J., Roli, F. (eds.) MCS 2009. LNCS, vol. 5519, pp. 32–41. Springer, Heidelberg (2009)
12. Mostafavi, S., Morris, Q.: Using the gene ontology hierarchy when predicting gene function.
In: Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, pp. 22–26
(September 2009)
13. Palit, I., Reddy, C.K.: Scalable and Parallel Boosting with MapReduce. IEEE Transactions
on Knowledge and Data Engineering, TKDE (in press, 2012)
14. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier Chains for Multi-label Classifica-
tion. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD
2009. LNCS, vol. 5782, pp. 254–269. Springer, Heidelberg (2009)
15. Reddy, C.K., Park, J.-H.: Multi-resolution Boosting for Classification and Regression Prob-
lems. Knowledge and Information Systems (KAIS) 29(2), 435–456 (2011)
16. Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Kernel-Based Learning of Hierarchi-
cal Multilabel Classification Models. The Journal of Machine Learning Research 7, 1601–
1626 (2006)
17. Ruepp, A., Zollner, A., Maier, D., Albermann, K., Hani, J., Mokrejs, M., Tetko, I., Güldener,
U., Mannhaupt, G., Münsterkötter, M., Mewes, H.W.: The FunCat, a functional annotation
scheme for systematic classification of proteins from whole genomes. Nucleic Acids Re-
search 32(18), 5539–5545 (2004)
18. Silla Jr., C.N., Freitas, A.A.: A survey of hierarchical classification across different applica-
tion domains. Data Mining and Knowledge Discovery 22, 31–72 (2011)
19. Stark, C., Breitkreutz, B., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a
general repository for interaction datasets. Nucleic Acids Research 34, D535–D539 (2006)
20. Valentini, G.: True path rule hierarchical ensembles for genome-wide gene function predic-
tion. IEEE ACM Transactions on Computational Biology and Bioinformatics 8(3), 832–847
(2011)
21. Vens, C., Struyf, J., Schietgat, L., Dz̃eroski, S., Blockeel, H.: Decision trees for hierarchical
multi-label classification. Machine Learning 73, 185–214 (2008)
22. Von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S., Fields, S., Bork, P.: Compara-
tive assessment of large-scale data sets of protein-protein interactions. Nature 417, 399–403
(2002)
23. Yan, R., Tesic, J., Smith, J.R.: Model-Shared Subspace Boosting for Multi-label Classifica-
tion. In: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), New York, NY, USA, pp. 834–843 (2007)
24. Zhang, M.-L., Zhang, K.: Multi-label learning by exploiting label dependency. In: Proceed-
ings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD 2010), Washington, D.C., USA, pp. 999–1007 (2010)
Diversity Analysis on Boosting Nominal
Concepts
1 Introduction
Boosting is an adaptive approach, which makes it possible to correctly classify
an object that can be badly classified by an ordinary classifier. The main idea
of Boosting is to build many classifiers who complement each other, in order
to build a more powerful classifier. Adaboost (Adaptive Boosting) is the most
known method of Boosting for classifiers generation and combination.
AdaBoost algorithm is iterative. At first, it selects a subset of instances from
the learning data set (different subset from the training data set in each itera-
tion). Then, it builds a classifier using the selected instances. Next, it evaluates
the classifier on the learning data set, and it starts again T times.
It has been found that this ingenious manipulation of training data can fa-
vorise diversity especially for linear classifiers [11]. However, there is no study
concerning the role of diversity on Nominal Concepts classifiers [13]. In this
paper, we study how diversity changes according to the nominal classifier num-
bers and we show when adding new classifiers to the team can’t provide further
improvements.
This paper is organized as follows: section 2 presents the principle of Classifier
of Nominal Concepts (CNC ) used in Boosting [7,13]. In section 3, we discuss
the diversity of classifiers and the different measures that can be exploited in
the classifiers ensembles generation. Section 4 presents the experimental results
that prove when the diversity can be useful in Boosting of Nominal Concepts.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 306–317, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Diversity Analysis on Boosting Nominal Concepts 307
2 Boosting of CNC
Then, the other attributes describing all the extracted instances are determined
(using the closure operator δ ◦ ϕ (AN ∗ = vj )) as follows:
In [13], a method called BNC (Boosting Nominal Concepts) has been proposed.
The advantage of BNC is to build a part of the lattice covering the best nominal
concepts (the pertinent ) which is used as classification rules (the Classifier Nom-
inal Concepts). The BNC has the particularity to decide the number of nominal
classifiers in order to control the time of application and to provide the best
decision.
For K -class problem, let Y = {1, .., K} the class labels, with yi ∈Y is the class
label associated for each instance o i (i=1 to N ). To generate T classifiers in
AdaBoost, the distribution of the weight of o i is initially determined as :
Wt
Dt (i) = N i (7)
t
i=1 Wi
βt = εt /(1 − εt ) (9)
The procedure is repeated T times and the final result of BNC is determined
via the combination of the generated classifier outputs:
T
hf in (oi ) = arg max log(1/βt) × pt (oi , yi ). (10)
y∈Y
t=1
The first variant of the AdaBoost algorithm is called Adaboost.M1 [5,6] that uses
the previous process and stops it when the error rate of a classifier becomes over
0.5. The second variant is called AdaBoost.M2 [6] which has the particularity of
handling multi-class data and operating whatever the error rate is. In this study,
we use AdaBoost.M2 since Adaboost.M1 has the limit to stop Boosting if the
learning error exceeds 0.5. In some experiments, Adaboost.M1 can be stopped
after the generation of first classifier thus we cannot calculate the diversity of
classifier ensemble in this particular case.
Recent researches have proved the importance of classifier diversity in improv-
ing the performance of AdaBoost [1,4,8]. We shall discuss about that in the next
section.
Diversity Analysis on Boosting Nominal Concepts 309
3 Classifier Diversity
According to [4], linear classifiers should be different from each other, otherwise
the decision of the ensemble will be of lower quality than the individual decision.
This difference, also called diversity, can lead to better or worse overall decision
[3].
In [14], the authors found a consistent pattern of diversity showing that at the
beginning, the generated linear classifiers are highly diverse but as the learning
progresses, the diversity gradually returns to its starting level. This suggests
that it could be beneficial to stop AdaBoost before diversity drops. The authors
confirm that there are a consistent patterns of diversity with many measures
using linear classifiers. However, they report that the pattern might change if
other classifier models are used. In the paper, we will prove that this pattern is
the same with nominal classifiers.
Many measures can be used to determine the diversity between classifiers [11].
In this section, we present three(3) of them: Q Statistic, Correlation Coefficient
(CC) and Pairwise Interrater Agreement (kp). These pairwise measures have
the same diversity value (0) when the classifiers are statistically independent.
They are called pairwise because they consider the output classifiers, two at a
time and then they average the calculated pairwise diversity. These measures
are computed based on the agreement and the disagreement between each 2
classifiers (see Table 1).
N =N 00 +N 01 +N 10 +N 11
N 11 N 00 − N 10 N 01
Qj,k = (11)
N 11 N 00 + N 10 N 01
The Pairwise Interrater agreement: For this measure, the agreement be-
tween each pair of classifiers is calculated as:
N 11 N 00 − N 10 N 01
kpj,k = 2 (13)
(N 11 + N 10 )(N 01+ N 00 )(N 11 + N 01 )(N 10 + N 00 )
For any pair of classifiers, Q and ρ have the same sign. The maximum value of
ρ and kp is 1 but the minimum value depends on the individual performance of
the classifiers.
In [11], it is reported that there is not unique choice of diversity measure.
But, for linear classifiers, the authors recommended the use of Q Statistic for
it’s simplicity and it’s significant results. Then, it’s interesting to compare the
previous measures in Boosting Nominal Concept.
4 Experimental Study
The goal of this section is to study the relationship between the nominal classi-
fiers diversity and AdaBoost performance for 2-class problems.
The experiments are performed on 5 real data sets extracted from UCI Ma-
chine Learning Repository 1 [2] and the algorithms are implemented in WEKA2, a
widely used toolkit.
The characteristics of these data sets are reported in Table 2. For each data
set, we respectively give the number of instances and the number of attributes.
Also, we present the data diversity rate that indicates the samples which are
different (including the class label) in the data [9].
1
http://archive.ics.uci.edu/ml/
2
http://www.cs.waikato.ac.nz/ml/weka/
Diversity Analysis on Boosting Nominal Concepts 311
It consists on dividing the data sample into 10 subsets. In turn, each subset
will be used for testing and the rest are assembled together for learning. Finally,
the average of these 10 runs is reported.
Figures 1, 2, 3, 4 and 5 present the performance of BNC and the values of
the 3 diversity measures on the Credit German, Diabetes, Ionosphere, Tic Tac
Toc and Transfusion data sets respectively.
In figure 1.1, we remark that the performance of BNC starts to stabilize when
using ensembles of 20 classifiers, for high diversity data (DD =98.59%). The
classifiers generated are negatively depend (Q ≤ -0.02). The minimum values
of Q Statistic are obtained with classifier numbers varying between 10 and 20.
From figure 1.3 and figure 1.4, the values of the 2 measures CC and Kp are very
divers and the variation curves are ascending, while the curve of the values of Q
is upward then downward.
In figure 2.1, the best performance of BNC is obtained with divers classifier
ensembles (with Q=-0.3 as minimum average values). In figure 2.2, the min-
imum values of Q Statistic are obtained with 20 classifiers. For Diabetes data
(DD =22.83%), there is a relation between Q Statistic and the BNC performance.
With high divers data (figure 3.2), the first generated classifiers are indepen-
dent but the rest are negatively depend (Q ≤ -0.15). The minimum values of Q
Statistic are obtained with classifier numbers varying between 15 and 30. With
less than 20 classifiers, the error rate decreases about 40% (figure 3.1).
From figure 4.1, the difference between the error rates of the first classifier
and the generated thereafter, is not important. This show that Boosting can
converge to the best performance with few classifiers. For this case, Q Statistic
is informative. In Figure 4.3 and 4.4, the values of Kp and CC vary an a very
arbitrary way.
For the Transfusion data set (DD =1.07%), the classifier generation does not
help to increase BNC performance. We conclude that it is not recommanded to
use AdaBoost for this type of data.
Concerning diversity measures, we can note that for 2-class problems, the
values of ρ and kp are not correlated with AdaBoost performance using nominal
classifiers. The Q Statistic seems like a good measure of model diversity that
has a relationship with the performance of AdaBoost and then for can be used
to stop classifier learning .
5 Conclusions
In this paper, we have study the diversity of nominal classifiers in AdaBoost.M2.
We have compared 3 diversity measures for 2-class problems. We have found
that the Q Statistic is significantly correlated with the AdaBoost performance,
especially for very divers data sets. Then, it’s possible to use this measure as
a stopping criteria for ensemble learning . But for very correlated data sets, no
measure is useful. This results should be confirmed with more correlated data.
The diversity of data sets should then be taken into account in AdaBoost learning
process. It’s also interesting to study Q Statistic diversity to see if it can be used
in AdaBoost for many class problems.
Diversity Analysis on Boosting Nominal Concepts 317
References
1. Aksela, M., Laaksonen, J.: Using diversity of errors for selecting members of a
committee classifier. Pattern Recognition 39(4), 608–623 (2006)
2. Asuncion, A., Newman, D.: Machine Learning Repository (2007)
3. Gavin, B., Jeremy, W., Rachel, H., Xin, Y.: Diversity creation methods: A survey
and categorisation. Journal of Information Fusion 6, 5–20 (2005)
4. Brown, G., Kuncheva, L.I.: “Good” and “Bad” Diversity in Majority Vote Ensem-
bles. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp.
124–133. Springer, Heidelberg (2010)
5. Freund, Y.: Boosting a weak learning algorithm by majority. Information and Com-
putation 121, 256–285 (1995)
6. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: 13th
International Conference on Machine Learning, Bari, Italy (1996)
7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences 55(1),
119–139 (1997)
8. Gacquer, D., Delcroix, V., Delmotte, F., Piechowiak, S.: On the Effectiveness of
Diversity When Training Multiple Classifier Systems. In: Sossai, C., Chemello, G.
(eds.) ECSQARU 2009. LNCS, vol. 5590, pp. 493–504. Springer, Heidelberg (2009)
9. Ko, A.H.R., Sabourin, R., Soares de Oliveira, L.E., de Souza Britto, A.: The impli-
cation of data diversity for a classifier-free ensemble selection in random subspaces.
In: International Conference on Pattern Recognition, pp. 1–5 (2008)
10. Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation
and Model Selection. In: Actes d’International Joint Conference on Artificial In-
telligence, pp. 1137–1143 (1995)
11. Kuncheva, L.I., Skurichina, M., Duin, R.P.W.: An experimental study on diversity
for bagging and boosting with linear classifiers. Information Fusion 3(4), 245–258
(2002)
12. Kuncheva, L.I., Rodriguez, J.J.: Classifier Ensembles for fMRI Data Analysis: An
Experiment. Magnetic Resonance Imaging 28(4), 583–593 (2010)
13. Meddouri, N., Maddouri, M.: Adaptive Learning of Nominal Concepts for Super-
vised Classification. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.)
KES 2010. LNCS, vol. 6276, pp. 121–130. Springer, Heidelberg (2010)
14. Shipp, C.A., Kuncheva, L.I.: An investigation into how adaboost affects classifier
diversity. In: Proc. of Information Processing and Management of Uncertainty in
Knowledge-Based Systems, pp. 203–208 (2002)
Extreme Value Prediction for Zero-Inflated Data
1 Introduction
The notion behind being able to foretell the occurrence of an extreme event in a
time series is very appealing, especially in domains with significant ramifications
associated with the occurrence of an extreme events. Predicting pandemics in
an epidemiological domain or forecasting natural disasters in a geological and
climatic environment are examples of applications that give importance to detec-
tion of extreme events. Unfortunately, the accurate prediction of the timing and
magnitude of such events is a challenge given their low occurrence rate. More so,
the prediction accuracy depends on the regression method used as well as char-
acteristics of the data. On the one hand, standard regression methods such as
generalized linear model (GLM) emphasize estimating the conditional expected
value, and thus, are not best suited for inferring extremal values. On the other
hand, methods such as quantile regression are focused towards estimating the
confidence limits of the prediction, and thus, may overestimate the frequency
and magnitude of the extreme events. Though methods for inferring extreme
value distributions do exist, combining them with other predictor variables for
prediction purposes remains a challenging research problem.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 318–329, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Extreme Value Prediction for Zero-Inflated Data 319
Standard regression methods typically assume that the data conform to certain
parametric distributions (e.g., from an exponential family). Such methods are in-
effective if the assumed distribution does not adequately model characteristics of
the real data. For example, a common problem encountered especially in model-
ing climate and ecological data is the excess probability mass at zero. Such zero-
inflated data, as they are commonly known, often lead to poor model fitting using
standard regression methods as they tend to underestimate the frequency of ze-
ros and the magnitude of extreme values in the data. One way for handling such
type of data is to identify and remove the excess zeros and then fit a regression
model to the non-zero values. Such an approach, can be used, for example, to pre-
dict future values of a precipitation time series [13], in which the occurrence of wet
or dry days is initially predicted using a classification model prior to applying the
regression model to estimate the amount of rainfall for the predicted wet days. A
potential drawback of this approach is that the classification and regressions mod-
els are often built independent of each other, preventing the models from gleaning
information from each other to potentially improve their predictive accuracy. Fur-
thermore, the regression methods used in modeling the zero-inflated data do not
emphasize accurate prediction of extreme values.
The paper presents an integrated framework that simultaneously classifies
data points as zero-valued or not, and apply quantile regression to accurately
predict extreme values or the tail end of the non-zero values of the distribution
by focussing on particular quantiles.
We demonstrate the efficiency of the proposed approach on modeling climate
data (precipitation) obtained from the Canadian Climate Change Scenarios Net-
work website [1]. The performance of the approach is compared with four baseline
methods. The first baseline is the general linear model (GLM) with a Poisson
distribution. The second baseline used is the general linear model using an ex-
ponential distribution coupled with a binomial distribution classifier(GLM-C).
A zero-inflated Poisson was used as the third baseline method (ZIP). The fourth
basesline was quantile regression. Empirical results showed that our proposed
framework outperforms the baselines for majority of the weather stations inves-
tigated in this study.
In summary, the main contributions of this paper are as follows:
– We compare and analyze the performance of models created using variants
of GLM, quantile regression and ZIP approaches to accurately predict values
for extreme data points that belong to a zero-inflated distribution.
– We present a approach optimized for modeling zero-inflated data that out-
performs the baseline methods in predicting the value of extreme data points.
– We successfully demonstrated the proposed approach to the real-world prob-
lem of downscaling precipitation climate data with application to climate
impact assessment studies.
2 Related Work
The motivation behind the presented model is accurately predicting extreme
values in the presence of zero-inflated data. Previous studies have shown that
320 F. Xin and Z. Abraham
additional precautions must be taken to ensure that the excess zeros do not
lead to poor fits [2] of the regression models. A typical approach to model a
zero-inflated data set is to use a mixture distribution of the form P (y|x) =
απ0 (x) + (1 − α)π(x), where π0 and π are functions of the predictor variables
x and α is a mixing coefficient that governs the probability an observation is
a zero or non-zero value. This approach assumes that the underlying data are
generated from known parametric distributions, for example, π may be Poisson
or negative binomial distribution (for discrete data) and lognormal or Gamma
(for continuous data).
Generally, simple modeling of zero values may not be sufficient, especially
in the case of zero-inflated climate data such as that of precipitation where
extreme value observations, (that could indicate floods, droughts, etc) need to be
accurately modeled. Due to the significance of extreme values in climatology and
the increasing trend in extreme precipitation events over the past few decades, a
lot of work needs to be done in analysing the trends in precipitation, temperature,
etc., for regions in United states, Canada, among others [3]. Katz et al. introduces
the common approaches used in climate change research, especially with regard
to extreme values[4].
The common approaches to modeling extreme events are based on general
extreme value theory [5], Pareto distribution [10], generalized linear modeling
[6], hierarchical Bayesian approaches [9], etc. Gumbel [8] and Weibull [12] are the
more common variants of general extreme value distribution used. There are also
Bayesian models [11] that try augmenting the model with spatial information.
Watterson et al. propose a model that also deals with the skewness of non-
zero data/intermittency of precipitation using gamma distribution to interpret
changes in precipitation extremes [7]. In contrast, the framework presented in
this paper handles the intermittency of the data by coupling a logistic regression
classifier to the quantile regression part of the model.
3 Preliminaries
β̂ = (XT X)−1 XT y,
The generalized linear model is one of most widely used regression methods due
to its simplicity. Generally, a GLM consists of three elements:
1. The response variable Y, which has a probability distribution from the
exponential family.
2. A linear predictor η = Xβ
3. A link function g(·) such that E(Y|X) = μ = g −1 (η)
where, Y ∈ Rn×1 is the response variables vector, X ∈ Rn×d is the design
matrix with all 1 in the last column. β ∈ Rp×1 is the parameter vector. Since
the link function shows the relationship between the linear predictor and the
mean of the distribution, it is very important to understand the detail about
the data before arbitrarily using the canonical link function. In our case, since
the precipitation data are always non-negative and values represented using a
millimeter scale, the non-zero data may be treated as count data allowing us
to use Poisson distribution or an exponential distribution to describe the data.
Hence, in our experiments we always choose log(·) as the link function and choose
to use Poisson distribution. We scale the Y used in the regression model to be
10 × Y :
(10 × Yi )|Xi ∼ P oi(λi )
E((10 × Yi )|Xi ) = λi = g −1 (ηi ) = g −1 (Xi β);
The histogram in Figure 1 is a representation of the data belonging to station-
1. It is clear that the number of zero is too large. The second histogram which
is without zero looks similar to a kind of Poisson or exponential distribution.
Considering the large number of zeros, one is motivated to perform classifi-
cation first to eliminate the zero values before any regression. There are many
322 F. Xin and Z. Abraham
0.3
0.8
Density
Density
0.2
0.6
0.4
0.1
0.2
0.0
0.0
0 20 40 60 80 0 20 40 60 80
Y Y
classification methods available. But for the purpose of our experiments, we use
logistic regression (which is also a variation of GLM) to do the classification.
The response variable Y ∗ of logistic regression is a binary variable defined as:
1 Y > 0,
Y∗ =
0 Y =0
The detail of the model is as follows: The link function is a logit link g(p) =
p
log( ), such that,
1−p
Yi∗ |Xi ∼ Bin(pi )
E(Yi∗ |Xi ) = pi = g −1 (ηi ) = g −1 (Xi β);
When we derive the fitted values, they will be transferred to be binary:
1 1 ≥ Ŷ ∗ > 0.5,
f∗ =
0 0.5 ≥ Ŷ ∗ ≥ 0
The second part is a GLM with exponential distribution, the response variable
Y is just those non-zero data, and the link function is g(·) = log(·):
Differing from the methods above, zero inflated poisson regression treats the
zero as a mixture of two distributions: a Bernoulli distribution with probability
πi to get 0, and a Poisson distribution with parameter μ (let P r(·; μ) denote the
probability density function). In fact, the ZIP regression model is defined as:
πi + (1 − πi )P r(Yi = 0; λi ) yi = 0,
P r(Y = yi |xi ) =
(1 − πi )P r(Y = yi ; λi ) yi > 0
where
τu u>0
ρτ (u) =
(τ − 1)u u ≤ 0
Let FY (y) = P (Y ≤ y) be the distribution function of a real valued random
variable Y. The τ th quantile of Y is given by:
QY (τ ) = F −1 (τ ) = inf{y : FY (y) ≥ τ }
It can be proved that the ŷ which minimizes Eρτ (y − ŷ) should satisfy that
FY (ŷ) = τ . Thus, quantile regression will find the τ th quantile of a random
variable, for example:
qr qr
Median(Y|X) = X β̂ ; β̂ = arg min ρ0.5 (yi − XT
i b)
b
324 F. Xin and Z. Abraham
For the purpose of the experiments conducted, we always used τ = 0.95 to rep-
resent extreme high value. Unlike the least squares methods mentioned above
which could be solved by numerical linear algebra, the solution to quantile re-
gression is relatively non-trivial. Linear programming is used to solve the loss
function by converting the problem to the following form.
min {τ eT T
N u + (1 − τ )eN v|Y − Xb = u − v; b ∈ R ; u, v ∈ R+ }
p N
u,v,b
For the same reason as mentioned in the Section 3.1, a classification method
should be incorporated along with the regression model. We used logistic regres-
sion for classification, and quantile regression on those nonzero Y . Finally, we
report the product of those two fitted values. Quantile regression may return a
negative value, which we force to 0. We do this because precipitation is always
non-negative.
The rationale for the design of our objective function is as follows. The first
term which corresponds to the regression part of the equation represents quantile
regression performed for only the observed non-zero values in the time series.
The regression model is therefore biased towards estimating the non-zero extreme
values more accurately and not be adversely influenced by the over-abundance of
zeros in the time series. The product fi × (fi + 1)/2 in the first term, corresponds
to the predicted output of our joint classification and regression model. The sec-
ond term in the objective function, which is the main classification component,
is equivalent to the least square support vector machine. And the last two terms
in the objective function are equivalent to the L2 norm used in ridge regression
models to shrink the coefficients in ω 1 and ω 2 .
We consider each data point to be a representative reading at an instance of
time t ∈ {1, 2, · · · , n} in the time series. Each predictor variable is standardized
by subtracting its mean value and then dividing by its corresponding standard
deviation. The standardization of the variables is needed to account for the
varying scales.
The optimization method used while performing experiments is ’L-BFGS-
B’, described by Byrd et. al. (1995). It is a limited memory version of BFGS
methods. This method does not store a Hessian matrix, just a limited number
of update steps for it, and then it uses derivative information. Since our model
includes a quantile regression component, which is not differentiable, this method
of optimization is well suited to our objective function.
To solve the objective function, we used the inverse logistic function of xTi ω 2
instead of sign((xTi ω 2 + 1)/2)). The decision was motivated by the fact that the
optimizer tries to do a line search along the steepest descent direction and finds
the positive derivative along this line, which would result in a nearly flat surface
for the binary component. Hence conversion of the binary report to an inverse
logistic function of xTi ω2 was used to address this issue. During the prediction
stage, we use the binary-fitted values from the SVM component.
5 Experimental Evaluation
In this section, the climate data that are used to downscale precipitation is de-
scribed. This is followed by the experiment setup. Once the dataset is introduced,
we analyzed the behavior of baseline models and contrasted them with ICRE
in terms of relative performance of the various models when applied to this real
world dataset to forecast future values of precipitation.
5.1 Data
All the algorithms were run on climate data obtained for 29 weather stations in
Canada, from the Canadian Climate Change Scenarios Network website [1]. The
response variable to be regressed (downscaled), corresponds to daily precipita-
tion values measured at each weather station. The predictor variables correspond
to 26 coarse-scale climate variables derived from the NCEP Reanalysis data set
326 F. Xin and Z. Abraham
and the H3A2a data set(computer generated simulations), which include mea-
surements of airflow strength, sea-level pressure, wind direction, vorticity, and
humidity. The predictor variables used for training were obtained from the NCEP
Reanalysis data set while the predictor variables used for the testing were ob-
tained from the H3A2a data set. The data span a 40-year period, 1961 to 2001.
The time series was truncated for each weather station to exclude days for which
temperature or any of the predictor values are missing.
The first step was to standardize the predictor variables by subtracting its mean
value and then dividing by its corresponding standard deviation to account for
their varying scales. The training size used was 10yrs worth of data and the
test size, 25yrs. During the validation process, the selection of the parameter λ
was done using the score returned by RMSE-95. Also, to ensure the experiments
replicated the real world scenario where the prediction for a future timeseries
needs to be performed using simulated values of the predictor variables for the
future time series, we used simulated values for the corresponding predictor
variables obtained from H3A2a climate scenario as XU , while XL are values
obtained from NCEP. All the experiments were run for 37 stations.
We compare the performance of ICRE with baseline models created using general
linear model(GLM), general linear model with classification (GLM-C), quantile
regression(QR), quantile regression with classification and zero-inflated Pois-
son(ZIP). Further details about the baselines are provided below.
General Linear Model (GLM). The baseline GLM refers to the generalized
linear model that uses a Poisson distribution as a link function, resulting in the
regression function log(λ) = Xβ, where E(Y |X) = λ
The motivation behind the selection of the various evaluation metrics was to
evaluate the different algorithms in terms of predicting the magnitude and the
timing of the extreme events.The following criteria to evaluate the performance
of the models are used:
– Root Mean Square Error (RMSE), which measures the difference between
the actualand predicted values of the response variable, i.e.:
n
i=1 i (y −f f )2
i i
RMSE = n .
– RMSE-95, which we use to measure the difference between the actual and
predicted value of the response variable for only the extreme data points(j).
Extreme data points refer to the points whose actual value were 95 percentile
and above. The equation is with respect to 95 percentile, as throughout this
paper, we associate data points that are 95 percentile and above as extreme
values, i.e.:
n/20
j=1 (yj −fi fj )2
RMSE-95 = n/20 .
– Confusion matrices will be computed to visualize the precision and recall of
extreme and non-extreme events. F-measure, which is the harmonic mean
between recall and precision values will be used as a score that evaluates the
precision and recall results.
F-measure = 2×Recall×P recision
Recall+P recision
The results section consists of two main sets of experiments. The first set of
experiments evaluates the impact of zero-inflated data on modeling extreme
values. The second section compares the performance of ICRE with the baseline
methods which are followed .
328 F. Xin and Z. Abraham
6 Conclusions
This paper compare and analyze the performance of models created using vari-
ants of GLM, quantile regression and ZIP approaches to accurately predict values
for extreme data points that belong to a zero-inflated distribution. An alter-
nate framework(ICRE) was present that outperforms the baseline methods and
the effectiveness of the model was demonstrated on climate data to predict the
amount of precipitation at a given station. For future work, we plan to extend
the framework to a semi-supervised setting.
References
1. Canadian Climate Change Scenarios Network, Environment Canada,
http://www.ccsn.ca/
2. Ancelet, S., Etienne, M.-P., Benot, H., Parent, E.: Modelling spatial zero-inflated
continuous data with an exponentially compound poisson process. Environmental
and Ecological Statistics (April 2009), doi:10.1007/s10651-009-0111-6
3. Kunkel, E.K., Andsager, K., Easterling, D.: Long-Term Trends in Extreme Pre-
cipitation Events over the Conterminous United States and Canada. J. Climate,
2515–2527 (1999)
4. Katz, R.: Statistics of extremes in climate change. Climatic Change, 71–76 (2010)
5. Gaetan, C., Grigoletto, M.: A hierarchical model for the analysis of spatial rainfall
extremes. Journal of Agricultural, Biological, and Environmental Statistics (2007)
6. Clarke, R.T.: Estimating trends in data from the Weibull and a generalized extreme
value distribution. Water Resources Research (2002)
7. Watterson, I.G., Dix, M.R.: Simulated changes due to global warming in daily
precipitation means and extremes and their interpretation using the gamma dis-
tribution. Journal of Geophysical Research (2003)
8. Booij, M.J.: Extreme daily precipitation in Western Europe with climate change
at appropriate spatial scales. International Journal of Climatology (2002)
9. Ghosh, S., Mallick, B.: A hierarchical Bayesian spatio-temporal model for extreme
precipitation events. Environmetrics (2010)
10. Dorland, C., Tol, R.S.J., Palutikof, J.P.: Vulnerability of the Netherlands and
Northwest Europe to storm damage under climate change. Climatic Change, 513–
535 (1999)
11. Cooley, D., Nychka, D., Naveau, P.: Bayesian spatial modeling of extreme proe-
cipitation return levels. Journal of the American Statistical Association, 824–840
(2007)
12. Clarke, R.T.: Estimating trends in data from the Weibull and a generalized extreme
value distribution. Water Resources Research (2002)
13. Wilby, R.L.: Statistical downscaling of daily precipitation using daily airflow and
seasonal teleconnection. Climate Research 10, 163–178 (1998)
Learning to Diversify Expert Finding
with Subtopics
Abstract. Expert finding is concerned about finding persons who are knowl-
edgeable on a given topic. It has many applications in enterprise search, social
networks, and collaborative management. In this paper, we study the problem of
diversification for expert finding. Specifically, employing an academic social net-
work as the basis for our experiments, we aim to answer the following question:
Given a query and an academic social network, how to diversify the ranking list,
so that it captures the whole spectrum of relevant authors’ expertise? We precisely
define the problem and propose a new objective function by incorporating topic-
based diversity into the relevance ranking measurement. A learning-based model
is presented to solve the objective function. Our empirical study in a real system
validates the effectiveness of the proposed method, which can achieve significant
improvements (+15.3%-+94.6% by MAP) over alternative methods.
1 Introduction
Given a coauthor network, how to find the top-k experts for a given query q? How to
diversify the ranking list so that it captures the whole spectrum of relevant authors’ ex-
pertise? Expert finding has long been viewed as a challenging problem in many different
domains. Despite that considerable research has been conducted to address this prob-
lem, e.g., [3,17], the problem remains largely unsolved. Most existing works cast this
problem as a web document search problem, and employ traditional relevance-based
retrieval models to deal with the problem.
Expert finding is different from the web document search. When a user is looking
for expertise collaborators in a domain such as “data mining”, she/he does not typically
mean to find general experts in this field. Her/his intention might be to find experts on
different aspects (subtopics) of data mining (e.g., “association rules”, “classification”,
“clustering”, or “graph mining”). Recently, diversity already becomes a key factor to
address the uncertainty and ambiguity problem in information retrieval [12,21]. How-
ever, the diversification problem is still not well addressed for expert finding. In this
paper, we try to give an explicit diversity-based objective function for expert finding,
and to leverage a learning-based algorithm to improve the ranking performance.
The work is supported by the Natural Science Foundation of China (No. 61073073) and Chi-
nese National Key Foundation Research (No. 60933013), a special fund for Fast Sharing of
Science Paper in Net Era by CSTD.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 330–341, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Learning to Diversify Expert Finding with Subtopics 331
Fig. 1. Illustrative example of diversified expert finding. The query is “information retrieval”, and
the list on the left side is obtained by language model. All the top five are mainly working on
retrieval models. The right list is obtained by the proposed diversified ranking method with four
subtopics (indicated by different colors).
– We precisely formulate the problem of diversified expert finding and define an ob-
jective function to explicitly incorporate the diversity of subtopics into the relevance
ranking function.
332 H. Su, J. Tang, and W. Hong
2 Problem Definition
In this section, we formulate the problem in the context of academic social network
to keep things concrete, although adaption of this framework to expert finding in other
social-network settings is straightforward.
Generally speaking, the input of our problem consists of (1) the results of any topic
modeling such as predefined ontologies or topic cluster based on pLSI [9] or LDA [5]
and (2) a social network G = (V, E) and the topic model on authors V , where V is a
set of authors and E ⊂ V × V is a set of coauthor relationships between authors. More
precisely, we can define a topic distribution over each author as follows.
Topic distribution: In social networks, an author usually has interest in multiple topics.
T
Formally, eachuser v ∈ V is associated with a vector θv ∈ R of T -dimensional topic
distribution ( z θvz = 1). Each element θvz is the probability (i.e., p(z|v)) of the user
on topic z.
In this way, each author can be mapped onto multiple related topics. In the meantime,
for a given query q, we can also find a set of associated topics (which will be depicted
in detail in §3). Based on the above concepts, the goal of our diversified expert finding
is to find a list of experts for a given query such that the list can maximally cover the
associated topics of the query q. Formally, we have:
Problem 1. Diversified Expert Finding. Given (1) a network G = (V, E), (2) T -
dimensional topic distribution θv ∈ RT for all authors v in V , and (3) a metric function
f (.), the objective of diversified expert finding for each query q is to maximize the
following function:
T
f (k|z, G, Θ, q) × p(z|q) (1)
z=1
where f (k|z, G, Θ, q) measures the relevance score of top-k returned authors given
topic z; we can apply a parameter τ to control the complexity of the objective function
by
selecting topics with larger probabilities (i.e., minimum number of topics that satisfy
z p(z|q) ≥ τ ). In an extreme case (τ = 1), we consider all topics.
Please note that this is a general formulation of the problem. The relevance metric
f (k|z, G, Θ, q) can be instantiated in different ways and the topic distribution can also
be obtained using different algorithms. Our formulation of the diversified expert finding
is very different from existing works on expert finding [3,16,17]. Existing works have
Learning to Diversify Expert Finding with Subtopics 333
mainly focused on finding relevant experts for a given query, but ignore the diversi-
fication over different topics. Our problem is also different from the learning-to-rank
work [11,23], where the objective is to combine different factors into a machine learn-
ing model to better rank web documents, which differs in nature from our diversified
expert finding problem.
3 Model Framework
3.1 Overview
In general, the topic information can be obtained in many different ways. For exam-
ple, in a social network, one can use the predefined categories or user-assigned tags
as the topic information. In addition, we can use statistical topic modeling [9,5,20] to
automatically extract topics from the social networking data. In this paper, we use the
author-conference-topic (ACT) model [20] to initialize the topic distribution of each
user. For completeness, we give a brief introduction of the ACT model. For more de-
tails, please refer to [20].
ACT model simulates the process of writing a scientific paper using a series of prob-
abilistic steps. In essence, the topic model uses a latent topic layer Z = {z1 , z2 , ..., zT }
as the bridge to connect the different types of objects (authors, papers, and publication
venues). More accurately, for each object it estimates a mixture of topic distribution
which represents the probability of the object being associated with every topic. For ex-
ample, for each author, we have a set of probabilities {p(zi |a)} and for each paper d, we
have probabilities {p(zi |d)}. For a given query q, we can use the obtained topic model
to do inference and obtain a set of probabilities {p(zi |q)}. Table 1 gives an example of
the most relevant topics for the query “Database”.
Table 1. Most relevant topics (i.e., with a higher p(z|q)) for query “Database”
Topic p(z|q)
Topic 127: Database systems 0.15
Topic 134: Gene database 0.09
Topic 106: Web database 0.07
Topic 99: XML data 0.05
Topic 192: Query processing 0.04
|Q| k
1 P rec(aji ) × rel(aji )
M AP (Q, k) = i=1
k (2)
|Q| j=1 i=1 rel(aji )
where Q is a set of queries in the training data; P rec(aji ) represents the precision
value obtained for the set of top i returned experts for query qj ; rel(aji ) is an indicator
function equaling 1 if the expert aji is relevant to query qj , 0 otherwise. The normalized
inner sum denotes the average precision for the set of top k experts and the normalized
outer sum denotes the average over all queries Q.
Now, we redefine the objective function based on a generalized MAP metric called
MAP-Topic, which explicitly incorporates the diversity of subtopics. More specifically,
given a training data set {(q (j) , Aq(j) )}, where q (j) ∈ Q is query and Aq(j) is the set of
related experts for query q (j) , we can define the following objective function:
|Q| T k i
p(z|ajm )
1 rel(aji ) × m=1
O= p(z|q) × i=1
k i
(3)
|Q| j=1 z=1 i=1 p(z|aji ) × rel(aji )
where wi is the weight of the i-the feature. Given a feature weight vector w, according
to the objective function described above, we can calculate a value, denoted as O(w),
to evaluate the ranking results of that model. Thus our target is to find a configuration
of w to maximize O(w).
purpose of simplicity and effectiveness, in this paper, we utilize the hill climbing algo-
rithm due to its efficiency and ease of implementation. The algorithm is summarized in
Algorithm 1.
Different from the original random start hill climbing algorithm which starts from
pure random parameters, we add our prior knowledge empiricalV ector to the initial-
ization of w, as we know some features such as BM25 will directly affect the relevance
degree tends to be more important. By doing so, we could reduce the CPU time for
training.
4 Experiment
Data Sets. From the system, we obtain a network consisting of 1,003,487 authors,
6,687 conferences, and 2,032,845 papers. A detailed introduction about how the aca-
demic network has been constructed can be referred to [19]. As there is no standard data
sets available, and also it is difficult to create such an data sets with ground truth. For
a fair evaluation, we construct a data set in the following way: First, we select a num-
ber of most frequent queries from the query log of the online system; then we remove
the overly specific or lengthy queries (e.g., ‘A Convergent Solution to Subspace Learn-
ing’) and normalize similar queries (e.g., ‘Web Service’ and ‘Web Services’ to ‘Web
1
http://arnetminer.org
336 H. Su, J. Tang, and W. Hong
Table 2. Statistics of selected queries. Entropy(q) = − Ti=1 p(zi |q) log p(zi |q) measures
the query’s uncertainty; #(τ = 0.2) denotes the minimum number of topics that satisfy
P (z|q) ≥ τ .
Service’). Second, for each query, we identify the most relevant (top) conferences. For
example, for ‘Web Service’, we select ICWS and for ‘Information Retrieval’, we select
SIGIR. Then, we collect and merge PC co-chairs, area chairs, and committee members
of the identified top conferences in the past four years. In this way, we obtain a list of
candidates. We rank these candidates according to the appearing times, breaking ties
using the h-index value [8]. Finally, we use the top ranked 100 experts as the ground
truth for each query.
Topic Model Estimation. For the topic model (ACT), we perform model estimation
by setting the topic number as 200, i.e., T = 200. The topic number is determined by
empirical experiments (more accurately, by minimizing the perplexity [2], a standard
measure for estimating the performance of a probabilistic model, the lower the better).
The topic modeling is carried out on a server running Windows 2003 with Dual-Core
Intel Xeon processors (3.0 GHz) and 4GB memory. For the academic data set, it took
about three hours to estimate the ACT model.
We produce some statistics for the selected queries (as shown in Table 2).
uncertainty and #(τ = 0.2) denotes the minimum
Entropy(q) measures the query’s
number of topics that satisfy P (z|q) ≥ τ .
Feature Definition. We define features to capture the observed information for ranking
experts of a given query. We consider two types of features: 1) query-independent fea-
tures (such as h-index, sociability, and longevity) and 2) query-dependent features (such
as BM25 [13] score and language model with recency score). A detailed description of
the feature definition is given in Appendix.
Evaluation Measures and Comparison Methods. To quantitatively evaluate the pro-
posed method, we consider two aspects: relevance and diversity. For the feature-based
ranking, we consider six-fold cross-validation(i.e. five folds for training and the rest for
testing) and evaluate the approaches in terms of Prec@5, Prec@10, Prec@15, Prec@20,
and MAP. And we conduct evaluation on the entire data of the online system (including
916,946 authors, 1,558,499 papers, and 4,501 conferences). We refer to the proposed
method as DivLearn and compare with the following methods:
RelLearn: A learning-based method. It uses the same setting (the same feature defini-
tion and the same training/test data) as that in DivLearn, except that it does not consider
the topic diversity and directly use MAP as the objective function for learning.
Learning to Diversify Expert Finding with Subtopics 337
45 1
LM
BM25
40 ACT 0.8
ACT+RW
Cumulated Probability
LDA
pLSI
35 0.6
MAP(%)
DivLearn
30 0.4
Data Mining
25 0.2 Machine Learning
Natural Language Processing
Semantic Web
20 0
0 0.2 0.4 0.6 0.8 1 0 50 100 150 200
τ #Topics (ordered by p(z|q))
(a) Effect of threshold τ on MAP (b) Cumulated probabilities z p(z|q) for top
topics of example queries
Fig. 2. Effect of topic threshold analysis
Convergence Property. We first study the convergence property of the learning al-
gorithm. We trace the execution of 72 random hill climbing runs to evaluate the con-
vergence of the model learning algorithm. On average, the number of iterations to find
the optimal parameter w varies from 16 to 28. The CPU time required to perform each
iteration is around 1 minute. This suggests that the learning algorithm is efficient and
has a good convergence property.
Effect of Topic Threshold. We conduct an experiment to see the effect of using dif-
ferent thresholds τ to select topics in the objective function (Eq.
3). We select the min-
imum number of topics with higher probabilities that statisfy z p(z|q) ≥ τ , then
re-scale this sum to be 1 and assign 0 to other topics. Clearly, when τ = 1, all topics
are counted. Figure 2a shows the value of MAP of multiple methods for various τ . It
shows that this metrics is consistent to a certain degree. The performance of different
methods are relatively stable with different parameter setting. This could be explained
by Figure 2b, which depicts the cumulated P (z|q) of top n topics. As showed, for a
given query, p(z|q) tends to be dominated by several top related topics. Statistics in
Table 2 also confirm this observation. All these observations confirm the effectiveness
of the proposed method.
Effect of Recency. We evaluate whether expert finding is dynamic over time. In Eq. 7,
we define a combination feature of the language model score and the recency score
(Func 1). Now, we qualitatively examine how different settings for the recency impact
function will affect the performance of DivLearn. We also compared with some other
recency function with Recency(p) = 2( ) (Func 2) [14]. Figure 3 shows the
d.year - current year
λ
performance of MAP with different parameter λ. The baseline denote the performance
without considering recency. It shows that recency is an important factor and both im-
pact functions perform better than the baseline which does not consider the recency.
Learning to Diversify Expert Finding with Subtopics 339
45
DivLearn + Func 1
DivLearn + Func 2
40
Baseline
MAP(%) 35
30
25
20
1 2 5 10 15
Recency Parameter
Fig. 3. MAP for different recency impact functions with different parameters
We can also see that both impact function perform best with the setting of λ 5. On
average, the first impact function (Func 1, used in our approach) performs a bit better
than Func 2.
5 Related Work
Previous works related to our learning to diversify for expert finding with subtopics
can be divided into the following three aspects: expert finding, learning to rank, search
result diversification. On expert finding, [17] propose a topic level approach over het-
erogenous network. [3] extended language models to address the expert finding prob-
lem. TREC also provides a platform for researchers to evaluate their models[16]. [7]
present a learning framework for expert finding, but only relevance is considered. Other
topic model based approaches were proposed either[17].
Learning to rank aims to combining multiple sources of evidences for ranking. Liu
[11] gives a survey on this topic. He categorizes the related algorithms into three groups,
namely point-wise, pair-wise and list-wise. To optimize the learning target, in this paper
we use an list-wise approach, which is similar to [23].
Recently, a number of works study the problem of result diversification by taking
inter-document dependencies into consideration [1,25,6,18]. Yue and Joachims [24]
present a SVM-based approach for learning a good diversity retrieval function. For
evaluation, Agrawal et al. [1] generalize classical information retrieval metrics to ex-
plicitly account for the value of diversification. Zhai et al. [25] propose a framework for
evaluating retrieval different subtopics of a query topic. However, no previous work has
been conducted for learning to diversify expert finding.
6 Conclusion
In this paper, we study the problem of learning to diversify expert finding results us-
ing subtopics. We formally define the problem in a supervised learning framework. An
objective function is defined by explicitly incorporating topic-based diversity into the
340 H. Su, J. Tang, and W. Hong
relevance based ranking model. An efficient algorithm is presented to solve the ob-
jective function. Experiment results on a real system validate the effectiveness of the
proposed approach.
Learning to diversify expert finding represents a new research direction in both infor-
mation retrieval and data mining. As future work, it is interesting to study how to incor-
porate diversity of relationships between experts into the learning process. In addition,
it would be also interesting to detect user intention and to learn weights of subtopics via
interactions with users.
References
1. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM
2009, pp. 5–14. ACM (2009)
2. Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval, vol. 463. ACM Press,
New York (1999)
3. Balog, K., Azzopardi, L., de Rijke, M.: Formal models for expert finding in enterprise cor-
pora. In: SIGIR 2006, pp. 43–50. ACM (2006)
4. Bertsekas, D.: Nonlinear programming. Athena Scientific, Belmont (1999)
5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: NIPS 2001, pp. 601–608
(2001)
6. Carbonell, J.G., Goldstein, J.: The use of mmr, diversity-based reranking for reordering doc-
uments and producing summaries. In: SIGIR 1998, pp. 335–336 (1998)
7. Fang, Y., Si, L., Mathur, A.: Ranking experts with discriminative probabilistic models. In:
SIGIR 2009 Workshop on LRIR. Citeseer (2009)
8. Hirsch, J.E.: An index to quantify an individual’s scientific research output. Proceedings of
the National Academy of Sciences 102(46), 16569–16572 (2005)
9. Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR 1999, pp. 50–57. ACM (1999)
10. Koza, J.: On the programming of computers by means of natural selection, vol. 1. MIT Press
(1996)
11. Liu, T.: Learning to rank for information retrieval. Foundations and Trends in Information
Retrieval 3(3), 225–331 (2009)
12. Radlinski, F., Bennett, P.N., Carterette, B., Joachims, T.: Redundancy, diversity and interde-
pendent document relevance. SIGIR Forum 43, 46–52 (2009)
13. Robertson, S., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at trec-4.
In: Proceedings of TREC, vol. 4 (1995)
14. Roth, M., Ben-David, A., Deutscher, D., Flysher, G., Horn, I., Leichtberg, A., Leiser, N.,
Matias, Y., Merom, R.: Suggesting friends using the implicit social graph. In: KDD 2010
(2010)
15. Russell, S., Norvig, P., Canny, J., Malik, J., Edwards, D.: Artificial intelligence: a modern
approach, vol. 74. Prentice Hall, Englewood Cliffs (1995)
16. Soboroff, I., de Vries, A., Craswell, N.: Overview of the trec 2006 enterprise track. In: Pro-
ceedings of TREC. Citeseer (2006)
17. Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into the random
walk framework for academic search. In: ICDM 2008, pp. 1055–1060 (2008)
18. Tang, J., Wu, S., Gao, B., Wan, Y.: Topic-level social network search. In: KDD 2011, pp.
769–772. ACM (2011)
19. Tang, J., Yao, L., Zhang, D., Zhang, J.: A combination approach to web user profiling. ACM
TKDD 5(1), 1–44 (2010)
Learning to Diversify Expert Finding with Subtopics 341
20. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of
academic social networks. In: KDD 2008, pp. 990–998 (2008)
21. Tong, H., He, J., Wen, Z., Konuru, R., Lin, C.-Y.: Diversified ranking on large graphs: an
optimization viewpoint. In: KDD 2011, pp. 1028–1036 (2011)
22. Wei, X., Croft, W.: Lda-based document models for ad-hoc retrieval. In: SIGIR 2006, pp.
178–185. ACM (2006)
23. Yeh, J., Lin, J., Ke, H., Yang, W.: Learning to rank for information retrieval using genetic
programming. In: SIGIR 2007 Workshop on LR4IR. Citeseer (2007)
24. Yue, Y., Joachims, T.: Predicting diverse subsets using structural svms. In: ICML 2008, pp.
1224–1231. ACM (2008)
25. Zhai, C., Cohen, W., Lafferty, J.: Beyond independent relevance: methods and evaluation
metrics for subtopic retrieval. In: SIGIR 2003, pp. 10–17. ACM (2003)
– h-index: h-index equals h indicates that an author has h of N papers with at least h
citations each, while the left (N − h) papers have at most h citations each.
– Longevity: Longevity reflects the length of an author’s academic life. We consider
the year when one author published his/her first paper as the beginning of his/her
academic life and the last paper as the end year.
– Sociability: The score of an author’s sociability is defined based on how many co-
author he/she has. This score is defined as:
Sociability(A) = 1 + ln(#co − paperc) (5)
c∈A’s coauthors
where #co − paperc denotes the number of papers coauthored between the author
and the coauthor c.
– Language Model with Recency: We consider the effect of recency and impact factor
of conference. Thus the language model score we used for an author is redefined
as:
LM (q|a) = p(q|d) × Impact(d.conf erence) × Recency(d) (6)
d∈{a’s publications}
– BM25 with Recency: It defines a similar relevance score as that in Eq. 6, except that
the p(q|d) is obtained by BM25.
An Associative Classifier for Uncertain Datasets
University of Alberta
Edmonton, Alberta, Canada
{hooshsad,zaiane}@cs.ualberta.ca
1 Introduction
Typical relational databases or databases in general hold collections of records
representing facts. These facts are observations with known values stored in the
fields of each tuple of the database. In other words, the observation represented
by a record is assumed to have taken place and the attribute values are assumed
to be true. We call these databases “certain database” because we are certain
about the recorded data and their values. In contrast to “certain” data there is
also “uncertain data”; data for which we may not be sure about the observation
whether it really took place or not, or data for which the attribute values are
not ascertained with 100% probability.
Querying such data, particularly computing aggregations, ranking or discov-
ering patterns in probabilistic data is a challenging feat. Many researchers have
focused on uncertain databases, also called probabilistic databases, for managing
uncertain data [1], top-k ranking uncertain data [2], querying uncertain data [3],
or mining uncertain data [4,5]. While many approches use an existancial uncer-
tainty attached to a record as a whole, our model targets uncertain databases
with probabilities attached to each attribute value.
This paper addresses the problem of devising an accurate rule-based classi-
fier on uncertain training data. There are many classification paradigms but the
classifiers of interest to our study are rule-based. We opted for associative clas-
sifiers, classifiers using a model based on association rules, as they were shown
to be highly accurate and competitive with other approaches [6].
After briefly reviewing related work for associative classification as well as
published work on classifying in the presence of uncertainty, we present in Section
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 342–353, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Associative Classifier for Uncertain Datasets 343
2 Related Works
probability. Existing uncertain data rule based classifiers have suggested various
answers to this problem.
uHARMONY suggested a lower bound on the probability by which the in-
stance satisfies the rule antecedent. This approach is simple and fast, but the
difficulty or even impossibility of setting the threshold is a problem. This is ex-
plained in more detail in Section 3.2. uRule suggested to remove the items in the
antecedent of the rule from the instance, to leave only the uncovered part of the
instance every time. In contrast to uHARMONY, this method uses the whole
dataset but it may cause sensitivity to noise which is undesirable. UCBA, wich
is based on CBA, does not include the uncertainty in the rule selection process;
they select as many rules as possible. This method does not filter enough rules;
so may decrease the accuracy.
In UAC, we introduce a new solution to the coverage problem. This compu-
tation does not increase the running time complexity and needs no extra passes
over the dataset.
3 UAC Algorithm
In this section, we present our novel algorithm, UAC. Before applying UAC
to uncertain numerical attributes in the train sets, they are first transformed
into uncertain categorical attributes using U-CAIM [10], assuming the normal
distribution on the intervals. After discretization, the value of the i-th attribute
for the j-th instance is a list of value-probability pairs, as shown in Equation 1.
Some studies have criticized expected support and defined another measure
which is called probabilistic support [17] [18]. Probabilistic support is defined as
the probability of an itemset to be frequent with respect to a certain minimum
expected support. However, probabilistic support increases the time complexity
significantly. Therefore to be more efficient, UAC uses the expected support.
uHARMONY defines another measure instead of confidence which is called
expected confidence. The computation of this measure takes O(|T |2 ) time where
|T | is the number of instances. Computing confidence is only O(1), thus we use
confidence for efficiency reasons. Our experimental results in Section 4 empiri-
cally shows that our confidence based method can reach high accuracies.
Our rule extraction method is based on UApriori [4]. The candidate set is first
initialized by all rules of form a → c where a is a single attribute assignment and
c is a class label. After removing all infrequent ruleitems, the set of candidates
is pruned by the pessimistic error rate method [19]. Each two frequent ruleitems
with the same class label are then joined together to form the next level candi-
date set. The procedure is repeated until the generated candidate set is empty,
meaning all the frequent ruleitems have been found. Those ruleitems that are
strong (their confidence is above the predefined threshold) are the potential clas-
sification rules. In the next section, the potential ruleitems are filtered and the
final set of rules is formed.
The outcome of the rule extraction is a set of rules called rawSet. Usually the
number of ruleitems in rawSet is excessive. Excessive rules may have negative
impact on the accuracy of the classification model. To prevent this, UAC uses
the database coverage method to reduce the set of rules while handling the
uncertainty. The initial step of the database coverage method in UAC is to sort
rules based on their absolute precedence to accelerate the algorithm. Absolute
precedence in the context of uncertain data is defined as follows:
Definition: Rule ri has absolute precedence over rule rj or ri rj , if a) ri
has higher confidence than rj ; b) ri and rj have the same confidence but ri has
higher expected support than rj ; c) ri and rj have the same confidence and the
same expected support but ri have less items in its antecedent than rj .
When data is not uncertain, confidence is a good and sufficient measure to
examine whether a rule is the best classifier for an instance. But when uncertainty
is present, there is an additional parameter in effect. To illustrate this issue,
assume rules r1 : [m, t → c1 ] and r2 : [n → c2 ] having confidences of 0.8 and 0.7,
respectively. It is evident that r1 r2 . However, for a test instance like I1 : [(m :
0.4), (n : 0.6), (t, 0.3) → x] where x is to be predicted, which rule should be used?
According to CBA, r1 should be used because its confidence is higher than that
of r2 . However, the probability that I1 satisfies the antecedent of r1 is small,
so r1 is not likely to be the right classifier. We solve this problem by including
another measure called PI. P I or probability of inclusion, denoted by π(ri , Ik ),
is described as the probability by which rule ri can classify instance Ik . P I is
346 M. Hooshsadat and O.R. Zaı̈ane
defined in Equation 3. In the example above π(r1 , I1 ) is only 0.3 × 0.4 = 0.12,
While π(r2 , I1 ) is 0.6.
Stage 1: Finding ucRules and uwRules. After sorting rawSet based on the
absolute precedence, we make one pass over the dataset to link each instance i in
the dataset to two rules in rawSet: ucRule and uwRule. ucRule is the rule with
the highest relative precedence that correctly classifies i. In contrast, uwRule
is the rule with the highest relative precedence that wrongly classifies i. The
pseudocode for the first stage is presented in Algorithm 1.
In Algorithm 1, three sets are declared. U contains all the rules that clas-
sify at least one training instance correctly. Q is the set of all ucRules which
have relative precedence over their corresponding uwRules with respect to the
associated instances. If i.uwRule has relative and absolute precedence over the
corresponding ucRule, a record of form < i.id, i.class, ucRule, uwRule > is put
in A. Here, i.id is the unique identifier of the instance and i.class represents the
class label.
To find the corresponding ucRule and uwRule for each instance, the procedure
starts at the first rule of the sorted rawSet and descends. For example, if there
is a rule that correctly classifies the target instance and has applicability of α,
we pass this rule and look for the rules with higher applicabilities to assign as
ucRule. Searching continues only until we reach a rule that has a confidence
An Associative Classifier for Uncertain Datasets 347
of less than α. Clearly, this rule and rules after it (with less confidence) have
no chance of being ucRule. The same applies to uwRule. Also as shown in
Algorithm 1 lines 4 and 6, the applicability values of ucRule and uwRule are
stored to expedite the process for the next stages.
The purpose of the database coverage in UAC is to find the best classifying
rule (coverage) for each instance in the dataset. The covering rules are then
contained in the final set of rules and others are filtered out. The best rule,
that is the covering rule, in CBA is the highest precedence rule that classifies an
instance. This definition is not sufficient for UAC because the highest precedence
rule may have a small P I.
To solve the aforementioned problem, uHARMONY sets a predefined lower
bound on the P I value of the covering rule, a method with various disadvantages.
Clearly, not only estimating the suitable lower bound is critical, but it is also
intricate, and even in many cases impossible. When predicting a label for an
instance, rules that have higher P I than the lower bound are treated alike. To
improve upon this, it is necessary to set the lower bound high enough to avoid low
probability rules covering the instances. However, it remains that it is possible
that the only classifying rules for some of the instances are not above that lower
bound and are removed. Additionally, setting a predefined lower bound filters
out usable information, while the purpose of the uncertain data classifiers is to
use all of the available information. Moreover, having a single bound for all of
the cases is not desirable. Different instances may need different lower bounds.
Given all the above reasons, we need to evaluate the suitable lower bound for
each instance. The definition of the covering rule in UAC is as follows, where we
use the applicability of i.ucRule as our lower bound for covering i.
Definition: Rule r covers instance i if: a) r classifies at least one instance
correctly; b) π(r, i) > 0; c) α(r, i) > α(i.ucRule, i) = cApplic. d) r i.ucRule
cApplic represents the maximum rule applicability to classify an instance
correctly. Thus, it is the suitable lower bound for the applicability of the covering
rules. This will ensure that each instance is covered with the best classifying rule
(ucRule) or a rule with higher relative and absolute precedence than ucRule. In
the next two stages, we remove the rules that do not cover any instance from
rawSet.
pointer to each child node via the replace set (line 12). The number of incoming
edges is stored in incom (line 14). Each node represents a rule and each edge
represents a replacement relation.
Each rule has a covered array in UAC where r.covered[c] is used to
store the total number of instances covered by r and labeled by class c. If
r.covered[r.class] = 0, then r does not classify any training instance correctly
and is filtered out. Starting from line 22, we traverse RepDAG in its topologically
sorted order to update the covered array of each rule. Rule ri comes before rj in
the sorted order, if ri rj and there is no instance such as Ik where rj [Ik ] ri .
If a rule fails to cover any instance correctly (line 26), it does not have any effect
on the covered array of the rules in its replace set. At the end of this Stage,
enough information has been gathered to start the next stage, which finalizes
the set of rules.
of rawSet. The worst case scenario is when at Stage 1, at least one ucRule or
uwRule is the last rule in the sorted rawSet. This case rarely happens because
the rules are sorted based on their absolute precedence. UAC also makes slightly
more than one pass over the dataset in the rule filtering step. Passes are made in
Stage 1 and 2. Note that array A is usually small, given that most of the instances
are usually classified by the highest ranked rules. The number of passes is an
350 M. Hooshsadat and O.R. Zaı̈ane
important point, because the dataset may be very large. Specially for datasets
that can not be loaded into memory at once, it is not efficient to make multiple
pases. This is an advantage for UAC over UCBA, which passes over the dataset
once for each rule in rawSet. Next section explains the rule selection that is the
procedure of classifying test instances based on the set of rules.
5 Conclusion
References
1. Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilis-
tic databases. In: IEEE ICDE, pp. 596–605 (2007)
2. Wang, C., Yuan, L.-Y., You, J.H., Zaiane, O.R., Pei, J.: On pruning for top-k
ranking in uncertain databases. In: International Conference on Very Large Data
Bases (VLDB), PVLDB, vol. 4(10) (2011)
3. Cheema, M.A., Lin, X., Wang, W., Zhang, W., Pei, J.: Probabilistic reverse nearest
neighbor queries on uncertain data. IEEE Transactions on Knowledge and Data
Engeneering (TKDE) 22, 550–564 (2010)
4. Aggarwal, C.C., Li, Y., Wang, J., Wang, J.: Frequent pattern mining with uncertain
data. In: ACM SIGKDD, pp. 29–38 (2009)
5. Jiang, B., Pei, J.: Outlier detection on uncertain data: Objects, instances, and
inference. In: IEEE ICDE (2011)
6. Antonie, M.-L., Zaiane, O.R., Holte, R.: Learning to use a learned model: A two-
stage approach to classification. In: IEEE ICDM, pp. 33–42 (2006)
7. Bi, J., Zhang, T.: Support vector classification with input data uncertainty. In:
Advances in Neural Information Processing Systems (NIPS), pp. 161–168 (2004)
8. Qin, B., Xia, Y., Li, F.: DTU: A Decision Tree for Uncertain Data. In: Theera-
munkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS,
vol. 5476, pp. 4–15. Springer, Heidelberg (2009)
9. Ge, J., Xia, Y., Nadungodage, C.: UNN: A Neural Network for Uncertain Data
Classification. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD
2010. LNCS, vol. 6118, pp. 449–460. Springer, Heidelberg (2010)
10. Qin, B., Xia, Y., Li, F.: A bayesian classifier for uncertain data. In: ACM Sympo-
sium on Applied Computing, pp. 1010–1014 (2010)
11. Qin, B., Xia, Y., Prabhakar, S., Tu, Y.: A rule-based classification algorithm for
uncertain data. In: IEEE ICDE (2009)
12. Gao, C., Wang, J.: Direct mining of discriminative patterns for classifying uncertain
data. In: ACM SIGKDD, pp. 861–870 (2010)
13. Qin, X., Zhang, Y., Li, X., Wang, Y.: Associative Classifier for Uncertain Data.
In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp.
692–703. Springer, Heidelberg (2010)
14. Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining.
In: ACM SIGKDD, pp. 80–86 (1998)
15. Zaiane, O., Antonie, M.-L.: Classifying text documents by associating terms with
text categories. In: Australasian Database Conference, pp. 215–222 (January 2002)
16. Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on
multiple class-association rules. In: IEEE ICDM, pp. 369–376 (2001)
17. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilistic data. In: ACM
SIGMOD, pp. 819–832 (2008)
18. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic fre-
quent itemset mining in uncertain databases. In: ACM SIGKDD (2009)
19. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
(1993)
20. Demsar, J.: Statistical comparison of classifiers over multiple data sets. JMLR 7,
1–30 (2010)
21. Hooshsadat, M.: Classification and Sequential Pattern Mining From Uncertain
Datasets. MSc dissertation, University of Alberta, Edmonton, Alberta (Septem-
ber 2011)
Neighborhood-Based Smoothing of External
Cluster Validity Measures
1 Introduction
Clustering is a basic data mining task that discovers similar groups from given
multi-variate data. Validation of a clustering result is a fundamental but difficult
issue, since clustering is an unsupervised learning and is essentially to find latent
clusters in the observed data[3,7,14]. Up until now, various validity measures
have been proposed from different aspects, and they are mainly separated into
two types whether based on internal or external criteria[7,10,8]:
– Internal criteria evaluate compactness and separability[3] of the clusters
based only on distance between objects in the data space, that is learning
perspective. As such measures, older methods of Dunn-index[4], DB-index[2],
and recent CDbw[5] are well known. Surveys and comparisons of internal
cluster validity measures are [3,9].
– External criteria evaluate how accurately the correct/desired clusters are
formed in the clusters, that is user’s perspective. External criteria normally
uses class/category label together with cluster assignment. Purity, entropy,
F-measure, and mutual information are typical measures[10,12,14].
This paper focuses on using external criteria, that is provided by human inter-
pretation of data. It is more beneficial to use external criteria when class labels
are available.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 354–365, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Neighborhood-Based Smoothing of External Cluster Validity Measures 355
1. A data object which belongs to the same class should be in neighbor over
clusters. To evaluate this property, we introduce a weighting function based
on inter-cluster distance. The inter-cluster distance can be computed based
on either topology-based or Euclidean distance in the data space.
2. A weighting function is introduced into basic statistics that are commonly
used in the conventional measures. Therefore, our approach is generic, any
conventional cluster validity measure that uses these statistics can also be
extended in the same way.
2 Preliminaries
Definition 2. (Class) Let a class set be T = {Ti }L i=1 , and t(i) ∈ T denotes a
class assignment for xi . Classes are provided independent from a clustering.
356 K. Fukui and M. Numao
1
This work introduces the smoothing function into several cluster validity measures,
in actual use, a measure should be selected according to the target application and
the aspects the user wants to evaluate.
Neighborhood-Based Smoothing of External Cluster Validity Measures 357
d i, j Cj
Ci
FOXVWHUFHQWURLG
WKHQXPEHURI
REMHFWVLQHDFK
FODVV
FODVV
ZHLJKWHGVXPPDWLRQRIQHLJKERULQJGDWDE\ h i, j
Fig. 1. Extension of a set-based clustering measure. The basic statistics are weighted
by the neighborhood relation based on inter-cluster distance di,j . This example shows
topology-based distance.
N̂ = fˆ(u)du = h(u, v)f (v; l)dvdu. (3)
Ω Ω l∈T Ω
Discretizing eqs. (1) to (3), let Nl,i be the number of objects with class l in the ith
cluster Ci ∈ C; Nl,i = #{xk |t(k) = l, c(k) = Ci }, where # denotes the number of
elements. Ni denotes the number of objects in cluster Ci ; Ni = #{xk |c(k) = Ci }.
Also N denotes the total number of objects; N = #{xk |xk ∈ S}. Eqs. (1) to (3)
can be rewritten as follows:
Nl,i = hi,j Nl,j , (4)
Cj ∈C
Ni = Nt,i = hi,j Nl,j , (5)
l∈T l∈T Cj ∈C
N = Ni = hi,j Nl,j . (6)
Ci ∈C Ci ∈C l∈T Cj ∈C
Here, hi,j can be used any monotonically decreasing function, for example, the
often encountered Gaussian function: hi,j = exp(−di,j /σ 2 ), where di,j denotes
inter-cluster distance and σ(> 0) is a smoothing (neighborhood) radius.
Thus, weighted cluster purity and entropy, for example, are defined using the
weighted statistics of eqs. (4), (5), and (6) as follows:
weighted Cluster Purity (wCP)
1
wCP(C) = max Nl,i . (7)
N l∈T
Ci ∈C
The original purity is an average of the ratio that a majority class occupies
in each cluster, whereas in the weighted purity a majority class is determined
by the neighbor class distribution {Nl,i }.
358 K. Fukui and M. Numao
d c(i), c(j)
c(i) likelihood( c(i) = c(j) )
1
wEP(C) = Entropy(Ci ), (8)
|C|
Ci ∈C
1 Nl,i
Nl,i
Entropy(Ci ) = − log , (9)
log N Ni Ni
l∈T
where |C| denotes a cluster number. The original entropy indicates the degree
of unevenness of class distribution within a cluster, whereas the extended
entropy includes unevenness of the neighboring clusters.
a + d
wPA(C) = . (14)
a + b + c + d
The original pairwise accuracy is a ratio of the number of pairs in the same
class belonging to the same cluster, or the number of pairs in different classes
belonging to different clusters, against all pairs. The weighted PA is the de-
gree to which pairs in the same class belong to the neighbor clusters or that
pairs in different classes belong to distant clusters.
2·P ·R
wPF(C) = , (15)
P +R
where P = a /(a +b ) is precision, that is a measure of the same class among
each cluster, and R = a /(a +c ) is recall that is a measure of the same cluster
among each class. The original pairwise F-measure is a harmonic average of
the precision and the recall. While, the weighted PF is based on a degree
that the data pairs belong to the same cluster.
values, as the radius becomes zero (σ → 0). On the other hand, as the radius
becomes larger (σ → ∞), the data space is smoothed by almost the same weights,
and all micro-clusters are treated as one big cluster. The way to find the optimal
radius is described in section 3.4.
VPRRWKHGYDOXH
UDGLXVm
RULJLQDOYDOXH
QHLJKERULQJFOXVWHUV
Fig. 3. Example of the effect of smoothing radius. Values over the neighborhood rela-
tion of the clusters become smoother as the radius increases.
This section describes the experiment to clarify the properties of the proposed
smoothed validity measures.
Neighborhood-Based Smoothing of External Cluster Validity Measures 361
1. kmc-knn
Typical k-means clustering was used to produce a clustering and mutual k-
nearest neighbor (kmc-knn) was used to obtain the neighbor relation. With
parameters of the prototype (micro-cluster) number k1 and of nearest neigh-
bors k2, adjacent matrix A = (ai,j ) can be given by:
1 if Cj ∈ O(Ci ) and Ci ∈ O(Cj )
ai,j = (17)
0 otherwise,
4.2 Datasets
1. Synthetic data
In order to evaluate the proposed measure, two classes of two-dimensional
synthetic data were prepared, where 300 data points for each class were
generated from different Gaussian distributions. The data distribution and
examples of graphs are illustrated in Fig. 4.
2. Real-world data
Well-known open datasets2 were used as real-world data: Iris data (150
samples, 4 attributes, 3 classes), Wine data (178 samples, 13 attributes, 3
classes), and Glass Identification data (214 samples, 9 attributes, 6 classes).
Fig. 5 shows the evaluation values of the smoothed validity measures for the
synthetic data using kmc-knn. The larger value is the better except entropy.
The values are average of 100 runs of randomized initial values.
Firstly, the total evaluation values (Eval) provides always better value than
that of random topology (Evalrnd ) where neighborhood relation of the proto-
types is destroyed. This means that the proposed measures evaluate both cluster
validity and neighborhood relation of the clusters.
2
http://archive.ics.uci.edu/ml/
362 K. Fukui and M. Numao
(a) kmc-knn k1=10, k2=4 (b) kmc-knn k1=25, k2=4 (c) kmc-knn k1=50, k2=4
(d) kmc-knn k1=25, k2=8 (e) kmc-knn k1=25, k2=4, (f) SOM 10x10
random topology
Fig. 4. Cluster prototypes (•) with topology-based neighbor relation on two dimen-
sional synthetic data. The data points (, ) were generated from two Gaussian dis-
tributions; N (µ1 , 1) and N (µ2 , 1), where µ1 = (0, 0) and µ2 = (3, 0).
Secondly, as the smoothing radius becomes close to zero (σ → 0), the extended
measure evaluates individual clusters without neighborhood relation. Whereas,
as the radius becomes larger (σ → ∞), the extended measure treats whole data
as one big cluster as mentioned before. Therefore, the solid and the broken lines
gradually become equal as the radius becomes close to zero or becomes much
larger.
Thirdly, the neighbor component has a monomodality against the radius in
all measures, since there exists an appropriate radius to the average class dis-
tribution. Since the smoothed measure is a composition of cluster validity and
neighborhood relation, the radius that gives the maximum Eval does not al-
ways match with that of neighbor component, for instance, wCP, wEP, and
wPF in Fig. 5. Therefore, the neighbor component should be examined to find
the appropriate radius. Also the appropriate radius depends on function of the
measure such as purity, F-measure, or entropy. This means that the user should
use different radius for each measure.
These three trends appear also in SOM (omitted due to page limitation).
The effect of prototype number is examined by changing k1 = 10, 25, 50 (Fig. 6).
In wPF, k1 = 25 provides the highest neighbor component (0.116 at σ = 1.4)
among three (Fig. 6(b)). wPF can suggest an optimal prototype number in terms
of maximizing the neighbor component in the measure, which means neighbor
Neighborhood-Based Smoothing of External Cluster Validity Measures 363
Fig. 5. The effect of smoothing radius (synthetic data, kmc-knn(k1 = 25, k2 = 4));
total evaluation value (Eval), Eval with random topology (Evalrnd ), neighbor com-
ponent (N C)
m1&W&3
m1&W3)
1&
N
m m
(a) wCP (b) wPF
Fig. 6. The effect of prototype number (synthetic data, kmc-knn(k2 = 4)). The maxi-
mum neighbor component (N C ∗ ) and total values (wCP and wPF) are listed together
in the table.
m1&W&3
m1&W3)
1&
+
G
m m
(a) wCP (b) wPF
m1&W&3
m1&W3)
1&
LULV
ZLQH
JODVV
m m V\QWKWLF
(a) wCP (b) wPF
5 Conclusion
This paper proposed a novel and generic smoothed cluster validity measures
based on neighborhood relation of clusters with external criteria. The experi-
ments revealed the existence of an optimal neighborhood radius which maxi-
mizes the neighbor component. A user should use an optimal radius depending
on a function of measure and a dataset. Our measure can determine the optimal
radius independent to class overlap, and can evaluate volume of class overlap.
In addition, feature selection, metric learning[13,15], and a correlation index for
multilabels to determine the most relevant class are promising future directions
for this work.
Neighborhood-Based Smoothing of External Cluster Validity Measures 365
References
1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering
evaluation metrics based on formal constraints. Information Retrieval 699(12), 461–
486 (2009)
2. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on
Pattern Analsis and Machine Intelligence (TPAMI) 1(4), 224–227 (1979)
3. Deborah, L.J., Baskaran, R., Kannan, A.: A survey on internal validity measure
for cluster validation. International Journal of Computer Science & Engineering
Survey (IJCSES) 1(2), 85–102 (2010)
4. Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Journal of Cy-
bernetics 4, 95–104 (1974)
5. Halkidi, M., Vazirgiannis, M.: Clustering validity assessment using multi represen-
tatives. In: Proc. 2nd Hellenic Conference on Artificial Intelligence, pp. 237–248
(2002)
6. Kohonen, T.: Self-Organizing Maps. Springer (1995)
7. Kovács, F., Legány, C., Babos, A.: Cluster validity measurement techniques. En-
gineering 2006, 388–393 (2006)
8. Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer, B.:
An effective evaluation measure for clustering on evolving data streams. In: Proc.
the 17th SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
2011), pp. 868–876 (2011)
9. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering val-
idation measures. In: Proc. IEEE International Conference on Data Mining (ICDM
2010), pp. 911–916 (2010)
10. Rendón, E., Abundez, I., Arizmendi, A., Quiroz, E.M.: Internal versus external
cluster validation indexes. International Journal of Computers and Communica-
tions 5(1), 27–34 (2011)
11. Tasdemir, K., Merényi, E.: A new cluster validity index for prototype based clus-
tering algorithms based on inter- and intra-cluster density. In: Proc. International
Joint Conference on Neural Networks (IJCNN 2007), pp. 2205–2211 (2007)
12. Veenhuis, C., Koppen, M.: Data Swarm Clustering, ch. 10, pp. 221–241. Springer
(2006)
13. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin
nearest neighbor classification. Journal of Machine Learning Research (JMLR) 10,
207–244 (2009)
14. Xu, R., Wunsch, D.: Cluster Validity. Computational Intelligence, ch. 10, pp. 263–
278. IEEE Press (2008)
15. Zha, Z.J., Mei, T., Wang, M., Wang, Z., Hua, X.S.: Robust distance metric learning
with auxiliary knowledge. In: Proc. International Joint Conference on Artificial
Intelligence (IJCAI 2009), pp. 1327–1332 (2009)
Sequential Entity Group Topic Model for Getting Topic
Flows of Entity Groups within One Document
1 Introduction
Analyzing documents on the Web is difficult due to the fast growing number of
documents. Most of documents are not annotated, leading us to prefer unsupervised
methods for analyzing document, and topic mining is one such method. This method
is basically a probabilistic way to capture latent semantics, or topics, among
documents. Since techniques like Probabilistic Latent Semantic Indexing (PLSI) [1]
and Latent Dirichlet Allocation (LDA) [2] were first introduced, many studies have
been derived from them: for example, to get relationships among entities in corpora
[3, 4], to discover topic flows of documents in time dimension [5], or topic flows of
segments in one document [6, 7], and so on. Capturing topic flows in one document
(i.e., a fiction or a history) has special characteristics. For instance, adjacent segments
in one document would influence each other because the full set of segments (i.e., the
document) as a whole has some story. Moreover, the readers probably want to see the
story in a perspective of each entity or each relationship. Although existing topic
models tried to get topics of entity groups, no model has been proposed to obtain the
topic flow of each entity or each relationship in one document. The topic flow in one
document should also be useful for the readers to grasp the story easily.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 366–378, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Sequential Entity Group Topic Model for Getting Topic Flows of Entity Groups 367
In this paper, we propose two topic models, Entity Group Topic Model (EGTM)
and Sequential Entity Group Topic Model (S-EGTM), claiming two contributions.
First, topic distribution of each entity and of each entity group can be analyzed.
Second, the topic flow of each entity and each relationship through segments in one
document can be captured. To realize our proposal, we adopt collapsed gibbs
sampling methods [8] to infer the parameters of the models.
The rest of the paper is organized as follows. In the following subsection, we
preview the terminology to set out the basic concepts. Section 2 discusses related
works. Section 3 describes out approach and algorithms in detail. Section 4 presents
experiments and results. Finally, Section 5 concludes.
1.1 Terminology
In this subsection, we summarize the terminology used in this paper to clarify the
basic concepts.
Entity: Something which the user want to get information about it. It can be a
name, an object, or even a concept such as love and pain.
Empty group (empty set): A group having no entity.
Entity group: A group having one or more entities.
Entity group size: The number of entities in the entity group.
Entity pair: A pair of two entities.
Topic (word topic): A multinomial word distribution.
Entity topic: A multinomial entity distribution of CorrLDA2.
Segment: A part of a document. It can be a paragraph, or even a sentence.
Topic flow: A sequence of topic distribution through segments of a document.
Relationship of entities: A topic distribution of the entity group.
2 Related Work
In this section, we describe related studies with respect to entity topic mining and
sequential topic mining.
The goal of entity topic mining is to capture the topic of each entity, or of each
relationship of entities. Author Topic Model (ATM) [9] is a model for getting a topic
distribution of each author. Although the model does not consider entities, it can be
used for getting topics of entities by just considering an entity as an author. However,
it does not involve a process of writing entities in the document. There are several
studies about a model involving the process. The recent proposed model, named as
Nubbi [4], tried to capture two kinds of topics, which are the word distributions of
each entity and of each entity pair. However, since it takes two kinds of topics
separately, the topics of entities will be different from that of entity pairs. Several
topic models for analyzing entities were introduced in [3]. Especially, CorrLDA2
showed its best prediction performance. The model captures not only topics, but also
entity topics. The entity topic is basically a list of entities, thus each entity topic plays
a role as an entity group. This implies that it has a lack of capability of getting
relationship of a certain entity group.
368 Y.-S. Jeong and H.-J. Choi
As for sequential topic mining, there are works which tried to get topic flows in
different dimensions. Dynamic Topic Model (DTM) [5] aimed to capture topic flows of
documents in time dimension. Probabilistic way to capture the topic patterns on
weblogs, in both of space dimension and time dimension, was introduced in [10]. Multi-
grain LDA (MG-LDA) [11] used topic distribution of each window in a document to
get the ratable aspects. Although it utilizes sequent topic distributions to deal with multi-
grained topics, the objective of the model is not getting a topic flow of the document.
STM and Sequential LDA tried to get a topic flow within a document. The both studies
are based on a nested extension of the two-parameter Poisson-Dirichlet Process (PDP).
The STM assumes that each segment is influenced by the document, while the
Sequential LDA assumes that each segment is influenced by its previous segment
except for the first segment.
(a) (b)
Fig. 1. (a) Graphical model of EGTM. The colored circles represent the variables observable
from the documents. (b) Graphical model of S-EGTM.
Sequential Entity Group Topic Model for Getting Topic Flows of Entity Groups 369
The notations are described in Table 1, with a minor exceptional use of notation that
TW DET
and Ckw in thisCdek
expression exclude the ith word. Three parameters are obtained as
follows:
βw + Ckw
TW
,
Φ kw =
∑( βv + CkvTW ) (2)
v
αk + Cdek
DET
,
θdek =
∑(αz + CdezDET ) (3)
z
ηe + CdeDE .
π de =
∑(CdeDE + ηe ) (4)
e
Table 1. Meaning of the notations. The upper part contains variables for graphical models. The
bottom part contains variables for representing the conditional probabilities.
and the segments are restaurants. The table count t is the number of tables occupied
by customers. The customers sitting around a table share a dish. Especially, in nested
PDP, the number of tables of next restaurant is a customer of current restaurant.
When we do a collapsed gibbs sampling for topics, removing ith topic zdgi=k affects
the table counts and topic distributions of entity group e in the segment g. Therefore,
we need to consider three cases of conditional probabilities in terms of udek, as
following.
First, when udek=1,
∝ ( SS
n ' dgek +1+ t d ( g +1 ) ek
t ' dgek , a β w + C kw
TW
.
P ( z dgn = k | z ' , w , x , t ) )
n ' dgek + t d ( g +1 ) ek
t ' dgek , a ∑( β v + C kv )
TW
(7)
v
The notations are described in Table 1, with a minor exceptional use of notation that
TW
in thisCkwexpression exclude the ith word. At each step, we also sample a table count
because a table count is affected by the number of words having the table’s topic and
372 Y.-S. Jeong and H.-J. Choi
vice versa. If we assume that we remove a table count tdgek, then new table count is
sampled as follows:
n +t
α k + t d 1ek ,
θ d 0 ek =
∑ (α z + t d 1ez )
z
(9)
4 Experiments
We used two data sets: the Bible and the fiction ‘Alice’. We removed stop-words and
did stemming by Porter stemmer. The sentences were recognized by ‘.’, ‘?’, ‘!’, and
“newline”. After the deleting stop-words, the Bible has 295,884 words and the fiction
‘Alice’ has 11,605 words. As S-EGTM gets a topic flow in a document, it regards the
Bible as a document consisting of 66 segments, and the fiction ‘Alice’ as a document
consisting of 12 segments. In contrast, to compare EGTM with other models, we
divided each document into separated files as segments. For every experiment, we set
α=0.1, β=0.01, η=1, a=0.5, b=10, and the window size was 1.
Fig. 4. The number of unique entity groups. The horizontal axis is the number of entities and
the vertical axis represents the number of unique entity groups.
Sequential Entity Group Topic Model for Getting Topic Flows of Entity Groups 373
Table 2. Topics obtained from LDA. The topic names are manually labeled. The listed chapters
have a big proportion of the corresponding topic.
Topics Gospel Journey of Mission Kingdom of Field life &
Jesus & disciples work Israel Sanctuary
Christ disciple Jew king Egypt
faith father Jerusalem Israel gold
love son spirit Judah curtain
sin crowd holy son Israelite
law reply sail temple cubit
Top spirit ask Antioch reign blue
words gospel heaven prison Jerusalem altar
grace truth apostle father mountain
church answer gentile priest ring
truth kingdom Ship prophet acacia
hope Pharisee Asia Samaria pole
power teacher travel altar ephod
dead law province servant tent
Chapters Romans~ Matthew~ Acts Kings Exodus
Jude John
of entity topics. We did 10-fold cross validation for the comparison, and got the
prediction results using the process in Figure 5.
Table 3. Topics obtained from EGTM. The topic names are manually labeled.
Topics Gospel Journey of Mission Kingdom of Field life &
Jesus & disciples work Israel Sanctuary
Christ disciple Jew king land
faith father Jerusalem Israel Egypt
love son holy Judah curtain
law crowd spirit temple Israelite
sin reply sail Jerusalem cubit
Top grace truth ship son gold
words gospel ask gentile reign mountain
world Pharisee speak Samaria altar
spirit kingdom disciple prophet ring
hope teacher believe father frame
church world Christ priest blue
life heaven Antioch altar tent
boast answer prison servant pole
Chapters Romans ~ Matthew ~ Acts Kings Exodus
Jude John
God, Jesus, God, David, God, Abraham,
God, Jesus, Mary,
Entities Paul, Judas, Abraham, Moses, Joshua,
God, Jesus, Judas, David,
John, David, Solomon, Aaron
Paul, John Abraham,
Abraham, Moses
Joshua, Moses
Moses
{God,Jesus}, {God,David}, {God,Abraham},
{God,Jesus}, {Paul,Jesus}, {Solomon, {God,Moses,
{God,Jesus},
Relation- {Abraham,Jesus}, {Paul,Judas}, David}, Abraham},
{God,Paul},
ships {Jesus,David}, {John,Jesus}, {God, {Aaron,God},
{God,John},
{Abraham,Joshua, {David,Judas}, Solomon}, {Moses,Joshua},
{Jesus,Paul}
David} {Paul,John}, {David,God, {Moses,Abraham
{God,Abraham} Solomon} }
"-
51= 5#6 =@ 0 $
-
@- 9
$
$
/$
$$
$
"B
$
$
//0
8
$$4
51= 4 % ( P ) % ( ! P ) % ( P )
!
=@ 4 % ( P ) % ( ! P ) %
( P ) $ > ?
!
5#6 4 % ( P ) %
( !
P ) %
( P )
!
0
! $ 0 $
$
$
$
$
1- 5$
//1 %$:& $
1
$/ 0 %$!:&-
-
1 $
$ 8
1
/
$-
Fig. 5. The process of the entity prediction
Sequential Entity Group Topic Model for Getting Topic Flows of Entity Groups 375
The test data consists of sentences which have at least one entity. If a sentence has
multiple entities, then choosing one of them is regarded as a correct choice. As
depicted in Figure 6(a), the CorrLDA2 shows fixed performance because the
resampling makes P(t|d) to be fixed. EGTM outperforms other models because
the topics of Entity-LDA have nothing to do with entities and CorrLDA2 does not get
the topic distribution of each entity. The performance of EGTM grows as the number
of topics grows because the Bible covers various topics. EGTM shows better
performances than CorrLDA2 because of two reasons. First, CorrLDA2 does not
directly get the topic distribution of each entity and it disperses the topic distribution
of each entity into multiple entity topics. Second, CorrLDA2 takes data exclusively.
To be specific, the data already used for entity topics will not be used for word topics.
Fig. 6. (a) The entity prediction performances of three models. The horizontal axis is the
number of topics. The vertical axis means a prediction rate. (b) The entity pair prediction
performances. (c) The entity group prediction performances.
We compared the entity pair prediction performance between EGTM and CorrLDA2.
For fair comparison, we used entity-entity affinity of [3]. The entity-entity affinity,
defined as P(ei|ej)/2+P(ej|ei)/2, is to rank true pairs and false pairs. The true pairs
exist in only unseen document, while the false pairs do not exist. The prediction
performance is the number of true pairs in half of high ranked pairs, divided by the
number of total pairs. We prepared 50 true pairs and 50 false pairs. The models have
different methods to get P(ei|ej) which is obtainable from . Entity-
LDA just counts the number of each topic. CorrLDA2 uses entity topic distributions.
For example, where et means each entity topic. Figure 6(b)
describes the prediction performance. Because the most entities of the Bible old
testament usually do not appear in the Bible new testament, the overall prediction
performances is low. EGTM outperforms CorrLDA2 and Entity-LDA, because
EGTM directly takes a topic distribution of each entity.
prediction performance with different entity group sizes. The predictive distribution
is , where eg represents the entity group,
and d represents each training document. Figure 6(c) shows the prediction
performance. The accuracy is the number of correct predictions divided by the
number of total predictions. The prediction performance of smaller entity group is
better than that of larger entity group, because it is harder to predict more entities.
We compare the topic flow of S-EGTM with the topic distributions of EGTM. To
show the topic consistency between the two models, we trained S-EGTM boosted
from the trained EGTM with 2,000 iterations. The Bible new testament and the fiction
‘Alice’ are used as data. We analyze the entity Alice with 10 topics, and analyze a
relationship {Jesus, God} with 20 topics. Figure 7 and Figure 8 show the topic flows
of the entity Alice and the relationship {Jesus, God}, respectively.
Fig. 7. (a) The confusion matrix by Hellinger distance, with the fiction ‘Alice’ as a data, where
S-EGTM topics run along the Y-axis. (b) Topic flow of entity Alice by EGTM. (c) Topic flow
of entity Alice by S-EGTM.
Fig. 8. (a) The confusion matrix by Hellinger distance, with the Bible new testament as a data,
where S-EGTM topics run along the Y-axis. (b) Topic flow of relationship {Jesus, Paul} by
EGTM. (c) Topic flow of relationship {Jesus, Paul} by S-EGTM.
Figure 7(a) and Figure 8(a) show the confusion matrices of the topic distributions
generated by EGTM and S-EGTM. The diagonal cells are darker than others,
meaning that the corresponding topics have low Hellinger distance. Thus, the topics
of two models are consistent. Other than the Figure 7(a) and Figure 8(a), the
horizontal axis means each segment, while the vertical axis represents topic
proportion. Clearly, in Figure 7(b), each topic appears in totally different segments,
which gives no idea about a topic flow through the segments. In contrast, in Figure
Sequential Entity Group Topic Model for Getting Topic Flows of Entity Groups 377
7(c), we can see the pattern that the topic 8(pink color) flows through every segment.
As the topic 8 is about Alice’s tracking the rabbit, its flow through every segment is
coherent with the story. Consider the case of the relationship {Jesus, God} in more
detail. In Figure 8(b), the topic Gospel (topic 14) is dominant in four separated parts,
meaning that the relationship {Jesus, God} associates with the topic Gospel in only
those separated four parts. This is caused by that the relationship has sparse topic
distribution because it reflects only the sentences having the relationship. The
separated appearance of the topic is not coherent with the Bible, because a purpose of
the Bible new testament associates with the topic Gospel which is strongly about the
news of the relationship {Jesus, God}. In contrast, in Figure 8(c), the topic Gospel
appears like a flow from Acts to Revelation. This means the relationship {Jesus, God}
associates with the topic Gospel without any cutting, through the segments. This is
more coherent with the Bible. Thus, S-EGTM helps us to grasp the topic flow of an
entity or a relationship by smoothing the sparse topic distribution of EGTM.
5 Conclusion
In this paper, we proposed two new generative models, Entity Group Topic Model
(EGTM) and the Sequential Entity Group Topic Model (S-EGTM). S-EGTM reflects
the sequential structure of a document in the hierarchical modeling. We developed
collapsed gibbs sampling algorithms for the models. EGTM employs a power-set
structure to get topics of entities or entity groups. S-EGTM is a sequential version of
the EGTM, and employs nested two-parameter Poisson-Dirichlet process (PDP) to
capture a topic flow over the sequence of segments in one document. We have
analyzed the topics obtained from EGTM, and showed that topic flows generated by
S-EGTM are coherent with the original document. Moreover, the experimental results
show that the prediction performance of EGTM is better than that of CorrLDA2.
Thus, we believed that the intended mechanisms of the EGTM and S-EGTM models
work.
References
1. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: SIGIR, pp. 50–57 (1999)
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. In: NIPS, pp. 601–608
(2001)
3. Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: KDD, pp.
680–686 (2006)
4. Chang, J., Boyd-Graber, J.L., Blei, D.M.: Connections between the lines: augmenting
social networks with text. In: KDD, pp. 169–178 (2009)
5. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: ICML, pp. 113–120 (2006)
378 Y.-S. Jeong and H.-J. Choi
6. Du, L., Buntine, W.L., Jin, H.: A segmented topic model based on the two-parameter
Poisson-Dirichlet process. Machine Learning, 5–19 (2010)
7. Du, L., Buntine, W.L., Jin, H.: Sequential Latent Dirichlet Allocation: Discover
Underlying Topic Structures within a Document. In: ICDM, pp. 148–157 (2010)
8. Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. National Academy of Sciences,
5228–5235 (2004)
9. Rosen-Zvi, M., Griffiths, T.L., Steyvers, M., Smyth, P.: The Author-Topic Model for
Authors and Documents. In: UAI, pp. 487–494 (2004)
10. Mei, Q., Liu, C., Su, H., Zhai, C.: A probabilistic approach to spatiotemporal theme pattern
mining on weblogs. In: WWW, pp. 533–542 (2006)
11. Titov, I., McDonald, R.T.: Modeling Online Reviews with Multi-grain Topic Models.
CoRR (2008)
Topological Comparisons of Proximity Measures
1 Introduction
In order to understand and act on situations that are represented by a set of objects,
very often we are required to compare them. Humans perform this comparison subcon-
sciously using the brain. In the context of artificial intelligence, however, we should be
able to describe how the machine might perform this comparison. In this context, one
of the basic elements that must be specified is the proximity measure between objects.
Certainly, application context, prior knowledge, data type and many other factors
can help in identifying of the appropriate measure. For instance, if the objects to be
compared are described by boolean vectors, we can restrict our comparisons to a class
of measures specifically devoted to this data type. However, the number of candidate
measures might still remain quite large. Can we consider that all those remaining are
equivalent and just pick one of them at random? Or are there some that are equivalent
and, if so, to what extent? This information might interest a user when seeking a specific
measure. For instance, in information retrieval, choosing a given proximity measure is
an important issue. We effectively know that the result of a query depends on the mea-
sure used. For this reason, users may wonder which one more useful? Very often, users
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 379–391, 2012.
c Springer-Verlag Berlin Heidelberg 2012
380 D.A. Zighed, R. Abdesselam, and A. Hadgu
– Axiomatically, as in the works of [1], [2] and [7], where two measures are consid-
ered equivalent if they possess the same mathematical properties.
– Analytically, as in the works of [2], [3] and [7], where two measures are considered
equivalent if one can be expressed as a function of the other.
– Emperically, as in [20], where two proximity measures are considered similar if,
for a given set of objects, the proximity matrices brought about over the objects
are somewhat similar. This can be achieved by means of statistical tests such as
the Mantel test [13]. We can also deal with this issue using an approach based on
preordonance [7][8][18], in which the common idea is based on a principle which
says that two proximity measures are closer if the preorder induced in pairs of
objects does not change. We will provide details of this approach later on.
Nevertheless, these approaches can be unified depending on the extent to which they
allow the categorization of proximity measures. Thus, the user can identify measures
that are equivalent from those that are less so [3][8].
In this paper, we present a new approach for assessing the similarity between prox-
imity measures. Our approach is based on proximity matrices and hence belongs to
empirical methods. We introduce this approach by using a neighborhood structure of
objects. This neighborhood structure is what we refer to as the topology induced by the
proximity measures. For two proximity measures ui and u j , if the topological graphs
produced by both of them are identical, then this means that they have the same neigh-
borhood graph and consequently, the proximity measures ui and u j are in topological
equivalence. In this paper, we will refer to the degree of equivalence between proxim-
ity measures. In this way, we can calculate a value of topological equivalence between
pairs of proximity measures which would be equal to 1 for perfect equivalence and 0
for total mismatch. According to these values of similarity, we can visualize how close
the proximity measures are to each other. This visualization can be achieved by any
clustering algorithm. We will introduce this new approach more formally and show the
principal links identified between our approach and that based on preordonnance. So
far, we have not found any publication that deals with the problem in the same way as
we do here.
The present paper is organized as follows. In Section 2, we describe more pre-
cisely the theoretical framework and we recall the basic definitions for the approach
based on induced preordonnance. In Section 3, we introduce our approach, topological
equivalence. In section 4, we provide some results of the comparison between the two
approaches, and highlight possible links between them. Further work and new lines
of inquiry provided by our approach are detailed in Section 5, the conclusion. We also
Topological Comparisons of Proximity Measures 381
make some remarks on how this work could be extended to all kinds of proximity
measures, regardless of the representation space: binary [2][7][8][26], fuzzy [3][28] or
symbolic, [11][12].
u : Rp × Rp −→ R
(x, y)
−→ u(x, y)
Two proximity measures, ui and u j generally lead to different proximity matrices. Can
we say that these two proximity measures are different just because the resulting matri-
ces have different numerical values? To answer this question, many authors,[7][8][18],
have proposed approaches based on preordonnance defined as follows:
382 D.A. Zighed, R. Abdesselam, and A. Hadgu
This definition has since reproduced in many papers such as [2], [3], [8] and [28]. This
definition leads to an interesting theorem which is demonstrated in [2].
In order to compare proximity measures ui and u j , we need to define an index that could
be used as a similarity value between them. We denote this by S(ui , u j ). For example,
we can use the following similarity index which is based on preordonnance.
S(ui , u j ) = 1
n4 ∑x ∑y ∑z ∑t δi j (x, y, z,t)
1 if [ui (x, y) − ui (z,t)] × [u j (x, y) − u j (z,t)] > 0
where δi j (x, y, z,t) = or ui (x, y) = ui (z,t) and u j (x, y) = u j (z,t)
0 otherwise
S varies in the range [0, 1]. Hence, for two proximity measures ui and u j , a value of 1
means that the preorder induced by the two proximity measures is the same and there-
fore the two proximity matrices of ui and u j are equivalent.
Topological Comparisons of Proximity Measures 383
The workflow in Fig 1 summarizes the process that leads to the similarity matrix
between proximity measures.
The comparison between indices of proximity measures has also been studied by
[19], [20] from a statistical perspective. The authors proposed an approach that com-
pares similarity matrices, obtained by each proximity measure, using Mantel’s test [13],
in a pairwise manner.
3 Topological Equivalence
This approach is based on the concept of a topological graph which uses a neighborhood
graph. The basic idea is quite simple: we can associate a neighborhood graph to each
384 D.A. Zighed, R. Abdesselam, and A. Hadgu
proximity measure ( this is -our topological graph- ) from which we can say that two
proximity measures are equivalent if the topological graphs induced are the same. To
evaluate the similarity between proximity measures, we compare neighborhood graphs
and quantify to what extent they are equivalent.
⎛ ⎞
Vu ... x y z t u ...
⎜ ⎟
⎜ ... .. .. .. .. .. .. ⎟
⎜ . . . . . . ... ⎟
⎜ ⎟
⎜ x ... 1 1 0 0 0 ... ⎟
⎜ ⎟
⎜ y ... 1 1 1 1 0 ... ⎟
⎜ ⎟
⎜ z ... 0 1 1 0 1 ... ⎟
⎜ ⎟
⎜ t ... 0 1 0 1 0 ... ⎟
⎜ ⎟
⎜ u ... 0 0 1 0 1 ... ⎟
⎝ ⎠
.. .. .. .. .. .. ..
. . . . . . . ...
In order to use the topological approach, the property of the relationship must lead
to a related graph. Of the various possibilities for defining the binary relationship, we
can use the properties in a Gabriel Graph or any other algorithm that leads to a related
graph such as the Minimal Spanning Tree, MST. For our work, we use only the Relative
Neighborhood Graph, RNG, because of the relationship there is between those graphs
[16].
From the previous material, using topological graphs (represented by an adjacency ma-
trix), we can evaluate the similarity between two proximity measures via the similarity
Topological Comparisons of Proximity Measures 385
between the topological graphs each one produces. To do so, we just need the adjacency
matrix associated with each graph. The workflow is represented in Figure 3.
Note that Vui and Vu j are the two adjacency matrices associated with both proximity
measures. To measure the degree of similarity between the two proximity measures, we
just count the number of discordances between the two adjacency matrices. The value
is computed as:
1 if Vui (x, y) = Vu j (x, y)
S(Vui ,Vu j ) = 1
n2 ∑x∈Ω ∑y∈Ω δi j (x, y) where δi j (x, y) = 0 otherwise
S is the measure of similarity which varies in the range [0, 1]. A value of 1 means that
the two adjacency matrices are identical and therefore the topological structure induced
by the two proximity measures in the same, meaning that the proximity measures con-
sidered are equivalent. A value of 0 means that there is a full discordance between the
= Vu j (x, y) ∀ω ∈ Ω 2 ). S is thus the extent of agreement between
two matrices ( Vui (x, y)
the adjacency matrices. The similarity values between the 13 proximity measures in the
topological framework for iris are given in Table 3.
We have found some theoretical results that establish a relationship between topolog-
ical and preordonnance approaches. For example, from Theorem 1 of preordonnance
equivalence we can deduce the following property, which states that in the case where
f is strictly monotonic then if the preorder is preserved this implies that the topology is
preserved and vice versa. This property can be formulated as follows:
Proof. Let us assume that max(ui (x, z) , ui (y, z)) = ui (x, z),
by Theorem 1, we provide ui (x, y) ≤ ui (x, z) ⇒ f (ui (x, y)) ≤ f (ui (x, z)),
again, ui (y, z) ≤ ui (x, z) ⇒ f (ui (y, z)) ≤ f (ui (x, z))
⇒ f (ui (x, z)) ≤ max( f (ui (x, z)), f (ui (y, z))),
hence the result, u j (x, y) ≤ max(u j (x, z), u j (y, z)).
The reciprocal implication is true, because if f is continuous and strictly monotonic
then its inverse f −1 is continuous in the same direction of variation as f .
The converse is also true, i.e. two proximity measures which are dependent on each
other induce the same topology and are therefore equivalent.
Topological Comparisons of Proximity Measures 387
Now the user has two approaches, topological and preordonnance, to assess the
closeness between proximity measures relative to a given dataset. This assessment
might be helpful for choosing suitable proximity measures for a specific problem. Of
course, there are still many questions. For instance, does the clustering of proximity
measures remain identical when the data set changes? What is the sensitivity of the
empirical results when we vary the number of variables or samples within the same
dataset? To answer these questions we carried out a series of experiments. The core
idea of these experiments was to study whether proximity measures are clustered in the
same way regardless of the dataset used. To this end, given a dataset with N individuals
and P variables, we verified the effect of varying the sample size, N, and dimension, P,
within a given dataset and using different datasets. All datasets in our experiments were
taken from the UCI repository, [24], as shown in Table 4.
388 D.A. Zighed, R. Abdesselam, and A. Hadgu
– Sensitivity to change in sample size: To examine the influence of changing the num-
ber of individuals, we generated five samples from the waveform dataset varying
the sample size from 1000 to 5000 for the topological approach and 100 to 400
for the preorder approach because of the complexity of the algorithm. The number
of variables, 40, was the same for all experiments. The results of HCA clustering
using each approach are shown in Tables 7 and 8 respectively. Clearly, there was a
slight change in the clustering but it seems there was a relative stability.
– Sensitivity to varying data sets: To examine the effect of changing the data sets, the
two approaches were tested with various datasets. The results are shown in Tables
9 and 10. In the topological approach, regularity {chSqr, SC, JD} and {Euc, EucW,
Min} was observed regardless of the change in individuals and variables within the
same dataset or across different datasets.
5 Conclusion
In this paper, we have proposed a new approach for comparing proximity measures
with complexity O(n2 ). This approach produces results that are not totally identical to
those produced by former methods. One might wonder which approach is the best. We
390 D.A. Zighed, R. Abdesselam, and A. Hadgu
believe that this question is not relevant. The topological approach described here has
some connections with preordonnance, but proposes another point of view for compar-
ison. The topological approach has a lower time complexity. From theoretical analysis,
when a proximity measure is a function of another proximity measure then we have
shown that the two proximity measures are identical for both approaches. When this is
not the case, the experimental analysis showed that there is sensitivity to sample size,
dimensionality and the dataset used.
References
1. Batagelj, V., Bren, M.: Comparing resemblance measures. In: Proc. International Meeting on
Distance Analysis, DISTANCIA 1992 (1992)
2. Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of classification 12, 73–90
(1995)
3. Bouchon-Meunier, M., Rifqi, B., Bothorel, S.: Towards general measures of comparison of
objects. Fuzzy Sets and Systems 84(2), 143–153 (1996)
4. Clarke, K.R., Somerfield, P.J., Chapman, M.G.: On resemblance measures for ecological
studies, including taxonomic dissimilarities and a zero-adjusted Bray-Curtis coefficient for
denuded assemblages. Journal of Experimental Marine Biology & Ecology 330(1), 55–80
(2006)
5. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the Fourteenth
Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied
Mathematics (2003)
6. Kim, J.H., Lee, S.: Tail bound for the minimal spanning tree of a complete graph. Statistics
& Probability Letters 64(4), 425–430 (2003)
7. Lerman, I.C.: Indice de similarité et préordonnance associée, Ordres. In: Travaux Du
Séminaire Sur Les Ordres Totaux Finis, Aix-en-Provence (1967)
8. Lesot, M.J., Rifqi, M., Benhadda, H.: Similarity measures for binary and numerical data: a
survey. IJKESDP 1(1), 63–84 (2009)
9. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th Inter-
national Conference on Machine Learning, pp. 296–304 (1998)
10. Liu, H., Song, D., Ruger, S., Hu, R., Uren, V.: Comparing Dissimilarity Measures for
Content-Based Image Retrieval. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou,
G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 44–50. Springer, Heidelberg (2008)
11. Malerba, D., Esposito, F., Gioviale, V., Tamma, V.: Comparing dissimilarity measures for
symbolic data analysis. In: Proceedings of Exchange of Technology and Know-how and
New Techniques and Technologies for Statistics, vol. 1, pp. 473–481 (2001)
12. Malerba, D., Esposito, F., Monopoli, M.: Comparing dissimilarity measures for probabilistic
symbolic objects. In: Data Mining III. Series Management Information Systems, vol. 6, pp.
31–40 (2002)
13. Mantel, N.: A technique of disease clustering and a generalized regression approach. Cancer
Research 27, 209–220 (1967)
14. Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity measures, docu-
ment term weighting schemes and representations in a Boolean environment. In: Proceedings
of the 3rd Annual ACM Conference on Research and Development in Information Retrieval
(1980)
15. Park, J.C., Shin, H., Choi, B.K.: Elliptic Gabriel graph for finding neighbors in a point set and
its application to normal vector estimation. Computer-Aided Design 38(6), 619–626 (2006)
16. Preparata, F.P., Shamos, M.I.: Computational geometry: an introduction. Springer (1985)
Topological Comparisons of Proximity Measures 391
17. Richter, M.M.: Classification and learning of similarity measures. In: Proceedings der
Jahrestagung der Gesellschaft fur Klassifikation. Studies in Classification, Data Analysis and
Knowledge Organisation. Springer (1992)
18. Rifqi, M., Detyniecki, M., Bouchon-Meunier, B.: Discrimination power of measures of re-
semblance. In: IFSA 2003. Citeseer (2003)
19. Schneider, J.W., Borlund, P.: Matrix comparison, Part 1: Motivation and important issues for
measuring the resemblance between proximity measures or ordination results. Journal of the
American Society for Information Science and Technology 58(11), 1586–1595 (2007)
20. Schneider, J.W., Borlund, P.: Matrix comparison, Part 2: Measuring the resemblance between
proximity measures or ordination results by use of the Mantel and Procrustes statistics. Jour-
nal of the American Society for Information Science and Technology 58(11), 1596–1609
(2007)
21. Spertus, E., Sahami, M., Buyukkokten, O.: Evaluating similarity measures: a large-scale
study in the orkut social network. In: Proceedings of the Eleventh ACM SIGKDD Inter-
national Conference on Knowledge Discovery in Data Mining. ACM (2005)
22. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In:
Workshop on Artificial Intelligence for Web Search, pp. 58–64. AAAI (2000)
23. Toussaint, G.T.: The relative neighbourhood graph of a finite planar set. Pattern Recogni-
tion 12(4), 261–268 (1980)
24. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml
25. Ward, J.R.: Hierarchical grouping to optimize an objective function. Journal of the American
Statistical Association, JSTOR 58(301), 236–244 (1963)
26. Warrens, M.J.: Bounds of resemblance measures for binary (presence/absence) variables.
Journal of Classification 25(2), 195–208 (2008)
27. Zhang, B., Srihari, S.N.: Properties of binary vector dissimilarity measures. In: Proc. JCIS
Int’l Conf. Computer Vision, Pattern Recognition, and Image Processing, vol. 1 (2003)
28. Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy concepts: A
comparative analysis. Int. J. Approx. Reason 2(1), 221–242 (1987)
Quad-tuple PLSA: Incorporating Entity
and Its Rating in Aspect Identification
Abstract. With the opinion explosion on Web, there are growing re-
search interests in opinion mining. In this study we focus on an important
problem in opinion mining — Aspect Identification (AI), which aims to
extract aspect terms in entity reviews. Previous PLSA based AI methods
exploit the 2-tuples (e.g. the co-occurrence of head and modifier), where
each latent topic corresponds to an aspect. Here, we notice that each
review is also accompanied by an entity and its overall rating, resulting
in quad-tuples joined with the previously mentioned 2-tuples. Believ-
ing that the quad-tuples contain more co-occurrence information and
thus provide more ability in differentiating topics, we propose a model
of Quad-tuple PLSA, which incorporates two more items — entity and
its rating, into topic modeling for more accurate aspect identification.
The experiments on different numbers of hotel and restaurant reviews
show the consistent and significant improvements of the proposed model
compared to the 2-tuple PLSA based methods.
1 Introduction
With the Web 2.0 technology encouraging more and more people to participate
in online comments, recent years have witnessed the opinion explosion on Web.
As large scale of user comments accumulate, it challenges both the merchants
and customers to analyze the opinions or make further decisions. As a result,
opinion mining which aims at determining the sentiments of opinions has become
a hot research topic.
Additionally, besides the simple overall evaluation and summary, both cus-
tomers and merchants are becoming increasingly concerned in certain aspects
of the entities. Take a set of restaurant reviews as example. Common restau-
rant aspects include “food”, “service”, “value” and so on. Some guests may be
interested in the “food” aspect, while some may think highly of the “value” or
“service” aspect. To meet these personalized demands, we need to decompose
the opinions into different aspects for better understanding or comparison.
On the other hand, it also brings out perplexity for merchants to digest all
the customer reviews in case that they want to know in which aspect they
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 392–404, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Quad-tuple PLSA: Incorporating Entity and Its Rating 393
lack behind their competitors. As pointed out in [12], the task of aspect-based
summarization consists of two subtasks: the first is Aspect Identification (AI),
and the second is sentiment classification and summarization. The study in this
paper mainly focuses on the first task, which aims to accurately identify the
aspect terms in the reviews for certain type of entities.
As shown in Figure 1, there are 3 reviews on different hotels, where the de-
scription for the same aspect is stained in the same color. One of a recent works
in this area argues that it is more sensible to extract aspects from the phrase
level rather than the sentence level since a single sentence may cover different
aspects of an entity (as shown in Figure 1, a sentence may contain different col-
ored terms) [5]. Thus, Lu et al. decompose reviews into phrases in the form of
(head, modifier ) pairs. A head term usually indicates the aspect while a modifier
term reflects the sentiment towards the aspect. Take the phrase “excellent staff”
for example. The head “staff” belongs to the “staff/front desk” aspect, while the
modifier “excellent” shows a positive attitude to it. Utilizing the (head, modifier )
pairs, they explore the latent topics embedded in it with aspect priors. In other
words, they take the these 2-tuples as input, and output the latent topics as the
identified aspects.
In this study, we observe that besides the (head, modifier) pairs each review
is often tied with an entity and its overall rating. As shown in Figure 1, a hotel
name and an overall rating are given for each review. Thus, we can construct
the quad-tuples of
which indicates that a phrase of the head and modifier appears in the review
for this entity with the rating. For example, the reviews in Figure 1 include the
following quad-tuples,
394 W. Luo et al.
With these quad-tuples from the reviews for a certain type of entities, we further
argue that they contain more co-occurrence information than 2-tuples, thus pro-
vide more ability in differentiating terms. For example, reviews with the same
rating tend to share similar modifiers. Additionally, reviews with the same rating
on the same entity often talk about the same aspects of that entity (imagine that
people may always assign lowest ratings to an entity because of its low quality in
certain aspect). Therefore, incorporating entity and rating into the tuples may
facilitate aspect generation.
Motivated by this observation, we propose a model of Quad-tuple PLSA
(QPLSA for short), which can handle two more items (compared to the pre-
vious 2-tuple PLSA [1,5]) in topic modeling. In this way we aim to achieve
higher accuracy in aspect identification. The rest of this paper is organized as
follows: Section 2 presents the problem definition and preliminary knowledge.
Section 3 details our model Quad-tuple PLSA and the EM solution. Section 4
gives the experimental results to validate the superiority of our model. Section 5
discusses the related work and we conclude our paper in Section 6.
In this section, we first introduce the problem, and then briefly review Lu’s
solution–the Structured Probabilistic Latent Semantic Analysis (SPLSA) [5].
The frequently used notations are summarized in Table 1.
Symbol Description
t the comment
T the set of comments
h the head term
m the modifier term
e the entity
r the rating of the comment
q the quad-tuple of (h,m,r,e)
z the latent topic or aspect
K the number of latent topics
Λ the parameters to be estimated
n(h, m) the number of co-occurrences of head and modifier
n(h, m, r, e) the number of co-occurrences of head,modifier, rating and entity
X the whole data set
Quad-tuple PLSA: Incorporating Entity and Its Rating 395
In this section, we give the problem definition and the related concepts.
{(h, m, r, e)|P hrase (h, m) appears with rating r in a review of entity e}.
Structured PLSA (SPLSA for short) is a 2-tuple PLSA based method for rated
aspect summarization. It incorporates the structure of phrases into the PLSA
model, using the co-occurrence information of head terms and their modifiers.
Given the whole data X composed of (head, modifier) pairs, SPLSA arouses a
mixture model with latent model topics z as follows,
p(h, m) = p(h|z)p(z|m)p(m). (1)
z
The parameters of p(z|m), p(h|z) and p(m) can be obtained using the EM algo-
rithm by solving the maximum log likelihood problem in the following,
log p(X|Λ) = n(h, m) log p(z|m)p(h|z)p(m), (2)
h,m z
where Λ denotes all the parameters. And the prior knowledge of seed words
indicating specific aspect are injected in the way as follows:
n(h, m)p(z|h, m; Λold ) + σp(h|z0 )
p(h|z; Λ) = m , (3)
m n(h , m)p(z|h , m; Λ
old ) + σ
h
where z0 denotes the priors corresponding to the latent topic z, and σ is the
confidential parameter of the head term h belonging to aspect z0 . And each h is
grouped into topic z with the largest probability of generating h, which was the
aspect identification function in SPLSA: A(h) = arg maxz p(h|z).
396 W. Luo et al.
3.1 QPLSA
In SPLSA, aspects are extracted based on the co-occurrences of head and mod-
ifier, namely a set of 2-tuples. Next, we will detail our model–QPLSA, which
takes the quad-tuples as input for more accurate aspect identification.
e
z
2ͲtupleͲ>QuadͲtuple z
h m
h m r
Figure 2 illustrates the graphical model of QPLSA. The directed lines among
the nodes are decided by the understandings on the dependency relationships
among these variables. Specifically, we assume that given a latent topic z, h and
m are conditionally independent. Also, a reviewer may show different judgement
toward different aspects of the same entity. Thus, rating r is jointly dependent
on entity e and latent topic z. From the graphic model in Figure 2, we can write
the joint probability over all variables as follows:
Let Z denote all the latent variables, and given the whole data X, all the param-
eters can be approximated by maximizing the following log likelihood function,
log p(X|Λ) = log p(X, Z|Λ) = n(h, m, r, e) log p(h, m, r, e, z|Λ),
Z h,m,r,e z
(5)
where Λ includes the parameters of p(m|z), p(h|z), p(r|z, e), p(z|e) and p(e). The
derivation of EM algorithm is detailed in next subsection.
to maximize the log likelihood function in Equation (5). Specifically, the lower
bound (Jensen’s inequality) L0 of (5) is:
p(h, m, r, e, z|Λ)
L0 = q(z) log{ }. (6)
z
q(z)
L0 = p(z|h, m, r, e; Λold) log p(z, h, m, r, e|Λ)
z
L
(7)
− p(z|h, m, r, e; Λold) log{p(z|h, m, r, e; Λold)} = L + const.
z
const
Similarly, we have:
z,h,m,r n(h, m, r, e)p(z|e, h, m, r; Λold)
p(e) = , (13)
h,m,r,e n(h, m, r, e; Λ
old )
398 W. Luo et al.
h,m,r n(h, m, r, e)p(z|e, h, m, r; Λold)
p(z|e) = , (14)
h,m,r,z n(h, m, r, e)p(z |e, h, m, r; Λold )
n(h, m, r, e)p(z|e, h, m, r; Λold )
p(m|z) =
e,h,r
, (15)
e,h,r,m n(h, m , r, e)p(z|e, h, m , r; Λold )
h,m n(h, m, r, e)p(z|e, h, m, r; Λ
old
)
p(r|z, e) = . (16)
h,m,r n(h, m, r , e)p(z|e, h, m, r ; Λ
old )
where σ = 0 if we have no prior knowledge on z. Note that adding the prior can
be interpreted as increasing the counts for head term h by σ + 1 times when
estimating p(h|z). Therefore, we have:
m,r,e n(h, m, r, e)p(z|h, m, r, e; Λ ) + σp(h|z)
old
p(h|z; Λ) = . (18)
h ,m,r,e n(h , m, r, e)p(z|h , m, r, e; Λ
old ) + σ
where we select the aspect which generates h with the largest probabilty as the
aspect label for head term h.
4 Experiments
In this section, we present the experimental results to evaluate our model QPLSA.
Firstly, we introduce the data sets and implementation details, and then give the
experimental results in the following subsections.
Quad-tuple PLSA: Incorporating Entity and Its Rating 399
We adopt two different datasets for evaluation, which are detailed in Table 2.
The first dataset is a corpus of hotel reviews provided by Wang et al. [14]. The
data set includes 246,399 reviews on 1850 hotels with each review associated
with an overall rating and 7 detailed ratings about the pre-defined aspects, and
the value of the rating ranges from 1 star to 5 stars. Table 2 also lists the prior
knowledge of some seed words indicating specific aspects.
The other dataset is about restaurant reviews from Snyder et al. [11], which
is much sparser than the previous one. This dataset contains 1609 reviews on
420 restaurants with each review associated with an overall rating and 4 aspect
ratings. For both of the datasets, we decompose the reviews into phrases utilizing
a set of NLP toolkits such as the POS tagging and chunking functions1 .
terms and manually label them as knowledge base. Specifically, for the hotel
reviews we select 408 head terms and categorize them into 7 specific aspects.
While for the restaurant reviews, we select 172 head terms and label them with
4 specific aspects. The details of the categorization are summarized in Table 3,
and A1 to A7 corresponds to the aspects in Table 2. Here we only evaluate
the results of specific aspect identification and compare our model QPLSA with
SPLSA.
Hotel Reviews
Aspects Prior Words Aspect No.
Value value,price,quality,worth A1
Room room,suite,view,bed A2
Location location,traffic,minute,restaurant A3
Cleanliness clean,dirty,maintain,smell A4
Front Desk/Staff staff,check,help,reservation A5
Service service,food,breakfast,buffet A6
Business business,center,computer,internet A7
Restaurant Reviews
Food food,breakfast,potato,drink A1
Ambience ambience,atmosphere,room,seat A2
Service service,menu,staff,help A3
Value value,price,quality,money A4
1
http://opennlp.sourceforge.net/
400 W. Luo et al.
300hotels 600hotels
0.9 1
0.8 0.9
0.7 0.8
0.7
0.6
0.6
0.5
QPLSA 0.5 QPLSA
0.4
SPLSA 0.4 SPLSA
0.3
0.3
0.2 0.2
0.1 0.1
0 0
A1 A2 A3 A4 A5 A6 A7 A1Ͳ7 A1 A2 A3 A4 A5 A6 A7 A1Ͳ7
900hotels 1200hotels
0.9 1
0.8 0.9
0.7 0.8
0.7
0.6
0.6
0.5
QPLSA 0.5 QPLSA
0.4
SPLSA 0.4 SPLSA
0.3
0.3
0.2 0.2
0.1 0.1
0 0
A1 A2 A3 A4 A5 A6 A7 A1Ͳ7 A1 A2 A3 A4 A5 A6 A7 A1Ͳ7
1500hotels 1850hotels
1 0.9
0.9 0.8
0.8 0.7
0.7
0.6
0.6
0.5
0.5 QPLSA QPLSA
0.4
0.4 SPLSA SPLSA
0.3
0.3
0.2 0.2
0.1 0.1
0 0
A1 A2 A3 A4 A5 A6 A7 A1Ͳ7 A1 A2 A3 A4 A5 A6 A7 A1Ͳ7
Hotel Reviews
Aspects Representative Terms By QPLSA Representative Terms By SPLSA
hotel location experience value price size vacation walk value price rates side york parking
Value rates choice deal job way surprise atmosphere station tv orleans quality distance standards
quality selections money holiday variety spots screen light money end charge line bus
room bed view pool bathroom suits ocean room quarters area bed view pool transportation
Room shower style space feel window facilities touch bathroom suits towels shower variety lobby
balcony chair bath amenities pillows furnished space window facilities balcony chair bath sand
places restaurants area walk resort beach city time restaurants day night resort trips beach
Location street shopping minutes bus distance quarters doors street way minutes years week hour
building tourist store tour lobby attractions cafe visit weekend block island evening morning
water decor towels fruit tub air appointed sand floor level water flight air noise music class
Cleanliness cleaning smell maintained noise music club worlds cleaning smell maintained condition wall
condition garden republic done design francisco francisco car eggs anniversary notch afternoon
staff reservation guests checking manager house staff desk people guests checking person couples
Front Desk airporter receptions desk help island eggs lady manager fun lounge children member receptions
attitude smiles lounge museum kong man concierge towers guys reservation cart trouble attitude lady
service breakfast food bar drinks buffet tv service breakfast food access bar tub shuttle
Service coffee meals wine bottle items dinner drinks buffet coffee meals fruit wine bottle
juice tea snacks dish screen car shuttle connected weather juice beer tea snacks
floor access internet side parking station shopping problem building complaints ones
Business
standards light end class line sites wall stop internet traveller points bit tourist store cafe
Service business connected center district towers level deal thing attractions issue star sites items city
Total 89 correct terms 64 correct terms
Restaurant Reviews
food potato sauce ribs wine taste drinks fries food potato sauce ribs wine sause taste drinks
Food parking fee dogs toast breakfast bun cajun gravy diversity reduction feast charcoal
pancakes croissants lasagna pies cinnamon plus brats nature tiramisu cauliflower goods
atmosphere style cheese shrimp room seated music atmosphere area style room seated feeling music
Ambience tomatoes decor game dressing tip orders onion manner piano band poster arts cello movie
mushroom garlic cocktail setting piano mousse blues appearance folk medium francisco avenue
service staff menu wait guy guests carte chili help service staff menu attitude guests gras mousse
Service attitude space downtown section become women maple behavior tone lettuce defines future excuse
employees critic poster market waitstaff office smorgasbord sports networkers supper grandmothers
priced value quality done management legs anniversary priced value quality parking rate money ravioli
Value rate money thought cafeteria informed croutons bags fee pupils flaw heron inside winter education aiken
elaine system bomb proportions recipes buy standbys drenched paying year-old-home veteran
Total 47 correct terms 42 correct terms
All 136 correct terms 108 correct terms
402 W. Luo et al.
Totally, for the 7 aspects of hotel reviews, there are 105 head terms accurately
selected by QPLSA compared to 64 by SPLSA. Also for the 4 aspects of restau-
rant reviews, more correct words are captured by QPLSA than SPLSA. In all,
QPLSA extracts 136 correct terms compared to 108 of SPLSA. All these results
demonstrate that incorporating entity and its rating for aspect identification(or
extraction) is effective.
Note that both QPLSA and SPLSA obtain much better results on dataset
hotel reviews than those on restaurant reviews. The reason is that both methods
are based on generative model that models the co-occurrence information. As
we know, hotel review dataset is much more dense, and thus can provide enough
co-occurrence information for learning.
5 Related Work
This section details some interesting study that is relevant to our research. Pang
et al. [8] give a full overview of opinion mining and sentiment analysis, after
describing the requests and challenges, they outlined a series of approaches and
applications for this research domain. It is pointed out that sentiment classifica-
tion could be broadly referred as binary categorization, multi-class categoriza-
tion, regression or ranking problems on an opinionated document.
Hu and Liu [2] adopt association mining based techniques to find frequent
features and identify the polarity of opinions based on adjective words. However,
their method did not perform aspect clustering for deeper understanding of
opinions. Similar work carried out by Popescu and Etzioni [10] achieved better
performance on feature extraction and sentiment polarity identification, however,
there is still no consideration of aspects.
Kim et al. [3] developed a system for sentiment classification through combin-
ing sentiments at word and sentence levels, however their system did not help
users digest opinions from the aspect perspective. More approaches for sentiment
analysis could be referred to [9,13,15,7], although none of these methods attach
importance to aspects.
Topic models [14,4,6,5] are also utilized to extract aspects from online re-
views. Lu et al. adopt the unstructured and structured PLSA for aspect identi-
fication [5], however, in their model, there is no consideration of rating or entity
in the aspect generation phase. Wang et al. [14] proposed a rating regression ap-
proach for latent aspect rating analysis on reviews, still in their model they do
not take account of entity. Mei et al. [6] defined the problem of topic-sentiment
analysis on Weblogs and proposed Topic-Sentiment Mixture(TSM) model to
capture sentiments and extract topic life cycles. However, as mentioned before,
none of these topic models extracts aspects in view of quads.
A closely related work to our study could be referred to Titov and McDon-
ald’s [12] work on aspect generation. They construct a joint statistical model
of text and sentiment ratings, called the Multi-Aspect Sentiment model(MAS)
to generate topics from the sentence level. They build local and global topics
based on the Multi-Grain Latent Dirichlet Allocation model (MG-LDA) for bet-
ter aspect generation. One recent work [4] by Lakkaraju et al. also focused on
Quad-tuple PLSA: Incorporating Entity and Its Rating 403
6 Conclusion
References
1. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd
International Conference on Reserach and Development in Inforamtion Retrieval,
SIGIR 1999 (1999)
2. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the
International Conference on Knowledge Discovery and Data Mining (KDD 2004),
pp. 168–177 (2004)
3. Kim, S.M., Hovy, E.: Determining the sentiment of opinors. In: Proceedings of the
20th International Conference on Computational Linguistics, p. 1367 (2004)
4. Lakkaraju, H., Bhattacharyya, C., Bhattacharya, I., Merugu, S.: Exploiting coher-
ence for the simultaneous discovery of latent facets and associated sentiments. In:
Proceedings of 2011 SIAM International Conference on Data Mining (SDM 2011),
pp. 498–509 (April 2011)
5. Lu, Y., Zhai, C., Sundaresan, N.: Rated aspect summarization of short comments.
In: Proceedings of the 18th International Conference on World Wide Web (WWW
2009), pp. 131–140 (2009)
6. Mei, Q., Ling, X., Wondra, M., Su, H., Zhai, C.: Topic sentiment mixture: Modeling
facets and opinions in weblogs. In: Proceedings of the 16th International World
Wide Web Conference (WWW 2007), pp. 171–180 (2007)
404 W. Luo et al.
1 Introduction
Privacy leakage is one of major concerns when publishing data for statistical
process or data analysis. In general, organizations need to release data that may
contain sensitive information for the purposes of facilitating useful data analysis
or research. For example, patients’ medical records may be released by a hospital
to aid the medical study. Records in Table 1 (called the microdata) is an example
of patients’ records published by hospitals. Note that attribute Disease contains
sensitive information of patients. Hence, data publishers must ensure that no
adversaries can accurately infer the disease of any patient. One straightforward
approach to achieve this goal is excluding unique identifier attributes, such as
Name from the table, which however is not sufficient for protecting privacy
leakage under linking-attack [1, 2]. For example, the combination of Age and
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 405–417, 2012.
c Springer-Verlag Berlin Heidelberg 2012
406 X. He et al.
1.1 Motivation
30
30
30
20
20
20
10
10
10
Fig. 1. The points inTable 1 Fig. 2. Data Clustering Fig. 3. Data Clustering
(Step 1) (Step 2)
Then, we may wonder: Can we significantly improve the utility while pre-
serving k-anonymity by clustering-based approaches ? The answer depends on
whether it is possible to partition microdata into clusters with less information
loss while still ensuring k-anonymity. Intuitively, data points within a cluster
are more similar to each other than they are to a point belonging to a different
cluster.
The above observation motivates us to devise a new solution to improve the
data utility of clustering-based solutions. As an example, we illustrate the details
to generalize Table 1 by our approach. Let gen be a generalization function that
takes as input a set of tuples and returns a generalized domain. Firstly, Table
1 is divided into 2 clusters, denoted by red and blue in Figure 2, respectively.
Then, the cluster denoted by blue is further divided into 2 cluster, denoted by
black and green color in Figure 3. Finally, tuples with same color are general-
ized as a QI-group, that is, tuple Andy and Bob consists of the first QI-group,
and assign gen({Andy, Bob})=[20 − 20], [25 − 30] to the first QI-group. Sim-
ilarly, {Jane,Alex}, {Mary, Lily, Lucy} make the second and third QI-group.
Eventually, table 2 is the final result by our approach.
In this paper, we mainly focused on the basic k-anonymity model due to the
following reasons: (i) k-anonymity is a fundamental model for privacy protection,
which has received wide attention in the literatures; (ii) k-anonymity has been
employed in many real applications such as location-based services [5, 6], where
there are no additional (sensitive) attributes; (iii) There is no algorithm that
is suitable for so many privacy metrics such as l-diversity[7], t-Closeness [8],
but algorithms for k-anonymity are simple yet effective, and can be further
adopted for other privacy metrics. Apart from the k-anonymity model, we also
consider the scenarios with stronger adversaries, extending our approach to l-
diversity(Section 4)
The rest of the paper is organized as follows. In Section 2, we give the defini-
tions of basic concept and the problem will be addressed in this paper. In Section
3, we present the details of our generalization algorithm. Section 4 discusses the
extension of our methodology for l-diversity. We review the previously related
research in Section 5. In Section 6, we experimentally evaluate the efficiency and
effectiveness of our techniques. Finally, the paper is concluded in Section 7.
408 X. He et al.
2 Fundamental Definitions
global order on all possible values in the domain. If a tuple t in T ∗ has range
[xi , yi ] on attribute Ai (1 ≤ i ≤ d), then the normalized certainty penalty in t on
Ai is N CPAi (t) = |y|A i −xi |
i|
, where |Ai | is the domain of the attribute Ai . For
d
tuple t, the normalized certainty penalty in t is N CP (t) = i wi · N CPAi (t),
where
wi is the weight of attribute Ai . The normalized certainty penalty in T is
t∈T N CP (t).
∗
Now, we are ready to give the formal definition about the problem that will be
addressed in this paper. Information loss is an unfortunate consequence of data
anonymization. We aim to generate a utility-friendly version anonymizaiton for
a microdata such that the privacy can be guaranteed by k-anonymity and the
information loss quantified by N CP is minimized. Now, we are ready to give
the formal definition about the problem that will be addressed in this paper.
(Limited by space, all proofs are omitted.)
Definition 2 (Problem Definition). Given a table T and an integer k,
anonymize it by clustering to be T ∗ such that T ∗ is k-anonymity and the to-
tal information loss is minimized measured by N CP .
Theorem 1. (Complexity) The problem of optimal clustering-based anonymiza-
tion is NP-hard under the metric N CP .
the partitioning algorithm will terminate when the size of all groups is between
k and 2k − 1. If at least one group contains a cardinality more than 2k − 1, the
partitioning algorithm will continue.
In the above procedure, the way that we partition G into two subsets G1 and
G2 is influential on the information loss of the resulting solution. In the first
round, we randomly choose two tuples t1 , t2 as the center points C1 , C2 , and
then insert them G1 and G2 separately. Then, we distribute each tuple w ∈ G:
for each tuple w, we compute Δ1 = N CP (C1 ∪ w) and Δ2 = N CP (G2 ∪ w), and
add tuple w to the group that leads to lower penalty (line 7). If Δ1 = Δ2 , assign
the tuple to the group who has lower cardinality. After successfully partitioning
G, remove the tuples t1 and t2 from G1 − {t1 } and G2 − {t 2 }. At the later each
t∈Gi t
round, the center points Ci are conducted as follows: Ci = |Gi | , i = 1, 2. that
t.Aj
is, for each attribute Aj (1 ≤ j ≤ d), Ci .Aj = t∈Gi
|Gi | ,i
= 1, 2.
After the each partition, if the current partition is better than previous tries,
record the partition result G1 , G2 and the total sum of N CP (G1 ) and N CP (G2 ).
That is, we pick the one that that minimizes the sum of N CP (G1 ) and N CP (G2 )
as the final partition among the r partitions(line 9). Each round of G can be
accomplished in O(r · (|G| · (6 + λ))) expected time, where λ is the cost of
evaluating loss. The computation cost is theoretically bounded in Theorem 2.
4 Extension to l-Diversity
In this section, we discuss how we can apply clustering-based anonymization
for other privacy principles. In particular, we focus on l-diversity, described in
Definition 3.
Definition 3 (l-diversity[7]). A generalized table T ∗ is l-diversity if each QI-
group QIi ∈ T ∗ satisfies the following condition: let v be the most frequent As
ci (v)
value in QIi , and ci (v) be the number of tuples t ∈ QIi , then |QI i|
≤ 1l .
5 Related Work
In this section, previous related work will be surveyed. Existing generalization
algorithms can be further divided into heuristic-based and theoretical-based ap-
proaches. Generally, appropriate heuristics are general so that they can be used
412 X. He et al.
Number of Number of
Attribute Types Attribute Types
distinct values distinct values
Age 78 Numerical Age 78 Numerical
Gender 2 Categorical Occupation 711 Numerical
Birthplace 983 Numerical
Education 17 Numerical
Gender 2 Categorical
Marital 6 Categorical
Education 17 Categorical
Race 9 Numerical Race 9 Categorical
Work-class 10 Categorical Work-class 9 Categorical
Country 83 Numerical Marital 6 Categorical
Occupation 50 Sensitive Income [1k,10k] Sensitive
Parameter Values
k 250,200,150,100,50
cardinality n 100k,200k,300k,400k,500k
number of QI-attributes d 3,4,5,6
6 Empirical Evaluation
In this section, we will experimentally evaluate the effectiveness and efficiency
of the proposed techniques. Specifically, we will show that by our technique
(presented in Section 3) have significantly improved the utility of the anonymized
data with quite small computation cost.
Towards this purpose, two widely-used real databases sets: SAL and IN-
COME(downloadable from http://ipums.org) with 500k and 600k tuples, respec-
tively, will be used in following experiments. Each tuple describes the personal
information of an American. The two data sets are summarized in Table 3.
In the following experiments, we compare our cluster-based anonymity algo-
rithm (denoted by CB) with the existing state-of-the-art technique: the non-
homogeneous generalization [13](NH for short). (The fast algorithm [10] was
cited and compared with NH in the paper [13], therefore, we omit the details of
the fast algorithm.)
In order to explore the influence of dimensionality, we create two sets of micro-
data tables from SAL and INCOME. The first set has 4 tables, denoted as SAL-3,
· · · , SAL-6, respectively. Each SAL-d (3 ≤ d ≤ 6) has the first d attributes in
Table 3 as its QI-attributes and Occupation as its sensitive attribute(SA). For
example, SAL-4 is 5-Dimensional, and contains QI-attributes: Age, Gender, and
0.014 0.14
NH NH NH
0.32 CB CB
0.012 CB
Information Loss (GCP)
0.26 0.1
0.008
0.22
0.18 0.06
0.004
0.14
0 0.02
50 100 150 200 250 50 100 150 200 250 50 100 150 200 250
K K K
0.25
0.15
0.05
0.08 0.05 0
50 100 150 200 250 50 100 150 200 250 50 100 150 200 250
K K K
0.3
0.26 0.25
0.22
0.18 0.15
0.14
0.05
50 100 150 200 250 50 100 150 200 250
K K
Education, Marital. The second set also has 4 tables INC-3, · · · , INC-6, where
each INC-d (3 ≤ d ≤ 6) has the first d attributes as QI-attributes and income
as the SA.
In the experiments, we investigate the influence of the following parameters
on information loss of our approach: (i) value of k in k-anonymity; (ii)number of
attributes d in the QI-attributes; (iii)number of tuples n. Table 4 summarizes the
parameters of our experiments, as well as their values examined. Default values
are in bold font. Data sets with different cardinalities n are also generated by
randomly sampling n tuples from the full SAL-d or INC-d (3 ≤ d ≤ 6). All
experiments are conducted on a PC with 1.9 GHz AMD Dual Core CPU and 1
gigabytes memory. All the algorithms are implemented with VC++ 2008.
We measure the information loss of the generalized tables using GCP, which
is first used in [10]. Note that GCP essentially is equivalent to N CP with only
a difference of constant number d × N . Specifically, under the same partition
P of table T , GCP (T ) = N d×N
CP (T )
( d is the size of QI-attributes), when all the
weights are set to 1.0.
0.25 0.35
NH NH
CB CB
Information Loss (GCP)
Information Loss (GCP)
0.2
0.25
0.15
0.1
0.15
0.05
0 0.05
3 4 5 6 3 4 5 6
d d
0.23
NH NH
0.22 CB
Information Loss (GCP)
CB 0.26
Information Loss (GCP)
0.24
0.19 0.2
0.16
0.16
0.12
0.13
0.08
100 200 300 400 500 100 200 300 400 500
n (in thousands) n (in thousands)
shown in Figure 6 (a)-6(h). From the results, we can clearly see that information
loss of CB sustains a big improvement over NH, for the tested data except the on
SAL-3. Another advantage of our model over NH is that the utility achieved by
our model is less sensitive to domain size than NH. From the figures, we can see
that data sets generated by NH has a lower GCP on SAL-d than that on INC-d
(4 ≤ d ≤ 7) due to the fact that domain size of SAL is smaller than that of INC.
Such a fact implies that the information loss of NH is positively correlated to
the domain size. However, in our model, domain size of different data set has
less influence on the information loss of the anonymized data.
Results of this experiment also suggest that for almost all tested data sets the
GCP of these algorithms grows linearly with k. This can be reasonably explained
since larger k will lead to more generalized QI-groups, which inevitably will
sacrifice data utility. NH performs well when the dimensionality of QI-Attributes
is low and the domain size is small, see the experiment results in the paper[13].
400 600
NH NH
CB CB
500
running time (s)
300
running time (s)
400
200 300
200
100
100
0 0
100 200 300 400 500 100 200 300 400 500
n (in thousands) n(in thousands)
loss of these methods on both two data sets decreases with the growth of n. This
observation can be attributed to the fact that when the table size increases more
tuples will share the same or quite similar QI-attributes. As a result, it is easier
for the partitioning strategies to find very similar tuples to generalize. Similar
to previously experimental results, our method is the clear winner since infor-
mation loss of CB is significantly small than that of NH, which is consistently
observed for various database size.
6.4 Efficiency
Finally, we evaluate the overhead of performing anonymization. Figure 9(a) and
9(b) show the computation cost of the these anonymization methods on two
data sets, respectively. We compare CB with NH when evaluating computational
cost. The running time of tow algorithms increases linearly when n grows from
100k to 500k, which is expected since more tuples that need to be anonymized
will cost longer time to finish the anonymization procedure. The NH method
is more efficient. Comparison results show that the advantages of our method
in anonymization quality do not come for free. However, in the worst case, our
algorithm can be finished in 500 seconds, which is acceptable. In most real appli-
cations quality is more important than running time, which justifies the strategy
to sacrifice certain degree of time performance to achieve higher data utility.
Summary. Above results clearly show that clustering-based anonymization
achieves less information loss than the non-homogeneous anonymization (NH)
in cases where the dimensionality of QI-attribute d > 3 . NH has a good perfor-
mance when the domain size is small, and the dimensionality of QI-Attributes
is low. This is due to its greedy partitioning algorithm.
7 Conclusion
As privacy becomes a more and more serious concern in applications involving
microdata, good anonymization is of significance. In this paper, we propose an
algorithm which is based on clustering to produce a utility-friendly anonymized
version of microdata. Our extensive performance study shows that our methods
outperform the non-homogeneous technique where the size of QI-attribute is
larger than 3.
References
1. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzzi-
ness Knowl.-Based Syst. 10(5), 557–570 (2002)
2. Samarati, P.: Protecting respondents’ identities in microdata release. TKDE 13(6),
1010–1027 (2001)
3. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing
information. In: PODS 1998, p. 188. ACM, New York (1998)
4. MacQueen, J.B.: Some methods for classification and analysis of multivariate ob-
servations, Berkeley, pp. 281–297 (1967)
5. Kalnis, P., Ghinita, G., Mouratidis, K., Papadias, D.: Preventing location-based
identity inference in anonymous spatial queries. TKDE 19(12), 1719–1733 (2007)
6. Mokbel, M.F., Chow, C.-Y., Aref, W.G.: The new casper: query processing for lo-
cation services without compromising privacy. In: VLDB 2006, pp. 763–774 (2006)
7. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity:
Privacy beyond k-anonymity. In: ICDE 2006, p. 24 (2006)
8. Li, N., Li, T.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: KDD
2007, pp. 106–115 (2007)
9. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.-C.: Utility-based anonymiza-
tion using local recoding. In: KDD 2006, pp. 785–790. ACM (2006)
10. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with
low information loss. In: VLDB 2007, pp. 758–769. VLDB Endowment (2007)
11. Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and
privacy preservation, pp. 205–216 (2005)
12. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Workload-aware anonymization. In:
KDD 2006, pp. 277–286. ACM, New York (2006)
13. Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in
privacy preserving data publishing. In: SIGMOD 2010, pp. 747–758. ACM, New
York (2010)
14. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-
anonymity. In: SIGMOD 2005, pp. 49–60. ACM, New York (2005)
15. Iwuchukwu, T., Naughton, J.F.: K-anonymization as spatial indexing: toward scal-
able and incremental anonymization. In: VLDB 2007, pp. 746–757 (2007)
16. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-
anonymity. In: ICDE 2006, Washington, DC, USA, p. 25 (2006)
17. Gionis, A., Mazza, A., Tassa, T.: k-anonymization revisited. In: ICDE 2008, pp.
744–753. IEEE Computer Society, Washington, DC (2008)
18. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In:
ICDE 2005, pp. 217–228. IEEE Computer Society, Washington, DC (2005)
Unsupervised Ensemble Learning
for Mining Top-n Outliers
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 418–430, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Unsupervised Ensemble Learning for Mining Top-n Outliers 419
Although there are numerous outlier detection methods proposed in the literature,
no one method performs better than the others under all circumstances, and the best
method for a particular dataset may not be known a priori. Each detection method is
proposed based on the specific priori knowledge. For example, the nearest neighbor
based methods assume that the feature space is well enough to discriminate outliers
from normal data, while the classification based and the statistical methods need to
suppose the distributions of outliers and normal objects, respectively. Hence, their de-
tection performances vary with the nature of data. This setting motivates a fundamental
information retrieval problem - the necessity of an ensemble learning of different de-
tection methods to overcome their drawbacks and to increase the generalization ability,
which is similar to meta-search that aggregates query results from different search en-
gines into a more accurate ranking. Like meta-search, ensemble learning in the top-n
outlier detection is more valuable than the fusion of the binary labels, especially in large
databases. There is the literature on the ensemble learning of outlier detection, such as
[13,14,15]. However, all these efforts state the problem of effectively detecting outliers
in the sub-feature spaces. Since the work of Lazarevic and others focuses on the fusion
of the sub-feature spaces, these methods are very demanding in requiring the full spec-
trum of outlier scores in the datasets that prevents them from the fusion of the top-n
outlier lists in many real-world applications.
Although the problem of ensemble learning in the top-n outlier detection shares a
certain similarity to that of meta-search, they have two fundamental differences. First,
the top-n outlier lists from various individual detection methods include the order infor-
mation and outlier scores of n most outstanding objects. Different detection methods
generate outlier scores in different scales. This requires the ensemble framework to pro-
vide a unified definition of outlier scores to accommodate the heterogeneity of different
methods. Second, the order-based rank aggregation methods, such as Mallows Model
[18], can only combine the information of the order lists with the same length, which
prevents the application of these rank aggregation methods in the fusion of top-k outlier
lists. Because, for a particular dataset, there are always several top-k outlier lists with
various length used to measure the performance and effectiveness of a basic outlier de-
tection method. In order to address these issues, we propose a general framework of
ensemble learning in the top-n outlier detection shown in Figure 1, and develop two
fusion methods: the score-based aggregation method (SAG) and the order-based ag-
gregation method (OAG). To the best of our knowledge, this is the first attempt to the
ensemble learning in the top-n outlier detection. Specifically, the contributions of this
paper are as follows:
The remainder of this paper is organized as follows. Section 2 introduces the framework
of ensemble learning in the top-n outlier detection and the two novel aggregation meth-
ods: the score-based and the order-based methods. Section 3 reports the experimental
results. Finally, Section 4 concludes the paper.
2 Methodologies
We first introduce the general framework and the basic notions of ensemble learning in
the top-n outlier detection, and then introduce the score-based method with a unified
outlier score and the order-based method based on the distance-based Mallows model,
respectively.
The aggregation model combines K orderings {σi }K i=1 to obtain the optimal top-
n outlier list. Clearly, the literature with respect to the fusion of sub-feature spaces
[13,14,15] can be included in this framework by using the detection model in a special
sub-feature space as an individual method. In this paper, we only focus on the unsuper-
vised aggregation models based on the order information and outlier scores.
Since a top-n outlier list σi contains the order information and the corresponding outlier
scores, it is straightforward that combining these outlier scores from different meth-
ods improves the detection performance. As mentioned in the previous section, outlier
scores of the existing methods have different scales. For example, outlier scores vary
from zero to infinity for the nearest based method [6], while lying in the interval [−1, 1]
for the classification based method [10]. In this subsection, an effective method is pro-
posed to transform outlier scores to posterior probability estimates. Compared with
outlier scores, the posterior probability based on Bayes’ theorem provides a robust esti-
mate to the information fusion and a spontaneous measure of the uncertainty in outlier
prediction. Without loss of generality, we assume that the higher S(i), the more proba-
ble Xi to be considered as an outlier. Let Yi be the label of Xi , where Yi = 1 indicates
that Xi is an outlier and Yi = 0 if Xi is normal. According to Bayes’ theorem,
where μ and std are the mean value and standard deviation of the original outlier scores,
respectively. In large datasets, these statistics
can be computed by sampling the original
data. As a discriminant function, ln ϕ(i) < 0 means (S(i) − μ)/std > τ ; the object
Xi can be assigned as an outlier. In all the experiments, the default value of τ equals
1.5 based on Lemma 1.
Lemma 1: For any distribution of outlier score S(i), it holds that
S(i) − μ 1
P >τ ≤ 2
std τ
where nd is the number of the orderings that contain object Xi and relj (i) is the nor-
malized outlier score of Xi by the j-th individual method. When r = 1, the ultimate
outlier score is composed of the number of the orderings nd and the sum of its outlier
scores. When r = 0, the result is only the sum of its outlier scores. When r = −1,
it is equivalent to the average outlier scores of the orderings containing Xi . According
to Eq. 1 and Eq. 2, the posterior probabilities can be used to normalize outlier scores
directly. The detailed steps of SAG are shown in Algorithm 1.
Given a judge ordering σ and its expertise indicator parameter θ, the Mallows model
[16]generates an ordering π given by the judge according to the formula:
1
P (π|θ, σ) = exp(θ · d(π, σ)) (4)
Z(σ, θ)
where
Z(σ, θ) = exp(θ · d(π, σ)) (5)
π∈Rn
According to the right invariance of the distance function, the normalizing constant
Z(σ, θ) is independent of σ, which means Z(σ, θ) = Z(θ). The parameter θ is a non-
positive quantity and the smaller the value of θ, the more concentrated at σ the ordering
π. When θ equals 0, the distribution is uniform meaning that the ordering given by the
judge is independent of the truth.
Unsupervised Ensemble Learning for Mining Top-n Outliers 423
In this extended model, each ordering σi is returned by a judge for a particular set
of objects. θi represents the expertise degree of the i-th judge. Eq. 6 computes the
probability that the true ordering is π, given the orderings σ from K judges and the
degrees of their expertise.
Based on the hypothesis of the distance-based Mallow model, we propose a genera-
tive model of OAG, which can be described as follows:
K
P (π, σ|θ) = P (σ|θ, π)P (π|θ) = P (π) P (σi |θi , π) (8)
i=1
The true list π is sampled from the prior distribution P (π) and σi is drawn from the
Mallows model P (σi |θi , π) independently. For the ensemble learning of top-n outlier
lists, the observed objects are the top-n outlier lists σ from various individual detection
methods, and the unknown object is the true top-n outlier list π. The value of the free
parameter θi depends on the detection performance of the i-th individual method. The
goal is to find the optimal ranking π and the corresponding free parameter θi which
maximize the posteriori probability shown in Eq. 6. In this work, we propose a novel
EM algorithm to solve this problem. For obtaining an accurate estimation of θi by the
EM-based algorithm, we construct the observed objects by applying several queries
with different lengths {Nq }Q q=1 , where N1 = n and Nq/1 > n. Clearly, it is to compute
the parameter θ = (θ1 , · · · , θK ) by considering the information of different scales. In
this paper, the default value of Q is 4 and the lengths meet the following requirement:
Nq = q · n.
where
Q
Q K
Q K
L(θ) = log P (πq ) − log Zq (θi ) + θi · d(πq , σqi ) (10)
q=1 q=1 i=1 q=1 i=1
Q
U (θ ) = P (πq |θ , σq ) (11)
q=1
Lemma 3: The parameter θ maximizing the expected value ζ(θ, θ ) meets the following
formula:
Q Q
Eθi (d(πq , σqi )) = d(πq , σqi ) · U (θ ) (12)
q=1 (π1 ,··· ,πQ ) q=1
The proofs for Lamma 2 and Lamma 3 are omitted due to lack of space. As shown
in Lamma 3, the value of the right-hand side of Eq. 12 and the analytical expression
of the left-hand side should be evaluated under the appropriate distance function to ob-
tain the optimal θ. Before introducing the detailed procedure of our EM-based learning
algorithm, we bring in an effective distance function d(π, σ) between the top-n order-
ings π and σ, which is proposed in [18]. To keep this work self-contained, this distance
function is introduced as follows.
Definition 1: Let Fπ and Fσ be the elements of π and σ respectively. Z = Fπ ∩ Fσ
with |Z| = z. P = Fπ \ Z, and S = Fσ \ Z (note that |P | = |S| = n − z = r).
Define the augmented ranking π̃ as π augmented with the elements of S assigned the
same index n + 1. Clearly, π̃ −1 (n + 1) is the set of elements at position n + 1 (σ̃ is
defined similarly). Then, d(π, σ) is the minimum number of the adjacent transpositions
needed to turn π̃ to σ̃ as follows, where I(x) = 1 if x > 0, and 0 otherwise.
n
n
r(r + 1)
d(π, σ) = Vi (π̃, σ̃) + Ui (π̃, σ̃) + (13)
i=1 i=1
2
π̃ −1 (i)∈Z π̃ −1 (i)∈Z
/
where
n
Vi (π̃, σ̃) = I(σ̃(π̃ −1 (i)) − σ̃(π̃ −1 (j))) + I(σ̃(π̃ −1 (i)) − σ̃(j))
j=i j∈π̃ −1 (n+1)
π̃ −1 (j)∈Z
n
Ui (π̃, σ̃) = 1
j=i
π̃ −1 (j)∈Z
In each iteration of the EM process, θ is updated by solving Eq. 12. Based on Definition
1, Eθi (d(πq , σqi )) is computed as follows:
Nq eθi Nq
jejθi r(r + 1) eθi (z+1)
Eθi (d(πq , σqi )) = − + − r(z + 1)
1 − e i j=r+1 1 − e i
θ jθ 2 1 − eθi (z+1)
Unsupervised Ensemble Learning for Mining Top-n Outliers 425
This function is a monotonous function of the parameter θi . For estimating the right-
hand side of Eq. 12, we adopt the Metropolis algorithm introduced in [2] to sample from
Eq. 6. Suppose that the current list is πt . A new list πt+1 is achieved by exchanging
the objects i and j, which are randomly chosen from all the objects in πt . Let r =
P (πt+1 |θ, σ)/P (πt |θ, σ). If r ≥ 1, πt+1 is accepted as the new list, otherwise πt+1 is
accepted with the probability r. Then, θ can be computed by the line search approach
with the average z of the samples. The steps of OAG are shown in Algorithm 2.
3 Experiments
We evaluate the aggregation performances of SAG and OAG methods using a number
of real world datasets. We measure the robust capabilities of SAG and OAG methods
to the random rankers, which are generated based on the Uniform distribution and the
Gaussian distribution, respectively.
In this subsection, we make use of several state-of-the-art methods, including LOF [6],
K-Distance [3], LOCI [7], Active Learning [10], and Random Forest [11] as the individ-
ual methods to return the original top-n outliers lists. Since the performances of LOF
and K-Distance depend on the parameter K that determines the scale of the neighbor-
hood, we take the default value of K as 2.5% of the size of a real dataset. Both LOF and
LOCI return outlier scores for each dataset based on the density estimation. However,
K-Distance [3] only gives objects binary labels. Hence, according to the framework of
K-Distance, we compute outlier scores as the distance between an object and its Kth
nearest neighbor. Active learning and Random Forest both transform outlier detection to
classification based on the artificial outliers generated according to the procedures pro-
posed in [10]. These two methods both compute outlier scores by the majority voting
of the weak classifiers or the individual decision trees.
The real datasets used in this section consist of the Mammography dataset, the Ann-
thyroid dataset, the Shuttle dataset, and the Coil 2000 dataset, all of which can be
downloaded from the UCI database except for the Mammography dataset.1 Table 1
1
Thank Professor Nitesh.V.Chawla for providing this dataset, whose email address is
[email protected]
426 J. Gao et al.
summarizes the documentations of these real datasets. All the comparing outlier de-
tection methods are evaluated using precision and recall in the top-n outlier list σ as
follows
Fig. 2. The posterior probability curves based on SAG and score histograms of various individual
methods on the Mammography dataset
Table 2. The precisions in the top-n outlier lists for all the individual methods and the aggregation
methods on the real data
PP
PP Dataset Mammography Ann-thyroid Shuttle-1 Shuttle-2 Shuttle-3 Coil-2000
PP
Method P (Top 260) (Top-73) (Top-13) (Top-39) (Top-809) (Top-348)
LOF 19.0% 39.7% 23.1% 53.8% 28.4% 5.5%
K-Distance 13.8% 37.0% 29.8% 48.7% 34.5% 8.0%
LOCI 8.8% 28.8% 7.7% 33.3% 67.0% 8.9%
Active Learning 18.1% 28.8% 15.4% 0% 30.3% 8.9%
Random Forests 15.4% 41.1% 0% 0% 70.6% 8.6%
Average of All 15.0% 35.1% 15.2% 27.2% 46.2% 8.0%
Cumulative Sum 10.0% 31.5% 23.1% 58.9% 40.0% 10.3%
Breadth-first 14.2% 38.4% 0% 28.2% 46.9% 10.6%
Mallows Model 13.1% 38.4% 23.1% 51.3% 44.4% 8.0%
SAG (r= 1) 18.5% 34.2% 23.1% 48.7% 61.3% 9.8%
SAG (r= 0) 18.5% 34.2% 23.1% 48.7% 62.1% 9.5%
SAG (r=-1) 5.4% 26.0% 7.7% 43.6% 59.5% 10.9%
OAG 19.7% 42.5% 30.8% 53.8% 71.7% 9.1%
the number nd of the individual top-n outlier lists contributes little to the final fusion
performance. Compared with the above aggregation methods, the performance of SAG
with r = −1 varies with the nature of the data dramatically. SAG with r = −1 achieves
the best performance on the Coil 200 dataset. However, it performs more poorly than
SAG with r = {1, 0} and OAG on the other datasets. This demonstrates that the average
of the unified outlier scores does not adapt to the fusion of the top-n lists. In general,
since outlier scores are always either meaningless or inaccurate, the order-based ag-
gregation method makes more sense than the score-based method. OAG achieves the
428 J. Gao et al.
Fig. 3. The precisions of OAG and SAG (r = 1) varying with the number of random lists Kr on
the Mammography data and Shuttle-3 data
Table 3. The parameter θ of all the individual methods and five random lists on the Mammogra-
phy and Shuttle-3 datasets
PP
PP Method LOF K-Distance LOCI
Active Random random lists
PP Learning Forests (Average)
Dataset P
Uniform-Noise -0.0058 -0.0039 -0.0058 -0.0052 -0.0039 -0.00014
Mammogrpahy
Gaussian-Noise -0.0061 -0.0033 -0.0055 -0.0054 -0.0044 -0.00016
Uniform-Noise -0.0014 -0.0016 -0.0014 -0.0018 -0.0037 -0.00001
Shuttle-3
Gaussian-Noise -0.0014 -0.0016 -0.0018 -0.0014 -0.0035 -0.00002
best performance than SAG on the Mammography, the Ann-thyroid, and the Shuttle-
1,3 datasets. Both Cumulative Sum and SAG are score-based fusion methods. Table 2
shows that the performance of SAG is more stable and effective, especially SAG with
r = 1. Breath-first, Mallows Model, and OAG are all the order-based fusion methods.
Although Breath-first can be used in the aggregation of top-n outlier lists, it is sensitive
to the order of the individual methods. Mallows Model supposes that there is a fixed
expertise indicator parameter θ for an individual method regardless of the nature of
the data. Experiment results indicates that this hypothesis is not appropriate for the en-
semble learning in the top-n outlier detection. Overall, SAG and OAG both achieve the
better performances than Average of All and the aggregation methods Breadth-first, Cu-
mulative Sum and Mallows Model, which means that the proposed approaches deliver a
stable and effective performance independent of different datasets in a good scalability.
previous five individual detection methods, and the Kr random lists of the poor judges,
where Kr varies from 1 to 5.
For lack of the space, only the results on the Mammography dataset and the Shuttle-
3 dataset are shown in the Figure 3. Clearly, OAG is more robust to the random poor
judges than SAG regardless of Uniform-Noise or Gaussian-Noise. Especially, OAG
achieves a better performance when the number Kr of random lists increases. Table 3
gives the value of the parameter θ of the individual method pool on the Mammogra-
phy and Shuttle-3 datasets. The parameter θ of each Uniform-Noise or Gaussian-Noise
is close to zero. This demonstrates that OAG learns to discount the random top-n lists
without supervision.
4 Conclusions
We have proposed the general framework of the ensemble learning in the top-n outlier
detection in this paper. We have proposed the score-based method (SAG) with the nor-
malized method of outlier scores, which is used to transform outlier scores to posterior
probabilities. We have proposed the order-based method (OAG) based on the distance-
based Mallows model to combine the order information of various individual top-n
outlier lists. Theoretical analysis and empirical evaluations on several real data sets
demonstrate that both SAG and OAG can effectively combine the state-of-the-art de-
tection methods to deliver a stable and effective performance independent of different
datasets in a good scalability, and OAG can discount the random top-n outlier lists with-
out supervision.
References
1. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
2. Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications.
Journal of Biometrika 57(1), 97–109 (1970)
3. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications.
Journal of VLDB 8(3-4), 237–253 (2000)
4. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. Journal of ACM Computing
Surveys (CSUR) 31(3), 264–323 (1999)
5. Barnett, V., Lewis, T.: Outliers in Statistic Data. John Wiley, New York (1994)
6. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based local
outliers. In: SIGMOD, pp. 93–104 (2000)
7. Papadimitriou, S., Kitagawa, H., Gibbons, P.: Loci: Fast outlier detection using the local
correlation integral. In: ICDE, pp. 315–326 (2003)
8. Yang, J., Zhong, N., Yao, Y., Wang, J.: Local peculiarity factor and its application in outlier
detection. In: KDD, pp. 776–784 (2008)
9. Gao, J., Hu, W., Zhang, Z(M.), Zhang, X., Wu, O.: RKOF: Robust Kernel-Based Local
Outlier Detection. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part II.
LNCS(LNAI), vol. 6635, pp. 270–283. Springer, Heidelberg (2011)
10. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: KDD, pp. 504–
509 (2006)
11. Breiman, L.: Random Forests. J. Machine Learning 45(1), 5–32 (2001)
430 J. Gao et al.
12. Fox, E., Shaw, J.: Combination of multiple searches. In: The Second Text REtrieval Confer-
ence (TREC-2), pp. 243–252 (1994)
13. Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: KDD, pp. 157–166 (2005)
14. Gao, J., Tan, P.N.: Converting output scores from outlier detection algorithms into probability
estimates. In: ICDM, pp. 212–221 (2006)
15. Nguyen, H., Ang, H., Gopalkrishnan, V.: Mining outliers with ensemble of heterogeneous
detectors on random subspaces. Journal of DASFAA 1, 368–383 (2010)
16. Mallows, C.: Non-null ranking models. I. J. Biometrika 44(1/2), 114–130 (1957)
17. Lebanon, G., Lafferty, J.: Cranking: Combining rankings using conditional probability mod-
els on permutations. In: ICML, pp. 363–370 (2002)
18. Klementiev, A., Roth, D., Small, K.: Unsupervised rank aggregation with distance-based
models. In: ICML, pp. 472–479 (2008)
Towards Personalized Context-Aware
Recommendation by Mining Context Logs
through Topic Models
1 Introduction
Recent years have witnessed the increasing popularity of smart mobile devices,
such as smart phones and pads. These devices are usually equipped with multi-
ple context sensors, such as GPS sensors, 3D accelerometers and optical sensors,
which enables them to capture rich contextual information of mobile users and
thus support a wide range of context-aware services, including context-aware
tour guide [15], location based reminder [13] and context-aware recommenda-
tion [2,9,16,10], etc. Moreover, these contextual information and users’ corre-
sponding activity (e.g., browsing web sites, playing games and chatting by So-
cial Network Services) can be recorded into context logs to be used for mining
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 431–443, 2012.
c Springer-Verlag Berlin Heidelberg 2012
432 K. Yu et al.
The results clearly demonstrate the effectiveness of the proposed approach and
indicate some inspiring conclusions.
The remainder of this paper is organized as follows. Section 2 provides a
brief overview of related works. Then, Section 3 presents the idea of making
personalized context-aware recommendation by mining context logs for mining
users’ context-aware preferences, and Section 4 presents how to mine common
context-aware preferences through topic models. Section 5 reports our experi-
mental results on a real world data set. Finally, in Section 6, we conclude this
paper.
2 Related Work
Today, the powerful sensing abilities of smart mobile devices enable to capture
the rich contextual information of mobile users, such as location, user activ-
ity, audio level, and so on. Consequently, how to leverage such rich contextual
information for personalized context-aware recommendation has become a hot
problem which dramatically attracts many researchers’ attention.
Many previous works about personalized context-ware recommendation for
mobile users have been reported. For example, Tung et al. [14] have proposed
a prototype design for building a personalized recommender system to recom-
mend travel related information according to users’ contextual information. Park
et al. [12] proposed a location-based personalized recommender system, which
can reflect users’ personal preferences by modeling user contextual information
through Bayesian Networks. Bader et al. [2] have proposed a novel context-
aware approach to recommending points-of-interest (POI) for users in an au-
tomotive scenario. Specifically, they studied the scenario of recommending gas
stations for car drivers by leveraging a Multi-Criteria Decision Making (MCDM)
based methods to modeling context and different routes. However, most of these
works only leverage individual user’s historical context data for modeling per-
sonal context-aware preferences, and do not take into account the problem of
insufficient personal training data.
434 K. Yu et al.
3 Preliminary
should be collected is usually predefined. For example, the GPS coordinate is not
available when the user is indoor. Moreover, interaction records can be empty
(denoted as “Null”) because the user activities which can be captured by devices
do not always happen.
It is worth noting that we transform raw location based context data such as
GPS coordinates or cell Ids into social locations which have explicit meanings
such as “Home” and “Work place” by some existing location mining approaches
(e.g., [5]). The basic idea of these approaches is to find clusters of user location
data and recognize their social meaning by time pattern analysis. Moreover,
we also manually transform the raw activity records to more general ones by
mapping the activity of using a particular application or playing a particular
game to an activity category. For example, we can transform two raw activity
records “Play Angry Birds” and “Play Fruit Ninja” to same activity records
“Play action games”. In this way, the context data and activity records in context
logs are normalized and the data sparseness is some how alleviated for easing
context-aware preference mining.
Given a context C = {p} where p denotes an atomic context, i.e., a contex-
tual feature-value pair, the probability that a user u prefers activity a can be
represented as
P (a, C|u)P (u)
P (a|C, u) = ∝ P (a, C|u) ∝ P (a, p|u),
P (C, u) p
where we assume that the atomic contexts are mutually conditionally indepen-
dent given u.
Then the problem becomes how to calculate P (a, p|u). According to our pro-
cedure, we introduce a variable of CCP denoted as z, and thus we have
P (a, p|u) = P (a, p, z|u) ∝ P (a, p|z, u)P (z, u) ∝ P (a, p|z)P (z|u),
z z z
where we assume that a user’s preference under a context only relies on the
CCPs and his (her) context-aware preferences in the form of their distribution
on the CCPs, rather than other information of the user. Therefore, the problem
is further converted into learning P (a, p|z) and P (z|u) from many users’ context
logs, which can be solved by widely used topic models. In the next section,
we present how to utilize topic models for mining CCPs, i.e., P (a, p|z), and
accordingly make personalized context-aware recommendation.
and the corresponding contextual feature-value pair p, i.e., (a, p), as Atomic
Context-aware Preference feature, and ACP-feature for short. Intuitively, if we
take ACP-features as words, take context logs as bags of ACP-features to cor-
respond documents, and take CCPs as topics, we can take advantage of topic
models to learn CCPs from many users’ context logs.
However, raw context logs are not naturally in the form of bag of ACP-features
so we need some preprocessing for extracting training data. Specially, we first
remove all context records without any activity record and then extract ACP-
feature from the remaining ones. Given a context record < T id, C, a > where T id
denotes the timestamp, C = {p1 , p2 , ..., pl } denotes the context and a denotes
the activity, we can extract l ACP-features, namely, (a, p1 ), (a, p2 ), ..., (a, pl ).
For simplicity, we refer the bag of ACP-features extracted from user u’s context
log as the ACP-feature bag of u.
Among several existing topic models, in this paper, we leverage the widely
used Latent Dirichlet Allocation model (LDA) [4]. According to LDA model, the
ACP-feature bag of user ui denoted as di is generated as follows. First, before
generating any ACP-feature bag, K prior ACP-feature conditional distributions
given context-aware preferences {φz } are generated from a prior Dirichlet distri-
bution β. Secondly, a prior context-aware preference distribution θi is generated
from a prior Dirichlet distribution α for each user ui . Then, for generating the
j-th ACP-feature in di denoted as wi,j , the model firstly generates a CCP z from
θi and then generates wi,j from φz . Figure 2 shows the graphic representation
of modeling ACP-feature bags by LDA.
£
=
¢ © ] Z
1
0
In our approach, the objective of LDA model training is to learn proper es-
timations for latent variables θ and φ to maximize the posterior distribution
of the observed ACP-feature bags. In this paper, we choose a Markov chain
Monte Carlo method named Gibbs sampling introduced in [6] for training LDA
models efficiently. This method begins with a random assignment of CCPs to
ACP-features for initializing the state of Markov chain. In each of the following
iterations, the method will re-estimate the conditional probability of assigning
a CCP to each ACP-feature, which is conditional on the assignment of all other
ACP-features. Then a new assignment of CCP to ACP-features according to
those latest calculated conditional probabilities will be scored as a new state of
Markov chain. Finally, after rounds of iterations, the assignment will converge,
Towards Personalized Context-Aware Recommendation 437
which means each ACP-feature is assigned a stable and final CCP. Eventually, we
can obtain the estimated values for two distributions {
p(a, p|z)} and {
p(z|u)},
which denote the probability that the ACP-feature (a, p) appears under the
CCP z, and the probability that user u has the context-aware preference z,
respectively.
(a,p) (u)
n(z) + β n(z) + α
p(a, p|z) = (.)
, p(z|u) = (u)
,
n(a,p) + Aβ n(.) + Zα
(a,p)
where the n(z) indicates the number of times ACP-feature (a, p) has been
(u)
assigned to CCP z, while n(z) indicates the number of times a ACP-feature
from user u’s context log that has been assigned to CCP z. The A indicates the
number of ACP-features from u’s context log, and Z indicates the number of
CCPs.
LDA model needs a predefined parameter Z to indicate the number of CCPs.
How to select an appropriate Z for LDA is an open question. In terms of guaran-
teeing the performance of recommendation, in this paper we utilize the method
proposed by Bao et al [3] to estimate Z, and we set ζ to be 10% in our experi-
ments accordingly. Please refer to [1] for more information.
After learning CCPs represented by distributions of ACP-features, we can
predict users’ preference according to their historical context-aware preferences
and current contexts, i.e., P (a, C|u). Then, we recommend users a rank list
of different categories of contents according to the preference prediction. For
example, if we predict a user u is more likely willing to play action games than
listen pop music, the recommendation priority of popular action games will be
higher than that of recent hot pop music.
5 Experiments
In this section, we evaluate the performance of our LDA based personalized
context-aware recommendation approach, namely Personalized Context-aware
Recommendation with LDA (PCR-LDA), with several baseline methods in a
real-world data set.
Number
users 443
unique activities 665
unique context 4,391
context records 8,852,187
activity-context records† 1,097,189
†
activity-context records denote the context records with non-empty
user activity records.
content categories, which are Call, Web, Multimedia, Management, Games, Sys-
tem, Navigation, Business, Reference, Social Network Service (SNS), Utility and
Others. Specifically, in our experiments, we do not utilize the categories Call and
others because their activity information is clear for making recommendations.
Therefore, in our experiments we utilize 10 activity categories which contain 618
activities appear in total 408,299 activity-context records.
non-empty activity records and can be used as training data, which implies the
limit of learning personal context-aware preferences only from individual user’s
context logs.
recall at top k recommendation results on the test cases of user u, and |U | indi-
K
cates the number of the users. AR(u) @K can be computed by N1u i r=1 reli (r),
where Nu denotes the number of test cases for user u, r denotes a given cut-off
rank, and reli () is the binary function on the relevance of a given rank.
truth activities for all contexts. It is because PCR-LDA takes advantage of many
users’ context logs. In contrast, PCR-i has worse MAR@K due to the insufficient
training data in individual user’s context logs for mining context-aware prefer-
ence. Moreover, due to the different context-aware preference between users, the
popularity based approach CPR under-performs the other approaches.
PCR-i can only recommend relevant content categories in the top one position
for one test case, and the Popularity based approach CPR always recommend
same content categories for all users and thus sometimes performs not well.
6 Concluding Remarks
In this paper, we investigated how to exploit user context logs for personalized
context-aware recommendation by mining CCPs through topic models. To be
specific, first we extract ACP-Feature bags for each user from their historical
context logs. Then, we propose to mine users’ CCPs through topic models. Fi-
nally, we make recommendation according to the given context and the CCP
distribution of the given user. The experimental results from a real-world data
set clearly show that our proposed recommendation approach can achieve good
performance for personalized context-aware recommendation.
References
1. Azzopardi, L., Girolami, M., Risjbergen, K.V.: Investigating the relationship be-
tween language model perplexity and ir precision-recall measures. In: SIGIR 2003,
pp. 369–370 (2003)
2. Bader, R., Neufeld, E., Woerndl, W., Prinz, V.: Context-aware poi recommenda-
tions in an automotive scenario using multi-criteria decision making methods. In:
CaRR 2011, pp. 23–30 (2011)
3. Bao, T., Cao, H., Chen, E., Tian, J., Xiong, H.: An unsupervised approach to
modeling personalized contexts of mobile users. In: ICDM 2010, pp. 38–47 (2010)
4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Lantent dirichlet allocation. Journal of Machine
Learning Research, 993–1022 (2003)
5. Eagle, N., Clauset, A., Quinn, J.A.: Location segmentation, inference and pre-
diction for anticipatory computing. In: AAAI Spring Symposium on Technosocial
Predictive Analytics (2009)
6. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of National
Academy of Science of the USA, 5228–5235 (2004)
7. Heinrich, G.: Paramter stimaion for text analysis. Technical report, University of
Lipzig (2009)
Towards Personalized Context-Aware Recommendation 443
8. Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR 1999, pp. 50–57
(1999)
9. Jae Kim, K., Ahn, H., Jeong, S.: Context-aware recommender systems using data
mining techniques. Journal of World Academy of Science, Engineering and Tech-
nology 64, 357–362 (2010)
10. Karatzoglou, A., Amatriain, X., Baltrunas, L., Oliver, N.: Multiverse recommen-
dation: n-dimensional tensor factorization for context-aware collaborative filtering.
In: RecSys 2010, pp. 79–86 (2010)
11. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from la-
beled and unlabeled documents using EM. Machine Learning 39, 103–134 (2000)
12. Park, M.-H., Hong, J.-H., Cho, S.-B.: Location-Based Recommendation System
Using Bayesian User’s Preference Model in Mobile Devices. In: Indulska, J., Ma,
J., Yang, L.T., Ungerer, T., Cao, J. (eds.) UIC 2007. LNCS, vol. 4611, pp. 1130–
1139. Springer, Heidelberg (2007)
13. Sohn, T., Li, K.A., Lee, G., Smith, I., Scott, J., Griswold, W.G.: Place-Its: A
Study of Location-Based Reminders on Mobile Phones. In: Beigl, M., Intille, S.S.,
Rekimoto, J., Tokuda, H. (eds.) UbiComp 2005. LNCS, vol. 3660, pp. 232–250.
Springer, Heidelberg (2005)
14. Tung, H.-W., Soo, V.-W.: A personalized restaurant recommender agent for mobile
e-service. In: EEE 2004, pp. 259–262 (2004)
15. van Setten, M., Pokraev, S., Koolwaaij, J.: Context-Aware Recommendations in
the Mobile Tourist Application COMPASS. In: De Bra, P.M.E., Nejdl, W. (eds.)
AH 2004. LNCS, vol. 3137, pp. 235–244. Springer, Heidelberg (2004)
16. Woerndl, W., Schueller, C., Wojtech, R.: A hybrid recommender system for
context-aware recommendations of mobile applications. In: ICDE 2007, pp. 871–
878 (2007)
17. Zheng, V.W., Cao, B., Zheng, Y., Xie, X., Yang, Q.: Collaborative filtering meets
mobile recommendation: A user-centered approach. In: AAAI 2010, pp. 236–241
(2010)
Mining of Temporal Coherent Subspace Clusters
in Multivariate Time Series Databases
1 Introduction
Mining patterns from multivariate temporal data is important in many appli-
cations, as for example analysis of human action patterns [12], gene expression
data [8], or chemical reactions [17]. Temporal data in general reflect the possi-
bly changing state of an observed system over time and are obtained by sensor
readings or by complex simulations. Examples include financial ratios, engine
readings in the automotive industry, patient monitoring, gene expression data,
sensors for forest fire detection, and scientific simulation data, as e.g. climate
models. The observed objects in these examples are individual stocks, engines,
patients, genes, spatial locations or grid cells in the simulations. The obtained
data are usually represented by multivariate time series, where each attribute
represents a distinct aspect of observed objects; e.g., in the health care exam-
ple, each patient has a heart rate, a body temperature, and a blood pressure.
The attributes are often correlated; e.g. for the forest fire, the attributes tem-
perature and degree of smoke are both signs of fire. Unknown patterns in such
databases can be mined by clustering approaches, where time series are grouped
together by their similarity. Accordingly, clusters of time series correspond to
groups of objects having a similar evolution over time, and clusters represent
these evolutions.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 444–455, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Mining of Temporal Coherent Subspace Clusters 445
dimension 1
dimension 2
time
2 Related Work
Clustering of temporal data can roughly be divided into clustering of incoming
data stream, called stream clustering [9], and clustering of static databases, the
topic of this paper. Our method is related to two static clustering research areas,
namely time series clustering and subspace clustering, and we discuss these areas
in the following. In the experiments, we compare to methods from both areas.
Time Series Clustering. There is much research on clustering univariate time
series data, and we suggest the comprehensive surveys on this topic [4,11]. Clus-
tering of multivariate time series is recently gaining importance: Early work
in [14] uses clustering to identify outliers in multivariate time series databases.
Multivariate time series clustering approaches based on statistical features were
introduced in [2,19,20]. There are no concepts in these approaches to discover
patterns hidden in parts of the dimensional or temporal extents of time series.
Most clustering approaches are based on an underlying similarity measure
between time series. There is work on noise-robust similarity measures based
on partial comparison, called Longest Common Subsequences (LCSS) [18]. We
combined k-Medoid with LCSS as a competing solution.
There is another type of time series clustering methods, designed for ap-
plications in which single, long time series are mined for frequently appearing
subsequences. These methods perform subsequence clustering, with subsequences
being generated by a sliding window. Since we are interested in patterns that
occur in several time series at similar points in time and not in patterns that
occur in a single time series at arbitrary positions, those approaches cannot be
applied in our application scenario.
Subspace Clustering (2D) and TriClustering (3D). Subspace cluster-
ing [1,10,15,21] was introduced for high-dimensional (non-temporal) vector data,
where clusters are hidden in individual dimensional subsets of the data. Since
subspace clustering is achieved by simultaneous clustering of the objects and
dimensions of dataset, it is also known as 2D clustering. When subspace clus-
tering is applied to 3D data (objects, dimensions, time points), the time series
for the individual dimensions are concatenated to obtain a 2D space (objects,
concatenated dimensions). While subspace clustering is good for excluding irrel-
evant points in time, there are problems when it is applied to temporal data:
First, by the transformation described above, the correlation between the dimen-
sions is lost. Second, subspace clustering in general cannot exploit the natural
correlation between subsequent points in time, i.e. temporal coherence is lost.
Accordingly, for 3D data, Triclustering approaches were introduced [7,8,16,22],
which simultaneously cluster objects, dimensions, and points in time. Special
Triclustering approaches are for clustering two related datasets together [6],
which is a fundamentally different concept than the one in this paper. Gen-
erally, Triclustering approaches can only find block-shaped clusters: A cluster is
defined by a set of objects, dimensions, and points in time [22] or intervals [7,16].
The points in time or intervals hold for all objects and dimensions of a cluster.
Mining of Temporal Coherent Subspace Clusters 447
In contrast, our approach can mine clusters where each dimension has different,
independent relevant intervals.
To avoid isolated points in time, i.e. where time series are rather similar by
chance, we require that length(Int) > 1. The cluster property defines how sim-
ilarity between subsequences is measured and how similar they need to be in
order to be included in the same interval pattern. This property can be chosen
by specific application needs. Besides the cluster compactness, which is also used
by other subspace clustering methods [13,15], other distance measures applicable
for time series including DTW [3] can be used.
dimension 1
dimension 2
1 10 25 35 40
Based on the introduced interval patterns, clusters are generated: for each
dimension, zero, a single, or even several interval patterns can exist in the cluster.
This allows our method to systematically exclude non-relevant intervals from the
cluster to better reflect the existing patterns in the analyzed data. However, not
all combinations of interval patterns correspond to reasonable temporal clusters.
The temporal coherence of the pattern is crucial. For example, let us consider a
set of objects O forming three interval patterns in the time periods 1-10, 25-40,
and 35-40, as illustrated in Fig. 2. Since for the remaining points in time no
pattern is detected, there is no temporal coherence of the patterns. In this case,
two individual clusters would reflect the data correctly.
To ensure temporal coherence, each point in time t ∈ T that is located between
the beginning a ∈ T and ending b ∈ T of a cluster, i.e. a ≤ t ≤ b, has to
be contained in at least one interval pattern of an arbitrary dimension. Thus,
by considering all dimensions simultaneously, the cluster has to form a single
connected interval, and each point in time can be included in several dimensions.
The higher the redundancy parameter values λobj and λint are set, the more
time series (λobj ) or intervals (λint ) of C have to be covered by M so that M
is considered structural similar to C. In the extreme case of robj = rdim =
1, C’s time series and intervals have to be completely covered by M ; in this
setting, only few clusters are categorized as redundant. By choosing smaller
values, redundancy occurs more often.
The final clustering Result must not contain structural similar clusters to be
redundancy-free. Since, however, several clusterings fulfill this property, we intro-
duce a second structural property that allows us to choose the most-interesting
redundancy-free clustering. On the one hand, a cluster is interesting if it contains
450 H. Kremer et al.
many objects, i.e. we get a strong generalization. On the other hand, a cluster
can represent a long temporal pattern but with less objects, corresponding to a
high similarity within the cluster. Since simultaneously maximizing both crite-
ria is contradictory, we introduce a combined objective function that realizes a
trade-off between the number of objects and the pattern length:
m
Interest(C) = |O| · length(Inti )
i=1
4 Efficient Computation
In this section we present an efficient algorithm for the proposed model. Due
to space limitations we just present a short overview. Since calculating the op-
timal clustering according to Def. 5 is NP-hard, our algorithm determines an
approximative solution. The general processing scheme is shown in Fig. 3 and
basically consists of two cyclically processed phases to determine the clusters.
Thus, instead of generating the whole set of clusters Clusters and selecting the
subset Result afterwards, we iteratively generate promising clusters, which are
added to the result.
Phase 1: In the first phase of each cycle a set of cluster candidates is generated
based on the following procedure: A time series p acting as a prototype for
these candidates is randomly selected, and this prototype is contained in each
cluster candidate of this cycle. The cluster candidates, i.e. groups of time series
Oi , are obtained by successively adding objects xi to the previous group, i.e.
Oi+1 = Oi ∪ {xi } with O0 = {p}. Since the interestingness of a cluster depends
Mining of Temporal Coherent Subspace Clusters 451
on its size, which is constant for Oi , and the length of the intervals, the choice
of xi is completely determined based on the latter. Accordingly, the best object
x0 is the one which would induce the longest interval interval patterns w.r.t. p
(summed over all dimension).
An interval pattern for the prototype p and x0 at the beginning can poten-
tially include each point in time. Interval patterns for the subsequent objects
xi , however, have to be restricted to the relevant intervals of Oi . Overall, we
generate a chain of groups Oi containing objects with a high similarity to p.
Based on these candidates we select the set O+ with the highest interestingness,
i.e. according to Def. 4 we combine the size with the interval lengths.
Phase 2: In the second phase, a cluster C for the object set O+ should be
added to the current result Resj . In the first cycle of the algorithm the result is
empty (Res0 = ∅), whereas in later cycles it is not. Thus, adding new clusters
could induce redundancy. Accordingly, for cluster C we determine those clusters
C ∈ Resj with similar relevant intervals (cf. Def. 3). For this set we check
if a subset of clusters M covering similar objects as C exists. If not, we can
directly add C to the result. In case such a set M exists, we test whether the
(summed) interestingness of M is lower than the one of C. In this case, selecting
C and removing M is beneficial. As a further optimization we determine the
union of C’s and M ’s objects, resulting in a larger cluster U with potentially
smaller intervals. If U ’s interestingness exceeds the previous values, we select
this cluster. This procedure is especially useful if clusters of previous iterations
are not completely detected, i.e. some objects of the clusters were missed. This
step improves the quality of these clusters by adding further objects. Overall,
we generate a redundancy-free clustering solution and simultaneously maximize
the interestingness as required in Def. 5.
By completing the second phase we initiate the next cycle. In our algorithm
the number of cycles is not a priori fixed but it is adapted to the number of
detected clusters. We perform c · |Resj | cycles. The more clusters are detected,
the more prototypes should be drawn and the more cycles are performed. Thus,
our algorithm automatically adapts to the given data.
In the experimental evaluation, we will demonstrate the efficiency and effec-
tiveness of this algorithm w.r.t. large scale data.
452 H. Kremer et al.
100 100
TimeSC TimeSC
F1Measure[%]
F1Measure[%]
75 MineClus 75 MineClus
Proclus Proclus
50 kMeans 50 kMeans
kMeans(stat.) kMeans(stat.)
MIC MIC
25 25
kMedoid(LCSS) kMedoid(LCSS)
0 0
80 160 240 320 400 4 8 12 16 20 32 40
timeserieslength timeseriesdimensionality
(a) data set length (b) data set dimensionality
100 100
F1Measure[%]
F1Measure[%]
75 TimeSC 75 TimeSC
Proclus MineClus
MineClus kMeans
50 kMeans 50 Proclus
kMeans(stat.) kMeans(stat.)
25 MIC 25 MIC
kMedoid(LCSS) kMedoid(LCSS)
0 0
4 8 12 16 20 10 20 30 40 50
numberofclusters numberoftimeseriespercluster
(c) number of clusters (d) time series per cluster
5 Experiments
We evaluate our TimeSC in comparison to six competing solutions, namely
kMeans, a kMeans using statistical features for multivariate time series [19],
Proclus [1], MineClus [21], and MIC [16]. We also included kMedoid, where we
used the Longest Common Subsequences (LCSS) [18] as a distance measure to
allow for partial comparison in the distance computation. We also compared
to TriCluster [22], which was provided by the authors on their webpage, but it
either delivered no result or the obtained accuracy was very low (≤ 3%); there-
fore we no longer included it in the experiments. For TimeSC, we used w = 30
for the compactness parameter. For the redundancy model, we used λobj = 0.5
and λint = 0.9. Some of the competing algorithms are not suitable for large
datasets; thus, in some experiments, we could not obtain all values. If not stated
otherwise, we use the following settings for our synthetic data generator: The
dataspace has an extend of [-100,+100], time series length is 200, clusters length
is 100, the dataset dimensionality is 10, the number of relevant dimensions per
cluster is 5, the number of clusters is 10, the average number of time series per
cluster is 25, and there is 10% noise (outliers) in the data. The experiments were
performed on AMD Opteron servers with 2.2GHz per core and 256GB RAM. In
the experiments, the F1 measure is used to measure accuracy of the obtained
clusterings [5,13], and the values are averages of three runs.
Mining of Temporal Coherent Subspace Clusters 453
100 100
TimeSC
F1Measure[%]
F1Measure[%]
75 TimeSC 75
MineClus MineClus
Proclus Proclus
50 kMeans 50 kMeans
kMeans(stat.) kMeans(stat.)
25 MIC MIC
25
kMedoid(LCSS) kMedoid(LCSS)
0 0
2 4 6 8 10 40 80 120 160 200
relevantDimensionspercluster minimalclusterlength[pointsintime]
(a) relevant dimensions (b) relevant points in time
Fig. 5. Performance w.r.t. different number of relevant dimensions and points in time
1,000,000 1,000,000
MIC
runtime[sec]
MIC
runtime[sec]
10,000 MineClus 10,000 MineClus
Proclus Proclus
kMedoid(LCSS) kMedoid(LCSS)
100 TimeSC TimeSC
kMeans 100 kMeans
kMeans(stat.) kMeans(stat.)
1
1
2,400
5,600
80
240
400
4 12 20 40
timeserieslength timeseriesdimensionality
(a) data set length (b) data set dimensionality
1,000,000 1,000,000
MIC
runtime[sec]
MIC MineClus
runtime[sec]
1,000
30,000
60,000
10
40
1
4 12 20 60 100
numberofclusters numberoftimeseriespercluster
(c) number of clusters (d) number of time series per cluster
6 Conclusion
We introduced a novel model for subspace clustering of multivariate time se-
ries data. The clusters in our model are formed by individual sets of relevant
intervals per dimension, which together fulfill temporal coherence. We develop
a redundancy model to avoid structurally similar clusters and introduce an ap-
proximate algorithm for generating clusterings according to our novel model.
In the experimental comparison, we showed that our approach is efficient and
generates clusterings of higher quality than the competing methods.
Acknowledgments. This work has been supported by the UMIC Research
Centre, RWTH Aachen University.
Mining of Temporal Coherent Subspace Clusters 455
References
1. Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms
for projected clustering. In: ACM SIGMOD, pp. 61–72 (1999)
2. Dasu, T., Swayne, D.F., Poole, D.: Grouping Multivariate Time Series: A Case
Study. In: IEEE ICDMW, pp. 25–32 (2005)
3. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and
mining of time series data: experimental comparison of representations and distance
measures. PVLDB 1(2), 1542–1552 (2008)
4. Fu, T.: A review on time series data mining. Engineering Applications of Artificial
Intelligence 24(1), 164–181 (2011)
5. Günnemann, S., Färber, I., Müller, E., Assent, I., Seidl, T.: External evaluation
measures for subspace clustering. In: ACM CIKM, pp. 1363–1372 (2011)
6. Hu, Z., Bhatnagar, R.: Algorithm for discovering low-variance 3-clusters from real-
valued datasets. In: IEEE ICDM, pp. 236–245 (2010)
7. Jiang, D., Pei, J., Ramanathan, M., Tang, C., Zhang, A.: Mining coherent gene clus-
ters from gene-sample-time microarray data. In: ACM SIGKDD, pp. 430–439 (2004)
8. Jiang, H., Zhou, S., Guan, J., Zheng, Y.: gTRICLUSTER: A More General and
Effective 3D Clustering Algorithm for Gene-Sample-Time Microarray Data. In: Li,
J., Yang, Q., Tan, A.-H. (eds.) BioDM 2006. LNCS (LNBI), vol. 3916, pp. 48–59.
Springer, Heidelberg (2006)
9. Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G., Pfahringer,
B.: An effective evaluation measure for clustering on evolving data streams. In:
ACM SIGKDD, pp. 868–876 (2011)
10. Kriegel, H. P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey
on subspace clustering, pattern-based clustering, and correlation clustering. ACM
TKDD 3(1) (2009)
11. Liao, T.W.: Clustering of time series data - a survey. Pattern Recognition 38(11),
1857–1874 (2005)
12. Minnen, D., Starner, T., Essa, I.A., Isbell, C.: Discovering characteristic actions
from on-body sensor data. In: IEEE ISWC, pp. 11–18 (2006)
13. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace
projections of high dimensional data. PVLDB 2(1), 1270–1281 (2009)
14. Oates, T.: Identifying distinctive subsequences in multivariate time series by clus-
tering. In: ACM SIGKDD, pp. 322–326 (1999)
15. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm
for fast projective clustering. In: ACM SIGMOD, pp. 418–427 (2002)
16. Sim, K., Aung, Z., Gopalkrishnan, V.: Discovering correlated subspace clusters in
3D continuous-valued data. In: IEEE ICDM, pp. 471–480 (2010)
17. Singhal, A., Seborg, D.: Clustering multivariate time-series data. Journal of Chemo-
metrics 19(8), 427–438 (2005)
18. Vlachos, M., Gunopulos, D., Kollios, G.: Discovering similar multidimensional tra-
jectories. In: IEEE ICDE, pp. 673–684 (2002)
19. Wang, X., Wirth, A., Wang, L.: Structure-based statistical features and multivari-
ate time series clustering. In: IEEE ICDM, pp. 351–360 (2007)
20. Wu, E.H.C., Yu, P.L.H.: Independent Component Analysis for Clustering Multi-
variate Time Series Data. In: Li, X., Wang, S., Dong, Z.Y. (eds.) ADMA 2005.
LNCS (LNAI), vol. 3584, pp. 474–482. Springer, Heidelberg (2005)
21. Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In:
IEEE ICDM, pp. 689–692 (2003)
22. Zhao, L., Zaki, M.J.: TriCluster: An effective algorithm for mining coherent clusters
in 3D microarray data. In: ACM SIGMOD, pp. 694–705 (2005)
A Vertex Similarity Probability Model for Finding
Network Community Structure
Abstract. Most methods for finding community structure are based on the prior
knowledge of network structure type. These methods grouped the communities
only when known network is unipartite or bipartite. This paper presents a vertex
similarity probability (VSP) model which can find community structure without
priori knowledge of network structure type. Vertex similarity, which assumes
that, for any type of network structures, vertices in the same community have
similar properties. In the VSP model, “Common neighbor index” is used to
measure the vertex similarity probability, as it has been proved to be an effective
index for vertex similarity. We apply the algorithm to real-world network data.
The results show that the VSP model is uniform for both unipartite networks and
bipartite networks, and it is able to find the community structure successfully
without the use of the network structure type.
1 Introduction
As part of the recent surge of research on large, complex networks, attention has been
devoted to the computational analysis of complex networks [1-4]. Complex networks,
such as social networks and biological networks, are all highly dynamic objects which
grow and change quickly over time. These networks have a common feature, namely
“community structure”. Communities, also known as clusters or modules, are groups of
vertices which could share common properties and/or have similar roles within the
graph [5]. Finding community structure and clustering vertices in the complex network,
is key to learning a complex network topology, to understanding complex network
functions, to founding hidden mode, to link prediction, and to evolution detection.
Through the analysis of community structure, researchers have achieved a lot results,
such as in [6, 7], V. Spirin et al. revealed the relationship between protein function and
interactions inherent; in [8, 9], Flake et al. found the internal relations of hyperlink and
the main page; in [10, 11], Moody et al. identified the social organizations to evolve
over time and so on.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 456–467, 2012.
© Springer-Verlag Berlin Heidelberg 2012
A Vertex Similarity Probability Model for Finding Network Community Structure 457
The most popular method for finding community structure is the modularity
matrix method [12, 13] proposed by Newman et al. which is based on spectral
clustering. The Modularity model proves that, if the type of the network structure is
known, modularity optimization is able to find community structure in both unipartite
and bipartite networks by the maximum or minimum eigenvalue separately. Then,
some scientists have sought to detect the community in bipartite networks like Michael
J. Barber [14]. BRIM proposed by Barber and his colleagues can determine the number
of communities of a bipartite network. Furthermore, in [15], Barber and Clark use the
label-propagation algorithm (LPA) for identifying network communities. However,
[14, 15] can not be used without knowing the type of network.
There are other methods to find community structure. Hierarchical clustering is
adopted frequently in finding community structures, in which vertices are grouped into
communities that further are subdivided into smaller communities, and so forth, as in
[12]. Clauset, Moore and Newman propose HRG [16] using the maximum likelihood
estimation to forecast the probability of connections between vertices. Hierarchical
methods perform remarkably in clear hierarchy network, but not so impressive under
contrary circumstance. Moreover, a hierarchical method always has high
computational complexity. In 2009, Roger Guimera and Marta Sales-Pardo proposed a
stochastic block model [17] based on HRG. Different from traditional concept which
divide network by principle of “inside connection dense outside sparse”, in [17], the
probability that two vertices are connected depends on the blocks to which they belong.
However, the assumption that vertices in same blocks have same connection
probability is not accurate. Recently, Karrer and Newman [18] also proposed a
stochastic block model which considers the variation in vertex degree. This stochastic
block model solves the heterogeneous vertex degrees problem and got a better result
than other previous researches without degree correction. It can be used in both types of
networks, but different types of networks should be dealt with separately none the less.
In some cases, researchers have no priori knowledge of the network structure. For
example, when we know the interaction of vertex in the protein network, we may have
no knowledge of the network structure type. Moreover, when we get a network which
consists of people’s relationships in schools, the type of network may not be sure. It is
because that if links only exist between students, the network will be a unipartite
network; or if links exist between students and teachers, the network will be a bipartite
one. An effective method used for finding community structure in both unipartite and
bipartite networks is needed.
It is discussed before that most methods deal with the unipartite network or
bipartite network separately, because the properties of networks are different in
different types of the network structure. Unipartite networks assume that connections
between the vertices in same community are dense, and between the communities are
sparse, such as Social network [19], biochemical network [20] and information network
[21]. However, some real networks are bipartite with edges joining only vertices of
different communities, such as shopping networks [22], protein-protein interaction
networks [23], plant-animal mutualistic networks [24], scientific publication networks
[25], etc. Although the properties of “edges” in the two types of networks are different,
vertices in the same communities should be similar because vertices in same
458 K. Li and Y. Pang
communities have similar properties. In this paper, we develop a uniform VSP model
which is based on the vertex similarity. Therefore, the VSP model can be used in any
type of networks as long as we put similar vertices in same communities. The VSP
model gets ideal result both by theoretical proof and experimental analysis.
The paper is organized as follows. In section 2, we prove vertex similarity theory is
suitable for finding community structure. We present the VSP model and the method to
group network into two communities in section 3. In section 4, we make the experiment
in both unipartite and bipartite network. Compared with Newman’s modularity, the
VSP model is an accurate uniform model which can find community structure without
prior knowledge of type of the network structure. Finally, we draw our conclusions.
The concept of community informs that vertices in the same community should share
common properties no matter in unipartite or bipartite network. It means that vertices in
the same community should be similar, although edges in different type of the network
structures are connected in different ways. Therefore, we change our focus from
“edges” to “vertices” for finding communities.
Vertex similarity is widely studied by researchers in complex network. It is
sometimes called structural similarity, to distinguish it from social similarity, textual
similarity, or other similarity types. It is a basic premise of research on networks that
the structure of a network reflects real information about the vertices the network
connects, so it is reasonable that meaningful structural similarity measures might exist
[26]. In general, if two vertices have a number of common neighbors, we believe that
these two vertices are similar. In community detection, we assume that two similar
vertices have similar properties and should be grouped in the same community.
Let Γ x be the neighborhood of vertex x in a network, i.e., the set of vertices that
are directly connected to x via an edge. Then Γ x ∩ Γ y is the number of common
neighbors of x and y. Common neighbor index, Salton index, Jaccard index, Sorenson
index, LHN (Leicht-Holme-Newman) index, and Adamic-Adar index [27-31] are five
famous methods for vertex similarity. Many researchers have analyzed and compared
these methods. Liben-Nowell[32] and Zhou Tao[33] proved that the simplest
measurement “common neighbor index” performs surprisingly well. We use “common
neighbor index” to measure the vertex similarity in our VSP model.
Definition 1. For two vertices x and y, if there is a vertex z to be the neighbor of x
and y at the same time, we call x and y a pair, denoted as pair(x, y). z is called the
common neighbor of pair(x, y).
Since vertices which are in the same community have similar properties, we assume
vertices in the same community are similar vertices. The more similar the vertices
inside a community are the more common neighbors they have. The number of
common neighbors Nij of vertices i and j is given by,
N ij = Γi ∩ Γ j , and Nii = 0 .
A Vertex Similarity Probability Model for Finding Network Community Structure 459
And the sum of common neighbors with vertices in different communities N out is
given by
N out = ∑ i , j∉same N ij .
commumity
Therefore, the task of maximizing the number of common neighbors in the same
community is to get max( N in ) or to get min( N out ) . The sum of common neighbors in
the network R is given by
1
R= ∑ Nij .
2 i , j∈n
We define the adjacency matrix A to be the symmetric matrix with elements Aij .
If there is an edge joining vertices i and j, Aij = 1 ; if no, Aij = 0 . Define a i as i th
vector of A, so as A can be rewritten as A = [a1 ,a 2 , ...,an ] . If and only if Aik Akj = 1 , the
vertex k is a common neighbor of vertices i and j. Therefore Nij can be rewritten as
Nij = ∑ k Aik Akj = a i ⋅ a j ,
when i and j are two different vertices. As a i ⋅ a i = ki , matrix N is
N = AT A − Λ k ,
where Λ k = diag (k1 , k2 ,..., kn ) . It allows us to rewrite R as
1
R= ∑ i, j∈n ai ⋅ a j
2 i≠ j
1
= (∑ i , j∈n ai ⋅ a j − ∑ i ki )
2
1
= (∑ i ai ⋅ (k1 , k2 ,..., kn )T − ∑ i ki ) (1)
2
1
= ((k1 , k2 ,..., kn ) ⋅ ( k1 , k 2 ,..., k n )T − ∑ i ki )
2
1
= ∑ i ki (ki − 1)
2
Definition 2. According to Eq.(1), R is only related to a function of vertex degree.
To analyze the relationship between a vertex x and common neighbor index, we define
the function as a common neighbor degree index, denoted as c x . Let
cx = k x (k x − 1) / 2 . Therefore, R = ∑ x∈n cx .
460 K. Li and Y. Pang
Total number of common neighbors in the network equals the number of common
neighbors in same communities plus the number of common neighbors different
communities, R also can be written as R = Nin + N out .
The following proves that using common neighbor index in finding community
structure is suitable in both unipartite networks and bipartite networks.
For two vertices i and j in different communities, N out is related to Aij , ki and k j .
If there is an edge between i and j, Aout will plus 1 and N out will plus ki + k j − 2 . As
ki + k j − 2 ≥ 0 , we consider Aout and N out have the same growth trend. It means
getting min( N out ) is equivalent to getting min( Aout ) . The conclusion is in line with the
basic principles of the unipartite network community detection.
A Vertex Similarity Probability Model for Finding Network Community Structure 461
For a bipartite network, the basic community detection principle is “edges inside
communities are sparse, outside are dense”. The task is to maximize Aout , written as
max( Aout ) .
In a bipartite network, almost all adjacent vertices are in different communities.
For a pair of vertices which are in the same community, the common neighbor should
be in a different community, as shown in Fig. 2.
As a result, for any pair of vertices i and j which are in the same community, N ij
have 2 N ij edges between different communities. In the overall network, each edge will
be counted ( ki + k j − 2) times.
Similar as section 2.1, we consider Aout and N in have same growth trend. It
means getting max( N in ) is equivalent to getting max( Aout ) . The conclusion is in line
with the basic principles of the bipartite network community detection.
In summary of section 2.1 and 2.2, the common neighbor index of vertex
similarity is suitable for finding community structure in both unipartite and bipartite
networks.
In this section, we propose our VSP model to find community structure. In [13],
Newman et al. proved that a good division of a network in to communities “in which
the number of edges inside groups is bigger than expected”. It can get a better result
than the measures based on pure numbers of edges between communities. Similarly, a
good division of a network into communities should be one which the number of
common neighbors within communities is bigger than expected. Let
462 K. Li and Y. Pang
It is a function that divides the network into groups, with larger values indicating
stronger community structure. We build a random network in which vertices have same
common neighbor degrees as the vertices in the complex network, and assume the
expected number of common neighbors as the number in the random network. In
section 2, we have proved that common neighbor index can be used to find
communities instead of edges. However, we can also find that Nout in unipartite
network and N in in bipartite network are both affected not only by edges but also by
c k
vertex degree. It is known that common neighbor degree i is a function of i and R
c c
is the sum of i . We use i to calculate the common neighbors in the random
network. The probability of a random vertex to be a common neighbor of a particular
c
vertex i depends only on the expected common neighbor degree i . The probabilities
of a random vertex to be a common neighbor of two vertices are independent on each
P
other. This implies that the expected number of common neighbors ij between vertices
i and j is the product f (ci ) f (c j ) of separate functions of the two common
neighbor degrees, where the functions must be the same since Pij is symmetric. Hence
f (ci ) = Cci for some constant C,
∑ i , j∈n
Pij = ∑ i f (ci )∑ j f (c j ) = C 2 R 2 (7)
Vertices in random network have the same common neighbor degree just like in
complex network, ∑ P = ∑ i , j∈n N ij = 2 R . So, C =
2 and
i , j∈n ij
R
2
f (ci ) = ci (8)
R
We get the expected number of common neighbors of pair(x, y) as follows,
2c x c y
Pij = f (ci ) f (c j ) = . (9)
R
The VSP model can be written,
1 2ci c j
Q=
2R
∑ i , j∈same [ N ij −
community R
]. (10)
2ci c j ∑ c j ∑ i∈n ci
∑ i , j∈n
R
=2
j∈n
R
= 2 R = ∑ i , j∈n N ij . (11)
A Vertex Similarity Probability Model for Finding Network Community Structure 463
Thus,
1 2c c
2
∑ i , j∈n
[ Nij − i j ] = 0
R
(12)
Let
2ci c j .
Bij = Nij − (13)
R
B is the VSP matrix, and ∑ Bij = 0 .
i , j∈n
We use the VSP matrix instead of modularity matrix to find the community
structure. In the VSP model, the higher value of Q, the more similar vertices are in the
same community. It can be applied to both unipartite networks and bipartite networks
without knowing the exact type of network structure in advance. It is more flexible than
the previous methods which deal with the grouping separately according to the type of
the network structure.
4 Experimental Results
In this section, we apply the VSP model to a unipartite network and two bipartite
networks with Pajek [34]. The unipartite network shows the dolphin social network
studied by Lusseau et al. [35]. The bipartite networks show the interactions of women
in the American Deep South at various social events [36] and Scotland Corporate
Interlock in early twentieth century [37].
Since we know the actual communities for the real networks, we measure the
accuracy of the VSP model by directly comparing with the known communities. We
take use of the normalized mutual information Inormn [38] for the comparison. When the
found communities match the real ones, we have Inorm=1, and when they are
independent of the real ones, we have Inorm=0.
We compare the VSP model with the Modularity model in unipartite networks and
bipartite networks by three properties: the edges outside communities; Q of the
Modularity model, where Q is the edges within communities minus expected number
of such edges, written as Q-Modularity; and Inorm.
Two red vertices are grouped into the green community in the VSP model, while
three red vertices are grouped into the green community in the Modularity model.
In a unipartite network, edges inside the communities are dense, while outside are
sparse. Edges outside communities should be small; Q-Modularity should be large;
Inorm close to 1. Properties of the dolphin social network are shown in Table.1. It shows
that two properties of VSP model are better than the Modularity model in this unipartite
network. Q-Modularity of the VSP model is 0.381 which is approximately equal to the
one of the Modularity model. The VSP model performs well in unipartite networks.
Modularity
VSP
Fig. 3. Finding community structures of the dolphin social network. The red and green vertices
represent the division of the network. The solid curve represents the division of the VSP model..
The dotted curve represents the division of the Modularity model.
Fig. 4. Find community structures of Southern women network using the VSP model
A Vertex Similarity Probability Model for Finding Network Community Structure 465
The Southern women data set describes the grouping of 18 women in 14 social
events constitute a bipartite network; and an edge exists between a woman and a social
event if the woman was in attendance at the event. We use this network here to group it
into two communities, shown in Fig.4. It shows that the VSP model groups the network
accurately into two communities of “women” and “events”.
Although using other finding community structure methods can also get the same
result, they should know the type of the network in advance. For example, in
modularity, it gets the smallest value of the Modularity model but not the biggest
because the southern women network is a bipartite network.
As a second example of bipartite network, we consider a data set on Scotland
Corporate Interlock in early twentieth century. The data set is characterized by 108
Scottish firms, detailing the corporate sector, capital, and board of directors for each
firm. The data set includes only those board members who held multiple directorships,
totaling 136 individuals. Unlike the Southern women network, the Scotland corporate
interlock is not connected. We got the division of one community with 102 vertices and
the other with 142 vertices when Q-VSP is the maximum value.
We also compare three properties of the VSP model and the Modularity model. In a
bipartite network, edges inside the communities are sparse, while outside are dense.
Edges outside communities should be large; Q-Modularity should be small; Inorm close
to 1. Properties of the Scotland corporate interlock are shown in Table 2. All the three
properties of the VSP model are better than the Modularity model. It proves that the
VSP model also finds community structure accurately in bipartite networks.
In summary of section 4.1 and 4.2, the VSP model is a uniform model for finding
community structure which can be used in both unipartite networks and bipartite
networks. It is flexible and applicable to a wide range. For instance, in the protein
network which people only knows its tip of iceberg, the VSP model can find the
community structure only with the topology of the network, even when we have no idea
of the type of the network structure.
5 Conclusion
In this paper, we define a VSP model for finding the community structure in complex
networks. The VSP model is based on the vertex similarity using the common neighbor
index. As common neighbor index is proved an effective measurement of the vertex
similarity methods in complex network, it is applied to the VSP model to measure the
vertex similarity. We prove that calculating the common neighbor inside communities
of the network is equivalent to calculation the least edges outside communities in a
unipartite network and the most edges outside communities in a bipartite network.
466 K. Li and Y. Pang
Therefore, it is suitable for finding community structure in both unipartite and bipartite
network. Then we give the expectation of the common neighbor between any two
vertices and gave the VSP model. At last, we apply our model in the dolphin social
network, Southern women event network and Scotland corporate interlock network
separately. Results showed that the VSP model is effective for finding community
structure without the need of the network structure type.
References
1. Wasserman, S., Faust, K.: Social Networks Analysis. Cambridge University Pres,
Cambridge (1994)
2. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.U.: Complex networks:
Structure and dynamics. Physics Reports-Review Section of Physics Letters 424, 175–308
(2006)
3. Lu, L.Y., Zhou, T.: Link prediction in complex networks: A survey. Physica a-Statistical
Mechanics and Its Applications 390, 1150–1170 (2011)
4. Boguna, M., Krioukov, D., Claffy, K.C.: Navigability of complex networks. Nature
Physics 5, 74–80 (2009)
5. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: A
survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and
Data Engineering 17, 734–749 (2005)
6. Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular networks.
Proceedings of the National Academy of Sciences of the United States of America 100,
12123–12128 (2003)
7. Chen, J.C., Yuan, B.: Detecting functional modules in the yeast protein-protein interaction
network. Bioinformatics 22, 2283–2290 (2006)
8. Flake, G.W., Lawrence, S., Giles, C.L., Coetzee, F.M.: Self-organization and identification
of web communities. Computer 35, 66–71 (2002)
9. Dourisboure, Y., Geraci, F., Pellegrini, M.: Extraction and classification of dense
communities in the web. In: Proceedings of the 16th International Conference on the World
Wide Web, pp. 461–470. ACM, New York (2007)
10. Moody, J., White, D.R.: Structural cohesion and embeddedness: A hierarchical concept of
social groups. American Sociological Review 68, 103–127 (2003)
11. Wellman, B.: The development of social network analysis: A study in the sociology of
science. Contemporary Sociology-a Journal of Reviews 37, 221–222 (2008)
12. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks.
Physical Review E 69 (2004)
13. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of
matrices. Physical Review E 74 (2006)
14. Barber, M.J.: Modularity and community detection in bipartite networks. Physical Review
E 76 (2007)
15. Barber, M.J., Clark, J.W.: Detecting network communities by propagating labels under
constraints. Physical Review E 80 (2009)
16. Clauset, A., Moore, C., Newman, M.E.J.: Hierarchical structure and the prediction of
missing links in networks. Nature 453, 98–101 (2008)
A Vertex Similarity Probability Model for Finding Network Community Structure 467
17. Guimera, R., Sales-Pardo, M.: Missing and spurious interactions and the reconstruction of
complex networks. Proceedings of the National Academy of Sciences of the United States of
America 106, 22073–22078 (2009)
18. Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks.
Physical Review E 83 (2011)
19. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks.
Proceedings of the National Academy of Sciences of the United States of America 99,
7821–7826 (2002)
20. Holme, P., Huss, M., Jeong, H.W.: Subnetwork hierarchies of biochemical pathways.
Bioinformatics 19, 532–538 (2003)
21. Rosvall, M., Bergstrom, C.T.: An information-theoretic framework for resolving
community structure in complex networks. Proceedings of the National Academy of
Sciences of the United States of America 104, 7327–7331 (2007)
22. Chuma, J., Molyneux, C.: Coping with the costs of illness: The role of shops and
shopkeepers as social networks in a low-income community in coastal kenya. Journal of
International Development 21, 252–270 (2009)
23. Li, F., Long, T., Lu, Y., Quyang, Q., Tang, C.: The yeast cell-cycle network is robustly
designed. PNAS 101(14), 4781–4786 (2004)
24. Bascompte, J., Jordano, P., Melian, C.J., Olesen, J.M.: The nested assembly of plant-animal
mutualistic networks. Proceedings of the National Academy of Sciences of the United States
of America 100, 9383–9387 (2003)
25. Guimera, R., Uzzi, B., Spiro, J., Amaral, L.A.N.: Team assembly mechanisms determine
collaboration network structure and team performance. Science 308, 697–702 (2005)
26. Leicht, E.A., Holme, P., Newman, M.E.J.: Vertex similarity in networks. Physical Review
E 73 (2006)
27. Salton, G., McGill, M.J.: Introduction to modern information retrieval. MuGraw-Hill,
Auckland (1983)
28. Jaccard, P.: Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise
des Science Naturelles 44, 223–270 (1908)
29. Sφrensen, T.: A method of establishing groups of equal amplitude in plant sociology based
on similarity of species content, and its application to analyses of the vegetation on Danish
commons. Det Kongelige Danske Videnskabernes Selskab. Biologiske Skrifter 5(4), 1–34
(1948)
30. Salton, G.: Automatic text processing: The transformation, analysis, and retrieval of
information by computer. Addison-Wesley, Boston (1989)
31. Adamic, L.A., Adar, E.: Friends and neighbors on the Web. Social Networks 25, 211–230
(2003)
32. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journal of
the American Society for Information Science and Technology 58, 1019–1031 (2007)
33. Zhou, T., Lu, L.Y., Zhang, Y.C.: Predicting missing links via local information. European
Physical Journal B 71, 623–630 (2009)
34. Pajek: http://vlado.fmf.uni-lj.si/pub/networks/pajek
35. Lusseau, D., Schneider, K., Boisseau, O.J., Haase, P., Slooten, E., Dawson, S.M.: The
bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting
associations - Can geographic isolation explain this unique trait? Behavioral Ecology and
Sociobiology 54, 396–405 (2003)
36. Davis, A., Gardner, B.B., Gardner, M.R.: Deep South. University of Chicago Press (1941)
37. Scott, J., Hughes, M.: The anatomy of Scottish capital: Scottish companies and Scottish
capital. Croom Helm, London (1980)
38. Danon, L., Diaz-Guilera, A., Arenas, A.: The effect of size heterogeneity on community
identification in complex networks. Journal of Statistical Mechanics-Theory and
Experiment, P11010 (2006)
Hybrid-ε-greedy for Mobile Context-Aware
Recommender System
1 Introduction
Mobile technologies have made access to a huge collection of information, anywhere
and anytime. Thereby, information is customized according to users’ needs and prefe-
rences. This brings big challenges for the Recommender System field. Indeed, tech-
nical features of mobile devices yield to navigation practices which are more difficult
than the traditional navigation task.
A considerable amount of research has been done in recommending relevant informa-
tion for mobile users. Earlier techniques [8, 10] are based solely on the computational
behavior of the user to model his interests regardless of his surrounding environment
(location, time, near people). The main limitation of such approaches is that they do not
take into account the dynamicity of the user’s context. This gives rise to another category
of recommendation techniques that tackle this limitation by building situation-aware user
profiles. However, these techniques have some problems, namely how to recommend
information to the user in order to follow the evolution of his interest.
In order to give Mobile Context-aware Recommender Systems (MCRS) the capa-
bility to provide the mobile user information matching his/her situation and adapted to
the evolution of his/her interests, our contribution consists of mixing bandit algorithm
(BA) and case-based reasoning (CBR) methods in order to tackle these two issues:
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 468–479, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Hybrid-ε-greedy for Mobile Context-Aware Recommender System 469
The remainder of the paper is organized as follows. Section 2 reviews some related
works. Section 3 presents the proposed recommendation algorithm. The experimental
evaluation is described in Section 4. The last Section concludes the paper and points
out possible directions for future work.
2 Background
We reference in the following recent relevant recommendation techniques that tackle
the both issues namely: following the evolution of user’s interests and managing the
user’s situation.
the concepts in the user profile is established for every new experience of the user.
The user activity combined with the user profile are used together to filter and rec-
ommend relevant content.
Another work [2] describes a MCRS operating on three dimensions of context that
complement each other to get highly targeted. First, the MCRS analyzes information
such as clients’ address books to estimate the level of social affinity among users.
Second, it combines social affinity with the spatiotemporal dimensions and the user’s
history in order to improve the quality of the recommendations.
Each work cited above tries to recommend interesting information to users on con-
textual situation; however they do not consider the evolution of the user’s interest.
To summarize, none of the mentioned works tackles both problems. This is pre-
cisely what we intend to do with our approach, exploiting the following new features:
In what follows, we summarize the terminology and notations used in our contribu-
tion, and then we detail our methods for inferring the recommendation.
User Preferences. Preferences are deduced during user navigation activities. They
contain the set of navigated documents during a situation. A navigation activity ex-
presses the following sequence of events: (i) the user logs in the system and navigates
across documents to get the desired information; (ii) the user expresses his/her prefe-
rences on the visited documents. We assume that a visited document is relevant, and
thus belongs to the user’s preferences, if there are some observable user’s behaviors
through 2 types of preference:
• The direct preference: the user expresses his interest in the document by inserting a
rate, like for example putting stars (“*”) at the top of the document.
• The indirect preference: it is the information that we extract from the user system
interaction, for example the number of clicks or the time spent on the visited doc-
uments.
Hybrid-ε-greedy for Mobile Context-Aware Recommender System 471
Let UP be the preferences submitted by a specific user to the system at a given situa-
tion. Each document in UP is represented as a single vector d=(c1,...,cn), where ci (i=1,
.., n) is the value of a component characterizing the preferences of d. We consider the
following components: the total number of clicks on d, the total time spent reading d,
the number of times d was recommended, and the direct preference rate on d.
History. All the interactions between the user and the system are stored together with
the corresponding situations in order to exploit this data to improve the recommenda-
tion process.
Calendar. The user’s calendar has information concerning the user’s activities, like
meetings. Time and location information is automatically inferred by the system.
User Situation. A situation S is represented as a triple whose features X are the values
assigned to each dimension: S = (Xl, Xt, Xs), where Xl (resp. Xt and Xs) is the value of
the location (resp. time and social) dimension.
Suppose the user is associated to: the location "48.8925349, 2.2367939" from his
phone’s GPS; the time "Mon Oct 3 12:10:00 2011" from his phone’s watch; and the
meeting with Paul Gerard from his calendar. To build the situation, we associate to
this kind of low level data, directly acquired from mobile devices capabilities, more
abstracted concepts using ontologies reasoning means.
• Task 1. It observes the current user ut and a set At of arms together with their fea-
ture vectors xt,a for a ∈ At. The vector xt,a summarizes information of both user ut
and arm a, and is referred to as the context.
472 D. Bouneffouf, A. Bouzeghoub, and A.L. Gançarski
• Task 2. Based on observed rewards in previous trials, it chooses an arm at ∈ At, and
receives reward rt ,a whose expectation depends on both the user ut and the arm at.
t
∑
T
In tasks 1 to 3, the total T-trial reward of A is defined as r while the optimal
[∑ ]where a
t =1 t , a t
T
expected T-trial reward is defined as Ε r t
*
is the arm with maxi-
t =1 t , at
*
mum expected reward at trial t. Our goal is to design the bandit algorithm so that the
expected total reward is maximized.
In the field of document recommendation, we may view documents as arms. When
a document is presented to the user and this one selects it by a click, a reward of 1 is
incurred; otherwise, the reward is 0. With this definition of reward, the expected re-
ward of a document is precisely its Click Through Rate (CTR). The CTR is the aver-
age number of clicks on a recommended document, computed diving the total number
of clicks on it by the number of times it was recommended. Consequently, choosing a
document with maximum CTR is equivalent, in our bandit algorithm, to maximizing
the total expected rewards.
( )
sim j X cj , X ij = 2 ∗
deph ( LCS )
(deph ( X cj ) + deph ( X ij ))
(2)
Here, LCS is the Least Common Subsumer of Xjc and Xji, and depth is the number
of nodes in the path from the node to the ontology root.
RecommendDocuments() (Alg. 3)
In order to insure a better precision of the recommender results, the recommendation
takes place only if the following condition is verified: sim(Sc, Ss) ≥ B (Alg. 3),
where B is a threshold value and
sim(S c , S s ) = ∑ sim j (X cj ,X sj )
j
d ⋅db
∑c k ⋅ ck
cos sim ( d , d b ) = = k
(3)
d ⋅ db
∑c
k
b
k
2 ⋅
∑ c kb 2
k
Output: D
D = Ø
For i=1 to N do
q = Random({0, 1})
j = Random({0, 1})
argmaxd ∈ (UP-D) (getCTR(d)) if j<q<ε
c
di= CBF(UP -D, argmaxd ∈ (UP-D)(getCTR(d)) if q≤j≤ε
c
Random(UP ) otherwise
D = D ∪ {di}
Endfor
Return D
474 D. Bouneffouf, A. Bouzeghoub, and A.L. Gançarski
Output: D
D = Ø
s s c
(S , UP ) = RetrieveCase(S , PS)
if sim(S ,S ) ≥ B then
c s
s
D = RecommendDocuments(ε, UP , N)
c s
UP = UpdatePreferences(UP , D)
if sim(S , S ) ≠ 3 then
c s
c c
PS = InsertCase(S , UP )
else
p c
PS = UpdateCase(S , UP )
end if
c c
else PS = InsertCase(S , UP );
end if
Return D
4 Experimental Evaluation
In order to empirically evaluate the performance of our algorithm, and in the absence
of a standard evaluation framework, we propose an evaluation framework based on a
diary study entries. The main objectives of the experimental evaluation are: (1) to find
the optimal threshold B value of step 2 (Section 3.3) and (2) to evaluate the perfor-
mance of the proposed hybrid ε-greedy algorithm (Alg. 3) w. r. t. the optimal ε value
and the dataset size. In the following, we describe our experimental datasets and then
present and discuss the obtained results.
Hybrid-ε-greedy for Mobile Context-Aware Recommender System 475
Each diary situation entry represents the capture, for a certain user, of contextual in-
formation: time, location and social information. For each entry, the captured data are
replaced with more abstracted information using the ontologies. For example the situ-
ation 1 becomes as shown in Table 2.
From the diary study, we obtained a total of 342 725 entries concerning user navi-
gation, expressed with an average of 20.04 entries per situation. Table 3 illustrates an
example of such diary navigation entries. For example, the number of clicks on a
document (Click), the time spent reading a document (Time) or his direct interest
expressed by stars (Interest), where the maximum stars is five.
In order to evaluate the precision of our technique to identify similar situations and
particularly to set out the threshold similarity value, we propose to use a manual
476 D. Bouneffouf, A. Bouzeghoub, and A.L. Gançarski
classification as a baseline and compare it with the results obtained by our technique.
So, we manually group similar situations, and we compare the manual constructed
groups with the results obtained by our similarity algorithm, with different threshold
values.
Figure 1 shows the effect of varying the threshold situation similarity parameter B
in the interval [0, 3] on the overall precision P. Results show that the best perfor-
mance is obtained when B has the value 2.4 achieving a precision of 0.849. Conse-
quently, we use the identified optimal threshold value (B = 2.4) of the situation simi-
larity measure for testing effectiveness of our MCRS presented below.
As seen from these figures, when the parameter ε is too small, there is insufficient
exploration; consequently the algorithms failed to identify relevant documents, and
had a smaller number of clicks. Moreover, when the parameter is too large, the algo-
rithms seemed to over-explore and thus wasted some of the opportunities to increase
the number of clicks. Based on these results, we choose appropriate parameters for
each algorithm and run them once on the evaluation data.
We can conclude from the plots that CBR information is indeed helpful for finding a
better match between user interest and document content. The CBF also helps hybrid-ε-
greedy in the learning subset by selecting more attractive documents to recommend.
5 Conclusion
This paper describes our approach for implementing a MCRS. Our contribution is to
make a deal between exploration and exploitation for learning and maintaining user’s
interests based on his/her navigation history.
We have presented an evaluation protocol based on real mobile navigation. We
evaluated our approach according to the proposed evaluation protocol. This study
yields to the conclusion that considering the situation in the exploration/exploitation
strategy significantly increases the performance of the recommender system following
the user interests.
In the future, we plan to compute the weights of each context dimension and
consider them on the detection of user’s situation, and then we plan to extend our
situation with more context dimension. Regarding the bandit algorithms we plan to
investigate methods that automatically learn the optimal exploitation and exploration
tradeoff.
References
1. Bellotti, V., Begole, B., Chi, E.H., Ducheneaut, N., Fang, J., Isaacs, E., King, T., Newman,
M.W., Walendowski, A.: Scalable architecture for context-aware activity-detecting mobile
recommendation system. In: 9th IEEE International Symposium on a World of Wireless,
WOWMOM 2008 (2008)
2. Lakshmish, R., Deepak, P., Ramana, P., Kutila, G., Dinesh, G., Karthik, V., Shivkumar,
K.: A Mobile context-aware, Social Recommender System for Low-End Mobile Devices.
In: Tenth International Conference on Mobile Data Management: CAESAR 2009, pp.
338–347 (2009)
3. Lihong, L., Wei, C., Langford, J., Schapire, R.E.: A Contextual-Bandit Approach to Per-
sonalized News Article Recommendation. Presented at the Nineteenth International Confe-
rence on World Wide Web, CoRR 2010, Raleigh, vol. abs/1002.4058 (2010)
4. Bouidghaghen, O., Tamine-Lechani, L., Boughanem, M.: Dynamically Personalizing
Search Results for Mobile Users. In: Andreasen, T., Yager, R.R., Bulskov, H., Christian-
sen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS, vol. 5822, pp. 99–110. Springer, Heidel-
berg (2009)
5. Panayiotou, C., Maria, I., Samaras, G.: Using time and activity in personalization for the
mobile user. In: On Fifth ACM International Workshop on Data Engineering for Wireless
and Mobile Access, MobiDE 2006, pp. 87–90 (2006)
6. Peter, S., Linton, F., Joy, D.: OWL: A Recommender System for Organization-Wide
Learning. In: Educational Technology, ET 2000, vol. 3, pp. 313–334 (2000)
7. Bianchi, R.A.C., Ros, R., Lopez de Mantaras, R.: Improving Reinforcement Learning by
Using Case Based Heuristics. In: McGinty, L., Wilson, D.C. (eds.) ICCBR 2009. LNCS,
vol. 5650, pp. 75–89. Springer, Heidelberg (2009)
Hybrid-ε-greedy for Mobile Context-Aware Recommender System 479
8. Samaras, G., Panayiotou, C.: Personalized Portals for the Wireless and Mobile User. In:
Proc. 2nd Int’l Workshop on Mobile Commerce, ICDE 2003, p. 792 (2003)
9. Shijun, L., Zhang, Y., Xie, Sun, H.: Belief Reasoning Recommendation Mashing up Web
Information Fusion and FOAF:JC 2010. Journal of Computers, JC 02(12), 1885–1892
(2010)
10. Varma, V., Kushal, S.: Pattern based keyword extraction for contextual advertising. In:
Proceedings of the 19th ACM Conference on Information and Knowledge Management,
CIKM (2010)
11. Wei, L., Wang, X., Zhang, R., Cui, Y., Mao, J., Jin, R.: Exploitation and Exploration in a
Performance based Contextual Advertising System. In: Proceedings of the 16th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
2010. ACM (2010)
12. Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: Proceedings of the 32nd
Annual Meeting of the Association for Computational Linguistics, ACL 1994, pp. 133-138
(1994)
13. Zhang, T., Iyengar, V.: Recommender systems using linear classifiers. The Journal of Ma-
chine Learning Research, JMLR 2, 313–334 (2002)
Unsupervised Multi-label Text Classification
Using a World Knowledge Ontology
Xiaohui Tao1 , Yuefeng Li2 , Raymond Y.K. Lau3 , and Hua Wang1
1
Centre for Systems Biology, University of Southern Queensland, Australia
{xtao,hua.wang}@usq.edu.au
2
Science and Engineering Faculty, Queensland University of Technology, Australia
[email protected]
3
Department of Information Systems, City University of Hong Kong, Hong Kong
[email protected]
1 Introduction
The increasing availability of documents in the past decades has greatly pro-
moted the development of information retrieval and organising systems, such as
search engines and digital libraries. The widespread use of digital documents has
also increased these systems’ accessibility to textual information. A fundamen-
tal theory supporting these information retrieval and organising systems is that
information can be associated with semantically meaningful categories. Such a
theory supports also ontology learning, text categorisation, information filtering,
text mining, and text analysis, etc. Text classification aims at associating tex-
tual documents with semantically meaningful categorises, and has been studied
in the past decades, along with the development of information retrieval and
organising systems [11].
Text classification is the process of classifying an incoming stream of docu-
ments into predefined categories. Text classification usually employs a supervised
learning strategy with the classifiers learned from pre-classified sample docu-
ments. The classifiers are then used to classify incoming documents. In terms
of supervised text classification, the performance is determined by the accuracy
of pre-classified training samples and the quality of the categorisation. The ac-
curacy of classifiers determines their capability of differentiating the incoming
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 480–492, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Unsupervised Multi-label Text Classification 481
2 Related Work
Unsupervised text classification aims to classify documents into the classes with
absence of any labelled training documents. In many occasions the target classes
may not have any labelled training documents available. One particular example
is the “cold start” problem in recommender systems and social tagging. Unsu-
pervised classification can automatically learn an annotation model to make
recommendations or label the tags when the products or tags are rare and do
not have any useful information associated. Unsupervised classification has been
studied by many groups and many successful models have been proposed. With-
out associated training samples, Yang et al. [16] built a classification model for a
target class by analysing the correlating auxiliary classes. Though as similar as
theirs in investigating correlating classes, our work is different by exploiting a hi-
erarchical world knowledge ontology for classification, instead of only auxiliary
classes. Also exploiting a world knowledge base, Yan et al. [14] examined un-
supervised relation extraction from Wikipedia articles and integrated linguistic
analysis with web frequency information to improve unsupervised classification
performance. However, our work has different aims from theirs; ours aims to
exploit a world knowledge ontology to help unsupervised classification, whereas
Yan et al. [14] aims to extract semantic relations for Wikipedia concepts by using
unsupervised classification techniques. Cai et al. [2] and Houle and Grira [6] pro-
posed unsupervised approaches to evaluate and improve the quality of selecting
features. Given a set of data, their work is to find a subset containing the most
informative, discriminative features. Though the work presented in this paper
also relies on features selected from documents, the features are further investi-
gated with their referring-to ontological concepts to improve the performance of
classification.
Text classification models are originally designed to handle only single-label
problems, where each document is classified into only one class. However, in
many circumstances single-label text classification cannot satisfy the demand,
for example, in social network multiple labels may need to be suggested for a
tag [8]. Comparing with the work done by Katakis et al. [8], our work relies on
the semantic content of documents, rather than the meta-data of documents used
in [8]. As similar as the work conducted by Yang et al. [15], our work also targets
on multi-label text classification. However, Yang et al. [15]’ work is different in
adopting active learning algorithms for multi-label classification, whereas ours
exploits concepts and their structure in world knowledge ontologies.
Ontologies have been studied and exploited by many works to facilitate text
classification. Gabrilovich and Markovitch [5] enhanced text classification by
generating features using domain-specific and common-sense knowledge in large
ontologies with hundreds of thousands of concepts. Comparing with their work,
our work moves beyond feature discovery and investigates the hierarchical ontol-
ogy structure for knowledge generalisation to improve text classification. Camous
et al. [3] also introduced a domain-independent method that uses the Medical
Subject Headings (MeSH) ontology. The method observes the inter-concept rela-
tionships and represents documents by MeSH subjects. Similarly, Camous’ work
Unsupervised Multi-label Text Classification 483
The world knowledge ontology is constructed from the Library of Congress Sub-
ject Headings (LCSH), which is a knowledge system developed for organising
information in large library collections. It has been under continuous develop-
ment for over a hundred years to describe and classify human knowledge. Because
of the endeavours dedicated by the knowledge engineers from generation to gen-
eration, the LCSH has become a de facto standard for concept cataloguing and
indexing, superior to other knowledge bases. Tao et al. [12] once compared the
LCSH with the Library of Congress Classification, the Dewey Decimal Classifica-
tion, and Yahoo! categorisation, and reported that the LCSH has broader topic
coverage, more meaningful structure, and more accurate semantic relations. The
LCSH has been widely used as a means for many knowledge engineering and
484 X. Tao et al.
management works [4]. In this work, the class set S = {s1 , . . . , sK } is encoded
from the LCSH subject headings.
Definition 1. (SUBJECT) Let S be the set of subjects, an element s ∈ S is a
4-tuple s := label, neighbour, ancestor, descendant, where
– label is a set of sequential terms describing s; lable(s) = {t1 , t2 , . . . , tn };
– neighbour refers to the set of subjects in the LCSH that directly link to s,
neighbour(s) ⊂ S;
– ancestor refers to the set of subjects directly and indirectly link to s and
locating at more abstractive level than s in the LCSH, ancestor(s) ⊂ S;
– descendant refers to the set of subjects directly and indirectly link to s and
locating at more specific level than s in the LCSH, descendant(s) ⊂ S.
The semantic relationships of subjects are encoded from the references defined in
the LCSH for subject headings, including Broader Term, Used for, and Related
to. The ancestor(s) in Definition 1 returns the Broader Term subjects of s; the
descendant(s) is the reversed function of ancestor(s), with additional subjects
Used for s; the neighbour(s) returns the subjects Related to s.
With Definition 1, the world knowledge ontology is defined:
Definition 2. (ONTOLOGY) Let O be a world ontology. O contains a set of
subjects linked by their semantic relations in a hierarchical structure. O is a
S
3-tuple O := S, R, HR , where
– S is the set of subjects defined in Definition 1;
– R is the set of relations linking any pair of subjects;
S
– HR is the hierarchical structure of O constructed by S × R.
– ∀p ∈ F (d), p ⊆ d.
– ∀p1 , p2 ∈ F (d)(p1
= p2 ), p1
⊂ p2 ∧ p2
⊂ p1 .
– ∀p ∈ F (d), w(p) ϑ, a threshold.
Definition
4. (TERM-SUBJECT MATRIX) Let T be the term space of S, T =
{t ∈ s∈S label(s)}, S, T is the matrix coordinated by T and S, where a map-
ping exists:
μ : T → 2S , μ(t) = {s ∈ S|t ∈ label(s)}
and its reverse mapping also exists:
where I(z) is an indicator function that outputs 1 if z is true and zero, otherwise;
f (d) = {p|p, w(p) ∈ F (d)}; g(ρ) = {t ∈ ∪p∈ρ p}; h(τ ) = {s ∈ ∪t∈τ μ(t)}.
The initial classification process easily generates noisy subjects because of direct
feature-subject mapping. Against the problem, we introduce a method to gener-
alise the initial subjects to optimise the classification. We observed that in initial
classification some subjects extracted from the ontology are overlapping in their
semantic space. Thus, we can optimise the classification result by keeping only
the dominating subjects and pruning away those being dominated. This can be
done by investigating the semantic relations existing between subjects. Let s1
and s2 be two subjects and s1 ∈ ancestor(s2 ) (s2 ∈ descendant(s1 )). s1 refers
to an broader semantic space than s2 and thus, is more general. Vice versa, s2
is more specific and focused than s1 . Hence, if some subjects are covered by a
common ancestor, they can be replaced by the common ancestor without infor-
mation loss. The common ancestor is unnecessary to be chosen from the initial
classification result, as choosing an external common ancestor also satisfies the
above rule. After generalising the initial classification result, we have a smaller
set of subject classes, with no information lost but some focus. (The handling of
focus problem is presented in next section.)
486 X. Tao et al.
4 Implementation
In this section, we present the technical details for implementing the proposed
approach of unsupervised multi-label text classification.
Algorithm 1 describes the process of extracting features to represent a docu-
ment. The output is F (d), a set of closed frequent sequential patterns discovered
from d. Adopting the prediction in Eq. (1), with F (d) the initial set of subjects,
S I (d), can be assigned to classify d. Taking into account the weights of feature
patterns, we can evaluate t ∈ d:
w(t) = w(p)
p∈{p|t∈g◦f (d),p∈f (d)}
All s ∈ S I (d) can then be re-evaluated for their likelihood of being assigning to
d with consideration of term evaluation and term distribution in s ∈ S I (d). A
prediction function can then be used to assess initial classification subjects for
the second run of classification:
κ |S I (di )|
y
i = I( w(t) × log( ) θ), i = 1, . . . , m (2)
sf (t, S I (di ))
t∈μ−1 (sκ )
filtering out noisy subjects. In experiments different values were tested for θ.
The results revealed that setting θ as the top fifth z in S I (di ), a variable rather
than a static value, gave the best performance. (Refer to Section 6 for detail.)
In the generalisation phase, descendant subjects are replaced by their common
ancestor subject. However, the common ancestor should not be too far away from
the replaced descendants in the ontology structure. The focus will be significantly
lost, otherwise. In implementation, we use only the lowest common ancestor
(shortened by LCA) to replace its descendant subjects. The LCA is the common
ancestor of a set of subjects, with the shortest distance to these subjects in the
ontology structure. The LCA replaces descendant subjects with full information
kept and minimised focus lost.
Algorithm 2 describes the process of generalising the initial subject classes to
optimise classification. The function str(i, s) describes the likelihood of assign
κ
s todi and returns the value of I(z θ) in Prediction function y
i in Eq. (2).
The function δ(s1 → s2 ) returns a positive real number indicating the distance
between two subjects. Such a distance is measured by counting the number of
S
edges travelled through from s1 to s2 in HR . The function LCA(S(s1 ) ∪ S(s2 ))
returns s, the LCA of s1 and s2 . Note that δ(s1 → s2 ) ≤ 3, which restricts
LCAs to three edges in distance. Subjects further than that in distance are too
general; whereas using a highly-general subject for generalisation would severely
jeopardise the focus of original subjects. (In the experiments, δ(s1 → s2 ) ≤ 3
and ≤ 5 were tested under the same environment in order to find a valid distance
for tracking the competent LCA. The testing results revealed that as of three
the distance was better.)
5 Evaluation
The experiments were performed, using a large corpus collected from the cat-
alogue of a library using the LCSH for information organising. The title and
content of each catalogue item were used to form the content of a document.
The subject headings associated with the catalogue items were manually assigned
488 X. Tao et al.
by specialist librarians who were trained to specify subjects for documents with-
out bias [4]. The documents and subjects provided an ideal ground truth in the
experiments to evaluate the effectiveness of the proposed classification method.
This objective evaluation methodology assured the solidity and reliability of the
experimental evaluation.
The testing set was crawled from the online catalogue of library of the Uni-
versity of Melbourne1 . General text pre-processing techniques, such as stopword
removal and word stemming (Porter stemming algorithm), were applied to the
preparation of testing set for experiments. In the experiments, we used only doc-
uments containing at least 30 terms, resulted in 31,902 documents in the testing
set. Documents shorter than that could hardly provided substantial frequent
patterns for feature extraction, as revealed in the preliminary experiments.
Given that the LCSH ontology contains 394,070 subjects in our implementa-
tion, the problem actually became a K-class text classification problem where
K = |S| = 394, 070, a very large number. Hence, we chose two typical multi-
class classification approaches, Rocchio and kNN, as the baseline models in the
experiments.
The performance of experimental models was measured by precision and re-
call, the modern evaluation methods in information retrieval and organising. In
terms of text classification, precision was to measure the ability of a method to
assign a document with only focusing subjects, and recall the ability to assign a
document with all dealing subjects.
Taking into account K = |S| = 394070, in respect with the testing document
set and the ground truth featured by the LCSH, the classification performance
was evaluated by:
the respective accuracy rate. As shown in the figure, the effectiveness rates were
measured by precision, recall, and F1 Measure, where P (x) refers to the preci-
sion results of experimental model x, R(x) the recall results, and F (x) the F1
Measure results. Their overall average performances are shown in Table 1.
F1 Measure equally considers both precision and recall. Thus the F1 Measure
results can be deemed as an overall effectiveness performance. The average F1
Measure result shown in Table 1 reveals that the UTC model has achieved a much
better overall performance (0.125) than other two models (0.020 and 0.016). Such
a performance is also confirmed by the detailed results depicted in Fig. 1 - the
F (U T C) line is located at much higher bound level compared to the F (Rocchio)
and F (kN N ) lines.
Precision measures how accurate the classification is. In terms of this, the UTC
model once again has outperformed the baseline models. The average precision
results shown in Table 1 demonstrates the achievement (UTC 0.158 vs. Rocchio
0.020 and kNN 0.021). The precision results depicted in Fig. 1 illustrate the
same conclusion; the P (U T C) outperformed others.
Recall measures the performance of classification by considering all dealing
classes. The recall performance in the experiments shows a slightly different
result, compared with those from F1 Measure and precision performance. The
Rocchio model achieved the best recall performance (0.290 on average), com-
pared to that of the UTC model (0.135) and the kNN model (0.054). The result
is also illustrated in Fig. 1, where R(U T C) lies in the middle of R(Rocchio) and
R(kN N ).
490 X. Tao et al.
There was a gap between the recall performance of the UTC and the Rocchio
models. From the observation of recall results, we found that the classes as-
signed by the Rocchio model were usually a large set of subjects (935 on av-
erage), whereas the UTC model assigned documents with a reasonable number
of subjects (16 on average) and the kNN results had an average size of 106.
Due to the natural of recall measurement, more feature term would be cover
if the subject size became larger. As a result, the Rocchio classification with
the largest size achieved the best recall performance. The subject sets assigned
by the kNN model had larger size than those assigned by the UTC. However,
when expanding the classification by neighbours, a large deal of nosey data was
also brought into the neighbourhood - the average number of neighbours arisen
was 336. This was caused by the very large set and short length of documents
in consideration. As a result, the classification became inaccurate though only
the documents with the top cosine values were chosen to expand and only the
subjects with the top similarity values were chosen to classify a document.
Different number of levels were tested in sensitivity study for choosing a right
number of levels to find the lowest common ancestor when generalising subjects
for optimal classification. Table 2 displays the testing results for finding such a
right level number. In the same experimental environment, if tracing three levels
to find a LCA the UTC model’s overall performance including F1 Measure,
precision, and recall was better than that of tracing five levels. In addition,
tracing three levels only would give us better complexity. Therefore, we chose
three levels to restrict the extent of finding CLAs.
7 Conclusions
Text classification has been widely exploited to improve the performance in
information retrieval, information organising, text categorisation, and knowl-
edge engineering. Traditionally, text classification relies on the quality of target
categorises and the accuracy of classifiers learned from training samples. Some-
times qualified training samples may be unavailable; the set of categories used
for classification may be with inadequate topic coverage. Sometimes documents
may be classified into noisy classes because of large dimension of categories.
Aiming to deal with these problems, in this paper we have introduced an un-
supervised multi-label text classification method. Using a world ontology built
from the LCSH, the method consists of three modules; closed frequent sequen-
tial pattern mining for feature extraction; extracting subjects from the ontology
Unsupervised Multi-label Text Classification 491
for initial classification; and generalising subjects for optimal classification. The
method has been promisingly evaluated by compared with typical text classi-
fication methods, using a large real-world corpus, based on the ground truth
encoded by human experts.
References
1. Bekkerman, R., Gavish, M.: High-precision phrase-based document classification
on a modern scale. In: Proceedings of the 17th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KDD 2011, pp. 231–239 (2011)
2. Cai, D., Zhang, C., He, X.: Unsupervised feature selection for multi-cluster data.
In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD 2010, pp. 333–342 (2010)
3. Camous, F., Blott, S., Smeaton, A.: Ontology-Based MEDLINE Document Classi-
fication. In: Hochreiter, S., Wagner, R. (eds.) BIRD 2007. LNCS (LNBI), vol. 4414,
pp. 439–452. Springer, Heidelberg (2007)
4. Chan, L.M.: Library of Congress Subject Headings: Principle and Application.
Libraries Unlimited (2005)
5. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using
world knowledge. In: Proceedings of The 19th International Joint Conference for
Artificial Intelligence, pp. 1048–1053 (2005)
6. Houle, M.E., Grira, N.: A correlation-based model for unsupervised feature selec-
tion. In: Proceedings of the 16th ACM Conference on Conference on Information
and Knowledge Management, CIKM 2007, pp. 897–900 (2007)
7. Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting wikipedia as external
knowledge for document clustering. In: KDD 2009: Proceedings of the 15th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
389–396 (2009)
8. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for auto-
mated tag suggestion. In: Proceedings of the ECML/PKDD 2008 Workshop on
Discovery Challenge (2008)
9. Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance
feature discovery. In: Proceedings of 16th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, pp. 753–762 (2010)
10. Malik, H.H., Kender, J.R.: Classifying high-dimensional text and web data using
very short patterns. In: Proceedings of the 2008 8th IEEE International Conference
on Data Mining, ICDM 2008, pp. 923–928 (2008)
11. Rocha, L., Mourão, F., Pereira, A., Gonçalves, M.A., Meira Jr., W.: Exploiting
temporal contexts in text classification. In: Proceeding of the 17th ACM Conference
on Information and Knowledge Management, CIKM 2008, pp. 243–252 (2008)
12. Tao, X., Li, Y., Zhong, N.: A personalized ontology model for web information
gathering. IEEE Transactions on Knowledge and Data Engineering, IEEE Com-
puter Society Digital Library 23(4), 496–511 (2011)
13. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using
wikipedia. In: KDD 2008: Proceeding of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 713–721 (2008)
492 X. Tao et al.
14. Yan, Y., Okazaki, N., Matsuo, Y., Yang, Z., Ishizuka, M.: Unsupervised relation
extraction by mining wikipedia texts using information from the web. In: Proceed-
ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Language Processing of the AFNLP,
ACL 2009, vol. 2, pp. 1021–1029 (2009)
15. Yang, B., Sun, J.-T., Wang, T., Chen, Z.: Effective multi-label active learning for
text classification. In: KDD 2009: Proceedings of the 15th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 917–926 (2009)
16. Yang, T., Jin, R., Jain, A.K., Zhou, Y., Tong, W.: Unsupervised transfer classifica-
tion: application to text categorization. In: Proceedings of the 16th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, KDD 2010,
pp. 1159–1168 (2010)
Semantic Social Network Analysis with Text Corpora
1 Introduction
As far as we know, topic modeling has become a most popular technology to model
large collection of corpus[1-3], such as Latent Dirichlet Allocation[4]. The basic idea
of topic modeling is that the latent topics can be used to describe the relationship be-
tween words and documents. In this paper we consider the problem of using latent
topics to connect the words and entities in documents (such as person, location, or-
ganization). We focus on the news articles which contain lots of entities in order to
convey the information about who, what, when and where. The purpose we want most
is modeling the entities in terms of latent topics so that we can 1) find out the interest-
ed entities through the topics we aim at; 2)recognize groups with supposing that the
entities (especially the persons) which concern the similar topics can be seen as a
group; 3) rank the plentiful entities in a document to figure out the valuable ones by
assuming that the more an entity contributes to a document’s topic(s), the more valu-
able it is in the precise one. We call the three tasks Semantic Social Network Analysis
for the interactions been found based on the topics of the corpus.
There are several related researches to achieve the relationship between words and
entities (authors) with topic models. The Author-Topic (AT) model[5-6] learns the
topics of a document conditioned on the mixture of interests with the authors. AT
model assumes that the authors equally contribute to the topics of a document. The
SwitchLDA and GESwitchLDA[7-8] extend LDA to capture dependencies between
entities and topics, referring to entities as additional classes of words.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 493–504, 2012.
© Springer-Verlag Berlin Heidelberg 2012
494 D. Yang et al.
2 Document-Entity-Topic Model
In this section we introduce the document-Entity-topic (DET) model. The DET model
belongs to the family of generative models, in which each word w in a document is
associated with two latent variables: an entity assignment x , and a topic assignment z .
The entities in news may have different weights to be described by the words, for
example, “a reporter covers that, Mr. A and B contact at a national conference and
have an educational barging, trying to improve the intercommunion and friendship of
both.” We find that the relationship between A and B is closer than that between re-
porter and the two people. In order to discover the different weights for different enti-
ties, we use Dirichlet allocation as the prior distribution to describe the importance of
each entity in a document, which is similar to LDA model by using Dirichlet alloca-
tion to describe the relationship between topics and the document.
The reason to choose the Dirichlet is that, firstly, it can reflect the characteristic of
document-entity relation, a document has primary and minor entities, and the weights
can be adjusted by the hyperparameter of Dirichlet. Secondly, the conjugate prior of
multinomial distribution is Dirichlet allocation, so it can simplify the computation for
the posterior distribution which has the same functional form as the prior.
Thus, we propose the Document-Entity-Topic (DET) model for mining the seman-
tic description of entities and using the topical distribution to carry out the social net-
working tasks. The generative process of DET model for a document can be summa-
rized as follows: firstly, an entity is chosen randomly from the distribution over
Semantic Social Network Analysis with Text Corpora 495
entity-document; next, a topic label is sampled for each word from the distribution
over topics associated with the entities of that word; finally, the words are sampled
from the distribution over words associated with each topic. The plate
representation[9] for all models are shown in figure 1.
ad γ Ψd
α α α
x x
θ zd , n θ zd , n θ zd , n
D A X
φ wd ,n φ wd ,n φ wd ,n
T Nd T Nd T Nd
β D β D β D
Fig. 1. Two related models and the DET model. In all models, each word w is generated from a
topic-specific multinomial word distribution; however topics are sampled differently in each of
the models. In LDA, a topic is sampled from a document-specific topic distribution which is
sampled from a Dirichlet with hyperparameter. In the AT model, a topic is sampled from an
author-specific multinomial distribution, and authors are sampled uniformly from the docu-
ment’s author set. In DET, Dirichlet prior has been introduced to the document-entity distribu-
tion, a topic is sampled from an entity-specific multinomial distribution, and entity assignment
is sampled from the Dirichlet allocation of that document.
Summing over the latent variables x and z, we can obtain the probability of the
words in each document w d as equation (2):
Nd
P ( w d | Θ, Φ, Ψ ) = ∏ P ( w i | Θ, Φ, Ψ )
i =1
Nd Xd T
= ∏ ∑∑ P ( wi , zi = t , xi = x | Θ, Φ, Ψ )
i =1 x =1 t =1
Nd Xd T
(2)
= ∏ ∑∑ P ( wi | zi = t , Φ ) P ( zi = t | xi = x, Θ ) P ( xi = x | Ψ )
i =1 x =1 t =1
Nd T
=∏ ∑ ψ ⋅ ∑θ
i =1 x∈ X d
xd
t =1
xt ⋅ φwi t
Factorizing in the third line of equation (2) uses the conditional independence assump-
tions of the model. The last line in the equations expresses the probability of the words w
in terms of the parameter matrices Ψ , Θ and Φ . P ( xi = x | Ψ ) is the entity multinomial
distribution ψ d in Ψ which corresponds to document d, P ( zi = t | xi = x, Θ ) is the multi-
nomial distribution θ x in Θ that corresponds to entity x, and P ( wi | zi = t , Φ ) is the mul-
tinomial distribution φt in Φ corresponding to topic t.
The DET model contains three continuous random variables Ψ , Θ and Φ . The infe-
rence scheme used in this paper is based upon a Markov chain Monte Carlo (MCMC)
algorithm or more specifically, Gibbs sampling. We estimate the posterior distribu-
tion P ( Ψ,Θ,Φ | D train , γ , α , β ) . The inference scheme is based upon the observation
that
P ( Ψ,Θ,Φ | D train , γ , α , β )
= ∑ P ( Ψ,Θ,Φ | z , x, D train , γ , α , β ) P ( z , x | D train , γ , α , β )
(3)
z,x
Where z is the topic variable and x is the entity assignment. This inference process
involves two steps. Firstly, we use Gibbs sampling to obtain an empirical
Semantic Social Network Analysis with Text Corpora 497
sample corresponding to particular x and z using the conjugation trait between Dirich-
let and multinomial distribution.
P ( xi = x, zi = t | wi = w, x − i , z − i , w − i , γ , α , β )
,−i + β CtxTX, − i + α C xdXD, − i + γ
WT
Cwt (4)
∝ ⋅ ⋅
∑ w ' CwWT't ,−i + W β ∑ t ' CtTX' x, −i + T α ∑ x ' CxXD' d , −i + X γ
XD
Here C XD represents the document-entity count matrix, where C xd , −i is the number
of words assigned to entity x for document d excluding word wi. C TX is the topic-
TX
entity count matrix, where Ctx , −i is the number of words assigned to topic t for entity x
excluding the topic assignment to word wi. C WT is the number of words from the wth
entry in the vocabulary assigned to topic t excluding the topic assignment to word wi.
Finally, z-i, x-i , w-i stand for the vector of topic assignments, vector of entity assign-
ments and vector of word observations in corpus except for the ith word, respectively.
4 Experiment Result
We train our DET model on the “Libya Event” dataset which is collected from Internet
(http://www.ifeng.com). It contains D = 4165 documents, P = 3784 unique entities
(most are person names), N = 782043 tokens, and a vocabulary of V = 15812 unique
words. We preprocess the document set with tryout edition of ICTCLAS whose rights
reserved by ictclas.org. All documents are written in Chinese, and we translate the re-
sults in English.
We run the Markov chain for a fixed number of 2000 iterations. Furthermore, we
find that the sensitivity to hyperparameters is not very strong, so that we use the fixed
symmetric Dirichlet distributions γ = 0.5, α = 0.1 , and β = 0.01 in all our experiments.
In the comparing experiment of AT model, the author set a are entities extracted from
the documents.
model. Firstly, initializing the algorithm by randomly assigning topics and entities to
words of the test documents, and then performing a number of loops through the
Gibbs sampling update:
( )
p ( zi = t , xi = x | w i = w, z − i , x − i , w − i ; Μ ) ∝ ϕ wt ⋅ nx( t, −) i + α ⋅ ( nd( x, −) i + γ )
(8)
( x )
Where nx(t, −) i is the number of topic t been assigned to persona x, and nd , − i is the
number of entity x been assigned to document d. Both of them exclude the topic and
entity assignment of word wi. We report the perplexities with different number of
topics on “Libya Event” test data set with 109 documents, about 10% of the whole
data set.
Fig. 2. Perplexity comparison of AT and DET on “Libya” data set. DET model has significant-
ly better predictive power as AT over our document set. We can also find that the lowest per-
plexity obtained by DET is not achievable by AT with any topic number. It proves that DET
can better adapt to the task of Semantic Social Network Analysis (SSNA), which discovers the
topic-based relationship and group information of entities in documents.
Topics and Entities. We get the latent topics after applying the Gibbs sampling algo-
rithm to DET model. We use the topic significance ranking method[10] to rank the
topics and show two most important topics in table 1. In each topic we list the most
likely words in the topic with their probability and below that the most likely entities
and the topic names are named by authors.
During the experiment process, we have found that many topics own lots of same
words with high probabilities, the reason we think is that all documents in “Libya
Event” data set talk about one event (similar topics). We introduce in the idea of tf-idf
500 D. Yang et al.
The probability of entity x belonging to topic k is not only decided by θ x , k , but also
decided by the number of entity x appearing in documents. If x appears in docu-
ment d , the number adds 1, and the appearing frequency is df x = | {x ∈ d } | | D | . So
the probability of entity x with topic k is θ = df x ⋅ θ x , k .
'
x, k
Table 1. Two topics with highest probabilities from a 100-topic DET running with “Libya
Event” data. In each topic we list the most likely words in lowercase with their probabilities,
and below that the most likely entities in uppercase with initial.
Topic89: Conflicts of government and oppo- Topic31: National transition committee
sition in Libya comes into existence
the opposition 0.751071 committee 0.313636
demos 0.098968 transition 0.278701
fremdness 0.018027 nation 0.213317
relation 0.015836 admit 0.046756
find out 0.013272 chairman 0.033091
reason 0.011508 come into existence 0.024036
in the past 0.008329 spokesman 0.020482
hours 0.007549 intraday 0.013421
with responsibility for 0.006162 leaguer 0.013068
encounter 0.005506 promise 0.006441
Qaddafi 0.060839 Abdul-Jelil 0.041501
Bangh acirc 0.043085 Bangh acirc 0.026637
Qatar 0.030726 Italy 0.025412
National Transition Commit-
Reuters 0.027212 0.025164
tee
Italy 0.020556 Qatar 0.023441
Russia 0.014663 Bani Walid 0.019576
Abdul-Jelil 0.014444 Abdul-Jelil 0.016731
Egypt 0.013855 Beijing 0.016373
Associated Press 0.008018 Paris 0.015959
Muhammad 0.007668 London 0.015528
Table 2. KL divergences, probabilities and frequencies of all entities in two documents for
particular information
In most instances, if an entity which has a lower KL divergence with the document,
the probability it belongs to that document will be higher, and the frequency is not a
key factor to influence the belonging probability. In order to compare the entity rank-
ing performances between AT and DET model on the whole data, we further adopt
the weighted KL divergence which is defined as equations (9) and (10):
D ad
∑∑ KL (θ || η d ,t )
1 1
wKL _ AT = ⋅ x ,t
(9)
D ad d =1 a =1
502 D. Yang et al.
wKL _ DET =
1 D Xd
(
⋅ ∑∑ ψ d , x ⋅ KL (θ x ,t || ηd ,t )
D d =1 x =1
) (10)
The smaller the weighted KL value is, the more similarity entities and documents
own. In figure 3, we have shown the values with different topic numbers.
Fig. 3. The weighted KL divergences of AT and DET model with different topic numbers. The
values of DPT model are lower than AT model. It means that the more important entities (with
lower KL divergence to document which they appear in) have higher probabilities belong to the
document and contributes more to the topic generativity.
5 Conclusions
We have presented the Document-Entity-Topic model, a probabilistic model for ex-
ploring the interactions of words, topics and entities within documents. It applies the
probabilistic model to the social network analysis based on latent topics. In order to
avoid the side effects of noisy entities and find out the entities which mainly affect the
topics, we have introduced in the Dirichlet allocation for document-entity distribution
other than uniform allocation. The model can be applied to discovering topics
conditioned on entities, clustering to find semantic social groups, and ranking the
significance of entities in a document.
However, while there is no entity in a document, the topics of that document can
not be modeled. When such lack-of-entity documents arrive at a certain amount,
the topic modeling of the corpus will be affected. Consequently, we try to improve
the model for the application when there are many documents lacking of
entities.
Semantic Social Network Analysis with Text Corpora 503
References
1. Blei, D.: Introduction to Probabilistic Topic Models. Communications of the ACM (2011)
2. Steyvers, M., Griffiths, T.: Probabilistic Topic Models. Handbook of Latent Semantic
Analysis 427 (2007)
3. Blei, D., Carin, L., Dunson, D.: Topic Models. IEEE Signal Processing Magazine 27(6),
55–65 (2010)
4. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. The Journal of Machine Learning
Research 3, 993–1022 (2003)
5. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., Steyvers, M.: Learning Au-
thor-Topic Models from Text Corpora. ACM Transactions on Information Systems
(TOIS) 28(1), 1–38 (2010)
6. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for Au-
thors and Documents, pp. 478–494. AUAI Press (2004)
7. Shiozaki, H., Eguchi, K., Ohkawa, T.: Entity Network Prediction Using Multitype Topic
Models. Springer (2008)
8. Newman, D., Chemudugunta, C., Smyth, P.: Statistical Entity-Topic Models, pp. 680–686
(2006)
9. Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)
10. AlSumait, L., Barbará, D., Gentle, J., Domeniconi, C.: Topic Significance Ranking of
LDA Generative Models. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J.
(eds.) ECML PKDD 2009. LNCS, vol. 5781, pp. 67–82. Springer, Heidelberg (2009)
504 D. Yang et al.
Appendix
P ( x, z, w | γ , α , β , X )
D X T Nd
= ∫ ∫ ∫ ∏ p( ψ d | γ )∏ p(θ x | α )∏ p(φt | β )∏ p( xdi | ψ d ) p( zdi | θ dxdi ) P( wdi | φzdi )dΦdΘdΨ
d =1 x =1 t =1 i =1
⎛ Γ (∑ D γ ) X
D ⎞ D X
= ∫∏⎜ d =1 x
∏ ψ γ d −1 ⎟
∏∏ ψ dx
ndx
dΨ
d =1 ⎜ ∏ ⎟
dx
γ
X
⎝ x =1
Γ ( x ) x =1
⎠ d =1 x =1
X ⎛ Γ(
∑ α t ) T θαt −1 ⎞⎟ X T θnxt dΘ × T ⎛⎜ Γ(∑ v =1 βv ) V φ βv −1 ⎞⎟ T V φ ntv dΦ
T V
× ∫ ∏ ⎜ T t =1 ∏ xt ⎟∏∏ ∏ ∏ tv ⎟ ∏∏
x =1 ⎜ ∏ t =1 ⎜ ∏
xt tv
Γ(α t ) t =1 Γ( β v ) v =1
V
⎝ t =1 ⎠ x =1 t =1 ⎝ v =1 ⎠ t =1 v =1
D X X T T V
∝ ∏ ∫ ∏ ψ γdxd + ndx −1dψ d × ∏ ∫ ∏ θαxtt + nxt −1dθ x × ∏ ∫ ∏ φtvβv + ntv −1dφt
d =1 x =1 x =1 t =1 t =1 v =1
Γ (∑ (γ x + ndx )) Γ ( ∑ (α )) Γ (∑ ( β v + ntv ) )
X T V
d =1
x =1
x =1
t =1 t + nxt t =1
v =1
Where ndx is the number of tokens assigned to persona x and document d, nxt is the
number of tokens assigned to topic t and persona x, ntv is the number of tokens of
word w assigned to topic t. Using the chain rule, we can obtain the conditional proba-
bility conveniently. We define w− di as all word tokens except the token wdi :
P ( xdi , zdi | x − di , z − di , w , α , β , γ , X)
P ( xdi , zdi , wdi | x − di , z − di , w − di , α , β , γ , X)
=
P ( wdi | x − di , z − di , w − di , α , β , γ , X)
P(x, z, w | α , β , γ , X)
∝
P(x − di , z − di , w − di | α , β , γ , X)
γ x + ndx − 1 α z + nx −1 β w + nz −1
∝ di di di di zdi di di wdi
Yang Xiang1 , David Fuhry2 , Ruoming Jin3 , Ye Zhao3 , and Kun Huang1
1
Department of Biomedical Informatics, The Ohio State University,
Columbus, OH 43210, USA
{yang.xiang,kun.huang}@osumc.edu
2
Department of Computer Science and Engineering, The Ohio State University,
Columbus, OH 43210, USA
[email protected]
3
Department of Computer Science, Kent State University,
Kent, OH 44242, USA
{jin,zhao}@cs.kent.edu
1 Introduction
Today the infusion of data from every facet of our society, through document-
ing, sensing, digitalizing and computing, challenges scientists, analysts and users
with its typical massive size and high dimension. Data mining and visualization
are two important areas in analyzing and understanding the data. The role of
data mining is to discover hidden patterns of the data. In particular, various
clustering techniques (see [19] for a review) have been proposed to reveal the
structures of these data and support exploratory data analysis. By contrast, the
role of visualization is to present the data in a clear and understandable man-
ner for people. Many visualization techniques have been developed to facilitate
exploratory analysis and analytical reasoning through the use of (interactive)
visual interfaces.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 505–516, 2012.
c Springer-Verlag Berlin Heidelberg 2012
506 Y. Xiang et al.
However, despite these efforts, current research is far from perfect in integrat-
ing these two endeavors in a close and uniform fashion. Given the discovered
cluster structures from the data, how can we visualize them and provide users
better insight of the data? How can visualization techniques help reveal and ex-
pose underlying structures of the data? Those research questions are clearly very
critical for us to meet the challenges of the “data explosion”. In this paper, we
address those questions by developing a novel visualization model for visualizing
discovered clusters in large and multivariate datasets. Our goal is to efficiently
provide users different views of discovered clusters as well as preserve the de-
tails of these clusters to the maximal extent possible. Among many visualization
techniques for multidimensional data [33], parallel coordinates is one of the most
elegant yet simple tools, and we select it as the visualization platform for our
proposed algorithms.
Fig. 1. (a) Data visualization with parallel coordinates w, x, y, z; (b) Data projection
on wz plane
and made flexible in some systems by allowing user adjustment. For data sets
with many dimensions, this will impose unexpected challenge to end users, while
they may have inadequate knowledge and experiences. Moreover, data clusters
from aggregation and abstraction are even harder to be illustrated along multiple
coordinates, together with many polylines.
To address these challenges, in this paper, we visualize clusters in parallel
coordinates for visual knowledge discovery using a novel dimension ordering
approach which is further refined by an energy reduction model.
2 Related Works
Parallel Coordinates for Clusters. Wegman [30] promotes parallel coor-
dinates visualization in the aspects of geometry, statistics and graphics, which
has been widely applied in information visualization [28,25]. For visualizing clus-
tered data sets, many approaches have been conducted using parallel coordinates
[20,23,3]. In particular, instead of visualizing each individual data item as a poly-
line, each cluster pattern is visualized as a fuzzy stripe [13,20,3]. Fua et al. [9]
visualize clusters by variable-width opacity bands, faded from a dense middle to
transparent edges. Such visualization focuses on the global pattern of clusters,
but the general shape of a cluster might be adversely affected by a small number
of outliers inside a cluster. In comparison, instead of displaying a shape profile
of individual clusters, our method seeks to keep the line structures while high-
lighting clusters and their relationships, by seeking good orders of coordinates
as well as shaping them smoothly by a quadratic energy reduction model which
extends the linear system proposed in [36].
Dimension Ordering. The dimension ordering and permutation problem is
naturally associated with parallel coordinate visualization. It is discussed in
the early paper by Wegman [30] and the subsequent work by Hurley and Old-
ford [17,16]. In [30],Wegman points out the problem and gives a basic solution
on how to enumerate the minimum number permutations such that every pair
of coordinates can be visualized in at least one of the permutations. However,
it is rather inefficient to display parallel coordinates corresponding to all these
permutations. The grand tour animates a static display in order to examine
the data from continuous different views [29,32,31]. The method is effective by
seeking solution to temporal exploration for computational complex tasks such
as manifesting outliers and clusters. Ankerst et al. [1] propose to rearrange di-
mensions such that dimensions showing a similar behavior are positioned next
to each other. Peng et al. [27] try to find a dimension ordering that can mini-
mize the “clutter measure”, which is defined as the ratio of outliers to total data
points. Since permutation related problems are mostly NP-hard, the existing
work [1,27,34] primarily relies on heuristic algorithms to get a quick solution.
Ellis and Dix [8] use line crossings to reduce clutter. Dasgupta and Kosara [7]
recently use number of crossings as an indication of clutter between two adjacent
coordinates, and apply a simple Branch-and-Bound optimization for dimension
508 Y. Xiang et al.
ordering. Hurley [15] uses crossings to study the correlation between two dimen-
sions of a dataset as well as reduce clutters. Different from them, we define the
crossing as an order change between a pair of inter-cluster items (or intra-cluster
items, depending on the visualization focus) on two adjacent coordinates. Our
definitions lead to an effective and efficient solution to study cluster interactions
on parallel coordinates for visual knowledge discovery.
Exact Solution. An exact solution for the minimum (or maximum) weighted
Hamiltonian path problem exhaustively tries all the permutations of vertices.
The complexity is O(n!) and the method becomes intractable when n is slightly
larger. However, in parallel coordinates visualization, it is not uncommon to
see a dataset with 10 or less coordinates. For these applications, the exhaustive
search algorithm is still one of the most simple and effective solutions. Ideas
in various branch and bound approaches for the Traveling Salesman Problem
(TSP for short) can be used to speed up the exhaustive search algorithm for the
minimum (or maximum) weighted Hamiltonian path problem. Interested readers
may refer to the TSP survey paper [24] for details.
Metric Space and Approximation Solutions. Since the exact solutions can-
not easily handle high-dimensional data, we seek fast approximate solutions when
the number of coordinates is large. As nice approximate algorithms for minimum
or maximum metric-TSPs exist (see solutions in [5] for minimum metric-TSP,
[12] for maximum Metric-TSP, [4] for minimum metric-TSP with a prescribed
order of vertices), we are wondering if our problems are metric Hamiltonian path
problems. If they are, can we have similar approximate algorithms? Fortunately,
we have a positive answer as stated in Lemma 1 (proof omitted due to space
limit) which extends the well-known fact that Kendall tau distance (corresponds
to intra-cluster crossings) is a metric :
Lemma 1. The graph G, constructed by converting each coordinate to a vertex
and setting the weight of each edge between two vertices to be the number of
inter-cluster crossings between the two corresponding coordinates, either within
two specific clusters or among all clusters, forms a metric space, in which edge
weights follow the triangle inequality.
Thus, it is not difficult to show that, if a graph G, with n vertices forms a
metric space (regardless whether there exists a prescribed order of some ver-
tices), a k-approximation solution for the minimum (or maximum) traveling
salesman problem implies 2k-approximation solutions for minimizing (or maxi-
mizing) inter-cluster (or intra-cluster) crossings.
In some special cases, it is possible to achieve even better approximation
ratio. For example, Hoogeveen [14] shows that Christofides’ 1.5 Approximation
algorithm [5] of minimum metric TSP can be modified for minimum metric
Hamiltonian path problem with the same approximation ratio, but the time
complexity of this algorithm or its modified version, though polynomial, is much
larger than linear. To achieve an even faster running speed for minimizing inter-
cluster (or intra-cluster) crossings, we implemented a linear 2-approximation
minimum metric Hamiltonian algorithm modified from the well-known linear
2-approximation algorithm for the minimum metric-TSP [6].
source of machine learning and data mining datasets. The basic characteristics of
the datasets to be studied, are listed in Table 1. For our experiments, we chose
the well-known K-means algorithm [11] to cluster the data items into exclusive
clusters. We implemented the visualization program in JavaScript (web-based).
For this study, we tested our visualization implementation in Firefox 3.6.12 on a
mainstream desktop PC with an Intel Core i5 2.67GHz CPU and 8 GB of memory.
In our empirical study, we are primarily interested in observing the effects of
proposed inter-cluster and intra-cluster ordering for visual knowledge discovery.
Maximizing intra-cluster crossings does not clearly connect to the study of cluster
interactions thus we omit it for the conciseness of the paper.
Table 1 reports the detailed changes of inter-cluster crossings after minimiza-
tion and maximization, and intra-cluster crossings after minimization, for dif-
ferent datasets. Although crossing changes are substantial, it is more interesting
to see what are the changes on the visualization results? A set of representative
results are as shown in Figure 2.
Minimizing and Maximizing Total Inter-cluster Crossings:
Figure 2 (b) and (c) shows the visualization results for minimizing total inter-
cluster crossings and maximizing total inter-cluster crossings, respectively, for
dataset “wine”. In the original order, i.e., Figure 2 (a), we can observe clusters
are generally negatively related between col 3 and col 4, between col 4 and col
5, between col 5 and col 6, between col 6 and col 7. Quite impressively, clusters
show much more positive relations in the adjacent coordinates in Figure 2 (b). In
contrast, clusters show even more negative relations in Figure 2 (c). These results
generally meet our expectation for the effects of minimizing and maximizing the
total inter-cluster crossings. Interestingly, we can observe the last two columns
(col 3 and col 7) in Figure 2 (c) contain a couple of strongly negatively-related
cluster pairs which are not revealed by Figure 2 (a) on its original order. By
checking the original data, we found col 3 corresponding to “alkalinity of ash”,
while col 7 corresponding to “proline”. This helps explain the negative relations
between clusters as alkalinity is the ability of a solution to neutralize acids, while
the proline is an α-amino acid. A high in alkalinity is more likely to result in low
α-amino acid.
512 Y. Xiang et al.
( )
(a) wine original (b) wine intercluster min (c) wine intercluster max
Fig. 2. Visualization results (colors shown in the web version of this paper)
Similarly, Figure 2 (e) and (f) shows the visualization results for minimizing
total inter-cluster crossings and maximizing total inter-cluster crossings, respec-
tively, for dataset “parkinsons”, in which each col represents a measurement for
“parkinsons”. After our visualization, it is easy for health care providers to spot
measurements that are strongly positively-related in Figure 2 (e), and measure-
ments that are strongly negatively-related in Figure 2 (f).
Minimizing Inter-cluster and Intra-cluster Crossings on a Pair of
Clusters:
We would like to see the difference between minimizing inter-cluster crossings and
intra-cluster crossings. To ease our observation, we focus on only two clusters,
cyan and light green, in Figure 2 (g), which shows the dataset “forestfires” in its
original order. Figure 2 (h) and (i) show the visualization results corresponding
to minimizing inter-cluster crossings and intra-cluster crossings, respectively. In
Figure 2 (h) we can observe that the cyan cluster and the light green cluster are
generally positively-related in all adjacent columns. This is understandable as
the visualization goal is to minimize the inter-crossings between them. However,
Figure 2 (i) shows a strongly negative-relation between them on the last two
columns (col 2 and col 6). This is because the goal of minimizing intra-cluster
crossings does not care about the relations between the cyan cluster and light
green cluster. Rather, it tries to reduce crossing within the two clusters so as
Visualizing Clusters in Parallel Coordinates for Visual Knowledge Discovery 513
to reduce visual clutter and provide a better chance to observe the relations,
regardless of positive or negative, between the two clusters.
By checking the original data, we found col 2 corresponding to DMC and col 6
corresponding to RH. DMC is an indication (the larger the more likely) of the depth
that fire will burn in moderate duff layers and medium size woody material, while
RH is relative humidity. Thus, we understand the discovered result in Figure 2 (i)
that clusters tend to be negatively-related between DMC and RH.
Minimizing Inter-cluster Crossings by the 2-Approximation Algorithm:
In all the tested datasets, the exact algorithm finishes in no more than 100 mil-
liseconds except for the datasets “eighthr” and “water-treatment”. It takes about
2 minutes to exactly order “eighthr” (12 columns), and about 15 seconds to exactly
order “water-treatment” (11 columns). This poses a concern on using exact algo-
rithms for ordering datasets with more than 10 columns, and justify the impor-
tance of approximation algorithms for ordering large datasets. In the following we
empirically study the effect of the popular 2-approximation algorithm (discussed
at the end of Section 3.2) on our visualization scheme. In order to get a better or-
dering through the 2-approximation algorithm, we try DFS search from each ver-
tex and find a lowest-cost result among all the 2-approximation results. Even with
multi-DFS search, the ordering time is still lightning fast. For all datasets, includ-
ing “eighthr” and “water-treatment”, the multi-DFS search finishes within a cou-
ple of milliseconds. This makes our visualization schemes work for large datasets.
Figure 2 (l) shows the visualization result of minimizing inter-cluster crossings
by the 2-approximation algorithm. Compared to the visualization result by the
exact algorithm as in Figure 2 (k), it is hard to tell the actual difference between
the two algorithms in revealing the positive relations among clusters. Detailed
data may explain this: The numbers of inter-cluster crossings minimized by the 2-
approximation algorithm are -23.0%,-29.6%,-42.0%,-8.4%,-41.6%,-14.3%,-46.8%,
respectively, for the datasets in Table 1 (from top to bottom). Thus we can
see there is very little performance degradation (in some datasets there is no
difference) with the 2-approximation algorithm but very significant speed-up
(linear vs factorial, in terms of complexity).
(a) enhancement for Figure 2(c) (b) enhancement for Figure 2(i)
Fig. 3. visualization enhancement (colors shown in the web version of this paper)
Here each cluster has an attracting center ĉp which may serve as a repelling center
for its adjacent clusters. We developed an efficient energy reduction model by
properly initializing and manipulating ĉp (omitted due to space limit).
The visualization effects are significantly enhanced by our energy reduction
model. Figure 3(a) and Figure 3(b) are examples of enhanced visualization re-
sults for Figure 2(c) and (i), respectively, by our energy reduction models. Read-
ers can easily observe more clusters and thus better understanding their rela-
tionships. It is easy to see the essential details of these clusters are not altered.
More specifically, if two clusters are negatively-related or positively-related, the
relationship not only remains after energy reduction, but gets further enhanced
for human observation. For example, we can observe the blue cluster is negatively
related to the pink cluster between the last two columns in Figure 3(a) while it
is almost impossible to see this in Figure 2(c). As another example, the negative
relation between cyan cluster and the light green cluster is more manifest in
Figure 3(b) than in Figure 2(i). Finally, instead of affecting the observation of
cluster interactions, outliers of each cluster can be easily identified as those few
lines far away from the majority of lines.
In summary, given an order of coordinates, our energy reduction model effi-
ciently provides better views of clusters for visual knowledge discovery at both
the macro level (i.e., cluster interactions) and the micro level (i.e., individual
lines with outliers clearly exposed).
energy reduction model, such that cluster interactions are much easier to observe
without compromising their essential details. Our empirical study on visualizing
real datasets confirms that our method is effective and efficient. Our visual-
ization techniques can further be combined with other visualization tools for
better results, e.g, applying various visual rendering algorithms to enhance our
visualization effects.
Acknowledgement. This work was supported by the US National Science
Foundation under Grant #1019343 to the Computing Research Association for
the CIFellows Project.
References
1. Ankerst, M., Berchtold, S., Keim, D.A.: Similarity clustering of dimensions for an
enhanced visualization of multidimensional data. In: IEEE Symposium on Infor-
mation Visualization (INFOVIS), p. 52 (1998)
2. Artero, A.O., de Oliveira, M.C.F., Levkowitz, H.: Uncovering clusters in crowded
parallel coordinates visualizations. In: IEEE Symposium on Information Visualiza-
tion (INFOVIS), pp. 81–88 (2004)
3. Berthold, M.R., Hall, L.O.: Visualizing fuzzy points in parallel coordinates. IEEE
Transactions on Fuzzy Systems 11(3), 369–374 (2003)
4. Böckenhauer, H.-J., Hromkovič, J., Kneis, J., Kupke, J.: On the Approximation
Hardness of Some Generalizations of TSP. In: Arge, L., Freivalds, R. (eds.) SWAT
2006. LNCS, vol. 4059, pp. 184–195. Springer, Heidelberg (2006)
5. Christofides, N.: Worst-case analysis of a new heuristic for the travelling salesman
problem. Graduate School of Industrial Administration, CMU, Report 388 (1976)
6. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms.
The MIT Press (2001)
7. Dasgupta, A., Kosara, R.: Pargnostics: Screen-Space Metrics for Parallel Coordi-
nates. IEEE Transactions on Visualization and Computer Graphics 16(6), 1017–
1026 (2010)
8. Ellis, G., Dix, A.: Enabling automatic clutter reduction in parallel coordinate plots.
IEEE Transactions on Visualization and Computer Graphics 12, 717–724 (2006)
9. Fua, Y.-H., Ward, M.O., Rundensteiner, E.A.: Hierarchical parallel coordinates for
exploration of large datasets. IEEE Visualization, 43–50 (1999)
10. Gross, J.L., Yellen, J.: Graph theory and its applications. CRC Press (2006)
11. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann
(2000)
12. Hassin, R., Rubinstein, S.: A 7/8-approximation algorithm for metric max tsp. Inf.
Process. Lett. 81(5), 247–251 (2002)
13. Holten, D., Van Wijk, J.J.: Evaluation of Cluster Identification Performance for
Different PCP Variants. Computer Graphics Forum 29(3), 793–802 (2010)
14. Hoogeveen, J.A.: Analysis of christofides’ heuristic: Some paths are more difficult
than cycles. Operations Research Letters 10(5), 291–295 (1991)
15. Hurley, C.B.: Clustering visualizations of multidimensional data. Journal of Com-
putational and Graphical Statistics 13(4), 788–806 (2004)
16. Hurley, C.B., Oldford, R.W.: Pairwise display of high-dimensional information via
eulerian tours and hamiltonian decompositions. Journal of Computational and
Graphical Statistics 19(4), 861–886 (2010)
516 Y. Xiang et al.
17. Hurley, C.B., Oldford, R.W.: Eulerian tour algorithms for data visualization and
the pairviz package. Computational Statistics 26(4), 613–633 (2011)
18. Inselberg, A.: The plane with parallel coordinates. The Visual Computer 1(2),
69–91 (1985)
19. Jain, A.K., Narasimha Murty, M., Flynn, P.J.: Data clustering: A review. ACM
Comput. Surv. 31(3), 264–323 (1999)
20. Johansson, J., Ljung, P., Jern, M., Cooper, M.: Revealing structure within clustered
parallel coordinates displays. In: IEEE Symposium on Information Visualization
(INFOVIS), p. 17 (2005)
21. Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93
(1938)
22. Knight, W.R.: A computer method for calculating kendall’s tau with ungrouped
data. Journal of the American Statistical Association, 436–439 (1966)
23. Kosara, R., Bendix, F., Hauser, H.: Parallel sets: Interactive exploration and visual
analysis of categorical data. IEEE Trans. Vis. Comput. Graph. 12(4), 558–568
(2006)
24. Laporte, G.: The traveling salesman problem: An overview of exact and approxi-
mate algorithms. European Journal of Operational Research 59(2), 231–247 (1992)
25. Moustafa, R., Wegman, E.: Multivariate Continuous Data - Parallel Coordinates.
Springer, New York (2006)
26. Nelson, R.B.: Kendall tau metric. Encyclopaedia of Mathematics 3, 226–227 (2001)
27. Peng, W., Ward, M.O., Rundensteiner, E.A.: Clutter reduction in multi-
dimensional data visualization using dimension reordering. In: IEEE Symposium
on Information Visualization (INFOVIS), pp. 89–96 (2004)
28. Siirtola, H., Räihä, K.J.: Interacting with parallel coordinates. Interacting with
Computers 18(6), 1278–1309 (2006)
29. Wegman, E.J.: The grand tour in k-dimensions. In: Computing Science and Statis-
tics: Proceedings of the 22nd Symposium on the Interface, pp. 127–136 (1991)
30. Wegman, E.J.: Hyperdimensional data analysis using parallel coordinates. Journal
of the American Statistical Association 85(411), 664–675 (1990)
31. Wegman, E.J.: Visual data mining. Statistics in Medicine 22, 1383–1397 (2003)
32. Wilhelm, A.F.X., Wegman, E.J., Symanzik, J.: Visual clustering and classification:
The oronsay particle size data set revisited. Computational Statistics 14, 109–146
(1999)
33. Wong, P.C., Bergeron, R.D.: 30 years of multidimensional multivariate visualiza-
tion. Scientific Visualization, 3–33 (1994)
34. Yang, J., Peng, W., Ward, M.O., Rundensteiner, E.A.: Interactive hierarchical di-
mension ordering, spacing and filtering for exploration of high dimensional datasets.
In: INFOVIS (2003)
35. Zhao, K., Liu, B., Tirpak, T.M., Schaller, A.: Detecting patterns of change using
enhanced parallel coordinates visualization. In: ICDM, p. 747 (2003)
36. Zhou, H., Yuan, X., Qu, H., Cui, W., Chen, B.: Visual clustering in parallel coor-
dinates. Comput. Graph. Forum 27(3), 1047–1054 (2008)
Feature Enriched Nonparametric Bayesian
Co-clustering
1 Introduction
Co-clustering [11] has emerged as an important approach for mining relational
data. Often, data can be organized in a matrix, where rows and columns present
a symmetrical relation. Co-clustering simultaneously groups the different kinds
of objects involved in a relation; for example, proteins and molecules indexing
a contingency matrix that holds information about their interaction. Molecules
are grouped based on their binding patterns to proteins; similarly, proteins are
clustered based on the molecules they interact with. The two clustering processes
are inter-dependent. Understanding these interactions provides insight into the
underlying biological processes and is useful for designing therapeutic drugs.
Existing co-clustering techniques typically only leverage the entries of the
given contingency matrix to perform the two-way clustering. As a consequence,
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 517–529, 2012.
c Springer-Verlag Berlin Heidelberg 2012
518 P. Wang et al.
they cannot predict the interaction values for new objects. This greatly limits
the applicability of current co-clustering approaches.
In many applications additional features associated to the objects of interest
are available, e.g., sequence information for proteins. Such features can be lever-
aged to perform predictions on new data. The Infinite Hidden Relational Model
(IHRM) [36] has been proposed to leverage features associated to the rows and
columns of the contingency matrix to forecast relationships among previously un-
seen data. Although IHRM was originally introduced from a relational learning
point of view, it is essentially a co-clustering model that overcomes the afore-
mentioned limitations of existing co-clustering techniques. In particular, IHRM
is a nonparametric Bayesian model, which learns the number of row and column
clusters from the given samples. This is achieved by assuming Dirichlet Process
priors to the rows and columns of the contingency matrix. As such, IHRM does
not require the a priori specification of the numbers of row and column clusters
in the data.
Existing Bayesian co-clustering models [30,35,19] are related to IHRM, but
none makes use of features associated to the rows and columns of the contin-
gency matrix. As a consequence, these methods can handle missing entries only
for already observed rows and columns (e.g., for a protein and a molecule used
during training, although not necessarily in combination). In particular, IHRM
can be viewed as an extension to the nonparametric Bayesian co-clustering
(NBCC) model [19]. IHRM adds to NBCC the ability to exploit features as-
sociated to rows and columns, thus enabling IHRM to predict entries for unseen
rows and/or columns. The authors in [36] have applied IHRM to collaborative
filtering [27]. Co-clustering techniques have also been applied to collaborative
filtering [33,15,10], but again none of these involve features associated to rows
or columns of the data matrix.
The work on IHRM [36] lacks an evaluation of the improvement that can be
achieved when leveraging features to make predictions for unseen objects. In this
work, we fill this gap and re-interpret IHRM from a co-clustering point of view.
We call the resulting method Feature Enriched Dirichlet Process Co-clustering
(FE-DPCC). We focus on the empirical evaluation of forecasting relationships
between previously unseen objects by leveraging object features.
2 Related Work
Researchers have proposed several discriminative and generative co-clustering
models, e.g. [7,29]. Bayesian Co-clustering (BCC) [30] maintains separate Dirich-
let priors for row- and column-cluster probabilities. To generate an entry in the
data matrix, the model first generates the row and column clusters for the en-
try from their respective Dirichlet-multinomial distributions. The entry is then
generated from a distribution specific to the row- and column-cluster. Like the
original Latent Dirichlet Allocation (LDA) [5] model, BCC assumes symmetric
Dirichlet priors for the data distributions given the row- and column-clusters.
Shan and Banerjee [30] proposed a variational Bayesian algorithm to perform
Feature Enriched Nonparametric Bayesian Co-clustering 519
inference with the BCC model. In [35], the authors proposed a variation of BCC,
and developed a collapsed Gibbs sampling and a collapsed variational algorithm
to perform inference. All aforementioned co-clustering models are parametric,
i.e., they need to have specified the number of row- and column-clusters.
A nonparametric Bayesian co-clustering (NBCC) approach has been proposed
in [19]. NBCC assumes two independent Bayesian priors on rows and columns.
As such, NBCC does not require a priori the number of row- and column-clusters.
NBCC assumes a Pitman-Yor Process [24] prior, which generalizes the Dirich-
let Process. The feature-enriched method we introduce here is an extension of
NBCC, where features associated to rows and columns are used. Such features
enable our technique to predict entries for unseen rows/columns.
A related work is Bayesian matrix factorization. In [17], the authors allevi-
ated overfitting in singular value decomposition (SVD) by specifying a prior
distribution over parameters, and performing variational inference. In [26], the
authors proposed a Bayesian probabilistic matrix factorization method, that as-
signs a prior distribution to the Gaussian parameters involved in the model.
These Bayesian approaches to matrix factorization are parametric. Nonparamet-
ric Bayesian matrix factorization models include [8,32,25].
Our work is also related to collaborative filtering (CF) [27]. CF learns the
relationships between users and items using only user preferences to items, and
then recommends items to users based on the learned relationships. Various ap-
proaches have been proposed to discover underlying patterns in user consump-
tion behaviors [6,16,1,18,17,26,31,12,14]. Co-clustering techniques have already
been applied to CF [33,15,10]. None of these techniques involve features asso-
ciated to rows or columns of the data matrix. On the contrary, content-based
(CB) recommendation systems [3] predict user preferences to items using user
and item features. In practice, CB methods are usually combined with CF. The
approach we introduce in this paper is a Bayesian combination of CF and CB.
k−1 ∞
πk = vk (1 − vj ) G= πk δ(θk∗ ) (1)
j=1 k=1
not be observed.
FE-DPCC is a generative model that assumes
two independent DPM priors on rows and columns. α0R GR0 GE 0 GC 0 α0C
We follow a stick-breaking representation to de-
scribe the FE-DPCC model. Specifically, assum- π ∞θk
R ∗R θkl
∗E
θl∗C π C
∞
R R
ing row and column DP priors Dir(α0 , G0 ) and R C
Dir(αC C
0 , G0 ), FE-DPCC draws row-cluster pa- xR
r xCc
E C
zrR xrc zc
rameters θ∗R k from GR 0 , for k = {1, · · · , ∞},
∗C
column-cluster parameters θl from G0 , for l = C
{1, · · · , ∞}, and co-cluster parameters θ∗E kl from Fig. 1. FE-DPCC model
Feature Enriched Nonparametric Bayesian Co-clustering 521
GE0 , for each combination of k and l ; then draws row mixture proportion π
1 R
and column mixture proportion π as defined in Eq. 1. For each row r and each
C
column c, FE-DPCC draws the row-cluster indicator zrR and column-cluster in-
dicator zcC according to π R and π C , respectively. Further, FE-DPCC assumes
the observed features of each row r and each column c are drawn from two para-
metric distributions F (·|θ ∗R ∗C E
k ) and F (·|θ l ), respectively, and each entry, xrc , of
the relational feature matrix is drawn from a parametric distribution F (·|θ ∗E kl ),
where zrR = k and zcC = l.
The generative process for FE-DPCC is as follows and the FE-DPCC model
is illustrated in Figure 1.
1. Draw vkR ∼ Beta(1, αR 0 ), for k = {1, · · · , ∞} and calculate π
R
as in Eq (1)
2. Draw θ ∗R k ∼ G R
0 , for k = {1, · · · , ∞}
3. Draw vlC ∼ Beta(1, αC 0 ), for l = {1, · · · , ∞} and calculate π
C
as in Eq (1)
∗C
4. Draw θ l ∼ G0 , for l = {1, · · · , ∞}
C
F (·|θ ∗C
zcC )
∗E
8. For each entry xE rc , draw xrc ∼ F (·|θ z R z C )
E
r c
4.1 Inference
The likelihood of the observed data is given by:
R
C
R
C
p(X|Z R , Z C , θ∗R , θ∗C , θ∗E ) =( f (xR ∗R
r |θ zr
R ))( f (xC ∗C
c |θ zc
C ))( f (xE ∗E
rc |θ zr
R z C ))
c
r=1 c=1 r=1 c=1
where f (·|θ∗R ∗C ∗E
k ), f (·|θ l ) and f (·|θ kl ) denote the probability density (or mass)
functions of F (·|θk ), F (·|θl ) and F (·|θ ∗E
∗R ∗C
kl ), respectively; Z
R
= zrR |r =
∗R ∗R
{1, · · · , R}; Z = zc |c = {1, · · · , C}; θ
C C
= θk |k = {1, · · · , ∞};
θ∗C = θ ∗C l |l = {1, · · · , ∞}; and θ ∗E
= θ ∗E
kl |k = {1, · · · , ∞}, l = {1, · · · , ∞}.
The marginal likelihood obtained by integrating out the model parameters
θ∗R , θ ∗C , and θ∗E is:
R
∗R ∗R R ∗R
p(X|Z , Z R C
, GR C E
0 , G0 , G0 ) = r |θ zr
f (xR R )g(θz R |ζ )dθ z R
r r
(2)
r=1
C
C
∗C ∗C C ∗C
R
∗E ∗E ∗E
c |θ zcC )g(θzc
f (xC C |ζ )dθ z C rc |θ zr
f (xE R z C )g(θz R z C |ζ )dθ z R z C
E
c c r c r c
c=1 r=1 c=1
R C E
where g(·|ζ ), g(·|ζ ) and g(·|ζ ) denote the probability density functions of
∗R ∗C
GR C E R
0 , G0 and G0 , respectively. We assume F (·|θ k ) and G0 , F (·|θ l ) and G0 ,
C
∗E E
and F (·|θ kl ) and G0 are all pairwise conjugate. Thus, there is a closed form
expression for the marginal likelihood (2). The conditional distribution for sam-
pling the row-cluster indicator variable zrR for the rth row xR r is as follows. For
populated row-clusters k ∈ {ZrR }r ={1,··· ,r−1,r+1,··· ,R} ,
1
Every co-cluster is indexed by a row-cluster ID and a column-cluster ID. Thus, we
denote a co-cluster defined by the kth row-cluster and the lth column-cluster as (k, l).
522 P. Wang et al.
p(zrR = k|xR E
r , {xrc }c∈{1,··· ,C} , X
R¬r
, X E¬r , Z R¬r ) ∝ (3)
C
Nk¬r
f (xR ∗R ∗R ∗R¬r
r |θ k )g(θ k |ζk )dθ∗Rk f (xE ∗E ∗E ∗E¬r ∗E
rc |θ kz C )g(θ kz C |ζkz C )dθ kz C
R − 1 + αR0 c=1
c c c c
where ¬r means excluding the rth row, Nk¬r is the number of rows assigned to the
k th row-cluster excluding the rth row, ζk∗R¬r is the hyperparameter of the pos-
terior distribution of the k th row-cluster parameter θ ∗Rk given all rows assigned
∗E¬r
to the k th row-cluster excluding the rth row, and ζkz C is the hyperparameter
c
C
of the posterior distribution of the co-cluster (k, zc ) given all entries assigned
to it excluding the entries in the rth row. When k ∈ / {zrR }r ={1,··· ,r−1,r+1,··· ,R} ,
R
i.e., zr is being set to its own singleton row-cluster, the conditional distribution
becomes:
p(zrR = k|xR E
r , {xrc }c∈{1,··· ,C} , X
R¬r
, X E¬r , Z R¬r ) ∝ (4)
R C
α0
f (xR ∗R ∗R R
r |θ k )g(θ k |ζ )dθ k
∗R
f (xE ∗E ∗E ∗E¬r ∗E
rc |θ kz C )g(θ kz C |ζkz C )dθ kz C
R − 1 + αR0 c=1
c c c c
p(zcC = l|xC E
c , {xrc }r∈{1,··· ,R} , X
C¬c
, X E¬c , Z C¬c ) ∝ (5)
¬c R
Nl
f (xC ∗C ∗C ∗C¬c
c |θ l )g(θ l |ζl )dθ∗Cl f (xE ∗E ∗E ∗E¬c ∗E
rc |θ z R l )g(θ z R l |ζz R l )dθ z R l
C − 1 + αC0 r=1
r r r r
where ¬c means excluding the cth column, Nl¬c is the number of columns as-
signed to the lth column-cluster excluding the cth column, ζl∗C¬c is the hyper-
parameter of the posterior distribution of the lth column-cluster parameter θ∗C l
given all columns assigned to the lth column-cluster excluding the cth column,
and ζz∗E¬c
R is the hyperparameter of the posterior distribution of the co-cluster
r l
(zr , l) given all entries assigned to it excluding the entries in the cth column. If
R
zcC ∈/ {zcC }c ={1,··· ,c−1,c+1,··· ,C} , i.e., zcC is being assigned to its own singleton
column-cluster, the conditional distribution becomes:
c , {xrc }r∈{1,··· ,R} , X
p(zcC = l|xC , X E¬c , Z C¬c ) ∝
E C¬c
(6)
R
αC
0 ∗C ∗C C ∗C ∗E ∗E ∗E¬c ∗E
f (x C
c |θ l )g(θ l |ζ )dθ l f (x E
|θ R
rc zr l )g(θ R |ζ R )dθ R
C − 1 + αC0 r=1
zr l z r l zr l
5 Experimental Evaluation
We conducted experiments on two rating datasets and two protein-molecule
interaction datasets. MovieLens2 is a movie recommendation dataset containing
100,000 ratings in a sparse data matrix for 1682 movies rated by 943 users. Jester3
is a joke rating dataset. The original dataset contains 4.1 million continuous
ratings of 140 jokes from 73,421 users. We chose a subset containing 100,000
ratings. Following [30], we uniformly discretized the ratings into 10 bins.
2
http://www.grouplens.org/node/73
3
http://goldberg.berkeley.edu/jester-data/
Feature Enriched Nonparametric Bayesian Co-clustering 523
4
http://pharminfo.pharm.kyoto-u.ac.jp/services/glida/
5
http://pubchem.ncbi.nlm.nih.gov/
524 P. Wang et al.
xRr = ar , gr , or , where ar , gr , and or represent the age, gender and occupa-
tion, respectively. The predictive distribution of the k th row-cluster observing a
new user, xR r , is:
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
p(xR
r | ,
k kς , κk , k , ϕk , z R
r = k) = Poi(a r |λ k )Gamma(λ |
k k k, ς )dλ k
Ber(gr |ϑ∗k )Beta(ϑ∗k |κk∗ , k∗ )dϑ∗k Cat(or |φ∗k )Dir(φ∗k |ϕ∗k )dφ∗k (7)
where ∗k , ςk∗ , κk∗ , k∗ , and ϕ∗k are the posterior hyperparameters (k indexes the
row-clusters). Denote ζk∗R = ∗k , ςk∗ , κk∗ , k∗ , ϕ∗k . We assume that features as-
sociated with movies are generated from a Multinomial distribution, Mul(·|ψ),
with Dirichlet prior, Dir(ψ|ϕ). Accordingly, θ ∗C l = ψ ∗l , and ζ C = ϕ. The pre-
dictive distribution of the l column-cluster observing a new movie, xC
th
c , is:
∗ C ∗ ∗ ∗ ∗ ∗C ∗
p(xCc |ϕ l , z c = l) = Mul(x C
c |ψ l )Dir(ψ l |ϕ l )dψ l , where ζl = ϕl is the poste-
rior hyperparameter of the Dirichlet distribution (l indexes the column-clusters).
In Jester, there are no features associated with the users (rows), thus row-
clusters cannot predict an unseen user. We used a bag-of-word representation
for joke features, and assumed each joke feature vector is generated from a
Multinomial distribution, Mul(·|ψ), with a Dirichlet prior, Dir(ψ|ϕ). The pre-
dictive distribution of the lth column-cluster observing a new joke, xC c , is:
∗ C ∗ ∗ ∗ ∗
p(xCc |ϕ l , z c = l) = Mul(x C
c |ψ l )Dir(ψ l |ϕ l )dψ l .
For MP1 and MP2, rows represent molecules and columns represent proteins.
We extracted k-mer features from protein sequences. For MP1, we also used
hierarchical features for proteins obtained from annotation databases. We used
Feature Enriched Nonparametric Bayesian Co-clustering 525
5.2 Results
We performed a series of experiments to evaluate the performance of FE-DPCC
across the four datasets. All experiments were repeated five times, and we re-
port the average (and standard deviation) perplexity across the five runs. The
experiments were performed on an Intel four core, Linux server with 4GB mem-
ory. The average running time for FE-DPCC was 1, 3, 3.5 and 2.5 hours on the
MovieLens, Jester, MP1 and MP2 datasets, respectively.
14 28
FE−DPCC FE−DPCC
DPCC DPCC
1
1
6 12 26
4
2
2
4
3.5 3 10 24
3
4 2
Perplexity
Perplexity
4 3 8 22
Movies
Users
5
0
5
2.5 6
6 20
6 −2
7
2
7 4 18
8 −4
1.5
8 9
2 16
1 2 3 4 5 6 7 1 2 3 0.25 0.5 0.75 1 0.25 0.5 0.75 1
Users Jokes Matrix Density Percentage Matrix Density Percentage
that incorporating row and column features is beneficial for the prediction of
relationships.
Visualization of Co-clusters. In Figure 2 we illustrate the co-cluster struc-
tures learned by FE-DPCC on MovieLens and Jester. We calculate the mean
entry value for each co-cluster, and plot the resulting mean values.
Data Density. We varied the density of MovieLens and Jester to see how it
affects the perplexity of FE-DPCC and DPCC. We varied the matrix density by
randomly sampling 25%, 50% and 75% of the entries in the training data. The
sampled matrices were then given as input to DPCC and FE-DPCC to train a
model and infer unknown entries on the test data. Figure 3 illustrates the results
averaged across five iterations. As the sparsity of the relational matrix increases
the test perplexity increases for both FE-DPCC and DPCC. But DPCC has
far higher perplexity for a sparser matrix. As the matrix sparsity increases, the
information within the relational matrix is lost and the FE-DPCC algorithm
relies on the row and column features. Thus, for sparser matrices FE-DPCC
shows far better results than DPCC. These experiments suggest the reason why
we see a more dramatic difference between the two algorithms for MP1 and MP2,
which are very sparse (see Table 1).
6 Conclusion
In this work, we focus on the empirical evaluation of FE-DPCC to predict rela-
tionships between previously unseen objects by using object features. We con-
ducted experiments on a variety of relational data, including protein-molecule
interaction data. The evaluation demonstrates the effectiveness of the feature-
enriched approach and demonstrates that features are most useful when data
are sparse.
References
1. Agarwal, D., Chen, B.-C.: Regression-based latent factor models. In: Proceedings
of the ACM International Conference on Knowledge Discovery and Data Mining,
pp. 19–28 (2009)
528 P. Wang et al.
23. Papaspiliopoulos, O., Roberts, G.O.: Retrospective Markov chain Monte Carlo
methods for Dirichlet process hierarchical models. Biometrika 95(1), 169–186
(2008)
24. Pitman, J., Yor, M.: The two-parameter Poisson-Dirichlet distribution derived from
a stable subordinator. Annals of Probability 25(2), 855–900 (1997)
25. Porteous, I., Asuncion, A., Welling, M.: Bayesian matrix factorization with side
information and dirichlet process mixtures. In: AAAI (2010)
26. Salakhyuditnov, R., Mnih, A.: Bayesian Probabilistic Matrix Factorization using
Markov Chain Monte Carlo. In: International Conference on Machine Learning
(2008)
27. Schafer, J.B., Konstan, J., Riedi, J.: Recommender systems in e-commerce. In:
Proceedings of the ACM Conference on Electronic Commerce, pp. 158–166 (1999)
28. Sethuraman, J.: A constructive definition of Dirichlet priors. Statistica Sinica 4,
639–650 (1994)
29. Shafiei, M., Milios, E.: Latent Dirichlet co-clustering. In: IEEE International Con-
ference on Data Mining, pp. 542–551 (2006)
30. Shan, H., Banerjee, A.: Bayesian co-clustering. In: IEEE International Conference
on Data Mining (2008)
31. Shan, H., Banerjee, A.: Generalized probabilistic matrix factorizations for collab-
orative filtering. In: Proceedings of the IEEE International Conference on Data
Mining, pp. 1025–1030 (2010)
32. Sutskever, I., Salakhutdinov, R., Tenenbaum, J.: Modelling relational data using
Bayesian clustered tensor factorization. In: Advances in Neural Information Pro-
cessing Systems, vol. 22, pp. 1821–1828 (2009)
33. Symeonidis, P., Nanopoulos, A., Papadopoulos, A., Manolopoulos, Y.: Nearest-
Biclusters Collaborative Filtering with Constant Values. In: Nasraoui, O.,
Spiliopoulou, M., Srivastava, J., Mobasher, B., Masand, B. (eds.) WebKDD 2006.
LNCS (LNAI), vol. 4811, pp. 36–55. Springer, Heidelberg (2007)
34. Wale, N., Karypis, G.: AFGEN. Technical report, Department of Computer Science
& Enigneering, University of Minnesota (2007),
http://www.cs.umn.edu/~ karypis
35. Wang, P., Domeniconi, C., Laskey, K.: Latent Dirichlet Bayesian co-clustering. In:
Proceedings of the European Conference on Machine Learning, pp. 522–537 (2009)
36. Xu, Z., Tresp, V., Yu, K., Kriegel, H.: Infinite hidden relational models. In: Pro-
ceedings of the International Conference on Uncertainity in Artificial Intelligence
(2006)
Shape-Based Clustering for Time Series Data
Abstract. One of the most famous algorithms for time series data
clustering is k-means clustering with Euclidean distance as a similarity
measure. However, many recent works have shown that Dynamic Time
Warping (DTW) distance measure is more suitable for most time series
data mining tasks due to its much improved alignment based on shape.
Unfortunately, k-means clustering with DTW distance is still not prac-
tical since the current averaging functions fail to preserve characteristics
of time series data within the cluster. Recently, Shape-based Template
Matching Framework (STMF) has been proposed to discover a cluster
representative of time series data. However, STMF is very computa-
tionally expensive. In this paper, we propose a Shape-based Clustering
for Time Series (SCTS) using a novel averaging method called Ranking
Shape-based Template Matching Framework (RSTMF), which can av-
erage a group of time series effectively but take as much as 400 times
less computational time than that of STMF. In addition, our method
outperforms other well-known clustering techniques in terms of accuracy
and criterion based on known ground truth.
1 Introduction
Time series data mining is increasingly an active research area since time series
data are ubiquitous, appearing in various domains including medicine [15], ge-
ology [13], etc. One of its main mining tasks is clustering, which is a method
to seperate unlabeled data into their natural groupings. In many applications
related to time series data [14], k-means clustering [2] is generally used with the
Euclidean distance function and amplitude averaging (arithmetic mean) as an
averaging method.
Although the Euclidean distance is popular and simple, it is not suitable for
time series data because its distance between two sequences is calculated in
one-to-one manner. As a result, k-means with Euclidean distance does not clus-
ter well because time shifting among data sequences in the same class usually
occurs. In time series mining, especially in time series classification, Dynamic
Time Warping (DTW) [1] distance has been proved to give more accurate re-
sults than Euclidean distance. Unfortunately, k-means clustering with the DTW
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 530–541, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Shape-Based Clustering for Time Series Data 531
distance still does not work practically [8][7] because current averaging function
does not return a characteristic-preserving averaging result. Traditional k-means
clustering fails to return a correct clustering result since this cluster centers do
not reflect characteristics of the data, as shown in Fig. 1. In this work, we will
demonstrate that our proposed method can resolve this problem.
a)
b) c)
Fig. 1. a) Sample 3-class CBF data [3] and its cluster centers from b) traditional k-
means clustering and from c) our proposed method
2 Background
This section provides background knowledge on k-means clustering, Dynamic
Time Warping (DTW) distance measure, and global constraint.
532 W. Meesrikamolkul, V. Niennattrakul, and C.A. Ratanamahatana
⎧
⎪
⎨ dist(pi−1 , qj )
2
dist(pi , qj ) = (pi − qj ) + min dist(pi , qj−1 ) (2)
⎪
⎩
dist(pi−1 , qj−1 )
DTW distance is computed through dynamic programming to discover the
minimum cumulative distance of each element in n × m matrix. In addition, the
warping path between two sequences can be found by tracing back from the last
cell.
In this work, DTW distance is used to measure the similarity between each
time series data and cluster centers to give more accurate results.
The global constraint is used when we need to limit the amount of warping in the
DTW alignment. In some applications such as speech recognition [12], two data
sequences are considered the same class when only small time shifting occurs; so,
Shape-Based Clustering for Time Series Data 533
Fig. 2. The warping window of P and Q is limited by the global constraint of size r
the global constraint is used to align the sequences more precisely. The Sakoe-
Chiba band [12], one of the most popular global constraints, has been originally
proposed for speech community and also has been used in various tasks in time
series mining [11]. The size of the warping window is defined by r (as shown in
Fig. 2), the percentage of the time series’ length, which is symmetric in both
above and on the right of a diagonal. In this work, we will show in experiments
that the global constraint plays an important role in improving the accuracy.
3 Related Work
In the past few decades, there are many clustering techniques proposed to cluster
time series data [5], for example, agglomerative hierarchical clustering [13], which
merges most similar objects until all objects are in the cluster. However, this
technique is still inaccurate, especially when outliers are present.
Another popular clustering technique is partitional clustering, which tries to
minimize an objective function. The well-known algorithms are k-medoids and
k-means clustering, which are different in their approaches to find new cluster
centers. For k-medoids clustering application [4], DTW distance is used as a
similarity measure among data sequences, and a sequence with minimum sum
of distance to the rest of the sequences in the cluster is selected as a new cluster
center. However, medoid is not always a centroid of a cluster, so the sequences
can be assigned to wrong clusters.
In contrast to k-medoids clustering, k-means clustering mostly uses Euclidean
distance as a distance metric, and an arithmetic mean or amplitude averaging
is simply used to find a new cluster center [14]. Although the DTW distance
is more appropriate for time series data, there currently is no DTW averaging
method that provides a satisfied averaging result.
According to this, many research works have tried to improve the quality
of the averaging result. Shape-based Template Matching Framework (STMF)
[10] was recently introduced to average time series sequences. Table 1 shows the
algorithm of this framework; the most similar pair of sequences is averaged by
Cubic-spline Dynamic Time Warping (CDTW) algorithm (in line 6).
534 W. Meesrikamolkul, V. Niennattrakul, and C.A. Ratanamahatana
Algorithm STMF(D)
1. D is the set of time series data to be averaged
2. initialize weight ω = 1 for every sequences in D
3. while(size(D) > 1)
4. {C1 , C2 } = the most similar pair of sequences in D
5. Z = CDTW(C1 , C2 , ωC1 , ωC2 )
6. ωZ = ωC1 + ωC2
7. add Z to D
8. remove C1 , C2 from D
9. end while
10. return Z
Given C1 and C2 as the most similar sequences, first, we find the warping
path between these two sequences. The variables c1i and c2j are elements of C1
and C2 , which are warped. The averaged sequence Z, which has coordinates zkx
and zky can be computed as follows.
C1 C1
Z Z
C2 C2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
a) b)
Fig. 3. The average sequences between C1 and C2 using DTW alignment a) before
applying cubic spline interpolation and b) after applying cubic spline interpolation
Shape-Based Clustering for Time Series Data 535
However, according to this framework, finding the most similar pair for each
time of averaging is enormously computationally expensive because the DTW dis-
tance of every pair of the sequences must be computed. Therefore, our RSTMF will
mainly focus on improving its time complexity by estimating an order of sequences
before averaging while maintaining the accuracy of the averaging results.
distapprox (DistMP ,... , DistMQ ,... ) = max DistMP ,Ck − DistMQ ,Ck (5)
1≤k≤K
Algorithm SCTS(D, K)
1. D is the set of time series data
2. C is the set of cluster centers
3. K is the number of cluster in C
4. M is the set of data in each cluster
5. Dist is the matrix of the distance between data sequences and all cluster centers
6. initialize C as cluster centers of K clusters
7. do
8. for i = 1:size(D)
9. for k = 1:K
10. DistDi ,Ck = DTW(Di , Ck )
11. end for
12. if(DistDi ,Ck is minimal)
13. assign Di into Mk
14. end if
15. end for
16. for k = 1:K
17. Ck = RSTMF(Mk , Dist)
18. end for
19. while(the cluster membership changes)
20. return the cluster members and the cluster centers
Algorithm UPDATE(S, a, b, z)
1. S is the matrix of the distance between data sequences in M
2. for i = 1:size(S)
3. SMz ,Mi = SMi ,Mz = min(SMa ,Mi , SMb ,Mi )
4. end for
5. remove SMa ,... , S...,Ma , SMb ,... , S...,Mb from S
By using the distapprox and the UPDATE method, our RSTMF can achieve
large speedup because we can estimate an order of the sequences before aver-
aging. In contrast, the original STMF needs to calculate the DTW distance to
select the most similar pair of the sequences every time of averaging.
Datasets Number of classes Length of data Size of training set Size of test set
Synthetic Control 6 60 300 300
Trace 4 275 100 100
Gunpoint 2 150 50 150
Lightning-2 2 637 60 61
Lightning-7 7 319 70 73
ECG 2 96 100 100
Olive Oil 4 570 30 30
Fish 7 463 175 175
CBF 3 128 30 900
Face Four 4 350 24 88
We execute each algorithm for 40 times with random initial cluster centers,
and the k value is set to the a number of classes in each dataset. With the luxury
of labeled datasets used in all experiments, an accuracy, which is the number of
correctly assigned data sequences in all clusters, is used evaluation. Fig. 4 shows
the accuracy of our proposed method, comparing other well-known clustering
methods mentioned above. According to the results, our method outperforms
others in almost all datasets.
538 W. Meesrikamolkul, V. Niennattrakul, and C.A. Ratanamahatana
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 0 0.5 1
Our proposed work Our proposed work
a) b)
1
Hierarchical Clustering with Euclidean
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0 0.5 1
0 0.5 1
c) d)
Fig. 4. The accuracy of our RSTMF method on 10 datasets, comparing with a) general
k-means clustering, b) k-medoids clustering, and k-hierarchical clustering using c) the
Euclidean distance and d) the DTW distance, respectively
2 |Gi ∩ Cj |
Sim(Gi , Cj ) = (7)
|Gi | + |Cj |
In Fig. 5, we compare our proposed work with the general k-means clustering
and the k-medoids clustering using this criterion. The results show that the
clusters obtained from our method are more similar to the ground-truth clusters
because the RSTMF averaging method does give the new cluster centers that
represent the overall charactheristic of the data within each cluster.
Shape-Based Clustering for Time Series Data 539
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.5 1 0 0.5 1
a) b)
Fig. 5. The criterion based on known ground truth, comparing our proposed method
with a) general k-means clustering and b) k-medoids clustering
400
1
Achievable Speedup
300 0.8
SCTS with STMF
200 0.6
100 0.4
0.2
0
0
0 0.5 1
a) b)
Fig. 6. a) The speedup achieved by our proposed work. b) The accuracy of our proposed
work comparing with that using STMF.
In some cases, it appears that SCTS with DTW distance achieves a lower
accuracy than the general k-means clustering. In an attempt to alleviate this
drawback, we experiment on the global constraint parameter of DTW, Sakoe-
Chiba band. We can improve the clustering accuracy, comparing with the orig-
inal k-means clustering (warping window size is 0%). Fig. 7 shows the accuracy
of our proposed RSTMF and STMF, which are comparable, as warping win-
dow sizes vary. In almost datasets, the larger warping window size does not
always provide the better accuracy; so, the appropriate warping window size is
around 20%. However, in some dataset such as ECG, the wider warping win-
dow can lead to pathological warping and make the accuracy of clustering de-
creases.
540 W. Meesrikamolkul, V. Niennattrakul, and C.A. Ratanamahatana
Accuracy
Accuracy
a) b)
Accuracy
Accuracy
Warping window size (%) Warping window size (%)
c) d)
Fig. 7. The accuracy of Shape-based clustering using STMF and our proposed RSTMF
of a) CBF, b) ECG, c) Trace, and d) Synthetic Control datasets
6 Conclusion
In this paper, we propose time series data clustering technique called Shape-
based Clustering for Time Series (SCTS), which incorporates k-means clustering
with a novel averaging method called Ranking Shape-based Template Matching
Framework (RSTMF).
Comparing with the other well-known clustering algorithms, our SCTS yields
better cluster results in terms of both accuracy and the criterion based on known
ground truth because our RSTMF averaging function provides cluster centers
that preserve characteristics of data sequences within the cluster (as shown in
Fig. 8). Furthermore, RSTMF does gives a comparable sequence averaging result
while consuming much less computational time than STMF in a few orders of
magnitude; therefore, RSTMF is practically applied in clustering algorithm. We
also used global constraint to increase an accuracy of our clusters. The results
show that our SCTS can provide more accurate clustering when the width of
warping window is about 20% of time series length.
a) b) c)
Fig. 8. The cluster centers obtained from a) our proposed method and b) the original
k-means clustering of c) sample 4-class Trace data
Shape-Based Clustering for Time Series Data 541
References
1. Berndt, D.J., Clifford, J.: Using dynamic time warping to find patterns in time
series. In: Proceedings of AAAI Workshop on Knowledge Discovery in Databases,
pp. 359–370 (1994)
2. Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In:
Proceedings of the International Conference on Machine Learning (ICML 1998),
pp. 91–99 (1998)
3. Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.A. (2011),
http://www.cs.ucr.edu/~eamonn/time_series_data
4. Liao, T.W., Bodt, B., Forester, J., Hansen, C., Heilman, E., Kaste, R.C., O’May, J.:
Understanding and projecting battle states. In: Proceedings of 23rd Army Science
Conference (2002)
5. Liao, T.W.: Clustering of time series data-a survey. Pattern Recognition, 1857–1874
(2005)
6. Meesrikamolkul, W., Niennattrakul, V., Ratanamahatana, C.A.: Multiple shape-
based template matching for time series data. In: Proceedings of the 8th Interna-
tional Conference on Electrical Engineering/Electronics, Computer, Telecommuni-
cations and Information Technology (ECTI-CON 2011), pp. 464–467 (2011)
7. Niennattrakul, V., Ratanamahatana, C.: On clustering multimedia time series data
using k-means and dynamic time warping. In: Proceedings of the International
Conference on Multimedia and Ubiquitous Engineering, pp. 733–738 (2007)
8. Niennattrakul, V., Ratanamahatana, C.A.: Inaccuracies of Shape Averaging
Method Using Dynamic Time Warping for Time Series Data. In: Shi, Y., van
Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp.
513–520. Springer, Heidelberg (2007)
9. Niennattrakul, V., Ruengronghirunya, P., Ratanamahatana, C.: Exact indexing
for massive time series databases under time warping distance. Data Mining and
Knowledge Discovery 21, 509–541 (2010)
10. Niennattrakul, V., Srisai, D., Ratanamahatana, C.A.: Shape-based template
matching for time series data. Knowledge-Based Systems 26, 1–8 (2011)
11. Ratanamahatana, C.A., Keogh, E.: Making time-series classification more accurate
using learned constraints. In: Proceedings of SIAM International Conference on
Data Mining (SDM 2004), pp. 11–22 (2004)
12. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken
word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,
43–49 (1978)
13. Shumway, R.H.: Time-frequency clustering and discriminant analysis. Statistics
and Probability Letters, 307–314 (2003)
14. Vlachos, M., Lin, J., Keogh, E., Gunopulos, D.: A wavelet-based anytime algorithm
for k-means clustering of time series. In: Proceedings of Workshop on Clustering
High Dimensionality Data and Its Applications, pp. 23–30 (2003)
15. Wismuller, A., Lange, O., Dersch, D.R., Leinsinger, G.L., Hahn, K., Pütz, B.,
Auer, D.: Cluster analysis of biomedical image time-series. International Journal
of Computer Vision, 103–128 (2002)
Privacy-Preserving EM Algorithm for Clustering
on Social Network
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 542–553, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Privacy-Preserving EM Algorithm for Clustering on Social Network 543
will occur between these two classes. Even though both assortative and disassor-
tative mixing models have theoretical and practical significance, their mixture
is more meaningful in most practical applications. If, for example (Fig. 1(c)),
researchers in the same field were treated as one group, there would generally
be more connections inside each group. In addition, some cross-disciplinary re-
searchers, such as researchers in computational linguistics, may regularly connect
with researchers in other related fields, such as linguistics, psychology, and com-
puter science, although they may frequently also connect with other members
of the same group. Newman et al. [1] proposed a probabilistic mixture model
that could deal with an assortative and disassortative mixture using an EM al-
gorithm. Such a model is realistic for the clustering problem in social networks.
With increasing concerns about the issue of personal information and privacy
protection, many privacy-preserving data mining algorithms have been proposed.
In this paper, we consider the clustering problem on social networks, in which
each member contacts the others via various means of communication, such as
social applications (MSN, Yahoo Messenger, etc.), mobile phones of different
service providers, etc. Their records are stored in different organizations, such
as Microsoft, Yahoo, and mobile service providers. The collection of their data
always contains large commercial value. However, it is impossible to make these
competitors collaborate to perform data mining algorithms, such as clustering.
This motivate us to develop a secure clustering algorithm that can be performed
without any support of the organizations. Using this algorithm, not only vertices
in the network are clustered, but the privacy of each vertex is also protected.
We summarize the related works in section 2. Section 3 introduces some back-
ground knowledge and Section 4 formulates our problem. We develop two basic
secure summation protocols in Section 5, and propose our main EM-algorithm
for private clustering based on these protocols in Sections 6. In Section 7, we
discuss our evaluation of the performance of our protocols. The results of exper-
iments conducted to evaluate the performance of our protocols are explained in
Section 8, and concluding remarks are given in the last section.
544 B. Yang, I. Sato, and H. Nakagawa
2 Related Work
Newman et al. [1] proposed a probabilistic model for the mixture of assorta-
tive and disassortative models, and provided a corresponding EM algorithm, by
which all vertices are clustered so that vertices in the same cluster had the same
probabilities as if they had a connection with each vertex in the network.
Many kinds of privacy-preserving methods have been proposed to carry out
data mining while protecting privacy. In general, privacy-preserving K-means
clustering problems can be classified into horizontally partitioned K-means [4],
vertically partitioned K-means [5] and arbitrarily partitioned K-means [6]. Meth-
ods using privacy-preserving EM clustering have also been proposed [7]. All of
these methods deal with distributed databases with large numbers of data.
Secure data analysis in networks has recently attracted increasingly more at-
tention. Hay et al. [8] proposed an efficient algorithm to compute the distribution
of degrees of social networks. Another method of computing users privacy scores
in online social networks was provided by Liu et al. [9]. Sakuma et al. [10] used
the power method to solve ranking problems such as PageRank and HITS where
each vertex in a network was treated as one party and only knew about its
neighbors. Similar work was done by Kempe et al. [11].
Even though all these works provided valuable studies, clustering in private
peer-to-peer networks has not yet attracted adequate attention. We focus on the
clustering problem based on these kinds of private networks. In addition, we also
concentrate on the mixture of assortative and disassortative models.
3 Preliminaries
Let θri denote the probability that a link from a vertex in cluster r is connected
C πr denote the
to vertex i, and nfraction of vertices in cluster r. The normalization
conditions ( r=1 πr = 1, i=1 θri = 1) are satisfied. Using the probabilistic
mixture model [1], the structural features in large-scale network can be detected
by dividing the vertices of a network into clusters, such that the members of
each cluster had similar patterns of connections to other vertices. We illustrate
this generative graphical model in Fig. 2.
Privacy-Preserving EM Algorithm for Clustering on Social Network 545
C
π
n C×n
gi |π ∼ Discrete(π) g θ
n×n
Aij |gi , θ ∼ M ultinomial(θgi )
A
integer. Paillier encryption [12] is a public key encryption system. It also satisfies
additive homomorphism, i.e., there is an operation “·”, s.t. ∀m1 , m2 ∈ ZN ,
Epk (m1 + m2 , t) ≡ Epk (m1 , t1 ) · Epk (m2 , t2 ) (mod N 2 ) , (4)
where t, t1 and t2 are random numbers. We will omit these random numbers
when they were not necessary. Using this property, we can securely compute the
cryptograph of the summation of two numbers, m1 and m2 , only given their
cryptographs. The following condition can be obtained from (4).
Epk (m · k) ≡ Epk (m)k (mod N 2 ) . (5)
Secure Summation Protocols. Suppose each party has a private input. All
the parties collaborate to compute the summation of all their inputs, without
any party obtaining any information about other parties. Such a protocol is
called a secure summation protocol. Many secure summation protocols, such as
those by Kantarcoglu et al. [13], have been proposed. But these methods have
been based on the assumption that any two parties are connected.
4 Problem Statement
We focus on the clustering problem in a social network described in Section 3.
Furthermore, we also protect the privacy of each vertex in the network.
4.1 Assumptions
We treat each vertex in the network as one party. Thus, a network containing
n vertices becomes an n-party system. We also assume that this network is
a connected network, in which there is at least one path between any pair of
vertices. There is no special vertex in the network, i.e., each vertex performs the
same operations. These assumptions are of practical significance. For example,
the relations of sending e-mail can be used to construct a network. Each vertex
(an e-mail user) can be seen as one party. Hence, each e-mail user knows its
neighbors, since it is connected to each neighbor using e-mail. Also, all e-mail
users in this network are equivalent.
Since the vertices, such as e-mail users, never want to reveal private informa-
tion about themselves in practice, we need to consider the privacy of each vertex
in the network. We specifically assume that each vertex only knows about itself
and its neighbors. First, it knows all information about itself. Second, it only
knows about the connections with its neighbors. Third, it knows nothing about
other vertices, not even whether they exist. Moreover, we assume all parties are
semi-honest, which means that they all correctly follow the protocol with the
exception that they keep a record of all their intermediate computations.
The knowledge range of any vertex is outlined in Fig. 3. We take the white
vertex as the current vertex. It only knows about its neighbors (gray) since there
is an edge between them. However, it does not know anything about the other
(black) vertices, and even does not know whether they exist. In addition, it even
does not know whether any pair of its neighbors is connected or not.
Privacy-Preserving EM Algorithm for Clustering on Social Network 547
Current Vertex
Neighbors
Other Vertices
In our private network, the variables Aij , πr , θrj , and qir are distributed into all
parties. The Aij denotes whether the pair of (i, j) is connected, so we treat it
as private information of parties i and j. The πr denotes the fraction of vertices
in cluster r. As it contains nothing about individual parties, we publish it to all
parties. The θrj expresses the relationship between cluster r and vertex j. We
assume it is only known to party j. The qir is similarly only known to party i.
The detail of this protocol is shown in Table 1. Since only Party m can decrypt
messages, Party 0 can obtain nothing about xi from Xi (i ∈ {1, 2, · · · , m − 1}).
From the homomorphism of encryption, Z = Epk (r0 + m−1 i=1 xi ), Party m can
m−1
then compute z = r0 + i=1 xi . As Party m does not know r0 , it can obtain
m−1
nothing about i=1 xi from the values of z. In summary, nothing can be inferred
from the intermediate information other than the final result, x.
The detail of this protocol is shown in Table 2. The equation in line 06 implies
that each vertex accumulates the cryptographs of the summation of the sub-tree
of itself, and sends the result, Yi , to its parent in T. Hence Y1 in the root is
the cryptograph of the summation of all vertices. Moreover, since only Party n
can decrypt messages, nothing is revealed through the execution of the protocol
other than the final result.
If party i obtains the values of αir s (r ∈ [C]), the qir s can be directly computed.
Consequently, we focus on securely computing of αir s (6). Although (6) is a
product of n items, from the definitions of Aij , we could eliminate the term θrj
if party j is not a child of party i. In other words, the value of αir becomes the
product of πr and the θrj s of all children of party i, i.e.,
αir = πr · θrj = πr · θrj . (8)
Aij =1 j∈ch(i)
Hence, we have
log αir = log πr + log θrj . (9)
j∈ch(i)
Here, each log θrj can be seen as a private input of party j. Then, our goal
becomes to securely compute the summation of these log θrj s (j ∈ ch(i)). To
do this, we only need to perform Protocol 1 by treating these log θrj s (j ∈
ch(i)) as the parameters of this protocol (private inputs of children of party i).
Throughout this execution, the value of each θrj is kept secret with each party.
Similarly, treating qir s as the private inputs of its parents, party j can securely
compute the value of βrj with Protocol 1 without revealing any information
about qir s. In addition, substituting the definition of βrj into (2), θrj becomes
550 B. Yang, I. Sato, and H. Nakagawa
βrj βrj
θrj = n = . (12)
k=1 βrk βr
7 Performance
We discuss the efficiency of our method here. Both computation and communi-
cation can be carried out in parallel in the execution of our protocol. As each
party performs operators with only one neighbor at the same time, evaluating
the total running time is equivalent to the edge coloring problem in graph theory.
The edge coloring of a graph is generally the assignment of “colors” to its edges
so that no two adjacent edges have the same color. Vizing [15] has shown that
the color index of a graph with maximum vertex-degree K is either K or K + 1.
We now discuss the running time for one round of computation, which includes
one E-step and one M-step. In the E-step, each party i performs Protocol 1 with
all its children for C times. From Vizings conclusion [15], the total running time
for this stage is O(CK). In the M-step, the βr s and πr s are all accumulated with
Protocol 2. Because the running time for one duration of Protocol 2 is O(logK n)
and the βr s and πr s include 2C values, the running time for these accumulations
is O(C logK n). The secure computation of βrj involves C times of executions of
Protocol 1 with all its parents. From Vizings conclusion [15], the total running
time for this computation is at most O(CK).
In summary, the running time for one round of E-step and M-step is O(CK +
C logK n). Nevertheless, this is just an atomic operation of the entire EM-
algorithm. If we need to perform R rounds of E-step and M-step until they
converge, the entire running time will become O(RC(K + logK n)).
8 Experiments
60
100 Cluster Cluster
number 50 number
5 40 5
80
7 30 7
70
9 20 9
60
11 11
10
50 13 13
Th
40 0
15 15
30 50 100 150 200 250 300 30 50 100 150 200 250 300
The number of vertices The number of vertices
We used artificial and real data to evaluate the accuracy and efficiency of our
protocol. The artificial data were generated from generative models with differ-
ent parameters. We evaluated them by comparing the inferred results with the
corresponding parameters. Moreover, we selected a network of books about US
politics compiled by Valdis Krebs [16] as the real data, in which nodes repre-
sented books about US politics sold by Amazon.com and edges represented the
co-purchasing of books by the same buyers, as indicated by the customers who
bought this book also bought these other books feature on Amazon. Nodes were
given three labels to indicate whether they were liberal, neutral, or conservative.
We compared our inferred results with them.
8.1 Accuracy
We executed our protocol and counted the number of results that matched the
true values. We used matching rate, the percentage of matched data, to evalu-
ate accuracy. In Fig. 4, each line expresses the relation between the number of
vertices and the accuracy with respect to a special number of clusters. We found
that the results could be correctly inferred with our protocol for three clusters.
However, increasing the number of cluster will lead to a decrease in accuracy.
Fortunately, we could increase accuracy by increasing the number of vertices.
This can be verified from Fig. 4, in which each line is increasing. We also eval-
uated the speed of convergence by counting the number of necessary rounds of
computation until convergence occurred (Fig. 5). We found that convergence be-
came faster when there were far more vertices than numbers of clusters. We then
evaluated the data set of books about US politics [16]. Although this network
contains only 105 vertices and 441 pairs of edges, the accuracy was about 86%.
The intuitive image of these experimental results of real data are shown in Fig.
6. We found they are quite close to the original data.
8.2 Efficiency
We used two computers in this experiment to simulate distributed computation.
We executed the operators for each pair one-by-one by treating these two com-
puters as two adjacent parties and recording the running time for each step.
552 B. Yang, I. Sato, and H. Nakagawa
50
45 Cluster
Fig. 6. Clustering result of real data Fig. 7. Number of vertices vs. entire run-
ning time
1.4 1.6
The running time for 1-round
(sec)
0.6
9 0.6 9
0.4 0.4
11 11
0.2 0.2
13 13
0 0
15 15
30 50 100 150 200 250 300 10 15 20 25 30
The number of vertices The maximum degree of the network
Fig. 8. Number of vertices vs. one-round Fig. 9. Maximum degree vs. one-round of
of running time running time
We designed a parallel solution using Vizing’s solution [15], and calculated the
entire computational time in this parallel environment. All of our experimental
results also contained the communication time. Fig. 7 plots the relation between
the number of vertices and the total running time with respect to the different
number of cluster. Combined with Fig. 5, we also obtained the results in Fig. 8,
which illustrates one-round of running time with respect to different numbers of
clusters and vertices. An interesting phenomenon is that increasing the number
of vertices can decrease the entire running time (Fig. 7), although one-round run-
ning time (Fig. 8) is nearly independent of the number of vertices. This implies
our privacy-preserving schema for clustering can be applied to very large-scale
networks such as social networks. Fig. 9 also compares the running time with the
maximum degree. The one-round of running time is increased with the increase
in the maximum degree. We also found that the results in Fig. 9 agree with
our description in Section 7. We also evaluated the real data using the protocol
with encryption. It only needed 12 rounds of computations until convergence
occurred. The entire running time was about 11 sec. That implies the average
running time for one-round of computation is only about 1 sec.
9 Conclusion
mixing models. Assuming that each vertex is independent, private, and semi-
honest, our algorithm was sufficiently secure to preserve the privacy of every
vertex. The running time for our algorithm only depended on the number of
clusters and the maximum degree. Since our algorithm does not become ineffi-
cient with larger amounts of data, it can be applied to very large-scale networks.
References
1. Newman, M.E.J., Leicht, E.A.: Grid Mixture models and exploratory analysis in
networks. Proc. Natl. Acad. Sci. USA 104, 9564–9569 (2007)
2. Bunn, P., Ostrovsky, R.: Secure two-party k-means clustering. In: The 14th ACM
Conference on Computer and Communications Security (2007)
3. Koller, D., Pfeffer, A.: Probabilistic frame-based systems. In: The 15th National
Conference on Artificial Intelligence (1998)
4. Jha, S., Kruger, L., McDamiel, P.: Privacy preserving clustering. In: The 10th
European Symposium on Research in Computer Security (2005)
5. Vaidya, J., Clifton, C.: Privacy-Preserving k-means clustering over vertically parti-
tioned data. In: The 9th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining (2003)
6. Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means cluster-
ing over arbitrarily partitioned data. In: The 11th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (2003)
7. Lin, X., Clifton, C., Zhu, M.: Privacy-preserving clustering with distributed EM
mixture. Knowledge and Information Systems, 68–81 (2004)
8. Hay, M., Li, C., Miklau, G., Jensen, D.: Accurate estimation of the degree distri-
bution of private networks. In: The 9th IEEE International Conference on Data
Mining (2009)
9. Liu, K., Terzi, E.: A framework for computing the privacy scores of users in online
social networks. In: The 9th IEEE International Conference on Data Mining (2009)
10. Sakuma, J., Kobayashi, S.: Link analysis for private weighted graphs. In: The 32nd
ACM SIGIR Conference (2009)
11. Kempe, D., McSherry, F.: A decentralized algorithm for spectral analysis. Journal
of Computer and System Sciences 74(1), 70–83 (2008)
12. Paillier, P.: Public-Key Cryptosystems Based on Composite Degree Residuosity
Classes. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 223–238.
Springer, Heidelberg (1999)
13. Kantarcoglu, M., Clifton, C.: Privacy-preserving distributed mining of association
rules on horizontally partitioned data. In: The ACM SIGMOD Workshop on Re-
search Issues on Data Mining and Knowledge Discovery, DMKD 2002 (2002)
14. Goldreich, O.: Foundations of Cryptography. Basic Applications, vol. 2. Cambridge
University Press, Cambridge (2004)
15. Vizing, V.G.: On an estimate of the chromatic class of a p-graph. Diskret. Analiz 3,
25–30 (1964)
16. Krebs, V.: http://www.orgnet.com/
Named Entity Recognition and Identification
for Finding the Owner of a Home Page
1 Introduction
Developing named entity-based datasets is a central task to applications such
as expert search engines and scientific digital library portals, where researchers
and organizations are the key entities to index and search for. However, devel-
oping such datasets is challenging because information must be extracted from
unstructured or semi-structured sources. One approach involves the extraction
of information from bibliographic metadata of scientific publications. DBLP1 is
an example of a site offering an index of the literature in Computer Science.
CiteSeerX2 crawls the Web to collect files that correspond to publications and
1
http://www.informatik.uni-trier.de/~ ley/db/
2
http://citeseerx.ist.psu.edu/
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 554–565, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Finding the Owner of a Home Page 555
entity recognition method and features we employ to identify the named entities
from which we select the Web page owner named entities. Section 4 introduces
a framework for the selection of named entities and describes two baseline ap-
proaches and one based the construction of a graph from named entities that are
similar, exploiting the redundancy in the named entity occurrences. In Section 5
we describe an approach based on supervised machine learning and, more specif-
ically, a binary SVM classifier, which is trained to select the named entities from
a home page. Section 6 describes our dataset and the experimental results we
have obtained. Finally, Section 7 closes this work with some concluding remarks.
2 Related Work
The approach to identify the named entity of the home page owner is primarily
related to named entity recognition, metadata extraction from Web pages, and
to coreference resolution.
Coreference Resolution. The task of selecting the main entity from the set of
entities identified on a home page is related to coreference resolution, which
determines whether two textual expressions refer to the same entity or not.
Typical coreference resolution methods employ supervised learning [12][6] and
rely on the linguistic analysis of text to extract features. The task of identifying
the owner of a home page does not require the full resolution of all references,
and hence, it is not necessary to apply coreference resolution at a first step.
the begin, inside, outside (BIO) convention for labels. For example the sentence
“Chris Bishop is a Distinguished Scientist at . . .” 4 is tokenized and labeled as
follows:
Chris Bishop is a Distinguished Scientist at . . .
BPERSON IPERSON O O O O O
BFNAME BLNAME O O O O O
where BPERSON denotes the beginning of a full name, BFNAME denotes the be-
ginning of a first name, and BLNAME denotes the beginning of a last name. The
label IPERSON denotes that the corresponding token is inside a person name.
The label O denotes that the corresponding token does not belong to any of the
classes we consider.
Next, we train a CRF model using five types of features. The first type of
features corresponds to the tokens themselves. The second type corresponds to
two features, whose value depends on the form of the examined token. The first
feature indicates whether the token contains only numerical digits, or it is a
single upper case letter, or a punctuation symbol, or a capitalized word, etc.).
The second feature indicates whether the token is an alphanumeric string. The
third type of features relies on two gazetteers for first names and geographic
locations, respectively, and comprises two binary features indicating whether
the token is a first name, and whether the token is a geographic location. The
fourth type of features is based on a full-text index of Web pages and comprises
4 features. More specifically, two features correspond to the logarithm of the
number of documents in which the term occurs in the body, and the anchor
text of incoming links respectively. The two next features correspond to flags
indicating whether the term occurs in the title or the anchor text of incoming
links of the currently processed document. The fifth type of features comprises
one feature indicating whether the token occurs in the anchor text of an outgoing
hyperlink in the currently processed document, differentiating between links to
Web pages in the same or different domains. Note that the last two types of
features depend on the distribution of terms in a full text index of Web pages,
and the text associated with the link structure of Web pages. We employ the
implementation of CRF++5.
We learn the CRF model and apply it to unseen Web pages in the fol-
lowing way. First, we train a CRF to recognize full names and assign labels
BPERSON and IPERSON . The assigned labels are then used to learn a second
model where the assigned labels constitute an additional twelfth feature used
in the recognition of first, middle and last names, assigning labels BFNAME,
IFNAME, BMNAME, IMNAME, and BLNAME and ILNAME, respectively. After apply-
ing the CRF models to label tokens, we aggregate consecutive tokens with B
and I labels in an entity e. For each entity e, t(e) is the type of the entity where
t(e) ∈ {PERSON, FNAME, MNAME, LNAME}, c(e) is the average confidence of the label
assignment over the entity’s tokens, and s(e) is the concatenation of the tokens
to form the string representation of e.
4
Quoted from http://research.microsoft.com/en-us/um/people/cmbishop/
5
Available from http://crfpp.sourceforge.net/
Finding the Owner of a Home Page 559
The intuition for defining the weight wstr as the sum of the confidences is that
it reflects both the number of times the same string has been identified as a
PERSON entity as well as the confidence in the recognition.
The weighting of S(str) from Eq. 2 is only based on the average confidence
of the label assignment to each token of the entities in S(str). We can improve
the weighting by incorporating more information regarding the position of the
occurrences of entities.
wstr = (wa anchor(e) + wt title(e) + wc c(e)) (3)
e∈S(str)
560 V. Plachouras, M. Rivière, and M. Vazirgiannis
The baseline weighting of set S(str) according to Eq. 2 and 3 only consider the
entities of type PERSON with the same string representation. However, they ignore
any similarities between the identified entities in order to compute an improved
weight. Suppose that on a Web page the full name of a researcher appears only
twice at the top of the Web page, and the name of the most frequent co-author
appears in abbreviated form once for each publication of the researcher6. In such
a setting, the baseline weighting functions may select the abbreviated name of
the co-author as the named entity of the URL’s owner, instead of the full name
of the researcher.
Type 3 edge
Type 1 edge
{ e | t(e)=LNAME ^ s(e)=”Bishop” }
Type 2 edge
Fig. 1. The graph constructed from the sets of identified named entities in a Web page
– A type 1 edge connects sets of FNAME, MNAME, LNAME entities to the corre-
sponding sets of PERSON entities in which they occur.
– A type 2 edge connects a set S(t, str1) to S(t, str2) when string str1 is an
abbreviated form of str2 and t ∈ {FNAME, MNAME, LNAME}.
– A type 3 edge connects a set S(str1) to S(str2) when the name str1 is
an abbreviated form of the name str2. Formally, S(str1) ∈ in3 (S(str2))
if there exists S(t, str3) ∈ in1 (PERSON, str1) ∩ in1 (PERSON, str2) and there
exist S(t , str4) ∈ in1 (PERSON, str1)), S(t , str5) ∈ in1 (PERSON, str2) where
S(t , str4) ∈ in2 (t , str5).
Figure 1 illustrates a graph constructed from a set of identified named entities.
The graph has two vertexes of type PERSON, one vertex of type LNAME for the
last name ’Bishop’ and two vertexes of type FNAME for the first name ’Chris’
and its abbreviated form ’C.’ There are four edges of type 1, linking the ver-
texes of type FNAME and LNAME to the corresponding vertexes of type PERSON.
There is one edge of type 2 which links the vertex S(FNAME, ’C.’) to the ver-
tex S(FNAME, ’Chris’). Finally, there is one edge of type 3 from S(’C. Bishop’)
to S(’Chris Bishop’) because both vertexes have incoming links from the same
vertex S(LNAME, ’Bishop’) and there is a type 2 edge between two of their FNAME
linking vertexes.
The graph G, which is constructed as described above, is a directed acyclic
graph (DAG). From the definition of type 1 edges, we cannot have a cycle in-
volving vertices of type PERSON and any other entity type because type 1 edges
always point to vertices of type PERSON. Hence, a cycle may involve either type
2 edges exclusively or type 3 edges exclusively. Since a type 3 edge exists only if
there is a type 2 edge, and the two edges cannot be in the same path, then there
exists a cycle with type 3 edges only if there exists a cycle with type 2 edges.
However, there cannot be a cycle with type 2 edges, because type 2 edges link
an abbreviated name to its full form. Hence, there cannot be any cycle in the
graph G.
Once we have constructed the graph from the named entities identified in a
Web page, we compute a weight for each vertex S(str), corresponding to the
sum of the Baseline 2 score from Eq. 3 plus the sum of the scores of vertices that
link to S(str).
wstr = (wa anchor(e) + wt title(e) + wc c(e)) + wstr (4)
e∈S(str) S(str )∈ini (str)
Finally, we select the set S(str) with the highest score wstr according to Eq. 4.
The intuition is that the scores of abbreviated named entities propagate to the
entities corresponding to full names.
which are more likely to refer to the owner of the Web page. However, all three
functions will always produce a score for the entities, even when the named
entity of the owner of the Web page is not among the identified named entities.
For example, a researcher may have a set of Web pages documenting a software
he has written and released as open-source. The functions introduced earlier
will always select one set of entities as the owner for the considered Web page.
Moreover, extending these functions with arbitrary features is not trivial. In this
section, we investigate the problem of selecting the named entities as a binary
classification problem in a supervised learning setting.
In particular, we formulate the classification problem y(x) ∈ {−1, 1}, where
x ∈ X = {(URL, S(PERSON, str))}. The input x is a pair of a URL and
a set S(PERSON, str). For the output, y(x) = 1 when the named entities in
S(PERSON, str) correspond to the owner of Web page with URL, otherwise,
y(x) = −1. For each input point x, we compute 13 features:
– the graph-based score of S(PERSON, str) from Eq. 4
– the rank of S(PERSON, str) when all sets of PERSON entities are ordered in
ascending order of the Baseline 1, Baseline 2, graph-based scoring functions,
as well as in the order of occurrence (4 features)
– the sum of the cardinalities |S(t, str )| where S(t, str ) ∈ ini (PERSON, str) for
each type of links (3 features)
– the number of edges of type i pointing to S(PERSON, str) for i = 1, 2, 3 (3
features)
– 1 if str appears in an email address found in the content of URL, otherwise
0
– 1 if str appears to be emphasized in the text of home page identified by
URL, otherwise 0
The feature values are normalized between -1 and +1 on a per home page basis.
We employ an SVM classifier with radial-basis kernel from LIBSVM7 [2]. For
a given home page identified by URL, if the SVM classifies as +1 more than
one pairs (URL, S(PERSON, str)), we select the one with the highest estimated
probability, as computed by the SVM classifier.
6 Experimental Results
In this section, we describe the experimental setting in which we evaluate the
introduced methods. First, we evaluate the CRF-based named entity recognition
(Section 6.1). Next, we describe the dataset we use for entity selection and we
present the obtained results (Section 6.2).
order and split them in three folds. We use each fold once to test the CRF model
we learn on the other two folds. Table 1 reports the micro-averaged precision,
recall and F-measure for each of the labels we assign during the first and the
second passes of the CRF-based NER system, respectively.
The NER system assigns BPERSON and IPERSON labels with high precision and
recall. This is consistent with results reported for NER systems trained on much
larger corpora [7]. First and last names are identified with an accuracy of more
than 0.80. The obtained precision for middle names is significantly lower, mainly
due to the small number of training examples available.
selected as the owner’s name for the corresponding page. The effectiveness of
this heuristic depends on the accuracy of the underlying NER system because
any wrong identification of names will lead to an error in the selection [8]. The
two next approaches, Baseline 1 and Baseline 2, correspond to the selection of
entities using Eq. 2 and 3, respectively. The fourth and fifth rows in Table 2
display the results obtained with the graph-based and the SVM-based entity
selection approaches, respectively.
Table 2. Fraction of Web pages for which there is a perfect or partial match, when
using Order, Baseline 1, Baseline 2, Graph and SVM-based entity selection
7 Conclusions
In this work, we have introduced a novel method to select among recognized
named entities in a home page the one corresponds to the owner. Our method
uses the output of a named entity recognition system and exploits the redun-
dancy and the similarities between names to select the correct one. The in-
troduced methods are developed independently of the employed named entity
recognition approach. Indeed, they can be used with any NER approach that
identifies person names, but also first, middle and last names. In a dataset of
more than 400 home pages, our methods identify the correct name for more than
90% of the home pages in which a NER system identifies at least once the correct
name in the processed page. The comparison of our methods with a heuristic
based on the order of names shows that our approaches achieve important im-
provements in effectiveness because they are more robust with respect to the
accuracy of the employed NER system.
We have applied the developed methods in the context of researchers’ home
pages. In the future, we will evaluate it in the context of different applications,
such as the automatic creation of online social networks, or people search. We
also aim to apply the developed methods for identifying the name of the owner
of a Web page as a feature to improve the classification of Web pages.
Finding the Owner of a Home Page 565
References
1. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a high-performance
learning name-finder. In: Procs. of the 5th ANLC, pp. 194–201 (1997)
2. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans.
Intell. Syst. Technol. 2, 27:1–27:27 (2011)
3. Changuel, S., Labroche, N., Bouchon-Meunier, B.: Automatic web pages author
extraction. In: Procs. of the 8th FQAS, pp. 300–311 (2009)
4. Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy ap-
proach. In: Procs. of the 7th Conference on Natural Language Learning at HLT-
NAACL 2003, CONLL 2003, vol. 4, pp. 160–163 (2003)
5. Culotta, A., Bekkerman, R., McCallum, A.: Extracting social networks and contact
information from email and the web. In: CEAS (2004)
6. Culotta, A., Wick, M., Hall, R., McCallum, A.: First-order probabilistic models
for coreference resolution. In: Procs. of HLT/NAACL, pp. 81–88 (2007)
7. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
information extraction systems by gibbs sampling. In: Procs. of the 43rd Annual
Meeting on ACL, pp. 363–370 (2005)
8. Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic home-
pages for digital libraries. In: Procs. of the 11th JCDL, pp. 123–132 (2011)
9. Kato, Y., Kawahara, D., Inui, K., Kurohashi, S., Shibata, T.: Extracting the author
of web pages. In: Procs. of the 2nd ACM WICOW, pp. 35–42 (2008)
10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In: Procs. of the 18th
ICML, pp. 282–289 (2001)
11. Minkov, E., Wang, R.C., Cohen, W.W.: Extracting personal names from email:
applying named entity recognition to informal text. In: Procs. of the Conf. on HLT
and EMNLP, HLT 2005, pp. 443–450 (2005)
12. Ng, V., Cardie, C.: Improving machine learning approaches to coreference reso-
lution. In: Procs. of the 40th Annual Meeting on ACL, ACL 2002, pp. 104–111
(2002)
13. Shi, Y., Wang, M.: A dual-layer crfs based joint decoding method for cascaded
segmentation and labeling tasks. In: Procs. of the 20th IJCAI, pp. 1707–1712 (2007)
14. Takeuchi, K., Collier, N.: Use of support vector machines in extended named en-
tity recognition. In: Procs. of the 6th Conference on Natural Language Learning,
COLING 2002, vol. 20, pp. 1–7 (2002)
15. Tang, J., Zhang, D., Yao, L.: Social network extraction of academic researchers.
In: Procs. of the 7th ICDM, pp. 292–301 (2007)
16. Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web
using visual features. In: Procs. of the 7th ICDMW, pp. 33–40 (2007)
17. Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for
web information extraction. In: Procs. of the 22nd ICML, pp. 1044–1051 (2005)
Clustering and Understanding Documents
via Discrimination Information Maximization
Abstract. Text document clustering is a popular task for understanding and sum-
marizing large document collections. Besides the need for efficiency, document
clustering methods should produce clusters that are readily understandable as
collections of documents relating to particular contexts or topics. Existing cluster-
ing methods often ignore term-document semantics while relying upon geomet-
ric similarity measures. In this paper, we present an efficient iterative partitional
clustering method, CDIM, that maximizes the sum of discrimination informa-
tion provided by documents. The discrimination information of a document is
computed from the discrimination information provided by the terms in it, and
term discrimination information is estimated from the currently labeled docu-
ment collection. A key advantage of CDIM is that its clusters are describable by
their highly discriminating terms – terms with high semantic relatedness to their
clusters’ contexts. We evaluate CDIM both qualitatively and quantitatively on
ten text data sets. In clustering quality evaluation, we find that CDIM produces
high-quality clusters superior to those generated by the best methods. We also
demonstrate the understandability provided by CDIM, suggesting its suitability
for practical document clustering.
1 Introduction
Text document clustering discovers groups of related documents in large document col-
lections. It achieves this by optimizing an objective function defined over the entire
data collection. The importance of document clustering has grown significantly over
the years as the world moves toward a paperless environment and the Web continues
to dominate our lives. Efficient and effective document clustering methods can help in
better document organization (e.g. digital libraries, corporate documents, etc) as well
as quicker and improved information retrieval (e.g. online search).
Besides the need for efficiency, document clustering methods should be able to han-
dle the large term space of document collections to produce readily understandable
clusters. These requirements are often not satisfied in popular clustering methods. For
example, in K-means clustering, documents are compared in the term space, which
is typically sparse, using generic similarity measures without considering the term-
document semantics other than their vectorial representation in space. Moreover, it is
not straightforward to interpret and understand the clusters formed by K-means clus-
tering; the similarity of a document to its cluster’s mean provides little understanding
of the document’s context or topic.
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 566–577, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Clustering and Understanding Documents 567
the quality of clusterings produced by the K-means algorithm improves over the base-
line (“bag of words”) document representation. However, extracting information from
knowledge bases is computationally expensive. Furthermore, these approaches suffer
from the same shortcomings of K-means regarding cluster understandability.
The challenge of high dimensional data clustering, including that of document clus-
tering, has been tackled by clustering in a lower dimensional space of the original
term space. One way to achieve this is through Non-Negative Matrix Factorization
(NMF). NMF approximates the term-document matrix by the product of term-cluster
and document-cluster matrices [8]. Extensions to this idea, with the goal of improv-
ing the interpretability of the extracted clusters, have also been proposed [9,10]. An-
other way is to combine clustering with dimensionality reduction techniques [11,12].
Nonetheless, these methods are restricted by their focus on approximation rather than
semantically useful clusters, and furthermore, dimensionality reduction based tech-
niques are often computationally expensive.
Recently, it has been demonstrated that the relatedness of a term to a context or
topic in a document collection can be quantified by its discrimination information [2].
Such a notion of relatedness, as opposed to the traditional term-to-term relatedness,
can be effectively used for data mining tasks like classification [13]. Meanwhile, mea-
sures of discrimination information, such as relative risk, odds ratio, risk difference,
and Kullback-Leibler divergence, are gaining popularity in data mining [14,15]. In the
biomedical domain, on the other hand, measures like relative risk have been used for a
long time for cohort studies and factor analysis [16,17].
current work. In addition to the cluster composition, we will also like to find signifi-
cant describing terms for each cluster. Let Tk be the index set of significant terms for
cluster k.
CDIM finds K clusters in the document collection by maximizing the sum of discrimi-
nation scores of documents for their respective clusters. If we denote the discrimination
information provided by document i for cluster k by dik and the discrimination informa-
tion provided by document i for all clusters but cluster k by d¯ik , then the discrimination
score of document i for cluster k is defined as dˆik = dik − d¯ik . CDIM’s objective
function can then be written as
K
J= rik (dik − d¯ik ) (1)
k=1 xi ∈Ck
when p(xj |C¯k ) − p(xj |Ck ) > t
p(xj |C̄k )
w̄jk = p(xj |Ck ) (3)
0 otherwise
where p(xj |Ck ) is the conditional probability of term j in cluster k and C¯k denotes all
clusters but cluster k. The term discrimination information is either zero (no discrimina-
tion information) or greater than one with a larger value signifying higher discriminative
power. The conditional probabilities in Equations 2 and 3 are estimated via smoothed
maximum likelihood estimation.
A similar expression can be used to define d¯ik . The document discrimination informa-
tion dik can be thought of as the relatedness (discrimination) of document i to cluster
k. The document discrimination score is given by dˆik = dik − d¯ik ; the larger this value
is, the more likely that document i belongs to cluster k. Note that a term contributes
to the discrimination information of document i for cluster k only if it belongs to Tk
and it occurs in document i. If such a term occurs multiple times in the document then
each of its occurrence contributes to the discrimination information. Thus, the discrim-
ination information of a document for a particular cluster increases with the increase in
occurrences of highly discriminating terms for that cluster.
3.6 Algorithm
CDIM can be described more compactly in matrix notation. CDIM’s algorithm, which
is outlined in Algorithm 1, is described next.
Let W (W̄) be the M × K matrix formed from the elements wjk , ∀j, k (w̄jk , ∀j, k),
D̂ be the N ×K matrix formed from the elements dˆik , ∀i, k, and R be the N ×K matrix
formed from the elements rik , ∀i, k. At the start, each document is assigned to one of the
K randomly selected seeds using cosine similarity, thus defining the matrix R. Then,
a loop is executed consisting of two steps. In the first step, the term discrimination
information matrices (W and W̄) are estimated from the term-document matrix X and
the current document assignment matrix R. The second step projects the documents
onto the relatedness or discrimination score space to create the discrimination score
matrix D̂. Mathematically, this transformation is given by
R = maxrow(D̂) (6)
where ‘maxrow’ is an operator that works on each row of D̂ and returns a 1 for the
maximum value and a zero for all other values. The processing of Equations 5 and 6
are repeated until the absolute difference in the objective function becomes less than a
specified small value. The objective function J is computed by summing the maximum
values from each row of matrix D̂.
The algorithm outputs the final document assignment matrix R and the final term
discrimination information matrix W. It is easy to see that the computational time
complexity of CDIM is O(KM N I) where I is the number of iterations required to
reach the final clustering. Thus, the computational time of CDIM depends linearly on
the clustering parameters.
572 M.T. Hassan and A. Karim
4 Experimental Setup
Our evaluations comprise of two sets of experiments. First, we evaluate the clustering
quality of CDIM and compare it with other clustering methods on 10 text data sets.
Second, we illustrate the understanding that is provided by CDIM clustering. The results
of these experiments are given in the next section. Here, we describe our experimental
setup.
Our experiments are conducted on 10 standard text data sets of different sizes, contexts,
and complexities. The key characteristics of these data sets are given in Table 1. Data
set 1 is obtained from the Internet Content Filtering Group’s web site1 , data set 2 is
available from a Cornell University web page2 , and data sets 3 to 10 are obtained from
Karypis Lab, University of Minnesota3 . Data sets 1 (stopword removal) and 3 to 10
(stopword removal and stemming) are available in preprocessed formats, while we per-
form stopword removal and stemming of data set 2. For more details on these standard
data sets, please refer to the links given above.
We compare CDIM with five clustering methods. Four of them are K-means variants
and one of them is based on Non-Negative Matrix Factorization (NMF) [8].
The four K-means variants are selected from the CLUTO Toolkit [18] based on
their strong performances reported in the literature [5,3]. Two of them are direct K-
way clustering methods while the remaining two are repeated bisection methods. For
1
http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/
2
http://www.cs.cornell.edu/People/pabo/movie-review-data/
3
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
Clustering and Understanding Documents 573
each of these two types of methods, we consider two different objective functions. One
objective function maximizes the sum of similarities between documents and their clus-
ter mean. The direct and repeated bisection methods that use this objective function are
identified as Direct-I2 and RB-I2, respectively. The second objective function that we
consider maximizes the ratio of I2 and E1, where I2 is the intrinsic (based on cluster
cohesion) objective function defined above and E1 is an extrinsic (based on separa-
tion) function that minimizes the sum of the normalized pairwise similarities of docu-
ments within clusters with the rest of the documents. The direct and repeated bisection
methods that use this hybrid objective function are identified as Direct-H2 and RB-H2,
respectively.
For NMF, we use the implementation provided in the DTU:Toolbox4. Specifically,
we use the multiplicative update rule with Euclidean measure for approximating the
term-document matrix.
In using the four K-means variants, the term-document matrix is defined by term-
frequency-inverse-document-frequency (TF-IDF) values and the cosine similarity mea-
sure is adopted for document comparisons. For NMF, the term-document matrix is
defined by term frequency values.
Table 3. Top 10 most discriminating terms (stemmed words) for clusters in ohscal data set
the top 10 most discriminating terms (stemmed words) for each cluster of the ohscal
data set in Table 3. The ohscal data set contains publications from 10 different medical
subject areas (antibodies, carcinoma, DNA, in-vitro, molecular sequence data, preg-
nancy, prognosis, receptors, risk factors, and tomography). By looking at the top ten
terms, it is easy to determine the category of most clusters: cluster 2 = carcinoma, clus-
ter 3 = antibodies, cluster 4 = prognosis, cluster 5 = pregnancy, cluster 6 = risk factors,
cluster 7 = DNA, cluster 9 = receptors, cluster 10 = tomography. The categories molec-
ular sequence data and in-vitro do not appear to have a well-defined cluster; molecular
sequence data has some overlap with cluster 7 while in-vitro has some overlap with
clusters 1 and 9. Nonetheless, clusters 2 and 8 still give coherent meaning to the docu-
ments they contain.
As another example, in hitech data set, the top 5 terms for two clusters are: (1)
‘health’, ‘care’, ‘patient’, ‘hospit’, ‘medic’, and (2) ‘citi’, ‘council’, ‘project’, ‘build’,
‘water’. The first cluster can be mapped to the health category while the second clus-
ter does not have an unambiguous mapping to a category but it still gives sufficient
indication that these articles discuss hi-tech related development projects.
Since CDIM finds clusters in a K-dimensional discrimination information space,
the distribution of documents among clusters can be visualized via simple scatter plots.
The 2-dimensional scatter plot of documents in the pu data set is shown in Figure 1 (left
plot). The x- and y-axes in this plot correspond to document discrimination information
for cluster 1 and 2 (di1 and di2 ), respectively. and the colored makers give the true
categories. It is seen that the two clusters are spread along the two axes and the vast
majority of documents in each cluster belong to the same category. Similar scatter plots
for Direct-I2 and NMF are shown in the middle and right plots, respectively, of Figure
1. However, these methods exhibit poor separation between the two categories in the pu
data set.
Such scatter plots can be viewed for any pair of clusters when K > 2. Since CDIM’s
document assignment decision is based upon document discrimination scores (dˆik , ∀k),
scatter plots of documents in this space are also informative; each axis quantifies how
relevant a document is to a cluster in comparison to the remaining clusters.
576 M.T. Hassan and A. Karim
0.045 0.45
0.3
0.04 0.4
0.01 0.1
0.05
0.005 0.05
0 0 0
0 0.1 0.2 0 0.02 0.04 0 0.1 0.2
Doc. Disc. Info for Cluster 1 Similarity to mean of Cluster 1 Weight for Cluster 1
Fig. 1. Scatter plot of documents projected onto the 2-D discrimination information space
(CDIM), similarity to cluster mean space (Direct-I2), and weight space (NMF). True labels are
indicated by different color markers.
References
1. Morris, J., Hirst, G.: Non-classical lexical semantic relations. In: Proceedings of the HLT-
NAACL Workshop on Computational Lexical Semantics, pp. 46–51. Association for Com-
putational Linguistics (2004)
2. Cai, D., van Rijsbergen, C.J.: Learning semantic relatedness from term discrimination infor-
mation. Expert Systems with Applications 36, 1860–1875 (2009)
3. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, New
York (2006)
Clustering and Understanding Documents 577
4. Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis.
Technical Report 01-40, University of Minnestoa (2001)
5. Steinbach, M., Karypis, G.: A comparison of document clustering techniques. In: Proceed-
ings of the KDD Workshop on Text Mining (2000)
6. Hu, X., Zhang, X., Lu, C., Park, E., Zhou, X.: Exploiting wikipedia as external knowledge for
document clustering. In: Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 389–396. ACM (2009)
7. Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A Comparative Study of Ontology Based Term
Similarity Measures on PubMed Document Clustering. In: Kotagiri, R., Radha Krishna, P.,
Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 115–126.
Springer, Heidelberg (2007)
8. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization.
In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and
Development in Informaion Retrieval, pp. 267–273. ACM (2003)
9. Xu, W., Gong, Y.: Document clustering by concept factoriz ation. In: Proceedings of the 27th
Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 202–209. ACM (2004)
10. Cai, D., He, X., Han, J.: Locally consistent concept factorization for document clustering.
IEEE Transactions on Knowledge and Data Engineering (2010)
11. Tang, B., Shepherd, M., Heywood, M.I., Luo, X.: Comparing Dimension Reduction Tech-
niques for Document Clustering. In: Kégl, B., Lee, H.-H. (eds.) Canadian AI 2005. LNCS
(LNAI), vol. 3501, pp. 292–296. Springer, Heidelberg (2005)
12. Ding, C., Li, T.: Adaptive dimension reduction using discriminant analysis and k-means
clustering. In: Proceedings of the 24th International Conference on Machine Learning,
pp. 521–528. ACM (2007)
13. Junejo, K., Karim, A.: A robust discriminative term weighting based linear discriminant
method for text classification. In: Eighth IEEE International Conference on Data Mining,
pp. 323–332 (2008)
14. Li, H., Li, J., Wong, L., Feng, M., Tan, Y.P.: Relative risk and odds ratio: a data mining
perspective. In: PODS 2005: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems (2005)
15. Li, J., Liu, G., Wong, L.: Mining statistically important equivalence classes and delta-
discriminative emerging patterns. In: KDD 2007: Proceedings of the 13th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (2007)
16. Hsieh, D.A., Manski, C.F., McFadden, D.: Estimation of response probabilities from aug-
mented retrospective observations. Journal of the American Statistical Association 80(391),
651–662 (1985)
17. LeBlanc, M., Crowley, J.: Relative risk trees for censored survival data. Biometrics 48(2),
411–425 (1992)
18. Karypis, G.: CLUTO-a clustering toolkit. Technical report, Dept. of Computer Science, Uni-
versity of Minnesota, Minneapolis (2002)
19. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space
model. In: Proceedings of the 36th Annual Meeting of the Association for Computa-
tional Linguistics and 17th International Conference on Computational Linguistics, vol. 1,
pp. 79–85. ACL (1998)
20. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evalua-
tion metrics based on formal constraints. Information Retrieval 12(4), 461–486 (2009)
21. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine
Learning Research 7, 1–30 (2006)
A Semi-supervised Incremental Clustering
Algorithm for Streaming Data
1 Introduction
Clustering plays a key role in data analysis, aiming at discovering interesting
data distributions and patterns in data. Also it is widely recognized as means to
provide an effective way of maintaining data summaries and a useful approach
for outlier analysis. Thus clustering is especially important for streaming data
management. Streaming data are generated continuously at high rates and due
to storage constraints we are not able to maintain in memory the entire data
stream. Having a compressed version (synopsis) of data at different time slots,
data analysts are able to keep track of the previously arrived data and thus more
effectively extract useful patterns from the whole stream of data.
Semi-supervised clustering has received much attention in the last years, be-
cause it can enhance clustering quality by exploiting readily available background
knowledge, i.e. knowledge on the group membership of some data items or knowl-
edge on some properties of the clusters to be built. The available knowledge is
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 578–590, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Semi-supervised Incremental Clustering Algorithm for Streaming Data 579
2 Related Work
Recently, a number of clustering algorithms have been proposed to deal with
streaming data. In [1], a framework of a stream clustering approach is proposed,
which includes two clustering phases. At the online phase, the proposed ap-
proach periodically maintain statistical information about local data in terms of
micro-clusters, while at the off-line phase, the decision maker uses these statis-
tics to provide a description of clusters in the data stream. The main drawback
of this algorithm is that the number of micro-clusters needs to be predefined.
HPStream [2] incorporates a fading cluster structure and a projection-based
clustering methodology to deal with the problem of high-dimensionality in data
streams. Another stream clustering algorithm is Denstream [6]. It is based on
the idea of DBSCAN [10] and it forms local clusters progressively by detecting
and connecting dense data item neighborhoods.
A version of the AP algorithm that is closer to handling streaming data is
presented in [16]. According to this approach, as data flows, data items are
compared one-by-one to the exemplars and they are assigned to their nearest
one if a distance threshold condition is satisfied. Otherwise, data are considered
outliers and they are put in a reservoir. A cluster redefinition is triggered if the
number of outliers exceeds a heuristic (user-defined) reservoir size or if a change
in data distribution is detected.
Though there is a lot of work on constraint-based clustering methods for static
data [3,8,15], the related work on clustering streaming data based on constraints
is limited. Ruiz et al. [14] have presented a conceptual model for constraints
on streams and extended the constraint-based K-means of [15] for streaming
data. In SemiStream, we refine this model by proposing a constraint stream
of instance-level constraints and we adapt the clusters incrementally instead
of rebuilding them from scratch. Moreover an extension of Denstream so that
constraints are taken into account during the clustering process is proposed in
[13]. C-Denstream ensures that all given constraints are satisfied. However this
is achieved at the cost of creating many small clusters. Moreover, in case of
conflicts among constraints, C-Denstream is unable to conclude in a clustering.
At each time point ti , some items of the snapshot Di are involved in instance-level
constraints, together with items of earlier time points. These new constraints
constitute the new constraint-set CSi . From them, we derive the set of active
i = ∪i CSj , u = max{1, i − w}. In Alg. 1, we depict
constraints at ti , CS j=u
the DYNAMIC CONSTRAINT UPDATER, which builds CS i . At each t = t2 , t3 , . . .,
the DYNAMIC CONSTRAINT UPDATER reads the set of old active constraints CS old
and the current constraint-set CS. It combines them to produce the new set of
active constraints CS by (a) marking constraints on outdated items as obsolete,
(b) recomputing the weights of non-obsolete, i.e. active constraints and (c) taking
the union of the resulting set of constraints with CS to form CS.
that are involved in Must-Link constraints together with x, and CL(x), the
corresponding set of items for Cannot-Link constraints. Then, the constraint-
violation cost for assigning x to c is given by:
t) =
costCV (x, c, CS,
/ weight (≺ x, y , t) × fML (x, y)+
y∈ML(x),y ∈c (2)
y∈CL(x),y∈c weight (≺ x, y , t) × fCL (x, y)
In this formula, we consider the items that must be linked with x and are not
members of cluster c, as well as the items in c that should not be linked to x. For
such an item y, the weight of the corresponding constraint is weight (≺ x, y , t),
according to Eq. 1.
The functions fML (·), fCL (·) denote the cost of violating a Must-Link, resp.
Cannot-Link constraint. We consider that the cost of violating a must-link con-
straint between two close points should be higher than the cost of violating
a must-link constraint between two points that are far apart. Thus we imple-
ment the cost function fML (·) as fML (x, y) = dmax − d(x, y), where dmax is the
maximum distance encountered between two items in the dataset. Similarly, the
cost of violating a cannot-link constraint between two distant points should be
higher than the cost of violating a cannot-link constraint between points that
are close. Then we define the fCL (·) function as the distance between the two
items involved in a cannot-link constraint and we set fCL (x, y) = d(x, y).
Then, the cost of assigning an item x to a cluster c for a given set of active
constraints
CS at time point t consists of the cost of effected constraint violations
and the overhead of placing this item to this cluster. The latter is represented
by the distance of the item to the cluster center:
t) = costCV (x, c, CS,
cost(x, c, CS, t) + d(x, rep(c)) (3)
items in c across all dimensions of the feature space. The within-cluster distance
the cluster representative:
of c is the average distance of a data item from
d(x, center (c))
withinClusterDistance(c) = x∈c (4)
|c|
With some abuse of conventions, we also use the term “radius” for the within-
cluster-distance, i.e. we set radius(c) ≡ withinClusterDistance(c). Then, the
items within the radius of c constitute its “nucleus”:
overlap(c1 , c2 ) = 1
|c1 | × |{x ∈ nucleus(c2 )|d(x, rep(c1 )) ≤ radius(c1 )}| (6)
In all cases, the cluster is removed from the m-cluster, possibly causing further
cluster detachments. Similarly to the merging case, the clustering defined after
splitting is evaluated based widely known cluster quality criteria. Specifically, a
A Semi-supervised Incremental Clustering Algorithm for Streaming Data 587
cluster validity method is adopted to evaluate the results of the proposed incre-
mental clustering approach at specific time slots in terms of constraint accuracy
as well as clustering compactness and separation [11]. Then if the quality of
the newly emerged clustering ξ is comparable to the old clustering, ξ is an
accepted clustering. The cluster validity method may trigger the re-clustering of
data when significant changes are observed in the quality of currently defined
clusters.
5 Experimental Evaluation
In this section we present experimental evaluation of our approach using different
datasets. We implemented SemiStream in JAVA. All experiments were conducted
on a 2.53GHz Intel(R) Core 2 Duo PC with 6GB memory.
Data Sets and Constraints Generation. We generate some synthetic
datasets with different numbers of clusters and dimensions. For the sake of vi-
sualization, we chose here to present the performance of our approach on two-
dimensional datasets. Specifically, to evaluate the sensitivity of our algorithm
and its clustering quality in case of arbitrarily shaped clusters, we consider the
synthetic datasets depicted in Figure 1, which have also been used in [10]. We also
generated an evolving data stream, ESD, by choosing one of the three datasets
(denoted SD1, SD2, SD3 in Figure 1) 10 times. Each of the datasets contains
10, 000 points and thus the total length of ESD is 100, 000.
To generate constraints for our experimental study, we consider a set of labeled
data. Our algorithm is fed with constraints each time a new batch of data arrive.
Following a strategy similar to this used in the static semi-supervised approaches,
the constraints are generated from labeled data so that they correspond to a
percentage of data in the considered window.
(a) (b)
Fig. 2. Clustering purity vs Time points: a) SD1 dataset, horizon = 2, stream speed
= 250, b) ESD dataset, horizon = 2, stream speed = 1,000
where |Cid | denotes the number of majority class items in cluster i, |Ci | is the
size of cluster i and k is the number of clusters. The results of clustering purity
presented in the following section are an average of results over 5 runs of our
algorithm.
Clustering Quality Evaluation. First, we test the clustering quality of
SemiStream using the SD1 dataset. We consider that the stream speed is 250
data items per time point, the window size is set to 4 time points. Also a new set
of stream constraints is received as data arrives which corresponds to a percent-
age of data in the window. We can observe that the clusters are arbitrarily shaped
and thus the majority of clustering algorithms are not able to identify them. Our
study shows that the use of constraints can assist with the clustering procedure.
Since the points fades out as time passes, we compute the purity of clustering
results in a pre-defined horizon (h) from current time. Figure 2(a) presents the
purity of clustering defined by SemiStream in a small horizon (h = 2) when the
constraints are 1% and 10% of the arrived data. It can be seen that SemiStream
gives very good clustering quality. The clustering purity is always higher than
75%. Also Figure 2 (a) shows that increasing the number of constraints, the
purity of identified clusters is increased.
Then we evaluate the performance of our algorithm using the evolving data
stream ESD at different time units. We set the stream speed at 1, 000 points
A Semi-supervised Incremental Clustering Algorithm for Streaming Data 589
per time unit and horizon equals to 2. Figure 2 (b) depicts the clustering purity
results of our algorithm when the constraints correspond to 1% and 10% of the
arrived data. It can be seen that SemiStream achieves to identify the evolution
of the clusters as new data arrives, resulting in clusters with purity higher than
80%. Also we can observe the advantage that the use of constraints provides.
Using 10% of data as constraints, our approach can achieve a clustering model
with purity 99%.
Time Complexity. We evaluate the efficiency of SemiStream measuring the
execution time. The algorithm periodically stores the current clustering results.
Thus the execution time refers to the time that our algorithm needs to store clus-
tering results, read data from previous time slots and redefine clusters. Figure 3
shows execution time for synthetic dataset using different number of constraints.
We can observe that the execution time grows almost linearly as data stream
proceeds.
6 Conclusions
We present SemiStream, an algorithm that incrementally adapts a clustering
scheme to streaming data which are also accompanied by a set of constraints.
Modeling constraints as a stream and associating constraint violation with a
penalty function allowed us to design a cost-based strategy for cluster adaptation
to snapshots of arriving data and constraints. We introduce the use of i) s-clusters
to describe dense areas in the data set and ii) multiple clusters (m-clusters) to
represent overlapping dense areas in order to capture arbitrarily shaped clusters.
Moreover we use the structure of outliers clusters to describe a small set of
data whose characteristics seem to deviate significantly from average behavior
of the currently processed data. Based on a set of adaptation criteria SemiStream
achieve to observe changes in structure of clusters as data evolve.
References
1. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data
streams. In: Proc. of VLDB (2003)
2. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for projected clustering of
high dimensional data streams. In: Proc. of VLDB (2004)
3. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised Clustering by Seeding. In:
Proc. of ICML (2002)
4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proc. of ICML (2004)
5. Bilenko, M., Basu, S., Mooney, R.J.: A probabilistic framework for semi-supervised
clustering. In: Proc. of KDD, p. 8 (2004)
6. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving
data stream with noise. In: Proc. of SDM (2006)
7. Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints:
Theoretical and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Ca-
macho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70.
Springer, Heidelberg (2005)
590 M. Halkidi, M. Spiliopoulou, and A. Pavlou
8. Davidson, I., Ravi, S.: Clustering with constraints: Feasibility issues and the k-
means algorithm. In: Proc. of SDM, Newport Beach, CA (April 2005)
9. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Par-
titional Clustering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M.
(eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg
(2006)
10. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algortihm for Dis-
covering Clusters in Large Spatial Database with Noise. In: Proc. of KDD (1996)
11. Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A
Framework for Semi-Supervised Learning Based on Subjective and Objective Clus-
tering Criteria. In: Proc. of ICDM (2005)
12. Ruiz, C., Menasalvas, E., Spiliopoulou, M.: Constraint-based Clustering. In: Proc.
of AWIC. SCI. Springer (2007)
13. Ruiz, C., Menasalvas, E., Spiliopoulou, M.: C-DenStream: Using Domain Knowl-
edge on a Data Stream. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.)
DS 2009. LNCS, vol. 5808, pp. 287–301. Springer, Heidelberg (2009)
14. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: User Constraints Over Data Streams.
In: IWKDDS (2006)
15. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-means Clustering
with Background Knowledge. In: Proc. of ICML (2001)
16. Zhang, X., Furtlehner, C., Sebag, M.: Data Streaming with Affinity Propagation.
In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II.
LNCS (LNAI), vol. 5212, pp. 628–643. Springer, Heidelberg (2008)
Unsupervised Sparse Matrix Co-clustering
for Marketing and Sales Intelligence
1 Introduction
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 591–603, 2012.
c Springer-Verlag Berlin Heidelberg 2012
592 A. Zouzias, M. Vlachos, and N.M. Freris
Existing co-clustering methods can face several practical issues which limit
both their efficiency and the interpretability of the results. For example, the
majority of co-clustering algorithms explicitly require as input the number of
clusters in the data. In most business scenarios, such an assumption is unrealis-
tic. We cannot assume prior knowledge on the data, but we require a technique
that allows data exploration. There exist some methodologies that attempt to
perform automatic co-clustering [3], i.e., determine the number of co-clusters.
They address the problem by evaluating different number of configurations and
retaining the solution that provides the best value of the given objective func-
tion (e.g., entropy minimization). Such a trial-based approach can significantly
affect the performance of the algorithm. Therefore, these techniques are bet-
ter suited for off-line data processing, rather than for interactive data analysis,
which constitutes a key requirement for our setting.
Another shortcoming of many spectral-based approaches is that they typically
assume a perfect block-diagonal form for the reordered matrix; “off-diagonal”
clusters are usually not detected. This is something that we accommodate in
Unsupervised Sparse Matrix Co-clustering 593
2 Related Work
The principle of co-clustering was introduced first by Hartigan with the goal of
‘clustering cases and variables simultaneously’ [11]. Initial applications were for
the analysis of voting data. Hartigan’s method is heuristic in nature and may fail
to find existing dense co-clusters under certain cases. In [4] the authors present
an iterative algorithm that convergences to a local minimum of the same objec-
tive function as in [11]. [1] describes an algorithm which provides constant factor
approximations to the optimum co-clustering solution using the same objective
function. A spectral co-clustering method based on the Fiedler vector appeared
in [6]. Our approach uses a similar analytical toolbox as [6], but in addition is
automatic (number of clusters need not be given), and does not assume perfect
594 A. Zouzias, M. Vlachos, and N.M. Freris
block-diagonal form for the matrix. A different approach, views the input matrix
as an empirical joint probability distribution of two discrete random variables
and poses the co-clustering problem as an optimization problem from an infor-
mation theoretic perspective [7]. So, the optimal co-clustering maximizes the
mutual information between the clustered random variables. A method employ-
ing a similar metric as the one of [7] appeared in [20]; the latter approach also
returns the co-clusters in a hierarchical format. Finally, [16] provides a parallel
implementation of the method of [20] using the map-reduce framework. More
detailed reviews on the topic can be found in [14] and [21].
Our approach is equally rigorous with the above approaches; more impor-
tantly, it lifts important shortcomings of the spectral-based approaches and in
addition focuses on the recommendation aspect of co-clustering, something that
previous efforts do not consider.
Note that the above objective function typically takes a small value if the bi-
partition (S, S̄) is not balanced, hence it favors balanced partitions. The goal of
graph partitioning is to solve the optimization problem
min NCut(S, S̄). (1)
S⊂V
This problem is NP-hard. Many approximation algorithms for this problem have
been developed over the last years [2,12], however they turn out not to be per-
forming well in practice. In our approach, we employ a heuristic that is based
on spectral techniques and works well in practice [13,9]. Following the notation
of [13], for any S ⊂ V , define the vector q ∈ Rn as
+η2 /η1 , i ∈ S;
qi = (2)
− η1 /η2 , i ∈ S̄,
where η1 = vol(S) and η2 = vol(S̄). The objective function in (1) can be written
(see [6] for details) as follows
q Lq
min , subject to q as in (2) and q D1 = 0,
q Dq
where the extra constraint q D1 = 0 excludes the trivial solution where S = V
or S = ∅. Even though the above problem is also NP-hard, we adopt a spectral
2-clustering heuristic: first, we drop the constraint that q is as in Eqn. (2); this
relaxation is equivalent to finding the second largest eigenvalue and eigenvector
of the generalized eigensystem Lz = λDz. Then, we round the resulting eigen-
vector z (a.k.a. Fiedler vector ) to obtain a bipartition of G. The rounding is
performed by applying 2-means clustering separately on the coordinates of z,
which can be solved exactly and efficiently.
4 The Algorithm
The proposed algorithm consists of two steps: in the first step, we compute a per-
mutation of the row set and column set using a recursive spectral bi-partitioning
algorithm; in the second step we use the permuted input matrix to identify any
remaining clusters by means of a Gaussian-based density estimator.
Cheeger’s inequality [5] tells us that 2c(G) ≥ λ2 ≥ c(G)2 /2, where λ2 is the sec-
ond smallest eigenvalue of the normalized Laplacian of G. The first inequality of
the above equation implies that if λ2 is large, then G does not have sufficiently
small conductance. The latter implication supports our choice of the termination
criterion of the recursion of Algorithm 1 (see Step 11 of the SplitCluster proce-
dure). Roughly speaking, we want to stop the recursion when the matrix can not
be reduced (after permutation of rows and columns) to an approximately block
diagonal matrix, equivalently when the bipartite graph associated to the cur-
rent matrix does not contain any sparse cut. Using Cheeger’s inequality, we can
efficiently check if the bipartite graph has a sufficiently good cut or not. An illus-
tration of the algorithm’s recursion is explored in Fig.2. Our approach has many
similarities with Newman’s modularity partitioning but it is not identical [15].
Unsupervised Sparse Matrix Co-clustering 597
250 250
20 20
40 40
200 200
60 60
80 80
150 150
100 100
120 120
100 100
140 140
160 160
180 180 50 50
200 200
220 220
0 0
50 100 150 200 50 100 150 200 0 50 100 150 200 250 0 50 100 150 200 250
4.4 Recommendations
Algorithm 1 together with the density estimation step output a list of co-clusters.
In the business scenarios that we consider co-clusters will represent strongly
correlated customer and products subsets. We illustrate how this information
can be used to drive meaningful product recommendations.
Many discovered co-clusters are expected to contain “white-spots”. These
represent customers that exhibit similar buying pattern with a number of other
customers, they still have not bought a product within the co-cluster. These
are products that constitute good recommendations. Essentially, we exploit the
existence of globally-observable patterns for making individual recommendations.
Not all “white-spots” are equally important. We rank them by considering
firmographic and financial characteristics of the customers. The intuition is that
‘wealthy’ customers/companies that have bought many products in the past are
better-suited candidates. They are at financial position to buy a product and
they have already established a buying relationship. In our formula we consider
three factors:
In our scenario, the weights w1 ,w2 ,w3 are assumed to be equal but in general
they can be tuned appropriately.
Unsupervised Sparse Matrix Co-clustering 599
5 Experiments
5.1 Comparison with other Techniques
We compare the proposed approach with two other techniques. The first one
(SPECTRAL) is described in [6] and is similar with the proposed approach.
The main difference is that the approach of [6] performs k-partition of the in-
put matrix using the eigenvectors that correspond to the smallest eigenvalues.
Moreover, in order to compute the clustering it utilizes k-means clustering which
makes the approach randomized, compared to our approach which is determinis-
tic. The second one (DOUBLE-KMEANS) is described in [1]. First, it performs
k-means clustering using as input vectors the columns of the input matrix and
then permutes the columns by grouping together columns that belong in the
same cluster. In the second step, it performs the same procedure on the rows
of the input matrix using a possible different number of clusters, say l. This
approach outputs k · l clusters. Typical values for k and l that we use, are be-
tween 3 and 5. We run all the above algorithms on synthetic data which we
produced by creating several block-diagonal and off-diagonal clusters. We intro-
duced “salt-and-pepper” noise in the produced matrix, in an effort to examine
the accuracy of the compared algorithms even in when diluting the strength of
the original patterns. The results are summarized in Figure 4. We observe that
our algorithm can detect with high efficiency the original patterns, whereas the
original spectral and k-Means algorithms present results of lower quality.
Fig. 4. The first column contains the ground truth and the remaining columns contains
the output of the three algorithms described in Section 5.1. All algorithms take as input
a randomly permuted (independently in rows and columns) version of the ground truth
instance.
The results using are depicted in Table 1. For this we extract matrices that rep-
resent buying patterns within our company for various industries of customers,
because different industries exhibit different patterns. We notice that the pro-
posed recursive algorithm results in compressed matrix sizes significantly smaller
than the competitive approaches, suggesting a more effective co-clustering pro-
cess.
For this example we use real-world data provided by our sales department re-
lating to approximately 30,000 Swiss customers. The dataset contains all firmo-
graphic information pertaining to the customers, such as: industry categorization
(electronic, automotive, etc), expected industry growth, customer’s turnover for
last, past revenue. We perform co-clustering on the customer-product matrix..
We apply our algorithm on each industry separately, because sales people only
have access to their industry of specialization. Figure 5 shows the outcome of
the algorithm and the detected diagonal and off-diagonal clusters. The highest
ranked recommendations are detected within the blue cluster, and they suggest
that ‘white-spot’ customers within this cluster can be approached with an offer
for the product ‘System-I’. These customers were ranked higher based on their
financial characteristics.
6 Conclusion
Focus of this work was to explicitly show how co-clustering techniques can be
coupled with recommender systems for business intelligence applications. Con-
tributions of our approach include:
– An unsupervised spectral-based technique for detection of large ‘diagonal’ co-
clusters. We present a robust termination criterion and we depict its accuracy
on a variety of synthetic data where we compare with ground-truth.
602 A. Zouzias, M. Vlachos, and N.M. Freris
References
1. Anagnostopoulos, A., Dasgupta, A., Kumar, R.: Approximation Algorithms for co-
Clustering. In: Proceedings of ACM Symposium on Principles of Database Systems
(PODS), pp. 201–210 (2008)
2. Arora, S., Rao, S., Vazirani, U.: Expander Flows, Geometric Embeddings and
Graph Partitioning. J. ACM 56, 5:1–5:37 (2009)
3. Chakrabarti, D., Papadimitriou, S., Modha, D.S., Faloutsos, C.: Fully Automatic
Cross-associations. In: Proc. of International Conference on Knowledge Discovery
and Data Mining (KDD), pp. 79–88 (2004)
4. Cho, H., Dhillon, I.S., Guan, Y., Sra, S.: Minimum Sum-Squared Residue co-
Clustering of Gene Expression Data. In: Proc. of SIAM Conference on Data Mining,
SDM (2004)
5. Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society (1994)
6. Dhillon, I.S.: Co-Clustering Documents and Words using Bipartite Spectral Graph
Partitioning. In: Proc. of International Conference on Knowledge Discovery and
Data Mining (KDD), pp. 269–274 (2001)
7. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-Clustering. In:
Proc. of International Conference on Knowledge Discovery and Data Mining
(KDD), pp. 89–98 (2003)
8. Fiedler, M.: Algebraic Connectivity of Graphs. Czechoslovak Mathematical Jour-
nal 23(98), 298–305 (1973)
9. Guattery, S., Miller, G.L.: On the Performance of Spectral Graph Partitioning
Methods. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp. 233–242 (1995)
10. Hagen, L., Kahng, A.: New Spectral Methods for Ratio Cut Partitioning and Clus-
tering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems 11(9), 1074–1085 (1992)
11. Hartigan, J.A.: Direct Clustering of a Data Matrix. Journal of the American Sta-
tistical Association 67(337), 123–129 (1972)
12. Leighton, T., Rao, S.: Multicommodity Max-flow Min-cut Theorems and their Use
in Designing Approximation Algorithms. J. ACM 46, 787–832 (1999)
13. Luxburg, U.: A Tutorial on Spectral Clustering. Statistics and Computing 17, 395–
416 (2007)
14. Madeira, S., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis:
a survey. Trans. on Comp. Biology and Bioinformatics 1(1), 24–45 (2004)
15. Newman, M.E.J.: Fast Algorithm for Detecting Community Structure in Networks.
Phys. Rev. E 69, 066133 (2004)
16. Papadimitriou, S., Sun, J.: DisCo: Distributed Co-clustering with Map-Reduce: A
Case Study towards Petabyte-Scale End-to-End Mining. In: Proc. of International
Conference on Data Mining (ICDM), pp. 512–521 (2008)
Unsupervised Sparse Matrix Co-clustering 603
17. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)
18. Salomon, D.: Data Compression: The Complete Reference, 2nd edn. Springer-
Verlag New York, Inc. (2000)
19. Shmoys, D.B.: Cut Problems and their Application to Divide-and-conquer, pp.
192–235. PWS Publishing Co. (1997)
20. Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.S.: GraphScope: Parameter-free
Mining of Large Time-evolving Graphs. In: Proc. of KDD, pp. 687–696 (2007)
21. Tanay, A., Sharan, R., Shamir, R.: Biclustering Algorithms: a survey. Handbook
of Computational Molecular Biology (2004)
Expectation-Maximization Collaborative
Filtering with Explicit and Implicit Feedback
1 Introduction
In the modern digital world, consumers are overwhelmed by the huge amount
of product choices offered by electronic retailers and content providers. Recom-
mender systems, which analyze patterns of user interests in products in order
to provide personalized recommendatons satisfying users’ tastes, have recently
attracted a great deal of attention from both academia and industry. Collabora-
tive filtering (CF) [9], which analyzes relationships among users and items (i.e.,
products) in order to identify potential associations between users and items, is
a popular strategy for recommender systems. Compared to content filtering [8],
which is the other recommendation strategy that depends on profiles of users
and/or items, CF has the advantage of being free of domain knowledge. Since CF
relies only on the history of user behavior, it can address the issues in creating
explicit profiles, which are difficult in many recommender system scenarios.
The history of user behavior for CF usually consists of user feedback, which
generally refers to any form of user action on items that may convey the infor-
mation about users’ preferences of items. There are two kinds of user feedback:
P.-N. Tan et al. (Eds.): PAKDD 2012, Part I, LNAI 7301, pp. 604–616, 2012.
c Springer-Verlag Berlin Heidelberg 2012
EM Collaborative Filtering with Explicit and Implicit Feedback 605
explicit feedback and implicit feedback. Explicit feedback is often in the form
of rating actions. For example, Amazon.com asks users to rate their purchased
CDs and books on a scale of 1-5 stars. Since explicit feedback directly represents
users’ judgments on items in a granular way, it has been widely used in many
traditional CF recommender systems [2] [9] [10]. However, explicit feedback re-
quires users to perform extra rating actions, which may lead to inconvenience for
the user. Given the overwhelming amount of items, it is burdensome for users
to rate every item they like or dislike.
Implicit feedback, on the other hand, does not need additional rating actions.
Implicit feedback generally refers to any user behavior that indirectly expresses
user interests. For example, a news website may cache a user’s clicking records
when browsing news articles, in order to predict what kind of news the user
may prefer. Other forms of implicit feedback include keyword searching, mouse
movement, and even eye tracking. Since it is relatively easier to collect such user
behaviors, implicit feedback attracts the interest of researchers who attempt to
infer user preferences from the much larger amount of implicit feedback. How-
ever, implicit feedback is less accurate than explicit feedback. For example, a
user may regret buying a product online after receiving the real product. It is
difficult to determine whether a user likes a product only based on the purchase
behavior, even though the user paid for the product.
Explicit feedback and implicit feedback are naturally complementary to each
other. With explicit feedback, the quality is more reliable, but the quantity is
limited. However, the quality of implicit feedback is less accurate, but there
is an abundant quantity. Most of the existing works have solely considered ei-
ther explicit feedback [3] or implicit feedback [4] [7] in recommender systems.
While there are few works that have tried to unify explicit and implicit feedback,
explicit feedback is just treated as a special kind of implicit feedback; and, the
implicit feedback is simply normalized to a set of numeric rating values. Without
carefully studying how to organically combine explicit and implicit feedback, we
will not be able to further improve the performances of recommender systems.
In this paper, we propose a novel recommender method based on matrix fac-
torization [5], called expectation-maximization collaborative filtering (EMCF).
The first contribution of this paper is that we combine explicit and implicit feed-
back together, in which both explicit feedback and implicit feedback are fully
utilized. The second contribution is that we observe different cases of implicit
feedback and develop the corresponding solutions to estimate implicit feedback
for different cases. The third contribution is that we design an expectation-
maximization-styled algorithm in EMCF to update the estimations of implicit
feedback and latent factor vectors.
Instead of treating explicit feedback as special implicit feedback, EMCF ini-
tializes a latent factor model with explicit feedback and then updates the latent
factor vectors based on the explicit feedback ratings and the implicit feedback
estimations. The key challenge in utilizing implicit feedback in matrix factoriza-
tion is that implicit feedback does not have the numeric rating value. Instead of
simply normalizing implicit feedback to a set of numeric rating values, EMCF
606 B. Wang et al.
estimates implicit feedback rating values based on the explicit feedback and
available latent factor vectors.
We observe that the implicit feedback can be categorized into four cases, in
terms of the distribution of latent factor vectors from the currently trained latent
factor model. For different cases, EMCF has corresponding solutions, which are
not only based on explicit feedback ratings and implicit feedback estimations,
but are also based on the graph-based structure of explicit and implicit feedback.
We also observe that only part of implicit feedback ratings can be esti-
mated based on the current situation of the model. Therefore, an expectation-
maximization-styled algorithm is designed in EMCF to: 1) propagate current
estimations of implicit feedback plus explicit feedback ratings towards the set of
implicit feedback that have not yet been estimated, so that more implicit feed-
back estimations can be added into the model training; 2) Re-train the latent
factor model based on all the available implicit feedback estimations and explicit
feedback ratings and then update the latent factor vectors of users and items for
further estimating. The algorithm not only fully utilizes implicit feedback with
explicit feedback, but also prevents noisy implicit feedback from affecting the
performance of the EMCF model.
Experiments have been conducted to compare the EMCF model with other
popular models. The experimental results show that EMCF outperforms those
models, especially when the percentage of explicit feedback is small.
The rest of the paper is organized as follows: in Section 2, a preliminary is
given including the formalization, the background of CF, and a related method
called co-rating; in Section 3, we present the observations of implicit feedback
and propose the solutions to estimate implicit feedback for different cases; in
Section 4, we describe the EMCF model and introduce the algorithm to train
the model; in Section 5, the experiments make the comparisons between EMCF
and other models; the conclusion and some future works are given in Section 6.
2 Preliminary
2.1 Formalization
In this paper, we use U = {u1 , u2 , . . . , um } to denote a set of m users and use
I = {i1 , i2 , . . . , in } to denote a set of n items. The explicit feedback and implicit
feedback are defined as the observable actions from U to I that may directly
and indirectly reflect users’ preferences of the items.
Explicit feedback is usually in the form of rating actions. A user rates items
by assigning numeric rating values. The observed rating values are represented
by a matrix R ∈ m×n , in which each entry rui ∈ is used to denote the rating
on item i given by user u. We use S E to denote the set of explicit feedback,
which consists of user-item-rating triples (u, i, rui ).
Implicit feedback typically consists of various types of actions performed by
users on items that can be automatically tracked by systems. In some related
works [4] [6], implicit feedback is represented by a binary variable bui ∈ {0, 1},
in which 1 means user u performed some action on item i and 0 means u never
EM Collaborative Filtering with Explicit and Implicit Feedback 607
touched i. In this paper, we assume that a user has no interest to an item if he/she
never touched this item. We use S I to denote the set of implicit feedback, which
consists of user-item pairs (u, i) for which u has implicit feedback on i.
2.3 Co-rating
Liu et al. [6] developed a matrix factorization model called co-rating, which tries
to unify explicit and implicit feedback. Co-rating treats explicit feedback as a
608 B. Wang et al.
special kind of implicit feedback, so that the entire set of explicit and implicit
feedback can be used simultaneously during the model training. For solving
the challenge of implicit feedback ratings having only binary values instead of
numeric values, co-rating normalizes the rating values of explicit feedback and
the binary values of implicit feedback into a range of [0, 1]. With the co-rating
method, latent factor vectors are learned by solving an objective function in
the matrix factorization model trained with explicit and implicit feedback. The
co-rating’s objective function is different from the one normally used in other
matrix factorization methods [1] [5]; an extra weighted term has been added,
which aims at controlling the loss when treating explicit feedback as implicit
feedback.
R ≈ MU · MI . (1)
The matrix factorization method learns the individual latent factor vectors in
MU and MI by solving
(rui − qi pu ) + λ(qi + pu ),
2 2
argminq∗ ,p∗ (2)
(u,i,rui )∈S T
where S T is the training set including the known rating values, the term rui −
qi pu is the estimation of the goodness of the rating approximation, qi +pu
2 2
qi ← qi + γ · (eui · pu − λ · qi ) (4)
pu ← pu + γ · (eui · qi − λ · pu ) (5)
ALS is a different style of algorithm for learning the latent factor vectors.
In ALS, one of the unknown latent factor vectors is fixed in order to learn the
other vector; and, the latter vector is then fixed to learn the former vector. The
procedure is repeated until convergence is reached. With ALS, the optimization
of Equation 2 becomes quadratic and can be optimally solved. Although ALS is
slower and more complicated than SGD, it is usually favorable when paralleliza-
tion is needed. Due the space limitation, we are not going to give the details of
ALS here.
– Case 1: Both u and i have been assigned latent factor vectors pu and qi ,
respectively, because u and i are also included in S E .
– Case 2: u has been assigned latent factor vector pu because u is also included
in S E , but i has no latent factor vector since i is not included in S E .
– Case 3: i has been assigned latent factor vector qi because i is also included
in S E , but u has no latent factor vector since u is not included in S E .
– Case 4: Both u and i have no latent factor vectors, because u and i are not
included in S E .
target target
u i u i
Case 1 Case 2
u i u i
target target
Case 3 Case 4
Fig. 1. Four cases of implicit feedback. The black circles and black squares are used
to represent users and items, respectively, on which the latent factor vectors have been
assigned, and the white circles and white squares are used to represent the users and
items, on which there is no latent factor vector assigned yet. The solid lines represent
the explicit feedback, and the dash lines represent the implicit feedback. The thick dash
lines are the targets of implicit feedback that we will estimate based on the current
situation.
For Case 2, the target implicit feedback cannot be directly estimated using latent
factor vectors due to the lack of qi . We use an item-based CF method to estimate
it. First, the similarity sim(i, j) between item i and item j is calculated using
Jaccard Similarity Coefficient as:
|Ai Aj |
sim(i, j) = , (7)
|Ai Aj |
where Ai and Aj are the set of users who have either explicit or implicit feedback
actions on i and j respectively. Next, we look for the set of neighbor items Ni
of item i. In Ni , each neighbor item j has to satisfy the conditions as: 1) the
similarity sim(i, j) is larger than a pre-defined threshold; 2) a latent factor vector
I
has already been assigned on j. Then, the estimation r̂ui for the target implicit
feedback can be computed as:
I j∈N sim(i, j)qj pu
r̂ui = i . (8)
j∈Ni sim(i, j)
Similarly for Case 3, the similarity sim(u, v) between user u and user v is also
calculated by Jaccard Similarity Coefficient as:
|Au Av |
sim(u, v) = , (9)
|Au Av |
where Au and Av are the set of items on which u and v have either explicit
or implicit feedback actions respectively. The set of neighbor users Nu for user
EM Collaborative Filtering with Explicit and Implicit Feedback 611
u is found, in which each neighbor user v has a value sim(u, v) larger than a
I
threshold and has an assigned latent factor vector. The estimation r̂ui for the
target implicit feedback can be computed using a user-based CF method as
I v∈Nu sim(u, v)qi pv
r̂ui = . (10)
v∈Nu sim(u, v)
– E Step: Train the collaborative filtering model using all the explicit feedback
ratings and currently available estimations of implicit feedback.
– M Step: Estimate the implicit feedback based on latent factor vectors that
are output from the CF model trained in the previous E Step.
Algorithm EMCF
Input: Explicit feedback set S E , implicit feedback set S I , user set U , and item set I.
Output: User latent factor matrix M U and item latent factor matrix M I .
Initialization:
– Initialize training set S T with S E , train the latent factor vectors for users and items in S T , and assign them
back to M U and M I .
– Initialize an empty set Ŝ E , in which the implicit feedback estimation triples (u, i, r̂ui will be included.
BEGIN
Repeat:
For each user-item pair (u, i) in S I :
If both u and i have latent factor vectors:
Estimate rating r̂ui for (u, i);
Put (u, i, r̂ui ) in Ŝ E by Equation 6;
Remove (u, i) from S I ;
Else If u has latent factor vector but i not:
Estimate rating r̂ui for (u, i) by Equation 8 when item neighbors of i can be found;
Put (u, i, r̂ui ) in Ŝ E ;
Remove (u, i) from S I ;
Else If i has latent factor vector but u not:
Estimate rating r̂ui for (u, i) by Equation 10 when user neighbors of u can be found;
Put (u, i, r̂ui ) in Ŝ E ;
Remove (u, i) from S I ;
E
Train model using S T ← S T Ŝ ;
Update the corresponding columns of M U and M I by updated latent factor vectors;
Evaluate the difference between the rating estimations produced by previous latent factor vectors and the
rating estimations produced by current latent factor vectors;
Until there is no new entry added in Ŝ E and the estimation difference is lower than the threshold.
END
users and items that already have latent factor vectors, their latent factor vectors
are updated, since the EMCF model is re-trained using the updated rating set.
Therefore, the estimations of some non-eligible implicit feedback in the previous
round become possible. Then, EMCF algorithm is back to the step of estimating
implicit feedback, and the above steps are repeated. The algorithm is terminated
when there is no longer eligible implicit feedback to estimate and the rating es-
timation difference between the previous round and current round is lower than
a pre-defined threshold. The formal algorithm procedure description is shown in
Figure 2.
The EMCF algorithm has advantages by combining explicit feedback with im-
plicit feedback. First, the implicit feedback is categorized into the four disjoint
sets, in terms of the current situations of user and item latent factor vectors.
Therefore, we have a chance to deal with implicit feedback differentially. Sec-
ond, the estimation methods for different cases not only depend on the rating
calculation from the matrix factorization, but also consider the neighbor struc-
ture built by both explicit feedback and implicit feedback. Third, the iterative
procedure of EMCF fully utilizes implicit feedback by providing the opportu-
nity to include more estimations of implicit feedback, which are not eligible in
the previous operational round of the algorithm. Finally, the EMCF algorithm
prevents noisy implicit feedback from the model training procedure, so that the
performance of output model can be improved. Some implicit feedback is not
used, since there is no sufficent information for estimation. Usually, such implicit
feedback is suspected of being noise.
EM Collaborative Filtering with Explicit and Implicit Feedback 613
5 Experiments
From the results, we can see that the performance of EMCF with implicit
feedback of Case 1 is worse than the baseline. It is because the implicit feed-
back of Case 1 is estimated by the latent factor vectors learned from the model
only based on explicit feedback. Without estimations of other cases of implicit
feedback, EMCF with Case 1 overfits the model. EMCF with implicit feedback
of Case 2 or Case 3 outperforms the baseline. But the improvements of perfor-
mance are not obvious. EMCF with the implicit feedback combination of Case
2 and Case 3 has a greater improvements compared to the baseline. However,
the implicit feedback is not fully utilized due to the lack of Case 1. EMCF with
all the implicit feedback has the best performance since the EM-style algorithm
of EMCF fully utilizes all the implicit feedback.
1
http://www.grouplens.org/node/73
2
http://mahout.apache.org/
614 B. Wang et al.
1.5
Co-rating
1.4
MF with Explicit
EMCF
1.3
RMSE 1.2
1.1
1.0
0.9
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Explicit Feedback Percentage
Fig. 3. The experimental results of matrix factorization (MF) only with explicit feed-
back, co-rating, and EMCF with different percentage of explicit feedback
References
1. Bell, R.M., Koren, Y.: Scalable Collaborative Filtering with Jointly Derived Neigh-
borhood Interpolation Weights, Los Alamitos, CA, USA, pp. 43–52 (2007)
2. Desrosiers, C., Karypis, G.: A Comprehensive Survey of Neighborhood-based Rec-
ommendation Methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.)
Recommender Systems Handbook, pp. 107–144. Springer, Boston (2011)
3. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Transactions
on Information Systems (TOIS) 22, 89–115 (2004)
4. Hu, Y., Koren, Y., Volinsky, C.: Collaborative Filtering for Implicit Feedback
Datasets. In: IEEE International Conference on, Los Alamitos, CA, USA, pp. 263–
272 (2008)
5. Koren, Y., Bell, R., Volinsky, C.: Matrix Factorization Techniques for Recom-
mender Systems. Computer 42(8), 30–37 (2009)
6. Liu, N.N., Xiang, E.W., Zhao, M., Yang, Q.: Unifying explicit and implicit feedback
for collaborative filtering, New York, NY, USA, pp. 1445–1448 (2010)
7. Pan, R., Scholz, M.: Mind the gaps: weighting the unknown in large-scale one-class
collaborative filtering. In: Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining - KDD, Paris, France, p.
667 (2009)
616 B. Wang et al.