Web Technologies and Applications
Web Technologies and Applications
Web Technologies
LNCS 7808
and Applications
15th Asia-Pacific Web Conference, APWeb 2013
Sydney, Australia, April 2013
Proceedings
123
Lecture Notes in Computer Science 7808
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Yoshiharu Ishikawa Jianzhong Li
Wei Wang Rui Zhang Wenjie Zhang (Eds.)
Web Technologies
and Applications
15th Asia-Pacific Web Conference, APWeb 2013
Sydney, Australia, April 4-6, 2013
Proceedings
13
Volume Editors
Yoshiharu Ishikawa
Nagoya University
Graduate School of Information Science
Nagoya 464-8601, Japan
E-mail: [email protected]
Jianzhong Li
Harbin Institute of Technology
Department of Computer Science and Technology
Harbin 150006, China
E-mail: [email protected]
Wei Wang
Wenjie Zhang
University of New South Wales
School of Computer Science and Engineering
Sydney, NSW 2052, Australia
E-mail: {weiw, zhangw}@cse.unsw.edu.au
Rui Zhang
University of Melbourne
Department of Computing and Information Systems
Melbourne, VIC 3052, Australia
E-mail: [email protected]
CR Subject Classification (1998): H.2.8, H.2, H.3, H.5, H.4, J.1, K.4, I.2
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in ist current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Message from the General Chairs
Welcome to APWeb 2013, the 15th Edition of the Asia Pacific Web Confer-
ence. APWeb is a leading international conference on research, development, and
applications of Web technologies, database systems, information management,
and software engineering, with a focus on the Asia-Pacific region. Previous AP-
Web conferences were held in Kunming (2012), Beijing (2011), Busan (2010),
Suzhou (2009), Shenyang (2008), Huangshan (2007), Harbin (2006), Shanghai
(2005), Hangzhou (2004), Xian (2003), Changsha (2001), Xian (2000), Hong
Kong (1999), and Beijing (1998).
The APWeb 2013 conference was, for the first time, held in Sydney, Australia
a city blessed with a temperate climate, a beautiful harbor, and natural at-
tractions surrounding it. These proceedings collect the technical papers selected
for presentation at the conference, during April 46, 2013.
The APWeb 2013 program featured a main conference, a special track, and
four satellite workshops. The main conference had three keynotes by eminent re-
searchers H.V. Jagadish from the University of Michigan, USA, Mark Sanderson
from RMIT University, Australia, and Dan Suciu from the University of Wash-
ington, USA. Three tutorials were oered by Haixun Wang, Microsoft Research
Asia, China, Yuqing Wu, Indiana University, USA, George Fletcher, Eindhoven
University of Technology, The Netherlands, and Lei Chen, Hong Kong University
of Science and Technology, Hong Kong, China. The conference received 165 paper
submissions from North America, South America, Europe, Asia, and Oceania.
Each submitted paper underwent a rigorous review by at least three indepen-
dent referees, with detailed review reports. Finally, 39 full research papers and
22 short research papers were accepted, from Australia, Bangladesh, Canada,
China, India, Ireland, Italy, Japan, New Zealand, Saudia Arabia, Sweden, Nor-
way, UK, and USA. The special track of Distributed Processing of Graph, XML
and RDF Data: Theory and Practice was organized by Alfredo Cuzzocrea. The
conference had four workshops
We were extremely excited with our strong Program Committee, comprising out-
standing researchers in the APWeb research areas. We would like to extend our
sincere gratitude to the Program Committee members and external reviewers.
Last but not least, we would like to thank the sponsors, for their strong support
VI Message from the General Chairs
of this conference, making it a big success. Special thanks go to the Chinese Uni-
versity of Hong Kong, the University of New South Wales, Macquarie University,
and the University of Sydney.
Finally, we wish to thank the APWeb Steering Committee, led by Xuemin
Lin, for oering us the opportunity to organize APWeb 2013 in Sydney. We also
wish to thank the host organization, the University of New South Wales, and
Local Arrangements Committee and volunteers for their assistance in organizing
this conference.
Conference Co-chairs
Vijay Varadharajan Macquarie University, Australia
Jerey Xu Yu Chinese University of Hong Kong, China
Workshop Co-chairs
James Bailey University of Melbourne, Australia
Xiaochun Yang Northeastern University, China
Tutorial/Panel Co-chairs
Sanjay Chawla University of Sydney, Australia
Xiaofeng Meng Renmin University of China, China
Industrial Co-chairs
Marek Kowalkiewicz SAP Research in Brisbane, Australia
Mukesh Mohania IBM Research, India
Publication Co-chairs
Rui Zhang University of Melbourne, Australia
Wenjie Zhang University of New South Wales, Australia
VIII Conference Organization
Publicity Co-chairs
Alfredo Cuzzocrea University of Calabria, Italy
Jiaheng Lu Renmin University of China, China
Demo Co-chairs
Wook-Shin Han Kyungpook National University, Korea
Helen Huang University of Queensland, Australia
Webmasters
Yu Zheng East China Normal University, China
Chen Chen University of New South Wales, Australia
Program Committee
Toshiyuki Amagasa University of Tsukuba
Djamal Benslimane University of Lyon
Jae-Woo Chang Chonbuk National University
Haiming Chen Chinese Academy of Sciences
Jinchuan Chen Renmin University of China
David Cheung The University of Hong Kong
Bin Cui Beijing University
Alfredo Cuzzocrea ICAR-CNR & University of Calabria
Ting Deng Beihang University
Jianlin Feng Sun Yat-Sen University
Yaokai Feng Kyushu University
Conference Organization IX
External Reviewers
Xuefei Li Mahmoud Barhamgi
Hongyun Cai Xian Li
Jingkuan Song Yu Jiang
Yang Yang Saurav Acharya
Xiaofeng Zhu Syed K. Tanbeer
Scott Bourne Hongda Ren
Yasser Salem Wei Shen
Shi Feng Zhenhua Song
Jianwei Zhang Jianhua Yin
Kenta Oku Liu Chen
Sukhwan Jung Wei Song
An Overview of Probabilistic Databases
Dan Suciu
University of Washington
[email protected]
http://homes.cs.washington.edu/ suciu/
References
1. Bacchus, F., Dalmao, S., Pitassi, T.: Algorithms and complexity results for #sat
and bayesian inference. In: FOCS, pp. 340351 (2003)
2. Birnbaum, E., Lozinskii, E.L.: The good old davis-putnam procedure helps counting
models. J. Artif. Int. Res. 10(1), 457477 (1999)
3. Davis, M., Logemann, G., Loveland, D.: A machine program for theorem-proving.
Commun. ACM 5(7), 394397 (1962)
4. Gomes, C.P., Sabharwal, A., Selman, B.: Model counting. In: Handbook of Satisfi-
ability, pp. 633654 (2009)
5. Jha, A.K., Suciu, D.: Knowledge compilation meets database theory: compiling
queries to decision diagrams. In: ICDT, pp. 162173 (2011)
6. Suciu, D., Olteanu, D., Re, C., Koch, C.: Probabilistic Databases. In: Synthesis
Lectures on Data Management. Morgan & Claypool Publishers (2011)
7. Vardi, M.Y.: The complexity of relational query languages (extended abstract).
In: STOC, pp. 137146 (1982)
Challenges with Big Data on the Web
H.V. Jagadish
University of Michigan
[email protected]
References
[CCC12] Jagadish, H.V., et al: Challenges and Opportunities with Big Data,
http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf
[SIGMOD12] Singh, M., Nandi, A., Jagadish, H.V.: Skimmer: rapid scrolling of rela-
tional query results. In: SIGMOD Conference, pp. 181192 (2012)
Mark Sanderson
School of Computer Science and Information Technology
RMIT University
GPO Box 2476, Melbourne 3001
Victoria, Australia
Abstract. This year, (2013) marks the 20th anniversary of the first public web
search engine JumpStation launched in late 1993. For those who were around
in those early days, it was becoming clear that an information provision and an
information access revolution was on its way; though very few, if any would have
predicted the state of the information society we have today. It is perhaps worth
reflecting on what has been achieved in the field of information retrieval since
these systems were first created, and consider what remains to be accomplished.
It is perhaps easy to see the success of systems like Google and ask what else is
there to achieve? However, in some ways, Google has it easy. In this talk, I will
explain why Web search can be viewed as a relatively easy task and why other
forms of search are much harder to perform accurately.
Search engines require a great deal of tuning, currently achieved empirically.
The tuning carried out depends greatly on the types of queries submitted to a
search engine and the types of document collections the queries will search over.
It should be possible to study the population of queries and documents and
predictively configure a search engine. However, there is little understanding
in either the research or practitioner communities on how query and collection
properties map to search engine configurations. I will present the some of the
early work we have conducted at RMIT to start charting the problems in this
particular space.
Another crucial challenge for search engine companies is how to ensure that
users are delivered the best quality content. There is a growth in systems that
recommend content based not only on queries, but also on user context. The
problem is that the quality of these systems is highly variable; one way of tackling
this problem is gathering context from a wider range of places. I will present some
of the possible new approaches to providing that context to search engines. Here
diverse social media, and advances in location technologies will be emphasized.
Finally, I will describe what I see as one of the more important challenges
that face the whole of the information community, namely the penetration of
computer systems to virtually every person on the planet and the challenges
that such an expansion presents.
Table of Contents
Tutorials
Understanding Short Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Haixun Wang
Distributed Processing:
A Simple XSLT Processor for Distributed XML . . . . . . . . . . . . . . . . . . . . . . 7
Hiroki Mizumoto and Nobutaka Suzuki
Graphs
GPU-Accelerated Bidirected De Bruijn Graph Construction
for Genome Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Mian Lu, Qiong Luo, Bingqiang Wang, Junkai Wu, and Jiuxin Zhao
Social Networks
Identification of Sybil Communities Generating Context-Aware Spam
on Online Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Faraz Ahmed and Muhammad Abulaish
Location-Based Emerging Event Detection in Social Networks . . . . . . . . . 280
Sayan Unankard, Xue Li, and Mohamed A. Sharaf
Measuring Strength of Ties in Social Network . . . . . . . . . . . . . . . . . . . . . . . 292
Dakui Sheng, Tao Sun, Sheng Wang, Ziqi Wang, and Ming Zhang
Finding Diverse Friends in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Syed Khairuzzaman Tanbeer and Carson Kai-Sang Leung
Social Network User Influence Dynamics Prediction . . . . . . . . . . . . . . . . . . 310
Jingxuan Li, Wei Peng, Tao Li, and Tong Sun
Credibility-Based Twitter Social Network Analysis . . . . . . . . . . . . . . . . . . . 323
Jebrin Al-Sharawneh, Suku Sinnappan, and Mary-Anne Williams
Design and Evaluation of Access Control Model Based on Classification
of Users Network Behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Peipeng Liu, Jinqiao Shi, Fei Xu, Lihong Wang, and Li Guo
Two Phase Extraction Method for Extracting Real Life Tweets Using
LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Shuhei Yamamoto and Tetsuji Satoh
XXII Table of Contents
Probabilistic Queries
A Probabilistic Model for Diversifying Recommendation Lists . . . . . . . . . 348
Yutaka Kabutoya, Tomoharu Iwata, Hiroyuki Toda, and
Hiroyuki Kitagawa
Spatial-Temporal Databases
FIMO: A Novel WiFi Localization Method . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Yao Zhou, Leilei Jin, Cheqing Jin, and Aoying Zhou
Performance
S2MART: Smart Sql to Map-Reduce Translators . . . . . . . . . . . . . . . . . . . . . 571
Narayan Gowraj, Prasanna Venkatesh Ravi, Mouniga V, and
M.R. Sumalatha
Towards a Novel and Timely Search and Discovery System Using the
Real-Time Social Web. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
Owen Phelan, Kevin McCarthy, and Barry Smyth
Haixun Wang
1 Speakers
Haixun Wang is a senior researcher at Microsoft Research Asia in Beijing, China,
where he manages the Data Management, Analytics, and Services group. Before
joining Microsoft, he had been a research sta member at IBM T. J. Watson
Research Center for 9 years. He was Technical Assistant to Stuart Feldman (Vice
President of Computer Science of IBM Research) from 2006 to 2007, and Tech-
nical Assistant to Mark Wegman (Head of Computer Science of IBM Research)
from 2007 to 2009. Haixun Wang has published more than 120 research papers
in referred international journals and conference proceedings. He is on the edi-
torial board of Distributed and Parallel Databases (DAPD), IEEE Transactions
of Knowledge and Data Engineering (TKDE), Knowledge and Information Sys-
tem (KAIS), Journal of Computer Science and Technology (JCST). He is PC
co-Chair of WWW 2013 (P&E), ICDE 2013 (Industry), CIKM 2012, ICMLA
2011, WAIM 2011. Haixun Wang got the ER 2008 Conference best paper award
(DKE 25 year award), and ICDM 2009 Best Student Paper run-up award.
Lei Chen
1 Speakers
Lei Chen received the BS degree in Computer Science and Engineering from
Tianjin University, Tianjin, China, in 1994, the MA degree from Asian Institute
of Technology, Bangkok, Thailand, in 1997, and the PhD degree in computer
science from the University of Waterloo, Waterloo, Ontario, Canada, in 2005.
He is currently an Associate Professor in the Department of Computer Science
and Engineering, Hong Kong University of Science and Technology. His research
interests include crowd sourcing on social media, social media analysis, proba-
bilistic and uncertain databases, and privacy-preserved data publishing. So far,
he has published nearly 200 conference and journal papers. He got the Best
Paper Awards in DASFAA 2009 and 2010. He is PC Track Chairs for VLDB
2014, ICDE 2012, CIKM 2012, SIGMM 2011. He has served as PC members
for SIGMOD, VLDB, ICDE, SIGMM, and WWW. Currently, he serves as an
Associate Editor for IEEE Transaction on Data and Knowledge Engineering and
Distribute and Parallel Databases. He is a member of the ACM and the chairman
of ACM Hong Kong Chapter.
1 Tutorial Overview
Exploratory keyword-style search has been heavily studied in the past decade,
both in the context of structured [18] and semi-structured [9] data. Given the
ubiquity of massive (loosely structured) graph data in domains such as the web,
social networks, biological networks, and linked open data (to name a few), there
recently has been a surge of interest and advances on the problem of search in
graphs (e.g., [1, 3, 12, 15, 17, 19, 20]). As graph exploration leads to deeper
domain understanding, user queries begin to shift from unstructured searching
to richer structure-based exploration of the graph. Consequently, there has been
a flurry of language proposals specifically targeting this style of structure-aware
querying in graphs (e.g., [2, 3, 5, 13, 16]).
In this tutorial, we survey this growing body of work, with an eye towards
both bringing participants up to speed in this field of rapid progress and delim-
iting the boundaries of the state-of-the-art. A particular focus will be on recent
results in the theory of graph languages on the design and structural charac-
terization of simple yet powerful algebraic languages for graph search, which
bridge structure-oblivious and structure-aware graph exploration [46]. At the
heart of these results is the methodology of coupling the expressive power of a
given query language with an appropriate structural notion on data instances.
Here, the idea is to characterize language equivalence of data objects in in-
stances (i.e., the inability of queries in the language to distinguish the objects)
purely in terms of the structure of the instance (i.e., equivalence under no-
tions such as homomorphism or bisimilarity). Recently, first steps towards graph
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 36, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Tutorial Outline
3 Speakers
Yuqing Wu and George Fletcher, together with a group of collaborators in the
USA, the Netherlands, and Belgium, have been conducting research in this area
in recent years and have published several papers in both the theory and engi-
neering branches of database research.
References
1. Delbru, R., Campinas, S., Tummarello, G.: Searching web data: An entity retrieval
and high-performance indexing model. J. Web Sem. 10, 3358 (2012)
2. Fazzinga, B., Gianforme, G., Gottlob, G., Lukasiewicz, T.: Semantic web search
based on ontological conjunctive queries. J. Web Sem. 9(4), 453473 (2011)
3. Fletcher, G.H.L., Van den Bussche, J., Van Gucht, D., Vansummeren, S.: Towards
a theory of search queries. ACM Trans. Database Syst. 35(4), 28 (2010)
4. Fletcher, G.H.L., Gyssens, M., Leinders, D., Van den Bussche, J., Van Gucht, D.,
Vansummeren, S.: Similarity and bisimilarity notions appropriate for characterizing
indistinguishability in fragments of the calculus of relations. CoRR, abs/1210.2688
(2012)
5. Fletcher, G.H.L., Gyssens, M., Leinders, D., Van den Bussche, J., Van Gucht, D.,
Vansummeren, S., Wu, Y.: Relative expressive power of navigational querying on
graphs. In: Proc. ICDT, Uppsala, Sweden, pp. 197207 (2011)
6. Fletcher, G.H.L., Gyssens, M., Leinders, D., Van den Bussche, J., Van Gucht,
D., Vansummeren, S., Wu, Y.: The impact of transitive closure on the boolean
expressiveness of navigational query languages on graphs. In: Lukasiewicz, T., Sali,
A. (eds.) FoIKS 2012. LNCS, vol. 7153, pp. 124143. Springer, Heidelberg (2012)
7. Fletcher, G.H.L., Hidders, J., Vansummeren, S., Picalausa, F., Luo, Y., De Bra, P.:
On guarded simulations and acyclic first-order languages. In: Proc. DBPL, Seattle,
WA, USA (2011)
8. Hellings, J., Fletcher, G.H.L., Haverkort, H.: Ecient external-memory bisimula-
tion on DAGs. In: Proc. ACM SIGMOD, Scottsdale, AZ, USA, pp. 553564 (2012)
9. Liu, Z., Chen, Y.: Processing keyword search on XML: a survey. World Wide
Web 14(5-6), 671707 (2011)
10. Luo, Y., de Lange, Y., Fletcher, G.H.L., De Bra, P., Hidders, J., Wu, Y.: Bisimu-
lation reduction of big graphs on MapReduce (manuscript in preparation, 2013)
11. Luo, Y., Fletcher, G.H.L., Hidders, J., Wu, Y., De Bra, P.: I/O-ecient algo-
rithms for localized bisimulation partition construction and maintenance on mas-
sive graphs. CoRR, abs/1210.0748 (2012)
12. Mass, Y., Sagiv, Y.: Language models for keyword search over data graphs. In:
Proc. ACM WSDM, Seattle, Washington, USA (2012)
13. Perez, J., Arenas, M., Gutierrez, C.: nSPARQL: A navigational language for RDF.
J. Web Sem. 8(4), 255270 (2010)
14. Picalausa, F., Luo, Y., Fletcher, G.H.L., Hidders, J., Vansummeren, S.: A struc-
tural approach to indexing triples. In: Simperl, E., Cimiano, P., Polleres, A., Cor-
cho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 406421. Springer,
Heidelberg (2012)
15. Tran, T., Herzig, D.M., Ladwig, G.: SemSearchPro - using semantics throughout
the search process. J. Web Sem. 9(4), 349364 (2011)
6 Y. Wu and G.H.L. Fletcher
16. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candi-
dates for ecient keyword search on graph-shaped (RDF) data. In: Proc. IEEE
ICDE, Shanghai, pp. 405416 (2009)
17. Wu, Y., Van Gucht, D., Gyssens, M., Paredaens, J.: A study of a positive fragment
of path queries: Expressiveness, normal form and minimization. Comput. J. 54(7),
10911118 (2011)
18. Yu, J.X., Qin, L., Chang, L.: Keyword search in relational databases: A survey.
IEEE Data Eng. Bull. 33(1), 6778 (2010)
19. Zhou, M., Pan, Y., Wu, Y.: Conkar: constraint keyword-based association discov-
ery. In: Proc. ACM CIKM, Glasgow, UK, pp. 25532556 (2011)
20. Zhou, M., Pan, Y., Wu, Y.: Ecient association discovery with keyword-based
constraints on large graph data. In: Proc. ACM CIKM, Glasgow, UK, pp. 2441
2444 (2011)
A Simple XSLT Processor for Distributed XML
University of Tsukuba
1-2, Kasuga, Tsukuba Ibaraki 305-8550, Japan
{s0911654@u,nsuzuki@slis}.tsukuba.ac.jp
1 Introduction
XML has been a de facto standard format in the Web, and the sizes of XML doc-
uments have rapidly been increasing. Recently, due to geographical and admin-
istrative reasons an XML document is partitioned into fragments and managed
separately in plural sites. Such a form of XML documents is called distributed
XML[5,2,1]. For example, Figures 1 and 2 show a distributed XML document
of an auction site. In this example, one XML document is partitioned into four
fragments F0 , F1 , F2 , and F3 , and fragment F1 is stored in site S1 , fragment F2
is stored in site S2 , and so on. We say that S1 and S3 are the child sites of S0
(S0 is the parent site of S1 and S3 ).
In this paper, we consider XSLT transformation for distributed XML docu-
ments. An usual centralized approach for performing an XSLT transformation
of a distributed XML document is to send all the fragments to a specific site,
then merge all the fragments into one XML document, and perform an XSLT
transformation to the merged document. However, this approach is inecient
due to the following reasons. First, in this approach an XSLT transformation
processing is not load-balanced. Second, an XSLT transformation becomes in-
ecient if the size of the target XML document is large[12]. This implies that
the centralized approach is inecient even if the size of each XML fragment is
small, whenever the merged document is large.
In this paper, we propose a method for performing XSLT transformation ef-
ficiently for distributed XML documents. In our method, all the sites having an
XML fragment perform an XSLT transformation in parallel. This avoids cen-
tralized XSLT transformation for distributed XML documents and leads to an
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 718, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Related Work
A distribution design of XML documents is firstly proposed in [3]. There have
been several studies on evaluations of XPath and other languages for distributed
XML. [4,5] propose an ecient XPath evaluation algorithms for distributed
XML. [7] proposes a method for evaluating XQ, a subset of XPath, for vertically
partitioned XML documents. [10] considers a regular path query evaluation in
an distributed environment. [11] extensively studies the complexities of regular
A Simple XSLT Processor for Distributed XML 9
path query and structural recursion over distributed semistructured data. Be-
sides query languages, [2,1] study on the complexities of schema design problems
for distributed XML. To the best of the authors knowledge, there is no study
on XSLT evaluation for distributed XML.
2 Definitions
Since our method is based on unranked top-down tree transducer, we first show
related definitions. Let be a set of labels. By T we mean the set of unranked
-trees. A tree whose root is labeled with a and has n subtrees t1 , , tn is
denoted by a(t1 tn ). In the following, we always mean -tree whenever we say
tree. A hedge is a finite sequence of trees. The set of hedges is denoted by H .
In the following, we use t, t1 , t2 , to denote trees and h, h1 , h2 , to denote
hedges. For a set Q, by H (Q) (T (Q)) we mean the set of -hedges (resp.,
-trees) such that leaf nodes can be labeled with elements from Q.
A tree transducer is a quadruple (Q, , q0 , R), where Q is a finite set of states,
is the set of labels, q0 Q is the initial state, and R is a finite set of rules of
the form (q, a) h, where a , q Q and h H (Q). If q = q0 , then h is
restricted to T (Q) \ Q. A state corresponds to a mode of XSLT.
The translation defined by a tree transducer T r = (Q, , q0 , R) on a tree t in
state q, denoted by T rq (t), is inductively defined as follows.
Q = {p, q},
= {a, b, x, y, z},
R = {(p, a) x(z), (q, a) z(q), (p, b) x(p q), (q, b) y(p)}.
Tr corresponds to the XSLT script shown in Fig. 3. For example, consider the
rule (p, a) x(z) in R. This corresponds to the first template in Fig. 3. The
mode p in the left-hand side of the rule corresponds to the mode of the template,
and label a in the left-hand side of the rule corresponds to the match attribute.
The tree t in Fig. 4 is transformed to T r(t) in Fig. 5.
Fig. 6. Fragment F0
Distributed XML
In this paper, we consider settings in which an XML tree t is partitioned into a
set Ft of disjoint subtrees of t, where each subtree is called fragment. We assume
that each fragment Fi Ft is stored in a distinct site. We allow arbitrary
nesting of fragments. Thus, fragments can appear at any level of the tree.
For example, the XML tree t T in Fig. 1 is partitioned into four fragments,
Ft = {F0 , F1 , F2 , F3 }. For a tree t, the fragment containing the root node of t is
called root fragment. In Fig. 2, the root fragment is F0 .
For two fragments Fi and Fj , we say that Fj is a child fragment of Fi if
the root node of Fj corresponds to a leaf node v of Fi . In order to represent a
connection between Fi and Fj , we use an EntityReference node at the position
of v, which refers the root node of Fj . For example, in Fig. 6 EntityReference
nodes &fragment1 and &fragment3 are inserted into F0 at the positions of
regions and people, respectively.
Each fragment is stored in a site. The site having the root fragment is called
root site and the other sites are called slave sites. For example, in Fig. 2 S0 is
the root site and S1 , S2 , S3 are slave sites. We assume that no two fragments are
stored in the same site. If the fragment in site S has a child fragment stored in
S , then S is a child site of S (S is the parent site of S ). For example, in Fig. 2
S0 has two child sites S1 and S3 .
3 Transformation Method
3.1 Outline
In our transformation method, all the sites S transform the fragment stored in
S in parallel, in order to avoid transformation processes being centralized on a
specific site. To realize this strategy, we have a problem to solve. For a fragment
F , the template that should be applied to F cannot be determined until the
transformation of the parent fragment of F is completed. For example, consider
the right child a of the root in Fig.4. Since this node is labeled with a, we have
two rules that can be applied to the node, (p, a) x(z) and (q, a) z(q),
and we cannot determine which of the rules should be applied to this node
12 H. Mizumoto and N. Suzuki
until the transformation of the root node is completed. Thus, for a fragment
F and the rules r1 , , rn that can be applied to F , our method proceeds the
transformation of F as follows.
We now present the details of our method. We use two XSLT possessors Master-
XSLT and Slave-XSLT. First, Master-XSLT is used in the root site and has the
following functions.
Second, Slave-XSLT is used in a slave site and has the following functions.
Let us first present Master-XSLT. This procedure first sends a tree transducer
(XSLT styleseet) to each slave site (lines 1 to 3), then transforms the root frag-
ment by procedure Transform shown later (line 5), and receives the transformed
fragments from the slave sites (line 6), and finally merges the fragments (line 7).
Master-XSLT
Input: A tree transducer Tr = (Q, , q0 , R) and the root fragment F .
Output: Tree T T .
Procedure Transform
Input: A tree transducer Tr = (Q, , q0 , R), a fragment F , the context node v
of F , and a mode q.
Output: A transformed fragment of F in mode q.
1. if v is an EntityReference node then
2. By referring to the parent node of v , identify the mode q that should
be applied to F (v ).
3. Send the mode q to S(v ).
4. else if (q, v ) h R for some h then
5. Q {q | q is a leaf node of h} Q;
6. if v has a child node v1 , , vk then
7. for each q Q do
8. for each child node vi {v1 , , vk } of v do
9. Fi the subtree rooted at vi of F ;
10. Ti Transform(Tr , Fi , vi , q );
11. end
12. Replace node q in h with hedge T1 Tk .
13. end
14. else
15. for each q Q do
16. q ;
17. end
18. end
19. Replace the subtree rooted at v of F with h.
20. else
21. Replace the subtree rooted at v of F with .
22. end
23. Return F ;
Finally, we present Slave-XSLT. This procedure runs in each slave site S and
transforms the fragment F stored in S. This procedure starts when it receives
a tree transducer from the root site, and then transforms F by using the rules
14 H. Mizumoto and N. Suzuki
that can be applied to F in parallel, using threads (lines 3 to 8). Thus more than
one transformation results may be obtained. Then the procedure waits until the
mode(s) p that should be applied to F is received from the parent site (line 9),
and sends the fragment(s) transformed in mode(s) p to the root site (line 11).
Note that if F contains EntityReference nodes v, we have to tell the appropriate
modes of v to child sites S(v). This is done in lines 12 to 15.
Procedure Slave-XSLT
Input: A fragment F stored in its own site.
Output: none
1. Wait until tree transducer Tr = (Q, , q0 , R) is received from the root site.
2. vr the root node of F ;
3. M odes {q | (q, vr ) h R for some h};
4. for each q M odes do
5. Thread start
6. Tq Transform(Tr , F, vr , q);
7. Thread end
8. end
9. Wait until mode(s) p is received from the parent site.
10. if Tp = null then
11. Send Tp to the root site.
12. for each EntityReference node v in F do
13. Identify the mode q of v in Tp .
14. Send q to S(v).
15. end
16. else
17. Send to the root site.
18. end
For example, let us consider the distributed XML document consisting of two
fragments m and f in Fig. 7. Let Tr be the tree transducer defined in Example 1.
Moreover, let Sr be the root site storing m and Ss be the slave site storing f .
Master-XSLT running in Sr sends Tr to site S(&f ragment) = Ss and starts the
transformation of m with the initial mode p (Figure 8 shows the the transfor-
mation result). Slave-XSLT running in Ss receives Tr and starts to transform
f . In this case, two rules (p, a) x(z) and (q, a) z(q) can be applied to
the root node a of f . Thus, Slave-XSLT applies the two rules to f in parallel,
and the obtained results are shown in Fig. 9. When Master-XSLT encounters
the EntityReference node &f ragment of m, the procedure sends the appro-
priate mode(s) to S(&f ragment). In this case, both modes p and q are send to
S(&f ragment). After Slave-XSLT completes the transformation of f , the pro-
cedure sends the transformed fragments. In this case, the two fragments shown
in Fig. 9 are send to the root site Sr . Finally, Sr merges the three fragments and
we obtain the output tree in Fig.10.
A Simple XSLT Processor for Distributed XML 15
4 Evaluation Experiment
Fragment size
Total size F0 F1 F2 F3
50MB 23MB 25MB 5.5MB 2.6MB
100MB 46MB 50MB 11MB 5.1MB
150MB 69MB 75MB 17MB 7.6MB
200MB 92MB 100MB 22MB 11MB
250MB 115MB 125MB 28MB 13MB
The results are shown in Figs. 11 to 13. As shown in these figures, our method
is about four times faster than the centralized method, regardless the used
stylesheets. This suggests that our method works well for distributed XML doc-
uments.
5 Conclusion
In this paper, we proposed a method for performing XSLT transformation for
distributed XML documents. The experimental results suggest that our method
work well for distributed XML documents.
However, we have a lot of future work to do. First, in this paper the expressive
power of XSLT is restricted to unranked top-down tree transducer. We have
investigated XSLT elements and functions, and we have found that about half
of the elements/functions can easily be incorporated into our method (Type
A of Table 2). The elements/functions f of this type can be calculated within
the fragment in which f is used, e.g., xslt:text. On the other hand, it seems
dicult to incorporate the rest elements/functions into our method (Type B of
Table 2). An example element of this type is xslt:for-each, which accesses
several fragments beyond the fragment in which the xslt:for-each element is
used. Thus, we have to handle XSLT elements/functions of Type B carefully in
order to extend the expressive power of our method. Another future work relates
to experimentation. In our experimentation we use only three synthetic XSLT
stylesheets. Thus we need to make more experiments using real-world XSLT
stylesheets.
elements function
Type A 18 18
Type B 17 16
Total 35 34
18 H. Mizumoto and N. Suzuki
References
1. Abiteboul, S., Gottlob, G., Manna, M.: Distributed XML design. J. Comput. Syst.
Sci. 77(6), 936964 (2011)
2. Abiteboul, S., Gottlob, G., Manna, M.: Distributed XML design. In: Proc. PODS
2009 Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems, pp. 247258 (2009)
3. Bremer, J.M., Gertz, M.: On distributing XML repositories. In: Proc. of WebDB,
pp. 7378 (2003)
4. Buneman, P., Cong, G., Fan, W., Kementsietsidis, A.: Using partial evaluation in
distributed query evaluation. In: Proc. VLDB 2006 (2006)
5. Cong, G., Fan, W., Anastasios: Distributed Query Evaluation with Performance
Guarantees. In: Proc. SIGMOD 2007 Proceedings of the 2007 ACM SIGMOD
International Conference on Management of Data, pp. 509520 (2007)
6. Kepser, S.: A simple proof for the Turing-completeness of XSLT and XQuery. In:
Extreme Markup Languages (2004)
7. Kling, P., Ozsu, M.T., Daudjee, K.: Generating ecient execution plans for verti-
cally partitioned XML databases. Proc. VLDB Endow. 4(1), 111 (2010)
8. Martens, W., Neven, F.: Typechecking Top-Down Uniform Unranked Tree Trans-
ducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS,
vol. 2572, pp. 6478. Springer, Heidelberg (2002)
9. Schmidt, A., Waas, F., Kersten, M., Carey, M.J., Manolescu, I., Busse, R.: Xmark:
a benchmark for XML data management. In: Proceedings of the 28th International
Conference on Very Large Data Bases, Proc. VLDB 2002, pp. 974985 (2002)
10. Stefanescu, D.C., Thomo, A., Thomo, L.: Distributed evaluation of generalized
path queries. In: Proc. the 2005 ACM Symposium on Applied Computing, SAC
2005, pp. 610616 (2005)
11. Suciu, D.: Distributed query evaluation on semistructured data. ACM Trans.
Database Syst. 27(1), 162 (2002)
12. Zavoral, F., Dvorakovam, J.: Perfomance of XSLT processors on large data sets.
In: Proc. Applications of Digital Information and Web Technologies, pp. 110115
(2009)
Ontology Usage Network Analysis Framework
1 Introduction
A decade-long eort by the Semantic Web community regarding knowledge
representation and assimilation has resulted in the development of methodolo-
gies, tools and technologies to develop and manage ontologies [1]. As a result,
numerous domain ontologies have been developed to describe the information
pertaining to dierent domains such as Healthcare and Life Science (HCLS),
entertainment, financial services and eCommerce. Consequently, we now have
billions of triples [2] on the Web in dierent domains, annotated by using dif-
ferent domain-specific ontologies [3] to provide structured and semantically rich
data on the Web.
In our previous work [4], Ontology Usage Analysis Framework (OUSAF) is
presented in order to analyse the use of ontologies on the web. The OUSAF
framework is comprised of four phases namely identification, investigation, rep-
resentation and utilization. The identification phase, which is the focus of this
paper, is responsible for identifying dierent ontologies that are being used in a
particular application area or in a given dataset for further analysis. A few of
the common requirements which form the selection criteria for the identification
of ontologies in this scenario are:
1. What are the widely used ontologies in the given application?
2. What ontologies are more interlinked with other ontologies to describe
domain-specific entities?
3. What ontologies are used more frequently and what is their usage percentage
based on the given dataset?
4. Which ontology clusters form cohesive groups?
In order to establish a better understanding of ontology usage and to identify
the ontologies, based on the abovementioned criteria, this paper proposes the
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 1930, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Ontology Usage Network Analysis Framework (OUN-AF) that models the use of
ontologies by dierent data sources using an Ontology Usage Network (OUN).
OUN represents the relationships between ontologies and data sources based
on the actual usage data available on the Web in the form of a graph-based
relationship structure. This structure is then analysed using metrics to study
the structural, typological and functional characteristics of OUN by applying
Social Network Analysis (SNA) techniques.
2 Background
In [5], the authors have analyzed the social and structural relationships avail-
able on the Semantic Web, focusing mainly on FOAF and DC vocabularies. The
study was performed on approximately 1.5 million FOAF documents to analyze
instance data available on the Web and their usefulness in understanding so-
cial structures and networks. They identified the graphical patterns emerging in
social networks and represented this using the FOAF vocabulary and the de-
gree distribution of the network. As this research provides a detailed analysis of
Semantic Web data by focusing on a specific vocabulary, it does not provide a
framework or methodology to make it applicable to dierent vocabularies.
Other work in the literature has analyzed a dierent number of ontologies
using SNA. For example, [6] covered five ontologies, [7] analyzed 250 ontologies
and [8] used approximately 3,000 vocabularies. In all these work, ontologies were
investigated to measure their structural properties, the distribution of dierent
measures and terminological knowledge encoded in ontologies, but none includes
how they are being used on the Web. The use of SNA and its techniques to anal-
yse the use of ontologies and measure the relationships based on usage has only
been applied marginally. In the identification phase of the OUSAF framework,
the ontologies and their use by dierent data sources are represented in a way
that allows the aliation between ontologies and dierent data publishers to
be measured.
Fig. 1. (a) Aliation matrix of author-paper aliation network, and (b) an example
of an author-paper aliation network
The objective of the ontology identification is to identify the use of dierent on-
tologies by dierent data publishers to discover hidden usage patterns. Therefore,
in order to mine such analysis, the Ontology Usage Network Analysis Framework
(OUN-AF) is proposed, as shown in Fig. 2. OUN-AF comprises three phases,
namely Input phase, Computation phase and Analysis phase.
The input phase is responsible for managing the dataset. The two key compo-
nents in this phase are a crawler and an RDF triple store. In order to point the
crawler to relevant and interesting data sources, the bootstrapping process first
builds the seed URIs as multiple starting points. A list of seed URIs is obtained
by accessing semantic search engines which return the URIs (URLs) of the data
sources (web sites) with structured data. The crawler collects the Semantic Web
data (RDF data) and after preprocessing it, loads it to the RDF triple store
(database). From a data management point of view, since RDF data comprises
22 J. Ashraf and O.K. Hussain
Fig. 2. Ontology Usage Network Analysis Framework (OUN-AF) and its phases
An ontology set is defined as the set O which represents the nodes of the
first mode of the aliation network. An ontology set O contains the list of
ontologies used on the Web-of-Data such that there is a triple t < s, p, o >
anywhere in the dataset (specifically, otherwise in general, on the Web-of-
Data) where o O is the URI of either p or o.
A data-source set is defined as the set D which has the list of hostnames
on the Web-of-Data such that there exist a triple t < s, p, o > in the dataset
(specifically, otherwise in general, on the Web-of-Data), where s is the host-
name and either p or o is a member of O.
The Ontology Usage Network (OUN) is a bipartite graph, denoted as
OU N (O, D) that represents the aliation network, with a set of ontologies
Ontology Usage Network Analysis Framework (OUN-AF) 23
O on one side and a set of data- sources D on the other and edge (o, d)
represents the fact that o is used by d.
In order to infer the connectedness present within one set of nodes based on their
co-participation in the other set of nodes, OUN is transformed from a two-mode
network to a one-mode network using the process of projection. Projection is
used to generate two one-mode graphs; one for nodes in the ontology set known
as the Ontology Co-Usability network, and second for the data-source set known
as the Data-Source Collaboration network as shown in Fig. 2.
The Ontology Co-Usability network is an ontology-to-ontology network,
in which two nodes are connected if both of the ontologies are being used by
the same data source. This means that the Ontology Co-Usability network rep-
resents the connectedness of an ontology with other ontologies, based on their
co-membership in the data source.
The Data-Source Collaboration network is a data-source-to-data-source
network in which two nodes are connected if both of them have used the same on-
tology to describe their data. The Data-Source Collaboration network represents
the similarity of data-sources in terms of their need to semantically describe the
information on the Web.
4.2 Semanticity
Semanticity distribution identifies the fraction of the data sources in the net-
work which have a degree k. Semanticity measures the participation of dierent
ontologies in a given data source. The more ontologies are being used by a data
source, the higher semanticity value it has. Semanticity is measured by calculat-
ing the degree centrality and degree distribution on the second set of nodes in
an aliation network, which is the set representing the data sources present in
the dataset. The degree centrality CD (dsi ) of a data source dsi is measured as:
n2
"
CD (dsi ) = d(dsi ) = Aij (3)
j=1
Where i = 1, . . . n2 , n2 = |D|, d(dsi ) is the degree of ith data source ds, and A
is the aliation matrix representing OUN
where d(v, u) denotes the length of a shortest-path between v and u. The close-
ness centrality of v is the inverse of the average (shortest path) distance from
v to any other vertex in the graph. The higher the cv , the shorter the average
distance from v to other vertices, and v is more important by this measure.
required that each member of the clique has a direct tie with the others, but
instead that it has to be no more than distance n from each other. Formally,
a clique is the maximum number of actors who have all possible ties present
between them.
The detail of dataset which is used to populate OUN and the analysis performed
using metrics defined above, are discussed as follows.
In order to build a dataset which has a fair representation of the Semantic Web
data described using domain ontologies, semantic search engines such as Sindice
and Swoogle are used to build the seed URLs. These seed URLs are then used
to crawl the structured data published on the Web using ontologies. The dataset
built for the identification phase comprises 22.3 millions of triples, collected from
211 data sources1 . In this dataset, 44 namespaces are used to describe entities
semantically.
The resulting Ontology Usage Network is comprise of 1390 edges linking 44
ontologies to 211 data sources. In terms of generic OUN properties, the density
of the network is 0.149 and the average degree is 10.90. The average degree shows
that the network is neither too sparse nor too dense which is a common pattern
in information networks. Details on the other properties and metrics are given
in the following subsections.
Through OUD, we would like to determine how the use of an ontology is dis-
tributed over the data sources in the dataset. Using Eq. 2, Fig. 3(a) shows the
percentage of the ontologies being used by a number of dierent data sources.
The relative frequency of OUD on the dataset shows that there is both extreme
and average ontology usage by data sources. It also shows that 13.6% of on-
tologies are not used by any of the data sources and approximately half of the
ontologies are exclusively used by the data sources. The second row of the Fig.
3(a) shows that 47.7% of ontologies (21 ontologies) are being used by a data
source that has not used any other ontology. This means that there are several
ontologies in the dataset which either conceptualize a very specialized domain,
restricting their reusability, or are of a proprietary nature. From the third row
of Fig. 3(a) onwards, there is an increase in the reusability factor of ontologies.
This is because an increasingly large number of data sources are using them. The
last row shows that 4.5% (2 ontologies) of ontologies are being used equally by
1
https://docs.google.com/spreadsheet/ccc?key=
0AqjAK1TTtaSZdGpIMkVQUTRNenlrTGctR2J1bkl6WEE
26 J. Ashraf and O.K. Hussain
208 data sources. Through this analysis, we can see that there are less ontologies
which are not being used at all and there are a few which have almost optimal
utilization.
Fig. 3(b) shows the degree distribution of ontology usage in a number of data
sources. The value of degree is shown on the x-axis and the number of ontologies
with that degree is shown on the y-axis. It can be seen that there are a large
number of ontologies with a small degree value and only a few ontologies have a
larger degree value.
Fig. 3. (a) Distribution of Ontology Usage in data sources, and (b) Degree distribution
of ontology usage (data sources per ontology)
Fig. 4. (a) Distribution of Semanticity (Ontology used per data source), and (b)Degree
distribution of semanticity (Ontologies per Data source)
shaped curve, is normally concentrated in the centre and decreases on either side.
This signifies that degree has less tendency to produce extreme values compared
to power law distribution.
(ii) ontologies closer to each other have more entities which correspond to en-
tities of other ontologies; and (iii) closely related ontologies tend to facilitate
query answering on the Semantic Web.
The betweenness and closeness centrality of ontology co-usage nodes is shown
in Fig. 5(a) and 5(b), respectively and node size reflects the centrality value.
Fig. 5. (a) Betweenness centrality of Ontology Co-Usage network, and (b) Closeness
centrality of Ontology Co-Usage network
As we can see in Fig. 5(a), the Ontology Co-Usage network has very few nodes
with a higher betweenness value which means that the ontologies represented
by the green nodes (which are not many) are the ones falling in between the
geodesic path of many other nodes and acting as the gateway (or hub) in the
communication between other ontologies in the graph. These are the nodes,
namely rdf, rdfs, gr, vCard and foaf which, in our interpretation, are acting as
the semantic gateway by becoming the reason for the adoption of other ontologies
on the Web. However, on the other hand, closeness centrality is approximately
distributed evenly in the network. Thus, it is safe to assume that almost every
node is reachable to other nodes except those which are not connected.
URIs) of the data sources included in the dataset (or on the Web to generalize
it). Note that the size of the cohesive sub-group, in terms of percentage, closely
matches the findings of [15] for the classical Web which was 91%. Within the
giant connected component, to know the sub-component based on the equal dis-
tribution of the concentration of links around a set of nodes, k-core is computed.
k-core is the maximum sub-graph in which each node has at least degree k within
the sub-graph. Fig. 6 stacks the k-core components, based on ascending k values
from highest to lowest. From Figure 6, it is easy to see which ontologies are
highly linked, based on ontology usage patterns invariance across data sources.
References
1. Hitzler, P., van Harmelen, F.: A Reasonable Semantic Web. Semantic Web Jour-
nal 1(1-2), 3944 (2010)
2. Bizer, C., Jentzsch, A., Cyganiak, R.: State of the Linked Open Data (LOD) Cloud.
Technical Report April 5, 2011 (March 2011),
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
3. Ashraf, J., Cyganiak, R., ORiain, S., Hadzic, M.: Open eBusiness Ontology Usage:
Investigating Community Implementation of GoodRelations. In: Proceedings of
Linked Data on the Web Workshop (LDOW) at WWW 2011, Hyderabad, India,
March 29. CEUR Workshop Proceedings, vol. 813, pp. 111. CEUR-WS.org (2011)
30 J. Ashraf and O.K. Hussain
4. Ashraf, J.: A Framework for Ontology Usage Analysis. In: Simperl, E., Cimiano,
P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp.
813817. Springer, Heidelberg (2012)
5. Ding, L., Zhou, L., Finin, T., Joshi, A.: How the Semantic Web is Being Used:
An Analysis of FOAF Documents. In: Proceedings of the 38th Annual Hawaii
International Conference on System Sciences, vol. 4, pp. 113120. IEEE Computer
Society, Washington, DC (2005)
6. Zhang, H.: The scale-free nature of semantic web ontology. In: Proceeding of the
17th International Conference on World Wide Web, pp. 10471048. ACM (2008)
7. Theoharis, Y., Tzitzikas, Y., Kotzinos, D., Christophides, V.: On Graph Features
of Semantic Web Schemas. IEEE Transactions on Knowledge and Data Engineer-
ing 20(5), 692702 (2008)
8. Cheng, G., Qu, Y.: Term Dependence on the Semantic Web. In: Sheth, A.P., Staab,
S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC
2008. LNCS, vol. 5318, pp. 665680. Springer, Heidelberg (2008)
9. Oliveira, M., Gama, J.: An overview of social network analysis. Wiley Interdisci-
plinary Reviews: Data Mining and Knowledge Discovery 2(2), 99115 (2012)
10. Weisstein, E.: Normal distribution (2005) From MathWorld,A Wolfram Web Re-
source, http://mathworld.wolfram.com/NormalDistribution.html
11. Newman, M.: The mathematics of networks. The New Palgrave Encyclopedia of
Economics 2 (2008)
12. Newman, M.E.J.: Coauthorship networks and patterns of scientific collaboration.
Proceedings of the National Academy of Sciences of the United States of Amer-
ica 101(suppl. 1), 52005205 (2004)
13. Okamoto, K., Chen, W., Li, X.-Y.: Ranking of closeness centrality for large-scale
social networks. In: Preparata, F.P., Wu, X., Yin, J. (eds.) FAW 2008. LNCS,
vol. 5059, pp. 186195. Springer, Heidelberg (2008)
14. David, J., Euzenat, J.: Comparison between Ontology Distances (Preliminary Re-
sults). In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin,
T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 245260. Springer,
Heidelberg (2008)
15. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1),
309320 (2000)
Energy Eciency in W-Grid Data-Centric
Sensor Networks via Workload Balancing
1 Introduction
Wireless sensor networks consist of large number of distributed nodes that orga-
nize themselves into a multi-hop (wireless) network. Sensor nodes are generally
powered by batteries with an inherently limited lifetime, hence energy consump-
tion issues play a leading role. In more detail, we focus the attention on so-called
data-centric sensor networks, where nodes are smart enough either to store some
data and to perform basic processing allowing the network itself to supply higher
level information closer to the network user expectations.
Energy saving in wireless sensor networks involves both MAC layer and net-
work layer. In particular, here we state that the routing protocol must be
lightweight, must not require too much computation, such as complex evalu-
ations of possible paths, and must not need a wide knowledge of the network
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 3142, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
Existing routing protocols have been developed according to dierent approaches.
Basically, routing is necessary whenever a data is sensed (we also say gener-
ated) and stored in the system or whenever a query is submitted to the system.
Energy Eciency in W-Grid Data-Centric Sensor Networks 33
As stated before, we do not consider sensor network systems which store sensor
data externally at a remote base station, but rather we focus on advances wire-
less sensor networks in which data or events are kept at sensors, by means of
representing them in terms of relations in a virtual distributed database and, for
eciency purposes, indexed by suitable attributes. For instance in [12,10,21,20],
data generated at a node is assumed to be stored at the same node, and queries
are either flooded throughout the network. In [19], a Geographic Hash Tables
(GHT) approach is proposed, where data are hashed by name to a location
within the network, enabling highly ecient rendezvous. GHTs are built upon
the GPSR protocol [11] thus leveraging some interesting properties of that proto-
col, such as the ability of routing to sensors nearest to a given location, together
with some of well-known limits, such as the risk of dead ends. Dead end prob-
lems, especially under low density environments or scenarios with obstacles and
holes, are caused by the inherent greedy nature of routing algorithms that can
lead to situation in which a packet gets stuck at a local optimal sensors that
appears closer to the destination than any of its known neighbors. In order to
solve this flaw, correction methods such as perimeter routing, which tries to ex-
ploit the right hand rule, have been implemented. However, packet losses still
remain and, in addition to this, using perimeter routing causes loss of eciency
both in terms of average path length and energy consumption. Besides, another
limitation of GHT-based routing is that it needs sensors to know their physical
position, which causes additional localization costs to the whole system. In [9],
Greenstein et al. have designed a spatially-distributed index, called DIFS, to fa-
cilitate range search over attributes. In [12], Li et al. have built a distributed
index, called DIM, for multidimensional range queries over attributes but they
require nodes to be aware of their physical location and of network perimeter.
Moreover, they exploit GPSR for routing. Our solution extending W-Grid also
behaves like a distributed index, but its indexing feature is cross-layered with
routing, meaning that no physical position nor any external routing protocol is
necessary, routing information is given by index itself.
Another research area that is directly related to our work is represented by
the problem of eectively and eciently managing multidimensional data over
sensor networks, as W-Grid may represent a very ecient indexing layer for tech-
niques and algorithms supporting this critical task. The problem of performing
multidimensional and OLAP analysis of data streams, like the ones originated by
sensor networks, has received great attention recently (e.g., [2]). Due to compu-
tational overheads introduced by these time-consuming tasks, several solutions
have been proposed in literature, such as data compression (e.g., [4]), usage of
high-performance Computational Grids (e.g., [6,5]), extensions to uncertain and
imprecise domains (e.g., [3]) that occur very frequently in sensor environments,
and so forth. This problem also exposes very interesting and challenging corre-
lations with cross-disciplinary scientific areas, such as mobile computing (e.g.,
[7]), thus opening the door to novel research perspectives still poorly explored,
such as data management issues in mobile sensor networks (e.g., [8,14,13]).
34 A. Cuzzocrea, G. Moro, and C. Sartori
3 W-Grid Overview
W-Grid1 can be viewed as a binary tree index cross-layering both routing and
data management features over a wireless sensor network. Two main phases are
performed in W-Grid: (1) implicitly generate coordinates and relations among
nodes that allow ecient message routing; (2) determine a data indexing space
partition by means of so-generated coordinates in order to eciently support
multidimensional data management. Also, each node can have one or more vir-
tual coordinates on which an order relation is defined and through which the
routing occurs. At the same time, each virtual coordinate represents a portion
of the data indexing space for which a device is assigned with the (data) manage-
ment tasks. By assigning multiple coordinates at nodes, we aim at reducing query
path length and latency. This is obtained via bounding the probability that two
nodes physically close have very dierent virtual coordinates, which may happen
whenever a multidimensional space is translated into a one-dimensional space. In
the next Sections, we provide a formal description on the W-Grid main features.
S = (D, L) (1)
such that D is the set of participating devices and L is the set of physical
connectivity between couples of devices, which is defined as follows:
length(c) : C (4)
1
From now on, we will refer at devices with the terms sensors or nodes indis-
tinctly.
Energy Eciency in W-Grid Data-Centric Sensor Networks 35
lChild(c) = c0 (11)
Given a coordinate c, we define the right child of c, denoted by rChild(c), as
follows:
rChild(c) = c1 (12)
To give examples, given a coordinate ci = 011, then: f ather(011) = 01,
lChild(011) = 0110, rChild(011) = 0111.
Given a coordinates in the domain C and devices in the domain D, we in-
troduce the mapping function M that maps each coordinate c to the device d
holding it, as follows:
M :CD (13)
A W-Grid network W is represented in terms of a graph, as follows:
W = (C, P ) (14)
p = (ci , cj ) (16)
If d is not active, last returns {}. Let n1 denote the first node that joined the
network, then last(n1 ) = .
2. l = (di , dj ) L : last(di ) = {}, two parent-ships are generated: (i) p =
(last(di ), c ): M (c ) = dj ; (ii) p, where c = lChild(last(di )) | rChild(last(di )).
Namely, c corresponds to the non-deterministic choice of one of the children of c.
According to a pre-defined coordinate selection strategy, nodes progressively get
new coordinates from each of its physical neighbors, in order to establish parent-
ships with them. Coordinate getting is also called split, and it is actually re-
lated to data management tasks of W-Grid. Actors of the split procedure are the
so-called joining node and giving node, respectively. We say that a coordinate
ci is split by concatenating a bit to it, and, after, one of the new coordinates is
assigned to the joining node, while the other one is kept by the giving node. Ob-
viously, an already split coordinate ci cannot be split anymore since this would
generate duplicates. Besides, in order to guarantee uniqueness of coordinates even
in case of simultaneous requests, each joining node must be acknowledged by the
giving node. Thus, if two nodes ask for the same coordinate to split, only one
request will succeed, while the other one will be temporarily rejected and post-
poned. Coordinate discovering is gradually performed by implicitly overhearing
of neighbor sensor transmissions.
are not simply downloaded at a fixed sink or in which a sink node is in charge of
periodically performing queries but, contrary to this, networks where each active
sensor can be queried at any time.
Eectiveness and eciency of W-Grid load balancing solutions have been
evaluated in terms of the following experimental parameters: query average path
length, data and trac load at sensors, and energy consumption of the resulting
sensor network. The functional parameters chosen are, instead, the following:
device density, number of coordinates per node. In the following, we first describe
the simulation model of our experimental campaign (Section 4.1). Then, we
provide experimental results for the following experimental aspect assessing the
W-Grids performance: query eciency (Section 4.2), network load (Section 4.3)
and energy consumption (Section 4.4).
GPSR GPSR Stddev W-Grid W_Grid Stddev GPSR GPSR Stddev W-Grid W_Grid Stddev
Fig. 1. Query eciency due to W-Grid without load balancing solutions, in comparison
to GPSR, for the configurations having 11 (left) and 8 (right) neighbors per node
We let the simulator perform the following tasks: (i) uniformly random place-
ment of n sensors in a user-defined area; (ii) gradual generation of W-Grid
coordinates at sensors that exploit implicit overhearing; (iii) random generation
of data sensing and query in a user-defined ratio q : i.
At each simulator run, we observed the following experimental parameters:
(i) average path length of queries (compared with the one due to GPSR [11]);
(ii) loads at sensors, modeled in terms of the relative dierence between network
trac and managed data space at sensors; (iii) network energy consumption in
each dierent scenario.
Energy Eciency in W-Grid Data-Centric Sensor Networks 39
In the first experiment, we tested the ratio of succeeded queries due to W-Grid
with respect to the ones accomplished by GPSR. Indeed, for the sake of clarity,
here we highlight that the latter comparison is unfair as GPSR assumes to know
sensor physical positions over the network. This means that the probability of
committing a query is always very high, especially in quite dense application
scenarios like the one we tested. However, our main experimental goal was to
test our solution against the most ecient one available in literature, even if
the comparison one was advantaged. Figure 1 and Figure 2 show the average
number of hops needed to succeed queries with respect to the number of virtual
coordinates for W-Grid and GPSR in the two distinctive scenarios of not apply-
ing (Figure 1) and applying (Figure 2) the load balancing solutions. As Figure 1
and Figure 2 show, it clearly follows that W-Grid behavior is absolutely ecient
and not too much distant from the one by GPSR, which reasonably represents
the favorite case.
GPSR GPSR Stddev W-Grid W_Grid Stddev GPSR GPSR Stddev W-Grid W_Grid Stddev
Fig. 2. Query eciency due to W-Grid with load balancing solutions, in comparison
to GPSR, for the configurations having 11 (left) and 8 (right) neighbors per node
140
Number of data
120
120
100
100
80
80
60
60
40
40
20
20
0
0
1 2 4 6 8 10 OPN 1 2 3 4 5 6 8 OPN
Number of coordinates Number of coordinates
Average Standard deviation Average Standard Deviation
Fig. 3. Network load due to W-Grid without load balancing solutions for the configu-
rations having 11 (left) and 8 (right) neighbors per node
Number of data
120 120
100 100
80 80
60 60
40 40
20 20
0 0
1 2 4 6 8 10 1 2 3 4 5 6 8 OPN
Number of coordinates Number of coordinates
Average Standard Deviation Average Standard Deviation
Fig. 4. Network load due to W-Grid with load balancing solutions for the configurations
having 11 (left) and 8 (right) neighbors per node
0,025 0,025
0,02 0,02
0,015 0,015
0,01 0,01
0,005 0,005
0 0
1 2 4 6 8 10 OPN 1 2 3 4 5 6 8 OPN
Number of Coordinates Num ber of Coordinates
Avg StdDev Load Balancing OFF StdDev Load Balancing ON Avg StdDev Load Balancing OFF StdDev Load Balancing ON
Fig. 5. Variation of data space at sensors due to W-Grid with load balancing solutions
for the configurations having 11 (left) and 8 (right) neighbors per node
750 750
700 700
Joules
Joules
650 650
600 600
550 550
500 500
0,10 0,20 0,40 0,60 0,80 1,00 OPN 0,10 0,20 0,40 0,60 0,80 1,00 OPN
Coordinates per neighbor Coordinates per neighbor
Fig. 6. Energy consumption due to W-Grid for dierent configurations with (left) and
without (right) load balancing solutions
References
1. Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers
for decentralized detection of network attacks. In: Wang, H., Zhang, R. (eds.) ADC
2013, Adelaide, South Australia. CRPIT, pp. 18. ACS (2013)
2. Cuzzocrea, A.: CAMS: OLAPing multidimensional data streams eciently. In:
Pedersen, T.B., Mohania, M.K., Tjoa, A.M. (eds.) DaWaK 2009. LNCS, vol. 5691,
pp. 4862. Springer, Heidelberg (2009)
3. Cuzzocrea, A.: Retrieving accurate estimates to OLAP queries over uncertain and
imprecise multidimensional data streams. In: Bayard Cushing, J., French, J., Bow-
ers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 575576. Springer, Heidelberg
(2011)
4. Cuzzocrea, A., Chakravarthy, S.: Event-based lossy compression for eective and
ecient OLAP over data streams. Data Knowl. Eng. 69(7), 678708 (2010)
5. Cuzzocrea, A., Furfaro, F., Greco, S., Masciari, E., Mazzeo, G.M., Sacc` a, D.: A
distributed system for answering range queries on sensor network data. In: PerCom
Workshops, pp. 369373 (2005)
42 A. Cuzzocrea, G. Moro, and C. Sartori
6. Cuzzocrea, A., Furfaro, F., Mazzeo, G.M., Sacc a, D.: A grid framework for ap-
proximate aggregate query answering on summarized sensor network readings. In:
Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM Workshops 2004. LNCS, vol. 3292,
pp. 144153. Springer, Heidelberg (2004)
7. Cuzzocrea, A., Furfaro, F., Sacc` a, D.: Enabling OLAP in mobile environments
via intelligent data cube compression techniques. J. Intell. Inf. Syst. 33(2), 95143
(2009)
8. El-Moukaddem, F., Torng, E., Xing, G.: Mobile relay configuration in data-
intensive wireless sensor networks. IEEE Trans. Mob. Comput. 12(2), 261273
(2013)
9. Greenstein, B., Estrin, D., Govindan, R., Ratnasamy, S., Shenker, S.: Difs: A
distributed index for features in sensor networks. In: Proceedings of First IEEE
WSNA, pp. 163173. IEEE Computer Society (2003)
10. Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., Silva, F.: Directed
diusion for wireless sensor networking. IEEE/ACM Trans. Netw. 11(1), 216
(2003)
11. Karp, B., Kung, H.: GPRS: greedy perimeter stateless routing for wireless net-
works. In: MobiCom 2000, pp. 243254. ACM Press (2000)
12. Li, X., Kim, Y., Govindan, R., Hong, W.: Multi-dimensional range queries in sensor
networks. In: SenSys 2003, pp. 6375. ACM Press, New York (2003)
13. Li, Z., Liu, Y., Li, M., Wang, J., Cao, Z.: Exploiting ubiquitous data collection for
mobile users in wireless sensor networks. IEEE Trans. Parallel Distrib. Syst. 24(2),
312326 (2013)
14. Liu, B., Dousse, O., Nain, P., Towsley, D.: Dynamic coverage of mobile sensor
networks. IEEE Trans. Parallel Distrib. Syst. 24(2), 301311 (2013)
15. Monti, G., Moro, G., Sartori, C.: WR -Grid: A scalable cross-layer infrastructure
for routing, multi-dimensional data management and replication in wireless sensor
networks. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., R unger, G. (eds.)
ISPA 2006 Ws. LNCS, vol. 4331, pp. 377386. Springer, Heidelberg (2006)
16. Moro, G., Monti, G.: W-Grid: a self-organizing infrastructure for multi-dimensional
querying and routing in wireless ad-hoc networks. In: IEEE P2P 2006 (2006)
17. Moro, G., Monti, G.: W-grid: A scalable and ecient self-organizing infrastructure
for multi-dimensional data management, querying and routing in wireless data-
centric sensor networks. Journal of Network and Computer Applications 35(4),
12181234 (2012), http://dx.doi.org/10.1016/j.jnca.2011.05.002
18. Ouksel, A.M., Moro, G.: G-Grid: A class of scalable and self-organizing data struc-
tures for multi-dimensional querying and content routing in P2P networks. In:
Moro, G., Sartori, C., Singh, M.P. (eds.) AP2PC 2003. LNCS (LNAI), vol. 2872,
pp. 123137. Springer, Heidelberg (2004)
19. Ratnasamy, S., Karp, B., Shenker, S., Estrin, D., Govindan, R., Yin, L., Yu, F.:
Data-centric storage in sensornets with ght, a geographic hash table. Mob. Netw.
Appl. 8(4), 427442 (2003)
20. Xiao, L., Ouksel, A.: Tolerance of localization imprecision in eciently managing
mobile sensor databases. In: ACM MobiDE 2005, pp. 2532. ACM Press, New York
(2005)
21. Ye, F., Luo, H., Cheng, J., Lu, S., Zhang, L.: A two-tier data dissemination model
for large-scale wireless sensor networks. In: MobiCom 2002, pp. 148159. ACM
Press, New York (2002)
Update Semantics for Interoperability
among XML, RDF and RDB
A Case Study of Semantic Presence
in CISCOs Unified Presence Systems
Muhammad Intizar Ali1 , Nuno Lopes1 , Owen Friel2 , and Alessandra Mileo1
1
DERI, National University of Ireland, Galway, Ireland
{ali.intizar,nuno.lopes,alessandra.mileo}@deri.org
2
Cisco Systems, Galway, Ireland
[email protected]
1 Introduction
Relational databases (RDB) are the most common way of storing and managing
data in majority of the enterprise environments [1]. Particularly in the enter-
prises where integrity and confidentiality of the data are of the utmost impor-
tance, relational data model is still the most preferred option. However, with the
growing popularity of the Web and other Web related technologies (e.g. Semantic
Web), various data models have been introduced. Extensible Markup Language
(XML) is a very popular data model to store and specify the semi-structured
data over the Web [7]. It is also considered as a de-facto data model for data
exchange. XPath and XQuery are designed to access and query data in XML
This work has been funded by Science Foundation Ireland, Grant No.
SFI/08/CE/l1380 (Lion2) and CISCO Systems Galway, Ireland.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 4350, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Listing 1. XMPP request message for chat Listing 2. Mapping XMPP Message of
room creation Listing 1 into RDF [8]
46 M.I. Ali et al.
3.1 Syntax
XSPARQL is built on XQuery semantics, XSPARQL Update Facility is also ex-
tended using formal semantics of XQuery Update Facility. It merges the subset
of SPARQL, XQuery and SQL update clauses. We limit ourself to three major
update operations of INSERT: to insert new records in the data set, DELETE:
to delete already existing records from data set and UPDATE: to replace the ex-
isting records or their values with the new records or values. Figure 2 presents a
schematic view of the XSPARQL Update Facility which allows its users to select
data from one data source of any of its data format and update the results into
another data source of dierent format within a single XSPARQL query while
preserving the bindings of the variables defined within query. Contrary to the
select queries, the update queries have no return type. On successful execution
of the update queries no results are returned, only a response is generated which
can be either successful upon successful execution of the query or an appropriate
error is raised if for some reasons the XSPARQL query processor is unable to
update the records successfully. However if a valid updating or delete query with
or without where clause does not match any relevant tuple in the data source,
the XSPARQL query processor will still response with successful execution, but
Updates in XSPARQL 47
there will be no eect on the existing data source. The basic building block of
XQuery is the expression, an XQuery expression takes zero or more XDM in-
stances and returns an XDM instance. Following new expressions are introduced
in XSPARQL Update Facility
A new category of expression called updating expression is introduced to
make persistent changes in the existing data.
A basic updating expression can be any of the insert, delete or update.
for, let, where or order by clause can not contain an updating expression.
return clause can contain an updating expression which will be evaluated for
each tuple generated by its for clause.
Figure 3 presents an overview of the basic syntax rules for XSPARQL Update
Facility, for complete grammar of XSPARQL Update Facility, we refer our reader
to the latest version of the XSPARQL available at http://sourceforge.net/
projects/xsparql.
3.2 Semantics
Similar to the semantics of the XSPARQL language [6], for the semantics of
updates in XSPARQL we rely on the semantics of the original languages: SQL,
XQuery, and SPARQL. Notably, the semantics for SPARQL updates are pre-
sented only in the upcoming W3C recommendation: SPARQL 1.1 [12]. We start
by presenting a brief overview of the semantics of update languages for the dif-
ferent data models: relational databases, XML, and RDF.
RDB: updates in the SQL language rely on a procedural language that specify
the operations to be performed, respectively: insrdb (r, t), delrdb (r, C), and
modrdb (r, C, C ), where r is a relation name, t is a relational tuple, and C, C
are a set of conditions of the form A = c or A = c (A is an attribute
name and c is a constant). Following [1], insrdb (r, t) inserts the relational
tuple t into the relation r, delrdb (r, C) deletes the relational tuples from
relation r that match the condition, and modrdb (r, C, C ) updates the tuples
in relation r that match conditions C to the values specified by C .
XML: For XML data, the XQuery language is adapted such that an expression
can return a sequence of nodes (as per the XQuery semantics specifica-
tion [9]) or pending update list [18], which consists of a collection of update
48 M.I. Ali et al.
f o r $x i n / / [ name= p r e s e n c e ]
i f $x / s t a t u s [ @code =201] t h e n
i n s e r t data
{Graph <S e m a n t i c P r e s e n c e>
{
: DERICollab a s i o c t : ChatChannel ;
s i o c : h a s o wn e r : i n t i z a r a l i @ c o r p . com .
: D E R I Co l l a b Ch a t S e ssi o n a s i o c c : ChatSession ;
sioc : has container : DERICollab ;
siocc : has publishing source : intizaraliNotebook
s i o c c : has nickname ali
}}
5 Related Work
Several eorts are made to integrate distributed XML, RDF and RDB data
on the fly. These approaches can be divided into two types, (i) Transformation
Based: Data stored in various format is transformed into one format and can be
queried using a single query language [11]. The W3C GRDDL working group
addresses the issues of extracting RDF from XML documents. (ii) Query Re-
writing Based: Query languages are used to transform or query data from one
format to another format. SPARQL queries are embedded into XQuery/XSLT to
pose against pure XML data [13]. XSPARQL was initially designed to integrate
data from XML and RDF and later extended to RDB as well. DeXIN is another
approach to integrate XML, RDF and RDB data on the fly with more focus
on distributed computing of heterogeneous data sources scattered over the Web
[3]. Realising the importance of the updates, the W3C has recommendations for
XQuery and SPARQL updates [18,12]. In [15], SPARQL is used to update the
relational data. However, to the best of our knowledge their is no work available
which can provide simultaneous access over the distributed heterogeneous data
source with updates.
References
1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley
(1995)
2. Akhtar, W., Kopeck y, J., Krennwallner, T., Polleres, A.: XSPARQL: Traveling
between the XML and RDF Worlds and Avoiding the XSLT Pilgrimage. In:
Bechhofer, S., Hauswirth, M., Homann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, pp. 432447. Springer, Heidelberg (2008)
3. Ali, M.I., Pichler, R., Truong, H.-L., Dustdar, S.: DeXIN: An Extensible Framework
for Distributed XQuery over Heterogeneous Data Sources. In: Filipe, J., Cordeiro,
J. (eds.) ICEIS 2009. LNBIP, vol. 24, pp. 172183. Springer, Heidelberg (2009)
4. Beckett, D., McBride, B.: RDF/XML Syntax Specification. W3C Proposed Rec-
ommendation (February 2004) (revised)
5. Berglund, A., Boag, S., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simeon,
J.: XML Path Language (XPath) 2.0. W3C Recommendation (December 2010)
6. Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between
RDF and XML with XSPARQL. Journal on Data Semantics 1, 147185 (2012)
7. Bray, T., Paoli, J., Maler, E., Yergeau, F., Sperberg-McQueen, C.M.: Extensible
Markup Language (XML) 1.0 (5th edn.). W3C Recommendation (November 2008)
8. Dabrowski, M., Scerri, S., Rivera, I., Leggieri, M.: Dx Initial Mappings for the
Semantic Presence Based Ontology Definition (November 2012),
http://www.deri.ie/publications/technical-reports/
9. Draper, D., Fankhauser, P., Fern andez, M., Malhotra, A., Rose, K., Rys, M.,
Simeon, J., Wadler, P.: XQuery 1.0 and XPath 2.0 Formal Semantics. W3C Rec-
ommendation (January 2007)
10. Fernandez, M.F., Florescu, D., Boag, S., Robie, J., Chamberlin, D., Simeon, J.:
XQuery1.0: An XML query language. W3C Proposed Recommendation (April 2009)
11. Gandon, F.: GRDDL Use Cases: Scenarios of extracting RDF data from XML
documents. W3C Proposed Recommendation (April 2007)
12. Gearon, P., Passant, A., Polleres, A.: SPARQL 1.1 Update. W3C Working Draft
(January 2012)
13. Groppe, S., Groppe, J., Linnemann, V., Kukulenz, D., Hoeller, N., Reinke, C.:
Embedding SPARQL into XQuery/XSLT. In: Proc. of SAC (2008)
14. Hauswirth, M., Euzenat, J., Friel, O., Grin, K., Hession, P., Jennings, B., Groza,
T., Handschuh, S., Zarko, I.P., Polleres, A., Zimmermann, A.: Towards Consoli-
dated Presence. In: Proc. of CollaborateCom 2010, pp. 110 (2010)
15. Hert, M., Reif, G., Gall, H.: Updating relational data via SPARQL/update. In:
Proc. of EDBT/ICDT Workshops (2010)
16. Negri, M., Pelagatti, G., Sbattella, L.: Formal Semantics of SQL Queries. ACM
Trans. Database Syst. 16(3), 513534 (1991)
17. Prudhommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Pro-
posed Recommendation (January 2008)
18. Robie, J., Chamberlin, D., Dyck, M., Florescu, D., Melton, J., Simeon, J.: XQuery
Update Facility 1.0. W3C Recommendation (March 2011)
GPU-Accelerated Bidirected De Bruijn Graph
Construction for Genome Assembly
Mian Lu1 , Qiong Luo2 , Bingqiang Wang3 , Junkai Wu2 , and Jiuxin Zhao2
1
A*STAR Institute of High Performance Computing, Singapore
[email protected]
2
Hong Kong University of Science and Technology
{luo,jwuac,zhaojx}@cse.ust.hk
3
BGI-Shenzhen, China
[email protected]
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 5162, 2013.
c Springer-Verlag Berlin Heidelberg 2013
52 M. Lu et al.
I/O and CPU processing. Therefore, we study how to utilize the hierarchy of
disk, main memory, and GPU memory to pipeline processing and to involve the
CPU for co-processing. These issues are essential for the feasibility and overall
performance on practical applications; unfortunately, they are seldom studied in
current GPGPU research.
Specifically, we address the GPU memory limit by developing a staged al-
gorithm for GPU-based graph construction. We divide the reads into chunks
and load the data chunk by chunk from disk through the main memory to the
GPU memory. To estimate the chunk size that can fit into the GPU memory,
we develop a memory cost model for each processing step. We further utilize the
CPU and main memory to perform result merging each time the GPU finishes
processing a chunk. Finally, we pipeline the disk I/O, GPU processing, and CPU
merging (DGC) to improve the overall performance. We expect this pipelined-
DGC processing model to be useful for a wide range of GPGPU applications
that handle large data.
We have implemented the GPU-based bidirected De Bruijn graph construction
and evaluated it on an NVIDIA Tesla S1070 GPU device with 4 GB memory. Our
initial results show that this implementation doubles the performance reported
on a 1024-node IBM Blue Gene/L and is orders of magnitude faster than state-
of-the-art CPU-based sequential implementations.
The remainder of this paper is organized as follows. In Section 2, we briefly
introduce the graph construction algorithm and related work. We present our de-
sign of the staged algorithm in Section 3. We describe the details of the pipelined-
DGC model in Section 4. The experimental results are reported in Section 5. We
conclude in Section 6.
2 Preliminary
2.1 Genome Assembly
The second generation of sequencing produces very short reads at a high through-
put. Popular algorithms to reconstruct the original sequence for such a large
number of short reads are based on the De Bruijn [5] or bidirected De Bruijn
graph [6]. The graph is constructed through generating k-mers from reads as
graph nodes. For example, suppose a short read is ACCTGC and k = 4, then
this read can generate three 4-mers, which are ACCT, CCTG, and CTGC. The
major dierence between the two graph models is that for each k-mer, its re-
verse complement is represented by a separate node (De Bruijn) or the same
node (bidirected De Bruijn).
At the beginning of assembly, each l-length read generates (l k + 1) k-
mers. Then the De Bruijn graph is constructed using information about overlap
between k-mers. Next, the graph is simplified and corrected by some heuris-
tic algorithms. After the simplification and correction, several long contigs are
generated. Finally, if reads are generated through paired-end sequencing, these
contigs are joined to produce scaolds using relevant information. Another al-
ternative of joining contigs is through Eulerian paths. The detailed algorithms
GPU-Accelerated Bidirected De Bruijn Graph Construction 53
of these steps can be found in previous work [2, 4]. In this work, we focus on the
De Bruijn graph construction.
K K
' ( )
K
Distinct canonical k-mers are collected as representatives for graph nodes, and
adjacency information is built for the graph representation for further manipu-
lation.
Step 1. Encoding. This step loads short reads from the disk to the main
memory and uses two bits per character to represent the reads.
n reads
that the memory required for the chunk-based processing does not exceed the
GPU memory size. Specifically, there are three processing steps for one chunk:
1. C-1: I/O. n (n n) short reads are loaded from the disk to the main
memory as a chunk of data.
2. C-2: GPU processing. We transfer the data chunk to the GPU memory.
The GPU performs the encoding and generates n (l k) edges. Duplicates
are eliminated through a sorting-based method for these n (l k) edges.
After the duplicate elimination, there are m distinct edges for this chunk.
3. C-3: CPU merge. These m distinct edges are copied from the GPU mem-
ory to the main memory, and merged with the distinct edges generated from
previously processed data chunks. Since both the newly generated edges and
the existing edges are ordered, the merge step is ecient. The m distinct
edges after merging will be used for the next chunk. The multiplicity infor-
mation are also updated in this step. Due to the high coverage of input reads,
the number of distinct edges is around tens of times smaller than that of all
generated edges. Therefore we assume that the main memory is sucient to
hold these distinct edges.
Given the GPU memory size Mg we estimate the memory consumption for each
step to decide the suitable chunk size. Suppose each chunk contains n reads,
and each read is l-length. Then the memory size of input reads is n l bytes
(one byte per character). After encoding, the memory size for encoded reads is
n l
4 bytes. Thus the total GPU memory consumption for this encoding step is:
(n l + n 4l ) bytes. Next we use encoded reads in the memory for generating
edges. The number of edges generated for one chunk is n (l k), and each
edge is represented using a 64-bit word type. Thus the memory size required for
all generated (k+1)-mers in one chunk is 8 n (l k) bytes. Therefore the
total GPU memory consumption
$ % of this edge generation step for one chunk is:
n l
4 + 8 n (l k) bytes. We perform the sorting-based duplicate elimina-
tion algorithm on n (l k) edges. The memory consumption of the GPU-based
radix sort is an output buer with the same size of the input array. Thus the total
memory consumption for the radix sort can be estimated as (2 8 n (l k))
GPU-Accelerated Bidirected De Bruijn Graph Construction 57
bytes. Finally, we remove duplicates from these sorted edges. The memory con-
sumption for this operation is at most (2 8 n (l k)) bytes. Therefore,
the maximum value of n can be estimated as follows:
& '
4Mg Mg Mg
min , ,
5l 8.25l 8k 16 (l k)
copy from bi to bg (and all following memory copies) should wait until the GPU
processing for the previous chunk is done to release the use of bg . Similarly,
we pipeline the GPU and CPU processing. A memory copy from the GPU to
CPU is performed when the computation on the GPU is finished. The GPU
processing is blocked when coping the data to bc . After the memory copy, the
CPU can perform the merge step for the data in bc , and the the next chunk of
data can be copied from bi to bg if the data is ready in bi , otherwise the GPU
thread is blocked to wait for the data in bi becoming ready. We can see that
with pipelining, the I/O, GPU processing, and CPU merge can overlap in time.
5 Experiments
5.1 Experimental Setup
Hardware Configuration. We conduct our experiments on a server machine
with two Intel Xeon E5520 CPUs (16 threads in total) and an NVIDIA Tesla
S1070 GPU. The NVIDIA Tesla S1070 has four GPU devices. In our current
implementation, we only utilize one GPU device, and the CPU merge step also
only uses one core on the CPU. This computation capability is sucient for the
overall performance as the I/O is the bottleneck. The main memory size of the
server is 16 GB.
Data Sets. We use a small and a medium sized data set for the evaluation. As
the previous research [13], we randomly sample known chromosomes to simulate
short reads. Two data sets are sampled from Arabidopsis chromosome 1 (denoted
as Arab.) and human chromosome 1 (denoted as Human), which can be accessed
from the NCBI genome databases [16]. The details of the two data sets are shown
in Table 1. Additionally, we fix the k-mer length k to 21, which is a common
value for most genome assemblers.
step on the CPU together take around 90% of the execution time. The DGC
model is applied to these four components. Particularly, among these four com-
ponents, I/O takes around 50%, and both the GPU processing (including encod-
ing and generating edges) and CPU processing take around 25% of the elapsed
time. Since the eciency of the pipeline is limited by the most time-consuming
component, we expect that the overall performance of these four components
can be improved by around two times through pipelining.
I/O
GPU
CPU
0 25 50 75 100 125 150 175 200 225 250
Time (second)
I/O
GPU
CPU
0 10 20 30 40 50 60 70 80 90 100 110 120 130
Time (second)
Even though the GPU and CPU processing cannot completely overlap, their
utilization is improved considerably. As a result, the overall execution time for
these four steps are reduced from around 250 seconds to 128 seconds.
Comparison to Existing Implementations. There are four CPU-based im-
plementations for comparison: Velvet [4], SOAPdenovo [2], ParBidirected [14],
and Jacksons implementation [13]. We only compare the performance of graph
construction. We perform the evaluation on the same machine for all these soft-
ware except the Jacksons implementation. Note that SOAPdenovo, ParBidi-
rected, and Jacksons implementation are parallel implementations.
Arab. Human 1
GPU-accelerated 18 177
Velvet [4] 86 -
SOAPdenovo [2] 78 1,245
ParBidirected [14] 1,740 32,400
Jackson2008 [13] 15 327
Table 2 shows the comparison result for the running time of four dierent im-
plementations. The memory required by preprocessing in Velvet for the Human
data set exceeds our main memory limit and takes excessively long time, thus
we do not report the performance number for the Human data set in the table.
SOAPdenovo has parallelized the hash table building. In the evaluation, it con-
sumes a similar main memory size (around 8 GB for the Human data set) to our
implementation. However, the GPU-accelerated graph construction is around
4-7x faster than SOAPdenovo for the similar functionality. Since ParBidirected
adopts a more conservative method to handle the out-of-core processing, inter-
mediate results need to be written into the disk in most steps, which results in
a slow execution time, but the memory consumption is stable and very low. In
our experiments, we have already modified the default buer size used in the
external sorting to improve the performance. The published performance results
from Jacksons implementation [13] is based on a 1024-node IBM Blue Gene/L.
The input and output are the same as our program. However, they have adopted
a node-oriented graph building approach, and the message passing is very ex-
pensive. In summary, compared with existing implementations, our GPU-based
graph construction is significantly faster. Specifically, compared with the mas-
sively parallel implementation on the IBM Blue Gene/L, our implementation is
still around two times faster.
6 Conclusion
In this paper, we have presented the design and implementation of our GPU-
accelerated bidirected De Bruijn graph construction, which is a first step to build
GPU-Accelerated Bidirected De Bruijn Graph Construction 61
References
1. Jackson, B., Regennitter, M., Yang, X., Schnable, P., Aluru, S.: Parallel de novo
assembly of large genomes from high-throughput short reads. In: IPDPS 2010:
Proceedings of the 2010 IEEE International Symposium on Parallel&Distributed
Processing, pp. 110 (April 2010)
2. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G.,
Kristiansen, K., Li, S., Yang, H., Wang, J., Wang, J.: De novo assembly of human
genomes with massively parallel short read sequencing. Genome Research 20(2),
265272 (2010)
3. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: Abyss:
a parallel assembler for short read sequence data. Genome Research 19(6), 1117
1123 (2009)
4. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using
de bruijn graphs. Genome Research 18(5), 821829 (2008)
5. Pevzner, P.A., Tang, H.: Fragment assembly with double-barreled data. Bioinfor-
matics 17(suppl. 1), S225S233 (2001)
6. Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for
sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS
(LNBI), vol. 4645, pp. 289301. Springer, Heidelberg (2007)
7. Chaisson, M.J., Pevzner, P.A.: Short read fragment assembly of bacterial genomes.
Genome Research 18(2), 324330 (2008)
8. Hossain, M.S.S., Azimi, N., Skiena, S.: Crystallizing short-read assemblies around
seeds. BMC Bioinformatics 10(suppl. 1) (2009)
9. Hernandez, D., Francois, P., Farinelli, L., ster
as, M., Schrenzel, J.: De novo bac-
terial genome sequencing: Millions of very short reads assembled on a desktop
computer. Genome Research 18(5), 802809 (2008)
10. Butler, J., MacCallum, I., Kleber, M., Shlyakhter, I.A., Belmonte, M.K., Lander,
E.S., Nusbaum, C., Jae, D.B.: Allpaths: De novo assembly of whole-genome shot-
gun microreads. Genome Research 18(5), 810820 (2008)
11. Warren, R.L., Sutton, G.G., Jones, S.J., Holt, R.A.: Assembling millions of short
dna sequences using ssake. Bioinformatics 23(4), 500501 (2007)
62 M. Lu et al.
12. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: Sharcgs, a fast and highly
accurate short-read assembly algorithm for de novo genomic sequencing. Genome
Research 17(11), 16971706 (2007)
13. Jackson, B.G., Aluru, S.: Parallel construction of bidirected string graphs for
genome assembly. In: International Conference on Parallel Processing, pp. 346
353 (2008)
14. Kundeti, V., Rajasekaran, S., Dinh, H.: Ecient parallel and out of core algorithms
for constructing large bi-directed de bruijn graphs. CoRR abs/1003.1940 (2010)
15. Mahmood, S.F., Rangwala, H.: Gpu-euler: Sequence assembly using gpgpu. In:
Proceedings of the 2011 IEEE International Conference on High Performance Com-
puting and Communications, HPCC 2011, pp. 153160. IEEE Computer Society
(2011)
16. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
K Hops Frequent Subgraphs Mining
for Large Attribute Graph
1 Introduction
With the rapidly increasing amount of data in Web and related applications,
more and more data can be represented by graphs. Most mechanism and algo-
rithms can only work on labeled graphs, e.g. a vertex has only one label, while
vertices in many real graph data have more than one label, and this kind of
graphs are named multi-labeled graphs [1], such as graphs describing social net-
work and proteins. Attribute graph is characterized as a kind of multi-labeled
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 6374, 2013.
c Springer-Verlag Berlin Heidelberg 2013
64 H. Zhang et al.
graph. There are a set of attributes associated with each vertex in attribute
graph, and each attribute has an exacted name.
Mining frequent subgraphs from graph databases is of great importance in
graph data mining. Frequent subgraphs are useful in data analysis and data
mining tasks, such as similarity search, graph classification, clustering and in-
dexing [2]. In similarity search and classification, frequent subgraphs are usually
used as feature leading to exact results and high scalability. In graph clustering,
frequent subgraphs can be used as a solution for traditional algorithms such as k-
means. Similarity searching is an important issue in many applications of graph
data. The task is considered NP-complete and is inecient for large graph.
Existing research on frequent subgraphs mining is conducted mainly on two
types of graph database. The first one is transaction graph databases [3] that
consist of a set of relatively smaller graphs. Transaction graph databases are
prevalently used in scientific domains such as chemistry, bioinformatics, etc. The
Second one is a graph with a large number of vertices such as social networks [4].
This paper focuses on the latter type of graph database with several attributes
in each vertex.
Mining frequent subgraphs is a iterative processing consists of two main steps
[5]: candidates generation and subgraphs counting. In the first step, candidates
can be mined extending. In the second step, subgraph isomorphism will be in-
vestigated for counting. Subgraph isomorphism is an NP-complete problem [6]
and it is time-consuming for large graphs.
Considering the increasing importance of frequent subgraphs mining, this
paper will present a mechanism for mining frequent subgraphs from a large
attribute graph. Attribute graph will be transformed into labeled graphs by
projection firstly. Then algorithm FSGen will be performed to mining k hops
frequent subgraphs by extending, joining and isomorphism testing. Finally, fre-
quent labeled subgraphs will be merged into attribute subgraphs by integration
according to designated attributes. There is no candidates generation in algo-
rithm FSGen, and isomorphism testing is optional. Thus, the time occupation
of our mechanism for mining frequent labeled subgraphs is better than existing
works.
The rest of this paper is organized as follows. Related works are summa-
rized in section 2 on subgraph mining algorithms for graph databases consist
of multiple small labeled graphs. Section 3 is dedicated to basic concepts and
general definitions of frequent subgraphs mining. We present the k hops frequent
subgraph mining algorithm FSGen consisting of extending, joining and isomor-
phism testing procedures in section 4. Section 5 shows experimental studies for
our technique and section 6 concludes our work.
2 Related Works
Frequent subgraph mining has been widely studied for graph databases consist
of multiple small labeled graphs. To the best of our knowledge, no mechanisms
K Hops Frequent Subgraphs Mining for Large Attribute Graph 65
3 Preliminaries
Y Y D Y Y D $QQ$, Y Y 'DQ'%
Y Y E Y Y F %LOO%LR Y Y 5RVH0HG
4.1 Projection
Attribute graph will be transformed into labeled graphs by projection before
frequent subgraphs mining. Projection can simplify the complexity of attribute
graph. And mechanisms for processing labeled graphs can be used to work on
attribute graph after projection.
Based on above properties, our algorithm for mining frequent subgraph has
following features: (1) Vertices and edges in each extending step are frequent in
all N (vr , h)(1 h k), where vr VR is a root vertex. (2) The same edge
(labels of start and end vertices are the same respectively) occurs once in each
extending step.
K hops frequent subgraph mining algorithm FSGen consists of three stages: ex-
tending, joining and isomorphism testing. Frequent subgraphs are initially
denoted as frequent vertices, and then these vertices will be extended by other
frequent vertices and edges until k hops in extending stage. Frequent subgraphs
obtained from extending stage are constructed as tree model. So in joining stage,
frequent edges in labeled graph will be added into extended k hops frequent sub-
graphs where they do not occur. After above two stages, k hops frequent subgraphs
can be enumerated. If we want to summarize all k hops frequent subgraphs, iso-
morphism testing will be performed and this stage is optional according to actual
requirements. In order to complete our algorithm, isomorphism testing will be pre-
sented along with the other two stages in this subsection. Algorithm 1 shows FS-
Gen performed on labeled graphs by projection. The algorithm consists of three
main procedures: Extending, Joining and Isomorphism Testing.
Algorithm 1. FSGen
Input: (1) k, (2) labeled graph Gp by projection, (3) frequent vertices set Vf , (4)
frequent edges set Ef
Output: k hops frequent subgraphs of Gp
1: Designing Vmf denotes the set of frequent vertices as root vertices set
2: int h = 0;
3: Designing Gf denotes to set of k hops frequent subgraphs and Gf = ;
4: for all vi in Vmf and vi does not belong any extended frequent subgraph do
5: Designing g as a k hops frequent subgraph in Gf and g.V = vi , g.E = ;
6: Extending(vi , g, k, ++h)
7: Joining(g)
8: add g to Gf
9: end for
10: Isomorphism Testing(Gf )
11: return Gf
As each Gp is a labeled graph, an inverted table will be built as index for each
label and sorted by frequency. Then root vertices can be chosen by the inverted
table. In extending stage, frequent vertices that do not occur in any extended
frequent subgraphs can be used as new root vertices in the iteration. Extending
is a recursive procedure as Algorithm 2 shows and the complexity of time is
O(2n ). We design h as the current number of hops. In each steps of extending, a
frequent vertex and a related frequent edge will be added into frequent subgraph
extended in previous steps as line 6 to line 8 shows. In order to avoid overlap, we
use a list of vertices for N (v, 1) which include non-repeated neighbors of vertex
v. While h is equal to k, the procedure will return and g will be extended in
vertices set and edges set.
70 H. Zhang et al.
Algorithm 2. Extending
Input: (1) k, (2)current number of hops h, (3) frequent vertex v, (4) a reference of
frequent subgraph g
Output: reference g as a frequent subgraph
1: if h<k then
2: add v to g.V
3: Designing list L of frequent vertices in N (v, 1), and L includes dierent vertex
only once
4: for all vni in L do
5: if vni is frequent and edge (v,vni ) is frequent then
6: add vni to g.V
7: add (v,vni ) to g.E
8: Extending(vni , g, k, ++h)
9: end if
10: end for
11: end if
Algorithm 3. Joining
Input: (1) k, (2) k hops frequent subgraph g, (3) frequent edges set Ef
Output: Joined k hops frequent subgraphs g
1: for all vertex vi in g.V do
2: for all vertex vj in g.V do
3: if edge (vi ,vj )Ef and (vi ,vj )
/ g.E then
4: add (vi ,vj ) to g.E
5: end if
6: end for
7: end for
8: return g
is no more than a given threshold of similarity, these two frequent subgraphs are
considered similar subgraph and one of them will be removed from Gf . The total
complexity of time in isomorphism testing is O(m3 n2 ) where m is the number
of vertices in each frequent labeled subgraph and n denotes the number of all
frequent labeled subgraphs. By the restriction of k, the number of vertices in a
frequent subgraph can be considered as a constant, so the complexity of time in
Isomorphism Testing procedure is O(n2 ).
4.3 Integration
Definition 7. Integration. Given n labeled graphs gp1 , gp2 , ..., gpn , where gpi =
(Vpi , Epi , L)
pi ). Integration
)nis a aggregated (
attribute graph GA = (V, E, FA ),
n n
where V = i=1 Vpi , E = i=1 Epi and FA = i=1 Lpi .
According to the definition of integration, many operations of intersection
computing will be carried out. Common solution as definition 7 shows is time-
consuming. On the other hand, there may be no frequent subgraphs mined from
labeled subgraphs for attributes whose values are unique. While integrating all
frequent labeled subgraphs, no frequent attribute subgraphs will be obtained.
Thus, the mining results are not significant. In order to solve problems presented
above, we adopt a flexible method for integration. We add a flag for each attribute
72 H. Zhang et al.
of a vertex, if value of the attribute is frequent, the flag will be set to 1, and
otherwise set to 0. Then we can designate attributes and its related frequent
labeled subgraphs for integration. In each frequent labeled subgraph, scanning
the set of vertices, if flags of the values in designated attributes of a vertex are
equal to 1, the vertex will be selected as a frequent vertex and added to vertices
set of frequent attribute subgraph. This processing is a linear scanning and the
complexity of time is O(m), where m is the number of vertices in the scanned
frequent subgraph. For all frequent subgraphs, the total complexity of time for
integration is O(m n) where n is the number of frequent labeled subgraphs
mined by FSGen.
5 Experimental Studies
The mechanism presented in this paper is written in C++ using Microsoft Visual
Studio.net 2008. All experiments are carried out on a PC with Intel core i7 CPU
2.67GHz, 4GB memory and Microsoft Windows 7.
The experiment consists of two parts to demonstrate the eectiveness of our
work. Firstly, we perform our mechanism for mining frequent subgraphs in at-
tribute graph with dierent k, dierent size of data and dierent support. And
then testing time occupation in each case. Secondly, we perform algorithm FSGen
for mining frequent subgraphs in labeled graphs together with other traditional
algorithms including gSpan, FFSM and GraphGen, which must be rewritten for
large labeled graph instead of multiple small graphs, to compare the time occu-
pations of these algorithms. We obtain large attribute graph data by crawling
from YouTube API and arrange each item to a vertex of graph including 9 at-
tributes. We choose three attributes: category, length and views for frequent
subgraphs mining.
K is the number of hops denoting distance from root vertex to another in each
frequent subgraph. In this experiment, we firstly discuss the impact of k in time
occupation while running the k hops frequent subgraph mining algorithm FS-
Gen on labeled graphs transformed from the attribute graph based on YouTube
dataset. And then testing the eectiveness with dierent size of data and dier-
ent support.
Fig. 3 shows time occupation according to dierent k. We can see from the
figure that the time occupation is increasing along with the growth of k while
the number of vertices in attribute graph is 500K. The same tendency occurs in
Fig. 4 while the number of hops is set to 3, with the growing amount of vertices,
time occupation of our work will increase. The time occupation is also related to
the support as Fig. 5 shows. The values of x axis is denoted by the ratio between
support (given by definition 3) and total number of vertices in the labeled graph.
More time will be occupied while the support is low because low support will
bring more subgraphs.
K Hops Frequent Subgraphs Mining for Large Attribute Graph 73
2000
0 0 0
2 3 4 5 6 100000 200000 500000 1000000 2000000 0.1 0.2 0.4 0.6 0.8
Number of hops k Number of vertices support
1000 1400
FSGen
FSGen
gSpan 1200
Time Occupation (s)
800 gSpan
FFSM
Time Occupation (s)
1000
600 GraphGen FFSM
800
GraphGen
400 600
400
200 200
0 0
50000 100000 200000 500000 1000000 2000000 0.1 0.2 0.4 0.6 0.8
Number of vertices support
6 Conclusion
References
1. Yang, J., Zhang, S., Jin, W.: DELTA: Indexing and Querying Multi-labeled Graphs.
In: The 20th ACM International Conference on Information and Knowledge Man-
agement (CIKM), pp. 17651774 (2011)
2. Keyvanpour, M.R., Azizani, F.: Classification and Analysis of Frequent Subgraphs
Mining Algorithms. Journal of Software 7(1), 220227 (2012)
3. James, C., Yiping, K., Ada, W.C.F., Jerey, X.Y.: Fast graph query processing
with a low-cost index. The International Journal on Very Large Data Bases 20(4),
521539 (2011)
4. Koren, Y., North, S.C., Volinsky, C.: Measuring and extracting proximity in net-
works. In: The 12th ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining (KDD), pp. 245255 (2006)
5. Meinl, T., Borgelt, C., Berthold, M.R.: Discriminative Closed Fragment Mining
and Perfect Extensions in MoFa. In: The 2nd Starting AI Researchers Symposium
(STAIRS), pp. 314 (2004)
6. Li, X.T., Li, J.Z., Gao, H.: An Ecient Frequent Subgraph Mining Algorithm.
Journal of Software 18(10), 24692480 (2007)
7. Inokuchi, A., Washio, T., Motoda, H.: Complete Mining of Frequent Patterns From
Graphs: Mining graph data. Machine Learning 50(3), 321354 (2003)
8. Kuramochi, M., Karypis, G.: Frequent Subgraph Discovery. In: The 1st IEEE In-
ternational Conference on Data Mining (ICDM), pp. 313320 (2001)
9. Yan, X., Han, J.: GSpan: Graph-Based Substructure Pattern Mining. In: The 2nd
IEEE International Conference on Data Mining (ICDM), pp. 721724 (2002)
10. Huan, J., Wang, W., Prins, J.: Ecient mining of frequent subgraph in the pres-
ence of isomorphism. In: The 3rd IEEE International Conference on Data Mining
(ICDM), pp. 549552 (2003)
11. Riesen, K., Bunke, H.: Approximate graph edit distance computation by means of
bipartite graph matching. Image and Vision Computing 27(4), 950959 (2009)
Privacy Preserving Graph Publication
in a Distributed Environment
Abstract. Recently, many works studied how to publish privacy preserving so-
cial networks for safely data mining or analysis. These works all assume that
there exists a single publisher who holds the complete graph. While, in real life,
people join different social networks for different purposes. As a result, there are
a group of publishers and each of them holds only a subgraph. Since no one has
the complete graph, it is a challenging problem to generate the published graph in
a distributed environment without releasing any publishers local content. In this
paper, we propose a SMC (Secure Multi-Party Computation) based protocol to
publish a privacy preserving graph in a distributed environment. Our scheme can
publish a privacy preserving graph without leaking the local content information
and meanwhile achieve the maximum graph utility. We show the effectiveness of
the protocol on a real social network under different distributed storage cases.
1 Introduction
Recently, privacy preserving graph publication has become a hot research topic and
many graph protecting models have been proposed [9,7,3,12,11,13,14,1]. All these
models assume there is a trustable centralized data source, which has a complete origi-
nal graph G, and can directly generate the privacy preserving graph G from G.
However, in reality, people join different social networks due to different interests
or purposes. For example, a person uses Facebook to share his information with his
classmates and coworkers. At the same time, he may also build his blog on a blog
server to share his interests with others. As a result, his connection information is stored
in two social networks. A consequence of this joining preference is that each social
network website only holds partial information (a subgraph) of the complete social
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 7587, 2013.
c Springer-Verlag Berlin Heidelberg 2013
76 M. Yuan et al.
network G. We call such a social network website as a data agent. Consider a social
graph as shown in Figure 1 where each person has two labels and people are involved
in different interactions, the graph can be stored on three data agents, A1 , A2 and A3
separately as shown in Figure 2.
Although people join different networks, it is often necessary to obtain a privacy pre-
serving graph generated from the complete graph for criminal investigation [8] or min-
ing useful patterns/influential persons [4,8,6]. This requests that the distributed agents
should cooperatively generate a privacy preserving graph G . There are three potential
approaches to do this (Figure 3).
...
A Trustable Agent
Virtual Whole
The Whole Original Graph Virtual Server original graph
...
...
...
...
...
...
Privacy preserved Graph Privacy preserved Graph Privacy preserved Graph
A Third-Party approach (Figure 3(a)) is to let all the data agents send their local con-
tents to a trustable third party agent, where the published graph by integrating all the
data. However, since the data of each site is its most valuable asset, no one is willing to
share its data with others, finding such a trustable third party data agent is not feasible in
the real world. The naive approach (Figure 3(b)) is to let each data agent generate a pri-
vacy preserving graph based on its own content and securely combine them into a large
privacy preserving graph 1 . Secure combination means no local content is released dur-
ing the process. However, this approach encounters two problems: (1) It results in the
low quality of the published data since the graph is constructed only based on local in-
formation instead of the complete graph; (2) A privacy preserving graph generated only
by local information may violate the privacy requirement when globally more connec-
tion information is provided considering the connections between nodes [9,13,14,2,1].
The secure combination needs to delete some connections to ensure the final published
graph satisfies the privacy requirement. Therefore, the published graph will have incom-
plete information. Another approach (Figure 3(c)) is that each data agent participates
in a protocol to produce a privacy preserving graph. The privacy preserving graph is
generated just as there virtually exists an agent who has the integrated data. During
the computation, except the content derived from the final published graph, the pro-
tocol controls no additional local content of a data agent is released. This solution is
1
The overlapping between subgraphs should be considered. [5] showed it is not safe even when
each publisher publishes privacy preserved data independently. Here we use Fig. 3(b) to show
the basic workflow of the naive approach.
Privacy Preserving Graph Publication in a Distributed Environment 77
known as the famous Secure Multi-Party Computation (SMC). SMC deals with a prob-
lem where a set of parties with private data wish to jointly compute a function of their
data. A lot of SMC protocols have been proposed for different computation problems,
but not for privacy preserving graph publishing in a distributed environment. SMC has
two requirements: 1) Correctness requirement: the computation is performed in a dis-
tributed environment just the same as doing it on an agent who holds the integrated
data; 2) Security requirement: each data agent should not know the local information of
other agents even with the intermediate results passing through each other.
In this paper, we follow the SMC approach and design a secure protocol SP to al-
low the data agents to cooperately generate a graph G based on a recently proposed
protection model [1], called S-Clustering. Through clustering methods, S-Clustering
publishes a graph G that only contains super nodes (clusters) where each super node
represents multiple nodes in G. We call a published graph which satisfies S-Clustering
as the S-Clustering graph. We refer the algorithm which generates the S-Clustering
graph in a centralized environment as the centralized algorithm. The distributed ver-
sion of the centralized algorithm, SP , should work the same as running the centralized
algorithm on the complete original graph. For any data agent, SP protects its local
information when generating the published graph. We propose novel solutions in SP
based on random lock, permutation, and Millionaire Protocol [10].
2 Problem Description
An online social network with rich interactions can be represented as a bipartite graph
G(V, I, E) [1], where each node in V represents an entity in the network and each node
in I stands for an interaction between a subset of entities in V . An edge e(u, i) E
(u V , i I) means entity u participants in the interaction i. Each entity has an
identity and a group of attributes. Without loss of generality, each entitys identifier can
be represented as a unique id within [1, |V |]. The commutative encryption scheme [8]
can assign each name a unique number id without releasing the real name values in a
distributed environment. In the rest part of this paper, we refer entity u as the same as
the entity with id u. Each interaction also has an identity and a set of properties as shown
in Figure 1. An interaction can involve more than two entities, such as the interaction
game3. Two entities can also be involved in different interactions at the same time.
For example, in Figure 1, v1 and v2 participate in blog1 and blog2 simultaneously.
In a distributed environment, the graph G is distributively stored on l different data
agents. That is, each agent Ai holds a portion of the graph Gi (Vi , Ii , Ei ) such that
V = li=1 Vi , I = li=1 Ii and E = li=1 Ei . The interactions that an entity participants
may be stored on different agents. The Gi held by different data agents may overlap,
such that the intersection between two graphs stored on different agents might not be
an empty set. 2 That is, li=1 Vi = and li=1 Ii = .
2
When Vi s do not have overlap (li=1 Vi = ), G is a disconnected graph since each data agent
only holds a subgraph that is isolated from other parts.
78 M. Yuan et al.
In the rest of this paper, we use node to specifically represent an entity in V and use
interaction to represent an item in I. When we say two nodes u, v have a connection,
it means u and v are involved in a same interaction i in I (In the graph, we have two
edges e(u, i) and e(v, i)).
We propose a protocol which securely generates a S-Clustering graph [1] for in-
teraction based graphs in a distributed environment. S-Clustering [1] assumes that an
attacker can know the id, label and any connection information of a node. An attacker
uses the information he knows about some users to analyze the published graph in order
to learn more about these users. Given a constant k, S-Clustering publishes a clustered
graph (S-Clustering graph) which guarantees the following three Privacy objectives:
Objective 1: For any node u, an attacker has at most k1 probability to recognize it.
Objective 2: For any interaction i, an attacker has at most k1 probability to know a node
u involves in it.
Objective 3: For any two nodes u and v, an attacker has at most k1 probability to know
they have a connection (u and v participate in a same interaction).
We assume that all participants joined in the computation, including all data agents and
necessary extra servers are semi-honest. The local information beyonds the Privacy
objectives of S-Clustering is protected. The problem we solve in this paper is:
1. Input: A graph G distributed stored on l data agents and a privacy parameter k.
2. Output: A S-Clustering graph G of G under k.
3. Constraints: (a) Correctness: G is computed the same as generating it on an agent
who holds G; (b) Security: The three Privacy objectives of S-Clustering are guaran-
teed even when a participant gets the intermediate results during the computation.
To guarantee the three privacy objectives, the S-Clustering graph G must satisfy:
Each cluster represents at least k entities. This guarantees the Objective (1).
The Clustering Safety Condition (CSC) [1]:
3.1 Overview
12 if flag then
13 Create a New Cluster c and add c to CV ;
14 Insert(c, v); /* End Stage2 */
CSC(c, v) without releasing any connection information. Stage 3 and Stage 4 gener-
ate the interactions on clusters and attributes for clusters respectively without disclos-
ing any interaction-node mapping and node-attribute mapping. To summarize, we must
avoid the releasing of the following information during computation: (1) Degree infor-
mation (from Stage 1); (2) Specific node-attribute mapping (from Stage 4); (3) Specific
connection information, including node-interaction mapping and connection between
nodes (from Stage 2 & 3); (4) Any CSC result and the computation order of nodes
(from Stage 1 & 2).
Next, we introduce our design of SP for each stage in detail. Due to space limitation,
we ignore all the proofs. Before presenting SP , we assume each node/interaction has a
weight that represents how many duplicated copies of this node/interaction have been
stored in the system. We call these weights duplicated weights. Details of the method
to generate duplicated weights is shown in Appendix A.
Protocol Design. In this stage, we sort the nodes without revealing any degree infor-
mation and the computation order of nodes (the sorting order of nodes on real ids)
information. The basic idea is to do the sorting on permuted ids with their correspond-
ing locked degree values. We use the random number to lock the real degree of each
node. That is, we make one agent hold a key (a random number) and another agent hold
the locked degree value (real degree plus this random number). Then, we sort all nodes
on permuted ids through the cooperation of these two agents. During the sorting, the
agent who has the locked degree values cannot learn any key value and the agent who
holds the keys cannot learn any locked value. The Secure Sorting Sub-Protocol (SSSP)
works as follows (Figure 5):
1. A1 generates a random real number vector D with size |V |.! A1 constructs a variable vector
D = D. Then, for each node u stored on A1 , A1 adds e(u,i) w1i to D [u] (wi is the
duplicated weight of the interaction i). Finally, A1 sends D to A2 ;
!
2. When A2 receives D , for each node u stored on A2 , we add e(u,i) w1i to D [u]. Then, A2
sends D to A3 ;
3. Each Ai (i > 1) does the same operation as A2 and sends D to Ai+1 if Ai+1 exists;
4. Al generates a random permutation function , passes to A1 and sends (D ) to S1
5. A1 sends (D) to S2 .
6. S1 and S2 cooperately sort all the nodes. During the sorting, each time when S1 needs
to compare two values (D )[i] (D)[i] and (D )[j] (D)[j], S1 and S2 uses the
Millionaires Protocol [10]3 to securely get the result by comparing the value ((D )[i]
(D )[j]) on S1 and ((D)[i] (D)[j]) on S2 .
Theorem 1. SSSP sorts all nodes on (id)s under the Security requirement.
3
The Millionaires Problem is: suppose there are two numbers a and b hold by two people, they
want to know the inequality a > b or a < b without revealing the actual values of a and b.
The Millionaires Protocol [10] can securely compare a and b based on the techniques such
as Homomorphic Encryption. When we want to compare two degree values d1 and d2 in case
that Sx knows r1 , r2 and Sy knows (r1 + d1 ), (r2 + d2 ), we can use the Millionaires Protocol
to compare the value ((r1 + d1 ) (r2 + d2 )) on Sx and the value (r1 r2 ) on Sy .
Privacy Preserving Graph Publication in a Distributed Environment 81
[2] 2 (L2')
A1 [2]
Al
2 (L2)
(1) generate a random
[4] (NMR) [3] LN2 permutation 2 (|V|*|V|)
[3] LN2'
(2) L2' = sort MR row by row
(4) generate matrix NMR
based on 2-1 (LN2')
S2 S3
The third step is to do the clustering on (id)s. Before introducing how S1 conducts
the clustering, we prove a property we use in the next computation. For a node u, we
call Eu (|Eu | = |V |) the connection vector of u. Eu is a [0, 1] vector and Eu [u] = 1. If
nodes u and v have a connection, Eu [i] = 1, otherwise Eu [i] = 0. For example, node
v1 s connection vector# in Figure 1 E1 = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]. For a cluster of nodes
C, we call EC = ( uC Eu ) as Cs connection vector. The connection vector has
the following property:
Theorem 2. For any cluster of nodes C, if there exists a t where EC [t] > 1, then C
does not satisfy CSC; otherwise, C satisfies the CSC.
(2) Generate a random permutation [6] CVatr (7) Compute CVatr = ''-1(CVatr)
(5) Compute ERC = (ERC) + R
[4] ERC
S3 S1 (6) Compute the attributes cluster mapping CVatr on
(6) Compute (ERC - RC ) to determine if CSC(c,u) ''(id)s using [ ''(AT) - ''(AT) ]
T : CSC(c, v), F : CSC(c, v)
S1
#
For a cluster of nodes C on (id)s, it is obvious EC = uC ((N M R )[u]
(N M R)[u]). Based on the property of connection vector, when S1 needs to check
whether v can join a cluster c based on CSC, the protocol works as follows (Figure 7):
!
1. S1 computes a connection vector ERC = uC (N M R )[u]. Where C = c {v}. The
size of this vector is |V |.;
2. S1 generates a random permutation ! pattern and sends (C, ) to S2 ;
3. S2 computes a vector RC = ( uC (N M R)[u]). A2 continues to generate a random
vector R with size |V | and sends R to S1 . S2 sends RC
= RC + R to S3 ;
4. S1 computes ERC = (ERC ) + R and passes ERC to S3 ;
5. S3 computes EC = ERC RC . If EC contains a number bigger than 1, it returns cannot
cluster to S1 . Otherwise, it returns can cluster to S1 .
S1 does the clustering on L and uses the above method to test CSC. S1 finally gets the
cluster set CV on (id)s and passes CV to A1 . A1 computes the cluster set on real
ids CV = 1 (CV ).
Privacy Preserving Graph Publication in a Distributed Environment 83
Theorem 3. SCSP clusters all nodes without violating the Security requirement.
4 Experiment
To demonstrate the effectiveness of SP protocol, we implement a Relaxed Secure Pro-
tocol (RSP) which is based on the naive approach (Fig. 3(b)). The brief introduction
to the design of RSP is shown in Appendix B. We compare the graph generated We
compare the graph generated by RSP and SP .
4.1 Criteria
We focus on two aspects to show the benefit of designing a SM C protocol: the infor-
mation loss and the utilities. We estimate the information loss of RSP comparing with
SP (SP does not delete any interaction). Assume del is the number of interactions that
are deleted by RSP and the original complete graph contains |I| interactions, we use
the ratio of deleted interactions ( 100del
|I| %) to represent the information loss.
Utility is used to estimate the quality of the published graph. For the clustering-based
protection models [7,12,1], the utility testing is operated by drawing sample graphs
from the published one, measuring the utility of each sample and aggregating utilities
across samples. We test the following measures:
84 M. Yuan et al.
1. Degree distribution: Suppose the sorted degree sequence of the original graph G is
DG and the one of a sampled graph Gc is DGc . The different !
between the degree dis-
tributions of G and Gc is represented as: EDDD (DG , DGc ) = |V1 | "|V
i=1 (DG [i] DGc [i]) .
| 2
We compute the average EDDD of 30 sampled graphs for each protocol. Suppose
the SP s result is EDDD,SP and RSP s result is EDDD,RSP , we use
EDDD,RSP EDDD,SP
EDDD,SP 100% to compare these two protocols.
2. A group of randomly selected queries: We test three aggregate queries as the same
as Paper [1]. We compute the average query error between the sampling graphs and
the original graph. Suppose SP s result is errorSP and RSP s result is errorRSP ,
errorSP
we use errorRSP
errorSP 100% to compare them. The three aggregate queries
include: (1) Pair Queries: how many nodes with certain attribute interact with nodes
with another attribute; (2) Trio Queries: how many two hop neighbors. (3) Trian-
gle Queries: how many triangles given three attributes. We select a random set of
queries of each type to do the testing. For each type, we select 20 queries and com-
pute the average query error between the sampling graphs and original graph.
4.3 Result
Result on Real Distributed Case
We compare the performance of SP and RSP under different privacy parameter ks.
Figure 9(a) shows the result of information loss. RSP deletes 13% to 17% interac-
tions. This is a fairly significant information loss. The published graph by RSP fails
Privacy Preserving Graph Publication in a Distributed Environment 85
18 30
RSP vs. SP 80 RSP vs. SP RSP vs. SP
15 42 40
RSP vs. SP RSP vs. SP RSP vs. SP
14
Increased distance (%)
38 30
13
36 25
12 34 20
11 32 15
30 10
10
28 5
9
26 0
8 24 -5
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Node overlapping ratio (%) Node overlapping ratio (%) Node overlapping ratio (%)
to correctly represent the original graph. Figure 9(b) shows the comparison of degree
sequence distance. RSP performs 45% to 80% worse than SP . The published graph
by RSP is much worse than SP when estimating by the degree distribution. Figure
9(c) shows the results of queries. In most cases, the RSP performs 10% to 25% worse
than SP . In most cases, SP gets a much better graph than RSP .
Result on Simulated Distributed Case
For the Simulated Distributed Case, we set k = 5 and compare the performance of SP
and RSP with different node overlapping ratio rs. Figure 10(a) shows the number of
deleted interactions. Larger r means more overlapping between data agents and more
chances to delete an interaction. In most cases, RSP deletes 10% to 20% interactions.
Figure 10(b) shows the comparison of degree sequence distance. RSP performs 25%
to 45% worse than SP . Figure 9(c) shows the results of queries. In most cases, the
RSP performs 5% to 35% worse than SP . SP generates a graph with higher utilities
than RSP .
From the testing results, we can find SP exhibits benefits on both information com-
pleteness and the utilities of the published graphs. Firstly, the RSP deletes roughly
13%-19% interactions, publishing a graph which loses such a large portion of infor-
mation is not acceptable. While SP guarantees no information loss in the published
graph. Secondly, the utilities of the graph generated by SP are much better than RSP .
It is necessary to design a SMC protocol such as SP for the privacy preserving graph
publication problem. Actually, since SP has the same effect as running the state-of-art
centralized graph construction algorithm on the complete original graph, the published
graph generated by SP can be seen as the best.
86 M. Yuan et al.
5 Related Work
Two models are proposed to publish a privacy preserving graph: the edge editing based
model [9,13,14,2,11] and the clustering based model [7,12,1]. The edge editing based
model is to add/delete edges to make the published graph satisfy certain properties ac-
cording to the privacy requirements. For example, Liu[9] defined and implemented the
k-Degree anonymous model on network structure, with which there exists at least other
k 1 nodes having the same degree as any node in a published graph. The clustering
based model is to cluster similar nodes into super nodes. Since a clustered graph only
contains super nodes, the probability of re-identifying any user can be bounded to k1 by
making each clusters size at least k.
Frikken [4] designed a protocol which allows a group of agents to generate an in-
tegrated graph. However, the graph generated by this protocol cannot provide the pro-
tections against the complex attacks proposed in recent works [9,7,3,12,11,13,1,14].
Kerschbaum [8] designed a protocol to generate an anonymized graph from multiple
data agents. Each edge in the anonymized graph has a digital signature which can help
trace where this edge comes from. While, simple removing the identifiers of nodes in
a graph cannot resist an attack which aims to re-identify the nodes/links [7]. There-
fore, it is essential to investigate protocols that support the stronger protection models
[9,7,3,12,11,13,1,14] in a distributed environment. Our work generates the published
graph which follows the recently proposed S-Clustering protection model. This model
protects nodes and links against the strongest attack which may use any information
around nodes. Moreover, our work supports a more flexible graph model while the pro-
tocols of Keith [4] and Florian [8] only support the homogeneous graphs.
6 Conclusion
In this paper, we target on the secure multi-party privacy preserving social network
publication problem. We design a SMC protocol SP for the latest clustering based
graph protection model. SP can securely generate a published graph with the same
quality as the one generated in the centralized environment. As far as our knowledge,
this is the first work on SMC privacy preserving graph publication against the structure
attack. In the future, one interesting direction is to study how to enhance the current
solution to handle the malicious or accessory agents. How to design SMC protocols for
the editing based graph protection models will be another interesting direction.
References
1. Bhagat, S., Cormode, G., Krishnamurthy, B., Srivastava, D.: Class-based graph anonymiza-
tion for social network data. Proc. VLDB Endow. 2, 766777 (2009)
2. Cheng, J., Fu, A.W.-C., Liu, J.: K-isomorphism: privacy preserving network publication
against structural attacks. In: SIGMOD 2010, pp. 459470. ACM, New York (2010)
Privacy Preserving Graph Publication in a Distributed Environment 87
3. Cormode, G., Srivastava, D., Yu, T., Zhang, Q.: Anonymizing bipartite graph data using safe
groupings. Proc. VLDB Endow. 1, 833844 (2008)
4. Frikken, K.B., Golle, P.: Private social network analysis: how to assemble pieces of a graph
privately. In: WPES 2006, pp. 8998. ACM, New York (2006)
5. Ganta, S.R., Kasiviswanathan, S., Smith, A.: Composition attacks and auxiliary information
in data privacy. CoRR (2008)
6. Garg, S., Gupta, T., Carlsson, N., Mahanti, A.: Evolution of an online social aggregation
network: an empirical study. In: IMC 2009, pp. 315321. ACM, New York (2009)
7. Hay, M., Miklau, G., Jensen, D., Towsley, D., Weis, P.: Resisting structural re-identification
in anonymized social networks. Proc. VLDB Endow. 1, 102114 (2008)
8. Kerschbaum, F., Schaad, A.: Privacy-preserving social network analysis for criminal investi-
gations. In: WPES 2008, pp. 914. ACM, New York (2008)
9. Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD 2008, pp. 93106
(2008)
10. Yao, A.C.: Protocols for secure computations. In: SFCS 1982, pp. 160164. IEEE Computer
Society, Washington, DC (1982)
11. Ying, X., Wu, X.: Randomizing social networks: a spectrum preserving approach. In: SDM
2008 (2008)
12. Zheleva, E., Getoor, L.: Preserving the privacy of sensitive relationships in graph data. In:
Bonchi, F., Malin, B., Saygn, Y. (eds.) PinKDD 2007. LNCS, vol. 4890, pp. 153171.
Springer, Heidelberg (2008)
13. Zhou, B., Pei, J.: Preserving privacy in social networks against neighborhood attacks. In:
ICDE 2008, pp. 506515 (2008)
14. Zou, L., Chen, L., Ozsu, M.T.: k-automorphism: a general framework for privacy preserving
network publication. Proc. VLDB Endow. 2, 946957 (2009)
1 Introduction
Data mining extracts useful implicit patterns within data along with important
correlations/anities between the patterns to discover knowledge from databases.
Moreover, in data mining, correlation analysis is a special measure which finds the
underlying dependencies between objects. The frequent itemset can be mined but
correlation calculation among items can define the dependencies and mutual cor-
relation among items within an itemset. The hidden information within databases,
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 8895, 2013.
c Springer-Verlag Berlin Heidelberg 2013
and mainly the interesting association relationships among sets of objects, may
disclose useful patterns for decision support, financial forecast, marketing poli-
cies, even medical diagnosis and many other applications.
Nowadays, data mining techniques are applied to non-traditional domains e.g.
the most complicated real life scenarios where objects are interacting with their
surrounding other objects. To model such scenarios graph can be used, where
vertices of the graph will correspond to entities and edges will correspond to
relations among entities. Because of combinatorially explosive search for sub-
graphs which includes subgraph isomorphism testing, the graph structured data
mining is dicult. Moreover, in mining graph data, the emphasis is on frequent
labels and common topologies unlike traditional data. Here mining can be di-
vided into level-by-level generate-and-test method and pattern growth-based
approach. AGM[1] and FSG[2] are of the former type, where as gSpan[3] and
graph pattern [4] are of later type, which require no candidate generation. In
order to discover correlations, several measures are used [5] and to mine corre-
lation in graph databases existing works such as [6] mainly focus on structural
similarity search. However, graphs those are structurally dissimilar but always
appear together within the database, may be more interesting , such as, the
chemical properties of isomers [7].
For capturing eective correlations, authors of [7] proposed CGS algorithm
which mines graph correlation by adopting P earson s correlation coef f icient by
taking into account the occurrence distributions of graphs. However, CGS works
for searching correlation of a specific query graph with the database. Therefore,
it has some limitations in describing inherent correlation among graphs and the
domain knowledge is obligatory in using CGS, otherwise lots of queries would be
meaningless. These facts motivated us in developing a new measure which can
prune a large number of un-correlated graphs and designing an algorithm to ef-
ficiently mine graph correlation. Our contributions are: a new graph correlation
measure gConf idence to mine inherent correlation in graph databases, pruning a
large number of candidates using the downward closure property of the measure
and an algorithm CGM (Correlated Graph M ining) to eciently mine correla-
tion by constructing a hierarchical reduced search space.
Rest of the paper is organized as follows: Section 2 contains our proposed
scheme and Section 3 focuses on the performance analysis of our proposed algo-
rithm. Finally, we concluded our work in Section 4.
Correlated graph mining is one of the most important graph mining tasks. So, we
have proposed a new measure, gConf idence and a new method, CGM (Correlated
Graph M ining), to search correlation among graphs within a graph database. We
have mined correlated graphs by constructing a hierarchical search space using our
proposed measure and algorithm.
90 Md. Samiullah et al.
{N o. of Graphs G | Gs G GD}
gConf idence(Gs ) = (2)
max({ei E(Gs ), N o. of graphs Gj | ei E(Gj ), Gj GD})
Consider a scenario shown in Figure 1, where two frequent closed graphs found
from a set of graphs representing a group of people. Each graph in the set repre-
sents friend circle of an individual where nodes represent individuals and edges
represent interaction among individual pairs. The circles around closed graphs
represent the interaction of a group of people all together. Labels of edges and
circles represent the frequencies of the edges and closed graphs respectively. How-
ever, in mining the most correlated group, frequency of closed-frequent graphs
cant help due a tie i.e. 30 for each. As a consequence, Group G2 will be sug-
gested as most correlated by our proposed measure. Since, maximum interaction
30
of any pair in G1 is 100 and in G2 is 60. Therefore, gConf idence(G1 ) = 100 =
30
0.3 <gConf idence(G2) = 60 = 0.5.
Correlation Mining in Graph Databases with a New Measure 91
We have created a hierarchical tree like structure for eciently searching the
correlation among graphs within graph databases. The tree is defined as follows:
Definition 3. (gConfidence Tree) A tree, where each node represents a graph
or subgraph by storing corresponding DFS code and represents correlation by
storing gConf idence value gC. Moreover, the relation between parent node
and child node complies with the relation that a parent is one edge shorter in
size than its child and a child is one edge larger than its parent and no child
has gC greater than its parent. The relation between siblings is consistent with
the DFS lexicographic order. That is, the pre-order search of gConf idence tree
follows the DFS lexicographic order.
In Figure 2, we have shown a gConf idence code tree where the (n+1)-th level of
the tree has nodes which contain DFS codes of n-edge graphs and a value gC
which represents the inherent correlation among the nodes, edges and subgraphs
of that particular graph. Any node in the gConf idence tree contains a valid DFS
code. Certainly some of the nodes contain a minimum DFS code while others
do not. And there could be some nodes having gC values smaller than the
minimum correlation threshold. The value for gC also maintains a parent and
child relationship that is gC() gC() where = (a0 , a1 , ..., am ) and =
(a0 , a1 , ..., am , b) that is is s parent.
Now we will describe our algorithm for mining graph correlation using our
proposed measure gConf idence within graph databases. Since correlation is
searched based on user specified minimum correlation threshold then the search
space can be pruned based on two values, one for minimum support threshold
and another one is minimum confidence threshold. Proposed CGM is an edge
based correlation mining algorithm and we also used the concept of Projected
Database to reduce the costly searching operation for counting occurrences of
any graph/subgraph.
92 Md. Samiullah et al.
For illustrating the working procedure of our CGM algorithm, we can consider
the graph database for the chemical dataset shown in Figure 3. We have assumed
the support threshold, = 13 and correlation threshold, = 23 . According to our
algorithm we have calculated support and gConf idence of each edge and vertex
and then selected the frequent and correlated vertices and edges. Now, we have
to construct single edge graph for each frequent-correlated edge and the edge set
is also used to construct GLOBAL EDGE M AT RIX. This matrix is sorted
based on support count and DFS code and can be used in a chronological order
for constructing potential children (smallest DFS code oriented child first) and
also helps in counting maximum elementary edge support count of a graph.
We have created a Null-rooted gConf idence tree and then started mining for
the first 1-edge graph C1 . Since it satisfies both the threshold values, we have
added it in the search space and start mining the correlation of its potential
children recursively. Its first child C1.1 is not frequent, hence we have pruned
it. The second child C1.2 is frequent and correlated, hence added to the search
space. However, recursive calculation of its children shows that C1.2.1 and C1.2.4
are frequent but not correlated and C1.2.2 and C1.2.3 are not frequent.
Correlation Mining in Graph Databases with a New Measure 93
As a consequence, the fifth child of C1.2 that is C1.2.5 is found correlated and
added to gConf idence tree. The eighth child of C1.2.5 is found correlated and
added in the search space but first seven children of C1.2.5 are not frequent hence
will not be correlated and been pruned from the search space. Therefore, we have
to further search for the correlation of all possible children of C1.2.5.8 . Since no
children of C1.2.5.8 found frequent-correlated, we can backtrack to C1.2.5 . But
already we have checked all possible children of C1.2.5 hence we can trace back
again to C1.2 for checking correlation of its remaining children.
In this way we have calculated the correlation for the graph database of Figure
3 and found a complete gConf idence tree, where nodes of the tree contain
correlated graphs along with the amount of correlation. All the above discussed
steps are illustrated in Figure 3.
3 Experimental Results
Fig. 4. Processing time wrt Graph Fig. 5. Processing time wrt Graph
Density on (MOLT-4) Density on (D200kN30T80v50)
Fig. 8. CGM vs CGS Fig. 9. CGM vs gSpan Fig. 10. Performance in Fil-
on (D200kN20T40v30) on (D200kN20T40v30) tering(%) on (MOLT-4)
graph database and a synthtic database characterized by D200kN 20T 40v30 re-
spectively. We have found that 50 to 100 seconds are required in mining NCI-H23
database and 300 to 1200 seconds are required for the synthetic dataset with any
confidence threshold within range.
To compare the performance of CGS with CGM, once again we have used the
denser synthetic dataset used in the scalability assessment earlier. However, the
support threshold is considered 5% and confidence threshold 50%. The compar-
ison is shown in the Figure 8, which illustrates a significant performance gain.
We have also provided the performance of our proposed algorithm in filtering
less significant graphs. By comparing it with the well known frequent subgraph
mining algorithm gSpan, using denser synthetic dataset, used in the scalability
assessment, we found our algorithm ecient enough in comparison with existing
algorithms. Here, we have considered the support threshold 5% and confidence
Correlation Mining in Graph Databases with a New Measure 95
threshold 50%. The comparison can be found in Figure 9. Figure 10 contains the
percentage of graphs, those are un-correlated, filtered by CGM with respect to
the graphs selected by gSpan. Figure 10, shows that the filtering percentage of
CGM can be from 10% up to 40%, on a real life data set MOLT-4, for various
gConf idence threshold.
4 Conclusions
Mining frequent patterns or sub-patterns with larger support threshold could
miss some interesting patterns. At the same time if the threshold considered
is small enough to capture such rare but interesting items could generate lots
of spurious patterns. Therefore, association and correlation analysis is used in
association of frequent pattern mining for mining frequent-interesting patterns
from a collection of itemsets, but correlation searching is a challenging task. In
a graph database, correlation searching is more challenging due to the fact that
searching frequent subgraphs faces the graph isomorphism problem. Therefore, a
new measure gConf idence and an algorithm CGM are proposed for correlation
mining in graph databases which can capture more interesting inherent correla-
tion among graphs. gConf idence has downward closure property, which helps in
pruning descendants of non-correlated candidates. Proposed method constructs
a tree-like search space named gConf idence tree to eciently mine the corre-
lation. We have performed extensive performance analysis of CGM and found
it ecient enough which outperforms existing works in correlation search based
on speed. The proposed algorithm can be applied in both traditional and non-
traditional domains such as bio-informatics, computer vision, various networks,
machine learning, chemical domain and various other real life domains.
References
1. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining fre-
quent substructures from graph data. In: Zighed, D.A., Komorowski, J., Zytkow,
J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 1323. Springer, Heidelberg
(2000)
2. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM, pp. 313320.
IEEE Computer Society (2001)
3. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: ICDM, pp.
721724. IEEE Computer Society (2002)
4. Li, J., Liu, Y., Gao, H.: Ecient algorithms for summarizing graph patterns. IEEE
Trans. Knowl. Data Eng. 23(9), 13881405 (2011)
5. Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for
association patterns. In: KDD, pp. 3241. ACM (2002)
6. Yan, X., Zhu, F., Yu, P.S., Han, J.: Feature-based similarity search in graph struc-
tures. ACM Trans. Database Syst. 31(4), 14181453 (2006)
7. Ke, Y., Cheng, J., Ng, W.: Ecient correlation search from graph databases. IEEE
Trans. Knowl. Data Eng. 20(12), 16011615 (2008)
8. Pubchem web site for information on biological activities of small molecules (2011),
http://pubchem.ncbi.nlm.nih.gov
Improved Parallel Processing
of Massive De Bruijn Graph for Genome Assembly
1 Introduction
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 96107, 2013.
Springer-Verlag Berlin Heidelberg 2013
Improved Parallel Processsing of Massive De Bruijn Graph for Genome Assembly 97
sequence has been sequencced. Due to the diversity of species, there are still a laarge
number of species that neeed to be sequenced. Moreover, modern medical reseaarch
shows that most of the diiseases and gene have connections. Genome sequenccing
results can lead to reveal th
he mystery of genetic variation, and is now widely usedd in
genetic diagnosis, gene therrapy and drug design, etc.[4].
Whole Genome sequenccing has two main steps. The first step is to obtain the
sequence of short reads. It needs amplifying the DNA molecule (multiples of
amplification is referred to o as coverage) first, and then these DNA molecules are
randomly broken into many y small fragments (each small fragment is referred to aas a
read), the short read is then
n determined by sequencing devices, which is illustratedd in
Figure 1. The second step is to reconstruct genome sequence from these short reaads,
which is usually referred asa genome assembly. Genome assembly problem is proved
to be is NP-hard [1], reduceed from the Shortest Common Superstring(SCS) problem m.
Many optimization criteria and heuristics have been presented to deal with
assembly problem. Early, the length of fragments obtained by the Sanger sequencing
method can be achieved 200bp (base pairs). Overlap graph model algorithm is more
effective, but the cost of sequencing is relatively high. For example, the Human
Genome Project, by the completion of the first-generation sequencing technology,
already spent $3 billion in 3 years. More recently, genome sequencing entered in a era
of large-scale applications due to the second-generation sequencing technology (also
known as the next-generation sequencing technology), such as Solexa, 454, SOLID
[5]. There are three prominent features of the second-generation sequencing
technology, namely, high throughput, short sequence and high coverage. For high-
throughput sequencing, a sequencing machine can simultaneously sequence millions
of reads, which greatly reduces the whole cost. For short sequence, it stands for that
the sequence length is between 25-80 bp in general. In this regard, genome assembly
has to greatly increase DNA coverage [5] to ensure the integrity of the information.
With the increasing of the coverage, the total number of reads also increases greatly.
For overlap graph, then the size of the graph will significantly increase accordingly.
But for De Bruijn graph, the size of the graph is only linear to the length of the
reference genome. Therefore, in the face of the second generation of gene sequencing
technology, De Bruijn graph model has a great advantage and become much popular.
The proposed algorithms based n De Bruijn graph includes Euler[3], ALLPATHS[6],
Velvet[7], IDBA[8], SOAPdenovo[9], ABySS[10] and YAGA[11]. Euler,
ALLPATHS, Velvet and IDBA algorithms are serial algorithms, only suitable for
small data sets. SOAPdenovo is multi-threaded SMP mainframe-based algorithm,
which can process a large data set. But it also spends more than 40 hours on human
genome [9] and the required infrastructure is very expensive.
The main phases of genome assembly based on De Bruijn graph is as follows: First,
De Bruijn graph construction. Second, errors correction. To remove the known
erroneous structures in the De Bruijn graph. Third, graph simplification. It is to
compress a single chain in the De Bruijn graph into a single compact node, which also
generates preliminary contig information. Fourth, generating scaffold. It is to merge
contigs from the previous step with pair-end information obtained by sequencing,
then to derive longer contigs (known as scaffold). Fifth, to output the obtained
scaffold. Generally, the first phase has the largest memory consumption, and the
second phase and the third phase has the largest time expense. Initially, the graph
after constructed is extremely sparse, chain nodes ratio is more than 99% [12].
YAGA[11] is a distributed algorithm with well parallelism and scalability which
represent state of the art. In this paper, we improve the parallel processing of De
Bruijn graph by constructing a De Bruijn graph in distributed manner for the
construction phase and then a parallel graph traversal to obtain all chains for the
simplification phase. After assuming that the error has been removed, our work
focuses on improving the efficiency of parallel graph simplification. The main
contribution of this paper is to formulate the graph simplification problem into graph
traversal problem. In YAGA algorithm, Graph simplification is approached as a list
Improved Parallel Processing of Massive De Bruijn Graph for Genome Assembly 99
ranking problem with a total of four times global sorting and a multi-round recursion.
Our algorithm directly traverses from end node to access all chains and visit each
node only once. Our algorithm has a computing complexity of and
communication complexity of , which is smaller than YAGA algorithm, here
g is the length of genome reference, p is the number of processors.
Section 2 gives the definition of De Bruijn and the detailed description on graph
construction. Section 3 describes parallel graph simplification. We show experimental
results in section 4. Section 5 concludes this paper.
We first explain the definition of De Bruijn graph. Then, in order to achieve efficient
parallel De Bruijn graph construction, we discuss a compact representation of De
Bruijn graph. Finally, a parallel De Bruijn graph construction is given, which
distributes the storage of the graph across multiple computational nodes.
Let s be a DNA sequence of length n, which is a sequence consists of the four bases,
namely A C T and G. A substring derived from s with length k, represented as
, is called as a k-mer. The set of all k-mers
derived from s is referred to as the k-spectrum of s. For a k-mer represents the
reverse sequence of s complementary sequence (for example , then
). can be obtained by first complementing each character in alpha and
then reversing and the resulted sequence. The complementary rules are
.
A k-molecule represents a pair of k-mers { }, and and is reverse
complementary. The notation is the ordering relation between the two k-mers with
length k, such that indicates that is larger than lexicographically. We
designate the lexicographically larger one as the positive k-mer denoted , and
the other one as negative k-mer denoted (as shown in Figure 2), here there is
. When the context is clear, stands for a k-molecule , that is
. Fig.2 depicts the relationship between a k-mer and its
corresponding k-molecule with an example.
100 L. Zeng et al.
k-mers, positive or negative, only one different character can be found over the two
adjacent nodes. So we can simply use this character to represent the corresponding
edge. We use an 8-bit char type variable to store these edges. The bit with value of 1
indicates that the existence of this edge, otherwise, it means there is no corresponding
edge. The rule is: the top 4 bits represent the edges of positive k-mer, 0 to 3
corresponding to the A C G T; the after 4 bits represent the edges of negative k-mer, 4
to 7 corresponding to A C G T. For example, the node TAG owns three edges
connected to three nodes (TTA AGT and TAG), as defined above, we use the positive
k-mer TAG to represent the node. We set the third bit of arc field to 1. It means that
an edge exist from the positive k-mer(TAG) to the negative k-mer(AGG) of node
CCT, for the number of the bit is the third and in the top 4. The edges record of node
TAG is shown in Figure 4. The bits of the shaded part in Figure 4 with value of 1
represent the existence of these edges. Accordingly, only 9 bytes is used to store one
node in our De Bruijn graph.
The parallel De Bruijn graph construction algorithm has three main steps. The first
step is to read all short sequences in parallel; the second step is to obtain all k-mers
and the edges between any two consecutive k-mers, and to obtain all k-molecules
based on those k-mers, repeated edges in a k-molecule are merged. The third step to
102 L. Zeng et al.
Overview. Our algorithm directly traverses from an end node of a chain to the other
end node to simplify the chain.
Therefore, we use two threads for each processor, which collectively performs the
concurrent graph traversal to identify all chains. As shown in Figure 8, thread A is
responsible for the simplification task as to merge all chain nodes identified over a
chain, on the other hand, thread B is dedicated for inter-process communications. The
communication is for resolving local requests and dealing with other processors
requests on traversing end nodes.
Data Structure. Each processor has two Maps. One is locationMap, which store
nodes before simplification. Another is subGraphMap which stores nodes after
simplification. During initialization, the subGraphMap is empty and the
subGraphMap is inserted with nodes during the simplification by the process of
identifying chains.
Improved Parallel Processing of Massive De Bruijn Graph for Genome Assembly 105
Algorithm. Based on the above data structure, we outline the algorithm as below:
Step 1: Initialization: De Bruijn graph is distributed on all processor and each
processor has a locationMap and a subGraphMap.
Step 2: Traversal of the graph (as shown in Figure 9): Every processor starting to
traverse from an end node in local locationMap. Since a chain has two ends
nodes, traversal can be divided into the following three cases: 1) a processor
visits a same chain twice from the two end nodes; 2) two different processors
visit the same chain twice, each one from a different end node; 3) two processors
visit the same chain simultaneously by starting from the two end node at almost
the same time. In order to reduce repeated access to the same chain, we add a
endNodeID variable to store the ID of end node in the direction of the end.
According to the ID number, the merging thread can determine whether to
continue or exit.
Step 3: Simplification: Delete all chain nodes and append their information to the
corresponding end node, and then insert them into subGraphMap.
Step 4: Select another end node to repeat the processing from Step 2 until no end
node can be found.
3.3 Analysis
YAGA algorithm transforms the problem of simplifying all chains into list ranking
problem which requires four times of global sorting over all graph nodes. In this
procedure, a large number of nodes are moved four times. The time and
communication cost is very large. The computational complexity of four times global
sorting is , and the communication complexity is . In addition to
the cost of four times of global sort, the cost of other involved processing is also very
large.
In our algorithm, each processor directly traverses from the local end node to get all
chains. A chain only holds two end nodes, so each chain node is accessed twice at
most. Also our algorithm successfully prevents a chain from being visited twice. So,
in the worst case, all nodes are calculated twice, the time complexity is .And
each node is requested twice at most, the communication complexity is .
4 Performance Evaluation
Our assembly software is written in C++ and MPI. The experimental platforms
includes 10 servers interconnected by InfiniBand network, which is configured as 16-
core, 32G shared memory. We choose Yeast and C.elegans genomes, using Perl
scripts automatically generated yeast test data with 17, 007, 362 reads and C.elegans
test data, which consists of 140, 396, 108 reads. The length of reads ranges from 36bp
to 50bp with error rate of 0, and the coverage for both data is 50X. Our goal is to
demonstrate the scalability of our assembler to graph construction and graph
compaction on a large distributed De Bruijn graph.
106 L. Zeng et al.
Firstly, we tested assembler on the Yeast data set. The runtime is displayed in
Figure 10 and the time is divided into three phrases. First, parallel read file from a
distributed file system. Second, graph construction. Third, graph compaction. The
most part time on the third phase is network transmission cost. But, in our algorithms
most nodes only be moved once, that is an efficient optimization. Figure 10 indicates
that the algorithm has good scalability. When the number of processors is increased
from 8 to 128, the total time is reduced from 490s to 63s that means the speedup is
about 8 times.
The C.elegans dataset is 10 times larger than Yeast dataset. The running time is
shown in Figure 11. When the number of processors is increased from 8 to 128, the
total time is reduced from 3844s to 375s with the speedup being 10 times.
Improved Parallel Processing of Massive De Bruijn Graph for Genome Assembly 107
5 Conclusion
We propose to use depth-first traversal over the underlying De Bruijn graph once
to simplify it. The new method is fast, effective and can be executed in parallel.
By testing the two data sets, the experimental results show that the algorithm has great
scalability. After the completion of the graph simplification, the scale of the graph is
sharply reduced, thus the graph simplification is of great value. The latter stage
includes resolving branches and repeats using pair-end information in a single
machine. We will study these problems in our future work.
6 Conclusion
This work is supported by NSFC of China (Grant No. 61103049) and Shenzhen
Internet Industry Development Fund (Grant No. JC201005270342A). The authors
would like to thank the anonymous reviewers for their helpful comments.
References
[1] Kundeti, V.K., Rajasekaran, S., Dinh, H., et al.: Efficient parallel and out of core
algorithms for constructing large bi-directed de Bruijn graphs. BMC Bioinformatics
11(560) (2010)
[2] Kececioglu, J.D., Myers, E.W.: Combinatorial algorithms for DNA sequence assembly.
Algorithmica 13(1), 751 (1995)
[3] Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment
assembly. Proceedings of the National Academy of Sciences of the United States of
America 98(17), 97489753 (2001)
[4] Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for
sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS
(LNBI), vol. 4645, pp. 289301. Springer, Heidelberg (2007)
[5] Jackson, B.G., Aluru, S.: Parallel Construction of Bidirected String Graphs for Genome
Assembly, 346353 (2008)
[6] Butler, J., Maccallum, I., Kleber, M., et al.: ALLPATHS: de novo assembly of whole-
genome shotgun microreads. Genome Res. 18(5), 810820 (2008)
[7] Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de
Bruijn graphs. Genome Res. 18(5), 821829 (2008)
[8] Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBAA Practical Iterative de
Bruijn Graph De Novo Assembler. In: Berger, B. (ed.) RECOMB 2010. LNCS,
vol. 6044, pp. 426440. Springer, Heidelberg (2010)
[9] Li, R., Zhu, H., Ruan, J., et al.: De novo assembly of human genomes with massively
parallel short read sequencing. Genome Res. 20(2), 265272 (2010)
[10] Simpson, J.T., Wong, K., Jackman, S.D., et al.: ABySS: a parallel assembler for short
read sequence data. Genome Res. 19(6), 11171123 (2009)
[11] Jackson, B., Regennitter, M., Yang, X., et al.: Parallel de novo assembly of large
genomes from high-throughput short reads. IEEE (2010)
B3Clustering: Identifying Protein Complexes
from Protein-Protein Interaction Network
Abstract. Cluster analysis is one of most important challenges for data mining
in the modern Biology. The advance of experimental technologies have produced
large amount of binary protein-protein interaction data, but it is hard to find pro-
tein complexes in vitro.We introduce new algorithm called B3Clustering which
detects densely connected subgraphs from the complicated and noisy graph.
B3Clustering finds clusters by adjusting the density of subgraphs to be flexi-
ble according to its size, because the more vertices the cluster has, the less dense
it becomes. B3Clustering bisects the paths with distance of 3 into two groups
to select vertices from each group.We experiment B3Clustering and two other
clustering methods in three different PPI networks. Then, we compare the re-
sultant clusters from each method with benchmark complexes called CYC2008.
The experimental result supports the efficiency and robustness of B3Clustering
for protein complex prediction in PPI networks.
1 Introduction
Protein complexes play essential roles in biological process and they are also important
to discover drugs in pharmaceutical process. Recent progress in experimental technolo-
gies has led to an unprecedented growth in binary protein-protein interaction (PPI) data.
Despite this accumulation of PPI data, considerably small amount of protein complexes
are identified in vitro. Thus, computational approaches are used to overcome the tech-
nological limit for detecting protein complexes. It becomes one of the most important
challenges for modern biology to identify protein complexes from binary PPI data. Pro-
teins are represented as vertices and their interactions are represented as edges in the
PPI network. Since protein complexes are shown as densely connected subgraphs in
PPI network, clustering methods are applied to find them from the network.
For example, some used clique finding algorithms to predict protein complexes from
PPI network (Spirin and Mirny [23] and Liu et al.[18]). They devised their own methods
to merge overlapping cliques.
Besides, Van Dongen (2000)[27] introduced MCL (Markov Clustering) as graph par-
titioning method by simulating random walks. It used two operators called expansion
and inflation, which boosts strong connections and demotes weak connections. Brohee
et al. (2006)[6] shows the robustness of MCL with comparison to three other clustering
algorithms (RNSC[12], SPC[5], MCODE[4]) for detecting protein complexes.
One of the recent emerging techniques is to use protein core attachments methods,
which identify protein-complex cores and add attachments into these cores to form pro-
tein complexes (CORE[15] and COACH[28]). Li et al.(2010)[17] shows that COACH
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 108119, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fig. 1. The graph comes from the Biogrid database using Cytoscape
We evaluated B3Clustering using three PPI datasets and compared the results with
MCL and COACH. We used CYC2008[22] as a benchmark of protein complexes.
B3Clustering shows capability to identify protein complexes in the tightly intercon-
nected PPI network as well as partially missing PPI network. We found that it achieved
higher F-measure than MCL and COACH.
All the proteins that we mentioned in this paper are those of Sacchromyces Cere-
visiae, so to speak, a bakers yeast. We introduce the basic concepts and explain the
methods which B3Clustering uses in the next section. In section 3, we give full details
of PPI data and complex benchmark data which we use in this paper. In section 4, we
explain the evaluation methods we hire here. We show the results of our experiments in
section 5. In the last section, we summarize our project and suggest future improvement
of B3Clustering.
2 Method
Protein-protein interaction(PPI) data come in the form of connections between proteins,
which is easily described as a graph model. Proteins are represented as vertices and their
interactions are represented as edges in the graph. Since protein complexes are shown
as densely connected subgraphs in PPI network, clustering methods are applied to find
them from the PPI network. We introduce B3Clustering to identify clusters in the graph
and we apply it to detect protein complexes in the PPI network.
B3Clustering proceeds clustering in three steps. Firstly, it selects vertices and we call
it vertex selection step. Secondly, it decides if the vertices can form clusters or not and
it is called cluster selection step. Lastly, it merges clusters, which is called trimming
step. We introduce terminology we use before we explain three steps.
2.1 Terminology
For graph G = (V, E), V (G) and E(G) denote the sets of vertices and edges of G, re-
spectively. We assume that graph G is an undirected and simple graph. One of important
objects in B3Clustering is a path with distance of 3. A path in a graph is a sequence of
vertices which are connected to the next vertex in the sequence. We use paths with
distinct vertices in B3Clustering.
Definition 2 (3-Path Set). Path3(u, v) is defined as the set of paths 3-path reachable
from u to v.
Path3 (u,v)={ {u,i, j,v}|(u,i),(i, j),( j,v)E(G),u=i= j=v, for i, jV (G)}
Identifying Protein Complexes from Protein-Protein Interaction Network 111
We assume that u, i, j, v are member of V (G) and they should be different from each
other. Path3(u, v) could be empty set if u is not 3-path reachable to v. Otherwise, it will
have paths with distance of 3 as members.
Definition 3 (Bi-sets). Du (u, v) is defined as the set of u, v, and vertices which are
linked to u in paths Path3(u, v). Dv (u, v) is defined as the set of u, v, and vertices
which are linked to v in paths Path3(u, v). Du (u, v) and Dv (u, v) are called Bi-sets of
Path3(u, v).
Du (u,v)={ u,v,i|(u,i)E(G),{u,v,i}z, for any zPath3 (u,v) }
We extracts two sets of vertices from 3-path set and call them Du (u, v) and Dv (u, v).
Each set includes u and v. Additionally, Du (u, v) includes the set of vertices which are
linked to u in paths Path3(u, v) and Dv (u, v) includes the set of vertices which are
linked to v in paths Path3(u, v).
YDR397C
YGR047C YDR087C
YBR123C
YOR110W
YLP007C
YDR362C
YAL001C
YDL097C
pair (u, v) is given. We can get Du (u, v) and Dv (u, v) from Definition 3. Then, we calcu-
late the probability to interact with members of other set for each vertex. We define the
probability of interaction pu (i) and pv ( j) as follows.
w(i, j) is an edge weight between i and j. If there exists an edge between i and j, then
w(i, j) should be 1. If i is the same vertex with j, then w(i, j) should 1. Otherwise,
w(i, j) should be 0. !
1, if (i, j) E(G) or i = j;
w(i, j) =
0, otherwise.
We calculate pu (i) for every vertex i Du (u, v) and we calculate pv ( j) for every vertex
j Dv (u, v). If pu (i) is higher than , then i remains in Du (u, v). Otherwise, we remove i
from Du (u, v). Likewise, if pv ( j) is higher than , then v remains in Dv (u, v). Otherwise,
we remove j from Dv (u, v).
is the threshold which decides lowest margin for probability of interaction. As the
protein complex consist of more proteins, they show lower connectivity, Therefore, we
set a lower value when there are more paths with distance of 3, and vice versa.
|4Path3 (u,v)|
= 2 m
where m = max i, jV (G) |Path3 (u,v)| and is given. We set = 0.7 in this paper.
Lets see Fig.2 as a example. If we start from (Y OR110W,Y DR362C), then we get
bi-sets as shown Table 1. From now on, we call DY OR110W (Y OR110W,Y DR362C) D1 and
DY DR362C (Y OR110W,Y DR362C) D2 in short.
If we calculate the probability of interaction for each vertex in D1 and D2, each
vertex has a value in Table 2.
If We set that is equal to 0.5, then YDR087C is removed from D1 and YDL397C
and YDL097C are removed from D2. Finally, we get the D1 and D2 as shown in Table 3.
pYOR110W pY DR362C
p1(YOR110W)=1 p2(YOR110W)=0.714
p1(YDR362C)=0.714 p2(YDR362C)=1
p1(YAL001C)=0.857 p2(YAL001C)=0.857
p1(YBR123C)=0.857 p2(YBR123C)=0.857
p1(YGR047C)=0.714 p2(YGR047C)=1
p1(YPL007C)=0.571 p2(YDR397C)=0.281
p1(YDR087C)=0.281 p2(YDL097C)=0.281
Lets look at example in Table 3. Then the union of D1 and D2 has 6 members. Since
D1 and D2 has 5 common vertices, s(Y OR110W,Y DR362C) = 0.833. Hence, D1 D2
is considered as a cluster because it has higher similarity value than .
3 Evaluation
3.1 Data
We use three PPI datasets for Saccharomyces cerevisiae. One is a dataset from Krogan
et al.(2006)[25] and another comes from Dip database[30]. The last dataset is biogrid
3.1.85 downloaded from BIOGRID[25] database 1 .
1 http://www.thebiogrid.org/
114 E. Chin and J. Zhu
Start
PPI data ,
Get Path3(u,v)
Get l(u,v)
No
Merge clusters
Display results
End
Krogan and dip datasets are used by Li et al.(2010) to evaluate the performance of
several clustering algorithms[17]. As shown in Table 4, Krogan and dip dataset has
similar number of average degree, but biogrid has much higher average degree than
them. Biogrid has about ten times higher average degree then dip dataset.
PPI data have a high rate of false positives, which has been estimated to be about 50%
[24]. The noise of the data disturbs clustering methods to detect protein complexes from
PPI data.
We use CYC2008 complexes as a benchmark set. CYC2008 is published by Pu et
al.(2008)[22] and it can be downloaded from (http://wodaklab.org/cyc2008/). CYC2008
provides a comprehensive catelogue of manually curated 408 protein complexes in Sac-
charoyces cerevisiae.
CYC2008 complexes consists of 1627 proteins, which is about one quarter of pro-
teins in Biogrid 3.1.85. Among the 408 protein complexes, 259 protein complexes are
composed of less than four proteins with 97% of connectivity. As the protein complex
consist of more proteins, they have lower connectivity.
There are 172 protein complexes consisting of two proteins out of 408 total protein
complexes in CYC2008. Two-member complexes exist as a form of edge in a graph so
that they should be chosen from edges in PPI networks. For example, we try to detect
172 protein complexes from 193,679 edges in Biogrid 3.1.85.
Identifying Protein Complexes from Protein-Protein Interaction Network 115
3.2 F-Measure
Precision, recall, and F-measure have been used to evaluate the performance in infor-
mation retrieval and data mining. We calculate affinity score used by Li et al.[17] to get
F-measure. The neighbourhood affinity score NA(p,b) is defined as follows.
|Vp Vb |2
NA(p,b) =
|Vp | |Vb |
Ncp
Precision =
|P|
4 Result
We experiment B3Clustering with three datasets. Three databases are Biogrid 3.1.85,
Dip, and Krogan as describes in section 3. We experiment MCL and COACH with the
same datasets to compare the performance of B3Clustering. We set to be 0.7, to be
0.6, and to be 0.9.
116 E. Chin and J. Zhu
From the results, B3Clustering has the highest F-measure in Krogan data and Biogrid
data, but it has lower F-measure than COACH in Dip data. COACH and B3Clustering
score F-measure twice higher than MCL in Krogan and Dip datasets as shown in Fig.
4. MCL supposes to have the lowest F-measure because it partitions PPI graph. On
the contrary, COACH and B3Clustering consider some of vertices and overlap clusters
instead of partitioning.
For Biogrid 3.1.85 data, MCL has a F-measure of 0.0261 by producing 52 clusters,
one of which includes 5,786 proteins out of 6,194 total proteins. COACH predicts a
large number of clusters as many as 3,138 and only 105 are similar to protein complexes
in CYC2008. Since COACH has low precision for Biogrid 3.1.85, it shows low F-
measure of 0.055 as well.
B3Clustering has the similar F-measure to COACH in Krogan and Dip data, but it
has six times higher F-measure than COACH in Biogrid 3.1.85.
The experimental results show that B3Clustering has the capability to detect clusters
from the noisy graph as well as tightly interconnected graph. They support the efficiency
and robustness of B3Clustering for protein complex prediction in PPI networks.
5 Related Works
There exist several clustering algorithms applied to biological networks in order to find
protein complexes. One of traditional clustering approaches are maximal clique finding
algorithm, which induces complete subgraphs that is not a subset of any larger sub-
graph. Clique finding algorithms have been used to predict protein complexes from
PPI network (Spirin and Mirny [23] and Liu et al.[18]). They devised their own meth-
ods to merge overlapping cliques. Liu et al.[18] proposed an algorithm called CMC
(Clistering-based on Maximal Cliques), which used a variant of Czekanowski-Dice dis-
tance to rank cliques and merged highly overlapped cliques. If the interaction networks
Identifying Protein Complexes from Protein-Protein Interaction Network 117
0.3
0.6 0.6
MCL MCL 0.25 MCL
0.5 0.5
0.2
0.4 0.4
0.15
0.3 COACH 0.3 COACH COACH
Fig. 4. Result from Krogan et al., Dip and Biogird 3.1.85 datasets
are accuate and complete, then maximal clique finding algorithm [26] can be ideal for
detecting protein complexes from the networks. However, since biological data is highly
noisy and incomplete, protein complexes are not formed as shapes of complete sub-
graphs in the biological networks. To reduce the impact of the noise in the network, Pei
et al.[21] proposed using quasi-cliques instead of maximal cliques. Other approaches
employ vertex similarity measures to improve confidence of noisy networks [14].
Besides, Van Dongen (2000)[27] introduced MCL (Markov Clustering) as graph par-
titioning method by simulating random walks. It used two operators called expansion
and inflation, which boosts strong connections and demotes weak connections. Iter-
ative expansion and inflation separate the graphs into many subgraphs. Brohee et al.
(2006)[6] shows the robustness of MCL with comparison to three other clustering algo-
rithms (RNSC[12], SPC[5], MCODE[4]) for detecting protein complexes. Each clus-
tering algorithm was applied to binary PPI data in order to test the ability to extract
complexes from the networks and the clusters were compared with the annotated MIPS
complexes [20]
MCODE [4] uses a vertex-weighting method based on the clustering coefficient
which assesses the local neighbourhood density of a vertex. MCODE is the overlap-
ping clustering approach which allows to overlap clusters.
One of the recent emerging techniques is to use protein core attachments methods,
which identify protein-complex cores and add attachments into these cores to form pro-
tein complexes (CORE[15] and COACH[28]). Li et al.(2010)[17] shows that COACH
performs better than seven other clustering algorithms (MCODE[4], RNSC[12],
MCL[27], DPClus[3], CFinder[1], DECAFF[16], and CORE[15]) when they are ap-
plied to two PPI datasets (DIP[30] and Krogan et al.[13] data).
118 E. Chin and J. Zhu
6 Conclusion
As the advances of technologies has produced a wealth of data, data becomes compli-
cated and erroneous. The noise of the data makes it hard to find desirable results from
the overflow of data. We introduce B3Clustering to detect clusters in the complicated
and noisy graph. B3Clustering controls the connectivity of clusters according to the
size of clusters by two thresholds. Since protein complexes has lower connectivity with
larger size, it is reasonable to control threshold according to the size of protein com-
plexes. As we can see in the experimental results, B3Clustering performs better than
the existing state-of-art clustering methods in the densely connected graph as well as
noisy graph.
In the future, data size will be bigger and data may be more complicated. For saving
operation time, we may consider B3Clustering detect clusters from the selected vertex
pairs instead of all the vertex pairs.
References
1. Adamcsek, B., Palla, G., Farkas, I.J., Derenyi, I., Vicsek, T.: CFinder: locating cliques and
overlapping modules in biological networks. Bioinformatics 22(8), 10211023 (2006)
2. Aloy, P., et al.: Structure-based assembly of protein complexes in yeast. Science 303(5666),
20262029 (2004)
3. Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K., Kanaya, S.: Development and
implementation of an algorithm for detection of protein complexes in large interaction net-
works. BMC Bioinformatics 7, 207 (2006)
4. Bader, G., Hogue, C.: An automated method for finding molecular complexes in large protein
interaction networks. MBC Bioinformatics 4, 2 (2003)
5. Blatt, M., Wiseman, S., Domany, E.: Superparamagnetic clustering of data. Phys. Rev.
Lett. 76(18), 32513254 (1996)
6. Brohee, S., van Helen, J.: Evaluation of clustering algorithms for protein-protein interaction
networks. BMC Bioinformatics 7, 488 (2006)
7. Cho, Y., Hwang, W., Ramanthan, M., Zhang, A.: Semantic integration to identify overlapping
functional modules inprotein interaction networks. BMC Bioinfotmatics 8, 265 (2007)
8. Dwight, S.S., et al.: Saccharomyces Genome Database provides secondary gene annotation
using the Gene Ontology. Nucleic Acids Research 30(1), 6972 (2002)
9. Friedel, C.C., Krumsiek, J., Zimmer, R.: Boostrapping the interactome: Unsupervised Iden-
tification of Protein Complexes in Yeast. In: Vingron, M., Wong, L. (eds.) RECOMB 2008.
LNCS (LNBI), vol. 4955, pp. 316. Springer, Heidelberg (2008)
10. Gavin, A., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen,
L.J., Bastuck, S., Dumpelfeld, B., et al.: Proteome survey reveals modularity of the yeast cell
machinery. Nature 440(7084), 631636 (2006)
11. Gentleman, R., Huber, W.: Making the most of high-throughput protein-interaction data.
Genome Biology 8(10), 112 (2007)
12. King, A., Przulj, N., Jurisica, I.: Protein complexes prediction via cost-based clustering.
Bioinformatics 20(17), 30133020 (2004)
13. Krogan, N., Cagney, G., Yu, H., Zhong, G., Guo, X., et al.: Global landscape of protein
complexes in the yeast Saccharomyces cerevisiae. Nature 440(7082), 637643 (2006)
14. Leight, E., Holme, P., Newman, J.: Vertex similarity in networks. Physical Review E 73,
026120 (2006)
Identifying Protein Complexes from Protein-Protein Interaction Network 119
15. Leung, H.C., Yiu, S.M., Xiang, Q., Chin, F.Y.: Predicting Protein Complexes from PPI Data:
A Core-Attachment Approach. Journal of Computational Biology 16(2), 133144 (2009)
16. Li, X., Foo, C., Ng, S.K.: Discovering protein complexes in dense reliable neighborhoods of
protein interaction networks. In: Comput. Sys. Bioinformatics Conf., pp. 157168 (2007)
17. Li, X., Wu, M., Kwoh, C.K., Ng, S.K.: Computational approaches for detecting protein com-
plexes from protein interaction networks: a survey. BMC Bioinformatics 11(suppl. 1), S3
(2010)
18. Liu, G.M., Chua, H.N., Wong, L.: Complex discovery from weighted PPI networks. Bioin-
formatics 25(15), 18911897 (2009)
19. Mete, M., Tang, F., Xu, X.D., Yuruk, N.: A structural approach for finding functional modules
from large biological networks. BMC Bioinformatics 9(suppl. 9), S19 (2008)
20. Mewes, H.W., et al.: MIPS: Analysis and annotation of proteins from whole genomes. Nu-
cleic Acids Res. 32(Database Issue), D41D44 (2004)
21. Pei, J., Jiang, D., Zhang, A.: On mining cross-grph quasi-cliques. In: Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2005)
22. Pu, S., Wong, J., Turner, B., Cho, E., Wodak, S.J.: Up-to-date catalogues of yeast protein
complexes. Nucleic Acids Res. 37(3), 825831 (2009)
23. Spirin, V., Mirny, L.: Protein complexes and functional modules in molecular networks.
PNAS 100(21), 1212312128 (2003)
24. Sprinzak, E., Sattah, S., Magalit, H.: How reliable are experimental protein-protein interac-
tion data. Journal of Molecular Biology 327(5), 919923 (2003)
25. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: Biogrid: A
general Repository for Interaction Datasets. Nucleic Acids Res. 34, D535D539 (2006)
26. Tomita, E., Tanaka, A., Takahashi, H.: The worst-case time complexity for generating all
maximal cliques and computational experiments. Theoretical Computer Sciene 363, 2842
(2006)
27. Van Dongen, S.: Graph Clustering by Flow Stimulation. University of Utrecht (2000)
28. Wu, D.D., Hu, X.: An efficient approach to detect a protein community from a seed. In:
2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational
Biology (CIBCB 2005), pp. 135141. IEEE, La Jolla (2005)
29. Wu, M., Li, X., Kwoh, C.K., Ng, S.K.: A core-attachment based method to detect protein
complexes in PPI networks. BMC Bioinformatics 10, 169 (2009)
30. Xenarios, I., Salwinski, L., Duan, X., Higney, P., Kim, S., Eisenberg, D.: DIP, the Database
of Interacting Proteins: a research tool for studying cellular networks of ptoein interactions.
Nucleic Acids Research 30, 303305 (2002)
Detecting Event Rumors on Sina Weibo Automatically
Abstract. Sina Weibo has become one of the most popular social networks in
China. In the meantime, it also becomes a good place to spread various spams.
Unlike previous studies on detecting spams such as ads, pornographic messages
and phishing, we focus on identifying event rumors (rumors about social
events), which are more harmful than other kinds of spams especially in China.
To detect event rumors from enormous posts, we studied the characteristics of
event rumors and extracted features which can distinguish rumors from ordi-
nary posts. The experiments conducted on real dataset show that the new fea-
tures are effective to improve the rumor classifier. Further analysis of the event
rumors reveals that they can be classified into 4 different types. We propose an
approach for detecting one major type, text-picture unmatched event rumors.
The experiment demonstrates that this approach is well-performed.
1 Introduction
Over the past several years, online social networks such as Facebook, Twitter, Re-
nren, and Weibo have become more and more popular. Among all these sites, Sina
Weibo is one of the leading micro-blogging service providers with eight times more
users than Twitter [1]. It is designed as a micro-blog platform for users to communi-
cate with their friends and keep track of hot trends. It allows users to publish micro-
blogs including short text messages with no more than 140 Chinese characters,
attached videos, audios and pictures to show their opinions and interests. All these
micro-blogs published by users will appear in their followers pages. Also users can
make comments on or retweet others' micro-blogs to express their view.
Weibo is a kind of convenient sites for users to show themselves and communicate
with others. It is put into service in 2009 by Sina Corporation. A recent official report
shows that Sina Weibo has more than 300 million registered users by the end of Feb-
ruary 29, 2012 and posts 100 million micro-blogs per day. Unfortunately, the wealth
of information also attract interests of spammers who attempt to send kinds of spams
*
Corresponding authors.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 120131, 2013.
Springer-Verlag Berlin Heidelberg 2013
Detecting Event Rumors on Sina Weibo Automatically 121
Fig. 1. Instance of event rumor on Sina Weibo Fig. 2. Instance of Sina Weibo Rumor-
Busting
To detect spams, Sina Weibo has tried several ways such as adding a "report as a
spam" button to each micro-blog as shown in the left bottom of Figure 1. Any users
considering a micro-blog to be suspicious can report it to Sina official Weibo account
(with the translated English user name of Weibo Rumor-Busting [1]) who specia-
lizes in identifying fake micro-blogs manually. Figure 2 shows a micro-blog
published by the Sina official account, which points out that the message showed in
Figure 1 is a rumor and gives detailed and authentic reasons. Though artificial method
is accurate, it costs a lot of human efforts and financial resources. What's more, there
is a certain delay by this approach. It cannot identify a spam quickly once it appears.
The rumor may have been spread widely and produced bad effects on society before it
is identified manually. So an automatic approach to detecting event rumors is essen-
tial. To the best of our knowledge, [1] is the first paper that detect rumor automatical-
ly on Sina Weibo, but it aims to detect general rumors not event rumors. And the
accuracy has big room to be improved.
In this paper we study how to detect the special kind of rumor, event rumor, be-
cause it is more harmful to our society than others. We consider the problem of rumor
detection as a classification problem firstly. To distinguish rumors from ordinary
posts, we design special features about event rumors from Sina Weibo and build clas-
sifiers based on these features. Then we divide event rumors into 4 types and propose
a method to identifying one major type among them. The experimental results indi-
cate that our approaches are effective.
The rest of the paper is organized as follows. In section 2 we give a review of re-
lated work. In section 3 we describe our two approaches to identifying event rumors
and provide a detailed description of the new features. In section 4 we describe how
to collect data and present the experimental results. Finally, we conclude this paper in
section 5.
122 S. Sun et al.
2 Related Work
The previous research about spam detection mainly focuses on detecting spams re-
lated to ads, pornography, viruses, phishing and so on. These works can be divided
into two types. One is detecting spammers who publish social spams [5-10], while the
other is detecting spams directly [1-4].
Lee et al. [6] proposed a honeypot-based approach to uncover social spammers in
online social systems including MySpace and Twitter. They deployed social honey-
pots to harvest deceptive spam profiles from social networking communities firstly.
Then they analyzed the properties of these spam profiles and built spam classifiers to
filter out spammers. Benevento et al. [11] also used honeypots to collect spam pro-
files. Benevento et al. [5] extracted attributes related to content of tweets and
attributes about user behavior and built a SVM classifier to detect spammers of Twit-
ter. This approach is well-performed in filtering out spammers who focus on posting
ads, especially link deception. Stringhini et al. [7] did detailed analysis of spammer on
Facebook, MySpace and Twitter. They created honey profiles accepting all requests
but sending none request to make friends in order to collect data about spamming
activity. Spam bots were divided into four categories Displayer, Bragger, Poster and
Whisperer. Random forest algorithm was used to identify spammers.
Wang [3] extracted three graph-based features and three content-based features,
and formulated the problem of spam bots detection as classification problem. And
then he applied different classification algorithms such as decision tree, neural net-
work, support vector machines, and k-nearest neighbor. Finally they found that Bayes
classifier is best well-performed. The three graph-based features were extracted from
Twitters social network graph which represents the following relationship among
users. The content-based features contained the number of duplicate tweets, the num-
ber of HTTP links, and the number of replies or mentions. Similar to this work [3], in
Wangs another work [4], a new content-based feature named trending topics
representing the number of tweets that contains the hash tag # in a users 20 most
recent tweets was added to features set. Besides the features studied in previous
works, Yang, et al. [1] extracted two new features, client-based feature and location-
based feature. The former refers to the client program that user has used to post a
micro-blog while the later refers to the actual place where the event mentioned in the
micro-blog has happened. They trained a Support Vector Machine classifier to meas-
ure the impact of proposed features on the classification performance. A noteworthy
point is that the dataset they used is from Sina Weibo. To the best of our knowledge,
this is the first paper that studies rumor detection of Sina Weibo. The difference be-
tween this work and ours lies in that we focus on event rumors. Gao, et al. [2] studied
the wall posts in Facebook which are usually used to spread malicious content by
spammers. They built wall post similarity graph, clustered similar wall posts into a
group and then identified spam clusters. The difference from previous works is using
clustering method.
As mentioned above, majority of the previous works are about detecting spams
such as ads, viruses, phishing, pornography and so on. Unlike previous studies, we
focus on identifying event rumors (rumors about social events), which are more harm-
ful to national security and society harmony than other kinds of spams especially in
China.
Detecting Event Rumors on Sina Weibo Automatically 123
: set of users
: set of micro-blogs
: a set of s followers
: a set of s friends
: the posting time of micro-blog
: the text content of micro-blog
: the picture attached to micro-blog
The number of URLs. Since Sina Weibo only allows users to post a short message
within 140 Chinese characters, the URLs are always shown in a shorten format [3].
Shorten URLs can hide the URLs source and users cant get direct information from
the URLs. So URL is an attribute that is often utilized by rumor publisher.
The number of duplications. To spread rumors widely, users may post the same mi-
cro-blog several times. So the number of duplications is also a noteworthy feature.
Given two micro-blogs named and , we measure the similarity based on Jaccard
coefficient as shown in Equation 1.
(1)
rumor non-rumor
20
No. of event verbs
15
10
0
0 10 20 30 40 50 60 70 80 90 100110
no
on-rumor
rumor
0% 50% 100%
User-Based Features. Useer-based features are 6 attributes about users, including the
number of users followerss, the number of users friends, reputation of user, whetther
the user has VIP authenticaation, ratio of micro-blogs containing event verbs, and raatio
of micro-blogs containing strong negative words. The former 4 features have bbeen
studied already, while the later 2 features are new features proposed by us. Sina WWei-
bo provides a platform to users
u to create social connections by following others and
hem. Figure 5 shows a simple social network graph and the
allowing others to follow th
arrows depict the following g relationship between users. We can see that user A and
user D are the followers of user B, while user C and user D are friends of user B. U
User
B and user D are mutual folllowers.
The number of followers. The
T micro-blogs posted by a user will be seen by all hiss or
her followers. The more folllowers the user has, the wider the rumors spread.
The number of friends. An nyone can follow a user to be this users follower withhout
seeking permission. As Sina Weibo will send a private message to you when someone
follows your account activeely, rumor publishers use the following function to attrract
attention of users attention
n. If you follow other accounts, you can also mention thheir
accounts by using a @ taag in your micro-blogs, then they will receive your miccro-
blogs even though they are not your followers.
Reputation. If a user has a small number of followers but a lot of friends, the poossi-
bility that he or she posts rumors is higher than others. The reputation of a useer is
defined as follows by Equattion 2.
(2)
126 S. Sun et al.
VIP authentication. If users identity is verified by Sina Weibo officially, then a VIP
tag will occur after the user name. The verified users include famous actors, expor-
ters, organizations, famous corporations and so on. Usually, VIP users are less likely
to post rumors.
Ratio of micro-blogs containing event verbs. This attribute means the proportion of
micro-blogs that contain event verbs over all micro-blogs posted by the user. The
higher this value, the more likely the users post event rumors.
Ratio of micro-blogs containing strong negative words. This attribute means the pro-
portion of micro-blogs that contain strong negative words over all micro-blogs posted
by the user. The higher this value, the more likely the users post event rumors.
Timespan. To make the rumors look like normal micro-blogs, users usually attach
corresponding pictures to the micro-blogs. These pictures often come from Internet
and were published before. We call them outdated pictures. Users utilize the outdated
pictures and new messages to make up a new micro-blog related to event rumors. The
timespan function is defined by Equation 3.
(3)
1
http://shitu.baidu.com/
Detecting Event Rumors on Sina Weibo Automatically 127
proportion
0.5 0.5
0 0
0 1 2 1 2
timespan timespan
Fig. 8. The distribution of times pan of mi- Fig. 9. The distribution of timespan of mi-
cro-blogs on Sina Weibo cro-blogs containing pictures on Sina Weibo
Based on these features, classifiers are built and used to predict whether a micro-
blog is an event rumor or not.
Further analysis of sample event rumors reveals that event rumors can be classified
into different types. In the following discussion, we provide our analysis of several
common types of event rumors and propose an effective method to one major type,
detect text-picture unmatched rumors.
Based on our observation, most event rumors belong to one of the following types:
The difference between fabricated details type and time-sensitive type is that the
latter was true but is out of date while the former is fake.
Text-picture unmatched rumors: As shown in Figure 1, to increase event rumors
credibility, users often attach a picture to the text, but there is no relationship be-
tween the picture and the text.
From Figure 8 we can see that about 80% event rumors contain a picture and a major-
ity of them are text-picture unmatched rumors according to our observation. There-
fore, we focus on study how to detect this kind of event rumors. Figure 1 shows an
instance of text-picture unmatched rumor, whose text describes an officers investiga-
tion of peoples life under the protection of soldiers, but the picture actually describes
the team leader of Chinese Embassy in the Republic of Iraq visiting temporary em-
bassy accompanied by security personnel in fact. The rumor account utilizes the unre-
lated picture and new text to make up a new micro-bog and we call it text-picture
unmatched rumor. We propose a 5-step method to identify this kind of event rumors.
First, we create a list, including 60 news websites consisting of almost all the major
domestic media, and foreign media2. Second, we submit the picture attached to a mi-
cro-blog as a query to search engine to look for similar pictures. If the output result is
nothing, then we consider that the picture matches the text, and the micro-blog is non-
rumor. Otherwise, we order the result records according to their websites reliability
and their posting time descending. The websites in the list created in the first step are
credible and their credibility is the same while others are not credible. Third, we crawl
the main content of the top ranked website. Finally, we use Jaccard coefficient to
calculate the similarity between micro-blogs text and the crawled content after re-
moving stopping words. If they are similar, we consider the micro-blog is non-rumor.
Otherwise, they are text-picture unmatched rumors.
4 Experiments
2
http://news.hao123.com/wangzhi
Detecting Event Rumors on Sina Weibo Automatically 129
by the search function provided by Sina Weibo. We select the top one micro-blog
from the search results. We collect 104 event rumors matching the keywords of mi-
cro-blogs published by Weibo Rumor-Busting from May 2011 to December 2011.
Also we extract the profiles (number of followers, number of followings and VIP
authentication) of users who published the event rumors correspondingly and all their
micro-blogs. The profiles of the followers and followings and their micro-blogs are
also included in our dataset. Finally we create a high quality dataset that consists of
1943 users and 26972 micro-blogs, which includes 104 event rumors and 26868 non-
rumors.
4.2 Evaluation
To assess the effectiveness of our methods, we use the standard information retrieval
metrics of precision, recall and F1 [12]. The precision is the ratio of the number of
rumors classified correctly to the total number of micro-blogs predicted as rumors.
The recall is the ratio of the number of rumors correctly classified to the total number
of true rumors. The F1 is the harmonic mean between both precision and recall, and is
defined as . We train 4 different classifiers used in previous
works [1, 3-5] including Nave Bayes, Bayesian Network, Neural Networks and Deci-
sion Tree. 10-fold cross validation strategy is used to measure the impact of those
features introduced in Section 3.2 on the classification performance.
The experimental results are shown in Table 1. The experimental results indicate
that before adding the 5 new features the precision, recall and F-measure are 0.555,
0.202 and 0.269 respectively. There are two reasons why the performance of these old
features in our experiment is worse than that in previous works. One is that the data-
sets are different, and the other is that the purposes are different. Previous studies
focus on identifying spams such as ads, while we focus on detecting event rumors.
After introducing the 5 new features to the classification process, the precision, recall
and F1 are improved to 0.85, 0.654 and 0.739 respectively. As shown in Figure 10,
the precision, recall and f1-measure have increased 0.475, 0.048 and 0.276 separately,
demonstrating the effectiveness of our proposed new features.
130 S. Sun et al.
F1
recall
precision
Fig. 10. The comparison between effect of feature set without and with new features
5 Conclusion
In this paper we focus on detecting event rumors on Sina Weibo, which are more
harmful than other kinds of spams especially in China. To distinguish event rumors
from normal messages, we propose five new features and use classification technique
to predict which one is event rumor. Further, we divide event rumors into four types
and propose a method to identifying text-picture unmatched event rumors with the
help of picture search engine. The experiments show the proposed approaches are
effective. In the future we will focus on study how to detect other three forms of event
rumors.
References
1. Yang, F., Liu, Y., Yu, X., Yang, M.: Automatic detection of rumor on Sina Weibo. In:
Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, Beijing, China,
pp. 17. ACM (2012)
2. Gao, H.Y., Hu, J., Wilson, C., Li, Z.C., Chen, Y., Zhao, B.Y.: Detecting and characterizing
social spam campaigns. In: Proceedings of the 10th ACM SIGCOMM Conference on In-
ternet Measurement, Melbourne, Australia, pp. 3547. ACM (2010)
Detecting Event Rumors on Sina Weibo Automatically 131
3. Wang, A.H.: Dont follow me: Spam detection in Twitter. In: Proceedings of the 2010 In-
ternational Conference on Security and Cryptography (SECRYPT) (2010)
4. Wang, A.H.: Detecting spam bots in online social networking sites: A machine learning
approach. In: Foresti, S., Jajodia, S. (eds.) Data and Applications Security and Privacy
XXIV. LNCS, vol. 6166, pp. 335342. Springer, Heidelberg (2010)
5. Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting Spammers on Twitter.
In: Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, Redmond,
Washington, US (2010)
6. Lee, K., Caverlee, J., Webb, S.: Uncovering social spammers: social honeypots + machine
learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research
and Development in Information Retrieval, Geneva, Switzerland, pp. 435442. ACM
(2010)
7. Stringhini, G., Kruegel, C., Vigna, G.: Detecting spammers on social networks. In: Pro-
ceedings of the 26th Annual Computer Security Applications Conference, Austin, Texas,
pp. 19. ACM (2010)
8. Yang, C., Harkreader, R.C., Gu, G.: Die free or live hard? Empirical evaluation and new
design for fighting evolving twitter spammers. In: Sommer, R., Balzarotti, D., Maier, G.
(eds.) RAID 2011. LNCS, vol. 6961, pp. 318337. Springer, Heidelberg (2011)
9. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., Goncalves, M.: Detecting spam-
mers and content promoters in online video social networks. In: Proceedings of the 32nd
International ACM SIGIR Conference on Research and Development in Information Re-
trieval, Boston, MA, USA, pp. 620627. ACM (2009)
10. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., Zhang, C., Ross, K.: Identifying
video spammers in online social networks. In: Proceedings of the 4th International Work-
shop on Adversarial Information Retrieval on the Web, Beijing, China, pp. 4552. ACM
(2008)
11. Webb, S., Caverlee, J., Pu, C.: Social Honeypots: Making Friends with a Spammer near
You. In: Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, Moun-
tain View, California, US (2008)
12. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of In-
formation Retrieval 1(1/2), 6788 (1999)
Uncertain Subgraph Query Processing
over Uncertain Graphs
Wenjing Ruan1 , Chaokun Wang1,2,4 , Lu Han1 , Zhuo Peng1 , and Yiyuan Bai1,3
1
School of Software, Tsinghua University, Beijing 100084, China
2
Tsinghua National Laboratory for Information Science and Technology
3
Department of Computer Science and Technology, Tsinghua University
4
Key Laboratory of Intelligent Information Processing, Institute of Computing
Technology, Chinese Academy of Sciences
{ruanwj11,hanl09,pengz09,baiyy10}@mails.tsinghua.edu.cn,
[email protected]
1 Introduction
With the rapid development of web technology, graph has considerable impor-
tance in modeling of data in diverse scenarios, e.g., social networks, mobile
networks and sensor networks. If the graph-structured data is uncertain, it is
called an uncertain graph. For example, each edge of the graph has a probabil-
ity of its existence in this paper. That is to say the uncertainty is embodied in
the relationships between nodes of the graph. A typical example of the uncer-
tain graph is protein-protein interaction networks (PPI networks for short) [5].
This work was supported by the National Natural Science Foundation of China
(No. 61170064, No. 61133002), the National High Technology Research and Devel-
opment Program of China (No. 2013AA013204) and the ICT Key Laboratory of
Intelligent Information Processing, CAS.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 132139, 2013.
c Springer-Verlag Berlin Heidelberg 2013
In PPI networks, nodes denote protein molecules while edges denote the interac-
tion relationships between protein molecules. Limited by the measuring methods,
we cannot accurately detect whether there is an interaction between two protein
molecules. Therefore, an existence probability is set for each interaction.
Recently, uncertain graphs have become an increasingly attractive focus in
many fields. There are a number of algorithms proposed to solve mining and
query problems in uncertain graph database. Representative mining techniques
in this area include finding top-k maximal cliques [11] and mining frequent sub-
graph patterns [10]. Various query methods have been proposed to solve the
essential problems of uncertain graph database, e.g., KNN query [4], the short-
est path query [9], reliable subgraph query [3], and so on. All of the prior works
haven not considered the uncertainty of the query graph, which is used to search
matches in the uncertain graph database.
In the research field of uncertain graph management, the problem of uncertain
subgraph query emerges when both the dataset and the query are uncertain.
As shown in Figure 1, Q is an uncertain query graph, and D is an uncertain
data graph. The uncertain subgraph query returns all the pairs of isomorphic
existences between Q and D.
There are two challenges to deal with the above-mentioned problem. Firstly,
the subgraph isomorphism verification is an NP-hard problem, and it will be
more complex when the dataset and the query are both uncertain. Secondly,
uncertain subgraph query problem has high complexity, especially when the size
of the uncertain dataset is large.
To address the above challenges, a novel algorithm called Mutual-Match is
proposed in this paper. The intuitive idea of Mutual-Match is using distributed
computing methods to greatly improve the query performance. Thanks to the
MapReduce model [1], Mutual-Match can eectively find out all matching pairs,
each of which is composed by two isomorphic subgraphs derived, respectively,
from the query graph and the data graph.
134 W. Ruan et al.
To the best of our knowledge, this paper is the first one addressing the problem
of uncertain subgraph query. The main contributions of the paper are summa-
rized as follows.
1. The problem of uncertain subgraph query over uncertain graph is formally
presented.
2. An algorithm called Mutual-Match is proposed to process the uncertain sub-
graph query over uncertain graphs with the MapReduce model. It can also exe-
cute on dynamic uncertain graphs.
3. Extensive experiments are conducted to show the correctness and eectiveness
of our proposed method.
The rest of the paper is organized as follows. Section 2 gives the formal definition
of the problem of uncertain subgraph query processing. We present a brute-force
algorithm, and then propose an index structure and the Mutual-Match algorithm
in Section 3. Moreover, we conduct an empirical study on both real-world and
synthetic datasets in Section 4. Finally, we conclude in Section 5.
2 Problem Definition
Definition 1. An uncertain graph is defined as G(V, E, , , ), where (1) V is
a set of nodes; (2) E V V is a set of edges; (3) is a set of labels; (4)
: V is a function that sets a label to each node of V ; and (5) : E [0, 1]
is a probability function for edges.
Given an uncertain graph G(V, E, , , ), we can find out a certain graph
G (V, E, , , ) which is generated from G when all the edges probabilities
of G are 1. We call G the underlining graph of G.
For a clear formalization of the problem of uncertain subgraph query over
uncertain graphs, we use the possible world model. Each possible world of G
corresponds to a subgraph of G and has an existence probability that is com-
puted according to the probability of every edge in G. In this paper, we assume
that all the probabilities of edges distribute independently.
Definition 2. Given an uncertain query graph Q and an uncertain data graph
D, uncertain subgraph query is to find a set of matches. Each match which is
denoted as a pair m(q, d) should satisfy the following two requirements.
1. q and d are certain graphs from the possible worlds of Q and D, respectively.
2. q and d are isomorphic.
As shown in Figure 1, if we limit that the edge number of isomorphic subgraph
is not less than 3, there are 5 matching pairs when searching uncertain subgraph
matches of Q in D. One of the pairs is m(q, d).
Definition 3. For n given uncertain query graph Q, an uncertain data graph D
and one of their matches m(q, d), the matching probability of m(q, d) is defined as
where P ro(q) and P ro(d) are the existence probabilities of the corresponding
possible worlds of Q and D, respectively.
3 Algorithms
For a given uncertain query Q and an uncertain data graph D, the brute-force
algorithm first transforms them into possible worlds WQ = {g1 , g2 , , gs } and
WD = {g1 , g2 , , gt }, respectively. Let CS = {q1 , q2 , , qm } be a subset of WQ ,
which contains all connected components of WQ . Each element of CS or WD is
a certain graph with an existence probability, so that the brute-force algorithm
evaluates the uncertain subgraph query by processing certain subgraph queries
on WD m times. We build an index structure on all elements of WD with gIndex
[7,8], which performs the best for most queries for sparse datasets in recent
researches about indexes [2]. And then, search subgraph matches against each
graph in CS using the gIndex.
This brute-force algorithm has to deal with exponential times of subgraph
queries within an extremely large search space, if the query and data graphs are
large. Clearly, it can only be applied to small datasets.
With the Mutual-Match algorithm, we need not know the possible worlds of
uncertain data and query graphs before the query processing. In order to get
a matching pair m(q, d) from the uncertain query graph Q and the uncertain
data graph D, the Mutual-Match algorithm first splits both Q and D into some
small components which are denoted as bi-edge graphs. Then, with the help of
the edge join operation (EJoin) which joins components together, q and d are
gradually constructed respectively.
Bi-Edge Index Structure. Before the query processing, we build an index struc-
ture of the uncertain graph database based on the bi-edge graphs. Given an uncer-
tain data graph D, we generate Bie(D) by decomposing D into bi-edge graphs. If
(el (vl , vm ), er (vm , vr )) Bie(D), the label of is labell /labelm/labelr ,
where labelm is from vm and labell < labelr (< is the partial order). The value
of is saved as (dbi id, el , er , vm , vl , vr ), where dbi id is the unique identifier of .
We index bi-edge graphs by their labels. This bi-edge graph based index is imple-
mented by a hash chain H. The ids of bi-edge graphs which have the same keyword
are saved in a list denoted by blist(label).
We generate Bie(Q) and store each bi-edge graph in Bie(Q) with the same
method as the data graph. And then, we search the database index for every
label of bi-edge graphs of Bie(Q) and get the input file for MapReduce jobs.
The format of the input file is (el , er , qbi id, vm , posv, pose), where el , er , vm are
from ( Bie(D)) and qbi id is the unique identifier of ( Bie(Q)). posv
and pose are vectors that save the matched nodes and edges, respectively. Every
record of the input file is an initial matching pair.
Fig. 2. Two Dierent Join Manners. The above is a middle-join, and the below is a
non-middle-join.
3.3 Improvement
Our method can adapt to dynamic uncertain graphs well. That is the Mutual-
Match algorithm can be improved to address the problem of uncertain subgraph
query over dynamic uncertain graphs.
When new edges are inserted or existing edges are removed and original graph
D becomes the new graph D , we dont need to rebuild the bi-edge index for D
considering all edges of D . Instead, we only take the new edges and the removed
edges into consideration.
Suppose we have run the Mutual Match Algorithm for the data graph D and
the query graph Q, and get the uncertain subgraph query result file RF which
contains all matching pairs of D and Q. When D changes, for each removed edge
e, we find matching pairs which contain e and remove these pairs from RF . For
each new edge e, we build new bi-edge index related with e. Then we get the
new input file F . We process F in Mapper to produce key-value pairs P and
choose records in RF whose keys are the same with the keys in P . Then we
add these records to the file F . Finally, we construct new matching pairs of D
and Q by EJoin operation in the MapReduce job chain (Job2 Job3 . . .
Jobn ) with F as the input file.
4 Experiments
Our experiments are conducted on two dierent environments. The index con-
struction, MapReduce input file construction and the brute-force algorithm are
performed on a 2.4GHZ, 2GB-memory, Intel(R) Core(TM) i5 CPU PC, running
Ubuntu 10.10 operation system. The other is a MapReduce platform with 5
multithreaded cores. All the algorithms are implemented with Java.
100 100
90 BF 90 BF
Time cost (s) 80 MRMM 80 MRMM
3000 16
MRMM 14 MRMM
2500
DMRMM
Time cost(min)
DMRMM
Time cost(ms)
12
2000 10
1500 8
1000 6
4
500 2
0 0
10 15 20 25 30 35 40 45 50 10 15 20 25 30 35 40 45 50
#of edges in the data graph(x1000) #of edges in the data graph(x1000)
from the beginning and DMRMM to stand for the algorithm that only takes
changed edges into consideration. Figure 4(a) shows that the time cost of the
index building of the DMRMM method is much smaller than that of the MRMM
method for DMRMM does not compute all edges of the new graph. Figure 4(b)
shows the eciency of query processing in the MapReduce jobs. Clearly, the
DMRMM method is faster than the MRMM method.
5 Conclusions
Uncertain subgraph query is an essential operation on uncertain graph database.
It is complicated due to the NP-hard isomorphism verification and uncertainty.
MapReduce provides a good solution to this problem. In this paper, we have
explored a Mutual-Match algorithm and an applicative index method to con-
duct the uncertain subgraph query problem within MapReduce framework. Our
MapReduce-Based Mutual-Match algorithm has eective performances on un-
certain subgraph matching processing over static uncertain graphs and dynamic
uncertain graphs.
References
1. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters.
Communications of the ACM 51(1), 107113 (2008)
2. Han, W., Lee, J., Pham, M., Yu, J.: iGraph: A framework for comparisons of disk-
based graph indexing techniques. Proceedings of the VLDB Endowment 3(1-2),
449459 (2010)
3. Hintsanen, P., Toivonen, H.: Finding reliable subgraphs from large probabilistic
graphs. Data Mining and Knowledge Discovery 17(1), 323 (2008)
4. Potamias, M., Bonchi, F., Gionis, A., Kollios, G.: K-nearest neighbors in uncertain
graphs. Proceedings of the VLDB Endowment 3(1-2), 9971008 (2010)
5. Saito, R., Suzuki, H., Hayashizaki, Y.: Interaction generality, a measurement to
assess the reliability of a proteinprotein interaction. Nucleic Acids Research 30(5),
1163 (2002)
6. Viger, F., Latapy, M.: Ecient and simple generation of random simple connected
graphs with prescribed degree sequence. In: Wang, L. (ed.) COCOON 2005. LNCS,
vol. 3595, pp. 440449. Springer, Heidelberg (2005)
7. Yan, X., Yu, P., Han, J.: Graph indexing: A frequent structure-based approach. In:
SIGMOD, pp. 335346. ACM (2004)
8. Yan, X., Yu, P., Han, J.: Graph indexing based on discriminative frequent structure
analysis. ACM Transactions on Database Systems (TODS) 30(4), 960993 (2005)
9. Yuan, Y., Chen, L., Wang, G.: Eciently answering probability threshold-based
shortest path queries over uncertain graphs. In: Kitagawa, H., Ishikawa, Y., Li,
Q., Watanabe, C. (eds.) DASFAA 2010, Part I. LNCS, vol. 5981, pp. 155170.
Springer, Heidelberg (2010)
10. Zou, Z., Gao, H., Li, J.: Discovering frequent subgraphs over uncertain graph
databases under probabilistic semantics. In: SIGKDD, pp. 633642. ACM (2010)
11. Zou, Z., Li, J., Gao, H., Zhang, S.: Finding top-k maximal cliques in an uncertain
graph. In: ICDE, pp. 649652. IEEE (2010)
Improving Keyphrase Extraction
from Web News by Exploiting Comments Information
1 Introduction
As thousands of web news are posted on the Internet everyday, it is a challenge to
retrieve, summarize, classify or mine this enormous repository eectively. Keyphrases
can be seen as condensed represents of web news documents which can be used to help
these applications. However only a small minorities of documents have author-assigned
keyphrases and manually assigning keyphrases for each document is very laborious.
Therefore it is highly desirable to extract keyphrases automatically. In this paper we
consider this task in web news scenario.
Most existing work on keyphrase extraction from a web news document only con-
siders the internal information, such as the phrases Tf-idf, position and other syntactic
information, which neglects the external information of the document. Some researches
also utilize background information. For example, Wan and Xiao proposed a method,
which selected similar documents for a given document from a large documents cor-
pus, to improve the performance of keyphrase extraction [14]. Mihalcea and Csomai
used Wikipedia to provide background knowledge for this task [11]. However both of
these work needs to retrieve topic-related documents as external information, which is
a non-trivial problem. Fortunately, nowadays many web news sites provide various so-
cial tools for community interaction, including the possibility to comment on the web
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 140150, 2013.
c Springer-Verlag Berlin Heidelberg 2013
news. These comments are highly related to the web news documents and naturally
bound with them. The motivation of our work is that comments are valuable resource
providing background information for each web news document, especially some use-
ful comments contain rich information which is the focus of the readers. Hence, in this
paper we consider exploiting comments information to improve keyphrase extraction
from the web news documents.
Unfortunately, although the main entry and its comments are certainly related and at
least partially address similar topics, they are markedly dierent in several ways [16].
First of all, their vocabulary is noticeably dierent. Comments are more casual, conver-
sational, and full of jargon. They are less carefully edited and therefore contain more
misspellings and typographical errors. There is more diversity among comments than
within the single-author post, both in style of writing and in what commenters like to
talk about. Simply using all the comments as external information does not improve the
performance of keyphrase extraction (see section 5.2). Therefore, we propose several
methods to select useful comments, based on some criterions such as textual similarity
and helpfulness of the comment. The experimental results show our comment selection
approaches can improve keyphrase extraction for web news. Especially our machine
leaning approach which integrates many factors of comments can significantly improve
this task.
The rest of this paper is organized as follows. Section 2 introduces the related work.
Section 3 describes the framework of our approach to extract keyphrases from a web
news document. In section 4 we propose the strategies for selecting useful comments.
Section 5 presents and discusses the evaluation results. Lastly we conclude the paper.
2 Related Work
Most existing work only uses the internal information of a document for keyphrase ex-
traction. The simplest approach is to use a frequency criterion (or Tf-idf model [12]) to
select keyphrases in a document. This method was generally found to give poor perfor-
mance [3]. However, Hasan recently conducted a systematic evaluation and analysed
some algorithms on a variety of standard evaluation datasets, his results indicated that
Tf-idf remained oering very robust performance across dierent datasets [4]. Another
important clue for keyphrase extraction is the location of phrase in the document. Kea
[2] and GenEx [13] used this clue for their studies. Hulth added more linguistic knowl-
edge, such as syntactic features, to enrich term representation, which were eective for
keyphrase extraction [5]. Mihalcea and Tarau firstly proposed a graph-based ranking
method for keyprase extraction [10] and currently graph-based ranking methods have
become the most widely used approaches [3,8,9]. However, all the approaches above
never consider the external information when extracting keyphrases.
Another important information, which can be used for keyphrase extraction, is ex-
ternal resource. Wikipedia1 has been intensively used to provide background knowl-
edge. Mihalcea and Csomai used the link structure of Wikipedia to derive a novel word
feature for keyphrase extraction [11]. Grineva et al. utilized article titles and the link
structure of Wikipedia to construct a semantic graph to extract keyphrases [3]. Xu et al.
1
http://www.wikipedia.org/
142 Z. Luo, J. Tang, and T. Wang
extracted the in-link, out-link, category, and inbox information of the documents related
the article in Wikipedia for this task [15]. All approaches using Wikipedia significantly
improve the performance of keyphrase extraction. Another external resource is neigh-
bor documents. Wan and Xiao proposed an idea that a user would better understand a
topic expressed in a document if the user reads more documents about the same topic
[14]. He used a textual similarity measure (e.g., cosine similarity) to find documents
topically closed to the document from a large documents corpus. These neighbor doc-
uments are beneficial to evaluate and extract keyphrases from the document. However
using the external resource introduced above needs to retrieve topic-related documents,
which is a non-trivial problem. Unlike these resources, we exploit comments of web
news as external information. This information is topic-related to the original docu-
ment. Moreover, the characteristic of naturally bound with the original news makes the
comments easier to obtain.
3 Framework
A web news document has two parts: an article and comments. The article contains
the text of event written by the author. Comments can be seen as the discussion of
the article including the attitude of readers. Most of comments are topic-related to the
article. Especially some comments contain rich information which is focused by the
readers. We call the comments, which provide additional information for readers to
better understand the content of the article, useful comments. Intuitively, these useful
comments are expected to be beneficial to evaluate and extract the salient keyphrases
from the article. However traditional keyphrase extraction methods only consider the
information of the article and ignore its comments information. This study we propose
an approach exploiting comments information to improve keyphrase extraction for the
web news documents.
Figure 1 gives the framework of our approach for keyphrase extraction. First we
segment a web news document into the article part and the comments part based on
its natural structure. Next, since not all comments are helpful for keyphrase extraction
(see section 5.2), we use a comments selector, based on some strategies, to select useful
comments from the comments part. Then we integrate the selected comments with the
article part to form a new document. At last we use a keyphrase extractor to extract
phrases from the new document. The extracted phrases, occurred in the article, are
taken as keyphrases. Our approach can be easily incorporated with any state-of-the-art
keyphrase extractor to extract keyphrases from web news.
Since some comments contain noise information such as advertisement, irrelevant
texts and texts only containing meaningless word emoticons. For example, a comment
Who cares! just reflects a readers attitude to the news without any useful informa-
tion for other readers to understand the article. Hence, the key point of our approach
is to find the useful comments for keyphrase extraction. We propose three strategies to
select useful comments, which are expected to find the articles keyphrases with higher
accuracy. The first is selecting comments which is similar to the original article, called
Similar Comments Selector. The idea of this approach is the same as Wan and Xiao
[14]. Many web news sites provide ratings setting about the comments for other readers
Improving Keyphrase Extraction from Web News 143
Keyphrase
Extractor
Article
Article Article
Comments
Selector
(e.g, thumb up or thumb down votes). These meta ratings information can help filter
useful comments more eciently. We call the approach based on ratings information
Helpfulness Comments Selector. Furthermore, since existing technologies of natural
language processing and machine learning provide strong supports to mine the poten-
tial knowledge, we use these technologies to select useful comments, called KeyInfo
Comments Selector. Next section we will introduce the three comments selectors in
detail.
Many web news sites also allow readers to vote for the helpfulness of each comment and
provide a function to assess the quality. The helpfulness function h is usually defined as
follow:
Numthumbup
h= (1)
Numthumbup + Numthumbdown
where Numthumbup is the number of people, who consider the comment is helpful.
Numthumbdown is the number of people, who consider the comment is unhelpful. This
function actually provides a direct and convenient way to estimate the utility value of a
given comment. Therefore, we use the helpfulness of comments based on this function
to select useful comments. The top N comments with highest scores voted by at least 5
users are selected for the following keyphrase extraction step.
A simple idea is that given an article, if the comments include more keyphrases, they
are more likely to be useful comments than the other comments which have less or
no keyphrase, since these comments contain more information which the readers focus
on. Intuitively, integrating these useful comments into original article is expected to im-
prove the performance of keyphrase extraction. However, we can not obtain the accurate
number of keyphrases in each comment, since it is the final purpose of the task. Fortu-
nately, some factors of comments are related to the probability of comments containing
more keyphrases such as the textual similarity and helpfulness of comments. Moreover,
existing natural language processing technologies can help us find many factors related
to the probability of comments containing keyphrases, and using machine learning tech-
nologies, based on these factors, can also estimate the probability accurately. Therefore,
we propose another comments selector called KeyInfo comments selector using natural
language processing and machine learning technologies.
The principle of the KeyInfo comments selector is that if a shorter comment contains
more keyphrases, it is more likely to improve the performance of keyphrases extraction
from the web news when integrating it into the original article. We define a function k
called keyInfo function as follow:
Numkeyphrase
k= (2)
Numword
where Numkeyphrase is the number of keyphrases in the comment and Numword is the
number of words in the comment. We choose the top N comments with highest score
of k as the useful comments for keyphrase extraction.
The key point of this approach is to predict the value of Numkeyphrase . We take
this prediction as a regression-learning problem. Firstly, some web news and their
comments are collected as training data. These news documents have been originally
annotated keyphrases. Then using these gold standard keyphrases, each comments
Numkeyphrase can be easily obtained. Some features, which can potentially help to pre-
dict the Numkeyphrase , are designed for each comment. Next each comment is repre-
sented by a feature vector. Then a classical regression tool-simple linear regression
Improving Keyphrase Extraction from Web News 145
5 Experiment
We empirically evaluate our approach for keyphrase extraction from a web news docu-
ment, especially comparing the performance of various strategies to select useful com-
ments for this task. In addition we conduct in-depth analysis of the dierences among
useful comments sets chosen by dierent comments selectors.
Evaluation Metric. For the evaluation of keyphrase extraction results, the automatic ex-
tracted keyphrases are compared with the gold standard keyphrases. The precision p =
countcorrect /countextrator , recall r = countcorrect /countgold , and F1 score f = 2pr/(p + r)
2
http://www.cs.waikato.ac.nz/ml/weka/
3
http://nlp.stanford.edu/software/tagger.shtml
4
http://www.wjh.harvard.edu/ inquirer/
5
http://www.aol.com/
146 Z. Luo, J. Tang, and T. Wang
are used as evaluation metrics, where countcorrect is the total number of correct keyphrases
extracted by the extractor, countextrator is the total number of automatic extracted
keyphrases, and countgold is the total number of gold keyphrases.
dierent datasets [4]. Therefore, we use these two keyphrase extractors in our experi-
ment. A publicly available implementation of these two extractors can be downloaded6.
Tf-idf Keyphrase Extractor assigns a score to every term in a document based on its
frequency (term frequency, tf) and the number of other documents containing this term
(inverse document frequency, idf). Given a document the extractor first computes the
Tf-idf score of each candidate word, then extracts all the longest n-grams consisting of
candidate words and scores each n-gram by summing the Tf-idf scores of its constituent
unigrams, and finally output the top N n-grams as keyphrases.
SingleRank Keyphrase Extractor is expanded from TextRank Keyphrase Ex-
tractor [10,14]. Both of these two extractors take a text represented by a graph, in
which each vertex corresponds to a word type and the edge connects two vertices if the
two words co-occur within a window of W words in the associated text. The goal is
to compute the score of each vertex, which reflects its importance. The two extractors
use the word types that correspond to the highest-scored vertices to form keyphrases
for the text. Some graph-based ranking algorithms such as Googles PageRank [1] and
HITS [6], which compute the importance of vertices, can be easily integrated into the
TextRank model and SingleRank model. There are some dierences between these two
extractors. First, each edge has a weight equal to the number of times the two corre-
sponding word types co-occur within a window of W words in a SingleRank graph,
while each edge has the same weight in a TextRank graph. Second, SingleRank only
uses high score sequence nouns and adjectives words to form keyphrases, but any high
score sequence words can form keyphrases in TextRank.
Precision Recall F1
Base Tf-idf 0.139 0.366 0.202
All Tf-idf 0.153 0.259 0.192
Base SingleRank 0.120 0.316 0.174
All SingleRank 0.096 0.163 0.121
6
http://www.utdallas.edu/ ksh071000/code.html
148 Z. Luo, J. Tang, and T. Wang
Next we investigate some useful comments can help keyphrase extraction. We ranked
all the comments based on keyInfo function k. We use the gold keyphrases in each
article to calculate the accurate Numkeyphrase value for each comment. For each web
news document, we choose the top 15 comments with high value of k and integrate them
into the original article (called Oracle Tf-idf and Oracle SingleRank), which achieve
the best result of keyphrase extraction in our experiment (see Table 4). Moreover, they
are oracle tests and the results are significantly dierent (at p = 0.05). It means there
is a subset of useful comments, which can help for keyphrase extraction, in comments
part for each web news (see Figure 1).
Precision Recall F1
Base Tf-idf 0.139 0.366 0.202
Oracle Tf-idf 0.165 0.432 0.239
Base SingleRank 0.120 0.316 0.174
Oracle SingleRank 0.146 0.382 0.211
At last we investigate the performance of our three comments selectors, choosing top
15 high score comments, for keyphrase extraction. The Similar Tf-idf (SingleRank)
and Helpfulness Tf-idf (SingleRank) methods used the measures of comments sim-
ilar and helpfulness score to choose the comments. To KeyInfo Tf-idf (SingleRank)
approach, we used 5 new AOL web news comments excluding AOL corpus to train a
regression model, which can predict the Numkeyphrase of each new comment. The 5 new
documents contain 524 comments and each document has gold standard keyphrases.
We extract all features in Table 1 to represent each comment as a feature vector for
training. The trained model can be used to predict the KeyInfo score of a new com-
ment. Table 5 gives the comparison results of various selecting methods for keyphrase
extraction. We can see that the performance of most methods integrating comments
outperform the Base Tf-idf(SingleRank) for keyphrase extraction. It shows comments
information indeed can help for keyphrase extraction. We conducted a paired t-test be-
tween the results and found all methods significantly improve the keyphrase extraction
(at p = 0.05) except Helpfulness Tf-idf. Especially the KeyInfo Tf-idf (SingleRank)
achieves the best result. It shows similar and helpfulness information are helpful for se-
lecting useful comments. Moreover, using more knowledge, extracted based on natural
language processing and machine learning technologies, can collect more useful com-
ments. All these shows exploiting comments information are eective for keyphrase
extraction.
We are interested in the detail of comments selected by these three selectors. We
compare three selected comment sets with the comments selected based on the accurate
Numkeyphrase value (see Oracle Test). All of comparisons use the top 15 high score
comments for each document. Table 6 gives the result. We can see that the KeyInfo
selector collect the largest most useful comments. It demonstrates our machine learning
approach is eective to find more comments with high function K value for keyphrase
extraction.
Improving Keyphrase Extraction from Web News 149
Precision Recall F1
Base Tf-idf 0.139 0.366 0.202
Similar Tf-idf 0.151 0.393 0.218
Helpfulness Tf-idf 0.144 0.379 0.209
KeyInfo Tf-idf 0.157 0.406 0.226
Base SingleRank 0.120 0.316 0.174
Similar SingleRank 0.134 0.324 0.190
Helpfulness SingleRank 0.133 0.332 0.190
KeyInfo SingleRank 0.144 0.331 0.201
Number
SetOracle SetKeyIn f o 437
SetOracle SetHelp f ulness 331
SetOracle SetS imilar 300
6 Conclusion
In this paper we propose a novel approach to use comments information for keyphrase
extraction from web news documents. Since comments are full of noisy and low-quality
data, simply integrating comment into an original article does not improve the perfor-
mance of this task. Therefore we give three strategies to select useful comments. The ex-
perimental results show that our comments selection approaches can improve keyphrase
extraction from web news. Especially our machine leaning approach which considers
many factors of comments can significantly improve this task.
References
1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceed-
ings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107117.
Elsevier Science Publishers B. V., Amsterdam (1998)
2. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific
keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Ar-
tificial Intelligence, IJCAI 1999, pp. 668673. Morgan Kaufmann Publishers Inc., San Fran-
cisco (1999)
3. Grineva, M., Grinev, M., Lizorkin, D.: Extracting key terms from noisy and multitheme doc-
uments. In: Proceedings of the 18th International Conference on World Wide Web, WWW
2009, pp. 661670. ACM, New York (2009)
4. Hasan, K.S., Ng, V.: Conundrums in unsupervised keyphrase extraction: making sense of the
state-of-the-art. In: Proceedings of the 23rd International Conference on Computational Lin-
guistics: Posters, COLING 2010, pp. 365373. Association for Computational Linguistics,
Stroudsburg (2010)
150 Z. Luo, J. Tang, and T. Wang
5. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Pro-
ceedings of the 2003 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2003, pp. 216223. Association for Computational Linguistics, Stroudsburg (2003)
6. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604632
(1999)
7. Liu, J., Cao, Y., Lin, C., Huang, Y., Zhou, M.: Low-quality product review detection in opin-
ion summarization. In: Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), pp. 334342 (2007)
8. Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic keyphrase extraction via topic decom-
position. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Lan-
guage Processing, EMNLP 2010, pp. 366376. Association for Computational Linguistics,
Stroudsburg (2010)
9. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extrac-
tion. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2009, vol. 1, pp. 257266. Association for Computational Linguistics,
Stroudsburg (2009)
10. Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Proceedings of EMNLP,
Barcelona, Spain, vol. 4, pp. 404411 (2004)
11. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Pro-
ceedings of the Sixteenth ACM Conference on Information and Knowledge Management,
CIKM 2007, pp. 233242. ACM, New York (2007)
12. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process.
Manage. 24(5), 513523 (1988)
13. Turney, P.D.: Learning to extract keyphrases from text. CoRR cs.LG/0212013 (2002)
14. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In:
Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2,
pp. 855860. AAAI Press (2008)
15. Xu, S., Yang, S., Lau, F.: Keyword extraction and headline generation using novel word
features. In: Proc. of the Twenty-Fourth AAAI Conference on Artificial Intelligence (2010)
16. Yano, T., Cohen, W., Smith, N.: Predicting response to political blog posts with topic models.
In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the
North American Chapter of the Association for Computational Linguistics, pp. 477485.
Association for Computational Linguistics (2009)
A Two-Layer Multi-dimensional Trustworthiness
Metric for Web Service Composition
1 Introduction
The Web originates for enabling users the information sharing power. As evolv-
ing, it has been enriched to provide more flexible ways to link dierent sources,
one of which is to use web services, a set of related functionalities that can be
programmatically accessed through the web [16], usually via XML-based pro-
tocols. Web service composition, aiming at composing distributed web services
to carry out complex business processes, is considered to be a promising area
to revolutionize the collaboration among heterogeneous applications[3]. Due to
the constantly changing environment and the large scalability of the Web, the
research focus on web service composition is moving from static to dynamic
technology, known as Automatic Service Composition (ASC) [15]. The software
programs that provide ASC functionalities are termed ASC agents. An ASC
agent has to compose several web services to deliver results, which brings them
a few particular characteristics that are dierent from single-functionality web
services.
Researchers has realized the significance of trust in the web service environ-
ment. Methods, models and frameworks are proposed to help service consumers
identify trustworthy web services [17]. However, determining the trustworthiness
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 151162, 2013.
c Springer-Verlag Berlin Heidelberg 2013
for ASC agents is left as an open issue. We adopt the major consideration in
[6] by defining the trustworthiness of ASC agents to be a two-layer structure,
namely composition-layer and service-layer. This caters to the unique character-
istics owned by ASC agents that their deliveries depend on not only their own
activities, but also the single-functionality web services they compose.
The previous work did not give a complete investigation on how to use QoS
factors to represent the trustworthiness of an ASC agent. The conducted ex-
periment cannot reflect the eectiveness of the proposed model. In this paper,
we will thoroughly explore the relationships between trustworthiness and QoS
factors. A proper set of indicators, based on QoS factors, will be constructed
that can better present ASCs trustworthiness. We will also use experiment to
prove that our proposed metric shows better result than result of the forecast
simply using historical data. The contributions of this paper are:
The rest of the paper is structured as follows. In Section 2, we explore the rela-
tionship between the trustworthiness of ASC agent and QoS factors. In Section
3, we present trustworthiness metric in detail. A method to integrate TSIs for
trustworthiness comparison is introduced in Section 4. A simulation experiment
is presented in Section 5. Section 6 gives the recent related works and Section 7
summarizes the paper and looks at the future work.
are people that irrationally trust others without any reasons. However, rational
people give their attitude based on gathered information on trustworthiness.
In this paper, the ASC requesters are trustors and the ASC agents are the
trustees. Instead of directly modeling trust, we design a metric to represent the
trustworthiness of ASC agents, which has impact on the trust from requesters.
3 Trustworthiness Indicators
3.1 Composition-Layer TSIs
Functional Eectiveness Deviation (Tf ed ). This TSI represents the sta-
tus of functional eectiveness. To define the functional eectiveness deviation,
154 H. Jiao et al.
we first need to define functional error rate. Functional error rate is the pro-
portion of requests answered with functional errors over the total number of
requests. Given an ASC agent c and a certain period du, it is calculated as
(c, du) = NNtotal
err
, where Nerr is the total number of requests answered in error
and Ntotal is the total number of composition requests. The functional eective-
ness deviation Tf ed (c, du) represents the dierence between the actual functional
error rate a (c, du) and the claimed one c (c, du) within that duration. It is cal-
culated as Tf ed (c, du) = max{0, a (c, du) c (c, du)}. In practice, a (c, du) can
be monitored by consumers or a third party and c (c, du) is claimed by the ASC
agent itself. The lower the Tf ed (c, du) is, the more trustworthy the ASC agent
is. In general, this TSI reflects how trustworthy the ASC agent is in the aspect
of functional capability.
Execution Eciency Deviation (Teed ). This TSI indicates the status of the
execution eciency of an ASC agent. For an ASC agent c within duration du, the
execution eciency deviation Teed (c, du) is the dierence between the claimed
on-time response rate c (c, du) and the actual on-time response rate a (c, du).
On-time response rate is the proportion of on-time response times over the to-
tal times an ASC agent should response, i.e., = NNontime
total
. Therefore, Teed (c, du)
is calculated as Teed (c, du) = max{0, c (c, du) a (c, du)}. In general, execution
eciency deviation stands for how trustworthy the ASC agent is in the aspect
of on-time response.
framework, we define the range of ranking score is [0,1]. Therefore, the range of
Tpr (c, du) is also [0,1]. Dierent from the above three TSIs, the higher the public
reputation is, the more trustworthy the ASC agent is.
The above four TSIs are composition-layer TSIs that reflect the trustwor-
thiness of an ASC agent itself. In the next sub-section, we will introduce sev-
eral service-layer TSIs which reflects the status of candidate elementary services
maintained by an ASC agent. Their properties also impacts the final deliverables
of ASC agents and therefore, impact the trust consumers give.
Average Service Availability (Tasa ). Average service availability Tasa (c, du)
is the average value of the availability of each elementary"nservice maintained in
Avaii
ASC agent c in the last duration du, i.e., Tasa (c, du) = i=1n , where Avaii
is the availability of the ith elementary service and n is the total number of ele-
mentary service. The definition of the availability of a single elementary service
can be referred to [14]. The higher the average service availability is, the better
the quality of elementary services is. Generally, average service availability re-
flects the overall quality of elementary services in the aspect of on-line available
duration.
Monitoring Strength (Tms ). Monitoring strength stands for the strength that
elementary services are monitored by ASC agents or a third party. It is reason-
able to think that if elementary services are monitored regularly, their services
will be of consistent quality. Two factors are considered to have impact on the
monitoring strength: monitoring coverage mc and monitoring frequency mf .
Monitoring coverage is the proportion of the monitored services over all the ser-
vices. Its range is [0,1]. Monitoring frequency stands for how often the monitoring
session is. To simplify the calculation, we normalize monitoring frequency by giv-
ing satisfactory monitoring frequency smf as a reference so that the range of
mf becomes [0,1]. The satisfactory monitoring frequency is a frequency in which
if elementary services are monitored, consumers will consider the monitoring to
be satisfactory. This reference value can be decided by conducting consumer sur-
vey. Once it is decided, it keeps constant and seldom changes. For a certain ASC
agent c in duration du, " we use the following formula to calculate Monitoring
n
mc mf
Strength: Tms (c, du) = i=1n i M in{ smf , 1}, where mci is the monitoring
coverage of the ith monitoring process. The higher the monitoring strength is,
the better the quality that it represents is.
In summary, we proposed several composition-layer and service-layer TSIs,
which form a comprehensive metric, reflecting the trustworthiness status of an
ASC agent. As mentioned before, it is impossible to give complete metric that
shows every aspects of a composer, since the properties are countless and dierent
contexts may concern dierent things. Our metric is extensible so that consumers
can add or remove TSIs as they preferred, as far as the new added ones can be
scaled into the range [0,1]. In the next part, we will integrate all the TSIs together
to form a single value through which the trustworthiness of ASC agents can be
comprehensively distinguished.
4 Trustworthiness Comparison
In the above formula, T SIimax and T SIimin are respectively the highest and low-
est values of the ith TSI of all the ASC agents, i.e., T SIimax = M ax(T SIi,j ), 1
j n and T SIimin = M in(T SIi,j ), 1 j n. By applying the above formula,
we obtain a 8 n matrix V N ,
v1,1 v1,2 ... v1,8
v2,1 v2,2 ... v2,8
v3,1 v3,2 ... v3,8 (2)
... ... ... ...
vn,1 vn,2 ... vn,8
where each column stands for one TSI dimension and each row represents an
ASC agent.
ASC agent may be better in some TSIs and the other may be better in other
TSIs. This problem is related to how important each TSI is in the context that
the comparison occurs. There are no fixed rules saying what TSI is the most
important one and what is the second one. The significance of each TSI depends
on user preference.
#8 method, users are given an 1 8 matrix
In our trustworthiness comparison
W = (wj |1 j 8, 0 wj 1, j=1 wj = 1) to allocate weight values for each
TSI, where element wj is the weight value for the jth TSI we described before.
Given an 8 n normalized TSI matrix V N = (vi,j N
|1 i n, 1 j 8), by
applying the following formula, an 1 n matrix T will be achieved, in which
each value Ti stands for trustworthiness of an ASC agent.
v1,1 v1,2 ... v1,8 w1
v2,1 v2,2 ... v2,8 w2
T =V W =
N
v3,1 v3,2 ... v3,8 w3 (3)
... ... ... ... ...
vn,1 vn,2 ... vn,8 wn
The above formula results in n Tj , each of which is the final normalized trust-
worthiness for the jth ASC agent. That value is ranging in [0,1]. The higher
the value is, the more trustworthy the ASC agent is. Please note that if users
only want to calculate an absolute trustworthiness value for an ASC agent (no
comparison), they can directly multiple weight values with the result calculated
from the formulas given in TSI description. In that case, normalization is not
necessary.
As we mentioned, it is very dicult to know which TSI is more important.
However, there are some general rules that can be applied to any contexts. The
first rule is eort-oriented division, which means that we put more weight to
where the major computation jobs are done. If you know most of your work
will be done by elementary services and the composer is just for integrating
the results, you should put more weight on service-layer TSIs, and vice versa.
The second rule is diculty-driven. We put more weight to the TSIs that are
hard to achieve. For example, if you have a massive computation task in which
computation logic is very simple but the amount is quite large, rather than
putting too much weight on the TSIs reflecting the ability factors, you should
give more attention on the quality factors such as execution eciency. If the
computation logic is very complex, you should focus on ability factors such as
Functional Eectiveness Deviation. The third rule is concern-driven. If you are
quite keen in certain aspects, e.g., in some time-crucial cases where even one
microsecond determines everything, you should definitely put the highest weight
to the corresponding factors. Again, our metric is open and it accepts any new
TSIs as its member, as far as they can be scaled to [0,1]. Users are allowed to
create their own TSIs based on their contexts.
A Two-Layer Multi-dimensional Trustworthiness Metric 159
above table. Except the MC and MF data, the rest data in the table are just
the average values. The performance of ASC agents follows a normal distribu-
tion with the mean values given in the table and 1% variance for availability, 5
milliseconds variance for execution duration and 0.5% for error rate. 200ms is
the longest duration consumers would like to wait for one composition job. Each
ASC agent maintain their own elementary service pool that consists of 60 web
services in total with 20 for each function. Those services are randomly selected
from the web. The ideal monitoring strength in our experiment is suppose to
check service pool every 10 compositions with coverage 100%.
We simulate the composition processes for each ASC agents for 300 rounds.
In each round, for a certain ASC agent, we first check whether it is available by
comparing a even-distribution random number with its availability that follows
normal distribution. Then we use same method to check whether the composition
has errors. After that, the ASC agent has to select three elementary web service
with dierent functionalities. If a selected web service cannot work properly
(not available, not reliable or execution duration is too long), ASC agent has to
select another elementary service with the same# functionality. The final execution
n
duration of ASC agent is calculated as Dasc = i=1 Tiws + (n 2) Tasc, where
n is the total times that the WSC interact with dierent web services, Tiws
is the execution duration of ith web service and Tasc is the duration. If the
final total execution duration is greater than 200ms, the composition is regarded
as not ecient. Until now, we finish calculation of the composition-layer TSIs.
The calculation of the first three service-layer TSIs is done based on the current
services maintained in each ASC agent. Then based on the monitoring frequency,
we decide whether a monitoring process must be started after the current round.
If a monitoring process has to be conducted, we will randomly select several
web services for each function type according to monitoring coverage and check
their availability, reliability and execution duration. The web services that give
ineligible results will be removed. In those cases, new services whose quality is
better than the removed ones will be added in from the web. A web service that
was once selected by a certain ASC agent is not allowed to be added again. After
each round, a trustworthiness value (equal weight for TSIs) and a ERR value of
each ASC agent are calculated based on the previous composition records. The
results are shown in Fig.1.
Fig.1(a) and (b) give the general evolution curves of trustworthiness values
and eligible returning rate (ERR) for each ASC agent. In general, as we ex-
pected, the better actual capability an ASC agent has, the better its ERR and
trustworthiness are. This shows that our calculated trustworthiness result eec-
tively reflect the actual performance. We note that at the first several rounds,
the trustworthiness curves and ERR curves fluctuate drastically and can not
reflect the actual capability status. This is because the performance of the ASC
agents and elementary services are of random nature (normal distributions in
the settings). As the experiment goes, our calculated trustworthiness values be-
come stable much faster than ERR. It tells that our selected TSIs and trust-
worthiness calculation method have better sensitivity of reflecting the actual
A Two-Layer Multi-dimensional Trustworthiness Metric 161
6 Related Works
Research on trust in the online environment has been studied mainly in the
information systems domain and computer science domain.
In the information systems domain, people concerns the fundamental concep-
tualization and the implication of each concepts. The seminal work that clarify
the subtilty includes [9,4,12,5]. There is a branch of trust research which focuses
on analyzing the institutional trust elements and their functions [18,10,13]. Re-
cently, some brain visualization technologies are used to analyze trust reactions
[8,2]. Basically, the trust research in information systems focuses on exploring
the nature and roles of trust and use dierent methods to analyze the impact of
peoples trustworthy and untrustworthy behaviors. Their research is valuable in
guiding people and organizations to form trust relationships.
Researchers from computer science paid more attention on trust modeling and
trust framework construction. There have been some very good survey papers in
this field [17,7,1]. Details cannot be listed due to the limit of the required page
size.
References
1. Artz, D., Gil, Y.: A survey of trust in computer science and the Semantic Web.
Journal of Web Semantics: Science, Services and Agents on the World Wide
Web 5(2), 5871 (2007)
2. Dimoka, A.: What Does the Brain Tell Us about Trust and Distrust? Evidence
From A Functional Neuroimaging Study. MIS Quarterly 34(2), 373396 (2010)
3. Dustdar, S., Schreiner, W.: A Survey on Web Service Composition. International
Journal of Web and Grid Services 1(1), 130 (2005)
4. Gefen, D.: Reflections on the Dimensions of Trust and Trustworthiness among
Online Consumers. SIGMIS Database 33(3), 3853 (2002)
5. Gefen, D., Benbasat, I., Pavlou, P.: A Research Agenda for Trust in Online Envi-
ronments. Journal of Management Information Systems 24(4), 275286 (2008)
6. Jiao, H., Liu, J., Li, J., Liu, C.: Trust for Web Service Composers. In: The
12th Global Information Technology Management Association World Conference,
GITMA, pp. 180184 (2011)
7. Jsang, A., Ismail, R., Boyd, C.A.: A survey of trust and reputation systems for
online service provision. Decision Support Systems 43(2), 618644 (2007)
8. King-Casas, B., Tomlin, D., Anen, C., Camerer, C.F., Quartz, S.R., Montague,
P.R.: Getting to Know You: Reputation and Trust in a Two-Person Economic
Exchange. Science 308(5718), 7883 (2005)
9. Mayer, R.C., Davis, J.H., Schoorman, F.D.: An Integrative Model of Organizational
Trust. The Academy of Management Review 20(3), 709734 (1995)
10. McKnight, D.H., Chervany, N.L.: What Trust Means in E-Commerce Customer
Relationships: An Interdisciplinary Conceptual Typology. International Journal
on Electronic Commerce 6(2), 3559 (2001)
11. McKnight, D.H., Choudhury, V., Kacmar, C.: Developing and Validating Trust
Measures for e-Commerce: An Integrative Typology. Info. Sys. Research 13(3),
334359 (2002)
12. Pavlou, P.A., Dimoka, A.: The Nature and Role of Feedback Text Comments in
Online Marketplaces: Implications for Trust Building, Price Premiums, and Seller
Dierentiation. Information Systems Research 17(4), 392414 (2006)
13. Pavlou, P.A., Tan, Y.-H., Gefen, D.: The Transitional Role of Institutional Trust in
Online Interorganizational Relationships. In: 36th Hawaii International Conference
on Sytem Sciences, HICSS 2003, p. 215a (2003)
14. Ran, S.: A Model for Web Service Discovery with QoS. ACM SIGecom Ex-
changes 4(1), 110 (2003)
15. Rao, J., Su, X.: A Survey of Automated Web Service Composition Methods.
In: Cardoso, J., Sheth, A.P. (eds.) SWSWPC 2004. LNCS, vol. 3387, pp. 4354.
Springer, Heidelberg (2005)
16. Tsur, S., Abiteboul, S., Agrawal, R., Dayal, U., Klein, J., Weikum, G.: Are Web
Services the Next Revolution in e-Commerce. Paper Presented at the The 27th
International Conference on Very Large Data Bases, VLDB 2001 (2001)
17. Wang, Y., Vassileva, J.: A Review on Trust and Reputation for Web Service Selec-
tion. In: 27th International Conference on Distributed Computing Systems Work-
shops, ICDCSW 2007, pp. 2533 (2007)
18. Zucker, L.G.: Production of trust: Institutional sources of economic structure, 1840-
1920. Research in Organizational Behavior 8(1), 53111 (1986)
An Integrated Approach
for Large-Scale Relation Extraction from the Web
1 Introduction
Information available on the Web has the potential to be a great source of structured
knowledge. However this potential is far from being realized. The main benefit of ob-
taining exploitable facts such as relationships between entities from natural language
texts is that machines can automatically interpret them. The automatic processing en-
ables advanced applications such as semantic search, question answering and various
other applications and services. The Linked Open Data (LOD) aims at making the
vision of creating a large structured database a reality. In this domain, the building
of semantic knowledge bases such as DBpedia, MusicBrainz or Geonames is (semi-
)automatically performed by adding new facts which are usually represented by triples.
However, most of these triples express a simple relationship between an entity and one
of its properties, such as the birthplace of a person or the author of a book. By mining
structured and unstructured documents from the Web, one can provide more complex
relationships such as parodies. A different vision known as the web of concepts shares
similar objectives with LOD [12]. As a consequence, knowledge harvesting [9,10,14]
and more generally open-domain information extraction [5] are emerging fields with
the goal of acquiring knowledge from textual content.
In this paper, we propose a relation extraction approach named SPIDER1 . It aims at
addressing the previously mentioned issues by integrating the most relevant techniques
1
Semantic and Provenance-based Integration for Detecting and Extracting Relations.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 163175, 2013.
c Springer-Verlag Berlin Heidelberg 2013
164 N. Takhirov et al.
2 Overview
2.1 Problem Definition
The overall goal of SPIDER is to continuously generate relationships and patterns. Let
us first define a relationship. It is a triplet <e1 , , e2 > where e1 and e2 represent entities
and stands for a type of relationship. An entity might be represented with different
labels in natural language text, and we note le L one of the mentions for the entity
e. An example of a relationship is createdBy between the work entity The Lord of the
Rings and the person entity J.R.R. Tolkien. Note that both the entities and the type
of relationship are uniquely identified using an URI. An example denotes a pair of
entities (e1 , e2 ) which satisfies a type of relationship.
The patterns are extracted from a collection of documents D = {d1 , d2 , ..., dn }.
Although not limited to, these documents are webpages in our context. Each document
is composed of sentences, which may contain mentions of the two entities. In that case,
the sentence is extracted as a candidate pattern. We note CP the set of candidate
patterns given a collection of documents and a set of initial examples. Each candidate
pattern is defined as a tuple cp = {tb , e1 , tm , e2 , ta } with tb , tm and ta respectively
standing for the text before, the text in the middle and the text after the entities. A
sentence Bored of the Rings is a parody of Lord of the Rings is transformed to a
corresponding candidate pattern {, e1 , is a parody of, e2 , }.
From the set of candidate patterns CP, we derive a set P of generic patterns by ap-
plying a strategy s. A strategy is defined as a sequence of operations s =< o1 , o2 , ...,
ok >, each operation aiming at generalizing the candidate patterns. Namely, this gen-
eralization implies the detection of frequent terms and the POS-tagging of the other
terms of the candidate patterns. Thus, a strategy is a function such that s(CP) P.
All generated patterns are associated to a specific type of relationship . For instance,
An Integrated Approach for Large-Scale Relation Extraction from the Web 165
a generic pattern for the parody type is illustrated with {e1 } is/VBZ a/DT {parody,
illusion, spoof} of/IN {e2 } 2 .
Finally, a pattern and an example both have a confidence score noted confp and
confe respectively. This score is based on the support, the provenance, the number of
occurrences, the number of strategies and the iteration. A pattern similarity metric in-
dicates the proximity between two (candidate) patterns. The notion of confidence score,
as well as the one for operation, strategy, pattern similarity and (candidate) pattern, are
further detailed in the next sections.
2.2 Workflow
Given two labels, SPIDER generates patterns and derives to a relationship. This pattern
generation capability is guaranteed by the following two processes:
Pattern Generation is in charge of detecting candidate patterns by using examples
or provided entities and of generalizing these candidate patterns to obtain patterns
for a given type of relationship (see Figure 1).
Example Generation exploits the previously generated patterns in order to discover
new examples which satisfy the type of relationship.
The knowledge base stores all generated examples and patterns. These examples can
be used to maintain the system continuously running, but they can be exploited from a
user perspective too.
3 Pattern Generation
The pattern generation process either requires a few examples for a given type of re-
lationship so that patterns for this type of relationship can be automatically generated
(supervised), or it directly tries to guess the type of relationship for two given labels
(unsupervised). The process is similar in both modes and it is composed of three main
steps: extension of entities, extraction of candidate patterns from the collection of doc-
uments and their refinement into patterns.
In a document, entities are not uniquely identified by a label but they have alternative la-
bels or spelling forms. Therefore, extending these entities with their alternative labels is
a crucial step and it requires the correct identification of the entity. For instance, the en-
tity Lord of the Rings can be labeled LOTR or The Lord of the Rings. To avoid
missing potentially interesting relationships, we search for these alternative forms of
spelling in the documents. Given an entity e represented by a label l, the goal is to dis-
cover its set of alternative labels Le = {l, l1, l2 , ..., ln }. The idea is to match the entity
against LOD semantic knowledge bases to obtain this list of alternative labels. Namely,
2
VBZ=Verb, 3rd person sing. present, DT=Determiner, IN=Preposition or subordinating con-
junction.
166 N. Takhirov et al.
we build various queries by decomposing the initial label and we query in the aliases at-
tributes of knowledge bases (i.e., common.topic.alias for Freebase, wikiPageRedirects
for DBpedia, etc.). In most cases, several candidate entities are returned and the tool
tries to automatically select the correct one.
The process of automatically selecting the correct entity is achieved as follows. First,
an AND query is constructed with the two labels. Clusters of documents are built rep-
resenting documents belonging to a set of specific type of entities. The n number of
words around labels are extracted and stemming performed on words. Our assumption
is based on the fact that documents mentioning the same entities tend to have similar
words. Therefore, a graph of semantically related words is built. The most important
documents in the cluster are then compared against the abstract of the automatically
selected entities. Next, we extract frequent terms from the most important documents
in the result set and use these frequent words as extensions. Note that if disambiguation
is not possible, we discard the example and we do not use it for subsequent pattern gen-
eration. The result of the extension process is a list of alternative labels as illustrated in
Figure 1.
The main issue in this step deals with the absence of the two labels in any knowl-
edge base, which means that the entities cannot be extended. The number of retrieved
documents in that case could not be sufficient to extract good candidate patterns. The
first solution consists of analyzing these retrieved documents to detect potential alter-
native spellings by applying metrics such as tf-idf and Named Entity Recognition tech-
niques. Another possibility is to relax the similarity constraint when searching a label
in a knowledge base. In other words, a strict equality measure would not be applied
between the label and the candidate spelling forms from a knowledge base. Rather n-
grams or Levenshtein similarity metrics with a high threshold would be a better choice.
their relevance score. The candidate patterns are extracted by parsing these documents
and locating the sentences with co-occurrence of both entities (defined by a maximum
number of words between them, currently 15 words). Note that we include in the can-
didate patterns the text before and after the entities to obtain full sentences. The final
step aims at refining the candidate patterns to obtain patterns.
Namely, the distance is equal to 0 if the word in the sentence is a frequent term. POS-
tagged terms whose tags are identical (e.g., two verbs) have a similarity obtained by
applying the Resnik distance in Wordnet [13]. Else, words with different tags have
the maximal distance. The minimal total distance (p, s) between a pattern p and a
sentence s is computed by the matrix algorithm of the Levenshtein distance4 using the
semantic distance for all pairs of words. The distance is normalized in the range [0, 1]
with Formula 3 which assesses the similarity pattsim.
1
pattsim(p, s) = (3)
1 + (p, s)
Our metric is flexible because extra or missing words in a sentence do not significantly
affect the similarity value. Similarly, less important words are mainly compared on their
nature (POS-tag). Finally, we can select the sentences which are modeled by a pattern
according to a threshold.
NER. When a sentence in a document corresponds to a pattern, our approach needs to
identify and extract entities contained in the sentence. Thus, the NER issue is crucial
as it determines (part of) the output. Indeed, it is necessary to correctly identify the
(labels of) entities in the sentence based on a pattern. To solve this issue, we rely on
the formalism of our patterns: since they have been POS-tagged, the tags serve as a
delimiter and may constraint the candidate entities.
4
http://j.mp/Levenschtein
An Integrated Approach for Large-Scale Relation Extraction from the Web 169
5 Scalability
In order to efficiently process documents, we distribute the jobs into several machines.
MapReduce inspired techniques have been popular to tackle such tasks. Therefore, the
collection is indexed with Hadoop enabling efficient indexing and searching. To com-
pute statistics, SPIDER makes use of Pig5 which is a high level platform for analyzing
a large collection of data.
Additionally, we propose a document partition to incrementally provide results on
a subset of the collection. The general idea is that highly ranked documents should
be a better source for obtaining patterns than those with lower PageRank, SpamScore
and relevance score values. The size of a partition depends on the quality of the ob-
tained patterns and examples. SPIDER is able to automatically tune the ideal size of
a partition. Initially, the documents are sorted by their PageRank, SpamScore and the
relevance score as described above. The top k documents are selected for analysis from
the head of the ranked list of documents. For the i-th round, the cursor is moved to
the range [k, i k] and the documents in that range are picked out for analysis. Fur-
thermore, when there are too few patterns discovered for two given labels out of these
initial set of documents, the partition size is subsequently adjusted to a higher number.
5
http://pig.apache.org
170 N. Takhirov et al.
The combination of scores as well as the partitioning mechanism makes obtaining the
URLs of the documents and their content for a given query fast, i.e., it only requires a
few seconds. This efficiency is demonstrated in Section 7.2.
6 Related Work
Relationship extraction has been widely studied in the last decade. Supervised systems
for relationship extraction are mainly based on kernel methods, but they suffer from the
processing costs and the difficulty for annotating new training data. One of the earliest
semi-supervised system, DIPRE, uses a few pairs of entities for a given type of rela-
tionship as initial seeds [3]. It searches the Web for these pairs to extract patterns repre-
senting a relationship, and use the patterns to discover new pairs of entities. These new
entities are integrated in the loop to generate more patterns, and then find new pairs of
entities. Snowball [1] enhances DIPRE in two different directions. First, a verification
step is performed so that generated pairs of examples are checked with MITRE named
entity tagger. Secondly, the patterns are more flexible because they are represented as
vectors of weighted terms, thus enabling the clustering of similar patterns.
Espresso [11] is a weakly-supervised relation extraction system that also makes use
of generic patterns. The system is efficient in terms of reliability and precision. How-
ever, experiments were performed on smaller datasets and it is not known how the
system performs at Web-scale.
TextRunner brought further perspective in the field of Open Information Extraction,
for which the types of relationships are not predefined [2]. A self-supervised learner is
in charge of labeling the training data as positive or negative, and a classifier is then
trained. Relations are extracted with the classifier while a redundancy-based probabilis-
tic model assigns confidence scores to each new examples. The system was further
developed into a ReVerb framework [6] which improves the precision-recall curve of
the TextRunner.
ReadTheWeb / NELL [4] is another project that aims at continuously extracting cat-
egories (e.g., the type of an entity) and relationships from web documents and improv-
ing the extraction step by means of learning techniques. Four components including a
classifier and two learners are in charge of deriving the facts with a confidence score.
According to the content of the online knowledge base, more iterations provide high
confidence scores (almost 100%) for irrelevant relationships. Contrary to NELL, we do
not assume that one document returning a high confidence score for a given relation-
ship is sufficient for approving this relationship. In addition, NELL is mainly dedicated
to the discovery of categories (95% of the discovered facts) rather than relationships
between entities.
A recent work reconciles three main issues in terms of precision, recall and perfor-
mance [10]. Indeed, Prospera utilizes both pattern analysis with n-gram item sets to
ensure a good recall and rule-based reasoning to guarantee an acceptable precision. The
performance aspect is handled by partitioning and parallelizing the tasks in a distributed
architecture. A restriction of this work deals with the pattern which only covers the mid-
dle text between the two entities. This limitation affects the recall, as shown with the
example Lord Of The Rings, which Tolkien has written.
An Integrated Approach for Large-Scale Relation Extraction from the Web 171
Contrary to the oldest systems which include hard representations of patterns, Pros-
pera and SPIDER includes a more flexible definition of the patterns, so that similar
patterns can be merged. In addition, the patterns are at the sentence level, which means
that the texts before, after and between the entities are considered. Our confidence
score for a pattern or an example takes into account crucial criteria such as provenance.
In addition, the support and the occurrence scores are correlated within the same type of
relationship, thus the confidence in a pattern or an example may decrease over time. To
the best of our knowledge, none of these approaches deal with the issue of identifying
the alternative labels of an entity. In the future, we plan to demonstrate the impact of
these alternative labels on the recall. Finally, our approach is scalable with document
partitioning based on smart sorting using the SpamScore, PageRank and relevance score
of the documents.
7 Experimental Results
Our collection of documents is the English portion of the ClueWeb09 dataset (around
500 million documents). For the components, we have used the contextual strategy,
with the Maxent POS-tagger provided with StandfordNLP6. The NER component is
based on the OpenNLP toolkit7 . Evaluation is performed using well-known metrics
such as precision (number of correct discovered results divided by the total number of
discovered results). The recall can only be estimated since we cannot manually parse
the 500 million documents to check if some results have been forgotten. However, we
show that the number of correct results increases over time.
In this section, we present our results in terms of quality of label extension, relationship
discovery and comparison with state of the art baseline knowledge extraction tools. The
evaluation of relationship discovery depends on the quality of generated patterns and
hence we present this evaluation rather than the pattern generation itself.
Relationship Discovery. We evaluate the quality obtained by SPIDER when running
the first use case (Section 4.2). Given two labels (representing entities), we search for
the correct type of relationship which links them. To fulfill this goal, we have man-
ually designed a set of 200 relationships, available at this URL8 . Note that the type
of relationship associated to each example is the most expected one, but several types
of relationship are possible for the same example. Table 1 provides a sample of ex-
amples (e.g., Obama, Hawai) and some candidate types of relationship discovered by
SPIDER (e.g., birthplace). A bolded type of relationship indicates that it is correct for
this example. A second remark about our set of relationships deals with the complexity
of some relationships (e.g., <cockatoo, tail, yellow>). The last column shows the ini-
tial confidence score computed for the candidate relationship. The quality is measured
6
http://nlp.stanford.edu/software/CRF-NER.shtml
7
http://opennlp.apache.org/
8
http://j.mp/apweb2013
172 N. Takhirov et al.
in terms of precision at different top-k. Indeed, SPIDER outputs a ranked list of rela-
tionship types according to their confidence scores. In addition, our approach is able to
run with or without training data. Thus, we have tested the system when a few training
data have been provided. Using 1 training example means that the system has randomly
selected 1 correct example for bootstrapping the system. Experiments with the training
data are based on cross-validation and 5 runs reduce the impact of randomness. The
manual validation of the discovered relationships has been performed by 8 people. This
manual validation includes around 3000 invalid relationships and 600 correct ones, and
it facilitates the automatic computation of precision. In addition, we are able to estimate
the recall, i.e. to evaluate the number of correct types discovered during a run w.r.t. all
validated types. This is an estimation because there may exist more correct types of
relationship than the ones which have been validated. Besides, a discovered type may
have a different spelling from a validated type while both have the same meaning, thus
decreasing the recall.
Figures 2(a) and 2(b) respectively depict the average precision and the average
recall for the 200 relationships by top-k and by number of provided training data. We
first notice that SPIDER achieves low quality without training data (precision from
40% at top-1 to 30% at top-10). The estimated recall values are also quite low at top-1
because there is an average of 3 correct types of relationships for each example. The
top-3 results are interesting with 5 training data: the precision is acceptable (more than
80%) while the recall value (32%) indicates that one type of relationship out of three is
correctly identified. Since our dataset contains complex relationships, this configuration
is promising for bootstrapping the system. Precision strongly decreases at top-5 and
top-10, mainly because each example roughly includes 3 types. However, the top-5 and
top-10 recall values indicate that we discover more correct examples. Finally, we notice
that providing a few training data (5 examples) enables at least a 10% improvement both
for precision and recall. This remark is important since our approach aims at running
perpetually by reusing previously discovered examples and patterns.
The quality results are subject to the complexity of the set of relationships, since we
have selected some complex ones to discover, such as <cockatoo, tail, yellow>.
Other problems of disambiguation occurred, for instance the example Chelsea, Lon-
don mainly returns types of relationships about accommodations because Chelsea is
identified as a district of London and not as the football team.
An Integrated Approach for Large-Scale Relation Extraction from the Web 173
100 100
no training no training
1 example 1 example
5 examples 5 examples
60 60
40 40
20 20
0 0
top-1 top-3 top-5 top-10 top-1 top-3 top-5 top-10
Baseline Comparison. A final experiment aims at comparing our system with two other
approaches, ReadTheWeb (NELL) [4] and Prospera [10], both described in Section 6.
These two approaches have been chosen as baseline because the dataset along with the
results are available online. An evaluation of these tools is described online9 . Since the
seed examples are available, we have used them as training data. Table 2 summarizes
the comparison between the three systems in terms of estimated precision, as ex-
plained in the experiments reported in [4,10]. Similarly to Prospera and ReadTheWeb,
our precision is an estimation due to the amount of relationships to validate. Namely,
1000 random types have been validated for each relationship. The average precision of
the three systems is the same (around 0.91). However, the total number of facts discov-
ered by SPIDER (71,921) is 36 times higher than ReadTheWeb (2,112) and 1.3 times
higher than Prospera (57,070), outperforming both baselines.
Prospera provides slightly better quality results than our approach on
AthletePlaysForTeam relation. However, several factors have an influence on the pre-
cision results between Prospera, ReadTheWeb and SPIDER. First, Prospera is able to
use seeds and counter seeds while we only rely on positive examples. On the other side,
Prospera includes a rule-based reasoner combined with the YAGO ontology. Although
SPIDER does not support this feature, the combination of POS-tagged patterns and
NER techniques achieves outstanding precision values.
7.2 Performance
Since knowledge extraction systems deal with large collections of documents, they need
to be scalable. Figure 3(a) depicts the performance of SPIDER for retrieving and pre-
processing (i.e., clean up the header, remove html tags) the documents. The total time
(sum of retrieval and preprocessing) is also indicated. Although there is no caching,
the total time is not significant for collecting and preprocessing one million documents
(around 40 seconds). Note that in real cases, a conjunctive query composed of two
labels rarely returns more than 20, 000 documents. The peak for retrieval at 600, 000
documents is due to an overhead processing from the thread manager. Increasing the
9
http://www.mpi-inf.mpg.de/yago-naga/prospera/
174 N. Takhirov et al.
400 500000
80 candidate patterns
retrieval-time patterns
preprocessing-time 350 total time
70 total-time
400000
60 300
50 250
Time in seconds
300000
40 200
30 200000
150
20 100
100000
10 50
0 0 0
200000 400000 600000 800000 1M
1000 5000 10000 20000 50000
Number of documents Number of documents
number of threads above 400 leads to higher thread switching latency while decreasing
this number only reports the peak earlier during the process. This issue could be simply
solved by dispatching this task on different servers.
ReadTheWeb and Prospera expose their knowledge base but not their tools. Thus, it
is not possible to evaluate the three systems on the same hardware and the following
comparison is based on the performance described in the original research papers. It
mainly aims at showing the significant improvement of SPIDER over ReadTheWeb and
Prospera, for a better average quality. To produce the results shown in Table 2, Prospera
needed more than 2 days using 10 servers with 48 GB RAM [10]. In a similar fashion,
ReadTheWeb has generated an average of 3618 facts per day during the course of 67
days using the Yahoo M 45 supercomputing cluster [4]. On the contrary, our approach
performed the same experiment in a few hours. The generation of facts for each type
of relationship took between 20 to 60 minutes with four servers equipped with 24 GB
RAM. Although these values are mainly indicative, SPIDER is more efficient than the
two other systems when dealing with dynamic and large-scale environments.
8 Conclusion
In this paper, we have presented SPIDER, an approach to automatic extraction of
binary relationships from large text corpora. The main advantage of our system is to
An Integrated Approach for Large-Scale Relation Extraction from the Web 175
References
1. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections.
In: Proc. of DL, pp. 8594. ACM (2000)
2. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information
extraction from the web. In: Proc. of IJCAI, pp. 26702676. Morgan Kaufmann (2007)
3. Brin, S.: Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendel-
zon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172183. Springer, Heidel-
berg (1999)
4. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E., Mitchell, T.M.: Toward an
architecture for never-ending language learning. In: Proc. of AAAI. AAAI Press (2010)
5. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the
web. Communication of ACM 51, 6874 (2008)
6. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction.
In: Proc. of EMNLP, pp. 15351545. ACL (2011)
7. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals.
Journal of Soviet Physics Doklady 10, 707 (1966)
8. Lynam, T.R., Cormack, G.V., Cheriton, D.R.: On-line spam filter fusion. In: Proc. of SIGIR,
pp. 123130. ACM (2006)
9. Mausam, Schmitz, M., Soderland, S., Bart, R., Etzioni, O.: Open language learning for in-
formation extraction. In: Proc. of EMNLP, pp. 523534. ACL (2012)
10. Nakashole, N., Theobald, M., Weikum, G.: Scalable knowledge harvesting with high preci-
sion and high recall. In: Proc. of WSDM, pp. 227236. ACM (2011)
11. Pantel, P., Pennacchiotti, M.: Espresso: leveraging generic patterns for automatically harvest-
ing semantic relations. In: Proc. of ACL, pp. 113120. ACL (2006)
12. Parameswaran, A., Garcia-Molina, H., Rajaraman, A.: Towards the web of concepts: extract-
ing concepts from large datasets. VLDB Endowment 3, 566577 (2010)
13. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its appli-
cation to problems of ambiguity in natural language. Journal of Artificial Intelligence Re-
search 11, 95130 (1999)
14. Takhirov, N., Duchateau, F., Aalberg, T.: An evidence-based verification approach to extract
entities and relations for knowledge base population. In: Cudr-Mauroux, P., Heflin, J., Sirin,
E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bern-
stein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 575590. Springer,
Heidelberg (2012)
Multi-QoS Eective Prediction
in Web Service Selection
1
State Key Laboratory of Networking and Switching Technology
2
School of Computer Science
Beijing University of Posts and Telecommunications, Beijing, China
{kkliangjun,guo.apple8}@gmail.com, {zouhua,fcyang,rhlin}@bupt.edu.cn
1 Introduction
In recent years, Web services are designed as computational components to
build service-oriented distributed systems [1]. The rising development of service-
oriented architecture makes more and more alternative Web services oered by
dierent providers with equivalent or similar functions. So how to select Web
services from a large number of functionally-equivalent candidates become an
important issue.
In an intuitive way, Web service can be selected from all candidates by their
QoS values. However, it is hard to obtain all their QoS observed by a target user,
as some Web service invocations may be charged. Hence QoS prediction [2-5]
become a challenging research topic. For example, Jieming Zhu et al [2] proposes
a Web service positioning framework for response time prediction. Zibin Zheng
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 176183, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Problem Statement
In this section, we formally describe the Multi-QoS eective prediction problem
as follow.
178 Z. Liang et al.
Given a set of users and a set of Web services, based on the existing Multi-
QoS values from dierent users, predict the missing Multi-QoS values of Web
service when invoked by a target user at same time. For example, a bipartite
graph G = (U S, E) represents the user-Web services invocation (shown as
Fig.1), where U denotes the users set, and S denotes the Web services set, E
denotes the invocations set between U and S. If user ui has invoked Web service
1 2 m
si , where ui U , sj S, and eij E, the weight Qij = {qij , qij , ..., qij } on edge
eij corresponds to the Multi-QoS values (e.g., throughput, response-time in this
example) of that invocation. Our task is to eectively predict the Multi-QoS
values of potential invocations (the broken lines) for a targeted user.
User
8.72
0.366 0.375
0.693 0.223 0.22 0.344
Web service
Third, prediction step, which aims to predict the missing Multi-QoS val-
ues via Multi-output Support Vector Regression algorithm. This step finally
solves our problem based on the extraction feature of Web services and the
targeted users historical QoS experience.
Convert the
normalization result
Normalization Web service feature
of multi-QoS values extraction
Output QoS result
0.973 18.86 27.67 9.66 10.0 6.849
oS
7.172
Q
0.334 17.54 25.31 9.049 9.009 7.025
Predict Multi-QoS
5.982 0.228 0.237 0.221 0.222 0.527
with MSVR 2.13 0.273 0.254
11.19
User
8.72
0.366 0.375
Web service
Vector Regression (MSVR for short) algorithm[8] to predict the missing Multi-
QoS values at same time by following steps:
First, we select the training data according to the invoked experience of
the targeted user. Suppose a targeted user ue has invoked d Web services,
then the training data can be represented by T = {(x1 , y1 ), ..., (xd , yd )},
where xd W is the obtained feature of Web service sd , yd represents QoS
performance of Web service sd observed by ue .
Second, we employ MSVR to build a function f , which represents the relation
between the feature of Web service and Multi-QoS values obtained from a
target user. Consider Rr is the feature space of Web service, Rm is the QoS
space. A function f can be described as formulation (7).
4
f : Rr Rm
(7)
xk W T (xk ) + b
where W T (xk ) + b reflects a map from the feature space of Web service to
the QoS space, which is learning from the training data by MSVR.
Last, we predict the miss Multi-QoS values for a targeted user, based on the
feature of unused Web service. let xu is the feature of unused Web service su, the
predicted Multi-QoS Qeu for a targeted user ue can be calculated as following:
4 Experiment
In this section, we conduct several experiments to show the prediction quality of
our proposed approach on real-world Web service dataset[9]. In the experiments,
we randomly remove 90% entries from the original QoS matrix, only use the
remaining 10% entries to predict the removed entries of a targeted user (which is
chosen by random). The parameter settings are r = 4, = 0.02, C = 678, = 0.6.
Figure.3 and Figure.4 show the predictions of response-time and throughput
for dierent users. The experimental results show that our approach have good
performance on predicting the QoS value of dierent Web services.
UserMean : predict the missing QoS values based on the mean QoS value of
each user.
UPCC: predict the missing QoS values based on similar users[10].
IPCC: predict the missing QoS values based on similar items (item refers to
Web service in this paper)[11].
182 Z. Liang et al.
1.6 150
Throughput(kbps)
100
1
0.8
0.6
50
0.4
0.2
0 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Different Web services Different Web services
3.5 100
Throughput(kbps)
2 60
50
1.5
40
1 30
20
0.5
10
0
0 5 10 15 20 25 30 35 40 0
0 5 10 15 20 25 30 35 40
Different Web services
Different Web services
Table.1 shows the experimental results of dierent approaches, and the result of
MQEPA is specially highlighted. From Table.1, we can observe that our approach
can obtain smaller MAE and RMSE, shows that MQEPA has better performance
on prediction accuracy on both response-time and throughput for dierent users.
This indicates that our approach suits better for Web service selection.
5 Conclusion
In this paper, we propose a novel Multi-QoS Eective Prediction problem, which
aims to make eective Multi-QoS prediction based on Multi-QoS attributes and
their relationships. To address this problem, we propose a novel prediction frame-
work Multi-QoS Eective Prediction Approach. Comprehensive empirical studies
demonstrate the utility of the proposed method.
Multi-QoS Eective Prediction in Web Service Selection 183
References
1. Zhang, L., Zhang, J., Cai, H.: Services Computing: Core Enabling Technology of
the Modern Services Industry. Tsinghua University Press (2007)
2. Zhu, J., Kang, Y., Zheng, Z., Lyu, M.R.: WSP: A Network Coordinate Based
Web Service Positioning Framework for Response Time Prediction. In: Proc. of
the IEEE ICWS 2012, pp. 9097 (2012)
3. Zheng, Z., Lyu, M.R.: Collaborative reliability prediction of service-oriented sys-
tems. In: Proc. of 32nd International Conference on Software Engineering, pp.
3544 (2010)
4. Zhang, Y., Zheng, Z., Lyu, M.R.: Exploring Latent Features for Memory-Based
QoS Prediction in Cloud Computing. In: Proc. of the 30th IEEE Symposium on
Reliable Distributed Systems, pp. 110 (2011)
5. Jiang, Y., Liu, J., Tang, M., Liu, X.F.: An eective web service recommendation
method based on personalized collaborative filtering. In: Proc. of the IEEE ICWS
2011, pp. 211218 (2011)
6. Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S., Huang, T.S.: Supporting sim-
ilarity queries in MARS. In: Proc. of the ACM Multimedia, pp. 403413 (1997)
7. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788791 (1999)
8. Yu, S.P., Yu, K., Volker, T.: Multi-Output Regularized Feature Projection. IEEE
Transactions on Knowledge and Data Engineering, 16001613 (2006)
9. Zheng, Z., Zhang, Y., Lyu, M.: Distributed QoS Evaluation for Real-World Web
Services. In: Proc. of ICWS 2010, pp. 8390 (2010)
10. Shao, L., Zhang, J., Wei, Y., Zhao, J., Xie, B., Mei, H.: Personalized qos prediction
for web services via collaborative filtering. In: Proc. of ICWS 2007, pp. 439446
(2007)
11. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: GroupLens: an open
architecture for collaborative filtering of netnews. In: Proc. of CSCW 1994, pp.
175186 (1994)
Accelerating Topic Model Training on a Single Machine
Mian Lu2 , Ge Bai2 , Qiong Luo2 , Jie Tang3 , and Jiuxin Zhao2
1
A*STAR Institute of High Performance Computing, Singapore
[email protected]
2
Hong Kong University of Science and Technology
{luo,gbai,zhaojx}@cse.ust.hk
3
Tsinghua University
[email protected]
1 Introduction
Statistical topic models have recently been successfully applied to text mining tasks
such as classification, topic modeling, and recommendation. Latent Dirichlet Alloca-
tion (LDA) [1], one of the recent major developments in statistical topical modeling,
immediately attracted a considerable amount of interest from both research and indus-
try. However, due to its high computational cost, training an LDA model may take
multiple days [2]. Such a long running time hinders the use of LDA in applications that
require online performance. Therefore, there is a clear need for methods and techniques
to efficiently learn a topic model.
In this paper, we present a solution to the efficiency problem of training the LDA
model: acceleration with graphics processing units (GPUs). GPUs have recently be-
come a promising high-performance parallel architecture for a wide range of appli-
cations [3]. The current generation GPU provides a 10 higher computation capability
and a 10 higher memory bandwidth than a commodity multi-core CPU. Earlier studies
of using GPUs to accelerate the LDA have shown significant speedups compared with
CPU counterparts [4,5]. However, due the limited size of GPU memory, these methods
are not scalable to large data sets. The GPU memory size is relatively small compared
with the main memory size, e.g., up to 6GB for GPUs in the market. This limits the
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 184195, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Accelerating Topic Model Training on a Single Machine 185
use of GPU-based LDA for practical applications since the memory size required of
many real-world data sets clearly exceeds the capability of the GPU memory. More-
over, some parallel implementations of LDA, such as AD-LDA [2], cannot be directly
adapted on the GPU either since the overall memory consumption would easily exceed
the available GPU memory size due to a full copy of word-topic counts maintained for
each processor.
To address this issue, we study how to utilize limited hardware resources for large-
scale GPU-based LDA training. Compared with the existing work on GPU-base LDA,
our implementation is able to scale up to millions of documents. This scalability is
achieved through the following three techniques:
1. On-the-fly generation of document-topic counts. We do not precompute and store
the global document-topic matrix. Instead, the counts for specific documents are
generated only when necessary.
2. Sparse storage scheme. We investigate a more compact storage scheme to store the
document-topic and word-topic count matrices since they are usually sparse.
3. Word token partitioning. Word tokens in the data set are divided into a few disjoint
partitions based on several criteria. Then for each partition, only a part of document-
topic and word-topic matrices are needed in each iteration.
Based on these techniques, we have successfully trained LDA models on data sets orig-
inally requiring up to 10 GB memory on a GPU with only 1 GB memory. Furthermore,
our GPU-based LDA training achieved significant performance speedups. These results
show that our approach is practical and cost-effective. Also, our techniques can be ex-
tended to solutions that involve multiple GPUs.
The remainder of this paper is organized as follows. In Section 2, we briefly introduce
the LDA algorithm and review GPGPU. We present the design and implementation of
our GLDA in detail in Section 3. In Section 4, we experimentally evaluate our GLDA
on four real-world data sets. We conclude in Section 5.
2 Preliminary
2.1 Gibbs Sampling of Latent Dirchlet Allocation
We develop GLDA based on Gibbs sampling of LDA [6] due to its simplicity and
high effectiveness for real-world text processing. In the LDA model, D documents are
modeled as a mixture over K latent topics. Given a training data set, we use a set x
to represent the words in documents, where xij is the jth word in the ith document.
Furthermore, a corresponding topic assignments set z is maintained, where zij is the
assigned topic for the word xij . The total number of words in the data set is N . Two
matrices, njk and nwk are used in LDA. Specifically, njk is the document-topic count
the number of words in document j assigned to topic k, and nwk is the word-topic
count the number of occurrences of word w assigned to topic k. The sizes of matrices
njk and nwk are D K and W K, respectively, where W is the number of words in
the vocabulary for the given data set.
The original Gibbs sampling is inherently sequential, as each iteration in the training
process depends on the results from the previous iteration. To implement it in parallel,
186 M. Lu et al.
the previous study has proposed a parallel approximate LDA algorithm [2], which has
been shown similar accuracy to the sequential algorithm. Therefore, in this work, we
adopt a similar parallel approximate LDA algorithm for our GPU implementation.
Two parallel LDA algorithms, named AD-LDA and HD-LDA, are proposed by New-
man et al. [2]. The speedup of AD-LDA on a 16-processor computer was up to 8 times.
Chen et al. [7] have implemented AD-LDA using MPI and MapReduce for large-scale
LDA applications. Their implementations achieved a speedup of 10 when 32 ma-
chines were used. Another asynchronous distributed LDA algorithm was proposed by
Asuncion et al. [8], which had a speedup of 15-25 when there were 32 processors.
Among existing studies on GPU-based LDAs, Masada et al. [4] have accelerated
collapsed variational Bayesian inference for LDA using GPUs and achieved a speedup
of up to 7 times compared with the standard LDA on the CPU. Moreover, Yan et al.
[5] have implemented both collapsed Gibbs sampling (CGS) and variational Bayesian
(CVB) methods of LDA on the GPU. Compared with the sequential implementation on
the CPU, the speedups for CGS and CVB are around 26x and 196x, respectively. How-
ever, none of these GPU-based LDAs has effectively addressed the scalability issue. In
contrast, our GLDA can train very large data sets on the GPU efficiently.
A traditional parallel approach [2] for LDA (AD-LDA) is that both D documents and
document-topic counts njk are distributed evenly over p processors. However, each
processor needs to maintain a local copy of all word-topic counts. Then each proces-
sor performs LDA training based on its own local documents, document-topic, and
Accelerating Topic Model Training on a Single Machine 187
Our parallel algorithm is based on the following fact: With atomic increment and
decrement operations, concurrent execution of these operations is guaranteed to pro-
duce a correct result. Therefore, in our implementation, we also maintain only one copy
of nwk matrix, and adopt two modifications. (1) We use atomic increment and decre-
ment operations, which are supported by recent generation GPUs, in count updates.
(2) We serialize the computation and update on the nwk matrix. With these modifica-
tions, Algorithm 1 outlines our parallel LDA algorithm for one iteration. Compared
with AD-LDA, which does not perform global updates until the end of each iteration,
our algorithm performs global updates after each p words are sampled in p processors
in parallel and guarantees the updated results are correct. Since the multi-processors
on the GPU are tightly coupled, the communication overhead is insignificant in this
implementation. Specifically, we map each thread block in CUDA as a processor, and
the inherent data parallelism of the sampling algorithm is implemented using multiple
threads within each thread block.
assignment set z = {zij } are both sizeof (int) N bytes. The document-topic and
word-topic count matrix consume sizeof (int)DK bytes and sizeof (int)W K
bytes, respectively. Therefore, the estimated total memory size is at least M bytes:
M = sizeof (int) (2 N + (D + W ) K)
Where N , D, and W are fixed for a given data set, and the number of topics K is
assigned by users.
A large K is usually required for large data sets, for example, a
typical value is D. Figure 1 shows the estimated memory consumption for NYTimes
and PubMed data sets that are used in our evaluations (see Table 1 for experimental
setup). Obviously, current generations of GPUs cannot support such large amounts of
required memory space. A straightforward solution is to send the data set xij and topic
assignment set zij to the GPU on-the-fly. However, this method does not solve the
problem when either njk or nwk matrix cannot be entirely stored in the GPU memory.
Therefore, we study more advanced techniques to enable our GLDA to handle large
data sets.
Fig. 1. Estimated memory consumptions for NYTimes and PubMed data sets (Table 1) with num-
ber of topics varied
Moreover, the additional cost introduced is equivalent to one scan on the topic assign-
ment set z for each iteration. Due to the high memory bandwidth of the GPU, the gen-
eration of document-topic counts can be implemented efficiently through the coalesced
access pattern.
Accelerating Topic Model Training on a Single Machine 189
We use a thread block to generate the document-topic counts for a given document.
Algorithm 2 demonstrates the GPU code for a thread to produce document-topic counts
based on the coalesced access. We have adopted a temporary array stored in the local
memory to improve the memory access efficiency for counting as well as to facilitate
the coalesced access for both reads and writes on the device memory. Without storing
all njk counts, the memory saving is significant with a low computation overhead.
In contrast, the nwk counts are stored. This design choice is made because
re-calculating nwk is expensive - as words are distributed across documents, gener-
ating nwk for each iteration would require approximately N scans of the original data
set. Fortunately, the number of topics and the size of vocabulary are both much smaller
than the data set size; therefore, it is feasible to store the nwk counts even for large data
sets. In the following, we further discuss how to store these matrices efficiently and how
to partition them if they do not fit into memory.
in the array for specific pairs. Since p word tokens in p different documents are pro-
cessed in parallel, we need to decompress those corresponding njk and nwk counts.
The estimated memory required is Msparse bytes.
Where nzjk % and nzwk % refer to the percentage of non-zero elements for njk and
nwk counts, respectively. Note that, when nzjk % or nzwk % is over 50%, this storage
scheme will consume more memory than the original storage format for the njk or nwk
matrix, respectively. Therefore, to decide whether adopting the sparse matrix storage,
we should first estimate the memory space saving.
Although the introduced computation overhead is inexpensive, there is a critical
problem for the GPU-based implementation. Since the percentage of non-zero elements
is not guaranteed to monotonically decrease with iterations, a new topic-count pair may
be inserted when an original zero element becomes non-zero. Thus an additional check-
ing step is necessary after each word sampling process to examine whether new pairs
should be inserted. Unfortunately, our evaluated GPUs do not support dynamical mem-
ory allocation within the GPU kernel code, and the kernel code cannot invoke the CPU
code at runtime. To perform the checking step and also the GPU memory allocation,
some information must be copied from the device memory to the main memory. The
number of such memory copies is around N/p. The overhead of such a large number
of memory copies may be expensive. Our evaluation results confirm this concern. With
the newer generation GPUs, this concern may be partially or fully addressed.
for the same word token subset. Therefore the overhead is (1 + m) more scans on the
original data set. We denote this global partitioning scheme as doc-voc-part. Partitioning
schemes with m = 1 and n = 1 are denoted as doc-part and voc-part, respectively. doc-
part and voc-part are suitable when the word-topic and document-topic matrix can be
entirely stored in the GPU memory, respectively. For a uniform data distribution with
range partitioning, the memory size requirement is estimated as Mpart bytes:
2N DK W K
Mpart = sizeof (int) ( + + )
nm n m
Note that the workload difference between different partitions will not hurt the overall
performance significantly since these partitions are processed in different computation
phases sequentially rather than in parallel on the GPU. Moreover, if the entire word to-
ken set, the topic assignment set, and all njk and nwk counts can be stored in the main
memory, we can directly copy the required words and counts from the main memory to
the GPU memory and avoid the additional computation for the corresponding njk and
nwk counts generation in each iteration. This can further reduce the overhead but re-
quires large main memory. In our implementation, we do not store njk and nwk counts
in the main memory thus our implementation can also be used on a PC with limited
main memory.
3.6 Discussion
The three techniques we have proposed can also be combined, e.g., either no njk or
sparse storage scheme can be used together with the word token partitioning. To choose
appropriate techniques, we give the following suggestions based on our experimental
evaluation results: doc-part should be first considered if the data set and topic assign-
ments cannot be entirely stored in the GPU memory. Then if all njk counts cannot be
held in the GPU memory, no njk or doc-part should be considered. Finally, the voc-
part should be further adopted if the GPU memory is insufficient to store all nwk counts,
and the sparse matrix storage may be used for nwk .
4 Empirical Evaluation
4.1 Experimental Setup
We conducted the experiments on a Windows XP PC with 2.4 GHz Intel Core2 Duo
Quad processor, 2GB main memory and an NVIDAI GeForce GTX 280 GPU (G280).
G280 has 240 1.3 GHz scalar processors, and 1GB device memory. Our GLDA is devel-
oped using NVIDIA CUDA. The CPU counterparts (CPU LDA) are based on a widely
used open-source package GibbsLDA++1 . Note that CPU LDA is a sequential program
using only one core. Due to the unavailability of the code from the previous work on
GPU-based LDA [5], we could not compare our GLDA with it. However, the speedup
reported by their paper is similar to ours when the data can be entirely stored on the
GPU.
Four real-world data sets are used in our evaluations 2 , which are illustrated in Table
1. We use a subset of the original PubMed data set since the main memory cannot
hold the original PubMed data set. The two small data sets KOS and NIPS are used
to measure the accuracy of GLDA, and the two large data sets NYTimes and PubMed
are used to evaluate the efficiency. Moreover, we also have a randomly sampled subset
with a size half of the original NY Times dataset (denoted as Sub-NYTimes) to study the
overhead of different techniques when all data can be stored in the GPU memory. For
each data set, we randomly split it to two subsets. 90% of the original documents are
used for training, and the remaining 10% are used for test. Throughout our evaluations,
the hyperparameters and are set to 50/K and 0.1, respectively [6].
We use the perplexity [9] to evaluate the accuracy of GLDA. Specifically, after ob-
taining the LDA model from 1000 iteration training, we calculate the perplexity of the
test data.
Fig. 2. Perplexity of KOS and NIPS data sets with number of thread blocks varied
each thread block to 32 when varying the number of blocks. Then the number of thread
block is fixed to 512 and the number of threads in each thread block is varied. Figure 3
shows the elapsed time of one iteration for the GLDA with the number of thread blocks
and number of threads in each thread block varied. Compared with the CPU-LDAs
performance of 10 minutes per iteration, this figure shows significant speedups when
either number of thread blocks or threads per block is increased. However, it nearly
maintains a constant or slightly reduced when the GPU resource is fully utilized.
Next, we study the overhead of techniques used to address the scalability issue. We
first measure the performance when only a single method is adopted. Figure 4(a) shows
the elapsed time of GLDA adopting different techniques. opt refers to the implementa-
tion without any additional techniques and is optimized through the thread parallelism.
There are four partitions for both doc-part and voc-part. doc-voc-part has divided both
documents and vocabulary into two subsets, thus also resulting four partitions. The
figure shows that the sparse storage scheme has a relatively large overhead compared
with the other methods. Through our detailed study, we find the additional overhead is
mainly from the large number of memory copies between the main memory and GPU
memory. Moreover, voc-part is slightly more expensive than doc-part and no njk since
it needs to scan the original data set for each partition. For data sets used in our evalu-
ations, the sparse storage scheme is always more expensive than other techniques, thus
we focus on the other two techniques in the following evaluations. We further study the
performance overhead when the technique no njk and partitioning are used together.
Figure 4(b) shows the elapsed time when these two techniques are combined. The com-
bined approach is slightly more expensive than only one technique adopted since the
overhead from two techniques are both kept for the combined method.
To investigate the memory space saving from different techniques, Figure 5 shows
the GPU memory consumptions corresponding to Figure 4. It demonstrates that the
memory requirement is reduced by around 14%-60% through various techniques. Since
the number of topics is not very large and the partitioning scheme can also reduce the
memory size consumed for the word token set and topic assignment set, partitioning
is more effective than the other two techniques. Moreover, Figure 5(b) shows that the
combined technique can further reduce the memory consumption compared with a sin-
gle technique adopted.
194 M. Lu et al.
Fig. 4. Elapsed time of one iteration on the GPU for Sub-NYTimes when adopting different
techniques, K=256
Fig. 5. Overall memory consumption on the GPU for Sub-NYTimes of adopting different tech-
niques, K=256
Fig. 6. Performance comparisons for one iteration of NYTimes and PubMed data sets with num-
ber of topics varied
Finally, we show the overall performance comparisons based on the NYTimes and
PubMed data sets with reasonable numbers of topics. The CPU counterparts are im-
plemented using the same set of techniques since the main memory is also insufficient
to handle such data set for the original CPU implementation. The estimated memory
consumptions using a traditional LDA algorithm for these two data sets are presented
in Figure 1, and the actual GPU memory consumption is around 700MB for each data
sets. The technique selection is based on the discussion in the implementation section.
Specifically, the evaluation with NYTimes has adopted the techniques doc-part, while
Accelerating Topic Model Training on a Single Machine 195
doc-voc-part and no jjk are both adopted for the PubMed data set. Figure 6 shows
that our GLDA implementations are around 10-15x faster than their CPU counterparts.
Such speedup is significant especially for large data sets. For example, suppose 1000
iterations are required for the PubMed with 2048 topics, GLDA could accelerate the
computation from originally more than two months to only around 5 days.
5 Conclusion
We implemented GLDA, a GPU-based LDA library that features high speed and scal-
ability. On a single GPU with 1 GB GPU memory, we have evaluated the performance
using the data sets originally requiring up to 10 GB memory successfully. Our exper-
imental studies demonstrate that GLDA can handle such large data sets and provide a
performance speedup of up to 15X on a G280 over a popular open-source LDA library
on a single PC.
Acknowledgement. This work was supported by grant 617509 from the Research
Grants Council of Hong Kong.
References
1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning
Research 3, 9931022 (2003)
2. Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed inference for latent dirichlet
allocation. In: NIPS (2007)
3. Owens, J.D., Luebke, D., Govindaraju, N.K., Harris, M., Kruger, J., Lefohn, A.E., Purcell,
T.J.: A survey of general-purpose computation on graphics hardware. In: Eurographics 2005,
State of the Art Reports (2005)
4. Masada, T., Hamada, T., Shibata, Y., Oguri, K.: Accelerating collapsed variational bayesian
inference for latent dirichlet allocation with nvidia CUDA compatible devices. In: Chien, B.-
C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009. LNCS, vol. 5579, pp. 491500.
Springer, Heidelberg (2009)
5. Yan, F., Xu, N., Qi, Y.: Parallel inference for latent dirichlet allocation on graphics processing
units. In: NIPS 2009, pp. 21342142 (2009)
6. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy
of Sciences, PNAS 2004 (2004)
7. Chen, W.Y., Chu, J.C., Luan, J., Bai, H., Wang, Y., Chang, E.Y.: Collaborative filtering for
orkut communities: discovery of user latent behavior. In: WWW 2009 (2009)
8. Asuncion, A., Smyth, P., Welling, M.: Asynchronous distributed learning of topic models. In:
NIPS (2008)
9. Azzopardi, L., Girolami, M., van Risjbergen, K.: Investigating the relationship between lan-
guage model perplexity and ir precision-recall measures. In: SIGIR 2003, pp. 369370 (2003)
Collusion Detection in Online Rating Systems
Abstract. Online rating systems are subject to unfair evaluations. Users may
try to individually or collaboratively promote or demote a product. Collabora-
tive unfair rating, i.e., collusion, is more damaging than individual unfair rating.
Detecting massive collusive attacks as well as honest looking intelligent attacks
is still a real challenge for collusion detection systems. In this paper, we study
impact of collusion in online rating systems and asses their susceptibility to col-
lusion attacks. The proposed model uses frequent itemset mining technique to
detect candidate collusion groups and sub-groups. Then, several indicators are
used for identifying collusion groups and to estimate how damaging such collud-
ing groups might be. The model has been implemented and we present results of
experimental evaluation of our methodology.
1 Introduction
Today, in e-commerce systems, buyers usually rely on the feedback posted by others
via online rating systems to decide on purchasing a product [13]. The higher the num-
ber of positive feedback on a product, the higher the possibility that one buys such a
product. This fact motivates people to promote products of their interest or demote the
products in which they are not interested by posting unfair rating scores [7, 6]. Unfair
rating scores are scores which are cast regardless of the quality of a product and usu-
ally are given based on personal vested interests of the users. For example, providers
may try to submit supporting feedback to increase the rating of their product in order
to increase their income [6]. The providers also may attack their competitors by giv-
ing low scores on their competitors products. Also, sometimes sellers in eBay boost
their reputations unfairly by buying or selling feedback [7]. Unfair rating scores may be
given individually or collaboratively [18]. Collaborative unfair ratings which are also
called collusion [17, 18] by their nature are more sophisticated and harder to detect than
individual unfair ratings [17]. For that reason, in this work we focus on studying and
identifying collusion.
Although collusion is widely studied in online collaborative systems [4, 8, 10, 13,
12, 9, 19], such systems still face some serious challenges. One major challenge arises
when a group of raters try to completely take control of a product i.e., when the number
of unfair reviewers is significantly higher than the number of honest users; the existing
models usually can not detect such groups. Also, the existing models do not perform
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 196207, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Collusion Detection in Online Rating Systems 197
well against intelligent attacks, in which group members try to give an appearance of
honest users. For example, typically they will not deviate from the majoritys ranking on
most of the cast feedback and target only a small fraction of the products. Such attacks
are hard to identify using the existing methods [19].
Moreover, there are cases in which a large group of people have rated the same group
of products, and thus could be considered as a potential collusion group. Such a group
might be subsequently deemed as non collusive. However, in such cases there may exist
some smaller collusive sub-groups inside the large group which are collusive, but when
put along with others they might be undetected. Detection of such sub-groups is not
addressed in the existing collusion detection models.
In this paper we propose a framework for detecting and representing collusion in
online rating systems. Besides indicators used in the already existing work, we also
define two novel indicators. One indicator which we call Suspiciousness of a reviewer,
which is a metric to estimate to what extent ratings posted by such reviewer correspond
to majority consensus. Such indicator is calculated using two distance functions: the Lp
norm and the Uniform distance (see Section 4.1 for more details). The other indicator
which we call spamicity estimates the likelihood that a particular rating score given to
a product by a reviewer is unfair.
We also propose a graph data model for representing rating activities in online rat-
ing system. This model allows representing products, reviewers and the rating scores
reviewers have cast on products and identifying collusive groups. We propose a new
notion and representation for collusion groups called biclique. A biclique is a group of
users and a group of products such that every reviewer in such a group has rated every
product in the corresponding group of products. We use bicliques to detect collusion.
Moreover, we propose an algorithm which employs two clustering algorithms built
upon frequent itemset mining (FIM) technique [1] for finding bicliques and sub-bicliques.
Then the algorithm uses the proposed collusion indicators to find any possible collusion
in online rating systems. We have implemented our model and tested it and the evaluation
results show the efficiency of our model.
The rest of the paper is organized as follows. In section 2 we propose our data model.
In Section 3 we propose our example scenario and data preprocessing details. Our collu-
sion detection method is proposed in Section 4 and the corresponding algorithm appears
in Section 5. In section 6 we propose implementation details and also evaluate results.
We discuss related work in section 7 and conclude in section 8.
We define a graph data model (i.e. ORM: Online Rating Model) for organizing a set of
entities (e.g. reviewers and products) and relationships among them in an online rating
system. ORM can be used to distinguish between fair and unfair ratings in an online
rating system and helps to calculate a more realistic rating score for every product. In
ORM, we assume that interactions between reviewers and products are represented by
a directed graph G = (V, E) where V is a set of nodes representing entities and E is a
set of directed edges representing relationships between nodes. Each edge is labeled by
a triplet of numbers, as explained in the followings.
198 M. Allahbakhsh et al.
An entity is an object that exists independently and has a unique identity. ORM
consists of three types of entities: products, reviewers and bicliques. We use the concept
of folder used in our previous work [3] to represent bicliques.
Product. A product is an item which has been rated by system users in terms of quality
or any other possible aspects. Products are described by a set of attributes such as the
unique indicator (i.e. ID), title, and category (e.g. book, cd, track, etc). We assume that
there are Np products P = {pj |1 j Np } in the system to be rated.
Reviewer. A reviewer is a person who has rated at least one product. Reviewers are
described by a set of attributes and are identified by their unique identifier (i.e. ID). We
assume that there are Nu reviewers U = {ui |1 i Nu } rating products.
Rating Relationship. A relationship is a directed link between a pair of entities, which
is associated with a predicate defined on the attributes of entities that characterizes the
relationship. We assume that no reviewer can rate a product more than once. So, in
ORM, there is at most one relation between every pair of products and reviewers.
When a reviewer rates a product, a rating relationship is established between corre-
sponding reviewer and product. We define R = {eij | ui U pj P } as the set of
all rating relationships between reviewers and products i.e. eij R is the rating score
which the ui has given to pj . A rating relationship is weighted with the values of the
following three attributes:
1. The value is the evaluation of the the reviewer from the quality of the product and
is in the range [1, M ], M > 1. M is a system dependent constant. for example in
Amazon, rating scores reside between 1 and 5. We denote the value field of the
rating by eij .v.
2. The time shows the time in which the rating has been posted to the system. The
time field of the rating is an integer number and is denoted by eij .t.
3. As we mentioned earlier we assume that every reviewer can have at most one
relation to each product (i.e., we do not allow multi-graphs). However, in real
world, e.g. in Amazon rating system, one reviewer can rate a product several times.
Some collusion detection models like [12] just eliminate duplicate rating scores.
We rather use them for the purpose of detecting unfair ratings. Spamicity shows
what fraction of all scores cast for this particular product are cast by the reviewer in
the considered relationship with this product. We denote the spamicity of a rating
relation by eij .spam.
2.1 Biclique
A biclique is a sub-graph of ORM containing a set of reviewers R, a set of products
P and their corresponding relationships Rel. All reviewers in R of a biclique have
rated all products in the corresponding P , i.e., there is a rating relationship between
every r R and every p P . A biclique is denoted by BC = {R, P, Rel}. For
p1 , ...}} means that reviewers u1 ,
example, BC = {{r1 , r2 , r3 }, {p1 , p2 , p3 }, {r1
u2 , and u3 all have voted on p1 , p2 and p3 . We will use terms biclique and group
interchangeably throughout this paper.
Collusion Detection in Online Rating Systems 199
3 Example Scenario
Amazon is one of the well-known online markets. Providers or sellers put products on
the Amazon online market. Buyers go to Amazon and buy products if they find them
of an acceptable quality, price, etc. Users also can share their experiences with others
as reviews or rating scores they cast on products. Rating scores are numbers in range
[1, 5]. Amazon generates an overall rating score for every product based on the rating
scores cast by users. Some evidence like [6] show that the Amazon rating system has
widely been subject to collusion and unfair ratings.
We use the log of Amazon online rating system1 which was collected by Leskovec
et. al. for analyzing dynamics of viral Marketing [11], referred in the following as AM-
ZLog. This log contains more than 7 million ratings cast on the quality of more than
500 thousands of products collected in the summer of 2006.
We preprocess AMZLog to fit our graph data model. We delete inactive reviewers
and unpopular products and only keep reviewers who reviewed at least 10 products and
the products on which at least 10 ratings are cast. We also change the date format from
yyyy-mm-dd to an integer number reflecting the the number of days between the date
on which the first rating score in the system has been posted and the day in which this
rating score has been cast.
We use redundant votes from one rater on same product to calculate spamicity of
their relation. Let E(j) be the set of all ratings given to product pj and E(i, j) be the
set of all ratings given by reviewer ui to the product pj . We calculate spamicity degree
of the relationship between the reviewer ri and the product pj as follows:
!
0 if (|E(i, j)| 2)
eij .spam = |E(i,j)|
|E(j)| otherwise
In the above equation allowing casting two votes instead of one accommodates for the
situations where a genuine mind change has taken place.
We then calculate an overall degree of similarity for every group to show how all mem-
bers are similar in terms of values they have posted as rating scores and call it Group
rating Value Similarity (GVS). GVS for every group is the minimum amount of pair-
wise similarity between group members. The GVS of group g is calculated as follows.
Where M inT = min(eij .t), M axT = max(eij .t), j g.P for all i g.R
Now, we choose the largest T W (j) as the degree of time similarity between the
ratings posted by group members on the target products. We say
GT S(g) = max(T W (j)), for all pj g.P (4)
The bigger the GT S, the more similar the group members are in terms of the time of
posting their rating scores.
Collusion Detection in Online Rating Systems 201
Now, we calculate a credibility degree for every eij E(j). We denote this credibility
degree with ij and use it to marginalize outliers. The ij is calculated as follows.
4
1 if (mj dj ) rij (mj + dj )
ij = (7)
0 otherwise
Equation (7) implies that only the ratings which fall in range (mj dj ) are considered
as credible.
Step 2: In this step, we calculate the averages of all credible ratings on the product pj
and assume that it is a dependable guess for the real rating of the product. We show it
by gj and calculate it as follows.
#
iE(j) (eij .v ij )
gj = # (8)
iE(j) ij
Step 3: In the third step, using the ratings we guessed for every product (Equation (8)),
we calculate two error rates for every reviewer. The first error rate is the LP error rate
(here L2 ) which is L2 norm of all differences between the ratings cast by the reviewer
on the quality of pj and the gj . We denote the Lp error rate of reviewer ui by LP (i).
Suppose that Ji is the set of indices of all products have been rated by ui . The LP (i) is
calculated as follows. 6
" 788 8%2
8
LP (i) = 8eij .v gj 8 (9)
jJi
The second error rate we calculate for every reviewer is the uniform norm of differences
between eij .v and gj for all products have been rated by ui . We call this error rate
uniform error rate of ui , denote it by U N (i) and calculate it as follows.
Step 4: The suspicious reviewers are the people who have have large LP (i) or while
they have normal LP (i) they have high U N (i) values. To identify suspicious review-
ers, we identify the normal range of error rates for all reviewers. Based on the calculated
range, the outliers are considered as suspicious reviewers. Suppose that LP 9 is the me-
dian of all LP (i) and U9 N is the median of all U N (i). Also, we assume that LP and
U N are the standard distance of all LP (i) and U N (i) respectively, calculated similar
to the method being used in Equation (6). The list of suspicious reviewers is denoted by
S and built using Equation (11).
: $ % $7 ; 7 ;
S = u| u U LP (u) < LP 9 LP LP (u) > LP 9 + LP
7 ; 7 ;% (11)
9
U N (u) < U 9
N U N U N (u) > U N + UN .
have been attacked by the group. The bigger these two parameters are, the more defec-
tive such a collusive group is.
Group Size (GS). Size of a collusive group (GS) is proportional to the number of
reviewers who have collaborated in the group i.e., g.R.size and is calculated as follows.
|g.R|
GS(g) = where g is a group (13)
max(|g.R|)
GS is a parameter between (0, 1] and showing how large is the number of members of
a group in comparison with other groups.
Group Target Products Size (GPS). The size of the target of a group (GPS) is propor-
tional to the number of products which have been attacked by members of a collusion
group. The bigger the GTS, the more defective the group will be. GTS is a number in
range (0, 1] and is calculated as follows.
|g.P |
GP S(g) = where g is a group (14)
max(|g.P |)
Finally we propose Algorithm 1 for collusion detection. This Algorithm uses POC, DI
and ORM graph along with constants and M axtT W to detect collusion in online
rating systems.
0.2 0.2
vided for this purpose. More details of 0 0.5 1 1.5 0 0.5 1 1.5
0 0
are collusive or not. Experts using their 0 0.5 1 1.5 0 0.5 1 1.5
An effective model should achieve a high precision and a high recall; but its is not
possible in real world, because these metrics are inversely related [16]. It means that
cost of improvement in precision is reduction of recall and vice versa. We calculate pre-
cision and recall with different thresholds to show how value of impacts the quality
and accuracy of the results. Figure 2 shows the results of running our algorithm with
different threshold values. We do not specify particular value for precision and recall.
We only say that if the user wants to achieve the highest possible values for both preci-
sion and recall metrics, Figure 2 obviously shows that the optimal value for threshold
() is 0.4. In this case 71% of the bicliques are retrieved and 71% of retrieved results are
really collusive. One can change to change quality or coverage of data. Also, for more
accuracy one can run the model with different groups of randomly chosen bicliques.
Our model also calculates a damaging impact (DI) factor for every group to show
how damaging the group is. DI helps users to identify groups that have a high potential
damaging power. These groups are mainly groups with high number of targeted products.
Table 1 shows a sample set of bicliques and their DI and POC. The group 2 has high values
for POC and DI, so it is a damaging group. On the other hand, group 9 has small values
for POC and DI, and is not a really defective group. Looking at group 11 reveals that
although the POC of the group is small (0.121), but it still can be damaging because its
DI is 0.41. DI is also useful when comparing groups, e.g, comparing 4 and 6. Although,
the POC of group 6 is 0.08 higher than POC of group 4, the DI of group 4 is 0.07 higher
than DI of group 6. Therefore we can classify them similar rather than deeming group 4
more damaging than group 6. Without having DI, the damaging potential of groups like
11 or 4 may be underestimated and it can lead to incorrect calculation of rating scores.
7 Related Work
Collusion detection has been widely studied in P2P systems [4, 8, 14]. For instance,
EigenTrust [8] tries to build a robust reputation score for P2P collaborators but a re-
search [14] shows that it is still prone to collusion. A comprehensive survey on collusion
detection in P2P systems can be found here [4]. Models proposed for detecting collud-
ers in P2P systems are not directly applicable to online rating systems, because in P2P
systems models are mostly built on relations and communications between people; but
206 M. Allahbakhsh et al.
in online rating systems there is no direct relation between raters. Reputation manage-
ment systems are also targeted by collusion. Very similar to rating systems, colluders in
reputation management systems try to manipulate reputation scores by collusion. Many
efforts are put on detecting collusion using majority rules, weight of the voter and tem-
poral analysis of the behavior of the users [17] but none of these models is enough strong
to detect all sorts of collusion [17, 18]. Yang et. al. [19] try to identify collusion by em-
ploying both majority rule and temporal behavior analysis. Their model is not still tested
thoroughly and just is applied to a specific dataset and a particular type of attack.
The most similar work to ours is [13]. In this work Mukherjee et.al. propose a model
for spotting fake review groups in online rating systems. The model analyzes textual
feedbacks cast on products in Amazon online market to find collusion groups. They use
8 indicators to identify colluders and propose an algorithm for ranking collusion groups
based on their degree of spamicity. However, our model is different from this model
in terms of proposed indicators, analyzing personal behavior of the raters and dealing
with redundant rating scores. Also a recent survey [5] shows that buyers rely more on
scores and ratings when intend to buy something rather than reading textual items. So,
in contrast with this model we focus on numerical aspect of posted feedback. However,
the model proposed by Mukherjee et.al. is still vulnerable to some attacks. For example,
if the number of attackers is much higher than honest raters on a product the model can
not identify it as a potential case of collusion.
Another major difference between our work and other related work is that, we pro-
pose a graph data model and also a flexible query language [2] for better understanding,
analyzing and querying collusion. This aspect is missing in almost all previous work.
8 Conclusion
In this paper we proposed a novel framework for collusion detection in online rating
systems. We used two algorithms designed using frequent itemset mining technique for
finding candidate collusive groups and sub-groups in our dataset. We propose several
indicators showing the possibility of collusion from different aspects. We used these
indicators to assign every group a rank to show their probability of collusion and also
damaging impact. We evaluated our model first statically and showed the adequacy of
the way we define and calculate collusion indicators. We then used precision and recall
metrics to show quality of output of our method.
As future direction, we plan to identify more possible collusion indicators. We also
plan to extend the implemented query language with more features and enhanced visual
query builder to assist users employing our model. Moreover, we plan to generalize our
model and apply it to other possible areas which are subject to collusive activities.
References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In:
Proceedings of VLDB 1994, pp. 487499 (1994)
2. Allahbakhsh, M., Ignjatovic, A., Benatallah, B., Beheshti, S.-M.-R., Foo, N., Bertino, E.:
Detecting, Representing and Querying Collusion in Online Rating Systems. ArXiv e-prints
(November 2012)
Collusion Detection in Online Rating Systems 207
Zukun Yu1, William Wei Song2, Xiaolin Zheng1, and Deren Chen1
1
Computer Science College, Zhejiang University, Hangzhou, China
2
School of Technology and Business Studies, Dalarna University, Borlnge, Sweden
{zukunyu,xlzheng,drc}@zju.edu.cn,
[email protected]
Abstract. Recommender Systems (RS) aim to suggest users with items that
they might like based on users opinion on items. In practice, information about
the users opinion on items is usually sparse compared to the vast information
about users and items. Therefore it is hard to analyze and justify users favo-
rites, particularly those of cold start users. In this paper, we propose a trust
model based on the user trust network, which is composed of the trust relation-
ships among users. We also introduce the widely used conceptual model Topic
Map, with which we try to classify items into topics for Recommender analysis.
We novelly combine trust relations among users with Topic Maps to resolve the
sparsity problem and cold start problem. The evaluation shows our model and
method can achieve a good recommendation effect.
1 Introduction
With rapid development of Internet, more and more people use online systems to buy
products and services (hereafter called items). However, with overwhelming amount
of information about items available on the Internet, it is extremely difficult for users
to easily find and determine what they would like to buy. Recommender systems aim
to recommend the target users with the items which are considered to have high pos-
sibilities of meeting their preferences.
Given a huge number of users making commercial transactions online and an even
larger amount of items available for sale online, recommender systems have to face
two major challenges: data sparsity the average number of ratings given by users is
often very small compared to the huge number of items, and cold start the dumb
users who review few items and provide little information. It causes the problem that
a recommender system cannot decide what should be recommended to the target users
since it can only directly access to the users opinions on a small proportion of items.
Therefore, the data sparsity is now taken into account by many recommendation me-
thods [9]. The cold start problem is a challenge to recommender systems due to its
lack of sufficient information to justify their interests. The traditional solutions to the
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 208219, 2013.
Springer-Verlag Berlin Heidelberg 2013
A Recommender System Model Combining Trust with Topic Maps 209
2 Related Work!
Many efforts have been put in studying and developing recommender systems, aiming
to support users doing business online. The major ones include content-based me-
thods, collaborative filtering (CF) methods, hybrid methods - a combination of con-
tent-based methods, collaborative filtering methods and others. Recently, researchers
are developing methods using trust to analyze users or items thereby to recommend
users with items they might like.
The content-based methods analyze the items that are rated by users and use the
contents of items to infer users profiles, which is used in recommending items of
interest to these users [2] More specifically, the TF-IDF (term frequencyinverse
210 Z. Yu et al.
how freetext claims of internet and their sources can be trusted by using an iterative
algorithm to compute the scores of trust propagation. But this work utilizes a weak
supervision at the evidence level, which makes it difficult to be used in other do-
mains. Apart from trust, reputation - a global aggregation of the local trust scores by
all the users [11] - is also important because it represents users trustworthiness from
a systemic perspective. Kamvar et al. [11] describe an algorithm to decrease the
number of downloads of inauthentic files in a peer-to-peer file-sharing network. This
paper assigns each peer a unique global trust value, based on the peers history of
uploads. Adler et al. [1] propose a content-driven reputation system for Wikipedia
authors, which can be used to flag new contributions from low-reputation authors, or
to allow only high-reputation authors to edit critical pages. Trust-based methods use
trust relationships among users to build a social network to link users and use it to
derive users interest. But these methods have a shortcoming that they fail to analyze
the relationships among items.
It is important for recommendation to take into account the relationships of items.
We will apply Topic Maps technology to model relationships between items. Topic
Maps related technologies are used in different research work. Dichev et al. [5] try to
use Topic Maps to organize and retrieve online information in the context of e-
learning courseware. They think Topic Maps offers a standard-based approach for
experts knowledge. This allows further reusing, sharing and interoperability of
knowledge and teaching units between courseware authors and developers. Dong and
Li [6] propose a new set of hyper-graph operations on XTM (XML Topic Map),
called HyO-XTM, to manage the distributed knowledge resources. In the HyO-XTM,
the set of vertices is the union of the vertices and the hyper-edges sets of the hyper-
graph; the set of edges is defined by the relation of incidence between vertices and
hyper-edges of the hyper-graph. The hyper-graph model matches the Topic Maps
with Hyper-graph vertices mapping to topic nodes and edges mapping to association
nodes. Topic Maps is shown to be a new way to graphically manage the knowledge.
Based on the previous work, we will first time try to use a topic map to represent
relationships among items of recommender systems, so as to be aware of the relation-
ships of items.
to another user node vj. Each e in E is associated to a value in [-1, 1], indicating the
weight of e. A negative trust value between two users means that they distrust each
other and a positive trust value means that they trust each other. We use e(vi, vj) to
denote the trust value. We define a truster function of a user node v, yielding the set
of all the users who have a trust relationship (an edge) to v, as
truster(v) = {u V | (u, v) E} . We also define a trustee function of v, representing
the set of all the users who have a trust relationship from v, as
trustee(v) = {u V | (v, u) E} . For each user node v, we define a reputation func-
tion, denoted as (v), taking values from [0, 1].
(v2 )
(v2 , i1)
Fig. 1. Example of Topic Maps based trust recommender model Users Trust Network (left)
and Topic Maps (right)
Now we consider this type of relationships: rating R, from the user node set V to
the item node set I. Each element (v, i ) R is an edge from a user who rates the item
to the item with a rating function (v, i ) . We define the set of items rated by the user v
by the function item(v) = {i I | (v, i) R} . Similarly, we use the function
reviewer(i) ={vV |(v,i)R} to represent the set of users who rate the item i.
According to the concept of Topic Maps, an item is an occurrence which belongs to
a topic. We define the type of relationship, belonging B, from the item set I to the topic
set T. Each element (i, t ) B is a direct edge from an item i to a topic t. An item
might belong to a number of topics. So we use the function topic(i) = {t T | (i, t ) B}
to define the set of all the topics the item i belongs to. We use the function
occurrence(t ) = {i I | (i, t ) B} to define the set of all the items belonging to the
topic t. To model the association between topics in a topic map, we use a function
(tm , tn ) to represent the association degree between two topics tm and tn. Its value is a
real number in [0, 1]. The higher the value is, the closer the two topics are.
A Recommender System Model Combining Trust with Topic Maps 213
Using Topic Maps, we conceptually describe users, items, and topics, as well as vari-
ous relationships between them. In this section we will introduce a computation me-
thod to quantitatively calculate these functions defined on the nodes and relation-
ships.
(v2 )
e(v1 , v2 ) e(v2 , v4 )
v2
(v1 ) (v4 )
e (1) (v1 , v4 )
v1 v4
(v3 )
e(v1 , v3 ) e(v3 , v4 )
v3
Fig. 2. Example of trust propagation
In the trust model, the trust relationships among users are the basic ones, from which
we derive users reputations and propagate new trust relationships among users. The
reputation of a user v is computed by averaging all the trust values of the trust rela-
tionships to v from other users, i.e.
(v ) = e(u , v ) | tr u s te r ( v ) |
(1)
u tru ste r ( v )
If a user has no trust relationship from a truster, its reputation is set to 0. However,
for any two users u, v if they do not have a direct edge (i.e. trust relationship) between
them but have indirect edges (i.e. via one more users), we use trust propagation to
determine the trust value between u, v. In other words, we aim to use existing values
in the trust network to gain more trust values between the users without direct trust
relationships. For simplicity, we only consider to propagate trust relationships but not
distrust relationships in this paper. We use a step-by-step trust transitivity to derive
indirect trust, in which a single step means to derive trust by the intermediate user
nodes which have direct edge to both the start user node and the target user node in
the user trust network, see Fig. 2. We use solid lines to represent existing trust rela-
tionships among users and dashed lines to represent the derived trust relationship. The
trust relationships between v1 to v4 can be derived in the first single transitivity of trust
propagation and that between v1 and v5 can be derived in the second single transitivi-
ty. In order to clearly explain the propagation process of trust, we use e ( n ) (vi , v j ) to
denote a newly derived trust value in the nth single step of trust transitivity. The origi-
nal trust value e(vi , v j ) is denoted as e (0) (vi , v j ) , which means the existed trust
edges in the original trust network.
214 Z. Yu et al.
For example, to derive a new trust value from the user node v1 to the user node v4,
there might be a number of paths, where a path means a chain of edges from v1 to v4
via an intermediate user node, e.g. v2. Here we define a function for the set of the
common user nodes as com(vi, vj) = trustee(vi)truster(vj). We consider all the paths
to derive the trust value:
(e({0,1,2,...n}) (vi , vk ) (vk ) + e({0,1,2,...n}) (vk , vj ) (vj ))
vk trustee(vi ) truster(vj )
e(n+1) (vi , vj ) = (2)
vk trustee(vi ) truster(vj ) (| (vk )| + | (vj )|)
Here, e({0,1,2,...n}) (vi , vk ) is a trust relationship in the set of all the trust relationships
including the original trust relationships and those generated in the 1st, 2nd,, and nth
step of the single step transitivity. The formula above gives a new trust value in [-1, 1]
by applying once the single step transitivity. For a trust network, we can propagate
trust relationships using a number of steps of the single step transitivity. We use a
parameter s to control the number of steps for trust propagation.
In this section, we discuss how to generate a rating value between a user and an item
if there was none there. We use the trust values from the user trust network and the
rating values from the existing ratings given by the users to the items to derive users
new opinions on items together with new rating values. We denote the newly generat-
ed rating relationships through the user trust network as R. We indirectly compute
users opinion on items through rating on the items from the intermediate users if they
rated them directly. These intermediate users should have positive reputation values.
The users with negative reputation values are thought as malicious users and not al-
lowed to give advices to others. The newly generated rating relationships with rating
values are computed by the function given below.
In section 3.4 we generated a new set of rating relationships using trust propaga-
tion on the user trust network in the section 3.4. However, if the user trust network
contains a lot of isolated clusters, the trust values and rating values from one cluster
would not be possible to be used for other clusters. We call this the isolated trust clus-
ter problem and will further explain it in section 4.1.
Here we consider using the Topic Maps to solve this problem. We use the topic map
to derive users opinion on topics. To do so, we first define a function g(v, t) to be an
A Recommender System Model Combining Trust with Topic Maps 215
user vs opinion on a topic t, deriving from users opinion on the items and items
belonging to the topic t, as follows:
g (v, tm ) (tm , t )
tm{tT |g (v,t )0, (tm ,t )0}
g (v, t ) = (6)
| {t T | g (v, t ) 0, (tm , t ) 0} |
Finally, we consider how to decide a users opinion on an item through both trust and
topic map. We use a combination of users opinion derived based on the topic map
and users opinion on items, to compute users opinion (v, i ) based on the user
trust network.
4 Experiments
In this section, first we describe the dataset used in our experiments, and then we
discuss the experiments and their results.
4.1 Dataset
We use a data collection of the real review data from Epinions.com, provided at Mas-
sas website [16], as the input dataset. We use two datasets, the trust relationships
216 Z. Yu et al.
(and their values) between users and the users ratings on items (and their values). In
order to reduce the time and space complexities of algorithm, we use two subsets
obtained by the method described below:
Extract all the users in the first dataset and select 100 different users.
Extract all the trust relationships among the 100 users. We obtained a subset, called
subset 1. It contains 110 trust relationships, 82 of which have trust value 1 and 28
have trust value -1.
Extract all the ratings given by the 100 users in the second dataset. We first obtain
282418 ratings on 230126 items from the 64 users who rated the items. Through
sampling the ratings, we obtain the subset 2. It contains 10141 ratings, with 8858
items and 34 users. The subset 2 has 6 different levels of rating values and 27 top-
ics. As shown in Figure 3, most items are rated 5 and 4. Only a small proportion of
them are rated 1, 2, 3 and 6.
From the sample datasets we constructed for the experiments, we clearly observe the
problems of data sparsity and cold start users, which we discussed in Section 1.
There are 10141 ratings for 100 users and 8858 items in this recommender system. So
the subset of this recommender system is sparse because the ratio of the number of
ratings to the total number of the matrix (number of users times number of items) is
10141/(100*8858)= 1.14%. Only one out of 100 users rated some items, so there are
more than 98 cold start users. The maximum ratings for one user is 7507, on average
every user 101 ratings.
The trust relationships in the subset 1 form a user trust network, as shown in Fig. 4.
We use the trust propagation method discussed in section 3 to obtain more trust rela-
tionships. But this user trust network consists of many isolated trust clusters. They
do not connect to each other and contribute little to the trust propagation computation.
Fig. 3. The distribution of rating values Fig. 4. The structure of user trust network
4.2 Results
We use MAE (Mean Absolute Error) and MAUE (Mean Absolute User Error) [15] to
evaluate our recommendation method, because the MAE is the most commonly used
and the easiest to be understood, and based on MAE, the MAUE provides the aver-
ages of evaluation.
A Recomm
mender System Model Combining Trust with Topic Maps 217
Based on the observation that the most derived trust values have been obtainedd in
the first three steps of trustt propagation on the user trust network, we consider too set
s=3. We evaluate the variaation of accuracy as the weight parameter changes witth a
confidence =0.95 in consttructing a topic map. As show in Fig. 5, both the MAE and
MAUE decline nearly in a straight
s line with the weight parameter. We can see that the
MAUE is always less than the MAE due to the many cold start users in our dataaset.
So we find that the two meetrics keep consistent in the relationship between accurracy
and weight parameter.
Fig. 5. The variation of accurracy with weight Fig. 6. The coverage changing as confideence
parameter of Topic Map
We have three contribution ns in this work. First we propose a trust model based on the
user trust network and a meethod to compute the trust propagation. We use the repuuta-
tion as the weight in trustt propagation, which mimics considering friends adviices
before people buying produ ucts or services in the real world. That is, advices from the
users with the greater reputation values are trusted by their friends with higher degrree.
Second we propose a reco ommendation method, applying Topic Maps in analyzzing
users opinions on topics, which
w better and further supports deriving users opiniions
on items. Third we evaluatee our method using the two metrics accuracy and coveraage.
218 Z. Yu et al.
The result tells that the recommender system based on our method provides better
accuracy and coverage in coping with the problems of the sparsity and cold start us-
ers.
For the next step of study, we plan to scale up our method for recommender sys-
tems to a reasonably large number of users in the user trust network as in the reality a
recommender system should be able to deal with millions of users and hundred mil-
lions of items. We also consider to include the computation of distrust relationships in
the trust propagation method as we believe it will greatly contribute to the accuracy
and coverage of recommender systems.
Acknowledgments. The authors would like to thank Dalarna University, Sweden for
providing a visiting position and Zhejiang University, China for providing financial
support of this visit.
References
1. Adler, B.T., Alfaro, L.D.: A Content-Driven Reputation System for the Wikipedia. Tech-
nical Report ucsc-crl-06-18. School of Engineering, University of California, Santa Cruz
(2006)
2. Basu, C., Hirsh, H., Cohen, W.: Recommendation as Classification: Using Social and Con-
tent-Based Information in Recommendation, pp. 714720. AAAI Press (1998)
3. Blanco-Fernndez, Y., Pazos-Arias, J.J., Gil-Solla, A., Ramos-Cabrer, M., Lpez-Nores,
M., Garca-Duque, J., Fernndez-Vilas, A., Daz-Redondo, R.P., Bermejo-Muoz, J.: A
flexible semantic inference methodology to reason about user preferences in knowledge-
based recommender systems. Knowl-Based Syst. 21, 305320 (2008)
4. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for
collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Ar-
tificial Intelligence, pp. 4352. Morgan Kaufmann Publishers Inc., Madison (1998)
5. Christo, D., Darina, D., Lora, A.: Using topic maps for e-learning. In: Proceedings of In-
ternational Conference on Computers and Advanced Technology in Education including
the IASTED International Symposium on Web-Based Education, Rhodes, Greece, pp. 26
31 (2003)
6. Dong, Y., Li, M.: HyO-XTM: a set of hyper-graph operations on XML Topic Map toward
knowledge management. Future Gener. Comp. Sy. 20, 81100 (2004)
7. Golbeck, J., Rothstein, M.: Linking social networks on the web with FOAF: a semantic
web case study. In: AAAI 2008, pp. 11381143. AAAI Press (2008)
8. Guha, R., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In:
Proceedings of the 13th International Conference on World Wide Web, pp. 403412.
ACM, New York (2004)
9. Huang, Z., Chen, H., Zeng, D.: Applying associative retrieval techniques to alleviate the
sparsity problem in collaborative filtering. ACM Trans. Inf. Syst. 22, 116142 (2004)
10. Jamali, M., Ester, M.: TrustWalker: a random walk model for combining trust-based and
item-based recommendation. In: KDD 2009, pp. 397406. ACM, New York (2009)
11. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The eigentrust algorithm for reputation
management in P2P networks. In: Proceedings of the 12th International Conference on
World Wide Web, pp. 640651. ACM, Budapest (2003)
A Recommender System Model Combining Trust with Topic Maps 219
12. Kini, A., Choobineh, J.: Trust in electronic commerce: definition and theoretical considera-
tions. In: Proceedings of the Thirty-First Hawaii International Conference on System Sci-
ences, vol. 4, pp. 5161 (1998)
13. Knud, S.: Topic Maps-An Enabling Technology for Knowledge Management. In: Interna-
tional Workshop on Database and Expert Systems Applications, p. 472 (2001)
14. Massa, P., Bhattacharjee, B.: Using Trust in Recommender Systems: An Experimental
Analysis, pp. 221235 (2004)
15. Massa, P., Avesani, P.: Trust-aware recommender systems. In: RecSys 2007, pp. 1724.
ACM, New York (2007)
16. Massa, P.:
http://www.trustlet.org/wiki/Extended_Epinions_dataset.
17. Melville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filtering for im-
proved recommendations. In: Eighteenth National Conference on Artificial Intelligence
(AAAI 2002)/Fourteenth Innovative Applications of Artificial Intelligence Conference, pp.
187192 (2002)
18. Miranda, T., Claypool, M., Gokhale, A., Murnikov, P., Netes, D., Sartin, M.: Combining
Content-Based and Collaborative Filters in an Online Newspaper (1999)
19. Papagelis, M., Rousidis, I., Plexousakis, D., Theoharopoulos, E.: Incremental Collabora-
tive Filtering for Highly-Scalable Recommendation Algorithms. In: Hacid, M.-S., Murray,
N.V., Ra , Z.W., Tsumoto, S. (eds.) ISMIS 2005. LNCS (LNAI), vol. 3488, pp. 553561.
Springer, Heidelberg (2005)
20. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill,
Inc., New York (1986)
21. Schein, A.I., Popescul, A., Popescul, R., Ungar, L.H., Pennock, D.M.: Methods and Met-
rics for Cold-Start Recommendations, pp. 253260. ACM Press (2002)
22. G. M. O. The Topic Maps: XML Topic Maps (XTM) 1.0. (2001)
23. Tso, K., Schmidt-Thieme, L.: Empirical Analysis of Attribute-Aware Recommendation
Algorithms with Variable Synthetic Data. Data Science and Classification, 271278
(2006)
24. Vercellis, C.: Business Intelligence: Data Mining and Optimization for Decision Making,
p. 277. John Wiley & Sons, Ltd., Chichester (2009)
25. Vydiswaran, V.G.V., Zhai, C., Roth, D.: Content-driven trust propagation framework. In:
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, pp. 974982. ACM, San Diego (2011)
26. Zhang, Z.K., Zhou, T., Zhang, Y.C.: Personalized recommendation via integrated diffusion
on user-item-tag tripartite graphs. Physica A 389, 179186 (2010)
27. Ziegler, C., Lausen, G.: Spreading activation models for trust propagation. In: Proceedings
of 2004 IEEE International Conference on e-Technology, e-Commerce and e-Service, pp.
8397 (2004)
A Novel Approach
to Large-Scale Services Composition
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 220227, 2013.
c Springer-Verlag Berlin Heidelberg 2013
avoid complex modeling of the real world. Although it has been proved to be
eective for small scale service compositions, it can be overly computationally
expensive when working on a large number of services.
In this paper, we present a novel mechanism based on multi-agent reinforce-
ment learning to enable adaptive service composition. The model proposed in
this paper extends the reinforcement learning model that we have previously
introduced in [11]. In order to reduce the time of convergence, we introduce
a sharing strategy to share the policies among the agents, through which one
agent can use the policies explored by the others. As the learning process con-
tinues throughout the life-cycle of a service composition, the composition can
automatically adapt to the change of the environment and the evolvement of its
component services. Experimental evaluation on large scale service compositions
has demonstrated that the proposed model can provide good results.
Query Find
Flight Find
Book Book Book Restaurant Map
s4 Hotel Find s1
Flight1 s4 Hotel
Query s6
s0(m) s2
Query s6 s7
Book Restaurant s12 sr(m)
Flight1 Weather
Book Hotel s7 Find Forecast Hotel
Book Map
Flight2
Query Flight1 Book s12
Flight s2 Book Workflow1
Flight2
Flight1 sr(m)
Query s8 Query
Hotel Query Book
s0(m) Flight Book Find Flight Flight2
Find Book Hotel s5 Map
Weather s1 s3 Map s11 Flight2 s0(m) s1 s3 s9 s10
Forecast Weather Find s11 sr(m)
Query Book Forecast Restaurant
Hotel Hotel s5
Find s9 s10 Query
Flight
Restaurant Workflow2
Find
Map
The Q-learning is the most popular and seems to be the eective model-free
algorithm about the RL problems. It does not, however, address any of the issues
involved in generalizing over large state and/ or action spaces [7]. That is why, in
order to speed up the training process, we extend the proposed approach towards
a distributed one, in which multiple cooperative agents learn to coordinate in
order to find the optimal policy in their environment.
Experience sharing can help agents with similar tasks to learn faster and
better. For instance, agents can exchange information using communication [10].
Furthermore, by design, most multiagent systems also allow the easy insertion
of new agents into the system, leading to a high degree of scalability [1].
From agent-ms standpoint, its control task could be thought of as an ordinary
reinforcement problem except that their action selecting strategy is dependent
on other agents optimal policies at the beginning of the learning. So at a cer-
tain state, one agents policy may be useful to other agents which can help them
to find optimal strategy quickly. But assume that each agent can simultane-
ously send its current policy to other agents, the communication information
is huge. For the purpose of reducing the communication information we dont
let the agent to communicate with each others, but introduce supervisor agent
which supervises the learning process and synchronizes the computations of the
individual agents. In our algorithm, each agent m use the global Q-values esti-
mations stored in the blackboard which stores the global Q-values estimations
and communicate to the supervisor agent their intention to update a Q-value
estimation.
So we have two types of agents in our architecture:
agents. It keeps a blackboard [5] which stores the global Q-values estima-
tions. The local WSCA agents use the global Q-values estimations stored
in the blackboard and communicate to the WSCS agent their intention to
update Q-value estimation. If a local agent tries to update a certain Q-value,
the WSCS agent will update the global Q-value estimation only if the new
estimation received from the local agent is greater than the Q-values esti-
mations existing in the blackboard.
In each time step, the agent m updates Q(s, a) by recursively discounting future
utilities and weighting them by a positive learning rate :
The WSCS supervisor agent initializes the Q-values from the blackboard.
During some training episodes, the individual WSCA agents will experiment
some paths from the initial to a final state, using the greedy mechanism and
updating the Q-values estimations according to the algorithm described below
(algorithm 1). We denote in following by Q(s, a) the Q-value estimate associated
to the state s and a, as stored by the blackboard of the WSCS agent.
A Novel Approach to Large-Scale Services Composition 225
After the training of the multi-agent system has been completed, the solution
learned by the WSCS supervisor agent is constructed by starting from the initial
state and following the Greedy mechanism until a solution is reached. The system
applies the solution as a service workflow to execute. At the same time, the
execution is also treated as an episode of the learning process. The Q-functions
are updated afterwards, based on the newly observed reward.
By combining execution and learning, our framework achieves self-adaptively
automatically. As the environment changes, service composition will change its
policy accordingly, based on its new observation of reward. It does not require
prior knowledge about the QoS attributes of the component services, but is able
to achieve the optimal execution policy through learning.
4 Experimental Evaluation
In order to evaluate the methods, we conducted simulation to evaluate the prop-
erties of our service composition mechanism based on the methods discussed in
this paper. The PC configuration: Intel Xeon E7320 2.13GHZ with 8GB RAM,
Windows 2003, jdk1.6.0.
We considered two QoS attributes of services. They ware service fee and ex-
ecution time. We assigned each service node in a simulated WSC-MDP graph
with random QoS values. The value followed normal distribution. To simulate
226 H. Wang and X. Wang
the dynamic environment, we periodically varied the QoS values of existing ser-
vices based on a certain frequency. We applied the algorithms introduced in
Section 3 to execute the simulated service compositions. The reward function
used by the each learner was solely based on the two QoS attributes. After an
execution of a service , the learners get a reward , whose value is:
f eemax f eesi timemax timesi
R(s) = f eemax
i
f eemin
+ timemax
i
timemin
i i i i
The reward was always positive, however service consumers always prefer low
execution time and service fee.
We will show that such cooperative agents can speed up learning, measured
by the average cumulative values in training, even though they will eventually
reach the same asymptotic performance as independent agents.
50 50 50
45 45 45
40 40 40
Mean cumulative reward
30 30 30
25 25 25
20 20 20
45 45 45
40 40 40
Mean cumulative reward
35 35 35
30 30 30
25 25 25
20 20 20
(d) (e) (f )
Fig. 3. (a) Results of comparison with 20 services in each state; (b) Results of com-
parison with 30 services in each state; (c) Results of comparison with 40 services in
each state; (d) Results of comparison with 100 services in each state; (e) Results of
comparison with 150 services in each state; (f) Results of comparison with 200 services
in each state;
The results in all cases clearly indicate that distributed RL approach presented
in this paper learns more quickly and reduces the overall computational time
compared againest the Q-learning.
5 Conclusion
This paper studied a novel framework for large scale service composition. In or-
der to reduce the time of convergence we introduce a sharing strategy to share
the policies among agents in a team. The experimental results show that the
strategy of sharing state-action space improves the learning eciency signifi-
cantly. Additionally, the problem that has to be further investigated is how to
reduce the communication cost between the WSCA agents and WSCS agent and
explore other local search mechanisms. Next, we will concentrate on these issues
and improve our algorithm further.
References
1. Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent
reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics,
Part C: Applications and Reviews 38(2), 156172 (2008)
2. Carman, M., Serafini, L., Traverso, P.: Web service composition as planning. In:
ICAPS 2003 Workshop on Planning for Web Services, pp. 16361642 (2003)
3. Doshi, P., Goodwin, R., Akkiraju, R., Verma, K.: Dynamic workflow composition
using markov decision processes. In: IEEE International Conference on Web Ser-
vices, pp. 576582. IEEE (2004)
4. Gao, A., Yang, D., Tang, S., Zhang, M.: Web service composition using markov de-
cision processes. In: Fan, W., Wu, Z., Yang, J. (eds.) WAIM 2005. LNCS, vol. 3739,
pp. 308319. Springer, Heidelberg (2005)
5. Gonzaga, T., Bentes, C., Farias, R., de Castro, M., Garcia, A.: Using distributed-
shared memory mechanisms for agents communication in a distributed system. In:
Seventh International Conference on Intelligent Systems Design and Applications,
ISDA 2007, pp. 3946. IEEE (2007)
6. Hwang, S.Y., Lim, E.P., Lee, C.H., Chen, C.H.: Dynamic web service selection for
reliable web service composition. IEEE Transactions on Services Computing 1(2),
104116 (2008)
7. Kaelbling, L., Littman, M., Moore, A.: Reinforcement learning: A survey. Arxiv
preprint cs/9605103 (1996)
8. Papazoglou, M., Georgakopoulos, D.: Service-oriented computing. Communications
of the ACM 46(10), 2528 (2003)
9. Sirin, E., Parsia, B., Wu, D., Hendler, J., Nau, D.: Htn planning for web service
composition using shop2. Web Semantics: Science, Services and Agents on the
World Wide Web 1(4), 377396 (2004)
10. Sutton, R., Barto, A.: Reinforcement learning. Journal of Cognitive Neuro-
science 11(1), 126134 (1999)
11. Wang, H., Zhou, X., Zhou, X., Liu, W., Li, W., Bouguettaya, A.: Adaptive ser-
vice composition based on reinforcement learning. In: Maglio, P.P., Weske, M.,
Yang, J., Fantinato, M. (eds.) ICSOC 2010. LNCS, vol. 6470, pp. 92107. Springer,
Heidelberg (2010)
The Consistency and Absolute Consistency
Problems of XML Schema Mappings
between Restricted DTDs
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 228239, 2013.
c Springer-Verlag Berlin Heidelberg 2013
schema can be mapped into a document conforming to the target schema ac-
cording to the given dependencies, is an essentially necessary property. It is
also important for schema mappings to be absolutely consistent, that is, every
document conforming to the source schema can be mapped into a document
conforming to the target schema according to the given dependencies.
Example 1. Define the source and target schemas DS and DT as follows:
DS : DT :
root -> student+ root -> student+
student -> name phone student -> name phone
phone -> home mobile other phone -> home | mobile | other
First, consider the following dependency set 1 :
1 = {student/phone[home][mobile] student/phone[home][mobile]}.
1 states that if there is a document of DS such that a phone node has both
home and mobile nodes as its children, then the document is mapped into a
document of DT with a phone node which has, again, both home and mobile
nodes as its children. However, 1 is not consistent because in any document
of DT , a phone node cannot have both home and mobile nodes as its children.
Next, consider the following dependency set 2 :
2 = {student/phone[home][other] student/phone[home][other]}.
3 = {student/phone[home][mobile][other]
student[phone/home][phone/mobile][phone/other]}.
2 Preliminaries
2.1 XML Documents and DTDs
In this paper, we regard elements of XML documents as labeled nodes, and
entire XML documents as labeled ordered trees. In what follows, we write trees
to mean labeled ordered trees. Let (v) denote the label of a node v. For a
sequence of nodes v1 v2 vn , define (v1 v2 vn ) as (v1 )(v2 ) (vn ).
A regular expression over consists of (empty sequence), the symbols in ,
and operators (concatenation, usually omitted in the notation), | (disjunction),
and (repetition). We exclude (empty set) because we are interested in only
nonempty regular expressions. Let L(e) denote the string language represented
by a regular expression e.
3 Deciding Consistency
3.1 Deciding Consistency of Mappings between General DTDs
To begin with, we show that the consistency decision problem can be solved
by deciding validity and satisfiability of combinations of document patterns in
dependencies. The following lemma is essentially equivalent to the statement
given in the proof of the EXPTIME-completeness of deciding consistency under
general DTDs [5].
Lemma 1. Let M = (DS , DT , ) and = {i i | i [n]} where [n] =
{1, 2, . . . , n}. Then, M is consistent if and only if there are two disjoint subsets
I = {i1 , , ik } and J = {j1 , , jk } of [n] such that I J = [n],
:: rS [i1 ik ] is not valid under DS , and
:: rT [j 1 j k ] is satisfiable under DT
where rS and rT are the root labels of DS and DT , respectively.
Proof. Suppose M is consistent. Then, a pair of trees (TS , TT ) belongs to !M".
Let J = {j1 , , jk } be a subset of [n] such that TT satisfies j l for each jl J.
Then, because of the semantics of qualifier [ ], TT satisfies :: rT [j 1 j k ].
On the other hand, let I = {i1 , , ik } = [n] J. Then, TT does not satisfy i
for any i I. So, because (TS , TT ) !M", TS must not satisfy i for any i I.
Therefore, because of the semantics of , TS does not satisfy :: rS [i1 ik ].
234 H. Kuwada et al.
+ + [] Satisfiability
+ + + + + + PTIME
+ + + + + + + PTIME
+ + + + PTIME
+ + + + NP-complete
+ + + + NP-complete
+ + + NP-complete
+ + + + + NP-complete
Proof. For the if part, assume that is valid under D. Then, since t TL(D),
t satisfies . For the only if part, assume that t satisfies . From the definition
of D, for each l , the unique sequence in L(P (l)) is a subsequence of any
w L(P (l)). Let t be an arbitrary tree in TL(D). Then there is an injective
mapping from the set of nodes of t to that of t such that
(v) = ((v)) for each node v of t,
for any two nodes v1 and v2 of t, if v1 is a parent of v2 , then (v1 ) is a parent
of (v2 ), and
for any two nodes v1 and v2 of t, if v1 is a following sibling of v2 , then (v1 )
is a following sibling of (v2 ).
For this, we can see by induction on the structure of that t |= (v, v ) implies
t |= ((v), (v )). Thus, t also satisfies . =
Proof. The only if part is trivial. For the if part, assume that :: r[1 2 ] is
Because t TL(D), t satisfies :: r[1 2 ].
valid under D. Consider t TL(D).
By the semantics of , t satisfies at least one of 1 and 2 . Without loss of
generality, we assume that t satisfies 1 . From Lemma 2, 1 is valid under D.
=
! node of
Proof. The if part is trivial. For the only if part, assume that both 1 and 2
are satisfiable under D. Then, there are trees t1 and t2 in TL(D) such that
t1 and t2 satisfy 1 and 2 , respectively. Because X does not include negation
operator and next-sibling axis, if a tree t satisfies XPath expression X , any
tree obtained by inserting arbitrary subtrees to t also satisfies [3]. Thus, if we
can obtain the same tree t12 from t1 and t2 by inserting subtrees (see Fig. 1), t12
satisfies :: r[1 2 ]. To show that such t12 exists, we should prove that for any
regular DC expression e, if w1 , w2 L(e), there is w L(e) such that w1 and
w2 are subsequences of w. This property holds because DC regular expressions
cannot specify non-co-occurrence among labels. =
From the above, we can refine Lemma 1 for deciding consistency of mappings
between DC-DTDs as follows.
Lemma 5. Let M be a mapping (DS , DT , ) such that the schemas DS and
DT are DC-DTDs. Then, M is consistent if and only if there is no dependency
in such that is valid under DS and is not satisfiable under DT .
Proof. Let = {i i | i [n]} and rS and rT be the root labels of DS
and DT , respectively. From Lemma 1, we prove this lemma by showing the next
condition: there is no such that is valid under DS and is not
satisfiable under DT if and only if there are two disjoint subsets I = {i1 , . . . , ik }
and J = {j1 , . . . , jk } of [n] such that I J = [n], :: rS [i1 ik ] is not
valid under DS , and :: rT [j 1 j k ] is satisfiable under DT .
Assume that there is no such that is valid under DS and is
not satisfiable under DT . Then, for each l [n], l is not valid under DS or l
is satisfiable under DT . Let I = {i1 , . . . , ik } be the set of all the indexes il in [n]
such that il is not valid under DS . Let J = {j1 , . . . , jk } = [n] I. Then j l is
satisfiable under DT for each jl J. From Lemmas 3 and 4, :: rS [i1 ik ]
is not valid under DS and :: rT [j 1 j k ] is satisfiable under DT .
Conversely, assume that there are two disjoint subsets I = {i1 , . . . , ik } and
J = {j1 , . . . , jk } of [n] such that I J = [n], :: rS [i1 ik ] is not valid
under DS , and :: rT [j 1 j k ] is satisfiable under DT . From Lemmas 3
and 4, l is not valid under DS for any l I and l is satisfiable under DT for
any l J. Since I J = [n], there is no such that is valid under
DS and is not satisfiable under DT . =
The Consistency Problems of XML Schema Mappings 237
Similarly to the case of consistency, the above lemma tells us that the hardness
of deciding absolute consistency stems from the combinatorial explosion caused
by checking all the subset of dependencies as well as the hardness of deciding
XPath satisfiability.
238 H. Kuwada et al.
5 Conclusion
This paper has discussed the consistency and absolute consistency problems of
XML schema mappings between DC-DTDs. First, we have shown that consis-
tency of mappings between DC-DTDs can be solved by deciding validity and sat-
isfiability of XPath expressions linearly many times. Moreover, we have proved
that the XPath validity problem in the presence of DC-DTDs can be solved
in polynomial time. From these facts, we have proved that the consistency of
mappings between DC-DTDs can be solved in polynomial time if the document
patterns are in an XPath class for which satisfiability and validity are tractable
under DC-DTDs. Next, we have shown that absolute consistency of the map-
pings between DC-DTDs can also be solved by deciding satisfiability of XPath
expressions in the presence of DC-DTD linearly many times. So the absolute
consistency of mappings between DC-DTDs can also be solved in polynomial
time if the document patterns are in an XPath class for which satisfiability is
tractable under DC-DTDs.
The Consistency Problems of XML Schema Mappings 239
References
1. Amano, S., Libkin, L., Murlak, F.: XML schema mappings. In: Proceedings of
the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
Database Systems, pp. 3342 (2009)
2. Ishihara, Y., Morimoto, T., Shimizu, S., Hashimoto, K., Fujiwara, T.: A tractable
subclass of DTDs for XPath satisfiability with sibling axes. In: Gardner, P., Geerts,
F. (eds.) DBPL 2009. LNCS, vol. 5708, pp. 6883. Springer, Heidelberg (2009)
3. Ishihara, Y., Shimizu, S., Fujiwara, T.: Extending the tractability results on XPath
satisfiability with sibling axes. In: Lee, M.L., Yu, J.X., Bellahs`ene, Z., Unland, R.
(eds.) XSym 2010. LNCS, vol. 6309, pp. 3347. Springer, Heidelberg (2010)
4. Arenas, M., Libkin, L.: XML data exchange: Consistency and query answering.
Journal of the ACM 55(2) (2008)
5. Arenas, M., Barcelo, P., Libkin, L., Murlak, F.: Relational and XML Data Exchange.
Morgan & Claypool (2010)
6. Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs.
Journal of the ACM 55(2) (2008)
7. Bojanczyk, M., Parys, P.: XPath evaluation in linear time. Journal of the
ACM 58(4), 17 (2011)
Linking Entities in Unstructured Texts
with RDF Knowledge Bases
Abstract. Entity linking (entity annotation) is the task of linking named entity
mentions on Web pages with the entities of a knowledge base (KB). With the con-
tinued progress of information extraction and semantic search techniques, entity
linking has received much attention in both research and industrial communi-
ties. The challenge of the task is mainly on entity disambiguation. To our best
knowledge, the huge existing RDF KBs have not been fully exploited for entity
linking. In this paper, we study the entity linking problem via the usage of RDF
KBs. Besides the accuracy of entity linking, the scalability of handling huge Web
corpus and large RDF KBs are also studied. The experimental results show that
our solution on entity linking achieves not only very good accuracy but also good
scalability.
1 Introduction
Nowadays, search engines have been widely used as the most convenient way of ac-
cessing information within the huge amount of unstructured texts on the Web. How-
ever, techniques adopted by most existing search engines are based on straightforward
matches of terms within queries and those within the unstructured texts. This often re-
sults in that, users are frustrated in frequently adjusting their query terms to retrieve
the desired results. To address this problem, semantic search has been proposed and
widely studied [1]. In semantic search, the contextual meaning of terms in the unstruc-
tured texts is very important to enrich the semantics of terms. In this paper, we study
the problem of enriching contexts of named entities within unstructured texts, by link-
ing the detected named entities with existing RDF knowledge bases. By doing this, we
are able to extract important concepts and topics out of the plain texts, and effectively
support the semantic search over the unstructured texts.
With the continued progress of Semantic Web and information extraction techniques,
more and more RDF data emerge on the web. They form many huge RDF knowledge
bases (KBs) such as Yago[2], Freebase[3], DBPedia[4] et al. Such RDF KBs contain
billions of RDF triple facts either extracted from Web pages or contributed manually by
users, describing information about hundreds of millions of entities. The entity infor-
mation in these KBs is so abundant and diverse that they are very suitable for enriching
the contexts of named entity mentions within the unstructured texts.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 240251, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Linking Entities in Unstructured Texts with RDF Knowledge Bases 241
The entity linking (also called entity annotation in the paper) task is defined as to
map a named entity mention m in the free texts of a Web page to a corresponding entity
e in a KB. In our work, a KB refers to an RDF KB without further specification. We list
some applications that such a way of entity linking can be applied as follows:
Enhancing search results of semantic search. Through linking entities with RDF
KBs, the search results of unstructured texts can be expanded by utilizing relevant
semantic information derived from RDF KBs.
Cleaning the extracted named entities. The named entity mentions extracted by
automatic information extraction tools often have quality issues such as errors or
duplicates. Linking entities with proper entities in RDF KBs can help to clean the
extracted named entities.
Finding relations between entities. Given a pair of entities, their relations can be
learned from the contexts. The accuracy of entity relation discovery will highly
depend on the accuracy of entity resolution during the annotation process.
One major challenge of entity linking is the disambiguation problem. A named en-
tity mention has the disambiguation problem because a number of entities (of different
types) in the KB can have same name mention. When this happens, we need to accu-
rately link the entity mention to a proper KB entity. This is called entity resolution (or
entity disambiguation) problem. For example, a mention Michael Jordan may map to
at least two entities in a KB. One is a famous retired NBA basketball player, and the
other is a famous computer scientist in U.C. Berkeley. Many existing approaches use
a lot of features from contexts to address the entity disambiguation problem. However,
an RDF KB can be treated as a huge labeled graph, by considering the properties of the
graph, we observe that the relevant entities of mentions in the same context often are
close pairs in the KB graph. Based on the observation, we propose an effective approach
that accomplish the entity disambiguation problem with only two proposed features.
An open domain RDF KB has a large number of entities. However, many existing
approaches[8,9] are designed on small-scale data set, or can only process one document
a time[21], which are not scalable. For a huge RDF KB, entity linking will have to be
done in a parallel and distributed fashion for guaranteeing the efficiency and scalability.
Fortunately, the MapReduce[6] framework on cloud computing provides an easy-to-use
way of dispatching the expensive annotating and disambiguating tasks over clusters. We
therefore propose an efficient entity disambiguation algorithm that is able to conduct the
entity linking task in parallel based on the MapReduce framework.
The main contributions of the paper are as follows:
We propose a simple but effective algorithm to accurately link named entity men-
tions in the unstructured texts with proper entities in an RDF KB.
We propose an MapReduce-based approach to linking mentions of unstructured
texts with entities in the RDF KB. The annotation of multiple documents can be
conducted parallelly and in batch.
We test the performance of our approaches over two real-life RDF KBs. The re-
sults show that our solution can achieve high accuracy on both KBs. Meanwhile,
by designing algorithms running over MapReduce framework, the solution is very
efficient and scalable.
242 F. Du, Y. Chen, and X. Du
The rest of the paper is organized as follows. Section 2 reviews the related work. Section
3 describes the solution of entity disambiguation using RDF KBs. In Section 4, we
propose an efficient framework of conducting the entity linking task on MapReduce
framework. Section 5 presents the results of experimental study. Finally, the paper ends
with conclusion and future work in Section 6.
2 Related Work
Entity linking has received much attention recently[7,15,14], especially after the emerg-
ing of large knowledge bases such as Yago[2], Freebase[3], DBPedia[4]. Most of these
works[8,9] link free texts to real world knowledge bases. The most important and chal-
lenging issue in entity linking task is the named entity disambiguation problem.
Named entity disambiguation is also called co-reference resolution or word sense
disambiguation in some other studies. Most studies in this area tackle the challenges
based on rich contexts where named entity mentions occur. Bagga and Baldwin[11]
use vector space model to resolve the ambiguities in persons names. They represent
the context of the entity mentions with the bag of words model and compute similarity
between vectors. An early work [7] identifies the most proper meaning of ambiguous
words by measuring their context overlaps. In [10], the authors derive a set of features
from Wikipedia to compute similarity between context of mentions and the texts of
Wikipedia.
In word sense disambiguation, there are two main streams of solutions[17,18]. Some
researchers use knowledge extracted from dictionaries to identify the correct word in
a given context[8,16]. The others collect probabilities from large amounts of sense-
annotated data, and then use machine learning approaches to solve the problem[17,18].
The authors of [12] implement and combine these two approaches. They choose the
right and left word of the ambiguous word as its context. For each word, they extract
a training feature vector from Wikipedia links, and then integrate the features in Naive
Bayes classifier. These works in word sense disambiguation are similar to the entity
linking task. However, they are more likely to assign dictionary meanings to the poly-
semous words, which is not as difficult as the entity linking task.
Weikum and Theobald[19] propose their named entity disambiguation approach in
facts harvesting research works. In [20], they interconnect RDF data and Web contents
via LOD (Linked Open Data)[5]. They construct an entity mention graph and use a
coherence graph algorithm to solve the entity disambiguation problem. Through com-
puting the coherence between ambiguous entities, the mentions are connected to at most
one entity in a KB.
The most similar work to ours are Linden[21]. Linden first builds a dictionary based
on four kinds of Wikipedia pages: entity page, redirect page, disambiguation page and
hyper links. By using the dictionary, Linden generates a candidate linking list for each
mention. In order to conduct entity disambiguation, it creates feature vectors of four
dimensions to rank the entities in candidate list. However, Linden can process only
one document at the same time. For real world applications, there will be much more
mentions from more than one document to be processed simultaneously. Moreover, it
generates many features to rank the candidate linking entities. Some of them (e.g., the
global coherence feature) are relatively hard to be accurately computed.
Linking Entities in Unstructured Texts with RDF Knowledge Bases 243
To the best of our knowledge, there is no entity linking work based on RDF knowl-
edge base over MapReduce framework. Because RDF data can be represented as graphs,
we can use the properties of the graph to design a simple but effective approach.
The frequently used notations in this paper are summarized in Table1.
The entity linking task basically links named entity mentions with entities in a KB. If
it is navely conducted based on string match, quite often, a mention will be mapped to
more than one entities in the KB because of the homonymy problem. In our work, we
use Jaccard to compute the similarity between two strings (entity mention and entity in
244 F. Du, Y. Chen, and X. Du
KB), an entity mention will be mapped to an entity in KB if the similarity between them
is higher than some threshold. For example, a mention Michael Jordan in unstructured
texts will be mapped to more than one entity in KB, for instance, Michael I. Jordan
and Michael J. Jordan. This requires a solution of entity disambiguation.
In this study, we take two measures into account for addressing the entity disam-
biguation problem. One is the relevance, which measures the relation between a men-
tion and an entity. The other one is the homogeneity, that is to find the proper linking
entity which has closer hyponyms with the mention. By studying the properties of exist-
ing RDF KBs, we introduce two features to measure the relevance and the homogeneity
of an entity in a KB to a given mention of certain context.
Definition 1 (RDF KB Graph). Given an RDF KB, let be the set of all entities in
the KB, R be the set of relations between entities in the KB. An RDF KB Graph is
G = (V, E, L), where V is the set of vertexes, E is the set of edges and L is the set of
literals. V , E R, Lv is the set of labels on vertex, while Le is the set of labels on
edges. L = Lv Le .
Definition 2 (Semantic KB Graph). Given an RDF KB graph, a semantic KB graph
(SG) is SG = (V, E, L, C), where V is the set of vertexes, E is the set of edges, L is
the set of literals and C is the set of classes. v V, Cv C.
To construct the semantic KB Graph, we firstly recognize entities and relations in a
given RDF KB, where an entity is the triples with same subject(s), and a relation be-
tween entities is the predicate (p) which connects two entities. Based on entities and
relations, we can easily construct an RDF KB graph. Then, we pick predicates such as
rdf:type in the RDF KB graph as initial class nodes set. Duplicates and noises in the
set are cleaned. Some new class nodes are added based on WordNet. We finally built
the taxonomy of the RDF KB. For pages limitation, we will not discuss the details of
taxonomy construction. An RDF KB graph expanded by the taxonomy is called seman-
tic KB graph. Fig. 1 is an example of an RDF KB graph, while Fig. 2 is an example of
the corresponding semantic KB graph. For instance, after expanded as Semantic Graph,
node Michael Jordan and node David E. Rumelhart in Fig. 1 have class node as
Professor in Fig. 2.
The mentions appearing in the same context often have a high probability to be under
the same topic. Meanwhile, in an RDF KB graph, the relevant entities (e.g., under the
same topic) are likely to be close with each other in terms of the graph distances. For
example, if Michael Jordan refers to the famous basketball player, mentions such
as NBA and Chicago Bulls are more likely to be found than mentions such as
Linking Entities in Unstructured Texts with RDF Knowledge Bases 245
machine learning in the context. Moreover, in the KB graph, the shortest path distance
between the entitiesMichael Jordan (basketball player) and NBA is likely to be
shorter than the distance between Michael Jordan (computer scientist) and NBA.
Accordingly, we are able to measure the relevance of entities to a given mention.
We denote the mentions in same document with m as the contexts of m, that is C. An
entity e E is the linking entity of a named entity mention m M. Suppose ck C
is the kth context mention, ek E is the linking entity of ck (where we choose the cs
from C which have only one linking entity. Mentions have no linking entity are useless.
Mentions with more than one linking entities will lead to a recursive processing.), we
formally define the context dependency of an entity e, denoted as CD(e) as:
n
"
1
CD(e) = ck D Dist(ek , e), (1)
N
k=1,k=i
SS(e) shows the type similarity between the context mention c and the mention m,
it measures the homogeneity of e. For example, e1 and e2 are two candidate linking
entities of m, if the class nodes of linking entities (e) for c are more closer to e1 than
e2 , e1 is more likely to be the proper linking entity of m. Higher value of SS(e) means
closer class structure to the candidate linking entity.
important issues of the entity linking solution. In this section, we propose a frame-
work for efficient entity linking based on the entity disambiguation algorithm proposed
above. We assume that the mentions need to be linked to entities have been extracted
and recognized from web pages. We also assume the mentions have potential linking
entities in the KB. In order to linking a mention to an RDF KB, we propose a framework
including the following modules:
Generate Candidate Entity Linking List. For each mention m M , obtain the
list of entities having the name m (in the KB) as the candidate linking list. For the
mentions that do not have any linking entity in the KB, NIL will be returned as the
result.
Linking Entities Disambiguation. If the candidate linking list has more than one
candidate, we need to find the proper linking entity of m. For each mention m, we
use a defined score function to compute the score of each entities e in the candi-
date linking list. Firstly, we compare the context dependency to see which candi-
date linking entity is more relevant to the context mentions. Secondly, we compute
semantic similarity to check the similarity of candidate entities in terms of their se-
mantic types. Finally, a score is calculated for each entity in candidate linking list,
the one with the maximal score will be the proper entity to mention m.
Each mention m in Web pages will be compared with the entities e in KB, if they
match, e will be add into the candidate linking list of m. The length of linking list, |L|,
must be in one of the following three cases. (1) |L| = 0, which means no matching
entity exists in the knowledge base. (2) |L| = 1, which means there is only one match
in KB for m. (3) |L| 1, which means two or more e in knowledge base are the
possible entity links of m. For case 1, NIL will be returned to m; for case 2, the only
e is returned to show there is a link between m and e; only for case 3 we need to use
the disambiguation algorithm described in Section3 to find the proper linking entity of
m. The algorithm is executed on MapReduce framework. In the map phrase, mentions
and entities with the same key are dispatched to the same reduce node; In the reduce
phrase, mentions(entities) and their neighbors in the semantic KB graph are gathered
to compute CD and SS. After several iterative map and reduce phases, the algorithm
can output the proper linking entity of m. With MapReduce framework, the algorithm
is conducted in a parallel fashion, which makes the approach more scalable.
5 Experimental Study
5.1 Experimental Setup
All the experiments are run on the Renda-Cloud platform, which is a Hadoop cluster.
Each node in the cluster has 30GB memory, 2TB disk space and 2.40 GHz core 24
processor, running Hadoop 0.20.2 under Ubuntu 10.04 Linux 2.6.32-24.
In order to evaluate our approach, we choose two real world RDF KBs, YAGO and
FreeBase, to conduct the experimental study. YAGO is a large knowledge base devel-
oped at Max-Planck-Institute Sarrbrucken. It contains information from Wikipedia and
linked to Wordnet. There are more than 10 million entities and 120 million facts about
the entities. YAGO contains not only entities and relations between entities but also
248 F. Du, Y. Chen, and X. Du
Nl
Accu. = (6)
Nm
where Nl is the number of mentions which are correctly linked to entities, Nm is the
total number of mentions.
Sections 5.2 and 5.3 give the experimental results on accuracy and scalability over
the two RDF KBs respectively.
Accuracy. In order to evaluate the performance, we firstly extract 1000 mentions from
50 Wiki pages as 50 sets of Mentions. mentions in each set are contexts for each other.
We test Accu. by using different features in score function. Table 2 shows the highest,
lowest and average Accu. with different features for the 50 groups of mentions. Then
we compare our solution with Linden, which is the most similar work with ours in the
existing solutions. we use the 1000 mentions in one group and test our solution on 2
Hadoop nodes, while test Linden on a PC with a 1.8Ghz Core 4 Duo processor, 10 GB
memory and running 64-bit Linux Redhat kernel. Table 3 shows the results.
Table 2. Experimental results on YAGO: Accu. Table 3. Experimental results on YAGO: com-
pared with Linden
Accu. our solution Linden
feature set
Max Min Avg. feature set Accu. feature set Accu.
CD 0.9406 0.9198 0.9325 CD 0.9395 LP 0.8665
SS 0.9264 0.8813 0.9054 SS 0.9240 LP+SA 0.9189
CD + SS 0.9487 0.9232 0.9411 CD + SS 0.9436 LP+SA+SS 0.9391
From Table 2 we can see that the accuracy of CD is higher than SS. By combining
the two features, our solution achieves the best accuracy. It indicates that in most cases,
relatedness is more useful in dealing with entity linking problem. Although Linden has
good performance, the comparison results in Table 3 shows that the accuracy of our
solution is better than that of Linden.
Scalability. In order to test the scalability of the algorithm, we varying the number
of mentions from 1000 to 100,000 and execute the algorithms on 2 Hadoop nodes, 8
Hadoop nodes, 20 Hadoop nodes respectively. The results are shown in Table 4. The
time consumption includes generating candidate linking list and scoring candidate link-
ing entities, for these are the main jobs in linking entity task.
Linking Entities in Unstructured Texts with RDF Knowledge Bases 249
# of mentions
# of cluster nodes 1000 5000 10000 100000
2 23.39 26.27 38.61 108.05
8 20.45 23.14 29.38 42.06
20 16.07 18.27 22.75 35.11
Results in Table 4 show the scalability of our solution. With more nodes in Hadoop
cluster, the computing capability are efficiently enhanced. The minimum execution time
in our solution is above 10 seconds. Note that the initial time for MapReduce job is at
least 10 seconds.
From Table 5 we can see that the accuracy results are worse than that in Table 2.
We just store data extracted from IMDB without any preprocessing, because they have
little noisy than data form Wikipedia. However, the accuracy is still above 0.93. The
results in Table 6 show that with only the feature SS our solution still outperforms the
baseline.
250 F. Du, Y. Chen, and X. Du
Table 7. RDF KB preprocessing run-time in seconds over different nodes of Hadoop cluster.
6 Conclusions
In this paper, we propose a simple but effective algorithm to accurately link entity men-
tions with proper entities in an RDF KB. The algorithm is designed on the MapReduce
framework, which can annotate multiple documents in parallel. The experimental study
over huge real-life RDF knowledge bases demonstrates the accuracy and scalability of
the approach.
Based on our solution we can study semantic search techniques using the annotated
unstructured texts. For example, by linking unstructured texts to RDF KB, we are able
to build a huge extended labeled graph, through which we are able to apply and extend
semantic keyword search over text graphs.
Acknowledgements. This work is supported by the National Science Foundation of
China under grant No. 61170010 and No. 61003085, and HGJ PROJECT 2010ZX01042-
002-002-03.
References
1. Guha, R.V., McCool, R., Miller, E.: Semantic search. In: WWW, pp. 700709 (2003)
2. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge unifying
wordnet and wikipedia. In: WWW, pp. 697706 (2007)
3. Bollacker, K.D., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively cre-
ated graph database for structuring human knowledge. In: SIGMOD, pp. 12471250 (2008)
Linking Entities in Unstructured Texts with RDF Knowledge Bases 251
4. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A Nucleus
for a Web of Open Data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon,
L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudre-Mauroux,
P. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722735. Springer, Heidelberg (2007)
5. Linking Open Data, http://www.w3.org/wiki/SweoIG/TaskForces/
CommunityProjects/LinkingOpenData
6. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. In:
OSDI 2004 (2004)
7. Lesk, M.: Automatic sense disambiguation using machine readable dictionaries: How to tell
a pine cone from an ice cream cone. In: SIGDOC, Toronto (June 1986)
8. Mihalcea, R.: Large vocabulary unsupervised word sense disambiguation with graph-based
algorithms for sequence data labeling. In: Proceedings of the Human Language Technol-
ogy/Empirical Methods in Natural Language Processing Conference, Vancouver (2005)
9. Navigli, R., Velardi, P.: Structural semantic interconnections: a knowledge-based approach to
word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI) 27 (2005)
10. Bunescu, R., Pasca, M.: Using Encyclopedic Knowledge for Named Entity Disambiguation.
In: EACL, pp. 916 (2006)
11. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space
model. In: COLING, pp. 7985 (1998)
12. Dredze, M., McNamee, P., Rao, D., Gerber, A., Finin, T.: Entity disambiguation for knowl-
edge base population. In: COLING, pp. 277285 (2010)
13. Hasegawa, T., Sekine, S., Grishman, R.: Discovering relations among named entities from
large corpora. In: ACL, pp. 415422 (2004)
14. Demartini, G., Difallah, D.E., Cudre-Mauroux, P.: ZenCrowd: leveraging probabilistic rea-
soning and crowdsourcing techniques for large-scale entity linking. In: WWW, pp. 469478
(2012)
15. Stoyanov, V., Mayfield, J., Xu, T., Oard, D.W., Lawrie, D., Oates, T., Finnin, T.: A context
aware approach to entity linking. In: The Joint Workshop on Automatic Knowledge Base
Construction and Web-scale Knowledge Extraction, NAACL-HLT 2012 (2012)
16. Navigli, R., Velardi, P.: Structural semantic interconnections: a knowledge-based approach
to word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 27, 10751086 (2005)
17. Gliozzo, A., Giuliano, C., Strapparava, C.: Domain kernels for word sense disambiguation.
In: ACL (2005)
18. Ng, H., Lee, H.: Integrating multiple knowledge sources to disambiguate word sense: An
examplar-based approach. CoRR, vol. 9606032 (1996)
19. Weikum, G., Theobald, M.: From information to knowledge: harvesting entities and relation-
ships from web sources. In: PODS, pp. 6576 (2010)
20. Hoffart, J., Yosef, M.A., Bordino, I., Furstenau, H., Pinkal, M., Spaniol, M., Taneva, B.,
Thater, S., Weikum, G.: Robust Disambiguation of Named Entities in Text. In: Proceedings
of EMNLP, pp. 782792 (2011)
21. Shen, W., Wang, J.Y., Luo, P., Wang, M.: LINDEN: linking named entities with knowledge
base via semantic knowledge. In: WWW 2012, pp. 449458 (2012)
22. Shen, W., Wang, J.Y., Luo, P., Wang, M.: LIEGE: Link Entities in Web Lists with Knowledge
Base. In: Proceedins of KDD 2012 (2012)
23. Carlos, B.T., Guestrin, C., Koller, D.: Max-margin markov networks. In: NIPS (2003)
24. Nadeau, D., Turney, P.D., Matwin, S.: Unsupervised Named-Entity Recognition: Generating
Gazetteers and Resolving Ambiguity. In: Lamontagne, L., Marchand, M. (eds.) Canadian AI
2006. LNCS (LNAI), vol. 4013, pp. 266277. Springer, Heidelberg (2006)
25. Stanford NER, http://nlp.stanford.edu/ner/index.shtml
26. Hadoop, http://hadoop.apache.org/
An Approach to Retrieving Similar Source Codes
by Control Structure and Method Identifiers
Yoshihisa Udagawa
1 Introduction
Source code retrieval is an important task in order to maintain the quality of software
in the development cycle. Duplicated code complicates this activity and leads to high-
er maintenance cost because the same bugs will need to be fixed and consequently
more code will need to be tested. Many developers put significant effort into finding
software defects. However, due to the vast amounts of source codes, it is difficult to
find efficiently the code segments that we want.
Various techniques have been proposed to collect similar source codes. These
techniques can be classified into four categories:
Text-based comparison
This approach compares source codes in the same partition. The key idea of this ap-
proach is to identify source-code fragments using similar identifiers [3].
Token comparison
In this approach, before comparison, tokens of identifiers (data type names, variable
names, etc.) are replaced by special tokens, and then similar subsequences of tokens
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 252259, 2013.
Springer-Verlag Berlin Heidelberg 2013
An Approach to Retrieving Similar Source Codes 253
are identified [2]. Because the encoding of tokens abstracts from their concrete val-
ues, code fragments that are different only in parameter naming can be detected.
Metrics comparison
This approach characterizes code fragments using some metrics, and compares these
metric vectors instead of directly comparing the code [4].
Structure-based comparison
This approach applies pattern matching and complex algorithms on abstract syntax
trees or dependency graphs. Baxter et al. [1] propose a method using abstract syntax
trees for detecting exact and near-miss program source fragments. Ouddan et al. [5]
propose a multi-language source code retrieval system using the structural content of
the source code and the semantic one. Their approach is focused on detecting plagiar-
ism in source code between programs written in different programming languages,
such as C, C++, Java, and C#.
Our approach is the structure-based comparison that takes a sequence of statements
as a retrieval condition. We developed a lexical parser and extracted structures of
source codes for control statements and method identifiers. The extracted structural
information is an input to the vector space model and a proposed source code retrieval
model, named derived structure retrieval model. Our retrieval model takes a se-
quence of statements as a retrieval condition and derives meaningful search condi-
tions from the given sequence. Because a program is composed of a sequence of
statements, our retrieval model practically improves the performance of source code
retrieval.
The rest of this paper is organized as follows. In Section 2, we present architecture
of the developed tools. In Section 3, we present overview of the Struts 2 framework
[7] and extracted structures. In Section 4, we discuss a similarity retrieval approach
based on the vector space model. In Section 5, we discuss the derived structure re-
trieval model and results of code retrieval. Section 6 concludes the paper.
replacing each element of the given sequence with a wildcard. In case, the sequence
n
consists of n elements, 2 - 2 sequences are derived to retrieve a set of similar me-
thods. Details are discussed in Section 5.
A framework automates common tasks, and thereby providing a user platform that
simplifies web development. The Struts 2 framework [7] implements the model-view-
controller (MVC) design pattern.
We estimated the volume of the source codes using file metrics. Struts 2.3.1.1 Core
consists of 46,100 lines in source code including comment lines and blank lines. As
for the number of lines, Struts 2.3.1.1 Core is a middle scale application in industry.
We can estimate the volume of the source codes using file metrics. Typical file me-
trics used are as follows:
Number of Java Files ---- 368
Number of Classes ---- 414
Number of Methods ---- 2,667
Number of Code Lines ---- 21,543
Number of Comment Lines ---- 17,954
Number of Total Lines ---- 46,100
The number of Java files are 368, which differs from the number of declared classes
414 because some java files include definitions of inner classes and anonymous
classes.
The vector space model [6] is an algebraic model for representing text documents as
vectors of identifiers or terms. Given a set of documents D, a document dj in D is
represented as a vector of term weights:
(1)
where N is the total number of terms in document dj and wi, j is the weight of the i-th
term. A user query can be similarly converted into a vector q:
(2)
The similarity between document dj and query q can be computed as the cosine of the
angle between the two vectors dj and q in the N-dimensional space:
(3)
256 Y. Udagawa
that the order in which the terms appear in the document is lost in the vector space
model. On the other hand, a program is essentially a sequence of statements. So, the
vector space model has limitations in performance when it is applied to retrieving
program source codes.
The derived structure retrieval model, a source code retrieval model proposed in
the paper, takes a sequence of statements as a retrieval condition, and derives mea-
ningful search conditions from the given sequence. Let S1 and S2 are statements ex-
tracted by the structure extraction tool. [S1 S2] denotes a sequence of S1 followed by
S2. In general, for a positive integer n, let Si (i ranges between 1 and n) be a state-
ment. [S1 S2 ... Sn] denotes a sequence of n statements, i.e., S1, S2, and Sn.
The essential idea of the derived structure retrieval model is matching on wildcard
characters. Given a sequence [S1 S2 ... Sn] and a method structure m, our model
first try to search structure of m that matches the sequence [S1 S2 ... Sn]. Next,
derive a sequence, such as [* S2 ... Sn], by replacing one of elements in
[S1 S2 ... Sn], and try to search structure of m that matches the derived sequence.
The algorithm continue to evaluate structure of m until all of derivable sequences,
except for the sequence [* * ... *], are evaluated.
The sequence [S1 S2 ... Sn] is more "restrictive" than [* S2 ... Sn] and
n
[S1 * ... Sn], etc. In general, 2 - 1 sequences are derived from an n-element se-
quence including the sequence whose all elements are replaced by wildcards. Because
the sequence [* * ... *] matches all substructures, it does not work as a retrieval
n
condition. A given sequence [S1 S2 ... Sn] and 2 - 2 derived sequences define a
domain of discourse for retrieval conditions.
Let Count (m, [S1 S2 ... Sn], r) is a function that counts the number of sub-
structures of m that match derived sequences of [S1 S2 ... Sn] whose r elements
are replaced by the wildcard. Fig.3 shows an algorithm to compute the similarity of
methods that near match a retrieval condition [S1 S2 ... Sn]. In Fig.3, it is as-
sumed that getMethodStructure(j) returns a structure of the j-th method extracted by
the structure extraction tool. Note that the similarity is 1.0 when a method includes
the sequence [S1 S2 ... Sn] and it does not include any of derived sequences from
[S1 S2 ... Sn].
Input: set_of_structure M;
Input: sequence [S1 S2 ... Sn];
Output: Sim[M.length];
// Declare array Sim[].
double Sim[M.length];
// Compute Similarity for each method in M.
int Nume; int Deno;
for ( int j=0; j < M.length; j++ ) {
Nume= Count(getMethodStructure(j),[S1 S2 ... Sn],0);
Deno= 0;
for ( int r=1; r < [S1 S2 ... Sn].length; r++ ){
Deno= Deno+ Count(getMethodStructure(j),[S1 S2 ... Sn],r);
}
Sim[j]= (double) Nume / (double)( Nume + Deno );
}
Fig. 3. Algorithm to compute the similarity of methods for a sequence [S1 S2 ... Sn]
6 Conclusions
In this paper, we presented an approach that improves source code retrieval using the
structural information of control statements. Our key contribution in this research is
the development of an algorithm that derives meaningful search conditions from a
given sequence, and then performs retrieval using all of the derived conditions. Thus
our source code retrieval model retrieves all of source codes that partially match the
given sequence.
The results are promising enough to warrant future research. In future work, we
will work on improving our methods by combining information, such as an inherit-
ance of a class and implementation of an abstract class, etc. into structural informa-
tion. We will also conduct experiments on various types of open source programs
available on the Internet.
References
1. Baxter, I.D., Yahin, A., Moura, L., SantAnna, M., Bier, L.: Clone detection using abstract
syntax trees. In: Proc. of the 14th International Conference on Software Maintenance, pp.
368377 (November 1998)
2. Kamiya, T., Kusumoto, S., Inoue, K.: CCFinder: A multi-linguistic tokenbased code clone
detection system for large scale source code. IEEE Transactions on Software Engineer-
ing 28(7), 654670 (2002)
3. Marcus, A., Maletic, J.: Identification of high-level concept clones in source code. In: Proc.
of the 16th International Conference on Automated Software Engineering, pp. 107114
(November 2001)
4. Mayland, J., Leblanc, C., Merlo, E.M.: Experiment on the automatic detection of function
clones in a software system using metrics. In: Proc. of the 12th International Conference on
Software Maintenance, pp. 244253 (November 1996)
5. Ouddan, A.M., Essafi, H.: A Multilanguage Source Code Retrieval System Using Structur-
al-Semantic Fingerprints. International Journal of Computer and Information Science and
Engineering, 96101 (Spring 2007)
6. Salton, G., Buckley, C.: Term-Weighting approaches in automatic text retrieval. Informa-
tion Processing and Management 24(5), 513523 (1988)
7. Struts - The Apache Software Foundation (2012), http://struts.apache.org/
Complementary Information for Wikipedia
by Comparing Multilingual Articles
1 Introduction
Wikipedia has become an extremely popular website, with over 284 language
versions in 2012. Nevertheless, information of many articles is lacking because
users can create and edit the information freely. Furthermore, Wikipedia has
dierent levels of value of its information depending on the language version
of the site because dierent users create and edit articles of the respective lan-
guage versions. We specifically examine the multilinguality of Wikipedia and
propose a method that complements information of articles that lacks infor-
mation based on comparing dierent language articles that have the similar
contents[3].
In our proposed method, a user first browses an article in Wikipedia of the
users language. Then the system extracts complementary information from user-
specified articles that have been composed in other languages. As described in
our research, we designate the lacking information as complementary infor-
mation, an article that a user is browsing and which lacks information as a
browsing article, its Wikipedia as a browsing Wikipedia, and its language
as a browsing language. We also designate a dierent language article as a
comparison article, its Wikipedia as a comparison Wikipedia, and its lan-
guage as a comparison language.
When we compare a browsing article with a comparison article, the infor-
mation granularity is shown to dier. The browsing article is only one page,
but the comparison article might have multiple pages because these articles are
written by intercultural authors. Our proposed method has been used to extract
lacking information from all content in target articles in comparison Wikipedia.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 260267, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
Our proposed approach uses link structure analysis techniques. Many studies
use the Wikipedia link structure. Milne,[4] and Strub et al.[7] extract measures
of semantic relatedness of terms using the Wikipedia link structure.
Some studies have been conducted to examine complementary information.
Ching et al. [1] propose a framework to assist Wikipedia editors to enrich
Wikipedia articles. They extract complementary information by comparing mul-
tilingual Wikipedia. However, they are not interested in dierences of informa-
tion granularity between the languages. In contrast, we consider the coverage
of comparison articles and create a Wikipedia article to extract complementary
information from comparison articles. Eklou et al.[2] proposed a method of in-
formation complementation of Wikipedia by the web. They extract complemen-
tary information from web pages using LDA. Their goal is similar to ours, but
the target complement information is dierent. Ma et al.[6] proposed integrat-
ing cross-media news contents such as TV programs and Web pages to provide
users with complementary information. They extract complementary informa-
tion from web pages using their proposed topic structure. Their research issue,
which is complement information, is similar to our issue. However, the target of
the content is dierent.
262 Y. Fujiwara et al.
of segment in the basic article. Sik represents the cosine similarity between
segment k of the basic article and an interactive link article i. max(Rim )
signifies the maximum value in all Ri .
The system calculates from 2. to 5. for all interactive linked articles in the link
graph. When relevance degree Ri is less than the threshold value, we delete
the article from the link graph. When the system extracts complementary in-
formation from the comparison articles, nodes that consist of a root node and
other nodes become comparison articles.
comparison article that has no anchor text which links to the basic article
in the lead part, but it has anchor text in other parts of comparison articles.
That is, the partial match article excludes inclusively related articles from
comparison articles.
3.2.2 Comparison Area
We extract complementary information from the comparison area, which is based
on articles of three types described above.
Basic Article
A comparison area of the basic article is all contents of a page (article) because
a basic article has the same title as the browsing article. It is strongly related to
the browsing article. (Fig. 1(a)).
Inclusive Relation Article
An inclusive relation article is a subset of a basic article. We regard a comparison
area of the inclusive relation article as all contents of a article. That is, the
comparison area is the same as its basic article. (Fig. 1(a)).
Partial Match Article
We specifically examine the position of the anchor text which links to the basic
article, in a partial match article. We extract a comparison area as follows:
4 Experimental Evaluation
We assessed the availability of extracting complementary information based on
comparing our proposed method which is a calculated comparison area with a
baseline which is not a calculated comparison area. In both of our experiments,
we set an appropriate weight =3.0 and threshold =0.2 of the relevance degree
described in equation (1) in section 3, and threshold =0.2 of cosine similar-
ity in equation (2) in our earlier experiment[3]. Table 1 presents results of the
experiment. The precision is 0.26 higher than the baseline, which means that
our proposed method is useful for extracting complementary information. An
example of a good result is the browsing article is written for the plot and cast
of My Neighbor Totoro. One comparison article is written about Sayama Hill,
the geographic information related to Sayama Hill diers, but it is unrelated to
the movie. This time, in our proposed method, the information is not a com-
parison area. We do not extract the information as complementary information.
However, the exemplary bad results are those for Manzai, which is a type of
Japanese comedy. We can not extract the information written about the associ-
ation of Manzai, Roukyoku Manzai which is a kind of manzai, and so on because
the anchor text in each case is in an itemization area. Our proposed method
1
The total number of section in browsing article.
2
The total number of comparison articles.
3
The total number of sections in comparison articles.
Complementary Information for Wikipedia 267
5 Conclusion
References
1. Ching-Man, A.Y., Duh, K., Nagata, M.: Assisting wikipedia users in cross-lingual
editing. In: WebDB Forum 2010, pp. 1112 (2010)
2. Eklou, D., Asano, Y., Yoshikawa, M.: How the web can help wikipedia: a study on
information complementation of wikipedia by the web. In: Proceedings of the 6th
International Conference on Ubiquitous Information Management and Communica-
tion, ICUIMC 2012 (2012)
3. Fujiwara, Y., Suzuki, Y., Konishi, Y., Nadamoto, A.: Extracting Dierence Infor-
mation from Multilingual Wikipedia. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu,
G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 496503. Springer, Heidelberg (2012)
4. Milne, D.: Computing semantic relatedness using wikipedia link structure. In: Proc.
of New Zealand Computer Science Research Student Conference, NZCSRSC 2007.
CDROM (2007)
5. Nakayama, K.: Wikipedia mining for triple extraction enhanced by co-reference
resolution. In: Proceedings of the 1st International Workshop on Social Data on the
Web, SDoW 2008 (2008)
6. Qiang, M., Nadamoto, A., Tanaka, K.: Complementary information retrieval for
cross-media news content. Elsevier ARTICLE Information Systems 31(7), 659678
(2006)
7. Strube, M., Ponzetto, S.P.: WikiRelate! computing semantic relatedness using
wikipedia. In: Proceedings of the 21st International Conference on Artificial In-
telligence, AAAI 2006, pp. 14191424 (2006)
Identification of Sybil Communities Generating
Context-Aware Spam on Online Social Networks
1 Introduction
Online social networking sites have attracted a large number of internet users.
Among many existing Online Social Networks (OSNs), Facebook and Twitter
are the most popular social networking sites with over 800 million and 100 mil-
lion active users, respectively. However, due to this popularity and existence of
a rich set of potential users, malicious third parties have also diverted their at-
tention towards exploiting various features of these social networking platforms.
Though, the exploitation methodologies vary according to the features provided
by the social networking platforms, malware infections, spam, and phishing are
the most common security concerns for all of these platforms. In addition, a
number of social botnets have emerged that utilize social networking features to
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 268279, 2013.
c Springer-Verlag Berlin Heidelberg 2013
spreading infections as command and control channels [1], [2], [3]. The root cause
of all these security concerns is the social network sybils or fake accounts created
by malicious users to increase the ecacy of their attacks that are commonly
known as sybil attacks [4]. Generally, an attacker uses multiple fake identities to
unfairly increase its ranking or influence in a target community. Moreover, sev-
eral underground communities exist, which trade sybil accounts with users and
organizations looking for online publicity [5], [6]. Recent studies have shown that
with the increase in the popularity of social media, sybil attacks are becoming
more widespread [7]. Several sybil communities have been reported so far that
forward spam and malwares on Facebook [8] network. In online social networks,
third-party nodes are most vulnerable to sybil attacks, where the third-party
nodes are communities and groups on OSN platforms which bring together users
from dierent real-world communities on the basis of their interests. In case of
Facebook, a third-party node can be defined as a Facebook community page
which is used to connect two users from entirely dierent regions. Sybil accounts
hired for carrying out spam campaigns target such vulnerable nodes. Recently,
the rapid increase in the number of spam on popular online social networking
sites has attracted the attention of researchers from security and related fields.
Though a significant amount of research works has been reported for the
detection and characterization of spam on Facebook and Twitter networks [9],
[10], [11], [12], [13], the existing techniques do not focus on the detection of
coordinated spam campaigns carried out by the communities of sybil accounts.
Similarly, several techniques have been presented for the identification of sybil
communities [4], [14], [15],[16], [17], but all of them focus on the decentralized
detection of sybil accounts. Moreover, the existing techniques are based on two
common assumptions about the behavior of sybil nodes. Firstly, sybil nodes can
form edges between them in a social graph and secondly, the number of edges
connecting sybil and normal nodes is less as compared to the number of edges
connecting either only normal nodes or only sybil nodes. These assumptions
were based on the intuition that normal users do not readily accept friendship
requests from seemingly unknown users. Although empirical studies from [17]
showed existence of such sybil communities in the Tuenti social network, another
study of Renren social network [7] showed that sybil nodes rarely created edges
between themselves. This implies that the community behavior of sybil nodes in a
social graph is mercurial and the assumption that sybil nodes form communities
cannot be generalized [18].
In this work, the authors utilize the rich corpus of prior research works on
spam detection and sybil community identification as a basis and present a
hybrid approach to identify coordinated spam or malware attacks carried out
using sybil accounts. The proposed approach is independent of the assumptions
discussed above by the previous researchers. Although the proposed approach is
generic in nature, this paper focuses on the sybil accounts present on Facebook
social network for experiment and evaluation purposes. The contributions of this
paper can be summarized as follows:
270 F. Ahmed and M. Abulaish
The rest of the paper is organized as follows. After a brief review of the existing
state-of-the-art techniques for spam identification in online social networks in
Section 2, Section 3 presents a data collection methodology from Facebook social
network. Section 4 presents the profile grouping methodology to generate various
groups of similar profiles in the original social network dataset. This Section also
presents the experimental results obtained from a real dataset and their analyses.
Finally, Section 5 presents conclusions.
2 Related Work
A significant number of research works has been reported in last few years for
spam detection on online social networks. In [19], the authors proposed a real
time URL-spam detection scheme for Twitter. They proposed a browser moni-
toring approach, which takes into account a number of details including HTTP
redirects, web domains contacted while constructing a page, HTML content be-
ing loaded, HTTP headers, and invocation of JavaScript plug-ins. In [11], the
authors created honey-profiles representing dierent age, nationality, and so on.
Their study is based on a dataset collected from the profiles of several regions,
including USA, Middle-East, and Europe. They logged all types of requests,
wall posts, status updates, and private messages on Facebook. Based on the
users activities over social networking sites, they distinguished spam and nor-
mal profiles. The authors in [12] utilized the concept of social honeypot to lure
content polluters on Twitter. The harvested users are analyzed to identify a set
of features for classification purpose. The technique is evaluated on a dataset of
Twitter spammers collected using the @spam mention to flag spammers. In [8],
the authors analyzed a large dataset of wall posts on Facebook user profiles to
detect spam accounts. They built wall posts similarity graph for the detection of
malicious wall posts. Similarly, in [13] the authors presented a thorough analysis
Identification of Sybil Communities Generating Context-Aware Spam 271
3 Dataset
Based on the analyses reported in [21], it is found that a significant amount of
spam posts on Facebook are directed towards those Facebook pages that are
publicly accessible and any user in the network can post on them. Spammers
generally utilize such openly accessible public pages to spread spam in the net-
work. This type of spam spreading mechanism not only relieves the spammers
from their dependence on friendship requests, but also increases the number of
target users. Once a spam post is visible on a pages wall, it can be visible to
every user who likes that page. In addition, users page-like information can help
spammers to spread context-aware spam through Facebook pages in normal user
communities. Recently, there has been an increasing number of evidence about
the existence of underground communities trading groups of accounts that carry
spam campaigns [18]. Therefore, a group of accounts bought by a party could
be used for a single purpose, resulting in a high correlation in their behavior.
This work exploits the intuition that a spam targeting a community is most
likely the spam generated by a community. A dataset [21] containing Facebook
spam profiles is analyzed to identify Facebook pages that have been mostly
272 F. Ahmed and M. Abulaish
I
634
22
H 6
621 D
24
682
25
15
5
12
20
J C B N K
21 9 14 11
858 655 691 917 395
44
8 8 8 5
L
791 G F E M
4 4 5
688 883 326 431
X X
Z
24
Y Y 7
X = Page ID
Y = No. of Users
A
Z = No. of Common
Users 358
4 Methodology
To detect communities of sybil accounts generating context-aware spam, the rich
amount of textual information contained in each profile is used to generate an
undirected-weighted social graph, in which a node represents a profile and an
edge connecting a pair of nodes represents a link between them. The connections
initiated through a friendship request are independent of the links created in the
actual social graph. A total number of three important features has been used
to determine the weight of an edge in the social graph. For each group of profiles
identified in Section 3, a social graph is generated as G = (Vn , E, W ), where n
represent the group id, Vn is the set of profiles in group n, E V V is the
set of edges, and W is the set of weights assigned to edges. For each node
v Vn , a 3-dimensional feature vector comprising profile similarity, page likes,
and URLs shared is generated, which is then used to calculate the weight of
an edge eij = (vi , vj ). Further details about the features and weight calculation
process are presented in the following Subsections.
For the nodes vi and vj that fulfil the squareness measure criterion, the similarity
index is calculated as follows. Considering x and y as the number of posts of vi
and vj on their own walls, respectively, a cosine similarity matrix C of dimensions
x y is created in such a way that each post of vi is compared with all the posts
of vj . For cosine similarity, each post is converted into a tf -idf feature vector,
where tf -idf of a term t is calculated using Equations 2 and 3. In these equations,
274 F. Ahmed and M. Abulaish
d is the post under consideration, D is the set of posts present in nodes vi and
vj , and tf (t, d) is calculated as the number of times t appears in d.
tf -idf (t, d, D) = tf (t, d) idf (t, D) (2)
|D|
idf (t, D) = log (3)
|{d D : t d}|
For any two posts a and b with their corresponding tf -idf feature vectors A and
B, the value of an element cij of the matrix C is calculated as a cosine similarity
using Equation 4, where l is the length of feature vectors.
#l
k=1 Ak Bk
cij = 5# 5
#l (4)
l 2 2
k=1 (Ak ) k=1 (B k )
Finally, after smoothing the values of the matrix C using Equation 5, the simi-
larity index for the edge eij = (vi , vj ) is calculated as the normalized cardinality
of the set of non-zero elements in C, as shown in Equation 6.
4
1 if cij > 0.1
cij = (5)
0 if cij < 0.1
URL Sharing: Like page-likes feature, the value of URL sharing feature of
nodes vi and vj is calculated as the fraction of the URLs commonly shared by
them, as shown in Equation 8. In this equation, Ui and Uj represent the set of
URLs shared by nodes vi and vj , respectively.
|Ui Uj |
Uij = (8)
|Ui Uj |
Based on the values of the features discussed above, the final weight of edge
eij = (vi , vj ), (eij ), is calculated using Equation 9, where 1 , 2 , and 3 are
constants such that each i > 0 and 1 + 2 + 3 = 1.
(eij ) = 1 Is + 2 Pij + 3 Uij (9)
Identification of Sybil Communities Generating Context-Aware Spam 275
Figure 2 shows a subgraph of the FEM group present in the dataset. The
graph shows 4 major communities out of total 14 communities obtained through
Louvain algorithm. In the experiment, the default resolution values of Louvains
implementation in Gephi has been used. In Figure 2, the legend describes the
percentage of nodes in each community. It can be observed in Figure 2 that
nodes with modularity class 2 are dispersed, whereas nodes of classes 1, 3 and 4
are more closely related. In Section 4.3, the analysis has been further extended
276 F. Ahmed and M. Abulaish
5 Conclusions
Along the lines of the previous research works, this paper has presented a hybrid
approach to detect communities of sybil accounts that are under the control of
spammers and generate context-aware spam towards normal user communities.
The proposed approach is independent of the assumptions made by the previous
eorts and identifies six dierent profiles groups in the dataset based on the
users interests on Facebook network. The users with most common page-likes
have been grouped together for further analysis. Three dierent types of features
have been identified and used to model each group as a social graph in which
profiles are represented as nodes and their links as edges. The weight of a link is
calculated as a function of the degree of similarity of the nodes. Louvain commu-
nity detection algorithm is applied on the social graphs to identify communities
embedded within them. Thereafter, based on the class (malicious or benign)
of the nodes with high closeness-centrality values, the underlying community is
marked either as malicious or benign. The obtained results highlight that gener-
ally nodes with high closeness-centrality values are malicious and belong to sybil
communities, whereas nodes with low closeness-centrality values are benign and
constitute normal user communities.
References
1. Boshmaf, Y., Muslukhov, I., Beznosov, K., Ripeanu, M.: Key challenges in defend-
ing against malicious socialbots. In: Proceedings of the 5th USENIX Conference
on Large-scale Exploits and Emergent Threats, LEET, vol. 12 (2012)
2. Nagaraja, S., Houmansadr, A., Piyawongwisal, P., Singh, V., Agarwal, P., Borisov,
N.: Stegobot: A covert social network botnet. In: Filler, T., Pevn
y, T., Craver, S.,
Ker, A. (eds.) IH 2011. LNCS, vol. 6958, pp. 299313. Springer, Heidelberg (2011)
3. Thomas, K., Nicol, D.: The koobface botnet and the rise of social malware. In:
IEEE 2010 5th International Conference on Malicious and Unwanted Software,
MALWARE, pp. 6370 (2010)
278 F. Ahmed and M. Abulaish
4. Yu, H., Kaminsky, M., Gibbons, P., Flaxman, A.: Sybilguard: defending against
sybil attacks via social networks. ACM SIGCOMM Computer Communication Re-
view 36, 267278 (2006)
5. Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Gossip algorithms: Design, analysis
and applications. In: Proceedings IEEE 24th Annual Joint Conference of the IEEE
Computer and Communications Societies, INFOCOM 2005, vol. 3, pp. 16531664.
IEEE (2005)
6. Danezis, G., Lesniewski-Laas, C., Frans Kaashoek, M., Anderson, R.: Sybil-
resistant DHT routing. In: De Capitani di Vimercati, S., Syverson, P.F., Goll-
mann, D. (eds.) ESORICS 2005. LNCS, vol. 3679, pp. 305318. Springer, Heidel-
berg (2005)
7. Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B., Dai, Y.: Uncovering social
network sybils in the wild. In: Conference on Internet Measurement (2011)
8. Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.: Detecting and characterizing
social spam campaigns. In: Proceedings of the 10th Annual Conference on Internet
Measurement, pp. 3547. ACM (2010)
9. Lee, K., Caverlee, J., Cheng, Z., Sui, D.: Content-driven detection of campaigns in
social media (2011)
10. Jin, X., Lin, C., Luo, J., Han, J.: A data mining-based spam detection system for
social media networks. Proceedings of the VLDB Endowment 4(12) (2011)
11. Stringhini, G., Kruegel, C., Vigna, G.: Detecting spammers on social networks. In:
Proceedings of the 26th Annual Computer Security Applications Conference, pp.
19. ACM (2010)
12. Lee, K., Eo, B., Caverlee, J.: Seven months with the devils: A long-term study
of content polluters on twitter. In: Intl AAAI Conference on Weblogs and Social
Media, ICWSM (2011)
13. Yang, C., Harkreader, R.C., Gu, G.: Die free or live hard? Empirical evaluation and
new design for fighting evolving twitter spammers. In: Sommer, R., Balzarotti, D.,
Maier, G. (eds.) RAID 2011. LNCS, vol. 6961, pp. 318337. Springer, Heidelberg
(2011)
14. Yu, H., Gibbons, P., Kaminsky, M., Xiao, F.: Sybillimit: A near-optimal social net-
work defense against sybil attacks. In: IEEE Symposium on Security and Privacy,
SP 2008, pp. 317. IEEE (2008)
15. Danezis, G., Mittal, P.: Sybilinfer: Detecting sybil nodes using social networks.
NDSS (2009)
16. Tran, N., Min, B., Li, J., Subramanian, L.: Sybil-resilient online content voting.
In: Proceedings of the 6th USENIX Symposium on Networked Systems Design and
Implementation, pp. 1528. USENIX Association (2009)
17. Cao, Q., Sirivianos, M., Yang, X., Pregueiro, T.: Aiding the detection of fake
accounts in large scale social online services. Technical Report (2011),
http://www.cs.duke.edu/~ qiangcao/publications/sybilrank_tr.pdf
18. Wang, G., Mohanlal, M., Wilson, C., Wang, X., Metzger, M., Zheng, H., Zhao, B.:
Social turing tests: Crowdsourcing sybil detection. Arxiv preprint arXiv:1205.3856
(2012)
19. Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of a
real-time url spam filtering service. In: IEEE Symposium on Security and Privacy
(2011)
Identification of Sybil Communities Generating Context-Aware Spam 279
20. McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In:
Calero, J.M.A., Yang, L.T., M armol, F.G., Garca Villalba, L.J., Li, A.X., Wang,
Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175186. Springer, Heidelberg (2011)
21. Ahmed, F., Abulaish, M.: An mcl-based approach for spam profile detection in on-
line social networks. In: The 11th IEEE International Conference on Trust, Security
and Privacy in Computing and Communications, TrustCom 2012, IEEE (2012)
22. Blondel, V., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of commu-
nities in large networks. Journal of Statistical Mechanics: Theory and Experiment
2008, 10008 (2008)
23. Blondel, V.: The Louvain method for community detection in large networks
(2011), http://perso.uclouvain.be/vincent.blondel/research/louvain.html
(accessed July 11, 2012)
Location-Based Emerging Event Detection
in Social Networks
Abstract. This paper proposes a system for the early detection of emerg-
ing events by grouping micro-blog messages into events and using the
message-mentioned locations to identify the locations of events. In our re-
search we correlate user locations with event locations in order to identify
the strong correlations between locations and events that are emerging.
We have evaluated our approach on a real-world Twitter dataset with dif-
ferent granularity of location levels. Our experiments show that the pro-
posed approach can eectively detect the top-k ranked emerging events
with respect to the locations of the users in the dierent granularity of
location scales.
1 Introduction
The detection of emerging events with respect to the locations and participants
of events provides a better understanding of ongoing events. Emerging events like
natural disasters may need to be reported in real time when they are observed
by many people. Emerging events such as infectious diseases and cyberspace-
initiated/plotted attacks/unrest need to be detected at their early stages.
Currently, a User Generated Content (UGC) system, such as micro-blog ser-
vices, provides a wealth of online discussions about real-world events. Micro-blog,
e.g. Twitter is being considered as a powerful means of communication for people
who are looking for sharing and exchanging information over the social media.
The social events range from popular, widely known ones such as events con-
cerning well-known people or political aairs to local events such as accidents,
protests or natural disasters. Moreover social events could be malicious, danger-
ous, or disastrous. For instance, in August 2011, rioters used instant messaging
and social network services to have arranged meetings and agitated people across
England. Our research interest is to understand where, when, and what is hap-
pening (emerging), so to predict its occurrence via the real-time monitoring of
the social networks. More specifically, our research interest is focused on the
emerging hotspot events. We define that a hotspot event is a tuple (location,
time, topic) that a social network user is associated with through the posting of
a micro-blog.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 280291, 2013.
c Springer-Verlag Berlin Heidelberg 2013
both the topic as well as the event location. An emerging event is an event
that has significantly increased in the number of messages but rarely posted in
the past. A hotspot event is an event where there is a strong association between
event location and user location. User location is the location where the message
is sent from. Message-referred location is the location mentioned in the message.
It could be event location which is the location where the event occurs or other
locations referred to within the messages.
In order to understand where the locations of messages are found in social
media, we conducted statistical studies. Firstly, we try to understand the avail-
ability of the locations within the micro-blogs. We used Twitter API2 to have
crawled the micro-blog datasets. Dataset1 is crawled from the messages sent by
users around Brisbane Australia, from the dates 17 March 2012 to 25 March
2012 with 219,933 messages while Dataset2 is crawled from the messages sent
by users around the USA, from the dates 21 June 2012 and 27 June 2012 with
196,834 messages. The statistical information of locations from the two datasets
is shown in Figure 1 and Table 2.
As we can see from Datasets 1 and 2 that, the user locations are divided into
three types i.e. the geo-tagged locations, from the user profiles, and implied by IP
addresses (first three rows in the Table 2). The majority of user locations come
from user profiles, approximately 77% of Dataset1 and 67% of Dataset2. Also
from Datasets 1 and 2 we can see that the message-mentioned locations are only
available within a small proportion of the micro-blogs and appeared in one or two
times in average in the micro-blog messages. Our most important observation on
the locations of micro-blogs is that most of the micro-blog messages contain only
one geographical location while messages that contain geographical locations of
more than one constituted approximately 4% of Dataset1 and 7% of Dataset2.
When a location is mentioned in the micro-blog message, it can be either an
event location or a topic location.
In Table 3, we also try to understand what the locations used for in the
messages contents. Dataset3 is downloaded from G. R. Boynton, University of
2
https://dev.twitter.com/docs/api/1.1
Location-Based Emerging Event Detection in Social Networks 283
Iowa which consists of two events: USA H1N1 and the Indonesia Earthquake3.
Dataset4 is crawled by using hashtag #qldvote for Queensland Election 2012.
As we can see from Datasets 3 and 4 that the message-mentioned locations are
mostly the event locations. The message-mentioned location can appear more
than one time in the message. Our most important observation on the message-
mentioned locations is that the more frequently a message-mentioned location,
the more likely it is the event location. Therefore, the use of message-mentioned
locations to identify event location is the key goal behind this study.
networks, we do not know the number of events in advance. On the other hand,
our approach requires no prior knowledge of the number of events. Therefore,
the hierarchical clustering is used. In this stage, we aim to automatically group
messages into the same event. We also need a fast and ecient message clustering
system to overcome the problem of the high arrival rate of messages. In order
to dealing with the high incoming rate of messages, we use a sliding window
manager to keep track of messages arriving in the system. The size of the sliding
window can be defined as the number of messages or time interval. In our case,
we use time intervals such as one hour, two hours or one day depending on the
user given preference. Additionally, the number of previous time slots needs to
be specific because it is not necessary to consider the complete usage history of
data to compute an emerging of the event.
In order to find the best representation of term weight for tweet messages, we
compare four dierent term weight formulas (i.e. Term frequency, Augmented
Normalized Term Frequency[13], TFIDF[13], and Smooth-TFIDF[10]). We also
compare four dierent similarity functions (i.e. Jaccard index, Euclidean dis-
tance, Manhattan distance and Cosine similarity) for finding the best similarity
function. To evaluate an eective clustering method, we manually label 16,144
tweets into 17 topics. We evaluate our algorithm by using pair-wise precision,
recall and F1-score. Due to space limit, we do not show our preliminary exper-
iments. The clustering method performs well when using the augmented nor-
malized term frequency and cosine similarity function. Therefore, we calculate
the weight wi,t of term t in message i by using augmented normalized term
frequency:
Location-Based Emerging Event Detection in Social Networks 285
tfi,t
wi,t = 0.5 + 0.5 (1)
tfimax
where tfi,t is the term frequency value of the term t in message i, tfimax is the
highest term frequency value of the message i. The cosine similarity function is
used to calculate the similarity between the existing cluster and the new message:
#
(wm,ti wc,ti )
cos(m, c) = 5# i 5# (2)
w 2 w 2
j m,tj j c,tj
database, the database of geographic locations, for getting the location. Finally,
if neither of them is not available we set message location equal to World.
For the gazetteer database, we downloaded a list of geographic locations from
GeoNames6 and stored a local copy of the gazetteer in a database.
Secondly, for event location, we find all terms or phrases reference to geo-
graphic location (e.g. country, state and city) from tweet contents. Since the
location extraction from text is one of the challenging problems of this research
area, this task will be more focused in future work. In this research, we simply
extract the message-mentioned locations via Named Entity Recognition (NER).
We use the Stanford Name Entity Recognizer [7] to identify locations within
the messages. We also use the Part-of-Speech Tagging for Twitter which is in-
troduced in [6] to extract proper nouns. We use extracted terms query into the
gazetteer database to obtain candidate locations of the event. We find the most
possible location of the event using the frequency of each location in the cluster.
The location which has the highest frequency is assigned as event location.
Finally, for finding the correlation between event location and user location,
a correlation score is computed by comparing the level of location granularity.
The granularity level is defined as Country>State>City>PlaceName. We assign
scores for each level if both of them have the same value. The equation is shown
below:
CorrelateScore = 1(F (uCountry, eCountry)) + 2(F (uState, eState))
+3(F (uCity, eCity)) + 4(F (uP lace, eP lace)) (3)
where 1 - 4 are the weight of granularity levels, 1 = 2 = 3 = 4 = 0.25,
uCountry, uState, uCity and uP lace are user location, eCountry, eState, eCity
and eP lace are event location, and F (x, y) = 1; if x = y and the higher granu-
larity level has the same value otherwise F (x, y) = 0.
To identify which cluster is a hotspot event, the LocScore is computed. The
range of LocScore is 0 to 1. It will be used to compute emerging score in the
next section and the top k ranking emerging hotspot events will be selected.
The LocScore of cluster c is defined as:
#
CorrelateScoreu
LocScorec = uU (4)
Np
where N p is the number of users who post messages in cluster c and U represents
users who post messages in cluster c.
3.5 Visualization
For usability and understanding issues of visualizing model, we use a motion
chart7 to represent emerging hotspot events. Google Map8 is used to represent
7
http://code.google.com/apis/chart/interactive/docs/
gallery/motionchart.html
8
https://developers.google.com/maps/documentation/javascript/examples
288 S. Unankard, X. Li, and M.A. Sharaf
the top k emerging hotspot events for a specific location. Examples of a motion
chart and a geo-map are shown in Figure 3. Figure 3 (left) shows a given emerging
hotspot event within a time period (x-axis is the number of users who post
massages; y-axis is the number of messages; colour represents location and bubble
size is emerging score). Figure 3 (right) shows the top five events in a selected
State.
Table 7. Sample of top 5 events detected by the LEED approach on 25/06/2012. Data
is represented as event topic (event location).
Detected events Description
1) behold, dalla, debbi, editor, storm, tropic A tropical storm named Debbie,
(US>Texas>Dallas) headed for Dallas
2) debbi, florida, gulf, move,slowli, soak, storm, Moving Slowly in Gulf,Tropical Storm
tropic (US>Florida) Debby Soaks Florida (Related to event 1)
3) attack, coast, debbi, flood, florida, Tropical Storm Debby attacks
lead, neighborhood, storm, tornado, touch, tropic, Floridas coast, flooding neighborhoods
weather, latest, listen, newscast (US>Florida) (Related to event 1)
4) concert, denver, oc, park, polic, shot Police ocer shot dead at jazz concert
(US>Colorado>Denver) in Denver
5) accid, buse, circl, injuri, involv, metro, Accident Washington Circle 2 Metro buses
report, washington (US>Washington, D.C.) involved.
5 Related Work
5.1 Event Detection in Social Streams
Event detection on micro-blogs as a challenging research topic has been increas-
ingly reported recently [1,3,8,11,15,17]. Sayyadi et al. in [15] developed a new
event detection algorithm by using a keyword graph and community detection
algorithm to discover events from blog posts. Keywords with low document
9
http://keygraph.codeplex.com
290 S. Unankard, X. Li, and M.A. Sharaf
frequency are filtered. An edge is removed if it does not satisfy the conditions.
Communities of keywords are detected by removing the edges with a high be-
tweenness centrality score. However, the number of detected events depends on
the threshold parameters and there was no evaluation conducted. Ozdikis et al.
in [11] proposed an event detection method in Twitter base on clustering of hash-
tags, the # symbol is used to mark keywords or topics in Twitter, and applied a
semantic expansion to message vectors. For each hashtag the most similar three
hashtags are extracted by using cosine similarity. A tweet vector with a single
hashtag is expanded with three similar hashtags and then used in the clustering
process. However, using only messages with a single hashtag can lead to ignore
some important events. Also they did not implement any credibility filter in
order to decide whether a tweet is about an event or not.
6 Conclusions
In this paper, we develop a system namely LEED, to automatically detect
emerging hotspot events over micro-blogs. Our contributions can be summarized
Location-Based Emerging Event Detection in Social Networks 291
as: (1) We proposed a solution to detect the emerging hotspot events which can
help governments or organizations prepare for and respond to unexpected events
such as natural disasters and protests. (2) We correlate user location with event
location to establish a strong correlation between them. (3) We provide an eval-
uation for the eective event detection on a real-world Twitter dataset with
dierent granularity of location levels. Our experiments show that the proposed
approach can eectively detect emerging events over the baselines. However,
some clusters that we found are not belong to the real-world events. It can be
chat topic. In future work, the dierentiation between event location and topic
location which are mentioned in the text will be studied. Moreover, each com-
ponent may aect the final detection results so a break-down staged evaluation
will be assessed.
References
1. Aggarwal, C., Subbian, K.: Event detection in social streams. In: SDM (2012)
2. Alvanaki, F., Michel, S., Ramamritham, K., Weikum, G.: See whats enblogue:
real-time emergent topic identification in social media. In: EDBT (2012)
3. Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: Real-world event
identification on twitter. In: ICWSM (2011)
4. Cataldi, M., Caro, L.D., Schifanella, C.: Emerging topic detection on twitter based
on temporal and social terms evaluation. In: MDMKDD (2010)
5. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis, 1st edn. John
Wiley & Sons Inc. (February 1973)
6. Gimpel, K., Schneider, N., OConnor, B., Das, D., Mills, D., Eisenstein, J., Heil-
man, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twit-
ter: Annotation, features, and experiments. In: ACL (2011)
7. Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition,
vol. 163. Prentice Hall (2000)
8. Li, C., Sun, A., Datta, A.: Twevent: segment-based event detection from tweets.
In: CIKM (2012)
9. Mathioudakis, M., Koudas, N.: Twittermonitor: trend detection over the twitter
stream. In: SIGMOD (2010)
10. Ni, X., Quan, X., Lu, Z., Wenyin, L., Hua, B.: Short text clustering by finding core
terms. Knowl. Inf. Syst. 27(3), 345365 (2011)
11. Ozdikis, O., Senkul, P., Oguztuzun, H.: Semantic expansion of hashtags for en-
hanced event detection in twitter. In: VLDB-WOSS (2012)
12. Ruthven, I., Lalmas, M.: A survey on the use of relevance feedback for information
access systems. Knowl. Eng. Rev. 18(2), 95145 (2003)
13. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval.
Inf. Process. Manage. 24(5), 513523 (1988)
14. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.:
Twitterstand: news in tweets. In: GIS (2009)
15. Sayyadi, H., Hurst, M., Maykov, A.: Event detection and tracking in social streams.
In: ICWSM (2009)
16. Watanabe, K., Ochi, M., Okabe, M., Onai, R.: Jasmine: a real-time local-event
detection system based on geolocation information propagated to microblogs. In:
CIKM (2011)
17. Weng, J., Lee, B.-S.: Event detection in twitter. In: ICWSM (2011)
Measuring Strength of Ties in Social Network
Dakui Sheng, Tao Sun, Sheng Wang, Ziqi Wang, and Ming Zhang
1 Introduction
Granovetter [4] introduced the concept of strength of ties in his landmark paper
The Strength of Weak Ties. Strong ties connect people you really trust, and
those whose social circles tightly overlap with yours. Often, they are also the
people who are most like you while weak ties are merely acquaintances.
Weak ties often provide people with access to information and resources out of
their own circles [3] while strong ties have greater motivation to be of assistance
and are typically more easily available [3]. So the study of social ties can bring
lots of benefits to daily life [10, 5] which is well worth working on. However, in
most studies of social network, social connections are often presented as being
friends or not, neglecting the dierence between strong and weak ties. And the
problem hasnt been well explored yet.
Previous studies have investigated the strength of ties, e.g., by exploiting
human annotation as training data [2, 6]. The problem is that performance
largely depends on the quality of annotation. At the same time, they treat the
problem as a binary classification task, i.e., to classify a relationship as strong or
This work is partially supported by the the National Natural Science Foundation of
China (NSFC Grant No. 61272343), as well as The Specialized Research Fund for
the Doctoral Program of Higher Education of China under Grant (FSSP Grant
No. 20120001110112).
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 292300, 2013.
c Springer-Verlag Berlin Heidelberg 2013
weak and were not able to quantitatively estimate the strength, which limits the
scope of their contribution in real world. In this paper, we harness the homophily
principles and empirical analysis to develop a latent variable model for estimating
tie strength based on user similarities and social interactions. Comparing to the
existing methods, the proposed method requires no human annotation, and the
outputs are real numbers (instead of 0 or 1) representing the strength of ties.
To evaluate the proposed method, we conduct experiments on data collected
from Sina Weibo1 , and experimental results demonstrate that our model out-
perform previous methods. Further, with the method to calculate tie strength,
we investigate the phenomenon of social triads. In sociology a triad is a group of
three people which is mostly investigated by microsociology. We examine social
triads and our findings are consistent with Granovetters conclusion [3].
The rest part of this paper is organized as follows. We review the related
works in Section 2. In Section 3, we introduce the framework of measuring the
strength of ties in social network. In Section 4, we show the experimental results
of our work. Finally, in Section 5, we conclude and talk about future work.
2 Related Work
Some recent eorts have been made on the task of measuring the strength of
ties. Most of them focus on supervised model which involves eorts on human
annotation [2, 6]. However, measuring the strength of ties is only treated as a
binary classification task by these works, i.e., to classify a relationship as strong
or weak and they were unable to quantitatively estimate the strength between
dierent users. Zhao et al. [13] proposed a novel framework for measuring the
relationship strength between dierent users on various activity fields in online
social networks. However, it only represented the strength of ties on various
activity fields rather than the overall relationship strength.
The most relevant work was proposed by Xiang et al. [12] which is a latent
variable model that measures the strength of ties by outputting continuous values
without human annotation. User similarity and social interactions are exploited
while constructing the model. Our model is the same with it in the aspects
mentioned above. However, there are several dierences: 1)we consider both user
similarities and social interactions as decisive factor of tie strength while they
only considered user similarities. 2)The way in which we utilize social interactions
is rather dierent. In our work, we believe that social interactions directly impact
on the strength of ties and it in turn impacts on the interaction probability of
the user pairs in the future. In the previous work, researchers believe that the
strength of ties impacts on the interaction probability of user pairs all the time.
Experimental results confirm that our model outperforms the previous methods.
At present, there has been work concerning about triads. However these works
either investigated the network structure using social triad theory [11, 1] or
1
http://weibo.com. It is one of the most popular microblog systems in China, which
is similar to twitter.
294 D. Sheng et al.
studied the triads composed of positive and negative edges(friends and ene-
mies) [8] in social media . To the best of our knowledge, our work is the first
that investigates the triads whose edges are measured by strength of ties.
3 Model
One key assumption is the theory of homophily [9]. It postulates that people
tend to form ties with other people who have similar characteristics. Moreover,
it is likely that the stronger the tie, the higher the similarity. Therefore, we can
model the strength of ties as a hidden eect of user similarities.
Another assumption is that the frequency of social interactions between user
pairs directly impact on their strength of ties. Since each user has a finite amount
of resources (e.g., time) to form and maintain relationships, it is likely that they
direct the sources(interacts) towards the ties that they deem more important [12].
As a result, user pairs with strong ties are more likely to interact in the future. On
the other hand, it can be inferred that user pairs with more social interactions in
the past are more likely to form strong ties between them. In this way, we model
the strength of ties as the hidden cause of users future interaction probability
and the hidden eect of the users historical social interactions. Graphical model
representation of the tie strength model is shown in Fig. 1.
s(i,j)
z(i,j)
is generative. Our model represents the possible casual relationships among these
variables by modeling the conditional dependencies, so it decomposes as follows:
t
=
P (z (i,j) , y (i,j) |s(i,j) ) = P (z (i,j) |s(i,j) ) (i,j) (i,j)
P (ym |s ) (1)
m=1
The general latent variable model of tie strength can be instantiated in dif-
ferent ways and we adopt the Gaussian distribution to model the conditional
probability of the tie strength given the vectors of the similarity and historical
interactions information. Then the dependency between z (i,j) and s(i,j) is :
(i,j) 1
P (ym = 1|z (i,j) ) = (3)
1+ e(m z(i,j) +b)
m=1
3.2 Inference
Note that both the quadratic terms and the logarithm of logistic function are
concave. Since the sum of concave functions is concave, the function L is concave.
Therefore, a gradient based method will allow us to optimize over the parameters
w, m (m = 1, 2, . . . , t) and the latent variables z (i,j) , (i, j) D, to find the
maximum of L. We derive a coordinate ascent method for the optimization.
The coordinate-wise gradients are:
"t
L 1 T (i,j) (i,j) (i,j) 1
= (w s z ) + (ym )m (6)
z (i,j) v m=1
1 + e m z(i,j) +b)
(
L " 1
(i,j)
= (ym ( (i,j) +b)
)z (i,j) t (7)
m 1 + e mz
(i,j)D
L 1 "
= (z (i,j) wT s(i,j) )s(i,j) w w (8)
m v
(i,j)D
(i,j) 2
= ( (i,j) +b) 2
(9)
(z ) v m=1 (1 + e m z )
2L = T (i,j)
(z (i,j) )2 e(m z +b)
T
= I (10)
t t
(i,j)D
(1 + e(m z(i,j) +b) )2
For w, the root of Eq.(8) can be found analytically as in usual ridge regression:
W = (w I + S T S)1 S T Z (11)
where S = (s(i1 ,j1 ) , s(i2 ,j2 ) ....s(iN ,jN ) )T , and Z = (z (i1 ,j1 ) , z (i2 ,j2 ) ....z (iN ,jN ) )T .
Implementing the model , given y (i,j) and s(i,j) , we can gradually obtain the
parameter w, and z (i,j) ( continuous). So the problem can be solved.
Measuring Strength of Ties in Social Network 297
4 Experiment
4.1 Model Evaluation
The experiments are conducted on a real-world dataset crawled from Sina Weibo
using API2 . Since two-way communication are mostly be observed between bi-
followers3, we focus on ties between bi-followers in the experiments. We recruits
several people to label the category of ties(strong or weak) between them and
their bi-followers. The criterion of human annotation is based on the definition
of strength of ties proposed by Granovetter [4] talked about in Section 1. Then
we crawl data, extract features, and apply our model to the dataset. Meanwhile,
the human annotation results are not used while calculating the strength of ties.
To evaluate the model, we use the continuous value obtained to identify whether
a tie is strong or weak. The accuracy and AUC of the classification task are used
to evaluate the model since there are no other criterion suitable for it. Several
baselines are compared with our model. One is the model proposed by Xiang et
al. [12] referred as UBL model here. One uses social interaction features to classify
while the other exploits both social interaction and user similarity features. They
are referred as UC model and UIC model respectively while our model is referred
as UIBL model. The experiment results are shown in Fig. 2(a).
From the result, it can be seen that our model outperforms the others on both
criterions. Despite of the fact that our model and UBL model are quite similar4
2
http://open.weibo.com/.
3
Your bi-followers follow you and you follow them as well.
4
We have discussed about the dierences and common point in Section 2.
298 D. Sheng et al.
to some extent, our model increases the accuracy of it by 6 percent and achieves
much better in AUC mainly for the eect of historical social interactions on tie
strength that has been into account. Although UIC model has exploited all the
features of our model as well as the results of human annotation, it is still beat
by our model on both criterions. We explain it for that theory of homophily
and essence of tie strength are taken into account while proposing our model. In
other words, we exploited the inner relation between features and tie strength
rather than just training based on the features and training set.
Relations between social media sites often reflect a mixture of friendly and an-
tagonistic interactions [7] while in social network, it often reflects a mixture of
strong and weak ties. Dierent from triads signed in social media [8], we classify
edges of triads based on the strength of ties. The edge(tie) in social triads can
be classified to three categories, strong,weak and atypical5 . Triads with atypical
ties are not considered in experiments. In this way, social triads composed of
strong and weak ties can be classified into four categories shown in Fig. 3.
(a) 3 strong (b) 2 strong & 1 weak (c) 1 strong & 2 weak (d) 3 weak
We use snowball sampling to crawl the data from Sina Weibo. We randomly
pick up several celebrities as seed , utilize snowball sampling to crawl all their bi-
followers and repeat it several times. Since users are more likely to be friends with
people of common interests, the datasets can be seen as communities composed
of celebrities in one field to some extent. By choosing celebrities from Sports,
Technology and Entertainment field as seed respectively, we obtain three datasets
of celebrities in dierent fields which are referred to as dataset Sports, Technology
and Entertainment. Statistics of them are shown in Fig. 2(b).
We conduct several experiments to reveal the phenomena behind social tri-
ads. Significance level() are defined to classify between strong , weak ties and
atypical ties. It is denoted by k%(0 k 100). We sort ties by their strength
in ascending order, and considere that the first k% ties are strong ties while the
last k% ones are weak ties. The others are atypical ties. In this way, the strong
5
Considering that it is rather hard to define the boundary between strong ties and
weak ties, we add another type of tie which is not significant enough to be a strong
tie or weak tie and refer it as atypical tie. They are not considered to be statistically
meaningful here.
Measuring Strength of Ties in Social Network 299
Fig. 4. The proportions of four triads change with the level of significance, in data set
Sports, Technology and Entertainment, respectively. The horizontal ordinate here is
the significance level.
ties and weak ties we defined here are significantly strong or weak ties when
is small. In the experiments, we gradually take k as 5, 10, 15, 20, 25, 30 for study.
Fig. 4 shows changes of triads in proportion when varies. They appear to be
similar for dierent datasets. When is less than 0.15, there are only triads with
three strong ties since few weak ties make up triads then. When increases from
0.15 to 0.30, triads with three strong ties remains the most. So we can infer that
triads with three strong ties are more likely to form. Meanwhile, the number of
triads with two strong and one weak tie is much smaller which matches with the
theory of sociology that your friends friend are more likely to be your friend.
References
[1] Cartwright, D., Harary, F.: Structural balance: a generalization of heiders theory.
Psychological Review 63(5), 277 (1956)
[2] Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Pro-
ceedings of the 27th International Conference on Human Factors in Computing
Systems, pp. 211220. ACM (2009)
[3] Granovetter, M.: The strength of weak ties: A network theory revisited. Sociolog-
ical Theory 1(1), 201233 (1983)
300 D. Sheng et al.
[4] Granovetter, M.: The strength of weak ties. American Journal of Sociology, 1360
1380 (1973)
[5] Harrison, F., Sciberras, J., James, R.: Strength of social tie predicts cooperative
investment in a human social network. PloS One 6(3), e18338 (2011)
[6] Kahanda, I., Neville, J.: Using transactional information to predict link strength
in online social networks. In: Proceedings of the Third International Conference
on Weblogs and Social Media, ICWSM (2009)
[7] Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting positive and negative links
in online social networks. In: Proceedings of the 19th International Conference on
World Wide Web, pp. 641650. ACM (2010a)
[8] Leskovec, J., Huttenlocher, D., Kleinberg, J.: Signed networks in social media. In:
Proceedings of the 28th International Conference on Human Factors in Computing
Systems, pp. 13611370. ACM (2010b)
[9] McPherson, M., Smith-Lovin, L., Cook, J.: Birds of a feather: Homophily in social
networks. Annual Review of Sociology, 415444 (2001)
[10] Montgomery, J.: Job search and network composition: Implications of the strength-
of-weak-ties hypothesis. American Sociological Review, 586596 (1992)
[11] Weber, C., Weber, B.: Exploring the antecedents of social liabilities in cvc triadsa
dynamic social network perspective. Journal of Business Venturing 26(2), 255272
(2011)
[12] Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social
networks. In: Proceedings of the 19th International Conference on World Wide
Web, pp. 981990. ACM (2010)
[13] Zhao, X., Li, G., Yuan, J., et al.: Relationship strength estimation for online social
networks with the study on facebook. Neurocomputing (2012)
Finding Diverse Friends in Social Networks
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 301309, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Example 1. Consider FSN shown in Table 1(b), which consists of n=10 interest-group
lists L1 , ..., L10 for m=7 friends in Table 1(a). Each row in the table represents the list
of an interest group. These 10 interest groups are distributed into d=3 domains D1 , D2
and D3 . For instance, FD1 = {L1 , L2 , L3 }. For group G = {Cathy, Ed}, its Size(G)=2.
G G G
As its projected lists on the 3 domains are FD 1
=, FD 2
={L6 , L7 } and FD 3
={L8 }, its
frequencies Freq D1,2,3 (G) = 0, 2, 1. 1
Example 2. Revisit FSN in Table 1(b). The prominence value of friend group G =
Prom (Cathy)+Prom (Ed)
{Cathy, Ed} in D1 = D1
Size(G)
D1
= 0.20+0.50
2
= 0.35. We apply similar
0.60+0.40 0.70+0.45
computation and get Prom D1,2,3 (G) = 0.35, 2
, 2
= 0.35, 0.5, 0.575.
Recall from Example 1 that Freq D1,2,3 (G) = 0, 2, 1. So, the overall influence of G in all
3 domains can be calculated as Inf D1,2,3 (G) = 0.350, 0.52, 0.5751 = 0, 1, 0.575.
Thus, the diversity of G in these d=3 domains in FSN is Div(G)= 0+1+0.575 3
=0.525. 1
Example 3. Recall from Example 2 that Div({Cathy, Ed})=0.525. Given (i) FSN in
Table 1(b) and (ii) the user-specified minDiv=0.5, group G={Cathy, Ed} is diverse
because Div(G)=0.525 0.5=minDiv. But, group G ={Ed} is not diverse because
Div(G ) = (0.50)+(0.42)+(0.451)
3
= 0+0.8+0.45
3
= 0.417 < minDiv. 1
Example 5. To construct a Div-tree for FSN shown in Table 1(b) when minDiv=0.5,
Div-growth scans FSN to compute (i) GMProm D1,2,3 = 0.9, 0.7, 0.7 for all d=3 do-
mains, (ii) frequencies of all 7 friends in d=3 domains (e.g., Freq D1,2,3 ({Amy}) =
2, 0, 3), (iii) upper bound of diversity values of all 7 friends (e.g., Div U ({Amy}) =
(0.92)+(0.70)+(0.73)
3
= 1.3 using Inf UD1,2,3 ({Amy})). Based on Lemma 1, we safely
remove F red and Greg having Div U ({F red})=0.23 and Div U ({Greg})=0.23 both be-
low minDiv as their super-groups cannot be diverse. So, the header table includes only
the remaining 5 friendssorted in some order (e.g., lexicographical order of friend
names)with their Freq D1,2,3 ({fi }). To facilitate a fast tree traversal, like the FP-
tree, the Div-tree also maintains horizontal node traversal pointers from the header
table to nodes of the same fi .
Div-growth then scans each Lj FSN , removes any friend fi Lj having Div U (fi )
<minDiv, sorts the remaining friends according to the order in the header table, and
inserts the sorted list into the Div-tree. Each tree node captures (i) fi representing the
group G consisting of all friends from the root to fi and (ii) its frequencies in each
domain Freq D1,2,3 (G). For example, the rightmost node Ed:0,1,0 of the Div-tree in
Fig. 1(b) captures G={Cathy, Ed} and Freq D1,2,3 (G)=0, 1, 0. Tree paths of common
prefix (i.e., same friends) are shared, and their corresponding frequencies are added.
See Figs. 1(a) and 1(b) for Div-trees after reading all interest-group lists in domain D1
and the entire FSN , respectively. 1
With this tree construction process, the size of#the Div-tree for FSN with a given
minDiv is observed to be bounded above by Lj FSN |Lj |.
Lemma 2. The diversity value of a friend group G computed based on LMProm D (G)
is a tighter upper bound than Div U (G) computed based on GMProm D . 1
Example 6. To mine potentially diverse friend groups from the Div-tree in Fig. 1(b)
using minDiv =0.5, Div-growth first builds the {Ed}-projected treeas shown in
Fig. 2(a)by extracting the paths Amy, Cathy, Ed:0,0,1, Bob, Cathy, Ed:0,1,0 and
Ed
Cathy, Ed:0,1,0 from the Div-tree in Fig. 1(b). For FD 1,2,3
={Amy, Bob, Cathy, Ed},
Ed
Div-growth also uses LMProm D1,2,3 (FD1,2,3 ) = 0.9, 0.7, 0.7 to compute the tightened
Div U (G) such as tightened Div U ({Amy, Ed})= (0.90)+(0.70)+(0.71)
3
=0.23 < minsup.
As Div U ({Amy, Ed}) and Div U ({Bob, Ed}) are both below minsup, Div-growth
prunes Amy and Bob from the {Ed}-projected tree to get the {Ed}-conditional tree
Ed
as shown in Fig. 2(b). Due to pruning, Div-growth recomputes LMProm D1,2,3 (FD 1,2,3
)
=0.5, 0.6, 0.7 and the tightened Div U ({Cathy, Ed})= (0.50)+(0.62)+(0.71)
3
=0.63 for
Ed
the updated FD 1,2,3
={Cathy, Ed}. This completes the mining for {Ed}.
Next, Div-growth builds {Don}-, {Cathy}- & {Bob}-projected and conditional
trees, from which potentially diverse friend groups can be mined. Finally, Div-growth
computes the true diversity value Div(G) for each of these mined groups to check if it
is truly diverse (i.e., to remove all false positives). 1
4 Experimental Results
To evaluate the eectiveness of our proposed Div-growth algorithm and its as-
sociated Div-tree structure, we compared them with a closely related weighted
frequent pattern mining algorithm called Weight [18] (but it does not use dier-
ent weights for individual items). As Weight was designed for frequent pattern
mining (instead of social network mining), we apply those datasets commonly
used in frequent pattern mining for a fair comparison: (i) IBM synthetic datasets
(e.g., T10I4D100K) and (ii) real datasets (e.g., mushroom, kosarak) from the Fre-
quent Itemset Mining Dataset Repository fimi.cs.helsinki.fi/data. See Ta-
ble 2 for more detail. Items in transactions in these datasets are mapped into
friends in interest-group lists. To reflect the concept of domains, we subdivided
the datasets into several batches. Moreover, a random number in the range (0, 1]
is generated as a prominence value for each friend in every domain.
All programs were written in C++ and run on the Windows XP operating
system with a 2.13 GHz CPU and 1 GB main memory. The runtime specified
indicates the total execution time (i.e., CPU and I/Os). The reported results
Finding Diverse Friends in Social Networks 307
are based on the average of multiple runs for each case. We obtained consistent
results for all of these datasets.
Runtime. First, we compared the runtime of Div-growth (which includes the
construction of the Div-tree, the mining of potentially diverse friend groups from
the Div-tree, and the removal of false positives) with that of Weight. Fig. 3(a)
shows the results for a dense dataset (mushroom), which were consistent with
those for sparse datasets (e.g., T10I4D100K). Due to page limitation, we omit the
results for sparse datasets. Runtimes of both algorithms increased when mining
larger datasets (social networks), more batches (domains), and/or with lower
minDiv thresholds. Between the two algorithms, our tree-based Div-growth al-
gorithm outperformed the Apriori-based Weight algorithm. Note that, although
FP-growth [4] is also a tree-based algorithm, it was not design to capture weights.
To avoid distraction, we omit experimental results on FP-growth and only show
those on Weight (which captures weights).
Compactness of the Div-tree. Next, we evaluated the memory consumption.
Fig. 3(b) shows the amount of memory required by our Div-tree for capturing
the content of social networks with the lowest minDiv threshold (i.e., without
removing any friends who were not diverse). Although this simulated the worst-
case scenario for our Div-tree, Div-tree was observed (i) to consume a reasonable
amount of memory and (ii) to require less memory than Weight (because our
Div-tree is compact due to the prefix sharing).
Scalability. Then, we tested the scalability of our Div-growth algorithm by
varying the number of transactions (interest-group lists). We used the kosarak
dataset as it is a huge sparse dataset with a large number of distinct items
(individual users). We divided this dataset into five portions, and each portion is
subdivided into multiple batches (domains). We set minDiv=5% of each portion.
Fig. 3(c) shows that, when the size of the dataset increased, the runtime also
increased proportionally implying that Div-growth is scalable.
308 S.K. Tanbeer and C.K.-S. Leung
5 Conclusions
In this paper, we (i) introduced a new notion of diverse friends for social net-
works, (ii) proposed a compact tree structure called Div-tree to capture impor-
tant information from social networks, and (iii) designed a tree-based mining
algorithm called Div-growth to find diverse (groups of) friends from social net-
works. Diversity of friends is measured based on their prominence, frequency
and influence in dierent domains on the networks. Although diversity does not
satisfy the downward closure property, we managed to address this issue by us-
ing the global and local maximum prominence values of users as upper bounds.
Experimental results showed that (i) our Div-tree is compact and space-eective
and (ii) our Div-growth algorithm is fast and scalable for both sparse and dense
datasets. As ongoing work, we conduct more extensive experimental evaluation
to measure other aspects (e.g., precision) of our Div-growth algorithm in finding
diverse friends. We also plan to (i) design a more sophisticated way to mea-
sure influence and (ii) incorporate other computational metrics (e.g., popularity,
significance, strength) with prominence into our discovery of useful information
from social networks.
References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: VLDB 1994, pp. 487499 (1994)
2. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social
networks. In: ACM KDD 2008, pp. 715 (2008)
3. Cameron, J.J., Leung, C.K.-S., Tanbeer, S.K.: Finding strong groups of friends
among friends in social networks. In: IEEE DASC/SCA 2011, pp. 824831 (2011)
4. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.
In: ACM SIGMOD 2000, pp. 112 (2000)
5. Jiang, F., Leung, C.K.-S., Tanbeer, S.K.: Finding popular friends in social net-
works. In: CGC/SCA 2012, pp. 501508. IEEE (2012)
6. Kamath, K.Y., Caverlee, J., Cheng, Z., Sui, D.Z.: Spatial influence vs. community
influence: modeling the global spread of social media. In: ACM CIKM 2012, pp.
962971 (2012)
Finding Diverse Friends in Social Networks 309
7. Lee, W., Leung, C.K.-S., Song, J.J., Eom, C.S.-H.: A network-flow based influence
propagation model for social networks. In: CGC/SCA 2012, pp. 601608. IEEE
(2012)
8. Lee, W., Song, J.J., Leung, C.K.-S.: Categorical data skyline using classification
tree. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011.
LNCS, vol. 6612, pp. 181187. Springer, Heidelberg (2011)
9. Leung, C.K.-S., Carmichael, C.L.: Exploring social networks: a frequent pattern
visualization approach. In: IEEE SocialCom 2010, pp. 419424 (2010)
10. Leung, C.K.-S., Tanbeer, S.K.: Mining social networks for significant friend groups.
In: Yu, H., Yu, G., Hsu, W., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA
Workshops 2012. LNCS, vol. 7240, pp. 180192. Springer, Heidelberg (2012)
11. Peng, Z., Wang, C., Han, L., Hao, J., Ou, X.: Discovering the most potential stars in
social networks with infra-skyline queries. In: Sheng, Q.Z., Wang, G., Jensen, C.S.,
Xu, G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 134145. Springer, Heidelberg
(2012)
12. Sachan, M., Contractor, D., Faruquie, T.A., Subramaniam, L.V.: Using content
and interactions for discovering communities in social networks. In: ACM WWW
2012, pp. 331340 (2012)
13. Schaal, M., ODonovan, J., Smyth, B.: An analysis of topical proximity in the
twitter social graph. In: Aberer, K., Flache, A., Jager, W., Liu, L., Tang, J., Gueret,
C. (eds.) SocInfo 2012. LNCS, vol. 7710, pp. 232245. Springer, Heidelberg (2012)
14. Sun, Y., Barber, R., Gupta, M., Aggarwal, C.C., Han, J.: Co-author relationship
prediction in heterogeneous bibliographic networks. In: ASONAM 2011, pp. 121
128. IEEE (2011)
15. Tanbeer, S.K., Leung, C.K.-S., Cameron, J.J.: DIFSoN: discovering influential
friends from social networks. In: CASoN 2012, pp. 120125. IEEE (2012)
16. Yang, X., Ghoting, A., Ruan, Y., Parthasarathy, S.: A framework for summarizing
and analyzing twitter feeds. In: ACM KDD 2012, pp. 370378 (2012)
17. Zhang, C., Shou, L., Chen, K., Chen, G., Bei, Y.: Evaluating geo-social influence
in location-based social networks. In: ACM CIKM 2012, pp. 14421451 (2012)
18. Zhang, S., Zhang, C., Yan, X.: Post-mining: maintenance of association rules by
weighting. Information Systems 28(7), 691707 (2003)
Social Network User Influence Dynamics
Prediction
1 Introduction
1.1 Identifying Influential Users
One of most popular topics of the social network analysis is identifying influen-
tial users and their network impact. More interestingly, knowing the influence of
users and being able to predict it can be leveraged for many applications. The most
famous application to researchers and marketers is the viral marketing [5], which
aims at targeting a group of influential users to maximize the marketing campaign
ROI (Return of Investment). Other interesting applications include search [1], ex-
pertise/tweets recommendation [22], trust/information propagation [9], and cus-
tomer handling prioritization in social customer relationship management.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 310322, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Influence Models and Measures. Currently most applications and tools com-
pute user influence based on their static network properties, such as, the num-
ber of friends/followers in the social graph, the number of posted tweets/received
retweets/mentions/replies in the activity graph, or users centrality (e.g. PageR-
ank, Betweeness-centrality, etc.). All of them make implicit assumptions about
the underlying dynamic process that social network users can only influence their
followers/friends, with equal impact strength, equal infected probabilities, and
at the same propagation rate; or the influence is assumed to take a random walk
on the corresponding static network.
A few works investigate adoption behaviors of social network users as the
dynamic influence propagation 1 or diusion process [18]. The adoption behaviors
refer to some activities or topics (tweets, products, Hashtags, URLs, etc.) shared
among users implicitly and explicitly such as users forwarding a message to their
friends, recommending a product to others, joining some groups with the similar
musical favor, and posting messages about the same topics, etc. According to
the diusion theory, the information cascades from social leaders to followers. In
most diusion models, propagators have certain probabilities to influence their
receivers, and the receivers also have certain thresholds to be influenced. Finding
the social leaders or the users who can maximize the influence coverage in the
network is the major goal of most diusion models. Targeting these influential
users is the idea behind viral marketing, aiming to activate a chain-reaction of
influence, with a small marketing cost.
Some drawbacks of existing social network influence models based on either
static networks or the influence maximization diusion process are: (1) Most
existing models are descriptive models rather than predictive models. For exam-
ple, the number of friends or the centrality score of a given user describes his/her
underlying network connectivity. The number of tweets that a user posted or
get retweeted indicates the trust/insterest that his/her followers have on his/her
tweets. All these measures/models are descriptive and very few models are able
to predict users future influence. (2) Existing influence maximization diusion
process is often modeled by discrete-time models such as Independent Cascade
Model or Linear Threshold Model. Using discrete-time models to model diusion
process in continuous time is very computationally expensive. The time step t
needs to be defined as a very small number. Thus parameter estimation (e.g. ac-
tivation probability) tends to come out very small numbers as well, and diusion
process simulation could take forever to run.
NumberofAdoptions
4
3
Hashtag
URL
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day
Fig. 1. The average number of topic adoptions over the time on our Twitter dataset.
The x axis indicates the time, while the y axis denotes the average number of the
summation of the adoption frequency of all topics at the specific time point.
from a given node is following an exponential distribution over the time, which can
be observed in the real-world data [12]. Figure 1 shows that the average number
of topic adoptions decreases exponentially over the time. Hashtags receive more
adoptions compared with URLs, and the number of Hashtag adoptions decreases
more slowly. Furthermore, transition rates q are calculated and treated as the tran-
sition probabilities (or activation probability) of the embedded Markov chain in
CTMP. Then the transition probability P (t) can be computed from q, given any
time t. In this paper, the nodes in the propagation sequences are the users, and the
edges connect users who refer to the same topic contiguously on time. Topics here
particularly refer to Hashtags (expressed as # followed by a word) and short URLs
(e.g. bit.ly) on twitter, which is one of the most popular microblog services, was
launched since July 13, 2006. Hashtags and URLs are both unique identifiers tag-
ging distinct tweets with certain topic labels. We regard the temporal sequences
of Hashtags and URLs as the diusion paths, where the topics are reposted subse-
quently. Although retweeting is not included in our paper as a diusion approach, it
is implicitly considered because the retweets would usually contain the same Hash-
tags and URLs as in the original tweets. Essentially marketers hope by targeting
a small set of influential users not only can their marketing campaign messages
receive many retweets, but also can the influential users create topics about their
products and stimulate a lot of tweets about them continuously. Our experimental
results on a large-scale twitter dataset demonstrates a promising prediction per-
formance on estimating the number of influenced users within a given time.
The contributions of this paper can be summarized as: (1) A dynamic informa-
tion diusion model based on the Continuous-Time Markov Process is proposed
to model and predict the influence coverage of social network users (See Sec-
tion 4). (2) The proposed model is compared with two other baseline models. In
addition, the experiment is conducted over a large-scale twitter dataset, which
validates the capabilities of our model in the real world.
2 Related Work
Some recent research aims at studying the content in the social network, e.g.,
Petrovic et al [16] studied the problem of deciding whether a tweet can be
Social Network User Influence Dynamics Prediction 313
retweeted, Tsur and Rappoport [23] did content based prediction using hash-
tags provided by microblogging websites. Meanwhile, a number of works have
addressed the matter of user influence on social network. Many of them regard
user influence as their network metrics. Kwak et al. [12] found the dierence
between three influence measures: number of followers, page-rank, and number
of retweets. Cha et al. [4] also compared these three measures, and discovered
that the number of retweets and the number of mentions are correlated well
with each other while the number of friends does not correlated well with the
other two measures. Their hypothesis is that the number of followers of user may
not be a good influence measure. Weng et al. [24] regarded the central users of
each topic-sensitive subnetwork of the follower-and-followee graph as influential
users. Other work such as [7], mined users influence from their static network
properties derived from either their social graphs or activity graphs.
Various dynamic diusion models have also been proposed to discover the influ-
ential users. They are shown to outperform influence models based on static net-
work metrics [17,7]. A lot of work in this direction are devoted to viral marketing.
Domingos and Richardson [5,17] were the first to mine customer network values for
influence maximization for viral marketing in data mining domain. The proposed
approach is a probability optimization method with the hill-climbing heuristics.
Kemper et al. [11] further showed that a natural greedy strategy can achieve 63%
of optimal for two fundamental discrete-time propagation models - Independent
Cascade Model (IC) and Linear Threshold Model (LT) . Many diusion models
assume the influence probabilities on the edges or the probability of acceptance on
the nodes are given or randomly simulated. Goyal et al. [8] proposed to mine these
probabilities by analyzing the past behavior of users. Saito et al. [19,20] extend
IC model and LT model to incorporate asynchronous time delay. Model parame-
ters including activation probabilities and continuous time delay are estimated by
Maximum Likelihood. Our proposed diusion model is dierent from the above
discussed models: (1) We model the dynamic probabilities of edge diusion and
node threshold changing over the time, rather than computing the static probabil-
ities. (2) Our model is a Continuous-Time diusion model instead of a discrete-time
one. Although Saito et al. also proposed Continuous-Time models, the fundamen-
tal diusion process of their models are following LT and IC models. For example,
in asynchronous IC, an active node can only infect one of its neighbors in one itera-
tion, while our proposed models does not assume iterations so that nodes can be ac-
tivated at any time without resetting the clock in the new iteration. Moreover, the
models proposed by Saito et al. supposed only one initial active user and focused on
model parameter estimation, not much on prediction. The experiments are eval-
uated on simulated data from some real network topology. Our proposed model
estimates the model parameters from the real large-scale social network data, al-
lows many initial active users asynchronously or simultaneously to influence other
users, and predicts the real diusion sizes in the future.
In addition, most of influence models are basically descriptive models instead
of predictive models. Bakshy et al. [3] studied the diusion tree of URLs on
twitter, and train a regression tree model on a set of user network attributes,
314 J. Li et al.
user past influence, and URL content to predict users future influence. Our work
is quite dierent from the work of Bakshy et al. in the following aspects : (1) They
predict users average spreading size in the next month based on the data from
the previous month. However, the dynamic nature of word-of-mouth marketing
determines that the influence coverage vary over the time. Thus our work aims
at predicting the spreading size of each individual user within a specific given
date, so we can answer what is the spreading size of user A within 2 hours,
1 day, or 1 month, etc.. (2) Their work is built on top of a regression model,
while our work proposes a real-time stochastic model. The input and output
from these two models are quite dierent. (3) Besides URLs diusion, we also
study the diusion of Hashtags on twitter, which usually have longer lifetime.
Continuous-Time Markov Process (CTMP) has showed its applicability for
various research topics, including simulation, economics, optimal control, genet-
ics, queues, and etc.. The detailed description of CTMP can be found in the
book by Norris [15]. Huang et al [10] adopted it to model the web user visiting
patterns. Liu et al. [13] also utilized CTMP to model user web browsing patterns
for ranking web pages. Song et al. [21] employed CTMP to mine document and
movie browsing patterns for recommendation. To the best of our knowledge,
our work is the first to construct influence diusion model based on CTMP for
spreading coverage prediction and user influence on social networks.
A B
E H
C D
F
10 70 150 100 700
iPad E A B H D F
100 60 30
Barack Obama G B C A
20 80 100 70 75 40
Earthquake E A D F G B H
There can be multiple entries in T (Vi Vj ) since user i and user j can post
the same set of topics or one topic at multiple times. Note that we aggregate all
topics together to form this temporal influencer network in this paper.
the time point . We assume that the transition probability Pij (t) does not
depend on the actual starting time of the propagation process, thus the CTMP
is time-homogeneous:
Pij (t) = P {X(t + ) = j|X() = i} = P {X(t) = j|X(0) = i}. (2)
In order to estimate the diusion size of user i given a pre-defined time window
t, we need to compute the transition probability from user i to all the others,
then determine the number of users being aected by i at the end of t. The
diusion size of user i over t based on CTMP can be defined as
#
DSi,t = Pij (t) ni , (3)
j
where ni is the number of times that user i occurs at time t. It can be estimated
by supposing that it linearly increments on t. However, it is impractical to esti-
mate the transition probability matrix P (t) with infinite possible t. Thus instead
of estimating P (t) directly, we calculate the transition rate matrix Q, and then
P (t) can be estimated from Q.
Before jumping into the details of Q matrix and P (t) matrix, a simple example
is presented in Figure 3 to illustrate the procedure of predicting the diusion
size of a particular user.
Fig. 3. An example to show the application of our model over a simple dataset which
composed of only three users and two topics. Based on the two topic transitions in the
dataset, we can calculate the Q matrix. Then, the P(t) matrix can be obtained via the
assistance of Q matrix, assuming the predicting time is t. Once P(t) matrix is ready,
the diusion size can be predicted.
Note that qij reflects a change in the transition probability from user i to user
j. qi , namely out-user transition rate in this paper, is equal to qii . It indicates
the rate of user i propagating topics to any other users. As shown in Figure 1,
the average number of topic adoptions decreases exponentially over the time.
Thus, in order to compute qi , we assume that the time for user i to propagate a
topic to all the other users is following an exponential distribution as observed
for many users in our data, where the rate parameter is qi . It is known that the
expected value of an exponentially distributed random variable Ti (in this case,
the topic propagation time for user i) with rate parameter qi is given by:
1
E[Ti ] = . (5)
qi
Thus qi is one divided by the mean of j (T (Vi Vj )), which is defined in the
temporal influence network.
According to the theory of Continuous-Time Markov Process, if a propagation
occurs on user i, the probability that the other user j would post the #
topic forms
an embedded Markov chain. The transition probability is Sij , and j Sij = 1
(i = j) and Sii = 0. One important property is that qij = qi Sij . Then, the
transition rate from user i to j can be estimated by:
#
qij = qi2 exp(qi tm
ij ), (6)
m
5 Experiment
This paper aims at modeling, predicting, and measuring users influence in the
social network. Hence, to validate the dynamic metric and the predictive model
IDM-CTMP, instead of focusing on identifying influential users like the previous
work, we mainly evaluate its capability of predicting how many users would
adopt one topic given a certain period of time after a user posts it. Specifically,
in our experiment, we first train IDM-CTMP model on the first 12-day training
data, then calculate the spreading coverage of each user for each day from day
1 to day 22, including both training and testing periods.
Baselines. To our best knowledge, this is the first attempt to predict the
continuous-time spreading coverage of social network users. Therefore, we em-
ploy the Autoregressive Integrated Moving Average (ARIMA) model [14], which
is widely used for fitting and forecasting the time series data in the area of statis-
tics and econometrics, as one baseline. This model can first fit to time series data
(in our case, a users spreading sizes of dierent days in the history), then pre-
dict this users spreading size in the future. Thus, the spreading sizes of first 12
days are used to build the ARIMA model. Then, it predicts the entire 22 days.
Note that the optimal ARIMA is always selected based on Akaike information
criterion (AIC) and Bayesian information criterion (BIC) for comparison.
In addition to ARIMA, one of the basic information diusion models Inde-
pendent Cascade (IC) [11] is used as the second baseline. In the IC model, a user
u who mentions a topic in his/her tweets at the current time step t is treated
as a new activated user. An activated user has one chance to activate each of
his/her neighbors (i.e., make them adopt this topic) with a certain probability.
If a neighbor v posts the same topic after time step t, it is said that v becomes
active at time step t + 1. If v becomes active at time step t + 1, u cannot activate
v in the subsequent rounds.
In order to apply the IC model to calculate users spreading sizes, the activa-
tion probability for every pair of users needs to be estimated. Specifically, for a
user u, we first obtain the spreading size of each of his/her topics during the first
12 days, thus we can get average spreading size over all of his/her topics. Then,
the daily average spreading size (DDS) is computed from dividing the average
spreading size by 12 days. Finally, 1/DDS is taken as the activation probability
of u and each of his/her neighbors.
Table 1. The comparison over the 10,000 Table 2. The comparison over the 10,000
top users random users
Methods MAE RMSE MASE Methods MAE RMSE MASE
IDM-CTMP 3.290 4.231 0.714 IDM-CTMP 1.686 2.055 0.702
ARIMA 4.369 5.470 1.294 ARIMA 2.026 2.855 0.764
IC 5.858 7.209 2.355 IC 3.928 4.834 2.091
Size
40 20
20 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day Day
Top user 2 Random user 2
100 50
80 40
60 30
Size
40 Size 20
20 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day Day
Top user 3 Random user 3
100 50
80 40
60 30
Size
Size
40 20
20 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day Day
Top user 4 Random user 4
100 50
80 40
60 30
Size
Size
40 20
20 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day Day Ground Truth
Top user 5 Random user 5 IDMCTMP
100 50 ARIMA
80 40
IC
60 30
Size
Size
40 20
20 10
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Day Day
Fig. 4. The comparison between the predicted spreading size of top ranked 5 users (left
side) and randomly picked 5 users (right side) by IDM-CTMP and baseline against the
ground truth
not exactly as same as the ground truth, most of the predicted curves fit very
close to the true curves. In particular, most of the peaks and valleys can
be well captured by our proposed method. However, ARIMA and IC does not
perform well, missing many peaks and valleys and having wrong predictions.
In addition to the prediction capability, we also conduct experiments to com-
pare the running time of three models. It turns out that IDM-CTMP spends
similar running time as IC model for prediction tasks, while ARIMA requires
the least running time. However, to achieve the best performance of ARIMA,
a lot of parameter tuning work needs to be done beforehand. Considering this,
ARIMA cannot be the best choice in terms of the running time.
Social Network User Influence Dynamics Prediction 321
6 Conclusion
References
1. Adamic, L.A., Adar, E.: How to search a social network. Social Networks 27, 2005
(2005)
2. Anderson, W.J., James, W.: Continuous-time Markov chains: An applications-
oriented approach, vol. 7. Springer, New York (1991)
3. Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyones an influencer:
quantifying influence on twitter. In: WSDM, pp. 6574 (2011)
4. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K.P.: Measuring user influence
in twitter: The million follower fallacy. In: AAAI, ICWSM (2010)
5. Domingos, P., Richardson, M.: Mining the network value of customers. In:
SIGKDD, pp. 5766. ACM (2001)
6. Gardiner, C.W.: Handbook of stochastic methods. Springer, Berlin (1985)
7. Ghosh, R., Lerman, K.: Predicting influential users in online social networks.
CoRR, abs/1005.4882 (2010)
8. Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in
social networks. In: WSDM, pp. 241250 (2010)
9. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information diusion through
blogspace. In: WWW, pp. 491501. ACM (2004)
10. Huang, Q., Yang, Q., Huang, J.Z., Ng, M.K.: Mining of web-page visiting patterns
with continuous-time markov models. In: Dai, H., Srikant, R., Zhang, C. (eds.)
PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 549558. Springer, Heidelberg (2004)
11. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: SIGKDD, pp. 137146 (2003)
12. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news
media? In: WWW, pp. 591600 (2010)
13. Liu, Y., Gao, B., Liu, T.Y., Zhang, Y., Ma, Z., He, S., Li, H.: Browserank: letting
web users vote for page importance. In: SIGIR, pp. 451458 (2008)
14. Mills, T.C.: Time series techniques for economists. Cambridge Univ. Pr. (1991)
15. Norris, J.R.: Markov chains. Number 2008. Cambridge University Press (1998)
16. Petrovic, S., Osborne, M., Lavrenko, V.: Rt to win! predicting message propagation
in twitter. In: 5th ICWSM (2011)
17. Richardson, M., Domingos, P.: Mining knowledge-sharing sites for viral marketing.
In: SIGKDD, pp. 6170 (2002)
18. Rogers, E.M.: Diusion of Innovations, vol. 27. Free Press (2003)
322 J. Li et al.
19. Saito, K., Kimura, M., Ohara, K., Motoda, H.: Ecient estimation of cumulative
influence for multiple activation information diusion model with continuous time
delay. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS, vol. 6230, pp.
244255. Springer, Heidelberg (2010)
20. Saito, K., Kimura, M., Ohara, K., Motoda, H.: Generative models of information
diusion with asynchronous timedelay. JMLR - Proceedings Track 13, 193208
(2010)
21. Song, X., Chi, Y., Hino, K., Tseng, B.L.: Information flow modeling based on
diusion rate for prediction and ranking. In: WWW, pp. 191200 (2007)
22. Song, X., Tseng, B.L., Lin, C.-Y., Sun, M.-T.: Personalized recommendation driven
by information flow. In: SIGIR, pp. 509516 (2006)
23. Tsur, O., Rappoport, A.: Whats in a hashtag?: content based prediction of the
spread of ideas in microblogging communities. In: Proceedings of the Fifth ACM
International Conference on Web Search and Data Mining. ACM (2012)
24. Weng, J., Lim, E.P., Jiang, J., He, Q.: Twitterrank: finding topic-sensitive influen-
tial twitterers. In: WSDM, pp. 261270 (2010)
Credibility-Based Twitter Social Network Analysis
1 Introduction
The immediacy of social networks play an excellent host to various social networks
formed at times of crisis. Examining complex Twitter communication can be
challenging; however, it is known that in Social Networks (SNs), some members have
more influence over other members; they are known as leaders or pioneers [6, 13].
Finding a credible member within an SN is a challenging issue, therefore, the main
objective of this paper is to utilize social network information to identify members
roles as Leader or Follower in Twitter Social Network Service (SNS), based on their
credibility and to assess the impact of Leaders in crisis situations.
Credibility refers to the objective and subjective components of the believability of
an agent. A credible SN member is defined as one who has performed consistently,
accurately, and its generated content is useful over a period of time in a specific
context. The credibility of an agent refers to its quality of being believable or
trustworthy and can be measured by its trustworthiness, expertise, and dynamism [8].
In SNSs members interact in varied contexts. We consider Victorian bushfires [11]
as the context of the interaction between Twitter members, where the worst day
known as Black Saturday when 173 people died.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 323331, 2013.
Springer-Verlag Berlin Heidelberg 2013
324 J. Al-Sharawneh, S. Sinnappan, and M.-A. Williams
2 Related Works
(1)
(2)
(3)
where W = (0.4, 0.1, 0.1, 0.1) represent each component weight impact in tweet
quality respectively, and finally is normalized to the max in the context.
Notably, the @ parameter does not appear in new messages, so we need to consider
this impact when computing new tweet quality by scaling new tweets quality by 7/6.
item based on its frequency in the context as the standard deviation of the keywords
set. (4) Compute quality of a tweet message as shown in equation 3.
members. This is because retweets reinforce a message and as such are considered
important, and hence they are passed on down the network.
From Table 1, for active members; we note the following observations: (1) 59.4% of
the tweets are new, of which 15.5% of the new tweets are retweeted again;
furthermore, 25.1% of the tweets are mentions or replies. This indicates new tweets
are vital source of information distribution. (2) Leaders who represent 20% of the
population in the active set generate 59.3% of all tweets, they retweet 62% of all
retweet posts, and this for certain extent reflects power law distribution. (3) As shown
in (RT 2 Me) column, 84.2% of the Retweets in the communication are generated by
members addressing leaders content, i.e., originally leaders generated content as
new messages. On the other hand 15.8% of the Retweets are generated by members
addressing followers content. (4) 67% of followers follow leaders while 33% of
followers follow other members in the SNS. (5) Leaders community possess 72.8%
of indirect followers, on other hand, 27.2% of indirect followers follow other
followers' members, and those appear in the fourth level of the Follow the Leader
hierarchy. (6) Followers are more tweet repliers than leaders; they generate 53.1% of
reply messages. Leaders usually create new messages and pass (retweet) messages
more than followers. (7) Leaders community possesses 63.7% of SNS credibility,
while followers community possess 37.3% of SNS credibility; this tends to align
with the standard power law distribution.
proxy for power and influence; second, we conducted quality and spread analysis on
the new tweets and compare between leaders and followers influence.
Table 2.a presents the centrality measures (i.e., in-degree and out-degree) for
leaders, followers and all members in the dataset. The in-degree centrality of a
particular node is defined as the number of in-links (trust statements this node
received); it represents the number of retweets and replied tweets addressing that
member. The out-degree centrality of a node is defined as the number of out-links
(trust statements issued from this node), it represents the number of retweets and
replied tweets issued by that member to other members in SNS. Table 2.a shows that
leaders in-degree 3.87 which is about 8 times followers in-degree (0.48).
Furthermore, leaders out-degree 5.39 is about 4 times followers out-degree (1.21).
Since leaders possess high average in-degree and high average out-degree
centrality, which in both cases is more than the followers and more than the average
member in the SNS, leaders are the most prominent and influential members in the
SNS. This outcome is drawn from their trustworthiness and expertise; leaders possess
the highest trustworthiness and expertise level among all members in the context.
1. Quality Analysis Results: from Table 2.b, we note that: (1) High quality content
represents 16.75% of all new tweets content, of which 41.6% of this content is
generated by leaders. (2) As shown in column 3 (LdrsContent%), the percentage of
new postings content generated by leaders for each category is increasing by the
increase of the new content quality. This indicates that leaders are capable of
generating the most credible content. (3) As shown in column 4 (Retweeted%),
retweeting is increasing with the increase of the content quality; moreover, multiple
retweets correspond to quality above the average.
2. Spread / Diffusion Time Analysis Based on Quality Category Results: from
Table 2.c, we note that the elapsed time between new content posting time and first
time retweeted as shown in (1st Retweet Avg Time (Min)) column is decreasing with
the increase of new content quality. As shown for the High quality category, the time
elapse from new content posting to the first retweet time is 8.5 minutes, with overall
elapse time for all retweeting of 195.0 minutes. One notable phenomenon is that
tweets which carry bad news spread faster than other tweets.
5 Results Summary
In a Twitter SNS, members behaviour is the determinant of their credibility; different
members undertook varied activities and possess different interests. In a SN, the
credibility of a member acts as the predictor of their role. Leaders are the most
prominent and influential members in Twitter SNS. They generate the highest quality
content to be adopted and retweeted by other members.
SN leaders are aware and become involved in the crisis early; they produce
information-laden content and maintain high activity all times. Thus, their postings
are an important resource for information management and decision making.
In Twitter SNS, context leaders differ from Twitter leaders. In each context, there
are a set of leaders who are considered experts in the context, thus, it is difficult to
find same leaders across all contexts, simply because each Twitter member has varied
interests and expertise in varied contexts.
References
1. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality
content in social media. In: WDSM 2008, pp. 183194. ACM (2008)
2. Al-Sharawneh, J., Williams, M.-A.: Credibility-based Social Network Recommendation:
Follow the Leader. In: ACIS 2010 Proceedings, Paper 24, pp. 110 (2010),
http://aisel.aisnet.org/acis2010/24
3. Bruns, A., Burgess, J.E., Crawford, K., Shaw, F.: # qldfloods and@ QPSMedia: Crisis
Communication on Twitter in the 2011 South East Queensland Floods (2012)
4. Burt, R.S.: Brokerage and closure: An introduction to social capital. Oxford University
Press, USA (2005)
5. Freeman, L.: Centrality in social networks conceptual clarification. Social Networks 1(3),
215239 (1979)
6. Goldbaum, D.: Follow the Leader: Simulations on a Dynamic Social Network, UTS
Finance and Economics Working Paper No 155 (2008),
http://www.business.uts.edu.au/finance/
research/wpapers/wp155.pdf
7. Hanneman, R., Riddle, M.: Introduction to social network methods. University of
California, Riverside (2005)
8. Kouzes, J.M., Posner, B.Z.: Credibility: How Leaders Gain and Lose It, Why People
Demand It, Revised Edition. Jossey-Bass, San Francisco (2003)
9. Kwon, K., Cho, J., Park, Y.: Multidimensional credibility model for neighbor selection in
collaborative recommendation. Expert Systems with Applications 36(3), 71147122
(2009)
10. Liu, G., Wang, Y., Orgun, M.: Trust Inference in Complex Trust-Oriented Social
Networks. In: Proceedings of the International Conference on Computational Science and
Engineering, Vancouver, Canada, vol. 4, pp. 9961001. IEEE (2009)
11. Sinnappan, S., Farrell, C., Stewart, E.: Priceless tweets! A study on Twitter messages
posted during crisis: Black Saturday. In: ACIS 2010 Proceedings, Paper 39 (2010)
12. Starbird, K., Palen, L., Hughes, A.L., Vieweg, S.: Chatter on the red: what hazards threat
reveals about the social life of microblogged information. In: Proceedings of the ACM
2010 Conference on Computer Supported Cooperative Work (CSCW 2010), pp. 241250
(2010)
13. Xu, Y., Ma, J., Sun, Y., Hao, J., Zhao, Y.: Using Social Network Analysis as a Strategy for
e-Commerce Recommendation. In: PACIS 2009 Proceedings, p. 106 (2009)
14. Ziegler, C.N., Golbeck, J.: Investigating interactions of trust and interest similarity.
Decision Support Systems 43(2), 460475 (2007)
Design and Evaluation of Access Control Model
Based on Classification of Users Network Behaviors
Peipeng Liu1,2,3,4, Jinqiao Shi3,4, Fei Xu3,4, Lihong Wang, and Li Guo3,4
1
Institute of Computing Technology, CAS, China
2
Graduate University, CAS, China
3
Institute of Information Engineering, CAS, China
4
National Engineering Laboratory for Information Security Technologies, IIE, China
[email protected]
Abstract. Nowadays, the rapid development of the Internet brings great con-
venience to peoples daily life and work, but the existence of pornography
information seriously endangers the health of youngsters. So researches on net-
work access control model, aiming at preventing youngsters from accessing
pornography information, have become a hotspot. However, existing models
usually dont distinguish between adults and youngsters, and they block all the
accesses to pornography information, which does not meet the actual demand.
So preventing youngsters from accessing pornography information, while not
affecting the freedom of expressions in adults, has become a reasonable yet
challenging alternative. In this paper, an access control model based on classifi-
cation of users network behaviors is proposed, and this model can effectively
make a decision whether to allow ones access to pornography information ac-
cording to its age. Besides, an evaluation framework is also proposed, through
which the models controllability, cost-effectiveness and users accessibility are
analyzed respectively.
1 Introduction
With the rapid development of the Internet, more and more network services are pro-
vided for peoples daily life and work, but the Internet is also becoming flooded with
services involving pornography, which seriously endangers youngsters physiological
and mental health [8]. Therefore, researches on access control model aiming at pre-
venting youngsters from accessing pornography information have attracted more and
more attentions.
This work is supported by National Natural Science Foundation of China (Grant No.
61100174 and No.61272500), National High Technology Research and Development Pro-
gram of China, 863 Program (Grant No.2011AA010701 and No.2011AA01A103), National
Key Technology R&D Program (Grant No.2012BAH37B04).
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 332339, 2013.
Springer-Verlag Berlin Heidelberg 2013
Design and Evaluation of Access Control Model Based on Classification 333
There have been raised many technical models for network access control [1, 7].
However, it has been shown that most of these control models work in an assertive
manner by which once the content or resource is identified as pornography, the model
refuses all accesses to it. Obviously, this does not meet the actual demand. So it is
necessary to develop a systematic model that can effectively prevent youngsters por-
nography access, while not affect the freedom of expression in adults.
In this paper, an access control model based on classification of users network be-
haviors is proposed, and this model first judges whether the user is a youngster or not,
and then based on the judgment, determines whether to allow the access. This paper
also proposed an evaluation framework, through which the models controllability,
users accessibility and the cost-effectiveness are analyzed respectively.
The rest of the paper is organized as follows. Section 2 discusses the existing
works related to access control model and its evaluation. In section 3, the proposed
access control model is described in detail. And section 4 is about the evaluation
framework. An example is given in section 5. At last, we summarize the paper and
give a conclusion in section 6.
2 Related Works
In [1], the authors proposed the control model (ICCON) for the ICS (Internet content
security) on the Internet. Although it involves the communicators identities, it mainly
takes the coarse-grained properties, such as IP address and URL, into consideration,
excluding the users fine-grained properties, such as the ages. As a result, it cannot
distinguish between adults and youngsters.
Another kind of popular access control model is based on the resource, that is once
a resource on the Internet is identified as pornography, and then all access to it is re-
fused. Otherwise, the resource is open to everyone.
As for evaluation, previous researches, such as [2, 6, 9, 10], usually focus on the
controllability from a technical point, which are lack of assessment of the users expe-
rience and the models cost-effectiveness.
All these works have played important roles in preventing the spread of the porno-
graphy information on the Internet and evaluating the existing models, but they are
sometimes not comprehensive enough. So its necessary to develop a more compre-
hensive access control model.
Framework
At first, we give the framework for the access control model based on classification of
users network behaviors as in Fig. 1.
Fig. 1. Framework for access control model based on classification of users network behaviors
Design and Evaluation of Access Control Model Based on Classification 335
The framework works as follows. At first, the model learns users surfing habits
and predicts the users ages depending on the learned knowledge, and secondly, when
an access action is captured, the model analyzes it together with the prediction results
and the control strategies. At last, based on the analysis, the model makes a decision
whether to allow the access action.
In Fig. 2, the learning module is in a training process, through which we can get
the probability distribution of ages in WebPages. And based on the results of learning
module, the prediction module takes the users ip and historical browsing behaviors as
inputs, and gives the prediction results in the form of ( IP, Age ) .
In Fig. 3, extraction is applied to the network stream to get the ( IP,Webpage) pairs,
and then together with prediction results and control rules, namely, ( IP, Age) and
(Age,Webpage) are supplied to the analyze module as inputs, then the analyze mod-
ule analyzes all the inputs comprehensively, and at last gives the control rules satis-
faction, i.e., whether the access action should be allowed or not.
336 P. Liu et al.
The analysis process can be denoted by a tuple, that is < Lu , Lp , P,T , , D, Auth > ,
where Lu is the label of a requester; L p is the label of the webpage requested; P is the
prediction results according to Lu , existing in the form likes ( Li , Agei ) , where
Agei is the predicted age for Li ; T is a blacklisted control strategies, existing in the
form likes {( Age1 ,Webpage1 ), ( Age2 , Webpage2 )...( Agen ,Webpagen )} , where each
pair (Agei ,Webpagei ) means users at Agei should not be allowed to access
Webpagei ; is a set of non-negative real numbers; D : ( Lu , Lp ) P T , is a dis-
tance function, and the detailed calculation method for distance varies with different
applications. The smaller the distance is, the more likely that the user Lu should not
be allowed to access the service L p ; Auth : [0,1] , is an authentication function,
which is employed to compute the probability of the response. The input of the func-
tion is the distance calculated above. The greater Auth is, the greater the probability
with which a response should be made is.
Response
At last, in the stage of response, the active response is chosen for the proposed model.
Active response refers to taking some approaches such as blocking and cutting to
refuse the access request.
4 Evaluation
An evaluation framework is also proposed to quantitatively evaluate the proposed
control model from several aspects, including controllability, accessibility and cost-
effectiveness.
4.1 Controllability
We firstly give the formal definition of controllability. According to the Internet in-
formation access control model, we suppose extract, analyze, match and control as
atomic operation, which denote the process of extracting information from Internet,
analyzing information, matching information with control rules, and controlling
specific information or access action correspondingly.
That is, controllability is defined to contain two cases, that is, information extracted
from Internet is mismatching control rules or information must be controlled if it
matches control rules.
So controllability can be expressed by Pi in Fig. 4 as Pcon = Pe Ptn + Pe Ptp Pres .
4.2 Accessibility
Accessibility is proposed from a users point, namely, the impact brought to users by
the control model. The accessibility is divided into two types according to users dif-
ferent access requests. For the pornography request, the accessibility is expressed as
follow.
Formula ( ) expresses the probability that a users request which actually is porno-
graphy is allowed.
While for the non-pornography requests, the accessibility is expressed as follow.
Formula ( ) expresses the probability that a users request which actually is not por-
nography is allowed.
For accessibility, absolutely, the aim of the control model is to increase
Pnon - pornography while decrease Ppornography .
4.3 Cost-Effectiveness
In order to achieve the best performance while decrease the costs as much as possible,
an evaluation method based on linear programming is proposed. Here, we suppose it
will cost Wi to increase one percentage point of probability Pi in Fig. 4. And for
each component in the system, its ceiling of cost is limited, and we suppose for each
Pi , its costs can take up to Ci . Then the problem can be described as follows.
5 Example
= D (( P , A ), T ) ={ 1,0,otherwise
if ( P, A) is inT ,
.
(4)
Auth ( ) = {1,undefined
if =0,
, =1.
(5)
The authentication function means that if the distance between the access actions and
the blacklist is 0, then the access should be responded with probability 1, otherwise,
the authentication is undefined.
Design and Evaluation of Access Control Model Based on Classification 339
References
1. Fang, B., Guo, Y., Zhou, Y.: Information content security on the Internet: the control mod-
el and its evaluation. Science China Information Sciences (2010)
2. Hongli, Z.: A Survey on Internet Measurement and Analysis. Journal of Software (2003)
3. Fortuna, B., Mladenic, D., Grobelnik, M.: User Modeling Combining Access Logs, Page Con-
tent and Semantics. In: 1st International Workshop on Usage Analysis and the Web of Data
(USEWOD 2011) in the 20th International World Wide Web Conference (WWW 2011),
Hyderabad, India, March 28 (2011)
4. Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on users
browsing behavior. In: Proceedings of the 16th International Conference on World Wide
Web, Banff, Alberta, Canada, pp. 151160. ACM (2007)
5. Lihua, Y., Binxing, F.: Research on controllability of internet information content security.
Networked Digital Technologies (2009)
6. Liu, Y.-Y.: Controllability of complex networks. Nature 473, 167173 (2011)
7. Park, J., Sandhu, R.: The UCONabc usage control model. ACM Transactions on Informa-
tion and System Security 7(1), 128174 (2004)
8. Peter, R.W.: There is a need to regular indecency on the internet. 6 CORNELL J.L. &
PUB.POLy 363 (1997)
9. Wiedenbeck, S., Waters, J., et al.: PassPoints: Design and longitudinal evaluation of a
graphical password system. International Journal of Human-Computer Studies 63(1-2),
102127 (2005)
10. Wei, J.: Evaluating Network Security and Optimal Active Defense Based on Attack-
Defense Game Model. Chinese Journal of Computers (2008)
Two Phase Extraction Method
for Extracting Real Life Tweets Using LDA
1 Introduction
Many information sharing services currently exist, such as community knowledge
sharing sites, blogs, and microblogs. Twitter is one of the most widely spread
microblogs. Articles in Twitter are posted very easily within 140 characters.
Users post their experiences, opinions, and other events that arise in their daily
life.
Most articles in Twitter are both useful and timely because they are written
based on current events. For example, commuter train delay information posted
by a passenger is timely and useful for waiting passengers. Supermarket sales and
bargain information is also useful for neighborhood consumers. Such information
has highly regionality and freshness. Thus, we call such posts Real Life Tweets.
Extracting real life tweets from a sea of tweets is quite an important research
issue for supporting life activities.
The Great East Japan Earthquake Disaster, which occurred in March of 2011,
is a perfect example of the benefits of real life tweets. There was a great amount
of confusion in the stricken area immediately following the earthquake. There
was a lack of food, suspension of water supply, and train service cancellations.
At that time, useful tweets reported the location of water supplies and food
distribution, as well as the service status of trains, demonstrating that such real
life tweets helped the users in the devastated region[1].
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 340347, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Two Phase Extraction Method for Extracting Real Life Tweets Using LDA 341
As mentioned above, highly useful real life tweets are increasingly posted on
Twitter. However, users posts various types of tweets. Nod and sympathetic
phrases frequently appear on Twitter. For example, Thank you and I see
often appear in posts. These posts do not directly support the real life situations
of other users. We believe that users want a method of locating beneficial tweets
on Twitter. These types of nods and sympathies simply impede the discovery of
substantive tweets.
Information is used in various aspects of life. Real life tweets can accommo-
date such aspects. For example, tweets such as The train is not coming! are
categorized in the Trac aspect and will support users who want to ride the
train. Posts such as Today, bargain sale items are 50% o! are categorized
as Expense aspects and will support users who are going shopping. In our
previous research, we detected 14 aspects of real life[2]. These 14 aspects, listed
in Table 1, are obtained from Local Community1 and Life2 in the Japanese
version of Wikipedia.
In this paper, we propose a two phase extraction method for extracting real life
tweets. In the first phase, many topics are extracted from a sea of tweets using
Latent Dirichlet Allocation (LDA). In the second phase, associations between
many topics and fewer aspects, shown in Table 1, are constructed using the
weight of feature terms calculated with information gain.
2 Related Works
Real life tweets consist of both the experiences and knowledge of users and re-
gional information. Several studies with experience mining have been conducted
1
http://ja.wikipedia.org/wiki/nR~ jeB
2
http://ja.wikipedia.org/wiki/
342 S. Yamamoto and T. Satoh
to extract experiences from documents. Kurashima et al. [3] reported that human
experience can be divided into five areas: Time, space, action, object, and feeling.
Inui et al. [4] described a method of indexing personal experience information
from the viewpoint of time, polarity, and speaker modality. This information is
indexed as Topic object, Experiencer, Event expression, Event type, and Factu-
ality. These mining methods are eective for relatively long documents such as
blogs. Hence, these methods are not appropriate for Twitter posts, which consist
of many short sentences. In addition, experience mining would be much more
dicult because subjects and objects in sentences are often omitted in Twitter.
The study of Twitter is flourishing. Ramage et al. [5] used large scale topic
models to represent Twitter feeds and users, showing improved performance
on tasks, such as post and user recommendations. Bollen et al. [6] analysed
sentiment on Twitter according to a six-dimensional mood (tension, depression,
anger, vigor, fatigue, and confusion) representation, determining that sentiment
on Twitter correlates with real-world values such as stock prices and coincides
with cultural events. Diakopoulous and Shamma [7] conducted inspirational work
in this vein, demonstrating the use of timeline analytics to explore the 2008
Presidential debates through Twitter sentiment. Sakaki et al. [8] assumed that
Twitter users act as sensors, discovering an event occurring in real time in the
real world. Zhao et al. [9] suggested a model, called Twitter-LDA, based on
the hypothesis that one tweet expresses one content of a topic. They classified
tweets into appropriate topics and extracted keywords to express the contents of
the topic. Mathioudakis et al. [10] extracted burst keywords in tweets collected
automatically. They found a trend fluctuating in real time by creating groups
using the co-occurrence of keywords.
Our paper contends that the information is not only based on user experience,
but also user knowledge, which we believe to be useful in real life.
Real life tweets contain the various aspects mentioned in Section 1. Therefore,
it is dicult to enumerate keywords related into whole aspects. Moreover, the
rule-based parsing approach using experience mining does not work well because
twitter posts are comprised of very short sentences.
We propose a two phase extraction method, shown in Figure 1. In the first
phase, a large number of topics are extracted from a sea of tweets using LDA.
LDA is an unsupervised learning model for clustering large amounts of docu-
ments [11]. In the second phase, we construct an association between the topics
and aspects.
provides feature choice and represents a good feature. The information gain of
term w is calculated as follows:
1 "
R(ap, k) = IG(w) pw,k , (2)
length(Wap )
wWap
R(ap, k)
R(ap, k) = " , (3)
R(ap, k)
kT
(ap) = max R(ap, k) std(ap) d, (4)
kT
344 S. Yamamoto and T. Satoh
where std(ap) denotes the standard variation in R(ap, k) for all topics. Parameter
d varies for maximum extraction precision in Section 4.
where W denotes a set of terms extracted from an unknown tweet and pw,k ,
denotes the term w of occurrence probability in topic k.
The selected aspect in each unknown tweet is as follows:
4 Experimental Evaluations
We evaluate the precision of the aspect selected by equation (6). After that,
we should clarify why we drew a dierence between high precision and lower
precision. Because of this, we carefully observe the association between topics
and aspects of varying thresholds d.
5 Discussion
According to Figure 2, Living, Eating, Disaster, and Expense achieved the high-
est precisions. These four aspects are associated with fewer topics, as shown in
Figure 3, even if parameter d was increased. Association details between each
aspect and topics are shown in Figure 4. These ten topics in the figure have
a larger sum of relevances R from topics to aspects. In this figure, we confirm
that these four aspects have strong relevance to the specific topics, Living to
Topic418, Disaster to Topic144, and so on, respectively.
The worst cases, Event and Contact, have a lower precision value as shown in
Figure 2. These aspects were associated with many topics even if d was small. In
particular, Contact is associated with the same topics, Topic297, Topic383, and
Topic479, as Eating. Moreover, the values of R are smaller than Eating. Event
is associated with the same topics, Topi465 and Topic320, as Locality.
6 Conclusion
In this paper, we propose a two phase extraction method for extracting real life
tweets. In the first phase, many topics are extracted from a sea of tweets using
LDA. In the second phase, associations between many topics and fewer aspects
are constructed using a small set of labeled tweets. To enhance accuracy, the
weight of the feature terms is calculated with information gain.
Based on the experimental evaluation results, our prototype system demon-
strates that our proposed method can extract aspects of each unknown tweet.
We confirm that high precision aspects are associated with fewer topics that are
similar to the aspects. However, low precision aspects are associated with many
topics. In this case, many topics are associated with many aspects. In the future,
we should consider evaluating the KL Divergence from aspect to aspect in order
to enhance separation among the aspects.
References
1. Yamamoto, M., Ogasawara, H., Suzuki, I., Furukawa, M.: Tourism informatics:9.
information propagation network for 2012 tohoku earthquake and tsunami on twit-
ter. IPSJ Magazine 53(11), 11841191 (2012) (in Japanese)
2. Yamamoto, S., Satoh, T.: Real life information extraction method from twitter. In:
The 4th Forum on Data Engineering and Information Management (DEIM 2012)
F3-4 (2012) (in Japanese)
3. Kurashima, T., Tezuka, T., Tanaka, K.: Blog map of experiences: Extracting and
geographically mapping visitor experiences from urban blogs. In: Ngu, A.H.H.,
Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005.
LNCS, vol. 3806, pp. 496503. Springer, Heidelberg (2005)
Two Phase Extraction Method for Extracting Real Life Tweets Using LDA 347
4. Inui, K., Abe, S., Morita, H., Eguchi, M., Sumida, A., Sao, C., Hara, K., Mu-
rakami, K., Matsuyoshi, S.: Experience mining: Building a large-scale database
of personal experiences and opinions from web documents. In: Proceedings of the
2008 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 314321
(2008)
5. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models.
In: Proceedings of ICWSM 2010, pp. 130137 (2010)
6. Bollen, J., Pepe, A., Mao, H.: Modeling public mood and emotion: Twitter senti-
ment and socio-economic phenomena. In: Proceedings of WWW 2010, pp. 450453
(2010)
7. Diakopoulous, N.A., Shamma, D.A.: Characterizing debate performance via aggre-
gated twitter sentiment. In: Proceedings of CHI 2010, pp. 11951198 (2010)
8. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: Real-time
event detection by social sensors. In: Proceedings of 18th International World Wide
Web Conference, WWW 2010, pp. 851860 (2010)
9. Zhao, X., Jiang, J., He, J., Song, Y., Achananuparp, P., Lim, E.P., Li, X.: Topical
key phrase extraction from twitter. In: The 49th Annual Meeting of the Association
for Computational Linguistics, pp. 379388 (2011)
10. Mathioudakis, M., Koudas, N.: Twittermonitor: trend detection over the twitter
stream. In: Proceedings of the 2010 International Conference on Management of
Data, pp. 11551158 (2010)
11. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of
Machine Learning Research 3, 9931022 (2003)
12. Griths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National
Academy of Science 101, 52285235 (2004)
A Probabilistic Model
for Diversifying Recommendation Lists
1 Introduction
Many online services, such as Amazon1 , Netflix2 , and YouTube3 , employ rec-
ommender systems because they are expected to improve user convenience and
service provider profit [6, 10, 12, 18]. This is promoting a lot of research on
recommendation methods [1]. Many conventional recommendation methods are
accuracy-oriented, i.e., their goal is to accurately predict items that the user will
purchase in the future.
These accuracy-oriented algorithms, however, sometimes degrade user satis-
faction. The recommendation results are often presented to users in the form of
a multiple item list, not one single item. The recommended items in a list gen-
erated by an accuracy-oriented method often have similar content. According to
[21], since such a list reflect only partial aspects of the users interests, it cannot
necessarily satisfy the user.
1
Amazon - Online shopping, http://www.amazon.com
2
Netflix - Online movie rentals, http://www.netflix.com
3
YouTube - Online video service, http://www.youtube.com
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 348359, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Proposed Method
where il represents the lth item in the list, yl is a binary variable indicating that
the lth item is purchased or not (yl = 1 if purchased, yl = 0 if not). On the
other hand, the objective function of a conventional accuracy-oriented method
is individually set for each item in the list without considering a purchase from
the whole list [2].
Here, we assume that the target user purchases at most one item from the
list. This assumption is based on the behavior of many actual recommendation
services, such as on Amazon.com. On these services, a user cannot purchase
350 Y. Kabutoya et al.
multiple items from a list, since the screen is transited when the user selects
(click a link) an item from a recommendation list.
We use the greedy algorithm because exactly maximization of the objective
function expressed by Eq. (1) falls into a combinatorial optimization problem,
which is dicult in general. Since many users decide whether to purchase each
item in the list in order of rank, applying the greedy approach seems reasonable.
The first item can be easily obtained by maximizing P (y1 = 1|u, i1 ) as is
done in conventional methods. Having chosen the first item i1 , we determine the
second item i2 so as to maximize P (y1 = 1 y2 = 1|u, i1, i2 ). For simplicity, we
expand P (y1 = 1 y2 = 1|u, i1 , i2 ) as:
P (y1 = 1 y2 = 1|u, i1 , i2 )
= P (y1 = 1|u, i1 , i2 ) + P (y1 = 0 y2 = 1|u, i1, i2 )
P (y1 = 1|u, i1 ) + P (y2 = 1|u, i1 , i2 , y1 = 0) P (y1 = 0|u, i1 ) ,
P (y2 = 1|u, i2 , y1 = 0, i1 ) .
Recursively, we can express the probability that the user u selects the lth item
il from the list as follows:
This probability means that for choosing il according to our two assumptions,
we should take into account not only the target user u but also the condition
that items i1 , . . . , il1 are not purchased.
We formulate a model and its parameter estimation for our method based on
the following three policies:
1. We reduce the number of parameters for the robust estimation. The direct
robust estimation of 7this model
; is dicult because the number of parameters
is of the order of O U N l , where U is the number of unique users and N
is the number of unique items, and so becomes excessive compared with the
size of purchase history for training.
2. We use only purchase history, which can be automatically collected. This
avoids the expensive manual tasks needed to acquire the other kinds of in-
formation needed for recommendations, such as item taxonomy and ratings.
A Probabilistic Model for Diversifying Recommendation Lists 351
We also approximate the denominator of Eq. (3) in the same manner as the
numerator.
In summary, the probability that u will purchase the lth item from the list
can be expressed as follows:
P (yl = 1|u, il , y1 = 0, . . . , yl1 = 0, i1 , . . . , il1 )
?
P (yl = 1|u, il ) l1 l =1 P (yl = 0|yl = 1, il , il )
#1 ?l1 . (4)
y =0 P (yl |u, il )
l
l =1 P (yl = 0|yl , il , il )
352 Y. Kabutoya et al.
7 ;
The number
7 of
; parameters in this model is O U N + N 2 , which is far smaller
than O U N l , the number in the model expressed by Eq. (2).
We approximate the personalization factor as follows without considering po-
sition l:
P (y = 1|u)
u = #I 2
, (7)
i=1 P (i|u)
4 Related Work
Recently, diversification of recommendation lists has targeted by many researchers
[9, 11, 17, 19, 21] since it is known to be an important factor in improving user
satisfaction through recommendations. [21] proposed a re-ranking algorithm that
uses item taxonomy to realize topic diversification. Their method can control the
trade-o between accuracy and diversity by using a hyperparameter determined in
an actual user experiment. [9] formalized the intra-list topic diversification problem
by addressing a multi-objective optimization problem on diversity and preference
similarity. [11] proposed a method to make diversified recommendations over time,
and discovered the negative correlation between user profile length and the degree
of recommendation diversity. [19] raised the issue of evaluating the novelty and
diversity of recommendations. [17] studied the adaptive diversification problem for
recommendations by connecting latent factor models with portfolio retrieval.
Additionally, in the area of information retrieval and Web search, many re-
searchers have studied the diversity of search results [35, 8, 1416]. In particu-
lar, [4] is strongly related to our work since they are pioneers in the use of the
probabilistic approach to list diversification. Our method can be positioned as
an extension of [4] to recommendations. We expand Eq. (1) to Eq. (2) in the
same manner as the greedy approach to maximize the probability that least one
document in the search results is relevant.
5 Evaluation
5.1 Formal Evaluation
First, we evaluated our method formally in terms of practicality, by comparing it
to conventional methods for diversifying recommendation lists. We pick up three
conventional approaches, which we label Ziegler [21], Hurley [9], and Shi [17].
354 Y. Kabutoya et al.
First, our method and Hurley do not require any manual tasks for generating
information sources, while Ziegler and Shi do. Ziegler uses item taxonomy, which
is manually provided by service developer. Shi uses ratings, which are manually
given by users. On the other hand, our method and Hurley work even if only the
purchase history is available, and thus allow us to skip such manual tasks.
Second, our method and Shi do not use any hyperparameters for controlling
the balance between accuracy and diversity, while Ziegler and Hurley do. Accord-
ing to [21], there is a trade-o between accuracy and diversity, and balancing
the two maximizes end-user satisfaction. This balancing operation varies with
the service and/or the dataset, so a parameter tuning is needed to determine
the optimal balance. In general, such tuning is dicult since it requires end-user
experiments.
To summarize, we formally demonstrated that our method is superior to con-
ventional methods in terms of practicality.
proposed in [22] to measure similarity between two items, and genre informa-
tion for item taxonomy in calculating the similarity. In this paper, we used
the best settings yielded by the empirical analysis in [21, 22]. Specifically,
propagation factor = 0.75, and diversification factor F = 0.3.
Result. In our experiments, we evaluated the accuracy and the diversity of the
recommendation lists produced by the three methods.
We used precision, micro-averaged recall, and macro-averaged recall curves to
measure accuracy. It is important to use both micro- and macro-averaged recall
[20], as micro-averaged metrics stress performance on prolific users, while macro-
averaged ones target those with few transactions. Both are used in this paper for
fairness and for detail. The formulae of precision@K, micro-averaged recall@K,
and macro-averaged recall@K are given below:
#U #N K
I[i itest
u i iu ]
precision@K =100 u=1 i=1 ,
UK
#U #N K
I[i itest i iu ]
micro-averaged recall@K =100 u=1#Ui=1#N u ,
i=1 I[i iu ]
test
u=1
U #N K
1 " i=1 I[i itest
u i iu ]
macro-averaged recall@K =100 #N ,
U u=1 i=1 I[i iu ]
test
K
where itest
u is the data for u appearing in the test set, and iu is top-K recom-
mended items to u. Figure 1 show the recommendation accuracy of compared
methods via precision, micro-averaged recall, and macro-averaged recall curves.
These results reveal that our diversification also degrades the accuracy as does
Ziegler. The precision curves show that our method is of practical use because
its loss in accuracy at high rank is less pronounced than is true for Ziegler.
Specifically, our method degrades the accuracy less than Ziegler at the high
rank (K 5), but the accuracy curves for our method drastically drop, and
our method is then outperformed by Ziegler at the low rank (K > 8). Compar-
ing micro-averaged and macro-averaged recall, the drop of our method is less
for macro-averaged recall than for micro-averaged recall. This result means that
our method can oer more accurate recommendations to the users who make
few purchases.
To measure diversity, we employed intra-list similarity and original list overlap
as used in [21]. Intra-list similarity is calculated as follows:
U K K
1 "" "
ILS@K = sim(iuk , iuk ),
U u=1
k=1 k =k+1
where iuk is the kth recommendation to u, and sim() is the item similarity
measured by the metric proposed in [22]. The original list overlap can be calcu-
lated by counting the number of recommended items that stay the same after
356 Y. Kabutoya et al.
46
Proposed
44 Non-diversified
42 Ziegler
40
Precision
38
36
34
32
30
28
26
1 2 3 4 5 6 7 8 9 10
K: number of recommendations
(a) Precision@K
10 16
Proposed Proposed
9 Non-diversified 14 Non-diversified
Macro-averaged recall
Micro-averaged recall
8 Ziegler Ziegler
12
7
6 10
5 8
4
6
3
2 4
1 2
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
K: number of recommendations K: number of recommendations
16 10
Proposed Proposed
14 Non-diversified 9 Non-diversified
12 Ziegler 8 Ziegler
Original list overlap
Intra-list similarity
10 7
8 6
6 5
4 4
2 3
0 2
-2 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
K: number of recommendations K: number of recommendations
the diversification. Higher values of these metrics denote lower diversity. Figure
2 shows recommendation diversity of the compared methods based on intra-
list similarity and original list overlap. These figures reveal that our method
can increase diversity even though it uses only purchase history. From Figure
2 (a), we found that our method can significantly lower the taxonomy-driven
pairwise similarity in lists without recourse to item taxonomy. Specifically, in
Figure 2 (a), the curve of our method is closer to the Ziegler curve than to the
A Probabilistic Model for Diversifying Recommendation Lists 357
3
Proposed
2.8 Non-diversified
2.6 Ziegler
2.4
2.2
DCG
2
1.8
1.6
1.4
1.2
1
1 2 3 4 5 6 7 8 9 10
K: number of recommendations
Table 1. Example recommendation results for five items using a user profile. We refer
Ac to Action, Ad to Adventure, An to Animation, Ch to Childrens, Co
to Comedy, Cr to Crime, H to Horror, D to Drama, Rto Romance, SF to
Sci-Fi, T to Thriller, and W to War. ILS means intra-list similarity.
References
1. Adomavicious, G., Tuzhilin, A.: Toward the next generation of recommender sys-
tems: a survey of the state-of-the-art and possible extensions. IEEE TKDE 17(6),
734749 (2005)
2. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Machine Learning
Research 3, 9931022 (2003)
3. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reorder-
ing documents and producing summaries. In: SIGIR 1998, pp. 335336 (1998)
4. Chen, H., Karger, D.: Less is more: probabilistic models for retrieving fewer relevant
documents. In: SIGIR 2006, pp. 429436 (2006)
5. Clarke, C., Kolla, M., Cormack, G., Vechtomova, O., Ashkan, A., B uttcher, S.,
MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR
2008, pp. 659666 (2008)
6. Davidson, J., Liebald, B., Liu, J., Nandy, P., Vleet, T.V.: The YouTube video
recommendation system. In: RecSys 2010, pp. 293296 (2010)
7. Griths, T., Steyvers, M.: Finding scientific topics. PNAS 101, 52285235 (2004)
8. Guo, S., Sanner, S.: Probabilistic latent maximal marginal relevance. In: SIGIR
2010, pp. 833834 (2010)
9. Hurley, N., Zhang, M.: Novelty and diversity in top-N recommendation analysis
and evaluation. ACM Trans. Internet Technol. 10(4), 14:114:30 (2011)
10. Koren, Y.: Collaborative filtering with temporal dynamics. In: KDD 2009, pp.
447456 (2009)
11. Lathia, N., Hailes, S., Capra, L., Amatriain, X.: Temporal diversity in recommender
systems. In: SIGIR 2010, pp. 210217 (2010)
12. Linden, G., Smith, B., York, J.: Amazon.com recommendations: item-to-item col-
laborative filtering. IEEE Internet Computing 7(1), 7680 (2003)
13. Minka, T.: Estimating a Dirichlet distribution. Tech. rep., M. I. T. (2000)
14. Radlinski, F., Dumais, S.: Improving personalized web search using result diversi-
fication. In: SIGIR 2006, pp. 691692 (2006)
15. Santos, R., Macdonald, C., Ounis, I.: Selectively diversifying web search results.
In: CIKM 2010, pp. 11791188 (2010)
16. Santos, R., Macdonald, C., Ounis, I.: Intent-aware search result diversification. In:
SIGIR 2011, pp. 595604 (2011)
17. Shi, Y., Zhao, X., Wang, J., Larson, M., Hanjalic, A.: Adaptive diversification of
recommendation results via latent factor portfolio. In: SIGIR 2012, pp. 175184
(2012)
18. Tosher, A., Jahrer, M., Bell, R.: The BigChaos solution to the Netflix
grand prize (2009), http://www.netflixprize.com/assets/GrandPrize2009 BPC
BigChaos.pdf
19. Vargas, S., Castells, P.: Rank and relevance in novelty and diversity metrics for
recommender systems. In: RecSys 2011, pp. 109116 (2011)
20. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: SIGIR
1999, pp. 4249 (1999)
21. Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation
lists through topic diversification. In: WWW 2005, pp. 2232 (2005)
22. Ziegler, C., Lausen, G., Schmidt-Thieme, L.: Taxonomy-driven computation of
product recommendations. In: CIKM 2004, pp. 406415 (2004)
A Probabilistic Data Replacement Strategy
for Flash-Based Hybrid Storage System
1 Introduction
Although most of the people believe that the magnetic disk will be replaced by
flash-based solid state drives (SSD) in the future, currently the popularization of
flash memory is still limited by its high price and low capacity. Hence magnetic
disk and flash memory will coexist over quite a long period of time. From the
comparison in Table 1, flash displays a moderate I/O performance and price per
GB between DRAM and hard disk. Consequently, it is straightforward to adopt
flash memory as a level of memory between the HDD and main memory because
of its advantages in performance [18,19]. The flash-hard disk hybrid storage is
more and more adopted. Seagate provides a mixed storage hard disk with 4GB
flash chip to improve the overall performance [2]. Windows operating system
support Quick Boost from vista to accelerate the booting [1]. In addition, some
companies start to replace some of the hard disk to SSD to build a hybrid storage
system. In this case, how to design an eective flash-hard disk hybrid storage
system emerges as a critical issue.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 360371, 2013.
Springer-Verlag Berlin Heidelberg 2013
A Probabilistic Data Replacement Strategy 361
The data migration among main memory, flash and disk is the most important
issue in hybrid storage design. Many works has been done on this problem.
TAC [7](Temperature-Aware Caching) adopts temperature to determine page
placement. In TAC, a global temperature table is maintained for each page.
The temperature of a page is decided by its access numbers and patterns in a
period. The pages with higher temperature are placed in main memory and flash
memory while a cold page evicted from main memory will be replaced to disk.
In contrast, the LC [11](Lazy Cleaning) and FaCE [13](Flash as Cache Exten-
sion) always cache a page exit from main memory to flash memory. LC handles
flash memory as a write-back cache. The dirty pages are kept on flash first and if
the percentage of dirty pages exceeds a threshold, these pages will be flushed to
disk. Whereas FaCE proposes FIFO replacement for flash memory management
which is an ideal pattern for flash write. In this way, FaCE can improve the
throughput and shortens the recovery time of database.
All the above methods are all in deterministic way. In some cases, such deter-
ministic migration policy is really inecient. For example, in some cases, pages
are only accessed once, and thus it is suboptimal to keep these cold pages on the
flash memory as designed in LC and FaCE. The temperature method in TAC
works well on hot page detection with stable pattern. However, on a workload
changing, TAC takes a rather long time to forget the history and learn the new
pattern. Another disadvantage of TAC is its high time and space consumption.
In order to overcome the problems in prior approaches, we propose a probability-
based policy named HyPro to manage data storage and migration in the storage
hierarchy, which is composed of main memory, SSD and HDD. In our approach,
the priority of data in each level of the hierarchy is maintained separately. The key
dierence from prior work is that the data migration among dierent levels are no
longer deterministic but based on probabilities. Compared to prior deterministic
approaches, HyPro has several advances:
2 Related Work
3 Probabilistic Framework
3.1 The HyPro Approach
The typical structure of a hybrid storage system is illustrated in Figure 1. All the
data is stored on hard disk and organized as data pages. A page need to be loaded
into main memory before being accessed. Since flash has better performance
compared to disk, it works as the level between main memory and disk. When
a page miss happens in main memory, the flash will be checked first. The disk
is only accessed when the page is not found in the flash.
Main Memory
48 57 69 35 51
sh
Fla
m h
fro l as
ad to
F
Re
rite
W
Flash
Disk
48 5 69 10 18 32 57 2 21 76 1 45 77 23 35 36 39 99 81 51
In this system, the data placement and migration is a critical issue to achieve
better performance. In this part, we introduce our probabilistic approach for
hybrid storage management, named HyPro. The overall structure of HyPro is
shown in Fig 2. In our framework, the pages in main memory and flash memory
are exclusive from each other. In other words, we do not keep a page in flash
memory if it is already in main memory.
In the HyPro, we adopt two probabilities to control the data migration. If
some of the pages on flash are frequently accessed, its better to be elevated
to main memory. We call this process elevation. Once the elevation happens
we have to evict a page from main memory. Obviously, the elevation should be
managed carefully so that the benefits of accessing hot pages can oset the I/O
cost overhead caused by data movement. In our probabilistic data management,
we use a probability named pelevate to control the elevation frequency. As shown
in Fig 3, when a page is accessed, it has the chance of pelevate to be kept in
main memory; otherwise, this page will be evicted on the next data access. It is
obvious that the page has more chance to be elevated if it is more frequently used.
Hence, real hot pages are detected and promoted into main memory statistically
during a long runtime. In each elevation, a cold page in main memory need to
be evicted to flash and placed in the original space of the elevated page.
In the HyPro, some page may be evicted from the main memory, which we
name sinking operation in this paper. At first glance, this page is likely to be
364 Y. Lv et al.
Main Memory
pelevate
1.0
1.0 1.0- psink
Flash
psink
Disk
pelevate pelevate
32 23 21 45 36 39 99 81 76 18 32 23 51 45 36 39 99 81 76 18
Flash Flash
hotter than pages on flash memory, and should replace one flash page. However,
sinking to flash incurs a flash write. Whether the future benefit is worth this
write depends on the cost ratio between flash read and write as well as the
hotness of the page being sunk. An example of sinking is illustrated in Fig 4. We
take a probability psink to control the ratio of sinking to flash. A main memory
evicted page has chance of psink to replace a flash page, otherwise, it will be
discarded directly or written back to disk if it is dirty. Lets see why this works.
We assume the main memory evicted page is M (page 51 in Fig 4 (a)), and the
replaced page from flash is F(page 18 in Fig 4 (a)), respectively. The larger psink
is, the more evicted pages are sunk to flash, and the closer the hotness of M
and F are. Consequently, the benefit of sinking to flash will be small for close
hotness of M and F. By setting proper value of pelevate and psink , we can achieve
a better trade-o between main memory and flash accesses and the overall I/O
cost diminishes. We will talk about the parameter tuning in the Section 4.
The pseudocode of HyPro are listed in Algorithm 1. A structure named frame
is used to store the position of each page, and the frames are organized in a
hash table to facilitate searching. The Algorithm 1 illustrates the routine of
page access. First, the position of the page is determined, and then dierent
operations are conducted according to the page position. The Algorithm 1(line
13) may invoke Algorithm 2. Algorithm 2 loads a page from disk and puts this
page to the right position according to the psink . Although LRU is adopted in
our experiments for main memory and flash memory management, HyPro can
support other strategies such as LIRS [12] and ARC [17]. The HyPro is easy to
A Probabilistic Data Replacement Strategy 365
Main Memory
48 57 69 35 51
1.0
32 23 21 45 36 39 99 81 76 18
Flash
48 5 69 10 18 32 57 2 21 76 1 45 77 23 35 36 39 99 81 51
Disk
implement and quick enough for online processing. The time complexity is O(1)
for each page access.
In this paper, we focus on the data migration design. Nevertheless, some
optimizations can be supported in HyPro applied to further improve the per-
formance e.g. by considering the asymmetric I/O and the access pattern (ran-
dom/sequential). For example, if the asymmetric I/O is considered, dierent
probabilities can be allocated to read and write operations respectively, which
can make one write operation equivalent to the eect of n read operations. Fur-
thermore, we could also manage flash in FIFO manner as FaCE to transform
flash space allocation into sequential pattern.
4 Parameter Tuning
The probabilities of transforming data among dierent memory levels are the
crucial part in our stochastic page management policy. In this section, we provide
study on how to automatically tune these probabilities based on the cost
analysis. Table 2 facilitates fast check on the notations used in this section.
Definition 1. For a cache management algorithm, we denote the place of the
page to be evicted as the evict position. (For example, the end of the LRU
queue). Nevict is defined as the total hit number on the evict position.
To begin with, a definition is introduced. Note that we consider the hit number
on a position instead of on a specific page in this definition. For example, if
LRU algorithm is adopted in main memory, the Nevict stands for the number of
accesses on the LRU end. If a read operation is hit on the least recently used
366 Y. Lv et al.
Parameters Description
Cdr ,Cf r The read time cost of one page on disk, flash re-
spectively
Cdw ,Cf w The write time cost of one page on disk, flash re-
spectively
pelevate The probability to elevate
psink The probability to sink
Rf evict ,Rmevict The number of read hit on the evict position of
flash, main memory respectively
Wf evict ,Wmevict The number of write hit on the evict position of
flash, main memory respectively
A Probabilistic Data Replacement Strategy 367
page P , Nevict increases. However, at this time page P is moved to LRU head
and a new page named P is moved to LRU end. Then next time we increase
Nevict when the P is accessed rather than P .
Assume that in a certain time period, a page Pf on flash is read Rf times
and written Wf times. Rmevict and Wmevict stand for the read and write times
on the evict position in the main memory during the same time period. The
read and write costs of flash are denoted as Cf r and Cf w .
In HyPro, psink is adopted to balance the eviction to flash and disk. Hence,
we discuss the tuning of psink by comparing the cost of the two cases: 1)evict
the page to flash (Figure 4 (c)) and 2)evict the page to disk (Figure 4 (b)). The
I/O costs of two cases are calculated as follows respectively.
Case 1. Evict page Pmevict to flash, and if flash is full evict the Pf evict to disk
(when dirty). Consequently, Pmevict will be accessed from flash and Pf evict from
disk, and thereby the corresponding I/O cost is:
Case 2. Evict page Pmevict to disk. In this case Pmevict will be accessed from
disk while the Pmevict mentioned in Case 1 is still accessed from flash. Thus the
I/O cost is:
In the case of Csinkf < Csinkd , it is more I/O ecient to evict a page to flash,
and hence, psink should be increased and vice versa. The above analysis only
takes the I/O cost of page transferred, that is, page evicted and page elevated
into consideration. Actually, the I/O costs of other pages will also be influenced
which are not the primary cost and experiments show that the obtained Csinkf
and Csinkd can deal with parameter adjusting eectively.
The Csinkf and Csinkd are very small and unstable for a single page on a short
period of time. In practice, we accumulate Csinkf and Csinkd on all the evicted
page in a certain window on the trace, so that the psink can be adjusted based
on the comparison of Csinkf and Csinkd on each accumulation. Note that the
calculation and parameter tuning described above need only O(1) time for each
access. The tuning will not significantly increase the overhead of whole strategy.
The tuning of pelevate can be performed in a similar way with psink , and thus
is omitted here due to space limitation.
5 Performance Evaluation
In this section, we conduct a trace-driven simulation to evaluate the eective-
ness of the proposed framework. The traces used include TPC-B, TATP [3] and
making Linux kernel (MLK for short) to further evaluate the performance on
various workloads. TAC and FaCE are chosen as the competitors. The simulation
368 Y. Lv et al.
is developed in Visual Studio 2010 using C#. All experiments are run on a
Windows 7 PC with a 2.4 GHz Intel Quad CPU and 2 GB of physical memory.
We use three traces mentioned above for performance evaluation. The bench-
marks namely TPC-B [4], and TATP [3] are run on PostgreSQL 9.0.0 with
default settings, e.g., the page size is 8KB. Dataset size of both TPC-B and
TATP are 2GB. The MLK is a record of the page accesses of making Linux
kernel 2.6.27.39. We use a tool named strace to monitor theses processes and
obtain the disk access history. Specification on the traces was given in Table 3.
In our experiment, the total I/O time is used as the primary metric to eval-
uate the performance. We employ The samsung SSD (64GB, 470 series) in our
experiment. We obtain its access latency by testing. Access latency of hard disk
is obtained from paper [15]. The parameters used in our experiments are listed in
Table 4, in which F lashsize/P ages denotes the ratio between flash and dataset
size. We fixed the main memory size to the 1% of the dataset size.
Parameter Value
Cr , Cw (s) 271, 803 (for SSD)
12700, 13700 (for hard disk)
F lashsize/P ages 1.25%, 2.5%, 5%, 10%, 20%
pelevate 0.001,0.01, 0.015,0.02,...,0.1
psink 0.01, 0.1,0.2,...,0.9
The performance of our approach also significantly varies with the pelevate , as
shown in Figure 5 (b) , and the minimum I/O time is reached when it locates
between 0.01 and 0.02. As the pelevate controls whether a page on flash should be
elevated, large pelvevate will cause more exchanges between main memory and
flash, which may deteriorates the whole performance, while no exchange will
cause pages on flash has no opportunity to get into main memory (corresponding
to the case pelevate = 0). Other testings also show that the best parameter is
often around 0.02 and 0.2, and thus, we adopt these values as the initial value
and use dynamic parameter tuning in the following experiments.
In this part, we compare HyPro with FaCE and TAC. We test extensive configu-
rations, but only shows some results here in Figure 6 due to the space limitation.
In the following results, the main memory is 1% of the total workload size and
the flash size varies from 1.25% to 20%. Our approach shows similar or better
performance compared with FaCE and TAC approach. Our approach can re-
duce upto 50% of the total I/O time against other competitors. In MLK trace,
FaCE performs the worst, since many pages in MLK are accessed only once, but
the FaCE still cache these pages in flash memory. On TATP, HyPro and Face
has similar performance and better than TAC. Because the TATP is not a very
stable access pattern. FaCE and HyPro can adapt themselves to the workload
changes quickly, but TAC needs a rather long time to learn this change. On
TPC-B trace TAC has the best performance when the flash size is low, this is
because the temperature-based hot detection can accurately discover the hottest
pages, which has superiority when the cache size is low. However, When the flash
size is large, the performance of TAC degrades. This phenomenon is partially
because precisely hot page detection is not necessary for a larger cache size, and
partially because of the write through cache design, as the write ratio is very
high in TPC-B according to Table 3. The HyPro can adapt itself to the flash
size enlargement and shows a good performance in all the cases.
370 Y. Lv et al.
(b) High-end SSD performance on TATP (c) Low-end SSD performance on TPC-B
trace trace
6 Conclusion
In this paper we propose a novel stochastic approach for flash based hybrid stor-
age system management named HyPro. Dierent from the existing deterministic
models, HyPro controls the data migration between devices using two probabili-
ties. One probability describes the chance of which one page will be kept in main
memory after accessed from flash. Another probability is adopted to determine
the place where a page should be put after evicted from main memory. By doing
this, Hypro can achieves lower hard disk access with a little exchange overhead
increment on flash writes. We also developed an approach to determine the prob-
abilities based on cost analysis. The experiments show that HyPro outperforms
other competitors.
References
1. http://windows.microsoft.com/en-US/windows-vista/products/
features/performance
2. Momentus XT Solid State Hybrid Drives,
http://www.seagate.com/www/en-us/products/laptops/laptop-hdd/
3. Telecom Application Transaction Processing Benchmark,
http://tatpbenchmark.sourceforge.net/index.html
4. TPC Benchmark B (TPC-B), http://www.tpc.org/tpcb/
5. Bisson, T., Brandt, S.A., Long, D.D.E.: A hybrid disk-aware spin-down algorithm
with I/O subsystem support. In: IPCCC, pp. 236245 (2007)
6. Canim, M., Bhattacharjee, B., Mihaila, G.A., Lang, C.A., Ross, K.A.: An ob-
ject placement advisor for DB2 using solid state storage. PVLDB 2(2), 13181329
(2009)
7. Canim, M., Mihaila, G.A., Bhattacharjee, B., Ross, K.A., Lang, C.A.: SSD buer-
pool extensions for database systems. PVLDB 3(2), 14351446 (2010)
8. Chen, S.: Flashlogging: exploiting flash devices for synchronous logging perfor-
mance. In: SIGMOD Conference, pp. 7386 (2009)
9. Debnath, B.K., Sengupta, S., Li, J.: Flashstore: High throughput persistent key-
value store. PVLDB 3(2), 14141425 (2010)
10. Debnath, B.K., Sengupta, S., Li, J.: Skimpystash: RAM space skimpy key-value
store on flash-based storage. In: SIGMOD Conference, pp. 2536 (2011)
11. Do, J., Zhang, D., Patel, J.M., DeWitt, D.J., Naughton, J.F., Halverson, A.: Tur-
bocharging DBMS buer pool using SSDs. In: SIGMOD Conference, pp. 11131124
(2011)
12. Jiang, S., Zhang, X.: LIRS: an ecient low inter-reference recency set replacement
policy to improve buer cache performance. In: SIGMETRICS, pp. 3142 (2002)
13. Kang, W.-H., Lee, S.-W., Moon, B.: Flash-based extended cache for higher through-
put and faster recovery. Proc. VLDB Endow. 5(11), 16151626 (2012)
14. Koltsidas, I., Viglas, S.: Flashing up the storage layer. PVLDB 1(1), 514525 (2008)
15. Lee, S.-W., Moon, B.: Design of flash-based DBMS: an in-page logging approach.
In: SIGMOD Conference, pp. 5566 (2007)
16. Luo, T., Lee, R., Mesnier, M.P., Chen, F., Zhang, X.: hStorage-DB: Heterogeneity-
aware data management to exploit the full capability of hybrid storage systems.
CoRR abs/1207.0147 (2012)
17. Megiddo, N., Modha, D.S.: ARC: A self-tuning, low overhead replacement cache.
In: FAST (2003)
18. Ou, Y., H arder, T.: Trading memory for performance and energy. In: Xu, J., Yu,
G., Zhou, S., Unland, R. (eds.) DASFAA Workshops 2011. LNCS, vol. 6637, pp.
241253. Springer, Heidelberg (2011)
19. Wu, X., Reddy, A.L.N.: Managing storage space in a flash and disk hybrid storage
system. In: MASCOTS, pp. 14 (2009)
An Influence Strength Measurement via Time-Aware
Probabilistic Generative Model for Microblogs
Zhaoyun Ding1, , Yan Jia1 , Bin Zhou1 , Jianfeng Zhang1, Yi Han1 , and Chunfeng Yu2
1
School of Computer, National University of Defense Technology,
Changsha, 410073, China
{zyding,yanjia,binzhou,jfzhang,yihan}@nudt.edu.cn
2
Naval Aeronautical Engineering Institute Qingdao Branch, Qingdao, Shandong, China
[email protected]
1 Introduction
Online social networking sites are becoming more popular each day, such as Facebook,
Twitter, and LinkedIn. Among all these sites, Twitter is the fastest growing one than any
other social network. Microblogs, such as Twitter, is a service that has emerged as a new
medium for communication recently. Unlike other social network services, the relation-
ship of following between users can be unidirectional; a user is allowed to choose who
she wants to follow without seeking any permission. Twitter employs a social network
called following relationship; the user whose updates are being followed is called the
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 372383, 2013.
c Springer-Verlag Berlin Heidelberg 2013
friend, while the one who is following is called the follower. In case the author is
not protecting his tweets, they appear in the so-called public timeline and his followers
will receive all messages from him.
It is well recognized that networks often contain both strong and weak ties, also
influence strength between dierent users is usually dierent and it is unidirectional
in microblogs, so treating all relationships as equal will increase the level of noise in
the learned models and likely lead to degradation in performance. Recently, influence
strength analysis has attracted considerable research interests. However, most of their
works only considered the network structure or the user behaviors in social network,
and the micro-level mechanisms of influence strength of a user on her followers at a
specific topic, have been largely ignored.
Studies by Kwak et al [1] have shown that Twitter is more likely to be a news media.
Users can not only find friends by microblogs, but also can publish messages by them-
selves, and microblogs transform people from content consumers into content produc-
ers. Usually, there are two aims for users to make use of microblog services, including
finding friends in real life and finding friends according to the homophily. So, it is in-
sucient to measure the influence strength of two users only by the number of common
friends.
+RZDUG
%U\DQW
UHODWLRQVKLS
-DPHV
VWUHQJWK 1DVK
%RVK
:DGH
'XUDQW
1%$ IROORZLQJ $IWHU
%U\DQW +RZDUG )ULHQG% KRXU
/RV$QJHOHV
/DNHUV
1DVK
*DVRO IROORZLQJ
IRRWEDOO
%DUFHORQD $IWHU $IWHU
8VHU$
1%$ 5RQDOGR KRXU
UHODWLRQVKLS 5HDO KRXUV
%U\DQW
VWUHQJWK 0DGULG
0HVVL
)ULHQG&
Fig. 1. The correlation of the influence strength Fig. 2. An example of multi-level hierarchial
and the words structure
Moreover, if a user is influenced by her friends, she will retweet (by making use of
an informal convention such as RT @user) or reintroduce (tweets are similar with the
tweets previously posted by other users, but without acknowledgement of a source) the
messages of her friends, resulting that some words in tweets are the same. The more
the same words which are copied by a user are, the higher the influence strength on the
user is. Figure 1 gives an example to illustrate the correlation of the influence strength
and the number of the same words. In the topic of basketball, we can find the user A has
more similar words with her friend B and we infer the user A copes more ideas from her
friend B. In fact, figure 1 indicates two assumptions: 1) Users with similar interests have
a stronger influence on each other, 2) Users whose actions frequently correlate have a
stronger influence on each other, which have been validated in [2] [3].
Also, the interval of time is an important factor to measure the influence strength
and it stands for the diused rate of each word. If a user copes the ideas from her friend
much faster for a topic, it will indicate that the friend has stronger influence on this user.
The interval of time has been ignored in recent studies for influence strength analysis.
374 Z. Ding et al.
2 Related Work
Liu et al [2] [3] proposed a generative graphical model which leveraged both hetero-
geneous link information and textual content associated with each user in the network
to mine topic-level influence strength. Based on the learned direct influence, they fur-
ther studied the influence propagation and aggregation mechanisms. Guo et al [4] pro-
posed a generative model for modeling the documents linked by the citations, called
the Bernoulli Process Topic (BPT) model, to measure the influence strength in Citation
Networks. Dietz et al [5] devised a probabilistic topic model that explained the genera-
tion of documents; the model incorporated the aspects of topical innovation and topical
inheritance via citations to predict the citation influences.
Our work is similar to their works and we also propose a novel generative prob-
abilistic model of a corpus which leverages both the network of following and post
content associated with each user in the network to measure the influence strength in
microblogs. However, the time is considered in our work. Usually, the influence strength
between users is evolved with the variety of time. Moreover, the interval of time is an
important factor to decide the direct and indirect strength of influence between users.
Xiang et al [6] developed an unsupervised model to estimate relationship strength
from interaction activity (e.g., communication, tagging) and user similarity. More
specifically, they formulated a link-based latent variable model, along with a coordinate
ascent optimization procedure for the inference. However, influence strength analysis
microscopically is ignored in their work.
Influence Strength Measurement for Microblogs 375
Also, interaction data has been used to predict relationship strength [7] [8] but this
work only considered two levels of relationship strength, namely weak and strong rela-
tionships.
3 Preliminary
Definition 1. (Word) A word is the basic unit of discrete data, defined to be an item
from a vocabulary indexed by {1, ..., v} . Words are represented with unit-basis vectors
that have a single component equal to one and all other components equal to zero.
Thus, the vth word in the vocabulary is represented by a V-vector w such that wv = 1
and wu = 0 for u ! v.
Definition 4. (Network) The network in microblogs is a directed graph G = (N, E). The
N stands for the set of users in microblogs. And the directed edge E N N stands
for the relationships of following in microblogs. For euv = (u, v) E, if there exists a
directed edge between u and v, euv = 1; otherwise euv = 0.
Definition 5. (Direct and Indirect Influence) Given two user nodes u, v in the network
G = (N, E), we denote Iu (v) R as the influence strength of user v on user u. Fur-
thermore, if euv = 1, we call Iu (v) the direct influence of user v on u; if euv = 0, we
call Iu (v) the indirect influence of user v on u. Because the relationship of following be-
tween users in microblogs can be unidirectional, the influence strength is asymmetric,
i.e., Iv (u) ! Iu (v).
Similar to the existing topic models, each document is represented as a mixture over
latent topics. In our model, the content of each document is a mixture of two sources:
1) the content of the given document, and 2) the content of other documents related to
the given document through the network structure. This perspective actually reflects the
process of writing a post (tweet) in microblogs: either depending on users interests or
influenced by one of their friends. when a user writes a post, she may create the idea
innovationally or copy it from one of her friends.
In the model, we use a parameter s to control the influence situation. The s is gen-
erated from a Bernoulli distribution whose parameter is . When s = 1, the behavior is
generated based on users own interests. When s = 0, it means the behavior of the user
is influenced by one of her friends. The parameter of the coin is learned by the model,
given an asymmetric beta prior = ( , ) which prefers the topic mixture of a
376 Z. Ding et al.
'
LII I
I V
LIV LIV
(DFKSDLU
WLPHVWDPS
t
LIV
)ULHQGVSRVWVI 8VHUSRVWV'
friend document. Thus we need another parameter to select one influencing user v
from friends set L(u).
The key dierences from the existing topic models are that the interval of time and
the relationship of users are combined in our model. The mixture distribution over
topics is influenced by both word co-occurrences and the documents time stamp. A
per-topic Beta distribution generates the documents time stamp. The interval of two
related documents t is drawn according to the negative exponential distribution with
the parameter . Figure 3 shows the graphical structure of the model, which explains
the process of generative each word and time stamp in posts. The key dierences of the
generative process are illustrated in the algorithm 1.
Using the chain rule, we can obtain the conditional probability conveniently. According
to p(zi , zi ) = p(z) and the bayesian inference. We can infer as follows.
p(w,w ,z,z,t,t , f,s| , , , , ,,,)
p(zi | z i , wi , ti , si = 0, fi , ) =
" p(w,w , z i ,"z i ,t,t , f,s| , , , , ,,,)
p(w,w |z,z ,)p(| )p(t,t |,z,z ) p(z,z |s, f,,)p(| )p( | )
"
"
p(w,w | z i , z i ,)p(|
" )p(t,t |, z i , z i ) " z i , z i |s, f,, )p(| )p( | )
p(
p(t|,z ) p(w,w |z,z ,)p(| ) p(z,z |s, f,,)p(| )p( | ) (5)
" "
p(t|, z i ) p(w,w | z i , z i ,)p(| ) p( z i , z i |s, f,,)p(| )p( | )
2
1 1 z 1
(1ti ) zi ti i Nw,z (wi ,zi )+Nw ,z (wi ,zi )+ 1 N f ,z ( fi ,zi )+N f ,z,s ( fi ,zi ,0)+ 1
B(1z ,2z )
Nz (zi )+Nz (zi )+V 1 N f ( fi )+N f ,s ( fi ,0)+T 1
i i
p(zi | z i , wi , ti , di , si = 1, fi , ) is derived analogously.
p(zi | z i , wi , ti , di , si = 1, fi , )
(1ti ) zi ti i
2
1 1 z 1
Nw,z (wi ,zi )+Nw ,z (wi ,zi )+ 1 Nd,z,s (di ,zi ,1)+ 1
(6)
B(1zi ,2zi )
Nz (zi )+Nz (zi )+V 1 Nd,s (di ,1)+T 1
378 Z. Ding et al.
After the Gibbs process, we can estimate of the negative exponential distribution.
Y
1#
= 1/ tiy (7)
Y y=1
The Y represents the number of all time intervals in the i topic. And the tiy represents
the yth time intervals in the i topic.
Moreover, we will also obtain the sampled coin si , friends fi , and topic zi for each
word, and the influence strength can be then estimated by Eq.(8), which are averaged
over the sampling chain after convergence. A denotes the length of the sampling chain.
(d,z, f,0)n
Nd,z, f ,s !
A nz exp(nz tin ) +
1 # i=1
Iu (v|z) = d ( f |z) = (8)
A n=1 Nd,z,s(d, z, 0)n + |L(d)|
The ti represents the shortest interval of two same tokens assigning to the same topic
between the user u and the user v.
The equations reflect our assumptions in a statistical way. It indicates that if one user
u copes more words of user v on topic z, then v has a stronger influence on u w.r.t. topic
z. Moreover, if the user u copes ideas of user v faster (has a shorter interval of time)
on topic z, then v has a stronger influence on u w.r.t. topic z. In addition, we can also
illustrate the evolutionary trend of the influence strength for two pair users according to
the time stamp of each user in microblogs.
,QIOXHQFH ,QIOXHQFH
SURSDJDWLRQ SURSDJDWLRQ
1%$
%U\DQW +RZDUG %U\DQW %U\DQW
/RV$QJHOHV +RZDUG
/DNHUV IROORZLQJ IROORZLQJ +RZDUG
1DVK 1DVK
*DVRO t 1h *DVRO t 5h
A
1 # Nd,z,s, f1, f2 , (d, z, 0, f1, f2 , )n +
Iu ( f1 , f2 , |z) = (9)
A n=1 Nd,z,s(d, z, 0)n + |L(d)|
Iu ( f1 , f2 , |z) represents the indirect influence of user f2 on u through the user f1 .
Nd,z,s, f1, f2 , (d, z, 0, f1 , f2 , ) represents the number of the same tokens in user u that are
assigned to both the user f1 and the user f2 . For example, Nd,z,s, f1, f2 , (u, z, 0, f1, f2 , ) = 2
in the figure 4. Analogously, we can get indirect influence strength for further hops.
Moreover, the influence of ideas decreases with the process of propagation. Fortu-
nately, the interval of time can be used to illustrate the process of influence decrease.
Usually, the influence of ideas decreases with the development of the time. If the inter-
val of time is longer, the influence of ideas will decrease significantly. Otherwise, if the
interval of time is shorter, the influence of ideas will decrease insignificantly. We take
advantage of the function of negative exponential distribution to illustrate the process
of influence decrease quantitatively.
Nd,z,s, f1 , f2 , (d,z,0, f1 , f2 )n
!
A nz exp(nz tin ) +
1 # i=1
Iu ( f1 , f2 , |z) = (10)
A n=1 Nd,z,s(d, z, 0)n + |L(d)|
Also, the ti represents the interval of two tokens between two indirect users. For ex-
ample, ti (A, C) = 6h in the figure 4.
Next, we will analyze the multi-path influence propagation in microblogs. In the
figure 2, we can find that the user 4 is influenced indirectly by the user 1 through two
users (the user 2 and the user 5). In order to measure the indirect influence strength
through multi-path, we give an aggregation function to combine multi-path indirect
influence strength.
Iu ( f2 |z) = max(Iu ( f11 , f2 , |z), Iu ( f12 , f2 , |z)) (11)
The u is influenced indirectly by the f2 through two users and f11 f12 . We refer the highest
multi-path indirect influence strength as the final indirect influence strength.
7 Experiments
7.1 Dataset
For the purpose of this study, a set of Twitter data about Chinese-based twitters who
have published at least one Chinese tweet from 2011 04 15 to 2011 07 15 was
prepared as follows. About 0.26 million users and 2.7 million tweets were collected
through the API of Twitter. We begin by describing the dataset we are trying to mine.
As the figure 5 shows, the distributions of users followers and tweets are approximately
power-law, implying that the vast majority of users have a small quantity of followers
and tweets, while a small fraction have a large number of followers and tweets.
In order to measure the influence strength with our model, we extracted tweets and
users who published at least 10 tweets, then extracted their network relationships. Due
to the power-law distributions of users tweets, 3334 users and 168526 network re-
lationships(edges) were extracted from the data set. Our model is trained with hyper
parameters = 0.01, = = 0.1, = 0.1, = 1.0 and 50 topics.
380 Z. Ding et al.
5 5
10 10
7.2 Evaluation
Predicting User Behaviors by Direct Influence Strength. We apply the derived in-
fluence strength to help predict user behaviors and compare the prediction performance
with other methods. The behavior is defined as whether a user retweets a friends mi-
croblog. To do this, we rank the pairs of users by influence strength and measure the
area under the ROC curve (AUC) based on the feature values for the ranked pair (e.g.,
1 if a user is retweeted by other user, 0 otherwise). We compare the rankings using
influence strength to several alternative rankings (Named our model as TNIM-Time
Network Influence Model):
1) The influence strength between users is calculated as the Kullback-Leibler (KL)
divergence of user distributions over topics (Named as KLD).
1 1
Iu (v) = = ! (12)
KL(u, v) i
ziu log zzui
v
0<iT
2) The influence strength between users is calculated as the number of common friends
(Named as NCF).
|N(u) N(v)|
Iu (v) = (13)
|N(u) N(v)|
3) The influence strength between users is calculated as the methods proposed by Liu
et al [2] [3] (Named as NIM-Network Influence Model).
A
1 # Nd,z, f,s(d, z, f, 0)n +
Iu (v) = (14)
A n=1 Nd,z,s (d, z, 0)n + |L(d)|
In order to predict user behaviors, we divide the data set into two parts. We learn the
influence strength by the data set from 2011 04 15 to 2011 06 30. And we pre-
dict user behaviors by the data set from 2011 07 01 to 2011 07 15. We select
four users and their friends to evaluate our model. The ROC curve of dierent methods
for four users are shown in the figure 6.
Then, we select 1000 users and their friends to compute the average AUC (Area
Under Curve) of above four methods. The results are shown in the figure 7.
Experimental results show that the prediction performance of our model is the best.
The average AUC (Area Under Curve) of our model is 0.0647 higher than the method
of NIM, and is 0.0876 higher than the method of KLD. The prediction performance of
NCF is the worst. We can infer it is not too accurate to predicate the retweets by the
number of common friends and they do not always have common topics in virtual life.
Influence Strength Measurement for Microblogs 381
1 1
ROC curve of user2
ROC curve of user1
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate (1Specificity) False positive rate (1Specificity)
1 1
ROC curve of user3 ROC curve of user4
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False positive rate (1Specificity) False positive rate (1Specificity)
TNIM Random NIM KLD NCF
0.8 0.6
0.75
TNIM
0.75 0.5
NIM
0.7
NIM
0.2 0.6
0.6
0.1
0.55 0.55
0
TNIM TNIM TNIM NIM NIM KLD
0.5 and and and and and and 0.5
TNIM NIM KLD NCF KLD NCF NCF
NIM NCF KLD h=1 h=2 h=3
Fig. 7. The average AUC Fig. 8. The correlation of our Fig. 9. The average AUC of
model and other methods indirect influence strength
their indirect friends to compute the average AUC (Area Under Curve). Experimental
results are shown in the figure 9, where h denotes the hop of indirect influence.
Experimental results show that the prediction performance of our model is better
than the method proposed by Liu et al [2] [3]. Moreover, we find the prediction perfor-
mance of our model drops more slowly with the improvement of the hop. The indirect
influence strength proposed by Liu et al [2] [3] is impacted by the hops of network. And
all indirect influence strengths are computed by Iu (w) Iw (v), where is the concate-
nation function, e.g., multiplication and minimum value. However, the interval of time
is ignored by them. Usually, the up-to-date tweets are recommended to followers, and
followers are influenced more easily by these up-to-date tweets. It is not too accurate
to compute the indirect influence strength, not considering the interval of time. In fact,
Though the hop of network is farther, the interval of multi-hop may be shorter and the
tweet is retweeted more easily by other users. So, the prediction performance of the
method proposed by Liu et al [2] [3] drops faster with the improvement of the hop.
8 Conclusion
References
1. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In:
Proc. of the 19th International World Wide Web Conference, WWW 2010, Raleigh, USA, pp.
591600 (April 2010)
2. Liu, L., Tang, J., Han, J., Jiang, M., Yang, S.: Mining topic-level influence in heterogeneous
networks. In: Proc. of the 19th ACM Conference on Information and Knowledge Manage-
ment, CIKM 2010, Toronto, Ontario, Canada, pp. 199208 (October 2010)
Influence Strength Measurement for Microblogs 383
3. Liu, L., Tang, J., Han, J., Yang, S.: Learning influence from heterogeneous social networks.
Data Mining and Knowledge Discovery 25(3), 511544 (2012)
4. Guo, Z., Zhang, Z., Zhu, S., Chi, Y., Gong, Y.: Knowledge discovery from citation networks.
In: Proc. of the 9th IEEE International Conference on Data Mining, ICDM 2009, Miami,
Florida, USA, pp. 800805 (December 2009)
5. Dietz, L., Bickel, S., Scheer, T.: Unsupervised prediction of citation influences. In: Proc.
of the 24th International Conference on Machine Learning, ICML 2007, Corvallis, Oregon,
USA, pp. 233240 (June 2007)
6. Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks.
In: Proc. of the 19th International World Wide Web Conference, WWW 2010, Raleigh, USA,
pp. 981990 (April 2010)
7. Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Proc. of the 27th
International Conference on Human Factors in Computing Systems, CHI 2009, Boston, USA,
pp. 211220 (April 2009)
8. Kahanda, I., Neville, J.: Using transactional information to predict link strength in online
social networks. In: Proc. of the 3rd International AAAI Conference on Weblogs and Social
Media, ICWSM 2009, San Jose, California, USA, pp. 7481 (May 2009)
A New Similarity Measure Based on Preference
Sequences for Collaborative Filtering
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 384391, 2013.
c Springer-Verlag Berlin Heidelberg 2013
similarity and Pearson correlation coecient [3]. Vector cosine similarity takes
each user or item as a vector of ratings, and computes the cosine angle formed
by the rating vectors. Pearson correlation coecient measures the degree of two
users or items linearly relate to each other. They have both made great success
in many practical CF applications.
However, in some cases, what recommender systems get are not the ratings,
but preference sequences of users on a series of items. For this type of data,
those traditional similarity measures may fail to meet the practical application
requirements. In this paper, from the point view of user behavior, a similarity
measure based on inversion (Inversion) is proposed for preference sequences nat-
urally. Based on the Inversion similarity measure, some structural information
of user preference sequences is analyzed. By merging average precision (AP) and
weighted inversion into similarity computation, a new similarity measure based
on preference sequences is proposed for collaborative filtering. Finally some ex-
perimental results show that the proposed similarity measure based on preference
sequences can get better performance than the common similarity measures in
preference sequences datasets.
The rest of this paper is organized as follows. Section 2 introduces some pre-
liminary knowledge, including overview of CF recommender systems and simi-
larity measures. Section 3 gives out some analysis on structural information of
user preference sequences, and a new similarity measure based on preference
sequences is proposed for collaborative filtering. Some preliminary experimen-
tal results and evaluations are shown in Section 4. Finally, Section 5 presents
conclusions and future work.
2 Preliminary Knowledge
for item based CF recommender systems. At present, there are many dierent
methods for similarity computation [1], but the most commonly used are vector
cosine similarity (Cosine) and Pearsons correlation coecient (Pearson).
In this section, the ratings of one user are seen as a preference sequence and
inversion is employed as a similarity measure of the preference sequences. By
analyzing some problems of the similarity measure based on inversion and some
structural information of preference sequences, a new similarity measure based
on preference sequences for collaborative filtering is proposed by merging average
precision and weighted inversion into similarity computation. Here the similarity
measure based on inversion is firstly introduced as follows.
From the point view of behavioral psychology, ones behavior on some items (such
as high or low score, purchase or not, interested or disgusted, et al.) can be seen
as a preference sequence, thus the similarity between two users can be measured
by the two preference sequences of their behaviors [6,8]. In some sense, it may
get better performance than the common vector cosine similarity and Pearson
correlation coecient.
Based on the above analysis, a similarity measure based on inversion can be
proposed for collaborative filtering, and thus similarity computation between two
users can be converted to inversion computation of their preference sequences.
First, some definitions are given as follows.
Definition 1 (Inversion) [4]: Suppose that P = (p1 , p2 , , pn ) is a sequence
containing N dierent real numbers, < pi , pj > (i, j = 1, 2, , n, i = j) is called
a pair, the number of pairs at how many times pi > pj occurs in sequence P is
called its inversion, written as t(P ) and shown as Formula (1).
n
" n
"
t(P ) = f (pi > pj ) (1)
i=1 j=i+1
According to the above definitions and theorems, similarity between two se-
quences can be calculated by normalizing the inversion between them, as shown
in Formula (2).
2 t(Q P )
Sim(P, Q) = 1 (2)
|P | (|P | 1)
According to Formula (1) and Formula (2), the time complexity of similarity
measure based on inversion is O(n2 ). However, according to merge-sort algo-
rithm [4], similarity measure based on inversion can reduce its time complexity
to O(n log n).
#N
i=1 (p(i) rel(i))
AP (P ) = (4)
N
where N is the length of sequence, p(i) is the precision at cut-o i in the sequence
P , and rel(i) is an indicator function.
According to Formula (3) and Formula (4), two kinds of position information
of each element in preference sequences can be taken into the similarity com-
putation, and a new similarity measure based on preference sequences can be
proposed for collaborative filtering, named similarity measure based on Weighted
Inversion and Average Precision (WIAP), as shown in Formula (5).
Sim(P, O) = (1 W Inv(P )) AP (P ) (5)
where O is the benchmark sequence, W Inv(P ) and AP (P ) are calculated accord-
ing to Formula (3) and Formula (4) respectively. Finally the proposed similarity
measure WIAP can be described as Algorithm 1.
1. Sort O in ascending order under the limit of keeping the permutation relationship between P
and O;
2. Initialize W I = 0, AP = 0, c = 0;
3. For i = 1 to N 1
4. Calculate inv as inversion of P (i);
5. W I = W I + inv / i;
6. If P (i) == i
7. c = c + 1;
8. AP = AP + c / i;
9. End If
10. End For
11. If P (N ) == N
12. c = c + 1;
13. AP = AP + c / N ;
14. End If
15. Normalize W I into [0, 1];
16. Calculate Sim(P, O) = (1 W Inv(P )) AP (P ) and return it;
4 Experiments
In this section, some experiments are conducted on four similarity measures (Co-
sine, Pearson, Inversion and WIAP) to measure the performance of the proposed
A New Similarity Measure Based on Preference Sequences 389
1
Cosine
Pearson
0.9 Inversion
WIAP
0.8
0.7
Recall
0.6
0.5
0.4
0.3
20% 40% 60% 80% 100%
Filling Ratio
From Fig.1, it can be seen that the proposed similarity measure WIAP outper-
forms all the other similarity measures Cosine, Pearson and Inversion on dierent
matrix filling ratios. In particular, similarity measure WIAP almost recalls all
390 T. Shang et al.
the items on the full matrix. Besides, recall of WIAP can increase rapidly along
with the rise of matrix filling ratio.
Sampling Ratio
Dataset Measure
20% 40% 60% 80% 100%
Cosine 0.1444 0.2106 0.3400 0.3853 0.4790
Pearson 0.3343 0.3723 0.4296 0.4473 0.4671
Movielens
Inversion 0.2893 0.2715 0.4195 0.3829 0.3942
WIAP 0.3949 0.3725 0.4311 0.4469 0.4987
Cosine 0.5735 0.5059 0.4715 0.4621 0.4356
Pearson 0.6235 0.5695 0.5805 0.5790 0.5644
Jester
Inversion 0.6012 0.5108 0.4977 0.5037 0.4867
WIAP 0.7302 0.8138 0.9220 0.9647 0.9814
Cosine 0.9222 0.5204 0.2587 0.1566 0.1127
Pearson 0.9970 0.9508 0.8618 0.6873 0.5167
BookCrossing
Inversion 0.9970 0.9529 0.8512 0.6564 0.4664
WIAP 0.9970 0.9534 0.8715 0.6863 0.5128
Cosine 0.5467 0.4123 0.3567 0.3347 0.3424
Pearson 0.6516 0.6309 0.6240 0.5712 0.5161
Average1
Inversion 0.6292 0.5784 0.5895 0.5143 0.4491
WIAP 0.7074 0.7132 0.7415 0.6993 0.6643
1
Average represents average recall of each measure on three benchmark datasets.
In Table 1, it contains four blocks where the front three blocks are original
recalls of three benchmark datasets and the last block is the average recalls of
them. In all the 15 cases of the front three blocks, the proposed similarity mea-
sure WIAP can get much better performance than other similarity measures in
12 cases. In cases of 80% of Movielens dataset and 80% and 100% of BookCross-
ing dataset, Pearson is just a little bit better than WIAP. Actually, the values
of Movielens dataset and BookCrossing dataset are just few discrete numbers
which may be not suitable to be seen as preference sequences. Thus, WIAP may
not get best performance in some cases of them. For the Average block of
Table 1, it can be seen that WIAP can get much better performance than other
similarity measures in all the cases. Besides, for BookCrossing dataset, there
is an interesting phenomenon that on some sparser matrices, all the similarity
measures get higher recalls. That is because the matrices are so sparse that there
are not enough books to be recommended, and the similarity measures can recall
the very few books with higher proportion. Generally speaking, it can be con-
cluded that WIAP outperforms three other similarity measures when measuring
preference sequences.
A New Similarity Measure Based on Preference Sequences 391
5 Conclusions
References
1. Ahn, H.: A new similarity measure for collaborative filtering to alleviate the new
user cold-starting problem. Information Sciences 178(1), 3751 (2008)
2. Alsaleh, S., Nayak, R., Xu, Y., Chen, L.: Improving matching process in social
network using implicit and explicit user information. In: Du, X., Fan, W., Wang,
J., Peng, Z., Sharaf, M.A. (eds.) APWeb 2011. LNCS, vol. 6612, pp. 313320.
Springer, Heidelberg (2011)
3. Chowdhury, G. Introduction to modern information retrieval. Facet Publishing
(2010)
4. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to algorithms, 3rd
edn. MIT Press and McGraw-Hill (2009)
5. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Transactions
on Information Systems (TOIS) 22(1), 89115 (2004)
6. Jung, S., Hong, J., Kim, T.: A statistical model for user preference. IEEE Trans-
actions on Knowledge and Data Engineering 17(6), 834843 (2005)
7. Linden, G., Smith, B., York, J.: Amazon. com recommendations: Item-to-item
collaborative filtering. IEEE Internet Computing 7(1), 7680 (2003)
8. Park, M.-H., Hong, J.-H., Cho, S.-B.: Location-based recommendation system us-
ing bayesian users preference model in mobile devices. In: Indulska, J., Ma, J.,
Yang, L.T., Ungerer, T., Cao, J. (eds.) UIC 2007. LNCS, vol. 4611, pp. 11301139.
Springer, Heidelberg (2007)
9. Su, X., Greiner, R., Khoshgoftaar, T., Zhu, X.: Hybrid collaborative filtering al-
gorithms using a mixture of experts. In: Proceedings of the IEEE/WIC/ACM In-
ternational Conference on Web Intelligence, pp. 645649. IEEE Computer Society
(2007)
10. Su, X., Khoshgoftaar, T.: A survey of collaborative filtering techniques. In: Ad-
vances in Artificial Intelligence 2009, p. 4 (2009)
Visually Extracting Data Records
from Query Result Pages
Abstract. Web databases are now pervasive. Query result pages are dynamically
generated from these databases in response to user-submitted queries. Automat-
ically extracting structured data from query result pages is a challenging prob-
lem, as the structure of the data is not explicitly represented. While humans have
shown good intuition in visually understanding data records on a query result
page as displayed by a web browser, no existing approach to data record extrac-
tion has made full use of this intuition. We propose a novel approach, in which
we make use of the common sources of evidence that humans use to understand
data records on a displayed query result page. These include structural regularity,
and visual and content similarity between data records displayed on a query result
page. Based on these observations we propose new techniques that can identify
each data record individually, while ignoring noise items, such as navigation bars
and adverts. We have implemented these techniques in a software prototype, rEx-
tractor, and tested it using two datasets. Our experimental results show that our
approach achieves significantly higher accuracy than previous approaches.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 392403, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Visually Extracting Data Records from Query Result Pages 393
Fig. 1. Query Result Page from HMV.com with visual blocks highlighted
integrate the data and display it to the user. Other practical applications include flight
and hotel booking sites, financial product comparisons, property sales and rentals.
In this paper, we focus on the problem of data record extraction, that is, to identify
the groups of data items that make up each data record on a query result page.
Data records on a query result page display regularity in their content, structure and
appearance. They exhibit structural and visual similarities: that is, they form visual pat-
terns which are repeated on the page. This is because data records on the same site
are often presented using the same template. The displayed data is in an underlying
database, and as the data items for each record are retrieved from the database (in re-
sponse to the users query), the same template is used each time to present the record.
Much of the existing work that deals with the extraction of data records is based on the
theme of identifying repeated patterns. The common premise is to find and use the re-
peated patterns of data records. The main differences between the existing approaches
are where they look for these patterns and how they use them in data extraction.
Early approaches [1,4] identify repeated patterns in the HTML source code of multi-
ple training pages from a data source in order to infer a common structure or template,
and use it to extract structured data from new pages from the same source. Other ap-
proaches [3,15] identify repeated patterns in the source code of a query result page
in order to directly extract data records and align data items in them. However, these
approaches are all limited by the rapidly increasing complexity of source code. For in-
stance, the widespread use of Javascript libraries, such as jQuery, can make source code
structurally much more complex; thus these approaches start to fail.
Later approaches [9,16,18,19] use the tag tree representation of a web page to iden-
tify repeated patterns in the subtrees of the tag tree. This representation is useful be-
cause it provides a hierarchal representation of the source code. However, the tag tree
was designed for the browser to use when displaying the page, and unfortunately does
not accurately resemble the structure of the data records on the displayed page. The use
of scripts and other runtime features contribute further to the differences between the
structure of the tag tree and the dynamically displayed web page.
394 N. Anderson and J. Hong
VBM vs. Tag Tree. While the VBM and the tag tree are related, they are not equiva-
lent. The tag tree is a complex representation of the HTML code of the page, created
for the browser to interpret and is only part of the information required to render the
396 N. Anderson and J. Hong
page as the designer intended. We choose to use the VBM in preference to the tag tree
for data record extraction for a number of reasons. First, nodes that are close together
on the tag tree may be spatially far apart on the displayed page and vice-versa. By con-
sidering only the visual blocks in the VBM, our approach can see the result page in
the same way that a human can. Crucially, this is how the page was designed: inferred
relationships, such as a group of data items that form a data record, are much easier
to identify in a visual context. Second, our approach is insulated from developments in
coding practices and standards. Our approach relies on the rendering engine, accord-
ingly we are at liberty to make use of the best engine without the need to adapt our
VBM.
For example, as shown in green in Figure 1, the visual blocks that contain the DVD
title in both records on the query result page have the same visual proprieties and are,
therefore, visually similar.
Definition 2. Width Similarity: Two visual blocks have similar widths if the width
properties for both blocks are within a threshold of 5 pixels of each other.
For example, as shown in blue in Figure 1, the visual block that contains all of the data
items of the first record, is the same width as the visual block that contains all of the
data items of the second record. We say these blocks have similar widths.
Our approach also needs to decide if two blocks have block content similarity, that
is, they have similar sets of child blocks. Our observation is that two record blocks
have a high degree of block content similarity. This is because both record blocks are
created from the same template, repeated for each record on the page. For example,
the two record blocks, as shown in Figure 1, contain a large number of visually similar
blocks. In contrast, our observation is that there is little block content similarity between
a record block and a noise block. For example, the container block for the sort by
options, which is the same width as, and directly above, the record blocks in Figure 1,
does not share any visually similar child blocks with the record blocks.
We use a variant of the Jaccard index [13] to measure block content similarity be-
tween two visual container blocks. The index ranges between 0 and 1, where 1 means
that the two blocks are identical, and 0 means they have nothing in common. In our
approach, we consider that each container block contains a set of child blocks. We can
then measure block content similarity between two container blocks by a similarity
index between two corresponding sets of visual child blocks.
Visually Extracting Data Records from Query Result Pages 397
Definition 3. Block Content Similarity: Two visual container blocks have similar block
contents if they have a similarity index above a preset threshold.
For example, as shown in blue in Figure 1, the visual block that contains all of the data
items of the first record has a large number of child blocks that have the same visual
appearance as those child blocks in the visual block that contains all of the data items
of the second record. We say these blocks have block content similarity. In Section 4.2,
we formalise our usage of the similarity index.
Fig. 2. Examples of Spatially Related Vi- Fig. 3. A Clockwise Ulam Spiral Encoun-
sual Blocks tering a Basic Block
The goal of seed block selection is to identify a single basic visual block from the VBM
which is part of a single data record. Let us look at the organisation of a query result
page. Since the Western reading order is from top to bottom, and from left to right,
it follows that the data records should start in the top left of the page. However, most
web pages have common navigation menu and header structures that appear around the
edges of the page. Thus, the starting point of the data records can be shifted down and to
the right. As human readers expect this convention, they start looking for data records
in this area of the page. A study on web usability by eye tracking [12] confirms that the
highest priority area for content is between the centre and the top left of the page.
Our approach starts at the centre of the page, furthest from the noise blocks at the
edges and closest to the highest priority area for data records. We trace a clockwise
Ulam Spiral [5], as shown in Figure 3, which naturally grows from the centre towards
the top left of the page. This is the area of the page most likely to contain data records.
The Ulam Spiral was specially selected as it covers the largest possible proportion of
the highest priority area, before it reaches the edges of the page. A simple plane between
the centre of the page and the top left corner of the page has, on the other hand, the
potential to miss the basic blocks belonging to sparsely populated data records. Instead
it could quickly reach the edge of the page and select as the seed a basic block belonging
to a noisy feature such as a left menu.
The exponential growth of the spiral combined with its direction of travel (clockwise)
ensures that it shows bias to the area between the centre and the top left of the page
thereby covering more of the highest priority area than is possible for a simple plane
cover. As shown in Figure 3, the spiral terminates when it first encounters a basic block.
This block is taken as the seed block.
The goal of data record selection is to identify a set of container blocks from the VBM,
one block for each of the data records on the query result page.
The seed block is contained inside a number of container blocks, each of which pro-
vides a structure on the page. Examples of these container blocks are shown in Figure 1,
highlighted in blue. By isolating only the container blocks in which the seed block is
actually contained, our approach identifies the set of candidate record blocks, as shown
in Figure 1, highlighted in red. We observe that one of these candidate record blocks is
the record block for the data record and furthermore, this visual block has the similar
width to the record block for each of the records on the same query result page.
The seed block is a basic block, which is contained inside one or more of the container
blocks. As shown in Figure 1, the seed block, which was selected in the previous step, is
contained inside four container blocks, highlighted in red. As one of these blocks is the
record block, all four are taken as the candidate record blocks. Next, our approach filters
Visually Extracting Data Records from Query Result Pages 399
the set of all container blocks on the page, discarding any block that is not the same
width as one of the candidate record blocks. Our approach uses a one-pass algorithm to
cluster the filtered container blocks into a strict partition based on block width. This step
creates a number of clusters, one of which contains the record blocks. In our example,
the algorithm would create four clusters, one for each of the candidate record blocks.
Select one child block from each cluster in Ac as its representative, so we have a set of
representative child blocks, Ax , for Ac :
Given two sets of visual blocks, A and B, we use our similarity measure to find the
block content similarity between A and B, defined as follows:
|A B|
SimBlockContent(A, B) = (3)
|A B|
We define an indicator function for Ax as follows:
We then have:
m "
" n
|A B| = min{1A(xi ), 1B (yj )}|Sim(xi , yj ) = 1 (6)
i=1 j=1
m "
" n
|A B| = max{1A (xi ), 1B (yj )}|Sim(xi , yj ) = 1 + |A B| + |B A| (7)
i=1 j=1
where
m "
" n
|A B| = 1A (xi )|Sim(xi , yj ) = 0
i=1 j=1
and
m "
" n
|B A| = 1B (yj )|Sim(xi , yj ) = 0
i=1 j=1
Only one of the candidate record blocks represents the container blocks that provide the
structure for each data record on the page. The other candidate record blocks represent
container blocks that are used to provide structure to other areas of the page. By select-
ing the candidate record block, which has content similarity to the maximum number
of container blocks, our approach identifies the blocks for each data record.
However, our example demonstrates a scenario where two candidate record blocks
both correspond to the maximum number of container blocks. In this case the candidate
record blocks C and D in Figure 1 both have similar contents to the same number of
container blocks, as both blocks are present for each displayed record. In the event
of a tie for maximum blocks, our approach selects the candidate record block which
represents the widest container block, as these container blocks provide the structure to
the whole data record (for example, container block labelled D), rather than the internal
structure provided by the smaller blocks (for example, container block labelled C).
5 Experimental Results
We have implemented our algorithms for data record extraction in a software prototype
called rExtractor. In this section we first describe the two datasets that we use to eval-
uate the accuracy of our algorithms, and then discuss the performance metrics used to
interpret the results of these experiments. We compare the performance of our proto-
type, rExtractor, with that of ViNTs [17], a state of the art extraction system available
on the web (our experimental analysis makes use of the online prototype), which is
based on both visual content features and HTML tag structures.
Visually Extracting Data Records from Query Result Pages 401
5.1 Datasets
In our experiments we use two datasets, DS1 and DS2. DS1 is used to compare our
method with ViNTs [17]. As these is no standard dataset for experimental analysis of
data record extraction techniques, DS1 comprises of web sites taken from the third party
list of web sites contained in Dataset 3 presented in [17]. The web pages contained in
Dataset 3 were downloaded from the Internet in 2004, as a result they are stale, that
is, they are not representative of the layout and appearance of modern web pages. From
a visual standpoint, modern pages are more vivid than older pages: they contain more
images and much more noise in the form of adverts and menus for example. Therefore,
for each web site from Dataset 3 we visited the same web site and downloaded five
modern equivalent query result web pages. DS1 comprises 50 web pages from 50 web
sites, one web page per web site. In total DS1, contains 752 data records.
DS2 contains a total of 500 web pages from 100 web sites, five per site. It is based
on Dataset 1 and Dataset 2 in [17], as well as the list of web sites in [6]. As with the
previous dataset, the source web pages used to compile DS2 are stale. Accordingly, for
each web site included in DS2 from Dataset 1 and Dataset 2, we visited the same web
site and downloaded a modern equivalent query result page. The web sites in [6] are
listed by their query interface, for each web site included in DS2, we used the query
interface to generate five modern query result pages. The web sites in our datasets are
drawn from a number of domains, including Books, Music, Movies, Shopping, Proper-
ties, Jobs, Automobiles and Hotels. In total, DS2 contains 11,458 data records.
page. Finally, we calculate the average precision and recall for all of the web sites in
DS1 for both ViNTs and rExtractor. Next, we undertake the same procedure for DS2
and rExtractor, to determine how rExtractor preforms on a larger dataset. The results
for both DS1 and DS2 are presented in Table 1.
As we can see from Table 1, the performance of rExtractor is considerably better than
that of ViNTs. Our analysis of the results for ViNTs on DS1 found that ViNTs identified
a large number of false positive data records (427 in total). This contributed greatly to
their low precision score. By inspecting these false positives, we discovered that ViNTs
frequently selected the wrong sub-section of the page as the data-rich section section
(DRS). For example, their approach often selected a centrally located, or near centrally
located, page navigation menu (for instance, a large footer menu) and then extracted
each menu item as a false positive record. In these cases, it was then impossible for their
technique to extract the correct data records. Consequently, the number of true positive
data records they identified was also small, which contributed to their low recall score.
This is not a criticism of ViNTs. Their technique worked very well on the web pages
that were available when ViNTs was developed; rather it serves to highlight that modern
web pages are vastly different from older web pages.
Conversely, the seed block technique implemented by rExtractor selected correctly a
child block belonging to a data record in all of the test cases. This demonstrates that our
technique has no need to explicitly identify a DRS. Furthermore, as we can see from
Tables 1 and 2, rExtractor preforms extremely well on both DS1 and DS2 (more than
12,000 data records in total). The experimental results show that the techniques imple-
mented in rExtractor to identify candidate record blocks, measure block similarity and
select record blocks are all robust and highly effective. Therefore, it is more worthwhile
to investigate the cases where rExtractor failed to correctly extract data records.
First, a small number of data records (1.3% in total) were not displayed in record con-
tainer blocks. rExtractor relies on these blocks to define the record boundaries. Close
inspection of web pages containing these data records reveals they use out-dated page
design conventions and are very much the exception rather than the norm.
Second, our content similarity technique prevented the extraction of a small number
of data records. These records were determined to have not enough in common with
other records on the same page. For instance, designers use a different structural layout
and visual appearance to distinguish between a normal record and a featured record. In
the cases where rExtactor failed, it was often because a featured record was excluded.
We see this as an opportunity to develop further our content similarity technique.
Visually Extracting Data Records from Query Result Pages 403
6 Conclusions
This paper presents a novel approach to the automatic extraction of data records from
query result pages. Our approach first identifies a single visual seed block in a data
record, and then discovers the candidate record blocks that contain the seed. Next the
container blocks on the page are clustered and a similarity measure is used to identify
which of these blocks have similar contents to each candidate record block. Finally, our
approach selects the cluster of visual blocks that correspond to the data records. We
plan to extend our approach so that the similarity threshold, used to determine if two
blocks have similar contents, is set automatically by a machine learning technique.
References
1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Con-
ference, New York, NY, USA, pp. 337348 (2003)
2. Cai, D., Yu, S., Wen, J., Ma, W.-Y.: Extracting content structure for web pages based on
visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS,
vol. 2642, pp. 406417. Springer, Heidelberg (2003)
3. Chang, C.-H., Lui, S.-C.: Iepad: information extraction based on pattern discovery. In:
WWW Conference, New York, NY, USA, pp. 681688 (2001)
4. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from
large web sites. In: VLDB Conference, San Francisco, CA, USA, pp. 109118 (2001)
5. Prime spiral (2012), http://mathworld.wolfram.com/PrimeSpiral.html
6. Tel-8 query interfaces (2004),
http://metaquerier.cs.uiuc.edu/repository/datasets/tel8/
7. Jakob nielsen - usable i.t (2002),
http://www.useit.com/alertbox/20021223.html
8. Webkit - layout engine, http://www.webkit.org/
9. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: SIGKDD Conference,
New York, NY, USA, pp. 601606 (2003)
10. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction.
IEEE Transactions on Knowledge and Data Engineering 22, 447460 (2010)
11. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from
the web using tag path clustering. In: WWW Conference, pp. 981990 (2008)
12. Nielsen, J., Pernice, K.: Eyetracking Web Usability, 1st edn., pp. 97110. New Riders (2010)
13. Real, R., Vargas, J.M.: The probabilistic basis of jaccards index of similarity. Systematic
Biology 45, 380385 (1996)
14. Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual per-
ceptions. In: CIKM Conference, New York, NY, USA, pp. 381388 (2005)
15. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In:
WWW Conference, New York, NY, USA, pp. 187196 (2003)
16. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW Conference,
New York, NY, USA, pp. 7685 (2005)
17. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for
search engines. In: WWW Conference, New York, NY, USA, pp. 6675 (2005)
18. Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search
engine result pages. In: VLDB Conference, pp. 9891000 (2006)
19. Zhao, H., Meng, W., Yu, C.: Mining templates from search result records of search engines.
In: SIGKDD Conference, New York, NY, USA, pp. 884893 (2007)
Leveraging Visual Features
and Hierarchical Dependencies
for Conference Information Extraction
1 Introduction
As we all know, thousands of conferences, symposiums and workshops are held
all around the world every year. The conference Web site is a main and ocial
platform to share and post related conference information which can be accessed
by people everywhere. With the increase of conference Web pages, it becomes
a cumbersome and time-consuming job for researchers to collect conference in-
formation and keep track of the hot research topics. Furthermore, people some-
times are more interested in discovering the future trends of the research field,
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 404416, 2013.
c Springer-Verlag Berlin Heidelberg 2013
analyzing the social networks of scholars and the inner relationships among these
conferences. Building a repository of conference information can satisfy all the
above requirements and many value-added services and applications can be de-
veloped based on the repository. The ArnetMiner1 system [1] is a good example of
academic social networking based on the publication information repository. In or-
der to automatically and eectively archive clean and high quality academic data,
it is essential to extract useful academic information from the conference Web
pages, which is more up-to-date, comprehensive and reliable than other sources.
Web information extraction is a classical problem, which aims at identifying
interested information from unstructured or semi-structured data in Web pages,
and translates it into a semantic clearer structure. During the past years, many
techniques such as DOM structure analysis [2], visual based methods [3,4], ma-
chine learning methods [3,4] have been devised for Web information extraction.
Likewise, this paper studies the problem of automatically extracting structured
academic information from conference Web pages.
In the area of conference Web page extraction, most traditional approaches
focused on the use of visual features in the Web page. When people design
conference Web pages, they usually follow some implicit rules about how they
structure certain types of information on the page. So, it is easy to understand
that visual features are in some sense more stable than content features. VIsion-
based Page Segmentation (VIPS) algorithm [5] was proposed by Deng Cai et al
to segment the Web page into text blocks. But the segmentation results of VIPS
on conference Web page sometimes are not satisfied due to the ignorance of the
hierarchical dependencies (see the example as shown in Figure 3), thus we aim
to address this segmentation problem by proposing a new hybrid approach com-
bines these two kinds of features to archive better segmentation results. In order
to better organize the segmented text blocks, classification or block labeling pro-
cedures are then employed, after which we can perform the refined information
extraction on the annotated labels. The state-of-the-art models for sequence la-
beling problems include Support Vector Machine (SVM) [6] and Conditional
Random Fields (CRFs) [7], which mainly rely on the sequence structure. Look-
ing into the conference websites, we observe that apparent hierarchical relations
exist between dierent information blocks. For example, the submission dead-
line information always appears directly after the phrase - Important Dates
in a bold style. This observation gives us a hint that we should make full use of
the hierarchical dependencies to archive a better annotation performance. Due
to the fact that the hierarchical relation in conference Web page forms a tree
structure, we therefore introduce a Tree-structured Conditional Random Fields
model [8,9] to do the annotation work.
In summary, we have the following contributions: (1) we propose a hybrid
page segmentation algorithm, which combines the vision-based segmentation
algorithm with the DOM-based segmentation algorithm; (2) we introduce the
Tree-structured Conditional Random Fields model [8,9] to annotate the text
blocks, which making use of visual features, content features and hierarchical
1
http://arnetminer.org/
406 Y. You et al.
2 Related Work
In this section, we present a survey in two aspects: (1) Classification/sequence
labeling problem and (2) Web information extraction.
First, our work is related to a classification or sequence labeling problem in
some sense. Identifying the useful information from Web pages is sometimes
equivalent to annotating the text blocks with dierent predefined labels. There
have been a lot state-of-the-art approaches proposed for this problem, such as
Bayes Network [10], SVM [6], Conditional Random Fields [7] and so on. But
these approaches usually depend on individual features or linear-dependencies in
a sequence of information, while in Web information extraction the information
can be two-dimensional [4] or hierarchically depended [3].
Second, our work belongs to the area of Web information extraction, which
receives a lot of attentions during past years. This work could be categorized
into two types: (1) Template level wrapper induction systems. Several au-
tomatic or semi-automatic wrapper learning approaches based on the templates
have been proposed. Typical representatives are RoadRunner [11], EXALG [12]
and so on. (2) Visual feature assisted techniques. In contrast, the dier-
ent display styles of dierent parts in a Web page provides an additional means
for segmenting content blocks. For example, [13,3] reported that visual features,
such as width, height, font, etc. are useful in page segmentation.
The limitations of existing Web information extraction methods are:
Some rule-based Web information extraction techniques are not scalable for
the heterogeneity of dierent conference Web pages.
Previous studies may be eective for sequence of text blocks as a linear
model. But they are not suitable for conference Web information extraction,
where the information embedded in the page always follows a tree structure.
Traditional methods can only extract information from a single Web page,
but can not integrate the useful information of a conference that are located
in multiple webpages.
events (e.g. conference name, time, location, submission deadline and submis-
sion URL. (2) Information about conference topics (e.g. call for papers and
topics of interests). (3) Information about related people and institutes
(e.g. chairs, program committee, authors, companies and universities).
(a) (b)
Fig. 2. The layout structure and vision-based structure of a conference Web page
After investigating lots of conference Web pages, we find that the most useful
text blocks appear in tag <p>, <table>, <ul>, <h>, <a> or in text node,
while these nodes are always located at the same level of the DOM tree and
adjacent to each other. Therefore, we define these six types of HTML nodes as
Content Nodes. A DOM-based page segmentation algorithm has two phases: (1)
performing a depth-first traversal of the DOM tree, and saving useful Content
Nodes based on rules showed in Table 1. (2) re-dividing and reorganizing saved
nodes into semantic blocks based on rules shown in Table 2.
According to the analysis above, we propose a hybrid page segmentation al-
gorithm combining the VIPS and the DOM-based algorithm. Algorithm 1 shows
the detail of the algorithm. First, it segments an input web page into visual
blocks using the VIPS algorithm (line 1). Second, it removes the noise blocks
(navigation blocks, copyright blocks, etc.) using some heuristic rules (line 2).
Third, it segments the input web page using the DOM-based algorithm (line
3). Finally, it combines the text blocks produced by the VIPS and DOM-based
algorithm to archive a more complete segmentation result (line 4).
Based on the above definition, the main goal is to calculate p(l|b). Here, we
introduce TCRFs [8,9] to compute it, which will be described in the following.
Fig. 4. The text block tree of the Web page showed in 2(a)
edge features into account. We use f (bp , bc ), f (bc , bp ), and f (bs , bs ) to denote
the current text blocks parent-child dependency, child-parent dependency and
sibling dependency, respectively.
We use f to represent both edge feature function and vertex feature function; c
to represent both edge and vertex; and to represent two types of parameters:
and . Thus, the derivative of log-likelihood function with respect to parameter
j associated with clique index c is:
( )
L # # (i) (i)
## (i) (i)
= fj (c, y(c) , x ) p(yc |x )fj (c, y(c) , x ) , (4)
j i c y c
(i)
where y(c) is the annotation result of clique c in x(i) , and y(c) ranges over
annotation results to the clique c. p(y(c) |x(i) ) requires the computation of the
marginal probabilities. We utilize the Tree Reparameterization (TRP) algorithm
[14] to compute the approximate probabilities of the factors. The function L
can be optimized by several techniques; but in this paper, we adopt gradient
based Limited-memory BFGS method [15], which outperforms other optimiza-
tion techniques for linear-chain CRFs [16].
Post-processing. The text blocks annotation results may directly aect the
quality of the final conference academic information extraction result. In order
to improve the quality of the annotation results, the post-processing is essential.
Known from the conference Websites, the text blocks can be divided into five
groups according to their display characteristics and contents, i.e., Title, Data
Item/Date/Item, Topic/List no text, Committee Person, Yes Text/No Text. We
summarize some heuristic rules for post-processing for dierent label groups,
and repair text blocks with wrongly annotated labels from the following two
aspects automatically. (1) A block has special features but cannot be annotated
correctly. For example, given a text block Camera Ready Papers Due: Thursday,
August 11, 2011, which have typical DI features. However, this block is classified
as YT. For this situation, it can be repaired by identifying some typical combina-
tion of features. This method is suitable for text blocks with clear features, such
as DI/D/I, TO and CP. (2) A text block is classified as a category but it does
not contain corresponding features. For example, text blocks about submission
instruction are classified as YT. Although it contains a lot of words, it usually
does not contain any people name, date or location. Therefore, we can check its
Leveraging Visual Features and Hierarchical Dependencies for Conference IE 413
6 Experiments
The system is mainly implemented in Java, and the Web page segmentation
module is implemented in C#. We also use MALLET2 , a Java-based package
for machine learning applications. Our experiments are conducted on a PC with
2.93GHz CPU, 4GB RAM and Windows 7. Our system can process a conference
Web page in average of 0.80 seconds, which contains 0.18 seconds for segmenting,
0.62 seconds for annotating and 0.002 seconds for post-processing.
The object of the first experiment is to compare our hybrid segmentation al-
gorithm with the baseline VIPS algorithm. To verify the eectiveness of our
algorithm, we call these two algorithms for conference Web pages respectively,
and take the manually segmented results as the ground truth. If a text block
outputted by an algorithm is the same as the human segmented one, we call
it a correct block. Table 4 shows the count number of correct text blocks of a
conference Web site using the two segmentation algorithm respectively. From
the results, we can find that our hybrid segmentation algorithm can get more
correct text blocks than the VIPS segmentation algorithm. Especially for confer-
ence DASFAA-11, VIPS get a poor result because the visual separators between
two text blocks is not so obvious, while our hybrid segmentation algorithm gets
a satisfactory result.
2
http://mallet.cs.umass.edu/
414 Y. You et al.
7 Conclusions
This paper has proposed a hybrid approach to extract academic information
from conference Web pages. Particularly, dierent from existing approaches, our
method combined visual feature and hierarchical structure analysis. Meanwhile,
we also developed heuristic-rule based post-processing techniques to further im-
prove the annotation results. Our experimental results based on the real world
data sets have shown that the proposed method is highly eective and accurate,
and it is able to achieve average 94% precision and 93% recall.
References
1. Tang, J., Zhang, J., Zhang, D., Yao, L., Zhu, C., Li, J.Z.: Arnetminer: An expertise
oriented search system for web community. In: Semantic Web Challenge. CEUR
Workshop Proceedings, vol. 295 (2007)
2. Sun, F., Song, D., Liao, L.: Dom based content extraction via text density. In:
SIGIR, pp. 245254 (2011)
3. Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: Simultaneous record detection
and attribute labeling in web data extraction. In: KDD, pp. 494503 (2006)
4. Zhu, J., Nie, Z., Wen, J.R., Zhang, B., Ma, W.Y.: 2d conditional random fields for
web information extraction. In: ICML. ACM International Conference Proceeding
Series, vol. 119, pp. 10441051 (2005)
5. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: SIGIR, pp.
456463 (2004)
6. Duan, K.-B., Keerthi, S.S.: Which is the best multiclass SVM method? An empir-
ical study. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS,
vol. 3541, pp. 278285. Springer, Heidelberg (2005)
7. Laerty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In: ICML, pp. 282289
(2001)
416 Y. You et al.
8. Bradley, J.K., Guestrin, C.: Learning tree conditional random fields. In: ICML, pp.
127134 (2010)
9. Tang, J., Hong, M., Li, J., Liang, B.: Tree-structured conditional random fields for
semantic annotation. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe,
D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp.
640653. Springer, Heidelberg (2006)
10. Heckerman, D.: A tutorial on learning with bayesian networks. In: Holmes, D.E.,
Jain, L.C. (eds.) Innovations in Bayesian Networks. SCI, vol. 156, pp. 3382.
Springer, Heidelberg (2008)
11. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data ex-
traction from large web sites. In: VLDB, pp. 109118 (2001)
12. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIG-
MOD Conference, pp. 337348 (2003)
13. Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for
web pages. In: WWW, pp. 203211 (2004)
14. Wainwright, M.J., Jaakkola, T., Willsky, A.S.: Tree-based reparameterization for
approximate inference on loopy graphs. In: NIPS, pp. 10011008 (2001)
15. Xiao, Y., Wei, Z., Wang, Z.: A limited memory bfgs-type method for large-scale
unconstrained optimization. Computers & Mathematics with Applications 56(4),
10011009 (2008)
16. Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In:
HLT-NAACL (2003)
Aggregation-Based Probing
for Large-Scale Duplicate Image Detection
1 Introduction
The explosion of web image scale inevitably brings overwhelming number of du-
plicate images. According to a recent study [18], more than 8.1% of web images
have more than ten near duplicates and more than 20% of web images have
duplicate images. Without proper processing, such duplicates can cause redun-
dancies in web search results and reduces user experiences. Therefore, detecting
duplicate images from a large image collection becomes a hot research topic.
As defined in work[17,9], the duplicate image detection task is to find all
visually duplicate image groups from an image collection. According to the def-
inition, it is a cluster style task rather than a search style task. The input is a
gigantic collection of images and the output is a cleaned version of that image
collection(all duplicate ones are grouped into one cluster).
To cope with large scale, eciency is the core consideration in designing algo-
rithms. State-of-the-art approaches usually exploit hash-code representation of
images [9,17] and put images with the same hash-code together. Since duplicate
images could be represented with dierent hash codes, such approaches achieve
high precision but relatively low recall.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 417428, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
2.1 Duplicate Image Detection
Plenty of works [14,19,9,17,22] are devoted to duplicate image detection. Accord-
ing to dierent feature representations of an image, these works can be classified
into two categories.
Local Method: Typical works[3,9,22,16] are bases on SIFT[12]. The min-hash
[9] is used to speed up the process. The advantage of adopting local features
is that they are invariant to the scale, rotation, position and ane transform
of images. While the drawback is burden on computational resources con-
sumption, which makes it hard to build a system to handle the the billions
of images.
Global Method: [10,15,17] are based on global features. Typical global fea-
tures include color histogram [20], gray value [17] and edge histogram [23].
PCA is useful in duplicate task[6]. Wang et al.[17] proposed a PCA-based
hashing method depending on the global features. In their work, each image
is divided into k k regular grid elements and the corresponding mean gray
values are extracted to get simple but eective visual features. However, it
focuses on the precision but ignores the recall. In our work, we propose an
ecient rank-based probe method to increase the low recall.
3 Framework
In this section, we introduce the framework for the duplicate image detection
task, which contains two steps:
At first, all images in the entire data set will be represented by gray value
and indexed according to their PCA-code; Then we use the PCA codes to divide
the image set into smaller groups, the images in each group share the same
PCA code. We refer to each group as a bucket in the rest of this paper. In the
generation of PCA-code, similar images may vary a lot in their PCA codes. The
slight shift of the features may cause totally dierent code, thus put into dierent
buckets. To improve the performance, the PCA-codes representing similar image
features should be identified and related. The traditional similar metric is the
hamming distance, which means the PCA-codes with small hamming distance
will be grouped together. But it suers a low eciency to probe all images in
the dataset.
To increase the eciency, for each bucket, we generate a probing sequence.
The probing sequence can be regarded as similarity measurement of the buckets.
The later element in the sequence means the less probability that contains the
duplicate image. If we can generate the probing sequence, it could eectively
and eciently find the duplicate image groups.
Recently, a multi-probe method [13] is proposed to generate the probing se-
quence for LSH. Since the Gaussian distribution assumption of PCA is equal to
that of LSH in [13], we can apply [13] to the PCA-code. But [13] method relays
on the query as the input parameter, however in duplicate image detection, there
are more than one images in each bucket, therefore, we cannot apply [13] directly.
We develop a two-step algorithm to handle the problem. First, for each image
in the bucket, we generate the basic probing sequence. Second, we apply the
rank aggregation algorithm to generate the merged probing sequence. Because
the length of the candidate list is limited, it is a partial list rank aggregation.
420 Z. Feng et al.
/SGMK9KZ
,KGZ[XK )UJK
9VGIK 9VGIK
,KGZ[XK+^ZXGIZOUT )UJK-KTKXGZOUT
3KXMKJ6XUHOTM9KW[KTIK (GYOI6XUHOTM9KW[KTIK
*[VROIGZKJ/SGMK-XU[VY
Since we target to cope with large scale web images, its hard and expensive
to acquire the labeled data, thus it is hard to use supervised rank aggregation
method. We turn to an unsupervised score-based method which is displayed in
Figure 1.
4 PCA-Code Component
In this section, we introduce feature extraction and PCA hash-code generation.
For feature extraction, an image is divided into a k k grid, where k is set to 8
from previous work [17]. The mean gray value feature is extracted from each grid
as a robust image representation for duplicate detection. It can tolerate most
image transformations such as resizing, recoloring, compression, and a slight
amount of cropping and rotation. Each feature element is calculated by the
Equation (1)
1 " (i)
Fi = grayj i = 1, 2 . . . k 2 , (1)
Ni
(i)
where grayj is the j th gray value in grid i and Ni is the number of pixels in
grid i. So an image could be represented by a k 2 1 vector.
Before code generation, the raw features should be compressed into a compact
representation with noises reduced. The PCA hash-code algorithm uses a PCA-
based hash function to map a high-dimensional feature space to a low-dimensional
Aggregation-Based Probing for Large-Scale Duplicate Image Detection 421
feature space. The PCA model is used to compress the features extracted from the
image database. After the dimension reduction, visual features are transformed
to a low-dimensional feature space. Then the hash code H for an image is built
as follows: each dimension is transformed to one if its value in this dimension is
greater than mean value, and zero otherwise. This is summarized in Equation (2).
!
1 vi meani
H(i) = (2)
0 vi < meani
where the meani is the mean value of the dimension i. The flowchart of PCA-
code component is shown in Figure 2.
where the k is the mean value of dimension k and k means the standard
deviation of dimension k. It is used to estimate the probability that images in hj
duplicate the query image. A lower score means that hj is more likely to contain
duplicate images. For each image in the starting bucket, we get a basic probing
sequence based on scoreP CA .
It naturally describes the dierence between the two rankers (The result list
V also could be seen as a ranker).
2. Canberra Distance
But in the footrule distance, we dont take into account that the large
elements in the list will have an larger weight than the small elements (i.e.
|Vji Vj | is not weighted). The weighted footrule distance is the Canberra
Distance[7] .
" |Vji Vj |
dis(V i , V / ) = (7)
j
|Vji + Vj |
independently. The formulation for the gradient descent step is listed in the
following equation.
" Vji 2
Vj = Vj {I(Vji Vj ) 2+
i (Vj + Vji )
(10)
Vji 2
I(Vji < Vj ) 2}
(Vj + Vji )
Where I(x) is the Indicator function and is the learning rate(which is usually
setting as 0.01). The smallest k elements in V are the result list.
3. Because the rank adjustment factor of each element in T has relativity, the
problem of Discounted Canberra Distance cannot # be handled by the previous al-
gorithm. We add a restriction that the V = i Wi V i . Then we follow the idea
from the existing lambdarank optimization algorithm[1] (which approximately
optimizes NDCG as its object) and this problem can be optimized directly by
a approximate gradient descent algorithm. The gradient process is listed in the
following equation.
1 ""
Wk = WK {I(Vji Vj ) Vjk
1 + lg P osVj i j
(11)
2 Vji i k
2 Vji
2 + I(V j < Vj ) Vj 2 }
(Vj + Vji ) (Vj + Vji )
where is the learning rate (which is usually setting to 0.01) and I(x) is the
Indicator function. The P osVj will be updated after each iteration process. Then
we use the W to calculate the V and the smallest k elements in V are the result
list.
6 Experiment
6.1 Experimental Setting
The dataset contains 1,003,440 images crawled from the web. It is split into
two parts, the ground truth and the distractor set. The ground truth consists
239 groups of 1260 images. It is labeled manually. The distractor set is selected
randomly from the web with images in the ground truth removed. The size of the
distractor set is 1,002,180. All the images are resized to 160 pixels on their larger
side. We use 100,000 images to train a stable PCA for dimension reduction.
(#CorrectImageP airs)
Recall = 100% (13)
(#GroundT ruthImageP airs)
In the table, all the rank aggregation methods improve the recall and main-
tain the precision. The F Dst and M C4 methods show similar performance and
the CaDst is slightly better than either of those on precision performance.The
improvement of DCaDst on recall performance is higher than all the others.
When we fix the number of buckets at 40, Recall/P recision of Baseline is
only 0.556/0.998 but DCaDis reaches 0.813/0.994, improving the recall by 40%.
The other methods are around 0.775/0.993, which is better than the baseline but
lower than DCaDst.When we fix the recall score at 0.813, Baseline probes 780
buckets and DCaDst probes only 40 buckets. DCaDst probes far fewer buckets
to reach the same recall. The other methods probe more than 50 buckets. This
good performance of DCaDst can be explained by its design principle: it focuses
on the top-K items of the result list and optimizes them. This design is helpful
for optimizing the top-k items of the result list. When the length of result list is
shorter, the performance is more better than the others. So the DCaDst could
probe less buckets than the other methods to reach the same performance.
We conclude that DCaDst is better than the other methods for this task. It
eectively reduces the number of the buckets to be probed and increase the recall
Aggregation-Based Probing for Large-Scale Duplicate Image Detection 427
while preserving the precision performance. It also shows that the score function
in Equation 3 is a reasonable inference, which could reflect the probability of the
bucket containing the duplicate images.
Table 3 shows the stability of our aggregation-based probe algorithm on three
distances when the partial list length is changed.
Table 3. The recall performance with respect to dierent settings. The LR means the
length of the result list and the LP means the length of the partial ordered list.
From Table 3, we can see that the recall performance is stable as the result
list length fixed, which indicated that our proposed methods are robust as the
length of partial ordered list varies. It is a exciting property in our task. So that
we can set the length of the partial ordered list as length of the result list while
maintaining the performance. It could reduce the computation and easy to apply
to the next application.
To sum up, the experiments show that our aggregation-based probe algo-
rithm can eectively increase the recall performance while maintaining the high
precision with high robustness.
References
1. Burges, C.J.C., Ragno, R., Le, Q.V.: Learning to rank with nonsmooth cost func-
tions. In: NIPS, pp. 193200 (2006)
2. Chen, S., Wang, F., Song, Y., Zhang, C.: Semi-supervised ranking aggregation. In:
CIKM, pp. 14271428 (2008)
428 Z. Feng et al.
3. Chum, O., Philbin, J., Zisserman, A.: Near duplicate image detection: min-hash
and tf-idf weighting. In: BMVC (2008)
4. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for
the web. In: WWW, pp. 613622 (2001)
5. Fagin, R., Kumar, R., Sivakumar, D.: Ecient similarity search and classification
via rank aggregation. In: SIGMOD Conference, pp. 301312 (2003)
6. Huang, Z., Shen, H.T., Shao, J., Zhou, X., Cui, B.: Bounded coordinate system
indexing for real-time video clip search. ACM Trans. Inf. Syst., 27(3) (2009)
7. Jurman, G., Riccadonna, S., Visintainer, R., Furlanello, C.: Canberra distance on
ranked lists. In: Ranking NIPS 2009 Workshop, pp. 2227 (2009)
8. Klementiev, A., Roth, D., Small, K.: An unsupervised learning algorithm for rank
aggregation. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S.,
Mladenic, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 616
623. Springer, Heidelberg (2007)
9. Lee, D.C., Ke, Q., Isard, M.: Partition min-hash for partial duplicate image discov-
ery. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS,
vol. 6311, pp. 648662. Springer, Heidelberg (2010)
10. Li, Y., Jin, J., Zhou, X.: Video matching using binary signature. In: Intelligent
Signal Processing and Communication Systems, pp. 317320 (December 2005)
11. Liu, Y., Liu, T.-Y., Qin, T., Ma, Z., Li, H.: Supervised rank aggregation. In: WWW,
pp. 481490 (2007)
12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision 60(2), 91110 (2004)
13. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe lsh: Ecient
indexing for high-dimensional similarity search. In: VLDB, pp. 950961 (2007)
14. Ponitz, T., Stottinger, J.: Ecient and robust near-duplicate detection in large
and growing image data-sets. In: ACM Multimedia, pp. 15171518 (2010)
15. Qamra, A., Meng, Y., Chang, E.Y.: Enhanced perceptual distance functions and in-
dexing for image replica recognition. IEEE Trans. Pattern Anal. Mach. Intell. 27(3),
379391 (2005)
16. Valle, E., Cord, M., Philipp-Foliguet, S.: High-dimensional descriptor indexing for
large multimedia databases. In: CIKM, pp. 739748 (2008)
17. Wang, B., Li, Z., Li, M., Ma, W.-Y.: Large-scale duplicate detection for web image
search. In: ICME, pp. 353356 (2006)
18. Wang, X.-J., Zhang, L., Liu, M., Li, Y., Ma, W.-Y.: Arista - image search to
annotation on billions of web photos. In: CVPR, pp. 29872994 (2010)
19. Wang, Y., Hou, Z., Leman, K.: Keypoint-based near-duplicate images detection
using ane invariant feature and color matching. In: ICASSP, pp. 12091212 (2011)
20. Zhang, D., Chang, S.-F.: Detecting image near-duplicate by stochastic attributed
relational graph matching with learning. In: ACM Multimedia, pp. 877884 (2004)
21. Zhao, X., Li, G., Wang, M., Yuan, J., Zha, Z.-J., Li, Z., Chua, T.-S.: Integrating
rich information for video recommendation with multi-task rank aggregation. In:
ACM Multimedia, pp. 15211524 (2011)
22. Zhou, W., Lu, Y., Li, H., Song, Y., Tian, Q.: Spatial coding for large scale partial-
duplicate web image search. In: ACM Multimedia, pp. 511520 (2010)
23. Zhu, J., Hoi, S.C.H., Lyu, M.R., Yan, S.: Near-duplicate keyframe retrieval by
nonrigid image matching. In: ACM Multimedia, pp. 4150 (2008)
User Interest Based Complex Web Information
Visualization
Abstract. Web graph allows end user to visualize web information and
its connectivity. But due to its huge size, it is very challenging to visualize
web information from the enormous cyberspace. As a result, web graph
lacks simplicity while the user is searching for information. Clustering
and filtering techniques have been employed to make the visualization
concise but it is often not accurate to the expectation of user because
they do not utilize user-centric information. To deliver the information
concisely and precisely to the end users according to their interest, we
introduce personalized clustering techniques for web information visu-
alization. We propose a system architecture, which considers the user
interests while clustering & filtering, to make the web information more
meaningful and useful to the end users. By the implementation of the ar-
chitecture, an experimental example is provided to reflect our approach.
1 Introduction
It has become very dicult for end users to get their desired information quickly
and accurately from the gigantic World Wide Web. Research attempts to help
the users manage and control the information over the internet by presenting
user-focused information [4] [6] [11]. However, the existing technology is not ade-
quate enough to provide visual maps users expect to guide their web experience.
As a consequence, the next research challenge is to provide focused informa-
tion to the end users for quick and easy understanding of the web information
and its pattern. To provide an easy way to explore information according to
the need of user, techniques of information retrieval and visualization have been
introduced [7]. Information visualization presents the abstract data to amplify
human cognition and unfold the hidden relationships among the data. The pur-
pose of information visualization is not only to create interesting pictures for the
users but also to communicate information clearly and eectively [3] [7].
A web graph, which represents the web information with its internal con-
nectivity revealed, is a very useful way to visualize various aspects of the web
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 429436, 2013.
c Springer-Verlag Berlin Heidelberg 2013
information. Nevertheless, a huge web graph is less likely to create a good im-
pression for the end user compared to a relatively smaller one with information
relevant to the users interests. Therefore, reducing the size of the web graph is
very important from the end user viewpoint. Filtering and clustering techniques
from dierent aspects such as structure, context, have been applied on the web
graph to reduce the size. Besides the structure and content based clustering
techniques, compromise the direct users needs is expected for personalized visu-
alization. First of all, users are dierent and their needs vary. Most of the work
accomplished for personalization of web data is done for a group of people who
can be addressed as similar minded [13]. But it is not obvious that every user of
a specific group behaves similarly all the time. Secondly, short term interests or
preferences of a particular user change over time or in dierent context.
To provide the above personalization in web information visualization, we
propose to model every user separately. Instead of providing information di-
rectly in the traditional way, we introduce user profile driven clustering for web
information personalization. We strongly believe that this can classify the web
information into several categories and yield an improved visualization for the
end user as the information is clustered according to the individuals interest.
The rest of the paper is organized as follows: Section 2 presents related works;
Section 3 describes the proposed architecture for personalized visualization; Sec-
tion 4 presents the experiment; and finally, Section 5 concludes the paper.
2 Related Work
Work has been done in dierent directions to make the representation of web
information more eective to the end users. Some work deals with information
searching through search engines [14] [1], i.e., on search results. For personal-
ization, researchers consider the user search preference learned from the click
history or browsing history [9]; history of similar minded users [13]; location
metadata of the user for personalized search [2]; and other personalizing page
ranking techniques [11]. In the case of navigating in the web space, clustering
techniques plays important roles for reducing the complexity of the web graph.
Structure based clustering can be found in the works of Huang et al. [6] and
Rattigan et al. [12]. They provide structure based clustering techniques to re-
duce the web graph size. Graph clustering approaches like these only consider
the structural aspect of the graph without considering the content of the docu-
ments. Besides structure, Gao [4] integrates content in the process of clustering.
She first clusters using structure and then clusters using content only on the sub-
clusters created in the first phase. As a consequence, this two-phased clustering
heavily depends on the structure. Inclusion of structure in the visualization cre-
ates scopes for non-similar documents to be linked and hence put in the same
cluster. Our work avoids this possibility by considering only content based sim-
ilarities while relating documents. In contrast to previous research approaches,
this work is personalized as we consider the users point of view.
User Interest Based Complex Web Information Visualization 431
3 System Architecture
Our proposed system to resolve the issue of personalized visualization uses data
from user profile as preliminary knowledge-base to assist in filtering, cluster-
ing and hence final visualization. While the user uses the system, the system
automatically updates the existing user profile or creates a new one where ap-
propriate to reflect users latest focusing trends. Finding appropriate scope for
involving the user profiles in the system is very crucial for a personalized system.
Scopes of the user profiles for personalized searching from electronic information
space have been documented in [10]. In our system, after collecting information,
we include the user profile while analysing the web information, i.e., in filtering
& clustering, and finally we visualize the information.
Figure 1 depicts the architecture of the system for personalization of informa-
tion visualization. The key modules of this model are:
i) Result Analyser module (composed of Data Collection, Filtering and
Clustering components) collects the web data based on users request, filters
out irrelevant data, and finally generates a clustered web graph by using the
user profile for a handy and compact visualization.
ii) Visualization module is responsible for visualizing the web graph to pro-
vide a good overview; to help the user to find out the expected information;
and to create scopes for the user to interact with the information.
iii) Feedback Analyser module gets the user interaction data from the vi-
sualization module, then analyses the data, and lists possible updates.
iv) Profiler module gets the updates from Feedback Analyser module and
consults with the existing user profile to keep user profile up-to-date to
reflect users latest interests in order to provide more relevant output after-
wards. This module also gets direct inputs from the user where appropriate,
to construct or modify the user profile.
very less information rather discards them. We filter out all other html tags, any
data except text such as image, audio and video to keep page information simple.
Once crawling is finished, we have the original text data for further processing.
We convert the words to their base forms(for example: running to run) to get
the actual frequencies of words and remove stop words from the content.
After filtering the stop words, we have set of words, still big in size, to represent
a URL. To reduce the size, we apply well known TF-IDF to get the terms of
every URL weighted and then sort them. We take a fixed percentage of top
weighted terms as keywords to make further steps time & cost eective.
Now we have a set of documents in the corpus to be visualized by the web
graph. Every document is composed of a set of (Keyword, W eight) pairs. All
the distinct keywords of these documents form the total set of keywords. At this
point, it is very easy to define the documents as vectors of weighted keywords.
Hence, for the i th document, document vector di as < ti1 , ti2 , ti3 , ..., tin >
where tij is the weight j th keyword where 1 <= j <= n and n is the number
of total distinct keywords in the corpus.
We calculate the edges connecting the nodes for the initial graph based on the
content similarity of the nodes. If we consider two documents, the more keywords
they have common, the more they are connected. The similarity between them
is computed by cosine similarity1 . If the similarity score CSim(dy , dz ) is greater
than a threshold value then we create an edge between the documents dy and
dz . Finally, we have an undirected initial graph for next processing.
To optimize retrieval accuracy, we clearly need to model the user accurately and
personalize the information retrieval according to each individual user, which is
a dicult task. It is even harder for a user to precisely describe what the infor-
mation need is. Unfortunately, in the real world applications, users are usually
reluctant to spend time inputting additional personalized information which can
provide relevant examples for feedback.
Representation of user profiles has been classified into three categories in [5].
The simplest one to represent and maintain is keyword-based user profile where
the keywords are either automatically extracted from the web documents during
a users visit or through direct input by the user. Each keyword can represent
a topic of interest and weight can be associated with it to represent the degree
of interest. The other two categories are semantic network profile which deals to
address the polysemy problem and concept based user profile which consists of
concepts & reveals the relationship between the concepts.
In our system architecture, we adopt the keyword based user profiles for sim-
plicity to represent a user. So the user profile consists of a set of weighted key-
words, i.e., (Keyword, W eight) pairs. User profiles are maintained in two phases
by the Profiler component. In the creation phase, the user data are collected
directly from the user via a user interface. Here the user is allowed to answer
1
http://en.wikipedia.org/wiki/Cosine_similarity
User Interest Based Complex Web Information Visualization 433
some predefined questions for the creation of basic user profile. After collect-
ing the basic data from the user, the system will collect more user data from
the internet according to the users authorization. The maintaining or updating
phase occurs mostly whenever the user uses the system. Profiler updates the
preferences based on user interaction data passed from the Feedback Analyser.
3.3 Clustering
We want to cluster the total corpus into dierent clusters where the documents
will be clustered by biased similarity values based on the user profile. We modify
the document vectors computed in section 3.1 to represent both user interests
and main topics of the document. For this we need to replace the existing doc-
ument terms with the keywords of the user profile where appropriate.
We update the total number of keywords of the corpus by adding the user
profile keywords in the keyword set. If the number of new distinct keywords is m,
the total number of keywords becomes (n + m). Now, we can write the modified
document vector for di as: < Ti1 , Ti2 , ..., Tin , Ti(n+1) , Ti(n+2) , ..., Ti(n+m) >.
We calculate the Ti(n+j) ; where 1 < j <= m; in the following way. We mea-
sure the similarity value of cj , where cj is a keyword from user profile, with
each keyword of the di . We use one of the wordNet similarity measures for this
purpose. There are six similarity measures applied on wordNet, three measures
including Lin [8] are based on information content of the least common sub-
sumer(LCS). Lin yields slightly higher correlation with human judgements than
others, which actuates us to use it instead of others.
If cj is similar to document keyword kil ; where 1 < l <= n; meaning that
(SIM (kl , cj ) Til ) > , we consider that document di is containing keyword
cj instead of the original document keyword. So, we remove the kil by making
the weight Til as 0. We repeat it for every cj of user profile for each di . As a
particular cj can be similar with multiple keywords of a document by dierent
values, we take the maximum similarity value as the weight of that cj .
Hence, the new vector to represent a document, di , reflects the user interests
and the topics of the original content as well. We apply k-means algorithm for
clustering, where the number k and initial means are calculated from the cosine
similarity values of the document vectors. We select di as an initial mean if the
degree of di is greater than the sum of the average degree of the web graph and
a predefined threshold.
URL Keywords
1 - justice, evacuation, screencasts, wireless, concept
2 - centre, management, laboratory, college, group, resource
3 - directory, bing, yahoo, google, hospitality
... ... ...
6 - bachelor, flexibility, convenience, delivery
7 - postgraduate, information, city, night, suite
... ... ...
19 - command, wireless, sheet, layout
for the user to visit the web pages in the right pane. The user can- zoom in/out;
re-arrange the web graph by drag and drop; delete an inappropriate node or
cluster; and create new clusters.
As the user interacts with the visualization interface, it sends the browsing,
click and arrangement information generated from the actions accomplished by
the user to the feedback analyser component. Browsing feature is the primary
contributing feature in profile updating. When user browses a page, the keywords
with their weights along with the dwelling time are sent to the feedback analyser
module, which updates the weights of existing interests and adds the new ones.
While adding the new interests in the user profile, the interest and corresponding
weight-value to be added are shown to the user for any amendment.
4 Experiment
In this section, we provide an experimental example. It begins with showing a
simple web graph. For this example we analyse a university web site and set the
crawl style property as BFS whose termination is set to 1. We get 19 nodes while
crawling and name them according to their appearance number in the crawler.
We get the sets of keywords for representing the nodes after applying techniques
from 3.1, which are shown in Table 1. Figure 2a shows the original web graph.
We get two clusters and eight other nodes which is shown in Figure 2b after
applying content based clustering described in [4].
We also have two user profiles constructed earlier based on two dierent
users. We consider one user profile, U Pa , represents an international student
who looks for opportunity of postgraduate studies in computer science related
fields whether the other, U Pb , is of a local student focusing on undergraduate
studies in business, particularly interested in Hawthorn campus. The corre-
sponding user profiles are given in Table 2a as (Interest, W eight) pairs.
First, we take the user profile U Pa and generate the clusters based on the
methodology described in section 3.3. We get three clusters and eight unclustered
nodes. Second, applying the same process on the initial graph for the user profile
U Pb , we get four clusters and seven unclustered nodes as well. Table 2b shows
how the nodes are assigned in the clusters for all three cases. With no user
User Interest Based Complex Web Information Visualization 435
profile applied, the two clusters are calculated based on their content similarity.
Three and four clusters have been constructed for the user profiles U Pa & U Pb
and are shown in Figure 3a & 3b respectively. From these two visualization
it is noticed that clustering has been heavily influenced by the user profiles.
Rather displaying the same web graph of Figure 2b, our system shows the web
information to dierent users in dierent ways according to their interests.
References
1. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In:
WSDM, pp. 514 (2009)
2. Bennett, P.N., Radlinski, F., White, R.W., Yilmaz, E.: Inferring and using location
metadata to personalize web search. In: SIGIR, pp. 135144 (2011)
3. Card, S.: Information visualization. In: The Human-Computer Interaction Hand-
book: Fundamentals, Evolving Technologies, and Emerging Applications (2007)
4. Gao, J., Lai, W.: Visualizing blogsphere using content based clusters. In: Web
Intelligence and Intelligent Agent Technology, pp. 832835 (2008)
5. Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for person-
alized information access. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive
Web 2007. LNCS, vol. 4321, pp. 5489. Springer, Heidelberg (2007)
6. Huang, X., Eades, P., Lai, W.: A framework of filtering, clustering and dynamic
layout graphs for visualization. In: ACSC, pp. 8796 (2005)
7. Keim, D.A., Mansmann, F., Schneidewind, J., Ziegler, H.: Challenges in visual data
analysis. In: Information Visualization, pp. 916 (2006)
8. Lin, D.: An information-theoretic definition of similarity. In: ICML, pp. 296304
(1998)
9. Matthijs, N., Radlinski, F.: Personalizing web search using long term browsing
history. In: WSDM, pp. 2534 (2011)
10. Micarelli, A., Gasparetti, F., Sciarrone, F., Gauch, S.: Personalized search on the
world wide web. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web
2007. LNCS, vol. 4321, pp. 195230. Springer, Heidelberg (2007)
11. Qiu, F., Cho, J.: Automatic identification of user interest for personalized search.
In: WWW, pp. 727736 (2006)
12. Rattigan, M.J., Maier, M., Jensen, D.: Graph clustering with network structure
indices. In: ICML, pp. 783790 (2007)
13. Smyth, B., Balfe, E., Boydell, O., Bradley, K., Briggs, P., Coyle, M., Freyne, J.: A
live-user evaluation of collaborative web search. In: IJCAI, pp. 14191424 (2005)
14. Wang, X., Zhai, C.: Learn from web search logs to organize search results. In:
SIGIR, pp. 8794 (2007)
FIMO: A Novel WiFi Localization Method
1 Introduction
For a long time, people expect the electronic devices around them can have
perception towards the environment.In early 2012, IDC (Internet Data Center)
predicted that around 1.1 billion smart connected devices would be sold in 2012,
and this number would even be doubled in 20161. As one of the important ap-
plications of ubiquitous computing system, localization application has drawn
more and more attention. In general, these systems often need to provide infor-
mation or services according to users current location. For example, a customer
who is shopping in a new mall may need to know where his /her favorite shop
is and how to reach it. Advertisements can be recommended to users based on
their locations etc..
Nowadays, outdoor localization technologies, such as GPS, have become quite
popular in daily life. Recent years also witness the increasing demand for indoor
localization. Several indoor localization techniques have been developed in recent
years, including Bluetooth [3], RFID [8] and WiFi [1]. Bluetooth and WiFi based
approaches rely on Received Signal Strength (RSS) for localization, while RFID
based approaches determine the location of the moving object by reading the
Corresponding author.
1
See http://www.199it.com/archives/29224.html
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 437448, 2013.
c Springer-Verlag Berlin Heidelberg 2013
active RFID tag. WiFi localization has many advantages, such as ubiquitous
coverage, scalability, no additional hardware required, extended range, no line
of sight restrictions and free measurement etc.
We can see that WiFi is a good choice for localization in indoor space. How-
ever, the accuracy of basic approaches is low due to the instability of WiFi signal
strength. Some projects combined WiFi with other devices such as RFID [5] [13],
Bluetooth [2] [12] to improve the accuracy. Although the accuracy is improved,
additional devices are required, and will cause extra burden to the setup process.
Besides, very few approaches have taken advantage of continuous monitoring of
users locations and considered the features of WiFi signal strength itself.
In this paper, we propose a novel approach to estimate a moving objects
location, by taking into consideration the instability of signal strength and the
movement of objects. This method is based on the Fingerprint approach in
estimating the location of a moving object.
The remainder of this paper is organized as follows. In Section 2, we survey
related work in localization technologies. In Section 3, we discuss our research
methodology. In Section 4, we present and analyze our approach. Section 5 re-
ports a comprehensive experimental study. Finally, we present our conclusions
and provide some future research directions in Section 6.
2 Related Work
The existing work on indoor localization can be divided into two main groups:
signal propagation approaches and fingerprinting approaches.
3 Research Methodology
In this section, we describe the experimental testbed and data collection.
Fig. 1. Map of the floor where the experiments were conducted. The stars show the
locations of the 5 access points. The black dots denote locations were empirical signal
information was collected.
deployed at dierent locations, denoted as AP1 , AP2 , AP3 , AP4 and AP5 . Each
AP is a Totolink N300R Router. In our tests, a mobile host is a laptop computer
equipped with Intel WiFi Link 5100 AGN running Windows 7. The APs provide
overlapping coverage over the entire floor together.
4 Our Algorithm
In this section, we describe our algorithm in detail. Ideally, we expect the signal
strength of an AP remains almost stable with time going on. Figure 2 shows the
signal strength levels at dierent distances. We can find that almost all of the
signal strength values received 1m away from the AP are the same, while the
stand deviation of the signal strength values received 10m away from the AP is
around 3%. When the distance reaches 35m, the stand deviation can be around
6%. From this, we conclude that the signal strength keeps fluctuating all the
time and the ones with higher value is more stable. We also test the stability
of signal strengths received by a moving object, as shown in Figure 2. In this
figure, a user walks from AP5 to AP1 slowly, we can observe that the signal
strengths would have greater fluctuation when the user is moving. How to use
these unstable signal strengths to localization is challenging to us.
110 100
100
80
signal strength(%)
signal strength(%)
90
80 60
70 40
60
20
50
40 0
1 11 21 31 41 51 61 71 2 22 42 62 82 102 122 142 162 182
time(s) time(s)
Fig. 2. Signal Level at dierent distances Fig. 3. The signal strength values re-
ceived the user along work
Figure 2 and 3 show that the signal strength would reflect distance more
accurately if the fluctuation of the signal strength could be smoothed. Besides,
in Figure 2, its easy to see that stronger signal strength are more stable. And
the degree of fluctuation of the signal strength changes with time. So we can be
convinced that the signal strength with dierent values and the signal strength
received at dierent time should have dierent credibility.
Due to these particular features, we propose an approach, called FIMO (FInd
Me Out), for indoor localization. FIMO continuously records signal strength
values from each AP at the users location at fixed intervals when the user
is walking around. When the user submits a location request, FIMO will pro-
cess the data and then determine the location of the user using Algorithm 1.
FIMO backtracks the signal strengths received by the user which are recorded
in the database and put them into ssArray (Line 2). It smoothes the signal
strength values in ssArray chronologically and puts the Smoothed Values into
smoothArray (Line 3), as discussed in Section 4.1. After calculating the weight of
442 Y. Zhou et al.
every signal strength value using the Original Value (The signal strength values
that havent been smoothed) (Line 5), as discussed in Section 4.2, FIMO returns
the location to the user by matching the reference points (Line 7-14, Section 4.4)
in the possible range (Line 6, Section 4.3).
1. Detecting the outliers: The system scans the signal strength values ac-
cording to the time sequence. Assume that represents the threshold of
normal value, and the system acquires four sequential values, a, b, c and d.
If (b a > and b c > and c d ) or (a b > and c b >
4
See http://en.wikipedia.org/wiki/Moving_average
FIMO: A Novel WiFi Localization Method 443
100
80
signal strength(%)
60
40
20
0
2 22 42 62 82 102 122 142 162 182
time(s)
Fig. 5. The signal strength values received by the user along work after smoothing
Figure 5 shows the signal strengths after smoothing. Compared with Figure
3 which is before smoothing, it seems that the quality of the data has improved
obviously after smoothing. We call the values after smoothing the Smoothed
Value and the values having not been smoothed the Original Value.
Therefore, even at the same location, dierent signal strength has dierent cred-
ibility, and we should give dierent weights to dierent signal strength when
matching the signal strength in the database. Considering the two factors af-
fecting the credibility of signal strength mentioned above, we propose a method
to calculate the weights. The rule is to give the signal strength with the value
which is higher and more stable higher weight. We can use the Original Value
(oi ) of APi s signal strength to represent the credibility of APi s signal strength.
Then it comes to the question of how to measure the stability of signal
strength. It is generally known that the sample standard deviation shows how
much variation from the average (or expected value). A low sample standard
deviation indicates that the data points tend to be very close to the mean, vice
verse. However, the average value is not the expected value because the signal
strength varies with distance, so we cannot simply use the sample standard de-
viation here. Although the trend of the signal strength should be described as
curve function, we just focus on the stability of signal strength in a short time
(such short time is deemed as a time window), and then it can be seen as linear.
We regard the time period containing four continuous signal strength values as
a time window in our experiments, and the length of the time window can be
changed. If the length is too long, the items could not be fit with the linear
function. And if it is too short, the linear function fitting out could not rea-
sonably reflect the trend of the changes of signal strength. The method of least
squares5 is a standard approach to the approximate solution of overdetermined
systems. Least Squares means that the overall solution minimizes the sum of the
squares of the errors made in the results of every single equation. So we use Least
Squares to find out the linear function which could describe the distribution of
signal strength in a short time. Then we can get the expected signal strength
value (ei,j ) received from APi at time tj with the linear function.
Using the expected signal strength values, the sample standard deviation of
several signal strength values (i ) in the same period can be calculated with
Equation (1) below.
@
A W
A 1 "
i = B (oi,j ei,j )2 (1)
W 1 j=1
where W indicates the length of the time window, and oi,j is the Original Value
of APi at time tj .
To sum up, the weight of every signal strength value of APi would be wi
coming from Equation (2).
oi 1
wi = (2)
i maxk (ok /k )
5
See http://en.wikipedia.org/wiki/Least_squares
FIMO: A Novel WiFi Localization Method 445
9 10
FIMO 14 FIMO FIMO
8 FIMO w/o candidates Fingerprint (NN) Fingerprint (KNN)
7 12 8
Erorr distance(m)
Erorr distance(m)
Erorr distance(m)
6 10
6
5 8
4
6 4
3
4
2 2
1 2
0 0 0
1 2 3 4 5 6 7 8 9101112131415161718192021 1 2 3 4 5 6 7 8 9101112131415161718192021 1 2 3 4 5 6 7 8 9101112131415161718192021
ID of query point ID of query point ID of query point
8
FIMO 16 FIMO 18 FIMO
7 FIMO w/o candidates Fingerprint (NN) 16 Fingerprint (KNN)
14
6
Erorr distance(m)
Erorr distance(m)
Erorr distance(m)
12 14
5 12
10
4 10
8
8
3 6 6
2 4 4
1 2 2
0 0 0
1 2 3 4 5 6 7 8 9101112131415161718192021 1 2 3 4 5 6 7 8 9101112131415161718192021 1 2 3 4 5 6 7 8 9101112131415161718192021
ID of query point ID of query point ID of query point
5 Experiments
FIMO vs. FIMO w/o FIMO vs. Fingerprint FIMO vs. Fingerprint
Candidates (NN) (kNN)
1 3.79 78.71 43.69
2 0.64 136.69 86.33
References
1. Bahl, P., Padmanabhan, V.N.: Radar: An in-building rf-based user location and
tracking system. In: Proc. of INFOCOM, pp. 775784 (2000)
2. Baniukevic, A., Sabonis, D., Jensen, C.S., Lu, H.: Improving wi-fi based indoor po-
sitioning using bluetooth add-ons. In: Proc. of MDM, pp. 246255. IEEE Computer
Society, Washington, DC (2011)
448 Y. Zhou et al.
3. Bargh, M.S., de Groote, R.: Indoor localization based on response rate of bluetooth
inquiries. In: Proc. of MELT, pp. 4954. ACM, New York (2008)
4. Chen, Q., Lee, D.-L., Lee, W.-C.: Rule-based wifi localization methods. In: Proc.
of EUC, pp. 252258. IEEE Computer Society, Washington, DC (2008)
5. Chen, Y.-C., Chiang, J.-R., Chu, H.-H., Huang, P., Tsui, A.W.: Sensor-assisted
wi-fi indoor location system for adapting to environmental dynamics. In: Proc. of
MSWiM, pp. 118125. ACM, New York (2005)
6. Hern andez, N., Herranz, F., Ocana, M., Bergasa, L.M., Alonso, J.M., Magdalena,
L.: Wifi localization system based on fuzzy logic to deal with signal variations. In:
Proc. of ETFA, pp. 317322. IEEE Press, Piscataway (2009)
7. Ho, W., Smailagic, A., Siewiorek, D.P., Faloutsos, C.: An adaptive two-phase ap-
proach to wifi location sensing. In: Proc. of PERCOMW, pp. 452456. IEEE Com-
puter Society, Washington, DC (2006)
8. Jin, G.-Y., Lu, X.-Y., Park, M.-S.: An indoor localization mechanism using active
rfid tag. In: Proc. of SUTC, pp. 4043. IEEE Computer Society, Washington, DC
(2006)
9. Lee, D.L., Chen, Q.: A model-based wifi localization method. In: Proc. of InfoScale,
pp. 40:140:7. ICST (2007)
10. Letchner, J., Fox, D., Lamarca, A.: Large-scale localization from wireless signal
strength. In: Proc. of the National Conference on Artificial Intelligence, pp. 1520.
The MIT Press (2005)
11. Seidel, S.Y., Rapport, T.S.: 914 MHz path loss prediction model for indoor wireless
communications in multi-floored buildings. In: Proc. of Antennas Propagation.
IEEE Trars (1992)
12. Wang, R., Zhao, F., Luo, H., Lu, B., Lu, T.: Fusion of wi-fi and bluetooth for
indoor localization. In: Proc. of MLBS, pp. 6366. ACM, New York (2011)
13. Yeh, L.-W., Hsu, M.-S., Lee, Y.-F., Tseng, Y.-C.: Indoor localization: Automati-
cally constructing todays radio map by irobot and rfids. In: Proc. of Sensors, pp.
14631466 (2009)
An Algorithm for Outlier Detection
on Uncertain Data Stream
Keyan Cao, Donghong Han, Guoren Wang, Yachao Hu, and Ye Yuan
1 Introduction
Uncertainty is inherent in data collected in various applications, such as sensor
networks, marketing research, and social science [7]. Since the intrinsic dierences
between uncertain and deterministic data, existing data mining algorithms on
deterministic data are not suitable for uncertain data. Such data present in-
teresting research challenges for a variety of data set and data mining applica-
tions [1] [4] [6] [7] [9] [10] [11]. Recent years, more and more attention is paid to
uncertain data mining.
Outlier detection is considered as an important data mining task, aiming at
discovery of the elements that are significantly dierent from other elements,
This research are supported by the NSFC (Grant No. 61173029, 61025007, 60933001,
75105487 and 61100024), National Basic Research Program of China (973, Grant
No. 2011CB302200-G), National High Technology Research and Development 863
Program of China (Grant No. 2012AA011004) and the Fundamental Research Funds
for the Central Universities (Grant No. N110404011).
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 449460, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
In this section, we illustrate related work in two fields: (1) outlier detection on
uncertain data; (2) outlier detection on uncertain data stream.
with similar properties tended to have similar instances, the normal instances of
each uncertain object using the instances of objects with similar properties.
The algorithms mentioned above work are all in static fashion. This means
that the algorithm must be executed from scratch if there are changes in the
underlying data objects, leading to performance degradation when operate on
uncertain data stream.
symbol Interpretation
K the number of neighbors parameter
R the distance parameter for the outlier detection
W the windows size
S the set of uncertain tuples in the current window
slide the window slide length
xji the j-th instance of uncertain tuple xi
p xj the existing probability of instance xji
i
ppxj the probability that instance xji is an outlier
i
sum(xi) the probability summation of possible worlds that xi is an outlier in
n(xji , R) the number of neighbors that instance xji has
Sx j the succeeding neighbors set of instance xji has
i
Pxj the preceding neighbors set of instance xji has
i
Nxi the set of (R,K)-outlier possible worlds that xi has
x*i the store structure of xi after updated
xarr . the newly arrived tuple
xexp . the latest expired tuple
An Algorithm for Outlier Detection on Uncertain Data Stream 453
4.1 Framework
At first, we introduce the storage structure, in order to meet the demand for
increment processing. We define the storage structure, which is designed to cap-
ture key information of outlier during the course of uncertain data stream. For
example, we list partial uncertain tuple in uncertain data stream:
x1 {[(1, 2), (0.7)], [(1, 1), (0.3)]}, x2 {[(2, 1), (0.4)], [(3, 1), (0.6)]},
x3 {[(2, 2), (0.5)], [(2, 3), (0.5)]}, x4 {[(4, 4), (0.8)], [(3, 3), (0.2)]}
There are the value of the four tuples from x1 to x4 . x1i and x2i are two instances
of tuple xi , px1i and px2i are exist probability of x1i and x2i respectively. let R = 2,
K = 2, range query results for every instance in the current window is showed
in Figure 1(range query is find neighbors for instance). We neaten the range
query result on the two principles: (1) The instance of one tuple can not appear
in the range query result of the same tuples instance, i.e., in the range query
result of x11 , x21 can not exist, and delete x21 from the range query results of x11 .
(2) If there are at least two instances of the same uncertain tuple in the range
queries result of a instance, then merge these instances existing probability of
the same uncertain tuple, i.e., x13 and x23 are in the range queries result of x11 , the
probability of x13 and x23 are added, px13 + px23 = px3 . The result after neatening
is more conducive to analyze in figure 2.
Since the tuple is uncertain data, the probability of an uncertain tuple being
an outlier is probability summation of its every instance being an outlier as
Definition 5.
Definition 5. (Uncertain Tuple Outlier Probability) We assume that the
uncertain tuple xi consist of m instances ((x1i , px1i ), (x2i , px2i ), , (xji , pxj ),
i
j
, (xm
i , pxi )), 1 j m, Pxji is the probability of instance xi . The probability
m
probability that the number of neighbor instance xji has is less k, 0 pn(xj ,R)<k 1.
i
In figure 2, the range query result of x11 contain x12 and x3 , the probability of
x11 being an outlier ppx11 is 0.7 (1 0.4) = 0.42. The probability of x21 being
an outlier ppx21 is 0.3 (1 0.5) = 0.15. Then we can get the probability of x1
being an outlier is equal to px1 = ppx11 + ppx21 =0.42+0.15=0.57.
Definition 7. (Non-outlier Instance) Let sxi denote the set of instances that
tuple xi has. We determine the instance is a non-outlier instance, if the instance
is not an outlier in any possible world. syxi is a set of outlier instances that tuple
454 K. Cao et al.
Fig. 1. Result of range query without neaten Fig. 2. Result of range query after
neaten
xi has, and snxi is a set of non-outlier instance that tuple xi has.(Since these two
sets do not overlap and cover the complete objects set, i.e., syxi
) snxi = sxi , syxi
n
sxi = .#If the probability summation of non-outlier instance is greater than
1 , i.e. xj sxn pxj > 1 , then the tuple is non-outlier.
i i i
Through analysis of the results, we can get the outlier in the current window,
and then we will design a storage structure. Since we are dealing with uncer-
tain data in stream environment, the arrival time of uncertain data will aect
the result. Firstly, we divide the range query results of every instance into two
sets, preceding neighbors and succeeding neighbors set, denoted by Pxj and Sxj
i i
respectively.
Definition 8. (Non-outlier Instance Probability) The non-outlier instance
probability increase, along with the neighbor number n(xji , R) of an instance
xji increase. If the exist of a part neighbors, non-outlier instance probability is
greater than 1 , then xji is non-outlier instance.
According to the above definition, we only need to store the neighbors that
make the tuple become non-outlier(according to the time order), instead of stor-
ing all neighbors of each instance. The tuple in Pxj is eliminated before the
i
tuple xi . It may make tuple xi become outlier from candidate set, therefore,
we must consider the expires time of tuple in candidate set. The tuples in Pxi
is stored with time information and probability. The tuples in the Sxi can not
be expired before the tuple xi was deleted, so we only store the probability
information. For example, in figure 3, we depict an example in two dimen-
sional, for K = 3, = 0.7. Let the subscripts denote the order of arrival of
these tuples. We focus on tuple x18 , the range query result of instance x118 is
{x16 (0.4), x210 (0.6), x115 (0.5), x216 (0.6), x117 (0.7), x219 (0.8)}, the range query result of
instance x218 is {x23 (0.6), x116 (0.4), x120 (0.7), x222 (0.8)}. Range query result based
on time, be divided into two sets to store, i.e., Sxj and Pxj . In order to reduce
18 18
the storage space to meet the demand of uncertain data stream, we only store
the necessary information. According arrival time of each neighbor, to find the
existence of the neighbors make tuple x18 is non-outlier, i.e., pxi (no) > 1 .
An Algorithm for Outlier Detection on Uncertain Data Stream 455
When x22 , x20 , x19 , x17 and x16 exist, px18 (no) = px118 (no) + px218 (no) = 0.3136 >
1 , then x18 is non-outlier. We only store structure of x18 as follows:
In this paper, we use the count-based window which always maintains the W
most recent uncertain tuples. The uncertain tuples maintained by the sliding
window are termed active objects. When the tuple leaves the window we think
that the tuple expires. Along with the arrival of the new uncertain tuples, the
old tuples disappear from the window.
In the section, we introduce the incremental processing for outlier detection
on uncertain data stream. Whenever the window slid, the new uncertain tuple
arrives, and the old uncertain tuple is moved out from the window. In order to
reduce calculation, we first delete the expired tuple, and then receive a new un-
certain tuple. When the old tuple is deleted, it may make the tuple in candidate
set become outlier. The impact that a new tuple arrives includes two classifica-
tions: (1) The impact to a new uncertain tuple from existing uncertain tuples.
(2) The impact to exist uncertain tuples from a new uncertain tuple. It might
make the outlier become candidate tuple, or make the candidate tuples become
a safe inlier.
When a new tuple arrives, the range query is done according the probability
descending of each instance. According to Definition 6, the important tuples are
found and stored in the data structure. Range query could be done according the
time of tuple. For instance, when tuple xi arrivals, we first determine whether
the instance of xi1 is neighbor of instance of xi , and then determine xi2 , in
turn forward according to the time of arrival, until finding the tuples that make
xi is non-outlier. These instances could be store in store structure of xi . The
pseudo code of these operations is given in Algorithm 1.
When the tuple is expired, it is removed from the window, if it is in O(R, K, )
or C(R, K, ). Tuple storage structure is updated, that the store structure con-
tain expired tuple instance. If the tuple is in O(R, K, ), we only update the
storage structure. If the tuple is in C(R, K, ), update the storage structure,
and then determine whether the tuple become an outlier. The pseudo code of
these operations is given in Algorithm 2.
In figure 3, we assume that a new tuple x28 arrives, instance x128 is neighbor
of x118 and its existing probability is 0.9, then according Definition 6, when x28 ,
x22 , x20 , x19 and x17 exist, x18 is non-outlier but not a safe one. Then we update
the store structure of x18 as follows:
Some time later, when tuple x35 arrivals, instance x235 is the neighbor of x118 ,
and its probability is 0.6, we can see, the existence of x35 , x28 , x22 , x20 and
x19 make x18 become non-outlier, and they are all in set Sx18 , so x18 is a safe
inlier. When x18 become a safe inlier along with window slid, we deleted store
structure of x18 .
instances that make the new tuple become non-outlier in the range of M1 , it
isnt necessary to consider the leaf node of M2 . It could reduce the unnecessary
calculations with SM-tree to improve eciency of range queries.
m1 m2 m3
5 Performance Evaluation
We conducted a series of experiments on synthetic data set and real data sets
to evaluate the performance of the algorithms proposed in this paper, namely
Uncertain Outlier Detection on Uncertain Data Stream (denoted by UOD), and
Uncertain Outlier Detection based on M-tree(denoted by UOD-M). For com-
parison, we implemented Naive Outlier Detection based on Examining Possible
Worlds method named (denoted by NUOD).
All the algorithms are implemented in visual C 2010, and experiments are
carried out on a PC with a Core i3-3.3 GHz CPU and 4G of main memory.
40 40 40
30 30 30
Running Time(s)
Running Time(s)
Running Time(s)
20 20 20
10 10 10
NUOD NUOD NUOD
UOD UOD UOD
UOD-M UOD-M UOD-M
0 0 0
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
Window Size(K) Window Size(K) Window Size(K)
30 30 30
NUOD NUOD NUOD
UOD UOD UOD
UOD-M UOD-M UOD-M
25 25 25
Running Time(s)
Running Time(s)
Running Time(s)
20 20 20
15 15 15
10 10 10
0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3
Outlier(% W) Outlier(% W) Outlier(% W)
60 60 60
Running Time(s)
Running Time(s)
Running Time(s)
40 40 40
20 20 20
60 60 60
50 50 50
Running Time(s)
Running Time(s)
Running Time(s)
40 40 40
30 30 30
20 20 20
10 10 10
UOD UOD UOD
UOD-M UOD-M UOD-M
0 0 0
10 15 20 25 30 10 15 20 25 30 10 15 20 25 30
R value R value R value
6 6 6
Running Time(s)
Running Time(s)
Running Time(s)
4 4 4
2 2 2
First, we test the performance of the algorithm for varying values of W from
104 to 105 . Figure 5 certify the running time of all algorithms increased when
increasing the length of the sliding window. The performance of UOD-M and
UOD algorithm performs better than NUOD.
Figure 6 depicts the results that number of outliers varying in range [0.1%, 3%].
As expected, UOD-M algorithm performs better than UOD algorithm.
In figure 7, we show that the performance of NUOD, UOD-SM and UOD
algorithm. We vary K from 1 to 100. Figure 7 show that the running time of
UOD-M and UOD decreased when increasing the value of K.
Figure 8 show that running time of UOD is much higher than UOD-M for
varying R from [10, 50]. Figure 9 demonstrate that the running time when in-
creasing the probability threshold . We vary value from 0.1 to 0.9. Figure 9
show that of both algorithms decreasing when increasing value.
10 10 10
Memory Requirements(MBytees)
Memory Requirements(MBytees)
Memory Requirements(MBytees)
8 8 8
6 6 6
4 4 4
2 2 2
UOD UOD UOD
UOD-M UOD-M UOD-M
0 0 0
10 20 30 40 50 10 20 30 40 50 10 20 30 40 50
Window Size(K) Window Size(K) Window Size(K)
6 Conclusion
References
1. Aggarwal, C.C.: On density based transforms for uncertain data mining. In: ICDE,
pp. 866875 (2007)
2. Aggarwal, C.C., Yu, P.S.: Outlier detection with uncertain data. In: SDM, pp.
483493 (2008)
3. Assent, I., Kranen, P., Baldauf, C., Seidl, T.: Anyout: Anytime outlier detection
on streaming data. In: VLDB, pp. 228242 (2012)
4. Burdick, D., Deshpande, P.M., Jayram, T.S., Ramakrishnan, R., Vaithyanathan,
S.: Olap over uncertain and imprecise data. In: VLDB, pp. 970981 (2005)
5. Chandola, V., Banerjee, A., Kumar, V.: Outlier detection: A survey. ACM Com-
puting Surveys (2007) (to appear)
6. Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over
imprecise data. In: SIGMOD, pp. 551562 (2003)
7. Jiang, B., Pei, J.: Outlier detection on uncertain data: Objects, instances, and
inferences. In: ICDE, pp. 422433 (2011)
8. Kontaki, M., Gounaris, A., Papadopoulos, A., Tsichlas, K., Manolopoulos, Y.:
Continuous monitoring of distance-based outliers over data streams. In: ICDE,
pp. 135146 (2011)
9. Sarma, A.D., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain
data. In: ICDE, p. 7 (2006)
10. Singh, S., Mayfield, C., Prabhakar, S., Shah, R., Hambrusch, S.: Indexing uncertain
categorical data. In: ICDE, pp. 616625 (2007)
11. Tao, Y., Cheng, R., Xiao, X., Ngai, W.K., Kao, B., Prabhakar, S.: Indexing multi-
dimensional uncertain data with arbitrary probability density functions. In: ICDE,
pp. 922933 (2005)
12. Wang, B., Xiao, G., Yu, H., Yang, X.: Distance-based outlier detection on uncertain
data. CIT 1, 293298 (2009)
13. Wang, B., Yang, X., Wang, G., Yu, G.: Outlier detection over sliding windows for
probabilistic data streams. Journal of Computer Science and Technology 25(3),
389400 (2010)
Improved Spatial Keyword Search
Based on IDF Approximation
Xiaoling Zhou, Yifei Lu, Yifang Sun, and Muhammad Aamir Cheema
1 Introduction
There has been a strong trend of incorporating spatial information into keyword-
based search systems in recent years for the support of geographic queries. It
is driven mainly by two reasons: highly user demands and rapid advances in
Geo-positioning and Geo-coding technologies. User interests include finding the
nearby restaurants, tourist attractions, entertainment services, public transport
and so on, which renders the spatial keyword queries important. Studies [13] have
shown that about 20% of web search queries are geographical and exhibit local
intent. In addition, both the various advanced Geo-positioning technologies such
as GPS, Wi-Fi and cellular geo-location services (oered by Google, Skyhook,
and Spotigo), and dierent geo-coding technologies which enable annotating of
web content with positions accelerate the evolution of spatial keyword search.
Spatial keyword queries take a set of keywords and a location as parameters
and return the ranked documents that are most textually and spatially relevant
to the query keywords and location. Usually, distance is used to measure spatial
relevance and document similarity functions such as TF-IDF weighting [7] are
used to measure textual relevance. A fair amount of investigations have been
conducted aiming at this issue, such as the algorithm of solving kNNs (k-nearest
neighbor queries) using R-tree [12], loose combinations of inverted file and R*-
tree [2] [6] [15], and signature augmented IR -tree [4]. Whereas limitations exist
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 461472, 2013.
Springer-Verlag Berlin Heidelberg 2013
462 X. Zhou et al.
2 Preliminaries
2.1 Problem Definition
We consider a database consists of N documents, each associated with a spatial
location. A query Q also consists of a set of keywords (denoted as WQ ), and a
location (denoted as Q.loc).
Improved Spatial Keyword Search Based on IDF Approximation 463
Given a query range SQ and a scoring function score(Q, d), a top-k spatial
keyword search returns top k scoring documents from the database.
The scoring function should take into consideration both the textual rele-
vance of the document to the query as well as the spatial proximity between
the locations of the query and the document. Hence, the final function can be
represented as score(Q, d) = scoretext (Q, d)+ (1 )scorespatial (Q, d), where
[0, 1] is a tuning parameter.
Spatial Relevance. We only need to consider documents within the search range
as specified by SQ (these doucments are also known as spatially qualified ). For
spatially qualified documents, we give its proximity score based on the euclidean
distance between its location and the query location, i.e., scorespatial (Q, d) =
1 dist(d.loc,Q.loc)
distmax , where distmax is the maximum possible distance between the
query and any spatially qualified document.
Textual Relevance. We adopt the popular Vector Space Model to measure the
textual relevance between the document and the query. In this paper, we con-
sider approaches using the query-specific IDF, which is defined as idfw,D,SQ =
Ds
log dfw,D . Note that Ds is number of spatially qualified documents rather than
s
that of the entire collection of documents, and dfw,Ds refers to number of spa-
tially qualified documents that contain w, therefore, idfw,D,SQ reveals query
words distributions among spatially qualified documents rather than in the
entire collection. Finally, the textual score is calculated as: scoretext (Q, d) =
#
wWQ (tfw,d idfw,D,SQ ). We say a document d is textually qualified to a given
query Q i d contains at least one query keyword of Q.
Notations. We use the following notations in the rest of the paper. The root of an
IR-tree is r. The number of documents within SQ as Ds . Number of documents
within SQ that contain w as dfw,Ds and query specific idf value of w as idfw,Ds .
Similarly, for a tree node e, dfw,De refers to the number of spatially qualified doc-
uments stored underneath e that contain w. In later sections, unless explictly
stated, when saying dfw or idfw , we refer to dfw,Ds and idfw,Ds respectively.
Finally, the candidate set produced from IDF calculation step as B.
each leaf node is linked to an inverted file indexing all the documents belonging
to the nodes.
An example IR-tree is shown in Figure 1.
22 forall d S2 do
23 B.add(d) ; /* Also keep tfw,d information about d */
Ds
24 return {log dfw
}, w WQ
Top-k Document Retrieval. This step performs best-first search on the IR-tree,
and it terminates as soon as top-k objects have been determined. Initially, all
the entries in B and their estimated maxscores are pushed into max-heap H.
Improved Spatial Keyword Search Based on IDF Approximation 465
N3 maxTF:0, DF:0 d5 0 0 0
N1
d6 0.2 0 2
Inverted File Inverted File Inverted File
7 8 ... ... ... d7 0 0 0
N1 N2 N3
d8 0 0 0
Example Data location map IR-tree(1):Tree Part IR-tree(2):Node Summary Info Example Data statistics
than spatial information and other index overhead associated with each docu-
ment. In addition, due to the sheer size of the inverted files, they are typically
stored on the disk, and many random I/Os are needed to retrieve them.
Therefore, our goal is not to access all of the inverted files while still retrieving
the top-k results. We instead approximate the IDF values by a range. We can
find conditions that guarantee the correctness of top-k answers even with such
approximate IDF values.
3.1 Framework
Our idea is to leave out Lines 17-23 of Algorithm 1, and compute a DF range
[DFwmin , DFwmax ] for each query word w without traversing inverted lists.
Consider the previous example. When node N1 is traversed, we find it is an
intersecting leaf node, instead of scanning its inverted files, we know that under
N1, there are dfw,DN 1 docs contain w, but not sure if all of them are spatially
qualified. Thus, we can bound DFw as [DFw , DFw + dfw,DN 1 ]. Subsequently, we
Ds Ds
can bound the IDF values for each query keyword as [log DFw +df w,DN 1
, log DF w
].
Once we have the IDF range for every query keyword, the min and max scores
of an IR-tree node or a document can be computed using the IDF ranges and
the scoring function presented in section 2. Note that spatial score of an IR-tree
node can also be bounded by computing the min and max distance between the
MBR of the node and the query location.
Denote minScore of d as (d) and maxScore of d as (d). It is obvious that
if (d1 ) > (d2 ), then we know d1 s score is better than d2 and we can safely
rank d1 before d2 .
23 return R
Algorithm 3 outlines top-k document retrieval process. Firstly, for each entry
e in B, we estimate its (e) and (e) , and put entry (e, (e) , (e)) into a
max-heap H which sorts its entries in descending order of their maxScores.
Secondly, we iteratively check the head entry (e, (e), (e)) of H to identify
top-k documents until H is empty. If e is a node, expand this node by adding its
children and their scores into H. If it is a document and H is not empty, the next
top entry(n) in H needs to be examined and compared with e in order to deter-
mine whether e can be added into R.This leads to two cases: 1) Scores of e and
n are comparable, i.e. (e) is greater than or equal to (n), we can simply add
e into R; 2) Their scores are not comparable, i.e. (e) is less than (n), we can-
not determine directly whether e is the most relevant document in H, thus extra
process is required. If n is a node, push e back to H and expand node n, whereas
if n is also a document, which means the current DFs ranges are too rough to de-
termine exact top-k documents, more intersecting leaf nodes need to be expaned
to update min and max DFs of query keywords, IDF values are recomputed and
entry scores in H are also updated to a tighter bound for exact ranking.
ExpandTreeNode(n, H) process examines all the child nodes (cn) of the node
to be expaned (n) and add (cn, (cn), (cn))s into H for later processing if n
468 X. Zhou et al.
is an internal tree node. If n is a leaf node which is within query range, tex-
tually qualified documents (d) which is implied by that they exist in at least
one of the inverted lists of query keywords associated with n are examined
and (d, (d), (d))s are added into H. However, if n is an intersecting leaf
node(n.MBR is interect with SQ ), not only documents stored underneath n
have to be checked whether they are textually qualified, they also have to be
checked if spatially qualified so that they can be added into H and minDFs and
maxDFs are also updated accordingly.
Lemma 1 (Correctness). Algorithm 2 correctly retrieves the top-k result.
Example 2. Using the same example in Figure 1, the proposed algorithm works
as below:
IDFRangeComputation procedure traverses IR-tree to find contained node
N2, and intersecting leaf nodes N1 and N3, thus updates Ds = 4, ILFs=(N1,
N3), B =(N1, N2, N3), and dfw1,Ds = [1, 3], dfw2,Ds = [2, 3], and computes
idfw1,Ds = [0.125, 0.602], idfw2,Ds = [0.125, 0.301].
TopKRetrieval first constructs initial H = {(N2, 1.403, 0.75)(N1, 0.501, 0.325)
(N3, 0.476, 0.238)}. Then node N2 is expanded to get newH={(d1 , 1.103,
0.538)(d2 , 0.601, 0.425)(N1, 0.501, 0.325)(N3, 0.476, 0.238)}
Two documents are found at the top of heap, but with incomparable scores.
Therefore, more intersecting leaf nodes have to be expanded. Assume 50% of
|ILFs|is expanded at one time, then in this case, node N1 is expanded. dfw1,Ds
is not changed, newdfw2,Ds = [3, 3], so newidfw2,Ds = [0.125, 0.125]. The new
updated H = {(d1 , 1.015, 0.538)(N3, 0.476, 0.238)(d2 , 0.425, 0.425)(N1, 0.325,
0.325)}
d1 is returned as its (d1 ) > (N 3). Search stops.
4 Experiments
4.1 Experiments Setup
We select two publicly available datasets.
Trec is from the TREC-9 Filtering Track1 . We use TREC as text dataset.
POI-Italy contains gas station and car parking locations of Italy2 . for spatial
data.
There are totally 239580 documents in Trec dataset. We randomly combine two
documents to form a longer document, and 100000 of these combined documents
are selected as test data set. We assign each combined document a unique spatial
location selected randomly from POI-Italy dataset. All the x and y coordinates
data in POI-Italy dataset are normalised into range [0, 40000]. Correspondingly,
we randomly select 1 to 8 words from the vocabulary of the documents to form
textual part of queries and assign random rectangle as the spatial part. We show
some statistics in Table 1.
1
http://trec.nist.gov/data/t9_filtering.html
2
http://poi-osm.tucristal.es/
Improved Spatial Keyword Search Based on IDF Approximation 469
Test Size Total Distinct Words Total Words Average Document Size
100000 241297 35883968 358
Parameters Values
Query range 10, 100, 1000, 10000
Score ratio 0, 0.25, 0.5, 0.75, 1
Top k 1, 10, 20, 50, 100
Query word number 1, 2, 4, 8
Test size 10000, 20000, 40000, 60000, 80000, 100000
470 X. Zhou et al.
Vary Score Ratio : Figure 2(c) presents search perfomance of varying weight
ratio between textual and spatial relevance. It can be seen that search time stays
steadily for the two algorithms when varies from 0 to 1 except that when is
set to 0(only consider spatial score), approIDF performs particularly good than
any other cases. The reason is that when is 0, the ordering of documents only
depends on spatial score, which renders IDF values useless no matter they are
exact or approximate. Therefore, top k documents can be selected directly and
quickly using accurate spatial score. In addition, the overall trend is the same
as above, ApproIDF performs much better than original search algorithm in all
ratio settings. Here,pre-checking of can be added before search process starts
as an optimization to deal with special cases such as is 0 or 1.
Vary Top K k: Figure 2(d) demonstrates the impact of number of requested
documents(k). It is shown that search time of the two algorithms increase slighly
as k increases. As both of the search processes are incremental, which means they
stop as soon as top-k docs have been retrieved without having to go through all
candidate set, when k becomes larger, more scores of documents need to be com-
puted and ranked. Besides, our proposed algorithm is still more ecient than
IR-tree search algorithm.
Vary Query Range SQ : Next, eects of query spatial range on search perfor-
mance are evaluated. The log scale search time is summarized in Figure 2(b).
When query range is small(10-1000), total search time for both algorithm is
less than 0.04 millisecond, whereas when search range increases to 10000, to-
tal search time grows drastically to around 0.63ms for IR-tree and 0.33ms for
ApproIDF. The is mainly due to increase of candidate size when query range
grows. There are nearly 10% (9231) of documents are within query range when
SQ is set to 10000. Overall, ApproIDF outperforms IR-tree search algorithm,
especially obvious when query range is large.
Vary Test Size |D|: As shown in Figure 2(e), search time grows up linearly for
both algorithms along the increasing of test size (number of documents indexed).
It is mainly due to the increase of candidate size when test size becomes larger
as shown in Figure 2(f).
5 Related Works
Numerous works have been done in this field. Some works use inverted index for
textual retrieval and separate index such as Quad-tree or Grid index for spatial
filtering [2] [10]. Currently the most widely used index structure is hybrid index,
and many variations exist besides the IR-tree structure [3] [11] we already men-
tioned, such as Nave Hybrid Index [9], R*-tree [1] [8], KR*-tree [6], IR2 -tree [4],
and W-IR-tree [14].
Nave Hybrid Index [9] combines document location to every word in the
document to be a new word, and build index on them. This method clearly
wastes lots of space and does not support flexible spatial relevance computation.
Improved Spatial Keyword Search Based on IDF Approximation 471
1 1
IRTree approidf IRTree approidf
0.8
0.1
Time (ms)
Time (ms)
0.6
0.4
0.01
0.2
0 0.001
1 2 4 8 10.0 100.0 1000.0 10000.0
Number of Words Query Range
(a) Search Performance, versus |WQ | (b) Search Performance, versus SQ (Log Scale)
1 1
IRTree approidf IRTree approidf
0.8 0.8
Time (ms)
Time (ms)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0.0 0.25 0.5 0.75 1.0 1 10 20 50 100
Score Ratio Number of Top K
1
IRTree approidf Candidate Size
0.8 10000
Candidate Size
Time (ms)
0.6
0.4
0.2 1000
0
10000 20000 40000 60000 80000 100000 10000 20000 40000 60000 80000 100000
Test Size Test Size
(e) Search Performance, versus |D| (f) Candidate Size versus |D|
R*-tree based method use loose combinations of an inverted file and an R*-
tree [6] [15]. These works usually suer from the inability to prune search space
simultaneously according to keyword similarity and spatial information. IR2 -
tree [4] augments the R-tree with signatures enables keyword search on spatial
data objects that each have a limited number of keywords. This technique suers
from substantial I/O cost since the signature files of all words need to be loaded
into memory when a node is visited. Also, signature approach is useless and
infeasible when it comes to handling ranked quires. W-IR-tree [14] is essen-
tially an IR-Tree combined with word partitioning technique, which partitions
keywords before constructing IR-tree. W-IBR-tree is simply a variation of W-IR-
Tree by replacing inverted files attached to each node by inverted bitmap, which
reduces the storage space and also saves I/O during query processing. This tree
structure combines both keyword and location information, better matches the
semantics of top-k spatial keyword queries and performs better than IR-Tree.
However, it does not consider textual similarity as a ranking factor as it treats
an object as either textual relevant (containing all the keyword queries) or not.
472 X. Zhou et al.
References
1. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: An ecient
and robust access method for points and rectangles. In: SIGMOD Conference, pp.
322331 (1990)
2. Chen, Y.Y., Suel, T., Markowetz, A.: Ecient query processing in geographic web
search engines. In: SIGMOD Conference, pp. 277288 (2006)
3. Cong, G., Jensen, C.S., Wu, D.: Ecient retrieval of the top-k most relevant
spatial web objects. In: PVLDB, vol. 2(1), pp. 337348 (2009)
4. Felipe, I.D., Hristidis, V., Rishe, N.: Keyword search on spatial databases. In:
ICDE, pp. 656665 (2008)
5. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In:
SIGMOD Conference, pp. 4757 (1984)
6. Hariharan, R., Hore, B., Li, C., Mehrotra, S.: Processing spatial-keyword (SK)
queries in geographic information retrieval (GIR) systems. In: SSDBM, p. 16 (2007)
7. Hiemstra, D.: A probabilistic justification for using tf x idf term weighting in
information retrieval. Int. J. on Digital Libraries 3(2), 131139 (2000)
8. Hjaltason, G.R., Samet, H.: Distance browsing in spatial databases. ACM Trans.
Database Syst. 24(2), 265318 (1999)
9. Jones, C.B., Abdelmoty, A.I., Finch, D., Fu, G., Vaid, S.: The SPIRIT spatial
search engine: Architecture, ontologies and spatial indexing. In: Egenhofer, M.,
Freksa, C., Miller, H.J. (eds.) GIScience 2004. LNCS, vol. 3234, pp. 125139.
Springer, Heidelberg (2004)
10. Lee, R., Shiina, H., Takakura, H., Kwon, Y.J., Kambayashi, Y.: Optimization of
geographic area to a web page for two-dimensional range query processing. In:
WISEW 2003, pp. 917. IEEE Computer Society, Washington, DC (2003)
11. Li, Z., Lee, K.C.K., Zheng, B., Lee, W.C., Lee, D.L., Wang, X.: IR-Tree: An e-
cient index for geographic document search. IEEE Trans. Knowl. Data Eng. 23(4),
585599 (2011)
12. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: SIGMOD
Conference, pp. 7179 (1995)
13. Sanderson, M., Kohler, J.: Analyzing geographic queries. In: Workshop on
Geographic Information Retrieval SIGIR (2004)
14. Wu, D., Yiu, M.L., Cong, G., Jensen, C.S.: Joint top-k spatial keyword query
processing. IEEE Trans. Knowl. Data Eng. 24(10), 18891903 (2012)
15. Zhou, Y., Xie, X., Wang, C., Gong, Y., Ma, W.Y.: Hybrid index structures for
location-based web search. In: CIKM, pp. 155162 (2005)
16. Zobel, J., Moat, A.: Inverted files for text search engines. ACM Comput.
Surv. 38(2) (2006)
Efficient Location-Dependent Skyline Retrieval
with Peer-to-Peer Sharing
Abstract. Motivated by the fact that a great deal of information is closely re-
lated to spatial location, a new type of skyline queries, location-dependent sky-
line query (LDSQ), was recently presented. Different from the conventional
skyline query, which focuses on non-spatial attributes of objects and does not
consider location relationship (i.e., proximity), LDSQ considers both the prox-
imity and non-spatial attributes of objects. In this paper, we explore the problem
of efficient LDSQ processing in a mobile environment. We propose a novel
LDSQ processing method based on peer-to-peer (P2P) sharing. Extensive expe-
riments are conducted, and the experimental results demonstrate the effective-
ness of our methods.
1 Introduction
Owing to the ever growing popularity of mobile devices and rapid advent of wireless
technology, information systems are expected to be able to handle a tremendous
amount of service requests from the mobile devices. The majority of these mobile
devices are equipped with positioning systems, which gave rise to a new class of mo-
bile services, i.e., location-based services (LBS). LBS enable users of mobile devices
to search for facilities such as restaurants, shops, and car-parks close to their route. In
general, mobile clients send location-dependent queries to an LBS server from where
the corresponding location-related information is returned as query results.
The skyline query is one very important query for users decision making. The
conventional skyline query focuses on non-spatial attributes of objects and does not
consider proximity. Considering both the proximity and non-spatial attributes of
objects, a new type of skyline queries, location-dependent skyline query (LDSQ),
was recently presented [1, 2]. Unlike the conventional skyline query, whose result is
based only on the actual query itself and irrelevant to the location of the mobile
client, the query result of an LDSQ is closely relevant to the location of the mobile
client, i.e., the location of the mobile client is a parameter of the LDSQ. To answer
LDSQs, mobile clients need to connect to the corresponding LBS server through a
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 473481, 2013.
Springer-Verlag Berlin Heidelberg 2013
474 Y. Xiao et al.
GSM/3G/Wi-Fi service provider and the LBS server is responsible for processing
each LDSQ. With the ever growing number of mobile devices, the sharply increas-
ing query loads are already resulting in network congestion and the performance
bottle of the LBS server. Aim to the above scalability limitations, in this paper, we
explore to leverage P2P data sharing to answer LDSQs.
The remainder of this paper is organized as follows. We review the related work in
Section 2. In Section 3, we first formulate the problem studied in this work, and then
describe the reference infrastructure for processing LDSQ. Section 4 presents the
proposed processing approach. We evaluate the proposed approaches through com-
prehensive experiments in Section 5. Finally, Section 6 concludes this paper.
2 Related Work
Skyline query has received considerable attention due to its importance in numerous
applications. The skyline operator is first introduced into the database community by
Borzsonyi et al. [3]. Since then a large number of methods have been proposed for
processing skyline queries in a conventional set. These methods can be classified in
two general categories depending on whether they use indexes or not. The popular
indexes-based methods mainly include index [4], nearest neighbor [5], branch-and-
bound skyline [6] and ZBtree [7]. The typical methods without indexes are block
nested loop [3], divide and conquer [3], sort first skyline [8], linear elimination sort
for skyline [9] and sort and limit skyline algorithm [10]. Furthermore, a number of
interesting variants of the conversional skyline, like multi-source skyline [11], distri-
buted skyline [12] and etc., have been addressed.
The abovementioned methods only focus on non-spatial attributes of objects and
do not consider location relationship. Motivated by the popularity of LBS and the
increased complexity of the queries issued by mobile clients, Zheng et al. [1] first
address the problem of LDSQ which is the most relevant to our work. LDSQ enables
clients to query on both spatial and non-spatial attributes of objects. In [1], the con-
cept of valid scope is introduced to address the query answer validation issue and two
query algorithms to answer LDSQ, namely brute-forth and -Scanning, are proposed.
Furthermore, [2] explores how to efficiently answer LDSQ in road networks where
the important measure is the network distance, rather than their Euclidean distance.
The main disadvantage of the methods present in [1, 2] is lack of scalability and thus
they cannot avoid network congestion and the performance bottle of the LBS server
caused by the ever growing number of mobile devices.
3 Preliminary
In this section, we first formally define LDSQ and related concepts, and then describe
the system infrastructure for supporting LDSQ.
Efficient Location-Dependent Skyline Retrieval with Peer-to-Peer Sharing 475
Notations and Definitions. Let S be the set of spatial objects stored at the LBS server
and every object o S has both spatial location attributes denotes by L(o) and a set of
non-spatial preference attributes denoted by P = {p1, p2, , pm}. We use o. pi (1 i
m) to represent the os value for the i-th non-spatial attribute and d(L(o), L(q)) to
represent the distance between o and query point q where d(.) denotes a distance me-
tric obeying the triangle inequality. Let t u denote object t dominates object u with
respect to non-spatial preference attributes. We use SK(S) to denote the skyline over S,
i.e., SK(S) = { S S s.t. k o}. In the following, we formally define the LDSQ
and related concepts.
Definition 1 ( ). Given two spatial objects t, u and a query point q, we say t u
with respect to q, if t u with respect to non-spatial preference attributes and d(L(t),
L(q)) d(L(u), L(q)), formally, t u d(L(t), L(q)) d(L(u), L(q)).
Definition 4 (Valid Scope). The valid scope of an LDSQ with respect to a query
point q, denoted by VS(S, q), is defined as a spatial region wherein all query points
will receive an identical result, i.e., for any two query points q1, q2, if L(q1) VS(S, q)
and L(q2) VS(S, q), then LSK (S, q1) = LSK (S, q2) = LSK (S, q).
The introduction of valid scope provides the chance of sharing query results among
neighboring mobile clients.
In Algorithm 1 and the following Algorithm 2, CacheData denotes the cache ca-
pacity allocated by MC or peeri for caching its last LDSQ result and the correspond-
ing valid scope, CacheData.LSK(S, MC) represents the space of CacheData used to
caching the last LDSQ result and CacheData.VS(S, MC) denotes the remaining space
of CacheData used to caching the corresponding valid scope.
In Algorithm 2, CacheData.VS(S, peeri) denotes the valid scope of the last LDSQ
cached in CacheData and Toverdue is calculated by the formula: Toverdue = tcur-
rent2tdelayslack, where tcurrent is the current system time of peeri, tdelay denotes the
largest network delay between two neighboring mobile clients based on MANETs,
and slack represents the slack factor.
478 Y. Xiao et al.
In Algorithm 3, functions -Scanning() and -Scanning VS() are responsible for re-
spectively computing LDSQ and valid scope, whose detailed descriptions are present
in [1].
5 Performance Evaluation
In this section, we first describe the experimental settings and then present the simula-
tion results.
Experiment Settings. Our simulation experiments are based on the same data sets of
spatial objects as [1], namely School and NBA. In the simulation, mobile clients have
sufficient memory space to buffer their last query results. Like [1], we also assume
the LBS server has sufficient memory space to load all the candidate objects. In order
to reduce the randomness effect we average the results of the algorithms over 50
LDSQs. Table 1 lists the main simulation parameters.
Our simulator consists of two main modules: the MC module and the BSC module.
The MC module generates and controls the query requests of all MCs. Each MC is an
independent object that decides its movement autonomously. The BSC module man-
ages the spatial object set and meanwhile coordinates the operations of MBSs.
Simulation Results. First, we compare SLSQ with brute-forth and -Scanning in
CPU time. Fig. 1 shows how CPU time of the three methods varies with the number
of non-spatial preference attributes |P|, i.e., the performance of the three methods in
terms of CPU time, as a function of |P|. As shown in Fig.1, SLSQ outperforms the
other methods under two different data sets (School and NBA). This is because SLSQ
enables a large number of LDSQs to be answered directly from the results cached in
Efficient Location-Dependent Skyline Retrieval with Peer-to-Peer Sharing 479
neighboring mobile clients and thus avoids a great majority of dominance relationsship
tests required by brute-forrth and -Scanning. We can also see from Fig. 1 thatt -
Scanning gets a distinct ad dvantage over brute-forth. The reason is that brute-foorth
blindly checks all objects and
a thus incurs the significant overhead of dominance reela-
tionship tests. The another observation from Fig. 1 is the CPU cost under School and
NBA data sets shows a littlee different trend. This is because School is an independdent
data set in non-spatial attrib
butes while NBA follows ant-correlated distribution in nnon-
spatial attributes, which cauuses the different overhead of dominance relationship teests.
Fig. 2 shows CPU time,, as a function of the P2P wireless transmission range R R. R
reflects the range of P2P wireless
w transmission between neighboring mobile clieents.
The bigger R usually meanss a larger number of single-hop peers of a mobile client.. As
we expect, SLSQ achieves a better CPU performance than brute-forth and -Scannning
and the performance gap beecomes wider with the increase of R. This is not surprissing
because SLSQ avoids a greaat majority of dominance relationship tests by P2P shariing.
Moreover, for SLSQ, the in ncrease of R enables each mobile client to have a higgher
DSQ by P2P sharing, but brute-forth and -Scanning are not
opportunity to fulfill its LD
influenced by R.
(a) Delay (School) (b) Delay (NBA) (a) Delay (School) (b) Delay (NBA))
Fig. 3. Network deelay vs. |P| Fig. 4. Network delay vs. R
Fig. 3 that the network delay grows as attribute dimensionality increases. This is be-
cause the volume of query result, which needs to be transmitted, grows with the in-
crease of attribute dimensionality.
Fig. 4 shows the performance of the three methods in terms of network delay, as a
function of R. As we expect, SLSQ has a distinct advantage over brute-forth and -
Scanning in network delay. The reason is that for many LDSQs, SLSQ avoids the
overhead of routing, message forwarding and handoff caused by connecting the re-
mote LBS server while the overhead is required by brute-forth and -Scanning for all
LDSQs. We can also see from Fig. 4 that brute-forth and -Scanning are not influ-
enced by R in network delay while the network delay of SLSQ decreases as R grows.
The reason is that brute-forth and -Scanning directly ask the LBS server to compute
LDSQs so do not involve the P2P wireless transmission between mobile clients, while
SLSQ leverages the P2P sharing. Thus, the increase of R enables SLSQ to have a
higher opportunity to fulfill its LDSQ by P2P sharing.
6 Conclusion
References
1. Zhang, B., Lee, K.W.C., Lee, W.C.: Location-dependent skyline query. In: Proc. of MDM,
pp. 148155. IEEE Press, Beijing (2008)
2. Xiao, Y., Zhang, H., Wang, J.: Skyline computation for supporting location-based services
in a road network. Information 15(5), 19371948 (2012)
3. Stephan, B., Donald, K., Konrad, S.: The skyline operator. In: Proc. of ICDE, pp. 421430.
IEEE Press, Heidelberg (2001)
4. Tan, K.L., Eng, P.K., Ooi, B.C.: Efficient progressive skyline computation. In: Proc. of
VLDB, Roma, pp. 301310 (2001)
5. Kossmann, D., Ramsak, F., et al.: Shooting stars in the sky: An online algorithm for sky-
line queries. In: Proc. of VLDB, pp. 275286 (2002)
6. Papadias, D., Tao, Y., et al.: Progressive skyline computation in database systems. ACM
Transactions on Database Systems 30(1), 4182 (2005)
7. Lee, K., Zheng, B., Li, H., Lee, W.-C.: Approaching the skyline in Z order. In: Proc. of
VLDB, pp. 279290 (2007)
Efficient Location-Dependent Skyline Retrieval with Peer-to-Peer Sharing 481
8. Chomicki, J., Godfrey, P., Gryz, J., et al.: Skyline with presorting. In: Proc. of ICDE, pp.
717816 (2003)
9. Godfrey, P., Shipley, R., Gryz, J.: Maximal vector computation in large data sets. In: Proc.
of VLDB, pp. 229240 (2005)
10. Bartolini, I., Ciaccia, P., Patella, M.: Efficient sort-based skyline evaluation. ACM Trans-
actions on Database Systems 33(4), 145 (2008)
11. Deng, K., Zhou, X., Shen, H.: Multi-source skyline query processing in road networks. In:
Proc. of ICDE, Istanbul, Turkey, pp. 796805 (2007)
12. Xiao, Y., Chen, Y.: Efficient distributed skyline queries for mobile applications. Journal of
Computer Science and Technology 25(3), 523536 (2010)
13. Ku, W.S., Zimmermann, R.: Location-Based Spatial Query Processing in Wireless Broad-
cast Environments. IEEE Transactions on Mobile Computing 7(6), 778790 (2008)
What Can We Get from Learning Resource
Comments on Engineering Pathway
1 Introduction
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 482493, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fig. 1. Screen image of the Engineering Fig. 2. Search results by keyword data
Pathway mining
The 100 Most Commented is what we used in this paper. The contents of
comment should contain rating, comment title, comment description, author and
post time. How to use these dataset to answer the questions like Which learning
resource object should we choose or Which one is the best search result and
which one is exactly what I want ? That is the reason why we focus on using
these comments to evaluate the learning resources, and we can achieve this by
selecting the best alternative that matches all of the digital library users criteria.
Learning recourses evaluation by their comments is such a multiple, diverse
criteria decision-making problem with high complexity, so we draw lessons from
Analytic Hierarchy Process (AHP), a classical method presented by Saaty[3] and
be cited more than 1,000 times. We make some modifications on this method
and will give more details on the following sections.
2 Related Work
AHP can be used in many applications besides our web data mining area, such as
in ecosystem [4], emergency management [5], plant control [6] and so on so forth.
484 Y. Zhang, W. Yu, and S. Li
From those research works, it is clear that AHP has been widely applied in many
of the data management and decision making areas. To address the problem
encountered in our web data mining area, that is, digital learning resource mining
for EP, we summarize some related former research works as follows:
Some researchers choose a combination of fuzzy theory and AHP, such as Tao
et al. [8] , Uzoka et al. [9] and Feng et al. [10]. Tao proposed a decision model
by the application of AFS (axiomatic fuzzy set) theory and AHP method to get
the ranking order. They also provided the definitely semantic interpretations for
the decision results by their theory. Uzoka used fuzzy logic and AHP for medical
decision support systems. It is interpreting idea to introduce the semantic infor-
mation proposed by Tao, thus we try to add them in our EP comments mining
this time. However it causes high computational complexity, we will try this in
our future work in this aspect.
AHP also could be used with traditional data mining algorithm, such as Svo-
ray et al [11]. They used a data mining (DM) procedure based on decision trees to
identify areas of gully initiation risk, and compared it with AHP expert system.
Rad et al. [13] proposed an very inspiring idea by considering the problem of
clustering and ranking university majors in Iran. By eight dierent criteria and
177 existing university majors, they give the rank of all the university majors.
In our paper, we use five dierent criteria and 100 learning resources as ex-
perimental dataset, however, both subjective expert and objective quantitative
information are used to build the judgement matrices.
As for educational digital library applications, we focus on how to provide both
teachers and students enriching and motivational educational resources as the
former researchers do in [14] and [15] . Jing and Qingjun [16] provided teachers
and students a virtual knowledge environment where students and teachers enjoy
a high rate of participation. Alias et al. [17] described an implementation of
digital libraries that integrated semantic research. Mobile ad hoc networks are
becoming an important part of the digital library ecology . Social navigation can
be used to replicate some social features of traditional libraries, to extend digital
library portals and to guide portal users to valuable resources in [13].
There are three lays on our system, that is ,the objective layer, criteria layer, and
alternatives layer. The framework of the comments hierarchy and the content of
each layer is as shown in Figure 7.
where Nj is the number of comments for resource j, rj,n is the rating value
given by comment n and 1 is a properly selected weighting for Nj . Suppose
1 1
AImax and AImin are the maximum and minimum absolute importance of
the top 100 most commented resources respectively, thus the relative impor-
tance value of resource j is given by the following linear scaling method:
8
RIji = i i (AIji AImin
i
)+1
AImax AImin
i i
We assume that AImax > AImin .The element in matrix M i can be given
i RIki
by Mk,j = RIji
Note that we will use the same linear scaling method and
definition of RIji for the following matrices, i = 2, . . . , 5, and omit them.
where Nj is the number of comments for resource j, LTj,n is the length of the
comment title given by comment n and 2 is a properly selected weighting
for Nj . The relative importance RI 2 and comment title matrix M 2 can be
obtained by the same method in rating matrix.
Nj
1 "
AIj3 = LCj,n + 3 Nj
Nj n=1
where Nj is the number of comments for resource j, LCj,n is the length of the
comment given by comment n and 3 is a properly selected weighting for Nj .
where Nj is the number of comments for resource j, Vj,n is author type value
of comment n and 4 is a properly selected weighting for Nj .
3.3 Weighting
After getting the judgement matrix M , we obtain the weightings of the five
criteria in the second layer by calculating its normalized eigenvector wmax cor-
responding to the maximum eigenvalue max , that is,
5
"
M wmax = max wmax , wmax,i = 1.
i=1
Note that for M i , i = 1, . . . , 5, they are 100 100 dimension matrices, thus it
is time-consuming job to exactly calculate the eigenvalues and eigenvectors. To
obtain a high quality estimation in short time, we use the following approach :
M
1. Normalize each column in M , that is, Mi,j = "n i,jMi,j ;
#
i=1
3.4 Consistency
CIM
CRM =
RIn
thus imax = n and CIM i = 0. Furthermore, the total consistency is also holds
because
wmax,1 CIM 1 + wmax,2 CIM 2 + + wmax,5 CIM 5
= 0 < 0.1.
wmax,1 RI100 + wmax,2 RI100 + + wmax,5 RI100
1. Criteria Evaluation
Get judgment matrix M based on expert information, calculate its maximum
eigenvalue and corresponding eigenvector wmax , and check its consistence by
Saaty Rule (adjust it if necessary);
What Can We Get from Learning Resource Comments on EP 489
3. Alternative Evaluation:
The final evaluation vj for alternative j is given as
5
"
i
vj = wmax,i wmax,j .
i=1
4 Experiment
In this section, we will evaluate the proposed AHP method on the K-Gray
Engineering-Pathway library data. The proposed method will be experimentally
compared with other general methods, such average rating, average comment
number, to validate its eectiveness.
i 1 2 3 4 5
i 0.8 8 10 0.5 0.05
To make it clear, we define the standard deviation between AHP and method
as 2
"100
SD = (vjAHP vj )
j=1
where vjAHP , vjare the normalized evaluation value for resource j obtained
by AHP and method . Table. 3 reports the standard deviation for dierent
methods. ACL and AR have the smaller SD values compared with other methods.
It is consistent with the expert information and our proposed AHP method from
the judgement matrix M , we see that criteria rating and comment length have
the bigger weightings. Finally, we report the top 10 recommened resources by
AHP, ACL and AR based on the 100 most commented resources on computing
sciences.
AHP ACL AR
1 Alice 2.0 The Black Swan Alice 2.0
2 iWoz ACM K-12 Web-CAT
3 First Barbie Pair Programming Jeliot 3
4 Pair Programming Educating Engineers Award Courseware
5 The Black Swan Computer Museum RVLS
6 Computing ThinkCycle Greenfoot
7 Computational Geometry Initiatives World Without Oil
8 Language Media The Enigma
9 Achieving Dreams Scratch Girl Geeks
10 ACM K-12 Children Website Award Courseware
492 Y. Zhang, W. Yu, and S. Li
References
1. Zhang, Y., Agogino, A.M., Li, S.: Lessons Learned from Developing and Evaluat-
ing a Comprehensive Digital Library for Engineering Education. In: JCDL 2012
Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp.
393394 (2012)
2. Zhang, Y., Zhou, G., Zhang, J., Xie, M., Yu, W., Li, S.: Engineering pathway for
user personal knowledge recommendation. In: Gao, H., Lim, L., Wang, W., Li, C.,
Chen, L. (eds.) WAIM 2012. LNCS, vol. 7418, pp. 459470. Springer, Heidelberg
(2012)
3. Saaty, T.L.: Fundamentals of Decision Making and Priority Theory with the AHP,
2nd edn. RWS Publications, Pittsburgh (2000)
4. Koschke, L., Fuerst, C., Frank, S.: A multi-criteria approach for an integrated
land-cover-based assessment of ecosystem services provision to support landscape
planning. Ecological Indicators 21, 5466 (2012)
5. Ergu, D., Kou, G., Peng, Y.: Data Consistency in Emergency Management. Inter-
national Journal of Computers Communications & Control 7(3), 450458 (2012)
6. Forsyth, G.G., Le Maitre, D.C., OFarrell, P.J.: The prioritization of invasive alien
plant control projects using a multi-criteria decision model informed by stakeholder
input and spatial data. Journal of Environmental Management 103, 5157 (2012)
7. Tao, L., Chen, Y., Liu, X.: An integrated multiple criteria decision making model
applying axiomatic fuzzy set theory
8. Uzoka, F.-M.E., Obot, O., Barker, K.: An experimental comparison of fuzzy logic
and analytic hierarchy process for medical decision support systems. Computer
Methods and Programs in Biomedicine 103(1), 1027 (2011)
9. Feng, Z., Wang, Q.: Research on health evaluation system of liquid-propellant
rocket engine ground-testing bed based on fuzzy theory. Acta Astronautica 61(10),
840853 (2007)
What Can We Get from Learning Resource Comments on EP 493
10. Svoray, T., Michailov, E., Cohen, A.: Predicting gully initiation: comparing data
mining techniques, analytical hierarchy processes and the topographic threshold.
Earth Surface Processes and Landforms 37(6), 607619 (2012)
11. Amin, G.R., Gattoufi, S., Seraji, E.R.: A maximum discrimination DEA method
for ranking association rules in data mining. International Journal of Computer
Mathematics 88(11), 22332245 (2011)
12. Rad, A., Naderi, B., Soltani, M.: Clustering and ranking university majors using
data mining and AHP algorithms: A case study in Iran. Expert Systems with
Applications 38(11), 755763 (2011)
13. Brusilovsky, P., Cassel, L., Delcambre, L., Fox, E., Furuta, R., Garcia, D.D., Ship-
man III, F.M., Bogen, P., Yudelson, M.: Enhancing digital libraries with social
navigation: The case of ensemble. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani,
F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 116123. Springer,
Heidelberg (2010)
14. Fernandez-Villavicencio, N.G.: Helping students become literate in a digital,
networking-based society: a literature review and discussion. International Infor-
mation and Library Review 42(2), 124136 (2010)
15. Hsu, K.-K., Tsai, D.-R.: Mobile Ad Hoc Network Applications In The Library. In:
Proceedings of the 2010 Sixth International Conference on Intelligent Information
Hiding and Multimedia Signal Processing (IIHMSP 2010), pp. 700703 (2010)
16. Jing, H., Qingjun, G.: Prospect Application in the Library of SNS. In: 2011
Third Pacific-Asia Conference on Circuits, Communications and System (PACCS),
Wuhan, China, July 17-18 (2011)
17. Alias, N.A.R., Noah, S.A., Abdullah, Z., et al.: Application of semantic technology
in digital library. In: Proceedings of 2010 International Symposium on Information
Technology (ITSim 2010), pp. 15141518 (2010)
Tuned X-HYBRIDJOIN
for Near-Real-Time Data Warehousing
M. Asif Naeem
1 Introduction
Near real-time data warehousing exploits the concepts of data freshness in tra-
ditional static data repositories in order to meet the required decision support
capabilities. The tools and techniques for promoting these concepts are rapidly
evolving. Most data warehouses have already switched from a full refresh [7] to
an incremental refresh policy [4]. Further the batch-oriented, incremental refresh
approach is moving towards a continuous, incremental refresh approach.
One important research area in the field of data warehousing is data transfor-
mation, since the updates coming from the data sources are not in the format
required for the data warehouse. Furthermore, a join operator is required to
implement the data transformation.
In traditional data warehousing the update tuples are buered in memory and
joined when resources become available [6]. Whereas, in real-time data warehous-
ing these update tuples are joined when they are generated in the data sources.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 494505, 2013.
c Springer-Verlag Berlin Heidelberg 2013
One important factor related to the join is that both inputs of the join come from
dierent sources with dierent arrival rates. The input from the data sources is
in the form of an update stream which is fast, while the access rate of the lookup
table is comparatively slow due to disk I/O cost. It creates a bottleneck in pro-
cessing of update stream and the research challenge here is to minimize this
bottleneck by optimizing the performance of the join operator.
To overcome these challenges a novel stream-based join algorithm called X-
HYBRIDJOIN (Extended Hybrid Join) [1] was proposed recently by the author.
This algorithm not only addressed the issues described above but also was de-
signed to take into account typical market characteristics, commonly known as
the 80/20 sale Rule [2]. According to this rule 80 percent of sale focus on only 20
percent of the products, i.e., one observes a Zipfian distribution. To achieve this
objective, one component of the algorithm called disk-buer is divided into two
equal parts. The contents of one part of the disk buer (called the non-swappable
part) are kept fixed while the contents of the other part (called the swappable
part) are exchanged for each iteration of the algorithm. As the non-swappable
part of the disk buer always contains the most frequently used disk pages, most
stream tuples can be processed without invoking the disk. Although the author
presented an adaptive algorithm to adopt the typical market characteristics, the
components of the algorithm are not tuned to make ecient use of available
memory resources. Further details about this issue are provided in Section 3.
On the basis of these observations, a revised X-HYBRIDJOIN is proposed
with name Tuned X-HYBRIDJOIN. The cost model of existing X-HYBRIDJOIN
is revised and the components of the proposed algorithm are tuned based on that
cost model. As a result the available memory is distributed among all compo-
nents optimally and consequently it improves the performance of the algorithm
significantly.
The rest of the paper is structured as follows. The related work to proposed
algorithm is presented in Section 2. Section 3 describes problem statement about
the current approach. The proposed solution for the stated problem is presented
in Section 4. Section 5 presents the tuning of the proposed algorithm based on
the revised cost model. The experimental study is discussed in Section 6 and
finally Section 7 concludes the paper.
2 Related Work
Some techniques have already been introduced to process the join queries over
continuous streaming data [5]. This section presents only those approaches which
are directly related to the stated problem domain.
A stream-based algorithm called Mesh Join (MESHJOIN) [8] was designed
specifically for joining a continuous stream with disk-based data in an active
data warehouse. This is an adaptive approach but there are some research is-
sues related to inecient memory distribution among join components due to
unnecessary constraints and an inecient strategy for accessing the disk based
relation.
496 M.A. Naeem
Queue Queue
tm . . . . t3 t2 t1
tm ... t3 t2 t1
Hash table
Join Stream
Stream Hash Join
S Hash output function output
... Stream buffer
function
Stream buffer
...
Join window
Disk-based p1 p p p ... p
master data p2
R .. Master data (R)
pn Size of each p=k number of pages
table. If the tuple match, the algorithm generates the resulting tuple as output
and deletes the stream tuple from the hash table along with its join attribute
value from the queue. In the next iteration the algorithm again dequeues the
oldest element from the queue, loads the relevant disk page into the swappable
part of the disk buer and repeats the procedure.
X-HYBRIDJOIN minimizes the disk access cost and improves performance
significantly by introducing the non-swappable part of the disk buer. But in
X-HYBRIDJOIN the memory assigned to the swappable part of the disk buer
is equal to the size of disk buer in HYBRIDJOIN [11] and the same amount
of memory is allocated to the non-swappable part of the disk buer. In the
following it will be shown that this is not optimal. The problem considered in
this paper is to tune the size of both parts of the disk buer so that the memory
distribution among these two components is optimal. Once these two components
acquire optimal settings, based on that memory can be assigned to the rest of
join components.
4 Proposed Solution
Memory Cost. Since the optimal values for the sizes of both the swappable
part and non-swappable part can be dierent, it is assumed k number of pages for
the swappable part and l number of pages for the non-swappable part. Overall
the largest portion of the total memory is used for the hash table while a much
smaller amount is used for each of the disk buer and the queue. The memory
for each component can be calculated as given below:
Memory for the swappable part of disk buer (bytes)= k vP (where vP is the
size of each disk page in bytes).
Memory for the non-swappable part of disk buer (bytes)= l vP .
Memory for the hash table (bytes)= [M (k + l)vP ] (where M is the total
allocated memory and is memory weight for the hash table).
Memory for the queue (bytes)= (1 )[M (k + l)vP ] (where (1 ) is memory
weight for the queue).
The total memory used by the algorithm can be determined by aggregating the
above.
Processing Cost. This section presents the processing cost for the proposed
approach. The cost for one iteration of the algorithm is denoted by cloop and
express it as the sum of the costs for the individual operations. Therefore the
processing cost for each component is first calculated separately.
Cost to read the non-swappable part of disk buer (nanoseconds) = cI/O (l vP ).
Cost to read the swappable part of disk buer (nanoseconds)= cI/O (k vP ).
Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing 499
Cost to look-up the non-swappable part of disk buer in the hash table (nanosec-
onds) = dN cH (where dN = l vvR P
is the size of the non-swappable part of disk
buer in terms of tuples, vP is size of disk page in bytes, vR is size of disk tuple
in bytes, and cH is look-up cost for one disk tuple in the hash table).
Cost to look-up the swappable part of disk buer in the hash table (nanoseconds)=
dS cH (where dS = k vvR P
is the size of the swappable part of disk buer in terms
of tuples).
Cost to generate the output for w matching tuples (nanoseconds) = wcO (where
cO is cost to generate one tuple as an output).
Cost to delete w tuples from the hash table and the queue (nanoseconds)= w cE
(where cE is cost to remove one tuple from the hash table and the queue).
Cost to read w tuples from stream S into the stream buer (nanoseconds)= wcS
(where cS is cost to read one stream tuple into the stream buer).
Cost to append w tuples into the hash table and the queue (nanoseconds)= wcA
(where cA is cost to append one stream tuple in the hash table and the queue).
As the non-swappable part of the disk buer is read only once before execution
starts, it is excluded. The total cost for one loop iteration is:
5 Tuning
The stream-based join operators normally execute within limited memory and
therefore tuning of join components is important to make ecient use of the
available memory. For each component in isolation, more memory would be bet-
ter but assuming a fixed memory allocation there is a trade-o in the distribution
of memory. Assigning more memory to one component means less memory for
other components. Therefore it needs to find the optimal distribution of memory
among all components in order to attain maximum performance. A very impor-
tant component is the disk buer because reading data from disk to memory is
expensive.
In the proposed approach tuning is first performed through performance mea-
surements by considering a series of values for the sizes of the swappable and
non-swappable parts of the disk buer. Later a mathematical model for tuning is
also derived from the cost model. Finally, the tuning results of both approaches
are compared to validate the cost model. The details about the experimental
setup are presented in Table 1.
Parameter value
Disk-based data
Size of R 0.5 million to 8 million tuples
Size of each tuple vR 120 bytes
Stream data
Size of each tuple vS 20 bytes
Size of each node in queue 12 bytes
Benchmark
Based on Zipfs law
Characteristics Bursty and self-similar
particular memory settings for swappable and non-swappable parts rather than
on every contiguous value.
The measurement approach assumes the size of total memory and the size
of R are fixed. The sizes for the swappable and non-swappable parts vary in
such a way that for each size of the swappable part the performance is measured
against a range of sizes for the non-swappable part. By changing the sizes for
both parts of the disk buer the memory sizes for the hash table and the queue
are also aected.
The performance measurements for varying the sizes of both swappable and
non-swappable parts are shown in Figure 3. The figure shows that the perfor-
mance increases rapidly by increasing the size for the non-swappable part. After
reaching a particular value for the size of non-swappable part the performance
starts decreasing. The plausible reason behind this behavior is that in the be-
ginning when the size for the non-swappable part increases, the probability of
matching stream tuples with disk tuples also increases and that improves the
performance. But when the size for the non-swappable part is increased further
it does not make a significant dierence in stream matching probability. On the
other hand, due to higher look-up cost and the fact that less memory is avail-
able for the hash table the performance decreases gradually. A similar behavior
is seen when the performance against the swappable part is tested. In this case,
after attaining the maximum performance it decreases rapidly because of an
increase in the I/O cost for loading the growing swappable part. From the mea-
surements shown in the figure it is possible to approximate the optimal settings
for both the swappable and non-swappable parts by finding the maximum on
the two-dimensional surface.
A mathematical model for the tuning is also derived based on the cost model
presented in Section 4.2. From Equation 3 it is clear that the service rate depends
on the size of w and the cost cloop . To determine the optimal settings it is first
necessary to calculate the size of w. The main components on which the value
Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing 501
4
x 10
Frequency in stream
1.35
1.3
Service rate (tuples/sec)
1.25
1.2
1.15
1.1
950
Siz
. . .
eo 900 2000
f sw
app 1750
able 850
r (tuples)
part
of d
is 800
1500
of disk
buffe Part of master data Master Data on disk Part of master data
that is loaded into memory
k bu 1250 part that exists permanently in memory.
ffer
750 ap pable in the form of partitions.
(tup 1000 nsw
les) of no
Size
of w depends are:- size of the non-swappable part (dN ), size of the swappable
part (dS ), size of the master data (Rt ), and size of the hash table (hS ).
Typically the stream of updates can be approximated through Zipfs law with
a certain exponent value. Therefore, a significant part of the stream is joined
with the non-swappable part of the disk buer. Hence, if the size of the non-
swappable part (i.e. dN ) is increased, more stream tuples will match as a result.
But the probability of matching does not increase at the same rate as increasing
dN because, according to Zipfian distribution, the matching probability for the
second tuple in R is half of that for the first tuple and similarly the matching
probability for the third tuple is one third of that for the first tuple and so on [2].
Due to this property, the size of R (denoted by Rt ) also aects the matching
probability. The swappable part of the disk buer deals with the rest of the
master data denoted by R (where R = Rt dN ), which is less frequent in
the stream than that part which exists permanently in memory. The algorithm
reads R in partitions, where the size of each partition is equal to the size of the
swappable part of the disk buer dS . In each iteration the algorithm reads one
partition of R using an index on join attribute and loads it into memory through
a swappable part of the disk buer. In the next iteration the current partition in
memory is replaced by a new partition, and so on. As mentioned earlier, using
the Zipfian distribution the matching probability for every next tuple is less than
the previous one. Therefore, the total number of matches against each partition
is not the same. This is explained further in Figure 4, where n total partitions
are considered in R . From the figure it can be seen the matching probability for
each disk partition decreases continuously as one moves toward the end position
in R.
The size of the hash table is another component that aects w. The reason
is simple: if there are more stream tuples in memory, the number of matches
will be greater and vice versa. Before driving the formula to calculate w it is
first necessary to understand the working strategy of Tuned X-HYBRIDJOIN.
Consider for a moment that the queue contains stream tuples instead of just
join attribute values. Tuned X-HYBRIDJOIN uses two independent inner loops
under one outer loop. After the end of the first inner loop, which means after
502 M.A. Naeem
finishing the processing of the non-swappable part, the queue only contains those
stream tuples which are related to only the swappable part of R, denoted by R .
For the next outer iteration of the algorithm these stream tuples in the queue
are considered to be an old part of the queue. In that next outer iteration the
algorithm loads some new stream tuples into the queue and these new stream
tuples are considered to be a new part of the queue. The reason for dividing the
queue into two parts is that the matching probability for both parts of the queue
is dierent. The matching probability for the old part of the queue is denoted by
pold and it is only based on the size of the swappable part of R i.e. R . On the
other hand, the matching probability for the new part of the queue, known as
pnew , depends on both the non-swappable as well as the swappable parts of R.
Therefore, to calculate w it is first needed to calculate both these probabilities.
Therefore, if the stream of updates S obeys Zipfs law, then the matching
probability for any swappable partition k with the old part of the queue can be
determined mathematically as shown below.
dN #
+kdS
1
x
x=dN +(k1)dS +1
pk =
#
Rt
1
x
x=dN +1
Each summation in the above equation generates a harmonic series, which can
#k
1
be summed up using the formula x = ln k + + k , where is a Eulers
x=1
constant whose value is approximately equal to 0.5772156649 and k is another
1
constant which is 2k . The value of k approaches 0 as k goes to [3]. In this
1
paper the value of 2k is small and therefore, it is ignored.
If there are n partitions in R , then the average probability of an arbitrary
partition of R matching the old part of the queue can be determined using
Equation 4.
#
n
pk
1
pold = k=1 = (4)
n n
Now the probability of matching is determined for the new part of the queue.
Since the new input stream tuple can match either the non-swappable or the
swappable part of R, the average matching probability of the new part of the
queue with both parts of the disk buer can be calculated using Equation 5.
1
pnew = pN + pS (5)
n
where pN and pS are the probabilities of matching for a stream tuple with the
non-swappable part and the swappable part of the disk buer respectively. The
values of pN and pS can be calculated as below.
Tuned X-HYBRIDJOIN for Near-Real-Time Data Warehousing 503
#
d N
1
#
Rt
1
x x
x=1 x=dN +1
pN = and pS =
#Rt
1 #Rt
1
x x
x=1 x=1
Assume that w are the new stream tuples that the algorithm will load into the
queue in the next outer iteration. Therefore,
The size of the new part of the queue (tuples)=w
The size of the old part of the queue (tuples)=(hS w)
If w are the average number of matches per outer iteration with both the swap-
pable and non-swappable parts, then w can be calculated by applying the bino-
mial probability distribution on Equations 4 and 5 as given below.
hS pold (1 pold )
w= (6)
1 + pold (1 pold ) pnew (1 pnew )
By using the values of w and cloop in Equation 3 the algorithm can be tuned.
Swappable Part: This experiment compares the tuning results for the swap-
pable part of the disk buer using both the measurement and cost model ap-
proaches. The tuning results of each approach (with 95% confidence interval in
case of measurement approach) are shown in Figure 5 (a). From the figure it is
evident that at every position the results in both cases are similar, with only
0.5% deviation.
6 Experimental Study
To strengthen the arguments an experimental evaluation of proposed Tuned X-
HYBRIDJOIN is performed using the synthetic datasets. Normally, in Tuned
X-HYBRIDJOIN kinds of algorithms, the total memory and the size of R are
504 M.A. Naeem
4 4
x 10 x 10
1.5 1.5
Measured Measured
Calculated Calculated
1.4 1.4
Service rate (tuples/sec)
Service rate(tuples/sec)
1.3 1.3
1.2 1.2
1.1 1.1
1 1
750 800 850 900 950 1000 1250 1500 1750 2000
Size of the swappable part of the disk buffer(tuples) Size of the nonswappable part of the disk buffer(tuples)
(a) Tuning Comparison for swappable (b) Tuning Comparison for non-swappable
part: based on measurements Vs based on part: based on measurements Vs based on
cost model cost model
5 4
10 x 10
4
Tuned XHYBRIDJOIN
Tuned XHYBRIDJOIN
XHYBRIDJOIN XHYBRIDJOIN
HYBRIDJOIN HYBRIDJOIN
3.5
RMESHJOIN RMESHJOIN
MESHJOIN MESHJOIN
3
Service rate (tuples/sec)
2.5
4
10
1.5
3
10 0.5
0.5 1 2 4 8 50 100 150 200 250
Size of R (million tuples) Total memory (MB)
(a) Size of disk-based relation varies (on (b) Total allocated memory varies
log-log scale)
the common parameters that vary frequently. Therefore, the experiments pre-
sented here compare the performance by varying both parameters individually.
depicts the performance results. From the figure it can be observed that for all
memory budgets the Tuned X-HYBRIDJOIN again performs significantly better
than all approaches. This improvement increases gradually as the total memory
budget increases.
7 Conclusions
This paper investigates a well known stream-based join algorithm called X-
HYBRIDJOIN. Main observation about X-HYBRIDJOIN is that the tuning
factor is not considered but it is necessary, particularly when limited memory
resources are available to execute the join operation. By omitting the tuning fac-
tor, the available memory cannot be distributed optimally among the join com-
ponents and consequently the algorithm cannot perform optimally. This paper
presents a variant version of X-HYBRIDJOIN called Tuned X-HYBRIDJOIN.
The cost model presented in X-HYBRIDJOIN is revised and the proposed algo-
rithm is tuned based on that revised cost model. To strengthen the arguments a
prototype of Tuned X-HYBRIDJOIN is implemented and the performance with
existing approaches is compared.
References
1. Naeem, M.A., Dobbie, G., Weber, G.: X-HYBRIDJOIN for Near-Real-Time Data
Warehousing. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds.) BNCOD
2011. LNCS, vol. 7051, pp. 3347. Springer, Heidelberg (2011)
2. Anderson, C.: The Long Tail: Why the Future of Business is Selling Less of More.
Hyperion (2006)
3. Milton, A., Irene, A.S.: Handbook of Mathematical Functions with Formulas,
Graphs, and Mathematical Tables. Ninth Dover printing, Tenth GPO printing,
New York (1964)
4. Labio, W.J., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Ecient resumption of
interrupted warehouse loads. SIGMOD Rec. 29(2), 4657 (2000)
5. Golab, L., Tamer Ozsu, M.: Processing Sliding Window Multi-Joins in Continuous
Queries over Data Streams. In: VLDB 2003, Berlin, Germany, pp. 500511 (2003)
6. Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-
memory environment. Distrib. Parallel Databases 1(1), 103128 (1993)
7. Gupta, A., Mumick, I.S.: Maintenance of Materialized Views: Problems, Tech-
niques, and Applications. IEEE Data Engineering Bulletin 18, 318 (2000)
8. Polyzotis, N., Skiadopoulos, S., Vassiliadis, P., Simitsis, A., Frantzell, N.: Mesh-
ing Streaming Updates with Persistent Data in an Active Data Warehouse. IEEE
Trans. on Knowl. and Data Eng. 20(7), 976991 (2008)
9. Naeem, M.A., Dobbie, G., Weber, G.: R-MESHJOIN for Near-real-time Data
Warehousing. In: DOLAP 2010: Proceedings of the ACM 13th International Work-
shop on Data Warehousing and OLAP. ACM, Toronto (2010)
10. Chakraborty, A., Singh, A.: A partition-based approach to support streaming up-
dates over persistent data in an active datawarehouse. In: IPDPS 2009: Proceedings
of the 2009 IEEE International Symposium on Parallel & Distributed Processing,
pp. 111. IEEE Computer Society, Washington, DC (2009)
11. Naeem, M.A., Dobbie, G., Weber, G.: HYBRIDJOIN for Near-real-time
Data Warehousing. International Journal of Data Warehousing and Mining
(IJDWM) 7(4) (2011)
Exploiting Interaction Features
in User Intent Understanding
Universit`
a di Salerno, Via Ponte don Melillo, Fisciano(SA), Italy
{deufemia,mgiordano,gpolese}@unisa.it, [email protected]
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 506517, 2013.
c Springer-Verlag Berlin Heidelberg 2013
for UIU limit their analysis to the results contained in a SERP, ignoring many
important interactions and contents visited from such results. On the contrary,
the aim of this paper is to show that UIU can be considerably improved by
performing additional analysis of user interactions on the web pages of a SERP
that the user decides to visit. In particular, we define a new model for UIU that
incorporates interactions analyzed in the context of SERP results and of the
web pages visited starting from them. The interaction features considered in the
model are not global page-level statistics, rather they are finer-grained and refer
to portions of web pages. This is motivated on the basis of the results of literature
studies, performed with eye-trackers, revealing that although a user might focus
on many sections composing a web page, s/he will tend to overlook portions
of low interest [10]. Thus, capturing interaction features on specific portions of
web pages potentially conveys a better accuracy in the evaluation of the user
actions. These and other features are evaluated in our approach by means of a
classification algorithm to understand user intents. In particular, to simplify the
classification process, we use a common taxonomy that defines three types of
queries: informational, navigational, and transational [11].
Experimental results highlight the eciency of the proposed model in query
classification, showing how the interaction features extracted from visited web
pages contribute to enhance user understanding. In particular, the proposed set of
features has been evaluated with three dierent classification algorithms, namely
Support Vector Machine (SVM) [12], Conditional Random Fields (CRF) [13], and
Latent Dynamic Conditional Random Fields (LDCRF) [14]. The achieved results
have been compared with those achieved with other subsets of features and demon-
strate that the proposed model outperforms them in terms of query classification.
Moreover, a further analysis highlights the eectiveness of the proposed features
for the classification of transactional queries.
The rest of this paper is organized as follows. In Section 2, we provide a
short review of related work. Then, we present the model exploiting interaction
features for UIU in Section 3. Section 4 describes experimental results. Finally,
conclusions are given in Section 5.
2 Related Work
A pioneer study by Carmel et al. [15] in the early 90s focusses on the analysis of
hypertext. In particular, the study highlights three navigation strategies: scan
browsing, aiming to inquire and evaluate information contained in the hypertext
based on interest, review browsing, aiming to integrate information contained
in the user mental context, and search-oriented browsing, aiming to look for
new information based on specific goals, and integrating them with information
collected in successive scans.
A further interesting study by Morrison et al. [16] aimed to classify types of
queries based on search intents. Such taxonomies are defined as formalizations of
three basic questions that the user asks him/herself before starting a search ses-
sion: why, how, and what to search. The taxonomy that is closest to user intent un-
derstanding is the one used in the context of a research called (method taxonomy),
508 V. Deufemia et al.
which aims to detect the following types of search activities: explore, monitor, find,
and collect. By monitoring user search activities, Sellen et al. extend previously
defined taxonomies, introducing transactional search types [17], defined as trans-
acting, and finalized to search purposes and to the use of web services on which
it is possible to better exploit user experiences while performing search activities.
In this context, the goal of the search is not to find information, rather services of
information manipulation: hence, the search engine becomes a means to migrate
towards a web service. Successively, new approaches to user intent understanding
have highlighted the necessity to introduce a clear classification of search queries
based on user intents. To this end, one of the former studies by Broder et al. [11]
refers to a simple taxonomy of search queries based on three distinct categories:
navigational queries to search for a particular web site, informational queries to
collect information from one or more web pages, and transactional queries to reach
a web site on which further interactions are expected.
In the last decade, all existing approaches have aimed at applying taxonomies
in a recognition and automatic classification context. One of the former method-
ologies by Kang et al. [4] aims at two types of search activities: topic relevance,
that is, searching documents for a given topic, of informational type, and home-
page finding, aiming to search main pages of several types of navigational web
sites. Starting from common information used by Information Retrieval (IR)
systems, such as web page content, hyperlinks, and URLs, the model proposes
methods to classify queries based on the two categories mentioned above.
The analysis and the comprehension of user interaction models during web
navigation is the basis of a predictive model on real case studies, proposed by
Agichtein et al. [6]. The model tries to elicit and understand user navigation
behaviors by analyzing several activities, such as clicks, scrolls, and dwell times,
aiming to predict user intention during web page navigation. The features this
study proposes to analyze are used to characterize the complex system of in-
teractions following the click on a result. Successively, a study by Guo et al. [9]
starts from the hypothesis that such interactions can help accurately inferring
two particular tightly correlated intents: search and purchase of products, a two
phase activity defined as search-purchase.
Starting from experimental studies on real user navigation strategies, Lee
et al. propose a model for the automatic identification of search goals based
on features, and restricted to navigational and informational queries [5]. These
studies have primarily revealed that most queries can be eectively associated
to one of two categories defined within the taxonomy, making it possible the
construction of an automatic identification system. Moreover, most queries that
are not eectively associable to a category are related to few topics, such as
proper nouns or names of software systems. This makes it possible to use ad-hoc
systems of features for all those unpredictable cases. The model proposes two
features: past user-click behavior that infers user intent from the way s/he has
interacted with the results in the past, and anchor-link distribution, which uses
possible targets of links having the same text of the query.
Exploiting Interaction Features in User Intent Understanding 509
The strategies for user intent understanding described so far aim to classify
search queries exclusively using features modeled to describe several aspects of
search queries. A study by Tamine et al. proposes to analyze search activities
that the user has previously performed in the same context, aiming to derive
data useful for inferring the type of the current query [8]. The set of queries
already performed represents the query profile.
In this section we describe the proposed model and the features used for the
classification process.
Several approaches and models have been proposed to provide solutions to the
user intention understanding problem, but all of them mainly focus on the in-
teraction between users and the SERP. Additional interactions originating from
SERPs contents, such as browsing, reading, and multimedia content fruition are
not considered by the research community.
The proposed model aims to extend existing models, by analyzing user in-
teractions between users and web pages during a search session. Our aim is to
analyze not only the interactions between users and SERP, but also between
users and web pages reached by clicking on SERPs results. We believe that
data about interactions between users and web pages may be very useful to
clarify the intent of the user, because these interactions are driven by the same
motivation behind the initial search query. All user interactions with web pages
could be reduced to three main categories, which are the same we consider in
our model: session, search, and interaction.
Session. A session is a sequence of search activities aimed at achieving a goal.
When the first query does not provide the desired result, the user tries to grad-
ually approach the target, refining or changing search terms and keywords. All
these research activities constitute a session.
Search. A search activity is the combination of the following user actions: sub-
mission of a query to a search engine, analysis of search results, navigation on
one or more web pages inside them. During a search activity a user has a spe-
cific goal, generally described by the query itself. This goal is classifiable in a
taxonomy, as defined by previous studies [11, 18].
Interaction. Interaction is the navigation of a web page using a wide range of
interactions that include mouse clicks, page scrolling, pointer movements, and
text selection. Starting from these interactions, combined with features such as
dwell time, reading rate, and scrolling rate, it is possible to derive an implicit
feedback of users about web pages [2]. Moreover, several studies have proven the
usefulness of user interactions to assess the relevance of web pages [1, 3, 19, 20],
to classify queries, and to determine the intent of search sessions [9, 21].
510 V. Deufemia et al.
3.2 Features
Interaction data extracted from user web navigation have been encoded into
features that characterize user behavior. We organize the set of features into the
following categories: query, search, interaction, and context.
Query. These features are derived from characteristics of a search query such
as keywords, the number of keywords, the semantic relations between them, and
other characteristics of a search or an interaction.
Search. These features act on the data from search activities such as: re-
sults, time spent on SERP, and number of results considered by the user. The
DwellTime is measured from the start of the search session until the end of
the last interaction originated by the same search session. The reaction time,
TimeToFirstInteraction, is the time elapsed from the start of the search session
and the complete loading of the first selected page. Other features dedicated
to interactions with the results are ClicksCount, which is the number of visited
results, and FirstResultClickedRank, determining the position of the first clicked
result.
Interaction. These features act on the data collected from interactions with
web pages and subpages, taking into account the absolute dwell time, the ef-
fective dwell time, all the scrolling activities, search and reading activities. The
DwellRate measures the eectiveness of the permanence of a user on a web
page, while the reading rate ReadingRate, measures the amount of reading of a
web page [2]. Additional interactional features are: ViewedWords, the number of
words considered during the browsing, UrlContainsTransactionalTerms, which
verifies if the URL of the page contains transactional terms (download, software,
video, watch, pics, images, audio, etc.), AjaxRequestsCount, which represents the
number of AJAX requests originated during browsing.
Context. These features act on the relationship between the search activities
performed in a session, such as the position of a query in the sequence of search
requests for a session.
4 Experiments
In this section we describe the dataset constructed for evaluating the proposed
approach and the results achieved with dierent classification algorithms.
Exploiting Interaction Features in User Intent Understanding 511
In particular, after an overview on the evaluation metrics used and on the subsets
of features taken into consideration, experimental results are presented.
Dataset. The dataset has been constructed by the thirteen participants to the
test. In particular, all the subjects were requested to perform a series of search
activities related to various topics and with dierent intentions. Each subject
was asked to determine his/her own goal in advance. After starting the session,
subjects performed a series of searches using the Google search engine, without
any limitation as to time or the results to visit. By following this protocol, we had
129 sessions and 353 web searches, which were subsequently manually classified
by relying on the intent of the user. Starting from web searches, 490 web pages
and 2136 sub pages were visited. All interactions were detected by the YAR
plug-in for Google Chrome/Chromium [2].
All : subset of all the proposed features query, search, interaction, and con-
text ;
512 V. Deufemia et al.
4.2 Results
In order to simulate an operating environment, the set of queries made by users
was separated into two subsets, which included 60% and 40% of web searches,
and were used for training and testing the classifiers, respectively.
Taking into account informational queries (see Figure 1), it could be found
that best classifications were achieved by using the CRF classifier with the All set
of features, whereas worst performances were achieved by considering the sub-
sets of features Query and Search. Similar considerations hold for Search+Query.
These initial observations demonstrate the importance of the subset of interac-
tional features, which alone are able to achieve classification performances close
to those achieved by the All set of all features.
Moreover, transactional features are also useful to improve the quality of the
classification process of informational queries (Table 2), whereas they have no
eect on navigational ones (Table 3).
5 Conclusions
In this paper we have proposed a new model for user intent understanding in web
search. Assuming that the interactions of the user with the web pages returned
by a search engine in response to a query can be highly useful, in this research
we aimed at defining a new model, based on the results returned by the search
engine, on the interactions of the user with them, and with the web pages visited
by exploring them. By examining these interactions, we have produced a set of
features that are suitable for determining the intent of the user. Each feature
involves a dierent level of interaction between the user and the system: query,
search, and web pages (interaction). To simplify and make the process of classifi-
cation more ecient, we also adopt a simple one level taxonomy: informational,
navigational, and transactional. In addition to the set of all the proposed fea-
tures, during the testing phase we also considered some subsets corresponding to
features of individual interactional contexts and to their union or dierence, in
order to evaluate the eectiveness of the classification with respect to traditional
models, and to interactional, or transactional features.
Experimental results have highlighted the eectiveness of query classification
when applying both features representing interactions on web pages and those
representing interactions in the context of queries and results. This also arise in
the classification of transactional queries, further highlighting the eectiveness
of interactional features, and more important, of transactional features.
References
1. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporat-
ing user behavior information. In: Proceedings of the International Conference on
Research and Development in Information Retrieval, SIGIR 2006, pp. 1926. ACM
(2006)
2. Deufemia, V., Giordano, M., Polese, G., Tortora, G.: Inferring web page relevance
from human-computer interaction logging. In: Proceedings of the International
Conference on Web Information Systems and Technologies, WEBIST 2012, pp.
653662 (2012)
3. Guo, Q., Agichtein, E.: Beyond dwell time: estimating document relevance from
cursor movements and other post-click searcher behavior. In: Proceedings of the
International Conference on World Wide Web, WWW 2012, pp. 569578. ACM
(2012)
4. Kang, I., Kim, G.: Query type classification for web document retrieval. In: Pro-
ceedings of the Conference on Research and Development in Informaion Retrieval,
SIGIR 2003, pp. 6471. ACM (2003)
5. Lee, U., Liu, Z., Cho, J.: Automatic identification of user goals in web search. In:
Proceedings of the International Conference on World Wide Web, WWW 2005,
pp. 391400. ACM (2005)
6. Agichtein, E., Brill, E., Dumais, S., Ragno, R.: Learning user interaction models
for predicting web search result preferences. In: Proceedings of the International
Conference on Research and Development in Information Retrieval, SIGIR 2006,
pp. 310. ACM (2006)
Exploiting Interaction Features in User Intent Understanding 517
7. Jansen, B.J., Booth, D.L., Spink, A.: Determining the user intent of web search
engine queries. In: Proceedings of the International Conference on World Wide
Web, WWW 2007, pp. 11491150. ACM (2007)
8. Tamine, L., Daoud, M., Dinh, B., Boughanem, M.: Contextual query classification
in web search. In: Proceedings of International Workshop on Information Retrieval
Learning, Knowledge and Adaptability, LWA 2008, pp. 6568 (2008)
9. Guo, Q., Agichtein, E.: Ready to buy or just browsing? detecting web searcher
goals from interaction data. In: Proceedings of the International Conference on
Research and Development in Information Retrieval, SIGIR 2010, pp. 130137.
ACM (2010)
10. Nielsen, J.: F-shaped pattern for reading web content (2006),
http://www.useit.com/articles/f-shaped-pattern-reading-web-content/
11. Broder, A.: A taxonomy of web search. SIGIR Forum 36, 310 (2002)
12. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273297 (1995)
13. Laerty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In: Proceedings of the
International Conference on Machine Learning, ICML 2001, pp. 282289 (2001)
14. Morency, L.P., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models
for continuous gesture recognition. In: Proceedings of IEEE Conference Computer
Vision and Pattern Recognition, CVPR 2007, pp. 18 (2007)
15. Carmel, E., Crawford, S., Chen, H.: Browsing in hypertext: a cognitive study. IEEE
Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 22,
865884 (1992)
16. Morrison, J., Pirolli, P., Card, S.K.: A taxonomic analysis of what world wide
web activities significantly impact peoples decisions and actions. In: Extended
Abstracts on Human Factors in Computing Systems, CHI EA 2001. pp. 163164.
ACM (2001)
17. Sellen, A., Murphy, R., Shaw, K.: How knowledge workers use the web. In: Ex-
tended Abstracts on Human Factors in Computing Systems, CHI 2002, pp. 227
234. ACM (2002)
18. Rose, D., Levinson, D.: Understanding user goals in web search. In: Proceedings of
the International Conference on World Wide Web, WWW 2004, pp. 1319. ACM
(2004)
19. Guo, Q., Agichtein, E.: Towards predicting web searcher gaze position from mouse
movements. In: Proceedings of the International Conference on Human Factors in
Computing Systems, CHI EA 2010, pp. 36013606. ACM (2010)
20. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography.
SIGIR Forum 37, 1828 (2003)
21. Guo, Q., Agichtein, E.: Exploring mouse movements for inferring query intent.
In: Proceedings of the International Conference on Research and Development in
Information Retrieval, pp. 707708. ACM (2008)
22. Lauer, F., Guermeur, Y.: MSVMpack: a multi-class support vector machine pack-
age. Journal of Machine Learning Research 12, 22692272 (2011)
Identifying Semantic-Related Search Tasks
in Query Log
Keywords: Query log analysis, User intent, Search task, Named entity.
1 Introduction
Detecting user intents in queries is crucial for search engines to better satisfy
users information need. Usually users may issue multiple related queries to ac-
complish one search task [7], e.g. multiple queries magnolia junior high and
chino high school may belong to the same search task for looking for schools
in the City of Chino. Identifying search tasks could help search engines to bet-
ter understand user intents. Many works have been proposed to identify search
tasks, but long-term and semantic-related search task identification has not been
well studied. Search tasks are often intertwined [11], and may span from seconds
to days [2]. For example, the search task trip planning can span many days.
During this period the user may issue many queries of other search tasks. To our
best knowledge, existing technologies do not work well to identify such long-term
search tasks [3,9,8]. Because of the sparsity of clicks and lexical features of query
log, queries triggered by the same semantic-related search task may share few
common terms or clicked documents. Identifying long-term and semantic-related
search tasks is dicult. Firstly, it is not straightforward how to exploit semantic
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 518525, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
The related work can be organized into the following two categories:
Query intent analysis has attracted substantial attention. User queries
are often classified into three high-level intents: navigational, informational and
transactional intents [4]. Recently, many works [6,12] attempted to exploit named
entities to better understand user intent. Guo et al. [6] exploited the category
of named entities to classify user queries. Yin et al. [12] built a hierarchical
taxonomy of the generic search intents based on named entities. Recently the
research on query intent of multiple queries such as query refinements [10] and
search tasks [9] have also attracted much attention [5,1,8,2].
Search tasks identification was proposed by Jones et al. [9] in the setting
of queries sessions detection [3]. Jones et al. [9] determined the boundary of
search tasks between chronological queries with a decision tree. Donato et al. [5]
determined whether pairs of chronologically issued queries targeting the same
search task. This approach could not deal with intertwined search tasks well.
Aiello et al. [1] clustered the identified search tasks of dierent users into more
general topics by the approach proposed in [5]. But this paper aims to discover
more semantic-related search tasks accomplished by the same user. Lucchese et
al. [8] clustered user queries in a time-split session. They exploited both content
features and semantic features in a straightforward way. The dierent impor-
tance of terms and dependency among terms in queries is ignored. The semantic
of queries might deform. Sadikov et al. [10] clustered refinements of a query ac-
cording to dierent user intents. But the intertwined search tasks could not be
520 S. Gong et al.
eectively identified. Boldi et al. [3] introduced the query-flow graph to represent
query log. This model was used to separate the chronological queries into sets of
related queries, named query chain, with three categories of features. Kotov et
al. [2] modeled the user behaviors of cross-sessions search tasks. This work also
showed the importance of identifying long-term search tasks.
3 Problem Formalization
This section introduces the distance function and semantic-related features used
for search task identification.
Q denotes all the queries issued by a user. Search task T is a subset of Q. E de-
notes the named entities set collected from collaborative knowledge bases, such
as Wikipedia. Cat denotes the categories set of named entities. The category of
a named entity could be denoted as cat(e). A user query qi Q can be formal-
ized as a bi-tuple as qi =< ei , ci >, where ei denotes the entity contained in the
query qi and ci denotes the remaining terms of that query. q =< qi , qj > denotes
any pairs of queries issued by a user. The function f : Q Q [0, 1] denotes
semantic-related distance between pair-wise queries. Given labeled semantic-
related user query pairs, finding out the distance function f becomes the follow-
ing optimal problem:
l
1"
f = arg min [ V (f, xi , yi ) + f 2 ] (1)
f HK l i=1
where V is the loss function over the labeled data, such as squared loss for Reg-
ularized Least Squares, Hk is an Reproducing Kernel Hilbert space of functions
with kernel function K, l is the size of labeled queries pairs, and is the coef-
ficient of regularization on the ambient space. This paper defines the distance
function as a linear function about feature vector xq between pair-wise queries.
The solution of the minimization problem will be discussed in Section 4.
The features between pair-wise queries can be split into two parts: entity features
and context features. The entity features used in the distance function are:
Semantic Relevance of Named Entity. To get the semantic relevance feature
between pair-wise named entities, it is necessary to use some external semantic
sources, such as Wikipedia. Therefore we crawled and indexed the web pages
from Wikipedia, built a search engine. The relevance between a named entity
and all indexed documents constitutes the semantic feature vector for that entity.
Identifying Semantic-Related Search Tasks in Query Log 521
Context Semantic Relevance. This paper also uses the vector model to build
semantic feature vector for each context. Dierent with [8], this paper submits
each context as a whole to the search engine. The semantic relevance between
pair-wise contexts is cosine similarity between the semantic feature vectors.
Context Lexical Distance. As illustrated in [9,8], the queries containing com-
mon terms may target relevant search tasks. This paper uses both Jaccard index
on tri-grams and normalized Levenstein distance to represent lexical distance
between pair-wise contexts respectively, as suggested by the works [9,3].
where l is the number of labeled pair-wise queries, N is the total number of pair-
wise queries, yi denotes the labeled value of semantic-related distance function,
is the coecient of regularization in ambient space, and is the coecient of
same categories regularization. As above mentioned, the distance function is a
linear combination of dierent features. So Eq. 1 can be reformulated as follows:
l
1" T Xw]
w = arg min[ (yi wT xi )2 + wT w + wT X (4)
w l i=1 N
5 Experiments
5.1 Experiment Setup
This section discusses our experiment data set, two extended state-of-the-art
approaches to be compared with our approach STI and evaluation measures.
Data Set. The 2006 AOL query log is used for experiment, in order to compared
with related approaches tested by the same log. This query log consists of about
20 million of web queries issued by more than 650,000 users over 3 months. In
order to learn the distance function, 200 pairs of queries random selected are
labeled as semantic-related and 200 pairs of queries random selected are labeled
as not-semantic-related. Named entities are collected from Wikipedia with the
list of * query such as list of states in U.S..
5.2 Evaluation
The user annotation results are presented in Table 1. Each evaluator is required
to rate each identified search task as 1(Poor), 2(Bad), 3(Acceptable), 4(Good)
or 5(Perfect) and indicates optionally whether an identified search task is under
clustered or over clustered. According to the result of user rating score, the
dierences among the three approached are statistically significant, and STI
performs the best. The column of the ratio of under clustered search tasks shows
that STI yields the least one. As showed by Fig. 1, the coverage of identified
search tasks by STI is much higher than both mQCI and mTSDP. The column
of the ratio of over clustered search tasks shows that STI yields the largest one.
524 S. Gong et al.
We now look at the reason that is likely to underlie the observed results.
Both mTSDP and mQCI perform better to aggregate the misspelling queries,
such as 800flowers and 1-800flower. But STI tends to yield long-term
and more semantic-related search tasks than others, such as magnolia junior
high and chino high school, which are not in the results of both mQCI
and mTSDP. This is consistent with what Fig.1 and Fig. 2 shows.
1.0
STI
mQCI
mTSDP
0.8
Method Average Under Over
Rating Clustered Clustered
0.6
Coverage
STI 4.4 7% 10%
0.4
mQCI 2.6 18.9% 5%
mTSDP 3.8 17% 7%
0.2
0.0
Clustering Threshold
STI
mQCI
mTSDP
300
0.8
Cohesion & Coverage
# of Search Task
0.6
200
0.4
STI Cov.
STI Coh.
100
no Reg. Cov.
no Reg. Coh.
0.2
0.18
0.12
0.0
0
Fig. 3 shows the impact of the entity category regularization, which has been
discussed in Section 4.1. When learned the distance function with entity category
regularization, the balance point of coverage and cohesion curves is 50% higher
than that of without regularization. With regularization, the coverage of the
identified search tasks is always higher. And when > 0.6, the coverage and
cohesion of identified search tasks with regularization are both higher than those
of without regularization. The category regularization helps to overcome the
over-fitting problem and reinforces that the queries containing named entities of
the same category probably belong to the same search task.
Identifying Semantic-Related Search Tasks in Query Log 525
6 Conclusion
References
1. Aiello, L.M., Donato, D., Ozertem, U., et al.: Behavior-driven Clustering of queries
into topics. In: Proc. of the 20th CIKM, pp. 13731382 (2011)
2. Kotov, A., Bennett, P.N., White, R.W., et al.: Modeling and analysis of cross-
session search tasks. In: Proc. of the 34th SIGIR (2011)
3. Boldi, P., Bonchi, F., Castillo, C., et al.: The query-flow graph: model and appli-
cations. In: Proc. of the 17th CIKM, pp. 609618 (2008)
4. Broder, A.: A taxonomy of web search. SIGIR Forum 36, 310 (2002)
5. Donato, D., Bonchi, F., Chi, T., et al.: Do you want to take notes?: identifying
research missions in Yahoo! search pad. In: Proc. of the 19th WWW (2010)
6. Guo, J., Xu, G., Cheng, X., et al.: Named entity recognition in query. In: Proc. of
the 32nd SIGIR, pp. 267274 (2009)
7. Ji, M., Yan, J., Gu, S., et al.: Learning search tasks in queries and web pages via
graph regularization. In: Proc. of the 34th SIGIR, pp. 5564 (2011)
8. Lucchese, C., Orlando, S., Perego, R., et al.: Identifying task-based sessions in
search engine query logs. In: Proc. of the 4th WSDM, pp. 277286 (2011)
9. Jones, R., Klinkner, K.L.: Beyond the session timeout: automatic hierarchical seg-
mentation of search topics in query logs. In: Proc. of the 17th CIKM (2008)
10. Sadikov, E., Madhavan, J., Wang, L., et al.: Clustering query refinements by user
intent. In: Proc. of the 19th WWW, pp. 841850 (2010)
11. Spink, A., Park, M., Jansen, B.J., et al.: Multitasking during web search sessions.
Information Processing and Management 42(1), 264275 (2006)
12. Yin, X., Shah, S.: Building taxonomy of web search intents for name entity queries.
In: Proc. of the 19th WWW, pp. 10011010 (2010)
Multi-verifier:
A Novel Method for Fact Statement Verification
{wangteng,zq,swang}@ruc.edu.cn
Abstract. Untruthful information spreads on the web, which may mislead users
or have a negative impact on user experience. In this paper, we propose a method
called Multiveri f ier to determine the truthfulness of a fact statement. The basic
idea is that whether a fact statement is truthful or not depends on the information
related to the fact statement. We utilize a popular search engine to collect the
top-n search results that are most related to the target fact statement. We then
propose support score to measure the extent to which a search result supports the
fact statement, and we make use of credibility ranking to rank the search results.
At last, we combine the support score and credibility ranking value of a search
result to evaluate its contribution to a fact statement determination. Based on the
contributions of the search results, we determine the fact statement. Our proposals
are evaluated by experiments and results show availability and high precision of
the proposed method.
1 Introduction
The Web has become one of the most important information sources in peoples daily
life, however, much untruthful information spreads on the web. Untruthful information
may mislead users or affect user experiences, therefore, it is urgent to determine whether
a piece of information is trustful or not. Information is mainly loaded by sentences.
Sentences that state facts, rather than opinions, are called fact statements[8]. Generally
speaking, fact statements are in two categories: positive fact statements and negative
fact statements. In this paper, we mainly focus on positive fact statements, since for
any negative fact statement, there is a positive one corresponding to it, and truthfulness
determination of a negative fact statement can be inferred through determination of the
positive one. In this paper, when we say a fact statement, we mean a positive one.
A fact statement is either trustful or untruthful, depending on wether the contents
it carries is a objective fact or not. When a user determine the trustfulness of a fact
statement, he/she will specify some part(s) of the fact statement he/she is not sure about.
The part(s) are called the doubt unit(s) [8]. After the doubt unit(s) is/are specified, the
fact statement can be regarded as an answer to a question. Based on the number of
correct answers of the corresponding questions, fact statements can be divided into two
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 526537, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Multi-verifier: A Novel Method for Fact Statement Verification 527
categories: unique-answer fact statement and multi-answer fact statement. For example,
Obama is the American President is a trustful fact statement. If the doubt unit is Obama,
it is an answer of who is American President. Since the question has only one correct
answer, the fact statement is an unique-answer fact statement.
Most of the existing works utilize a popular search engine to collect the search re-
sults of the target fact statement, and the truthfulness of the target fact statement is
determined by analyzing the search results, e.g., the amount and the sentiments of the
search results are considered as the dominant factors in some works. However, this is
not desirable for the following reasons: (i). The amount of the search results is not
always reliable in trustfulness determination. Since not all search results support the
target fact statement. (ii).The sentiments of the search results do not always represent
their sentiments on the target fact statement, since not every search results theme hap-
pens to be the fact statement. Fig.1 shows a search result about the fact statement War-
ren Moon was born in 1956. Obviously, the theme is not the fact statement. In other
works, the trustful fact statement is identified from the alternative fact statements of a
target fact statement. Here, the alternative statements are the fact statements which are
answers to the question corresponding to the target fact statement. There are three lim-
itations in these works: (i). The doubt unit must be specified, otherwise the alternative
fact statements cant be found. (ii). If the doubt unit includes much information, it is
difficult, sometimes even impossible to find the proper alternative fact statements. (iii).
This method is not fit for the trustfulness determination of multi-answer fact statements,
since only one fact statement is picked out as the trustful fact statement.
We run a set of experiments and evaluate the method in terms of reasonability and
precision. The results show availability and high precision of our proposals.
For the rest of this paper, we first summarize the related works. Then, we describe the
method in details. Experiments and analysis are shown in section 4. At last, we conclude
this paper.
2 Related Works
We conduct a brief review of existing works on the following aspects: web page credi-
bility and statement trustfulness.
Generally speaking, credible sources are likely to present trustful information. Some
researchers study credibility of web pages, and they utilize web page features to deter-
mine whether a web page is credible or not, e.g., page keywords, page tittle, page style,
etc[1][2][3]. However, the trustfulness of web page content hasnt been considered in
these works. Some web pages which meet these features may present incorrect facts.
Other studies focus on spam web pages detection based on link analysis and content
for filtering out low quality pages[4][5]. However spam web pages are not equal to un-
truthful web pages and non-spam web pages may present incorrect facts. These studies
cant be used in fact statement determination directly.
Some researchers have started discussing the truthfulness of individual statements
in very recent years. [6] introduces system Honto?search1.0, which doesnt deter-
mine an uncertain fact directly, but provides users with additional data on which users
can determine the uncertain fact. Authors believe that if an uncertain fact is trust-
ful, the sentiments of the search results on the uncertain fact should be consistent.
Honto?search2.0[7] is the improved version of Honto?search1.0. This system has
two function: comparative fact finding and aspect extraction. Given an uncertain fact,
the system collects comparative facts and estimates the validity of each fact. Just for
checking the credibility of these facts in details, necessary aspects about the uncertain
fact are extracted. Scoring the facts and the aspects, users can get more information and
determine the uncertain fact. Verify[8] can determine the truthfulness of the fact state-
ment directly. Given a target fact statement, Verify finds the alternative fact statements
of the fact statement, ranks these fact statements, and chooses the fact statement at the
highest position as the trustful fact statement. In [7][8], doubt unit(s) of the fact state-
ment should be specified. In addition, when the doubt unit includes much information,
it is difficult, sometimes even impossible to find the proper alternative(comparative) fact
statements. Especially, for multi-answer fact statements, they cant work well.
3 Proposed Solution
The goal of this paper is to determine whether a fact statement is trustful or not. Firstly,
we submit the target fact statement to a popular search engine, and get the top-n search
results; secondly, the support scores of the search results of the fact statement are mea-
sured; thirdly, the credibility ranking of the search results is captured; at last, based
on the support scores and credibility ranking, the truthfulness of the fact statement is
determined.
Multi-verifier: A Novel Method for Fact Statement Verification 529
3.1 Notation
We use f s to denote the target fact statement. After deleting stop words in f s, the rest
are the keywords of f s, which is represented by K. R denotes the collection of the search
results of f s. We set R = {r1 , . . . , rn }(n 2) and ri =< ti , ui , si > (1 i n). Suppose
that ri is extracted from web page pi , ti and ui denote the title and url of pi respectively.
And si is the snippet related to f s in pi . Kri is the collection of words which belong to
both K and ri . ci is used to denotes the shortest consecutive sentences including Kri in
ri . sup(ri , f s) [0, 1] is used to describe the support score of ri to f s.
1 http://nlp.stanford.edu/software/stanford-dependencies.shtml
530 T. Wang, Q. Zhu, and S. Wang
8 Di =Stanford Parser(ci );
9 Ni = Kri ;
10 while LastNi = Ni do
11 LastNi = Ni ;
12 for each di j Di do
13 if di j .name Re then
14 Ni di j .governor; Ni di j .dependent;
15 if di j .name Ro then
16 if di j .dependent Ni then
17 Ni di j .governor;
18 return Ni ;
For example, S1 = {a, b, c} and S2 = {b, c, d} are two set. If a is deleted from S1 and d
is inserted in S1 , S1 = S2 . So the set distance of S1 and S2 is 2.
We use Sdis(Ni , K) to denote the set distance of Ni and K. Here, K is the collection of
keywords of f s. In order to get accurate Sdis(Ni , K), the words in K or Ni are stemmed
before computing Sdis(Ni , K). However, Sdis(Ni , K) cant be used to measure dis(ri , f s)
directly. The reason is that f s can be express in a different way. Given a sentence f s1 ,
which is different from f s but express the meaning of f s. f s can be transformed into
f s1 by word change. We suppose there are some words in f s cant be changed from
f s to f s1. And if these words changes, the meaning f s cant be kept. Here we use Du
to denote the collection of these unchangeable words in f s. Obviously, Du K. Since
word matching is a factor considered in returning search results, Du is composed of
the nouns and numbers in f s. Based on K and Du, we give out an equation to compute
dis(ri , f s).
Sdis(Ni ,K)+|Du||K|
|K| if Sdis(Ni , K)+ | Du | | K |> 0
dis(ri , f s) = (1)
0 if Sdis(Ni , K)+ | Du | | K |<= 0
In support score computation, we do not consider whether the search result oppose the
fact statement or not. Since the focus is the positive fact statements in this paper and
there are few search results explicitly opposing the target positive fact statement. In
addition, for the search results implicitly opposing the fact statement, corresponding
support scores may be smaller based on the equations or the search results will be
considered neutral.
Importance Ranking. The search results of a given fact statement display in a certain
order. We consider the order as the importance ranking of the search results. The order
is brought by the search engine which is used to find the search results. Besides, the
532 T. Wang, Q. Zhu, and S. Wang
importance of the sources of the search results are considered by the search engine. For
a fact statement f s and a search result ri on f s, Srank is used to denote the importance
ranking of the search results on f s, and Sranki is the position of ri in Srank. Given an-
other search result r j on f s, if Sranki < Srank j , ri is more important than r j . Note that,
we do not adopt the popular Pagerank for the following reasons: (i). The Pagerank val-
ues of web sites/pages cant be derived. (ii). Although the Pagerank levels are available,
the range of Pagerank level is too small ([0, 10]) to satisfy our requirements.
Popularity Ranking. We use Arank to denote the popularity ranking, Aranki represents
the Arank value of ri . We capture the popularity ranking of search results based on
Alexa2 ranking. Alexa ranking is a traffic ranking of each site based on reaches and
page views. It can be used to measure popularity of web sites. Given a fact statement
f s and a collection of search results R on f s. Alexai is used to denote Alexa ranking
value of the site from which ri is derived. Based on Alexai (1 i n), we can get the
popularity ranking of the search results in R. Based on Alexa ranking, Aranki can be
derived by the following equations.
Alexai min(Alexai )
Amidi = 1 + [ ] (4)
IntervalAmid
Amidi 1
Aranki = 1 + [ ] (6)
IntervalArank
Since Alexa ranking is for all web sites, given Alexai and Alexa j , the gap between them
may be very large. We use the above four equations to map Alexai into range [1, n]. We
first introduce the median of Alexai represented by Alexam and get the interval ranking
Amid by Equation.3 and Equation.4, just for keeping the distribution of Alexai (1 i
n) unchanged. Then, we linearly map Amidi into [1, n] and get Aranki by Equation.5
and Equation.6.
Credibility Ranking. We use Crank to denote the credibility ranking of the search
results on the same fact statement. Cranki is the position of ri in Crank. Based on Sranki
and Aranki , Cranki can be computed. The follow equation describes how to compute
Cranki based on Sranki and Aranki .
Cranki = w1 Sranki + w2 Aranki (7)
Here, Srank and Arank play the same important roles in getting Crank. We set wi and
w2 to 0.5.
2 https://www.alexa.com
Multi-verifier: A Novel Method for Fact Statement Verification 533
Algorithm 2 shows the procedure of determining a fact statement. The input of the
algorithm is the collection of search results on the target fact statement R, the credibility
ranking of the search results Crank, the support scores of the search results to the fact
statement sup(ri , f s)(1 i n), the set of words which belong to the keyword set of
f s and the search result, represented by Kri , the unchangeable word collection of the
fact statement Du, and the threshold . While the output is the result of determination
(trustful or untruthful). First, the contributions form positive search results and neutral
search results are worked out respectively (Line 1-6). Secondly, the ratio between pos-
itive contributions and neutral contributions are computed. At last, the algorithm can
decide whether the fact statement is truthful or not (Line 7-10).
4 Experiments
We run a set of experiments to evaluate Multi-verifier in terms of precision and avail-
ability. First, we detect the distribution of the search results including the target fact
534 T. Wang, Q. Zhu, and S. Wang
In this experiment, we detect the average percentage of the search results which include
the meanings of the target fact statements. For untruthful fact statements, there are few
search results including them. Thus, this experiment is about trustful fact statements.
We use Pf to represent the average percentage of the search results including the target
fact statements. f su denotes the trustful unique-answer fact statements, f sm represents
the trustful multi-answer fact statements. Fig.2 shows Pf values when n changes. Here,
n is the number of the search results adopted for a fact statement. It can be seen from
the figure, both Pf for f su and Pf for f sm decrease with the increase of n. For a multi-
answer fact statement, there are more than one correct answers to the corresponding
question. Thus Pf for f su is always larger than the Pf for f sm . In addition, when n
increases, more search results which dont include the target fact statements come out.
of average Crank values are not so sharply. It means that the Arank or Srank can not
replace Crank, and some search results at higher positions in Srank or Arank may be at
lower positions in Crank. It is consistent with our observation.
Average ranking
Correlation value
u
90 Arank
0.75 fsm 0.62
0.7 0.61
f
60
P
0.65 0.6
0.6 0.59 30
0.55 0.58
0.5 0.57 0
10 30 50 70 90 110 130 150 10 30 50 70 90 110 130 150 0 30 60 90 120 150
Top n search results Top n seacrh results Rank of search results
Average distance
Average distance
1.2
1.1 0.2
1.1
1 0.9
rd 1 rdsi
0.9 si 0.15
rd 0.6 rdsi
rdsu 0.9 su
0.8
0.3 rd 0.1
0.8 su
0.7
0.6 0.7 0 0.05
10 30 50 70 90 110 130 150 10 30 50 70 90 110 130 150 10 30 50 70 90 110 130 150 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
Top n search results Top n search results Top n search results Vlaue of
0.92 1
0.9
0.9
Precision
Precision
0.88
0.86 0.8
0.84
0.7
0.82
0.8 0.6
10 30 50 70 90 110 130 150 1 1.5 2 2.5 3 3.5 4 4.5 5
Top n search results (=1.5) Value of (n=30)
increase of n. The reason is that more search results are positive in top 30 search results,
and more positive search results are considered with the increase of n. When n # 30, the
precision decreases with the increase of n. The reason is that more neutral search results
are considered with increase of n when n # 30. Fig.8 shows the precision on when
n = 30. From this figure, When = 1.5, the precision reaches the greatest value(0.9).
When " 1.5, the precision is enhanced with the increase of . Since, when is
smaller, some untruthful fact statements are regarded as trustful ones. However, when
is greater, some trustful fact statements are regarded as untruthful ones. Thus, when
# 1.5, the precision decreases with the increase of . Fig.9 shows the precision on
and n. From this figure, the overall trend is that the precision increases and then
decreases as the increase of n and . Especially, when = 1.5 and n = 30, the precision
reach the peak(0.9).
The experiments show the method is available and accurate. Since the dataset in-
cludes many multi-answer fact statements, we can see that the method is also available
for multi-answer fact statements determination.
Since Honto?search can not determine a fact statement directly, it is not necessary
to run experiments to compare it with our method. Although Verify can directly de-
termine a fact statement, it is based on alternative fact statements and not suitable for
multi-answer fact statements. Thus, it is not necessary to compare it with our method.
In this paper, we propose a method called Multi veri f ier to determine the truthfulness
of a fact statement. In Multi veri f ier, the search results on the fact statement,whose
truthfulness is needed to be determined, are found by a popular search engine. And the
Multi-verifier: A Novel Method for Fact Statement Verification 537
support scores and the credibility ranking are considered in determining the truthfulness
of the fact statement. The experiments shows the method is powerful. However, our
focus is domain-independent fact statements, and the domain knowledge isnt used. In
the future, we will focus on the domain-dependent fact statements, we think that usage
of domain knowledge can bring higher precision. In addition, the quality of related
information is important for the truthfulness determination. We will try to find a method
to enhance the quality of related information on a fact statement.
Acknowledgments. This work is partly supported by the Important National Science
& Technology Specific Projects of China (Grant No.2010ZX01042-001-002), the Na-
tional Natural Science Foundation of China (Grant No.61070053), the Graduates Sci-
ence Foundation of Renmin University of China(Grant No.12XNH177) and the Key
Lab Found (07dz2230) of High Trusted Computing in Shanghai.
References
1. McKnight, D.H., Kacmar, J.: Factors and effects of information credibility. In: 9th Interna-
tional Conference on Electronic Commerce, pp. 423432. ACM Press, NewYork (2007)
2. Schwarz, J., Ringel Morris, M.: Augmenting web pages and search results to scport credibil-
ity assessment. In: International Conference on Human Factors in Computing Systems, pp.
12451254. ACM Press, NewYork (2011)
3. Lucassen, T., Schraagen, J.M.: Trust in Wikipedia: How Users Trust Information from an
Unknown Source. In: 4th ACM Workshop on Information Credibility on the Web, pp. 19
26. ACM Press, New York (2010)
4. Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In:
30th International Conference on Very Large Data Bases, pp. 576587. Morgan Kaufmann,
San Fransisco (2004)
5. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through con-
tent analysis. In: 15th International Conference on World Wide Web, pp. 8392. ACM Press,
NewYork (2006)
6. Yamamoto, Y., Tezuka, T., Jatowt, A., Tanaka, K.: Supporting Judgment of Fact Trustworthi-
ness Considering Temporal and Sentimental Aspects. In: Bailey, J., Maier, D., Schewe, K.-
D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 206220. Springer,
Heidelberg (2008)
7. Yamamoto, Y., Tanaka, K.: Finding Comparative Facts and Aspects for Judging the Credi-
bility of Uncertain Facts. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS,
vol. 5802, pp. 291305. Springer, Heidelberg (2009)
8. Li, X., Meng, W., Yu, C.: T-verifier: Verifying Truthfulness of Fact Statements. In: 27th
International Conference on Data Engineering, pp. 6374. IEEE Press, New York (2011)
9. Fogg, B.J.: Persuasive Technology: Using Computers to Change What We Think and Do.
Morgan Kaufmann, San Francisco (2002)
10. Marneffe, M., Maccartney, B., Manning, C.: Generating typed dependency parsers from
pharse structure parses. In: The International Conference on Language Resources and Eval-
uation, ELRA, Luxembourg (2006)
An Ecient Privacy-Preserving RFID
Ownership Transfer Protocol
Wei Xin, Zhi Guan, Tao Yang , Huiping Sun, and Zhong Chen
1 Introduction
Radio Frequency Identification (RFID) technology represents a fundamental
change in the information technology infrastructure. RFID has many applica-
tions for both business and private individuals. Several of these applications will
include items that change owners at least once in its lifetime. The swapping and
resale of items is a practice that is likely to be popular in the future, and so
any item that depends on RFID for function or convenience should be equipped
to deal with change of ownership. Ownership transfer presents its own set of
threats, and therefore demands the attention of security researchers.
Generally speaking, a secure ownership transfer protocol should follow the
following assumptions: The old owner should not be able to access the tag after
the ownership transfer has taken place; The new owner should be able to perform
mutual authentication with the tag after the ownership transfer has taken place.
In this paper, we concentrate on designing a RFID ownership transfer proto-
col with high eciency and robust security and privacy properties. Our protocol
consists of three sub-protocols: an authentication protocol, an ownership trans-
fer protocol, and a secret update protocol. The rest of the paper is organized as
follows. In Section 2, we introduce a brief overview of some ownership transfer
protocols in RFID systems. Section 3 presents the security and threat model.
Section 4 describes the proposed ownership transfer protocol. Sections 5 demon-
strates security and privacy analysis. And finally, Section 6 concludes.
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 538549, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Related Work
One of the earliest protocols addressing ownership transfer with Trusted Third
Party (TTP) was proposed by Saito et al.[1]. The goal of the protocol is to
transfer ownership from one entity R1 to another R2 . The protocol steps are
summarized in Fig. 1. An adversary can commit de-synchronization attacks by
blocking the message from R2 to the tag. T T P and R2 have the new key, while
the tag keep the previous one. This can be prevented by T T P and R2 storing the
previous key. Unfortunately, in this case, since R2 knows s1 , backward security
is violated.
R1 R2
secret s1 secret s2 TTP
s1 secret s
s1 , s2
Es (s1 , s2 )
Tag
secret s, s1
Es (s1 , s2 )
if s1 is valid
update s1 to s2
Tagi R1 R2 TTP
ti , s 1 , s 2 r1 , s1 r2 , s2 ti , rj , si
NP ||EN P ti s1
(s2 )
NT ||Hti NT (s2 NP )
Er1 (s1 )
NP ||Er2 NP (s2 r2 )
Hr2 (s2 NP )
NR2 ||Es 2 (NR2 )
NT ||HNT s2 (NR2 s2 )
R1 R2 Tag
t, s, t , s t, s t, s
r1
r2 {0, 1}l
M1 t r2
M1 , M2 M2 ft (r1 r2 )
RT , r1 , M1 , M2
find (t,s) s.t.
M2 =ft (r1 M1 t)
r2 = M1 t
M3 s (r2 2l )
s s
t t
s (s1 4l ) r1
(t 4l ) r2
t h(s) M3 , (t, s),Info (t, s),Info
M3
s M3 (r2 2l )
if h(s) == t
t h((s 4l ) r1
(t 4l ) r2 )
Tag R2 R1
N
NT , i , ci
N, NT , i , ci , Av
ref R1 , ref V Tag authenti
i+1 , ci+1 Issuer verification
R Tag
I, h, s, ID, pk, sk Ie , s
NR
k=g2 (s)
M1 =g1 (k NR )
Ie , M1 s=g1 (s)
I=Decsk (Ie )
find (I, h, s, ID) by I s.t.
M1 =g1 (g2 (s) NR )
e(h, g2x ) = e(I, g2 )
Ie = Encpk (I)
s=g1 (s)
M2 = g1 (s NR )
Ie s, M2
check M2 =g1 (s NR )
if true,Ie = Ie
1. The reader R initiates the protocol by sending a random number to the tag.
2. Upon receiving a challenge NR from R, the tag T derives two values, a new
secret k = g2 (s) and a new state s = g1 (s). After updating the state to g1 (s),
T sends Ie and M1 = g1 (k NR ) to R.
3. On receipt of the messages, R first decrypts Ie using his secret key as
I=Decsk (Ie ), then find (I, h, s, ID) in database using index I and satisfying
both M1 =g1 (g2 (s)NR ) and e(h, g2x ) = e(I, g2 ). If M1 =g1 (g2 (s)NR ) is not
true, for j = 1 to , R judges whether M1 equals g1 (g2 (g1j (s)) NR ). If true,
update s to g1j (s), otherwise, reject. If e(h, g2x ) = e(I, g2 ) is not true, reject
as well. After that, R encrypts I by his public key as Ie = Encpk (I) and up-
dates the state of T to s=g1 (s). In our proposal, we use Elgamal encryption
544 W. Xin et al.
Ownership Transfer Protocol: In this protocol, the new owner of a tag first
requests the T T P for ownership of the tag. If the request is valid, the T T P then
transfers all the information related to the tag to the new owner via a secure
channel. The ownership transfer protocol is based on our authentication proto-
col, since the authentication protocol satisfies backward privacy, the ownership
transfer protocol follows old owner privacy innately. However, we have to up-
date the tag state to protect new owner privacy. In addition, the index I must be
changed in order to achieve untraceability. The protocol steps are summarized
in Fig. 6 and described as follows:
1. At first, the old owner R1 sends IDR1 to the new owner R2 through a secure
communication channel(omitted in Fig. 6 for better representing).
2. The new owner R2 initiates the protocol by sending a random number NR
to the tag.
3. Upon receiving a challenge NR from R2 , the tag T derives two values, a new
secret k = g2 (s) and a new state s = g1 (s). After updating the state to g1 (s),
T sends Ie and M1 = g1 (k NR ) to R2 .
4. R2 then forwards Ie , M1 = g1 (k NR ), NR , IDR1 and IDR2 to T T P . In
our protocol, we use issuer as our T T P .
5. On receipt of the messages, T T P first decrypts Ie using R1 s secret key as
I=DecskR1 (Ie ), then find (I, h, s, ID) in database using index I and satisfy-
ing both M1 =g1 (g2 (s) NR ) and e(h, g2x ) = e(I, g2 ). If M1 =g1 (g2 (s) NR )
is not true, T T P tries to find g1j (s) that M1 equals g1 (g2 (g1j (s)) NR ) for
j = 1 to and update s to g1j (s). After that, T T P picks a new random
number t Fq and computes h = H(t ), u1 = g1r1 , v1 = hx g1r1 ( is the
secret key of R2 ) and let I = hx , Ie = (u1 , v1 ). Then, T T P updates the
state of T to s=g1 (s). Finally T T P sends inf o(ID), s, I , h , Ie , M2 to R2 .
6. R2 stores inf o(ID), s, I , h and forwards Ie s and M2 to T .
7. When T receives Ie s and M2 , T first verifies whether M2 = g1 (s NR ),
if true, then update Ie to Ie s s. Otherwise, T do not update Ie .
1. The reader R initiates the protocol by sending NR and M1 =g1 (g2 (s) NR )
to the tag T .
2. Upon receiving a challenge NR and M1 from R, the tag T computes k = g2 (s)
and verify M1 =g1 (k NR ). If true, computes M2 =g2 (k NR ) and update
the tag state to s=g1 (s NR ), then sends Ie k and M2 to R.
An Ecient Privacy-Preserving RFID Ownership Transfer Protocol 545
TTP
R2 Tag
I, h, s, I , h
I , h , s, ID, pk, sk Ie , s
ID, x, g2x
NR
k=g2 (s)
M1 =g1 (k NR )
Ie , M1 s=g1 (s)
NR , Ie , M1 , IDR1 , IDR2
I=DecskR1 (Ie )
find (I, h, s, ID) by I s.t.
M1 =g1 (g2 (s) NR )
e(h, g2x ) = e(I, g2 )
generate new I , h
Ie = (u1 , v1 )
s=g1 (s)
M2 = g1 (s NR ) inf o(ID), s, I , h , Ie , M2
Ie s, M2
check M2 =g1 (s NR )
if true,Ie = Ie
R Tag
I, h, s, ID, pk, sk Ie , s
generate NR
NR , M 1
M1 =g1 (g2 (s) NR )
k=g2 (s)
verify M1 =g1 (k NR )
M2 =g2 (k NR )
s=g1 (s NR )
Ie , M2
I=Decsk (Ie )
find (I, h, s, ID) by I s.t.
M2 =g2 (g2 (s) NR )
e(h, g2x ) = e(I, g2 )
s=g1 (s NR )
5 Security Analysis
5.1 Security Issues
In this section, we mainly discuss some attacks that are commonly used by ad-
versaries against RFID systems and present how our authentication protocol to
546 W. Xin et al.
Security
Schemes RA MITM DoS BT NP OP
Saito [1] X O O X O X
Kapoor [2] O O O O O O
Kulseng [15] O X X X * X
Osaka [16] O O X X * X
Fouladgar [17] X X O X O X
Song [3] X O X O * O
Elkhiyaoui [4] O O O X X X
ours O O O O * O
O: success
X: failure
*: success under an assumption
5.2 Performance
In this section, we analyze our protocol in computation costs and proves our
scheme is viable for lightweight passive RFID tags. We will mainly focus on the
performance characteristics of ownership transfer protocol. TE denotes the time
for one encryption or decryption; TRN G is the time to generate a random num-
ber; TH is the time for one hash function; TP O stands for the time of pairing
operation; and Texp denotes the time of exponentiation operation. Table 2 com-
pares the performance characteristics of our ownership transfer protocol with
the other proposed schemes.
To notice that, we do not consider database searching process in the Table
2. Actually, Osakas protocol, Fouladgars protocol and Songs protocol do not
support index, so the computation costs are much more than what we have
listed in table 2. Although Kulsengs protocol supports index, the index may
cause de-synchronization problem (index updates asynchronously between the
reader and the tag), the reader still has to search the whole database when
it happens. We adopt Elkhiyaouis scheme to use re-encryption primitive to
548 W. Xin et al.
6 Conclusion
References
1. Saito, J., Imamoto, K., Sakurai, K.: Reassignment scheme of an RFID tags key for
owner transfer. In: Enokido, T., Yan, L., Xiao, B., Kim, D.Y., Dai, Y.-S., Yang, L.T.
(eds.) EUC Workshops 2005. LNCS, vol. 3823, pp. 13031312. Springer, Heidelberg
(2005)
2. Kapoor, G., Piramuthu, S.: Single RFID Tag Ownership Transfer Protocols. IEEE
Transactions on Systems, Man, and Cybernetics, 110 (2011)
3. Song, B., Mitchell, C.J.: Rfid authentication protocol for low-cost tags. In: WISEC,
pp. 140147 (2008)
4. Elkhiyaoui, K., Blass, E.-O., Molva, R.: ROTIV: RFID Ownership Transfer with
Issuer Verification. In: Juels, A., Paar, C. (eds.) RFIDSec 2011. LNCS, vol. 7055,
pp. 163182. Springer, Heidelberg (2012)
5. Sarma, S., Weis, S., Engels, D.: Radio-Frequency Identification: Security Risks and
Challenges. Cryptobytes, RSA Laboratories 6(1), 29 (2003)
6. Hopper, N.J., Blum, M.: Secure human identification protocols. In: Boyd, C. (ed.)
ASIACRYPT 2001. LNCS, vol. 2248, pp. 5266. Springer, Heidelberg (2001)
7. Juels, A., Weis, S.A.: Authenticating Pervasive Devices with Human Protocols. In:
Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 293308. Springer, Heidelberg
(2005)
An Ecient Privacy-Preserving RFID Ownership Transfer Protocol 549
1 Introduction
Real-time monitoring over data streams have been attracting much attention.
Anomaly detection aims to detect the points which are significantly dierent
from others, or those which cause dramatic change in the distribution of under-
lying data stream. It is a great challenge to accurately monitor and detect the
anomalies in real time because of the uncertainty of properties of the anomaly.
In general, existing anomaly detection methods detect abnormality based on
a model derived from normal historical behavior of a data stream. They try to
reveal the dierences over short-term or long-term behaviors that are inconsis-
tent with the derived model. However, the detected results largely depend on the
model defined, as well as the data distribution and time granularity of the under-
lying data stream. For the data stream whose distribution changes constantly,
the existing methods will result in a large volume of false positives. On the other
hand, due to the application-dependent definition of abnormal behavior, each
existing method can only detect certain types of abnormal behaviors. In real
situation, it is meaningful to define abnormal behavior in a more general sense,
and is highly desirable to have an ecient, accurate, and scalable mechanism for
detecting abnormal behaviors over dynamic data streams.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 550562, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Preliminaries of Fractals
log q = log p + r log s. For discrete time series data, the power-law scaling rela-
tionship can be transformed to q(st) = sr q(t), where t is the basic time unit and
s(s R) is the scaling measurement [3]. r can be determined by r = log(q(st)/q(t))
log s .
The power-law scaling relationship is also retained when q is a first order
aggregate function (e.g. sum, count) or some second order statistic variables
(e.g. variance).
Intuitively, objects satisfying self-similarity property can be mapped (by a se-
ries of maps) to part of itself. This mapping process can be conducted iteratively
until a fixed point called attractor, is found. Given the attractor. by iteratively
applying the inverse of the mapping, we can approximate the original objects.
The mapping process introduced above is mathematically called the iterated
function system (IFS).
More precisely, an IFS consists of a finite collection of contraction maps Mi
(i = 1, 2, 3, ..., m). Each Mi can map a compact metric space onto itself, provided
that the metric space is linear.
G H AGpoint (x,
HGy) can
H Gbei mapped/transformed
H by
x ai11 ai12 x b1
Mi [x, y] with the form Mi = i i + i .
y a21 a22 y b2
(m
Assume A is the attractor, then A = i=1 Mi (A). By modeling the data with
an IFS, we can use the attractor of the IFS to approximate the data. When ai12
is zero, the mapping is called shear transformation [14]. Then, ai22 < 1 is called
the contraction factor for map Mi .
IFS can only be used to manipulate linear fractals, which have a fixed contrac-
tion factor for all objects in all scales. However, the natural objects may be more
complex. Thus recurrent iterated function systems (RIFS) are introduced. RIFS
can be generalized for arbitrary metric spaces [2]. Intuitively, dierent from IFS,
in RIFS each map can be applied on part of the objects. The ranges of objects
the maps can be applied on may overlap. Thus, dierent objects may be mapped
to dierent parts and with dierent contraction scale. (
In RIFS, given the contraction maps Mi , the attractor A Rd is N i=1 Ai
where Ai Rd (i = 1,( ..., N ) are possibly overlapping partitions. Further-
more, Mj (A) = Aj = <i,j>G Mj (Ai ). Here G is a directed graph, and
A = (A1 , ..., AN ) (Rd )N . Each edge < i, j > G indicates that the map-
ping composition Mj Mi is allowed. The attractor can also be represented by
A = M(A), in which M : (Rd )N (Rd )N is defined as (M1 , M2 , ...MN ). Thus,
given a set vector D = (D1 , D2 , ..., DN ) with Di Rd , A = limi Mi (D).
Similar to IFS, attractor A can be used to approximate the data. When the
graph G is complete, the RIFS degrades to an IFS [8].
IFS and RIFS are often used to summarize and compress fractal data. The
problem lies in the determination of the attractor. Since the attractor can be
computed given the mapping, the computation of the parameters of the mapping
is usually called the inverse problem of IFS or RIFS [8].
Given the mapping M, D = M(D), we consider the condition that D D
in this paper. Under this condition, the open set property holds, so that the
Collage Theorem can be applied to guarantee the precision of approximating
the original data using the attractor [14,2]. This theorem provides a way to
Fractal Based Anomaly Detection over Data Streams 553
measure the goodness of fit of the attractor associated with an RIFS and a given
function.
Collage Theorem: Let (X, d) be a complete metric space where d is a distance
measure. Let D be a given function and > 0 be given. Choose(an RIFS, with
N
contractility factor s = max{si : i = 1, 2, 3, ..., N } so that d(D, i=1 Mi (D))
. Then d(D, A) 1s where A is the attractor of the RIFS.
Since no assumption that the data streams to be monitored are linear fractals
can be made, we consider RIFS in the rest of this paper.
3 Problem Statement
A data stream X is considered as a sequence of points x1 , ..., xn . Each element
in the stream can be a nonnegative value, which is denoted as a (timestamp :
i, value : yi ) pair or i for simplicity. Function F being monitored can be not only
the monotonic aggregates with respect to the window size, for example sum
and count, but also the non-monotonic ones such as average and variance. One
common property of the above functions is that they all retain the power-law
scaling relationship [16].
Monitoring anomalies in data stream is to detect the change of power-law
scaling relationship, i.e., self-similarity, on the sequences with continuous incom-
ing new value xi . Under such a data stream model, the problem about anomaly
detection can be described as follows.
PROBLEM STATEMENT : A data stream X is represented as a sequence of
points x1 , ..., xn . An anomaly is detected if the self-similarity of X changes (i.e.,
the historic one is violated) when new value xn comes.
Our basic idea is to use the attractor to represent the data stream, and moni-
tor events that change the attractor, which mean anomalies. Thus, the problem
lies in How to eciently estimate the attractor in a streaming data environ-
ment?, How good is it to use estimated attractor to approximate the original
data stream?, and How to modeling the anomaly based on the attractor?.
Optimal Model. Let the start point of a piece P of stream X be (s, ys ), and
the end point of P be (e, ye ), e > s. A contraction mapping M can be defined
as M (i, yi ) = (int(i a11 + b1 ), i a21 + yi a22 + b2 ). The data points in a larger
piece of a data stream are mapped by M from P {(s, ys ), (e, ye )} to a smaller
piece of the same stream P {(s , ys ), (e , ye )}, where P P and e s > e s .
Every two adjacent P s meet only at the endpoints and do not overlap. For
all pieces( of the (whole data stream, however, there might be some overlaps,
that is, Pi = Pi = X. Suppose the L2 error between the data points in
2
mapping M (P ) and those in the original piece #e P is E(M ) = (M (P ) P )2 ,
then the error of a mapping M is E(M ) = i=s ((i a21 + yi a22 + b2 ) yj ) ,
where j = i a11 + b1 . Given P and P , the optimal mapping Mopt is the one
such that the minimum value of E(M ) is reached.
For data stream X, it is ensured that each contraction mapping M maps the
start point and end point of piece P to corresponding ones of piece P , namely,
(s , ys ) = M (s, ys ) and (e , ye ) = M (e, ye ). a21#and b2 can be obtained by
e
equations 2 and 3. Therefore, we have E(M ) = i=s (Ai a22 Bi )2 , in which
ei is ei is
Ai = yi ( es ys + es ye ), and Bi = yj ( es ys + es ye ).
The coecient a22 of the optimal mapping Mopt can be computed using
E(M) #e #e #e 2
a22 = 2 i=s (Ai a22 Bi )Ai = 0, then a22 = i=s Ai Bi / i=s Ai . Conse-
quently, the optimal contraction mapping Mopt can be obtained by maintaining
e e
i=s Ai Bi and i=s A2i over the data streams. Algorithm 1 illustrates the method
for constructing the optimal piecewise fractal model for a data stream.
on the data stream. In this way, the a22 and corresponding approximate results
Mapp can be computed.
The approximate piecewise fractal model consists of m pieces which can be
used to reconstruct the original data stream. The error of reconstruction is
bounded by Collage Theorem. Each piece Pi corresponds to a contraction map-
ping Mi and a pair of partition P and P . One value needed to be stored for Pi
is the sux sum F (n si ) starting from (si , ysi ), where a sux sum of data
n
stream is defined as F (wi ) = j=nwi+1 xj . The other value needed to be stored
for Pi is the number of points included in this sux sum. We first try to add
each data point of the evolving stream into the current piece. If failed, then a
new piece is created for this point. Because the end point of piece Pi is the start
point of piece Pi+1 , it is not necessary to store the end point. The method of con-
structing an approximate piecewise fractal model for a data stream is described
in Algorithm 2.
Theorem 2. Algorithm 2 can maintain the approximate piecewise fractal model
for a data stream with O(m) space in O(n) time.
The history-based model trains a model incrementally, and compares the new
coming data against the model. The data that deviate from the model are re-
ported as anomalies. Existing burst detection [20,4,16] and change detection [10]
methods conform to this model. For instance, the threshold of bursty behaviors
of aggregate function F on i-th window is set as follows. At first, the moving
Fractal Based Anomaly Detection over Data Streams 557
F (wi ) is computed over some training data which could be the foremost 1% of
the data set. The training data forms another time series data set, say, Y . Then
the threshold of anomaly is set to be T (wi ) = y + y , where y and y are
the mean and standard deviation, respectively. The threshold can be tuned by
varying the coecient of standard deviation. These methods then look for the
significant dierence in short-term or long-term behaviors which are inconsistent
with the model.
To maintain the model, piecewise fractal model is used. For each piece, the
line connecting end points of a piece is used to approximate the data within this
piece. The error is denoted by E(M ) and is guaranteed. The data generated by
all pieces are summed up to approximate the original data stream. Thus, the
piecewise fractal model is used as a synopsis data structure.
6 Performance Evaluation
6.1 Experimental Setup
The experiments are conducted on a platform with a 2.4GHz CPU, 512MB main
memory on Windows 2000. We applied our algorithms to a variety of data sets.
558 X. Gong et al.
a) Varying (D1) b) Varying (D3) (a) Varying (D1) (b) Varying (D3)
Fig. 3. The accuracy when varying Fig. 4. The accuracy when varying
Due to the space limitation, only results on three real-life datasets are reported:
1) Stocks Exchange (D1): This data set contains one year tick-by-tick trading
records in 1998 from a corporation on Hong Kong main board. 2) Network Trac
(D2): This data set BC-Oct89Ext4 is a network trac tracing data set obtained
from the Internet Trac Archive [1,7]. 3) Web Site Requests (D3): It consists of
all requests made to the 1998 World Cup Web site between April 26, 1998 and
July 26, 1998 [1].
Two accuracy metrics, recall and precision, are considered. Recall is the ratio
of true alarms raised to the total true alarms which should be raised; precision
is the ratio of true alarms raised to the total alarms raised.
which leads to the increase of the error of anomaly detection. On the other hand,
it is easy to understand that larger results in longer pieces, therefore larger
reconstruction error.
Given = 5 and = 9, with the increase of number of monitored windows, the
number of used pieces also increases. But the number of consumed pieces can still
be guaranteed by m < log2 4 N W . It can be seen from Figure 5 that with only
a few pieces the precision and recall are still high. The accuracy becomes better
when N W increases. This is because the average size of monitored windows is
larger with the increase of N W . The fractal approximation is more accurate over
large time scales. So the error decreases with the increasing value of N W . This
illustrate our algorithms ability to monitor large amounts of windows with high
accuracy over streaming data.
Figure 6 that SWT returns large number of false positive results in every case.
For example, in Figure 6.(b), 108 alarms are returned, which is definitely not
informative to the users. Moreover, it is dicult to set the threshold for anoma-
lies. A slightly smaller or larger can either boost false alarms or false positive,
respectively. Therefore, our self-similarity-based method is more adaptable for
providing accurate alarms in anomaly detection.
Space and Time Eciency. To show the time and space eciency of our
algorithm, we compare our method with Stardust [4] and the query-based method
[16]. It can be observed from Figures 8 and 9 that our method uses much less
memory, runs more than tens of times faster than other methods. We use the
code provided by the authors of [4] and [16] to do the comparing test. The rest
of setting is the same as before, except for the special setting for Stardust with
box capacity c = 2.
Figure 8 shows the time eciency testing results. With the increasing of the
number of monitored windows, the processing time saved by using self-similarity-
based algorithm is getting larger and larger. Our method can process 400 107
tuples in 10 seconds. This means that it is capable of processing trac rates on
100Mbs links, and with some work than 1Gbps or even higher are within reach.
Figure 9 shows the space eciency testing results. The space cost of our piece-
wise fractal model is only aected by the number of pieces maintained. However,
Stardust has to maintain all the monitored data points and the index structure at
the same time. The query-based method has to maintain an inverted histogram
(IH) in memory. It can be seen that the space saved by the piecewise fractal
model is considerable. Thus, our method is more suitable to detect anomalies
over streaming data.
7 Concluding Remarks
In this paper, we incorporate the fractal analysis technique into anomaly de-
tection. Based on the piecewise fractal model, we propose a novel method for
detecting anomalies over a data stream. Fractal analysis is employed to model
the self-similar pieces in the original data stream. We show that this fractal-
based method can be used to detect not only the bursts defined before, but
Fractal Based Anomaly Detection over Data Streams 561
more general anomalies that cause the fractal characteristics change. Piecewise
fractal model is proved to provide accurate monitoring results, while consuming
only limited storage space and computation time. Both theoretical and empirical
results show the high accuracy and eciency of the proposed method. Further-
more, this approach can also be used to reconstruct the original data stream
with the error guaranteed.
References
1. Internet trac archive, http://ita.ee.lbl.gov/
2. Barnsley, M., Elton, J., Hardin, D.: Recurrent iterated function systems. Construc-
tive Approximation 5(1), 331 (1989)
3. Borgnat, P., Flandrin, P., Amblard, P.: Stochastic discrete scale invariance. IEEE
Signal Processing Letters 9(6), 181184 (2002)
4. Bulut, A., Singh, A.: A unified framework for monitoring data streams in real time.
In: Proceedings of the 21st International Conference on Data Engineering, ICDE
2005, pp. 4455. IEEE (2005)
5. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Com-
puting Surveys (CSUR) 41(3), 15 (2009)
6. Chen, Y., Dong, G., Han, J., Wah, B., Wang, J.: Multi-dimensional regression anal-
ysis of time-series data streams. In: Proceedings of the 28th International Confer-
ence on Very Large Data Bases, pp. 323334. VLDB Endowment (2002)
7. Cormode, G., Muthukrishnan, S.: Whats new: Finding significant dierences in
network data streams. In: Twenty-third Annual Joint Conference of the IEEE
Computer and Communications Societies, INFOCOM 2004, vol. 3, pp. 15341545.
IEEE (2004)
8. Hart, J.: Fractal image compression and recurrent iterated function systems. IEEE
Computer Graphics and Applications 16(4), 2533 (1996)
9. Jain, A., Chang, E., Wang, Y.: Adaptive stream resource management using kalman
filters. In: Proceedings of the 2004 ACM SIGMOD International Conference on
Management of Data, pp. 1122. ACM (2004)
10. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceed-
ings of the Thirtieth International Conference on Very Large Data Bases, vol. 30,
pp. 180191. VLDB Endowment (2004)
11. Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection:
methods, evaluation, and applications. In: Proceedings of the 3rd ACM SIGCOMM
Conference on Internet Measurement, pp. 234247. ACM (2003)
12. Mandelbrot, B.: The fractal geometry of nature. Times Books (1982)
13. Mazel, D., Hayes, M.: Using iterated function systems to model discrete sequences.
IEEE Transactions on Signal Processing 40(7), 17241734 (1992)
14. Michael, F.: Fractals everywhere. Academic Press, San Diego (1988)
15. Patcha, A., Park, J.: An overview of anomaly detection techniques: Existing solu-
tions and latest technological trends. Computer Networks 51(12), 34483470 (2007)
16. Qin, S., Qian, W., Zhou, A.: Approximately processing multi-granularity aggregate
queries over data streams. In: Proceedings of the 22nd International Conference on
Data Engineering, ICDE 2006, p. 67. IEEE (2006)
562 X. Gong et al.
17. Shahabi, C., Tian, X., Zhao, W.: Tsa-tree: A wavelet-based approach to improve
the eciency of multi-level surprise and trend queries on time-series data. In: Pro-
ceedings of the 12th International Conference on Scientific and Statistical Database
Management, pp. 5568. IEEE (2000)
18. Wu, X., Barbar a, D.: Using fractals to compress real data sets: Is it feasible. In:
Proc. of SIGKDD (2003)
19. Zhou, A., Qin, S., Qian, W.: Adaptively detecting aggregation bursts in data
streams. In: Zhou, L., Ooi, B.-C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453,
pp. 435446. Springer, Heidelberg (2005)
20. Zhu, Y., Shasha, D.: Ecient elastic burst detection in data streams. In: Proceed-
ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 336345. ACM (2003)
Preservation of Proximity Privacy
in Publishing Categorical Sensitive Data
Yujia Li1, Xianmang He2,*, Wei Wang1, Huahui Chen2, and Zhihui Wang1
1
School of Computer Science and Technology, Fudan University,
No.220, Handan Road, Shanghai, 200433, P.R. China
{071021056,weiwang1,zhhwang}@fudan.edu.cn
2
School of Information Science and Technology, NingBo University
No.818, Fenghua Road, Ning Bo, 315122, P.R. China
{hexianmang,chenhuahui}@nbu.edu.cn
1 Introduction
The information age has witnessed a tremendous growth of personal data that can be
collected and analyzed, including a great deal of released microdata (i.e. data in raw,
non-aggregated format) [12].These released microdata offer significant advantages in
terms of information availability, which is particularly suitable for ad hoc analysis in
a variety of domains, such as medical research.
However, the publication of microdata leads to the concerns of individual private
information. For example, the medical institute and its cooperative organization need
a large amount of private data on patients for disease research. Table 1 is such an
example. One straightforward approach to achieve this is to exclude unique identifier
attributes, such as Name from Table 1, which is not sufficient for protecting privacy
leakage under linking-attack [1-3]. Suppose the attacker has known about Andy with
attribute values <age=5, sex=M, zip=12k>, then according to Table 1, s/he can deduce
Andys disease, which is pneumonia. Such combination of Age, Zipcode and Sex can
*
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 563570, 2013.
Springer-Verlag Berlin Heidelberg 2013
564 Y. Li et al.
1.1 Motivation
Recently, there are many works to address the problem of privacy preserving data
publishing. However, most of them ignore the case that there may exist semantically
proximity for sensitive categorical data. For example, an adversary possessing the QI-
values of Andy is able to find out that Andys record is in the first QI-group of Table
2. Without further information, assume that each tuple in the group has an equal
chance of being owned by an adversary, s/he can conclude that Andy suffers from
some respiratory infection disease with 2/3 probability.
The existing anonymization principles fall short of solving such proximity privacy
leakage in our setting, which is a privacy threat specific to categorical sensitive
attribute (i.e. pneumonia in Table 2). To address the issue, this paper proposes a new
anonymization principle termed as m-Color constraint.
The rest of the paper is organized as follows. The previously related work is re-
viewed in Section 2. Section 3 clarifies the basic concepts and problem definition.
Section 4 discusses the properties of m-Color constraint, and Section 5 elaborates the
generalization algorithm. In Section 6, we experimentally evaluate the efficiency
and effectiveness of our techniques. Finally, the paper is concluded in Section 7.
2 Related Work
In this section, previous related work will be surveyed. All the privacy-preserving
transformation of the microdata is referred to as recoding, which can be classified into
Preservation of Proximity Privacy in Publishing Categorical Sensitive Data 565
two classes of models: global recoding and local recoding[11,16]. In global recoding,
a particular detailed value must be mapped to the same generalized value in all
records. Local recoding allows the same detailed values being mapped to different
generalized values of different QI-groups. Obviously, global recoding is a more spe-
cial case of local recoding. Efficient greedy solutions following certain heuristics
have been proposed [7],[9],[11],[15] to obtain a near optimal solution. Incognito [8]
provides a practical framework for implementing full-domain generalization, borrow-
ing ideas from frequent item set mining, while Mondrian [9] takes a partitioning ap-
proach reminiscent of kd-trees. To achieve k-anonymity, Ghinita [16] presents a
framework mapping the multi-dimensional quasi-identifiers to 1-Dimensional(1-D)
space. It is discovered that k-anonymity a data set is strikingly similar to building a
spatial index over the data set, so that classical spatial indexing techniques can be
used for anonymization [17]. The idea of non-homogeneous generalization was first
introduced in k-anonymization revisited [18], which studies techniques with a guaran-
tee that an adversary cannot associate a generalized tuple to less than K individuals,
but suffer additional types of attack. Authors of paper [19] proposed a randomization
method that prevents such type of attack and showed that k-anonymity is not com-
promised by the attack, but its partitioning algorithm is only a special of the top-down
algorithm presented in [11]. However, in our algorithm, we mainly discuss privacy
preservation of categorical sensitive attributes over m-Color constraint.
3 Problem Statement
Let T be a microdata table that contains the private information of a set of individuals.
T has d QI-attributes A1,....,Ad, and a sensitive attribute (SA)S. A partition P consists of
several subsets Gi (1 i k) of T, such that each tuple in T belongs to exactly one sub-
set and T = ik Gi . We refer to each subset Gi as a QI-group or a bucket.
Ai
where Ai is the domain of the attribute Ai. For tuple t, the normalized certainty
d
penalty in t is NCP (t ) = i
NCPA (t )
i
.The normalized certainty penalty in T is tT
NCP (t ) .
In reality, an attacker with sufficient background knowledge [10] may be able to
deduce a concrete QI-group. Assume that adversary deduce the sensitive attribute
values of victims with the same probability. For example, after realizing that Andys
record appears in the first QI-group of Table 2, an adversary derives P[X = s] = 1/3,
where X models the disease of Andy, and s can be any disease (i.e., Pneumonia, Pul-
monary embolism, Flu) in that group.
566 Y. Li et al.
While generalizing the table T, we do not change the sensitive attribute values of
disease but take colors as the sensitive values. From the Table 3, we can observe that
the generalization algorithm will not cause any information loss in sensitive attribute
values while the personality privacy [5] was not. We may label the tuples whose sen-
sitive values are similar with the same color class-tags, and assign the tuples with
the same color class-tags into different QI-groups to prevent the problem of proximity
privacy leakage. Hence, m-Color constraint is introduced.
Definition 2. A QI-group satisfies m-Color constraint, if the QI-group has at least m
tuples, and each tuple of this QI-group owns different color class-tags. Then a table
fulfills m-Color constraint, if in this table, each QI-group satisfies m-Color constraint.
Without considering the information loss, we can provide a possible partition of
Table 1 again (see the Table 4). Based on the definition 2, the partition fulfills 3-
Color constraint.
In our paper, m-Color constraint is used to solve the categorical attribute proximity
privacy problem, which limits proximity privacy leakage effectively.
Given table T and m, after each tuple was labeled different color class-tags, we won-
der the following question: Given a table T and an integer m, can we find a genera-
lized T* that fulfills m-Color constraint? This question is equivalent to the following
definition 5. Hence, the results are given as follows.
Preservation of Proximity Privacy in Publishing Categorical Sensitive Data 567
Definition 3. Given table T and m, if there exists a partition that each QI-group has at
least m tuples, and all the sensitive attribute values of tuples in each QI-group are
unique. We refer to such table as m-eligible.
Theorem 1. T is a m-eligible, if and only if the number of tuples with the same sensi-
tive values is no more than 1/m the total number of tuples in the table T.
Respiratory & Digestive Respiratory & Digestive
system diseases system diseases
5 Generalization Algorithm
Step1: We can employ the following strategy to select t1, t2 : select a point u ran-
domly, then to choose t1 to minimize NCP(t1, u) (see the definition 3). We can scan all
tuples once to find tuple t2 that maximizes NCP(t1, t2).
Step 2: For each tuple w in table T, we compute the 1= NCP(w S1)-NCP(S1), and
2=NCP(w S2)-NCP(S2), and distribute w to the group that leads to a lower infor-
mation loss.
Step 3: If 1= 2, then w was put to the group with less cardinality.
Step 4-5 is to check whether S1, S2 satisfy m-eligible. If both are generalizable, we
partition T are tried K times and tuples of G are randomly shuffled for each time.
In the second step, we use assign algorithm [3] to create the set of buckets.
In this section, we use buckets passed from the assign algorithm to generate QI-
groups. Let the buckets signature be {v1,v2, , v } ( m) and each color tag corres-
ponds to tuples. If 1, the bucket is a QI-group; if >1, we divide the bucket
into two disjoint buk1 and buk2, and each bucket has the same signature. The process
to split the bucket until each bucket contains exactly tuples, each color tag corres-
ponds to one tuple.
The idea underlying the split algorithm is given as follows: sort all the tuples in
the bucket according to attribute A1. Then put the first /2 tuples from the bucket
into buk1 and the rest tuples into buk2. In this way, we can obtain a partition of the
bucket. For the attribute A2, A3,., we can get another d-1 choices, where d is the
dimension of QI-attributes. Among all the d partitions, we pick the one that minimizes
the sum of NCP(buk1) and NCP(buk2) as the final partition.
6 Experiments
1 150
0.5 100
50
0
0
4 8 10 6 20 25 30
m 4 8 10 m 15 20 25 30
7 Conclusion
Acknowledgement. This work was supported in part by the National Natural Science
Foundation of China (NO. 61033010, 61170006, 61202007), the Project of Education
Department of Zhejiang Province(Y201224678).
References
[1] Sweeney, L.: k-anonymity: A model for protecting privacy. Journal on Uncertainty,
Fuzziness, and Knowlege-Based Systems 10(5), 557570 (2002)
[2] Machanavajjhala, A., Gehrke, J., Kifer, D.: l-diversity: Privacy beyond k-anonymity. In:
ICDE (2006)
[3] Xiao, X., Tao, Y.: m-invariance: Towards privacy preserving re-publication of dynamic
datasets. In: SIGMOD 2007, pp. 689700 (2007)
570 Y. Li et al.
[4] Li, N., Li, T.: t-closeness: Privacy beyond k-anonymity and l-diversity. In: ICDE (2007)
[5] Xiao, X., Tao, Y.: Personalized privacy preservation. In: SIGMOD 2006, pp. 229240
(2006)
[6] Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: ICDE
2005, pp. 217228 (2005)
[7] Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and priva-
cy preservation. In: ICDE 2005, pp. 205216 (2005)
[8] LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: Efficient full-domain k-
anonymity. In: Sigmod 2005, pp. 4960 (2005)
[9] LeFevre, K., DeWitt, D.J., et al.: Mondrian multidimensional k-anonymity. In: ICDE,
pp. 277286 (2006)
[10] Li, J., Tao, Y., Xiao, X.: Preservation of Proximity Privacy in Publishing Numerical
Sensitive Data. In: Sigmod 2008 (2008)
[11] Xu, J., Wang, W., Pei, J., et al.: Utility-based anonymization using local recoding. In:
KDD 2006, pp. 785790 (2006)
[12] Willenborg, L., de Waal, T.: Elements of Statistical Disclosure Control. Lecture Notes
in Statistics. Springer (2000)
[13] Samarati, P., Sweeney, L.: Protecting Privacy when Disclosing Information: k-
Anonymity and Its Enforcement through Generalization and Suppression (1998)
[14] Samarati, P.: Protecting RespondentsIdentities in Microdata Release. IEEE Trans.on
Konwl. and Data Eng. 13(6), 10101027 (2001); Control. Lecture Notes in Statistics.
Springer (2000)
[15] Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in pri-
vacy preserving data publishing. In: SIGMOD 2010: Proceedings of the 2010 Interna-
tional Conference on Management of Data, pp. 747758. ACM, New York (2010)
[16] Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low in-
formation loss. In: VLDB 2007: Proceedings of the 33rd International Conference on
Very Large Data Bases, pp. 758769. VLDB Endowment (2007)
[17] Iwuchukwu, T., Naughton, J.F.: K-anonymization as spatial indexing: toward scalable
and incremental anonymization. In: VLDB 2007: Proceedings of the 33rd International
Conference on Very Large Data Bases, pp. 746757. VLDB Endowment (2007)
[18] Gionis, A., Mazza, A., Tassa, T.: k-anonymization revisited. In: ICDE 2008: Proceed-
ings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 744753.
IEEE Computer Society, Washington, DC (2008)
[19] Wong, W.K., Mamoulis, N., Cheung, D.W.L.: Non-homogeneous generalization in pri-
vacy preserving data publishing. In: SIGMOD 2010: Proceedings of the 2010 Interna-
tional Conference on Management of Data, pp. 747758. ACM, New York (2010)
[20] Zhang, Q., Koudas, N., Srivastava, D., Yu, T.: Aggregate query answering on anony-
mized tables. In: ICDE 2007: Proceedings of the 23nd International International Confe-
rence on Data Engineering, vol. 1, pp. 116125 (2007)
[21] Agrawal, R., Srikant, R.: Privacy-preserving data mining. SIGMOD Rec. 29(2), 439
450 (2000)
S2MART: Smart Sql to Map-Reduce Translators
Abstract. With the advent of rapid increase in the size of data in large
cluster systems and the transformation of data into big data in major
data intensive organizations and applications, it is very necessary to build
ecient and flexible Sql to Map-Reduce translators that would make Tera
to Peta bytes of data easy to access and retrieve because conventional Sql-
based data processing has limited scalability in these cases. In this paper
we propose a Smart Sql to Map-Reduce Translator (S2MART), which
transforms the Sql queries into Map-Reduce jobs with the inclusion of
intra-query correlation for minimizing redundant operation, sub-query
generation and spiral modeled database for reducing data transfer cost
and network transfer cost. S2MART also applies the concept of views
in database to perform parallelization of big data easy and streamlined.
This paper gives a comprehensive study about the various features of
S2MART and we compare the performance and correctness of our system
with two widely used Sql to Map-Reduce translators hive and pig.
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 571582, 2013.
c Springer-Verlag Berlin Heidelberg 2013
used query results to minimize the data transfer cost and network cost and in-
crease the time eciency.3) The spiral modeled database is adaptive to dynamic
workload patterns.4) Sub-query generation to increase the throughput of the
system.5) Use the concept of Views in database to store intermediate results
which reduces the amount of records to be checked for every computation.6)
Experiment our system with legacy systems like Hive, pig etc.
2 Literature Survey
The popularity of a hybrid system transforming Sql queries to Map-Reduce jobs
has increased due to the emergence of big data in many data centric organiza-
tions. S2MART uses nuoDB [10] [11] as its cloud storage as nuoDB provides full
indexing support, easy and flexible Sql querying with DB replication, provides
drivers for ODBC, JDBC and Hibernate, it can be deployed anywhere in the
cloud, provides scalability upto 50 nodes and the most important thing is that it
provides support for Hadoop HDFS. In this paper we focus on three important
legacy Sql to Map-Reduce translators namely Pig, Hive and Scope. Companies
and organizations which deal with big data in the form of click streams, logs and
web crawls have focused their attention towards developing an open source high-
level dataflow system. This has led yahoo develop an open source incarnation
called Pig to support parallelization of data retrieval in the cloud environment
[12]. Pig supports a simple language called Pig Latin [13] to support query formu-
lations and data manipulation. Pig Latin is written using java class libraries and
provides support for simple operations like filtering, grouping, projection and
selection. The major issue concerned with Pig is that it provides limited flexibil-
ity towards join operations and nested operators. Hive developed by Facebook
adds Sql like functions to Map-Reduce but follows syntax dierent from the
conventional Sql and it is called as HiveQl [14] [15]. HiveQl transforms the Sql
query into Map-Reduce jobs that are executed in Hadoop (an open source in-
carnation of Map-Reduce). Hive provides the functionalities of a metastore that
supports query compilations and optimizations [16]. Even though Hive follows
a structured approach towards database concepts, it provides limited function-
ality towards join predicate. The Sql to Map-Reduce translator developed by
Microsoft was called as Scope [17]. The compiler and optimizer of Scope gen-
erates a new scripting language for managing and retrieving data from large
data repositories. The compiler and optimizer are responsible for automatic job
generation and parallelization of job execution. However, Scope provides no sup-
port towards user defined types and metastore which reduces the performance
of the system. Hence Scope doesnt provide a rich data model when compared
to Hive and pig as it doesnt include partitioning of tables. The other translators
developed by Microsoft are Cosmos [18] and Dyrad [19].
3 Proposed System
The proposed system consists of two linchpin modules which are:
S2MART: Smart Sql to Map-Reduce Translators 573
The full-fledged system architecture is shown in fig.1 which is divided into three
major layers.The architecture of S2MART uses the classic master/worker pattern
of the Map-Reduce framework where the master node is referred as the query
distributor and the worker nodes are referred as query analyzers. The input
to the system is the Sql query from the user. This Sql query is given to the
Sql query Parser which in turn is given to the IQ-RT (Intra-query relationship
tree) parser which constructs the tree based on the input query. IQ-RT parser
generates the tree structure which we call as the relationship tree for the input
query from the Sql query parser. The tree is then given as an input to the IQ-RI
(Intra-query relationship identifier). After identifying the existing relationships,
the operations similar to each other are consolidated so that the number of scans
of a particular database reduces drastically. The resultant IQ-RT is subjected
to the Sub-Query generator engine which disintegrates the dierent operations
in the IQ-RT into sub-queries rather than a single large query. Now each query
are distinctly separated and every query is compared with the queries stored in
the Spiral database which uses the FQR (Frequent Query Retrieval) algorithm
to check whether the input query matches any one of the entries in the Spiral
database. Now there exists two possibilities, the first scenario is the case where
the input query and the results of the input query are cached from the Spiral
database and the other scenario is that the query and the results are not found
in the Spiral database. In the first case the results of the individual sub-query
are cached from the Spiral database [20] and are given as an input to the query
optimizer which optimizes the query results and gives the query results back to
the user. In the other case, each individual query is given to the Map-Reduce
framework which is built inside a Hadoop infrastructure [21][22]. Here in this case
a single node alone is not subjected to retrieve data from the database nodes
hence avoiding a bottleneck situation and no single node does the entire mapping
job of a single query hence increasing the eciency of data retrieval from the
database nodes without changing the underlying structure of the Map-Reduce
framework. Data transfer cost is reduced in this method by the use of spiral DB.
The frequently used queries are stored, thus reducing the need to transfer data
unnecessarily. Network transfer cost is reduced by the use of buers after the
map-reduce step.
3.2 Methodology
The entire working of the S2MART basically depends on the following modules:
JOIN
Fig. 2. Input Query to SQL Parser Fig. 3. Relationship Tree for Input Query
JOIN
Select d.dept_name, d.dept_id, e.empname from
dept d, emp e, country c where d.empno=e.empno JOIN JOIN
c.state_name=MUM; C E D D
Fig. 4. Resultant Query after IQ-RI mod- Fig. 5. Resultant Tree after IQ-RI mod-
ule ule
optimize the query eciency. Now the relationship tree is generated again based
on all the groupings. The query which has been subjected to the IQ-RI module
is shown in fig.4. Now, the IQ-RT is again generated for the query obtained
which is shown in fig.5. The IQ-RT obtained represents 4 selection operations
and 3 join operations totally, hence shows less number of operations compared
to the IQ-RT obtained in fig.3. Hence the number of table scans and operations
that has to be performed against the database has decreased which increases
the throughput and reduces execution time. The algorithm used to find the
intra-query relations and combine similar operations are shown in fig.7.
IQRI Generator
IQRT Generator
Input: IQRT, a tree
Input : query, a string create empty tree IQRT_new;
if no_of_table_ids is greater than six then
create empty tree IQRT; if join.sel.table_ids are repeating then
while query parsing till the end do group selection operations on same table_id;
make Join node; if join.partitionkey are same then
use OR or AND operation into one join;
assign Join as root node of IQRT;
if aggregrate operations nodes on same table_id then
if select operation is found in query then group aggregate operations on same table_id;
make select node; combine individual outer join and inner join into a -
assign the input table_id to select node; single query, IQ;
IQRT_new=IQRT(IQ);
do point node to Join root node of query;
GenQl(IQRT_new);
return IQRT; else
GenQl(IQRT)
any match. If there is any match then the results of the query is directly given
to the buer or else the query is divided into a number of sub-queries which is
done based on the GenQL algorithm. In order to bring concurrency between the
queries to get the proper results we use the concept of Views in database. After
the sub-queries are generated, each sub-query represents a unique independent
operation performed by the relationship tree. The number of views that has to
be created is equal to the number of sub-queries generated for the same. The
query which has been subjected to the IQ-RI module shown in the fig.4 can be
divided into 4 sub-queries based on the GenQL algorithm shown in fig. 12. Since
the number of sub-queries is 4 the number of views required is also 4. Every
individual query is subjected to a view creation which reduces the amount of
records or data that has to be accessed or referred for further operations on the
table and the view creation can group operations that scans the same tables,
minimizing the number of operations. The sub-queries generated for the query
shown in fig.4 are shown in fig 8, 9, 10, 11.
depending on our needs. The work flow of the Spiral Database has already been
discussed in Section 3. The complete work procedure of the FQR algorithm is
depicted in fig. 13.
4 Experimental Results
create view <view name> as ( < Individual Select empname from employee where
sub-query is given here > ); emp_id=101;
Fig. 14. General Format for sub-queries Fig. 15. Simple Lightweight query
The comparison between the various execution time of the query in S2MART,
Hive and Pig is shown in fig.19 where x-axis represents the number of nodes and
y-axis represents the execution time in seconds. The graph shows that there is
no drastic decrease in the execution time for simple light weighted queries.
Fig. 16. Not Subjected to IQ-RT & IQ-RI Fig. 17. Subjected to IQ-RT & IQ-RI
5
Select e.empname from employee e, department d
and country c where c.country=indian and 4
Execution Time in Seconds
d.dept_id=101;
3 Hive
Sub-Query 1: Create view view1 as (Select
e.empname from employee e, country c where 2 Pig
c.country=indian); S2MART
1
Sub-Query 2: Create view view2 as (Select 0
v.empname from view1 v, department d where
1 5 10 15 20 25 30 35 40
d.dept_id=101); Number of Nodes
Fig. 18. Subject to Sub-queryGeneration Fig. 19. Comparison of systems - case 4.1
Fig.21 shows the experimental comparison between the systems. The x-axis of
the graph represents the amount of data in thousands and y-axis represents the
execution time of the query. The graph shows that the execution time drastically
decreases after 20,000 records and thus it can be inferred that S2MART can be
very useful when we deal with large amount of data in the data repositories.
12 20
10
15
8
Hive Hive
6 10
Pig Pig
4
S2MART 5 S2MART
2
0 0
1 5 10 15 20 25 30 35 40 1 5 10 15 20 25 30 35 40
Number of Nodes Number of Nodes
Fig. 20. Comparison of systems - case 4.2 Fig. 21. Comparison of systems - case 4.3
25
Execution Time
20
14
Execution Time in Seconds
15 12
Hive
10
10 Pig 8
6 Execution Time
S2MART
5 4
2
0 0
1 5 10 15 20 25 30 35 40 Hive S2MART
Number of Nodes
Fig. 22. Comparison of systems - case 4.4 Fig. 23. Comparing systems of cached data
5 Conclusion
Execution of complex queries involving join operations between multiple tables,
aggregation operations are very much necessary for data intensive applications,
thus we developed a system that could support the concept of big data in data
intensive applications and ease data retrieval to provide useful information. We
implement many novel modules such as Intra-Query relationships (IQ-RT and
IQ-RI), Sub-Query generation and Spiral Database in the S2MART system that
makes the complete system a novel one. We verified the correctness of S2MART
system by performing various experimental results using dierent paradigms of
queries and we achieved eciency in all the experimental results. The intended
applications for S2MART are all data intensive applications which use big data
S2MART: Smart Sql to Map-Reduce Translators 581
analytics. Hence in the future we can build the system with more ecient query
optimization for a particular application rather than being generic and enable
secure cloud storage [30] for confidentiality and integrity of data.
References
1. Jiang, D., Tung, A.K.H., Chen, G.: MAP-JOIN-REDUCE: Toward Scalable and
Ecient Data Analysis on Large Clusters. IEEE Transactions on Knowledge and
Data Engineering 23(9) (September 2011)
2. Warneke, D., Kao, O.: Exploiting Dynamic Resource Allocation for Ecient Par-
allel Data Processing in the Cloud. IEEE Transactions on Parallel and Distributed
Systems 22(6) (June 2011)
3. Jiang, W., Agrawal, G.: MATE-CG: A MapReduce-Like Framework for Accelerat-
ing Data-Intensive Computations on Heterogeneous Clusters. In: 2012 IEEE 26th
International Parallel and Distributed Processing Symposium (2012)
4. Sakr, S., Liu, A., Batista, D.M., Alomari, M.: A Survey of Large Scale Data Man-
agement Approaches in Cloud Environments. IEEE Communications Surveys and
Tutorials 13(3) (Third Quarter 2011)
5. Han, J., Song, M., Song, J.: A Novel Solution of Distributed Memory NoSQL
Database for Cloud Computing. In: 10th IEEE/ACIS International Conference on
Computer and Information Science (2011)
6. Bisdikian, C.: Challenges for Mobile Data Management in the Era of Cloud and
Social Computing. In: 2011 12th IEEE International Conference on Mobile Data
Management (2011)
7. Nicolae, B., Moise, D., Antoniu, G., Bouge, L., Dorier, M.: BlobSeer: Bringing High
Throughput under Heavy Concurrency to Hadoop Map-Reduce Applications. In:
International IEEE Conference (2010)
8. Zhang, J., Wu, X.: A 2-Tier Clustering Algorithm with Map-Reduce. In: The Fifth
Annual ChinaGrid Conference (2010)
9. Pallickara, S., Ekanayake, J., Fox, G.: Granules: A Lightweight, Streaming Run-
time for Cloud Computing With Support for Map-Reduce. In: IEEE International
Conference (2009)
10. Starkey, J.: Presentation of nuoDB,
http://www.siia.net/presentations/software/AATC2012/NextGen_NuoDB.pdf
11. Starkey, J.: Presentation of nuoDB,
http://www.cs.brown.edu/courses/cs227/slides/dtxn/nuodb.pdf
12. Zhang, Z., Cherkasova, L., Vermam, A., Loo, B.T.: Optimizing Completion Time
and Resource Provisioning of Pig Programs. In: 2012 12th IEEE/ACM Interna-
tional Symposium on Cluster, Cloud and Grid Computing (2012)
13. Olsten, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Christopher Olsten,
Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins. Presented by
Welch, D.
14. Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S.,
Liu, H., Murthy, R.: Hive A Petabyte Scale Data Warehouse Using Hadoop. In:
IEEE International Conference (2010)
15. Michel, S., Theobald, M.: HadoopDB: An Architectural Hybrid of MapReduce and
DBMS Technologies for Analytical Workloads. Presented by Raber, F.
16. Thusoo, A., Sen Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H.,
Wycko, P., Murthy, R.: Hive A Warehousing Solution Over a MapReduce Frame-
work. In: VLDB 2009, Lyon, France (August 2009)
582 N. Gowraj et al.
17. Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., Zhou,
J.: SCOPE: Easy and Ecient Parallel Processing of Massive Data Sets. In: VLDB
2008, Auckland, New Zealand, August 24-30 (2008)
18. Yang, C.: Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-
Nothing Distributed Database. In: ICDE Conference (2010)
19. Isard, Y.Y.M.: DryadLINQ: A System for General-Purpose Distributed Data-
Parallel Computing Using a High-Level Language. Microsoft research labs
20. He, Y., Lee, R.: RCFile: A Fast and Space-ecient Data Placement Structure in
MapReduce-based Warehouse Systems. In: ICDE Conference (2011)
21. Foley, M.: High Availability HDFS,
http://storageconference.org/2012/Presentations/M07.Foley.pdf
22. Chandar, J.: Join Algorithms using Map/Reduce. University of Edinburgh (2010)
23. Chen, T., Taura, K.: ParaLite: Supporting Collective Queries in Database System
to Parallelize User-Defined Executable. In: 2012 12th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing (2012)
24. Leu, J.-S., Yee, Y.-S., Chen, W.-L.: Comparison of Map-Reduce and SQL on Large-
scale Data Processing. In: International Symposium on Parallel and Distributed
Processing with Applications (2010)
25. Hsieh, M.-J.: SQLMR: A Scalable Database Management System for Cloud Com-
puting. In: 2011 International Conference on Parallel Processing (2011)
26. Zhu, M., Risch, T.: Querying Combined Cloud-Based and Relational Databases.
In: 2011 International Conference on Cloud and Service Computing (2011)
27. Husain, M.F.: Scalable Complex Query Processing Over Large Semantic Web Data
Using Cloud. In: 2011 IEEE 4th International Conference on Cloud Computing
(2011)
28. Hu, W.: A Hybrid Join Algorithm on Top of Map Reduce. In: 2011 Seventh Inter-
national Conference on Semantics, Knowledge and Grids (2011)
29. Introduction to Hive by Cloudera 2009,
http://www.cloudera.com/wp-content/uploads/2010/01/6-IntroToHive.pdf
30. Wallom, D.: myTrustedCloud: Trusted Cloud Infrastructure for Security-critical
Computation and Data Management. In: 2011 Third IEEE International Confer-
ence on Cloud Computing Technology and Science (2011)
MisDis: An Efficent Misbehavior Discovering Method
Based on Accountability and State Machine in VANET
1 Introduction
*
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 583594, 2013.
Springer-Verlag Berlin Heidelberg 2013
584 T. Yang et al.
2 Related Work
There are only few misbehavior detection schemes suggested for finding attacker.
Golle et al. [4] propose a scheme to detect malicious data by finding outlying data.
If there is data that does not fit to the current view of the neighborhood, it will be
marked as malicious. The current view, in turn, is established cooperatively. Vehicles
share sensor data with each other. An adversary model helps to find explanations for
inconsistencies. The main goal is to detect multiple adversaries or a single entity car-
rying out a sybil attack. This work, however, does not provide concrete means to
detect misbehavior but only mentions a sensor-driven detection.
Raya et al. [5] introduced a solution of immediate revocation of certificates of a
misbehaving vehicle and formulated a detection system for misbehavior. It is as-
sumed that the PKI is not omnipresent and hence the need for an infrastructure as-
sisted solution. One component of their work is a local misbehavior detection system
running at each node. This system relies on gathering information from other nodes,
and then comparing the behavior of a vehicle with some model of average behavior of
the other vehicles built on the fly. It is based on the deviation of attacker node from
normal behavior and used timestamp, assigned messages and trusted components
(hardware and software). Honest and illegitimate behavior is differentiated by using
clustering techniques. The basic idea behind the autonomous solution is to evaluate
deviation from normal behavior of vehicles, while they always assume an honest ma-
jority. Once, misbehavior is detected, a revocation of a certificate is indicated over a
base station, a vehicle connects to. This revocation is then distributed to other vehicles
and the Certification Authority itself. But, the assumption of an honest majority of
vehicles does not hold as identities may be generated arbitrarily.
Ghosh et al [6,7] presented and evaluated a misbehavior detection scheme (MDS)
for post crash notification (PCN) application. Vehicles can determine the presence of
bogus safety message by analyzing the response of driver at the occurrence of an
event. The PCN alert raised will be received by multiple vehicles in the vicinity, and
more than one of these receiving vehicles may use an MDS in their OBUs to detect
misbehavior. The basic cause-tree approach is illustrated and used effectively to joint-
ly achieve misbehavior detection as well as identification of its root-cause.
Schmidt et al [8] introduced a framework (VEBAS) based on reputation models.
By combining the output of multiple behavior analysis modules, a vehicle calculates
the trustworthiness value in other vehicles in its near vicinity from the various sensor
values obtained. The value is then used to identify a misbehaving vehicle and build up
reputation. VEBAS requires continuous exchange of information between the ve-
hicles that is used by a vehicle to analyze the behavior of nearby vehicles. Based on
this information, vehicles are classified into trustworthy, untrustworthy or neutral.
Applications may then take this trust rating into consideration in order to react appro-
priately on incoming information. However, the paper only describes the scheme, no
evaluation for any specific application is presented.
A message filtering model is introduced by Kim et al [9] that uses multiple com-
plementary sources of information. According to this model, drivers are sent alert
messages after the existence of certain event is proved. The model leverages multiple
586 T. Yang et al.
3 Preliminaries
3.1 Definition
Closely related are the following two terms, tamper proof and tamper evident. Tam-
per-proof is a more strict definition of tamper resistant, which claims to be 100%
secure against tampering, whereas tamper evident means that one is able to detect that
a device has been tampered with. Tamper resistance means the vehicles devices have
to be secured, and communication buses have to be protected. Furthermore, tamper
resistant devices can provide secure storage for keys and certificates, and maybe even
for a history of recent messages sent over the external communication system, like an
event data recorder (EDR), which could be used for legal purposes. When properly
applied, tamper-proof and tamper-resistant hardware enable secured communication
and will prevent most attacks on the active safety communication system from inside
of the vehicle.
Time is the most important element in a secure log. Timing information based
verification correlates the time data fields in secure log against the vehicles internal
clock (synchronized and updated using information provided by the GPS system).
Accurate time service can support the upper modules, like Detector and State
Machine. Verifications with the secure log in order to detect misbehavior are possible
with regard to the following aspects. A first step consists in comparing the dump log
time to the creation log time stamp. Time verification with several logs originating
from a single node can provide additional insight on the nodes behavior (or on the fact
another node is trying to impersonate this node). In combination with position
information, time based plausibility checks for single nodes also leads towards
vehicle speed related plausibility checking.
Security log is the basic data set for vehicle misbehavior detection. Each OBU main-
tains a security log with the following security features: it can only add normal per-
mission, only entities with high privileges, such as RSU and DoT, can dump or
delete the security log.
The secure log entry is defined as follows:
len = ( n, n, n, n)
where:
n: a strictly increasing sequence number;
n: either SEND or RECV which is the category of the log entry;
n: some type-specific content;
n: a recursively defined hash value which is defined as n = H( n-1 || n|| n ||H( n));
|| stands for concatenation and the base hash n-1 is a predefined value. The hash
function H(.) is pre-image resistant, second pre-image resistant, and collision
resistant.
588 T. Yang et al.
The secure log is an append-only list and is secure controlled by DoT. If required,
the log can be dump to the DoT by some security channels. In this paper, we suppose
the secure log about a node vehicle can take enough behavior evidence for judging.
In the bottom level, the 802.11p communication module guarantees the interactive
with the other OBUs and RSUs.
We define the security requirements and will show the fulfillment of these require-
ments after presenting the design details.
Accountability: The requirement states that the misbehavior can be detectable by
the evidence kept by the system. We consider the use of accountability to detect
and expose node faults in distributed systems [11]. An accountable system main-
tains a tamper evident record that provides nonrepudiable evidence of all nodes
actions. Non-repudiation also requires that violators or misbehaving users cannot
deny the fact that they have misbehaved. Based on this record, a faulty node whose
observable behavior deviates from that of a correct node can be detected eventual-
ly. At the same time, a correct node can defend itself against any false accusations.
Identificability: An accountable system requires strong node identities. Otherwise,
an exposed faulty node could escape responsibility through assuming a different
identity. In this paper, we assume that each node is in possession of a cryptograph-
ic key pair that can be linked to a unique node identifier. And we also assume that
correct actions are deterministic and that communication between correct nodes
eventually succeeds.
4 Design of MisDis
We expire by Timeweave[12] and employ some ideas of the scheme in PeerRe-
view[11]. The basic idea of MisDis is to set a security log for each vehicle to record
its behavioral state determined behavioral state machines strictly, and introduce a
moderate size of monitoring OBUs which have check-token authorized by DoT. Then
we use this collection to monitor the security log of the target vehicle real-time and
on-site. If we find any exceptions, we should report to DoT through RSU for future
investigation and isolation.
First, Table 1 shows the notation table in our scheme. Next, we elaborate on the
technical design of the proposed scheme as following.
Table 1. (Continued)
Notation Description
H(.) Hash function H :{0,1}* *
q
In the system initialization, the DoT initializes the system parameters in offline man-
ner, and registers vehicles and RSUs.
i
i
+
DoT assigns a PID for Vi, issues a anonymous certification binding K i to Vi.
4) OBU-TokenGen: According to the DoTs history record and the choice policy,
DoT authorizes RSU to issue Check-Token to some OBUs. Each token has a limit
valid time.
T1 T
Vj
Vi
Vk T2
If vehicle i subsequently cannot produce a prefix of its log that matches the hash value
in uni , then vehicle j has verifiable evidence that i has tampered with its log and is
faulty. In addition, uni can be used by vehicle j as the verifiable evidence to persuade
other vehicles or RSUs an entry len exists in vehicle i's log. Any OBU or RSU can
also use uni to inspect len and the entries before it in i's log.
1) Hello: OBUi broadcasts Hello-Message periodically to show its existence for the
near vehicle. Hello-Message is [ K i+ ] .
2) Check-Req: After received the Hello-Message from Vi, Vj sends back a Check-
+
Req-Message which is [ K i || check token || Log || T ] s encryption version by a
session key between Vi and Vj. The Check-Req Algorithm is shown as following.
5 Security Analysis
Accountability: The OBUs secure log keeps the evidence of the vehicle node,
and can be further dump onto the DoTs total database. OBUs core components
are tamper-resistant hardware. Based on this outsourcing record, a misbehavior car
whose observable behavior deviates from that of a correct node can be detected
eventually.
MisDis: An Efficent Misbehavior Discovering Method 593
6 Conclusion
In VANET, each car on the road is equipped with one OBU. In this paper, we intro-
duce MisDis, a method that can identify misbehavior vehicle by state automata and
supervision, as well as a special security log to record the behavioral characteristics of
the target vehicle. If the host vehicle violates the established security policy, then the
system need to record the behavioral and try to dump to a safer roadside unit or de-
partment in charge.
References
1. Aijaz, A., Bochow, B., Dtzer, F., Festag, A., Gerlach, M., Kroh, R., Leinmller, T.: Attacks
on Inter Vehicle Communication Systems - an Analysis. In: Proceedings of WIT, pp. 189
194 (2006)
2. Raya, M., Hubaux, J.P.: Securing Vehicular Ad Hoc Networks. Journal of Computer Secu-
rity, Special Issue on Security of Ad Hoc and Sensor Networks 15(1), 3968 (2007)
3. Grover, J., Prajapati, N.K., Laxmi, V., Gaur, M.S.: Machine Learning Approach for Mul-
tiple Misbehavior Detection in VANET. In: Abraham, A., Mauri, J.L., Buford, J.F., Suzu-
ki, J., Thampi, S.M. (eds.) ACC 2011, Part III. CCIS, vol. 192, pp. 644653. Springer,
Heidelberg (2011)
4. Golle, P., Greene, D., Staddon, J.: Detecting and Correcting Malicious Data in VANETs.
In: Proceedings of the First ACM Workshop on Vehicular Ad Hoc Networks (VANET),
pp. 139151. ACM Press, Philadelphia (2004)
5. Raya, M., Papadimitratos, P., Gligor, V.D., Hubaux, J.P.: On data centric trust establish-
ment in ephemeral ad hoc networks. In: 27th IEEE Conference on Computer Communica-
tions (INFOCOM 2008), pp. 12381246 (2008)
6. Raya, M., Papadimitratos, P., Aad, I., Jungels, D., Hubaux, J.P.: Eviction of Misbehaving
and Faulty Nodes in Vehicular Networks. IEEE Journal on Selected Areas in Communica-
tions, Special Issue on Vehicular Networks 25(8), 15571568 (2007)
7. Ghosh, M., Varghese, A., Kherani, A.A., Gupta, A.: Distributed Misbehavior Detection in
VANETs. In: 2009 IEEE Conference on Wireless Communications and Networking Con-
ference (WCNC 2009), pp. 29092914. IEEE Press, Piscataway (2009)
8. Ghosh, M., Varghese, A., Kherani, A.A., Gupta, A., Muthaiah, S.N.: Detecting Misbeha-
viors in VANET with Integrated Root-cause Analysis. In: Ad Hoc Networks, vol. 8, pp.
778790. Elsevier Press, Amsterdam (2010)
594 T. Yang et al.
9. Schmidt, R.K., Leinmuller, T., Schoch, E., Held, A., Schafer, G.: Vehicle Behavior Analy-
sis to Enhance Security in VANETs. In: 4th Workshop on Vehicle to Vehicle Communica-
tions (V2VCOM 2008), pp. 168176 (2008)
10. Kim, T.H., Studer, H., Dubey, R., Zhang, X., Perrig, A., Bai, F., Bellur, B., Iyer, A.:
VANET Alert Endorsement Using Multi-Source Filters. In: Proceedings of the Seventh
ACM International Workshop on Vehicular Internetworking, pp. 5160. ACM Press, New
York (2010)
11. Haeberlen, A., Kouznetsov, P., Druschel, P.: PeerReview: practical accountability for dis-
tributed systems. In: 21st ACM Symposium on Operating Systems Principles (SOSP
2007), pp. 175188. ACM Press, New York (2007)
12. Maniatis, P., Baker, M.: Secure History Preservation Through Timeline Entanglement, pp.
297312. USENIX Association, Berkeley (2002)
13. NCTUns 5.0, Network Simulator and Emulator,
http://NSL.csie.nctu.edu.tw/nctuns.html
A Scalable Approach
for LRT Computation in GPGPU Environments
Linsey Xiaolin Pang1,2, Sanjay Chawla1 , Bernhard Scholz1 , and Georgina Wilcox1
1
School of Information Technologies, University of Sydney, Australia
2
NICTA, Sydney, Australia
{qlinsey,scholz}@it.usyd.edu.au,
{sanjay.chawla,georgina.wilcox}@sydney.edu.au
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 595608, 2013.
c Springer-Verlag Berlin Heidelberg 2013
596 L.X. Pang et al.
64 64 spatial grid may take nearly six hundred days.1 Wu et al. [8] proposed a method
which reduces the computation time to eleven days. However as noted previously in [9],
this approach will not scale for larger data sets and the biggest spatial grid reported in [8]
was 64 64.
The nature of LRT permits the computation of regions independently of each other,
which facilitates parallelization to some degree. However, new algorithmic techniques
and parallelization strategies are required in order to fully harvest the computational
power of GPGPUs. We have identified the following challenges that need to be ad-
dressed for achieving a speed up of several orders of magnitude with GPGPUs:
Spatial Grid
Approx.
Generalised AX-GPU
LRT
Distribution
Fig. 1. Workflow: We distinguish between distributions which belong to the one-parameter ex-
ponential family (1EXP) and distributions outside of 1EXP. We then design two algorithms for
1EXP which are used to efficiently and exactly estimate the LRT. For general data distributions,
we design the AX-GPGPU algorithm which uses the exact algorithms as subroutines.
Computing the LRT of a given region R requires the computation of the likelihood
i.e. the complement of R. While R has a rectangular shape, R
of R, is irregularly
shaped making the parallelization more difficult [11,5,6]. The experiments of [9]
showed that enumerating the complement regions is a bottleneck for efficient com-
putation.
The LRT computation on sequential computers is intractable for large spatial grids.
GPGPUs can perform massively data parallel computations, which is well-suited to
LRT computation. However, the LRT computation must be adapted to the GPGPU
programming model that exhibits a complex memory hierarchy consisting of a grid,
blocks and threads. To minimize the communication between CPU and GPGPU
memory, it is advantageous to perform the majority of the computation on GPG-
PUs using appropriate schemes for dividing work between threads and allocating
threads to blocks.
Our strategy to overcome the challenges noted above is shown in Figure 1. First, in
our framework, we distinguish between data distributions following the one-parameter
exponential family (1EXP) and those that are outside of 1EXP. When the data belongs
1
The experiment results were reported in 2009.
A Scalable Approach for LRT Computation in GPGPU Environments 597
to a 1EXP distribution, we can estimate the LRT of a given region R by omitting the
computation of R (c.f. [3]). When the data is assumed to be generated by an arbitrary
statistical distribution, we provide a solution to eliminate the bottleneck associated with
the computation of R: we use an upper-bounding technique to replace the computa-
tion of R with that of its constituent regions [8]. Second, we provide three GPGPU
algorithms named as: Brute-Force GPGPU (BF-GPGPU), De Morgan GPGPU (DM-
GPGPU) and Approximate GPGPU (AX-GPGPU). The BF-GPGPU and DM-GPGPU
implementations provide exact solutions for 1EXP. The AX-GPGPU algorithm provides
approximate solution for any underlying data distribution and it uses the BF-GPGPU
or DM-GPGPU as a sub-routine. We designed our blocking scheme for partitioning
the work by dividing the spatial grid into overlapping sub-grids and mapping these re-
gions onto blocks of threads [4]. The majority of the computation is performed on the
GPGPUs and we utilize shared memory for each block by pre-loading the data that will
be used by multiple threads.
The rest of the paper is structured as follows. In Section 2, we provide background
materials on LRT computation and upper-bounding technique. Related work is given
in Section 3. In Section 4, we explain how we use De Morgans law and dynamic pro-
gramming to speed up the enumeration and processing of regions measurements. The
description and details of three designed algorithms (i.e. BF-GPGPU, DM-GPGPU and
AX-GPGPU) are presented in Section 5. In Section 6, we evaluate the algorithms on
both synthetic and real data sets consisting of Magnetic Resonance Imaging (MRI)
scans from patients suffering from dementia. We give our conclusions in Section 7.
2 Background
2.1 The Likelihood Ratio Test (LRT)
We provide a brief but self-contained introduction for using LRT to find anomalous
regions in a spatial setting. The regions are mapped onto a spatial grid. Given a data
set X, an assumed model distribution f (X, ), a null hypothesis H0 : 0 and an
alternate hypothesis H1 : 0 , LRT is the ratio
sup0 {L(|X)|H0 }
=
sup {L(|X)|H1 }
where L() is the likelihood function and is a set of parameters for the distribution [8].
In a spatial setting, the null hypothesis is that the data in a region R (that is currently
being tested) and its complement (denoted as R) are governed by the same parameters.
Thus if a region R is anomalous then the alternate hypothesis will most likely be a
better fit and the denominator of will have a higher value for the maximum likelihood
estimator of . A remarkable fact about is that under mild regularity conditions, the
asymptotic distribution of 2 log follows a 2k distribution with k degrees of
freedom, where k is the number of free parameters2. Thus regions whose value falls
in the tail of the 2 distribution are likely to be anomalous [8].
2
If the 2 distribution is not applicable then Monte Carlo simulation can be used to ascertain
the p-value.
598 L.X. Pang et al.
For example, if we assume the counts m(R) in a region R follow a Poisson distri-
bution with baseline b and intensity , then a random variable x P oisson() is
a member of 1EXP with T (x) = x/, = 1/, a() = , = log(), Be =
m(R) and m(R)
exp(), ge (x) = log(x). For any regions R and R, are independently
Poisson distributed with mean {exp(R )b(R)} and {exp(R )b(R)} respectively. Then
b(R) m(R)
bR = b(R)+b( and mR = m(R)+m(R)
R) . The log-likelihood ratio is calculated by:
c(mR log( m 1mR
bR ) + (1 mR )log( 1bR )) (c.f. [3]). The closed-form formula for LRT
R
3 Related Work
Previous attempts to parallelize LRT computation have only achieved limited success.
For example, the Spatial Scan Statistic (SSS), which is a special case of LRT for Poisson
data, is available as a program under the name SatScan [2]. It has been parallelized for
multi-core CPU environments and its extension for a GPGPU hardware [7] has achieved
improved speed up of two over the multi-core baseline. The GPGPU implementation
in [7] has proposed loading parts of the data into shared memory but has achieved only
a modest speed up. The other attempt of [14] applied their own implementation of a
spatial scan statistic program on the GPU to the epidemic disease dataset. The number
of blocks was decided by both the number of the exploring villages and the number of
diseases to take full advantage of stream multiprocessors on GPGPUs. This solution is
only applicable to its special disease scenario. In each of these cases, we believe there is
further room for optimising the algorithms for the GPGPU by carefully exploiting the
thread and block architecture and further utilising shared memory.
Furthermore, all the existing parallel solutions perform simplified Poisson LRT tests
in a circular or cylindrical way, not in a grid-based scenario. Our parallel GPGPU
A Scalable Approach for LRT Computation in GPGPU Environments 599
solution is different and provides a fully paralleled template for a general LRT com-
putation in a grid, based on the work of [8].
|R| = |A B| (1)
= |G| (|A| + |B| |A B|) (2)
= |G| (2|G| |A| |B| (|X| + |Y |)) (3)
= |A| + |B| + |X| + |Y | |G| (4)
G G
Y
A
R R
X B
number of elements of region R(x1 , y1 , x2 , y2 ) where (x1 , y1 ) is the upper left corner
and (x2 , y2 ) is the lower right corner is expressed by:
|R(x1 , y1 , x2 , y2 )| = |A(x2 , y2 )| + |B(x1 , y1 )| + |X(x1 , y2 )|
(5)
+ |Y (x2 , y1 )| |G|
To obtain the tables for A, B, X, and Y , we employ dynamic programming. For ex-
ample the table for A can be computed using the following recurrence relationship:
(a) (b)
Fig. 3. Block and thread schemes: (a) is applicable for exact computation and pre-computation
of R in AX-GPGPU. Each (x, y) point corresponds to a block coordinate. Each thread is re-
sponsible for processing (w/tw, h/th) regions and each thread processes (2, 2) regions in our
implementation; Figure (b) is applicable to DM-GPGPU and pre-computation of R. The grid is
divided into equal disjoints parts. Each part is associated with a block.
In a Poisson model, each cell has a population count and a success count, which takes
8 bytes. If the size of shared memory is 16KB, a maximum of 2048 cells can be loaded
into the shared memory. This resembles a rectangular region of 45 x 45 cells. To use
threads efficiently, the rectangular region is set to 44 44 cells and associated with a
block. A block with coordinates (x,y) in the C UDA grid is associated to the region with
(x, y)(x+43, y+43) in the spatial grid. In our implementation, the maximum number
of threads per block is limited by the hardware to 512. Therefore, 22 22 threads can
be assigned to process 44 44 cells for maximizing the usage of shared memory. So
each thread is responsible to process four sub-regions: (x1 , y1 , x2 , y2 ), (x1 , y1 , x2 +
1, y2 ), (x1 , y1 , x2 , y2 + 1), (x1 , y1 , x2 + 1, y2 + 1), where (x1 , y1 ) is the upper left cell
and (x2 , y2 ) is the lower right cell. See Figure 3a and algorithm 1.
A two-pass scheme is used to obtain the likelihood ratio of each region:
(a) In the first pass, all rectangular regions of fixed sizes up to 44 44 are enumerated
in the grid. The data for each block is copied to its shared memory. After the block
data is stored into shared memory, each thread computes the score function of its
four rectangular regions and stores it back into global memory.
(b) The second pass enumerates all rectangular regions that do not fit into a single block
on the GPGPU and uses the intermediate results from the first pass to calculate the
top outlier over all regions.
5.1.2 DM-GPGPU
We now illustrate the GPGPU pre-computation step based on De Morgans Law. Rest
of the computation is identical as BF-GPGPU. In our De Morgan model, four datasets
are pre-computed for all possible rectangular regions. From above, we already know
that the pre-computed datasets come from four groups of regions, each group of which
shares one corner with the grid G. The computation of a n n grid requires the pre-
computation of 4 n n regions.
602 L.X. Pang et al.
To parallelize the pre-computation, we split the whole spatial grid G into equal dis-
joint parts. For a grid of size nn, if the shared memory can load region with maximum
size of w h, we create (n/w) (n/h) non-overlapping parts. Each such disjoint part
is associated to a block. If the thread number in a block is tw th, each thread can
be assigned to process (w/tw) (h/th) regions. Each thread is only responsible for
the regions that share one of the four corners of its correspondent disjoint sub-grid.
The approach for pre-computing R in the next section will be along similar lines. See
Figure 3b.
An additional second pass is carried out to merge the results from each sub-grid to
get the full pre-computed set. In our implementation, each block holds 32 16 cells to
fully utilize the threads capability.
5.2 AX-GPGPU
a CPU environment [7] have identified that the LRT computational bottleneck comes
from the enumeration of the complement region for each given region.
However we can leverage the strategy proposed by Wu et al. [8] for pruning non-
outliers, to provide an efficient parallelization of R. The likelihood of R is upper-
bounded by the sum of the likelihood of four regions. These four regions share one
of the corner of the grid separately and they constitute R. We use an approach similar
as DM-GPGPU for R computation and BF-GPGPU for R computation.
Each point (x,y) in the spatial grid is the coordinate of a block. To satisfy the tight-
bound criteria and fully utilize the thread capability, our block size is defined as 3216.
Therefore, 32 16 regions are stored into shared memory for processing R. There are
8568 sub-regions to be enumerated in each block. So each thread is responsible for 2
sub-regions. Figure 3b shows the block scheme in pre-computing R. It is different from
Figure 3a. The pre-computation of R only involves the group of regions that share one
of the four corners [8]. Each point (x, y) does not need to be associated to a block, the
entire sub grid is divided into several equal disjoint parts and each part corresponds to a
block. We choose 32 32 regions to be block size for pre-computation of R and there
are (n/32) (n/32) blocks.
All our algorithms assume that the entire spatial grid can be loaded into GPGPU global
memory. This may not be a realistic scenario. For example, in our environment, the
maximum size of the spatial grid that can be processed by a single GPGPU is 128128.
We provide several solutions to overcome this constraint.
Split the Grid into Small Sub-grids. We can split the spatial grid into several sub-
grids and load each of them in a sequential fashion onto a GPGPU. The block and
thread scheme is applied to each sub-grid and the final computation is merged on the
CPU. This may involve multiple iterations of the same kernel function.
Multiple GPGPUs Approach. We combine multiple GPGPU cards into a computa-
tional node. For example, in our case, a single GPGPU implementation can handle a
maximum grid size of 128 128 grid due to the 4GB global memory limitation. For
a 256 256 grid, we can split it into 4 equal 128 128 sub grids and send the grids
604 L.X. Pang et al.
to the two GPGPUs , separately. Merging is carried out on the cpu after the results are
obtained from the two GPGPUs.
Multi-tasked Streaming. C UDA offers APIs for asynchronous memory transfer and
streaming. With these capabilities it is possible to design algorithm that allows the com-
putation to proceed on both CPU and GPGPUs, while memory transfer is in progress.
In our implementation, we use streaming operation to support asynchronous memory
transfer while the kernel is being executed.
(a) (b)
Fig. 4. Figure (a) The running time of CPU implementation of De Morgan vs. brute-force search-
ing. Figure (b) pre-computation time for De Morgan: CPU vs. GPGPUs implementation.
What are the performance gains of the brute-force versus the De Morgan technique
on a sequential CPU?
What are the performance gains of sequential versus parallel outlier detection?
What are the performance gains of sequential versus parallel pruning using our
upper-bounding technique?
How does our technique perform on real-world data sets?
The experiments were conducted on an 8-core E5520 Intel server that is equipped with
two GPGPU T esla C1060 cards supporting CUDA 4.0. Each GPGPU card has 4GB
global memory, 16KB shared memory, 240 cores and 30 multiprocessors. The experi-
ments are performed on a Poisson distribution model and a randomly generated anoma-
lous region was planted for verification in synthetic data sets. To ensure correctness of
the parallel approaches, the results were verified by implementing a sequential version
of all the algorithms which ran on a single CPU machine. For the real data, we used
MRI images of people suffering from dementia and used MRIs of normal subjects as
the baseline.
A Scalable Approach for LRT Computation in GPGPU Environments 605
(a) (b)
Fig. 5. (a) The running time of exact outlier detection algorithm on a CPU, and single GPGPU.
It shows that BF-GPGPU is faster than the BF-CPU. BF-GPGPU becomes nearly 40 times faster
than BF-CPU on the grid of 128128. (b) The running time of exact searching on single GPGPU
and two GPGPUs.
(a) (b)
Fig. 6. (a) The running time of AX-CPU is nearly 10 times faster than BF-CPU. (b) The running
time of AX-GPGPU is 10 times faster than AX-CPU and almost 400 times than BF-CPU.
Fig. 7. MRI Images: the LRT value of each region in (a) and (b) is compared to a corresponding
region in the baseline image (c). The region in the yellow box in (b) has a LRT value significantly
higher than the baseline (7562), so it is a potential outlier. The corresponding region in (a) has a
low LRT value (2134) so it is normal.
of the MRI data is 176 176 208. Figure 7 shows one slice of each subjects data,
which is 176 208. A large number of voxels are black on the first and last 20 rows,
and the first 40 and the last 40 columns. The area tested has dimensions of 128 128.
The application of LRT using both AX-GPGPU and BF-GPGPU identifies regions
where the subjects suffering from dementia have a high likelihood ratio value for the af-
fected regions in the brain. The same areas of normal subjects have low likelihood ratio
values. The AX-GPGPU found the anomalous region in 24s compared to 1min6s by
BF-GPGPU. The BF-CPU on the other hand required 17min55s to find the anomalous
region.
7 Conclusion
The Likelihood Ratio Test Statistic (LRT) is the state-of-the-art method for identifying
hotspots or anomalous regions in large spatial settings. To speed up the LRT computa-
tion, this paper proposed three novel contributions: (i) a fast dynamic program to enu-
merate all the possible regions in a spatial grid. Compared to a brute force approach, the
dynamic programming method is nearly a hundred times faster. (ii) a novel way to use
an upper bounding technique, initially designed for pruning non-outliers, to accelerate
the likelihood computation of a complement region. (iii) a systematic way of mapping
the LRT computation onto the GPGPU programming model. In concert the three con-
tributions yield a speed up of nearly four hundred times compared to their sequential
counterpart. Until now, existing GPGPU implementations for specialized versions of
LRT such as Spatial Scan Statistic had given only modest gains. Moving the computa-
tion of the LRT statistics to the GPGPU enables the use of this sophisticated method of
outlier detection for larger spatial grids than ever before.
Acknowledgment. We thank Dr. Jinman Kim at the University of Sydney for providing
us with the MRI image resources and helping us analyze the data.
References
1. http://www.oasis-brains.org/
2. SatScan, http://www.SatScan.org
3. Agarwal, D., Phillips, J.M., Venkatasubramanian, S.: The hunting of the bump: On maximiz-
ing statistical discrepancy. In: SODA, pp. 11371146 (2006)
4. Beutel, A., Mlhave, T., Agarwal, P.K.: Natural neighbor interpolation based grid dem con-
struction using a gpu. In: GIS 2010, pp. 172181. ACM, New York (2010)
5. Gregerson, A.: Implementing fast mri gridding on gpus via. cuda. Nvidia Tech. Report on
Medical Imaging using CUDA (2008)
6. Hong, S., Kim, S.K., Oguntebiv, T., Olukotun, K.: Efficient parallel graph exploration on
multi-core cpu and gpu. In: Proceedings of the 16th ACM Symposium on Principles and
Practice of Parallel Programming, PPoPP 2011 (2011)
7. Larew, S.G., Maciejewski, R., Woo, I., Ebert, D.S.: Spatial scan statistics on the gpgpu.
In: Proceedings of the Visual Analytics in Healthcare Workshop at the IEEE Visualization
Conference (2010)
608 L.X. Pang et al.
8. Wu, M., Song, X., Jermaine, C., Ranka, S., Gums, J.: A LRT Framework for Fast Spatial
Anomaly Detection. In: Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD 2009), pp. 887896 (2009)
9. Pang, L.X., Chawla, S., Liu, W., Zheng, Y.: On mining anomalous patterns in road traf-
fic streams. In: Tang, J., King, I., Chen, L., Wang, J. (eds.) ADMA 2011, Part II. LNCS,
vol. 7121, pp. 237251. Springer, Heidelberg (2011)
10. Wilks, S.S.: The large sample distribution of the likelihood ratio for testing composite hy-
potheses. Annals of Mathematical Statistics (9), 6062 (1938)
11. Vuduc, R., Chandramowlishwaranv, A., Choi, J., Guney, M., Shringarpure, A.: On the limits
of gpu acceleration. In: HotPar 2010 Proceedings of the 2nd USENIX Conference on Hot
Topics in Parallelism, pp. 237251 (2010)
12. Wu, R., Zhang, B., Hsu, M.: Gpu-accelerated large scale analytics. In: IACM UCHPC 2009:
Second Workshop on UnConventional High Performance Computing (2009)
13. Wu, R., Zhang, B., Hsu, M.C.: Clustering billions of data points using gpus. In: IACM
UCHPC 2009: Second Workshop on UnConventional High Performance Computing (2009)
14. Zhao, S.S., Zhou, C.: Accelerating spatial clustering detection of epidemic disease with
graphics processing unit. In: Proceedings of Geoinformatics, pp. 16 (2010)
ASAWA: An Automatic Partition Key Selection Strategy
Abstract. With the rapid increase of data volume, more and more applications
have to be implemented in a distributed environment. In order to obtain high
performance, we need to carefully divide the whole dataset into multiple parti-
tions and put them into distributed data nodes. During this process, the selection
of partition key would greatly affect the overall performance. Nevertheless,
there are few works addressing this topic. Most previous projects on data parti-
tioning either utilize a simple strategy, or rely on a commercial database sys-
tem, to choose partition keys. In this work, we present an automatic partition
key selection strategy called ASAWA. It chooses partition keys according to
the analysis on both dataset and workload schemas. In this way, intimate tuples,
i.e. co-appearing in queries frequently, would be probably put into the same
partition. Hence the cross-node joins could be greatly reduced and the system
performance could be improved. We conduct a series of experiments over the
TPC-H datasets to illustrate the effectiveness of the ASAWA strategy.
1 Introduction
Nowadays, more and more applications need to manage huge volume of data and
serve large amount of requests concurrently. As it is too hard for a single machine to
support the whole system operations, these applications are mostly installed on clus-
ters containing multiple data nodes. Thus, the whole data in an application needs to be
partitioned and then stored in different machines. The goal of a partitioning algorithm
is to achieve low response time and high throughput by minimizing inter-node com-
munications.
During data partitioning, partition key selection, which would select attributes for
tables as the baseline in partitioning, is the most important decision made regarding
the scalability of database [1]. Data in different partitions are serviced by different
processing nodes, and concurrent Read and Write operations over multiple partitions
will be a bottleneck for applications. Poor partition key selection can then greatly
affect query performance [2]. For example, an application, which has partitioned data
*
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 609620, 2013.
Springer-Verlag Berlin Heidelberg 2013
610 X. Wang, J. Chen, and X. Du
by the Location attribute but accesses data with queries containing many joins on the
Date column, would undoubtedly has an expensive communication cost.
Previously, there have been many efforts on the problem of data partitioning [1, 2,
5-8]. However, few of them address the problem of partition key selection. Instead,
most of them utilize some simple strategy, while some others rely on a commercial
database system for this selection.
In this work, we discuss how to select partition keys automatically for tables in ap-
plications and propose ASAWA (Automatic Selection based on Attribute Weight
Analysis) strategy. Its general idea is based on the observation that if an attribute is
found to be accessed in many queries, it would probably act as keys in join opera-
tions. Thus it would be better to partition datasets according to the attributes in search
condition defined in query statements. The basic process of ASAWA would first
compute an aggregated score for each attribute by identifying the data dependency
among tables and finding out associated attributes, and then select partition keys with
top scores evaluated by grouping features in workload schema. Combined both data
and workload information predefined, it is expected to reduce communication cost
among data nodes and improve system performance.
The contributions of this paper are summarized as follows:
1. We propose an automatic partition key selection strategy.
2. We implement the strategy within a novel algorithm for choosing partition
keys based on data and workload information.
3. We conduct a series of experiments on TPC-H [3] to show both the efficiency
and effectiveness of our proposed approaches.
The remainder of this paper is organized as follows. Section 2 presents related works
in data partitioning with particular focus on partition key selection. Next, Section 3
illustrates the automatic partition key selection method ASAWA. Using three kinds of
partition key selection strategies, Section 4 describes the performance comparison to
show the efficiency of ASAWA. Finally, we have briefly summarized the contribu-
tions in this work and discussed works in the future in Section 5.
2 Related Works
Data partitioning [4] is known as one of the most famous technologies for its efficien-
cy in large-scale application scalability. Traditionally, there are 3 kinds of partitioning
strategy: horizontal partitioning [5], vertical partitioning [6] and hybrid approaches
[7], any of which needs to select a partition key for each relation as the basis of parti-
tioning algorithms. In recent years, there have emerged some new researches, which
do not need to select partition keys. Schism [8] uses Metis [9] to divide mapping
graph of relations in data into fine-grained partitioning results. In [2], it has proposed
a method to have a skew-aware partitioning solution based on large-scale neighbor-
hood search. It would like to select the most frequently accessed column in the work-
load as the partition key for each table originally, and change it when a better one
ASAWA: An Automatic Partition Key Selection Strategy 611
Primary Key Selection: This is the most widely used partition key selection
strategy as default in commercial database systems. One of its variations is to
use foreign key or the first column in a table as its partition key.
Random Selection: This occurs when partition keys are specified without suffi-
cient database management experiences or by systems without any constrains.
Exhaustive Selection: This would like to try all possible partition key groups to
have the one with best performance. It needs to use a sample data and workload
set to simulate as in the real world, which is hard to predict the cost correctly.
Analytical Selection: This is based on data and workload schema analysis,
which is usually applied manually, or recommended by optimizing advisors in-
tegrated within expensive commercial database systems.
As data amount increasing rapidly and workload requests variously, this work would
concentrate on automatically partition key selection from a top-down view that suits
the requirements from data and workload. It prompts ASAWA method for automatic
partition key selection, which not only takes the data and workload schemas into con-
sideration, but also combined with design and execution information.
To illustrate the problem of partition key selection, consider the following motivating
scenario: Given a data center with many query workloads, it needs to build a data
management configuration that consists of many storage nodes and to partition data
to suitable nodes so that the workloads could be executed in parallel. As the first step
in data partitioning, a critical problem is how to select partition keys for relationships
in the data center to minimize communication costs as much as possible.
612 X. Wang, J. Chen, and X. Du
For a well-designed application, the basic information of the system, like data sche-
ma, workload pattern and their utility features, is given before data partitioning. In
this work, we focus on applications with well defined data schema and static work-
load pattern which could be obtained with access frequencies for queries in it.
The partition key selection problem can be generally formulated as follows:
Definition. Partition Key Selection: Given a database D = {R1,R2,,Rd}, a query
workload W = {Q1,Q2,,Qw}, and a storage bound S = {B1,B2,,Bs}, select an
attribute PKSx (1 x ) for each table as the partition key. Here,
The total execution time Ttotal consists of two parts: data loading time Tload and the
whole workload execution time Texec. In this work, we assume all the data nodes are
the same in physical features and use the widely used Hash function h1:S [0,1] to
distribute datasets into data nodes uniformly. The best partition key selection solution
should partition data to nodes and complete queries within the least response time.
Database Analysis
Data Schema
As mentioned in Section 3.1, we assume that the data schema is well defined to guar-
antee data integrity for the database and all of the basic information (including
attribute name, data type, constraints in the table, etc.) could be obtained.
To guarantee the efficiency of ASAWA, we would like to select the attributes with
data integrity declarations as partition key candidates, including Primary Key, For-
eign Key and columns declared Unique or Not Null. The weight of affect in dimension
of data schema for an attribute could then be assigned equally as the values of
wDS(PK), wDS(FK), wDS(U) and wDS(NN) respectively for each of declarations men-
tioned above. The values of weights could be the same or different, which depend on
specific applications. The existences of the 4 kinds of statements are present as
eDS(PK), eDS (FK), eDS (U) and eDS (NN) respectively, whose value is 1 or 0 based on
the integrity statement on the relative attribute.
ASAWA: An Automatic Partition Key Selection Strategy 613
For a given attribute in the data schema, its weight in dimension of data schema is
shown as wDS(GA) in Formula (1), which is the product of the existences that have
defined integrity statements on it and the values of weights in a relative sequence:
wDS(GA) = (wDS(PK) eDS(GAPK) + wDS(FK) eDS (GAFK)
+ wDS(U) eDS(GAU) + wDS(NN)) eDS (GANN)) (1)
Scale Factor.
A scale factor is the number used as a multiplier in scaling. This is mainly used in
data generation. In ASAWA, the weight of affect for a given attribute in dimension of
scale factor is as in Formula (2):
Workload Analysis
Query Pattern
Query pattern is a piece of code that will make a reasonably normal query expression
compile. As mentioned in Section 3.1, we assume that each query would be executed
using the same structure with variable parameters. This ensures that the attributes
touched in each query is the same in every execution of it. As mentioned in [10], The
most relevant operations that present the attributes touched in SQL statements include
equi-joins, Group By operations, duplicate elimination, and selections. The weights of
them in dimension of query pattern are as wQP(Join), wQP(GB), wQP(DR),
wQP(SelectConstant) and wQP(SelectHostVariable) relatively, while the existences as
eQP(Join), eQP (GB), eQP (DR), eQP(SelectConstant) and eQP(SelectHostVariable)
respectively, which are the counts of their appearance in the query.
614 X. Wang, J. Chen, and X. Du
For a given attribute in a given query, the weight of it in dimension of data schema
is shown as wQP (GAGQ) in Formula (4):
wQP(GAGQ) = (wQP(Join) eQP(Join) + wQP(GB) eQP (GB)
Access Frequency
Access frequency represents how many times an application runs in a given period.
Its measurement is decided by the designer, e.g. an hour, a day, et al. In ASAWA, we
assume the access frequency is steady for queries in the workload and define that the
weight of affect in dimension of access frequency is as in Formula (5):
wAF(Qy) = 1 + log ( AFy / AFMIN ) (5)
According to the above analysis, the general weight of a given attribute w(a) could be
calculated as presented in Formula (7):
ASAWA Algorithm
Although there have been many partition key selection strategies for manual operation
tips, how to combine these tips for automatic partition key selection is a problem.
Here we propose ASAWA (Automatic Selection based on Attribute Weight Analysis)
method to solve this issue.
The whole ASAWA algorithm is listed in Fig.1. Combining both data and work-
load information predefined, it takes attributes with data integrity constrains into con-
sideration to reduce the complexity in practice and is expected to reduce communica-
tion cost among data nodes and benefit for system performance.
ASAWA: An Automatic Partition Key Selection Strategy 615
4 Experimental Study
In this section, we have shown results of our experiments using different partition key
selection strategies and compared with the results with analysis.
Input:
1. Data schema of the designed database
2. Scale Factors of tables defined in the data schema
3. Query Pattern of the designed workload
4. Access Frequency of each query in the workload
Output:
Solution PKSD = {PKS1, PKS2,, PKSx} (1 x ), in which
PKSj (1 j x) is an attribute in table Tx D.
Procedure ASAWA:
1. Eliminate attributes without impact and obtain par-
tition key candidates with its weight w(a) for each.
a. Extract the attributes with data integrity con-
strains from the data schema.
b. Assign the weight of each attribute in dimen-
sion of data schema, scale factor, Query Pattern
and access frequency as w(a) using Formulate (8).
2. Exhaustively consider all combinations of
attributes that are not eliminated:
a. Call optimizer with candidate set and grouping
to estimate execution times of queries.
b. Estimate aggregate Response time of all que-
ries, and update {best partition key, relation
grouping} to the current one to choose the best.
3. Output PKSD = { PKS1, PKS2, , PKSx} (1xd)
xd) over D
as the partition key for the given data schema.
Fig. 1. The ASAWA Algorithm
In the experiment, we have setup 3 data nodes, each of which is equipped with
2.66GHz * 8 cores, more than 300G disk, and CentOS 6.1 x86-64 with Open SSH 5.2
inside. As the scenario mentioned in Section 3, we select Greenplum Database
(GPDB for short) 4.2.0.0 [26] as the platform. The dataset and queries used are gener-
ated from TPC-H 2.14.2, which has well-defined data schema and query patters to
illustrate a system that examine large volumes of data, execute queries with a high
degree of complexity, and give answers to critical business questions. The node with
32G main memories is chosen as the master node, while the other two nodes have
16G main memory and are set as slave nodes. Each of them has 8 segments.
616 X. Wang, J. Chen, and X. Du
It needs to select partition keys based on different strategies in the first step. In this
work, we implement three partition key selection strategies for comparison, which are
based on Primary Key, Random Assigned and ASAWA respectively. The detail de-
scriptions of the first two strategies could be found in Section 2. For ASAWA, we set
the weights in dimension of query pattern as: wQP(Join) = 1.0, wQP(GB) = 0.1,
wQP(DR) = 0.08, wQP(SelectConstant) = -0.05 and wQP(SelectHostVariable) = 0.05 as
in [10]. In order to present the underlying structure information in the schema, we
assign wDS(PK) = wDS(FK) = 1 and wDS(U) = wDS(NN) = 0.5 respectively for
wDS(GA) in this work. Then we could run Procedure ASAWA shown in Fig. 1 for
partition key selection for case ASAWA with the specific scale factor in this round.
In the TPC-H Schema, there are two tables named Nation and Region respectively
that are not affected by scale factors. For the size for both of them are fixed and quite
limited, instead of partitioned, both of them are copied to each data node in every
selected strategy case for performance consideration.
In order to get a steady result and have a much more reliable result analysis, each
strategy with relative dataset and workload in a specific scale should be executed
multi round in the same physical environment. Due to the limitation of current hard-
ware environment, the scale factor of dataset is set to be 1, 10 and 100 respectively.
We run query set 5 times within the newly partitioned dataset in every round.
Overall Results
As a decision support oriented benchmark, it is hard to predict the access frequency of
queries in TPC-H. We first all of the 22 queries in the data set of scale factor =1, 10
and 100 respectively. The results are shown in Figure 3.
(a) scale factor = 1 (b) scale factor = 10 (c) scale factor = 100
Fig. 3. Executing Time of 22 Queries
From the series figures shown in Fig. 3, we could see that in each scale factor
round, Case 2: Random Strategy always runs the longest time and shows the worst
performance, Case 1: Primary Key Strategy is faster than Case 2 and shows fine per-
formance, while Case 3: ASAWA is even better than Case 1 and always shows the
best performance in the 3 scales of datasets. Besides the results shown in Figure 3, we
have also recorded data loading time in each round of every case. In these records,
Case 2 shows obviously longer data loading time than the others.
The result verifies again that the performance in queries is affected by the key se-
lection switch. Also, we found that compared with Case 1, the advantages of Case 3
in Figure 3(c) are not so distinct than it in Figure 3(a) and 3(b). This is because GPDB
has implemented a mechanism named Motion to do live migration and eliminate par-
tition differences during running time, which means the long time running, the less
distance. But this would add extra workload to data nodes and are not common im-
plemented in every product.
longest time and shows the worst performance, Case 1 shows fine performance than
Case 2 especially in large scales, while Case 3 is shows the best. This indicates that
the partition strategy should take the actual workloads of applications into considera-
tion to get much better performance. Due to identify tuples specifically but having not
considered the business semantical information, Primary Keys strategy always obtains
reasonable but limited performance. For improper partition keys selection, Random
strategy leads to bad system performance. Having taken data and workload features
into consideration, ASAWA shows the best performance in all of these 3 strategies.
Besides, we have intentionally selected Q1, Q5 and Q13 from the 22 queries of
TPC-H, in which Q1 is represented for queries on a single large table, Q5 for queries
on multi tables and Q13 for queries only on small tables, to take a continuous running
for 7 rounds respectively on the dataset with the sale factor set as 100. The result
shown in Fig.5 indicates that, for the queries that data distributed originally and mem-
ory could not cached, the executing time does not change too much, like Q1 and Q5,
while for the queries that memory cacheable as Q13, the executing time is the most at
the first time, less in the second time, and lest and maintaining in the following 5
times. It also shows the downtrend slightly in Q1 and Q5. Due to the automatic mo-
tion operation (like live migration) in GPDB, Round 3-7 show steady results. Ob-
viously, if similar queries could be scheduled closer, which our strategy want to
achieve, executing performance would be improved a lot.
5 Conclusion
As more and more applications need to manage data in huge volume and response
large amount of requests concurrently, the whole dataset for an application need to be
partitioned into multiple parts and put into different machines. During this process,
the selection of partition key would greatly affect the partitioning results. Neverthe-
less, there are few works addressing this topic. In this work, we present an automatic
partition key selection strategy, called ASAWA. It would first compute an aggregated
score for each attribute based on identifying the dependency among tables and finding
out the corresponding associated attributes, and then select partition keys with top
scores optimized by grouping features in workload schema. In this way, close tuples,
i.e. co-appearing in queries frequently, would be probably put into the same partition.
Hence the inter-node joins could be greatly reduced and the system performance
could be improved. We then conduct a series of experiments over TPC-H queries with
datasets generated in different scale factors. The experiment results illustrate the ef-
fectiveness of ASAWA method.
For most applications with huge data volume and great concurrent requests, parti-
tioning strategy should be well planed before data storage and would be difficult to
change once it adopted. If it could be adjusted automatically according to the actual
applications, this would be undoubtedly more meaningful and useful for the widely
use of database system products and release the burden of database administrators or
developers. Partition key selection is the first step in data partitioning, and we only
620 X. Wang, J. Chen, and X. Du
use empirical weight values assigned in ASAWA in the experiments. In the future
work, we would like to adjust the values by learning methods to get a more practical
weight set. Of course, there must be some differences among application features and
visions. We would like to explore an automatic partitioning method combined with
both ASAWA and proper partitioning algorithms in further research.
References
1. Zilio, D.C.: Physical Database Design Decision Algorithms and Concurrent Reorganiza-
tion for Parallel Database Systems. PhD Thesis, Department of Computer Science, Univer-
sity of Toronto (1998)
2. Pavlo, A., Curino, C., Zdonik, S.: Skew-aware Automatic Database Partitioning in Shared-
Nothing Parallel OLTP Systems. In: Proc. of the ACM SIGMOD, pp. 6172 (2012)
3. TPC BenchmarkTM H, http://www.tpc.org/tpch/
4. Stonebraker, M., Cattell, R.: 10 Rules for Scalable Performance in Simple Operation Da-
tastores. Communications of the ACM 54, 7280 (2011)
5. Ceri, S., Negri, M., Pelagatti, G.: Horizontal Data Partitioning in Database Design. In:
Proc. of the ACM SIGMOD, pp. 128136 (1982)
6. Navathe, S., Ceri, G., Wiederhold, G., Dou, J.: Vertical Partitioning Algorithms for Data-
base Systems. ACM Transactions on Database Systems 9(4), 680710 (1984)
7. Agrawal, S., Narasayya, V., Yang, B.: Integrating Vertical and Horizontal Partitioning into
Automated Physical Database Design. In: Proc. of the ACM SIGMOD, pp. 359370 (2004)
8. Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a Workload-Driven Approach to
Database Replication and Partitioning. Proc. of the VLDB Endowment 3, 4857 (2010)
9. Metis, http://glaros.dtc.umn.edu/gkhome/views/metis/index.html
10. Zilio, D.C., Jhingran, A., Padmanabhan, S.: Partition Key Selection for a Shared-nothing
Parallel Database System. Technical Report RC 19820(87739) 11/10/94, IBM T. J. Wat-
son Research Center (1994)
11. Eadon, G., Chong, E.I., Shankar, S., Raghavan, A., Srinivasan, J., Das, S.: Supporting Table
Partitioning by Reference in Oracle. In: Proc. of the ACM SIGMOD, pp. 11111122 (2008)
12. Zilio, D.C., Rao, J., Lightstone, S., Lohman, G., et al.: DB2 Design Advisor: Integrated
Automatic Physical Database Design. In: Proceedings of the VLDB, pp. 10871097
(2004)
13. Nehme, R., Bruno, N.: Automated Partitioning Design in Parallel Database Systems. In:
Proc. of the ACM SIGMOD, pp. 11371148 (2011)
14. zsu, M.T., Valduriez, P.: Principles of Distributed Database Systems, 3rd edn. Springer,
New York (2011)
15. Rahimi, S., Haug, F.S.: Distributed Database Management Systems: A Practical Approach.
IEEE Computer Society, Hoboken (2010)
16. Greenplum Database,
http://www.greenplum.com/products/greenplum-database
An Active Service Reselection Triggering
Mechanism
Ying Yin, Tiancheng Zhang, Bin Zhang, Gang Sheng, Yuhai Zhao, and Ming Li
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 621628, 2013.
c Springer-Verlag Berlin Heidelberg 2013
factors in advance and then avoids the exceptional service interruption. That is,
prediction has gradually become a valid method to trigger service reselection [9].
Prediction is realized by some algorithms which monitor and evaluate the quality
of composite services. There are several prediction methods, such as ML-based
methods known in literature[2], QoS-aware based methods[3,4] and collaborative
filtering-based methods[5,9] and so on.
Despite the existence of previous work, prediction methods still three major
challenges remain, as defined below. First of all, most of the traditional prediction
models, such as Support Vector Machines (SVM) and Artificial Neural Networks
(ANN), can only take input data as vectors of features [12]. However, there are
no explicitly features in sequential data. Secondly, prediction models based on
QoS monitoring, such as Naive Bayes and Markov model[10], which assume that
sequences in a class are generated by an underlying model M and the probability
distributions are described by a set of parameters. These parameters however,
are obtained by predicting the QoS during the whole lifecycle of the services and
will therefore lead to high overhead costs. Another method is prediction based on
sequence distance, such as collaborative filtering[9], this type of methods require
the definition of to define a distance function to measure the similarity between
a pair of sequences. However, how to select an optimal similarity function is far
from trivial, as it will introduce numerous parameters and measures for distances
can be subjective.
In order to maintain the reliability of composite services, an important thing
for us is actively identify exception patterns and trigger service reselection algo-
rithm quickly. In this paper, we proposes an ecient service reselection mech-
anism by mining previous patterns in advance. The system will trigger service
reselection once the currently executing sequence matches any mined historical
sequence indicating imminent failure.
The remainder of this paper is organized as follows: Section 2 discusses the
system architecture. In Section 3, we present the process of extracting execution
instances from execution logs while in Section 4 the basic related definitions
are given, and an ecient method for identifying early patterns based on an
interest measure is proposed, and methods for implementation are presented. We
summarize our research and discuss some future work directions in Section 5.
2 System Architecture
Below we describe the system architecture of our service reselection trigger-
ing mechanism. After accepting a users query specification, our BPEL Engine
should be able to provide a composite service execution sequence. Once some
latent exception is identified during execution, our system will trigger the rese-
lection process.
There are four phases in the trigger process: (1) recording the composite
service execution log information to generate the execution service sequence
repository; (2) cleaning collected data and converting service log into service
execution sequences; (3) mining early prediction patterns from converted execu-
tion sequences and deducing predictive rules; (4) triggering the service reselection
An Active Service Reselection Triggering Mechanism 623
process as soon as an executing service sequence matches one of the early predic-
tion patterns. Figure 1 shows the whole reselection triggering process and below
each of the steps will be briefly discussed.
Busine ss Pruning
Proce ss Early Rules Rules
Pruning
Measure Function
Selection/ Y/N?
Matching Early Factor
Reselection
Conciseness Factor
Execution
Engine Monitor S equence
Preparing
Web services
Execution Log
Recording of composite service execution log. During this phase, the system needs
to collect large amounts of failed composite service execution information, aiming
to mine useful knowledge and to analyze the causes of failure reason so as to
trigger service reselection as quickly as possible. The generated composite service
execution log repository forms the basis for early prediction.
Extracting of the service execution sequence. the original service log includes a
large variety of information, such as Web Service Description Language (WSDL)
information, keywords, IP address, QoS and so on. The data needs to be cleaned
and QoS information needs to be transferred into sequences so as to mine early
pattern eectively. For the functionally similar Web services, even for each Web
service itself, perceptive QoS values may dier at dierent time points or dierent
network environments thus presumably indicating the dierence in quality. That
means that the quality of Web service might dierent in dierent situation.
Therefore, extracting descriptive information from the service log and converting
it into forms that can readily be mined are the key part of our methods. The
details of this step will be given in Section 3.
Mining of early prediction pattern. Various behavioral patterns in dierent envi-
ronment can be lead to composite service execution failure. However, both the
time when the fault occurs and the length of the behavioral pattern influence
when the service reselection will be triggered. Under the various environments,
mining triggering patterns with minimal cost and triggering the reselection as
earlier as possible are the important issues in our work. The contents of this part
will be described in Section 4.
624 Y. Yin et al.
3 Preliminary
Once a service-oriented application or a composite service is deployed in a run-
time execution environment, the application can be executed in many execu-
tion instances. Each execution instance is uniquely identified with an identifier
(i.e., id). In each execution instance, a set of service sequences can be triggered.
Various uncertain factors may make a execution instance fail. Reason for this
phenomenon is that there exists a large number of potential exception status
information under dierent uncertain factors. We record triggered events in a
log which contains dierent types of information. We are interested in failed
QoS information. It is helpful for service quality management by extracting the
execution status information from execution log to predict the service reliability.
A CompositeQoS Description
S0 < av 0 , exe0 > Server unavailable, runtime delay
S1 < av 0.5 , exe0 > Server available intermittently, runtime delay
S2 < av 1 , exe0 > Server available, runtime delay
< av 0 , exe1 > Server unavailable, normal execution
S3 < av 0.5 , exe1 > Server available intermittently, normal execution
S4 < av 1 , exe1 > Server available, normal execution
In order to depict various exception status information clearly, for each com-
ponent service, we only consider two QoS attributes as an example, such as the
availability attribute (AV ) and the execution time attribute(EXE). There are
three possible states for AV , namely inaccessible, intermittently accessible and
accessible, which are denoted by av 0 , av 0.5 and av 1 respectively. EXE on the
other hand can only signify two states, namely exe0 and exe1 representing de-
layed execution and normal execution respectively. Under these situations, we
can obtain five possible groups of service execution statuses information which
are shown in Table 1: < av 0 , exe0 >, denoted by S 0 , represents server unavail-
able but runtime delay, < av 0.5 , exe0 >, denoted by S 1 , represents server avail-
able intermittently but runtime delay, < av 1 , exe0 >, denoted by S 2 , represents
server available but runtime delay, < av 0.5 , exe1 >, denoted by S 3 , represents
server available intermittently but normal execution and < av 1 , exe1 >, denoted
by S 4 , represents server available but normal execution. However, for the status
An Active Service Reselection Triggering Mechanism 625
information < av 0 , exe1 >, denoted by S 0 , represents server unavailable but nor-
mal execution. We do not consider this status, as it cannot occur in
practice.
Given a pattern p=si1 sj2 , ...pkm , we say pattern p appears in a sequence s if there
exists an index 1i0 n m + 1 such that ai0 =si1 , ai0 +1 =sj2 ,..., ai0 +m1 =skm .
626 Y. Yin et al.
Problem Statement. For a given failed service log entry, a Pattern p should
satisfy: (1) Pattern p should frequently be present in failed type; (2) Pattern p
should be as short as possible and should have low prediction cost; All patterns
satisfy the above conditions called Early Patterns. the optimal rules are deduced
called Early and Concise (EC) Rule Sets.
Previous existing pattern mining methods require the definition of various thresh-
olds as quality measures such as support, confidence , such that only those
relatively frequent rules are selected for classifier construction. However, these
mining methods are ineective in practice and the set of mined rules are still
very large. It is very hard for a user to seek out appropriate rules. In our case,
we consider the earliness and conciseness properties to design a utility measure
and only preserve those early, concise patterns passing the measure from the
training database.
The mining process is conducted on a prefix tree. Limited by space, we omit
the description of constructing such a prefix tree structure. A complete pseudo-
code for mining optimal EP sets is presented in Algorithm 1.
Finding interesting contrast sequence rules by checking all subsequence along
the paths is still a tough task. Various pruning rules are introduced below to
prevent unnecessary rule generation and only preserve interesting rules. Reduc-
ing the search space are extremely important for eciency of EP. In algorithm
1, we use two ecient pruning rules to improve eciency.
Note: The presentation of pruning rule 1 describes the early support of a pat-
tern, not the general support. Because in a exception execution log, pattern p
may emerge several times. We only record the first position so as to trigger ser-
vice reselection procedures as soon as possible. This is dierence with traditional
association rules.
Limited by space, we omit the proof. The above pruning rules are very ecient
since it only generates a subset of frequent patterns with maximal interestingness
instead of all ones.
5 Conclusion
In this paper, we discuss the mining of early prediction patterns which are im-
portant for predicting service reliability, and propose two interest measures in
order to decide whether the rules are concise and as short as possible in length for
quickly identifying exception patterns. Based on the proposed interest measures,
we propose a new algorithm with ecient pruning rules to mine all optimal EP
rules.
628 Y. Yin et al.
References
1. Yu, Q., Liu, X.M., Bouguettaya, A., Medjahed, B.: Deploying and managing Web
services: issues, solutions, and directions. The VLDB Journal 17, 537572 (2008)
2. Wang, F., Liu, L., Dou, C.: Stock Market Volatility Prediction: A Service-Oriented
Multi-kernel Learning Approach. In: IEEE SCC, pp. 4956 (2012)
3. Goldman, A., Ngoko, Y.: On Graph Reduction for QoS Prediction of Very Large
Web Service Compositions. In: IEEE SCC, pp. 258265 (2012)
4. Tao, Q., Chang, H., Gu, C., Yi, Y.: A novel prediction approach for trustworthy
QoS of web services. Expert Syst. Appl. (ESWA) 39(3), 36763681 (2012)
5. Lo, W., Yin, J., Deng, S., Li, Y., Wu, Z.: Collaborative Web Service QoS Prediction
with Location-Based Regularization. In: ICWS 2012, pp. 464471 (2012)
6. Yu, T., Zhang, Y., Lin, K.J.: Ecient Algorithms for Web Services Selection with
End-to-End QoS Constraints. ACM Trans. Web 1(1), 125 (2007)
7. Leitner, P., Michlmayr, A., Rosenberg, F., Dustdar, S.: Monitoring, Prediction and
Prevention of SLA Violations in Composite Services. In: ICWS, pp. 369376 (2010)
8. Zheng, Z., Ma, H., Lyu, M.R., King, I.: QoS-Aware Web Service Recommendation
by Collaborative Filtering. IEEE Transactions on Service Computing (TSC) 4(2),
140152 (2011)
9. Chen, L., Feng, Y., Wu, J., Zheng, Z.: An Enhanced QoS Prediction Approach for
Service Selction. In: IEEE Conference on Web Service, ICWS (2011)
10. Park, J.S., Yu, H.-C., Chung, K.-S., Lee, E.: Markov Chain Based Monitoring
Service for Fault Tolerance in Mobile Cloud Computing. In: AINA Workshops, pp.
520525 (2011)
11. Zhang, L., Zhang, J., Hong, C.: Services Computing. TSingHua University Press,
Beijing (2007)
12. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. The Mor-
gan Kaufmann Series in Data Management Systems. Morgan Kaufmann Publishers
(March 2006)
Linked Data Informativeness
Keywords: Semantic Web, Linked Data, the Web of Data, DBpedia, Semantic
Network, Partitioned Information Content, Information Filtering.
1 Introduction
The Semantic Web is a major evolution to the current World Wide Web (WWW) that
provides machine-readable and meaningful representation of Web content. Formal
knowledge representation languages such as RDF (Resource Description Framework)
[1] and OWL (Ontology Web Language) [2], and query languages such as SPARQL
[3] are key enablers of semantic applications. These technologies enable the Web of
Data, a giant graph of interconnected information resources, known as Linked Data.
Based on the principles of the Linking Open Data (LOD) project, an extensive
amount of structured data is now published and available for public use [4]. This in-
cludes interlinked datasets in a wide variety of domains that range from encyclopedic
knowledge bases such as DBpedia1, which is the Semantic Web-based representation of
Wikipedia2 [5], to domain-specific datasets such as Geo Names3 and Bio2RDF4.
1
http://dbpedia.org
2
http://wikipedia.org
3
http://www.geonames.org
4
http://bio2rdf.org
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 629637, 2013.
Springer-Verlag Berlin Heidelberg 2013
630 R. Meymandpour and J.G. Davis
According to the Linking Open Data cloud5, more than 295 datasets in various domains,
including Geography, Publications, Life Sciences, and so on are now interconnected
together. This valuable source of semantic knowledge can be exploited in practical and
innovative applications. It has also introduced several research opportunities.
Due to the fact that most of Linked Datasets have been created by automatic or
semi-automatic extraction of structured data from Web pages, assessing the quality of
the Web of Data has become a critical challenge for semantic data consumption.
Moreover, not all available Linked Data is valuable for semantic software agents or
human understanding. For example, it is likely that among all relations available in
the Web of Data relating to a particular movie, its director and actors are more inter-
esting to most users than its language or runtime. Likewise, for a software agent that
compares two movies, having a same director conveys more information than sharing
a same language. Understanding the informativeness of resources and structured data
associated with them facilitates browsing as well as automatic consumption and inter-
pretation of Linked Data. This paper, introduces two novel metrics to evaluate the
amount of information that a resource and its relations carry. After proposing the
metrics, we present a set of experiments conducted in various domains and applica-
tions in order to demonstrate their characteristics.
5
http://lod-cloud.net
Linked Data Informativeness 631
(3)
Thus, we have
(4)
(5)
(6)
such that is the frequency of the feature and is the frequency of the most
common feature in a given Linked Dataset.
In this definition, the amount of information conveyed by a given resource is calcu-
lated according to the probability of its features. Popular features carry less informa-
tion while specific ones are more informative. This metric also allows us to study the
Web of Data in terms of the value of information contained in each resource and do-
main. It highlights the most informative and influential information sources among a
large number of available resources in a particular domain.
(7)
The domain could be the whole Linked Data graph or a subset of it that is filtered by a
search query. This metric enables us to indicate the most important relations that con-
vey the highest amount of information about resources inside a particular domain.
3 Experiments
At the time of writing this paper (October 2012), the latest version of DBpedia was
3.8, released on August 2012. In our final DBpedia dataset, the parameter in the
Equation (6) was set as the number of (rdf:type, owl:thing) relations (2,350,906).
The first experiment was performed using the Partitioned Information Content-based
metric in order to compare information distribution across several domains in DBpe-
dia. It is observable that information in the Music domain is more distributed than the
others, with standard deviation of 590.2 in comparison to 148.2 for Films and 437.6
for Actors (see Fig. 1 (a) and Fig. 2). Moreover, there is more distinctive information
available for musical artists and actors than films (Fig. 1 (b)). This may be explained
by the fact that each artist has an increasing number of outputs during their career.
(a) (b)
Fig. 1. (a) Distribution of IC and (b) Cumulative distribution of IC among DBpedia resources
the Films dataset extracted from DBpedia (see Fig. 2). Comparing these figures un-
derline the difference between the qualities of structured information provided by
DBpedia, which is a general purpose Linked Data, against LinkedMDB.
600
500
400
300
200
100
0
Films Music Actors LinkedMDB
Average (bits) 282.5 485 369.4 157.8
Standard Deviation (bits) 148.2 590.2 437.6 198.9
Generated Information Content. These properties hold the most valuable and distinc-
tive information about instances. It is useful especially for Linked Data browsers and
search engines to allow users easily navigate through a large collection of results.
4 Related Work
Since being introduced by Resnik [11], Information Content-based metrics have been
widely exploited for semantic similarity measurement between concepts in lexical
taxonomies [12-15]. These metrics demonstrated better performance than convention-
al edge-counting and feature-based methods. However, the notion of informativeness
has been rarely used in Linked Data. Cheng et al. [16] embed the mutual information
of resources into a random surfer model and present an approach for centrality-based
summarization of entities in Linked Data.
A number of metrics have been developed to analyze the Web of Data from vari-
ous aspects. Ell et al. [17] propose metrics to evaluate labels in Linked Data in terms
of completeness, unambiguity, and multilinguality. Guret et al. [18] apply network
analysis techniques to measure and improve the robustness of Linked Data graph.
Another series of research has been carried out for entity ranking in Linked Data.
Although traditional graph-based ranking methods, such as PageRank [19], SimRank
[20], and HITS [21], can also be applied to the link structure of the Web of Data, its
semantics, represented using various relation types, will not be taken into account.
Numerous studies have attempted to extend the existing algorithms to consider link
types in the rating methodology [22-27]. ObjectRank [22] adds link weights to Page-
Rank for ranking in a directed labeled graph. Bamba and Mukherjea [23] extend
PageRank to arrange the results of Semantic Web queries. TripleRank [26] is a gene-
ralization of the authority (importance) and connectivity (hub) scores of HITS algo-
rithm for Linked Data, with applications in faceted browsing. These methods rate the
importance of resources according to the link structure of the Web of Data graph
while our approach is based on the amount of useful information that each individual
resource and relation conveys. Nevertheless, further research needs to be done to
assess the performance of different algorithms in terms of ranking functionality.
636 R. Meymandpour and J.G. Davis
In this paper, we addressed the problem of Linked Data quality assessment, which has
a significant impact on the performance of Semantic Web applications. By relying on
the principles of information theory, we have developed two metrics, namely Parti-
tioned Information Content (PIC) and Generated Information Content (GIC), for mea-
suring the informativeness of resources and relations in Linked Data. We analyzed the
Web of Data using the metrics in several domains. As it was demonstrated in the ex-
periments, assessing the quality of information contained in Linked Data offers the
potential for its deployment in a variety of applications including information filter-
ing, multi-faceted browsing, search engines, and entity ranking. Moreover, it can also
be exploited for Linked Data clustering and visualization to highlight the most impor-
tant information sources and relations. Currently, we are verifying and evaluating the
proposed metrics in a range of application areas. We are also devising them in order
to enable further analysis and comparison of resources in the Web of Data.
References
1. Lassila, O.: Resource Description Framework (RDF) model and syntax specification.
World Wide Web Consortium, W3C (1999)
2. McGuinness, D.L., Harmelen, F.V.O.: Web Ontology Language: Overview. World Wide
Web Consortium, W3C (2004)
3. Prudhommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. World Wide Web
Consortium, W3C (2008)
4. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International Jour-
nal on Semantic Web and Information Systems 5(3), 122 (2009)
5. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: A Nuc-
leus for a Web of Open Data. Springer, Heidelberg (2007)
6. Gray, R.M.: Entropy and Information Theory. Springer, New York (2009)
7. Edwards, S.: Elements of information theory, 2nd edition. Information Processing & Man-
agement 44(1), 400401 (2008)
8. Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Jour-
nal 27, 379423, 623656 (1948)
9. Ross, S.M.: A First Course in Probability. Prentice Hall (2002)
10. Hassanzadeh, O., Consens, M.P.: Linked Movie Data Base (2009)
11. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In:
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp.
448453 (1995)
12. Jiang, J.J., Conrath, D.W.: Semantic Similarity Based on Corpus Statistics and Lexical
Taxonomy. In: Proceedings of the International Conference Research on Computational
Linguistics (ROCLING X), Taiwan (1997)
13. Lin, D.: An information-theoretic definition of similarity. Morgan Kaufmann, San Francis-
co (1998)
14. Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its
Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelli-
gence Research 11, 95130 (1999)
Linked Data Informativeness 637
15. Zili, Z., Yanna, W., Junzhong, G.: A New Model of Information Content for Semantic Si-
milarity in WordNet (2008)
16. Cheng, G., Tran, T., Qu, Y.: RELIN: Relatedness and Informativeness-Based Centrality
for Entity Summarization. Springer, Heidelberg (2011)
17. Ell, B., Vrande i , D., Simperl, E.: Labels in the Web of Data. Springer, Heidelberg
(2011)
18. Guret, C., Groth, P., van Harmelen, F., Schlobach, S.: Finding the Achilles Heel of the
Web of Data: Using Network Analysis for Link-Recommendation. Springer, Heidelberg
(2010)
19. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Pro-
ceedings of the Seventh International Conference on World Wide Web 7, Brisbane, Aus-
tralia. Elsevier Science Publishers B. V. (1998)
20. Jeh, G., Widom, J.: SimRank: a measure of structural-context similarity. In: Proceedings
of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Edmonton, Alberta, Canada. ACM (2002)
21. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604
632 (1999)
22. Balmin, A., Hristidis, V., Papakonstantinou, Y.: Objectrank: authority-based keyword
search in databases. In: Proceedings of the Proceedings of the Thirtieth International Con-
ference on Very Large Data Bases - Volume 30, Toronto, Canada. VLDB Endowment
(2004)
23. Bamba, B., Mukherjea, S.: Utilizing Resource Importance for Ranking Semantic Web
Query Results. Springer, Heidelberg (2005)
24. Ding, L., Pan, R., Finin, T., Joshi, A., Peng, Y., Kolari, P.: Finding and Ranking Know-
ledge on the Semantic Web. Springer, Heidelberg (2005)
25. Hogan, A., Harth, A., Decker, S.: ReConRank: A Scalable Ranking Method for Semantic
Web Data with Context. In: Proceedings of Second International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS 2006), in conjunction with International
Semantic Web Conference (ISWC 2006) (2006)
26. Franz, T., Schultz, A., Sizov, S., Staab, S.: TripleRank: Ranking Semantic Web Data by
Tensor Decomposition. Springer, Heidelberg (2009)
27. Delbru, R., Toupikov, N., Catasta, M., Tummarello, G., Decker, S.: Hierarchical Link
Analysis for Ranking Web Data. Springer, Heidelberg (2010)
Harnessing the Wisdom of Crowds
for Corpus Annotation through CAPTCHA*
1 Introduction
The term CAPTCHA (Completely Automated Public Turing test to tell Computers
and Humans Apart) was created by Luis von Ahn in 2000 [1], to name a simple Tur-
ing test which requires users to type letters from distorted images. Since its creation,
CAPTCHA has become a widely applied security measure on the Web, to prevent
large scale abuses of online services. According to Ahns estimates [2], humans
around the world type more than 100 million CAPTCHAs every day. Suppose it takes
10 seconds to enter a CAPTCHA. In accordance with the use of 100 million times a
day, it will be huge labor resources of 150,000 hours. It is desirable if we can turn
them into useful services. In recent years, there have been several proposals for better
utilization the labor resources on CAPTCHA. A typical example is ReCAPTCHA, a
special type of CAPTCHA that can be used to digitize printed material.
In this work, we explore the possibility of utilizing the labor resources on
CAPTCHA to annotate textual corpus. We propose a new type of CAPTCHA, named
AnnoCAPTCHA, which not only provides the functionality of CAPTCHA but also
works a platform to perform corpus annotation. In each Turing test issued by Anno-
CAPTCHA, a user is presented with a set of short textual items and is asked to select
the right annotation for each item. Among the textual items, some items annotations
are already known, so they can be used to verify if the user is a human being. The
other items annotations are unknown, such that the users selections become new
*
This work is supported by the Basic Research fund in Renmin University of China from the
central government-No.12XNLJ01.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 638645, 2013.
Springer-Verlag Berlin Heidelberg 2013
Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA 639
2 Related Work
1
For more information, go to http://www.solvemedia.com/
2
For more information, go to http://captcha.tw/
3
For more information, go to http://openmind.media.mit.edu/
640 Y. Cao and X. Zhou
The Model.
We assume that each CAPTCHA contains k+2 items, in which 2 items annotations
are known and the other k items annotations are unknown.
A user of AnnoCAPTCHA is either a human or a machine. Let s% be the accuracy
of a humans judgment. Let t% be the accuracy of a machines judgment. Obvious-
ly, s% is much greater than t%.
As mentioned previously, once a machine is detected by AnnoCAPTCHA, its con-
nection will be cut off. This effectively restricts the number of annotations by ma-
chines. Let the proportion of CAPTCHA tests finished by humans be m% and that
by machines/attackers be 1-m%. Thus, m% should be far greater than 1-m%.
After a user passes the verification based on the 2 known items, the system will
record users votes on the k unknown items in a database. The vote record of each
unknown item is in the form i : j, where i represents the number of users voting for
positive polarity and j represents the number of users voting for negative polarity.
When the difference between i and j reaches a certain statistical significance, that
is, when i (or j) is significantly larger than j (or i), we turn the unknown item into a
known item. We use error probability to measure the significance level, which is
calculated as:
(1)
642 Y. Cao and X. Zhou
Analysis of Efficiency
When an item receives i positive votes and j negative votes, if the error probability Pij
falls below the threshold, the annotation process for this item will terminate. Thus, i+j
measures the cost of the annotation, i.e. the number of votes required to determine the
sentiment polarity of a context-adjective phrase. The higher the number, the longer
the duration we need to generate an annotation. We are interested in the expected
annotation cost, which we use to depict the efficiency of AnnoCAPTCHA.
Let Pt denote the threshold of error probability. Given an arbitrary pair of i and j,
let Qij denote the probability that the process does not terminate before it receives i
positive votes and j negative votes. It can be inferred that the value of Qij depends on
that of Q(i-1)j and Qi(j-1), because the vote record i : j must be evolved from either i-1 : j
or i : j-1 . Thus, Qij can be calculated recursively as follows:
(2)
Let Oij denote the probability that the process terminates when it receives i positive
votes and j negative votes. Then Oij can be computed by:
(3)
We did not manage to integrate Equations (1), (2), (3) and (4) into a single formula to
measure the expected annotation cost. However, it does not prevent us from evaluat-
ing the relationship between the expect cost and the various factors in our model. As
suggested by our analysis, the main influence factors of the expected annotation cost
Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA 643
include the accuracies of the annotations, i.e. s% and t%, and the proportion of attack-
ers votes, i.e. 1-m%.
We created a Java program to compute the expect annotation cost based on Equa-
tions (1)~(4). To evaluate the influence of humans accuracy (s%) on the expected
cost, we set m% to 80% and t% to 50% (machines only make random guess), and
vary s% from 55% to 100%. Then, the relationship between the annotation cost and
humans accuracy (s%) is plotted in Fig. 2. Obviously, the higher the human accura-
cy, the smaller of the expected cost. As we can see, to guarantee the efficiency of the
annotation process, we need to ensure that the accuracy of humans be significantly
higher than that of computers. For example, to ensure that the average cost is below
10 votes, the human accuracy must be above 75%. Fortunately, this condition can be
satisfied for most corpus annotation tasks.
To evaluate the influence of attacks, we vary m % from 50% to 100%, and fix s%
to 90% and t% to 50%, and. Then, the relationship between the annotation cost and
the proportion of humans votes (m %) is plotted in Fig.3. As we can see, the higher
the proportion of humans votes, the smaller the average cost. The proportion of hu-
mans votes depends on how effectively a website detects and blocks attackers. When
the proportion is more than 70%, the average cost can be controlled below 5 votes.
50 12
average cost
10
average cost
40
30 8
6
20 4
10 2
0 0
0.55 0.75 0.95 0.50 0.70 0.90
human accuracy human annotations
4 Evaluation of Accessibility
4.1 me
User Response Tim
User response time measurres the time a human user needs to complete a CAPTCHA
test. Fig.4 shows the response times of the 9 volunteers on each CAPTCHA. As we
can see, there is a learning curve
c for this new type of CAPTCHA. When a volunteeer is
presented with the first CA APTCHA, she / he needs a certain time to understandd its
requirements and get used to t the test environment. This time usually takes aroundd 25
seconds, and can occasionaally go up to 1 minute. After the second CAPTCHA, the
volunteers become familiarr with the exercises. Most of the response times fall in the
interval between 5 seconds to 25 seconds. The average response time is below w 10
seconds. Overall, no volun nteer encountered difficulty when doing the CAPTCH HAs.
The accessibility of AnnoCAPTCHA appears acceptable to most users.
80 volunteer11
volunteer22
60 volunteer33
time(s)
40 volunteer44
volunteer55
20 volunteer66
0 volunteer77
volunteer88
0 2 4 6 8 10 volunteer99
CAPTC
CHA's group number 1-10
Fig. 4. User Respond Time
100% 12
10
8
accuracy
average cost
95%
6
90% 4
2
85% 0
1 2 3 4 5 6 7 8 9 100% 80% 60% 40%
volunteerr 1-9 human annotations
Fig. 5. User Accurracy Fig. 6. Annotation Cost vs Proportionn of
Human Annotations
Harnessing the Wisdom of Crowds for Corpus Annotation through CAPTCHA 645
4.3 Efficiency
Using the annotations collected from the user study, we simulated the process of an-
notation generation by AnnoCAPTCHA. We increased the proportions of random
annotations made by machines to evaluate the impact of attacks on the annotation
efficiency. We assume that humans accuracy is 96.67%, the exact value we measured
in the user study. And we set the error probability threshold to 1%. The average cost
for annotating each context-adjective phrase (i.e. the average number of users we
need to annotate each phrase) is shown in Fig. 6. As we can see, with the accuracy
of 96.67%, the annotation is very efficient. Without the disturbance of machines, the
annotation cost is only about 2 votes. As more and more test results of machines are
added, the annotation cost goes up. However, the overall cost still appears acceptable.
5 Conclusions
The main contribution of this paper is to provide a new type of CAPTCHA that can
utilize crowd intelligence to do corpus annotation. We used the sentiment corpus an-
notation as example and discussed the effectiveness and efficiency of our system. We
clarified the relationship between the efficiency of the system and the accuracy of
human annotation as well as the influence of attacks. Finally we conducted a user
study to prove our system is easy to access and efficient in annotation generation.
References
1. von Ahn, L., Blum, M., Hopper, N.J., Langford, J.: CAPTCHA: Telling Humans and Com-
puters Apart. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, Springer, Heidel-
berg (2003)
2. von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: reCAPTCHA:human-
Based Character Recognition via Web Security Measures,
http://www.sciencemag.org/content/321/5895/1465.full
3. von Ahn, L., Dabbish, L.: Labeling Images with a Computer Game,
http://www.cs.cmu.edu/~biglou/ESP.pdf
4. von Ahn, L., Liu, R., Blum, M.: Peekaboom: A Game for locating Objects in Images,
http://www.cs.cmu.edu/~biglou/Peekaboom.pdf
5. Elson, J., Douceur, J.R., Howell, J., Saul, J.: Asirra: A CAPTCHA that Exploits Interest-
Aligned Manual Image Categorization. In: Proceedings of the 14th ACM Conference on
Computer and Communications Security. Association for Computing Machinery, New
York (2007)
6. Morrison, D., Marchand-Maillet, S., Bruno, E.: Tagcaptcha: annotating images withcapt-
chas. In: Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP
2009, pp. 4445. ACM, New York (2009)
7. Chklovski, T.: Improving the design of intelligent acquisition interfaces for collecting world
knowledge from web contributions. In: Proceedings of the 3rd International Conference on
Knowledge Capture, pp. 3542 (2005)
A Framework for OLAP in Column-Store Database:
One-Pass Join and Pushing the Materialization to the End
Abstract. In data warehouse modeled with the star schema, data are usually re-
trieved by performing a join operation between the fact table and dimension ta-
ble(s) followed by a selection and project operation, while join operator is the
most expensive operator in RDBMS. In column-store database, there are two
ways to do join. The first way is early materialization join (EM join); the other
way is late materialization join (LM join). In EM join, the columns involved in
the query are glued together firstly, then the glued rows are sent to join opera-
tor. Whereas, in LM join, only the attributes participated in the join operator are
accessed. The problem that access to inner table is out-of-order cant be ig-
nored for LM join. Otherwise, the nave LM join is usually slower than EM join
[9]. Since the late materialization is good for memory bandwidth and CPU effi-
ciency, the LM join attracts more attention in academic research community.
The state-of-art LM joins in column-store such as radix-cluster hash join [8] in
MonetDB, invisible join [10] in C-Store all try to avoid accessing table random-
ly. In this paper, we devised a framework for OLAP called CDDTA-MMDB
where a new join algorithm called CDDTA-LWMJoin (we contract it to
LWMJoin in the following) is introduced. The LWMJoin is on the basis of our
prior work: CDDTA-Join [7]. We equip the CDDTA-Join with light-weight
materialization (LWM) which is designed to cut down the memory access and
reduce production of intermediate data structure. Experiments show that
CDDTA-MMDB is efficient and can be 2x faster than MonetDB and 4x faster
than invisible join in the context of data warehouse modeled with star schema.
1 Introduction
Data warehousing (DW) and On-line analytical processing (OLAP) consist of essen-
tial elements of decision support, increasingly becoming a focus of the database in-
dustry. Deviating from the queries to transactional databases, the nature of the queries
to DWs is more complex, longer lasting and read-intensive, but less attributes to
touch. Since column-store database only retrieves attributes required in the query, it
can optimize cache performance. Thus columnar storage is more suitable for DW in
this sense, and now the Microsoft also adds columnar storage in SQL server 2012 to
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 646653, 2013.
Springer-Verlag Berlin Heidelberg 2013
A Framework for OLAP in Column-Store Database 647
improve performance on data warehousing queries [5]. With the increment of RAM
density and reduction of its price there are a number of in-memory column-oriented
database systems such as MonetDB [1, 2] and C-Store [3] emerging for better per-
formance. These systems offer order-of-magnitude gains against traditional, row-
oriented databases on certain workloads, particularly on read-intensive analytical
processing workloads, such as those encountered in DWs. In this paper, we propose a
framework for OLAP called CDDTA-MMDB (Columnar Direct Dimensional Tuple
Access-Main Memory Database) to efficiently process the query on DWs modeled
with star scheme. This work is on the basis of our prior work to develop one-pass join
algorithm [7] which is variant of invisible join [10]. The following is our focus inno-
vations:
The rest of this paper is organized as follows: In section 2, prior work on columnar
storage, materialization and join algorithms is described. Section 3 gives the light-
weight materialization techniques. In Section 4, system overview is presented. Sec-
tion 5 shows representative experimental results. Finally, conclusion is drawn in
section 6.
2 Related Works
Storage. The decomposition storage Model (DSM) [4] reduces each n-attribute rela-
tion to n separate relations. DSM organizes values from same attribute close in physi-
cal. Supposed the record reconstruction cost is low, cache performance of main-
memory database systems can be improved by DSM. As mentioned to the detail of
representation of DSM, MonetDB proposed the Binary Units (or BUNs) [2] to con-
struct the decomposed table in the form of <OID, value > pairs.
Materialization and Join Algorithm in Column-Store. Column-store modifies the
physical data organization in database. In logical, the applications treat column-store
the same as row-store. At some point in a query plan, a column-oriented DBMS must
648 Y. Zhu et al.
glue multiple attributes from the same logical tuple into a physical tuple. The strate-
gies of materializing are divided into two categories: early materialization (EM) and
late materialization (LM). A recent work on materialization in column-store is in
[14]. It used a dynamic and cache-friendly data structure, called cracker maps which
provide direct mapping between the pairs of attributes in the query, to reduce the ran-
dom data access. In [6], trade-offs between different strategies are systematically
explored and advices are given for choosing a strategy for a particular query.
The nave join algorithm using LM strategy (nave LM join) in column-store data-
base is about 2x slower than the join algorithm using EM strategy (EM join) [9]. The
main reason is that in nave LM join the access to inner table is out-of-order, causing
the high cache miss. The state-of-art LM join algorithms in column store such as radix
hash join [8], MCJoin[13] and invisible join [10] all try to avoid the random memory
access to inner table. Radix-clustered hash join [8] partitions the input relation into H
clusters. By controlling the number of clustering bits each pass, the H clusters gener-
ated can be kept under the count of Translation Lookaside Bu er (TLB) entries and
cache lines available on that CPU. The random access only happens in storage block.
Invisible join [10] uses bitmap to discard tuples that dont satisfy the prediction and
can reduce the random memory accesses. LWMJoin tries to avoid producing the in-
termediate data structure and further reduces memory access compared to invisible
join [10] and CDDTA-Join [7]. Experiments show that LWMJoin runs about 4x faster
than invisible join and 2x faster than radix-cluster join on average.
3 Light-Weight Materialization
whether attributes in the corresponding position in column satisfy the prediction; then
use position-wise AND to intersect the two position lists. According to intersected
result, the block of R.a is re-accessed for aggregation. The latter strategy is called late
materialization. Light-weight materialization (LWM) adopts notion of LM to push
the materialization as far as possible in query plan, but try to reduce the memory
access and avoid producing intermediate data structure. Instead of re-accessing the
block of R.a to get record value for GROUP BY, LWM just lets the key of triplet pro-
duced by scanning (R.a, R.b) go through whole query plan. Furthermore, LWM
doesnt need to be assisted by the intermediate data structure except triplets (in some
case such as the attributes that are not used any more but just serve as filters, the trip-
lets can be replaced by bitmap to save space). Considering the above scenario again,
we dont need to keep two intermediate position lists and use position-wise AND to
intersect the two positions to produce intersected result, but just utilize key in triplet
and let it go through query plan. So, the LWM can accelerate the system speed, and it
greatly facilitates aggregation operation.
4 System Design
The sub-queries on fact table and dimension tables are handled separately in CDDTA-
MMDB. When a query is received, it is broken off. Then, sub-queries on dimension
table(s) are rewritten in order to produce triplets, while the sub-query on fact table is
parsed and information is stored for later use. The Query Parser is responsible for
those tasks. Based on the triplets and information from parsing sub-query on fact
table, the result set can be obtained by scanning fact table one pass. Figure 1 illu-
strates the overview of the system.
CDDTA-LWMJoin is derived from invisible join [10]. Invisible join is also divided
into three phases. First, apply the prediction on the dimension table and use key of
rows that pass the prediction to build hash table. Second, scan the fact table; mean-
while fetch the foreign key to probe the hash table. The bitmaps on dimension tables
are built up in this step. Then intersect the bitmaps to product final bitmap. At last,
according to final bitmap scan the fact table again to construct result relation. If the i-
th value is set to true in the final bitmap, the i-th record in the fact table is retrieved.
From above analysis, the time t used by invisible join can be expressed as t0 + 2t1 + t2,
where t0 is spent on producing hash table; t1 on scanning the fact table; t2 on produc-
ing the final bitmap. Since size of the final bitmap is equal to size of fact table, we
can assume that t1 is equal t2. So, the total time t spent by invisible join is
t0 + 3 t1.
Compared with invisible join, LWMJoin algorithm just utilizes the key information
in triplet which was produced in first phase and thus avoids scanning fact table twice.
The following is the algorithm procedure. First, decomposed queries involved with
dimension table are rewritten; then we get the result sets from dimension tables. The
last step of this phase is to produce the <surrogate, key, value> triplets. For those sub-
queries such as RQ1, bitmap is enough to represent them; for the other case, the <sur-
rogate, key, value> triplet is to be produced. This representation forms the base of
LWMJoin. Second, Fact table is scanned sequentially; foreign keys are used to probe
the key vectors producing in first phase, deciding whether the row satisfies the predic-
tion. If keys equal to zero is greater than one, this row should be discarded. Other-
wise, a physical tuple is constructed. Now, the operators such as GROUP BY can be
applied on rows passing qualification.The benefits brought by LWM are obvious:
dynamically allocating the memory to build the temporary position lists is no need; it
accelerate the aggregation especially when the fact table is big and result sets are big
because the records don need to be fetched again for aggregation. In general, the key
of triplet serves as grouper in star schema and in such case we dont need to re-access
attributes in dimension table for aggregation. Therefore, random memory access can
be further reduced. Figure 2 is the illustration of CDDTA-MMDB handling Q4.1.
First apply the rewritten SQL RQ1, RQ2, RQ3, and RQ4 on corresponding dimension
table to produce triplets or bitmaps. Then scan the fact table and use foreign key to
probe key/OID arrays. In this example, there are four tuples passing the prediction,
and result set contains two groups. Key/OID vectors with broken line are used in
GROUP BY (if it is end of query plan, the program will fetch value by key to con-
struct tuple directly). The time t used by LWMJoin is expressed as: t0 + t1, where the
time t0 is used to produce the triplets and t1 is used to scan the fact table. So, the time
spent by invisible join is almost three times more than LWMJoin. The following ex-
periments will show that the consumed time by invisible join is more than three times
compared with time used by LWMJoin, which accords with above analysis.
A Framework for OLAP in Column-Store Database 651
2 1 228490 1 1 1014839
2396604 3083649
5 Experiments
Query 1.x doesnt have GROUP BY, so we omit them in the experiment. We simulate
state-of-art LM join algorithms such as in [7, 10] that after join phase the data is re-
trieved for aggregation (LMJoin). Then we compare aggregation time used by
LMJoin with that used by LWMJoin. To our surprise, as is shown in Figure 3 time
consumed by aggregation in LMJoin even exceeds the time by join phase on Q3.1. On
Q3.1 the time took up by aggregation is up to 61% of total time, followed by 35% on
Q4.1, 21% on Q2.1 and 17% in Q4.1. Whereas, in LWMJoin the time took by aggre-
gation is reduced to 28%, 13%, 8% and 5% respectively. The difference of aggrega-
tion time between LMJoin and LWMJoin lies in the random memory access. After
join phase, LMJoin fetches the tuple for aggregation. This out-of-order memory
access will incur cache miss. So the aggregation time in LMJoin is much more than
that in LWMJoin, especially when the selectivity is high.
652 Y. Zhu et al.
6 Conclusion
The main contribution of this paper is that we propose a framework which introduces
LWMJoin to efficiently deal with the queries on DW modeled with star scheme.
A Framework for OLAP in Column-Store Database 653
LWMJoin is the combination of CDDTA-Join [7] and LWM. Compared to the state-
of-art materialization in columnar database, LWM reduces production of intermediate
data structure and can cut down memory access greatly. Experiments show that
CDDTA-MMDB has good scalability and can be 2x faster than MonetDB and 4x
faster than invisible join. In future, we will add triplet representation and LWM strat-
egy into open source columnar database such as MonetDB.
References
1. Boncz, P.A., et al.: MonetDB/X100: Hyper-pipelining query execution. In: CIDR, pp.
225237 (2005)
2. Boncz, P.A., et al.: MIL primitives for querying a fragmented world. VLDB Journal 8(2),
101119 (1999)
3. Stonebraker, M., et al.: C-Store: A Column-Oriented DBMS. In: VLDB, pp. 553564
(2005)
4. Copeland, G.P., et al.: a decomposition storage model. In: ACM SIGMOD, pp. 268279
(1985)
5. Larson, P.-A., et al.: Columnar Storage in SQL Server 2012. In: ICDE, pp. 1520 (2012)
6. Abadi, D.J., et al.: Integrating compression and execution in column-oriented database sys-
tems. In: ACM SIGMOD, pp. 671682 (2006)
7. Yan-Song, Z., et al.: One-size-fits-all OLAP technique for big data analysis. Chinese Jour-
nal of Computers 34(10), 19361946 (2011)
8. Boncz, P.A., et al.: Breaking the Memory Wall in MonetDB. Communications of the
ACM 51(21), 7785 (2008)
9. Abadi, D.J., et al.: Materialization strategies in a column-oriented DBMS. In: ICDE, pp.
466475 (2007)
10. Abadi, D.J., et al.: Column-Stores vs. Row-Stores: How Different Are They Really? In:
ACM SIGMOD, pp. 967980 (2008)
11. ONei, P.E., et al.: The Star Schema Benchmark (SSB),
http://www.cs.umb.edu/~poneil/StarSchemaB.PDF
12. ONeil, P.E., et al.: Adjoined Dimension Column Index (ADC Index) to Improve Star
Schema Query Performance. In: ICDE, pp. 14091411 (2008)
13. Begley, S., et al.: MCJoin: A Memory-Constrained Join for Column-Store Main-Memory
Databases. In: ACM SIGMOD, pp. 121132 (2012)
14. Idreos, S., et al.: Self-organizing Tuple Reconstruction in Column-stores. In: ACM
SIGMOD, pp. 112 (2009)
A Self-healing Framework for QoS-Aware Web
Service Composition via Case-Based Reasoning
Guoqiang Li1,2 , Lejian Liao1 , Dandan Song1, , Jingang Wang1 , Fuzhen Sun1 ,
and Guangcheng Liang1
1
Beijing Engineering Research Centre of High Volume Language Information
Processing & Cloud Computing Applications, Beijing Key Laboratory of Intelligent
Information Technology, School of Computer Science, Beijing Institute of Technology,
Beijing, China
2
School of informatics, Linyi University, Linyi, China
{lgqsj,sdd,liaolj,bitwjg,10907023,195812}@bit.edu.cn
1 Introduction
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 654661, 2013.
c Springer-Verlag Berlin Heidelberg 2013
of the environment [1]. To the best of our knowledge, our work is the first one
to use CBR in the service adaptive domain.
The remainder of this paper is structed as follows. In Section 2 we present
the related work. The self-healing framework is described in Section 3. Section
4 sets forth a case representation, a retrieval algorithm and a reuse approach in
detail. In Section 5 we present experiments of our work. At last, in Section 6 we
discuss conclusion and future work.
2 Related Work
Although WSBPEL1 specifies business processes behavior as a standard, its
fault-handling mechanism is limited as it is dicult for users to consider the dy-
namic information of services and their running environment in advance. Many
works extended BPEL engine to enhance their fault-handling capabilities. A
multi-layered control loop is proposed in [2], which focuses on the framework
structure. VieDAME [3] monitors BPEL process according to Quality of Service
(QoS) and replaces existing partner services based on various strategies. In [4],
AOP (aspect-oriented programming) is used to extend the engine ActiveBPEL
by using ECA (event-condition-action) rule paradigm. In [5], Web service Re-
covery Language is defined inspired by the ECA rule paradigm with emphasis
on the process recovery. Similarly, in [6], ECA rules are designed to change the
components states according to the fault events. In addition, vital and non-
vital components are distinguished to express the importance of activities in the
workflow. In Dynamo [7] and DISC [8], JBOSS rule and EC (event calculus)
model are used respectively. But in the above mentioned works, it is dicult
for designers to design the complete set of rules and event models before the
process execution. Our approach can automatically generate the failure cases on
the basis of the fault information. With this advantage, our approach is easier
to be accepted by the users.
When QoS constraints are violated during the execution, the problem of select-
ing supplementary service is formalized as a mixed integer linear programming
problem in [9], where negotiation techniques are exploited to find a feasible so-
lution. In [10], a staged approach generates multiple workflows for one template
(abstract workflow), and some workflows are selected as an IP optimization
problem by considering QoS parameters. In [11], a reconfigurable middleware is
proposed, covering the whole cycle of adaptation management including moni-
toring, QoS analysis and substitution-based reconfiguration. In [12], failures are
classified into several categories from the perspective of a service composition.
Then process reorganization, substitution and retrial/data mediation are pro-
posed to react to the failures. In [13], a region-based reconfiguration algorithm
is designed to minimize the reconfiguration cost. Supplementary services are se-
lected as a backup source for recovery by using the linear programming. In [14],
DSOL re-planning technique is leveraged by Q-DSOL to optimize the orches-
tration supporting QoS in presence of faults. Dierent from previous works, our
1
http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html
656 G. Li et al.
3 Self-healing Framework
There are three parts in our framework. The first part is the business process
which achieves users requirements. The framework is shown in Figure 1. The
second part extractor gets the fault information (FIN), services functional in-
formation (FI) and non-functional information (NFI) which are passed to the
third part case-based reasoner. After reasoning, a case with solution (CWS)
is returned to the business process and the satisfied solution keeps the process
running forward rightly. A case without solution (CWOS) is viewed as a new
fault. We assume that fault information has been caught using some method(e.g.
AOP). The reasoning pseudo code is presented in Algorithm 1.
s31
extractor s1 s21 s51 s 61
business
Business
s41
process
throw fault
return solution
4 Detailed Implementation
In our approach, FI and NFI information can be easily extracted from WSDL
files. FIN is represented by using the HTTP status codes as in [15]. The case
symptoms (CS) can be described as a triple: CS = (FI, NFI, FIN), where FI =
(input, output), NFI = (availability, cost, successability, response time), FIN =
(fri, nfri). Their definitions refer to WSBPEL and [15]. Here we only give one
part as follows:
1. functional information(FI)
(a) input: abstract message format for the solicited request.
(b) output: abstract message format for the solicited response.
2. nonfunctional information(NFI)
(a) availability: the percentage of time is operating during a time interval
(b) throughput: number of successful invocations/total invocations
3. fault information(FIN)
(a) fri: function related information
(b) nfri: non-function related information
We classify adaptive actions into two categories repeat and replace which are
commonly used in previous works, e.g.[12]. For example, the code 404 denotes
that service is not found, so replace should be chosen. There are 20 standard
faults defined in WSBPEL specification where they are distinguished according
to whether they are related to input data. For example, the fault invalidEx-
pressionValue is data related, and repeat can be used. Analysis shows only
missingReply and missingRequest can be thought as data unrelated.
We construct a case representation with solution and solution templates shown
in Table 1 and 2 respectively. Codes with the prefix d can be defined by the
users in the first column in Table 2. Codes with prefix b come from WSBPEL
and the prefix is added by us. The remaining codes come from [15]. The under-
lined tasks in template column will be replaced with a concrete service in Table
2. A new problem is a case without the solutions part of Table 1.
In light of our initial study, the case base is stored in a linear list. In real word,
we can index the cases according to attributes of the service to improve the
retrieval eciency. Due to the fact that attributes of QoS vary in units, they
658 G. Li et al.
Problem(symptoms) Solutions
(1)problem: fault description
(2)service location:
(3)service input:
(4)service output: (1)Diagnosis: the used data is faulty
(5)availability:89 (2)Repair: repeat invoking the service with
(6)throughput:7.1 correct data
(7)successability:90
(8)response time:302.75
b901 ambiguousReceive Diagnosis: the service doesnt work well with incorrect
data.
411 Length required Repair: repeat this service with correct input data.
The factor k is the weight denoting the users preference for an attribute of
QoS and sum of these factors is 1. Cik comes from the matrix A of Ci .
Condition 1: if the adaptive action is repeat, the failed service can be re-
invoked directly.
Condition 2: if the adaptive action is replace, the solution is verified to check
whether QoS attributes deviate from users constraints.
Our approach automatically generates two alternatives for one new case. Assume
S1 , S2 and S3 occur in the case base corresponding to C1 , C2 and C3 . C2 is similar
to C1 and its solution is: replace S2 with S3 . So, C3 is also a candidate of C1 .
Even though their similarity may be bigger than the similarity threshold, we
think that our method has a great success rate as long as the threshold is as
small as possible.
Finally, we transform the validation of solution against QoS requirements into
a constraint satisfaction problem. In our paper, it is assumed that the candidate
service of the failed service can be found in the case base. The reason is that the
business process will return to a correct state by using the compensation handler
in WSBPEL. How to implement this compensation mechanism with CBR is a
future research topic.
5 Experiment
The QWS dataset2 is used in our experiments including 2,507 Web services.
We use five attributes: response time, availability, throughput, successability,
reliability. We assume all services are functionally equivalent and the services
composition uses sequential structure including 20 services. We record the com-
parison times(CPT) about the failed services which are generated randomly.
We build the case base only using the first 2400 services. In the first experi-
ment, we do not limit the search step (SS). The results in Figure 2(a) show that
CPT is not proportional to the number of failed services. We can see that the
CPT is very high at point 8 and 10, where the CPT of point 10 is 938.2 because
there are 30% of found services coming from the remaining 107 services. So, it
is very important to build a case base as completely as possible.
In the second experiment, two conditions are compared: step-limited and step-
unlimited. The latter is our reuse approach and the value of SS is equal to 5.
In Figure 2(b), we can see that the CPT of step-unlimited is lower than that
of step-limited. In particular, when there are 9 failed services, the comparison
times reduce 26 times. So, our method takes less time to achieve the adaption.
At last, we make a further study on the influence on CPT by dierent search
steps . In Figure 2(c), the results show that the CPT can be reduced significantly
with the increase of search steps. But, it does not mean that the bigger steps are
better. For example, when SS is equal to 4 and 5, the comparison times are very
close. So, in practice, we should determine a proper value of the search steps.
2
http://www.uoguelph.ca/ qmahmoud/qws/index.html#Quality of Web Service
(QWS) This dataset has more QoS attributes than that on the web site
http://www.zibinzheng.com/
660 G. Li et al.
1000 40 90
stepunlimited step3
900 80
35 steplimited step4
step5
800
70
30
700
Number of Comparisons
60
Comparisons Times
Comparisons Times
25
600
50
500 20
40
400
15
30
300
10
20
200
5 10
100
0 0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Number of the Failed Services Number of the Failed Services Number of the Failed Services
References
1. Kolodner, J.L.: An introduction to case-based reasoning. Artif. Intell. Rev. 6(1),
334 (1992)
2. Guinea, S., Kecskemeti, G., Marconi, A., Wetzstein, B.: Multi-layered monitoring
and adaptation. In: Kappel, G., Maamar, Z., Motahari-Nezhad, H.R. (eds.) ICSOC
2011. LNCS, vol. 7084, pp. 359373. Springer, Heidelberg (2011)
3. Moser, O., Rosenberg, F., Dustdar, S.: Non-intrusive monitoring and service adap-
tation for ws-bpel. In: Proceedings of the 17th International Conference on World
Wide Web, WWW 2008, pp. 815824. ACM, New York (2008)
4. Liu, A., Li, Q., Huang, L., Xiao, M.: Facts: A framework for fault-tolerant composi-
tion of transactional web services. IEEE Transactions on Services Computing 3(1),
4659 (2010)
5. Baresi, L., Guinea, S.: A dynamic and reactive approach to the supervision of bpel
processes. In: ISEC 2008, pp. 3948 (2008)
6. Ali, M.S., Rei-Marganiec, S.: Autonomous failure-handling mechanism for wf long
running transactions. In: Proceedings of the 2012 IEEE Ninth International Con-
ference on Services Computing, SCC 2012, pp. 562569. IEEE Computer Society,
Washington, DC (2012)
A Self-healing Framework for QoS-Aware Web Service Composition 661
7. Baresi, L., Guinea, S., Pasquale, L.: Self-healing bpel processes with dynamo and
the jboss rule engine. In: International Workshop on Engineering of Software Ser-
vices for Pervasive Environments: in Conjunction with the 6th ESEC/FSE Joint
Meeting, ESSPE 2007, pp. 1120. ACM, New York (2007)
8. Zahoor, E., Perrin, O., Godart, C.: Disc: A declarative framework for self-healing
web services composition. In: IEEE International Conference on Web Services,
ICWS 2010, Miami, Florida, USA, July 5-10, pp. 2533. IEEE Computer Society
(2010)
9. Ardagna, D., Pernici, B.: Adaptive service composition in flexible processes. IEEE
Transactions on Software Engineering 33(6), 369384 (2007)
10. Chafle, G., Dasgupta, K., Kumar, A., Mittal, S., Srivastava, B.: Adaptation inweb
service composition and execution. In: International Conference on Web Services,
ICWS 2006, pp. 549557 (September 2006)
11. Ben Halima, R., Drira, K., Jmaiel, M.: A qos-oriented reconfigurable middleware
for self-healing web services. In: IEEE International Conference on Web Services,
ICWS 2008, pp. 104111 (September 2008)
12. Subramanian, S., Thiran, P., Narendra, N.C., Mostefaoui, G.K., Maamar, Z.: On
the enhancement of bpel engines for self-healing composite web services. In: Pro-
ceedings of the 2008 International Symposium on Applications and the Internet,
SAINT 2008, pp. 3339. IEEE Computer Society, Washington, DC (2008)
13. Li, J., Ma, D., Mei, X., Sun, H., Zheng, Z.: Adaptive qos-aware service process
reconfiguration. In: Proceedings of the 2011 IEEE International Conference on
Services Computing, SCC 2011, pp. 282289. IEEE Computer Society, Washington,
DC (2011)
14. Cugola, G., Pinto, L., Tamburrelli, G.: Qos-aware adaptive service orchestrations.
In: IEEE 19th International Conference on Web Services (ICWS), pp. 440447
(June 2012)
15. Al-Masri, E., Mahmoud, Q.H.: Investigating web services on the world wide web.
In: Proceedings of the 17th International Conference on World Wide Web, WWW
2008, pp. 795804. ACM, New York (2008)
16. Al-Masri, E., Mahmoud, Q.: Qos-based discovery and ranking of web services. In:
Proceedings of 16th International Conference on Computer Communications and
Networks, ICCCN 2007, pp. 529534 (August 2007)
Workload-Aware Cache for Social Media Data
1 Introduction
The success of social network services (SNS), such as Facebook and Twitter, has
brought up many interesting Web 2.0 applications, while posed great challenges
for human-real-time data management for huge volume of data with unstruc-
tured nature.
Many SNS tasks, including those for both SNS providers and end-users, can
be represented by a specific kind of queries, i.e. timeline query (TQ). Intuitively,
a timeline query is a top-k or windowed query for records satisfying specific
conditions, in reverse order of timestamps associated with those records. Home
timeline query (HTQ), as a kind of typical timeline queries, is to retrieve a
timeline with records created by a given users followees.
Fig. 1, for example, shows a global timeline within the period between times-
tamps t0 and tj of six users, i.e. u1 , u2 , u3 and v1 , v2 , v3 , whose following rela-
tionships are also given. Here, ui,j means the jth message posted by user ui .
The top-5 results of home timeline queries for u1 , u2 , u3 at timestamp tj are also
listed. Here, the query condition is that the message queried should belong to
a querying user himself or his followees. Thus, v3,1 is included in the results of
HT Q(u1, tj , 5) and HT Q(u2 , tj , 5), but not of HT Q(u3, tj , 5), since user u3 is
not following v3 .
However, evaluation of home timeline queries in a human-real-time manner is
non-trivial. On the one side, the data should be cached in memory with orders or
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 662673, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Fig. 1. An example of home timeline queries over a global timeline of six users
indices on timestamps, so that the top-k or time-window part of the query can
be eciently evaluated. On the other side, a specific query may only retrieve a
very small portion of the data distributed in the whole timeline, while dierent
queries may extract dierent set of records with overlaps. The sparsity of the
results makes the ordering of the global timeline in the system meaningless.
The report in [1] shows three caching schemes to facilitate the evalation of
home timeline queries over social media data. The pull -base method caches the
messages of each user separately by putting any item to the cache unit of its
corresponding user simply. This strategy requires to merge the messages to con-
struct home timeline so that it suers the problem of high-cost in cache merge
leading to high latency. The push-base approach directly caches home timeline
query result by delivering user generated message to his followers eagerly. Clearly,
it consumes too much memory for duplicated items and imposes heavy load to
CPU for vast deliveries. To strike a balance between pull-base and push-base
approach, a hybird strategy distinguishes popular users, such as celebrities who
have a large volume of followers, from the ordinary ones. The messages gener-
ated by popular users are cached following pull-based method, while messages
from the other users are processed as push-based approach. Consequently, ac-
tive users are most likely to have worse user experience in the view of retrieval
latency since they are always following a certain number of popular ones.
A workload-aware caching scheme named overlapping caching is proposed in
this paper with the observation that users of SNS tend to form communities
with homogenization and access frequency has high skew among users in SNS.
It distinguishes itself from previous work in the following aspects.
The rest of the paper is organized as follows. In Section 2, we state the prob-
lem of home timeline queries formally. The workload-sensitive caching scheme
is illustrated in Section 3 and the empirical study results are shown in Section
4. Section 5 lists the related work followed by the conclusion of this paper in
Section 6.
2 Problem Statement
A timeline is a sequence of items (e.g., messages) that are ordered chronologi-
cally. Each item is a triple < ux , ty , mx,y >, in which ux is the user identifier,
ty denotes the timestamp, and mx,y is the id of the item created by ux at ty . A
followship network is a directed graph G : (V, E), in which V is the set of users,
and E is a subset of V V . (u, ui ) E means that user u is following user ui .
In many real-life SNS, every user in default subscribes his own messages, i.e.
u, (u, u) E. We denote all ui s followees as Fi , i.e. Fi = {u|(ui , u) E}.
The raw data that to be queried is a global timeline, denoted as TG , in which
items from dierent users are interlaced. A timeline query T Q(U, tj , k) is a query
at tj to retrieve last k items < ux , ty , mx,y > in TG satisfying both ty tj and
ux U , where U is a set of user identifiers.
Home timeline queries, as a kind of typical timeline queries, are notoriously
common in SNS applications. Every time a user logs in or refreshes his/her web
page, home timeline query is triggered. It retrieves ui s home timeline, denoted
as Ti , which is the timeline with all items from this users followees. Thus, the
home timeline query of user ui at time tj can be denoted as HT Q(ui , tj , k) =
T Q(Fi , tj , k).
The main problem in this work is how to achieve low latency and high through-
put with limited system resources. Here, latency is the response time of home
timeline query, whereas throughput means the operations consisting of creating
items and retrieving home timelines per time unit (e.g., seconds). In addition,
we take RAM and CPU as system resources into consideration, since most of
the data is stored in RAM for performance reasons in many real applications.
Formally, let us define CostRAM () as the total consumption of memory and
CostCP U () as the aggregate CPU utilization. Then, the home timeline query
problem is to minimize both CostRAM () and CostCP U (), while ensuring the
Workload-Aware Cache for Social Media Data 665
average query latency is less than M time units (e.g., milliseconds) and the
average throughput is higher than N operations per time unit in the premise
that home timelines are provided correctly.
3 Workload-Sensitive Caching
Two trivial solutions for home timeline query processing are pull -based method
and push-based method, which are illustrated in Fig. 2(a) and Fig. 2(b) respec-
tively. Under pull strategy, any message item < ux , ty , mx,y > is cached only once
in ux s user timeline where all messages posted by u are organized chronologi-
cally. Thus, while processing a home timeline query HT Q(ui , tj , k), the problem
is to retrieve user timelines of ui s followees, i.e, Fi . Dierently, the push-based
method assumes that all home timeline queries are known in advance and re-
sults of them are pre-computed and cached. While the query is triggered, the
system just retrieve the result from cache and deliver it to the user. Clearly, to
support push strategy, each item < ux , ty , mx,y > is cached as many times as the
number of ux s followers. However, neither pull-based nor push-based method
is ideal for home timeline query processing. The former suers the problem of
high-cost in cache merge leading to high latency, whereas the later consumes too
much memory for duplicated items and imposes heavy load to CPU.
666 J. Wei et al.
The result shared groups are expected to resolve the problem stated in Section
2. That is how to achieve low latency and high throughput with limited system
resources. Thus, the determination of shared groups is forced to facilitate the
evaluation of home timeline queries to consume RAM and utilize CPU in an
economical manner while satisfactory performance is guaranteed.
Workload-Aware Cache for Social Media Data 667
s.t. K
Cl = Fi , i (1)
ui Pl
|Xi | /i , i (2)
Constraint 1 ensures that one users home timeline can be generated by merging
all his/her shared timelines. Constraint 2 limits the number of shared timelines
each user owns, which is bound to request frequencies of users home timelines.
The problem of determination of the shared timelines is non-trivial. We sim-
plify the original problem by setting all i = 1 and formally define the simplified
problem as Volume Bounded Set Basis Problem, which can be reduced to the
classic Set Basis problem [2].
Theorem 1. The Volume Bounded Set Basis Problem is NP-hard.
Details of the definition of Volume Bounded Set Basis Problem and its complexity
analysis along with the proof of Theorem 1, are attached in Appendix.
4 Empirical Study
We now examine our experimental results. The four strategies(i.e., pull, push,
hybrid and overlap) described in Section 3 are implemented to handle the mes-
sage load obtained from a real-life social media(Sina Weibo). We show a com-
parison of the above four caching strategies from the aspects of system cost and
performance.
Our system cost metric is aggregate memory usage and CPU utilization across
all servers. We put a limit on memory capacity in order to evaluate performance of
the strategies when some cache misses happen, which is the common case in many
real-life applications. CPU utilization is measured by Unix sar command. Perfor-
mance is evaluated in terms of latency and throughput. We measure the latency
as response time observed from the client, for retrieving one users home timeline.
Throughput is measured as the number of operations performed per second. Here,
an operation means posting a message or retrieving a home timeline.
In summary, our results show: The overlapping caching strategy strikes best
balance between system load and performance: it utilizes least CPU, consumes
second least memories, and can achieve lowest query latency and highest through-
put at most time under the memory constraint in our experiment.
1 14
push push
pull pull
hybrid 12 hybrid
0.8 overlap overlap
Utilization of CPU(%)
Usage of Memory
10
0.6
8
0.4 6
4
0.2
2
0
0 100 200 300 400 500 600 700 800 0
Operations(x104)
0 100 200 300 400 500 600 700 800
Time Series
their followers and regard the top 7,509 users as popular ones, who has more than
1,000 followers. All the popular users are treated dierently in hybrid strategy.
To make a fair comparison of system cost and performance among the four
caching strategies, each of them is run against the same workload. That is, the
same sequence of messages posted in real-life application and the same retrieval
log we simulate as mentioned are used in our experiments. The ratio of posts and
retrievals is 1:4. We also launch the same number of threads for each strategy.
We run our experiments on server-class machines(8 core [email protected] Intel
Xeon CPU, 16GB of RAM, 1TB of disk and gigabit ethernet). In particular,
we use twelve Redis storage servers(six for meta data) and one node to launch
and process requests. Memory capacity for each the Redis server is limited to
100MB, under which a portion of cache misses encounter under all the strategies
except the pull one.
4.3 Performance
Next, we evaluate the performance of the four strategies. As for the latency and
throughput reported in Fig. 4, we use average values. Our experiment results
show that: The push based strategy brings lowest query latency and highest
throughput before reaching the maximum memory capacity and results in a
sharp drop in performance since it consumes most memory and meets the limit
of memory capacity first; Performance of the pull method is quite stable yet
trivial; The overlap based strategy achieves lowest query latency and highest
throughput at most time.
Workload-Aware Cache for Social Media Data 671
Throughput(ops/s)
5000
5
4000
4
3000
3
2000
2
1 1000
0 0
0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800
Operations(x104) Operations(x104)
5 Related Work
Both industry and research community have paid great eort to develop ap-
proaches to support home timeline query, which is a key feature in SNS. There
are three strategies(i.e pull,push and hybrid) that were proposed previously to
generate the home timelines for users . Facebook is one of the most giant SNS
providers in the world. According to its report[5], pull based strategy was im-
plemented to support its service and push based one was constructed on the fly.
On the other hand,Twitter[1] reported recently a hybrid strategy: push messages
from the majority of ordinary users and pull messages from popular ones.
A.Silberstein.et.al[6] have introduced a formal solution to hybrid strategy from
the aspect of materialization. They proposed a cost based heuristic method to
decide whether the tweets should be materialized for each pair of followees and
followers. Besides, some arguments can be tuned to make a tradeo between
the latency and throughput. Some other researchers[7][8] try to approach this
problem from the distributed management of graph data. Each user in the social
network is represented by a node in the graph and home timeline is generated by
issuing read requests to the neighbors. They try their best eort to minimize the
communications between sites. In order to sustain the local semantic of query,
[7]replicates all the nodes that need to be accessed on remote sites to local.
However, this approach causes a great of storage redundancy and may result in
great eorts to maintain the replications. [8] introduces a more elaborate ap-
proach. The nodes are partitioned randomly and clustered into multiple clusters
in each partition.The replication decision is made for each pair of clusters and
partitions in according to their access behaviors.The experiments illustrate that
their method is better than [7]. Besides, there are several decentralized social
network systems[9][10][11] proposed. [10] incorporates a concept similar to hy-
brid strategy. Users in its system are divided into social users and media users.
Media users typically have to broadcast messages to a lot of users. As a result,
messages of media users are disseminated by gossiping.
672 J. Wei et al.
Our work focuses on the microbloging service and proposes a new strategy.
[4][3] all aim to detecting overlapping communities in the social network. [4]
provides an algorithm that is linear to the number of edges in the network. We
provide a MapReduce implementation to preprocess the dataset and help us
understand the data.
6 Conclusion
In this paper, to minimize memory usage and CPU utilization while ensuring
high throughput and low latency, we propose a novel workload-aware cache
scheme, i.e., overlap-base strategy, to evaluate the home timeline query. This
strategy makes full use of workload information including social communities
and user access frequency in SNS. Our experimental studies show overlap-based
approach outperforms all other three strategies including pull-based, push-based
and hybrid strategies in both system cost and performance.
References
We prove Theorem 1 that the Volume Bounded Set Basis Problem is NP-hard by
transforming the set basis problem to the Volume Bounded Set Basis Problem.
Proof. Let collection C of subsets of a finite set S, postive interger K <= |C|
be an instance of the set basis problem. We construct the instance of Volume
Bounded Set Basis Problem as follows: collection
# F = C of subsets of a finite
set U = S, postive interger m = K, W = cC |c|.
Let B be subsets of S with |B| = K such that, for each c C, there is a
subcollection of B whose union is exactly c. Therefore for each f F = C, there
is a subcollection of B whose union is exactly c which has size not greater than
|B| = k = m.
# #And the volume of all basis sets in B is upper bouned by W , i.e.
bB |b| cC |c| = W , which is also due to the fact that each c C is an
union of a subcollection of B. Then B is a solution to the instance of Volume
Bounded Set Basis Problem.
The another direction is oblivious.
Shortening the Tour-Length
of a Mobile Data Collector in the WSN
by the Method of Linear Shortcut
1 Introduction
Wireless Sensor Network (WSN) is widely used for tracking, monitoring and
other purposes. The problem of collecting data packets from the sensor nodes
and depositing those to the sink node is known as Data Gathering Problem [1,2].
Using mobile elements for data gathering in the WSN is a recent trend [3]. It
has many advantages. For example, it increases connectivity, reduces cost of
deployment of a dense WSN and increases the lifetime of the WSN. However,
it has the disadvantage of high latency as these mobile elements have limited
speed (compared to the high speed data gathering by routing). The dedicated
mobile element in the WSN which collects data packets and brings those to
the static sink is called Mobile Data Collector (M DC). A convenient way to
control the latency in the case of data collection by the M DC is to carefully
plan its path so that the path is as short as possible. In this work, we present a
path-planning method that shortens the path of the M DC iteratively. We test
our path-planning method in a simulation which is run on a realistic testbed.
Experimental results show that the shortening of the path indeed translates into
decreased latency and other improvements.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 674685, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Rest of the sections are organized as follows. Related works are discussed in
Section 2. The problem is formulated in Section 3. Our method is presented
in Section 4. Experimental results are presented in Section 5. The prospect of
future work is presented in Section 6.
2 Related Works
A complete survey on using mobile elements for data collection in the WSN is
presented in [3]. Earlier works on mobile elements can be found in [4], [5], [6] and
[7]. However, random motion of the mobile elements in these works is not suitable
for optimization. Mobile sinks have been considered in [8] and [9]. Mobile relay
based approaches for opportunistic networks have been surveyed in [10]. How-
ever, these methods are not suitable for the WSN because of its dierence with
such networks. In [11], an energy-ecient data gathering mechanism for large-
scale multi-hop network has been proposed. The inter-cluster tour proposed in
this work is N P -hard. Latency issue is not addressed in it. One of the heuristics
used in this work produces edges which are not connected with any nodes of the
WSN. Latency is considered while planning path for the mobile collector in [12]
and [13]. Methods presented in these works produce a shorter tour termed Label
Covering tour from a TSP-tour. However, the transmission range of the sensor
nodes is ignored in the shortening process. In [14], authors propose an approxi-
mation algorithm which is based on the TSP-tour of the M DC. Although, the
computation time (O(n)) is impressive, the solution is applicable to only certain
kind of TSP-tour (tours for which the centroid of the tour-polygon lies within
that polygon). If a condition regarding the concavity of the given TSP-tour is
not met, the problem of finding the optimal solution becomes N P -hard. In [15],
authors address the problem of planning paths of multiple robots so as to collect
the data from all sensors in the least amount of time. The method presented
here exploits earlier work on T SP -tour with neighborhood problem. However,
this work does not utilize the available location information of the sensor nodes
to the fullest as it allows the traversal of the full boundary of the transmission
region of a node. In sparse network, one or more sensor nodes have no neighbors
at all. As a result, the traversal of the boundary those nodes is futile and adds
up to the tour-length.
3 Preliminaries
We represent a WSN with n nodes by a complete graph Kn where the graph-
nodes represent the sensors and the sink. The edges in this graph represent the
Euclidian distances between two nodes. We adopt the disk model of the given
transmission range T XR. A circle with radius T XR centered at a node represent
the area of radio transmission of that node. We assume that there are one sink
and one M DC in the WSN and that both the sink and the sensor nodes are
static. A tour or cycle for the M DC is a closed path in the graph Kn which
starts and ends at the sink node.
676 Md.S. Rahman and M. Naznin
If n packets are collected in the current tour by the M DC, average packet
delivery latency tavg is given by:
#n
[tT tg (i)]
tavg = i=1
#nn
tg (i)
= tT i=1 (2)
n
"n
t (i)
The term i=1n g in Equation 2 is the average packet generation time. This
parameter is not controllable as it depends on the sampling rate of the sensor
nodes and the event frequency. However, we may improve both per packet DDL
(tl ) and average DDL (tavg ) by decreasing the tour-time tT (See Equation 1 and
2). The tour-time of the TSP-tour i.e. tT SP has two components: the fraction of
tour-time th when the M DC halts and collects data from the nearby nodes and
the fraction of tour-time tm when the M DC travels between the node positions.
tT SP = th + tm (3)
When the number of nodes is very high and/or the network is sparse, th << tm
and thus tm dominates tour-time tT SP . This assumption is logical for practical
scenario where the speed of a commercially available robotic car used as M DC
usually is 5 ms1 whereas packet transfer from a sensor node to the M DC
happens in the order of miliseconds [16]. Thus, decreasing the motion time tm
contributes to improving latency. If the speed of the M DC is vMDC , and if we
assume that it accelerates to this speed instantly and stops instantly, then
|tT SP |
tm = (4)
vMDC
1
We use TSP-tour to denote the minimum cost TSP-tour in this work.
Energy-Aware Path of M DC for WSN 677
where |tT SP | is the path-length of the TSP-tour. Given a particular M DC, vMDC
is fixed. The only way to decrease tour-time is decreasing the length of the tour
i.e. |tT SP | (See Equation 4). However, decreasing the tour-length arbitrarily has
the risk of making the resulting tour incomplete. Therefore, we address the issue
carefully so that, the resulting tour is complete and shorter than the TSP-tour.
Problem Statement
Given a TSP-tour of the M DC in a WSN, find a tour Td that is complete and
shorter than the TSP-tour.
p5
p1
n0
1,2,3,4,5
3
1,2,3,5 2,3,4,5
3 4
4
1,2,3
1,2,3,4,5
2 1,4,5
2
1,2,3,5 1,2,4,5
5 5
1 1,2,4,5
1
1,2,3,4,5
(a) Complete Graph derived from the Connectivity Graph (b) Label Covering Tour (tour-edges makred in double-line)
ni+3 ni+5
ni+1
ni nj
ni+4
ni+2
lni+1 rni+1
lni+2 rni+2
lni+3 rni+3
lni+4 rni+4
lni+5 rni+5
the circle and the straight line is beyond the edge, the end-point of the CI is the
nearest end-point of the edge. For example, in Figure 3, r-point of Node ni+5 is
the location of Node nj .
Definition 6. Given a list of contact intervals CIe of a tour-edge e, Critical
Contact Interval or CCI is the interval of the minimum length which has at least
one point from each contact interval.
Critical Contact Interval or CCI of i-th edge is represented by two points- lcci
and rcci on that edge. These points can be determined as follows:
lni+1 rni+1
lni+2 rni+2
lni+3 rni+3
lni+4 rni+4
lni+5 rni+5
lcci = rni+2 rcci = lni+5
For example, the CCI of the edge shown in Figure 3 is determined in Figure 4.
After the CCIs have been computed, we can connect r point of the CCI of
an edge with the l point of the CCI of the next edge. However, the nodes visited
in the given tour may be missed as shown in Figure 5(a). Therefore, to cover
those nodes, we apply the following method for a Node ni :
1. If both the edges have non-Null CCIs, i.e. there are intermediate nodes on
both the edges, then we add the r point of the incoming edge with the l
point of the outgoing edge. We call this line segment r-l line segment.
(a) If r-l line segment intersects circle of Node ni , then CCIs of both of the
adjacent edges are kept unchanged (See Figure 5(b)).
(b) If r-l line segment does not intersect the circle of Node ni , then we draw
a straight line that is parallel to the r-l line segment and tangent to that
circle. Let this line intersects the incoming and outgoing edges at points
pi and po respectively. (see Figure 5(b) for Node n11 ). We update the
r point of the incoming edge and the l point of the outgoing edge as pi
and po respectively.
2. If the incoming edge does not have any intermediate node with overlapping
circle or (in case it has) its r point is farther from point ni by at least T XR,
then we compute the point pi as the intersection between the incoming edge
and the circle centered at ni . If the incoming edge has non-null CCI, then we
update its r point as pi . Otherwise, we set the incoming edges r and l point
as pi . This case applies for Nodes n1 , n2 and n13 as shown in Figure 5(b).
680 Md.S. Rahman and M. Naznin
n5 n7 n5 n7
n3 n3
rcc2 rcc2
n8 n8
n2 lcc2 n2 lcc2
n6 n6
n4
lcc1 = rcc1 n4
n10 n10
lcc4 = rcc4
rcc3
n1 n1
lcc5 = rcc5
rcc4 lcc4
n11 n11
n13 n12 n13 n12
(a) Connecting the CCIs of successive (b) Updated l and r points to cover vis-
tour-edges ited nodes
lcc2 = rcc2
n5 n7
n5 n7 n3
n3 lcc3 rcc3
rcc2 n8
n8 n2
n2 lcc2 n6
n6
lcc4
n4
lcc1 = rcc1 n4 rcc1
rcc4
n10 n10
lcc6 = rcc6
lcc8 = rcc8
lcc3 lcc1
lcc7 = rcc7
rcc3
n1 n1
lcc5 = rcc5
lcc4
rcc4 n11 n11
n13 n12 n13 n12
(c) T LC-tour derived in Iteration 1 (d) Updating the l and r point after Iter-
ation 1
n5 n7 n5 n7
n3 n3
rcc3 lcc4 = rcc4
n8 lcc3
n8
n2 n2
n6 n6
n4 lcc5 = rcc5
lcc1 = rcc1 n4
n10 n10
lcc6
lcc7 = rcc7
lcc8 = rcc8
lcc9 = rcc9
rcc6
n1 n1
n11 n11
n13 n12 n13 n12
(e) T LC-tour derived in Iteration 2 (f) Updating the l and r point after Iter-
ation 2
lcc3 = rcc3
lcc2 = rcc2
n5 n7 n5 n7
n3 n3
lcc4 rcc4
n8 n8
n2 n2
n6 n6
n4 lcc = rcc n4
1 1
lcc6 = rcc6
n10 n10
lcc10 = rcc10
lcc9 = rcc9
n1 n1
lcc7 = rcc7
n11 n11
n13 n12 n13 n12
(g) T LC-tour derived in Iteration 3 (h) Updating the l and r point after Iter-
ation 3
Fig. 5. Generating TLC-tour using Linear Shortcut
Energy-Aware Path of M DC for WSN 681
Now, we join the r point of the previous edge with the l point of the next edge.
The final edges are shown as bold straight lines in Figure 5(c). We term this
shortening as tightening of the given tour by the linear shortcut method. We
term the shorter tour derived from the LC-tour as Tight Label Covering tour or
T LC-tour.
n5 n7
n3
n8
n2
n6
n4
n10
n1
n11
n13 n12
Fig. 6. Comparison between input LC-tour (doted path) and TLC-tour (solid path)
derived in Iteration 4
Here |tT LC |i is the length of the T LC-tour derived in iteration i. We stop the
iterations for path-shortening as soon as the path gain is below a threshold like
1%, 5% etc.
O(n2 )
T LC-tour generator
5 Experimental Results
We use Castelia[17] framework of OMENT++ simulator to distribute sensor
nodes randomly. The sensor nodes also generated packet randomly. We use Con-
cord TSP-Solver [18] to compute the optimal TSP-tour. We derive the LC-tour
from the TSP-tour and the T LC-tour from the LC-tour using path-gain thresh-
old of 5%. The M DC tours continuously in TSP-tour, LC-tour and TLC-tour.
During its travel, it collects the packets from the sensor nodes and deposits
those to the sink node. We vary the value of T XR from 2m to 32m. Low value
of T XR indicates lower degree of connectivity and hence, sparse network. Sim-
ilarly, higher value of T XR indicates dense network.
As shown in Figure 8(a), average packet delivary latency is always the lowest
for TLC-tour and the highest for TSP-tour. The value is comparatively better
in the case of the sparse WSN. In Figure 8(b), throughput for the entire run
is shown. The value is always the highest for T LC-tour and the lowest for the
TSP-tour.
6 Conclusion
We have given a framework for shortening a given tour of the M DC. The re-
sulting tour decreases packet delivery latency and increases throughput. The
TLC-tour derived by Linear Shortcut of the LC-tour is highly suitable for real-
time WSN in which high latency is undesirable.
In our future work, we shall consider more objectives besides minimizing la-
tency, for example- facilitating multi-hop forwarding among the sensor nodes,
load-balancing of the network trac etc. We shall also extend the path-planning
for a WSN with multiple M DCs and multiple sinks.
References
1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor net-
works: a survey. Comput. Netw. 38(4), 393422 (2002)
2. Rajagopalan, R., Varshney, P.K.: Data-aggregation techniques in sensor networks:
a survey. IEEE Communications Surveys Tutorials 8(4), 4863 (2006)
3. Di Francesco, M., Das, S.K., Anastasi, G.: Data collection in wireless sensor net-
works with mobile elements: A survey. ACM Transaction on Sensor Networks 8,
7:17:31 (2011)
4. Zhao, W., Ammar, M.H.: Message ferrying: Proactive routing in highly-partitioned
wireless ad hoc networks. In: Proceedings of the The Ninth IEEE Workshop on
Future Trends of Distributed Computing Systems, FTDCS 2003, pp. 308314.
IEEE Computer Society, Washington, DC (2003)
5. Shah, R.C., Roy, S., Jain, S., Brunette, W.: Data mules: Modeling a three-tier ar-
chitecture for sparse sensor networks. In: IEEE SNPA Workshop, pp. 3041 (2003)
6. Zhao, W., Ammar, M., Zegura, E.: A message ferrying approach for data delivery
in sparse mobile ad hoc networks. In: Proceedings of the 5th ACM International
Symposium on Mobile ad Hoc Networking and Computing, MobiHoc 2004, pp.
187198. ACM, New York (2004)
7. Jun, H., Zhao, W., Ammar, M.H., Zegura, E.W., Lee, C.: Trading latency for en-
ergy in wireless ad hoc networks using message ferrying. In: Proceedings of the
Third IEEE International Conference on Pervasive Computing and Communica-
tions Workshops, PERCOMW 2005, pp. 220225. IEEE Computer Society, Wash-
ington, DC (2005)
8. Wang, Z.M., Basagni, S., Melachrinoudis, E., Petrioli, C.: Exploiting sink mobility
for maximizing sensor networks lifetime. In: Proceedings of the 38th Annual Hawaii
International Conference on System Sciences - Volume 09, HICSS 2005, p. 287.1.
IEEE Computer Society, Washington, DC (2005)
9. Rao, J., Wu, T., Biswas, S.: Network-assisted sink navigation protocols for data
harvesting in sensor networks. In: WCNC 2008, IEEE Wireless Communications
& Networking Conference, Las Vegas, Nevada, USA, March 31-April 3. Conference
Proceedings, pp. 28872892. IEEE (2008)
Energy-Aware Path of M DC for WSN 685
10. Conti, M., Pelusi, L., Passarella, A., Anastasi, G.: Mobile-relay Forwarding in Op-
portunistic Network. In: Adaptation and Cross Layer Design in Wireless Networks.
CRC Press (2008)
11. Ma, M., Yang, Y.: Sencar: An energy-ecient data gathering mechanism for large-
scale multihop sensor networks. IEEE Transaction on Parallel and Distributed
System 18, 14761488 (2007)
12. Sugihara, R., Gupta, R.K.: Improving the data delivery latency in sensor networks
with controlled mobility. In: 4th IEEE International Conference on Distributed
Computing in Sensor Systems, pp. 386399. Springer, Heidelberg (2008)
13. Sugihara, R., Gupta, R.K.: Path planning of data mules in sensor networks. ACM
Transaction on Senson Networks 8(1), 1:11:27 (2011)
14. Yuan, Y., Peng, Y.: Racetrack: an approximation algorithm for the mobile sink
routing problem. In: Nikolaidis, I., Wu, K. (eds.) ADHOC-NOW 2010. LNCS,
vol. 6288, pp. 135148. Springer, Heidelberg (2010)
15. Bhadauria, D., Tekdas, O., Isler, V.: Robotic data mules for collecting data over
sparse sensor fields. J. Field Robot. 28(3), 388404 (2011)
16. Huang, P., Xiao, L., Soltani, S., Mutka, M., Xi, N.: The evolution of mac proto-
cols in wireless sensor networks: A survey. IEEE Communications Surveys Tutori-
als PP(99), 120 (2012)
17. Castalia wsn simulator framwork for omnet++, http://www.castelia.org
18. Concorde tsp solver, http://www.tsp.gatech.edu/concorde.html
Towards Fault-Tolerant Chord P2P System:
Analysis of Some Replication Strategies
Rafa Kapelko
1 Introduction
In recent years, Peer-to-Peer (P2P) networks (see e.g. [21], [17], [7] [15]) have been
powerful environments for successful sharing of certain resources. Despite the advan-
tages of the P2P systems, there exists a significant problem with fault tolerance. When
some nodes depart from the system without uploading gathered information, some data
which are stored on the departured nodes may be lost.
Among the structured P2P systems, Chord protocol introduced in [21] and developed
in [14] is one of the most popular and we focus on it in this paper. In the Chord P2P
system each of such unexpected node failures effects in loosing some data stored in the
node (see [2], [11]).
The replication techniques are used for storing documents at many nodes in the
system. The benefits of replication are that it can improve availability of documents
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 686696, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Towards Fault-Tolerant Chord P2P System 687
in case of failure [22]. Different strategies for replication in Chord P2P system such
as successor-list, symmetric replication or multiple hash functions are discussed in
many publications (see [5], [18], [10], [16], [3], [20], [19]). In the replication schemes
successor-list replication is very popular and often applied in ring-based networks.
Our main contribution include:
We propose a novel analytical model describing the process of unexpected nodes
failures. Let > 0. Let d be replication degree in Chord with successor-list repli-
1
cation. We prove that for large n after failure of n1 d nodes from Chord with
successor list replication no document is lost with high probability (see Theorem
3). Our consideration dedicated to Chord P2P system is also applicable for other
overlays based on ring such as Pastry [7] (with some modification), RP2P [1], Cas-
sandra [12], CRP-overlay [13] and others.
We also investigate the fault tolerance of overlay system called k multiple chord
1 1
rings. We prove that after simultaneous unexpected departures of n 1 k nodes from
n 2k
k multiple chord rings no information disappears with high probability.
As a practical application of our theoretical estimation we discuss the recovery
mechanism of partially lost information in both replication schemes. We show how
fault tolerance of Chord system changes depending of different replica degree. We
also explain how to choose the proper number of replication to optimize those two
replication strategies.
The remainder of this article is organised as follows. In Section 2 we describe the
successor-list replication scheme in Chord P2P system and present analytical formulas
describing the resistance to loss of document. Section 3 presents the overlay k Chord.
Then, in Section 4 we discuss the recovery mechanism of partially lost information.
Finally, Section 5 summaries and concludes our work.
2 Successor-List Replication
In this section, we briefly describe the successor-list replication scheme in Chord P2P
system [5]. For a more detailed description of Chord P2P system the reader is referred
to [21]. Then, we analyze the process of unexpected departure.
In Chord P2P system nodes are arranged in a logical ring. The positions of nodes
and documents are created by a hash operation, which results in random placement into
the Chord ring. Let n denotes the number of nodes in Chord. Then, the identifier space
of the nodes is defined as a set of integers {0, 1, . . . , n 1}.
Let (0, 1, . . . , n 1) denotes Chord ring. The first successor of a node with identi-
fier p is the first node found going in clockwise direction on the ring starting at p. The
predecessor of a node with identifier p is the first node found going anti-clokwise direc-
tion on the ring starting at p. Each node p is responsible for storing documents between
p s predecessor and p. To ensure the connectivity of the chord ring, each node knows the
688 R. Kapelko
U
U
U
first s nodes which identifier is greater than the nodes. These nodes are the nodes suc-
cessors. Additionally, each node maintains a table of fingers of size O(log(n)), which
point to other nodes in the ring. Hence the size of routing table is O(log(n) + s).
Let d < s. For a replication degree of d in successor-list replication a document is
stored on the node p which is responsible for storing document and d 1 immediate
successor of p. Let the node p leaves the system in an unexpected way, i.e without
uploading its documents into the system. After the departure there are still (d 1)
replicas of documents of the node p.
For p = 0, 1, . . . , n 1 and d > 1 we define family of sets by the formula
Remark 1. Notice that, if the set A is safe, then no information disappears from Chord
with successor-list replication of degree d after simultaneous unexpected departure of
all nodes from A.
Proof. Firstly observe that for |A| < d we have Pr[A is safe] = 1. Therefore, we may
assume that |A| d. Then
Towards Fault-Tolerant Chord P2P System 689
L M
Pr[A is unsafe] = Pr p{0,1,...,n1} Up,d A = Up,d =
In1 J n1
N "
Pr Up,d A = Up,d Pr [Up,d A = Up,d ] =
p=0 p=0
7 nd ;
|A|d n|A|d 1 1 1
n 7n; 7 .
|A|
(n (d 1))d n 1 d1 ;d n
n
=
Theorem 3. Let > 0. Let d 2. Let n be the number of nodes in Chord with
1
successor-list replication of degree d and let A be the set of nodes. If |A| n1 d
then
lim Pr[A is unsafe] = 0
n
Proof. As in the proof of Theorem 2, observe that for |A| < d we have Pr[A is safe] =
1. Therefore, we may assume that |A| d. Then
O 7 nd ; P
11/d d
lim Pr [A is unsafe] lim n n7 n
; = 0.
n n
n11/d
where Ri,n is the ith Chord ring. The nodes of the first Chord ring are expressed in
the form R1,n = (0, 1, 2, ..., n1) and on the ith ring (i = 2, . . . , k) are generated by
independent random permutation i Sn i.e., Ri,n = (i (0), i (1), . . . , i (n 1)).
Each document has a unique position and is mapped into the same location on different
Chord rings.
In k Chord hybrid system each node maintains a k dimensional finger table and a
successor list of a size s (s > k) on each Chord ring. Hence the routing table is of size
O(k(log(n) + s)). With k being constant and small (e.g, k 5.), the size of routing
table is O(log(n) + s) which is the same like in Chord. Let us consider the event, when
the node p leaves the system in an unexpected way without uploading storing documents
into the system. In the classical Chord such event effects in loosing information stored
on the node p. The situation changes in hybrid k Chord. There are (k 1) replicas of
documents stored on the nodes i (p) on the ith ring (i = 2, . . . , k). The probability
1
of loosing information stored on the node p equals (n1) k1 .
690 R. Kapelko
+GRFB
+GRFB
+GRFB
+GRFB
+GRFB
Let us remark that, if the set A is safe then no information disappears from the hybrid
system after simultaneous unexpected departure of all nodes from A.
Theorem 2. Let k 2. Let n be the number of nodes in the structure k Chord. let
1
n1 k
A be a random subset of nodes from the structure k Chord. If |A| 1 then
n 2k
Pr[A is safe] 1 1 .
n
Notice that, the bounds in Theorem 2 do not depend on the number of documents put
into the system.
l
K l
Pr[As KA,i = ] = Pr[(p )(i (np ) = ns )] = Pr[ i (np ) = ns ] = .
p=1
n
Towards Fault-Tolerant Chord P2P System 691
Hence
I k
J l
I k
J
Q " Q
Pr[A is unsafe] = Pr KA,i = Pr As KA,i =
i=1 s=1 i=2
l
I k
J l =
k & 'k1
" R " l
Pr As KA,i = = Pr[As KA,i = ] l .
s=1 i=2 s=1 i=2
n
n1 k
1 7 l ;k1
Finally notice that, for l 1 we have l 1 , so the Theorem is proved.
n 2k n n
=
Table 1. Time for recovery of partially lost information Trecovery as the function of the number of
nodes N [104 , 106 ] for different replica degrees D = 2, 3, 4, 5 with fixed T = 1800 sec and
u = 5%
8
500
6
400
4 300
200
2
100
200 000 400 000 600 000 800 000 1 ! 106 200 000 400 000 600 000 800 000 1 ! 106
D=2 replicas D=4 replicas
140
1200
120
1000
100
80 800
60 600
40 400
20 200
Table 2. Notations
Symbol Meaning
N the average number of nodes in the system
Nr the medium number of nodes waiting for complementing
T the average time a node spends in a system (during one session)
u the proportion of unexpected departures among all departures
Tr the average time needed for recovery of partially lost information
the average number of departures from the system in a second
m the average number of unexpected departures from the system in a second
Tc,d the time in which nodes: p, p + 1, . . . p + d 1, check the node p + d
the average number of documents in the system per one node
ds the size of the information item in the system
ts the transmition speed
Trecovery time for recovery of partially lost information for different replica degrees
so
T Nr
Tr = .
u N
For two replication schemes discussed in this paper we have obtained the following
safety bound
1
N 1 D
Nr = 1 , (2)
N 2D
where D is a replication degree (D = d for successor-list replication of degree d,
D = k for k Chord) . If we want to keep our system in a safe configuration with
high probability greater than 1 1N , then the following inequality must hold
T Nr
Tr . (3)
u N
Therefore, time for recovery of partially lost information for different replica degrees
is:
T Nr
Trecovery = . (4)
u N
Let T = 30 minutes (see [6], [23]). Suppose that u = 5% nodes leave the system in
unexpected way without uploading gathered information. In Table 1 we put four figures
describing the function Trecovery of the number of nodes N [104 , 106 ] for different
replica degrees D = 2, 3, 4, 5 with fixed T = 1800 sec. and u = 5%. The system gains
more time for recovery of partially lost information when replica degree increases.
The following table contains the time for recovery of partially lost information with
fixed parameters N = 105 , T = 1800 sec, u = 5%, for different replica degree:
d 2 3 4 5 6 7 8
Trecovery 6 sec 113 sec 8 min 18.9 min 33.7 min 50.9 min 1h 9 min
Towards Fault-Tolerant Chord P2P System 693
Let us consider the network with N = 105 average number of nodes in the system (T =
30 minutes, u = 5%). From equation (1) we get m = N T u 2.77, so approximately
2.77 nodes unexpectedly leave the network in one second.
Assume that nodes: p+1, . . . p+d1 in the network check the node p+d periodically
with the period of Tc,d seconds (see Algorithm 1). Assume that Tc,d < 2 sec.
During this time nodes: p + 1, . . . p + d 1, do not know that the node p + d left
the network. Then, there are Tr Tc,d seconds for information recovery (see inequality
(3)). During this time nodes: p + 1 , . . . p + d 1 , ask their successors for a portion of
documents in 0.2 sec. Then, nodes: p + 1 , . . . p + d 1 , wait for receiving necessary
information. Nodes p + 1 , . . . p + d 1 , send some portion of documents to some of
their successors. Therefore, the following inequality guarantees the system is in a safe
configuration with high probability
ds T 3
Tc,d + 0.2 + 3 N 2d . (5)
ts u
The inequality (5) can be used in a practical way. It shows how to choose the proper
number of replication degree d to optimize the recovery process. For estimated param-
eters , ds , ts , T, u, N of the network and fixed Tc,d the optimal replica degree is the
minimal natural d ( d > 1) for which the inequality (5) holds.
Let us assume, that the transmition speed ts = 100 kB/s. As in [23] we consider two
scenarios with size of documents, ds of 1.57 kB and 1 MB, respectively.
The Table 3 contains the upper bounds for the average number of documents per one
node for different replica degree, when the system stays in a safe configuration.
Table 3. Values of the upper bound of the average number of documents per one node for different
replica degree and size of documents with fixed Tc,d = 2 sec., T = 30 minute, u = 5%,
N = 105 , ts = 100 kB/s
4.2 k Chord
In k Chord network the recovery process of partialy lost information is more com-
plicated than in successor-list replication.
Let us consider the network with N = 105 average number of nodes in the system
(T = 30 minutes, u = 5%). From equation (1) we get m = N T u 2.77, so approxi-
mately 2.77 nodes unexpectedly leave the network in one second.
Assume that nodes: p + 1, . . . p + k 1, in each chord ring check the node p + k
periodically with the period of Tc,k seconds (see Algorithm 2). Assume that Tc,k < 2
sec.
During this time nodes: p + 1, . . . p + k 1, do not know that the node p + k left
the network. Then, there are Tr Tc,k seconds for information recovery (see inequal-
ity (3)). During this time nodes: p + 1 , . . . p + k 1 , localise in the remind (k 1)
Chord rings nodes responsible for storing lost documents. The time for fulfilling this
operation equals approximately 12 lg2 N 0.2 sec. Then, nodes: p + 1 , . . . p + k 1 ,
wait for receiving necessary information and the time is approximately dtss . Nodes:
p + 1 , . . . p + k 1 , send received documents to some of their successors. Therefore,
the following inequality guarantees the system is in a safe configuration with high prob-
ability
1 ds T 3
Tc,k + lg2 N 0.2 + 2 N 2k . (6)
2 ts u
Similary, to successor-list replication inequality (6) can be used in a practical way. It
shows how to choose the proper number of k Chord rings to optimize the recovery
process. For estimated parameters , ds , ts , T, u, N of the network and fixed Tc,k , the
optimal replica degree is the minimal natural k ( k > 1) for which the inequality (6)
holds.
Table 4. Values of the upper bound of the average number of documents per one node for different
k Chord and size of documents with fixed Tc,k = 2 sec., T = 30 minute, u = 5%, N = 105 ,
ts = 100 kB/s
Let us assume that the transmition speed ts = 100 kB/s. As in [23] we consider two
scenarios with size of documents, ds of 1.57 kB and 1 MB, respectively.
The Table 4 contains the upper bounds for the average number of documents per one
node for different k Chord, when the system stays in a safe configuration.
5 Conclusion
In this paper, we study two replication strategies for Chord P2P system: successor-list
and multiple Chord rings. In both cases we obtain estimation describing the ressis-
tance to loss of documents. As a practical illustration of our estimated safety bounds we
discuss the recovery mechanism of partially lost information. We show how fault toler-
ance of Chord system changes depending of different replica degree and explain how
to choose the proper number of replication to optimize these two replication strategies.
In future research we plan to experimentaly evaluate efficiency of proposed recovery
algorithms. We also plan to extend our work to symmetric replication scheme.
References
1. Chen, S., Li, Y., Rao, K., Zhao, L., Li, T., Chen, S.: Building a scalable P2P network with
small routing delay. In: Zhang, Y., Yu, G., Bertino, E., Xu, G. (eds.) APWeb 2008. LNCS,
vol. 4976, pp. 456467. Springer, Heidelberg (2008)
2. Cichon, J., Jasinski, A., Kapelko, R., Zawada, M.: How to improve the reliability of chord?
In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM-WS 2008. LNCS, vol. 5333, pp. 904913.
Springer, Heidelberg (2008)
3. Cichon, J., Kapelko, R., Marchwicki, K.: Brief announcement: A note on replication of doc-
uments. In: Dfago, X., Petit, F., Villain, V. (eds.) SSS 2011. LNCS, vol. 6976, pp. 439440.
Springer, Heidelberg (2011)
4. Cooper, R.B.: Introduction To Queueing Theory, 2nd edn. Elsevier North Holland, Inc.
(1981)
5. Dabek, F., Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Wide-area cooperative storgae
with cfs. In: 18th ACM Symposium on Operating Systems Principles (SOSP), New York,
USA, pp. 202215 (2001)
6. Derek, L., Zhong, Y., Vivek, R., Loguinov, D.: On lifetime-based node failure and stochas-
tic resilience of decentralized peer-to-peer networks. IEEE/ACM Transactions on Network-
ing 5(15), 644656 (2007)
7. Rowstron, A., Druschel, P.: Pastry: Scalable, Decentralized Object Location, and Routing
for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS,
vol. 2218, pp. 329350. Springer, Heidelberg (2001)
8. Flocchini, P., Nayak, A., Xie, M.: Hybrid-chord: A peer-to-peer system based on chord. In:
Ghosh, R.K., Mohanty, H. (eds.) ICDCIT 2004. LNCS, vol. 3347, pp. 194203. Springer,
Heidelberg (2004)
9. Flocchini, P., Nayak, A., Xie, M.: Enhancing peer-to-peer systems through redundancy. IEEE
Journal on Selected Areas in Communications 1(25), 1524 (2007)
10. Ghodsi, A., Alima, L.O., Haridi, S.: Symmetric replication for structured peer-to-peer sys-
tems. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.-H., Ouksel, A.M. (eds.) DBISP2P
2005 and DBISP2P 2006. LNCS, vol. 4125, pp. 7485. Springer, Heidelberg (2007)
696 R. Kapelko
11. Park, G., Kim, S., Cho, Y., Kook, J., Hong, J.: Chordet: An Efficient and Transparent Repli-
cation for Improving Availability of Peer-to-Peer Networked Systems. In: 2010 ACM Sym-
posium on Applied Computing (SAC), Sierre, Switzerland, pp. 221225 (2010)
12. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage systems. ACM
SIGOPS Operating Systems Review 2(44), 3540 (2010)
13. Li, Z., Hwang, K.: Churn resilient protocol for massive data dissemination in p2p networks.
IEEE Transactions on Parallel and Distributed Systems 22, 13421349 (2011)
14. Liben-Nowell, D., Balakrishnan, H., Karger, D.: Analysis of the Evolution of Peer-to-Peer
Systems. In: ACM Conference on Principles of Distributed Computing, Monterey, Califor-
nia, USA, pp. 233242 (2002)
15. Maymounkov, P., Mazires, D.: Kademlia: A Peer-to-Peer Information System Based on the
XOR Metric. In: Druschel, P., Kaashoek, M.F., Rowstron, A. (eds.) IPTPS 2002. LNCS,
vol. 2429, pp. 5365. Springer, Heidelberg (2002)
16. Pitoura, T., Ntarmos, N., Triantafillou, P.: Replication, load balancing and efficient range
query processing in dHTs. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hat-
zopoulos, M., Bhm, K., Kemper, A., Grust, T., Bhm, C. (eds.) EDBT 2006. LNCS,
vol. 3896, pp. 131148. Springer, Heidelberg (2006)
17. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A Scalable Content-
addressable Network. In: SIGCOMM 2001, San Diego, California, USA, pp. 161172 (2001)
18. Rieche, S., Wehrle, K., Landsiedel, O., Gtz, S., Petrak, L.: Reliability of data in structured
peer-to-peer systes. In: HOT-P2P 2004, Volendam, The Netherland, pp. 108113 (2004)
19. Kapelko, R.: Towards efficient replication of documents in chord: Case (r,s) erasure codes.
In: Liu, B., Ma, M., Chang, J. (eds.) ICICA 2012. LNCS, vol. 7473, pp. 477483. Springer,
Heidelberg (2012)
20. Shafaat, T.M., Ahmad, B., Haridi, S.: ID-replication for structured peer-to-peer systems. In:
Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484,
pp. 364376. Springer, Heidelberg (2012)
21. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A Scalable Peer-
to-Peer Lookup Service for Internet Applications. In: SIGCOMM 2001, San Diego, Califor-
nia, USA, pp. 149160 (2001)
22. Vu, Q., Lupu, M., Ooi, B.: Peer-to-Peer Computing. Springer (2010)
23. Wang, C., Harfoush, K.: On the stability-scalability tradeoff of dht deployment. In: INFO-
COM 2007, 26th IEEE International Conference on Computer Communications, Anchorage,
Alaska, USA, pp. 22072215 (2007)
A MapReduce-Based Method
for Learning Bayesian Network from Massive Data
Qiyu Fang1, Kun Yue1,2,*, Xiaodong Fu3, Hong Wu1, and Weiyi Liu1
1
Department of Computer Science and Engineering, School of Information Science and
Engineering, Yunnan University, 650091, Kunming, China
2
Key Laboratory of Software Engineering of Yunnan Province, 650091, Kunming, China
3
Faculty of Information Engineering and Automation, Kunming University of Science and
Technology, 650500, Kunming, China
[email protected]
1 Introduction
With the rapid development of Internet and Web applications, more and more data are
generated and collected in scientific experiments, e-commerce and social network
applications [1]. In this case, computationally intensive analysis results in complex
and stringent performance demands that are not satisfied by any existing data man-
agement [2, 3]. Centered on massive data management and analysis, data intensive
computing may generate many queries, each involving access to supercomputer-class
computations on gigabytes or terabytes of data [4]. Thus, knowledge discovery from
massive data is naturally indispensable for the applications of data intensive para-
digms, such as Web 2.0 and large-scale business intelligence. However, it is quite
challenging by incorporating the specialties of massive data [1].
*
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 697708, 2013.
Springer-Verlag Berlin Heidelberg 2013
698 Q. Fang et al.
where the parameters Nij and Nijk can be obtained by statistics from data. Actually, the
values of Nij and Nijk of each record can be obtained in parallel, and then the values of
these two parameters of all records can be summed to obtain the final values required
in Equation (1). It can be observed that the parameters can be obtained separately
(i.e., independently), and the scoring function for local structures can be evaluated
separately as well. For each node, the local optimal structures could be chosen by
A MapReduce-Based Method for Learning Bayesian Network from Massive Data 699
scoring all candidate ones separately. So, we can merge all local optimal structures to
obtain the global optimal one as the final result. The above observation and discussion
illustrate that both the scoring and search process can be executed in parallel.
From the parallel point of view, some parallel methods for BN learning [12, 13, 14,
15] have been achieved. As a representative, the dynamic programming algorithm
[14] is used to obtain the local and global optimal structures. However, these methods
only focus on the parallelization during algorithm execution and the allocation for
processor cooperation, while have not concerned the throughput computing to access
massive data. Thus, it is desired to explore a BN learning method that can be executed
in parallel and efficiently to access massive data.
Fortunately, MapReduce is a programming model for processing and generating
large data sets [16]. It not only offers a parallel programming model, but also can
conduct and process massive data. It has been used in data analysis and dependency-
analysis-based BN learning [17, 18, 19, 20, 21]. Although MapReduce cannot reveal
its advantages on conducting data of normal volume, it is suitable for learning BN
from massive data. Using MapReduce to extend the current centralized K2 algorithm,
we are to explore the parallel and data-intensive K2 algorithm. To make the MapRe-
duce-based K2 fit the demand of massive data definitely is the purpose of our study.
Hadoop is an open source implementation of MapReduce, and has been used exten-
sively outside of Google by a number of organizations [22]. In this paper, we pro-
posed a MapReduce-based K2 in which the scoring and search process can be
executed in parallel on Hadoop. Generally, the contributions of this paper are as
follows:
The rest of this paper is organized as follows: Section 2 introduces K2 algorithm and
MapReduce. Section 3 extends the K2 algorithm based on MapReduce. Section 4
gives experimental results. Section 5 concludes and discusses future work.
2 Preliminaries
(1) V is the set of random variables and makes up the nodes of the network.
(2) E is the set of directed edges connecting pairs of nodes. An arrow from node X
to node Y means that X has a direct influence on Y (X, YV and XY).
(3) Each node has a CPT that quantifies the effects that the parents have on the
node. The parents of node X are all those that have arrows pointing to it.
Let Z be a set of n discrete parameters, where a variable xi in Z has ri possible value
assignments: (vi1, ..., vij). Let D be a database with m records (i.e., rows), each of which
contains a value assignment for each variable in Z. Let BS denote a belief-network
structure containing just the parameters in Z. Each variable xi in BS has a set of par-
ents, which we represent with a list of ri parameters. Let wij denote the jth unique
instantial record of Li relative to D. Suppose there are qi such unique instantial records
of Li. Define Nijk as the number of cases in D in which variable xi has the value vik and
Li is instantiated as wij. Here let
ri
N ij = N ijk (2)
k =1
For each node i and its parents i, we use Equation (1) to compute the score of the
candidate structures of each node, and use Pred(xi) to denote the set of nodes preced-
ing xt. We consider the preceding parent of each node according to the nodes
sequence. In the scoring function g(i, i), Nij and Nijk can be obtained by statistics
counting from the given massive data in parallel, which will be discussed in Section
3. The following pseudo-code expresses the heuristic search algorithm, called K2 [8].
Algorithm 1. K2
Input:
1) A set of n nodes, a sequence of the nodes
2) An upper bound u on the number of parent nodes
3) A database D containing m records
Output: A printout of the parents for each node
Steps:
For i:=1 to n Do
i :=
Pold:=g(i, i ) //by Equation (1)
OKTo Proceed:=true
while OKToProceed and | i |<u Do
Let z be a node in Pred(xi) i maximizing g(i, i {zi})
Pnew:=g(i, i {zi})
If Pnew>Pold Then
Pold:=Pnew
i := i {z}
Else OKToProceed:=false
End while
output(`Node:', xi, `Parents of this node:', i )
End for
A MapReduce-Based Method for Learning Bayesian Network from Massive Data 701
For each node, we get a list of its parents. Actually, we can score the candidate local
structures composed of the node and its parents in parallel. In the scoring of each
local structure, the parameters in Equation (1) can be counted from data in parallel.
The computation takes or produces a set of input or output key/value pairs. Users of
MapReduce library express the computation as two functions: Map and Reduce.
Map, written by users, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups all intermediate values associated
with the same intermediate key together and passes them to the Reduce function.
Reduce, also written by users, accepts
an intermediate key and a set of values
for that key. It merges these values to
form a possibly smaller set of values. The
intermediate values are supplied to the
users reduce function via an iterator.
This allows us to handle lists of values
that are too large to fit in memory [15].
Fig. 1 (a) indicates that we send the in-
put data to the Master node, by which the
data are split into many maps sent to
Slave node, which will return the result to
Master node and reduce the map result to
the final one for output. The NameNode
JobTracker is allocated on Master node to
control the whole task and DataNodes,
and TaskTracker is allocated on Slave
node to communicate with the NameNode
JobTracker, shown in Fig.1 (b). Fig. 1. MapReduce Process
In this architecture, massive data will be split into m map processes dynamically,
whose results will be collected to the final result by R (usually R=1) reduce processes.
702 Q. Fang et al.
For the first record in Table 1, we can obtain the <key/value> pair <N(x1=0, x2=0), 0>
by Algorithm 2.1, that is the number of records satisfying x1=0, x2=0 is 0 and the val-
ue of Nij is 0. Similarly, we can obtain for the pairs for other records.
In the reduce process (Algorithm 2.2), the intermediate results <Nij, rk> and <Nijk,
rk> that have the same key will be collected and summer to obtain parameters <Nij,
rk > and < Nijk, rk> required in Equation (1).
Algorithm 2.2. Reduce(String key, Iterator values)
//key: Nij or Nijk, values: in the Iterator and have the
same key in <key/value> pairs of <Nij, rk> or <Nijk, rk>
Result:=0
For each key/value pair Do
value:=Interator.Next()
result:=result+value
End For
Output<key, result>
For the data in Table 1, we can obtain <N(x1=0, x2=0), 4> by Algorithm 2.2.
Fig. 3. Candidate structures Fig. 4. Local optimal structures Fig. 5. Global optimal structure
of x1 and x2 of x1 and x2
x2), (x1 x3) and (x1 x2, x1 x3), shown in Fig. 3 (a), Fig. 3 (b) and Fig. 3 (c) respec-
tively. For node x2, the candidate structure (x2 x3) is shown in Fig. 3 (d).
The candidate structures generated according to Seq and Pred(xi) are formed as
<key/value> pairs <xi, Si>, where Si is the set of l candidate structures of node xi, and
Si={Si1, Si2, , Sil}. For example, the candidate structure of x2 will be <x2, x2 x3>. In
each map process (Algorithm 3.1) of Algorithm 3, the candidate local structures will
be scored in parallel to obtain the local optimal one and split into M map processes..
Algorithm 3. MapReduce-based Search Function
Input: candidate local structures of each node
Output: the local optimal structure of each node
The intermediate results generated by map processes are formed as <key/value> pairs
<xi&Sij, scoreij>. In each reduce process (Algorithm 3.2), the structure of node xi with
the highest score will be chosen as the local optimal structure of xi.
Algorithm 3.2. Reduce(String key, String value)
// key: node name xi & structure Sij, value: score of Sij
max:=0;
For each Sij of xi Do
If max<scoreij Then
max:=scoreij
Content:=key.Sij
Node:=key.xi
End If
Output <Node,Content>
End For
A MapReduce-Based Method for Learning Bayesian Network from Massive Data 705
Now, we exemplify the structure search process on the data in Table 1. First, the
scores of candidate structures (x1 x2), (x1 x3), (x1 x2, x1 x3), shown in Fig. 3 (a),
Fig. 3 (b) and Fig. 3 (c), can be computed in parallel by map processes. The scores of
these three structures are 1.4410-5, 7.2210-6 and 2.2310-10 respectively. Thus,
(x1 x2) can be chosen as the optimal structure for x1 with the highest score by the
reduce process. By the same way, we can choose the only candidate structure
(x2 x3), shown in Fig. 3 (d), as the optimal structure for x2 with the highest score
7.2210-5.
Second, by merging the local optimal structures (x1 x2) and (x2 x3) shown in
Fig. 4, we can obtain the global optimal structure, shown in Fig. 5. According to the
ideas in [8], the candidate local structures generated by Algorithm 1 are based on the
sequence of Seq, which makes the merging result be the global optimal structure.
If the massive sample data have m records upon n variables. The time complexity
of traditional K2 is O(m*n2). Assume the number of Slave nodes is large enough, and
a map is executed in each Slave node, that is one Slave node exactly processes one
record of data. If there are t Slave nodes, the time complexity of Algorithm 2 will be
O(m*n2/t), whose order is much lower than O(m*n2). Comparing the time consumma-
tion of Algorithm 2 and that of Algorithm 3, the former fulfills the preprocessing and
mainly determines the efficiency of the whole method. Thus, we conclude that the
time complexity of our method is O(m*n2/t).
The BNs DAG can be obtained by Algorithm 2 and Algorithm 3. Then, we can
easily compute the CPTs for each node in the DAG by the similar MapReduce-based
ideas adopted in Algorithm 2 when accessing the massive data.
4 Experimental Results
60 4000
Time(s)
Time(s)
40
2000
20
0 0
1 2 3 1 2 3
400 10000
Time(s)
Time(s)
200 5000
0 0
1 2 3 1 2 3
Number of Datanodes Number of Datanodes
Fig. 6. Comparisons between total time and algorithm time on various Slave node numbers and
data volume
algorithms, and algorithm time is the execution time of Algorithm 2 without including
the time of initialization of Hadoop and I/O. First, we recorded TT and AT on differ-
ent sized data with the increase of Slave nodes, shown in Fig. 6.
Then, we recorded the total time with the increase of data volume with 1, 2 and 3
Slave nodes respectively, shown in Fig. 7. It can be seen that the total time is in-
creased slowly with the increase of data volume under 1, 2 or 3 Slave nodes. Mean-
while, for each size of the sample data, the smaller the data size, the less the total time
will be. This means that the method for learning BN from sample data can be run
efficiently on the Hadoop-based architecture. With the increase of machines in the
cluster, BN can be constructed in less time.
Meanwhile, we recorded the algorithm time (including those of Algorithm 2 and
Algorithm 3) with the increase of data volume with 1, 2 and 3 Slave nodes respective-
ly, shown in Fig. 8. It can be seen that the algorithm time is increased with the in-
crease of data volume under 1, 2 or 3 Slave nodes. Meanwhile, for each size of the
sample data, the smaller the data size, the less the algorithm time will be. This means
that the map and reduce steps can be executed efficiently in Hadoop, which verifies
that our extension to the classical algorithm for BN learning is efficient and effective.
14000 6000
8000
Time (s)
3000
6000
2000
4000
1000
2000
0
0
15.26 152.59 305.18 457.76 610.35 15.26 152.59 305.18 457.76 610.35
Data size (MB) Data size (MB)
Fig. 7. Total time with the increase of data Fig. 8. Algorithm time with the increase of data
volume volume
A MapReduce-Based Method for Learning Bayesian Network from Massive Data 707
References
1. Kouzes, R., Anderson, G., Elbert, S., Gorton, L., Gracio, D.: The changing paradigm of
dataintensive computing. IEEE Computer 42(1), 2634 (2009)
2. Agrawal, D., El Abbadi, A., Antony, S., Das, S.: Data Management Challenges in Cloud
Computing Infrastructures. In: Kikuchi, S., Sachdeva, S., Bhalla, S. (eds.) DNIS 2010.
LNCS, vol. 5999, pp. 110. Springer, Heidelberg (2010)
3. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extens-
ible foundation for dataintensive computing. In: Abiteboul, S., Bhm, K., Koch, C., Tan,
K. (eds.) Proc. of ICDE 2011, pp. 11511162. IEEE Computer Society, Hannover (2011)
4. Deshpande, A., Sarawagi, S.: Probabilistic graphical models and their role in database. In:
Koch, C., Gehrke, J., Garofalakis, M.N., et al. (eds.) VLDB 2007, pp. 14351436. ACM
(2007)
708 Q. Fang et al.
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 709720, 2013.
c Springer-Verlag Berlin Heidelberg 2013
community, and hundreds of bugs will be reported daily. Dierent QAs might
report the same problem many times respectively. This causes an issue as du-
plicate bug reports. The phenomenon of duplicate bug reports is very common.
If duplicate bug reports of the same problem are triaged to dierent developer,
lots of eorts will be wasted and potential conflicts will be made.
There have been a number of works which focus on the duplicate bug reports
detection issue in a bug tracking system. Existing methods [8][11][10][13][5][12][7]
can be divided into two dierent categories: relevant bug report retrieval and du-
plicate bug report identification. Given a new submitted bug report, the former
methods will return a sorted list of existing reports that are similar to it. How-
ever, the latter methods will identify that it is a duplicate report or not. As far
as we know, the state-of-the-art search approach[7] can guarantee that the re-
lated bug reports will be found in the top-10 search results with 82% probability
in some dataset if the new bug report is really duplicate. It is already a very
good result. However, in actual development environment, it still requires users
to spend lots of eorts to review the search results manually. From a practical
point of view, duplicate bug reports identification can really reduce the devel-
opers eorts. Yuan Tian etc.[10] reported their progress in the field of duplicate
bug reports identification. The true positive rate was 24% while keeping the
false positive rate of 9%. It still need to be improved in order to meet the needs
of practical application. In this paper, we propose PDB, a practical duplicate
bug reports detection method, which will help developers to reduce the eorts
on processing bug reports by combining the relevant bug report retrieval and
duplicate bug report identification.
For duplicate bug reports issue, any type of existing methods need to
compute the similarity between reports. Dierent methods will choose dierent
features to compute the similarity. such as textual similarity[8][11], surface fea-
tures(timestamp, severity, product and component etc.)[10], execution trace[13]
etc. Almost all of these features are extracted from the bug reports themselves.
Existing works have tried almost all possible methods in the field of informa-
tion retrieval and machine learning to address the duplicate bug reports issue.
It is hard to improve the performance only depending on these features. We
propose some new features extracted from comments and user profiles according
to the observation on bug reports. In a development community, comments are
necessary complement to bug reports. Furthermore, The reporters knowledge
and experience will aect the quality of the bug reports he submitted. Exper-
iments show that these new features can eectively help us to further improve
the duplicate bug reports detection performance.
The contributions of this paper include:
We propose a practical duplicate bug reports detection method, which adopts
three stages of classifiers to combine existing two categories of methods. It
will really help developers to reduce the eorts on processing bug reports
We propose some new features from comments, user profiles and query feed-
back to improve the eectiveness of duplicate bug report detection.
Practical Duplicate Bug Reports Detection 711
2 Background
2.1 Web-Based Bug Tracking System
Nowadays web-based development communities become grown up for its remov-
ing the geographical restrict, fostering communication and collaboration in group
members. Various kinds of web-based development communities are widely used:
Web forums provide place for users to submit and discuss the program questions.
Distributed revision control system like GitHub facilitates sharing codes with
other people and code management for multiple contributors.
Web-based bug tracking system also is one important kind of web-based de-
velopment communities. It is designed to keep track of reported software bugs.
It allows remote access of resources anytime, anywhere. It engages all the roles
including QAs, developers and project managers on a single collaboration plat-
form over the web. Users could review bugs and share comments and resources
freely. All the knowledge are stored and centralized. Currently many web-based
bug tracking systems such as Bugzilla and JIRA have been widely used in soft-
ware projects, especially in large open source software projects like Firefox and
Eclipse.
the bug is automatically sent to triager to check the quality of bug. If the triager
judges that bug information is insucient, the bug will be sent back to submit-
ter to complete. If the bug is invalid, never fixed or duplicate, triage will set
resolution and close bug. If triage have completed the entire checklist for a bug,
bug will be marked as triaged and assigned to related developer. Next, bug is
fixed by developer. After the bug is resolved, QAs verify the resolution of the
bug, and close the bug when it is fixed.
Firstly, in general it costs much time to analyze a bug report and judge
whether it is duplicate manually. The situation gets worse with the huge
number of bug reports and limited manpower.
Secondly, same error causes failure at dierent level of program. In some
cases, duplicate bugs show completely dierent symptoms. Due to short of
global knowledge, it is hard for triage to detect the relation of these bugs.
Thirdly, a bug is not filed against appropriate component, which might mis-
lead triagers.
In this section we describe the techniques in PDB in details; Overall, PDB adopts
three stages of classification in duplicate bug detection. The first one is to predict
how much contribution a comment makes to define a bug. The second one is to
predict the possibility that two bugs are relevant and duplicate. The possibility
is used to be score for ranking. The third one is to predict whether one bug is
duplicate.
Comments in bugs are the points of view and discussions about the bugs issued
by owner or others. Among these, a large proportion of comments are supplement
to the original content, and help people better understand the appearing result
and find root cause of bug. In some cases when the summary and description
in a bug are vague and ambiguous, some key comments are critical to define to
bug, and also contain important clues in finding duplicate bugs. So PDB puts
comments into consideration when calculating the degree of relevance between
bugs.
Comments in a bug report are issued by various roles with various kinds
of intents. Some comments make some supplements to the description of bug
phenomenon or possible rule causes, which can assist in identifying the relevance
of bugs. But other comments just talks about bug fixing procedure or the results
of bug processing, which deviate from the topic of main body for bug report and
are not crucial in identifying the relevance of bugs. So it is intuitive to put
comments weights dierentially based on the type of comments, and set the
former type of comments higher weight.
714 L. Feng et al.
In order to measure the weight of comment, we divide all the comments into
two classes: usefulness and less usefulness. The first class of comment is the one
which describes bug phenomena, discusses root cause of bugs and solution of
bug. The remaining of comments are categorized to the second class. We set
comment weight as probability that the classifier put the content into usefulness
class. For comment ci in bug Y , we define the weight of this comment as follows:
WY (ci ) = p(usef ulness|ci) (1)
SVM with RBF kernel is adopted as classifer. There are four types of features to
classify the types of comments: average-idf, bug status, length of comment text,
and unigrams of a term within comments.
Average-IDF: From our observation, those useful comments usually talks about
the technical details of programs, and contain many professional vocabulary.
These professional vocabularies generally have high idf value. On the contrary,
the less useful comments usually are full of common words. Its much reason-
able that these comments might appear in other bug reports, like please fix
as soon as possible.
Bug status: For a comment, the bug status indicates current state when the
comment is submitted. Usually the intent of comment has strong relation-
ship with the current bug status. E.g. when a bug stands on state of NEW,
QAs usually supplement some source and test environment of bug to facili-
tate triagers to understand bug.
summary and description in query bug and comments in candidate bug. Given a
set of comments C in Bug Y and query text q, similarity function between them,
ScoreY (q, C), is the maximum score in similarity between text and comment. It
is defined as follows:
S T
ScoreY (q, C) = max Sim(q, ci ) WY (ci ) | ci C (2)
Where ci denotes the ith comment in a bug report, ci and query text q is modeled
as vector space model, and Sim(q, ci ) is cosine similarity between query text q
and comment ci .
Given query bug X, text set(X) denotes the text set of bug X including
summary and description. C denotes the set of comments in one bug Y , P s(Q, Y )
is defined as follows:
#
qi text set(Q) WQ (qi ) ScoreY (qi , C)
P s(Q, Y ) = # (3)
qi text set(Q) WQ (qi )
Table 1. All the features used in bug retrieval model, CS is abbreviation for Cosine
Similarity
The latter is hard to get, but valuable for predicting the resolution of submitting
bugs. Intuitively novice users as well as expert or advanced users tend to submit
bugs with dierent quality.
PDB takes some user profiles features into consideration. They include the
total number of issuing bugs, the portion of issuing bugs with resolution as
FIXED and the portion of issuing bugs with resolution as DUPLICATE. The
total number of issuing bugs tries to identify the participation and familiarity
of submitter in a project. Two portions shows submitters habit of issuing bugs.
These features are helpful to predict the resolution of a new query bug. For
example, high total number of issuing bugs indicates that the submitter might
have profound knowledge of project and program, and thus is less likely to submit
duplicate bugs; A low portion of bugs with resolution as FIXED and a high
portion of bugs with resolution as DUPLICATE mean the submitter might have
not habit of searching relevant bugs before issuing a new bug, the resolution of
the bug which this author issues has good chance of being duplicate.
The idea of query feedback is to judge the features of query from the returning
sorted results by retrieval engine. Suppose that a user issues query bug Q to a
duplicate bug retrieval system and a ranked list L of candidate bugs is returned.
Intuitively, the more similar top one candidate bug is, the more probably query
bug Q is duplicate bug. However, Due to convenience, submitters usually employ
many same words and very similar writing style when filling in some relevant
bug reports. It will lead to very high similarity score between these bugs which
are submitted by same author, but not duplicate. In order to catch these situ-
ations, PDB consider several query feedback information. The features include
the similarity score with first, fifth, tenth bug report, whether the first, fifth,
tenth returning bug is submitted by same author, the portion of same author in
top five returning bug reports, the portion of same author in top ten returning
bug reports, the average similarity score for in top five returning bug reports and
the average similarity score for in top ten returning bug reports. Similarity score
of bugs is the probability that two bugs are duplicates, described in previous
subsection.
Practical Duplicate Bug Reports Detection 717
4 Experiments Study
4.1 Dataset
MeeGo is a Linux-based free mobile operating system project. We crawled a
dataset of bugs report from public MeeGo bugzilla1 through open RPC/XML
API. The dataset contains 22486 bugs, which are issued from March 2010 to
August 2011. There are 1613 bugs with the resolution as DUPLICATE.
The first experiment concerns at the performance of PDBs finding duplicate bug
reports We compare PDB with the REP[10]. Several standard measures (MAP,
Precision, Recall) are used in evaluation.
PDB adopts SVM classifer2 with linear kernel. The number of topics in topic
model is set to be 100. The ids of bugs for training set range from 1 to 5000,
and the others are used for test. The query bug set contain both duplicate and
master bugs. The result set are expected to contain all the duplicate bug reports.
Table 2 displays the performance on recall, precision and MAP respectively
for these two approaches. PDB outperforms REP in all measures. PDB achieves
higher recall rate by about 1%-8% for the resulting lists of top 1-20 bug reports.
The gain gap grows as the size of list increases. In precision, PDB outperforms
REF by 1%-8% in top 1-20 resulting bug reports. PDB improves MAP from 36%
to 46%.
We analyze the contribution of the various features to the model by measuring
the F-score through python script in LIBSVM. From all new features, weighted
comment similarity has highest F-score, which means that comments in a bug
play a major role in duplicate bug retrieval. The features related to topic model
follow. Reporter distance is lowest-ranking.
Table 2. Performance results in Recall, Precision and MAP between two methods
bugs respectively, and then apply the Naive Bayes, Decision Tree3 and SVM for
classification. RBF kernel is used in SVM classifier. 10-fold cross validation is
proceeded to measure accuracy, true positive rate and true negative rate. The
parameters are grid searched to maximize accuracy.
We make comparison on the accuracy of three classifiers with PDB and state-
of-the-art[12]. TIAN stands for [12] in the following. The best classifier refers to
classifier that achieves the highest accuracy. Figure 2 shows the results. PDB
outperforms TIAN consistently in all three classifiers. SVM and Decision Tree
achieve close accuracy with both PDB and TIAN, but make increase about 10%
on accuracy compared to the Naive Bayes.
Fig. 2. Comparisons on the accuracies of three classifiers with PDB and TIAN methods
We define true positive(TP) rate as the fraction of duplicate bugs which are
correctly classified, and true negative(TN) rate as the fraction of non-duplicate
bug which are correctly classified. The higher are both these two rates, the better
the classifier works.
Table 3 shows the comparison on the TP and TN rate for three classifiers with
PDB and TIAN. In all three classifiers, PDB outperforms TIAN by 2%-31% on
both TP rate and TN rate, except on TN rate in Decision Tree. Naive Bayes
classifier shows best performance on TP rate, but get worst result on TN rate
in three classifers. Similar with accuracy, SVM and Decision Tree achieve close
performance on TP rate and TN rate overall. SVM outperforms Decision Tree
on TP Rate and TN rate by a narrow margin. Therefore, Overall SVM do most
stable and accurate work in three classifiers.
3
Naive Bayes and Decision Tree are implemented by WEKA. Available at
http://www.weka.net.nz
Practical Duplicate Bug Reports Detection 719
Table 3. Comparisons on the TP rate and TN Rate for three classifiers with PDB and
TIAN methods
PDB TIAN
Classifer TP Rate TN Rate TP Rate TN Rate
Naive Bayes 89.9% 33% 87.2% 12.2%
Decision tree 77.9% 61.6% 45.4% 66.7%
SVM 76.2% 68% 52% 57.2%
5 Related Works
6 Conclusion
References
1. Bettenburg, N., Premraj, R., Zimmermann, T., Kim, S.: Duplicate bug reports
considered harmful... really? In: ICSM, pp. 337345 (2008)
2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine
Learning Research 3, 9931022 (2003)
3. Cavalcanti, Y.C., de Almeida, E.S., da Cunha, C.E.A., Lucredio, D., de Lemos
Meira, S.R.: An initial study on the bug report duplication problem. In: CSMR,
pp. 264267 (2010)
4. Griths, T.: Gibbs sampling in the generative model of latent dirichlet allocation.
Unpublished note (2002), http://citeseerx.ist.psu.edu/viewdoc/summary
5. Jalbert, N., Weimer, W.: Automated duplicate detection for bug tracking systems.
In: DSN, pp. 5261 (2008)
6. Kaushik, N., Tahvildari, L.: A comparative study of the performance of ir models
on duplicate bug detection. In: CSMR, pp. 159168 (2012)
7. Nguyen, A.T., Nguyen, T.T., Nguyen, T.N., Lo, D., Sun, C.: Duplicate bug report
detection with a combination of information retrieval and topic modeling. In: ASE,
pp. 7079 (2012)
8. Runeson, P., Alexandersson, M., Nyholm, O.: Detection of duplicate defect reports
using natural language processing. In: ICSE, pp. 499510 (2007)
9. Steyvers, M., Griths, T.: Probabilistic topic models. Handbook of Latent Seman-
tic Analysis 427(7), 424440 (2007)
10. Sun, C., Lo, D., Khoo, S.-C., Jiang, J.: Towards more accurate retrieval of duplicate
bug reports. In: ASE, pp. 253262 (2011)
11. Sun, C., Lo, D., Wang, X., Jiang, J., Khoo, S.-C.: A discriminative model approach
for accurate duplicate bug report retrieval. In: ICSE (1), pp. 4554 (2010)
12. Tian, Y., Sun, C., Lo, D.: Improved duplicate bug report identification. In: CSMR,
pp. 385390 (2012)
13. Wang, X., Zhang, L., Xie, T., Anvik, J., Sun, J.: An approach to detecting duplicate
bug reports using natural language and execution information. In: ICSE, pp. 461
470 (2008)
14. Wei, X., Croft, W.B.: Lda-based document models for ad-hoc retrieval. In: SIGIR,
pp. 178185 (2006)
Selecting a Diversified Set of Reviews
1 Introduction
Online user reviews have great eects upon decision-making process of users.
Customers would like to read reviews to get a full picture of an item (i.e., a
product or service) when they are selecting items. For example, according to
the reviews of a camera, a customer can have knowledge of the size, color and
convenience of usage though they have never bought it.
However, since user-generated reviews proliferate in recent years, e-commerce
sites are facing challenges of information overload. In eBay.com, there are typi-
cally several hundreds of reviews for popular products such as the latest model
of iPhone. The massive volume of reviews prevent users from catching the clear
view of products. On one hand, it is not easy for users to read all of the re-
views. Users may be unable to go through all the reviews for each item, for it is
too time-consuming. On the other hand, an increasing number of users choose
to shop online via mobile phones, such as reserving a table at a restaurant or
picking a movie to watch when they are outside. Those users have to make their
purchase decision in a short time, so they can only read a small fraction of re-
views for each item. In addition, due to the limited screen size of mobile phones,
it is inconvenient for users to scan more reviews.
To address the problem of information overload, e-commerce sites have adopted
several kinds of review ranking and selection methods. Review ranking ranks re-
views according to their helpfulness votes, so as to provide top-k reviews to users.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 721733, 2013.
c Springer-Verlag Berlin Heidelberg 2013
722 W. Yu et al.
Helpfulness votes are evaluated by users to those reviews that are helpful to them.
And there are also a number of researches on automatically estimating the quality
of reviews [8][11][15][6]. However, there are two drawbacks in these review ranking
work. First, the resulting top-k reviews of an item may contain redundant infor-
mation while some important attributes may not be covered. For example, top-k
reviews of a mobile phone may comment on display, camera and battery life, but
mention nothing about whether it is easy to use or carry. Second, since previous
experiments [3] showed that users tend to consider helpful the reviews that fol-
low the mainstream, the resulting top-k reviews may lack diversity of opinions.
By these two observations, review selection based on attributes coverage [14] is
proposed. It prefers to select reviews covering as many attributes as possible. But
it is found that such kind of method does not reflect the original customer opinion
distributions and may lead to unfair picture to users. Then [9] comes to solve the
problem by selecting a set of diversified reviews that keep the proportion of posi-
tive and negative opinions. But we have found that this work does not perform well
especially for selecting a smaller set of reviews, which is caused by the overlooking
of attribute importance. For example, for a T-shirt, there are a lot of reviews com-
plaining about its delivery speed while a smaller number of reviews complaining
on the product quality. In such a case, it may be better to show reviews for clothes
quality.
Hence, to improve the overall value of the top-k reviews, we view the top-k
reviews as a review set rather than simple aggregation of reviews. We propose
an approach to select a small set of reviews that cover important attributes
with high quality, while at the same time have better diversified attributes and
opinions by clustering. So our contributions in this paper are as follows:
1. We propose to evaluate attribute importance by calculating weights to
them. We suppose to return users reviews covering attributes to their concerns.
2. In order to improve the diversity of the top-k results, we propose to cluster
reviews to dierent topic groups. We design an algorithm to select diversified
reviews from dierent groups which can help improve diversification results.
3. We perform experiments on real data crawled from e-commerce site to
evaluate our algorithm.
The rest of the paper is organized as follows. Section 2 introduces the re-
lated work. Section 3 defines the problem. Section 4 gives the algorithms to the
problem. Section 5 presents the experimental evaluation. Section 6 reaches the
conclusion.
2 Related Work
Our work is related to the existing work that addresses the problem of review
selection, review assessment and ranking and review summarization. A sequence
of recent work has focused on these problems.
Review Selection: The review selection problem is to select a set of reviews
from a review collection. Lappas and Gunopulos [10] proposed to select a set
of reviews that represent the majority opinions on attributes. The drawback of
Selecting a Diversified Set of Reviews 723
this approach is that it reduces the diversity of opinions, regardless of the fact
that users tend to make purchase decision after viewing dierent opinions on an
item. Tsaparas et al. [14] intended to select a set of reviews that contain at least
one positive and one negative opinion on each attribute. This method fails to
reflect the distribution of opinions in the entire review collection, thus misleading
users. Lappas et al. [9] proposed the selection of a set of reviews that capture the
proportion of opinions in the entire review collection. The shortcoming of this
approach falls in the lack of consideration on the quality of reviews. Furthermore,
the three methods view all the attributes as the same, which may lead to the
overload of attributes.
Review Assessment and Ranking: The review assessment and ranking prob-
lem is to rank reviews according to their estimated quality. Kim et al. [8] trained
a SVM regression system on a variety of features to learn a helpfulness function
and applied it to automatically rank reviews. They also analyzed the importance
of dierent feature classes to capture review helpfulness and found that the most
useful features were the length, unigrams and product rating of a review. Hong
et al. [6] used user preference features to train a binary classifier and a SVM
ranking system. The classifier divides reviews into helpful and helpless reviews,
and the ranking system ranks reviews based on their helpfulness. Their evalu-
ation showed that the user preference features improved the classification and
ranking of reviews, and jointly using textual features proposed by Kim et al.
[8] can achieve further improvement. However, these methods suer from low
coverage of attributes as the resulting top-k reviews may contain redundant in-
formation. It is in that they estimated the review quality separately and did not
consider the overall quality of the top-k reviews.
Review Summarization: The review summarization problem is to extract
attributes and opinions from the review collection and summarize the overall
opinion on each attribute. Hu and Liu [7] proposed a three-step approach: first
extract attributes from reviews, then identify opinion sentences and whether
they are positive or negative, and finally summarize the number of positive and
negative opinion sentences for each attribute. Yu et al. [12] intended to pre-
dict rating for each attribute from the overall rating, and extract representative
phrases. Although these methods extract useful information from the review col-
lection, they fail to provide the context of reviews from which users can view
the opinions in a more comprehensive and objective way.
3 Problem Definition
In this section, we formally define our problem and the properties of our problem.
Given an item i I, where I denotes a set of items in a certain category, let
R = {r1 , r2 , . . . , rn } be a set of reviews commenting on i which cover a set of at-
tributes A = {a1 , a2 , . . . , am } and express a set of opinions O = {o1 , o2 , . . . , oz }.
We assume that reviewers may express two types of opinions on each attribute:
positive or negative opinion. Hence, the size of O is z = 2m. We also say that a
724 W. Yu et al.
where w (a) (0, 1) represents the weight of attribute a. We will introduce the
attribute weight later.
Mainstream Divergence: Users tend to consider helpful the reviews that fol-
low the mainstream, so we measure the gap between the viewpoint of a review
and the mainstream viewpoint. In e-commerce sites, users are required to give
an item an overall rating ranging from one to five along with their reviews, so we
can say that the overall rating expresses the same viewpoint as the review, and
the average rating reflects the mainstream viewpoint. Therefore, the divergence
from mainstream viewpoint div (r) is defined as follows:
8 " 8
8 8
8v (r) r R v(r ) 8
8 |R| 8
div (r) = (5)
maxr R v (r )
"
v (r )
where v (r) {1, 2, . . . , 5} is the viewpoint of review r, r R
|R| is the main-
stream viewpoint, and maxr R v (r ) is the maximum rating.
Step 2: construct training data for SVM regression. We randomly choose reviews
with a certain amount of votes and estimate their quality according to q (r) so
as to form a labeled dataset of {r, q (r)} pairs. We employ the labeled dataset
to train our regression model and use the model to estimate review quality.
Assignment of Attribute Weight. Users read reviews to acquire information
on attributes. However, the review collection of an item may cover dozens of
attributes, while users are only interested in part of them. In order to provide
key information to users, we measure the importance of attributes and assign
relevant weights.
According to user behaviors of review-writing, we regard attributes that are
frequently referred in the reviews of an item and items in the same category as
important attributes. Here, attributes that appear in the reviews of most items
in a category are taken into consideration since they are common attributes
for a category. When users are selecting items of this category, they are more
likely to refer to those attributes first. Therefore, let Ra R be a set of reviews
commenting on an item i I that cover attribute a, where I denotes a set of
items in a certain category, and Ia I be a set of items that have reviews
covering a, the attribute weight w (a) is defined as follows:
726 W. Yu et al.
|Ra | |Ia |
w (a) = (6)
maxa A |Ra | maxa A |Ia |
where max |R a|
measures the importance of a for an item, and max |I a|
a A |Ra | a A |Ia |
measures the importance of a for a category.
The definition of w (a) is similar to that of TF-IDF, as an attribute can be
viewed as a term, and the review collection of an item as a document. Hence, the
importance of an attribute for an item is equal to TF, while the importance of an
attribute for a category is equal to DF instead of IDF since we value attributes
that are frequently referred in the reviews of items in the same category.
Since each time we select one review to add into the review set, we regard the
diversity of a review set as the sum of the diversity of each review in the set
when it is selected. Therefore, the review set diversity function D (S) is defined
as follows: "
D (S) = d (r) (7)
rS
cluster Ci , and the number of reviews selected from Ci is less than n (Ci ), select
r if it is similar with Ci , while at the same time not similar with S. Therefore,
the review diversity function d (r) is defined as follows:
In this section, we give the algorithms to our problem including the algo-
rithms for opinion extraction and review selection. Before we select reviews, we
apply opinion extraction algorithm that extracts attributes and opinions from
the original review collection to generate input for review selection algorithm.
728 W. Yu et al.
5 Experimental Evaluation
In this section, we evaluate our algorithm for review selection. We set the
size of the selected review set as 5. We apply the diversity factor with
{0, 0.1, . . . , 0.9} to observe the eects on the quality and diversity of the selected
review set with the increase on .
Selecting a Diversified Set of Reviews 729
5.1 Datasets
In our experiments, we use data crawled from e-commerce site eBay.com. This
dataset includes reviews from three categories: tablets and e-book readers, portable
audio and digital cameras. There are 4789 items and 110480 reviews in all. Af-
ter pruning items having less than 20 reviews, there are 121 items on tablets
and e-book readers with 9555 reviews, 286 items on portable audio with 38799
reviews, and 583 items on digital cameras with 42783 reviews.
We are interested in how quality behaves when raising diversity factor from 0.1
up to 0.9, we hypothesize that the quality will reduce with the increase on . On
the other hand, we will focus on whether attributes and opinions are diversified
through our algorithm. In addition, we will compare our algorithm with other
algorithms for review selection in both the quality and diversity of the selected
review set.
Review Set Quality Analysis. We apply our algorithm on all three categories
of items and compute the average quality of all the selected review sets. As shown
in Fig. 1, the quality goes down smoothly when raising by 10% each time, which
agrees with our hypothesis, and reveals that the diversification of review set have
detrimental eects on the quality.
to perform better than our algorithm in PROP-ERR. However, since the size
of the selected review set is small and the characteristic algorithm diversifies the
selected review set by maintaining the proportion of opinions in the original re-
view collection, some rarely mentioned attributes may not be covered, resulting
in the low coverage of opinions and important attributes as well.
In our work, we define important attributes as those attributes that are fre-
quently referred in the reviews of an item and items in the same category. Some
attributes may only appear in a small fraction of reviews of an item, but they are
likely to be important for the category where the item belongs to. For example,
the reviews selected by our algorithm for Canon EOS 1D 4.2 MP Digital
SLR Camera cover not only those common attributes such as:
picture quality: . . . It delivers excellent image quality, with great colors
and detail. . .
shooting: . . . I also like the 6.5fps shooting, which is handy for my wildlife
shooting. . .
price: . . . This is one of the best cameras out there for the price in my
opinion. . .
but also attributes that are not frequently mentioned in the reviews of this
item while in fact are of great importance to the category of cameras, such as:
732 W. Yu et al.
screen and interface: . . . What I love in it? Large viewfinder, great ISOs
range, nice speed, large screen and comfortable interface. . .
menu system: . . . The menu system is also very intuitive and easy to learn
and the my menu feature is good for me about 90% of the time and keeps me
from having to go through the whole menu interface. . .
Users may concern about screen, interface and menu system when selecting
cameras as they are key to user experience. Though they are not frequently
referred in the reviews of this item, they are definitely important for the category
of cameras.
Characteristic Diversified
QLTY 0.86128730 0.95268065
OP-COV 0.10075375 0.31482820
WOP-COV 0.46727788 0.70877750
C-COV 0.17052619 0.37709308
WC-COV 0.60632420 0.80178090
PROP-ERR 0.21768339 1.30870930
6 Conclusion
We have proposed an approach to select a small set of reviews from the review
collection of an item that have high quality, while at the same time have diversi-
fied attributes and opinions. We first train a regression model to estimate review
quality. Then, to cover important attributes, we assign weights to attributes ac-
cording to their importance, and to cover dierent opinions, we cluster reviews
according to their opinions and select reviews proportionally from dierent clus-
ters. Finally, we use a diversity factor to balance the quality and diversity of
the selected review set, and implement our method with a greedy algorithm to
maximize the overall value of the selected review set. The experimental results
show that our algorithm helps diversify the selected review set.
References
1. Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reorder-
ing documents and producing summaries. In: Proceedings of the 21st Annual In-
ternational ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 335336. ACM (1998)
Selecting a Diversified Set of Reviews 733
2. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans-
actions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)
3. Danescu-Niculescu-Mizil, C., Kossinets, G., Kleinberg, J., Lee, L.: How opinions are
received by online communities: a case study on amazon. com helpfulness votes.
In: Proceedings of the 18th International Conference on World Wide Web, pp.
141150. ACM (2009)
4. De Marnee, M., MacCartney, B., Manning, C.: Generating typed dependency
parses from phrase structure parses. In: Proceedings of LREC, vol. 6, pp. 449454
(2006)
5. Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In:
Proceedings of the International Conference on Web Search and Web Data Mining,
pp. 231240. ACM (2008)
6. Hong, Y., Lu, J., Yao, J., Zhu, Q., Zhou, G.: What reviews are satisfactory: novel
features for automatic helpfulness voting. In: Proceedings of the 35th International
ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 495504. ACM (2012)
7. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the
Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pp. 168177. ACM (2004)
8. Kim, S.M., Pantel, P., Chklovski, T., Pennacchiotti, M.: Automatically assessing
review helpfulness. In: Proceedings of the 2006 Conference on Empirical Methods
in Natural Language Processing, pp. 423430 (2006)
9. Lappas, T., Crovella, M., Terzi, E.: Selecting a characteristic set of reviews. In:
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 832840. ACM (2012)
10. Lappas, T., Gunopulos, D.: Ecient confident search in large review corpora. Ma-
chine Learning and Knowledge Discovery in Databases, 195210 (2010)
11. Liu, J., Cao, Y., Lin, C., Huang, Y., Zhou, M.: Low-quality product review detec-
tion in opinion summarization. In: Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pp. 334342 (2007)
12. Lu, Y., Zhai, C.X., Sundaresan, N.: Rated aspect summarization of short com-
ments. In: Proceedings of the 18th International Conference on World Wide Web,
pp. 131140. ACM (2009)
13. Moghaddam, S., Ester, M.: On the design of lda models for aspect-based opinion
mining (2012)
14. Tsaparas, P., Ntoulas, A., Terzi, E.: Selecting a comprehensive set of reviews. In:
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 168176. ACM (2011)
15. Tsur, O., Rappoport, A.: Revrank: A fully unsupervised algorithm for selecting
the most helpful book reviews. In: International AAAI Conference on Weblogs and
Social Media (2009)
Detecting Community Structures
in Microblogs from Behavioral Interactions
Ping Zhang1, Kun Yue1,*, Jin Li2, Xiaodong Fu3, and Weiyi Liu1
1
Department of Computer Science and Engineering, School of Information Science
and Engineering, Yunnan University, Kunming, China
2
Department of Software Engineering, School of Software,
Yunnan University, Kunming, China
3
Faculty of Information Engineering and Automation, Kunming University of Science
and Technology, 650500, Kunming, China
[email protected]
Abstract. Microblogs have created a fast growing social network on the Inter-
net and community detection is one of the subjects of great interest microblogs
network analysis. The follow relationships highlighted in many existing me-
thods are not sufficient to capture the actual relationship strength between users,
which leads to imprecise communities of microblog users. In this paper, we
presented the graph-based method for modeling microblog users and corres-
pondingly proposed a new metric for obtaining the relationship quantitatively
based on interaction activities. We then gave a hierarchical algorithm for com-
munity detection by incorporating the quantitative relationship strength and
modularity density criterion. Experimental results showed and verified the ef-
fectiveness, applicability and efficiency of our method.
1 Introduction
*
Corresponding author.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 734745, 2013.
Springer-Verlag Berlin Heidelberg 2013
Detecting Community Structures in Microblogs from Behavioral Interactions 735
For example, many companies adopt microblogging social media services as ideal
places to obtain timely opinions from customers and potential customers directly.
Among the various perspectives of knowledge discover in microblog related applica-
tions, the analysis of the so-called community structure of online social networks has
acquired much attention during the latest years. For instance, Guan et al. [3] found
that it is likely to stimulate new purchases effectively by targeting a product on poten-
tial users of a community that has a significant number of users who have already
purchased the product.
Naturally, in a specific physical domain or logical range, the users with communi-
cations of microblog messages can be described by a network, from which their
associations can be represented globally. To identify the community structures in
microblogs could help us understand and exploit these networks more effectively,
centered on which various studies on community discovery or detection were con-
ducted [4, 5]. For this purpose, many scientists cluster users by the follow relation-
ships [1, 6] based on the hypothesis that microblog users are likely to follow (or be
followed by) other similar ones if their relationships of message communication could
be reflected in a community. Intuitively, by comparing the lists of followers of the
users in small communities, large communities can be generated by merging the small
ones that share a certain number of users according to a pre-defined threshold. How-
ever, the follow relationships actually do not reflect the strength of real relation-
ships from weak to strong. Thus, these methods [1, 6] might group the users into a
community mistakenly, since the close or remote relationships not only depend on the
number of associations, but also on the strength of these associations.
Consequently, some sociologists [7, 8] proposed the method for measuring the re-
lationship strength based on communication reciprocity and interaction frequency. By
this way, the strength was predicted by analyzing the behavioral interaction data [9,
10], but just two levels of strength, namely weak and strong, were considered. There-
fore, from the community discovery point of view, it is desired to develop a strength
metric to describe the relationships both qualitatively and quantitatively.
As stated above, the relationships among microblog users can be represented by a
weighted network composed of vertices and weighted edges, on which the relation-
ship strengths can be looked upon as the weight. The property of community structure
is as follows: network nodes are joined together in tightly knit groups with strong
relationship strength, while there are only looser connections with weak relationship
strength between them. Starting from this observation, many clustering algorithms in
social network analysis, such as GN, CNM, have been applied for community discov-
ery on weighted networks [11, 12]. But only the coarse structure of the communities
can be discovered by these algorithms. As the improvement, Li [13] introduced the
concept of modularity density criterion and proved the k-means algorithm which can
optimize the criterion to obtain finer results. Actually, in the clustering algorithms like
k-means, the number of clusters is to be specified in advance, which cannot be made
for the microblog network since the actual topology and the scale of the concerned
microblog users cannot be known clearly.
Therefore, inspired by the basic idea of clustering algorithms and starting from
the relationship strengths of microblog users, we proposed the precise method for
736 P. Zhang et al.
Suppose that user i calls user j for wij times and user j calls user i for wji times. What
is the degree of relationship strength between them? In this section, we discuss the
metric to quantify the relationship strength between a given pair of users.
Microblog has two special tweet types for communication among users, called rep-
lies and retweets. In the network of microblog users, the edges correspond to the di-
rected interactions of replies or retweets among users.
A reply is a tweet sent in response to an earlier tweet and it includes the earlier
tweet ID. This type of tweet typically includes @account_name followed by the
new message. All of the users following either the replying user or the replied-to user
can read this tweet. When a tweet is retweeted (abbreviated as RT) by a user, it is
broadcast to the users followers. For an official RT with twitters retweet command,
the tweet is simply reposted without changing the original message. An unofficial RT
is created by retweet users using the RT @account_name_original message nota-
tion, possibly followed by additional text.
It is well known that the more reciprocal and frequent of interaction of a pair, the
closer their relation will be. Therefore, we can simply define the relationship strength
as follows:
where Fij, Recij, Freij represent the relationship, reciprocity and frequency strength
between i and j respectively. It is worth noting that Fij is computed by adopting the
weights in the directed micoblog behavioral network. The most intuitionistic metric to
quantify the Fij can be computed as follows:
min( wij , w ji )
Fij Recij * Freij ( wij w ji ) (2)
max( wij , w ji )
However, Eq. (2) is not suitable for measuring the relationship strength. For example,
the values of Eq. (2) for different pairs (1, 10) and (1, 1000) are equal. Actually in the
microblog applications, the information on some nodes is always transmitted enorm-
ously by others nodes, but these nodes themselves transmit little from other nodes. This
appears frequently between the nodes corresponding to stars and their fans. Although
the sum of interactions is very high, stars and their fans are generally not very close
from each other. From this standpoint, when Recij is large enough, the value of Fij
should mainly depend on Recij. Thus, the relationship between 1 and 10 should be
higher than that between 1 and 1000.
To make a distinction of the relationships as illustrated above, the increase of Freij
should tend to flat with the increase of total interactions. So, we can compute the fre-
quency strength as Freij* wij w ji , and Eq. (2) can be improved as follows:
min( wij , w ji )
Fij Recij * Freij* wij w ji (3)
max( wij , w ji )
!
! Fig. 1. Microblog network Fig. 2. Quantitative microblog network
By using Eq. (3), we can transform the directed weighted network to an undirected
one while deriving the general relationship strength between each pair of users. For
example, we can transform the directed weighted network shown in Fig. 1 into the
undirected one shown in Fig. 2. In Section 3, we will present our method for commu-
nity detection based on the undirected network.
claimed that modularity contains an intrinsic scale that depends on the total size of
links in the network, and stated that Q-based method only can detect the coarse struc-
ture of the communities. To improve the situation, Li et al. [13] introduced modularity
density (D) which takes both edge and node into account, and was proved to be able
to recognize the various sizes communities in the undirected network. In our algo-
rithm, we exactly take modularity density as the quantitative measure of partition.
First, we introduce some related concepts.
where V is the vertex set and E is the edge set of a network G=(V, E), and A is the
adjacent matrix of G. Given the partition of a network G1=(V1, E1), , Gm=(Vm, Em),
where Vi and Ei are the node set and edge set of Gi respectively, for i=1, ... , m
and m is not larger than n. d(Gi) is the density of subgraph Gi. Vi is the number of
vertexes in Vi, L (Vi , Vi ) Aij (representing the total number of edges that
i Vi , j Vi
Given a valid partition of a network G the summation of the density value of all
communities, computed by Eq. (4), should be as large as possible. Li [13] also gave
the definition of modularity density.
m m L (V , V ) L(Vi , Vi )
i i
D d (Gi ) (5)
i 1 i 1 Vi
density in Definition 2. The problem is that the subgraph containing a single node has
arbitrary density, so we cannot compute which pairs can maximum D (denoted as
the increase of D). To obtain the communities, instead of optimizing D directly, we
suppose that Ds value of all single node subgraphs is 0, and define the subgraph prox-
imity for merging nodes. Our basic idea is as follows: two subgraphs will be merged
if the combination of them could decrease the proximity by a maximal degree and
make D>0, specified in Definition 3.
2 L(Vi , V j )
proij (6)
L(Vi , Vi ) L(V j ,V j )
The greater the value of proij, the closer the relationship between Gi and Gj. If there is
no connection between Gi and Gj, then proij equals 0, and if Gi and Gj are only con-
nected by each other, then proij equals 1.
In some network, the idea of joining the pair of subgraphs with the maximal prox-
imity may lead to the asymmetry growth: large subgraphs will be more likely to be
merged than small ones. In extreme cases, the whole network may be clustered as just
one community. The reason is that large subgraphs have more nodes, even if they are
linked with others by a very small rate, and the number of edges between them is
also more than those in small ones. For example, w.r.t. the two communities in Fig. 2, the
result of subgraph merging is shown in Fig. 3, where all the nodes will be clustered
together as one community in each time when D is increased. To avoid this situation,
we choose subgraphs randomly during the merging process. According to this idea,
the large communities will have some possibilities that will not be chosen to merge.
Fig. 4 illustrates the random choose, from which the correct result can be obtained.
i j k p o q i j k p o q
Algorithm HCD starts from a state in which each vertex is the sole member of one
of n subgraphs. At each step for the randomly chosen subgraph m, joining subgraph i
with the greatest promi, which makes modularity density D increased. The value of D
Detecting Community Structures in Microblogs from Behavioral Interactions 741
will be maximized, if all the nodes are partitioned correctly. Randomly choosing a
subgraph in F will take O(1) time. Since the joining of a pair of communities between
which there are no edges at all can never result in an increase in D, so computing
promi and D only depends on k (the mean of the node degree of the whole network)
in the worst case. Updating the values of V, F and C totally takes O(n) time. Thus,
each step of the algorithm takes O(k+n) time in the worst case. There is a maximum
of n1 join operations necessary to construct the complete dendrogram and hence the
entire algorithm runs in O(kn+n2) time on a dense graph or O(n2) time on a sparse one..
4 Experimental Results
Due to the small network can be clearer to illustrate, we tested the resolution ratio of
our result on well-studied small networks. The Zacharys karate club network [16] is
one of the classic studies in social network analysis. It consists of 34 members of a
karate club as nodes and 78 edges representing friendship between members of the
club which was observed over a period of two years. Due to a disagreement between
the clubs administrator and the clubs instructor, the club can be split into 2 small
ones. Fig. 5 shows the Karate club network, where the shapes of the vertices represent
2 groups: the instructor and administrator factions respectively.
Table 1 shows that the results of the different algorithms are reasonable from the
topology of the network. However, only two coarse structures were found by the fam-
ous algorithm based on Q proposed by Clauset, Newman and Moore (CNM) [17]. The
Community 1# and 2# were split into 4 subgraphs by the density-based k-means algo-
rithm [13]. The Community 1# was split into 2 subgraphs by Algorithm HCD we
proposed in this paper. This means that the resolution ratio of the result of Algorithm
HCD is better than that of the CNM algorithm and not far from the density-based k-
means algorithm. Furthermore, we note that the underlined node 10 cannot be as-
signed properly, since it should be assigned to Community 1# or Community 2# di-
rectly from the topology of the network. By the density-based k-means algorithm, 4
factions can be detected perfectly, since the number of groups is known as human
knowledge in advance.
742 P. Zhang et al.
Fig. 5. Karate club network. Square nodes and circle nodes represent the instructors faction
and the administrators faction, respectively. The dotted line represents our partition, and solid
line represents the partition by the method in [13]. Shaded and open nodes represent the parti-
tion by CNM [17].
4.2 Precision
We tested the precision of our algorithm on different generated networks with the
community structure from apparent to fuzzy. Our method was applied to the bench-
mark computer-generated networks (denoted FLR) proposed by Lancichinetti et al.
[18]. The networks have a heterogeneous distribution of node degree, edge weight
and community size, which are exactly the features of real microblog networks [1,
19]. Thus, it is generally accepted as a proxy of a real network with community struc-
ture and appropriate to evaluate the performance of the community detection algo-
rithms. FLR allows the users to specify distributions for the community sizes, the
vertex degrees and the edge weights, with exponents , and muw. Users can also
adjust the average ratio to alter the community structure from apparent to fuzzy.
Detecting Community Structures in Microblogs from Behavioral Interactions 743
Fig. 6. Comparison of the normalized mutual information between Algorithm HCD and the
CNM algorithm [17] on the benchmark computer-generated network
4.3 Efficiency
We tested the efficiency of our algorithm on the generated networks with different
scales by comparing the execution time of Algorithm HCD proposed in this paper and
those of FN and CNM as the improvement of FN. The comparisons were made with
the increase of nodes in the generated networks and shown in Fig. 7. It can be seen
that the execution time of Algorithm HCD(i.e., CommuDis) is larger than that of
CNM but smaller than that of FN when the network gets larger and larger.
To sum up, we can conclude from the above experimental results that our algo-
rithm for community detection can achieve more precise results than CNM, but is less
efficient than CNM although more efficient than FN. This makes us improve our
algorithm to achieve better scalability inspired by some ideas of the improvement
from FN to CNM, which is our future work.
In this paper, we used the weighted graph to model microblog users and their relation-
ships by user interaction activities, and gave a hierarchical algorithm which intro-
duced the modularity density D successfully to discover the communities precisely.
From a certain extent, experimental results showed that our method is effective and
efficient for community detection in the microblog networks.
The methods proposed in this paper also leave open some other research issues.
For example, different types of behavioral interactions may have different effects on
relationships, but our method does not make a distinction between them. According to
the fact that different interactions correspond to different weights, it may be possible
to make better predictions on relationships. Actually, the scales of real microblog net-
works are frequent large, so the efficiency and scalability of our method is desired to
be improved. In addition, the interaction in microblog occurs over time gradually, the
features we used in this work have not concerned the dynamical aspects of data.
These issues are exactly our future work.
Acknowledgement. This work was supported by the National Natural Science Foun-
dation of China (Nos. 61063009, 61163003, 61232002), the Ph. D Programs Founda-
tion of Ministry of Education of China (No. 20105301120001), the Natural Science
Foundation of Yunnan province (Nos. 2011FZ013, 2011FB020), and the Yunnan
Provincial Foundation for Leaders of Disciplines in Science and Technology (No.
2012HB004).
References
1. Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and
communities. In: Zhang, H., Spiliopoulou, M., Mobasher, B., Lee Giles, C., McCallum,
A., Nasraoui, O., Srivastava, J., Yen, J. (eds.) Proc. of the 9th WebKDD and 1st SNA-
KDD 2007 Workshop on Web Mining and Social Network Analysis, pp. 5665. ACM
Press (2007)
2. Kavanaugh, A., Fox, E., Sheetz, S., Yang, S., Li, L., Whalen, T., Shoemaker, D., Natsev,
P., Xie, L.: Social Media Use by Government: From the Routine to the Critical. Govern-
ment Information Quarterly 29(4), 480491 (2012)
3. Guan, Z., Wu, J., Zhang, Q., Singh, A., Yan, X.: Assessing and Ranking Structural Corre-
lations in Graphs. In: Sellis, T., Miller, R., Kementsietsidis, A., Velegrakis, Y. (eds.) Proc.
of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 937
948. ACM Press (2011)
Detecting Community Structures in Microblogs from Behavioral Interactions 745
4. Ferrara, E.: A Large-scale Community Structure Analysis in Facebook. EPJ Data Science
1(9) (2012), doi:10.1140/epjds9
5. Liao, Y., Moshtaghi, M., Han, B., Karunasekera, S., Kotagiri, R., Baldwin, T., Harwood,
A., Pattison, P.: Mining Micro-Blogs: Opportunities and Challenges. Computational Social
Networks, Part 1, 129159 (2013)
6. Enoki, M., Ikawa, Y., Rudy, R.: User Community Reconstruction using Sampled Micro-
blogging Data. In: Mille, A., Gandon, F., Misselis, J., Rabinovich, M., Staab, S. (eds.)
Proc. of the 21st International Conference Companion on World Wide Web, Workshop,
pp. 657660. ACM Press (2012)
7. Burt, R.: Structural Holes: The Social Structure of Competition. Harvard University (1995)
8. Marsden, P.: Core Discussion Networks of Americans. American Sociological Re-
view 52(1), 122131 (1981)
9. Gilbert, E., Karahalios, K.: Predicting tie strength with social media. In: Dan, O., Ken, H.
(eds.) Proc. of the SIGCHI Conference on Human Factors in Computing Systems, pp.
211220. ACM Press (2009)
10. Kahanda, I., Neville, J.: Using transactional information to predict link strength in online
social networks. In: Cohen, W., Nicolov, N., Glance, N. (eds.) Proc. of the Third Interna-
tional AAAI Conference on Weblogs and Social Media, pp. 7481. AAAI Press (2009)
11. Coscia, M., Giannotti, F., Pedreschi, D.: A classification for Community Discovery Me-
thods in Complex Networks. Statistical Analysis and Data Mining 4(5), 512546 (2011)
12. Huang, L., Li, R., Li, Y., Gu, X., Wen, K., Xu, Z.: 1-Graph Based Community Detection
in Online Social Networks. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb
2012. LNCS, vol. 7235, pp. 644651. Springer, Heidelberg (2012)
13. Li, Z., Zhang, S., Wang, R., Zhang, X., Chen, L.: Quantitative Function for Community
Detection. Physical Review E 77(3) (2008), doi:10.1103/PhysRevE.77.036109
14. Newman, M.: Fast Algorithm for Detecting Community Structure in Networks. Physical
Review E 69(6) (2004), doi:10.1103/PhysRevE.69.066133
15. Fortunato, S., Barthlemy, M.: Resolution limit in community detection. The National
Academy of Sciences of the United State of America, PNAS 104(1), 3641 (2007)
16. Zachary, W.: An information flow model for conflict and fission in small groups. Anthro-
pological Research 33(4), 452473 (1977)
17. Clauset, A., Newman, M., Moore, C.: Finding community structure in very large networks.
Physical Review E 70(6) (2004), doi:10.1103/PhysRevE.70.066111
18. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community de-
tection algorithms. Physical Review E 78(4) (2008), doi:10.1103/PhysRevE.78.0461-10
19. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media?
In: Proc. of the 19th International Conference on World Wide Web 2010, pp. 591600.
ACM Press (2010)
20. Danon, L., Daz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure identi-
fication. J Statistical Mechanics 2005(9) (2005), doi:10.1088/1742-5468/2005/09/P09008
Towards a Novel and Timely Search and Discovery
System Using the Real-Time Social Web
1 Introduction
Yokie is a novel search and discovery service that harnesses the social content
streams of curated sets of users on Twitter. The system aims to provide timely,
novel, useful and relevant results by streaming content from Twitter. It also
allows for partitioning of the Twitter social graph into topical or arbitrary groups
of users (called Search Parties). While query-based search is at the core of both
Yokie and traditional search services, our system is not designed or conceived
to compete with or replace the latter. Yokie gathers its content from entirely
dierent sources for a start, and this content is indexed in response to real-time
search needs. As a result, the Yokie use-case of returning high-quality, real-time,
breaking or trending content is very dierent from archival search approaches of
traditional search engines.
Our work is motivated by the fact that the nature of online content is chang-
ing. The conventional web of links and authored content is giving ground to a
web of user-generated content, opinions and ratings. At the same time, users
who have previously relied on traditional search as their basic discovery service
are finding more content through sharing on social networks. The advent of the
social web has seen an important shift in the role of users from passive con-
sumers of content to active contributors of content creation and dissemination.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 746757, 2013.
c Springer-Verlag Berlin Heidelberg 2013
New technologies and services have emerged that allow users to directly manip-
ulate and contribute diverse web content [7]. The Twitter microblogging service
has, like the web as a whole, grown significantly in recent years [6][12].
While emphasis is placed on simple 140-character messaging, tweets can con-
tain a wealth of contextual metadata relating to the time, location content and
the user profile. Both Twitter and Facebook have become vital link-sharing net-
works [13][4] our previous work has shown that approximately 22% of Twitter
tweets contain a hyperlink string [13]. This prevalence of link sharing on so-
cial networks motivates an alternative approach to sourcing fresh, relevant and
human-selected search engine content, instead of crawling the entire web-sphere.
Also, by focussing on sourcing content from humans rather than the entire web,
we can subdivide the landscape of content sharers into useful groups. This has
numerous benefits, including the opportunity for users to curate their own ad-
hoc lists, algorithmically group topical producers, and to aid in eliminating spam
contamination.
A proof-of-concept prototype of Yokie was described and presented at [13], and
a full system demonstration was presented at [14]. We have since added several
user-feedback features to gain insight into the systems usefulness, and propose
a per-item global interestingness score; both of which are described below. In
this paper, we are particularly interested in evaluating the performance of the
system at providing timely, relevant and useful content. This is done by a live
user trial, and an attempt at capturing comparative data with other established
online systems.
techniques for harnessing recently shared content from Twitter for a news dis-
covery and recommendation service [12]. Chen et al. and Bernstein et al. have
explored using the content of messages to recommend topical content [3]. Garcia
et. al. successfully apply microblog data to product recommendation [5].
A main focus of applying social network properties to search systems should be
on the users and contributors. Indeed, a core component of Yokie is the selection
of groups of users for sourcing content (explained further below). Krishnamurthy
et al. demonstrate that the user base of Twitter can be split into definitive classes
based on behaviours and geographical dispersion [6]. They highlight the process
of producing and consuming content based on retweet actions, where users source
and disseminate information through the network. Several services have emerged
that take advantage of the open access to profile data with a view to providing a
score of a users online activities (provided the user has given them permission).
One such service, Klout (http://www.klout.com), aims to provide an engagement
and influence score based on an aggregation of a users activities across a broad
set of online social networks. Other work has focussed on applying user-level
reputation systems to social search contexts [10] and recommender systems [15].
Grouping users can have numerous benefits including the opportunity for users
to curate or generate their own preferred lists of personal or topical sources from
which to receive content. This would, in turn, allow for vertical searching across
multiple domains [2]. This may be favourable compared to a catch-all attempt to
gather all shared content. Previous work by [6][18] postulate a large and active
subgraph of social networks like Twitter exist that exclusively publish spam. As
such, any steps towards algorithmic subdivision of the social graph may need to
consider the potential of irrelevant or spam-laden accounts.
Peeked
Item
= Liked
Item
Fig. 1. Yokies UI includes functions for viewing extra explanatory metadata related to
the item, and unary relevance ratings. This example resultlist is based on links labelled
Instagram in the past week by high-value technology people, ranked by popularity.
User and Item-Based Result Ranking. Users can rank using typical Rel-
evance, which is vanilla Term Frequency Inverse Document Frequency scoring
(TFxIDF) [16]. However, a set of ranking strategies beyond standard relevance
have been devised using the added contextual metadata of the microblogs. These
include ranking using temporal features such as item age (Newer-First and
Older-First) and Longevity (the size of the time window between the first
and last mentions of the given hyperlink). Popular items can be ranked as we
capture a number of sample tweets that mention the hyperlink (labelled Men-
tions in the UI). Potential Audience is a user-based sum of follower-counts
for each search-party member who have mentioned the link. The premise of this
strategy is high-volume in-links for users can be a sign of high user reputation.
It is easily possible to derive a range of user-based scores as we are specifically
listening to people for content, as opposed to relying on a document graph
that ignores content publishers.
We have recently implemented two new ranking strategies, both of which are
part of our evaluation. Klout is a service that provides users of social networks
an influence score based on user reach, engagement and their ability to drive
other interactions. Using the Klout API, we can gather scores for each user (once
Klout has a score computed for them). It is possible to rank content based on
the original publishers/sharers Klout score. Interestingness, is an attempt to
measure how interesting an item is based on activity of the item on the social
network, and in other Yokie query sessions. Since we are privy to a multitude of
content, context and user interaction data for each item, and much of this data
is dependent on the conditions of the query, we can create a novel interestingness
algorithm. Given the Query Q tuple, Interestingness is defined as:
& ' & '
P op(Ui , Q) |ClkUi |
Int(Ui , Q) = |LkUi | (1)
Lng(Ui , Q) |P kUi |
P op(Ui , Q) and Lng(Ui , Q) are the popularity and longevity of the current item
Ui , given the parameters of the query tuple, (meaning their value is dependent
on the query Tmax and Tmin values). |ClkUi |, |P kUi | and |LkUi | represent
the total number of clicks, peeks and likes for the item, irrespective of the pa-
rameters of Q. These values have a default value of 1 so as to avoid null values
for interestingness of items with no user engagement. This algorithm utilizes the
contextual features of the item, benefits from prior user engagement with the
item, and ignores the profile data of the user. This allows the items score to be
potentially independent of the original sharers explicit profile data.
2 Evaluating Yokie
Summary Statistics. During the course of the two month trial, a total of
365,735 unique hyperlinks were mined from the Twitter profiles of our 1000
technology influencers. The user study attracted 125 unique users who submitted
1100 queries across 327 sessions. These queries led to the retrieval of 44,856
URLs, and in 95% of the cases, had a temporal window Tmin set to now
indicating users interest in up-to-the-minute results. Users interacted with 1061
items 203 of which were click-throughs, and 858 were exploratory peeks.
While we encouraged users to click on the Like button for items, we found it
dicult to elicit meaningful volumes of user feedback during the trial. As such,
we will focus on user engagement of items based on Click-through and Peek
actions. Further dataset and usage statistics are discussed below.
752 O. Phelan, K. McCarthy, and B. Smyth
A: Frequency of User-selected RERANKS per Strategy (Used to Normalize data in B) C: No. of USER INTERACTIONS per Result Position (overall) CLICK PEEK
400 250.0
# User Interactions
Freq. of Selection
50.0
10.0
37.5
7.5
25.0 5.0
26.7
24.8
5
23.4
21.8
4 4 3.5 4 4
17.8
6.5
16.6
11
10.8
11.6
10.4
10.4
2.0 2.0 2.0
9.5
0 0
Longevity Klout Newer F. Older F. Interesting. Pot. Aud. Relev. Mentions Longevity Klout Newer F. Older F. Interesting. Pot. Aud. Relevance Mentions
Fig. 2. Chart (A) shows frequency of user selected ranking strategies. Charts (B) shows
normalized User Interaction Data for Clicks and Peeks. Charts (C) and (D) show Click
Position Data for User Interactions.
Interactions vs. Ranking Strategies. While Relevance was the most fre-
quently selected strategy by users, it performed poorly in terms of normalized
user interactions. As presented in Figure 2 (B) users performed 2.6 times more
click-throughs on items ranked by Newer First, and over twice as many items
ranked by Longevity. Post-trial, we discovered that over 90% of the queries per-
formed had a Tmax of now (i.e. the query time). As such, it is conceivable the
items selected in Newer First results lists were more topical or new to the user,
or contained breaking news. Nevertheless, this is an interesting trend towards
a preference for novel context-sensitive ranking. While Peek engagement of
Relevance ranked items remains low, the trends are starkly dierent compared
to the corresponding frequencies of selected ranking strategies. Users peeked at
more items ranked by Longevity and Klout scores.
Result Positioning. Figure 2(C) represents median positions for Clicks and
Peeks across all strategies. Both interaction methods follow a standard F-pattern
of user interactivity as observed in other search result evaluations [11][8][17]. This
chart presents a clear peak of user interaction for items on average above the 7th
element in the Yokie result-lists. It also illustrates higher preference for users to
perform exploratory peek actions on items instead of visiting the hyperlinks
themselves. Finally, Figure 2(D) presents the median positions per strategy for
Towards a Novel and Timely Search and Discovery System 753
each interaction method. Here, we can interpret the positioning of items lower
in the list as the system performing well at capturing interactions in the first
few result items. In this case, Potential Audience, Klout, Interestingness and
Mentions gained user click-throughs with a median position of the second item
in the list, while the rest slightly higher. While it can be assumed that a better-
performing engine would present higher-frequency interactions at the earliest
position in the list, we can also argue the benefits of users exploring further
beyond the top-n results. Recency-based (Newer-first) results garnered a higher
median positioning users explored more recent items lower in the list.
Discussion. There are a number of take-home messages that arise from these
results. First, users do appear to be motivated to try a variety of dierent ranking
approaches. There is a tendency to prefer those such as Relevance and Interest-
ingness, at least with relation to their frequency of selection, perhaps because
they align well with conventional Search Engine ranking. However, based on re-
sult engagement, we can see our novel strategies such as Interestingness, Klout
and temporal strategies seem to outperform others.
For every Yokie query performed during the open user trial, we also gathered
search results from Bing Web and Bing News using a vector of the same key-
word queries users submitted to Yokie (Bings APIs dont support contextual
features of the Yokie query tuple). For every Yokie query, we continued to collect
Bings responses every 30 minutes so that we could track how these responses
evolved over the evaluation period. In each case, Bing Web and News returned a
maximum of 50 results, and each result item was stored based on its query, page
title, timestamp of indexing by Bing, item positioning and description. Useful
pieces of metadata returned by the API included a corresponding timestamp
that represented the moment Bing captured or indexed the item, and the result-
list position of an item with respect to that query in Yokie. These two features,
along with the ability to compare and contrast overlapping content with Yokie
allow us to do comparative studies to see if Yokie has gathered exclusive content,
and whether it can present timely results higher up its result lists. Our assump-
tion that the Twitter-sourced content would be newer and easier to index in a
more timely fashion seems obvious, however little has been done to test it before.
Comparisons between OVERALL DATASETS Comparisons between USER-INTERACTED ITEMS (Clicks & Hovers)
A YOKIE vs Bing Web vs Bing News Datasets B YOKIE vs Bing Web vs Bing News Datasets
*scale in relation to
Bing WEB YOKIE dataset size e
Bing NEWS Bing WEB YOKIE
24,662 Bing NEWS
Unique URLs
Ls
s
YOKIE 233,991 All Unique URLs 689 All Unique URLs
Unique URLs Unique Items
Unique URLs
170
Overlapping Items between
2,837
Overlapping Items between
34
Overlapping Ittems between
88
Overlapping Ittems between
YOKIE and BING WEB YOKIE and BING NEWS YOKIE User Item Sele
ections and BING WEB YOKIE User Item Selecctions and BING NEWS
Appeared on average Appeared on average Appeared on average Appeared on average Appeared on average Appeared on average
18.1 Days 17 hours 7 Days 8.18 Days 7 Hours 15 Hours
EARLIER in BING WEB EARLIER in BING NEWS EARLIER in BING WEB EARLIER in BING WEB EARLIER in BING NEWS EARLIER in BING NEWS
2 21
49 21 71
1178
121
1659 13 32 15 65
Users Interacted with Users Interacted with Users Interacted with Users Interacted with
Appeared on average Appeared on average 13 items on average 32 items that were 15 items on average 32 items that were
20 Days 19.5 hours 16 Days TWEETED an average 4.61 Days TWEETED an average
EARLIER in YOKIE EARLIER in YOKIE EARLIER in YOKIE 15.37 Days EARLIER EARLIER in YOKIE 1.88 Days EARLIER
Fig. 3. Yokie dataset comparison with Bing Web Search and Bing News Search. Figure
A shows a breakdown of item overlaps in the Bing Web / Yokie / Bing News datasets.
Figure B shows the overlaps of user-engaged Yokie items and the total Bing datasets,
and also highlight items that appeared earlier in respective datasets.
of the items which appeared earliest in either Yokie or Bing, and the average
time of how early it was in the corresponding dataset.
Time of User Interaction vs. Bing Indexing. In Figure 3(B), we see that
during the trial, 689 results were interacted with by Yokie users. Once again,
the vast majority were exclusive to Yokie; only 34 items found by Bing web,
and 89 by Bing News. In terms of timeliness, we also see Yokie performing
comparatively well as opposed to Bing Web 13 items were interacted with by
users an average of 16 days earlier than they even appeared in the Bing Web
index. Compare this to Bing Webs larger share of 21 items which appear on
average only a week earlier in the Bing Web index than in Yokies. Yokie is
also seen to fall behind in terms of user engagement and Bing News publication.
While Yokie users had clicked on items that were on average almost 5 days
newer than their appearance in Bing News, this was only represented by 15
items. Bing News had a considerably broader coverage with 71 items with which
Yokie users interacted, however these were in its indexes only 7 hours before
Yokie. Overall, comparing the time of item interactions with Bing index time
shows Yokie under-performing in terms of coverage of items with which its users
interacted, regardless of its good performance gaining content earlier than Bing.
Web and News. As described in Figure 3(B), Yokie users interacted with a total
of 32 items that appeared on average 15.3 days earlier than they appeared in Bing
Web. They also interacted with the majority of the news overlaps on average
1.88 days earlier than they even appeared in Bing News.
Discussion. One could sensibly assume that Bing Web and Bing News would
garner higher coverage and timely content respectively compared to our system.
However, the vast majority of Yokie results were exclusive to the system, apart
from the 170 web items and 2837 news items that are overlapping. We can use
this overlapping data to further support the hypothesis that Yokie is able to
identify these results earlier than Bing with over 70% of Bing Web/Yokie
overlaps appearing almost 20 days earlier in Yokie, and almost 60% of the Bing
News/Yokie coverage appearing nearly a full day beforehand in Yokie.
Once again, the data supports the argument for the potential of a real-time
streaming technique for indexing web content. While Bing had more coverage of
earlier items amongst user-interacted results, Yokies interacted content was not
only vastly exclusive, but also in many cases more timely than Bing. This is a
surprising result considering Bing Web represents a capture of the entire web,
and Bing News represents a real-time capture of topical breaking news.
Our user trial and comparative evaluations have highlighted Yokies potential
for timely, exclusive and relevant content compared with popular web and news
search services. Our approaches to contextual and user-based ranking strategies
also performed well at providing relevant content with which users interacted. In
particular, the temporally sensitive rankings that showed newer content and also
content with high longevity garnered significant interest from users. Our second
trial illustrates the shortcomings of both archival Search Engines and, surpris-
ingly, some reputable online news search services at providing timely content.
This work is not without its limitations. In building Yokie we have designed a
complex and novel information discovery service with many moving parts. This
presented something of an evaluation challenge. Unlike traditional IR systems,
there were no benchmark datasets (e.g. TREC [9], etc). Our approach has been
to attempt a live user trial. Obviously in an ideal world one would welcome the
opportunity to engage a large cohort of users perhaps thousands or tens of
thousands in such an evaluation. Unfortunately this was not possible with the
resources available, and so our trial is limited to a smaller cohort of just 125 users.
Nevertheless these users did help to clarify our primary evaluation objectives and
their usage provided a consistent picture of their common engagement patterns
to show how they interacted with dierent ranking strategies and gained benefit
from the results they provided. Moreover we bolstered this live evaluation with
an additional study to compare Yokie with more conventional search and news
discovery services by harnessing the activity of our live users as the basis for
search queries. Our focus on Bing was purely pragmatic as its APIs Terms and
Towards a Novel and Timely Search and Discovery System 757
1 Introduction
The population of the Internet is increasing day by day. Now 50 billion webpages are
available on the Internet and the size of the Internet is about to double in every 5
years[1, 2]. This gigantic size of the Internet creates some problems too. When people
search in the Internet, they are now experiencing overwhelmed with lots of informa-
tion. A little portion of the information is relevant to the person's interest. So a re-
commender system is needed to guide the user when searching such a vast space.
Recommender systems suggest items or persons to a user based on the user's taste so
that the user can quickly reach to the point of his interest without going through irre-
levant information. Furthermore, a recommendation list itself may be large enough
and usually the user considers only the top few items. So it is a challenging problem
in recommender systems to rank top items accurately.
Recommender systems are mainly divided into two categories: content-based fil-
tering and collaborative filtering [3]. In recommender systems, an item is recom-
mended to an active user based on 1) similarity of the item with other items in which
the active user has already shown some preferences, or 2) similarity of users to the
active user, whom active user has similar preference on some set of items. In content-
based filtering, the similarity is based on the item or user profile. For example, in a
book recommendation, a book is recommended to a user based on the books profile
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 758769, 2013.
Springer-Verlag Berlin Heidelberg 2013
GWMF: Gradient Weighted Matrix Factorisation for Recommender Systems 759
similarity (book type, writer and so on) with other books that were already rated high-
ly by that user [3-5]. Rather than computing similarity from profile, collaborative
filtering uses users choice on items to derive similarity. Collaborative filtering (CF) is
widely used in social network and also in online web [3-5]. As the size of the Internet
is increasing day by day, scalability becomes an important issue and matrix factorisa-
tion is the ultimate solution to handle this scalability problem [6].
In matrix factorisation (MF), data are placed into a matrix, which indicates user in-
teraction over item. For example, R RNM is the rating matrix represents preferences
on N users over M items. The major purpose of matrix factorisation in this context is
to obtain some form of lower-rank approximation to R for understanding how the
users relate to the items [7]. User preferences over item are modelled by user's latent
factors and item preferences by user are modelled by item's latent factors. These fac-
tor matrices are learned from observed preferences in such a way that their scalar
product equals to original rating matrix [8]. For example, the target matrix R RNM
is approximated by two low rank matrix P RNK and Q RMK
R P.Qt (1)
Here K represents the dimensionality of the approximation. After learning, these la-
tent factors are used to predict unobserved ratings. Gradient descent algorithm [8] is
widely used for learning these latent factors.
In this paper, we present a new algorithm for learning latent factors. Our main con-
tributions are as follows. Firstly, we develop a new gradient descent based matrix
factorisation algorithm, namely GWMF, by using weights to update factor matrix in
each steps to learn latent factors. We also introduce a regularisation parameter to re-
duce overfitting. Result shows our algorithm achieves converge faster with higher
accuracy than the ordinal gradient descent based matrix factorisation algorithm.
2 Literature Survey
For its scalability and efficiency, matrix factorisation has become most popular me-
thod in recommendation. A lot of research was done for further improvement of ac-
curacy and reducing computational time.
In [6] a probabilistic matrix frame (PMF) is proposed to approximate the user fea-
ture vector and movie feature vector from user movie rating matrix. The work ad-
dresses two crucial issues of recommender system: i) complexity in terms of observed
rating and ii) sparsity of ratings. With a cost function by a probabilistic linear model
with Gaussian noise, this method achieves 7% performance improvement over Netflix
own system [17]. Authors in [9] identify drawback of their previous PMF models [6]
as PMF is computationally complicated. To solve the problem, they propose Bayesian
Probabilistic Matrix Factorisation (BPMF) where user and movie features have no
zero mean which need to be modeled by some hyper-parameters. Therefore, the pre-
dictive distribution of the rating is approximated by Monte Carlo algorithm. When the
number of features is large, BPMF outperforms PMF in terms of time complexity.
760 N. Chowdhury and X. Cai
A hybrid approach is used in [10] to learn incomplete rating from a rating matrix.
Here they enforce non-negativity constrain to model matrix factorisation. At first,
Expectation Maximization algorithm is performed several iterations to select initial
values of user and item matrix. These initial values are used by Weighted Non-
negative Matrix Factorisations (WNMF). Here they use weights as indicator matrix.
For example, a weight equals to one if rating is observable or zero if unobservable.
Authors in [11] extend WNMF in [10] to Graph Regularized Weighted Nonnegative
Matrix Factorisation (GWNMF) to approximate low rank matrices of user and item.
They use both user and item's internal information, external information and also
social trust information. Experimental result shows that these internal and external
information increase recommendation accuracy.
Clustering is applied with Orthogonal Non-negative Matrix Factorisation
(ONMTF) in [12]. ONMTF is applied to decompose a user-item rating matrix into a
user matrix (U) and an item matrix (SVT). By using U and SVT, user/item clusters
and their centroids are identified. To predict unknown rating of a user to an item, at
first similarity between the test user and item is calculated and similar user/item clus-
ters are identified. Within the identified clusters, K most similar neighbours are found.
The linear combination model based on user-based approach, item-based approach
and ONMTF is applied to calculate the test user's rating on the item. Although expe-
rimental result does not show any improvement, their method requires less computa-
tional cost.
An ensemble of collaborative filters is applied in [13] to predict ratings on Netflix
dataset. As MF algorithm is sensitive to local minima, the author applies different MF
algorithms with different initialisation parameters several times. And then the mean of
their prediction is used to compute final ratings.
Authors in [14] develop non-negative matrix factorisation algorithm where they
put additional constraint that all elements of the factor matrices must be non-negative.
They develop two algorithms by using multiplicative update steps, one attempts to
minimise the least squares error and the other to minimises the Kullback-Leibler di-
vergence. Both of the algorithms guarantee locally optimal solutions and are of easy
implementations.
To the best of our knowledge, there is no research done by using user or item
weights with latent factor vectors to approximate ratings. Our algorithm, GWMF
gives larger weight values to an item and user with high approximation error, i.e.
latent factors not approximated the rating between the user and item well in the cur-
rent iteration, so that in next gradient step the algorithm can focus on those items and
users and give more update on them. We also introduce a regularisation parameter to
reduce overfitting. Result shows that GWMF guaranteed lower training and test error
and it achieves minimum RMSE faster than gradient descent.
The rest of the paper is arranged as follows: in next section we present gradient
descent-based matrix factorisation algorithm (GMF) and our gradient weighted matrix
factorisation algorithm (GWMF). Experimental settings and result of GWMF vs.
GMF is presented in section 4. Finally we draw conclusions and future work.
GWMF: Gradient Weighted Matrix Factorisation for Recommender Systems 761
3 Algorithm
Given a rating matrix R and mean square error (MSE) as optimisation function, Gra-
dient descent based matrix factorisation aims to find two low rank matrix P and Q,
which minimise the following objective.
N M
GMF= ( R P.Qt )2ij
i=1 j=1 (2)
As shown in Algorihtm 1, GMF treats each element equally, i.e. assign equal learn-
ing rate for each element, in Q when update P and each equally in P when update Q.
instance, and P(i,:) is the linear regression model parameter. Similarly, for each col-
umn R(:,j), we have:
R(:,j) = P*Q(:,j)
where we can fix matrix P and solve the parameter vector Q(:, j) for a linear regres-
sion model. In such a configuration, the matrix factorlisation is formulated as a set of
linear regression models: one for each row of P and one for each column of Q.
One of the common assumptions, underlying most linear and nonlinear least
squares regression, is that each data instance provides equally precise information
about the deterministic part of the total process variation. More specific, the standard
deviation of the error term is constant over all values of the predictor or explanatory
variables. However, this assumption does not hold, even approximately, in every
modelling application. In such situations, weighted least squares can be used to
maximise the efficiency of parameter estimation [18].
To this end, giving each instance its proper amount of influence over the parameter
estimates may improve the performance of the approximation since a method that
treats all of the data instance equally would give less precisely measured instance
more influence than they should have and would give highly precise points too little
influence.
Therefore, in GWMF, we use weights to optimise approximation of the target ma-
trix. The idea behind this is, if a user or item is not approximated well in previous
iterations, the algorithm focuses on reducing its errors and thus which contributes
more to the update than others. In GWMF we give more weight to instances that are
not approximated well by previous iterations. We use two weight matrices 1) user
weight WP RNM that indicates user approximation over item and item weight
WQ RNM that indicates item approximation by user in current iterations. At initiali-
sation all user weight values for a rated item and all rated item weight values for a
user are equal (for example as shown in Fig. 1(a)). Then at each gradient steps this
weight values keep tracks how error of rating matrix to factor matrix i.e. (R-PQ) is
changed for each observed user and item. We also enlarge this effect by exponential
cost function for weight updates, which results in large weight values for instances
which have higher error. Weight update formula of WP is given below. WQ follows
the same update formula. Furthermore, we also normalise the weight values by max-
imum normalisation.
WP = e|( R PQ )|
Therefore, the algorithm converges to the training objective quickly. But in real
world, there may be some outliers or noise in datasets. So this weight update formula
gives more weights to those instances constantly and thus could make the algorithm
lose generalisation ability, i.e. overfitting. In this concern, we introduce a parameter
called rbeta inspired by [15] which takes (R-rbeta*PQ) for weight update. So the
weight update formula becomes:
From Fig. 1 we see how weight values of WP and WQ changes according to equation 3 in
each gradient steps. After the 1st iteration (Fig. 1(b)) the weight values of each user-item
vary by a large amount. So more gradient update is performed to user-items that have high
weight values than others. When the algorithm converges, user weight values for an item
and item weight values for a user become equal. At 1000 iterations (Fig. 1(c)) and 2000
iterations (Fig.1(d)) training errors become very small, i.e. similar, and weight values of
user item become 1. The GWMF algorithm is given in Algorithm 2.
Algorithm 2: GWMF
4 Results
To evaluate performance of GWMF, we use two datasets: (i) Small Dataset that consists
of first 100-by-50(user-by-item) rating for training from Movielens u1.base dataset[16].
Testing is done by those users-items rating from u1.test dataset. (ii) The Movielens Data-
set: which consists of 100,000 ratings from 1000 users on 1700 movies. We perform
experiments separately on each five Movielens dataset and average the results. We com-
pare the proposed GWMF with the standard GMF algorithm. For fair comparison, we
need to make learning rates of these two algorithms equal. To achieve this, we set learn-
ing rates of two algorithms in two different ways. For the smaller dataset, we divide
learning rate of GWMF by initialisation weight values. Here initialisation weight values
means if a user rated 5 items, all items receive equal weight values 1/5=0.2. This is be-
cause in GWMF we multiplied weight with errors so it decreases the learning rate. To
make it the same as GMF, we divide learning rate of GWMF again by initialisation
weight. Here initialisation weight values remain the same for all iterations whereas other
weight values (i.e multiplied WP and WQ) change according to GWMF. In the settings
for large Movielens dataset, we keep learning rate of GWMF unchanged but multiplied
764 N. Chowdhury and X. Cai
(a)
(b)
(c)
(d)
Fig. 1. Rating matrix R and initialization weight values of WP and WQ is shown in (a). PQ,
WP, WQ is shown after the 1st (b), after 1000 (c) and after 2000 (d) iterations.
weight value with learning rate of GMF. So this rate is also decreased and learning rate
of GWMF and GMF are comparable.
RMSE is used to measure performance. For the small dataset, as the size of our
training set is smaller and thus more sparse, rating prediction is more challenging than
that of the original Movielens dataset.
To see how weights work on matrix factorisation, we perform two experiments on
small dataset. In one experiment we use one weight matrix to examine how instances
are approximated in previous iterations and give more weight to instances that are not
approximated well. In other experiment, we use two weight matrices: the user weight
GWMF: Gradient Weighted Matrix Factorisation for Recommender Systems 765
matrix WP which indicates how items are approximated for a user in previous itera-
tions and the item weight matrix WQ which indicates how users are approximated
for an item. The idea behind this experiment is: the user factor (item factor) is multip-
lied by rated item factors (user factor) in matrix factorisation. So these two weight
matrices give an idea about how well users and items are approximated in previous
iterations. RMSE of two weight matrices vs. one weight matrix on smaller version of
Movielens dataset is given in Table 1. In Fig. 2, RMSE on test dataset is shown.
Table 1. RMSE of two weight matrix vs. one weight matrix in training and testing
1.1
MF by gradient descent
MF by weighted gradient descent
1.05
0.95
RMSE
0.9
0.85
0.8
0.75
Fig. 2. RMSE of two weight matrices vs. one weight matrix on the small dataset
766 N. Chowdhury and X. Cai
From Fig. 2 we see that MF by two weight matrices performs better that one
weight matrix. In this concern we use two weight matrices for the rest of our experi-
ment. Next we present performance of user parameter rbeta in our algorithm. When
rbeta=1, it means that it takes PQ's opinion completely for next iterations and when
rbeta=0 it results unique weight update for all iterations. For this small dataset, we
achieve minimum RMSE for rbeta=0.5. Here in Fig. 3 we present performance of
various rbeta values on GWMF with two weight matrices.
1.2
rbeta=1
1.15 rbeta=.7
rbeta=0.5
rbeta=0.2
1.1
rbeta=0
1.05
1
RMSE
0.95
0.9
0.85
0.8
0.75
0.7
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
No. of iterations x 10
4
Fig. 3. RMSE of various rbeta values on GWMF with two weight matrices on small dataset
We also compare our algorithm to gradient descent and the result is presented in
Fig. 4 and 5 on the smaller dataset and Fig. 6 and 7 for the Movielens dataset. We set
rbeta=1 for both dataset and learning rate=.0002 for smaller dataset and 0.01 for Mo-
vielens dataset.
1.4
MF by gradient descent
MF by weighted gradient descent
1.2
0.8
RMSE
0.6
0.4
0.2
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
No. of iterations
Fig. 4. MF by gradient descent vs. MF by weighted gradient descent on the training set (Small).
GWMF: Gradient Weighted Matrix Factorisation for Recommender Systems 767
1.4
MF by gradient descent
1.3
MF by weighted gradient descent
1.2
1.1
RMSE
0.9
0.8
0.7
0 1 2 3 4 5 6 7 8 9 10
No. of iterations x 10
4
Fig. 5. MF by gradient descent vs. MF by weighted gradient descent on the test set (Small).
1.2
MF by gradient descent
MF by weighted gradient descent
1.1
0.9
RMSE
0.8
0.7
0.6
0.5
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
No. of iterations
Fig. 6. MF by gradient descent vs. MF by weighted gradient descent on the training set (Movielens)
From Fig. 4 and 5, we see that GWMF achieves lower RMSE on earlier iterations than GMF
but higher RMSE than GMF at the rest of the training phase. But in testing (Fig. 5), GWMF
achieves lower RMSE (0.7334) which is 10 times earlier than GMF.
768 N. Chowdhury and X. Cai
1.15
MF by gradient descent
1.1
MF by weighted gradient descent
1.05
1
RMSE
0.95
0.9
0.85
0.8
0.75
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
No. of iterations
Fig. 7. MF by gradient descent vs. MF by weighted gradient descent on the test set ( Movielens)
For earlier iterations, it shows GWMF achieves lower RMSE and in 5000 itera-
tions GWMF achieves its lowest RMSE. However, GMF reaches its lowest RMSE at
50000 iterations that is still larger than the lowest RMSE of GWMF.
In Movielens dataset, GWMF also outperforms GMF (Fig. 6 and 7). On test set
GWMF achieves the lowest RMSE in 100 iterations (0.75) whereas GMF achieves
the lowest RMSE at 2000 iterations (0.76) that is also higher than GWMF. So our
GWMF performs better than GMF in terms of accuracy and computational cost in
both Movielens dataset.
5 Conclusion
Although Gradient descent is the widely used techniques in matrix factorisation, it has
slow convergence. Researchers use multiplicative update steps or other techniques to re-
solve this problem. However, to the best of our knowledge, there is no research was done
to optimise matrix factorisation by weighted gradient descent. GWMF is the first work in
this direction, where we use weights directly in gradients update steps. As gradient descent
moves forward to objective function in each step, putting extra weights is very challeng-
ing. There is a risk to destroy gradient descents convergence property. But using an appro-
priate cost function and regularisation parameter, we achieve this and our results are very
encouraging. In future, we plan to apply this algorithm on larger datasets and other prob-
lems where gradient descent algorithm is also applied.
Acknowledgement. This project is funded by the Smart Services CRC under the
Australian Government's Cooperative Research Centre program.
GWMF: Gradient Weighted Matrix Factorisation for Recommender Systems 769
References
1. The size of the World Wide Web, http://www.worldwidewebsize.com/
2. The Size of Internet to Double Every 5 Years,
http://www.labnol.org/internet/
internet-size-to-double-every-5-years/6569/
3. Adomavicius, G., Tuzhilin, A.: Toward the Next Generation of Recommender Systems: A
Survey of the State-of-the-Art and Possible Extensions. In: IEEE Transaction of Know-
ledge and Data Engineering, pp. 734749 (2005)
4. Pazzani, M., Billsus, D.: Learning and Revising User Profiles: The identification of Inter-
esting web sites. J. Mac. Lea. 27, 313331 (1997)
5. Jannach, D., Friedrich, G.: Tutorial: Recommender Systems. In: Joint Conference on Ar-
tificial Intelligence, Barcelona (2011)
6. Salakhutdinov, R., Mnih, A.: Probabilistic Matrix Factorization. In: Neural Information
Processing Systems (2008)
7. Hubert, L., Meulman, J., Heiser, W.: Two Purposes for Matrix Factorization: A Historical
Appraisal. Society for Industrial and Applied Mathematics 42, 6882 (2000)
8. Koren, Y., Bell, R.M., Volinsky, C.: Matrix Factorization Techniques for Recommender
Systems. IEEE Computer 42(8), 3037 (2009)
9. Salakhutdinov, R., Mnih, A.: Bayesian Probabilistic Matrix Factorization using Markov
Chain Monte Carlo. In: 25th International Conference on Machine Learning (ICML), Hel-
sinki (2008)
10. Zhang, S., Wang, W., Ford, J., Makedon, F.: Learning from Incomplete Ratings Using
Non-negative Matrix Factorization. In: Proceedings of the SIAM International Conference
on Data Mining, SDM (2006)
11. Gu, A., Zhou, J., Ding, C.: Collaborative Filtering: Weighted Nonnegative Matrix Factori-
zation Incorporating User and Item Graphs. In: 10th SIAM International Conference on
Data Mining (SDM), USA (2010)
12. Chen, G., Wang, F., Zhang, C.: Collaborative Filtering Using Orthogonal Nonnegative
Matrix Trifactorization. J. of Inf. Pro. and Man. 45, 368379 (2009)
13. Wu, M.: Collaborative Filtering via Ensembles of Matrix Factorizations. In: Proceedings
of KDD Cup and Workshop, pp. 4347 (2007)
14. Lee, D.D., Seung, H.S.: Algorithms for Non-Negative Matrix Factorization. In: Advances
in Neural Information Processing, pp. 556562 (2000)
15. Jin, R., Liu, Y., Si, L., Carbonell, J., Hauptmann, A.G.: A New Boosting Algorithm Using
Input-Dependent Regularizer. In: Proceedings of the 20th International Conference on Ma-
chine Learning, Washington (2003)
16. MovieLens Data Sets, http://www.grouplens.org/node/73
17. Netflix Prize, http://www.netflixprize.com/
18. Gentle, J.: 6.8.1 Solutions that Minimize Other Norms of the Residuals. In: Matrix
Algebra. Springer, New York (2007)
Collaborative Ranking
with Ranking-Based Neighborhood
1 Introduction
Electronic retailers and content providers oer a huge selection of products for
modern consumers. Matching consumers with products they like is very impor-
tant for users satisfaction on the web. Recommendation systems, which provide
personalized recommendation for users, become a very important technology to
help users to filter useless information. Through automatically identifying rele-
vant information by learning users tastes from their behavior data in the system,
recommendation systems greatly improve users experience on the web. Inter-
net leaders like Amazon, Google, Netflix and Yahoo increasingly adopt dierent
kinds of recommendation systems, which generate huge profits for them.
Recommendation systems are usually divided into two classes: content-based
filtering and collaborative filtering. Content-based filtering approaches use the
content information of the users and items (products) to make recommenda-
tions, while collaborative filtering approaches just collect users ratings on items
to make recommendations. Recommendation has been treated as a rating pre-
diction task for a long time. Two successful approaches to rating prediction task
are neighborhood model and latent factor model. Neighborhood models predic-
tion a target users rating on an item by (weighted) averaging the items ratings
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 770781, 2013.
c Springer-Verlag Berlin Heidelberg 2013
of her/his k-nearest neighbors, who have the similar rating pattern with the
target user. Latent factor models transform both items and users to the same
latent factor space and predict a users rating on an item by multiplying the
users latent feature with the items latent feature. The combination of neighbor-
hood model and latent factor model successfully completes the rating prediction
task [1].
However, in many recommendation systems only the top-K items are shown
to users. Recommendation is a ranking problem in the top-K recommenda-
tion situation. Ranking is more about predicting items relative orders rather
than predicting rating on items and it is broadly researched in information re-
trieval. The problem of ranking documents for given queries is called Learning
To Rank(LTR) [2] in information retrieval. If we treat users in recommendation
systems as queries and items as documents, then we can use LTR algorithms
to solve the recommendation problem. A key problem in using LTR models for
recommendation is the lack of features. In information retrieval, explicit features
are extracted from (query, document) pairs. Generally, three kinds of features
can be used: query features, document features, and query-document-dependent
features. In recommendation system, users profiles and items profiles are not
easy to be represented as explicit features. Extracting ecient features for rec-
ommendation systems is emerged and also very important for learning a good
ranking model.
Some work tries to extract features for user-item pairs [3,4] etc. In [3], the
authors extract features for a given (u, i) pair from user us k-nearest neighbors
who rated item i. They use rating-based user-user similarity metric to find neigh-
bors for a target user. However, in some e-commerce systems users preferences
to items are perceived by tracking users actions. A user can search and browse
an item page, bookmark an item, put an item to the shopping cart and purchase
an item. Dierent action indicates dierent preference to the item. For example,
if a user u purchased item i and bookmarked item j, we can assume that user u
prefers item i to item j. Mapping the user actions into a numerical scale is not
natural and trivial. So it is hard to accurately compute the user-user similarity
in e-commerce systems where users feedbacks are non-numerical scores.
In this paper, we argue that using ranking-based neighbors to extract features
for user-item pairs is more suitable than using rating-based neighbors, because we
are facing and completing a ranking task. Choosing neighbors through a ranking-
based user-user similarity metric is natural and much closer to the essence of
the ranking. Neighbors with the similar preference over items with the target
user will give more accurate ranking information on items than rating-based
neighbors. Moreover, We can also easily use ranking-based user-user similarity
metric to find neighbors for a target user even that the users feedbacks are
no-numerical scores. Our contributions in the paper can be summarized as:
2.1 Notations
In recommendation system, we are given a set of m users U = {u1 , u2 , ..., um }
and a set of n items I = {i1 , i2 , ..., in }. We also use u, v to represent any user
and use i, j to represent any item for convenience. The users ratings on the
items are represented by an m n matrix R where entry ru,i represents user
us true rating on item i. ru,i represents user us prediction rating on item i. Ru
represents all the ratings of user u.
Since dierent users and items have dierent bias, PCC and ACC are proposed
to relieve the bias problem. An alternative form of the neighborhood-based
approach is the item-based model. Item-item similarity can also be computed
like user-user similarity. Given a user-item pair (u, i), item-based neighborhood
model estimates the rating on item i based on the k most similarity items that
have been rated by u. PCC, ACC and VSS are rating-based similarity metric.
The similarity is computed purely based on rating vector, and it ignores the ex-
plicit preference over items. In [6], the similarity between users is computed by
Collaborative Ranking with Ranking-Based Neighborhood 773
Vector Space Similarity) to find neighbors for a target user. Their work has
some drawbacks. First, they can not well define user-user similarity when users
feedbacks are not numerical scores. Second, they choose neighbors by rating-
based user-user similarity metric, which is not corresponding to the ultimate
goal of ranking. Our work is dierent from their work. We choose neighbors
with similar preference pattern rather than rating pattern. The problem with
rating oriented approaches is that the focus has been placed on approximating
the ratings rather than the rankings, but ranking is a more important goal for
modern recommender systems.
In this section, we use the ranking-based neighbors to extract the features for
user-item pairs. A feature vector (u, i) for a user-item pair (u, i) is extracted
based on the K nearest ranking-based neighbors K(u, i) = {uk }K k=1 of user u who
have rated i. The features are descriptive statistics information of the ranking
of item i. Every neighbor uk of user u has her/his opinion on the ranking of
item i. For every item j (j = i) in uk s profiles, uk has her/his preference over
item i and item j. We aggregate K neighbors opinions and extract statistics
information from their opinions as features. We summarize the preference for i
into three K 1 matrices Wui , Tui and Lui :
1 #
Wui (k) = I[ruk ,i > ruk ,i ]
|Ruk | 1
i Ruk \i
1 #
Lui (k) = I[ruk ,i < ruk ,i ]
|Ruk | 1
i Ruk \i
1 #
Tui (k) = I[ruk ,i = ruk ,i ]
|Ruk | 1
i Ruk \i
The three matrices describe the relative preference for i across all the neighbors
of u. Then we summarize all matrices with a set of simple descriptive statistics.
1"
(Wui ) = [(Wui ), (Wui ), max(Wui ), min(Wui ), I[Wui (k) = 0]
k
k
776 C. Fan and Z. Lin
1"
(Lui ) = [(Lui ), (Lui ), max(Lui ), min(Lui ), I[Lui (k) = 0]
k
k
1"
(Tui ) = [(Tui ), (Tui ), max(Tui ), min(Tui ), I[Tui (k) = 0]
k
k
where is mean function, is the deviation function, max and min are the
maximize and minimize function. The last term counts the number of neighbors
who express any preference towards i. We convert neighbors ratings on i to pref-
erence descriptive features in this way. After concatenating the three statistics
features we have the last feature representation for (u, i):
The features are descriptive statistics information for the ranking of item i.
Since these feature are statistics of neighbors opinions on the ranking of items
and our experiment show that they are very robust. It is obvious that ranking-
based neighbors will give more significant and accurate ranking information than
rating-based neighbors in this feature extraction setting. After we have extracted
features for every user-item pairs, we can use any learning to rank algorithm
to learn a ranking model and make recommendation. We will introduce the
procedure in the next part.
where w and w0 is the learned ranking model. Only ratings in the training
dataset are used to select the neighbors, and make predictions for the items in
the validating and testing dataset. After all unrated items prediction ratings
are computed, we can order them based on the prediction rating value and make
top-K recommendation. We can see that the rating is not on a scale of 1 to 5
anymore and it represents the relative position of the items. Moreover, based
Collaborative Ranking with Ranking-Based Neighborhood 777
4 Experiments
4.1 Evaluation Measure
Since recommendation systems often show several items to the user, a good
ranking mechanism should put relevant impressions in the very top of a ranking
list extremely well. So we choose Normalized Discount Cumulative Gain (NDCG)
[16] metric in information retrieval as the evaluation measure, because it is very
sensitive to the ratings of the highest ranked items, making it highly desirable
for measuring ranking quality in recommendation system. A ranking of the test
items can be represented as a permutation : {1, 2, ..., N } {1, 2, ..., N } where
(i) = p denotes that the ranking of the item i is p and i = 1 (p) is the item
index ranked at p-th position. Items in test dataset are sorted based on their
prediction ratings. The NDCG value for a given user u is computed by:
K
" 2ru,1 (p) 1
1
NDCG(u, )@K = (6)
MDCG(u) p=1 log(p + 1)
We assume that is a permutation over the test items that are sorted on the
true ratings and MDCG@K is DCG@K value corresponding to permutation .
We can see that a perfect ranks NDCG value is equal to one. We set k to be 10
in all our experiments like [11,14,4]1 . We report the macro-average NDCG@K
across all users in our experiment results.
4.2 Datasets
We use two well known movie ratings datasets in our experiment: Movielens100K
and Movielens1M 2 . Most of the state-of-the-art ranking-based collaborative
ranking algorithms also use them. The Movielens100K dataset consists of about
100K ratings made by 943 users on 1682 movies. The Movielens1M dataset
consists of about 1M ratings made by 6040 users on 3706 movies. Since users in
the two datasets rate at least 20 rating, the cold start users wont appear. The
ratings for Movielens100K and Movielens1M are on a scale from 1 to 5. Table 1
gives statistics information associated with the two datasets.
We divide the whole dataset into training dataset and Testing dataset. We
randomly sample a fixed number N of ratings for every user for training test
on the rest data. We sample 10 ratings from training dataset for validating and
1
NDCG values at other cutos give us dierent numerical values but qualitative
almost the same results.
2
http://www.grouplens.org/node/73
778 C. Fan and Z. Lin
after we found the best hyper parameters, we put validating dataset into training
dataset and retrain the model, then we report the rank performance on testing
dataset. In our experiment, we set N as 20, 30, 40 and 50, so users with less than
30, 40, 50 and 60 ratings are removed to ensure that we can compute NDCG@10
on the testing dataset. The number of users and items after filtering for dierent
N are reported in Table 1.
user often asks her/his friends for recommendations and users taste is influenced
by friends taste on the social network. Incorporating social eects in our collab-
orative ranking framework is our next work. Dierent users in recommendation
systems may vary largely. For example, the influence of neighbors on dierent
users are very dierent. Considering the large dierence between users, it is not
the best choice to use a single ranking function to deal with all users. So we also
plan to investigate how to design personalized collaborative ranking algorithm
in the future.
References
1. Koren, Y.: Factor in the neighbors: Scalable and accurate collaborative filtering.
ACM Trans. Knowl. Discov. Data 4(1), 1:11:24 (2010)
2. Liu, T.Y.: Learning to rank for information retrieval, Berlin, German, vol. 3(3),
pp. 225331. Springer (2011)
3. Volkovs, M.N., Zemel, R.S.: Collaborative ranking with 17 parameters. In: Pro-
ceedings of the Twenty-sixth Annual Conference on Neural Information Process-
ing Systems, NIPS 2012, Lake Tahoe, Nevada, United States, December 3-6, MIT
Press (2013)
4. Balakrishnan, S., Chopra, S.: Collaborative ranking. In: Proceedings of the Fifth
ACM International Conference on Web Search and Data Mining, WSDM 2012, pp.
143152. ACM, New York (2012)
5. Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM
Trans. Inf. Syst. 22(1), 143177 (2004)
6. Liu, N.N., Yang, Q.: Eigenrank: a ranking-oriented approach to collaborative fil-
tering. In: Proceedings of the 31st Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, SIGIR 2008, pp. 8390.
ACM, New York (2008)
7. Marden, J.I.: Analyzing and modeling rank data. Chapman & Hall Press, New
York (1995)
8. Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on
top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on
Recommender Systems, RecSys 2010, pp. 3946. ACM, New York (2010)
9. Koren, Y., Sill, J.: Ordrec: an ordinal model for predicting personalized item rat-
ing distributions. In: Proceedings of the Fifth ACM Conference on Recommender
Systems, RecSys 2011, pp. 117124. ACM, New York (2011)
10. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: Bpr: Bayesian
personalized ranking from implicit feedback. In: Proceedings of the Twenty-Fifth
Conference on Uncertainty in Artificial Intelligence, UAI 2009, Arlington, Virginia,
United States, pp. 452461. AUAI Press (2009)
11. Shi, Y., Larson, M., Hanjalic, A.: List-wise learning to rank with matrix factoriza-
tion for collaborative filtering. In: Proceedings of the Fourth ACM Conference on
Recommender Systems, RecSys 2010, pp. 269272. ACM, New York (2010)
Collaborative Ranking with Ranking-Based Neighborhood 781
12. Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Hanjalic, A., Oliver, N.: Tfmap:
optimizing map for top-n context-aware recommendation. In: Proceedings of the
35th International ACM SIGIR Conference on Research and Development in In-
formation Retrieval, SIGIR 2012, pp. 155164. ACM, New York (2012)
13. Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., Hanjalic, A.: Climf:
learning to maximize reciprocal rank with collaborative less-is-more filtering. In:
Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys 2012,
pp. 139146. ACM, New York (2012)
14. Weimer, M., Karatzoglou, A., Bruch, M.: Maximum margin matrix factorization
for code recommendation. In: Proceedings of the Third ACM Conference on Rec-
ommender Systems, RecSys 2009, pp. 309312. ACM, New York (2009)
15. Burges, C.J.C., Ragno, R., Le, Q.V.: Learning to rank with nonsmooth cost func-
tions. In: Proceedings of the Twentieth Annual Conference on Neural Information
Processing Systems, NIPS 2006, Vancouver, British Columbia, Canada, December
4-7, pp. 193200. MIT Press (2007)
16. Jarvelin, K., Kek
al
ainen, J.: Cumulated gain-based evaluation of ir techniques.
ACM Trans. Inf. Syst. 20(4), 422446 (2002)
Probabilistic Top-k Dominating Query
over Sliding Windows
1 Introduction
Spatial database has drawn attention in recent decades. Beckmann et al. [1] design
the R-tree as an index for efficient computation. Roussopoulos et al. [8] study nearest
neighbor queries, and propose three effective pruning rules to speed up the computation.
They also the extend nearest neighbor queries to k nearest neighbor queries, which
return top-k preferred objects. Borzsonyi et al. [4] is among the first to study the skyline
operator, and propose an SQL syntax for the skyline query.
Top-k dominating query is first studied by Yiu et al. [10], which retrieves the top-
k objects that are better than the largest number of objects in a dataset. This is quite
different from the skyline query in [4] that retrieves objects which are not worse than
other objects.
Uncertain data analysis is of importance in many emerging applications, such as
sensor network, trend prediction and moving objects management. There has been a lot
of works focusing on uncertain data management, [5,11] to cite a few.
In this work, we study probabilistic top-k dominating query over sliding windows.
We employ the data model with sliding windows as used in [9,13]. In sliding windows,
data is treated as a stream, and only recent n objects are considered.
There are some closely related work, seen in [7,12]. However, they study objects
with multiple instances. Besides, their query is to get top-k dominating objects from
the total data set. While our paper studies objects from append-only data stream. And
our main concern is to maintain top-k dominating objects from recent N objects where
N is the window size.
Probabilistic top-k dominating query is desirable in various real-life applications. Ta-
ble 1 evidences a scenario of house rental, where answering such type of query can be
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 782793, 2013.
c Springer-Verlag Berlin Heidelberg 2013
Probabilistic Top-k Dominating Query over Sliding Windows 783
beneficial. Lessees are interested in knowing the top-1 house from all recent 4 advertise-
ments. Each advertisement is associated with its house ID, post time, price, distances
to supermarket and trustability. Trustability is derived from Lessees feedback on the
lessors product quality; this trustability value can also be regarded as the probability
that the house is the same as what it claims.
We assume lower price and closer to supermarket are preferred and lessees do not
care about other attributes. We also assume that lessees want a house that are better
than the most of others. A top-k dominating (k = 1) query would retrieve {H1} as the
result for the first 4 objects {H1, H2, H3, H4} because H1 is better than H2, H3, H4.
{H4} is the result for the next sliding window {H2, H3, H4, H5}. However, we may
also notice that H4 is of low trustability. It is more reasonable to take the trustabil-
ity into consideration. We model such on-line selection problem as probabilistic top-k
dominating query against sliding windows by regarding on-line advertisements as an
append-only data stream. This query will be formally defined in Section 2. Hence, the
probabilistic top-k dominating query over sliding windows provides a more reasonable
solution under the given scenario.
2 Preliminary
This section introduces the background information, and defines the problem.
784 X. Feng et al.
2.1 Background
When comparing two objects, if one object p is not worse than another q in all aspects
and p is better than q in at least one aspect, we say p dominates q. Formally, we have
Definition 1.
Definition 1 (Dominate). Consider two distinct d-dimensional objects p and q, and
p[i] denotes the i-th value of p. p dominates q (denoted by p q) iff p[i] q[i] holds
for 1 i d, and there exists j, such that p[j] < q[j].
Based on Definition 1, [10] develops a score function to count the number of objects
dominated by an object as
(p) = |{p D|p p }|. (1)
Consequently, top-k dominating query in a data set is to retrieve the top-k objects with
maximal (p) in the data set.
Given a sequence data stream DS of uncertain data objects, a possible world W
is a subsequence of DS; i.e., each object from DS can be either in W or not. For
example, given a data stream with 3 objects, {a, b, c}, there are totally 8 possible worlds
{, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}}. The probability of W to appear is
the product of probabilities of all objects which appear ?in W and 1 ? the probabilities of
all objects which do not appear in W ; i.e., P (W ) = W P () W / (1 P ()).
We compute the function in Equation (1) for every object in the certain possi-
ble world W . For a top-k dominating query over sliding window, probability of each
qualified possible world W is accumulated in order to get the overall probability.
E1 E3
0.9
0.8
0.7
E2
0.6
0.5
0.4
E
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
We observe that the complexity can be reduced by viewing the problem from another
perspective. Instead of enumerating possible worlds, we retrieve the largest l in the
condition, where at least l objects exist of M objects, where M is the number of all
dominated objects.
Specifically, let mi,j denote the probability that exactly i objects out of j objects
exist, and P (i) denote the probability that the i-th object exists. We have the following
relations.
m0,j1 (1 (P (j))) i=0
mi,j = mi1,j1 P (i) i=j (2)
mi1,j1 P (j) + mi,j1 (1 P (j)) otherwise
Equation (2), known as Poisson-Binomial Recurrence [6], is widely used in uncertain
database analysis [2,3]. Applying Equation (2), we propose a dynamic programming al-
gorithm that efficiently computes l value of a certain object, encapsulated in Algorithm 1.
Algorithm 1 takes and as inputs, where is the object of which we need to
compute l value, and contains all the objects which are dominated by . A matrix is
used to store the intermediate result, initialized in Line 2. Then, we use Equation (2) to
fill in the cells of the matrix (Lines 4 13). Finally, we remove mi,size from f at a time
till f is less then q (Lines 15 - 17), and i 1 is returned as the result.
Algorithm 1. calDS(, )
1 if P () < q then return 0;
2 m0,0 = 1; i = 1;
3 size = ||; f = P ();
4 while j size do
5 m0,j = m0,j1 (1 P (j));
6 j = j + 1;
7 while i size do
8 mi,i = mi1,i1 P (i);
9 j = 1;
10 while j < i do
11 mi,j = mi1,j1 P (j) + mi,j1 (1 P (j));
12 j = j + 1;
13 i = i + 1;
14 i = 0;
15 while f > q do
16 f = f mi,size P ();
17 i = i + 1;
18 return i 1;
Example 2. Consider the running example in 2.2. We only compute H4 and H6, since
of H3 or H5 is empty. For H4, P (H4) = 0.5 < 0.6, and hence , H4.l = 0. For H6,
P (H6) = 0.9 > 0.6, m0,1 = 0.15, m1,1 = 0.85. As 0.9 0.15 P (H6) = 0.765 >
0.6, 0.765 0.85 P (H6) = 0 < 0.6, the H6.l is 1. Therefore, top-1 object among
{H3, H4, H5, H6} is H6.
4 Updating Technique
A naive solution to retrieving top-k dominating objects over sliding windows is to visit
each object every time the window slides. When visiting an object, it also gets all the ob-
jects it dominates by traversing R-tree and then computes its l value using Algorithm 1.
Eventually, it chooses k objects with maximal l values. We observe that this tedious
process can be accelerated. Given a probability threshold q and a sliding window with
size N , Algorithm 2 shows how we process every arriving object.
When a new object new comes, if there are exactly N objects in the window, we
remove the oldest object old . Then, we use function insert(new ) to update the l
value of objects. Finally we collect the top-k dominating objects using min-heap.
In the following subsections, we present novel algorithms to efficiently execute Al-
gorithm 2. We first introduce the data structure, then provide our algorithm to handle
the situation when a new object is inserted (insert(new )), followed by an algorithm
to deal with a removed old object (remove(old )).
4.2 Insert
When a new object new comes, we need to consider all the objects dominating it
(denote as S1) and being dominated by it (denote as S2). To this end, we process the
following tasks: 1) compute S1 and S2 by traversing the aggregate R-tree, 2) update l
of each object in S1; and 3) compute l of the new object.
Algorithm 3 describes the steps we take to handle an insertion. R is the aggregate
R-tree we used to store all objects in current sliding window. S3 is used to collect all
the entries which partially dominate new , S4 is used to collect all the entries which
are partially dominated by new , and S34 is used to collect the entries which partially
dominate new and are partially dominated by new . First, we classify all the entries to
different sets (Lines 1 - 6). Then, we refine the result and get all the entries of interest,
namely S1 and S2, using probe1, probe2 and probe3 (Lines 7 - 9). So far, we have
finished task 1). Then, task 2) is done by updating all the entries in S1 using update
(Line 10). Afterwards, we achieve task 3) by computing l of new (Line 11).
Algorithm 3. insert(new )
1 for each E R.root do
2 ifE new then add E to S1;
3 else ifnew E then add E to S2;
4 else ifE partial new &new E then add E to S3;
5 else ifE partial new &new partial E then add E to S34;
6 else ifnew partial E&E new then add E to S4;
7 ifS3 = then probe1(S3);
8 ifS4 = then probe2(S4);
9 ifS34 = then probe3(S34);
10 ifS2 = then update1(S1);
11 ifS1 = then new .l = calDS(S2, new );
Algorithm 4 shows how we refine S3. Note that all the entries in S3 cannot be
dominated by new . We add entries dominating new to S1 (Line 4 in Algorithm 4),
and leave entries which partially dominate new in S3 (Line 5 in Algorithm 4).
Algorithm 4. probe1(S3)
1 while S3 = do
2 E = Dequeue(S3);
3 for each child E of E do
4 if E new then add E to S1;
5 else if E partial new then add E to S3;
Probabilistic Top-k Dominating Query over Sliding Windows 789
The ideas behind Algorithm 4 and Algorithm 5 are similar. We add entries which are
dominated by new to S2 (Line 4 in Algorithm 5), and leave entries which are partially
dominated by new in S4 (Line 5 in Algorithm 5).
Algorithm 5. probe2(S4)
1 while S4 = do
2 E = Dequeue(S4);
3 for each child E of E do
4 if new E then add E to S2;
5 else if new partial E then add E to S4;
S34 is special in that it may contain entries which dominate new and entries which
are dominated by new . Thus, we first partition them (Lines 1 - 8), and apply probe1
and probe2 thereafter.
Algorithm 6. probe3(S34)
1 while S34 = do
2 E = Dequeue(S34);
3 for each child E of E do
4 if new E then add E to S1;
5 else if E new then add E to S2;
6 else if new partial E &E partial new then add E to S34;
7 else if E partial new &new E then add E to S3;
8 else if new partial E &E new then add E to S4;
9 if S3 = then probe1(S3);
10 if S4 = then probe2(S4);
Algorithm 7 describes the update of S1. For each object at leaf node, we update its l
and top-k dominating by using new . As a consequence, the time complexity of the update
is O(l).
Algorithm 7. update1(S1)
1 while S1 = do
2 E = Dequeue(S1);
3 for each child E of E do
4 if E is at leaf level then
5 Read E ;
6 for each child E of E do
7 updateLambda(E , new );
8 updateL(E );
9 else
10 for each child E of E do
11 add E to S1;
790 X. Feng et al.
We analyse the worst case complexity of Algorithm 3; i.e., when every object is
dominated by all the previous objects. Let N denote the window size. In this case, S1
contains all N objects. Immediately, the most time consuming part is in Line 11, and
thus, the time complexity is given by O(N 2 ).
Since Algorithm 3 is complicated, we give an example below.
Example 3. Regarding the example in Table 1. We assume windows size N = 4, prob-
abilistic threshold q = 0.6, and there are exactly H1, H2, H3 in current window. H4
comes as new . Before H4 comes, we use Algorithm 1 to compute the corresponding l
values of H1, H2, H3. As an example, the first three columns in Table 3 shows how to
compute the l value of H1. And we keep {0.015, 0.22, 0.765} as top-k dominating of H1.
The aggregate information of other objects are listed in the first three rows of Table 4.
When H4 comes, according to Algorithm 3, we first collect all objects dominating H4
in S1 and all objects dominated by H4 in S2. Here, we have S1 = {H1}, S2 =
{H2, H3}. Then, we update l and top-k dominating of H1. Using Equation (2), we get
the 4-th column in Table 3. Because (0.3825 + 0.4925) P (H1) > q = 0.6, the l
value still equals 2. And top-k dominating is updated as {0.0075, 0.1175, 0.4925, 0.3825}.
Finally, we compute l and top-k dominating for H4. Because P (H4) < q = 0.6, we
have l = 0 and top-k dominating = . Therefore, the top-1 dominating query for window
{H1, H2, H3, H4} is H1.
4.3 Expire
When old expires, the l value of each object dominated by old will not change. There-
fore, we only need to collect all the objects dominating old , re-calculate their l values.
Algorithm 8. remove(old )
1 for each E R.root do
2 if E old then add E to S1;
3 else if E partial old then add E to S3;
4 while S3 = do
5 E = Dequeue(S3);
6 for each child E of E do
7 if E old then add E to S1;
8 else if E partial old then add E to S3;
9 update2(S1);
10 delete old from R;
Probabilistic Top-k Dominating Query over Sliding Windows 791
The update process of S1 is shown in Algorithm 9. For each object at leaf node,
we re-calculate its l and top-k dominating by traversing the R-tree to collect all the objects
which are dominated by the object in .
Algorithm 9. update2(S1)
1 while S1 = do
2 E = Dequeue(S1);
3 for each child E of E do
4 if E is at leaf level then
5 read E ;
6 for each child E of E do
7 traverse R to collect all objects dominated by E in ;
8 E .l = calDS(, E );
9 else
10 for each child E of E do
11 add E to S1;
5 Experiments
This section reports the experimental studies.
5.1 Setup
We ran all experiments on a MacBook Pro with Mac OSX 10.8.1, 2.26GHz Intel Core
2 Duo CPU and 4G 1333 MHz RAM. We evaluated the efficiency of our algorithm
against sliding window size, dimensionality, and probabilistic threshold, respectively.
The default values of the parameters are listed in Table 5.
Parameters are varied as follows:
sliding window size: 2k, 4k, 6k, 8k, 10k
dimensionality: 2, 3, 5
probabilistic threshold: 0.1, 0.3, 0.5, 0.7, 0.9
We ran all experiments with 100 windows, and repeated all experiments 10 times to get
an average execution time. The execution time in each following figure is the running
time for processing 100 windows.
792 X. Feng et al.
5.2 Evaluation
Figures 2, 3 and 4 show the results when window size is varied. In Figure 5, we vary
the threshold probability to see the running time of baseline algorithm. In Figure 6, we
vary the threshold probability to see the running time of the efficient algorithm. From
the figures, we see that the efficient algorithm outperforms the baseline algorithm in all
cases.
4
10 baseline baseline 2 baseline
efficient 2 efficient 10 efficient
Avg. Delay(s)
Avg. Delay(s)
Avg. Delay(s)
103 10
1
2 10
1 10
10
0 0
10
1 10 10
As shown in Figures 2, 3 and 4, when window size increases linearly, the running
time increases in a super-linear manner for of the both efficient and baseline algo-
rithms. Moreover, we conclude, by comparing Figures 2, 3 and 4, that the running time
decreases greatly with the increase of dimensionality. This is because the most time-
consuming part of the baseline algorithm is to compute l; i.e., collect all objects and
apply Algorithm 1 to compute l. With higher dimensionality, there is a smaller for
each object. Similar conclusion can be drawn for the efficient algorithm.
3 2d 3d 5d 2d 3d 5d
10 10
2
Avg. Delay(s)
Avg. Delay(s)
1
10
102
100
1 -1
10 10
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
From Figures 5 and 6, when threshold grows, the running time decreases. With larger
threshold, there are more objects being pruned, i.e. objects whose probability is lower
than threshold. In addition, this effect is more obvious with low dimensionality, as more
objects to be pruned by the algorithm in the lower dimensional space.
6 Conclusion
In this paper, we have investigated the problem of top-k dominating query in the con-
text of sliding window on uncertain data. We first model the probability threshold based
top-k dominating problem, and then present a framework to handle the problem. Ef-
ficient techniques have been presented to process such query continuously. Extensive
experiments demonstrate the effectiveness and efficiency of our techniques.
References
1. Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The r*-tree: An efficient and robust
access method for points and rectangles. In: SIGMOD Conference, pp. 322331 (1990)
2. Bernecker, T., Kriegel, H.-P., Mamoulis, N., Renz, M., Zufle, A.: Scalable probabilistic sim-
ilarity ranking in uncertain databases. IEEE Trans. Knowl. Data Eng. 22(9), 12341246
(2010)
3. Bernecker, T., Kriegel, H.-P., Mamoulis, N., Renz, M., Zuefle, A.: Continuous inverse rank-
ing queries in uncertain streams. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM
2011. LNCS, vol. 6809, pp. 3754. Springer, Heidelberg (2011)
4. Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE, pp. 421430
(2001)
5. Cormode, G., Garofalakis, M.N.: Sketching probabilistic data streams. In: Chan, C.Y., Ooi,
B.C., Zhou, A. (eds.) SIGMOD Conference, pp. 281292. ACM (2007)
6. Lange, K.: Numerical analysis for statisticians. Statistics and Computing (1999)
7. Lian, X., Chen, L.: Top-k dominating queries in uncertain databases. In: Kersten, M.L.,
Novikov, B., Teubner, J., Polutin, V., Manegold, S. (eds.) EDBT. ACM International Con-
ference Proceeding Series, vol. 360, pp. 660671. ACM (2009)
8. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: SIGMOD Confer-
ence, pp. 7179 (1995)
9. Shen, Z., Cheema, M.A., Lin, X., Zhang, W., Wang, H.: Efficiently monitoring top-k pairs
over sliding windows. In: Kementsietsidis, A., Salles, M.A.V. (eds.) ICDE, pp. 798809.
IEEE Computer Society (2012)
10. Yiu, M.L., Mamoulis, N.: Efficient processing of top-k dominating queries on multi-
dimensional data. In: VLDB, pp. 483494 (2007)
11. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilistic data. In: SIGMOD Confer-
ence, pp. 819832 (2008)
12. Zhang, W., Lin, X., Zhang, Y., Pei, J., Wang, W.: Threshold-based probabilistic top-k domi-
nating queries. VLDB J. 19(2), 283305 (2010)
13. Zhang, W., Lin, X., Zhang, Y., Wang, W., Yu, J.X.: Probabilistic skyline operator over sliding
windows. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) ICDE, pp. 10601071. IEEE (2009)
Distributed Range Querying Moving Objects
in Network-Centric Warfare
Abstract. Sensing and analyzing moving objects is an important task for com-
manders to make decisions in Network-centric Warfare(NCW). Spatio-temporal
queries are essential operations to understand moving objects situation. This
paper proposed a new method for addressing distributed range querying moving
objects problem, which was not considered in related studies. Firstly, the paper
designed an index structure called MOI which utilized hilbert curve to mapping
moving objects into 1-dimensional space and then used DHT to index. Then
based on locality-preserved characteristic of hilbert curve, the paper devised a
high-efficient spatio-temporal range query algorithm, which greatly reduces
routing messages. Experimental results reveal that MOI outperforms other re-
lated algorithms and exhibits a low index maintaining cost.
1 Introduction
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 794803, 2013.
Springer-Verlag Berlin Heidelberg 2013
Distributed Range Querying Moving Objects in Network-Centric Warfare 795
2 Problem Description
Moving objects constantly change their position and shape with time passing by. This
paper considers that spatio-temporal data is a kind of data related to the position de-
tected by various detection methods in the earth reference space, expressing the gra-
dual changes over time of entitys and processs state of being (position, shape, size,
etc.) and property.
In the battlefield, there exists all kinds of objects whose properties evolves over
time, such as position changing individual, tanks, and shape changing cloud, smoke,
fire, etc. In this paper, the use of thought about traditional multidimensional index
(such as R-trees) abstracts time-space objects in the battlefield into minimum bound-
ing rectangle (MBR). The corresponding coordinates of MBR (i.e., two point coordi-
nates in left bottom corner and right-top-corner) change with the position and shape of
time-space object.
Besides moving objects, other command unit exists in battlefield. With positioning
technology, the position and shape of objects can be transmitted to these units. Mov-
ing objects (such as individual, tanks and other autonomy type objects) can register
nearby to report to corresponding units, whereas the units can collect space informa-
tion to local of the objects by continuous reconnaissance and detection.
Through the description above, the problem can be defined as: in the battlefield,
command unit can be abstracted as network nodes (i.e., peer), each node gets physical
connection through the ground link, the investigation data is position and shape
changing over time rectangular data (i.e., MBR), in the distributed environment, it is
required to set up a kind of spatio-temporal index mechanism, i.e., set up a network,
each node maintains part of the global index, each node, through the mutual commu-
nication, completes the whole index operation as query, insert, update, and etc, as
shown in figure 1. This paper here assumes that each moving object has a global
object ID.
Using consistent hash, Chord[3] maps resources to one dimensional space, which
destroys the locality of resources. Namely resources nearby in the original space
through mapping are no longer neighbors, which cannot meet the needs of queries in
space range. So this paper maps the time-space information to one dimension infor-
mation using Hilbert curve[4].
MOI is actually a kind of overlay network. Based on DHT technology, each node has
a node identifier, and is responsible for part of key areas. Moving objects belonging to
this key area through mapping, their spatio-temporal information is maintained by
nodes responsible for. The detail description is as below:
1) This paper views the moving objects to be considered as extent objects, and
present its space state with Minimal Bounding Rectangle (MBR). This paper uses
MBR to represent moving objects space state below. This paper assumes the spatio-
temporal information format of moving objects obtained by nodes as (Oid, Xmin, Xmax,
Ymin, Ymax, t), in which Oid means the objects global ID, (Xmin, Xmax, Ymin, Ymax) means
the coordinates of MBR, and t means MBR timestamp.
2) This paper sets the whole space to be considered as square in two dimensions
with the side length equals to value E, and uses D order Hilbert curve. The whole
space is divided into cells of the same size. Through the construction of Hilbert
curve, the mapping from the set P of all the points in the whole space to the set V of
Hilbert value exists, among which and
.
3) This paper uses Chord as overlay network structure, with [0, ] as its key
space. We set m=2D, namely ensure the consistency of key space and the set of Hil-
bert value.
4) The sizes of corresponding MBR for every MO may be different. One MBR
may overlap several cells while one cell may contain several MBR. According to the
Hilbert curve,all points in each cell have unique counterpart Hilbert value, which can
be regarded as each cell has an unique corresponding Hilbert value, so that any one of
the MBR can be mapping for one or several Hilbert value based on Hilbert curve.
When an MBR maps for several Hilbert value, because of the locality preservation
property of the Hilbert curve, in these Hilbert values, some may be continuous, which
can be regarded that the MBR will be mapped to a series of 1 dimensional region.
This paper calls one dimension region for segments.
5) Each segment has a starting value and an end value, and according to the
Chords position in key space, MBR will be stored to the corresponding node, and the
specific process will be described in section 3.2. According to its different size, each
MBR can be stored to one or more nodes, which ensures the spatio-temporal query
rate. Each node establish HR-trees[5] index in local according to the spatio-temporal
information (Xmin, Xmax, Ymin, Ymax, t) of stored MBR.
Distributed Range Querying Moving Objects in Network-Centric Warfare 797
Figure 2(a) reveals the space state of STO1~STO7 (for simplicity, we use O1~O7 in
the figure) respectively at t=10, t=20 and t=40 (namely MBR). Adopting 3 order Hil-
bert curve, the whole space is divided into 64 cells and numbered according to Hilbert
curve. Figure 2(b) shows the distribution of MO in the MOI index. If the segments of
O6 through mapping are [43,45],[39,40] and 34 respectively (figure 2 (a) distinguish
different segment with different shadow patterns), according to the segment value and
Chord resource placement strategy, referring data maintenance algorithm for detail,
when t=20, O6 distributed to node N38, N42, and N48. Each node establishes HR-tree
index locally, figure 2(b) samples the HR-tree index of node N21 and Finger Table of
node N8.
1) In accordance with Hilbert curve mapping rules that any coordinate in each cell
maps for the cell number Hilbert curve running through, the space range (Qxmin, Qxmax,
Qymin, Qymax) can be converted to a series of segments, one of which is a continuous
Hilbert value region. Each segment has the structure of (Henter, Hexit). Henter is a Hilbert
value as Hilbert curve into the segment, Hexit is a Hilbert value as Hilbert curve out
the segment, Henter Hexit;
798 B. Ge et al.
2) For each segment, according to the Chord routing rules that stated in [3] as
find_successor() function, node P routes the spatio-temporal query conditions and it
to keys successor node S;
3) When receiving spatio-temporal query request, node S carries on the judgment:
if node R, the predecessor node of node S, had identifier value equal to or greater than
the value of the received segment, the query request would be forward to node R;
4) Using of local HR-tree, node S conducts time point queries or time interval que-
ries, then routes the result to node P.
All nodes which is the equivalent of node S (successor node corresponding to Hexit
value of each segment) and node R (predecessor node meet the space overlapping con-
ditions through judgment by the equivalent node of node S) will go to step 3) and 4).
The example of spatio-temporal query algorithm is shown in figure 3. According to
the four segments, query conditions will route to node N38, N48 and N56 respectively.
After comparison, node N56 will forward query conditions to node N51. Each node
searches through time point queries or time interval queries in the local HR-tree and
consequently returns the result to the original inquiring node.
4.1 Construction
In the battlefield, when a new command unit added, namely supposing that node Q
joins the network, implement steps as follows:
1) Node Q accesses to node identifier through connecting to match-making server
and other ways. Node Q then arbitrarily connects a node M, sending node identifier.
On the basis of Chord routing rules, namely the find_successor() function in [3], node
Distributed Range Querying Moving Objects in Network-Centric Warfare 799
M forward the message. The join request of node Q is routed to the successor node N
of the key where node Qs identifier is.
2) Node N and node Q are respectively responsible for updating the key space
themselves (in agreement with chord protocols node adding algorithm).According to
the relationships between Hilbert mapping rules and key space itself, node N shifts
MO in the scope of node Q to node Q, by the way updating local HR-tree index.
3) Node Q receives the MO, and sets up the local HR-tree index.
4.2 Maintenance
Index maintenance is that when a node leaving or failing and data on node changing,
relevant nodes need to update the index, this kind of situation on the battlefield envi-
ronment often occur.
1) Node L sends "I will leave system" message and local MO to the successor node
K, node K updating local HR - trees, followed by the exiting of node L;
2) The predecessor node of node J keeps detecting the state of node L, once found
node L leaving the system, according to its own successor list (the same with Chord),
it would contact node K and save node K as successor node;
3) Node K saves node J for predecessor node, meanwhile, the index update com-
pletes.
For node failure, we adopt Heartbeat mechanism that periodically tests successor
nodes and predecessor nodes state, to assess whether there is a node failure, and then,
take steps 2)~3), completing index update.
Data Change
In the battlefield, position and shape of all kinds of MO change continuously as time
goes on, new situation, new object appear constantly in the battle space while some
objects possibly will quit or disappear (every MO has the only object ID), these are
corresponding to the system operations including update, insert and delete. These
operations are similar to the spatio-temporal query, specifically, when a node in the
system obtains a new object O (Oid, Xmin, Xmax, Ymin, Ymax, t) to insert, take steps as
follows:
1) Based on Hilbert mapping rules, the space range of object O (Xmin, Xmax, Ymin,
Ymax) is transformed into a series of segments (Henter, Hexit);
800 B. Ge et al.
2) For each segment, according to Chord routing rules, node I routes object O and
this segment to key Hexits successor node G;
3) After receiving object insertion request, node G start to assess: if identifier value
of node Gs predecessor node H is greater than or equal to the value Henter from the
received segment, forward this request to node H;
4) Using HR-tree insertion algorithm, node G maintains the spatio-temporal data
and returns the result of successful insertion operation to node I.
All nodes equal to node G (successor node corresponding to each segments Hexit
value) and node H (predecessor node in accordance with space overlap condition
judged by the equivalent node of node G) take steps 3)~4).
Both delete and update operations of MO are similar to insert operation, the differ-
ence is, the maintenance of local data need be shifted into HR-trees deletion and
update algorithm. Due to hardly space, no more statements is listed here.
5 Performance Evaluation
In order to evaluate performance of the proposed method in this paper, we use open
source simulation platform PeerSim[6] to conduct the simulation experiment. All
experiments are carried out on computers with Intel Core2 Quad 2.5 GHz and 2 GB
main memory. We set the whole space to be considered as 11 unit rectangular, and
set the length of time as 1 unit time region; Owing to the side length of the rectangle
in experiment is far less than that of the rectangle to be considered, we take 7 order
Hilbert curve for default.
In order to simulate the battle space objects, we implement the GSTD[7] spatio-
temporal data generation method, which is widely used in spatio-temporal index ex-
periment. Based on GSTD method, two data sets are used: analog data as the base
state data (hereinafter referred as data set A) and real data as base state data (hereinaf-
ter referred as data set B).
In data set A, the center of base state data follows Gaussian distribution with
=0.5 (expectation) and =0.1 (variance), the side length presents 1000 rectangular
data which follows 0,0.1 uniform distribution; time sampling point numbers for
100, and follows uniform distribution. The variation of the center ( center) follows
(0.2, 0.2) uniform distribution and the variation of side length ( extent) follows (0.05,
0.05) uniform distribution. With time goes on, the overall change of data is that they
spread from the center to all around.
In data set B, we adopt TIGER[8] data set as the ground state data. The detailed
generate process is: choose all the ways in Maryland Montgomery county of TIGER
data set, with each road composed by limited lines. According to the starting point
and end point of the line segment, rectangular is generated, which is provided
for ground state data. Time sampling point numbers for 100, and follows uniform
Distributed Range Querying Moving Objects in Network-Centric Warfare 801
distribution. The variation of the center ( center) follows (0, 0.4) uniform distribu-
tion, the side length variation ( extent) follows (-0.05, 0.05) uniform distribution. As
time goes on, the overall change of data is that they gradually move to northeast of the
space.
We evaluate the query performance by the number of messages when querying. Mes-
sage number refers to the total message number sent by all the nodes from the net-
work for one time of spatio-temporal query. For each message number statistics, we
randomly generate 300 queries and take the average number of messages as the eval-
uation result. Compare MOI with Distributed Quadtree[9], P2P Meta-index [10], and
Service Zone[11], we conduct the query performance comparison. The size of mes-
sage number is tested under different network scales and different selection rate re-
spectively. For data set A and data set B, the experiment results are similar, here we
only take the result for data set A.
Set the querys space selection rate as 0.01. The number of nodes in the network
rises from 250 to 10000. And then, count the corresponding average message number.
As shown in figure 4(a), the result of the experiment reveals that with the increase of
the network scale, the query performance of the method proposed by this paper is
superior to the other three kinds of methods. This is because the space division me-
thod adopted by this paper using Hilbert curve with local preservation property, which
maps the query into continuous one dimensional queries. And the query algorithm
proposed by this paper considers checking the predecessor node whether it intersects
with query region by node itself, which greatly reduces the number of messages and
optimizes the query performance.
450
160 MOI MOI
Distributed Quadtree 400 Distributed Quadtree
140
Service Zone Service Zone
P2P Meta-index 350 P2P Meta-index
Number of messages
Number of messages
120
300
100
250
80 200
60 150
100
40
50
20
0
0 2000 4000 6000 8000 10000 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Number of peers Spatial Selectivity
Set the number of node as 2000. The query space selection rate changes from 0.001
to 0.07. And then, count the corresponding average message number. As shown in
figure 4(b), the result of the experiment reveals that with the increase of the space
selection rate, the Service Zone methods and Distributed Quadtree method very rapid
802 B. Ge et al.
increase in the number of messages. And the proposed MOI method produced little
change in the number of messages, which is lower than the P2P Meta-index method.
This is because the Service Zone method uses CANs[12] thought to divide the space
binary tree, finding low efficiency. Distributed Quadtree method uses consistent hash,
where query is decomposed into a large number of one-dimensional intervals, increas-
ing traffic, resulting in query performance degradation. P2P Meta-index method with
the crawlers, need to continue to contact each server, increasing the communication
overhead.
Based on DHT, the processing performance for node join, exiting and the failure is
similar to DHT, so no more narrative hare. In this section, the experimental evaluation
mainly bases on data insertion cost. The number of messages caused by spatial and
temporal data insertion for data set A and data set B were tested under different net-
work size. The space default selection rate is 0.01, where spatial selectivity refers to
the size of the insertion of the MBR, the results shown in Figure 5.With the increase
in the number of nodes, for two different sets of data, the cost of maintaining indexes
showed a logarithmic growth. It illustrates that the index maintenance performance of
MOI is able to adapt to the changes in the size of network.
40000000
Dataset A
Dataset B
35000000
Number of messages
30000000
25000000
20000000
15000000
10000000
5000000
0
0 2000 4000 6000 8000 10000
Number of peers
6 Conclusion
In future digital battlefield, the effective index of the temporal and spatial objects will
greatly help to improve the battlefield space-time query performance, which is of
great significance. For the needs of fast access to spatial and temporal information,
faced with the distributed battlefield environment, based on peer-to-peer computing,
this article proposes distributed spatio-temporal indexing mechanism MOI. Make use
of Hilbert curve combined the Chord to design spatio-temporal query algorithms,
Distributed Range Querying Moving Objects in Network-Centric Warfare 803
indexing and maintenance algorithm. We test query performance on two different sets
of data, cost of the maintenance. The experimental results show that the local Hilbert
curves characteristics greatly improve the efficiency of the spatio-temporal query,
meanwhile the index maintenance performance is acceptable, so as to provide real-
time rapid access to the distributed spatio-temporal information in the future digital
battlefield with technical support.
References
1. Ilarri, S., Mena, E., Illarramendi, A.: Location-Dependent Query Processing: Where We
Are and Where We Are Heading. ACM Computing Surveys 42(3), 173 (2010)
2. Wolfson, O., Xu, B., Chamberlain, S., Jiang, L.: Moving Objects Databases: Issues and
Solutions. In: Proceedings of the 10th International Conference on Scientific and Statistic-
al Database Management (SSDM), pp. 111122 (1998)
3. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., et al.: Chord: A Scalable Peer-to-Peer
Lookup Service for Internet Applications. In: Proceedings of the ACM SIGCOMM, pp.
149160 (2001)
4. Faloutsos, C., Roseman, S.: Fractals for Secondary Key Retrieval. In: Proceedings of the
8th ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems,
pp. 247252 (1989)
5. Nascimento, M., Silva, J.: Towards Historical R-trees. In: Proceedings of ACM Sympo-
sium on Applied Computing (ACM-SAC), pp. 235240 (1998)
6. PeerSim Simulator Project[EB/OL] (2005), http://peersim.sourceforge.net
7. Theodoridis, Y., Silva, J.R.O., Nascimento, M.A.: On the Generation of Spatiotemporal
Datasets. In: Proceedings of the 6th International Symposium on Spatial Databases (SSD),
pp. 147164 (1999)
8. Tiger Datasets[EB/OL], http://www.census.gov/geo/www/tiger/
9. Tanin, E., Harwood, A., Samet, H.: Using a Distributed Quadtree Index in Peer-to-Peer
Networks. VLDB Journal 16(2), 165178 (2007)
10. Hernndez, C., Rodrguez, M.A., Marn, M.: A P2P Meta-Index for Spatio-Temporal
Moving Object Databases. In: Proceedings of the 13th International Conference on Data-
base Systems for Advanced Applications, pp. 653660 (2008)
11. Wang, H., Zimmermann, R., Ku, W.-S.: Distributed continuous range query processing on
moving objects. In: Bressan, S., Kng, J., Wagner, R. (eds.) DEXA 2006. LNCS,
vol. 4080, pp. 655665. Springer, Heidelberg (2006)
12. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A Scalable Content-
Addressable Network. In: Proceedings of the ACM SIGCOMM, pp. 161172 (2001)
An Ecient Approach on Answering Top-k
Queries with Grid Dominant Graph Index
Aiping Li, Jinghu Xu, Liang Gan, Bin Zhou, and Yan Jia
Abstract. The top-k queries are well used in processing spatial data and
dimensional data stream to recognize important objects. To address the
problem of top-k queries with preferable function, a grid dominant graph
(GDG) index based on reserve data points set (RDPS) is presented.
Because of the correlation between top-k and skyline computation, when
the count of one points RDPS is bigger than or equals to the value of
k, it is for sure to be pruned the points dominant set. This approach in
used to prune a lot of free points of top-k queries. GDG used grid index
to calculate all RDPS approximately and eciently. When construction
GDG, grid index figures out k-max calculating region to prune free points
that all top-k function would not visit, which decreases the amount of
index dramatically. When travelling GDG, grid index figures out k-max
search region by top-k function which avoids travelling those free points
of ad-hoc top-k function. Because GDG uses grid index to rank those
data points in the same layer approximately by k-max search region
by top-k function, this index structure make less visited points than
traditional dominant graph (DG) structure. Whats more, grid index
needs less additional computation and storage that make the GDG index
more adaptive for top-k queries. Analytical and experimental evidences
display that GDG index approach performs better on both storage of
index and eciency of queries.
1 Introduction
Given a record set D and a query function F , a top-k preference query (top-k
query for short) returns k records from D, whose values of function F on their
attributes are the highest. The top-k query is crucial to many multi-criteria
decision making applications. For example, autonomous robot explore a given
spatial region under dangerous environment and returns the search results as a
sequence of objects ranked by a scoring function. As another example, in self-
determinded soccer robet game, mobile robotic system list an ordered set of
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 804814, 2013.
c Springer-Verlag Berlin Heidelberg 2013
objects based on how aggressive the teammate object is or how weakly the rival
objects is matching the score function used in the game.
Existing work on top-k query processing is to avoid searching all the objects
in the underlying dataset, and limiting the number of random accesses to the
spatial data. A key technique to evaluate top-k query is building index on the
dataset. They can be divided into four categories: Sorted list-based, Layer-based,
View-based and Sketch-based.
1. Sortedlists-based. The methods in this category sort the records in each
dimension and calculate the rank by parallel scanning each list until the top-k
results are returned, such as TA[6] and SA[2].
2. Layer-based. The algorithms organize all records into consecutive layers,
such as DG[9],Onion[3] and AppRI[7]. The organization strategy is based on
the common property among the records, such as the same convex hull layer in
Onion[3]. Any top-k query can be answered by up to k layers of records.
3. V iew-based. Materialized views are used to answer a top-k query, such as
PREFER[5] and LPTA[4].
4. Sketch-based. Data stream sketch-based method is used to calculate top-k
query, such as RankCube[8].
In this paper we proposed a hybrid index structure gridded dominant graph
(GDG) method to improve top-k query eciency, which integrates grid index into
dominant graph (DG) index. The rest of this paper is organized as follows. The
next section discusses related work. Section 3 describes the problem description
and definition. Section 4 gives the index structure and algorithm. Section 5
present and analyze the experiments, and Section 6 concludes.
2 Related Works
There are considerable amount of query answering algorithms proposed for top-
k queries in the literatures. The threshold algorithm (TA) [6] is one of the most
common algorithms. TA first sorts the values in each attribute and then scans the
sorted lists in parallel. A threshold value is used to prune the tuples in the rest
of the lists if they cannot have better scores than the threshold. A lot of follow-
on methods have built on the TA algorithm as it addresses a common category
of top-k queries, where the scoring function is a monotone function composed by
using various attributes associated with the objects in the input dataset.
The layer-based method divided all the tuples of the dataset into several
layers; any top-k query can be answered by up to k layers of records. There
are some layered methods, such as DG[9],Onion[3] and AppRI[7]. The Onion
method takes convex hull as origination rule. For given preference liner function,
the result only exists in the convex hull layer. The process of Onion is calculating
the convex hull of the dataset. The first convex hull is calculated and then the
next is calculated on the rest dataset, and so on, until no dataset is left. The
layer rule in AppRI method is the tuple t is put into layer l, if and only if it
satisfy two conditions: a) For any liner function, t is not in the result of top-(l-
1); b) Exists at least one function satisfy t belongs the result of top-l. The layer
806 A. Li et al.
rule in DG is each layer is the former Skyline. The first skyline is calculated,
then the second skyline is computed on the rest dataset, till no dataset is left.
The most important property in DG is the necessary condition that a record
can be in top-k answers is that all its parents in DG have been in top-(k-1)
answers. Benefiting from the property, the search space (that is the number of
retrieved records from the record set to answer a top-k query) of a top-k query is
reduced greatly. The dierence between DG and the other two methods is there
is dominant relationship in DG, so it is not necessary to access and calculate all
the former record set in front of k layer.
The DG method based on k-skyband is to keep adding the candidate points.
Our method is to prune the free points based on Reverse Dominant Point Set. In
other words, if exists k-1 points dominate the point e, all the points dominated
by e can be pruned.
That is, when f is given, we can improve the eciency more by pruned the
redundant Cells in the k-M CR.
Theorem 2. For any monotone top-k preference query function f , its results
are in the k-M SRf .
Proof is omitted.
4 GDG Index
4.1 Grid Index Structure
The GDG index is based on grid index and DG index, as shown in Figure. 2.
There are two parts in the GDG index, which are DG and k-MCR. We main-
tain the layer and dominate relationship between the data point dp. Each data
point dp is composed of id, cid (cell id), the pointer set of its children chd, and
the next pointer to the next data in the same layer sk-i. The sk-i( i = 1, , k)
(skyline-i) , represents the ith skyline of the DG index. The same sk-i and the
data points dp in the same cell are stored in the continuous memory via link
structure. There is one GL link table in the k-MCR, each node of the link table
is the cell in the grid. The basic information Inf and link information with DG
DG-link is stored in the cell.
Theorem 3. GCG index correctly finds top-k answers.
Proof is omitted.
the CL list. The data in CL are ordered by the f value, and the count of the
data in CL is k n, where n is the number of result set(RS). Line 1 put the
data have largest R value into RS. Lines 2 to 21 calculated the rest of the RS
one by one. The proof of algorithm 2s correctness is omitted.
We firstly analysis the space complexity of the algorithm. Suppose the dimension
of data set is d, the m is the number of intervals in index. We can get the total
number of cells is M = md . Each data point needs d memory units. The ratio of
k-MCR in total cells is P , with two points in this type cells, one is total numers,
the f -value of this cell. So the number of total memory units is md P (2d + 4).
Let the ratio of k-MSRf in total cells is P , where cells f -value is mapped to
one cell ID, is md P . The number of k-MSRf access points in GDG is s d.
So, the space complexity of our method is O(md ), where m is the number of
intervals in index, d is the dimension.
Then we analysis the time complexity. Suppose the distribution of the data
is uniform distribution, then the number of points in each cell is about N/md .
During the process of building index k-MCR, the complexity of k-MCR equals
scan the data set time plus find the determine point time. Let N is the sum of
the data points in the window, then scan the data set time is O(N ). Find the
determine point time is find the cell.M in point. The worst case is find the whole
cell set, the time is O(M ), where M is the total number of cells in the Grid
Index. If N is large, and M is small, then the complexity of k-MCR is O(N ),
where N is the window size. Similarly, we can get the complexity of k-MSRf
is O(M P + log(M P )), approximately equal O(M ), where M is the total
number of cells in the Grid Index.
5 Experimental Result
All experiments were carried out on a PC with Intel Core2 6400 dual CPU
running at 2.13GHz and 2GB RAM. The operating system is Fedora with the
Linux kernel version 2.6.35.6. All algorithms were implemented in C++ and
compiled by GCC 4.5.1 with -O3 flag. Our experiments are conducted on both
real and synthetic datasets.
Synthetic datasets are generated as follows. We first use the methodologies
in [10] to generate 1 million data elements with the dimensionality from 2 to 5
and the spatial location of data elements follow independent distribution. Syn-
thetic datasets are represented with Ui , which have 1000K tuples, the granularity
is 8000, with normal distribution between [0,7999]. The real data dataset is from
http://archive.ics.uci.edu/ml/machine-learning-databases/ covtype/, 54 dimen-
sions, 10 dims are numeric. We choose 3 numeric variable as dimension, while
the granularity is 1989, 5787 and 5827 respectively.
We first compare the space and time cost during building the index between
DG and GDG.
The method we compared with is RankCube[4] and DG[9], the dataset is of-
fline. Our methods are GDG40 and GDG80, where each dimension is divided
into 40 intervals and 80 intervals. As synthetic data (3 dimension) in Figure. 3(a),
812 A. Li et al.
when answer top-100 queries, the DG index is 10K 572K tuples, the maximum
ratio is 57.2%. While the GDG40 index is 9K 74K tuples, the maximum ratio is
7.4%, and the GDG40 index is 8K 39K tuples, the maximum ratio is 3.9%. As
real data in Figure. 3(b)(580 tuples), when answer top-100 queries, the DG index
is 216K tuples, the ratio is 37%. While the GDG40 index is 19K 34K tuples, the
maximum ratio is 6%, and the GDG80 index is 6K 24K tuples, the maximum
ratio is 4%. So, our methods can reduce the size of index obviously.
Secondly, the comparisons of the number of access data, query response time
among our method and RankCube[4] , DG[9] is completed.
As shown in Figure. 4 and Figure. 5, for each function f has dierent number
of access data and query response time, we choose 10 functions, then take average
value as the result. As shown in Figure. 4, our methods have less number of access
data than RankCube[4] and DG[9], where GDG80 is the best.
As shown in Figure. 5, our methods have better performance than the exist-
ing RankCubee[4] and DG[9] method, both using synthetic data and real data
environments.
We can draw conclusions from the experimental results as follows. Firstly,
some data sets accessed in DG method are free, which are pruned in GDG
method. So GDG method can decrease the number of access points and response
time. Secondly, the RankCube method access the data set unordered and have
An Ecient Approach on Answering Top-k Queries 813
6 Conclusions
The top-k query is well used in spatial dara or searching multi-dimensional data
stream to recognize important objects. A grid-index based on reserve data points
set (RDP S) is presented to calculate multi-dimensional preference top-k. Be-
cause of the correlation between top-k and skyline computation, when the count
of one points RDP S is bigger than or equals to the value of k, it is for sure to be
pruned the points dominant set. RDP S can be calculated with grid-index algo-
rithm quickly. Combining grid-index and dominant graph index, we get gridded
dominant graph (GDG) index. Analytical and experimental evidences display
that GDG index approach performs better on both storage of index and e-
ciency of queries.
The GDG index we proposed in this paper can clearly reduce the index size,
the random access of the dataset and increase the performance of top-k queries,
which can satisfy the critical performance demand in robots environments.
In our future work, we will study how apply RDSP in other layered index,
such as Onion and AppRI, to reduce random access number of datasets. Morever,
we will use our methods to analysis online data streams, which can be used in
the mobile robots exploring in the computational complicated environments.
References
1. Borzs
onyi, S., Bailey, M., Kossmann, D., Stocker, K.: The Skyline Operator. In:
Proc. 17th International Conference on Data Engineering, pp. 421430 (2001)
2. Chang, K.C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the
web: observations and implications. SIGMOD Rec., 6170 (2004)
3. Chang, Y.C., Bergman, L., Castelli, V., Li, C.S., Lo, M.L., Smith, J.R.: The onion
technique: indexing for linear optimization queries. SIGMOD Rec., 391402 (2000)
4. Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries
using views. In: Proc. 32nd International Conference on Very Large Data Bases,
pp. 451462 (2006)
5. Hristidis, V., Koudas, N., Papakonstantinou, Y.: PREFER: a system for the e-
cient execution of multi-parametric ranked queries. In: Proc. 2001 ACM SIGMOD
International Conference on Management of Data, pp. 259270 (2009)
6. Nepal, S., Ramakrishna, M.V.: Query Processing Issues in Image(Multimedia)
Databases. In: Proc. 15th International Conference on Data Engineering, pp. 2229
(1999)
7. Dong, X., Chen, C., Han, J.W.: Towards robust indexing for ranked queries. In:
Proc. 32nd International Conference on Very Large Data Bases, pp. 235246 (2006)
8. Dong, X., Han, J.W., Cheng, H., Li, X.L.: Answering top-k queries with multi-
dimensional selections: the ranking cube approach. In: Proc. 32nd International
Conference on Very Large Data Bases, pp. 463474 (2006)
9. Zou, L., Chen, L.: Dominant Graph: An Ecient Indexing Structure to Answer
Top-K Queries. In: Proc. 24th International Conference on Data Engineering, pp.
536545 (2008)
10. Li, A.P., Han, Y., Zhou, B., Han, W., Jia, Y.: Detecting Hidden Anomalies Using
Sketch for High-speed Network Data Stream Monitoring. Appl. Math. Inf. Sci. 6(3),
759765 (2012)
A Survey on Clustering Techniques
for Situation Awareness
1 Introduction
This work has been funded by the Austrian Federal Ministry of Transport, Innovation
and Technology (BMVIT) under grant FFG FIT-IT 829589, FFG BRIDGE 838526
and FFG Basisprogramm 838181.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 815826, 2013.
c Springer-Verlag Berlin Heidelberg 2013
relationships can be spotted that have not even been explicitly defined so far.
Especially clustering techniques are considered to be beneficial, since they require
neither a-priori user-created test data sets nor other background knowledge and
also allow anomaly detection, such as uncommon relations. If we consider a
road trac SAW system, clustering might for example be used to detect (ST)
hotspots on a highway, i. e., road segments and time windows where atypically
many accidents occur, or to reveal current major trac flows in an urban area.
Specific Requirements of SAW. The nature of SAW systems, however, poses
specific requirements on the applicability of existing clustering techniques. SAW
systems have to cope with a large number of heterogeneous but interrelated
real-world objects stemming from various sources, which evolve over time and
space, being quantitative or qualitative in nature (as, e.g., demonstrated in [3]).
The focus of this paper is therefore on evaluating existing spatio-temporal (ST)
clustering techniques with respect to their ability to complement the specification
of critical situations in SAW systems.
Contributions. In this paper, we first systematically examine the requirements
on clustering techniques to be applicable in the SAW domain. Then, we survey
several carefully selected approaches according to our criteria stemming from
the fields of SAW and ST clustering, and compare them in our lessons learned,
highlighting their advantages and shortcomings.
Structure of the Paper. In the following section, we compare our survey
with related work (cf. Section 2). Then, we introduce our evaluation criteria
(cf. Section 3), before we present lessons learned (cf. Section 4). Due to space
limitations, the in-depth criteria-driven evaluation of each approach surveyed
can be found online1 only.
2 Related Work
Despite a plethora of work exists in the field of data mining, to the best of our
knowledge no other survey has dealt with the topic of clustering ST data for
SAW applications so far. However, there exist surveys about ST clustering (e. g.,
[21]) and ST data mining in general (e. g., [19], [29]), surveys about spatial (e. g.,
[11]) and temporal (e. g., [34]) clustering and various surveys on clustering in
other domains (e. g., [15], [36]). Furthermore the field of stream data mining has
received a lot of attention over the last few years and as a result several surveys
on stream data clustering (e. g., [20]) and on stream data mining (e. g., [8], [13])
in general have been conducted.
In the following paragraphs these mentioned surveys are briefly discussed with
respect to the contributions of our survey.
ST Clustering. Kisilevich et al. [21] propose a classification of ST data and fo-
cus their survey of clustering techniques on trajectories, being the most complex
setting in their classification. The main part of the survey is a discussion of sev-
eral groups of clustering approaches including dierent example algorithms for
1
http://csi.situation-awareness.net/stc-survey
A Survey on Clustering Techniques for Situation Awareness 817
each group. Finally, they present examples from dierent application domains
where clustering of trajectory data is an issue, like studying movement behavior,
cellular networks or environmental studies. In contrast to our survey, Kisilevich
et al. focus on presenting various groups of ST data mining approaches, rather
than systematically analyzing dierent techniques backed by a catalog of evalu-
ation criteria. Furthermore, we considered the applicability of the techniques to
our domain on basis of SAW-specific criteria, while they conducted their analysis
in a more general way. Nevertheless, their work represented a valuable starting
point for our survey, from which we especially adopted their classification of
ST data, like trajectories or ST events.
Spatial, Temporal and General Clustering. Several surveys on spatial clus-
tering (e. g., [11]) and temporal clustering (e. g., [34]) have been conducted for
dierent application domains. Nevertheless, none of the surveyed approaches
specifically deals with the unique characteristics that arise from the domain of
SAW respectively from ST data in general (cf. Section 3.1), but take only either
spatial- or temporal characteristics into account.
Finally, there exist numerous and comprehensive surveys on non-ST clustering
(e. g., [15], [36]). These approaches cannot simply be used to work with ST data,
but have to be adopted manually to deal with its particular nature.
ST Data Mining. Kalyani et al. [19] describe the peculiarities of ST data
models and the resulting increase of complexity for the data mining algorithms.
In their survey they outline dierent data mining tasks compared to their spatial
counterparts and motivate the need for dedicated ST data mining techniques.
Geetha et al. [29] present a short overview of challenges in spatial, temporal and
ST data mining. However, none of the above focuses on concrete data mining
techniques, but rather gives an overview of the field of ST data mining.
Stream Data Mining. In their survey on clustering of time series data streams,
Kavitha et al. [20] review the concepts of time series and provide an overview
of available clustering algorithms for streaming data. More general surveys of
stream data mining were conducted by Gaber et al. [8] and Ikonomovska et al.
[13]. Both review the theoretical foundations of stream data analysis and give
a rough overview of algorithms for the various stream data mining tasks and
applications. Furthermore, they mention that stream data clustering is a major
task in stream data mining and lots of algorithms have been adapted to work
on streaming data (e. g., CluStream [1], DenStream [7], ClusTree [22]).
However, in contrast to our SAW systems which store a history of the data
received, thus allowing an arbitrary number of read accesses, such stream data
mining approaches do not store the huge amount of data they process. Even
though the elements of a data stream are temporally ordered, they might ar-
rive in a time-varying and unpredictable fashion and do not necessarily contain
a timestamp, which does not allow for the identification of temporal patterns
(especially cyclic ones). Furthermore, most stream data mining approaches sur-
veyed by these authors do not deal with spatial data at all and hence are not
applicable here.
818 S. Mitsch et al.
3 Evaluation Criteria
In this section, we derive a systematic criteria catalog, (methodologically ad-
hering to some of our previous surveys, e.g., [35]), which we use for evaluating
selected ST clustering techniques. The criteria catalog consists of two sets of cri-
teria, as depicted in Fig. 1, comprising SAW-specific criteria (cf. Section 3.1) and
clustering-specific criteria (cf. Section 3.2), which are detailed in the following.
Each criterion is assigned an abbreviation for reference during evaluation.
a clustering technique supports, i.e., allows for heterogeneous input data (HID),
the better it is applicable to the SAW domain.
Evolution of Input Data. As indicated in the examples above, the temporal
dimension furthermore captures the potential evolution of the observed objects
(E) with respect to their spatial (e.g., location or length) or non-spatial (e.g.,
temperature) properties. An evolution along the spatial dimension corresponds
to a mobile, i. e., moving, object, whereas an object solely evolving with respect
to the non-spatial properties represents an immobile, i.e., static, object.
Context of Input Data. The input data used in SAW systems often com-
prises objects which are bound to a certain context (CID), enforcing constraints
on the interpretation of the input data (e. g., in road trac, the majority of
objects cannot move around in space freely, but is bound to the underlying road
network), or further requirements on the input data (e. g., necessity of velocity
information). We evaluate if and what kind of context information is required.
Fuzzy Input Data. SAW systems often have to cope with fuzzy input data
(FID), for example incidents reported by humans who only have a partial over-
view of the situation. Fuzzy input data might address the spatial properties (SP)
(e.g., uncertainty about the exact location of an object), the temporal proper-
ties (TP) (e.g., an accident has occurred within the last half an hour), or the
non-ST properties (e.g., it cannot be defined exactly what has happened).
Besides the criteria mentioned above, which detail the nature of the input
data, the following further deal with the goal of the analysis.
Intention. This criterion (I) reflects the objective of the analysis, i.e., the kind of
implicit knowledge that should be extracted by the clustering technique. Possible
intentions are clustering of events or regions with similar ST characteristics,
clustering of trajectories, or the detection of moving clusters.
Online or Oine Analysis. The requirements imposed on the clustering tech-
niques dier with respect to the phase they should be employed. During the con-
figuration phase of a SAW system, we have to perform an analysis on a complete,
historical data set, i. e., oine analysis. However, since an SAW system typically
operates on a real-time environment, we also want to perform an analysis at run-
time, i. e., online analysis. Since clustering techniques devoted to online analysis
often rely on optimizations and approximations in order to deliver fast results
(e.g., compute only locally optimal clusters), these approaches are less suited
for configuration tasks where computation time is not a major issue, whereas
an exact result is preferred. Thus, this criterion (OOA) reflects whether the clus-
tering technique is only suited for the oine configuration phase, or if it is also
applicable to or favored for runtime analysis.
We first give a short description of the main contribution (MC) and distinguish
the clustering techniques according to their algorithm class (AC), describing the
method how the clusters are obtained. Following [12] this can be, due to par-
titioning (i.e., the data space is partitioned as a whole into several clusters),
hierarchical clustering (i. e., clusters are created by merging close data points
bottom-up or splitting clusters top-down), density-based clustering (i. e., clus-
ters are defined as exceeding a certain density of data points over a defined
region), and grid-based clustering (i. e., the data space is overlaid by a grid, grid
cells containing similar structures are merged).
The remaining criteria are structured into functional and non-functional ones.
Functional Criteria. The distinct algorithmic methods yield dierent cluster
shapes (CS), distinguishing spherical, rectangular or arbitrarily shaped clusters.
Besides, it is investigated if a technique can handle clusters which overlap (CO).
For analysis properties, we consider spatial analysis properties (SA), i. e., what
kind of spatial data is internally processed by the algorithm, temporal analysis
properties (TA), i. e., time instants or intervals, temporal patterns (TAP), i. e.,
linear or cyclic time patterns, and the focus of the analysis (F), i. e., if spatial
and temporal aspects are handled equally or if one is favored without ignoring
the other. We evaluate if the techniques support dedicated handling of noise (N)
within the data set and furthermore investigate the employed similarity measures
(SM) and the determinism of the produced results (D).
Non-functional Criteria. These criteria comprise the configurability (C) of the
approaches, reflecting to which degree the clustering technique can be tweaked, and
the computational complexity (CC). Furthermore, we examine whether and which
optimization strategies (O) are provided, which would be beneficial in reducing run-
time, however might also aect the quality of the obtained clustering result.
Cluster Trajectories
Cluster ST-Regions
Spatial Temporal
Cluster ST-Events
Moving Clusters
Spatial (SP) (TP)
Immobile
Fuzzi-
Interval
Mobile
Instant
Region
Online
Offline
ness Context
Point
Line
(FID) (CID)
Wang, 2006 3 - 3 - - 3 - - - 3 - - - - 3
Evo.
No
Birant, 2006 3 - 3 - - 3 - - - - 3 - - - 3
Tai, 2007 3 (3) 3 - - 3 - - - 3 - - - 3 3
Gaffney, 1999 - 3 3 - - 3 - - - - - 3 - - 3
Nanni, 2006 - 3 3 - - 3 - - - - - 3 - - 3
Object
Evo.
Li, 2010 - 3 3 - - 3 - - - - - 3 - 3 3
Lu, 2011 - 3 3 - - 3 - - - - - 3 - - 3
Gariel, 2011 - 3 3 - - 3 - - - - - 3 - 3 3
Kalnis, 2005 - 3 3 - - 3 - - - - - - 3 3 3
Iyengar, 2004 3 - 3 - - 3 - - - 3 - - - - 3
Neill, 2005 3 - 3 - - 3 - - - 3 - - - - 3
Li, 2004 - 3 3 - - 3 - - requires velocity - - - 3 - 3
Evolution
Cluster
The first group comprises techniques that do not entail any evolution (NE),
like clustering ST events (i. e., grouping events in close or similar regions) or
ST groups (i. e., finding regions sharing similar physical properties over time).
Techniques for trajectory clustering (i. e., grouping similar trajectories) deal with
the evolution of objects that move along these trajectories (OE). And finally, the
area of moving-object clustering (i. e., discovery of groups of objects moving to-
gether during the same time period, so called moving clusters) and the detection
of spatio-temporal trends (e.g., disease outbreaks) deal with the evolution of
clusters over time (CE).
Our evaluation of ST clustering approaches for the field of SAW has revealed
interesting peculiarities of current clustering techniques. In the following, we
explain our findings grouped according to our evaluation criteria (cf. also Figure
2, 3 and 4). Note that a tick in parentheses means that the criteria is only partly
fulfilled by the approach.
Evolution Mostly Supported (E). Algorithms from the NE group work with
objects in fixed locations and thus do not support spatial evolution at all. All
other techniques allow for moving objects.
Spatial and Temporal Extent Not Handled (SP, TP). None of the sur-
veyed techniques can handle anything but spatial points or temporal instances
822 S. Mitsch et al.
lines, rectangles or temporal intervals are not dealt with. Hence, currently none
of the approaches is able to directly deal with the heterogeneity (HID) criterion.
Fuzziness of Data Is Not an Issue (FID). All of the approaches treat objects
as facts and do not consider any uncertainties, except Jensen et al. [16] who
consider that objects might disappear without notifying the server and reduce
the confidence in the assumed object movement as time passes.
Almost NO Context Knowledge Supported (CID). Most of the algo-
rithms work without any knowledge of context and do not allow any further
information than a data set and parameter values. Exceptions are techniques
that cluster the objects backed by a network graph (e. g., [5]), or techniques that
make use of velocity information for moving objects (e. g., [16], [23]).
Techniques Mainly Focus on Oine Clustering (OOA). While the sur-
veyed techniques mainly aim at oine clustering of closed data-sets, only few
exceptions (from each group) allow online clustering of ever-changing data-sets
(e. g., [32], [10], [16]).
Majority of Algorithms Is Density-Based (AC). Our survey comprises
clustering techniques from all classes of algorithms, whereby OE techniques are
usually density based, while CE techniques cover all algorithm classes.
Predominance of Arbitrarily Shaped Clusters (CS). OE and NE techniques
mostly result in arbitrary clusters, whereas clusters produced by CE techniques
are often restricted to spherical or rectangular shape (e. g., [14], [30]).
Cluster Overlap in CE Techniques (CO). Overlapping clusters are only
considered by few of the techniques mostly from the CE group (e. g., Kalnis et
al. [18]). These approaches are able to detect dierent clusters moving through
each other and can keep them separated until they split again.
Spatial Analysis Properties Highly Dependent on Intent of Analysis
(SA). While NE and CE techniques are focused on clustering data points, OE
techniques mostly deal with clustering of lines, since a trajectory can be inter-
polated to a line. Li et al. [23] first group the objects into micro-clusters (i. e.,
small regions) and then combine these regions to complete clusters. However, as
they cluster the micro-clusters center points, they also deal with points only.
Only Temporal Instants (TA) and Linear Patterns (TAP). Linear pat-
terns for time instants are predominant in the surveyed approaches, while other
temporal properties like cyclic time patterns or intervals are unhandled in the
majority of cases, especially by CE techniques. Only Birant et al. [4] include
cyclic time patterns, and Nanni et al. [27] consider time intervals.
Focus on ST data (F). Most of the techniques are focused on handling the
specific nature of ST data. However, there are several spatial-data dominant
techniques originally stemming from the field of spatial clustering, enriched with
the ability of dealing with temporal data aspects (e. g., [4], [32]). Only Nanni et
al. [27] presented a temporal-data dominant variation of their algorithm.
Sparse Explicit Noise Handling (N). As most of the proposed techniques
extend well-known clustering algorithms to work with ST data, the handling
A Survey on Clustering Techniques for Situation Awareness 823
Hierarchical
Partitioning
Grid-based
Computational
Density-
based
Main Contribution Configurability Complexity Optimization strategies
(MC) (C) (CC) (O)
Wang, 2006 DBSCAN-based / grid based ST-clustering - - 3 3 calculated with k-dist graph O(n) / O(n) fast vs. precise
Birant, 2006 min # points density improved R-Tree, filters to
DBSCAN-based ST-clustering - - 3 - O(n*log n)
Evo.
No
- - 3 - not specified
Evo.
Chen, 2007 clustering moving-objects on a network 3 - 3 - max distances not specified cluster blocks
Rosswog, 2008 detecting and tracking spatio-temporal order of the filter, filter type (flat, adaptive, results improve as history
3 - - - not specified
clusters with adaptive history filtering adaptive+stability) grows
Jeung, 2008 min # of cluster objs.
clustering of convoys, i.e., groups of objects clustering on simplified
distance value e
that travel together for at least a min. - - 3 - not specified trajectories for detection of
OLIHWLPHNOHQJWKRIWLPHSDUWLWLRQ
duration of time convoy candidates
WROHUDQFHRIWUDMHFWRU\VLPSOLILFDWLRQ
Rosswog, 2012 DBSCAN-based ST-clustering backed by a parameterized kernel DBSCAN Relationship graph:
- - 3 - R-Tree
SVM training data set required O(n*log n)
Zheng, 2013 DBSCAN parameters
YDULDWLRQWKUHVKROG diff. spatial indexing
clustering of gatherings - - 3 - not specified
lifetime threshold of crowd/participator techniques, pruning
support threshold of crowd/gathering
Functional Criteria
Cluster Analysis Properties
Temporal (TA) Deter-
Spatial Pattern Focus minism
(SA) (TAP) (F) (D)
Overlap (CO)
deterministic
spatio-temp.
Shape (CS)
temporal
heuristic
interval
Instant
Region
spatial
Linear
Cyclic
Noise Similarity Measures
Point
Line
(N) (SM)
Wang, 2006 A - 3 - - 3 - 3 - - - 3 - n.a. / arbitrary 3 -
Evo.
No
4.3 Conclusion
temporal extent (SP) as well as cyclic time patterns (TAP) are not the focus of
ST clustering techniques. Also fuzzy input data (FID) is not handled in the pre-
dominant number of cases and only few online approaches (OOA) exist. As long as
no appropriate techniques are available, we suggest transformation of the input
data to enable application of the clustering techniques reviewed in this survey
(e. g., only the starting points of trac jams could be used for clustering, in order
to apply techniques operating on point input data only and thus discarding the
information about their extent).
References
1. Aggarwal, C.C., et al.: A framework for clustering evolving data streams. In: Proc.
of 29th Int. Conf. on Very Large Data Bases, pp. 8192. VLDB Endowment (2003)
2. Andrienko, G., Andrienko, N.: Interactive cluster analysis of diverse types of spatio-
temporal data. ACM SIGKDD Explorations Newsletter 11(2), 1928 (2009)
3. Baumgartner, N., et al.: Beaware!situation awareness, the ontology-driven way.
Int. Journal of Data and Knowledge Engineering 69(11), 11811193 (2010)
4. Birant, D., Kut, A.: St-dbscan: An algorithm for clustering spatial and temporal
data. Data & Knowledge Engineering 60(1), 208221 (2007)
5. Chen, J., Lai, C., Meng, X., Xu, J., Hu, H.: Clustering moving objects in spatial
networks. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E.
(eds.) DASFAA 2007. LNCS, vol. 4443, pp. 611623. Springer, Heidelberg (2007)
6. Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial
databases with noise. In: Proc. of 2nd Int. Conf. on Knowledge Discovery and Data
Mining (KDD), pp. 226231. AAAI Press (1996)
7. Cao, F., et al.: Density-based clustering over an evolving data stream with noise.
In: SIAM Conf. on Data Mining, pp. 328339 (2006)
8. Gaber, M.M., et al.: Mining data streams: a review. ACM SIGMOD Record 34(2),
1826 (2005)
9. Ganey, S., Smyth, P.: Trajectory clustering with mixtures of regression models.
In: Proc. of the 5th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining, KDD 1999, pp. 6372. ACM (1999)
10. Gariel, M., et al.: Trajectory clustering and an application to airspace monitoring.
Trans. Intell. Transport. Sys. 12(4), 15111524 (2011)
11. Han, J., et al.: Spatial clustering methods in data mining: A survey. In: Geographic
Data Mining and Knowledge Discovery, pp. 129 (2011)
12. Han, J., et al.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann
(2011)
13. Ikonomovska, E., et al.: A survey of stream data mining. In: Proc. of 8th National
Conf. with Int. Participation, ETAI (2007)
14. Iyengar, V.S.: On detecting space-time clusters. In: Proc. of 2nd Int. Conf. on
Knowledge Discovery and Data Mining (KDD), pp. 587592. AAAI Press (1996)
15. Jain, A.K., et al.: Data clustering: a review. ACM Computing Surveys 31(3), 264
323 (1999)
16. Jensen, C., et al.: Continuous clustering of moving objects. IEEE Transactions on
Knowledge and Data Engineering 19(9), 11611174 (2007)
17. Jeung, H., et al.: Discovery of convoys in trajectory databases. Proc. VLDB En-
dow. 1(1), 10681080 (2008)
826 S. Mitsch et al.
18. Kalnis, P., Mamoulis, N., Bakiras, S.: On discovering moving clusters in spatio-
temporal data. In: Medeiros, C.B., Egenhofer, M., Bertino, E. (eds.) SSTD 2005.
LNCS, vol. 3633, pp. 364381. Springer, Heidelberg (2005)
19. Kalyani, D., Chaturvedi, S.K.: A survey on spatio-temporal data mining. Int. Jour-
nal of Computer Science and Network (IJCSN) 1(4) (2012)
20. Kavitha, V., Punithavalli, M.: Clustering time series data stream - a literature
survey. Int. Journal of Computer Science and Inf. Sec. (IJCSIS) 8(1) (2010)
21. Kisilevich, S., et al.: Spatio-temporal clustering: a survey. In: Data Mining and
Knowledge Discovery Handbook, pp. 122 (2010)
22. Kranen, P., et al.: The clustree: Indexing micro-clusters for anytime stream mining.
Knowledge and Information Systems Journal 29(2), 249272 (2011)
23. Li, Y., et al.: Clustering moving objects. In: Proc. of the 10th ACM Int. Conf. on
Knowledge Discovery and Data Mining (KDD), pp. 617622. ACM (2004)
24. Li, Z., Lee, J.-G., Li, X., Han, J.: Incremental clustering for trajectories. In: Kita-
gawa, H., Ishikawa, Y., Li, Q., Watanabe, C. (eds.) DASFAA 2010. LNCS, vol. 5982,
pp. 3246. Springer, Heidelberg (2010)
25. Lu, C.-T., Lei, P.-R., Peng, W.-C., Su, I.-J.: A framework of mining semantic
regions from trajectories. In: Yu, J.X., Kim, M.H., Unland, R. (eds.) DASFAA
2011, Part I. LNCS, vol. 6587, pp. 193207. Springer, Heidelberg (2011)
26. Matheus, C., et al.: Sawa: An assistant for higher-level fusion and situation aware-
ness. In: Proc. of SPIE Conf. on Multisensor, Multisource Information Fusion.
Architectures, Algorithms, and Applications, pp. 7585 (2005)
27. Nanni, M., Pedreschi, D.: Time-focused clustering of trajectories of moving objects.
Journal of Intelligent Information Systems 27(3), 267289 (2006)
28. Neill, D.B., et al.: Detection of emerging space-time clusters. In: Proc. of the 11th
ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, KDD 2005,
pp. 218227. ACM (2005)
29. Geetha, R., et al.: A survey of spatial, temporal and spatio-temporal data mining.
Journal of Computer Applications 1(4), 3133 (2008)
30. Rosswog, J., Ghose, K.: Detecting and tracking spatio-temporal clusters with adap-
tive history filtering. In: Proc. of 8th IEEE Int. Conf. on Data Mining, Workshops
(ICDMW), pp. 448457 (2008)
31. Rosswog, J., Ghose, K.: Detecting and tracking coordinated groups in dense, sys-
tematically moving, crowds. In: Proc. of the 12th SIAM Int. Conf. on Data Mining,
pp. 111. SIAM/Omnipress (2012)
32. Tai, C.-H., Dai, B.-R., Chen, M.-S.: Incremental clustering in geography and op-
timization spaces. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS
(LNAI), vol. 4426, pp. 272283. Springer, Heidelberg (2007)
33. Wang, M., Wang, A., Li, A.: Mining spatial-temporal clusters from geo-databases.
In: Li, X., Zaane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp.
263270. Springer, Heidelberg (2006)
34. Warren Liao, T.: Clustering of time series data-a survey. Pattern Recogn. 38(11),
18571874 (2005)
35. Wimmer, M., et al.: A survey on uml-based aspect-oriented design modeling. ACM
Computing Surveys 43(4), 28:128:33 (2011)
36. Xu, R., Wunsch II, D.C.: Survey of clustering algorithms. IEEE Transactions on
Neural Networks 16(3), 645678 (2005)
37. Zheng, K., et al.: On discovery of gathering patterns from trajectories. In: IEEE
Int. Conf. on Data Engineering, ICDE (2013)
Parallel k-Skyband Computation
on Multicore Architecture
Xing Feng1 , Yunjun Gao1 , Tao Jiang2 , Lu Chen1 , Xiaoye Miao1 , and Qing Liu1
1
Zhejiang University, China
{3090103362,gaoyj,chenl,miaoxy,liuq}@zju.edu.cn
2
Jiaxing University, China
jiangtao [email protected]
1 Introduction
In recent decades, more manufacturers are producing multicore architecture proces-
sors due to the frequency wall. As a consequence, even personal computer has 4 cores,
8 cores or more. It is meaningful to design algorithms to leverage this feature, espe-
cially for problems which are computationally expensive. k-skyband query is a typical
problem of such kind that requires extensive computation. This paper investigates the
problem, and designs a parallel solution using multicore processors.
Given a data set D of elements in a d-dimensional numerical space, we say an ele-
ment p dominates another element q, if p is not worse than q in every dimension and
is better than q in at least one dimension. Without lose of generality, in this paper, we
assume lower value is preferred. Given a data set D, a k-skyband query retrieves the set
of elements which are dominated by at most k elements. The concept of k-skyband is
closely related to another two concepts, skyline and top-k ranking queries. Given a data
set D, skyline query retrieves the elements of a data set that are not worse than others.
Clearly, skyline query is a special case of k-skyband where k = 0. Given a data set
D and a ranking function f , top-k ranking query retrieves the k elements with highest
score according to f . Note k-skyband query keeps a minimum candidate set for top-k
query. Thus, keeping k-skyband of a data set is sufficient to answer all top-k queries.
Below, we give an example of k-skyband.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 827837, 2013.
c Springer-Verlag Berlin Heidelberg 2013
828 X. Feng et al.
consists {a, i, k, b, h, m} which are worse than at most 1 other element. 2-skyband con-
sists {a, i, k, b, h, m, c, g} which are worse than at most 2 other elements.
To the best of our knowledge, this paper is among the first to study the problem of
parallel k-skyband computation on multicore architecture. Our main contributions are
summarized as follows.
We design a parallel algorithm to efficiently compute k-skyband query and analyse
its time complexity.
We conduct extensive experimental evaluation to verify the effectiveness and effi-
ciency of our proposed algorithm.
The rest of the paper is organized as follows. In Section 2, we formally define the
k-skyband problem and provide some necessary background information about our
parallel techniques. In Section 3, we present the BBSkyband algorithm for k-skyband
computation, which is used as the baseline algorithm. Section 4 presents the parallel
version of BBSkyband algorithm (Parallel BBSkyband). In Section 5, we examine the
efficiency of our algorithm. Related work is summarized in Section 6. This is followed
by conclusions.
2 Background Information
We present the problem definition and introduce a parallel library, OpenMP, in this
section.
2.2 OpenMP
In this paper, we use OpenMP [5] to parallelize the k-skyband computation. OpenMP
(Open Multiprocessing) is an API that supports multi-platform shared memory multi-
processing programming in C, C++, and Fortran, on most processor architectures and
operating systems. It consists of a set of compiler directives, library routines, and envi-
ronment variables that influence run-time behavior.
OpenMP is an implementation of multithreading, a method of parallelizing whereby
a master thread (a series of instructions executed consecutively) forks a specified num-
ber of slave threads and a task is divided among them. The threads then run concur-
rently, with the runtime environment allocating threads to different processors.
There are many details about the usage of OpenMP. Here, we only present the two
examples we used in our implementation. More information can be found on their
website1 .
Example 2.
pragma omp parallel for default (shared) private (i)
f or(i = 0; i < size; i + +)
output[i] = f (input[i]);
The first line of Example 2 instructs the compiler to distribute loop iterations within the
team of threads that encounters this work-sharing construct. Besides, default (shared)
private (i) means all the variables in the loop region be shared across all threads except
i. In the following sections, for a concise presentation, we use parallel instead of
pragma omp parallel for default (shared) private (i).
Example 3.
omp set num threads(NUMBER OF PROCESSORS);
The API in Example 3 instructs the compiler to set the number of threads in subsequent
parallel regions to be NUMBER OF PROCESSORS. This can be useful to see how the
running time changes against varying number of threads.
3 Baseline Algorithm
Papadias et al. develop a branch-and-bound algorithm (BBS) to get the skyline with the
minimal I/O cost [11]. They also point that their algorithms can be used to compute
k-skyband with some modifications.
First, we demonstrate the main idea behind BBS.
1
http://openmp.org
830 X. Feng et al.
Algorithm 1. BBSkyband
1 S = ;
2 insert all entries of the root R in the heap H;
3 while H not empty do
4 remove top entry e;
5 if e is dominated by more than k elements in S then
6 discard e;
7 else if e is an intermediate entry then
8 for each child ei of e do
9 if ei is dominated by at most k elements in S then
10 insert ei into H;
11 else
12 insert e into S
Correctness and Complexity Analysis. We prove the correctness in two steps. First,
all the elements in S must be in k-skyband. This is guaranteed by Theorem 1. Second,
all the elements that are discarded cannot be in k-skyband. This is intuitive because
these elements are already dominated by k + 1 elements. We analyse the worst case
complexity of Algorithm 1; i.e., when all the elements are k-skyband elements. Let N
denote the size of the data set. In this case, the most time-consuming parts are Line 5
and Line 9 of Algorithm 1; and thus, the time complexity is given by O(N 2 ).
Although the worst case time complexity is O(N 2 ), this algorithm is still efficient
because it discards the non-k-skyband elements at a level as high as possible. Below is
an example for Algorithm 1.
Example 4. Consider the running example in Section 1. Suppose we are going to find 2-
skyband and the R-tree is depicted in Figure 2. Applying the BBSkyband (Algorithm 1),
we have the following steps in Table 2. We omit some steps in which the top entry is an
element. In the initial step, we insert all the entries of root R, e6 and e7 , into H. Then
we remove top entry, e7 , from H. Note the distance of e7 is 0 + 3 = 3 which is smaller
than that of e6 , 5 + 1 = 6. Because no element in S dominates e7 , we expand e7 . After
expanding e7 , we have {e3 , e6 , e5 , e4 } in H. Then we remove and expand e3 . After
that, we have {i, e6 , h, e5 , e4 , g} in H. Then, we remove i. Because i is at leaf level, we
Parallel k-Skyband Computation on Multicore Architecture 831
12
11
l
10 e4
k
9 e
8 n
e2
7 e5 f H H
e7 d
6 m
H H H H H
5 g e6
4 h e3 D E F G H I J K L O N P Q
i c
3
2 e1 b
1 a
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13
insert i into S instead of expanding. This process goes until there is no element or entry
left in H. Finally, S contains the 2-skyband, that is, {i, h, m, k, a, g, c, b}.
4 Parallel BBSkyband
In this section, we are going to parallelize the baseline algorithm in Section 3.
By analysing the algorithm, we find the most computation consuming parts are Line
5 and Line 9 in Algorithm 1 because the element is compared with all the elements in S.
A naive solution is to parallelize these two lines by using the parallel command
in Example 2. However, for example, when executing Line 5, a thread has found more
832 X. Feng et al.
than k dominating elements and break out, how could other threads be notified? To
solve that, we either have to let a thread waits for all other comparing threads terminate,
or use communication method between threads. Both of the two ways decrease the ben-
efit we get from parallelization.
10 else
11 parallel for each e of S do
12 for each ei S, ei e do
13 ek + +;
14 if ek > k then
15 break;
16 for each e of S do
17 if ek k then
18 if e is an intermediate node of R then
19 parallel for each child cj of e do
20 for each e S and e cj do
21 cjk + +;
22 if cjk > k then
23 break;
27 else
28 S = S {e };
29 S = ;
30 return S;
elements in this buffer (Lines 4 9). Then, we compare these candidate entries with
elements that are already in k-skyband, i.e. elements in S (Lines 1115). This operation
can be run concurrently. After that, if the entry is an intermediate node, we compare
each child node of the entry with k-skyband elements in S concurrently (Lines 1923);
If the element is a leaf node, we insert it into S (Line 28). By doing this, we parallelize
the most time-consuming parts, Line 5 and Line 9 in Algorithm 1.
Correctness and Complexity Analysis. The proof of correctness is quite similar to
that of Algorithm 1. All the elements in S must be in k-skyband and all the elements
that are discarded cannot be in k-skyband. Next, We analyse the worst case complexity
of Algorithm 2; Similarly, we assume all the elements are k-skyband elements. Let N
denote the size of the data set. In this case, the most time-consuming parts are Line 12
and Line 20 of Algorithm 2. And thus, the time complexity is given by O(N 2 /c) where
c is the number of cores.
Example 5. Consider the running example in Section 3. Suppose we are going to find
2-skyband using parallel BBSkyband (Algorithm 2). Table 3 shows the steps of the
algorithm. We omit some steps where there is no intermediate entry in S . In the initial
step, we insert all the entries of root R, e6 and e7 , into minheap H. Then we remove top
entry e7 . Because no element in S dominates e7 , e7 is stored in S . In the next loop,
because there is an intermediate entry (e7 ) in S , we compare e7 with every element in
S and record the number of elements that dominate e7 (Here is 0). Note that, if there
are other elements in S , we can compare them with elements in S concurrently. Then,
we expand e7 (like Lines 19 23 in Algorithm 2). Note that, this process can be done
concurrently. After expanding e7 , we add entries that are not dominated by more than
k elements to H (like Lines 24 26 in Algorithm 2). Here, we have {e3 , e6 , e5 , e4 } in
H. Then we remove and expand e3 . After that, we have {i, e6 , h, e5 , e4 , g} in H. Note
that, i will be moved to S . Similar process goes on until there is no element or entry
left in H and S . Finally, S contains the 2-skyband, that is, {i, h, m, a, k, g, b, c}.
5 Experiment
5.1 Setup
We ran all experiments on a DELL OPTIPLEX 790 with Windows 7, two quad-Intel
Core CPU and 2G 1333 MHz RAM. We evaluated the efficiency of our algorithm
against data set distribution, dimensionality, data set size, number of cores and k-
skyband width, respectively. Our experiments are conducted on synthetic data sets.
The synthetic data sets are generated as follows. We use the methodologies in [12] to
generate 1 million data elements with dimensionality from 2 to 5 and the spatial location
of data elements follow two kinds of distributions, independent and anti-correlated.
The default values of the parameters are listed in Table 4. Data set distribution and
number of cores do not have default values because we vary all the values of them in all
experiments.
5.2 Evaluation
We evaluate our algorithm on both Independent and Anti-correlated data set. Besides, as
demonstrated in Example 3, we use the command in Example 3 to set different number
of threads which can also be regarded as different cores. Particularly, when c = 1, we
use BBSkyband as baseline algorithm rather than Parallel BBSkyband.
The first set of experiments is reported in Figure 3 where dimensionality varies. The
performance of our algorithm decreases as the dimensionality increases. This is because
when dimensionality increases, the cost of comparing two elements increases (Note the
comparing cost is O(d) where d is the dimensionality). Besides, there are usually more
elements in a k-skyband due to higher dimensionality. Because with higher dimension-
ality, an element has to be better than another one in more dimensions. Note having
Parallel k-Skyband Computation on Multicore Architecture 835
more elements in k-skyband also increases the cost for operations such as Line 12 and
Line 20 in Algorithm 2.
Figure 4 evaluates the system scalability towards the data set size. The performance
of our algorithm decreases as the data set size increases. When data set size increases,
there is a larger R-tree as index. The operation on R-tree takes more time. What is more,
larger data set usually results in more elements in a k-skyband.
Figure 5 evaluates the impact of k-skyband width. As expected, Figure 5 shows that
processing cost increases when k increases. k is the width of k-skyband. With larger k,
we have to do more dominating test and we usually have more elements in a k-skyband.
45 40
40 c=1 35 c=1
c=2 c=2
Avg. Delay (s)
120 120
c=1 c=1
100 c=2 100 c=2
Avg. Delay (s)
80 c=4 80 c=4
c=8 c=8
60 60
40 40
20 20
0 0
64K 128K 256K 512K 1024K 64K 128K 256K 512K 1024K
(a) Varying N, Anti-correlated (b) Varying N, Independent
80 70
70 c=1 c=1
c=2 60 c=2
Avg. Delay (s)
60 50
c=4 c=4
50 c=8 40 c=8
40
30 30
20 20
10 10
0 0
2 4 6 8 10 2 4 6 8 10
(a) Varying k, Anti-correlated (b) Varying k, Independent
In all figures, there is an obvious improvement when the number of cores rises. How-
ever, it is hard to get a speedup that strictly equals the number of cores. This is because
there are other costs that are not parallelized in our algorithm (I/O operation, collecting
candidate elements, insertion into heap, etc.).
6 Related Work
Spatial database has draw many attention in recent decades. Beckmann et al. [2] design
the R-tree as an index for efficient computation. Roussopoulos et al. [12] study Nearest
Neighbor Queries and propose three efficient pruning rules. They also extend Nearest
Neighbor Queries to K Nearest Neighbor Queries which return top-K preferred ob-
jects. Borzsonyi et al. [3] first study the skyline operator and propose an SQL syntax
for the skyline query. After that, the skyline query is widely studied by a lot of papers,
like Chomicki et al. [4], Godfrey et al. [7] and Tan et al. [13]. Papadias et al. develop a
branch-and-bound algorithm (BBS) to get the skyline with the minimal I/O cost [11].
The concept of k-skyband is first proposed in [11]. [10] shows that it suffices to keep
the k-Skyband of the data set to provide the top-k ranked queries for the monotone
preference function f . Reverse k-skyband query is recently studied in [9].
With the development of multi-core processors, a recent trend is to study the parallel
skyline computation. Most of them are divide and conquer algorithms. Some papers
focus on distributed skyline computation. That is, their algorithm runs on server cluster
with several servers, such as [1]. Other papers focus on parallize skyline computation
on different cores instead of servers. This is different since all the threads share the
same memory and there is much less communication cost. Park et al. design a paral-
lel algorithm on multicore architecture with an easy idea. But it distributes the most
computation-consuming part to every core. Our work is based on their algorithm. Vla-
chou et al. [14] employ the technique which partitions the data set according to the
weight of each axis. Moreover, Kohler et al. [8] use a similar idea but a faster way
to partition the data set. Besides, Gao et al. [6] design a parallel algorithm for skyline
queries in multi-disk environment.
7 Conclusion
In this paper, we investigate the problem of k-skyband query. We give a modified Al-
gorithm (BBSkyband) from the traditional algorithm and parallelize it, namely, Parallel
BBSkyband algorithm. The experimental results demonstrate that a simple design of the
parallel algorithm provides a satisfactory runtime performance. We believe that more
efficient algorithm can be further developed based on our presented method.
Acknowledgements.. This work was supported in part by NSFC 61003049, the Nat-
ural Science Foundation of Zhejiang Province of China under Grant LY12F02047, the
Fundamental Research Funds for the Central Universities under Grant 2012QNA5018,
the Key Project of Zhejiang University Excellent Young Teacher Fund (Zijin Plan), and
the Bureau of Jiaxing City Science and Technology Project under Grant 2011AY1005.
Parallel k-Skyband Computation on Multicore Architecture 837
References
1. Afrati, F.N., Koutris, P., Suciu, D., Ullman, J.D.: Parallel skyline queries. In: Deutsch, A.
(ed.) ICDT, pp. 274284. ACM (2012)
2. Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The r*-tree: An efficient and robust
access method for points and rectangles. In: SIGMOD Conference, pp. 322331 (1990)
3. Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE, pp. 421430
(2001)
4. Chomicki, J., Godfrey, P., Gryz, J., Liang, D.: Skyline with presorting. In: ICDE, pp. 717719
(2003)
5. Clark, D.: Openmp: a parallel standard for the masses. IEEE Concurrency 6(1), 1012 (1998)
6. Gao, Y., Chen, G., Chen, L., Chen, C.: Parallelizing progressive computation for skyline
queries in multi-disk environment. In: Bressan, S., Kung, J., Wagner, R. (eds.) DEXA 2006.
LNCS, vol. 4080, pp. 697706. Springer, Heidelberg (2006)
7. Godfrey, P., Shipley, R., Gryz, J.: Maximal vector computation in large data sets. In: VLDB,
pp. 229240 (2005)
8. Kohler, H., Yang, J., Zhou, X.: Efficient parallel skyline processing using hyperplane projec-
tions. In: SIGMOD Conference, pp. 8596 (2011)
9. Liu, Q., Gao, Y., Chen, G., Li, Q., Jiang, T.: On efficient reverse k-skyband query processing.
In: Lee, S.-g., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012,
Part I. LNCS, vol. 7238, pp. 544559. Springer, Heidelberg (2012)
10. Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over slid-
ing windows. In: Chaudhuri, S., Hristidis, V., Polyzotis, N. (eds.) SIGMOD Conference, pp.
635646. ACM (2006)
11. Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database sys-
tems. ACM Trans. Database Syst. 30(1), 4182 (2005)
12. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: SIGMOD Confer-
ence, pp. 7179 (1995)
13. Tan, K.-L., Eng, P.-K., Ooi, B.C.: Efficient progressive skyline computation. In: VLDB, pp.
301310 (2001)
14. Vlachou, A., Doulkeridis, C., Kotidis, Y.: Angle-based space partitioning for efficient parallel
skyline computation. In: SIGMOD Conference, pp. 227238 (2008)
Moving Distance Simulation for Electric Vehicle
Sharing Systems for Jeju City Area
1 Introduction
Considering power eciency and low carbon emissions, electric vehicles, or EVs,
are hopefully expected to penetrate into our daily lives [1]. However, private
ownership is not yet aordable to customers due to their high cost. Hence, EV
sharing is considered to be a good intermediary business model to accelerate
their deployment. EV sharing has many operation models according to the man-
ageability, user convenience, serviceability, and others. In one-way rental, a cus-
tomer picks up an EV at a station and return to a dierent station. Even if this
model provides great flexibility to customers, it faces uneven EV distribution.
Some stations do not have an EV, so a pick-up request issued at those stations
cannot be admitted. This problem can be solved only by an appropriate vehicle
relocation strategy. Obviously, there is trade-o between relocation frequency
and relocation cost.
This research was supported by the MKE, Republic of Korea, under IT/SW
Creative research program supervised by the NIPA (NIPA-2012-(H0502-12-1002).
This research was also financially supported in part by the Ministry of Knowledge
Economy (MKE), Korea Institute for Advancement of Technology (KIAT) through
the Inter-ER Cooperation Projects.
Y. Ishikawa et al. (Eds.): APWeb 2013, LNCS 7808, pp. 838841, 2013.
c Springer-Verlag Berlin Heidelberg 2013
2 Paramter Setting
The EV relocation policy must consider such factors as relocation time, the num-
ber of service men, and relocation goal. Practically speaking, relocation during
the operation hours may lead to service discontinuity, so nonoperation hours are
appropriate for relocation and other management activities. This paper considers
3 strategies and evaluates their performance using the sharing system analysis
framework implemented in our previous work [2]. First, even relocation strategy
makes the number of EVs equal for each station. Second, utilization-based re-
location scheme distributes EVs according to the pick-up ratio of each station
840 J. Lee and G.-L. Park
3 Experiment Result
Figure 1 measures the moving distance according to the number of EVs. It is the
distance the customer actually moves to the sharing station. The moving distance
decreases according to the increase of the number of EVs, as more requests can
be served. The moving distance for utilization-based scheme is the worst. The
dierence between morning-focused and even relocation schemes reaches 2.5 m.
Here again, the morning-focuses scheme shows slightly shorter moving distance
than the even relocation. The gap gets smaller when the access distance is 500 m,
while both even and morning-focused schemes show almost the same distance,
as shown in Figure 1(b).
220 380
"Even" "Even"
210 "Utilization" "Utilization"
"Morning" 370 "Morning"
200
Moving distance (m)
360
190
180 350
170
340
160
330
150
140 320
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Number of EVs Number of EVs
1 1
"Even" "Even"
"Utilization" "Utilization"
Normalized moving distance
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000
Access distance (m) Access distance (m)
Next, Figure 2 measures the moving distance according to the access distance.
As the moving distance is sure to increase according to the increase of the access
distance, the moving distance is normalized to 1.0. It denotes how far requests
happen away from sharing stations. Figure 2(a) and Figure 2(b) have almost the
same shape, indicating independence to the number of EVs. When the access
distance is 100 m, the number of trip records is not sucient, so this point is not
consistent with other access distance range. Anyway, when the access distance
is around 500 m, the sharing requests from wider range can be served. Again,
the utilization-based scheme is the worst out of the three.
4 Conclusions
This paper has measured the moving distance for three practical relocation
strategies in electric vehicle sharing systems. Combined with the serviceabil-
ity, the moving distance indicates the easiness to access the sharing system and
thus the service quality. For even, utilization-based, and morning-focused reloca-
tions schemes, the experiments are conducted using taxi trip records in Jeju City
area. The measurement results show that the morning-focused scheme generally
outperforms the others, making customers conveniently access the EV sharing
service, discovering the necessity of a proactive relocation scheme based on the
future demand forecast.
References
1. Lee, J., Kim, H.-J., Park, G.-L., Kwak, H.-Y., Lee, M.Y.: Analysis Framework for
Electric Vehicle Sharing Systems Using Vehicle Movement Data Stream. In: Wang,
H., Zou, L., Huang, G., He, J., Pang, C., Zhang, H.L., Zhao, D., Yi, Z. (eds.) APWeb
2012 Workshops. LNCS, vol. 7234, pp. 8994. Springer, Heidelberg (2012)
2. Lee, J., Kim, H., Park, G., Kang, M.: Energy Consumption Scheduler for Demand
Response Systems in the Smart Grid. Journal of Information Science and Engineer-
ing, 955969 (2012)
3. Kek, A., Cheu, R., Meng, Q., Fung, C.: A Decision Support System for Vehicle
Relocation Operations in Carsharing Systems. Transportation Research Part E,
149158 (2009)
4. Barth, M., Todd, M., Xue, L.: User-based Vehicle Relocation Techniques
for Multiple-Station Shared-Use Vehicle Systems. Transportation Research
Record 1887, 137144 (2004)
Author Index