(Carlos Buil-Aranda, Marcelo Arenas, Oscar Corcho (B-Ok - CC)
(Carlos Buil-Aranda, Marcelo Arenas, Oscar Corcho (B-Ok - CC)
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Grigoris Antoniou Marko Grobelnik
Elena Simperl Bijan Parsia
Dimitris Plexousakis Pieter De Leenheer
Jeff Pan (Eds.)
13
Volume Editors
Grigoris Antoniou
FORTH-ICS and University of Crete, 71110 Heraklion, Crete, Greece
E-mail: [email protected]
Marko Grobelnik
Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia
E-mail: [email protected]
Elena Simperl
Karlsruhe Institute of Technology, 76128 Karlsruhe, Germany
E-mail: [email protected]
Bijan Parsia
University of Manchester, Manchester M13 9PL, UK
E-mail: [email protected]
Dimitris Plexousakis
FORTH-ICS and University of Crete, 70013 Heraklion, Crete, Greece
E-mail: [email protected]
Pieter De Leenheer
VU University of Amsterdam, 1012 ZA Amsterdam, The Netherlands
E-mail: [email protected]
Jeff Pan
University of Aberdeen, Aberdeen AB24 3UE, UK
E-mail: [email protected]
LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI
Every year ESWC brings together researchers and practitioners dealing with
different aspects of semantic technologies. Following a successful re-launch in
2010 as a multi-track conference, the 8th Extended Semantic Web Conference
built on the success of the ESWC conference series initiated in 2004. Through its
extended concept this series seeks to reach out to other communities and research
areas, in which Web semantics play an important role, within and outside ICT,
and in a truly international, not just ‘European’ context. This volume contains
the papers accepted for publication in key tracks of ESWC 2011: the technical
tracks including research tracks, an in-use track and two special tracks, as well
as the PhD symposium and the demo track.
Semantic technologies provide machine-understandable representations of
data, processes and resources — hardware, software and network infrastruc-
ture — as a foundation for the provisioning of a fundamentally new level of
functionality of IT systems across application scenarios and industry sectors.
Using automated reasoning services over ontologies and metadata, semantically
enabled systems will be able to better interpret and process the information
needs of their users, and to interact with other systems in an interoperable way.
Research on semantic technologies can benefit from ideas and cross-fertilization
with many other areas, including artificial intelligence, natural language pro-
cessing, database and information systems, information retrieval, multimedia,
distributed systems, social networks, Web engineering, and Web science. These
complementarities are reflected in the outline of the technical program of ESWC
2011; in addition to the research and in-use tracks, we furthermore introduced
two special tracks this year, putting particular emphasis on inter-disciplinary
research topics and areas that show the potential of exciting synergies for the
future. In 2011, these special tracks focused on data-driven, inductive and prob-
abilistic approaches to managing content, and on digital libraries, respectively.
The technical program of the conference received 247 submissions, which were
reviewed by the Program Committee of the corresponding tracks. Each track was
coordinated by Track Chairs and installed a dedicated Program Committee. The
review process included paper bidding, assessment by at least three Program
Committee members, and meta-reviewing for each of the submissions that were
subject to acceptance in the conference program and proceedings. In all, 57
papers were selected as a result of this process, following comparable evaluation
criteria devised for all technical tracks.
The PhD symposium received 25 submissions, which were reviewed by the
PhD Symposium Program Committee. Seven papers were selected for presenta-
tion at a separate track and for inclusion in the ESWC 2011 proceedings. The
demo track received 19 submissions, 14 of which were accepted for demonstration
VI Preface
in a dedicated session during the conference. Ten of the demo papers were also
selected for inclusion in the conference proceedings.
ESWC 2011 had the pleasure and honor to welcome seven renowned keynote
speakers from academia and industry, addressing a variety of exciting topics of
highest relevance for the research agenda of the semantic technologies community
and its impact on ICT:
– James Hendler, Tetherless World Professor of Computer and Cognitive Sci-
ence and Assistant Dean for Information Technology and Web Science at
Rensselaer Polytechnic Institute
– Abe Hsuan, founding partner of the law firm Irwin & Hsuan LLP
– Prasad Kantamneni, principal architect of the Eye-Tracking platform at
Yahoo!
– Andraž Tori, CTO and co-founder of Zemanta
– Lars Backstrom, data scientist at Facebook
– Jure Leskovec, assistant professor of Computer Science at Stanford University
– Chris Welty, Research Scientist at the IBM T.J. Watson Research Center in
New York
We would like to take the opportunity to express our gratitude to the Chairs,
Program Committee members and additional reviewers of all refereed tracks,
who ensured that ESWC 2011 maintained its highest standards of scientific qual-
ity. Our thanks should also reach the Organizing Committee of the conference,
for their dedication and hard work in selecting and coordinating the organiza-
tion of a wide array of interesting workshops, tutorials, posters and panels that
completed the program of the conference. Special thanks go to the various or-
ganizations who kindly support this year’s edition of the ESWC as sponsors, to
the Sponsorship Chair who coordinated these activities, and to the team around
STI International who provided an excellent service in all administrative and
logistic issues related to the organization of the event. Last, but not least, we
would like to say thank you to the Proceedings Chair, to the development team
of the Easychair conference management system and to our publisher, Springer,
for their support in the preparation of this volume and the publication of the
proceedings.
Organizing Committee
General Chair Grigoris Antoniou
(FORTH-ICS and University of Crete,
Greece)
Program Chairs Marko Grobelnik
(Jozef Stefan Institute, Slovenia)
Elena Simperl
(Karlsruhe Institute of Technology,
Germany)
News from Front Coordinators Lyndon Nixon (STI International, Austria)
Alexander Wahler (STI International, Austria)
Poster and Demo Chairs Bijan Parsia (University of Manchester, UK)
Dimitris Plexousakis
(FORTH-ICS and University of Crete,
Greece)
Workshop Chairs Dieter Fensel (University of Innsbruck, Austria)
Raúl Garcı́a Castro (UPM, Spain)
Tutorials Chair Manolis Koubarakis
(University of Athens, Greece)
PhD Symposium Chairs Jeff Pan (University of Aberdeen, UK)
Pieter De Leenheer
(VU Amsterdam, The Netherlands)
Semantic Technologies
Coordinators Matthew Rowe (The Open University, UK)
Sofia Angelatou (The Open University, UK)
Proceedings Chair Antonis Bikakis
(University of Luxembourg, Luxembourg)
Sponsorship Chair Anna Fensel (FTW, Austria)
Publicity Chair Lejla Ibralic-Halilovic (STI, Austria)
Panel Chairs John Domingue (The Open University, UK)
Asuncion Gomez-Perez (UPM, Spain)
Treasurer Alexander Wahler (STI International, Austria)
Local Organization and
Conference Administration STI International, Austria
Members
Members
Members
Members
Members
Members
Members
Members
Members
Members
Members
Members
Members
Referees
Sofia Angeletou Huyen Do Günter Ladwig
Darko Anicic Vicky Dritsou Feiyu Lin
Fedor Bakalov George Eadon Dong Liu
Jürgen Bock Angela Fogarolli Christian Meilicke
Stefano Bortoli Anika Gross Ivan Mikhailov
Sebastian Brandt Karl Hammar Rammohan Narendula
Siarhei Bykau Matthias Hert Axel-Cyrille Ngonga
Michele Caci Martin Homola Ngomo
Elena Cardillo Matthew Horridge Maximilian Nickel
Michele Catasta Wei Hu Andriy Nikolov
Gong Cheng Prateek Jain Dusan Omercevic
Annamaria Chiasera Mouna Kamel Giorgio Orsi
Catherine Comparot Malte Kiesel Matteo Palmonari
Gianluca Correndo Szymon Klarman Bastien Rance
Brian Davis Johannes Knopp Mariano Rodriguez
Evangelia Daskalaki Matthias Knorr Muro
Renaud Delbru Haridimos Kondylakis Marco Rospocher
Kathrin Dentler Jacek Kopecky Brigitte Safar
Organization XV
Steering Committee
Chair
John Domingue
Members
Sponsoring Institutions
Platinum Sponsors
Gold Sponsors
Organization XVII
Silver Sponsors
Demo Track
SmartLink: A Web-Based Editor and Search Environment for Linked
Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
Stefan Dietze, Hong Qing Yu, Carlos Pedrinaci, Dong Liu, and
John Domingue
ViziQuer: A Tool to Explore and Query SPARQL Endpoints . . . . . . . . . . 441
Martins Zviedris and Guntis Barzdins
EasyApp: Goal-Driven Service Flow Generator with Semantic Web
Service Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Yoo-mi Park, Yuchul Jung, HyunKyung Yoo, Hyunjoo Bae, and
Hwa-Sung Kim
Who’s Who – A Linked Data Visualisation Tool for Mobile
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
A. Elizabeth Cano, Aba-Sah Dadzie, and Melanie Hartmann
OntosFeeder – A Versatile Semantic Context Provider for Web Content
Authoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
Alex Klebeck, Sebastian Hellmann, Christian Ehrlich, and Sören Auer
wayOU – Linked Data-Based Social Location Tracking in a Large,
Distributed Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Mathieu d’Aquin, Fouad Zablith, and Enrico Motta
SeaFish: A Game for Collaborative and Visual Image Annotation and
Interlinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
Stefan Thaler, Katharina Siorpaes, David Mear, Elena Simperl, and
Carl Goodman
The Planetary System: Executable Science, Technology, Engineering
and Math Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Christoph Lange, Michael Kohlhase, Catalin David,
Deyan Ginev, Andrea Kohlhase, Bogdan Matican, Stefan Mirea, and
Vyacheslav Zholudev
Semantic Annotation of Images on Flickr . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Pierre Andrews, Sergey Kanshin, Juan Pane, and Ilya Zaihrayeu
FedX: A Federation Layer for Distributed Query Processing on Linked
Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and
Michael Schmidt
Table of Contents – Part II XXIII
PhD Symposium
Reasoning in Expressive Extensions of the RDF Semantics . . . . . . . . . . . . 487
Michael Schneider
Linked Data Metrics for Flexible Expert Search on the open Web . . . . . . 108
Milan Stankovic, Jelena Jovanovic, and Philippe Laublet
Ontologies Track
Elimination of Redundancy in Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Stephan Grimm and Jens Wissmann
Evaluating the Stability and Credibility of Ontology Matching
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Xing Niu, Haofen Wang, Gang Wu, Guilin Qi, and Yong Yu
How Matchable Are Four Thousand Ontologies on the Semantic Web . . . 290
Wei Hu, Jianfeng Chen, Hang Zhang, and Yuzhong Qu
Reasoning Track
A Tableaux-Based Algorithm for SHIQ with Transitive Closure of
Roles in Concept and Role Inclusion Axioms . . . . . . . . . . . . . . . . . . . . . . . . 367
Chan Le Duc, Myriam Lamolle, and Olivier Curé
Abstract. The W3C SPARQL working group is defining the new SPARQL 1.1
query language. The current working draft of SPARQL 1.1 focuses mainly on the
description of the language. In this paper, we provide a formalization of the syntax
and semantics of the SPARQL 1.1 federation extension, an important fragment of
the language that has not yet received much attention. Besides, we propose opti-
mization techniques for this fragment, provide an implementation of the fragment
including these techniques, and carry out a series of experiments that show that
our optimization procedures significantly speed up the query evaluation process.
1 Introduction
The recent years have witnessed a constant growth in the amount of RDF data avail-
able, exposed by means of Linked Data-enabled URLs and SPARQL endpoints. Several
non-exhaustive, and sometimes out-of-date, lists of SPARQL endpoints or data catalogs
are available in different formats (from wiki-based HTML pages to SPARQL endpoints
using data catalog description vocabularies). Besides, most of these datasets are inter-
linked, what allows navigating through them and facilitates building complex queries
combining data from heterogeneous datasets.
These SPARQL endpoints accept queries written in SPARQL and adhere to the
SPARQL protocol, as defined by the W3C recommendation. However, the current
SPARQL recommendation has an important limitation in defining and executing queries
that span across distributed datasets, since it only considers the possibility of executing
these queries in isolated SPARQL endpoints. Hence users willing to federate queries
across a number of SPARQL endpoints have been forced to create ad-hoc extensions
of the query language or to include additional information about data sources in the
configuration of their SPARQL endpoint servers [14,15]. This has led to the inclusion
of query federation extensions in the current SPARQL 1.1 working draft [12] (together
with other extensions that are out of the scope of this paper), which are studied in detail
in order to generate a new W3C recommendation in the coming months.
The federation extension of SPARQL 1.1 includes two new operators in the query
language: SERVICE and BINDINGS. The former allows specifying, inside a SPARQL
query, the SPARQL query service in which a portion of the query will be executed. This
query service may be known at the time of building the query, and hence the SERVICE
operator will already specify the IRI of the SPARQL endpoint where it will be executed;
or may be retrieved at query execution time after executing an initial SPARQL query
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 1–15, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 C. Buil-Aranda, M. Arenas, and O. Corcho
Moreover, in this definition μ∅ represents the mapping with empty domain (which is
compatible with any other mapping).
The evaluation of a graph pattern P over an RDF graph G, denoted by P G , is de-
fined recursively as follows (due to the lack of space, we refer the reader to the extended
version of the paper for the definition of the semantics of the FILTER operator):
µ | there exists µ ∈ (SERVICE c P1 )G s.t. dom(µ) = (dom(µ ) ∪ {?X}),
c∈I
µ(?X) = c and µ(?Y ) = µ (?Y ) for every ?Y ∈ dom(µ )
P G = P1 G
{μS,A1 , . . . , μS,An }.
It is important to notice that the rules (1)–(4) above were introduced in [11], while we
propose in the rules (5)–(8) a semantics for the operators SERVICE and BINDINGS
introduced in [12]. Intuitively, if c ∈ I is the IRI of a SPARQL endpoint, then the idea
behind the definition of (SERVICE c P1 ) is to evaluate query P1 in the SPARQL end-
point specified by c. On the other hand, if c ∈ I is not the IRI of a SPARQL endpoint,
then (SERVICE c P1 ) leaves unbounded all the variables in P1 , as this query cannot
be evaluated in this case. This idea is formalized by making μ∅ the only mapping in the
evaluation of (SERVICE c P1 ) if c ∈ dom(ep). In the same way, (SERVICE ?X P1 )
6 C. Buil-Aranda, M. Arenas, and O. Corcho
is defined by considering all the possible IRIs for the variable ?X, that is, all the val-
ues c ∈ I. In fact, (SERVICE ?X P1 ) is defined as the union of the evaluation of the
graph patterns (SERVICE c P1 ) for the values c ∈ I, but also storing in ?X the IRIs
from where the values of the variables in P1 are coming from. Finally, the idea behind
the definition of (P1 BINDINGS S {A1 , . . . , An }) is to constrain the values of the
variables in S to the values specified in A1 , . . ., An .
Example 1. Assume that G is an RDF graph that uses triples of the form
(a, service address, b) to indicate that a SPARQL endpoint with name a is located at
the IRI b. Moreover, let P be the following SPARQL query:
SELECT {?X, ?N, ?E}
(?X, service address, ?Y ) AND (SERVICE ?Y (?N, email, ?E))
BINDINGS [?N ] {[ John ], [ Peter ]}
Query P is used to compute the list of names and email addresses that can be retrieved
from the SPARQL endpoints stored in an RDF graph. In fact, if μ ∈ P G , then μ(?X)
is the name of a SPARQL endpoint stored in G, μ(?N ) is the name of a person stored
in that SPARQL endpoint and μ(?E) is the email address of that person. Moreover,
the operator BINDINGS in this query is used to filter the values of the variable ?N .
Specifically, if μ ∈ P G , then μ(?N ) is either John or Peter.
The goal of the rules (5)–(8) is to define in an unambiguous way what the result of
evaluating an expression containing the operators SERVICE and BINDINGS should
be. As such, these rules should not be considered as an implementation of the language.
In fact, a direct implementation of the rule (6), that defines the semantics of a pattern of
the form (SERVICE ?X P1 ), would involve evaluating a particular query in every pos-
sible SPARQL endpoint, which is obviously infeasible in practice. In the next section,
we face this issue and, in particular, we introduce a syntactic condition on SPARQL
queries that ensures that a pattern of the form (SERVICE ?X P1 ) can be evaluated by
only considering a finite set of SPARQL endpoints, whose IRIs are actually taken from
the RDF graph where the query is being evaluated.
as for every RDF graph G and mapping μ ∈ P1 G , we know that ?Y ∈ dom(μ)
and μ(?Y ) ∈ dom(G). Moreover, we also have that variable ?Y is bound in
(SELECT {?X, ?N, ?E} P1 ) as ?Y is bound in graph pattern P1 .
A natural way to ensure that a SPARQL query P can be evaluated in practice is by
imposing the restriction that for every sub-pattern (SERVICE ?X P1 ) of P , it holds
that ?X is bound in P . However, in the following theorem we show that such a condition
is undecidable and, thus, a SPARQL query engine would not be able to check it in order
to ensure that a query can be evaluated.
Theorem 1. The problem of verifying, given a SPARQL query P and a variable ?X ∈
var(P ), whether ?X is bound in P is undecidable.
The fact that the notion of boundedness is undecidable prevents one from using it as
a restriction over the variables in SPARQL queries. To overcome this limitation, we
introduce here a syntactic condition that ensures that a variable is bound in a pattern
and that can be efficiently verified.
8 C. Buil-Aranda, M. Arenas, and O. Corcho
The previous definition recursively collects from a SPARQL query P a set of vari-
ables that are guaranteed to be bound in P . For example, if P is a triple pattern t, then
SB(P ) = var(t) as one knows that for every variable ?X ∈ var(t) and for every RDF
graph G, if μ ∈ tG , then ?X ∈ dom(μ) and μ(?X) ∈ dom(G). In the same way,
if P = (P1 AND P2 ), then SB(P ) = SB(P1 ) ∪ SB(P2 ) as one knows that if ?X
is bound in P1 or in P2 , then ?X is bound in P . As a final example, notice that if
P = (P1 BINDINGS S {A1 , . . . , An }) and ?X is a variable mentioned in S such
that ?X ∈ dom(μS,Ai ) for every i ∈ {1, . . . , n}, then ?X ∈ SB(P ). In this case, one
knows that ?X is bound in P since P G = P1 G {μS,A1 , . . . , μS,An } and ?X is in
the domain of each one of the mappings μS,Ai , which implies that μ(?X) ∈ dom(P )
for every μ ∈ P G . In the following proposition, we formally show that our intuition
about SB(P ) is correct, in the sense that every variable in this set is bound in P .
Proposition 1. For every SPARQL query P and variable ?X ∈ var(P ), if ?X ∈
SB(P ), then ?X is bound in P .
Given a SPARQL query P and a variable ?X ∈ var(P ), it can be efficiently verified
whether ?X is strongly bound in P . Thus, a natural and efficiently verifiable way to en-
sure that a SPARQL query P can be evaluated in practice is by imposing the restriction
that for every sub-pattern (SERVICE ?X P1 ) of P , it holds that ?X is strongly bound
in P . However, this notion still needs to be modified in order to be useful in practice, as
shown by the following examples.
That is, either ?X and ?Z store the name of a SPARQL endpoint and a de-
scription of its functionalities, or ?X and ?Y store the name of a SPARQL end-
point and the IRI where it is located (together with a list of names and email
addresses retrieved from that location). Variable ?Y is neither bound nor strongly
bound in P1 . However, there is a simple strategy that ensures that P1 can be
evaluated over an RDF graph G: first compute (?X, service description, ?Z)G ,
then compute (?X, service address, ?Y )G , and finally for every μ in the set
Semantics and Optimization of the SPARQL 1.1 Federation Extension 9
(?X, service address, ?Y )G , compute (SERVICE a (?N, email, ?E))G with a =
μ(?Y ). In fact, the reason why P1 can be evaluated in this case is that ?Y is
bound (and strongly bound) in the sub-pattern ((?X, service address, ?Y ) AND
(SERVICE ?Y (?N, email, ?E))) of P1 .
As a second example, assume that G is an RDF graph that uses triples of the form
(a1 , related with, a2 ) to indicate that the SPARQL endpoints located at the IRIs a1 and
a2 store related data. Moreover, assume that P2 is the following graph pattern:
When this query is evaluated over the RDF graph G, it returns for every tuple
(a1 , related with, a2 ) in G, the list of names and email addresses that that can be re-
trieved from the SPARQL endpoint located at a1 , together with the phone number for
each person in this list for which this data can be retrieved from the SPARQL endpoint
located at a2 (recall that graph pattern (SERVICE ?U2 (?N, phone, ?F )) is nested in-
side the first SERVICE operator in P2 ). To evaluate this query over an RDF graph, first
it is necessary to determine the possible values for variable ?U1 , and then to submit the
query ((?N, email, ?E) OPT (SERVICE ?U2 (?N, phone, ?F ))) to each one of the
endpoints located at the IRIs stored in ?U1 . In this case, variable ?U2 is bound (and
also strongly bound) in P2 . However, this variable is not bound in the graph pattern
((?N, email, ?E) OPT (SERVICE ?U2 (?N, phone, ?F ))), which has to be evaluated
in some of the SPARQL endpoints stored in the RDF graph where P2 is being evalu-
ated, something that is infeasible in practice. Notice that the difficulties in evaluating P2
are caused by the nesting of SERVICE operators (more precisely, by the fact that P2
has a sub-pattern of the form (SERVICE ?X1 Q1 ), where Q1 has in turn a sub-pattern
of the form (SERVICE ?X2 Q2 ) such that ?X2 is bound in P2 but not in Q1 ).
In the following section, we use the concept of strongly boundedness to define a notion
that ensures that a SPARQL query containing the SERVICE operator can be evaluated
in practice, and which takes into consideration the ideas presented in Example 2.
u6 : (?Y, a, ?Z)
Fig. 1. Parse tree T (Q) for the graph pattern Q = ((?Y, a, ?Z) UNION ((?X, b, c) AND
(SERVICE ?X (?Y, a, ?Z))))
are bound inside sub-patterns and nested SERVICE operators. It should be noticed that
these two features were identified in the previous section as important for the definition
of a notion of boundedness (see Example 2).
Definition 3 (Service-boundedness). A SPARQL query P is service-bound if for every
node u of T (P ) with label (SERVICE ?X P1 ), it holds that: (1) there exists a node v
of T (P ) with label P2 such that v is an ancestor of u in T (P ) and ?X is bound in P2 ;
(2) P1 is service-bound.
For example, query Q in Figure 1 is service-bound. In fact, condition (1) of Def-
inition 3 is satisfied as u5 is the only node in T (Q) having as label a SERVICE
graph pattern, in this case (SERVICE ?X (?Y, a, ?Z)), and for the node u3 , it holds
that: u3 is an ancestor of u5 in T (P ), the label of u3 is P = ((?X, b, c) AND
(SERVICE ?X (?Y, a, ?Z))) and ?X is bound in P . Moreover, condition (2) of Defini-
tion 3 is satisfied as the sub-pattern (?Y, a, ?Z) of the label of u5 is also service-bound.
The notion of service-boundedness captures our intuition about the condition that a
SPARQL query containing the SERVICE operator should satisfy. Unfortunately, the
following theorem shows that such a condition is undecidable and, thus, a query engine
would not be able to check it in order to ensure that a query can be evaluated.
Theorem 2. The problem of verifying, given a SPARQL query P , whether P is service-
bound is undecidable.
As for the case of the notion of boundedness, the fact that the notion of service-
boundedness is undecidable prevents one from using it as a restriction over the variables
used in SERVICE calls. To overcome this limitation, we replace the restriction that the
variables used in SERVICE calls are bound by the decidable restriction that they are
strongly bound. In this way, we obtain a syntactic condition over SPARQL patterns that
ensures that they are service-bound, and which can be efficiently verified.
Definition 4 (Service-safeness). A SPARQL query P is service-safe if for every node
u of T (P ) with label (SERVICE ?X P1 ), it holds that: (1) there exists a node v of
T (P ) with label P2 such that v is an ancestor of u in T (P ) and ?X ∈ SB(P2 ); (2) P1
is service-safe.
Proposition 2. If a SPARQL query P is service-safe, then P is service-bound.
Semantics and Optimization of the SPARQL 1.1 Federation Extension 11
The notion of service-safeness is used in our system to verify that a SPARQL pattern
can be evaluated in practice. We conclude this section by pointing out that it can be
efficiently verified whether a SPARQL query P is service-safe, by using a bottom-up
approach over the parse tree T (P ) of P .
In [11,17], the authors study the complexity of evaluating the fragment of SPARQL
consisting of the operators AND, UNION, OPT and FILTER. One of the conclusions
of these papers is that the main source of complexity in SPARQL comes from the use
of the OPT operator. In light of these results, it was introduced in [11] a fragment
of SPARQL that forbids a special form of interaction between variables appearing in
optional parts, which rarely occurs in practice. The patterns in this fragment, which are
called well-designed patterns [11], can be evaluated more efficiently and are suitable for
reordering and optimization. In this section, we extend the definition of the notion of
being well-designed to the case of SPARQL patterns using the SERVICE operator, and
prove that the reordering rules proposed in [11], for optimizing the evaluation of well-
designed patterns, also hold in this extension. The use of these rules allows to reduce
the number of tuples being transferred and joined in federated queries, and hence our
implementation benefits from this as shown in Section 5.
Let P be a graph pattern constructed by using the operators AND, OPT, FILTER
and SERVICE, and assume that P satisfies the safety condition that for every sub-
pattern (P1 FILTER R) of P , it holds that var(R) ⊆ var(P1 ). Then, by following [11],
we say that P is well-designed if for every sub-pattern P = (P1 OPT P2 ) of P and
for every variable ?X occurring in P : If ?X occurs both inside P2 and outside P ,
then it also occurs in P1 . All the graph patterns given in the previous sections are well-
designed. On the other hand, the following pattern P is not well-designed:
((?X, nickname, ?Y ) AND (SERVICE c ((?X, email, ?U ) OPT (?Y, email, ?V )))),
pattern (?X, nickname, ?Y ), but it does not occur in P1 . Given an RDF graph G,
graph pattern P retrieves from G a list of people with their nicknames, and retrieves
from the SPARQL endpoint located at the IRI c the email addresses of these people
and, optionally, the email addresses associated to their nicknames. What is unnatu-
ral about this graph pattern is the fact that (?Y, email, ?V ) is giving optional infor-
mation for (?X, nickname, ?Y ), but in P appears as giving optional information for
(?X, name, ?U ). In fact, it could happen that some of the results retrieved by using the
triple pattern (?X, nickname, ?Y ) are not included in the final answer of P , as the value
of variable ?Y in these intermediate results could be incompatible with the values for
this variable retrieved by using the triple pattern (?Y, email, ?V ).
In the following proposition, we show that well-designed patterns including the
SERVICE operator are suitable for reordering and, thus, for optimization.
Proposition 3. Let P be a well-designed pattern and P a pattern obtained from P by
using one of the following reordering rules:
and enables handling large datasets and tuple streams, which may result from the exe-
cution of queries in different query services and data sources. The low level technical
details of our implementation can be found in [5].
5.2 Evaluation
In our evaluation, we compare the results and performance of our system with
other similar systems that provide some support for SPARQL query federation. Cur-
rently, the engines supporting the official SPARQL 1.1 federation extension are:
DARQ [14], Networked Graphs [15] and ARQ, which is available via an on-
line web service (http://www.sparql.org/) as well as a library for Jena
(http://jena.sourceforge.net/). Other system that supports distributed
RDF querying is presented in [18]. We do not consider this system here as it uses the
query language SeRQL instead of SPARQL.
The objective of our evaluation is to show first that we can handle SPARQL queries
that comply with the federated extension, and second that the optimization techniques
proposed in Section 4.1 actually reduce the time needed to process queries. We have
checked for existing SPARQL benchmarks like the Berlin SPARQL Benchmark [4],
SP2 Bench [16] and the benchmark proposed in [7]. Unfortunately for our purposes, the
first two are not designed for a distributed environment, while the third one is based
on a federated scenario but is not as comprehensive as the Berlin SPARQL Benchmark
and SP2 Bench. Thus, we decided to base our evaluation on some queries from the life
sciences domain, similar to those in [7] but using a base query and increasing its com-
plexity like in [4]. These queries are real queries used by Bio2RDF experts.
Datasets description. The Bio2RDF datasets contains 2,3 billion triples organized
around 40 datasets with sometimes overlapping information. The Bio2RDF datasets
that we have used in our benchmark are: Entrez Gene (13 million triples, stored in the
local endpoint sparql-pubmed), Pubmed (797 million triples), HHPID (244,021 triples)
and MeSH (689,542 triples, stored in the local endpoint sparql-mesh). One of the prac-
tical problems that these benchmarks have is that public SPARQL endpoints normally
restrict the amount of results that they provide. To overcome this limitation we installed
Entrez Gene and MeSH in servers without these restrictions. We also divided them in
files of 300,000 triples, creating endpoints for each one of them.
Queries used in the evaluation. We used 7 queries in our evaluation. The query struc-
ture follows the following path: using the Pubmed references obtained from the Entrez
gene dataset, we access the Pubmed endpoint (queries Q1 and Q2). In these queries,
we retrieve information about genes and their references in the Pubmed dataset. From
Pubmed we access the information in the National Library of Medicine’s controlled
vocabulary thesaurus (queries Q3 and Q4), stored at MeSH endpoint, so we have more
complete information about such genes. Finally, to increase the data retrieved by our
queries we also access the HHPID endpoint (queries Q5, Q6 and Q7), which is the
knowledge base for the HIV-1 protein. The queries, in increasing order of complexity,
can be found at http://www.oeg-upm.net/files/sparql-dqp/. Next we
show query Q4 to give the reader an idea of the type of queries that we are considering:
14 C. Buil-Aranda, M. Arenas, and O. Corcho
Results. Our evaluation was done in an Amazon EC2 instance. The instance has 2
cores and 7.5 GB of memory run by Ubuntu 10.04. The data used in this evaluation,
together with the generated query plans and the original queries in Java formatting, can
be found at http://www.oeg-upm.net/files/sparql-dqp/. The results of
our evaluation are shown in the following table:
Query Not optimized Optimized DARQ NetworkedGraphs ARQ
SPARQL-DQP SPARQL-DQP
A first clear advantage of our implementation is the ability to use asynchronous calls
facilitated by the use of indirect access mode, what means that we do not get time out
in any of the queries. This time out happens when accessing an online distributed query
processing like in the case of ARQ (www.sparql.org/query). It is important to
note that the ability to handle this type of queries is essential for many types of data-
intensive applications, such as those based on Bio2RDF. Data transfer also plays a key
role in query response times. For example, in some queries the local query engine re-
ceived 150,000 results from Entrez gene, 10,000 results from Pubmed, 23,841 results
from MeSH and 10,000 results from HHPID. The implemented optimizations are less
noticeable when the amount of transferred data is fewer.
It is possible to observe three different sets of results from this preliminary evalua-
tion. The first set (Q1–Q3 and Q5) are those that are not optimized because the reorder-
ing rules in Section 4.1 are not applicable. The second query group (Q4) represents
the class of queries that can be optimized using our approach, but where the differ-
ence is not too relevant, because the less amount of transferred data. The last group
of queries (Q6–Q7) shows a clear optimization when using the well-designed patterns
rewriting rules. For example, in query 6 the amount of transferred data varies from a
join of 150, 000 × 10, 000 tuples to a join of 10, 000 × 23, 841 tuples (using Entrez,
Pubmed and MeSH endpoints), which highly reduces the global processing time of the
query. Regarding the comparison with other systems, they do not properly handle these
amounts of data. We represent as 10+ min. those queries that need more than 10 minutes
to be answered.
Semantics and Optimization of the SPARQL 1.1 Federation Extension 15
In summary, we have shown that our implementation provides better results than
other similar systems. Besides, we have also shown that our implementation, which ben-
efits from an indirect access mode, can be more appropriate to deal with large datasets.
References
1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading
(1995)
2. Angles, R., Gutierrez, C.: The Expressive Power of SPARQL. In: Sheth, A.P., Staab, S.,
Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS,
vol. 5318, pp. 114–129. Springer, Heidelberg (2008)
3. Antonioletti, M., et al.: OGSA-DAI 3.0 - The Whats and the Whys. UK e-Science All Hands
Meeting, pp. 158–165 (2007)
4. Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. Int. J. Semantic Web Inf. Syst. 5(2),
1–24 (2009)
5. Buil, C., Corcho, O.: Federating Queries to RDF repositories. Technical Report (2010),
http://oa.upm.es/3302/
6. Durst, M,. Suignard, M.: Rfc 3987, Internationalized Resource Identifiers (IRIs),
http://www.ietf.org/rfc/rfc3987.txt
7. Haase, P., Mathäß, T., Ziller, M.: An evaluation of approaches to federated query processing
over linked data. In: I-SEMANTICS (2010)
8. Harris, S., Seaborne, A.: SPARQL 1.1 Query. W3C Working Draft (June 1, 2010),
http://www.w3.org/TR/sparql11-query/
9. Klyne, G., Carroll, J.J., McBride, B.: Resource description framework (RDF): Concepts and
abstract syntax. W3C Recommendation (February 10, 2004),
http://www.w3.org/TR/rdf-concepts/
10. Lynden, S., et al.: The design and implementation of OGSA-DQP: A service-based dis-
tributed query processor. Future Generation Computer Systems 25(3), 224–236 (2009)
11. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. TODS 34(3)
(2009)
12. Prud’hommeaux, E.: SPARQL 1.1 Federation Extensions. W3C Working Draft (June 1,
2010), http://www.w3.org/TR/sparql11-federated-query/
13. Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C Recommenda-
tion (January 15, 2008), http://www.w3.org/TR/rdf-sparql-query/
14. Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer,
S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp.
524–538. Springer, Heidelberg (2008)
15. Schenk, S., Staab, S.: Networked graphs: a declarative mechanism for SPARQL rules,
SPARQL views and RDF data integration on the Web. In: WWW, pp. 585–594 (2008)
16. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2Bench: A SPARQL Performance
Benchmark. In: ICDE, pp. 222–233 (2009)
17. Schmidt, M., Meier, M., Lausen, G.: Foundations of SPARQL query optimization. In: ICDT,
pp. 4–33 (2010)
18. Stuckenschmidt, H., Vdovjak, R., Geert-Jan, H., Broekstra, J.: Index structures and algo-
rithms for querying distributed RDF repositories. In: WWW, pp. 631–639 (2004)
Grr: Generating Random RDF
Abstract. This paper presents G RR, a powerful system for generating random
RDF data, which can be used to test Semantic Web applications. G RR has a
SPARQL -like syntax, which allows the system to be both powerful and conve-
nient. It is shown that G RR can easily be used to produce intricate datasets, such
as the LUBM benchmark. Optimization techniques are employed, which make
the generation process efficient and scalable.
1 Introduction
Testing is one of most the critical steps of application development. For data-centric
applications, testing is a challenge both due to the large volume of input data needed,
and due to the intricate constraints that this data must satisfy. Thus, while finding or
generating input data for testing is pivotal in the development of data-centric applica-
tions, it is often a difficult undertaking. This is a significant stumbling block in system
development, since considerable resources must be expended to generate test data.
Automatic generation of data has been studied extensively for relational databases
(e.g., [4, 7, 11, 13]), and there has also been considerable progress on this problem for
XML databases [1, 3, 8]. This paper focuses on generating test data for RDF databases.
While some Semantic Web applications focus on varied and unexpected types of data,
there are also many others that target specific domains. For such applications, to be
useful, datasets used should have at least two properties. First, the data structure must
conform to the schema of the target application. Second, the data should match the
expected data distribution of the target application.
Currently, there are several distinct sources for RDF datasets. First, there are down-
loadable RDF datasets that can be found on the web, e.g., Barton libraries, UniProt
catalog sequence, and WordNet. RDF Benchmarks, which include both large datasets
and sample queries, have also been developed, e.g., the Lehigh University Benchmark
(LUBM) [10] (which generates data about universities), the SP2 Bench Benchmark [14]
(which provides DBLP-style data) and the Berlin SPARQL Benchmark [5] (which is
built around an e-commerce use case). Such downloadable RDF datasets are usually an
excellent choice when testing the efficiency of an RDF database system. However, they
will not be suitable for experimentation and analysis of a particular RDF application.
Specifically, since these datasets are built for a single given scenario, they may not have
either of the two specified properties, for the application at hand.
This work was partially supported by the GIF (Grant 2201-1880.6/2008) and the ISF (Grant
143/09).
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 16–30, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Grr: Generating Random RDF 17
Data generators are another source for datasets. A data generator is a program that
generates data according to user constraints. As such, data generators are usually more
flexible than benchmarks. Unfortunately, there are few data generators available for
RDF (SIMILE 1 , RBench2 ) and none of these programs can produce data that conforms
to a specific given structure, and thus, again, will not have the specified properties.
In this paper, we present the G RR system for generating RDF that satisfies both desir-
able properties given above. Thus, G RR is not a benchmark system, but rather, a system
to use for Semantic Web application testing.3 G RR can produce data with a complex
graph structure, as well as draw the data values from desirable domains. As a motivat-
ing (and running) example, we will discuss the problem of generating the data described
in the LUBM Benchmark. However, G RR is not limited to creating benchmark data. For
example, we also demonstrate using G RR to create FOAF [9] data (Friend-of-a-friend,
social network data) in our experimentation.
Example 1. LUBM [10] is a collection of data describing university classes (i.e., enti-
ties), such as departments, faculty members, students, etc. These classes have a plethora
of properties (i.e., relations) between them, e.g., faculty members work for departments
and head departments, students take courses and are advised by faculty members.
In order to capture a real-world scenario, LUBM defines interdependencies of dif-
ferent types between the various entities. For example, the number of students in a de-
partment is a function of the number of faculty members. Specifically, LUBM requires
there to be a 1:8-14 ratio of faculty members to undergraduate students. As another
example, the cardinality of a property may be specified, such as each department must
have a single head of department (who must be a full professor). Properties may also
be required to satisfy additional constraints, e.g., courses, taught by faculty members,
must be pairwise disjoint.
Our main challenge is to provide powerful generation capabilities, while still retaining
a simple enough user interface, to allow the system to be easily used. To demonstrate
the capabilities of G RR, we show that G RR can easily be used to reproduce the entire
LUBM benchmark. Actually, our textual commands are not much longer than the intu-
itive description of the LUBM benchmark! We are interested in generating very large
datasets. Thus, a second challenge is to ensure that the runtime remains reasonable,
and that data generation scales up well. Several optimization techniques have been em-
ployed to improve the runtime, which make G RR efficient and scalable. We note that a
short demo of this work appeared in [6].
In this section we present the abstract syntax and the semantics for our data generation
language. Our data generation commands are applied to a (possibly empty) label graph,
1
http://simile.mit.edu/
2
http://139.91.183.30:9090/RDF/RBench
3
Our focus is on batch generation of data for testing, and not on online testing, where data
generation is influenced by application responses.
18 D. Blum and S. Cohen
to augment it with additional nodes and edges. There are four basic building blocks for
a data generation command. First, a composite query Q̄ finds portions of a label graph
to augment. Second, a construction command C defines new nodes and edges to create.
Third, a query result sampler defines which results of Q̄ should be used for input to C.
Fourth, a natural number sampler determines the number of times that C is applied.
Each of these components are explained below.
Label Graphs. This paper presents a method of generating random graphs of RDF
data. As such, graphs figure prominently in this paper, first as the target output, and
second, as an integral part of the generation commands.
A label graph L = (V, E) is a pair, where V is a set of labeled nodes and E is a
set of directed, labeled edges. We use a, b, . . . to denote nodes of a label graph. Note
that there can be multiple edges connecting a pair of nodes, each with a different label.
Label graphs can be seen as an abstract data model for RDF.
4
SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query
Grr: Generating Random RDF 19
choosing labels, using value samplers), and then discuss the former augmentation (cre-
ating new structures, using construction patterns), later on.
Example 2. In order to produce data conforming to the LUBM schema, many data
value samplers are needed. Each data value sampler will associate a different type of
node or edge with a value. For example, to produce the age of a faculty member, a data
value sampler choosing an age within a given range can be provided. To produce RDF
identifiers of faculty members, a data value sampler producing a specific string, ended
by a running number (e.g., http://www.Department.Univ/FullProf14)
can be provided. Finally, to produce the predicates of an RDF graph, constant data value
samplers can be used, e.g., to produce the edge label ub:worksFor, a data sampler
which always returns this value can be provided.
A class value sampler for faculty, provides methods of producing values for all of
the literal properties of faculty, e.g., age, email, officeNumber, telephone.
Construction Patterns. Construction patterns are used to define the nodes and edges
that will be generated, and are denoted as C = (x̄, z̄, ē, Π), where:
ub:University
rdf:type
ub:Department
stud2
univ1
rdf:type
ub:email ub:name
ub:memberOf rdf:type ub:suborganizationOf
sally@huji Sally
dept1 dept2
ub:memberOf
ub:worksFor ub:worksFor
stud1 ub:worksFor
prof1 prof2
ub:email ub:name
ub:email lecturer1
sam@huji Sam
pete@huji
– Adding New Nodes: For each variable z ∈ z̄, we add a new node a to L. If Π(z) is
a data value sampler, then the label of a is randomly drawn from Π(z). If Π(z) is
a class value sampler, then we also choose and add all literal properties associated
with z by Π(z). Define μ (z) = a.
When this stage is over, every node in z̄ is mapped to a new node via μ . We define
μ as the union of μ and μ .
– Adding New Edges: For each edge (u, v) ∈ ē, we add a new edge (μ (u), μ (v))
to L. The label of (μ (u), μ (v)) is chosen to as a random sample drawn from
Π(u, v).
A construction pattern can be applied several times, given the same mapping μ. Each
application results in additional new nodes and edges.
Example 3. Consider5 C = ((?dept), (?stud), {(?stud, ?dept)}, Π), where Π as-
signs ?stud a class value sampler for students, and assigns the edge (?stud, ?dept)
with a data value sampler that returns the constant ub:memberOf. Furthermore, sup-
pose that the only literal values a student has are his name and email.
Figure 1 contains a partial label graph L of RDF data. Literal values appear in rect-
angles. To make the example smaller, we have omitted many of the literal properties, as
well as much of the typing information (e.g., prof1 is of type ub:FullProfessor,
but this does not appear).
Now, consider L− , the label graph L, without the parts appearing in the dotted cir-
cles. Let μ be the mapping which assigns input variable ?dept to the node labeled
dept1. Then each application of C ⇓µ L− can produce one of the circled struc-
tures. To see this, observe that each circled structure contains a new node, e.g., stud1,
whose literal properties are added by the class value structure. The connecting edge,
i.e., (stud1, dept1), will be added due to the presence of the corresponding con-
struction edge in C, and its label ub:memberOf is determined by the constant data
value sampler.
5
We use the SPARQL convention and prepend variable names with ?.
Grr: Generating Random RDF 21
Query Result Samplers. As stated earlier, there are four basic building blocks for
a data generation command: composite queries, construction commands, query result
samplers and natural number samplers. The first two items have been defined already.
The last item, a natural number sampler or n-sampler, for short, is simply a function
that returns a non-negative natural number. For example, the function πi that always
returns the number i, π[i,j] that uniformly returns a number between i and j, and πm,v
which returns a value using the normal distribution, given a mean and variance, are all
special types of natural number samplers.
We now define the final remaining component in our language, i.e., query result
samplers. A query, along with a label graph, and an assignment μ of values for the
input nodes, defines a set of assignments for the output nodes. As mentioned earlier, the
results of a query guide the generation process. However, we will sometimes desire to
choose a (random) subset of the results, to be used in data generation. A query result
sampler is provided precisely for this purpose.
Given (1) a label graph L, (2) a query Q and (3) a tuple of nodes ā for the input
variables of Q, a query result sampler, or q-sampler, chooses mappings in Mā (Q, L).
Thus, applying a q-sampler πq to Mā (Q, L) results in a series μ1 , . . . , μk of mappings
from Mā (Q, L). A q-sampler determines both the length k of this series of samples
(i.e., how many mappings are returned) and whether the sampling is with, or without,
repetition.
Example 4. Consider a query Q1 (having no input values) that returns department nodes
from a label graph. Consider label graph L from Figure 1. Observe that M(Q1 , L) con-
tains two mappings: μ1 which maps ?dept to the node labeled dept1 and μ2 which
maps ?dept to the node labeled dept2.
A q-sampler πq is used to derive a series of mappings from M(Q1 , L). The q-
sampler πq can be defined to return each mapping once (e.g., the series μ1 , μ2 or
μ2 , μ1 ), or to return a single random mapping (e.g., μ2 or μ1 ), or to return two ran-
dom choices of mappings (e.g., one of μ1 , μ1 or μ1 , μ2 or μ2 , μ1 or μ2 , μ2 ), or can be
defined in countless other ways. Note that regardless of how πq is defined, a q-sampler
always returns a series of mappings. The precise definition of πq defines properties of
this series (i.e., its length and repetitions).
22 D. Blum and S. Cohen
rdf:type
ub:suborganizationOf
ub:suborganizationOf
ub:University dept5
TelAvivU dept3
dept1 dept2
rdf:type rdf:type dept4
ub:name rdf:type rdf:type
rdf:type rdf:type
univ2
ub:Department
(a) (b)
Fig. 3. (a) Possible result of application of C1 to the empty label graph. (b) Possible result of
application of C2 to (a).
Data Generation Commands. To generate data, the user provides a series of data
generation commands, each of which is a 4-tuple C = (Q̄, π̄q , C, πn ), where
– Q̄ = Q1 (x̄1 ; ȳ1 ), . . . , Qk (x̄k , ȳk ) is a composite query;
– π̄q = (πq1 , . . . , πqk ) is a tuple of q-samplers;
– C = (x̄, z̄, ē, Π) is a construction pattern and
– πn is an n-sampler.
In addition, we require every input variable in tuple x̄ of the construction pattern C to
appear among the output variables in Q̄.
Algorithm Apply (Figure 2) applies a data generation command C to a (possibly
empty) label graph L. It is called with C, L, L∗ = L, the empty mapping μ∅ and j = 1.
Intuitively, Apply runs in a recursive fashion, described next.
We start with query Q1 , which cannot have any input variables (by definition). There-
fore, Line 6 is well-defined and assigns ā the empty tuple (). Then, we choose mappings
from the result of applying Q1 to L, using the q-sampler πq1 . For each of these mappings
μ , we recursively call Apply, now with the extended mapping μ∪μ and with the index
2, which will cause us to consider Q2 within the recursive application.
When we reach some 1 < j ≤ |Q̄|, the algorithm has a mapping μ which assigns
values for all output variables of Qi , for i < j. Therefore, μ assigns a value to each
of its input variables x̄j . Given this assignment for the input variables, we call Qj , and
choose some of the results in its output. This process continues recursively, until we
reach j = |Q̄| + 1.
When j = |Q̄| + 1, the mapping μ must assign values for all of the input variables
of C. We then choose a number n using the n-sampler πn , and apply the construction
pattern C to L∗ a total of n times. Note that at this point, we recurse back, and may
eventually return to j = |Q̄| + 1 with a different mapping μ.
We note that Apply takes two label graphs L and L∗ as parameters. Initially, these are
the same. Throughout the algorithm, we use L to evaluate queries, and actually apply the
construction patterns to L∗ (i.e., the actual changes are made to L∗ ). This is important
Grr: Generating Random RDF 23
from two aspects. First, use of L∗ allows us to make sure that all constructions are
made to the same graph, which eventually returns a graph containing all new additions.
Second, and more importantly, since we only apply queries to L, the end result is well-
defined. Specifically, this means that nodes constructed during the recursive calls where
j = |Q̄| + 1, cannot be returned from queries applied when j ≤ |Q̄|. This makes the
algorithm insensitive to the particular order in which we iterate over mappings in Line 7.
Maintaining a copy L∗ of L is costly. Hence, in practice, G RR avoids copying of L, by
deferring all updates until the processing of the data generation command has reached
the end. Only after no more queries will be issued are all updates performed.
The heart of the G RR system is a Java implementation of our abstract language, which
interacts with an RDF database for both evaluation of the SPARQL composite queries,
and for construction and storage of the RDF data. The user can provide the data gener-
ation commands (i.e., all details of the components of our abstract commands) within
an RDF file. Such files tend to be quite lengthy, and rather arduous to define.
To make data generation simple and intuitive, we provide the user with a simpler
textual language within which all components of data generation commands can be
defined. Our textual input is compiled into the RDF input format. Thus, the user has
the flexibility of using textual commands whenever they are expressive enough for his
needs, and augmenting the RDF file created with additional (more expressive) com-
mands, when needed. We note that the textual language is quite expressive, and
24 D. Blum and S. Cohen
therefore we believe that such augmentations will be rare. For example, commands
to recreate the entire LUBM benchmark were easily written in the textual interface.
The underlying assumption of the textual interface is that the user is interested in
creating instances of classes and connecting between these, as opposed to arbitrary
construction of nodes and edges. To further simplify the language, we allow users to
use class names instead of variables, for queries that do not have self-joins. This simpli-
fication allows many data generation commands to be written without the explicit use
of variables, which makes the syntax more succinct and readable. The general syntax
of a textual data generation command appears below.
Observe that a data generation command can contain three types of clauses: FOR,
CREATE and CONNECT. There can be any number (zero or more) FOR clauses. Each
FOR clause defines a q-sampler (Lines 1–2), and a query (Lines 3-4). The CREATE and
CONNECT clauses together determine the n-sampler (i.e., the number of times that the
construction pattern will be applied), and the construction pattern. Each of CREATE
and CONNECT is optional, but at least one among them must appear.
We present several example commands that demonstrate the capabilities of the lan-
guage. These examples were chosen from among the commands needed to recreate the
LUBM benchmark data.6
each undergraduate student takes 2–4 undergraduate courses. In C6 , the first FOR clause
loops over all undergraduate students. The second FOR clause chooses 2–4 undergrad-
uate courses for each student.
Our interface can be used to define query samplers with three different repetition
modes (Line 2). In repeatable mode, the same query results can be sampled multiple
times. Using this mode in C6 would allow the same course to be sampled multiple times
for a given student, yielding several connections of a student to the same course. In
global distinct mode, the same query result is never returned more than once, even
for different samples of query results in previous FOR clauses. Using global distinct in
C6 would ensure that the q-sampler of the second FOR clause never returns the same
course, even when called for different undergraduates returned by the first FOR clause.
Hence, no two students would take a common course. Finally, in local distinct mode,
the same query result can be repeatedly returned only for different results of previous
FOR clauses. However, given specific results of all previous FOR clauses, query results
will not be repeated. Thus, for each undergraduate student, we will sample 2–4 different
undergraduate courses. However for different undergraduate students we may sample
the same courses, i.e., several students may study the same course, as is natural.
As a final example, we show how variables can be used to express queries with self-
joins (i.e., using several instances of the same class). This example connects people
within a FOAF RDF dataset. Note the use of FILTER in the WHERE clause, which is
naturally included in our language, since our WHERE clause is immediately translated
into the WHERE clause of a SPARQL query.
FOR EACH {foaf:Person ?p1}
FOR 15-25 {foaf:Person ?p2}
WHERE {FILTER ?p1 != p2}
CONNECT {?p1 foaf:knows ?p2}
Finally, we must explain how the data value samplers and class value samplers are
defined. An optional class mappings input file provides a mapping of each class to the
literal properties that it has, e.g., the line
ub:GraduateStudent; ub:name; ub:email; ub:age;
associates GraduateStudents with three properties. An additional sampler function file
states which value samplers should be used for each literal property, e.g., the line
CounterDictSampler; GlobalDistinctMode; ub:Person; Person
states that the literals which identify a Person should never be repeated, and should be
created by appending Person with a running counter.
4 Optimization Techniques
Each evaluation of a data generation command requires two types of interactions with
an RDF database: query evaluation of the composite queries (possibly many times), and
Grr: Generating Random RDF 27
the addition of new triples to the RDF database, as the construction patterns are applied.
Thus, the OLTP speed of the RDF database used will greatly dictate the speed in which
our commands can be executed. Two strategies were used to decrease evaluation time.
Caching Query Results. A data generation command has a composite query Q̄ =
Q1 (x̄1 ; ȳ1 ), . . . , Qk (x̄k ; ȳk ) and a tuple of q-samplers, πq1 , . . . , πqk . During the execu-
tion of a data generation command, a specific query Qi may be called several times.
Specifically, a naive implementation will evaluate Qi once, for each sample chosen by
πqj from the result of Qj , for all j < i. For example, the query determined by the sec-
ond FOR clause of C6 (Section 3), returning all undergraduate courses, will be called
repeatedly, once for each undergraduate student.
Sometimes, repeated evaluations of Qi cannot be avoided. This occurs if the input
parameters x̄i have been given new values. However, repeated computations with the
same input values can be avoided by using a caching technique. We use a hash-table
to store the output of a query, for each different instantiation of the input parameters.
Then, before evaluating Qi with input values ā for x̄i , we check whether the result of
this query already appears in the hash table. If so, we use the cached values. If not, we
evaluate Qi , and add its result to the hash-table.
Avoiding Unnecessary Caching. Caching is effective in reducing the number of times
queries will be applied. However, caching query results incurs a significant storage
overhead. In addition, there are cases in which caching does not bring any benefit, since
some queries will never be called repeatedly with the same input values.
To demonstrate, consider C4 (Section 3), which chooses a head for each department.
As the q-sampler for the query defined by the first FOR clause iterates over all de-
partments without repetition, the query defined by the second FOR clause will always
be called with different values for its input parameter ub:Dept. Hence, caching the
results of the second query is not useful.
In G RR, we avoid caching of results when caching is guaranteed to be useless. In
particular, we will not cache the results of Qi if, for all j < i we have (1) Qj runs in
global distinct mode and (2) all output parameters of Qj are input parameters of Qi .
As a special case, this also implies that caching is not employed in C4 , but will be used
for C5 (as only Dept, and not Undergrad, is an input parameter to the query defined
by its second FOR clause). Note that our caching technique is in the spirit of result
memoization, a well-known query optimization technique (e.g., [12, 2]).
5 Experimentation
We implemented G RR within the Jena Semantic Web Framework for Java. 7 Our exper-
imentation uses Jena’s TDB database implementation (a high performance, pure-Java,
non-SQL storage system). All experiments were carried out on a personal computer
running Windows Vista (64 bit) with 4GB of RAM.
We created two types of data sets using G RR. First, we recreated the LUBM bench-
mark, with various scaling factors. Second, we created simple RDF data conforming
to the FOAF (friend of a friend) schema [9]. The full input provided to G RR for these
7
Jena–A Semantic Web Framework for Java, http://jena.sourceforge.net
28 D. Blum and S. Cohen
Time (ms) for Departments Queries for Departments Cache (KB) for Departments
1000000 100000 4000
3500
10000 3000
100000
2500
1000 2000
1500
10000
100 1000
500
1000 10 0
10 20 30 40 10 20 30 40 10 20 30 40
No Cache Always Cache Smart Cache No Cache Always Cache Smart Cache Always Cache Smart Cache
Time (ms) for N People, M Projects Queries for N People, M Projects Cache (KB) for N People, M Projects
100000 100000 10000
10000 1000
1000
10000 100
100
10
10
1000 N: 100 N: 1000 N: 100 N: 10000 N: 1000 1
N: 100 N: 1000 N: 100 N: 10000 N: 1000 N: 100 N: 1000 N: 100 N: 10000 N: 1000
M: 1000 M: 100 M: 10000 M:100 M:1000
M: 1000 M: 100 M: 10000 M:100 M:1000 M: 1000 M: 100 M: 10000 M:100 M:1000
No Cache Always Cache Smart Cache
No Cache Always Cache Smart Cache Always Cache Smart Cache
Fig. 4. Time, number of queries and cache size for generating data
experiments, as well as working downloadable code for G RR, is available online.8 Our
goal is to determine the scalability of G RR in terms of runtime and memory, as well as
to (at least anecdotally) determine its ease of use.
LUBM Benchmark. In our first experiment we recreated the LUBM benchmark data.
All data could be created directly using the textual interface. In total, 24 data generation
commands were needed. These commands together consisted of only 284 words. This
is approximately 1.8 times the number of words used in the intuitive description of the
LUBM benchmark data (consisting of 158 words), which is provided in the LUBM
project for users to read. We found the data generation commands to be intuitive, as
demonstrated in the examples of Section 3. Thus, anecdotally, we found G RR to be
quite easy to use.
Together with the class mappings and sampler function files (details not discussed
in the LUBM intuitive description), 415 words were needed to recreate the LUBM
benchmark. This compares quite positively to the 2644 words needed to recreate LUBM
when writing these commands directly in RDF—i.e., the result of translating the textual
interface into a complete RDF specification.
We consider the performance of G RR. The number of instantiations of each class type
in LUBM is proportional to the number of departments.9 We used G RR to generate the
benchmark using different numbers of departments. We ran three versions of G RR to
determine the effect of our optimizations. In NoCache, no caching of query results was
performed. In AlwaysCache, all query results were cached. Finally, in SmartCache,
only query results that can potentially be reused (as described in Section 4) are stored.
Each experiment was run three times, and the average runtime was taken.
8
www.cs.huji.ac.il/˜danieb12
9
LUBM suggests 12-25 departments in a university.
Grr: Generating Random RDF 29
The runtime appears in Figure 4(a). As can be seen in this figure, the runtime in-
creases linearly as the scaling factor grows. This is true for all three versions of G RR.
Both AlwaysCache and SmartCache significantly outperform NoCache, and both
have similar runtime. For the former two, the runtime is quite reasonable, with con-
struction taking approximately 1 minute for the 200,000 tuples generated when 40 de-
partments are created, while NoCache requires over 12 minutes to generate this data.
The runtime of Figure 4(a) is easily explained by the graph of Figure 4(b), which de-
picts the number of queries applied to the database. Since SmartCache takes a conser-
vative approach to caching, it always caches results that can potentially be useful. Thus,
AlwaysCache and SmartCache apply the same number of queries to the database.
NoCache does not store any query results, and hence, must make significantly more
calls to the database, which degrades its runtime.10
Finally, Figure 4(c) shows the size of the cache for G RR. NoCache is omitted, as
is does not use a cache at all. To measure the cache size, we serialized the cache af-
ter each data generation command, and measured the size of the resulting file. The
maximum cache size generated (among the 24 data generation commands) for each of
AlwaysCache and SmartCache, and for different numbers of departments, is shown
in the figure. Clearly, SmartCache significantly reduces the cache size.
FOAF Data. In this experiment we specifically chose data generation commands that
would allow us to measure the effectiveness of our optimizations. Therefore, we used
the following (rather contrived) data generation commands, with various values for M
and N .
(C1 ) CREATE N {foaf:Person}
(C2 ) CREATE M {foaf:Project}
(C3 ) FOR EACH {foaf:Person}
FOR EACH {foaf:Project}
CONNECT {foaf:Person foaf:currProject foaf:Project}
(C4 ) FOR EACH {foaf:Person}
FOR EACH {foaf:Project}
WHERE {foaf:Person foaf:currProject foaf:Project}
CONNECT {foaf:Person foaf:maker foaf:Project}
Commands C1 and C2 create N people and M projects. Command C3 connects all
people to all projects using the currProject predicate, and C4 adds an additional
edge maker between each person and each of his projects. Observe that caching is
useful for C3 , as the inner query has no input parameters, but not for C4 , as it always has
a different value for its input parameters.11
Figures 4(d) through 4(f) show the runtime, number of query applications to the
RDF database and memory size, for different parameters of N and M . Note that ap-
proximately 200,000 triples are created for the first two cases, and 2 million for the
10
This figure does not include the number of construction commands applied to the database,
which is equal, for all three versions.
11
This example is rather contrived, as a simpler method to achieve the same effect is to include
both CONNECT clauses within C3 .
30 D. Blum and S. Cohen
last three cases. For the last three cases, statistics are missing for AlwaysCache, as
its large cache size caused it to crash due to lack of memory. Observe that our system
easily scales up to creating 2 million tuples in approximately 1.5 minutes. Interestingly,
SmartCache and NoCache perform similarly in this case, as the bulk of the time is
spent iterating over the instances and constructing new triples.
We conclude that SmartCache is an excellent choice for G RR, due to its speed and
reasonable cache memory requirements.
6 Conclusion
We presented the G RR system for generating random RDF data. G RR is unique in that
it can create data with arbitrary structure (as opposed to benchmarks, which provide
data with a single specific structure). Thus, G RR is useful for generating test data for
Semantic Web applications. By abstracting SPARQL queries, G RR presents a method to
create data that is both natural and powerful.
Future work includes extending G RR to allow for easy generation of other types
of data. For example, we are considering adding recursion to the language to add an
additional level of power. User testing, to prove the simplicity of use, is also of interest.
References
1. Aboulnaga, A., Naughton, J., Zhang, C.: Generating synthetic complex-structured XML data.
In: WebDB (2001)
2. Bancilhon, F., Ramakrishnan, R.: An amateur’s introduction to recursive query processing
strategies. In: SIGMOD, pp. 16–52 (1986)
3. Barbosa, D., Mendelzon, A., Keenleyside, J., Lyons, K.: ToXgene: an extensible template-
based data generator for XML. In: WebDB (2002)
4. Binnig, C., Kossman, D., Lo, E.: Testing database applications. In: SIGMOD (2006)
5. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. IJSWIS 5(2), 1–24 (2009)
6. Blum, D., Cohen, S.: Generating RDF for application testing. In: ISWC (2010)
7. Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB (2005)
8. Cohen, S.: Generating XML structure using examples and constraints. PVLDB 1(1), 490–
501 (2008)
9. The friend of a friend (foaf) project, http://www.foaf-project.org
10. Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for OWL knowledge base systems.
JWS 3(2-3), 158–182 (2005)
11. Houkjaer, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB (2006)
12. McKay, D.P., Shapiro, S.C.: Using active connection graphs for reasoning with recursive
rules. In: IJCAI, pp. 368–374 (1981)
13. Neufeld, A., Moerkotte, G., Lockemann, P.C.: Generating consistent test data for a variable
set of general consistency constraints. The VLDB Journal 2(2), 173–213 (1993)
14. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2 Bench: a SPARQL performance bench-
mark. In: ICDE (March 2009)
High-Performance Computing Applied to
Semantic Databases
1 Introduction
The Semantic Web is a loosely defined notion, but generally includes such stan-
dards as the
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 31–45, 2011.
c Springer-Verlag Berlin Heidelberg 2011
32 E.L. Goodman et al.
2 Cray XMT
The Cray XMT is a unique shared-memory machine with multithreaded pro-
cessors especially designed to support fine-grained parallelism and perform well
despite memory and network latency. Each of the custom-designed compute
processors (called Threadstorm processors) comes equipped with 128 hardware
High-Performance Computing Applied to Semantic Databases 33
threads, called streams in XMT parlance, and the processor instead of the operat-
ing system has responsibility for scheduling the streams. To allow for single-cycle
context switching, each stream has a program counter, a status word, eight tar-
get registers, and thirty-two general purpose registers. At each instruction cycle,
an instruction issued by one stream is moved into the execution pipeline. The
large number of streams allows each processor to avoid stalls due to memory
requests to a much larger extent than commodity microprocessors. For exam-
ple, after a processor has processed an instruction for one stream, it can cycle
through the other streams before returning to the original one, by which time
some requests to memory may have completed. Each Threadstorm can currently
support 8 GB of memory, all of which is globally accessible. One system we use
in this study has 512 processors and 4 TB of shared memory.
Programming on the XMT consists of writing C/C++ code augmented with
non-standard language features including generics, intrinsics, futures, and per-
formance-tuning compiler directives such as pragmas. Generics are a set of func-
tions the Cray XMT compiler supports that operate atomically on scalar values,
performing either read, write, purge, touch, and int_fetch_add operations.
Each 8-byte word of memory is associated with a full-empty bit and the read and
write operations interact with these bits to provide light-weight synchronization
between threads. Here are some examples of the generics provided:
– readxx: Returns the value of a variable without checking the full-empty bit.
– readf e: Returns the value of a variable when the variable is in a full state,
and simultaneously sets the bit to be empty.
– writeef : Writes a value to a variable if the variable is in the empty state,
and simultaneously sets the bit to be full.
– int f etch add: Atomically adds an integer value to a variable.
Parallelism is achieved explicitly through the use of futures, or implicitly when
the compiler attempts to automatically parallelize for loops. Futures allow pro-
grammers to explicitly launch threads to perform some function. Besides explicit
parallelism through futures, the compiler attempts to automatically parallelize
for loops, enabling implicit parallelism. The programmer can also provide prag-
mas that give hints to the compiler on how to schedule iterations of the for
loop on to various threads, which can be by blocks, interleaved, or dynamic. In
addition, the programmer can supply hints on how many streams to use per
processor, etc. We extensively use the #pragma mta for all streams i of n
construct that allows programmers to be cognizant of the total number of streams
that the runtime has assigned to the loop, as well as providing an iteration index
that can be treated as the id of the stream assigned to each iteration.
(SPEED-MT)2 library. The first is a set of algorithms and data structures de-
signed to run scalably on shared-memory platforms such as the XMT. The sec-
ond is a novel scalable Semantic Web processing capability being developed for
the XMT.
3 Dictionary Encoding
The first aspect of semantic databases we examine is that of translating semantic
data from a string representation to an integer format. To simplify the discussion,
we consider only semantic web data represented in N-Triples. In this format,
semantic data is presented as a sequence of lines, each line containing three
elements, a subject, a predicate, and an object. An element can either be a
URI, a blank node (an anonymous resource), or a literal value (a string value
surrounded by quotes with optional language and datatype modifiers). In all
cases, an element is a string of arbitrary length. To speed up later processing
of the data and to also reduce the size of the semantic graph, a common tactic
is to create a dictionary encoding – a mapping from string to integers and vice
versa. On the data sets we explore in this paper, we were able to compress the
raw data by a factor of between 3.2 and 4.4.
The dictionary encoding algorithm, outlined in Figure 1, is described in more
detail below. The dictionary is encapsulated within a class, RDF Dictionary,
that has three important members: fmap, rmap, and carray. The fmap, or for-
ward map, is an instance of a hash table class that stores the mapping from
strings to integer ids. Similarly, rmap, or reverse map, stores the opposite map-
ping, from integers to strings. We use unsigned 64-bit integers in order to support
data sets with more than 4 billion unique strings. The hash table implementa-
tion is similar to the linear probing method described in Goodman et al. [1].
However, we made some modifications that significantly reduces the memory
footprint that will be described in the next section.
Both of fmap and rmap reference carray, which contains a single instance of
each string, separated by null terminators. Having a single character array store
the unique instances of each string reduces the memory footprint and allows for
easy reading and writing of the dictionary to and from disk; however, it does
add some complexity to the algorithm, as seen in Figure 1. Also, we support
iteratively adding to the dictionary, which introduces further complications.
The dictionary encoding algorithm is invoked with a call to parse file. The
variable ntriple file contains the location on disk of the file to be encoded.
As of now, we only support processing files in N-Triples or N-Quads3 format.
After reading in the raw data, the algorithm tokenizes the array into individual
elements (i.e. subjects, predicates, and objects) in lines 6-10. It does this by
inserting a null terminator at the conclusion of each element, and storing the
beginning of each element in the words array.
We allow for updates to an existing dictionary, so the next for loop on lines
11-15 extracts the subset of elements that are new this iteration. Line 11 checks
2
https://software.sandia.gov/trac/MapReduceXMT
3
http://sw.deri.org/2008/07/n-quads/
High-Performance Computing Applied to Semantic Databases 35
to see if the string is already stored in the fmap and inserts them into a function-
scoped instance of the map class, tmap. Notice that for each new word we insert
the value one. The actual ids that will be added to the dictionary are assigned
in the next block of code. Doing so allows us to avoid memory contention on a
counter variable and use efficient range iterators that come with the hash table
class.
The block of lines from 16 through 20 assigns ids to the new set of elements,
and then appends the new elements to the end of carray. Line 16 determines
the largest id contained within the dictionary and increments that value by one,
thus specifying the starting id for the new batch of strings. If the dictionary is
empty, the starting id is one, reserving zero as a special value required by the
hash table implementation. Line 17 calls the function assign contiguous ids
which iterates through the keys of the hash table and assigns them values
v ∈ [start, start + num new], thus ensuring that regardless of how many times
parse file is called, the ids are in the range [1, num keys], where num keys
is the total number of keys. Line 18 gathers the new elements into a contiguous
array, keys. Line 19 takes keys and copies the data to the end of carray, plac-
ing null terminators between each element. The function consolidate returns
the previous size of carray and assigns that value to plen. Line 20 updates the
total number of unique elements.
36 E.L. Goodman et al.
Once we’ve updated the number of keys, we can then test if the forward
and reverse maps need to be resized. On line 21, if the total number of keys
divided by the maximum load factor exceeds the current capacity of the ta-
ble (the total number of slots in the table, claimed or unclaimed), then we
resize both maps. The new size is set to be the smallest power of two such that
num keys/capacity < max load.
After the forward and reverse maps have been resized if necessary, they are
then updated with the new elements and new ids in lines 27 through 33. Since
the new elements have been added to the end of carray, we iterate through that
portion of the array. Each time we find a null terminator at position i, we know
that an element to be added starts at i + 1. We find the corresponding id from
tmap, and then add the pair to each map. With the forward and reverse maps
updated, we are finally ready to translate the elements listed in the words array
into integers and store the result in the output buffer in lines 34 through 36.
After the data has been encoded as integers, we are then ready to move on to
the next step, that of performing inferencing. An optional step is to write out
the translated data and the mapping between strings and integers to disk. This
is done by means of three files:
– <dataset>.translated : A binary file of 64-bit unsigned integers that contain
the triples encoded as integer values.
– <dataset>.chararr : This contains the contents of carray.
– <dataset>.intarr : Another binary file of 64-bit unsigned integers. The se-
quence of integers corresponds to the same sequence of words found in
<dataset>.chararr, thus preserving the mapping defined between strings and
integers.
3.1 Results
We examined four data sets: Uniprot 4 , DBPedia 5 , Billion Triple Challenge 2009 6
(BTC2009 ), and the Lehigh University Benchmark (LUBM(8000)). We also ran
the dictionary encoding on a LUBM data set consisting of 16.5 billion triples.
This is roughly equivalent to LUBM(120000), though we generated it using
several different concurrent runs of the generator using different random seeds
and different offsets. These sets represent a wide variety, ranging from the well-
behaved, synthetic triple set of LUBM, to real-world but curated sets such as
DBPedia and Uniprot, to the completely wild sources like BTC2009, which was
formed by crawling the web.
We evaluated the dictionary encoding code using two different-sized systems,
a 512-processor XMT and a 128-processor system. Each XMT comes equipped
with a service partition. On the service nodes a Linux process called a file service
worker (fsworker ) coordinates the movement of data from disk to the compute
nodes. Multiple fsworkers can run on multiple service nodes, providing greater
4
http://www.uniprot.org
5
http://wiki.dbpedia.org
6
http://challenge.semanticweb.org
High-Performance Computing Applied to Semantic Databases 37
aggregate bandwidth. The 512 system has 16 service nodes and can run 16
fsworkers. However, our 128 system is limited to 2 service nodes and 2 fsworkers.
Since this is an artificial constraint of our smaller installation, we decided to
estimate the rate that would have been achieved had 16 fsworkers been available.
Since throughput degrades linearly as the number of fsworkers is reduced, we
compute I/O performance by multiplying the measured rate by a factor of eight
to characterize the behavior of a machine configuration more amenable to I/O.
Table 1 shows the raw sizes of the original data sets and the compression ratio
achieved. The compression ratio is calculated with
so
si + sc + st
where so is the size of the original data set, si is the size of the dictionary integer
array, sc is the size of the dictionary character array, and st is the size of the
encoded triples. The size of the dictionary on disk is si + sc while the size of the
dictionary in memory is the total memory footprint of the dictionary. Going from
disk to memory increases the size of the dictionary by about a factor between 1.5
and 2. This is due to the hash table implementation which requires load factors
lower than 0.7 to work efficiently.
Table 2 gives a comparison to a MapReduce dictionary encoding algorithm
presented by Urbani, et al. [8]. We compare rates achieved using 32 Threadstorm
processors versus a 32 quad-core cluster. We range from a 2.4 to a 3.3 times
improvement. Rate is calculated by dividing the size of the original data set
by the total time to read the data from disk to memory, perform the encoding
algorithm, and write the encoding and dictionary to disk. It should be noted that
the datasets are of similar variety, but of different sizes. DBPedia and Uniprot
have grown since the time when the Urbani paper was published to when we
examined them. Also, we used a larger LUBM dataset. Figure 2(a) displays the
38 E.L. Goodman et al.
5 3
10 10
20,521 DBPedia 481
DBPedia
BTC2009 BTC2009
4 LUBM(8000)
Time (seconds)
10 LUBM(8000)
Uniprot
Rate (MB/s)
Uniprot 405
3 433 2
10 3,359 10
2
10
1
69.6
1
10 10
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Processors Processors
(a) Times for Dictionary Encoding (b) Dictionary Encoding Rates
Fig. 2. (a) shows the compute times for the data sets and varying number of processors.
(b) displays the encoding rates achieved. The rate is defined as the original file size
divided by the time it takes to read the file into memory, perform the calculation, and
write the translated triples and dictionary to disk. The file I/O times were estimated
to what would be achieved using 16 fsworkers.
times obtained for the compute portion (i.e. excluding file I/O) of the dictionary
encoding process. Regardless of the nature of the data, we see nearly linear
speedup of 47-48x. Figure 2(b) presents the encoding rates. This includes an
estimated I/O time that would have been obtained with 16 fsworkers. The rates
fall within a relatively tight band except DBPedia, which is about 15% slower.
We are unsure if this is due to the nature of the data within DBPedia, or due to
the fact that file is significantly smaller than the other datasets.
We ran the 512 system on LUBM(120000). We ran once using all 512 pro-
cessors, iteratively processing a third of the data at a time. The times for each
chunk were 1412, 2011, and 1694 seconds. The times of the latter files are longer
than the first due to the need to check against the existing table, and also sec-
ond file required a resize of the forward and reverse hash tables. Overall the rate
achieved was 561 MB/s. Extrapolating from our LUBM(8000) 2-processor run,
ideally we would have achieved 2860 MB/s, representing an efficiency of about
.20. If we had run had concatenated all the data together, the rate of the 512
run would have been significantly better.
4 RDFS Closure
We presented an algorithm for RDFS closure in previous work [2]. In general the
process we described is to keep a large hash table, ht, in memory and also smaller
hash tables as queues for the RDFS rules, qi . We first iterate through all the
triples, adding the original set to ht, and any triples that match a given rule is
added to the appropriate qi . Then, upon invocation of a rule, we iterate through
its cue instead of the entire data set. The algorithm assumes the ontology does
not operate on RDFS properties. As such, a single pass through the RDFS rule
set is sufficient.
High-Performance Computing Applied to Semantic Databases 39
The algorithm we employed in this paper is largely the same. We did make
some modifications that resulted in a 50% decrease in the memory footprint,
namely with
– removal of the occupied array in the hash table and hash set implementa-
tions, and
– removal of the rule queues.
In our previous work on hashing for the Cray XMT [1], we outlined an open
addressing scheme with linear probing, the key contribution being a mechanism
for avoiding locking except for when a slot in the hash table is declared occupied
for a given key. The open addressing scheme makes use of two arrays, a key array
and an occupied array. The key array stores the keys assigned to various slots
in the hash table, while the occupied array handles hash collisions and thread
synchronization. The occupied array acts as a boolean, a 1 indicating that the
slot is taken and a 0 otherwise (this assumes we don’t care about deleting and
reclaiming values, else we need another bit). Despite the occupied array being a
boolean, each position in the array is a 64-bit integer. Threads need to be able to
interact with the full-empty bit for synchronization, and full-empty bits are only
associated with each 8-byte word. However, an important observation is that the
occupied array is only necessary for a general implementation that is agnostic
to the key distribution. In situations where there is a guarantee that a particular
key k will never occur, we can use the key array itself for thread synchronization
and use k as the value indicating a slot is empty. When we initialize the key
array, we set all the values to k. Since we control what values are assigned during
the dictionary encoding, we reserve k = 0 as the special value indicating a slot
is open.
The second change we employed is the removal of queues. In our previous
implementation, we made use of queues for RDFS rules. As we processed ex-
isting triples or added new triples through inference, we would check to see if
the triple under consideration matches a rule. If so, we would add it to the ap-
propriate queue. Then, when the rule was actually evaluated, we iterated over
the queue instead of the entire dataset, thus saving computation. To save on
space, we removed the queues. This change did result in a small increase in com-
putation time. We examined LUBM(8000) and found about a 33% increase in
computation time for small processor counts, but for 128 the increase in time
was only 11%.
4.1 Results
We examined performing closure on LUBM(8000) and BTC2009. For BTC2009,
we used the higher-level ontology described by Williams et al. [10]. BTC2009 is
a collection of data crawled from the web. As such, it is questionable whether the
ontological information procured from sundry sources should be applied to the
40 E.L. Goodman et al.
entire data set. For instance, some ontological triples specified superproperties
for rdf:type. While expansion of rdf and rdfs namespaces may be appropriate
for some portion of BTC2009, namely the source from which the ontological
information is taken, it doesn’t make sense for the rest of the data. Also, this
type of expansion violates the single-pass nature of our algorithm, and would
require multiple passes. As such, we removed all ontological triples (i.e. any triple
with rdfs or owl in the namespace of the predicate) from BTC2009 and added
the higher level ontology.
Figure 3 displays the results of running our RDFS closure algorithm on the
two different data sets. For comparison, we also include the times using the
previous approach on LUBM(8000). Table 3 provides comparison with other
approaches. We refer to the work of Weaver and Hendler [9] as MPI as they
use an MPI-based approach. WebPIE refers to the work of Urbani et al. [7].
We extract the WebPIE rate for RDFS out of a larger OWL computation. In
both cases we compare equal number of Threadstorm processors with quad-core
nodes (32 for MPI and 64 for WebPIE). We present the comparison with and
without I/O. As this part of our pipeline does not require I/O, it seems fair
to consider the comparison between our non I/O numbers with the previous
approaches, whose processing relies upon access to disk. Though to aid in an
apples-to-apples comparison, we include estimated rates that would be garnered
with I/O using 16 fsworkers.
We also ran RDFS closure on LUBM(120000) with 512 processors. The final
triple total came in at 20.1 billion unique triples. We achieved an inference
rate of 13.7 million inferences/second when we include I/O, and 21.7 million
inferences/second without I/O. Again using the 2 processor run on LUBM(8000)
as a baseline, ideally we would want to see 77.2 million inferences/second when
ignoring I/O. This gives an estimate on efficiency of 0.28.
High-Performance Computing Applied to Semantic Databases 41
Once we have the data encoded as integers, and all RDFS inferences have been
materialized, we are now ready to store the data within a data model. Previous
to this step, the triples had been stored in one large array. Instead of trying to
fit standard relational DBMS-style models to sets of triples, we opt to model
each triple as a directed edge in a graph. The subject of a triple is a vertex on
the graph, the predicate is a typed edge, with the head being the subject and
the tail being the object, another vertex.
We present some basic notation to facilitate discussions of the graph data
model. A graph is defined in terms of vertices, V , and edges E, i.e. G = (V, E).
The graphs we consider are directed, meaning that the edges point from a head
vertex to a tail vertex. We use E(v) to denote the edges incident on vertex v,
while E − (v) denotes only the incoming edges and E + (v) signifies the outgoing
edges. Similarly we define degree, the number of edges incident to vertex v as
deg(v), deg − (v), and deg + (v). We use source(e) and dest(e) to denote the head
and tail vertices for an edge e. Also, we enumerate the edges, and refer to the
ith edge incident with v using the notations E(v) [i], E(v)− [i], and E(v)+ [i].
6 Querying
Once we have the data in graph form, we can now utilize that information to
perform efficient SPARQL queries. LUBM [3] provides several standard queries.
For the purposes of discussion we list query 1:
The WHERE clause contains a set of what are called basic graph patterns
(BGPs). They are triple patterns that are applied to the data set, and those
elements that fit the described constraints are returned as the result. The above
SPARQL query describes formally a request to retrieve all graduate students
that take a particular course. It is important to note that there are basically
only two possibilities for the number of variables within a BGP, one and two.
The other two cases are degenerate: a BGP with no variables has no effect on
the result set, and a BGP with all variables simply matches everything.
Here we present an algorithm we call Sprinkle SPARQL. The algorithm begins
by creating an array of size |V | for each variable specified in the query (see line 1
of Figure 4). Each array is called aw for every variable w. We then evaluate each
BGP by incrementing a counter in the array for a given variable each time a
node in the graph matches the BGP. For example, say we have a BGP, b, similar
42 E.L. Goodman et al.
to the two BGPs of LUBM query 1, where the subject b.s is a variable but b.p
and b.o are fixed terms (line 2 of Figure 5). In that case, we use the graph data
model (line 3 of Figure 5) and start at the object and iterate through all the
deg − (b.o) edges in E(b.o). If an edge matches b.p, we then increment the counter
for the subject at the source of the edge (source(E − (b.o) [i]) in the temporary
array t. We use a temporary array to prevent the possibility of the counter for
a node being incremented more than once during the application of a single
BGP. Once we have iterated through all the edges associated with b.o and found
matching subjects, we then increment positions within ab.s that have a non-zero
corresponding position in t. In the interest of space, we omit the a description
of the other cases. Note that the algorithm as currently defined excludes the
possibility of having a variable in the predicate position; we will leave that as
future work. It is this process of iterating through the BGPs and incrementing
counters that we liken to sprinkling the information from SPARQL BGPs across
the array data structures.
Procedure: Sprinkle(b, A)
Let B and W be the same as above. Let F be the set of fixed terms (not
variables) in B
1: Create a temporary array t of size |V | where ∀i ∈ [0, |V | − 1] : t [i] = 0
2: if b.s ∈ W ∧ b.p ∈ F ∧ b.o ∈ F then
3: for i ← 0...deg − (b.o) − 1 do
4: if E − (b.o) [i] = b.p then
5: s ← source(E − (b.o) [i])
6: t [s]++
7: end if
8: end for
9: for i ← 0...|V | − 1 do
10: if t [i] > 0 then
11: ab.s [i]++
12: end if
13: end for
14: else if b.s ∈ F ∧ b.p ∈ F ∧ b.o ∈ W then
...
15: else if b.s ∈ W ∧ b.p ∈ F ∧ b.o ∈ W then
...
16: end if
Fig. 5. This figure outlines the Sprinkle process
point, One can think of the result set as a relational table with one attribute.
We then iterate through all the 2-variable BGPs in lines 6 through 11, where for
each iteration we select the BGP that creates in the smallest result set. For lack
of a better term, we use the term join to denote the combination of the result
set R with a BGP b, and we use GraphJoin(R, b) to represent the operation.
Consider that GraphJoin(R, b) has two cases:
– A variable in R matches one variable in b, and the other variable in b is
unmatched.
– Two variables in R match both variables in b.
In the former case, the join adds in an additional attribute to R. In the latter case,
the join further constrains the existing set of results. Our current implementation
calculates the exact size of each join. An obvious improvement is to select a
random sample to estimate the size of the join.
6.1 Results
Here we present the results of running Sprinkle SPARQL on LUBM queries 1-5
and 9. Of the queries we tested, 4, 5, and 9 require inferencing, with 9 needing
owl:intersectionOf to infer that all graduate students are also students. Since we
do not yet support OWL, we added this information as a post-processing step
after caclucating RDFS closure. Figure 6 shows the times we obtained with the
method. We report the time to perform the calculation together with the time to
44 E.L. Goodman et al.
3
10
1 Table 4. This table shows the speedup
Sprinkle SPARQL achieved against
2
3
Time (seconds)
2
10
4 other approaches for queries 2 and 9
5
9
1
Query MapReduce BigOWLIM
10
2 13.6 2.12
9 28.0 2.82
0
10
1 2 4 8 16 32 64 128
Processors
either print the results to the console or to store the results on disks, whichever
is faster. For smaller queries, it makes sense to report the time to print to screen
as a human operator can easily digest small result sets. For the larger result sets,
more analysis is likely necessary, so we report the time to store the query on disk.
Queries 2 and 9 are the most complicated and they are where we see the
most improvement in comparison to other approaches. We compare against a
MapReduce approach by Husain et al. [4] and against the timings reported for
BigOWLIM on LUBM(8000) 7 . This comparison is found in Table 4. For the
MapReduce work, we compare 10 Threadstorm processors to an equal number
of quad-core processors. For the comparison against BigOWLIM, we compare
against the quad-core system, ontosol, dividing the times of our two-processor
count runs by two to give a fair comparison. Ideally, we would like to compare
against larger processor counts, but we could find nothing in the literature.
Queries 1, 3, and 4 have similar performance curves. The majority of the time
is consumed during the Sprinkle phase, which down-selects so much that later
computation (if there is any) is inconsequential. For comparison we ran a simple
algorithm on query 1 that skips the Sprinkle phase, but instead executes each
BGP in a greedy selection process, picking the BGPs based upon how many
triples match the pattern. For query 1, this process chooses the second BGP,
which has 4 matches, followed by the first BGP, which evaluated by itself has
over 20 million matches. For this simple approach, we arrive at a time of 0.33
seconds for 2 processors as opposed to 29.28 with Sprinkle SPARQL, indicating
that Sprinkle SPARQL may be overkill for simple queries. Query 5 has similar
computational runtime to 1, 3, and 4, but because of a larger result set (719
versus 4, 6, and 34), takes longer to print to screen. For these simple queries,
Sprinkle SPARQL performs admirably in comparison to the MapReduce work,
ranging between 40 - 225 times faster, but comparing to the BigOWLIM results,
we don’t match their times of between 25 and 52 msec. As future work, we plan
to investigate how we can combine the strategies of Sprinkle SPARQL and a
simpler approach without Sprinkle (and perhaps other approaches) to achieve
good results on both simple and complex queries.
7
http://www.ontotext.com/owlim/benchmarking/lubm.html
High-Performance Computing Applied to Semantic Databases 45
7 Conclusions
In this paper we presented a unique supercomputer with architecturally-advanta-
geous features for housing a semantic database. We showed dramatic improvement
for three fundamental tasks: dictionary encoding, rdfs closure, and querying. We
have shown the performance value of holding large triple stores in shared memory.
We have also demonstrated scaling up to 512 processors.
Acknowledgments. This work was funded under the Center for Adaptive
Supercomputing Software - Multithreaded Architectures (CASS-MT) at the
Dept. of Energy’s Pacific Northwest National Laboratory. Pacific Northwest Na-
tional Laboratory is operated by Battelle Memorial Institute under Contract
DE-ACO6-76RL01830.
References
1. Goodman, E.L., Haglin, D.J., Scherrer, C., Chavarrı́a-Miranda, D., Mogill, J., Feo,
J.: Hashing Strategies for the Cray XMT. In: Proceedings of the IEEE Workshop
on Multi-Threaded Architectures and Applications, Atlanta, GA, USA (2010)
2. Goodman, E.L., Mizell, D.: Scalable In-memory RDFS Closure on Billions of
Triples. In: Proceedings of the 4th International Workshop on Scalable Seman-
tic Web Knowledge Base Systems, Shanghai, China (2010)
3. Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base
Systems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2-
3), 158–182 (2005)
4. Husain, M.F., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data Intensive
Query Processing for Large RDF Graphs Using Cloud Computing Tools. In: Pro-
ceedings of the 3rd International Conference on Cloud Computing, Maimi, Florida
(2010)
5. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: A SPARQL Performance Bench-
mark. In: Proceedings of the 25th International Conference on Data Engineering,
Shanghai, China (2009)
6. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reason-
ing Using MapReduce. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L.,
Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 634–649. Springer, Heidelberg (2009)
7. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: OWL reasoning
with WebPIE: calculating the closure of 100 billion triples. In: Proceedings of the
7th Extended Semantic Web Conference, Heraklion, Greece (2010)
8. Urbani, J., Maaseen, J., Bal, H.: Massive Semantic Web data compression with
MapReduce. In: Proceedings of the MapReduce Workshop at High Performance
Distributed Computing Symposium, Chicago, IL, USA (2010)
9. Weaver, J., Hendler, J.A.: Parallel materialization of the finite RDFS closure for
hundreds of millions of triples. In: Bernstein, A., Karger, D.R., Heath, T., Feigen-
baum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS,
vol. 5823, pp. 682–697. Springer, Heidelberg (2009)
10. Williams, G.T., Weaver, J., Atre, M., Hendler, J.: Scalable Reduction of Large
Datasets to Interesting Subsets. In: Billion Triples Challenge, Washington D.C.,
USA (2009)
An Intermediate Algebra for Optimizing RDF Graph
Pattern Matching on MapReduce
1 Introduction
With the recent surge in the amount of RDF data, there is an increasing need for scalable
and cost-effective techniques to exploit this data in decision-making tasks. MapReduce-
based processing platforms are becoming the de facto standard for large scale analyt-
ical tasks. MapReduce-based systems have been explored for scalable graph pattern
matching [1][2], reasoning [3], and indexing [4] of RDF graphs. In the MapReduce [5]
programming model, users encode their tasks as map and reduce functions, which are
executed in parallel on the Mappers and Reducers respectively. This two-phase com-
putational model is associated with an inherent communication and I/O overhead due
to the data transfer between the Mappers and the Reducers. Hadoop1 based systems
like Pig [6] and Hive [7] provide high-level query languages that improve usability
and support automatic data flow optimization similar to database systems. However,
1
http://hadoop.apache.org/core/
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 46–61, 2011.
c Springer-Verlag Berlin Heidelberg 2011
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 47
most of these systems are targeted at structured relational data processing workloads
that require relatively few numbers of join operations as stated in [6]. On the con-
trary, processing RDF query patterns typically require several join operations due to
the fine-grained nature of RDF data model. Currently, Hadoop supports only partition
parallelism in which a single operator executes on different partitions of data across
the nodes. As a result, the existing Hadoop-based systems with the relational style join
operators translate multi-join query plans into a linear execution plan with a sequence
of multiple Map-Reduce (MR) cycles. This significantly increases the overall commu-
nication and I/O overhead involved in RDF graph processing on MapReduce platforms.
Existing work [8][9] directed at uniprocessor architectures exploit the fact that joins
presented in RDF graph pattern queries are often organized into star patterns. In this
context, they prefer bushy query execution plans over linear ones for query process-
ing. However, supporting bushy query execution plans in Hadoop based systems would
require significant modification to the task scheduling infrastructure.
In this paper, we propose an approach for increasing the degree of parallelism by
enabling some form of inter-operator parallelism. This allows us to “sneak in” bushy
like query execution plans into Hadoop by interpreting star-joins as groups of triples
or TripleGroups. We provide the foundations for supporting TripleGroups as first class
citizens. We introduce an intermediate algebra called the Nested TripleGroup Algebra
(NTGA) that consists of TripleGroup operators as alternatives to relational style oper-
ators. We also present a data representation format called the RDFMap that allows for
a more easy-to-use and concise representation of intermediate query results than the
existing format targeted at relational tuples. RDFMap aids in efficient management of
schema-data associations, which is important while querying schema-last data models
like RDF. Specifically, we propose the following:
– A TripleGroup data model and an intermediate algebra called Nested TripleGroup
Algebra (NTGA), that leads to efficient representation and manipulation of RDF
graphs.
– A compact data representation format (RDFMap) that supports efficient
TripleGroup-based processing.
– An extension to Pig’s computational infrastructure to support NTGA operators,
and compilation of NTGA logical plans to MapReduce execution plans. Operator
implementation strategies are integrated into Pig to minimize costs involved in RDF
graph processing.
– A comparative performance evaluation of Pig and RAPID+ (Pig extended with
NTGA operators) for graph pattern queries on a benchmark dataset is presented.
This paper is organized as follows: In section 2, we review the basics of RDF graph
pattern matching, and the issues involved in processing such pattern queries in systems
like Pig. We also summarize the optimization strategies presented in our previous work,
which form a base for the algebra proposed in this paper. In section 3.1, we present the
TripleGroup data model and the supported operations. In 3.2, we discuss the integration
of NTGA operators into Pig. In section 4, we present the evaluation results comparing
the performance of RAPID+ with the existing Pig implementation.
48 P. Ravindra, H. Kim, K. Anyanwu
Fig. 1. Example pattern matching query in (a) SPARQL (b) Pig Latin (VP approach)
2
http://www4.wiwiss.fu-berlin.de/bizer/
BerlinSPARQLBenchmark/spec/
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 49
Graph Pattern Matching in Pig. Map Reduce data processing platforms like Pig fo-
cus on ad hoc data processing in the cloud environment where the existence of pre-
processed and suitably organized data cannot be presumed. Therefore, in the context
of RDF graph pattern processing which is done directly from input documents, the VP
approach with smaller relations is more suitable. To capture the VP storage model in
Pig, an input triple relation needs to be “split” into property-based partitions using Pig
Latin’s SPLIT command. Then, the star-structured joins are achieved using an m-way
JOIN operator, and chain joins are executed using th traditional binary JOIN operator.
Fig. 1(b) shows how the graph pattern query in Fig. 1(a) can be expressed and processed
in Pig Latin. Fig. 2 (a) shows the corresponding query plan for the VP approach. We
refer to this sort of query plan as Pig’s approach in the rest of the paper. Alternative
plans may change the order of star-joins based on cost-based optimizations. However,
that issue does not affect our discussion because the approaches compared in this pa-
per all benefit similarly from such optimizations. Pig Latin queries are compiled into
a sequence of Map-Reduce (MR) jobs that run over Hadoop. The Hadoop scheduling
supports partition parallelism such that in every stage, one operator is running on dif-
ferent partitions of data at different nodes. This leads to a linear style physical execution
plan. The above logical query plan will be compiled into a linear execution plan with
a sequence of five MR cycles as shown in Fig. 2 (b). Each join step is executed as a
separate MR job. However, Pig optimizes the multi-way join on the same column, and
compiles it into a single MR cycle.
Issues. (i) Each MR cycle involves communication and I/O costs due to the data transfer
between the Mappers and Reducers. Intermediate results are written to disk by Mappers
after the map phase, which are read by Reducers and processed in the reduce phase after
which the results are written to HDFS (Hadoop Distributed File System). These costs
are summarized in Fig. 2 (b). Using this style of execution where join operations are
executed in different MR cycles, join-intensive tasks like graph pattern matching will
Fig. 2. Pattern Matching using VP approach (a) Query plan (b) Map-Reduce execution flow
50 P. Ravindra, H. Kim, K. Anyanwu
result in significant I/O and communication overhead. There are other issues that con-
tribute I/O costs e.g. the SPLIT operator for creating VP relations generates concurrent
sub flows which compete for memory resources and is prone to disk spills. (ii) In im-
perative languages like Pig Latin, users need to explicitly manipulate the intermediate
results. In schema-last data models like RDF, there is an increased burden due to the
fact that users have to keep track of which columns of data are associated with which
schema items (properties) as well as their corresponding values. For example, for the
computation of join J2, the user needs to specify the join between intermediate relation
J1 on the value of property type “product”, and relation SJ3 on the value of property
type “reviewFor”. It is not straightforward for the user to determine that the value cor-
responding to property “product” is in column 20 of relation J1. In schema-first data
models, users simply reference desired columns by attribute names.
TripleGroup-based Pattern Matching. In our previous work [10], we proposed an
approach to exploit star sub patterns by re-interpreting star-joins using a grouping-based
join algorithm. It can be observed that performing a group by Subject yields groups of
tuples or TripleGroups that represent all the star sub graphs in the database. We can
obtain all these star sub graphs using the relational style GROUP BY which executes
in a single MR cycle, thus minimizing the overall I/O and communication overhead in
RDF graph processing. Additionally, repeated data processing costs can be improved by
coalescing operators in a manner analogous to “pushing select into cartesian product”
in relational algebra to produce a more efficient operator. The empirical study in our
previous work showed significant savings using this TripleGroup computational style,
suggesting that it was worth further consideration.
In this paper, we present a generalization of this strategy by proposing an interme-
diate algebra based on the notion of TripleGroups. This provides a formal foundation
to develop first-class operators with more precise semantics, to enable tighter integra-
tion into existing systems to support automatic optimization opportunities. Additionally,
we propose a more suitable data representation format that aids in efficient and user-
friendly management of intermediate results of operators in this algebra. We also show
how this representation scheme can be used to implement our proposed operators.
3 Foundations
Nested TripleGroup Algebra (NTGA) is based on the notion of the TripleGroup data
model which is formalized as follows:
Definition 1. (TripleGroup) A TripleGroup tg is a relation of triples t1 ,t2 ,...tk , whose
schema is defined as (S, P , O). Further, any two triples ti , tj ∈ tg have overlapping
components i.e. ti [coli ] = tj [colj ] where coli , colj refer to subject or object com-
ponent. When all triples agree on their subject (object) values, we call them subject
(object) TripleGroups respectively. Fig. 3 (a) is an example of a subject TripleGroup
which corresponds to a star sub graph. Our data model allows TripleGroups to be nested
at the object component.
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 51
(proj) The proj operator extracts from each TripleGroup, the required triple com-
ponent from the triple matching the triple pattern. From our example data,
proj?hpage (T G) ={www.vendors.org/V 1}.
(filter) The filter operator is used for value-based filtering i.e. to check if the
TripleGroups satisfy the given filter condition. For our example data, filterprice>500
(T G) would eliminate the TripleGroup ntg in Fig. 3 (b) since the triple (&Of f er1,
price, 108) does not satisfy the filter condition.
(groupfilter) The groupfilter operation is used for structure-based filtering
i.e. to retain only those TripleGroups that satisfy the required query sub structure.
For example, the groupfilter operator can be used to eliminate TripleGroups like
tg in Fig. 3 (a), that are structurally incomplete with respect to the equivalence class
T G{label,country,homepage,mbox} .
(join) The join expression join(?vtpx :T Gx ,?vtpy :T Gy ) computes the join between
a TripleGroup tgx in equivalence class T Gx with a TripleGroup tgy in equivalence class
T Gy based on the given triple patterns. The triple patterns tpx and tpy share a common
variable ?v at O or S component. The result of an object-subject (O-S) join is a nested
TripleGroup in which tgy is nested at the O component of the join triple in tgx . For ex-
ample, Fig. 6 shows the nested TripleGroup resulting from the join operation between
equivalence classes T G{price,validT o,vendor,product} and T G{label,country,homepage}
that join based on triple patterns {?o vendor ?v} and {?v country ?vcountry}
respectively. For object-object (O-O) joins, the join operator computes a TripleGroup
by union of triples in the individual TripleGroups.
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 53
Execution plan using NTGA and its mapping to Relational Algebra. TripleGroup-
based pattern matching for a query with n star sub patterns, compiles into a MapRe-
duce flow with n MR cycles as shown in Fig. 7. The same query executes in double
the number of MR cycles (2n − 1) using Pig approach. Fig. 7 shows the equivalence
between NTGA and relational algebra operators based on our notion of content equiva-
lence. This mapping suggests rules for lossless transformation between queries written
in relational algebra and NTGA. First, the input triples are loaded and the triples that
are not part of the query pattern are filtered out. Pig load and filter operators are
coalesced into a loadFilter operator to minimize costs of repeated data handling.
The graph patterns are then evaluated using the NTGA operators, (i) star-joins using
Pig’s GROUP BY operator, which is coalesced with the NTGA groupFilter operator
to enable structure-based filtering (represented as StarGroupFilter) and, (ii) chain
joins on TripleGroups using the NTGA join operator (represented as RDFJoin). The
final result can be converted back to n-tuples using the NTGA flatten operator. In
general, TripleGroups resulting from any of the NTGA operations can be mapped to
Pig’s tupled results using the flatten operator. For example, the StarGroupFilter
operation results in a set of TripleGroups. Each TripleGroup can be transformed to an
equivalent n-tuple resulting from relational star-joins SJ1, SJ2, or SJ3.
Implementing NTGA operators using RDFMap. In this section, we show how the
property-based indexing scheme of an RDFMap can be exploited for efficient imple-
mentation of the NTGA operations. We then discuss the integration of NTGA operators
into Pig.
StarGroupFilter. A common theme in our implementation is to coalesce operators
where possible in order to minimize the costs of parameter passing, and context switch-
ing between methods. The starGroupFilter is one such operator, which coalesces
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 55
the NTGA groupfilter operator into Pig’s relational GROUP BY operator. Creating
subject TripleGroups using this operator can be expressed as:
T G = StarGroupFilter triples by S
The corresponding map and reduce functions for the StarGroupFilter operator are
shown in Algorithm 1. In the map phase, the tuples are annotated based on the S compo-
nent analogous to the map of a GROUP BY operator. In the reduce function, the different
tuples sharing the same S component are packaged into an RDFMap that corresponds
to a subject TripleGroup. The groupfilter operator is integrated for structure-based
filtering based on the query sub structures (equivalence classes). This is achieved us-
ing global bit patterns (stored as BitSet) that concisely represent the property types
in each equivalence class. As the tuples are processed in the reduce function, the lo-
cal BitSet keeps track of the property types processed (line 6). After processing all
tuples in a group, if the local BitSet (locBitSet) does not match the global BitSet
(ECBitSet), the structure is incomplete and the group of tuples is eliminated (lines 10-
11). Fig. 9 shows the mismatch between the locBitSet and ECBitSet in the sixth po-
sition that represents the missing property “product” belonging to the equivalence class
T G{price,validT o,delivDays,vendor,product} . If the bit patterns match, a labeled RDFMap
is generated (line 13) whose propM ap contains all the (p,o) pairs representing the edges
of the star sub graph. The output of StarGroupFilter is a single relation containing
a list of RDFMaps corresponding to the different star sub graphs in the input data.
4 Evaluation
Our goal was to empirically evaluate the performance of NTGA operators with re-
spect to pattern matching queries involving combinations of star and chain joins. We
compared the performance of RAPID+ with two implementations of Pig, (i) the naive
Pig with the VP storage model, and (ii) an optimized implementation of Pig (P igopt ),
in which we introduced additional project operations to eliminate the redundant join
columns. Our evaluation tasks included, (i) Task1 - Scalability of TripleGroup-based
approach with size of RDF graphs, (ii) Task2 - Scalability of TripleGroup-based pat-
tern matching with denser star patterns, and (iii) Task3 - Scalability of NTGA operators
with increasing cluster sizes.
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 57
Table 1. Testbed queries and performance gain of RAPID+ over Pig (10-node cluster / 32GB)
Query #Triple Pattern #Edges in Stars %gain Query #Triple Pattern #Edges in Stars %gain
Q1 3 1:2 56.8 Q6 8 4:4 58.4
Q2 4 2:2 46.7 Q7 9 5:4 58.6
Q3 5 2:3 47.8 Q8 10 6:4 57.3
Q4 6 3:3 51.6 2S1C 6 2:4 65.4
Q5 7 3:4 57.4 3S2C 10 2:4:4 61.5
4.1 Setup
Testbed - Dataset and Queries: Synthetic datasets (n-triple format) generated using
the BSBM tool were used. A comparative evaluation was carried out based on size of
data ranging from 8.6GB (approx. 35 million triples) at the lower end, to a data size
of 40GB (approx. 175 million triples). 10 queries (shown in Table 1) adapted from the
BSBM benchmark (Explore use case) with at least a star and chain join were used.
The evaluation tested the effect of query structure on performance with, (i) Q1 to Q8
consisting of two star patterns with varying cardinality, (ii) 2S1C consisting of two star
patterns, a chain join, and a filter component (6 triple patterns), and (ii) 3S2C consisting
of three star patterns, two chain joins, and a filter component (10 triple patterns). Query
details and additional experiment results are available on the project website4 .
Task1: Fig. 11 (a) shows the execution times of the three approaches on a 5-node clus-
ter for 2S1C. For all the four data sizes, we see a good percentage improvement in
the execution times for RAPID+. The two star patterns in 2S1C are computed in two
separate MR cycles in both the Pig approaches, resulting in the query compiling into
a total of three MR cycles. However, RAPID+ benefits by the grouping-based join al-
gorithm (StarGroupFilter operator) that computes the star patterns in a single MR
cycle, thus reducing one MR cycle in total. We also observe cost savings due to the
integration of loadFilter operator in RAPID+ that coalesces the LOAD and FILTER
phases. As expected, the P igopt performs better than the naive Pig approach due to the
decrease in the size of the intermediate results.
3
https://vcl.ncsu.edu/
4
http://research.csc.ncsu.edu/coul/RAPID/ESWC_exp.htm
58 P. Ravindra, H. Kim, K. Anyanwu
Fig. 11. Cost analysis on 5-node cluster for (a) 2SIC (b) 3S2C
Fig. 11 (b) shows the performance comparison of the three approaches on a 5-node
cluster for 3S2C. This query compiles into three MR cycles in RAPID+ and five MR
cycles in Pig / P igopt . We see similar results with RAPID+ outperformed the Pig based
approaches, achieving up to 60% performance gain with the 32GB dataset. The Pig
based approaches did not complete execution for the input data size of 40GB. We sus-
pect that this was due to the large sizes of intermediate results. In this situation, the
compact representation format offered by the RDFMap proved advantageous to the
RAPID+ approach. In the current implementation, RAPID+ has the overhead that the
computation of the star patterns results in a single relation containing TripleGroups
belonging to different equivalence classes. In our future work, we will investigate tech-
niques for delineating different types of intermediate results.
Task2: Table 1 summarizes the performance of RAPID+ and Pig for star-join queries
with varying edges in each star sub graph. NTGA operators achieve a performance gain
of 47% with Q2 (2:2 cardinality) which increases with denser star patterns, reaching
59% with Q8 (6:4 cardinality). In addition to the savings in MR cycle in RAPID+, this
demonstrates the cost savings due to smaller intermediate relations achieved by elimi-
nating redundant subject values and join triples that are no longer required. Fig. 12 (b)
shows a comparison on a 5-node cluster (20GB data size) with P igopt which eliminates
join column redundancy in Pig, similar to RDFMap’s concise representation of subjects
within a TripleGroup. RAPID+ maintains a consistent performance gain of 50% across
the varying density of the two star patterns.
Task3: Fig. 12(a) shows the scalability study of 3S2C on different sized clusters, for
32GB data. RAPID+ starts with a performance gain of about 56% with the 10-node
cluster, but its advantage over Pig and P igopt reduces with increasing number of nodes.
The increase in the number of nodes, decreases the size of data processed by each node,
therefore reducing the probability of disk spills with the SPLIT operator in the Pig based
approaches. However, RAPID+ still consistently outperforms the Pig based approaches
with at least 45% performance gain in all experiments.
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 59
Fig. 12. Scalability study for (a) 3S2C varying cluster sizes (b) two stars with varying cardinality
5 Related Work
Data Models and High-Level Languages for cluster-based environment. There has
been a recent proliferation of data flow language such as Sawzall [11], DryadLINQ [12],
HiveQL [7], and Pig Latin [6] for processing structured data on parallel data processing
systems such as Hadoop. Another such query language, JAQL5 is designed for semi-
structure data analytics, and uses the (key, value) JSON model. However, this model
splits RDF sub graphs into different bags, and may not be efficient to execute bushy
plans. Our previous work, RAPID [13] focused on optimizing analytical processing of
RDF data on Pig. RAPID+ [10] extended Pig with UDFs to enable TripleGroup-based
processing. In this work, we provide formal semantics to integrate TripleGroups as
first-class citizens, and present operators for graph pattern matching.
RDF Data processing on MapReduce Platforms. MapReduce framework has been
explored for scalable processing of Semantic Web data. For reasoning tasks, specialized
map and reduce functions have been defined based on RDFS rules [3] and the OWL
Horst rules [14], for materializing the closure of RDF graphs. Yet another work [15]
extends Pig by integrating schema-aware RDF data loader and embedding reasoning
support into the existing framework. For scalable pattern matching queries, there have
been MapReduce-based storage and query systems [2],[1] that process RDFMolecules.
Also, [16] uses HadoopDB [17] with a column-oriented database to support a scal-
able Semantic Web application. This framework enables parallel computation of star-
joins if the data is partitioned based on the Subject component. However, graph pattern
queries with multiple star patterns and chain join may not benefit much. Another re-
cent framework [4] pre-processes RDF triples to enable efficient querying of billions of
triples over HDFS. We focus on ad hoc processing of RDF graphs that cannot presume
pre-processed or indexed data.
Optimizing Multi-way Joins. RDF graph pattern matching typically involves sev-
eral join operations. There have been optimization techniques [9] to re-write SPARQL
queries into small-sized star-shaped groups and generate bushy plans using two phys-
ical join operators called njoin and gjoin. It is similar in spirit to the work presented
5
http://code.google.com/p/jaql
60 P. Ravindra, H. Kim, K. Anyanwu
here since both exploit star-shaped sub patterns. However, our work focuses on parallel
platforms and uses a grouping-based algorithm to evaluate star-joins. There has been
work on optimizing m-way joins on structured relations like slice join [18]. However,
we focus on joins involving RDF triples for semi-structured data. Another approach
[19] efficiently partitions and replicates of tuples on reducer processes in a way that
minimizes the communication cost. This is complementary to our approach and the
partitioning schemes could further improve the performance of join operations. [20], in-
vestigates several join algorithms which leverage pre-processing techniques on Hadoop,
but mainly focus on log processing. RDFBroker [21] is a RDF store that is based on the
concept of a signature (set of properties of a resource), similar to NTGA’s structure-
labeling function λ. However, the focus of [21] is to provide a natural way to map RDF
data to database tables, without presuming schema knowledge. Pregel [22] and Sig-
nal/Collect [23] provide graph-oriented primitives as opposed to relational algebra type
operators, and also target parallel platforms. The latter is still in a preliminary stage and
has not completely demonstrated its advantages across parallel platforms.
6 Conclusion
In this paper, we presented an intermediate algebra (NTGA) that enables more natural
and efficient processing for graph pattern queries on RDF data. We proposed a new data
representation format (RDFMap) that supports NTGA operations in a more efficient
manner. We integrated these NTGA operators into Pig, and presented a comparative
performance evaluation with the existing Pig implementation. For certain classes of
queries, we saw a performance gain of up to 60%. However, there might be certain
scenarios in which it may be preferable not to compute all star patterns. In such cases,
we need a hybrid approach that utilizes cost-based optimization techniques to determine
when the NTGA approach is the best. We will also investigate a more efficient method
for dealing with heterogeneous TripleGroups resulting from join operations.
References
1. Newman, A., Li, Y.F., Hunter, J.: Scalable Semantics: The Silver Lining of Cloud Computing.
In: IEEE International Conference on eScience (2008)
2. Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A Scale-Out Rdf Molecule Store
for Distributed Processing of Biomedical Data. In: Semantic Web for Health Care and Life
Sciences Workshop (2008)
3. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reasoning Using
MapReduce. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta,
E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 634–649. Springer, Heidelberg
(2009)
4. Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data Intensive Query Processing
for Large Rdf Graphs Using Cloud Computing Tools. In: IEEE International Conference on
Cloud Computing, CLOUD (2010)
5. Dean, J., Ghemawat, S.: Simplified Data Processing on Large Clusters. ACM Commun. 51,
107–113 (2008)
6. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: A Not-So-Foreign
Language for Data Processing. In: Proc. International Conference on Management of data
(2008)
An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce 61
7. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P.,
Murthy, R.: Hive: A Warehousing Solution over a Map-Reduce Framework. Proc. VLDB
Endow. 2, 1626–1629 (2009)
8. Neumann, T., Weikum, G.: The Rdf-3X Engine for Scalable Management of Rdf Data. The
VLDB Journal 19, 91–113 (2010)
9. Vidal, M.-E., Ruckhaus, E., Lampo, T., Martı́nez, A., Sierra, J., Polleres, A.: Efficiently Join-
ing Group Patterns in SPARQL Queries. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije,
A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp.
228–242. Springer, Heidelberg (2010)
10. Ravindra, P., Deshpande, V.V., Anyanwu, K.: Towards Scalable Rdf Graph Analytics on
Mapreduce. In: Proc. Workshop on Massive Data Analytics on the Cloud (2010)
11. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the Data: Parallel Analysis
with Sawzall. Sci. Program. 13, 277–298 (2005)
12. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P.K., Currey, J.: Dryadlinq:
A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level
Language. In: Proc. USENIX Conference on Operating Systems Design and Implementa-
tion (2008)
13. Sridhar, R., Ravindra, P., Anyanwu, K.: RAPID: Enabling Scalable Ad-Hoc Analytics on
the Semantic Web. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D.,
Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 715–730. Springer,
Heidelberg (2009)
14. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: OWL Reasoning with
Webpie: Calculating the Closure of 100 Billion Triples. In: Aroyo, L., Antoniou, G.,
Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010.
LNCS, vol. 6088, pp. 213–227. Springer, Heidelberg (2010)
15. Tanimura, Y., Matono, A., Lynden, S., Kojima, I.: Extensions to the Pig Data Processing
Platform for Scalable Rdf Data Processing using Hadoop. In: IEEE International Conference
on Data Engineering Workshops (2010)
16. Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in
Action: Building Real World Applications. In: Proc. International Conference on Manage-
ment of data (2010)
17. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: an
Architectural Hybrid of Mapreduce and Dbms Technologies for Analytical Workloads. Proc.
VLDB Endow. 2, 922–933 (2009)
18. Lawrence, R.: Using Slice Join for Efficient Evaluation of Multi-Way Joins. Data Knowl.
Eng. 67, 118–139 (2008)
19. Afrati, F.N., Ullman, J.D.: Optimizing Joins in a Map-Reduce Environment. In: Proc. Inter-
national Conference on Extending Database Technology (2010)
20. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A Comparison of Join
Algorithms for Log Processing in Mapreduce. In: Proc. International Conference on Man-
agement of data (2010)
21. Sintek, M., Kiesel, M.: RDFBroker: A Signature-Based High-Performance RDF Store. In:
Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 363–377. Springer, Heidel-
berg (2006)
22. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.:
Pregel: A System for Large-Scale Graph Processing. In: Proc. International Conference on
Management of data (2010)
23. Stutz, P., Bernstein, A., Cohen, W.: Signal/Collect: Graph Algorithms for the (Semantic)
Web. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks,
I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 764–780. Springer, Heidelberg
(2010)
Query Relaxation for Entity-Relationship Search
1 Introduction
1.1 Motivation
There is a trend towards viewing Web or digital-library information in an entity-centric
manner: what is the relevant information about a given sports club, a movie star, a politi-
cian, a company, a city, a poem, etc. Moreover, when querying the Web, news, or blogs,
we like the search results to be organized on a per-entity basis. Prominent examples of
this kind of search are entitycube.research.microsoft.com or google.com/squared/. Ad-
ditionally, services that contribute towards more semantic search are large knowledge
repositories, including both handcrafted ones such as freebase.com as well as automat-
ically constructed ones such as trueknowledge.com or dbpedia.org. These have been
enabled by knowledge-sharing communities such as Wikipedia and by advances in in-
formation extraction (e.g., [2, 6, 17, 23, 20]).
One way of representing entity-centric information, along with structured relation-
ships between entities, is the Semantic-Web data model RDF. An RDF collection con-
sists of a set of subject-property-object (SPO) triples. Each triple is a pair of entities
with a named relationship. A small example about books is shown in Table 1.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 62–76, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Query Relaxation for Entity-Relationship Search 63
RDF data of this kind can be queried using a conjunction of triple patterns – the
core of SPARQL – where a triple pattern is a triple with variables and the same variable
in different patterns denotes a join condition. For example, searching for Pulitzer-prize
winning science fiction authors from the USA could be phrased as:
This query contains a conjunction (denoted by “;”) of four triple patterns where ?a
and ?b denote variables that should match authors and their books respectively.
While the use of triple patterns enables users to formulate their queries in a precise
manner, it is possible that the queries are overly constrained and lead to unsatisfactory
recall. For example, this query would return very few results even on large book collec-
tions, and only one - Carl Sagan - for our example data. However, if the system were
able to automatically reformulate one or more conditions in the query, say, replacing
bornIn with citizenOf, the system would potentially return a larger number of results.
This paper addresses the query relaxation problem: automatically broadening or refor-
mulating triple-pattern queries to retrieve more results without unduly sacrificing preci-
sion. We can view this problem as the entity-relationship-oriented counterpart of query
expansion in the traditional keyword-search setting. Automatically expanding queries
in a robust way so that they would not suffer from topic drifts (e.g., overly broad gener-
alizations) is a difficult problem [3].
The problem of query relaxation for triple-pattern queries has been considered in lim-
ited form in [22, 11, 7, 10] and our previous work [8]. Each of these prior approaches
64 S. Elbassuoni, M. Ramanath, and G. Weikum
focused on very specific aspects of the problem, and only two of them [22, 8] con-
ducted experimental studies on the effectiveness of their proposals. These techniques
are discussed in more detail in Sect. 5.
2 Relaxation Framework
We start by describing the basic setting and some of the terminology used in the rest of
this paper.
Knowledge Base. A knowledge base KB of entities and relations is a set of triples,
where a triple is of the form e1, r, e2 with entities e1, e2 and relation r (or s, p, o
Query Relaxation for Entity-Relationship Search 65
1
Note that, even though we refer only to entities in the following, the same applies to relations
as well.
66 S. Elbassuoni, M. Ramanath, and G. Weikum
We now need to define the set of terms over which the LM is estimated. We de-
fine two kinds of terms: i) “unigrams” U , corresponding to all entities in KB, and, ii)
“bigrams” B, corresponding to all entity-relation pairs. That is,
U = {e : e r o ∈ KB||s r e ∈ KB}
B = {(er) : e r o ∈ KB} ∪ {(re) : s r e ∈ KB}
Example: The entity Woody Allen would have a document consisting of triples
Woody Allen directed Manhattan, Woody Allen directed Match Point, Woody Allen acte-
dIn Scoop, Woody Allen type Director, Federico Fellini influences Woody Allen, etc.
The terms in the document would include Scoop, Match Point, (type,Director), (Fed-
erico Fellini,influences), etc.
Note that the bi-grams usually occur exactly once per entity, but it is still important
to capture this information. When we compare the LMs of two entities, we would like
identical relationships to be recognized. For example, if for a given entity, we have the
bigram (hasWonAward, Academy Award), we can then distinguish the case where a can-
didate entity has the term (hasWonAward, Academy Award) and the term (nominated-
For, Academy Award). This distinction cannot be made if only unigrams are considered.
Estimating the LM. The LM corresponding to document D(E) is now a mixture model
of two LMs: PU , corresponding to the unigram LM and PB , the bigram LM. That is,
where μ controls the influence of each component. The unigram and bigram LMs are
estimated in the standard way with linear interpolation smoothing from the corpus. That
is,
c(w; D(E)) c(w; D(KB))
PU (w) = α
+ (1 − α)
Σw ∈U c(w ; D(E)) Σw ∈U c(w ; D(KB))
where w ∈ U , c(w; D(E)) and c(w; D(KB)) are the frequencies of occurrences of w
in D(E) and D(KB) respectively and α is the smoothing parameter. The bigram LM is
estimated in an analogous manner.
Documents and LMs for relations. Let R be the relation of interest and let D(R) be
its document, which is constructed as the set of all triples in which R occurs. That is,
As with the case of entities, we again define two kinds of terms – “unigrams” and
“bigrams”. Unigrams correspond to the set of all entities in KB. But, we make a distinc-
tion here between entities that occur as subjects and those that occur as objects, since
the relation is directional (note that there could be entities that occur as both). That is,
S = {s : s r o ∈ KB}
O = {o : s r o ∈ KB}
B = {(so) : s r o ∈ KB}
68 S. Elbassuoni, M. Ramanath, and G. Weikum
Example: Given the relation directed, D(directed) would consist of all triples
containing that relation, including, James Cameron directed Aliens, Woody Allen di-
rected Manhattan, Woody Allen directed Match Point, Sam Mendes directed Ameri-
can Beauty, etc. The terms in the document would include James Cameron, Manhattan,
Woody Allen, (James Cameron, Aliens), (Sam Mendes, American Beauty), etc.
Estimating the LM. The LM of D(R) is a mixture model of three LMs: PS , corre-
sponding to the unigram LM of terms in S, PO , corresponding to the unigram LM of
terms in O and PB , corresponding to the bigram LM. That is,
PR (w) = μs PS (w) + μo PO (w) + (1 − μs − μo )PB (w)
where μs , μo control the influence of each component. The unigram and bigram LMs
are estimated in the standard way with linear interpolation smoothing from the corpus.
That is,
c(w; D(R)) c(w; D(KB))
PS (w) = α + (1 − α)
Σw ∈S c(w ; D(R)) Σw ∈S c(w ; D(KB))
where w ∈ S, c(w; D(R)) and c(w; D(KB)) are the frequencies of occurrences of w in
D(R) and D(KB) respectively, and α is a smoothing parameter. The other unigram LM
and the bigram LM are estimated in an analogous manner.
Generating the Candidate List of Relaxations. As previously mentioned, we make
use of the square root of the JS-divergence as the similarity score between two entities
(or relations). Given probability distributions P and Q, the JS-divergence between them
is defined as follows,
JS(P ||Q) = KL(P ||M ) + KL(Q||M )
where, given two probability distributions R and S, the KL-divergence is defined as,
R(j)
KL(R||S) = Σj R(j) log
S(j)
and
1
M= (P + Q)
2
2.3 Examples
Table 2 shows example entities and relations from the IMDB and LibraryThing datasets
and their top-5 relaxations derived from these datasets, using the techniques described
above. The entry var represents the variable candidate. As previously explained, a vari-
able substitution indicates that there were no other specific candidates which had a high
similarity to the given entity or relation. For example, the commentedOn relation has
only one specific candidate relaxation above the variable relaxation – hasFriend. Note
that the two relations are relations between people - a person X could comment on
something a person Y wrote, or a person X could have a friend Y - whereas the remain-
ing relations are not relations between people. When generating relaxed queries using
these individual relaxations, we ignore all candidates which occur after the variable.
The process of generating relaxed queries will be explained in Sect. 3.
Query Relaxation for Entity-Relationship Search 69
LibraryThing IMDB
Egypt Non-fiction Academy Award for Best Actor Thriller
Ancient Egypt Politics BAFTA Award for Best Actor Crime
Mummies American History Golden Globe Award Horror
for Best Actor Drama
Egyptian Sociology var Action
Cairo Essays Golden Globe Award Mystery
for Best Actor Musical or Comedy
Egyptology History New York Film Critics Circle var
Award for Best Actor
wrote commentedOn directed bornIn
hasBook hasFriend actedIn livesIn
hasTagged var created originatesFrom
var hasBook produced var
hasTag hasTag var diedIn
hasLibraryThingPage hasTagged type isCitizenOf
Table 3. Top-3 relaxation lists for the triple patterns for an example query
Table 4. Top-5 relaxed queries for an example query. The relaxed entities/relations are underlined.
Now, we describe how results of the original and relaxed queries can be merged and
ranked before representing them to the user.
To be able to rank the results of the original and relaxed queries, we assume that a
result T matching query Qj is scored using some score function f . The scoring function
can be any ranking function for structured triple-pattern queries over RDF-data. In this
paper, we make use of the language-model based ranking function described in [8]. Let
the score of each result T with respect to query Qj be f (T, Qj ) and let the score of each
relaxed query Qj be sj where the score of a relaxed query is computed as described in
the previous subsection. In order to merge the results of the original and relaxed queires
into a unified result set, we utilize the following scoring function for computing the
score of a result T :
r
S(T ) = Σj=0 λj f (T, Qj )
We next describe two techniques to set the values of the different λ’s.
Adaptive Weighting. In this weighting scheme, we assume that the user is interested
in seeing the results in a holistic manner. That is, a match to a lower ranked (relaxed)
query can appear before a match to a higher ranked query. For example, consider the
query Q0 = ?m hasGenre Action and a relaxation Qj = ?m hasGenre Thriller. The
assumption now is that the user would rather see a “famous” movie of genre thriller,
rather than an obscure movie of genre action. And so, a “mixing” of results is allowed.
To this end, we set the λj ’s as a function of the scores of the relaxed queries sj ’s as
follows:
1 − sj
λj = r (1 − s )
Σi=0 i
Recall that the smaller the sj is, the closer Qj is to the original query Q0 . Also
recall that s0 is equal to 0. This weighting scheme basically gives higher weights to the
matches to relaxed queries which are closer to the original query. However, matches for
a lower ranked query with sufficiently high scores can be ranked above matches for a
higher ranked query.
Incremental Weighting. In this weighting scheme, we assume that the user is interested
in seeing results in order. That is, all ranked matches of the original query first, followed
by all ranked matches of the first relaxed query, then those of the second relaxed query,
etc. That is, the results are presented “block-wise”.
In order to do this, we need to set the λj ’s by examining the scores of the highest
scoring and lowest scoring result to a given query. For example, consider our example
query : ?m hasGenre Action. Suppose a relaxation to this query is ?x hasGenre Thriller.
If we want to ensure that all matches of the original query are displayed before the first
match of the relaxed query, we first examine the result with the lowest score for the
original query and the highest score for the relaxed query. Let these results be T0low and
T1high , respectively. We now need to ensure that λ0 ∗ f (T0low , Q0 ) > λ1 ∗ f (T1high , Q1 ).
Note that both incremental as well as adaptive weighting are only two ways in which
we can present results to the user. Additional schemes can include a mixture of both
schemes for instance, or any other variations. Our ranking model is general enough and
can support any number of such fine-grained result presentation schemes.
72 S. Elbassuoni, M. Ramanath, and G. Weikum
4 Experimental Evaluation
We evaluated the effectiveness of our relaxation techniques in 2 experiments. The first
one evaluated the quality of individual entity and relation relaxations. It also evaluated
the quality of the relaxed queries overall. The second experiment evaluated the quality
of the final query results obtained from both original and relaxed queries. The complete
set of evaluation queries used, relevance assessments collected and an online demo can
be found at http://www.mpii.de/˜elbass/demo/demo.html.
4.1 Setup
All experiments were conducted over two datasets using the Amazon Mechanical Turk
service2 . The first dataset was derived from the LibaryThing community, which is an
online catalog and forum about books. The second dataset was derived from a subset
of the Internet Movie Database (IMDB). The data from both sources was automatically
parsed and converted into RDF triples. Overall, the number of unique entities was over
48,000 for LibraryThing and 59,000 for IMDB. The number of triples was over 700,000
and 600,000 for LibraryThing and IMDB, respectively.
Due to the lack of an RDF query benchmark, we constructed 40 evaluation queries
for each dataset and converted them into structured triple-pattern queries. The number
of triple patterns in the constructed queries ranged from 1 to 4. Some example queries
include: “People who both acted as well as directed an Academy Award winning movie”
(3 triple patterns), “Children’s book writers who have won the Booker prize” (3 triple
patterns), etc.
correlation between the average rating and the JS-divergence. We achieved a strong
negative correlation for all relaxations which shows that the smaller the score of the
relaxation (closer the relaxation is to the original), the higher the rating assigned by the
evaluators. The fourth row shows the average rating for the top relaxation.
The fifth and sixth rows in Table 5 show the average rating for relaxations that ranked
above and below the variable relaxation respectively. Recall that, for each entity or
relation, a possible entry in the relaxation candidate-list is a variable as described in
Section 2. For those relaxations that ranked above a variable (i.e., whose JS-divergence
is less than that of a variable), the average rating was more than 1.29 for both entities
and relations, indicating how close these relaxations are to the original entity or relation.
For those relaxations that ranked below a variable, the average rating was less than 1.1
for entities and 0.8 for relations. This shows that the evaluators, in effect, agreed with
the ranking of the variable relaxation.
Table 5. Results for entity, relation and Table 6. Average NDCG and rating for all
query relaxations evaluation queries for both datasets
Table 7. Top-ranked results for the example query ”A science-fiction book that has tag Film”
Result Rating
Q: ?b type Science Fiction; ?b hasTag Film
Adaptive
Star Trek Insurrection type Science Fiction; Star Trek Insurrection hasTag Film 2.50
Blade type Science Fiction; Blade hasTag Movies 2.83
Star Wars type Science Fiction; Star Wars hasTag Made Into Movie 2.00
Incremental
Star Trek Insurrection type Science Fiction; Star Trek Insurrection hasTag Film 2.50
The Last Unicorn type Science Fiction; The Last Unicorn hasTag Movie/tv 2.50
The Mists of Avalon type Science Fiction; The Mists of Avalon hasTag Movie/tv 2.17
Baseline
Star Trek Insurrection type Science Fiction; Star Trek Insurrection hasTag Film 2.50
Helter Skelter type History; Helter Skelter hasTag Film 0.83
Fear & Loathing in Vegas type History; Fear & Loathing in Vegas hasTag Film 1.83
5 Related Work
One of the problems addressed in this paper is that of relaxing entities and relations with
similar ones. This is somewhat related to both record linkage [14], and ontology match-
ing [19]. But a key difference is that we are merely trying to find candidates which are
Query Relaxation for Entity-Relationship Search 75
close in spirit to an entity or relation, and not trying to solve the entity disambiguation
problem. Other kinds of reformulations such as spelling correction, etc. directly benefit
from techniques for record linkage, but are beyond the scope of our work.
Query reformulation in general has been studied in other contexts such as keyword
queries [5] (more generally called query expansion), XML [1, 13], SQL [4, 22] as well
as RDF [11, 7, 10]. Our setting of RDF and triple patterns is different in being schema-
less (as opposed to relational data) and graph-structured (as opposed to XML which is
mainly tree-structured and supports navigational predicates).
For RDF triple-pattern queries, relaxation has been addressed to some extent in [22,
11, 7, 10, 8]. This prior work can be classified based on several criteria as described
below. Note that except for [22] and our previous work in [8], none of the other papers
report on experimental studies.
Scope of Relaxations. With the exception of [7, 11], the types of relaxations considered
in previous papers are limited. For example, [22] considers relaxations of relations only,
while [10, 8] consider both entity and relation relaxations. The work in [8], in particular,
considers a very limited form of relaxation – replacing entities or relations specified in
the triple patterns with variables. Our approach, on the other hand, considers a compre-
hensive set of relaxations and in contrast to most other previous approaches, weights
the relaxed query in terms of the quality of the relaxation, rather than the number of
relaxations that the query contains.
Relaxation Framework. While each of the proposals mentioned generates multiple
relaxed query candidates, the method in which they do so differ. While [7, 11, 10]
make use of rule-based rewriting, the work in [22] and our own work make use of the
data itself to determine appropriate relaxation candidates. Note that rule-based rewriting
requires human input, while our approach is completely automatic.
Result Ranking. Our approach towards result ranking is the only one that takes a holis-
tic view of both the original and relaxed query results. This allows us to rank results
based on both the relevance of the result itself, as well as the closeness of the relaxed
query to the original query. The “block-wise” ranking adopted by previous work – that
is, results for the original query are listed first, followed by results of the first relaxation
and so on – is only one strategy for ranking, among others, that can be supported by our
ranking model.
6 Conclusion
We proposed a comprehensive and extensible framework for query relaxation for entity-
relationship search. Our framework makes use of language models as its foundation and
can incorporate a variety of information sources on entities and relations. We showed
how to use an RDF knowledge base to generate high quality relaxations. Furthermore,
we showed how different weighting schemes can be used to rank results. Finally, we
showed the effectiveness of our techniques through a comprehensive user evaluation.
We believe that our contributions are of great importance for an extended-SPARQL API
that could underlie the emerging “Web-of-Data” applications such as Linking-Open-
Data across heterogeneous RDF sources.
76 S. Elbassuoni, M. Ramanath, and G. Weikum
References
[1] Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: Flexible structure and full-text
querying for xml. In: SIGMOD (2004)
[2] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nu-
cleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I.,
Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-
Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer,
Heidelberg (2007)
[3] Billerbeck, B., Zobel, J.: When query expansion fails. In: SIGIR (2003)
[4] Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic information retrieval ap-
proach for ranking of database query results. ACM Trans. on Database Syst. 31(3) (2006)
[5] Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice.
Pearson Education, London (2009)
[6] Doan, A., Gravano, L., Ramakrishnan, R., Vaithyanathan, S. (eds.): Special issue on man-
aging information extraction. ACM SIGMOD Record 37(4) (2008)
[7] Dolog, P., Stuckenschmidt, H., Wache, H., Diederich, J.: Relaxing rdf queries based on user
and domain preferences. Journal of Intell. Inf. Sys. (2008)
[8] Elbassuoni, S., Ramanath, M., Schenkel, R., Sydow, M., Weikum, G.: Language-model-
based ranking for queries on RDF-graphs. In: CIKM (2009)
[9] Fang, H., Zhai, C.: Probabilistic models for expert finding. In: Amati, G., Carpineto, C.,
Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 418–430. Springer, Heidelberg (2007)
[10] Huang, H., Liu, C., Zhou, X.: Computing relaxed answers on RDF databases. In: Bailey, J.,
Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175,
pp. 163–175. Springer, Heidelberg (2008)
[11] Hurtado, C., Poulovassilis, A., Wood, P.: Query relaxation in rdf. Journal on Data Semantics
(2008)
[12] Järvelin, K., Kekäläinen, J.: Ir evaluation methods for retrieving highly relevant documents.
In: SIGIR (2000)
[13] Lee, D.: Query Relaxation for XML Model. Ph.D. thesis, UCLA (2002)
[14] Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan & Claypool,
San Francisco (2010)
[15] Nie, Z., Ma, Y., Shi, S., Wen, J.-R., Ma, W.Y.: Web object retrieval. In: WWW (2007)
[16] Petkova, D., Croft, W.: Hierarchical language models for expert finding in enterprise cor-
pora. Int. J. on AI Tools 17(1) (2008)
[17] Sarawagi, S.: Information extraction. Foundations and Trends in Databases 2(1) (2008)
[18] Serdyukov, P., Hiemstra, D.: Modeling documents as mixtures of persons for expert finding.
In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008.
LNCS, vol. 4956, pp. 309–320. Springer, Heidelberg (2008)
[19] Staab, S., Studer, R.: Handbook on Ontologies (International Handbooks on Information
Systems). Springer, Heidelberg (2004)
[20] Suchanek, F., Sozio, M., Weikum, G.: SOFIE: A self-organizing framework for information
extraction. In: WWW (2009)
[21] Vallet, D., Zaragoza, H.: Inferring the most important types of a query: a semantic approach.
In: SIGIR (2008)
[22] Zhou, X., Gaugaz, J., Balke, W.T., Nejdl, W.: Query relaxation using malleable schemas.
In: SIGMOD (2007)
[23] Zhu, J., Nie, Z., Liu, X., Zhang, B., Wen, J.R.: Statsnowball: a statistical approach to ex-
tracting entity relationships. In: WWW (2009)
Optimizing Query Shortcuts in RDF Databases
Abstract. The emergence of the Semantic Web has led to the creation of large
semantic knowledge bases, often in the form of RDF databases. Improving the
performance of RDF databases necessitates the development of specialized data
management techniques, such as the use of shortcuts in the place of path queries.
In this paper we deal with the problem of selecting the most beneficial shortcuts
that reduce the execution cost of path queries in RDF databases given a space
constraint. We first demonstrate that this problem is an instance of the quadratic
knapsack problem. Given the computational complexity of solving such prob-
lems, we then develop an alternative formulation based on a bi-criterion linear
relaxation, which essentially seeks to minimize a weighted sum of the query cost
and of the required space consumption. As we demonstrate in this paper, this
relaxation leads to very efficient classes of linear programming solutions. We
utilize this bi-criterion linear relaxation in an algorithm that selects a subset of
shortcuts to materialize. This shortcut selection algorithm is extensively evalu-
ated and compared with a greedy algorithm that we developed in prior work. The
reported experiments show that the linear relaxation algorithm manages to sig-
nificantly reduce the query execution times, while also outperforming the greedy
solution.
1 Introduction
The Semantic Web involves, among other things, the development of semantic reposito-
ries in which structured data is expressed in RDF(S) or OWL. The structure of this data -
and of its underlying ontologies - is commonly seen as a directed graph, with nodes
representing concepts and edges representing relationships between concepts. A basic
issue that semantic repositories need to address is the formulation of ontology-based
queries, often by repeatedly traversing particular paths [11] of large data graphs. Re-
gardless of the specific model used to store these data graphs (RDF/OWL files, relational
databases, etc.), path expressions require substantial processing; for instance, when us-
ing relational databases, multiple join expressions are often involved [3,20,22,28].
By analogy to materialized views in relational databases, a shortcut construct can be
used in RDF repositories to achieve better performance in formulating and executing
frequent path queries. In this paper we elaborate on the creation of shortcuts that corre-
spond to frequently accessed paths by augmenting the schema and the data graph of an
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 77–92, 2011.
c Springer-Verlag Berlin Heidelberg 2011
78 V. Dritsou et al.
hasTitle Title
Institute funds Author writes Paper acceptedBy Conference
takesPlace
City
RDF repository with additional triples, effectively substituting the execution of the cor-
responding path queries. Considering all possible shortcuts in a large RDF repository
gives rise to an optimization problem in which we seek to select shortcuts that maximize
the reduction of query processing cost subject to a given space allocation for storing the
shortcuts. For this problem, which turns out to be a knapsack problem, known to be
NP-hard, we have previously developed a greedy algorithm [13] which, however, leans
on the side of wasting space whenever it comes to dealing with particular types of query
workloads, especially correlated ones. We thus consider an alternative formulation in
this paper, which leads to the development of a more efficient linear algorithm.
The contributions of this work are: (i) we elaborate on the notion of shortcuts and
the decomposition of a set of user queries that describe popular user requests into a
set of simple path expressions in the RDF schema graph of an RDF repository, which
are then combined in order to form a set of candidate shortcuts that can help during
the evaluation of the original queries; (ii) we formally define the shortcut selection
problem as one of maximizing the expected benefit of reducing query processing cost
by materializing shortcuts in the knowledge base, under a given space constraint; (iii)
we provide an alternative formulation of the shortcut selection problem that trades off
the benefit of a shortcut with the space required for storing its instances in the RDF
database: the constraint matrix of the resulting bi-criterion optimization problem enjoys
the property of total unimodularity, by virtue of which we obtain very efficient classes
of linear programming solutions; and (iv) through extensive experimental evaluation we
demonstrate that the linear algorithm outperforms our greedy solution in most cases.
2 Problem Formulation
In this section we provide the underlying concepts that are required in order to formulate
our problem. Sections 2.1 and 2.2 describe these preliminary concepts, which have been
introduced in [13].
sh1 qf1
sh4 sh3
hasTitle Title
Institute funds Author writes Paper acceptedBy Conference qf3 qf5
sh5 takesPlace
City GQ1
sh6
sh2 qf4
query q ∈ Q. For example, consider again the sample graph of Figure 1 and the queries
of Figure 2 constituting the workload Q. Given Q, the candidate shortcut nodes are: (i)
Institute, as starting node of q1 and q2 ; (ii) Paper, as starting node of q3 ; (iii) Conference,
since two edges belonging to two different queries originate from it; (iv) Title, as ending
node of q1 and q3 ; and (v) City, as ending node of q2 . Starting from these, we generate
the set of candidate shortcuts presented in Figure 31 .
Each candidate shortcut shi maps to exactly one query fragment qfi , which may be
contained in more than one queries of Q. The set of queries in which qfi is contained
is called related queries of shi and is denoted as RQi . For instance, the set of related
queries of sh4 in Figure 3 is RQ4 = {q1 , q2 }. Regarding the relationships between
shortcuts, we distinguish the following three cases: (i) a shortcut shi is disjoint to shj
if the query fragments they map to do not share any common edge; (ii) a shortcut shi
is fully contained in shj iff the query fragment that shi maps to is a subpath of the
corresponding query fragment of shj , denoted hereafter as shi ≺ shj ; and (iii) a short-
cut shi overlaps shj if they are not disjoint, and none of them fully contains the other.
Finally, for each shortcut shi we denote the set SFi = {qfj | qfj ≺ qfi }.
We now turn to defining the benefit of introducing a shortcut and formulating shortcut
selection as a benefit maximization problem. Assume a candidate shortcut shi with un-
derlying query fragment qfi and its set of related queries RQi . An estimate of the cost
of retrieving the answer to qfi may be given by (i) the number of edges that need to be
traversed in the data graph GD ; (ii) the actual query time of qfi ; (iii) an estimate by a
query optimizer. For the formulation of our problem we assume that this cost is given
by the number of edges, tri , that need to be traversed in the data graph GD (however
in Section 4 we present a comparison of the results obtained when considering as cost
the number of traversals on one hand and the actual query times on the other). Note
that not all edges traversed necessarily lead to an answer, however they do contribute
to the counting of tri . Now suppose that we augment the database by shortcut shi .
This involves inserting one edge in the schema graph GS , while in the data graph GD
one edge is inserted for each result of qfi , a total of ri edges between the correspond-
ing nodes. Hence, by inserting one instance of shi for each result of qfi using RDF
Bags2 , thus allowing duplicate instances of shortcuts (triples) to be inserted, we retain
the result cardinalities and obtain exactly the same query results. Then the new cost of
answering qfi is equal to the new number of traversals required, i.e. the number ri of
results of qfi . Therefore, the benefit obtained by introducing a shortcut shi in order
to answer qfi is equal to the difference between the initial cost minus the new cost,
tri − ri . Since qfi is used in answering each of its related queries, the benefit for query
qk ∈ RQi can be estimated by multiplying the fragment benefit by the frequency of the
query, i.e. fk (tri − ri ). The total benefit obtained by introducing shortcut shi is then
equal to the sum of the benefits obtained for each related query. If the query fragments
1
Due to readability issues, the number contained in the label of shortcuts is not presented in the
figure as subscript.
2
http://www.w3.org/TR/rdf-schema/
Optimizing Query Shortcuts in RDF Databases 81
underlying the candidate shortcuts are disjoint then the aggregate benefit is the sum of
the benefits of all candidate shortcuts. If, however, there are containment relationships
between fragments, things get more complicated. For example, take the above men-
tioned queries q1 and q3 described in Figure 2. According to the definition of candidate
shortcuts, there are four candidates beneficial for these queries, namely sh1 , sh3 , sh4
and sh5 . Assume now that shortcut sh5 has been implemented first and that adding sh1
is being considered. The benefit of adding sh1 with regard to query q1 will be smaller
than what it would have been without sh5 in place. Indeed, sh1 is not going to elim-
f unds
inate the traversals initially required for the fragment qf1 = Institute → Author
writes acceptedBy hasT itle
→ P aper → Conf erence → T itle but, rather, the traversals of edges
of type sh5 . We denote as dij the difference in the number of traversals required to
answer qfi due to the existence of a shortcut induced by qfj . This difference is positive
for all qfj ≺ qfi , i.e. for each qfj ∈ SFi . Finally, shortcuts induced by overlapping
query fragments do not interfere in the above sense: since the starting node of one of
them is internal to the other, they cannot have common related queries.
Let us now turn to the space consumption of shortcuts. Regardless of how many
queries are related to a query fragment qfi , the space consumption that results from
introducing the corresponding shortcut shi is ri as above. The total space consumption
of all shortcuts actually introduced should not exceed some given space budget b.
subject to
n
ri xi b
i=1
1 if shortcut shi is established
xi =
0 otherwise
SFi = {qfj | qfj ≺ qfi , qfj ∈ QF }
RQi = {qj | qfi ≺ qj , qj ∈ Q}.
This is a 0-1 quadratic knapsack problem, known to be NP-hard [15]. Several sys-
tematic and heuristic methods have been proposed in the literature [27] to obtain ap-
proximate solutions to this problem. All computationally reasonable methods assume
non-negative terms in the objective function. This is obviously not the case of our prob-
lem, since the reductions dij of the traversals are non-negative integers thus resulting in
non-positive quadratic terms.
82 V. Dritsou et al.
To address this problem, we have previously developed a greedy algorithm [13] that
seeks to maximize the overall benefit by selecting a subset of candidate shortcuts to
materialize given a space budget. The algorithm takes as input the schema and data
graph, together with the query workload, finds the candidate shortcuts and then com-
putes the benefit of each one, considering as cost the number of edge traversals. It then
incrementally selects the candidate with the maximum per-unit of space benefit that fits
into the remaining budget, until it reaches the defined space consumption. This algo-
rithm succeeds in identifying beneficial shortcuts, yet it wastes space, especially when
the queries of the workload are correlated. This led us to consider an alternative formu-
lation and develop a new algorithm, as described in the following Section. A detailed
comparison of the two approaches is presented in Section 4.
3 Bi-criterion Optimization
One established class of solution methods for the 0-1 quadratic knapsack problem are
Lagrangean methods, including the traditional Lagrangean relaxation of the capacity
constraint and a more sophisticated variant, Lagrangean decomposition [8,10,25,27].
As explained in Section 2, such techniques are inapplicable to our problem, due to
the negative quadratic terms in the objective function. However, the techniques cited
above, provided the inspiration for an alternative modelling approach that would avoid
the combined presence of binary decision variables and a capacity constraint.
Instead of stating a space constraint, we employ a cost term for space consumption
in the objective function. This means that the utility of each calculated solution will, on
one hand, increase based on the benefit of the selected shortcuts but, on the other hand,
it will decrease proportionally to the solution’s space consumption. The factor that de-
termines the exact penalty incurred in our objective function per space unit significantly
influences the final space required by the solution of our algorithm. A small penalty per
space unit used typically leads to solutions that select many shortcuts, thus consuming a
lot of space. On the contrary, a large penalty per space unit used typically leads to select-
ing fewer shortcuts of high benefit for materialization. As we describe at the end of this
section, a binary search process will help determine the right value of this parameter,
so as to select a solution that respects a space budget. Unlike Lagrangean methods that
penalize consumption in excess of the budget, in our formulation space consumption is
penalized from the first byte. As explained below, this formulation yields an efficiently
solvable linear program.
We will now describe how the constraints are derived for three important cases:
– Specifying that a candidate shortcut may be useful (or not) for evaluating specific
queries that contain its corresponding query fragment.
– Specifying how to prevent containment relations for the same query. For the same
query, whenever we consider two materialized shortcuts such that one fully con-
tains the other, then only one of them can be used for the evaluation of the query.3
Expressing such constraints is crucial for the quality of the obtained solution. The
3
In this case, it is not true that the larger of the two shortcuts will always be used, as the presence
of additional materialized shortcuts for the same query may result in utilizing the smaller one.
Optimizing Query Shortcuts in RDF Databases 83
greedy algorithm may select a smaller shortcut for materialization, and then select
a shortcut that fully contains the first one.
– Specifying that two overlapping shortcuts (but none of them is fully contained in
the other) cannot be useful at the same time for the same query.
In the model of Section 2, the resource-oriented treatment of space employs the space
budget as a parameter for controlling system behavior. In the alternative, price-oriented
model presented in this section, a price coefficient on space consumption is the control
parameter. A small price on space consumption has an analogous effect to setting a high
space budget. By iteratively solving the price-oriented model over a range of prices we
effectively determine pairs of space consumption and corresponding solutions (sets of
shortcuts). So, both the resource-oriented and the price-oriented approach yield shortcut
selection policies the outcomes of which will be compared in terms of query evaluation
cost and space consumption in Section 4.
Another important feature of the alternative model is the explicit representation of the
usage of shortcuts in evaluating queries. In the model of Section 2, decision variables xi
express whether shortcut shi is inserted or not. The actual usage of selected shortcuts
regarding each query is not captured by these variables. In fact, the set of shortcuts that
serve a given query in an optimal solution as determined by the greedy algorithm has
no containment relationships among shortcuts and no overlapping shortcuts. Of course,
containment and overlapping can occur between shortcuts serving different queries.
For example, consider again the queries q1 , q2 and q3 mentioned above. The greedy
algorithm will not propose to use both shortcuts sh1 and sh3 , which have a containment
relationship, to serve q1 . Even if sh3 is selected in one step and sh1 in a subsequent step,
only sh1 will finally be used for the execution of q1 . Similarly, sh5 and sh3 will not
be proposed by greedy for q1 , since their query fragments overlap. If we use sh5 then
acceptedBy
the part of q1 that corresponds to the subgraph P aper → Conf erence will be
“hidden” under the new edge sh5 and therefore shortcut sh3 cannot be applied.
Specifying that a shortcut is useful for a specific query. We introduce additional
decision variables that express the usage of a given shortcut to serve a given query. In
particular, xik denotes whether shortcut shi is useful for the query qk , where xik =
{0, 1} and qk ∈ RQi .
The xik variables depend on each other and on xi in various ways. Note that if a
shortcut shi is selected to serve any query qk (specified by xik = 1), then this shortcut
is also selected by our algorithm for materialization (specified by xi = 1). Thus:
Moreover, if a shortcut has been selected for materialization, then this shortcut is useful
for at least one query. Thus:
xi xik , i = 1, ..., n (2)
k∈RQi
exist any qfk (k = i, j) such that qfi ≺ qfk and qfk ≺ qfj , then an edge is inserted in
GQk starting from the bigger fragment qfj and ending at the smaller fragment qfi . Note
that we connect each such node with its children only and not with all of its descendants.
An example illustrating such a graph is shown in Figure 4, where the graph of query q1
is presented. This graph is constructed with the help of Figure 3, and by recalling that
each candidate shortcut shi matches the query fragment qfi .
Using GQk we generate the containment constraints by the following procedure:
1. For each root node of the graph (i.e., each node with no incoming edges) of query
qk , find all possible paths pkt leading to leaf nodes of the graph (i.e., to nodes with
no outgoing edges) and store them in a set of paths P k .
2. For each path pkt ∈ P k , create a containment constraint by restricting the sum of
all variables xik whose fragment is present in the path to be less than or equal to 1.
The second step enforces that among all the candidate shortcuts participating in each
path pkt , at most one can be selected. Continuing our previous example based on Fig-
ure 4, there exist two paths in P 1 for query q1 , namely p11 = {qf1 , qf3 } and p12 =
{qf1 , qf5 , qf4 }. These paths generate two containment constraints, namely x11 +x31
1 and x11 + x51 + x41 1.
Avoiding overlaps. Overlapping candidates are treated in a similar way. In graph GQk
we compare each node with all other nodes on different paths of the graph. For each
pair of compared nodes, if the corresponding query fragments of the nodes under com-
parison have edges in common, then we insert this pair of nodes in a set OFk which
stores the pairs of overlapping fragments with respect to query qk . For each pair in
OFk we create a constraint specifying that the sum of its decision variables be less
than or equal to 1, indicating that only one of the candidate shortcuts in the pair can be
selected to serve the query qk . In the previous example of query q1 , we would check
the two pairs (qf5 , qf3 ) and (qf4 , qf3 ). Only the first of these pairs contains fragments
acceptedBy
that have edges in common, as qf5 and qf3 both contain the edge P aper →
Conf erence. In order to specify that the shortcuts of these two query fragments can-
not be useful for the same query q1 , we generate the constraint: x51 + x31 1.
In conclusion, given a parameter c (called price coefficient) that specifies the per
space penalty of the chosen solution, we obtain the following bi-criterion 0-1 integer
linear programming formulation for the shortcut selection problem:
n
n
max fk (tri − ri )xik − c ri xi (3)
i=1 qk ∈RQi i=1
subject to
xik − xi 0, i = 1, ..., n; k ∈ RQi (4)
− xik + xi 0, i = 1, ..., n (5)
k∈RQi
xik 1, k = 1, ..., m; pku ∈ P k (6)
i∈pk
u
Optimizing Query Shortcuts in RDF Databases 85
n
n
max fk (tri − ri )xik − c ri xi (8)
i=1 qk ∈RQi i=1
subject to
xik − xi 0, i = 1, ..., n; k ∈ RQi (9)
xik 1, k = 1, ..., m; pku ∈ P k (10)
i∈pk
u
0 xik 1, 0 xi 1
The constraint matrix of the above linear program can be proven to fulfill the suffi-
cient conditions of total unimodularity given in [31]. The proof is omitted due to space
limitations. By virtue of this property, the linear program is guaranteed to have integer
solutions (thus each xik and xi will either be equal 0 or 1). Furthermore, the relax-
ation enables much more efficient solution since the empirical average performance of
simplex-based linear program solvers is polynomial, in fact almost linear [4,7].
Satisfying the overlap constraints. The solution of the relaxed problem may violate
constraint (9) that we have ignored in the relaxation. Here we propose a cutting plane
method with two alternative heuristics for restoring constraint satisfaction. After solv-
ing the relaxed problem, we check whether the solution violates any of the omitted
overlap constraints. Recall that these constraints contain two variables (representing
86 V. Dritsou et al.
two shortcuts used for the same query) that are not both allowed to be equal to 1. For
each such violation we set the value of one of the variables equal to 0 (i.e., we exclude
the respective shortcut) and insert this as an additional constraint into the initial prob-
lem. It can be proven that adding such constraints in the problem does not affect total
unimodularity. The problem is solved again and new cuts are introduced as needed until
we reach a solution that does not violate any overlap constraint.
But how do we choose which of the two variables to set equal to zero? We explore
two alternative heuristics. The first heuristic (LRb) selects to prohibit the shortcut with
the smallest benefit for the given query qk . The second heuristic (LRl) chooses to pro-
hibit the shortcut with the shortest corresponding query fragment. Both heuristics have
been developed in evaluation tests and are discussed in Section 4.
Selecting the Value of the Price Coefficient c. In order to find a solution that is ap-
propriate for a given space budget, we perform a binary search on the value of c. Both
linear algorithms (i.e., heuristics) developed here take as input the space budget and
then create linear models based on different values of c until they reach a solution the
space consumption of which is close to (or, even better, equal to) the given budget.
4 Evaluation
In this section we present an experimental study of the two variations of the linear
algorithm. We implemented LRb and LRl in C++ and compared their results with our
previously developed greedy algorithm (GR) described in [13]. Our goal is to evaluate
the reduction in query costs w.r.t. different RDF stores and data sets, to compare the
performance of LRb/LRl with GR in general and more specifically to check whether
LRb/LRl can capture the dependencies produced by strongly correlated workloads in
a more effective way than GR. By strongly correlated we mean workloads containing
path queries with containment relationships among the majority of them. In this spirit,
we report on the reduction of the total execution time of the given query workloads
after augmenting the database by the proposed shortcuts, vis-a-vis the allocated space
budget. We used four different systems for our experiments, each of them following
a different type of RDF store approach: (i) the SWKM4 management system, which
relies on a relational DBMS, (ii) the native Sesame5 repository, (iii) the in-memory
Sesame repository and (iv) the native RDF3X6 system. In the following experiments,
unless otherwise stated, we present in the graphs the percentage of reduction achieved
in the query time of the entire workload after augmenting the system by the proposed
shortcuts (y-axis) with regard to the space consumed for this augmentation, expressed
as a fraction of the database size (x-axis). All reported experiments were executed on
a Intel Core 2 Duo 2.33GHz PC with 4GB RAM running 32bit Windows Vista. In the
reported experiments we used Lingo 9.07 as linear solver for LRb and LRl.
Before discussing the results obtained with strongly correlated queries, we examine
the dependence of performance on the query cost metrics used. Recall from Section 2.2
4
Available at http://139.91.183.30:9090/SWKM/
5
Available at http://www.openrdf.org/
6
Available at http://www.mpi-inf.mpg.de/ neumann/rdf3x/
7
LINDO Systems, http://www.lindo.com
Optimizing Query Shortcuts in RDF Databases 87
100 470
LR GR
eduction in query time
460
me (sec)
80
450
60
nning tim
440
40 430
Input cost: actual times [LR]
run
I
Input
t cost:
t actual
t l times
ti [GR] 420
% of re
20
Input cost: traversals [LR]
Input cost: traversals [GR] 410
0
0 5 10 15 20
% of data space consumed % of data space consumed
80
60
%reducctioninqu
40
20
0
0 5 10 15
%ofdataspaceconsumed
that computing the benefit of candidate shortcuts requires a cost metric of the query
fragments. Our theoretical formulation employs the number of traversals required to
answer the query, while the actual query times must also be considered. In [13] we have
shown that both cost metrics give qualitatively similar results for the greedy algorithm.
To confirm that this stands true for LRb/LRl too, we used the Yago data set [34] that
contains 1.62 million real data triples. We considered as query workload the 233 distinct
query paths of length 3 contained in the schema graph of Yago and used as RDF store
the Sesame native repository. In Figure 5 we present the percentage of reduction in
query time achieved for the entire workload (y-axis) after augmenting the system by
the selected shortcuts w.r.t. the fraction of data space required to be consumed by the
shortcut edges. In this experiment LRb and LRl give the same solutions, due to the
small length of queries (presented as LR in Figure 5).
It is obvious that our approach significantly reduces the overall query execution time
without consuming large amounts of space. Moreover, LR achieves a bigger reduction
compared to greedy (GR): when using traversals as cost metric, LR reduces the query
time by 55% by consuming only 6% of the data space, while GR reaches a reduction
of 33% for the same consumption. LR reaches the maximum reduction of 86% of the
initial query time with GR stopping at a maximum reduction of 66%. The reported
results confirm that the two cost metrics used give qualitatively similar results. We
88 V. Dritsou et al.
2.5% −2% 5%
5.0% 32% 34%
7.5% 32% 34%
10.0% 33% 40%
12.5% 36% 42%
15.0% 49% 52%
17.5% 61% 83%
uerytime
80 75
60
uctioninq
60
%ofreductioninqu
45
40
30
%ofredu
20 15
0 0
0 10 20 30 0 2 4 6 8
%ofdataspaceconsumption %ofdataspaceconsumption
Fig. 8. Experiment with CIDOC Ontology Fig. 9. Using Long Path Queries
5 Related Work
languages have been proposed, with SPARQL [2], RQL [19] and RDQL [1] standing
as main representatives.
In the relational world, views have long been explored in database management sys-
tems [30], either as pure programs [32,33], derived data that can be further queried [23],
pure data (detached from the view query) or pure index [29]. The selection of shortcuts
in RDF databases is similar to materialized views in relational systems, which has long
been studied: selecting and materializing a proper set of views with respect to the query
workload and the available system resources optimizes frequent computations [18,21].
Moreover, restricting the solution space into a set of views with containment properties
has been proposed in the Data Cube [16] framework.
Indexing RDF data is similar in concept (due to the hierarchical nature of path
queries) to indexing data in object oriented (OO) and XML Databases. An overview
of indexing techniques for OO databases is presented in [5]. In [17] a uniform index-
ing scheme, termed U-index, for OO databases is presented, while in [6] the authors
present techniques for selecting the optimal index configuration in OO databases when
only a single path is considered and without taking into account overlaps of subpaths.
The work of [12] targets building indices over XML data. Unlike XML data, RDF data
is not rooted at a single node. Moreover, the above techniques target the creation of
indices over paths, thus not tackling the problem of this paper, which is the selection of
shortcuts to materialize given a workload of overlapping queries.
Following the experience from relational databases, indexing techniques for RDF
databases have already been explored [14,24,35]. A technique for selecting which (out
of all available) indices can help accelerate a given SPARQL query is proposed in [9].
In [3] the authors propose a vertically partitioned approach to storing RDF triples and
consider materialized path expression joins for improving performance. While their ap-
proach is similar in spirit to the notion of shortcuts we introduce, the path expression
joins they consider are hand-crafted (thus, there is no automated way to generate them in
an arbitrary graph). Moreover the proposed solutions are tailored to relational backends
(and more precisely column stores), while in our work we do not assume a particular
storage model for the RDF database. Our technique can be applied in conjunction with
these approaches to accelerate query processing, since it targets the problem of deter-
mining the proper set of shortcuts to materialize without assuming a particular RDF
storage model. In [26] the authors propose a novel architecture for indexing and query-
ing RDF data. While their system (RDF3X) efficiently handles large data sets using
indices and optimizes the execution cost of small path queries, our evaluation shows
that our technique further reduces the execution cost of long path queries in RDF3X.
6 Conclusions
Shortcuts are introduced as a device for facilitating the expression and accelerating the
execution of frequent RDF path queries, much like materialized database views, but
with distinctive traits and novel features. Shortcut selection is initially formulated as
a benefit maximization problem under a space budget constraint. In order to overcome
the shortcomings of a previous greedy algorithm solution to this problem, we developed
an alternative bi-criterion optimization model and a linear solution method.
Optimizing Query Shortcuts in RDF Databases 91
References
1. RDQL - A Query language for RDF. W3C Member,
http://www.w3.org/Submission/RDQL/
2. SPARQL Query Language for RDF. W3C Recommendation,
http://www.w3.org/TR/rdf-sparql-query/
3. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable Semantic Web Data Man-
agement Using Vertical Partitioning. In: VLDB (2007)
4. Beasley, J.E.: Advances in Linear and Integer Programming. Oxford Science (1996)
5. Bertino, E.: A Survey of Indexing Techniques for Object-Oriented Database Management
Systems. Query Processing for Advanced Database Systems (1994)
6. Bertino, E.: Index Configuration in Object-Oriented Databases. The VLDB Journal 3(3)
(1994)
7. Borgwardt, K.H.: The average number of pivot steps required by the simplex-method is poly-
nomial. Mathematical Methods of Operations Research 26(1), 157–177 (1982)
8. Caprara, A., Pisinger, D., Toth, P.: Exact Solution of the Quadratic Knapsack Problem. IN-
FORMS J. on Computing 11(2), 125–137 (1999)
9. Castillo, R., Leser, U., Rothe, C.: RDFMatView: Indexing RDF Data for SPARQL Queries.
Tech. rep., Humboldt University (2010)
10. Chaillou, P., Hansen, P., Mahieu, Y.: Best network flow bounds for the quadratic knapsack
problem. Lecture Notes in Mathematics, vol. 1403, pp. 225–235 (2006)
11. Constantopoulos, P., Dritsou, V., Foustoucos, E.: Developing query patterns. In: Agosti, M.,
Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS,
vol. 5714, pp. 119–124. Springer, Heidelberg (2009)
12. Cooper, B.F., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A Fast Index for
Semistructured Data. In: VLDB (2001)
13. Dritsou, V., Constantopoulos, P., Deligiannakis, A., Kotidis, Y.: Shortcut selection in RDF
databases. In: ICDE Workshops. IEEE Computer Society, Los Alamitos (2011)
14. Fletcher, G.H.L., Beck, P.W.: Indexing social semantic data. In: Sheth, A.P., Staab, S.,
Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS,
vol. 5318, Springer, Heidelberg (2008)
15. Gallo, G., Hammer, P., Simeone, B.: Quadratic knapsack problems. Mathematical Program-
ming 12, 132–149 (1980)
16. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Total. In: ICDE (1996)
17. Gudes, E.: A Uniform Indexing Scheme for Object-Oriented Databases. Information Sys-
tems 22(4) (1997)
18. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing Data Cubes Efficiently. In: SIG-
MOD Conference (1996)
92 V. Dritsou et al.
19. Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., Scholl, M.: RQL: A
declarative query language for RDF. In: WWW (2002)
20. Kotidis, Y.: Extending the Data Warehouse for Service Provisioning Data. Data Knowl.
Eng. 59(3) (2006)
21. Kotidis, Y., Roussopoulos, N.: A Case for Dynamic View Management. ACM Trans.
Database Syst. 26(4) (2001)
22. Larson, P., Deshpande, V.: A File Structure Supporting Traversal Recursion. In: SIGMOD
Conference (1989)
23. Larson, P., Yang, H.Z.: Computing Queries from Derived Relations. In: VLDB (1985)
24. Liu, B., Hu, B.: Path Queries Based RDF Index. In: SKG, Washington, DC, USA (2005)
25. Michelon, P., Veilleux, L.: Lagrangean methods for the 0-1 Quadratic Knapsack Problem.
European Journal of Operational Research 92(2), 326–341 (1996)
26. Neumann, T., Weikum, G.: The rdf-3x engine for scalable management of rdf data. VLDB
J. 19(1) (2010)
27. Pisinger, D.: The quadratic knapsack problem - a survey. Discrete Applied Mathemat-
ics 155(5), 623–648 (2007)
28. Rosenthal, A., Heiler, S., Dayal, U., Manola, F.: Traversal Recursion: A Practical Approach
to Supporting Recursive Applications. In: SIGMOD Conference (1986)
29. Roussopoulos, N., Chen, C.M., Kelley, S., Delis, A., Papakonstantinou, Y.: The ADMS
Project: View R Us. IEEE Data Eng. Bull. 18(2) (1995)
30. Roussopoulos, N.: Materialized Views and Data Warehouses. SIGMOD Record 27 (1997)
31. Schrijver, A.: Theory of linear and integer programming. John Wiley, Chichester (1998)
32. Sellis, T.K.: Efficiently Supporting Procedures in Relational Database Systems. In: SIGMOD
Conference (1987)
33. Stonebraker, M.: Implementation of Integrity Constraints and Views by Query Modification.
In: SIGMOD Conference (1975)
34. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge. In: WWW.
ACM Press, New York (2007)
35. Udrea, O., Pugliese, A., Subrahmanian, V.S.: GRIN: A Graph Based RDF Index. In: AAAI
(2007)
RDFS Update: From Theory to Practice
1 Introduction
RDF has become one of the prime languages for publishing data on the Web,
thanks to initiatives like Linked Data, Open Data, Datagovs, etc. The next
step is to work on the evolution of such data, thus, facing the issue of informa-
tion update. If one analyzes the data that is being published, the vocabulary
used includes the core fragment of RDFS plus some OWL features. This poses
strong challenges to the goal of updating such information. It is well-known
that the problem of updating and schema evolution in Knowledge Bases is both,
intractable and non-deterministic in the general case. For example, erasing a
statement ϕ (that is, updating the knowledge base so that the statement ϕ can
not be deduced from it) not only could take exponential time, but, there could be
many different and equally reasonable solutions. Thus, there is no global solution
and the problem has to be attacked by parts.
In this paper we study the problem of updating data under the RDFS vo-
cabulary, considering the rest of the vocabulary as constant. Many proposals
on updates in RDFS and light knowledge bases (e.g. DL-lite ontologies) have
been presented and we discuss them in detail in Section 5. Nevertheless, such
proposals have addressed the problem from a strictly theoretical point of view,
making them –due to the inherent complexity of the general problem– hard or
impossible to be used in practice.
Using the Katsuno-Mendelzon theoretical approach (from now on, K-M ap-
proach) for update and erasure [8], which has been investigated and proved
fruitful for RDFS (see [3,4,6]), we show that updates in RDFS can be made
practical. We are able to get this result by (a) following the approach typical in
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 93–107, 2011.
c Springer-Verlag Berlin Heidelberg 2011
94 C. Gutierrez, C. Hurtado, and A. Vaisman
ontology evolution, where schema and instance updates are treated separately;
(b) focusing on the particular form of the deductive rules of RDFS; and (c) con-
sidering blank nodes as constants (which for current big data sets is a rather safe
assumption) [1]. In this paper we concentrate in the erasure operation (‘deleting’
a statement), because update (adding information) in RDFS, due to the positive
logic nature of it, turns out to be almost trivial [6]. Our two main results are,
a deterministic and efficient algorithm for updating instances, and a reduction
of the update problem for schema to a graph theoretical problem in very small
graphs.
Regarding instance update, we show that due to the particular form of the
rules involved in RDFS [13], and using a case by case analysis, instance erasure
(i.e., erasing data without touching the schema) is a deterministic process for
RDFS, that is, it can be uniquely defined, hence opening the door to automate it.
Then, we show that this process can be done efficiently, and reduces essentially to
compute reachability in small graphs. We present pseudo-code of the algorithms
that implement this procedure.
As for schema erasure, the problem is intrinsically non-deterministic, and worst,
intractable in general. A trivial example is a chain of subclases (ai , sc, ai+1 ) from
where one would like to erase the triple (a1 , sc, an ). The minimal solutions consist
in deleting one of the triples. In fact, we show that in general, each solution corre-
sponds bi-univocally to the well-known problem of finding minimal cuts for certain
graphs constructed from the original RDF graph to be updated. This problem is
known to be intractable. The good news here is that the graphs where the cuts
have to be performed are very small (for the data we have, it is almost of constant
size: see Table 1). They correspond essentially to the subgraphs containing triples
with predicates subClassOf and subPropertyOf. Even better, the cuts have to be
performed over each connected component of these graphs (one can avoid cuts be-
tween different connected components), whose size is proportional to the length
of subClassOf (respectively subPropertyOf) chains in the original graph. We also
present pseudo-code for this procedure.
The remainder of the paper is organized as follows. Section 2 reviews RDF
notions and notations and the basics of the K-M approach to erasure. Section 3
studies the theoretical basis of the erasure operations proposed, and Section 4
puts to practice the ideas presenting algorithms for efficiently computing erasure
in practice. Section 5 discusses related work. We conclude in Section 6.
2 Preliminaries
To make this paper self-contained we present in this section a brief review of
basic notions on RDF, and theory of the K-M approach to update in RDFS.
Most of the material in this section can be found in [5,6,13] with more detail.
Definition 1 (RDF Graph). Consider infinite sets U (URI references); B =
{Nj : j ∈ N} (Blank nodes); and L (RDF literals). A triple (v1 , v2 , v3 ) ∈ (U ∪
B) × U × (U ∪ B ∪ L) is called an RDF triple. The union of U, B, L will be
denoted by U BL.
RDFS Update: From Theory to Practice 95
Table 1. Statistics of triples in schema, instances and sc, sp chains of some RDF
datasets. (The difference between # triples and #(schema + instances) is due the
predicates sameAs, sameClass, which being schema, do not have semantics in RDFS.)
An RDF graph (just graph from now on) is a set of RDF triples. A subgraph
is a subset of a graph. A graph is ground if it has no blank nodes.
A set of reserved words defined in RDF Schema (called the rdfs-vocabulary)
can be used to describe properties like attributes of resources, and to represent
relationships between resources. In this paper we restrict to a fragment of this
vocabulary which represents the essential features of RDF and that contains the
essential semantics (see [13]): [range], rdfs:domain [dom], rdf:type [type], rdfs:
subClassOf [sc] and rdfs:subPropertyOf [sp]. The following set of rule schemas
captures the semantics of this fragment [13]. In each rule schema, capital letters
A, B, C, D, X, Y,... represent variables to be instantiated by elements of UBL.
GROUP A (Subproperty)
1
As in [13], we are avoiding triples of the form (a, sc, a) and (b, sp, b), because this
causes no harm to the core of the deductive system (see [13]).
RDFS Update: From Theory to Practice 97
sc
b
sc sc
b
sc sc sc
sc
a d e sc sc
a d e
sc sc
sc sc sc
c
c
(a) (b)
sc sc
b b
sc sc sc sc
b
sc sc sc sc
a d e a d e
sc sc sc sc
a d
c c
sc sc a d
sc sc
b b
sc sc sc sc c
sc sc sc
a d e a d e a d
sc sc sc sc
c c
(a) (b)
Fig. 2. (a) The set of erase candidates ecand(G, {(a, sc, d)}). (b) The set of minimal
bases minbases(cl(G), {(a, sc, d)}).
3. For all formulas F of RDF, ( E∈ecand(G,H) E) |= F if and only if Mod(G •
H) ⊆ Mod(F ).
Items (1) and (2) in Theorem 1 state that if we had disjunction in RDF, erasure
could be expressed by the following finite disjunction of RDF graphs:
G • H “=” G ∨ E1 ∨ · · · ∨ En ,
where Ej are the erase candidates of G • H. Item (3) states the fact that all the
statements entailed by G • H expressible in RDF are exactly represented by the
RDF graph defined by the intersection of all the erase candidates graphs.
Note that the smaller the size of ecand(G, H), the better the approximation
to G • H, being the limit the case when it is a singleton:
Corollary 1. If ecand(G, H) = {E}, then (G • H) ≡ E.
RDFS Update: From Theory to Practice 99
Thus, the relationship between delta and erase candidates is the following:
Definition 8 (Bases and Minimal Bases). (1) The set of leaves of a proof
tree (of H from G) is called the base of such proof.
(2) A base B of H from G, is a minimal base iff it is minimal under set-
inclusion among all the bases of proofs of H from G (that is, for every base B
of H from G, it holds B ⊆ B ). We denote minbases(G, H) the set of minimal
bases of G, H.
Example 3. For the graph G given in Figure 1 (a), the set minbases(cl(G),
{(a, sc, d)}) contains the graphs given in Figure 2 (b).
Theorem 2. Let G, H, C be RDF graphs. Then, C is a hitting set for the col-
lection of sets minbases(G, H) iff (cl(G) \ C) |= H. Moreover, C is a minimal
hitting set iff cl(G) \ C is a maximal subgraph G of cl(G) such that G |= H.
Proof. (sketch) Note that if C is a hitting set, its minimality follows from the
maximality of its complement, G \ C, and vice versa. Hence we only have to
prove that C is a hitting set for minbases(G, H) iff (G \ C) |= H.
Proof. Follows from the Definition 7, Theorem 2, and the observation that C ⊆
cl(G) is minimal iff cl(G) \ C is maximal.
100 C. Gutierrez, C. Hurtado, and A. Vaisman
For a triple t in a graph G, we will show that the graphs in dcand(G, t) correspond
to certain cuts defined in two directed graphs derived from G, that we denote
G[sc] and G[sp], defined as follows:
Table 2. Description of the construction of the graphs G[sc] (above) and G[sp] (below)
For an RDF triple t, the set of multicuts (set of pairs of nodes) associated to
the erasure of t from an RDF graph G, is defined as follows:
Definition 12 (Set of edges t[sc, G] and t[sp, G]). The set t[sc, G] contains
the pairs of nodes (u, v) as described in Table 3 (second column) with u, v nodes
in G[sc]. Analogously, we define t[sp, G] using Table 3 (third column).
Example 4. Let us consider graph G on the left hand side of Figure 3. The center
part and the right hand side show, respectively, the graphs G[sp] and G[sc], built
according to Table 2. For example, for the triple t = (d, sc, c), the sets of edges
are (d, sc, c)[sc, G] = {(nd , nc )} and (d, sc, c)[sp, G] = ∅. There are triples which
RDFS Update: From Theory to Practice 101
mc,dom
dom sc .
. c f nc nf
m
f,dom
sp sc
.
.
e ne
sp sc
mp
type
p d h nd n t,h
a b
m a,b
G G[sp] G[sc]
Fig. 3. An RDF graph G and its associated G[sc] and G[sp] graphs
give rise to multiple pairs of nodes. For example, for the triple t = (a, type, c) and
the graph in Figure 3, the sets contain the pairs (a, type, c)[sc, G] = {(nt,a , nc )}∩
G[sc] = ∅, and (a, type, c)[sp, G] = {(mab , mc,dom), (mba , mc,dom)}.
Table 3. Construction of the pairs of nodes t[sc, G] and t[sp, G] associated to a triple
t. The minimal multicuts of them in G[sc] and G[sp] respectively, will give the elements
of dcand(G, t) (Theorem 3).
The next theorem shows that computing the delta candidates can be reduced
to compute minimal multicuts, in particular the set of cuts defined in Table 3 in
the graphs defined in Table 2.
Proof. The proof is a case-by-case analysis of each form of t. For t = (a, dom, c)
or t = (a, range, c), the set dcand(G, t) = {t}, because t cannot be derived by
any rule, thus, G |= t if and only if t ∈ G.
Case t = (a, sc, b). From the deduction rules in Section 2, t can be deduced
from G if and only if there is a path in G[sc] from na to nb (note that the only
rule that can derive t is (3)). Hence dcand(G, t) is in correspondence with the
set of the minimal cuts from na to nb in G[sc].
Case t = (a, sp, b). This is similar to the previous one. ¿From the deduction
rules, it follows that t can be only be deduced from G if there is a path in G[sp]
102 C. Gutierrez, C. Hurtado, and A. Vaisman
from ma to mb (by rule (1)). Thus dcand(G, t) is the set of the minimal cuts
from ma to mb in G[sp].
Case t = (a, type, c). To deduce t we can use rules (4), (5) and (6). Rule (4)
recursively needs a triple of the same form (a, type, d) and additionally the fact
that (d, sc, c). Thus, t can be derived from G if there is path in G[sc] from nt,a
to nc . Triple t can also be derived from (5) and (6). Let us analyze (5) (the other
case is symmetric). We need the existence of triples (a, P, x) and (P, dom, u)
and u →∗ c in G[sc], i.e., (u, sc, c). Then (a, P, x) can be recursively derived
by rule(2) (and by no other rule); (P, dom, u) should be present; and the last
condition needs (u, sc, c). Hence t can be derived if for some x there is a path
from ma,x to mc,dom in G[sp] (this explains the two last lines of Table 1).
Analyzing the rules, we can conclude that t is derivable from G if and only
if we can avoid the previous forms of deducing it. That is, producing a minimal
cut between nt,a and nc in G[sc] and a minimal multicut between the set of
pairs (max , mc,dom ) for all x, and the set of pairs (my,a , mrange,c ) for all y, in
the graph G[sp].
Case t = (a, p, b). Here, t can only be derived using rule (2). This needs the
triples (a, q, b) and (q, sp, p). With similar arguments as above, it can be shown
that t can be derived from G iff there is path in G[sp] from ma,b to mp . Hence
dcand(G, t) is the set of minimal cuts from ma,b to mp in G[sp].
Lemma 1. Let G, H be ground RDF graphs in normal form (i.e. without re-
dundancies, see [5]). (i) If E ∈ ecand(G, H), then there exists a triple ti ∈ H
RDFS Update: From Theory to Practice 103
such that E ∈ ecand(G, {ti }); (ii) If D ∈ dcand(G, H), then there exists a triple
ti ∈ H such that D ∈ dcand(G, {ti }).
For instance erasure, the situation is optimal, since it assumes that the schema of
the graph remains untouched. In this setting, we will show that (i) the procedure
is deterministic, that is, there is a unique choice of deletion (i.e., dcand(G, t) has
a single element); (ii) this process can be done efficiently.
Algorithm 3 computes dcand(G, t) for instances. The key fact to note is that
for instances t, that is, triples of the form (a, type, b) or (a, p, b), where p does
not belong to RDFS vocabulary, the minimal multicut is unique. For triples of
the form (a, p, b), it follows from Table 3 that one has to cut paths from mab
to mp in G[sp]. Note that nodes of the form mp are non-leaf ones, hence all
edges in such paths in G[sp] come from schema triples (u, sp, v) (see Table 2).
Because we cannot touch the schema, if follows that if there is such path, the
unique option is to eliminate the edge mab , which corresponds to triples (a, w, b)
in G. For triples of the form (a, type, b) the analysis (omitted here for the sake
RDFS Update: From Theory to Practice 105
5 Related Work
Although updates have attracted the attention of the RDF community, only
updates to the instance part of an RDF graph have been addressed so far.
Sarkar [15] identifies five update operators: and presented algorithms for two
of them. Zhan [18] proposes an extension to RQL, and defines a set of up-
date operators. Both works define updates in an operational way. Ognyanov and
Kiryakov [14] describe a graph updating procedure based on the addition and
the removal of a statement (triple), and Magiridou et al [12] introduce RUL, a
declarative update language for RDF instances (schema updates are not stud-
ied). SPARQL/Update [17] is an extension to SPARQL to support updates over
a collection of RDF graphs. The treatment is purely syntactic, not considering
the semantics of RDFS vocabulary. In this sense, our work can be considered as
input for future enrichments of this language to include RDF semantics.
Konstantinidis et al. [9] introduce a framework for RDF/S ontology evolu-
tion, based on the belief revision principles of Success and Validity. The authors
map RDF to First-Order Logic (FOL), and combine FOL rules (representing
the RDF ontology), with a set of validity rules (which capture the semantics of
the language), showing that this combination captures an interesting fragment
of the RDFS data model. Finally, an ontology is represented as a set of positive
106 C. Gutierrez, C. Hurtado, and A. Vaisman
ground facts, and an update is a set of negative ground facts. If the update causes
side-effects on the ontology defined above, they chose the best option approach
based on the principle of minimal change, for which they define an order between
the FOL predicates. The paper overrides the lack of disjunction and negation in
RDF by means of working with FOL. Opposite to this approach, in the present
paper we remain within RDF.
Chirkova and Fletcher [3], building on [6] and in [9], present a preliminary
study of what they call well-behaved RDF schema evolution (namely, updates
that are unique and well-defined). They focus in the deletion of a triple from an
RDF graph, and define a notion of determinism, showing that when deleting a
triple t from a graph G, an update graph for the deletion exists (and is unique),
if and only if t is not in the closure of G, or t is deterministic in G. Although
closely related to our work, the proposal does not study instance and schema
updates separately, and practical issues are not discussed.
Description Logics ontologies can be seen as knowledge bases (KB) composed
of two parts, denoted TBox and ABox, expressing intensional and extensional
knowledge, respectively. So far, only updates to the extensional part (i.e., in-
stance updates) have been addressed. De Giacomo et al. [4] study the non-
expressibility problem for erasure. This states that given a fixed Description
Logic L, the result of an instance level update/erasure is not expressible in L
(for update, this has already been proved by Liu et al. [11]). Here, it is also
assumed that the schema remains unchanged (i.e., only the ABox is updated).
For update they use the possible models approach [16], and for erasure, the K-M
approach. Building also in the ideas expressed in [6], the authors show that for
a fragment of Description Logic, updates can be performed in PTIME with re-
spect to the sizes of the original KB and the update formula. Calvanese et al. also
study updates to ABoxes in DL-Lite ontologies. They present a classification of
the existing approaches to evolution, and show that ABox evolution under what
they define as bold semantics is uniquely defined [2].
6 Conclusions
Following the approach typical in ontology evolution, where schema and instance
updates are treated separately, we proposed practical procedures for computing
schema and instance RDF erasure, basing ourselves on the K-M approach. We
focused in bringing to practice the theory developed on this topic. As one of
our main results, we have shown that instance erasure is deterministic and fea-
sible for RDFS. Further, we presented an algorithm to perform it. For schema
erasure, the problem is non-deterministic and intractable in the general case.
However, we show that since schemas are very small in practice, it can become
tractable. Thus, we proposed an algorithm to compute schema updates, based
on computing multicuts in graphs. Future work includes developing an update
language for RDF based on the principles studied here, and implementing the
proposal at big scale.
RDFS Update: From Theory to Practice 107
References
1. Arenas, M., Consens, M., Mallea, A.: Revisiting blank nodes in rdf to avoid the
semantic mismatch with sparql. In: W3C Workshop: RDF Next Steps, Palo Alto,
CA (2010)
2. Calvanese, D., Kharlamov, E., Nutt, W., Zheleznyakov, D.: Evolution of DL − lite
knowledge bases. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang,
L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496,
pp. 112–128. Springer, Heidelberg (2010)
3. Chirkova, R., Fletcher, G.H.L.: Towards well-behaved schema evolution. In:
WebDB (2009)
4. De Giacomo, G., Lenzerini, M., Poggi, A., Rosati, R.: On instance-level update
and erasure in description logic ontologies. J. Log. Comput. 19(5), 745–770 (2009)
5. Gutierrez, C., Hurtado, C.A., Mendelzon, A.O., Pérez, J.: Foundations of semantic
web databases. Journal of Computer and System Sciences (JCSS) 77, 520–541
(2010); This is the Journal version of the paper with same title Presented at the
PODS Conference Proc. PODS, pp. 95–106 (2004)
6. Gutiérrez, C., Hurtado, C.A., Vaisman, A.A.: The meaning of erasing in RDF
under the katsuno-mendelzon approach. In: WebDB (2006)
7. Hayes, P. (ed.): RDF semantics. W3C Working Draft (October 1, 2003)
8. Katsuno, H., Mendelzon, A.O.: On the difference between updating knowledge base
and revising it. In: International Conference on Principles of Knowledge Represen-
tation and Reasoning, Cambridge, MA, pp. 387–394 (1991)
9. Konstantinidis, G., Flouris, G., Antoniou, G., Christophides, V.: A formal approach
for RDF/S ontology evolution. In: ECAI, pp. 70–74 (2008)
10. Lin, H.-Y., Kuo, S.-Y., Yeh, F.-M.: Minimal cutset enumeration and network relia-
bility evaluation by recursive merge and BDD. In: IEEE Symposium on Computers
and Communications (ISCC 2003), Kiris-Kemer, Turkey, June 30 - July 3 (2003)
11. Liu, H., Lutz, C., Milicic, M., Wolter, F.: Updating description logic aboxes. In:
KR, pp. 46–56 (2006)
12. Magiridou, M., Sahtouris, S., Christophides, V., Koubarakis, M.: RUL: A declar-
ative update language for RDF. In: Gil, Y., Motta, E., Benjamins, V.R., Musen,
M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 506–521. Springer, Heidelberg (2005)
13. Muñoz, S., Pérez, J., Gutierrez, C.: Minimal deductive systems for RDF. In: Fran-
coni, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 53–67.
Springer, Heidelberg (2007)
14. Ognyanov, D., Kiryakov, A.: Tracking changes in RDF(S) repositories. In: Gómez-
Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 373–
378. Springer, Heidelberg (2002)
15. Sarkar, S., Ellis, H.C.: Five update operations for RDF. Rensselaer at Hartford
Technical Report, RH-DOES-TR 03-04 (2003)
16. Winslett, M.: Reasoning about action using a possible models approach. In: AAAI,
pp. 89–93 (1988)
17. WWW Consortium. SPARQL/Update: A language for updating RDF graphs
(2008), http://www.w3.org/Submission/SPARQL-Update/
18. Zhan, Y.: Updating RDF. In: 21st Computer Science Conference, Rensselaer at
Hartford (2005)
Benchmarking Matching Applications on
the Semantic Web
1 Introduction
In the recent years, the increasing availability of structured linked data over the
semantic web has stimulated the development of a new generation of semantic
web applications capable of recognizing identity and similarity relations among
data descriptions provided by different web sources. This kind of applications are
generally known as matching applications and are more and more focused on the
specific peculiarities of instance and linked data matching [7]. Due to this situa-
tion, the evaluation of matching applications is becoming an emerging problem
which requires the capability to measure the effectiveness of these applications
in discovering the right correspondences between semantically-related data. One
of the most popular approaches to the evaluation of a matching application con-
sists in extracting a test dataset of linked data from an existing repository, such
as those available in the linked data project, in deleting the existing links, and in
measuring the capability of the application to automatically restore the deleted
links. However, the datasets extracted from a linked data repository, suffer of
three main limitations: i) the majority of them are created by acquiring data
from web sites using automatic extraction techniques, thus both data and links
are not validated nor checked; ii) the logical structure of these datasets is usually
quite simple and the level of semantic complexity quite low; iii) the number and
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 108–122, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Benchmarking Matching Applications on the Semantic Web 109
2 Related Work
In the last years, significant research effort has been devoted to ontology match-
ing with special attention on techniques for instance matching [7]. In the litera-
ture, most of the existing approaches/techniques use their individually created
benchmarks for evaluation, which makes a comparison difficult or even impos-
sible [10]. In the fields of object reconciliation, duplicate detection, and entity
resolution, which are closely related to instance matching, a widely used set
of benchmarks are proposed by the Transaction Processing Performance Coun-
cil (TPC)1 that focuses on evaluating transaction processing and databases. A
number of benchmarks are also available for XML data management. Popular
examples are presented in [3,5]. Since these datasets are not defined in a Seman-
tic Web language (e.g., OWL) their terminological complexity is usually very
shallow. In the area of ontology matching, the Ontology Alignment Evaluation
Initiative (OAEI) [6] organizes since 2005 an annual campaigns aiming at eval-
uating ontology matching technologies through the use of common benchmarks.
1
http://www.tpc.org
110 A. Ferrara et al.
However, the main focus of the past OAEI benchmarks was to compare and eval-
uate schema-level ontology matching tools. From 2009, a new track specifically
focused on instance matching applications has been introduced in OAEI and a
benchmark has been developed to this end [8]. The weakness of this 2009 OAEI
benchmark is the basic level of flexibility enforced during the dataset creation
and the limited size of the generated test cases. The benchmark provided by
Yatskevich et al. [13] is based on real-world data, using the taxonomy of Google
and Yahoo as input. In [13], the limit of the proposed approach is the problem
to create an error-free gold standard, since the huge size of the datasets pre-
vents a manual alignment. Intelligent semi-automatic approximations are used
to overcome such a weakness, however it is not possible to guarantee that all the
correct correspondences are found and that none of the found correspondences
is incorrect. The same problem raises with the linked data benchmarks DI2 and
VLCR3 . Alexe et al. [1] provide a benchmark for mapping systems, which gener-
ate schema files out of a number of given parameters. Their automatic generation
process ensures that a correct gold standard accrues. However, real-world data
are not employed and artificial instances with meaningless content are mainly
considered. Other benchmarks in the area of ontology and instance matching are
presented in [13] and [9]. In these cases, the weak point is still the limited degree
of flexibility in generating the datasets of the benchmark. We stress that the pro-
posed SWING approach provides a general framework for creating benchmarks
for instance matching applications starting with a linked data source and ending
with various transformed ABox ontologies. In particular, the SWING approach
combines the strength of both benchmarks [1] and [12] by taking real-world data
from the linked data cloud as input and by performing transformations on them
which ensure that the gold standard must be correct in all cases. Moreover, a
further contribution of the SWING approach is the high level of flexibility en-
forced in generating the datasets through data transformations that is a widely
recognized weak point of the other existing benchmarks.
Outputs
Film Film a
Fantasy c e
Science Fiction e
c
star wars
a b
(http://www.freebase.com/) star wars iv a new hope ford h. d
b d
harrison ford 1942-7-13
Example
(IIMB2010)
base that contains information about 11 Million real objects including movies,
books, TV shows, celebrities, locations, companies and more. Data extraction
has been performed using the query language JSON together with the Freebase
JAVA API5 .
4 Data Acquisition
The SWING data acquisition phase is articulated in two tasks, called data selec-
tion and data enrichment. The task of data selection has the aim to find the right
balance between the creation of a realistic benchmark and the manageability of
the dataset therein contained. In SWING, the data selection task is performed
according to an initial query that is executed against a linked data repository
with the supervision of the evaluation designer. In particular, the size of the
linked data source is narrowed down by i) selecting a specific subset of all avail-
able linked data classes and ii) limit the individuals belonging to these selected
classes. With the latter selection technique, we can easily scale the number of
individuals from hundreds to millions only by adjusting one single parameter.
The goal of the data enrichment task is to provide a number of data enrich-
ment techniques which can be applied to any linked data source for extending
its structural and semantic complexity from the description logic ALE(D) up
to ALCHI(D). This data enrichment has to be realized, because in the open
linked data cloud the concept hierarchies are usually very low and disjointness
axioms or domain and range restrictions are rarely defined. The limited level
of semantic complexity is a distinguishing feature of linked data. Nevertheless,
many matching applications are capable of dealing with data at different levels
of OWL expressiveness.
To illustrate the SWING enrichment techniques, we will refer to a small snip-
pet of the IIMB 2010 benchmark displayed in Figure 4. The black colored nodes,
arrows, and names represent information that has been extracted from Freebase,
while the gray colored information has been added according to the following
enrichment techniques of our SWING approach.
Add Super Classes and Super Properties. The designer can adopt two dif-
ferent approaches for determining new super classes. The first one is a bottom-up
approach where new super classes are created by aggregating existing classes.
Thereby, the evaluation designer has to define a super class name which encom-
passes all the classes to include. The second approach is top-down and it requires
to define how to split a class into more subclasses. The same approaches can be
applied for determining super object properties, respectively. This task is mainly
performed manually by the designer, with the support of the system to avoid
the insertion of inconsistency errors.
In IIMB we added for instance following statements for classes and object
properties:
5
http://code.google.com/p/freebase-java/. However, we stress that any kind of
linked data-compatible source can be used to implement the SWING approach.
Benchmarking Matching Applications on the Semantic Web 113
Film
George Lucas
Science Fiction Character Creator
Fantasy name
featuring
b
Star Wars directed_by
Episode IV: name a
A New Hope Director
directs
featured_by
created_by Person
starring_in featured_by
featuring starring_in d
featured_by
Creature
featuring name
acted_by
Character
Harrison Ford name c Han Solo
Specify Domain and Range Restrictions. For all existing properties, the
challenge is to find the best - that means the narrowest - domain and range
without turning the ontology to inconsistency. For an object property, all the
possible domain and range restrictions can be determined by attempting to as-
sign all the existing classes to be the potential domain/range. If the ontology
is still consistent after the assignment of this new domain/range restriction, the
change is saved, otherwise it is discarded. In the example of IIMB 2010, among
the others, the following domain and range restrictions have been added to the
object property created by:
5 Data Transformation
Once the data acquisition phase is executed and a reference ontology O is pro-
duced, we start the SWING data transformation phase that has the goal of pro-
ducing a set T = {O1 , O2 , . . . , On−1 , On } of new ontologies, called test cases.
Each test case Oi ∈ T has the same schema (i.e., TBox) of O but a different set
of instances (ABox) generated by transforming the ABox AO of O. In detail,
the input of each transformation is a reference ontology O and a configuration
scheme C, which contains the specification of properties involved in the trans-
formations process, the kind of transformations enforced, and the parameters
required by the transformation functions. The output is a new ontology Oi . The
implementation of each ontology transformation can be described in terms of a
transformation function θ : AO → AiO , where AO and AiO denote two ABoxes
consistent with the TBox TO of the ontology O. The transformation function
θ maps each assertion αk ∈ AO into a new set of assertions θ(αk )C according
to the configuration scheme C. Thus, given a configuration scheme C and an
ontology O, a test case Oi is produced as follows:
N
Oi = TO ∪ AiO with AiO = θ(αk )C
k=1
property or a data property. In the first case, the value y is another individual,
while in the second case y is a concrete value. As an example, in the reference
ontology used for IIMB 2010, we have the following assertions:
denoting the fact that b is a Director whose name is represented by the string
“George Lucas”. Moreover the object denoted by the individual d is created by
the individual b (d denotes the character “Han Solo” as shown in Figure 2). Both
the kinds of assertions taken into account in SWING can be described in terms
of an assertion subject, denoted αs , that is the individual which the assertion
α is referred to, an assertion predicate, denoted αp , that is the RDF property
rdf : type in case of class assertions or the property involved in the assertion in
case of property assertions, and an assertion object, denoted αo , that is a class
in case of class assertions and a concrete or abstract value in case of property
assertions. For example, referring to the IIMB 2010 example above, we have
αs1 = b, αp1 = rdf : type, αo1 = Director and αs2 = b, αp2 = name, αo2 = “George
Lucas”. According to this notation, we can define the individual description Dj
of and individual j into an ABox AO as follows:
Dj = {αk ∈ AO | αsk = j}
that is the set of all assertions in AO having j as subject. According to this notion
of individual description, we define also the notion of individual transformation
θ(j) as the result of the transformation θ(αk ) of each assertion αk in the definition
Dj of j.
Preprocessing of the Initial Ontology. The preprocessing step has the goal
of adding some axioms to the ontology TBox O, that will be the reference TBox
for the rest of the transformation and will be identical for all the test cases.
These additions are required in order to implement some of the subsequent data
transformations without altering the reference TBox. In particular, we add two
kind of axioms. As a first addition, we take into account all the data properties
Pi ∈ O and, for each property, we add a new object property Ri , such that
O = O ∪ Ri . Moreover, we add a data property has value to O. These additions
are required for transforming data property assertions into object property as-
sertions. The second addition is performed only if the semantic complexity of
the ontology chosen by the designer allows the usage of inverse properties. In
this case, we take into account all the object properties Ri that are not already
associated with an inverse property and we add to O a new property Ki such
that Ki ≡ Ri− .
116 A. Ferrara et al.
Data structure transformation operations change the way data values are
connected to individuals in the original ontology graph and change the type
and number of properties associated with a given individual. A comprehensive
example of data structure transformation is shown in Table 3, where an initial
set of assertions A is transformed in the corresponding set of assertions A by
applying the property type transformation, property assertion deletion/addition,
and property assertion splitting.
Finally, data semantic transformation operations are based on the idea of
changing the way individuals are classified and described in the original ontology.
For the sake of brevity, we illustrate the main semantic transformation operations
by means of the following example, by taking into account the portion of TO and
the assertions sets A and A shown in Table 4.
118 A. Ferrara et al.
A A
name(n, “Natalie Portman”) name(n, “Natalie”)
born in(n, m) name(n, “Portman”)
name(m, “Jerusalem”) born in(n, m)
gender(n, “Female”) name(m, “Jerusalem”)
date of birth(n, “1981-06-09”) name(m, “Auckland”)
obj gender(n, y)
has value(y, “Female”)
A A
Character(k) Creature(k)
Creature(b) Country(b)
Creature(r) (r)
created by(k, b) creates(b, k)
acted by(k, r) f eaturing(k, r)
name(k, “Luke Skywalker”) name(k, “Luke Skywalker”)
name(b, “George Lucas”) name(b, “George Lucas”)
name(r, “Mark Hamill”) name(r, “Mark Hamill”)
In the example, we can see how the combination of all the data semantic
operations may change the description of the individual k. In fact, in the original
set A, k (i.e., the Luke Skywalker of Star Wars) is a character created by the
individual b (i.e., George Lucas) and acted by r (i.e., Mark Hamill). In A instead,
k is a more generic “creature” and also the relation with r is more generic
(i.e., “featuring” instead of “acted by”). Moreover, individual k is not longer
created by b as it was in A, but it is b that creates k. But the individual b of
A cannot be considered the same than b ∈ A, since the class Creature and
Country are disjoint.
According to Table 1, data transformation operations may also be categorized
as operations that add, delete or modify the information originally provided by
the initial ontology. Table 1 shows also how some operations are used in order
to implement more than one action over the initial ontology, such as in case of
deletion and modifications of string tokens that are both implemented by means
of the operation ρ. Moreover, some operations cause more than one consequence
on the initial ontology. For example, the property assertion splitting ζ causes
both the modification of the original property assertion and the addition of
some new assertions in the new ontology.
name
a Person
Star Wars x name
Episode IV: directed_by George Lucas obj_name
A New Hope name
a' directs b'
b
Character Director featuring date of birth
has_value date of birth
created_by date of birth
article Star Wars 1944
p Episode IV: May 14
1944-05-14 George E. T. S. Walton Lucas,
name George Walton Lucas, Jr. (born younger (born May 14, 1944)
May 14, 1944) is an Academy b'' article cost Associate in Nursing
Darth Vader Award-nominated American film academy Award-nominated
producer [...] American film [...]
Expected results
6 Experimental Results
in the field of instance matching. We recall that the goal of our evaluation is
to verify the capability of the IIMB 2010 benchmark generated with SWING
to provide a reliable and sufficiently complete dataset to measure the effective-
ness of different and often complementary matching algorithms/applications.
The considered matching algorithms are divided into two categories, namely
simple matching and complex matching. Simple matching algorithms are three
variations of string matching functions that are used for the comparison of a
selected number of property values featuring the test case instances. Complex
matching algorithms work on the structure and semantics of test cases and on
the expected cardinality of resulting mappings. In this category we executed
the algorithms LexicalMatch, StructuralMatch, and HMatch. LexicalMatch and
StructuralMatch [11] use integer linear programming to calculate the optimal
one-to-one alignment based on the sum of the lexical similarity values. In Lexical-
Match, no structural information is considered. StructuralMatch uses both, lex-
ical and structural information. Finally, HMatch [4] is a flexible matching suite
where a number of matching techniques are implemented and organized in dif-
ferent modules providing linguistic, concept, and instance matching techniques
that can be invoked in combination.
A summary of the matching results is reported in Figure 4(a), where we show
the average values of the harmonic mean of precision and recall (i.e., FMeasure)
for each algorithm over data value, structure and semantic transformation test
cases, respectively. The last two chunks of results refer to a combination of
transformations and to the benchmark as a whole, respectively.
The goal of our experiments is to evaluate the IIMB 2010 effectiveness, that
is the capability of distinguishing among different algorithms where they are
tested on different kinds of transformations. To this end, we observe that IIMB
2010 allows to stress the difference between simple and complex matching algo-
rithms. In fact, in Figure 4(a), the FMeasure for simple matching algorithms is
between 0.4 and 0.5, while we obtain values in the range 0.8-0.9 with complex
algorithms. It is interesting to see how simple algorithms have best performances
on value transformations and worst performances on structural transformations.
This result is coherent with the fact that simple matching does not take into
account neither the semantics nor the structure of individuals, and proves that
SWING simulates structural transformation in a correct way. In case of seman-
tic transformations instead, simple algorithms have quite good performances
because many semantic transformations affect individual classification, which is
an information that is ignored by simple algorithms. In Figure 4(b), we show
the values of precision and recall of the considered algorithms in the test cases
where all the expected mappings were one-to-one mappings (i.e., 1-1 Mappings)
and in the test cases where one-to-many mappings were expected for 20% of
the individuals (i.e., 1-N Mappings). We note that the algorithms that are not
based on the assumption of finding one-to-one mappings (i.e., simple matching
algorithms and HMatch) have similar results in case of 1-1 and 1-N mappings.
Instead, LexicalMatch and StructuralMatch are based on the idea of finding the
best 1-1 mapping set. Thus, precision and recall of these algorithms are lower
Benchmarking Matching Applications on the Semantic Web 121
)
(
'
&
%
$
#
"
!
(a)
'
&
%
$
#
"
!
(b)
Fig. 4. Results for the execution of different matching algorithms against IIMB 2010
7 Concluding Remarks
In this paper we have presented SWING, our approach to the supervised gen-
eration of benchmarks for the evaluation of matching applications. Experiments
presented in the paper show that SWING is applicable to the evaluation of real
matching applications with good results. Our future work is focused on collecting
more evaluations results in the instance matching evaluation track of OAEI 2010,
where SWING has been used to generate the IIMB 2010 benchmark. Moreover,
we are interested in studying the problem of extending SWING to the creation
of benchmarks for evaluation of ontology matching applications in general, by
providing a suite of comprehensive evaluation techniques and tools tailored for
the specific features of TBox constructs.
References
1. Alexe, B., Tan, W.C., Velegrakis, Y.: STBenchmark: towards a Benchmark for
Mapping Systems. Proc. of the VLDB Endowment 1(1), 230–244 (2008)
2. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a Collabo-
ratively Created Graph Database for Structuring Human Knowledge. In: Proc. of
the ACM SIGMOD Int. Conference on Management of Data, pp. 1247–1250 (2008)
3. Bressan, S., Li Lee, M., Guang Li, Y., Lacroix, Z., Nambiar, U.: The XOO7 Bench-
mark. In: Proc. of the 1st VLDB Workshop on Efficiency and Effectiveness of XML
Tools, and Techniques, EEXTT 2002 (2002)
4. Castano, S., Ferrara, A., Montanelli, S.: Matching Ontologies in Open Networked
Systems: Techniques and Applications. Journal on Data Semantics V (2006)
5. Duchateau, F., Bellahse, Z., Hunt, E.: XBenchMatch: a Benchmark for XML
Schema Matching Tools. In: Proc. of the 33rd Int. Conference on Very Large Data
Bases, VLDB 2007 (2007)
6. Euzenat, J., Ferrara, A., Hollink, L., et al.: Results of the Ontology Alignment Eval-
uation Initiative 2009. In: Proc. of the 4th Int. Workshop on Ontology Matching,
OM 2009 (2009)
7. Euzenat, J., Shvaiko, P.: Ontology Matching. Springer, Heidelberg (2007)
8. Ferrara, A., Lorusso, D., Montanelli, S., Varese, G.: Towards a Benchmark for
Instance Matching. In: Proc. of the ISWC Int. Workshop on Ontology Matching,
OM 2008 (2008)
9. Guo, Y., Pan, Z., Heflin, J.: An Evaluation of Knowledge Base Systems for Large
OWL Datasets. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC
2004. LNCS, vol. 3298, pp. 274–288. Springer, Heidelberg (2004)
10. Koepcke, H., Thor, A., Rahm, E.: Evaluation of Entity Resolution Approaches on
Real-World Match Problems. In: Proc. of the 36th Int. Conference on Very Large
Data Bases, VLDB 2010 (2010)
11. Noessner, J., Niepert, M., Meilicke, C., Stuckenschmidt, H.: Leveraging Termino-
logical Structure for Object Reconciliation. In: Aroyo, L., Antoniou, G., Hyvönen,
E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010.
LNCS, vol. 6089, pp. 334–348. Springer, Heidelberg (2010)
12. Perry, M.: TOntoGen: A Synthetic Data Set Generator for Semantic Web Appli-
cations. AIS SIGSEMIS Bulletin 2(2), 46–48 (2005)
13. Yatskevich, M., Giunchiglia, F., Avesani, P.: A Large Scale Dataset for the Eval-
uation of Matching Systems. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC
2007. LNCS, vol. 4519, Springer, Heidelberg (2007)
Efficiently Evaluating Skyline Queries on RDF Databases
Abstract. Skyline queries are a class of preference queries that compute the
pareto-optimal tuples from a set of tuples and are valuable for multi-criteria
decision making scenarios. While this problem has received significant
attention in the context of single relational table, skyline queries over joins of
multiple tables that are typical of storage models for RDF data has received
much less attention. A naïve approach such as a join-first-skyline-later strategy
splits the join and skyline computation phases which limit opportunities for
optimization. Other existing techniques for multi-relational skyline queries
assume storage and indexing techniques that are not typically used with RDF
which would require a preprocessing step for data transformation. In this paper,
we present an approach for optimizing skyline queries over RDF data stored
using a vertically partitioned schema model. It is based on the concept of a
“Header Point” which maintains a concise summary of the already visited
regions of the data space. This summary allows some fraction of non-skyline
tuples to be pruned from advancing to the skyline processing phase, thus
reducing the overall cost of expensive dominance checks required in the skyline
phase. We further present more aggressive pruning rules that result in the
computation of near-complete skylines in significantly less time than the
complete algorithm. A comprehensive performance evaluation of different
algorithms is presented using datasets with different types of data distributions
generated by a benchmark data generator.
1 Introduction
The amount of RDF data available on the Web is growing more rapidly with
broadening adoption of Semantic Web tenets in industry, government and research
communities. With datasets increasing in diversity and size, there have been more and
more research efforts spent on supporting complex decision making over such data.
An important class of querying paradigm for this purpose is preference queries, and in
particular, skyline queries. Skyline queries are valuable for supporting multi-criteria
decision making and have been extensively investigated in the context of relational
databases [1][2][3][4][5][6][11][12] but in a very limited way for Semantic Web [8].
A skyline query over a data set S with D-dimension aims to return the subset of S
which contains the points in S that are not dominated by any other data point. For two
D-dimensional data points , , ,…, and , , ,…, , point p is
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 123–138, 2011.
© Springer-Verlag Berlin Heidelberg 2011
124 L. Chen, S. Gao, and K. Anyanwu
1.1 Contributions
This paper proposes an approach for efficient processing of skyline queries over RDF
data that is stored as vertically partitioned relations. Specifically, we propose
• The concept of a Header Point that maintains a concise summary of the already
visited region of the data space for pruning incoming non-skyline tuples during
join phase. This improves efficiency by reducing number of comparisons needed
during later skyline processing phase.
Efficiently Evaluating Skyline Queries on RDF Databases 125
Consider again the example of sales promotion targeting the young customers with
less debt. Also assume that company would like to focus their campaigns on
customers that live close to some branch. We express such a query using an extended
SPARQL as shown in Figure 1 (b). The properties in front of MIN/MAX keywords
are the skyline properties (dimensions) to be considered during the skyline
computation. The MIN/MAX keyword specifies that we want the value in the
corresponding property to be minimized/maximized. We now formalize the concept
of a skyline graph pattern that builds on the formalization of SPARQL graph pattern
queries.
An RDF triple is 3-tuple , , where s is the subject, p is the predicate and o is
the object. Let I, L and B be the pairwise disjoint infinite set of IRIs, Blank nodes
and Literals. Also assume the existence of an infinite set V of variables disjoint from
the above sets. A triple pattern is a query pattern such that V can appear in subject,
predicate or object. A graph pattern is a combination of triple patterns by the binary
operators UNION, AND and OPT. Given an RDF database graph, a solution to a
graph pattern is a set of substitutions of the variables in the graph pattern that yields a
subgraph of the database graph and the solution to the graph pattern is the set of all
possible solutions.
126 L. Chen, S. Gao, and K. Anyanwu
In other words, at the end of a join iteration we would have computed d tuples and
each tuple is based on the 2-tuple pointed to by table pointer in some dimension VPT.
A Header Point summarizes the region of data explored in earlier join iterations. It
enables a newly joined tuple in the subsequent join iteration to be compared against this
summary rather than multiple comparisons against the tuples in the skyline candidate list.
Efficiently Evaluating Skyline Queries on RDF Databases 127
Definition 2 (Header Point). Let , ,…, be the set of tuples in the jth join
iteration. A Header Point of the computation is a tuple of 1 ,
2 , . . ., where is either min() or max() function. We call
the tuples that form the basis of the Header Point (i.e. the s), Header Tuples.
To illustrate the advantage of the header point concept, we will use a smaller version
of our motivating example considering only a graph sub pattern with skyline
properties, Age and Debt. We will assume that data is organized as VPT and indexed
by Subject (SO) and by Object (OS). Using the OS index, the triples can be accessed
in decreasing order of “goodness” when minimizing/maximizing skyline properties,
i.e. in increasing/decreasing order of object values. Let us consider the earliest join
iteration involving the first tuples of each relation. Figure 2 (a) shows the table
pointers (JAge and JDebt) for the two relations and the two red lines show the matching
tuples to be joined resulting in the tuples T1 (C1, 25, 2800) and T2 (C13, 32, 800)
shown in Figure 2 (b). Since these tuples are formed from the best tuples in each
dimension, they have the best values in their respective dimensions, therefore no other
tuples dominate them and they are part of the skyline result. We can create a header
point based on the worst values (here, the largest values) in each dimension (Age and
Debt) across all the currently joined tuples resulting in a tuple H (32, 2800). T1 and
T2 are called Header Tuples.
Our goal with the header point is to use it to determine whether newly joined tuples
in subsequent iterations should be considered as candidate skyline tuples or be pruned
from advancing to the expensive skyline processing phase. For example, assume that
in the next join iteration we compute the next set of joined tuples by advancing the
table pointer to the next tuple in each VPT. Then, one of the resulting joined tuples is
(C12, 25, 3100). For the rest of the paper, we will use C12 to denote the tuple (C12,
25, 3100). However, the current candidate skyline list { (C1, 25, 2800), (C13, 32,
800) } contains a tuple (C1, 25, 2800) that clearly dominates C12, therefore C12
should not be considered as a candidate skyline tuple. Further, we should be able to
128 L. Chen, S. Gao, and K. Anyanwu
use the information from our header point to make this determination rather than
compare every newly joined tuple with tuples in the candidate skyline list. We
observe the following relationship between newly joined tuples and the header point
that can be exploited: If a newly joined tuple is worse than the header point in at least
one dimension, then that new tuple cannot be a skyline tuple and can be pruned.
Applying this rule to our previous example, we will find that the tuple C12 is worse
than our header point H in the Debt dimension. Therefore, it can be pruned. In
contrast, the other joined tuple (C6, 30, 1200) is not worse than H in at least one
dimension (in fact, it is better in two dimensions) and is not pruned. Figure 2 (b)
shows the relationships between these points in a 2-dimensional plane: any
subsequent joined tuple located in the regions of S, Q and T can be pruned by header
point H. We now make a more precise statement about the relationship between a
header point and tuples created after the creation of the header point.
Lemma 1. Given a D-dimensional header point , ,…, , any
“subsequent ” (i.e. a point constructed after the current header point) D-tuple whose
values are worse than H in at least D - 1 dimensions, are not skyline points.
Proof. Let , , …, be a new tuple that is worse than the current header
point , ,…, in at least D – 1 dimensions but is a candidate skyline
point. Let be the header point that was just formed during the construction of the
most recently computed set B = , ,…, , , ,…, ,
... , , ,…, } of candidate skyline points (header tuples), where
denotes the next best value in . Recall that B consists of the last D tuples that
resulted from the join between best tuples in each dimension and a matching tuple in
each of the other dimensions.
Assume that the dimensions in which P has worse values than the header point H are
dimensions 2 to D. Then, H “partially dominates” P in dimensions 2 to D. Further,
since the header point is formed from the combination of the worst values in each
dimension across the current header tuples, P is also partially dominated by the
current header tuples which are also current candidate skyline tuples. Therefore, the
only way for P to remain in the skyline is that no candidate skyline tuples dominate it
in the only remaining dimension, dimension 1. However, the header point tuple
, ,…, which is currently a candidate skyline tuple has a dimension-
1 value - that is better than . This is because the values are sorted and visited
in decreasing order of “goodness” and the tuple was constructed before P, so the
value must be better than . This means that is better than P in all
dimensions and therefore dominates P. Therefore, P cannot be a skyline tuple which
contradicts our assumption.
tuples. Since the header point is formed using the worst values in each dimension
among the joined tuples, it may represent very loose boundary conditions which will
significantly reduce its pruning power. This occurs when these worst values are
among the worst overall in the relations. In our example, this occurs in the first join
iteration with the construction of the initial header point H(32, 2800). To strengthen
the pruning power of the header point, we can update its values as we encounter new
tuples with better values i.e. the next set of D tuples whose set of worse values are
better than the worse values in the current header point. These tuples will become the
next iteration’s header tuples.
Figure 3 (a) shows the second join iteration where new header tuples are used to
update the Header Point. The next tuple in Age VPT is (C12, 25) and its joined tuple
is (C12, 25, 3100). Compared with H, C12 can be pruned since it is worse than H in at
least D-1 dimensions (D is 2). RSJFH advances the table pointer to the next tuple in
the Age table, (C2, 26) whose joined tuple is (C2, 26, 2000). Compared with H, C2 is
not pruned and this tuple is adopted as a header tuple. Then, RSJFH moves to the next
VPT Debt where the next tuple is (C6, 1200) and its joined tuple is (C6, 30, 1200).
Compared with H, C6 is not pruned. Now, there is one header tuple from each VPT
and the header point can be updated to H’ (30, 2000). Similarly, in the third join
iteration (Figure 3 (b)), RSJFH checks the subsequent joined tuples in tables Age and
Debt and finds (C5, 28, 1400) and (C5, 28, 1400) are the next header tuples in tables
Age and Debt respectively. Then, the header point is updated to H’’ (28, 1400) based
on these two header tuples.
the header point is not updated or either one of the table pointers advances down
to the end of , then the search for additional candidate skyline tuples can be
terminated losslessly.
Proof. Recall that during a join iteration, we pick the next best tuples that are not
pruned in each dimension and join with other dimensions to form the next D
candidate skyline tuples. So each resulting tuple should contain the next best value in
some dimension, i.e. , ,…, , , ,…, ,…,
, ,…, . If after computing the set B, the header point is not updated, it
implies that each is worse than the corresponding hi in H. It is clear that all
these tuples can be pruned because their best dimension values are worse than our
current worst seen values in our header tuple, resulting in that the other dimensions
are clearly also worse. Thus, the header tuple dominates all of them. Further, since the
tuples in each dimension are ordered in the decreasing order of “goodness”, the next
set of best values cannot be better than those currently considered. Therefore, no
better tuples can be found with further scanning of the tables. When either one of the
table pointers advances down to the end of then all the values for that
dimension have been explored. Therefore, no new D-tuples can be formed and the
search for additional candidate skyline tuples can be terminated losslessly.
Discussion. Intuitively, the header point summarizes neighborhoods that have been
explored and guides the pruning of the tuples in neighborhood around it during each
iteration. However, since RSJFH uses the worse points in each dimension, and prunes
tuples that are really worse (worse in d-1 dimensions) than the header point, it only
allows conservative pruning decisions. Therefore, some non-skyline points can still be
inserted into the candidate skyline list. For example, assume that RSJFH has just
processed the following tuples into the candidate skyline list {(25, 4000), (28, 3500),
(30, 3000)} and computed the header point (30, 4000). Then, an incoming tuple (29,
3750) would be advanced to the next stage because it is better than the header point in
Efficiently Evaluating Skyline Queries on RDF Databases 131
both dimensions. However, it is unnecessary to advance the tuple (29, 3750) into the
next stage of the computation because it would eventually be dropped from candidate
skyline list since the tuple (28, 3500) dominates this tuple.
4 Near-Complete Algorithms
We can try to improve the header point to allow for more aggressive pruning. We
may however risk pruning out correct skyline tuples. In the following section, we
propose a strategy that strikes a balance between two objectives: increasing the
aggressiveness of pruning power of the header point and minimizing the risk of
pruning correct skyline tuples. We posit that in the context of the Web, partial results
can be tolerated particularly if the result set is “near-complete” and can be generated
much faster than the “complete” set. The approach we take is based on performing
partial updates to the header point after each join iteration rather than updating all the
dimensions.
Definition 3 (Partial Header Point Update). Let H be the header point generated in
the previous join iteration and , , … , be the tuples in the current join iteration.
A partial update to the header point means that for the ith dimension of header point,
the value is updated only if the worst value of , , … , in the ith dimension is
better than the ith dimensional value in H.
This implies that if all values in the ith dimension are better than the ith dimensional
value of the header point, then the ith dimension of the header point is updated with
worst value as before; otherwise, the ith dimension is not updated. Thus, the header
point is aggressively updated by the improving (or advancing) dimension values of
the joined tuples in the current join iteration.
Fig. 4. Partially Update Header Points by Bitmap Computation and its Termination
132 L. Chen, S. Gao, and K. Anyanwu
Algorithm 2. RDFSkyJoinWithPartialHeader(RSJPH)
INPUT: n VPT which are sorted according to object value, VPTList.
OUTPUT: A set of skyline points.
Define Variables: in n newly-joined tuples , , … , , let dimension-1 value of be
, dimension-2 value of be , …, dimension-n value of be .
1.Initialization // to get first header point H
2.While (H is updated) && (< , , … , > is better than H) do
3. for each VPT t ∈ VPTListI do
4. read one tuple and hash join to get complete tuple T and compare T with H
5. use the bitmap representation to record T’s better value and worse value
6. if T is prunable by H then
7. T is pruned
8. else
9. T is inserted into Candidate List C
10. end for
11. read the bitmap as bit-slice and calculate bit-slice value by bitwise AND operation
12. for each bit-slice do // partially update H
13. if bit-slice value is 1 then
14. update the corresponding header point value
15. else
16. no update
17. end for
18.end while
19.BNLCalculate(C).
Discussion. In RSJFH, we always update the dimensions using the worse values from
the header tuples in that iteration regardless of whether those values are better or
worse than the current header point values. Essentially, a header point summarizes
only the iterations preceding it. In RSJPH, a header point may hold “good” pruning
values from earlier iterations and only update those dimensions whose values are
Efficiently Evaluating Skyline Queries on RDF Databases 133
Given a header point, if the ith dimension is updated in the last iteration, we regard it
as an “updated dimension”; otherwise, we regard it as “non-updated dimension”.
Given a tuple p, if p has n dimensions whose values are better than that of the header
point h, we say that p has n better dimensions. From Lemma 1, we can infer that a
tuple p needs to have at least two better dimensions to survive the pruning check.
Assume that we have: (1) , is a header point partially updated from a full
updated header point , , where and , where denotes
better; Thus, is the “updated” dimension of and is the “non-updated”
dimension of ; (2) a tuple , , where , and .
Compared to , p can be pruned because p has only one better dimension. However,
when compared to , p will survive the pruning check since p has two better
dimensions ( and ). Since the partial update approach makes the
“updated” dimensions of too good, the tuples that may survive given the fully
updated header point , such as p, are mistakenly pruned. To alleviate this situation,
we relax the pruning condition with the following crosscheck constraint.
Crosscheck. If an incoming tuple has some dimensional values better than "non-
updated" dimension and some dimensional values worse than "updated” dimension,
we add this tuple into candidate list. To implement this algorithm, we basically add
this additional condition check between Lines 6 and 7 in Algorithm 2. The resulting
algorithm is called RDFSkyJoinWithPartialHeader+ (RSJPH+).
Proposition 2. Let be a header point in RSJPH with the “updated” dimension
that has been updated in iteration i-1 and the “non-updated” dimension that has
not been updated in iteration i-1. Let p be a new tuple that has failed the pruning
134 L. Chen, S. Gao, and K. Anyanwu
5 Experimental Evaluation
Experimental Setup and Datasets. In this section, we present an experimental
evaluation of the three algorithms presented in above sections in terms of scalability,
dimensionality, average completeness coverage and prunability. We use the synthetic
datasets with independent, correlated and anti-correlated data distributions generated
by a benchmark data generator [1]. Independent data points follow the uniform
distribution. Correlated data points are not only good in one dimension but also good
in other dimensions. Anti-correlated data points are good in one dimension but bad in
one or all of the other dimensions. All the data generated by the data generator is
converted into RDF format using JENA API and is stored as VPT using BerkeleyDB.
All the algorithms are implemented in Java and the experiments are executed on a
Linux machine of 2.6.18 kernel with 2.33GHz Intel Xeon and 16GB memory. The
detailed experimental results can be found at sites.google.com/site/chenlingshome.
Scalability. Figure 5 (A), (B) and (C) show the scalability evaluation of RSJFH,
RSJPH, RSJPH+ and Naïve for independent, correlated and anti-correlated datasets (1
million to 4 million triples). In all data distributions, RSJPH and RSJPH+ are superior
to RSJFH and Naïve. The difference in execution time between RSJPH, RSJPH+ and
RSJFH comes from the fact that partial update method makes the header point
stronger (i.e. the header point has better value in each dimension and could dominate
more non-skyline tuples resulting in stronger prunability) earlier, which terminates the
algorithm earlier. For independent data (Figure 5 (A)), RSJPH and RSJPH+ use only
about 20% of the execution time needed in RSJFH and Naïve. The execution time of
RSJFH increases quickly in the independent dataset with size 4M of triples. The
reason for this increase is that the conservativeness of the full header point update
technique leads to limited effectiveness in prunability. This results in an increased
size for the candidate skyline set and consequently total number of comparisons with
Header Point. RSJPH+ relaxes the check condition in RSJPH and so allows more
tuples to be inserted into the candidate list explaining the slight increase in the
execution time in Figure 5(A). Figure 5 (B) shows that RSJFH, RSJPH and RSJPH+
perform better in correlated datasets than in independent datasets. In the correlated
data distribution, the header points tend to become stronger earlier than in the case of
independent datasets especially when the data is accessed in the decreasing order of
“goodness”. The reason for this is that the early join iterations produce tuples that are
made of the best values in each dimension. Stronger header points make the algorithm
terminate earlier and reduce the number of tuples joined and checked against the
header point and the size of candidate skyline set. Figure 5 (C) shows particularly bad
performance for the anti-correlated datasets which often have the best value in one
dimension but the worst value in one or all of the other dimensions. This leads to very
weak header points because header points are constructed from worst values of joined
Efficiently Evaluating Skyline Queries on RDF Databases 135
tuples. RSJFH have to explore almost the entire search space, resulting in the poor
performance shown in Figure 5 (C). Although RSJPH seems to outperform the other
algorithms, this advantage is attributed to the fact that it computes only 32% of
complete skyline result set.
Independent Correlated
120 256
Naïve Naïve
Execution Time(Sec)
RSJFH 128
Execution Time(Sec)
RSJFH
100
RSJPH RSJPH
64
RSJPH+ RSJPH+
80 32
60 16
8
40 4
2
20
1
0 0.5
1m 2m 3m 4m 1m 2m 3m 4m
5(A) Triple Size (million) 5(B) Triple Size (million)
Anti-Correlated Independent
1200 Naïve 245
Naïve
Execution Time (Sec)
RSJFH
1000 210 RSJFH
Execution Time(sec)
RSJPH RSJPH
RSJPH+ 175 RSJPH+
800
140
600
105
400
70
200 35
0 0
1m 2m 3m 4m 2 3 4 5
5(C) Triple Size (million) 5(D) Number of Dimensions
Correlated Anti-Correlated
48 16384 Naïve
Naïve
RSJFH
42 RSJFH 4096
Execution Time(sec)
Execution Time(sec)
RSJPH RSJPH
36 RSJPH+ 1024 RSJPH+
30 256
24
64
18
16
12
6 4
0 1
2 3 4 5 2 3 4 5
5(E) Number of Dimensions 5(F) Number of Dimensions
Percentage(%)
60 RSJFH 60 RSJFH
RSJPH
50 50 RSJPH
RSJPH+
40 40 RSJPH+
30 30
20 20
10 10
0 0
Independent Correlated Anti-Correlated Independent Correlated Anti-Correlated
5(G) Data Distribution 5(H) Data Distribution
Dimensionality. Figure 5 (D), (E) and (F) show the effect of increasing the
dimensionality (2-5) on the performance of the four algorithms for different data
distributions. As in previous experiments, RSJPH and RSJPH+ consistently
outperform RSJFH and Naïve. The execution time of RSJFH starts to increase with
number of dimensions greater than 3. The reason is that the conservative way of
updating header point makes the header points greatly reduce the pruning power in
high dimensional data and the extra comparisons almost double the total execution
time. For RSJPH+, with the increase in number of dimensions, the size of the saved
tuples by the crosscheck condition increases, therefore, the size of candidate skyline
set increases and the execution time increases as well.
Average Completeness Coverage and Average Prunability. Figure 5 (G) and (H)
show the average completeness coverage (ACC) and average prunability (AP) of RSJFH,
RSJPH and RSJPH+. ACC and AP are the averages for completeness coverage and the
number of pruned triples across all the experiments shown in Figure 5 (A) to (F)
respectively. The pruned triples include the ones pruned by header points as well as the
ones pruned by early termination. Figure 5 (E) shows that RSJFH has 100% of ACC in
all data distributions. For the correlated datasets, the data distribution is advantageous in
forming a strong header point without harming the completeness of skyline results when
the data is sorted from “best” to “worst”. Thus, RSJFH, RSJPH and RSJPH+ have 100%
of ACC and 99% of AP in correlated datasets. For independent datasets, RSJPH
aggressively updates the header points to increase AP with the cost of decreasing ACC.
RSJPH+ improves the ACC by using crosscheck while only sacrificing 2.7% of the AP
compared with RSJPH. For anti-correlated datasets, the data distribution makes all the
algorithms perform poorly. Although RSJFH achieves 100% of ACC, the AP decreases
to 7%. RSJPH still maintains 99% for AP but its ACC is only 32%.
RSJPH+ achieves a good tradeoff between completeness coverage and prunability.
RSJPH+ computes about 80% of skyline results when it scans about the first 35% of
sorted datasets.
6 Related Work
In recent years, much effort has been spent on evaluating skyline over single relation.
[1] first introduced the skyline operator and proposed BNL and D&C and an
algorithm using B-trees that adopted the first step of Fagin’s . [2][3][9] proposed
algorithms that could terminate earlier based on sorting functions. [4][5][6] proposed
index algorithms that could progressively report results. Since these approaches focus
on single relation, they consider skyline computation independent from join phase,
which renders the query execution to be blocking.
Some techniques have been proposed for skyline-join over multiple relations. [11]
proposed a partitioning method that classified tuples into three types: general, local
and non-local skyline tuples. The first two types are joined to generate a subset of the
final results. However, this approach isn’t suitable for single dimension tables like
VPT [7] in RDF databases because each VPT can only be divided into general and
local skyline tuples, neither of which can be pruned, requiring a complete join of all
relevant tables. [16] proposed a framework SkyDB to partition the skyline-join
Efficiently Evaluating Skyline Queries on RDF Databases 137
process into macro and micro level. Macro level generates abstractions while micro
level populates regions that are not pruned by macro level. Since RDF databases only
involve single dimension tables, SkyDB is not suitable for RDF databases.
In addition, there are some techniques proposed for skyline computation for
Semantic Web data and services. [8] focused on extending SPARQL with support of
expression of preference queries but it does not address the optimization of query
processing. [13] formulated the problem of semantic web services selection using the
notion of skyline query and proposed a solution for efficiently identifying the best
match between requesters and providers. [14] computed the skyline QoS-based web
service and [15] proposed several algorithms to retrieve the top-k dominating
advertised web services. [17] presented methods for automatically relaxing over-
constrained queries based on domain knowledge and user preferences.
References
[1] Borzsonyi, S., Kossmann, D., Stocker, K., Passau, U.: The Skyline Operator. In: ICDE
(2001)
[2] Chomicki, J., Godfrey, P., Gryz, J., Liang, D.: Skyline with Presorting. In: ICDE (2003)
[3] Bartolini, I., Ciaccia, P., Patella, M.: SaLSa: computing the skyline without scanning the
whole sky. In: CIKM, Arlington, Virginia, USA, pp. 405–414 (2006)
[4] Tan, K.L., Eng, P.K., Ooi, B.C.: Efficient Progressive Skyline Computation. In: VLDB,
San Francisco, CA, USA, pp. 301–310 (2001)
[5] Kossmann, D., Ramsak, F., Rost, S.: Shooting stars in the sky: an online algorithm for
skyline queries. In: VLDB, HK, China, pp. 275–286 (2002)
[6] Papadias, D., Fu, G., Morgan Chase, J.P., Seeger, B.: Progressive Skyline Computation in
Database Systems. ACM Trans. Database Syst (2005)
[7] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web data
management using vertical partitioning. In: VLDB, Vienna, Austria, pp. 411–422 (2007)
[8] Siberski, W., Pan, J.Z., Thaden, U.: Querying the semantic web with preferences. In:
Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo,
L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 612–624. Springer, Heidelberg (2006)
138 L. Chen, S. Gao, and K. Anyanwu
[9] Godfrey, P., Shipley, R., Gryz, J.: Maximal Vector Computation in Large Data Sets. In:
VLDB, Norway (2005)
[10] Raghavan, V., Rundensteiner, E.A.: Progressive Result Generation for Multi-Criteria
Decision Support Queries. In: ICDE (2010)
[11] Jin, W., Ester, M., Hu, Z., Han, J.: The Multi-Relational Skyline Operator. In: ICDE
(2007)
[12] Sun, D., Wu, S., Li,J.,Tung, A.K.H.: Skyline-join in Distributed Databases. In: ICDE
Workshops, pp. 176–181 (2008)
[13] Skoutas, D., Sacharidis, D., Simitsis, A., Sellis, T.: Serving the Sky: Discovering and
Selecting Semantic Web Services through Dynamic Skyline Queries. In: ICSC, USA
(2008)
[14] Alrifai, M., Skoutas, D., Risse, T.: Selecting Skyline Services for QoS-based Web Service
Composition. In: WWW, Raleigh, NC, USA (2010)
[15] Skoutas, D., Sacharidis, D., Simitsis, A., Kantere, V., Sellis, T.: Top-k Dominant Web
Services Under Multi-Criteria Matching. In: EDBT, Russia, pp. 898–909 (2009)
[16] Raghavan, V., Rundensteiner, E.: SkyDB: Skyline Aware Query Evaluation Framework.
In: IDAR (2009)
[17] Dolog, P., Stuckenschmidt, H., Wache, H., Diederich, J.: Relaxing RDF Queries based on
User and Domain Preferences. JIIS 33(3) (2009)
The Design and Implementation of Minimal
RDFS Backward Reasoning in 4store
1 Introduction
RDF stores - or triple stores - implement some features that make them very
attractive for certain type of applications. Data is not bound to a schema and it
can be asserted directly from RDF sources (e.g. RDF/XML or Turtle files) due to
their native support of Semantic Web data standards. But the most attractive
characteristic is the possibility of implementing an entailment regime. Having
entailment regimes in a triple store allows us to infer new facts, exploiting the
semantics of properties and the information asserted in the knowledge base. To
agree on common semantics, some standards have arisen for providing different
levels of complexity encoded in a set of inference rules, from RDF and RDFS to
OWL and RIF, each of them applicable to different scenarios.
Traditionally, reasoning can be implemented via forward chaining (FC hence-
forth), backward chaining (or BC), or hybrid algorithms (a mixture of the two).
Minimal RDFS refers to the RDFS fragment published in [8].
1
Preliminary results were presented at the Web-KR3 Workshop [10] and demoed at
ISWC 2010 [9].
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 139–153, 2011.
c Springer-Verlag Berlin Heidelberg 2011
140 M. Salvadores et al.
FC algorithms tend to apply a set of inference rules to expand the data set be-
fore or during the data assertion phase; with this approach, the database will
contain all the facts that are needed when a query is issued. On the other hand,
BC algorithms are goal directed and thus the system fires rules and/or axioms
at runtime to find the solutions. Hybrid algorithms use a combinations of for-
ward and backward chaining. The pros and cons of these approaches are well
known to the AI and database community. FC approaches force the system to
retract entailments when there is an update or insert, making data transactions
very expensive. Complete materialisation of a knowledge base could lead to an
explosion of data not manageable by current triple store technology. Backward
chaining performs better for data transactions and the size of the KB is smaller,
but queries tend to have worse performance.
Reasoners are also classified by their level of completeness. A reasoner can
claim to be complete over an entailment regime R if and only if : (a) it is able
to detect all entailments between any two expressions; and (b) it is able to draw
all valid inferences; according to R semantics. For some types of applications a
complete reasoner might be required but one should assume that higher com-
pleteness tends to degrade query response time. There is, therefore, a clear com-
promise: performance versus completeness and the old AI debate about speed
and scalability versus expressiveness. In our specific case, 4sr excludes a subset of
RDFS semantics rarely used by Semantic Web applications and implements the
semantics from the Minimal RDFS fragment [8]. 4sr entailment is complete con-
sidering Minimal RDFS semantics but incomplete for the full normative RDFS
semantics [5]. In that sense, our main contribution is a system that proves that
Minimal RDFS semantics can scale if implemented in a clustered triple store. In
comparison to our previous research, this paper formalizes 4sr against Minimal
RDFS, and also describes the components to be synchronized among the cluster
and benchmarks the bind operation to test its scalability.
The remainder of the paper is as follows: Section 2 describes the related re-
search in the area and introduces basic 4store concepts and Minimal RDFS.
Section 3 introduces and formalizes 4sr ’s distributed model. Section 4 explains
the design and implementation of 4sr explaining the modifications undertaken
in 4store. Section 5 studies the scalability of the new bind operation by bench-
marking it under different conditions, and finally Section 6 analyses the results
achieved by this work.
2 Related Work
Pellet [11] and Sesame [1]. These tools can perform RDFS reasoning with datasets
containing up to few million triples. But, even though they have played a key
role in helping Semantic Web technologies to get adopted, their scalability and
performance is still a major issue.
BigOWLIM2 is one of the few enterprise tools that claims to perform OWL
reasoning over billions of triples. It can run FC reasoning against the LUBM
(90K,0) [3] , which comprises around 12 billion triples. They materialize the
KBs after asserting the data which means that BigOWLIM has to retract the
materialization if the data is updated. There is no information on how this tool
behaves in this scenario, even though they claim their inferencing system is
retractable.
In the context of distributed techniques, [15] performs FC parallel reasoning
to expand the RDFS closure over hundreds of millions of triples, and it uses a
C/MPI platform tested on 128 core infrastructure with the LUBM 10k dataset.
[13] pursues a similar goal and using MapReduce computes the RDFS closure
over 865M triples in less than two hours. A continuation of this work has been
presented in [12] providing a parallel solution to compute the OWL Horst regime.
This solution, built on top of Hadoop, is deployed on a cluster of 64 machine
and has been tested against a synthetic data set containing 100 billion triples
and a 1.5 billion triples of real data from the LDSR and UniProt datasets.
[7] presented a novel method based on the fact that Semantic Web data present
very skewed distributions among terms. Based on this evidence, the authors
present a FC algorithm that works on top of data flows in a p2p infrastructure.
This approach reported a materialization of RDFS for 200 million triples in 7.2
minutes on a cluster of 64 nodes.
Obviously, in the last 2-3 years there has been a significant advance on mate-
rialization of closures for both RDFS and OWL languages. However very little
work has been presented on how to query such vast amounts of data and how to
connect those solutions with SPARQL engines. Furthermore, these types of solu-
tions are suitable for static datasets where updates and/or deletes are sparse or
non-existent. Applying this mechanism to dynamic datasets with more frequent
updates and deletes whose axioms need to be recomputed will lead to processing
bottlenecks.
To avoid these bottlenecks, progress on backward chained reasoning is re-
quired. To date, there has been little progress on distributed backward chained
reasoning for triple stores. [6] presented an implementation on top of DHTs us-
ing p2p techniques. So far, such solutions have not provided the community with
tools, and recent investigations have concluded that due to load balancing issues
they cannot scale [7].
With the SPARQL/Update specification to be ratified soon, we expect more
triple/quad stores to implement and support transactions, which makes BC rea-
soning necessary at this juncture.
2
http://www.ontotext.com/owlim/big/index.html accessed 21/06/2010
142 M. Salvadores et al.
RDFS extends RDF with a schema vocabulary, a regime that contains seman-
tics to describe light-weight ontologies. RDFS focusses mostly on expressing
class, property and data type relationships and its interpretations can potentially
generate inconsistencies. For instance, by using rdfs:Literal as rdfs:domain
for a predicate P, any statement (S,P,O) with P as predicate would entail (S
a rdfs:Literal) which is clearly inconsistent since RDF doesn’t allow lit-
erals to be subjects (see section 4.3 [5]). Another issue with RDFS interpre-
taions is decidability. There is a specific case when using container memberships
(rdfs:ContainerMembershipProperty) that can cause an RDFS closure to be
infinite [14]. Other similar cases like these appear when constructing ontologies
with different combinations of the RDF reification vocabulary, rdf:XMLLiteral,
disjoint XSD datatypes, etc. A complete RDFS reasoner must examine the ex-
istence of such paradoxes and generate errors when models hold inconsistencies.
Consistency checking is computationally very expensive to deal with, and re-
duces query answering performance, and one should question the applicability
of the semantics that generate inconsistencies for most type of applications.
Another known issue with RDFS is that there is no differentiation between
language constructors and ontology vocabulary, and therefore constructors can
be applied to themselves. (P rdfs:subPropertyOf rdfs:subPropertyOf), for
example, it is not clear how an RDFS reasoner should behave with such con-
struction. Thankfully this type of construction is rarely used on Semantic Web
applications and Linked Data.
[8] summarizes all the above problems among many others motivating the
use of an RDFS fragment, called Minimal RDFS. This fragment preserves the
normative semantics of the core functionalities avoiding the complexity described
in [5]. Because Minimal RDFS also avoids RDFS constructors to be applied to
themselves, it has been proven that algorithms to implement reasoning can be
bound within tight complexity limits (see section 4.2 in [8]).
Minimal RDFS is built upon the ρdf fragment which includes the following
RDFS constructors: rdfs:subPropertyOf, rdfs:subClassOf, rdfs:domain,
rdfs:range and rdf:type3. It is worth mentioning that this fragment is relevant
because it is non-trivial and associates pieces of data external to the vocabulary
of the language. Contrarily, predicates left out from the ρdf fragment essentially
characterize inner semantics in the ontological design of RDFS concepts.
2.2 4store
4store [4] is an RDF storage and SPARQL query system that became open
source under the GNU license in July 2009. Since then, a growing number of
users have been using it as a highly scalable quad store. 4store provides a sta-
ble infrastructure to implement decentralized backward chained reasoning: first
3
For the sake of clarity we use the same shortcuts as in [8] ([sp], [sc] [dom] and [type]
respectively).
The Design and Implementation of Min RDFS BC Reasoning in 4store 143
order to apply the deductive rules from [8], this process to replicate Gmrdf in all
segments is described throughout the rest of this section. The following rules,
extracted from [8], implement the Minimal RDFS semantics:
( , A, sp, B)( , B, sp, C) ( , A, sp, B)( , X, A, Y )
(sp0 ) (Ge , A, sp, C) (sp1 ) (Ge , X, B, Y )
These rules have been reformulated taking into account that we are dealing
with a quad system and not just with triples. The m element of the quads is
irrelevant for the rule condition; in the consequence the m element takes the value
of Ge which we consider the graph of entailments contained in KB. The model
element in the quad (m) does not play any role unless that the SPARQL query
being processed projects named graphs into the query resultset - see section 6.1
Future Work on named graph semantics.
At this point we have the definition of KB, S, and a set of deductive rules
for ρdf . Every segment in S contains a non-overlapping set of quads from KB.
One important aspect of 4store’s scalability is that the bind operation runs
concurrently on every segment of KB. Therefore, we need to look at data inter-
dependency in order to investigate rule chaining locks that can occur between
segments.
The chain of rules dependencies for Minimal RDFS is shown in Figure 1.
In the figure, mrdf -quads are marked with a ’*’, and quads in bold are initial
quads not triggered by any rule. The rest of the quads are entailed by triggering
one or more rules. There is an interesting characteristic in the chain of rules
that can be triggered in Minimal RDFS: in any possible chain of rules, only
(_,P1, dom , C0)* (_,A , P1, B) (_,P1, range , C0)* (_,B , P1, A) sp0 sp0
dom0
ran0 sp1
(Ge,X, P2 ,Y)
(Ge,A, type , C0) ran1
The RDFS inferencing in 4sr is based on two new components that have been
incorporated into 4store’s architecture:
Data Segment
Gmrdf Replicated bind'(M,S,P,O) bind'(M,S,P,O)
Segment
RDFS sync
– RDFS Sync: A new processing node to replicate Gmrdf called RDFS sync.
This node gathers all the quads that satisfy the condition to be held in
Gmrdf from all the segments and replicates them to every Storage Node
keeping them synchronized. After every import, update, or delete, this pro-
cess extracts the new set of quads from Gmrdf in the KB and sends it to
the Storage Nodes. Even for large KBs this synchronization is fast because
Gmrdf tends to be a very small portion of the dataset.
– bind’: The new bind function matches the quads, not just taking into ac-
count the explicit knowledge, but also the extensions from the RDFS se-
mantics. This modified bind’ accesses Gmrdf to implement backward chain
reasoning. bind’ is depicted in detail in Section 4.1.
Figure 2 shows in the architecture how bind’ and RDFS Sync interact for
a hypothetical two storage-node deployment. The dashed arrows refer to the
messages exchanged between the RDFS Sync process and the Storage Nodes in
order to synchronize Gmrdf ; the arrows between the Processing Node and the
Storage Nodes refer to the bind operation requested from the Query Engine
(QE).
146 M. Salvadores et al.
The rationale for such a replacement is based on the fact chaining dom0 and
dom1 generate equivalent entailments to just dom1 . And, similarly, chaining
range0 and range1 is the same as range1 .
sc
C1
sc
C5
dom
sp
P2
range
C3
G *
mrdf P1
sp
C2 C3 dom range
sc sp C5 P2 C3 C1
sp C2 C1
sp
P3
,
Gmrdf
sp
C4
C5
C1 C2 C5 dom range
P3 C3 C1
Our design relies on the definition of two entailed graphs Gmrdf and G∗mrdf ; to
deduce these graphs we will entail dom1 and range1 over Gmrdf . That process is
depicted in Figure 3, and the generated graphs hold the following characteristics:
We also define the following operations over Gmrdf and G∗mrdf , where X is
considered an arbitrary resource in KB:
With these functions we define the access to the graphs Gmrdf and Gmrdf ∗, which
we should emphasize are accessible to every segment in S. These operations,
therefore, will be used for the definition of bind ’.
A bind operation in 4store is requested by the query processor as we explained
in section 2.2. The bind receives 4 multisets with the resources to be matched or
NULL in case some part of the quad pattern is unbound, so a bind to be executed
receives as input (BM , BS , BP , BO ). We omit in this paper the description of
how the query processor works in 4store, , but for the sake of understanding the
system we depict a very simple example:
A potential query plan in 4store would issue two bind operations, first the
most restrictive query pattern:
The only point to clarify is that the second bind receives B0s as BS (B0s refers
to the subject element of B0 ). The original bind operation in 4store is made of
four nested loops that traverse the indexes in an optimized manner. The ranges
of the loops are the input lists (BM , BS , BP , BO ) which are used to build up the
combination of patterns to be matched. For simplicity, we draw the following
bind function:
4store’s real implementation uses radix tries as indexes, and the index selec-
tion is optimized based on the pattern to match. This simplified algorithm plays
the role of explaining our transformation from original bind to the new bind’
that implements Minimal RDFS semantics.
(d) For patterns where the predicate is any of (sc, sp, range, dom), the pattern
match comes down to a simple bind operation over the replicated graph -
G∗mrdf |bind (s, p, o).
(e) No reasoning needs to be triggered and a normal bind is processed. The
rationale for no reasoning being triggered comes from the fact that in the
set of deductive rules, reasoning is processed for patterns where p is one
of (type, sc, sp, range, dom) or p is part of a sp closure with more than one
element. These two conditions are satisfied in (a) and (c). (b) covers the case
of patternp = ∅ for which a recursive call for every predicate is requested.
This evaluation studies 4sr ’s distributed model and its behaviour with different
configurations in terms of number of distributed processes - segments - and size
of datasets. This analysis is based upon the LUBM synthetic benchmark [3]; we
have used 6 different datasets LUBM(100), LUBM(200), LUBM(400), ... , LUBM
(1000,0). These datasets progressively grow from 13M triples - LUBM(100,0) -
triples to 138M triples LUBM(1000,0). In [10] we presented a preliminary bench-
mark that demonstrates that 4sr can handle SPARQL queries with up to 500M
triple datasets; this benchmark shows the overall performance of the whole sys-
tem. The type of benchmark we analyse in this paper, instead of studying perfor-
mance for big datasets, studies how the bind operation behaves when trying to
find solutions that require Minimal RDFS reasoning under different conditions,
i.e. to see how the bind operations behaves when adding more processors or
when the data size is increased. Studying just the bind operation also leaves out
components of 4store that are not affected by the implementation of reasoning,
like the query engine.
Our deployment infrastructures are made of two different configurations:
1. Server set-up: One Dell PowerEdge R410 with 2 dual quad processors (8
cores - 16 threads) at 2.40GHz, 48G memory and 15k rpm SATA disks.
2. Cluster set-up: An infrastructure made of 5 Dell PowerEdge R410s, each
of them with 4 dual core processors at 2.27GHz, 48G memory and 15k rpm
SATA disks. The network connectivity is standard gigabit ethernet and all
the servers are connected to the same network switch.
4sr does not materialize any entailments in the assertion phase, therefore the
import throughput we obtained when importing the LUBM datasets is similar to
the figures reported by 4store developers, around 100kT/s for the cluster set-up
and 114kT/s for the server set-up6 .
The LUBM benchmark evaluates OWL inference, and therefore there are
constructors not supported by 4sr. We have selected a set of 5 individual triple
patterns that cover all the reasoning implemented by 4sr :
6
Throughput obtained asserting the data in ntriples, the LUBM datasets had been
converted from RDF/XML into ntriples using the rapper tool.
150 M. Salvadores et al.
For the biggest datasets - LUBM800 and LUBM1000 - the system degraded
drastically. For these datasets and 1,2 and 4 segment deployment the system did
not respond properly.
The cluster benchmark - Figure - shows better performance. The time needed
for transmitting messages over the network gets balanced by the fact that there
is a lot better I/O disk throughput. The server configuration has 2 mirrored
15K RPM disks, the same as each of the nodes in the cluster but every node
in the cluster can use those disks independently from the other nodes and, the
segments collide less on I/O operations.
152 M. Salvadores et al.
The performance of the cluster for the biggest datasets - LUBM800 and
LUBM1000 - show optimal performance reaching all the binds throughputs be-
tween 150K solutions per second and 300K solutions per second. Domain and
range inference for Faculty, Organisation and Person show linear scalability
and no degradation - unlike in the server configuration. The throughput perfor-
mance tends to get higher the bigger the dataset is because it generates more
solutions without yet reaching the performance limits of the cluster.
In overall, the server configuration sets the scalability limit on the LUBM
400 for the 16-node configuration. For bigger datasets than LUBM400 the server
configuration behaves worse. The cluster configuration seems to perform better
due to its more distributed nature. It is fair also to mention that, of course, the
cluster infrastructure cost is higher than the server and for some applications
the performance shown by the server configuration could be good enough.
In this paper we have presented the design and implementation of Mininal RDFS
fragment with two novel characteristics decentralisation and backward chaining.
We have also disclosed 4sr ’s distributed model and how subProperty, subClass,
domain, range and type semantics can be parallelized by synchronizing a small
subset of the triples, namely the ones held in Gmrdf . The scalability benchmark
showed that the distributed model makes efficient usage of the cluster infras-
tructure with datasets up to 138M triples. The scalability analysis showed that
4sr utilises efficiently the cluster infrastructure providing better throughput for
bigger datasets.
Since no materialization is processed at the data assertion phase, 4sr offers a
good balance between import throughput and query performance. In that sense,
4sr will support the development of Semantic Web applications where data can
change regularly and RDFS inference is required.
Our plans for future work include the implementation of stronger semantics
for named graphs. At the point of writing this paper the research community
is discussing how named graphs with attached semantics should behave in a
quad store. Our current implementation simply makes Gmrdf , Gmrdf and G∗mrdf
available to every graph and we delegate the semantics of named graphs to the
query engine that will treat entailed solutions as part of Ge .
Acknowledgements
This work was supported by the EnAKTing project funded by the Engineering
and Physical Sciences Research Council under contract EP/G008493/1.
The Design and Implementation of Min RDFS BC Reasoning in 4store 153
References
1. Broekstra, J., Kampman, A., Van Harmelen, F.: Sesame: A Generic Architecture
for Storing and Querying RDF and RDF Schema, pp. 54–68. Springer, Heidelberg
(2002)
2. Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.:
Jena: Implementing the Semantic Web Recommendations. In: WWW (2004)
3. Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base
Systems. Journal of Web Semantics 3(2-3), 158–182 (2005)
4. Harris, S., Lamb, N., Shadbolt, N.: 4store: The Design and Implementation of a
Clustered RDF Store. In: Scalable Semantic Web Knowledge Base Systems - SSWS
2009, pp. 94–109 (2009)
5. Hayes, P., McBride, B.: RDF Semantics, W3C Recommendation (February 10,
2004), http://www.w3.org/TR/rdf-mt/
6. Kaoudi, Z., Miliaraki, I., Koubarakis, M.: RDFS Reasoning and Query Answering
on Top of DHTs. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard,
D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 499–516.
Springer, Heidelberg (2008)
7. Kotoulas, S., Oren, E., van Harmelen, F.: Mind the Data Skew: Distributed In-
ferencing by Speeddating in Elastic Regions. In: Proceedings of the WWW 2010,
Raleigh NC, USA (2010)
8. Muñoz, S., Pérez, J., Gutierrez, C.: Simple and Efficient Minimal RDFS. Journal
of Web Semantics 7, 220–234 (2009)
9. Salvadores, M., Correndo, G., Harris, S., Gibbins, N., Shadbolt, N.: 4sr - Scal-
able Decentralized RDFS Backward Chained Reasoning. In: Posters and Demos.
International Semantic Web Conference (2010)
10. Salvadores, M., Correndo, G., Omitola, T., Gibbins, N., Harris, S., Shadbolt, N.:
4s-reasoner: RDFS Backward Chained reasoning Support in 4store. In: Web-scale
Knowledge Representation, Retrieval, and Reasoning, Web-KR3 (2010)
11. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A Practical
OWL-DL Reasoner. Journal of Web Semantics 5(2), 51–53 (2007)
12. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.E.: Owl Reasoning
with Webpie: Calculating the Closure of 100 Billion Triples. In: Extended Semantic
Web Conference (2010)
13. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Reason-
ing Using MapReduce. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L.,
Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 634–649. Springer, Heidelberg (2009)
14. Weaver, J.: Redefining the RDFS closure to be decidable. In: W3C Workshop RDF
Next Steps, Stanford, Palo Alto, CA, USA (2010),
http://www.w3.org/2009/12/rdf-ws/papers/ws16
15. Weaver, J., Hendler, J.A.: Parallel Materialization of the Finite RDFS Closure
for Hundreds of Millions of Triples. In: Bernstein, A., Karger, D.R., Heath, T.,
Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009.
LNCS, vol. 5823, pp. 682–697. Springer, Heidelberg (2009)
miKrow: Semantic Intra-enterprise
Micro-Knowledge Management System
iSOCO
Avda. Partenón. 16-18, 28042, Madrid, Spain
{vpenela,galvaro,cruiz,ccordoba,fcarbone,mcastagnone,
jmgomez,jcontreras}@isoco.com
http://lab.isoco.net/
1 Introduction
The increasing amount of information generated by enterprises during the last
decade has lead to the introduction of the new Knowledge Management (KM)
concept, that has grown from a mere accessory to a full discipline that allows
companies to grow more efficient and competitive.
Best practices in KM strategies usually attack several key objectives: i) iden-
tify, gather and organize the existing knowledge within the enterprise, ii) facil-
itate the creation of new knowledge, and iii) foster innovation in the company
through the reuse and support of workers’ abilities. However, in most of the cases,
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 154–168, 2011.
c Springer-Verlag Berlin Heidelberg 2011
miKrow: Semantic Intra-enterprise Micro-Knowledge Management System 155
situations and effectively envision and create their future. Due to the new fea-
tures of the market like the increasing availability and mobility of skilled workers,
ideas sitting on the shelf,. . . , knowledge is not anymore a static resource of the
company. It resides in its employees, suppliers, customers,. . . . If companies do
not use the knowledge they have inside, one of their main resources stale.
In recent years computer science has faced more and more complex problems
related to information creation and fruition. Applications in which small groups
of users publish static information or perform complex tasks in a closed system
are not scalable. In 2004, James Surowiecki introduced the concept of “The
Wisdom of Crowds”[17] demonstrating how complex problems can be solved
more effectively by groups operating according to specific conditions, than by
any individual of the group. The collaborative paradigm leads to the generation
of large amounts of content and when a critical mass of documents is reached,
information becomes unavailable. Knowledge and information management are
not scalable unless formalisms are adopted. Semantic Webs aim is to transform
human readable content into machine readable. With this goal languages such
as RDF Schema and OWL have been defined.
Computer supported collaborative work[10] research analyzed the introduc-
tion of Web 2.0 in corporations: McAfee[11] called “Enterprise 2.0”, a paradigm
shift in corporations towards the 2.0 philosophy: collaborative work should not
be based in the hierarchical structure of the organization but should follow
the Web 2.0 principles of open collaboration. This is especially true for inno-
vation processes which can be particularly benefited by the new open innovation
paradigm[7]. In a world of widely distributed knowledge, companies do not have
to rely entirely on their own research, but should open the innovation to all the
employees of the organization, to providers and customers.
Web 2.0 tools do not have formal models that allow the creation of complex
systems managing large amounts of data. Nowadays solutions like folksonomies,
collaborative tagging and social tagging are adopted for collaborative categoriza-
tion of contents. In this scenario we have to face the problem of scalability and
interoperability[9]: making users free to use any keyword is very powerful but
this approach does not consider the natural semantic relations between the tags.
Semantic Web can contribute introducing computer-readable representations for
simple fragments of meaning. As will be seen, an ontology-based analysis of a
plain text provides a semantic contextualization of the content, supports tasks
such as finding semantic distance between contents and helps in creating rela-
tions between people with shared knowledge and interests.
Different mechanisms for leveraging all this scattered enterprise knowledge
have been studied during the last decade, particularly trying to ease the pain
of introducing new tools in the already overcrowded worker’s desktop by adding
a semantic layer on top of current applications. CALO3 based on the use of
cognitive systems and NEPOMUK4 trying to add the social and semantic aspects
3
CALO is part of the PAL Program: https://pal.sri.com/
4
NEPOMUK Project: http://nepomuk.semanticdesktop.org/
miKrow: Semantic Intra-enterprise Micro-Knowledge Management System 157
to the user’s personal desktop are two of the main references of ACTIVE5 , a
project that aims to increase productivity of knowledge workers with pro-active
and contextualized mechanisms and which technology has been used to improve
the proposed solution.
Microblogging is one of the recent social phenomena of Web 2.0, being one of the
key concepts that has brought Social Web to more than merely early adopters
and tech savvy users. The simplest definition of microblogging, a light version of
blogging where messages are restricted to less than a small number of characters,
does not make true judgment of the real implications of this apparent constraint.
Its simplicity and ubiquitous usage possibilities have made microblogging one of
the new standards in social communication. There is a large number of social
networks and sites, with more blooming every day, that have some microblogging
funcionalities, although currently there are two big players in the field: Twitter
and Facebook, with 175 and 600 million users respectively.
One of the main issues microblogging has today is the lack of proper semantics,
making building any kind of intelligent system on top of them quite hard. Even
though different user initiatives have emerged, such as the use of hashtags to
define channels of communication and provide a context for the conversation, its
use is mostly related to user consumption of the information, not allowing for
any real analysis of the meaning of the related data.
Twitter introduced Annotations6 , as a mechanism to add structured metadata
about a tweet. It proposes an open key/value structure as properties of a type
entity with recommended types such as “place”, “movie” or “review”. This low
level approach is simplistic in the way that it does not define a formal model,
but only a mechanism to add metadata to messages.
Facebook has proposed the Open Graph protocol as a mechanism to add
metadata to its network, however the target has been quite the opposite, instead
of adding metadata to messages as with Twitter Annotations, the main goal is
to improve information linking with external resources by proposing a modified
RDFa structure for webpages.
SMOB[12] tries to solve this by proposing the use of semantically-enabled
hashtags such as #dbp:Eiffel Tower in #geo:Paris France. However this ap-
proach puts all the burden of explicitly giving meaning to different elements on
the user, which is counterproductive with the idea of microblogging as lightweight
communication tools.
This lack of semantics is a stronger constraint in a work environment, where
employees need to have both faster and more reliable tools for KM while expect-
ing new tools not to disturb their usual work experience and thus not forcing
them into having to perform new tasks. Passant et al.[13] extended their previous
approach by trying to solve these issues with a mixture of different user-friendly
5
ACTIVE Project: http://www.active-project.eu/
6
Twitter Annotations: http://dev.twitter.com/pages/annotations_overview
158 V. Penela et al.
Web 2.0 interfaces for users to both provide and consume RDF/OWL annota-
tions. This approach still seems quite hard on common employees, experts in
their domain but with no basic knowledge on semantic technologies.
analyzed and stored in the message index. The set of terms present in users’
statuses compose their entries in the experts index. The text of the messages is
used to perform a semantic search against the same index as well.
Semantic Indexing. When a user posts a new status message into the system,
its content is analyzed and included into a message index (status repository), al-
lowing future retrieval. Similarly, a repository of expert users (experts repository)
is populated by relating the relevant terms of the message with the particular
author.
Technically, messages that users post to the system are groups of terms T
(both key-terms
T K , relevant terms from the ontology domain, and normal
terms) T . The process of indexing each message results in a message reposi-
tory that contains each document indexed by the different terms it contains, as
shown in figure 1(a).
In the case of the update of the semantic repository of experts, which follows
the message indexing, each user can be represented
by a group of key-terms
(only those present in the domain ontology) T K . This way, the repository of
experts will contain the different users of the systems, that can be retrieved by
the key-terms. Figure 1(b) illustrates this experts repository.
– Given the text of a status update, the search on the experts index returns
semantically related people, such as other co-workers with experience on
related areas.
From a technicalpoint of view, the semantic repository is queried by using
the group of terms T of the posted message, as depicted in figure 2(a). This
search returns messages semantically relevant to the one that the user has just
posted.
It is worth noting that, as it will be covered in 3.3, the search process in the
repository is semantic, therefore the relevant messages might contain some of the
exact terms present in the current status message, but also terms semantically
related through the domain ontology.
As it has been stated above, along with the search for relevant messages, the
system is also able to extract experts (identified by the terms present in the
messages they have been writing previously) associated with the current status
message being posted. In this case, the search over the semantic repository of
experts
K is performed by using the key-terms contained in the posted message
T , as depicted in figure 2(b).
matches the concept “product” in the ontology), ii) semantic relations between
concepts, (e.g. the keyword “target” matches the relation product has target ),
or iii) instance entities (e.g., the keyword “Sem10 Engine” which matches the
instance Sem10 Engine, a particular product of the company). This process can
produce any number of matches for each term, strongly depending on the size and
number of elements of the ontology, how well it covers the business knowledge
base of the company and how the message is related to the core elements in it.
Once associations between terms and ontology entities (concepts, attributes
and instances) are identified, the semantic search engine builds a query exploiting
these relations defined by the ontology and hence in the company knowledge
base. Different expansions can be performed depending on the initial input,
with each one being weighed accordingly to the relevance of its relation:
– If a synonym of an ontology term is detected, the ontology term is added to
the query.
– If a term corresponding to an ontology class is found, subclasses and in-
stances labels are used to expand the query.
– If an instance label is identified, the corresponding class name and sibling
instance labels are added to the query.
Even though the global solution is built upon a microblogging environment and
obviously focused on lightweight KM, interaction with currently deployed sys-
tems in an enterprise environment is a key element in order to ease possible
entry barriers as well as leverage already available knowledge information in the
company.
As a test use case different levels of information will be extracted from services
provided by ACTIVE project7 [16]. ACTIVE aims to increase the productivity
of knowledge workers in a pro-active, contextualized, yet easy and unobtrusive
way through an integrated knowledge management workspace that reduces in-
formation overload by significantly improving the mechanisms through which
enterprise information is created, managed, and used. Combining this approach
with our microblogging solution will thrive the benefits for workers.
ACTIVE tries to extract information from the whole employee environment,
dividing the provided data in three main types of concept:
The microblogging tool will extend its classical interface by including links to
different instances of each class. These instances will be obtained by consuming
ACTIVE services with the detected terms in a particular message as tags for
the query and function as interaction channels between both systems, allowing
the employee to gather further information and working as a bridge between
lightweight KM tool and more resource-intensive platform.
7
ACTIVE Project: http://www.active-project.eu/
164 V. Penela et al.
The theoretical contribution covered in the previous section has been imple-
mented as a prototype, codenamed miKrow, in order to be able to evaluate and
validate our ideas. In the following subsections, we address the implementation
details and the evaluation performed.
inside ACTIVE project, as well as solve some of the issues raised from the first
evaluation performed inside iSOCO8 .
miKrow is divided in two main components, a semantic engine that uses
Lucene in order to offer search and indexing functionalities, and a microblogging
engine, for which Google’s Jaiku9 has been forked and extended to properly in-
clude and show the new type of related information that miKrow offers to the
final user.
The original semantic engine is also extended by introducing two main addi-
tional functionalities, which main goal is to reduce the usual cold start of this
type of services:
– Linked Data entities. External services such as OpenCalais13 are used to con-
nect the messages posted to external entities in the Linked Data paradigm,
allowing the system to propose new entities not included in the enterprise
ontology.
– Knowledge resources. ACTIVE technology is used to recommend knowledge
resources related with the entities extracted from the user messages, lowering
the gap between the lightweight tool and more intensive desktop platforms.
– Informing users about the reasons for the suggestions (both internal to the
tool for messages and experts, and external, for documents found in the exist-
ing enterprise information systems) is important, as they perceive some sort
of intelligence in the system, and are significantly more pleased. Also, if the
suggestion is not good, they at least know why it has been produced. Again,
letting them provide feedback in these occasions will generate a benefitious
loop that will enrich the system.
6 Conclusions
References
1. Álvaro, G., Córdoba, C., Penela, V., Castagnone, M., Carbone, F., Gómez-Pérez,
J.M., Contreras, J.: mikrow: An intra-enterprise semantic microblogging tool as a
micro-knowledge management solution. In: International Conference on Knowledge
Management and Information Sharing 2010, KMIS 2010 (2010)
2. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I., et al.: Above the clouds: A berkeley view
of cloud computing. EECS Department, University of California, Berkeley, Tech.
Rep. UCB/EECS-2009-28 (2009)
3. Bellinger, G.: Systems thinking-an operational perspective of the universe. Systems
University on the Net 25 (1996)
4. Berners-Lee, T.: Linked data. International Journal on Semantic Web and Infor-
mation Systems 4(2) (2006)
168 V. Penela et al.
5. Cadenas, A., Ruiz, C., Larizgoitia, I., Garcı́a-Castro, R., Lamsfus, C., Vázquez,
I., González, M., Martı́n, D., Poveda, M.: Context management in mobile envi-
ronments: a semantic approach. In: 1st Workshop on Context, Information and
Ontologies (CIAO 2009), pp. 1–8 (2009)
6. Carbone, F., Contreras, J., Hernández, J.: Enterprise 2.0 and semantic technolo-
gies: A technological framework for open innovation support. In: 11th European
Conference on Knowledge Management, ECKM 2010 (2010)
7. Chesbrough, H., Vanhaverbeke, W., West, J.: Open Innovation: Researching a new
paradigm. Oxford University Press, USA (2006)
8. Dey, A., Abowd, G.: Towards a better understanding of context and context-
awareness. In: CHI 2000 Workshop on the What, Who, Where, When, and How of
Context-Awareness, pp. 304–307 (2000)
9. Graves, M.: The relationship between web 2.0 and the semantic web. In: European
Semantic Technology Conference, ESTC 2007 (2007)
10. Grudin, J.: Computer-supported cooperative work: History and focus. Com-
puter 27(5), 19–26 (1994)
11. McAfee, A.: Enterprise 2.0: The dawn of emergent collaboration. MIT Sloan Man-
agement Review 47(3), 21 (2006)
12. Passant, A., Hastrup, T., Bojars, U., Breslin, J.: Microblogging: A semantic and
distributed approach. In: Proceedings of the 4th Workshop on Scripting for the
Semantic Web (2008)
13. Passant, A., Laublet, P., Breslin, J., Decker, S.: Semslates: Improving enterprise
2.0 information systems thanks to semantic web technologies. In: Proceedings of
the 5th International Conference on Collaborative Computing: Networking, Appli-
cations and Worksharing (2009)
14. Penela, V., Ruiz, C., Gómez-Pérez, J.M.: What context matters? Towards mul-
tidimensional context awareness. In: Augusto, J.C., Corchado, J.M., Novais, P.,
Analide, C. (eds.) ISAmI 2010. AISC, vol. 72, pp. 113–120. Springer, Heidelberg
(2010)
15. Schein, A., Popescul, A., Ungar, L., Pennock, D.: Methods and metrics for cold-
start recommendations. In: Proceedings of the 25th ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 253–260 ACM,New York
(2002)
16. Simperl, E., Thurlow, I., Warren, P., Dengler, F., Davies, J., Grobelnik, M.,
Mladenic, D., Gomez-Perez, J.M., Ruiz, C.: Overcoming information overload in
the enterprise: the active approach. IEEE Internet Computing 14(6), 39–46 (2010)
17. Surowiecki, J., Silverman, M., et al.: The wisdom of crowds. American Journal of
Physics 75, 190 (2007)
18. Warren, P., Kings, N., Thurlow, I., Davies, J., Brger, T., Simperl, E., Ruiz, C.,
Gómez-Pérez, J., Ermolayev, V., Ghani, R., Tilly, M., Bsser, T., Imtiaz, A.: Im-
proving knowledge worker productivity – the active approach. BT Technology Jour-
nal 26, 165–176 (2009)
A Faceted Ontology for a Semantic Geo-Catalogue
1 Introduction
Geo-spatial applications need to provide powerful search capabilities to support users
in their daily activities. This is specifically underlined by the INSPIRE1 directive and
regulations [15, 16] that establish minimum criteria for the discovery services to sup-
port search within the INSPIRE metadata elements. However, discovery services are
often limited by only syntactically matching user terminology to metadata describing
geographical resources [1]. This weakness has been identified as one of the key issues
for the future of the INSPIRE implementation [11, 17, 18, 19].
As a matter of fact, current geographical standards only aim at syntactic agreement
[23]. For example, if it is decided that the standard term to denote a harbour (defined
in WordNet as “a sheltered port where ships can take on or discharge cargo”) is har-
bour, they will fail in applications where the same concept is denoted with seaport.
As part of the solution, domain specific geo-spatial ontologies need do be adopted. In
[14] we reviewed some of the existing frameworks supporting the creation and main-
tenance of geo-spatial ontologies and proposed GeoWordNet - a multi-lingual geo-
spatial ontology providing knowledge about geographic classes (features), geo-spatial
entities (locations), entities’ metadata and part-of relations between them - as one of
the best candidates, both in terms of quantity and quality of the information provided,
to provide semantic support to the spatial applications.
1
http://inspire.jrc.ec.europa.eu/
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 169–182, 2011.
© Springer-Verlag Berlin Heidelberg 2011
170 F. Farazi et al.
The purpose of the Semantic Geo-Catalogue (SGC) project [20] - promoted by the
Autonomous Province of Trento (PAT) in Italy with the collaboration of Informatica
Trentina, Trient Consulting Group and the University of Trento - was to develop a
semantic geo-catalogue as an extension of the existing geo-portal of the PAT. It was
conceived to support everyday activities of the employees of the PAT. The main re-
quirement was to allow users to submit queries such as Bodies of water in Trento, run
them on top of the available geographical resources metadata and get results also for
more specific features such as rivers and lakes. This is clearly not possible without
semantic support. As reported in [12], other technological requirements directly com-
ing from the INSPIRE directives included (a) performance - send one metadata record
within 3s. (this includes, in our case, the time required for the semantic expansion of
the query); (b) availability - service up by 99% of the time; (c) capacity - 30 simulta-
neous service requests within 1s.
In this paper we report our work on the implementation of the semantic geographi-
cal catalogue for the SDI of the PAT. In particular, we focus on the semantic exten-
sion of its discovery service. The semantic extension is based on the adoption of the
S-Match2 semantic matching tool [4] and on the use of a specifically designed faceted
ontology [2] codifying the necessary domain knowledge about geography and includ-
ing inter-alia the administrative divisions (e.g., municipalities, villages), the bodies of
water (e.g., lakes, rivers) and the land formations (e.g., mountains, hills) of the PAT.
Before querying the geo-resources, user queries are expanded by S-Match with do-
main specific terms taken from the faceted ontology. In order to increase the domain
coverage, we integrated the faceted ontology with GeoWordNet.
The rest of the paper is organized as follows. Section 2 describes the overall sys-
tem architecture and focuses on the semantic extension in particular. Section 3 de-
scribes the dataset containing the locations within the PAT and how we cleaned it.
Sections 4, 5 and 6 provide details about the construction of the faceted ontology, its
population and integration with GeoWordNet, respectively. The latter step allows
supporting multiple languages (English and Italian), enlarging the background ontol-
ogy and increasing the coverage of locations and corresponding metadata such as lati-
tude and longitude coordinates. Finally Section 7 concludes the paper by summarizing
the main findings and the lessons learned.
2 The Architecture
As described in [1], the overall architecture is constituted by the front-end, business
logic and back-end layers as from the standard three-tier paradigm. The geo-catalogue
is one of the services of the existing geo-cartographic portal3 of the PAT. It has been
implemented by adapting available open-source tool4 conforming to the INSPIRE di-
rective and by taking into account the rules enforced at the national level. Following
the best practices for the integration of the third-party software into the BEA ALUI
2
S-Match is open source and can be downloaded from
http://sourceforge.net/projects/s-match/
3
http://www.territorio.provincia.tn.it/
4
GeoNetwork OpenSource,
http://geonetwork-opensource.org
A Faceted Ontology for a Semantic Geo-Catalogue 171
framework5 (the current engine of the geo-portal), external services are brought to-
gether using a portlet6-based scheme, where GeoNetwork is used as a back-end. Fig. 1
provides an integrated view of the system architecture. At the front-end, the function-
alities are realized as three portlets for:
1. metadata management, including harvesting, search and catalogue navigation
functionalities;
2. user/group management, to administer access control on the geo-portal;
3. system configuration, which corresponds to the functionalities of the GeoNet-
work's Administrator Survival Tool (GAST) tool of GeoNetwork.
These functionalities are mapped 1-to-1 to the back-end services of GeoNetwork.
Notice that external applications, such as ESRI ArcCatalog, can also access the back-
end services of GeoNetwork.
External application
QUERY
RESULT User and group User and group Faceted
management management ontology
portlet web-service
Semantic
System System matching
configuration configuration
portlet web-service background
ontology
5
http://download.oracle.com/docs/cd/E13174_01/alui/
6
http://jcp.org/en/jsr/detail?id=168
172 F. Farazi et al.
7
S-Match uses WordNet by default but it can be easily substituted programmatically, for in-
stance by plugging GeoWordNet at its place.
A Faceted Ontology for a Semantic Geo-Catalogue 173
geographical classes and corresponding locations and filter out noisy data. The picture
below summarizes the main phases, described in detail in the next paragraphs.
The data are available in MS Excel files (Table 1), and are gathered from the PAT
administration. The features file contains information about the main 45 geographical
classes; the ammcom file contains 256 municipalities; the localita file contains 1,507
wards and ward parts, that we generically call populated places; the toponimi file con-
tains 18,480 generic locations (including inter-alia villages, mountains, lakes, and
rivers). Comune, frazione and località popolata are the Italian class names for mu-
nicipality, ward and populated place respectively.
Table 1. The names and descriptions of the files containing PAT data
With the construction of the faceted ontology we identified a suitable name for the
rest of the Italian class names from the analysis of the PAT geographical classes in the
174 F. Farazi et al.
features file. In fact, they are very generic as they are meant to contain several, but
similar, kinds of locations. For instance, there is a class that includes springs, water-
falls and other similar entities.
We retrieved the main PAT classes, that we call macro-classes (as they group differ-
ent types of locations), from the features file. In this file each class is associated an id
(e.g., P110) and an Italian name (e.g., Monti principali).
We did not process the macro-class with id P310 (Regioni limitrofe) as it repre-
sents locations in the neighbouring region of Trento (out of the scope of our interest)
and P472 (Indicatori geografici) as it represents geographic codes. Notice that names
of the macro-classes needed to be refined as they are too generic and represent many
kinds of locations grouped together. As this file lacks classes for the provinces, mu-
nicipalities, wards and populated places, we created them as shown in Table 2.
We imported all the locations into a temporary database by organizing them into the
part-of hierarchy province > municipality > ward > populated place (and other loca-
tion kinds) as follows:
• The province level. We created an entity representing the Province of Trento.
This entity is not explicitly defined in the dataset but it is clearly the root of the
hierarchy. We assigned the following names to it: Provincia Autonoma di
Trento, Provincia di Trento and Trento. It was assigned to the province class.
• The municipality level. Municipalities were extracted from the ammcom file. We
created an entity for each municipality and a part-of relation between each mu-
nicipality and the province. They were assigned to the municipality class.
• The ward and populated place level. Wards and populated places (sections of
wards) were extracted from the localita file. Here each ward is connected to the
corresponding municipality and each populated place to the corresponding ward
by specific internal codes. For each ward and populated place we created a corre-
sponding entity. Using the internal codes, each ward was connected to the corre-
sponding municipality and each populated place to the corresponding ward. They
were assigned to the class ward or populated place accordingly.
A Faceted Ontology for a Semantic Geo-Catalogue 175
• All other locations. All other (non administrative) locations were extracted from
the toponimi file. Here each of them is connected either to a municipality, a ward
or a populated place by specific internal codes. Using the internal codes, we con-
nected them accordingly. A few of them are not connected to any place and
therefore we directly connected them to the province. Each location was tempo-
rary assigned to the corresponding macro-class.
Locations are provided with latitude and longitude coordinates in Cartesian WGS84
(World Geodetic System 1984) format, a standard coordinate reference system mainly
used in cartography, geodesy and navigation to represent geographical coordinates on
the Earth8. Since in GeoWordNet we store coordinates in WGS84 decimal format, for
compatibility we converted them accordingly.
macro-classes 44
locations 20,162
part-of relations 20,161
alternative names 7,929
8
https://www1.nga.mil/ProductsServices/GeodesyGeophysics/WorldGe
odeticSystem/
9
Note that the missing municipalities are due to the fact that they were merged with other
municipalities on 1st January 2010, while the duplicates are related to administrative islands
(regions which are not geometrically connected to the main area of each municipality).
176 F. Farazi et al.
We started from the 45 macro-classes extracted from the feature file that we imported
in the temporary database. Notice that they are not accompanied by any description.
Therefore, analyzing the locations contained in each macro-class, each macro-class
was manually disambiguated and refined - split, merged or renamed - and as a result
new classes had to be created.
MACRO-CLASSES CLASSES
P410 Capoluogo di Provincia Province
P465 Malghe e rifugi Shelter
Farm
Hut
P510 Antichita importanti Antiquity
P520 Antichita di importanza minore
P210 Corsi dacqua/laghi (1 ord.) Lake
P220 Corsi dacqua/laghi (2 ord.) Group of lakes
P230 Corsi dacqua/Canali/Fosse/Cond. forz./Laghi (3 ord.) Stream
P240 Corsi dacqua/Canali/Fosse/Cond. forz./Laghi (>3 ord.- River
25.000) Rivulet
P241 Corsi dacqua/Canali/Fosse/Cond. forz./Laghi (>3 ord.) Canal
A Faceted Ontology for a Semantic Geo-Catalogue 177
By identifying semantic relations between atomic concepts and following the ana-
lytico-synthetic approach we finally created the faceted ontology of the PAT with five
distinct facets: antiquity, geological formation (further divided into natural elevation
and natural depression), body of water, facility and administrative division. As an ex-
ample, below we provide the body of water and geological formation facets.
For instance, entities with name starting with Monte were considered as instances
of the class montagna in Italian (mountain in English), while entities with name start-
ing with Passo were mapped to the class passo in Italian (pass in English). The gen-
eral criterion we used is that if we could successfully apply a heuristic we classified
the entity in the corresponding class otherwise we choose a more generic class, which
is the root of a facet (same as the block name) in the worst case. For some specific
macro-classes we reached a success rate of 98%. On average, about 50% of the loca-
tions were put in a leaf class thanks to the heuristics.
Finally, we applied the heuristics beyond the boundary of the blocks for further re-
finement of the instantiation of the entities. The idea was to understand whether, by
mistake, entities were classified in the wrong macro-class. For instance, in the natural
depression block (the 5 macro-classes from P320 to P350), 6 entities have name start-
ing with Monte and therefore they are supposed to be mountains instead. The right
place for them is therefore the natural elevation facet. In total we found 48 potentially
bad placed entities, which were checked manually. In 41.67% of the cases it revealed
that the heuristics were valid, in only 8.33% of the cases the heuristics were invalid
and the rest were unknown because of the lack of information available on the web
about the entities. We moved those considered valid in the right classes.
3. Parent Identification. If the class name starts with either “group of” or “chain
of”, remove this string from the name and convert the remaining part to the sin-
gular form. Identify the synset/concept of the converted part. The parent of the
identified concept is selected as the parent of the class. If the class name consists
of two or more words, take the last word and retrieve its synset/concept. Assign
this concept as the parent of the atomic concept corresponding to the class. If
neither the concept nor the parent is identified yet, ask for manual intervention.
Note that while matching classes across datasets, we took into account the sub-
sumption hierarchy of their concepts. For example, Trento as municipality in the
PAT dataset is matched with Trento as administrative division in GeoWordNet
because the former is more specific than the latter. Note also that the heuristic
above aims only at minimizing the number of duplicated entities but it cannot
prevent the possibility of still having some duplicates. However, further relaxing
it would generate false positives. For instance, by dropping the condition of hav-
ing same children we found 5% (1 over 20) of false matches.
With this step non overlapping locations and part-of relations between them were im-
ported from the temporary database to GeoWordNet following the macro steps below:
1. For each location:
a. Create a new entity in GeoWordNet
b. Use the main name of the location to fill the name attribute both in English
and Italian
c. For each Italian alternative name add a value to the name attribute in Italian
d. Create an instance-of entry between the entity and the corresponding class
concept
2. Create part-of relations between the entities using the part-of hierarchy built as
described in Section 3.3
3. Generate an Italian and English gloss for each entity created with previous steps
Note that natural language glosses were automatically generated. We used several
rules, according to the language, for their generation. For instance, one in English is:
entity_name + “ is ” + article + “ “ + class_name + “ in ” + parent_name + “(“ + par-
ent_class + “ in ” + country_name + “)”;
This allows for instance to describe the Garda Lake as “Garda Lake is a lake in
Trento (Administrative division in Trentino Alto-Adige)”.
A Faceted Ontology for a Semantic Geo-Catalogue 181
7 Conclusions
We briefly reported our experience with the geo-catalogue integration into the SDI of
the PAT and in particular with its semantic extension. S-Match, initially designed as a
standalone application, was integrated with GeoNetwork. S-Match performs a seman-
tic expansion of the query using a faceted ontology codifying the necessary domain
knowledge about geography of the PAT. This allows identifying information that
would be more difficult to find using traditional information retrieval approaches. Fu-
ture work includes extended support for Italian and the semantic expansion of the
entities such as Trento into its (administrative and topological) parts.
In this work we have also dealt with data refinement, concept integration through
parent or equivalent concept identification, ontology population using a heuristic-
based approach and finally with entity integration through entity matching. In particu-
lar, with the data refinement, depending on the cases, most of the macro-classes
needed to be split or merged so that their equivalent atomic concepts or parents could
be found in the knowledge base used (GeoWordNet in our case). We accomplished
the splitting/merging task manually supported by a statistical analysis, while the inte-
gration with the knowledge base was mostly automatic. Working on the PAT macro-
classes helped in learning how to reduce manual work in dealing with potentially
noisy sources. Entity integration was accomplished through entity matching, which
was experimented within and across the entity repositories. The entity matching crite-
ria that perform well within a single repository might need to expand or relax when
the comparison takes place across the datasets. Note that entity type specific matchers
might be necessary when dealing with different kinds of entities (e.g., persons,
organizations, events).
Acknowledgements
This work has been partially supported by the TasLab network project funded by the
European Social Fund under the act n° 1637 (30.06.2008) of the Autonomous Prov-
ince of Trento, by the European Community's Seventh Framework Programme
(FP7/2007-2013) under grant agreement n° 231126 LivingKnowledge: LivingKnowl-
edge - Facts, Opinions and Bias in Time and by “Live- Memories - Active Digital
Memories of Collective Life” funded by the Autonomous Province of Trento. We are
thankful to our colleagues of the Informatica Trentina and in particular to Pavel
Shvaiko for the fruitful discussions on the implementation of the geo-catalogue within
the geo-portal of the Autonomous Province of Trento. We acknowledge Aliaksandr
Autayeu for his support for the integration of S-Match. We are grateful to Veronica
Rizzi for her technical support within the SGC project and to Biswanath Dutta for his
suggestions for the creation of the faceted ontology. Finally, we want to thank Daniela
Ferrari, Giuliana Ucelli, Monica Laudadio, Lydia Foess and Lorenzo Vaccari of the
PAT for their kind support.
182 F. Farazi et al.
References
1. Shvaiko, P., Ivanyukovich, A., Vaccari, L., Maltese, V., Farazi, F.: A semantic geo-
catalogue implementation for a regional SDI. In: Proc. of the INPSIRE Conference (2010)
2. Giunchiglia, F., Dutta, B., Maltese, V.: Faceted Lightweight Ontologies. In: Borgida, A.T.,
Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Ap-
plications. LNCS, vol. 5600, pp. 36–51. Springer, Heidelberg (2009)
3. Giunchiglia, F., Zaihrayeu, I.: Lightweight Ontologies. The Encyclopedia of Database Sys-
tems (2007)
4. Giunchiglia, F., Autayeu, A., Pane, J.: S-Match: an open source framework for matching
lightweight ontologies. The Semantic Web Journal (2010)
5. Ranganathan, S.R.: Prolegomena to library classification. Asia Publishing House (1967)
6. Cruz, I., Sunna, W.: Structural alignment methods with applications to geospatial ontolo-
gies. Transactions in Geographic Information Science 12(6), 683–711 (2008)
7. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007)
8. Janowicz, K., Wilkes, M., Lutz, M.: Similarity-based information retrieval and its role
within spatial data infrastructures. In: Proc. of GIScience (2008)
9. Maué, P.: An extensible semantic catalogue for geospatial web services. Journal of Spatial
Data Infrastructures Research 3, 168–191 (2008)
10. Stock, K., Small, M., Ou, Y., Reitsma, F.: OGC catalogue services - OWL application pro-
file of CSW. Technical report, Open Geospatial Consortium (2009)
11. Vaccari, L., Shvaiko, P., Marchese, M.: A geo-service semantic integration in spatial data
infrastructures. Journal of Spatial Data Infrastructures Research 4, 24–51 (2009)
12. Shvaiko, P., Vaccari, L., Trecarichi, G.: Semantic Geo-Catalog: A Scenario and Require-
ments. In: Proc. of the 4th Workshop on Ontology Matching at ISWC (2009)
13. Giunchiglia, F., McNeill, F., Yatskevich, M., Pane, J., Besana, P., Shvaiko, P.: Approxi-
mate structure-preserving semantic matching. In: Proc. of ODBASE (2008)
14. Giunchiglia, F., Maltese, V., Farazi, F., Dutta, B.: GeoWordNet: A resource for geo-spatial
applications. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H.,
Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 121–136. Springer,
Heidelberg (2010)
15. European Parliament, Directive 2007/2/EC establishing an Infrastructure for Spatial In-
formation in the European Community (INSPIRE) (2009)
16. European Commission, COMMISSION REGULATION (EC) No 976/2009 implementing
Directive 2007/2/EC as regards the Network Services (2009)
17. Lutz, M., Ostlander, N., Kechagioglou, X., Cao, H.: Challenges for Metadata Creation and
Discovery in a multilingual SDI - Facing INSPIRE. In: Proc. of ISRSE (2009)
18. Crompvoets, J., Wachowicz, M., de Bree, F., Bregt, A.: Impact assessment of the
INSPIRE geo-portal. In: Proc. of the 10th EC GI & GIS workshop (2004)
19. Smits, P., Friis-Christensen, A.: Resource discovery in a European Spatial Data Infra-
structure. Transactions on Knowledge and Data Engineering 19(1), 85–95 (2007)
20. Ivanyukovich, A., Giunchiglia, F., Rizzi, V., Maltese, V.: SGC: Architettura del sistema.
Technical report, TCG/INFOTN/2009/3/D0002R5 (2009)
21. Giunchiglia, F., Villafiorita, A., Walsh, T.: Theories of Abstraction. In: AI Communica-
tions, vol. 10(3/4), pp. 167–176. IOS Press, Amsterdam (1997)
22. Giunchiglia, F., Walsh, T.: Abstract Theorem Proving. In: Proceedings of the 11th Interna-
tional Joint Conference on Artificial Intelligence (IJCAI 1989), pp. 372–377 (1989)
23. Kuhn, W.: Geospatial semantics: Why, of What, and How? Journal of Data Semantics
(JoDS) III, 1–24 (2005)
SoKNOS – Using Semantic Technologies
in Disaster Management Software
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 183–197, 2011.
c Springer-Verlag Berlin Heidelberg 2011
184 G. Babitski et al.
to a large extent non-IT experts and only accustomed to casual usage of disas-
ter management software, since large incidents luckily occur rather infrequently.
This casual usage implies that, for example, users may not always have the right
terminology at hand in the first place, especially when facing information over-
load due to a large number of potentially relevant information sources. A large
incident poses a stressful situation for all involved users of disaster management
software. Users often need to operate multiple applications in a distributed and
heterogeneous application landscape in parallel to obtain a consistent view on
all available information.
The SoKNOS1 system [1] is a working prototype (see Fig. 1) for such soft-
ware using semantic technologies for various purposes. In SoKNOS, information
sources and services are annotated with ontologies for improving the provision of
the right information at the right time. The annotations are used for connecting
existing systems and databases to the SoKNOS system, and for creating visual-
izations of the information. Furthermore, the users’ actions are constantly super-
vised, and errors are avoided by employing ontology-based consistency checking.
A central design decision for the system was to ensure that any newly created
information as well as all integrated sensor information is semantically character-
ized, supporting the goal of a shared and semantically unambiguous information
basis across organizations. In this sense, semantic technologies were used in a
holistic and pervasive manner thought the system, making SoKNOS a good ex-
ample for the successful application of semantic technologies. Fig. 2 shows an
overview of the ontologies developed and used in the SoKNOS project.
1
More information on the publicly funded SoKNOS project can be found at
http://www.soknos.de
SoKNOS – Using Semantic Technologies in Disaster Management Software 185
DOLCE
imp
imports orts
s rts import
import impo s im
po
orts
Ontology on Ontology on Geosensor Discovery
rts
Dialog Ontology mapped to rts
po
Resources Damages Ontology
imp
im
impo rts
rts impo
Ontology on
Application Ontologies
Deployment Regulations
Design Time
Fig. 3. Six use cases for semantic technologies covered in the SoKNOS project
staff, etc.) or by software engineers. Fig. 3 shows an overview on these use cases
identified in SoKNOS, classified according to the two criteria.
Each of the use cases presented in this paper has been implemented in the So-
KNOS support system [3]. In this section, we illustrate the benefits of ontologies
for the end users in each of these use cases. Furthermore, we explain how this
functionality has been implemented using ontologies and semantic technology.
2
The Sensor Observation Service Interface Standard (SOS) is specified by the Sen-
sor Web Enablement (SWE) initiative of the Open Geospatial Consortium (OGC);
http://www.opengeospatial.org
SoKNOS – Using Semantic Technologies in Disaster Management Software 191
3
A demo video of the Semantic Data Explorer can be found at
http://www.soknos.de/index.php?id=470.
192 G. Babitski et al.
Finding the right tools. Current tools are rarely designed with end users, i.e.,
laymen with respect to ontological modeling, in mind. In fact, ontology editors
are especially weak when it comes to complex ontologies [14].
The complex terminology modeling involved in ontology engineering is hard
to comprehend for domain experts in case of existing modeling artifacts. We
experienced domain experts in the disaster management as not been trained in
any modeling environment.
Ontology editors need improvement in their “browsing mechanisms, help sys-
tems and visualization metaphors” [14], a statement from 2005 which unfortu-
nately still holds true. Better ontology visualization helps to quickly gain an
SoKNOS – Using Semantic Technologies in Disaster Management Software 193
Finding the right visualization depth. Offering only a class browser with a treelike
visualization of the ontology’s extensive OWL class hierarchy caused confusion
among end users. The SoKNOS inventory management application visualized the
modeled OWL class hierarchy directly in a class browser. Here, the end user for
example can browse resources like cars, trucks and aircrafts. However, due to the
numerous concepts and the extensive class hierarchy, the user actions of selecting
and expanding nodes were often too complicated for the end user. In case that the
end user doesn’t know exactly where in the class hierarchy the desired concept
is located, browsing involves a high error-rate in the exploration process when
the explored classes and sub-classes do not contain the concept the end user
looks for. As shown in the example above, the concept “rescue helicopter” has
six upper classes. An end user needs to select and expand in this example six
times the right node to finally select the concept “rescue helicopter” as resource
in this class browser due to the direct OWL class hierarchy visualization. In
sum, we found the simple browser visualization of an OWL class hierarchy as
not sufficient for an end user interface.
In SoKNOS, we have addressed this challenge by hiding top level categories in
the user interface. Only concepts defined in the core domain ontology are used
in the user interface (but are still available to the reasoner); the foundational
categories from DOLCE are not. Thus, the end user only works with concepts
from her own domain. As a further extension, immediate categories that do not
provide additional value for the end user, such as “motorized means of trans-
portation”, can be suppressed in the visualization.
Finding the right visualization. Various ways of visualizing ontologies and anno-
tated data exist [18]. In the SoKNOS Semantic Data Explorer discussed above,
we have used a straight forward graph view, which, like the standard OWL vi-
sualization, uses ellipses for instances and rectangles for data values. The user
studies have shown that even that simple, straight forward solution provides a
large benefit for the end user. Thus, slightly modifying Jim Hendler’s famous
quote [19], we can state that a little visualization goes a long way.
4 Conclusion
In this paper, we have introduced the SoKNOS system, a functional prototype
for an integrated emergency management system which makes use of ontologies
and semantic technologies for various purposes.
In SoKNOS, ontologies have been used both at design-time and at run-time of
the system. Ontologies are used for providing a mutual understanding between
developers and end users as well as between end users from different organiza-
tions. By annotating information objects and data sources, information retrieval,
the discovery of relevant Web services and the integration of different databases
containing necessary information, are simplified and partly automated. Further-
more, ontologies are used for improving the interaction with the system by fa-
cilitating user actions across application borders, and by providing plausibility
checks for avoiding mistakes due to stressful situations.
196 G. Babitski et al.
During the course of the project, we have employed ontologies and semantic
technologies in various settings, and derived several key lessons learned. First,
the ontology engineering process necessarily should involve end users from the
very beginning and foresee the role of dedicated ontology engineers, since ontol-
ogy engineering is a non-trivial task which is significantly different from software
engineering, so it cannot be simply overtaken by a software engineer. Tool sup-
port is currently not sufficient for letting untrained users build a useful ontology.
Second, current semantic annotation mechanisms for class models are not
suitable. Those mechanisms are most often intrusive and require a 1:1 mapping
between the class model and the ontology. When dealing with legacy code, both
assumptions are unrealistic. Thus, different mechanisms for semantically anno-
tating class models are needed. Furthermore, relying on a programming model
backed by an ontology and using reasoning at run-time imposes significant chal-
lenges to reactivity and performance.
Third, it is not trivial to find an appropriate modeling and visualization depth
for ontologies. While a large modeling depth is useful for some tasks, the feedback
from the end users targeted at the need for simpler visualizations. In SoKNOS,
we have addressed that need by reducing the complexity of the visualization,
and by providing a straight forward, but very useful graphical visualization of
the annotated data.
In summary, we have shown a number of use cases which demonstrate how
the employment of ontologies and semantic technologies can make emergency
management systems more useful and versatile. The lessons learned can also be
transferred to projects with similar requirements in other domains.
Acknowledgements
The work presented in this paper has been partly funded by the German Federal
Ministry of Education and Research under grant no. 01ISO7009.
References
1. Paulheim, H., Döweling, S., Tso-Sutter, K., Probst, F., Ziegert, T.: Improving Us-
ability of Integrated Emergency Response Systems: The SoKNOS Approach. In:
Proceedings of 39 Jahrestagung der Gesellschaft für Informatik e.V (GI) - Infor-
matik 2009 LNI, vol. 154, pp. 1435–1449 (2009)
2. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb De-
liverable D18 – Ontology Library (final) (2003),
http://wonderweb.semanticweb.org/deliverables/documents/D18.pdf
(accessed August 2, 2010)
3. Döweling, S., Probst, F., Ziegert, T., Manske, K.: SoKNOS - An Interactive Visual
Emergency Management Framework. In: Amicis, R.D., Stojanovic, R., Conti, G.
(eds.) GeoSpatial Visual Analytics. NATO Science for Peace and Security Series
C: Environmental Security, pp. 251–262. Springer, Heidelberg (2009)
4. Paulheim, H., Probst, F.: Application Integration on the User Interface Level: an
Ontology-Based Approach. Data & Knowledge Engineering Journal 69(11), 1103–
1116 (2010)
SoKNOS – Using Semantic Technologies in Disaster Management Software 197
5. Paulheim, H.: Efficient semantic event processing: Lessons learned in user inter-
face integration. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stucken-
schmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp.
60–74. Springer, Heidelberg (2010)
6. Angele, J., Erdmann, M., Wenke, D.: Ontology-Based Knowledge Management in
Automotive Engineering Scenarios. In: Hepp, M., Leenheer, P.D., Moor, A.D., Sure,
Y. (eds.) Ontology Management. Semantic Web and Beyond, vol. 7, pp. 245–264.
Springer, Heidelberg (2008)
7. Angele, J., Lausen, G.: Handbook on Ontologies. In: Staab, S., Studer, R. (eds.)
International Handbooks on Information Systems, 2nd edn., pp. 45–70. Springer,
Heidelberg (2009)
8. Voigt, K., Ivanov, P., Rummler, A.: MatchBox: Combined Meta-model Matching
for Semi-automatic Mapping Generation. In: Proceedings of the 2010 ACM Sym-
posium on Applied Computing, pp. 2281–2288. ACM, New York (2010)
9. Sonntag, D., Deru, M., Bergweiler, S.: Design and Implementation of Combined
Mobile and Touchscreen-based Multimodal Web 3.0 Interfaces. In: Arabnia, H.R.,
de la Fuente, D., Olivas, J.A. (eds.) Proceedings of the 2009 International Confer-
ence on Artificial Intelligence (ICAI 2009), pp. 974–979. CSREA Press (2009)
10. Babitski, G., Bergweiler, S., Hoffmann, J., Schön, D., Stasch, C., Walkowski, A.C.:
Ontology-based integration of sensor web services in disaster management. In:
Janowicz, K., Raubal, M., Levashkin, S. (eds.) GeoS 2009. LNCS, vol. 5892, pp.
103–121. Springer, Heidelberg (2009)
11. Liu, B., Chen, H., He, W.: Deriving User Interface from Ontologies: A Model-Based
Approach. In: ICTAI 2005: Proceedings of the 17th IEEE International Conference
on Tools with Artificial Intelligence, pp. 254–259. IEEE Computer Society, Wash-
ington, DC, USA (2005)
12. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International
Journal on Semantic Web and Information Systems 5(3), 1–22 (2009)
13. Paulheim, H., Meyer, L.: Ontology-based Information Visualization in Integrated
UIs. In: Proceedings of the 2011 International Conference on Intelligent User In-
terfaces (IUI), pp. 451–452. ACM, New York (2011)
14. Garcı́a-Barriocanal, E., Sicilia, M.A., Sánchez-Alonso, S.: Usability evaluation of
ontology editors. Knowledge Organization 32(1), 1–9 (2005)
15. Babitski, G., Probst, F., Hoffmann, J., Oberle, D.: Ontology Design for Information
Integration in Catastrophy Management. In: Proceedings of the 4th International
Workshop on Applications of Semantic Technologies, AST 2009 (2009)
16. Paulheim, H., Plendl, R., Probst, F., Oberle, D.: Mapping Pragmatic Class Mod-
els to Reference Ontologies. In: 2nd International Workshop on Data Engineering
meets the Semantic Web, DESWeb (2011)
17. Hepp, M.: Possible Ontologies: How Reality Constrains the Development of Rele-
vant Ontologies. IEEE Internet Computing 11(1), 90–96 (2007)
18. Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.G.: On-
tology Visualization Methods - A Survey. ACM Comput. Surv. 39(4) (2007)
19. Hendler, J.: On Beyond Ontology (2003),
http://iswc2003.semanticweb.org/hendler_files/v3_document.htm;
Invited Talk at the International Semantic Web Conference (2003)
Semantic Technologies for Describing
Measurement Data in Databases
1 Introduction
Reliable and safe automation is one foundation for modern traffic systems and
part of concepts for assistance and automation systems. Therefore, for the analy-
sis and reflection of users in automated environments, experimental systems (i.e.
vehicles and driving simulators) play an important role. They produce exhaustive
amounts of data, which can be used for an evaluation of the considered system
in a realistic environment. Long term studies and naturalistic driving studies [7]
result in similar datasets. Much data means a lot of information to interpret and
potential results to discover. Therefore our motivation is, to store derived meta
data closely connected with its original data for an integrated processing. By
using semantic technologies a continuous technical and semantic process can be
provided.
As a starting point, experimental data is to be considered as already stored
in the relational database and should not be changed in the process. As many of
the following principles not only match for measurement data, but in general for
tables in relational databases, also the term bulk data is used as an equivalent for
measurement data. So, it is aspired, to use semantic technologies for describing
the database elements to support their interpretation [15]. They allow to for-
mally handle complex data structures very flexible and schema knowledge can
be extended easily. This is a demand resulting from the fact, that experimental
systems are in continuous development and projects touch different knowledge
domains.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 198–211, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Semantic Technologies for Describing Measurement Data in Databases 199
Figure 1 shows the intended database elements for semantic annotations. Ta-
bles containing sensor data of an experiment can be tagged with experiment
information. Columns can have a semantic annotation about physical unit of
containing elements and used sensor for recording. Rows or cells can get an an-
notation about bad quality or detected events in the considered elements. A
complex application case is presented in Sect. 5. More meaningful examples of
generic metadata annotations are also discussed in [16,12].
In some of the following considerations it is assumed, that table records have
an order, so the statement row is preferred to record or tuple. Also rows are
considered with the restriction that they must be uniquely identifiably. The
needed database elements from Fig. 1 are transformed in unique URIs to be
used in RDF statements. However, data itself is not transformed, since the value
itself is not used for the presented purposes. Based on this, annotations are used
similar to a formal memo or notepad mechanism. As domain knowledge is solely
modelled in exchangeable ontologies, the presented concepts can be also applied
for other domains.
The Web Ontology Language (OWL) is based on principles of RDF and can
be used to formally describe complete ontologies. In an ontology the terms and
their relations and rules are described. Especially for plain rules there is the
Semantic Web Rule Language (SWRL) as an extension of OWL [13].
2.2 RDF-Stores
A RDF-repository [18,1] (triple-store) is used to manage RDF-statements. It
supports adding, manipulating and searching statements in the repository. Such
repositories can have different backends to store managed statements. Using re-
lational databases as backend is in focus of this document. The statements con-
sisting of subject, predicate and object are stored in a separate relation in the
underlying database. Advantages of different optimization strategies are exhaus-
tively discussed in [22]. Current repositories often directly support the Simple
Protocol and RDF Query Language (SPARQL). It has for RDF-stores similar
importance as SQL for relational databases.
3 Data Management
The recorded time structured bulk data is stored in a relational database. All
other data either describes these time series or is deduced from it. As a result,
all data besides time series is considered as metadata.
model is used. In this way a good scalability can be reached. Generic information
is stored in a jointly used RDF model.
Strict separation of bulk data and annotations also reflects their different
usage. Afterwards measurement data is not modified anymore. Its meaning is
hidden in recorded number values to be interpreted. On top of the bulk data,
there is an abstract layer of interpretations. Descriptions (i.e. for columns) or
identified events (i.e. overtaking manoeuvre) from human imaginations are to
be seen there. In contrast to bulk data, there are much less elements. This ab-
straction layer is used during analysis, so that there is a high dynamic in the
statements. These statements are usually assertional (ABox). Schema knowledge
(TBox) can be edited using Protégé [19] or a Semantic Wiki [17] as known for
RDFS/OWL and then be integrated (cf. Sect. 4). The just presented approach
is the adaption of commonly discussed multimedia annotations [9,11] for mea-
surement data. The term micro-annotations is also adapted from this domain.
4 User Interface
The graphical user interface allows to visualize semantic annotations. During
sequent processing steps it helps to understand temporary results. Fig. 3 shows
a screenshot of the application called Semantic Database Browser. In the left
window a schema, table or column can be chosen. The main window shows
selected tables. Database elements, which have annotations, are highlighted with
a colour. When such an annotated element is selected, the bottom window shows
the annotations as a tree. In Fig. 3 the column Velocity is selected in the left
window.
The tree representation was selected, as it is very compact, especially com-
pared to graph visualizations. Since only the really needed sub-tree is fetched
from memory model, this approach preserves resources. However, it must be
considered, that only outgoing edges of the graph model are drawn. If there are
cycles in the graph and the tree is expanded, nodes can appear multiple times.
In front of each tree element is a symbol, which indicates the type of the node
204 U. Noyer, D. Beckmann, and F. Köster
Fig. 3. User interface for interactive working with annotations including highlighted
elements and a montage showing nationalized labels
(named resource, blank node, property or literal). The option describe proper-
ties lets the tree also show properties of properties. In this way i.e. domain- and
range-restrictions of properties can be viewed. Another feature is to optionally
show labels (identified by rdfs:label [10]) instead of resource names (URIs or
anonymous ids). In this case a literal is chosen, which best matches the lan-
guage chosen in the user interface (montage Fig. 3). Using the context menu it
is also possible to directly open URI-resources with the associated program, i.e.
a web-page in a semantic wiki with the web browser.
Since annotations are stored in different models for each bulk data table (cf.
Sect. 3.2), they are loaded on demand for the currently shown table. For that
reason interactions with annotations in the interface are generally performed
on the base graph model GB . This graph is the union of the actually used
graphs. That at least includes the generic graph GG , containing annotations
for database instance, schemas and table schema. Annotations on bulk data,
which are always stored in new tables, are managed in separate RDF models.
For each table Ti there exists a graph model GTi . Graph GB is manipulated
during runtime and is the union of generic annotations GG and currently viewed
table Tj : GB = GG ∪ GTj . Since GG only growths very moderately and most
annotations are managed in the different GTi , the scalability of the approach for
many experiments with their measurement tables is assured. If one table can be
processed, this is also possible for a great many of them.
It can be necessary to reference
external knowledgebases GOWLi in form of
OWL documents. Therefor i∈I GOWLi with i ∈ I is the set of relevant OWL
documents. As consequence GOWLi can be added as a sub-graph to GB , with the
Semantic Technologies for Describing Measurement Data in Databases 205
!"#
result GB = GG ∪ GTj ∪ i∈I GOWLi . In the case of measurement data handling
an additional OWL document primarily contains additional schema knowledge,
as assertional knowledge concerning database elements is maintained in GTi .
The resulting graph of graphs is shown in Fig. 4 in the bottom half, where sub-
graphs are labelled with full text names. This graphical presentation is available
in the main window accessible by a tab. The user can interact with the graph,
as he can remove sub-graphs or request some statistics about them. For further
support of visualization layers are offered, which are configured by the user in
the left window (accessed by a tab). A layer is either the graph GB as complete
union of the included knowledge bases (base layer) or a view derived from GB
with an additional layer name and colour (Fig. 4 in the upper half).
If a database element is contained in a layer, it is highlighted in the specific
colour of the layer. Furthermore, layers have an order defined by the user. An
element contained in several layers is shown in the colour of the layer with the
greatest index (Fig. 5). In this way, a layer concept is realized similar to be
found in geo information systems. Derived graphs for layers are created using
a reasoner or SPARQL queries (Construct-, Describe- or Select) on the
206 U. Noyer, D. Beckmann, and F. Köster
Fig. 6. Time line representation from user interface for the base layer and two layers
containing selected events
In the following steps of semantic enrichment every analysis step builds on the
previous ones. Thus, in every step the point of view gets more abstract and more
comprehensible for domain specialists. Below, the concept overtaking is defined.
An overtaking manoeuvre occurs, if the events blink left, near centreline and
vehicle ahead lost arise at the same time. Based on this, an analyst can search
for all these events in an analysis tool and then directly visualize the searched
situation. Technically this search is implemented using SPARQL-queries. In the
formerly presented user interface Prefix statements for registered namespaces
are optional:
PREFIX ts: <http://www.dlr.de/ts/>
SELECT ?x WHERE {
?x ts:hasProperty ts:BlinkLeft .
?x ts:hasProperty ts:NearCentreline .
?x ts:hasProperty ts:VehicleAheadLost .
}
208 U. Noyer, D. Beckmann, and F. Köster
"
!
software. The presented user interface allows an intuitive workflow with intro-
duced annotations. Semantic annotations are flexible and well suited for this
application of knowledge integration. By storing time structured data using a
relational database, it can be easily accessed using standard database connec-
tivity of used applications without any adaptations.
On the contrary there are also challenges in the presented approach. Using
two technologies instead of one for accessing specific data increases complexity.
That also influences modelling of considered data. The developer must consider
whether to store data in a relational way or as annotations. Moreover, the speed
of accessing annotations can be slower than a solely relational storage. A RDF-
store allows to handle separate data models for the annotation instances. So,
separate models can be used for different time structured bulk data tables to
ensure scalability. Generally we promote to use a relational design for bulk data,
which has a fix schema. For light weighted data and metadata, which constantly
evolves and where the interpretation is in focus, we prefer annotations.
The use cases and ontologies are currently refined with a stronger focus on
temporal aspects. Furthermore, the integration of automated services in a SWS
has to be realized. A further medium-term objective is to polish the user interface
and port it to the Eclipse Rich Client Platform for a productive deployment.
References
1. Aduna-Software: OpenRDF.org — ... home of Sesame. WWW (October 2008),
http://www.openrdf.org/
2. Altintas, I., et.al.: Kepler: An extensible system for design and execution of scien-
tific workflows. In: Scientific and Statistical Database Management (2004)
3. Barrasa, J., Óscar Corcho, Gómez-Pérez, A.: R2O, an Extensible and Semantically
Based Database-to-ontology Mapping Language. In: Workshop on Semantic Web
and Databases (2004)
4. Baumann, M., et al.: Integrated modelling for safe transportation — driver mod-
eling and driver experiments. In: 2te Fachtagung Fahrermodellierung (2008)
5. Berners-Lee, T.: Relational Databases on the Semantic Web (1998),
http://www.w3.org/DesignIssues/RDB-RDF.html
6. Bizer, C.: D2R MAP — A Database to RDF Mapping Language. In: World Wide
Web Conference (2003)
7. Fancher, P., et al.: Intelligent cruise control field operational test (final report).
Tech. rep., University of Michigan (1998)
8. Gačnik, J., et al.: DESCAS — Design Process for the Development of Safety-
Critical Advanced Driver Assistance Systems. In: FORMS (2008)
9. Gertz, M., Sattler, K.U.: Integrating scientific data through external, concept-based
annotations. In: Data Integration over the Web (2002)
10. Hayes, P.: RDF Semantics. W3C Recommendation (February 2004)
11. Herrera, P., et al.: Mucosa: A music content semantic annotator. In: Music Infor-
mation Retrieval (2005)
12. Köster, F.: Datenbasierte Kompetenz- und Verhaltensanalyse — Anwendungs-
beispiele im selbstorganisierten eLearning. In: OlWIR (2007)
Semantic Technologies for Describing Measurement Data in Databases 211
13. Motik, B., Sattler, U., Studer, R.: Query answering for OWL-DL with rules. Journal
of Web Semantics: Science, Services and Agents on the World Wide Web 1, 41–60
(2005)
14. Noyer, U., Beckmann, D., Köster, F.: Semantic annotation of sensor data to sup-
port data analysis processes. In: Semantic Authoring, Annotation and Knowledge
Markup Workshop (SAAKM) (2009)
15. Noyer, U., Beckmann, D., Köster, F.: Semantic technologies and metadata system-
atisation for evaluating time series in the context of driving experiments. In: 11th
International Protégé Conference, pp. 17 – 18 (2009)
16. Rothenberg, J.: Metadata to support data quality and longevity. In: Proceedings
of the 1st IEEE Metadata Conference (1996)
17. Schaffert, S., Francois Bry, J.B., Kiesel, M.: Semantic wiki. Informatik-
Spektrum 30, 434–439 (2007)
18. SourceForge.net. Jena — A Semantic Web Framework for Java (October 2008),
http://jena.sourceforge.net/
19. Stanford Center for Biomedical Informatics Research: The protégé ontology editor
and knowledge acquisition system. WWW (April 2009),
http://protege.stanford.edu/
20. myGrid team. Taverna workbench project webseite (November 2009),
http://taverna.sourceforge.net/
21. Tusch, G., Huang, X., O’Connor, M., Das, A.: Exploring microarray time series
with Protégé. In: Protégé Conference (2009)
22. Velegrakis, Y.: Relational Technologies, Metadata and RDF, ch. 4, pp. 41–66.
Springer, Heidelberg (2010)
23. Virgilio, R.D., Giunchiglia, F., Tanca, L. (eds.): Semantic Web Information Man-
agement: A Model-Based Perspective, ch. 11, pp. 225–246. Springer, Heidelberg
(2010)
24. Vollrath, M., et al.: Erkennung von Fahrmanövern als Indikator für die Belastung
des Fahrers. In: Fahrer im 21. Jahrhundert (2005)
Ontology-Driven Guidance for Requirements
Elicitation
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 212–226, 2011.
Springer-Verlag Berlin Heidelberg 2011
Ontology-Driven Guidance for Requirements Elicitation 213
This work is organized in the following way: Section 2 presents related work.
Section 3 motivates our research; section 4 introduces our approach to ontology-
based guidance. Section 5 presents an evaluation of the tool and section 6 con-
cludes and gives ideas about future work.
2 Related Work
This section summarizes related work, going from the broad field of requirements
engineering to the more specific areas of elicitation guidance and finally pattern-
based requirements.
two or more products of the development process can be established [1]. Gotel [7]
and Watkins [21] describe why requirements tracing can help project managers
in verification, cost reduction, accountability, change management, identification
of conflicting requirements and consistency checking of models.
authors did not propose a fixed list of boilerplates1 but instead envisioned a
flexible language that can be adapted or enriched when necessary.
Stålhane, Omoronyia and Reichenbach [20] extended boilerplates with a do-
main ontology by linking attribute values to ontology concepts. They adapted
the requirements analyses introduced by Kaiya [13] to boilerplate requirements
and added a new analysis called opacity. The requirement language used in this
work is based on their combination of boilerplates and the domain ontology.
Ibrahim et al. [10] use boilerplate requirements in their work about require-
ments change management. They define a mapping from boilerplate attributes
to software design artifacts (e.g., classes, attributes, operations) and add trace-
ability links between requirements and artifacts accordingly. There are several
other pattern based languages similar to boilerplates, e.g., requirements based
on EBNF grammars [19]. Denger et al. [4] propose natural language patterns
to specify requirements in the embedded systems domain. They include a meta-
model for requirement statements and one for events and reactions which they
use to check the completeness of the pattern language. Compared to boilerplate
requirements, their patterns seem to be a bit less generic, e.g., some of the non-
functional requirements used in our evaluation would be impossible to express.
Matsuo, Ogasawara and Ohnishi [16] use controlled natural language for re-
quirements, basically restraining the way in which simple sentences can be com-
posed to more complex ones. They use a frame model to store information about
the domain. There are three kind of frames. The noun frame classifies a noun
into one of several predefined categories. The case frame classifies verbs into
operations and contains the noun types which are required for the operation. Fi-
nally the function frame represents a composition of several simple operations.
The authors use these frames to parse requirements specifications, to organize
them according to different viewpoints and to check requirements completeness.
In contrast to domain ontologies, the frame-based approach seems to be harder
to understand and to adapt by non-experts.
3 Research Issues
There have been presented several approaches to use ontologies to analyze re-
quirements. These approaches try to measure quality aspects like completeness,
correctness and consistency on a set of requirements. In [20] there is an analysis
called opacity that basically checks if, for two concepts occurring in a require-
ment, there is a relation between them in the domain ontology. A conclusion
of our review of this analysis was that, rather than first writing an incorrect
requirement, then analyzing and improving it, a better approach would be to
actually suggest the very same domain information which is used for the opacity
analysis to the requirements engineer in the first place. There are two points to
this idea:
1
J. Dick maintains a list of suggestions at
http://freespace.virgin.net/gbjedi/books/re/boilerplates.htm though.
Ontology-Driven Guidance for Requirements Elicitation 217
2
http://cesarproject.eu/
218 S. Farfeleder et al.
Concept: A concept represents an entity in the problem domain. The entity can
be material (e.g., a physical component of the system) or immaterial (e.g., a
temporal state). OWL classes are used to represent concepts. The reason for
using classes instead of individuals is the built-in support for sub-classing.
A concept has two attributes, its name and a textual definition. The defini-
tion is intended to provide the possibility to check whether the correct term
is used.
Relation: A relation is a labeled directed connection between two concepts.
A relation contains a label which is expected to be a verb. The label, the
relation’s source and destination concepts form a subject-verb-object triple.
Relations are used for guidance (section 4.3). Relations map to OWL object
properties and object property restrictions.
Axiom: There are two types of axioms that are relevant to guidance: sub-class
and equivalence axioms. The first one specifies that one concept is a sub-class
of another concept, e.g., cargo door is a sub-class of door. This information is
used to “inherit” suggestions to sub-class concepts, e.g., the guidance system
infers the suggestion the user opens the cargo door from the base class’ the
user opens the door.
The equivalence axiom is used to express that two concepts having different
names refer to the same entity in the domain. An example from DMS is the
equivalence of aircraft and airplane. Ideally each real-world phenomenon
has exactly one name. However, due to requirements coming from different
stakeholders or due to legacy reasons, at times several names are required.
It is possible to mark a concept as being deprecated ; the tool will warn
about occurrences of such concepts and will suggest using an equivalent
non-deprecated concept instead.
In this work we assume the pre-existence of a suitable domain ontology. See [17]
for ways of constructing new domain ontologies. The tool contains an ontology
220 S. Farfeleder et al.
editor that is tailored to the information described here. We found this editor to
be more user-friendly than generic OWL editors like Protégé3.
4.3 Guidance
When filling the attributes of a boilerplate, the tool provides a list of suggestions
to the requirements engineer. The provided guidance depends on the attribute
the requirements engineer is currently filling, e.g., the suggestions for system
will be completely different than for action. The idea is to apply an attribute-
based pre-filter to avoid overwhelming the user with the complete list of ontology
entities. Typing characters further filters this list of suggestions to only those
entries matching the typed string.
It is not mandatory to choose from the list of suggestions; the tool will not
stop the requirements engineer from entering something completely different.
In case information is missing from the domain ontology, an update of the on-
tology should be performed to improve the guidance for similar requirements.
All changes to the domain ontology should be validated by a domain expert to
ensure data correctness.
There are three types of suggestions for an attribute; Table 3 provides an
overview over the suggestion types.
Concept: The tool suggests to use the name of a concept for an attribute.
The tool generates two variants, just the plain name and once prefixed with
the article “the”. The idea is that most of the times using “the” will be
appropriate but sometimes other determiners like “all” or “each” are more
suitable and are typed in manually.
Verb-Object: The tool uses a relation from the domain ontology to suggest
a verb phrase to the requirements engineer. The suggestion is the concate-
nation of the verb’s infinitive form4 , the word “the” and the relation’s des-
tination object. This construction is chosen in order to be grammatically
correct following a modal verb like “shall”. An example from Figure 2 is the
suggestion check the door status.
Subject-Verb-Object: For this kind of suggestion the entire relation including
subject and object is taken into account. The suggestion text is “the”, the
subject, the verb conjugated into third person singular form, “the” and the
object. An example from Figure 2 is the suggestion the person tries to open
the door.
Type Suggestion
Concept concept
the concept
Verb-Object verb (inf.) the object
Subject-Verb-Object the subject verb (3rd sing.) the object
per attribute and the sub-class axioms mentioned in Table 1. The domain on-
tology imports the attributes ontology to use its classes. Domain concepts are
linked to attributes by means of sub-class axioms which are stored in the domain
ontology.
An example for the semantic guidance system is given in Figure 2. The do-
main ontology is shown in the upper part of the figure, the attributes ontology
is below. The concept Doors Management System is a sub-class of class system,
which in turn allows the tool to suggest using the Doors Management System
for a boilerplate containing the attribute system. The blue regions represent
verb-object and subject-verb-object suggestions in the domain ontology. Their
mapping to the attributes action and operational condition is inferred auto-
matically by the tool.
Figure 3 shows the boilerplates for two requirements and some of the sugges-
tions provided by the guidance system. The information that Doors Management
System is a system and that second and millisecond are values for attribute unit
is stored in the domain ontology itself. The suggestions check the door status and
determine the door unlockability are inferred from the domain ontology relations.
The knowledge to suggest verb-object pairs for the attribute action is a built-in
feature of the tool. The attribute operational condition turns out to be the most
difficult one in terms of providing useful suggestions. The reason for this is that
222 S. Farfeleder et al.
5 Evaluation
As mentioned before we evaluated the semantic guidance system with a domain
ontology and a set of requirements from the Doors Management System.
5.1 Setting
The use case contains a set of 43 requirements specified using natural language
text. Various types of requirements are included: functional, safety, performance,
reliability, availability and cost. Each requirement was reformulated into a boil-
erplate requirement using DODT. The semantic guidance system was used to
assist in filling the boilerplate attributes.
The domain ontology for DMS was specifically developed for usage with the
boilerplates tool. The data for the DMS ontology was initially provided by EADS
and was then completed by the authors. Table 4 lists the number of concepts,
relations and axioms of the DMS ontology.
Figure 4 shows the graphical user interface of the tool. At the top of the
interface boilerplates can be selected. The center shows the currently selected
boilerplates and text boxes for the attribute values of the requirements. The list
of phrases below the text boxes are the suggestions provided by the semantic
guidance system. Typing in the text boxes filters the list of suggestions. The tool
shows the textual definitions of the concepts as tooltips. Selecting a list entry
will add the text to the corresponding text box. The bottom of the interface lists
all requirements. Expressions that refer to entities from the domain ontology are
Ontology-Driven Guidance for Requirements Elicitation 223
underlined with green lines; fixed syntax elements of boilerplates with black.
If nouns missing from the domain ontology were to be seen, they would be
highlighted with the color red.
Table 5 present statistics about the suggestions produced by the guidance
system for the DMS ontology.
5.2 Results
Table 6 lists the major results of the evaluation. For 43 requirements, we used
21 different boilerplates. The boilerplate which was used most often (16 times)
is system shall action. The 43 boilerplate requirements have a total of 120
attributes. For 36 attributes out of 120 (30%) the semantic guidance system was
able to suggest the entire attribute value without any need for a manual change.
For another 59 attributes (57.5%) the guidance could suggest at least parts of
the attribute value. This leaves 25 attribute values (12.5%) for that the guidance
was no help. For partial matches, these are some of the reasons the attribute
values had to be modified:
– A different determiner is used than the suggested “the”, e.g., “a”, “each” or
“all”.
– The plural is used instead of singular.
– A combination of two or more suggestions is used.
– A subordinate clause is added, e.g., “each door that could be a hazard if it
unlatches”.
Reasons for no guidance are these:
– Numbers for the number attribute cannot be suggested.
– Words are used that do not exist in the domain ontology.
Future work will include setting up an evaluation to compare the elicitation
time with and without the semantic guidance system. However, due to the high
percentage where the guidance was able to help (>85%) we are confident that
efficiency improved, even though the presentation of explicit numbers has to be
postponed to future work.
We also hope to improve the quality of requirements using the tool. We did
a qualitative comparison of the original DMS requirements and the boilerplate
requirements. These are our findings:
– Boilerplate requirements encourage using the active voice. In our evaluation
the original requirement “Information concerning the door status shall be
sent from the Doors Management System to ground station. . . ” was turned
into “The Doors Management System shall send information concerning the
door status to ground station. . . ” 8 requirements were improved in this way.
In some cases missing subjects were added.
– Requirements like “There shall be. . . ” and “It shall not be possible to. . . ”
were changed into “The subject shall have” and “The subject shall not al-
low. . . ”. Such changes make it obvious what part of the system is responsible
to fulfill the requirement. To determine the right value for subject the origi-
nal stakeholders should be asked for clarification. Due to timing constraints
this was not possible and plausible values were inserted by the authors.
– During the requirements transformation we found that the original require-
ments used different expressions for seemingly identical things, e.g., “provi-
sion to prevent pressurization” and “pressure prevention means” or “airplane”
Ontology-Driven Guidance for Requirements Elicitation 225
Acknowledgments
The research leading to these results has received funding from the ARTEMIS
Joint Undertaking under grant agreement No 100016 and from specific national
programs and/or funding authorities. This work has been supported by the
Christian Doppler Forschungsgesellschaft and the BMWFJ, Austria.
226 S. Farfeleder et al.
References
1. IEEE Recommended Practice for Software Requirements Specifications. IEEE Std
830 (1998)
2. OWL 2 Web Ontology Language Direct Semantics. Tech. rep., W3C (2009),
http://www.w3.org/TR/2009/REC-owl2-direct-semantics-20091027/
3. Cobleigh, R., Avrunin, G., Clarke, L.: User Guidance for Creating Precise and Ac-
cessible Property Specifications. In: 14th International Symposium on Foundations
of Software Engineering, pp. 208–218. ACM, New York (2006)
4. Denger, C., Berry, D., Kamsties, E.: Higher Quality Requirements Specifications
through Natural Language Patterns. In: 2003 IEEE International Conference on Soft-
ware - Science, Technology and Engineering, pp. 80–90. IEEE, Los Alamitos (2003)
5. Egyed, A., Grunbacher, P.: Identifying Requirements Conflicts and Coopera-
tion: How Quality Attributes and Automated Traceability Can Help. IEEE Soft-
ware 21(6), 50–58 (2004)
6. Elazhary, H.H.: REAS: An Interactive Semi-Automated System for Software Re-
quirements Elicitation Assistance. IJEST 2(5), 957–961 (2010)
7. Gotel, O., Finkelstein, C.: An Analysis of the Requirements Traceability Problem.
In: 1st International Conference on Requirements Engineering, pp. 94–101 (1994)
8. Gottesdiener, E.: Requirements by Collaboration: Workshops for Defining Needs.
Addison-Wesley, Reading (2002)
9. Hull, E., Jackson, K., Dick, J.: Requirements Engineering. Springer, Heidelberg
(2005)
10. Ibrahim, N., Kadir, W., Deris, S.: Propagating Requirement Change into Software
High Level Designs towards Resilient Software Evolution. In: 16th Asia-Pacific
Software Engineering Conference, pp. 347–354. IEEE, Los Alamitos (2009)
11. Jackson, J.: A Keyphrase Based Traceability Scheme. IEEE Colloquium on Tools
and Techniques for Maintaining Traceability During Design, 2/1–2/4 (1991)
12. Kaindl, H.: The Missing Link in Requirements Engineering. Software Engineering
Notes 18, 30–39 (1993)
13. Kaiya, H., Saeki, M.: Ontology Based Requirements Analysis: Lightweight Seman-
tic Processing Approach. In: 5th Int. Conf. on Quality Software, pp. 223–230 (2005)
14. Kitamura, M., Hasegawa, R., Kaiya, H., Saeki, M.: A Supporting Tool for Re-
quirements Elicitation Using a Domain Ontology. Software and Data Technologies,
128–140 (2009)
15. Kotonya, G., Sommerville, I.: Requirements Engineering. John Wiley & Sons,
Chichester (1998)
16. Matsuo, Y., Ogasawara, K., Ohnishi, A.: Automatic Transformation of Organization
of Software Requirements Specifications. In: 4th International Conference on Re-
search Challenges in Information Science, pp. 269–278. IEEE, Los Alamitos (2010)
17. Omoronyia, I., Sindre, G., Stålhane, T., Biffl, S., Moser, T., Sunindyo, W.: A
Domain Ontology Building Process for Guiding Requirements Elicitation. In: 16th
REFSQ, pp. 188–202 (2010)
18. Pedrinaci, C., Domingue, J., Alves de Medeiros, A.K.: A Core Ontology for Busi-
ness Process Analysis. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis,
M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 49–64. Springer, Heidelberg (2008)
19. Rupp, C.: Requirements-Engineering und -Management. Hanser (2002)
20. Ståhane, T., Omoronyia, I., Reichenbach, F.: Ontology-Guided Requirements and
Safety Analysis. In: 6th International Conference on Safety of Industrial Automated
Systems (2010)
21. Watkins, R., Neal, M.: Why and How of Requirements Tracing. IEEE Soft-
ware 11(4), 104–106 (1994)
The Semantic Public Service Portal (S-PSP)
1 Introduction
Public service provision is an important duty of all government departments.
Independent of what kind of service it is, every public service can be broken down into
two distinct but complementary phases: the informative phase and the performative
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 227–242, 2011.
© Springer-Verlag Berlin Heidelberg 2011
228 N. Loutas et al.
phase [1]. During the informative phase, the service provider provides information
about the service to the citizen/ business1, while during the performative phase the
citizen utilises the public service. The informative phase is essential for optimal service
utilisation and public-administration efficiency, however it is often overlooked by
governments. In order to effectively use a public service, citizens must identify which
public services address their needs and find answers to their questions regarding these
services, e.g. “am I eligible for this service”, “what is the outcome of the service”,
“which public agency provides the service”, etc. In this paper, we present a portal that
facilitates the informative phase of public-service provision.
An additional factor of public services is that they often have complex structures, and
may be specialized into a variety of service versions. For example a public service that is
concerned with the issuing of a driving license, may have alternative versions if this is
the first license of the applicant, if the applicant is over 60, if the applicant wishes to
drive lorries, etc. It is therefore not enough for citizens to identify a public service in
general, but they must also go one step further and identify the specific service version
for which they are eligible. The service versions refine the generic public service further
and may be differentiated from one another according to:
i. the profile of the citizen that wishes to consume the service;
ii. the service inputs and outputs; and/or
iii. the service workflow.
However, traditional governmental portals still follow a one-size-fits-all approach.
Thus the portal cannot react differently and tailor the offered public services to the
needs and the profile of each individual citizen. Moreover, the citizen has to figure
out on their own whether they are eligible for the service by reading lengthy public
service descriptions (which very often include legal terminology). These are common
problems in all existing national eGovernment portals. According to [2], the most
typical problems of eGovernment portals can be grouped into the following
categories:
• The user is unable to find the desired information or service.
• The user is unable to achieve his goal, even though the system supports it
and he has started along the path to achieve it.
• The user is able to accomplish his goal, but not efficiently, e.g., easily and
quickly.
In order to enhance the informative part of public service provision and improve
existing governmental portals, this paper introduces the Semantic Public Service
Portal (S-PSP), which aims:
• To inform citizens whether they are eligible for a specific public service;
• To personalize the public-service-related information according to the profile
and the specific needs and wants of the citizen and identify the specific
public service version;
1
For the remainder of the paper we refer only to citizens for the sake of brevity, but citizens
and businesses are implied.
The Semantic Public Service Portal (S-PSP) 229
2 Related Work
Researchers have tried to solve parts of the problem that we described in the previous
section, focusing mostly on facilitating service search and discovery. Fang et al. [5]
support the selection of an optimal set of featured service-links. These links will then
appear on the homepage of an eGovernment portal, thus helping users to locate
services more easily by reducing the number of steps that they have to perform until
the desired service is found. This is expected to improve the citizens’ satisfaction and
consequently increase the number of people using the portal. Therefore, a heuristic
Web-mining algorithm called ServiceFinder is proposed, which aims to help citizens
find the services that they are looking for in eGovernment portals. ServiceFinder uses
three metrics to measure the quality of eGovernment service selection, which will
then appear as links on the homepage of the portal. These are effectiveness (degree of
easiness to locate the desired service), efficiency (probability to locate the desired
service) and utilization (sum of easily located desired services). The metrics are
calculated using patterns either extracted from the structure of the eGovernment portal
or mined from a Web log. Although this approach may improve the service discovery
by organizing better the available services within the portal, the process of finding a
service in the portal is still based on a trial and error approach. This means that the
user is still browsing the eGovernment portal in order to find the desired service.
Another drawback of this approach as compared to ours is that it provides no
information to the citizen with respect to his/her eligibility for the identified public
service.
2
http://www.semantic-gov.org
3
http://www.rural-inclusion.eu
230 N. Loutas et al.
4
www.onestopgov-project.org
5
http://www.fit-project.org/
The Semantic Public Service Portal (S-PSP) 231
the execution of the service. This is a very strong asset of our approach, as the
eligibility check at an early stage during the informative part of service provision can
save the citizen a lot of time and money. It is interesting that the work of [6] bears
some resemblance with ours in the way that services are modelled and organized, but
what is different, apart from the use of ontologies versus taxonomies, is the fact that
in our work services are described at a greater level of granularity e.g. the distinction
between service type and service version. This difference is very important and due to
this Sacco’s work is not able to personalize services, but only provide generic info
about them. Moreover, it does not answer the eligibility question.
The Semantic Public Service Portal (S-PSP) provides information about available
public-services, which a user may browse and search in a customisable and user-
friendly manner. Through the use of the S-PSP, a user can quickly identify:
• which public service(s) are applicable to their individual use-case,
• whether they are eligible for these public service(s), and
• what is required from the user to complete these public service(s) for their
individual use-case.
Fig. 1 shows the homepage of the S-PSP with the list of all currently available
public services and the languages they are available in.
Once a user selects the public service they are interested in, the dialogue page
appears, as shown in Fig. 2. At this step, the user answers a series of questions, which
will determine if a user is eligible for this service and what information they will need
to provide/complete to utilise this service. Fig. 3 shows the customised information
that this particular users requires to utilise this service, moving from a one-size-fits-all
approach that is unrealistic in the case of public services.
232 N. Loutas et al.
The S-PSP follows a three-tier architecture, as shown in Fig. 4, which comprises of:
• the User Interface Layer, which facilitates the interaction between the
citizens and the portal, acting as an entry-point to the portal’s functionalities.
• the Application Layer, which implements the functionalities provided to the
citizens. This layer consists of two components:
o the Service Tree Locator (STL) and
o the Query Mechanism.
• the Semantic Repository Layer where all the semantic artefacts (ontologies)
used by the portal are stored.
The Semantic Public Service Portal (S-PSP) 233
While the user interacts directly with the S-PSP interface, the service provider
collaborates with an ontology manager to define the public-service descriptions,
which are then added to the Semantic Repository. This process of semantic public-
service description is currently in the process of being automated, so that the service
provider may create the service description using a tool that will substitute the
ontology manager.
The User Interface (UI) Layer provides citizens with the means to interact with the
portal. Its main functionality includes presenting the questions asked by the Query
Mechanism to the citizens and collecting their answers. The answers are then returned
to the Query Mechanism. It is important to clarify that all information that is made
available through the UI, e.g. list items in dropdown lists, questions and possible
answers etc., comes from the underlying ontologies stored in the Semantic
Repository.
The Application Layer consists of the Service Tree Locator (STL) and the Query
Mechanism components. The Service Tree Locator (STL) identifies the appropriate
Service Tree Ontology (STO), which models the public service that addresses the
user’s requirements. Citizens can enter keywords in the portal’s UI to describe the
service that they are looking for. These keywords are sent to the STL, which then
queries the semantic repository using SPARQL6 queries to find matching public-
service descriptions. The STL may contact WordNet7 in order to find synonyms and
hypernyms/hyponyms for the keywords entered by the user, thus making the keyword
search more effective. Finally, the resulting public services are returned to the citizens
in order for them to select the appropriate one. SPARQL was chosen as the semantic
query language, as it is a W3C Recommendation and has a large, active community.
The Query Mechanism (QM) is the core component of the S-PSP as it identifies the
questions to include in the public-service dialogue, based on a user’s previous
6
http://www.w3.org/TR/rdf-sparql-query/
7
http://wordnet.princeton.edu/
234 N. Loutas et al.
answers. The QM continually checks the user’s eligibility for the public service and,
if eligible, it stores the user’s answers in order to determine the personalised
information required to provide the service.
The Repository Layer contains the semantic repository component, which houses all of
the ontologies of the S-PSP. These will be discussed in more detail in the next section.
A Service Tree Ontology (STO) formally defines the dialogue that would usually take
place between a public-service provider and a citizen for a particular public service.
STOs have a tree-like structure and are written in OWL. The dialogue starts from a
generic public service, which is stepwise refined after every question/answer pair. In
the case that the citizen is eligible for the specific public service, the dialogue leads to
the public-service version that matches their profile and a detailed structured
description of the public service version is made available. Otherwise the citizen is
informed that they are not eligible for the specific public service. STOs contain the
business rules from which the different service versions derive as well as the
questions that will be asked to the citizen in order to collect information, which
enables the portal to personalize the public service and decide on the eligibility of the
citizen and on the matching service version. Moreover, the user interface of the portal
is dynamically created based on information encoded in the STOs.
• Service Provider is the PA Entity that provides the service to the Societal
Entities (clients). The PA Entities belong to an Administrative Level (e.g.
municipality, regional).
• Evidence Provider is the PA Entity that provides necessary Evidence to the
Service Provider in order to execute the PA Service.
• Consequence Receiver is the PA Entity that should be informed about a PA
Service execution.
• Service Collaborator is the PA Entity that participates in the provision of a
public service (but is not the service provider).
Political Entities define PA Services which are governed by Preconditions usually
specified in Legal Acts - Laws. Preconditions set the general framework in which the
service should be performed and the underlying business rules that should be fulfilled
238 N. Loutas et al.
In addition to the STOs, the meta-ontology for STOs, and the Public service ontology,
the following OWL ontologies are also used by the S-PSP:
• Ontologies that model the profile of businesses and citizens, for example the
brand name, type, or legal status.
• Ontologies that contain public service related information, such as the
administrative documents that are required as input for the different versions
of the public service (modelled as instances of the EvidencePlaceholder class
of the Public Service Ontology).
• Ontologies that include listings of countries, nationalities, and business
types.
The QM, as discussed in section 3.1, is the core component of the S-PSP, as it
identifies the questions to include in the public-service dialogue, by traversing the
corresponding public-service STO. During the traversal of the STO, the public service
that the citizen has selected is being personalized according to their answers. This is
achieved by resolving the generic service type into the appropriate service version. It
is important to note that at each stage of the traversal, the next step option is unique.
This means there is no case where the same citizen could follow two different paths in
the same STO. If the current node is an InternalNode then the QM has to verify the
conditions of all its descendants, which are expressed as SPARQL queries. Therefore,
the QM takes the appropriate question from the STO and forwards it to the UI so that
the question can be displayed to the citizen. In case the current node is a LeafNode,
i.e. it has no descendants, then the end of the structured conversation has been
The Semantic Public Service Portal (S-PSP) 239
reached. At this point the portal has collected all the necessary information for
identifying the specific public service version that matches the citizen’s profile and
for deciding on their eligibility. In case the citizen is not eligible for one of the service
versions that are modelled in the STO (isNotEligible is set to true), then the QM
terminates its execution and returns a notification message, for example, ‘You are not
eligible for this service because you are under 18 years old’.
The S-PSP is currently being utilised by Rural Inclusion8, an EC project funded under
the Competitiveness and Innovative Framework Programme. Rural Inclusion aims at
adopting a state-of-art infrastructure that will facilitate the offering of innovative
services by public administration in rural areas. The S-PSP is one of three
components that make up the Rural Inclusion Platform, which is being rolled out
across five European trial-sites in France, Martinique, Greece, Latvia and Spain.
The Chios Chamber of Commerce is one of the trial-partners. It is supervised by
the Greek Ministry of Development and serves as a consultant of a wide range of
business-related matters for the Greek island of Chios. Public services that the
chamber provides and that are presented in the Rural Inclusion platform include:
8
http://www.rural-inclusion.eu
240 N. Loutas et al.
6 Evaluation
The evaluation of the S-PSP is ongoing as part of the Rural Inclusion Project. As
stated previously, the S-PSP is one of three components that make up the Rural
Inclusion Platform, which is being rolled out across five European public-sector trial-
sites in France, Martinique, Greece, Latvia and Spain. Initial results are positive, with
the main constructive criticism focusing on improving the intuitive integration of the
S-PSP into the actual trial-partner sites, and where they currently provide the actual
public services. A complete evaluation will be published at a future date.
7 Conclusion
This paper presents an ontology-based, public-services portal, the S-PSP, which
facilitates the informative phase of public service provision. It checks the eligibility of
the citizens for a specific public service before the actual execution of the service,
thus saving them time, effort and money. Also the public service related information
is personalised according to the profile and the specific needs and wants of the citizen
and the specific public service version required is identified, thus providing targeted,
tailored and comprehensive information. The S-PSP’s architecture is modular and as
such it is easily extendable. This has been shown with the Rural Inclusion Platform,
where the S-PSP has been integrated with other components for a specific solution.
The S-PSP is also decoupled from the public-service execution environment that may
be available in different technologies and communicates with it using Web Services.
The main advantage of this portal is its use of semantics to describe all aspects of
public-services, resulting in reusable, extensible public-service data. New public-
services may be added to the portal through the creation of a new STO.
Acknowledgments. The work presented in this paper has been funded in part by
Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and the
European Union under Grant No. CIP-ICT PSP-2008-2/238900 (Rural Inclusion).
The authors would like to thank all the Rural Inclusion project partners for the
creative discussions and ideas.
242 N. Loutas et al.
References
1. Peristeras, V., Tarabanis, K.: The Governance Architecture Framework and Models. In:
Advances in Government Enterprise Architecture, IGI Global Information Science
Referencece (2008)
2. Thomas, S., Schmidt, K.U.: D4: Identification of typical problems in eGovernment portals,
in Technical Report, FIT Consortium (2006)
3. Loutas, N., Peristeras, V., Tarabanis, K.: Providing Public Services to Citizens at the
National and Pan-European level using Semantic Web Technologies. In: 6th Eastern
European eGov Days, Prague, Czech Republic (2008)
4. Loutas, N., et al.: A Semantically Enabled Portal for Facilitating the Public Service
Provision. In: Semantic Technologies for E-Government, pp. 287–314. Springer, Berlin
(2010)
5. Fang, X., Liu Sheng, O.R., Chau, M.: ServiceFinder: A Method Towards Enhancing
Service Portals. ACM Transactions on Information Systems (TOIS) 25(4) (2007)
6. Sacco, G.M.: Interactive Exploration and Discovery of eGovernment Services. In: 8th
Annual International Conference on Digital Government Research: Bridging Disciplines &
Domains, Philidelphia, US (2007)
7. Stollberg, M., Muth, M.: Service Customization by Variability Modeling. In: Service
Customization by Variability Modeling, 5th International Workshop on Engineering
Service-Oriented Applications (WESOA 2009) co-located with the ICSOC-ServiceWave,
Vienna, Austria (2009)
8. OMG, Semantics of Business Vocabulary and Business Rules (SBVR) V1.0 (2008),
http://www.omg.org/spec/SBVR/1.0/
9. Tambouris, E., Vintar, M., Tarabanis, K.: A life-event oriented framework and platform
for one-stop government. In: 4th Eastern European eGov Days, Prague, Czech Republic
(2006)
10. Schmidt, K.U., et al.: Personalization in e-Government: An Approach that combines
Semantics and Web 2.0, in: Semantic Technologies for E-Government. In: Vitvar, T.,
Peristeras, V., Tarabanis, K. (eds.) Semantic Technologies for E-Government, pp. 261–
285. Springer, Berlin (2010)
11. Peristeras, V., et al.: Ontology-Based Search for eGovernment Services Using Citizen
Profile Information. Journal of Web Engineering 8(3), 245–267 (2009)
DataFinland—A Semantic Portal for Open and Linked
Datasets
Abstract. The number of open datasets available on the web is increasing rapidly
with the rise of the Linked Open Data (LOD) cloud and various governmen-
tal efforts for releasing public data in different formats, not only in RDF. The
aim in releasing open datasets is for developers to use them in innovative appli-
cations, but the datasets need to be found first and metadata available is often
minimal, heterogeneous, and distributed making the search for the right dataset
often problematic. To address the problem, we present DataFinland, a semantic
portal featuring a distributed content creation model and tools for annotating and
publishing metadata about LOD and non-RDF datasets on the web. The metadata
schema for DataFinland is based on a modified version of the voiD vocabulary for
describing linked RDF datasets, and annotations are done using an online meta-
data editor SAHA connected to ONKI ontology services providing a controlled
set of annotation concepts. The content is published instantly on an integrated
faceted search and browsing engine HAKO for human users, and as a SPARQL
endpoint and a source file for machines. As a proof of concept, the system has
been applied to LOD and Finnish governmental datasets.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 243–254, 2011.
c Springer-Verlag Berlin Heidelberg 2011
244 M. Frosterus, E. Hyvönen, and J. Laitio
licensing conditions, and so on. Such information should be available both to human
users as well as machines of the Semantic Web.
Aside from properly linked datasets in RDF format, various organizations have also
began publishing open data in whatever format they had it in. The governments of the
United States and the United Kingdom have been releasing their governmental data in
an open format4 and other governments are following suit. This provides another source
of datasets which have their own unique challenges in classifying and subsequently
finding them in that they are released in arbitrary formats with varying amounts of
associated metadata. Setting up a uniform schema and vocabulary for annotating these
datasets as well as providing effective search tools helps developers find these sets in
order to use them for new applications [6].
There are search engines for finding RDF and other datasets, such as ordinary search
engines, SWSE [11], Swoogle5 , Watson6 , and others. However, using such systems
based on the Google-like search paradigm it is difficult to get an idea of the whole
cloud of the offered datasets. Furthermore, finding suitable datasets based on different
selection criteria such as topic, size, licensing, publisher, language etc. is not supported.
To facilitate this, interoperable metadata about the different aspects or facets of datasets
is needed, and faceted search (also called view-based search) [19,9,12] can be used to
provide an alternative paradigm for string-based semantic search.
This paper presents DataFinland, a semantic portal for creating, publishing, and find-
ing datasets based on metadata. In contrast to systems like CKAN7 , the LOD-oriented
voiD8 (Vocabulary of Interlinked Datasets) metadata schema is used to describe datasets
with property values taken from a set of shared domain ontologies providing controlled
vocabularies with clearly defined semantics. Content is annotated using a web-based
annotation tool SAHA 39 connected to ONKI ontology services10 [22,21] that publish
the domain ontologies. SAHA 3 has been integrated with the lightweight multifaceted
search engine HAKO11 [16], which facilitates automatically forming a faceted search
and browsing application for taking in and discerning the datasets on offer. The anno-
tation data itself is stored in RDF format, which makes combining the metadata about
different datasets from different sources simple. This means that it would be possible
to have several annotation projects for different sets of datasets, which could then be
combined as needed for searching purposes. As a proof of concept, the system has
been applied to describing the LOD cloud datasets and datasets in the Finnish Open
Data Catalogue Project12 complementing the linked open governmental datasets on a
national level. The demonstration is available online13 and the system received the first
prize in this year’s ”Apps4Finland–Doing Good With Open Data” competition.
4
http://www.data.gov/ and http://data.gov.uk/
5
http://swoogle.umbc.edu/
6
http://watson.kmi.open.ac.uk/WatsonWUI/
7
http://www.ckan.net/
8
http://semanticweb.org/wiki/VoiD
9
http://www.seco.tkk.fi/services/saha/
10
http://www.onki.fi/
11
http://www.seco.tkk.fi/tools/hako/
12
http://data.suomi.fi/
13
http://demo.seco.tkk.fi/saha3sandbox/voiD/hako.shtml
DataFinland—A Semantic Portal for Open and Linked Datasets 245
In the following we will first present the general model and tools for creating and
publishing metadata about (linked) datasets, and then discuss the voiD metadata schema
and ontology repository ONKI presenting a controlled vocabulary. After this, the anno-
tation tool SAHA for distributed semantic content creation is presented along with the
faceted publication engine HAKO. In conclusion, the main contributions of the paper
are listed, related work discussed, directions for future research proposed.
Our solution for the process of producing metadata and publishing the annotated
datasets is depicted in Figure 1. The process begins with the publication of a dataset.
Metadata for the dataset is produced either by its original publisher or by a third party,
using an annotation tool, in our case SAHA 3. A metadata schema, in our case modi-
fied voiD, is used to dictate for the distributed and independent content providers the
exact nature of the metadata needed. Interoperability in annotation values is achieved
through shared ontologies that are used for certain property values in the schema (e.g.,
subject matter and publisher resources are taken from corresponding ontologies). The
ontologies are provided for the annotation tool as services, in our case by the national
ONKI Ontology Service (or by SAHA itself). Finally, the metadata about the datasets is
published in a semantic portal capable of using the annotations to make the data more
accessible to the end-user, be that a human or a computer application. For this part the
faceted search engine HAKO is used.
In the figure, we have marked the tools and resources used in our proof-of-concept
system in parentheses, but the process model itself is generic.
Fig. 1. The distributed process of producing and publishing metadata about (linked) datasets
246 M. Frosterus, E. Hyvönen, and J. Laitio
From a semantic viewpoint, the key ingredients of general model presented above are
the metadata schema and domain ontologies/vocabularies used for filling in values in the
schema. As for the metadata schema, the Vocabulary of Interlinked Datasets (voiD), an
RDF vocabulary for describing linked datasets [1], seemed like a natural starting point
because it addresses specifically problems of representing linked data. It was therefore
chosen as a basis in our proof-of-concept system.
The basic component in voiD is a dataset, a collection of RDF triples that share a
meaningful connection with each other in the form a shared topic, source or host. The
different aspects of metadata that voiD collects could be classified into the following
three categories or facets:
1. Descriptive metadata tells what the dataset is about. This includes properties such
as the name of the dataset, the people and organizations responsible for it, as well as
the general subject of the dataset. Here voiD reuses other, established vocabularies,
such as dcterms and foaf. Additionally, voiD allows for the recording of statistics
concerning the dataset.
2. Accessibility metadata tells how to access the dataset. This includes information on
SPARQL endpoints, URI lookup as well as licensing information so that potential
users of the dataset know the terms and conditions under which the dataset can be
used.
3. Interlinking metadata tells how the dataset is linked to other datasets. This is done
by defining a linkset, the concept of which is depicted in Figure 2. If dataset :DS1
includes relations to dataset :DS2, a subset of :DS1 of the type void:Linkset is made
(:LS1) which collects all the triples that include links between the two datasets (that
is, triples whose subject is a part of DS1 and whose object is a part of :DS2).
In order to facilitate annotating also non-linked open datasets, we made some exten-
sions to voiD. The most important of these was a class for datasets in formats other
than RDF. This void-addon:NonRdfDataset is similar to the void:Dataset but
DataFinland—A Semantic Portal for Open and Linked Datasets 247
does not have the RDF-specific properties such as SPARQL endpoint while including a
proprety for describing the format of the dataset, void-addon:format. The addition
of this class also resulted in modifications to most of the voiD properties to include
void-addon:NonRdfDataset in their domain specifications. Another addition to the
basic voiD in our system was dcterms:language that facilitates the multi-language
applications.
HAKO is a faceted search engine that can be used to publish a SAHA 3 project as a read-
ily usable portal. The RDF data produced in SAHA 3 is exported into HAKO, which is
then configured to produce a portal matching the needs of the end user. The publisher
configures the classes whose instances are to be searched and whose properties form
the search facets for these instances.
The end result is a semantic portal supporting both faceted search as well as free text
search, which is done as a prefix search by default. For machine use, SAHA 3 also has
a SPARQL endpoint14 which can be used to access the metadata from the outside as a
service instead of accessing the HAKO portal human interface. The SPARQL interface
14
http://demo.seco.tkk.fi/saha/service/data/voiD/sparql?query={query}
DataFinland—A Semantic Portal for Open and Linked Datasets 249
can be used also internally in SAHA for providing semantic recommendation links
between data objects on the human interface.
4.3 DataFinland
DataFinland is the name given for the whole solution of combining SAHA 3 and
HAKO search portal with the extended voiD schema for creating, publishing, and find-
ing datasets based on metadata.
When configuring SAHA 3 for voiD, the dcterms:subject was connected to the
ONKI instance of the General Finnish Ontology (YSO)15 with over 20,000 concepts.
The property dcterms:license was linked to an ONKI instance featuring six Cre-
ative Commons license types, but the system also allows for the defining of other li-
cense types as new instances of a simple license class. Its properties include of a free
text description of the license as well as a possible link to a webpage describing the
license further. Finally, dcterms:language was connected to the ONKI instance of
the Lingvoj16 vocabulary listing of the languages of the world.
The SAHA 3 annotation environment for voiD (depicted in Figure 5) allows for
the annotation of both RDF and non-RDF datasets as well as licenses, formats and
organizations. Licenses are additional licenses that the user may want to use aside from
the ready linked Creative Commons licenses. Formats are simple resources to identify
the format of the dataset, e.g. PDF, MS Word Document, etc. Finally, organizations
allows for a simple way of describing an organization or a person responsible for a
given dataset in the form of a title, free text description and a link to a homepage or a
similar information source.
HAKO was configured to search for both RDF and non-RDF datasets and to form
facets based on the license, language, format and subject properties. This way the end-
user can, for example, limit his/her search to cover only Linked Open datasets by choos-
ing the RDF format. In Figure 6 the user has selected from the facets on the left RDF
datasets concerning Information technology industry in the English language. Out of
the nine results provided by HAKO, the user has chosen Advogato to see its metadata.
A problem of faceted search with wide-ranging datasets is that facets tend to get very
large, which makes category selection more difficult. A solution to this is to use hier-
archical facets. However, using the hierarchy of a thesaurus or an ontology intended
originally for annotations and reasoning may not be an optimal facet for information
retrieval from the end-user’s perspective [20]. For example, the top levels of large on-
tologies with complete hierarchies can be confusing for the end-users. Our planned
solution in the future is to provide the annotators with a simple tool for building hi-
erarchies for the facets as a part of the annotation process. Another possible solution
would be to use some kind of an all-inclusive classification system as the top level of
the facets. There has been some discussion of a classification schema for open datasets
in the community, but no clear standard has risen. In the future we plan to explore the
possibility of using the Finnish Libraries’ classification system that is based on Dewey
Decimal Classification.
15
http://www.yso.fi/onki/yso/
16
http://www.lingvoj.org/
250 M. Frosterus, E. Hyvönen, and J. Laitio
5 Discussion
5.1 Contributions
This paper presented a distributed content creation model for metadata about datasets
published on the web. The model emphasizes and supports the idea that metadata
should be created in an interoperable way by the actors that publish the actual content.
Making metadata interoperable afterwards is usually more difficult and costly. [13] In
practice this requires support for using shared metadata schemas and domain ontolo-
gies/vocabularies, as well as a shared publication channel, a semantic portal. These
facilities are provided in our model by the combination of ONKI, SAHA and HAKO
tools.
DataFinland—A Semantic Portal for Open and Linked Datasets 251
One of the main challenges of any model dealing with dataset metadata is to motivate
dataset publishers to also publish semantically annotated metadata about their content.
Our work is driven by the hope that this social challenge can be addressed by making
annotating easy by online tools (such as SAHA and ONKI), and by providing the anno-
tators with instant feedback on how their dataset is shown in the final semantic portal
(HAKO).
There is a number of tools available for creating voiD descriptions. The voiD editor
ve17 and liftSSM18 , an XSLT script that transforms a semantic sitemap in XML to
voiD RDF/XML format, but these allow building only rudimentary descriptions, which
should then be added to by manually editing the RDF file.
17
http://ld2sd.deri.org/ve/
18
http://vocab.deri.ie/void/guide#sec 4 3 Publishing tools
252 M. Frosterus, E. Hyvönen, and J. Laitio
As for datasets, there are a number of tools for finding Linked Open data. Semantic
Web Search Engine[11] (SWSE) takes a free text approach allowing the user to enter
a query string and returning entities from Linked Open datasets that match the query
term. Searching for whole datasets is not supported.
Aside from search tools intended for human users, there is a number of search in-
dexes intended for applications, including Sindice [18], Watson [5] and Swoogle [7].
These provide APIs supporting the discovery of RDF documents based on URIs or key-
words. Sindice is intended for finding individual documents while Swoogle is used for
finding ontologies. Watson allows the finding of all sorts of semantic data and features
advanced filtering abilities intended for both machine and human users. However, none
of these search engines are very good for exploring what sorts of datasets are available
or for getting a whole picture of a given domain.
Governmental Open Data is widely published through CKAN19 (Comprehensive
Knowledge Archive Network), a registry for Open Data packages. CKAN provides sup-
port for publishing and versioning Open data packages and includes robust API support.
However, the metadata about the data packages is recorded utlizing free tagging which
does not support hierarchical, view-based search and does not contain semantic relation
data between different tags.
Finally, concurrently to our work, an interoperability format for governmental data
catalogues based on the dcat RDF vocabulary was proposed in [17]. There, the metadata
schema was based on existing metadata used in the data catalogues as opposed to the
LOD based voiD. Furthermore, this solution does not contain tools for editing metadata
nor link to existing ontologies for use in dataset descriptions. A faceted search using
Gridworks in combination with dcat was also proposed in [4].
The distributed semantic content creation and publishing approach, using shared
metadata schemas, ontology services, and semantic portals for publication, has been
originally developed in the semantic portals of the FinnONTO project [15].
Acknowledgements
This work was conducted as a part of the National Semantic Web Ontology project
in Finland20 (FinnONTO, 2003-2012), funded mainly by the National Technology and
Innovation Agency (Tekes) and a consortium of 38 public organizations and companies.
Furthermore, we would like to thank Tuomas Palonen for his annotation work on the
datasets for the demonstration and Petri Kola and Antti Poikola for fruitful discussions
on publishing open datasets.
19
http://www.ckan.net/
20
http://www.seco.tkk.fi/projects/finnonto/
DataFinland—A Semantic Portal for Open and Linked Datasets 253
References
1. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets - on the
design and usage of void, the vocabulary of interlinked datasets. In: Conjunction with 18th
International World Wide Web Conference (WWW 2009) Linked Data on the Web Workshop
(LDOW 2009) (2009)
2. Bizer, C., Cyganiak, R., Heath, T.: How to publish linked data on the web (2007)
3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. International Journal on
Semantic Web and Information Systems, IJSWIS (2009)
4. Cyganiak, R., Maali, F., Peristeras, V.: Self-service linked government data with dcat and
gridworks. In: Proceedings of the 6th International Conference on Semantic Systems, Graz,
Austria. I-SEMANTICS 2010, pp. 37:1–37:3. ACM, New York (2010)
5. dÁquin, M., Motta, E.: Watson, more than a semantic web search engine (2010)
6. Dekkers, M., Polman, F., te Velde, R., de Vries, M.: Mepsir: Measuring european public sec-
tor information resources. final report of study on exploitation of public sector information.
Technical report (2006)
7. Finin, T., Ding, L., Pan, R., Joshi, A., Kolari, P., Java, A., Peng, Y.: Swoogle: Searching for
knowledge on the semantic web. In: AAAI 2005 (intelligent systems demo), pp. 1682–1683.
The MIT Press, Cambridge (2005)
8. Hausenblas, M., Halb, W., Raimond, Y., Heath, T.: What is the size of the semantic web? In:
Proceedings of I-SEMANTICS 2008, Graz, Austria (2008)
9. Hearst, M., Elliott, A., English, J., Sinha, R., Swearingen, K., Lee, K.-P.: Finding the flow in
web site search. CACM 45(9), 42–49 (2002)
10. Hildebrand, M., van Ossenbruggen, J., Amin, A., Aroyo, L., Wielemaker, J., Hardman,
L.: The design space of a configurable autocompletion component. Technical Report INS-
E0708, Centrum voor Wiskunde en Informatica, Amsterdam (2007)
11. Hogan, A., Harth, A., Umrich, J., Decker, S.: Towards a scalable search and query engine for
the web. In: WWW 2007: Proceedings of the 16th international conference on World Wide
Web, pp. 1301–1302. ACM, New York (2007)
12. Hyvönen, E., Saarela, S., Viljanen, K.: Application of ontology techniques to view-based se-
mantic search and browsing. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS
2004. LNCS, vol. 3053, pp. 92–106. Springer, Heidelberg (2004)
13. Hyvönen, E.: Preventing interoperability problems instead of solving them. In: Semantic
Web Journal (2010) (accepted for pubication)
14. Hyvönen, E., Mäkelä, E.: Semantic autocompletion. In: Mizoguchi, R., Shi, Z.-Z.,
Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 739–751. Springer, Heidelberg
(2006)
15. Hyvönen, E., Viljanen, K., Mäkelä, E., Kauppinen, T., Ruotsalo, T., Valkeapää, O., Seppälä,
K., Suominen, O., Alm, O., Lindroos, R., Känsälä, T., Henriksson, R., Frosterus, M., Tuomi-
nen, J., Sinkkilä, R., Kurki, J.: Elements of a national semantic web infrastructure—case
study finland on the semantic web. In: Proceedings of the First International Semantic Com-
puting Conference (IEEE ICSC 2007). IEEE Press, Irvine (2007) (invited paper)
16. Kurki, J., Hyvönen, E.: Collaborative metadata editor integrated with ontology services and
faceted portals. In: Workshop on Ontology Repositories and Editors for the Semantic Web
(ORES 2010), The Extended Semantic Web Conference ESWC 2010, CEUR Workshop Pro-
ceedings, Heraklion, Greece (2010) http://ceur-ws.org
17. Maali, F., Cyganiak, R., Peristeras, V.: Enabling interoperability of government data cata-
logues. In: Wimmer, M.A., Chappelet, J.-L., Janssen, M., Scholl, H.J. (eds.) EGOV 2010.
LNCS, vol. 6228, pp. 339–350. Springer, Heidelberg (2010),
http://dx.doi.org/10.1007/978-3-642-14799-9_29
254 M. Frosterus, E. Hyvönen, and J. Laitio
18. Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Tummarello, G.: Sindice.com: A document-
oriented lookup index for open linked data. International Journal of Metadata, Semantics and
Ontologies 3 (2008)
19. Pollitt, A.S.: The key role of classification and indexing in view-based searching. Technical
report, University of Huddersfield, University of Huddersfield, UK (1998),
http://www.ifla.org/IV/ifla63/63polst.pdf
20. Suominen, O., Viljanen, K., Hyvönen, E.: User-centric faceted search for semantic portals
(2007)
21. Tuominen, J., Frosterus, M., Viljanen, K., Hyvönen, E.: ONKI SKOS server for publish-
ing and utilizing SKOS vocabularies and ontologies as services. In: Aroyo, L., Traverso,
P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M.,
Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 768–780. Springer, Heidelberg (2009)
22. Viljanen, K., Tuominen, J., Hyvönen, E.: Ontology libraries for production use: The finnish
ontology library service ONKI. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath,
T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS,
vol. 5554, pp. 781–795. Springer, Heidelberg (2009)
Biological Names and Taxonomies on the
Semantic Web – Managing the Change in
Scientific Conception
1 Introduction
Exploitation of natural resources, urbanisation, pollution, and climate changes
accelerate the extinction of organisms on Earth which has raised a common
concern about maintaining biodiversity. For this purpose, management of in-
formation about plants and animals is needed, a task requiring an efficient us-
age of heterogeneous, dynamic biological data from distributed sources, such
as observational records, literature, and natural history collections. Central re-
sources in biodiversity management are names and ontological taxonomies of
organisms [1,19,20,3,4]. Animal ontologies are stereotypical examples in the se-
mantic web text books, but in reality semantic web technologies have hardly
been applied to managing the real life taxonomies of biological organisms and
biodiversity on the web. This paper tries to fill this gap.1
1
We discuss the taxonomies of contemporary species, not ’phylogenetic trees’ that
model evolutionary development of species, where humans are successors, e.g., of
dinosaurs.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 255–269, 2011.
c Springer-Verlag Berlin Heidelberg 2011
256 J. Tuominen, N. Laurenne, and E. Hyvönen
The scientific name system is based on the Linnean binomial name system where
the basic unit is a species. Every species belongs to some genus and every genus
belongs to a higher taxon. A scientific name often has a reference to the original
publication where it was first published. For example, the scientific name of the
bumblebee, Apis mellifera Linnaeus, 1758, means that Linnaeus published the
description of the bumblebee in 1758 (in Systema Naturae 10th edition) and
that bumblebee belongs to the genus Apis. The upper levels of the taxonomic
hierarchy do not show in a scientific name. A confusing feature of scientific
names is that the meaning of the name may change although the name remains
the same. Taxon boundaries may vary according to different studies, and there
may be multiple simultaneous views of taxon limits of the same organism group.
For example, a genus may be delimited in three ways and according to each
view different sets of species are included in the genus as illustrated in Fig. 1.
These differing views are taxonomic concepts. The usage of the correct name is
not enough, and Berendsohn [1] suggested that taxonomic concepts should be
referred to by an abbreviation sec (secundum) after the authors name to indicate
in which meaning the name is used.
The nature of a biological name system is a change, as there is no single in-
terpretation of the evolution. Typically there is no agreement if the variation
observed in an organism is taxon-specific or shared by more than one taxon,
which makes the name system dynamic. For example, the fruit fly Drosophila
melanogaster was shifted into the genus Sophophora, resulting in a new name
combination Sophonophora melanogaster [7]. The most common taxonomic
changes and their implications to the scientific names are the following: 1) A
species has been shifted to another genus - the genus name changes. 2) One
species turns out to be several species - new species are described and named,
and the old name remains the same with a narrower interpretation. 3) Several
species are found to be just one species - the oldest name is valid and the other
names become its synonyms.
Taxonomic concept 1
Fig. 1. A genus is delimited in three different ways according to three different studies.
Black squares indicate species.
258 J. Tuominen, N. Laurenne, and E. Hyvönen
names. The TaxMeOn model supports the referring system that is typical to bi-
ology. Some of the properties used in TaxMeOn are part-specific as the uses of
the parts differ from each other. For instance, the property that refers to a ver-
nacular name is only available in the name collection part as it is not relevant
in the other parts of the model.
The most distinctive feature of the research part [14] is that a scientific name
and taxonomic concepts associated to it are separated, which allows detailed
management of them both. In the name collection and species list parts, a name
and its taxonomic concepts are treated as a unit. Different statuses can be as-
sociated to names, such as validity (accepted/synonym), a stage of a naming
process (proposed/accepted) and spelling errors.
The model has a top-level hierarchy that is based on a rough classification,
such as the division of organism classes and orders. Ontologies that are gener-
ated using TaxMeOn, can be hung on the top-level classification. A hierarchy is
created using the transitive isPartOfHigherTaxon relation, e.g. to indicate that
the species forrestii belongs to the genus Abies.
Taxon names that refer to the same taxon can occur as different names in
the published species lists and different types of relations (see Table 1) can be
set between the taxa. Similarly, research results of phylogenetic studies can be
mapped using the same relations. The relations for mapping taxa are divided
on the basis of attributes of taxa (intensional ) or being a member of a group
(ostensive). If it is known that two taxa have an association which is not specified,
a class is provided for expressing incomplete information (see the empty ellipse in
Fig. 2). This allows associations of taxa without detailed taxonomic knowledge,
and especially between taxa originating from different sources.
Table 1. Mapping relations used in species lists and research results. The three rela-
tions can be used as intensional and/or ostensive, using their subproperties.
Relation Description
congruent with taxon taxonomic concepts of two taxa are equal
is part of taxon a taxonomic concept of a taxon is included in a taxonomic
concept of another taxon
overlaps with taxon taxonomic concepts of two taxa overlap
references related to the changes of a name status can be added. This allows the
tracking the temporal order of the statuses. The model for vernacular names is
illustrated in Fig. 2.
Species lists. Species lists have a single hierarchy and they seldom include ver-
nacular names. Species lists have more relevance in science than name collections,
but they lack information about name changes and a single list does not express
the parallel or contradictory views of taxonomy which are crucial for researchers.
Synonyms of taxa are typically presented and the taxonomic concept is included
in a name like in a name collection. Taxa occurring in different species lists can
be mapped to each other or to research results using the relations in Table 1.
In addition, a general association without taxonomic details can be used (see
Fig. 2).
Biological research results. In biological research results a key element is a
taxonomic concept that can have multiple scientific names (and vice versa). In-
stead of names, taxonomic concepts are used for defining the relations between
taxa. The same relations are applied here as in the speies list part (see Ta-
ble 1). The latest research results often redefine taxon boundaries, for example
a split of taxa narrows the original taxonomic concept and the meaning of the
name changes although the name itself may remain the same. The new and the
old concepts are connected into a temporal chain by instantiation of a change
event. In Fig. 3 the concept of the beetle genus Galba is split into the concepts
of the Balgus and Pterotarsus. The taxon names are shown inside the ellipses
hasVernacular
VernacularName Name TaxonInNameCollection Synonym
Genus
forrestii Reference
TaxonInNameCollection
isPartO n
Taxo
Species
fHighe
list r AcceptedVernacularName
hasVernacula
r
hasVernacular NameStatus
TaxonInNameCollection Name VernacularName
refersTo
TaxMeOn rdfs:la
Taxon Species bel
Forrest fir
Research
rdfs:seeAlso
results VernacularName
Reference
AlternativeVernacularName
˘
Accepted
dc:title
1919
Chundian
lengshan
Wikipedia
Gard. Chron., III, 65:150
Schenkling
Fleutiaux1928 Fleutiaux, Crowson, Cobos,
1920 1945 1967 1961
Publication Publication
Publication Publication Publication
isPartOf isPartOf
Balgus congruent Balgus isOlder Balgus
Taxon Than
HigherTaxon HigherTaxon
1 2 3 after
before
Galba Pterotarsus potential Galba Split
changeIn Relation Eucnemidae Eucnemidae Eucnemidae Eucnemidae
Taxonomic
Concept publishedIn after
tuberculata historio 4a
iPOHT
4b iPOHT
4c iPOHT
4d
Pterotarsus Galbites Pterotarsus Galbites
Publication changeIn changeIn changeIn
Taxonomic Taxonomic Taxonomic
Publication Concept Concept Concept
Publication
Lameere, 1900 Publication Publication Publication Publication
Guerin-
Guerin-Meneville, Meneville,
1830 1831 Fleutiaux, Fleutiaux, Fleutiaux, Muona,
Guerin-Meneville, 1920 1918 1945 1987
1838. In the illustrations of the book
were publishe later and Galba
tuberculata had a name Pterotarsus
marmorata
4 Use Cases
We have applied the TaxMeOn ontology model to three use cases that are based
on different needs. The datasets include a name collection of common names of
vascular plants, several species lists of different animal groups and a collection
of biological research results of Afro-tropical beetles. The use cases were selected
on the basis of the active usage of the data (vernacular names), usefulness to
the users (species lists), and the taxonomic challenges with available expertise
(scientific names based on research results). The datasets used are depicted in
Table 2.
4
http://demo.seco.tkk.fi/saha/VascularPlants/index.shtml
Biological Names and Taxonomies on the Semantic Web 263
Table 2. Datasets TaxMeOn has been applied to. Vascular plants are included in the
name collection, the false click beetles are biological research results, and all other
datasets are based on species lists.
New scientific species names are added by creating a new instance of the
Species class and then adding the other necessary information, such as their
status. Similarly, a higher taxon can be created if it does not already exist,
and the former is linked to the latter with the isPartOfHigherTaxon relation.
SAHA has search facilities for querying the data, and a journalist writing a non-
scientific article about a house plant, for example, can use the system for finding
a common name for the plant.
Currently, the mapped beetle names are published as services for humans and
machines in the ONKI Ontology Service6 [25]. The ONKI Ontology Service is a
general ontology library for publishing ontologies and providing functionalities
for accessing them, using ready-to-use web widgets as well as APIs. ONKI sup-
ports content indexing, concept disambiguation, searching, and query expansion.
Fig. 4 depicts the user interface of the ONKI server [24]. The user is browsing
the species lists of cerambycid beetles, and has made a query for taxon names
starting with a string “ab”. The selected species abdominalis has been described
by Stephens in 1831, and it occurs in the species list Catalogue of Palaearctic
Coleoptera, published in the year 2010 [15]. The species abdominalis belongs to
the subgenus and genus Grammoptera. The taxonomy of the family Cerambyci-
dae is visualised as a hierarchy tree. The same species also occurs in other species
lists, which is indicated by congruentWithTaxonOst relation. Browsing the taxa
reveals varying taxon names and classifications. For example, the Grammoptera
(Grammoptera) abdominalis has a subgenus in this example, but the rank sub-
genus does not exist in the other lists of cerambycid. Also, the synonyms of the
selected taxon are shown (analis, femorata, nigrescens and variegata).
The ONKI Ontology Services can be integrated into applications on the user
interface level (in HTML) by utilising the ONKI Selector, a lightweight web
widget providing functionalities for accessing ontologies. The ONKI API has
6
http://demo.seco.tkk.fi/onkiskos/cerambycids/
Biological Names and Taxonomies on the Semantic Web 265
The use case of scientific names is the Afro-tropical beetle family Eucnemidae,
which consists of ca. nine genera that have gone through numerous taxonomic
treatments. Also, mistakes and uncertain events are modelled if they are rel-
evant to name changes. For example, the position of the species Pterotarsus
historio in taxonomic classification has changed 22 times and at least eight tax-
onomic concepts are associated to the genus Pterotarsus [17]. Fig. 3 illustrates
the problematic nature of the beetle group in a simplified example. A compara-
ble comparable situation concerns most organism groups on Earth. Due to the
numerous changes in scientific names, even researchers find it hard to remember
them and this information can only be found in publications of taxonomy. The
option of managing individual names is advantageous as it completes the species
lists and allows the mapping of detailed taxonomic information to the species
lists. For example, environmental authorities and most biologists prefer a simple
representation of species lists instead of complicated change series.
5 Discussion
We have explored the applicability of the semantic web technologies for the
management needs of biological names. Separating taxonomic concepts from
scientific and vernacular names is justified due to the ambiguity of the names
referring to taxa. This also enables relating relevant attributes separately to a
concept and to a name, although it is not always clear to which of these an
attribute should be linked and subjective decisions have to made. The idea of
the model is simplicity and practicality in real-world use cases.
The fruitfulness lays in the possibilities to link divergent data serving diver-
gent purposes and in linking detailed information with more general information.
For example, a common name of a house plant, a taxonomic concept that ap-
pears to be a species complex (a unit formed by several closely related species)
and the geographical area can be linked.
The most complex use case is the management of scientific name changes of
biological research results. The main goal is to maintain the temporal control
of the name changes and classifications. The instantiation of taxon names and
concepts lead to a situation in which they are hard to manage when they form a
266 J. Tuominen, N. Laurenne, and E. Hyvönen
long chain. Every change increases the number of instances created. Protegé7 was
used for editing the ontologies, although managing names is quite inconvenient
because they are shown as an alphabetically ordered flat list, not as a taxonomic
hierarchy.
As Protegé is rather complicated for a non-expert user, the metadata editor
SAHA was used for maintaining the continuous changes of common names of
plants. The simplicity of SAHA makes it a suitable option for ordinary users
who want to concentrate on the content. However, we noticed that some useful
features are missing from SAHA. The visualisation of a nested hierarchy would
help users to compare differing classifications.
In many biological ontologies the ’subclass of’ relation is used for expressing
the taxon hierarchies. However, in the TaxMeOn model we use the isPartHigh-
erTaxon relation instead. If the ’subclass of’ relation was used to express the
taxonomic hierarchy, a taxon would incorrectly be an instance of the higher
taxon ranks, e.g., a species would be an instance of the class Genus. This would
lead to a situation in which queries for genera also return species.
NCBO BioPortal8 and OBO Foundry9 have large collections of life science on-
tologies mainly concentrating on biomedicine and physiology. The absence of
taxonomic ontologies is distinctive which may indicate the complexity of the bi-
ological name system. The portals contain only three taxonomic ontologies (Am-
phibian taxonomy, Fly taxonomy and Teleost taxonomy) and one broader clas-
sification (NCBI organismal classification). The taxonomic hierarchy is defined
using the rdfs:subClassOf relation in the existing ontologies. Taxonconcept.org10
provides Linked Open Data identifiers for species concepts and links data about
them originating from different sources. All names are expressed using literals
and the following taxonomic ranks are included: a combination of a species and a
genus, a class and an order. Parallel hierarchies are not supported. Geospecies11
uses the properties skos:broaderTransitive and skos:narrowerTransitive to ex-
press the hierarchy.
Page [19] discusses the importance of persistent identifiers for organism names
and presents a solution for managing names and their synonyms on the semantic
web. The taxon names from different sources referring to the same taxon are
mapped using the owl:sameAs relation which is a strong statement. Hierarchy
is expressed using two different methods in order to support efficient queries.
Schulz et al. [20] presented the first ontology model of biological taxa and its
application to physical individuals. Taxa organised in a hierarchy is thoroughly
discussed, but the model is static and based on a single unchangeable taxonomy.
7
http://protege.stanford.edu/
8
http://bioportal.bioontology.org/
9
http://www.obofoundry.org/
10
http://www.taxonconcept.org/
11
http://lod.geospecies.org/
Biological Names and Taxonomies on the Semantic Web 267
Despite recognising the dynamic nature of taxonomy and the name system, the
model is not applicable in the management of biological names as such.
Franz and Peet [3] enlighten the problematic nature of the topic by describing
how semantics can be applied in relating taxa to each other. They introduce two
essentially important terms from philosophy to taxonomy to specify the way, in
which differing classifications that include different sets of taxa can be compared.
An ostensive relation is specified by being a member of a group and intensional
relations are based on properties uniting the group. These two fundamentally
different approaches can be used simultaneously, which increases the information
content of the relation.
Franz and Thau [4] developed the model of scientific names further by eval-
uating the limitations of applying ontologies. They concluded that ontologies
should focus either on a nomenclatural point of view or on strategies for align-
ing multiple taxonomies.
Tuominen et al. [23] model the taxonomic hierarchy using the skos:broader
property, and preferred scientific and common names of the taxa are represented
with the property skos:prefLabel and alternative names with skos:altLabel. The
property rdf:type is used to indicate the taxonomic rank. This is applicable to
relatively simple taxonomies such as species lists, but it does not support ex-
pressing more elaborate information (changes in a concept or a name).
The Darwin Core (DwC) [2] is a metadata schema developed for observation
data by the TDWG (Biodiversity Information Standards). The goal of the DwC
is to standardise the form of presenting biological information in order to enhance
the usage of it. However, it lacks the semantic aspect and the terms related to
biological names are restricted due to the wide and general scope of the DwC.
The scope of the related work presented above differs from our approach as
our focus is on practical name management and retrieval of names.
Research on ontology versioning [10] and ontology evolution [18] has focused
on finding mappings between different ontology versions, performing ontology
refinements and other changes in the conceptualisation [9,21], and in reasoning
with multi-version ontologies [5]. There are similarities in our problem field,
but our focus is to support multiple parallel ontologies interpreting the domain
differently, not in versioning or evolution of a specific ontology. For example,
there is no single taxonomy of all organisms, but different views of how they
should be organised into hierarchies.
A similar type of an approach for managing changes and parallel views of
concepts has been proposed by Tennis and Sutton [22] in the context of SKOS
vocabularies. However, TaxMeOn supports richer ways of expressing informa-
tion, e.g. for managing changes of taxon names and concepts separately.
References
1. Berendsohn, W.: The concept of ”potential taxon” in databases. Taxon 44, 207–212
(1995)
2. Darwin Core Task Group. Darwin core. Tech. rep (2009),
http://www.tdwg.org/standards/450/
3. Franz, N., Peet, R.: Towards a language for mapping relationships among taxo-
nomic concepts. Systematics and Biodiversity 7(1), 5–20 (2009)
4. Franz, N., Thau, D.: Biological taxonomy and ontology development: scope and
limitations. Biodiversity Informatics 7, 45–66 (2010)
5. Huang, Z., Stuckenschmidt, H.: Reasoning with multi-version ontologies: A tem-
poral logic approach. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.)
ISWC 2005. LNCS, vol. 3729, pp. 398–412. Springer, Heidelberg (2005)
6. Hyvönen, E., Viljanen, K., Tuominen, J., Seppälä, K.: Building a national semantic
web ontology and ontology service infrastructure – the FinnONTO approach. In:
Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, pp. 95–109. Springer, Heidelberg (2008)
7. ICZN. Opinion 2245 (case 3407) drosophila fallén, 1823 (insecta, diptera):
Drosophila funebris fabricius, 1787 is maintained as the type species. Bulletin of
Zoological Nomenclature 67(1) (2010)
8. Kauppinen, T., Väätäinen, J., Hyvönen, E.: Creating and using geospatial ontology
time series in a semantic cultural heritage portal. In: Bechhofer, S., Hauswirth, M.,
Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 110–123.
Springer, Heidelberg (2008)
9. Klein, M.: Change Management for Distributed Ontologies. Ph.D. thesis, Vrije
Universiteit Amsterdam (August 2004)
10. Klein, M., Fensel, D.: Ontology versioning on the Semantic Web. In: Proceedings of
the International Semantic Web Working Symposium (SWWS), July 30 – August
1, pp. 75–91. Stanford University, California (2001)
11. Knapp, S., Polaszek, A., Watson, M.: Spreading the word. Nature 446, 261–262
(2007)
12. Kurki, J., Hyvönen, E.: Collaborative metadata editor integrated with ontology
services and faceted portals. In: Workshop on Ontology Repositories and Edi-
tors for the Semantic Web (ORES 2010), the Extended Semantic Web Confer-
ence ESWC 2010, CEUR Workshop Proceedings, Heraklion, Greece (June 2010),
http://ceur-ws.org/
12
http://www.seco.tkk.fi/projects/finnonto/
Biological Names and Taxonomies on the Semantic Web 269
13. Laurenne, N., Tuominen, J., Koho, M., Hyvönen, E.: Modeling and publishing
biological names and classifications on the semantic web. In: TDWG 2010 Annual
Conference of the Taxonomic Databases Working Group (September 2010); poster
abstract
14. Laurenne, N., Tuominen, J., Koho, M., Hyvönen, E.: Taxon meta-ontology
TaxMeOn – towards an ontology model for managing changing scientific names
in time. In: TDWG 2010 Annual Conference of the Taxonomic Databases Working
Group (September 2010); contributed abstract
15. Löbl, I., Smetana, A.: Catalogue of Palearctic Coleoptera Chrysomeloidea, vol. 6.
Apollo Books, Stenstrup (2010)
16. Mayden, R.L.: A hierarchy of species concepts: the denouement in the saga of the
species problem. In: Claridge, M.F., Dawah, H.A., Wilson, M.R. (eds.) Species:
The Units of Biodiversity Systematics Association Special, vol. 54, pp. 381–424.
Chapman and Hall, London (1997)
17. Muona, J.: A revision of the indomalesian tribe galbitini new tribe (coleoptera,
eucnemidae). Entomologica Scandinavica. Supplement 39, 1–67 (1991)
18. Noy, N., Klein, M.: Ontology evolution: Not the same as schema evolution. Knowl-
edge and Information Systems 6(4) (2004)
19. Page, R.: Taxonomic names, metadata, and the semantic web. Biodiversity Infor-
matics 3, 1–15 (2006)
20. Schulz, S., Stenzhorn, H., Boeker, M.: The ontology of biological taxa. Bioinfor-
matics 24(13), 313–321 (2008)
21. Stojanovic, L.: Methods and Tools for Ontology Evolution. Ph.D. thesis, University
of Karlsruhe, Germany (2004)
22. Tennis, J.T., Sutton, S.A.: Extending the simple knowledge organization system
for concept management in vocabulary development applications. Journal of the
American Society for Information Science and Technology 59(1), 25–37 (2008)
23. Tuominen, J., Frosterus, M., Laurenne, N., Hyvönen, E.: Publishing biological
classifications as SKOS vocabulary services on the semantic web. In: TDWG 2010
Annual Conference of the Taxonomic Databases Working Group (September 2010);
demonstration abstract
24. Tuominen, J., Frosterus, M., Viljanen, K., Hyvönen, E.: ONKI SKOS server for
publishing and utilizing SKOS vocabularies and ontologies as services. In: Aroyo,
L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi,
R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp.
768–780. Springer, Heidelberg (2009)
25. Viljanen, K., Tuominen, J., Hyvönen, E.: Ontology libraries for production use: The
finnish ontology library service ONKI. In: Aroyo, L., Traverso, P., Ciravegna, F.,
Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl,
E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 781–795. Springer, Heidelberg (2009)
An Approach for More Efficient Energy Consumption
Based on Real-Time Situational Awareness
1 Introduction
Real-time processing has become very important for sensor-based applications, since
the quantity of data being generated from sensors requires on–the-fly processing and
immediate reaction in order to be effective. There are many examples, starting from
item-tracking in RFID-supported logistics to remote patient monitoring in eHealth.
Indeed, real-time awareness enables the detection of problems (e.g. a damaged item in
a delivery, or an acute health problem in a patient) as soon as they happen, so that the
reaction can be successfully performed. Note that the same mechanism can be used
for preventive reactions, i.e. reacting before a problem would happen.
In the nutshell of this mechanism is the ability to recognize in real-time1 (or even
ahead of time) some interesting situations, what is called “real-time situational
awareness”. Note that this goes beyond the traditional (static) situational awareness
1
We consider “business real-time“ as the criteria for declaring something to be processed in
real-time.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 270–284, 2011.
© Springer-Verlag Berlin Heidelberg 2011
An Approach for More Efficient Energy Consumption 271
(like in [1]) that is focused on the understanding a situation (if possible in real-
time). Real-time situational awareness introduces the notion of real-time emergency:
the main goal is to recognize a situation of interest as soon as possible in order to be
able to react on it properly.
On the other hand, such a process introduces several challenges for the processing
of sensor-data:
a) it should be very efficient in order to retain its “real-time” flavor and
b) it should be very flexible in order to deal with various and dynamically changing
patterns (situations) of interests (that should be recognized in real-time).
Complex event processing is a technology that can resolve these challenges.
Energy efficiency is one of application areas, where real-time situational awareness
can bring a substantial added value. Smart Grid is a well-known example: Smart
meters2 enable real-time publishing of information about energy consumption of a
user (usually once every seven minutes), which consequently can support the real-
time optimization of the energy production. Decreasing energy consumption in
buildings, public spaces and households is another very promising area for applying
real-time situational awareness. It has been shown that the creation of the awareness
about the current (in real-time) energy consumption in a household can bring itself up
to 20% savings in electricity bills. Beside these passive savings, active energy savings
by switching off electrical devices in particular situations are very common methods
in reducing energy consumption3.
Although well developed, current approaches for achieving energy efficiency seem to
be “inefficient”: they are usually dealing with customized solutions tuned to the
predefined energy consumption scenarios. The consequence is that the costs for
introducing and maintaining these solutions are quite high, since e.g. each new scenario
(the so-called energy consumption pattern) has to be modeled separately. On the other
hand, due to a high variation in the energy consumption profile (e.g. the spatial
distribution of energy consumers), energy consumption patterns are very variable and
modeling new patterns is more a rule than an exception. Therefore, although oriented
towards real-time recognition of the interesting situations, these approaches are suffering
from the inflexibility in the detection process, e.g. by not having available a declarative
description of the situations to be detected and by not performing an intelligent
processing (reasoning) on the incoming data. Obviously, the application of semantic
technologies can be very promising for resolving these challenges.
In this paper we present a novel approach for achieving energy efficiency that
exploits real-time situational awareness based on the use of Complex Event
Processing and Semantic Technologies. The main idea is to enable a semantic-based
description of the situations of interests (i.e. energy consumption patterns) and
perform reasoning about those situations in real-time. The approach leverages on our
work in the domain of intelligent Complex Event Processing (iCEP)4, especially
complex event reasoning, that combines a very efficient in-memory processing (on
the fly) of a huge amount of streaming data and the reasoning (on the fly) using
available domain knowledge.
2
http://www.smartmeters.com/faqs.html
3
http://www.greentrac.com/
4
See iCEP.fzi.de
272 Y. Xu et al.
The approach has been implemented using the iCEP framework and deployed in
our experimental environment that supports testing novel technologies with a higher
users’ involvement. We have performed a case study related to the office occupancy
control that limits the operation of the lighting system based on the actual use of the
space. Preliminary evaluation tests have shown very promising results regarding the
usability and the efficiency of the approach: the approach is able to abstract from
particular patterns to be recognized into general/declarative situations to be reasoned
about.
The paper is structured in the following way:
In the second section we give more details about our Energy Efficiency use case,
from the real-time consumption point of view. In the third section we outline the
architecture of our solution. In section four we describe some evaluation details,
whereas section five elaborates briefly on the related work. In section six we give
some concluding remarks.
5
http://www.energysavingsecrets.co.uk/HowToRunAnEnergyEfficientOffice.html
An Approach for More Efficient Energy Consumption 273
prevent the system from turning the lights off while the space is still occupied but
there is very little activity, a time delay typically ranging from 1 to 15 minutes can be
programmed into the controls.
However, as already mentioned in the Introduction, current approaches are based
on a fix set of rules that are designed for a particular scenario. Let’s illustrate this on
the example depicted in Figure 1: an office with six desks, each equipped with a lamp.
The task is to enable switching off the corresponding lamp when a person is leaving
the room. The most advanced systems would model several patterns that describe
situations that a person is leaving the room which implies the need for switching off
the light at her/his desk. A pattern for the region F would be (cf. Figure 1):
If sequence (RegionF, RegionD, RegionB, Door) Then SwitchOff(LampF) (1)
Fig. 1. The distribution of sensors in a smart office. There are three types of sensors: 1) contact
sensors (TFK - attached to the door), 2) moving sensors (M1 – M6) and Light barrier sensors
(L1-L4)
It consists of three layers: the raw data provided by sensors, the Digital Entities
provided by sensor resources and the context information provided by the advanced
system components or context-level resources.
The raw data consists of the value that the sensor provides, e.g., the numerical
value 25. A resource may augment this information with meta-information, e.g. that
the measured value is a temperature that it is in degrees Celsius that it was measured
by sensor X at a certain point in time etc. We call the resulting information Digital
Entity.
An Approach for More Efficient Energy Consumption 275
3.2 ETALIS
ETALIS Language for Events [2] is a logic-based formalism for Complex Event
Processing (CEP) and Stream Reasoning. It uses SWI-Prolog Semantic Web Library7
to represent an RDF/XML ontology as a set of Prolog rules and facts. ETALIS is an
open-source implementation of the language. The language and a corresponding
implementation are based on a novel event processing strategy which detects complex
events by maintaining the intermediate states. Every time an atomic event (relevant
w.r.t. the set of monitored events) occurs, the system updates the internal state of
complex events. Essentially, this internal state encodes what atomic events are still
missing for the completion a certain complex event. Complex events are detected as
6
http://code.google.com/p/etalis/
7
http://www.swi-prolog.org/pldoc/package/semweb.html
276 Y. Xu et al.
soon as the last event required for their detection has occurred. Descriptions telling
which occurrence of an event drive the detection of complex events (including the
relationships between complex events and events they consist of) are given by
deductive rules. Consequently, detection of complex events then amounts to an
inference problem.
Event processing formalisms based on deductive or logic rules [4, 5, 6] have been
attracting considerable attention as they feature formal, declarative semantics.
Declarative semantics of a CEP system prescribe what the system needs to detect, i.e.,
a user does not need to worry how that will be detected. In this respect declarative
semantics guarantees predictability and repeatability of results produced by a CEP
system. Moreover, CEP systems based on deductive rules can process not only events,
but also any additional background knowledge relevant with respect to the detection
of complex situations in real-time. Hence a rule-based approach enables a high
abstraction level and a uniform framework for realizing knowledge-based CEP
applications (i.e., specification of complex event patterns, contextual knowledge, and
their interaction). Such applications can be further supported by machine learning
(more specifically data mining) tools, to automate the construction and refinement of
event patterns (see [7]). Although the machine learning support per se is out of scope
of this paper, we want to emphasize the importance of the formal, rule-based
semantics which can further enable automated construction of both, event patterns
and the background knowledge. These features are beyond capabilities of existing
approaches [8, 9, 10], and this is one of reasons why ETALIS follows a logic rule-
based approach for event processing.
In the following, we identify a number of benefits of the ETALIS event processing
model, realized via deductive rules: First, a rule-based formalism (like the one we
present in this paper) is expressive enough and convenient to represent diverse
complex event patterns. Second, a formal deductive procedure guarantees the
correctness of the entire event processing. Unlike reactive rules (production rules and
ECA rules), declarative rules are free of side-effects; the order in which rules are
evaluated is irrelevant. Third, although it is outside the scope of this paper, a
deductive rule representation of complex events may further help in the verification of
complex event patterns defined by a user (e.g., by discovering patterns that can never
be detected due to inconsistency problems). Further on, ETALIS can also express
responses on complex events, and reason about them in the same formalism [11].
Fourth, by maintaining the state of changes, the ETALIS event model is also capable
of handling queries over the entire space (i.e. answering queries that span over
multiple ongoing detections of complex events). Ultimately, the proposed event
model allows for reasoning over events, their relationships, entire state, and possible
contextual knowledge available for a particular domain (application). Reasoning in
the ETALIS event model can be further exploited to find ways to reach a given aim,
which is a task that requires some intelligence. For example, an application or a
service needs to reach a stable or known (desired) state. To achieve this, the system
has to have a capability to reason about, or to asses states (in a changing
environment). Another example is to just “track and trace” the state of any entity at
any time (in order to be able to “sense and respond” in a proactive way).
Technically, ETALIS approach is based on the decomposition of complex event
patterns into intermediate patterns (i.e. goals). The status of achieved goals is
An Approach for More Efficient Energy Consumption 277
materialized as first class citizens of a fact base. These materialized goals show the
progress toward completion of one or more complete event patterns. Such goals are
automatically asserted by rules as relevant events occur. They can persist over a
period of time “waiting” in order to support detection of a more complex goal or
complete pattern. Important characteristics of these goals are that they are asserted
only if they are used later on (to support a more complex goal or an event pattern),
that goals are all unique and persist as long as they remain relevant (after that they can
be deleted). Goals are asserted by rules which are executed in the backward chaining
mode. The notable property of these rules is that they are event-driven. Hence,
although the rules are executed backwards, overall they exhibit a forward chaining
behavior. For more information, an interested reader is referred to [2].
3.3 Example
As already mentioned, one of the main advantages of our approach is the possibility
to define the situations of interests in a declarative way and reason about them based
on the incoming sensor data.
In order to illustrate the abstractions introduced by ETALIS, we present here a very
illustrative example for the occupancy control based on the office context presented in
Figure 1.
In the traditional approaches, the situation of interests:
“a person left the room and her/his desk lamp should be switched off within 5 sec”
must be described by using one rule for each possible situation. An example is
illustrated in Figure 3: a person left the room by traversing from Region F, through
Region D and B till the door.
Fig. 3. A possible path from the office desk to the door: a situation that can lead to switching
off the lamp at the desk in the Region F
Therefore, traditional approaches must cover use all possible “evacuation” paths
which is a tedious and error prone process. The situation is even worse when we
consider that the distribution of objects in the office can be changed – the whole set of
rules must be rewritten.
278 Y. Xu et al.
On the other hand, in our approach there is only one logic-based statement that
covers all requested situations, by describing them declaratively:
Namespace: cep: http://www.icep.fzi.de/cepsensor.owl#
Pattern.event:
door_open <- status(‘cep:door’, 'cep:door_opened').
status(A, B) <- sensor(X, Y)
WHERE
(rdfs_individual_of(Sensor, 'cep:Sensor'),
rdf(Sensor, 'cep:hasName', X),
rdf(State, 'cep:hasValue', Y),
rdfs_individual_of(State, 'cep:State'),
rdf(B, 'cep:detectedWithState', State),
rdfs_individual_of(B, 'cep:Status'),
rdf(Sensor, 'cep:locatedIn', A)).
movement(Loc1,Loc2) <- status(Loc1, ‘cep:movementInRegion’) SEQ status(Bord,
‘cep:moveover’) SEQ status(Loc2, ‘cep:movementInRegion’)
WHERE
(rdfs_individual_of(Loc1, 'cep:Region'),
rdfs_individual_of(Loc2, 'cep:Region'),
rdfs_individual_of(Bord, 'cep:Borderline'),
rdf(Loc1, 'cep:hasNeighbor', Loc2),
rdf(Loc1, 'cep:hasBorderline', Bord),
rdf(Loc2, 'cep:hasBorderline', Bord))2sec.
comment: this statement detects the situation that a person has changed the region, if within
2 sec the movement sensor and light barrier sensor for a Region has been activated
movement(Loc1,Loc3) <- movement(Loc1,Loc2) SEQ movement(Loc2,Loc3) .
movement(Loc, ‘cep:door’) <- (movement(Loc, ‘cep:regionB’) SEQ status(Bord,
‘cep:moveover’)) 2sec.
comment: this statement is the most crucial one: by introducing recursive rules we are able
to describe all possible paths which are containing succeeding regions
SwitchOff(Loc) <- (movement(Loc, ‘cep:door’) SEQ door_open)5sec.
comment: this statement detects the situation that a person has left the room (after a
sequence of traversing between regions) and that after 5 sec the light at the starting location
should be switched off
Note that the particular state of the world (like this presented in Figure 2) is
represented in the domain ontology and the CEP engine (ETALIS) is accessing that
knowledge in the real time.
rdf(?Subject, ?Predicate, ?Object)
is a function in SWI-Prolog Semantic Web Library . It is an Elementary query for triples.
Subject and Predicate are atoms representing the fully qualified URL of the resource. Object
is either an atom representing a resource or literal (Value) if the object is a literal value.
rdfs_individual_of(?Resource, ?Class)
is a function in SWI-Prolog Semantic Web Library. It tests whether the Resource is an
individual of class. It returns true if Resource is an individual of Class. This implies
Resource has an rdf:type property that refers to Class or a sub-class thereof. It can be used to
test, to generate classes Resource belongs to or to generate individuals described by Class.
An Approach for More Efficient Energy Consumption 279
The Object hierarchy describes the real world entities such as Lamp, Door and
Region, which are connected to a sensor or an actuator. The object property locatedIn
describes the connection between the Objects and Sensors or Actuators. Each object
has several statuses e.g. Door has two statuses: open and closed. Some of these
statuses can be detected by Sensor with the special State; the others are controlled by
Actuator by using related Process.
This ontology is used as background knowledge by ETALIS engine. Indeed,
ETALIS allows using background knowledge in the detection process - any constraint
can be easily associated to each situation, which enables a very easy generation of
new occupancy situations that should be detected. For example, it is very easy to
introduce a new property of a region in an office, like to treat regions that have a
window separately from other regions.
280 Y. Xu et al.
4 Evaluation
In order to evaluate the performance of the proposed system we have implemented a
test case concerning efficient energy consumption in an office. We have used the FZI
Living Lab8 environment for the testing.
The use case is based on simulating occupancy control situations that limit the
operation of the lighting system based on the actual use of the space. In other words,
if there is a situation, that leads to possibly saving energy, being recognized in a way
specified in Section 3.3, the corresponding lighting source should be either dimmed or
switched off. In order to make the test realistic we have implemented the set of energy
consumption patterns developed for a Building Energy Challenge9. Table 1 represents
some of those patterns. Note that in order to be realistic we assume that there are
negative and positive situations from the energy efficiency point of view, depicted as
Penalties and Bonus in Table 1 (the setting has been completely taken from the
Building Energy Challenge).
Table 1. Examples of the consumption patterns from the Building Energy Challenge
Penalties Bonus
Having a window open while the heating Switch off the light each time when
system is on leaving the office
Leaving the office at the end of the day Switch off the heating each time when
with the computer switched on none is in the office
Switch on the artificial light while day Switch off the computer when leaving
light is sufficient the office for more than one hour
Having a temperature lower than 26 °C Switch off the artificial light while day
with the air conditioning on [12] light is sufficient
We have modeled all these patterns using ETALIS language and ontologies as
discussed in Section 3. The setting of the sensors was very similar to that presented in
Figure 1 (additional sensors for measuring temperature, light intensity and actuators
for electric devices have been introduced).
We performed an experiment in order to measure savings in the energy
consumption. We have measured the power saving time in the period of one month in
an office with five people. We find this setting as a very common one. As already
explained, our declarative approach doesn’t depend on the number of sensors and the
8
Living Labs is a practical approach to realize open innovation with a regional dimension. By
definition, it is “research methodology for sensing, validating and refining complex solutions
in real life contexts”. It is conceptualized as an innovation platform that brings together and
involves a multitude of actors, such as end-users, researchers, industrialists and policy
makers. They crucial characteristic is that they are user-centered with an active participation
of users within the entire development process. Usually, Living Labs build upon or create a
technology platform geared to answer the needs of users in a particular situation.
9
A contest regarding energy consumption between several office buildings within a company,
see: http://www.artist-embedded.org/docs/Events/2010/GREEMBED/0_GREEMBED_Papers/
IntUBE 20- 20GREEMBED.pdf
An Approach for More Efficient Energy Consumption 281
size of the room. We performed several changes in the layout of the room (position of
sensors) but without the need to change the complex event patterns. Therefore, the
abstraction provided by our language is correct: interesting situations are defined on
the level of objects, independently from the current position of sensors.
Table 2 presents the results from this experiment. In the last column we present the
average value of measurements and in the rest of the columns the values for four
particular days (1st, 10th, 20th and 30th) in order to illustrate how theses consumption
values varied.
Power saving time represents the time when some electric devices were switched
off because of the situation that the corresponding person (related to that device) had
left the room.
We are quite satisfied with the general result of the experiment: the proposed
approach leads to significant reductions in the energy consumption. We didn’t
encounter any example of the false positive.
The only problem we have faced is the rather huge error rate, whereas an error
represents the number of situations that couldn’t be detected by using currently
deployed patterns (out of scope of the experiment). In the following we give an
explanation (i.e. interfere factors) of these situations.
The first interfere factor is the precision of the sensors. In the evaluation we used
ELV FS20 sensor systems including FS20 PIRI-2 motion sensor, FS20 IR light
barrier sensor, FS20 TFK contact sensor and FS20 ST-3 radio electrical socket. The
motion sensor and the light barrier sensor have a minimal send time interval of 8
seconds, which means they can only send a single value every 8 seconds. In the case
of a high activity frequency, the sensors can’t detect all activities. Furthermore, the
sensors can’t detect some situations such as two people come into the office together.
In this situation the sensors are not able to recognize the number of the people and
only one lamp will be switched on. To overcome this, we can use more sensors and
the better sensors to increase the precision of the event detection.
The second interfere factor is unanticipated activity in the office. For example, a
user forgets to close the door after coming into the office. Then when another user
leaves the office, he doesn’t need to open the door, which is a necessary event
according to the pattern. In this situation, the lamp will also not be switched off.
Similarly, a visitor has visited the office, when he leaves the office, one lamp in the
282 Y. Xu et al.
office will be falsely switched off. This problem can be overcome by installing
automatic door closing device and using new sensor technologies (such as RFID) to
recognize the identity of the user.
The third interfere factor results from the fact that the pattern definition doesn’t
match the character of a user. In the pattern we have defined that the movement event
and door open event must happen within 5 seconds to trigger the switch off event. If a
user is accustomed to do something else costing more than 5 seconds before he opens
the door, then his lamp will not be switched off. The problem can be solved by doing
some study on the characters of the users before defining the patterns.
Modeling the above mentioned situations will be one of the subjects of the further
work.
5 Related Work
In this section we only present the related work related to the current lighting control
systems. Related work to our approach for complex event processing can be found in [2].
Current lighting and climate control systems often rely on building regulations that
define maximum occupancy numbers for maintaining proper lighting and
temperatures. However, in many situations, there are rooms that are used infrequently,
and may be lighted, heated or cooled needlessly. Having knowledge regarding
occupancy and being able to accurately predict usage patterns may allow significant
energy-savings.
In [13], the authors reported on the deployment of a wireless camera sensor
network for collecting data regarding occupancy in a large multi-function building.
They constructed multivariate Gaussian and agent based models for predicting user
mobility patterns in buildings.
In [14], the authors identified that the majority of this energy waste occurs during
the weekdays, not during the weeknights or over the weekends. They showed that this
pattern of energy waste is particularly suited to be controlled by occupancy sensors,
which not only prevent runaway operation after typical business hours, but also
capture savings during the business day.
An analysis of the impact of the new trends in energy efficient lighting design
practices on human comfort and productivity in the modern IT offices is given in [14].
In [15], the authors presented the design and implementation of a presence sensor
platform that can be used for accurate occupancy detection at the level of individual
offices. The presence sensor is low-cost, wireless, and incrementally deployable
within existing buildings.
An examination of different types of buildings and their energy use is given in
[16]. The authors discussed opportunities available to improve energy efficient
operation through various strategies from lighting to computing.
As a conclusion, there are many approaches for the lighting control, but none of
them is using a more declarative approach that would enable an efficient real-time
situation detection.
An Approach for More Efficient Energy Consumption 283
6 Conclusions
In this paper we presented a novel approach for achieving energy efficiency in public
buildings (especially sensor-enabled offices) based on the application of intelligent
complex event processing and semantic technologies. In the nutshell of the approach
is an efficient method for realizing real-time situational awareness that helps in
recognizing the situations where a more efficient energy consumption is possible and
reaction on those opportunities promptly. Semantics allows a proper contextualization
of the sensor data (its abstract interpretation). Complex event processing enables the
efficient real-time processing of sensor data and its logic-based nature supports a
declarative definition of the situations of interests.
The approach has been implemented using iCEP framework and deployed in the
FZI Living Lab environment that supports testing novel technologies with a higher
users’ involvement. We have performed a case study related to the office occupancy
control, that limits the operation of the lighting system based on the actual use of the
space. Preliminary evaluation tests have shown very promising results regarding the
usability and the efficiency of the approach: the approach is able to abstract from
particular patterns to be recognized into general/declarative situations to be reasoned
about.
Future work will be related to modeling a more comprehensive set of patterns for
representing more complex situations as described in the Evaluation section.
Additionally, new tests in the Living Lab have been planned.
Acknowledgments
Research for this paper was partially financed by EU in the following FP7 projects:
ALERT (ICT-258098), PLAY (ICT-258659) and ARtSENSE (ICT-270318).
References
1. Thirunarayan, K., Henson, C., Sheth, A.: Situation Awareness via Abductive Reasoning
for Semantic Sensor Data: A Preliminary Report. In: Proceedings of the 2009 International
Symposium on Collaborative Technologies and Systems (CTS 2009), Baltimore, MD,
May 18-22 (2009)
2. Anicic, D., Fodor, P., Rudolph, S., Stuehmer, R., Stojanovic, N., Studer, R.: A rule-based
language for complex event processing and reasoning. In: Proceedings of the 4th
International Conference on Web Reasoning and Rule Systems (RR 2010), pp. 42–57
(2010)
3. Deepak, M.: SNOOP: An Event Specification Language For Active Database Systems.
Master Thesis, University of Florida (1991)
4. Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker,
M., Tatbul, N., Zdonik, S.: Monitoring streams: a new class of data management
applications. In: VLDB 2002: Proceedings of the 28th international conference on Very
Large Data Bases. VLDB Endowment (2002)
5. Ray, O.: Nonmonotonic abductive inductive learning. Journal of Applied Logic (2008)
284 Y. Xu et al.
6. Gutierrez, C., Hurtado, C.A., Vaisman, A.A.: Introducing time into rdf. IEEE Transactions
on Knowledge and Data Engineering 19(2), 207–218 (2007)
7. Ryvkina, E., Maskey, A.S., Cherniack, M., Zdonik, S.: Revision processing in a stream
processing engine: A high-level design. In: Proc. Int. Conf. on Data Eng (ICDE), Atlanta,
GA, USA (2006)
8. Agrawal, J., Diao, Y., Gyllstrom, D., Immerman, N.: Efficient pattern matching over event
streams. In: Proceedings of the 28th ACM SIGMOD Conference, pp. 147–160 (2008)
9. Barga, R.S., Goldstein, J., Ali, M.H., Hong, M.: Consistent streaming through time: A
vision for event stream processing. In: Proceedings of the 3rd Biennial Conference on
Innovative Data Systems Research (CIDR 2007), pp. 363–374 (2007)
10. Arasu, A., Babu, S., Widom, J.: The cql continuous query language: semantic foundations
and query execution. VLDB Journal 15(2), 121–142 (2006)
11. Anicic, D., Stojanovic, N.: Expressive logical framework for reasoning about complex
events and situations. In: Intelligent Event Processing - AAAI Spring Symposium 2009.
Stanford University, California (2009)
12. Recommendation from the French Construction code for construction and housing,
http://www.legifrance.gouv.fr/affichCodeArticle.do;
jsessionid=87AE72FAE86DC9CF56B8673C1B88F9AD.tpdjo08v_2?
cidTexte=LEGITEXT000006074096&idArticle=
LEGIARTI000006896264&dateTexte=20090619&categorieLien=id
13. Erickson, V.L., et al.: Energy Efficient Building Environment Control Strategies Using
Real-time Occupancy Measurements. In: Proceedings of the First ACM Workshop on
Embedded Sensing Systems for Energy-Efficiency in Buildings, pp. 19–24 (2009)
14. von Neida, B., et al.: An analysis of the energy and cost savings potential of occupancy
sensors for commercial lighting systems,
http://www.lrc.rpi.edu/resources/pdf/dorene1.pdf
15. Walawalkar, R., et al.: Effect of Efficient Lighting on Ergonomic Aspects in Modern IT
Offices,
http://www.walawalkar.com/info/Publications/Papers/
EE&Ergonomics.pdf
16. Agarwal, Y., et al.: Occupancy-driven energy management for smart building automation.
In: Proceedings of the 2nd ACM Workshop on Embedded Sensing Systems for Energy-
Efficiency in Building (2010)
Ontology-Driven Complex Event Processing in
Heterogeneous Sensor Networks
1 Introduction
Sensor networks, especially low-cost wireless sensor networks (WSNs), are rapidly
gaining popularity for use in developing scientific knowledge. Scientists are per-
forming dense monitoring of natural environment parameters to learn about
matters including faunal distribution and behaviour, biodiversity and biological
interactions, air and water quality, micro-climatic conditions, and human im-
pact. In some cases a regular collection of homogenous sensor data is sufficient
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 285–299, 2011.
c Springer-Verlag Berlin Heidelberg 2011
286 K. Taylor and L. Leidinger
to support subsequent intensive data analysis over a data archive. In other cases,
there is a need for real-time data collection, analysis, and active response. This
will arise when scientists are unsure about the how to detect the phenomena
being investigated, when a human response is necessary to an observed event,
or when the recognised occurrence of an event should cause a change in the
monitoring behavior or equipment configuration. For example, when turtle eggs
begin to hatch on the beach in stormy weather, a scientist should be alerted and
cameras should be activated. When an endangered plant begins to flower and
pollinators are detected in the locality, a protecting cloche should be removed to
enable pollination. When nitrate concentrations at several points in the water
course exceed a threshold, the downstream oyster farmers should be warned and
additional water quality parameters should be monitored.
In these cases the events may be detected only by recognising a complex cor-
relation of observations made over multiple sensors attached to multiple WSN
nodes, and possibly distributed over multiple networks. The sensors may be het-
erogenous, as may be the sensing nodes that control them. Although in principle
is possible to use in-network processing techniques by explicitly programming
each device to make observations and to coordinate and correlate those obser-
vations with neighbouring nodes, this is very difficult in general and certainly
requires advanced programming skills and dedication with current technology. If
we add the typical experimental challenge to the mix—that it is not well known
how to define the desired event in terms of measurable properties, nor even what
all the interesting events might be over the life of a sensor network deployment,
then we can conclude that better tools are needed to offer this capability.
A better tool should enable an experimental scientist to
– discover sensors that could be used to make relevant measurements;
– develop a specification for the event of interest in terms of the available
sensing capability;
– reuse measurements that are already being made if possible or to
– program the sensor devices to make the necessary measurements otherwise;
– describe an action to be taken if the interesting event is detected; and to
– easily deploy the specification for efficient runtime processing.
In this paper we propose the use of ontologies as an important part of such a
tool. In earlier work we have shown how to effectively program sensor networks
and devices by modelling the sensing capability and programming language in
an OWL ontology, using the ontology to add contextual concepts for use in
the programming task, and using the ontology reasoning capability to validate
commands [12]. In this work, we propose that an ontology can capture valuable
domain context for sensor capability description and discovery, for description
of an interesting event in terms of potential sensor measurements, and for opti-
mising the execution strategy for run-time event detection.
We do not address sensor discovery, but discovery methods based on ontolog-
ical descriptions abound, for example [11] and [6]. In our work, we rely on the
newly developed SSN ontology from the W3C Semantic Sensor Network Incu-
bator Group [13], to take advantage of an emerging community terminology.
Ontology Driven Complex Event Processing 287
Fig. 1 shows the event framework and illustrates the communication between
components. The basic process can be described as follows:
1. The user defines a complex event definition composed from several atomic
sensed events within the user interface.
2. The complex event definition is processed and stored in the event ontology.
3. The ontology data is used to program the required sensors.
4. The event detection part for the current event definition is separately saved
in an ExportOntology and sent to the semantic event middleware.
5. The semantic event middleware transforms the received ontology into CEP
streams and queries, written in the CEP-dependent Event Processing Lan-
guage (EPL).
6. The semantic event middleware sets up the CEP server with created streams
and queries, and initiates the event detection process.
7. The CEP server performs the event detection and sends an alert if the spec-
ified event has been recognized.
The EventOntology is the central part of the Event Framework. It allows the use
of reasoning over event information to obtain additional knowledge and to perform
semantic optimisations. A clear formalization of the event and measurement en-
vironment is necessary to exploit these advantages. Our OWL 2.0 event ontology
extends an early form of the SSN ontology of the W3C SSN-XG [13]. It is reported
to be in ALCIQ(D) by the ontology editor, Protégé. Along with the sensors, it mod-
els the domain of application and the concept and model of the entire event frame-
work. Events, alerts, triggers, streams, locations, instruments, phenomena, sensors
and sensor programs are all part of this description. Additional classes and instance
data are included to describe relations between single events, time intervals, and
to define different kinds of sensor programs. The high-level structure is apparent
through its reflection in the user interface given in section 3. The interested reader
may find the ontology at http://research.ict.csiro.au/conferences/ssn/
EventOntology_no_imports.owl.
The CEP server application performs the actual complex event detection. For
this, the server receives a stream with configuration information for the event
data sources and a query for the complex event detection, both expressed in the
CEP’s proprietary EPL. The CEP server is also responsible for sending the user
defined alert message if an event has been detected.
We now provide more detail on our user interface, our semantic event mid-
dleware and our management module interfaces for each source.
1
The Coral8 CEP-platform.
http://www.aleri.com/products/aleri-cep/coral8-engine
Ontology Driven Complex Event Processing 289
3.1 Events
Every event description consists of two major parts: an alert which will be acti-
vated if an event has been recognized, and the definition of the event itself. The
definition can be expressive and must be able to represent user-defined complex
structures. To achieve this goal, every complex event definition, called an obser-
vation, is composed of several atomic observations. Each atomic observation can
be individually configured in four main steps.
could be useful to add extra information like physical values or the kind of loca-
tion and an environment description. Both location and instrument descriptions
are stored within the ontology. The user interface loads this information dynam-
ically and only displays valid entries within the event definition window.
3.2 Alerts
The user can choose to receive an email or a text message on a mobile phone
if an event has been detected. Every email includes standard passages about
the event processing supplemented with event specific information such as the
latest sensor data and a description of the trigger configuration. The SMS alert
is shorter. Alerts are defined within the ontology, together with the relevant EPL
code fragments, and dynamically loaded and integrated into the user interface.
This easily allows the integration of additional alert types. In future work we
will add an alert type to send a control message to external software systems.
3.3 Observations
Complex events are designed within the ontology, so that every complex event
contains an observation. An observation itself is used as a generic term for five
different kinds that are used to realize the complexity of definitions. Observation
operators are used to define the logical AND, OR and FOLLOWED BY rela-
tionships between atomic or compound observations. Bracketed grouping can
also be represented in observation groups.
Atomic Observation is the description of an atomic event within the entire
complex event definition. It contains the information to program the selected
sensor and the trigger definition for the event.
Observation Intersection is used to create a logical AND (&&) relationship
between two observations. Each of these observations can be a single obser-
vation, an observation union, an observation sequence or another observation
intersection in turn.
Observation Union is the counterpart to the observation intersection. It links
two observations by a logical OR (||) relationship. Here again, each of obser-
vations can be a single observation, an observation intersection, an observa-
tion sequence or another observation union.
Observation Sequence is used to create a FOLLOWED BY (,) relationship
between two observations. FOLLOWED BY specifies that the next event
has to be recognized chronologically after the previous one. Each of these
observations can be a single observation, an observation intersection, an
observation union or an observation sequence. This is only available if strict
order is used, as described in the next paragraph.
Observation Groups are used to combine multiple single observations. Each
group belongs to a certain event and contains all consecutive observations
summarized by the user within one parenthetical group.
292 K. Taylor and L. Leidinger
3.4 Triggers
To create expressive and practical event descriptions, it is necessary to inter-
pret, analyse and filter environmental observations to be able to define which
occurrences are interesting for an event. The data source is already described by
the sensor program within an atomic observation. In order to be in a position to
recognize complex events, it is not enough to simply compare incoming values. It
is much more revealing to observe time dependent behaviour and value patterns.
For this reason, nine different kinds of Triggers were designed.
About examines if the received data is equal to a user defined value. Values
within 10 percent tolerance are detected.
Area recognizes all readings in the interval between two given values.
Change monitors if the current reading changes with respect to the average
value of the previous two readings.
Decrease is used to detect if the latest value is lower than the average of two
preceding readings.
Equal simply checks if the current value equals a value defined by the user.
Greater triggers if the received value is greater than a user defined value.
Increase is the opposite of Decrease and observes if the latest value is greater
than the average of two preceding readings.
Less triggers if the received reading is smaller than a user defined value.
Received recognizes if data from a sensor has been received, without further
qualification.
Like the observation concept, the trigger specification within the ontology con-
tains CEP-platform specific code fragments which are used to generate queries
Ontology Driven Complex Event Processing 293
Fig. 3 illustrates this usage of ontology data to create Coral8 EPL program
code. The shown ontology data and the code snippet correspond to a com-
plex event definition example which recognizes temperature between -20 and
-10 degrees, amongst other observations. All shaded statements are generated
from the description of sensor programs from the ontology. For example, the sen-
sor program individual “program 0 1283418466826Ind” has the object properties
“WS-TemperatureSensor” and “WM1600”. Both are used as variable names to
describe input and output streams as well as CEP windows that include the
definition of which data has to be filtered. The black highlighted expression
within the “where” clause is also created automatically. The template clause,
“trigCmd”, for the Area trigger, “column > value1 AND column < value2”, is
used as basis for the “where” clause. This information is stored in the ontology
as a part of the definition of the Area trigger class, and so, through a reasoner,
is also a dataproperty value of the “trigger 0 1283418466826Ind” individual of
that class. Other data property values instantiated through the user interface
provide the user defined thresholds “hasValue1” and “hasValue2” to replace the
terms “value1” and “value2” inside the template clause. The template string “col-
umn” is replaced by the corresponding variable name generated by using the
description of sensor program “program 0 1283418466826Ind”.
2
Environdata WeatherMaster1600.
http://www.environdata.com.au/weathermaster1600
296 K. Taylor and L. Leidinger
6 Related Work
Research over the past twenty years has produced a large number of research
prototypes and commercial products to continuously monitor data streams for
detection of events arising as complex correlations of source data values. These
may be known as distributed stream management systems, complex event pro-
cessors, or sensor data management systems. Some are directed at scientific
users and incorporate grid computing elements; others are optimised for high
message throughput rates; and others again are closely integrated with the sen-
sor networks that generate the data and can instigate collection of data from
the sensors on demand. Some of the best known are Aurora/Borealis[1], TinyDB
[10], Caldera[9], and Esper [4].
In concurrent work, [3] extends the standard RDF query language SPARQL
to provide a SPARQL interface to querying streaming data sources, together
with a temporally extended RDF response. One aim of that work was to make
streaming sensor data available within the linked open data framework, hence
the choice of an RDF/SPARQL model. However, like in our work, the queries are
mapped to to a native stream processing language for run-time execution. In that
case a more traditional query translation approach is used with an intermediate
algebraic representation, and reverse translation of query answers.
Our event framework does not offer a query language interface directly but we
offer an exploratory GUI that can be understood as defining an interesting event
rather than querying for instances of the event. However, because an event spec-
ification in our approach is simply an ontology fragment, other query-language
like interfaces or API could be readily designed. A query language based on
description logic conjunctive queries over the classes and properties of an event
definition would be a more natural match for our work than extended SPARQL.
Although we do not offer a linked open data solution, we rely on the more
expressive OWL ontology language to provide design-time contextualisation of
the sensor data and optimisation, but with no run-time processing overhead.
Our approach provides a more direct translation path to the underlying EPL, so
is likely to allow more expressive queries (where modelled in the ontology) and
possibly also more compact and efficient EPL code. For example, our system
straightforwardly offers complex events defined by integrating multiple sensor
data streams, whereas that capability is set down for future work in [3].
In [6], a SensorMashup platform is described that offers a visual composer
for sensor data streams. Data sources and intermediate analytical tools are de-
scribed by reference to an ontology, enabling an integrated discovery mechanism
for such sources. A SPARQL endpoint is offered to query both sensor descrip-
tions (in the conventional manner) and also individual sensor data streams.
Like our event framework, a third-party DSMS is used to manage the raw sen-
sor data streams. A SPARQL query is translated to an SQL-like continuous
query over the streams to be handled by the DSMS, but the usual essential
windowing and aggregation functions of a DSMS (such as "average") cannot be
used as there is no SPARQL counterpart. The most important difference to our
Ontology Driven Complex Event Processing 297
taken when complex events arising from the sensed data are detected. Further-
more, a user can specify a complex event of interest, within the capability of
the available sensor networks, and if the necessary phenomena are not currently
being monitored, the system will automatically and transparently program those
various sensor networks to monitor the necessary phenomena.
Our work is currently deployed on our Phenonet network for agricultural
monitoring installed near Leeton, NSW, Australia. Although a fairly small de-
ployment, the architecture is highly heterogeneous. The network includes several
Fleck devices3 of a WSN with sensors for soil moisture, leaf temperature, and
soil moisture. There is a separate Fleck WSN with the node directly connected
to a Vaisala Automatic Weather Station 4 , a solar radiation sensor, and a pho-
tosynthetically active radiation sensor. Another independent wireless network of
Hussat data loggers5 with soil temperature profile and soil moisture profile sen-
sors is also present. Finally there is an independent IP-connected Environdata
automatic weather station. Currently, stream data arising from Fleck and Hussat
nodes is retrieved by polling a database archive and automated programming
of the nodes through the event framework is not possible. The programming
capability and live stream feed for these sources will be available shortly, taking
advantage of code optimisation work [7] in a service architecture [8].
One event that is particularly important in this domain is the occurrence
of frost. We will investigate including frequent weather reports as a source of
streaming data together with our sensors in the field. We will use the system
to develop an adequate recognition of a frost occurrence and connect the event
notification to the control system for infrastructure that can protect the experi-
mental crop in the field.
The strength and novelty of this work lies in its use of ontologies and reason-
ing. We have shown how a scientist can develop a specification for an event of
interest in terms of the available sensing capability, reusing measurements that
are already being made. We have shown how a scientist can describe an action
to be taken if the interesting event is detected, and can easily deploy the spec-
ification for efficient runtime processing. Because of the ontological component,
the work can also be used together with semantic discovery techniques and also
semantic sensor network programming techniques to offer a complete solution for
user-driven scientific and experimental reuse of heterogenous sensor networks.
Acknowledgement. The authors thank the members of the CSIRO semantic sen-
sor networks team, Peter Lamb, Doug Palmer, Laurent Lefort, Leakha Henry
and Michael Compton, and the CSIRO Phenomics scientists, Xavier Sirault and
3
Powercom Fleck. See
http://www.powercomgroup.com/Latest_News_Stories/Fleck_long_range_
wireless_sensing_and_control.shtml
4
Vaisala WM30. See
http://www.vaisala.com/files/WM30_Brochure_in_English.pdf
for the data sheet
5
Hussat wireless microloggers. See
http://hussat.com.au/
Ontology Driven Complex Event Processing 299
David Deery. This work was conducted using the Protégé resource, which is sup-
ported by grant LM007885 from the United States National Library of Medicine.
References
1. Abadi, D.J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stone-
braker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data
stream management. VLDB Journal 12(2), 120–139 (2003)
2. Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: Semantic
foundations and query execution. Very Large Database (VLDB) Journal 14 (2005)
3. Calbimonte, J.-P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to
streaming data sources. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P.,
Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS,
vol. 6496, pp. 96–111. Springer, Heidelberg (2010)
4. Esper–Complex Event Processing. Espertech event stream intelligence (December
2010), http://esper.codehaus.org/
5. Hinze, A., Sachs, K., Buchmann, A.: Event-based applications and enabling tech-
nologies. In: DEBS 2009: Proceedings of the Third ACM International Conference
on Distributed Event-Based Systems, pp. 1–15. ACM, New York (2009)
6. Le-Phuoc, D., Hauswirth, M.: Linked open data in sensor data mashups. In: Taylor,
K., Ayyagari, A., De Roure, D. (eds.) Proceedings of the 2nd International Work-
shop on Semantic Sensor Networks, SSN 2009, Washington DC, USA, October
2009, vol. 522, pp. 1–16 (2009) CEUR workshop proceedings
7. Li, L., Taylor, k.: Generating an efficient sensor network program by partial de-
duction. In: Zhang, B.-T., Orgun, M.A. (eds.) PRICAI 2010. LNCS, vol. 6230, pp.
134–145. Springer, Heidelberg (2010)
8. Li, L., Taylor, K.: A framework for semantic sensor network services. In: Bouguet-
taya, A., Krueger, I., Margaria, T. (eds.) ICSOC 2008. LNCS, vol. 5364, pp. 347–
361. Springer, Heidelberg (2008)
9. Liu, Y., Vijayakumar, N., Plale, B.: Stream processing in data-driven computa-
tional science. In: Proc. 7th IEEE/ACM International Conference on Grid Com-
puting, GRID 2006, pp. 160–167. IEEE, Washington, DC, USA (2006)
10. Madden, S.R., Franklin, M.J., Hellerstein, J.M., Hong, W.: TinyDB: an acquisi-
tional query processing system for sensor networks. ACM Trans. Database Syst. 30,
122–173 (2005)
11. Sirin, E., Parsia, B., Hendler, J.: Filtering and selecting semantic web services with
interactive composition techniques. IEEE Intelligent Systems 19, 42–49 (2004)
12. Taylor, K., Penkala, P.: Using explicit semantic representations for user program-
ming of sensor devices. In: Advances in Ontologies: Proceedings of the Australasian
Ontology Workshop Conferences in Research and Practice in Information Technol-
ogy, Melbourne, Australia, December 1. CRPIT, vol. 112, Australasian Computer
Society (2009)
13. W3C SSN-XG members. SSN: Semantic sensor network ontology (December 2010),
http://purl.oclc.org/NET/ssnx/ssn
A Semantically Enabled Service Architecture for
Mashups over Streaming and Stored Data
1 Introduction
Sensor networks promise to bridge the gap that, for too long, has separated
computing applications from the physical world that they model and in which
they are ultimately embedded. Many scientific and technological challenges need
to be tackled before sensor networks can be exploited to their full capacity
for aiding decision support applications. Additionally, as more and more sensor
networks are independently developed and deployed, it becomes increasingly
important to support their reuse in applications that were not foreseen or that
transcend their original purpose. This will facilitate the use of sensor network
technology to support decision-making that requires on-the-fly integration of
data of differing modalities, e.g. sensed data with data stored in databases, as
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 300–314, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Semantic Service Architecture for Data Mashups 301
well as the ad hoc generation of mashups over data stemming from computations
that combine real-time and legacy historical data. This, in turn, will enable the
enacting of decisions based on such real-time sensed data.
One area that has seen a massive increase in the deployment of sensing de-
vices to continuously gather data is environmental monitoring [9]. For example,
metocean data, i.e. wave, tide, and meteorology data, for the south coast of Eng-
land is measured by two independently deployed sensor networks—the Channel
Coastal Observatory (cco)1 and WaveNet2 —as well as meteorological data from
the MetOffice3 . More broadly, data from sensors is being used worldwide to im-
prove predictions of, and plan responses to, environmental disasters, e.g. coastal
and estuarine flooding, forest fires, and tsunamis. The models that make these
predictions can be improved by combining data from a wide variety of sources.
For example, in a coastal flooding scenario, details of sea defences, combined
with current wave information, can be used to identify potential overtopping
events—when waves go over the top of a sea defence. When responding to a
flooding event additional data sources such as live traffic feeds4 , and details of
public transport infrastructure can help inform decisions.
Enabling the rapid development of flexible and user-centric decision support
systems that use data from multiple autonomous independently deployed sensor
networks and other applications raises several technical challenges, including:
(i) Discovering relevant sources of data based on their content, e.g. features of
interest and the region covered by the dataset; (ii) Reconciling heterogeneity in
the data sources, e.g. the modality, data model, or interface of the data source,
and enabling users to retrieve data using domain concepts; and (iii) Integrating
and/or mashing up data from multiple sources to enable more knowledge about
a situation to become available. ogc-swe [3] and gsn [1] are previous proposals
that share some of our aims in this paper. However, both require data sources
to expose their respective data model, thus limiting the reuse of existing sources
from a multitude of domains. Additionally, neither supports integrating data
from heterogeneous data sources. Our proposed approach makes extensive use
of semantic technologies to reconcile the heterogeneity of data sources whilst
offering services for correlating data from independent sources. This enables user-
level applications to generate queries over ontologies which are then translated
into queries to be executed over the data sources. Data is returned expressed
in terms of the user-level ontology and can be correlated and juxtaposed with
other data in a meaningful and controlled manner.
The rest of the paper is structured as follows. In Section 2 we provide a
detailed description of the flood emergency planning scenario for the south coast
of England and identify the set of requirements. Section 3 provides an overview
of the ontology network that we have developed to represent the information
1
http://www.channelcoast.org/ (21 October 2010).
2
http://www.cefas.co.uk/our-science/observing-and-modelling/
monitoring-programmes/wavenet.aspx (21 October 2010).
3
http://www.metoffice.gov.uk/ (4 November 2010).
4
http://www.highways.gov.uk/rssfeed/rss.xml (4 November 2010).
302 A.J.G. Gray et al.
needed in this scenario. We then present our architecture for a semantic sensor
web in Section 4, and describe the role of ontologies and semantic annotations in
the architecture. Section 5 describes a prototype deployment of our architecture
for the flood emergency response use case. We discuss related work in Section 6
and present our conclusions in Section 7.
state (available from sensors), and a forecast of how the latter are likely to evolve
in the region (available from predictive models). This would enable them to make
more accurate decisions about the likely effects of the sea-state on shipping. Sim-
ilarly, they must assess the risk to the public. To aid this, they need details of
the transportation infrastructure (available from stored data), populated areas
(available as maps), and the predicted effects of the flood (available from models
based on the sensor data).
2.2 Requirements
The following general requirements can be drawn from the scenario.
Upper
DOLCE SWEET
UltraLite
SSG4Env
infrastructure
SSN Service Schema
External
FOAF Ordnance
Survey
Flood domain
Fig. 1. The SemSorGrid4Env ontology network in the flood emergency planning sce-
nario. The arrows indicate ontology reuse.
Great Britain, and by the Additional Regions ontology, which includes other regions
needed in our scenario. (v) To represent those features of interest and their proper-
ties that are specific to the flood emergency planning scenario. This is covered by the
Coastal Defences ontology. (vi) To represent the different roles involved in a flood
emergency planning scenario. This is covered by the Roles ontology.
All the ontologies10 have been implemented using owl. While some of the
ontologies presented here are specific to the flood warning scenario, e.g. Role,
the architecture proposed in the next section is generic. Thus, it can be adapted
to other situations by replacing the flood domain ontologies.
Fig. 2. Conceptual view of the service architecture. Boxes denote SemSorGrid4Env ser-
vices, ovals denote external services/entities, arrows denote caller-callee relationships.
Table 1. Services and their interfaces. Interfaces shown in italics are optional.
excerpts of which are shown in Fig. 3. Note, these declarations make use of strdf
[12], a spatiotemporal extensions for rdf that defines uris of the form &term.
Data source services provide the mechanism to publish data: either coming from
a sensor network or some other data source, e.g. a database or another data
service. Depending on the interfaces supported by the data service, operations
are provided for querying, retrieving, and subscribing to data. A distributed
query processing service can be offered, using the integration interface, which
consumes data from other services that may only support the data access or
subscription interfaces.
Data source services publish a property document about the data that they
provide, and the mechanisms by which it may be accessed. The first part of
the property document in Fig. 3 describes the interaction mechanisms provided,
Semantic Service Architecture for Data Mashups 307
1 <service:WebService rdf:about="#cco-ws">
2 <rdfs:label>Channel coastal observatory streaming data service</rdfs:label>
3 <service:hasInterface rdf:resource="service:ssg4ePullStream"/>
4 <service:hasDataset rdf:resource="#envdata_SandownPier_Tide"/>
5 <service:hasDataset rdf:resource="#envdata_SandownPier_Met"/>
6 ...
7 </service:WebService>
8 <sweet:Dataset rdf:about="#envdata_SandownPier_Tide">
9 <rdfs:label>envdata_SandownPier_Tide</rdfs:label>
10 <service:coversRegion rdf:resource="&AdditionalRegions;SandownPierLocation"/>
11 <time:hasTemporalExtent rdf:datatype="®istry;TemporalInterval">
12 [2005, NOW]</time:hasTemporalExtent>;
13 <service:includesFeatureType rdf:resource="&CoastalDefences;Sea"/>
14 <service:includesPropertyType rdf:resource="&CoastalDefences;TideHeight"/>
15 <service:includesPropertyType rdf:resource="&CoastalDefences;WaveHeight"/>
16 <service:hasSchema rdf:resource="#envdata_SandownPier_Tide_Schema"/>
17 ...
18 </sweet:Dataset>
19 <schema:Stream rdf:about="#envdata_SandownPier_Tide_Schema">
20 <schema:extent-name>envdata_SandownPier_Tide</schema:extent-name>
21 <schema:hasAttribute rdf:resource="#HMax"/>
22 <schema:hasAttribute rdf:resource="#Tp"/>
23 ...
24 </schema:Stream>
25 <schema:Attribute rdf:about="#HMax">
26 <schema:attribute-name>HMax</schema:attribute-name>
27 ...
28 </schema:Attribute>
29 ...
Fig. 3. Snippets from the cco sensor data web service semantic property document
expressed in strdf [12] using xml notation. We assume appropriate namespace decla-
rations and represent omitted parts of the document with ‘. . . ’.
i.e. the interfaces (line 3) and operations supported by the data service. The
rest of the property document describes the data that is available through the
service, which is not covered in the wsdl definition of the service.
A data source may publish one or more datasets, as per the ws-dai stan-
dard [2]. Lines 4 and 5 show that the cco sensor data service publishes multi-
ple datasets including the two identified as #envdata_SandownPier_Tide and
#envdata_SandownPier_Met. Each dataset is described in terms of its spa-
tiotemporal and thematic coverage, and (where appropriate) its schema. Lines 8
to 18 provide details of the #envdata_SandownPier_Tide dataset. Specifically,
lines 10 to 12 describe the spatiotemporal range of the dataset as providing data
for the ‘Sandown Pier’ location and that the time range is from 2005 until the
current time, represented with the distinguished literal ‘NOW’. The types of fea-
tures covered by the dataset are declared by the statements in line 13, which
state that the #envdata_SandownPier_Tide dataset contains information about
the CoastalDefences concept Sea. Lines 14 and 15 give the property types covered
as the CoastalDefences concepts of TideHeight and WaveHeight. Where appro-
priate, e.g. for relational data sources, the property document also includes an
ontological description of the schema of the dataset using the Schema ontology.
Line 16 declares that the #envdata_SandownPier_Tide dataset has a schema
described by the resource #envdata_SandownPier_Tide_Schema. Lines 19 to 28
describe the relational schema of the #envdata_SandownPier_Tide data stream:
308 A.J.G. Gray et al.
its name, attributes, types of the attributes, primary key, and timestamp at-
tribute. It is this information that enables a distributed query service, which
itself can be seen as a data service, to support queries over external data sources.
integrator can infer that the #CCO-WS only supports data retrieval through the
pull-stream interface (line 3), i.e. it does not support queries, and that the schema
of the #envdata_SandownPier_Tide is as described (lines 19-28). The integrator
can also query the registry to discover suitable distributed query processing
services to invoke in answering queries over the integrated data resource. The
property documents of the data sources also aid the integrator in the creation of
the property document that describes the integrated data model. In particular,
its spatiotemporal and thematic content. Note that the semantic representation
of a source schema, as provided in lines 19-28 of Fig. 3, can help mapping tools
in understanding the schema of a data source and, therefore, in the creation of
mappings between source schemas and an ontology for the domain of interest.
4.5 Summary
The property document enables well-informed interactions between the services
in the architecture, and is instrumental in all aspects of the functionality offered.
It is not a requirement to provide the semantic property document, and no
parts of it are mandatory. As such, external services, e.g. those defined by ogc
[3,5], can be incorporated into the architecture. However, by describing the non-
functional properties, particularly the spatiotemporal and thematic coverage of
its data, in a property document a service can be discovered through the registry,
and used by the integrator and application services in a seamless manner.
Fig. 4. Screenshots from the flood emergency response Web application available from
http://webgis1.geodata.soton.ac.uk/flood.html
Web Application
Integrator DQP CCO-WS CCO-Stored
Application Services
GET http://…/geojson?resource=integrator&query=q
SPARQLExecuteFactory(integrator, q)
GenericQueryFactory(snee, pull, q’)
EPR
EPR
URL
SQLExecute(cco-stored, q’’)
WebRowSet
GetStreamItem(cco:<stream>,
( <pos>))
WebRowSet
GetStreamItem(snee:pull:<stream>, <pos>)
WebRowSet
GetStreamItem(int:<stream>, <pos>)
SPARQLResultSet
GET URL
JSON
Fig. 5. Interaction diagram showing how data is integrated across heterogeneous data
sources. Operations below the dotted line are repeated periodically.
from the available sensor networks for the region are juxtaposed on top. The val-
ues are displayed as red circles: the larger the circle, the higher the wave value
measured.
To support identifying an overtopping event requires data from heterogenous
sources with different schemas and data modalities, viz. stored and sensed, to be
integrated. The required orchestration is depicted in Fig. 5 which shows a web
application retrieving data through an integrator that exposes an ontological
view of the source data. Note that the orchestration assumes that the integrated
resource has already been created, i.e. the mapping document relating the data
sources to the ontological view has already been passed to the integrator. The
web application supports the user in discovering potential data services for de-
tecting overtopping events based on the contents of the property documents
stored by the registry service (not shown in the orchestration in Fig. 5). The
web application then supports the user in characterising an overtopping event
as a query over the ontological view, hiding all the complexities of the required
orchestration. The web application uses a restful interface offered by an appli-
cation service to pass the query as a service call to the integrator. The integrator
translates the query over the ontological view into a query expressed in terms of
the source schemas. The integrator instantiates a distributed query processing
service (dqp) to evaluate the query over the sources. As the query is evalu-
ated, answers are periodically retrieved through the interactions shown below
the dotted line in Fig. 5. The rate at which the dqp service polls its sources is
controlled by the rate declared in the property document of each source. Simi-
larly, the rates at which the integrator and the application poll their respective
source is controlled by the property document declarations.
312 A.J.G. Gray et al.
6 Related Work
We describe related work in its ability to satisfy the requirements identified in
Section 2.
The Open Geospatial Consortium Sensor Web Enablement (ogc-swe) [3] de-
fines a set of xml data models for describing sensors and their data, as well as
a set of web services for publishing sensor data, locating data, and receiving
alerts about the data. A reference implementation of the framework was devel-
oped in the sany project [16]. The framework can be seen to satisfy R4, and
provides support for satisfying R5. However, data access patterns are limited by
the service interfaces and there is no support for declarative query languages.
As such, it does not fully satisfy R1. Data is published according to their xml
data models, which is not always possible with autonomous data sources. Thus,
they do not satisfy R3. ogc-swe does not fully meet R2: there is support for
unmediated merging of sensor and stored data but not for correlating it. We note
that the GetCapabilities operation provided by the services provide support for
the functional properties in our property documents but not the spatiotemporal
or thematic properties. Henson et al. [10] have extended the sensor observa-
tion service by semantically annotating the data. Our proposal goes beyond this
by using semantics to support the discovery, integration, and mashup of data
stemming from autonomous heterogeneous data sources.
Global Sensor Network (gsn) [1] is a middleware platform for the deployment
and programming of sensor networks. It allows for the abstraction of sensor
nodes as data sources irrespective of the underlying hardware and provides query
processing capabilities within the middleware. It enables a data-oriented view of
sensor networks and the processing of queries over that data. gsn satisfies R1
and R2 provided that the data is all published in the same data model. It does
not satisfy the other requirements.
Collaborative Oceanography [17] used semantic annotations to support the
reuse of oceanographic data. Their approach relied on a centralised triple store
containing the annotations and the manual mashup of data based on these an-
notations. Our approach provides support for semantic integration and mashup
of heterogeneous data sources.
7 Conclusions
We have presented a service architecture for providing support to semantic sensor
web applications. The architecture provides a semantically integrated informa-
tion space for sensed and stored data drawn from heterogeneous autonomous
data sources. The architecture enables rapid development of thin applications
(mashups) over this information space through the use of (i) declarative queries
to describe the data need, both for locating data based on its spatiotemporal
and thematic coverage, and for integrating and accessing data, and (ii) seman-
tically annotated property documents which support well-informed interactions
between the architecture services.
Five high-level requirements were identified, from the application use case
presented, that are considered to be relevant for a broad range of applications.
Semantic Service Architecture for Data Mashups 313
The next steps for the implementation of our architecture are to provide services
which can push data from the sources through the architecture, and to provide
mechanisms for interacting with existing infrastructures such as ogc-swe. For
future work we intend to investigate offering configurable mechanisms for sup-
porting rest interfaces to integrated information spaces. We will also perform
a user evaluation with coastal managers from the Solent region.
References
1. Aberer, K., Hauswirth, M., Salehi, A.: Infrastructure for data processing in large-
scale interconnected sensor networks. In: 8th International Conference on Mobile
Data Management (MDM 2007), pp. 198–205 (2007)
314 A.J.G. Gray et al.
2. Antonioletti, M., Krause, A., Paton, N.W., Eisenbrg, A., Laws, S., Malaika, S.,
Melton, J., Pearson, D.: The WS-DAI family of specifications for web service data
access and integration. SIGMOD Record 35(1), 48–55 (2006)
3. Botts, M., Percivall, G., Reed, C., Davidson, J.: OGC R sensor web enablement:
Overview and high level architecture. In: Nittel, S., Labrinidis, A., Stefanidis, A.
(eds.) GSN 2006. LNCS, vol. 4540, pp. 175–190. Springer, Heidelberg (2008)
4. Calbimonte, J.-P., Corcho, O., Gray, A.J.G.: Enabling ontology-based access to
streaming data sources. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P.,
Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS,
vol. 6496, pp. 96–111. Springer, Heidelberg (2010)
5. de la Beaujardiere, J.: OpenGIS R web map server implementation specification.
Standard Specification 06-042, Open Geospatial Consortium Inc. (2006)
6. Fielding, R.T.: Architectural Styles and the Design of Network-based Software
Architectures. Ph.D. thesis, Information and Computer Science, University of Cal-
ifornia, Irvine, California, USA (2000)
7. Galpin, I., Brenninkmeijer, C.Y.A., Gray, A.J.G., Jabeen, F., Fernandes, A.A.A.,
Paton, N.W.: SNEE: a query processor for wireless sensor networks. Distributed
and Parallel Databases 29(1-2), 31–85 (2010)
8. Gray, A.J.G., Galpin, I., Fernandes, A.A.A., Paton, N.W., Page, K., Sadler, J.,
Kyzirakos, K., Koubarakis, M., Calbimonte, J.P., Garcia, R., Corcho, O., Ga-
baldón, J.E., Aparicio, J.J.: SemSorGrid4Env architecture – phase II. Deliverable
D1.3v2, SemSorGrid4Env (December 2010),
http://www.semsorgrid4env.eu/files/deliverables/wp1/D1.3v2.pdf
9. Hart, J.K., Martinez, K.: Environmental sensor networks: A revolution in earth
system science? Earth Science Reviews 78, 177–191 (2006)
10. Henson, C., Pschorr, J., Sheth, A.P., Thirunarayan, K.: SemSOS: Semantic sensor
observation service. In: International Symposium on Collaborative Technologies
and Systems, CTS 2009 (2009)
11. Geographic information – services. International Standard ISO19119:2005, ISO
(2005)
12. Koubarakis, M., Kyzirakos, K.: Modeling and querying metadata in the semantic
sensor web: The model stRDF and the query language stSPARQL. In: Aroyo, L.,
Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudo-
rache, T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 425–439. Springer, Heidelberg
(2010)
13. Page, K., De Roure, D.C., Martinez, K., Sadler, J., Kit, O.: Linked sensor data:
RESTfully serving RDF and GML. In: International Workshop on Semantic Sensor
Networks, pp. 49–63 (2009)
14. Pérez de Laborda, C., Conrad, S.: Relational.OWL: a data and schema repre-
sentation format based on OWL. In: 2nd Asia-Pacific Conference on Conceptual
Modelling (APCCM 2005), Newcastle, Australia, pp. 89–96 (2005)
15. Raskin, R.G., Pan, M.J.: Knowledge representation in the semantic web for earth
and environmental terminology (SWEET). Computers and Geosciences 31(9),
1119–1125 (2005)
16. Schimak, G., Havlik, D.: Sensors anywhere – sensor web enablement in risk man-
agement applications. ERCIM News 76, 40–41 (2009)
17. Tao, F., Campbell, J., Pagnani, M., Griffiths, G.: Collaborative ocean resource in-
teroperability: Multi-use of ocean data on the semantic web. In: Aroyo, L., Traverso,
P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E.,
Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 753–767. Springer,
Heidelberg (2009)
Zhi# – OWL Aware Compilation
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 315–329, 2011.
c Springer-Verlag Berlin Heidelberg 2011
316 A. Paar and D. Vrandečić
A common difficulty of widely used OWL APIs and the usage of wrapper
classes to represent entities of an ontology are the different conceptual bases
of types and instances in a programming language and classes, properties, in-
dividuals, and XML Schema Definition [4] data type values in OWL DL. In
particular, the Web Ontology Language reveals the following major differences
to object-oriented programming languages and database management systems:
only elementary examples of an integrated use of XSD data types and ontological
class descriptions in Zhi# are presented. The Zhi# programming language is
implemented by a compiler framework [16] that is – by means of plug-ins –
extensible with external type systems2 . Detailed descriptions of the compiler
framework and the XSD and OWL plug-ins can be found in [17]. Zhi# programs
are compiled into conventional C# and are interoperable with .NET assemblies.
The Zhi# approach is distinguished by a combination of features that is targeted
to make ontologies available in an object-oriented programming language using
conventional object-oriented notation.
In contrast to naïve approaches that are based on the generation of wrapper
classes for XSD and OWL types, no code generation in form of an additional
class hierarchy is required in Zhi#. Instead, ontologies are integrated into the
programming language, which facilitates OWL aware compilation including type
checking on the ontology level. At runtime, the results of ontological reasoning
influence the execution of Zhi# programs: Zhi# programs don’t just execute,
they reason. The underlying ontology management system can be substituted
without recompilation of Zhi# programs. The Zhi# programming language pro-
vides full support for XSD data types. Thus, Zhi# can compensate for datatype
agnostic OWL APIs. Zhi# programs can be used concurrently with API-based
knowledge base clients to allow for a smooth migration of an existing code-base.
extensibility with respect to external type systems, are the following: External
types (i.e. XSD data types and OWL class descriptions) can be included using
the keyword import, which works analogously for external types like the C#
using keyword for .NET programming language type definitions. It permits the
use of external types in a Zhi# namespace such that, one does not have to qualify
the use of a type in that namespace. An import directive can be used in all places
where a using directive is permissible. As shown below, the import keyword is
followed by a type system evidence, which specifies the external type system (i.e.
XSD or OWL). Like using directives, import directives do not provide access to
any nested namespaces.
import type_system_evidence alias = external_namespace;
In Zhi# program text that follows an arbitrary number of import directives,
external type and property references must be fully qualified using an alias that is
bound to the namespace in which the external type is defined. Type and property
references have the syntactic form #alias#local_name (both the namespace
alias and the local name must be preceded by a ’#’-symbol).
External types can be used in Zhi# programs in all places where .NET types
are admissible except for type declarations (i.e. external types can only be im-
ported but not declared in Zhi# programs). For example, methods can be over-
ridden using external types, user defined operators can have external input and
output parameters, and arithmetic and logical expressions can be built up us-
ing external objects. Because Zhi#’s support for external types is a language
feature and not (yet) a feature of the runtime, similar restrictions to the usage
of external types apply as for generic type definitions in the Java programming
language (e.g., methods cannot be overloaded based on external types from the
same type system at the same position in the method signature).
In Zhi# programs, types of different type systems can cooperatively be used
in one single statement. As shown in line 5 in the following code snippet, the
.NET System.Int32 variable age can be assigned the XSD data type value of the
OWL datatype property hasAge of the ontological individual Alice.
1 import OWL c h i l = h t t p : / / c h i l . s e r v e r . de ;
2 class C {
3 p u b l i c s t a t i c v o i d Main ( ) {
4 #c h i l#Person a l i c e = new #c h i l#Person ("# c h i l#Alice " ) ;
5 i n t age = a l i c e .# c h i l#hasAge ;
6 }
7 }
Syntax checks. The most fundamental compile-time feature that Zhi# provides
for OWL is checking the existence of referenced ontology elements in the im-
ported terminology. The C# statements below declare the ontological individu-
als a and b. Individual b is added as a property value for property R of individual
a. For the sake of brevity, in this work, the URI fragment identifier “#” may
be used to indicate ontology elements in Zhi# programs instead of using fully-
qualified names. The object o shall be an instance of an arbitrary OWL API.
The given code is a well-typed C# program. It may, however, fail at runtime if
in the TBox of the referenced ontology classes A and B and property R do not
exist.
1 IOWLAPI o = [ . . . ] ;
2 o . a d d I n d i v i d u a l("#a " , "#A" ) ;
3 o . a d d I n d i v i d u a l("#b " , "#B " ) ;
4 o . a d d O b j e c t P r o p e rt y Va l ue("#a " , "#R" , "#b " ) ;
Creation of individuals. In C#, the new -operator can be used to create objects
on the heap and to invoke constructors. In Zhi#, the new -operator can also be
used to return ontological individuals in a knowledge base as follows.
Zhi# provides a constructor for OWL class instances that takes the URI of
the individual. As in conventional C#, the new -operator cannot be overloaded.
In contrast to .NET objects, ontological individuals are not created on the heap
but in the shared ontological knowledge base, and as such they are subject
to ontological reasoning. This is also in contrast to naïve approaches where
wrapper classes for ontological classes are instantiated as plain .NET objects.
Zhi# programs use handles to the actual individuals in the shared ontological
knowledge base. Also note that an existing individual in the ontology with the
same URI is reused, following Semantic Web standards. As for assignments of
.NET object creation expressions to variables or fields, the type of the individual
creation expression must be subsumed by the type of the lvalue based on the
class hierarchy (see line 2 in the code snippet above). Zhi# supports covariant
coercions for ontological individuals and arrays of ontological individuals.
320 A. Paar and D. Vrandečić
Disjoint classes. In OWL DL, classes can be stated to be disjoint from each
other using the owl:disjointWith constructor. It guarantees that an individual
that is a member of one class cannot simultaneously be a member of the other
class. In the following code snippet, the Zhi# compiler reports an error in line
2 for the disjoint classes MeetingRoom and LargeRoom.
¬
Disjoint XSD data types. In Zhi#, a “frame-like” view on OWL object proper-
ties is provided by the checked -operator used in conjunction with assignments
to OWL object properties (see Section 2.2). For assignments to OWL datatype
properties in Zhi# programs, the “frame-like” composite view is the default be-
havior. The data type of the property value must be a subtype of the datatype
property range restriction. The following assignment in line 2 fails to type-check
for an OWL datatype property hasCapacity with domain MeetingRoom and
range xsd#byte because in Zhi# programs the literal 23.5 is interpreted as a
.NET floating point value (i.e. xsd#double), which is disjoint with the primitive
base type of xsd#byte (i.e. xsd#decimal ).
1 #MeetingRoom r = [ . . . ] ;
2 r .# h a s C a p a c i t y = 2 3 . 5 ;
The XSD compiler plug-in allows for downcasting objects to compatible XSD
data types (i.e. XSD types that are derived from the same primitive base type).
The assignment in line 3 in the following Zhi# program is validated by a down-
cast. In general, this may lead to an InvalidCastException at runtime, which
prevents OWL datatype properties from being assigned invalid property values.
1 int i = [ . . . ] ;
2 #MeetingRoom r = [ . . . ] ;
3 r .# h a s C a p a c i t y = ( xsd#by te ) i ;
Properties. Erik Meijer and Peter Drayton note that “at the moment that you
define a [programming language] class Person you have to have the divine insight
to define all possible relationships that a person can have with any other possible
object or keep type open” [13]. Ontology engineers do not need to make early
commitments about all possible relationships that instances of a class may have.
In Zhi# programs, ontological individuals can be related to other individuals
Zhi# – OWL Aware Compilation 321
and XSD data type values using an object-oriented notation. In contrast to au-
thoritative type declarations of class members in statically typed object-oriented
programming languages, domain and range declarations of OWL object proper-
ties are used to infer the types of the subject (i.e. host object) and object (i.e.
property value). Hence, the types of the related individuals do not necessarily
need to be subsumed by the domain and range declarations of the used object
property before the statement. The only requirement here is that the related
individuals are not declared to be instances of classes disjoint to the declared
domain and range. In the following Zhi# program, the ontological individuals
referred to by e and l are inferred to be not only an Event and a Location but
also a Lecture and a LargeRoom, respectively.
1 #Event e = [ . . . ] ; // e refers to e, e:Event
2 #L o c a t i o n l = [ . . . ] ; // l refers to l, l:Location
3 e .# t a k e s P l a c e I n A u d i t o r i u m = l ; // e:Lecture, l:LargeRoom
Both for OWL object and non-functional OWL datatype properties the prop-
erty assignment semantics in Zhi# are additive. The following assignment state-
ment adds the individual referred to by b as a value for property R of the
individual referred to by a; it does not remove existing triples in the ontology.
a.#R = b ;
Correspondingly, property access expressions yield arrays of individuals and
arrays of XSD data type values for OWL object properties and non-functional
OWL datatype properties, respectively, since an individual may be related to
more than one property value. Accordingly, the type of OWL object property
and non-functional OWL datatype property access expressions in Zhi# is always
an array type, where the base type is the range declaration of the property.
The type of an assignment to an OWL object property and a non-functional
OWL datatype property is always an array type, too. This behavior is slightly dif-
ferent from the typical typing assumptions in programming languages. Because
the assignment operator (=) cannot be overloaded in .NET, after an assignment
of the form x = y = z all three objects can be considered equal based on the
applicable kind of equivalence (i.e. reference and value equality). The same is
not always true for assignments to OWL properties considering the array ranks
of the types of the involved objects. In the following cascaded assignment ex-
pression, the static type of the expression b.#R = c is Array Range(R) because
individual b may be related by property R to more individuals than only c.
As a result, with the following assignment in Zhi#, individual a is related by
property R to all individuals that are related to individual b by property R.
Ontological equality. In Zhi#, the equality operator (==) can be used to deter-
mine if two ontological individuals are identical (i.e. refer to the same entity in
the described world). The inequality operator (!=) returns true if two individu-
als are known (either explicitly or implicitly) to be not identical. Note that the
322 A. Paar and D. Vrandečić
As a second example, for static references of OWL classes the auxiliary prop-
erties Exists, Count, and Individuals are defined. The Exists property yields
true if individuals of the given type exist in the ontology, otherwise false. Count
returns the number of individuals in the extension of the specified class de-
scription. Individuals yields an array of defined individuals of the given type.
The Individuals property is generic in respect of the static type reference on
which it is invoked. In the following array definition, the type of the property
access expression #Person.Individuals is Array Person (and not Array Thing).
Accordingly, it can be assigned to variable persons of type Array Person.
#Person [ ] p e r s o n s = #Person . I n d i v i d u a l s ;
Runtime type checks. Reasoning is used to infer the classes an individual belongs
to. This corresponds to the use of the instanceof and is-operator in Java and C#,
respectively. In Zhi#, the is-operator is used to determine whether an individual
is in the extension of a particular class description. The use of the is-operator is
completely statically type-checked both on the programming language and the
ontology level. For example, the Zhi# compiler will detect if an individual will
never be included by a class description that is disjoint with its asserted type.
See the Zhi# program in Section 3 for an exemplary use of the is-operator.
XSD and OWL were integrated with the Zhi# programming language simi-
larly like generic types in Java. XSD data types and OWL DL class descriptions
in Zhi# programs are subject to type substitution where references of external
types are translated into 1) a constant set of proxy classes and 2) function calls
on the Zhi# runtime library and its extensions for XSD and OWL. Detailed
explanations how Zhi# programs are compiled into conventional C# are given
in [17].
324 A. Paar and D. Vrandečić
3 Validation
The Zhi# compiler was regression tested with 12 KLOC of Zhi# test code, which
was inductively constructed based on the Zhi# language grammar, and 9 KLOC
of typing information to regression test the semantic analysis of Zhi# programs.
The ease of use of ontological class descriptions and property declarations in
Zhi# is illustrated in [17] by contradistinction with C#-based implementations
of “ontological” behavior for .NET objects. The Zhi# approach facilitates the
use of readily available ontology management systems compared to handcrafted
reasoning code.
Examples for the advantage of OWL aware compilation in the Zhi# program-
ming language over an API-based use of ontology management systems can be
shown based on the following programming tasks, which are all frequent for
ontology-based applications. Assume the following TBox.
In line 1 in the Zhi# program shown below, the given TBox, which is de-
fined in the http://www.zhimantic.com/eval namespace, is imported into the
Zhi# compile unit. In line 2, XML Schema Definition’s built-in data types are
imported. In lines 4–9, ontological individuals are created. In line 9, the fully
qualified name of individual c is inferred from the containing namespace of the
named class description C. In line 10, the RDF triple [a R o] is declared in
the ontology. Note how Zhi# facilitates the declaration of ad hoc relationships
(instead of enforcing a frame-like view, where Thing a would not have a slot
R). In line 11, a foreach-loop is used to iterate over all values of the auxiliary
Types property of individual o. Note, in line 12, how ontological individuals are
implicitly convertible to .NET strings. In line 15, the is-operator is used to dy-
namically check the RDF type of individual o. Be aware that the type check is
performed on the ontology level. In line 16, a foreach-loop iterates over all values
of the auxiliary Individuals property of the static class reference B. Note that
the Individuals property is generic with respect to the static class reference on
Zhi# – OWL Aware Compilation 325
→
→
→
→
Java code using the Jena Semantic Web Framework for the same given pro-
gramming tasks is available online3 . It can be seen that the Zhi# code shown
above treats the OWL terminology as first-class citizens of the program code,
and is thus not only inherently less error-prone but can also be checked at com-
pile time. Zhi#’s inherent support for ontologies facilitates type checking on the
ontology level, which is completely unavailable if OWL APIs are used.
Finally, we mapped the CHIL OWL API [8,16,17] on auxiliary properties and
methods of ontology entities in Zhi# programs. The functionality of 50 of the
3
http://sourceforge.net/p/zhisharp/wiki/Examples of Use/
326 A. Paar and D. Vrandečić
4 Related Work
A major disadvantage of using an OWL API compared to, for example, Java-
based domain models is the lack of type checking for ontological individuals.
This lack of compile-time support has lead to the development of code generation
tools such as the Ontology Bean Generator [18] for the Java Agent Development
Framework [22], which generates proxy classes in order to represent elements
of an ontology. Similarly, Kalyanpur et al. [9] devised an automatic mapping
of particular elements of an OWL ontology to Java code. Although carefully
engineered the main shortcomings of this implementation are the blown up Java
class hierarchy and the lack of a concurrently accessible ontological knowledge
base at runtime (i.e. the “knowledge base” is only available in one particular Java
virtual machine in the form of instances of automatically generated Java classes).
This separation of the ontology definition from the reasoning engine results in
a lack of available ABox reasoning (e.g., type inference based on nominals).
The two latter problems were circumvented by the RDFReactor approach [25]
where a Java API for processing RDF data is automatically generated from an
RDF schema. However, RDFReactor only provides a frame-like view of OWL
ontologies whereas Zhi# allows for full-fledged ontological reasoning.
In stark contrast to these systems, the Zhi# programming language syntac-
tically integrates OWL classes and properties with the C# programming lan-
guage using conventional object-oriented notation. Also, Zhi# provides static
type checking for atomic XSD data types, which may be the range of OWL
datatype properties, while many ontology management systems – not to men-
tion the above approaches – simply discard range restrictions of OWL datatype
properties. A combination of static typing and dynamic checking is used for on-
tological class descriptions. In contrast to static type checking that is based on
generated proxy classes, Zhi#’s OWL compiler plug-in adheres to disjoint class
descriptions and copes well with multiple inheritance.
Koide and Takeda [11] implemented an OWL reasoner for the F L0 Description
Logic on top of the Common Lisp Object System [5] by means of the Meta-Object
Protocol [10]. Their implementation of the used structural subsumption algo-
rithm [2] is described, however, to yield only incomplete results. The integration
of OWL with the Python programming language was suggested by Vrandečić
and implemented by Babik and Hluchy [3] who used metaclass-programming
Zhi# – OWL Aware Compilation 327
to embed OWL class and property descriptions with Python. Their approach,
however, offers mainly a syntactic integration in form of LISP-like macros. Also,
their prototypical implementation does not support namespaces and open world
semantics.
The representation and the type checking of ontological individuals in Zhi#
is similar to the type Dynamic, which was introduced by Abadi et al. [1]. Values
of type Dynamic are pairs of a value v and a type tag T, where v has the
type denoted by T. The result of evaluating the expression dynamic e:T is a
pair of a value v and a type tag T, where v is the result of evaluating e. The
expression dynamic e:T has type Dynamic if e has type T. Zhi#’s dynamic
type checking of ontological individuals corresponds to the typecase construct as
proposed by Abadi et al. in order to inspect the type tag of a given Dynamic.
In Zhi# source programs, the use of OWL class names corresponds to explicit
dynamic constructs. In compiled Zhi# code, invocations of the AssertKindOf
method of the Zhi# runtime correspond to explicit typecase constructs.
Thatte described a “quasi-static” type system [23], where explicit dynamic
and typecase constructs are replaced by implicit coercions and runtime checks.
As in Thatte’s work, Zhi#’s dynamic typing for OWL detects errors as early
as possible to make it easy to find the programming error that led to the type
error. Abadi et al. and Thatte’s dynamic types were only embedded with a
simple λ-calculus. The same is true for recent gradual typing proposals [20].
Tobin-Hochstadt and Felleisen developed the notion of occurrence typing and
implemented a Typed Scheme [24]. Occurrence typing assigns distinct subtypes
of a parameter to distinct occurrences, depending on the control flow of the
program. Such distinctions are not made by Zhi#’s OWL compiler plug-in since
it is hard to imagine that appropriate subtypes can be computed considering
complex OWL class descriptions.
5 Conclusion
The Zhi# programming language makes the property-centric modeling features
of the Web Ontology Language available via C#’s object-oriented notation (i.e.
normal member access). The power of the “.” can be used to declare ad hoc rela-
tionships between ontological individuals on a per instance basis. Zhi#’s OWL
aware compilation integrates value space-based subtyping of XML Schema Def-
inition and ontological classification with features of the programming language
such as method overriding, user-defined operators, and runtime type checks. The
Zhi# programming language is implemented by an extensible compiler frame-
work, which is tailored to facilitate the integration of external classifier and
reasoner components with the type checking of Zhi# programs. The compiler
was written in C# 3.0 and integrated with the MSBuild build system for Mi-
crosoft Visual Studio. An Eclipse-based frontend was developed including an
editor with syntax highlighting and autocompletion. The complete Zhi# tool
suite totals 110 C# KLOC and 35 Java KLOC. Zhi# is available online4 .
4
http://zhisharp.sourceforge.net
328 A. Paar and D. Vrandečić
Zhi# offers a combination of static typing and dynamic checking for ontolog-
ical class descriptions. Ontological reasoning directly influences the execution of
programs: Zhi# programs don’t just execute, they reason. Thus, the develop-
ment of intelligent applications is facilitated. In contrast to many OWL APIs,
Zhi# contains extensive support for XSD data types. Zhi# code that uses ele-
ments of an ontology is compiled into conventional C#. All functionality related
to the use of ontologies is provided in a “pay-as-you-go” manner. The underlying
ontology management system can be substituted in the Zhi# runtime library
without recompilation of Zhi# programs.
Future work will include the transformation of Ontology Definition Metamod-
els [15] into Zhi# programs. With ontological class descriptions being first-class
citizens the complete MOF [14] modeling space can be translated into the Zhi#
programming language. We further plan to investigate the interplay of closed
world semantics in an ontology with autoepistemic features (e.g., the epistemo-
logical K-operator) with the static typing in Zhi#.
The Zhi# solution to provide programming language inherent support for
ontologies is the first of its kind. Earlier attempts either lack ABox reason-
ing, concurrent access to a shared ontological knowledge base, or fall short in
fully supporting OWL DL’s modeling features. In recent years, numerous pub-
lications described the – apparently relevant – OWL-OO integration problem.
However, the plethora of naïve code generation approaches and contrived hy-
brid methodologies all turned out to not solve the problem in its entirety. This
work demonstrates that OWL DL ontologies can be natively integrated into a
general-purpose programming language. The Zhi# compiler infrastructure has
shown to be a viable approach to solving the OWL-OO integration problem.
References
1. Abadi, M., Cardelli, L., Pierce, B., Plotkin, G.: Dynamic typing in a statically-
typed language. ACM Transactions on Programming Languages and Systems
(TOPLAS) 13(2), 237–268 (1991)
2. Baader, F., Calvanese, D., McGuiness, D., Nardi, D., Patel-Schneider, P.F.: The
Description Logic Handbook. Cambridge University Press, Cambridge (2003)
3. Babik, M., Hluchy, L.: Deep integration of Python with Web Ontology Language.
In: Bizer, C., Auer, S., Miller, L. (eds.) 2nd Workshop on Scripting for the Semantic
Web (SFSW) CEUR Workshop Proceedings, vol. 183 (June 2006)
4. Biron, P.V., Malhotra, A.: XML Schema Part 2: Datatypes Second Edition. Tech-
nical report, World Wide Web Consortium (W3C) (October 2004)
5. Demichiel, L.G., Gabriel, R.P.: The Common Lisp Object System: An Overview.
In: Bézivin, J., Hullot, J.-M., Lieberman, H., Cointe, P. (eds.) ECOOP 1987. LNCS,
vol. 276, pp. 151–170. Springer, Heidelberg (1987)
6. Hejlsberg, A., Wiltamuth, S., Golde, P.: C# language specification version 1.0.
Technical report, ECMA International (2002)
7. Labs, H.P.: Jena Semantic Web Framework (2004)
8. Information Society Technology integrated project 506909. Computers in the Hu-
man Interaction Loop, CHIL (2004)
Zhi# – OWL Aware Compilation 329
9. Kalyanpur, A., Pastor, D.J., Battle, S., Padget, J.A.: Automatic mapping of OWL
ontologies into Java. In: Maurer, F., Ruhe, G. (eds.) 16th International Conference
on Software Engineering and Knowledge Engineering (SEKE), pp. 98–103 (June
2004)
10. Kiczales, G., de Rivières, J., Bobrow, D.G.: The Art of the Metaobject Protocol.
MIT Press, Cambridge (1991)
11. Koide, S., Takeda, H.: OWL Full reasoning from an object-oriented perspective.
In: Mizoguchi, R., Shi, Z.-Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185,
pp. 263–277. Springer, Heidelberg (2006)
12. McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language Overview.
Technical report, World Wide Web Consortium (W3C) (February 2004)
13. Meijer, E., Drayton, P.: Static typing where possible, dynamic typing when needed.
In: OOPSLA Workshop on Revival of Dynamic Languages (2004)
14. Object Management Group (OMG). MetaObject Facility (August 2005)
15. Object Management Group (OMG). Ontology Definition Metamodel (2005)
16. Paar, A.: Zhi# – programming language inherent support for ontologies. In: Favre,
J.-M., Gasevic, D., Lämmel, R., Winter, A. (eds.) ateM 2007: Proceedings of the 4th
International Workshop on Software Language Engineering, Mainzer Informatik-
Berichte,Mainz, Germany, pp. 165–181. Johannes Gutenberg Universität Mainz,
Nashville (2007)
17. Paar, A.: Zhi# – Programming Language Inherent Support for Ontologies. PhD
thesis, Universität Karlsruhe (TH), Am Fasanengarten 5, 76137 Karlsruhe, Ger-
many (July 2009),
http://digbib.ubka.uni-karlsruhe.de/volltexte/1000019039
18. Protégé Wiki: Ontology Bean Generator (2007)
19. Puleston, C., Parsia, B., Cunningham, J., Rector, A.L.: Integrating object-oriented
and ontological representations: A case study in Java and OWL. In: Sheth, A.P.,
Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.)
ISWC 2008. LNCS, vol. 5318, pp. 130–145. Springer, Heidelberg (2008)
20. Siek, J.G., Taha, W.: Gradual typing for functional languages. In: Bailey, M.W.
(ed.) 7th Workshop on Scheme and Functional Programming (Scheme) ACM SIG-
PLAN Notices, vol. 41. ACM Press, New York (2006)
21. Stanford University School of Medicine. Protégé knowledge acquisition system
(2006)
22. Telecom Italia. Java Agent Development Framework, JADE (2007)
23. Thatte, S.R.: Quasi-static typing. In: 17th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages (POPL), pp. 367–381. ACM Press, New
York (1990)
24. Tobin-Hochstadt, S., Felleisen, M.: The design and implementation of typed
Scheme. ACM SIGPLAN Notices 43(1), 395–406 (2008)
25. Völkel, M.: RDFReactor – from ontologies to programmatic data access. In: 1st
Jena User Conference (JUC). HP Bristol (May 2006)
Lightweight Semantic Annotation of Geospatial
RESTful Services
Abstract. RESTful services are increasingly gaining traction over WS-* ones.
As with WS-* services, their semantic annotation can provide benefits in tasks
related to their discovery, composition and mediation. In this paper we present
an approach to automate the semantic annotation of RESTful services using a
cross-domain ontology like DBpedia, domain ontologies like GeoNames, and
additional external resources (suggestion and synonym services). We also
present a preliminary evaluation in the geospatial domain that proves the
feasibility of our approach in a domain where RESTful services are increasingly
appearing and highlights that it is possible to carry out this semantic annotation
with satisfactory results.
1 Introduction
In recent years, since the advent of Web 2.0 applications and given some of the
limitations of “classical” Web services based on SOAP and WSDL, Representational
State Transfer (REST) services have become an increasing phenomenon. Machine-
oriented Web applications and APIs that are conformant to the REST architectural
style [23], normally referred to as RESTful Web services, have started appearing
mainly due to their relative simplicity and their natural suitability for the Web.
However, using RESTful services still requires much human intervention since the
majority of their descriptions are given in the form of unstructured text in a Web page
(HTML), which contains a list of the available operations, their URIs and parameters
(also called attributes), expected output, error messages, and a set of examples of their
execution. This hampers the automatic discovery, interpretation and invocation of
these services, which may be required in the development of applications, without
extensive user involvement.
Traditionally, semantic annotation approaches for services have focused on
defining formalisms to describe them, and have been normally applied to WS-*
service description formalisms and middleware. More recently, these (usually
heavyweight) approaches have started to be adapted into a more lightweight manner
for the semantic description of RESTful services [1, 5, 8]. However, most of the
processes related to the annotation of RESTful services (e.g., [2, 11]) still require a
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 330–344, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Lightweight Semantic Annotation of Geospatial RESTful Services 331
large amount of human intervention. First, humans have to understand the informal
descriptions provided in the RESTful service description pages, and then the semantic
annotation of RESTful services is done manually, with or without assistance.
In this paper, we address the challenge of automating the semantic annotation of
RESTful services by: (1) obtaining and formalising their syntactic descriptions, which
allows their registration and invocation, and (2) interpreting, and semantically
enriching their parameters.
The main contribution of our work is the partial automation of the process of
RESTful semantic annotation services using diverse types of resources: a cross-
domain ontology, DBpedia (combined with GeoNames in the specific case of
geospatial services), and diverse external services, such as suggestion and synonym
services.
The remainder of this paper is structured as follows: Section 2 presents related
work in the context of semantic annotation of WS-* and RESTful services. Section 3
introduces our approach for automating the annotation of RESTful services, including
explanations on how we derive their syntactic description and semantic annotation.
Section 4 presents the evaluation of our system in the context of these services from
the geospatial domain. Finally, Section 5 presents some conclusions of this paper and
identifies future lines of work.
2 Related Work
Most research in the semantic annotation of RESTful services has focused on the
definition of formal description languages for creating semantic annotations. The
main proposed formalisms for describing these services are: the Web Application
Description Language1 (WADL), which describes syntactically RESTful services and
the resources that they access; its semantic annotation extension [19]; MicroWSMO
[3], which uses hREST (HTML for RESTful services) [3, 5]; and SA-REST [2, 8],
which uses SAWSDL [1] and RDFa2 to describe service properties.
From a broader point of view, the work done in the state of the art on Semantic
Web Services (SWS) has mainly focused on WS-*services. OWL-S and WSMO are
approaches that use ontologies to describe services.
Some authors propose the adaptation of heavyweight WS-* approaches to describe
RESTful services. An example is proposed in [10], which makes use of OWL-S as the
base ontology for services, whereas WADL is used for syntactically describing them.
Then, the HTTP protocol is used for transferring messages, defining the action to be
executed, and also defining the execution scope. Finally, URI identifiers are
responsible for specifying the service interface.
Other approaches are more lightweight (e.g., [1, 2]). The authors advocate an
integrated lightweight approach for describing semantically RESTful services. This
approach is based on use of the hREST and MicroWSMO microformats to facilitate
the annotation process. The SWEET tool [2] supports users in creating semantic
descriptions of RESTful services based on the aforementioned technologies. Unlike
1
http://www.w3.org/Submission/wadl/
2
http://www.w3.org/TR/xhtml-rdfa-primer/
332 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
this work, our approach is focused on automating this process, and could be well
integrated into this tool. Once the semantics of the RESTful service is obtained, this
could be represented in any of the existing semantic description approaches, such as
hREST, MicroWSMO, etc.
Finally, another approach for service description that focuses on automation, and
hence can be considered closer to our work, is presented in [17]. This approach
classifies service parameter datatypes using HTML treated Web form files as the Web
service's parameters using Naïve Bayes.
Nowadays the largest online repository of information about Web 2.0 mashups and
APIs is ProgrammableWeb.com. This aggregator site provides information on 5,401
mashups and 2,390 APIs that were registered between September 2005 and
Lightweight Semantic Annotation of Geospatial RESTful Services 333
• Service 1. http://ws.geonames.org/countryInfo?country=ES
This service retrieves information related to a ‘country’. More specifically, it
returns information about the following parameters: ‘capital’, ‘population’, ‘area’
(km2), and ‘bounding box of mainland’ (excluding offshore islands). In the
specified URL, we retrieve information about Spain.
• Service 2. http://api.eventful.com/rest/venues/search?app_key=p4t8BFcLDt
CzpxdS&location=Madrid
This service retrieves information about places (venues). More specifically, it
returns parameters like: ‘city’, ‘venue_name’, ‘region_name’, ‘country_name’,
‘latitude’, ‘longitude’, etc. In the specified URL, we retrieve information about
Madrid.
3
http://www.oasis-opencsa.org/sdo
334 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
The service invocation of a specific RESTful service may return diverse formats,
such as JSON, XML, etc. In our work we use any of these formats, although for
presentation purposes in this paper we will show how we handle XML responses. The
results of a sample invocation of the services that we presented in section 3.1 are
showed in Table 1.
Service 1 Service 2
<venue id="V0-001-000154997-6">
<geonames> <url>http://eventful.com/madrid/venues/la-
<country> ancha-/V0-001-000154997-6</url>
<countryCode>ES</countryCode> <country_name>Spain</country_name>
<countryName>Spain</countryName> <name>La Ancha</name>
<isoNumeric>724</isoNumeric> <venue_name>La Ancha</venue_name>
<isoAlpha3>ESP</isoAlpha3> <description></description>
<fipsCode>SP</fipsCode> <venue_type>Restaurant</venue_type>
<continent>EU</continent> <address></address>
<capital>Madrid</capital> <city_name>Madrid</city_name>
<areaInSqKm>504782.0</areaInSqKm> <region_name></region_name>
<population>40491000</population> <region_abbr></region_abbr>
<currencyCode>EUR</currencyCode> <postal_code></postal_code>
<languages>es-ES,ca,gl,eu</languages> <country_abbr2>ES</country_abbr2>
<geonameId>2510769</geonameId> <country_abbr>ESP</country_abbr>
<bBoxWest>-18.169641494751</bBoxWest <longitude>-3.68333</longitude>
<bBoxNorth>43.791725</bBoxNorth> <latitude>40.4</latitude>
<bBoxEast>4.3153896</bBoxEast> <geocode_type>City Based GeoCodes
<bBoxSouth>27.6388</bBoxSouth> </geocode_type>
</country> <owner>frankg</owner>
</geonames> <timezone></timezone>
<created></created>
<event_count>0</event_count>
<trackback_count>0</trackback_count>
<comment_count>0</comment_count>
<link_count>0</link_count>
<image></image>
</venue>
<venue id="V0-001-000154998-5">
These XML responses are processed using SDO, which enables to navigate
through the XML and extract output parameters of each service4. The result of this
invocation process is a syntactic definition of the RESTful service in XML, which can
be expressed in description languages like WADL or stored into a relational model.
Table 2 shows the different output parameters of each service, where we can observe
by manual inspection that there is some similarity between diverse parameters (e.g.,
countryName and country_name) and that they return similar values (Spain).
However, these parameters are written differently. These differences between
parameters are described and dealt with in sections 3.3.1 and 3.3.2.
With URL and input/output parameters, we generate a WADL description that can
be used as the input to the next process. Additionally, we register and store this
description into a repository using an oriented-object model. This repository is
implemented as a database that is specifically designed to store syntactic descriptions
of RESTful services and parameters’ values of invocations. We selected this storage
model in order to increase efficiency in the recovery of the RESTful services.
4
In the work reported here, we considered only XML tags with values.
Lightweight Semantic Annotation of Geospatial RESTful Services 335
Once the RESTful service is registered and the WADL description is generated,
our system invokes the service without associated parameters. For example:
On the other hand, our system also considers service URLs as http://www.foo.org/
weather/Madrid. These services belong to a specific RESTful entity and they are
always invoked with its associated parameters.
In this way, the system invokes the service for retrieving a collection of instances
(countries)5 related to the service. The results of this invocation are stored into the
oriented-object model. Thus, this process allows collecting additional information
about a service (output parameters and instances), which is registered in our system,
and retrieving it for future processes without the need to invoke the original service.
Service 1:
countryInfo($country,bBoxSouth,isoNumeric,continent,fipsCode,areaInSqKm,languages,iso
Alpha3,countryCode,bBoxNorth,population,bBoxWest,currencyCode,bBoxEast,capital,geo
nameId,countryName)
Service 2:
rest/venues/search($location,$app_key,id,link_count,page_count,longitude,trackback_count,
version,venue_type,owner,url,country_name,event_count,total_items,city_name,address,na
me,latitude,page_number,postal_code,country_abbr,first_item,page_items,last_item,page_si
ze,country_abbr2,comment_count,geocode_type,search_time,venue_name)
Some of the difficulties that arise in the semantic annotation of RESTful services are
briefly described in [1, 11]. In order to cope with them, we rely on techniques and
processes that permit: a) semantic annotation using only the syntactic description of
the services and their input/output parameters, or b) semantic annotation by
identifying a set of example values that allow the automatic invocation of the service.
The starting point of the semantic annotation process is the list of syntactic
parameters obtained previously (a WADL file or the model stored into a relational
database). Once the RESTful service is syntactically described with all its identified
input and output parameters, we proceed into its semantic annotation. We follow a
heuristic approach that combines a number of external services and semantic
resources to propose annotations for the parameters as shown in Figure 2. Next, we
describe the main components of the semantic annotation.
5
The results of this service invocation are available at
http://delicias.dia.fi.upm.es/RESTfulAnnotationWeb/RESTService1/RESTservice1.xml
336 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
- Parameter. This class provides a list of all parameters (inputs and outputs)
collected from different services. Likewise, we search for additional information
for each parameter, such as suggestions and synonyms, for enriching the initial
description of parameters. The relation hasCollection relates Parameter
with DBpediaOntology. Every parameter can be related to any number of
DBpedia classes or properties (from 0 to N).
- Ontologies. This class contains classes and properties of the DBpedia and
GeoNames ontology related to the parameters of each service. This class is
related to the classes DBpediaInstance and GeonamesInstance by
6
http://www.geonames.org/
Lightweight Semantic Annotation of Geospatial RESTful Services 337
0..*
- First, the system retrieves all the classes from the DBpedia ontology whose
names have a match with each parameter of the RESTful service. In this
matching process we test two different techniques:
• On the one hand, our approach uses an exact match to compare parameters of
RESTful service with the labels of the ontologies’ classes and properties.
• On the other hand, our approach uses a combination of various similarity
metrics (Jaro, JaroWinkler and Levenshtein metrics)7 to compare parameters
with the labels of the elements of these ontologies. This proposal allows
7
http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html
338 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
If the system obtains correspondences from the matching process, it uses these
DBpedia concepts individually to retrieve samples (concept instances) from the
DBpedia SPARQL Endpoint. Likewise, when a parameter matches an ontology
class related to some geospatial information; such as latitude, longitude, or
bounding box, our system retrieves samples from the GeoNames SPARQL
Endpoint. The resulting information (RDF) is suggested automatically to the
system and registered as a possible value for the corresponding parameter. When
a parameter matches more than once in the DBpedia ontology, our system only
considers those concepts that have information (instances), and automatically
discards those ontology concepts without instances.
- Next, the system tries to find correspondences between parameters of the
RESTful service and ontology properties. If the system obtains some
correspondences, it uses these DBpedia properties individually to retrieve
information of the DBpedia or GeoNames SPARQL Endpoint, as described
above. Furthermore, this information is registered as a possible correct value for
the corresponding parameter.
- Finally, with the obtained classes and properties, the system calls the DBpedia
and GeoNames SPARQL Endpoints to retrieve values (instances) for those
classes and properties, so that now we have possible values for them.
8
http://developer.yahoo.com/search/boss/boss_guide/Spelling_Suggest.html
Lightweight Semantic Annotation of Geospatial RESTful Services 339
(and GeoNames) again. The output is registered and stored into the repository.
Following the previous example, the parameter ‘countryName’ is not found in the
DBpedia ontology. Nevertheless, the added service allows separating this parameter
in ‘country’ and ‘name’, and then it calls to the DBpedia SPARQL Endpoint with
these new strings for obtaining results.
Synonym services
This external service9 is incorporated into the system to retrieve possible synonyms
for a certain parameter. This service tries to improve the semantic annotation process
when our system does not offer results for the previous steps, that is, when we still
have parameters in a RESTful service without any potential annotations.
As an example, we may have a parameter called ‘address’. The invocation process
uses the synonyms service to retrieve a set of synonyms of ‘address’ such as extension,
reference, mention, citation, denotation, destination, source, cite, acknowledgment, and
so on. These outputs are registered and stored into the repository, and then, the service
calls to the DBpedia (and GeoNames) SPARQL Endpoints for results again.
Both spelling suggestion and synonym services use the matching process described
in section 3.3.1 to find possible matches between the output of these services and the
components of the used ontologies.
In order to check the collected sample individuals and the initial semantic annotations
obtained as a result of the previous process, our system invokes the RESTful services
that were already registered in the repository (as we describe in Section 3.2) and
validates the input and output parameters for checking which is the best option to
describe each parameter.
For the validation of the input parameters, our system selects, for each parameter,
a random subset of the example instances (of classes and/or properties) coming from
the DBpedia (and GeoNames) ontology that we have obtained and registered before.
Next, it makes several invocations of the RESTful service iterating over these
registered values. The system does not check this with all the possible combination of
collected instances for all parameters for two reasons: first, because of the
combinatorial explosion that may be produced in such a case, and second because
many RESTful services have invocation limitations.
When a service has one or more than one input parameter, the system obtains
randomly some instances of this parameter for the validation process. Each parameter
generates a collection (list) of instances from our repository. Then, the system joins
instances to obtain a table of all combinations of each parameter. Likewise, the
geospatial parameters, specifically latitude and longitude parameters, are combined to
obtain some values (instances) that can be used for this invocation.
If the service returns results from the invocation, then the service is considered as
executable, and the corresponding annotations are marked as valid. If a service cannot
be invoked successfully, the service is classified as non-executable and is automatically
discarded from the list of services that can be automatically annotated.
9
http://www.synonyms.net/
340 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
For the validation of the output parameters, our system only takes into account
executions with the correct inputs from the input sets that have been considered
before. Next, the system compares the outputs obtained after execution with the
information already stored in the repository due to the initial retrieval processes done
before with DBPedia (and GeoNames), and external utility services. If the output can
be matched, our system considers the output annotation as valid.
Finally, the correspondences that have been established between the different
parameters of the RESTful service and the DBpedia (and GeoNames) ontology are
registered and stored in the repository, so that they can be used later. In such a way,
the RESTful service is annotated semantically and it will allow generating semantic
descriptions or annotations of any of the types that were identified in the related work
section (WADL, hREST, etc.). Table 3 provides an abbreviated form of this
description for our exemplar service 1.
($country,http://www.w3.org/2003/01/geo/wgs84_pos#lat,http://
w ww.w3.org/2003/01/geo/wgs84_pos#long,isoNumeric,http://dbpedia
.org/ontology/Continent,fipsCode,http://dbpedia.org/property/
areaMetroKm,languages,isoAlpha3,http://dbpedia.org/ontology/cou
ntry,http://www.w3.org/2003/01/geo/wgs84_pos#lat,http://www.w3
.org/2003/01/geo/wgs84_pos#long,http://dbpedia.org/ontology/
populationDensity,http://www.w3.org/2003/01/geo/wgs84_pos#lat,ht
tp://www.w3.org/2003/01/geo/wgs84_pos#long,http://dbpedia.org/
ontology/Currency,http://www.w3.org/2003/01/geo/wgs84_pos#lat,
http://www.w3.org/2003/01/geo/wgs84_pos#long,http://dbpedia.org
/ontology/capitalgeonameId,http://dbpedia.org/ontology/country)
4 Experimental Results
In order to evaluate our approach in the geospatial domain we have used 60 different
RESTful services found in http://www.programmableweb.com/, which we have
selected randomly from those that were available and could be characterized to
contain geospatial information by a manual lookup. The list of these services can be
found in our experiment website10. In the syntactic registration of all these services in
the system, by means of introducing the list of their URLs, our system successfully
registered 56 of them into the repository (4 services could not be registered due to an
invocation error). As a result of this syntactic registration, the system has produced a
complete list of 369 different parameters (52 input parameters and 342 output
parameters), without duplications.
This analysis follows the three steps described in our semantic annotation process.
First, our system identifies correctly 191 of 369 parameters by calling directly the
DBpedia and GeoNames ontologies. Second, the system uses initial parameters plus
the suggestion service and calls the DBpedia and GeoNames ontologies. In this case,
10
http://delicias.dia.fi.upm.es/RESTfulAnnotationWeb/SourcesList/sources.ods
Lightweight Semantic Annotation of Geospatial RESTful Services 341
it identifies 33 correspondences and adds 57 parameters to the initial ones. Third, the
system uses the initial parameters plus the synonyms service, and calls the DBpedia
and GeoNames ontologies. It identifies 126 correspondences and incorporates 1,147
additional parameters into the system. Finally, the system combines all the resources
that result from the enrichment process and calls again the DBpedia and GeoNames
SPARQL endpoint. Here it identifies 159 correspondences and adds 1,573 more
parameters. A detailed view of these results is shown in Table 4.
Matches (DBpedia
Additional
Attributes Total and GeoNames
parameters
ontologies)
Initial parameters 369 - 191
Parameters + Suggestions 426 57 33
Parameters + Synonyms 1573 1147 126
Parameters + Suggestions + Synonyms 1573 1204 159
With respect to the validation of input parameters11 (see Table 5), our system
recognizes 152 inputs of the initial list, of which 76 parameters can be annotated
automatically with the DBpedia (33 parameters) and GeoNames (45 parameters)
ontologies.
Likewise, we have discovered with our evaluation that some other parameters are
useless in terms of semantic annotation processes, since they refer to the navigation
process through the RESTful service results or “special” parameters. These
parameters (input/output) are not considered for this validation (nevertheless, they are
considered to the invocation process), concretely 155 “special” parameters, for
instance, userID, api_key, page, total, hits, etc.). These parameters were
detected manually and a list of them is collected in this website12. Our system takes
them out automatically from the service registration process13.
One aspect of our system is that we cannot always guarantee a successful
annotation, because in some cases the system cannot find any correspondence
between the service parameters and the concepts or properties of the DBpedia or
GeoNames ontologies. This is common, for instance, when parameter names are
described by only one letter (e.g., s, l or q) and hence they are not sufficiently
descriptive for our automated approach to find any correspondence. In our evaluation,
we had 12 of this type of parameters. In these cases the parameters should be shown
to users for a manual description of them.
In summary, for 56 of the 60 initial geospatial RESTful services we have obtained
correct input parameter associations, except for 4 cases where we could not find any
correspondence.
11
A detailed analysis on these input parameters is available at
http://delicias.dia.fi.upm.es/RESTfulAnnotationWeb/inputs/inputs.ods
12
http://delicias.dia.fi.upm.es/RESTfulAnnotationWeb/parameters/Parameters.ods
13
This was not described in the process described in section 3 since we did not consider it relevant
for the description of the whole process.
342 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
Annotated Annotated
RESTful Total Annotated Special Service
parameters parameters
Service parameters parameters parameters validation
(DBpedia) (GeoNames)
Input
152 76 33 45 73 569 48
parameters
Output
862 315 202 113 299 -
parameters
With respect to the validation of output parameters14 (see Table 5), our system
recognizes 862 outputs that belong to the 56 services whose input parameters have
been validated. This total of output parameters is divided into 315 whose
correspondences can be found using DBpedia (202 parameters) and GeoNames (113
parameters) ontologies, and 391 (special (299) and not found (92) parameters) whose
correspondences cannot be found.
Output
475 92 315 160 242 0.66 0.77
parameters
While in the context of the input parameters we are interested in determining whether
we can call the service or not, in the case of output parameters, we are interested in the
precision and recall metrics of the annotation process. Hence, we have generated a gold
standard with the studied services in order to assign manually the annotations that have
to be produced for all output parameters of these services, and we have performed an
evaluation of the results obtained from the system for the parameters that are found.
Regarding the parameters that are found, our system annotates 315 of them
automatically, from which 242 parameters are annotated correctly according to the gold
standard, while 160 parameters are not annotated. This provides us with an average
value for precision equal to 0.66 and recall equal to 0.77 for both metrics.
To the best of our knowledge, there are no available results from existing research
works to compare our results against. Likewise, these preliminary results prove the
feasibility of our system and highlight that its possible to carry out an assisted
semantic annotation of RESTful services.
takes into account the DBpedia ontology and its SPARQL Endpoint, for general
annotation, and GeoNames and its SPARQL Endpoint for geospatial specific results,
as well as different external resources such as synonyms and suggestion services. We
use combinations of these resources to discover meanings for each of the parameters
of the RESTful services that a user may select and perform semantic annotations of
them.
To illustrate our work and guide the explanations of the proposed semantic
annotation process we have used two exemplary RESTful services related to the
geospatial domain. Besides, we have presented some preliminary experimental results
that prove the feasibility of our approach, at least in the geospatial domain, and show
that it is possible to assist the semantic annotation of RESTful services, again at least
in this domain.
Future work will focus on the development of a GUI that will ease the introduction
of existing services by users for their semantic annotation, probably incorporated in
any existing RESTful semantic annotation tool/utility suite. Furthermore, we also plan
to make improvements to the proposed system through the analysis of instances
retrieved in the matching process, so as to improve the results that have been
demonstrated in our evaluation. In the same sense, we also aim at improving the
SPARQL queries to DBpedia and other semantic resources associated or not to a
specific domain, to better explore this resource in the annotation process, and
optimize the use of suggestion and synonyms services. Finally, we will incorporate
more specific domain ontologies in the semantic process for taking advantage of
specific domain characteristics.
Acknowledgments
This work has been supported by the R&D project España Virtual (CENIT2008-
1030), funded by Centro Nacional de Información Geográfica and CDTI under the
R&D programme Ingenio 2010.
References
1. Maleshkova, M., Kopecky, J., Pedrinaci, C.: Adapting SAWSDL for Semantic
Annotations of RESTful Services. In: Workshop: Beyond SAWSDL at OnTheMove
Federated Conferences & Workshops, Vilamoura, Portugal (2009)
2. Maleshkova, M., Pedrinaci, C., Domingue, J.: Semantically Annotating RESTful Services
with SWEET, Demo at 8th ISWC, Washington D.C., USA (2009)
3. Maleshkova, M., Gridinoc, L., Pedrinaci, C., Domingue, J.: Supporting the Semi-
Automatic Acquisition of Semantic RESTful Service Descriptions. In: ESWC (2009)
4. Pedrinaci, C., Domingue, J., Krummenacher, R.: Linked Data Meets Artificial Intelligence.
In: Services and the Web of Data: An Unexploited Symbiosis, Workshop: Linked AI:
AAAI Spring Symposium (2010)
5. Kopecký, J., Gomadam, K., Vitvar, T.: hRESTS: An HTML Microformat for Describing
RESTful Web Services. Web Intelligence, 619–625 (2008)
6. Lambert, D., Domingue, J.: Grounding semantic web services with rules. In: Workshop:
Semantic Web Applications and Perspectives, Rome, Italy (2008)
344 V. Saquicela, L.M. Vilches-Blazquez, and O. Corcho
7. Steinmetz, N., Lausen, H., Brunner, M.: Web Service Search on Large Scale.
ICSOC/ServiceWave, 437–444 (2009)
8. Lathem, J., Gomadam, K., Sheth, A.P.: SA-REST and (S)mashups: Adding Semantics to
RESTful Services. In: ICSC 2007, pp. 469–476 (2007)
9. García Rodríguez, M., Álvarez, J.M., Berrueta, D., Polo, L.: Declarative Data Grounding
Using a Mapping Language. Communications of SIWN 6, 132–138 (2009)
10. Freitas Ferreira Filho, O., Grigas Varella Ferreira, M.A.: Semantic Web Services: A
RESTful Approach. In: IADIS Int. Conference WWW/INTERNET 2009, Rome, Italy
(2009)
11. Alowisheq, A., Millard, D.E., Tiropanis, T.: EXPRESS: EXPressing REstful Semantic
Services Using Domain Ontologies. In: Bernstein, A., Karger, D.R., Heath, T.,
Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS,
vol. 5823, pp. 941–948. Springer, Heidelberg (2009)
12. Alarcon, R., Wilde, E.: Linking Data from RESTful Services. In: LDOW 2010, Raleigh,
North Carolina (2010)
13. Lerman, K., Plangprasopchok, A., Knoblock, C.A.: Semantic Labeling of Online
Information Sources. Int. J. Semantic Web Inf. Syst. 3(3), 36–56 (2007)
14. Ambite, J.L., Darbha, S., Goel, A., Knoblock, C.A., Lerman, K., Parundekar, R., Russ, T.:
Automatically constructing semantic web services from online sources. In: Bernstein, A.,
Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E., Thirunarayan, K. (eds.)
ISWC 2009. LNCS, vol. 5823, pp. 17–32. Springer, Heidelberg (2009)
15. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling Schemas of Disparate Data Sources:
A Machine-Learning Approach. In: SIGMOD Conference 2001, pp. 509–520 (2001)
16. Doan, A., Domingos, P., Halevy, A.Y.: Learning to Match the Schemas of Data Sources:
A Multistrategy Approach. Machine Learning 50(3), 279–301 (2003)
17. Heß, A., Kushmerick, N.: Learning to attach semantic metadata to web services. In:
Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 258–273.
Springer, Heidelberg (2003)
18. Rahm, E., Bernstein, P.: On matching schemas automatically. VLDB. Journal 10(4) (2001)
19. Battle, R., Benson, E.: Brinding the semantic web and web 2.0 with Representational State
Tranfer (REST). Web semantics 6, 61–69 (2008)
20. Braga, D., Ceri, S., Martinenghi, D., Daniel, F.: Mashing Up Search Services. IEEE
Internet Computing 12(5), 16–23 (2008)
21. Altinel, M., Brown, P., Cline, S., Kartha, R., Louie, E., Markl, V., Mau, L., Ng, Y.H.,
Simmen, D., Singh, A.: Damia: a data mashup fabric for intranet applications. In:
Proceedings of the 33rd Int. Conference on VLDB 2007 Endowment, pp. 1370–1373
(2007)
22. Resende, L.: Handling heterogeneous data sources in a SOA environment with service data
objects (SDO). In: Proceedings of the ACM SIGMOD International Conference on
Management of Data, pp. 895–897. ACM, New York (2007)
23. Fielding, R.: Architectural Styles and The Design of Network-based Software Architectures.
PhD thesis, University of California, Irvine (2000)
Towards Custom Cloud Services
Using Semantic Technology to Optimize Resource
Configuration
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 345–359, 2011.
c Springer-Verlag Berlin Heidelberg 2011
346 S. Haak and S. Grimm
Cloud computing (and its SaaS, PaaS and IaaS offerings) promises to be a so-
lution for cutting costs in IT spending while simultaneously increasing flexibility.
According to thecloudmarket.com [1], there already exist over 11.000 different
preconfigured Amazon EC2 images from various providers, that can be run as
virtual appliance on Amazon’s Elastic Compute Cloud (EC2). The large variety
indicates the importance of custom service offers. However, the customer has
little support in the technical and economic decision process for selecting an
appropriate image and deploying it on the Cloud. It is neither guaranteed, that
there finds an image that is configured exactly according to the customer needs.
Apparently there is a great business opportunity for Cloud providers who
manage to offer Custom Cloud Services tailored to their customers’ needs. Trans-
ferring the selection and deployment process to the provider simplifies the cus-
tomer’s decision process significantly. In such a scenario, the provider faces the
challenging task of automatically finding the optimal service composition based
on the customer request, from both a technical and an economic view. A cus-
tomer requesting a database service having for example only preferences on the
underlying operating system or the available storage space leaves a lot of room
for economic optimization of the service composition. A MySQL database run-
ning on a Linux-based virtual machine (VM) might be advantageous over the
more costly Oracle alternative on a Windows-based VM.
Custom Cloud Services, as referred to in this paper, are compositions of com-
mercial off-the-shelf software, operating system and virtualized hardware, bun-
dled to offer a particular functionality as requested by a customer. The functional
requirements usually leave many choices when choosing required resources from
groups that offer equal or similar functionalities as in the above mentioned exam-
ple. Finding the optimal choice involves many different aspects, including (among
others) technical dependencies and interoperability constraints, customer prefer-
ences on resources and Quality-of-Service (QoS) attributes, capacity constraints
and license costs. In a sense, it is a configuration problem as known from typical
configurators for products like cars, etc. However, stated as integer program-
ming or constraint optimization problem [15,7], it is underspecified as not all
configuration variables (the set of required service resource types) can be clearly
specified ex ante from the customer request. The missing variables however can
be derived dynamically as they are implicitly contained in the resources’ depen-
dencies. For example, a customer request for a CRM software might implicitly
require an additional database, some underlying operating system on some vir-
tualized hardware.
As mentioned before, traditional linear or constraint programming techniques
require the knowledge of all variables. In order to overcome this problem, we pro-
pose an ontology-based approach for a convenient and standardized knowledge
representation of all known Cloud service resources, allowing to derive the com-
plete configuration problem by subsequently resolving resource dependencies.
The remainder of this paper is structured as follows. Section 2 describes
an example use case, which helps us to derive requirements for the designated
Towards Custom Cloud Services 347
framework. These requirements are also used for a qualitative evaluation of our
approach and to distinguish it from related work.
In Section 3 we describe our main contribution. We start by describing our
understanding of functional requirements, which serve as input for the ontology-
based optimization framework. We then propose the usage of a three-fold ontol-
ogy system to serve as knowledge base. We distinguish between a generic service
ontology, that contains the meta concepts provided in this paper, a domain on-
tology that makes use of these concepts and contains the actual knowledge about
known infrastructure resources, and a result ontology that is used to represent a
graph, spanning a network over different choices based on abstract dependency
relations that can exist between different resources (e.g. that every application
needs some operating system). We formally describe this dependency graph and
show how it can be derived algorithmically, making use of different queries to
the ontology system. Further we show how this graph can be used to obtain all
feasible infrastructure compositions. In addition, we show how the graph can be
transformed into an integer program for finding the profit optimal configuration
with respect to customer preferences and costs. For knowledge representation,
we make use of the Semantic Web ontology and rule languages OWL and SWRL
combined with SPARQL querying facilities.
Section 4 describes a proof-of-concept implementation of the presented frame-
work and reviews the requirements from Section 2 with respect to our contribu-
tion. In Section 5 we conclude this paper by giving an overview on open issues
and an outlook on future research.
2.2 Requirements
Based on the use case, we can now derive a set of requirements, defining and
clarifying the goals of the desired solution. We identify five major properties that
we find necessary to provide an adequate solution for finding the optimal service
configuration in this case:
R 1 (Top Down Dependency Resolution). Automatic resolution of all
transitive dependencies between resource classes, starting from the top level func-
tional requirement resources until no more unresolved dependencies exist. Thus
deducting all variables for the Custom Cloud Service configuration problem.
R 2 (Functional Requirements). The functional requirements for the desig-
nated service should be describable by abstract or concrete resources (on different
levels).
R 3 (Customer Preferences). The approach should be able to consider cus-
tomer preferences regarding different configuration options.
R 4 (Interoperability Check). The interoperability/compatibility between re-
sources has to be validated. For reducing the modeling overhead, this validation
should be possible on both instance and higher abstraction levels.
R 5 (Profit Optimization). The profit maximizing Custom Cloud Service
configuration has to be found. I.e. the configuration yielding the greatest dif-
ference between the achievable offer price for a configuration and the accruing
costs on provider side.
Customer Provider
Functional Requirements
Dependency
Dependency Graph Graph Algorithm
Preferences
Optimization
Service Offer
Decline
Accept
Deployment
contains
ServiceComponent
hasOption OrNode
FixCosts VariableCosts
connectsTo
isCompatibleTo choice
multiplyer
connectsTo SinkNode SourceNode
xsd:float
The most fundamental concept for deriving all feasible deployment alterna-
tives is the class ServiceComponent along with the corresponding object properties
352 S. Haak and S. Grimm
requires and isCompatibleTo. The requires property is used to describe the func-
tional dependency between two resource instances. In most cases, dependencies
can and should be described in an abstract way at class-level. We can do this
by including the dependency relation into the class axiom in conjunction with
an object restriction in the form of an existential quantifier on the required
class:
ComponentA ServiceComponent
∃requires.ComponentB
Hereby we state that each resource of type ComponentA requires some resource
of type ComponentB. As a more concrete example, we could state that every
instance of the class Application requires at least some operating system:
Application ServiceComponent
∃requires.OS
isCompatibleTo(x,y) ← (L1.1)
ComponentA(x) (L1.2)
ComponentB(y) (L1.3)
We can exploit the full expressiveness of DL-safe SWRL rules. E.g. to state that
all versions of MySQL are compatible to all Windows versions except Windows
95, we include the following rule:
isCompatibleTo(x,y) ← (L2.1)
MySQL(x) (L2.2)
Windows(y) (L2.3)
differentFrom(y,’Windows95’) (L2.4)
Domain Ontology. The domain ontology uses the concepts described in the
preceding section to capture the knowledge about the Cloud service resources of
Towards Custom Cloud Services 353
a certain domain of interest. This allows to easily use the same technology for
many different contexts, just by loading a different domain ontology. In addition,
knowledge can be combined by loading several domain ontologies, as long as this
does not lead to inconsistencies.
Result Ontology. By resolving the transitive dependencies for the set of re-
sources from the functional requirements, it is clear that we cannot add any
knowledge, we can only make additional knowledge explicit, that is contained in
the knowledge base implicitly.
For persisting the extended model, we also rely on an OWL ontology, such
that it can be used for further reasoning tasks. The result ontology makes use
of the concepts SourceNode, SinkNode, OrNode and Alternative, all defined in
the service ontology. SourceNode and SinkNode are a helper nodes to have a
distinct starting and ending points in the graph. They correspond to the source
and sink nodes in a network. The OrNode is introduced to capture the branching
whenever there is more than one compatible resource instance that fulfills the
dependency requirement.
In the remainder of the paper we work with a more formal notation for the
dependency graph. Note that both notations are semantically the same. An
example graph is depicted in Figure 3.
op
ion
ire
tio
s
n
opt
Online
op
ion
on
Source Sink
n
tio
opt
opti
tio
Survey
n
op
on
opti
n
es
tio
uir
op
req
– if n ∈ V is a node with a nominal class label L(n) = {o} then there is a node
nC with label L(nC ) = C for each atomic or nominal class C with O |= C(x)
for all x such that O |= requires(o, x) and O |= isCompatibleT o(o, x), and
an edge e = (n, nC ) with label L(e) = and.
do not exist two individuals for which we find the requires property, but based
from the axiomatic knowledge we know there has to be at least one.
The query will answer us with a set of all types of these anonymous individuals.
This has one disadvantage: we are only interested in the most specific class
assertions. E.g. if it was stated that every application needs an operating system,
the query would have the class OS in its result set, however also every superclass
up to Thing.
Therefore, in a second step, we need to find out the most specific classes,
i.e. all classes that have no subclasses also contained in the result set. This can
be achieved by a simple algorithm which has a worst case runtime of O(n2 )
subsumption checks.
As there are redundant dependencies, which by themselves again might have
further dependencies, we memorize the visited resources (V ∗ ), as we do not need
to resolve their dependencies more than once.
Further the algorithm remembers unfulfilled requirements and recursively
traces them back, deleting unfeasible paths. We have not included these steps in
the above printed pseudo algorithm, as they would only confuse the reader.
re
qu
op
ion
ire
tio
s
n
opt
X3 option x31 requires X4 option X5
x41 x51
on
requires option
opti
op
ion
n
Source X1 Sink
n
tio
opt
opti
tion
op
on
x42 x52
opti
option
requires X5 option
X3 option x32 requires X4
on
opti
n
es
tio
uir
op
req
3.4 Preferences
As shown in Figure 1, after receiving the dependency graph, the customer can
specify preferences. For quantifying the customer satisfaction for a certain con-
figuration, we introduce the notion of a scoring function [3]. The scoring function
maps the customer preferences on certain configuration choices to a real number
in the interval [0, 1]. We achieve that by a weighted aggregation of the single
preference values regarding the different configuration choices (cf. Section 3.5).
For expressing non-compensating preferences, we define a set of additional re-
strictions R added to set of constraints C.
Definition 4 (Preferences). For a dependency graph GF the customer pref-
erences are described by the triplet P = (P, Λ, R), with P = {P1 , . . . , Pn }, a set
of preference
vectors, Λ = {λ1 , . . . , λn } a set of weights for each variable Xi in
X with i λi = 1 and R a set of non-compensating restrictions.
Example. For a variable XOS representing the various operating system choices
with DXOS = {W in, Linux}, the preferences are denoted by the vector POS =
(1, 0.8)T , expressing the slight preference for a Windows-based system, as de-
scribed in our use case. Analogously, we would denote PDB = (0.5, 1)T with
DXDB = {M ySQL, Oracle}. If both preference values are considered equally
important, we would denote λOS = λDB = 0.5. A non-compensating restriction
could be that the online survey has to be PHP-based, denoted by
!
R = {XScript = P HP }.
3.5 Optimization
Having the dependency graph GF and the customer preferences P we want to
find the optimal configuration that will be offered to the customer. The optimal
configuration can differ in various scenarios with different pricing schemes and
potential additional constraints (like capacity restrictions). In this work we define
the optimum as the configuration that yields the highest profit, i.e. the achieved
price minus the accruing costs. Hereby we assume, that customer is sharing his
Towards Custom Cloud Services 357
willingness to pay for a service perfectly fulfilling all his preferences, denoted by
α. Extensions to this simple economic model are considered in ongoing research.
In a naive approach, we could try to iterate over all feasible configurations
in GF . One configuration is a sub graph, as the meaning of the edges can be
interpreted as and and or. As a matter of fact, we can rewrite the dependency
graph as Boolean formula.If we convert the Boolean formula (which is already
in negation normal form) into its disjunctive normal form (DNF) by a step-wise
replacement using de Morgan’s laws, we exactly get an enumeration over all sets
of resource instances that reflect the different configurations.
However, we are interested in the optimal configuration, thus iterating over
all configurations might not be the best choice as it is very costly with respect
to runtime. We therefore set up an integer program, which calculates the profit
maximizing configuration. The variables from the constraint satisfaction problem
are by modeled a vector of binary decision variables Xi = (Xi1 , . . . , Xim ) for m
different choices.
maximize α · S(X ) − C(X )
X
j
subject to xi = 1 ∀Xi ∈ X
xji ∈Xi
Xi · Xk ≤ Iik ∀i, j : Xi →r Xk
R∈R constraints from P
with
S(X ) = λi · Xi · Pi
Xi ∈X
C(X ) = Ci · Xi
Xi ∈X
4 Evaluation
We evaluate our contribution qualitatively by having implemented a proof-of-
concept prototype and comparing the presented approach to the requirements
from Section 2.2.
4.1 Implementation
The implemented prototype can be executed using Java Web Start1 . For using
the prototype, a domain ontology (an example ontology is provided) has to be
1
http://research.steffenhaak.de/ServicePlanner/
358 S. Haak and S. Grimm
loaded, before one can add the set of resources C. Eventually, the algorithms can
be started by using the menu items ResolveDependencies and Evaluate. However,
not all functionalities presented in this paper are integrated. We can derive all
feasible configurations using the described DNF approach. The integer program
has been implemented using CPLEX, thus not being part of the downloadable
prototype. As reasoning engine we have chosen Pellet [17], as to our knowledge it
is the only OWL DL reasoner that is capable of both SWRL rules and SPARQL
queries that involve blank nodes.
5 Conclusion
In this paper, we propose an ontology-based optimization framework for finding
the optimal Cloud service configuration based on a set of functional require-
ments, customer preferences and cost information. We do this by means of an
OWL DL approach, combined with DL-safe SWRL rules and SPARQL querying
facilities. We can model dependencies between resource classes of any abstraction
level and use complex rules to ensure compatibility between resources. An inte-
ger program allows us to find the profit maximizing configuration. We provide a
prototypical implementation as proof-of-concept.
We recognize several reasonable extensions and shortcomings to our approach.
The economic model for the profit maximization is very simplistic. It is arguable
that the customer is willing to give price his preferences and willingness to pay
in a truthful manner. Further extensions from an economic perspective are the
integration of capacity constraints and optimizing several concurrent requests.
Ongoing research is dealing with both issues.
From the semantic perspective, a more complex cost and quality model would
be beneficial. In addition, we plan to investigate on how the proposed domain
ontology can be maintained collaboratively by incentivizing resource suppliers
to contribute the necessary knowledge about their resources as interoperability
and dependencies themselves.
We also recognize the need for a better evaluation, qualitatively, through rely-
ing on an industry use case, and quantitatively, by analyzing both the economic
benefit of our solution and its computational complexity.
Towards Custom Cloud Services 359
References
1. The Cloud Market EC2 Statistics (2010), http://thecloudmarket.com/stats
2. Apache. Apache Ivy (2010), http://ant.apache.org/ivy/
3. Asker, J., Cantillon, E.: Properties of Scoring Auctions. The RAND Journal of
Economics 39(1), 69–85 (2008)
4. Berardi, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Mecella, M.: Auto-
matic service composition based on behavioral descriptions. Int. J. of Cooperative
Information Systems 14(4), 333–376 (2005)
5. Blau, B., Neumann, D., Weinhardt, C., Lamparter, S.: Planning and pricing of
service mashups. In: 10th IEEE Conference on E-Commerce Technology and the
Fifth IEEE Conference on Enterprise Computing, E-Commerce and E-Services,
pp. 19–26 (2008)
6. Gaimon, C., Singhal, V.: Flexibility and the choice of manufacturing facilities under
short product life cycles. European Journal of Operational Research 60(2), 211–223
(1992)
7. van Hoeve, W.: Operations Research Techniques in Constraint Programming.
Ph.D. thesis, Tepper School of Business (2005)
8. Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.:
SWRL: A semantic web rule language combining OWL and RuleML. W3C Member
submission 21 (2004)
9. Junker, U., Mailharro, D.: The logic of ilog (j) configurator: Combining constraint
programming with a description logic. In: Proceedings of Workshop on Configura-
tion, IJCAI, vol. 3, pp. 13–20. Citeseer (2003)
10. Lamparter, S., Ankolekar, A., Studer, R., Grimm, S.: Preference-based selection of
highly configurable web services. In: Proceedings of the 16th international confer-
ence on World Wide Web, pp. 1013–1022. ACM Press, New York (2007)
11. Lécué, F., Léger, A.: A formal model for web service composition. In: Proceeding
of the 2006 conference on Leading the Web in Concurrent Engineering, pp. 37–46.
IOS Press, Amsterdam (2006)
12. McGuinness, D.L., Van Harmelen, F., et al.: OWL web ontology language overview.
W3C recommendation 10, 2004–03 (2004)
13. Motik, B., Studer, R.: KAON2–A Scalable Reasoning Tool for the Semantic Web.
In: Proceedings of the 2nd European Semantic Web Conference (ESWC 2005),
Heraklion, Greece (2005)
14. Motik, B., Sattler, U., Studer, R.: Query Answering for OWL-DL with Rules. Jour-
nal of Web Semantics: Science, Services and Agents on the World Wide Web 3(1),
41–60 (2005)
15. Sabin, D., Freuder, E.: Configuration as composite constraint satisfaction. In: Pro-
ceedings of the Artificial Intelligence and Manufacturing Research Planning Work-
shop, pp. 153–161 (1996)
16. Silva, G.: APT howto (2003),
http://www.debian.org/doc/manuals/apt-howto/index.en.html
17. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical
owl-dl reasoner. Web Semantics: Science, Services and Agents on the World Wide
Web 5(2), 51–53 (2007)
One Tag to Bind Them All:
Measuring Term Abstractness
in Social Metadata
1 Introduction
Since the advent of participatory web applications like Flickr1 , Youtube2 or
Delicious3 , social annotations (especially in the form of collaboratively created
Both authors contributed equally to this work.
1
http://www.flickr.com
2
http://www.youtube.com
3
http://www.delicious.com
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 360–374, 2011.
c Springer-Verlag Berlin Heidelberg 2011
One Tag to Bind Them All 361
2 Related Work
The first research direction relevant to this work has its roots in the analysis
of the structure of collaborative tagging systems. Golder and Huberman [11]
provided a first systematic analysis, mentioning among others the hypothesis of
“varying basic levels” – according to which users use more specific tags in their
domain of expertise. However, the authors only provided exemplary proofs for
this hypothesis, lacking a well-grounded measure of tag generality. In the follow-
ing, a considerable number of approaches proposed methods to make the implicit
semantic structures within a folksonomy explicit [19,13,22,2]. All of the previous
362 D. Benz et al.
works comprise in a more or less explicit way methods to capture the “gener-
ality” of a tag (e.g. by investigating the centrality of tags in a similarity graph
or by applying a statistical model of subsumption) – however, a comparison of
the chosen methods has not been given. Henschel et al. [12] claim to generate
more precise taxonomies by an entropy filter. In our own recent work [17] we
showed that the quality of semantics within a social tagging system is also depen-
dent on the tagging habits of individual users, Heymann [14] introduced another
entropy-based tag generality measure in the context of tag recommendation.
From a completely different point of view, the question of which factors deter-
mine the generality or abstractness of natural language terms has been addressed
by researchers coming from the areas of Linguistics and Psychology. The psy-
chologist Paivio [20] published in 1968 a list of 925 nouns along with human con-
creteness rankings; an extended list was published by Clark [8]. Kammann [16]
compared two definitions of word abstractness in a psychological study, namely
imagery and the number of subordinate words, and concluded that both capture
basically independent dimensions. Allen et al. [1] identify the generality of texts
with the help of a set of “reference terms”, whose generality level is known.
They also showed up a correlation between a word’s generality and its depth
in the WordNet hierarchy. In their work they developed statistics from analy-
sis of word frequency and the comparison to a set of reference terms. In [25],
Zhang makes an attempt to distinguish the four linguistic concepts fuzziness,
vagueness, generality and ambiguity.
3 Basic Notions
Term Graphs. Both core ontologies and folksonomies introduce various kinds
of relations among the lexical items contained in them. A typical example are
tag cooccurrence networks, which constitute an aggregation of the folksonomy
structure indicating which tags have occurred together. Generally spoken, these
term graphs G can be formalized as weighted undirected graphs G = (L, E, w)
whereby L is a set of vertices (corresponding to lexical items), E ⊆ L × L model
the edges and w: E → R is a function which assigns a weight to the edges. As an
example, given a folksonomy (U, T, R, Y ), one can define the post-based4 tag-tag
cooccurrence graph as Gcooc = (T, E, w) whose set of vertices corresponds to
the set T of tags. Two tags t1 and t2 are connected by an edge, iff there is at
least one post (u, Tur , r) with t1 , t2 ∈ Tur . The weight of this edge is given by
the number of posts that contain both t1 and t2 , i.e. w(t1 , t2 ) := card{(u, r) ∈
U × R | t1 , t2 ∈ Tur }
As we will define term abstractness measures based on core ontologies, folk-
sonomies and term graphs, we will commonly refer to them as term structures
S in the remainder of this paper. L(S) is a projection on the set of lexical items
contained in S. Based on the above terminology, we now formally define a term
abstractness measure in the following way:
Definition 1. A term abstractness measure S based upon a term structure S
is a partial order among the lexical items L present in S, i.e. S ⊆ L(S) × L(S).
If (l1 , l2 ) ∈S (or l1 S l2 ) we say that l1 is more abstract than l2 .
In the following, we will make frequent use of ranking functions r: L(S) → R
for lexical items in order to define a tag abstractness measure; please note that
a ranking function corresponds to a partial order according to (l1 , l2 ) ∈S ⇔
r(l1 ) > r(l2 ). We will denote the resulting term abstractness measure as Sr .
4
Other possibilities are resource-based and user-based cooccurrence; we use post-
based cooccurrence in the scope of this work as it is efficiently computable and
captures a sufficient amount of information.
364 D. Benz et al.
whereby cooc(t) is the set of tags which cooccur with t, and p(t |t) =
w(t ,t)
w(t ,t) (with w(t , t) being the cooccurrence weight defined in
t ∈cooc(t)
Hereby σst denotes the number of shortest paths between s and t and σst (v) is
the number of shortest paths between s and t passing through v. As its compu-
tation is obviously very expensive, it is often approximated [4] by calculating the
shortest paths only between a fraction of points. Finally, a vertex ranks higher
according to closeness centrality the shorter its shortest path length to all other
reachable nodes is:
1
cc(v) = (4)
t∈V \v dG (v, t)
dG (v, t) denotes hereby the geodesic distance (shortest path) between the vertices
v and t.
5 Evaluation
In order to assess the quality of the tag abstractness measures Ffreq , Fentr , G
dc ,
Gbc , G
cc and F
subs introduced above, a natural approach is to compare them
against a ground truth. A suitable grounding should yield reliable judgments
about the “true” abstractness of a given lexical item. Of special interest are
hereby taxonomies and concept hierarchies, whose hierarchical structure typi-
cally contains more abstract terms like “entity” or “thing” close to the taxonomy
root, whereby more concrete terms are found deeper in the hierarchy. Hence, we
have chosen a set of established core ontologies and taxonomies, which cover each
a rather broad spectrum of topics. They vary in their degree of controlledness
– WordNet (see below) on the one hand being manually crafted by language
experts, while the Wikipedia category hierarchy and DMOZ on the other hand
are built in a much less controlled manner by a large number of motivated web
users. In the following, we first briefly introduce each dataset; an overview about
their statistical properties can be found in Table 1.
366 D. Benz et al.
dataset because it was built for a similar purpose like many collaborative book-
marking services (namely organizing WWW references). In addition, some of its
top level categories (like “arts” or “business”) are described by rather abstract
terms.
As stated above, our grounding datasets contain information about concept sub-
sumptions. If a concept c1 subsumes concept c2 (i.e. (c1 , c2 ) ∈≥C ), we assume
c1 to be more abstract than c2 ; as the taxonomic relation is transitive, we can
infer (c1 , c2 ), (c2 , c3 ) ∈≥C ⇒ (c1 , c3 ) ∈≥C and hence that c1 is also more ab-
stract than c3 . In other words, thinking of the taxonomic relation as a directed
graph, a given concept c is more abstract than all other concepts contained in
the subgraph rooted at c. As we are interested in abstractness judgments about
lexical items, we can consequently infer that concept labels for more abstract
concepts are more abstract themselves. However, hereby we are facing the prob-
lem of polysemy: A given lexical item l can be used as a label for several concepts
10
The data set is publicly available at
http://www.uni-koblenz-landau.de/koblenz/fb4/AGStaab/Research/DataSets/
PINTSExperimentsDataSets/index_html
368 D. Benz et al.
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
correlation
correlation
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
correlation
correlation
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
Note that there may exist pairs which are neither concordant nor discordant.
Based on these notions, the gamma rank correlation is defined as
|C| − |D|
CR(, ∗ ) = (7)
|C| + |D|
whereby C and D denote the set of concordant and discordant pairs, respectively.
In our case, ∗ is not a partial ordering, but only a relation – which means
that in the worst case, a pair l, k can be concordant and discordant at the same
time. As is obvious from the definition of the gamma correlation (see Eq. 7),
such inconsistencies lead to a lower correlation. Hence, our proposed method
of “resolving” term ambiguity by constructing O according to Eq. 6 leads to
a lower bound of correlation. Figure 1 summarizes the correlation of each of
our analyzed measures, grounded against each of our ground truth taxonomies.
First of all, one can observe that the correlation values between the different
grounding datasets differ significantly. This is most obvious for the DMOZ hi-
erarchy, where almost all measures perform only slightly better than random
guessing. A slight exception is the entropy-based abstractness measure Fentropy ,
which in general gives greater than 0.25 across all datasets. Another relatively
constant impression is that the centrality measures based on the tag similarity
graph (cc sim and bc sim) show a smaller correlation than the other measures.
The globally best correlations are found for the WikiTaxonomy dataset, namely
by the subsumption-model-based measure subs. Apart from that, the centrality
measures based on the tag cooccurrence graph and the frequency-based measure
show a similar behavior.
The grounding approach of the previous section gave a first impression of the
ability of each measure to predict term abstractness judgments explicitly present
in a given taxonomy. This methodology allowed only for an evaluation based on
term pairs between which a connection exists in O , i.e. pairs where l1 is either
a predecessor or a successor of l2 in the term subsumption hierarchy. However,
our proposed measures make further distinctions among terms between which no
connection exists within a taxonomy (e.g. the freq states that the most frequent
term t is more abstract than all other terms). This phenomenon can probably
also be found when asking humans – e.g. if one would ask which of the terms “art”
or “theoretical computer science” is more abstract, most people will probably
choose “art”, even though both words are not connected by the is-a relation in
(at least most) general-purpose taxonomies.
In order to extend our evaluation to these cases, we derived two straightfor-
ward measures from a taxonomy which allow for a comparison of the abstract-
ness level between terms occurring in disconnected parts of the taxonomy graph.
Because this approach goes beyond the explicitly encoded abstractness informa-
tion, the question is justified to which extent it makes sense to compare the
generality of completely unrelated terms, e.g. between “waterfall” and “chair”.
370 D. Benz et al.
Besides our own intuition, we are not aware of any reliable method to determine
when humans perceive the abstractness of two terms as comparable or not. For
this reason, we validated the derived measures – namely (i) the shortest path to
the taxonomy root and (ii)the number of subordinate terms – by an experiment
with human subjects.
of “not comparable” judgments show that even with our elaborate filtering, the
task of differentiating abstractness levels is quite difficult. Despite this fact, our
user study provided us with a well-agreed set of 41 term pairs, for which we got
reliable abstractness judgments. Denoting these pairs as manual , we can now
check the accuracy of the term abstractness measures introduced by sp root and
subgraph size, i.e. the percentage of correctly predicted pairs. Table 3 contains
the resulting accuracy values. From our sample data, it seems that the subgraph
size (i.e. the number of subordinate terms) is a more reliable predictor of human
abstractness judgments. Hence, we will use it for a more detailed grounding of
our folksonomy-based abstractness measures.
The ranking function subgraph size naturally induces a partial order
O subgraph size among the set of lexical items present in a core ontology O. In
order to check how close each of our introduced term abstractness measures cor-
relate, we computed the gamma correlation coefficient [6] between the two partial
orders (see Eq. 7). Figure 2 shows the resulting correlations. Again, the corre-
lation level between the datasets differs, with DMOZ having the lowest values.
This is consistent with the first evaluation based solely on the taxonomic rela-
tions (see Figure 1). Another consistent observation is that the measure based on
the tag similarity network (bc sim and cc sim) show the weakest performance.
The globally best value is found for the subsumption model, compared to the
WikiTaxonomy (0.5); for the remaining conditions, almost all correlation values
lie in the range between 0.25 and 0.4, and correlate hence weakly.
5.5 Discussion
Our primary goal during the evaluation was to check if folksonomy-based term
abstractness measures are able to make reliable judgments about the relative ab-
stractness level of terms. A first consistent observation is that measures based on
frequency, entropy or centrality in the tag cooccurrence graph do exhibit a cor-
relation to the abstractness information encoded in gold-standard-taxonomies.
One exception is DMOZ, for which almost all measures exhibit only very weak
correlation values. We attribute this to the fact that the semantics of the DMOZ
topic hierarchy is much less precise compared to the other grounding datasets;
as an example, the category Top/Computers/Multimedia/Music and Audio/Software/Java does
hardly imply that Software “is a kind of” Music and Audio. WordNet on the
contrary subsumes the term Java (among others) under taxonomically much
more precise parents: [...] > communication > language > artifical language > programming
language > java The same holds for Yago, and the WikiTaxonomy was also built
with a strong focus on is-a relations [21]. This is actually an interesting obser-
vation: Despite the fact that both DMOZ and Delicious were built for similar
372 D. Benz et al.
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
correlation
correlation
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0.8 0.8
0.75 0.75
0.7 0.7
0.65 0.65
0.6 0.6
0.55 0.55
0.5 0.5
correlation
correlation
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
6 Conclusions
In this paper, we performed a systematic analysis of folksonomy-based term ab-
stractness measures. To this end, we first provided a common terminology to
subsume the notion of term abstractness in folksonomies and core ontologies.
We then contributed a methodology to compare the abstractness information
contained in each of our analyzed measures to established taxonomies, namely
WordNet, Yago, DMOZ and the WikiTaxonomy. Our results suggest that cen-
trality and entropy measures can differentiate well between abstract and concrete
terms. In addition, we have provided evidence that the tag cooccurence graph is
a more valuable input to centrality measures compared to tag similarity graphs
in order to measure abstractness. Apart from that, we also shed light on the tag
generality vs. popularity problem by showing that in fact, popularity seems to be
a fairly good indicator of the “true” generality of a given tag. These insights are
useful for all kinds of applications who benefit from a deeper understanding of
tag semantics. As an example, tag recommendation engines could take general-
ity information into account in order to improve their predictions, or folksonomy
navigation facilities could offer a new direction of browsing towards more general
or more specific directions. Finally, our results inform the design of algorithms
geared towards making the implicit semantics in folksonomies explicit.
As next steps, we plan to apply our measures to identify generalists and spe-
cialists in social tagging systems. A possible hypothesis hereby is that specialists
use a more specific vocabulary whereas generalists rely mainly on abstract tags.
Acknowledgments. We would like to thank Dr. Denis Helic and Beate Krause
for fruitful discussions during the creation of this work. The research presented
in this work is in part funded by the Know-Center, the FWF Austrian Science
Fund Grant P20269, the European Commission as part of the FP7 Marie Curie
IAPP project TEAM (grant no. 251514), the WebZubi project funded by BMBF
and the VENUS project funded by Land Hessen.
References
1. Allen, R., Wu, Y.: Generality of texts. In: Digital Libraries: People, Knowledge,
and Technology. LNCS, Springer, Heidelberg (2010)
2. Benz, D., Hotho, A., Stumme, G.: Semantics made by you and me: Self-emerging
ontologies can capture the diversity of shared knowledge. In: Proc. of WebSci 2010,
Raleigh, NC, USA (2010)
3. Bozsak, E., Ehrig, M., Handschuh, S., Hotho, A., Maedche, A., Motik, B., Oberle,
D., Schmitz, C., Staab, S., Stojanovic, L., Stojanovic, N., Studer, R., Stumme, G.,
Sure, Y., Tane, J., Volz, R., Zacharias, V.: KAON - Towards a Large Scale Semantic
Web. In: Bauknecht, K., Tjoa, A.M., Quirchmayr, G. (eds.) EC-Web 2002. LNCS,
vol. 2455, pp. 304–313. Springer, Heidelberg (2002)
374 D. Benz et al.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 375–389, 2011.
c Springer-Verlag Berlin Heidelberg 2011
376 F. Abel et al.
2 Related Work
Since Twitter was launched in 2007 research started to investigate the phe-
nomenon of microblogging. Most research on Twitter investigates the network
structure and properties of the Twitter network, e.g. [4,5,6,7]. Kwak et al. con-
ducted a temporal analysis of trending topics in Twitter and discovered that
over 85% of the tweets posted everyday are related to news [5]. They also show
that hashtags are good indicators to detect events and trending topics. Huang
et al. analyze the semantics of hashtags in more detail and reveal that tagging
in Twitter rather used to join public discussions than organizing content for fu-
ture retrieval [10]. Laniada and Mika [11] have defined metrics to characterize
hashtags with respect to four dimensions: frequency, specificity, consistency, and
stability over time. The combination of measures can help assessing hashtags as
strong representative identifiers. Miles explored the retrieval of hashtags for rec-
ommendation purposes and introduced a method which considers user interests
Semantic Enrichment of Twitter Posts for User Profile Construction 377
in a certain topic to find hashtags that are often applied to posts related to this
topic [12]. In this paper, we compare hashtag-based methods with methods that
extract and analyze the semantics of tweets. While SMOB [13], the semantic mi-
croblogging framework, enables users to explicitly attach semantic annotations
(URIs) to their short messages by applying MOAT [14] and therewith allows for
making the meaning of (hash)tags explicit, our ambition is to infer the semantics
of individual Twitter activities automatically.
Research on information retrieval and personalization in Twitter focussed on
ranking users and content. For example, Cha et al. [7] present an in-depth com-
parison of three measures of influence, in-degree, re-tweets, and mentions, to
identify and rank influential users. Based on these measures, they also investi-
gate the dynamics of user influence across topics and time. Weng et al. [6] focus
on identifying influential users of microblogging services as well. They reveal
that the presence of reciprocity can be explained by phenomenon of homophily,
i.e. people who are similar are likely to follow each other. Content recommen-
dations in Twitter aim at evaluating the importance of information for a given
user and directing the user’s attention to certain items. Anlei et al. [3] propose a
method to use microblogging streams to detect fresh URLs mentioned in Twit-
ter messages and compute rankings of these URLs. Chen et al. also focus on
recommending URLs posted in Twitter messages and propose to structure the
problem of content recommendations into three separate dimensions [15]: discov-
ering the source of content, modeling the interests of the users to rank content
and exploiting the social network structure to adjust the ranking according to
the general popularity of the items. Chen et al. however do not investigate user
modeling in detail, but represent users and their tweets by means of a bag of
words, from which they remove stop-words. In this paper we go beyond bag-
of-word representations and link tweets to news articles from which we extract
entities to generate semantically more meaningful user profiles.
Interweaving traditional news media and social media is the goal of research
projects such as SYNC32 , which aims to enrich news events with opinions from
the blogosphere. Twitris 2.0 [16] is a Semantic Web platform that connects event-
related Twitter messages with other media such as YouTube videos and Google
News. Using Twarql [17] for the detection of DBpedia entities and making the
semantics of hashtags explicit (via tagdef 3 ), it captures the semantics of major
news events. TwitterStand [18] also analyzes the Twitter network to capture
tweets that correspond to late breaking news. Such analyses on certain news
events, such as the election in Iran 2009 [2] or the earthquake in Chile 2010 [19],
have also been conducted by other related work. However, analyzing the feasi-
bility of linking individual tweets with news articles for enriching and contex-
tualizing the semantics of user activities on Twitter to generate valuable user
profiles for the Social Web – which is the main contribution of this paper – has
not been researched yet.
2
http://www.sync3.eu
3
http://tagdef.com
378 F. Abel et al.
%
&
! $
Fig. 1. Generic solution for semantic enrichment of tweets and user profile construction:
(a) generic architecture and (b) example of processing tweets and news articles
Linkage. The challenge of linking tweets and news articles is to identify these
articles a certain Twitter message refers to. Sometimes, users explicitly link
to the corresponding Web sites, but often there is no hyperlink within a
Twitter message which requires more advanced strategies. In Section 4 we
introduce and evaluate different strategies that allow for the discovery of
relations between tweets and news articles.
Semantic Enrichment. Given the content of tweets and news articles, another
challenge is to extract valuable semantics from the textual content. Further,
when processing news article Web sites an additional challenge is to extract
the main content of the news article. While RSS facilitates aggregation of
news articles, the main content of a news article is often not embedded
within the RSS feed, but is available via the corresponding HTML-formatted
Web site. These Web sites contain supplemental content (boilerplate) such
as navigation menus, advertisements or comments provided by readers of the
article. To extract the main content of news articles we use BoilerPipe [20],
a library that applies linguistic rules to separate main content from the
boilerplate.
In order to support user modeling and personalization it is important to –
given the raw content of tweets and news articles – distill topics and extract
entities users are concerned with. We therefore utilize Web services provided
by OpenCalais4 , which allow for the extraction of entities such as people,
organizations or events and moreover assign unique URIs to known entities
and topics.
The connections between the semantically enriched news articles and Twitter
posts enable us to construct a rich RDF graph that represents the microblog-
ging activities in a semantically well-defined context.
User Modeling. Based on the RDF graph, which connects Twitter posts, news
articles, related entities and topics, we introduce and analyze user modeling
strategies that create semantically rich user profiles describing different facets
of the users (see Section 5).
Figure 1(b) further illustrates our generic solution by means of an example
taken from our dataset: a user is posting a message about the election of the
sportsman of the year and states that she supports Francesca Schiavone, an
Italian tennis player. The Twitter message itself just mentions the given name
francesca and indicates with a hashtag (#sport ) that this post is related to
sports. Hence, given just the text from this Twitter message it is not possi-
ble to automatically infer that the user is concerned with the tennis player.
Given our linkage strategies (see Section 4), one can relate the Twitter mes-
sage with a corresponding news article published by CNN, which details on the
SI sportsman election and Francesca Schiavone in particular. Entity and topic
recognition reveal that the article is about tennis (topic:Tennis) and Schiavone’s
(person:Francesca Schiavone) success at French Open (event:FrenchOpen) and
therewith enrich the semantics which can be extracted from the Twitter message
itself (topic:Sports).
4
http://www.opencalais.com
380 F. Abel et al.
The bag-of-words strategy thus compares a tweet t with every news article in
N and chooses the most similar ones to build a relation between t and the
corresponding article n. T F × IDF is applied to measure the similarity. Given
a Twitter post t and a news article n, the term frequency T Fi of a term i (with
αi > 0 in the vector representation of t) is βi , i.e. the number of occurrences
of the word i in n. And IDFi , the inverse document frequency, is IDFi =
|N |
1 + log( |{n∈N :β i >0}|+1
), where |{n ∈ N : βi > 0}| is the number of news articles,
in which the term i appears. Given T F and IDF , the similarity between t and
n is calculated as follows.
m
sim(t, n) = T Fi · IDFi (1)
i=1
The hashtag-based strategy relates a tweet t (represented via its hashtags) with
the news article n, for which the T F × IDF score is maximized: (t, n) ∈ Rh ,
where Rh ⊆ T × N .
While the hashtag-based strategy thus varies the style of representing Twitter
messages, the entity-based strategy introduces a new approach for representing
news articles.
Definition 5 (Entity-based strategy). Twitter posts t ∈ T are represented
by a vector t = (α1 , α2 ..αm ) where αi is the frequency of a word i in t and m
denotes the total number of words in t. Each news article n ∈ N is represented by
means of a vector n = (β1 , β2 ..βk ), where βi is the frequency of an entity within
the news article, i is the label of the entity and k denotes the total number of
distinct entities in the news article n.
The entity-based strategy relates the Twitter post t (represented via bag-of-
words) with the news article n (represented via the labels of entities mentioned in
n), for which the T F × IDF score is maximized: (t, n) ∈ Re , where Re ⊆ T × N .
Entities are extracted by exploiting OpenCalais as described in Section 3. For
the hashtag- and entity-based strategies, we thus use Equation 1 to generate
a set of candidates of related tweet-news pairs and then filter out these pairs,
which do not fulfill the temporal constraint that prescribes that the tweet and
news article should be published within a time span of two days. Such temporal
constraints may reduce the recall but have a positive effect on the precision as
we will see in our analysis below.
Fig. 2. Number of tweets per user u ∈ Uu as well as the number of interactions (re-tweeting
or reply activities) with Twitter accounts maintained by mainstream news media
" '!.(**
' '!( '!) '!* '!+ '!, '!- '!. '!/ '!0
Fig. 3. Precision of different strategies for relating Twitter messages with news articles.
(considered to refer to accurate).
Given this ground truth of correct tweet-news relations, we compare the preci-
sion of the different strategies, i.e. the fraction of correctly generated tweet-news
relations. Figure 3 plots the results and shows that the URL-based strategies
perform best with a precision of 80.59% (strict) and 78.8% (lenient) respec-
tively. The naive content-based strategy, which utilizes the entire Twitter mes-
sage (excluding stop-words) as search query and applies TFxIDF to rank the
news articles, performs worst and is clearly outperformed by all other strategies.
It is interesting to see that the entity-based strategy, which considers the pub-
lishing date of the Twitter message and news article, is nearly as good as the
lenient URL-based strategy and clearly outperforms the hashtag-based strategy,
which uses the temporal constraints as well. Even without considering tempo-
ral constraints, the entity-based strategy results in higher accuracy than the
hashtag-based strategy. We conclude that the constellation/set of entities men-
tioned in a news article and Twitter message correspondingly, i.e. the number
of shared entities, is a good indicator of relating tweets and news articles.
Figure 4 shows the coverage of the strategies, i.e. the number of tweets per
user, for which the corresponding strategy found an appropriate news article.
The URL-based strategies, which achieve the highest accuracy, are very restric-
tive: for less than 1000 users the number of tweets that are connected to news
articles is higher than 10. The coverage of the lenient URL-based strategy is
clearly higher than for the strict one, which can be explained by the number of
interactions with Twitter accounts from mainstream news media (see Figure 2).
The hashtag-based and entity-based strategies even allow for a far more higher
number of tweet-news pairs. However, the hashtag-based strategy fails to relate
tweets for more than 79% of the users, because most of these people do not make
use of hashtags. By contrast, the entity-based strategy is applicable for the great
majority of people and, given that it showed an accuracy of more than 70% can
be considered as the most successful strategy.
Combining all strategies results in the highest coverage: for more than 20%
of the users, the number of tweet-news relations is higher than 10. In the next
section we will show that given these tweet-news relations we can create rich
profiles that go beyond the variety of profiles, which are just constructed based
on the tweets of the users.
Semantic Enrichment of Twitter Posts for User Profile Construction 385
Fig. 4. Number of tweets per user, which are according to the different strategies
related to news articles
Based on the linkage of Twitter activities with news articles, we can exploit the
semantics embodied in the news articles to create and enrich user profiles. In this
section, we first present approaches for user modeling based on Twitter activities
and then analyze the impact of exploiting related news articles for user profile
construction in Twitter.
In Twitter, a naive strategy for computing a weight w(u, e) is to count the num-
ber of u’s tweets that refer to the given entity e. |P (u)| depicts the number of
distinct entities that appear in a profile P (u). While entity-based profiles repre-
sent a user in a detailed and fine-grained fashion, topic-based profiles describe
a user’s interests into topics such as sports, politics or technology that can be
specified analogously (see Definition 7).
From a technical point of view, both types of profiles specify the interest of a user
into a certain URI, which represents an entity or topic respectively. Given the
URI-based representation, the entity- and topic-based profiles become part of the
Web of Linked Data and can therewith not only be applied for personalization
purposes in Twitter (e.g., recommendations of tweet messages or information
streams to follow) but in in other systems as well. For the construction of entity-
and topic-based profiles we compare the following two strategies.
Tweet-based. The tweet-based baseline strategy constructs entity- and topic-
based user profiles by considering only the Twitter messages posted by a
user, i.e. the first step of our user modeling approach depicted in Figure 1(a)
is omitted so that tweets are not linked to news articles. Entities and topics
are directly extracted from tweets using OpenCalais. The weight of an entity
corresponds to the number of tweets, from which an entity was successfully
extracted, and the weight of a topic corresponds to the number of tweets,
which were categorized with the given topic.
News-based. The news-based user modeling strategy applies the full pipeline of
our architecture for constructing the user profiles (see Figure 1(a)). Twitter
messages are linked to news articles by combining the URL-based and entity-
based (with temporal restrictions) strategies introduced in Section 4 and
entities and topics are extracted from the news articles, which have been
linked with the Twitter activities of the given user. The weights correspond
again to the number of Twitter activities which relate to an entity and topic
respectively.
Our hypothesis is that the news-based user modeling strategy, which benefits
from the linkage of Twitter messages with news articles, creates more valuable
profiles than the tweet-based strategy.
1000 10
100
10
0
100
20
10
10
1
0
0 200 400 600 800 1000 1 10 100 1000
user profiles user profiles
just less than four types of entities (mostly persons and organizations) while for
the news-based strategy more than 50% of the profiles reveal interests in more
than 20 types of entities. For example, they show that users are – in addition
to persons or organizations – also concerned with certain events or products.
The news-based strategy, i.e. the complete user construction pipeline proposed
in Figure 1, thus allows for the construction of profiles that cover different facets
of interests which increases the number of applications that can be built on top
of our user modeling approaches (e.g., product recommendations).
Related research stresses the role of hashtags for being valuable descrip-
tors [11,12,10]. However, a comparison between hashtag-based profiles and entity-
based profiles created via the news-based strategy shows that for user modeling
on Twitter, hashtags seem to be a less valuable source of information. Figure 5(d)
reveals that the number of distinct hashtags available in the corresponding user
profiles is much smaller than the number of distinct entities that are discovered
with our strategy, which relates Twitter messages with news articles. Given that
each named entity as well as each topic of an entity- and topic-based user profile
has a URI, the semantic expressiveness of profiles generated with the news-based
user modeling strategy is much higher than for the hashtag-based profiles.
388 F. Abel et al.
References
1. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time
event detection by social sensors. In: Proc. of 19th Int. Conf. on World Wide Web,
pp. 851–860. ACM, New York (2010)
2. Gaffney, D.: #iranElection: quantifying online activism. In: Proc. of the WebSci10:
Extending the Frontiers of Society On-Line (2010)
3. Dong, A., Zhang, R., Kolari, P., Bai, J., Diaz, F., Chang, Y., Zheng, Z., Zha, H.:
Time is of the essence: improving recency ranking using Twitter data. In: Proc. of
19th Int. Conf. on World Wide Web, pp. 331–340. ACM, New York (2010)
4. Lerman, K., Ghosh, R.: Information contagion: an empirical study of spread of
news on Digg and Twitter social networks. In: Cohen, W.W., Gosling, S. (eds.)
Proc. of 4th Int. Conf. on Weblogs and Social Media. AAAI Press, Menlo Park
(2010)
5. Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news
media? In: Proc. of the 19th Int. Conf. on World Wide Web, pp. 591–600. ACM,
New York (2010)
5
Code and further results: http://wis.ewi.tudelft.nl/umap2011/
Semantic Enrichment of Twitter Posts for User Profile Construction 389
6. Weng, J., Lim, E.P., Jiang, J., He, Q.: TwitterRank: finding topic-sensitive influ-
ential Twitterers. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proc. of
3rd ACM Int. Conf. on Web Search and Data Mining, pp. 261–270. ACM, New
York (2010)
7. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence
in twitter: The million follower fallacy. In: Cohen, W.W., Gosling, S. (eds.) Proc.
of 4th Int. Conf. on Weblogs and Social Media. AAAI Press, Menlo Park (2010)
8. Lee, K., Caverlee, J., Webb, S.: The social honeypot project: protecting online
communities from spammers. In: Proc. of 19th Int. Conf. on World Wide Web, pp.
1139–1140. ACM, New York (2010)
9. Lee, K., Caverlee, J., Webb, S.: Uncovering social spammers: social honeypots
+ machine learning. In: Proc. of 33rd Int. ACM SIGIR Conf. on Research and
Development in Information Retrieval, pp. 435–442. ACM, New York (2010)
10. Huang, J., Thornton, K.M., Efthimiadis, E.N.: Conversational tagging in twitter.
In: Proc. of 21st Conf. on Hypertext and Hypermedia, pp. 173–178. ACM, New
York (2010)
11. Laniado, D., Mika, P.: Making sense of twitter. In: Patel-Schneider, P.F., Pan, Y.,
Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC
2010, Part I. LNCS, vol. 6496, pp. 470–485. Springer, Heidelberg (2010)
12. Efron, M.: Hashtag retrieval in a microblogging environment. In: Proc. of 33rd Int.
ACM SIGIR Conf. on Research and Development in Information Retrieval, pp.
787–788. ACM, New York (2010)
13. Passant, A., Hastrup, T., Bojars, U., Breslin, J.: Microblogging: A Semantic Web
and Distributed Approach. In: Bizer, C., Auer, S., Grimnes, G.A., Heath, T. (eds.)
Proc. of 4th Workshop Scripting For the Semantic Web (SFSW 2008) co-located
with ESWC 2008, vol. 368 (2008), CEUR-WS.org
14. Passant, A., Laublet, P.: Meaning Of A Tag: A collaborative approach to bridge
the gap between tagging and Linked Data. In: Proceedings of the WWW 2008
Workshop Linked Data on the Web (LDOW 2008), Beijing, China (2008)
15. Chen, J., Nairn, R., Nelson, L., Bernstein, M., Chi, E.: Short and tweet: experi-
ments on recommending content from information streams. In: Proc. of 28th Int.
Conf. on Human Factors in Computing Systems, pp. 1185–1194. ACM, New York
(2010)
16. Jadhav, A., Purohit, H., Kapanipathi, P., Ananthram, P., Ranabahu, A., Nguyen,
V., Mendes, P.N., Smith, A.G., Cooney, M., Sheth, A.: Twitris 2.0: Semantically
empowered system for understanding perceptions from social data. In: Proc. of the
Int. Semantic Web Challenge (2010)
17. Mendes, P.N., Passant, A., Kapanipathi, P.: Twarql: tapping into the wisdom of
the crowd. In: Proc. of the 6th International Conference on Semantic Systems, pp.
45:1–45:3. ACM, New York (2010)
18. Sankaranarayanan, J., Samet, H., Teitler, B.E., Lieberman, M.D., Sperling, J.:
Twitterstand: news in tweets. In: Proc. of 17th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, pp. 42–51. ACM,
New York (2009)
19. Mendoza, M., Poblete, B., Castillo, C.: Twitter Under Crisis: Can we trust what
we RT? In: Proc. of 1st Workshop on Social Media Analytics (SOMA 2010). ACM
Press, New York (2010)
20. Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow
text features. In: Proc. of 3rd ACM Int. Conf. on Web Search and Data Mining,
pp. 441–450. ACM, New York (2010)
Improving Categorisation in Social Media Using
Hyperlinks to Structured Data Sources
1 Introduction
Social media such as blogs, discussion forums, micro-blogging services and social-
networking sites have grown significantly in popularity in recent years. By low-
ering the barriers to online communication, social media enables users to easily
access and share content, news, opinions and information in general. Recent re-
search has investigated how microblogging services such as Twitter enable real-
time, first-hand reporting of news events [15] and how question-answering sites
such as Yahoo! Answers allow users to ask questions on any topic and receive
community-evaluated answers [1]. Social media sites like these are generating
huge amounts of user-generated content and are becoming a valuable source of
information for the average Web user.
The work presented in this paper has been funded in part by Science Foundation
Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 390–404, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Improving Categorisation in Social Media 391
2 Related Work
anchor text for Web search [17]. Previous studies in the field of Web document
categorisation have proven that the classification of webpages can be boosted by
taking into account the text of neighbouring webpages ([2], [16]). Our work differs
in that we focus on social media and rather than incorporating entire webpages
or parts of webpages, we include specific metadata items that have a semantic
relation to the objects discussed in a post. This approach enables us to compare
the effectiveness of different metadata types for improving classification.
Other work has looked at classifying particular types of Web objects using
metadata. Our work is related to that of Figueiredo et al. [8], who assess the
quality of various textual features in Web 2.0 sites such as YouTube for classi-
fying objects within that site. They do not use any external data. Yin et al. [21]
propose improving object classification within a website by bridging heteroge-
neous objects so that category information can be propagated from one domain
to another. They improve the classification of Amazon products by learning from
the tags of HTML documents contained within ODP1 categories.
There is previous work on classifying social media using the metadata of the
post itself. Berendt and Hanser [4] investigated automatic domain classification
of blog posts with different combinations of body, tags and title. Sun et al. [19]
showed that blog topic classification can be improved by including tags and
descriptions. Our work differs from these because we use metadata from objects
on the Web to describe a social media post that links to those objects.
There has also been related work in the area of classifying Twitter messages.
Garcia Esparza et al. [9] investigated tweet categorisation based on content.
Jansen et al. [12] classified posts relating to consumer brands according to their
sentiment. Irani et al. [11] studied the problem of identifying posters who aim
to dishonestly gain visibility by misleadingly tagging posts with popular topics.
They build models that correspond to topics in order to identify messages that
are tagged with a topic but are in fact spam. They take advantage of hyperlinks
by augmenting their models with text from webpages linked to within posts.
Our work differs in that we focus on the potential offered by structured Web
data and show that extracting relevant metadata gives superior results to using
entire webpages. In the Semantic Web domain, Stankovic et al. [18] proposed a
method for mapping conference-related posts to their corresponding talks and
then to relevant DBpedia topics, enabling the posts to be more easily searched.
A relevant study that used structured data from hyperlinks in social media
was performed by Cha et al. [7] who used hyperlinks in a blog dataset to study
information propagation in the blogosphere. They downloaded the metadata of
YouTube videos and analysed the popularity of categories, the age distribution
of videos, and the diffusion patterns of different categories of videos.
(a) (b)
Fig. 1. (a) Web sources that were used to enrich social media data; (b) example of a
how a social media post can be enhanced with external structured data relating to it
4 Data Corpus
In our experiments, we use datasets originating from two different types of social
media: Forum, from an online discussion forum, and Twitter, from the microblog-
ging site2 . We examined the domains linked to in each dataset and identified the
most common sources of structured data. We extracted the posts that contained
hyperlinks to these sources, and for each hyperlink we retrieved the related meta-
data as well as the corresponding HTML page. An identical pre-processing step
was applied to each text source - post content, metadata and HTML documents.
All text was lower-cased, non-alphabetic characters were omitted and stopwords
were removed. Table 1 shows the average number of unique tokens remaining
per post for each text source after preprocessing. The discrepancy in lengths of
metadata and HTML between datasets is due to differences in the distribution
of domains linked to in each dataset. Wikipedia articles, for example, tend to
have particularly long HTML documents.
We now describe each stage of the data collection in more detail. The process
of metadata collection for the forum dataset is described further in [13].
Forum dataset. We use the corpus from the 2008 boards.ie SIOC Data Com-
petition3 , which covers ten years of discussion forum posts represented in the
SIOC (Semantically-Interlinked Online Communities) format [6]. Each post be-
longs to a thread, or conversation, and each thread belongs to a forum, which
typically covers one particular area of interest. Our analysis considers only the
posts contained in the final year of the dataset, since the more recent posts con-
tain more links to structured data sources. From the most common domains in
the dataset we identified MySpace, IMDB and Wikipedia as sources of Linked
Data, via third-party data publishers detailed later in this section. We iden-
tified Amazon, YouTube and Flickr as sources of metadata via APIs. We use
forum titles as categories for the classification experiments, since authors gen-
erally choose a forum to post in according to the topic of the post. We selected
ten forums for these experiments based on the criteria that they were among the
most popular forums in the dataset and they each have a clear topic (as opposed
to general “chat” forums). The percentage of posts that have hyperlinks varies
between forums, from 4% in Poker to 14% in Musicians, with an average of 8%
across forums. These are a minority of posts; however, we believe they are worth
focusing on because the presence of a hyperlink often indicates that the post is a
Table 1. Average unique tokens from each text source (± standard deviation)
2
http://twitter.com/, accessed March 2011.
3
http://data.sioc-project.org/, accessed March 2011.
Improving Categorisation in Social Media 395
useful source of information rather than just chat. Of the posts with hyperlinks,
we focus on the 23% that link to one or more of the structured data sources
listed previously. For the 23% of posts that have a title, this is included as part
of the post content. Since discussion forums are typically already categorised,
performing topic classification is not usually necessary. However, this data is
representative of the short, informal discussion systems that are increasingly
found on Web 2.0 sites, so the results obtained from utilising the class labels in
this dataset should be applicable to similar uncategorised social media sites.
Twitter dataset. The Twitter dataset4 comes from Yang and Leskovec [20],
and covers 476 million posts from June 2009 to December 2009. Twitter is a mi-
croblogging site that allows users to post 140 character status messages (tweets)
to other users who subscribe to their updates. Due to the post length restriction,
Twitter users make frequent use of URL shortening services such as bit.ly5 to
substantially shorten URLs in order to save space. Therefore for this dataset it
was necessary to first decode short URLs via cURL6 . From the most common
domains we identified Amazon, YouTube and Flickr as sources of metadata via
APIs. Like many social media websites, but in contrast to the previous dataset,
Twitter does not provide a formal method for categorising tweets. However, a
convention has evolved among users to tag updates with topics using words or
phrases prefixed by a hash symbol (#). We make use of these hashtags to create
six categories for classification experiments. Our approach borrows the hashtag-
to-category mappings method from Esparza et al. [9] to identify tweets that
relate to selected categories. We reuse and extend the hashtag categories of [9];
Table 2 shows the mappings between hashtags and categories. These categories
were chosen because they occur with a high frequency in the dataset and they
have a concrete topic. Tweets belonging to more than one category were omitted,
since our goal is to assign items to a single category. All hashtags were removed
from tweets, including those that do not feature in Table 2, since they may also
contain category information. Any URLs to websites other than the selected
metadata sources were eliminated from tweets. Finally, to avoid repeated posts
caused by users retweeting (resending another post), all retweets were omitted.
External metadata. Amazon product, Flickr photo and YouTube video meta-
data was retrieved from the respective APIs. MySpace music artist information
was obtained from DBTune7 (an RDF wrapper of various musical sources in-
cluding MySpace), IMDB movie information from LinkedMDB8 (a movie dataset
with links to IMDB) and Wikipedia article information from DBpedia9 . The lat-
ter three services are part of the Linking Open Data project [5]. The number
4
http://snap.stanford.edu/data/twitter7.html, accessed March 2011.
5
http://bit.ly, accessed March 2011.
6
http://curl.haxx.se/, accessed March 2011.
7
http://dbtune.org/, accessed March 2011.
8
http://linkedmdb.org/, accessed March 2011.
9
http://dbpedia.org/, accessed March 2011.
396 S. Kinsella et al.
Category #hashtags
Books book, books, comic, comics, bookreview, reading, readingnow, literature
Games game, pcgames, videogames, gaming, gamer, xbox, psp, wii
Movies movie, movies, film, films, cinema
Photography photography, photo
Politics politics
Sports nfl, sports, sport, football, f1, fitness, nba, golf
1400
Video
1200 Product 600
Video
Photo
1000 500 Product
Music Artist
Photo
800 Movie 400
Article
600 300
400 200
200 100
0 0
(a) (b)
Fig. 2. No. of posts containing links to each type of object for (a) Forum, (b) Twitter
of posts containing links to each type of object in the Forum dataset is shown
in Figure 2(a), and the number of posts containing links to each type of object
for Twitter is shown in Figure 2(b). For the Forum dataset, hyperlinks to mu-
sic artists occur mainly in the Musicians forum, movies in the Films forum, and
photos in the Photography forum. The other object types are spread more evenly
between the remaining seven forums. In total, Forum contains 6,626 posts and
Twitter contains 2,415 posts. Note that in rare cases in Forum, a post contains
links to multiple object types, in which case that post is included twice in a col-
umn. Therefore the total counts in Figure 2(a) are inflated by approximately 1%.
For our analysis, we select only the most commonly available metadata types in
order to make comparisons between them, but our method could be applied us-
ing arbitrary metadata. The metadata types that we chose were Title, Category
(includes music/movie genre), Description (includes Wikipedia abstract), Tags
and Author/Director (for Amazon books and IMDB movies only).
Improving Categorisation in Social Media 397
We now investigate some features of the metadata that was collected for the
Forum dataset. Statistics are not reported for Twitter due to space constraints.
Note that this analysis was performed after pre-processing the metadata text.
The first section of Table 3 shows the percentage of non-empty metadata for
each type of object. This is of interest since a metadata type that occurs rarely
will have limited usefulness. Due to the unique features of each website, not
every object type can have every metadata type. There are large variations in
the percentage of non-empty features for different metadata types. Titles are
typically essential to identify an object and categories are typically required by
a website’s browsing interface, so these features are almost always present. For
user-generated content, the frequency of non-empty fields is depends on whether
the field is mandatory. For example, tags are often absent in Flickr because
they are optional, while for videos they are almost always present because in the
absence of user-provided tags, YouTube automatically assigns tags. For products,
the author feature is often empty since this field is only available for books.
For movies, the director feature is sometimes empty, presumably due to some
inconsistencies in the various sources from which LinkedMDB integrates data.
The second section of Table 3 shows the average number of unique tokens
found in non-empty metadata fields. These figures are an indicator of how much
information each feature provides. In general, titles and authors/directors pro-
vide few tokens since they are quite short. For categories, the number of tokens
depends on whether the website allows multiple categories (e.g., Wikipedia) or
single categories (e.g., YouTube). The number of unique tokens obtained from
descriptions and tags are quite similar across all object types studied.
The third section of Table 3 gives the average percentage of unique tokens from
metadata that do not occur in post content. This section is important since it
shows which features tend to provide novel information. Note that for article
titles, the percentage is zero since all titles are contained within the article’s
URL. For music artist titles, the figure is low since bands often use their title
as their username, which is contained within the artist’s URL. All other object
types have URLs that are independent of the object properties. This section
also allows us to see how users typically describe an object. For example, 40%
of the tokens from product titles are novel, indicating that posters often do
not precisely name the products that they link to. For the subset of products
that are books, 23% of tokens from titles were novel. Approximately 32% of the
tokens from book authors and 43% of the tokens from movie directors are novel,
showing that posters often mention these names in their posts, but that in many
other cases this is new information which can aid retrieval.
398 S. Kinsella et al.
Author/
Title Category Description Tags
Director
Average % of text features that are non-empty after pre-processing
Article 100.0 100.0 99.7 - -
Movie 100.0 100.0 - - 39.9
Music Artist 99.7 100.0 - - -
Photo 100.0 - 58.8 84.9 -
Product 100.0 100.0 - 75.2 65.4
Video 100.0 100.0 99.5 99.5 -
Average unique metadata tokens for non-empty fields (± standard deviation)
Article 2.1 ± 0.9 13.6 ± 12.1 15.8 ± 8.3 - -
Movie 1.7 ± 0.7 4.1 ± 1.8 - - 2.2 ± 0.6
Music Artist 1.8 ± 0.9 2.7 ± 0.9 - - -
Photo 2.0 ± 1.1 - 10.9 ± 17.2 6.5 ± 4.9 -
Product 5.2 ± 3.0 11.5 ± 7.8 - 5.7 ± 2.1 2.0 ± 0.4
Video 3.7 ± 1.6 1.0 ± 0.0 13.1 ± 26.3 7.2 ± 5.0 -
Average % of unique metadata tokens that are novel (do not occur in post content)
Article 0.0 78.5 68.4 - -
Movie 17.4 76.2 - - 43.3
Music Artist 10.1 85.4 - - -
Photo 72.5 - 50.3 74.6 -
Product 39.5 81.0 - 51.1 32.2
Video 62.0 95.7 78.5 74.4 -
6 Classification Experiments
In this section, we evaluate the classification of posts in the Forum and Twit-
ter datasets, based on different post representations including the original text
augmented with external metadata.
Bag of words: The same term in different sources is represented by the same
element in the document vector. For these experiments, we test different
weightings of the two sources, specifically {0.1:0.9, 0.2:0.8, ... , 0.9:0.1}. Two
vectors v1 and v2 are combined into a single vector v where a term i in v is
given by, for example, v[i] = (v1 [i] × 0.1) + (v2 [i] × 0.9).
Concatenate: The same term in different sources is represented by different
elements in the feature vector - i.e., “music” appearing in a post is distinct
from “music” in a HTML page. Two vectors v1 and v2 are combined into a
single vector v via concatenation, i.e., v = v1 , v2 .
Forum Twitter
Data Source
Bag of Words Concatenate Bag of Words Concatenate
Content (without URLs) 0.745 ± 0.009 - 0.722 ± 0.019 -
Content 0.811 ± 0.008 - 0.759 ± 0.015 -
HTML 0.730 ± 0.007 - 0.645 ± 0.020 -
Metadata 0.835 ± 0.009 - 0.683 ± 0.018 -
Content+HTML 0.832 ± 0.007 0.795 ± 0.004 0.784 ± 0.016 0.728 ± 0.016
Content+Metadata 0.899 ± 0.005 0.899 ± 0.005 0.820 ± 0.013 0.804 ± 0.018
Forum Twitter
Content Content
Forum Content Metadata Forum Content Metadata
+M’data +M’data
Musicians 0.973 0.911 0.981 Books 0.804 0.836 0.877
Photography 0.922 0.844 0.953 Photography 0.785 0.728 0.842
Soccer 0.805 0.902 0.945 Games 0.772 0.675 0.830
Martial Arts 0.788 0.881 0.917 Movies 0.718 0.777 0.827
Motors 0.740 0.869 0.911 Sports 0.744 0.563 0.781
Movies 0.825 0.845 0.881 Politics 0.685 0.499 0.733
Politics 0.791 0.776 0.846
Poker 0.646 0.757 0.823
Atheism 0.756 0.732 0.821
Television 0.559 0.664 0.716
Macro-Avgd 0.781 0.818 0.879 Macro-Avgd 0.751 0.680 0.815
Table 5 shows the detailed results for each category, for Content, Metadata
and Content+Metadata (using the bag-of-words weighting with the best perfor-
mance). There is a large variation in classification results for different categories.
For post classification based on Content, Forum results vary from 0.973 down to
0.559 and Twitter results vary from 0.804 down to 0.685. Despite the variation
between categories, Content+Metadata always results in the best performance.
For the two single source representations, some categories obtain better results
using Content and others using Metadata. The higher result between these two
representations is highlighted with italics.
Table 6 shows the gains in accuracy achieved by performing classification
based on different types of metadata from Wikipedia articles and YouTube
videos, for the Forum dataset. We limit our analysis to these object types be-
cause they have consistently good coverage across all of the forums, apart from
Musicians which we excluded from this analysis. These results are based only
on the posts with links to objects that have non-empty content for every meta-
data type and amount to 1,623 posts for Wikipedia articles and 2,027 posts
for YouTube videos. We compare the results against Content (without URLs),
because Wikipedia URLs contain article titles and our aim is to measure the
Improving Categorisation in Social Media 401
effects of the inclusion of titles and other metadata. Table 6 shows that the re-
sults for different metadata types vary considerably. For posts containing links
to Wikipedia articles, the article categories alone result in a better classification
of the post’s topic than the original post content, with an F1 of 0.811 compared
to 0.761. Likewise, for posts that contain links to YouTube videos, the video tags
provide a much better indicator of the post topic than the actual post text. The
Content+Metadata column shows results where each metadata type was com-
bined with post content (without URLs), using a bag-of-words representation
with 0.5:0.5 weightings. Every metadata type examined improved post classifi-
cation relative to the post content alone. However some metadata types improve
the results significantly more than others, with Content+Category achieving the
best scores for articles, and Content+Tags achieving the best scores for videos.
7 Discussion
The usage of external information from hyperlinks for categorisation or retrieval
on the Web is a well-established technique. Our experiments show that categori-
sation of social media posts can be improved by making use of semantically-rich
data sources where the most relevant data items can be experimentally iden-
tified. Both datasets showed similar patterns, although the Twitter scores are
consistently lower. It may be that the Twitter hashtags are not as accurate de-
scriptors of topic as the forum categories. Also, for Forum the external metadata
is a better indicator of the category than the post content while for Twitter the
reverse is true. This may be partially due to the fact that the distribution of
domains linked to in each dataset is different and some domains may provide
more useful information than others, either within URLs or within metadata.
We also observe that results vary considerably depending on the topic that
is under discussion. For example in Forum, classification of a post in the Mu-
sicians forum is trivial, since almost all posts that feature a link to MySpace
belong here. In contrast, the classification of a Television forum post is much
more challenging, because this forum mentions a wide variety of topics which
are televised. We also note that some topics achieve better classification results
402 S. Kinsella et al.
using only external metadata but others have better results with the original
content. In the case of the Musicians and Photography forums, the good results
for Content may be due to the fact that links to MySpace are highly indicative
of the Musicians forum, and links to Flickr are usually from the Photography
forum. The Politics and Atheism forums also achieve better results based on post
content - this may be because they have a high percentage of links to Wikipedia
articles, whose URLs include title information. We can conclude for posts whose
hyperlinks contain such useful indicators, the addition of external metadata may
give only a slight improvement, but for posts whose URLs do not give such ex-
plicit clues, the addition of external metadata can be an important advantage
for topic classification.
A major benefit of using structured data rather than HTML documents is that
it becomes possible to compare the improvements gained by integrating different
metadata types. Our results show that the effect of the addition of different
metadata types varies greatly, e.g., Wikipedia categories and descriptions are
much more useful than article titles. The benefit of different metadata types is
not consistent across sites - Wikipedia’s rich categories are far more useful than
YouTube’s limited categories. Often particular metadata types from hyperlinked
objects in a post can be a better descriptor of the post topic than the post itself,
for example YouTube tags, titles and descriptions. In these cases the structure of
the data could be exploited to highly weight the most relevant metadata types.
Thus, even classification on unstructured Web content can immediately benefit
from semantically-rich data, provided that there are hyperlinks to some of the
many websites that do provide structured data. While this paper focused on
commonly-available metadata types, our approach could be applied to arbitrary
metadata types from unknown sources, where machine-learning techniques would
be employed to automatically select and weight the most useful metadata.
In our experiments, we used the structure of the external data to identify
which types of metadata provide the most useful texts for improving classifi-
cation. In addition to providing metadata, the Linked Data sources are also
part of a rich interconnected graph with semantic links between related enti-
ties. We have shown that the textual information associated with resources can
improve categorisation, and it would be interesting to also make use of the se-
mantic links between concepts. For example, imagine a Television post contains
links to the series dbpedia:Fawlty Towers. A later post that links to the se-
ries dbpedia:Mr Bean could be classified under the same category, due to the
fact that the concepts are linked in several ways, including their genres and the
fact that they are both produced by British television channels. Just as we used
machine-learning techniques to identify the most beneficial metadata types, we
could also identify the most useful properties between entities.
Potential applications for our approach include categorisation of either new or
existing post items. For example, on a multi-forum site (i.e., one that contains a
hierarchy of individual forums categorised according to topic area), a user may
not know the best forum where they should make their post, or where it is most
likely to receive comments that are useful to the user. This can be the case where
Improving Categorisation in Social Media 403
the content is relevant to not just one forum topic but to multiple topic areas.
On post creation, the system could use previous metadata-augmented posts and
any links if present in the new post to suggest potential categories for this post.
Similarly, posts that have already been created but are not receiving many com-
ments could be compared against existing augmented posts to determine if they
should be located in a different topic area than they are already in.
This approach also has potential usage across different platforms. While it
may be difficult to use augmented posts from Twitter to aid with categorisation
of posts on forums due to the differing natures of microblogs and discussion
forums, there could be use cases where augmented posts from discussion forums,
news groups or mailing lists (e.g., as provided via Google Groups) could be
used to help categorisations across these heterogeneous, yet similar, platforms.
Also, the categories from augmented discussion forum posts could be used to
recommend tags or topics for new blog content at post creation time.
8 Conclusion
In this work, we have investigated the potential of using metadata from hyper-
linked objects for classifying the topic of posts in online forums and microblogs.
The approach could also be applied to other types of social media. Our exper-
iments show that post categorisation based on a combination of content and
object metadata gives significantly better results than categorisation based on
either content alone or content and hyperlinked HTML documents. We observed
that the significance of the improvement obtained from including external meta-
data varies by topic, depending on the properties of the URLs that tend to
occur within that category. We also found that different metadata types vary
in their usefulness for post classification, and some types of object metadata
are even more useful for topic classification than the actual content of the post.
We conclude that for posts that contain hyperlinks to structured data sources,
the semantically-rich descriptions of entities can be a valuable resource for post
classification. The enriched structured representation of a post as content plus
object metadata also has potential for improving search in social media.
References
1. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality
content in social media. In: 1st Int’l Conference on Web Search and Data Mining,
WSDM 2008. ACM, New York (2008)
2. Angelova, R., Weikum, G.: Graph-based text classification: Learn from your neigh-
bors. In: 29th Int’l SIGIR Conference on Research and Development in Information
Retrieval. SIGIR 2006. ACM, New York (2006)
3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DB-
pedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N.,
Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mi-
zoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007.
LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
404 S. Kinsella et al.
4. Berendt, B., Hanser, C.: Tags are not metadata, but “just more content”–to some
people. In: 5th Int’l Conference on Weblogs and Social Media, ICWSM 2007 (2007)
5. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The story so far. International
Journal on Semantic Web and Information Systems 5(3) (2009)
6. Breslin, J.G., Harth, A., Bojars, U., Decker, S.: Towards semantically-interlinked
online communities. In: Gómez-Pérez, A., Euzenat, J. (eds.) ESWC 2005. LNCS,
vol. 3532, pp. 500–514. Springer, Heidelberg (2005)
7. Cha, M., Pérez, J., Haddadi, H.: Flash Floods and Ripples: The spread of media
content through the blogosphere. In: 3rd Int’l Conference on Weblogs and Social
Media, ICWSM 2009 (2009)
8. Figueiredo, F., Belém, F., Pinto, H., Almeida, J., Gonçalves, M., Fernandes, D.,
Moura, E., Cristo, M.: Evidence of quality of textual features on the Web 2.0. In:
18th Conference on Information and Knowledge Management, CIKM 2009. ACM,
New York (2009)
9. Garcia Esparza, S., O’Mahony, M.P., Smyth, B.: Towards tagging and categoriza-
tion for micro-blogs. In: 21st National Conference on Artificial Intelligence and
Cognitive Science, AICS 2010 (2010)
10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The
WEKA data mining software: An update. ACM SIGKDD Exp. 11(1) (2009)
11. Irani, D., Webb, S., Pu, C., Li, K.: Study of trend-stuffing on Twitter through text
classification. In: 7th Collaboration, Electronic messaging, Anti-Abuse and Spam
Conference, CEAS 2010 (2010)
12. Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as elec-
tronic word of mouth. J. Am. Soc. Inf. Sci. 60(11) (2009)
13. Kinsella, S., Passant, A., Breslin, J.G.: Using hyperlinks to enrich message board
content with Linked Data. In: 6th Int’l Conference on Semantic Systems, I-
SEMANTICS 2010. ACM, New York (2010)
14. Kinsella, S., Passant, A., Breslin, J.G.: Topic classification in social media using
metadata from hyperlinked objects. In: Clough, P., Foley, C., Gurrin, C., Jones,
G.J.F., Kraaij, W., Lee, H., Murdock, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp.
201–206. Springer, Heidelberg (2011)
15. Mendoza, M., Poblete, B., Castillo, C.: Twitter under crisis: Can we trust what we
RT? In: 1st Workshop on Social Media Analytics, SOMA 2010. ACM, New York
(2010)
16. Qi, X., Davison, B.: Classifiers without borders: Incorporating fielded text from
neighboring web pages. In: 31st Int’l SIGIR Conference on Research and Develop-
ment in Information Retrieval, SIGIR 2008. ACM, New York (2008)
17. Sergey, B., Lawrence, P.: The anatomy of a large-scale hypertextual Web search
engine. Computer Networks and ISDN Systems 30(1-7), 107–117 (1998)
18. Stankovic, M., Rowe, M., Laublet, P.: Mapping tweets to conference talks: a gold-
mine for semantics. In: 3rd Int’l Workshop on Social Data on the Web, SDoW 2010
(2010), CEUR-WS.org
19. Sun, A., Suryanto, M.A., Liu, Y.: Blog classification using tags: An empirical study.
In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007.
LNCS, vol. 4822, pp. 307–316. Springer, Heidelberg (2007)
20. Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Fourth
Int’l Conference on Web Search and Data Mining, WSDM 2011. ACM, New York
(2011)
21. Yin, Z., Li, R., Mei, Q., Han, J.: Exploring social tagging graph for web object
classification. In: 15th SIGKDD Int’l Conference on Knowledge Discovery and Data
Mining, KDD 2009. ACM, New York (2009)
Predicting Discussions on the Social
Semantic Web
Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom
{m.c.rowe,s.angeletou,h.alani}@open.ac.uk
Abstract. Social Web platforms are quickly becoming the natural place
for people to engage in discussing current events, topics, and policies.
Analysing such discussions is of high value to analysts who are interested
in assessing up-to-the-minute public opinion, consensus, and trends. How-
ever, we have a limited understanding of how content and user features
can influence the amount of response that posts (e.g., Twitter messages)
receive, and how this can impact the growth of discussion threads. Un-
derstanding these dynamics can help users to issue better posts, and
enable analysts to make timely predictions on which discussion threads
will evolve into active ones and which are likely to wither too quickly.
In this paper we present an approach for predicting discussions on the
Social Web, by (a) identifying seed posts, then (b) making predictions
on the level of discussion that such posts will generate. We explore the
use of post-content and user features and their subsequent effects on
predictions. Our experiments produced an optimum F1 score of 0.848 for
identifying seed posts, and an average measure of 0.673 for Normalised
Discounted Cumulative Gain when predicting discussion levels.
1 Introduction
The rise of the Social Web is encouraging more and more people to use these
media to share opinions and ideas, and to engage in discussions about all kinds
of topics and current events. As a consequence, the rate at which such discus-
sions are growing, and new ones are initiated, is extremely high. The last few
years have witnessed a growing demand for tools and techniques for searching
and processing such online conversations to, for example; get a more up-to-date
analysis of public opinion in certain products or brands; identify the main topics
that the public is interested in at any given time; and gauge popularity of certain
governmental policies and politicians. Furthermore, governments and businesses
are investing more into using social media as an effective and fast approach for
reaching out to the public, to draw their attention to new policies or products,
and to engage them in open consultations and customer support discussions.
In spite of the above, there is a general lack of intelligent techniques for
timely identification of which of the countless discussions are likely to gain more
momentum than others. Such techniques can help tools and social media analysts
overcome the great challenge of scale. For example, more than 7.4 million tweets
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 405–420, 2011.
c Springer-Verlag Berlin Heidelberg 2011
406 M. Rowe, S. Angeletou, and H. Alani
on Wikileaks1 were posted in just a few weeks. As early as 1997, Goldhaber [6]
introduced the concept of attention economics as a way to stress the importance
of engaging user attention in the new information era. However, there is a very
limited understanding of the role that certain user-characteristics and content-
features play in influencing the amount of response and attention generated
on these social medias. Understanding the impact of such features can support
interested parties in building more effective strategies for engaging with the
public on social media.
In this work we are interested in identifying the characteristics of content
posted on the Social Web that generate a high volume of attention - using the
microblogging platform Twitter as our data source. In particular, we explore
the attributes of posts (i.e., content and the author properties) that evolved into
popular discussions and therefore received a lot of attention. We then investigate
the use of such attributes for making predictions on discussion activity and their
contributions. We also present a behaviour ontology, designed to model statistical
features that our prediction techniques use from the Social Web in a common
format. More precisely, we explore the following research questions: Is it possible
to identify discussion seed posts with high-levels of accuracy? What are the key
features that describe seed posts? And: How can the level of discussion that a post
will yield be predicted? Investigation of these questions has lead to the following
contributions in this paper:
This paper is structured as follows: section 2 presents related work in the area
of discussion and activity prediction on social media. Section 3 describes our
ontology for modelling statistical features of users and posts. Section 4 presents
our method for identifying discussion seed posts, and our experiments using two
datasets of tweets. Section 5 describes our prediction method, the features used
and our prediction experiments. Conclusions and future work are covered in
section 6.
reply is posted in response, thus forming a discussion chain. Our work in this
paper addresses the issue of predicting discussion activity on the Social Semantic
Web. This entails the identification of discussion seeds as well as the response
activity and volume of the conversation they initiate. To this end, the related
literature can be studied in two partially overlapping research lines.
The first line concerns the topic of identifying high quality users and content
on the Social Web, understanding the factors that initiate attention towards
them and contribute to their popularity. Mishne and Glance [10] juxtapose the
number of comments per weblog with the number of page views and incoming
links, factors which constitute the weblog’s popularity - where the number of
comments is strongly correlated with popularity. Hsu et.al [8] present a method
to identify good quality comments on Digg stories and rank them based on
user features (e.g., number of posts, age in the system, number of friendships,
number of profile views and topic of activity) and content features (e.g., length,
informativeness, readability and complexity). Szabo and Huberman [14] use Digg
and Youtube by exploiting the number of post votes as a feature to predict views
of future content. Adamic et. al [1] and Bian et.al [2] use the Yahoo! question
answering service to assess the quality of questions and answers and predict the
best answer chosen by the question poster, where bespoke features are used (e.g.,
no. of best answers per user, no. of replies per question, answer length and thread
length). Ratkiewicz et. al [11], study the attention towards Wikipedia pages prior
and after certain events. They quantify their models using the number of clicks
on a specific page. Work by Cha et. al [5] regards attention towards a user as an
indicator of influence, they study users’ influence on Twitter by measuring the
number of followers, number of retweets and mentions. Our work extends the
current state of the art by exploring new user and content features for identifying
discussion seed posts. We focus on the use of Twitter as a source for our datasets,
where the role of user reputation and content quality are linked together.
The second line of work concerns the identification of conversation activity
on Twitter. Although a significant amount of work exists that analyses the phe-
nomenon of microblogging in many levels, in this work we focus on the ones
that study discussion activity. Only recent work by [12] has constructed conver-
sation models using data obtained from Twitter by employing a state transition
mechanism to label dialogue acts (e.g., reference broadcast, question, reaction,
comment) within a discussion chain. The most relevant work to ours, by Suh et.
al [13], explores the factors which lead to a post being retweeted, finding that the
existence of a hashtag or URL in the original post does not affect its retweeting
chance, while the number of followers and followed does. Our work differs from
[13] by identifying discussions conducted, as opposed to identifying information
spread, and unlike existing work by [12] we extend our scope to discussions,
rather than merely considering interactions of replies between 2 parties. In do-
ing so we identify discussion seed posts and the key features that lead to starting
a discussion and generating attention. To the best of our knowledge there is no
existing work that can suggest and predict if and to what extent a post will be
responded to.
408 M. Rowe, S. Angeletou, and H. Alani
3 Behaviour Ontology
In the context of our work - predicting discussions - we rely on statistical fea-
tures of both users and posts, which we describe in greater detail in the following
section. No ontologies at present allow the capturing of such information, and
its description using common semantics. To fill this gap we have developed a
behaviour ontology,2 closely integrated with the Semantically Interlinked Online
Communities (SIOC) ontology [3], this enables the modelling of various user
activities and and related impacts on a Social Networking Site (SNS). The be-
haviour ontology represents posts on social networking sites, the content of those
posts, the sentiment of the content and the impact which a post has had on the
site. It also extends existing SIOC concepts, in particular sioc:UserAccount, with
information about the impact the user has had on the site by capturing the num-
ber of followers and friends the user has at a given time (represented with the
Data property CollectionDate). For the sake of brevity and space restrictions
Figure 1 shows a part of the ontology that is relevant to the representation of
the information used in our approach.
The two key classes presented in Figure 1 are: PostImpact and UserImpact.
The former models the number of replies that a given post has generated, char-
acterising the level of discussion that the post has yielded up until a given point
in time. The latter class models the impact that the user has had within a given
SNS. Capturing this information is crucial to predicting discussion activity, as
according to [8,13] user reputation and standing within an online space is often
a key factor in predicting whether content will generate attention or not.
given Social Web platform that will yield at least one reply - within this paper
we concentrate on the use of tweets, therefore a seed post is regarded as the
initial tweet that generates a reply. Features which describe seed posts can be
divided into two sets: user features - attributes that define the user making the
post; and, content features - attributes that are based solely on the post itself.
We wish to explore the application of such features in identifying seed posts, to
do this we train several machine learning classifiers and report on our findings.
However we first describe the features used.
User Features
In Degree: Number of followers of U #
Out Degree: Number of users U follows #
List Degree: Number of lists U appears on. Lists group users by topic #
Post Count: Total number of posts the user has ever posted #
User Age: Number of minutes from user join date #
P ostCount
Post Rate: Posting frequency of the user U serAge
Content Features
Post length: Length of the post in characters #
Complexity: Cumulative entropy of the unique words in post p λ
i∈[1,n] pi(log λ−log pi)
of total word length n and pi the frequency of each word λ
Uppercase count: Number of uppercase words #
Readability: Gunning fog index using average sentence length (ASL) [7]
and the percentage of complex words (PCW). 0.4(ASL + P CW )
Verb Count: Number of verbs #
Noun Count: Number of nouns #
Adjective Count: Number of adjectives #
Referral Count: Number of @user #
Time in the day: Normalised time in the day measured in minutes #
Informativeness: Terminological novelty of the post wrt other posts
The cumulative tfIdf value of each term t in post p t∈p tf idf (t, p)
Polarity: Cumulation of polar term weights in p (using
Sentiwordnet3 lexicon) normalised by polar terms count P o+N e
|terms|
4.2 Experiments
Experiments are intended to test the performance of different classification mod-
els in identifying seed posts. Therefore we used four classifiers: discriminative
classifiers Perceptron and SVM, the generative classifier Naive Bayes and the
decision-tree classifier J48. For each classifier we used three feature settings:
user features, content features and user+content features.
3
http://sentiwordnet.isti.cnr.it/
410 M. Rowe, S. Angeletou, and H. Alani
Datasets. For our experiments we used two datasets of tweets available on the
Web: Haiti earthquake tweets4 and the State of the Union Address tweets.5 The
former dataset contains tweets which relate to the Haiti earthquake disaster -
tagged with #haiti - covering a varying timespan. The latter dataset contains
all tweets published during the duration of president Barack Obama’s State of
the Union Address speech. Our goal is to predict discussion activity based on
the features of a given post by first identifying seed posts, before moving on to
predict the discussion level.
Within the above datasets many of the posts are not seeds, but are instead
replies to previous posts, thereby featuring in the discussion chain as a node.
In [13] retweets are considered as part of the discussion activity. In our work
we identify discussions using the explicit “in reply to” information obtained
by the Twitter API, which does not include retweets. We make this decision
based on the work presented in boyd et.al [4], where an analysis of retweeting
as a discussion practice is presented, arguing that message forwards adhere to
different motives which do not necessarily designate a response to the initial
message. Therefore, we only investigate explicit replies to messages. To gather
our discussions, and our seed posts, we iteratively move up the reply chain - i.e.,
from reply to parent post - until we reach the seed post in the discussion. We
define this process as dataset enrichment, and is performed by querying Twitter’s
REST API6 using the in reply to id of the parent post, and moving one-step at
a time up the reply chain. This same approach has been employed successfully
in work by [12] to gather a large-scale conversation dataset from Twitter.
Table 2 shows the statistics that explain our collected datasets. One can ob-
serve the difference in conversational tweets between the two corpora, where the
Haiti dataset contains fewer seed posts as a percentage than the Union dataset,
and therefore fewer replies. However, as we explain in a later section, this does
not correlate with a higher discussion volume in the former dataset. We con-
vert the collected datasets from their proprietary JSON formats into triples,
annotated using concepts from our above behaviour ontology, this enables our
features to be derived by querying our datasets using basic SPARQL queries.
labels to one of two classes: seed and non-seed. To evaluate the performance our
method we use four measures: precision, recall, f-measure and area under the
Receiver Operator Curve. Precision measures the proportion of retrieved posts
which were actually seed posts, recall measures the proportion of seed posts
which were correctly identified and fallout measures the proportion of non-seed
posts which were incorrectly classified as seed posts (i.e., false positive rate). We
use f-measure, as defined in Equation 1 as the harmonic mean between precision
and recall, setting β = 1 to weight precision and recall equally. We also plot the
Receiver Operator Curve of our trained models to show graphical comparisons
of performance.
(1 + β 2 ) ∗ P ∗ R
Fβ = (1)
β2 ∗ P + R
For our experiments we divided each dataset up into 3 sets: a training set, a
validation set and a testing set using a 70/20/10 split. We trained our classifi-
cation models using the training split and then applied them to the validation
set, labelling the posts within this split. From these initial results we performed
model selection by choosing the best performing model - based on maximising
the F1 score - and used this model together with the best performing features,
using a ranking heuristic, to classify posts contained within our test split. We
first report on the results obtained from our model selection phase, before moving
onto our results from using the best model with the top-k features.
Table 3. Results from the classification of seed posts using varying feature sets and
classification models
4.3 Results
Our findings from Table 3 demonstrate the effectiveness of using solely user
features for identifying seed posts. In both the Haiti and Union Address datasets
training a classification model using user features shows improved performance
over the same models trained using content features. In the case of the Union
dataset we are able to achieve an F1 score of 0.782, coupled with high precision,
412 M. Rowe, S. Angeletou, and H. Alani
when using the J48 decision-tree classifier - where the latter figure (precision)
indicates conservative estimates using only user features. We also achieve similar
high-levels of precision when using the same classifier on the Haiti dataset. The
plots of the Receiver Operator Characteristic (ROC) curves in Figure 2 show
similar levels of performance for each classifier over the two corpora.When using
solely user features J48 is shown to dominate the ROC space, subsuming the
plots from the other models. A similar behaviour is exhibited for the Naive
Bayes classifier where SVM and Perceptron are each outperformed. The plots
also demonstrate the poor recall levels when using only content features, where
each model fails to yield the same performance as the use of only user features.
However the plots show that effectiveness of combining both user and content
features.
Experiments identify the J48 classifier as being our best performing model
yielding optimum F1 scores and by analysing the induced decision tree we observe
the affects of individual features. Extremes of post polarity are found to be good
indicators of seed posts, while posts which fall within this mid-polarity range are
likely to be objective. One reason for this could be that the posts which elicit
an emotional response are more likely to generate a reply. Analysis of the time
of day identifies 4pm to midnight and 3pm to midnight as being associated with
seed posts for the Haiti and Union Address datasets respectively.
Predicting Discussions on the Social Semantic Web 413
Top-k Feature Selection. Thus far we have only analysed the use of fea-
tures grouped together prompting questions: which features are more important
than others? and what features are good indicators of a seed post? To gauge
the importance of features in identifying seed posts we rank our features by
their Information Gain Ratio (IGR) with respect to seed posts. Our rankings in
Table 4 indicate that the number of lists that a user is featured in appears in the
first position for both the Haiti and Union Address datasets, and the in-degree
of the user also features towards the top of each ranking. Such features increase
Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. The
feature name is paired within its IG in brackets.
the broadcast capability of the user, where any post made by the user is read
by a large audience, increasing the likelihood of yielding a response. To gauge
the similarity between the rankings we measured the Pearson Correlation Co-
efficient, which we found to be 0.674 indicating a good correlation between the
two lists and their respective ranks.
The top-most ranks from each dataset are dominated by user features includ-
ing the list-degree, in-degree, num-of-posts and post-rate. Such features describe
a user’s reputation, where higher values are associated with seed posts. Figure 3
shows the contributions of each of the top-5 features to class decisions in the
training set, where the list-degree and in-degree of the user are seen to correlate
heavily with seed posts. Using these rankings our next experiment explored the
effects of training a classification model using only the top-k features, observing
the affects of iteratively increasing k and the impact upon performance. We se-
lected the J48 classifier for training - based on its optimum performance during
the model selection phase - and trained the classifier using the training split
from each dataset and only the top-k features based on our observed rankings.
The model was then applied to the held out test split of 10%, thereby ensuring
independence from our previous experiment.
Figure 4 shows the results from our experiments, where at lower levels of k we
observe similar levels of performance - particularly when only the highest ranked
feature is used (i.e., list-degree). As we increase k, including more features within
our classification model, we observe improvements in F1 scores for both datasets.
The lower ranks shown in Table 4 are dominated by content features. As we
include the lower-ranked features, our plots show a slight decrease in performance
for the Haiti dataset due to the low IGR scores yielded for such features. For
both datasets we yield 1 for precision when each model is trained using just
the list-degree of the user, although J48 induces different cutoff points for the
two datasets when judging the level of this nominal value. Using this feature
Predicting Discussions on the Social Semantic Web 415
More in-depth analysis of the data is shown in Figure 5(a) and Figure 5(b), dis-
playing the probability distributions and cumulative distributions respectively.
For each dataset we used maximum likelihood estimation to optimise parame-
ters for various distribution models, and selected the best fitting model using
the Kolmogorov-Smirnov goodness-of-fit test against the training splits from our
datasets. For the Haiti dataset we fitted the Gamma distribution - found to be
a good fit at α = 0.1, and for the Union dataset we fitted the Chi-squared dis-
tribution - however this was found to provide the minimum deviation from the
data without satisfying any of the goodness of fit tests. The distributions convey,
for both fitted datasets, that the probability mass is concentrated towards the
head of the distribution where the volume of the discussion is at its lowest. The
likelihood of a given post generating many replies - where many can be gauged as
the mean number of replies within the discussion volume distribution - tends to
0 as the volume increases. Such density levels render the application of standard
prediction error measures such as Relative Absolute Error inapplicable, given
that the mean of the volumes would be used as the random estimator for accu-
racy measurement. A solution to this problem is instead to assess whether one
post will generate a larger discussion than another, thereby producing a ranking,
similar to the method used in [8].
To predict the discussion activity level we use a Support Vector Regression
model trained using the three distinct feature set combinations that we intro-
duced earlier in this paper: user features, content features and user+content
features. Using the predicted values for each post we can then form a ranking,
which is comparable to a ground truth ranking within our data. This provides
discussion activity levels, where posts are ordered by their expected volume. This
approach also enables contextual predictions where a post can be compared with
existing posts that have produced lengthy debates.
5.1 Experiments
Datasets. For our experiments we used the same datasets as in the previous
section: tweets collected from the Haiti crisis and the Union Address speech. We
maintain the same splits as before - training/validation/testing with a 70/20/10
split - but without using the test split. Instead we train the regression models
using the seed posts in the training split and then test the prediction accuracy
using the seed posts in the validation split - seed posts in the validation set
are identified using the J48 classifier trained using both user+content features.
Table 5 describes our datasets for clarification.
In order to define reli we use the same approach as [8]: reli = N − ranki + 1,
where ranki is the ground truth rank of the element at index i from the predicted
ranking. Therefore when dividing the predicted rank by the actual rank, we
get a normalised value ranging between 0 and 1, where 1 defines the predicted
rank as being equivalent to the actual rank. To provide a range of measures
we calculated N DCG@k for 6 different values where k = {1, 5, 10, 20, 50, 100},
thereby assessing the accuracy of our rankings over different portions of the top-
k posts. We learnt a Support Vector Regression model for each dataset using the
same feature sets as our earlier identification task: user features, content features
and user+content features.
5.2 Results
Figure 6 shows the ranking accuracy that we achieve using a Support Vector
Regression model for prediction over the two datasets, where we observe differing
performance levels achieved using different feature set combinations. For the
Haiti dataset the user features play a greater role in predicting discussion activity
levels for larger values of k. For the Union Address dataset user features also
outperform content features as k increases. In each case we note that content
features do not provide as accurate predictions as the use of solely user features.
Such findings are consistent with experiments described in [8] which found that
user features yielded improved ranking of comments posted on Stories from Digg
in comparison with merely content features.
418 M. Rowe, S. Angeletou, and H. Alani
Following on from our initial rank predictions we identify the user features as
being important predictors of discussion activity levels. By performing analysis of
the learnt regression model over the training split we can analyse the coefficients
induced by the model - Table 6 present the coefficients learnt from the user
features. Although different coefficients are learnt for each dataset, for major
features with greater weights the signs remain the same. In the case of the list-
degree of the user, which yielded high IGR during classification, there is a similar
positive association with the discussion volume - the same is also true for the
in-degree and the out-degree of the user. This indicates that a constant increase
in the combination of a user’s in-degree, out-degree, and list-degree will lead to
increased discussion volumes. Out-degree plays an important role by enabling
the seed post author to see posted responses - given that the larger the out-degree
the greater the reception of information from other users.
Table 6. Coefficients of user features learnt using Support Vector Regression over the
two Datasets. Coefficients are rounded to 4 dp.
6 Conclusions
The abundance of discussions carried out on the Social Web hinders the tracking
of debates and opinion, while some discussions may form lengthy debate others
may simply die out. Effective monitoring of high activity discussions can be
solved by predicting which posts will start a discussion and their subsequent
discussion activity levels. In this paper we have explored three research questions,
the first of which asked Is it possible to identify discussion seed posts with high-
levels of accuracy? We have presented a method to identify discussion seed
posts achieving an optimum F1 score of 0.848 for experiments over one dataset.
Experiments with both content and user features demonstrated the importance
Predicting Discussions on the Social Semantic Web 419
Acknowledgements
This work is funded by the EC-FP7 projects WeGov (grant number 248512) and
Robust (grant number 257859).
References
1. Adamic, L.A., Zhang, J., Bakshy, E., Ackerman, M.S.: Knowledge sharing and
Yahoo Answers: Everyone knows something. In: Proceedings of WWW 2008, pp.
665–674. ACM, New York (2008)
2. Bian, J., Liu, Y., Zhou, D., Agichtein, E., Zha, H.: Learning to Recognize Reliable
Users and Content in Social Media with Coupled Mutual Reinforcement. In: 18th
International World Wide Web Conference (WWW 2009) (April 2009)
3. Bojars, U., Breslin, J.G., Peristeras, V., Tummarello, G., Decker, S.: Interlinking
the social web with semantics. IEEE Intelligent Systems 23, 29–40 (2008)
4. Boyd, D., Golder, S., Lotan, G.: Tweet, tweet, retweet: Conversational aspects of
retweeting on twitter. In: Hawaii International Conference on System Sciences, pp.
1–10 (2010)
420 M. Rowe, S. Angeletou, and H. Alani
5. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, K.P.: Measuring User Influence
in Twitter: The Million Follower Fallacy. In: Fourth International AAAI Conference
on Weblogs and Social Media (May 2010)
6. Goldhaber, M.H.: The Attention Economy and the Net. First Monday 2(4) (1997)
7. Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)
8. Hsu, C.-F., Khabiri, E., Caverlee, J.: Ranking Comments on the Social Web. In:
International Conference on Computational Science and Engineering, CSE 2009,
vol. 4 (August 2009)
9. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques.
ACM Trans. Inf. Syst. 20, 422–446 (2002)
10. Mishne, G., Glance, N.: Leave a Reply: An Analysis of Weblog Comments. In:
Third annual workshop on the Weblogging ecosystem (2006)
11. Ratkiewicz, J., Menczer, F., Fortunato, S., Flammini, A., Vespignani, A.: Charac-
terizing and modeling the dynamics of online popularity. Physical Review Letters
(May 2010)
12. Ritter, A., Cherry, C., Dolan, B.: Unsupervised Modeling of Twitter Conversations.
In: Proc. HLT-NAACL 2010 (2010)
13. Suh, B., Hong, L., Pirolli, P., Chi, E.H.: Want to be retweeted? Large scale analytics
on factors impacting retweet in Twitter network. In: Proceedings of the IEEE
Second International Conference on Social Computing (SocialCom), pp. 177–184
(August 2010)
14. Szabo, G., Huberman, B.A.: Predicting the popularity of online content. ACM
Commun. 53(8), 80–88 (2010)
Mining for Reengineering: An Application to
Semantic Wikis Using Formal and Relational
Concept Analysis
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 421–435, 2011.
c Springer-Verlag Berlin Heidelberg 2011
422 L. Shi et al.
semantic wiki can be considered as a wide blackboard where human agents inter-
act with software agents [3] for producing and completing knowledge. However,
the collaborative and multi-user aspects introduce different perceptions of a do-
main and thus differences in knowledge organization. The incremental building
over the time may also introduce lacks or over-definitions in the knowledge base.
Accordingly, learning techniques can be used to solve these kinds of problems
by reengineering, i.e. a semantic wiki is considered as a base for discovering and
organizing existing and new knowledge units. Furthermore, semantic data em-
bedded in the semantic wikis are rarely used to enrich and improve the quality
of semantic wikis themselves. There is a large body of potential knowledge units
hidden in the content of a wiki, and knowledge discovery techniques are can-
didate for making explicit these units. The objective of the present work is to
use knowledge discovery techniques –based on Formal Concept Analysis and Re-
lational Concept Analysis– for learning new knowledge units such as categories
and links between objects in pages for enriching the content and the organization
of a semantic wiki. Thus, the present work aims at reengineering a semantic wiki
for ensuring a well-founded description and organization of domain objects and
categories, as well as setting relations between objects at the most appropriate
level of description.
Reengineering improves the quality of a semantic wiki by allowing stability
and optimal factorization of the category hierarchy, by identifying similar cat-
egories, by creating new categories, and by detecting inaccuracy or omissions.
The knowledge discovery techniques used in this reengineering approach are
based on Formal Concept Analysis (FCA) [5] and Relational Concept Analy-
sis (RCA) [10]. FCA is a mathematical approach for designing a concept lattice
from a binary context composed of a set of objects described by attributes. RCA
extends FCA by taking into account relations between objects and introducing
relational attributes within formal concepts, i.e. reifying relations between ob-
jects at the concept level. FCA and RCA are powerful techniques that allow
a user to answer a set of questions related to the quality of organization and
content of semantic wiki contents. The originality of this approach is to con-
sider the semantic wiki content as a starting point for knowledge discovery and
reengineering, applying FCA and RCA for extracting knowledge units from this
content and allowing a completion and a factorization of the wiki structure (a
first attempt in this direction can be found in [1]). The present approach is gen-
eral and does not either depend on any domain or require customized rules or
queries, thus can be generalized to every semantic wikis.
After defining some terminology about semantic wikis in Section 2, we intro-
duce basics elements on Formal and Relational Concept Analysis and in Sec-
tion 3. Section 4 gives details on the proposed approach for reengineering a
semantic wiki. In Section 5, we propose an evaluation of the method based on
experimental results, and we discuss the results and related issues (Section 6).
After a brief review of related work, we conclude with a summary in Section 7.
Reengineering Semantic Wikis with FCA and RCA 423
Fig. 1. A wiki page titled “Harry Potter and the Philosopher’s Stone”. The upper half
shows the content of the wiki page while the lower half is the underlying annotated
form.
Throughout this paper, we use wiki(s) to refer to semantic wiki(s). Each wiki
page is considered as a “source ontological element”, including classes and prop-
erties [8]. Annotations in the page provide statements about this source element.
For example, the page entitled “Harry Potter and the Philosopher’s Stone” de-
scribes a movie and a set of annotations attached to this page (Figure 1).
Editors annotate an object represented by a wiki page with categories, data
types, and relations, i.e. an object can be linked to other objects through re-
lations. A category allows a user to classify pages and categories can be orga-
nized into a hierarchy. For example, the annotation [[Category:Film]] states
that the page about “Harry Potter and the Philosopher’s Stone” belongs to
the category Film. The category Fantasy is a subcategory of Film as soon
as the annotation [[Category:Film]] is inserted in the Fantasy page. Bi-
nary relationships are introduced between pages. For example, the annotation
[[Directed by::Chris Colombus]] is inserted in the “Harry Potter. . . ” page
for making explicit the Directed by relation between “Harry Potter. . . ” page
and “Chris Colombus” page.
Some attributes are allowed to assign “values”: they specify a relationship
from a page to a data type such as numbers. Then [[Duration::152min]] give
the the duration of the film “Harry Potter. . . ”.
Basically, all categories in a wiki are manually created by various editors, pos-
sibly introducing several sources of inconsistency and redundancy. The fact that
the number of pages is continuously growing and that new categories are intro-
duced is a major challenge for managing the category hierarchy construction.
For keeping efficient the browsing and navigation within the wiki, the category
hierarchy has to be periodically updated. Thus, a tool for automatically manag-
ing the category hierarchy in a wiki is of first importance. In the following, we
424 L. Shi et al.
show how this can be done using FCA and RCA, which are both detailed in the
next section.
Table 1. The two binary contexts of films KF ilms (left) and actors KActors (right)
hasRunningTime
hasAwards
ComedyDrama
Female
Age20
Age30
Male
hasAwards
American
Romance
Germany
hasYear
English
French
Guillaume Canet x x x
Daniel Brühl x x x
Leonardo DiCaprio x x x
Jeux d’enfants x x x x Marthe Keller x
Good Bye, Lenin x x x x x Tolias Schenke x
Catch me if you can x x x x x Julia Roberts x x
And now my love x x x x x Catherine Zeta Jones x
America’s Sweethearts x x x x x x Anna Maria Muhe x x
Kleinruppin Forever x x x
levels in a lattice. For example, concept c5 in LInitActors has for intent the set
of attributes {Female, hasAward} (respectively from c1 and c3). By contrast,
concept c3 in LInitActors has for extent the set of individuals {Julia Roberts,
Leonardo DiCaprio, Daniel Brül, Guillaume Canet} (respectively from c5,
c6, and c9). When attributes are mentioned following reduced labelling, they
will be said local attributes, otherwise inherited attributes. Attributes obey the
following rules: when there are at least two local attributes a1 and a2 in the same
intent, these attributes are equivalent, i.e. a1 appears as soon as a2 and recipro-
cally. For example, hasRunninigTime and hasYear are equivalent in LInitF ilms
(see Figure 2). Moreover, local attributes imply inherited attributes. For exam-
ple, ComedyDrama implies hasRunninigTime and hasYear.
Julia Roberts
Daniel Brühl
of relational attributes that are added to complete the “initial context” describ-
ing the object set Gj = dom(rk ). For each relation rk ⊆ Gj × G , there is an initial
lattice for each object set, i.e. Lj for Gj and L for G . For example, the two ini-
tial lattices for the relation Starring are LInitF ilms (Figure 2) and LInitActors
(Figure 3).
Given the relation rk ⊆ Gj × G , the RCA mechanism starts from two ini-
tial lattices, Lj and L , and builds a series of intermediate lattices by gradually
completing the initial context Gj = dom(rk) with new “relational attributes”. For
that, relational scaling follows the DL semantics of role restrictions. Given the re-
lation rk ⊆ Gj × G , a relational attribute ∃rk : c –c being a concept and ∃ the ex-
istential quantifier– is associated to an object g ∈ Gj whenever rk (g) ∩ extent(c)
= ∅ (other quantifiers are available, see [10]). For example, let us consider the
concept c1 whose intent is Starring : c3 in LF inalF ilms , i.e. the final lattice of
films on Figure 4. This means that all films in the extent of c1 are related to
(at least one or more) actors in the extent of concept c3 in LF inalActors , i.e. the
final lattice of actors, through the relation Starring (actors in the extent of c3
are characterized by the hasAward attribute).
The series of intermediate lattices converges toward a “fixpoint” or “final
lattice” and the RCA mechanism is terminated. This is why there is one initial
and one final lattice for each context of the considered RCF. Here, LInitActors is
Reengineering Semantic Wikis with FCA and RCA 427
identical to LF inalActors (Figure 3), and there are two different lattices for films,
namely the initial LInitF ilms (Figure 2) and the final LF inalF ilms (Figure 4).
4 Methodology
In this section, we give details on the knowledge discovery approach used for
wiki reengineering. We first explain how data are retrieved and how the different
formal contexts and associated concept lattices are built. Then, we analyze the
concepts and propose a new organization for the category hierarchy in the wiki.
the wiki can be improved. In the lattice LF inalF ilms , we assume that categories
English and American are subcategories of ComedyDrama by observing concepts
c4 and c9. This sounds strange but is mainly due to the Closed-World Assump-
tion of RCA, which collides with the openness of a semantic wiki and the reduced
size of our running example (see Section 6).
Question 4: Defining categories. Definitions are quite rare in SMW despite they
are essential. Nevertheless, definitions can help humans understanding of the
purposes of categories and can be used for automatic classification by introduc-
ing necessary and sufficient conditions for an individual to belong to a category.
As seen in question 1, elements in the local intent are substantially equivalent.
Therefore, if a formal concept contains a category and one or more attributes in
its local intent, then these attributes can be considered as a definition of that cat-
egory. Moreover, any new introduced object to that category should be assigned
these attributes. The case of equivalence between a category and a relational
attribute is similar. For instance, concept c2 has the category RomanceMovie in
its intent. This category can be defined by the relational attribute Starring:c1
where the intent of c1 is Female (see lattice LInitActors ). Then a romance movie
would involve a female actor, and any new object in this concept should be
related to at least one actress.
The result of all these is an RDF model that defines an OWL ontology con-
taining both a TBox (new class definitions, subsumptions and equivalences) and
an ABox (new instantiations). The final step is to contribute back with new
knowledge to the original wiki. Our method acts as a wiki user suggesting up-
dates. These changes, as any change from any other user, can be undone. Even
if all the results are correct and consistent, some of them may be useless in prac-
tice. Therefore, in this approach it is the responsibility of a human expert or
the wiki community to evaluate the reengineering proposals following the spirit
of collaborative editing work for wikis. Moreover, discarded proposals should be
remembered, so they are not proposed again in subsequent executions.
5 Experimental Results
We applied our method to several semantic wikis and defined criteria for evalu-
ating the experimental results.
Table 3. Wiki characteristics in terms of the total number of pages (AP), the number
of content pages (CP), the number of uncategorized content pages (UCP), the number
of categories (CAT), the number of subsumptions (SUBCAT), the number of datatype
attributes (DP), the number of relations (OP), the average cardinality of categories
(CATSIZE), the average number of datatype attributes in content pages (DPS/CP)
and the average number of relations in content pages (OPS/CF)
5.2 Results
Table 4 shows the topological characteristics of the lattices of all wikis. The
number of formal concepts defines the size of the lattice. Apparently, this num-
ber is not always proportional to the size of the wiki. For instance, in spite of
Bioclipse wiki being smaller than Hackerspace wiki in terms of pages, the lattice
3
http://jena.sourceforge.net/
4
http://sourceforge.net/projects/galicia/
Reengineering Semantic Wikis with FCA and RCA 431
of Bioclipse has more concepts than Hackperspace one. In the lattice, each edge
represents a subsumption relationship between concepts. Moreover, the depth
of the lattice is defined by the longest path from the top concept down to the
bottom concept, knowing that there are no cycles. The higher it is, the deeper
the concept hierarchy is.
The connectivity of the lattice is defined as the average number of edges per
concept. It is noteworthy that all lattices have a similar connectivity in a narrow
range between 1.62 and 2.05. It seems that the characteristics of the wikis do
not have a strong influence in the connectedness of the lattice. Finally, the last
column gives the average number of concepts per level in the lattice. This value
indicates the width of the lattice and it correlates to the size of the lattice.
Consequently, the shape of a lattice is determined by both its depth and width.
Galicia produces XML files that represent the lattices as graphs, where con-
cepts are labeled with their intent and extent. Using another custom Java ap-
plication, we interpret these files and transform them into OWL/RDF graphs.
Specifically, our application processes all concepts, and:
After substracting the model given by FCA/RCA and removing trivial new
categories, an RDF model is generated that contains concrete findings for reengi-
neering the wiki. In Table 5 we report some metrics about the findings. It should
be noticed that, in the case of VSK, the number of equivalences between origi-
nally existing categories (CAT-EQ) rises quickly due to the combinatorial effect
and the symmetry of the equivalence (i.e. A ≡ B and B ≡ A count for two
entries). Although some content pages are classified (CAT-CP), the lattice fails
to classify originally uncategorized pages. The prime reason is that these pages
often lack any attribute that can be used to derive a categorization.
432 L. Shi et al.
Table 5. The result of analyzing the lattices in terms of the number of new pro-
posed memberships of content pages to categories (CAT-CP), the number of proposed
sub-categorizations (SUB-CAT’), the number of category equivalences between origi-
nally existing categories (CAT-EQ) and the number of proposed non-trivial categories
(NEW-CAT-NT)
25
Reengineered wiki
Original wiki
20
Number of categories
15
10
0
0 2 4 6 8 10 12 14 16 18 20 22 24
Fig. 5. Category size histogram of VSK wiki before and after FCA/RCA. Shaded area
represents new proposed categories.
Figure 5 compares the category size histogram before and after the reengi-
neering of VSK wiki. The shaded area amounts for the number of new discovered
categories. The histogram clearly shows that most of the discovered categories
are small in terms of their sizes.
The number of discovered subsumption relationships (SUB-CAT’) seems to
be more related to the number of discovered new categories than to the num-
ber of pre-existing ones. This indicates that in general the new categories are
refinements of other ones; in other words, they have a “place” in the hierarchy.
Two of the studied wikis (Hackerspace and Referata) lead to only a few new
categories. By looking into these two wikis, we found that they are already well
organized and therefore provide less opportunities for reengineering, combined
with the fact that these wikis do not use datatype properties.
Reengineering Semantic Wikis with FCA and RCA 433
6 Discussion
The experimental results show that our proposed method conduces to reengi-
neering proposals that can go beyond what is obtained by DL-reasoning and
querying on the original wiki. It is noteworthy to say that we do not compute
the closure of our resulting model, which would produce an enlargement of the
values in Table 5 but with little practical effect on the quality of the feedback
provided to the wiki.
The method is suitable for any semantic wiki regardless of its topic or lan-
guage. However, the computations are not linear with the size of the wiki (in
terms of the number of pages). Precisely, the maximum size of the lattice is
2min(|G|,|M|) with respect to FCA and 2min(|G|,2∗|G|) with respect to RCA. There-
fore, processing large wikis can be a computational challenge.
FCA/RCA operates under the Closed-World Assumption (CWA), which di-
verges from the Open-World Assumption (OWA) of OWL reasoning. More im-
portantly, CWA collides with the open nature of wikis. As a consequence, some
of the results are counter-intuitive when they are translated back to the wiki,
as it was exemplified in the previous section. However, the results are always
consistent with the current data in the wiki, and the method can be repeated
over time if the wiki changes. Recall that the process is “semi-automatic” and
that an analysis is required.
A feedback analysis remains to be done. An approach for such an analysis
is to provide results to human experts (e.g., wiki editors), who may evaluate
the quality of the results based on their knowledge and experience. The quality
can be measured in terms of correctness and usefulness. The latter will produce
434 L. Shi et al.
a subjective indication of the “insights” of the results, i.e., how much they go
beyond the “trivial” and “irrelevant” proposals for reengineering.
the semantic data contained in wikis. We argue that the use of FCA and RCA
helps to build a well-organized category hierarchy. Our experiments show that
the proposed method is adaptable and effective for reengineering semantic wikis.
Moreover, our findings pose several open problems for future study.
References
1. Blansché, A., Skaf-Molli, H., Molli, P., Napoli, A.: Human-machine collaboration
for enriching semantic wikis using formal concept analysis. In: Lange, C., Reu-
telshoefer, J., Schaffert, S., Skaf-Molli, H. (eds.) Fifth Workshop on Semantic
Wikis – Linking Data and People (SemWiki-2010), CEUR Workshop Proceedings,
vol. 632 (2010)
2. Chernov, S., Iofciu, T., Nejdl, W., Zhou, X.: Extracting semantic relationships
between wikipedia categories. In: 1st International Workshop SemWiki2006 - From
Wiki to Semantics, co-located with the ESWC 2006, Budva, (2006)
3. Cordier, A., Lieber, J., Molli, P., Nauer, E., Skaf-Molli, H., Toussaint, Y.: Wiki-
Taaable: A semantic wiki as a blackboard for a textual case-based reasoning system.
In: 4th Workshop on Semantic Wikis (SemWiki2009), held in the 6th European
Semantic Web Conference (May 2009)
4. Dao, M., Huchard, M., Hacene, M.R., Roume, C., Valtchev, P.: Improving gener-
alization level in uml models iterative cross generalization in practice. In: ICCS,
pp. 346–360 (2004)
5. Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Berlin (1999)
6. Krötzsch, M., Schaffert, S., Vrandečić, D.: Reasoning in semantic wikis. In: Anto-
niou, G., Aßmann, U., Baroglio, C., Decker, S., Henze, N., Patranjan, P.-L., Tolks-
dorf, R. (eds.) Reasoning Web. LNCS, vol. 4636, pp. 310–329. Springer, Heidelberg
(2007)
7. Krözsch, M., Vrandečić, D.: Swivt ontology specification,
http://semantic-mediawiki.org/swivt/
8. Krözsch, M., Vrandečić, D., Kolkel, M., Haller, H., Studer, R.: Semantic wikipedia.
J. Web Sem., 251–261 (2007)
9. Leuf, B., Cunningham, W.: The Wiki Way: Quick Collaboration on the Web.
Addison-Wesley Longman, Amsterdam (2001)
10. Rouane, M.H., Huchard, M., Napoli, A., Valtchev, P.: A proposal for combin-
ing formal concept analysis and description logics for mining relational data. In:
Kuznetsov, S.O., Schmidt, S. (eds.) ICFCA 2007. LNCS (LNAI), vol. 4390, pp.
51–65. Springer, Heidelberg (2007)
11. Schaffert, S.: Ikewiki: A semantic wiki for collaborative knowledge management.
In: 1st International Workshop on Semantic Technologies in Collaborative Appli-
cations (STICA 2006), Manchester, UK (2006)
12. Sertkaya, B.: Formal Concept Analysis Methods for Descriptions Logics. PhD the-
sis, Dresden university (2008)
SmartLink: A Web-Based Editor and
Search Environment for Linked Services
Stefan Dietze, Hong Qing Yu, Carlos Pedrinaci, Dong Liu, and John Domingue
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
{s.dietze,h.q.yu,c.pedrinaci,d.liu,j.b.domingue}@open.ac.uk
1 Introduction
The past decade has seen a range of research efforts in the area of Semantic Web
Services (SWS), aiming at the automation of Web service-related tasks such as
discovery, orchestration or mediation. Several conceptual models, such as OWL-S
[6], WSMO [3], and standards like SAWSDL [7] have been proposed, usually
covering aspects such as service capabilities, interfaces and non-functional properties.
However, SWS research has for the most part targeted WSDL or SOAP-based Web
services, which are not prevalent on the Web. Also, due to the inherent complexity
required to fully capture computational functionality, creating SWS descriptions has
represented an important knowledge acquisition bottleneck and required the use of
rich knowledge representation languages and complex reasoners. Hence, so far there
has been little take up of SWS technology within non-academic environments.
That is particularly concerning since Web services – nowadays including a range of
often more light-weight technologies beyond the WSDL/SOAP approach, such as
RESTful services or XML-feeds – are in widespread use throughout the Web. That has
led to the emergence of more simplified SWS approaches to which we shall refer here
as “lightweight”, such as WSMO-Lite [9] SA-REST [7] and Micro-WSMO/hRESTs [4]
which replace “heavyweight” SWS with simpler models expressed in RDF(S) which
aligns them with current practices in the growing Semantic Web [1] and simplifies the
creation of service descriptions. While the Semantic Web has successfully redefined
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 436–440, 2011.
© Springer-Verlag Berlin Heidelberg 2011
SmartLink: A Web-Based Editor and Search Environment for Linked Services 437
itself as a Web of Linked (Open) Data (LOD) [2], the emerging Linked Services
approach [7] exploits the established LOD principles for service description and
publication, and is catering for exploiting the complementarity of the Linked Data and
services to support the creation of advanced applications for the Web.
In order to support annotation of a variety of services, such as WSDL services as
well as REST APIs, the EC-funded project SOA4ALL1, has developed the Linked
Services registry and discovery engine iServe2. iServe supports publishing service
annotations as linked data expressed in terms of a simple conceptual model that is
suitable for both human and machine consumption and abstracts from existing
heterogeneity of service annotation formalisms: the Minimal Service Model (MSM).
The MSM is a simple RDF(S) ontology able to capture (part of) the semantics of both
Web services and Web APIs. While MSM [7] is extensible to benefit from the added
expressivity of other formalisms, iServe allows import of service annotations
following, for instance, SAWSDL, WSMO-Lite, MicroWSMO, or OWL-S. Once
imported, service annotations are automatically published on the basis of the Linked
Data principles. Service descriptions are thus accessible based on resolvable HTTP
URIs by utilising content negotiation to return service instances in either plain HTML
or RDF. In addition to a SPARQL endpoint, a REST API allows remote applications
to publish annotations and to discover services through an advanced set of discovery
strategies that combine semantic reasoning and information retrieval techniques. In
order to support users in creating semantic annotations for services two editors have
been developed: SWEET [5] (SemanticWeb sErvices Editing Tool) and SOWER
(SWEET is nOt a Wsdl EditoR), which support users in annotating Web APIs and
WSDL services respectively.
However, SWEET and SOWER build on the assumption that either HTML
documentation of services/APIs (SWEET) or WSDL files (SOWER) are available as
starting point for annotation. While that holds for a certain set of services, a number
of services on the Web neither provide a WSDL nor an HTML documentation and
hence, current Linked Services editors cannot be deployed in a range of cases. In
addition, we would like to promote an approach were services documentation relies
on structured RDF(S) and additional human-readable documentation is not provided
manually but automatically generated to avoid redundancies. Therefore, we introduce
and demonstrate SmartLink, an editing and search environment for Linked Services
addressing the issues described above.
operates on top of LOD stores such as iServe and is an open environment accessible
to users simply via OpenID4 authentication.
SmartLink exploits an extension of the MSM schema including a number of
additional non-functional properties. These non-functional properties cover, for
instance, contact person, developer name, Quality of Service (QoS), development
status, service license, and WSMO goal reference. The latter property directly
contributes to facilitate our approach of allowing MSM models to refer to existing
WSMO goals which utilize the same service entity. MSM-schema properties are
directly stored in iServe, while additional properties are captured in a complementary
RDF store based on OpenRDF Sesame5. Due to the SmartLink-specific extensions to
the MSM, we refer in the following to our Linked Services RDF store as iServe+. The
following figure depicts the overall architecture of the SmartLink environment.
APIs – currently the WATSON6 API - to identify and recommend suitable model
references to the user.
Dedicated APIs allow machines and third party applications to interact with
iServe+, e.g., to submit service instances or to discover and execute services. In
addition, the Web application provides a search form which allows to query for
particular services. Service matchmaking is being achieved by matching a set of core
properties (input, output, keywords), submitting SPARQL queries, and a dedicated set
of APIs.
6
http://watson.kmi.open.ac.uk/WatsonWUI/
7
http://www.notube.tv
440 S. Dietze et al.
future efforts. For instance, the recommendation of LOD model references via open
APIs proved very useful to aid SmartLink users when annotating services. However,
due to the increasing number of LOD datasets – strongly differing in terms of quality
and usefulness – it might be necessary in the future to select recommendations only
based on a controlled subset of the LOD cloud in order to reduce available choices.
While SmartLink proved beneficial when creating light-weight service annotations,
the lack of service automation and execution support provided by our extended MSM
models, and, more importantly, the current tool support, made it necessary to transform
and augment these models to into more comprehensive service models (WSMO). Due
to the lack of overlap between concurrent SWS models, transformation is a manual and
costly process. Hence, our current research and development deals with the extension of
the MSM by taking into account execution and composition oriented aspects and the
development of additional APIs, which allow the discovery, execution and semi-
automated composition of Linked Services, and make the exploitation of additional
SWS approaches obsolete.
Acknowledgments
This work is partly funded by the European projects NoTube and mEducator. The
authors would like to thank the European Commission for their support.
References
[1] Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American
Magazine (2001) (retrieved March 29, 2009)
[2] Bizer, C., Heath, T., et al.: Linked data - The Story So Far. Special Issue on Linked data.
International Journal on Semantic Web and Information Systems, IJSWIS (2009)
[3] Fensel, D., Lausen, H., Polleres, A., de Bruijn, J., Stollberg, M., Roman, D., Domingue,
J.: Enabling Semantic Web Services: The Web Service Modeling Ontology. Springer,
Heidelberg (2007)
[4] Kopecky, J.; Vitvar, T.; and Gomadam, K. MicroWSMO. Deliverable, Conceptual
Models for Services Working Group (2008),
http://cms-wg.sti2.org/TR/d12/v0.1/20090310/d12v01_20090310.pdf
[5] Maleshkova, M., Pedrinaci, C., Domingue, J.: Supporting the creation of semantic restful
service descriptions. In: 8th International Semantic Web Conference on Workshop:
Service Matchmaking and Resource Retrieval in the Semantic Web, SMR2 (2009)
[6] Martin, D., Burstein, M., Hobbs, J., Lassila, O., McDermott, D., McIlraith, S., Narayanan,
S., Paolucci, M., Parsia, B., Payne, T., Sirin, E., Srinivasan, N., Sycara, K.: OWL-S:
Semantic Markup for Web Services. Member submission, W3C. W3C Member
Submission, November 22 (2004)
[7] Pedrinaci, C., Domingue, J.: Toward the Next Wave of Services: Linked Services for the
Web of Data. Journal of Universal Computer Science 16(13), 1694–1719 (2010)
[8] Sheth, A.P., Gomadam, K., Ranabahu, A.: Semantics enhanced services: Meteor-s,
SAWSDL and SA-REST. IEEE Data Eng. Bull. 31(3), 8–12 (2008)
[9] Vitvar, T., Kopecky, J., Viskova, J., Fensel, D.: Wsmo-lite annotations for web services.
In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, Springer, Heidelberg (2008)
ViziQuer: A Tool to Explore and Query SPARQL
Endpoints
Institute of Matematics and Computer Science, University of Latvia, Raina bulv. 29,
Riga LV-1459, Latvia
[email protected], [email protected]
Abstract. The presented tool uses a novel approach to explore and query a
SPARQL endpoint. The tool is simple to use as a user needs only to enter an
address of a SPARQL endpoint of one’s interest. The tool will extract and
visualize graphically the data schema of the endpoint. The user will be able to
overview the data schema and use it to construct a SPARQL query according to
the data schema. The tool can be downloaded from http://viziquer.lumii.lv.
There is also additional information and help on how to use it in practice.
1 Introduction
SPARQL endpoints take vital role in the Semantic Web as they provide access to
actual data for end-users and software agents. Thus, it is important for developers and
end-users to understand structure of the underlying data in a SPARQL endpoint to use
the available data efficiently. Nowadays a problem arises as there is not enough
fundamental work on the SPARQL endpoint schema definition or underlying data
documentation. There is ongoing work [1] on an endpoint description, but it is in
early development phase and not in actual use.
Until now SPARQL endpoints have been developed just as an access point for
SPARQL queries to collect semantically structured data. There is not enough work on
the SPARQL endpoint management process to document and overview the
underlying data. For example, in SQL like database systems a database programmer
can easily extract and view the underlying data schema, thus making much easier to
develop a system that works with the database data. It is obvious that for a SPARQL
endpoint user it would be also much faster and easier to work with the unknown
endpoint data if the user would have more information about the actual data schema
(ontology) according to which instance data in the endpoint is organized.
Use of existing query tools [2, 3, 4] is mostly like black box testing of a SPARQL
endpoint, because they do not provide overview of a SPARQL endpoint data schema.
Existing approaches are mostly based on faceted querying meaning that a user gets
overview of only a small part of the actual ontology. The user is like a real explorer
without knowledge of what will be encountered in next two steps and in which
direction to browse for the needed data. Alternatively, a programmer could construct
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 441–445, 2011.
© Springer-Verlag Berlin Heidelberg 2011
442 M. Zviedris and G. Barzdins
a SPARQL query by looking at some ontology and hope that a SPARQL endpoint
contains the needed data and that correct URI namespaces are used. Note that most
SPARQL endpoints nowadays do not provide any ontology or data schema at all as
documentation, thus it is quite hard and time-consuming for a programmer to guess
their data structure. There exist tools that can show all classes and all properties found
in a SPARQL endpoint but not the structure of the underlying data.
Also there exists an approach [5] that can visualize ontologies as diagram graphs to
aid better understanding of an ontology, but these applications are standalone and
cannot connect to a SPARQL endpoint to view or edit underlying ontology. Unlike
[5] we do not intend to show all possible ontology details as, but we rather
concentrate on providing a simplified ontology derived from the actual data in a
SPARQL endpoint.
1
http://viziquer.lumii.lv/
2
http://DBpedia.org/sparql
ViziQuer: A Tool to Explore and Query SPARQL Endpoints 443
depict part of relations more than once – as each logical, but not formally defined
subclass might have the same outgoing relation present in data resulting in explosion
of duplicate relations as one can grasp in the view of naïve schema visualization in
Fig. 1 where depicted is schema visualization result of the Semantic Web Dog Food
endpoint3. We use UML like graphical notation similar to one proposed in [5].
As one can easily see in Fig 1, the naïve data schema visualization is opaque
because subclass relations are not defined and each relation is drawn multiple times.
To make it more comprehensive currently we use a hack and display each relation
only once (see Fig.2) even if the relation actually occurs between more than one pair
of classes. Thus, we get a comprehensive picture that is not fully semantically correct,
but understandable for an end-user. In the third step when an end-user composes a
SPARQL query based on the extracted schema all relation occurrences are taken into
account. A semantically correct way to solve the duplicate relations problem would be
to allow end-user to manually define missing subclass relations (based on meaningful
names of classes) and then automatically “lift” duplicate relations to the most abstract
superclass and thus making each relation to appear only once.
Fig. 2. The Semantic Web Dog Food with namespace and limited associations
3
http://data.semanticweb.org/sparql
444 M. Zviedris and G. Barzdins
ViziQuer we have implemented a subset of the GQL graphic query language [8] that
provides basic features for selection of a subset of data. As queries are based on the
ontology then querying is more like constructing or drawing a subset of the ontology
corresponding to data that an end-user is interested in for further analysis.
A query construction is possible by two paradigms. First, a user can construct a
query by selecting one class and browsing further like in faceted browsing where a
user selects a possible way further from already selected classes. Second, a user can
use a connection-based query construction. This means that a user selects two classes
of interest by adding them to a query diagram and by drawing a line between them
indicates that both classes should be interconnected. The tool uses an advanced
algorithm to propose suitable paths how these two classes can be interconnected and a
user just needs to select a path that best fits to desired path. Thus, when a user wants
to connect some classes that are a bit further one from another, a selection of an
acceptable path between classes is needed rather than a guess by manual browsing
that can be quite hard in the faceted browsing paradigm if a user is not very well
familiar with the ontology structure. We should also mention that both paradigms
could be used in parallel when one constructs a SPARQL query.
We will briefly explain the GQL by an example query. Main idea in the GQL is
selection of an interconnected subset of classes, attributes, and associations that at
some point can be viewed as an ontology subset. Additionally it is possible to restrict
the subset by some basic filters on class attributes. In Fig. 3 is depicted the example
constructed for the Semantic Web Dog Food endpoint.
Fig. 3. Example query of the GQL based on the Semantic Web Dog Food endpoint
The query formulated in Fig. 3 could be rephrased as to “select those authors that
have edited some proceedings and also have some paper in edited proceedings that are
related to some conference event”. We add restriction that conference acronym is
ESWC2009. We set the answer to contain a persons first name, a name of published
paper, a year when proceedings where published and also a name of the conference.
For limited space reasons we do not show this query translated into SPARQL.
The ViziQuer implementation is based on the Model Driven Architecture
technologies that allow it to be very flexible and to connect to any SPARQL endpoint.
We used the GrTP platform [9] as environment for the tool development. GrTP allows
to easy manipulating graphical diagrams that is most needed to construct graphical
queries and visualize a SPARQL endpoint underlying data schema. Main drawback for
ViziQuer: A Tool to Explore and Query SPARQL Endpoints 445
the platform is that it supports only the Windows operating systems, thus the ViziQuer
currently works only in the Windows operating system environment.
Acknowledgments
This work has been partly supported by the European Social Fund within the project
«Support for Doctoral Studies at University of Latvia».
References
1. SPARQL endpoint description,
http://esw.w3.org/SparqlEndpointDescription
2. Heim, P., Ertl, T., Ziegler, J.: Facet Graphs: Complex Semantic Querying Made Easy. In:
Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache,
T. (eds.) ESWC 2010. LNCS, vol. 6088, pp. 288–302. Springer, Heidelberg (2010)
3. Longwell RDF Browser, SIMILE (2005),
http://simile.mit.edu/longwell/
4. List of SPARQL faceted browsers,
http://esw.w3.org/SemanticWebTools#Special_Browsers
5. Barzdins, J., Barzdins, G., Cerans, K., Liepins, R., Sprogis, A.: OWLGrEd: a UML Style
Graphical Notation and Editor for OWL. In: Clark, K., Sirin, E. (eds.) Proc. 7th
International Workshop OWL: Experience and Directions, OWLED-2010 (2010)
6. Berners-Lee, T., Hollenbach, J., Lu, K., Presbrey, J., Prud’ommeaux, E., Schraefel, M.C.:
Tabulator Redux: Browsing and writing Linked Data. In: Proc. WWW 2008 Workshop:
LDOW (2008)
7. Zviedris, M.: Ontology repository for User Interaction. In: d’Aquin, M., Castro,
A.G., Lange, C., Viljanen, K. (eds.) ORES-2010 Workshop on Ontology Repositories and
Editiors for the Semantic Web, CEUR (2010),
http://CEUR-WS.org/Vol-596/
8. Barzdins, G., Rikacovs, R., Zviedris, M.: Graphical Query Language as SPARQL
Frontend. In: Grundspenkis, J., Kirikova, M., Manolopoulos, Y., Morzy, T., Novickis,
L., Vossen, G. (eds.) Local Proceedings of 13th East-European Conference (ADBIS 2009),
pp. 93–107. Riga Technical University, Riga (2009)
9. Barzdins, J., Zarins, A., Cerans, K., Kalnins, A., Rencis, E., Lace, L., Liepins, R., Sprogis,
A.: GrTP:Transformation Based Graphical Tool Building Platform. In: Proceedings of the
MoDELS 2007 Workshop on Model Driven Development of Advanced User Interfaces
(MDDAUI-2007), CEUR Workshop Proceedings, vol. 297 (2007)
EasyApp: Goal-Driven Service Flow Generator with
Semantic Web Service Technologies
Yoo-mi Park1, Yuchul Jung1, HyunKyung Yoo1, Hyunjoo Bae1, and Hwa-Sung Kim2
1
Service Convergence Research Team, Service Platform Department,
Internet Research Laboratory, ETRI, Korea
{parkym,jyc77,hkyoo,hjbae}@etri.re.kr
2
Dept. of Electronics & Communications Eng., KwangWoon Univ., Korea
[email protected]
1 Introduction
Semantic Web Service is a new paradigm to combine the Semantic Web and Web
Services. It is expected to support dynamic computation of services as well as
distributed computation. Ultimately, on Web Service, it leads to goal-based computing
which is fully declarative in nature [1-3].
The previous researches on Semantic Web Services are OWL-S [4,6] and WSMO
[5,6]. They suggested new service models and description languages with ontology
for goal-based computing. However, these semantic web service approaches using the
new models and languages require expertise on ontology and lots of manual work
even for experts. In addition, they dealt with WSDL-based web services rather than
RESTful web services which are commonly used in the industry recently. These
limitations make it difficult to respond fast to the dynamically changing web service
world that exist more than 30,000 services [8,9].
In this demo paper, we introduce a goal-driven service flow generator (i.e., EasyApp)
with novel semantic web service technologies that can be applied to the currently
existing web services without any changes. Towards automatic service composition,
especially, we have considered semi-automatic service annotation, goal-driven semantic
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 446–450, 2011.
© Springer-Verlag Berlin Heidelberg 2011
EasyApp: Goal-Driven Service Flow Generator 447
service discovery, and automatic service flow generation based on our goal ontology
as its key enabling technologies. The key technologies can be applied for both
WSDL-based services and RESTful services. In the use case ‘hiring process’, we show
that the software developer makes service flows which satisfy his/her goals on EasyApp
where our novel semantic web service technologies are embedded.
2 EasyApp
Goal Input +
Graphic User Interfaces Service Flow Editor
Semantic Criteria Selector
Semantic Service
annotator
First of all, a developer catches a keyword in mind describing his/her goal, which
can be ‘job’, ‘work’, or ‘hiring’. He/she inputs the keyword in the goal box (1). Based
on the given keyword, EasyApp extends keywords (e.g. ‘vocation’, ‘occupation’,
‘career’, ‘recruiting’, ‘resume’, etc) as its substitutes from goal ontology and then
looks up the relevant goals with the extended keywords from goal pattern library.
After goal analysis, EasyApp suggests several candidate goals which concretize the
user’s keyword to the developer. The suggested goals are ‘hiring a person’, ‘recruiting
people’, ‘offering a job’, ‘getting a job’, and ‘making a resume’. He/she can select an
appropriate goal, which is ‘hiring a person’ among them (2). Then, he/she can select
additional criteria (non-functional semantics mentioned in Section 2.2) (3) to make
his/her requirement clearer. When he/she clicks ‘search’ button (4), EasyApp
decomposes the goal into sub-goals using goal pattern library and makes up a basic
service flow that is composed of sequential sub-goals. Then, EasyApp discovers
relevant services which satisfy sub-goals through semantic service discovery. During
the semantic service discovery, top-k services are ranked based on a weighted sum of
the textual similarity score given by keyword search and the functional similarity that
represents the goal achievability of the given service. The top-k services are re-ranked
by the weighted sum of each NFP’s weight and its importance. After the discovery, a
service flow is displayed in the service flow editor (5).
(5)
(9)
(1)
(2) (10)
(6)
(3) (8)
(7)
(4)
In this use case, a service flow consists of the following sub-goals: ‘make request’
(for requester to make hiring request to the broker in company) Æ ‘post document’
(for requester to upload a required specification) Æ ‘get document’ (for broker to get
the required specification) Æ ‘make meeting’ (for requester to meet the applicants) Æ
‘notify person’ (for requester to send result to the applicants). The sub-goals include
finally ranked top-k services as a result of service discovery.
The developer can choose the most appropriate service (6) by referring to service
properties represented in the ‘property’ view (7). He/she can modify the service flow
in the editor when he/she wants by dragging & dropping activity icons (8) on palette
(9). After the developer finishes the selection of services and the modification of
service flows, he/she can obtain java code from the service flow (10). Finally, the
developer gets a service flow for ‘hiring process’ on EasyApp.
4 Conclusions
In this demo, we present EasyApp, which is a novel and practical semantic web
service composition environment. With EasyApp, software developer can make
service flows that match the targeting goal regardless of web service programming
proficiency. Further work is to employ semantic service mediation technology for on-
the-fly service discovery and composition of web services.
References
1. Fensel, D., Kerrigan, M., Zaremba, M.: Implementing Semantic Web Services. Springer,
Heidelberg (2008)
2. Preist, C.: A Conceptual Architecture for Semantic Web Services. In: McIlraith, S.A.,
Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 395–409.
Springer, Heidelberg (2004)
3. Cabral, L., Domingue, J., Motta, E., Payne, T.R., Hakimpour, F.: Approaches to Semantic
Web Services: an Overview and Comparisons. In: Bussler, C.J., Davies, J., Fensel, D.,
Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 225–239. Springer, Heidelberg (2004)
4. Web Service Modeling Ontology (WSMO),
http://www.w3.org/Submission/WSMO/
5. Semantic Markup for Web Services (OWL-S),
http://www.w3.org/Submission/2004/SUBM-OWL-S-20041122/
6. Lara, R., Roman, D., Polleres, A., Fensel, D.: A Conceptual Comparison of WSMO and
OWL-S. In: European Conference on Web Services, pp. 254–269 (2004)
7. Shin, D.-H., Lee, K.-H., Suda, T.: Automated Generation of Composite Web Services
based on Functional Semantics. Journal of Web Semantics 7(4), 332–343 (2009)
8. Seekda,
http://webservices.seekda.com/
9. ProgrammableWeb,
http://www.programmableweb.com/
Who’s Who – A Linked Data Visualisation Tool for
Mobile Environments
1 Introduction
Mobile devices are increasingly becoming an extension of the lives of humans in the
physical world. The popularity of these devices simplifies in-situ management of the
ordinary end user’s information needs. Specifically, smart phones’ embedded devices
(e.g., built-in cameras) allow to build an abstraction of the user’s environment. Such
abstraction provides contextual information that designers can leverage in adapting a
mobile interface to the user’s information needs. Further, context can act as a set of
parameters to query the Linked Data (LD) cloud. The cloud connects distributed data
across the Semantic Web; it exposes a wide range of heterogeneous data, information
and knowledge using URIs (Uniform Resource Identifiers) and RDF (Resource De-
scription Framework) [2,6]. This large amount of structured data supports SPARQL
querying and the follow your nose principle in order to obtain facts. We present Who’s
Who, a tool that leverages structured data extracted from the LD cloud to satisfy users’
information needs ubiquitously. The application provides the following contributions:
1. Exploiting contextual information: Who’s Who facilitates access to the LD cloud
by exploiting contextual information, linking the physical world with the virtual.
2. Enhanced processing of Linked Data on mobile devices: Who’s Who enables
processing of semantic, linked data, tailoring its presentation to the limited re-
sources of mobile devices, e.g., reducing latency when querying semantic data by
processing triples within a mobile browser’s light-weight triple store.
3. Mobile access to Linked Data: Who’s Who uses novel visualisation strategies to
access LD on mobile devices, in order to overcome the usability challenges arising
from the huge amount of information in the LD cloud and limited mobile device
display size. This visualisation also enables intuitive, non-expert access to LD.
To whom correspondence should be addressed.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 451–455, 2011.
c Springer-Verlag Berlin Heidelberg 2011
452 A.E. Cano, A.-S. Dadzie, and M. Hartmann
2 Application Scenario
To illustrate our work consider the following scenario:
Bob is a potential masters student invited to an open day at a university. He will tour
different departments to get to know their facilities and research. To make the best of the
open day, Bob will use his mobile phone as a location guide, allowing him to retrieve
information (encoded as LD) about each department he visits, with the aid of visual
markers distributed around the university. This will allow him to identify researchers he
would like to meet and potential projects to work on.
We demonstrate the approach taken in Who’s Who to realise this scenario, i.e., to sup-
port user- and context-sensitive information retrieval from the LD cloud using a mobile
device. We exemplify this using the Data.dcs [6] linked dataset, which describes the
research groups in the Department of Computer Science at the University of Sheffield.
iPhone 3GS has a 600Mhz CPU and 256MB RAM, and the HTC Hero has a 528MHz
CPU and 288MB RAM; see also benchmarks for semantic data processing in small
devices in [3]). Although it is possible to handle triples in local RDF stores in Android-
based mobiles, this is not possible in other platforms such as the iPhone. An alterna-
tive is to use existing lightweight developments such as rdfQuery1, which runs on web
browsers, and HTML 5 features for persisting storage.
However, processing and rendering of semantic resources on mobile web browsers is
still limited by low memory allocation (e.g., 10-64MB in Webkit and Firefox mobile on
iPhone and Android phones). Leaving entirely the processing and rendering of seman-
tic resources to the mobile client improves the user experience by reducing latency due
to multiple requests. However, memory allocation restrictions make this a sub-optimal
option. On the other hand, executing the semantic analysis and data processing entirely
on the server-side results in the execution of continuous calls to the server, which trans-
lates to high data latency and a degradation of the responsiveness of the user interface
and interactivity. There must be a compromise between the number of triples handled
by a (mobile device) web browser and visualisation flow performance.
Who’s Who follows the mobile and server-side based architecture in Fig. 2. Based on
the parameters encoded in a visual marker, Who’s Who queries Data.dcs. The Data.dcs
triples are loaded in-memory via Jena on the server-side, following which SPARQL
queries are executed. The triples retrieved are encoded with JSON – a lightweight
data-interchange format – using JSONLib2 , and returned with a JavaScript callback
to the mobile device. On the Who’s Who mobile-side, the triples are loaded into an rdf-
Query lightweight triple store. Interaction with the visualisation triggers local SPARQL
queries that further filter the information.
The advantages of adopting this approach are that: 1) users need not download the
application in advance (as is the case with applications relying on local RDF storage);
2) users need not know the URI corresponding to the physical entity they want to enrich,
as contextual information is pre-encoded in the visual markers; 3) there is a balance be-
tween the triple load handled by the server- and mobile-sides, which translates to more
responsive user interfaces; 4) the mobile-side triple store allows semantic filtering on
the views exposed to the user, reducing latency and improving the interface’s usability.
1
rdfQuery: http://code.google.com/p/rdfquery
2
JSONLib: http://json-lib.sourceforge.net
454 A.E. Cano, A.-S. Dadzie, and M. Hartmann
3.3 Visualisation
Who’s Who supports the user in retrieving information stored in the LD cloud with
visualisations tailored to the application domain. User requests are automatically trans-
lated to SPARQL queries executed on the lightweight triple store on the mobile device
itself. If required, additional triples are retrieved from the Who’s Who server. Fig. 3 de-
scribes the interaction flow for retrieving publications: 1) the user is presented a list of
researchers corresponding to the physical entity encoded in the scanned visual marker;
2) when the user taps on a researcher – in this case Fabio Ciravegna – a SPARQL query
is executed; 3) the publication view is presented, providing an additional filtering layer.
The publication view shows a graph containing the triples resulting from the SPARQL
query – the number of publications per year and the number of collaborators involved
in each publication. In Fig. 3 (3), the user has tapped on the graph bubble corresponding
to the year 2009, which links to two collaborators. The publications are arranged in a
“card deck”, where the first publication appears in the foreground. The user can traverse
through the publications – where there are multiple – by selecting the next in the deck.
Fig. 3. (1) After selecting a researcher; (2) a SPARQL query is executed; (3) the resulting triples
are presented in the graph in the publication view
4 Related Work
Searching for information about entities and events in a user’s environment is an oft-
performed activity. The state of the art focuses on text-based browsing and querying of
LD on desktop browsers, e.g., Sig.ma [8] and Marbles [1], targeted predominantly at
technical experts (see also [2]). This excludes a significant part of the user population –
non-technical end users – who make use of advanced technology embedded in everyday
Who’s Who – A Linked Data Visualisation Tool for Mobile Environments 455
devices such as mobile phones. One of the best examples of a visual browser targeted at
mainstream use is DBPedia Mobile [1]; which is a location-aware Semantic Web client
that identifies and enriches information about nearby objects. However it relies on GPS
sensors for retrieving context, which makes it unsuitable for our indoor scenario. Our
approach improves on existing LD browsers for mobile devices in that Who’s Who:
1) extracts contextual information encoded in visual markers; 2) hides explicit SPARQL
filters from the user, increasing usability for especially non-technical users.
5 Summary
Who’s Who was developed to support especially those end users who may have little to
no knowledge about where to find information on nearby physical entities. It provides
exploratory navigation through new environments, guided by the user’s context. Studies
(see, e.g., [5,7]) evaluating the utility and usability of tag-based interaction with mobile
device applications illustrate the potential of lowering barriers to LD use.
We have demonstrated the use of a set of visual markers, corresponding to research
groups in a university department, to explore the linked data exposed in Data.dcs, using
a smart phone equipped with a camera and a QRcode scanner. We have also illustrated
how the approach taken in Who’s Who simplifies such tasks, by using visualisation
of structured data to extract relevant context and manage information load, to reveal
interesting facts (otherwise difficult to identify), and to facilitate knowledge extraction.
References
1. Becker, C., Bizer, C.: Exploring the geospatial semantic web with DBpedia Mobile. Journal
of Web Semantics 7(4), 278–286 (2009)
2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data – The Story So Far. International Journal
on Semantic Web and Information Systems (2009)
3. d’Aquin, M., Nikolov, A., Motta, E.: How much semantic data on small devices? In: Cimiano,
P., Pinto, H.S. (eds.) EKAW 2010. LNCS, vol. 6317, pp. 565–575. Springer, Heidelberg (2010)
4. Fröhlich, P., Oulasvirta, A., Baldauf, M., Nurminen, A.: On the move, wirelessly connected to
the world. ACM Commun. 54, 132–138 (2011)
5. Mäkelä, K., Belt, S., Greenblatt, D., Häkkilä, J.: Mobile interaction with visual and RFID tags:
a field study on user perceptions. In: Proc. CHI 2007, pp. 991–994 (2007)
6. Rowe, M.: Data.dcs: Converting legacy data into linked data. In: Proc., Linked Data on the
Web Workshop at WWW’10 (2010)
7. Toye, E., Sharp, R., Madhavapeddy, A., Scott, D., Upton, E., Blackwell, A.: Interacting with
mobile services: an evaluation of camera-phones and visual tags. Personal and Ubiquitous
Computing 11, 97–106 (2007)
8. Tummarello, G., Cyganiak, R., Catasta, M., Danielczyk, S., Delbru, R., Decker, S.: Sig.ma:
live views on the web of data. In: WWW’10, pp. 1301–1304 (2010)
OntosFeeder – A Versatile Semantic Context
Provider for Web Content Authoring
1 Introduction
One of the routine tasks of a content author (e.g. a journalist) during the time of
writing is researching for context information required for the intended article.
Without proper tool support, the author has to resort to manual searching (e.g.
Google) and skimming through available information sources. The availability of
structured data on the Semantic Data Web allows to automate these routine ac-
tivities by identifying topics within the article with the aid of Natural Language
Processing (NLP) and subsequently presenting relevant context information by
retrieving descriptions from the Linked Open Data Web (LOD).
We present the Ontos Feeder 1 – a system serving as context information
provider, that can be integrated into Content Management Systems in order to
1
http://www.ontos.com
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 456–460, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Ontos Feeder 457
Fig. 1. Entities are highlighted in the WYSIWYG editor of the CMS, Pop-ups allow
to select further information
3 Architecture
While the server side consists of the OWS, the client side consists of the Core
system and the CMS - adapters (see Figure 3). The core is CMS independent
and can be embedded into a specific CMS by an appropriate adapter. Currently
adapters for Drupal and WordPress are available.
Ontos Web Service (OWS) The Core system sends queries to the OWS. The On-
tos Knowledge Base contains aggregated and refined information from around 1
million documents mainly from English online news. The Ontos Semantic Engine
(NLP) extracts entities and their relationships from the text and disambiguates
entities based on significance[1]. The significance is a complex measure based on
the position of the occurrence in the text, the overall number of occurrences and
Ontos Feeder 459
the number of connected events and facts extracted from the text. The resulting
information is returned to the Ontos Feeder.
Ontos Feeder. The Ontos Feeder currently supports requesting information for
persons, organisations, locations and products, but can generally be extended to
handle any of the entity types supported by the OWS. The user can configure,
which types of entities the OWS should try to recognize in the provided text.
The retrieval of each single piece of contextual information is encapsulated as a
separate task by Ontos Feeder to increase the flexibility. The task engine sup-
ports task chaining, so if information could not be retrieved from a particular
Linked Data source, it is requested from another one. The type of presented
contextual information depends on the type of the recognized entity. The con-
textual information of a Person for example can consist of the age, nationality,
status roles, connections to other persons and organisations, latest articles about
this person, a Wikipedia article, a New York Times article, the personal home-
page and a collection of different public social network profiles from Twitter or
Facebook. Information about connections to other people and organisations, the
status roles and the relevant articles are collected from the OWS. As every sin-
gle information piece is requested by its own task, the variety of the presented
contextual information can easily be adapted to personal needs.
4 Embedding Metadata
The OWS is able to annotate plain text as well as markup data such as HTML
documents. The result is returned as a stand-off annotation, either in the
460 A. Klebeck et al.
form of start and end positions for text or an XPath expression for XML markup.
A specialized annotation algorithm is used to: 1. highlight the annotations in the
source HTML document in the editors. and 2. insert the annotations inline (as
e.g. RDFa) into the HTML source of the article. Because all of the supported
CMS WYSIWYG editors (currently FCKEditor and TinyMCE5 ) are capable
of returning the current article as plain text, Ontos Feeder utilizes the Web
Service in plain-text mode. As each of the editors have a different API, a special
abstraction layer is put in front of the annotation algorithm to make it editor-
independent. Furthermore, to make the annotation algorithm work faster for a
plain-text document, all annotations are sorted in descended order and inserted
bottom-up into the text. This avoids the recalculation of the annotation positions
as compared to the top-down insertion. The annotation algorithm is capable
of dealing with the entire supported semantic markup languages (RDFa and
Microformats) and allows for annotation highlighting and on-the-fly binding of
the contextual pop-up menu (see Figure 1).
References
1. Efimenko, I., Minor, S., Starostin, A., Drobyazko, G., Khoroshevsky, V.: Providing
Semantic Content for the Next Generation Web. In: Semantic Web, pp. 39–62.
InTech (2010)
2. Glaser, H., Jaffri, A., Millard, I.: Managing co-reference on the semantic web. In:
WWW 2009 Workshop: Linked Data on the Web, LDOW 2009 (April 2009)
5
http://ckeditor.com/ and http://tinymce.moxiecode.com/
6
http://drupal.org/project/[opencalais|zemanta]
wayOU – Linked Data-Based Social Location
Tracking in a Large, Distributed Organisation
Abstract. While the publication of linked open data has gained momen-
tum in large organisations, the way for users of these organisations to en-
gage with these data is still unclear. Here, we demonstrate a mobile appli-
cation called wayOU (where are you at the Open University) which relies
on the data published by The Open University (under data.open.ac.uk)
to provide social, location-based services to its students and members of
staff. An interesting aspect of this application is that, not only it con-
sumes linked data produced by the University from various repositories,
but it also contributes to it by creating new connections between people,
places and other types of resources.
1 Introduction
The Open University1 is a large UK University dedicated to distance learning.
Apart from its main campus, it is distributed over 13 regional centres across the
country. As part of the LUCERO project (Linking University Content for Educa-
tion and Research Online2 ), the Open University is publishing linked open data,
concerning people, publications, courses, places, and open educational material
from existing institutional repositories and databases, under data.open.ac.uk.3
While the collection, conversion, exposure and maintenance of linked data
in large organisations is slowly becoming easier, it is still an issue to get users
of these organisations to engage with the data in a way suitable to them and
that could also benefit to re-enforcing the data. Many ‘applications’ of linked
data mostly concern the visualisation or exploration of available data for a par-
ticular purpose (see for example [1]), especially in mobile applications (see for
example [2] or [3]), or the use of linked data to accomplish a specific task (e.g.,
recommendation in DBrec [4]). Our goal is to provide features to users that not
only make use of linked data, but which usage would contribute in creating new
connections in the data, including currently implicit relations between people
and places.
We demonstrate wayOU (where are you at the Open University): a mobile
application developed for the Google Android platform4 that allows users of the
1
http://www.open.ac.uk
2
http://lucero-project.info
3
http://data.open.ac.uk
4
http://www.android.com
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 461–465, 2011.
c Springer-Verlag Berlin Heidelberg 2011
462 M. d’Aquin, F. Zablith, and E. Motta
Open University (members of staff and students) to keep track of the places in
which they have been on the main campus and the regional centres, and connect
this information to other aspects of their interaction with the organisation (their
workplace, meetings with colleagues, tutorials, etc.) This application relies on
the data available through the SPARQL endpoint of data.open.ac.uk to get in-
formation regarding places (buildings, floors) and people (identifier, network), as
well as the linked data published by the UK government regarding locations in
the UK (especially, data.ordnancesurvey.co.uk5). More importantly, it generates
and keeps track of information generated for each user regarding their current lo-
cation, usual workplace, history of visits at the Open University and the reasons
for these visits. In this way, it can be compared to the foursquare application6
working in a specific ‘corporate’ environment and allowing to declare and expose
more complex connections between users and places in this environment.
In the next section, we give a general overview of the structure of the wayOU
application and in Section 3, we give more details about the way users interact
with it. We discuss in Section 4 the future work and challenges we identified from
our experience in building a social, mobile application based on linked data.
References
1. Lehmann, J., Knappe, S.: DBpedia Navigator. In: ISWC Billion Triple Challenge
(2008)
2. Becker, C., Bizer, C.: DBpedia mobile: A location-enabled linked data browser. In:
Proceedings of Linked Data on the Web Workshop (LDOW 2008), Citeseer (2008)
3. van Aart, C., Wielinga, B., van Hage, W.R.: Mobile Cultural Heritage Guide:
Location-Aware Semantic Search. In: Cimiano, P., Pinto, H.S. (eds.) EKAW 2010.
LNCS, vol. 6317, pp. 257–271. Springer, Heidelberg (2010)
4. Passant, A., Decker, S.: Hey! Ho! Let’s Go! Explanatory Music Recommendations
with dbrec. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt,
H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 411–415.
Springer, Heidelberg (2010)
8
http://twitter.com
9
http://facebook.com
10
http://www.facebook.com/apps/application.php?
api key=06d85b85540794e2fd02e9ef83206bf6
SeaFish: A Game for Collaborative and Visual
Image Annotation and Interlinking
1 Motivation
The automated interpretation and description of images is still a major challenge.
CAPTCHAs are still a prime example for a task that is trivial for a human user
but impossible for a computer [6], allowing the distinction between a human user
and a machine. The paradigm of human computation is also the foundation for
”games with a purpose” [4] which aim to exploit human intelligence for solv-
ing computationally difficult tasks by masquerading them as games. Thereby,
an abstract (knowledge acquisition or curation) task is not only hidden behind
an attractive, easy-to-understand user interface but users are also given incen-
tives to dedicate their time to solving the task. Playing, competition, and social
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 466–470, 2011.
Springer-Verlag Berlin Heidelberg 2011
SeaFish 467
Related work. We sketch related work focusing on games for the creation of image
annotations. A comprehensive list of games for knowledge acquisition is available
at the INSEMTIVES homepage2 . Luis von Ahn introduced the ESP game[7]
aka Google’s Image Labeler for annotating images. Additioinally, he created
other serious games that are published at the GWAP homepage3 . PopVideo4
aims to label videos and Waisda5 allows players to tag live TV shows. Magic
Bullet[9] evaluates the robustness of Captchas, TagCaptcha[3] produces image
annotations as side product of the CAPTCHA process. Finally, Picture This 6 is
another game for annotating images, however, unlike the ESP game users don’t
create tags but have to decide on the best matching tag.
SeaFish SeaFish is a single player game where players have to fish related images
that are floating around. Users can play from the OntoGame website7 as either
registered user or as anonymous guest. Players may also login via Facebook by
visiting the OntoGame Facebook page8 . After they clicked the SeaFish! button
1
f lickr T M wrappr,http://www4.wiwiss.fu-berlin.de/flickrwrappr/
2
INSEMTIVES, http://www.insemtives.eu/games.php
3
Games with a Purpose, www.gwap.com
4
www.gwap.com/popvideo
5
Waisda, http://waisda.nl/
6
Picture This
7
OntoGame, http://www.ontogame.org/games
8
OntoGame Facebook page, http://apps.facebook.com/ontogame
468 S. Thaler et al.
the game starts. From there players may either get playing instructions or start
a game. Each game round is about a concept taken from DBpedia that is repre-
sented by an image. Players see this image on the right hand side. Additionally,
players see the result of an search for the concept on f lickrT M wrappr, which
are floating through the main screen (see Figure 1). Players have 2 minutes to
mark those images as either related to the concept or unrelated to the concept.
They can do so by catching the images with the fishing rod and dragging them
either to the green basket (related) or the red basket (not related).
Players always obtain a small reward for catching an image. Generally, players
get more points depending on whether their decision is consensual with decisions
of the majority of other players. Additionally, the reward a player gets always
depends on their reliability as well as their level. When an image is caught a
new image appears in order to have less than ten images floating on the screen
at the same time. The game stops when all images have been fished or the time
is up. After each game round (see Figure 2) players can compare their answers
with the communitys answers as well as view statistics on accuracy and time.
To give an example of Seafish: in a game round the player is shown an image of
Blackberry (see Figure 1) on the right hand side (retrieved from DBpedia). As a
result of the query Blackberry on f lickrT M wrappr, pictures are floating around
on the main screen. The task of the player is to select those image that are
related to the concept Blackberry(the berry) and those that are not by putting
images in the ”Discard” or ”Catch” baskets on the bottom of the screen.
SeaFish 469
Data export. SeaFish exports annotations as RDF triples. In our first scenario
we create annotations of the form <image><http://xmlns.com/foaf/spec/depic-
tion><concept>as well as the inverse property. Our concepts are taken from
DBpedia plus follow the four Linked Open Data principles[1]. This means that
our data may be used to contribute to the Linked Open Data cloud without
being modified.
Our annotations are generated from stored answers in the following manner:
let si be an answer that states an image is related to a concept, i a number
greater than six, ri the reliability of the player giving the answer at the time of
the answer and n the number of total answers stored about an image. A player’s
reliability is a measurement of how accurate this player played in previous games.
Currently, three months after the initial release of SeaFish we have collected
feedback, implemented the proposed changes and re-released the game. A still
open issue is the lack of returning player. To counter this we have integrated
leveling mechanisms. We have integrated it on Facebook to profit from the so-
cial network effect. We are currently evaluating the outcome of these measures.
Besides, we also evaluate the effect of tweaking the rewarding function on the
quality of the generated annotations.
470 S. Thaler et al.
4 Conclusion
At ESWC 2011 the audience of the demo players will be able to play the games
of the OntoGame series (including Seafish) and can see how and which data is
generated. In this paper, we have described the SeaFish game for collaborative
image annotation. We are currently collecting massive user input in order to
thoroughly evaluate the game by assessing the quality of generated data and the
user experience.
Acknowledgments
The work presented has been funded by the FP7 project INSEMTIVES under
EU Objective 4.3 (grant number FP7-231181).
References
1. Berners-Lee, T.: Linked data - design issues (2006)
2. Hausenblas, M., Troncy, R., Raimond, Y., Bürger, T.: Interlinking multimedia: How
to apply linked data principles to multimedia fragments. In: Linked Data on the Web
Workshop, LDOW 2009 (2009)
3. Morrison, D., Marchand-Maillet, S., Bruno, E.: Tagcaptcha: annotating images with
captchas. In: Proceedings of the ACM SIGKDD Workshop on Human Computation,
HCOMP 2009, pp. 44–45. ACM, New York (2009)
4. Von Ahn, L.: Games with a purpose. IEEE Computer 29(6), 92–94 (2006)
5. Von Ahn, L.: Peekaboom: A Game for Locating Objects in Images (2006)
6. Von Ahn, L., Blum, M., Hopper, N.J., Langford, J.: Captcha: Using hard ai problems
for security (2003)
7. Ahn, L.v., Dabbish, L.: Labeling images with a computer game. In: CHI, pp. 319–326
(2004)
8. von Ahn, L., Ginosar, S., Kedia, M., Blum, M.: Improving Image Search with
PHETCH. In: IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, ICASSP 2007, vol. 4, pp. IV-1209 –IV-1212 (2007)
9. Yan, J., Yu, S.-Y.: Magic bullet: a dual-purpose computer game. In: Proceedings of
the ACM SIGKDD Workshop on Human Computation, HCOMP 2009, pp. 32–33.
ACM, New York (2009)
9
Ookaboo, http://ookaboo.com/
The Planetary System: Executable Science,
Technology, Engineering and Math Papers
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 471–475, 2011.
© Springer-Verlag Berlin Heidelberg 2011
472 C. Lange et al.
corpora have been imported from external sources and are presented read-only.
We have prepared demos of selected services in all of these instances.
Fig. 1. Interacting with an arχiv article via FoldingBar, InfoBar, and localized dis-
cussions. On the right: localized folding inside formulæ
474 C. Lange et al.
4 Related Work
Like a semantic wiki, Planetary supports editing and discussing resources.
Many wikis support LATEX formulæ, but without fine-grained semantic anno-
tation. They can merely render formulæ in a human-readable way but not
make them executable. The Living Document [19] environment enables users
to annotate and share life science documents and interlink them with
Web knowledge bases, turning – like Planetary – every single paper into a
portal for exploring the underlying network. However, life science knowledge
structures, e.g. proteins and genes, are relatively flat, compared to the tree-
like and context-sensitive formulæ of STEM. State-of-the-art math e-learning
systems, including ActiveMath [20] and MathDox [21], also make papers exe-
cutable. However, they do not preserve the semantic structure of these papers
in their human-readable output, which makes it harder for developers to embed
additional services into papers.
References
1. Executable Paper Challenge, http://www.executablepapers.com
2. David, C., et al.: eMath 3.0: Building Blocks for a social and semantic Web for
online mathematics & ELearning. In: Workshop on Mathematics and ICT (2010),
http://kwarc.info/kohlhase/papers/malog10.pdf
3. Kohlhase, A., Kohlhase, M., Lange, C.: Dimensions of formality: A case study
for MKM in software engineering. In: Autexier, S., Calmet, J., Delahaye, D., Ion,
P.D.F., Rideau, L., Rioboo, R., Sexton, A.P. (eds.) AISC 2010. LNCS(LNAI),
vol. 6167, pp. 355–369. Springer, Heidelberg (2010)
4. Lange, C.: Ontologies and Languages for Representing Mathematical Knowledge
on the Semantic Web. Submitted to Semantic Web Journal,
http://www.semantic-web-journal.net/underreview
5. arXMLiv Build System, http://arxivdemo.mathweb.org
6. PlanetMath Redux, http://planetmath.mathweb.org
7. Kohlhase, M., et al.: Planet GenCS, http://gencs.kwarc.info
8. David, C., Kohlhase, M., Lange, C., Rabe, F., Zhiltsov, N., Zholudev, V.: Publish-
ing math lecture notes as linked data. In: Aroyo, L., Antoniou, G., Hyvönen, E.,
ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010.
LNCS, vol. 6089, pp. 370–375. Springer, Heidelberg (2010)
9. Logic Atlas and Integrator, http://logicatlas.omdoc.org
10. Lange, C.: SWiM – A semantic wiki for mathematical knowledge management. In:
Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, pp. 832–837. Springer, Heidelberg (2008)
11. Kohlhase, M.: OMDoc An open markup format for mathematical documents [Ver-
sion 1.2]. LNCS (LNAI), vol. 4180. Springer, Heidelberg (2006)
12. Open Math 2.0. (2004), http://www.openmath.org/standard/om20
13. MathML 3.0., http://www.w3.org/TR/MathML3
14. Gardner, J., Krowne, A., Xiong, L.: NNexus: Towards an Automatic Linker for
a Massively-Distributed Collaborative Corpus. IEEE Transactions on Knowledge
and Data Engineering 21.6 (2009)
15. Kohlhase, A., Kohlhase, M., Lange, C.: sTeX – A System for Flexible Formalization
of Linked Data. In: I-Semantics (2010)
16. Lange, C., et al.: Expressing Argumentative Discussions in Social Media Sites. In:
Social Data on the Web Workshop at ISWC (2008)
17. David, C., Lange, C., Rabe, F.: Interactive Documents as Interfaces to Computer
Algebra Systems: JOBAD and Wolfram|Alpha. In: CALCULEMUS, Emerging
Trends (2010)
18. Giceva, J., Lange, C., Rabe, F.: Integrating web services into active mathemat-
ical documents. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) MKM
2009, Held as Part of CICM 2009. LNCS(LNAI), vol. 5625, pp. 279–293. Springer,
Heidelberg (2009)
19. García, A., et al.: Semantic Web and Social Web heading towards Living Docu-
ments in the Life Sciences. In: Web Semantics 8.2–3 (2010)
20. ActiveMath, http://www.activemath.org
21. MathDox Interactive Mathematics, http://www.mathdox.org
Semantic Annotation of Images on Flickr
1 Introduction
The task of discovering the semantics of photos is still very difficult and hence,
automatic annotation with concepts “is widely recognized as an extremely diffi-
cult issue” [1]. It is thus preferable to ask the creators of the photos to annotate
them directly when they share them. As was demonstrated in [2], simple free-
text annotations are not sufficient for performing good indexing and leveraging
semantic search.
In this paper we discuss an extension to a popular open source photo uploader
for the Flickr1 website that allows the annotation of image files with semantic
annotations without extra involvement from the user. One of the feature of this
tool is the bootstrapping of semantic annotations by extraction of the intrin-
sic semantics contained in the context in which the images reside on the local
computer of the user before uploading them to Flickr by using the technology de-
scribed in [3]. The users can also manually provide semantic annotations through
an extended interface. These semantic annotations are linked to their meaning
in a knowledge organisation system such as Wordnet2 .
The source code for the tools described in this paper is available freely at
https://sourceforge.net/projects/insemtives/.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 476–480, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Semantic Annotation of Images on Flickr 477
The platform services are exposed through the INSEMTIVES platform API’s,
which can then be used by third party applications. The API’s are based on a
communication framework which represents a highly modular client-server archi-
tecture implemented using communication layer over JMS3 messaging protocol
as well as six other protocols, including REST4 and Web Service Notification.
The platform is divided in two main components: Structured Knowledge and
User and Communities. The Structured Knowledge component stores all arti-
facts in RDF following the semantic annotation model defined in [5]. The seman-
tic store relies on OWLIM5 while the initial lexical resource is based on Wordnet
and DBPedia. The User and Communities component contains user profiles and
is responsible for maintaining provenance information of the annotations and
resources in the platform.
3
Java Messaging Service, see http://java.sun.com/products/jms/
4
http://en.wikipedia.org/wiki/Representational_State_Transfer
5
http://www.ontotext.com/owlim/
6
http://www.juploadr.org/
478 P. Andrews et al.
jUploadr allows the user to queue photos for uploading on Flickr and set their
properties (see Figure 1). In particular, the tags, description and title, which are
the main metadata stored and displayed by the Flickr website.
Location Tagging. In the “geo” tab of the properties window for a photo, the user
can specify where the photo was taken. Once a location is specified for a photo, the
user can press the “Find concepts” button to ask the INSEMTIVES tool to find
automatically new concepts that relate to the location where the photo was taken
by looking at popular terms already used for this location on Flickr. These concepts
will then be added to the list of semantic tags in the “Photo Info” tab, where the
user can remove the ones that might not fit the photo.
7
These senses are taken from WordNet.
Semantic Annotation of Images on Flickr 479
For the concepts that were automatically proposed by the services described
earlier, the user can correct the disambiguation by selecting the right sense from
a drop-down list. Here also, a summary is shown and the user can display the
definition of the concept (see Figure 3.2b)) .
4 Semantic Search
On the Flickr website, photos can only be searched by keywords, hence, if a
photo is tagged with “cat” and another one with “dog”, these photos will not
be found when searching for “animal”.
However, if the photos were uploaded with the INSEMTIVES Uploadr and
were assigned the concepts “cat” and “dog”, then they can be retrieved with a
semantic search for “animal”. To do this, the INSEMTIVES Uploadr provides a
specific search interface as shown in Figure 3.
480 P. Andrews et al.
Fig. 3. Example of a Semantic Search. A search for “journey” found photos about a
“trip”.
5 Demonstration
The goal of the demonstration will be to show the extension of the standard
open source jUploadr application with the specific semantic annotation tools.
The demonstration will be split in two main scenarios:
Annotation of Photos the new tools for semantic annotation of images will be
demonstrated in a scenario showing the use of automatic concept extraction
from the local context, recommendation of concepts from the location and
manual input of concepts to annotate a photo.
Semantic Search for Photos once some photos have been annotated and up-
loaded on Flickr, we will demonstrate the semantic search tool that is able
to retrieve photos, no only on the tags attached to them, but also on the
concepts used for annotation, thus finding related photos by mapping syn-
onymous terms to the same concept and by reasoning about the subsumption
relationship between concepts.
The visitors will also be shown how to download and install the tool for their
own use with their Flickr accounts so that they can upload ESWC’11 photos
with semantic annotations.
References
1. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends
of the new age. ACM Comput. Surv. 40, 5:1–5:60 (2008)
2. Andrews, P., Pane, J., Zaihrayeu, I.: Semantic disambiguation in folksonomy: a case
study. In: Bernardi, R., Chambers, S., Gottfried, B., Segond, F., Zaihrayeu, I. (eds.)
Advanced Language Technologies for Digital Libraries. LNCS Hot Topic subline.
Springer, Heidelberg (2011)
3. Zaihrayeu, I., Sun, L., Giunchiglia, F., Pan, W., Ju, Q., Chi, M., Huang, X.: From
web directories to ontologies: Natural language processing challenges. In: Aberer,
K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika,
P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC
2007 and ISWC 2007. LNCS, vol. 4825, pp. 623–636. Springer, Heidelberg (2007)
4. Siorpaes, K., Konstantinov, M., Popov, B.: Requirement analysis and architectural
design of semantic content management platform. Technical report, Insemtives.eu
(September 2009)
5. Andrews, P., Zaihrayeu, I., Pane, J., Autayeu, A., Nozhchev, M.: Report on the
refinement of the proposed models, methods and semantic search. Technical report,
Insemtives.eu (November 2010)
FedX: A Federation Layer for Distributed Query
Processing on Linked Open Data
1 Introduction
Motivated by the ongoing success of the Linked Open Data initiative and the
growing amount of semantic data sources available on the Web, new approaches
to query processing are emerging. While query processing in the context of RDF
is traditionally done locally using centralized stores, recently one can observe
a paradigm shift towards federated approaches which can be attributed to the
decentralized structure of the Semantic Web. The Linked Open Data cloud -
representing a large portion of the Semantic Web - comprises more than 200
datasets that are interlinked by RDF links. In practice many scenarios exist
where more than one data source can contribute information, making query
processing more complex. Contrary to the idea of Linked Data, centralized query
processing requires to copy and integrate relevant datasets into a local repository.
Accounting for the structure, the natural approach to follow in such a setting is
federated query processing over the distributed data sources.
While there exist efficient solutions to query processing in the context of RDF
for local, centralized repositories [7,5], research contributions and frameworks for
distributed, federated query processing are still in the early stages. In practical
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 481–486, 2011.
c Springer-Verlag Berlin Heidelberg 2011
482 A. Schwarte et al.
terms the Sesame framework in conjunction with AliBaba1 is one possible sample
solution allowing for federations of distributed repositories and endpoints. How-
ever, benchmarks have shown poor performance for many queries in the federated
setup due to the absence of advanced optimization techniques [6]. From the re-
search community DARQ [9] and Networked Graphs [10] contribute approaches
to federated SPARQL queries and federated integration. Since both require pro-
prietary extensions to languages and protocols which are not supported by most
of today’s endpoints, they are not applicable in practical environments.
In this demonstration paper we present FedX, a practical framework for trans-
parent access to data sources through a federation. The framework offers efficient
query processing in the distributed setting, while using only protocols and stan-
dards that are supported by most of today’s data sources.
In the following we will describe the FedX system and give a demonstration
of its practical applicability in the Information Workbench. In section 2 we give
some insights into the federation layer. Next, in section 3 we present the demon-
stration scenario. Finally, we conclude with some remarks on future work.
Application
Layer
Information Workbench
Sesame API
Sesame
Query Processing Infrastructure (Parsing, Java Mappings, I/O, Public API)
Framework
Data
Sources SPARQL Native Custom
Endpoint Repository Repository
• Join order: Join order tremendously influences performance since the number
of intermediate results determines overall query runtime. In FedX the vari-
able counting technique proposed in [3] supplemented with various heuristics
is used to estimate the cost for each join. Following a greedy approach the
joins are then executed in ascending order of cost.
• Bound joins: To reduce the number of requests and thus the overall runtime,
joins are computed in a block nested loop join.
• Groupings: Statements which have the same relevant data source are co-
executed in a single SPARQL query to push joins to the particular endpoint.
With the goal of illustrating the practicability of our system we provide a demon-
stration scenario using the previously discussed architecture. We employ the
Information Workbench for demonstrating the federated approach to query pro-
cessing with FedX. The Information Workbench is a flexible platform for Linked
Data application development and provides among others frontend facilities for
our UI as well as the integration with the backend, i.e. the query processing lay-
ers. In our demonstration we show a browser based UI allowing dynamic access
and manipulation of federations at query time as well as ad hoc query formula-
tion, then we execute the optimized query at the configured data sources using
FedX, and finally we present the query results in the platform’s widget based
visualization components. The scenario steps from the user’s point of view are
summarized in the following and illustrated in figure 2.
1. Linked Open Data discovery. Data sources can be visually explored and
discovered using a global data registry.
2. Federation setup. The federation is constructed and/or modified dynam-
ically on demand using a browser based self-service interface. Discovered
Linked Data repositories can be integrated into the federation with a single
click.
3. Query definition. A query can be formulated ad hoc using SPARQL or
selected from a subset of the FedBench queries. The predefined queries are
designed to match the domain-specific data sources and produce results.
3
FedBench project page: http://code.google.com/p/fbench/
4
For an initial comparison we employed the AliBaba extension for the Sesame frame-
work. To the best of our knowledge AliBaba provides the only federation layer avail-
able that does not require any proprietary extensions (e.g. SPARQL extensions).
FedX 485
1 2 3 4
Linked Open Data Self-Service Query Definition Query Execution &
Discovery Federation Setup Result Presentation
Visual Exploration of Data Sets Integrate discovered Linked Data Ad hoc query formulation Widget-based visualization in
using SPARQL the Information Workbench
For the demonstration we use cross domain and lifescience datasets and queries
as proposed in the FedBench benchmark. Those collections span a subset of the
Linked Open Data cloud and are useful to illustrate practical applicability of
query processing techniques such as those of FedX. Since FedX improves query
response time compared to existing solutions, and moreover since the total run-
time for most queries is in a range that is considered responsive, it is a valuable
contribution for practical federated query processing.
References
1. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A Generic Architecture
for Storing and Querying RDF and RDF Schema. In: Horrocks, I., Hendler, J.
(eds.) ISWC 2002. LNCS, vol. 2342, p. 54. Springer, Heidelberg (2002)
2. Alexander, K., et al.: Describing Linked Datasets – On the Design and Usage of
voiD. In: Proceedings of the Linked Data on the Web Workshop (2009)
486 A. Schwarte et al.
3. Stocker, M., et al.: SPARQL basic graph pattern optimization using selectivity
estimation. In: WWW, pp. 595–604. ACM, New York (2008)
4. Görlitz, O., et al.: Federated Data Management and Query Optimization for Linked
Open Data. In: New Directions in Web Data Management (2011)
5. Erling, O., et al.: Rdf support in the virtuoso dbms. In: CSSW (2007)
6. Haase, P., Mathäß, T., Ziller, M.: An Evaluation of Approaches to Federated Query
Processing over Linked Data. In: I-SEMANTICS (2010)
7. Neumann, T., Weikum, G.: Rdf-3X: a RISC-style engine for RDF. PVLDB 1(1)
(2008)
8. Haase, P., et al.: The Information Workbench - Interacting with the Web of Data.
Technical report, fluid Operations & AIFB Karlsruhe (2009)
9. Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In:
Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008.
LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)
10. Schenk, S., Staab, S.: Networked graphs: a declarative mechanism for sparql rules,
sparql views and rdf data integration on the web. In: WWW (2008)
Reasoning in Expressive Extensions
of the RDF Semantics
Michael Schneider
This research deals with reasoning in expressive semantic extensions of the RDF
Semantics specification [5]. The focus will specifically be on the ontology lan-
guage OWL 2 Full [9], which has been standardized by the World Wide Web
Consortium (W3C) in 2009 as an RDFS-compatible flavor of OWL that essen-
tially covers all other members of the RDF and OWL language families. Several
W3C languages have dependencies on OWL Full, including SKOS, RIF, and the
current revision of SPARQL (“SPARQL 1.1”). So far, however, OWL Full has
largely been ignored by the research community and no practically applicable
reasoner has been implemented.
The current situation may have a variety of reasons. The most frequently
heard technical argument against OWL Full reasoning is that OWL Full is
computationally undecidable with regard to key reasoning tasks [7]. However,
undecidability is a common theoretic problem in other fields as well, as for first-
order logic reasoning, which still has many highly efficient implementations with
relevant industrial applications [11]. Nevertheless, the undecidability argument
and other arguments have led to strong reservations about OWL Full and have
effectively prevented researchers from studying the distinctive features of the
language and from searching for methods to realize at least useful partial im-
plementations of OWL Full reasoning. But without a better understanding of
OWL Full reasoning and its relationship to other reasoning approaches it will
not even become clear what the added value of an OWL Full reasoner would be.
An OWL Full reasoner would make the full expressivity of OWL available
to unrestricted RDF data on the Semantic Web. A conceivable use case for
OWL Full reasoners is to complement RDF rule reasoners in reasoning-enabled
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 487–491, 2011.
c Springer-Verlag Berlin Heidelberg 2011
488 M. Schneider
logical conclusions from such an axiom; neither is supported by OWL 2 DL. For
each identified feature, a precise description, a motivation, an explanation for its
distinctiveness, and one or more concrete examples will be given. Identification
of the features may make use of any kind of source, including literature, ontolo-
gies, forum discussions, or technical aspects of the language. For each candidate
feature, concrete evidence will be searched in order to support its validity.
The implementability analysis will concentrate on studying the FOL trans-
lation approach, as mentioned in Sec. 1. This approach has the advantage that
it applies to arbitrary extensions of the RDF Semantics and it enjoys strong and
mature reasoning tool support through existing ATPs. The idea is to create a
corresponding FOL formula for every model-theoretic semantic condition of the
OWL 2 Full semantics. For example, the semantic condition for class subsump-
tion, as given in Sec. 5.8 of [9], can be translated into the FOL formula
∀c, d : iext(rdfs:subClassOf, c, d) ⇔ ic(c) ∧ ic(d) ∧ ∀x : [icext(c, x) ⇒ icext(d, x)] .
An RDF triple ‘:s :p :o’ is mapped to an atomic FOL formula ‘iext(p, s, o)’.
An RDF graph is translated into a conjunction of such ‘iext’ atoms, with ex-
istentially quantified variables representing blank nodes. An entailment query
is represented by the conjunction of the OWL Full axiomatisation, the transla-
tion of the premise graph, and the negated translation of the conclusion graph.
Entailment checking can then be performed by means of an ATP.
The implementability analysis will be based on a prototypical reasoner that
is going to be built from an ATP, an FOL axiomatisation of the OWL 2 Full
semantics, and a converter for translating RDF graphs into FOL formulas. The
reasoner will then be evaluated based on the identified distinctive OWL Full
features. This will be done by using the created concrete feature examples as test
cases for conformance and performance testing of the parsing and the reasoning
capabilities of the reasoner. Conversely, this method will help ensuring that the
identified distinctive features will be technically valid. The evaluation results will
be compared to those for OWL DL reasoners and RDF rule reasoners.
To the author’s knowledge, the proposed research will be the first in-depth
study of the features and the implementability of OWL 2 Full reasoning. No
analysis of the distinctive pragmatic features of OWL 2 Full has been done so
far. Also, there has been no rigorous analysis of OWL 2 Full reasoning based on
the FOL translation approach yet.
It has been observed that acceptable reasoning performance can often only be
achieved by reducing the whole OWL Full axiomatisation to a small sufficient
sub-axiomatisation. A next step will therefore be to search for an automated
method to eliminate redundant axioms with regard to the given input ontology.
So far, the implementability analysis was restricted to entailment and incon-
sistency checking. To fulfil the discussed use case of complementing RDF rule
reasoners in reasoning-enabled applications, OWL Full reasoners should also
support flexible query answering on arbitrary RDF data. This will specifically
be needed to realize the OWL 2 RDF-Based Semantics entailment regime of
SPARQL 1.1 [2]. Some ATPs have been reported to offer query answering on
FOL knowledgebases [6]. It will be analyzed to what extent these capabilities
can be exploited for OWL Full query answering.
For the syntactic-aspect feature analysis, which has already been finished for
OWL 1 Full, the remaining work will be to extend the analysis to the whole
of OWL 2 Full. The semantic-aspect feature analysis is in a less-complete state
and still requires the development of a feature categorization similar to that
of the syntactic-aspect feature analysis. The started work of building reasoning
test suites will be continued and will eventually lead to a collection of concrete
examples for the still to-be-identified semantic-aspect features.
References
1. Fikes, R., McGuinness, D., Waldinger, R.: A First-Order Logic Semantics for Se-
mantic Web Markup Languages. Tech. Rep. KSL-02-01, Knowledge Systems Lab-
oratory, Stanford University, Stanford, CA 94305 (January 2002)
2. Glimm, B., Ogbuji, C. (eds.): SPARQL 1.1 Entailment Regimes. W3C Working
Draft (October 14, 2010)
3. Hawke, S.: Surnia (2003), http://www.w3.org/2003/08/surnia
4. Hayes, P.: Translating Semantic Web Languages into Common Logic (July 18,
2005), http://www.ihmc.us/users/phayes/CL/SW2SCL.html
5. Hayes, P. (ed.): RDF Semantics. W3C Recommendation (February 10, 2004)
6. Horrocks, I., Voronkov, A.: Reasoning Support for Expressive Ontology Languages
Using a Theorem Prover. In: Dix, J., Hegner, S.J. (eds.) FoIKS 2006. LNCS,
vol. 3861, pp. 201–218. Springer, Heidelberg (2006)
7. Motik, B.: On the Properties of Metamodeling in OWL. Journal of Logic and
Computation 17(4), 617–637 (2007)
8. Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C. (eds.): OWL 2
Web Ontology Language: Profiles. W3C Recommendation (October 27, 2009)
9. Schneider, M. (ed.): OWL 2 Web Ontology Language: RDF-Based Semantics. W3C
Recommendation (October 27, 2009)
10. Tsarkov, D., Riazanov, A., Bechhofer, S., Horrocks, I.: Using Vampire to Reason
with OWL. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC
2004. LNCS, vol. 3298, pp. 471–485. Springer, Heidelberg (2004)
11. Voronkov, A.: Automated Reasoning: Past Story and New Trends. In: Proc. IJCAI
2003, pp. 1607–1612 (2003)
12. W3C WebOnt OWL Working Group: OWL 1 Test Results (March 9, 2004),
http://www.w3.org/2003/08/owl-systems/test-results-out
Personal Semantics: Personal Information Management
in the Web with Semantic Technologies
Salman Elahi
Abstract. Every web user has several online profiles through which personal
information is exchanged with many service providers. This exchange of
personal information happens at a pace difficult to fully comprehend and
manage without a global view and control with obvious consequences on data
control, ownership and, of course, privacy. To tackle issues associated with
current service-centric approaches, we propose a user-centric architecture where
the interaction between a user and other agents is managed based on a global
profile for the user, maintained in a profile management system and controlled
by the user herself. In this PhD, we will investigate research issues and
challenges in realizing such a system based on semantic technologies.
1 Research Problem
Web users maintain several online profiles across e-commerce websites, social
networks and others, making it difficult for them to realize how much information is
exchanged and what happens to that information. In other words, web interactions are
happening in a ‘one to many’ mode, where different agents, with different statuses
and relationships to the user take part and receive personal information. We refer to
this phenomenon as the fragmentation of personal data exchange, where many
different ‘destinations’ of the data receive various fragments of personal information
over time, without the user having a global view of her own data.
The problem stems from the model these online interactions are based upon, i.e. a
service-centric model, where everything is focused on the needs of a particular
organization. Current research suggests a rather contrasting model to address these
issues: a user-centric approach [1] where the interaction between a user and other agents
is managed based on a global profile for the user, maintained in a profile management
system and controlled by the user herself. Parts of this profile can be requested by various
agents (websites), with the user being given the possibility to control these accesses and
to keep track of them directly within the profile management system. In this PhD, we will
investigate research issues and challenges in realizing a system based on user-centric
profiles, and show how semantic technologies can support users in making sense,
managing and controlling their exchanges of personal data on the Web with such a
system.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 492–496, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Personal Semantics: Personal Information Management in the Web 493
1
http://www.openid.net
2
http://www.oauth.net
3
http://www.microsoft.com/windows/products/winfamily/cardspace/
default.aspx
4
http://www.projectliberty.org/liberty/about/
5
http://www.eclipse.org/higgins/
494 S. Elahi
4 Initial Results
In this section we discuss initial results obtained (during the first year of this part-time
PhD). In [10], our objective was to collect all the fragments of personal information
sent by a user during several weeks, and to try to reconstruct from these fragments a
coherent global profile. From logs of HTTP traffics through specifically developed
tools, the user managed to create 36 profile attributes mapped onto 1,108 data
attributes. However, while this small and simple exercise proved to be satisfying the
basic hypothesis, we also identified few research issues and challenges which will
need to be addressed to create more complex and flexible user profiles [10], including
the need for complex and sophisticated profile representations (with temporal and
multi-faceted representations), the need to include multiple, external sources of
information to enrich the profile (including information about the accessing agents)
and the need for appropriate mechanisms for the definition of semantic access control
496 S. Elahi
model over semantic data. On the last point, we are currently investigating a
prospective model for access control in a user-centric scenario, and applying it in the
scenario of the Open University’s information and access model. Through the use of
an ontological model, and the possibility of employing inference upon such a model,
we expect to obtain results showing how the user-centric approach can outsmart the
organization-centric approach, providing better overviews of the access control
aspects over the considered data and possibly detecting un-intended behaviours which
normally remain hidden in an organization-centric view.
5 Future Work
The next major steps for this part-time PhD include completing the current work on
Semantic policies and access rights in the next 4 months. The results obtained from
this exercise will be used to develop a prototype profile management system based on
the user-centric framework discussed earlier. This framework will employ semantic
technologies to provide access control through semantic policies with evolvable
profiles representing changing needs of a user in today’s online interactions. This
phase will investigate and try to address the technical issues mentioned above. Social
issues will be investigated in the next phase with the deployment of this system in
place of the Open University’s usual information and access mechanisms to gather
empirical evidence of how users interact with such a system and its impact on them.
This empirical evidence will be used in the evaluation of the system.
References
1. Iannella, R.: Social Web Profiles. Position paper, SNI (2009)
2. Jones, W., Teevan, J.: Personal Information Management (2007); ISBN: 9780295987378
3. Sauermann, L., Bernardi, A., Dengel, A.: Overview and outlook on the semantic desktop.
In: ISWC (2005)
4. Leonardi, E., Houben, G., Sluijs, K., Hidders, J., Herder, E., Abel, F., Krause, D.,
Heckmann, D.: User Profile Elicitation and Conversion in a Mashup Environment. In:
FIWLIW (2009)
5. Abel, F., Heckmann, D., Herder, E., Hidders, J., Houben, G., Krause, D., Leonardi, E.,
Slujis, K.: A Framework for Flexible User Profile Mashups, AP-WEB 2.0 (2009)
6. Ghosh, R., Dekhil, M.: Mashups for semantic user profiles. Poster. In: International World
Wide Web Conference, WWW (2008), Poster
7. Ghosh, R., Dekhil, M.: I, Me and My Phone: Identity and Personalization Using Mobile
Devices. In: HPL 2007-184 (2007)
8. Story, H., Harbulot, B., Jacobi, I., Jones, M.: FOAF+SSL: RESTful Authentication for the
Social Web. In: SPOT (2009)
9. Olesen, H., Noll, J., Hoffmann, M. (eds.): User profiles, personalization and privacy.
WWRF Outlook series, Wireless World Research Forum (May 2009)
10. Elahi, S., d’Aquin, M., Motta, E.: Who Wants a Piece of Me? Reconstructing a User
Profile from Personal Web Activity Logs. In: LUPAS, ESWC (2010)
Reasoning with Noisy Semantic Data
1 Problem Statement
Based on URIs, HTTP and RDF, the Linked Data project [3] aims to expose,
share and connect related data from diverse sources on the Semantic Web. Linked
Open Data (LOD) is a community effort to apply the Linked Data principles
to data published under open licenses. With this effort, a large number of LOD
datasets have been gathered in the LOD cloud, such as DBpedia, Freebase and
FOAF profiles. These datasets are connected by links such as owl:sameAs. LOD
has gained rapidly progressed and is still growing constantly. Until May 2009,
there are 4.7 billion RDF triples and around 142 million RDF links [3]. After
that, the total has been increased to 16 billion triples in March 2010 and another
14 billion triples have been published by the AIFB according to [17].
With the ever growing LOD datasets, one problem naturally arises, that is,
the generation of the data may introduce noise, thus hinders the application
of the data in practice. To make the Linked Data more useful, it is important
to propose approaches for dealing with noise within the data. In [6], the au-
thors classify noise in Linked Data into three main categories: accessibility1 and
derefencability2 w.r.t. URI/HTTP, syntax errors, and noise3 and inconsistency4
w.r.t. reasoning. In our work, we focus on dealing with the third category of
noise, namely noise and inconsistency w.r.t. reasoning, but may also consider
other categories of noise. We further consider one more noise in the logical level,
that is, the logical inconsistency caused by ontology mapping.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 497–502, 2011.
c Springer-Verlag Berlin Heidelberg 2011
498 Q. Ji, Z. Gao, and Z. Huang
Developing an ontology is not an easy task and often introduces noise and in-
completeness. This happens in LOD datasets as well. The ontologies in LOD
datasets are generally inexpressive and may contain a lot of noise. For example,
one of the most popular ontology, DBpedia ontology5, is claimed as a shallow
ontology. The TBox of this ontology mainly includes a class hierarchy.
To deal with the incomplete ontologies, we plan to use statistical relational
learning(SRL) techniques to learn expressive ontologies from LOD datasets
(more details can be seen in [19]).
Before learning ontologies, we will propose methods to measure the noise of
a data, which can be defined according to the data quality assessment [13] in
database area. For instance, an objective metric can be defined as the degree to
which misused vocabularies from all vocabularies in an ontology. Such kind of
measures provides a reference to decide whether a dataset needs to be cleaned
or not. If it is necessary to do cleaning, we could apply various cleaning strate-
gies to correct or remove the noise. For example, we can correct the misused
5
http://wiki.dbpedia.org/Ontology
Reasoning with Noisy Semantic Data 499
vocabularies manually with the help of an ontology editor like Protege6 . After
ontology learning, we can define the measure of ontology incompleteness which
is the degree to which axioms are missing from the ontology.
After learning expressive ontologies from LOD datasets and linking them with
other datasets, we may confront the problem of inconsistency7 handling. Ac-
cording to [6], it may be quite difficult to deal with this kind of noise. Due to the
large scale of the data, it is hard to apply existing approaches for reasoning with
inconsistent OWL ontologies to deal with OWL inconsistency in Linked Data.
In [6], the authors suggested that, to handle inconsistency in Linked Data, in-
consistent data may be pre-processed with those triples causing inconsistencies
dropped according to some heuristic measures. We fully agree with this pro-
posal. One measure that we will consider is the inconsistency measure defined
by four-valued semantics (see [10]). In the open philosophy of the Web, it may
be not desirable to completely repair inconsistent ontologies. One reason, as sug-
gested in [6], is that contradiction could be considered as a ‘healthy’ symptom
of different opinion. Therefore, when we repair inconsistency in OWL ontologies
in Linked Data, our goal is not to result in fully consistent ontologies, but to
reduce the inconsistency degrees of those ontologies. After that, we can apply
some inconsistency-tolerant approaches to reasoning with those inconsistent on-
tologies. To partially repair an inconsistent ontology, we will apply some patterns
to efficiently detect the sources of inconsistency, such as patterns given in [18].
To provide inconsistency-tolerant reasoning services, we will further develop the
idea of using selection functions [7] to reasoning with inconsistent ontologies. The
idea is to propose specific selection functions for specific ontology languages.
As reported in [8], LOD datasets are well connected by RDF links on the instance
level. But on the schema level, the ontologies are loosely linked. It is interesting
to consider aligning these ontologies based on the plentiful resources of LOD
datasets. A few such approaches have been proposed in [8,12].
With the mappings generated, we may confront the problem of dealing with
inconsistency caused by mappings and ontologies if we interpret mappings with
OWL semantics. We will first consider evaluating the inconsistent mappings8 by
defining a nonstandard reasoner. We will then consider mapping repair based on
work in [14,11]. For example, we can apply some patterns to efficiently detect
problematic correspondences in the mappings.
6
http://protege.stanford.edu/
7
A data is inconsistent iff it has no model.
8
An inconsistent mapping means no concepts in O1 ∪ O2 are interpreted as empty
but there is such a concept in the union of O1 , O2 connected by M.
500 Q. Ji, Z. Gao, and Z. Huang
3.4 Evaluation
To evaluate our work, we will implement our proposed approaches and do eval-
uation over LOD datasets. Based on our previously developed system RaDON9 ,
which is a tool to repair and diagnose ontology networks, we will develop a
system for reasoning with noisy Linked Data.
4 Results
We have studied repair and diagnosis in ontology networks and developed a
tool, called RaDON, to deal with logical contradictions in ontology networks
(see [9]). The functionalities provided by RaDON have been implemented by
extending the capabilities of existing reasoners. Specifically, the functionalities
include debugging and repairing an inconsistent ontology or mapping, and coping
with inconsistency based on a paraconsistency-based algorithm.
In [15], we proposed possibilistic extension of OWL to deal with inconsistency
and uncertainty in OWL ontologies. Some novel inference services have been
defined and algorithms for implementing these inference services were given. We
have implemented these algorithms and provided evaluations for their efficiency.
For an inconsistent mapping, the semantic precision and recall defined in [4]
meet the trivialization problems. To resolve such kind of problems, we define
the meaningfulness of an answer given by an inconsistency reasoner: Given two
ontologies O1 and O2 and a mapping M between them, for a correspondence
c = e, e , r, α, an answer provided by an inconsistency reasoner is meaningful
iff the following condition holds: Σ t(c) ⇒(∃Σ Σ)(Σ |= e ⊥ and Σ |=
e ⊥ and Σ |= t(c)). Here, e and e are atomic concepts. r is a semantic
relation like equivalence and α is a confidence value. t is a translation function
to transfer a correspondence to a DL axiom. Σ is the union of O1 , O2 and a
set of axioms obtained by translating all correspondences in M to DL axioms.
An inconsistency reasoner is regarded as meaningful iff all of the answers are
meaningful. Based on this definition, we can redefine semantic measures in [4].
5 Conclusions
Reasoning with noisy Linked Data is a quite challenging and interesting work. In
our work, we mainly consider the following work: (1) We will propose methods
for measuring noisy LOD datasets like incompleteness and clean the noise in
these datasets if necessary. Based on the plentiful LOD datasets, we will propose
methods for learning expressive ontologies using SRL techniques. (2) To deal with
logical inconsistency, we propose to partially repair an inconsistent ontology by
considering some patterns to achieve good scalability for LOD datasets. Then we
plan to apply some novel inconsistency-tolerant reasoning strategies like defining
specific selection functions for specific ontology languages. (3) We will propose
methods for evaluating inconsistent mappings and methods to repair inconsistent
mappings by applying some patterns.
9
http://neon-toolkit.org/wiki/RaDON
Reasoning with Noisy Semantic Data 501
Acknowledgements
This paper is sponsored by NSCF 60873153 and 60803061.
References
1. Bechhofer, S., Volz, R.: Patching syntax in OWL ontologies. In: McIlraith, S.A.,
Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 668–682.
Springer, Heidelberg (2004)
2. Bell, D., Qi, G., Liu, W.: Approaches to inconsistency handling in description-logic
based ontologies. In: Chung, S., Herrero, P. (eds.) OTM-WS 2007, Part II. LNCS,
vol. 4806, pp. 1303–1311. Springer, Heidelberg (2007)
3. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. In: IJSWIS,
pp. 1–22 (2009)
4. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In:
IJCAI, Hyderabad, India, pp. 348–353 (2007)
5. Halpin, H., Hayes, P.J., McCusker, J.P., McGuinness, D.L., Thompson, H.S.: When
owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data. In: Patel-
Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks,
I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 305–320. Springer,
Heidelberg (2010)
6. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the pedantic
web. In: LDOW, Raleigh, NC, USA (2010)
7. Huang, Z., van Harmelen, F., ten Teije, A.: Reasoning with inconsistent ontologies.
In: IJCAI, pp. 454–459. Morgan Kaufmann, San Francisco (2005)
8. Jain, P., Hitzler, P., Sheth, A.P., Verma, K., Yeh, P.Z.: Ontology alignment for
linked open data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang,
L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496,
pp. 402–417. Springer, Heidelberg (2010)
9. Ji, Q., Haase, P., Qi, G., Hitzler, P., Stadtmüller, S.: RaDON — repair and diag-
nosis in ontology networks. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P.,
Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.)
ESWC 2009. LNCS, vol. 5554, pp. 863–867. Springer, Heidelberg (2009)
10. Ma, Y., Qi, G., Hitzler, P.: Computing inconsistency measure based on paracon-
sistent semantics. Journal of Logic and Computation (2010)
11. Meilicke, C., Stuckenschmidt, H., Tamilin, A.: Repairing ontology mappings. In:
AAAI, pp. 1408–1413 (2007)
12. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Linking and building ontologies of
linked data. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L.,
Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp.
598–614. Springer, Heidelberg (2010)
13. Pipino, L., Lee, Y.W., Wang, R.Y.: Data quality assessment. ACM Commun. 45(4),
211–218 (2002)
14. Qi, G., Ji, Q., Haase, P.: A conflict-based operator for mapping revision. In:
Bernstein, A., Karger, D.R., Heath, T., Feigenbaum, L., Maynard, D., Motta, E.,
Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 521–536. Springer, Hei-
delberg (2009)
15. Qi, G., Ji, Q., Pan, J.Z., Du, J.: Extending description logics with uncertainty rea-
soning in possibilistic logic. International Journal of Intelligent System (to appear,
2011)
502 Q. Ji, Z. Gao, and Z. Huang
16. Schlobach, S., Huang, Z., Cornet, R., van Harmelen, F.: Debugging incoherent
terminologies. J. Autom. Reasoning 39(3), 317–349 (2007)
17. Vrandecic, D., Krotzsch, M., Rudolph, S., Losch, U.: Leveraging non-lexical knowl-
edge for the linked open data web. Review of AF Transactions, 18–27 (2010)
18. Wang, H., Horridge, M., Rector, A.L., Drummond, N., Seidenberg, J.: Debugging
OWL-DL ontologies: A heuristic approach. In: Gil, Y., Motta, E., Benjamins, V.R.,
Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 745–757. Springer, Heidelberg
(2005)
19. Zhu, M., Gao, Z.: SRL based ontology learning from linked open data. In: ESWC
PhD Symposium,Crete, Greece (to appear, 2011)
Extracting and Modeling Historical Events to
Enhance Searching and Browsing of Digital
Cultural Heritage Collections
Roxane Segers
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 503–507, 2011.
c Springer-Verlag Berlin Heidelberg 2011
504 R. Segers
searching, browsing and representing historical events online. Both projects use
the digital collections of the Netherlands Institute for Sound and Vision5 and
the Rijksmuseum Amsterdam6 .
In my research, I focus on the following research questions:
1. What is an adequate event model to capture variances and invariances of
historical events? Can we instantiate this event model (semi-)automatically?
2. What is an adequate organization of a historical ontology to be used for
annotations for cultural heritage objects?
3. What is a suitable organization of a historical event thesaurus to allow for
diverse interpretations and expressions for events?
4. What are relevant evaluation criteria for quality of the event model, the
historical ontology and the thesaurus and their added value for exploration
and search of cultural heritage objects?
Issues Related to the Research Questions
1. The event model provides the vocabulary for the event elements and their
relations. However, the domain-specific requirements for modeling historical
events in terms of classes and properties cannot be given beforehand. Addi-
tionally, the historical event model facilitates relations between events, e.g.
causality, meronymy. However, these relations are not part of an event but
exist as an interpretational element between two or more events and thus
need to be modeled as a separate module.
2. The historical ontology serves as a semantic meta-layer to type historical
events, independently of the expressions used. However, it is unknown to
what degree an ontology can be used as an unprejudiced meta-layer for such
typing as it might imply an interpretation. Ontologies typically represent a
time-fixed view on reality which influences the modeling of objects that can
only play a role in an event after a certain point in time. Additionally, the
expressivity and extensibility of the ontology depends on the expressivity and
extensibility of the event model and vice versa. It is critical to know how
they interact, as incompatible properties can affect reasoning about events.
3. The instantiation of the event model needs to be based on different
sources to capture the different perspectives and interpretations of events.
Typically, event descriptions reside in unstructured text documents. Thus,
portable information extraction techniques should be applied for detecting
events and their elements in document collections of different style and topic.
4. The event thesaurus is a structured set of historical events used for event-
based annotation of cultural heritage objects and for aligning different object
collections. For the creation of such a thesaurus we need to know (1) how
to identify and organize equal and similar event descriptions and (2) how
to identify and structure multiple interpretations of the relations between
events. Properties such as hasSubevent can become problematic for structur-
ing the thesaurus, as some sources might only report temporal inclusion.
5
http://portal.beeldengeluid.nl/
6
http://www.rijksmuseum.nl/
Extracting and Modeling Historical Events 505
3 Approach
We propose the following novel approach for extracting and structuring knowl-
edge of historical events from various text sources. First, we adapt an existing
event model to meet the domain specific requirements. Next, we populate this
model and learn a historical ontology using information extraction techniques.11
For the creation of the event thesaurus we consider to use different reasoning
techniques over both the instances and types of the modeled event descriptions.
Following, we elaborate on the approach in relation to the research questions:
RQ1: We consider SEM[6] as a model to start from, as it is not domain-
specific, represents a minimal set of event classes and includes placeholders for
a foreign typing system.
7
http://semanticweb.cs.vu.nl/agora/relatedwork
8
http://motools.sf.net/event/event.html
9
http://cidoc.ics.forth.gr/officialreleasecidoc.html
10
http://www.kulttuurisampo.fi/
11
see: http://semanticweb.cs.vu.nl/agora/experiments
506 R. Segers
4 Methodology
We apply the following iteration methodology in order to realize the approach
in section 3, i.e. Iteration I is scoped on acquisition of basic models:
– Analysis of SEM classes for the information extraction process.
– Learn patterns to instantiate SEM classes, starting with the event class.
Next, we extend to other classes and pertaining relations. We combine the
results of three IE techniques: (1) pattern-based and (2) co-occurrancy based,
both using Yahoo and Wikipedia and (3) lexical framing in newspaper col-
lections. For each we evaluate the recall, precision and reusability.
– Ontology, version 1, based on the first extraction results.
– Thesaurus, version 1, with limited relations.
– Test and evaluate the ontology and thesaurus in the Agora demonstrator.
We define new requirements from the evaluation.
In Iteration II we iterate all the RQs once again to extend the models with
domain specific requirements:
– Extend the document collection to domain-specific texts, e.g. scopenotes
with links to historical themes and historical handbooks. We scope the do-
main to two periods/themes of interest to the involved cultural heritage in-
stitutions. Apply the IE techniques and the extended event model. Creation
of ontology version 2 with unprejudiced typing of events.
– Evaluate the ontology and thesaurus version 2 by applying the IE module
and event model to another historical period/theme to ensure that the results
are not over-fitting the data. Integrate the results in the Agora demonstrator.
– Define requirements for evaluating the thesaurus in the Agora demonstrator,
e.g. added value in terms of links between objects (quantitative), added value
in terms of relevant and coherent links (qualitative).
Extracting and Modeling Historical Events 507
References
1. Buitelaar, P., Cimiano, P., Magnini, B. (eds.): Ontology learning from Text: Meth-
ods, Evaluation and Applications. IOS Press, Amsterdam (2005)
2. Cimiano, P.: Ontology Learning and Population from Text Algorithms, Evaluation
and Application. Springer, Heidelberg (2006)
3. Cohen, W., Borgida, A., Hirsh, H.: Computing least common subsumers in de-
scription logics. In: Proceedings of AAAI 1992. AAAI Press, Menlo Park (1992)
4. de Boer, V.: Ontology Enrichment from Heterogeneous Sources on the Web. PhD
thesis, VU University, Amsterdam, The Netherlands (2010)
5. Fellbaum, C. (ed.): Wordnet: An Electronical Lexical Database. MIT Press, Cam-
bridge (1998)
6. van Hage, W., Malaisé, V., de Vries, G., Schreiber, G., van Someren, M.: Combining
ship trajectories and semantics with the simple event model (sem). In: EiMM 2009,
New York, NY, USA, pp. 73–80 (2009)
7. Hyvönen, E., Alm, O., Kuittinen, H.: Using an ontology of historical events in
semantic portals for cultural heritage. In: ISWC 2007 (2007)
8. Ide, N., Woolner, D.: Historical ontologies. In: Words and Intelligence II: Essays in
Honor of Yorick Wilks, pp. 137–152 (2007)
9. Maedche, A., Staab, S.: Ontology learning for the semantic web. IEEE Intelligent
Systems, 72–79 (2001)
10. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A., Schneider, L.:
Wonderweb deliverable d18. Technical report, ISTC-CNR (2003)
11. Niles, I., Pease, A.: Towards a standard upper ontology. In: Proceedings of FOIS
2001, pp. 2–9. ACM, New York (2001)
12. Scherp, A., Franz, T., Saathoff, C., Staab, S.: F—a model of events based on
the foundational ontology dolce+dns ultralight. In: K-CAP 2009, Redondo Beach
(2009)
13. Shaw, R., Troncy, R., Hardman, L.: LODE: Linking open descriptions of events.
In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds.) ASWC 2009. LNCS, vol. 5926, pp.
153–167. Springer, Heidelberg (2009)
Enriching Ontologies by Learned Negation
Or How to Teach Ontologies Vegetarianism
Daniel Fleischhacker
1 Problem Statement
Ontologies form the basis of the semantic web by providing knowledge on concepts,
relations and instances. Unfortunately, the manual creation of ontologies is a time-
intensive and hence expensive task. This leads to the so-called knowledge acquisition
bottleneck being a major problem for a more widespread adoption of the semantic web.
Ontology learning tries to widen the bottleneck by supporting human knowledge engi-
neers in creating ontologies. For this purpose, knowledge is extracted from existing data
sources and is transformed into ontologies. So far, most ontology learning approaches
are limited to very basic types of ontologies consisting of concept hierarchies and rela-
tions but do not use large amounts of the expressivity ontologies provide.
Negation is of great importance in ontologies since many common ideas and con-
cepts are only fully expressible using negation. An example for the usefulness of nega-
tion is the notion of a vegetarian who is characterized by not eating meat. It is
impossible to fully formalize this notion without applying negation at some level. Not
stating these additional information on vegetarians would severely limit the possibili-
ties to deduce new knowledge on vegetarians from the ontology by doing reasoning.
Furthermore, negation is of great significance for assessing the quality of ontologies.
Without it, ontologies may never get incoherent or inconsistent which is an important
quality criterion. Additionally, with negations contained in ontologies, it is possible to
use ontology debugging approaches more effectively.
Given all these points, we consider it important to put effort into a more elaborate
research of automatic or semi-automatic learning of negation for enriching ontologies.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 508–512, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Enriching Ontologies by Learned Negation 509
classes and are not directly applicable for generating axioms containing complements
of complex class expressions. Even if most negation axioms, i.e., axioms containing
explicit negation, may be represented by disjointness, the representation in data (e.g.,
vegetarian as someone who does not eat meat) not necessarily resembles disjointness.
Another ontology learning method which also generates negation is implemented by
the DL-Learner tool [11]. It uses inductive logic programming (ILP) to yield complex
axioms describing concepts from a given ontology. Unfortunately, this method suffers
from two issues. First, it is limited to using ontologies or data sets convertible to ontolo-
gies as data sources, thus it is not adequate to handle unstructured data and probably
most semi-structured information. Secondly, the approach is not directly applicable to
large data sets. This is mainly because of the dependency on reasoning for generating
the relevant axioms which introduces scalability problems. Hellmann et al. [9] propose
a method to extract fragments from the data to reduce it to a processable size but this
could nevertheless lead to the loss of relevant data.
Texts are reasonable sources to extract knowledge about negation axioms, and de-
tecting negation in texts could be a valid first step towards reaching this goal. Thus,
work regarding the general detection of negation in biomedical texts is also of interest
for learning negation. Most research in detecting negation in texts has been made in the
biomedical domain where the approaches are used to extract data on the presence or
absence of certain findings. This detection is mainly done by means of a list of negation
markers and regular expressions [1], by additionally using linguistic approaches like
grammatical parsing [10, 5] or by applying machine learning techniques [13, 12, 14]. It
is particularly important that the detection of negation also requires the identification
of its scope, i.e., the parts of the sentence the negation is referring to. Even if some of
the mentioned works might be usable on open-domain texts, there is no evaluation in
this direction but only for the biomedical domain and thus there is no information on
their performance for other domains. Furthermore, it is not clear if detected negations
are similar to the ones required in ontologies.
Recently, there has been more work on negation detection for open-domain texts
mainly driven by its usefulness for sentiment analysis [3] or contradiction detection
[8]. Councill et al., who particularly concentrate on the task of detecting the scopes
of negation, also evaluated their approach on product review texts using an appropriate,
annotated gold standard which unfortunately seems not to be publicly available. Despite
these recent works, detecting negation in open-domain texts remains an open problem.
3 Expected Contributions
The main contribution of this work is expected to be the development of approaches
to enrich given ontologies with negation axioms extracted from texts as a part of an
overall ontology learning approach and accompanied by a corresponding implementa-
tion. For this purpose, we will take already existing, manually engineered ontologies
and add negation axioms extracted from free texts.
Negations in ontologies could provide great benefit for many application. In the field
of biomedicine, one example would be an ontology containing information on different
drugs. For some of these drugs, it is known that there are bacteria which are resis-
tant against them. For instance, methicillin-resistant Staphylococcus aureus (MRSA)
510 D. Fleischhacker
are strains of Staphylococcus aureus resistant against beta-lactam antibiotics. The ax-
iom BetaLactamAntibiotic ¬∃effectiveAgainst.MRSA could be used
to represent this in an ontology. Given such negation axioms, it would be possible to
deduce from the ontology which drugs are not suitable for treating diseases caused by
specific pathogens.
A second contribution will be developing and employing approaches to combine
multiple ways of extracting negation. This will help compensating possible shortcom-
ing of certain approaches or data sources and to achieve better overall results.
When enriching ontologies by negation, we have to pay special attention to the main-
tenance of the ontology’s consistency and coherence. Without this, there is the risk of
rendering the ontology inconsistent and less useful for reasoning tasks. Such inconsis-
tencies do not have to come from the addition of the learned negation axioms themselves
but may also arise from erroneous non-negation axioms added by the overall learning
approach.
To be able to actually evaluate the results gained by extracting negations from differ-
ent data sources, an appropriate evaluation strategy is necessary. Based on related work,
we will develop methodologies suited for the evaluation.
Negation Extraction from Text. We expect the detection of negation in textual data
to be domain-dependent to a high degree. However, we will focus on the biomedical
domain because of the large amount of work already done there regarding negation de-
tection and the availability of expressive ontologies. There are several kinds of negations
in texts which we will have to handle. Mostly, these kinds of textual negations are dis-
tinguishable into direct negation like caused by the word not and indirect negation rec-
ognizable by words like doubt, which introduce the negation solely by their semantics,
and misunderstanding, where the semantics of negation is characterized by morpholog-
ical markers like mis-. For the first manner of indirect negation, the lexical-semantic
relation of antonymy may provide some additional hints for detection. This is why we
already did experiments on detecting antonymy relations by means of relatedness and
similarity measures. We will evaluate the approaches from the biomedical domain re-
garding their coverage for these different kinds of negation and develop approaches to
treat the yet uncovered ones. To do this, we will most likely start with pattern-based
detection approaches and then additionally apply machine learning methods.
For the enrichment of ontologies, we have to develop approaches to actually trans-
fer the extracted textual negations into suitable logical negation which is not a trivial
problem because of the ambiguity of natural language. Furthermore, we will evaluate
the way negation is used in existing ontologies particularly regarding possible modeling
errors made by humans and regarding the expressivity required for these negation ax-
ioms. Based on the findings, we will choose description logic fragments best suited for
representing the learned negation while maintaining desirable computational properties.
Enriching Ontologies by Learned Negation 511
5 Conclusion
In this paper, we presented our plans to develop and implement approaches to enrich
ontologies by complex negation axioms. As described above, we consider this bene-
ficial for a couple of reasons. Having the results in the area of negation detection for
biomedical texts and some for open-domain texts, we already have some foundations
regarding negations in texts which should enable us to achieve first results soon. All
in all, learning approaches for negation can assist humans in creating more thoroughly
formalized ontologies and thus lead to a more expressive semantic web.
References
1. Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple al-
gorithm for identifying negated findings and diseases in discharge summaries. Journal of
Biomedical Informatics 34(5), 301–310 (2001)
512 D. Fleischhacker
2. Cimiano, P., Mädche, A., Staab, S., Völker, J.: Ontology learning. In: Staab, S., Studer, R.
(eds.) Handbook on Ontologies, 2nd edn., pp. 245–267. Springer, Heidelberg (2009)
3. Councill, I.G., McDonald, R., Velikovich, L.: What’s great and what’s not: learning to clas-
sify the scope of negation for improved sentiment analysis. In: Proc. of the Workshop on
Negation and Speculation in Natural Language Processing, pp. 51–59 (2010)
4. Dellschaft, K., Staab, S.: On how to perform a gold standard based evaluation of ontology
learning. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold,
M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 228–241. Springer, Heidelberg
(2006)
5. Gindl, S., Kaiser, K., Miksch, S.: Syntactical negation detection in clinical practice guide-
lines. Studies in Health Technology and Informatics 136, 187–192 (2008)
6. Haase, P., Völker, J.: Ontology learning and reasoning - dealing with uncertainty and in-
consistency. In: Proc. of the Workshop on Uncertainty Reasoning for the Semantic Web
(URSW), pp. 45–55 (2005)
7. Haase, P., Völker, J.: Ontology learning and reasoning — dealing with uncertainty and in-
consistency. In: da Costa, P.C.G., d’Amato, C., Fanizzi, N., Laskey, K.B., Laskey, K.J.,
Lukasiewicz, T., Nickles, M., Pool, M. (eds.) URSW 2005 - 2007. LNCS (LNAI), vol. 5327,
pp. 366–384. Springer, Heidelberg (2008)
8. Harabagiu, S., Hickl, A., Lacatusu, F.: Negation, contrast and contradiction in text process-
ing. In: Proc. of the 21st national conference on Artificial intelligence, vol. 1, pp. 755–762
(2006)
9. Hellmann, S., Lehmann, J., Auer, S.: Learning of OWL class descriptions on very large
knowledge bases. International Journal On Semantic Web and Information Systems 5, 25–48
(2009)
10. Huang, Y., Lowe, H.J.: A novel hybrid approach to automated negation detection in clinical
radiology reports. Journal of the American Medical Informatics Association 14(3), 304–311
(2007)
11. Lehmann, J.: DL-Learner: Learning concepts in description logics. Journal of Machine
Learning Research 10, 2639–2642 (2009)
12. Li, J., Zhou, G., Wang, H., Zhu, Q.: Learning the scope of negation via shallow semantic
parsing. In: Proc. of the 23rd International Conference on Computational Linguistics, pp.
671–679 (2010)
13. Morante, R., Daelemans, W.: A metalearning approach to processing the scope of negation.
In: Proc. of the 13th Conference on Computational Natural Language Learning, pp. 21–29
(2009)
14. Sarafraz, F., Nenadic, G.: Using SVMs with the command relation features to identify
negated events in biomedical literature. In: Proc. of the Workshop on Negation and Spec-
ulation in Natural Language Processing, pp. 78–85 (2010)
15. Schlobach, S.: Debugging and semantic clarification by pinpointing. In: Gómez-Pérez, A.,
Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 226–240. Springer, Heidelberg (2005)
16. Völker, J., Vrandečić, D., Sure, Y., Hotho, A.: Learning disjointness. In: Franconi, E., Kifer,
M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519, pp. 175–189. Springer, Heidelberg (2007)
Optimizing Query Answering over OWL Ontologies
Ilianna Kollia
Abstract. Query answering is a key reasoning task for many ontology based
applications in the Semantic Web. Unfortunately for OWL, the worst case com-
plexity of query answering is very high. That is why, when the schema of an
ontology is written in a highly expressive language like OWL 2 DL, currently
used query answering systems do not find all answers to queries posed over the
ontology, i.e., they are incomplete. In this paper optimizations are discussed that
may make query answering over expressive languages feasible in practice. These
optimizations mostly focus on the use of traditional database techniques that will
be adapted to be applicable to knowledge bases. Moreover, caching techniques
and a form of progressive query answering are also explored.
1 Problem
Query answering is an important task in the Semantic Web since it allows for the extrac-
tion of information from data as specified by the user. The answers to queries are based
not only on the explicitly stated facts but also on the inferred facts. In order to dervive
such implicit facts, we distinguish between the terminological and the assertional part
of an ontology [2]. The terminological part, called TBox, describes general informa-
tion about the modeled domain of the ontology, e.g., the relationships between classes
and properties. The assertional part, called ABox, contains concrete instance data, e.g.,
stating which indviduals belong to a class or are related with a property.
The derivation of the implicit facts of a knowledge base is done by reasoners and
is a computational problem of high complexity. For example, OWL 2 DL entailment
is known to be N2ExpTime-complete [7]. Hence the expressivity of the TBox, which
defines how complex the background knowledge that will be used to derive implicit
facts is, constitutes one source of complexity in query answering. The other source is
the size of the ABox.
A query answer is a mapping from the variables appearing in the query to terms of
the queried knowledge base such that replacing the variables with their mappings yields
an entailed consequence of the ontology. The well known conjunctive queries contain
variables which can be mapped to individuals and literals appearing in the ABox of
the queried knowledge base. A naive algorithm for finding the answers of a conjunctive
query w.r.t. a knowledge base would check which of the instantiated queries (formed by
substituting the variables of the query with every individual appearing in the ABox) are
entailed by the knowledge base. Hence such an algorithm would perform mn entailment
checks, where m is the number of individuals in the ontology and n is the number of
This work is partially funded by the EC Indicate project.
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 513–517.
c Springer-Verlag Berlin Heidelberg 2011
514 I. Kollia
variables in the query. The cost of an entailment check depends on the used logic, i.e.,
by using a less expressive formalism than OWL 2 DL one can reduce the complexity.
Because of the high complexity of entailment checking in OWL 2 DL, it is evident
that query answering becomes problematic. Our research tackles this problem attempt-
ing to achieve as high expressivity as possible as well as eÆciency in practice.
A naive query answering algorithm that checks all possible mappings for query vari-
ables needs optimizations in order to deal with expressive languages and behave well
in practice. In my PhD, starting from the highly expressive OWL 2 DL, an optimized
query answering algorithm will be devised that will use resolution based reasoners for
answering conjunctive queries and it will be extended to cover also queries about the
schema of an ontology. Through this work, it will be seen whether optimizations can
make query answering over OWL 2 DL feasible.
A first set of optimizations will be targeted at transferring techniques from relational
and deductive databases [1] to knowledge bases. For example, since it has been ob-
served that the order in which query atoms are evaluated is of critical importance to
the running time of a query, techniques from databases, such as cost based query re-
ordering, will be used to find optimal query execution plans. These techniques will be
appropriately adapted to take into account the schema of the ontology apart from the
data. As an example, let us consider a conjunctive query C(x), R(x,y), D(y), where x,y
are individual variables, C,D are classes and R is a property. At the moment it is not
clear whether the query is more eÆciently evaluated with the query atoms in the order
presented above or in a dierent order such as C(x), D(y), R(x,y) or R(x,y), C(x), D(y).
In many cases there is no need to consider all the elements of the data part as possible
mappings for query variables in conjunctive queries and hence avoid checking whether
all of them lead to the entailment of the instantiated queries by the queried knowledge
base. This can be so, either because the user is not interested in answers belonging to
some set or because some sets of mappings are not relevant w.r.t. a query. Such cases
will be identified and only an appropriate subset of the possible mappings for query
variables will be checked leading hopefully to an important reduction in the running
time of queries. Moreover, eÆcient caching techniques will be used to store parts of the
models of the queried ontology since it holds that, queries instantiated by many dierent
mappings and checked afterwards for entailment by the queried ontology, often use
the same parts of models. Hence saving these models will avoid the reconstruction of
them every time they are needed hopefully reducing the execution time of queries. This
is especially useful in the highly expressive OWL 2 DL in which the construction of
models requires a substantial amount of time.
The query answering algorithm will be made to work progressively, i.e., to output
query answers as soon as they are computed, outputing first the answers that are easily
computed and then answers that are more diÆcult to be found. For example, the answers
coming from the explicitly stated facts in the ABox can be found and given to the user
relatively quickly. Answers which require reasoning are more diÆcult to be computed
and require more time. What is more, even between mappings that require reasoning to
decide whether they constitute answers, the computation time needed diers substan-
tially. This happens because in order to decide whether dierent mappings consistute
answers, the reasoner might have to build models of dierent size and complexity. For
example, the amount of backtracking that the reasoner performs while trying dierent
possibilities that arise from disjunctions defines the complexity of a model and hence
516 I. Kollia
the time that is needed to be constructed. This, in turn, influences the time needed to
decide whether a mapping constitutes an answer or not. A more complex setting will
then be adopted in which query answers are given to the user in the order of decreased
relevance. This includes the definition of appropriate relevance criteria. The profile of
the user who types the query can be exploited to define such measures of relevance.
Since in highly expressive languages the time to compute all the answers for a query is
high, through progressive query answering the user is given some answers to work with
as soon as they are derived and more answers as time passes. However, the user should
expect incomplete answers since some “hard” answers cannot be computed in reason-
able time. The user may, therefore, be interested in the degree of (in)completeness of
the used system which can be computed and presented to him.
The above described algorithm will be developed in conjuction with the SPARQL
query language which is an RDF based query language that has recently been extended
by W3C to find query answers under the OWL entailment relation (the OWL entailment
regime of SPARQL [5]). In SPARQL a new class of powerful queries can be written
which go beyond conjunctive queries. These queries allow variables in place of classes,
object and data properties of OWL axioms apart from individuals and literals and need
dierent optimizations than the ones applicable to conjunctive queries. Such queries
have only partly been considered [13].
The steps that will be followed during the research are briefly described below. First,
the formalized optimizations and techniques will be implemented in a system that will
use SPARQL as a query language. As explained above, we will start with ontologies
expressed in OWL 2 DL and see whether the applicable optimizations reduce the run-
ning time of queries to such extent that query answering becomes more feasible. In case
the time for query answering is not acceptable even with the use of the considered op-
timizations, we will use techniques like knowledge compilation to approximate OWL
DL ontologies with simplified versions of them of lower complexity and see how the
use of these simplified ontologies aects the query answering times.
4 Results
A first attempt towards an optimized algorithm has been made. In particular, SPARQL
has already been extended to allow the use of OWL inference for computing query
answers. A cost based query reordering approach that seems to work well with con-
junctive queries has been developed. A couple of optimizations have been made for the
new class of expressive queries that can be represented in SPARQL. Such optimiza-
tions include the use of query rewriting techniques that transform the initial query to an
equivalent one that can be evaluated more eÆciently, the use of the class and property
hierarchy of the queried ontology to prune the search space of candidate bindings for
query variables and the use of more specialized tasks of OWL reasoners than entail-
ment checking to speed query execution. The proposed optimizations can reduce query
execution time by up to three orders of magnitude [8]. 1
1
This work has been done in collaboration with Dr Birte Glimm and Professor Ian Horrocks in
the Oxford University Computing Laboratory.
Optimizing Query Answering over OWL Ontologies 517
5 Conclusion
Taking into account the fact that naive query answering techniques are impractical over
expressive languages like OWL 2 DL, in my PhD I will try to devise optimized algo-
rithms that will hopefully behave well in practice. In order to achieve this, techniques
from relational and deductive databases will be transferred to knowledge bases and an
evaluation of their applicability and eÆciency will be made. For example, we will anal-
yse whether the magic set technique for finding relevant parts of data w.r.t. queries and
rules can be extended in our setting, where we have disjunction and existential quan-
tification in the rule head. The results taken so far are promising. However, more tests
need to be performed using a greater range of ontologies and queries.
References
1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading
(1994)
2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The De-
scription Logic Handbook: Theory, Implementation, and Applications. Cambridge Univer-
sity Press, Cambridge (2007)
3. Beeri, C., Ramakrishnan, R.: On the power of magic. In: PODS. pp. 269–284 (1987)
4. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
and eÆcient query answering in description logics: The DL-Lite family. J. of Automated
Reasoning 39(3), 385–429 (2007)
5. Glimm, B., Krötzsch, M.: SPARQL beyond subgraph matching. In: Patel-Schneider, P.F.,
Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010,
Part I. LNCS, vol. 6496, pp. 241–256. Springer, Heidelberg (2010)
6. Hustadt, U., Motik, B., Sattler, U.: Reasoning in description logics by a reduction to disjunc-
tive datalog. J. Autom. Reason. 39, 351–384 (2007)
7. Kazakov, Y.: ÊÁÉ and ËÊÇÁÉ are harder than ËÀÇÁÉ. In: Brewka, G., Lang, J. (eds.)
Proc. 11th Int. Conf. on Principles of Knowledge Representation and Reasoning (KR 2008),
pp. 274–284. AAAI Press, Menlo Park (2008)
8. Kollia, I., Glimm, B., Horrocks, I.: SPARQL Query Answering over OWL Ontologies
(2010), accepted for publication, http:www.comlab.ox.ac.ukfiles3681paper.pdf
9. Pan, J.Z., Thomas, E.: Approximating OWL-DL ontologies. In: Proceedings of the 22nd
National Conference on Artificial Intelligence, vol. 2, pp. 1434–1439. AAAI Press, Menlo
Park (2007)
10. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting under
description logic constraints. Journal of Applied Logic 8(2), 186–209 (2010)
11. Rosati, R.: On conjunctive query answering in EL. In: Proceedings of the 2007 International
Workshop on Description Logic (DL 2007). CEUR Electronic Workshop Proceedings (2007)
12. Sirin, E., Parsia, B.: Optimizations for answering conjunctive abox queries: First results. In:
Proc. of the Int. Description Logics Workshop, DL (2006)
13. Sirin, E., Parsia, B.: SPARQL-DL: SPARQL query for OWL-DL. In: Golbreich, C., Kalyan-
pur, A., Parsia, B. (eds.) Proc. OWLED 2007 Workshop on OWL: Experiences and Direc-
tions. CEUR Workshop Proceedings, vol. 258 (2007), CEUR-WS.org
Hybrid Search
Ranking for Structured and Unstructured Data
Daniel M. Herzig
1 Introduction
G. Antoniou et al. (Eds.): ESWC 2011, Part II, LNCS 6644, pp. 518–522, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Hybrid Search Ranking for Structured and Unstructured Data 519
the combination of both for search is not yet investigated in a satisfying way[1].
This thesis is situated between these two disciplines and combines them on the
data and on the query level. We call this scenario Hybrid Search. However, search
comprises the entire technical spectrum from indexing to the user interface. This
thesis concentrates on ranking, which is a core method of search and crucial for
its effectiveness. The goal of this thesis is to investigate a unified ranking frame-
work for hybrid search as the search on structured and unstructured data with
queries consisting of keywords and structured elements. The question this the-
sis addresses is how structured and unstructured data can be used to improve
search and how hybrid queries can be answered on hybrid data.
2 Problem Definition
This thesis addresses the problem of ranking on hybrid data for hybrid queries.
The frame of this research problem is defined by the following data and query
model: The proposed data model follows the RDF data model with Named
Graphs1 and is represented as a graph G(R, L, ER , EL , EG , Ĝ) consisting of re-
source nodes R, edges ER connecting resource nodes, edges EL connecting re-
source nodes to literal nodes L, and edges EG connecting resource nodes to
Named Graphs Ĝ, which are graphs Ĝ(R , L , ER
, EL ) consisting of subsets of
the elements of G, e.g. R ⊂ R. Textual data is integrated following the same
modeling paradigm and using the already mentioned modeling constructs. Each
text entity is represented by a resource of the type textual document. This re-
source has one edge, labelled content, pointing to a literal node holding the
textual information. In a later stage, there can be more edges providing more
fine grated distinctions, such as headline, paragraph, etc. All triples comprised
by the textual information of one textual entity form a Named Graph ĝ ∈ Ĝ, as
illustrated in Figure 1 by the dashed circle. The data model is a simplified RDF
model with Named Graphs and allows to use RDF data easily.
# ! "
!
$
# "
Fig. 1. Illustration of the data model. A textual document on the left side and struc-
tured data on the right side.
Queries to this data model should have a seamless flexibility ranging from
purely textual keyword queries, over hybrid queries, to crisp structured queries.
A hybrid query q can consist of a structured part qs and a textual part qt , i.e.
q = qs ∧ qt . If one part is empty, the query is either purely textual or purely
1
Named Graphs: http://www.w3.org/2004/03/trix/
520 D.M. Herzig
structured. The structured part qs follows the SPARQL query language and is a
set of graph patterns, qs = {qs1 , qs2 , ...}. The textual part qt allows to associate
a keyword query kw to each variable, i.e. qt = {qti |qti = (xi , kw), xi ∈ V ar(q)}.
For example, assume the information need: “Formula One drivers who moved to
Switzerland” 2 , which is illustrated in Figure 2. The result to such a query are
bindings to the distinguished variables. This model allows to represent purely
structured, hybrid, and purely textual queries. A purely textual query, i.e. simple
keyword query, would be the query in Fig. 2 without line (1). This query model
is a close adaptation of the model by [3].
Select ?x where {
?x rdf:type ns:FormulaOneRacer # (1)
?x {moved to Switzerland } } # (2)
Fig. 2. Illustration of the query model, consisting of the a structured part, i.e. triple
patterns (1) and unstructured part, i.e. keyword patterns (2).
4 Proposed Approach
Starting point are retrieval methods similar to [3,5] applied to a hybrid scenario,
because they have proven to be applicable for similar settings and are the state of
the art in IR. Following the idea of language models, we rank result graphs accord-
ing to the probability of being an result graph g to the given query q, i.e. P (g|q). The
structured part of the query is regarded as a constraint for the beginning and can
be relaxed later. It fulfills the purpose of selecting candidate results. Since qs deter-
mines the shape of the result graphs, all possible graphs share the same structure.
Therefore, the rank of a result depends only on the aspects, which differentiate the
n and their relations to qt . Therefore, we can
results, i.e. the bindings to the variables
reduce the ranking to P (g|q) ∝ i=1 P (qi |xi ), with qi = qtj ∧ qsk , xi ∈ qtj , gsk ,
the keyword and triple patterns associated to variables xi .
n
P (g|q) ∝ P (g) · P (q|g) ∝ P (g) · P (qi |xi ) (1)
i=1
5 Evaluation Methodology
The widest acceptance in IR for evaluating ranking approaches has the so-called
Cranfield methodology[14]. It provides well studied grounds and will be the basis
of the evaluation in line with [15]. However, the setting needs to be adapted to the
hybrid scenario. This can be done by adding structured elements to the keyword
522 D.M. Herzig
queries of [15] and by using datasets, which are a combination of structured and
unstructured data, e.g. the combination of Wikipedia and dbpedia.
6 Conclusion
The goal of this thesis is to investigate a unified ranking methodology for search
on hybrid data using adaptive queries. The proposed approach builds on a graph
based data model, which is compatible to RDF and incorporates textual docu-
ments. The query model allows seamless querying ranging from purely textual
queries, to hybrid queries, and to purely structured queries. The ranking ap-
proach builds methodologically on language models. The evaluation methodol-
ogy uses existing standards from the IR community, if applicable, but needs to
be adapted to the hybrid context. The question this thesis addresses is how the
combination of structured and unstructured data can be used to improve search.
References
1. Weikum, G.: DB & IR: both sides now. In: SIGMOD (2007)
2. Santos, D., Cabral, L.M.: GikiCLEF: crosscultural issues in an international set-
ting: asking non-English-centered questions to wikipedia. In: CLEF (2009)
3. Elbassuoni, S., Ramanath, M., Schenkel, R., Sydow, M., Weikum, G.: Language-
model-based ranking for queries on RDF-graphs. In: CIKM 2009 (2009)
4. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and
Beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2010)
5. Zhao, L., Callan, J.: Effective and efficient structured retrieval. In: CIKM (2009)
6. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In: ACM-
SIAM, San Francisco, California, United States, pp. 668–677 (1998)
7. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.
In: WWW, Brisbane, Australia, pp. 107–117 (1998)
8. Harth, A., Kinsella, S., Decker, S.: Using naming authority to rank data and on-
tologies for web search. In: Bernstein, A., Karger, D.R., Heath, T., Feigenbaum,
L., Maynard, D., Motta, E., Thirunarayan, K. (eds.) ISWC 2009. LNCS, vol. 5823,
pp. 277–292. Springer, Heidelberg (2009)
9. Delbru, R., Toupikov, N., Catasta, M., Tummarello, G., Decker, S.: Hierarchical
link analysis for ranking web data. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten
Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS,
vol. 6089, pp. 225–239. Springer, Heidelberg (2010)
10. Lalmas, M.: XML Retrieval. Synthesis Lectures on Information Concepts, Re-
trieval, and Services. Morgan & Claypool Publishers, San Francisco (2009)
11. Chaudhuri, S., Das, G., Hristidis, V., Weikum, G.: Probabilistic ranking of database
query results. In: VLDB, pp. 888–899 (2004)
12. Rocha, C., Schwabe, D., Aragao, M.P.: A hybrid approach for searching in the
semantic web. In: World Wide Web, WWW 2004, New York, NY, USA (2004)
13. Bhagdev, R., Chapman, S., Ciravegna, F., Lanfranchi, V., Petrelli, D.: Hybrid
search: Effectively combining keywords and semantic searches. In: Bechhofer, S.,
Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021,
pp. 554–568. Springer, Heidelberg (2008)
14. Cleverdon, C.: The CRANFIELD Tests on Index Langauge Devices. Aslib (1967)
15. Halpin, H., Herzig, D.M., Mika, P., Blanco, R., Pound, J., Thompson, H.S., Tran,
D.T.: Evaluating ad-hoc object retrieval. In: IWEST 2010, ISWC (2010)
Author Index