0% found this document useful (0 votes)
61 views

A Distributed Graph Engine For Web Scale RDF Data

Trinity.rdf

Uploaded by

OO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

A Distributed Graph Engine For Web Scale RDF Data

Trinity.rdf

Uploaded by

OO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

A Distributed Graph Engine for Web Scale RDF Data

∗ ∗
Kai Zeng† Jiacheng Yang] Haixun Wang‡ Bin Shao‡ Zhongyuan Wang‡,[
† ] ‡ [
UCLA Columbia University Microsoft Research Asia Renmin University of China
[email protected] [email protected]
{haixunw, binshao, zhy.wang}@microsoft.com

ABSTRACT systems and SPARQL engines [6, 12, 3, 36, 14, 5, 35, 27].
Still, scalability remains the biggest hurdle. Essentially,
Much work has been devoted to supporting RDF data. But
RDF data is highly connected graph data, and SPARQL
state-of-the-art systems and methods still cannot handle
queries are like subgraph matching queries. But most ap-
web scale RDF data effectively. Furthermore, many useful
proaches model RDF data as a set of triples, and use RDBMS
and general purpose graph-based operations (e.g., random
for storing, indexing, and query processing. These approaches
walk, reachability, community discovery) on RDF data are
do not scale as processing a query often involves a large num-
not supported, as most existing systems store and index data
ber of join operations that produce large intermediate re-
in particular ways (e.g., as relational tables or as a bitmap
sults. Furthermore, many systems, including SW-Store [5],
matrix) to maximize one particular operation on RDF data:
Hexastore [35], and RDF-3x [27] are single-machine systems.
SPARQL query processing. In this paper, we introduce Trin-
As the size of RDF data keeps soaring, it is not realis-
ity.RDF, a distributed, memory-based graph engine for web
tic for single-machine approaches to provide good perfor-
scale RDF data. Instead of managing the RDF data in triple
mance. Recently, several distributed RDF systems, such
stores or as bitmap matrices, we store RDF data in its na-
as SHARD [29], YARS2 [17], Virtuoso [15], and [20], have
tive graph form. It achieves much better (sometimes orders
been introduced. However, they still model RDF data as a
of magnitude better) performance for SPARQL queries than
set of triples. The cost incurred by excessive join operations
the state-of-the-art approaches. Furthermore, since the data
is further exacerbated by network communication overhead.
is stored in its native graph form, the system can support
Some distributed solutions try to overcome this limitation
other operations (e.g., random walks, reachability) on RDF
by brute-force replication of data [20]. However, this ap-
graphs as well. We conduct comprehensive experimental
proach simply fails in the face of complex SPARQL queries
studies on real life, web scale RDF data to demonstrate the
(e.g., queries with a multi-hop chain), and has a considerable
effectiveness of our approach.
space overhead (usually exponential).
1 Introduction The second challenge lies in the generality of RDF sys-
tems. State-of-the-art systems are not able to support gen-
RDF data is becoming increasingly more available: The se- eral purpose queries on RDF data. In fact, most of them are
mantic web movement towards a web 3.0 world is prolif- optimized for SPARQL only, but a wide range of meaningful
erating a huge amount of RDF data. Commercial search queries and operations on RDF data cannot be expressed in
engines including Google and Bing are pushing web sites SPARQL. Consider an RDF dataset that represents an en-
to use RDFa to explicitly express the semantics of their tity/relationship graph. One basic query on such a graph is
web contents. Large public knowledge bases, such as DB- reachability, that is, checking whether a path exists between
pedia [9] and Probase [37] contain billions of facts in RDF two given entities in the RDF data. Many other queries
format. Web content management systems, which model (e.g., community detection) on entity/relationship data rely
data in RDF, mushroom in various communities all around on graph operations. For example, random walks on the
the world. graph can be used to calculate the similarity between two
Challenges RDF data management systems are facing two entities. All of the above queries and operations require
challenges: namely, systems’ scalability and generality. The some form of graph-based analytics [34, 28, 22, 33]. Unfor-
challenge of scalability is particularly urgent. Tremendous tunately, none of these can be supported in current RDF
efforts have been devoted to building high performance RDF systems, and one of the reasons is that they manage RDF
data in some foreign forms (e.g., relational tables or bitmap
∗This work is done at Microsoft Research Asia.
matrices) instead of its native graph form.

Permission to make digital or hard copies of all or part of this work for Overview of Our Approach We introduce Trinity.RDF,
personal or classroom use is granted without fee provided that copies are a distributed in-memory RDF system that is capable of han-
not made or distributed for profit or commercial advantage and that copies dling web scale RDF data (billion or even trillion triples).
bear this notice and the full citation on the first page. To copy otherwise, to Unlike existing systems that use relational tables (triple
republish, to post on servers or to redistribute to lists, requires prior specific stores) or bitmap matrices to manage RDF, Trinity.RDF
permission and/or a fee. Articles from this volume were invited to present builds on top of a memory cloud, and models RDF data
their results at The 39th International Conference on Very Large Data Bases,
in its native graph form (i.e., representing entities as graph
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 4 nodes, and relationships as graph edges). We argue that
Copyright 2013 VLDB Endowment 2150-8097/13/02... $ 10.00. such a memory-based architecture that logically and physi-
cally models RDF in native graphs opens up a new paradigm Paper Layout The rest of the paper is organized as fol-
for RDF management. It not only leads to new optimiza- lows. Section 2 describes the difference between join oper-
tion opportunities for SPARQL query processing, but also ations and graph exploration. Section 3 presents the archi-
supports more advanced graph analytics on RDF data. tecture of the Trinity.RDF system. Section 4 describes how
To see this, we must first understand that most graph op- we model RDF data as native graphs. Section 5 describes
erations do not have locality [23, 31], and rely exclusively SPARQL query processing techniques. Section 6 shows ex-
on random accesses. As a result, storing RDF graphs in perimental results. We conclude in Section 8.
disk-based triple stores is not a feasible solution since ran-
dom accesses on hard disks are notoriously slow. Although 2 Join vs. Graph Exploration
sophisticated indices can be created to speed up query pro- Joins are the major operator in SPARQL query process-
cessing, they introduce excessive join operations, which be- ing. Trinity.RDF outperforms existing systems by orders of
come a major cost for SPARQL query processing. magnitude because it replaces expensive join operations by
Trinity.RDF models RDF data as an in-memory graph. efficient graph exploration. In this section, we discuss the
Naturally, it supports fast random accesses on the RDF performance implications of the two different approaches.
graph. But in order to process SPARQL queries efficiently, 2.1 RDF and SPARQL
we still need to address the issues of how to reduce the num-
ber of join operations, and how to reduce the size of inter- Before we discuss join operations vs. graph exploration, we
mediary results. In this paper, we develop novel techniques first introduce RDF and SPARQL query processing on RDF
that use efficient in-memory graph exploration instead of data. An RDF data set consists of statements in the form
join operations for SPARQL processing. Specifically, we de- of (subject, predicate, object). Each statement, also known
compose a SPARQL query into a set of triple patterns, and as as a triple, is about a fact, which can be interpreted as
conduct a sequence of graph explorations to generate bind- subject has a predicate property whose value is object. For
ings for each of the triple pattern. The exploration-based example, a movie knowledge base may contain the following
approach uses the binding information of the explored sub- triples about the movie “Titanic”:
graphs to prune candidate matches in a greedy manner. In ( T i t a n i c , has award , B e s t P i c t u r e )
contrast, previous approaches isolate individual triple pat- ( Titanic , casts , L DiCaprio ) ,
terns, that is, they generate bindings for them separately, ( J Cameron , d i r e c t s , T i t a n i c )
and make excessive use of costly join operations to com- ( J Cameron , w i n s , O s c a r A w a r d )
...
bine those bindings, which inevitably results in large inter-
mediate results. Our new query paradigm greatly reduces An RDF dataset can be considered as representing a directed
the amount of intermediate results, boosts the query perfor- graph, with entities (i.e. subjects and objects) as nodes, and
mance in a distributed environment, and makes the system relationships (i.e. predicates) as directed edges. SPARQL
scale. We show in experiments that even without a smart is the standard query language for retrieving data stored in
graph partitioning scheme, Trinity.RDF achieves several or- RDF format. The core syntax of SPARQL is a conjunctive
ders of magnitude speed-up on web scale RDF data over set of triple patterns called a basic graph pattern. A triple
state-of-the-art RDF systems. pattern is similar to an RDF triple except that any compo-
We also note that since Trinity.RDF models data as a nent in the triple pattern can be a variable. A basic graph
native graph, we enable a large range of advanced graph pattern describes a subgraph which a user wants to match
analytics on RDF data. For example, random walks, regu- against the RDF data. Thus, SPARQL query processing is
lar expression queries, reachability queries, distance oracles, essentially a subgraph matching problem. For example, we
community searches can be performed on web scale RDF can retrieve the cast of an award-winning movie directed by
data directly. Even large scale vertex-based analytical tasks a award-winning director using the following query:
on graph platforms such as Pregel [24] can be easily sup-
ported in our system. However, these topics are out of the Example 1.
scope of this paper, and we refer interested readers to the SELECT ? movie , ? actor WHERE {
Trinity system [30, 4] for detailed information. ? director wins ? award .
? director directs ? movie .
Contributions We summarize the novelty and advantages ? movie has_award ? movie_award .
of our work as follows. ? movie casts ? actor .}

1. We introduce a novel graph-based scheme for manag- SPARQL also contains other language constructs that sup-
ing RDF data. Trinity.RDF has the potential to sup- port disjunctive queries and filtering.
port efficient graph-based queries, as well as advanced 2.2 Using Join Operations
graph analytics, on RDF.
Many state-of-the-art RDF systems store RDF data as a set
of triples in relational tables, and therefore, they rely exces-
2. We leverage graph exploration for SPARQL process- sively on join operations for processing SPARQL queries. In
ing. The new query paradigm greatly reduces the general, query processing consists of two phases [25]: The
volume of intermediate results, which in turn boosts first phase is known as the scan phase. It decomposes a
query performance and system scalability. SPARQL query into a set of triple patterns. For the query
in Example 1, the triple patterns are ?director wins ?award,
3. We introduce a new cost model, novel cardinality es- ?director directs ?movie, ?movie has award ?movie award,
timation techniques, and optimization algorithms for and ?movie casts ?actor. For each triple pattern, we scan
distributed query plan generation. These approaches tables or indices to generate bindings. Assuming we are pro-
ensure excellent performance on web scale RDF data. cessing the query against the RDF graph in Figure 1. The
to a subject/object 1 . In our work, we use native graphs to
model RDF data, which enables us to perform the same
operation in O(1) time. With the support of the underly-
ing architecture, we make graph exploration extremely effi-
cient. In fact, Trinity.RDF can explore as many as 2.3 mil-
lion nodes on a graph distributed over an 8-server cluster
within one tenth of a second [30]. This lays the foundation
Figure 1: An example RDF graph for exploration-based SPARQL query processing.
We need to point out that the order of exploration is im-
base tables that contain the bindings are shown in Table 1. portant. Starting with the highly selective pattern ?movie
The second phase is the join phase. The base tables are has award ?movie award, we can prune a lot of candidate
joined to produce the final answer to the query. bindings of other patterns. If we explore the graph in a dif-
ferent order, i.e. exploring ?movie cast ?actor followed by
?movie ?movie award
?director ?award
Titanic Best Picture ?director directs ?movie, then we will still generate useless
J Cameron Oscar Award
G Lucas Saturn Award
Crash Best Picture intermediate results. Thus, query plans need to be carefully
?movie ?actor
?director ?movie
Crash D Cheadle
optimized to pick the optimal exploration order, which is
P Haggis Crash
J Cameron Titanic
Titanic L Dicaprio not trivial. We will discuss our algorithm for optimal graph
Avatar S Worthington
J Cameron Avatar
Star War VI M Hamill exploration plan generation in Section 5.
Note that graph exploration (following the links) is to
Table 1: Base tables and bound variables.
certain extent similar to index-nested-loops join. However,
Sophisticated techniques have been used to optimize the index-nested-loops join is costly for RDBMS or disk-based
order of joins to improve query performance. Still, the ap- data, because it needs a random access for each index lookup.
proach has inherent limitations: (1) It uses many costly join Hence, in previous approaches, scan-joins, which perform
operations. (2) The scan-join process produces large redun- sequential reads on sorted data, are preferred. Our ap-
dant intermediary results. From Table 1, we can see that proach further extends the random access approach in a dis-
most intermediary join results will be produced in vain. Af- tributed environment and minimizes the size of intermediate
ter all, only Titanic directed by J Cameron matches the join results.
query. Moreover, useless intermediary results may only be
detected in later stages of the join process. For example, if 3 System Architecture
we choose to join ?director directs ?movie and ?movie casts
In this section, we give an overall description of the data
?actor first, we will not know that the resulting rows related
model and the architecture of Trinity.RDF. We model and
to Avatar and Crash are useless until joining with ?director
store RDF data as a directed graph. Each node in the graph
wins ?award and ?movie has award ?movie award. Side-
represents a unique entity, which may appear as a subject
ways Information Passing (SIP) [26] was proposed to allevi-
and/or an object in an RDF statement. Each RDF state-
ate this problem. SIP is a dynamic optimization technique
ment corresponds to an edge in the graph. Edges are di-
for pipelined execution plans. It introduces filters on sub-
rected, pointing from subjects to objects. Furthermore, edges
ject, predicate, or object identifiers, and passes these filters
are labeled with the predicates. We will present the data
to joins and scans in other parts of the query that need to
structure for nodes and edges in more detail in Section 4.
process similar identifiers.
To ensure fast random data access in graph exploration,
2.3 Using Graph Explorations we store RDF graphs in memory. A web scale RDF graph
may contain billions of entities (nodes) and trillions of triples.
In this paper, we adopt a new approach that greatly im-
It is unlikely that a web scale RDF graph can fit in the
proves the performance of SPARQL query processing. The
RAM of a single machine. Trinity.RDF is based on Trin-
idea is to use graph exploration instead of joins.
ity [30], which is a distributed in-memory key-value store.
The intuition can be illustrated by an example. Assume Trinity.RDF builds a graph interface on top of the key-value
we perform the query in Example 1 over the RDF graph in store. It randomly partitions an RDF graph across a cluster
Figure 1 starting with the pattern: ?director wins ?award. of commodity machines by hashing on the nodes. Thus, each
After exploring the neighbors of ?award connected via the machine holds a disjoint part of the graph. Given a SPARQL
wins edge, we find that the possible binding for ?director query, we perform search in parallel on each machine. Dur-
is J Cameron and G Lucas. Then, we explore the graph ing query processing, machines may need to exchange data
further from node J Cameron and G Lucas via edge directs, as a query pattern may span multiple partitions.
and we generate bindings for ?director directs ?movie. In
Figure 2 shows the high level architecture of Trinity.RDF.
the above exploration, we prune G Lucas because it does
A user submits his query to a proxy. The proxy generates a
not have a directs edge. Also, we do not produce useless
query plan and delivers the plan to all the Trinity machines,
bindings as those shown in Table 1, such as the binding
which hold the RDF data. Then, each machine executes the
(P Haggis, Crash). Thus, we are able to prune unnecessary
query plan under the coordination of the proxy. When the
intermediate results efficiently.
bindings for all the variables are resolved, all Trinity ma-
The above intuition is only valid if graph exploration can chines send back the bindings (answers) to the proxy where
be implemented more efficiently than join. This is not true the final result is assembled and sent back to the user. As
for existing RDF systems. If the RDF graph is managed we can see, the proxy plays an important role in the archi-
by relational tables, triple stores, or disk-based key-value tecture. Specifically, it performs the following tasks. First,
stores, then we need to use join operations to implement it generates a query plan based on available statistics and
graph exploration, which means graph exploration cannot indices. Second, it keeps track of the status of each Trinity
be more efficient than join: With an index, it usually re-
1
quires an O(log N ) operation to access the triples relating N is the total number of RDF triples
Figure 3: An example of model (1)
characteristics of graphs, we need to look into how the graph
is partitioned in order to ensure best performance.
Figure 2: Distributed query processing framework Two factors may have impact on network overhead when
machine in query processing by, for example, synchronizing we explore a graph. The first factor is how the graph is
the execution of each query step. However, each Trinity partitioned. In our system, sharding is supported by the
machine not only communicates with the proxy. They also underlying key-value store, and the default sharding mech-
communicate among themselves during query execution to anism is hashing on node-id. In other words, the graph is
exchange intermediary results. All communications are han- randomly partitioned. Certainly, sophisticated graph par-
dled by a message passing mechanism built in Trinity. titioning methods can be adopted for sharding. However,
Besides the proxy and the Trinity machines, we also em- graph partitioning is beyond the scope of this paper.
ploy a string indexing server. We replace all literals in RDF The second factor is how we model graphs on top of the
triples by their ids. The string indexing server implements key-value store. The model given by (1) may have poten-
a literal-to-id mapping that translates literals in a SPARQL tial problems for real-life large graphs. Many real-life RDF
query into ids, and an id-to-literal mapping that maps ids graphs are scale-free graphs whose node degrees follow the
in the output back to literals for the user. The mapping can power law distribution. In DBpedia [9], for example, over
be either implemented by a separate Trinity in-memory key- 90% nodes have less than 5 neighbors, while some top nodes
value store for efficiency, or by a persistent key-value store have more than 100,000 neighbors. The model may incur a
if memory space is a concern. Usually the cost of mapping large amount of network traffic when we explore the graph
is negligible compared to that of query processing. from a top node x. For simplicity, let us assume none of x’s
neighbors resides on the same machine as x does. To visit
4 Data Modeling x’s neighbors, we need to send the node-ids of its neighbors
to other machines. The total amount of information we need
To support graph-based operations including SPARQL queries to send across the network is exactly the entire set of node-
on RDF data more effectively, we store RDF data in its na- ids in x’s adjacency list. For the DBpedia data, in the worst
tive graph form. In this section, we describe how we model case, whenever we encounter a top node in graph explo-
and manipulate RDF data as distributed graphs. ration, we need to send 800K data (each node-id is 64 bit)
4.1 Modeling Graphs across the network. This is a huge cost in graph exploration.
Trinity.RDF is based on Trinity, which is a key-value store We take the power law distribution into consideration in
in a memory cloud. We then create a graph model on top modeling RDF data. Specifically, we model a node x by the
of the key-value store. Specifically, we represent each RDF following key-value pair:
entity as a graph node with a unique id, and store it as a (node-id, hin1 , · · · , ink , out1 , · · · , outk i) (2)
key-value pair in the Trinity memory cloud:
where ini and outi are keys to some other key-value pairs:
(node-id, hin-adjacency-list, out-adjacency-listi) (1)
(ini , in-adjacency-listi ) (outi , out-adjacency-listi ) (3)
The key-value pair consists of the node-id as the key, and
the node’s adjacency list as the value. The adjacency list The essence of this model is the following: The key-value
is divided into two lists, one for neighbors with incoming pair (ini , in-adjacency-listi ) and the nodes in in-adjacency-listi
edges and the other for neighbors with outgoing edges. Each are stored on the same machine i. In other words, we parti-
element in the adjacency lists is a (predicate, node-id ) pair, tion the adjacency lists in model (1) by machine.
which records the id of the neighbor, and the predicate on The benefits of this design is obvious. No matter how
the edge. many neighbors x has, we will send no more than k nids
Thus, we have created a graph model on top of the key- (ini and outi ) over the network since each machine i, upon
value store. Given any node, we can find the node-id of receiving nidi , can retrieve x’s neighbors that reside on ma-
any of its neighbors, and the underlying Trinity memory chine i without incurring any network communication. How-
cloud will retrieve the key-value pair for that node-id. This ever, for nodes with few neighbors, model (2) is more costly
enables us to explore the graph from any given node by than model (1). In our work, we use a threshold t to decide
accessing its adjacency lists. Figure 3 shows an example of which model to use. If a node has more than t neighbors, we
the data structure. use model (2) to map it to the key-value store; otherwise,
we use model (1). Figure 4 gives an example with t = 1.
4.2 Graph Partitioning Furthermore, in our design, all triples are stored decentral-
We distribute an RDF graph across multiple machines, and ized at its subject and object. Thus, update has little cost
this is achieved by the underlying memory cloud, which par- as it only affects a few nodes. However, update is out of the
titions the key-value pairs in a cluster. However, due to the scope of this paper and we omit detailed discussion here.
distributed, and the two functions simply operate on the
local adjacency list.
We now use some examples to illustrate the use of the
above 3 operators on the RDF graph shown in Figure 4.
LoadN odes(l2 , out) finds n2 on machine 1, and n3 on ma-
chine 2. LoadN eighborsOnM achine(n0 , in, 1) returns the
partial adjacency list’s id in1 , and SelectByP redicate(in1 , l2 )
returns n2 .
5 Query Processing
In this section, we present our exploration-based approach
for SPARQL query processing.
Figure 4: An example of model (2)
5.1 Overview
4.3 Indexing Predicates
We represent a SPARQL query Q by a query graph G. Nodes
Graph exploration relies on retrieving nodes connected by in G denote subjects and objects in Q, and directed edges
an edge of a given predicate. We use two additional indices in G denote predicates. Figure 5 shows the query graph
for this purpose. corresponding to the query in Example 1, and lists the 4
Local predicate indexing We create a local predicate in- triple patterns in the query as q1 to q4 .
dex for each node x. We sort all (predicate, node-id ) pairs in
x’s adjacency lists first by predicate then by node-id. This
corresponds to the SPO or OPS index in traditional RDF
approaches. In addition, we also create an aggregate in-
dex to enable us to quickly decide whether a node has a
given predicate and the number of its neighbors connected
by the predicate.
Global predicate indexing The global predicate index
enables us to find all nodes that have incoming or outgoing • q1 : (?director wins ?award)
neighbors labeled by a given predicate. This corresponds to • q2 : (?director directs ?movie)
the PSO or POS index in traditional approaches. Specifi-
cally, for each predicate, machine i stores a key-value pair • q3 : (?movie has award ?movie award)
• q4 : (?movie casts ?actor )
(predicate, hsubject-listi , object-listi i)
Figure 5: The query graph of Example 1
where subject-listi (object-listi ) consists of all unique sub-
jects (objects) with that predicate on machine i. With G defined, the problem of SPARQL query process-
ing can be transformed to the problem of subgraph match-
4.4 Basic Graph Operators ing. However, as we pointed out in Section 2, existing RDF
We provide the following three graph operators with which query processing and subgraph matching algorithms rely ex-
we implement graph exploration: cessively on costly joins, which cannot scale to RDF data of
billion or even trillion triples. Instead, we use efficient graph
1. LoadN odes(predicate, dir): Return nodes that have exploration in an in-memory key-value store to support fast
an incoming or outgoing edge labeled as predicate. query processing. The exploration is conducted as follows:
We first decompose Q into an ordered sequence of triple
2. LoadN eighborsOnM achine(node, dir, i): For a given patterns: q1 , · · · , qn . Then, we find matches for each qi ,
node, return its incoming or outgoing neighbors that and from each match, we explore the graph to find matches
reside on machine i. for qi+1 . Thus, to a large extent, graph exploration acts as
joins. Furthermore, the exploration is carried out on all dis-
3. SelectByP redicate(nid, predicate): From a given par- tributed machines in parallel. In the final step, we gather
tial adjacency list specified by nid, return nodes that the matchings for all individual triple patterns to the cen-
are labeled with the given predicate. tralized query proxy, and combine them together to produce
the final results.
Here, dir is a parameter that specifies whether the pred-
icate is an incoming or an outgoing edge. LoadN odes() 5.2 Single Triple Pattern Matching
is straightforward to understand. When it is called, it uses We start with matching a single triple pattern. For a triple
the global predicate index on each machine to find nodes that pattern q, our goal is to find all its matches R(q). Let P
have at least one incoming or outgoing edge that is labeled denote the predicate in q, V denote the variables in q, and
as predicate. B(V ) denote the binding of V . If V is a free variable (not
The next two operators together find specific neighbors for bound), we also use B(V ) to denote all possible values V
a given node. LoadN eighborsOnM achine() finds a node’s can take. We regard a constant as a special variable with
incoming or outgoing neighbors on a given machine. But, only one binding.
instead of returning all the neighbors, it simply returns the We use graph exploration to find matches for q. There are
ini or outi id as given in (2). Then, given the ini or outi two ways of exploration: from subject to object (We first
id, SelectByP redicate() finds nodes in the adjacency list try to find matches for the subject in q, and then for each
that is associated with the given predicate. Certainly, if the match, we find matches for the object in q. We denote this
node has less than t neighbors, then its adjacency list is not exploration as −

q ) and from object to subject (We denote
this exploration as ←
−q ). We use src and tgt to refer to the Saturn Award on machine 2. Next, on the target ?director
source and target of an exploration (i.e., in −

q the src is the side, machine 1 gets the key of the adjacency list sent by


subject, while in q the src is the object). Saturn Award, and after calling SelectByP redicate(), it
gets G Lucas. Since the target ?director is a free variable,
Algorithm 1 MatchPattern(e) any edge labeled with win will be matched. We add match-
ing edge (G Lucas, wins, Saturn Award) to R. Similarly
obtain src, tgt, and predicate p from e (e = −

q or e = ←

q) on machine 2, we get (J Cameron, wins, Oscar Award).
As Algorithm 1 shows, given a triple q, each machine
// On the src side: performs M atchP attern() independently, and obtains and
if src is a free
S variable then stores the results on the target side, that is, on machines
B(src) = ∀p∈B(P ) LoadN odes(p, dir) where the target is matched. For the example in Figure 6,
set Mi = ∅ for all i // initialize messages to machine i matches for ← q−
1 where q1 is “?director wins ?award ” are
for each s in B(src) do stored on machine 1, where the target G Lucas is located.
for each machine i do Table 2 shows the results on both machines for q1 . We use
nidi = LoadN eighborsOnM achine(s, dir, i) Ri (q) to denote matches for of q on machine i. Note that
Mi = Mi ∪ (s, nidi ) the constant column wins is not stored.
batch send messages M to all machines (a) R1 (q1 ) (b) R2 (q1 )
?director ?award ?director ?award
// On the tgt side: G Lucas Saturn Award J Cameron Oscar Award

for each (s, nid) in M do Table 2: Individual matching result of q1


for each p in B(P ) do
N = SelectByP redicate(nid, p) 5.3 Multiple Pattern Matching by Exploration
for each n in N ∩ B(tgt) do A query consists of multiple triple patterns. Traditional
R = R ∪ (s, p, n) approaches match each pattern individually and join them
return R afterwards. A single pattern may generate a large number
of results, and this leads to large intermediary join results
Algorithm 1 outlines the matching procedure using the and costly joins. For the example of Figure 6, suppose we
basic operators introduced in Section 4.4. If src is a con- generate the matchings for pattern q1 , q2 separately. The
stant, we only need to explore from one node. If src is results are Table 2 for q1 and Table 3 for q2 . We can see
a variable, we initialize its bindings by calling LoadN odes, although P Haggis has not won an award, we still generate
which searches the global predicate index to find the matches (Crash, P Haggis) in R(q2 ).
for src. Note that if the predicate itself is a free variable,
then we have to load nodes for every predicate. After src (a) R1 (q2 ) (b) R2 (q2 )
?movie ?director ?movie ?director
is bound, for each node that matches src and for each ma- Titanic J Cameron Avatar J Cameron
chine i, we call LoadN eighborsOnM achine() to find the key Crash P Haggis

nidi . The node’s neighbors on machine i are stored in the Table 3: Individual matching result of q2
key-value pair with nidi as the key. We then send nidi to
machine i. Instead of matching single patterns independently, we treat
Each machine, on receiving the message, starts the match- the query as a sequence of patterns. The matching of the
ing on the tgt side. For each eligible predicate p in B(P ), current pattern is based on the matches of previous patterns,
we filter neighbors in the adjacency list by p by calling i.e., we “explore” the RDF graph from the matches of pre-
SelectByP redicate(). If tgt is a free variable, any neigh- vious patterns to find matches for the current pattern. In
bor is eligible as a binding, so we add (s, p, n) as a match other words, we eagerly prune invalid matchings by explo-
for every neighbor n. If tgt is a constant, however, only the ration to avoid the cost of joining large sets of results later.
constant node is eligible. As we treat a constant as a spe- (a) R1 (q2 ) (b) R2 (q2 )
cial variable with only one binding, we can uniformly handle ?movie ?director ?movie ?director
these two cases: we match a new edge only if its target is Titanic J Cameron Avatar J Cameron

in B(tgt). Table 4: Matching result of q2 after matching q1


We now use an example to illustrate the exploration and
pruning process. Assume we explore the graph in Figure
1 in the order of − →q1 , −

q2 , ←
q− −

3 , q4 . Clearly, how the triple
patterns are ordered may have a big impact on the inter-
mediate results size. We discuss query plan optimization in
Section 5.5.
There are two different cases in exploration and pruning,
and they are examplified by matching − →q2 after −

q1 , and by
Figure 6: Distribution of the RDF graph in Figure 1 ←− −

matching q3 after q2 , repsectively. We describe them sepa-
We use an example to demonstrate how M atchP attern rately. In the first case, the source of exploration is bound.
works. Assume the RDF graph is distributed on two ma- Exploring q2 after q1 belongs to this case, as the source
chines as shown in Figure 6. Suppose we want to find ?director is bound by q1 . So, instead of using LoadN odes()
matches for ← q−1 where q1 is “?director wins ?award ”. In to find all possible directors, we start the exploration from
this case, src is ?award. We first call LoadN odes(wins, in) existing bindings (J Cameron and G Lucas), so we won’t
to find B(?award), which are nodes having an incoming generate movies not directed by award-winning directors.
wins edge. This results in Oscar Award on machine 1, and Moreover, note that in Figure 1, G Lucas does not have
a directs edge, so exploring from G Lucas will not produce our scenario and theirs. In the relational optimizer, later
any matching triple. It means we can prune G Lucas safely: joins depend on previous intermediary join results, while
There is no need to send the key to its adjacency-list across for us, later explorations depend on previous intermediary
the network. The results are in Table 4, which contains bindings. The intermediary join results do not depend on
fewer tuples than Table 3. the order of join, while the intermediary bindings do depend
In the second case, the target of exploration is bound. Ex- on the order of exploration. For example, consider two plans
ploring q3 after q2 belongs to this case, as ?movie is bound (1) {−→
q1 , −

q2 , ←
q− −
→ −
→ − → ← − − →
3 , q4 } and (2) { q2 , q3 , q1 , q4 }, where the first
to {T itanic, Avatar} by − →
q2 . We only add results in this 3 elements are q1 , q2 , and q3 , but in different order. For the
binding set to the matching results, namely (Best P icture, relational optimizer (ignore the direction of each qi ), the join
T itanic). Independently, (Best P icture, Crash) also saties- results q1 , q2 , and q3 are the same no matter how they are
fies the pattern, but Crash is not in the binding set, so ordered. But in our case, plan (1) produces {Titanic} and
it is pruned. Furthermore, since the previous binding of plan (2) produces {Titanic, Crash} for B(?movie), as shown
Avatar does not match any triple in this round, it is also in Table 5. The redundant Crash will makes − →
q4 in plan (2)
safely pruned from ?movie’s binding. Finally, we incorpo- more costly than plan (1).
rate the matches of q3 into the result. As shown in Table 5, We now introduce our approach for exploration order op-
it now has three bound variables ?movie, ?director, and timization. For a query graph, we find exploration plans for
?movie award, and contains one row (T itanic, J Cameron, its subgraphs (starting with single nodes), and expand/com-
Best P icture) on machine 1 where T itanic is located. bine the plans until we derive the plan for the entire query
?movie ?director ?movie award
graph. There are two ways to grow a subgraph: expansion
Titanic J Cameron Best Picture and combination. Figure 7(a) depicts an example of expan-
Table 5: Results after incorporating q2 and q3 sion: we explore to a free variable or a constant and add
an edge to the subgraph. The subgraph {q1 } is expanded
5.4 Final Join after Exploration to a larger graph {q1 , q2 }. Another way to grow a subgraph
We used two mechanisms to prune intermediate results: a is that we combine two disjoint subgraphs by exploring an
binding is pruned if it cannot reach any bound target, or edge starting from one subgraph to the other. Figure 7(b)
it cannot be reached from any bound source. Furthermore, shows such an example: we combine the subgraph with one
once we prune the target (source), we also prune correspond- edge q1 with the subplan of q3 by exploring ← q−2 . This way,
ing values from the source (target). This greatly reduces the we construct a larger subgraph from two smaller subgraphs.
size of the intermediary results, and does not incur much ad-
ditional communication, as shown in the previous example.
However, invalid intermediary results may still remain af-
ter the pruning. This is because the pruning of q’s interme-
diary results only affects the bindings of q and the immediate
neighbors of q. Bindings of other patterns are not considered (a) Expansion
because otherwise we need to carry all historical bindings in
exploration, which incurs big communication cost.
After the exploration finishes, we obtain all the matches
in R. Since R is distributed and may contain invalid results,
we gather these results to a centralized proxy and perform
a final join to assemble the final answer. As we have ea-
gerly pruned most of the invalid results in the exploration
phase, our join phase is light-weight compared with tradi-
tional RDF systems that intensively rely on joins, and we (b) Combination
simply adopt the left-deep join for this purpose. Figure 7: Expansion and combination examples
5.5 Exploration Plan Optimization
Now, we introduce heuristics for exploration optimization.
Section 5.3 described the query processing for an ordered Let E denote a subgraph, R(E) denote its intermediary join
sequence of triple patterns. The order has significant impact results, and B(E) denote the bindings of variables in E. Note
on the query performance. We now describe a cost-based that in our exploration, we compute B(E) only, but not
approach for finding the optimal exploration plan. R(E). Furthermore, bindings for some variables in E may
We define an exploration plan as a graph traversal plan, contain redundant values. We define a variable ?c as an
and we denote it as a sequence he1 , · · · , en i, where each exploration point if it satisfies B(c) = Πc R(E). Intuitively,
ei denotes a directed exploration of a predicate qi in the node ?c is an exploration point if it does not contain any
query graph, that is, ei = − →
qi or ei = ← q−i . The cost of redundant value, in other words, each of its values must
the plan is Σi cost(ei ), where cost(ei ), the cost of matching appear in the intermediary join results R(E). We then adopt


qi or ← q−i , is roughly proportional to the size of qi ’s results the following heuristics in subgraph expansion/combination.
(Section 5.6 will describe cost estimation in more depth).
Clearly, the size of qi ’s results depends on the matching of Heuristic 1. We expand a subgraph from its exploration
some qj , j < i. Thus, the total cost depends on the order of point. We combine two subgraphs by connecting their explo-
ei ’s in the sequence. ration points.
Naive query plan optimization is costly. There are n! dif-
ferent orders for a query graph with n edges, and for each qi , The reason we want to expand/combine at the exploration
there are two directions of exploration. It is also tempting point is because the exploration points do not contain redun-
to adopt the join ordering method in a relational query op- dant values. Hence, they introduce fewer redundant values
timizer. However, there is a fundamental difference between for other variables in the exploration.
After the expansion/combination, we need to determine We give a brief sketch-proof. Our optimizer resembles the
the exploration points of the resulting graph. Heuristic 1 idea of semi-joins although we do not perform join. Bern-
leads to the following property: stein et al. proved [10] that for any relation in an acyclic
query, there exists a semi-join program that can fully reduce
Property 1. We expand a subgraph or combine two sub- the relation by evaluating each join condition only once. By
graphs through an edge. The two nodes on both ends of the mapping each node in G to a relation, and an edge in G to
edge are valid exploration points in the new graph. a join condition, we can see that our algorithm can find an
Proof. For expansion from subgraph E, we start from an exploration plan that evaluates each pattern exactly once.
exploration point c that satisfies B(c) = Πc R(E) and explore Discussion. There are two cases we have not considered
a new predicate q = c ; c0 . Based on our algorithm, we formally: i) G is cyclic, and ii) G contains a join on predi-
have Πc R(e) ⊆ Πc R(E). Since q ∈ / E and c0 ∈/ E, we get cates. For the first case, our algorithm may not be able to
R(E ∪ q) = R(E) ./c R(q). Thus: find an exploration plan. However, we can break a cycle in
Πc0 R(E ∪ q) = Πc0 (R(E) ./c R(q)) G by duplicating some variable in the cycle. For example,
one heuristic to pick the break point is that we break a cy-
= Πc0 R(q) = B(c0 )
cle at node u if it has the smallest cost when we explore
which means c0 is an exploration point of E ∪ q. After B(c0 ) u’s adjacent edges uv1 and uv2 from u; and in the case of
is obtained, the algorithm uses it to prune B(c) so that c’s many cycles, we repeatedly apply this process. The result-
new binding satisfies B(c) = Πc R(q). Thus: ing query graph G 0 is acyclic. We can apply our algorithm
to search for an approximate plan. For the second case, con-
Πc R(E ∪ q) = Πc (R(E) ./c R(q)) sider a join on predicate (?s ?p ?u), (?x ?p ?y). Here, we
= Πc R(E) ./c Πc R(q) = Πc R(q) = B(c) cannot explore from the first pattern from bound variables
?s or ?u because they are not connected with the second
which means c is a valid exploration point of E ∪q. Similarly,
pattern. To handle this case, after we explore an edge with
we can show Property 1 holds in subgraph combination.
a variable predicate, we iterate through all unvisited pat-
We use dynamic programming (DP) for exploration opti- terns sharing the same predicate variable ?p, i.e. (?x ?p ?y),
mization. We use (E, c) to denote a state in DP. We start and use LoadN odes to create an initial binding for ?x and
with subgraphs of size 1, that is, subgraphs of a single edge ?y. This enable us to contine the exploration.
q = u ; v. The states are ({q}, u) and ({q}, v). For their 5.6 Cost Estimation
cost, we consider both explorations ← −
q and −
→q to obtain the
minimal cost of reaching the state. SPARQL selectivity estimation is a challenging task. Stocker
After computing cost for subgraphs of size k, we perform et al [32] assumes subject, predicate and object are indepen-
expansion and combination to derive subgraphs of size ≥ dent and the selectivity of each triple is the product of the
k+1. Specifically, assuming we are expanding (E, c) through three. The result is far from optimal. RDF-3X [25] uses
edge q = c ; v, we reach two states: two approaches: One assumes independence between triples
and relies on traditional join estimation techniques. The
(E ∪ {q}, v) and (E ∪ {q}, c) (4) other mines frequent join pathes for large joins and main-
Let C denote the cost of the state before expansion, and C 0 tains statistics for these pathes, which is very costly and
the cost of the state after expansion. We have: unfeasible for web-scale RDF data.
We propose a novel estimation method that captures the
C 0 = min{C 0 , C + cost(−

q )} (5) correlation between triples but requires little extra statistics
Note that: i) We may reach the expanded state in different and data preprocessing. Specifically, we estimate cost(e)
ways, and we record the minimal cost of reaching the state; where e = − →q or ←

q . In the following, we estimate cost(− →
q)


only, and the estimation of cost( q ) can be obtained in the
ii) C is the cost of state of size ≤ k, which is determined
in previous iterations; iii) If q is in the other direction, i.e., same way. Also, we use src and tgt to denote the source and
q = v ; c, then cost(−→q ) above becomes cost(← −
q ). target nodes in e. The computation cost of matching q is
For combining two states (E1 , c1 ) and (E2 , c2 ) where E1 ∩ estimated as the size of the results, namely |R(q)|. Since we
E2 = ∅ through edge q = c1 ; c2 , we reach two states: operate in a distributed environment, we model communica-
tion cost as well. During exploration, we send bindings and
(E1 ∪ E2 ∪ q, c1 ) and (E1 ∪ E2 ∪ q, c2 ) (6) ids of adjacency lists across network, so we measure com-
munication cost as the binding size of the source node of
Let C1 and C2 denote the cost of the two states before com-
the exploration, i.e. |B(src)|. The final cost(− →
q ) is a linear
bination. We update the cost of the combined state to be:
combination of |R(q)| and |B(src)|.
C 0 = min{C 0 , C + C + cost(−
1 2
→q )} (7) Now, if we know |B(src)|, we can estimate |R(q)| and
|B(tgt)| as
We now show the complexity of the DP:
Cp Cp (tgt)
Theorem 1. For a query graph G(V, E), the DP has time |R(q)| = |B(src)| , |B(tgt)| = |B(src)|
complexity O(n·|V |·|E|) where n is the number of connected Cp (src) Cp (src)
subgraphs in G. where Cq , Cq (src), Cq (tgt) are the number of triples and con-
Here is a brief sketch-proof: There are n · |V | states in the nected subject/object with predicate p, which can be ob-
DP process (each subgraph E can have |E| ≤ |V | nodes), tained from a single global predicate index look-up. If the
and each update can take at most O(|E|) time. predicate of q is unknown, we consider the average case for
all possible predicates. For the case where the source or tar-
Theorem 2. Any acyclic query Q with query graph G is get of q is constant, we use the local predicate index to get
guaranteed to have an exploration plan. a more accurate estimation.
We then derive |B(src)|. For a standalone − → Dataset #Triples #S/O
q , we can BTC-10 3,171,793,030 279,082,615
derive |B(src)| from the global predicate index. When − →
q is DBPSB
LUBM-40
15,373,833
5,309,056
5,514,599
1,309,072
not standalone, the binding size of src is affected by related LUBM-160 21,347,999 5,259,588
LUBM-640 85,420,588 21,037,012
patterns already explored. To capture this correlation, we LUBM-2560 341,888,947 84,202,729
maintain a two-dimensional predicate × predicate matrix2 . LUBM-10240
LUBM-100000
1,367,122,031
9,956,527,583
336,711,191
2,452,700,932
Each cell (i, j) stores four statistics: the number of unique
nodes with predicates pi , pj as its incoming/outgoing edges Table 6: Statistics of datasets used in experiments
(4 combinations). When no confusion shall arise, we simply BTC-10 S1 S2 S3 S4 S5 S6 S7
# of joins 7 5 9 12 6 9 7
use Cpi pj to denote the correlation. DBPSB D1 D2 D3 D4 D5 D6 D7 D8
As shown in Section 5.5, the query optimizer handles two # of joins 1 1 2 3 3 4 4 5
LUBM L1 L2 L3 L4 L5 L6 L7
cases: expansion and combination. In the first case, assume # of joins 6 1 6 4 1 3 6

we expand through a new edge p2 from variable x which is


already connected with p1 . Assume the original binding size Table 7: Statistics of queries used in experiments
of x is Nx . We have the new binding size Nx0 as and Table 7. All of the queries used in our experiments can
Cp1 p2 be found online3 .
Nx0 = Nx (8)
Cp1 Join vs. Exploration We compare graph exploration (Trin-
ity.RDF) with scan-join (RDF-3X and BitMat) on DBPSB
The second case is combining two edges p1 and p2 on x.
and LUBM-160 datasets. The experiment results show that
Assume the original binding sizes of x with predicate p1 and
Trinity.RDF outperforms RDF-3X and BitMat; and more
predicate p2 are Nx,1 and Nx,2 respectively. We have the
importantly, its superiority does not just come from its in-
new binding size Nx0 as
memory architecture, but from the fact that graph explo-
Cp1 p2 ration itself is more efficient than join.
Nx0 = Nx,1 Nx,2 (9)
Cp1 Cp2 For a fair comparison, we set up Trinity.RDF on a single
machine, so we have the same computation infrastructure
For more complex cases in expansion and combination for all three systems. Specifically, to compare the in-memory
during exploration, e.g. expanding a new pattern from a performance, we set up a 20 GB tmpfs (an in-memory file
subgraph, or joining two subgraphs, we simply pick the most system supported by Linux kernel from version 2.4), and
selective pair from all pairs of involved predicates. deploy the database images of RDF-3X and BitMat in the
in-memory file system.
6 Evaluation
The first observation is that managing RDF data in graph
We evaluate Trinity.RDF on both real-life and synthetic form is space-efficient. The database images of LUBM-160
datasets, and compare it against the state-of-the-art central- and DBPSB in Trinity.RDF are of 1.6G and 1.9G respec-
ized and distributed RDF systems. The results show that tively, which are smaller or comparable to RDF-3X (2GB
Trinity.RDF is a highly scalable, highly parallel RDF engine. and 1.4GB respectively), and are much more efficient than
Systems We implement Trinity.RDF in C#, and deploy it BitMat (3.6GB and 19GB respectively even without liter-
on a cluster, wherein each machine has 96 GB DDR3 RAM, al/ID mapping).
two 2.67 GHz Intel Xeon E5650 CPUs, each with 6 cores and The results on LUBM-160 and DBPSB are shown in Ta-
12 threads, and one 40Gb/s InfiniBand Network adaptor. ble 8 and 9. For RDF-3X and BitMat, both in-memory
The OS is 64-bit Windows Server 2008 R2 Enterprise with and on-disk (cold-cache) performances are reported. Trin-
service pack 1. ity.RDF outperforms the on-disk performances of RDF-3X
We compare Trinity.RDF with centralized RDF-3X [27] and BitMat by a large margin for all queries: For most
and BitMat [8], as well as distributed MapReduce-RDF-3X queries, Trinity.RDF has 1 to 2 orders of magnitude per-
(a Hadoop-based RDF-3X solution [20]). We deploy the formance gain; for some queries, it has 3 orders of magni-
three systems on machines running 64 bit Linux 2.6.32 using tude speed-up. The results from the in-memory performance
the same hardware configuration as used by Trinity.RDF. comparison are more interesting. Here, since all systems are
Just like Trinity.RDF, all of the competitor systems map memory-based, the comparison is solely about graph explo-
literals to IDs in query processing. But BitMat relies on ration versus scan-join. We can see that the improvement
manual mapping. For a fair comparison, we measure the is easily 2-5 fold, and for L4, Trinity.RDF has 3 orders of
query execution time by excluding the cost of literal/ID magnitude speed-up. This also shows that, although SIP
mapping. Since all of these three systems are disk-based, and semi-join are proposed to overcome the shortcomings
we report both their warm-cache and cold-cache time. of the scan-join approach, they are not always effective,
as shown by L1, L2, L4, D1, D7, etc. Moreover, we vary
Datasets We use two real-life and one synthetic datasets. the complexity of DBPSB queries from 1 join to 5 joins,
The real-life datasets are the Billion Triple Challenge 2010 where Trinity.RDF achieves very stable performance gain.
dataset (BTC-10) [1] and DBpedia’s SPARQL Benchmark It proves that our query algorithm can effectively find the
(DBPSB) [2]. The synthetic dataset is the Lehigh Univer- optimal exploration order even for complex queries with
sity Benchmark (LUBM) [16]. We generated 6 datasets of many patterns.
different sizes using the LUBM data generator v1.7. We We also show that in-memory RDF-3X or BitMat runs
summarize the statistics of the data and some exemplary slightly better than Trinity.RDF on L2, L3 and D2. This
queries (LUBM queries are also published in [8]) in Table 6 is because L2, D2 have very simple structures and few in-
2
In many RDF datasets, there is a special predicate rdf:type termediate results, and Trinity has the overhead due to its
which characterizes the types of entities. Since the number C# implementation.
of entities associated with a certain type varies greatly, we
3
treat each type as a different predicate. http://research.microsoft.com/trinity/Trinity.RDF.aspx
L1 L2 L3 L4 L5 L6 L7 Geo. mean
Trinity.RDF 281 132 110 5 4 9 630 46
RDF-3X (In Memory) 34179 88 485 7 5 18 1310 143
BitMat (In Memory) 1224 4176 49 6381 6 51 2168 376
RDF-3X (Cold Cache) 35739 653 1196 735 367 340 2089 1271
BitMat (Cold Cache) 1584 4526 286 6924 57 194 2334 866
Table 8: Query run-time in milliseconds on the LUBM-160 dataset (21 million triples)

D1 D2 D3 D4 D5 D6 D7 D8 Geo. mean
Trinity.RDF 7 220 5 7 8 21 13 28 15
RDF-3X (In Memory) 15 79 14 18 22 34 68 35 29
BitMat (In Memory) 335 1375 209 113 431 619 617 593 425
RDF-3X (Cold Cache) 522 493 394 498 366 524 458 658 482
BitMat (Cold Cache) 392 1605 326 279 770 890 813 872 639
Table 9: Query run-time in milliseconds on the DBPSB dataset (15 million triples)

Performance on Large Datasets We experiment on three a query into multiple subqueries and each subquery pro-
datasets, LUBM-10240, LUBM-100000 and BTC-10, to study duces a much larger result set. This result again proves
the performance of Trinity.RDF on billion scale datasets, the performance impact of exploiting the correlations be-
and compare it against both centralized and distributed tween patterns in a query, which is the key idea behind
RDF systems. The results are shown in Table 11, 12 and 13. graph exploration.
As distributed systems, Trinity.RDF and MapReduce-RDF-
3X are deployed on a 5-server cluster for LUBM-10240, a
8-server cluster for LUBM-100000 and a 5-server cluster for
BTC-10. And we implement the directed 2-hop guarantee
partition for MapReduce-RDF-3X.
BitMat fails to run on BTC-10 as it generates terabytes of
data for just a single SPO index. Similar issues happen on
LUBM-100000. For some datasets and queries, BitMat and
RDF-3X fail to return answers in a reasonable time (denoted
as “aborted” in our experiment results). (a) LUBM group (I) (b) LUBM group (II)
On LUBM-10240 and LUBM-100000, Trinity.RDF gets
similar performance gain over RDF-3X and BitMat as on Figure 8: Data scalability
LUBM-160. Even compared with MapReduce-RDF-3X, Trin-
ity.RDF gives surprisingly competitive performance, and for
some queries, e.g. L4-6, Trinity.RDF is even faster. These
results become more remarkable if we note that all the LUBM
queries are with simple structures, and MapReduce-RDF-
3X specially partitions the data so that these queries can
be answered fully in parallel with zero network communica-
tion. In comparison, Trinity.RDF randomly partitions the
data, and has a network overhead. However, data partition-
ing is orthogonal to our algorithm and can be easily applied (a) LUBM group (I) (b) LUBM group (II)
to reduce the network overhead. This is also evidenced by Figure 9: Machine scalability
the results of L4-6. L4-6 only explore a small set of triples
(as shown in Table 14) and incur little network overhead. L1 L2 L3 L4 L5 L6 L7
LUBM-160 397 173040 0 10 10 125 7125
Thus, Trinity.RDF outperforms even MapReduce-RDF-3X. LUBM-10240 2502 11016920 0 10 10 125 450721
Moreover, MapReduce-RDF-3X’s partition algorithm incurs
great space overhead. As shown in Table 10, MapReduce- Table 14: The result sizes of LUBM queries
RDF-3X indexes twice as many as triples than RDF-3X and
Scalability To evaluate the scalability of our systems, we
Trinity.RDF do.
carry out two experiments by (1) scaling the data while
LUBM-10240 LUBM-100000 BTC-10
#triple 2,459,450,365 20,318,973,699 6,322,986,673
fixing the number of servers, and (2) scaling the number
Overhead 1.80X 2.04X 1.99X of servers while fixing the data. We group LUBM queries
Table 10: The space overhead of MapReduce-RDF- into two categories according to the sizes of their results, as
3X compared with the original datasets shown in Table 14: (I) Q1, Q2, Q3, Q7. The results of these
queries increase as the size of the dataset increases. Note
The BTC-10 benchmark has more complex queries, some that although Q3 produces an empty result set, it is more
with up to 13 patterns. In specific, S3, S4, S6 and S7 are not similar to queries in group (I) as its intermediate result set
parallelizable without communication in MapReduce-RDF- increases when the input dataset increases. (II) Q4, Q5,
3X, and additional MapReduce jobs are invoked to answer Q6. These queries are very selective, and produce results of
the queries. In Table 13, we list separately the time of RDF- constant size as the size of dataset increases.
3X jobs and MapReduce jobs for MapReduce-RDF-3X. In- Varying size of data: We test Trinity.RDF running on a
terestingly, Trinity.RDF shows up to 2 orders of magnitude 3-server cluster on 5 datasets LUBM-40 to LUBM-10240 of
speed-up even over the RDF-3X jobs of MapReduce-RDF- increasing sizes. The results are shown in Figure 8 (a) and
3X. This is probably because MapReduce-RDF-3X divides (b). Trinity.RDF utilizes selective patterns to do efficient
L1 L2 L3 L4 L5 L6 L7 Geo. mean
Trinity.RDF 12648 6018 8735 5 4 9 31214 450
RDF-3X (Warm Cache) 36m47s 14194 27245 8 8 65 69560 2197
BitMat (Warm Cache) 33097 209146 2538 aborted 407 1057 aborted 5966
RDF-3X (Cold Cache) 39m2s 18158 34241 1177 1017 993 98846 15003
BitMat (Cold Cache) 39716 225640 9114 aborted 494 2151 aborted 9721
MapReduce-RDF-3X (Warm Cache) 17188 3164 16932 14 10 720 8868 973
MapReduce-RDF-3X (Cold Cache) 32511 7371 19328 675 770 1834 19968 5087
Table 11: Query run-times in milliseconds for the LUBM-10240 dataset (1.36 billion triples)
L1 L2 L3 L4 L5 L6 L7 Geo. mean
Trinity.RDF 176 21 119 0.005 0.006 0.010 126 1.494
RDF-3X (Warm Cache) aborted 96 363 0.011 0.006 0.021 548 1.726
RDF-3X (Cold Cache) aborted 186 1005 874 578 981 700 633.842
MapReduce-RDF-3X (Warm Cache) 102 19 113 0.022 0.016 0.226 51.98 2.645
MapReduce-RDF-3X (Cold Cache) 171 32 151 1.113 0.749 1.428 89 13.633
Table 12: Query run-times in seconds for the LUBM-100000 dataset (9.96 billion triples)
pruning. Therefore, Trinity.RDF achieves constant size of optimization technique for pipelined execution plans [26]. It
intermediate results and stable performance for group (II) introduces filters on subject, predicate, or object identifiers,
regardless of the increasing data size. For group (I), Trin- and passes these filters to other joins and scans in differ-
ity.RDF scales linearly as the size of the dataset increases, ent parts of the operator tree that need to process similar
which shows that the network overhead is alleviated by the identifiers. This introduces opportunities to avoid some un-
efficient pruning of intermediate results in graph exploration. necessary index scans. BitMat [8] uses a matrix of bitmaps
Varying number of machines: We deploy Trinity.RDF in to compress the indexes, and use lightweight semi-join op-
clusters with varying number of machines, and test its per- erations on compressed data to reduce the intermediate re-
formance on dataset LUBM-10240. The results are shown sult before actually joining. However, these optimizations
in Figure 9 (a) and (b). For group (I), the query time of do not solve the fundamental problem of the join approach.
Trinity.RDF decrease reciprocally w.r.t. the number of ma- In comparison, our exploration-based approach is radically
chines. which testifies that Trinity.RDF can efficiently uti- different from the join approach.
lize the parallelism of a distributed system. Moreover, al-
Graph-based Solutions Another direction of research in-
though more partitions increase the amount of intermediate
vestigated the possibility of storing RDF data as graphs [18,
data delivered across network, our storage model effectively
7, 11]. Many argued that graph primitives besides pattern
bounds this overhead. For group (II), due to selective query
matching (SPARQL queries) should be incorporated into
patterns, the intermediate results are relatively small. Us-
RDF languages, and several graph models for advanced ap-
ing more machines does not improve the performance, but
plications on RDF data have been proposed [18, 7]. There
again the performance is very stable and is not impacted by
are several non-distributed implementations, including one
the extra network overhead.
that builds an in-memory graph model for RDF data us-
7 Related Work ing Jena, and another that stores RDF as a graph in an
object-oriented database [11]. However, both of them are
Tremendous efforts have been devoted to building high per- single-machine solutions with limited scalability. A related
formance RDF management systems [12, 36, 14, 5, 35, 27, research area is subgraph matching [13, 40, 19, 39] but most
26, 8, 7, 17]. State-of-the-art approaches can be classified solutions rely on complex indexing techniques that are often
into two categories: very costly, and do not have the scalability to process web
Relational Solutions Most existing RDF systems use a scale RDF graphs.
relational model to manage RDF data, i.e. they store RDF Recently, several distributed RDF systems [17, 15, 29,
triples in relational tables, and use RDBMS indexing to tune 20, 21] have been proposed. YARS2 [17], Virtuoso [15] and
query processing, which aim solely at answering SPARQL SHARD [29] hash partition triples across multiple machines
queries. SW-Store [5] exploits the fact that RDF data has and parallelize the query processing. Their solutions are
a small number of predicates. Therefore, it vertically parti- limited to simple index loop queries and do not support ad-
tions RDF data (by predicates) into a set of property tables, vanced SPARQL queries, because of the need to ship data
maps them onto a column-oriented database, and builds around. Huang et al. [20] deploy single-node RDF systems
subject-object index on each property table; Hexastore [35] on multiple machines, and use the MapReduce framework
and RDF-3x [27] manage all triples in a giant triple table, to synchronize query execution. It partitions and aggres-
and build indices of all six combinations (SPO, SOP, etc.). sively replicates the data in order to reduce network com-
The relational model decides that SPARQL queries are munication. However, for complex SPARQL queries, it has
processed as large join queries, and most prior systems rely high time and space overhead, because it needs additional
on SQL join optimization techniques for query processing. MapReduce jobs and data replication. Furthermore, Husain
RDF-3x [27], which is considered the fastest existing sys- et at [21] developed a batch system solely relying on MapRe-
tem, proposed sophisticated bushy-join planning and fast duce for SPARQL queries. It does not provide real-time
merge join for query answering. However, this approach query support. Yang et al. [38] recently proposed a graph
requires scanning large fraction of indexes even for very partition management strategy for fast graph query process-
selective queries. Such redundancy overhead quickly be- ing, and demonstrate their system on answering SPARQL
comes a bottleneck for billion triple datasets and/or com- queries. However, their work focuses on partition optimiza-
plex queries. Several join optimization techniques are pro- tion but not on developing scalable graph query engines.
posed. SIP (sideways information passing) is a dynamic Further, the partitioning strategy is orthogonal to our so-
S1 S2 S3 S4 S5 S6 S7 Geo. mean
Trinity.RDF 12 10 31 21 23 33 27 21
RDF-3X (Warm Cache) 108 8407 27428 62846 32 260 238 1175
RDF-3X (Cold Cache) 5265 23881 41819 91140 1041 3065 1497 8101
MapReduce-RDF-3X (Warm Cache w/o MapReduce) 132 8 4833 6059 24 1931 2732 453
MapReduce-RDF-3X (Cold Cache w/o MapReduce) 2617 661 13755 18712 801 4347 7950 3841
MapReduce-RDF-3X (MapReduce) N/A N/A 39928 39782 N/A 33699 33703 36649
Table 13: Query run-times in milliseconds for BTC-10 dataset (3.17 billion triples)
lution and Trinity.RDF can apply their algorithm on data [18] J. Hayes and C. Gutierrez. Bipartite graphs as intermediate
partitioning to achieve better performance. model for rdf. In ISWC, 2004.
[19] H. He and A. K. Singh. Graphs-at-a-time: query language
8 Conclusion and access methods for graph databases. In SIGMOD, 2008.
[20] J. Huang, D. J. Abadi, and K. Ren. Scalable SPARQL
We propose a scalable solution for managing RDF data as Querying of Large RDF Graphs. PVLDB, 4(11), 2011.
graphs in a distributed in-memory key-value store. Our query [21] M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan,
processing and optimization techniques support SPARQL and B. M. Thuraisingham. Heuristics-Based Query
queries without relying on join operations, and we report Processing for Large RDF Graphs Using Cloud Computing.
IEEE Trans. Knowl. Data Eng., 23(9):1312–1327, 2011.
performance numbers of querying against RDF datasets of
[22] J. Lu, Y. Yu, K. Tu, C. Lin, and L. Zhang. An Approach to
billions of triples. Besides scalability, our approach also has RDF(S) Query, Manipulation and Inference on Databases.
the potential to support queries and analytical tasks that In WAIM, pages 172–183, 2005.
are far more advanced than SPARQL queries, as RDF data [23] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W.
is stored as graphs. In addition, our solution only utilizes Berry. Challenges in Parallel Graph Processing. Parallel
basic (distributed) key-value store functions and thus can Processing Letters, 17(1):5–20, 2007.
be ported to any in-memory key-value store. [24] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for
9 References large-scale graph processing. In SIGMOD, 2010.
[25] T. Neumann and G. Weikum. RDF-3X: a RISC-style
engine for RDF. PVLDB, 1(1), 2008.
[1] Billion Triple Challenge.
[26] T. Neumann and G. Weikum. Scalable Join Processing on
http://challenge.semanticweb.org/.
Very Large RDF Graphs. In SIGMOD, 2009.
[2] DBpedia SPARQL Benchmark (DBPSB).
[27] T. Neumann and G. Weikum. The RDF-3X engine for
http://aksw.org/Projects/DBPSB.
scalable management of RDF data. VLDB J., 19(1):91–113,
[3] Jena. http://jena.sourceforge.net. 2010.
[4] Trinity. [28] M. Newman and M. Girvan. Finding and evaluating
http://research.microsoft.com/en-us/projects/trinity/. community structure in networks. Physical review E,
[5] D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. 69(2):026113, 2004.
SW-Store: a vertically partitioned DBMS for Semantic [29] K. Rohloff and R. E. Schantz. High-performance, massively
Web data management. VLDB J., 18(2):385–406, 2009. scalable distributed systems using the MapReduce software
[6] S. Alexaki, V. Christophides, G. Karvounarakis, framework: the SHARD triple-store. In PSI EtA, 2010.
D. Plexousakis, and K. Tolle. The ICS-FORTH RDFSuite: [30] B. Shao, H. Wang, and Y. Li. The Trinity graph engine.
Managing Voluminous RDF Description Bases. In Technical Report 161291, Microsoft Research, 2012.
SemWeb, 2001.
[31] B. Shao, H. Wang, and Y. Xiao. Managing and mining large
[7] R. Angles and C. Gutiérrez. Querying rdf data from a graph graphs: Systems and implementations. In SIGMOD, 2012.
database perspective. In ESWC, pages 346–360, 2005.
[32] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
[8] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. Matrix D. Reynolds. SPARQL basic graph pattern optimization
”Bit” loaded: a scalable lightweight join query processor for using selectivity estimation. In WWW, 2008.
RDF data. In WWW, pages 41–50, 2010.
[33] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient
[9] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, subgraph matching on billion node graphs. Proceedings of
and Z. G. Ives. DBpedia: A Nucleus for a Web of Open the VLDB Endowment, 5(9):788–799, 2012.
Data. In ISWC/ASWC, pages 722–735, 2007.
[34] H. Wang, H. He, J. Yang, P. S. Yu, and J. X. Yu. Dual
[10] P. A. Bernstein and D.-M. W. Chiu. Using Semi-Joins to Labeling: Answering Graph Reachability Queries in
Solve Relational Queries. J. ACM, 28(1):25–40, 1981. Constant Time. In ICDE, page 75, 2006.
[11] V. Bönström, A. Hinze, and H. Schweppe. Storing rdf as a [35] C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple
graph. In LA-WEB, pages 27–36, 2003. indexing for semantic web data management. PVLDB,
[12] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: 1(1):1008–1019, 2008.
A Generic Architecture for Storing and Querying RDF and [36] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds.
RDF Schema. In ISWC, 2002. Efficient RDF Storage and Retrieval in Jena2. In SWDB,
[13] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast pages 131–150, 2003.
graph pattern matching. In ICDE, pages 913–922, 2008. [37] W. Wu, H. Li, H. Wang, and K. Zhu. Probase: A
[14] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An probabilistic taxonomy for text understanding. In
Efficient SQL-based RDF Querying Scheme. In VLDB, SIGMOD, 2012.
2005. [38] S. Yang, X. Yan, B. Zong, and A. Khan. Towards effective
[15] O. Erling and I. Mikhailov. Virtuoso: RDF Support in a partition management for large graphs. In SIGMOD
Native RDBMS. In Semantic Web Information Conference, pages 517–528, 2012.
Management, pages 501–519. 2009. [39] F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. S. Yu.
[16] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for Mining top-k large structural patterns in a massive
OWL knowledge base systems. Journal of Web Semantics, network. PVLDB, 4(11):807–818, 2011.
3(2-3):158–182, 2005.
[40] L. Zou, L. Chen, and M. T. Özsu. Distancejoin: Pattern
[17] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A
match query in a large graph database. PVLDB,
federated repository for querying graph structured data
2(1):886–897, 2009.
from the web. In ISWC/ASWC, pages 211–224, 2007.

You might also like