A Distributed Graph Engine For Web Scale RDF Data
A Distributed Graph Engine For Web Scale RDF Data
∗ ∗
Kai Zeng† Jiacheng Yang] Haixun Wang‡ Bin Shao‡ Zhongyuan Wang‡,[
† ] ‡ [
UCLA Columbia University Microsoft Research Asia Renmin University of China
[email protected] [email protected]
{haixunw, binshao, zhy.wang}@microsoft.com
ABSTRACT systems and SPARQL engines [6, 12, 3, 36, 14, 5, 35, 27].
Still, scalability remains the biggest hurdle. Essentially,
Much work has been devoted to supporting RDF data. But
RDF data is highly connected graph data, and SPARQL
state-of-the-art systems and methods still cannot handle
queries are like subgraph matching queries. But most ap-
web scale RDF data effectively. Furthermore, many useful
proaches model RDF data as a set of triples, and use RDBMS
and general purpose graph-based operations (e.g., random
for storing, indexing, and query processing. These approaches
walk, reachability, community discovery) on RDF data are
do not scale as processing a query often involves a large num-
not supported, as most existing systems store and index data
ber of join operations that produce large intermediate re-
in particular ways (e.g., as relational tables or as a bitmap
sults. Furthermore, many systems, including SW-Store [5],
matrix) to maximize one particular operation on RDF data:
Hexastore [35], and RDF-3x [27] are single-machine systems.
SPARQL query processing. In this paper, we introduce Trin-
As the size of RDF data keeps soaring, it is not realis-
ity.RDF, a distributed, memory-based graph engine for web
tic for single-machine approaches to provide good perfor-
scale RDF data. Instead of managing the RDF data in triple
mance. Recently, several distributed RDF systems, such
stores or as bitmap matrices, we store RDF data in its na-
as SHARD [29], YARS2 [17], Virtuoso [15], and [20], have
tive graph form. It achieves much better (sometimes orders
been introduced. However, they still model RDF data as a
of magnitude better) performance for SPARQL queries than
set of triples. The cost incurred by excessive join operations
the state-of-the-art approaches. Furthermore, since the data
is further exacerbated by network communication overhead.
is stored in its native graph form, the system can support
Some distributed solutions try to overcome this limitation
other operations (e.g., random walks, reachability) on RDF
by brute-force replication of data [20]. However, this ap-
graphs as well. We conduct comprehensive experimental
proach simply fails in the face of complex SPARQL queries
studies on real life, web scale RDF data to demonstrate the
(e.g., queries with a multi-hop chain), and has a considerable
effectiveness of our approach.
space overhead (usually exponential).
1 Introduction The second challenge lies in the generality of RDF sys-
tems. State-of-the-art systems are not able to support gen-
RDF data is becoming increasingly more available: The se- eral purpose queries on RDF data. In fact, most of them are
mantic web movement towards a web 3.0 world is prolif- optimized for SPARQL only, but a wide range of meaningful
erating a huge amount of RDF data. Commercial search queries and operations on RDF data cannot be expressed in
engines including Google and Bing are pushing web sites SPARQL. Consider an RDF dataset that represents an en-
to use RDFa to explicitly express the semantics of their tity/relationship graph. One basic query on such a graph is
web contents. Large public knowledge bases, such as DB- reachability, that is, checking whether a path exists between
pedia [9] and Probase [37] contain billions of facts in RDF two given entities in the RDF data. Many other queries
format. Web content management systems, which model (e.g., community detection) on entity/relationship data rely
data in RDF, mushroom in various communities all around on graph operations. For example, random walks on the
the world. graph can be used to calculate the similarity between two
Challenges RDF data management systems are facing two entities. All of the above queries and operations require
challenges: namely, systems’ scalability and generality. The some form of graph-based analytics [34, 28, 22, 33]. Unfor-
challenge of scalability is particularly urgent. Tremendous tunately, none of these can be supported in current RDF
efforts have been devoted to building high performance RDF systems, and one of the reasons is that they manage RDF
data in some foreign forms (e.g., relational tables or bitmap
∗This work is done at Microsoft Research Asia.
matrices) instead of its native graph form.
Permission to make digital or hard copies of all or part of this work for Overview of Our Approach We introduce Trinity.RDF,
personal or classroom use is granted without fee provided that copies are a distributed in-memory RDF system that is capable of han-
not made or distributed for profit or commercial advantage and that copies dling web scale RDF data (billion or even trillion triples).
bear this notice and the full citation on the first page. To copy otherwise, to Unlike existing systems that use relational tables (triple
republish, to post on servers or to redistribute to lists, requires prior specific stores) or bitmap matrices to manage RDF, Trinity.RDF
permission and/or a fee. Articles from this volume were invited to present builds on top of a memory cloud, and models RDF data
their results at The 39th International Conference on Very Large Data Bases,
in its native graph form (i.e., representing entities as graph
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 4 nodes, and relationships as graph edges). We argue that
Copyright 2013 VLDB Endowment 2150-8097/13/02... $ 10.00. such a memory-based architecture that logically and physi-
cally models RDF in native graphs opens up a new paradigm Paper Layout The rest of the paper is organized as fol-
for RDF management. It not only leads to new optimiza- lows. Section 2 describes the difference between join oper-
tion opportunities for SPARQL query processing, but also ations and graph exploration. Section 3 presents the archi-
supports more advanced graph analytics on RDF data. tecture of the Trinity.RDF system. Section 4 describes how
To see this, we must first understand that most graph op- we model RDF data as native graphs. Section 5 describes
erations do not have locality [23, 31], and rely exclusively SPARQL query processing techniques. Section 6 shows ex-
on random accesses. As a result, storing RDF graphs in perimental results. We conclude in Section 8.
disk-based triple stores is not a feasible solution since ran-
dom accesses on hard disks are notoriously slow. Although 2 Join vs. Graph Exploration
sophisticated indices can be created to speed up query pro- Joins are the major operator in SPARQL query process-
cessing, they introduce excessive join operations, which be- ing. Trinity.RDF outperforms existing systems by orders of
come a major cost for SPARQL query processing. magnitude because it replaces expensive join operations by
Trinity.RDF models RDF data as an in-memory graph. efficient graph exploration. In this section, we discuss the
Naturally, it supports fast random accesses on the RDF performance implications of the two different approaches.
graph. But in order to process SPARQL queries efficiently, 2.1 RDF and SPARQL
we still need to address the issues of how to reduce the num-
ber of join operations, and how to reduce the size of inter- Before we discuss join operations vs. graph exploration, we
mediary results. In this paper, we develop novel techniques first introduce RDF and SPARQL query processing on RDF
that use efficient in-memory graph exploration instead of data. An RDF data set consists of statements in the form
join operations for SPARQL processing. Specifically, we de- of (subject, predicate, object). Each statement, also known
compose a SPARQL query into a set of triple patterns, and as as a triple, is about a fact, which can be interpreted as
conduct a sequence of graph explorations to generate bind- subject has a predicate property whose value is object. For
ings for each of the triple pattern. The exploration-based example, a movie knowledge base may contain the following
approach uses the binding information of the explored sub- triples about the movie “Titanic”:
graphs to prune candidate matches in a greedy manner. In ( T i t a n i c , has award , B e s t P i c t u r e )
contrast, previous approaches isolate individual triple pat- ( Titanic , casts , L DiCaprio ) ,
terns, that is, they generate bindings for them separately, ( J Cameron , d i r e c t s , T i t a n i c )
and make excessive use of costly join operations to com- ( J Cameron , w i n s , O s c a r A w a r d )
...
bine those bindings, which inevitably results in large inter-
mediate results. Our new query paradigm greatly reduces An RDF dataset can be considered as representing a directed
the amount of intermediate results, boosts the query perfor- graph, with entities (i.e. subjects and objects) as nodes, and
mance in a distributed environment, and makes the system relationships (i.e. predicates) as directed edges. SPARQL
scale. We show in experiments that even without a smart is the standard query language for retrieving data stored in
graph partitioning scheme, Trinity.RDF achieves several or- RDF format. The core syntax of SPARQL is a conjunctive
ders of magnitude speed-up on web scale RDF data over set of triple patterns called a basic graph pattern. A triple
state-of-the-art RDF systems. pattern is similar to an RDF triple except that any compo-
We also note that since Trinity.RDF models data as a nent in the triple pattern can be a variable. A basic graph
native graph, we enable a large range of advanced graph pattern describes a subgraph which a user wants to match
analytics on RDF data. For example, random walks, regu- against the RDF data. Thus, SPARQL query processing is
lar expression queries, reachability queries, distance oracles, essentially a subgraph matching problem. For example, we
community searches can be performed on web scale RDF can retrieve the cast of an award-winning movie directed by
data directly. Even large scale vertex-based analytical tasks a award-winning director using the following query:
on graph platforms such as Pregel [24] can be easily sup-
ported in our system. However, these topics are out of the Example 1.
scope of this paper, and we refer interested readers to the SELECT ? movie , ? actor WHERE {
Trinity system [30, 4] for detailed information. ? director wins ? award .
? director directs ? movie .
Contributions We summarize the novelty and advantages ? movie has_award ? movie_award .
of our work as follows. ? movie casts ? actor .}
1. We introduce a novel graph-based scheme for manag- SPARQL also contains other language constructs that sup-
ing RDF data. Trinity.RDF has the potential to sup- port disjunctive queries and filtering.
port efficient graph-based queries, as well as advanced 2.2 Using Join Operations
graph analytics, on RDF.
Many state-of-the-art RDF systems store RDF data as a set
of triples in relational tables, and therefore, they rely exces-
2. We leverage graph exploration for SPARQL process- sively on join operations for processing SPARQL queries. In
ing. The new query paradigm greatly reduces the general, query processing consists of two phases [25]: The
volume of intermediate results, which in turn boosts first phase is known as the scan phase. It decomposes a
query performance and system scalability. SPARQL query into a set of triple patterns. For the query
in Example 1, the triple patterns are ?director wins ?award,
3. We introduce a new cost model, novel cardinality es- ?director directs ?movie, ?movie has award ?movie award,
timation techniques, and optimization algorithms for and ?movie casts ?actor. For each triple pattern, we scan
distributed query plan generation. These approaches tables or indices to generate bindings. Assuming we are pro-
ensure excellent performance on web scale RDF data. cessing the query against the RDF graph in Figure 1. The
to a subject/object 1 . In our work, we use native graphs to
model RDF data, which enables us to perform the same
operation in O(1) time. With the support of the underly-
ing architecture, we make graph exploration extremely effi-
cient. In fact, Trinity.RDF can explore as many as 2.3 mil-
lion nodes on a graph distributed over an 8-server cluster
within one tenth of a second [30]. This lays the foundation
Figure 1: An example RDF graph for exploration-based SPARQL query processing.
We need to point out that the order of exploration is im-
base tables that contain the bindings are shown in Table 1. portant. Starting with the highly selective pattern ?movie
The second phase is the join phase. The base tables are has award ?movie award, we can prune a lot of candidate
joined to produce the final answer to the query. bindings of other patterns. If we explore the graph in a dif-
ferent order, i.e. exploring ?movie cast ?actor followed by
?movie ?movie award
?director ?award
Titanic Best Picture ?director directs ?movie, then we will still generate useless
J Cameron Oscar Award
G Lucas Saturn Award
Crash Best Picture intermediate results. Thus, query plans need to be carefully
?movie ?actor
?director ?movie
Crash D Cheadle
optimized to pick the optimal exploration order, which is
P Haggis Crash
J Cameron Titanic
Titanic L Dicaprio not trivial. We will discuss our algorithm for optimal graph
Avatar S Worthington
J Cameron Avatar
Star War VI M Hamill exploration plan generation in Section 5.
Note that graph exploration (following the links) is to
Table 1: Base tables and bound variables.
certain extent similar to index-nested-loops join. However,
Sophisticated techniques have been used to optimize the index-nested-loops join is costly for RDBMS or disk-based
order of joins to improve query performance. Still, the ap- data, because it needs a random access for each index lookup.
proach has inherent limitations: (1) It uses many costly join Hence, in previous approaches, scan-joins, which perform
operations. (2) The scan-join process produces large redun- sequential reads on sorted data, are preferred. Our ap-
dant intermediary results. From Table 1, we can see that proach further extends the random access approach in a dis-
most intermediary join results will be produced in vain. Af- tributed environment and minimizes the size of intermediate
ter all, only Titanic directed by J Cameron matches the join results.
query. Moreover, useless intermediary results may only be
detected in later stages of the join process. For example, if 3 System Architecture
we choose to join ?director directs ?movie and ?movie casts
In this section, we give an overall description of the data
?actor first, we will not know that the resulting rows related
model and the architecture of Trinity.RDF. We model and
to Avatar and Crash are useless until joining with ?director
store RDF data as a directed graph. Each node in the graph
wins ?award and ?movie has award ?movie award. Side-
represents a unique entity, which may appear as a subject
ways Information Passing (SIP) [26] was proposed to allevi-
and/or an object in an RDF statement. Each RDF state-
ate this problem. SIP is a dynamic optimization technique
ment corresponds to an edge in the graph. Edges are di-
for pipelined execution plans. It introduces filters on sub-
rected, pointing from subjects to objects. Furthermore, edges
ject, predicate, or object identifiers, and passes these filters
are labeled with the predicates. We will present the data
to joins and scans in other parts of the query that need to
structure for nodes and edges in more detail in Section 4.
process similar identifiers.
To ensure fast random data access in graph exploration,
2.3 Using Graph Explorations we store RDF graphs in memory. A web scale RDF graph
may contain billions of entities (nodes) and trillions of triples.
In this paper, we adopt a new approach that greatly im-
It is unlikely that a web scale RDF graph can fit in the
proves the performance of SPARQL query processing. The
RAM of a single machine. Trinity.RDF is based on Trin-
idea is to use graph exploration instead of joins.
ity [30], which is a distributed in-memory key-value store.
The intuition can be illustrated by an example. Assume Trinity.RDF builds a graph interface on top of the key-value
we perform the query in Example 1 over the RDF graph in store. It randomly partitions an RDF graph across a cluster
Figure 1 starting with the pattern: ?director wins ?award. of commodity machines by hashing on the nodes. Thus, each
After exploring the neighbors of ?award connected via the machine holds a disjoint part of the graph. Given a SPARQL
wins edge, we find that the possible binding for ?director query, we perform search in parallel on each machine. Dur-
is J Cameron and G Lucas. Then, we explore the graph ing query processing, machines may need to exchange data
further from node J Cameron and G Lucas via edge directs, as a query pattern may span multiple partitions.
and we generate bindings for ?director directs ?movie. In
Figure 2 shows the high level architecture of Trinity.RDF.
the above exploration, we prune G Lucas because it does
A user submits his query to a proxy. The proxy generates a
not have a directs edge. Also, we do not produce useless
query plan and delivers the plan to all the Trinity machines,
bindings as those shown in Table 1, such as the binding
which hold the RDF data. Then, each machine executes the
(P Haggis, Crash). Thus, we are able to prune unnecessary
query plan under the coordination of the proxy. When the
intermediate results efficiently.
bindings for all the variables are resolved, all Trinity ma-
The above intuition is only valid if graph exploration can chines send back the bindings (answers) to the proxy where
be implemented more efficiently than join. This is not true the final result is assembled and sent back to the user. As
for existing RDF systems. If the RDF graph is managed we can see, the proxy plays an important role in the archi-
by relational tables, triple stores, or disk-based key-value tecture. Specifically, it performs the following tasks. First,
stores, then we need to use join operations to implement it generates a query plan based on available statistics and
graph exploration, which means graph exploration cannot indices. Second, it keeps track of the status of each Trinity
be more efficient than join: With an index, it usually re-
1
quires an O(log N ) operation to access the triples relating N is the total number of RDF triples
Figure 3: An example of model (1)
characteristics of graphs, we need to look into how the graph
is partitioned in order to ensure best performance.
Figure 2: Distributed query processing framework Two factors may have impact on network overhead when
machine in query processing by, for example, synchronizing we explore a graph. The first factor is how the graph is
the execution of each query step. However, each Trinity partitioned. In our system, sharding is supported by the
machine not only communicates with the proxy. They also underlying key-value store, and the default sharding mech-
communicate among themselves during query execution to anism is hashing on node-id. In other words, the graph is
exchange intermediary results. All communications are han- randomly partitioned. Certainly, sophisticated graph par-
dled by a message passing mechanism built in Trinity. titioning methods can be adopted for sharding. However,
Besides the proxy and the Trinity machines, we also em- graph partitioning is beyond the scope of this paper.
ploy a string indexing server. We replace all literals in RDF The second factor is how we model graphs on top of the
triples by their ids. The string indexing server implements key-value store. The model given by (1) may have poten-
a literal-to-id mapping that translates literals in a SPARQL tial problems for real-life large graphs. Many real-life RDF
query into ids, and an id-to-literal mapping that maps ids graphs are scale-free graphs whose node degrees follow the
in the output back to literals for the user. The mapping can power law distribution. In DBpedia [9], for example, over
be either implemented by a separate Trinity in-memory key- 90% nodes have less than 5 neighbors, while some top nodes
value store for efficiency, or by a persistent key-value store have more than 100,000 neighbors. The model may incur a
if memory space is a concern. Usually the cost of mapping large amount of network traffic when we explore the graph
is negligible compared to that of query processing. from a top node x. For simplicity, let us assume none of x’s
neighbors resides on the same machine as x does. To visit
4 Data Modeling x’s neighbors, we need to send the node-ids of its neighbors
to other machines. The total amount of information we need
To support graph-based operations including SPARQL queries to send across the network is exactly the entire set of node-
on RDF data more effectively, we store RDF data in its na- ids in x’s adjacency list. For the DBpedia data, in the worst
tive graph form. In this section, we describe how we model case, whenever we encounter a top node in graph explo-
and manipulate RDF data as distributed graphs. ration, we need to send 800K data (each node-id is 64 bit)
4.1 Modeling Graphs across the network. This is a huge cost in graph exploration.
Trinity.RDF is based on Trinity, which is a key-value store We take the power law distribution into consideration in
in a memory cloud. We then create a graph model on top modeling RDF data. Specifically, we model a node x by the
of the key-value store. Specifically, we represent each RDF following key-value pair:
entity as a graph node with a unique id, and store it as a (node-id, hin1 , · · · , ink , out1 , · · · , outk i) (2)
key-value pair in the Trinity memory cloud:
where ini and outi are keys to some other key-value pairs:
(node-id, hin-adjacency-list, out-adjacency-listi) (1)
(ini , in-adjacency-listi ) (outi , out-adjacency-listi ) (3)
The key-value pair consists of the node-id as the key, and
the node’s adjacency list as the value. The adjacency list The essence of this model is the following: The key-value
is divided into two lists, one for neighbors with incoming pair (ini , in-adjacency-listi ) and the nodes in in-adjacency-listi
edges and the other for neighbors with outgoing edges. Each are stored on the same machine i. In other words, we parti-
element in the adjacency lists is a (predicate, node-id ) pair, tion the adjacency lists in model (1) by machine.
which records the id of the neighbor, and the predicate on The benefits of this design is obvious. No matter how
the edge. many neighbors x has, we will send no more than k nids
Thus, we have created a graph model on top of the key- (ini and outi ) over the network since each machine i, upon
value store. Given any node, we can find the node-id of receiving nidi , can retrieve x’s neighbors that reside on ma-
any of its neighbors, and the underlying Trinity memory chine i without incurring any network communication. How-
cloud will retrieve the key-value pair for that node-id. This ever, for nodes with few neighbors, model (2) is more costly
enables us to explore the graph from any given node by than model (1). In our work, we use a threshold t to decide
accessing its adjacency lists. Figure 3 shows an example of which model to use. If a node has more than t neighbors, we
the data structure. use model (2) to map it to the key-value store; otherwise,
we use model (1). Figure 4 gives an example with t = 1.
4.2 Graph Partitioning Furthermore, in our design, all triples are stored decentral-
We distribute an RDF graph across multiple machines, and ized at its subject and object. Thus, update has little cost
this is achieved by the underlying memory cloud, which par- as it only affects a few nodes. However, update is out of the
titions the key-value pairs in a cluster. However, due to the scope of this paper and we omit detailed discussion here.
distributed, and the two functions simply operate on the
local adjacency list.
We now use some examples to illustrate the use of the
above 3 operators on the RDF graph shown in Figure 4.
LoadN odes(l2 , out) finds n2 on machine 1, and n3 on ma-
chine 2. LoadN eighborsOnM achine(n0 , in, 1) returns the
partial adjacency list’s id in1 , and SelectByP redicate(in1 , l2 )
returns n2 .
5 Query Processing
In this section, we present our exploration-based approach
for SPARQL query processing.
Figure 4: An example of model (2)
5.1 Overview
4.3 Indexing Predicates
We represent a SPARQL query Q by a query graph G. Nodes
Graph exploration relies on retrieving nodes connected by in G denote subjects and objects in Q, and directed edges
an edge of a given predicate. We use two additional indices in G denote predicates. Figure 5 shows the query graph
for this purpose. corresponding to the query in Example 1, and lists the 4
Local predicate indexing We create a local predicate in- triple patterns in the query as q1 to q4 .
dex for each node x. We sort all (predicate, node-id ) pairs in
x’s adjacency lists first by predicate then by node-id. This
corresponds to the SPO or OPS index in traditional RDF
approaches. In addition, we also create an aggregate in-
dex to enable us to quickly decide whether a node has a
given predicate and the number of its neighbors connected
by the predicate.
Global predicate indexing The global predicate index
enables us to find all nodes that have incoming or outgoing • q1 : (?director wins ?award)
neighbors labeled by a given predicate. This corresponds to • q2 : (?director directs ?movie)
the PSO or POS index in traditional approaches. Specifi-
cally, for each predicate, machine i stores a key-value pair • q3 : (?movie has award ?movie award)
• q4 : (?movie casts ?actor )
(predicate, hsubject-listi , object-listi i)
Figure 5: The query graph of Example 1
where subject-listi (object-listi ) consists of all unique sub-
jects (objects) with that predicate on machine i. With G defined, the problem of SPARQL query process-
ing can be transformed to the problem of subgraph match-
4.4 Basic Graph Operators ing. However, as we pointed out in Section 2, existing RDF
We provide the following three graph operators with which query processing and subgraph matching algorithms rely ex-
we implement graph exploration: cessively on costly joins, which cannot scale to RDF data of
billion or even trillion triples. Instead, we use efficient graph
1. LoadN odes(predicate, dir): Return nodes that have exploration in an in-memory key-value store to support fast
an incoming or outgoing edge labeled as predicate. query processing. The exploration is conducted as follows:
We first decompose Q into an ordered sequence of triple
2. LoadN eighborsOnM achine(node, dir, i): For a given patterns: q1 , · · · , qn . Then, we find matches for each qi ,
node, return its incoming or outgoing neighbors that and from each match, we explore the graph to find matches
reside on machine i. for qi+1 . Thus, to a large extent, graph exploration acts as
joins. Furthermore, the exploration is carried out on all dis-
3. SelectByP redicate(nid, predicate): From a given par- tributed machines in parallel. In the final step, we gather
tial adjacency list specified by nid, return nodes that the matchings for all individual triple patterns to the cen-
are labeled with the given predicate. tralized query proxy, and combine them together to produce
the final results.
Here, dir is a parameter that specifies whether the pred-
icate is an incoming or an outgoing edge. LoadN odes() 5.2 Single Triple Pattern Matching
is straightforward to understand. When it is called, it uses We start with matching a single triple pattern. For a triple
the global predicate index on each machine to find nodes that pattern q, our goal is to find all its matches R(q). Let P
have at least one incoming or outgoing edge that is labeled denote the predicate in q, V denote the variables in q, and
as predicate. B(V ) denote the binding of V . If V is a free variable (not
The next two operators together find specific neighbors for bound), we also use B(V ) to denote all possible values V
a given node. LoadN eighborsOnM achine() finds a node’s can take. We regard a constant as a special variable with
incoming or outgoing neighbors on a given machine. But, only one binding.
instead of returning all the neighbors, it simply returns the We use graph exploration to find matches for q. There are
ini or outi id as given in (2). Then, given the ini or outi two ways of exploration: from subject to object (We first
id, SelectByP redicate() finds nodes in the adjacency list try to find matches for the subject in q, and then for each
that is associated with the given predicate. Certainly, if the match, we find matches for the object in q. We denote this
node has less than t neighbors, then its adjacency list is not exploration as −
→
q ) and from object to subject (We denote
this exploration as ←
−q ). We use src and tgt to refer to the Saturn Award on machine 2. Next, on the target ?director
source and target of an exploration (i.e., in −
→
q the src is the side, machine 1 gets the key of the adjacency list sent by
←
−
subject, while in q the src is the object). Saturn Award, and after calling SelectByP redicate(), it
gets G Lucas. Since the target ?director is a free variable,
Algorithm 1 MatchPattern(e) any edge labeled with win will be matched. We add match-
ing edge (G Lucas, wins, Saturn Award) to R. Similarly
obtain src, tgt, and predicate p from e (e = −
→
q or e = ←
−
q) on machine 2, we get (J Cameron, wins, Oscar Award).
As Algorithm 1 shows, given a triple q, each machine
// On the src side: performs M atchP attern() independently, and obtains and
if src is a free
S variable then stores the results on the target side, that is, on machines
B(src) = ∀p∈B(P ) LoadN odes(p, dir) where the target is matched. For the example in Figure 6,
set Mi = ∅ for all i // initialize messages to machine i matches for ← q−
1 where q1 is “?director wins ?award ” are
for each s in B(src) do stored on machine 1, where the target G Lucas is located.
for each machine i do Table 2 shows the results on both machines for q1 . We use
nidi = LoadN eighborsOnM achine(s, dir, i) Ri (q) to denote matches for of q on machine i. Note that
Mi = Mi ∪ (s, nidi ) the constant column wins is not stored.
batch send messages M to all machines (a) R1 (q1 ) (b) R2 (q1 )
?director ?award ?director ?award
// On the tgt side: G Lucas Saturn Award J Cameron Oscar Award
nidi . The node’s neighbors on machine i are stored in the Table 3: Individual matching result of q2
key-value pair with nidi as the key. We then send nidi to
machine i. Instead of matching single patterns independently, we treat
Each machine, on receiving the message, starts the match- the query as a sequence of patterns. The matching of the
ing on the tgt side. For each eligible predicate p in B(P ), current pattern is based on the matches of previous patterns,
we filter neighbors in the adjacency list by p by calling i.e., we “explore” the RDF graph from the matches of pre-
SelectByP redicate(). If tgt is a free variable, any neigh- vious patterns to find matches for the current pattern. In
bor is eligible as a binding, so we add (s, p, n) as a match other words, we eagerly prune invalid matchings by explo-
for every neighbor n. If tgt is a constant, however, only the ration to avoid the cost of joining large sets of results later.
constant node is eligible. As we treat a constant as a spe- (a) R1 (q2 ) (b) R2 (q2 )
cial variable with only one binding, we can uniformly handle ?movie ?director ?movie ?director
these two cases: we match a new edge only if its target is Titanic J Cameron Avatar J Cameron
D1 D2 D3 D4 D5 D6 D7 D8 Geo. mean
Trinity.RDF 7 220 5 7 8 21 13 28 15
RDF-3X (In Memory) 15 79 14 18 22 34 68 35 29
BitMat (In Memory) 335 1375 209 113 431 619 617 593 425
RDF-3X (Cold Cache) 522 493 394 498 366 524 458 658 482
BitMat (Cold Cache) 392 1605 326 279 770 890 813 872 639
Table 9: Query run-time in milliseconds on the DBPSB dataset (15 million triples)
Performance on Large Datasets We experiment on three a query into multiple subqueries and each subquery pro-
datasets, LUBM-10240, LUBM-100000 and BTC-10, to study duces a much larger result set. This result again proves
the performance of Trinity.RDF on billion scale datasets, the performance impact of exploiting the correlations be-
and compare it against both centralized and distributed tween patterns in a query, which is the key idea behind
RDF systems. The results are shown in Table 11, 12 and 13. graph exploration.
As distributed systems, Trinity.RDF and MapReduce-RDF-
3X are deployed on a 5-server cluster for LUBM-10240, a
8-server cluster for LUBM-100000 and a 5-server cluster for
BTC-10. And we implement the directed 2-hop guarantee
partition for MapReduce-RDF-3X.
BitMat fails to run on BTC-10 as it generates terabytes of
data for just a single SPO index. Similar issues happen on
LUBM-100000. For some datasets and queries, BitMat and
RDF-3X fail to return answers in a reasonable time (denoted
as “aborted” in our experiment results). (a) LUBM group (I) (b) LUBM group (II)
On LUBM-10240 and LUBM-100000, Trinity.RDF gets
similar performance gain over RDF-3X and BitMat as on Figure 8: Data scalability
LUBM-160. Even compared with MapReduce-RDF-3X, Trin-
ity.RDF gives surprisingly competitive performance, and for
some queries, e.g. L4-6, Trinity.RDF is even faster. These
results become more remarkable if we note that all the LUBM
queries are with simple structures, and MapReduce-RDF-
3X specially partitions the data so that these queries can
be answered fully in parallel with zero network communica-
tion. In comparison, Trinity.RDF randomly partitions the
data, and has a network overhead. However, data partition-
ing is orthogonal to our algorithm and can be easily applied (a) LUBM group (I) (b) LUBM group (II)
to reduce the network overhead. This is also evidenced by Figure 9: Machine scalability
the results of L4-6. L4-6 only explore a small set of triples
(as shown in Table 14) and incur little network overhead. L1 L2 L3 L4 L5 L6 L7
LUBM-160 397 173040 0 10 10 125 7125
Thus, Trinity.RDF outperforms even MapReduce-RDF-3X. LUBM-10240 2502 11016920 0 10 10 125 450721
Moreover, MapReduce-RDF-3X’s partition algorithm incurs
great space overhead. As shown in Table 10, MapReduce- Table 14: The result sizes of LUBM queries
RDF-3X indexes twice as many as triples than RDF-3X and
Scalability To evaluate the scalability of our systems, we
Trinity.RDF do.
carry out two experiments by (1) scaling the data while
LUBM-10240 LUBM-100000 BTC-10
#triple 2,459,450,365 20,318,973,699 6,322,986,673
fixing the number of servers, and (2) scaling the number
Overhead 1.80X 2.04X 1.99X of servers while fixing the data. We group LUBM queries
Table 10: The space overhead of MapReduce-RDF- into two categories according to the sizes of their results, as
3X compared with the original datasets shown in Table 14: (I) Q1, Q2, Q3, Q7. The results of these
queries increase as the size of the dataset increases. Note
The BTC-10 benchmark has more complex queries, some that although Q3 produces an empty result set, it is more
with up to 13 patterns. In specific, S3, S4, S6 and S7 are not similar to queries in group (I) as its intermediate result set
parallelizable without communication in MapReduce-RDF- increases when the input dataset increases. (II) Q4, Q5,
3X, and additional MapReduce jobs are invoked to answer Q6. These queries are very selective, and produce results of
the queries. In Table 13, we list separately the time of RDF- constant size as the size of dataset increases.
3X jobs and MapReduce jobs for MapReduce-RDF-3X. In- Varying size of data: We test Trinity.RDF running on a
terestingly, Trinity.RDF shows up to 2 orders of magnitude 3-server cluster on 5 datasets LUBM-40 to LUBM-10240 of
speed-up even over the RDF-3X jobs of MapReduce-RDF- increasing sizes. The results are shown in Figure 8 (a) and
3X. This is probably because MapReduce-RDF-3X divides (b). Trinity.RDF utilizes selective patterns to do efficient
L1 L2 L3 L4 L5 L6 L7 Geo. mean
Trinity.RDF 12648 6018 8735 5 4 9 31214 450
RDF-3X (Warm Cache) 36m47s 14194 27245 8 8 65 69560 2197
BitMat (Warm Cache) 33097 209146 2538 aborted 407 1057 aborted 5966
RDF-3X (Cold Cache) 39m2s 18158 34241 1177 1017 993 98846 15003
BitMat (Cold Cache) 39716 225640 9114 aborted 494 2151 aborted 9721
MapReduce-RDF-3X (Warm Cache) 17188 3164 16932 14 10 720 8868 973
MapReduce-RDF-3X (Cold Cache) 32511 7371 19328 675 770 1834 19968 5087
Table 11: Query run-times in milliseconds for the LUBM-10240 dataset (1.36 billion triples)
L1 L2 L3 L4 L5 L6 L7 Geo. mean
Trinity.RDF 176 21 119 0.005 0.006 0.010 126 1.494
RDF-3X (Warm Cache) aborted 96 363 0.011 0.006 0.021 548 1.726
RDF-3X (Cold Cache) aborted 186 1005 874 578 981 700 633.842
MapReduce-RDF-3X (Warm Cache) 102 19 113 0.022 0.016 0.226 51.98 2.645
MapReduce-RDF-3X (Cold Cache) 171 32 151 1.113 0.749 1.428 89 13.633
Table 12: Query run-times in seconds for the LUBM-100000 dataset (9.96 billion triples)
pruning. Therefore, Trinity.RDF achieves constant size of optimization technique for pipelined execution plans [26]. It
intermediate results and stable performance for group (II) introduces filters on subject, predicate, or object identifiers,
regardless of the increasing data size. For group (I), Trin- and passes these filters to other joins and scans in differ-
ity.RDF scales linearly as the size of the dataset increases, ent parts of the operator tree that need to process similar
which shows that the network overhead is alleviated by the identifiers. This introduces opportunities to avoid some un-
efficient pruning of intermediate results in graph exploration. necessary index scans. BitMat [8] uses a matrix of bitmaps
Varying number of machines: We deploy Trinity.RDF in to compress the indexes, and use lightweight semi-join op-
clusters with varying number of machines, and test its per- erations on compressed data to reduce the intermediate re-
formance on dataset LUBM-10240. The results are shown sult before actually joining. However, these optimizations
in Figure 9 (a) and (b). For group (I), the query time of do not solve the fundamental problem of the join approach.
Trinity.RDF decrease reciprocally w.r.t. the number of ma- In comparison, our exploration-based approach is radically
chines. which testifies that Trinity.RDF can efficiently uti- different from the join approach.
lize the parallelism of a distributed system. Moreover, al-
Graph-based Solutions Another direction of research in-
though more partitions increase the amount of intermediate
vestigated the possibility of storing RDF data as graphs [18,
data delivered across network, our storage model effectively
7, 11]. Many argued that graph primitives besides pattern
bounds this overhead. For group (II), due to selective query
matching (SPARQL queries) should be incorporated into
patterns, the intermediate results are relatively small. Us-
RDF languages, and several graph models for advanced ap-
ing more machines does not improve the performance, but
plications on RDF data have been proposed [18, 7]. There
again the performance is very stable and is not impacted by
are several non-distributed implementations, including one
the extra network overhead.
that builds an in-memory graph model for RDF data us-
7 Related Work ing Jena, and another that stores RDF as a graph in an
object-oriented database [11]. However, both of them are
Tremendous efforts have been devoted to building high per- single-machine solutions with limited scalability. A related
formance RDF management systems [12, 36, 14, 5, 35, 27, research area is subgraph matching [13, 40, 19, 39] but most
26, 8, 7, 17]. State-of-the-art approaches can be classified solutions rely on complex indexing techniques that are often
into two categories: very costly, and do not have the scalability to process web
Relational Solutions Most existing RDF systems use a scale RDF graphs.
relational model to manage RDF data, i.e. they store RDF Recently, several distributed RDF systems [17, 15, 29,
triples in relational tables, and use RDBMS indexing to tune 20, 21] have been proposed. YARS2 [17], Virtuoso [15] and
query processing, which aim solely at answering SPARQL SHARD [29] hash partition triples across multiple machines
queries. SW-Store [5] exploits the fact that RDF data has and parallelize the query processing. Their solutions are
a small number of predicates. Therefore, it vertically parti- limited to simple index loop queries and do not support ad-
tions RDF data (by predicates) into a set of property tables, vanced SPARQL queries, because of the need to ship data
maps them onto a column-oriented database, and builds around. Huang et al. [20] deploy single-node RDF systems
subject-object index on each property table; Hexastore [35] on multiple machines, and use the MapReduce framework
and RDF-3x [27] manage all triples in a giant triple table, to synchronize query execution. It partitions and aggres-
and build indices of all six combinations (SPO, SOP, etc.). sively replicates the data in order to reduce network com-
The relational model decides that SPARQL queries are munication. However, for complex SPARQL queries, it has
processed as large join queries, and most prior systems rely high time and space overhead, because it needs additional
on SQL join optimization techniques for query processing. MapReduce jobs and data replication. Furthermore, Husain
RDF-3x [27], which is considered the fastest existing sys- et at [21] developed a batch system solely relying on MapRe-
tem, proposed sophisticated bushy-join planning and fast duce for SPARQL queries. It does not provide real-time
merge join for query answering. However, this approach query support. Yang et al. [38] recently proposed a graph
requires scanning large fraction of indexes even for very partition management strategy for fast graph query process-
selective queries. Such redundancy overhead quickly be- ing, and demonstrate their system on answering SPARQL
comes a bottleneck for billion triple datasets and/or com- queries. However, their work focuses on partition optimiza-
plex queries. Several join optimization techniques are pro- tion but not on developing scalable graph query engines.
posed. SIP (sideways information passing) is a dynamic Further, the partitioning strategy is orthogonal to our so-
S1 S2 S3 S4 S5 S6 S7 Geo. mean
Trinity.RDF 12 10 31 21 23 33 27 21
RDF-3X (Warm Cache) 108 8407 27428 62846 32 260 238 1175
RDF-3X (Cold Cache) 5265 23881 41819 91140 1041 3065 1497 8101
MapReduce-RDF-3X (Warm Cache w/o MapReduce) 132 8 4833 6059 24 1931 2732 453
MapReduce-RDF-3X (Cold Cache w/o MapReduce) 2617 661 13755 18712 801 4347 7950 3841
MapReduce-RDF-3X (MapReduce) N/A N/A 39928 39782 N/A 33699 33703 36649
Table 13: Query run-times in milliseconds for BTC-10 dataset (3.17 billion triples)
lution and Trinity.RDF can apply their algorithm on data [18] J. Hayes and C. Gutierrez. Bipartite graphs as intermediate
partitioning to achieve better performance. model for rdf. In ISWC, 2004.
[19] H. He and A. K. Singh. Graphs-at-a-time: query language
8 Conclusion and access methods for graph databases. In SIGMOD, 2008.
[20] J. Huang, D. J. Abadi, and K. Ren. Scalable SPARQL
We propose a scalable solution for managing RDF data as Querying of Large RDF Graphs. PVLDB, 4(11), 2011.
graphs in a distributed in-memory key-value store. Our query [21] M. F. Husain, J. P. McGlothlin, M. M. Masud, L. R. Khan,
processing and optimization techniques support SPARQL and B. M. Thuraisingham. Heuristics-Based Query
queries without relying on join operations, and we report Processing for Large RDF Graphs Using Cloud Computing.
IEEE Trans. Knowl. Data Eng., 23(9):1312–1327, 2011.
performance numbers of querying against RDF datasets of
[22] J. Lu, Y. Yu, K. Tu, C. Lin, and L. Zhang. An Approach to
billions of triples. Besides scalability, our approach also has RDF(S) Query, Manipulation and Inference on Databases.
the potential to support queries and analytical tasks that In WAIM, pages 172–183, 2005.
are far more advanced than SPARQL queries, as RDF data [23] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W.
is stored as graphs. In addition, our solution only utilizes Berry. Challenges in Parallel Graph Processing. Parallel
basic (distributed) key-value store functions and thus can Processing Letters, 17(1):5–20, 2007.
be ported to any in-memory key-value store. [24] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,
I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for
9 References large-scale graph processing. In SIGMOD, 2010.
[25] T. Neumann and G. Weikum. RDF-3X: a RISC-style
engine for RDF. PVLDB, 1(1), 2008.
[1] Billion Triple Challenge.
[26] T. Neumann and G. Weikum. Scalable Join Processing on
http://challenge.semanticweb.org/.
Very Large RDF Graphs. In SIGMOD, 2009.
[2] DBpedia SPARQL Benchmark (DBPSB).
[27] T. Neumann and G. Weikum. The RDF-3X engine for
http://aksw.org/Projects/DBPSB.
scalable management of RDF data. VLDB J., 19(1):91–113,
[3] Jena. http://jena.sourceforge.net. 2010.
[4] Trinity. [28] M. Newman and M. Girvan. Finding and evaluating
http://research.microsoft.com/en-us/projects/trinity/. community structure in networks. Physical review E,
[5] D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. 69(2):026113, 2004.
SW-Store: a vertically partitioned DBMS for Semantic [29] K. Rohloff and R. E. Schantz. High-performance, massively
Web data management. VLDB J., 18(2):385–406, 2009. scalable distributed systems using the MapReduce software
[6] S. Alexaki, V. Christophides, G. Karvounarakis, framework: the SHARD triple-store. In PSI EtA, 2010.
D. Plexousakis, and K. Tolle. The ICS-FORTH RDFSuite: [30] B. Shao, H. Wang, and Y. Li. The Trinity graph engine.
Managing Voluminous RDF Description Bases. In Technical Report 161291, Microsoft Research, 2012.
SemWeb, 2001.
[31] B. Shao, H. Wang, and Y. Xiao. Managing and mining large
[7] R. Angles and C. Gutiérrez. Querying rdf data from a graph graphs: Systems and implementations. In SIGMOD, 2012.
database perspective. In ESWC, pages 346–360, 2005.
[32] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and
[8] M. Atre, V. Chaoji, M. J. Zaki, and J. A. Hendler. Matrix D. Reynolds. SPARQL basic graph pattern optimization
”Bit” loaded: a scalable lightweight join query processor for using selectivity estimation. In WWW, 2008.
RDF data. In WWW, pages 41–50, 2010.
[33] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient
[9] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, subgraph matching on billion node graphs. Proceedings of
and Z. G. Ives. DBpedia: A Nucleus for a Web of Open the VLDB Endowment, 5(9):788–799, 2012.
Data. In ISWC/ASWC, pages 722–735, 2007.
[34] H. Wang, H. He, J. Yang, P. S. Yu, and J. X. Yu. Dual
[10] P. A. Bernstein and D.-M. W. Chiu. Using Semi-Joins to Labeling: Answering Graph Reachability Queries in
Solve Relational Queries. J. ACM, 28(1):25–40, 1981. Constant Time. In ICDE, page 75, 2006.
[11] V. Bönström, A. Hinze, and H. Schweppe. Storing rdf as a [35] C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple
graph. In LA-WEB, pages 27–36, 2003. indexing for semantic web data management. PVLDB,
[12] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: 1(1):1008–1019, 2008.
A Generic Architecture for Storing and Querying RDF and [36] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds.
RDF Schema. In ISWC, 2002. Efficient RDF Storage and Retrieval in Jena2. In SWDB,
[13] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast pages 131–150, 2003.
graph pattern matching. In ICDE, pages 913–922, 2008. [37] W. Wu, H. Li, H. Wang, and K. Zhu. Probase: A
[14] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An probabilistic taxonomy for text understanding. In
Efficient SQL-based RDF Querying Scheme. In VLDB, SIGMOD, 2012.
2005. [38] S. Yang, X. Yan, B. Zong, and A. Khan. Towards effective
[15] O. Erling and I. Mikhailov. Virtuoso: RDF Support in a partition management for large graphs. In SIGMOD
Native RDBMS. In Semantic Web Information Conference, pages 517–528, 2012.
Management, pages 501–519. 2009. [39] F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. S. Yu.
[16] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for Mining top-k large structural patterns in a massive
OWL knowledge base systems. Journal of Web Semantics, network. PVLDB, 4(11):807–818, 2011.
3(2-3):158–182, 2005.
[40] L. Zou, L. Chen, and M. T. Özsu. Distancejoin: Pattern
[17] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A
match query in a large graph database. PVLDB,
federated repository for querying graph structured data
2(1):886–897, 2009.
from the web. In ISWC/ASWC, pages 211–224, 2007.