0% found this document useful (0 votes)
17 views

Grades 2014

Uploaded by

mohamed moustafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Grades 2014

Uploaded by

mohamed moustafa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Converting Relational to Graph Databases

Roberto De Virgilio Antonio Maccioni Riccardo Torlone


Università Roma Tre Università Roma Tre Università Roma Tre
Rome, Italy Rome, Italy Rome, Italy
[email protected] [email protected] [email protected]

ABSTRACT task can be however very hard for software engineers and a
Graph Database Management Systems provide an effective tool supporting this activity, possibly in an automatic way,
and efficient solution to data storage in current scenarios is clearly essential. Actually, there already exists solutions
where data are more and more connected, graph models are to this problem [3, 11], but they usually refer to specific tar-
widely used, and systems need to scale to large data sets. get data models, such as RDF. Moreover, they usually follow
In this framework, the conversion of the persistent layer of a naive approach in which, basically, tuples are mapped to
an application from a relational to a graph data store can nodes and foreign keys to edges, but this approach does not
be convenient but it is usually an hard task for database take into account the query load and can make graph traver-
administrators. In this paper we propose a methodology sals expensive. Last, but not least, none of them consider
to convert a relational to a graph database by exploiting the problem of mapping queries over the source into effi-
the schema and the constraints of the source. The approach cient queries over the target. Yet, this is fundamental to
supports the translation of conjunctive SQL queries over the reduce the impact on the logic layer of the application and
source into graph traversal operations over the target. We to provide, if needed, a relational view over the target.
provide experimental results that show the feasibility of our In this paper we propose a comprehensive approach to the
solution and the efficiency of query answering over the target automatic migration of databases from relational to graph
database. storage systems. Specifically, our technique converts a re-
lational database r into a graph database g and maps any
conjunctive query over r into a graph query over g. The
1. INTRODUCTION translation takes advantage of the integrity constraints de-
There are several application domains in which the data fined over the source and try to minimize the number of
have a natural representation as a graph. This happens for accesses needed to answer queries over the target. Intu-
instance in the Semantic Web, in social and computer net- itively, this is done by storing in the same node data that
works, and in geographic applications. In these contexts, likely occur together in query results. We refer to a general
relational systems are usually unsuitable to store data since graph data model and a generic query language for graph
they hardly capture their inherent graph structure. More- databases: this makes the approach independent of the spe-
over, and more importantly, graph traversals over highly cific GDBMSs chosen as a target. In order to test the feasi-
connected data require complex join operations, which can bility of our approach, we have developed a complete system
make typical operations on this kind of data inefficient and for converting relational to graph databases that implements
applications hard to scale. For these reasons, a new brand the above described technique. A number of experiments
category of data stores, called GDBMSs (Graph Database over available data stores have shown that there is no loss
Management Systems), is emerging. In GDBMSs data are of data in translation, and that queries over the source are
natively stored as graphs and queries are expressed in terms translated into efficient queries over the target.
of graph traversal operations. This allows applications to The rest of the paper is organized as follows. Section 6
scale to very large graph-based data sets. In addition, since discusses related works. In Section 2 we introduce some pre-
GDBMSs do not rely on a rigid schema, they provide a liminary notions that are used in Section 3 and in Section 4
more flexible solution in scenarios where the organization to illustrate the data and the query mapping technique, re-
of data evolves rapidly. In this framework, the migration spectively. Finally, Section 5 discusses some experimental
of the persistent layer of an application from a relational to results and Section 7 sketches conclusions and future works.
a graph-based storage system can be very beneficial. This
2. PRELIMINARIES
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are A graph data model for relational databases. As
not made or distributed for profit or commercial advantage and that copies usual, we assume that: (i) a relational database schema R
bear this notice and the full citation on the first page. To copy otherwise, to is a set of relation schemas R1 (X1 ), . . . , Rn (Xn ), where Ri
republish, to post on servers or to redistribute to lists, requires prior specific is the name of the i−th relation and Xi is the set of its
permission and/or a fee. attributes, and (ii) a relational database r over R is a set of
Proceedings of the First International Workshop on Graph Data Manage- relations r1 , . . . , rn over R1 (X1 ), . . . , Rn (Xn ), respectively,
ment Experience and Systems (GRADES 2013), June 23, 2013 - New York,
NY, USA. where ri is a set of tuples over Ri (Xi ). In the following,
Copyright 2013 ACM 978-1-4503-2188-4 ...$15.00. we will underline the attributes of a relation that belong
Follower (FR)
US.uname
User (US) fuser fblog
Tag (TG)
uid uname t3 u01 b01
tuser tcomment US.uid FR.fuser FR.fblog
t1 u01 Date t4 u01 b02
t7 u02 c01
t2 u02 Hunt t5 u01 b03
t6 u02 b01
BG.admin BG.bname
BG.bid
Blog (BG)
bid bname admin
t8 b01 Information Systems u02 TG.tuser TG.tcomment CT.cblog

t9 b02 Database u01


CT.msg
t10 b03 Computer Science u02

Comment (CT) CT.cuser CT.cid


CT.date
cid cblog cuser msg date
t11 c01 b01 u01 Exactly what I 25/02/2013 Figure 2: An example of schema graph
was looking for!

Figure 1: An example of relational database sp1 : FR.fuser → US.uid → US.uname.


sp2 : FR.fuser → FR.fblog → BG.bid → BG.bname.
sp3 : FR.fuser → FR.fblog → BG.bid → BG.admin
fk
to its primary key and we will denote by Ri .A −→ Rj .B a → US.uid → US.uname.
sp4 : TG.tuser → US.uid → US.uname.
foreign key between the attribute A of a relation Ri and the sp5 : TG.tuser → TG.tcomment → CT.cid → CT.msg.
attribute B of a relation Rj 1 . A relational schema R can sp6 : TG.tuser → TG.tcomment → CT.cid → CT.date.
be naturally represented in terms of a graph by considering sp7 : TG.tuser → TG.tcomment → CT.cid → CT.cblog
→ BG.bid → BG.bname.
the keys and the foreign keys of R. This representation will sp8 : TG.tuser → TG.tcomment → CT.cid → CT.cuser
be used in first step of the conversion of a relational into a → US.uid → US.uname.
graph database and is defined as follows. sp9 : TG.tuser → TG.tcomment → CT.cid → CT.cblog
Definition 1 (Relational Schema Graph). Given → BG.bid → BG.admin → US.uid → US.uname.
a relational schema R, the relational schema graph RG for
Figure 3: An example of full schema paths
R is a directed graph hN, Ei such that: (i) there is a node
A ∈ N for each attribute A of a relation in R and (ii) there the lack of theoretical studies on them, there is no accepted
is an edge (Ai , Aj ) ∈ E if one of the following holds: (a) Ai definition of data model for GDBMSs and of the features
belongs to a key of a relation R in R and Aj is a non-key provided by them. However, almost all the existing systems
attribute of R, (b) Ai , Aj belong to a key of a relation R in exhibit three main characteristics. First of all, at physical
R, (c) Ai , Aj belong to Ri and Rj respectively and there is level, a graph database satisfies the so called index-free ad-
a foreign key between Ri .Ai and Rj .Aj . jacency property: each node stores information about its
For instance, let us consider the relational database R for
neighbors only and no global index of the connections be-
a social application in Figure 1. Note that this is a typical
tween nodes exists. As a result, the traversal of an edge
application scenario for which relational DBMS are consid-
is basically independent on the size of data. This makes a
ered not suited [9]. It involves the following foreign keys:
fk fk fk
GDBMS very efficient to compute local analysis on graph-
FR.fuser −−→ US.uid, FR.fblog −−→ BG.bid, BG.admin −−→ US.uid, based data and makes it suitable in scenarios where data size
fk fk fk
CT.cblog −−→ BG.bid, CT.cuser −−→ US.uid, TG.tuser −−→ US.uid increases rapidly. Secondly, a GDBMS stores data by means
fk
and TG.tcomment −−→ CT.cid. Then, the relational schema of a multigraph , usually called property graph [12], where
graph for R is depicted in Figure 2. We say that a hub in every node and every edge is associated with a set of key-
a graph is a node having more than one incoming edges, a value pairs, called properties. We consider here a simplified
source is a node without incoming edges, and a sink is a version of a property graph where only nodes have prop-
node without outcoming edges. For instance, in the graph erties, which represent actual data, while edges have just
in Figure 2 FR.fuser is a source, CT.date is a sink, and US.uid labels that represent relationships between data in nodes.
is a hub. In a relational schema graph we focus our atten- Definition 2 (Graph Database). A graph database
tion on full schema paths, i.e., paths from a source node to is a multigraph g = (N, E) where every node n ∈ N is asso-
a sink node. This is because, in relational schema graphs, ciated with a set of pairs hkey, valuei and every edge e ∈ E
they represent logical relationships between concepts of the is associated with a label.
database and for this reason they correspond to natural way An example of graph database is reported in Figure 4: it
to join the tables of the database for answering queries. Re- represents a portion of the relational database in Figure 1.
ferring to Figure 2, we have the full schema paths shown in Note that a tuple t of over a relation schema R(X) is repre-
Figure 3. sented here by set of pairs hA, t[A]i, where A ∈ X and t[A]
Graph Databases. Recently, graph database models are is the restriction of t on A. The third feature common to
receiving a new interest with the diffusion of GDBMSs. Un- GDBMSs is the fact that data is queried using path traversal
fortunately, due to diversity of the various systems and of operations expressed in some graph-based query language,
1
Note that, in this paper, we only consider foreign keys over as discussed next.
single attributes. Foreign key over multiple attributes can Graph Query Languages. The various proposals of
be managed by means of references to tuple identifiers. query languages for graph data models [14] can be clas-
n1 n1
FOLLOWER_FUSER n2 FOLLOWER_FUSER n2

FR.fuser : u01 FR.fblog : b02 FR.fuser : u01 FR.fblog : b02


BLOG_ADMIN BG.bname : Database BLOG_ADMIN BG.bname : Database
US.uname : Date US.uname : Date
BG.admin : u01 US.uid : u01 BG.admin : u01
US.uid : u01
BG.bid : b02 BG.bid : b02
COMMENT_CUSER

Figure 4: An example of property graph FOLLOWER_FUSER n4


TG.tcomment : c01
n3 CT.cuser : u01
CT.cblog : b01
sified into two main categories. The former includes lan- FR.fblog : b01 CT.cid : c01
BG.bname : Information Systems CT.msg : Exactly what I was looking for!
guages, such as SPARQL and Cypher , in which queries BG.admin : u02 CT.date : 25/02/2013
BG.bid : b01
are expressed as a graphs and query evaluation relies on
FOLLOWER_FUSER
graph matching between the query and the database. The
FOLLOWER_FUSER
limitation of this approach is that graph matching is very
BLOG_ADMIN
expensive on large databases [4]. The latter category in- n5
TG.tuser : u02
FR.fuser : u02 n6
FR.fblog : b03
BG.bname : Computer Science
cludes languages that rely on expressions denoting paths of US.uname : Hunt
US.uid : u02
BG.admin : u02
BG.bid : b03
BLOG_ADMIN
the database. Among them, we mention Gremlin , XPath,
and XQuery. These languages, usually called traversal query TAG_TUSER

languages, are more suitable for an efficient implementation. Figure 5: An example of graph database
For the sake of generality, in this paper we consider an ab- in r such that: t1 [A] = v1 , t2 [B] = v2 , and A is n2n, and
stract traversal query language that adopts an XQuery-like (iii) there is a pair of joinable tuples t1 and t2 of relations R1
syntax. Expressions of this language are based on path ex- and R2 respectively in r such that: t1 [A] = v1 , t2 [B] = v2 ,
pressions in which, as usual, square parentheses denote con- A and B are not n2n, and there is other no tuple t3 in r that
ditions on nodes and the slash character (/) denotes the is joinable with t2 .
relationship between a node n and an edge incoming to or While this notion seems quite intricate, we show that it
outcoming from n. We will also make use of variables, which guarantees a balanced distribution of data among the nodes
range over paths and are denoted by the prefix $, of the for of the target graph database and an efficient evaluation of
construct, to iterate over path expressions, and of the re- queries over the target that correspond to joins over the
turn construct, to specify the values to return as output. source. Indeed, our technique aims at identifying and ag-
gregating efficiently unifiable data by exploiting schema and
3. DATA CONVERSION constraints of the source relational database. Let us consider
This section describes our method for converting a rela- the relational database in Figure 1. In this case, data is ag-
tional database r into a graph database g. Usually, existing gregated in six nodes, as shown in Figure 5. For instance
GDBMSs provide ad-hoc importers implementing a naive the node labeled by n1 aggregates data values occurring in
approach that creates a node n for each tuple t over a schema t1 , t3 , t4 and t5 . Similarly the node labeled by n2 involves
R(X) occurring in r, such that n has a property hA, t[A]i for data from t9 and t4 , while n3 aggregates data values from
each attribute A ∈ X. Moreover, two nodes n1 and n2 for t8 and t6 . In this paper, the data conversion process takes
a pair of tuples t1 and t2 are connected in g if t1 and t2 are into account only the schema of r. Of course, it could be
joined. Conversely, in our approach we try to aggregate val- taken into account a set of “frequent” queries over r. This is
ues of different tuples in the same node to speed-up traversal subject of future work.
operations over g. The basic idea is to try to store in the More in detail, given the relation database r with the
same node of g data values that are likely to be retrieved to- schema R, and the set SP of all full schema paths in the
gether in the evaluation of queries. Intuitively, these values relational schema graph RG for R, we generate a graph
are those that belong to joinable tuples, that is, tuples t1 and database g = (N, E) from r as shown in Algorithm 1. Our
t2 over R1 and R2 respectively such that there is a foreign procedure iterates on the elements of SP; in each iteration,
key constraint between R1 .A and R2 .B and t1 [A] = t2 [B]. a schema path sp = A1 → . . . → Ak is analyzed from
Referring to Figure 1, t11 and t8 are joinable tuples, since the source A1 to the sink Ak . Let us remind that each
fk Ai of sp corresponds to an attribute in r. The set of data
CT.cblog −→ BG.bid and t11 [cblog] = t8 [bid]. However, by values associated to Ai in the tuples of r is the active do-
just aggregating together joinable tuples we could run the main of Ai : we will use a primitive getAll(r,Ai ) that given
risk to accumulate a lot of data in each node, which is not the relational database r and an attribute Ai returns all
appropriate for graph databases. Therefore, we consider a the values v associated to Ai in r. The set of elements
data aggregation strategy based on a more restrictive prop- {hAi , vj i|vj ∈ getAll(r, Ai )} is the set of properties to asso-
erty, which we call unifiability. First, we need to introduce ciate to the nodes of g. In our procedure, when we include
a preliminary notion. We say that an attribute Ai of a rela- all the active domain of an attribute Ai in the nodes of g,
tion R is n2n if: (i) Ai belongs to the key K = {A1 , . . . , Ak } we say that Ai is visited, i.e. Ai is inserted in a set VS of vis-
of R and (ii) for each Aj of K there exists a foreign key ited attributes. Therefore, the analysis of a schema path (i.e.
fk
constraints R.Aj −→ R0 .B for some relation R0 in r differ- performed by cond(sp, Ai , VS)) can encounter five cases.
ent from R. Intuitively, a set of n2n attributes of a relation case 1. The current attribute Ai to analyze is a source, i.e.
implement a many-to-many relationship between entities. A1 , and both Ai and the following attribute Ai+1 , i.e. A2 ,
Referring again to Figure 1, FR.fuser and FR.fblog are n2n. are not visited. In this case we are at the beginning of the
Then we say that two data values v1 and v2 are unifiable migration, and we are creating new nodes from scratch: the
in a relational database r if one of the following holds: (i) function NewNode is responsible of this task. For instance,
there is a tuple t of a relation R in r such that: t[A] = v1 , referring to Figure 3, our procedure analyzes sp1 for first;
t[B] = v2 , and A and B are not n2n, (ii) there is a pair of Ai is FR.fuser while Ai+1 is US.uid. Since Ai is a source
joinable tuples t1 and t2 of relations R1 and R2 respectively and Ai+1 is not visited, we encounter the case 1. For each
data value in the domain of FR.fuser, that is {u01, u02}, Case 3: NewProperty(Ai , sp, r, g)
we generate a new node to insert in the set N of g: n1
1 Ai−1 ← sp[i − 1];
and n5 . Then we include the properties hFR.fuser, u01i and 2 foreach node n in g such that n contains a property
hFR.fuser, u02i in n1 and n5 , respectively. At the end, the hAi−1 , v1 i do
attribute FR.fuser will be included in VS. 3 R1 ← getTable(r, Ai−1 );
4 R2 ← getTable(r, Ai );
Algorithm 1: Create a graph database g 5 if R1 = R2 then v2 ← πAi σAi−1 =v1 (R1 );
Input : A relational database r, a set SP of full schema paths 6 else v2 ← v1 ;
Output: A graph database g 7 INS(g, label(n), Ai , v2 );
1 VS ← ∅;
2 g ← (∅, ∅);
3 foreach sp ∈ SP do
4 foreach Ai ∈ sp do
5 switch cond(sp, Ai , VS) do property hAi , vi in the node. Then we link the node with
6 case 1 NewNode(Ai , r, g); label lj generated or updated analyzing Ai−1 to the node
7 case 2 NewProperty(Ai , r, g);
8 case 3 NewProperty(Ai , sp, r, g); with label li just generated. Given the attribute Ai−1 and
9 case 4 NewNodeEdge(Ai , sp, r, g); the relation R which Ai−1 belongs to, the label le assigned
10 case 5 NewEdge(Ai , sp, r, g); to the new edge is built by the concatenation of R and Ai−1 .
11 VS ← VS ∪ {Ai }; This task is performed by the function NewNodeEdge. Let
12 return g; us consider the schema path sp2 and the attribute FR.fblog
as current attribute Ai to analyze. It is not visited and a
case 2. The current attribute Ai to analyze is a source, i.e. n2n node in g. In the previous iteration, the analysis of
A1 , Ai is not visited but the following attribute Ai+1 , i.e. FR.fuser (i.e. Ai−1 ) updated the node with label n1. In the
A2 , is visited. In this case there is a foreign key constraint current iteration, we have to generate three new nodes, i.e.
between Ai and Ai+1 , i.e. Ai → Ai+1 . Since Ai+1 is visited, with labels n2, n3 and n6, and to include the properties
we have a node n ∈ N with the property hAi+1 , vi where hFR.fblog, b12i, hFR.fblog, b02i, hFR.fblog, b03i, repsectively,
v ∈ getAll(r, Ai ). Therefore for each v ∈ getAll(r, Ai ) we since getAll(r, FR.fblog) is {b01, b02, b03}. Finally given the
have to retrieve a node n ∈ N (i.e. the label l associated to label le equal to FOLLOWER FUSER, i.e. FR.fuser belongs
n) and to insert a new property hAi , vi in n, as performed by to the relation Follower, we generate the edges with label le
the function NewProperty taking as input Ai , r, and g. For between n1 and n2, n1 and n3, n1 and n6.
instance, when we start to analyze sp4 (i.e., sp1 , sp2 and sp3
were analyzed), we have Ai = TG.tuser and Ai+1 = US.uid.
TG.tuser is a source and not visited while US.uid is visited, Case 5: NewEdge(Ai , sp, r, g)
since encountered in both sp1 and sp3 . Therefore we have 1 Ai−1 ← sp[i − 1];
the case 2: getAll(r, TG.tuser) is {u02} and n5 is the node 2 foreach node n in g such that n contains a property
hAi−1 , v1 i do
with the property hUS.uid, u02i. Finally we insert the new 3 R1 ← getTable(r, Ai−1 ); R2 ← getTable(r, Ai );
property hTG.tuser, u02i in n5. 4 if R1 = R2 then V ← πAi σAi−1 =v1 (R1 );
case 3. In this case the current attribute Ai is not visited 5 else V ← {v1 };
and is not a source neither an hub or a n2n node. Therefore 6 foreach v ∈ V do
7 li ← getNode(g, Ai , v);
we have to iterate on all nodes n generated or updated by 8 if li 6= N IL then le ← build(r, Ai−1 );
analyzing Ai−1 . In each node n where there was inserted a newEdge(g, lj , li , le );
property hAi−1 , v1 i, we have to insert also a property hAi , v2 i
as shown in Case 3: we call the function NewProperty taking
as input Ai , sp, r, and g. More in detail we have to under- case 5. The last case occurs when we are analyzing the
stand if Ai and Ai−1 are in the same relation (i.e. we are in last schema paths and in particular the last attributes in
the same tuple) or not (i.e. we are following a foreign key). a schema path. In this case we link two nodes generated
In the former we have to extract the data value v2 from the in the previous iterations. The current attribute Ai is (i)
same tuple containing v1 (line 5) otherwise v2 is v1 (line 6). not visited and n2n or (ii) visited and an hub. Moreover
We use the function getTable to retrieve the relation R in there exists a node in g with a property hAi , vi, and the
r containing a given attribute a (lines 3-4). Finally, we in- attribute Ai−1 is not a source. As shown in Case 3, our
sert the new property (by calling the function INS) in the procedure iterates on the nodes with label lj built or up-
node n to which is associated the label label(n), coming from dated analyzing Ai−1 and retrieves the node with label li
the iteration on the attribute Ai−1 . For instance iterating to link it with the node with label lj . We have to discern
on sp1 , when Ai is US.uname and Ai−1 is US.uid we have if Ai−1 and Ai are in the same relation or not. Given R1
the case 3: we iterate on the nodes n1 and n5 containing and R2 the relations which Ai−1 and Ai belong to, respec-
the properties hUS.uid, u01i and hUS.uid, u02i, respectively. tively, if R1 and R2 are the same then Ai−1 and Ai are in
Since US.uname and US.uid are in the same relation User the same tuple and we extract all data values V associated
(US), we extract from US the values associated to US.uname to Ai in the tuple (line 4). Otherwise we are considering
in the tuples t1 and t2 , referring to Figure 1. Then we insert a foreign key constraint between Ai−1 and Ai : V is {v1}
the properties hUS.uname, Datei and hUS.uname, Hunti in (line 5), where v1 is the value in the property hAi−1 , v1i
n1 and n5, respectively. included in the node with label lj . Finally for each data
case 4. The current attribute Ai is not visited and it is an value v in V we retrieve the node with label li including the
hub or a n2n node in g. As in case 3, we have to iterate property hAi , vi and, if it exists, we link the node with label
on all nodes n generated or updated by analyzing Ai−1 . lj to the node with label li (lines 6-8). Let us consider the
Differently from case 3, for each data value in the domain schema path sp3 and US.uid as current attribute Ai . Since
of Ai we generate a new node with label li and we insert the in the previous iteration the procedure analyzed sp1 , US.uid
BLOG_ADMIN straightforward to map QT 0 into a XQuery-like path traver-
BG.bname : Information Systems US.uname : ?
sal expression QP T 0 as follows.
for $x in /[BG.bname=’Informative Systems’],
Figure 6: The query template for Q’. $y in $x/BLOG ADMIN/*
return $y/US.uname
is visited now; moreover US.uid is an hub and the previous We start from the node with the property h BG.bname,
attribute BG.admin is not a source. We have the case 5: Information Systems i. Moreover, in the condition we ex-
since the nodes with labels n1 and n2 contain the properties press the fact that this node reaches another node through
hUS.uid, u01i and hBG.admin, u01i, respectively, a new edge the link BLOG ADMIN. Finally, from these nodes we return
with label BLOG ADMIN is built between that nodes (i.e. the values of the property with key US.uname (i.e. in our
similarly between the nodes with labels n3 and n5). example we have only Hunt).

4. QUERY TRANSLATION 5. EXPERIMENTAL RESULTS


Our mechanism for translating conjunctive (that is, select- We have developed the techniques described in this paper
join-projection) queries, expressed in SQL, into path traver- in a Java system called R2G. Experiments were conducted
sal operations over the graph database exploits the schema of on a dual core 2.66GHz Intel Xeon, running Linux RedHat,
the source relational. For the sake of simplicity, we consider with 4 GB of memory and a 2-disk 1Tbyte striped RAID
an intermediate step in which we map the SQL query in a array. We considered real datasets with different sizes (i.e.
graph-based internal structure, that we call query template number of tuples). In particular we used Mondial (17.115
(QT for short). Basically, a QT denotes all the sub-graphs tuples and 28 relations) and two ideal counterpoints (due
of the target graph database that include the result of the to the larger size), IMDb (1.673.074 tuples in 6 relations)
query. A QT is then translated into a path traversal query and Wikipedia (200.000 tuples in 6 relations), as described
(see Section 2). Given a query Q the construction of a QT in [6]. The authors in [6] defined a benchmark of 50 keyword
proceeds as follows. search queries for each dataset. We used the tool in [7] to
generate SQL queries from the keyword-based queries de-
1. We built a minimal set SP of full schema paths such
fined in [6].
that for each join condition Ri .Ai = Rj .Aj occurring
in Q, an edge (Ri .Ai , Rj .Aj ) is contained in at least
Dataset Neo4J OrientDB R2G N R2G O
one sp in SP; Mondial 7.4 sec 5.3 sec 13.9 sec 9.3 sec
Wikipedia 70.7 sec 66.5 sec 161.5 sec 148.7 sec
2. If there is an attribute in a selection condition (i.e., IMDb 8.1 min 10.2 min 16.2 min 22.1 min
Ri .Ai = c) that does not occur in any full schema
path in SP, another full schema path sp that includes Table 1: Performance of translations from r to g
both Ai and an attribute in a full schema path sp0 in R2G has been embedded and tested in two different
SP is added to SP; GDBMSs: Neo4J and OrientDB. In the following we de-
3. We built a relational database rQ made of: (i) a set note with R2G N and R2G O the implementations of R2G
of tables Ri (Ai ) having c as instance for each selection in Neo4J and OrientDB, respectively. First of all we evalu-
condition Ri .Ak = c, and (ii) a set of tables Rj (Aj ) ate data loading, that is time to produce a graph database
having the special symbol ? as instance for each at- starting from a SQL dump. We compared R2G against na-
tribute Rj .Aj in the SELECT clause of Q; tive data importers of Neo4J and OrientDB, that use a naive
approach to import a SQL dump, that is one node for each
4. QT is the graph database obtained by applying the tuple and one edge for each foreign key reference. In our
data conversion procedure illustrated in Section 3 over transformation process we query directly the RDBMS to
SP and rQ . build schema graph and compute schema paths and then to
We explain our technique by the following query example Q0 . extract data values. For our purposes we used PostgreSQL
9.1 (denoted as RDB). Table 1 shows the performance of
select US.uname
from User US, Tag TG, Blog BG, Comment CT
this task. Neo4J and OrientDB importers perform better
where (BG.bid = CT.cblog) and(CT.cid = TG.tcomment) and than our system, i.e. about two times better. This is due to
(TG.tuser = US.uid) and(BG.bname = ’Inf. Systems’) the fact that R2G has to process the schema information of
On the relational database of Figure 1, Q0 selects all the relational database (i.e. the schema graph) while the com-
users that have left a comment on the Information Systems petitor systems directly import data values from the SQL
blog. As said above, referring to Figure 3, (1) a minimal dump. Then we evaluated the performance of query exe-
set of full schema paths that contain all the join conditions cution. For each dataset, we grouped the queries in five
of Q0 is SP1 = {sp4 , sp7 } is built. (2) Since from the se- sets (i.e. ten queries per set): each set is homogeneous
lection condition (BG.bname = ’Information Systems’) the with respect to the complexity of the queries (e.g., num-
attribute BG.bname is already occurring in sp7 we do not ber of keywords, number of results and so on). For instance
have to include more paths in SP1 . (3) From the selection referring to IMDb, the first set (i.e. Q1-Q10) searches in-
condition (BG.bname = ’Information Systems’) and the at- formation about the actors (providing the name as input),
tribute US.uname of the SELECT clause, we build rQ0 = while the second set (i.e. Q11-Q20) seeks information about
{BLOG(bname), USER(uname)}, where BLOG(bname) con- movies (providing the title as input). The other sets com-
tains one tuple with the data value Information Systems bine actors, movie and characters. For each set, we ran the
and USER(uname) contains one tuple with the special sym- queries ten times and measured the average response time.
bol ?, respectively, as instance. (4) From SP1 and rQ0 , we We performed cold-cache experiments (i.e. by dropping all
obtain the query template QT 0 shown in Figure 6. It is file-system caches before restarting the various systems and
Wikipedia   IMDb  
10000   60000  

response  2me  (msec)  

response  2me  (msec)  


8000  
6000   40000  
4000  
20000  
2000  
0   0  
Q1-­‐Q10   Q11-­‐Q20   Q21-­‐Q30   Q31-­‐Q40   Q41-­‐Q50   Q1-­‐Q10   Q11-­‐Q20   Q21-­‐Q30   Q31-­‐Q40   Q41-­‐Q50  

Neo4J   OrientDB   R2G_N   R2G_O   RDB   Neo4J   OrientDB   R2G_N   R2G_O   RDB  

Figure 7: Performance on databases

running the queries) and warm-cache experiments (i.e. with- edges. This approach however does not fully exploit the ca-
out dropping the caches). Figure 7 shows the performance pabilities of GDBMSs to represent graph-shaped the infor-
for cold-cache experiments. Due to space constraints, in the mation. Moreover, there is no support to query translation
figure we report times only on IMDb and Wikipedia, since in these systems. Finally, it should be mentioned that some
their much larger size poses more challenges. In particular works have done on the problem of translating SPARQL
we show also times in the relational database (i.e. RDB) as queries to SQL to support a relational implementation of
global time reference, not for a direct comparison with rela- RDF databases [13]. But, this is different from the problem
tional DBMS. Our system performs consistently better for addressed in this paper.
most of the queries, significantly outperforming the others
in some cases (e.g., sets Q21-Q30 or Q31-Q40). We high- 7. CONCLUSION AND FUTURE WORK
light how our data mapping procedure allows OrientDB to In this paper we have presented an approach to migrate
perform better than RDB in IMDb (having a more complex automatically data and queries from relational to graph
schema). This is due to our strategy reducing the space databases. The translation makes use of the integrity con-
overhead and consequently the time complexity of the over- straints defined over the source to suitably build a target
all process w.r.t. the competitors that spend much time database in which the number of accesses needed to answer
traversing a large number of nodes. Warm-cache experi- queries is reduced. We have also developed a system that
ments follow a similar trend. implements the translation technique to show the feasibility
of our approach and the efficiency of query answering. In
6. RELATED WORKS future works we intend to refine the technique proposed in
The need to convert relational data into graph modeled this paper to obtain a more compact target database.
data [1] emerged particularly with the advent of Linked
Open Data (LOD) [8] since many organizations needed to 8.[1] R.REFERENCES
Angles and C. Gutiérrez. Survey of graph database
make available their information, usually stored in relational models. ACM Comput. Surv., 40(1), 2008.
databases, on the Web using RDF. For this reason, sev- [2] P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein, and
eral solutions have been proposed to support the translation G. Gianforme. Model-independent schema translation.
of relational data into RDF. Some of them focus on map- VLDB J., 17(6):1347–1370, 2008.
ping the source schema into an ontology [5, 10, 13] and rely [3] C. Bizer. D2r map - a database to rdf mapping language. In
on a naive transformation technique in which every rela- WWW (Posters), 2003.
tional attribute becomes an RDF predicate and every rela- [4] C. Bizer and A. Schultz. The berlin sparql benchmark. Int.
J. Semantic Web Inf. Syst., 5(2):1–24, 2009.
tional values becomes an RDF literal. Other approaches,
[5] F. Cerbah. Learning highly structured semantic repositories
such as R2 O [11] and D2RQ [3], are based on a declara- from relational databases:. In ESWC, pages 777–781, 2008.
tive language that allows the specification of the map be- [6] J. Coffman and A. C. Weaver. A framework for evaluating
tween relational data and RDF. As shown in [8], they all database keyword search strategies. In CIKM, pages
provide rather specific solutions and do not fulfill all the 729–738, 2010.
requirements identified by the RDB2RDF (http://www.w3. [7] S. B. et al. Keyword search over relational databases: a
org/TR/2012/CR-rdb-direct-mapping-20120223/) Working metadata approach. In SIGMOD, pages 565–576, 2011.
Group of the W3C. Inspired by draft methods defined by the [8] M. Hert, G. Reif, and H. C. Gall. A comparison of
W3C, the authors in [13] provide a formal solution where re- rdb-to-rdf mapping languages. In I-SEMANTICS, pages
25–32, 2011.
lational databases are directly mapped to RDF and OWL
[9] F. Holzschuher and R. Peinl. Performance of graph query
trying to preserve the semantics of information in the trans- languages - comparison of cypher, gremlin and native access
formation. All of those proposals focus on mapping rela- in neo4j. In EDBT/ICDT Workshops, pages 195–204, 2013.
tional databases to Semantic Web stores, a problem that is [10] W. Hu and Y. Qu. Discovering simple mappings between
more specific than converting relational to general, graph relational database schemas and ontologies. In
databases, which is our concern. On the other hand, some ISWC/ASWC, pages 225–238, 2007.
approaches have been proposed to the general problem of [11] J. B. Rodrı́guez and A. Gómez-Pérez. Upgrading relational
database translation between different data models (e.g., [2]) legacy data to the semantic web. In WWW, pages
1069–1070, 2006.
but, to the best of our knowledge, there is no work that tack-
[12] M. A. Rodriguez and P. Neubauer. Constructions from dots
les specifically the problem of migrating data and queries and lines. CoRR, abs/1006.2361, 2010.
from a relational to a graph database management system. [13] J. Sequeda, M. Arenas, and D. P. Miranker. On directly
Actually, existing GDBMSs are usually equipped with fa- mapping relational databases to rdf and owl. In WWW,
cilities for importing data from a relational database, but pages 649–658, 2012.
they all rely on naive techniques in which, basically, each [14] P. T. Wood. Query languages for graph databases.
tuple is mapped to a node and foreign keys are mapped to SIGMOD Record, 41(1):50–60, 2012.

You might also like