Grades 2014
Grades 2014
ABSTRACT task can be however very hard for software engineers and a
Graph Database Management Systems provide an effective tool supporting this activity, possibly in an automatic way,
and efficient solution to data storage in current scenarios is clearly essential. Actually, there already exists solutions
where data are more and more connected, graph models are to this problem [3, 11], but they usually refer to specific tar-
widely used, and systems need to scale to large data sets. get data models, such as RDF. Moreover, they usually follow
In this framework, the conversion of the persistent layer of a naive approach in which, basically, tuples are mapped to
an application from a relational to a graph data store can nodes and foreign keys to edges, but this approach does not
be convenient but it is usually an hard task for database take into account the query load and can make graph traver-
administrators. In this paper we propose a methodology sals expensive. Last, but not least, none of them consider
to convert a relational to a graph database by exploiting the problem of mapping queries over the source into effi-
the schema and the constraints of the source. The approach cient queries over the target. Yet, this is fundamental to
supports the translation of conjunctive SQL queries over the reduce the impact on the logic layer of the application and
source into graph traversal operations over the target. We to provide, if needed, a relational view over the target.
provide experimental results that show the feasibility of our In this paper we propose a comprehensive approach to the
solution and the efficiency of query answering over the target automatic migration of databases from relational to graph
database. storage systems. Specifically, our technique converts a re-
lational database r into a graph database g and maps any
conjunctive query over r into a graph query over g. The
1. INTRODUCTION translation takes advantage of the integrity constraints de-
There are several application domains in which the data fined over the source and try to minimize the number of
have a natural representation as a graph. This happens for accesses needed to answer queries over the target. Intu-
instance in the Semantic Web, in social and computer net- itively, this is done by storing in the same node data that
works, and in geographic applications. In these contexts, likely occur together in query results. We refer to a general
relational systems are usually unsuitable to store data since graph data model and a generic query language for graph
they hardly capture their inherent graph structure. More- databases: this makes the approach independent of the spe-
over, and more importantly, graph traversals over highly cific GDBMSs chosen as a target. In order to test the feasi-
connected data require complex join operations, which can bility of our approach, we have developed a complete system
make typical operations on this kind of data inefficient and for converting relational to graph databases that implements
applications hard to scale. For these reasons, a new brand the above described technique. A number of experiments
category of data stores, called GDBMSs (Graph Database over available data stores have shown that there is no loss
Management Systems), is emerging. In GDBMSs data are of data in translation, and that queries over the source are
natively stored as graphs and queries are expressed in terms translated into efficient queries over the target.
of graph traversal operations. This allows applications to The rest of the paper is organized as follows. Section 6
scale to very large graph-based data sets. In addition, since discusses related works. In Section 2 we introduce some pre-
GDBMSs do not rely on a rigid schema, they provide a liminary notions that are used in Section 3 and in Section 4
more flexible solution in scenarios where the organization to illustrate the data and the query mapping technique, re-
of data evolves rapidly. In this framework, the migration spectively. Finally, Section 5 discusses some experimental
of the persistent layer of an application from a relational to results and Section 7 sketches conclusions and future works.
a graph-based storage system can be very beneficial. This
2. PRELIMINARIES
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are A graph data model for relational databases. As
not made or distributed for profit or commercial advantage and that copies usual, we assume that: (i) a relational database schema R
bear this notice and the full citation on the first page. To copy otherwise, to is a set of relation schemas R1 (X1 ), . . . , Rn (Xn ), where Ri
republish, to post on servers or to redistribute to lists, requires prior specific is the name of the i−th relation and Xi is the set of its
permission and/or a fee. attributes, and (ii) a relational database r over R is a set of
Proceedings of the First International Workshop on Graph Data Manage- relations r1 , . . . , rn over R1 (X1 ), . . . , Rn (Xn ), respectively,
ment Experience and Systems (GRADES 2013), June 23, 2013 - New York,
NY, USA. where ri is a set of tuples over Ri (Xi ). In the following,
Copyright 2013 ACM 978-1-4503-2188-4 ...$15.00. we will underline the attributes of a relation that belong
Follower (FR)
US.uname
User (US) fuser fblog
Tag (TG)
uid uname t3 u01 b01
tuser tcomment US.uid FR.fuser FR.fblog
t1 u01 Date t4 u01 b02
t7 u02 c01
t2 u02 Hunt t5 u01 b03
t6 u02 b01
BG.admin BG.bname
BG.bid
Blog (BG)
bid bname admin
t8 b01 Information Systems u02 TG.tuser TG.tcomment CT.cblog
languages, are more suitable for an efficient implementation. Figure 5: An example of graph database
For the sake of generality, in this paper we consider an ab- in r such that: t1 [A] = v1 , t2 [B] = v2 , and A is n2n, and
stract traversal query language that adopts an XQuery-like (iii) there is a pair of joinable tuples t1 and t2 of relations R1
syntax. Expressions of this language are based on path ex- and R2 respectively in r such that: t1 [A] = v1 , t2 [B] = v2 ,
pressions in which, as usual, square parentheses denote con- A and B are not n2n, and there is other no tuple t3 in r that
ditions on nodes and the slash character (/) denotes the is joinable with t2 .
relationship between a node n and an edge incoming to or While this notion seems quite intricate, we show that it
outcoming from n. We will also make use of variables, which guarantees a balanced distribution of data among the nodes
range over paths and are denoted by the prefix $, of the for of the target graph database and an efficient evaluation of
construct, to iterate over path expressions, and of the re- queries over the target that correspond to joins over the
turn construct, to specify the values to return as output. source. Indeed, our technique aims at identifying and ag-
gregating efficiently unifiable data by exploiting schema and
3. DATA CONVERSION constraints of the source relational database. Let us consider
This section describes our method for converting a rela- the relational database in Figure 1. In this case, data is ag-
tional database r into a graph database g. Usually, existing gregated in six nodes, as shown in Figure 5. For instance
GDBMSs provide ad-hoc importers implementing a naive the node labeled by n1 aggregates data values occurring in
approach that creates a node n for each tuple t over a schema t1 , t3 , t4 and t5 . Similarly the node labeled by n2 involves
R(X) occurring in r, such that n has a property hA, t[A]i for data from t9 and t4 , while n3 aggregates data values from
each attribute A ∈ X. Moreover, two nodes n1 and n2 for t8 and t6 . In this paper, the data conversion process takes
a pair of tuples t1 and t2 are connected in g if t1 and t2 are into account only the schema of r. Of course, it could be
joined. Conversely, in our approach we try to aggregate val- taken into account a set of “frequent” queries over r. This is
ues of different tuples in the same node to speed-up traversal subject of future work.
operations over g. The basic idea is to try to store in the More in detail, given the relation database r with the
same node of g data values that are likely to be retrieved to- schema R, and the set SP of all full schema paths in the
gether in the evaluation of queries. Intuitively, these values relational schema graph RG for R, we generate a graph
are those that belong to joinable tuples, that is, tuples t1 and database g = (N, E) from r as shown in Algorithm 1. Our
t2 over R1 and R2 respectively such that there is a foreign procedure iterates on the elements of SP; in each iteration,
key constraint between R1 .A and R2 .B and t1 [A] = t2 [B]. a schema path sp = A1 → . . . → Ak is analyzed from
Referring to Figure 1, t11 and t8 are joinable tuples, since the source A1 to the sink Ak . Let us remind that each
fk Ai of sp corresponds to an attribute in r. The set of data
CT.cblog −→ BG.bid and t11 [cblog] = t8 [bid]. However, by values associated to Ai in the tuples of r is the active do-
just aggregating together joinable tuples we could run the main of Ai : we will use a primitive getAll(r,Ai ) that given
risk to accumulate a lot of data in each node, which is not the relational database r and an attribute Ai returns all
appropriate for graph databases. Therefore, we consider a the values v associated to Ai in r. The set of elements
data aggregation strategy based on a more restrictive prop- {hAi , vj i|vj ∈ getAll(r, Ai )} is the set of properties to asso-
erty, which we call unifiability. First, we need to introduce ciate to the nodes of g. In our procedure, when we include
a preliminary notion. We say that an attribute Ai of a rela- all the active domain of an attribute Ai in the nodes of g,
tion R is n2n if: (i) Ai belongs to the key K = {A1 , . . . , Ak } we say that Ai is visited, i.e. Ai is inserted in a set VS of vis-
of R and (ii) for each Aj of K there exists a foreign key ited attributes. Therefore, the analysis of a schema path (i.e.
fk
constraints R.Aj −→ R0 .B for some relation R0 in r differ- performed by cond(sp, Ai , VS)) can encounter five cases.
ent from R. Intuitively, a set of n2n attributes of a relation case 1. The current attribute Ai to analyze is a source, i.e.
implement a many-to-many relationship between entities. A1 , and both Ai and the following attribute Ai+1 , i.e. A2 ,
Referring again to Figure 1, FR.fuser and FR.fblog are n2n. are not visited. In this case we are at the beginning of the
Then we say that two data values v1 and v2 are unifiable migration, and we are creating new nodes from scratch: the
in a relational database r if one of the following holds: (i) function NewNode is responsible of this task. For instance,
there is a tuple t of a relation R in r such that: t[A] = v1 , referring to Figure 3, our procedure analyzes sp1 for first;
t[B] = v2 , and A and B are not n2n, (ii) there is a pair of Ai is FR.fuser while Ai+1 is US.uid. Since Ai is a source
joinable tuples t1 and t2 of relations R1 and R2 respectively and Ai+1 is not visited, we encounter the case 1. For each
data value in the domain of FR.fuser, that is {u01, u02}, Case 3: NewProperty(Ai , sp, r, g)
we generate a new node to insert in the set N of g: n1
1 Ai−1 ← sp[i − 1];
and n5 . Then we include the properties hFR.fuser, u01i and 2 foreach node n in g such that n contains a property
hFR.fuser, u02i in n1 and n5 , respectively. At the end, the hAi−1 , v1 i do
attribute FR.fuser will be included in VS. 3 R1 ← getTable(r, Ai−1 );
4 R2 ← getTable(r, Ai );
Algorithm 1: Create a graph database g 5 if R1 = R2 then v2 ← πAi σAi−1 =v1 (R1 );
Input : A relational database r, a set SP of full schema paths 6 else v2 ← v1 ;
Output: A graph database g 7 INS(g, label(n), Ai , v2 );
1 VS ← ∅;
2 g ← (∅, ∅);
3 foreach sp ∈ SP do
4 foreach Ai ∈ sp do
5 switch cond(sp, Ai , VS) do property hAi , vi in the node. Then we link the node with
6 case 1 NewNode(Ai , r, g); label lj generated or updated analyzing Ai−1 to the node
7 case 2 NewProperty(Ai , r, g);
8 case 3 NewProperty(Ai , sp, r, g); with label li just generated. Given the attribute Ai−1 and
9 case 4 NewNodeEdge(Ai , sp, r, g); the relation R which Ai−1 belongs to, the label le assigned
10 case 5 NewEdge(Ai , sp, r, g); to the new edge is built by the concatenation of R and Ai−1 .
11 VS ← VS ∪ {Ai }; This task is performed by the function NewNodeEdge. Let
12 return g; us consider the schema path sp2 and the attribute FR.fblog
as current attribute Ai to analyze. It is not visited and a
case 2. The current attribute Ai to analyze is a source, i.e. n2n node in g. In the previous iteration, the analysis of
A1 , Ai is not visited but the following attribute Ai+1 , i.e. FR.fuser (i.e. Ai−1 ) updated the node with label n1. In the
A2 , is visited. In this case there is a foreign key constraint current iteration, we have to generate three new nodes, i.e.
between Ai and Ai+1 , i.e. Ai → Ai+1 . Since Ai+1 is visited, with labels n2, n3 and n6, and to include the properties
we have a node n ∈ N with the property hAi+1 , vi where hFR.fblog, b12i, hFR.fblog, b02i, hFR.fblog, b03i, repsectively,
v ∈ getAll(r, Ai ). Therefore for each v ∈ getAll(r, Ai ) we since getAll(r, FR.fblog) is {b01, b02, b03}. Finally given the
have to retrieve a node n ∈ N (i.e. the label l associated to label le equal to FOLLOWER FUSER, i.e. FR.fuser belongs
n) and to insert a new property hAi , vi in n, as performed by to the relation Follower, we generate the edges with label le
the function NewProperty taking as input Ai , r, and g. For between n1 and n2, n1 and n3, n1 and n6.
instance, when we start to analyze sp4 (i.e., sp1 , sp2 and sp3
were analyzed), we have Ai = TG.tuser and Ai+1 = US.uid.
TG.tuser is a source and not visited while US.uid is visited, Case 5: NewEdge(Ai , sp, r, g)
since encountered in both sp1 and sp3 . Therefore we have 1 Ai−1 ← sp[i − 1];
the case 2: getAll(r, TG.tuser) is {u02} and n5 is the node 2 foreach node n in g such that n contains a property
hAi−1 , v1 i do
with the property hUS.uid, u02i. Finally we insert the new 3 R1 ← getTable(r, Ai−1 ); R2 ← getTable(r, Ai );
property hTG.tuser, u02i in n5. 4 if R1 = R2 then V ← πAi σAi−1 =v1 (R1 );
case 3. In this case the current attribute Ai is not visited 5 else V ← {v1 };
and is not a source neither an hub or a n2n node. Therefore 6 foreach v ∈ V do
7 li ← getNode(g, Ai , v);
we have to iterate on all nodes n generated or updated by 8 if li 6= N IL then le ← build(r, Ai−1 );
analyzing Ai−1 . In each node n where there was inserted a newEdge(g, lj , li , le );
property hAi−1 , v1 i, we have to insert also a property hAi , v2 i
as shown in Case 3: we call the function NewProperty taking
as input Ai , sp, r, and g. More in detail we have to under- case 5. The last case occurs when we are analyzing the
stand if Ai and Ai−1 are in the same relation (i.e. we are in last schema paths and in particular the last attributes in
the same tuple) or not (i.e. we are following a foreign key). a schema path. In this case we link two nodes generated
In the former we have to extract the data value v2 from the in the previous iterations. The current attribute Ai is (i)
same tuple containing v1 (line 5) otherwise v2 is v1 (line 6). not visited and n2n or (ii) visited and an hub. Moreover
We use the function getTable to retrieve the relation R in there exists a node in g with a property hAi , vi, and the
r containing a given attribute a (lines 3-4). Finally, we in- attribute Ai−1 is not a source. As shown in Case 3, our
sert the new property (by calling the function INS) in the procedure iterates on the nodes with label lj built or up-
node n to which is associated the label label(n), coming from dated analyzing Ai−1 and retrieves the node with label li
the iteration on the attribute Ai−1 . For instance iterating to link it with the node with label lj . We have to discern
on sp1 , when Ai is US.uname and Ai−1 is US.uid we have if Ai−1 and Ai are in the same relation or not. Given R1
the case 3: we iterate on the nodes n1 and n5 containing and R2 the relations which Ai−1 and Ai belong to, respec-
the properties hUS.uid, u01i and hUS.uid, u02i, respectively. tively, if R1 and R2 are the same then Ai−1 and Ai are in
Since US.uname and US.uid are in the same relation User the same tuple and we extract all data values V associated
(US), we extract from US the values associated to US.uname to Ai in the tuple (line 4). Otherwise we are considering
in the tuples t1 and t2 , referring to Figure 1. Then we insert a foreign key constraint between Ai−1 and Ai : V is {v1}
the properties hUS.uname, Datei and hUS.uname, Hunti in (line 5), where v1 is the value in the property hAi−1 , v1i
n1 and n5, respectively. included in the node with label lj . Finally for each data
case 4. The current attribute Ai is not visited and it is an value v in V we retrieve the node with label li including the
hub or a n2n node in g. As in case 3, we have to iterate property hAi , vi and, if it exists, we link the node with label
on all nodes n generated or updated by analyzing Ai−1 . lj to the node with label li (lines 6-8). Let us consider the
Differently from case 3, for each data value in the domain schema path sp3 and US.uid as current attribute Ai . Since
of Ai we generate a new node with label li and we insert the in the previous iteration the procedure analyzed sp1 , US.uid
BLOG_ADMIN straightforward to map QT 0 into a XQuery-like path traver-
BG.bname : Information Systems US.uname : ?
sal expression QP T 0 as follows.
for $x in /[BG.bname=’Informative Systems’],
Figure 6: The query template for Q’. $y in $x/BLOG ADMIN/*
return $y/US.uname
is visited now; moreover US.uid is an hub and the previous We start from the node with the property h BG.bname,
attribute BG.admin is not a source. We have the case 5: Information Systems i. Moreover, in the condition we ex-
since the nodes with labels n1 and n2 contain the properties press the fact that this node reaches another node through
hUS.uid, u01i and hBG.admin, u01i, respectively, a new edge the link BLOG ADMIN. Finally, from these nodes we return
with label BLOG ADMIN is built between that nodes (i.e. the values of the property with key US.uname (i.e. in our
similarly between the nodes with labels n3 and n5). example we have only Hunt).
Neo4J OrientDB R2G_N R2G_O RDB Neo4J OrientDB R2G_N R2G_O RDB
running the queries) and warm-cache experiments (i.e. with- edges. This approach however does not fully exploit the ca-
out dropping the caches). Figure 7 shows the performance pabilities of GDBMSs to represent graph-shaped the infor-
for cold-cache experiments. Due to space constraints, in the mation. Moreover, there is no support to query translation
figure we report times only on IMDb and Wikipedia, since in these systems. Finally, it should be mentioned that some
their much larger size poses more challenges. In particular works have done on the problem of translating SPARQL
we show also times in the relational database (i.e. RDB) as queries to SQL to support a relational implementation of
global time reference, not for a direct comparison with rela- RDF databases [13]. But, this is different from the problem
tional DBMS. Our system performs consistently better for addressed in this paper.
most of the queries, significantly outperforming the others
in some cases (e.g., sets Q21-Q30 or Q31-Q40). We high- 7. CONCLUSION AND FUTURE WORK
light how our data mapping procedure allows OrientDB to In this paper we have presented an approach to migrate
perform better than RDB in IMDb (having a more complex automatically data and queries from relational to graph
schema). This is due to our strategy reducing the space databases. The translation makes use of the integrity con-
overhead and consequently the time complexity of the over- straints defined over the source to suitably build a target
all process w.r.t. the competitors that spend much time database in which the number of accesses needed to answer
traversing a large number of nodes. Warm-cache experi- queries is reduced. We have also developed a system that
ments follow a similar trend. implements the translation technique to show the feasibility
of our approach and the efficiency of query answering. In
6. RELATED WORKS future works we intend to refine the technique proposed in
The need to convert relational data into graph modeled this paper to obtain a more compact target database.
data [1] emerged particularly with the advent of Linked
Open Data (LOD) [8] since many organizations needed to 8.[1] R.REFERENCES
Angles and C. Gutiérrez. Survey of graph database
make available their information, usually stored in relational models. ACM Comput. Surv., 40(1), 2008.
databases, on the Web using RDF. For this reason, sev- [2] P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein, and
eral solutions have been proposed to support the translation G. Gianforme. Model-independent schema translation.
of relational data into RDF. Some of them focus on map- VLDB J., 17(6):1347–1370, 2008.
ping the source schema into an ontology [5, 10, 13] and rely [3] C. Bizer. D2r map - a database to rdf mapping language. In
on a naive transformation technique in which every rela- WWW (Posters), 2003.
tional attribute becomes an RDF predicate and every rela- [4] C. Bizer and A. Schultz. The berlin sparql benchmark. Int.
J. Semantic Web Inf. Syst., 5(2):1–24, 2009.
tional values becomes an RDF literal. Other approaches,
[5] F. Cerbah. Learning highly structured semantic repositories
such as R2 O [11] and D2RQ [3], are based on a declara- from relational databases:. In ESWC, pages 777–781, 2008.
tive language that allows the specification of the map be- [6] J. Coffman and A. C. Weaver. A framework for evaluating
tween relational data and RDF. As shown in [8], they all database keyword search strategies. In CIKM, pages
provide rather specific solutions and do not fulfill all the 729–738, 2010.
requirements identified by the RDB2RDF (http://www.w3. [7] S. B. et al. Keyword search over relational databases: a
org/TR/2012/CR-rdb-direct-mapping-20120223/) Working metadata approach. In SIGMOD, pages 565–576, 2011.
Group of the W3C. Inspired by draft methods defined by the [8] M. Hert, G. Reif, and H. C. Gall. A comparison of
W3C, the authors in [13] provide a formal solution where re- rdb-to-rdf mapping languages. In I-SEMANTICS, pages
25–32, 2011.
lational databases are directly mapped to RDF and OWL
[9] F. Holzschuher and R. Peinl. Performance of graph query
trying to preserve the semantics of information in the trans- languages - comparison of cypher, gremlin and native access
formation. All of those proposals focus on mapping rela- in neo4j. In EDBT/ICDT Workshops, pages 195–204, 2013.
tional databases to Semantic Web stores, a problem that is [10] W. Hu and Y. Qu. Discovering simple mappings between
more specific than converting relational to general, graph relational database schemas and ontologies. In
databases, which is our concern. On the other hand, some ISWC/ASWC, pages 225–238, 2007.
approaches have been proposed to the general problem of [11] J. B. Rodrı́guez and A. Gómez-Pérez. Upgrading relational
database translation between different data models (e.g., [2]) legacy data to the semantic web. In WWW, pages
1069–1070, 2006.
but, to the best of our knowledge, there is no work that tack-
[12] M. A. Rodriguez and P. Neubauer. Constructions from dots
les specifically the problem of migrating data and queries and lines. CoRR, abs/1006.2361, 2010.
from a relational to a graph database management system. [13] J. Sequeda, M. Arenas, and D. P. Miranker. On directly
Actually, existing GDBMSs are usually equipped with fa- mapping relational databases to rdf and owl. In WWW,
cilities for importing data from a relational database, but pages 649–658, 2012.
they all rely on naive techniques in which, basically, each [14] P. T. Wood. Query languages for graph databases.
tuple is mapped to a node and foreign keys are mapped to SIGMOD Record, 41(1):50–60, 2012.