How Caching Improves Efficiency and Result Completeness for Querying Linked Data

How Caching Improves
Efficiency and Result Completeness
for Querying Linked Data
Olaf Hartig
http://olafhartig.de/foaf.rdf#olaf

Database and Information Systems Research Group
Humboldt-Universität zu Berlin

Can we query the Web of Data
as of it were a single,
giant database?

SELECT DISTINCT ?i ?label
WHERE {

?prof rdf:type <http://res ... data/dbprofs#DBProfessor> ;
foaf:topic_interest ?i .

}
OPTIONAL {

}
?i rdfs:label ?label
FILTER( LANG(?label)="en" || LANG(?label)="")

ORDER BY ?label
?

Our approach: Link Traversal Based Query Execution
[ISWC'09]
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 2

Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset

query-local
dataset

Main Idea

Query
http://bob.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am
Look up URIs in intermediate

e
●


Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am

e
●


Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

htt
p:/ ?

/bo

b.n
am

e
●

“Descriptor object”

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
Query kno
ws
http://alice.name
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
Query kno
ws
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq

? me

a
e.n
a lic
://

p
htt

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Main Idea
?acq
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query ?acq ?prj ?prjName
?prjName http://alice.name http://.../AlicesPrj “…“
s
ow

me
kn

na

?acq query-local
project ?prj
dataset

Characteristics
● Link traversal based query execution:
● Evaluation on a continuously augmented dataset
● Discovery of potentially relevant data during execution
● Discovery driven by intermediate solutions

● Main advantage:
● No need to know all data sources in advance

● Limitations:
● Query has to contain a URI as a starting point
●
Ignores data that is not reachable* by the query execution
*
formal definition in the paper

The Issue
Query
?acq interest
?i
s
ow

label
kn

?iLabel

query-local
dataset


The Issue
Query
?acq interest
?i
s
ow

label
kn

?iLabel

htt query-local
p: //b
ob dataset
? .nam
e


The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow

w s

label
kn

?iLabel

query-local
dataset


The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow

w s

label
kn

?iLabel

query-local
dataset

?acq ?i ?iLabel


The Issue
Query
?acq interest
?i
s
ow

label
kn

?iLabel

query-local
dataset

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Reusing the Query-Local Dataset
Query
?acq interest
?i
s
ow

label
kn

?iLabel

query-local
dataset

Query
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Query
?acq interest
?i
s
ow

label
kn

?iLabel


o ws
Query kn
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Query
?acq interest
?i ?acq
s
ow


label
kn

?iLabel


o ws
Query kn
?prjName
s
ow

me
kn

na

?acq query-local
project ?prj
dataset


Hypothesis

Re-using the query-local dataset (a.k.a. data caching)
may benefit
query performance + result completeness


Contributions
● Systematic analysis of the impact of data caching
●
Theoretical foundation*
●
Conceptual analysis*
● Empirical evaluation of the potential impact
*
see paper

● Out of scope: Caching strategies (replacement, invalidation)


Experiment – Scenario

● Information about the
distributed social
network of FOAF
profiles
● 5 types of queries
● Experiment Setup:
● 23 persons
● Sequential use
➔ 115 queries

Experiment – Complete Sequence
no reuse given 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
order
ContactInfoPhillipe ● No data caching
(Query No. 36)
● given order experiment
UnsetPropsPhillipe ● Reuse of the query-local
(Query No. 37) dataset for the complete
sequence of all 115 queries
2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39) ● Hit rate:
IncomingPhillipe look-ups answered from cache
(Query No. 40) all look-up requests
0 0,2 0,4 0,6 0,8 1
hit rate


no reuse given 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
order
(Query No. 36)
● given order experiment
UnsetPropsPhillipe ● Reuse of the query-local
2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
0 0,2 0,4 0,6 0,8 1
hit rate


no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
order
ContactInfoPhillipe
(Query No. 36)

UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)

no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
order
ContactInfoPhillipe
(Query No. 36)

UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

Summary
● Contributions:
● Theoretical foundation
● Conceptual analysis
● Empirical evaluation
● Main findings:
● Additional results possible (for semantically similar queries)
● Impact on performance may be positive but also negative
● Future work:
● Analysis of caching strategies in our context
● Main issue: invalidation


Backup Slides


Contributions
● Theoretical foundation (extension of the original definition)
● Reachability by a Dseed-initialized execution of a BGP query b
● Dseed-dependent solution for a BGP query b
● Reachability R(B) for a serial execution of B = b1 , … , bn
➔ Each solution for bcur is also R(B)-dependent solution for bcur
● Conceptual analysis of the impact of data caching
● Performance factor: p( bcur , B ) = c( bcur , [ ] ) – c( bcur , B )
● Serendipity factor: s( bcur , B ) = b( bcur , B ) – b( bcur , [ ] )
● Empirical verification of the potential impact

● Out of scope: Caching strategies (replacement, invalidation)

Query Template Contact
SELECT * WHERE { <PERSON> foaf:knows ?p .

OPTIONAL { ?p foaf:name ?name }
OPTIONAL { ?p foaf:firstName ?firstName }
OPTIONAL { ?p foaf:givenName ?givenName }
OPTIONAL { ?p foaf:givenname ?givenname }
OPTIONAL { ?p foaf:familyName ?familyName }
OPTIONAL { ?p foaf:family_name ?family_name }
OPTIONAL { ?p foaf:lastName ?lastName }
OPTIONAL { ?p foaf:surname ?surname }

OPTIONAL { ?p foaf:birthday ?birthday }

OPTIONAL { ?p foaf:img ?img }

OPTIONAL { ?p foaf:phone ?phone }
OPTIONAL { ?p foaf:aimChatID ?aimChatID }
OPTIONAL { ?p foaf:icqChatID ?icqChatID }
OPTIONAL { ?p foaf:jabberID ?jabberID }
OPTIONAL { ?p foaf:msnChatID ?msnChatID }
OPTIONAL { ?p foaf:skypeID ?skypeID }
OPTIONAL { ?p foaf:yahooChatID ?yahooChatID }
}


Query Template UnsetProps
SELECT DISTINCT ?result ?resultLabel WHERE
{
?result rdfs:isDefinedBy <http://xmlns.com/foaf/0.1/> .
?result rdfs:domain foaf:Person .

OPTIONAL { <PERSON> ?result ?var0 }
FILTER ( !bound(?var0) )

<PERSON> foaf:knows ?var2 .
?var2 ?result ?var3 .
?result rdfs:label ?resultLabel .
?result vs:term_status ?var1 .
}
ORDER BY ?var1


Query Template Incoming
SELECT DISTINCT ?result WHERE
{
?result foaf:knows <PERSON> .

OPTIONAL
{
?result foaf:knows ?var1 .
FILTER ( <PERSON> = ?var1 )
<PERSON> foaf:knows ?result .
}
FILTER ( !bound(?var1) )
}


Query Template 2ndDegree1
{
<PERSON> foaf:knows ?p1 .
FILTER ( ?p1 != ?p2 )

?p1 foaf:knows ?result .
FILTER ( <PERSON> != ?result )
?p2 foaf:knows ?result .

OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}

Query Template 2ndDegree2
{
FILTER ( ?p1 != ?p2 )

?result foaf:knows ?p1 .
FILTER ( <PERSON> != ?result )
?result foaf:knows ?p2 .

OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}

Experiment – Single Query
no reuse upper 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
bound
(Query No. 36)
● upper bound experiment
UnsetPropsPhillipe ● Reuse of query-local dataset
(Query No. 37) for 3 executions of each query
2ndDegree1Phillipe ● Third execution measured
(Query No. 38)

2ndDegree2Phillipe
0 0,2 0,4 0,6 0,8 1
hit rate


no reuse upper 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
bound
(Query No. 36)
● upper bound experiment
UnsetPropsPhillipe ● Reuse of query-local dataset
(Query No. 37) for 3 executions of each query
2ndDegree1Phillipe ● Third execution measured
(Query No. 38)

2ndDegree2Phillipe
0 0,2 0,4 0,6 0,8 1
hit rate


no reuse upper 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
bound
ContactInfoPhillipe
(Query No. 36)

UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time

(std.dev.) (std.dev.) (std.dev.)
4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
1
Averaged over all 115 queries
● In the ideal case for Bupper= [ bcur , bcur ] :
● pupper( bcur , Bupper ) = c( bcur , [ ] ) – c( bcur , Bupper ) = c( bcur , [ ] )
● supper( bcur , Bupper ) = b( bcur , Bupper ) – b( bcur , [ ] ) = 0



4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
1

● Summary (measurement errors aside):
● Same number of query results
● Significant improvements in query performance


no reuse upper 0 given0,4 0,6 0,8
0,2 1 ●
0
given15 20 25 experiment:
5 10
order 30 0 20 40 60 80
bound order
ContactInfoPhillipe ● Reuse of the query-local
UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

0,2 1 ●
0
given15 20 25 experiment:
5 10
order 30 0 20 40 60 80
bound order
ContactInfoPhillipe ● Reuse of the query-local
UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
bound order
ContactInfoPhillipe
(Query No. 36)

UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

Bgiven order= [ q1 , … , q38 ]
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
bound order s( q , Bgiven order ) = b( q39 , Bgiven order ) – b( q39 , [ ] )
39

ContactInfoPhillipe =9–1
(Query No. 36)
=8
UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)

Bgiven order= [ q1 , … , q38 ]
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
p'( q , B
bound order given order
39
) = c'( q39 , [ ] ) – c'( q39 , Bgiven order )
ContactInfoPhillipe = 31.48 s – 68.64 s
(Query No. 36)
= – 37.16 s
UnsetPropsPhillipe
(Query No. 37)

2ndDegree1Phillipe
(Query No. 38)

2ndDegree2Phillipe
(Query No. 39)

IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
(in seconds)


4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
6.878 0.932 39.845 s
given order
(12.158) (0.139) (145.898)
1
● Summary:
● Data cache may provide for additional query results
● Impact on performance may be positive but also negative


4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
6.878 0.932 39.845 s
given order
(12.158) (0.139) (145.898)
6.652 0.954 36.994 s
random orders
(11.966) (0.036) (118.700)

● Executing the query sequence in a random order results in
measurements similar to the given order.

These slides have been created by
Olaf Hartig

http://olafhartig.de

This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)


How Caching Improves Efficiency and Result Completeness for Querying Linked Data

Recommended

More Related Content

What's hot (17)

Similar to How Caching Improves Efficiency and Result Completeness for Querying Linked Data (20)

More from Olaf Hartig (20)

Recently uploaded (20)

How Caching Improves Efficiency and Result Completeness for Querying Linked Data