Querying Linked Data with SPARQL

Querying
Linked Data
with
SPARQL

ISWC 2009 Tutorial "How to Consume Linked Data on the Web"

Brief Introduction to SPARQL
● SPARQL: Query Language for RDF data
● Main idea: pattern matching
● Describe subgraphs of the queried RDF graph
● Subgraphs that match your description yield a result
● Mean: graph patterns (i.e. RDF graphs /w variables)

?v rdf:type
http://.../Volcano


Brief Introduction to SPARQL
Queried
graph:
rdf:type
http://.../Mount_Baker http://.../Volcano
p:lastEruption rdf:type
"1880" htp://.../Mount_Etna

?v rdf:type
Results: http://.../Volcano
?v
http://.../Mount_Baker
http://.../Mount_Etna

SPARQL Endpoints
● Linked data sources usually provide a
SPARQL endpoint for their dataset(s)
● SPARQL endpoint: SPARQL query processing
service that supports the SPARQL protocol*
● Send your SPARQL query, receive the result

* http://www.w3.org/TR/rdf-sparql-protocol/


SPARQL Endpoints
Data Source Endpoint Address

DBpedia http://dbpedia.org/sparql

Musicbrainz http://dbtune.org/musicbrainz/sparql

U.S. Census http://www.rdfabout.com/sparql

Semantic Crunchbase http://cb.semsol.org/sparql

More complete list:
http://esw.w3.org/topic/SparqlEndpoints

Accessing a SPARQL Endpoint
● SPARQL endpoints: RESTful Web services
● Issuing SPARQL queries to a remote SPARQL
endpoint is basically an HTTP GET request to
the SPARQL endpoint with parameter query

GET /sparql?query=PREFIX+rd... HTTP/1.1
Host: dbpedia.org
User-agent: my-sparql-client/0.1
URL-encoded string
with the SPARQL query

Query Results Formats
● SPARQL endpoints usually support different
result formats:
● XML, JSON, plain text
(for ASK and SELECT queries)
● RDF/XML, NTriples, Turtle, N3
(for DESCRIBE and CONSTRUCT queries)


Query Results Formats
PREFIX dbp: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>

SELECT ?name ?bday WHERE {
?p dbp:birthplace <http://dbpedia.org/resource/Berlin> ;
dbpprop:dateOfBirth ?bday ;
dbpprop:name ?name .
}
name | bday
------------------------+------------
Alexander von Humboldt | 1769-09-14
Ernst Lubitsch | 1892-01-28
...

<?xml version="1.0"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="name"/>
<variable name="bday"/>
</head>
<results distinct="false" ordered="true">
<result>
<binding name="name">
<literal xml:lang="en">Alexander von Humboldt</literal>
</binding>
<binding name="bday">
<literal datatype="http://www.w3.org/2001/XMLSchema#date">1769-09-14</literal>
</binding>
</result>
<result>
<binding name="name">
<literal xml:lang="en">Ernst Lubitsch</literal>
</binding>
<binding name="bday">
<literal datatype="http://www.w3.org/2001/XMLSchema#date">1892-01-28</literal>
</binding>
</result> http://www.w3.org/TR/rdf-sparql-XMLres/

</results>
</sparql>

{
"head": { "link": [], "vars": ["name", "bday"] },
"results": { "distinct": false, "ordered": true, "bindings": [
{ "name": { "type": "literal",
"xml:lang": "en",
"value": "Alexander von Humboldt" } ,
"bday": { "type": "typed-literal",
"datatype": "http://www.w3.org/2001/XMLSchema#date",
"value": "1769-09-14" }
},
{ "name": { "type": "literal",
"xml:lang": "en",
"value": "Ernst Lubitsch" } ,
"bday": { "type": "typed-literal",
"datatype": "http://www.w3.org/2001/XMLSchema#date",
"value": "1892-01-28" }
},
// ...
] } http://www.w3.org/TR/rdf-sparql-json-res/
}

Query Result Formats
● Use the ACCEPT header to request the
preferred result format:
GET /sparql?query=PREFIX+rd... HTTP/1.1
Host: dbpedia.org
Accept: application/sparql-results+json


Query Result Formats
● As an alternative some SPARQL endpoint
implementations (e.g. Joseki) provide an
additional parameter out

GET /sparql?out=json&query=... HTTP/1.1
Host: dbpedia.org


● More convenient: use a library
● Libraries:
● SPARQL JavaScript Library
http://www.thefigtrees.net/lee/blog/2006/04/sparql_calendar_demo_a_sparql.html
● ARC for PHP
http://arc.semsol.org/
● RAP – RDF API for PHP
http://www4.wiwiss.fu-berlin.de/bizer/rdfapi/index.html


● Libraries (cont.):
● Jena / ARQ (Java) http://jena.sourceforge.net/
● Sesame (Java) http://www.openrdf.org/
● SPARQL Wrapper (Python)
http://sparql-wrapper.sourceforge.net/
● PySPARQL (Python)
http://code.google.com/p/pysparql/


● Example with Jena / ARQ:
import com.hp.hpl.jena.query.*;

String service = "..."; // address of the SPARQL endpoint
String query = "SELECT ..."; // your SPARQL query
QueryExecution e = QueryExecutionFactory.sparqlService( service,
query );
ResultSet results = e.execSelect();
while ( results.hasNext() ) {
QuerySolution s = results.nextSolution();
// …
}
e.close();

● Querying a single dataset is quite boring
compared to:
● Issuing SPARQL queries over multiple datasets

● How can you do this?
1. Issue follow-up queries to different endpoints
2. Querying a central collection of datasets
3. Build store with copies of relevant datasets
4. Use query federation system


Follow-up Queries
● Idea: issue follow-up queries over other
datasets based on results from previous
queries
● Substituting placeholders in query templates


String s1 = "http://cb.semsol.org/sparql";
String s2 = "http://dbpedia.org/sparql";

String qTmpl = "SELECT ?c WHERE{ <%s> rdfs:comment ?c }";

String q1 = "SELECT ?s WHERE { ...";
QueryExecution e1 = QueryExecutionFactory.sparqlService(s1,q1);
ResultSet results1 = e1.execSelect();
while ( results1.hasNext() ) {
QuerySolution s1 = results.nextSolution();
String q2 = String.format( qTmpl, s1.getResource("s"),getURI() );
QueryExecution e2= QueryExecutionFactory.sparqlService(s2,q2);
ResultSet results2 = e2.execSelect();
while ( results2.hasNext() ) {
// ...
} Find a list of companies
e2.close();
}
filtered by some criteria and
e1.close(); return DBpedia URIs of them

Follow-up Queries
● Advantage:
● Queried data is up-to-date
● Drawbacks:
● Requires the existence of a SPARQL endpoint for
each dataset
● Requires program logic
● Very inefficient


Querying a Collection of Datasets
● Idea: Use an existing SPARQL endpoint that
provides access to a set of copies of relevant
datasets
● Example:
● SPARQL endpoint by OpenLink SW over a majority
of datasets from the LOD cloud at:
http://lod.openlinksw.com/sparql


Querying a Collection of Datasets
● Advantage:
● No need for specific program logic
● Drawbacks:
● Queried data might be out of date
● Not all relevant datasets in the collection


Own Store of Dataset Copies
● Idea: Build your own store with copies of
relevant datasets and query it
● Possible stores:
● Jena TDB http://jena.hpl.hp.com/wiki/TDB
● Sesame http://www.openrdf.org/
● OpenLink Virtuoso http://virtuoso.openlinksw.com/
● 4store http://4store.org/
● AllegroGraph http://www.franz.com/agraph/
● etc.


Own Store of Dataset Copies
● Advantages:
● Can include all datasets
● Independent of the existence, availability, and
efficiency of SPARQL endpoints
● Drawbacks:
● Requires effort to set up and to operate the store
● Ideally, data sources provide RDF dumps; if not?
● How to keep the copies in sync with the originals?
● Queried data might be out of date

Federated Query Processing
● Idea: Querying a mediator which ?
distributes subqueries to
relevant sources and
integrates the results
?
? ?


● Instance-based federation
● Each thing described by only one data source
● Untypical for the Web of Data
● Triple-based federation
● No restrictions
● Requires more distributed joins

● Statistics about datasets requires (both cases)


● DARQ (Distributed ARQ)
http://darq.sourceforge.net/
● Query engine for federated SPARQL queries
● Extension of ARQ (query engine for Jena)
● Last update: June 28, 2006


● Semantic Web Integrator and Query Engine
(SemWIQ) http://semwiq.sourceforge.net/
● Actively maintained by Andreas Langegger


● Advantages:
● Queried data is up to date
● Drawbacks:
● Requires the existence of a SPARQL endpoint for
each dataset
● Requires effort to set up and configure the mediator


In any case:
● You have to know the relevant data sources
● When developing the app using follow-up queries
● When selecting an existing SPARQL endpoint over
a collection of dataset copies
● When setting up your own store with a collection of
dataset copies
● When configuring your query federation system
● You restrict yourself to the selected sources


In any case:
● You have to know the relevant data sources
● When developing the app using follow-up queries
● When selecting an existing SPARQL endpoint over
a collection of dataset copies
● When setting up your own store with a collection of
dataset copies
● When configuring your query federation system
● You restrict yourself to the selected sources
There is an alternative:
Remember, URIs link to data

Automated
Link Traversal


Automated Link Traversal
● Idea: Discover further data by looking-up
relevant URIs in your application
● Can be combined with the previous approaches


Link Traversal Based
Query Execution
● Applies the idea of automated link traversal to the
execution of SPARQL queries
● Idea:
● Intertwine query evaluation with traversal of RDF links
● Discover data that might contribute to query results
during query execution
● Alternately:
● Evaluate parts of the query
● Look up URIs in intermediate solutions

Queried data

Query Execution
SELECT ?c ?u WHERE {
<http://mymovie.db/movie2449> mov:filming_location ?c .
?c geo:statistics ?cStats .
?cStats stat:unempRate ?u . }

● Example:
Return unemployment rate of the countries in
which the movie http://mymovie.db/movie2449
was filmed.

Queried data

Query Execution
49
v ie24
?cStats stat:unempRate ?u . } .d b/mo
m ovie
http ://my ?

Queried data

Query Execution

Queried data

Query Execution

...
<http://mymovie.db/movie2449>
mov:filming_location <http://geo.../Italy> .
Queried data
...

Query Execution
?c geo:statistics ?cStats . ?loc
?cStats stat:unempRate ?u . } http://geo.../Italy

...
<http://mymovie.db/movie2449>
mov:filming_location <http://geo.../Italy> .
Queried data
...

Query Execution
taly
o.../I
/ / ge ?
http:

Queried data

Query Execution
?cStats stat:unempRate ?u . } ly
http://geo.../Italy
eo .../Ita
http://g ?

Queried data

Query Execution

Queried data

Query Execution

...
<http://geo.../Italy>
geo:statistics <http://example.db/stat/IT> .
... Queried data

Query Execution

?loc ?stat
http://geo.../Italy http://stats.db/../it

...
<http://geo.../Italy>
geo:statistics <http://example.db/stat/IT> .
... Queried data

Query Execution

?loc ?stat
http://geo.../Italy http://stats.db/../it

● Proceed with this strategy
(traverse RDF links
during query execution)

Queried data

Query Execution
● Advantages:
● No need to know all data sources in advance
● No need for specific programming logic
● Queried data is up to date
● Independent of the existence of SPARQL endpoints
provided by the data sources
● Drawbacks:
● Not as fast as a centralized collection of copies
● Unsuitable for some queries
● Results might be incomplete

Implementations
● Semantic Web Client library (SWClLib) for Java
http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/
● SWIC for Prolog http://moustaki.org/swic/


Implementations
● SQUIN http://squin.org
● Provides SWClLib functionality as a Web service
● Accessible like a SPARQL endpoint
● Public SQUIN service at:
http://squin.informatik.hu-berlin.de/SQUIN/
● Install package: unzip and start
● Convenient access with SQUIN PHP tools:

$s = 'http:// …'; // address of the SQUIN service
$q = new SparqlQuerySock( $s, '… SELECT ...' );
$res = $q->getJsonResult(); // or getXmlResult()

Real-World Examples
SELECT DISTINCT ?author ?phone WHERE {
?pub swc:isPartOf
<http://data.semanticweb.org/conference/eswc/2009/proceedings> .
?pub swc:hasTopic ?topic . ?topic rdfs:label ?topicLabel .
FILTER regex( str(?topicLabel), "ontology engineering", "i" ) .

# of query results 2
?pub swrc:author ?author . # of retrieved graphs 297
{ ?author owl:sameAs ?authorAlt } # of accessed servers 16
UNION avg. execution time 1min 30sec
{ ?authorAlt owl:sameAs ?author }
Return
?authorAlt foaf:phone ?phone . phone numbers of authors
of ontology engineering papers
}
at ESWC'09.

Querying Linked Data with SPARQL

Recommended

More Related Content

What's hot (20)

Similar to Querying Linked Data with SPARQL (20)

More from Olaf Hartig (20)

Recently uploaded (20)

Querying Linked Data with SPARQL