Tutorial Data Management and workflows

1
Data Management &
Data Workflows
Tutorial
Maria-Esther Vidal
Axel Polleres
http://polleres.net/presentations/

2
Data Management (today)
vs.
Knowledge discovery/modeling (yesterday)

3
Outline
• Motivation
• Integrating (Open) Data from different sources
• Not only Linked Data (NoLD)
• Data workflows and Data Management in the context of rise of Big Data
• What is a "Data Workflow"?
• Different Views of Data Workflows in the context of the Semantic Web
• Key steps involved
• Tools?
• Data Integration Systems
• GAV vs. LAV
• The Mediator and Wrapper Architecture
• Query rewriting vs. Materialisation
• Data Integration using Ontologies
• Challenges:
• How to find Rules and ontologies?
• Handling Incompleteness
• How to find the data?
• Open Problems – Research Tasks

4
Motivation

5
Open Data is a global trend – Good for us!
• Cities, International Organizations, National and European portals, etc.:
5
• In general: more and
more structured data
available at our
fingertipps
• It's on the Web
• It's open
à no restrictions w.r.t. re-use
This image
cannot currently
be displayed.

6
Buzzword Bingo 1/3:
Open Data vs. Big Data vs. Open Government
• http://www.opendatanow.com/2013/11/new-big-data-vs-open-data-mapping-it-out/

7
• Volume:
• It's growing! (we currently monitor 90 CKAN portals,
512543 resources/ 160069 datasets,
at the moment (statically) ~1TB only CSV files...
• Variety:
• different datasets (from different cities, countries, etc.), only
partially comparable, partially not.
• Different metadata to describe datasets
• Different data formats
• Velocity:
• Open Data changes regularly (fast and slow)
• New datasets appear, old ones disappear
• Value:
• building ecosystems ("Data value chain") around Open
Data is a key priority of the EC
• Veracity:
• quality, trust
Buzzword Bingo 2/3:
Open Data vs. Big Data

8
Buzzword Bingo 3/3:
Open Data vs. Linked Data
cf.: [Polleres OWLED2013], [Polleres et al. Reasoning Web 2013]
LD efforts discontinued?!
LOD in OGD growing, but slowly
Alternatives in the
meantime:
(wikidata...)
LOD is still growing, but OD is growing
faster and challenges aren't
necessarily the exactly same…
So. let's focus on Open Data in general…
… more specifically on
Open Structured Data
This talk is NOT about DL Reasoning over Linked Data:

9
What makes Open Data useful
beyond “single dataset“ Apps...
Great stuff, but limited potential...
More interesting:
• Data Integration & building Data Workflows
from different Open Data sources!!!

10
Is Open Data useful at all?
A concrete use case:
What is the
CO2/capita in
Bologna?
What is the
population density of
Athens?
What is the length of
public transport in
Vienna?
Overall ratings computed from (ideally most current) base
indicators per cities

11
The "City Data Pipeline"
Idea – a "classic" Semantic Web use case!
• Regularly integrate various relevant Open Data sources
(e.g. eurostat, UNData, ...)
• Make integrated data available for re-use
(How) can ontologies help me?
• Are ontology languages expressive enough?
• Which ontologies could I (re-)use?
• Is there enough data at all?
• Where to find the right data?
• Where to find the right ontologies?
• How to tackle inconsistencies?

12
The "City Data Pipeline" – a "fairly standard" data workflow
That's a quite standard
Data Workflow, isn't it?

13
So:
a) What is a "standard data workflow"?
b) Where can/shall Semantic Technologies, but
also traditional Data Integration technologies be
used to build such workflows?
Data

14
Data Workflows
Access Transform Deliver
Data
•Well-defined functional units.
•Data is streamed between
units or activities.

15
Different Views & Examples of "What is
a Data Workflow:

16
Different Views & Examples:
1/7 „Classic" ETL-Process in Datawarehousing
Wikipedia:
• In computing,Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that:
• Extracts data from homogeneous or heterogeneous data sources
• Cleansing: deduplication, inconsistencies, missing data,...
• Transforms the data for storing it in proper format or structure for querying and analysis purpose
• Loads it into the final target (database, more specifically, operational data store, data mart, or data
warehouse)
• Typically assumes: fixed, static pipeline, fixed final schema in the final DB/DW
• Cleansing sometimes viewed as a part of Transform, sometimes not.
• Typically assumes complete/clean data at the “load" stage
• Aggregation sometimes viewed as a part of tranformation, sometimes higher up in the
Datawarehouse access layer (OLAP)
• WARNING: At each stage, things can go wrong! Filtering/aggregation may bias the data!
• References:[Golfarelli, Rizzi, 2009]
• https://en.wikipedia.org/wiki/Extract,_transform,_load
• https://en.wikipedia.org/wiki/Staging_%28data%29#Functions
"Hard-
wired"
Data
integration

17
RDF Data
Data
Transformation
Analytical
Extraction
Visualization
Transformation
Visualization
Abstraction
Visual Mapping
Transformation
View
Analytical Operators
Visualization Operators
J. Brunetti , S. Auer, R. García,”The Linked Data Visualization Model’.
2/7 Linked Data Visualization Model
View Operators
Access
Deliver
Transform

18
3/7 Or is it rather a Lifecycle...
• E.g. good example: Linked Data Lifecycle
• NOTE: Independent of whether Linked Data or other sources, you
need to revisit/revalidate your workflow, either for improving it or for
maintainance (sources changing, source formats changing, etc.)
Axel-Cyrille Ngonga Ngomo, SörenAuer, Jens Lehmann,
Amrapali Zaveri. Introduction to Linked Data and Its
Lifecycle on the Web. ReasoningWeb. 2014

19
4/7 We‘re not the first ones to recognize this is
actually a lifecycle… [Wiederhold92]

20
5/6 The “Data Science” Process:
http://semanticommunity.info/Data_Science/Doing_Data_Science
What Would a Next-Gen Data Scientist Do?
“[…] data scientists […] spend a lot more time trying
to get data into shape than anyone cares to admit—
maybe up to 90% of their time. Finally, they don’t find
religion in tools, methods, or academic
departments. They are versatile and interdisciplinary”

21
7/7 Big Data & Data Management against “Data Lake”
Install all data
Regardless of
requirements
Store all data in
native format without
schema definition
Do analysis
Using
analytic engines ,
Hadoop
Data Lake
Raghu Ramakrishnan,Big Data @ Microsoft

22
General challenges to be addressed
Syntactic heterogeneity (different formats)
Distributed data sources
Non-standard processing
Semantic heterogeneity
Naming ambiguity
Uncertainty and evolving concepts

23
Specific Steps (non-exhaustive, overlapping!)
• Extraction
• Inconsistency handling
• Incompleteness handling (sometimes called "Enrichment“,
sometimes imputation of missing values...)
• Data Integration (alignment, source reconciliation)
• Aggregation
• Cleansing (removing outliers)
• Deduplication/Interlinking (could involve "triplification")
• Analytics
• Enrichment
• Change dedection (Maintainance/Evolution)
• Validation (quality anaysis)
• Efficient, sometimes distributed (query) processing
• Visualization
Tools and current approaches support you partially in different parts of these
steps.... Bad news: there is no "one-size-fits-all" solution.

24
Some Tools (again, exemplary and SW-biased!):
• Linux-commandline Tools: curl, sed, awk, + postgresql does a good job in
many cases...
• LOD2 stack, stack of tools for integrating and generating Linked Data,
http://stack.lod2.eu/
• e.g., SILK http://silk-framework.com/ (Interlinking/objec consolidation)
• KARMA (extraction, data integration) http://usc-isi-i2.github.io/karma/
• RapidMiner Linked Data extension http://dws.informatik.uni-
mannheim.de/en/research/rapidminerlodextension/ [Gentile, Paulheim, et al. 2016]
• XSPARQL (extraction from XML and JSON/triplicifation)
http://sourceforge.net/projects/xsparql/ [Bischof et al. 2012]
• Seel also: https://ai.wu.ac.at/~polleres/20140826xsparql_st.etienne/
• STTL: A SPARQL-based Transformation Language for RDF
• See also: https://hal.inria.fr/hal-01150623 [Corby et al. 2015]

25
Outline
• Motivation
• Not only Linked Data
• Data workflows and Open data in the context of rise of Big Data
• What is a "Data Workflow"?
• Different Views of Data Workflows in the context of the Semantic Web
• Key steps involved
• Tools?
• Data Integration Systems & Query Processing
• Data Integration Systems - GAV vs. LAV
• The Mediator and Wrapper Architecture
• Query rewriting vs. Materialisation
• Challenges:
• How to find Rules and ontologies?
• Incomplete Data
• How to find the data?

28
Data Workflows accessing
Heterogeneous Sources
Data

29
3072
2623
Proceedings of the twenty-first ACM SIGMOD-SIGACT-
SIGART. Symposium on Principles of database systems.
2002

31
Data Integration Systems[Lenzerini2002]
• IS=<O,S,M>
• Let O be a set of general concepts in a general
schema (virtual).
• Let S={S1,..,Sn} be a set of data sources.
• Let M be a set of mappings between sources in S
and general concepts in O.
cf. [Lenzerini 2002]

32
Data
Data Integration System

33
Data

34
Global Schema
(grossGDP rdf:type rdf:Property).
(avgTemp rdf:type rdf:Property).
(rating rdf:type rdf:Property).
(grossGDP rdfs:subPropertyOf rating).
(avgTemp rdfs:subPropertyOf rating).
(euroCity rdf:type rdfs:Class).
(amCity rdf:type rdfs:Class)
(afCity rdf:type rdfs:Class)

35
Global Schema
grossGDP(C,R)
avgTemp(C,R)
rating(C,R)

36
Global Schema
grossGDP(C,R)
avgTemp(C,R)
rating(C,R)

37
Global Schema
grossGDP(C,R)
avgTemp(C,R)
rating(C,R)
grossGDP(C,R)
rdfs:subPropertyOfrdfs:subPropertyOf
rating(C,R)
avgTemp(C,R)

38
Global Schema
(euroCity rdf:type rdfs:Class).
(amCity rdf:type rdfs:Class)
(afCity rdf:type rdfs:Class)
euroCity(C)
amCity(C)
afCity(C)

39
Global Schema
euroCity(C) amCity(C)
afCity(C)
Integration Systems
grossGDP(C,R)
rating(C,R)
avgTemp(C,R)

40
Source Schema
(amFinancial rdf:type rdf:Property).
(euClimate rdf:type rdf:Property).
(tunisRating rdf:type rdf:Property).
(similarFinancial rdf:type rdf:Property).
amFinancial(C,R) provides the financial rating R of an American city C.
euClimate(C,R) provides the climate rating R of an European city C.
tunisRating(T,R) tells the ratings R (T is climate and financial) of Tunis.
similarFinancial(C1,C1) relates two American cities C1 and C2 that have the same financial
rating.

41
Integration Systems
amFinancial(C,R)
similarFinancial(C1,C2)euClimate(C,R)
tunisRating(T,R)
amFinancial(C,R) provides the financial rating R of an American city C.
euClimate(C,R) provides the climate rating R of an European city C.
tunisRating(T,R) tells the ratings R (T is climate and financial) of Tunis.
similarFinancial(C1,C1) relates two American cities C1 and C2 that have the same financial
rating.

42
Integration Systems
S={amFinancial(C,R), euClimate(C,R), tunisRating(T,R), similarFinancial(C1,C2)
}
Local Schema
Global Schema
grossGDP(C,R)
rating(C,R)
avgTemp(C,R) afCity(C)

43
Global-as-View (GAV):
• Concepts in the Global Schema (O) are defined in terms of
combinations of Sources (S).
Local-As-View (LAV):
§ Sources in S are defined in terms of combinations of Concepts in O.
Global- & Local-As-View (GLAV):
• Combinations of concepts in the Global Schema (O) are defined in
Integration Systems
IS=<O,S,M>

44
Conjunctive Rules
• Q(X1,X2,..,Xn):-P1(Y11,..,Y1m),P2(Y21,…,Y2k),..,
Pt(Yt1,…,Ytl),
X1=Y1m, X2=Y2k, Xn=,Ytl.

45
Conjunctive Rules
• Q(X1,X2,..,Xn):-P1(Y11,..,Y1m),P2(Y21,…,Y2k),..,
Pt(Yt1,…,Ytl),
Head of the Rule

46
Conjunctive Rules
• Q(X1,X2,..,Xn):-P1(Y11,..,Y1m),P2(Y21,…,Y2k),..,
Pt(Yt1,…,Ytl),
Body of the Rule

47
Conjunctive Rules
• Q(X1,X2,..,Xn):-P1(Y11,..,Y1m),P2(Y21,…,Y2k),..,
Pt(Yt1,…,Ytl),
Body of the Rule
P1(Y11,..,Y1m),P2(Y21,…,Y2k),..,Pt(Yt1,…,Ytl),, X1=Y1m, X2=Y2k, Xn=,Ytl.
Q(X1,X2,..,Xn)

48
Global Schema
grossGDP(C,R)
rating(C,R)
avgTemp(C,R)
}
Local Schema
Global-As-View (GAV)
α0: amCity(C):-amFinancial(C,R).
α1: grossGDP(C,R):-amFinancial(C,R).
α2: euroCity(C):-euClimate(C,R).
α3: avgTemp(C,R):-euClimate(C,R).
α4: grossGDP(“Tunis”,R):-tunisRating(“financial”,R).
α5: avgTemp(“Tunis”,R):-tunisRating(“climate”,R)
α6: afCity(“Tunis”).
α7: amCity(C1):-similarFinancial(C1,C2).
α9: grossGDP(C1,R):-similarFinancial(C1,C2), amFinancial(C2,R).
afCity(C)

49
Query Rewriting GAV
§ A query Q in terms of the global schema elements in O.
§ Problem: Rewrite Q into a query Q’ expressed in sources
in S.
query(C):-grossGDP(C,R), amCity(C)
Example GAV:

50
query1(C):-amFinancial(C,R),similarFinancial(C,C2).
Query Rewriting GAV
in S.
Rewritings
Example GAV:

51
• query(X):-p1(Y1),p2(Y2),….,pn(Yn).
p1(Y1):-q11(Y11),…,q1m(Y1m)
p2(Y2):-q21(Y22),…,q2l(Y2l)
…
pn(Yn):-qn1(Yn1),…,qnk(Yn1)
Global-As-View (GAV)- Query Unfolding

52
• query(X):-p1(Y1),p2(Y2),….,pn(Yn).
p1(Y1):-q11(Y11),…,q1m(Y1m)
p2(Y2):-q21(Y22),…,q2l(Y2l)
…

53
• query(X):-p1(Y1),p2(Y2),….,pn(Yn).
p1(Y1):-q11(Y11),…,q1m(Y1m)
p2(Y2):-q21(Y22),…,q2l(Y2l)
…

54
• query(X):-p1(Y1),p2(Y2),….,pn(Yn).
p1(Y1):-q11(Y11),…,q1m(Y1m)
p2(Y2):-q21(Y22),…,q2l(Y2l)
…
query(X):-q11(Y11),…,q1m(Y1m) , q21(Y22),…,q2l(Y2l),..,
qn1(Yn1),…,qnk(Ynk).

55
Query Rewriting GAV
in S.
Example GAV:

56
query1(C):-amFinancial(C,R),similarFinancial(C,C2).
Query Rewriting GAV
in S.
Rewritings
Example GAV:
query2(C):-similarFinancial(C,C2), amFinancial(C2,R),
similarFinancial(C,R1).
α9:grossGDP(C1,R):-similarFinancial(C1,C2),
amFinancial(C2,R).

57
When to use GAV
Query rewriting
is simpler
(Polynomial
time in the size
of the query)
Sources do not change
and global schema can
change over time

58
Lower Bounds for the Space of Query
Rewritings
• CQs and OWL2QL-ontologies [Gottlob14]
• Exponential and Superpolynomial lower bounds on the
size of pure rewritings.
• Polynomial-size under some restrictions.
[Gottlob14]
Georg Gottlob, Stanislav Kikot, Roman Kontchakov, Vladimir V. Podolskii,
Thomas Schwentick, Michael Zakharyaschev: The price of query
rewriting in ontology-based data access. Artif. Intell. 213: 42-59 (2014)

59
Integration Systems
IS=<O,S,M>

60
Global Schema
grossGDP(C,R)
rating(C,R)
avgTemp(C,R)
}
Local Schema
Local-As-View (LAV)
afCity(C)
α0:amFinancial(C,R):-amCity(C),grossGDP(C,R).
α1:euClimate(C,R):-euCity(C),avgTemp(C,R).
α2:tunisRating(“financial”,R):-afCity(“Tunis”),grosGDP(“Tunis”,R).
α3:tunisRating(“climate”,R):-afCity(“Tunis”),avgTemp(“Tunis”,R).
α4:similarFinancial(C1,C2):-amCity(C1),amCity(C2),
grossGDP(C1,R),grossGDP(C2,R).

61
query1(C):-amFinancial(C,R).
Rewritings
Example LAV:
Query Rewriting LAV

62
Local As View-Query Rewriting
query(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).

63
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).
query’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3) S2(X3,X4,X5)

64
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).
query’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
query’’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3) S2(X3,X4,X5)
S4(X1,X2) S3(X2,X3,X4) S4(X3,X4,X5)

65
Query Rewriting
DB is a Virtual Database with the instances of the
elements in O.
Query Containment: Q’ ÍQ ßà"DB Q’(DB) ÍQ(DB)
query1(C):- amFinancial(C,R),similarFinancial(C,C2). Í query(C):-grossGDP(C,R), amCity(C)
Washington
NYC
Miami
Caracas
Lima
Washington
NYC
Miami
Caracas
Lima
……
Bogota
College Park
Quito
Í
Source Database
Virtual Database

66
query2(C):-similarFinancial(C,C2).
Rewritings
Query Rewriting LAV
Example LAV:

67
query2(C):-similarFinancial(C,C2).
query3(C):-similarFinancial(C1,C).
Rewritings
Query Rewriting LAV
Example LAV:

68
Time Complexity
To check whether there is a valid rewriting R of Q with at
most the same number of goals as Q is an NP-complete
problem.
Levy, A.; Mendelzon, A.; Sagiv, Y.; and Srivastava, D. 1995. Answering
queries using views. In Proc. of PODS, 95–104.

69
Existing Approaches for LAV Query Rewriting
§ Bucket Algorithm [Levy & Rajaraman & Ullman 1996]
§ Inverse Rules Algorithm [Duscka & Genesereth 1997]
§ MiniCom Algorithm [Pottinger & Halevy 2001]
§ MDCSAT [Arvelo & Bonet & Vidal 2006]
§ SSSAT [Izquierdo & Vidal & Bonet 2011]
• GQR [Konstantinidis & Ambite, 2011]
§ IQR [Vidal & Castillo 2015]

70
When to use LAV
A GAV catalog
cannot be easily
adapted to changes
in the data sources
LAV views can be
easily adapted to
changes in the data
sources
Data Sources can be
easily described

71
Integration Systems
IS=<O,S,M>

72
Global Schema
grossGDP(C,R)
rating(C,R)
avgTemp(C,R)
}
Local Schema
Global-And-Local-As-View (GLAV)
afCity(C)
α0: amFinancial(C1,R),similarFinancial(C1,C2):-
amCity(C1),amCity(C2),financial(C1,R),financial(C2,R).

73
query1(C):-: amFinancial(C,R),similarFinancial(C,C2)
Rewritings
Example GLAV:
Query Rewriting GLAV
amCity(C1),amCity(C2),grossGDP(C1,R),grossGDP(C2,R).

74
Query Rewriting
DB is a Virtual Database with the instances of the
elements in O.
Query Containment: Q’ ÍQ ßà"DB Q’(DB) ÍQ(DB)
query1(C):-amFinancial(C,R),similarFinancial(C,C2). Í query(C):-grossGDP(C,R), amCity(C)

75
When to use GLAV
A GLAV catalog
cannot be easily
adapted to
changes in the
data sources
Data Sources
can be easily
described

76
The Mediator and Wrapper Architecture [Wiederhold92]
Wrapper Wrapper Wrapper
Mediator Catalog
Query
[Wiederhold92]Gio Wiederhold:Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
amFinancial(C,R)
Wrapper
euClimate(C,R) similarFinancial(C1,C2) tunisRating(T,R)

77
Mediator Catalog
Query
amFinancial(C,R)
Wrapper
Wrappers:
§Software components specific
for each type of data source.
§Export unique schema
for heterogeneous sources.

78
e.g. RDB2RDF Systems
Transformation Rules, e.g.,
R2RML
RDF
Wrappers in the context of RDF Data:
Cf. R2RML W3C standard: http://www.w3.org/TR/r2rml/ see also [Priyatna 2014]]
UltraWrap http://capsenta.com/ultrawrap/ [Sequeda & Miranker 2013],
D2RQ http://d2rq.org/

79
Mediator Catalog
Query
amFinancial(C,R)
Wrapper
Mediators:
•Export a unified schema.
•Query Decomposition.
•Identify relevant sources for each
query.
•Generate query execution plans.

80
Some recent works which implement Wiederhold’s
mediator/wrapper architecture in the SW:
Linked Data-Fu [Stadtmüller et al. 2013]
SemLAV [Montoya et al. 2014]
… both LAV-inspired.

81
MATERIALIZED GLOBAL SCHEMA-
DATA WAREHOUSE

82
Data Warehouse-Materialized Global Schema
Data
Warehouse
Engine
Catalog
GLAV rules
ETL ETL ETL
amFinancial(C1,R)
similarFinancial(C1,C2)
euClimate(C,R)
tunisIndicator(T,R)
financial(C,R) climate(C,R)
rdfs:subPropertyOf
rdfs:subPropertyOf
rating(C,R)
amCity(C)
euCity(C)
afCity(C)
This is a
GLAV
rule
Global Schema
amCity(C1),amCity(C2),grossGDP(C1,R),grossGDP(C2,R).

83
Materialized versus Virtual Access
The Mediator and
Wrapper
Architecture
requires to access
remote data
sources on the fly
Materialized Data
can be locally
accessed.
Convenient
wheneverdata is
static

84
Hybrid Architectures
Data
Warehouse
Mediator
Catalog
Catalog
QueryIntegration
System
Integration
System
Wrapper Wrapper WrapperWrapper Wrapper Wrapper Wrapper
Mediator

85
What is the role of Ontologies in Data
Workflows/Data Integration Systems?

86
• Also popular under the term Ontology-based data-access
(OBDA) [Kontchakov et al. 2013]:
• Typically conisders a relational DB, mappings (rules), an ontology
Tbox (typically OWL QL (DL-Lite), or OWL RL (rules))
Linked Data integration using ontologies:
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
RDB2RDF
Mappings
Datalog
OBDARDBMS
SQL

87
Linked Data integration using ontologies:
• Also popular under the term Ontology-based data-access
(OBDA) [Kontchakov et al. 2013]:
• Typically conisders a relational DB, mappings (rules), an ontology
Tbox (typically OWL QL (DL-Lite), or OWL RL (rules))
• For simplicity, let's leave out the Relational DB part,
assuming Data is already in RDF...
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
OBDATriple Store
RDF/SPARQL

88
Linked Data integration using
ontologies (example)
"Places with a Population Density below 5000/km2"?

89

90
City Data Model: extensible
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
dbo:PopulatedPlace rdfs:subClassOf :Place.
dbo:populationDensity rdfs:subPropertyOf :populationDensity.
eurotstat:City rdfs:subClassOf :Place.
eurotstat:popDens rdfs:subPropertyOf :populationDensity.
dbpedia:areakm rdfs:subPropertyOf :area
eurostat:area rdfs:subPropertyOf :area

91
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
dbo:areakm :area
eurostat:area :area
dbo:PopulatedPlace :Place
dbo:populationDensity :populationDensity
eurostat:City :Place
eurostat:popDens :populationDensity

92
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
ß dbo:areakm(X,Y):area(X,Y)
ß eurostat:area(X,Y):area(X,Y)
ß dbo:PopulatedPlace(X):Place(X)
ß dbo:populationDensity(X,Y):populationDensity(X,Y)
ß eurostat:City(X):Place(X)
ß eurostat:popDens(X):populationDensity(X,Y)

93
"Places with a Population Density below 5000/km2"?
SELECT ?X WHERE { ?X a :Place . ?X :populationDensity ?Y .
FILTER(?Y < 5000) }

94
Approach 1: Materialization
(input: triple store + Ontology
output: materialized triple store)
:Vienna a dbo:PopulatedPlace.
:Vienna dbo:populationDensity 4326.1
.
:Vienna dbo:areaKm 414.65 .
:Vienna dbo:populationTotal 1805681 .
:Vienna a :Place.
:Vienna :populationDensity 4326.1 .
:Vienna :area 414.65
FILTER(?Y < 5000) }
• RDF triple stores implement it naitively (OWLIM, Jena Rules, Sesame)
• Can handle a large part of OWL [Krötzsch, 2012, Glimm et al. 2012]

95
Approach 2: Query rewriting
(input: conjunctive query (CQ) + Ontology
output: UCQ)
.
FILTER(?Y < 5000) }
SELECT ?X WHERE { { {?X a :Place . ?X :populationDensity ?Y . }
UNION {?X a dbo:Place . ?X :populationDensity ?Y . }
UNION {?X a :Place . ?X dbo:populationDensity ?Y . }
UNION {?X a dbo:Place . ?X dbo:populationDensity ?Y . }
UNION {?X a dbo:Place . ?X dbo:populationDensity ?Y . }
... }
FILTER(?Y < 5000) }

96
FILTER(?Y < 5000) }
• Observation: essentially, GAV-style rewriting
• Can handle a large part of OWL (corresponding to DL-Lite [Calvanese et al.
2007]): OWL 2 QL
• Query-rewriting- based tools and systems available, many optimizations to
naive rewritings, e.g. taking into account mappings to a DB:
• REQUIEM [Perez-Urbina et al., 2009]
• Quest [Rodriguez-Muro, et al. 2012]
• ONTOP [Rodriguez-Muro, et al. 2013]
• Mastro [Calvanese et al. 2011]
• Presto [Rosati et al. 2010]
• KYRIE2 [Mora & Corcho, 2014]
• Rewriting vs. Materialization – tradeoff: [Sequeda et al. 2014]
• OBDA is a booming field of research!
Approach 2: Query rewriting
(input: conjunctive query (CQ) + Ontology
output: UCQ)

97
Where to find suitable ontologies?
Ok, so where do I find suitable
ontologies?
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
OBDATriple Store
RDF/SPARQL

98
Ontologies and mapping between Linked Data
Vocabularies
• Good Starting points: Linked Open Vocabularies
http://lov.okfn.org/dataset/lov/
• Still, probably a lot of manual mapping...
• Literature search for suitable ontologies à don't re-invent
the wheel, re-use where possible
• Crawl
• Ontology learning, i.e. learn mappings?
• e.g. using Ontology matching [Shvaiko&Euzenat, 2013]

99
• Extraction
• Aggregation
• Analytics
• Enrichment
• Visualization
Recall that slide from
the beginning? What
did we actually cover
and where could
Semantic Web
techniques help?

100
Incompleteness Handling:
Are RDFS and OWL enough?
.
:Bologna a dbo:PopulatedPlace.
:Bologna dbo:areaKm 140.7 .
:Bologna dbo:populationTotal 386298 .
FILTER(?Y < 5000) }
?
:populationDensity = :population/:area
:area = 0,386102 * dbpedia:areaMi2
Probably not...
A possible solution: [Bischof & Polleres, 2013]

101
q(PD) (S, popDensity, PD0
), (S, area, A0
), (S, area, A), PD := P/A, P := PD0
⇤ A0
(S, popDensity, PD) (S, population, P), (S, area, A), PD := P/A, A 6= 0.
(S, area, PD) (S, population, P), (S, popDensity, PD), A := P/PD, PD 6= 0.
(S, population, P) (S, area, A), (S, popDensity, PD), P := A ⇤ PD.
• [Bischof&Polleres 2013] Basic Idea: Consider
clausal form of all variants of equations and use
Query rewriting with "blocking":
:Bologna dbo:population 386298 .
:Bologna dbo:areaKm 140.7 .
SELECT ?PD WHERE { :Bologna dbo:popDensity ?PD}
q(PD) (S, popDensity, PD)
q(PD) (S, population, P), (S, area, A), PD := P/A
… infinite expansion even if only 1 equation is considered.
Solution: “blocking” recursive expansion of the same equation for the same value.
SELECT ?PD WHERE { {:Athens dbo:popDensity ?PD }
UNION
{ :Athens dbo:population ?P ; dbo:area ?A .
BIND (?P/?A AS ?PD )}
}
Finally, the resulting UCQs with
assignments can be rewritten back to
SPARQL using BIND

102
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
Ok, so where do I find these
equations?

103
Equational knowledge:
• Eurostat/Urbanaudit:
• http://ec.europa.eu/regional_policy/archive/urban2/urba
n/audit/ftp/vol3.pdf

104
Equational knowledge:
Unit conversion
http://www.wurvoc.org/vocabularies/om-1.8/http://qudt.org/

105
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capitaHmmm... Still a lot of work to
do, e.g. adding aggregates for
statistical data (Eurostat, RDF
Data Cube Vocabulary) ... cf.
[Kämpgen, 2014, PhD Thesis]
:avgIncome per country is the
population-weighted
average income of all its
provinces.
Hmmm...we
actually need
Claudia!
But Eurostat data is
incomplete... I don't
have the avg. income
for all provinces or
countries in the EU!
Incompleteness Handling:
Are RDFS and OWL and equations enough?

• Individual datasets (e.g. from Eurostat) have lots of missing values
• Merging together datasets with different indicators/cities adds sparsity
Challenges – Missing values [Bischof et al. 2015]
Integrated Open Data is (too?)sparse
Cities
Indicators
51% values
missing
97% values
missing
We don’t get very
far here with
equations…
Let’s try Data
Mining/ML!

Missing Values – Hybrid approach
choose best imputation method
per indicator [Bischof et al. 2015]
§ Our assumption: every indicator has its own
distribution and relationship to others.
§ Basket of „standard“ regression methods:
§ K-Nearest Neighbour Regression (KNN)
§ Multiple Linear Regression (MLR)
§ Random Forest Decision Trees (RFD)
§Let’s pick the “best method per indicator:
Validation: 10-fold cross validation
However: many/most machine learning methods need more or less complete training data!
More trickery needed, cf. e.g. [Bischof et al. 2015] … or ask Claudia J

108
• Extraction
• Aggregation
• Analytics
• Enrichment
• Visualization
Last but not
least…Really Don’t
forget the basic steps,
e.g.

109
Duplicates/Ambiguities:
• http://www.huffingtonpost.com/2013/10/29/12-places-with-the-same-n_n_4170470.html

110
Ambiguities/Inconsistencies affected also some older versions of our
City Data Pipeline:
• This example on the right
was due to naïve object
consolidation/deduplication,
BUT:
• Open Data is often
incomparable/inconsistent in
itself (e.g. across years the
method of data collection
might change)
à inconsistencies across and
within datasets are common

111
Idea – a "classic" Semantic Web use case!
• Regularly integrate various relevant Open Data sources
(e.g. eurostat, UNData, ...)
• Make integrated data available for re-use:
(How) can ontologies help me?
• Are ontology languages expressive enough?
• Which ontologies could I (re-)use?
• Is there enough data at all?
• Where to find the right data?
• Where to find the right ontologies?
• How to tackle inconsistencies?
citydata.wu.ac.at
Where to find the right data?

112
Where to find the data?
• Bad news:
• Finding suitable ontologies to map data sources to is not
the only challenge:
• Foremost… even before a Data workflow starts, a main
challenge is to find the right Datasets/Resources
• Semantic Web Search engines... Failed? L
• https://www.w3.org/wiki/Search_engines
• ... The obvious entry point:
• Open Data portals
• Still quite messy cf. http://data.wu.ac.at/portalwatch/
• Different formats, encodings, metdata of varying quality
• No proper Search!
• … but again: Semantic Web Technologies could help here!
No reason
not to try
again and
succeed
this time!
J

113
Open Data Portal search is a big
problem... Why?

114
How to search in/for Open Data?
https://www.youtube.com/watch?v=kCAymmbYIvc
vs.
Compared to Web (Table) search...
a) This looks like a slightly different problem...
b) Can linking to "Open" knowledge graphs help?
(wikidata, dbpedia?) ... Probably.
Cf. Work on structured Data in Web Search by Alon Halevy
... BTW: google has partially given it up on it it seems.
à Some more recent work in a SW & Open Data context:
[Neumaier et al., 2015+2016] [Ramnandan et al. 2015]
cf. also mini-projects!

116
Conclusions
Heterogeneous Web Sources
Tools & Pipelines to Access/Integrate Web Sources
RDB2RDF
Systems
CSV2RDF
Systems

117
Conclusions
Mediator Catalog
Query
amFinancial(C,R)
Wrapper

118
Integration Systems
}
Local Schema
Global Schema
grossGDP(C,R)
rating(C,R)
avgTemp(C,R) afCity(C)
GAV LAV
GLAV

119
Take-home messages:
• Semantic Web technologies help in Open Data Integration
workflows and can add flexibility
• It's worthwhile to consider traditional "Data Integration"
approaches & literature AND more recently work on OBDA
• Non-Clean Data requires: Statistics & machine learning
(outlier detection, imputing missing values, resolving
inconsistencies, etc.)
• Despite 15 years into Semantic Web:“Finding the right data“
remains a major challenge!
Many Thanks!
Questions

121
References 1
• [Polleres 2013]Axel Polleres.Tutorial "OWL vs. Linked Data: Experiences and Directions"OWLED2013.
http://polleres.net/presentations/20130527OWLED2013_Invited_talk.pdf
• [Polleres et al. 2013] Axel Polleres,Aidan Hogan,Renaud Delbru,Jürgen Umbrich:
RDFS and OWL Reasoning for Linked Data.Reasoning Web 2013:91-149
• [Golfarelli,Rizzi,2009]Matteo Golfarelli,Stefano Rizzi. Data Warehouse Design:Modern Principles and
Methodologies.McGraw-Hill,2009.
• [Lenzerini2002]Maurizio Lenzerini:Data Integration:A Theoretical Perspective.PODS 2002:233-246
• [Auer et al. 2012] Sören Auer, Lorenz Bühmann,Christian Dirschl,Orri Erling,Michael Hausenblas,RobertIsele,
Jens Lehmann,Michael Martin,Pablo N. Mendes,Bert Van Nuffelen,Claus Stadler, Sebastian Tramp, Hugh
Williams:
Managing the Life-Cycle of Linked Data with the LOD2 Stack. International Semantic Web Conference (2) 2012: 1-
16 see also http://stack.lod2.eu/
• [Taheriyan et al. 2012] Mohsen Taheriyan,Craig A. Knoblock,Pedro A. Szekely,José Luis Ambite: Rapidly
Integrating Services into the Linked Data Cloud. International Semantic Web Conference (1) 2012:559-574
• [Gentile, et al. 2016] Anna Lisa Gentile, Sabrina Kirstein,Heiko Paulheim and Christian Bizer.Extending
RapidMiner with Data Search and Integration Capabilities
• [Bischof et al. 2012] Stefan Bischof, Stefan Decker,Thomas Kr ennwallner,Nuno Lopes,Axel Polleres:
Mapping between RDFand XML with XSPARQL. J. Data Semantics 1(3): 147-185 (2012)
• [Corby et al. 2015]Olivier Corby,Catherine Faron-Zucker,Fabien Gandon:
A Generic RDF Transformation Software and Its Application to an Online Translation Service for Common
Languages ofLinked Data.International Semantic Web Conference (2) 2015:150-165
• [Nonaka & Takeuchi, 1995]"The Knowledge-Creating Company - How Japanese Companies Create the Dynamics of
Innovation"(Nonaka,Takeuchi,New York Oxford 1995)
• [Bischof et al. 2015] Stefan Bischof, Christoph Martin,Axel Polleres,Patrik Schneider:
Collecting,Integrating,Enriching and Republishing Open City Data as Linked Data. International Semantic Web
Conference (2) 2015:57-75

122
References 2
• [Doan et al. 2012] AnHai Doan, Alon Y. Halevy, Zachary G. Ives:
Principles of Data Integration. Morgan Kaufmann 2012, ISBN 978-0-12-416044-6, pp. I-XVIII, 1-497
• [Levy & Rajaraman & Ullman 1996] Alon Y. Levy, Anand Rajaraman, Jeffrey D. Ullman: Answering Queries
Using Limited External Processors. PODS 1996: 227-237
• [Duscka & Genesereth 1997]
• [Pottinger & Halevy 2001] Rachel Pottinger, Alon Y. Halevy: MiniCon: A scalable algorithm for answering
queries using views. VLDB J. 10(2-3): 182-198 (2001)
• [Arvelo & Bonet & Vidal 2006] Yolifé Arvelo, Blai Bonet, Maria-Esther Vidal: Compilation of Query-Rewriting
Problems into Tractable Fragments of Propositional Logic. AAAI 2006: 225-230
• [Konstantinidis & Ambite, 2011] George Konstantinidis, José Luis Ambite: Scalable query rewriting: a graph-
based approach. SIGMOD Conference 2011: 97-108
• [Izquierdo & Vidal & Bonet 2011] Daniel Izquierdo, Maria-Esther Vidal, Blai Bonet: An Expressive and Efficient
Solution to the Service Selection Problem. International Semantic Web Conference (1) 2010: 386-401
• [Wiederhold92] Gio Wiederhold: Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
• [Stadtmüller et al. 2013] Steffen Stadtmüller, Sebastian Speiser, Andreas Harth, Rudi Studer: Data-Fu: a
language and an interpreter for interaction with read/write linked data. WWW 2013: 1225-1236
• [Montoya et al. 2014] Gabriela Montoya, Luis Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal.
SemLAV: Local-As-View Mediation for SPARQL Queries. T. Large-Scale Data- and Knowledge-Centered
Systems 13: 33-58 (2014).

123
References 3
• [Priyatna et al. 2014]Freddy Priyatna, Óscar Corcho,Juan Sequeda:
Formalisation and experiences ofR2RML-based SPARQL to SQL query translation using morph.WWW 2014: 479-
490
• [Sequeda & Miranker 2013]Juan Sequeda,Daniel P. Miranker.Ultrawrap:SPARQL execution on relational data.J.
Web Sem. 22: 19-39 (2013)
• [Krötzsch 2012] Markus Krötzsch: OWL 2 Profiles: An Introduction to LightweightOntology Languages.Reasoning
Web 2012:112-183
• [Glimm et al. 2012]Birte Glimm, Aidan Hogan,Markus Krötzsch, Axel Polleres:OWL: Yet to arrive on the Web of
Data? LDOW 2012
• [Kontchakov et al. 2013]Roman Kontchakov,Mariano Rodriguez-Muro,Michael Zakharyaschev:Ontology-Based
Data Access with Databases:A Short Course.Reasoning Web 2013:194-229
• [Calvanese et al. 2007]Diego Calvanese,Giuseppe De Giacomo,Domenico Lembo,Maurizio Lenzerini,Riccardo
Rosati: Tractable Reasoning and EfficientQuery Answering in Description Logics:The DL-Lite Family. J. Autom.
Reasoning 39(3):385-429 (2007)
• [Perez-Urbina et al., 2009]Héctor Pérez-Urbina,Boris Motik and Ian Horrocks,A Comparison ofQuery Rewriting
Techniques for DL-Lite,In Proc. of the Int. Workshop on Description Logics (DL 2009),Oxford, UK, July 2009.
• [Rodriguez-Muro,etal. 2012]Mariano Rodriguez-Muro,Diego Calvanese:
Quest, an OWL 2 QL Reasoner for Ontology-based Data Access. OWLED 2012
• [Rodriguez-Muro,etal. 2013]Mariano Rodriguez-Muro,Roman Kontchakov,Michael Zakharyaschev:Ontology-
Based Data Access: Ontop of Databases.International Semantic Web Conference (1) 2013:558-573
• [Calvanese et al. 2011]Diego Calvanese,Giuseppe De Giacomo,Domenico Lembo,Maurizio Lenzerini,Antonella
Poggi,Mariano Rodriguez-Muro,Riccardo Rosati,Marco Ruzzi, Domenico Fabio Savo:
The MASTRO system for ontology-based data access.Semantic Web 2(1): 43-53 (2011)
• [Rosati et al. 2010]Riccardo Rosati,Alessandro Almatelli:Improving Query Answering over DL-Lite Ontologies.KR
2010
• [Mora & Corcho, 2014]José Mora, Riccardo Rosati,Óscar Corcho:kyrie2: Query Rewriting under Extensional
Constraints in ELHIO. Semantic Web Conference (1) 2014:568-583
• [Sequeda et al. 2014]Juan F. Sequeda,Marcelo Arenas,Daniel P.Miranker:
OBDA: Query Rewriting or Materialization? In Practice,Both! Semantic Web Conference (1) 2014:535-551

124
References 4
• [Acosta et al 2011] M. Acosta, M.-E. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. Anapsid: an adaptive query
processing engine for sparql endpoints. ISWC 2011.
• [Basca and Bernstein 2014] C. Basca and A. Bernstein. Querying a messy web of data with avalanche. In
Journal of Web Semantics, 2014.
• [Cohen-Boalaki and . Leser. 2013] S. Cohen-Boalakia, U. Leser. Next Generation Data Integration for the Life
Sciences. Tutorial at ICDE 2013. https://www2.informatik.hu-berlin.de/~leser/icde_tutorial_final_public.pdf
• [Doan et el. 2012] A. Doan, A. Halevy, Z. Ives, Data Integration. Morgan Kaukman 2012.
• [Halevy et al 2006] A. Y. Halevy, A. Rajaraman, J. Ordille: Data Integration: The Teenage Years. VLDB 2006:
9-16.
• [Halevy et al 2001] A. Y. Halevy. Answering queries using views: A survey. VLDB J., 2001.
• [Hassanzadeh et al. 2013] Oktie Hassanzadeh, Ken Q. Pu, Soheil Hassas Yeganeh, Renée J. Miller, Lucian
Popa, Mauricio A. Hernández, Howard Ho: Discovering Linkage Points over Web Data. PVLDB 2013
• [Gorlitz and Staab 2011] O. Gorlitz and S. Staab. SPLENDID: SPARQL Endpoint Federation Exploiting VOID
Descriptions. In Proceedings of the 2nd International Workshop on Consuming Linked Data, 2011.
• [Schwarte et al. 2011] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization
techniques for federated query processing on linked data. ISWC 2011.
• [Verborgh et al. 2014] Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck, Laurens De
Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert, Erik Mannens, Rik Van de Walle: Querying
Datasets on the Web with High Availability. ISWC2014

125
References 5
• [Acosta et al. 2015]Maribel Acosta, Amrapali Zaveri,Elena Simperl, Dimitris Kontokostas,Sören Auer, Jens
Lehmann:Crowdsourcing Linked Data Quality Assessment.ISWC 2013
• [Lenz 2007] Hans - J. Lenz. Data Quality Defining, Measuring and Improving. Tutorial at IDA 2007.
• [Naumann02] Felix Naumann: Quality-Driven Query Answering for Integrated Information Systems. LNCS
2261, Springer 2002
• [Ngonga et al. 2011]Axel-Cyrille Ngonga Ngomo,Sören Auer:LIMES - A Time-EfficientApproach for Large-Scale
Link Discovery on the Web of Data. IJCAI 2011
• [Saleem et al 2014] Muhammad Saleem,Maulik R. Kamdar,Aftab Iqbal,Shanmukha Sampath,Helena F. Deus,
Axel-Cyrille Ngonga Ngomo:Big linked cancer data: Integrating linked TCGA and PubMed.J. Web Sem. 2014
• [Soru et al. 2015]Tommaso Soru, Edgard Marx, Axel-Cyrille Ngonga Ngomo:ROCKER:A RefinementOperator for
Key Discovery.WWW 2015
• [Volz et al 2009]Julius Volz, Christian Bizer, Martin Gaedke, Georgi Kobilarov:Discovering and Maintaining Links on
the Web of Data. ISWC 2009
• [Zaveri,et al 2015]Amrapali J. Zaveri, Anisa Rula,Andrea Maurino,Ricardo Pietrobon,Jens Lehmann,and Sören
Auer. Quality Assessmentfor Linked Data:A Survey. Semantic Web Journal 2015
• [Hernandez&Stolfo,1998]M. A. Hernández,S. J. Stolfo: Real-world Data is Dirty: Data Cleansing and The
Merge/Purge Problem.
• Data Min. Knowl.Discov. 2(1): 9-37 (1998)
• [Sarma et al. 2012] Das Sarma, A., Fang, L., Gupta, N., Halevy, A., Lee,H., Wu, F., Xin, R., Yu, C.: Finding related
tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on ManagementofData. pp. 817–828.
ACM (2012)
• [Venetis et al. 2011]Venetis, P., Halevy,A.Y., Madhavan,J., Pasca, M., Shen,W., Wu, F., Miao, G., Wu, C.:
Recovering semantics oftables on the web. PVLDB 4(9), 528–538 (2011)
• [Ramnandan et al. 2015]Ramnandan,S.K., Mittal, A., Knoblock,C.A., Szekely,P.A.: Assigning semantic labels to
data sources.In: ESWC 2015. pp. 403–417
• [Neumaier et al. 2015]Jürgen Umbrich,Sebastian Neumaier,and Axel Polleres.Quality assessment& evolution of
open data portals.In IEEE International Conference on Open and Big Data, Rome,Italy, August 2015.
• [Neumaier et al. 2016]S. Neumaier,J. Umbrich,J. Parreira,A. Polleres.Multi-level semantic labelling ofnumerical
values,ISWC2016, to appear.

126
TRENDS & OPEN RESEARCH
QUESTIONS (SOME)

127
Federations of SPARQL Endpoints
Publicly available SPARQL endpoints
Federated
query processing
ANAPSID

128
Federation of SPARQL Endpoints
http://data.linkedmdb.org/sparql := http://data.linkedmdb.org/resource/movie/personal_film_appearance;
http://www.w3.org/2002/07/owl#sameAs;
http://www.w3.org/1999/02/22-rdf-syntax-ns#type;
http://xmlns.com/foaf/0.1/based_near;
http://xmlns.com/foaf/0.1/name;
…..
http://dbtune.org/jamendo/sparql := http://www.w3.org/1999/02/22-rdf-syntax-ns#type;
http://purl.org/dc/elements/1.1/title;
http://xmlns.com/foaf/0.1/based_near;
http://xmlns.com/foaf/0.1/homepage;
http://purl.org/ontology/mo/biography;
….
http://dbpedia.org/sparql := http://xmlns.com/foaf/0.1/name;
http://dbpedia.org/ontology/award;
http://dbpedia.org/ontology/almaMater;
http://www.geonames.org/ontology#name;
http://www.geonames.org/ontology#parentFeatures;
….
http://www.lotico.com:3030/lotico/sparq := http://www.geonames.org/ontology#name;
http://www.geonames.org/ontology#parentFeatures;
http://www.geonames.org/ontology#officialName;
http://www.geonames.org/ontology#postalCode;
…..

129
SPARQL Query Processing
Federated Query Engine
@PREFIX foaf:<http://xmlns.com/foaf/0.1/>
@PREFIX geonames:http://www.geonames.org/ontology#
SELECT ?name ?location WHERE {
?artist foaf:name ?name .
?artist foaf:based_near ?location .
?location geonames:parentFeature ?germany .
?germany geonames:name 'Federal Republic of Germany’ .}
{'news': '', 'name': 'Michael Bartels^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Melophon^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Remote Controlled^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Arne Pahlke^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Superdefekt^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Chaos^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'The Gay Romeos^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Der tollwu00FCtige Kasper^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'the ad.kowas^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'herr gau^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'The Rodeo Five^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}

130
References
• Andreas Schwarte, Peter Haase, Katja Hose, Ralf
Schenkel, Michael Schmidt: FedX: Optimization
Techniques for Federated Query Processing on
Linked Data. International Semantic Web Conference
(1) 2011: 601-616
http://www2.informatik.uni-freiburg.de/~mschmidt/docs/iswc11_fedx.pdf
• Maribel Acosta, Maria-Esther Vidal, Tomas Lampo,
Julio Castillo, Edna Ruckhaus: ANAPSID: An
Adaptive Query Processing Engine for SPARQL
Endpoints. International Semantic Web Conference
(1) 2011: 18-34
http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Research_Paper/03/70310017.pdf

131
• Describe the problem presented in the related
papers as a Data Integration System.
• Select the most suitable mapping approach to
describe the Data Integration System.
• Use the mediator and wrapper architecture to
• Illustrate with an example the Data Integration
System, and show the features implemented by
the mediator and wrappers of the Data
Integration System
RQ1: Can a federation of SPARQL Endpoints be
seen as a Data Integration System?

132
SPARQL Query Execution using LAV views
Publicly available Linked Data Fragments (LAV views)
Linked Data Fragment Client
Linked Data
Fragment Server

133
SPARQL Query Processing
Linked Data Server
@PREFIX foaf:<http://xmlns.com/foaf/0.1/>
@PREFIX geonames:http://www.geonames.org/ontology#
SELECT ?name ?location WHERE {
?artist foaf:name ?name .
?artist foaf:based_near ?location .
?location geonames:parentFeature ?germany .
?germany geonames:name 'Federal Republic of Germany’ .}
{'news': '', 'name': 'Michael Bartels^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Melophon^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Remote Controlled^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Arne Pahlke^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Superdefekt^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Chaos^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'The Gay Romeos^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'Der tollwu00FCtige Kasper^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'the ad.kowas^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'herr gau^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
{'news': '', 'name': 'The Rodeo Five^^<http://www.w3.org/2001/XMLSchema#string>', 'location': 'http://sws.geonames.org/2911297/'}
Linked Data Fragment Client

134
SPARQL Query Execution using Linked Data Fragments
• Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van
Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck,
Pieter Colpaert:
Triple Pattern Fragments: A low-cost knowledge graph interface for the
Web. J. Web Sem. 37: 184-206 (2016)
http://www.sciencedirect.com/science/article/pii/S1570826816000214
• Maribel Acosta, Maria-Esther Vidal: Networks of Linked Data Eddies:
An Adaptive Web Query Processing Engine for RDF Data.
International Semantic Web Conference (1) 2015: 111-127
http://www.aifb.kit.edu/images/f/f0/Acosta_vidal_iswc2015.pdf
References

135
• Describe the problem presented in the related
papers as a Data Integration System.
• Select the most suitable mapping approach to
• Use the mediator and wrapper architecture to
• Illustrate with an example the Data Integration
System, and show the features implemented by
the mediator and wrappers of the Data
Integration System
RQ2: Can a federation of Linked Data Fragments
be seen as a Data Integration System?

136
RQ3: What challenges does archiving of
RDF and Open Data involve?
• If Open Data is Big Data, archiving Open Data and RDF
Data is even one order of magnitude more!
• Challenges on creating (crawling), maintaining, storing
and querying such archives:
• cf. slides
• “On Archiving Linked and Open Data"
at the 2nd Workshop on Managing
the Evolution and Preservation
of the Data Web (MEPDaW 2016),
http://polleres.net/presentations/20160530Keynote-MEPDaW2016.pptx

137
RQ4: How to publish and use Linked Open Data
alongside Closed Data?
• Which policies need to be supported?
• How to describe these policies?
• How to enforce them, how to protect and securely store
closed linked data?
• Surprisingly few starting points in our community on
access control/encryption for RDF/Linked Data, cf. e.g.
• S. Kirrane. Linked data with access control. PhD thesis, 2015. NUI Galway
https://aran.library.nuigalway.ie/handle/10379/4903
• Mark Giereth: On Partial Encryption of RDF-Graphs. International Semantic Web Conference
2005: 308-322
• Lots of work on policy languages, e.g. ODRL:
• Simon Steyskal, Axel Polleres:
Towards Formal Semantics for ODRL Policies.RuleML 2015:360-375
• Simon Steyskal, Axel Polleres:Defining expressive access policies for linked data using the ODRL ontology 2.0.
SEMANTICS 2014: 20-23

138
Your Research Task(s) for the rest of the day:
• Work on one of the overall Research Questions (too generic on purpose!!!!) RQ1-RQ6
from the slides before in your mini-project groups!
• 4 questions/11 groups à 1 RQ can be chosen by at most 3 groups!
• RQ1-2 à Maria Esther
• RQ3-4 à Axel
For each problem you work on:
1) Problems: Why is it difficult? Find obstacles. Define concrete open (sub-)research questions!
2) Solutions: What could be strategies to overcome these obstacles?
3) Systems: What could be a strategy/roadmap/method to implement these strategies?
4) Benchmarks: What could be a strategy/roadmap/method to evaluate a solution?
Result: short presentation per group addressing these 4 questions and findings.
Tips:
• Think about how much time you dedicate to which of these four questions.
• Don’t start with 3)
• Prepare some answers or discussions for afinal plenary session which can be presented in a 2-3 min
pitch SUMMARIZING your discussion
• no more than 2 slides
• focus on take-home messages
mandatory
optional
à Please email your notes and (link to) slides to axel[at]polleres.net ...
We will review them and provide feedback during tmrw morning!

Tutorial Data Management and workflows

Recommended

More Related Content

What's hot (20)

Similar to Tutorial Data Management and workflows (20)

Recently uploaded (20)

Tutorial Data Management and workflows