SlideShare a Scribd company logo
Exploiting the query structure
for efficient join ordering in
SPARQL queries
Luiz Henrique Zambom Santana
Vinicius da Silveira Segalin
Agenda
•Paper and authors
•Background
•Problem and solution
•Example
•Algorithms
•Analysis
•Conclusions
Paper and authors
Gubichev, Andrey, and Thomas Neumann. "Exploiting
the query structure for efficient join ordering in
SPARQL queries." EDBT. 2014.
Extending Database Technology – Qualis A2/H-index 52
Background
•SPARQL
• W3C standard
• Semantic Web
• Inspired in SQL
•Query structure
•Join ordering (similar to matrix product)
Problem
•The join ordering problem is a fundamental challenge that has to be
solved by any query optimizer
•Depending on the order of the join, there is a different computation
time
•SQL solutions are not immediately capable of handling large SPARQL
queries. It is introduced a new join ordering algorithm that
performs a SPARQL-tailored query simplification
Problem
•Cardinality estimation is an essential part of any cost-based query
optimizer
•Two different approaches:
• RDF-3X: query compilation time (dominated by finding the optimal
join order) is one order of magnitude higher than the actual
execution time
• Virtuoso 7: greedy algorithm for compilation leads to a slow run
time (sub-optimal order)
Solution
•Best of both worlds:
• Heuristics that spends a reasonable amount of time optimizing the
query, and yet gets a decent join order
• The paper presents a SPARQL-tailored query simplification
procedure, that decomposes the query’s join graph into star-shaped
subqueries and chain-shaped subqueries
Challenges
•RDF can be very verbose
• TPC-H Query 2 written in SPARQL contains joins between 26 index
scans (as opposed to joins between 5 tables in the SQL
formulation)
• Number of plans:
• 5! = 120 plans in SQL vs 26! = 4 *1026
•Lack of schema
• Foreign keys become structural correlations
Solution
• Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense
that the subjects that have the same characteristic set tend to be similar
• Hierarchical Characterization:
• 1. H0
is the set of all characteristic sets of R
• 2. Hi
= {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ Hi
−1}, that is Hi
consists of the subsets C of sets
from Hi-1
that minimize cost(C).
• 3. ∀ S ∈ Hk
: |S| = 2
• 4. every S ∈ Hi-1
stores a pointer to its cheapest subset C ∈ Hi.
Algorithm 1 (part. 1)
• Line 2: S=[{created, bornIn, livedIn, hasName},
{ bornIn, livedIn, hasName},...]
• Line 8: Init Banker's iteration, ie. from the
smallest to the biggest possible set with the
predicates
Algorithm 1 (part. 2)
• Line 12: guarantees that S2
is smaller than S1
• Line 15-16: finds the subsets that have smaller
cost
• Cost
• Banker’s iteration potentially enumerates all
the subsets of all predicates in the dataset, in
reality it stops relatively early, since it is
always bounded by the largest set in Sets
Algorithm 2 (part. 1/2)
• Objective: finding the optimal join order in (sub)
queries of the form:
select * where {?s p1
?o1
. . . . ?s pk
?ok
}
• Idea: extract the part of the Hierarchical
Characterisation of the dataset starting with the
set S
• Input: Star-shaped graph
• Output: Order of the joins
• Lines 1-9:
• While size S > 2, find the most expensive
subset and push to front of O
Algorithm 2 (part. 2/2)
• The first part leads to the optimal for star-
shaped queries in linear time to the graph size
• However, it do not find the optional solution if
the query have constants:
select * where {?s p1
“Berlin”. . . . ?s pk
?ok
}
• Then:
• Lines 12-14: only one of the bounded
objects is in the triple with the key
predicate, ie., the entire star query is
therefore a lookup of properties of a
specific entity
• Lines 15-16: otherwise (many objects are
key), keep pushing down the constants in
the join tree and stop when the cost of the
corresponding index scan is bigger than the
cost of the join on that level of the tree
Algorithm 3 (part. 1/4)
• Objective: ordering join in general SPARQL queries
(s1
, hasName, "Marie Curie"),
(s1
, bornIn, s2
),
(s2
, label, "Warsaw"),
(s2
, locatedIn, "Poland")
• Problem: s2
links person to city, corresponding to the "foreign key", but RDF does not require any
schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the
optimizer has to assume independence between two entities linked via bornIn predicate, thus almost
inevitably underestimating the selectivity of the join of corresponding triple pattern
• Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where:
PC (Sc
(s), Sc
(o)) = {(Sc
(s), Sc
(o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R}
• The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2
characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example,
in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the
dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
Algorithm 3 (part. 2/4)
• Idea: to decompose the query into star-shaped subqueries
connected by chains, and to collapse the subqueries into
meta-nodes
• Input: SPARQL graph
• Output: join ordering for this graph
• Lines 11-24: starts with clustering the query into disjoint
star-shaped subqueries around subjects
• Line 13: order the triple patterns in the query by subject
• Line 15: group triple patterns with identical subjects, since
they potentially form star-shaped subqueries
• Lines 20-23: find starts around objects
Algorithm 3 (part. 3/4)
• Lines 4-5: for every star it adds the new meta-node to the
query graph and removes the intra-star edges
• Lines 6-7: the plan for the star subquery is computed using
the Hierarchical Characterisation (Algorithm 2) and added to
the DP table along with the meta-node
• Line 8: After all the star subqueries have been optimized, we
add the edges between meta-nodes to the query graph, if
the original graph has edges between the corresponding star
sub-queries
Algorithm 3 (part. 4/4)
• Line 10: selectivities associated with these edges are
computed using the Characteristic Pairs synopsis, and the
regular Dynamic Programming algorithm starts working on
this simplified graph
• In the following Figure simplifying the graph from 8 nodes to
3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans
• This algorithm is also linear to the input graph
Analysis
Conclusions
•The problem is very similar to the Matrix product
•The query simplification techniques reduces the search space size by
making some simplification before the DP algorithm starts
•The time analysis shows how important are the complexity study
•There is no complexity analysis though it mentions DP and Greedy
algorithms along the paper
•The tests did not turned the cache off
•Do not cover OPTIONAL clauses of SPARQL, which are equivalent to
the left outer joins and can not be freely reordered with other joins

More Related Content

PPTX
Hash table
Vu Tran
 
PPT
Hash tables
Rajendran
 
PDF
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 
PPTX
Searching Algorithms
Afaq Mansoor Khan
 
PDF
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
PPTX
Presentation
zohaib arif
 
Hash table
Vu Tran
 
Hash tables
Rajendran
 
Graph Algorithms, Sparse Algebra, and the GraphBLAS with Janice McMahon
Christopher Conlan
 
Searching Algorithms
Afaq Mansoor Khan
 
Algorithms Lecture 6: Searching Algorithms
Mohamed Loey
 
Presentation
zohaib arif
 

What's hot (20)

PPT
Basic data-structures-v.1.1
BG Java EE Course
 
PPT
Data structures
Nur Saleha
 
PPTX
Algorithms 101 for Data Scientists
Christopher Conlan
 
PPTX
Analysis of algorithms
iqbalphy1
 
PPTX
Hashing
invertis university
 
PPT
Introduction of data structure
eShikshak
 
PPT
Hash table
Rajendran
 
PPTX
Linear search-and-binary-search
International Islamic University
 
PDF
Searching and Sorting Techniques in Data Structure
Balwant Gorad
 
PPT
Counting Sort Lowerbound
despicable me
 
PPT
Ch 1 intriductions
irshad17
 
PPTX
Counting sort
arjunnaik19
 
PPTX
Dynamic Memory & Linked Lists
Afaq Mansoor Khan
 
PPTX
Data structures
Sneha Chopra
 
PPTX
Binary search
Raghu nath
 
PDF
Unit ii data structure-converted
Shri Shankaracharya College, Bhilai,Junwani
 
PDF
Week 2 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 
PDF
Elementary data structure
Biswajit Mandal
 
PPTX
Search algorithms master
Hossam Hassan
 
Basic data-structures-v.1.1
BG Java EE Course
 
Data structures
Nur Saleha
 
Algorithms 101 for Data Scientists
Christopher Conlan
 
Analysis of algorithms
iqbalphy1
 
Introduction of data structure
eShikshak
 
Hash table
Rajendran
 
Linear search-and-binary-search
International Islamic University
 
Searching and Sorting Techniques in Data Structure
Balwant Gorad
 
Counting Sort Lowerbound
despicable me
 
Ch 1 intriductions
irshad17
 
Counting sort
arjunnaik19
 
Dynamic Memory & Linked Lists
Afaq Mansoor Khan
 
Data structures
Sneha Chopra
 
Binary search
Raghu nath
 
Unit ii data structure-converted
Shri Shankaracharya College, Bhilai,Junwani
 
Week 2 - Data Structures and Algorithms
Ferdin Joe John Joseph PhD
 
Elementary data structure
Biswajit Mandal
 
Search algorithms master
Hossam Hassan
 
Ad

Viewers also liked (20)

PPTX
Graphium Chrysalis: Exploiting Graph Database
Graph-TA
 
PPTX
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
Fabio Benedetti
 
PPTX
Sistemas de federação linked data
Danusa Ribeiro
 
ODP
DBpedia: A Public Data Infrastructure for the Web of Data
Sebastian Hellmann
 
PPT
Gathering Alternative Surface Forms for DBpedia Entities
Heiko Paulheim
 
PDF
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
PDF
DBpedia InsideOut
Cristina Pattuelli
 
PPTX
NLP todo
Rohit Verma
 
PPTX
Federated SPARQL query processing over the Web of Data
Muhammad Saleem
 
PDF
Linked Data Fragments
Ruben Verborgh
 
PPTX
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Sören Auer
 
PDF
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Marieke van Erp
 
PDF
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
ODP
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
PDF
Applying Linked Open Data to Public Procurement
Jindřich Mynarz
 
ODP
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
PPTX
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
Syed Muhammad Ali Hasnain
 
PDF
Unsupervised Extraction of Attributes and Their Values from Product Description
Rakuten Group, Inc.
 
PPTX
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
Syed Muhammad Ali Hasnain
 
PDF
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
Graphium Chrysalis: Exploiting Graph Database
Graph-TA
 
LODeX: Schema Summarization and automatic SPARQL query generation for Linked ...
Fabio Benedetti
 
Sistemas de federação linked data
Danusa Ribeiro
 
DBpedia: A Public Data Infrastructure for the Web of Data
Sebastian Hellmann
 
Gathering Alternative Surface Forms for DBpedia Entities
Heiko Paulheim
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Stefan Dietze
 
DBpedia InsideOut
Cristina Pattuelli
 
NLP todo
Rohit Verma
 
Federated SPARQL query processing over the Web of Data
Muhammad Saleem
 
Linked Data Fragments
Ruben Verborgh
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Sören Auer
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Marieke van Erp
 
LDQL: A Query Language for the Web of Linked Data
Olaf Hartig
 
Fast Approximate A-box Consistency Checking using Machine Learning
Heiko Paulheim
 
Applying Linked Open Data to Public Procurement
Jindřich Mynarz
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Heiko Paulheim
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
Syed Muhammad Ali Hasnain
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Rakuten Group, Inc.
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
Syed Muhammad Ali Hasnain
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Olaf Hartig
 
Ad

Similar to Exploiting the query structure for efficient join ordering in SPARQL queries (20)

PPTX
Programming in python
Ivan Rojas
 
PPTX
Taxonomy of Scala
shinolajla
 
PPTX
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
HPCC Systems
 
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
PDF
Microservices, containers, and machine learning
Paco Nathan
 
PDF
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PPTX
Dive into spark2
Gal Marder
 
PDF
QBIC
Misha Kozik
 
PDF
ensembles_emptytemplate_v2
Shrayes Ramesh
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
PDF
Graph Analytics in Spark
Paco Nathan
 
DOCX
Data structure and algorithm.
Abdul salam
 
PPTX
PPT Lecture 2.2.1 onn c++ data structures
midtushar
 
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
PPTX
Data Step Hash Object vs SQL Join
Geoff Ness
 
PDF
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
PPT
Query Decomposition and data localization
Hafiz faiz
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Start From A MapReduce Graph Pattern-recognize Algorithm
Yu Liu
 
Programming in python
Ivan Rojas
 
Taxonomy of Scala
shinolajla
 
Optimizing Set-Similarity Join and Search with Different Prefix Schemes
HPCC Systems
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Microservices, containers, and machine learning
Paco Nathan
 
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Dive into spark2
Gal Marder
 
ensembles_emptytemplate_v2
Shrayes Ramesh
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Graph Analytics in Spark
Paco Nathan
 
Data structure and algorithm.
Abdul salam
 
PPT Lecture 2.2.1 onn c++ data structures
midtushar
 
design mapping lecture6-mapreducealgorithmdesign.ppt
turningpointinnospac
 
Data Step Hash Object vs SQL Join
Geoff Ness
 
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
Query Decomposition and data localization
Hafiz faiz
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Yu Liu
 

More from Luiz Henrique Zambom Santana (20)

PDF
Federal University of Santa Catarina (UFSC) - PySpark Tutorial
Luiz Henrique Zambom Santana
 
PDF
UFSC - Data Lakes Technlogies & Implementation - 2025
Luiz Henrique Zambom Santana
 
PDF
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Luiz Henrique Zambom Santana
 
PDF
Apache Sedona: how to process petabytes of agronomic data with Spark
Luiz Henrique Zambom Santana
 
PDF
De Arquiteto para Gerente: como debugar uma equipe
Luiz Henrique Zambom Santana
 
PDF
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
Luiz Henrique Zambom Santana
 
PDF
IBM Watson, Apache Spark ou TensorFlow?
Luiz Henrique Zambom Santana
 
PDF
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Luiz Henrique Zambom Santana
 
PPTX
Banco de dados nas nuvens - aula 3
Luiz Henrique Zambom Santana
 
PPTX
Banco de dados nas nuvens - aula 2
Luiz Henrique Zambom Santana
 
PPTX
Banco de dados nas nuvens - aula 1
Luiz Henrique Zambom Santana
 
PDF
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...
Luiz Henrique Zambom Santana
 
PPTX
A middleware for storing massive RDF graphs into NoSQL
Luiz Henrique Zambom Santana
 
PDF
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
Luiz Henrique Zambom Santana
 
PDF
Normalização
Luiz Henrique Zambom Santana
 
PDF
Consultas básicas em SQL
Luiz Henrique Zambom Santana
 
PDF
Processamento em Big Data
Luiz Henrique Zambom Santana
 
PPTX
Seminário de Andamento de Doutorado
Luiz Henrique Zambom Santana
 
PPTX
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Luiz Henrique Zambom Santana
 
Federal University of Santa Catarina (UFSC) - PySpark Tutorial
Luiz Henrique Zambom Santana
 
UFSC - Data Lakes Technlogies & Implementation - 2025
Luiz Henrique Zambom Santana
 
Perspectives on the use of data in Agriculture - Luiz Santana - Leaf Agricult...
Luiz Henrique Zambom Santana
 
Apache Sedona: how to process petabytes of agronomic data with Spark
Luiz Henrique Zambom Santana
 
De Arquiteto para Gerente: como debugar uma equipe
Luiz Henrique Zambom Santana
 
VoltDB: as vantagens e os desafios dos banco de dados NewSQL
Luiz Henrique Zambom Santana
 
IBM Watson, Apache Spark ou TensorFlow?
Luiz Henrique Zambom Santana
 
Uma visão sobre Fast-Data: Spark, VoltDB e Elasticsearch
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 3
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 2
Luiz Henrique Zambom Santana
 
Banco de dados nas nuvens - aula 1
Luiz Henrique Zambom Santana
 
Workload-Aware RDF Partitioning and SPARQL Query Caching for Massive RDF Gra...
Luiz Henrique Zambom Santana
 
A middleware for storing massive RDF graphs into NoSQL
Luiz Henrique Zambom Santana
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
Luiz Henrique Zambom Santana
 
Consultas básicas em SQL
Luiz Henrique Zambom Santana
 
Processamento em Big Data
Luiz Henrique Zambom Santana
 
Seminário de Andamento de Doutorado
Luiz Henrique Zambom Santana
 
Como modelar, integrar e desenvolver aplicações com múltiplos bancos de dados...
Luiz Henrique Zambom Santana
 

Exploiting the query structure for efficient join ordering in SPARQL queries

  • 1. Exploiting the query structure for efficient join ordering in SPARQL queries Luiz Henrique Zambom Santana Vinicius da Silveira Segalin
  • 2. Agenda •Paper and authors •Background •Problem and solution •Example •Algorithms •Analysis •Conclusions
  • 3. Paper and authors Gubichev, Andrey, and Thomas Neumann. "Exploiting the query structure for efficient join ordering in SPARQL queries." EDBT. 2014. Extending Database Technology – Qualis A2/H-index 52
  • 4. Background •SPARQL • W3C standard • Semantic Web • Inspired in SQL •Query structure •Join ordering (similar to matrix product)
  • 5. Problem •The join ordering problem is a fundamental challenge that has to be solved by any query optimizer •Depending on the order of the join, there is a different computation time •SQL solutions are not immediately capable of handling large SPARQL queries. It is introduced a new join ordering algorithm that performs a SPARQL-tailored query simplification
  • 6. Problem •Cardinality estimation is an essential part of any cost-based query optimizer •Two different approaches: • RDF-3X: query compilation time (dominated by finding the optimal join order) is one order of magnitude higher than the actual execution time • Virtuoso 7: greedy algorithm for compilation leads to a slow run time (sub-optimal order)
  • 7. Solution •Best of both worlds: • Heuristics that spends a reasonable amount of time optimizing the query, and yet gets a decent join order • The paper presents a SPARQL-tailored query simplification procedure, that decomposes the query’s join graph into star-shaped subqueries and chain-shaped subqueries
  • 8. Challenges •RDF can be very verbose • TPC-H Query 2 written in SPARQL contains joins between 26 index scans (as opposed to joins between 5 tables in the SQL formulation) • Number of plans: • 5! = 120 plans in SQL vs 26! = 4 *1026 •Lack of schema • Foreign keys become structural correlations
  • 9. Solution • Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense that the subjects that have the same characteristic set tend to be similar • Hierarchical Characterization: • 1. H0 is the set of all characteristic sets of R • 2. Hi = {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ Hi −1}, that is Hi consists of the subsets C of sets from Hi-1 that minimize cost(C). • 3. ∀ S ∈ Hk : |S| = 2 • 4. every S ∈ Hi-1 stores a pointer to its cheapest subset C ∈ Hi.
  • 10. Algorithm 1 (part. 1) • Line 2: S=[{created, bornIn, livedIn, hasName}, { bornIn, livedIn, hasName},...] • Line 8: Init Banker's iteration, ie. from the smallest to the biggest possible set with the predicates
  • 11. Algorithm 1 (part. 2) • Line 12: guarantees that S2 is smaller than S1 • Line 15-16: finds the subsets that have smaller cost • Cost • Banker’s iteration potentially enumerates all the subsets of all predicates in the dataset, in reality it stops relatively early, since it is always bounded by the largest set in Sets
  • 12. Algorithm 2 (part. 1/2) • Objective: finding the optimal join order in (sub) queries of the form: select * where {?s p1 ?o1 . . . . ?s pk ?ok } • Idea: extract the part of the Hierarchical Characterisation of the dataset starting with the set S • Input: Star-shaped graph • Output: Order of the joins • Lines 1-9: • While size S > 2, find the most expensive subset and push to front of O
  • 13. Algorithm 2 (part. 2/2) • The first part leads to the optimal for star- shaped queries in linear time to the graph size • However, it do not find the optional solution if the query have constants: select * where {?s p1 “Berlin”. . . . ?s pk ?ok } • Then: • Lines 12-14: only one of the bounded objects is in the triple with the key predicate, ie., the entire star query is therefore a lookup of properties of a specific entity • Lines 15-16: otherwise (many objects are key), keep pushing down the constants in the join tree and stop when the cost of the corresponding index scan is bigger than the cost of the join on that level of the tree
  • 14. Algorithm 3 (part. 1/4) • Objective: ordering join in general SPARQL queries (s1 , hasName, "Marie Curie"), (s1 , bornIn, s2 ), (s2 , label, "Warsaw"), (s2 , locatedIn, "Poland") • Problem: s2 links person to city, corresponding to the "foreign key", but RDF does not require any schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the optimizer has to assume independence between two entities linked via bornIn predicate, thus almost inevitably underestimating the selectivity of the join of corresponding triple pattern • Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where: PC (Sc (s), Sc (o)) = {(Sc (s), Sc (o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R} • The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2 characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example, in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
  • 15. Algorithm 3 (part. 2/4) • Idea: to decompose the query into star-shaped subqueries connected by chains, and to collapse the subqueries into meta-nodes • Input: SPARQL graph • Output: join ordering for this graph • Lines 11-24: starts with clustering the query into disjoint star-shaped subqueries around subjects • Line 13: order the triple patterns in the query by subject • Line 15: group triple patterns with identical subjects, since they potentially form star-shaped subqueries • Lines 20-23: find starts around objects
  • 16. Algorithm 3 (part. 3/4) • Lines 4-5: for every star it adds the new meta-node to the query graph and removes the intra-star edges • Lines 6-7: the plan for the star subquery is computed using the Hierarchical Characterisation (Algorithm 2) and added to the DP table along with the meta-node • Line 8: After all the star subqueries have been optimized, we add the edges between meta-nodes to the query graph, if the original graph has edges between the corresponding star sub-queries
  • 17. Algorithm 3 (part. 4/4) • Line 10: selectivities associated with these edges are computed using the Characteristic Pairs synopsis, and the regular Dynamic Programming algorithm starts working on this simplified graph • In the following Figure simplifying the graph from 8 nodes to 3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans • This algorithm is also linear to the input graph
  • 19. Conclusions •The problem is very similar to the Matrix product •The query simplification techniques reduces the search space size by making some simplification before the DP algorithm starts •The time analysis shows how important are the complexity study •There is no complexity analysis though it mentions DP and Greedy algorithms along the paper •The tests did not turned the cache off •Do not cover OPTIONAL clauses of SPARQL, which are equivalent to the left outer joins and can not be freely reordered with other joins