How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks

May15th, 2019
HOW REPRESENTATIVE IS A SPARQL
BENCHMARK? AN ANALYSIS OF RDF
TRIPLESTORE BENCHMARKS
Muhammad Saleem, Gábor Szárnyas, Felix Conrads,
Syed Ahmad Chan Bukhari, Qaiser Mehmood, Axel-
Cyrille Ngonga Ngomo
The Web Conference 2019, San Francisco
1

MOTIVATION
 Various RDF Triplestores
 e.g., Virtuoso, FUSEKI, Blazgraph, Stardog, RDF3X etc.
 Various Triplestore benchmarks
 e.g., WatDiv, FEASIBLE, LDBC, BSBM, SP2Bench etc.
 Varying workload on Triplestores
 Various important SPARQL query features
 Which benchmark is more representative?
 Which benchmark is more suitable to test given
Triplestore?
 How SPARQL features effect the query runtimes?
2

QUERYING BENCHMARK
COMPONENTS
 Dataset(s)
 Queries
 Performance metrics
 Execution rules
3

IMPORTANT RDF DATASET
FEATURES
RDF Datasets used in the querying benchmark should
vary:
 Number of triples
 Number of classes
 Number of resources
 Number of properties
 Number of objects
 Average properties per class
 Average instances per class
 Average in-degree and out-degree
 Structuredness or coherence
 Relationship specialty 4

IMPORTANT SPARQL QUERY FEATURES
 Number of triple patterns
 Number of projection variables
 Number of BGPs
 Number of join vertices
 Mean join vertex degree
 Query result set sizes
 Mean triple pattern selectivity
 BGP-restricted triple pattern selectivity
 Join-restricted triple pattern selectivity
 Overall diversity score (average coefficient of variation)
 Join vertex types (`star', `path', `hybrid', `sink')
 SPARQL clauses used (e.g., LIMIT, UNION, OPTIONAL, FILTER etc.)
5
SPARQL queries as directed hypergraph

IMPORTANT PERFORMANCE METRICS
(1/2)
 Query Processing Related
 Query execution time
 Query Mix per Hour (QMpH)
 Queries per Second (QpS)
 CPU and memory usage
 Intermediate results
 Number of disk/memory swaps
 Result Set Related
 Result set correctness
 Result set completeness
6

IMPORTANT PERFORMANCE METRICS
(2/2)
 Data Storage Related
 Data loading time
 Storage space
 Index size
 Parallelism with/without Updates
 Parallel querying agents
 Parallel data updates agents
7

BENCHMARKS SELECTION CRITERIA
 Target query runtime performance evaluation of triplestores
 RDF Datasets available
 SPARQL queries available
 No reasoning required to get complete results
8

SELECTED BENCHMARKS
 Real data and/or queries
benchmarks
 FishMark
 BioBench
 FEASIBLE
 Dbpedia SPARQL Benchmark (DBPSB)
 Synthetic benchmarks
 Bowlogna
 TrainBench
 Berlin SPARQL Benchmark (BSBM)
 SP2Bench
 WatDiv
 Social Networking Benchmark (SNB)
9
 Real-world datasets and queries
 Dbpedia 3.5.1
 Semantic Web Dog Food (SWDF)
 NCBIGene
 SIDER
 DrugBank

DATASETS ANALYSIS:
STRUCTUREDNESS
10
 Duan et al. assumption
 Real datasets are less structured
 Synthetic datasets are high structured
The dataset structuredness problem
is well covered in recent synthetic
data generators (e.g., WatDiv,
TrainBench)

DATASETS ANALYSIS:
RELATIONSHIP SPECIALTY
11
 Qiao et al. assumption
 Synthetic datasets have low relationship
specialty
The low relationship specialty
problem in synthetic datasets still
exists in general and needs to be
covered in future synthetic
benchmark generation approaches

QUERIES ANALYSIS: OVERALL
DIVERSITY SCORE
12
Benchmarks queries diversity (high to low): FEASIBLE  BioBench 
FishMark  WatDiv  Bowlogna  SP2Bench  BSBM  DBPSB  SNB-
BI  SNB-INT  TrainBench

QUERIES ANALYSIS: DISTRIBUTION OF SPARQL CLAUSES
AND JOIN VERTEX TYPES
13
Only FEASIBLE and BioBench do
not completely miss or
overused features
Synthetic benchmarks often
fail to contain important
SPARQL clauses

PERFORMANCE METRICS
14
BSBM reported the results for
maximum metrics among the
selected benchmarks

SPEARSMAN’S CORRELATION WITH RUNTIMES
15
Highest impact on query runtimes:
PV  JV  TP  Result  JVD 
JTPS  TPS  BGPs  LSQ  BTPS
The SPARQL query features we selected
have a weak correlation with query
execution time, suggesting that the query
runtime is a complex measure affected by
multidimensional SPARQL query features

EFFECT OF DATASETS STRUCTUREDNESS
16

CONCLUSIONS
17
 The dataset structuredness problem is well covered in recent synthetic data
generators (e.g., WatDiv, TrainBench)
 The low relationship specialty problem in synthetic datasets still exists in
general and needs to be covered in future synthetic benchmark generation
approaches
 The FEASIBLE framework employed on DBpedia generated the most diverse
benchmark in our evaluation
 The SPARQL query features we selected have a weak correlation with query
execution time, suggesting that the query runtime is a complex measure
affected by multidimensional SPARQL query features
 Still, the number of projection variables, join vertices, triple patterns, the
result sizes, and the join vertex degree are the top five SPARQL features
that most impact the overall query execution time
 Synthetic benchmarks often fail to contain important SPARQL clauses such
as DISTINCT, FILTER, OPTIONAL, LIMIT and UNION
 The dataset structuredness has a direct correlation with the result sizes
and execution times of queries and indirect correlation with dataset

How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks

More Related Content

Similar to How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks (20)

More from Muhammad Saleem (14)

Recently uploaded (20)

How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benchmarks