The Impact of Data Caching of on Query Execution for Linked DataOlaf Hartig
The document discusses link traversal based query execution for querying linked data on the web. It describes an approach that alternates between evaluating parts of a query on a continuously augmented local dataset, and looking up URIs in solutions to retrieve more data and add it to the local dataset. This allows querying linked data as if it were a single large database, without needing to know all data sources in advance. A key issue is how to efficiently cache retrieved data to avoid redundant lookups.
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficientlyijsrd.com
Efficient and effective full-text retrieval in unstructured peer-to-peer networks remains a challenge in the research community. First, it is difficult, if not impossible, for unstructured P2P systems to effectively locate items with guaranteed recall. Second, existing schemes to improve search success rate often rely on replicating a large number of item replicas across the wide area network, incurring a large amount of communication and storage costs. In this paper, we propose BloomCast, an efficient and effective full-text retrieval scheme, in unstructured P2P networks. By leveraging a hybrid P2P protocol, BloomCast replicates the items uniformly at random across the P2P networks, achieving a guaranteed recall at a communication cost of O (N), where N is the size of the network. Furthermore, by casting Bloom Filters instead of the raw documents across the network, BloomCast significantly reduces the communication and storage costs for replication. Results show that BloomCast achieves an average query recall, which outperforms the existing WP algorithm by 18 percent, while BloomCast greatly reduces the search latency for query processing by 57 percent.
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The document discusses linked data and services. It describes the linked data principles of using URIs to name things and including links between URIs. It then discusses querying linked data from multiple sources using either a materialization or distributed query processing approach. It proposes the concept of linked data services that adhere to REST principles and linked data principles by describing their input and output using RDF graph patterns. Integrating linked data services with linked open data could enable querying across both interconnected datasets and services.
This document discusses distributed database systems and distributed query processing. It begins with an introduction that notes the differences between distributed and centralized query processing, including considering the physical data distribution and communication costs during query optimization in distributed systems. The document then provides an overview of its contents, which include discussions of centralized query processing, the basics of distributed query processing, global query optimization, and a summary. It also gives examples of motivations for distributed query processing like low response times, high throughput, and efficient hardware usage.
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
The increasing adoption of Linked Data principles has led
to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Thanh Tran
The document describes a 2010 paper on schema-agnostic search approaches for querying linked data. It discusses the motivation for such approaches given complex information needs on the evolving web of data. The paper presents conceptual studies of four widely used schema-agnostic search approaches, and conducts experimental evaluations to assess their efficiency, effectiveness, and usability.
Template-based information access, in which templates are constructed for keywords, is a recent development of linked data information retrieval. However, most such approaches suffer from ineffective template management. Because linked data has a structured data representation, we assume the data’s inside statistics can effectively influence template management. In this work, we use this influence for template
creation, template ranking, and scaling. Our proposal can effectively be used for automatic linked data information retrieval and can be incorporated with other techniques such as ontology inclusion and sophisticated matching to further improve performance.
The document discusses information retrieval (IR) and provides definitions and examples of different IR models and techniques. It describes how documents and queries can be represented as vectors, with weights like term frequency-inverse document frequency (tf-idf) used to indicate importance. Various IR models are covered, including boolean, vector space, and probabilistic models, along with common weighting and ranking methods used in IR systems.
Search results clustering (SRC) is a challenging algorithmic
problem that requires grouping together the results returned
by one or more search engines in topically coherent clusters,
and labeling the clusters with meaningful phrases describing
the topics of the results included in them.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
With the rapid growth of information technology and in many business
applications, mining frequent patterns and finding associations among them requires
handling large and distributed databases. As FP-tree considered being the best compact data
structure to hold the data patterns in memory there has been efforts to make it parallel and
distributed to handle large databases. However, it incurs lot of communication over head
during the mining. In this paper parallel and distributed frequent pattern mining algorithm
using Hadoop Map Reduce framework is proposed, which shows best performance results
for large databases. Proposed algorithm partitions the database in such a way that, it works
independently at each local node and locally generates the frequent patterns by sharing the
global frequent pattern header table. These local frequent patterns are merged at final stage.
This reduces the complete communication overhead during structure construction as well as
during pattern mining. The item set count is also taken into consideration reducing
processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the
algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which
shows execution time efficiency as compared to other algorithms. The experimental result
shows that proposed algorithm efficiently handles the scalability for very large datab ases.
Index Terms—
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAcsandit
This work presents a novel ranking scheme for structured data. We show how to apply the
notion of typicality analysis from cognitive science and how to use this notion to formulate the
problem of ranking data with categorical attributes. First, we formalize the typicality query
model for relational databases. We adopt Pearson correlation coefficient to quantify the extent
of the typicality of an object. The correlation coefficient estimates the extent of statistical
relationships between two variables based on the patterns of occurrences and absences of their
values. Second, we develop a top-k query processing method for efficient computation. TPFilter
prunes unpromising objects based on tight upper bounds and selectively joins tuples of highest
typicality score. Our methods efficiently prune unpromising objects based on upper bounds.
Experimental results show our approach is promising for real data.
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
Matching and merging anonymous terms from web sourcesIJwest
This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated i
This document outlines the BoTLRet system, a template-based linked data information retrieval system. It begins with an introduction to linked data and related work in linked data access. It then describes the problem with current template-based systems and proposes BoTLRet as a solution. BoTLRet constructs templates according to linked data structure and ranks templates using dataset statistics. It can handle queries with two or more keywords by progressively constructing and merging templates for adjacent keyword pairs. The document concludes with experimental results showing BoTLRet achieves close to exhaustive retrieval with lower computational cost than alternative systems, and outperforms other state-of-the-art template-based linked data retrieval systems.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
This document provides an overview of probabilistic approaches to information retrieval. It discusses why probabilities are useful for IR given the inherent uncertainty. It covers the Probability Ranking Principle, which aims to rank documents by estimated probability of relevance. Other probabilistic techniques discussed include probabilistic indexing, probabilistic inference using logic representations, and using Bayesian networks for IR. The document notes open issues with some of these approaches and concludes by surveying existing survey papers on probabilistic IR.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
This document summarizes two algorithms - MFA and ATRA - for processing top-k spatial preference queries. MFA is a threshold-based algorithm that partitions queries into three features - spatial, preference, and text - and retrieves objects with the highest aggregate scores. ATRA uses a hybrid indexing structure called AIR-tree to more efficiently retrieve only relevant objects without revisiting the same data. The paper then proposes using an R-tree index structure combined with an enhanced branch-and-bound search algorithm to answer preference-based top-k spatial keyword queries by ranking objects based on feature quality in their neighborhoods.
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
(1) FedBench is a benchmark suite for evaluating federated semantic data processing systems.
(2) It includes parameterized benchmark drivers, a variety of RDF datasets and SPARQL queries, and an evaluation framework to measure system performance.
(3) An initial evaluation was conducted to demonstrate FedBench's flexibility in comparing centralized and federated query processing using different systems and scenarios.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
This document summarizes research on implementing search-as-you-type functionality in relational database forms. It motivates this approach by noting limitations of existing search paradigms like SQL and keyword search. Key challenges include enabling fast prefix matching, synchronizing local and global search results, handling errors and misspellings, and improving scalability for large databases. Initial achievements include a prototype called Seaform-DBLP that supports basic prefix search of a single database table, but has limitations around error tolerance, returning all results rather than top-k, and being memory-resident rather than native to a database system. Overall, search-as-you-type in database forms shows promise for balancing usability and functionality, but addressing
This document provides an overview of peer-to-peer computing and distributed shared memory. It discusses characteristics of peer-to-peer networks like decentralized control and anonymity. Structured and unstructured overlays are described, with Chord given as an example of a structured overlay using distributed hash tables. Search techniques for unstructured overlays include flooding and random walks. Distributed shared memory provides abstraction through memory consistency models for shared access across distributed nodes.
LODOP - Multi-Query Optimization for Linked Data Profiling QueriesAnja Jentzsch
The document describes LODOP, a system for optimizing Linked Data profiling queries. LODOP implements 15 profiling tasks as Apache Pig scripts and develops 3 optimization rules for executing multiple profiling scripts concurrently. The rules merge identical operators, combine FILTER operators, and combine FOREACH operators to reduce the number of operations and MapReduce jobs. Applying the rules reduces execution time by 70% compared to sequential execution but a more advanced cost-based approach is needed. Future work includes additional optimization rules and strategies.
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Thanh Tran
The document describes a 2010 paper on schema-agnostic search approaches for querying linked data. It discusses the motivation for such approaches given complex information needs on the evolving web of data. The paper presents conceptual studies of four widely used schema-agnostic search approaches, and conducts experimental evaluations to assess their efficiency, effectiveness, and usability.
Template-based information access, in which templates are constructed for keywords, is a recent development of linked data information retrieval. However, most such approaches suffer from ineffective template management. Because linked data has a structured data representation, we assume the data’s inside statistics can effectively influence template management. In this work, we use this influence for template
creation, template ranking, and scaling. Our proposal can effectively be used for automatic linked data information retrieval and can be incorporated with other techniques such as ontology inclusion and sophisticated matching to further improve performance.
The document discusses information retrieval (IR) and provides definitions and examples of different IR models and techniques. It describes how documents and queries can be represented as vectors, with weights like term frequency-inverse document frequency (tf-idf) used to indicate importance. Various IR models are covered, including boolean, vector space, and probabilistic models, along with common weighting and ranking methods used in IR systems.
Search results clustering (SRC) is a challenging algorithmic
problem that requires grouping together the results returned
by one or more search engines in topically coherent clusters,
and labeling the clusters with meaningful phrases describing
the topics of the results included in them.
Topic detecton by clustering and text miningIRJET Journal
This document discusses topic detection from text documents using text mining and clustering techniques. It proposes extracting keywords from documents, representing topics as groups of keywords, and using k-means clustering on the keywords to group them into topics. The keywords are extracted based on frequency counts and preprocessed by removing stop words and stemming. The k-means clustering algorithm is used to assign keywords to topics represented by cluster centroids, and the centroids are iteratively updated until cluster assignments converge.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Distributed Algorithm for Frequent Pattern Mining using HadoopMap Reduce Fram...idescitation
With the rapid growth of information technology and in many business
applications, mining frequent patterns and finding associations among them requires
handling large and distributed databases. As FP-tree considered being the best compact data
structure to hold the data patterns in memory there has been efforts to make it parallel and
distributed to handle large databases. However, it incurs lot of communication over head
during the mining. In this paper parallel and distributed frequent pattern mining algorithm
using Hadoop Map Reduce framework is proposed, which shows best performance results
for large databases. Proposed algorithm partitions the database in such a way that, it works
independently at each local node and locally generates the frequent patterns by sharing the
global frequent pattern header table. These local frequent patterns are merged at final stage.
This reduces the complete communication overhead during structure construction as well as
during pattern mining. The item set count is also taken into consideration reducing
processor idle time. Hadoop Map Reduce framework is used effectively in all the steps of the
algorithm. Experiments are carried out on a PC cluster with 5 computing nodes which
shows execution time efficiency as compared to other algorithms. The experimental result
shows that proposed algorithm efficiently handles the scalability for very large datab ases.
Index Terms—
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAcsandit
This work presents a novel ranking scheme for structured data. We show how to apply the
notion of typicality analysis from cognitive science and how to use this notion to formulate the
problem of ranking data with categorical attributes. First, we formalize the typicality query
model for relational databases. We adopt Pearson correlation coefficient to quantify the extent
of the typicality of an object. The correlation coefficient estimates the extent of statistical
relationships between two variables based on the patterns of occurrences and absences of their
values. Second, we develop a top-k query processing method for efficient computation. TPFilter
prunes unpromising objects based on tight upper bounds and selectively joins tuples of highest
typicality score. Our methods efficiently prune unpromising objects based on upper bounds.
Experimental results show our approach is promising for real data.
Automated building of taxonomies for search enginesBoris Galitsky
We build a taxonomy of entities which is intended to improve the relevance of search engine in a vertical domain. The taxonomy construction process starts from the seed entities and mines the web for new entities associated with them. To form these new entities, machine learning of syntactic parse trees (their generalization) is applied to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration.
Taxonomy and paragraph-level syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. Proposed algorithm is implemented as a part of Apache OpenNLP.Similarity project.
Matching and merging anonymous terms from web sourcesIJwest
This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated i
This document outlines the BoTLRet system, a template-based linked data information retrieval system. It begins with an introduction to linked data and related work in linked data access. It then describes the problem with current template-based systems and proposes BoTLRet as a solution. BoTLRet constructs templates according to linked data structure and ranks templates using dataset statistics. It can handle queries with two or more keywords by progressively constructing and merging templates for adjacent keyword pairs. The document concludes with experimental results showing BoTLRet achieves close to exhaustive retrieval with lower computational cost than alternative systems, and outperforms other state-of-the-art template-based linked data retrieval systems.
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Text mining is an emerging research field evolving from information retrieval area. Clustering and
classification are the two approaches in data mining which may also be used to perform text classification
and text clustering. The former is supervised while the later is un-supervised. In this paper, our objective is
to perform text clustering by defining an improved distance metric to compute the similarity between two
text files. We use incremental frequent pattern mining to find frequent items and reduce dimensionality.
The improved distance metric may also be used to perform text classification. The distance metric is
validated for the worst, average and best case situations [15]. The results show the proposed distance
metric outperforms the existing measures.
This document provides an overview of probabilistic approaches to information retrieval. It discusses why probabilities are useful for IR given the inherent uncertainty. It covers the Probability Ranking Principle, which aims to rank documents by estimated probability of relevance. Other probabilistic techniques discussed include probabilistic indexing, probabilistic inference using logic representations, and using Bayesian networks for IR. The document notes open issues with some of these approaches and concludes by surveying existing survey papers on probabilistic IR.
The document describes an evaluation of existing relational keyword search systems. It notes discrepancies in how prior studies evaluated systems using different datasets, query workloads, and experimental designs. The evaluation aims to conduct an independent assessment that uses larger, more representative datasets and queries to better understand systems' real-world performance and tradeoffs between effectiveness and efficiency. It outlines schema-based and graph-based search approaches included in the new evaluation.
This document summarizes two algorithms - MFA and ATRA - for processing top-k spatial preference queries. MFA is a threshold-based algorithm that partitions queries into three features - spatial, preference, and text - and retrieves objects with the highest aggregate scores. ATRA uses a hybrid indexing structure called AIR-tree to more efficiently retrieve only relevant objects without revisiting the same data. The paper then proposes using an R-tree index structure combined with an enhanced branch-and-bound search algorithm to answer preference-based top-k spatial keyword queries by ranking objects based on feature quality in their neighborhoods.
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
(1) FedBench is a benchmark suite for evaluating federated semantic data processing systems.
(2) It includes parameterized benchmark drivers, a variety of RDF datasets and SPARQL queries, and an evaluation framework to measure system performance.
(3) An initial evaluation was conducted to demonstrate FedBench's flexibility in comparing centralized and federated query processing using different systems and scenarios.
Query Processing : Query Processing Problem, Layers of Query Processing Query Processing in Centralized Systems – Parsing & Translation, Optimization, Code generation, Example Query Processing in Distributed Systems – Mapping global query to local, Optimization,
This document summarizes research on implementing search-as-you-type functionality in relational database forms. It motivates this approach by noting limitations of existing search paradigms like SQL and keyword search. Key challenges include enabling fast prefix matching, synchronizing local and global search results, handling errors and misspellings, and improving scalability for large databases. Initial achievements include a prototype called Seaform-DBLP that supports basic prefix search of a single database table, but has limitations around error tolerance, returning all results rather than top-k, and being memory-resident rather than native to a database system. Overall, search-as-you-type in database forms shows promise for balancing usability and functionality, but addressing
This document provides an overview of peer-to-peer computing and distributed shared memory. It discusses characteristics of peer-to-peer networks like decentralized control and anonymity. Structured and unstructured overlays are described, with Chord given as an example of a structured overlay using distributed hash tables. Search techniques for unstructured overlays include flooding and random walks. Distributed shared memory provides abstraction through memory consistency models for shared access across distributed nodes.
LODOP - Multi-Query Optimization for Linked Data Profiling QueriesAnja Jentzsch
The document describes LODOP, a system for optimizing Linked Data profiling queries. LODOP implements 15 profiling tasks as Apache Pig scripts and develops 3 optimization rules for executing multiple profiling scripts concurrently. The rules merge identical operators, combine FILTER operators, and combine FOREACH operators to reduce the number of operations and MapReduce jobs. Applying the rules reduces execution time by 70% compared to sequential execution but a more advanced cost-based approach is needed. Future work includes additional optimization rules and strategies.
Machine Language and Pattern Analysis IEEE 2015 ProjectsVijay Karan
List of Machine Language and Pattern Analysis IEEE 2015 Projects. It Contains the IEEE Projects in the Domain Machine Language and Pattern Analysis for the year 2015
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, because of this heterogeneity, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triple stores. In this paper, we tackle the problem of efficiently executing provenance-enabled queries over RDF data. We propose, implement and empirically evaluate five different query execution strategies for RDF queries that incorporate knowledge of provenance. The evaluation is conducted on Web Data obtained from two different Web crawls (The Billion Triple Challenge, and the Web Data Commons). Our evaluation shows that using an adaptive query materialization execution strategy performs best in our context. Interestingly, we find that because provenance is prevalent within Web Data and is highly selective, it can be used to improve query processing performance. This is a counterintuitive result as provenance is often associated with additional overhead.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Using ca e rwin modeling to asure data 09162010ERwin Modeling
Data profiling analyzes data content to infer metadata and increase the accuracy of data assets and models. It can help with data quality assessments, master data management, and reducing risks in data warehousing projects. The presentation provided examples of how profiling was used to uncover issues, validate models and requirements, standardize values, and reduce development times for various organizations.
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
The document discusses machine learned relevance at a large scale search engine. It provides biographies of the two authors who have extensive experience in machine learning and search engines. It then outlines the topics to be covered, including an introduction to machine learned ranking for search, relevance evaluation methodologies, data collection and metrics, the Quixey search engine system, model training approaches, and conclusions.
The document describes Panda, a system for managing data provenance and workflows. Panda aims to merge data and process provenance, define provenance operators to query and analyze mixed data and provenance, and create an open-source configurable system. An example workflow demonstrates deduplicating and processing datasets to predict purchased items. Panda allows for backward and forward tracing of data and refreshing results due to new data. It implements a query language and uses predicates to trace data back to its origins.
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
This document describes performance analyses of MapReduce implementations for large-scale homology searches. It introduces homology searches and their use in metagenome analysis using sequence databases that are growing enormously in size. Two MapReduce designs for homology searches are proposed: one replicates the database on all nodes, while the other distributes the database. Preliminary experiments show MapReduce exhibits good scaling and comparable performance to MPI implementations. The goal is high-performance MapReduce homology searches for extremely large databases.
The document proposes a novel ranking approach called Manifold Ranking with Sink Points (MRSP) that addresses relevance, importance, and diversity simultaneously. MRSP uses manifold ranking over data objects to find the most relevant and important objects. It then designates ranked objects as "sink points" to prevent redundant objects from receiving high ranks. The approach is applied to update summarization and query recommendation tasks, demonstrating strong performance compared to existing methods.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
This document discusses optimizing database management of large-scale web access logs. It proposes using a pre-processor to hash and sort logs in memory before writing to the database. An experiment compares the performance of using the pre-processor versus directly writing to the database. The results show the pre-processor is 18-20 times faster for input time and memory usage is twice as high but run time is much better compared to only using the database. The document concludes the proposed approach of using a pre-processor for in-memory processing before database storage provides better performance and optimization than traditional approaches.
Efficient top-k queries processing in column-family distributed databasesRui Vieira
The document discusses efficient top-k query processing on distributed column family databases. It begins by introducing top-k queries and their uses. It then discusses challenges with naive solutions and prior work using batch processing. The document proposes three algorithms - TPUT, Hybrid Threshold, and KLEE - to enable real-time top-k queries on distributed data in a memory, bandwidth, and computation efficient manner. It also discusses implementation considerations for Cassandra's data model and CQL.
This document discusses keyword query routing to identify relevant data sources for keyword searches over multiple structured and linked data sources. It proposes using a multilevel inter-relationship graph and scoring mechanism to compute relevance and generate routing plans that route keywords only to pertinent sources. This improves keyword search performance without compromising result quality. An algorithm is developed based on modeling the search space and developing a summary model to incorporate relevance at different levels and dimensions. Experiments showed the summary model preserves relevant information compactly.
Standard Datasets in Information Retrieval Jean Brenda
The document discusses standard datasets used for information retrieval (IR) system evaluation and research. It describes several major datasets including the Cranfield collection, which was the first test collection and used aeronautical papers, and the Text REtrieval Conference (TREC) collection, which is a large collection of newswire articles. It also mentions other datasets like Gov2, NTCIR, CLEF, and 20Newsgroups. The datasets provide documents, queries, and relevance judgments and allow comparison of IR systems and algorithms.
M phil-computer-science-machine-language-and-pattern-analysis-projectsVijay Karan
List of Machine Language and Pattern Analysis IEEE 2006 Projects. It Contains the IEEE Projects in the Domain Machine Language and Pattern Analysis for M.Phil Computer Science students.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
A Context-Based Semantics for SPARQL Property Paths over the WebOlaf Hartig
- The document proposes a formal context-based semantics for evaluating SPARQL property path queries over the Web of Linked Data.
- This semantics defines how to compute the results of such queries in a well-defined manner and ensures the "web-safeness" of queries, meaning they can be executed directly over the Web without prior knowledge of all data.
- The paper presents a decidable syntactic condition for identifying SPARQL property path queries that are web-safe based on their sets of conditionally bounded variables.
Rethinking Online SPARQL Querying to Support Incremental Result VisualizationOlaf Hartig
These are the slides of my invited talk at the 5th Int. Workshop on Usage Analysis and the Web of Data (USEWOD 2015): http://usewod.org/usewod2015.html
The abstract of this talks is given as follows:
To reduce user-perceived response time many interactive Web applications visualize information in a dynamic, incremental manner. Such an incremental presentation can be particularly effective for cases in which the underlying data processing systems are not capable of completely answering the users' information needs instantaneously. An example of such systems are systems that support live querying of the Web of Data, in which case query execution times of several seconds, or even minutes, are an inherent consequence of these systems' ability to guarantee up-to-date results. However, support for an incremental result visualization has not received much attention in existing work on such systems. Therefore, the goal of this talk is to discuss approaches that enable query systems for the Web of Data to return query results incrementally.
Tutorial "Linked Data Query Processing" Part 2 "Theoretical Foundations" (WWW...Olaf Hartig
This document summarizes the theoretical foundations of linked data query processing presented in a tutorial. It discusses the SPARQL query language, data models for linked data queries, full-web and reachability-based query semantics. Under full-web semantics, a query is computable if its pattern is monotonic, and eventually computable otherwise. Reachability-based semantics restrict queries to data reachable from a set of seed URIs. Queries under this semantics are always finitely computable if the web is finite. The document outlines computability results and properties regarding satisfiability and monotonicity for different semantics.
An Overview on PROV-AQ: Provenance Access and QueryOlaf Hartig
The slides which I used at the Dagstuhl seminar on Principles of Provenance (Feb.2012) for presenting the main contributions and open issues of the PROV-AQ document created by the W3C provenance working group.
Zero-Knowledge Query Planning for an Iterator Implementation of Link Traversa...Olaf Hartig
The document describes zero-knowledge query planning for an iterator-based implementation of link traversal-based query execution. It discusses generating all possible query execution plans from the triple patterns in a query and selecting the optimal plan using heuristics without actually executing the plans. The key heuristics explored are using a seed triple pattern containing a URI as the first pattern, avoiding vocabulary terms as seeds, and placing filtering patterns close to the seed pattern. Evaluation involves generating all plans and executing each repeatedly to estimate costs and benefits for plan selection.
Brief Introduction to the Provenance Vocabulary (for W3C prov-xg)Olaf Hartig
The document describes the Provenance Vocabulary, which defines an OWL ontology for describing provenance metadata on the Semantic Web. The vocabulary aims to integrate provenance into the Web of data to enable quality assessment. It partitions provenance descriptions into a core ontology and supplementary modules. Examples are provided to illustrate how the vocabulary can be used to describe the provenance of Linked Data, including information about data creation and retrieval processes. The design principles emphasize usability, flexibility, and integration with other vocabularies. Future work includes further alignment and additional modules to cover more provenance aspects.
Web & Graphics Designing Training at Erginous Technologies in Rajpura offers practical, hands-on learning for students, graduates, and professionals aiming for a creative career. The 6-week and 6-month industrial training programs blend creativity with technical skills to prepare you for real-world opportunities in design.
The course covers Graphic Designing tools like Photoshop, Illustrator, and CorelDRAW, along with logo, banner, and branding design. In Web Designing, you’ll learn HTML5, CSS3, JavaScript basics, responsive design, Bootstrap, Figma, and Adobe XD.
Erginous emphasizes 100% practical training, live projects, portfolio building, expert guidance, certification, and placement support. Graduates can explore roles like Web Designer, Graphic Designer, UI/UX Designer, or Freelancer.
For more info, visit erginous.co.in , message us on Instagram at erginoustechnologies, or call directly at +91-89684-38190 . Start your journey toward a creative and successful design career today!
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
UiPath Agentic Automation: Community Developer OpportunitiesDianaGray10
Please join our UiPath Agentic: Community Developer session where we will review some of the opportunities that will be available this year for developers wanting to learn more about Agentic Automation.
Bepents tech services - a premier cybersecurity consulting firmBenard76
Introduction
Bepents Tech Services is a premier cybersecurity consulting firm dedicated to protecting digital infrastructure, data, and business continuity. We partner with organizations of all sizes to defend against today’s evolving cyber threats through expert testing, strategic advisory, and managed services.
🔎 Why You Need us
Cyberattacks are no longer a question of “if”—they are a question of “when.” Businesses of all sizes are under constant threat from ransomware, data breaches, phishing attacks, insider threats, and targeted exploits. While most companies focus on growth and operations, security is often overlooked—until it’s too late.
At Bepents Tech, we bridge that gap by being your trusted cybersecurity partner.
🚨 Real-World Threats. Real-Time Defense.
Sophisticated Attackers: Hackers now use advanced tools and techniques to evade detection. Off-the-shelf antivirus isn’t enough.
Human Error: Over 90% of breaches involve employee mistakes. We help build a "human firewall" through training and simulations.
Exposed APIs & Apps: Modern businesses rely heavily on web and mobile apps. We find hidden vulnerabilities before attackers do.
Cloud Misconfigurations: Cloud platforms like AWS and Azure are powerful but complex—and one misstep can expose your entire infrastructure.
💡 What Sets Us Apart
Hands-On Experts: Our team includes certified ethical hackers (OSCP, CEH), cloud architects, red teamers, and security engineers with real-world breach response experience.
Custom, Not Cookie-Cutter: We don’t offer generic solutions. Every engagement is tailored to your environment, risk profile, and industry.
End-to-End Support: From proactive testing to incident response, we support your full cybersecurity lifecycle.
Business-Aligned Security: We help you balance protection with performance—so security becomes a business enabler, not a roadblock.
📊 Risk is Expensive. Prevention is Profitable.
A single data breach costs businesses an average of $4.45 million (IBM, 2023).
Regulatory fines, loss of trust, downtime, and legal exposure can cripple your reputation.
Investing in cybersecurity isn’t just a technical decision—it’s a business strategy.
🔐 When You Choose Bepents Tech, You Get:
Peace of Mind – We monitor, detect, and respond before damage occurs.
Resilience – Your systems, apps, cloud, and team will be ready to withstand real attacks.
Confidence – You’ll meet compliance mandates and pass audits without stress.
Expert Guidance – Our team becomes an extension of yours, keeping you ahead of the threat curve.
Security isn’t a product. It’s a partnership.
Let Bepents tech be your shield in a world full of cyber threats.
🌍 Our Clientele
At Bepents Tech Services, we’ve earned the trust of organizations across industries by delivering high-impact cybersecurity, performance engineering, and strategic consulting. From regulatory bodies to tech startups, law firms, and global consultancies, we tailor our solutions to each client's unique needs.
TrsLabs - AI Agents for All - Chatbots to Multi-Agents SystemsTrs Labs
AI Adoption for Your Business
AI applications have evolved from chatbots
into sophisticated AI agents capable of
handling complex workflows. Multi-agent
systems are the next phase of evolution.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution—avoiding performance bottlenecks and semantically inequivalent results. We discuss the engineering aspects of a refactoring tool that automatically determines when it is safe and potentially advantageous to migrate imperative DL code to graph execution and vice-versa.
How Caching Improves Efficiency and Result Completeness for Querying Linked Data
1. How Caching Improves
Efficiency and Result Completeness
for Querying Linked Data
Olaf Hartig
http://olafhartig.de/foaf.rdf#olaf
Database and Information Systems Research Group
Humboldt-Universität zu Berlin
2. Can we query the Web of Data
as of it were a single,
giant database?
SELECT DISTINCT ?i ?label
WHERE {
?prof rdf:type <http://res ... data/dbprofs#DBProfessor> ;
foaf:topic_interest ?i .
}
OPTIONAL {
}
?i rdfs:label ?label
FILTER( LANG(?label)="en" || LANG(?label)="")
ORDER BY ?label
?
Our approach: Link Traversal Based Query Execution
[ISWC'09]
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 2
3. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
query-local
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 3
4. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 4
5. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
htt
p:/ ?
● Evaluate parts of the query (triple patterns)
/bo
on a continuously augmented set of data
b.n
am
Look up URIs in intermediate
e
●
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 5
6. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
htt
p:/ ?
● Evaluate parts of the query (triple patterns)
/bo
on a continuously augmented set of data
b.n
am
Look up URIs in intermediate
e
●
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 6
7. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
htt
p:/ ?
● Evaluate parts of the query (triple patterns)
/bo
on a continuously augmented set of data
b.n
am
Look up URIs in intermediate
e
●
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 7
8. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
htt
p:/ ?
● Evaluate parts of the query (triple patterns)
/bo
on a continuously augmented set of data
b.n
am
Look up URIs in intermediate
e
●
solutions and add retrieved data
“Descriptor object”
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 8
9. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 9
10. Main Idea
● Intertwine query evaluation with traversal of data links
● We alternate between:
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
http://bob.name
Query kno
ws
http://bob.name
http://alice.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 10
11. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
http://bob.name
Query kno
ws
http://bob.name
http://alice.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 11
12. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
? me
on a continuously augmented set of data
a
e.n
a lic
://
● Look up URIs in intermediate
p
htt
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 12
13. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
? me
on a continuously augmented set of data
a
e.n
a lic
://
● Look up URIs in intermediate
p
htt
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 13
14. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
? me
on a continuously augmented set of data
a
e.n
a lic
://
● Look up URIs in intermediate
p
htt
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 14
15. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 15
16. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 16
17. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate
solutions and add retrieved data
to the query-local dataset
http://alice.name
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 17
18. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
solutions and add retrieved data
to the query-local dataset
http://alice.name
Query pr o
http://bob.name jec
t
?prjName http://.../AlicesPrj
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 18
19. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
solutions and add retrieved data
to the query-local dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 19
20. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
solutions and add retrieved data
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 20
21. Main Idea
● Intertwine query evaluation with traversal of data links
?acq
● We alternate between:
http://alice.name
● Evaluate parts of the query (triple patterns)
on a continuously augmented set of data
● Look up URIs in intermediate ?acq ?prj
http://alice.name http://.../AlicesPrj
solutions and add retrieved data
to the query-local dataset ?prj ?prjName
http://.../AlicesPrj “…“
Query ?acq ?prj ?prjName
http://bob.name
?prjName http://alice.name http://.../AlicesPrj “…“
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 21
22. Characteristics
● Link traversal based query execution:
● Evaluation on a continuously augmented dataset
● Discovery of potentially relevant data during execution
● Discovery driven by intermediate solutions
● Main advantage:
● No need to know all data sources in advance
● Limitations:
● Query has to contain a URI as a starting point
●
Ignores data that is not reachable* by the query execution
*
formal definition in the paper
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 22
23. The Issue
Query
?acq interest
?i
s
ow
label
kn
http://bob.name
?iLabel
query-local
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 23
24. The Issue
Query
?acq interest
?i
s
ow
label
kn
http://bob.name
?iLabel
htt query-local
p: //b
ob dataset
? .nam
e
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 24
25. The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow
w s
label
kn
http://alice.name
http://bob.name
?iLabel
query-local
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 25
26. The Issue
Query
?acq interest http://bob.name
?i
kno
s
ow
w s
label
kn
http://alice.name
http://bob.name
?iLabel
query-local
dataset
?acq ?i ?iLabel
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 26
27. The Issue
Query
?acq interest
?i
s
ow
label
kn
http://bob.name
?iLabel
query-local
dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 27
28. Reusing the Query-Local Dataset
Query
?acq interest
?i
s
ow
label
kn
http://bob.name
?iLabel
query-local
dataset
Query
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 28
29. Reusing the Query-Local Dataset
Query
?acq interest
?i
s
ow
label
kn
http://bob.name
?iLabel
http://alice.name
o ws
Query kn
http://bob.name
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 29
30. Reusing the Query-Local Dataset
Query
?acq interest
?i ?acq
s
ow
http://alice.name
label
kn
http://bob.name
?iLabel
http://alice.name
o ws
Query kn
http://bob.name
http://bob.name
?prjName
s
ow
me
kn
na
?acq query-local
project ?prj
dataset
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 30
31. Hypothesis
Re-using the query-local dataset (a.k.a. data caching)
may benefit
query performance + result completeness
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 31
32. Contributions
● Systematic analysis of the impact of data caching
●
Theoretical foundation*
●
Conceptual analysis*
● Empirical evaluation of the potential impact
*
see paper
● Out of scope: Caching strategies (replacement, invalidation)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 32
33. Experiment – Scenario
● Information about the
distributed social
network of FOAF
profiles
● 5 types of queries
● Experiment Setup:
● 23 persons
● Sequential use
➔ 115 queries
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 33
34. Experiment – Complete Sequence
no reuse given 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
order
ContactInfoPhillipe ● No data caching
(Query No. 36)
● given order experiment
UnsetPropsPhillipe ● Reuse of the query-local
(Query No. 37) dataset for the complete
sequence of all 115 queries
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39) ● Hit rate:
IncomingPhillipe look-ups answered from cache
(Query No. 40) all look-up requests
0 0,2 0,4 0,6 0,8 1
hit rate
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 34
35. Experiment – Complete Sequence
no reuse given 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
order
ContactInfoPhillipe ● No data caching
(Query No. 36)
● given order experiment
UnsetPropsPhillipe ● Reuse of the query-local
(Query No. 37) dataset for the complete
sequence of all 115 queries
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39) ● Hit rate:
IncomingPhillipe look-ups answered from cache
(Query No. 40) all look-up requests
0 0,2 0,4 0,6 0,8 1
hit rate
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 35
36. Experiment – Complete Sequence
no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
order
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 36
37. Experiment – Complete Sequence
no reuse given 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
order
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 37
38. Summary
● Contributions:
● Theoretical foundation
● Conceptual analysis
● Empirical evaluation
● Main findings:
● Additional results possible (for semantically similar queries)
● Impact on performance may be positive but also negative
● Future work:
● Analysis of caching strategies in our context
● Main issue: invalidation
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 38
39. Backup Slides
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 39
40. Contributions
● Theoretical foundation (extension of the original definition)
● Reachability by a Dseed-initialized execution of a BGP query b
● Dseed-dependent solution for a BGP query b
● Reachability R(B) for a serial execution of B = b1 , … , bn
➔ Each solution for bcur is also R(B)-dependent solution for bcur
● Conceptual analysis of the impact of data caching
● Performance factor: p( bcur , B ) = c( bcur , [ ] ) – c( bcur , B )
● Serendipity factor: s( bcur , B ) = b( bcur , B ) – b( bcur , [ ] )
● Empirical verification of the potential impact
● Out of scope: Caching strategies (replacement, invalidation)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 40
42. Query Template UnsetProps
SELECT DISTINCT ?result ?resultLabel WHERE
{
?result rdfs:isDefinedBy <http://xmlns.com/foaf/0.1/> .
?result rdfs:domain foaf:Person .
OPTIONAL { <PERSON> ?result ?var0 }
FILTER ( !bound(?var0) )
<PERSON> foaf:knows ?var2 .
?var2 ?result ?var3 .
?result rdfs:label ?resultLabel .
?result vs:term_status ?var1 .
}
ORDER BY ?var1
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 42
43. Query Template Incoming
SELECT DISTINCT ?result WHERE
{
?result foaf:knows <PERSON> .
OPTIONAL
{
?result foaf:knows ?var1 .
FILTER ( <PERSON> = ?var1 )
<PERSON> foaf:knows ?result .
}
FILTER ( !bound(?var1) )
}
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 43
44. Query Template 2ndDegree1
SELECT DISTINCT ?result WHERE
{
<PERSON> foaf:knows ?p1 .
<PERSON> foaf:knows ?p2 .
FILTER ( ?p1 != ?p2 )
?p1 foaf:knows ?result .
FILTER ( <PERSON> != ?result )
?p2 foaf:knows ?result .
OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 44
45. Query Template 2ndDegree2
SELECT DISTINCT ?result WHERE
{
<PERSON> foaf:knows ?p1 .
<PERSON> foaf:knows ?p2 .
FILTER ( ?p1 != ?p2 )
?result foaf:knows ?p1 .
FILTER ( <PERSON> != ?result )
?result foaf:knows ?p2 .
OPTIONAL {
<PERSON> ?knows ?result .
FILTER ( ?knows = foaf:knows )
}
FILTER ( !bound(?knows) )
}
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 45
46. Experiment – Single Query
no reuse upper 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
bound
ContactInfoPhillipe ● No data caching
(Query No. 36)
● upper bound experiment
UnsetPropsPhillipe ● Reuse of query-local dataset
(Query No. 37) for 3 executions of each query
2ndDegree1Phillipe ● Third execution measured
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39) ● Hit rate:
IncomingPhillipe look-ups answered from cache
(Query No. 40) all look-up requests
0 0,2 0,4 0,6 0,8 1
hit rate
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 46
47. Experiment – Single Query
no reuse upper 0 0,2 0,4 0,6 0,8 1 ● no reuse experiment:
bound
ContactInfoPhillipe ● No data caching
(Query No. 36)
● upper bound experiment
UnsetPropsPhillipe ● Reuse of query-local dataset
(Query No. 37) for 3 executions of each query
2ndDegree1Phillipe ● Third execution measured
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39) ● Hit rate:
IncomingPhillipe look-ups answered from cache
(Query No. 40) all look-up requests
0 0,2 0,4 0,6 0,8 1
hit rate
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 47
48. Experiment – Single Query
no reuse upper 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
bound
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 48
49. Experiment – Single Query
no reuse upper 0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
bound
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 49
50. Experiment – Single Query
Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time
(std.dev.) (std.dev.) (std.dev.)
4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
1
Averaged over all 115 queries
● In the ideal case for Bupper= [ bcur , bcur ] :
● pupper( bcur , Bupper ) = c( bcur , [ ] ) – c( bcur , Bupper ) = c( bcur , [ ] )
● supper( bcur , Bupper ) = b( bcur , Bupper ) – b( bcur , [ ] ) = 0
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 50
51. Experiment – Single Query
Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time
(std.dev.) (std.dev.) (std.dev.)
4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
1
Averaged over all 115 queries
● Summary (measurement errors aside):
● Same number of query results
● Significant improvements in query performance
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 51
52. Experiment – Complete Sequence
no reuse upper 0 given0,4 0,6 0,8
0,2 1 ●
0
given15 20 25 experiment:
5 10
order 30 0 20 40 60 80
bound order
ContactInfoPhillipe ● Reuse of the query-local
(Query No. 36) dataset for the complete
sequence of all 115 queries
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 52
53. Experiment – Complete Sequence
no reuse upper 0 given0,4 0,6 0,8
0,2 1 ●
0
given15 20 25 experiment:
5 10
order 30 0 20 40 60 80
bound order
ContactInfoPhillipe ● Reuse of the query-local
(Query No. 36) dataset for the complete
sequence of all 115 queries
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 53
54. Experiment – Complete Sequence
no reuse upper 0 given0,4 0,6 0,8
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
bound order
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 54
55. Experiment – Complete Sequence
no reuse upper 0 given0,4 0,6 0,8
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
bound order
ContactInfoPhillipe
(Query No. 36)
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 55
56. Experiment – Complete Sequence
Bgiven order= [ q1 , … , q38 ]
no reuse upper 0 given0,4 0,6 0,8
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
bound order s( q , Bgiven order ) = b( q39 , Bgiven order ) – b( q39 , [ ] )
39
ContactInfoPhillipe =9–1
(Query No. 36)
=8
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 56
57. Experiment – Complete Sequence
Bgiven order= [ q1 , … , q38 ]
no reuse upper 0 given0,4 0,6 0,8
0,2 1 0 5 10 15 20 25 30 0 20 40 60 80
p'( q , B
bound order given order
39
) = c'( q39 , [ ] ) – c'( q39 , Bgiven order )
ContactInfoPhillipe = 31.48 s – 68.64 s
(Query No. 36)
= – 37.16 s
UnsetPropsPhillipe
(Query No. 37)
2ndDegree1Phillipe
(Query No. 38)
2ndDegree2Phillipe
(Query No. 39)
IncomingPhillipe
(Query No. 40)
0 0,2 0,4 0,6 0,8 1 0 5 10 15 20 25 30 0 20 40 60 80
hit rate number of query results query execution time
(in seconds)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 57
58. Experiment – Complete Sequence
Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time
(std.dev.) (std.dev.) (std.dev.)
4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
6.878 0.932 39.845 s
given order
(12.158) (0.139) (145.898)
1
Averaged over all 115 queries
● Summary:
● Data cache may provide for additional query results
● Impact on performance may be positive but also negative
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 58
59. Experiment – Complete Sequence
Experiment Avg.1 number of Average1 Avg.1 query
Query Results Hit Rate Execution Time
(std.dev.) (std.dev.) (std.dev.)
4.983 0.576 30.036 s
no reuse
(11.658) (0.182) (46.708)
5.070 0.996 1.943 s
upper bound
(11.813) (0.017) (11.375)
6.878 0.932 39.845 s
given order
(12.158) (0.139) (145.898)
6.652 0.954 36.994 s
random orders
(11.966) (0.036) (118.700)
● Executing the query sequence in a random order results in
measurements similar to the given order.
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 59
60. These slides have been created by
Olaf Hartig
http://olafhartig.de
This work is licensed under a
Creative Commons Attribution-Share Alike 3.0 License
(http://creativecommons.org/licenses/by-sa/3.0/)
Olaf Hartig - How Caching Improves Efficiency and Result Completeness for Querying Linked Data 60