The document describes data workflows and data integration systems. It defines a data integration system as IS=<O,S,M> where O is a global schema, S is a set of data sources, and M are mappings between them. It discusses different views of data workflows including ETL processes, Linked Data workflows, and the data science process. Key steps in data workflows include extraction, integration, cleansing, enrichment, etc. Tools to support different steps are also listed. The document introduces global-as-view (GAV) and local-as-view (LAV) approaches to specifying the mappings M between the global and local schemas using conjunctive rules.
This document outlines a course on Knowledge Representation (KR) on the Web. The course aims to expose students to challenges of applying traditional KR techniques to the scale and heterogeneity of data on the Web. Students will learn about representing Web data through formal knowledge graphs and ontologies, integrating and reasoning over distributed datasets, and how characteristics such as volume, variety and veracity impact KR approaches. The course involves lectures, literature reviews, and milestone projects where students publish papers on building semantic systems, modeling Web data, ontology matching, and reasoning over large knowledge graphs.
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
The document summarizes a converts' rally held at Carnegie Hall in New York City on September 14, 1908 by the Evangelistic Committee. It discusses ingredients for publishing open data, including using URIs, versioning, repeatable transformations, choosing an appropriate level of detail, combining vocabularies, contextualizing information, and provenance. Provenance, or the origin and history of data, is a key issue in publishing open government data and builds trust for application developers and the public. Standards like the W3C PROV ontology can help represent provenance.
The document describes an analysis of 177 scientific workflows from Taverna and Wings systems. The analysis identified common "motifs" in workflows, including data-oriented motifs characterizing common data activities, and workflow-oriented motifs characterizing how activities are implemented. These motifs could help inform workflow design and the creation of automated tools to generate workflow abstractions, in order to facilitate understanding and reuse of workflows.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
Managing Metadata for Science and Technology Studies: the RISIS caseRinke Hoekstra
Presentation of our paper at the WHISE workshop at ESWC 2016 on requirements for metadata over non-public datasets for the science & technology studies field.
The document discusses the concept of "Broad Data" which refers to the large amount of freely available but widely varied open data on the World Wide Web, including structured and semi-structured data. It provides examples such as the growing linked open data cloud and over 710,000 datasets available from governments around the world. Broad data poses new challenges for data search, modeling, integration and visualization of partially modeled datasets. International open government data search and linking government data to additional contexts are also discussed.
The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
1. The document describes a study that aimed to develop an open government data (OGD) platform that integrates OGD and social media features to better stimulate value generation from OGD.
2. Researchers designed a prototype platform with features like data processing, feedback/collaboration, data quality ratings, and grouping/interaction capabilities.
3. An evaluation of the prototype found that users appreciated the novel social media-inspired features and found them useful for collaborating around OGD.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
This document discusses how semantic technologies can help link datasets to publications and institutions to enable new forms of data search and showcasing. It notes that standard schemas and formats are needed to allow linkages between data repositories. Knowledge graphs can help relate entities like papers, authors and institutions to facilitate disambiguation and multi-institutional search capabilities. Semantic technologies are seen as central to efficiently building these linkages at scale across the research data ecosystem.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document discusses two presentations on cognition for the semantic web. The first presentation discusses methods for involving humans in semantic data management, including crowdsourcing, citizen science, and games with a purpose. It provides examples of how these techniques can be used for tasks like data linking and validation. The second presentation discusses building cognitive and semantic systems to support understanding data and phenomena through visual examples. It aims to explain why and how these systems can make sense of data and foster understanding.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
This document provides a tutorial on crowdsourced data processing from both academic and industry perspectives. The tutorial is divided into three parts. Part 0 provides a background on crowdsourcing and surveys Parts 1 and 2. Part 1 surveys crowdsourced data processing algorithms from academia, discussing unit operations, cost models, error models, and examples like filtering and sorting. Part 2 surveys crowdsourced data processing in industry, finding that many large companies use internal platforms at large scale for tasks like categorization and content moderation, and that academic research is not yet widely used in industry.
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape which can power semantic models and knowledge graphs.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
Entity matching and entity resolution are becoming more important disciplines in data management over time, based on increasing number of data sources that should be addressed in economy that is undergoing digital transformation process, growing data volumes and increasing requirements related to data privacy. Data matching process is also called record linkage, entity matching or entity resolution in some published works. For long time research about the process was focused on matching entities from same dataset (i.e. deduplication) or from two datasets. Different algorithms used for matching different types of attributes were described in the literature, developed and implemented in data matching and data cleansing platforms. Entity resolution is element of larger entity integration process that include data acquisition, data profiling, data cleansing, schema alignment, data matching and data merge (fusion).
We can use motivating example of global pharmaceutical company with offices in more than 60 countries worldwide that migrated customer data from various legacy systems in different countries to new common CRM system in the cloud. Migration was phased by regions and countries, with new sources and data incrementally added and merged with data already migrated in previous phases. Entity integration in such case require deep understanding of data architectures, data content and each step of the process. Even with such deep understanding, design and implementation of the solution require many iterations in development process that consume human resources, time and financial resources. Reducing the number of iterations by automating and optimizing steps in the process can save vast amount of resources. There is a lot of available literature addressing any of the steps in the process, proposing different options for improvement of results or processing optimization, but the whole process still require a lot of human work and subject matter specific knowledge and many iterations to produce results that will have high F-measure (both high precision and recall). Most of the algorithms used in the various steps of the process are Human in the loop (HITL) algorithms that require human interaction. Human is always part of the simulation and consequently influences the outcome.
This paper is a part of the work in progress aimed to define conceptual framework that will try to automate and optimize some steps of entity integration process and try to reduce requirements for human influence in the process. In this paper focus will be on conceptual process definition, recommended data architecture and use of existing open source solutions for entity integration process automation and optimization.
This document discusses several key aspects of mathematics and algorithms used in internet information retrieval and search engines:
1. It explains how search engines like Google can rapidly rank billions of web pages using algorithms based on the topology and link structure of the web graph, such as PageRank.
2. It describes two main types of page ranking algorithms - static importance ranking based on link analysis, and dynamic relevance ranking based on statistical learning models to match pages to queries.
3. It proposes a new ranking algorithm called BrowseRank that models user browsing behavior using Markov chains and takes into account visit duration to better reflect true page importance.
This document summarizes a presentation on data discovery. It discusses key concepts in data discovery including data source joining, ontologies and taxonomies, rules of data discovery, single source of truth, and data visualization. It emphasizes the importance of not discarding original data and keeping track of the data transformation process to maintain data provenance and lineage. Overall, the presentation aims to illuminate how to understand and work with data through concepts in data discovery.
The document discusses several mathematical models and algorithms used in internet information retrieval and search engines:
1. Markov chain methods can be used to model a user's web surfing behavior and page visit transitions.
2. BrowseRank models user browsing as a Markov process to calculate page importance based on observed user behavior rather than artificial assumptions.
3. Learning to rank problems in information retrieval can be framed as a two-layer statistical learning problem where queries are the first layer and document relevance judgments are the second layer.
4. Stability theory can provide generalization bounds for learning to rank algorithms under this two-layer framework. Modifying algorithms like SVM and Boosting to have query-level stability improves performance.
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
This document discusses improving scholarly communication through knowledge graphs. It describes some current issues with scholarly communication like lack of structure, integration, and machine-readability. Knowledge graphs are proposed as a solution to represent scholarly concepts, publications, and data in a structured and linked manner. This would help address issues like reproducibility, duplication, and enable new ways of exploring and querying scholarly knowledge. The document outlines a ScienceGRAPH approach using cognitive knowledge graphs to represent scholarly knowledge at different levels of granularity and allow for intuitive exploration and question answering over semantic representations.
The document discusses data workflows and integrating open data from different sources. It defines a data workflow as a series of well-defined functional units where data is streamed between activities such as extraction, transformation, and delivery. The document outlines key steps in data workflows including extraction, integration, aggregation, and validation. It also discusses challenges around finding rules and ontologies, data quality, and maintaining workflows over time. Finally, it provides examples of data integration systems and relationships between global and source schemas.
1. The document describes a study that aimed to develop an open government data (OGD) platform that integrates OGD and social media features to better stimulate value generation from OGD.
2. Researchers designed a prototype platform with features like data processing, feedback/collaboration, data quality ratings, and grouping/interaction capabilities.
3. An evaluation of the prototype found that users appreciated the novel social media-inspired features and found them useful for collaborating around OGD.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
This document discusses how semantic technologies can help link datasets to publications and institutions to enable new forms of data search and showcasing. It notes that standard schemas and formats are needed to allow linkages between data repositories. Knowledge graphs can help relate entities like papers, authors and institutions to facilitate disambiguation and multi-institutional search capabilities. Semantic technologies are seen as central to efficiently building these linkages at scale across the research data ecosystem.
This document summarizes Ted Dunning's approach to recommendations based on his 1993 paper. The approach involves:
1. Analyzing user data to determine which items are statistically significant co-occurrences
2. Indexing items in a search engine with "indicator" fields containing IDs of significantly co-occurring items
3. Providing recommendations by searching the indicator fields for a user's liked items
The approach is demonstrated in a simple web application using the MovieLens dataset. Further work could optimize and expand on the approach.
This document discusses two presentations on cognition for the semantic web. The first presentation discusses methods for involving humans in semantic data management, including crowdsourcing, citizen science, and games with a purpose. It provides examples of how these techniques can be used for tasks like data linking and validation. The second presentation discusses building cognitive and semantic systems to support understanding data and phenomena through visual examples. It aims to explain why and how these systems can make sense of data and foster understanding.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
Crowdsourced Data Processing: Industry and Academic PerspectivesAditya Parameswaran
This document provides a tutorial on crowdsourced data processing from both academic and industry perspectives. The tutorial is divided into three parts. Part 0 provides a background on crowdsourcing and surveys Parts 1 and 2. Part 1 surveys crowdsourced data processing algorithms from academia, discussing unit operations, cost models, error models, and examples like filtering and sorting. Part 2 surveys crowdsourced data processing in industry, finding that many large companies use internal platforms at large scale for tasks like categorization and content moderation, and that academic research is not yet widely used in industry.
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
Data for drug discovery and healthcare is often trapped in silos which hampers effective interpretation and reuse. To remedy this, such data needs to be linked both internally and to external sources to make a FAIR data landscape which can power semantic models and knowledge graphs.
This document summarizes a seminar on machine learning using big data. It discusses the history of data storage and traditional databases. It then introduces machine learning and the types of learning, including supervised and unsupervised learning. Specific algorithms for each type are covered such as k-means clustering for unsupervised and naive Bayes for supervised. Case studies on applications like Amazon product recommendations are presented. The document concludes by discussing tools for machine learning and future applications as more connected devices generate extensive data.
Entity matching and entity resolution are becoming more important disciplines in data management over time, based on increasing number of data sources that should be addressed in economy that is undergoing digital transformation process, growing data volumes and increasing requirements related to data privacy. Data matching process is also called record linkage, entity matching or entity resolution in some published works. For long time research about the process was focused on matching entities from same dataset (i.e. deduplication) or from two datasets. Different algorithms used for matching different types of attributes were described in the literature, developed and implemented in data matching and data cleansing platforms. Entity resolution is element of larger entity integration process that include data acquisition, data profiling, data cleansing, schema alignment, data matching and data merge (fusion).
We can use motivating example of global pharmaceutical company with offices in more than 60 countries worldwide that migrated customer data from various legacy systems in different countries to new common CRM system in the cloud. Migration was phased by regions and countries, with new sources and data incrementally added and merged with data already migrated in previous phases. Entity integration in such case require deep understanding of data architectures, data content and each step of the process. Even with such deep understanding, design and implementation of the solution require many iterations in development process that consume human resources, time and financial resources. Reducing the number of iterations by automating and optimizing steps in the process can save vast amount of resources. There is a lot of available literature addressing any of the steps in the process, proposing different options for improvement of results or processing optimization, but the whole process still require a lot of human work and subject matter specific knowledge and many iterations to produce results that will have high F-measure (both high precision and recall). Most of the algorithms used in the various steps of the process are Human in the loop (HITL) algorithms that require human interaction. Human is always part of the simulation and consequently influences the outcome.
This paper is a part of the work in progress aimed to define conceptual framework that will try to automate and optimize some steps of entity integration process and try to reduce requirements for human influence in the process. In this paper focus will be on conceptual process definition, recommended data architecture and use of existing open source solutions for entity integration process automation and optimization.
This document discusses several key aspects of mathematics and algorithms used in internet information retrieval and search engines:
1. It explains how search engines like Google can rapidly rank billions of web pages using algorithms based on the topology and link structure of the web graph, such as PageRank.
2. It describes two main types of page ranking algorithms - static importance ranking based on link analysis, and dynamic relevance ranking based on statistical learning models to match pages to queries.
3. It proposes a new ranking algorithm called BrowseRank that models user browsing behavior using Markov chains and takes into account visit duration to better reflect true page importance.
This document summarizes a presentation on data discovery. It discusses key concepts in data discovery including data source joining, ontologies and taxonomies, rules of data discovery, single source of truth, and data visualization. It emphasizes the importance of not discarding original data and keeping track of the data transformation process to maintain data provenance and lineage. Overall, the presentation aims to illuminate how to understand and work with data through concepts in data discovery.
The document discusses several mathematical models and algorithms used in internet information retrieval and search engines:
1. Markov chain methods can be used to model a user's web surfing behavior and page visit transitions.
2. BrowseRank models user browsing as a Markov process to calculate page importance based on observed user behavior rather than artificial assumptions.
3. Learning to rank problems in information retrieval can be framed as a two-layer statistical learning problem where queries are the first layer and document relevance judgments are the second layer.
4. Stability theory can provide generalization bounds for learning to rank algorithms under this two-layer framework. Modifying algorithms like SVM and Boosting to have query-level stability improves performance.
Democratizing Data within your organization - Data DiscoveryMark Grover
n this talk, we talk about the challenges at scale in an organization like Lyft. We delve into data discovery as a challenge towards democratizing data within your organization. And, go in detail about the solution to solve the challenge of data discovery.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
This document discusses improving scholarly communication through knowledge graphs. It describes some current issues with scholarly communication like lack of structure, integration, and machine-readability. Knowledge graphs are proposed as a solution to represent scholarly concepts, publications, and data in a structured and linked manner. This would help address issues like reproducibility, duplication, and enable new ways of exploring and querying scholarly knowledge. The document outlines a ScienceGRAPH approach using cognitive knowledge graphs to represent scholarly knowledge at different levels of granularity and allow for intuitive exploration and question answering over semantic representations.
The document discusses data workflows and integrating open data from different sources. It defines a data workflow as a series of well-defined functional units where data is streamed between activities such as extraction, transformation, and delivery. The document outlines key steps in data workflows including extraction, integration, aggregation, and validation. It also discusses challenges around finding rules and ontologies, data quality, and maintaining workflows over time. Finally, it provides examples of data integration systems and relationships between global and source schemas.
This document discusses transforming open government data from Romania into linked open data. It begins with background on linked data and open data initiatives. Then it describes efforts to model, transform, link, and publish Romanian open data as linked open data. This includes identifying common vocabularies and properties, creating URIs, linking to external datasets like DBPedia, and publishing the linked data for use in applications via a SPARQL endpoint. Overall the goal is to make this data more accessible and interoperable through semantic web standards.
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
Watch the full session: Denodo DataFest 2016 sessions: https://goo.gl/Bvmvc9
Data prep and data blending are terms that have come to prominence over the last year or two. On the surface, they appear to offer functionality similar to data virtualization…but there are important differences!
In this session, you will learn:
• How data virtualization complements or contrasts technologies such as data prep and data blending
• Pros and cons of functionality provided by data prep, data catalog and data blending tools
• When and how to use these different technologies to be most effective
This session is part of the Denodo DataFest 2016 event. You can also watch more Denodo DataFest sessions on demand here: https://goo.gl/VXb6M6
Linked data provides benefits for publishing and sharing research data on the web in a flexible, cost-efficient way without unnecessary copies. It uses the RDF data model and SPARQL query language to represent data as connected triples with URIs. This allows data to be interlinked across sources and queried as a web of data. Initiatives like GO FAIR have incorporated linked data practices like FAIRification to help make data findable, accessible, interoperable and reusable according to FAIR principles. The future potential of linked data includes enabling global access to connected knowledge across heterogeneous environments and facilitating smart collaboration.
Unlock Your Data for ML & AI using Data VirtualizationDenodo
How Denodo Complement’s Logical Data Lake in Cloud
● Denodo does not substitute data warehouses, data lakes,
ETLs...
● Denodo enables the use of all together plus other data
sources
○ In a logical data warehouse
○ In a logical data lake
○ They are very similar, the only difference is in the main
objective
● There are also use cases where Denodo can be used as data
source in a ETL flow
This document discusses implementing Linked Data in low resource conditions. It begins by outlining goals of providing a high-level view of Linked Data, identifying possible bottlenecks due to limited resources, and offering suggestions to overcome bottlenecks based on experience. It then defines what is meant by "low-resource conditions", including limited IT competencies, software, hardware, electricity, internet access. The document outlines the Linked Data workflow and discusses each step in more detail, including data generation, conversion to RDF, data storage, maintenance, linking, and exposure. It highlights the example of AGRIS, a collaborative Linked Data application, and emphasizes starting small, being strategic, reusing existing resources, and collaborating to maximize resources in low
This document outlines DBpedia's strategy to become a global open knowledge graph by facilitating collaboration on data. It discusses establishing governance and curation processes to improve data quality and enable organizations to incubate their knowledge graphs. The goals are to have millions of users and contributors collaborating on data through services like GitHub for data. Technologies like identifiers, schema mapping, and test-driven development help integrate data. The vision is for DBpedia to connect many decentralized data sources so data becomes freely available and easier to work with.
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
lesson of a course on "Open Data and Linked Open Data" for Master in "ICT for Cultural Heritage"Heritage" of the Technological District for Cultural Heritage (DATABENC).
The current status of Linked Open Data (LOD) shows evidence of many datasets available on the Web in RDF. In the meantime, there are still many challenges to overcome by organizations in their journey of publishing five stars datasets on the Web. Those challenges are not only technical, but are also organizational. At this moment where connectionist AI is gaining a wave of popularity with many applications, LOD needs to go beyond the guarantee of FAIR principles. One direction is to build a sustainable LOD ecosystem with FAIR-S principles. In parallel, LOD should serve as a catalyzer for solving societal issues (LOD for Social Good) and personal empowerment through data (Social Linked Data).
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...semanticsconference
- The document discusses building a linked data warehouse at a law firm to integrate siloed information from diverse sources and allow for flexible information discovery.
- They extracted data from various sources like Excel reports, XML files, and databases into RDF triples and loaded them into a triple store with an ETL platform.
- This created a logical data warehouse that connects matters, people, clients, and other entities as "things" with relationships, allowing exploration from different domain views.
- An OData integration allows querying the triple store from tools like Excel and Tableau, while a web interface demonstrates lens-based exploration of entities and composing advanced semantic searches.
OrientDB: Unlock the Value of Document Data RelationshipsFabrizio Fortino
a) A general introduction of graph databases and OrientDB,
b) Why connected data has more value than just data,
c)How to "have fun" with OrientDB combining documents with graphs via SQL,
d) A use case on how OrientDB has helped to raise standards in Irish Public Office.
On OrientDB: NOSQL document databases provide an elegant way to deal with data in different shapes enabling developers to create better and faster products quickly. The main goal of these systems is to find the most efficient solution to manage data itself. With the Big Data Explosion we need to deal with a myriad of highly interconnected information. The challenge now is not only on how to store data but on how to manage, analyse, traverse and use your data within the context of relationships. Graph databases shine at maintaining highly connected data and is the fastest growing category in database management systems: 2014 registered an increase of 250% in terms of adoption and Forrester Research predicts that more than a quarter of enterprises will be using graphs by 2017. OrientDB combines more than one NOSQL model offering the unique flexibility of modelling data in the form of either documents, or graphs, while incorporating object oriented programming as a way of encapsulating relationships.
Linked Open Data Principles, benefits of LOD for sustainable developmentMartin Kaltenböck
Presentation held on 18.09.2013 at the OKCon 2013 in Geneva, Switzerland in the course of the workshop: How Linked Open data supports Sustainable Development and Climate Change Development by Martin Kaltenböck (SWC), Florian Bauer (REEEP) and Jens Laustsen (GBPN).
ISO 27001 Lead Auditor Exam Practice Questions and Answers-.pdfinfosec train
🧠 𝐏𝐫𝐞𝐩𝐚𝐫𝐢𝐧𝐠 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐈𝐒𝐎 𝟐𝟕𝟎𝟎𝟏 𝐋𝐞𝐚𝐝 𝐀𝐮𝐝𝐢𝐭𝐨𝐫 𝐄𝐱𝐚𝐦? 𝐃𝐨𝐧’𝐭 𝐉𝐮𝐬𝐭 𝐒𝐭𝐮𝐝𝐲—𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐞 𝐰𝐢𝐭𝐡 𝐏𝐮𝐫𝐩𝐨𝐬𝐞!
We’ve compiled a 𝐜𝐨𝐦𝐩𝐫𝐞𝐡𝐞𝐧𝐬𝐢𝐯𝐞 𝐰𝐡𝐢𝐭𝐞 𝐩𝐚𝐩𝐞𝐫 featuring 𝐫𝐞𝐚𝐥𝐢𝐬𝐭𝐢𝐜, 𝐬𝐜𝐞𝐧𝐚𝐫𝐢𝐨-𝐛𝐚𝐬𝐞𝐝 𝐩𝐫𝐚𝐜𝐭𝐢𝐜𝐞 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬 𝐚𝐧𝐝 𝐚𝐧𝐬𝐰𝐞𝐫𝐬 designed specifically for those targeting the 𝐈𝐒𝐎/𝐈𝐄𝐂 𝟐𝟕𝟎𝟎𝟏 𝐋𝐞𝐚𝐝 𝐀𝐮𝐝𝐢𝐭𝐨𝐫 𝐜𝐞𝐫𝐭𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧.
🔍 𝐈𝐧𝐬𝐢𝐝𝐞 𝐲𝐨𝐮'𝐥𝐥 𝐟𝐢𝐧𝐝:
✅ Exam-style questions mapped to ISO 27001:2022
✅ Detailed explanations (not just the right answer—but why it’s right)
✅ Mnemonics, control references (like A.8.8, A.5.12, A.8.24), and study tips
✅ Key audit scenarios: nonconformities, SoA vs scope, AART treatment options, CIA triad, and more
𝐖𝐡𝐞𝐭𝐡𝐞𝐫 𝐲𝐨𝐮'𝐫𝐞:
🔹 Starting your ISO journey
🔹 Preparing for your Lead Auditor exam
🔹 Or mentoring others in information security audits...
This guide can seriously boost your confidence and performance.
TechSoup Microsoft Copilot Nonprofit Use Cases and Live Demo - 2025.05.28.pdfTechSoup
In this webinar we will dive into the essentials of generative AI, address key AI concerns, and demonstrate how nonprofits can benefit from using Microsoft’s AI assistant, Copilot, to achieve their goals.
This event series to help nonprofits obtain Copilot skills is made possible by generous support from Microsoft.
How to Use Owl Slots in Odoo 17 - Odoo SlidesCeline George
In this slide, we will explore Owl Slots, a powerful feature of the Odoo 17 web framework that allows us to create reusable and customizable user interfaces. We will learn how to define slots in parent components, use them in child components, and leverage their capabilities to build dynamic and flexible UIs.
The PDF titled "Critical Thinking and Bias" by Jibi Moses aims to equip a diverse audience from South Sudan with the knowledge and skills necessary to identify and challenge biases and stereotypes. It focuses on developing critical thinking abilities and promoting inclusive attitudes to foster a more cohesive and just society. It defines bias as a tendency or prejudice affecting perception and interactions, categorizing it into conscious and unconscious (implicit) biases. The content highlights the impact of societal and cultural conditioning on these biases, particularly within the South Sudanese context.
Principal Satbir Singh writes “Kaba and Kitab i.e. Building Harmandir Sahib and Compilation of Granth Sahib gave Sikhs a central place of worship and a Holy book is the single most important reason for Sikhism to flourish as a new religion which gave them a identity which was separate from Hindu’s and Muslim’s.
Christian education is an important element in forming moral values, ethical Behaviour and
promoting social unity, especially in diverse nations like in the Caribbean. This study examined
the impact of Christian education on the moral growth in the Caribbean, characterized by
significant Christian denomination, like the Orthodox, Catholic, Methodist, Lutheran and
Pentecostal. Acknowledging the historical and social intricacies in the Caribbean, this study
tends to understand the way in which Christian education mold ethical decision making, influence interpersonal relationships and promote communal values. These studies’ uses, qualitative and quantitative research method to conduct semi-structured interviews for twenty
(25) Church respondents which cut across different age groups and genders in the Caribbean. A
thematic analysis was utilized to identify recurring themes related to ethical Behaviour, communal values and moral development. The study analyses the three objectives of the study:
how Christian education Mold’s ethical Behaviour and enhance communal values, the role of
Christian educating in promoting ecumenism and the effect of Christian education on moral
development. Moreover, the findings show that Christian education serves as a fundamental role
for personal moral evaluation, instilling a well-structured moral value, promoting good
Behaviour and communal responsibility such as integrity, compassion, love and respect. However, the study also highlighted challenges including biases in Christian teachings, exclusivity and misconceptions about certain practices, which impede the actualization of
Types of Actions in Odoo 18 - Odoo SlidesCeline George
In Odoo, actions define the system's response to user interactions, like logging in or clicking buttons. They can be stored in the database or returned as dictionaries in methods. Odoo offers various action types for different purposes.
"Dictyoptera: The Order of Cockroaches and Mantises" Or, more specifically: ...Arshad Shaikh
Dictyoptera is an order of insects that includes cockroaches and praying mantises. These insects are characterized by their flat, oval-shaped bodies and unique features such as modified forelegs in mantises for predation. They inhabit diverse environments worldwide.
Flower Identification Class-10 by Kushal Lamichhane.pdfkushallamichhame
This includes the overall cultivation practices of rose prepared by:
Kushal Lamichhane
Instructor
Shree Gandhi Adarsha Secondary School
Kageshowri Manohara-09, Kathmandu, Nepal
In this presentation we will show irrefutable evidence that proves the existence of Pope Joan, who became pontiff in 856 BC and died giving birth in the middle of a procession in 858 BC.
CURRENT CASE COUNT: 880
• Texas: 729 (+5) (56% of cases are in Gaines County)
• New Mexico: 78 (+4) (83% of cases are from Lea County)
• Oklahoma: 17
• Kansas: 56 (38.89% of the cases are from Gray County)
HOSPITALIZATIONS: 103
• Texas: 94 - This accounts for 13% of all cases in the State.
• New Mexico: 7 – This accounts for 9.47% of all cases in New Mexico.
• Kansas: 2 - This accounts for 3.7% of all cases in Kansas.
DEATHS: 3
• Texas: 2 – This is 0.28% of all cases
• New Mexico: 1 – This is 1.35% of all cases
US NATIONAL CASE COUNT: 1,076 (confirmed and suspected)
INTERNATIONAL SPREAD
• Mexico: 1,753 (+198) 4 fatalities
‒ Chihuahua, Mexico: 1,657 (+167) cases, 3 fatalities, 9 hospitalizations
• Canada: 2518 (+239) (Includes Ontario’s outbreak, which began November 2024)
‒ Ontario, Canada: 1,795 (+173) 129 (+10) hospitalizations
‒ Alberta, Canada: 560 (+55)
Things to keep an eye on:
Mexico: Three children have died this month (all linked to the Chihuahua outbreak):
An 11-month-old and a 7-year-old with underlying conditions
A 1-year-old in Sonora whose family is from Chihuahua
Canada:
Ontario now reports more cases than the entire U.S.
Alberta’s case count continues to climb rapidly and is quickly closing in on 600 cases.
Emerging transmission chains in Manitoba and Saskatchewan underscore the need for vigilant monitoring of under-immunized communities and potential cross-provincial spread.
United States:
North Dakota: Grand Forks County has confirmed its first cases (2), linked to international travel. The state total is 21 since May 2 (including 4 in Cass County and 2 in Williams County), with one hospitalization reported.
OUTLOOK: With the spring–summer travel season peaking between Memorial Day and Labor Day, both domestic and international travel may fuel additional importations and spread. Although measles transmission is not strictly seasonal, crowded travel settings increase the risk for under-immunized individuals.
How to Setup Renewal of Subscription in Odoo 18Celine George
A subscription is a recurring plan where you set a subscription period, such as weekly, monthly, or yearly. Based on this period, the subscription renews automatically. In Odoo 18, you have the flexibility to manage renewals either manually or automatically.
3. 3
Outline
• Motivation
• Integrating (Open) Data from different sources
• Not only Linked Data (NoLD)
• Data workflows and Data Management in the context of rise of Big Data
• What is a "Data Workflow"?
• Different Views of Data Workflows in the context of the Semantic Web
• Key steps involved
• Tools?
• Data Integration Systems
• GAV vs. LAV
• The Mediator and Wrapper Architecture
• Query rewriting vs. Materialisation
• Data Integration using Ontologies
• Challenges:
• How to find Rules and ontologies?
• Handling Incompleteness
• How to find the data?
• Open Problems – Research Tasks
5. 5
Open Data is a global trend – Good for us!
• Cities, International Organizations, National and European portals, etc.:
5
• In general: more and
more structured data
available at our
fingertipps
• It's on the Web
• It's open
à no restrictions w.r.t. re-use
This image
cannot currently
be displayed.
6. 6
Buzzword Bingo 1/3:
Open Data vs. Big Data vs. Open Government
• http://www.opendatanow.com/2013/11/new-big-data-vs-open-data-mapping-it-out/
7. 7
• Volume:
• It's growing! (we currently monitor 90 CKAN portals,
512543 resources/ 160069 datasets,
at the moment (statically) ~1TB only CSV files...
• Variety:
• different datasets (from different cities, countries, etc.), only
partially comparable, partially not.
• Different metadata to describe datasets
• Different data formats
• Velocity:
• Open Data changes regularly (fast and slow)
• New datasets appear, old ones disappear
• Value:
• building ecosystems ("Data value chain") around Open
Data is a key priority of the EC
• Veracity:
• quality, trust
Buzzword Bingo 2/3:
Open Data vs. Big Data
8. 8
Buzzword Bingo 3/3:
Open Data vs. Linked Data
cf.: [Polleres OWLED2013], [Polleres et al. Reasoning Web 2013]
LD efforts discontinued?!
LOD in OGD growing, but slowly
Alternatives in the
meantime:
(wikidata...)
LOD is still growing, but OD is growing
faster and challenges aren't
necessarily the exactly same…
So. let's focus on Open Data in general…
… more specifically on
Open Structured Data
This talk is NOT about DL Reasoning over Linked Data:
9. 9
What makes Open Data useful
beyond “single dataset“ Apps...
Great stuff, but limited potential...
More interesting:
• Data Integration & building Data Workflows
from different Open Data sources!!!
10. 10
Is Open Data useful at all?
A concrete use case:
What is the
CO2/capita in
Bologna?
What is the
population density of
Athens?
What is the length of
public transport in
Vienna?
Overall ratings computed from (ideally most current) base
indicators per cities
11. 11
A concrete use case:
The "City Data Pipeline"
Idea – a "classic" Semantic Web use case!
• Regularly integrate various relevant Open Data sources
(e.g. eurostat, UNData, ...)
• Make integrated data available for re-use
(How) can ontologies help me?
• Are ontology languages expressive enough?
• Which ontologies could I (re-)use?
• Is there enough data at all?
• Where to find the right data?
• Where to find the right ontologies?
• How to tackle inconsistencies?
12. 12
A concrete use case:
The "City Data Pipeline" – a "fairly standard" data workflow
That's a quite standard
Data Workflow, isn't it?
13. 13
So:
a) What is a "standard data workflow"?
b) Where can/shall Semantic Technologies, but
also traditional Data Integration technologies be
used to build such workflows?
Data
16. 16
Different Views & Examples:
1/7 „Classic" ETL-Process in Datawarehousing
Wikipedia:
• In computing,Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that:
• Extracts data from homogeneous or heterogeneous data sources
• Cleansing: deduplication, inconsistencies, missing data,...
• Transforms the data for storing it in proper format or structure for querying and analysis purpose
• Loads it into the final target (database, more specifically, operational data store, data mart, or data
warehouse)
• Typically assumes: fixed, static pipeline, fixed final schema in the final DB/DW
• Cleansing sometimes viewed as a part of Transform, sometimes not.
• Typically assumes complete/clean data at the “load" stage
• Aggregation sometimes viewed as a part of tranformation, sometimes higher up in the
Datawarehouse access layer (OLAP)
• WARNING: At each stage, things can go wrong! Filtering/aggregation may bias the data!
• References:[Golfarelli, Rizzi, 2009]
• https://en.wikipedia.org/wiki/Extract,_transform,_load
• https://en.wikipedia.org/wiki/Staging_%28data%29#Functions
"Hard-
wired"
Data
integration
18. 18
Different Views & Examples:
3/7 Or is it rather a Lifecycle...
• E.g. good example: Linked Data Lifecycle
• NOTE: Independent of whether Linked Data or other sources, you
need to revisit/revalidate your workflow, either for improving it or for
maintainance (sources changing, source formats changing, etc.)
Axel-Cyrille Ngonga Ngomo, SörenAuer, Jens Lehmann,
Amrapali Zaveri. Introduction to Linked Data and Its
Lifecycle on the Web. ReasoningWeb. 2014
19. 19
Different Views & Examples:
4/7 We‘re not the first ones to recognize this is
actually a lifecycle… [Wiederhold92]
20. 20
Different Views & Examples:
5/6 The “Data Science” Process:
http://semanticommunity.info/Data_Science/Doing_Data_Science
What Would a Next-Gen Data Scientist Do?
“[…] data scientists […] spend a lot more time trying
to get data into shape than anyone cares to admit—
maybe up to 90% of their time. Finally, they don’t find
religion in tools, methods, or academic
departments. They are versatile and interdisciplinary”
21. 21
Different Views & Examples:
7/7 Big Data & Data Management against “Data Lake”
Install all data
Regardless of
requirements
Store all data in
native format without
schema definition
Do analysis
Using
analytic engines ,
Hadoop
Data Lake
Raghu Ramakrishnan,Big Data @ Microsoft
22. 22
General challenges to be addressed
Syntactic heterogeneity (different formats)
Distributed data sources
Non-standard processing
Semantic heterogeneity
Naming ambiguity
Uncertainty and evolving concepts
23. 23
Specific Steps (non-exhaustive, overlapping!)
• Extraction
• Inconsistency handling
• Incompleteness handling (sometimes called "Enrichment“,
sometimes imputation of missing values...)
• Data Integration (alignment, source reconciliation)
• Aggregation
• Cleansing (removing outliers)
• Deduplication/Interlinking (could involve "triplification")
• Analytics
• Enrichment
• Change dedection (Maintainance/Evolution)
• Validation (quality anaysis)
• Efficient, sometimes distributed (query) processing
• Visualization
Tools and current approaches support you partially in different parts of these
steps.... Bad news: there is no "one-size-fits-all" solution.
24. 24
Some Tools (again, exemplary and SW-biased!):
• Linux-commandline Tools: curl, sed, awk, + postgresql does a good job in
many cases...
• LOD2 stack, stack of tools for integrating and generating Linked Data,
http://stack.lod2.eu/
• e.g., SILK http://silk-framework.com/ (Interlinking/objec consolidation)
• KARMA (extraction, data integration) http://usc-isi-i2.github.io/karma/
• RapidMiner Linked Data extension http://dws.informatik.uni-
mannheim.de/en/research/rapidminerlodextension/ [Gentile, Paulheim, et al. 2016]
• XSPARQL (extraction from XML and JSON/triplicifation)
http://sourceforge.net/projects/xsparql/ [Bischof et al. 2012]
• Seel also: https://ai.wu.ac.at/~polleres/20140826xsparql_st.etienne/
• STTL: A SPARQL-based Transformation Language for RDF
• See also: https://hal.inria.fr/hal-01150623 [Corby et al. 2015]
25. 25
Outline
• Motivation
• Integrating (Open) Data from different sources
• Not only Linked Data
• Data workflows and Open data in the context of rise of Big Data
• What is a "Data Workflow"?
• Different Views of Data Workflows in the context of the Semantic Web
• Key steps involved
• Tools?
• Data Integration Systems & Query Processing
• Data Integration Systems - GAV vs. LAV
• The Mediator and Wrapper Architecture
• Query rewriting vs. Materialisation
• Challenges:
• How to find Rules and ontologies?
• Incomplete Data
• How to find the data?
31. 31
Data Integration Systems[Lenzerini2002]
• IS=<O,S,M>
• Let O be a set of general concepts in a general
schema (virtual).
• Let S={S1,..,Sn} be a set of data sources.
• Let M be a set of mappings between sources in S
and general concepts in O.
cf. [Lenzerini 2002]
40. 40
Source Schema
(amFinancial rdf:type rdf:Property).
(euClimate rdf:type rdf:Property).
(tunisRating rdf:type rdf:Property).
(similarFinancial rdf:type rdf:Property).
amFinancial(C,R) provides the financial rating R of an American city C.
euClimate(C,R) provides the climate rating R of an European city C.
tunisRating(T,R) tells the ratings R (T is climate and financial) of Tunis.
similarFinancial(C1,C1) relates two American cities C1 and C2 that have the same financial
rating.
42. 42
Integration Systems
S={amFinancial(C,R), euClimate(C,R), tunisRating(T,R), similarFinancial(C1,C2)
}
Local Schema
Global Schema
euroCity(C) amCity(C)
grossGDP(C,R)
rdfs:subPropertyOfrdfs:subPropertyOf
rating(C,R)
avgTemp(C,R) afCity(C)
43. 43
Global-as-View (GAV):
• Concepts in the Global Schema (O) are defined in terms of
combinations of Sources (S).
Local-As-View (LAV):
§ Sources in S are defined in terms of combinations of Concepts in O.
Global- & Local-As-View (GLAV):
• Combinations of concepts in the Global Schema (O) are defined in
combinations of Sources (S).
Integration Systems
IS=<O,S,M>
49. 49
Query Rewriting GAV
§ A query Q in terms of the global schema elements in O.
§ Problem: Rewrite Q into a query Q’ expressed in sources
in S.
query(C):-grossGDP(C,R), amCity(C)
Example GAV:
α0: amCity(C):-amFinancial(C,R).
α1: grossGDP(C,R):-amFinancial(C,R).
α2: euroCity(C):-euClimate(C,R).
α3: avgTemp(C,R):-euClimate(C,R).
α4: grossGDP(“Tunis”,R):-tunisRating(“financial”,R).
α5: avgTemp(“Tunis”,R):-tunisRating(“climate”,R)
α6: afCity(“Tunis”).
α7: amCity(C1):-similarFinancial(C1,C2).
α8: amCity(C2):-similarFinancial(C1,C2).
α9: grossGDP(C1,R):-similarFinancial(C1,C2), amFinancial(C2,R).
50. 50
query1(C):-amFinancial(C,R),similarFinancial(C,C2).
Query Rewriting GAV
§ A query Q in terms of the global schema elements in O.
§ Problem: Rewrite Q into a query Q’ expressed in sources
in S.
Rewritings
query(C):-grossGDP(C,R), amCity(C)
Example GAV:
α1: grossGDP(C,R):-amFinancial(C,R).
α7: amCity(C1):-similarFinancial(C1,C2).
55. 55
Query Rewriting GAV
§ A query Q in terms of the global schema elements in O.
§ Problem: Rewrite Q into a query Q’ expressed in sources
in S.
query(C):-grossGDP(C,R), amCity(C)
Example GAV:
α0: amCity(C):-amFinancial(C,R).
α1: grossGDP(C,R):-amFinancial(C,R).
α2: euroCity(C):-euClimate(C,R).
α3: avgTemp(C,R):-euClimate(C,R).
α4: grossGDP(“Tunis”,R):-tunisRating(“financial”,R).
α5: avgTemp(“Tunis”,R):-tunisRating(“climate”,R)
α6: afCity(“Tunis”).
α7: amCity(C1):-similarFinancial(C1,C2).
α8: amCity(C2):-similarFinancial(C1,C2).
α9: grossGDP(C1,R):-similarFinancial(C1,C2), amFinancial(C2,R).
56. 56
query1(C):-amFinancial(C,R),similarFinancial(C,C2).
Query Rewriting GAV
§ A query Q in terms of the global schema elements in O.
§ Problem: Rewrite Q into a query Q’ expressed in sources
in S.
Rewritings
query(C):-grossGDP(C,R), amCity(C)
Example GAV:
query2(C):-similarFinancial(C,C2), amFinancial(C2,R),
similarFinancial(C,R1).
α9:grossGDP(C1,R):-similarFinancial(C1,C2),
amFinancial(C2,R).
α7: amCity(C1):-similarFinancial(C1,C2).
57. 57
When to use GAV
Query rewriting
is simpler
(Polynomial
time in the size
of the query)
Sources do not change
and global schema can
change over time
58. 58
Lower Bounds for the Space of Query
Rewritings
• CQs and OWL2QL-ontologies [Gottlob14]
• Exponential and Superpolynomial lower bounds on the
size of pure rewritings.
• Polynomial-size under some restrictions.
[Gottlob14]
Georg Gottlob, Stanislav Kikot, Roman Kontchakov, Vladimir V. Podolskii,
Thomas Schwentick, Michael Zakharyaschev: The price of query
rewriting in ontology-based data access. Artif. Intell. 213: 42-59 (2014)
59. 59
Global-as-View (GAV):
• Concepts in the Global Schema (O) are defined in terms of
combinations of Sources (S).
Local-As-View (LAV):
§ Sources in S are defined in terms of combinations of Concepts in O.
Global- & Local-As-View (GLAV):
• Combinations of concepts in the Global Schema (O) are defined in
combinations of Sources (S).
Integration Systems
IS=<O,S,M>
62. 62
Local As View-Query Rewriting
query(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).
63. 63
Local As View-Query Rewriting
query(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).
query’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3) S2(X3,X4,X5)
64. 64
Local As View-Query Rewriting
query(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3):-C1(X1,X2),C2(X2,X3).
S2(X3,X4,X5):-C3(X3,X4),C4(X4,X5).
S3(X2,X3,X4):-C2(X2,X3),C3(X3,X4).
S4(X1,X2):-C1(X1,X2).
query’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
query’’(X1,X5):-C1(X1,X2),C2(X2,X3),C3(X3,X4),C4(X4,X5)
S1(X1,X2,X3) S2(X3,X4,X5)
S4(X1,X2) S3(X2,X3,X4) S4(X3,X4,X5)
65. 65
Query Rewriting
DB is a Virtual Database with the instances of the
elements in O.
Query Containment: Q’ ÍQ ßà"DB Q’(DB) ÍQ(DB)
query1(C):- amFinancial(C,R),similarFinancial(C,C2). Í query(C):-grossGDP(C,R), amCity(C)
Washington
NYC
Miami
Caracas
Lima
Washington
NYC
Miami
Caracas
Lima
……
Bogota
College Park
Quito
Í
Source Database
Virtual Database
68. 68
Time Complexity
To check whether there is a valid rewriting R of Q with at
most the same number of goals as Q is an NP-complete
problem.
Levy, A.; Mendelzon, A.; Sagiv, Y.; and Srivastava, D. 1995. Answering
queries using views. In Proc. of PODS, 95–104.
70. 70
When to use LAV
A GAV catalog
cannot be easily
adapted to changes
in the data sources
LAV views can be
easily adapted to
changes in the data
sources
Data Sources can be
easily described
71. 71
Global-as-View (GAV):
• Concepts in the Global Schema (O) are defined in terms of
combinations of Sources (S).
Local-As-View (LAV):
§ Sources in S are defined in terms of combinations of Concepts in O.
Global- & Local-As-View (GLAV):
• Combinations of concepts in the Global Schema (O) are defined in
combinations of Sources (S).
Integration Systems
IS=<O,S,M>
74. 74
Query Rewriting
DB is a Virtual Database with the instances of the
elements in O.
Query Containment: Q’ ÍQ ßà"DB Q’(DB) ÍQ(DB)
query1(C):-amFinancial(C,R),similarFinancial(C,C2). Í query(C):-grossGDP(C,R), amCity(C)
75. 75
When to use GLAV
A GLAV catalog
cannot be easily
adapted to
changes in the
data sources
Data Sources
can be easily
described
76. 76
The Mediator and Wrapper Architecture [Wiederhold92]
Wrapper Wrapper Wrapper
Mediator Catalog
Query
[Wiederhold92]Gio Wiederhold:Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
Data Integration System
amFinancial(C,R)
Wrapper
euClimate(C,R) similarFinancial(C1,C2) tunisRating(T,R)
77. 77
The Mediator and Wrapper Architecture [Wiederhold92]
Wrapper Wrapper Wrapper
Mediator Catalog
Query
[Wiederhold92]Gio Wiederhold:Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
amFinancial(C,R)
Wrapper
euClimate(C,R) similarFinancial(C1,C2) tunisRating(T,R)
Wrappers:
§Software components specific
for each type of data source.
§Export unique schema
for heterogeneous sources.
78. 78
e.g. RDB2RDF Systems
Transformation Rules, e.g.,
R2RML
RDF
Wrappers in the context of RDF Data:
Cf. R2RML W3C standard: http://www.w3.org/TR/r2rml/ see also [Priyatna 2014]]
UltraWrap http://capsenta.com/ultrawrap/ [Sequeda & Miranker 2013],
D2RQ http://d2rq.org/
79. 79
The Mediator and Wrapper Architecture [Wiederhold92]
Wrapper Wrapper Wrapper
Mediator Catalog
Query
[Wiederhold92]Gio Wiederhold:Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
amFinancial(C,R)
Wrapper
euClimate(C,R) similarFinancial(C1,C2) tunisRating(T,R)
Mediators:
•Export a unified schema.
•Query Decomposition.
•Identify relevant sources for each
query.
•Generate query execution plans.
80. 80
Some recent works which implement Wiederhold’s
mediator/wrapper architecture in the SW:
Linked Data-Fu [Stadtmüller et al. 2013]
SemLAV [Montoya et al. 2014]
… both LAV-inspired.
82. 82
Data Warehouse-Materialized Global Schema
Data
Warehouse
Engine
Catalog
GLAV rules
Data Integration System
ETL ETL ETL
amFinancial(C1,R)
similarFinancial(C1,C2)
euClimate(C,R)
tunisIndicator(T,R)
financial(C,R) climate(C,R)
rdfs:subPropertyOf
rdfs:subPropertyOf
rating(C,R)
amCity(C)
euCity(C)
afCity(C)
This is a
GLAV
rule
Global Schema
α0: amFinancial(C1,R),similarFinancial(C1,C2):-
amCity(C1),amCity(C2),grossGDP(C1,R),grossGDP(C2,R).
83. 83
Materialized versus Virtual Access
The Mediator and
Wrapper
Architecture
requires to access
remote data
sources on the fly
Materialized Data
can be locally
accessed.
Convenient
wheneverdata is
static
85. 85
What is the role of Ontologies in Data
Workflows/Data Integration Systems?
86. 86
• Also popular under the term Ontology-based data-access
(OBDA) [Kontchakov et al. 2013]:
• Typically conisders a relational DB, mappings (rules), an ontology
Tbox (typically OWL QL (DL-Lite), or OWL RL (rules))
Linked Data integration using ontologies:
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
RDB2RDF
Mappings
Datalog
OBDARDBMS
SQL
87. 87
Linked Data integration using ontologies:
• Also popular under the term Ontology-based data-access
(OBDA) [Kontchakov et al. 2013]:
• Typically conisders a relational DB, mappings (rules), an ontology
Tbox (typically OWL QL (DL-Lite), or OWL RL (rules))
• For simplicity, let's leave out the Relational DB part,
assuming Data is already in RDF...
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
OBDATriple Store
RDF/SPARQL
88. 88
Linked Data integration using
ontologies (example)
"Places with a Population Density below 5000/km2"?
90. 90
A concrete use case:
The "City Data Pipeline"
City Data Model: extensible
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
dbo:PopulatedPlace rdfs:subClassOf :Place.
dbo:populationDensity rdfs:subPropertyOf :populationDensity.
eurotstat:City rdfs:subClassOf :Place.
eurotstat:popDens rdfs:subPropertyOf :populationDensity.
dbpedia:areakm rdfs:subPropertyOf :area
eurostat:area rdfs:subPropertyOf :area
91. 91
A concrete use case:
The "City Data Pipeline"
City Data Model: extensible
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
dbo:areakm :area
eurostat:area :area
dbo:PopulatedPlace :Place
dbo:populationDensity :populationDensity
eurostat:City :Place
eurostat:popDens :populationDensity
92. 92
A concrete use case:
The "City Data Pipeline"
City Data Model: extensible
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
ß dbo:areakm(X,Y):area(X,Y)
ß eurostat:area(X,Y):area(X,Y)
ß dbo:PopulatedPlace(X):Place(X)
ß dbo:populationDensity(X,Y):populationDensity(X,Y)
ß eurostat:City(X):Place(X)
ß eurostat:popDens(X):populationDensity(X,Y)
93. 93
A concrete use case:
The "City Data Pipeline"
ß dbo:areakm(X,Y):area(X,Y)
ß eurostat:area(X,Y):area(X,Y)
ß dbo:PopulatedPlace(X):Place(X)
ß dbo:populationDensity(X,Y):populationDensity(X,Y)
ß eurostat:City(X):Place(X)
ß eurostat:popDens(X):populationDensity(X,Y)
"Places with a Population Density below 5000/km2"?
SELECT ?X WHERE { ?X a :Place . ?X :populationDensity ?Y .
FILTER(?Y < 5000) }
94. 94
Approach 1: Materialization
(input: triple store + Ontology
output: materialized triple store)
ß dbo:areakm(X,Y):area(X,Y)
ß eurostat:area(X,Y):area(X,Y)
ß dbo:PopulatedPlace(X):Place(X)
ß dbo:populationDensity(X,Y):populationDensity(X,Y)
ß eurostat:City(X):Place(X)
ß eurostat:popDens(X):populationDensity(X,Y)
:Vienna a dbo:PopulatedPlace.
:Vienna dbo:populationDensity 4326.1
.
:Vienna dbo:areaKm 414.65 .
:Vienna dbo:populationTotal 1805681 .
:Vienna a :Place.
:Vienna :populationDensity 4326.1 .
:Vienna :area 414.65
SELECT ?X WHERE { ?X a :Place . ?X :populationDensity ?Y .
FILTER(?Y < 5000) }
• RDF triple stores implement it naitively (OWLIM, Jena Rules, Sesame)
• Can handle a large part of OWL [Krötzsch, 2012, Glimm et al. 2012]
95. 95
Approach 2: Query rewriting
(input: conjunctive query (CQ) + Ontology
output: UCQ)
ß dbo:areakm(X,Y):area(X,Y)
ß eurostat:area(X,Y):area(X,Y)
ß dbo:PopulatedPlace(X):Place(X)
ß dbo:populationDensity(X,Y):populationDensity(X,Y)
ß eurostat:City(X):Place(X)
ß eurostat:popDens(X):populationDensity(X,Y)
:Vienna a dbo:PopulatedPlace.
:Vienna dbo:populationDensity 4326.1
.
:Vienna dbo:areaKm 414.65 .
:Vienna dbo:populationTotal 1805681 .
SELECT ?X WHERE { ?X a :Place . ?X :populationDensity ?Y .
FILTER(?Y < 5000) }
SELECT ?X WHERE { { {?X a :Place . ?X :populationDensity ?Y . }
UNION {?X a dbo:Place . ?X :populationDensity ?Y . }
UNION {?X a :Place . ?X dbo:populationDensity ?Y . }
UNION {?X a dbo:Place . ?X dbo:populationDensity ?Y . }
UNION {?X a dbo:Place . ?X dbo:populationDensity ?Y . }
... }
FILTER(?Y < 5000) }
96. 96
SELECT ?X WHERE { ?X a :Place . ?X :populationDensity ?Y .
FILTER(?Y < 5000) }
• Observation: essentially, GAV-style rewriting
• Can handle a large part of OWL (corresponding to DL-Lite [Calvanese et al.
2007]): OWL 2 QL
• Query-rewriting- based tools and systems available, many optimizations to
naive rewritings, e.g. taking into account mappings to a DB:
• REQUIEM [Perez-Urbina et al., 2009]
• Quest [Rodriguez-Muro, et al. 2012]
• ONTOP [Rodriguez-Muro, et al. 2013]
• Mastro [Calvanese et al. 2011]
• Presto [Rosati et al. 2010]
• KYRIE2 [Mora & Corcho, 2014]
• Rewriting vs. Materialization – tradeoff: [Sequeda et al. 2014]
• OBDA is a booming field of research!
Approach 2: Query rewriting
(input: conjunctive query (CQ) + Ontology
output: UCQ)
97. 97
Where to find suitable ontologies?
Ok, so where do I find suitable
ontologies?
Ontology (O)
OWL,RDFS
Query (Q)
SPARQL
OBDATriple Store
RDF/SPARQL
98. 98
Ontologies and mapping between Linked Data
Vocabularies
• Good Starting points: Linked Open Vocabularies
http://lov.okfn.org/dataset/lov/
• Still, probably a lot of manual mapping...
• Literature search for suitable ontologies à don't re-invent
the wheel, re-use where possible
• Crawl
• Ontology learning, i.e. learn mappings?
• e.g. using Ontology matching [Shvaiko&Euzenat, 2013]
99. 99
Specific Steps (non-exhaustive, overlapping!)
• Extraction
• Inconsistency handling
• Incompleteness handling (sometimes called "Enrichment“,
sometimes imputation of missing values...)
• Data Integration (alignment, source reconciliation)
• Aggregation
• Cleansing (removing outliers)
• Deduplication/Interlinking (could involve "triplification")
• Analytics
• Enrichment
• Change dedection (Maintainance/Evolution)
• Validation (quality anaysis)
• Efficient, sometimes distributed (query) processing
• Visualization
Tools and current approaches support you partially in different parts of these
steps.... Bad news: there is no "one-size-fits-all" solution.
Recall that slide from
the beginning? What
did we actually cover
and where could
Semantic Web
techniques help?
101. 101
q(PD) (S, popDensity, PD0
), (S, area, A0
), (S, area, A), PD := P/A, P := PD0
⇤ A0
(S, popDensity, PD) (S, population, P), (S, area, A), PD := P/A, A 6= 0.
(S, area, PD) (S, population, P), (S, popDensity, PD), A := P/PD, PD 6= 0.
(S, population, P) (S, area, A), (S, popDensity, PD), P := A ⇤ PD.
• [Bischof&Polleres 2013] Basic Idea: Consider
clausal form of all variants of equations and use
Query rewriting with "blocking":
:Bologna dbo:population 386298 .
:Bologna dbo:areaKm 140.7 .
SELECT ?PD WHERE { :Bologna dbo:popDensity ?PD}
q(PD) (S, popDensity, PD)
q(PD) (S, population, P), (S, area, A), PD := P/A
… infinite expansion even if only 1 equation is considered.
Solution: “blocking” recursive expansion of the same equation for the same value.
SELECT ?PD WHERE { {:Athens dbo:popDensity ?PD }
UNION
{ :Athens dbo:population ?P ; dbo:area ?A .
BIND (?P/?A AS ?PD )}
}
Finally, the resulting UCQs with
assignments can be rewritten back to
SPARQL using BIND
102. 102
A concrete use case:
The "City Data Pipeline"
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capita
Ok, so where do I find these
equations?
105. 105
City Data Model: extensible
ALH(D) ontology:
Provenance
Temporal
information
Spatial context
Indicators,
e.g. area in km2,
tons CO2/capitaHmmm... Still a lot of work to
do, e.g. adding aggregates for
statistical data (Eurostat, RDF
Data Cube Vocabulary) ... cf.
[Kämpgen, 2014, PhD Thesis]
:avgIncome per country is the
population-weighted
average income of all its
provinces.
Hmmm...we
actually need
Claudia!
But Eurostat data is
incomplete... I don't
have the avg. income
for all provinces or
countries in the EU!
Incompleteness Handling:
Are RDFS and OWL and equations enough?
106. • Individual datasets (e.g. from Eurostat) have lots of missing values
• Merging together datasets with different indicators/cities adds sparsity
Challenges – Missing values [Bischof et al. 2015]
Integrated Open Data is (too?)sparse
Cities
Indicators
51% values
missing
97% values
missing
We don’t get very
far here with
equations…
Let’s try Data
Mining/ML!
107. Missing Values – Hybrid approach
choose best imputation method
per indicator [Bischof et al. 2015]
§ Our assumption: every indicator has its own
distribution and relationship to others.
§ Basket of „standard“ regression methods:
§ K-Nearest Neighbour Regression (KNN)
§ Multiple Linear Regression (MLR)
§ Random Forest Decision Trees (RFD)
§Let’s pick the “best method per indicator:
Validation: 10-fold cross validation
However: many/most machine learning methods need more or less complete training data!
More trickery needed, cf. e.g. [Bischof et al. 2015] … or ask Claudia J
108. 108
Specific Steps (non-exhaustive, overlapping!)
• Extraction
• Inconsistency handling
• Incompleteness handling (sometimes called "Enrichment“,
sometimes imputation of missing values...)
• Data Integration (alignment, source reconciliation)
• Aggregation
• Cleansing (removing outliers)
• Deduplication/Interlinking (could involve "triplification")
• Analytics
• Enrichment
• Change dedection (Maintainance/Evolution)
• Validation (quality anaysis)
• Efficient, sometimes distributed (query) processing
• Visualization
Tools and current approaches support you partially in different parts of these
steps.... Bad news: there is no "one-size-fits-all" solution.
Last but not
least…Really Don’t
forget the basic steps,
e.g.
110. 110
Ambiguities/Inconsistencies affected also some older versions of our
City Data Pipeline:
• This example on the right
was due to naïve object
consolidation/deduplication,
BUT:
• Open Data is often
incomparable/inconsistent in
itself (e.g. across years the
method of data collection
might change)
à inconsistencies across and
within datasets are common
111. 111
A concrete use case:
The "City Data Pipeline"
Idea – a "classic" Semantic Web use case!
• Regularly integrate various relevant Open Data sources
(e.g. eurostat, UNData, ...)
• Make integrated data available for re-use:
(How) can ontologies help me?
• Are ontology languages expressive enough?
• Which ontologies could I (re-)use?
• Is there enough data at all?
• Where to find the right data?
• Where to find the right ontologies?
• How to tackle inconsistencies?
citydata.wu.ac.at
Where to find the right data?
112. 112
Where to find the data?
• Bad news:
• Finding suitable ontologies to map data sources to is not
the only challenge:
• Foremost… even before a Data workflow starts, a main
challenge is to find the right Datasets/Resources
• Semantic Web Search engines... Failed? L
• https://www.w3.org/wiki/Search_engines
• ... The obvious entry point:
• Open Data portals
• Still quite messy cf. http://data.wu.ac.at/portalwatch/
• Different formats, encodings, metdata of varying quality
• No proper Search!
• … but again: Semantic Web Technologies could help here!
No reason
not to try
again and
succeed
this time!
J
114. 114
How to search in/for Open Data?
https://www.youtube.com/watch?v=kCAymmbYIvc
vs.
Compared to Web (Table) search...
a) This looks like a slightly different problem...
b) Can linking to "Open" knowledge graphs help?
(wikidata, dbpedia?) ... Probably.
Cf. Work on structured Data in Web Search by Alon Halevy
... BTW: google has partially given it up on it it seems.
à Some more recent work in a SW & Open Data context:
[Neumaier et al., 2015+2016] [Ramnandan et al. 2015]
cf. also mini-projects!
117. 117
Conclusions
Wrapper Wrapper Wrapper
Mediator Catalog
Query
[Wiederhold92]Gio Wiederhold:Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
Data Integration System
amFinancial(C,R)
Wrapper
euClimate(C,R) similarFinancial(C1,C2) tunisRating(T,R)
118. 118
Integration Systems
S={amFinancial(C,R), euClimate(C,R), tunisRating(T,R), similarFinancial(C1,C2)
}
Local Schema
Global Schema
euroCity(C) amCity(C)
grossGDP(C,R)
rdfs:subPropertyOfrdfs:subPropertyOf
rating(C,R)
avgTemp(C,R) afCity(C)
GAV LAV
GLAV
119. 119
Take-home messages:
• Semantic Web technologies help in Open Data Integration
workflows and can add flexibility
• It's worthwhile to consider traditional "Data Integration"
approaches & literature AND more recently work on OBDA
• Non-Clean Data requires: Statistics & machine learning
(outlier detection, imputing missing values, resolving
inconsistencies, etc.)
• Despite 15 years into Semantic Web:“Finding the right data“
remains a major challenge!
Many Thanks!
Questions
121. 121
References 1
• [Polleres 2013]Axel Polleres.Tutorial "OWL vs. Linked Data: Experiences and Directions"OWLED2013.
http://polleres.net/presentations/20130527OWLED2013_Invited_talk.pdf
• [Polleres et al. 2013] Axel Polleres,Aidan Hogan,Renaud Delbru,Jürgen Umbrich:
RDFS and OWL Reasoning for Linked Data.Reasoning Web 2013:91-149
• [Golfarelli,Rizzi,2009]Matteo Golfarelli,Stefano Rizzi. Data Warehouse Design:Modern Principles and
Methodologies.McGraw-Hill,2009.
• [Lenzerini2002]Maurizio Lenzerini:Data Integration:A Theoretical Perspective.PODS 2002:233-246
• [Auer et al. 2012] Sören Auer, Lorenz Bühmann,Christian Dirschl,Orri Erling,Michael Hausenblas,RobertIsele,
Jens Lehmann,Michael Martin,Pablo N. Mendes,Bert Van Nuffelen,Claus Stadler, Sebastian Tramp, Hugh
Williams:
Managing the Life-Cycle of Linked Data with the LOD2 Stack. International Semantic Web Conference (2) 2012: 1-
16 see also http://stack.lod2.eu/
• [Taheriyan et al. 2012] Mohsen Taheriyan,Craig A. Knoblock,Pedro A. Szekely,José Luis Ambite: Rapidly
Integrating Services into the Linked Data Cloud. International Semantic Web Conference (1) 2012:559-574
• [Gentile, et al. 2016] Anna Lisa Gentile, Sabrina Kirstein,Heiko Paulheim and Christian Bizer.Extending
RapidMiner with Data Search and Integration Capabilities
• [Bischof et al. 2012] Stefan Bischof, Stefan Decker,Thomas Kr ennwallner,Nuno Lopes,Axel Polleres:
Mapping between RDFand XML with XSPARQL. J. Data Semantics 1(3): 147-185 (2012)
• [Corby et al. 2015]Olivier Corby,Catherine Faron-Zucker,Fabien Gandon:
A Generic RDF Transformation Software and Its Application to an Online Translation Service for Common
Languages ofLinked Data.International Semantic Web Conference (2) 2015:150-165
• [Nonaka & Takeuchi, 1995]"The Knowledge-Creating Company - How Japanese Companies Create the Dynamics of
Innovation"(Nonaka,Takeuchi,New York Oxford 1995)
• [Bischof et al. 2015] Stefan Bischof, Christoph Martin,Axel Polleres,Patrik Schneider:
Collecting,Integrating,Enriching and Republishing Open City Data as Linked Data. International Semantic Web
Conference (2) 2015:57-75
122. 122
References 2
• [Doan et al. 2012] AnHai Doan, Alon Y. Halevy, Zachary G. Ives:
Principles of Data Integration. Morgan Kaufmann 2012, ISBN 978-0-12-416044-6, pp. I-XVIII, 1-497
• [Levy & Rajaraman & Ullman 1996] Alon Y. Levy, Anand Rajaraman, Jeffrey D. Ullman: Answering Queries
Using Limited External Processors. PODS 1996: 227-237
• [Duscka & Genesereth 1997]
• [Pottinger & Halevy 2001] Rachel Pottinger, Alon Y. Halevy: MiniCon: A scalable algorithm for answering
queries using views. VLDB J. 10(2-3): 182-198 (2001)
• [Arvelo & Bonet & Vidal 2006] Yolifé Arvelo, Blai Bonet, Maria-Esther Vidal: Compilation of Query-Rewriting
Problems into Tractable Fragments of Propositional Logic. AAAI 2006: 225-230
• [Konstantinidis & Ambite, 2011] George Konstantinidis, José Luis Ambite: Scalable query rewriting: a graph-
based approach. SIGMOD Conference 2011: 97-108
• [Izquierdo & Vidal & Bonet 2011] Daniel Izquierdo, Maria-Esther Vidal, Blai Bonet: An Expressive and Efficient
Solution to the Service Selection Problem. International Semantic Web Conference (1) 2010: 386-401
• [Wiederhold92] Gio Wiederhold: Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
• [Stadtmüller et al. 2013] Steffen Stadtmüller, Sebastian Speiser, Andreas Harth, Rudi Studer: Data-Fu: a
language and an interpreter for interaction with read/write linked data. WWW 2013: 1225-1236
• [Montoya et al. 2014] Gabriela Montoya, Luis Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, Maria-Esther Vidal.
SemLAV: Local-As-View Mediation for SPARQL Queries. T. Large-Scale Data- and Knowledge-Centered
Systems 13: 33-58 (2014).
123. 123
References 3
• [Priyatna et al. 2014]Freddy Priyatna, Óscar Corcho,Juan Sequeda:
Formalisation and experiences ofR2RML-based SPARQL to SQL query translation using morph.WWW 2014: 479-
490
• [Sequeda & Miranker 2013]Juan Sequeda,Daniel P. Miranker.Ultrawrap:SPARQL execution on relational data.J.
Web Sem. 22: 19-39 (2013)
• [Krötzsch 2012] Markus Krötzsch: OWL 2 Profiles: An Introduction to LightweightOntology Languages.Reasoning
Web 2012:112-183
• [Glimm et al. 2012]Birte Glimm, Aidan Hogan,Markus Krötzsch, Axel Polleres:OWL: Yet to arrive on the Web of
Data? LDOW 2012
• [Kontchakov et al. 2013]Roman Kontchakov,Mariano Rodriguez-Muro,Michael Zakharyaschev:Ontology-Based
Data Access with Databases:A Short Course.Reasoning Web 2013:194-229
• [Calvanese et al. 2007]Diego Calvanese,Giuseppe De Giacomo,Domenico Lembo,Maurizio Lenzerini,Riccardo
Rosati: Tractable Reasoning and EfficientQuery Answering in Description Logics:The DL-Lite Family. J. Autom.
Reasoning 39(3):385-429 (2007)
• [Perez-Urbina et al., 2009]Héctor Pérez-Urbina,Boris Motik and Ian Horrocks,A Comparison ofQuery Rewriting
Techniques for DL-Lite,In Proc. of the Int. Workshop on Description Logics (DL 2009),Oxford, UK, July 2009.
• [Rodriguez-Muro,etal. 2012]Mariano Rodriguez-Muro,Diego Calvanese:
Quest, an OWL 2 QL Reasoner for Ontology-based Data Access. OWLED 2012
• [Rodriguez-Muro,etal. 2013]Mariano Rodriguez-Muro,Roman Kontchakov,Michael Zakharyaschev:Ontology-
Based Data Access: Ontop of Databases.International Semantic Web Conference (1) 2013:558-573
• [Calvanese et al. 2011]Diego Calvanese,Giuseppe De Giacomo,Domenico Lembo,Maurizio Lenzerini,Antonella
Poggi,Mariano Rodriguez-Muro,Riccardo Rosati,Marco Ruzzi, Domenico Fabio Savo:
The MASTRO system for ontology-based data access.Semantic Web 2(1): 43-53 (2011)
• [Rosati et al. 2010]Riccardo Rosati,Alessandro Almatelli:Improving Query Answering over DL-Lite Ontologies.KR
2010
• [Mora & Corcho, 2014]José Mora, Riccardo Rosati,Óscar Corcho:kyrie2: Query Rewriting under Extensional
Constraints in ELHIO. Semantic Web Conference (1) 2014:568-583
• [Sequeda et al. 2014]Juan F. Sequeda,Marcelo Arenas,Daniel P.Miranker:
OBDA: Query Rewriting or Materialization? In Practice,Both! Semantic Web Conference (1) 2014:535-551
124. 124
References 4
• [Acosta et al 2011] M. Acosta, M.-E. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. Anapsid: an adaptive query
processing engine for sparql endpoints. ISWC 2011.
• [Basca and Bernstein 2014] C. Basca and A. Bernstein. Querying a messy web of data with avalanche. In
Journal of Web Semantics, 2014.
• [Cohen-Boalaki and . Leser. 2013] S. Cohen-Boalakia, U. Leser. Next Generation Data Integration for the Life
Sciences. Tutorial at ICDE 2013. https://www2.informatik.hu-berlin.de/~leser/icde_tutorial_final_public.pdf
• [Doan et el. 2012] A. Doan, A. Halevy, Z. Ives, Data Integration. Morgan Kaukman 2012.
• [Halevy et al 2006] A. Y. Halevy, A. Rajaraman, J. Ordille: Data Integration: The Teenage Years. VLDB 2006:
9-16.
• [Halevy et al 2001] A. Y. Halevy. Answering queries using views: A survey. VLDB J., 2001.
• [Hassanzadeh et al. 2013] Oktie Hassanzadeh, Ken Q. Pu, Soheil Hassas Yeganeh, Renée J. Miller, Lucian
Popa, Mauricio A. Hernández, Howard Ho: Discovering Linkage Points over Web Data. PVLDB 2013
• [Gorlitz and Staab 2011] O. Gorlitz and S. Staab. SPLENDID: SPARQL Endpoint Federation Exploiting VOID
Descriptions. In Proceedings of the 2nd International Workshop on Consuming Linked Data, 2011.
• [Schwarte et al. 2011] A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization
techniques for federated query processing on linked data. ISWC 2011.
• [Verborgh et al. 2014] Ruben Verborgh, Olaf Hartig, Ben De Meester, Gerald Haesendonck, Laurens De
Vocht, Miel Vander Sande, Richard Cyganiak, Pieter Colpaert, Erik Mannens, Rik Van de Walle: Querying
Datasets on the Web with High Availability. ISWC2014
125. 125
References 5
• [Acosta et al. 2015]Maribel Acosta, Amrapali Zaveri,Elena Simperl, Dimitris Kontokostas,Sören Auer, Jens
Lehmann:Crowdsourcing Linked Data Quality Assessment.ISWC 2013
• [Lenz 2007] Hans - J. Lenz. Data Quality Defining, Measuring and Improving. Tutorial at IDA 2007.
• [Naumann02] Felix Naumann: Quality-Driven Query Answering for Integrated Information Systems. LNCS
2261, Springer 2002
• [Ngonga et al. 2011]Axel-Cyrille Ngonga Ngomo,Sören Auer:LIMES - A Time-EfficientApproach for Large-Scale
Link Discovery on the Web of Data. IJCAI 2011
• [Saleem et al 2014] Muhammad Saleem,Maulik R. Kamdar,Aftab Iqbal,Shanmukha Sampath,Helena F. Deus,
Axel-Cyrille Ngonga Ngomo:Big linked cancer data: Integrating linked TCGA and PubMed.J. Web Sem. 2014
• [Soru et al. 2015]Tommaso Soru, Edgard Marx, Axel-Cyrille Ngonga Ngomo:ROCKER:A RefinementOperator for
Key Discovery.WWW 2015
• [Volz et al 2009]Julius Volz, Christian Bizer, Martin Gaedke, Georgi Kobilarov:Discovering and Maintaining Links on
the Web of Data. ISWC 2009
• [Zaveri,et al 2015]Amrapali J. Zaveri, Anisa Rula,Andrea Maurino,Ricardo Pietrobon,Jens Lehmann,and Sören
Auer. Quality Assessmentfor Linked Data:A Survey. Semantic Web Journal 2015
• [Hernandez&Stolfo,1998]M. A. Hernández,S. J. Stolfo: Real-world Data is Dirty: Data Cleansing and The
Merge/Purge Problem.
• Data Min. Knowl.Discov. 2(1): 9-37 (1998)
• [Sarma et al. 2012] Das Sarma, A., Fang, L., Gupta, N., Halevy, A., Lee,H., Wu, F., Xin, R., Yu, C.: Finding related
tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on ManagementofData. pp. 817–828.
ACM (2012)
• [Venetis et al. 2011]Venetis, P., Halevy,A.Y., Madhavan,J., Pasca, M., Shen,W., Wu, F., Miao, G., Wu, C.:
Recovering semantics oftables on the web. PVLDB 4(9), 528–538 (2011)
• [Ramnandan et al. 2015]Ramnandan,S.K., Mittal, A., Knoblock,C.A., Szekely,P.A.: Assigning semantic labels to
data sources.In: ESWC 2015. pp. 403–417
• [Neumaier et al. 2015]Jürgen Umbrich,Sebastian Neumaier,and Axel Polleres.Quality assessment& evolution of
open data portals.In IEEE International Conference on Open and Big Data, Rome,Italy, August 2015.
• [Neumaier et al. 2016]S. Neumaier,J. Umbrich,J. Parreira,A. Polleres.Multi-level semantic labelling ofnumerical
values,ISWC2016, to appear.
130. 130
References
• Andreas Schwarte, Peter Haase, Katja Hose, Ralf
Schenkel, Michael Schmidt: FedX: Optimization
Techniques for Federated Query Processing on
Linked Data. International Semantic Web Conference
(1) 2011: 601-616
http://www2.informatik.uni-freiburg.de/~mschmidt/docs/iswc11_fedx.pdf
• Maribel Acosta, Maria-Esther Vidal, Tomas Lampo,
Julio Castillo, Edna Ruckhaus: ANAPSID: An
Adaptive Query Processing Engine for SPARQL
Endpoints. International Semantic Web Conference
(1) 2011: 18-34
http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Research_Paper/03/70310017.pdf
131. 131
• Describe the problem presented in the related
papers as a Data Integration System.
• Select the most suitable mapping approach to
describe the Data Integration System.
• Use the mediator and wrapper architecture to
describe the Data Integration System.
• Illustrate with an example the Data Integration
System, and show the features implemented by
the mediator and wrappers of the Data
Integration System
RQ1: Can a federation of SPARQL Endpoints be
seen as a Data Integration System?
132. 132
SPARQL Query Execution using LAV views
Publicly available Linked Data Fragments (LAV views)
Linked Data Fragment Client
Linked Data
Fragment Server
134. 134
SPARQL Query Execution using Linked Data Fragments
• Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van
Herwegen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck,
Pieter Colpaert:
Triple Pattern Fragments: A low-cost knowledge graph interface for the
Web. J. Web Sem. 37: 184-206 (2016)
http://www.sciencedirect.com/science/article/pii/S1570826816000214
• Maribel Acosta, Maria-Esther Vidal: Networks of Linked Data Eddies:
An Adaptive Web Query Processing Engine for RDF Data.
International Semantic Web Conference (1) 2015: 111-127
http://www.aifb.kit.edu/images/f/f0/Acosta_vidal_iswc2015.pdf
References
135. 135
• Describe the problem presented in the related
papers as a Data Integration System.
• Select the most suitable mapping approach to
describe the Data Integration System.
• Use the mediator and wrapper architecture to
describe the Data Integration System.
• Illustrate with an example the Data Integration
System, and show the features implemented by
the mediator and wrappers of the Data
Integration System
RQ2: Can a federation of Linked Data Fragments
be seen as a Data Integration System?
136. 136
RQ3: What challenges does archiving of
RDF and Open Data involve?
• If Open Data is Big Data, archiving Open Data and RDF
Data is even one order of magnitude more!
• Challenges on creating (crawling), maintaining, storing
and querying such archives:
• cf. slides
• “On Archiving Linked and Open Data"
at the 2nd Workshop on Managing
the Evolution and Preservation
of the Data Web (MEPDaW 2016),
http://polleres.net/presentations/20160530Keynote-MEPDaW2016.pptx
137. 137
RQ4: How to publish and use Linked Open Data
alongside Closed Data?
• Which policies need to be supported?
• How to describe these policies?
• How to enforce them, how to protect and securely store
closed linked data?
• Surprisingly few starting points in our community on
access control/encryption for RDF/Linked Data, cf. e.g.
• S. Kirrane. Linked data with access control. PhD thesis, 2015. NUI Galway
https://aran.library.nuigalway.ie/handle/10379/4903
• Mark Giereth: On Partial Encryption of RDF-Graphs. International Semantic Web Conference
2005: 308-322
• Lots of work on policy languages, e.g. ODRL:
• Simon Steyskal, Axel Polleres:
Towards Formal Semantics for ODRL Policies.RuleML 2015:360-375
• Simon Steyskal, Axel Polleres:Defining expressive access policies for linked data using the ODRL ontology 2.0.
SEMANTICS 2014: 20-23
138. 138
Your Research Task(s) for the rest of the day:
• Work on one of the overall Research Questions (too generic on purpose!!!!) RQ1-RQ6
from the slides before in your mini-project groups!
• 4 questions/11 groups à 1 RQ can be chosen by at most 3 groups!
• RQ1-2 à Maria Esther
• RQ3-4 à Axel
For each problem you work on:
1) Problems: Why is it difficult? Find obstacles. Define concrete open (sub-)research questions!
2) Solutions: What could be strategies to overcome these obstacles?
3) Systems: What could be a strategy/roadmap/method to implement these strategies?
4) Benchmarks: What could be a strategy/roadmap/method to evaluate a solution?
Result: short presentation per group addressing these 4 questions and findings.
Tips:
• Think about how much time you dedicate to which of these four questions.
• Don’t start with 3)
• Prepare some answers or discussions for afinal plenary session which can be presented in a 2-3 min
pitch SUMMARIZING your discussion
• no more than 2 slides
• focus on take-home messages
mandatory
optional
à Please email your notes and (link to) slides to axel[at]polleres.net ...
We will review them and provide feedback during tmrw morning!