Tutorial given at Informatics for HEalth 2017 COnference These slides are for the second part of the tutorial describing provenance capture and management tools.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.2- Introduction to Galaxy. A web-based genome analysis platform.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Adding Transparency and Automation into the Galaxy Tool Installation ProcessEnis Afgan
The talk will discuss process of unifying the tool installation approach within the Galaxy project and how it can be used by anyone to to install potentially hundreds of tools in an automated fashion.
The Taverna Suite provides tools for interactive and batch workflow execution. It includes a workbench for graphical workflow construction, various client interfaces, and servers for multi-user workflow execution. The suite utilizes a plug-in framework and supports a variety of domains, infrastructures, and tools through custom plug-ins.
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
Genomics and life sciences is using antiquated technology for processing data. As the data volume is increasing in the life sciences, many in the biology community are reinventing the wheel, without realizing the existence of a rich ecosystem of tools for processing large data sets: Hadoop.
Computational workflows for omics analyses at the IARCMatthieu Foll
This document discusses the use of computational workflows and nextflow at the International Agency for Research on Cancer (IARC). IARC uses high throughput sequencing and omics data to study cancer causes and prevention. While research groups have different scientific questions, nextflow helps avoid duplicating bioinformatics efforts and promotes best practices. Key benefits of nextflow include easy installation, reproducibility, and ability to run pipelines on any machine or cluster. Challenges include debugging, handling multiple inputs, and deleting intermediate files. Overall nextflow has changed how IARC conducts bioinformatics research for the better.
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.
In this talk at the CECAM 2015 Workshop on Future Technologies in Automated Atomistic Simulations, I will discuss the Materials Project Ecosystem, an initiative to develop a comprehensive set of open-source software and data tools for materials informatics. The Materials Project is a US Department of Energy-funded initiative to make the computed properties of all known inorganic materials publicly available to all materials researchers to accelerate materials innovation. Today, the Materials Project database boasts more than 58,000 materials, covering a broad range of properties, including energetic properties (e.g., phase and aqueous stability, reaction energies), electronic structure (bandstructures, DOSs) and structural and mechanical properties (e.g., elastic constants).
A linchpin of the Materials Project is its robust data and software infrastructure, built on best open-source software development practices such as continuous testing and integration, and comprehensive documentation. I will provide an overview of the open-source software modules that have been developed for materials analysis (Python Materials Genomics), error handling (Custodian) and scientific workflow management (FireWorks), as well as the Materials API, a first-of-its-kind interface for accessing materials data based on REpresentational State Transfer (REST) principles. I will show a materials researcher may use and build on these software and data tools for materials informatics as well as to accelerate his own research.
Taverna workflows: provenance and reproducibility - STFC/NERC workshop 2013anpawlik
Slides on Taverna www.tvaerna.org.uk from the talk given at STFC/NERC workshop "Workflow approaches to investigation of biological complexity", 15-16 October 2013.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UkZRIC.
Monal Daxini presents a blueprint for streaming data architectures and a review of desirable features of a streaming engine. He also talks about streaming application patterns and anti-patterns, and use cases and concrete examples using Apache Flink. Filmed at qconsf.com.
Monal Daxini is the Tech Lead for Stream Processing platform for business insights at Netflix. He helped build the petabyte scale Keystone pipeline running on the Flink powered platform. He introduced Flink to Netflix, and also helped define the vision for this platform. He has over 17 years of experience building scalable distributed systems.
Ga4 gh meeting at the the sanger instituteMatt Massie
ADAM is a fast, scalable genome analysis platform built using Apache Spark and data formats like Avro and Parquet. It provides tools for read processing, variant calling, and multi-sample analysis on whole genome, high coverage data. The platform is designed to be easy to use for developers while leveraging existing open-source systems and deploying on both local and cloud infrastructures.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
This document proposes an approach called PTU (Provenance-To-Use) to improve the repeatability of scientific experiments by minimizing computation time during repeatability testing. PTU builds a package containing the software, input data, and provenance trace from a reference execution. Testers can then selectively replay parts of the provenance graph using the ptu-exec tool, reducing testing time compared to full re-execution. The document describes the PTU components, including tools for auditing reference runs, building provenance packages, and selectively replaying parts of the provenance graph. Examples applying PTU to the PEEL0 and TextAnalyzer applications show reductions in testing time.
High Performance Machine Learning in R with H2OSri Ambati
This document summarizes a presentation by Erin LeDell from H2O.ai about machine learning using the H2O software. H2O is an open-source machine learning platform that provides APIs for R, Python, Scala and other languages. It allows distributed machine learning on large datasets across clusters. The presentation covers H2O's architecture, algorithms like random forests and deep learning, and how to use H2O within R including loading data, training models, and running grid searches. It also discusses H2O on Spark via Sparkling Water and real-world use cases with customers.
Scalable Parallel Programming in Python with ParslGlobus
Parsl is a Python library that allows for the natural expression of parallelism in Python programs. It allows Python functions to be executed concurrently while respecting data dependencies. Parsl returns "futures" as proxies for results that may not yet be available. It decomposes parallel execution into a task dependency graph. Parsl scripts can run on local machines, grids, clouds, or supercomputers without changes to the code.
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.
Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.
These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.
In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Rafael Ferreira da Silva
The document provides guidelines for best practices in developing scientific software frameworks. It discusses hosting open source projects on version control platforms and ensuring documentation, testing, continuous integration/delivery, and other development practices are followed. Specific examples mentioned include the WRENCH simulation framework, Pegasus workflow system, and scikit-learn machine learning library. The document emphasizes practices like writing tests, tracking issues, reviewing code quality, and releasing versions in a semantic and citable manner.
Workflow Support for Continuous Data Quality Control in a FilteredPush Network
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song
Presentation given at TDWG 2014
Jönköping, Sweden
These slides provide a quick overview of the Materials API, an open platform for materials researchers to access data from the Materials Project. A few simple examples are provided, as well as links where more information can be obtained.
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
Twitter processes over 500 million tweets per day and more than 2 billion search queries per day. The company uses a search architecture based on Lucene with custom extensions. This includes an in-memory real-time index optimized for concurrency without locks, and a schema-based document factory. Future work includes support for parallel index segments and additional Lucene features.
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...Yu-Hsin Hung
UniSan is a compiler-based approach that uses static program analysis to identify unsafe kernel memory allocations that have potential to leak sensitive data. It instruments the code to initialize only the unsafe allocations with memset to zero. The evaluation found it effectively prevented 43 recent Linux kernel uninitialized data leaks while having low overhead for both system operations and user space programs. Future work could focus on custom heap allocators, analyzing more kernel modules, and applying the technique beyond kernels.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
Efficient top-k queries processing in column-family distributed databasesRui Vieira
The document discusses efficient top-k query processing on distributed column family databases. It begins by introducing top-k queries and their uses. It then discusses challenges with naive solutions and prior work using batch processing. The document proposes three algorithms - TPUT, Hybrid Threshold, and KLEE - to enable real-time top-k queries on distributed data in a memory, bandwidth, and computation efficient manner. It also discusses implementation considerations for Cassandra's data model and CQL.
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
H2O World 2015 - Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Ontology-based multi-domain metadata for research data management using tripl...João Rocha da Silva
A presentation given on the IDEAS 2014 Conference about database modelling using triple stores for research data management.
IDEAS '14, July 07 - 09 2014, Porto, Portugal.
Paper Abstract:
Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. While these are easy to understand and use, their semantics are limited to general concepts, leaving out domain-specific metadata and representing values as sets of text values. While this enables retrieval through free-text search, faceted search and dataset interlinking becomes limited. From the point of view of the relational database schema modeler, designing a more flexible metadata model represents a non-trivial challenge because of the open nature of the model. This work demonstrates the current approaches followed by current open-source platforms and propose a graph-based model for achieving modular, ontology-based metadata for interlinked data assets in the Semantic Web. This proposed model was implemented in a collaborative research data management platform currently under development at the University of Porto.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2UkZRIC.
Monal Daxini presents a blueprint for streaming data architectures and a review of desirable features of a streaming engine. He also talks about streaming application patterns and anti-patterns, and use cases and concrete examples using Apache Flink. Filmed at qconsf.com.
Monal Daxini is the Tech Lead for Stream Processing platform for business insights at Netflix. He helped build the petabyte scale Keystone pipeline running on the Flink powered platform. He introduced Flink to Netflix, and also helped define the vision for this platform. He has over 17 years of experience building scalable distributed systems.
Ga4 gh meeting at the the sanger instituteMatt Massie
ADAM is a fast, scalable genome analysis platform built using Apache Spark and data formats like Avro and Parquet. It provides tools for read processing, variant calling, and multi-sample analysis on whole genome, high coverage data. The platform is designed to be easy to use for developers while leveraging existing open-source systems and deploying on both local and cloud infrastructures.
The Materials Project is an open initiative that makes calculated materials property data publicly available to accelerate materials innovation. It has calculated properties for over 30,000 materials using over 10 million CPU hours. The project provides a Python library and API to access and analyze materials data, as well as a workflow manager to run calculations on supercomputers. It aims to calculate all known inorganic materials and establish collaborations to develop new materials design tools.
This document proposes an approach called PTU (Provenance-To-Use) to improve the repeatability of scientific experiments by minimizing computation time during repeatability testing. PTU builds a package containing the software, input data, and provenance trace from a reference execution. Testers can then selectively replay parts of the provenance graph using the ptu-exec tool, reducing testing time compared to full re-execution. The document describes the PTU components, including tools for auditing reference runs, building provenance packages, and selectively replaying parts of the provenance graph. Examples applying PTU to the PEEL0 and TextAnalyzer applications show reductions in testing time.
High Performance Machine Learning in R with H2OSri Ambati
This document summarizes a presentation by Erin LeDell from H2O.ai about machine learning using the H2O software. H2O is an open-source machine learning platform that provides APIs for R, Python, Scala and other languages. It allows distributed machine learning on large datasets across clusters. The presentation covers H2O's architecture, algorithms like random forests and deep learning, and how to use H2O within R including loading data, training models, and running grid searches. It also discusses H2O on Spark via Sparkling Water and real-world use cases with customers.
Scalable Parallel Programming in Python with ParslGlobus
Parsl is a Python library that allows for the natural expression of parallelism in Python programs. It allows Python functions to be executed concurrently while respecting data dependencies. Parsl returns "futures" as proxies for results that may not yet be available. It decomposes parallel execution into a task dependency graph. Parsl scripts can run on local machines, grids, clouds, or supercomputers without changes to the code.
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.
Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.
These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.
In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo
DNA sequencing is producing a wave of data which will change the way that drugs are developed, patients diagnosed, and our understanding of human biology. To fulfill this promise, however, the tools for interpretation and analysis must scale to match the quantity and diversity of "big data genomics."
ADAM is an open-source genomics processing engine, built using Spark, Apache Avro, and Parquet. This talk will discuss some of the advantages that the Spark platform brings to genomics, the benefits of using technologies like Parquet in conjunction with Spark, and the challenges of adapting new technologies for existing tools in bioinformatics.
These are slides for a talk given at the Apache Spark Meetup in Boston on October 20, 2014.
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Rafael Ferreira da Silva
The document provides guidelines for best practices in developing scientific software frameworks. It discusses hosting open source projects on version control platforms and ensuring documentation, testing, continuous integration/delivery, and other development practices are followed. Specific examples mentioned include the WRENCH simulation framework, Pegasus workflow system, and scikit-learn machine learning library. The document emphasizes practices like writing tests, tracking issues, reviewing code quality, and releasing versions in a semantic and citable manner.
Workflow Support for Continuous Data Quality Control in a FilteredPush Network
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song
Presentation given at TDWG 2014
Jönköping, Sweden
These slides provide a quick overview of the Materials API, an open platform for materials researchers to access data from the Materials Project. A few simple examples are provided, as well as links where more information can be obtained.
Search at Twitter: Presented by Michael Busch, TwitterLucidworks
Twitter processes over 500 million tweets per day and more than 2 billion search queries per day. The company uses a search architecture based on Lucene with custom extensions. This includes an in-memory real-time index optimized for concurrency without locks, and a schema-based document factory. Future work includes support for parallel index segments and additional Lucene features.
Group meeting: UniSan - Proactive Kernel Memory Initialization to Eliminate D...Yu-Hsin Hung
UniSan is a compiler-based approach that uses static program analysis to identify unsafe kernel memory allocations that have potential to leak sensitive data. It instruments the code to initialize only the unsafe allocations with memset to zero. The evaluation found it effectively prevented 43 recent Linux kernel uninitialized data leaks while having low overhead for both system operations and user space programs. Future work could focus on custom heap allocators, analyzing more kernel modules, and applying the technique beyond kernels.
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
This document discusses using Apache Spark and ADAM to perform scalable genomic analysis. It provides an overview of genomics and challenges with existing approaches. ADAM uses Apache Spark and Parquet to efficiently store and query large genomic datasets. The document demonstrates clustering genomic data from the 1000 Genomes Project to predict populations, showing ADAM and Spark can handle large genomic workloads. It concludes these tools provide scalable genomic data processing but future work is needed to implement more advanced algorithms.
Efficient top-k queries processing in column-family distributed databasesRui Vieira
The document discusses efficient top-k query processing on distributed column family databases. It begins by introducing top-k queries and their uses. It then discusses challenges with naive solutions and prior work using batch processing. The document proposes three algorithms - TPUT, Hybrid Threshold, and KLEE - to enable real-time top-k queries on distributed data in a memory, bandwidth, and computation efficient manner. It also discusses implementation considerations for Cassandra's data model and CQL.
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
H2O World 2015 - Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Ontology-based multi-domain metadata for research data management using tripl...João Rocha da Silva
A presentation given on the IDEAS 2014 Conference about database modelling using triple stores for research data management.
IDEAS '14, July 07 - 09 2014, Porto, Portugal.
Paper Abstract:
Most current research data management solutions rely on a fixed set of descriptors (e.g. Dublin Core Terms) for the description of the resources that they manage. While these are easy to understand and use, their semantics are limited to general concepts, leaving out domain-specific metadata and representing values as sets of text values. While this enables retrieval through free-text search, faceted search and dataset interlinking becomes limited. From the point of view of the relational database schema modeler, designing a more flexible metadata model represents a non-trivial challenge because of the open nature of the model. This work demonstrates the current approaches followed by current open-source platforms and propose a graph-based model for achieving modular, ontology-based metadata for interlinked data assets in the Semantic Web. This proposed model was implemented in a collaborative research data management platform currently under development at the University of Porto.
Slides presented at the Spark Summit East 2015 (http://spark-summit.org/east). Video should be available through their site, at some point in the future.
(Some of these slides were adapted from an earlier talk "Why is Bioinformatics a Good Fit for Spark?", given to a Spark meetup audience.)
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
cientific workflows are used by many scientific communities to capture, automate and standardize computational and data practices in science. Workflow-based automation is often achieved through a craft that combines people, process, computational and Big Data platforms, application-specific purpose and programmability, leading to provenance-aware archival and publications of the results. This talk summarizes varying and changing requirements for distributed workflows influenced by Big Data and heterogeneous computing architectures and present a methodology for workflow-driven science based on these maturing requirements.
2016-10-20 BioExcel: Advances in Scientific Workflow EnvironmentsStian Soiland-Reyes
Carole Goble, Stian Soiland-Reyes
http://orcid.org/0000-0001-9842-9718
Presented at 2016-10-20 BioExcel Workflow Training, BSC, Barcelona
http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-research/
NOTE: Although these slides are licensed as CC Attribution, it includes various logos which are covered by their own licenses and copyrights.
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
The document describes IBM's transition from a single-tenant Hadoop architecture for their Watson Analytics for Social Media product to a multitenant Apache Spark architecture supporting over 3000 tenants. Key aspects of the new architecture included splitting analytics into tenant-specific and language-specific components, aggregating social media feeds from all tenants into a single stream for processing, and removing tenant state from processing components to enable low-latency switching between tenants. This resulted in a scalable, robust pipeline for real-time social media analytics based on Spark, Kafka and Zookeeper.
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Behar Veliqi
- WHAT IS WATSON ANALYTICS FOR SOCIAL MEDIA
- PREVIOUS ARCHITECTURE ON HADOOP
- THOUGHT PROCESS TOWARDS MULTITENANCY
- NEW ARCHITECTURE ON TOP OF APACHE SPARK
- LESSONS LEARNED
Proactive ops for container orchestration environmentsDocker, Inc.
This document discusses different approaches to monitoring systems from manual and reactive to proactive monitoring using container orchestration tools. It provides examples of metrics to monitor at the host/hardware, networking, application, and orchestration layers. The document emphasizes applying the principles of observability including structured logging, events and tracing with metadata, and monitoring the monitoring systems themselves. Speakers provide best practices around failure prediction, understanding failure modes, and using chaos engineering to build system resilience.
Open Annotation Rollout, Manchester, 2013-06-25
See also PDF version: http://www.slideshare.net/soilandreyes/2013-0624annotatingr-osopenannotationmeeting-23289491
Open Annotation Rollout, Manchester, 2013-06-25
See also PPTX version with Notes: http://www.slideshare.net/soilandreyes/2013-0624annotatingr-osopenannotationmeeting
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.
Building and deploying LLM applications with Apache AirflowKaxil Naik
Behind the growing interest in Generate AI and LLM-based enterprise applications lies an expanded set of requirements for data integrations and ML orchestration. Enterprises want to use proprietary data to power LLM-based applications that create new business value, but they face challenges in moving beyond experimentation. The pipelines that power these models need to run reliably at scale, bringing together data from many sources and reacting continuously to changing conditions.
This talk focuses on the design patterns for using Apache Airflow to support LLM applications created using private enterprise data. We’ll go through a real-world example of what this looks like, as well as a proposal to improve Airflow and to add additional Airflow Providers to make it easier to interact with LLMs such as the ones from OpenAI (such as GPT4) and the ones on HuggingFace, while working with both structured and unstructured data.
In short, this shows how these Airflow patterns enable reliable, traceable, and scalable LLM applications within the enterprise.
https://airflowsummit.org/sessions/2023/keynote-llm/
This document provides guidance on interpreting and reporting performance test results. It discusses collecting various metrics like load, errors, response times and system resources during testing. It emphasizes aggregating the raw data into meaningful statistics and visualizing the results in graphs to gain insights. Key steps in the process include interpreting observations and correlations to develop hypotheses, assessing conclusions to make recommendations, and reporting the findings to stakeholders in a clear and actionable manner. The overall approach is to turn large amounts of data into a few insightful pictures and conclusions that can guide technical or business decisions.
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
The document discusses Microsoft's Cognitive Toolkit (CNTK), an open source deep learning toolkit developed by Microsoft. It provides the following key points:
1. CNTK uses computational graphs to represent machine learning models like DNNs, CNNs, RNNs in a flexible way.
2. It supports CPU and GPU training and works on Windows and Linux.
3. CNTK achieves state-of-the-art accuracy and is efficient, scaling to multi-GPU and multi-server settings.
This presentation shows you the basic concept of distributed tracing and Opentracing. And you can see the sample hands-on application (HotROD) of Jaeger
Distributed tracing allows requests to be tracked across multiple services in a distributed system. The Jaeger distributed tracing system was used with the HOTROD sample application to visualize and analyze the request flow. Key aspects like latency bottlenecks and non-parallel processing were identified. Traditional logs lack the request context provided by distributed tracing.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Workshop: Big Data Visualization for SecurityRaffael Marty
Big Data is the latest hype in the security industry. We will have a closer look at what big data is comprised of: Hadoop, Spark, ElasticSearch, Hive, MongoDB, etc. We will learn how to best manage security data in a small Hadoop cluster for different types of use-cases. Doing so, we will encounter a number of big-data open source tools, such as LogStash and Moloch that help with managing log files and packet captures.
As a second topic we will look at visualization and how we can leverage visualization to learn more about our data. In the hands-on part, we will use some of the big data tools, as well as a number of visualization tools to actively investigate a sample data set.
RDF Validation in a Linked Data World - A vision beyond structural and value ...Nandana Mihindukulasooriya
This document discusses RDF validation in a Linked Data context. It outlines factors to consider in designing an RDF validation process, including data source dynamics, publication strategy, and access control. It also covers procedural factors like the number of data sources and validation scope. Context factors like the validation purpose and data provenance must also be taken into account. The conclusion is that RDF validation for Linked Data needs to accommodate the particularities of the data sources, processes, and context involved.
Lipids: Classification, Functions, Metabolism, and Dietary RecommendationsSarumathi Murugesan
This presentation offers a comprehensive overview of lipids, covering their classification, chemical composition, and vital roles in the human body and diet. It details the digestion, absorption, transport, and metabolism of fats, with special emphasis on essential fatty acids, sources, and recommended dietary allowances (RDA). The impact of dietary fat on coronary heart disease and current recommendations for healthy fat consumption are also discussed. Ideal for students and professionals in nutrition, dietetics, food science, and health sciences.
Direct Evidence for r-process Nucleosynthesis in Delayed MeV Emission from th...Sérgio Sacani
The origin of heavy elements synthesized through the rapid neutron capture process (r-process) has been an enduring mystery for over half a century. J. Cehula et al. recently showed that magnetar giant flares, among the brightest transients ever observed, can shock heat and eject neutron star crustal material at high velocity, achieving the requisite conditions for an r-process.A. Patel et al. confirmed an r-process in these ejecta using detailed nucleosynthesis calculations. Radioactive decay of the freshly synthesized nuclei releases a forest of gamma-ray lines, Doppler broadened by the high ejecta velocities v 0.1c into a quasi-continuous spectrum peaking around 1 MeV. Here, we show that the predicted emission properties (light curve, fluence, and spectrum) match a previously unexplained hard gamma-ray signal seen in the aftermath of the famous 2004 December giant flare from the magnetar SGR 1806–20. This MeV emission component, rising to peak around 10 minutes after the initial spike before decaying away over the next few hours, is direct observational evidence for the synthesis of ∼10−6 Me of r-process elements. The discovery of magnetar giant flares as confirmed r-process sites, contributing at least ∼1%–10% of the total Galactic abundances, has implications for the Galactic chemical evolution, especially at the earliest epochs probed by low-metallicity stars. It also implicates magnetars as potentially dominant sources of heavy cosmic rays. Characterization of the r-process emission from giant flares by resolving decay line features offers a compelling science case for NASA’s forthcomingCOSI nuclear spectrometer, as well as next-generation MeV telescope missions.
Structure formation with primordial black holes: collisional dynamics, binari...Sérgio Sacani
Primordial black holes (PBHs) could compose the dark matter content of the Universe. We present the first simulations of cosmological structure formation with PBH dark matter that consistently include collisional few-body effects, post-Newtonian orbit corrections, orbital decay due to gravitational wave emission, and black-hole mergers. We carefully construct initial conditions by considering the evolution during radiation domination as well as early-forming binary systems. We identify numerous dynamical effects due to the collisional nature of PBH dark matter, including evolution of the internal structures of PBH halos and the formation of a hot component of PBHs. We also study the properties of the emergent population of PBH binary systems, distinguishing those that form at primordial times from those that form during the nonlinear structure formation process. These results will be crucial to sharpen constraints on the PBH scenario derived from observational constraints on the gravitational wave background. Even under conservative assumptions, the gravitational radiation emitted over the course of the simulation appears to exceed current limits from ground-based experiments, but this depends on the evolution of the gravitational wave spectrum and PBH merger rate toward lower redshifts.
DNA Profiling and STR Typing in Forensics: From Molecular Techniques to Real-...home
This comprehensive assignment explores the pivotal role of DNA profiling and Short Tandem Repeat (STR) analysis in forensic science and genetic studies. The document begins by laying the molecular foundations of DNA, discussing its double helix structure, the significance of genetic variation, and how forensic science exploits these variations for human identification.
The historical journey of DNA fingerprinting is thoroughly examined, highlighting the revolutionary contributions of Dr. Alec Jeffreys, who first introduced the concept of using repetitive DNA regions for identification. Real-world forensic breakthroughs, such as the Colin Pitchfork case, illustrate the life-saving potential of this technology.
A detailed breakdown of traditional and modern DNA typing methods follows, including RFLP, VNTRs, AFLP, and especially PCR-based STR analysis, now considered the gold standard in forensic labs worldwide. The principles behind STR marker types, CODIS loci, Y-chromosome STRs, and the capillary electrophoresis (CZE) method are thoroughly explained. The steps of DNA profiling—from sample collection and amplification to allele detection using electropherograms (EPGs)—are presented in a clear and systematic manner.
Beyond crime-solving, the document explores the diverse applications of STR typing:
Monitoring cell line authenticity
Detecting genetic chimerism
Tracking bone marrow transplant engraftment
Studying population genetics
Investigating evolutionary history
Identifying lost individuals in mass disasters
Ethical considerations and potential misuse of DNA data are acknowledged, emphasizing the need for careful policy and regulation.
Whether you're a biotechnology student, a forensic professional, or a researcher, this document offers an in-depth look at how DNA and STRs transform science, law, and society.
Protective function of skin, protection from mechanical blow, UV rays, regulation of water and electrolyte balance, absorptive activity, secretory activity, excretory activity, storage activity, synthetic activity, sensory activity, role of sweat glands regarding heat loss, cutaneous receptors and stratum corneum
Gender Bias and Empathy in Robots: Insights into Robotic Service FailuresSelcen Ozturkcan
Ad
"Data Provenance: Principles and Why it matters for BioMedical Applications"
1. Data Provenance: Principles and Why it matters
for BioMedical Applications
PART-2: Tools and Techniques
Informatics for Health 2017 Preconference Tutorial
Vasa Curcin, Lecturer, HSCR & Department of Informatics KCL
Pinar Alper, Postdoc at HSCR KCL
22.Apr.2017
Manchester/UK
1
3. Provenance Recap
Provenance has a Particular Subject & has a Standard Model:
• “information about entities, activities, and people involved in producing a
piece of data or thing…” (W3C PROV).
edit
wasGeneratedBy
page1
used
Wikipedia Editor
wasAssociatedWith
page2
entity activity agent
• PROV-DM Core vocabulary:
• Actor,Activity, Entity
• Causal relations among elements
• Conceptually a graph.
• Constraints:
• Typing: if two things are linked with actedOnBehalf of they
are of type actor
• Impossibility : activities and entities are disjoint.
Specialisation is not reflexive
• Ordering: Usage must occur between start and end event of
activities
• Human and Machine understandable representation:
• PROV-O,
• PROV-N
• Extensibility points 3
5. Provenance Capture
• Rigorously studied in the context of Scientific Computing.
• Workflows
• Scripts
• Command-Line Tools
• System-Level (File and Operating System)
• Databases
• Templates
5
6. • Became popular in the recent decade.
• Pipelines of tasks with dataflow
dependencies.
• Automation, Resource Access Client
• Somewhat disruptive, need to wrap
resources.
• Provenance for the output scientific data:
• An outline of the method followed
• Resources used (repositories, tools,
services).
• Parameter configurations and
intermediary results.
Scientific Workflows
Analysis
Data
Analysis
Visualization
Analysis
Adaptation
Community
Data Repo
Community
Tools
Local
Tools & Data
6
parameter
parameter
Data Data
7. 7
• Backward-Looking history
of what the workflow
engine observed during
the run:
• Data nodes with identifiers
minted by WF engine. Data
often stored separately.
• Task invocations,
timestamp.
• Data usage and generation
by tasks.
• Actor is primarily the WF
engine.
• Inferred provenance:
• Task causal dependencies
• Data causal dependencies
Workflows Execution Provenance
8. WF Provenance -Transparency
outputactivityinput
• Annotated grey-box provenance.
• Annotate workflow, get auto-annotated provenance
• Annotate provenance
outputactivityinput
BLAST Report
DNA SEquence Sequence
Alignment
8
• Wraps resources to incorporate into workflow, hence an “Observer”
perspective. In the most basic case this provides black-box provenance.
• We also have white-box provenance will come to that later!!
CL tool execution
Script execution
Web service invocation
prov:type
prov:type
prov:type
PROV Extensibility Points:
type
hadRole
attributes
hadRole
9. WF Provenance - Perspective
• Workflow Provenance is the first
to be referred to as “Prospective
Provenance” ‼
• Prospective provenance is often
viewed as the provenance by end-
users (scientists).
• Prospective Provenance can be
useful as an abstraction over
(bulky) retrospective provenance.
Prospective
Retrospective
lineage
9
13. A workflow may not necessarily be implemented by a
workflow system
13
• Provenance Challenge Series 1-4
• Provenance Challenge Workflow: 5-step computational process using Functional
Magnetic Resonance Imaging (fMRI) data.
• Have been realized with WF, tool, and system-level provenance
http://twiki.ipaw.info/bin/view/Challenge/FirstProvenanceChallenge
Domain data
Patterns of inquiry e.g. lineage
traversal
14. Scripts
• Popular, established method of data processing
• Integration into statistical, numerical analysis libraries
• Visualization libraries
• Researchers have recently started paying attention to script
provenance.
• Currently research prototypes rather than out-of-box features.
• Minimally disrupting existing practices. No technology change, No
wrapping!
14
15. Script Provenance - RDataTracker
Barbara Lerner and Emery Boose. RDataTracker: Collecting provenance
in an interactive scripting environment. In 6th USENIX Workshop on
the Theory and Practice of Provenance (TaPP 2014).
15
• Extend R scripts with logging
statements
• Provenance ON/OFF:
ddg.init, ddg.save
• Abstract multiple commands:
ddg.procedure
• Post-execution visualize the
Data Derivation Graph
Process
Data
Data flow
Control flow
https://github.com/End-to-end-provenance/RDataTracker
16. Script Provenance- YesWorkflow
16
• Annotate method declarations in R,
Matlab scripts
• Makes “latent workflow information
from scripts explicit”.
• Prospective provenance, a workflow
abstraction over the script.
• Visualized in process, data and
combined views
• @begin, @end
• @in, @out, @as
• Inputs/outputs can be concrete files
• Or can refer to prospective
resources identified by templates.
Timothy M. McPhillips, Tianhong Song et al. YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR, abs/1502.02403, 2015.
18. Script Provenance - noWorkflow
• Definition: Analyze the AST of
Python script.
• Arguments, function calls, global vars
• User defined functions
• Deployment: Environment and
library dependencies.
• Uses Python os, socket and
platform libraries
• Collected right before script execution
Leonardo Murta, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. noWorkflow: Capturing and analyzing provenance of scripts. In 5th International Provenance and
Anno- tation Workshop (IPAW), Cologne, Jun 2014.
18
19. Script Provenance -noWorkflow
• Execution: Runtime information
• Uses python’s runtime profiling and reflection capabilities
• the start time of the function activation, together with the values of every argument,
return, and globals
• open() csystem call to record depedency to files and snapshot file contents at the
time of call
19
https://github.com/gems-uff/noworkflow
20. Script Provenance -noWorkflow
• A lot of information!
• Analysis
1. Highly summarized Activation Graph. Nodes in one control flow block (e.g. loop) merged.
2. Diff two executions (report diff in dates, argument values, environment settings)
3. Query in Prolog access_influence(File, ’output.png’)
20
https://github.com/gems-uff/noworkflow
21. Script Provenance -CXXR
• R Interpreter with audit features
• Tracks provenance for bindings generated in the user workspace
• provenance(x): Returns a list comprising: expression, symbol, timestamp, parents, children
• pedigree(x): Displays the sequence of commands issued, which results in x’s current state
21
https://github.com/timothyjgraham/cxxr
22. Script Provenance
22
• End-use:
• Minimally disruptive, but, yet to prove its utility.
• Black-box provenance.
Too fine grained process abstraction?
Are variables, arguments or files a
useful data abstractions?
23. Command Line Tools
• Popular interface for many scientific analysis libraries
• Web-Based Reproducibility Frameworks, a layer between User and OS
command shell.
• Register tools and with metadata to framework. Grey-box provenance.
• Layer directs tool execution
• Layer records provenance
Sumatra: automated tracking of scientific computations.https://pythonhosted.org/Sumatra/
Galaxy Data intensive biology for everyone https://galaxyproject.org/
23
24. Tool Provenance
• End use:
• Convert exploratory steps into workflows
• Compare different invocations
24
25. Database Provenance
• Databases widely used for
managing scientific resource
metadata:
• Data Catalogs
• Service Registries
• White-box Provenance
• Why: witness tuples
• How: the way witness tuple contribute
to result.
• Where: cell-level data copying from
witness tuples to result
James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in databases: Why, how, and where. Foundations and Trends in
Databases, 1(4):379–474, 2009.
25
26. Database Provenance
• Why provenance:
• Query debugging: Why did I get this result?
• View maintenance: If I update this record, do I need to refresh this
(materialized) view?
• How provenance:
• Trust and uncertainty computation
• Where provenance:
• Annotation propagation
• If desired white-box provenance can be computed post query
execution, when needed.
• Limited to research prototypes
26
http://infolab.stanford.edu/trio/
27. System-Level Provenance
• Every application runs on some OS and uses some storage system
• Not disruptive Zero modification to applications.
• An audit mode for systems, explore feasibility and overhead.
27
28. System-Level Provenance - SPADE
• A consumer of audit APIS of different OS (Windows, Mac OS, Linux, Android)
• Collects process information: name, owner, creation time, command line,
environment vars, files read/written.
• Reporters:
• Default for each OS. E.g. in Linux exec(), fork(), clone(), exit(), open(),
close(), read(), write(), clone(), truncate() is picked up by reporters
• Domain specific reporters. Add onto default.
• Reporting overhead is variable, dependes on nature of application and target OS.
• Compile Apache Web Server on Windows (53%)
• Run BLAST tool on Linux (5%), on MacOS (10%)
• OPM compliant
A. Gehani and D. Tariq. SPADE: Support for Provenance Auditing in Distributed Environments. In Middleware 2012 - Proceedings, pages 101–120, 2012.
28
29. System-level Provenance - PASS
• Modified Linux Kernel. Relies on the tracking of system calls.
• Focuses on files a detailed change record.
29
• For each file in the filesystem,
• The executable that created it
• Any input files
• “Complete” hardware platform
description
• Command line
• Process environment
• Other data such as random seeds
> sort a > b
Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on
USENIX ’06 Annual Technical Conference, ATEC ’06, pages 4–4.
30. System-Level Provenance - PASS
• End Use
• Script generation
• Generate a Makefile that reproduces a file
• Detecting system changes
• Compare provenance of two files to detect changes in
environment, libraries, etc.
• Intrusion detection
• Detailed logs of how objects have changed
• Very fine grained provenance!!
• 5000 Objects in response to a lineage inquiry over trace with 50
elements.
30
31. To this end
• We focused on capturing provenance from a particular computational
system
• System dictates granularity
Data: Tuples, files, variables
Activity: Query operator, sys-call, wf activity
• System dictates transparency
White-box for DB vs Black/Grey Box for WFs and Tools
31
32. In reality
• We have heterogeneity and legacy:
• We use heterogeneous technologies.
• Provenance may be recorded in mixed granularities.
• Existing audit capabilities and provenance-like information recorded.
• Audited processes can span long time frames.
32
33. DB
Consider the case
33
Medical Practice
Management
(MPM)
A global
Provenance
Picture
Designing a decision aid to improve secondary prevention for stroke survivors with multimorbidity: A stakeholder engagement study (T Porat, I Marshall, E Sadler, MA Vadillo, V Curcin, C
McKevitt, C Wolfe) - Abstract presentationat Farr 2017 Tuesday 4pm
portal usage info
authentication trace
clinician
patient
• Have there been an update on patient’s EHR in MPM between initial assessment and follow up in SDA
• Do patients use any part of the portal after initial assessment activity in SDA?
• What are all the prior authentication records that lead to MPM activities updating this erroneous EHR.
• What is the average time between initial assessments and follow-ups?
DB
Stroke Decision
Aid (SDA)
Log files
Log files
Authentication
Service
Portal
34. A Solution Approach – Provenance Templates
• Focus on Provenance rather than its capture from a specific system.
• How can we combine provenance of multiple sources in a controlled manner?
• Focus on incorporating Domain-Specific information
• Heterogeneous systems interoperate via domain ontologies. How can we obtain
domain annotated provenance?
• Two groups actively working on it
34
“Templates as a method for implementing data provenance in
decision support systems.” Vasa Curcin, Elliot Fairweather,
Roxana Danger, Derek Corrigan. Journal of Biomedical
Informatics 65 (2017) 1-21
https://bitbucket.org/kclbig/templates
https://provenance.ecs.soton.ac.uk/prov-template/
In publication.
37. var:a
var:c
var:b
wasGeneratedBy
var:x
wasAssociatedWith
used
Zone
• A connected sub-graph of a
template graph, which can be
instantiated multiple times
• Series/Parallel type of zones
• Restrictions on Minimum and
Maximum number of
instantiations
var:d
wasDerviedFrom
id=zone1
type=parallel
min=1
max=5
Example TEMPLATE
37
Zone variable
External
variable
41. What constitutes a Valid Substitution?
• For Template Base
• Distinct values among all external variables
• Distinct values among the global node if space for non-graft variables
• Values for all value variables
• For each Zone substitution
• Distinct values for all zone variables among each other and among the global
node id space
• Values for all value variables
41
47. 1. Enforce structure over provenance
• Templates can be seen as
Provenance Schemas. Acceptable
patterns of provenance.
• Strict : there is no optionality. If a
variable is included in a template
bindings will be expected for it.
• Loose: Templates may not encode all
possible patterns. (Grafts)
• Prospective Provenance in its true
form!
47
48. 2. Go beyond single-system restriction
• Combine provenance in different granularities- importance of identification.
• User identity
• EHR associated with identity. Track CRUD at EHRlevel.
• Fields of EHR relevant for provenance. Track Updates at selected HER field level
• Build up provenance over time
1. Authentication
2. Activity on portal
3. Activity within application
4. Authentication
5. Activity on portal
6. Activity within application
…..
....
48
49. 3. Support other forms of Provenance
• Provenance can be found in other forms (XLS files, logs)
• We can create substitutions from these other forms
• Hence migrate legacy provenance to a standards compliant (graph-
based) representation
49
Rows Substitutions
Columns Variables
MIAME TEMPLATE
Generation
Minimum Information for Biological and Biomedical Investigations
50. 4. Possibility of devising Model-Driven techniques
for introducing provenance support to existing tools
• Identify parts of the system (Actor, Entity, Activity) that are relevant for
provenance tracking
• Data Flow Diagrams to understand the data entities
• Use cases that should be tracked
• Create sample (concrete) provenance
• Refactor templates from concrete provenance
50
• Map templates to service end points
on the template server
• Add service client code to existing
tools
• Map templates to (Java) beans
• Add bean population code to existing
tools
52. Querying Provenance
Most common pattern of access:
1. Locate node(s) of interest:
• based on attributes. e.g. Process nodes of a particular name, Data nodes produced after
some date, nodes with timestamp in a range
• based on involved in relationship(s) (consumption, generation). e.g. Data generated by a
particular buggy process.
2. Traverse lineage, causation (w transitive closure)
locate
traverse
lineage
52
53. A Less common pattern of access:
Locate sub-graph of interest:
• These are advanced queries that may involve regular path expressions
Querying Provenance
locate
subgraph
locate
53
54. Data Model for Provenance
• Relational
• Traversal require several self-joins in SQL, can be costly in case of deep lineage
traces.
• Yet still very popular!
• Joint querying of application data and provenance
• Graph
• Graph DB neo4j
• Cypher Query Language (Variable length relations) +Free text search
• Visualization
• Triple stores:
• W3C SPARQL based querying + free text search
• Ontology integration
• Reasoning and Rules support,
54
http://sig.biostr.washington.edu/projects/ontviews/gleen/index.htmlhttps://neo4j.com/product/
http://rdf4j.org/
https://jena.apache.org/
55. Exploring provenance graphs
• Browsing Layers of provenance
• WF Design - Manual
• (Semi) automated - grouping
Manish Kumar Anand ; Shawn Bowers ; Bertram
Ludäscher. Provenance browser: Displaying and
querying scientific workflow provenance graphs.
Proceedings of ICDE 2010 Conference.
Interactive Visualization of Provenance
Graphs for Reproducible Biomedical
Research. Stefan Lugeret al. 5th Symposium
on Biological Data Visualization.
55
Extract a sub-graph priot to
browsing
Olivier Biton, Sarah Cohen-Boulakia, Susan B. Davidson, and
Carmem S. Hara. Querying and Managing Provenance through
User Views in Scientific Workflows. pages 1072–1081, ICDE 2008.
56. Exploring provenance graphs
• Aggregating graph elements
Aggregation by Provenance Types: A Technique forSummarising Provenance
Graphs. Luc Moreau. In Proceedings GaM 2015, arXiv:1504.02448
Rinke Hoekstra, Paul Groth. PROV-O-Viz - Understanding the Role of Activities in
Provenance. IPAW 2014: pp 215-220 56
57. Exploring provenance
• A move towards customized visualizations:
E.g. Provenance of annotation sentence “inactivated by cyanide”. It
originates in TrEMBL, but ends up in Swiss-Prot.
Bell MJ, Collison M, Lord P (2013) Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation?
A Case Study Using UniProtKB. PLoS ONE 8(10): e75541. https://doi.org/10.1371/journal.pone.0075541
57
58. PROV implementations
58The full list is at: https://www.w3.org/2011/prov/wiki/ProvImplementations
60. Komadu
• Collect and Query API as a web service
• Visualize provenance, export to CSV format viewable in Cytoscape
• Export as PROV-XML
60
https://github.com/Data-to-Insight-Center/komadu/blob/master/docs/KomaduUserGuide.pdf
61. The Oh-Yeah Button
61
• PROV-AQ Implementation
• Browser add on
http://users.ugent.be/~tdenies/OhYeah/
65. Sample PROV Datasets
65
PROV BENCH, PROV RECONSTRUCTION
1st ProvBench: Benchmarking Provenance Management Systems. In K. Belhajjame and J. Zhao,
editors, Proceedings of the Joint EDBT/ICDT 2013 Workshops.
Data2Semantics. Provenance reconstruction challenge. http:// data2semantics.github.io/ 2014.
2nd ProvBench: Benchmarking Provenance Management Systems. K. Belhajjame and A.
Chapman. https://sites.google.com/ site/provbench/home/provbench- provenance-week-
2014, 2014.
66. Follow State of the Art in Provenance
IPAW and TAPP have been held jointly as Provenance Weeks.
IPAW 2018 will be held at King’s College London !!
(Every two years)
(Annual)
66
#3: In the first part of the tutorial we presented:
What constitutes provenance information,
The W3C standard PROV vocabulary to represent this information.
In the second part we will present the existing set of tools and technqiues (coming out of research) that helps us to Capture and Manage provenance information.
#14: You do not need to use a workflow system to have a workflow.