Provenance and Scientific Workflows

This tutorial provides an overview of research issues in provenance for scientific workflows. Provenance is an essential component to allow for result reproducibility, sharing, and knowledge re-use. The provenance of a data product contains information about the process and data used to derive the data product.

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views

Provenance and Scientific Workflows

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Provenance and Scientific Workflows:

Challenges and Opportunities

Susan B. Davidson Juliana Freire

University of Pennsylvania University of Utah
3330 Walnut Street 50 S. Central Campus Dr, rm 3190
Philadelphia, PA 19104-6389 Salt Lake City, UT 84112
[email protected] [email protected]

ABSTRACT have been widely used in the scientific community, but have serious
Provenance in the context of workflows, both for the data they de- limitations. In particular, scientists and engineers need to expend
rive and for their specification, is an essential component to allow substantial effort managing data (e.g., scripts that encode compu-
for result reproducibility, sharing, and knowledge re-use in the sci- tational tasks, raw data, data products, and notes) and recording
entific community. Several workshops have been held on the topic, provenance information so that basic questions can be answered,
and it has been the focus of many research projects and prototype such as: Who created this data product and when? When was it
systems. This tutorial provides an overview of research issues in modified and by whom? What was the process used to create the
provenance for scientific workflows, with a focus on recent litera- data product? Were two data products derived from the same raw
ture and technology in this area. It is aimed at a general database data? Not only is the process time-consuming, but also error-prone.
research audience and at people who work with scientific data and Workflow systems have therefore grown in popularity within the
workflows. We will (1) provide a general overview of scientific scientific community [25, 41, 31, 42, 43, 45, 16, 17, 27, 38]. Not
workflows, (2) describe research on provenance for scientific work- only do they support the automation of repetitive tasks, but they can
flows and show in detail how provenance is supported in exist- also capture complex analysis processes at various levels of detail
ing systems; (3) discuss emerging applications that are enabled by and systematically capture provenance information for the derived
provenance; and (4) outline open problems and new directions for data products [15]. The provenance (also referred to as the audit
database-related research. trail, lineage, and pedigree) of a data product contains information
about the process and data used to derive the data product. It pro-
vides important documentation that is key to preserving the data, to
Categories and Subject Descriptors determining the data’s quality and authorship, and to reproduce as
H.2 [Database Management]: General well as validate the results. These are all important requirements of
the scientific process.
General Terms Provenance in scientific workflows is thus of paramount and in-
creasing importance, as evidenced by recent specialized workshops
Documentation,Experimentation [6, 15, 21, 32, 33] and surveys [18, 14, 7, 36]. While provenance in
workflows bears some similarity to that of provenance in databases
Keywords (which was the topic of a tutorial in SIGMOD’2007 [10] and a
provenance, scientific workflows recent survey [40]), there are important differences and new chal-
lenges for the database community to consider.
Our objective in this tutorial is to give an overview of the prob-
1. IMPORTANCE OF PROVENANCE FOR lem of managing provenance data for scientific workflows, illus-
WORKFLOWS trate some of the techniques that have been developed to address
Computing has been an enormous accelerator to science and has different aspects of the problem, and outline interesting directions
led to an information explosion in many different fields. To analyze for future work in the area. In particular, we will present techniques
and understand scientific data, complex computational processes for reducing provenance overload as well as making provenance
must be assembled, often requiring the combination of loosely- information more “fine-grained.” We will examine uses of prove-
coupled resources, specialized libraries, and grid and Web services. nance that go beyond the ability to reproduce and share results, and
These processes may generate many final and intermediate data will demonstrate how workflow evolution provenance can be lever-
products, adding to the overflow of information scientists need to aged to explain difference in data products, streamline exploratory
deal with. Ad-hoc approaches to data exploration (e.g., Perl scripts) computational tasks, and enable knowledge re-use. We will also
discuss a new applications that are enabled by provenance, such as
social data analysis [19], which have the potential to change the
way people explore data and do science.

Copyright is held by the author/owner(s).

SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
ACM 978-1-60558-102-6/08/06.
2. TUTORIAL OUTLINE
2.1 Overview of Scientific Workflows
We motivate the need for scientific workflows using real applica-
tions as examples, in particular within genomics, medical imaging,
environmental observatories and forecasting systems. We also in-
troduce basic concepts for scientific workflows that are related to
provenance.
Workflow and workflow-based systems have emerged as an al-
ternative to ad-hoc approaches for constructing computational sci-
entific experiments [25, 39, 41, 45, 31]. Workflow systems help
scientists conceptualize and manage the analysis process, support
scientists by allowing the creation and reuse of analysis tasks, aid
in the discovery process by managing the data used and generated
at each step, and (more recently) systematically record provenance
information for later use. Workflows are rapidly replacing primi-
tive shell scripts as evidenced by the release of Apple’s Mac OS X
Automator, Microsoft’s Workflow Foundation, and Yahoo! Pipes.
Scientific workflows systems often adopt simple computational
models, in particular a dataflow model, where the execution order
of workflow modules is determined by the flow of data through the
workflow. This is in contrast to business workflows which provide
expressive languages (such as the Business Process Execution Lan- Figure 1: Prospective versus retrospective provenance. The
guage, BPEL [9]) to specify complex control flows [1]. In addition, workflow generates two data products: a histogram of the
unlike business workflows, scientific workflows are often used to scalar values of a structured grid data set; and a visualization of
perform data intensive tasks. an isosurface of the data set. The workflow definition provides
Workflow systems have a number of advantages for construct- prospective provenance, a recipe to derive these two kinds of
ing and managing computational tasks compared to programs and data products. On the left, we show some of the retrospective
scripts. They provide a simple programming model whereby a provenance that was collected during a run of this workflow.
sequence of tasks is composed by connecting the outputs of one This figure also illustrates user-defined provenance in the form
task to the inputs of another. Furthermore, workflow systems often of annotations, shown in yellow boxes.
provide intuitive visual programming interfaces, which make them
more suitable for users who do not have substantial programming nance is information about causality: the dependency relationships
expertise. Workflows also have an explicit structure. They can be among data products and the processes that generate them. Causal-
viewed as graphs, where nodes represent processes (or modules) ity can be inferred from both prospective and retrospective prove-
and edges capture the flow of data between the processes. The ben- nance and it captures the sequence of steps which, together with
efits of structure are well-known when it comes to exploring data. input data and parameters, caused the creation of a data product.
A program (or script) is to a workflow what an unstructured docu- Causality consists of different types of dependencies. Data-process
ment is to a (structured) database. dependencies (e.g., the fact that head-hist.png was derived by
the sub-workflow on the left in Figure 1) are useful for documenting
2.2 Managing Provenance data generation process, and they can also be used to reproduce or
We first describe different kinds of provenance that can be cap- validate the process. For example, it would allow new histograms
tured for scientific workflows. Then, we discuss the three key com- to be derived for different input data sets. Data dependencies are
ponents of a provenance management solution: the capture mecha- also useful. For example, in the event that the CT scanner used to
nism; the data model for representing provenance information; and generate the input file head.120.vtk is found to be defective,
the infrastructure for storing, accessing, and querying provenance. results that depend on the scan can be invalidated by examining
Last, but not least, we present different approaches used for each data dependencies.
of these components and classify the different workflow systems Another key component of provenance is user-defined informa-
based on a set of dimensions along which their treatments of the tion. This includes documentation that cannot be automatically
issues differ. captured but records important decisions and notes. This data is
often captured in the form of annotations. As Figure 1 illustrates,
Information represented in provenance. In the context of scien- annotations can be added at different levels of granularity and as-
tific workflows, provenance is a record of the derivation of a set of sociated with different components of both prospective and retro-
results. There are two distinct forms of provenance [11]: prospec- spective provenance (e.g., for modules, data products, execution
tive and retrospective. Prospective provenance captures the speci- log records).
fication of a computational task (i.e., a workflow)—it corresponds
to the steps that need to be followed (or a recipe) to generate a Capturing, modeling, storing and querying provenance. One of
data product or class of data products. Retrospective provenance the major advantages to using workflow systems is that they can
captures the steps that were executed as well as information about be easily instrumented to automatically capture provenance — this
the execution environment used to derive a specific data product— information can be accessed directly through system APIs. While
a detailed log of the execution of a computational task. Figure 1 early workflow systems (e.g., Taverna [41] and Kepler [25]) have
illustrates these two kinds of provenance. been extended to capture provenance, newer systems, such as Vis-
An important piece of information present in workflow prove- Trails [45] have been designed to support provenance.
Figure 2: Usable interface to refine workflows by analogy. The user chooses a pair of data products to serve as an analogy template.
In this case, the pair represents a change to a workflow that downloads a file from the Web and creates a simple visualization, into
a new workflow where the resulting visualization is smoothed. Then, the user chooses a set of other workflows to apply the same
change automatically. The workflow on the left reflects the original changes, and the one on the right reflects the changes when
translated to the workflow used to derive the last visualization on the right. The workflow components to be removed are shown in
orange, and the ones to be added, in blue. Note that the surrounding modules do not match exactly: the system identifies out the
most likely match. Image from [34].

Several provenance models have been proposed in the litera- 2.3 Using Provenance for Reproducibility and
ture [37, 28, 12, 2, 46, 26, 11, 20, 22]. All of these models support Beyond
some form of retrospective provenance and many also provide the We will also discuss a number of emerging applications for work-
means to capture prospective provenance as well as annotations. flow provenance and discuss the challenges they pose to database
Although these models differ in many ways, including the use of research. Some of these applications are described below.
different structures and storage strategies, they all share an essen-
tial type of information: process and data dependencies. In fact, Provenance and scientific publications. A key benefit for main-
a recent exercise to explore interoperability issues among prove- taining provenance of computational results is reproducibility: a
nance models has shown that it is possible to integrate different detailed record of the steps followed to produce a result allows oth-
provenance models [33]. ers to reproduce and validate these results. Recently, the issue of
While several approaches have been proposed to capture and publishing reproducible research has started to receive attention in
model provenance, only recently has the problem of storing, ac- the scientific community. In 2008, SIGMOD has introduced the
cessing, and querying provenance started to receive attention. Be- “experimental repeatability requirement” to “help published papers
sides allowing users to explore and better understand results, the achieve an impact and stand as reliable reference-able works for fu-
ability to query the provenance of workflows enables knowledge ture research”.1 A number of journals are also encouraging authors
re-use. For example, users can identify workflows that are suitable to make their publications reproducible, including, for example the
and can be re-used for a given task; compare and understand dif- IEEE Transactions on Signal Processing2 and the Computing in
ferences between workflows; and refine workflows by analogy (see Science and Engineering (CiSE) magazine3 . Provenance manage-
Figure 2). Provenance information can also be associated with data ment infrastructure and tools will have the potential to transform
products (e.g., images, graphs), allowing structured queries to be scientific publications as we know them today. However, for these
posed over these unstructured data. to be widely adopted, they need to be usable and within reach for
A common feature across many of the approaches to querying scientists that do not have computer science training.
provenance is that their solutions are closely tied to the storage
models used. A wide variety of data models and storage systems Provenance and data exploration. Provenance can also be used
have been used ranging from specialized Semantic Web languages to simplify exploratory processes. In particular, we present mech-
(e.g., RDF and OWL) and XML dialects that are stored as files and anisms that allow the flexible re-use of workflows; scalable explo-
to tuples stored in relational database tables. Hence, they require ration of large parameter spaces; and comparison of data products
users to write queries in languages like SQL [3], Prolog [8] and as well as their corresponding workflows [20, 35]. In addition, we
SPARQL [46, 26, 22]. While such standard languages can be use- show that useful knowledge is embedded in provenance which can
ful if users are already familiar with their syntax, none of them have be re-used to simplify the construction of workflows [34].
been designed for provenance. For that reason, simple queries can
be awkward and complex. We will discuss these approaches and Social analysis of scientific data. Social Web sites and Web-based
contrast them to recent work on intuitive visual interfaces to query communities (e.g., Flickr, Facebook, Yahoo! Pipes), which facili-
workflows [4, 34]. tate collaboration and sharing between users, are becoming increas-
ingly popular. An important benefit of these sites is that they en-
Provenance systems. We survey approaches to provenance adopted
by scientific workflow systems. We present and compare differ- 1
http://www.sigmod08.org/sigmod_research.shtml
ent proposals for capturing, modeling, storing and querying prove- 2
http://ewh.ieee.org/soc/sps/tsp
nance (e.g., [34, 13, 8, 18, 20, 29, 36, 32, 33]). 3
http://www.computer.org/portal/site/cise/index.jsp
able users to leverage the wisdom of the crowds. In the (very) re- provenance across their independently developed workflow
cent past, a new class of Web site has emerged that enables users systems. Although the preliminary results are promising and
to upload and collectively analyze many types of data (e.g., Many indicate that such an integration is possible, there needs to be
Eyes [44]). These are part of a broad phenomenon that has been more principled approaches to this problem. One direction
called “social data analysis”. This trend is expanding to the sci- currently being investigated is the creation of a standard for
entific domain where a number of collaboratories are under devel- representing provenance [30].
opment. As the cost of hardware decreases over time, the cost of
people goes up as analyses get more involved, larger groups need • Connecting database and workflow provenance. In many
to collaborate, and the volume of data manipulated increases. Sci- scientific applications, database manipulations co-exist with
ence collaboratories aim to bridge this gap by allowing scientists the execution of workflow modules: Data is selected from a
to share, re-use and refine their workflows. We discuss the chal- database, potentially joined with data from other databases,
lenges and key components that are needed to enable the develop- reformatted, and used in an analysis. The results of the anal-
ment of effective social data analysis (SDA) sites for the scientific ysis may then be put into a database and potentially used
domain [19]. For example, usable interfaces that allow users to in other analyses. To understand the provenance of a re-
query and re-use the information in these collaboratories are key sult, it is therefore important to be able to connect prove-
to their success. We will present recent work that has addressed nance information across databases and workflows. Com-
usability issues in the context of workflow systems and provenance bining these disparate forms of provenance information will
(see Figure 2). require a framework in which database operators and work-
flow modules can be treated uniformly, and a model in which
Provenance in education. Teaching is one of the killer applica- the interaction between the structure of data and the structure
tions of provenance-enabled workflow systems, in particular, for of workflows can be captured.
courses which have a strong data exploration component such as
data mining and visualization. Provenance can help instructors to 3. ABOUT THE PRESENTERS
be more effective and improve the students’ learning experience. Susan B. Davidson received a B.A. degree in Mathematics from
By using a provenance-enabled tool in class, an instructor can keep Cornell University in 1978, and a Ph.D. degree in Electrical Engi-
detailed record of all the steps she tried while while responding to neering and Computer Science from Princeton University in 1982.
students questions; and after the class, all these results and their Dr. Davidson joined the University of Pennsylvania in 1982, and
provenance can be made available to students. For assignments, is now the Weiss Professor and Department Chair of Computer and
students can turn the detailed provenance of their work, showing Information Science. She is an ACM Fellow, a Fulbright scholar,
all the steps they followed to solve a problem. and a founding co-Director of the Center for Bioinformatics at
2.4 Open Problems UPenn (PCBI). Dr. Davidson’s research interests include database
systems, database modeling, distributed systems, and bioinformat-
We discuss a number of open problems and outline possible di-
ics. Within bioinformatics she is best known for her work in data in-
rections for future research, including:
tegration, XML query and update technologies, and more recently
• Information management infrastructure. With the growing provenance in workflow systems.
volume of raw data, workflows and provenance information, Juliana Freire joined the faculty of the School of Computing
there is a need for efficient and effective techniques to man- at the University of Utah in July 2005. Before, she was member
age these data. Besides the need to handle large volumes of of technical staff at the Database Systems Research Department at
heterogeneous and distributed data, an important challenge Bell Laboratories (Lucent Technologies) and an Assistant Professor
that needs to be addressed is usability: Information man- at OGI/OHSU. She received Ph.D. and M.S. degrees in Computer
agement systems are notoriously hard to use [23, 24]. As Science from the State University of New York at Stony Brook, and
the need for these systems grows in a wide range of ap- a B.S. degree in Computer Science from Universidade Federal do
plications, notably in the scientific domain, usability is of Ceará, Brazil. Dr. Freire’s research has focused on extending tra-
paramount importance. The growth in the volume of prove- ditional database technology and developing techniques to address
nance data also calls for techniques that deal with informa- new data management problems introduced by the Web and scien-
tion overload [5]. tific applications. She is a co-creator of VisTrails (www.vistrails.org),
an open-source scientific workflow and provenance management
• Provenance analytics and visualization. The problem of min- system.
ing and extracting knowledge from provenance data has been
largely unexplored. By analyzing and creating insightful vi-
sualizations of provenance data, scientists can debug their 4. ACKNOWLEDGMENTS
tasks and obtain a better understanding of their results. Min- This work was partially supported by the National Science Foun-
ing this data may also lead to the discovery of patterns that dation and the Department of Energy.
can potentially simplify the notoriously hard, time-consuming
process of designing and refining scientific workflows. 5. REFERENCES
• Interoperability. Complex data products may result from [1] W. Aalst and K. Hee. Workflow Management: Models,
long processing chains that require multiples tools (e.g., sci- Methods, and Systems. MIT Press, 2002.
entific workflows and visualization tools). In order to provide [2] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance
detailed provenance for such data products, it becomes nec- collection support in the kepler scientific workflow system.
essary to integrate provenance derived from different systems In Proceedings of the International Provenance and
and represented using different models. This was the goal of Annotation Workshop (IPAW), pages 118–132, 2006.
the Second Provenance Challenge [33], which brought to- [3] R. S. Barga and L. A. Digiampietri. Automatic capture and
gether several research groups with the goal of integrating efficient storage of escience experiment provenance.
Concurrency and Computation: Practice and Experience, 2006. Invited paper.
20(5):419–429, 2008. [21] D. Gannon et al. A Workshop on Scientific and Scholarly
[4] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying Workflow Cyberinfrastructure: Improving Interoperability,
business processes. In VLDB, pages 343–354, 2006. Sustainability and Platform Convergence in Scientific And
[5] O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara. Scholarly Workflow. Technical report, NSF and Mellon
Querying and managing provenance through user views in Foundation, 2007.
scientific workflows. In Proceedings of ICDE, 2008. To https://spaces.internet2.edu/display/SciSchWorkflow.
appear. [22] J. Golbeck and J. Hendler. A semantic web approach to
[6] R. Bose, I. Foster, and L. Moreau. Report on the tracking provenance in scientific workflows. Concurrency
International Provenance and Annotation Workshop. and Computation: Practice and Experience, 20(5):431–439,
SIGMOD Rec., 35(3):51–53, 2006. 2008.
[7] R. Bose and J. Frew. Lineage retrieval for scientific data [23] L. Haas. Information for people.
processing: a survey. ACM Computing Surveys, 37(1):1–28, http://www.almaden.ibm.com/cs/people/laura/ Information For
2005. People keynote.pdf, 2007. Keynote talk at ICDE.
[8] S. Bowers, T. McPhillips, and B. Ludaescher. A provenance [24] H. V. Jagadish. Making database systems usable.
model for collection-oriented scientific workflows. http://www.eecs.umich.edu/db/usable/ usability-sigmod.ppt,
Concurrency and Computation: Practice and Experience, 2007. Keynote talk at SIGMOD.
20(5):519–529, 2008. [25] The Kepler Project. http://kepler-project.org.
[9] Business Process Execution Language for Web Services. [26] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar.
http://www.ibm.com/developerworks/library/specification/ws- Provenance trails in the wings/pegasus system. Concurrency
bpel/. and Computation: Practice and Experience, 20(5):587–597,
[10] P. Buneman and W.Tan. Provenance in databases. In 2008.
Proceedings of ACM SIGMOD, pages 1171–1173, 2007. [27] Microsoft Workflow Foundation.
[11] B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde, http://msdn2.microsoft.com/en-us/netframework/
and Y. Zhao. Tracking provenance in a virtual data grid. aa663322.aspx.
Concurrency and Computation: Practice and Experience, [28] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and
20(5):565–575, 2008. L. Moreau. Extracting Causal Graphs from an Open
[12] S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a Provenance Data Model. Concurrency and Computation:
model of provenance and user views in scientific workflows. Practice and Experience, 20(5):577–586, 2008.
In DILS, pages 264–279, 2006. [29] L. Moreau and I. Foster, editors. Provenance and Annotation
[13] S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. of Data - International Provenance and Annotation
Addressing the provenance challenge using zoom. Workshop, volume 4145. Springer-Verlag, 2006.
Concurrency and Computation: Practice and Experience, [30] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and
20(5):497–506, 2008. P. Paulson. The open provenance model, December 2007.
[14] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. http://eprints.ecs.soton.ac.uk/14979.
McPhillips, S. Bowers, M. K. Anand, and J. Freire. [31] S. G. Parker and C. R. Johnson. SCIRun: a scientific
Provenance in scientific workflow systems. IEEE Data Eng. programming environment for computational steering. In
Bull., 30(4):44–50, 2007. Supercomputing, page 52, 1995.
[15] E. Deelman and Y. Gil. NSF Workshop on Challenges of [32] First provenance challenge.
Scientific Workflows. Technical report, NSF, 2006. http://twiki.ipaw.info/bin/view/Challenge/
http://vtcpc.isi.edu/wiki/index.php/Main_Page. FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau
[16] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, (organizers).
C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, [33] Second provenance challenge.
A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework http://twiki.ipaw.info/bin/view/Challenge/
for Mapping Complex Scientific Workflows onto Distributed SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and
Systems. Scientific Programming Journal, 13(3):219–237, L. Moreau (organizers).
2005. [34] C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva.
[17] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A Querying and creating visualizations by analogy. IEEE
virtual data system for representing, querying and Transactions on Visualization and Computer Graphics,
automating data derivation. In Proceedings of SSDBM, pages 13(6):1560–1567, 2007. Papers from the IEEE Information
37–46, 2002. Visualization Conference 2007.
[18] J. Freire, D. Koop, E. Santos, and C. Silva. Provenance for [35] C. Silva, J. Freire, and S. P. Callahan. Provenance for
computational tasks: A survey. Computing in Science & visualizations: Reproducibility and beyond. Computing in
Engineering, 10(3), May/June 2008. To appear. Science & Engineering, 9(5):82–89, 2007.
[19] J. Freire and C. Silva. Towards enabling social analysis of [36] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data
scientific data. In CHI Social Data Analysis Workshop, 2008. provenance in e-science. SIGMOD Record, 34(3):31–36,
To appear. 2005.
[20] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. [37] Y. L. Simmhan, B. Plale, and D. Gannon. Karma2:
Scheidegger, and H. T. Vo. Managing rapidly-evolving Provenance management for data driven workflows.
scientific workflows. In International Provenance and International Journal of Web Services Research, Idea Group
Annotation Workshop (IPAW), LNCS 4145, pages 10–18, Publishing, 5:1, 2008. To Appear.
[38] Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru.
Performance evaluation of the karma provenance framework
for scientific workflows. In L. Moreau and I. T. Foster,
editors, International Provenance and Annotation Workshop
(IPAW), Chicago, IL, volume 4145 of Lecture Notes in
Computer Science, pages 222–236. Springer, 2006.
[39] The Swift System. www.ci.uchicago.edu/swift.
[40] W. C. Tan. Provenance in databases: Past, current, and
future. IEEE Data Eng. Bull., 30(4):3–12, 2007.
[41] The Taverna Project. http://taverna.sourceforge.net.
[42] The Triana Project. http://www.trianacode.org.
[43] VDS - The GriPhyN Virtual Data System.
http://www.ci.uchicago.edu/wiki/bin/view/VDS/
VDSWeb/WebMain.
[44] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and
M. McKeon. Manyeyes: a site for visualization at internet
scale. IEEE Transactions on Visualization and Computer
Graphics, 13(6):1121–1128, 2007.
[45] The VisTrails Project. http://www.vistrails.org.
[46] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining taverna’s
semantic web of provenance. Concurrency and Computation:
Practice and Experience, 20(5):463–472, 2008.

Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
From Everand
Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data
EMC Education Services
No ratings yet
Querying and ReUsing Workflows With Visstrails
No ratings yet
Querying and ReUsing Workflows With Visstrails
4 pages
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
No ratings yet
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
8 pages
Principles_for_data_analysis_workflows
No ratings yet
Principles_for_data_analysis_workflows
26 pages
Towardsenablingsocialanalysis Ofscientificdata
No ratings yet
Towardsenablingsocialanalysis Ofscientificdata
4 pages
Data Governance Book
No ratings yet
Data Governance Book
11 pages
Electronic Data Doesn't Contain The Historical Information That Will Help Users
No ratings yet
Electronic Data Doesn't Contain The Historical Information That Will Help Users
4 pages
Artificial intelligence: AI in the technologies synthesis of creative solutions
From Everand
Artificial intelligence: AI in the technologies synthesis of creative solutions
Alexander V. Andreichikov
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
HoloViews in Scientific Data Visualization: Definitive Reference for Developers and Engineers
From Everand
HoloViews in Scientific Data Visualization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Digital Evolution Novo Nordisk s Shift to 1743972486
No ratings yet
Digital Evolution Novo Nordisk s Shift to 1743972486
11 pages
HarvardX PH527X Planning Checklist 2017
No ratings yet
HarvardX PH527X Planning Checklist 2017
5 pages
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
Exploring the Complexity of Projects: Implications of Complexity Theory for Project Management Practice
From Everand
Exploring the Complexity of Projects: Implications of Complexity Theory for Project Management Practice
Svetlana Cicmil
No ratings yet
Efficient Management of Large Metadata Catalogs in a Ubiquitous Computing Environment
From Everand
Efficient Management of Large Metadata Catalogs in a Ubiquitous Computing Environment
Daniel Beatty
No ratings yet
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Exploration of Workflow Management Systems
No ratings yet
Exploration of Workflow Management Systems
8 pages
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
1408.1675v3
No ratings yet
1408.1675v3
21 pages
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Towards A Model of Provenance and User Views in Scientific Workflows
No ratings yet
Towards A Model of Provenance and User Views in Scientific Workflows
20 pages
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
From Everand
A Measurement Framework for Software Projects: A Generic and Practical Goal-Question-Metric(Gqm) Based Approach.
Prashanth Harish Southekal
No ratings yet
Provenance-Enabled Automatic Data Publishing: Abstract. Scientists Are Increasingly Being Called Upon To Publish Their
No ratings yet
Provenance-Enabled Automatic Data Publishing: Abstract. Scientists Are Increasingly Being Called Upon To Publish Their
9 pages
p507-chapman
No ratings yet
p507-chapman
14 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
From Everand
Superset Data Exploration and Analysis Framework: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
From Everand
SQL and NoSQL: Building Hybrid Data Solutions for Modern Applications
Robert Johnson
No ratings yet
Ef Cient Provenance Management Via Clustering and Hybrid Storage in Big Data Environments
No ratings yet
Ef Cient Provenance Management Via Clustering and Hybrid Storage in Big Data Environments
12 pages
Data Stewardship For Open Science Implementing FAIR Principles by Mons, Barend
No ratings yet
Data Stewardship For Open Science Implementing FAIR Principles by Mons, Barend
245 pages
Ava: From Data To Insights Through Conversation: Rogers Jeffrey Leo John Navneet Potti Jignesh M. Patel
No ratings yet
Ava: From Data To Insights Through Conversation: Rogers Jeffrey Leo John Navneet Potti Jignesh M. Patel
10 pages
Get Provenance and Annotation of Data and Processes 7th International Provenance and Annotation Workshop IPAW 2018 London UK July 9 10 2018 Proceedings Khalid Belhajjame free all chapters
100% (4)
Get Provenance and Annotation of Data and Processes 7th International Provenance and Annotation Workshop IPAW 2018 London UK July 9 10 2018 Proceedings Khalid Belhajjame free all chapters
55 pages
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Data Strategy
No ratings yet
Data Strategy
41 pages
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Finding Data Patterns in the Noise: A Data Scientist's Tale
From Everand
Finding Data Patterns in the Noise: A Data Scientist's Tale
Olayinka Ugwu
No ratings yet
Realistic Research Paper 4
No ratings yet
Realistic Research Paper 4
3 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
Usage of Visualization Techniques in Data Science Workflows: Johanna Schmidt
No ratings yet
Usage of Visualization Techniques in Data Science Workflows: Johanna Schmidt
8 pages
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to Data Analysis in Qualitative Research
From Everand
Introduction to Data Analysis in Qualitative Research
Asher Shkedi
No ratings yet
Provenance And Annotation Of Data And Processes Second International Provenance And Annotation Workshop Ipaw 2008 Salt Lake City Ut Usa June 1718 2008 Revised Selected Papers 1st Edition Juliana Freire pdf download
100% (1)
Provenance And Annotation Of Data And Processes Second International Provenance And Annotation Workshop Ipaw 2008 Salt Lake City Ut Usa June 1718 2008 Revised Selected Papers 1st Edition Juliana Freire pdf download
88 pages
The Webinar On Once More Unto The D
No ratings yet
The Webinar On Once More Unto The D
2 pages
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
From Everand
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Download ebooks file Provenance and Annotation of Data and Processes 5th International Provenance and Annotation Workshop IPAW 2014 Cologne Germany June 9 13 2014 Revised Selected Papers 1st Edition Bertram Ludäscher all chapters
100% (5)
Download ebooks file Provenance and Annotation of Data and Processes 5th International Provenance and Annotation Workshop IPAW 2014 Cologne Germany June 9 13 2014 Revised Selected Papers 1st Edition Bertram Ludäscher all chapters
52 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Impact Of Scientific Computing On Science And Society Pekka Neittaanmki pdf download
No ratings yet
Impact Of Scientific Computing On Science And Society Pekka Neittaanmki pdf download
86 pages
Database Design with SQL: Building Fast and Reliable Systems
From Everand
Database Design with SQL: Building Fast and Reliable Systems
Robert Johnson
No ratings yet
EXTEND THE FAIR
No ratings yet
EXTEND THE FAIR
4 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mapping Data To Queries: Semantics of The IS-A Rule
No ratings yet
Mapping Data To Queries: Semantics of The IS-A Rule
22 pages
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Record Linkage Similarity Measures and Algorithms
No ratings yet
Record Linkage Similarity Measures and Algorithms
130 pages
Message Broker Message Flows
80% (5)
Message Broker Message Flows
1,756 pages
p802 Koudastutorial
No ratings yet
p802 Koudastutorial
2 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Data Services in Your Spreadsheet
No ratings yet
Data Services in Your Spreadsheet
10 pages
Bernstein Presentation 03
No ratings yet
Bernstein Presentation 03
38 pages
Information Integration Using Logical Views
No ratings yet
Information Integration Using Logical Views
22 pages
Xquery: An XML Query Language
No ratings yet
Xquery: An XML Query Language
19 pages
Answering Queries Using Views - A Survey
No ratings yet
Answering Queries Using Views - A Survey
25 pages
Querying and Creating Visualizations by Analogy
No ratings yet
Querying and Creating Visualizations by Analogy
8 pages
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
No ratings yet
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
14 pages
Information Integration: Maurizio Lenzerini
No ratings yet
Information Integration: Maurizio Lenzerini
110 pages
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
No ratings yet
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
50 pages
Dfhuynh Thesis
No ratings yet
Dfhuynh Thesis
134 pages
XML Schema Automatic Matching Solution
No ratings yet
XML Schema Automatic Matching Solution
7 pages
The Dell Vostro 1510
No ratings yet
The Dell Vostro 1510
3 pages
AtTask User Guide
No ratings yet
AtTask User Guide
345 pages
Database Development Supporting Offline Update Using CRDT: (Conflict-Free Replicated Data Types)
No ratings yet
Database Development Supporting Offline Update Using CRDT: (Conflict-Free Replicated Data Types)
6 pages
Software Engineering Multiple Choice Questions and Answers-Software Life Cycle Models
No ratings yet
Software Engineering Multiple Choice Questions and Answers-Software Life Cycle Models
5 pages
Hammaad Akoojee: Education Personal Details
No ratings yet
Hammaad Akoojee: Education Personal Details
2 pages
Operating Systems Session 10 Memory Virtualization: Address Spaces, Memory API
No ratings yet
Operating Systems Session 10 Memory Virtualization: Address Spaces, Memory API
25 pages
Visio Tnugdali
No ratings yet
Visio Tnugdali
40 pages
Advance database Assignment 1-1
No ratings yet
Advance database Assignment 1-1
9 pages
OptiFlash User Manual Abel
No ratings yet
OptiFlash User Manual Abel
196 pages
Panasonic UB T880W UB T880 Interactive White Board
No ratings yet
Panasonic UB T880W UB T880 Interactive White Board
3 pages
Boarding School Students Monitoring Systems (E-ID) Using Radio Frequency Identification
No ratings yet
Boarding School Students Monitoring Systems (E-ID) Using Radio Frequency Identification
6 pages
Using Wildcards in MS Word
No ratings yet
Using Wildcards in MS Word
2 pages
Security Application Identification
No ratings yet
Security Application Identification
342 pages
Design Flaw in AUTOSAR Exposes Vehicles To DoS Attack
No ratings yet
Design Flaw in AUTOSAR Exposes Vehicles To DoS Attack
6 pages
FYP Proposal Format
No ratings yet
FYP Proposal Format
4 pages
Today I Learned - Django Queryset Default Ordering Is No Ordering - by Rui Rei - Medium
No ratings yet
Today I Learned - Django Queryset Default Ordering Is No Ordering - by Rui Rei - Medium
1 page
Ametek Land Solonet Brochure en
No ratings yet
Ametek Land Solonet Brochure en
8 pages
Plugin LF351
No ratings yet
Plugin LF351
8 pages
Delta-Q Troubleshooting Guide 3128829 03-08-2016-1
100% (1)
Delta-Q Troubleshooting Guide 3128829 03-08-2016-1
17 pages
Webometrics Altmetrics Wolverhampton LIDA
No ratings yet
Webometrics Altmetrics Wolverhampton LIDA
7 pages
Turtle Diagram With Questions
86% (7)
Turtle Diagram With Questions
20 pages
Chintya Widyaning P.U - Soal & Jawaban Tentang CRM
100% (1)
Chintya Widyaning P.U - Soal & Jawaban Tentang CRM
4 pages
2023-10-18 06点的log 0
No ratings yet
2023-10-18 06点的log 0
4 pages
lab 4 notes for system
No ratings yet
lab 4 notes for system
3 pages
Jackson System Development
No ratings yet
Jackson System Development
18 pages
Introduction To SITL: Before You Begin
100% (1)
Introduction To SITL: Before You Begin
16 pages
0 KDLVLP Đã G P
No ratings yet
0 KDLVLP Đã G P
523 pages
Unified Power Format
No ratings yet
Unified Power Format
3 pages
EmTech Notes
No ratings yet
EmTech Notes
3 pages
Evpn-bgp-Vxlan Configuration Cheat Sheet
No ratings yet
Evpn-bgp-Vxlan Configuration Cheat Sheet
11 pages
Free CAD Lessons and Tutorials
No ratings yet
Free CAD Lessons and Tutorials
7 pages

Provenance and Scientific Workflows

Uploaded by

Provenance and Scientific Workflows

Uploaded by

Provenance and Scientific Workflows:

Challenges and Opportunities

Susan B. Davidson Juliana Freire

Copyright is held by the author/owner(s).

You might also like