0% found this document useful (0 votes)
159 views

Provenance and Scientific Workflows

This tutorial provides an overview of research issues in provenance for scientific workflows. Provenance is an essential component to allow for result reproducibility, sharing, and knowledge re-use. The provenance of a data product contains information about the process and data used to derive the data product.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views

Provenance and Scientific Workflows

This tutorial provides an overview of research issues in provenance for scientific workflows. Provenance is an essential component to allow for result reproducibility, sharing, and knowledge re-use. The provenance of a data product contains information about the process and data used to derive the data product.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Provenance and Scientific Workflows:

Challenges and Opportunities

Susan B. Davidson Juliana Freire


University of Pennsylvania University of Utah
3330 Walnut Street 50 S. Central Campus Dr, rm 3190
Philadelphia, PA 19104-6389 Salt Lake City, UT 84112
[email protected] [email protected]

ABSTRACT have been widely used in the scientific community, but have serious
Provenance in the context of workflows, both for the data they de- limitations. In particular, scientists and engineers need to expend
rive and for their specification, is an essential component to allow substantial effort managing data (e.g., scripts that encode compu-
for result reproducibility, sharing, and knowledge re-use in the sci- tational tasks, raw data, data products, and notes) and recording
entific community. Several workshops have been held on the topic, provenance information so that basic questions can be answered,
and it has been the focus of many research projects and prototype such as: Who created this data product and when? When was it
systems. This tutorial provides an overview of research issues in modified and by whom? What was the process used to create the
provenance for scientific workflows, with a focus on recent litera- data product? Were two data products derived from the same raw
ture and technology in this area. It is aimed at a general database data? Not only is the process time-consuming, but also error-prone.
research audience and at people who work with scientific data and Workflow systems have therefore grown in popularity within the
workflows. We will (1) provide a general overview of scientific scientific community [25, 41, 31, 42, 43, 45, 16, 17, 27, 38]. Not
workflows, (2) describe research on provenance for scientific work- only do they support the automation of repetitive tasks, but they can
flows and show in detail how provenance is supported in exist- also capture complex analysis processes at various levels of detail
ing systems; (3) discuss emerging applications that are enabled by and systematically capture provenance information for the derived
provenance; and (4) outline open problems and new directions for data products [15]. The provenance (also referred to as the audit
database-related research. trail, lineage, and pedigree) of a data product contains information
about the process and data used to derive the data product. It pro-
vides important documentation that is key to preserving the data, to
Categories and Subject Descriptors determining the data’s quality and authorship, and to reproduce as
H.2 [Database Management]: General well as validate the results. These are all important requirements of
the scientific process.
General Terms Provenance in scientific workflows is thus of paramount and in-
creasing importance, as evidenced by recent specialized workshops
Documentation,Experimentation [6, 15, 21, 32, 33] and surveys [18, 14, 7, 36]. While provenance in
workflows bears some similarity to that of provenance in databases
Keywords (which was the topic of a tutorial in SIGMOD’2007 [10] and a
provenance, scientific workflows recent survey [40]), there are important differences and new chal-
lenges for the database community to consider.
Our objective in this tutorial is to give an overview of the prob-
1. IMPORTANCE OF PROVENANCE FOR lem of managing provenance data for scientific workflows, illus-
WORKFLOWS trate some of the techniques that have been developed to address
Computing has been an enormous accelerator to science and has different aspects of the problem, and outline interesting directions
led to an information explosion in many different fields. To analyze for future work in the area. In particular, we will present techniques
and understand scientific data, complex computational processes for reducing provenance overload as well as making provenance
must be assembled, often requiring the combination of loosely- information more “fine-grained.” We will examine uses of prove-
coupled resources, specialized libraries, and grid and Web services. nance that go beyond the ability to reproduce and share results, and
These processes may generate many final and intermediate data will demonstrate how workflow evolution provenance can be lever-
products, adding to the overflow of information scientists need to aged to explain difference in data products, streamline exploratory
deal with. Ad-hoc approaches to data exploration (e.g., Perl scripts) computational tasks, and enable knowledge re-use. We will also
discuss a new applications that are enabled by provenance, such as
social data analysis [19], which have the potential to change the
way people explore data and do science.

Copyright is held by the author/owner(s).


SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
ACM 978-1-60558-102-6/08/06.
2. TUTORIAL OUTLINE
2.1 Overview of Scientific Workflows
We motivate the need for scientific workflows using real applica-
tions as examples, in particular within genomics, medical imaging,
environmental observatories and forecasting systems. We also in-
troduce basic concepts for scientific workflows that are related to
provenance.
Workflow and workflow-based systems have emerged as an al-
ternative to ad-hoc approaches for constructing computational sci-
entific experiments [25, 39, 41, 45, 31]. Workflow systems help
scientists conceptualize and manage the analysis process, support
scientists by allowing the creation and reuse of analysis tasks, aid
in the discovery process by managing the data used and generated
at each step, and (more recently) systematically record provenance
information for later use. Workflows are rapidly replacing primi-
tive shell scripts as evidenced by the release of Apple’s Mac OS X
Automator, Microsoft’s Workflow Foundation, and Yahoo! Pipes.
Scientific workflows systems often adopt simple computational
models, in particular a dataflow model, where the execution order
of workflow modules is determined by the flow of data through the
workflow. This is in contrast to business workflows which provide
expressive languages (such as the Business Process Execution Lan- Figure 1: Prospective versus retrospective provenance. The
guage, BPEL [9]) to specify complex control flows [1]. In addition, workflow generates two data products: a histogram of the
unlike business workflows, scientific workflows are often used to scalar values of a structured grid data set; and a visualization of
perform data intensive tasks. an isosurface of the data set. The workflow definition provides
Workflow systems have a number of advantages for construct- prospective provenance, a recipe to derive these two kinds of
ing and managing computational tasks compared to programs and data products. On the left, we show some of the retrospective
scripts. They provide a simple programming model whereby a provenance that was collected during a run of this workflow.
sequence of tasks is composed by connecting the outputs of one This figure also illustrates user-defined provenance in the form
task to the inputs of another. Furthermore, workflow systems often of annotations, shown in yellow boxes.
provide intuitive visual programming interfaces, which make them
more suitable for users who do not have substantial programming nance is information about causality: the dependency relationships
expertise. Workflows also have an explicit structure. They can be among data products and the processes that generate them. Causal-
viewed as graphs, where nodes represent processes (or modules) ity can be inferred from both prospective and retrospective prove-
and edges capture the flow of data between the processes. The ben- nance and it captures the sequence of steps which, together with
efits of structure are well-known when it comes to exploring data. input data and parameters, caused the creation of a data product.
A program (or script) is to a workflow what an unstructured docu- Causality consists of different types of dependencies. Data-process
ment is to a (structured) database. dependencies (e.g., the fact that head-hist.png was derived by
the sub-workflow on the left in Figure 1) are useful for documenting
2.2 Managing Provenance data generation process, and they can also be used to reproduce or
We first describe different kinds of provenance that can be cap- validate the process. For example, it would allow new histograms
tured for scientific workflows. Then, we discuss the three key com- to be derived for different input data sets. Data dependencies are
ponents of a provenance management solution: the capture mecha- also useful. For example, in the event that the CT scanner used to
nism; the data model for representing provenance information; and generate the input file head.120.vtk is found to be defective,
the infrastructure for storing, accessing, and querying provenance. results that depend on the scan can be invalidated by examining
Last, but not least, we present different approaches used for each data dependencies.
of these components and classify the different workflow systems Another key component of provenance is user-defined informa-
based on a set of dimensions along which their treatments of the tion. This includes documentation that cannot be automatically
issues differ. captured but records important decisions and notes. This data is
often captured in the form of annotations. As Figure 1 illustrates,
Information represented in provenance. In the context of scien- annotations can be added at different levels of granularity and as-
tific workflows, provenance is a record of the derivation of a set of sociated with different components of both prospective and retro-
results. There are two distinct forms of provenance [11]: prospec- spective provenance (e.g., for modules, data products, execution
tive and retrospective. Prospective provenance captures the speci- log records).
fication of a computational task (i.e., a workflow)—it corresponds
to the steps that need to be followed (or a recipe) to generate a Capturing, modeling, storing and querying provenance. One of
data product or class of data products. Retrospective provenance the major advantages to using workflow systems is that they can
captures the steps that were executed as well as information about be easily instrumented to automatically capture provenance — this
the execution environment used to derive a specific data product— information can be accessed directly through system APIs. While
a detailed log of the execution of a computational task. Figure 1 early workflow systems (e.g., Taverna [41] and Kepler [25]) have
illustrates these two kinds of provenance. been extended to capture provenance, newer systems, such as Vis-
An important piece of information present in workflow prove- Trails [45] have been designed to support provenance.
Figure 2: Usable interface to refine workflows by analogy. The user chooses a pair of data products to serve as an analogy template.
In this case, the pair represents a change to a workflow that downloads a file from the Web and creates a simple visualization, into
a new workflow where the resulting visualization is smoothed. Then, the user chooses a set of other workflows to apply the same
change automatically. The workflow on the left reflects the original changes, and the one on the right reflects the changes when
translated to the workflow used to derive the last visualization on the right. The workflow components to be removed are shown in
orange, and the ones to be added, in blue. Note that the surrounding modules do not match exactly: the system identifies out the
most likely match. Image from [34].

Several provenance models have been proposed in the litera- 2.3 Using Provenance for Reproducibility and
ture [37, 28, 12, 2, 46, 26, 11, 20, 22]. All of these models support Beyond
some form of retrospective provenance and many also provide the We will also discuss a number of emerging applications for work-
means to capture prospective provenance as well as annotations. flow provenance and discuss the challenges they pose to database
Although these models differ in many ways, including the use of research. Some of these applications are described below.
different structures and storage strategies, they all share an essen-
tial type of information: process and data dependencies. In fact, Provenance and scientific publications. A key benefit for main-
a recent exercise to explore interoperability issues among prove- taining provenance of computational results is reproducibility: a
nance models has shown that it is possible to integrate different detailed record of the steps followed to produce a result allows oth-
provenance models [33]. ers to reproduce and validate these results. Recently, the issue of
While several approaches have been proposed to capture and publishing reproducible research has started to receive attention in
model provenance, only recently has the problem of storing, ac- the scientific community. In 2008, SIGMOD has introduced the
cessing, and querying provenance started to receive attention. Be- “experimental repeatability requirement” to “help published papers
sides allowing users to explore and better understand results, the achieve an impact and stand as reliable reference-able works for fu-
ability to query the provenance of workflows enables knowledge ture research”.1 A number of journals are also encouraging authors
re-use. For example, users can identify workflows that are suitable to make their publications reproducible, including, for example the
and can be re-used for a given task; compare and understand dif- IEEE Transactions on Signal Processing2 and the Computing in
ferences between workflows; and refine workflows by analogy (see Science and Engineering (CiSE) magazine3 . Provenance manage-
Figure 2). Provenance information can also be associated with data ment infrastructure and tools will have the potential to transform
products (e.g., images, graphs), allowing structured queries to be scientific publications as we know them today. However, for these
posed over these unstructured data. to be widely adopted, they need to be usable and within reach for
A common feature across many of the approaches to querying scientists that do not have computer science training.
provenance is that their solutions are closely tied to the storage
models used. A wide variety of data models and storage systems Provenance and data exploration. Provenance can also be used
have been used ranging from specialized Semantic Web languages to simplify exploratory processes. In particular, we present mech-
(e.g., RDF and OWL) and XML dialects that are stored as files and anisms that allow the flexible re-use of workflows; scalable explo-
to tuples stored in relational database tables. Hence, they require ration of large parameter spaces; and comparison of data products
users to write queries in languages like SQL [3], Prolog [8] and as well as their corresponding workflows [20, 35]. In addition, we
SPARQL [46, 26, 22]. While such standard languages can be use- show that useful knowledge is embedded in provenance which can
ful if users are already familiar with their syntax, none of them have be re-used to simplify the construction of workflows [34].
been designed for provenance. For that reason, simple queries can
be awkward and complex. We will discuss these approaches and Social analysis of scientific data. Social Web sites and Web-based
contrast them to recent work on intuitive visual interfaces to query communities (e.g., Flickr, Facebook, Yahoo! Pipes), which facili-
workflows [4, 34]. tate collaboration and sharing between users, are becoming increas-
ingly popular. An important benefit of these sites is that they en-
Provenance systems. We survey approaches to provenance adopted
by scientific workflow systems. We present and compare differ- 1
http://www.sigmod08.org/sigmod_research.shtml
ent proposals for capturing, modeling, storing and querying prove- 2
http://ewh.ieee.org/soc/sps/tsp
nance (e.g., [34, 13, 8, 18, 20, 29, 36, 32, 33]). 3
http://www.computer.org/portal/site/cise/index.jsp
able users to leverage the wisdom of the crowds. In the (very) re- provenance across their independently developed workflow
cent past, a new class of Web site has emerged that enables users systems. Although the preliminary results are promising and
to upload and collectively analyze many types of data (e.g., Many indicate that such an integration is possible, there needs to be
Eyes [44]). These are part of a broad phenomenon that has been more principled approaches to this problem. One direction
called “social data analysis”. This trend is expanding to the sci- currently being investigated is the creation of a standard for
entific domain where a number of collaboratories are under devel- representing provenance [30].
opment. As the cost of hardware decreases over time, the cost of
people goes up as analyses get more involved, larger groups need • Connecting database and workflow provenance. In many
to collaborate, and the volume of data manipulated increases. Sci- scientific applications, database manipulations co-exist with
ence collaboratories aim to bridge this gap by allowing scientists the execution of workflow modules: Data is selected from a
to share, re-use and refine their workflows. We discuss the chal- database, potentially joined with data from other databases,
lenges and key components that are needed to enable the develop- reformatted, and used in an analysis. The results of the anal-
ment of effective social data analysis (SDA) sites for the scientific ysis may then be put into a database and potentially used
domain [19]. For example, usable interfaces that allow users to in other analyses. To understand the provenance of a re-
query and re-use the information in these collaboratories are key sult, it is therefore important to be able to connect prove-
to their success. We will present recent work that has addressed nance information across databases and workflows. Com-
usability issues in the context of workflow systems and provenance bining these disparate forms of provenance information will
(see Figure 2). require a framework in which database operators and work-
flow modules can be treated uniformly, and a model in which
Provenance in education. Teaching is one of the killer applica- the interaction between the structure of data and the structure
tions of provenance-enabled workflow systems, in particular, for of workflows can be captured.
courses which have a strong data exploration component such as
data mining and visualization. Provenance can help instructors to 3. ABOUT THE PRESENTERS
be more effective and improve the students’ learning experience. Susan B. Davidson received a B.A. degree in Mathematics from
By using a provenance-enabled tool in class, an instructor can keep Cornell University in 1978, and a Ph.D. degree in Electrical Engi-
detailed record of all the steps she tried while while responding to neering and Computer Science from Princeton University in 1982.
students questions; and after the class, all these results and their Dr. Davidson joined the University of Pennsylvania in 1982, and
provenance can be made available to students. For assignments, is now the Weiss Professor and Department Chair of Computer and
students can turn the detailed provenance of their work, showing Information Science. She is an ACM Fellow, a Fulbright scholar,
all the steps they followed to solve a problem. and a founding co-Director of the Center for Bioinformatics at
2.4 Open Problems UPenn (PCBI). Dr. Davidson’s research interests include database
systems, database modeling, distributed systems, and bioinformat-
We discuss a number of open problems and outline possible di-
ics. Within bioinformatics she is best known for her work in data in-
rections for future research, including:
tegration, XML query and update technologies, and more recently
• Information management infrastructure. With the growing provenance in workflow systems.
volume of raw data, workflows and provenance information, Juliana Freire joined the faculty of the School of Computing
there is a need for efficient and effective techniques to man- at the University of Utah in July 2005. Before, she was member
age these data. Besides the need to handle large volumes of of technical staff at the Database Systems Research Department at
heterogeneous and distributed data, an important challenge Bell Laboratories (Lucent Technologies) and an Assistant Professor
that needs to be addressed is usability: Information man- at OGI/OHSU. She received Ph.D. and M.S. degrees in Computer
agement systems are notoriously hard to use [23, 24]. As Science from the State University of New York at Stony Brook, and
the need for these systems grows in a wide range of ap- a B.S. degree in Computer Science from Universidade Federal do
plications, notably in the scientific domain, usability is of Ceará, Brazil. Dr. Freire’s research has focused on extending tra-
paramount importance. The growth in the volume of prove- ditional database technology and developing techniques to address
nance data also calls for techniques that deal with informa- new data management problems introduced by the Web and scien-
tion overload [5]. tific applications. She is a co-creator of VisTrails (www.vistrails.org),
an open-source scientific workflow and provenance management
• Provenance analytics and visualization. The problem of min- system.
ing and extracting knowledge from provenance data has been
largely unexplored. By analyzing and creating insightful vi-
sualizations of provenance data, scientists can debug their 4. ACKNOWLEDGMENTS
tasks and obtain a better understanding of their results. Min- This work was partially supported by the National Science Foun-
ing this data may also lead to the discovery of patterns that dation and the Department of Energy.
can potentially simplify the notoriously hard, time-consuming
process of designing and refining scientific workflows. 5. REFERENCES
• Interoperability. Complex data products may result from [1] W. Aalst and K. Hee. Workflow Management: Models,
long processing chains that require multiples tools (e.g., sci- Methods, and Systems. MIT Press, 2002.
entific workflows and visualization tools). In order to provide [2] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance
detailed provenance for such data products, it becomes nec- collection support in the kepler scientific workflow system.
essary to integrate provenance derived from different systems In Proceedings of the International Provenance and
and represented using different models. This was the goal of Annotation Workshop (IPAW), pages 118–132, 2006.
the Second Provenance Challenge [33], which brought to- [3] R. S. Barga and L. A. Digiampietri. Automatic capture and
gether several research groups with the goal of integrating efficient storage of escience experiment provenance.
Concurrency and Computation: Practice and Experience, 2006. Invited paper.
20(5):419–429, 2008. [21] D. Gannon et al. A Workshop on Scientific and Scholarly
[4] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo. Querying Workflow Cyberinfrastructure: Improving Interoperability,
business processes. In VLDB, pages 343–354, 2006. Sustainability and Platform Convergence in Scientific And
[5] O. Biton, S. Cohen-Boulakia, S. Davidson, and C. Hara. Scholarly Workflow. Technical report, NSF and Mellon
Querying and managing provenance through user views in Foundation, 2007.
scientific workflows. In Proceedings of ICDE, 2008. To https://spaces.internet2.edu/display/SciSchWorkflow.
appear. [22] J. Golbeck and J. Hendler. A semantic web approach to
[6] R. Bose, I. Foster, and L. Moreau. Report on the tracking provenance in scientific workflows. Concurrency
International Provenance and Annotation Workshop. and Computation: Practice and Experience, 20(5):431–439,
SIGMOD Rec., 35(3):51–53, 2006. 2008.
[7] R. Bose and J. Frew. Lineage retrieval for scientific data [23] L. Haas. Information for people.
processing: a survey. ACM Computing Surveys, 37(1):1–28, http://www.almaden.ibm.com/cs/people/laura/ Information For
2005. People keynote.pdf, 2007. Keynote talk at ICDE.
[8] S. Bowers, T. McPhillips, and B. Ludaescher. A provenance [24] H. V. Jagadish. Making database systems usable.
model for collection-oriented scientific workflows. http://www.eecs.umich.edu/db/usable/ usability-sigmod.ppt,
Concurrency and Computation: Practice and Experience, 2007. Keynote talk at SIGMOD.
20(5):519–529, 2008. [25] The Kepler Project. http://kepler-project.org.
[9] Business Process Execution Language for Web Services. [26] J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar.
http://www.ibm.com/developerworks/library/specification/ws- Provenance trails in the wings/pegasus system. Concurrency
bpel/. and Computation: Practice and Experience, 20(5):587–597,
[10] P. Buneman and W.Tan. Provenance in databases. In 2008.
Proceedings of ACM SIGMOD, pages 1171–1173, 2007. [27] Microsoft Workflow Foundation.
[11] B. Clifford, I. Foster, M. Hategan, T. Stef-Praun, M. Wilde, http://msdn2.microsoft.com/en-us/netframework/
and Y. Zhao. Tracking provenance in a virtual data grid. aa663322.aspx.
Concurrency and Computation: Practice and Experience, [28] S. Miles, P. Groth, S. Munroe, S. Jiang, T. Assandri, and
20(5):565–575, 2008. L. Moreau. Extracting Causal Graphs from an Open
[12] S. Cohen, S. C. Boulakia, and S. B. Davidson. Towards a Provenance Data Model. Concurrency and Computation:
model of provenance and user views in scientific workflows. Practice and Experience, 20(5):577–586, 2008.
In DILS, pages 264–279, 2006. [29] L. Moreau and I. Foster, editors. Provenance and Annotation
[13] S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. of Data - International Provenance and Annotation
Addressing the provenance challenge using zoom. Workshop, volume 4145. Springer-Verlag, 2006.
Concurrency and Computation: Practice and Experience, [30] L. Moreau, J. Freire, J. Futrelle, R. McGrath, J. Myers, and
20(5):497–506, 2008. P. Paulson. The open provenance model, December 2007.
[14] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Ludäscher, T. M. http://eprints.ecs.soton.ac.uk/14979.
McPhillips, S. Bowers, M. K. Anand, and J. Freire. [31] S. G. Parker and C. R. Johnson. SCIRun: a scientific
Provenance in scientific workflow systems. IEEE Data Eng. programming environment for computational steering. In
Bull., 30(4):44–50, 2007. Supercomputing, page 52, 1995.
[15] E. Deelman and Y. Gil. NSF Workshop on Challenges of [32] First provenance challenge.
Scientific Workflows. Technical report, NSF, 2006. http://twiki.ipaw.info/bin/view/Challenge/
http://vtcpc.isi.edu/wiki/index.php/Main_Page. FirstProvenanceChallenge, 2006. S. Miles, and L. Moreau
[16] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, (organizers).
C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, [33] Second provenance challenge.
A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: a Framework http://twiki.ipaw.info/bin/view/Challenge/
for Mapping Complex Scientific Workflows onto Distributed SecondProvenanceChallenge, 2007. J. Freire, S. Miles, and
Systems. Scientific Programming Journal, 13(3):219–237, L. Moreau (organizers).
2005. [34] C. Scheidegger, D. Koop, H. Vo, J. Freire, and C. Silva.
[17] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A Querying and creating visualizations by analogy. IEEE
virtual data system for representing, querying and Transactions on Visualization and Computer Graphics,
automating data derivation. In Proceedings of SSDBM, pages 13(6):1560–1567, 2007. Papers from the IEEE Information
37–46, 2002. Visualization Conference 2007.
[18] J. Freire, D. Koop, E. Santos, and C. Silva. Provenance for [35] C. Silva, J. Freire, and S. P. Callahan. Provenance for
computational tasks: A survey. Computing in Science & visualizations: Reproducibility and beyond. Computing in
Engineering, 10(3), May/June 2008. To appear. Science & Engineering, 9(5):82–89, 2007.
[19] J. Freire and C. Silva. Towards enabling social analysis of [36] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data
scientific data. In CHI Social Data Analysis Workshop, 2008. provenance in e-science. SIGMOD Record, 34(3):31–36,
To appear. 2005.
[20] J. Freire, C. T. Silva, S. P. Callahan, E. Santos, C. E. [37] Y. L. Simmhan, B. Plale, and D. Gannon. Karma2:
Scheidegger, and H. T. Vo. Managing rapidly-evolving Provenance management for data driven workflows.
scientific workflows. In International Provenance and International Journal of Web Services Research, Idea Group
Annotation Workshop (IPAW), LNCS 4145, pages 10–18, Publishing, 5:1, 2008. To Appear.
[38] Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru.
Performance evaluation of the karma provenance framework
for scientific workflows. In L. Moreau and I. T. Foster,
editors, International Provenance and Annotation Workshop
(IPAW), Chicago, IL, volume 4145 of Lecture Notes in
Computer Science, pages 222–236. Springer, 2006.
[39] The Swift System. www.ci.uchicago.edu/swift.
[40] W. C. Tan. Provenance in databases: Past, current, and
future. IEEE Data Eng. Bull., 30(4):3–12, 2007.
[41] The Taverna Project. http://taverna.sourceforge.net.
[42] The Triana Project. http://www.trianacode.org.
[43] VDS - The GriPhyN Virtual Data System.
http://www.ci.uchicago.edu/wiki/bin/view/VDS/
VDSWeb/WebMain.
[44] F. B. Viegas, M. Wattenberg, F. van Ham, J. Kriss, and
M. McKeon. Manyeyes: a site for visualization at internet
scale. IEEE Transactions on Visualization and Computer
Graphics, 13(6):1121–1128, 2007.
[45] The VisTrails Project. http://www.vistrails.org.
[46] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining taverna’s
semantic web of provenance. Concurrency and Computation:
Practice and Experience, 20(5):463–472, 2008.

You might also like