0% found this document useful (0 votes)
23 views

Towardsenablingsocialanalysis Ofscientificdata

A new class of web site enables users to upload and collectively analyze many types of data. This trend is expanding to the scientific domain where a number of collaboratories are under development. Challenges and opportunities for social data analysis in the scientific domain.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Towardsenablingsocialanalysis Ofscientificdata

A new class of web site enables users to upload and collectively analyze many types of data. This trend is expanding to the scientific domain where a number of collaboratories are under development. Challenges and opportunities for social data analysis in the scientific domain.

Uploaded by

vthung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Towards Enabling Social Analysis of

Scientific Data
Juliana Freire Cláudio Silva
School of Computing SCI Institute
University of Utah University of Utah Introduction
Salt Lake City, UT 84112 USA Salt Lake City, UT 84112 USA Social Web sites and web-based communities (e.g.,
[email protected] [email protected] Flickr, Facebook, Yahoo! Pipes), which facilitate
collaboration and sharing between users, are becoming
Abstract
increasingly popular. An important benefit of these sites
Computing has been an enormous accelerator to
is that they enable users to leverage the wisdom of the
science and it has led to an information explosion in
crowds. For example, in Flickr, users, in a mass
many different fields. Future advances in science
collaboration approach, tag large volumes of pictures.
depend on the ability to comprehend these vast
These tags, in turn, help them to more easily find
amounts of data. In this paper, we discuss challenges
pictures they are looking for. In the (very) recent past,
and opportunities for social data analysis in the
a new class of Web site has emerged that enables users
scientific domain.
to upload and collectively analyze many types of data
(e.g., Many Eyes and Swivel). These are part of a
Keywords
broad phenomenon that has been called “social data
Social data analysis, scientific data, workflows,
analysis”. This trend is expanding to the scientific
provenance
domain where a number of collaboratories are under
ACM Classification Keywords development. As the cost of hardware decreases over
H3 [Information Storage and Retrieval]: Information time, the cost of people goes up as analyses get more
Search and Retrieval, Online Information Services; involved, larger groups need to collaborate, and the
H5.m. [Information interfaces and presentation]: volume of data manipulated increases. Science
Miscellaneous. collaboratories aim to bridge this gap by allowing
scientists to share, re-use and refine their
computational tasks (workflows). In this position paper,
Copyright is held by the author/owner(s). we discuss the challenges and key components that are
CHI 2008, April 5 – April 10, 2008, Florence, Italy needed to enable the development of effective social
ACM 1-xxxxxxxxxxxxxxxxxx. data analysis (SDA) sites for the scientific domain.
2

Challenges and Requirements for Scientific impossible) to reproduce and share results, to solve
SDA problems collaboratively, to validate results with
To analyze and understand scientific data, complex different input data, to understand the process used to
computational processes need to be assembled and solve a particular problem, and to re-use the
insightful visualizations need to be generated, often knowledge involved in the data analysis process. In
requiring the combination of loosely-coupled resources, addition, it greatly limits the longevity of the data
specialized libraries, and grid and Web services. The products—without precise and sufficient information
heterogeneity of the data, its size, and location, greatly about how the data product was generated, its value is
complicate the data analysis pipelines. Whereas greatly diminished. SDA systems aimed at the scientific
existing SDA systems require that data be uploaded to domain need to provide a flexible framework that not
a central location for analysis, in many scientific only enables scientists to perform complex analyses
applications that manipulate large volumes of data, this over large data, but that also captures detailed
is not feasible. Furthermore, the visualization process is provenance of the analysis process [1].
likely to require staged processing and a larger variety
of underlying visualization techniques than what is Analysis Pipelines and Provenance: The
currently supported by systems such as ManyEyes.1 Basis for Scientific SDA
Data analysis generates more data (e.g., graphs, Shared pipeline and provenance repositories can
visualizations) that add to the overflow of information expose scientists to computational tasks that provide
scientists need to deal with. Ad-hoc approaches to data examples of sophisticated uses of tools. They can also
exploration, which are widely used in the scientific uncover common pipeline patterns that can be re-used
community, have serious limitations. In particular, to solve different problems. For example, given a set of
scientists and engineers need to expend substantial pipeline patterns with high support in a collection, a
effort managing data (e.g., scripts that encode recommendation system can suggest a series of
computational tasks, raw data, data products, images, modules that are most likely to fit the pipeline being
and notes) and to record provenance information2 so developed, like an auto-completion for workflows. In
that basic questions can be answered, such as: Who addition, by analyzing how people refine pipelines over
created a data product and when? When was it time, we can potentially re-use these refinements in
modified and by whom? What was the process used to different tasks as well as learn effective strategies to
create the data product? Were two data products solve problems. By querying and analyzing the
derived from the same raw data? Not only is the information in these shared repositories, scientists can
process time-consuming, but also error-prone. The leverage the wisdom of the crowds to learn by
absence of provenance makes it hard (and sometimes example; expedite their scientific training; an
potentially reduce their time to insight. But for this to
1
http://services.alphaworks.ibm.com/manyeyes/home
2 become reality, we need to give the scientists
Provenance refers to all information needed to reproduce a
certain piece of data. It is also referred to as audit trail, appropriate and usable tools to explore the data in
lineage, or pedigree. these shared repositories. In early 2005, we started the
3

VisTrails project in an attempt to address some of these


issues.
The VisTrails System. VisTrails is an open-source
scientific workflow and provenance management
system [2][3][4]. The system was designed to aid
users in exploratory computational tasks, such as
visualization, data mining, and simulations, which are
adjusted in an iterative process. This is in contrast to
previous scientific workflow, such a Kepler and
Taverna, and workflow-based visualization systems,
such as SCIRun and ParaView, which were designed
primarily for automating repetitive processes.
With VisTrails, we introduced a series of operations and
intuitive interfaces that simplify common tasks in the
scientific exploration process: scientists can easily
navigate through the space of workflows created for a
given exploration task; visually compare workflows and Figure 1. An example of exploratory visualization for radiation
their results; explore large parameter spaces; query treatment planning. Complete provenance of the exploration
process is displayed as a history tree with each node
and refine workflows by analogy. These operations and
representing a workflow that generates a visualization. The
interfaces are made possible by VisTrails' change-based visual difference interface allows users to correlate the
provenance model, which uniformly captures changes differences between data products (the images on the right)
to parameter values as well as to workflow definitions. with the differences between the workflows used to derived the
Besides the lineage of data products, this model also data products.
captures information about how workflows evolve over modules that are the same in dark gray, and modules
time---workflows are treated as first-class data that have different parameter values in light gray. The
products. In exploratory tasks, it is important to ability to compute differences between pipelines can be
understand what the differences between workflows combined with the analogy mechanism [4]: the
are, especially if multiple people are collaboratively difference between two pipelines can be applied to a
exploring data. Figure 1 shows the visual difference third pipeline, like a patch. This mechanism lets naive
interface provided by VisTrails. A visual difference is users to modify workflows without having to directly
enacted by dragging one node in the history tree onto edit their definitions, and it has the potential to lower
another, which opens a new window with a difference the barrier of adoption for workflow-based systems,
workflow. Modules unique to the first node are shown which are notoriously hard to use. VisTrails also
in orange, modules unique to the second node in blue, provides a query-by-example interface which allows
users to construct as complex, structure-based queries
(e.g., find workflows that resample a data set before
4

extracting an isosurface) by example, using the same addressed is usability: Information management
interface used to build workflows [4]. systems are notoriously hard to use. As the need for
these systems grows in a wide range of applications,
Ongoing and Future Research. In the context of the notably in the scientific domain, usability is of
VisTrails project, we are developing infrastructure to paramount importance [5][6].
enable social analysis of scientific data. Acknowledgements
 Provenance analytics. The problem of mining and This article describes work being done in the VisTrails
extracting knowledge from provenance data has been project. We thank all team members for their
largely unexplored. By analyzing provenance data, contributions. This work was partially supported by the
scientists can debug their tasks and obtain a better National Science Foundation, the Department of
understanding of their results. Mining this data may Energy, and IBM Faculty Awards.
also lead to the discovery of patterns that can References
potentially simplify the notoriously hard, time- [1] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C.
consuming process of designing and refining scientific Silva, and H. Vo. Managing the evolution of dataflows
with VisTrails. In IEEE Workshop on Workflow and Data
workflows.
Flow for Scientific Applications (SciFlow), 2006.
 Usable web-enabled interfaces. By weaving
[2] The VisTrails Project. http://www.vistrails.org.
services together, it is possible to construct complex
[3] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C.
applications such as scientific workflows and Web
Scheidegger, C. Silva, and H. Vo. VisTrails: Enabling
mashups, which both automate repetitive tasks and Interactive Multiple-View Visualizations. In IEEE
ensure result reproducibility. However, constructing Visualization 2005, pages 135–142, 2005.
these applications is a non-trivial task, especially for
[4] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and
users who do not have programming expertise. This C. T. Silva. Querying and creating visualizations by
problem is compounded for exploratory tasks, where analogy. IEEE Transactions on Visualization and
the application needs to be iteratively refined. We are Computer Graphics, 2007.
working on a new framework for manipulating [5] L. Haas. Information for people. ICDE Keynote talk,
collections of services and workflows, based on a http://www.almaden.ibm.com/cs/people/laura/Informa
workflow manipulation language (and visual interface) tion For People keynote.pdf, 2007.
that naturally represents operations that are common [6] H. V. Jagadish. Making database systems usable.
in exploratory tasks. SIGMOD Keynote talk,
http://www.eecs.umich.edu/db/usable/usability-
 Information management infrastructure. With the
sigmod.ppt, 2007.
growing volume of raw data, pipelines and provenance
information, there is a need for efficient and effective
techniques to manage these data. Besides the need to
handle large volumes of heterogeneous and distributed
data, an important challenge that needs to be

You might also like