Towardsenablingsocialanalysis Ofscientificdata

A new class of web site enables users to upload and collectively analyze many types of data. This trend is expanding to the scientific domain where a number of collaboratories are under development. Challenges and opportunities for social data analysis in the scientific domain.

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Towardsenablingsocialanalysis Ofscientificdata

Uploaded by

vthung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Towards Enabling Social Analysis of

Scientific Data
Juliana Freire Cláudio Silva
School of Computing SCI Institute
University of Utah University of Utah Introduction
Salt Lake City, UT 84112 USA Salt Lake City, UT 84112 USA Social Web sites and web-based communities (e.g.,
[email protected] [email protected] Flickr, Facebook, Yahoo! Pipes), which facilitate
collaboration and sharing between users, are becoming
Abstract
increasingly popular. An important benefit of these sites
Computing has been an enormous accelerator to
is that they enable users to leverage the wisdom of the
science and it has led to an information explosion in
crowds. For example, in Flickr, users, in a mass
many different fields. Future advances in science
collaboration approach, tag large volumes of pictures.
depend on the ability to comprehend these vast
These tags, in turn, help them to more easily find
amounts of data. In this paper, we discuss challenges
pictures they are looking for. In the (very) recent past,
and opportunities for social data analysis in the
a new class of Web site has emerged that enables users
scientific domain.
to upload and collectively analyze many types of data
(e.g., Many Eyes and Swivel). These are part of a
Keywords
broad phenomenon that has been called “social data
Social data analysis, scientific data, workflows,
analysis”. This trend is expanding to the scientific
provenance
domain where a number of collaboratories are under
ACM Classification Keywords development. As the cost of hardware decreases over
H3 [Information Storage and Retrieval]: Information time, the cost of people goes up as analyses get more
Search and Retrieval, Online Information Services; involved, larger groups need to collaborate, and the
H5.m. [Information interfaces and presentation]: volume of data manipulated increases. Science
Miscellaneous. collaboratories aim to bridge this gap by allowing
scientists to share, re-use and refine their
computational tasks (workflows). In this position paper,
Copyright is held by the author/owner(s). we discuss the challenges and key components that are
CHI 2008, April 5 – April 10, 2008, Florence, Italy needed to enable the development of effective social
ACM 1-xxxxxxxxxxxxxxxxxx. data analysis (SDA) sites for the scientific domain.
2

Challenges and Requirements for Scientific impossible) to reproduce and share results, to solve
SDA problems collaboratively, to validate results with
To analyze and understand scientific data, complex different input data, to understand the process used to
computational processes need to be assembled and solve a particular problem, and to re-use the
insightful visualizations need to be generated, often knowledge involved in the data analysis process. In
requiring the combination of loosely-coupled resources, addition, it greatly limits the longevity of the data
specialized libraries, and grid and Web services. The products—without precise and sufficient information
heterogeneity of the data, its size, and location, greatly about how the data product was generated, its value is
complicate the data analysis pipelines. Whereas greatly diminished. SDA systems aimed at the scientific
existing SDA systems require that data be uploaded to domain need to provide a flexible framework that not
a central location for analysis, in many scientific only enables scientists to perform complex analyses
applications that manipulate large volumes of data, this over large data, but that also captures detailed
is not feasible. Furthermore, the visualization process is provenance of the analysis process [1].
likely to require staged processing and a larger variety
of underlying visualization techniques than what is Analysis Pipelines and Provenance: The
currently supported by systems such as ManyEyes.1 Basis for Scientific SDA
Data analysis generates more data (e.g., graphs, Shared pipeline and provenance repositories can
visualizations) that add to the overflow of information expose scientists to computational tasks that provide
scientists need to deal with. Ad-hoc approaches to data examples of sophisticated uses of tools. They can also
exploration, which are widely used in the scientific uncover common pipeline patterns that can be re-used
community, have serious limitations. In particular, to solve different problems. For example, given a set of
scientists and engineers need to expend substantial pipeline patterns with high support in a collection, a
effort managing data (e.g., scripts that encode recommendation system can suggest a series of
computational tasks, raw data, data products, images, modules that are most likely to fit the pipeline being
and notes) and to record provenance information2 so developed, like an auto-completion for workflows. In
that basic questions can be answered, such as: Who addition, by analyzing how people refine pipelines over
created a data product and when? When was it time, we can potentially re-use these refinements in
modified and by whom? What was the process used to different tasks as well as learn effective strategies to
create the data product? Were two data products solve problems. By querying and analyzing the
derived from the same raw data? Not only is the information in these shared repositories, scientists can
process time-consuming, but also error-prone. The leverage the wisdom of the crowds to learn by
absence of provenance makes it hard (and sometimes example; expedite their scientific training; an
potentially reduce their time to insight. But for this to
1
http://services.alphaworks.ibm.com/manyeyes/home
2 become reality, we need to give the scientists
Provenance refers to all information needed to reproduce a
certain piece of data. It is also referred to as audit trail, appropriate and usable tools to explore the data in
lineage, or pedigree. these shared repositories. In early 2005, we started the
3

VisTrails project in an attempt to address some of these

issues.
The VisTrails System. VisTrails is an open-source
scientific workflow and provenance management
system [2][3][4]. The system was designed to aid
users in exploratory computational tasks, such as
visualization, data mining, and simulations, which are
adjusted in an iterative process. This is in contrast to
previous scientific workflow, such a Kepler and
Taverna, and workflow-based visualization systems,
such as SCIRun and ParaView, which were designed
primarily for automating repetitive processes.
With VisTrails, we introduced a series of operations and
intuitive interfaces that simplify common tasks in the
scientific exploration process: scientists can easily
navigate through the space of workflows created for a
given exploration task; visually compare workflows and Figure 1. An example of exploratory visualization for radiation
their results; explore large parameter spaces; query treatment planning. Complete provenance of the exploration
process is displayed as a history tree with each node
and refine workflows by analogy. These operations and
representing a workflow that generates a visualization. The
interfaces are made possible by VisTrails' change-based visual difference interface allows users to correlate the
provenance model, which uniformly captures changes differences between data products (the images on the right)
to parameter values as well as to workflow definitions. with the differences between the workflows used to derived the
Besides the lineage of data products, this model also data products.
captures information about how workflows evolve over modules that are the same in dark gray, and modules
time---workflows are treated as first-class data that have different parameter values in light gray. The
products. In exploratory tasks, it is important to ability to compute differences between pipelines can be
understand what the differences between workflows combined with the analogy mechanism [4]: the
are, especially if multiple people are collaboratively difference between two pipelines can be applied to a
exploring data. Figure 1 shows the visual difference third pipeline, like a patch. This mechanism lets naive
interface provided by VisTrails. A visual difference is users to modify workflows without having to directly
enacted by dragging one node in the history tree onto edit their definitions, and it has the potential to lower
another, which opens a new window with a difference the barrier of adoption for workflow-based systems,
workflow. Modules unique to the first node are shown which are notoriously hard to use. VisTrails also
in orange, modules unique to the second node in blue, provides a query-by-example interface which allows
users to construct as complex, structure-based queries
(e.g., find workflows that resample a data set before
4

extracting an isosurface) by example, using the same addressed is usability: Information management
interface used to build workflows [4]. systems are notoriously hard to use. As the need for
these systems grows in a wide range of applications,
Ongoing and Future Research. In the context of the notably in the scientific domain, usability is of
VisTrails project, we are developing infrastructure to paramount importance [5][6].
enable social analysis of scientific data. Acknowledgements
 Provenance analytics. The problem of mining and This article describes work being done in the VisTrails
extracting knowledge from provenance data has been project. We thank all team members for their
largely unexplored. By analyzing provenance data, contributions. This work was partially supported by the
scientists can debug their tasks and obtain a better National Science Foundation, the Department of
understanding of their results. Mining this data may Energy, and IBM Faculty Awards.
also lead to the discovery of patterns that can References
potentially simplify the notoriously hard, time- [1] S. Callahan, J. Freire, E. Santos, C. Scheidegger, C.
consuming process of designing and refining scientific Silva, and H. Vo. Managing the evolution of dataflows
with VisTrails. In IEEE Workshop on Workflow and Data
workflows.
Flow for Scientific Applications (SciFlow), 2006.
 Usable web-enabled interfaces. By weaving
[2] The VisTrails Project. http://www.vistrails.org.
services together, it is possible to construct complex
[3] L. Bavoil, S. Callahan, P. Crossno, J. Freire, C.
applications such as scientific workflows and Web
Scheidegger, C. Silva, and H. Vo. VisTrails: Enabling
mashups, which both automate repetitive tasks and Interactive Multiple-View Visualizations. In IEEE
ensure result reproducibility. However, constructing Visualization 2005, pages 135–142, 2005.
these applications is a non-trivial task, especially for
[4] C. E. Scheidegger, H. T. Vo, D. Koop, J. Freire, and
users who do not have programming expertise. This C. T. Silva. Querying and creating visualizations by
problem is compounded for exploratory tasks, where analogy. IEEE Transactions on Visualization and
the application needs to be iteratively refined. We are Computer Graphics, 2007.
working on a new framework for manipulating [5] L. Haas. Information for people. ICDE Keynote talk,
collections of services and workflows, based on a http://www.almaden.ibm.com/cs/people/laura/Informa
workflow manipulation language (and visual interface) tion For People keynote.pdf, 2007.
that naturally represents operations that are common [6] H. V. Jagadish. Making database systems usable.
in exploratory tasks. SIGMOD Keynote talk,
http://www.eecs.umich.edu/db/usable/usability-
 Information management infrastructure. With the
sigmod.ppt, 2007.
growing volume of raw data, pipelines and provenance
information, there is a need for efficient and effective
techniques to manage these data. Besides the need to
handle large volumes of heterogeneous and distributed
data, an important challenge that needs to be

(DUMP) HCIA Cloud Computing Latest
No ratings yet
(DUMP) HCIA Cloud Computing Latest
198 pages
01 - S 201 - CST - Cobas S 201 System Overview - V1.0
No ratings yet
01 - S 201 - CST - Cobas S 201 System Overview - V1.0
27 pages
Plant Rollout For Germany Based Auto Ancillary Company
No ratings yet
Plant Rollout For Germany Based Auto Ancillary Company
3 pages
Four Dimensions of Social Network Analysis
No ratings yet
Four Dimensions of Social Network Analysis
33 pages
Querying and ReUsing Workflows With Visstrails
No ratings yet
Querying and ReUsing Workflows With Visstrails
4 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
pdf+8
No ratings yet
pdf+8
10 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
A Conceptual Framework For Composing and Managing Scientific Dat
No ratings yet
A Conceptual Framework For Composing and Managing Scientific Dat
5 pages
MILLERAND & BOWKER - Metadata Standards. Trajectories and Enactment in The Life of An Ontology
No ratings yet
MILLERAND & BOWKER - Metadata Standards. Trajectories and Enactment in The Life of An Ontology
17 pages
Fact Sheet: Emerging Data Tools
No ratings yet
Fact Sheet: Emerging Data Tools
1 page
Cloud-Based Multi-Modal Information Analytics
From Everand
Cloud-Based Multi-Modal Information Analytics
Tanushri Kaniyar
No ratings yet
BD1 1
No ratings yet
BD1 1
9 pages
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
18-Bietz and Lee
No ratings yet
18-Bietz and Lee
20 pages
COMMEDIA_Class9AIUnit1_20240821160102-1.docx_20240831_125951_0000
No ratings yet
COMMEDIA_Class9AIUnit1_20240821160102-1.docx_20240831_125951_0000
7 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
3science Mapping Analysis Software Tools - A Review
No ratings yet
3science Mapping Analysis Software Tools - A Review
27 pages
Datasist: A Python-Based Library For Easy Data Analysis, Visualization and Modeling
No ratings yet
Datasist: A Python-Based Library For Easy Data Analysis, Visualization and Modeling
17 pages
RDF Journal Compilation
No ratings yet
RDF Journal Compilation
7 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
Yadranjiaghdam Mastersthesis 2016
No ratings yet
Yadranjiaghdam Mastersthesis 2016
49 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Article: Science Mapping and Visualization Tools Used in Bibliometric & Scientometric Studies: An Overview
No ratings yet
Article: Science Mapping and Visualization Tools Used in Bibliometric & Scientometric Studies: An Overview
15 pages
Sharon Ve Ali - 2014 - Metadata for Research Data Current Practices and Trends
No ratings yet
Sharon Ve Ali - 2014 - Metadata for Research Data Current Practices and Trends
9 pages
Data Science, AI, and Blockchain: Integrated Approaches
From Everand
Data Science, AI, and Blockchain: Integrated Approaches
Ekaaksh Deshpande
No ratings yet
Efficient Management of Large Metadata Catalogs in a Ubiquitous Computing Environment
From Everand
Efficient Management of Large Metadata Catalogs in a Ubiquitous Computing Environment
Daniel Beatty
No ratings yet
AI Cycle and Data SC - CH-4
No ratings yet
AI Cycle and Data SC - CH-4
56 pages
ICDAR
No ratings yet
ICDAR
5 pages
Wileyencyclopedia
No ratings yet
Wileyencyclopedia
7 pages
bibliometric tools paper (new)
No ratings yet
bibliometric tools paper (new)
5 pages
Provenance and Scientific Workflows
No ratings yet
Provenance and Scientific Workflows
6 pages
Mapping and Visualization of A Rese
No ratings yet
Mapping and Visualization of A Rese
6 pages
BigData-Assignment1-CSP 554
No ratings yet
BigData-Assignment1-CSP 554
4 pages
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
AI Project Cycle: 2.1. Problem Scoping
No ratings yet
AI Project Cycle: 2.1. Problem Scoping
6 pages
Combinepdf
No ratings yet
Combinepdf
77 pages
Combinepdf
No ratings yet
Combinepdf
101 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
GIS and Network Analysis: January 2004
No ratings yet
GIS and Network Analysis: January 2004
17 pages
Notes Data Acquisition
No ratings yet
Notes Data Acquisition
3 pages
Torna-Freze G M
No ratings yet
Torna-Freze G M
3 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Semantic Distances
No ratings yet
Semantic Distances
32 pages
CiscoPress - Big Data Concepts Methodologies Tools and Applications (2016)
No ratings yet
CiscoPress - Big Data Concepts Methodologies Tools and Applications (2016)
3,147 pages
47347
No ratings yet
47347
43 pages
Expert System: Fundamentals and Applications for Teaching Computers to Think like Experts
From Everand
Expert System: Fundamentals and Applications for Teaching Computers to Think like Experts
Fouad Sabry
No ratings yet
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
No ratings yet
Intelligent Workflow Systems and Provenance-Aware Software: Yolanda Gil
8 pages
Treemap Visualization of The Semantic Twitter Analysis Tool
No ratings yet
Treemap Visualization of The Semantic Twitter Analysis Tool
32 pages
Querying and Creating Visualizations by Analogy
No ratings yet
Querying and Creating Visualizations by Analogy
8 pages
Artigo PingER SLAC
No ratings yet
Artigo PingER SLAC
7 pages
Hot Data
No ratings yet
Hot Data
9 pages
A Tour Through The Visualization Zoo PDF
No ratings yet
A Tour Through The Visualization Zoo PDF
18 pages
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
DSTF Final Report
No ratings yet
DSTF Final Report
9 pages
Big Data Analytics Towards A European Research Agenda
No ratings yet
Big Data Analytics Towards A European Research Agenda
21 pages
Brughmans T. 2013. Thinking Through Netw PDF
No ratings yet
Brughmans T. 2013. Thinking Through Netw PDF
40 pages
Data longevity and accessibility
No ratings yet
Data longevity and accessibility
4 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Mapping Data To Queries: Semantics of The IS-A Rule
No ratings yet
Mapping Data To Queries: Semantics of The IS-A Rule
22 pages
p802 Koudastutorial
No ratings yet
p802 Koudastutorial
2 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Record Linkage Similarity Measures and Algorithms
No ratings yet
Record Linkage Similarity Measures and Algorithms
130 pages
Duplicate Record Detection - A Survey
No ratings yet
Duplicate Record Detection - A Survey
16 pages
Message Broker Message Flows
80% (5)
Message Broker Message Flows
1,756 pages
Information Integration Using Logical Views
No ratings yet
Information Integration Using Logical Views
22 pages
Genetic Programming
No ratings yet
Genetic Programming
11 pages
Answering Queries Using Views - A Survey
No ratings yet
Answering Queries Using Views - A Survey
25 pages
Data Services in Your Spreadsheet
No ratings yet
Data Services in Your Spreadsheet
10 pages
Xquery: An XML Query Language
No ratings yet
Xquery: An XML Query Language
19 pages
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
No ratings yet
Data Integration: A Theoretical Perspective: Maurizio Lenzerini
14 pages
Bernstein Presentation 03
No ratings yet
Bernstein Presentation 03
38 pages
Information Integration: Maurizio Lenzerini
No ratings yet
Information Integration: Maurizio Lenzerini
110 pages
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
No ratings yet
The State of The Art in End-User Software Engineering: Submitted To ACM Computing Surveys
50 pages
XML Schema Automatic Matching Solution
No ratings yet
XML Schema Automatic Matching Solution
7 pages
Dfhuynh Thesis
No ratings yet
Dfhuynh Thesis
134 pages
The Dell Vostro 1510
No ratings yet
The Dell Vostro 1510
3 pages
Task Allocation Document
No ratings yet
Task Allocation Document
6 pages
The Linux File System: Operating Systems 1
No ratings yet
The Linux File System: Operating Systems 1
24 pages
Infomation Assurance Pre q2
No ratings yet
Infomation Assurance Pre q2
7 pages
CCNA1 Chapter 2 Exam - Online Answer
No ratings yet
CCNA1 Chapter 2 Exam - Online Answer
12 pages
Delegation in C#
No ratings yet
Delegation in C#
2 pages
Altcryptpad Install
No ratings yet
Altcryptpad Install
17 pages
Auditing Quiz
No ratings yet
Auditing Quiz
2 pages
Firepower 2100 Series
No ratings yet
Firepower 2100 Series
104 pages
A Certification Questions
100% (2)
A Certification Questions
67 pages
Sv9500 Data Sheet
No ratings yet
Sv9500 Data Sheet
2 pages
Oxe Mg-Documentation
No ratings yet
Oxe Mg-Documentation
50 pages
[Ebooks PDF] download Data cleaning First Edition Association For Computing Machinery. full chapters
No ratings yet
[Ebooks PDF] download Data cleaning First Edition Association For Computing Machinery. full chapters
38 pages
8024 Ecap200 Database Management Systems
No ratings yet
8024 Ecap200 Database Management Systems
3 pages
302 Big-Ip Dns Specialist: Exam Blueprint
No ratings yet
302 Big-Ip Dns Specialist: Exam Blueprint
5 pages
Name: - Shatrughan Kavilash Prajapati Contact No: - 7385119928 Career Objective
No ratings yet
Name: - Shatrughan Kavilash Prajapati Contact No: - 7385119928 Career Objective
3 pages
E Commerce Review
No ratings yet
E Commerce Review
12 pages
Ch. 3 ... Part - 5
No ratings yet
Ch. 3 ... Part - 5
21 pages
J6-Perancangan Ui-Ux Pada Aplikasi Mobile Penjualan Di 3R Stationary Menggunakan Metode Design Sprint
No ratings yet
J6-Perancangan Ui-Ux Pada Aplikasi Mobile Penjualan Di 3R Stationary Menggunakan Metode Design Sprint
12 pages
Aadvantage Pro Manual
No ratings yet
Aadvantage Pro Manual
115 pages
AbInitio High Level
No ratings yet
AbInitio High Level
16 pages
Multiple Choice Questions: Chapter 8 Systems Development
No ratings yet
Multiple Choice Questions: Chapter 8 Systems Development
3 pages
Students-Cisco-AICTE Internship 2024 Process Flow
No ratings yet
Students-Cisco-AICTE Internship 2024 Process Flow
15 pages
Csol 510
No ratings yet
Csol 510
12 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
CAT 2 IoT
No ratings yet
CAT 2 IoT
19 pages
Dbms. 5 Unit Part-B
No ratings yet
Dbms. 5 Unit Part-B
8 pages
User Interface Thesis
100% (3)
User Interface Thesis
7 pages

Towardsenablingsocialanalysis Ofscientificdata

Uploaded by

Towardsenablingsocialanalysis Ofscientificdata

Uploaded by

Towards Enabling Social Analysis of

VisTrails project in an attempt to address some of these

You might also like