Characteristics of Collaboration in the Emerging Practice of Open Data Analysis
Characteristics of Collaboration in the Emerging Practice of Open Data Analysis
835
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
This new abundance of government data provides fodder for stages, from problem formulation to data collection through
civic hacking. Civic hackers make use of open data to build analysis to conclusion [24]. Collaboration and communi-
software applications with the aim of providing transparency cation between domain scientists and statisticians is impor-
and better understanding of government functions. More- tant throughout all stages, but particularly during the problem
over, there is an emerging community of non-profit organi- formulation period. Because the domain scientist may not
zations, startups, independent information technologists and clearly formulate the problem, the goal of the statistician is
volunteers engaged in this analysis [17]. One example is a to listen and draw out the nature of the problem, then refor-
local community group in Chicago that created an interac- mulate it in a way that can be tested statistically [18]. In this
tive visualization of all lobbyist activity in the city, includ- way the statistician establishes “a mapping from the client’s
ing lobbyists, lobbying firms, clients, and actions sought by domain to a statistical question” [12]. Chatfield [5] argues
lobbyists from the city (ChicagoLobbyists.org). Other exam- that statistical tasks are tricky because the context of the data
ples include hackathons, such as the Green Hackathon, that matters: there is often messiness in the data, and the objec-
have brought together 20-60 individuals with broad expertise tives of the analysis are not necessarily clear. The statistician
to work on societal issues and has resulted in software prod- is encouraged to ask many questions of the domain scien-
ucts that use open data [39]. One such application combined tist to gain background information and context to understand
supply chain information with child labor data from the UN the data. Because communication during this period is both
to provide an estimate of the likelihood that child labor was difficult and critical, Chatfield suggests the following: “from
used in the manufacturing of specific products. bitter experience, I particularly advise against consulting by
telephone or electronic mail, where one cannot see the data”.
Several papers from the HCI and CSCW communities have In this type of collaboration domain scientists provide under-
described some isolated uses of open government data within
standing of the problem, the goals, and the data; statisticians
specific domains. For example, researchers have studied the
provide the technical skills to construct the appropriate anal-
use of Geographic Information System (GIS) data to em-
ysis and extract meaningful results.
power regional communities [35, 38] and tax data to engage
citizens with the tradeoffs in government spending [19]. Oth- Open Science, Open Collaboration, and Open Innovation
ers have studied certain practices in the area of open data.
Open data analysis shares commonalities with several forms
Bohner and Disalvo [2] interviewed the leaders of civic tech
of collaboration in which sharing and openness are important
in Atlanta, finding that openness in government data is more a
tenets. There is a movement toward more open sharing in
spectrum than a binary. Erete and colleagues [9] showed that
science, particularly of data. Data sharing holds scientists ac-
non-profit organizations use data-driven stories as arguments
countable by allowing others to confirm findings. Data shar-
to potential funders and stakeholders. As yet, there have been
ing also accelerates scientific progress through the reuse of
no attempts to give a broad overview of the analysis of open
a valuable resource [23]. In spite of these advantages, data
data from a CSCW perspective.
sharing in science is difficult [1]. One obstacle is the will-
ingness of scientists to share their data. There is a tradeoff
Data Science and Data Analysis between cooperation and openness on one hand and compe-
Advances in hardware and software technologies have led to tition and secrecy on the other [36], and different scientific
a rapid increase in the amount of data collected. Companies disciplines adopt different norms of openness. Another dif-
and organizations are recognizing the advantages of using this ficulty is in the use of shared data. Scientists must assess
data in decision making and hiring people with the skills to whether a given dataset is relevant, whether they can under-
exploit this data. This has led to the burgeoning field of “data stand the data, and whether they trust the data before deciding
science”. Despite the recognition of the importance of data whether or not to reuse data [10]. Data often lacks adequate
science and the need to train data scientists, the field and skills documentation to understand the context in which it was cre-
are fuzzily defined [31]. Data scientists are expected to make ated, its format, and the meaning of its fields [1]. Understand-
meaning from data using a broad collection of skills. There is ing the data often requires interaction with one of its creators
little to no academic research about data scientists and their [32]. Open data analysis, like the movement toward open sci-
work practices. Instead a majority of the discussion has come ence, involves the sharing and reuse of open data.
from position articles in popular media. Harris and colleagues
[13] have argued that data scientists come from many differ- Open data analysis also involves the joint production of a
ent backgrounds that draw analytic skills from five different shared artifact. Forte and Lampe [11] define open collabo-
ration as online collaboration that satisfies four conditions. It
areas: business, machine learning, math, programming, and
must produce a shared artifact, collaboration must be sup-
statistics. A perfect data scientist is often described as a ‘uni-
ported by a technological platform, this platform must al-
corn’ because it is impossible for an individual to have all the
low for contributors to enter and exit the collaboration eas-
skills needed. Renowned data scientists have urged their field
to make use of more teams because it is so difficult for any ily and the platform must allow for flexible social structures.
individual to gain a complete skillset [29, 30]. The two most studied, prototypical examples of open col-
laboration are encyclopedia editing on Wikipedia and open
Collaboration is common in the practice of statistics, one of source software development. Easy entry into a collabora-
the parent disciplines of data science. One frequent type tion on technologically-mediated collaboration platforms al-
of collaboration is between a set of domain scientists and low large-scale participation [11]. Successful open source
one or more statisticians [16]. Data analysis has multiple projects can attract tens of thousands of participants [26].
836
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
However, easy exit means turnover is high in open collab- Participants either communicate at the same time or at dif-
oration [8]. On Wikipedia, a large majority of editors only ferent times (synchronicity), communication is remote or in
make a few edits on one occasion [4]. While technologically- person (physical distance), many people participate or few
mediated communication helps to facilitate large-scale col- people participate (scale), collaborations are short-term or
laboration by reducing the costs of communication, it may long term (planned permanence), turnover is high or low.
not be well suited for collaboration that requires high levels of Some collaborations draw participants from many different
iterative feedback between participants [11]. Technologically backgrounds (number of communities of practice), and these
mediated communication often lacks the richness needed to participants have different norms, practices, expertise and
establish common ground and support tightly coupled work tools. Work practices can be established, routine, and well
[27]. understood or they may be unestablished and in development
(nascence).
Open data analysis also shares similarities with the Do-It-
Yourself (DIY) and maker movements. The maker movement MoCA can be used to describe meaningful differences in
is the practice of working with materials (e.g. electronics, work practices. For example, a collaboration in which in-
fabrics) and fabrication tools [22]. Some have argued that dividuals come from many different communities of practice
it represents the “democratization of technological practice” entails a culturally diverse group with different norms, prac-
[33]. Like work with open data, many participants embrace tices, tools, and languages. Members can make use of their
a hacker ethos in which creating, playfulness, and tinkering different backgrounds by engaging in work in complementary
are encouraged [37]. Offline collaborations in hackerspaces ways, but diverse backgrounds may also cause more difficulty
are as important as online spaces [33, 22]. It has a mix of in working together. We use MoCA to address the specific
lay experts and professionals and has been described both as question:
a hobby activity as well as a form of open innovation that
leads to the creation of professional manufacturing products Research Question 1: How do open data analysis groups co-
[22]. Wang and Kaye [37] argue that it has looser commu- ordinate their analytic activities?
nity boundaries than traditional communities of practice and Open data analysis projects use open data to produce a tan-
describe it as a collection of practice. gible artifact, such as a tool or a report. A second goal of
this project was to understand what types of artifacts were
RESEARCH QUESTIONS being created. The intent of analyzing data is to produce in-
Given the large quantities of data and the complexity and mul- sights, such as identifying trends, observing anomalies, and
tidisciplinary nature of data analysis, collaboration is likely drawing meaningful inferences. These insights can be used
to play an important role in the analysis of open government to reflect on government practices and to suggest changes in
data. Governments make hundreds of datasets available and these practices.
the most interesting and valuable analyses often come from There are multiple approaches for extracting insights from
combining several of these in a novel way [14]. Thus, no data. By building data processing, summarization, and visu-
one individual can single-handedly analyze all of the avail- alization tools, projects make data more accessible to others.
able data. Even the work involved for a single project can Tools provide the means for an audience to find their own in-
be substantial and may require a variety of skills and knowl- sights in the data and to create their own meaning from these
edge. Open data is often provided in bad formats and must be insights. Alternatively, authors analyze data and summa-
extracted, cleaned, and processed; this requires coding skills. rize their analyses in reports. In reports, authors present the
To test hypotheses and claims requires statistical knowledge. insights they discovered, their interpretation and their con-
Context is often critical to understand data, to understand how clusion to an audience. There are different types of analy-
statistical models fit into research questions, and to interpret ses, some of which are more complex than others [21]. Ex-
the results of these models [5]. Thus, we expect that individu- ploratory analyses identify trends, correlations, or relation-
als working with open data would often work in collaboration ships in the data. These analyses can be used to generate
with others. ideas, but have not been formally evaluated. Inferential anal-
One the goals of this project was to understand how collab- yses evaluate whether a pattern will continue to hold for new
oration unfolds in open data analysis projects. Open data samples. Finally predictive analyses use a set of features to
analysis shares commonalities and differences with multiple predict an outcome of interest for a single person or unit. The
forms of collaboration in which sharing and openness are im- latter two, inferential and predictive analyses, require sub-
portant tenets, such as open science, open collaboration (e.g. stantially more skill to apply, but can provide more reliable
open source software), and the maker movement. To address conclusions. We categorized projects based on the type of ar-
this question, we employ the Lee and Paine [20] Model of tifact (tool or exploratory, inferential, or predictive analysis)
Coordinated Action (MoCA). MoCA is a descriptive model they created to address the question:
used to understand collaborative work; it expands Johansen’s Research Question 2: What type of artifacts are being pro-
1988 time-space matrix with two dimensions of synchronicity duced by open data analysis projects?
and physical distribution to seven dimensions including syn-
chronicity, physical distribution, scale, planned permanence,
turnover, number of communities of practice, and nascence.
Collaborations can be characterized along each dimension.
837
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
838
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
semi-structured questions (e.g. “How did you communicate lyst at a governmental institution, contacted a colleague who
with each other during the project?”, “Did you work remotely had expertise with human resources data to help provide con-
or in the same place?”) asked during the interviews and from text to understand the data (Participant 4). Participant 6, a
responses to specific questions asked to survey participants civic hacker, said that that they often looked for people with
(e.g. “How frequently did other people join your project af- relevant domain expertise (e.g. labor law, health care) from
ter it was started?”). While coding interview transcripts, an within their organization when starting a project.
additional dimension, beneficiary of the project, emerged as
Survey participants were evenly split, with participants find-
another important aspect of collaboration. We included it as
ing collaborators through personal connections, such as work
an eighth dimension.
or friends (38%), online through organizations like Code for
America (42%), or both (21%). While most groups were
Scale
small, there were a few exceptions; one of our survey par-
Scale refers to the size of a group of practice. We asked in- ticipants indicated that he worked on the project in a group of
terviewees and survey participants how many people worked 40 people.
on their project in some capacity, such as by providing feed-
back, providing guidance, or conducting analyses. Both in- Another factor that affected scale was the degree to which
terviewees and survey participants reported working on data project groups were open. Some projects made their materi-
analysis projects in small teams (Table 1). So many projects als open, allowed anyone to join, and made their end prod-
may have been small in part due to the fact that participants ucts open; others were only partially open. Participant 7 used
found others to work with primarily from people they already GitHub to make all code publicly available both during and
knew. Several interviewees described seeking out colleagues after the project. In contrast, Participant 14, a data journalist,
and friends as collaborators because they knew these people worked with two other journalists; they compiled data from
had the expertise they needed to analyze the data. For ex- multiple sources including open and (previously) closed data,
ample Participant 13, a data journalist who worked with su- only making their results available once they were finished.
per PAC donation data, contacted a colleague at an organi- On average, projects made their end products “Mostly open”.
zation which he knew had experience reporting on political Groups were bimodal in terms of making their materials open
donations. Another interviewee, who worked as a data ana- and allowing anyone to join. Data journalists on average
839
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
made their project data, code and materials open, but did not
allow anyone to contribute to the project. By contrast, civic
hackers were more likely to make their project data, code, and
materials open and to allow anyone to join. Larger groups
were formed when project materials were made available and
anyone could join.
Turnover
Turnover refers to the frequency with which old members Figure 1. Mean of level of experience (standard error) estimated for each
area of expertise. Participants rated their own expertise (Self) and the
leave the group or how often new members join. In gen- most expert other member of the group for each area (Others).
eral, member turnover was rarely identified in the interviews,
which is consistent with our survey findings. We asked survey
participants to rate how frequently other people joined or left
their projects. On average, survey participants reported that me know the three things that she thought were most impor-
other people “Rarely” joined or left a project after it began. tant out of all seven forms”. This helped the team focus their
Many civic hacking projects (e.g. Participant 6 and 7) were attention on the most interesting parts of the data. She also
almost entirely open throughout the whole life cycle of the helped them understand what data was missing: only contri-
project. However, limited resources often prevented groups butions of at least $100 were recorded in the data.
from actively responding to and incorporating feedback from
others during development. The role of domain and technical experts was similar to the
roles of “thinker and doer” [25], where the domain experts did
Planned Permanence more of the thinking and framing of the work and the techni-
It was difficult to address planned permanence as described cal expert did more of the implementation and conducted the
by the MoCA framework because of the decentralized, in- analyses.
formal nature of open data analysis. The majority of groups Group members came from many different backgrounds.
did not start with fixed end date for their projects. Hence, This resulted in groups that had a wide variety of skills and
rather than ask how long projects were intended to last, we experience. Survey participants rated their own and other
asked participants how long their projects had actually lasted. team members’ levels of expertise (Figure 1, Figure 2). More
Projects lasted anywhere from 2 days to nearly four years. than half of groups had group members with at least 1 year of
Most projects had a specific end goal. Most of the projects training per subject in two or more specialized areas of infer-
reached that goal, creating a preliminary or finalized tool or ential statistics, software development or machine learning.
report. These are areas of skill that come from very different schools
of training. Individuals skilled at software development are
Number of Communities of Practice unlikely to be skilled at inferential statistics. Most groups
A community of practice is a collection of people who share also had group members with at least one year of experience
norms, practices, expertise, and tools. Participants and their in two or more domain-specific areas, such as government,
collaborators came from multiple communities of practice, journalism, activism, or non-profits. As we described in the
from software development to data science to journalism section on scale, collaborators were often chosen specifically
to city government. In describing the members of their because they had the complementary expertise needed to un-
groups, participants described collaborators with heteroge- derstand the data.
neous backgrounds. Furthermore, people with different back-
grounds played different roles within the group. Not only are many different communities of practice actively
engaged with open government data, even within groups there
Interviewees reported that within almost all of the projects are people from many different backgrounds. These people
there was at least one person who acted as a domain expert bring together a diversity of skills and practices that make
and at least one person who acted as a technical expert. Do- such groups highly interdisciplinary. This interdisciplinarity
main experts provided information about the larger context is in part intentional, with people from different backgrounds
of the data, including explaining what was and was not cap- playing different roles within the group.
tured by the data, identifying other sources of data, and iden-
tifying interesting and meaningful questions to ask with the Synchronicity and Physical Distribution
data. Technical experts completed most of the work and pro- Many interviewees made use of regular synchronous com-
vided guidance on which analytic methods to use. They also munication. Participant 15 and her collaborator spoke on
helped shape the questions by using “quantitative thinking”. the phone regularly while they were trying to design and
Participant 8, a front-end web developer, described how their scope the project. Participant 15, who had more experience
team worked with a city employee who understood regula- with data science projects, was in charge of coordinating the
tory frameworks in order to parse the data and focus in on the project. She worked with her collaborator to identify a project
most important parts. He said “she was the domain expert. goal and an appropriate data set. These conversations helped
I am just a software engineer ... There were like 7 different them to shape the project into one that would provide tangi-
forms to fill in to enter campaign finance data and there were ble benefits and that could be carried out in the few months
tons of different ways to fill out the seven forms ... she let they had to work on the project. Similarly, Participant 18 met
840
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
841
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
appeal to a larger group, they had to make educated guesses exploratory analyses to draw insights from the data, while
about how to produce something that would effectively ap- only a few projects used inferential statistics or predictive
peal to their intended audience. statistics (Table 2). Exploratory analyses focused on find-
ing patterns in the data, such as trends over time, anoma-
Survey responses were somewhat consistent with interviewee
lies, or extreme values. Almost all of the exploratory projects
responses, with the largest number of survey participants in-
made heavy use of visualizations. For example, Participant
dicating that their intended audience were the citizens of a 12 investigated whether it takes longer to get out of minimum
region. This was followed by projects that built tools or con- wage jobs now than it did in the past. For this they created
ducted analyses for one of the members of their group. A a visualization of changes in the percentage of workers who
smaller percentage indicated that their audience was a spe- held minimum wage jobs now and in the past. They then used
cific client or a specific group, such as government official or these visualizations to make an argument that escaping mini-
agency. mum wage jobs does takes longer than in the past.
Research Question 2: What type of artifacts are being Predictive analyses were used to inform decisions. Partici-
produced by open data analysis projects? pant 4 constructed projections based on census data to plan
To understand what types of artifacts were produced by these out various scenarios in planning for the future. A local gov-
open data analysis projects we coded the interview transcripts ernment intended to place a limited number of language insti-
and project materials when provided. Two types of projects tutes around their county and this data project aimed to find
emerged: those that conducted statistical analyses to address an optimal distribution for these institutes to maximize both
a specific research question and those that built a tool for the number of people who would benefit as well as the diver-
end users to explore the data on their own. For projects that sity of immigrant communities they served.
conducted statistical analyses, transcripts and project materi- Through interviews and surveys we found that nearly half of
als were further coded to identify the type of question: ex- projects left it up to the end users to draw their own conclu-
ploratory, inferential, or predictive. sion while the other half drew conclusions that relied almost
Slightly under half of the projects built tools for end users exclusively on exploratory and descriptive statistics. Very few
(Table 2). These projects developed software programs or projects used sophisticated statistical analyses.
websites that made the data easier to use for others. Some
projects built tools for readers to explore the data. In New DISCUSSION
York City, one group built an interactive map using 311 cit- Through interviews and survey responses we gathered infor-
izen complaints so that readers could explore which neigh- mation on 40 projects that involved the analysis of open gov-
borhoods had the most rat-infested restaurants. Participant 3 ernment data. We characterized the way in which work was
helped create a visualization tool for port officials to monitor coordinated and we categorized the type of artifacts produced
real-time international shipping price data. Using this tool, by these projects. Three major themes emerged. One, groups
port officials could observe unexpected changes in prices that were typically small, with low turnover, and relied heav-
could help them detect fraud. This tool allowed end users to ily on synchronous communication. Two, interdisciplinarity
monitor changes in the data in near real-time. Other projects played an important part in the formation of groups and the
included tools to support data analysis by speeding up the pro- roles individuals played within these groups. Three, very few
cessing of data (e.g. file conversion between data files). These projects produced artifacts that used sophisticated statistical
tools empower end users to use data to come to their own con- methods such as inferential or predictive analyses.
clusions. Using interactive visualizations, end users can fo-
In these respects, open data analysis shares some similari-
cus in on specific data points, monitor trends over time, and
ties and differences with other forms of open collaboration.
make their own comparisons. The purpose of these types of
Like prototypical open collaboration (e.g. Wikipedia, open
projects is not to make an observation, to make an argument,
or to support a decision. Instead the purpose is to make it source software), the production of shared artifact was cen-
easier for others to use the data. While some of these projects tral to open data analysis; unlike prototypical open collab-
included visualization, none made use of statistical analyses. oration this shared artifact varied in the degree to which it
was open and work on this artifact was not universally sup-
The other half of projects aimed to extract insights from data, ported by a technologically mediated collaboration platform.
and these insights were often summarized in a report. The Open data analysis projects had different levels of openness.
vast majority of these projects used descriptive statistics or All projects made use of data that was at least partially open
842
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
and most made their end products open. However, they var- terdisciplinarity engender a high level of interdependence,
ied in whether the project was open to new collaborators and which in turn may explain why collaborations are typically
whether materials were open while work was taking place. small in scale and use synchronous communication. Many
Only some projects used GitHub, an online version control forms of technologically-mediated communication that help
system with social transparency designed for software devel- collaborations scale may be insufficient to support the itera-
opment [8]. Technologically mediated platforms with low tive feedback required by complex, interdependent work [27].
barriers to entry and exit and flexible social structures en-
able the large-scale, asynchronous, high-turnover collabora- Open data analysis is an emerging practice, in which the
tions typical in most open collaboration [11]. Inconsistent contributors, norms, methods, and artifacts are still develop-
norms about openness and a lack of a universal, technolog- ing. Currently we find that collaboration is interdisciplinary,
interdependent, small in scale, with low turnover, and syn-
ically mediated platform may partially explain why we ob-
chronous communication. We argue that these characteristics
served open data analysis collaborations which were small,
stem from the lack of a centralized, technologically mediated
with low turnover, and synchronous communication.
collaboration platform as well as the task demands inherent
Open data analysis shares more similarities with less proto- in reusing data and performing statistical analysis. We expect
typical forms of open collaboration such as the maker move- open data analysis as a practice to evolve rapidly. Collective
ment and open science. Similar to the maker movement, there norms develop over time and, while norms of openness and
is no universal technologically mediated collaboration plat- sharing are currently heterogeneous, they may converge to-
form. In the maker movement, as in open data analysis, col- wards greater openness. More openness together with the de-
laboration takes places through a variety of different means. velopment of a technologically-mediated collaboration plat-
Sharing of designs and ideas takes place in person or on a va- form to support data analysis might facilitate the larger-scale
riety of online websites (e.g. Ikea Hacks website, Instructa- collaborations typical of other forms of open collaboration.
bles) [33, 37]. Collaboration frequently takes place offline in A greater total quantity of work can be completed with larger
hackerspaces and Fab Labs [22]. In hackerspaces individuals collaborations.
exchange knowledge of fabrication techniques; these spaces
are used to collaborate, to learn and to teach [33]. The lack of Similarly, techniques, methods, and objectives also develop
central technologically-mediated collaboration likely shapes over time. On average, participants were highly educated,
and project groups had contributors with years of experience
practice both at a community and artifact level for both the
in relevant technical areas and subject domains. Despite these
maker movement and open data analysis. Decentralization
skills, research questions remained exploratory. The avail-
likely creates looser community boundaries; both activities
ability of data science technologies, which have lowered bar-
are better explained as collectives of practice rather than com-
munities of practice [37]. Decentralization also may explain riers to entry in data science, may not be enough to make so-
why collaborations are smaller in scale. phisticated analyses accessible even for well-educated people
[6]. For the few cases in which sophisticated analyses were
The nature of data analysis tasks may create demands that used, these projects were often modeled after other existing
constrain collaboration practices as well. Many projects orga- projects. It may take time to build up a collective repository
nized work interdependently to support interdisciplinary roles of ideas to support more complex methods and questions.
within groups. Domain experts and technical experts took on
the roles of thinker and doer, respectively, which required it- In this paper we characterized collaboration in open data anal-
erative feedback between these two types of experts. Using ysis using the Model of Coordinated Action [20]. This paper
data that was collected by someone else is difficult. This is is one of the first to apply MoCA to describe collaboration
one of the challenges that scientists face in the reuse of other for an emerging coordinated action. This model provided a
scientists’ data. Data often lacks adequate documentation to systematic framework to compare and contrast collaborative
understand the context in which it was created, its format, practices in open data analysis against other forms of collab-
oration. This paper demonstrates that MoCA is an effective
and its meaning [1]. Scientists often need to interact with
framework to make task- and platform-independent compar-
the original creators of the data in order to fully understand
isons. The largest challenge we faced in using MoCA is op-
it [32]. In open data analysis projects, domain experts who
erationalization of its seven dimensions. Nascence, in par-
have more familiarity with the data play an invaluable role
explaining to technical experts the meaning of data entries ticular, was difficult to measure. We chose to operationalize
and fields and assessing issues of data quality. Domain ex- nascence as the degree of uncertainty individuals felt in their
perts also acted as advisors, guiding research questions and work. However, it was difficult to untangle whether individ-
interpretation. Through back-and-forth discussions techni- uals felt uncertainty because of the inherent uncertainty in
cal experts provided new results while domain experts gave discovering meaning from data or because individuals were
feedback on these results. This pattern of feedback shares a trying to figure out which questions, methods, and tools to
use in their analyses. There are also important aspects of col-
resemblance to the back-and-forth communication between
laboration that fall outside the scope of MoCA. For example,
scientists and statisticians that helps statisticians turn scien-
we found that the intended audience of the project shaped
tific questions into statistical questions [12].
collaboration in open data projects. Future work, will be re-
Analysis of open data requires interdisciplinary skills that a quired to determine whether seven dimensions are sufficient
single individual rarely possesses. The task demands of in- to characterize coordinated actions.
843
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
Limitations and Future Work analysis of open government data is expected to encourage
The greatest limitation of this study is the low survey sample citizens to participate in government as well as to improve
size. We gathered survey data to provide quantitative data to transparency and efficiency in government processes. We
complement results from our interviews and to increase our found that interdisciplinarity was important and that groups
sample size. Even so, we were only able to recruit a small were typically small, with low turnover and relied heavily
number of survey participants, despite multiple strategies for on synchronous communication. We found that most of the
recruiting a larger survey sample including posting recruit- projects analyzing government data asked exploratory ques-
ment messages in multiple online locales, sending personal- tions and made use of descriptive statistics and visualizations
ized email messages, and providing a monetary incentive (al- rather than more sophisticated questions and approaches. The
beit a low one). One of the challenges in studying open data emerging practice of open data analysis faces many chal-
analysis is that it has not yet developed a unified community lenges going forward, including how to tackle more complex
of practice. This creates two complications. First, the lack of questions, how to collaborate effectively with so many differ-
a unified community led to difficulties in recruiting a repre- ent communities of practice, and how to collaborate in ways
sentative cross section of participants. Second, the lack of a that scale when interdependent teamwork is so important.
centralized community made it hard to identify a sizable sam-
ple of the community. The participants that we were able to REFERENCES
recruit are likely to be more actively involved in the projects 1. Jeremy P. Birnholtz and Matthew J. Bietz. 2003. Data at
than typical individuals and more likely to identify with open Work: Supporting Sharing in Science and Engineering.
data as a community of practice. As a result of the low sam- In Proceedings of the SIGGROUP Conference on
ple size, and the heterogeneity within this community, we do Supporting Group Work (GROUP’03). ACM, New York,
not believe these participants are necessarily representative of NY, USA, 339–348. DOI:
all individuals who work with open government data. Instead http://dx.doi.org/10.1145/958160.958215
what the data does provide is a collection of over 40 exam-
ple projects. Using this set of projects we have identified a 2. Kirsten Boehner and Carl DiSalvo. 2016. Data, Design
number of patterns and themes in the way that these groups and Civics: An Exploratory Study of Civic Tech. In In
collaborate. Though these themes may not hold true for all Proceedings of the SIGCHI Conference on Human
projects, they are at least important considerations for many Factors in Computing Systems (CHI’16). ACM, New
such projects. As an exploratory study this study lays the York, NY, USA, 2970–2981. DOI:
groundwork for future work, which will hopefully comple- http://dx.doi.org/10.1145/2858036.2858326
ment these findings using a broader and more representative
sample. 3. Morgan Brazillian, Andrew Rice, Juliana Rotich, Mark
Howells, Joseph Decarolis, Cameron Brooks, Florian
Future work should look at open data analysis using a global Bauer, and Michael Liebreich. 2012. Open Source
sample. We specifically focused on the practices of open data Software and Crowdsourcing for Energy Analysis.
analysis in a single country because different countries have Energy Policy 49 (2012), 149–153. DOI:
very different political climates. Collaboration and the use http://dx.doi.org/10.1016/j.enpol.2012.06.032
of open data to fight government corruption in countries with
substantial political repression or retribution may be very dif- 4. Brian Butler, Elisabeth Joyce, and Jacqueline Pike.
ferent from the forms of collaboration in the U.S. 2008. Don’t Look Now, But We’ve Created a
Bureaucracy : The Nature and Roles of Policies and
In this paper we observed that large-scale collaborations are Rules in Wikipedia. In Proceedings of the SIGCHI
less typical of open data analysis than other, more prototypi- Conference on Human Factors in Computing Systems
cal forms of open collaboration. In part, this can be explained (CHI’08). ACM, New York, NY, USA, 1101–1110.
by the lack of a centralized, technologically mediated collab- DOI:http://dx.doi.org/10.1145/1357054.1357227
oration platform. Future work should evaluate this claim as
well as investigate what sorts of platforms could best support 5. Chris Chatfield. 2002. Confessions of a pragmatic
data analysis. We found that some projects used GitHub, but statistician. Journal of the Royal Statistical Society
this platform may not be well suited for data analysis. In par- Series D: The Statistician 51, 1 (2002), 1–20. DOI:
ticular, it lacks some technical capabilities such as the ability http://dx.doi.org/10.1111/1467-9884.00294
to store large quantities of data, to develop documentation
and metadata for data sets, and version control that supports 6. Sophie Chou, William Li, and Ramesh Sridharan. 2014.
data cleaning and processing. We also argue that this work re- Democratizing Data Science: Effecting positive social
quires interdisciplinarity and interdependent work which may change with data science. In Proceedings of the ACM
not be supported by the limited communication channels built SIGKDD Conference on Knowledge Discovery and
into platforms like GitHub. Data Mining (KDD) at Bloomberg (2014). DOI:
http://dx.doi.org/10.1.1.478.3295
844
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
8. Laura Dabbish, Rosta Farzan, Robert Kraut, and Tom 19. Namwook. Kim and Juho Kim. 2015. BudgetMap :
Postmes. 2012. Fresh Faces in the Crowd: Turnover, Issue-Driven Navigation for a Government Budget. In
Identity, and Commitment in Online Groups. In Proceedings of the SIGCHI Conference on Human
Proceedings of the ACM Conference on Factors in Computing Systems (CHI’15). ACM, New
Computer-Supported Cooperative Work and Social York, NY, USA, 1097–1102.
Computing (CSCW’12). ACM, New York, NY, USA,
20. Charlotte P. Lee and Drew Paine. 2015. From The
245–248. DOI:
Matrix to a Model of Coordinated Action (MoCA): A
http://dx.doi.org/10.1145/2145204.2145243
Conceptual Framework of and for CSCW. In
9. Sheena Erete, Emily Ryou, Geoff Smith, Khristina Proceedings of the ACM Conference on
Fassett, and Sarah Duda. 2016. Storytelling with Data : Computer-Supported Cooperative Work and Social
Examining the Use of Data by. In Proceedings of the Computing (CSCW’15). ACM, New York, NY, USA,
ACM Conference on Computer-Supported Cooperative 179–194.
Work and Social Computing (CSCW’16). ACM, New
21. Jeffery Leek and Roger D. Peng. 2015. What is the
York, NY, USA, 1273–1283.
Question? Science 347, 6228 (2015), 1314–1315.
10. Ixchel M. Faniel and Trond E. Jacobsen. 2010. Reusing
22. Silvia Lindtner, Garnet D. Hertz, and Paul Dourish.
scientific data: How earthquake engineering researchers
2014. Emerging sites of HCI innovation: Hackerspaces,
assess the reusability of colleagues’ data. Computer
Hardware Startups & Incubators. In Proceedings of the
Supported Cooperative Work 19, 3-4 (2010), 355–375.
SIGCHI Conference on Human Factors in Computing
DOI:
Systems (CHI’14). ACM, New York, NY, USA,
http://dx.doi.org/10.1007/s10606-010-9117-8
439–448. DOI:
11. Andrea Forte and Cliff Lampe. 2013. Defining, http://dx.doi.org/10.1145/2556288.2557132
Understanding and Supporting Open Collaboration:
23. Karen Seashore Louis, Lisa M. Jones, and Eric G.
Lessons from the Literature. American Behavioral
Campbell. 2002. Sharing in Science. American Scientist
Scientist 57, 5 (2013), 535–547. DOI:
90, 4 (2002), 304–307.
http://dx.doi.org/10.1177/0002764212469362
24. Jock R. MacKay and Wayne R. Oldford. 2000. Scientific
12. David J. Hand. 1994. Deconstructing Statistical
Method, Statistical Method and the Speed of Light.
Questions. Journal of the Royal Statistical Society.
Statist. Sci. 15, 3 (2000), 254–278.
Series A (Statistics in Society) 157, 3 (1994), 317–356.
http://www.jstor.org/stable/2676665.
http://www.jstor.org/stable/2983526
25. Henry Mintzberg. 1994. The Fall and Rise of Strategic
13. Harlan Harris, Sean Murphy, and Marck Vaisman. 2013.
Planning. Harvard Business Review 72, 1 (1994),
Analyzing the analyzers. O’Reilly Media.
107–114. https://hbr.org/1994/01/
14. Marijn Janssen, Yannis Charalabidis, and Anneke the-fall-and-rise-of-strategic-planning
Zuiderwijk. 2012. Benefits, adoption barriers and myths
26. Jae Yun Moon and Lee Sproull. 2000. Essence of
of open data and open government. Information Systems
distributed work: The case of the Linux kernel. First
Management 29 (2012), 258–268.
Monday 5, 11 (2000). DOI:
15. Thorhildur Jetzek, Michel Avital, and Niels http://dx.doi.org/10.5210/fm.v0i0.1479
Bjorn-Andersen. 2014. Data-Driven Innovation through
27. Gary Olson and Judith Olson. 2000. Distance Matters.
Open Government Data. Journal of Theoretical and
Human-Computer Interaction 15, 2 (2000), 139–178.
Applied Electronic Commerce Research 9, 2 (2014),
DOI:
15–16. DOI:http:
http://dx.doi.org/10.1207/S15327051HCI1523_4
//dx.doi.org/10.4067/S0718-18762014000200008
28. Sylvain Parasie and Eric Dagiral. 2013. Data-driven
16. Brian L. Joiner. 2010. Statistical consulting. In
Journalism and the Public Good:
Encyclopedia of Statistical Sciences. John Wiley {&}
Computer-assisted-reporters and
Sons, Inc., 1–9. DOI:http:
”Programmer-journalists” in Chicago. New Media &
//dx.doi.org/10.1002/0471667196.ess0409.pub3
Society 15, 6 (2013), 853–871. DOI:
17. Maxat Kassen. 2013. A promising phenomenon of open http://dx.doi.org/10.1177/1461444812463345
data: A case study of the Chicago open data project.
29. DJ Patil. 2011. Building data science teams. (2011).
Government Information Quarterly 30, 4 (2013),
http://radar.oreilly.com/2011/09/
508–513. DOI:
building-data-science-teams.html
http://dx.doi.org/10.1016/j.giq.2013.05.012
30. Gregory Piatesky. 2013. Unicorn Data Scientists vs Data
18. Ron S. Kenett. 2015. Statistics : A Life Cycle View.
Science Teams. (2013).
Quality Engineering 27, 1 (2015), 111–121. DOI:
http://www.kdnuggets.com/2013/12/
http://dx.doi.org/10.1080/08982112.2015.968054
unicorn-data-scientists-vs-data-science
845
Session: Data & Work CSCW 2017, February 25–March 1, 2017, Portland, OR, USA
31. Foster Provost and Tom Fawcett. 2013. Data Science and 36. Theresa Velden. 2013. Explaining Field Differences in
its Relationship to Big Data and Data-Driven Decision Openness and Sharing in Scientific Communities. In
Making. Data Science and Big Data 1, 1 (2013), 51–59. Proceedings of the ACM Conference on
DOI:http://dx.doi.org/10.1089/big.2013.1508 Computer-Supported Cooperative Work and Social
32. Betsy Rolland and Charlotte P. Lee. 2013. Beyond trust Computing (CSCW’13). ACM, New York, NY, USA,
and reliability: reusing data in collaborative cancer 445–457. DOI:
epidemiology research. In Proceedings of the ACM http://dx.doi.org/10.1145/2441776.2441827
Conference on Computer-Supported Cooperative Work 37. Tricia Wang and Joseph Jofish Kaye. 2011. Inventive
and Social Computing (CSCW’13). ACM, New York, Leisure Practices: Understanding Hacking Communities
NY, USA, 435–444. DOI: as Sites of Sharing and Innovation. Extended Abstracts
http://dx.doi.org/10.1145/2441776.2441826 on Human Factors in Computing Systems - CHI EA ’11
33. Joshua G. Tanenbaum, Amanda M. Williams, Audrey (2011), 263–272. DOI:
Desjardins, and Karen Tanenbaum. 2013. Democratizing http://dx.doi.org/10.1145/1979742.1979615
technology: pleasure, utility and expressiveness in DIY 38. Gemma Webster, David E. Beel, Chris Mellish,
and maker practice. In Proceedings of the SIGCHI Claire D. Wallace, and Jeff Pan. 2015. CURIOS :
Conference on Human Factors in Computing Systems Connecting Community Heritage through Linked Data.
(CHI’13). ACM, New York, NY, USA, 2603–2612. In Proceedings of the ACM Conference on
DOI:http://dx.doi.org/10.1145/2470654.2481360 Computer-Supported Cooperative Work and Social
34. Joshua Tauberer. 2014. Open Government Data: The Computing (CSCW’15). ACM, New York, NY, USA,
Book (second edi ed.). Self published. 639–648.
https://opengovdata.io/
39. Jorge L. Zapico, Daniel Pargman, Hannes Ebner, and
35. Alex S. Taylor, Siân Lindley, Tim Regan, and David Elina Eriksson. 2013. Hacking sustainability :
Sweeney. 2015. Data-in-Place: Thinking through the Broadening participation through Green Hackathons. In
Relations Between Data and Community. In Fourth International Symposium on End-User
Proceedings of the SIGCHI Conference on Human Development. IT University of Copenhagen, Denmark.
Factors in Computing Systems (CHI’15). ACM, New
York, NY, USA, 2863–2872.
846