Computationally Intensive Research
Computationally Intensive Research
Additional copies are available for $20 each. Orders must be placed through CLIR’s Web site.
This publication is also available online at http://www.clir.org/pubs/reports/pub151.
The paper in this publication meets the minimum requirements of the American National Standard
8
for Information Sciences—Permanence of Paper for Printed Library Materials ANSI Z39.48-1984.
Copyright © 2012 by Council on Library and Information Resources. This work is made available under the terms
of the Creative Commons Attribution-ShareAlike 3.0 license, http://creativecommons.org/licenses/by-sa/3.0/.
Contents
About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Acknowledgments
The Council on Library and Information Resources is grateful to the National
Endowment for the Humanities for its sponsorship of the data collection,
travel, and publication costs of this project through Cooperative Agreement
HC-50007-10. Special thanks are due to Brett Bobley and Jennifer Serventi at
the Office of Digital Humanities for their guidance and support throughout
the research process, including reading multiple drafts of this document.
Their clear visions for the Digging into Data Challenge program generally
and for this assessment specifically were invaluable to shaping this report’s
arguments. Their colleagues at the Joint Information Systems Committee in
the United Kingdom, the Social Science and Humanities Research Council in
Canada, and the National Science Foundation were equally encouraging and
enthusiastic throughout; we are especially thankful for their help in under-
standing the complexity of the international landscape of e-research. Above
all, the many researchers whose groundbreaking work this report describes
gave generously of their time over a two-year period that included telephone
interviews, site visits, and focus groups. These individuals are listed at the
end of this report. Many of the most insightful and eloquent thoughts ex-
pressed here are theirs; any mistakes are the responsibility of the authors.
Kathlin Smith and Brian Leney at CLIR edited and helped prepare both the
abridged print and lengthier web version of this report.
vi
1
Executive Summary
H
ow many lifetimes? This question often arose when the au-
thors of this report pondered the extraordinary scale and
complexity of research conducted in the Digging into Data
Challenge program. Analyzing and extrapolating patterns of mean-
ing from tens of thousands of audio files; nearly 200,000 trial tran-
scripts; millions of spoken words, recorded over many years; and
hundreds of thousands of primary and secondary texts in ancient
languages would, if undertaken using printed resources and analog
materials, have required the lifetimes and generations of scholars.
Because the resources in question were digital, the time of analysis
and discovery was compressed into months, not decades. By choos-
ing to work with very large quantities of digital data and to use the
assistance of machines, the Digging into Data Challenge investiga-
tors have demarcated a new era—one with the promise of revelatory
explorations of our cultural heritage that will lead us to new insights
and knowledge, and to a more nuanced and expansive understand-
ing of the human condition.
As articulated in section one, the Digging into Data projects are
built on collaborations that are neither contrived nor strained. These
collaborations include humanists, social scientists, computer scien-
tists, and other specialists working together toward shared goals that
also meet their individual research aspirations. Rather than working
in silos bounded by disciplinary methods, participants in this project
have created a single culture of e-research that encompasses what
have been called the e-sciences as well as the digital humanities: not a
choice between the scientific and humanistic visions of the world, but
a coherent amalgam of people and organizations embracing both.
Within this one culture are many important differences and dis-
tinctions (think of a magnifying lens adjusting to expose increasing
levels of granularity). Regardless of their disciplinary significance, at
the lowest level all data in a digital environment are zeros and ones,
a flattening of information that, while necessary for its storage within
a computer’s architecture, is not particularly meaningful to humans.
At an intermediate level, the human user can appreciate the diver-
sity of digital resources. Data for the humanities and social sciences
comprise many media and formats; among the types examined by
the Digging into Data investigators are digital images of American
quilts, fifteenth-century manuscripts, and seventeenth-century maps;
conversations recorded in kitchens; news broadcasts; court tran-
scripts; digitized music; and thousands upon thousands of digitized
2 Christa Williford and Charles Henry
Recommendations
This report results from a study of eight international projects that
have uncovered previously unimagined correlations between social
and historical phenomena through computational analysis of large,
complex data sets. The following recommendations are based on
this study; they are urgent, pointed, and even disruptive. To address
them, we must recognize the impediments of tradition that hinder
the contemporary university’s ability to adapt to, support, or sus-
tain this emerging research over time. Traditional organizations and
funding patterns reflect a much more strictly delineated intellectual
landscape. It is time to question which among these boundaries
remain useful, which should be more porous, and which no longer
serve a useful purpose.
3. Embrace interdisciplinarity
The scholars participating in the first eight Digging into Data projects
are active members of multiple academic communities that cross tradi-
tionally bounded fields. Their need to work across disciplines mirrors
a larger need for organizational flexibility and possible restructuring
of institutions of higher learning to promote successful working part-
nerships between differently trained scholars and academic profes-
sionals. Interdisciplinary collaboration benefits not only researchers
but also students. Today’s colleges and universities must equip stu-
dents with skills appropriate for a rapidly changing and diverse work-
force: the intellectual flexibility that an interdisciplinary perspective
cultivates is an excellent foundation for developing these skills.
1
John Coleman, Mark Liberman, Greg Kochanski, and colleagues make compelling
comparisons between the sizes of major data corpora in the sciences and humanities
on page 3 of their white paper Mining Years and Years of Speech. See http://www.phon.
ox.ac.uk/files/pdfs/MiningaYearofSpeechWhitePaper.pdf.
4 Christa Williford and Charles Henry
For administrators:
• Commit to investing in the long-term management and preserva-
tion of data.
• Create opportunities for humanities and social science faculty, ad-
junct faculty, staff, and students to develop skills in the manage-
ment, analysis, and interpretation of these data.
• Offer incentives for engagement in collaborative research
initiatives.
• Develop models for the assessment of collaborative work.
• Develop partnerships with institutions with complementary
strengths.
• Adopt clear policies for sharing hardware, software, and data
resources among on- and off-campus researchers that maximize
openness yet protect privacy and intellectual property.
2
Snow, C. P. The Two Cultures (Cambridge: Cambridge University Press, 1998), p.
2. Based upon a talk given by Snow at Cambridge University on May 7, 1959, first
published in the same year by Cambridge University Press. Cited in Patricia Waugh,
Review of The Two Cultures Controversy: Science, Literature and Cultural Politics in
Postwar Britain, by Guy Ortolano (Cambridge: Cambridge University Press, 2009).
8 Christa Williford and Charles Henry
3
A 2009 CLIR report titled Working Together or Apart: Promoting the Next Generation
of Digital Scholarship, which was the outcome of a symposium planned by former
CLIR Director of Programs Amy Friedlander, provided the foundation for the study
upon which this report is based. The insights of its authors, who write from specific
disciplinary perspectives, resonate well with the findings here.
4
Crane, Gregory. “What Do You Do With a Million Books?” D-Lib Magazine 12.3
(March 2006). Available at http://www.dlib.org/dlib/march06/crane/03crane.html.
5
This speculation was in addition to the class action lawsuit filed by the Author’s
Guild and the Association of American Publishers, still in litigation after a proposed
settlement agreement was rejected by the New York Southern District Court in March
2011.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 9
and social scientists refine their existing questions and articulate new
ones. Furthermore, many of these projects show collaborators making
significant advances in the field of computer science as well as within
the relevant subject domain. Conducting research “at scale,” espe-
cially across the unstructured and heterogeneous data upon which
humanists depend, can inspire new and more nuanced applications
of computer tools, which can in turn lead to new questions.
The web-based version of this report is The web-based version of this report includes individual case stud-
available at http://www.clir.org/pubs/ ies that describe key findings as well as some of the challenges each
reports/pub151. project team encountered. This printed report describes the cases in
aggregate, extrapolating the commonly shared, characteristics. Table
2 notes the represented disciplines, numbers of researchers, data
Table 2, continued types, methodologies, and tools used for each project.
24 25 26 27 28 29 30
It became clear in our work that humanists, who are often exceptional
experts in their fields, often have a difficult time describing how they
go about their work and analyses. Having humanists work in teams
and with computer scientists required them to explain and detail
their processes. The work we have done has made us more sensitive
to this issue and opens up many new areas of research—How can we
develop better collaborative models that help humanists explicate their
processes? Can we build tools to help capture the way humanists work?
How can we enhance digital archives to facilitate the ways humanists
work with objects?
2.2 Interdisciplinarity
Crossing disciplinary boundaries often increases the impact of com-
putationally intensive scholarship by exposing it to greater num-
bers of researchers, students, and the public. At the same time, it
complicates project management: traditions, concepts, and research
vocabularies must be adapted to accommodate other points of view.
When the common ground for a collaboration is methodological (the
“how”) rather than driven by a shared desire for a particular dis-
covery or outcome (the “why”), partners must be prepared to work
in ways that do not neatly fit the models they have been trained to
emulate. This results in products for which partners cannot take sole
credit, some of which defy traditional kinds of peer review. The level
of stress this transformation may create for the researcher varies by
discipline, by institution, and by individual, but acceptance of this
change is obligatory.
These projects point to new avenues for investigation more of-
ten than they provide conclusive answers to their original framing
questions. This is not surprising, given that topics as complex as pat-
terns of human creativity, authorship, and the continuity of culture
over time often elude conclusive explanation. But many practitioners
of computer-assisted investigation contend that in time, with enough
attention to the curation of valid data, the formation of suitably com-
plex and replicable methods of analysis, and the framing of increas-
ingly precise questions, it may be possible to combine computer-
based analysis of large data corpora with the creativity and critical
power of the human researcher to promote a greater understanding
of our society and culture than has ever been possible. The prospects
of new discovery at such a scale seem achievable only through con-
tinued collaboration across disciplines.
31
http://criminalintent.org/.
14 Christa Williford and Charles Henry
There are important questions also about how such resources ac-
quire authority status (e.g., through quality of referencing back to
original sources, through collaborative work by leading research
groups in the field, by peer review, by crowd sourcing from citi-
zen scholars).32
2.4 Expertise
Investigators stressed frequently that the research they pursued
would not be possible without extensive collaboration with partners
who contributed many kinds of expertise working in what Peter
Ainsworth (Digging into Image Data) called a “transformative, symbi-
otic partnership.” Collaborators’ expertise and training overlapped
more in some cases (such as Mining a Year of Speech, Data Mining with
Criminal Intent) than in others (such as Digging into the Enlightenment,
Digging into Image Data). When the teams included experts with
complementary, rather than overlapping, strengths, the coordination
and management of the project, including communications among
the partners and dividing responsibility for shared resources, was
especially vital, as were significant investments of time in planning
for and framing the project.
Four generic kinds of expertise were represented among part-
ners in each project: domain expertise, data management expertise,
analytical expertise, and project management expertise. Participants
in all the projects shared an appreciation for each of these kinds of
skills. While not always represented in the same proportions, each
of these areas was represented in the eight projects by one or more
individuals. These categories of expertise seemed important counter-
balances to one another, as if they were four supporting legs of a
table (Figure 1).
Although the investigators agreed that the four categories were
equally important, some observed that the contributions of research-
ers with more than one of these kinds of expertise were most critical
to project success. Dan Edelstein, who worked on Digging into the
Enlightenment, put it this way: “What made our project possible was
that we had these hybrid people with more than one leg of the ‘ta-
ble’. Those people are very hard to find. They don’t do well naturally
in a university setting.” Students, short-term project staff, and junior
faculty all played crucial roles, often in a “hybrid” capacity.
32
E-mail from Richard Healey to Christa Williford, June 11, 2011.
16 Christa Williford and Charles Henry
33
Simeone, Michael, Jennifer Guiliano, Rob Kooper, and Peter Bajcsy. “Digging
into Data Using New Collaborative Infrastructures Supporting Humanities-Based
Computer Science Research.” First Monday 16.5 (May 2, 2011). Available at http://
firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3372/2950. Section
3.2 of this article, titled “Legal and Ethical Aspects of Scholarly Collaborations,” is
especially salient here.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 21
34
At time of writing, some of the Digging into Data projects funded in 2009 are still
under way and have yet to report final results.
35
Tim Hitchcock, Robert Shoemaker, Clive Emsley, Sharon Howard, Jamie
McLaughlin, et al., The Old Bailey Proceedings Online, 1674-1913 version 7.0, 24 March
2012. Available at http://www.oldbaileyonline.org.
22 Christa Williford and Charles Henry
104
Number of Words
1000
100
10
1860 1862 1864 1866 1868 1870
Year
Fig. 2: Scatter plot created in Mathematica showing distribution of Old Bailey trial lengths in the 1860s,
by Tim Hitchcock and William Turkel.
investigating the reason(s) why this may have been the case.”36
Project leaders indicated that perhaps the chief motivation of us-
ing these tools is that they allow scholars to ask questions that would
not have occurred to them otherwise: this is the power of unexpected
discovery that opens paths to new thinking and further questioning.
Dan Edelstein and Paula Findlen emphasize this point in their white
paper for Digging into the Enlightenment: their geographic visualiza-
tions of historic letters “can serve a heuristic purpose, leading the
user toward less known corners of the dataset.”37 “We’re discovering
research questions that we didn’t have when we started off,” echoed
Peter Ainsworth, who led one of the teams on the Digging into Im-
age Data project. Using an image segmentation algorithm developed
during this project, Robert Markley and Michael Simeone were able
to analyze digital images of 40 British and French historic maps of
the Great Lakes dating from 1650 to 1800. Results showed marked
inconsistencies between the depictions of some of the lakes’ borders
over this period; they also showed that the mapmakers’ work did not
become more “accurate” over time. Examining these inconsistencies,
Simeone and Markley hypothesized that some “inaccuracies” reflect-
ed in the maps may actually correspond with water-level fluctuations
and periods of prolonged ice cover. If they are able to collect more
evidence to support this theory in their future research, Simeone and
Markley “can begin to analyze maps prior to 1800 in order to provide
36
Cohen, D., Hitchcock, T., Rockwell, G., et al. Data Mining with Criminal Intent. Final
White Paper. August 31, 2011. Available at http://criminalintent.org/wp-content/
uploads/2011/09/Data-Mining-with-Criminal-Intent-Final1.pdf.
37
See page 7 of Edelstein’s and Findlen’s white paper for the NEH-funded portion of
this project.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 23
Fig.3: Integrated census (“Railroad Men”) and railroad company (“Shop Index”) data show the extent
of railroad employment in the U.S. in 1880. Revealed are 38 “highly concentrated railroad centers.” 39
Thomas, William G., and Richard Healey. “Railroad Workers and Worker Mobility
39
on the Great Plains,” Western History Association, Lake Tahoe, Nevada, October 2010.
24 Christa Williford and Charles Henry
Funding
• Funding is not always available in the amounts or for the re-
sources most needed by investigators.
• Investigators need to continually seek external funding to sus-
tain ongoing work.
• Many institutions lack long-term support for valuable project
staff.
• Young scholars have difficulty getting travel support for meet-
ings with collaborators.
• Computer storage infrastructure and processing cycles can
be prohibitively expensive for humanists and social scientists
working with large data sets.
Time
• Planning for and managing complex international, multidisci-
plinary collaborations takes extensive time.
• Data correction and tool development are time-consuming.
• Deep collaboration requires frequent synchronous communi-
cation, which is a major time commitment.
• Partners often have conflicting academic calendars and work
schedules.
Communication
• Partners need patience and understanding to grasp perspec-
tives of others from different backgrounds.
• Convincing technologists or computer scientists of the value
of investing in humanities and social science work can be
challenging.
• Managing expectations among partners with responsibilities
for multiple projects can be tricky.
Data
• Data sharing requires shared tools and storage, and demands
that partners trust one another.
• Making data “diggable” can be extremely labor-intensive.
Error rates in data can be difficult to predict when planning a
project and hard to account for in an analysis.
• Data management and analysis are iterative and cyclical, rath-
er than sequential, activities.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 25
The awards were limited to 100,000 US dollars (NEH, NSF), 100,000 Canadian
40
41
Formal and informal education and training opportunities related to data-intensive
research methodologies are growing more common worldwide, although formal
training is still more common in the sciences (such as bioinformatics) than in the social
sciences or humanities.
42
See, for example, the MediaCommons project (http://mediacommons.
futureofthebook.org/about-mediacommons); Ball, A., and M. Duke, “Data Citation
and Linking,” DCC Briefing Papers (Edinburgh: Digital Curation Centre, 2011),
available at http://www.dcc.ac.uk/resources/briefing-papers/; Fitzpatrick, K. Planned
Obsolescence: Publishing, Technology, and the Future of the Academy (New York: NYU
Press, 2011); Withey, L., et al., “Sustaining Scholarly Publishing: New Business Models
for University Presses” (2011); available at http://mediacommons.futureofthebook.org/
mcpress/sustaining/.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 29
43
The academic library community has already taken up the challenge of supporting
data-intensive research, but much work remains to be done to establish sustainable
best practices for the management of research data that will work globally and
across disciplines, at institutions both large and small. See Marcum, Deanna, and
Gerald George, eds. The Data Deluge: Can Libraries Cope with E-Science? (Westport, CT:
Greenwood Press, 2010).
30 Christa Williford and Charles Henry
44
See links collected at “Evaluating Digital Work for Tenure and Promotion: A
Workshop for Evaluators and Candidates.” Modern Languages Association. Available
at http://www.mla.org/resources/documents/rep_it/dig_eval.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 31
45
In the second Digging into Data Challenge, the eight participating agencies have
been advocating increased transparency in handling intellectual property–related
issues. For example, JISC Director of the Strategic Content Alliance Stuart Dempster
reports “paying site visits, forwarding exemplars of good practice and giving grantees
a strong steer on licensing of project outputs” (e-mail to Brett Bobley, March 23,
2012). The JISC website has helpful information and tools related to intellectual
property rights and licensing. See, for example, Naomi Korn, “Embedding Creative
Commons Licenses into Digital Resources.” JISC Strategic Content Alliance Briefing
Paper, 2011. Available at http://www.jisc.ac.uk/publications/programmerelated/2011/
scaembeddingcclicencesbp.aspx.
46
See page 18 of the project white paper. See also Crane, G. “Give us Editors! Re-
inventing the Edition and Re-thinking the Humanities.” In Online Humanities
Scholarship: The Shape of Things to Come. (University of Virginia: Mellon Foundation,
2010). Available at http://cnx.org/content/m34316/latest/.
32 Christa Williford and Charles Henry
47
http://www.diggingintodata.org/.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 33
With ten billion digital objects being created every day, for those who
work in the humanities, for those who deal with the human record, it
cannot be a question of whether computer tools will be an important
part of the humanistic disciplines—they will need to be. This will
require re-imagining the humanities, rethinking and re-envisioning the
way humanists go about their work.
For case studies describing each of the eight 2009 Digging into Data proj-
ects, visit the web-based version of this report at http://www.clir.org/
pubs/reports/pub151.
Arms, William, and Ronald Larsen, eds. 2007. The Future of Scholarly
Communication: Building the Infrastructure for Cyberscholarship. NSF/
JISC Workshop, Phoenix, Arizona, April 17–19, 2007. Available at:
http://www.sis.pitt.edu/~repwkshop/SIS-NSFReport2.pdf.
Gold, Matthew K., ed. 2012. Debates in the Digital Humanities. Minne-
apolis: University of Minnesota Press.
High Level Expert Group on Scientific Data. 2010. Riding the Wave:
How Europe Can Gain from the Rising Tide of Scientific Data. Report to
the European Commission. Available at: http://cordis.europa.eu/fp7/
ict/e-infrastructure/high-level-group_en.html.
Kroll, Susan, and Rick Forsman. 2010. A Slice of Research Life: Informa-
tion Support for Research in the United States. Report commissioned
by OCLC Research in support of the RLG Partnership. Available at:
http://www.oclc.org/research/publications/library/2010/2010-15.pdf.
Lynch, Clifford A. (2008). “Big Data: How do Your Data Grow?” Na-
ture, 455.7209. Abstract available at: http://www.nature.com/nature/
journal/v455/n7209/full/455028a.html.
Lyon, Liz. 2009. Open Science at Web Scale: Optimising Participation and
Predictive Potential. London: Joint Information Systems Committee.
Available at: http://www.jisc.ac.uk/publications/reports/2009/open-
sciencerpt.aspx.
Maron, Nancy L., and K. Kirby Smith. 2008. Current Models of Digital
Scholarly Communication. Results of an Investigation Conducted by Ithaka
for the Association of Research Libraries. Washington, DC: Association
of Research Libraries. Available at: http://www.arl.org/bm~doc/
current-models-report.pdf.
Sehat, Connie Moon, and Erika Farr. 2009. “The Future of Digital
Scholarship: Preparation, Training, Curricula.” Washington, DC:
Council on Library and Information Resources. Available at: http://
www.clir.org/pubs/resources/archives/SehatFarr2009.pdf.
Van den Eynden, Veerle, Libby Bishop, Laurence Horton, and Louise
Corti. 2010. “Data Management Practices in the Social Sciences.” Es-
sex: UK Data Archive. Available at: http://www.data-archive.ac.uk/
about/publications.
van der Graaf, Maurits, and Leo Waaijers. 2011. A Surfboard for Riding
the Wave: Towards a Four Country Action Programme for Research Data.
Copenhagen: The Knowledge Exchange. Available at: http://www.
knowledge-exchange.info/Default.aspx?ID=469.
Whyte, A. and J. Tedds. 2011. Making the Case for Research Data Man-
agement. DCC Briefing Papers. Edinburgh: Digital Curation Centre.
Available at: http://www.dcc.ac.uk/resources/briefing-papers.
One Culture. Computationally Intensive Research in the Humanities and Social Sciences 37
Alison Babeu (Tufts University, US) is the digital librarian for the
Perseus Digital Library and contributed both data and subject exper-
tise to the project.
David Bamman (Tufts University, US) is a computational linguist
who contributed both technical and subject expertise to the project.
Federico Boschetti (Institute of Computational Linguistics of the
National Research Council, Italy) worked with Robertson on custom-
izing optical character recognition engines for ancient Greek source
texts.
Lisa Cerrato (Tufts University, US) is managing editor of the Perseus
Project and contributed both data and subject expertise.
Gregory Crane (Tufts University, US) served as principal investiga-
tor for the NEH-funded portion of the project.
John Darlington (Imperial College London, UK) served as principal
investigator for the JISC-funded portion of the project.
Brian Fuchs (Imperial College London, UK) designed and imple-
mented a scalable computer infrastructure for processing large data
sets of page images from books.
David Mimno (University of Massachusetts Amherst, US) is a com-
puter scientist who contributed both technical and analytical exper-
tise to the project.
Bruce Robertson (Mount Allison University, Canada) served as
principal investigator for the SSHRC-funded portion of the project
40 Christa Williford and Charles Henry
Music annotators:
Christa Emerson, David Adamcyk, Elizabeth Llewellyn, Meghan
Goodchild, Michel Vallières, Mikaela Miller, Parker Bert, Rona
Nadler, and Rémy Bélanger de Beauport