Quantitative Methods I: Reproducible Research and Quantitative Geography
Quantitative Methods I: Reproducible Research and Quantitative Geography
Progress report
Progress in Human Geography
2016, Vol. 40(5) 687–696
Quantitative methods I: ª The Author(s) 2015
Reprints and permission:
Chris Brunsdon
National Centre for Geocomputation, Maynooth University, Ireland
Abstract
Reproducible quantitative research is research that has been documented sufficiently rigorously that a third
party can replicate any quantitative results that arise. It is argued here that such a goal is desirable for
quantitative human geography, particularly as trends in this area suggest a turn towards the creation of
algorithms and codes for simulation and the analysis of Big Data. A number of examples of good practice in
this area are considered, spanning a time period from the late 1970s to the present day. Following this,
practical aspects such as tools that enable research to be made reproducible are discussed, and some
beneficial side effects of adopting the practice are identified. The paper concludes by considering some of the
challenges faced by quantitative geographers aspiring to publish reproducible research.
Keywords
Big Data, computational paradigm, geocomputation, programming, reproducibility
complete details of any reported results and the by the researcher or a third party) as an enabling
computation used to obtain them results should technology. In both of the newer paradigms,
be available, so that others following the same although important ideas may be articulated in
procedures and using the same data can obtain published texts, distinct intellectual contribu-
identical results. This article considers the rele- tions are embedded in software code where the
vance and implications of this for geographical ideas are represented in their most detailed
data analysis and GIS. Although the idea was put form. Given this, a full critical engagement with
forward over two decades ago, the need to adopt researchers working within these paradigms is
reproducible practices is more relevant than inhibited if code is not available openly. This
ever. It has been argued that in addition to the is generally the case for quantitative science and
two ‘classical’ paradigms of science that were social science, and for digital humanities. Here
commonly acknowledged at the time of the attention will be focused on the implications
Claerbout (1992) paper (Hey et al., 2009; for quantitative geography, geocomputation and
Kitchin, 2014b), two further paradigms are geographical information science.
emerging:
1. You have a data set that you would like article by Claerbout (1992) and those following,
to analyse using the same technique as providing such detail was not considered stan-
described in a paper recently published dard practice in many disciplines. Indeed, some
by another researcher in your area. In time later, few journals (none in geography,
that paper the technique is outlined in although this could be changing soon) insist that
prose form, but no explicit algorithm is such precise details are provided, and it could
given. Although you have access to the perhaps be argued that there is some contribu-
data used in the paper, and have attempted tory negligence on their part.
to recreate the technique, you are unable Similarly, although it is usually required
to reproduce the results reported there. that researchers must cite the sources of sec-
2. You published a paper five years ago in ondary data, such citations often consist of
which an analytical technique was applied acknowledgement of the agency that sup-
to a data set. You now discover an alterna- plied this data, possibly with a link to a gen-
tive method of analysis, and wish to com- eral website, rather than an explicit link (or
pare the results. links) to a file (or files) that contained the
3. A particular form of analysis was reported actual data used in the research, or details
in a paper; subsequently it was discovered of any re-formatting of the data (including
that one software package offered an code) prior to analysis. However, both pieces
implementation of this method that con- of information allow published results to be
tained errors. You wish to check whether critically assessed and scrutinized – ulti-
this affects the findings in the paper. mately leading to more trustworthy research
4. A data set used in a reported analysis was conclusions.
subsequently found to contain rogue
data, and has now been corrected. You
wish to update the analysis with the
IV The case for reproducible
newer version of the data. quantitative geography
The above is a general argument for reproduci-
Articles providing precise verbal description bility. However, one could ask whether this is
of algorithms are useful in these scenarios – as relevant or practical for applications in quantita-
exemplified in the earlier examples – and it is tive human geography. In terms of relevance, it
certainly the case that this is a great improve- is worth noting that a great deal of analysis of
ment on vaguer descriptions that provide insuf- social and economic data is inherently spatial –
ficient information to reproduce initial analyses. whether focusing on regional, local or street
However, one could argue that the code itself is level – and that the results of such analyses are
a much stronger aid to reproduction – a verbal often used to inform policy-makers, and are
description being prone both to incorrect inter- used in decision-making processes. In many
pretation and omission of necessary detail. In cases, the data being analysed is publicly avail-
addition, there is the possibility that the code able – for example, the US Census Bureau
used in an article may contain an error, so that provide a number of APIs to access official sta-
the precise description is in fact precise only tistics such as economic time series indicators
in outlining what the author thinks it does – only and the decennial census for 1990, 2000 and
the code itself will yield what it actually does. In 2010, the UK provides public access to census
most cases, the omission of such information is and reported crime data crime data, and Ireland
not done with malice aforethought on the part of provides access to Irish census data. However,
researchers. Until the issue was raised in the not all reports or articles analysing this and
Brunsdon 691
other publicly available data provide precise weighting of summary statistics lead to serious
details of the analysis. miscalculations that inaccurately represent the
There are a number of arguments as to why relationship between public debt and GDP growth
such information should be provided. The first among 20 advanced economies . . . Our overall
is a purely academic one – a useful and informed evidence refutes RR’s claim that public debt/GDP
ratios above 90% consistently reduce a country’s
critical discourse of any analytical work can only
GDP growth. (2013: 1)
take place when full details are provided. When
the data analysis is a black box, it is difficult to This arose after a student, Thomas Herndon,
either uphold or argue against any conclusions unsuccessfully attempted to reproduce the anal-
reached. One cannot tell whether the underlying ysis in Reinhart and Rogoff’s paper as a course-
models or techniques are appropriate or, even if work exercise. Investigations unearthed that the
they are, whether the underlying code or other analysis was flawed – in part due to an error
computational approach faithfully reflects them. with an Excel spreadsheet. In this case measures
A second argument is one of accountability. were not taken to ensure reproducibility in the
Many quantitative studies inform policy deci- original paper – it took an amount of forensic
sions by governments and other institutions – computing to discover the problem. Following
different quantitative analyses with different this, an errata was published (Reinhart and Rog-
outcomes could well lead to different policy off, 2013), although Rogoff and Reinhart have
decisions. Providing information not only about defended their conclusions – if not their original
the sources of data used but also about the meth- analysis. However, the debate continues as
ods used to analyse the data is a key strategy of authors of the critique continue to challenge a
open government and democratic decision- number of assumptions in the corrected analysis.
making. As suggested earlier, this in turn leads Putting aside any criticisms I may have of the
to a more trustworthy approach – although this original paper, the outcome here is perhaps one
does not guarantee that an analysis is without of cautious optimism in that an open debate about
error, it provides a mechanism where it is open the underlying analysis is now taking place –
to public scrutiny, so that the probability that any albeit after a great deal of public controversy.
error is identified and corrected is notably Again quoting from Herndon, Ash and Pollin:
increased. Also, relating to the earlier point, it
implies that any assumptions made in the analy- Beyond these strictly analytical considerations,
sis are open to scrutiny, so that public discussion we also believe that the debate generated by our
critique of RR has produced some forward prog-
and debate regarding the basis of policy deci-
ress in the sphere of economic policy making.
sions is made possible. (2013: 279)
A reminder of the relevance of this is pro-
vided through the recent controversy surround- However, a reproducible approach here could
ing a paper by Reinhart and Rogoff (2010), have resulted in a smoother path to the final sit-
whose published findings have been widely uation of public debate and a resolution of the
cited as an argument for fiscal austerity. How- erroneous analysis. Indeed, the spirit of the
ever, in an article by Herndon, Ash and Pollin exercise set to the student was that of reprodu-
(2013), flaws were identified in the data analy- cing the published analysis.
sis carried out in the paper. Quoting from the
abstract of the latter article:
V Achieving reproducibility
We replicate . . . and find that selective exclusion To address these problems, one approach pro-
of available data, coding errors and inappropriate posed is that of literate programming (Knuth,
692 Progress in Human Geography 40(5)
1984). This was initially proposed as a tool for publicly available. Thus, not only is it possible
documenting code, where a single file contained to share high level data analysis operations, but
both the code documentation and the code itself. also the code used to build the tools at the higher
This was used to generate both a human read- level.
able document and computer readable content Another possibility here is an approach using
to generate software. The purpose of this was Pweave (Pastell, 2014) – a similar extension of
that the human readable output provided an NOWEB to embed Python code rather than R.
explanation of the working of the program (and Again, Python offers many tools for geographi-
also neatly printed listings of the code), offering cal data analysis, such as the PySAL package
an accessible overview explanation of the pro- (Rey, 2015).
gram’s function. However, such compendium
files can also be used in a slightly different way,
where rather than describing the code, the VI Beneficial side effects
human readable output is an article containing Although much of the justification of a reprodu-
some data analysis performed by the incorpo- cible approach has been defensive, there are a
rated code. Tabulated results, graphs and maps number of benefits provided. Many of these
are created by the embedded code. As before, occur as side effects when using the kinds of
two operations can be applied to the files – approach outlined above. In particular:
document creation, and code extraction. The
embedded code is also visible in the original Reproducible analyses can be compared:
file. Thus information about both the reporting Different analytical approaches attempt-
and the processing can be contained in a single ing to address the same hypothesis can
document – and if this document is shared then a be compared on the same data set, to
reproducible analysis (together with associated assess the robustness of any conclusions
discussion) is achieved. drawn. In particular, a third party can take
Examples of this approach are the NOWEB an existing reproducible document and
system (Ramsey, 1994), and the Sweave and add an alternative analysis to it.
Knitr packages (Leisch, 2002; Xie, 2013). The Methods are documented: One option
first of these incorporates code into LaTeX doc- with many reproducibility tools is to
uments using two very simple extensions to the incorporate the code itself – as well as its
markup language. The latter two are extended outputs – in the documents produced.
implementations of this system using R as the This allows for transparency in the way
language for the embedded code. Knitr also that results are obtained.
offers the possibility of embedding code into Methods are portable: Since the code
markdown – a simpler markup language than may be extracted from the documents,
LaTeX – which facilitates very quick produc- others may use it and apply it to other data
tion of reproducible documents. The fact that sets, or modify it and combine it with
R is used in the latter two approaches is other methods. This allows approaches
encouraging for geographers, since R offers a to be assessed in terms of their generality,
number of packages for spatial analysis, geogra- and encourages further dialog in terms of
phical data manipulation of the kind provided interpretation of existing data.
by geographical information systems, and spa- Results may be updated: If updated ver-
tial statistics (Brunsdon and Comber, 2015). sions of data used in an analysis are pub-
Furthermore, as R is open source software, the lished (for example new census data),
code used in any of these packages is also methods applied to the old data may be
Brunsdon 693
re-applied and updated results compared wave of practitioners for whom the adoption
to the original ones. Also, if the original of coding as a tool for data analysis does not
data required amendment, an updated imply a change of culture. Recent attendance
analysis could easily be carried out. at GIS conferences by the author would suggest,
Reports may have greater impact: Recent at least anecdotally, that these trends are
work has shown that papers in a number reflected in geocomputation and geographical
of fields, including reproducible analy- information science.
ses, have higher impact and visibility. This Other minor practical challenges also exist –
is discussed in Vandewalle, Kovačević for example, how can a sequence of random
and Vetterli (2009). numbers in simulations be reproduced? How-
ever, many of these can be resolved by exam-
ples of ‘best practice’. In the given example,
VII Challenges random sequences may be made reproducible
The above sections argue that reproducible by noting that they are actually pseudo-random
approaches offer a number of benefits. However, and specifying the code used to produce them,
their adoption requires challenging changes in and the seed value(s).
current practice. Perhaps one of the most nota- However, a more significant challenge is cre-
ble is that the knitr, Sweave and Pweave ated by the so-called ‘Data Revolution’ (Kitchin,
approaches all require the use of code to carry 2014b) and the idea of Big Data – relating to the
out statistical analysis, visualization and data new paradigm of exploration and the search for
manipulation, rather than commonly adopted empirical pattern, with implications of data min-
GUI-based tools, such as Excel. Unfortunately ing and the search for patterns. Not only referring
this is an inherent characteristic of reproducibil- to the size of data sets, the term Big Data also
ity. After a series of point-and-click operations, refers to the diversity of applications, complexity
results are cut and pasted into a Word document of data and the fact that data is produced in a real-
(or similar) and the link between the reported time ‘firehose’ environment where sensors and
result and the analytical procedure is lost. It is other data-gathering devices are streaming vast
perhaps no surprise that the Reinhart and Rogoff quantities of data every second. This is of
affair was seeded by an error in Excel. importance to geographers applying quantitative
Despite this, perhaps it is more realistic to techniques, since much of this data has a geogra-
consider ways in which the divide between phical component. The exploratory paradigm is
GUI-based tools and reproducibility could be not without controversy – while the computa-
bridged than to propose such tools be aban- tional paradigm could be viewed as working
doned. One possibility might be to provide in co-operation with deductive and empirical
GUI-based software in which every interactive approaches, some propose the exploration of Big
event is echoed by a code equivalent, which is Data as a superior competitor to theory-led
recorded. The recorded code could then be approaches (see Mayer-Schonberger and Cukier,
embedded in a document. One such tool that 2013, or Anderson, 2008), suggesting that work-
does this on a web-based interface is Radiant ing with near-universal data sets and identifying
(Radiant News). However, it is perhaps also pattern supplants the need for theory and experi-
worth noting a general turn towards coding and ment. The title of the Anderson piece leaves little
away from GUI solutions in data analysis as doubt as to the magnitude of the claim being
indicated by the popularity of a number of books made!
such as O’Neill and Schutt (2013) and McKin- However, such boosterish claims have not
ney (2012) – suggesting that there is a current gone unchallenged – notably, in the discipline
694 Progress in Human Geography 40(5)
of geography, by Miller and Goodchild (2014), scrutiny of the representativeness of data – one
who argue, among other things, that there is still contextual factor that may enable more mean-
a need to understand the nature of the data being ingful analysis of Big Data.
used and to discriminate between spurious and
meaningful patterns. Kitchin (2014) warns of VIII Conclusion
the risks of ignoring contextual knowledge in
There are strong arguments for reproducibility
the analysis of Big Data. Although reproducibil-
in quantitative analysis of human geography
ity in research involving Big Data analysis
data – not just for academics, but also for public
would not fully address any of these issues, it
agencies and private consultancies charged with
may be argued that it can provide a foothold.
analysing data that may influence policy. Achiev-
Giving precise details of assumptions in coding
ing this in some situations is clearly within reach,
(for example, what kinds of patterns are being
although there are also some challenges ahead, as
sought out by a particular data mining algo-
the diversity and volume of geographically refer-
rithm?) will certainly provide an entry point into
enced information increases. Arguably there is
dialogues addressing the issues raised above.
also a role for such methods in addressing the Big
Despite this, currently many examples of
Data Revolution. However, the adoption of repro-
reproducible research have used fairly ‘tradi-
ducible approaches does call for some changes in
tional’ approaches to data analysis, where a data
the practice of both researchers – in adopting
set consists of a static file containing a rectangu-
reproducible research practices – and publishers –
lar table of cases by variables. More complex
in providing a medium where reproducible
data poses less of a conceptual problem per se
documents may be easily submitted, handled
in terms of reproducibility – the challenge here
and distributed.
is to devise appropriate analytical methods, but
if that can be achieved then code can be created Declaration of Conflicting Interests
and reproducible research can be carried out in
The author(s) declared no potential conflicts of inter-
the ways outlined above. Similarly, diversity est with respect to the research, authorship, and/or
of applications presents no further conceptual publication of this article.
difficulties for reproducibility. However, the
real-time aspect does provide some challenges – Funding
clearly, even with the same code, two people The author(s) received no financial support for the
accessing the same data stream at different research, authorship, and/or publication of this
points in time will not obtain identical results. article.
One possibility might be to acknowledge that
data used in a given publication is a static entity References
consisting of data obtained from a stream at a Anderson C (2008) The end of theory: The data deluge
given point in time – and to time stamp and makes the scientific method obsolete. Wired. Available
archive the data obtained and used in analysis at: http://www.wired.com/science/discoveries/maga-
zine/16-07/pb_theory (accessed 22 July 2015).
at the moment it was carried out. Although it
Ballas D, Clarke G, Dorling D, Eyre H, Thomas B and
would be impossible for a third party to obtain
Rossiter D (2005) SimBritain: A spatial microsimula-
identical data from the stream, and consequently tion approach to population dynamics. Population,
impossible to obtain identical analytical results, Space and Place 11(1): 13–34.
it would at least be possible to see the code used Barni M, Perez-Gonzalez F, Comesaña P and Bartoli G
to access the stream, note the time the stream (2007) Putting reproducible signal processing into
was accessed, and access a copy of the data practice: A case study in watermarking. Proc. IEEE
obtained at that time. This would also enable International Conference on Acoustics, Speech and
Brunsdon 695
Signal Processing. Available at: http://gpsc.uvigo.es/ Kitchin R (2014a) Big Data, new epistemologies and para-
sites/default/files/publications/icassp07reproducible. digm shifts. Big Data & Society 1(1). DOI: 10.1177/
pdf (accessed 22 July 2015). 2053951714528481.
Bergmann L (2013) Bound by chains of carbon: Ecological- Kitchin R (2014b) The Data Revolution: Big Data, Open
economic geographies of globalization. Annals of Data, Data Infrastructures and Their Consequences.
the Association of American Geographers 103(6): London: SAGE.
1348–70. DOI: 10.1080/00045608.2013.779547. Knuth D (1984) Literate programming. Computer Journal
Brunsdon C and Comber A (2015) An Introduction to R for 27(2): 97–111.
Spatial Analysis and Mapping. London: SAGE. Koenker R (1996) Reproducible Econometric Research.
Brunsdon C and Singleton A (2015) Reproducible Department of Econometrics, University of Illinois.
research: Concepts, techniques and issues. In: Bruns- Leisch F (2002) Dynamic generation of statistical reports
don C and Singleton A (eds) Geocomputation: A Prac- using literate data analysis. In: Härdle W and Rönz B
tical Primer. London: SAGE, 254–64. (eds) Compstat 2002: Proceedings in Computational
Buckheit JB and Donoho DL (1995) WaveLab and Repro- Statistics. Heidelberg: Physika Verlag, 575–580.
ducible Research. Tech. Rep. 474, Dept. of Statistics, Lovelace R and Ballas D (2013) ‘Truncate, replicate, sam-
Stanford University. ple’: A method for creating integer weights for spatial
Claerbout J (1992) Electronic documents give reprodu- microsimulation. Computers, Environment and Urban
cible research a mew meaning. In: Proc. 62nd Ann. Systems 41: 1–11.
Int. Meeting of the Soc. of Exploration Geophysics, Mayer-Schonberger V and Cukier K (2013) Big Data: A
601–604. Revolution That Will Change How We Live, Work and
Clarke M and Holm E (1987) Microsimulation methods in Think. London: John Murray.
spatial analysis in planning. Geografiska Annaler McKinney W (2012) Python for Data Analysis: Data
Series B, Human Geography 69(2): 145–164. Wrangling with Pandas, NumPy, and IPython. New
Gentleman R and Temple Lang D (2004) Statistical anal- York: O’Reilly.
yses and reproducible research. Bioconductor Project: Miller HJ and Goodchild M (2014) Data-driven geo-
Working Paper 2. graphy. GeoJournal. DOI: 10.10007/s10708-014-
Heppenstall A, Crooks A, See L and Batty M (2012) 9602-6.
Agent-Based Models of Geographical Systems. New O’Neill C and Schutt R (2013) Doing Data Science:
York: Springer. Straight Talk from the Frontline. New York:
Herndon T, Ash M and Pollin R (2013) Does high public O’Reilly.
debt consistently stifle economic growth? A critique Openshaw S and Taylor PJ (1979) A million or so cor-
of Reinhart and Rogoff. Cambridge Journal of relation coefficients: Three experiments on the
Economics 38: 257–279. modifiable areal unit problem. In: Statistical Appli-
Hey T, Tansley S and Tolle H (2009) Jim Gray on cations in the Spatial Sciences 21. London: Pion,
eScience: A transformed scientific method. In: Hey 127–144.
T, Tansley S and Tolle K (eds) The Fourth Paradigm: Parker J and Epstein J (2011) A distributed platform for
Data-Intensive Scientific Discovery. Redmond: Micro- global-scale agent-based models of disease transmis-
soft Research. Available at: http://research.microsoft. sion. ACM Transactions on Modeling and Computer
com/en-us/collaboration/fourthparadigm/4th_paradigm_ Simulation 22(1). DOI: 10.1145/2043635.2043637.
book_jim_gray_transcript.pdf (accessed 22 July 2015). Pastell M (2014) Pweave: Reports from data with Python.
Kelling S, Hochachka WH, Fink D, Riedewald M, Available at: http://mpastell.com/pweave/docs.html
Caruana R, Ballard G and Hooker G (2009) Data- (accessed 22 July 2015).
intensive science: A new paradigm for biodiversity R Core Team (2015) R: A Language and Environment for
studies. BioScience 59(7): 613–20. DOI: 10.1525/ Statistical Computing. Vienna: R Foundation for Statis-
bio.2009.59.7.12. tical Computing. Available at: http://www.R-project.
Kitchin R (2014) Big Data and human geography: Oppor- org/ (accessed 22 July 2015).
tunities, challenges and risks. Dialogues in Human Radiant News (2015) Introducing Radiant: A shiny inter-
Geography 3(3): 262–267. face for R. Available at: http://www.r-bloggers.com/
696 Progress in Human Geography 40(5)