Jupyter Notebooks—a Publishing Format for Reproducible Computational Workflows
Jupyter Notebooks—a Publishing Format for Reproducible Computational Workflows
1. Introduction
Researchers today across all academic disciplines often need to write computer code in
order to collect and process data, carry out statistical tests, run simulations or draw
figures. The widely applicable libraries and tools for this are often developed as open
source projects (such as NumPy, Julia, or FEniCS), but the specific code researchers
write for a particular piece of work is often left unpublished, hindering reproducibility.
Some authors may describe computational methods in prose, as part of a general
description of research methods. But human language lacks the precision of code, and
reproducing such methods is not as quick or as reliable as it should be. Others provide
code separately as supplementary material, but it may be difficult for readers to cross
reference between code and prose, and there is a risk that the two become inconsistent
as the author works on them.
Notebooks—documents integrating prose, code and results—offer a way to
publish a computational method which can be readily read and replicated.
1
Corresponding Author.
88 T. Kluyver et al. / Jupyter Notebooks – A Publishing Format
2. Notebooks
including a computational environment in which users can execute the code. Authors
can publish notebooks on GitHub along with an environment specification in one of a
few common formats. By pointing the Binder web service at the repository, a
temporary environment is automatically created with the notebooks and any libraries
and data required to run them. This allows authors to publish their code in an
interactive and immediately verifiable form.
Together, these tools allow the preservation and reuse of scientific code, the
computational environment to run that code, and data within the size constraints of a git
repository. Third party tools such as noWorkflow can integrate with this to track
provenance: how inputs, code and generated files relate to one another. noWorkflow
captures the execution of a marked notebook cell, or a script run through its command
line tool, as a ‘trial’, recording in a database the code that was used, the environment in
which it ran, the versions of modules that were used, and the files read and written.
Several papers have been published with supporting notebooks to reproduce the
analysis, or the creation of key plots. The detection of gravitational waves by the LIGO
experiment (LIGO Scientific Collaboration and Virgo Collaboration et al., 2016),
announced earlier this year, is one such: the researchers posted a notebook on their
website illustrating in detail how to filter and process the data to reveal the signature of
a distant black hole merger (LIGO collaboration). Others quickly made this available
through Binder, as described above (https://github.com/minrk/ligo-binder), allowing
anyone to replicate the analysis even without downloading or installing anything. Other
papers published in fields from geology to genetics to computer science have used
notebooks as supporting material (e.g. Sylvester et al., 2013; Olson & Roberts, 2015;
Brown et al., 2012).
Authors have also written books as a collection of IPython notebooks. Some of
these have been published in hard copy (e.g. Unpingco, 2014; Davidson-Pilon, 2015;
Rossant, 2014), but with the internet blurring traditional categorisations, similar
collections of notebooks are being published purely online. Of these, course materials
are a notable group, both to accompany teaching and for learners to work through
independently (e.g. Caporaso; Barba; Johansson).
It is not yet very practical to write academic papers themselves as notebooks, but
we are working towards this. One tricky point is inserting academic citations, which
require structured data about sources to be formatted in a very precise way which may
depend on the journal. One of us (TK) has an experimental plugin cite2c
(https://github.com/takluyver/cite2c), which allows the author to search their reference
library stored in the Zotero service, and insert citations into a Markdown cell. The
citations and bibliography are rendered by the citeproc-js package (Bennett), using the
common Citation Style Language format (http://citationstyles.org/).
Notebooks also fit well into novel publishing paradigms, such as post publication
review. Digital objects such as GitHub repositories, which may contain notebooks, and
blog posts, which may be made from notebooks, can now be archived and given
permanent DOI references (GitHub; Yarkoni, 2015), making it practical to cite them in
other publications. The Jupyter Project is part of the coalition around Hypothes.is, an
open source tool to annotate documents on the web (Perkel, 2015; Hypothes.is, 2015).
Finally, work is under way to support real-time collaboration in notebooks. This will
90 T. Kluyver et al. / Jupyter Notebooks – A Publishing Format
let multiple authors work on a notebook together, with the changes instantly visible to
all, reducing the chance of two people trying to change the same thing in different
ways.
References
Brown, C.T., Howe, A., Zhang, Q., Pyrkosz, A.B. & Brom, T.H. (2012) A Reference-Free Algorithm for
Computational Normalization of Shotgun Sequencing Data, arXiv:1203.4802 [q-bio] Available from:
<http://arxiv.org/abs/1203.4802> [Accessed: 4 March 2016]
Davidson-Pilon, C. (2015) Bayesian Methods for Hackers: Probabilistic Programming and Bayesian
Inference, New York: Addison-Wesley Professional GitHub Making Your Code Citable, Available
from: <https://guides.github.com/activities/citable-code/> [Accessed: 4 March 2016]
Iverson, K.E. (1962) A Programming Language, New York, NY, USA: John Wiley & Sons, Inc.
LIGO collaboration Signal Processing with GW150914 Open Data, Available from:
<https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.html> [Accessed: 4 March 2016]
LIGO Scientific Collaboration and Virgo Collaboration, Abbott, B.P., Abbott, R., Abbott, T.D., Abernathy,
M.R., Acernese, F., et al. (2016) Observation of Gravitational Waves from a Binary Black Hole Merger,
Physical Review Letters 116 (6): 061102
Olson, C.E. & Roberts, S.B. (2015) Indication of Family-Specific DNA Methylation Patterns in Developing
Oysters, bioRxiv: 012831
Pérez, F. & Granger, B.E. (2007) IPython: A System for Interactive Scientific Computing, Computing in
Science Engineering 9 (3): 21–29
Perkel, J.M. (2015) Annotating the Scholarly Web, Nature 528 (7580): 153
Rossant, C. (2014) IPython Interactive Computing and Visualization Cookbook, Packt Publishing
Spence, R. (1975) APL Demonstration, Imperial College London Available from:
<https://www.youtube.com/watch?v=_DTpQ4Kk2wA> [Accessed: 4 March 2016]
Sylvester, Z., Pirmez, C., Cantelli, A. & Jobe, Z.R. (2013) Global (latitudinal) Variation in Submarine
Channel Sinuosity: COMMENT, Geology 41 (5): e287–e287
Yarkoni, T. (2015) Now I Am Become DOI, Destroyer of Gatekeeping Worlds, The Winnower Available
from: <https://thewinnower.com/papers/282-now-i-am-become-doi-destroyer-of-gatekeeping-worlds>
[Accessed: 4 March 2016]