0% found this document useful (0 votes)
31 views24 pages

Yihui Xie Chapter

The document describes knitr, a tool for reproducible research in R. It has a parser that identifies code chunks, an evaluator that executes the code, and a renderer that generates final output. Key features include code decoration, graphics support, caching, and the ability to integrate other languages. The goal of knitr is to make reproducible research easier by combining code, output, and narrative in a single document.

Uploaded by

German Galdamez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views24 pages

Yihui Xie Chapter

The document describes knitr, a tool for reproducible research in R. It has a parser that identifies code chunks, an evaluator that executes the code, and a renderer that generates final output. Key features include code decoration, graphics support, caching, and the ability to integrate other languages. The goal of knitr is to make reproducible research easier by combining code, output, and narrative in a single document.

Uploaded by

German Galdamez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1

knitr: A Comprehensive Tool for Reproducible

Research in R

Yihui Xie
Department of Statistics, Iowa State University

CONTENTS

1.1 A Web Application ............................................... 6


1.2 Design............................................................ 7
1.2.1 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Code Decoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2.1 Graphical Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2.2 Plot Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2.3 Plot Rearrangement . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2.4 Plot Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2.5 The tikz Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.4 Code Externalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.5 Chunk Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.6 Evaluation of Chunk Options . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.7 Child Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.8 R Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Hooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.2 Language Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Reproducibility is the ultimate standard by which scientic ndings are judged. From the
computer science perspective, reproducible research is often related to literate programming
[13], a paradigm conceived by Donald Knuth, and the basic idea is to combine computer
code and software documentation in the same document; the code and documentation can
be identied by dierent special markers. We can either compile the code and mix the results
with documentation, or extract the source code from the document. To some extent, this
implies reproducibility because everything is generated automatically from computer code,
and the code can reect all the details about computing.

Early implementations like WEB [12] and Noweb [20] were not directly suitable for data

5
6 Dummy Title
analysis and report generation, which was partly overcome by later tools like Sweave [14].
There are still a number of challenges which were not solved by existing tools; for example,
A
Sweave is closely tied to L TEX and hard to extend. The knitr
package [28, 29] was built
upon the ideas of previous tools with a framework redesigned, enabling easy and ne control
of many aspects of a report. Sweave can be regarded as a subset of knitr in terms of the
features.
In this chapter, we begin with a simple but striking example that shows how reproducible
research can become natural practice to authors given a simple and appealing tool. We
introduce the design of the package in Section 1.2 and how it works with a variety of
A
document formats including L TEX, HTML and Markdown. Section 1.3 lists the features
that can be useful to data analysis such as the cache system and graphics support. Section
1.4 covers advanced features which extend knitr to a comprehensive environment for data
analysis; for example, other languages such as Python, awk and shell scripts can also be
integrated into the knitr framework. We will conclude with a few signicant examples
including student homework, data reports, blog posts and websites built with knitr.
The main design philosophy of knitr is to make reproducible research easier and more
enjoyable than the common practice of cut-and-paste results. This package was written in
the R language [11, 19]. It is freely available on CRAN (Comprehensive R Archive Network)
and documented in its website http://yihui.name/knitr/; the development repository
is on Github: https://github.com/yihui/knitr, where users can le bug reports and
feature requests and participate in the development.
There are obvious advantages of writing a literate programming document over copy-
ing and pasting results across software packages and documents. An overview of literate
programming applied to statistical analysis can be found in [22]; [8] introduced general con-
cepts of literate programming documents for statistical analysis, with a discussion of the
software architecture; [7] is a practical example based on [8], using an R package GolubRR
to distribute reproducible analysis; [2] revealed several problems that may arise with the
standard practice of publishing data analysis results, which can lead to false discoveries
due to lack of enough details for reproducibility (even with datasets supplied). Instead of
separating results from computing, we can actually put everything in one document (called
a compendium in [8]), including the computer code and narratives. When we compile this
document, the computer code will be executed, giving us the results directly. This is the
central idea of this chapter  we go from the source code to the report in one step, and
everything is automated by the source code.

1.1 A Web Application

R Markdown (referred to as Rmd hereafter) is one of the document formats that knitr
supports, and it is also the simplest one. Markdown [10] is a both easy-to-read and easy-
to-write language which was designed primarily for writing web content easily and can be
translated to HTML (e.g. **text** translates to <strong>text</strong>). Below is a
trivial example of how Rmd looks like:

# First section

Description of the methods.

```{r brownian-motion, fig.height=4, fig.cap='Brownian Motion'}


knitr: A Comprehensive Tool for Reproducible Research in R 7

x <- cumsum(rnorm(100))
plot(x)
```

The mean of x is `r mean(x)`.


We can compile this document with knitr and the output will be an HTML web page
containing all the results from R, including numeric and graphical results. This is not only
easier for authors to write a report, but also guarantees a report is reproducible since no
cut-and-paste operations are involved. To compile the report, we only need to load the
knitr package in R and call the knit() function:
library(knitr)
knit("myfile.Rmd") # suppose we saved the above file as myfile.Rmd

Based on this simple idea, knitr users have contributed hundreds of reports to the
hosting website RPubs (http://rpubs.com) within a few months since it was launched,
ranging from student homework, data analysis reports, HTML5 slides and class quizzes.
A
Traditionally, literate programming tools often choose L TEX as the authoring environment,
which has a steep learning curve for beginners. The success of R Markdown and RPubs
shows that one does not have to be a typesetting expert in order to make use of literate
programming and write reproducible reports.

1.2 Design

The package design consists of three components: the parser, evaluator and renderer. The
parser identies and extracts computer code from the source document; the evaluator ex-
ecutes the code; the renderer generates the nal output by appropriately marking up the
results according to the output format.

1.2.1 Parser

To include computer code into a document, we have to use special patterns to separate it
from normal texts. For instance, the Rmd example in Section 1.1 has an R code chunk
which starts with ```{r} and ends with ```.
Internally knitr uses the object knit_patterns to set or get the pattern rules, which
are essentially regular expressions. Dierent document formats use dierent sets of regular
expressions by default, and all built-in patterns are stored in the object all_patterns as
a named list. For example, all_patterns$rnw is a set of patterns for the Rnw format,
A
which has R code embedded in a L TEX document using the Noweb syntax. Similarly,
knitr has default syntax patterns for other formats like Markdown (md), HTML (html)
and reStructuredText (rst). We take the Rnw syntax for example.

library(knitr)
names(all_patterns) # all built-in document formats

## [1] "rnw" "brew" "tex" "html" "md" "rst"

all_patterns$rnw[c("chunk.begin", "chunk.end", "inline.code")]


8 Dummy Title

## $chunk.begin
## [1] "^\\s*<<(.*)>>="
##
## $chunk.end
## [1] "^\\s*@\\s*(%+.*|)$"
##
## $inline.code
## [1] "\\\\Sexpr\\{([^}]+)\\}"

In the pattern list for the Rnw format, there are three major elements as shown above:
chunk.begin, chunk.end and inline.code, which are regular expressions indicating the
patterns for the beginning and ending of a code chunk, and inline code respectively. For
example, the regular expression ^\s*<<(.*)>>= means the pattern for the beginning of
a code chunk is: in the beginning (^) of this line, there are at most some white spaces
(\s*), then the chunk header starts with <<; inside the chunk header, there can be some
texts denoting chunk options ((.*)), which can be regarded as meta data for a chunk (e.g.
fig.height=4 means the gure height will be 4 inches for this chunk); the chunk header is
closed by >>=. The code chunk is usually closed by @ (white spaces are allowed before it
and TEX comments are allowed after it), and we can also write inline code inside the pseudo
TEX command \Sexpr{}. Below is an example of a fragment of an Rnw document:

\section{First section}

Description of the methods.

<<brownian-motion, fig.height=4, fig.cap='Brownian Motion'>>=


x <- cumsum(rnorm(100))
plot(x)
@

The mean of x is \Sexpr{mean(x)}.


Based on the Rnw syntax, knitr will nd out the code chunk as well as the inline code
mean(x). Anything else in the document will remain untouched, and will be mixed with the
results from the computer code eventually. To show the parser can be easily generalized,
we take a look at the Rmd syntax as well:

str(all_patterns$md[c("chunk.begin", "chunk.end", "inline.code")])

## List of 3
## $ chunk.begin: chr "^\\s*`{3,}\\s*\\{r(.*)\\}\\s*$"
## $ chunk.end : chr "^\\s*`{3,}\\s*$"
## $ inline.code: chr "`r +([^`\n]+)\\s*`"

Roughly speaking, the three major patterns are changed to ```{r *} (beginning), ```
(ending) and `r *` (inline) respectively. If we want to specify our own syntax, we can use
the knit_patterns$set() function, which will override the default syntax, e.g.

knit_patterns$set(chunk.begin = "^<<r(.*)", chunk.end = "^r>>$",


inline.code = "\\{\\{([^}]+)\\}\\}")

Then we will be able to parse a document like this with the custom syntax:
knitr: A Comprehensive Tool for Reproducible Research in R 9

TABLE 1.1
Code syntax for dierent document formats (* denotes local chunk options, e.g. <<label,
eval=FALSE>>=; x denotes inline R code, e.g. <% 1+2 %>).
format start end inline output
Rnw <<*>>= @ \Sexpr{x} TEX
Rmd ```{r *} ``` `r x` Markdown
Rhtml <!--begin.rcode * end.rcode--> <!--rinline x--> HTML
Rrst .. {r *} .. .. :r:`x` reST
Rtex % begin.rcode * % end.rcode \rinline{x} TEX
brew <% x %> text

<<r brownian-motion, fig.height=4, fig.cap='Brownian Motion'


x <- cumsum(rnorm(100))
plot(x)
r>>

The mean of x is {{mean(x)}}.

In practice, however, this kind of customization is often unnecessary. It is better to follow


the default syntax, otherwise additional instructions will be required in order to compile
a literate programming document. Table 1.1 shows all the document formats which are
currently supported by knitr.
Among all chunk options, there is a special option called the chunk label. It is the only
chunk option that does not have to be of the form option = value. The chunk label is
supposed to be a unique identier of a code chunk, which will be used as the lename for
gure les, cache les, and also id's for chunk references. We will mention them later in
Section 1.3.

1.2.2 Evaluator

Once we have the code chunks and inline code expressions extracted from the document, we
need to evaluate them. The evaluate package [26] is used to execute code chunks, and the
eval() function in base R is used to execute inline R code. The latter is easy to understand
and made possible by the power of computing on the language [18] of R. Suppose we have
a code fragment 1+1 as a character string, we can parse and evaluate it as R code:

eval(parse(text = "1+1"))

## [1] 2

For code chunks, it is more complicated. The evaluate package takes a piece of R source
code, evaluates it and returns a list containing results of six possible classes: character
(normal text output), source (source code), warning, message, error and recordedplot
(plots).

library(evaluate)
res <- evaluate(c("'hello world!'", "1:2+1:3"))
str(res, nchar.max = 37)

## List of 5
10 Dummy Title

## $ :List of 1
## ..$ src: chr "'hello world!'\n"
## ..- attr(*, "class")= chr "source"
## $ : chr "[1] \"hello world!\"\n"
## $ :List of 1
## ..$ src: chr "1:2+1:3"
## ..- attr(*, "class")= chr "source"
## $ :List of 2
## ..$ message: chr "longer object length is not a multip"| __truncated__
## ..$ call : language 1:2 + 1:3
## ..- attr(*, "class")= chr [1:3] "simpleWarning" "warning" "condition"
## $ : chr "[1] 2 4 4\n"

An internal S3 generic function wrap() in knitr is used to deal with dierent types
of output, using output hooks dened in the object knit_hooks, which constitutes the
renderer. Before the nal output is rendered, we may have to post-process the output from
evaluate according to chunk options. For example, if the chunk option echo=FALSE, we
need to remove the source code. This is one advantage of using the evaluate package,
because we can easily lter out the result elements that we do not want according to the
classes of the elements. Continuing the example above, we can remove the source code by:

## filter out elements which are not source


res <- Filter(Negate(is.source), res)
str(res, nchar.max = 37)

## List of 3
## $ : chr "[1] \"hello world!\"\n"
## $ :List of 2
## ..$ message: chr "longer object length is not a multip"| __truncated__
## ..$ call : language 1:2 + 1:3
## ..- attr(*, "class")= chr [1:3] "simpleWarning" "warning" "condition"
## $ : chr "[1] 2 4 4\n"

Similarly we can process other elements according to the chunk options; for instance,
warning=FALSE means to remove warning messages, and results='hide' means to remove
elements of the class character; knitr has a large number of chunk options to tweak the
output, which are documented at http://yihui.name/knitr/options.
One notable feature of the evaluate package that may be surprising to most R users is
that it does not stop on errors by default. This is to mimic the behavior of R when we copy
and paste R code in the console (or terminal): if an error occurs in a previous R expression,
the rest of the code will still be pasted and executed. To completely stop on errors, we need
to set a package option in knitr:
opts_knit$set(stop_on_error = 2L)

1.2.3 Renderer

Unlike other implementations such as Sweave, knitr makes almost everything accessible
to the users, including every piece of results returned from evaluate. The users are free
to write these results in any formats they like via output hook functions. Consider the
following simple example:
knitr: A Comprehensive Tool for Reproducible Research in R 11

TABLE 1.2
Output hook functions and the object classes of results from the evaluate package.
Class Output hook Arguments
source source x, options
character output x, options
recordedplot plot x, options
message message x, options
warning warning x, options
error error x, options
chunk x, options
document x

1 + 1

## [1] 2

There are two parts in the returned results: the source code 1 + 1 [1]
and the output
2. Users may dene a hook function for the source code like this to use the lstlisting
A
environment in L TEX:

knit_hooks$set(source = function(x, options) {


paste("\\begin{lstlisting}\n", x, "\\end{lstlisting}\n",
sep = "")
})

Or put it inside the <pre> tag with a CSS class source in HTML:

knit_hooks$set(source = function(x, options) {


paste("<pre class='source'>", x, "</pre>", sep = "")
})

Here the name of the hook function corresponds to the class of the element returned
from evaluate; see Table 1.2 for the mapping between the two sets of names. The argument
x of the hook denotes the corresponding output (a character string), and options is a list of
chunk options for the current code chunk, e.g. options$fig.width is numeric value which
determines the width of gures in the current chunk. Note there are two additional output
hooks called chunk and document. The chunk hook takes the output of the whole chunk
as input, which has been processed by the previous six output hooks; the document hook
takes the output of the whole document as input and allows further post-processing of the
output text.
Like the parser, knitr also has a series of default output hooks for dierent document
formats, so users do not have to rewrite the renderer in most cases.

1.3 Features

The knitr package borrowed features such as TikZ graphics [25] and cache from pgfSweave
[3] and cacheSweave [16] respectively, but the implementations are completely dierent.
12 Dummy Title
New features like code reference from an external R script as well as output customization
are also introduced. The feature of hook functions in Sweave was re-implemented and hooks
have extended power now. Special emphasis was put on graphics: there can any number of
plots per chunk, there are more than 20 graphical devices to choose from (PDF, PNG and
Cairo devices, etc), and it is also easy to specify the size and alignment of plots via chunk
options.
There are several other small features which were motivated from the experience of using
Sweave. For example, a progress bar is provided when knitting a le so we more or less
know how long we still need to wait; output from inline R code (e.g. \Sexpr{x[1]}) is
automatically formatted in scientic notation (like 1.2346 × 108 ) if the result is numeric
(this applies to all document formats), and we will not get too many digits by default (the
default number in R is 7 which is too long).
As we emphasize the ease of use, the concept of an R Notebook was also introduced
in this package, which enables one to write a pure R script to create a report, and knitr
will take care of the details of formatting and compilation.

1.3.1 Code Decoration

Syntax highlighting comes by default in knitr (chunk option highlight=TRUE), since we


believe it enhances the readability of the source code. The formatR [27] is used to reformat
R code (option tidy=TRUE), e.g. add spaces and indentation, break long lines into shorter
ones and automatically replace the assignment operator = to <-; see the manual of formatR
for details.
A
For L TEX output, the framedpackage is used to decorate code chunks with a light gray
A
background (as we can see in this document). If this L TEX package is not found in the
system, a version will be copied directly from knitr. The output for HTML documents is
A
styled with CSS, which looks similar to L TEX (with gray shadings and syntax highlighting).
The prompt characters are removed by default because they mangle the R source code
in the output and make it dicult to copy R code. The R output is masked in comments by
default based on the same rationale. In fact, this was largely motivated from my experience
of grading homework; with the default prompts, it is dicult to verify the results in the
homework because it is so inconvenient to copy the source code. Anyway, it is easy to
revert to the output with prompts (set option prompt=TRUE), and we will quickly realize
the inconvenience to the readers if they want to run the code in the output document:

> x <- rnorm(5)


> x
[1] -0.56048 -0.23018 1.55871 0.07051 0.12929
> var(x)
[1] 0.6578

The example below shows the eect of tidy=TRUE/FALSE:

## option tidy=FALSE
for(k in 1:10){j=cos(sin(k)*k^2)+3;print(j-5)}

## option tidy=TRUE
for (k in 1:10) {
j <- cos(sin(k) * k^2) + 3
print(j - 5)
}
knitr: A Comprehensive Tool for Reproducible Research in R 13

While this may seem to be irrelevant to reproducible research, we would argue that it
is of great importance to design styles that look appealing and helpful at the rst glance,
which can encourage users to write reports in this way.

1.3.2 Graphics

Graphics is an important part of reports, and several enhancements have been made in
knitr. For example, grid graphics [15] may not need to be explicitly printed as long as
the same code can produce plots in the R console (in some cases, however, they have to be
printed, e.g. in a loop, because we have to do so in an R console); below is a chunk of code
which will produce a plot in both the R console and knitr:
library(ggplot2)
p <- qplot(carat, price, data = diamonds) + geom_hex()
p # no need to print(p)

● ●●●●●
●●
● ●



●●● ●●



●●

●●
●●
●●
●●


●●


●●


●●
●●

●●
●●
●●
●●●
● ●
● ● ●


●●● ●●
●●
●●

●●

●●● ●

●●●
●●
●●


●●

●● ●
●●

● ●

●●
●●●



●●

●●
●●●●


●● ● ●
●● ●● ●


●●

●●
●●



● ●
● ●
●●
●●

●●
●●
●●

●●
●●
●●
●●
●●●


●●

●●

●●
●●●
●● ●●
●●
●●
●●●●●●●

●●●●●
● ●
●●●●
● ●●
●●● ●●● ●



●●●
●●
●●
●●●


●●



●●

●●



●●

●●
●●


●●
●●



●●

●●








●●
●●●●





●●





●●
●●● ● ●
●●● ● ●●●




●●
















●●●



●●


●●

●●


●●




●●

●●











●●
●●

●●●
● ●
●● ●●
● ●
●●●●




●●

●●


●●●

●●●● ●


●●●
●●
●●
●●
●●
●●●●

●●


●●

●●●
●●●
● ● ● ● ●
15000


●●●
● ●
●●
●●
●●●













●●
●●




●●
●●


●●




●●●●



●●

●●


●●
●●


●●


●●
























●●
●●

●●
●●


● ● ● count
●●●●
●●








●●
●●















●●

















● ●











●●






●●
●●





●●










●●


●●

●●


●● ● ●
● ● ●

●●






●●




●●

●●














●●
●●

●●
●●

●●










●●
●●

●●
●●●●








●●







●●



●●



●●









●●











●●

●●●●● ●

●●●●●
● ●●
●●

●●

●●●
●●


●● ●●
●●●
●●
●●
●●●●










●●







●●














●●




●●





●●

















●●






●●


●●
● ●








●●





















●●





●●


●●










●●


●●































●●


●●
● ●●
● ●
● ●
● 5000


●●


●●


●●


●●



●●●

●●

● ●
●●




●●




●●

●●

●●

●●



●●


●●
●●
●● ●

●●

●●


●●
●●● ●●●

●●
●●
●●●
●●●

●●●

●●
●●

●●●●
●●

● ●
●●●●●●● ●
●●








●●
●●

●●
●●


●●

●●
●●







●●

●●
●●




●●
●●
●●

●●



●●

●●


●●

●●●
●●

●●
●●

●●
●●










●●

●●●
●●
●●
●●



●●




●●

●● ●
●● ●


●●
●●
●●

●●●
●●
●● ●●

●●●
●●
●●

●● ● ●● ● 4000
price


●●


●●


●●

●●




●●





●●●


●●



●●●



●●


●●
●●

●●
●●


●●


●●

●●●
● ●


●●● ● ●● ●● ● ●

●●


●●
●●



●●

●●●●
●●
●●

●●


●●
●●●
●●

●●●
●●

●●

●●


●●●

●●

●●

●●

●●●●
● ●
●●●●●
10000 ●●















●●

●●

●●


●●



●●

●●
●●

●●


●●


●●













●●




●●

●●
●●

●●
●●




●●








●●●
●●
●●

●●
●●

●●





















●●
●●
●●
●●

●●



●●

●●
●●





●●

●●




●●









●●

●●


●●

●●


●●

●●
● ●

●●


● ●● ●




●●
● ●
●●
●●●



●●●
●●
●●

●●


●●

●●
●●



●●●




●●

●●●


●●●
●●
●●
● ●
●●


●●
●●●

●●


●●

● ●
●● ●
● ●●● ● ●
●●●●




●●










●●
●●


●●


●●
●●
●●

●●

●●●
●●
●●

●●

●●



●●
●●
●●

● ●
●●











●●

●●


●●
●●



●●
●●

●●

●●● ●

●●



●●●










●●








●●
●●

●●
●●

●●

●●


●●











●●
●●


●●

●●


●●
●●


●●


●●


●●


●●













●●●

●●●●●
● ●
●● ●
● ● 3000

●●

●●


●●
●●
●●



●●●

●●

●●
●●
●●●●

●●●
●● ●



●●





●●

●●●

●●
●●


●●

●●●




●●●




●●
●●

●●
● ● ● ●●
●●

●●


●●●


●●


●●
●●
●●

●●

●●


●●
●● ●

●●
●●
●●
●●●




●●
●●

●●

●●●
●●

●●

●●

●●

●●●●



●●
●●



●●



●●

●●


● ●●●●●●●●●

●● ●
●●


●●
●●


●●
●● ●
●●


●●
●●●
●●

●●
●●●
●●

●●●
●●
●●●
●●
● ● ●● ●● ●
●●●


●●

●●
●●




●●
●●





●●
●●●


●●
●●



●●
●●






●●
●●



●●
●●●
●●
●●

●●
●●●●●


●●


●●
●●
●●
●●
●●
●●

●●
●●



●●
●●

●●
●●
●●


●●

●●


●●
●●
●●
●●
●●

●●





●●












●●
●●












●●






●●





●●●● ●

●●
●●●● ● ● 2000
●●●
● ●


●●●

●●
●●
●●
●●
●●
●●●
●●●

●●
●●
●●
●●


●●


●●●

●●
●●


●●
●●


●●
●●



●●
●●

●●
●●

●●
●●
●●


●●

●●●


●●
●●
●●●

●●●
●●
●●
●●
●●


●●
●●


●●


●●
●●
●●
●●
●●










●●

●●




●●●●

● ●● ●
●● ●●
●●●
● ●●

●●●
●●
●●

●●
●●
●●●●
●●●

●●
●●

●●
●●

●●


●●
●● ●
● ●
● ●● ●●●
5000 ●●
●●









●●
●●





●●
●●
●●



●●
●●


●●
●●

●●


●●

●●

●●

●●●



●●

●●

●● ●●
●●
●●


●●
●●
●●



●●
●●
●●●
●●●
●●
●●
●●●●
●●

●●
●●
●●
●●

●● ●
●●

●●
●●
●●
●●●
●●
●●
●●●
●●

●●
●●
●●

●●
●●
●●

●●


●●


●●
●●


●●
●●
●●


●●
●●



●●

















●●

●●





●●

●●



●●











●●●

● ●●
●●
●●

●●

●●

●●

●●
●●●
●●●
●●
● ●●

●●●
●●●
●●

●●
●●

●●●
●●


●●●


●●
●● ●
●●
●● ●●


















●●






●●

●●




●●
●●
●●

●●

●●



●●
● ●
●●
●●
●●




●●●
●●●

●●
●●
●●
●●
●●
●●●

●●
●●

●●
●●

●●●
●●
●●

●●

●●
●●
●●●

●●

●●●
●●
●●

●●
●●
●●●
●●
●●
●●●●
●●
●●
●●
●●







●●●
●●●
●●
●●
●● ●


●●


●●
●●
●●
●●





●●●




●●


●●
●●




●●

●●
●●

●●
●●

●●
●●

●●●


●●
●●
●●


●●●
●●








●●


●●
●●
●●●●


●●
●●
●●●



●●



●●
●●●




●●
●●



●●





















●●
● ●
●●




1000
●●●




●●●

●●●

●●


●●
●●
●●

●●●●●
●●


●●


●●
●●
●●
●●●
●●
●●●●

●●



●●●

●●
●●
●●

●●
●●●



●●
●●

●●



●●


●●●



●●
●●



●● ●
● ●
● ●●



●●



●●

●●
●●



●●
●●●


●●
●● ●


●●
●●
●●

●●●
●●
●●

●●
●●
●●●●
●●
●●●
●●
●●
●●
● ●●


●●
●●
●●

●●


●●
●●●

●●
●●●




●●



●●


●●●●

●●


●●●●


●●


●●

●●

●●

●●

●●●
●●
●●
●●●
●●●
●●
●●
●●
●●
●●
●●
●●
●●


●●
●●
●●
●●
●● ●


●●
●●
●●
●●●●
●●
●●
●●
●●


●●



●●
●●●
● ●

●●


●●
●●
●●


●●


●●

●●●


●●

●●


●●

●●


●●

●●●●
●●
●●

●●
●●
●●
●●
●●
●●
● ●

●●

●●

●●
● ●

●●
●●
●●
●●

●●

●●


●●●
●●

●●

●●
● ●

●●
●●
●●


●●
●●●

●●●
●●

●●



●●

●●●

●●





●●
●●




●●
● ●
●●
●●


●●

●●

●●
●●
●●
●●
●●
●●●
●●●●



●●
●●
●●●
●●
●●
●●
●●

●●
●●
●●●

●● ●
●●


●●
●●
●●
●●
●●
●●

●● ●


●●
●● ●



●●


●●

●●
●●
●●●

●●

●●●● ●

●●
●●
●●●
●●


● ●●
●●
●●
●●●
●●
●●

●●●
●●

●●
●●


●●
●●
●●

●●

●●
●●
●● ●
●●


●●


●●
● ●
●●●




●●●
●●


●●

●●
●●
●●
●●


●●



●●●

●●
●●



●●

0
0 1 2 3 4 5
carat

1.3.2.1 Graphical Devices


Over a long time, a frequently requested feature for Sweave was the support for other
graphics devices, which has been implemented since R 2.13.0. Instead of using several
logical options like png or jpeg, knitr uses a single option dev (like grdevice in Sweave)
which has support for more than 20 devices. For instance, dev='png' will use the png()
device in the grDevices package in base R, and dev='CairoJPEG' uses the CairoJPEG()
device in the add-on package Cairo (it has to be installed rst, of course). Here are the
possible values for dev:

## [1] "bmp" "postscript" "pdf" "png"


## [5] "svg" "jpeg" "pictex" "tiff"
## [9] "win.metafile" "cairo_pdf" "cairo_ps" "quartz_pdf"
## [13] "quartz_png" "quartz_jpeg" "quartz_tiff" "quartz_gif"
## [17] "quartz_psd" "quartz_bmp" "CairoJPEG" "CairoPNG"
## [21] "CairoPS" "CairoPDF" "CairoSVG" "CairoTIFF"
## [25] "Cairo_pdf" "Cairo_png" "Cairo_ps" "Cairo_svg"
## [29] "tikz"

If none of these devices is satisfactory, we can provide the name of a customized device
function, which must have been dened in this form before it is used:
14 Dummy Title

custom_dev <- function(file, width, height, ...) {


# open the device here, e.g. pdf(file, width, height, ...)
}

Then we can set the chunk option dev='custom_dev'.

1.3.2.2 Plot Recording


All the plots in a code chunk are rst recorded as R objects and then replayed inside a
graphical device to generate plot les. The evaluate package will record plots per expression
basis, in other words, the source code is split into individual complete expressions and
evaluate will examine possible plot changes in snapshots after each single expression has
been evaluated. For example, the code below consists of three expressions, out of which two
are related to drawing plots, therefore evaluate will produce two plots by default:
par(mar = c(3, 3, 0.1, 0.1))
plot(1:10, ann = FALSE, las = 1)
text(5, 9, "mass $\\rightarrow$ energy\n$E=mc^2$")

10 10
mass → energy
E = mc2
8 8

6 6

4 4

2 2

2 4 6 8 10 2 4 6 8 10

This brings a signicant dierence with traditional tools in R for dynamic report gener-
ation, since low-level plotting changes can also be recorded. The option fig.keep controls
fig.keep='all' will keep low-level changes in separate
which plots to keep in the output;
plots; by default (fig.keep='high'), knitr will merge low-level plot changes into the previ-
ous high-level plot, like most graphics devices do. This feature may be useful for teaching R
graphics step by step. Note, however, low-level plotting commands in a single expression (a
typical case is a loop) will not be recorded cumulatively, but high-level plotting commands,
regardless of where they are, will always be recorded. For example, this chunk will only
produce 2 plots instead of 21 plots because there are 2 complete expressions:

plot(0, 0, type = "n", ann = FALSE)


for (i in seq(0, 2 * pi, length = 20)) points(cos(i), sin(i))

But this will produce 20 plots as expected:

for (i in seq(0, 2 * pi, length = 20)) {


plot(cos(i), sin(i), xlim = c(-1, 1), ylim = c(-1, 1))
}

We can discard all previous plots and keep the last one only by fig.keep='last', or
keep only the rst plot by fig.keep='first', or discard all plots by fig.keep='none'.
knitr: A Comprehensive Tool for Reproducible Research in R 15

1.3.2.3 Plot Rearrangement


The chunk option fig.show can decide whether to hold all plots while evaluating the
code and ush all of them to the end of a chunk (fig.show='hold'; see the previ-
ous plot example), or just insert them to the places where they were created (by default
fig.show='asis'). Here is an example of fig.show='asis' for two plots in one chunk:

contour(volcano) # contour lines


1.0

100
110
130
0.8

110

170
0.6

190

160
0.4

180 160
150
140
0.2

120
110

0
0.0

110 10

0.0 0.2 0.4 0.6 0.8 1.0

filled.contour(volcano) # fill contour plot with colors


1.0

180
0.8

160
0.6
140
0.4
120
0.2
100
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Beside 'hold' and 'asis', the option fig.show can take a third value: 'animate',
which makes it possible to insert animations into the output document. In L TEX, the A
package animate is used to put together image frames as an animation. For animations
to work, there must be more than one plot produced in a chunk. The option interval
controls the time interval between animation frames; by default it is 1 second. Note we
have to add \usepackage{animate}
in the L TEX preamble, because will not add it A knitr
automatically. Animations in the PDF output can only be viewed in Adobe Reader. There
are animation examples in both the main manual and graphics manual of knitr, which can
be found on the package website.
We can specify the gure alignment via the chunk option fig.align ('left', 'center'
and 'right'). The plot example in the previous section used fig.align='center' so the
two plots were centered.

1.3.2.4 Plot Size


The fig.width and fig.height options specify the size of plots in the graphics device
(units in inches), and the real size in the output document can be dierent (specied by
out.width and out.height). When there are multiple plots per code chunk, it is possible
to arrange multiple plots side by side. For example, in L TEX we only need to set A out.width
to be less than half of the current line width, e.g. out.width='.49\\linewidth'.

1.3.2.5 The tikz Device


Beside PDF, PNG and other traditional R graphical devices, knitr has special support
to TikZ graphics via the tikzDevice package [24], which is similar to the feature of
16 Dummy Title
pgfSweave. If we set the chunk option dev='tikz', the tikz() device in tikzDevice
will be used to generate plots. Options sanitize (for escaping special TEX characters)
and external are related to the tikz device: see the documentation of tikz() for de-
external=TRUE in knitr has a dierent meaning with pgfSweave  it means
tails. Note
standAlone=TRUE in tikz(), and the TikZ graphics output will be compiled to PDF im-
mediately after it is created, so the externalization does not depend the ocial but com-
plicated externalization commands in the tikz package in LATEX. To maintain consistency
in (font) styles, knitr will read the preamble of the input document and pass it to the tikz
A
device, so that the font style in the plots will be the same as the style of the whole L TEX
document.
Besides consistency of font styles, the tikz device also enables us to write arbitrary
A
L TEX expressions into R plots. A typical use is to write math expressions. The traditional
approach in R is to use an expression() object to write math symbols in the plot, and for
A
the tikz device, we only need to write normal L TEX code. Below is an example of a math
expression p(θ|x) ∝ π(θ)f (x|θ) using the two approaches respectively:

plot(0, type = "n", ann = FALSE)


text(0, expression(p(theta ~ "|" ~ bold(x)) %prop% pi(theta) * f(bold(x) ~
"|" ~ theta)), cex = 2)

p(θ | x) ∝ π(θ)f(x | θ)
A
With the tikz device, it is both straightforward (if we are familiar with L TEX) and more
beautiful:

plot(0, type = "n", ann = FALSE)


text(0, "$p(\\theta|\\mathbf{x})\\propto\\pi(\\theta)f(\\mathbf{x}|\\theta)$",
cex = 2)

p(θ|x) ∝ π(θ)f (x|θ)


A
One disadvantage of the tikz device is that L TEX may not be able to handle too large
tikz les (it can run out of memory). For example, an R plot with tens of thousands of
A
graphical elements may fail to compile in L TEX if we use the tikz device. In such cases, we
can switch to the PDF or PNG device, or reconsider our decision on the type of plots, e.g.,
a scatter plot with millions of points is usually dicult to read, and a contour plot or a
hexagon plot showing the 2D density can be a better alternative (they are smaller in size).
We emphasized the uniqueness of chunk labels in Section 1.2.1, and here is one reason
why it has to be unique: the chunk label is used in the lenames of plots; if there are two
chunks which share the same label, the latter chunk will override the plots generated in the
previous chunk. The same is true for cache les in the next section.

1.3.3 Cache

The basic idea of cache is that we directly load results from a previous run instead of re-
compute everything from scratch if nothing has been changed since the last run.This is not
a new idea  both cacheSweave [16] and weaver [6] have implemented it based on Sweave,
with the former using lehash [17] and the latter using .RData images; cacheSweave also
knitr: A Comprehensive Tool for Reproducible Research in R 17

supports lazy-loading of objects based on lehash. The knitr package directly uses internal
base R functions to save (tools:::makeLazyLoadDB()) and lazy-load objects (lazyLoad()).
The cacheSweave vignette has clearly explained lazy-loading; roughly speaking, lazy-
loading means an object will not be really loaded into memory unless it is really used
somewhere. This is very useful for cache; sometimes we read a large object and cache it,
then take a subset for analysis and this subset is also cached; in the future, the initial large
object will not be loaded into R if our computation is only based on the subset object.
The paths of cache les are determined by the chunk option cache.path; by default all
cache les are created under a directory cache/ relative to the current working directory,
and if the option value contains a directory (e.g. cache.path='cache/abc-'), cache les
will be stored under the directory cache/ (automatically created if it does not exist) with
a prex abc-. The cache is invalidated and purged on any changes to the code chunk,
including both the R code and chunk options; this means previous cache les of this chunk
are removed (lenames are identied by the chunk label) and a new set of cache les will
be written. The change is detected by verifying if the MD5 hash of the code and options
has changed, which is calculated from the digest package [5].
Two new features that make knitr dierent from other packages are: cache les will
never accumulate since old cache les will always be removed, and knitr will also try to
preserve side-eects such as printing and loading add-on packages. However, there are still
other types of side-eects like setting par() or options() which are not cached. Users
should be aware of these special cases, and make sure to clearly divide the code which is
not meant to be cached into other chunks which are not cached, e.g., set all global options
in the rst chunk of a document and do not cache that chunk.
Sometimes a cached chunk may need to use objects from other cached chunks, which can
bring a serious problem  if objects in previous chunks have changed, this chunk will not
be aware of the changes and will still use old cached results, unless there is a way to detect
such changes from other chunks. There is an option called dependson in cacheSweave
which does this job. In knitr, we can also explicitly specify which other chunks this chunk
depends on by setting an option like dependson=c('chunkA', 'chunkB') (a character
vector of chunk labels). Each time the cache of a chunk is rebuilt, all other chunks which
depend on this chunk will lose cache, hence their cache will be rebuilt as well.
There are two alternative approaches to specify chunk dependencies: dep_auto() and
dep_prev(). For the former, we need to turn on the chunk option autodep (i.e. set
autodep=TRUE), then put dep_auto() in the rst chunk in a document. This is an ex-
perimental feature borrowed from weaver which frees us from setting chunk dependencies
manually. The basic idea is, if a latter chunk uses any objects created from a previous chunk,
the latter chunk is said to depend on the previous one. The function findGlobals() in
the codetools package is used to nd out all global objects in a chunk, and according to
its documentation, the result is an approximation. Global objects roughly mean the ones
which are not created locally, e.g. in the expression function() {y <- x}, x should be a
global object, whereas y is local. Meanwhile, we also need to save the list of objects created
in each cached chunk, so that we can compare them to the global objects in latter chunks.
For example, if chunk A created an object x and chunk B uses this object, chunk B must
depend on A, i.e. whenever A changes, B must also be updated. When autodep=TRUE,
knitr will write out the names of objects created in a cached chunk as well as those global
objects in two les named __objects and __globals respectively; later we can use the func-
tion dep_auto() to analyze the object names to gure out the dependencies automatically.
For dep_prev(), it is a very conservative approach which sets the dependencies so that a
cached chunk will depend on all of its previous chunks, i.e. whenever a previous chunk is
updated, all later chunks will be updated accordingly; similarly, this function needs to be
called in the rst code chunk in a document.
18 Dummy Title
1.3.4 Code Externalization

It can be more convenient to write R code in a separate le, rather than mixing it into a
literate programming document; for example, we can run R code successively in a pure R
script from one chunk to the other without jumping through other text chunks. This may
not sound important for some editors that support interaction with R, such as RStudio
(http://www.rstudio.com/ide) or Emacs with ESS [21], since we can send R code chunks
directly from the editor to R, but for other editors like LYX (http://www.lyx.org), we
can only compile the whole report as a batch job, which can be inconvenient when we only
want to know the results of a single chunk.
The second reason for the feature of code externalization is to be able to reuse code
across dierent documents. Currently the setting is like this: the external R script also has
chunk labels for the code in it (marked in the form ## @knitr chunk-label by default);
if the code chunk in the input document is empty, knitr will match its label with the label
in the R script to input external R code. For example, suppose this is a code chunk labeled
as Q1 in an R script named mycode.R which is under the same directory as the source
document:

## @knitr Q1
#' find the greatest common divisor of m and n
gcd <- function(m, n) {
while ((r <- m%%n) != 0) {
m <- n
n <- r
}
n
}

In the source document, we can rst read the script using the function read_chunk()
which is available in knitr:
read_chunk("mycode.R")

This is usually done in an early chunk, and we can use the chunk Q1 later in the source
document (e.g. an Rnw document):

<<Q1, echo=TRUE, tidy=TRUE>>=


@

Dierent documents can read the same R script, so the R code can be reusable across
dierent input documents. In a large project, however, this may not be an ideal approach
to organizing code since there are too many code fragments. We may consider an R package
to organize functions, which can be easier to call and test.

1.3.5 Chunk Reference

Code externalization is one way to reuse code chunks across documents, and for a single
document, all its code chunks are also reusable in this document. We can either reuse a
whole chunk, or embed one chunk into the other one. The former is done through the chunk
option ref.label, e.g.
knitr: A Comprehensive Tool for Reproducible Research in R 19

<<chunkA>>=
x <- rnorm(100)
@
Now we reuse chunkA in another chunk:

<<chunkB, ref.label="chunkA">>=
@

Then all the code in chunkA will be put into chunkB. Note only the code is reused; in
this example, chunkB will generate a new batch of random numbers, regardless of the value
of x in chunkA.
To embed a code chunk as a part of another chunk, we can use the syntax <<label>>,
e.g.

<<chunkA>>=
x <- rnorm(100)
@
Now we embed chunkA into chunkB:

<<chunkB>>=
<<chunkA>>
mean(x)
@

The location of the chunks does not matter. We can even dene a code chunk later,
but reference it in an earlier chunk. We can also recursively embed chunks, and there is no
limit on the levels of recursion. For example, we can embed A in B, and B in C, then C
will reuse the code in A as well.

1.3.6 Evaluation of Chunk Options

By default knitr treats chunk options like function arguments instead of a text string to
be split by commas to obtain option values. This gives the user much more power than
the traditional syntax in Sweave; we can pass arbitrary R objects to chunk options besides
simple ones like TRUE/FALSE, numbers and character strings. The page http://yihui.
name/knitr/demo/sweave/ has given two examples to show the advantages of the new
syntax. Here we show yet another useful application: conditional evaluation.
The idea is, instead of setting chunk options eval to be TRUE or FALSE (logical constants),
their values can be controlled by a variable in the current R session. This enables knitr
to conditionally evaluate code chunks according to variables. For example, here we assign
TRUE to a variable dothis:

dothis <- TRUE

In the next chunk, we set chunk options eval=dothis and echo=!dothis, both are valid
R expressions since the variable dothis exists. As we can see, the source code is hidden,
but it was indeed evaluated since we can see the output:

## [1] "you cannot see my source because !dothis is FALSE"

Then we set eval=dothis and echo=dothis for another chunk:


20 Dummy Title

if (dothis) print("you can see everything now because dothis is TRUE")

## [1] "you can see everything now because dothis is TRUE"

If we change the value of dothis to FALSE, neither of the above chunks will be evaluated
any more. Therefore we can control many chunks with a single variable, and present results
selectively. When chunk options are parsed and evaluated like function arguments, a literate
programming document becomes really programmable.

1.3.7 Child Document

We do not have to put everything in one single document; instead, we can write smaller child
documents and include them into a main document. This can be done through the child
option, e.g. child=c('child1.Rnw', 'child2.Rnw'). When knitr sees the child option
is not empty, it will parse, evaluate and render the child documents as usual, and include
the results back into the main document. Child documents can have a nested structure
(one child can have a further child), and there is no limit on the depth of nesting. This
feature enables us to better organize large projects, e.g. one author can focus on one child
document.

1.3.8 R Notebook

We can obtain a report based on a pure R script, without taking care of the authoring
A
tools such as L TEX or HTML. This kind of R scripts is called R notebooks in knitr
. There
are two approaches to compile R notebooks: stitch() and spin(). The idea of stitch is
we t an R script into a predened template in knitr
(choices of templates include L TEX, A
HTML and Markdown), and compile the mixed document to a report; all the code in the
script will be put into one single chunk. The idea of spin is to write a specially formatted
script, with normal texts masked in roxygen comments (i.e. after #') and chunk options
after #+. Here is an example for spin():

#' This is a report.


#'
#+ chunkA, eval=TRUE
# generate data
x <- rnorm(100)
#'
#' The report is done.

This script will be parsed and translated to one of the document formats that knitr
supports (Table 1.1), and then compiled to a report. This can be done through a single
click in RStudio, or we can also call the functions manually in R:

library(knitr)
stitch("mycode.R") # stitch it, or spin it
spin("mycode.R")
knitr: A Comprehensive Tool for Reproducible Research in R 21

1.4 Extensibility

The knitr package is highly extensible. We have seen in Section 1.2 that both the syn-
tax patterns and output hooks can be customized. In this section we introduce two new
concepts: chunk hooks and language engines.

1.4.1 Hooks

A chunk hook (not to be confused with the output hooks) is a function to be called when a
corresponding chunk option is not NULL, and the returned value of the function is written
into the output if it is character. All chunk hooks are also stored in the object knit_hooks.
One common and tedious task when using R base graphics is we often have to call par()
to set graphical parameters. This can be abstracted into a chunk hook, so that before a code
chunk is evaluated, a set of graphical parameters can be automatically set. A chunk hook
can be arbitrarily named, as long as it does not conict with existing hooks in knit_hooks.
For example, we create a hook named pars:

knit_hooks$set(pars = function(before, options, envir) {


if (before)
par(options$pars)
})

Now we can pass a list of parameters to the pars option in a chunk, e.g. <<pars =
list(col = 'gray', mar = c(4, 4, .1, .1), pch = 19)>>=. Because this list is obvi-
ously not NULL, knitr will run the chunk hook pars. In this hook, we specied that par()
is called before a chunk is evaluated (that is what if (before) means), and options argu-
ment in the hook function is a list of current chunk options, so the value of options$pars
is just the list we passed to the chunk option pars. As we can see, the name of the hook
function and the name of the chunk option should be the same, and that is how knitr
knows which hook function to call based on a chunk option. Below is a code chunk testing
the pars hook:

plot(rnorm(100), ann = FALSE)

● ●

2

● ●
● ●
● ● ●
● ● ●●
● ●●
1


● ● ●
● ●
● ● ● ●
● ●
●●
● ● ● ● ● ● ●●
● ● ●
●● ● ● ● ●
● ●
0

● ●●● ●
● ● ● ● ● ● ● ●● ● ●●
● ● ● ● ● ●
● ● ●
● ●● ● ●
● ●● ●
−1

● ●
●● ● ● ● ●
● ● ●


−2


0 20 40 60 80 100

We see a scatter plot with solid gray points, which means par() was indeed called (the
default of R is black open circles), although it did not show up in the source code. Because
the hook function does not return character results, nothing else is written in the output.
Now we show another example on how to save rgl plots [1] using a built-in chunk hook
hook_rgl() in knitr. Note this function returns a character string depending on the output
22 Dummy Title
A
format, e.g. if it is L TEX, it returns a character string like \includegraphics{filename}
where filename is the lename of the rglplot captured by knitr
.

knit_hooks$set(rgl = hook_rgl)
head(hook_rgl, 7) # the hook function is defined as this

##
## 1 function (before, options, envir)
## 2 {
## 3 library(rgl)
## 4 if (before || rgl.cur() == 0)
## 5 return()
## 6 name = fig_path("", options)
## 7 par3d(windowRect = 100 + options$dpi * c(0, 0, options$fig.width,

Then we only have to set the chunk option rgl to a non-NULL value, e.g. <<rgl=TRUE,
dev='png'>>= (when dev='png', we record the plot using rgl.snapshot() in rgl to cap-
ture the snapshot as a PNG image):

library(rgl)
demo("bivar", package = "rgl", echo = FALSE)
par3d(zoom = 0.7)

In all, chunk hooks help us do additional tasks before or after the evaluation of code
chunks, and we can also use them to write additional content to the output document.

1.4.2 Language Engines

Although knitr was created in R, it also supports other languages like Python, Perl, awk
and shell scripts. For the time being, the interface is still very preliminary: it is a call to
external programs via the system() function in R, and the results are collected as character
strings.
The chunk option engine is used to specify the language engine, which is 'R' by default.
It can be 'python', 'perl', 'awk', 'haskell' and 'bash', etc. Although the interface
is naive, the design is very general. For example, these engines can be used for all the
document formats, and appropriate renderers have been set up for them. For example, we
A
can call Python in this L TEX document:
knitr: A Comprehensive Tool for Reproducible Research in R 23

x = 'hello python from knitr'


print x.split(' ')
['hello', 'python', 'from', 'knitr']

As all other components of knitr, language engines can be customized as well. The ob-
ject that controls the engines is knit_engines, e.g. knit_engines$get('python')
we can call
to check how the Python engine was dened, or knit_engines$set(python = ...) to
override the default engine. See the documentation in the package for more details.
A data analysis project often involves multiple tools other than R  we may use a shell
script to decompress the data, awk to pre-process the data, and R to read the data. By
integrating all tools into one framework, a project can be more tight in the sense that all
the relevant code lives in the same document. It will be easy to redo the whole analysis
without worrying if a certain part of the project is not up-to-date.

1.5 Discussion

A few future directions about tools for reproducible research were outlined in [8], including
multi-language compendiums, conditional chunks and interactivity. All of these have been
made possible in the knitr framework. For example, modern web technologies have enabled
us to interact with web pages easily. RPubs mentioned in Section 1.1 is a good example: we
can publish reports to the web from RStudio with a single mouse click; besides, we can also
write interactive content into the web page based on knitr and other tools like JavaScript:
http://rpubs.com/jverzani/1143 is an interactive quiz for R; the questions and answers
were generated dynamically from knitr. Another application is the googleVis package [9]:
http://rpubs.com/gallery/googleVis (we are able to interact with tables and Google
maps there).
We observed a lot of homework submissions on RPubs (e.g. http://rpubs.com/kaz_
yos/1519), and we believe this is a good indication from the educational point of view.
When students are trained to write homework in a reproducible manner, it should have
more positive impact on scientic research in the future.
It is debatable which authoring environment is ideal for reproducible research (e.g. [8]
suggested XML), and we would argue that a wide list of choices should be made available.
A
L TEX is a perfect typesetting tool for experts, but it is very likely that beginners can get
stuck. Markdown is much less frustrating, and the most important thing is, users can step
into the paradigm of reproducible research really quickly rather than spending the most of
their time guring out typesetting problems. As one example, an RPubs user published a
data analysis about the hurricane Sandy almost immediately after it hit the east coast of
the United States: http://rpubs.com/JoFrhwld/sandy.
Everything is moving to the cloud nowadays, and lots of applications are developed and
deployed on the server side. OpenCPU is a platform that provides the service of R through
a set of API's which can be programmed in JavaScript; knitr has a simple application there
which allows one to write a report in the web browser: http://public.opencpu.org/apps/
knitr. The computing is done on OpenCPU, and nothing is required on the client side
except a web browser. This could be one of the future directions of statistical computing
and report generation. A similar platform sponsored by RStudio is the Shiny [23] server
and a knitr example can be found at http://glimmer.rstudio.com/yihui/knitr/.
Web applications may also have an impact on publications related to data analysis,
because it is convenient to collaborate with other people, fast to publish reports and get
24 Dummy Title
feedback. Vistat (http://vis.supstat.com) is an attempt to build a collaborative and
reproducible website featuring statistical graphics like a journal. It is based on Github and
R Markdown; authors can submit new articles through the version control tool GIT and
reviewers can make comments online. All the graphics will be veried independently, hence
it requires the author(s) to submit a detailed source document for other people to reproduce
the results.
There are a number of important issues when implementing the software package for
reproducible research. For example, cache may be handy because it can save us a lot of
time, but we have to be cautious about when to invalidate the cache. Even if the code and
chunk options are not changed, do we need to purge the cache and re-compute everything
after we have upgraded R from version 2.15.1 to 2.15.2? To incorporate with this kind
of questions, knitr provides additional approaches to invalidate the cache, e.g. we can
add a chunk option cache.extra=R.version.string so that whenever the R version has
changed, the cache will be rebuilt. Besides R itself, there can also be problems with add-on
packages. In knitr there is a convenience function write_bib() which can automatically
write the citation information about R packages in the current R session into a BibTEX
database; this guarantees that the version information of packages are always up-to-date.
We illustrate one more issue as a potential problem: when we distribute our analysis, how
A
are we supposed to include external materials such as the gure les? For L TEX, this is
not a problem since images are embedded in PDF; for Markdown/HTML, knitr
uses the
R package markdown to encode images as base64 strings and embed the character strings
into HTML, so that a web page is self-contained (i.e. no extra les are required to publish
it). However, it can be dicult, if not impossible, to embed everything in a single document,
e.g. how should we disseminate datasets and unit tests? A potential media is an R package
as proposed by [8], which has a nice structure of a project (source code, documentation,
vignettes, tests and datasets, etc). In this case, knitr will be one part of a reproducible
project. In fact, this has been made possible since R 3.0.0  we can build package vignettes
with knitr (traditionally only Sweave was allowed) and the document formats can be LATEX,
HTML and Markdown, etc.
The knitr package has gained support in many editors which make it easy to write
the source documents; at the moment, RStudio has the most comprehensive support. We
can also use LYX, Emacs/ESS, WinEdt, Eclipse and Tinn-R, etc. All of them support the
compilation of the source document with one mouse click or keyboard shortcut.
We emphasized graphics but not tables in this article because tables are essentially text
output, and can be supported by other packages such as xtable [4]; in knitr, we just need
to use the chunk option results='asis' when we want a table in the chunk output. Put
it another way, tables are orthogonal to knitr's design.
In all, we have mainly introduced one comprehensive tool for reproducible research,
namely knitr, in this chapter. It has a exible design to allow customization and extension
in several aspects from the input to the output. The major functionality of this package has
stabilized, and the future work will be primarily bug xes and improving existing features
such as the language engines. A much more detailed introduction of this package can be
found in the book [28].

Acknowledgments

First I would like to thank Friedrich Leisch for the seminal work on Sweave, which de-
serves credits of the design and many features in knitr. As I mentioned in Section 1.3,
knitr: A Comprehensive Tool for Reproducible Research in R 25

the ideas of cache and TikZ graphics were from cacheSweave (Roger Peng), pgfSweave
(Cameron Bracken and Charlie Sharpsteen) and weaver (Seth Falcon); syntax highlighting
was inspired by Romain Francois from his highlight package. I thank all these package
authors as well as Hadley Wickham for his unpublished decumar package, which greatly
inuenced the initial design of knitr. There have been a large number of users giving me
valuable feedbacks in the mailing list https://groups.google.com/group/knitr and on
Github, and I really appreciate the communications. I thank the authors and contributors
of open-source editors such as LYX and RStudio for the quick support. I thank my advisors
Di Cook and Heike Hofmann for their guidance. Last but not least, I thank the R Core
Team for providing such a wonderful environment for both data analysis and programming.
There are a few nice functions in R which introduced very useful features into knitr, such
as recordPlot() and lazyLoad().
26 Dummy Title
Bibliography

[1] Daniel Adler and Duncan Murdoch. rgl: 3D visualization device system (OpenGL),
2013. R package version 0.93.929/r929.

[2] Keith A. Baggerly, Jerey S. Morris, and Kevin R. Coombes. Reproducibility of seldi-
tof protein patterns in serum: comparing datasets from dierent experiments. Bioin-
formatics, 20(5):777785, 2004.
[3] Cameron Bracken and Charlie Sharpsteen. pgfSweave: Quality speedy graphics compi-
lation and caching with Sweave, 2012. R package version 1.3.0.

[4] David B. Dahl. xtable: Export tables to LaTeX or HTML, 2013. R package version
1.7-1.

[5] Dirk Eddelbuettel. digest: Create cryptographic hash digests of R objects, 2013. R
package version 0.6.3.

[6] Seth Falcon. weaver: Tools and extensions for processing Sweave documents, 2013. R
package version 1.24.0.

[7] Robert Gentleman. Reproducible research: A bioinformatics case study. Statistical


Applications in Genetics and Molecular Biology, 4(1):1034, 2005.
[8] Robert Gentleman and Duncan Temple Lang. Statistical analyses and reproducible
research. Bioconductor Project Working Papers, 2004.
[9] Markus Gesmann and Diego de Castillo. googleVis: Interface between R and the Google
Chart Tools, 2013. R package version 0.4.2.

[10] John Gruber. The Markdown Project, 2004. URL: http://daringfireball.net/


projects/markdown/.
[11] Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics.
Journal of computational and graphical statistics, 5(3):299314, 1996.
[12] Donald E. Knuth. The WEB system of structured documentation. Technical report,
Department of Computer Science, Stanford University, 1983.

[13] Donald E. Knuth. Literate programming. The Computer Journal, 27(2):97111, 1984.
[14] Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate data
analysis. In COMPSTAT 2002 Proceedings in Computational Statistics, number 69,
pages 575580. Physica Verlag, Heidelberg, 2002.

[15] Paul Murrell. R Graphics. Chapman & Hall/CRC, 2nd edition, 2011.

[16] Roger D. Peng. cacheSweave: Tools for caching Sweave computations, 2012. R package
version 0.6-1.

27
28 Bibliography
[17] Roger D. Peng. lehash: Simple key-value database, 2012. R package version 2.2-1.

[18] R Core Team. R Language Denition. R Foundation for Statistical Computing, Vienna,
Austria, 2012.

[19] R Core Team. R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria, 2013. ISBN 3-900051-07-0.

[20] Norman Ramsey. Literate programming simplied. Software, IEEE, 11(5):97105,


1994.

[21] A.J. Rossini, R.M. Heiberger, R.A. Sparapani, M. Maechler, and K. Hornik. Emacs
speaks statistics: A multiplatform, multipackage development environment for statis-
tical analysis. Journal of Computational and Graphical Statistics, 13(1):247261, 2004.
Proceedings of the 2nd International
[22] Anthony Rossini. Literate statistical analysis. In
Workshop on Distributed Statistical Computing, pages 1517, 2002.
[23] RStudio, Inc. shiny: Web Application Framework for R, 2013. R package version
0.4.0.99.

[24] Charlie Sharpsteen and Cameron Bracken. tikzDevice: R Graphics Output in LaTeX
Format, 2012. R package version 0.6.3/r49.

[25] Till Tantau. The TikZ and PGF Packages, 2008. URL: http://sourceforge.net/
projects/pgf/.
[26] Hadley Wickham. evaluate: Parsing and evaluation tools that provide more details
than the default., 2013. R package version 0.4.3.
[27] Yihui Xie. formatR: Format R Code Automatically, 2012. R package version 0.7.2.

[28] Yihui Xie. Dynamic Documents with R and knitr. Chapman and Hall/CRC, 2013.
ISBN 978-1482203530.

[29] Yihui Xie. knitr: A general-purpose package for dynamic report generation in R, 2013.
R package version 1.1.8.

You might also like