R Programming for Data Science Roger D. Peng - Download the ebook now for an unlimited reading experience
R Programming for Data Science Roger D. Peng - Download the ebook now for an unlimited reading experience
com
https://ebookmeta.com/product/r-programming-for-data-
science-roger-d-peng/
OR CLICK HERE
DOWLOAD EBOOK
https://ebookmeta.com/product/the-art-of-data-science-roger-d-peng/
ebookmeta.com
https://ebookmeta.com/product/r-programming-for-actuarial-science-1st-
edition-mcquire/
ebookmeta.com
https://ebookmeta.com/product/introduction-to-banking-3rd-edition-
claudia-girardone/
ebookmeta.com
https://ebookmeta.com/product/cross-my-heart-steamy-in-
sweetville-10-1st-edition-haven-rose/
ebookmeta.com
https://ebookmeta.com/product/the-bitcoin-dilemma-weighing-the-
economic-and-environmental-costs-and-benefits-1st-edition-colin-l-
read/
ebookmeta.com
https://ebookmeta.com/product/pennsylvania-dutch-the-story-of-an-
american-language-1st-edition-mark-l-louden/
ebookmeta.com
Religious Giving For Love of God 1st Edition David H Smith
https://ebookmeta.com/product/religious-giving-for-love-of-god-1st-
edition-david-h-smith/
ebookmeta.com
R Programming for Data Science
Roger D. Peng
© 2014 - 2016 Roger D. Peng
Also By Roger D. Peng
The Art of Data Science
Exploratory Data Analysis with R
Report Writing for Data Science in R
Contents
1. Stay in Touch! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
5.15 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
13.8 rename() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
13.9 mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
13.10 group_by() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.11 %>% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
13.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.1 Functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.2 Your First Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.3 Argument Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
15.4 Lazy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.5 The ... Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
15.6 Arguments Coming After the ... Argument . . . . . . . . . . . . . . . . . . . . . 78
15.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
23. Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S. . . . . . 141
23.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.2 Loading and Processing the Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . 141
23.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
1
2. Preface
I started using R in 1998 when I was a college undergraduate working on my senior thesis.
The version was 0.63. I was an applied mathematics major with a statistics concentration and
I was working with Dr. Nicolas Hengartner on an analysis of word frequencies in classic texts
(Shakespeare, Milton, etc.). The idea was to see if we could identify the authorship of each of the
texts based on how frequently they used certain words. We downloaded the data from Project
Gutenberg and used some basic linear discriminant analysis for the modeling. The work was
eventually published¹ and was my first ever peer-reviewed publication. I guess you could argue
it was my first real “data science” experience.
Back then, no one was using R. Most of my classes were taught with Minitab, SPSS, Stata, or
Microsoft Excel. The cool people on the cutting edge of statistical methodology used S-PLUS. I
was working on my thesis late one night and I had a problem. I didn’t have a copy of any of those
software packages because they were expensive and I was a student. I didn’t feel like trekking over
to the computer lab to use the software because it was late at night.
But I had the Internet! After a couple of Yahoo! searches I found a web page for something called R,
which I figured was just a play on the name of the S-PLUS package. From what I could tell, R was a
“clone” of S-PLUS that was free. I had already written some S-PLUS code for my thesis so I figured
I would try to download R and see if I could just run the S-PLUS code.
It didn’t work. At least not at first. It turns out that R is not exactly a clone of S-PLUS and quite a few
modifications needed to be made before the code would run in R. In particular, R was missing a lot of
statistical functionality that had existed in S-PLUS for a long time already. Luckily, R’s programming
language was pretty much there and I was able to more or less re-implement the features that were
missing in R.
After college, I enrolled in a PhD program in statistics at the University of California, Los Angeles.
At the time the department was brand new and they didn’t have a lot of policies or rules (or classes,
for that matter!). So you could kind of do what you wanted, which was good for some students and
not so good for others. The Chair of the department, Jan de Leeuw, was a big fan of XLisp-Stat and
so all of the department’s classes were taught using XLisp-Stat. I diligently bought my copy of Luke
Tierney’s book² and learned to really love XLisp-Stat. It had a number of features that R didn’t have
at all, most notably dynamic graphics.
But ultimately, there were only so many parentheses that I could type, and still all of the research-
level statistics was being done in S-PLUS. The department didn’t really have a lot of copies of S-PLUS
lying around so I turned back to R. When I looked around at my fellow students, I realized that I
was basically the only one who had any experience using R. Since there was a budding interest in R
¹http://amstat.tandfonline.com/doi/abs/10.1198/000313002100#.VQGiSELpagE
²http://www.amazon.com/LISP-STAT-Object-Oriented-Environment-Statistical-Probability/dp/0471509167/
2
Preface 3
around the department, I decided to start a “brown bag” series where every week for about an hour
I would talk about something you could do in R (which wasn’t much, really). People seemed to like
it, if only because there wasn’t really anyone to turn to if you wanted to learn about R.
By the time I left grad school in 2003, the department had essentially switched over from XLisp-
Stat to R for all its work (although there were a few hold outs). Jan discusses the rationale for the
transition in a paper³ in the Journal of Statistical Software.
In the next step of my career, I went to the Department of Biostatistics⁴ at the Johns Hopkins
Bloomberg School of Public Health, where I have been for the past 12 years. When I got to Johns
Hopkins people already seemed into R. Most people had abandoned S-PLUS a while ago and were
committed to using R for their research. Of all the available statistical packages, R had the most
powerful and expressive programming language, which was perfect for someone developing new
statistical methods.
However, we didn’t really have a class that taught students how to use R. This was a problem because
most of our grad students were coming into the program having never heard of R. Most likely in
their undergradute programs, they used some other software package. So along with Rafael Irizarry,
Brian Caffo, Ingo Ruczinski, and Karl Broman, I started a new class to teach our graduate students
R and a number of other skills they’d need in grad school.
The class was basically a weekly seminar where one of us talked about a computing topic of interest.
I gave some of the R lectures in that class and when I asked people who had heard of R before, almost
no one raised their hand. And no one had actually used it before. The main selling point at the time
was “It’s just like S-PLUS but it’s free!” A lot of people had experience with SAS or Stata or SPSS. A
number of people had used something like Java or C/C++ before and so I often used that a reference
frame. No one had ever used a functional-style of programming language like Scheme or Lisp.
To this day, I still teach the class, known a Biostatistics 140.776 (“Statistical Computing”). However,
the nature of the class has changed quite a bit over the past 10 years. The population of students
(mostly first-year graduate students) has shifted to the point where many of them have been
introduced to R as undergraduates. This trend mirrors the overall trend with statistics where we
are seeing more and more students do undergraduate majors in statistics (as opposed to, say,
mathematics). Eventually, by 2008–2009, when I’d asked how many people had heard of or used
R before, everyone raised their hand. However, even at that late date, I still felt the need to convince
people that R was a “real” language that could be used for real tasks.
R has grown a lot in recent years, and is being used in so many places now, that I think it’s
essentially impossible for a person to keep track of everything that is going on. That’s fine, but
it makes “introducing” people to R an interesting experience. Nowadays in class, students are often
teaching me something new about R that I’ve never seen or heard of before (they are quite good
at Googling around for themselves). I feel no need to “bring people over” to R. In fact it’s quite the
opposite–people might start asking questions if I weren’t teaching R.
³http://www.jstatsoft.org/v13/i07
⁴http://www.biostat.jhsph.edu
Preface 4
This book comes from my experience teaching R in a variety of settings and through different stages
of its (and my) development. Much of the material has been taken from by Statistical Computing
class as well as the R Programming⁵ class I teach through Coursera.
I’m looking forward to teaching R to people as long as people will let me, and I’m interested in
seeing how the next generation of students will approach it (and how my approach to them will
change). Overall, it’s been just an amazing experience to see the widespread adoption of R over the
past decade. I’m sure the next decade will be just as amazing.
⁵https://www.coursera.org/course/rprog
3. History and Overview of R
There are only two kinds of languages: the ones people complain about and the ones
nobody uses —Bjarne Stroustrup
3.1 What is R?
This is an easy question to answer. R is a dialect of S.
3.2 What is S?
S is a language that was developed by John Chambers and others at the old Bell Telephone
Laboratories, originally part of AT&T Corp. S was initiated in 1976² as an internal statistical analysis
environment—originally implemented as Fortran libraries. Early versions of the language did not
even contain functions for statistical modeling.
In 1988 the system was rewritten in C and began to resemble the system that we have today (this
was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white
book) documents the statistical analysis functionality. Version 4 of the S language was released in
1998 and is the version we use today. The book Programming with Data by John Chambers (the
green book) documents this version of the language.
Since the early 90’s the life of the S language has gone down a rather winding path. In 1993 Bell Labs
gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. In 2004
Insightful purchased the S language from Lucent for $2 million. In 2006, Alcatel purchased Lucent
Technologies and is now called Alcatel-Lucent.
Insightful sold its implementation of the S language under the product name S-PLUS and built a
number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”. In 2008 Insightful was
acquired by TIBCO for $25 million. As of this writing TIBCO is the current owner of the S language
and is its exclusive developer.
The fundamentals of the S language itself has not changed dramatically since the publication of the
Green Book by John Chambers in 1998. In 1998, S won the Association for Computing Machinery’s
Software System Award, a highly prestigious award in the computer science field.
¹https://youtu.be/STihTnVSZnI
²http://cm.bell-labs.com/stat/doc/94.11.ps
5
History and Overview of R 6
The key part here was the transition from user to developer. They wanted to build a language that
could easily service both “people”. More technically, they needed to build language that would
be suitable for interactive data analysis (more command-line based) as well as for writing longer
programs (more traditional programming language-like).
3.4 Back to R
The R language came to use quite a bit after S had been developed. One key limitation of the S
language was that it was only available in a commericial package, S-PLUS. In 1991, R was created
by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In
1993 the first announcement of R was made to the public. Ross’s and Robert’s experience developing
R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:
Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal
of Computational and Graphical Statistics, 5(3):299–314, 1996
In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the
GNU General Public License⁴ to make R free software. This was critical because it allowed for the
source code for the entire R system to be accessible to anyone who wanted to tinker with it (more
on free software later).
In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core
Group was formed, containing some people associated with S and S-PLUS. Currently, the core group
controls the source code for R and is solely able to check in changes to the main R source tree. Finally,
in 2000 R version 1.0.0 was released to the public.
³http://www.stat.bell-labs.com/S/history.html
⁴http://www.gnu.org/licenses/gpl-2.0.html
History and Overview of R 7
2.0⁶.
According to the Free Software Foundation, with free software, you are granted the following four
freedoms⁷
• The freedom to run the program, for any purpose (freedom 0).
• The freedom to study how the program works, and adapt it to your needs (freedom 1). Access
to the source code is a precondition for this.
• The freedom to redistribute copies so you can help your neighbor (freedom 2).
• The freedom to improve the program, and release your improvements to the public, so that
the whole community benefits (freedom 3). Access to the source code is a precondition for
this.
You can visit the Free Software Foundation’s web site⁸ to learn a lot more about free software. The
Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web
site⁹ is an interesting read if you happen to have some spare time.
1. The “base” R system that you download from CRAN: Linux¹¹ Windows¹² Mac¹³ Source Code¹⁴
2. Everything else.
• The “base” R system contains, among other things, the base package which is required to run
R and contains the most fundamental functions.
• The other packages contained in the “base” system include utils, stats, datasets, graphics,
grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.
⁶http://www.gnu.org/licenses/gpl-2.0.html
⁷http://www.gnu.org/philosophy/free-sw.html
⁸http://www.fsf.org
⁹https://stallman.org
¹⁰http://cran.r-project.org
¹¹http://cran.r-project.org/bin/linux/
¹²http://cran.r-project.org/bin/windows/
¹³http://cran.r-project.org/bin/macosx/
¹⁴http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz
History and Overview of R 9
• There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernS-
mooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.
When you download a fresh installation of R from CRAN, you get all of the above, which represents
a substantial amount of functionality. However, there are many other packages available:
• There are over 4000 packages on CRAN that have been developed by users and programmers
around the world.
• There are also many packages associated with the Bioconductor project¹⁵.
• People often make packages available on their personal websites; there is no reliable way to
keep track of how many packages are available in this fashion.
• There are a number of packages being developed on repositories like GitHub and BitBucket
but there is no reliable listing of all these packages.
3.8 Limitations of R
No programming language or statistical analysis system is perfect. R certainly has a number of
drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the
original S system developed at Bell Labs. There was originally little built in support for dynamic or
3-D graphics (but things have improved greatly since the “old days”).
Another commonly cited limitation of R is that objects must generally be stored in physical memory.
This is in part due to the scoping rules of the language, but R generally is more of a memory hog
than other statistical packages. However, there have been a number of advancements to deal with
this, both in the R core and also in a number of packages developed by contributors. Also, computing
power and capacity has continued to grow over time and amount of physical memory that can be
installed on even a consumer-level laptop is substantial. While we will likely never have enough
physical memory on a computer to handle the increasingly large datasets that are being generated,
the situation has gotten quite a bit easier over time.
At a higher level one “limitation” of R is that its functionality is based on consumer demand and
(voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your
job to implement it (or you need to pay someone to do it). The capabilities of the R system generally
reflect the interests of the R user community. As the community has ballooned in size over the past
10 years, the capabilities have similarly increased. When I first started using R, there was very little
in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some
of those communities have adopted R and we are seeing more code being written for those kinds of
applications.
If you want to know my general views on the usefulness of R, you can see them here in the following
exchange on the R-help mailing list with Douglas Bates and Brian Ripley in June 2004:
¹⁵http://bioconductor.org
History and Overview of R 10
Roger D. Peng: I don’t think anyone actually believes that R is designed to make
everyone happy. For me, R does about 99% of the things I need to do, but sadly, when I
need to order a pizza, I still have to pick up the telephone.
Douglas Bates: There are several chains of pizzerias in the U.S. that provide for Internet-
based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s
only a matter of time before you will have a pizza-ordering function available.
Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface under R for
Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we
presume as that is where the GraphApp author hails from). Alternatively, a Padovian
has no need of ordering pizzas with both home and neighbourhood restaurants ….
At this point in time, I think it would be fairly straightforward to build a pizza ordering R package
using something like the RCurl or httr packages. Any takers?
3.9 R Resources
Official Manuals
As far as getting started with R by reading stuff, there is of course this book. Also, available from
CRAN¹⁶ are
• An Introduction to R¹⁷
• R Data Import/Export¹⁸
• Writing R Extensions¹⁹: Discusses how to write and organize R packages
• R Installation and Administration²⁰: This is mostly for building R from the source code)
• R Internals²¹: This manual describes the low level structure of R and is primarily for developers
and R core members
• R Language Definition²²: This documents the R language and, again, is primarily for develop-
ers
¹⁶http://cran.r-project.org
¹⁷http://cran.r-project.org/doc/manuals/r-release/R-intro.html
¹⁸http://cran.r-project.org/doc/manuals/r-release/R-data.html
¹⁹http://cran.r-project.org/doc/manuals/r-release/R-exts.html
²⁰http://cran.r-project.org/doc/manuals/r-release/R-admin.html
²¹http://cran.r-project.org/doc/manuals/r-release/R-ints.html
²²http://cran.r-project.org/doc/manuals/r-release/R-lang.html
History and Overview of R 11
Other Resources
• Major technical publishers like Springer, Chapman & Hall/CRC have entire series of books
dedicated to using R in various applications. For example, Springer has a series of books called
Use R!.
• A longer list of books can be found on the CRAN web site²³.
²³http://www.r-project.org/doc/bib/R-books.html
4. Getting Started with R
4.1 Installation
The first thing you need to do to get started with R is to install it on your computer. R works on
pretty much every platform available, including the widely available Windows, Mac OS X, and Linux
systems. If you want to watch a step-by-step tutorial on how to install R for Mac or Windows, you
can watch these videos:
• Installing R on Windows¹
• Installing R on the Mac²
There is also an integrated development environment available for R that is built by RStudio. I really
like this IDE—it has a nice editor with syntax highlighting, there is an R object viewer, and there are
a number of other nice features that are integrated. You can see how to install RStudio here
• Installing RStudio³
¹http://youtu.be/Ohnk9hcxf9M
²https://youtu.be/uxuuWXU-7UQ
³https://youtu.be/bM7Sfz-LADM
⁴http://rstudio.com
⁵https://youtu.be/8xT3hmJQskU
⁶https://youtu.be/XBcvH1BpIBo
12
5. R Nuts and Bolts
5.1 Entering Input
At the R prompt we type expressions. The <- symbol is the assignment operator.
> x <- 1
> print(x)
[1] 1
> x
[1] 1
> msg <- "hello"
The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.
This is the only comment character in R. Unlike some other languages, R does not support multi-line
comments or comment blocks.
5.2 Evaluation
When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated
expression is returned. The result may be auto-printed.
The [1] shown in the output indicates that x is a vector and 5 is its first element.
Typically with interactive work, we do not explicitly print objects with the print function; it is much
easier to just auto-print them by typing the name of the object and hitting return/enter. However,
when writing scripts, functions, or longer programs, there is sometimes a need to explicitly print
objects because auto-printing does not work in those settings.
When an R vector is printed you will notice that an index for the vector is printed in square brackets
[] on the side. For example, see this integer sequence of length 20.
13
R Nuts and Bolts 14
The numbers in the square brackets are not part of the vector itself, they are merely part of the
printed output.
With R, it’s important that one understand that there is a difference between the actual R object
and the manner in which that R object is printed to the console. Often, the printed output may have
additional bells and whistles to make the output more friendly to the users. However, these bells and
whistles are not inherently part of the object.
Note that the : operator is used to create integer sequences.
5.3 R Objects
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the vector() function.
There is really only one rule about vectors in R, which is that A vector can only contain objects
of the same class.
But of course, like any good rule, there is an exception, which is a list, which we will get to a bit later.
A list is represented as a vector but can contain objects of different classes. Indeed, that’s usually
why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data analysis and
I won’t cover them here.
5.4 Numbers
Numbers in R are generally treated as numeric objects (i.e. double precision real numbers). This
means that even if you see a number like “1” or “2” in R, which you might think of as integers, they
are likely represented behind the scenes as numeric objects (so something like “1.00” or “2.00”). This
isn’t important most of the time…except when it is.
R Nuts and Bolts 15
If you explicitly want an integer, you need to specify the L suffix. So entering 1 in R gives you a
numeric object; entering 1L explicitly gives you an integer object.
There is also a special number Inf which represents infinity. This allows us to represent entities like
1 / 0. This way, Inf can be used in ordinary calculations; e.g. 1 / Inf is 0.
The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN can also be thought of
as a missing value (more on that later)
5.5 Attributes
R objects can have attributes, which are like metadata for the object. These metadata can be very
useful in that they help to describe the object. For example, column names on a data frame help to
tell us what data are contained in each of the columns. Some examples of R object attributes are
• names, dimnames
• dimensions (e.g. matrices, arrays)
• class (e.g. integer, numeric)
• length
• other user-defined attributes/metadata
Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects
contain attributes, in which case the attributes() function returns NULL.
Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE. However,
in general one should try to use the explicit TRUE and FALSE values when indicating logical values.
The T and F values are primarily there for when you’re feeling lazy.
You can also use the vector() function to initialize vectors.
R Nuts and Bolts 16
In each case above, we are mixing objects of two different classes in a vector. But remember that
the only rule about vectors says this is not allowed. When different objects are mixed in a vector,
coercion occurs so that every element in the vector is of the same class.
In the example above, we see the effect of implicit coercion. What R tries to do is find a way to
represent all of the objects in the vector in a reasonable fashion. Sometimes this does exactly what
you want and…sometimes not. For example, combining a numeric object with a character object
will create a character vector, because numbers can usually be easily represented as strings.
Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.
R Nuts and Bolts 17
When nonsensical coercion takes place, you will usually get a warning from R.
5.9 Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector
of length 2 (number of rows, number of columns)
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner
and running down the columns.
Matrices can also be created directly from vectors by adding a dimension attribute.
R Nuts and Bolts 18
Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.
5.10 Lists
Lists are a special type of vector that can contain elements of different classes. Lists are a very
important data type in R and you should get to know them well. Lists, in combination with the
various “apply” functions discussed later, make for a powerful combination.
Lists can be explicitly created using the list() function, which takes an arbitrary number of
arguments.
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
R Nuts and Bolts 19
We can also create an empty list of a prespecified length with the vector() function
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
5.11 Factors
Factors are used to represent categorical data and can be unordered or ordered. One can think of
a factor as an integer vector where each integer has a label. Factors are important in statistical
modeling and are treated specially by modelling functions like lm() and glm().
Using factors with labels is better than using integers because factors are self-describing. Having a
variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.
Factor objects can be created with the factor() function.
Often factors will be automatically created for you when you read a dataset in using a function like
read.table(). Those functions often default to creating factors when they encounter data that look
like characters or strings.
The order of the levels of a factor can be set using the levels argument to factor(). This can be
important in linear modelling because the first level is used as the baseline level.
¹https://github.com/hadley/dplyr
R Nuts and Bolts 22
5.14 Names
R objects can have names, which is very useful for writing readable code and self-describing objects.
Here is an example of assigning names to an integer vector.
$Boston
[1] 2
$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston" "London"
Column names and row names can be set separately using the colnames() and rownames()
functions.
R Nuts and Bolts 23
Note that for data frames, there is a separate function for setting the row names, the row.names()
function. Also, data frames do not have column names, they just have names (like lists). So to set
the column names of a data frame just use the names() function. Yes, I know its confusing. Here’s a
quick summary:
5.15 Summary
There are a variety of different builtin-data types in R. In this chapter we have reviewed the following
All R objects can have attributes that help to describe what is in the object. Perhaps the most useful
attribute is names, such as column and row names in a data frame, or simply names in a vector or
list. Attributes like dimensions are also important as they can modify the behavior of objects, like
turning a vector into a matrix.
6. Getting Data In and Out of R
6.1 Reading and Writing Data
Watch a video of this section¹
There are a few principal functions reading data into R.
There are of course, many R packages that have been developed to read in all kinds of other datasets,
and you may need to resort to one of these packages if you are working in a specific area.
There are analogous functions for writing data to files
• write.table, for writing tabular data to text files (i.e. CSV) or connections
• writeLines, for writing character data line-by-line to a file or connection
• dump, for dumping a textual representation of multiple R objects
• dput, for outputting a textual representation of an R object
• save, for saving an arbitrary number of R objects in binary format (possibly compressed) to
a file.
• serialize, for converting an R object into a binary format for outputting to a connection (or
file).
24
Getting Data In and Out of R 25
For small to moderately sized datasets, you can usually call read.table without specifying any other
arguments
Telling R all these things directly makes R run faster and more efficiently. The read.csv() function
is identical to read.table except that some of the defaults are set differently (like the sep argument).
• Read the help page for read.table, which contains many hints
²https://youtu.be/BJYYIJO3UFI
Getting Data In and Out of R 26
• Make a rough calculation of the memory required to store your dataset (see the next section
for an example of how to do this). If the dataset is larger than the amount of RAM on your
computer, you can probably stop right here.
• Set comment.char = "" if there are no commented lines in your file.
• Use the colClasses argument. Specifying this option instead of using the default can make
’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know
the class of each column in your data frame. If all of the columns are “numeric”, for example,
then you can just set colClasses = "numeric". A quick an dirty way to figure out the classes
of each column is the following:
• Set nrows. This doesn’t make R run faster but it helps with memory usage. A mild overestimate
is okay. You can use the Unix tool wc to calculate the number of lines in a file.
In general, when using R with larger datasets, it’s also useful to know a few things about your
system.
³http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Getting Data In and Out of R 27
So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that
much RAM. However, you need to be aware of
Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your
computer (or at least your R session). This is usually an unpleasant experience that usually requires
you to kill the R process, in the best case scenario, or reboot your computer, in the worst case. So
make sure to do a rough calculation of memeory requirements before reading in a large dataset.
You’ll thank me later.
7. Using the readr Package
The readr package is recently developed by Hadley Wickham to deal with reading in large flat
files quickly. The package provides replacements for functions like read.table() and read.csv().
The analogous functions in readr are read_table() and read_csv(). This functions are oven much
faster than their base R analogues and provide a few other nice features such as progress meters.
For the most part, you can read use read_table() and read_csv() pretty much anywhere you might
use read.table() and read.csv(). In addition, if there are non-fatal problems that occur while
reading in the data, you will get a warning and the returned data frame will have some information
about which rows/observations triggered the warning. This can be very helpful for “debugging”
problems with your data before you get neck deep in data analysis.
28
8. Using Textual and Binary Formats
for Storing Data
Watch a video of this chapter¹
There are a variety of ways that data can be stored, including structured text files like CSV or tab-
delimited, or more complex binary formats. However, there is an intermediate format that is textual,
but not as simple as something like CSV. The format is native to R and is somewhat readable because
of its textual nature.
One can create a more descriptive representation of an R object by using the dput() or dump()
functions. The dump() and dput() functions are useful because the resulting textual format is edit-
able, and in the case of corruption, potentially recoverable. Unlike writing out a table or CSV file,
dump() and dput() preserve the metadata (sacrificing some readability), so that another user doesn’t
have to specify it all over again. For example, we can preserve the class of each column of a table or
the levels of a factor variable.
Textual formats can work much better with version control programs like subversion or git which
can only track changes meaningfully in text files. In addition, textual formats can be longer-lived;
if there is corruption somewhere in the file, it can be easier to fix the problem because one can just
open the file in an editor and look at it (although this would probably only be done in a worst case
scenario!). Finally, textual formats adhere to the Unix philosophy², if that means anything to you.
There are a few downsides to using these intermediate textual formats. The format is not very space-
efficient, because all of the metadata is specified. Also, it is really only partially readable. In some
instances it might be preferable to have data stored in a CSV file and then have a separate code file
that specifies the metadata.
¹https://youtu.be/5mIPigbNDfk
²http://www.catb.org/esr/writings/taoup/
29
Using Textual and Binary Formats for Storing Data 30
Notice that the dput() output is in the form of R code and that it preserves metadata like the class
of the object, the row names, and the column names.
The output of dput() can also be saved directly to a file.
Multiple objects can be deparsed at once using the dump function and read back in using source.
> source("data.R")
> str(y)
'data.frame': 1 obs. of 2 variables:
$ a: int 1
$ b: Factor w/ 1 level "a": 1
> x
[1] "foo"
Using Textual and Binary Formats for Storing Data 31
If you have a lot of objects that you want to save to a file, you can save all objects in your workspace
using the save.image() function.
Notice that I’ve used the .rda extension when using save() and the .RData extension when using
save.image(). This is just my personal preference; you can use whatever file extension you want.
The save() and save.image() functions do not care. However, .rda and .RData are fairly common
extensions and you may want to use them because they are recognized by other software.
The serialize() function is used to convert individual R objects into a binary format that can be
communicated across an arbitrary connection. This may get sent to a file, but it could get sent over
a network or other connection.
When you call serialize() on an R object, the output will be a raw vector coded in hexadecimal
format.
Using Textual and Binary Formats for Storing Data 32
If you want, this can be sent to a file, but in that case you are better off using something like save().
The benefit of the serialize() function is that it is the only way to perfectly represent an R object
in an exportable format, without losing precision or any metadata. If that is what you need, then
serialize() is the function for you.
9. Interfaces to the Outside World
Watch a video of this chapter¹
Data are read in using connection interfaces. Connections can be made to files (most common) or to
other more exotic things.
In general, connections are powerful tools that let you navigate files or other external objects.
Connections can be thought of as a translator that lets you talk to objects that are outside of R.
Those outside objects could be anything from a data base, a simple text file, or a a web service API.
Connections allow R functions to talk to all these different external objects without you having to
write custom code for each object.
> str(file)
function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"),
raw = FALSE)
The file() function has a number of arguments that are common to many other connection
functions so it’s worth going into a little detail here.
33
Other documents randomly have
different content
The Project Gutenberg eBook of Rifles and
Riflemen at the Battle of Kings Mountain
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Language: English
History No. 12
Reprinted 1947
CONTENTS
Page
Kings Mountain, A Hunting Rifle Victory 1
The American Rifle at the Battle of Kings Mountain 8
Testing the Ferguson Rifle—Modern Marksman Attains High Precision
With Arm of 1776
19
1
Kings Mountain
[1]
A Hunting Rifle Victory
Only a few of the original Ferguson rifles are extant. The one shown is
exhibited at Kings Mountain National Military Park, South Carolina. Here
we see the profile of the piece with an 18-inch ruler to indicate scale.
To the long standing local strife between Whig and Tory, the results
of Kings Mountain were direct and considerable. It was an
unexpected blow which completely unnerved and undermined the
Loyalist organization in the Carolinas, and placed the downtrodden
Whig cause of the Piedmont in the ascendancy. Kings Mountain was a
climax to the social, economic, and military clashes between
democratic Whig and propertied Tory elements. In a sense it
epitomized this bitter struggle and its abrupt ending on what then
was the southwestern frontier. Heartening to the long repressed
Whigs, the engagement placed them in the control of the Piedmont,
and encouraged them to renewed resistance.
This time Cornwallis’ march was more cautious in its initial stages. For
the enforced delay of the major British advance occasioned by Kings
Mountain and lengthened by indecision, had enabled Greene, the
new American commander in the South, to reorganize his shattered
and dispirited army and launch a renewed and two-fold offensive
upon the main British movement. It was this offensive in 1781, which
first successfully struck the British at Cowpens, then rapidly withdrew
through the Piedmont, further dissipated Cornwallis’ energies at
Guilford Courthouse, and prepared the way for the American victory
at Yorktown.
Maj. Patrick Ferguson was born in 1744, the son of a Scottish jurist,
James Ferguson of Pitfour. At an early age he became an officer in
the Royal North British Dragoons, and by the time the American
colonists revolted against British rule he had distinguished himself in
service with the Scotch militia and as an expeditionist during the
Carib insurrection in the West Indies. In 1776 he demonstrated to
British Government officials a weapon of his own invention, “a rifle
gun on a new construction which astonished all beholders.”
9
BREECH MECHANISM OF THE FERGUSON RIFLE
Breech plug lowered by one turn of the trigger guard
11
In September 1780, while this spirit of hatred was at its height, the
regiments of backwoods patriots, who were to go down in history as
“Kings Mountain Men,” rendezvoused at South Mountain north of
Gilbert Town and determined to set upon Ferguson and his
command, then believed to be in Gilbert Town. The followers of the
Whig border leaders, Campbell, Shelby, Sevier, Cleveland, Lacey,
Williams, McDowell, Hambright, Hawthorne, Brandon, Chronicle, and
Hammond, descended upon Gilbert Town on October 4 only to find
that the Tories, apprised of the planned attack, had evacuated that
place; Ferguson was in full retreat in an attempt to evade an
engagement. His goal was Charlotte and the safety of the British
forces there stationed under Cornwallis. On October 6, 13
Ferguson was attracted from his line of march to the
commanding eminence, Kings Mountain, known at that time by the
famous name that we apply today. His 1,100 loyalists went into camp
on these heights, and Ferguson declared that “he was on Kings
Mountain, that he was King of that mountain, and God Almighty
could not drive him from it.” He took none of the ordinary military
precautions of forming breastworks, but merely placed his baggage
wagons along the northeastern part of the mountain to give some
slight appearance of protection in the neighborhood of his
headquarters.
The story of the battle which ensued is one of the thrilling chapters in
our history. The Whigs surrounded the mountain and, in spite of a
few bayonet charges made by the Tories, pressed up the slopes and
poured into the Loyalist lines such deadly fire from the long rifles that
in less than an hour 225 had been killed, 163 wounded, and 716
made prisoners. Major Ferguson fell with eight bullets in his body.
The Whigs lost 28 killed and 62 wounded.
14
PERFORMANCE OF THE FERGUSON RIFLE
Six shots a minute
Efficient in any weather
Four shots a minute while advancing
THE FERGUSON RIFLE
The rifle was ahead of its time and was discarded after his
death. It is now rare.
With the death of Ferguson, the rifles of his invention, with which
probably 150 of his men were armed, disappeared. Some were
broken in the fight and others were carried off by the victors. One
given by Ferguson to his companion, De Peyster, is today an heirloom
in the family of the latter’s descendants in New York City. It was
exhibited by the United States Government at the World’s Fair at
Chicago in 1893. A very few are to be found in museum collections in
this country and in England. The one possessed by the National Park
Service was obtained from a dealer in England through the vigilance
of members of the staff of the Colonial National Historical Park,
Virginia, and is now exhibited in the museum at Kings Mountain
National Military Park, South Carolina.
The Kings Mountain museum tells the story of the Revolutionary
backwoodsman and his place in the scheme of Americanism. Here
also is presented the story of the cultural, social, and economic
background of the Kings Mountain patriots, as well as the details of
the battle and its effect on the Revolution as a whole. Here lies the
rare opportunity to preserve for all time significant relics of Colonial
and Revolutionary days and at the same time interpret for a
multitude of visitors the basic elements in the story of the old frontier
—a story which affected most of the Nation during the century that
followed the Revolution.
Our interest here will turn to those intriguing reminders of how our
Colonial ancestors lived—their houses, their tools and implements,
their furniture, their books, and their guns. Because of the
significance of the American rifle in the battle of Kings Mountain, it
must be a feature of any Kings Mountain exhibit. In the Carolinas it
was as much a part of each patriot as was his good right arm.
I never in my life saw better rifles (or men who shot better) than
those made in America; they are chiefly made in Lancaster, and
two or three neighboring towns in that vicinity, in Pennsylvania.
The barrels weigh about six pounds two or three ounces, and carry
a ball no larger than thirty-six to the pound; at least I never saw
one of the larger caliber, and I have seen many hundreds and
hundreds. I am not going to relate any thing respecting the
American war; but to mention one instance, as a proof of most
excellent skill of an American rifleman. If any man shew me an
instance of better shooting, I will stand corrected.
Now, observe how well this fellow shot. It was in the month of
August, and not a breath of wind was stirring. Colonel Tartleton’s
horse and mine, I am certain, were not anything like two feet
apart; for we were in close consultation, how we should attack with
our troops, which laid 300 yards in the wood, and could not be
perceived by the enemy. A rifle-ball passed between him and me;
looking directly to the mill, I observed the flash of the powder. I
said to my friend, “I think we had better move, or we shall have
two or three of these gentlemen, shortly, amusing themselves at
our expence.” The words were hardly out of my mouth, when the
bugle horn man, behind us, and directly central, jumped off his
horse, and said, “Sir, my horse is shot.” The horse staggered, fell
down, and died. He was shot directly behind the foreleg, near to
the heart, at least where the great blood-vessels lie, which lead to
the heart. He took the saddle and bridle off, went into the woods,
and got another horse. We had a number of spare horses, led by
negro lads.
The rifle had been introduced into America about 1700 when there
was considerable immigration into Pennsylvania from Switzerland and
Austria, the only part of the world at that time where it was in use. It
was then short, heavy, clumsy, and little more accurate than the
musket. From this arm the American gunsmiths evolved the long,
slender, small-bore gun (about 36 balls to the pound) which by 1750
had reached the same state of development that characterized it at
the time of the Revolution. The German Jäger rifle brought to
America during the Revolution was by no means the equal of the
American piece. It was short-barreled and took a ball of 19 to the
pound. With its large ball and small powder charge its recoil was
heavy and its accurate range but little greater than that of the
smoothbore musket. It was the same gun that had been introduced
into America in 1700.
Each Whig on Kings Mountain had been told to act as his own
captain, to yield as he found it necessary, and to take every
advantage that was presented. In short, the patriots followed the
Indian mode of attack, using the splendid cover that the timber about
the mountain afforded, and selecting a definite human target for
every ball fired. Splendid leadership and command were exercised by
the Whig officers to make for concerted action every time a crisis
arose. This coordination, plus the Kentucky rifle and the “individual
power of woodcraft, marksmanship, and sportsmanship” of each
participant in the American forces, overcame all the military training
and discipline which had been injected into his Tory troops by
Ferguson.
19
Testing the Ferguson Rifle
[3]
Modern Marksman Attains High Precision With Arm of 1776
While it is understood that tests of this historic arm have been made
in England within late years, it is believed that in this country the
sinister crack of a Ferguson had not been heard since 1780 at the
Battle of Kings Mountain, South Carolina.
21
The Centennial Monument at Kings Mountain, unveiled on the 100th
anniversary of the Battle, October 7, 1880.
View of the Kings Mountain region, taken from the eastern slope of the
battlefield ridge, looking northeastwardly toward Henry’s Knob.
24
Granite obelisk erected by the Federal Government at Kings Mountain in
1909 to commemorate the Battle.
U. S. GOVERNMENT PRINTING OFFICE: 1947
25
Footnotes
[1]
From The Regional Review, National Park Service, Region One,
Richmond, Va., Vol. III, No. 6, December 1939, pp. 25-29.
[2]
From Idem., vol. V, No. 1, July 1940, pp. 15-21.
[3]
Idem., vol. VI, Nos. 1 and 2.
National Park Service
Popular Study Series
No. 1.—Winter Encampments of the Revolution.
No. 2.—Weapons and Equipment of Early American Soldiers.
No. 3.—Wall Paper News of the Sixties.
No. 4.—Prehistoric Cultures in the Southeast.
No. 5.—Mountain Speech in the Great Smokies.
No. 6.—New Echota, Birthplace of the American Indian Press.
No. 7.—Hot Shot Furnaces.
No. 8.—Perry at Put in Bay: Echoes of the War of 1812.
No. 9.—Wharf Building of a Century and More Ago.
No. 10.—Gardens of the Colonists.
No. 11.—Robert E. Lee and Fort Pulaski.
No. 12.—Rifles and Riflemen at the Battle of Kings Mountain.
No. 13.—Rifle Making in the Great Smoky Mountains.
No. 14.—American Charcoal Making in the Era of the Cold Blast
Furnace.