Foundations And Applications Of Statistics An Introduction Using R Randall Pruim pdf download
Foundations And Applications Of Statistics An Introduction Using R Randall Pruim pdf download
https://ebookbell.com/product/foundations-and-applications-of-
statistics-an-introduction-using-r-randall-pruim-5856816
Foundations
and Applications
of Statistics
An Introduction Using R
Randall Pruim
Foundations
and Applications
of Statistics
An Introduction Using R
Randall Pruim
Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy a chapter for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for such
permission should be addressed to the Acquisitions Department, American Mathematical Society,
201 Charles Street, Providence, Rhode Island 02904-2294 USA. Requests can also be made by
e-mail to [email protected].
c 2011 by Randall Pruim. All rights reserved.
Printed in the United States of America.
∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at http://www.ams.org/
10 9 8 7 6 5 4 3 2 1 16 15 14 13 12 11
Contents
Preface ix
What Is Statistics? xv
v
vi Contents
Intended Audience
As the title suggests, this book is intended as an introduction to both the foun-
dations and applications of statistics. It is an introduction in the sense that it
does not assume a prior statistics course. But it is not introductory in the sense
of being suitable for students who have had nothing more than the usual high
school mathematics preparation. The target audience is undergraduate students at
the equivalent of the junior or senior year at a college or university in the United
States.
Students should have had courses in differential and integral calculus, but not
much more is required in terms of mathematical background. In fact, most of my
students have had at least another course or two by the time they take this course,
but the only courses that they have all had is the calculus sequence. The majority
of my students are not mathematics majors. I have had students from biology,
chemistry, computer science, economics, engineering, and psychology, and I have
tried to write a book that is interesting, understandable, and useful to students
with a wide range of backgrounds and career goals.
This book is suitable for what is often a two-semester sequence in “mathe-
matical statistics”, but it is different in some important ways from many of the
books written for such a course. I was trained as a mathematician first, and the
book is clearly mathematical at some points, but the emphasis is on the statistics.
Mathematics and computation are brought in where they are useful tools. The
result is a book that stretches my students in different directions at different times
– sometimes statistically, sometimes mathematically, sometimes computationally.
Features of this book that help distinguish it from other books available for such a
course include the following:
ix
x Preface
• The use of R, a free software environment for statistical computing and graph-
ics, throughout the text.
Many books claim to integrate technology, but often technology appears
to be more of an afterthought. In this book, topics are selected, ordered, and
discussed in light of the current practice in statistics, where computers are an
indispensable tool, not an occasional add-on.
R was chosen because it is both powerful and available. Its “market share”
is increasing rapidly, so experience with R is likely to serve students well in
their future careers in industry or academics. A large collection of add-on
packages are available, and new statistical methods are often available in R
before they are available anywhere else.
R is open source and is available at the Comprehensive R Archive Network
(CRAN, http://cran.r-project.org) for a wide variety of computing plat-
forms at no cost. This allows students to obtain the software for their personal
computers – an essential ingredient if computation is to be used throughout
the course.
The R code in this book was executed on a 2.66 GHz Intel Core 2 Duo
MacBook Pro running OS X (version 10.5.8) and the current version of R (ver-
sion 2.12). Results using a different computing platform or different version
of R should be similar.
• An emphasis on practical statistical reasoning.
The idea of a statistical study is introduced early on using Fisher’s famous
example of the lady tasting tea. Numerical and graphical summaries of data
are introduced early to give students experience with R and to allow them
to begin formulating statistical questions about data sets even before formal
inference is available to help answer those questions.
• Probability for statistics.
One model for the undergraduate mathematical statistics sequence presents
a semester of probability followed by a semester of statistics. In this book,
I take a different approach and get to statistics early, developing the neces-
sary probability as we go along, motivated by questions that are primarily
statistical. Hypothesis testing is introduced almost immediately, and p-value
computation becomes a motivation for several probability distributions. The
binomial test and Fisher’s exact test are introduced formally early on, for ex-
ample. Where possible, distributions are presented as statistical models first,
and their properties (including the probability mass function or probability
density function) derived, rather than the other way around. Joint distribu-
tions are motivated by the desire to learn about the sampling distribution of
a sample mean.
Confidence intervals and inference for means based on t-distributions must
wait until a bit more machinery has been developed, but my intention is that
a student who only takes the first semester of a two-semester sequence will
have a solid understanding of inference for one variable – either quantitative
or categorical.
Preface xi
Brief Outline
The first four chapters of this book introduce important ideas in statistics (dis-
tributions, variability, hypothesis testing, confidence intervals) while developing a
mathematical and computational toolkit. I cover this material in a one-semester
course. Also, since some of my students only take the first semester, I wanted to
be sure that they leave with a sense for statistical practice and have some useful
statistical skills even if they do not continue. Interestingly, as a result of designing
my course so that stopping halfway makes some sense, I am finding that more of
my students are continuing on to the second semester. My sample size is still small,
but I hope that the trend continues and would like to think it is due in part because
the students are enjoying the course and can see “where it is going”.
The last three chapters deal primarily with two important methods for handling
more complex statistical models: maximum likelihood and linear models (including
regression, ANOVA, and an introduction to generalized linear models). This is not
a comprehensive treatment of these topics, of course, but I hope it both provides
flexible, usable statistical skills and prepares students for further learning.
Chi-squared tests for goodness of fit and for two-way tables using both the
Pearson and likelihood ratio test statistics are covered after first generating em-
pirical p-values based on simulations. The use of simulations here reinforces the
notion of a sampling distribution and allows for a discussion about what makes a
good test statistic when multiple test statistics are available. I have also included
a brief introduction to Bayesian inference, some examples that use simulations to
investigate robustness, a few examples of permutation tests, and a discussion of
Bradley-Terry models. The latter topic is one that I cover between Selection Sun-
day and the beginning of the NCAA Division I Basketball Tournament each year.
An application of the method to the 2009–2010 season is included.
Various R functions and methods are described as we go along, and Appendix A
provides an introduction to R focusing on the way R is used in the rest of the book.
I recommend working through Appendix A simultaneously with the first chapter –
especially if you are unfamiliar with programming or with R.
Some of my students enter the course unfamiliar with the notation for things
like sets, functions, and summation, so Appendix B contains a brief tour of the basic
xii Preface
mathematical results and notation that are needed. The linear algebra required for
parts of Chapter 4 and again in Chapters 6 and 7 is covered in Appendix C. These
can be covered as needed or used as a quick reference. Appendix D is a review of
the first four chapters in outline form. It is intended to prepare students for the
remainder of the book after a semester break, but it could also be used as an end
of term review.
All of the data sets and code fragments used in this book are available for use
in R on your own computer. Data sets and other utilities that are not provided
by R packages in CRAN are available in the fastR package. This package can
be obtained from CRAN, from the companion web site for this book, or from the
author’s web site.
Among the utility functions in fastR is the function snippet(), which provides
easy access to the code fragments that appear in this book. The names of the code
fragments in this book appear in boxes at the right margin where code output is
displayed. Once fastR has been installed and loaded,
snippet(’snippet’)
will both display and execute the code named “snippet”, and
snippet(’snippet’, exec=FALSE)
http://www.ams.org/bookpages/amstext-13
• an errata list,
• additional instructions, with links, for installing R and the R packages used in
this book,
• additional examples and problems,
• additional student solutions,
• additional material – including a complete list of solutions – available only to
instructors.
Preface xiii
Acknowledgments
Every author sets out to write the perfect book. I was no different. Fortunate
authors find others who are willing to point out the ways they have fallen short of
their goal and suggest improvements. I have been fortunate.
Most importantly, I want to thank the students who have taken advanced
undergraduate statistics courses with me over the past several years. Your questions
and comments have shaped the exposition of this book in innumerable ways. Your
enthusiasm for detecting my errors and your suggestions for improvements have
saved me countless embarrassments. I hope that your moments of confusion have
added to the clarity of the exposition.
If you look, some of you will be able to see your influence in very specific ways
here and there (happy hunting). But so that you all get the credit you deserve, I
want to list you all (in random order, of course): Erin Campbell, John Luidens, Kyle
DenHartigh, Jessica Haveman, Nancy Campos, Matthew DeVries, Karl Stough,
Heidi Benson, Kendrick Wiersma, Dale Yi, Jennifer Colosky, Tony Ditta, James
Hays, Joshua Kroon, Timothy Ferdinands, Hanna Benson, Landon Kavlie, Aaron
Dull, Daniel Kmetz, Caleb King, Reuben Swinkels, Michelle Medema, Sean Kidd,
Leah Hoogstra, Ted Worst, David Lyzenga, Eric Barton, Paul Rupke, Alexandra
Cok, Tanya Byker Phair, Nathan Wybenga, Matthew Milan, Ashley Luse, Josh
Vesthouse, Jonathan Jerdan, Jamie Vande Ree, Philip Boonstra, Joe Salowitz,
Elijah Jentzen, Charlie Reitsma, Andrew Warren, Lucas Van Drunen, Che-Yuan
Tang, David Kaemingk, Amy Ball, Ed Smilde, Drew Griffioen, Tim Harris, Charles
Blum, Robert Flikkema, Dirk Olson, Dustin Veldkamp, Josh Keilman, Eric Sloter-
beek, Bradley Greco, Matt Disselkoen, Kevin VanHarn, Justin Boldt, Anthony
Boorsma, Nathan Dykhuis, Brandon Van Dyk, Steve Pastoor, Micheal Petlicke,
Michael Molling, Justin Slocum, Jeremy Schut, Noel Hayden, Christian Swenson,
Aaron Keen, Samuel Zigterman, Kobby Appiah-Berko, Jackson Tong, William Van-
den Bos, Alissa Jones, Geoffry VanLeeuwen, Tim Slager, Daniel Stahl, Kristen
Vriesema, Rebecca Sheler, and Andrew Meneely.
I also want to thank various colleagues who read or class-tested some or all of
this book while it was in progress. They are
Ming-Wen An Daniel Kaplan
Vassar College Macalester College
Alan Arnholdt John Kern
Appalacian State University Duquesne University
Stacey Hancock Kimberly Muller
Clark University Lake Superior State University
Jo Hardin Ken Russell
Pomona College University of Wollongong, Australia
Nicholas Horton Greg Snow
Smith College Intermountain Healthcare
Laura Kapitula Nathan Tintle
Calvin College Hope College
xiv Preface
This is a course primarily about statistics, but what exactly is statistics? In other
words, what is this course about?1 Here are some definitions of statistics from other
people:
• a collection of procedures and principles for gaining information in order to
make decisions when faced with uncertainty (J. Utts [Utt05]),
• a way of taming uncertainty, of turning raw data into arguments that can
resolve profound questions (T. Amabile [fMA89]),
• the science of drawing conclusions from data with the aid of the mathematics
of probability (S. Garfunkel [fMA86]),
• the explanation of variation in the context of what remains unexplained (D.
Kaplan [Kap09]),
• the mathematics of the collection, organization, and interpretation of numer-
ical data, especially the analysis of a population’s characteristics by inference
from sampling (American Heritage Dictionary [AmH82]).
While not exactly the same, these definitions highlight four key elements of statis-
tics.
Data are the raw material for doing statistics. We will learn more about different
types of data, how to collect data, and how to summarize data as we go along. This
will be the primary focus of Chapter 1.
1
As we will see, the words statistic and statistics get used in more than one way. More on that
later.
xv
xvi What Is Statistics?
The tricky thing about statistics is the uncertainty involved. If we measure one box
of cereal, how do we know that all the others are similarly filled? If every box of
cereal were identical and every measurement perfectly exact, then one measurement
would suffice. But the boxes may differ from one another, and even if we measure
the same box multiple times, we may get different answers to the question How
much cereal is in the box?
So we need to answer questions like How many boxes should we measure? and
How many times should we measure each box? Even so, there is no answer to these
questions that will give us absolute certainty. So we need to answer questions like
How sure do we need to be?
In order to answer a question like How sure do we need to be?, we need some way of
measuring our level of certainty. This is where mathematics enters into statistics.
Probability is the area of mathematics that deals with reasoning about uncertainty.
So before we can answer the statistical questions we just listed, we must first develop
some skill in probability. Chapter 2 provides the foundation that we need.
Once we have developed the necessary tools to deal with uncertainty, we will
be able to give good answers to our statistical questions. But before we do that,
let’s take a bird’s eye view of the processes involved in a statistical study. We’ll
come back and fill in the details later.
the conversation changed at this suggestion, and the scientists began to discuss how
the claim should be tested. Within a few minutes cups of tea with milk had been
prepared and presented to the woman for tasting.
Let’s take this simple example as a prototype for a statistical study. What
steps are involved?
And how should we prepare the cups? Should we make 5 each way? Does
it matter if we tell the woman that there are 5 prepared each way? Should we
flip a coin to decide even if that means we might end up with 3 prepared one
way and 7 the other way? Do any of these differences matter?
(5) Make and record the measurements.
Once we have the design figured out, we have to do the legwork of data
collection. This can be a time-consuming and tedious process. In the case
of the lady tasting tea, the scientists decided to present her with ten cups
of tea which were quickly prepared. A study of public opinion may require
many thousands of phone calls or personal interviews. In a laboratory setting,
each measurement might be the result of a carefully performed laboratory
experiment.
(6) Organize the data.
Once the data have been collected, it is often necessary or useful to orga-
nize them. Data are typically stored in spreadsheets or in other formats that
are convenient for processing with statistical packages. Very large data sets
are often stored in databases.
Part of the organization of the data may involve producing graphical and
numerical summaries of the data. We will discuss some of the most important
of these kinds of summaries in Chapter 1. These summaries may give us initial
insights into our questions or help us detect errors that may have occurred to
this point.
(7) Draw conclusions from data.
Once the data have been collected, organized, and analyzed, we need to
reach a conclusion. Do we believe the woman’s claim? Or do we think she is
merely guessing? How sure are we that this conclusion is correct?
Eventually we will learn a number of important and frequently used meth-
ods for drawing inferences from data. More importantly, we will learn the basic
framework used for such procedures so that it should become easier and easier
to learn new procedures as we become familiar with the framework.
(8) Produce a report.
Typically the results of a statistical study are reported in some manner.
This may be as a refereed article in an academic journal, as an internal re-
port to a company, or as a solution to a problem on a homework assignment.
These reports may themselves be further distilled into press releases, newspa-
per articles, advertisements, and the like. The mark of a good report is that
it provides the essential information about each of the steps of the study.
As we go along, we will learn some of the standard terminology and pro-
cedures that you are likely to see in basic statistical reports and will gain a
framework for learning more.
At this point, you may be wondering who the innovative scientist was and
what the results of the experiment were. The scientist was R. A. Fisher, who first
described this situation as a pedagogical example in his 1925 book on statistical
methodology [Fis25]. We’ll return to this example in Sections 2.4.1 and 2.7.3.
Chapter 1
Summarizing Data
1.1. Data in R
Most data sets in R are stored in a structure called a data frame that reflects
the 2-dimensional structure described above. A number of data sets are included
with the basic installation of R. The iris data set, for example, is a famous data
1
2 1. Summarizing Data
[129] 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
iris-vector2
> iris$Species # get one variable and print as vector
[1] setosa setosa setosa setosa setosa
[6] setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa
< 19 lines removed >
[111] virginica virginica virginica virginica virginica
[116] virginica virginica virginica virginica virginica
[121] virginica virginica virginica virginica virginica
[126] virginica virginica virginica virginica virginica
[131] virginica virginica virginica virginica virginica
[136] virginica virginica virginica virginica virginica
[141] virginica virginica virginica virginica virginica
[146] virginica virginica virginica virginica virginica
Levels: setosa versicolor virginica
This is not a particularly good way to get a feel for data. There are a number of
graphical and numerical summaries of a variable or set of variables that are usually
preferred to merely listing all the values – especially if the data set is large. That
is the topic of our next section.
It is important to note that the name iris is not reserved in R for this data
set. There is nothing to prevent you from storing something else with that name.
If you do, you will no longer have access to the iris data set unless you first reload
it, at which point the previous contents of iris are lost.
iris-reload
> iris <- ’An iris is a beautiful flower.’
> str(iris)
chr "An iris is a beautiful flower."
> data(iris) # explicitly reload the data set
> str(iris)
’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length:num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width :num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
4 1. Summarizing Data
$ Petal.Length:num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width :num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species :Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1
1 1 1 ...
The fastR package includes data sets and other utilities to accompany this
text. Instructions for installing fastR appear in the preface. We will use data sets
from a number of other R packages as well. These include the CRAN packages alr3,
car, DAAG, Devore6, faraway, Hmisc, MASS, and multcomp. Appendix A includes
instructions for reading data from various file formats, for entering data manu-
ally, for obtaining documentation on R functions and data sets, and for installing
packages from CRAN.
Now that we can get our hands on some data, we would like to develop some tools
to help us understand the distribution of a variable in a data set. By distribution
we mean answers to two questions:
• What values does the variable take on?
• With what frequency?
Simply listing all the values of a variable is not an effective way to describe a
distribution unless the data set is quite small. For larger data sets, we require some
better methods of summarizing a distribution.
The types of summaries used for a variable depend on the kind of variable we
are interested in. Some variables, like iris$Species, are used to put individuals
into categories. Such variables are called categorical (or qualitative) variables to
distinguish them from quantitative variables which have numerical values on some
numerically meaningful scale. iris$Sepal.Length is an example of a quantitative
variable.
Usually the categories are either given descriptive names (our preference) or
numbered consecutively. In R, a categorical variable is usually stored as a factor.
The possible categories of an R factor are called levels, and you can see in the
output above that R not only lists out all of the values of iris$species but also
provides a list of all the possible levels for this variable. A more useful summary of
a categorical variable can be obtained using the table() function.
iris-table
> table(iris$Species) # make a table of values
From this we can see that there were 50 of each of three species of iris.
1.2. Graphical and Numerical Summaries of Univariate Data 5
Tables can be used for quantitative data as well, but often this does not work
as well as it does for categorical data because there are too many categories.
iris-table2
> table(iris$Sepal.Length) # make a table of values
4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3
6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7
6 6 4 9 7 5 2 8 3 4 1 1 3 1 1 1 4
7.9
1
Sometimes we may prefer to divide our quantitative data into two groups based on
a threshold or some other boolean test.
iris-logical
> table(iris$Sepal.Length > 6.0)
FALSE TRUE
89 61
The cut() function provides a more flexible way to build a table from quantitative
data.
iris-cut
> table(cut(iris$Sepal.Length,breaks=2:10))
The cut() function partitions the data into sections, in this case with break points
at each integer from 2 to 10. (The breaks argument can be used to set the break
points wherever one likes.) The result is a categorical variable with levels describing
the interval in which each original quantitative value falls. If we prefer to have the
intervals closed on the other end, we can achieve this using right=FALSE.
iris-cut2
> table(cut(iris$Sepal.Length,breaks=2:10,right=FALSE))
Notice too that it is possible to define factors in R that have levels that do not
occur. This is why the 0’s are listed in the output of table(). See ?factor for
details.
A tabular view of data like the example above can be converted into a vi-
sual representation called a histogram. There are two R functions that can be
used to build a histogram: hist() and histogram(). hist() is part of core R.
histogram() can only be used after first loading the lattice graphics package,
which now comes standard with all distributions of R. Default versions of each are
depicted in Figure 1.1. A number of arguments can be used to modify the resulting
plot, set labels, choose break points, and the like.
Looking at the plots generated by histogram() and hist(), we see that they
use different scales for the vertical axis. The default for histogram() is to use
percentages (of the entire data set). By contrast, hist() uses counts. The shapes of
6 1. Summarizing Data
Histogram of iris$Sepal.Length
20
30
Percent of Total
15
Frequency
20
10
5 10
5
0
0
4 5 6 7 8
4 5 6 7 8
iris$Sepal.Length
Sepal.Length
the two histograms differ because they use slightly different algorithms for choosing
the default break points. The user can, of course, override the default break points
(using the breaks argument). There is a third scale, called the density scale, that is
often used for the vertical axis. This scale is designed so that the area of each bar is
equal to the proportion of data it represents. This is especially useful for histograms
that have bins (as the intervals between break points are typically called in the
context of histograms) of different widths. Figure 1.2 shows an example of such a
histogram generated using the following code:
iris-histo-density
> histogram(~Sepal.Length,data=iris,type="density",
+ breaks=c(4,5,5.5,6,6.5,7,8,10))
We will generally use the newer histogram() function because it has several
nice features. One of these is the ability to split up a plot into subplots called
panels. For example, we could build a separate panel for each species in the iris
data set. Figure 1.2 suggests that part of the variation in sepal length is associated
with the differences in species. Setosa are generally shorter, virginica longer, and
versicolor intermediate. The right-hand plot in Figure 1.2 was created using
iris-condition
> histogram(~Sepal.Length|Species,data=iris)
If we only want to see the data from one species, we can select a subset of the data
using the subset argument.
iris-histo-subset
> histogram(~Sepal.Length|Species,data=iris,
+ subset=Species=="virginica")
4 5 6 7 8
50
0.3
Percent of Total
40
Density
0.2 30
20
0.1
10
0.0 0
4 5 6 7 8 9 10 4 5 6 7 8 4 5 6 7 8
Sepal.Length Sepal.Length
Figure 1.2. Left: A density histogram of sepal length using unequal bin
widths. Right: A histogram of sepal length by species.
1.2. Graphical and Numerical Summaries of Univariate Data 7
virginica
Percent of Total
30
20
10
5 6 7 8
Sepal.Length
Figure 1.3. This histogram is the result of selecting a subset of the data using
the subset argument.
By keeping the groups argument, our plot will continue to have a strip at the top
identifying the species even though there will only be one panel in our plot (Figure
1.3).
The lattice graphing functions all use a similar formula interface. The generic
form of a formula is
y ~ x | z
4 5 6 7 8 4 5 6 7 8
60
40
20
0
S.Width S.Width S.Width S.Width S.Width S.Width
100
80
60
40
20
0
S.Width S.Width S.Width S.Width S.Width S.Width
100
80
60
40
20
0
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Sepal.Length
4 5 6 7 8
Sepal.Width Sepal.Width
30
Percent of Total
20
10
0
Sepal.Width Sepal.Width
30
20
10
0
4 5 6 7 8
Sepal.Length
0 5 10 15
12
neg. skewed pos. skewed symmetric
Percent of Total
10
Percent of Total
20
8
15
6
10 4
5 2
0 0
0 5 10 15 0 5 10 15 2 3 4 5
x eruptions
Figure 1.5. Left: Skewed and symmetric distributions. Right: Old Faithful
eruption times illustrate a bimodal distribution.
the degree and direction of skewness with a number; for now it is sufficient to
describe distributions qualitatively as symmetric or skewed. See Figure 1.5 for
some examples of symmetric and skewed distributions.
Notice that each of these distributions is clustered around a center where most
of the values are located. We say that such distributions are unimodal. Shortly we
1.2. Graphical and Numerical Summaries of Univariate Data 9
will discuss ways to summarize the location of the “center” of unimodal distributions
numerically. But first we point out that some distributions have other shapes that
are not characterized by a strong central tendency. One famous example is eruption
times of the Old Faithful geyser in Yellowstone National park.
faithful-histogram
> plot <- histogram(~eruptions,faithful,n=20)
produces the histogram in Figure 1.5 which shows a good example of a bimodal
distribution. There appear to be two groups or kinds of eruptions, some lasting
about 2 minutes and others lasting between 4 and 5 minutes.
+-------+----------+---+------------+
| | |N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa | 50|5.0060 |
| |versicolor| 50|5.9360 |
| |virginica | 50|6.5880 |
+-------+----------+---+------------+
|Overall| |150|5.8433 |
+-------+----------+---+------------+
> summary(Sepal.Length~Species,iris,fun=median) # median instead
Sepal.Length N=150
+-------+----------+---+------------+
| | |N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa | 50|5.0 |
| |versicolor| 50|5.9 |
| |virginica | 50|6.5 |
+-------+----------+---+------------+
|Overall| |150|5.8 |
+-------+----------+---+------------+
Comparing with the histograms in Figure 1.2, we see that these numbers are indeed
good descriptions of the center of the distribution for each species.
We can also compute the mean and median of the Old Faithful eruption times.
faithful-mean-median
> mean(faithful$eruptions)
[1] 3.4878
> median(faithful$eruptions)
[1] 4
Notice, however, that in the Old Faithful eruption times histogram (Figure 1.5)
there are very few eruptions that last between 3.5 and 4 minutes. So although
these numbers are the mean and median, neither is a very good description of the
typical eruption time(s) of Old Faithful. It will often be the case that the mean
and median are not very good descriptions of a data set that is not unimodal.
1.2. Graphical and Numerical Summaries of Univariate Data 11
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370
In the case of our Old Faithful data, there seem to be two predominant peaks,
but unlike in the case of the iris data, we do not have another variable in our
data that lets us partition the eruption times into two corresponding groups. This
observation could, however, lead to some hypotheses about Old Faithful eruption
times. Perhaps eruption times at night are different from those during the day.
Perhaps there are other differences in the eruptions. Subsequent data collection
(and statistical analysis of the resulting data) might help us determine whether our
hypotheses appear correct.
One disadvantage of a histogram is that the actual data values are lost. For a
large data set, this is probably unavoidable. But for more modestly sized data sets,
a stemplot can reveal the shape of a distribution without losing the actual (perhaps
rounded) data values. A stemplot divides each value into a stem and a leaf at some
place value. The leaf is rounded so that it requires only a single digit. The values
are then recorded as in Figure 1.6.
From this output we can readily see that the shortest recorded eruption time
was 1.60 minutes. The second 0 in the first row represents 1.70 minutes. Note that
the output of stem() can be ambiguous when there are not enough data values in
a row.
Why bother with two different measures of central tendency? The short answer is
that they measure different things. If a distribution is (approximately) symmetric,
the mean and median will be (approximately) the same (see Exercise 1.5). If the
12 1. Summarizing Data
distribution is not symmetric, however, the mean and median may be very different,
and one measure may provide a more useful summary than the other.
For example, if we begin with a symmetric distribution and add in one addi-
tional value that is very much larger than the other values (an outlier), then the
median will not change very much (if at all), but the mean will increase substan-
tially. We say that the median is resistant to outliers while the mean is not. A
similar thing happens with a skewed, unimodal distribution. If a distribution is
positively skewed, the large values in the tail of the distribution increase the mean
(as compared to a symmetric distribution) but not the median, so the mean will
be larger than the median. Similarly, the mean of a negatively skewed distribution
will be smaller than the median.
Whether a resistant measure is desirable or not depends on context. If we are
looking at the income of employees of a local business, the median may give us a
much better indication of what a typical worker earns, since there may be a few
large salaries (the business owner’s, for example) that inflate the mean. This is also
why the government reports median household income and median housing costs.
On the other hand, if we compare the median and mean of the value of raffle
prizes, the mean is probably more interesting. The median is probably 0, since
typically the majority of raffle tickets do not win anything. This is independent of
the values of any of the prizes. The mean will tell us something about the overall
value of the prizes involved. In particular, we might want to compare the mean
prize value with the cost of the raffle ticket when we decide whether or not to
purchase one.
There is another measure of central tendency that is less well known and represents
a kind of compromise between the mean and the median. In particular, it is more
sensitive to the extreme values of a distribution than the median is, but less sensitive
than the mean. The idea of a trimmed mean is very simple. Before calculating the
mean, we remove the largest and smallest values from the data. The percentage of
the data removed from each end is called the trimming percentage. A 0% trimmed
mean is just the mean; a 50% trimmed mean is the median; a 10% trimmed mean
is the mean of the middle 80% of the data (after removing the largest and smallest
10%). A trimmed mean is calculated in R by setting the trim argument of mean(),
e.g., mean(x,trim=0.10). Although a trimmed mean in some sense combines the
advantages of both the mean and median, it is less common than either the mean or
the median. This is partly due to the mathematical theory that has been developed
for working with the median and especially the mean of sample data.
more “spread out”. “Almost all” of the data in distribution A is quite close to 10; a
much larger proportion of distribution B is “far away” from 10. The intuitive (and
not very precise) statement in the preceding sentence can be quantified by means
of quantiles. The idea of quantiles is probably familiar to you since percentiles
are a special case of quantiles.
Definition 1.2.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative dis-
tribution is a number q such that the (approximate) proportion of the distribution
that is less than q is p.
So, for example, the 0.2-quantile divides a distribution into 20% below and 80%
above. This is the same as the 20th percentile. The median is the 0.5-quantile (and
the 50th percentile).
The idea of a quantile is quite straightforward. In practice there are a few
wrinkles to be ironed out. Suppose your data set has 15 values. What is the 0.30-
quantile? Exactly 30% of the data would be (0.30)(15) = 4.5 values. Of course,
there is no number that has 4.5 values below it and 11.5 values above it. This is
the reason for the parenthetical word approximate in Definition 1.2.1. Different
schemes have been proposed for giving quantiles a precise value, and R implements
several such methods. They are similar in many ways to the decision we had to
make when computing the median of a variable with an even number of values.
Two important methods can be described by imagining that the sorted data
have been placed along a ruler, one value at every unit mark and also at each end.
To find the p-quantile, we simply snap the ruler so that proportion p is to the left
and 1−p to the right. If the break point happens to fall precisely where a data value
is located (i.e., at one of the unit marks of our ruler), that value is the p-quantile. If
the break point is between two data values, then the p-quantile is a weighted mean
of those two values.
Example 1.2.1. Suppose we have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100.
The 0-quantile is 1, the 1-quantile is 100, the 0.5-quantile (median) is midway
between 25 and 36, that is, 30.5. Since our ruler is 9 units long, the 0.25-quantile
is located 9/4 = 2.25 units from the left edge. That would be one quarter of the
way from 9 to 16, which is 9 + 0.25(16 − 9) = 9 + 1.75 = 10.75. (See Figure 1.8.)
Other quantiles are found similarly. This is precisely the default method used by
quantile().
−10 0 10 20 30
A B
0.20
Density
0.15
0.10
0.05
0.00
−10 0 10 20 30
Figure 1.7. Histograms showing smaller (A) and larger (B) amounts of variation.
14 1. Summarizing Data
1 4 9 16 25 36 49 64 81 100
6 6 6
1 4 9 16 25 36 49 64 81 100
6 6 6
Figure 1.8. Illustrations of two methods for determining quantiles from data.
Arrows indicate the locations of the 0.25-, 0.5-, and 0.75-quantiles.
intro-quantile
> quantile((1:10)^2)
0% 25% 50% 75% 100%
1.00 10.75 30.50 60.25 100.00
A second scheme is just like the first one except that the data values are placed
midway between the unit marks. In particular, this means that the 0-quantile
is not the smallest value. This could be useful, for example, if we imagined we
were trying to estimate the lowest value in a population from which we only had
a sample. Probably the lowest value overall is less than the lowest value in our
particular sample. The only remaining question is how to extrapolate in the last
half unit on either side of the ruler. If we set quantiles in that range to be the
minimum or maximum, the result is another type of quantile().
Example 1.2.2. The method just described is what type=5 does.
intro-quantile05a
> quantile((1:10)^2,type=5)
0% 25% 50% 75% 100%
1.0 9.0 30.5 64.0 100.0
Notice that quantiles below the 0.05-quantile are all equal to the minimum value.
intro-quantile05b
> quantile((1:10)^2,type=5,seq(0,0.10,by=0.005))
0% 0.5% 1% 1.5% 2% 2.5% 3% 3.5% 4% 4.5% 5% 5.5% 6% 6.5%
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.15 1.30 1.45
7% 7.5% 8% 8.5% 9% 9.5% 10%
1.60 1.75 1.90 2.05 2.20 2.35 2.50
A similar thing happens with the maximum value for the larger quantiles.
Other methods refine this idea in other ways, usually based on some assump-
tions about what the population of interest is like.
Fortunately, for large data sets, the differences between the different quantile
methods are usually unimportant, so we will just let R compute quantiles for us
using the quantile() function. For example, here are the deciles and quartiles
of the Old Faithful eruption times.
faithful-quantile
> quantile(faithful$eruptions,(0:10)/10)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000
100%
5.1000
> quantile(faithful$eruptions,(0:4)/4)
1.2. Graphical and Numerical Summaries of Univariate Data 15
8
Sepal.Length
virginica ● ●
7
● versicolor ● ●
6 ●
setosa ●
5 ● ●
5 6 7 8 2 3 4 5
setosa versicolor virginica
Sepal.Length eruptions
Figure 1.9. Boxplots for iris sepal length and Old Faithful eruption times.
The latter of these provides what is commonly called the five-number summary.
The 0-quantile and 1-quantile (at least in the default scheme) are the minimum
and maximum of the data set. The 0.5-quantile gives the median, and the 0.25-
and 0.75-quantiles (also called the first and third quartiles) isolate the middle 50%
of the data. When these numbers are close together, then most (well, half, to be
more precise) of the values are near the median. If those numbers are farther apart,
then much (again, half) of the data is far from the center. The difference between
the first and third quartiles is called the interquartile range and is abbreviated
IQR. This is our first numerical measure of dispersion.
The five-number summary can also be presented graphically using a boxplot
(also called box-and-whisker plot) as in Figure 1.9. These plots were generated
using
iris-bwplot
> bwplot(Sepal.Length~Species,data=iris)
> bwplot(Species~Sepal.Length,data=iris)
> bwplot(~eruptions,faithful)
The size of the box reflects the IQR. If the box is small, then the middle 50% of
the data are near the median, which is indicated by a dot in these plots. (Some
boxplots, including those made by the boxplot() use a vertical line to indicate
the median.) Outliers (values that seem unusually large or small) can be indicated
by a special symbol. The whiskers are then drawn from the box to the largest
and smallest non-outliers. One common rule for automating outlier detection for
boxplots is the 1.5 IQR rule. Under this rule, any value that is more than 1.5 IQR
away from the box is marked as an outlier. Indicating outliers in this way is useful
since it allows us to see if the whisker is long only because of one extreme value.
The trouble with this is that the total deviation from the mean is always 0 because
the negative deviations and the positive deviations always exactly cancel out. (See
Exercise 1.10).
To fix this problem, we might consider taking the absolute value of the devia-
tions from the mean:
total absolute deviation from the mean = |x − x| .
This number will only be 0 if all of the data values are equal to the mean. Even
better would be to divide by the number of data values:
1
mean absolute deviation = |x − x| .
n
Otherwise large data sets will have large sums even if the values are all close to the
mean. The mean absolute deviation is a reasonable measure of the dispersion in
a distribution, but we will not use it very often. There is another measure that is
much more common, namely the variance, which is defined by
1 2
variance = Var(x) = (x − x) .
n−1
You will notice two differences from the mean absolute deviation. First, instead
of using an absolute value to make things positive, we square the deviations from
the mean. The chief advantage of squaring over the absolute value is that it is
much easier to do calculus with a polynomial than with functions involving absolute
values. Because the squaring changes the units of this measure, the square root
of the variance, called the standard deviation, is commonly used in place of the
variance.
The second difference is that we divide by n − 1 instead of by n. There is
a very good reason for this, even though dividing by n probably would have felt
much more natural to you at this point. We’ll get to that very good reason later
in the course (in Section 4.6). For now, we’ll settle for a less good reason. If you
know the mean and all but one of the values of a variable, then you can determine
the remaining value, since the sum of all the values must be the product of the
number of values and the mean. So once the mean is known, there are only n − 1
independent pieces of information remaining. That is not a particularly satisfying
explanation, but it should help you remember to divide by the correct quantity.
All of these quantities are easy to compute in R.
intro-dispersion02
> x=c(1,3,5,5,6,8,9,14,14,20)
>
> mean(x)
[1] 8.5
> x - mean(x)
[1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5 5.5 5.5 11.5
> sum(x - mean(x))
[1] 0
> abs(x - mean(x))
[1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5 5.5 5.5 11.5
> sum(abs(x - mean(x)))
[1] 46
1.3. Graphical and Numerical Summaries of Multivariate Data 17
> (x - mean(x))^2
[1] 56.25 30.25 12.25 12.25 6.25 0.25 0.25 30.25 30.25
[10] 132.25
> sum((x - mean(x))^2)
[1] 310.5
> n= length(x)
> 1/(n-1) * sum((x - mean(x))^2)
[1] 34.5
> var(x)
[1] 34.5
> sd(x)
[1] 5.8737
> sd(x)^2
[1] 34.5
1.3.2. Scatterplots
There is another plot that is useful for looking at the relationship between two
quantitative variables. A scatterplot (or scattergram) is essentially the familiar
Cartesian coordinate plot you learned about in school. Since each observation in a
bivariate data set has two values, we can plot points on a rectangular grid repre-
senting both values simultaneously. The lattice function for making a scatterplot
is xyplot().
The scatterplot in Figure 1.10 becomes even more informative if we separate
the dots of the three species. Figure 1.11 shows two ways this can be done. The
first uses a conditioning variable, as we have seen before, to make separate panels
for each species. The second uses the groups argument to plot the data in the same
panel but with different symbols for each species. Each of these clearly indicates
that, in general, plants with wider sepals also have longer sepals but that the typical
values of and the relationship between width and length differ by species.
8 ●
● ● ● ●
●
●
●
● ● ●
●
7 ●
● ●
● ● ●
Sepal.Length
● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
6 ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
5 ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●
● ● ● ●
●
● ● ●
●
Sepal.Width
For each case they noted the race of the defendant and whether or not the death
penalty was imposed. We can use R to cross tabulate this data for us:
intro-deathPenalty01
> xtabs(~Penalty+Victim,data=deathPenalty)
Victim
Penalty Black White
Death 6 30
Not 106 184
Perhaps you are surprised that white defendants are more likely to receive
the death penalty. It turns out that there is more to the story. The researchers
also recorded the race of the victim. If we make a new table that includes this
information, we see something interesting.
intro-deathPenalty02
> xtabs(~Penalty+Defendant+Victim,
+ data=deathPenalty) , , Victim = White
, , Victim = Black
Defendant
Defendant Penalty Black White
Penalty Black White Death 11 19
Death 6 0 Not 52 132
Not 97 9
7 ●
● ●●
Sepal.Length
● ● ●
●● ● ●● ●
●●
● ● ●
● ● ●● ●●
● ● ● ● ●●● ●●
● ● ● ●
6 ●
●●●
● ● ● ●
● ●
●
6
● ● ●
● ●● ●●
● ● ● ●●● ●
● ● ●● ●
● ● ●●●●
● ● ● ●
●
●● ● ●
●●● ●● ● 5
5 ● ●●●●● ● ●
●● ● ● ●
●● ●
●
●● ● ●
●
●● ●
●
2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5
Sepal.Width Sepal.Width
Defendant
Bl Wh
Yes No
Bl
DeathPenalty
●
Victim
No
Wh
Yes
Figure 1.12. A mosaic plot of death penalty by race of defendant and victim.
.
It appears that black defendants are more likely to receive the death penalty
when the victim is white and also when the victim is black, but not if we ignore
the race of the victim. This sort of apparent contradiction is known as Simpson’s
paradox. In this case, it appears that the death penalty is more likely to be given
for a white victim, and since most victims are the same race as their murderer, the
result is that overall white defendants are more likely (in this data set) to receive
the death penalty even though black defendants are more likely (again, in this data
set) to receive the death penalty for each race of victim.
The fact that our understanding of the data is so dramatically influenced by
whether or not our analysis includes the race of the victim is a warning to watch for
lurking variables – variables that have an important effect but are not included
in our analysis – in other settings as well. Part of the design of a good study is
selecting the right things to measure.
These cross tables can be visualized graphically using a mosaic plot. Mosaic
plots can be generated with the core R function mosaicplot() or with mosaic()
from the vcd package. (vcd is short for visualization of categorical data.) The
latter is somewhat more flexible and usually produces more esthetically pleasing
output. A number of different formula formats can be supplied to mosaic(). The
results of the following code are shown in Figure 1.12.
intro-deathPenalty03
> require(vcd)
> mosaic(~Victim+Defendant+DeathPenalty,data=deathPen)
> structable(~Victim+Defendant+DeathPenalty,data=deathPen)
Defendant Bl Wh
Victim DeathPenalty
Bl No 97 9
Yes 6 0
Wh No 52 132
Yes 11 19
As always, see ?mosaic for more information. The vcd package also provides an
alternative to xtabs() called structable(), and if you print() a mosaic(), you
will get both the graph and the table.
20 1. Summarizing Data
1.4. Summary
1.4.1. R Commands
Here is a table of important R commands introduced in this chapter. Usage details
can be found in the examples and using the R help.
Exercises
1.2. The pulse variable in the littleSurvey data set contains self-reported pulse
rates.
a) Make a histogram of these values. What problem does this histogram reveal?
b) Make a decision about what values should be removed from the data and make
a histogram of the remaining values. (You can use the subset argument of
the histogram() function to restrict the data or you can create a new vector
and make a histogram from that.)
c) Compute the mean and median of your restricted set of pulse rates.
1.3. The pulse variable in the littleSurvey data set contains self-reported pulse
rates. Make a table or graph showing the distribution of the last digits of the
recorded pulse rates and comment on the distribution of these digits. Any conjec-
tures?
Note: %% is the modulus operator in R. So x %% 10 gives the remainder after
dividing x by 10, which is the last digit.
1.4. Some students in introductory statistics courses were asked to select a num-
ber between 1 and 30 (inclusive). The results are in the number variable in the
littleSurvey data set.
22 1. Summarizing Data
a) Make a table showing the frequency with which each number was selected
using table().
b) Make a histogram of these values with bins centered at the integers from 1 to
30.
c) What numbers were most frequently chosen? Can you get R to find them for
you?
d) What numbers were least frequently chosen? Can you get R to find them for
you?
e) Make a table showing how many students selected odd versus even numbers.
1.6. Describe some situations where the mean or median is clearly a better measure
of central tendency than the other.
1.7. Below are histograms and boxplots from six distributions. Match each his-
togram (A–F) with its corresponding boxplot (U–Z).
A B C ●
U
V ●
Percent of Total
W ●
D E F X ●
Y ●
Z ●
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
1.8. The function bwplot() does not use the quantile() function to compute its
five-number summary. Instead it uses fivenum(). Technically, fivenum() com-
putes the hinges of the data rather than quantiles. Sometimes fivenum() and
quantile() agree:
fivenum-a
> fivenum(1:11)
[1] 1.0 3.5 6.0 8.5 11.0
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
Exercises 23
1.9. Design some data sets to test whether by default bwplot() uses the 1.5 IQR
rule to determine whether it should indicate data as outliers.
1.10. Show that the total deviation from the mean, defined by
n
total deviation from the mean = (xi − x) ,
i=1
is 0 for any distribution.
1.11. We could compute the mean absolute deviation from the median instead of
from the mean. Show that the mean absolute deviation from the median is never
larger than the mean absolute deviation from the mean.
1.12. We could compute the mean absolute deviation from any number c (c for
center). Show that the mean absolute deviation from c is always at least as large
as the mean absolute deviation from the median. Thus the median is a minimizer
of mean absolute deviation.
1.13. Let SS(c) = (xi − c)2 . (SS stands for sum of squares.) Show that the
smallest value of SS(c) occurs when c = x. This shows that the mean is a minimizer
of SS.
1.14. Find a distribution with 10 values between 0 and 10 that has as large a
variance as possible.
1.15. Find a distribution with 10 values between 0 and 10 that has as small a
variance as possible.
1.16. The pitching2005 data set in the fastR package contains 2005 season statis-
tics for each pitcher in the major leagues. Use graphical and numerical summaries
of this data set to explore whether there are differences between the two leagues,
restricting your attention to pitchers that started at least 5 games (the variable GS
stands for ‘games started’). You may select the statistics that are of interest to
you.
If you are not much of a baseball fan, try using ERA (earned run average), which
is a measure of how many runs score while a pitcher is pitching. It is measured in
runs per nine innings.
1.17. Repeat the previous problem using batting statistics. The fastR data set
batting contains data on major league batters over a large number of years. You
may want to restrict your attention to a particular year or set of years.
24 1. Summarizing Data
1.18. Have major league batting averages changed over time? If so, in what ways?
Use the data in the batting data set to explore this question. Use graphical and
numerical summaries to make your case one way or the other.
1.19. The faithful data set contains two variables: the duration (eruptions) of
the eruption and the time until the next eruption (waiting).
a) Make a scatterplot of these two variables and comment on any patterns you
see.
b) Remove the first value of eruptions and the last value of waiting. Make a
scatterplot of these two vectors.
c) Which of the two scatterplots reveals a tighter relationship? What does that
say about the relationship between eruption duration and the interval between
eruptions?
1.20. The results of a little survey that has been given to a number of statistics
students are available in the littleSurvey data set. Make some conjectures about
the responses and use R’s graphical and numerical summaries to see if there is any
(informal) evidence to support your conjectures. See ?littleSurvey for details
about the questions on the survey.
1.21. The utilities data set contains information from utilities bills for a personal
residence over a number of years. This problem explores gas usage over time.
a) Make a scatterplot of gas usage (ccf) vs. time. You will need to combine
month and year to get a reasonable measurement for time. Such a plot is
called a time series plot.
b) Use the groups argument (and perhaps type=c(’p’,’l’), too) to make the
different months of the year distinguishable in your scatterplot.
c) Now make a boxplot of gas usage (ccf) vs. factor(month). Which months
are most variable? Which are most consistent?
d) What patterns do you see in the data? Does there appear to be any change
in gas usage over time? Which plots help you come to your conclusion?
1.22. Note that March and May of 2000 are outliers due to a bad meter reading.
Utility bills come monthly, but the number of days in a billing cycle varies from
month to month. Add a new variable to the utilities data set using
utilities-ccfpday
> utilities$ccfpday <- utilities$ccf / utilities$billingDays
> plot1 <- xyplot( ccfpday ~ (year + month/12), utilities, groups=month )
> plot2 <- bwplot( ccfpday ~ factor(month), utilities )
Repeat the previous exercise using ccfpday instead of ccf. Are there any noticeable
differences between the two analyses?
1.23. The utilities data set contains information from utilities bills for a personal
residence over a number of years. One would expect that the gas bill would be
related to the average temperature for the month.
Make a scatterplot showing the relationship between ccf (or, better, ccfpday;
see Exercise 1.22) and temp. Describe the overall pattern. Are there any outliers?
Exercises 25
1.24. The utilities data set contains information from utilities bills for a personal
residence over a number of years. The variables gasbill and ccf contain the gas
bill (in dollars) and usage (in 100 cubic feet) for a personal residence. Use plots
to explore the cost of gas over the time period covered in the utilities data set.
Look for both seasonal variation in price and any trends over time.
1.25. The births78 data set contains the number of births in the United States
for each day of 1978.
a) Make a histogram of the number of births. You may be surprised by the shape
of the distribution. (Make a stemplot too if you like.)
b) Now make a scatterplot of births vs. day of the year. What do you notice?
Can you conjecture any reasons for this?
c) Can you make a plot that will help you see if your conjecture seems correct?
(Hint: Use groups.)
Chapter 2
The excitement that a gambler feels when making a bet is equal to the
amount he might win times the probability of winning it.
Blaise Pascal [Ros88]
A good example is flipping a coin. The result of any given toss of a fair coin is
unpredictable in advance. It could be heads, or it could be tails. We don’t know
with certainty which it will be. Nevertheless, we can say something about the
long-run behavior of flipping a coin many times. This is what makes us surprised
if someone flips a coin 20 times and gets heads all 20 times, but not so surprised if
the result is 12 heads and 8 tails.
1
It is traditional within the study of probability to refer to random processes as random exper-
iments. We will avoid that usage to avoid confusion with the randomized experiments – statistical
studies where the values of some variables are determined using randomness.
27
28 2. Probability and Random Variables
There is much to be said for this definition. It will certainly give numbers
between 0 and 1 since the numerator is never larger than the denominator. Fur-
thermore, events that happen frequently will be assigned large numbers and events
which occur infrequently will be assigned small numbers. Nevertheless, (2.1) doesn’t
make a good definition, at least not in its current form. The problem with (2.1) is
that if two different people each repeat the random process and calculate the prob-
ability of E, very likely they will get different numbers. So perhaps the following
would be a better statement:
number of times outcome was in the event E
P(E) ≈ . (2.2)
number of times the random process was repeated
2.1. Introduction to Probability 29
0.8
0.6
0.4
0.2
number of tosses
The observation that the relative frequency of our event appears to be con-
verging as the number of repetitions increases might lead us to try a definition
like
number of times in n repetitions that outcome was in the event E
P(E) = lim .
n→∞ n
(2.3)
It’s not exactly clear how we would formally define such a limit, and even less
clear how we would attempt to evaluate it. But the intuition is still useful, and for
now we will think of the empirical probability method as an approximation method
(postponing for now any formal discussion of the quality of the approximation)
that estimates a probability by repeating a random process some number of times
and determining what percentage of the outcomes observed were in the event of
interest.
Such empirical probabilities can be very useful, especially if the process is quick
and cheap to repeat. But who has time to flip a coin 10,000 or more times just
to see if the coin is fair? Actually, there have been folks who have flipped a coin
a large number of times and recorded the results. One such was John Kerrich, a
South African mathematician who recorded 5067 heads in 10,000 flips while in a
prison camp during World War II. That isn’t exactly 50% heads, but it is pretty
close.
Since repeatedly carrying out even a simply random process like flipping a coin
can be tedious and time consuming, we will often make use of computer simulations.
If we have a reasonably good model for a random event and can program it into a
computer, we can let the computer repeat the simulation many times very rapidly
and (hopefully) get good approximations to what would happen if we actually
repeated the process many times. The histograms below show the results of 1000
simulations of flipping a fair coin 1000 times (left) and 10,000 times (right).
30 2. Probability and Random Variables
Results of 1000 simulations of 1000 coin tosses Results of 1000 simulations of 10,000 coin tosses
25
25
Percent of Total
Percent of Total
20
20
15 15
10 10
5 5
0 0
0.46 0.48 0.50 0.52 0.54 0.46 0.48 0.50 0.52 0.54
Figure 2.2. As the number of coin tosses increases, the results of the simu-
lation become more consistent from simulation to simulation.
Notice that the simulation-based probability is closer to 0.50 more of the time
when we flip 10,000 coins than when we flip only 1000 but that there is some
variation in both cases. Simulations with even larger sample sizes would reveal
that this variation decreases as the sample size increases but that there is always
some amount of variation from sample to sample. Also notice that John Kerrich’s
results are quite consistent with the results of our simulations, which assumed a
tossed coin has a 50% probability of being heads.
So it appears that flipping a fair coin is a 50-50 proposition. That’s not too
surprising; you already knew that. But how do you “know” this? You probably
have not flipped a coin 10,000 times and carefully recorded the outcomes of each
toss.2 So you must be using some other method to derive the probability of 0.5.
our empirical probabilities (at least given a fixed set of repetitions). Despite the
fact that our axioms are so few and so simple, they are quite useful. In Section 2.2
we’ll see that a number of other useful general principles follow easily from our
three axioms. Together these rules and the axioms form the basis of theoretical
probability calculations. But before we provide examples of using these rules to
calculate probabilities, we’ll introduce the important notion of a random variable.
Title: Fort Duquesne and Fort Pitt; Early Names of Pittsburgh Streets
Language: English
FORT DUQUESNE
AND
FORT PITT
PUBLISHED BY
OF
This little sketch of Fort Duquesne and Fort Pitt is compiled from
extracts taken mainly from Parkman's Histories; The Olden Time, by
Neville B. Craig; Fort Pitt, by Mrs. Wm. Darlington; Pioneer History,
by S. P. Hildreth, etc.
Pittsburgh
September, 1898.
CHRONOLOGY
1753—The French begin to build a chain of forts to enforce their
boundaries.
December 11, 1753.—Washington visits Fort Le Boeuf.
January, 1754.—Washington lands on Wainwright's Island in the
Allegheny river.—Recommends that a Fort be built at the "Forks
of the Ohio."
February 17, 1754.—A fort begun at the "Forks of the Ohio" by
Capt. William Trent.
April 16, 1754.—Ensign Ward, with thirty-three men, surprised
here by the French, and surrenders.
June, 1754.—Fort Duquesne completed.
May 28, 1754.—Washington attacks Coulon de Jumonville at Great
Meadows.
July 9, 1755.—Braddock's defeat.
April, 1758.—Brig. Gen. John Forbes takes command.
August, 1758.—Fort Bedford built.
October, 1758.—Fort Ligonier built.
November 24, 1758.—Fort Duquesne destroyed by the retreating
French.
November 25, 1758.—Gen. Forbes takes possession.
August, 1759.—Fort Pitt begun by Gen. John Stanwix.
May, 1763.—Conspiracy of Pontiac.
July, 1763.—Fort Pitt besieged by Indians.
1764.—Col. Henry Bouquet builds the Redoubt.
October 10, 1772.—Fort Pitt abandoned by the British.
January, 1774.—Dr. James Connelly occupies Fort Pitt with Virginia
militia, and changes name to Fort Dunmore.
July, 1776.—Indian conference at Fort Pitt.—Pontiac and Guyasuta.
June 1, 1777.—Brig. Gen. Hand takes command of the fort.
1778.—Gen. McIntosh succeeds Hand.
November, 1781.—Gen. William Irvine takes command.
May 19, 1791.—Maj. Isaac Craig reports Fort Pitt in a ruinous
condition.—Builds Fort Lafayette.
September 4, 1805.—The historic site purchased by Gen. James
O'Hara.
April 1, 1894.—Mrs. Mary E. Schenley, granddaughter of Gen.
James O'Hara, presents Col. Bouquet's Redoubt to the
Daughters of the American Revolution of Allegheny County,
Pennsylvania.
FORT DUQUESNE
Braddock.
Nothing of importance was done in Virginia and Pennsylvania until
the arrival of Braddock in February, 1755, bringing with him two
regiments. Governor Dinwiddie hailed his arrival with joy, hoping that
his troubles would now come to an end. Of Braddock, Governor
Dinwiddie's Secretary, Shirley wrote to Governor Morris: "We have a
general most judiciously chosen for being disqualified for the service
he is in, in almost every respect." Braddock issued a call to the
provincial governors to meet him in council, which was answered by
Dinwiddie of Virginia, Dobbs of North Carolina, Sharpe of Maryland,
Morris of Pennsylvania, Delancy of New York, and Shirley of
Massachusetts. The result was a plan to attack the French at four
points at once. Braddock was to advance on Fort Duquesne, Fort
Niagara was to be reduced, Crown Point seized, and a body of men
from New England to capture Beausejour and Arcadia.
We will follow Braddock. In his case prompt action was of the
utmost importance, but this was impossible, as the people refused to
furnish the necessary supplies. Franklin, who was Postmaster
General in Pennsylvania, was visiting Braddock's camp with his son
when the report of the agents sent to collect wagons was brought
in. The number was so wholly inadequate that Braddock stormed,
saying the expedition was at an end. Franklin said it was a pity he
had not landed in Pennsylvania, where he might have found horses
and wagons more plentiful. Braddock begged him to use his
influence to obtain the necessary supply, and Franklin on his return
to Pennsylvania issued an address to the farmers. In about two
weeks a sufficient number was furnished, and at last the march
began. He reached Will's Creek on May 10, 1755, where fortifications
had been erected by the colonial troops, and called Fort
Cumberland. Here Braddock assembled a force numbering about
twenty-two hundred. Although Braddock despised the provincial
troops and the Indians, he honored Col. George Washington, who
commanded the troops from Virginia, by placing him on his staff.
A month elapsed before this army was ready to leave Fort
Cumberland. Three hundred axemen led the way, the long, long,
train of pack-horses, wagons, and cannon following, as best they
could, along the narrow track, over stumps and rocks and roots. The
road cut was but twelve feet wide, so that the line of march was
sometimes four miles long, and the difficulties in the way were so
great that it was impossible to move more than three miles a day.
On the 18th of June they reached Little Meadows, not thirty miles
from Fort Cumberland, where a report reached them that five
hundred regulars were on their way to reinforce Fort Duquesne.
Washington advised Braddock to leave the heavy baggage and press
forward, and following this advice, the next day, June 19th, the
advance corps of about twelve hundred soldiers with what artillery
was thought indispensable, thirty wagons, and a number of pack-
horses, began its march; but the delays were such that it did not
reach the mouth of Turtle Creek until July 7th. The distance to Fort
Duquesne by a direct route was about eight miles, but the way was
difficult and perilous, so Braddock crossed the Monongahela and re-
crossed farther down, at one o'clock.
Washington describes the scene at the ford with admiration. The
music, the banners, the mounted officers, the troops of light cavalry,
the naval detachment, the red-coated regulars, the blue-coated
Virginians, the wagons and tumbrils, the cannon, howitzers and
coehorns, the train of pack-horses and the droves of cattle passed in
long procession through the rippling shallows and slowly entered the
forest.
Fort Duquesne was a strong little fort, compactly built of logs,
close to point of where the waters of the Allegheny and
Monongahela unite. Two sides were protected by these waters, and
the other two by ravelins, a ditch and glacis and a covered way,
enclosed by a massive stockade. The garrison consisted of a few
companies of regulars and Canadians and eight hundred Indian
warriors, under the command of Contrecœur. The captains under
him were Beaujeu, Dumas, and Ligneris.
When the scouts brought the intelligence that the English were
within six leagues of the fort, the French, in great excitement and
alarm, decided to march at once and ambuscade them at the ford.
The Indians at first refused to move, but Beaujeu, dressed as one of
them, finally persuaded them to march, and they filed off along the
forest trail that led to the ford of the Monongahela—six hundred
Indians and about three hundred regulars and Canadians. They did
not reach the ford in time to make the attack there.
Braddock's Defeat.
Braddock advanced carefully through the dense and silent forest,
when suddenly this silence was broken by the war-whoop of the
savages, of whom not one was visible. Gage's column wheeled
deliberately into line and fired; and at first the English seemed to
carry everything before them, for the Canadians were seized by a
panic and fled; but the scarlet coats of the English furnished good
targets for their invisible enemies. The Indians, yelling their war-
cries, swarmed in the forest, but were so completely hidden in
gullies and ravines, behind trees and bushes and fallen trunks, that
only the trees were struck by the volley after volley fired by the
English, who at last broke ranks and huddled together in a
bewildered mass. Both men and officers were ignorant of this mode
of warfare. The Virginians alone were equal to the emergency and
might have held the enemy in check, but when Braddock found
them hiding behind trees and bushes, as the Indians, he became so
furious at this seeming want of courage and discipline, that he
ordered them with oaths, to join the line, even beating them with his
sword, they replying to his threats and commands that they would
fight if they could see any one to fight with. The ground was strewn
with the dead and dying, maddened horses were plunging about,
the roar of musketry and cannon, and above all the yells that came
from the throats of six hundred invisible savages, formed a chaos of
anguish and terror indescribable.
Braddock saw that all was lost and ordered a retreat, but had
scarcely done so when a bullet pierced his lungs. It is alleged that
the shot was fired by one of his own men, but this statement is
without proof. The retreat soon turned into a rout, and all who
remained dashed pell-mell through the river to the opposite shore,
abandoning the wounded, the cannon, and all the baggage and
papers to the mercy of the Indians. Beaujeu had fallen early in the
conflict. Dumas and Ligneris did not pursue the flying enemy, but
retired to the Fort, abandoning the field to the savages, which soon
became a pandemonium of pillage and murder. Of the eighty-six
English officers all but twenty-three were killed or disabled, and but
a remnant of the soldiers escaped.
When the Indians returned to the Fort, they brought with them
twelve or fourteen prisoners, their bodies blackened and their hands
tied behind their backs. These were all burned to death on the bank
of the Allegheny, opposite the Fort. The loss of the French was
slight; of the regulars there were but four killed or wounded, and all
the Canadians returned to the Fort unhurt except five.
The miserable remnant of Braddock's army continued their wild
flight all that night and all the next day, when before nightfall those
who had not fainted by the way reached Christopher Gist's farm, but
six miles from Dunbar's camp. The wounded general had shown an
incredible amount of courage and endurance. After trying in vain to
stop the flight, he was lifted on a horse, when, fainting from the
effects of his mortal wound, some of the men were induced by large
bribes to carry him in a litter. Braddock ordered a detachment from
the camp to go to the relief of the stragglers, but as the fugitives
kept coming in with their tales of horror, the panic seized the camp,
and soldiers and teamsters fled.
The next day, whether from orders given by Braddock or Dunbar is
not known, more than one hundred wagons were burned, cannon,
coehorns, and shells were destroyed, barrels of gunpowder were
saved and the contents thrown into a brook, and provisions
scattered about through the woods and swamps, while the enemy,
with no thought of pursuit, had returned to Fort Duquesne. Braddock
died on the 13th of July, 1755, and was buried on the road; men,
horses and wagons passing over the grave of their dead commander
as they retreated to Fort Cumberland, thus effacing every trace of it,
lest it should be discovered by the Indians and the body mutilated.
Thus ended the attempt to capture Fort Duquesne, and for about
three years, while the storm of blood and havoc raged elsewhere,
that point was undisturbed.
Henry Bouquet.
FOOTNOTES:
ebookbell.com