100% found this document useful (4 votes)
27 views

Foundations And Applications Of Statistics An Introduction Using R Randall Pruim pdf download

The document is an introduction to the book 'Foundations and Applications of Statistics: An Introduction Using R' by Randall Pruim, aimed at undergraduate students with a background in calculus. It covers statistical concepts using R software, emphasizing practical statistical reasoning and integrating probability with statistical applications. The book is structured to provide a comprehensive understanding of statistics, suitable for a two-semester course, while also being accessible to students from various disciplines.

Uploaded by

shabaseelyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
27 views

Foundations And Applications Of Statistics An Introduction Using R Randall Pruim pdf download

The document is an introduction to the book 'Foundations and Applications of Statistics: An Introduction Using R' by Randall Pruim, aimed at undergraduate students with a background in calculus. It covers statistical concepts using R software, emphasizing practical statistical reasoning and integrating probability with statistical applications. The book is structured to provide a comprehensive understanding of statistics, suitable for a two-semester course, while also being accessible to students from various disciplines.

Uploaded by

shabaseelyin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Foundations And Applications Of Statistics An

Introduction Using R Randall Pruim download

https://ebookbell.com/product/foundations-and-applications-of-
statistics-an-introduction-using-r-randall-pruim-5856816

Explore and download more ebooks at ebookbell.com


Pure and Applied
Sally
The UNDERGRADUATE TEXTS
SERIES

Foundations
and Applications
of Statistics
An Introduction Using R

Randall Pruim

American Mathematical Society


Foundations
and Applications
of Statistics
An Introduction Using R
Pure and Applied
Sally
The UNDERGRADUATE TEXTS • 13
SERIES

Foundations
and Applications
of Statistics
An Introduction Using R

Randall Pruim

American Mathematical Society


Providence, Rhode Island
EDITORIAL COMMITTEE
Paul J. Sally, Jr. (Chair) Joseph Silverman
Francis Su Susan Tolman

2010 Mathematics Subject Classification. Primary 62–01; Secondary 60–01.

For additional information and updates on this book, visit


www.ams.org/bookpages/amstext-13

Library of Congress Cataloging-in-Publication Data


Pruim, Randall J.
Foundations and applications of statistics : an introduction using R / Randall Pruim.
p. cm. — (Pure and applied undergraduate texts ; v. 13)
Includes bibliographical references and index.
ISBN 978-0-8218-5233-0 (alk. paper)
1. Mathematical statistics—Data processing. 2. R (Computer program language) I. Title.
QA276.45.R3P78 2010
519.50285—dc22
2010041197

Copying and reprinting. Individual readers of this publication, and nonprofit libraries
acting for them, are permitted to make fair use of the material, such as to copy a chapter for use
in teaching or research. Permission is granted to quote brief passages from this publication in
reviews, provided the customary acknowledgment of the source is given.
Republication, systematic copying, or multiple reproduction of any material in this publication
is permitted only under license from the American Mathematical Society. Requests for such
permission should be addressed to the Acquisitions Department, American Mathematical Society,
201 Charles Street, Providence, Rhode Island 02904-2294 USA. Requests can also be made by
e-mail to [email protected].


c 2011 by Randall Pruim. All rights reserved.
Printed in the United States of America.

∞ The paper used in this book is acid-free and falls within the guidelines
established to ensure permanence and durability.
Visit the AMS home page at http://www.ams.org/
10 9 8 7 6 5 4 3 2 1 16 15 14 13 12 11
Contents

Preface ix

What Is Statistics? xv

Chapter 1. Summarizing Data 1


1.1. Data in R 1
1.2. Graphical and Numerical Summaries of Univariate Data 4
1.3. Graphical and Numerical Summaries of Multivariate Data 17
1.4. Summary 20
Exercises 21

Chapter 2. Probability and Random Variables 27


2.1. Introduction to Probability 28
2.2. Additional Probability Rules and Counting Methods 33
2.3. Discrete Distributions 49
2.4. Hypothesis Tests and p-Values 56
2.5. Mean and Variance of a Discrete Random Variable 65
2.6. Joint Distributions 72
2.7. Other Discrete Distributions 80
2.8. Summary 92
Exercises 97

Chapter 3. Continuous Distributions 113


3.1. pdfs and cdfs 113
3.2. Mean and Variance 124
3.3. Higher Moments 126
3.4. Other Continuous Distributions 133

v
vi Contents

3.5. Kernel Density Estimation 145


3.6. Quantile-Quantile Plots 150
3.7. Joint Distributions 155
3.8. Summary 163
Exercises 166

Chapter 4. Parameter Estimation and Testing 175


4.1. Statistical Models 175
4.2. Fitting Models by the Method of Moments 177
4.3. Estimators and Sampling Distributions 183
4.4. Limit Theorems 191
4.5. Inference for the Mean (Variance Known) 200
4.6. Estimating Variance 208
4.7. Inference for the Mean (Variance Unknown) 212
4.8. Confidence Intervals for a Proportion 223
4.9. Paired Tests 225
4.10. Developing New Tests 227
4.11. Summary 234
Exercises 238

Chapter 5. Likelihood-Based Statistics 251


5.1. Maximum Likelihood Estimators 251
5.2. Likelihood Ratio Tests 266
5.3. Confidence Intervals 274
5.4. Goodness of Fit Testing 277
5.5. Inference for Two-Way Tables 288
5.6. Rating and Ranking Based on Pairwise Comparisons 297
5.7. Bayesian Inference 304
5.8. Summary 312
Exercises 315

Chapter 6. Introduction to Linear Models 323


6.1. The Linear Model Framework 324
6.2. Simple Linear Regression 333
6.3. Inference for Simple Linear Regression 341
6.4. Regression Diagnostics 353
6.5. Transformations in Linear Regression 362
6.6. Categorical Predictors 369
6.7. Categorical Response (Logistic Regression) 377
6.8. Simulating Linear Models to Check Robustness 384
Contents vii

6.9. Summary 387


Exercises 391
Chapter 7. More Linear Models 399
7.1. Additive Models 399
7.2. Assessing the Quality of a Model 413
7.3. One-Way ANOVA 423
7.4. Two-Way ANOVA 455
7.5. Interaction and Higher Order Terms 470
7.6. Model Selection 474
7.7. More Examples 482
7.8. Permutation Tests and Linear Models 493
7.9. Summary 496
Exercises 499
Appendix A. A Brief Introduction to R 507
A.1. Getting Up and Running 507
A.2. Working with Data 513
A.3. Lattice Graphics in R 527
A.4. Functions in R 535
A.5. Some Extras in the fastR Package 542
A.6. More R Topics 544
Exercises 546

Appendix B. Some Mathematical Preliminaries 549


B.1. Sets 550
B.2. Functions 552
B.3. Sums and Products 553
Exercises 555
Appendix C. Geometry and Linear Algebra Review 559
C.1. Vectors, Spans, and Bases 559
C.2. Dot Products and Projections 563
C.3. Orthonormal Bases 567
C.4. Matrices 569
Exercises 575

Appendix D. Review of Chapters 1–4 579


D.1. R Infrastructure 579
D.2. Data 580
D.3. Probability Basics 581
D.4. Probability Toolkit 581
viii Contents

D.5. Inference 582


D.6. Important Distributions 583
Exercises 583
Hints, Answers, and Solutions to Selected Exercises 587
Bibliography 599

Index to R Functions, Packages, and Data Sets 605


Index 609
Preface

Intended Audience

As the title suggests, this book is intended as an introduction to both the foun-
dations and applications of statistics. It is an introduction in the sense that it
does not assume a prior statistics course. But it is not introductory in the sense
of being suitable for students who have had nothing more than the usual high
school mathematics preparation. The target audience is undergraduate students at
the equivalent of the junior or senior year at a college or university in the United
States.
Students should have had courses in differential and integral calculus, but not
much more is required in terms of mathematical background. In fact, most of my
students have had at least another course or two by the time they take this course,
but the only courses that they have all had is the calculus sequence. The majority
of my students are not mathematics majors. I have had students from biology,
chemistry, computer science, economics, engineering, and psychology, and I have
tried to write a book that is interesting, understandable, and useful to students
with a wide range of backgrounds and career goals.
This book is suitable for what is often a two-semester sequence in “mathe-
matical statistics”, but it is different in some important ways from many of the
books written for such a course. I was trained as a mathematician first, and the
book is clearly mathematical at some points, but the emphasis is on the statistics.
Mathematics and computation are brought in where they are useful tools. The
result is a book that stretches my students in different directions at different times
– sometimes statistically, sometimes mathematically, sometimes computationally.

The Approach Used in This Book

Features of this book that help distinguish it from other books available for such a
course include the following:

ix
x Preface

• The use of R, a free software environment for statistical computing and graph-
ics, throughout the text.
Many books claim to integrate technology, but often technology appears
to be more of an afterthought. In this book, topics are selected, ordered, and
discussed in light of the current practice in statistics, where computers are an
indispensable tool, not an occasional add-on.
R was chosen because it is both powerful and available. Its “market share”
is increasing rapidly, so experience with R is likely to serve students well in
their future careers in industry or academics. A large collection of add-on
packages are available, and new statistical methods are often available in R
before they are available anywhere else.
R is open source and is available at the Comprehensive R Archive Network
(CRAN, http://cran.r-project.org) for a wide variety of computing plat-
forms at no cost. This allows students to obtain the software for their personal
computers – an essential ingredient if computation is to be used throughout
the course.
The R code in this book was executed on a 2.66 GHz Intel Core 2 Duo
MacBook Pro running OS X (version 10.5.8) and the current version of R (ver-
sion 2.12). Results using a different computing platform or different version
of R should be similar.
• An emphasis on practical statistical reasoning.
The idea of a statistical study is introduced early on using Fisher’s famous
example of the lady tasting tea. Numerical and graphical summaries of data
are introduced early to give students experience with R and to allow them
to begin formulating statistical questions about data sets even before formal
inference is available to help answer those questions.
• Probability for statistics.
One model for the undergraduate mathematical statistics sequence presents
a semester of probability followed by a semester of statistics. In this book,
I take a different approach and get to statistics early, developing the neces-
sary probability as we go along, motivated by questions that are primarily
statistical. Hypothesis testing is introduced almost immediately, and p-value
computation becomes a motivation for several probability distributions. The
binomial test and Fisher’s exact test are introduced formally early on, for ex-
ample. Where possible, distributions are presented as statistical models first,
and their properties (including the probability mass function or probability
density function) derived, rather than the other way around. Joint distribu-
tions are motivated by the desire to learn about the sampling distribution of
a sample mean.
Confidence intervals and inference for means based on t-distributions must
wait until a bit more machinery has been developed, but my intention is that
a student who only takes the first semester of a two-semester sequence will
have a solid understanding of inference for one variable – either quantitative
or categorical.
Preface xi

• The linear algebra middle road.


Linear models (regression and ANOVA) are treated using a geometric,
vector-based approach. A more common approach at this level is to intro-
duce these topics without referring to the underlying linear algebra. Such an
approach avoids the problem of students with minimal background in linear
algebra but leads to mysterious and unmotivated identities and notions.
Here I rely on a small amount of linear algebra that can be quickly re-
viewed or learned and is based on geometric intuition and motivation (see
Appendix C). This works well in conjunction with R since R is in many ways
vector-based and facilitates vector and matrix operations. On the other hand,
I avoid using an approach that is too abstract or requires too much background
for the typical student in my course.

Brief Outline

The first four chapters of this book introduce important ideas in statistics (dis-
tributions, variability, hypothesis testing, confidence intervals) while developing a
mathematical and computational toolkit. I cover this material in a one-semester
course. Also, since some of my students only take the first semester, I wanted to
be sure that they leave with a sense for statistical practice and have some useful
statistical skills even if they do not continue. Interestingly, as a result of designing
my course so that stopping halfway makes some sense, I am finding that more of
my students are continuing on to the second semester. My sample size is still small,
but I hope that the trend continues and would like to think it is due in part because
the students are enjoying the course and can see “where it is going”.
The last three chapters deal primarily with two important methods for handling
more complex statistical models: maximum likelihood and linear models (including
regression, ANOVA, and an introduction to generalized linear models). This is not
a comprehensive treatment of these topics, of course, but I hope it both provides
flexible, usable statistical skills and prepares students for further learning.
Chi-squared tests for goodness of fit and for two-way tables using both the
Pearson and likelihood ratio test statistics are covered after first generating em-
pirical p-values based on simulations. The use of simulations here reinforces the
notion of a sampling distribution and allows for a discussion about what makes a
good test statistic when multiple test statistics are available. I have also included
a brief introduction to Bayesian inference, some examples that use simulations to
investigate robustness, a few examples of permutation tests, and a discussion of
Bradley-Terry models. The latter topic is one that I cover between Selection Sun-
day and the beginning of the NCAA Division I Basketball Tournament each year.
An application of the method to the 2009–2010 season is included.
Various R functions and methods are described as we go along, and Appendix A
provides an introduction to R focusing on the way R is used in the rest of the book.
I recommend working through Appendix A simultaneously with the first chapter –
especially if you are unfamiliar with programming or with R.
Some of my students enter the course unfamiliar with the notation for things
like sets, functions, and summation, so Appendix B contains a brief tour of the basic
xii Preface

mathematical results and notation that are needed. The linear algebra required for
parts of Chapter 4 and again in Chapters 6 and 7 is covered in Appendix C. These
can be covered as needed or used as a quick reference. Appendix D is a review of
the first four chapters in outline form. It is intended to prepare students for the
remainder of the book after a semester break, but it could also be used as an end
of term review.

Access to R Code and Data Sets

All of the data sets and code fragments used in this book are available for use
in R on your own computer. Data sets and other utilities that are not provided
by R packages in CRAN are available in the fastR package. This package can
be obtained from CRAN, from the companion web site for this book, or from the
author’s web site.
Among the utility functions in fastR is the function snippet(), which provides
easy access to the code fragments that appear in this book. The names of the code
fragments in this book appear in boxes at the right margin where code output is
displayed. Once fastR has been installed and loaded,
snippet(’snippet’)

will both display and execute the code named “snippet”, and
snippet(’snippet’, exec=FALSE)

will display but not execute the code.


fastR also includes a number of additional utility functions. Several of these
begin with the letter x. Examples include xplot, xhistogram, xpnorm, etc. These
functions add extra features to the standard functions they are based on. In most
cases they are identical to their x-less counterparts unless new arguments are used.

Companion Web Site

Additional material related to this book is available online at

http://www.ams.org/bookpages/amstext-13

Included there are

• an errata list,
• additional instructions, with links, for installing R and the R packages used in
this book,
• additional examples and problems,
• additional student solutions,
• additional material – including a complete list of solutions – available only to
instructors.
Preface xiii

Acknowledgments
Every author sets out to write the perfect book. I was no different. Fortunate
authors find others who are willing to point out the ways they have fallen short of
their goal and suggest improvements. I have been fortunate.
Most importantly, I want to thank the students who have taken advanced
undergraduate statistics courses with me over the past several years. Your questions
and comments have shaped the exposition of this book in innumerable ways. Your
enthusiasm for detecting my errors and your suggestions for improvements have
saved me countless embarrassments. I hope that your moments of confusion have
added to the clarity of the exposition.
If you look, some of you will be able to see your influence in very specific ways
here and there (happy hunting). But so that you all get the credit you deserve, I
want to list you all (in random order, of course): Erin Campbell, John Luidens, Kyle
DenHartigh, Jessica Haveman, Nancy Campos, Matthew DeVries, Karl Stough,
Heidi Benson, Kendrick Wiersma, Dale Yi, Jennifer Colosky, Tony Ditta, James
Hays, Joshua Kroon, Timothy Ferdinands, Hanna Benson, Landon Kavlie, Aaron
Dull, Daniel Kmetz, Caleb King, Reuben Swinkels, Michelle Medema, Sean Kidd,
Leah Hoogstra, Ted Worst, David Lyzenga, Eric Barton, Paul Rupke, Alexandra
Cok, Tanya Byker Phair, Nathan Wybenga, Matthew Milan, Ashley Luse, Josh
Vesthouse, Jonathan Jerdan, Jamie Vande Ree, Philip Boonstra, Joe Salowitz,
Elijah Jentzen, Charlie Reitsma, Andrew Warren, Lucas Van Drunen, Che-Yuan
Tang, David Kaemingk, Amy Ball, Ed Smilde, Drew Griffioen, Tim Harris, Charles
Blum, Robert Flikkema, Dirk Olson, Dustin Veldkamp, Josh Keilman, Eric Sloter-
beek, Bradley Greco, Matt Disselkoen, Kevin VanHarn, Justin Boldt, Anthony
Boorsma, Nathan Dykhuis, Brandon Van Dyk, Steve Pastoor, Micheal Petlicke,
Michael Molling, Justin Slocum, Jeremy Schut, Noel Hayden, Christian Swenson,
Aaron Keen, Samuel Zigterman, Kobby Appiah-Berko, Jackson Tong, William Van-
den Bos, Alissa Jones, Geoffry VanLeeuwen, Tim Slager, Daniel Stahl, Kristen
Vriesema, Rebecca Sheler, and Andrew Meneely.
I also want to thank various colleagues who read or class-tested some or all of
this book while it was in progress. They are
Ming-Wen An Daniel Kaplan
Vassar College Macalester College
Alan Arnholdt John Kern
Appalacian State University Duquesne University
Stacey Hancock Kimberly Muller
Clark University Lake Superior State University
Jo Hardin Ken Russell
Pomona College University of Wollongong, Australia
Nicholas Horton Greg Snow
Smith College Intermountain Healthcare
Laura Kapitula Nathan Tintle
Calvin College Hope College
xiv Preface

Interesting data make for interesting statistics, so I want to thank colleagues


and students who helped me locate data for use in the examples and exercises in
this book, especially those of you who made original data available. In the latter
cases, specific attributions are in the documentation for the data sets in the fastR
package.
Thanks also go to those at the American Mathematical Society who were in-
volved in the production of this book: Edward Dunne, the acquisitions editor with
whom I developed the book from concept to manuscript; Arlene O’Sean, produc-
tion editor; Cristin Zannella, editorial assistant; and Barbara Beeton, who provided
TEXnical support. Without their assistance and support the final product would
not have been as satisfying.
Alas, despite the efforts of so many, this book is still not perfect. No books
are perfect, but some books are useful. My hope is that this book is both useful
and enjoyable. A list of those (I hope few) errors that have escaped detection until
after the printing of this book will be maintained at
http://www.ams.org/bookpages/amstext-13
My thanks in advance to those who bring these to my attention.
What Is Statistics?

Some Definitions of Statistics

This is a course primarily about statistics, but what exactly is statistics? In other
words, what is this course about?1 Here are some definitions of statistics from other
people:
• a collection of procedures and principles for gaining information in order to
make decisions when faced with uncertainty (J. Utts [Utt05]),
• a way of taming uncertainty, of turning raw data into arguments that can
resolve profound questions (T. Amabile [fMA89]),
• the science of drawing conclusions from data with the aid of the mathematics
of probability (S. Garfunkel [fMA86]),
• the explanation of variation in the context of what remains unexplained (D.
Kaplan [Kap09]),
• the mathematics of the collection, organization, and interpretation of numer-
ical data, especially the analysis of a population’s characteristics by inference
from sampling (American Heritage Dictionary [AmH82]).
While not exactly the same, these definitions highlight four key elements of statis-
tics.

Data – the raw material

Data are the raw material for doing statistics. We will learn more about different
types of data, how to collect data, and how to summarize data as we go along. This
will be the primary focus of Chapter 1.

1
As we will see, the words statistic and statistics get used in more than one way. More on that
later.

xv
xvi What Is Statistics?

Information – the goal

The goal of doing statistics is to gain some information or to make a decision.


Statistics is useful because it helps us answer questions like the following:
• Which of two treatment plans leads to the best clinical outcomes?
• How strong is an I-beam constructed according to a particular design?
• Is my cereal company complying with regulations about the amount of cereal
in its cereal boxes?
In this sense, statistics is a science – a method for obtaining new knowledge.

Uncertainty – the context

The tricky thing about statistics is the uncertainty involved. If we measure one box
of cereal, how do we know that all the others are similarly filled? If every box of
cereal were identical and every measurement perfectly exact, then one measurement
would suffice. But the boxes may differ from one another, and even if we measure
the same box multiple times, we may get different answers to the question How
much cereal is in the box?
So we need to answer questions like How many boxes should we measure? and
How many times should we measure each box? Even so, there is no answer to these
questions that will give us absolute certainty. So we need to answer questions like
How sure do we need to be?

Probability – the tool

In order to answer a question like How sure do we need to be?, we need some way of
measuring our level of certainty. This is where mathematics enters into statistics.
Probability is the area of mathematics that deals with reasoning about uncertainty.
So before we can answer the statistical questions we just listed, we must first develop
some skill in probability. Chapter 2 provides the foundation that we need.
Once we have developed the necessary tools to deal with uncertainty, we will
be able to give good answers to our statistical questions. But before we do that,
let’s take a bird’s eye view of the processes involved in a statistical study. We’ll
come back and fill in the details later.

A First Example: The Lady Tasting Tea


There is a famous story about a lady who claimed that tea with milk tasted different
depending on whether the milk was added to the tea or the tea added to the milk.
The story is famous because of the setting in which she made this claim. She was
attending a party in Cambridge, England, in the 1920s. Also in attendance were a
number of university dons and their wives. The scientists in attendance scoffed at
the woman and her claim. What, after all, could be the difference?
All the scientists but one, that is. Rather than simply dismiss the woman’s
claim, he proposed that they decide how one should test the claim. The tenor of
What Is Statistics? xvii

the conversation changed at this suggestion, and the scientists began to discuss how
the claim should be tested. Within a few minutes cups of tea with milk had been
prepared and presented to the woman for tasting.
Let’s take this simple example as a prototype for a statistical study. What
steps are involved?

(1) Determine the question of interest.


Just what is it we want to know? It may take some effort to make a
vague idea precise. The precise questions may not exactly correspond to our
vague questions, and the very exercise of stating the question precisely may
modify our question. Sometimes we cannot come up with any way to answer
the question we really want to answer, so we have to live with some other
question that is not exactly what we wanted but is something we can study
and will (we hope) give us some information about our original question.
In our example this question seems fairly easy to state: Can the lady tell
the difference between the two tea preparations? But we need to refine this
question. For example, are we asking if she always correctly identifies cups of
tea or merely if she does better than we could do ourselves (by guessing)?
(2) Determine the population.
Just who or what do we want to know about? Are we only interested in
this one woman or women in general or only women who claim to be able to
distinguish tea preparations?
(3) Select measurements.
We are going to need some data. We get our data by making some mea-
surements. These might be physical measurements with some device (like a
ruler or a scale). But there are other sorts of measurements too, like the an-
swer to a question on a form. Sometimes it is tricky to figure out just what to
measure. (How do we measure happiness or intelligence, for example?) Just
how we do our measuring will have important consequences for the subsequent
statistical analysis.
In our example, a measurement may consist of recording for a given cup
of tea whether the woman’s claim is correct or incorrect.
(4) Determine the sample.
Usually we cannot measure every individual in our population; we have to
select some to measure. But how many and which ones? These are important
questions that must be answered. Generally speaking, bigger is better, but
it is also more expensive. Moreover, no size is large enough if the sample is
selected inappropriately.
Suppose we gave the lady one cup of tea. If she correctly identifies the
mixing procedure, will we be convinced of her claim? She might just be
guessing; so we should probably have her taste more than one cup. Will we
be convinced if she correctly identifies 5 cups? 10 cups? 50 cups?
What if she makes a mistake? If we present her with 10 cups and she
correctly identifies 9 of the 10, what will we conclude? A success rate of 90%
is, it seems, much better than just guessing, and anyone can make a mistake
now and then. But what if she correctly identifies 8 out of 10? 80 out of 100?
xviii What Is Statistics?

And how should we prepare the cups? Should we make 5 each way? Does
it matter if we tell the woman that there are 5 prepared each way? Should we
flip a coin to decide even if that means we might end up with 3 prepared one
way and 7 the other way? Do any of these differences matter?
(5) Make and record the measurements.
Once we have the design figured out, we have to do the legwork of data
collection. This can be a time-consuming and tedious process. In the case
of the lady tasting tea, the scientists decided to present her with ten cups
of tea which were quickly prepared. A study of public opinion may require
many thousands of phone calls or personal interviews. In a laboratory setting,
each measurement might be the result of a carefully performed laboratory
experiment.
(6) Organize the data.
Once the data have been collected, it is often necessary or useful to orga-
nize them. Data are typically stored in spreadsheets or in other formats that
are convenient for processing with statistical packages. Very large data sets
are often stored in databases.
Part of the organization of the data may involve producing graphical and
numerical summaries of the data. We will discuss some of the most important
of these kinds of summaries in Chapter 1. These summaries may give us initial
insights into our questions or help us detect errors that may have occurred to
this point.
(7) Draw conclusions from data.
Once the data have been collected, organized, and analyzed, we need to
reach a conclusion. Do we believe the woman’s claim? Or do we think she is
merely guessing? How sure are we that this conclusion is correct?
Eventually we will learn a number of important and frequently used meth-
ods for drawing inferences from data. More importantly, we will learn the basic
framework used for such procedures so that it should become easier and easier
to learn new procedures as we become familiar with the framework.
(8) Produce a report.
Typically the results of a statistical study are reported in some manner.
This may be as a refereed article in an academic journal, as an internal re-
port to a company, or as a solution to a problem on a homework assignment.
These reports may themselves be further distilled into press releases, newspa-
per articles, advertisements, and the like. The mark of a good report is that
it provides the essential information about each of the steps of the study.
As we go along, we will learn some of the standard terminology and pro-
cedures that you are likely to see in basic statistical reports and will gain a
framework for learning more.
At this point, you may be wondering who the innovative scientist was and
what the results of the experiment were. The scientist was R. A. Fisher, who first
described this situation as a pedagogical example in his 1925 book on statistical
methodology [Fis25]. We’ll return to this example in Sections 2.4.1 and 2.7.3.
Chapter 1

Summarizing Data

It is a capital mistake to theorize before one has data. Insensibly one


begins to twist facts to suit theories, instead of theories to suit facts.
Sherlock Holmes [Doy27]

Graphs are essential to good statistical analysis.


F. J. Anscombe [Ans73]

Data are the raw material of statistics.


We will organize data into a 2-dimensional schema, which we can think of as
rows and columns in a spreadsheet. The rows correspond to the individuals (also
called cases, subjects, or units depending on the context of the study). The
columns correspond to variables. In statistics, a variable is one of the measure-
ments made for each individual. Each individual has a value for each variable. Or
at least that is our intent. Very often some of the data are missing, meaning that
values of some variables are not available for some of the individuals.
How data are collected is critically important, and good statistical analysis
requires that the data were collected in an appropriate manner. We will return to
the issue of how data are (or should be) collected later. In this chapter we will
focus on the data themselves. We will use R to manipulate data and to produce
some of the most important numerical and graphical summaries of data. A more
complete introduction to R can be found in Appendix A.

1.1. Data in R

Most data sets in R are stored in a structure called a data frame that reflects
the 2-dimensional structure described above. A number of data sets are included
with the basic installation of R. The iris data set, for example, is a famous data

1
2 1. Summarizing Data

set containing a number of physical measurements of three varieties of iris. These


data were published by Edgar Anderson in 1935 [And35] but are famous because
R. A. Fisher [Fis36] gave a statistical analysis of these data that appeared a year
later.
The str() function provides our first overview of the data set.
iris-str
> str(iris)
’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length:num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width :num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length:num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width :num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species :Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1
1 1 1 ...
From this output we learn that our data set has 150 observations (rows) and 5
variables (columns). Also displayed is some information about the type of data
stored in each variable and a few sample values.
While we could print the entire data frame to the screen, this is inconvenient
for large data sets. We can look at the first few or last few rows of the data set
using head() and tail(). This is enough to give us a feel for how the data look.
iris-head
> head(iris,n=3) # first three fows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris-tail
> tail(iris,n=3) # last three rows
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
We can access any subset we want by directly specifying which rows and columns
are of interest to us.
iris-subset
> iris[c(1:3,148:150),3:5] # first and last rows, only 3 columns
Petal.Length Petal.Width Species
1 1.4 0.2 setosa
2 1.4 0.2 setosa
3 1.3 0.2 setosa
148 5.2 2.0 virginica
149 5.4 2.3 virginica
150 5.1 1.8 virginica
It is also possible to look at just one variable using the $ operator.
iris-vector
> iris$Sepal.Length # get one variable and print as vector
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7
[17] 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4
< 5 lines removed >
[113] 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1
1.1. Data in R 3

Box 1.1. Using the snippet( ) function


If you have installed the fastR package (and any other additional packages that
may be needed for a particular example), you can execute the code from this
book on your own computer using snippet(). For example,
snippet(’iris-str’)
will both display and execute the first code block on page 2, and
snippet(’iris-str’, exec=FALSE)
will display the code without executing it. Keep in mind that some code blocks
assume that prior blocks have already been executed and will not work as
expected if this is not true.

[129] 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

iris-vector2
> iris$Species # get one variable and print as vector
[1] setosa setosa setosa setosa setosa
[6] setosa setosa setosa setosa setosa
[11] setosa setosa setosa setosa setosa
< 19 lines removed >
[111] virginica virginica virginica virginica virginica
[116] virginica virginica virginica virginica virginica
[121] virginica virginica virginica virginica virginica
[126] virginica virginica virginica virginica virginica
[131] virginica virginica virginica virginica virginica
[136] virginica virginica virginica virginica virginica
[141] virginica virginica virginica virginica virginica
[146] virginica virginica virginica virginica virginica
Levels: setosa versicolor virginica

This is not a particularly good way to get a feel for data. There are a number of
graphical and numerical summaries of a variable or set of variables that are usually
preferred to merely listing all the values – especially if the data set is large. That
is the topic of our next section.
It is important to note that the name iris is not reserved in R for this data
set. There is nothing to prevent you from storing something else with that name.
If you do, you will no longer have access to the iris data set unless you first reload
it, at which point the previous contents of iris are lost.
iris-reload
> iris <- ’An iris is a beautiful flower.’
> str(iris)
chr "An iris is a beautiful flower."
> data(iris) # explicitly reload the data set
> str(iris)
’data.frame’: 150 obs. of 5 variables:
$ Sepal.Length:num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width :num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
4 1. Summarizing Data

$ Petal.Length:num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width :num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species :Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1
1 1 1 ...

The fastR package includes data sets and other utilities to accompany this
text. Instructions for installing fastR appear in the preface. We will use data sets
from a number of other R packages as well. These include the CRAN packages alr3,
car, DAAG, Devore6, faraway, Hmisc, MASS, and multcomp. Appendix A includes
instructions for reading data from various file formats, for entering data manu-
ally, for obtaining documentation on R functions and data sets, and for installing
packages from CRAN.

1.2. Graphical and Numerical Summaries of Univariate Data

Now that we can get our hands on some data, we would like to develop some tools
to help us understand the distribution of a variable in a data set. By distribution
we mean answers to two questions:
• What values does the variable take on?
• With what frequency?
Simply listing all the values of a variable is not an effective way to describe a
distribution unless the data set is quite small. For larger data sets, we require some
better methods of summarizing a distribution.

1.2.1. Tabulating Data

The types of summaries used for a variable depend on the kind of variable we
are interested in. Some variables, like iris$Species, are used to put individuals
into categories. Such variables are called categorical (or qualitative) variables to
distinguish them from quantitative variables which have numerical values on some
numerically meaningful scale. iris$Sepal.Length is an example of a quantitative
variable.
Usually the categories are either given descriptive names (our preference) or
numbered consecutively. In R, a categorical variable is usually stored as a factor.
The possible categories of an R factor are called levels, and you can see in the
output above that R not only lists out all of the values of iris$species but also
provides a list of all the possible levels for this variable. A more useful summary of
a categorical variable can be obtained using the table() function.
iris-table
> table(iris$Species) # make a table of values

setosa versicolor virginica


50 50 50

From this we can see that there were 50 of each of three species of iris.
1.2. Graphical and Numerical Summaries of Univariate Data 5

Tables can be used for quantitative data as well, but often this does not work
as well as it does for categorical data because there are too many categories.
iris-table2
> table(iris$Sepal.Length) # make a table of values

4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3
6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7
6 6 4 9 7 5 2 8 3 4 1 1 3 1 1 1 4
7.9
1

Sometimes we may prefer to divide our quantitative data into two groups based on
a threshold or some other boolean test.
iris-logical
> table(iris$Sepal.Length > 6.0)

FALSE TRUE
89 61

The cut() function provides a more flexible way to build a table from quantitative
data.
iris-cut
> table(cut(iris$Sepal.Length,breaks=2:10))

(2,3] (3,4] (4,5] (5,6] (6,7] (7,8] (8,9] (9,10]


0 0 32 57 49 12 0 0

The cut() function partitions the data into sections, in this case with break points
at each integer from 2 to 10. (The breaks argument can be used to set the break
points wherever one likes.) The result is a categorical variable with levels describing
the interval in which each original quantitative value falls. If we prefer to have the
intervals closed on the other end, we can achieve this using right=FALSE.
iris-cut2
> table(cut(iris$Sepal.Length,breaks=2:10,right=FALSE))

[2,3) [3,4) [4,5) [5,6) [6,7) [7,8) [8,9) [9,10)


0 0 22 61 54 13 0 0

Notice too that it is possible to define factors in R that have levels that do not
occur. This is why the 0’s are listed in the output of table(). See ?factor for
details.
A tabular view of data like the example above can be converted into a vi-
sual representation called a histogram. There are two R functions that can be
used to build a histogram: hist() and histogram(). hist() is part of core R.
histogram() can only be used after first loading the lattice graphics package,
which now comes standard with all distributions of R. Default versions of each are
depicted in Figure 1.1. A number of arguments can be used to modify the resulting
plot, set labels, choose break points, and the like.
Looking at the plots generated by histogram() and hist(), we see that they
use different scales for the vertical axis. The default for histogram() is to use
percentages (of the entire data set). By contrast, hist() uses counts. The shapes of
6 1. Summarizing Data

Histogram of iris$Sepal.Length
20

30
Percent of Total

15

Frequency

20
10

5 10
5

0
0
4 5 6 7 8
4 5 6 7 8
iris$Sepal.Length
Sepal.Length

Figure 1.1. Comparing two histogram functions: histogram() on the left,


and hist() on the right.

the two histograms differ because they use slightly different algorithms for choosing
the default break points. The user can, of course, override the default break points
(using the breaks argument). There is a third scale, called the density scale, that is
often used for the vertical axis. This scale is designed so that the area of each bar is
equal to the proportion of data it represents. This is especially useful for histograms
that have bins (as the intervals between break points are typically called in the
context of histograms) of different widths. Figure 1.2 shows an example of such a
histogram generated using the following code:
iris-histo-density
> histogram(~Sepal.Length,data=iris,type="density",
+ breaks=c(4,5,5.5,6,6.5,7,8,10))
We will generally use the newer histogram() function because it has several
nice features. One of these is the ability to split up a plot into subplots called
panels. For example, we could build a separate panel for each species in the iris
data set. Figure 1.2 suggests that part of the variation in sepal length is associated
with the differences in species. Setosa are generally shorter, virginica longer, and
versicolor intermediate. The right-hand plot in Figure 1.2 was created using
iris-condition
> histogram(~Sepal.Length|Species,data=iris)
If we only want to see the data from one species, we can select a subset of the data
using the subset argument.
iris-histo-subset
> histogram(~Sepal.Length|Species,data=iris,
+ subset=Species=="virginica")

4 5 6 7 8

0.4 setosa versicolor virginica

50
0.3
Percent of Total

40
Density

0.2 30

20
0.1
10

0.0 0

4 5 6 7 8 9 10 4 5 6 7 8 4 5 6 7 8

Sepal.Length Sepal.Length

Figure 1.2. Left: A density histogram of sepal length using unequal bin
widths. Right: A histogram of sepal length by species.
1.2. Graphical and Numerical Summaries of Univariate Data 7

virginica

Percent of Total
30

20

10

5 6 7 8

Sepal.Length

Figure 1.3. This histogram is the result of selecting a subset of the data using
the subset argument.

By keeping the groups argument, our plot will continue to have a strip at the top
identifying the species even though there will only be one panel in our plot (Figure
1.3).
The lattice graphing functions all use a similar formula interface. The generic
form of a formula is
y ~ x | z

which can often be interpreted as “y modeled by x conditioned on z”. For plotting,


y will typically indicate a variable presented on the vertical axis, and x a variable
to be plotted along the horizontal axis. In the case of a histogram, the values for
the vertical axis are computed from the x variable, so y is omitted. The condition
z is a variable that is used to break the data into sections which are plotted in
separate panels. When z is categorical, there is one panel for each level of z. When
z is quantitative, the data is divided into a number of sections based on the values
of z. This works much like the cut() function, but some data may appear in
more than one panel. In R terminology, each panel represents a shingle of the data.
The term shingle is supposed to evoke an image of overlapping coverage like the
shingles on a roof. Finer control over the number of panels can be obtained by using
equal.count() or co.intervals() to make the shingles directly. See Figure 1.4.

1.2.2. Shapes of Distributions


A histogram gives a shape to a distribution, and distributions are often described
in terms of these shapes. The exact shape depicted by a histogram will depend not
only on the data but on various other choices, such as how many bins are used,
whether the bins are equally spaced across the range of the variable, and just where
the divisions between bins are located. But reasonable choices of these arguments
will usually lead to histograms of similar shape, and we use these shapes to describe
the underlying distribution as well as the histogram that represents it.
Some distributions are approximately symmetrical with the distribution of
the larger values looking like a mirror image of the distribution of the smaller values.
We will call a distribution positively skewed if the portion of the distribution
with larger values (the right of the histogram) is more spread out than the other
side. Similarly, a distribution is negatively skewed if the distribution deviates
from symmetry in the opposite manner. Later we will learn a way to measure
8 1. Summarizing Data

4 5 6 7 8 4 5 6 7 8

S.Width S.Width S.Width S.Width S.Width


100
80
60
40
20
0
S.Width S.Width S.Width S.Width S.Width S.Width
100
80
Percent of Total

60
40
20
0
S.Width S.Width S.Width S.Width S.Width S.Width
100
80
60
40
20
0
S.Width S.Width S.Width S.Width S.Width S.Width
100
80
60
40
20
0
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8

Sepal.Length

4 5 6 7 8

Sepal.Width Sepal.Width
30
Percent of Total

20
10
0
Sepal.Width Sepal.Width
30
20
10
0
4 5 6 7 8

Sepal.Length

Figure 1.4. The output of histogram(∼Sepal.Length|Sepal.Width,iris)


and histogram(∼Sepal.Length|equal.count(Sepal.Width,number=4),iris).

0 5 10 15
12
neg. skewed pos. skewed symmetric
Percent of Total

10
Percent of Total

20
8
15
6
10 4
5 2
0 0

0 5 10 15 0 5 10 15 2 3 4 5

x eruptions

Figure 1.5. Left: Skewed and symmetric distributions. Right: Old Faithful
eruption times illustrate a bimodal distribution.

the degree and direction of skewness with a number; for now it is sufficient to
describe distributions qualitatively as symmetric or skewed. See Figure 1.5 for
some examples of symmetric and skewed distributions.
Notice that each of these distributions is clustered around a center where most
of the values are located. We say that such distributions are unimodal. Shortly we
1.2. Graphical and Numerical Summaries of Univariate Data 9

will discuss ways to summarize the location of the “center” of unimodal distributions
numerically. But first we point out that some distributions have other shapes that
are not characterized by a strong central tendency. One famous example is eruption
times of the Old Faithful geyser in Yellowstone National park.
faithful-histogram
> plot <- histogram(~eruptions,faithful,n=20)
produces the histogram in Figure 1.5 which shows a good example of a bimodal
distribution. There appear to be two groups or kinds of eruptions, some lasting
about 2 minutes and others lasting between 4 and 5 minutes.

1.2.3. Measures of Central Tendency


Qualitative descriptions of the shape of a distribution are important and useful.
But we will often desire the precision of numerical summaries as well. Two aspects
of unimodal distributions that we will often want to measure are central tendency
(what is a typical value? where do the values cluster?) and the amount of variation
(are the data tightly clustered around a central value, or more spread out?).
Two widely used measures of center are the mean and the median. You are
probably already familiar with both. The mean is calculated by adding all the
values of a variable and dividing by the number of values. Our usual notation will
be to denote the n values as x1 , x2 , . . . xn , and the mean of these values as x. Then
the formula for the mean becomes
n
xi
x = i=1 .
n
The median is a value that splits the data in half – half of the values are smaller
than the median and half are larger. By this definition, there could be more than
one median (when there are an even number of values). This ambiguity is removed
by taking the mean of the “two middle numbers” (after sorting the data). See the
exercises for some problems that explore aspects of the mean and median that may
be less familiar.
The mean and median are easily computed in R. For example,
iris-mean-median
> mean(iris$Sepal.Length); median(iris$Sepal.Length)
[1] 5.8433
[1] 5.8
Of course, we have already seen (by looking at histograms), that there are
some differences in sepal length between the various species, so it would be better
to compute the mean and median separately for each species. While one can use
the built-in aggregate() function, we prefer to use the summary() function from
the Hmisc package. This function uses the same kind of formula notation that the
lattice graphics functions use.
iris-Hmisc-summary
> require(Hmisc) # load Hmisc package
> summary(Sepal.Length~Species,iris) # default function is mean
Sepal.Length N=150
10 1. Summarizing Data

Box 1.2. R packages used in this text


From now on we will assume that the lattice, Hmisc, and fastR packages have
been loaded and will not show the loading of these packages in our examples.
If you try an example in this book and R reports that it cannot find a function
or data set, it is likely that you have failed to load one of these packages. You
can set up R to automatically load these packages every time you launch R if
you like. (See Appendix A for details.)
Other packages will be used from time to time as well. In this case, we will
show the require() statement explicitly. The documentation for the fastR
package includes a list of required and recommended packages.

+-------+----------+---+------------+
| | |N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa | 50|5.0060 |
| |versicolor| 50|5.9360 |
| |virginica | 50|6.5880 |
+-------+----------+---+------------+
|Overall| |150|5.8433 |
+-------+----------+---+------------+
> summary(Sepal.Length~Species,iris,fun=median) # median instead
Sepal.Length N=150

+-------+----------+---+------------+
| | |N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa | 50|5.0 |
| |versicolor| 50|5.9 |
| |virginica | 50|6.5 |
+-------+----------+---+------------+
|Overall| |150|5.8 |
+-------+----------+---+------------+
Comparing with the histograms in Figure 1.2, we see that these numbers are indeed
good descriptions of the center of the distribution for each species.
We can also compute the mean and median of the Old Faithful eruption times.
faithful-mean-median
> mean(faithful$eruptions)
[1] 3.4878
> median(faithful$eruptions)
[1] 4
Notice, however, that in the Old Faithful eruption times histogram (Figure 1.5)
there are very few eruptions that last between 3.5 and 4 minutes. So although
these numbers are the mean and median, neither is a very good description of the
typical eruption time(s) of Old Faithful. It will often be the case that the mean
and median are not very good descriptions of a data set that is not unimodal.
1.2. Graphical and Numerical Summaries of Univariate Data 11

> stem(faithful$eruptions) faithful-stem

The decimal point is 1 digit(s) to the left of the |

16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370

Figure 1.6. Stemplot of Old Faithful eruption times using stem().

In the case of our Old Faithful data, there seem to be two predominant peaks,
but unlike in the case of the iris data, we do not have another variable in our
data that lets us partition the eruption times into two corresponding groups. This
observation could, however, lead to some hypotheses about Old Faithful eruption
times. Perhaps eruption times at night are different from those during the day.
Perhaps there are other differences in the eruptions. Subsequent data collection
(and statistical analysis of the resulting data) might help us determine whether our
hypotheses appear correct.
One disadvantage of a histogram is that the actual data values are lost. For a
large data set, this is probably unavoidable. But for more modestly sized data sets,
a stemplot can reveal the shape of a distribution without losing the actual (perhaps
rounded) data values. A stemplot divides each value into a stem and a leaf at some
place value. The leaf is rounded so that it requires only a single digit. The values
are then recorded as in Figure 1.6.
From this output we can readily see that the shortest recorded eruption time
was 1.60 minutes. The second 0 in the first row represents 1.70 minutes. Note that
the output of stem() can be ambiguous when there are not enough data values in
a row.

Comparing mean and median

Why bother with two different measures of central tendency? The short answer is
that they measure different things. If a distribution is (approximately) symmetric,
the mean and median will be (approximately) the same (see Exercise 1.5). If the
12 1. Summarizing Data

distribution is not symmetric, however, the mean and median may be very different,
and one measure may provide a more useful summary than the other.
For example, if we begin with a symmetric distribution and add in one addi-
tional value that is very much larger than the other values (an outlier), then the
median will not change very much (if at all), but the mean will increase substan-
tially. We say that the median is resistant to outliers while the mean is not. A
similar thing happens with a skewed, unimodal distribution. If a distribution is
positively skewed, the large values in the tail of the distribution increase the mean
(as compared to a symmetric distribution) but not the median, so the mean will
be larger than the median. Similarly, the mean of a negatively skewed distribution
will be smaller than the median.
Whether a resistant measure is desirable or not depends on context. If we are
looking at the income of employees of a local business, the median may give us a
much better indication of what a typical worker earns, since there may be a few
large salaries (the business owner’s, for example) that inflate the mean. This is also
why the government reports median household income and median housing costs.
On the other hand, if we compare the median and mean of the value of raffle
prizes, the mean is probably more interesting. The median is probably 0, since
typically the majority of raffle tickets do not win anything. This is independent of
the values of any of the prizes. The mean will tell us something about the overall
value of the prizes involved. In particular, we might want to compare the mean
prize value with the cost of the raffle ticket when we decide whether or not to
purchase one.

The trimmed mean compromise

There is another measure of central tendency that is less well known and represents
a kind of compromise between the mean and the median. In particular, it is more
sensitive to the extreme values of a distribution than the median is, but less sensitive
than the mean. The idea of a trimmed mean is very simple. Before calculating the
mean, we remove the largest and smallest values from the data. The percentage of
the data removed from each end is called the trimming percentage. A 0% trimmed
mean is just the mean; a 50% trimmed mean is the median; a 10% trimmed mean
is the mean of the middle 80% of the data (after removing the largest and smallest
10%). A trimmed mean is calculated in R by setting the trim argument of mean(),
e.g., mean(x,trim=0.10). Although a trimmed mean in some sense combines the
advantages of both the mean and median, it is less common than either the mean or
the median. This is partly due to the mathematical theory that has been developed
for working with the median and especially the mean of sample data.

1.2.4. Measures of Dispersion


It is often useful to characterize a distribution in terms of its center, but that is not
the whole story. Consider the distributions depicted in the histograms in Figure 1.7.
In each case the mean and median are approximately 10, but the distributions
clearly have very different shapes. The difference is that distribution B is much
1.2. Graphical and Numerical Summaries of Univariate Data 13

more “spread out”. “Almost all” of the data in distribution A is quite close to 10; a
much larger proportion of distribution B is “far away” from 10. The intuitive (and
not very precise) statement in the preceding sentence can be quantified by means
of quantiles. The idea of quantiles is probably familiar to you since percentiles
are a special case of quantiles.
Definition 1.2.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative dis-
tribution is a number q such that the (approximate) proportion of the distribution
that is less than q is p.

So, for example, the 0.2-quantile divides a distribution into 20% below and 80%
above. This is the same as the 20th percentile. The median is the 0.5-quantile (and
the 50th percentile).
The idea of a quantile is quite straightforward. In practice there are a few
wrinkles to be ironed out. Suppose your data set has 15 values. What is the 0.30-
quantile? Exactly 30% of the data would be (0.30)(15) = 4.5 values. Of course,
there is no number that has 4.5 values below it and 11.5 values above it. This is
the reason for the parenthetical word approximate in Definition 1.2.1. Different
schemes have been proposed for giving quantiles a precise value, and R implements
several such methods. They are similar in many ways to the decision we had to
make when computing the median of a variable with an even number of values.
Two important methods can be described by imagining that the sorted data
have been placed along a ruler, one value at every unit mark and also at each end.
To find the p-quantile, we simply snap the ruler so that proportion p is to the left
and 1−p to the right. If the break point happens to fall precisely where a data value
is located (i.e., at one of the unit marks of our ruler), that value is the p-quantile. If
the break point is between two data values, then the p-quantile is a weighted mean
of those two values.
Example 1.2.1. Suppose we have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100.
The 0-quantile is 1, the 1-quantile is 100, the 0.5-quantile (median) is midway
between 25 and 36, that is, 30.5. Since our ruler is 9 units long, the 0.25-quantile
is located 9/4 = 2.25 units from the left edge. That would be one quarter of the
way from 9 to 16, which is 9 + 0.25(16 − 9) = 9 + 1.75 = 10.75. (See Figure 1.8.)
Other quantiles are found similarly. This is precisely the default method used by
quantile().

−10 0 10 20 30

A B
0.20
Density

0.15
0.10
0.05
0.00
−10 0 10 20 30

Figure 1.7. Histograms showing smaller (A) and larger (B) amounts of variation.
14 1. Summarizing Data

1 4 9 16 25 36 49 64 81 100

6 6 6
1 4 9 16 25 36 49 64 81 100

6 6 6
Figure 1.8. Illustrations of two methods for determining quantiles from data.
Arrows indicate the locations of the 0.25-, 0.5-, and 0.75-quantiles.

intro-quantile
> quantile((1:10)^2)
0% 25% 50% 75% 100%
1.00 10.75 30.50 60.25 100.00 

A second scheme is just like the first one except that the data values are placed
midway between the unit marks. In particular, this means that the 0-quantile
is not the smallest value. This could be useful, for example, if we imagined we
were trying to estimate the lowest value in a population from which we only had
a sample. Probably the lowest value overall is less than the lowest value in our
particular sample. The only remaining question is how to extrapolate in the last
half unit on either side of the ruler. If we set quantiles in that range to be the
minimum or maximum, the result is another type of quantile().
Example 1.2.2. The method just described is what type=5 does.
intro-quantile05a
> quantile((1:10)^2,type=5)
0% 25% 50% 75% 100%
1.0 9.0 30.5 64.0 100.0
Notice that quantiles below the 0.05-quantile are all equal to the minimum value.
intro-quantile05b
> quantile((1:10)^2,type=5,seq(0,0.10,by=0.005))
0% 0.5% 1% 1.5% 2% 2.5% 3% 3.5% 4% 4.5% 5% 5.5% 6% 6.5%
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.15 1.30 1.45
7% 7.5% 8% 8.5% 9% 9.5% 10%
1.60 1.75 1.90 2.05 2.20 2.35 2.50
A similar thing happens with the maximum value for the larger quantiles. 

Other methods refine this idea in other ways, usually based on some assump-
tions about what the population of interest is like.
Fortunately, for large data sets, the differences between the different quantile
methods are usually unimportant, so we will just let R compute quantiles for us
using the quantile() function. For example, here are the deciles and quartiles
of the Old Faithful eruption times.
faithful-quantile
> quantile(faithful$eruptions,(0:10)/10)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000
100%
5.1000
> quantile(faithful$eruptions,(0:4)/4)
1.2. Graphical and Numerical Summaries of Univariate Data 15

8
Sepal.Length

virginica ● ●
7
● versicolor ● ●
6 ●
setosa ●
5 ● ●

5 6 7 8 2 3 4 5
setosa versicolor virginica
Sepal.Length eruptions

Figure 1.9. Boxplots for iris sepal length and Old Faithful eruption times.

0% 25% 50% 75% 100%


1.6000 2.1627 4.0000 4.4543 5.1000

The latter of these provides what is commonly called the five-number summary.
The 0-quantile and 1-quantile (at least in the default scheme) are the minimum
and maximum of the data set. The 0.5-quantile gives the median, and the 0.25-
and 0.75-quantiles (also called the first and third quartiles) isolate the middle 50%
of the data. When these numbers are close together, then most (well, half, to be
more precise) of the values are near the median. If those numbers are farther apart,
then much (again, half) of the data is far from the center. The difference between
the first and third quartiles is called the interquartile range and is abbreviated
IQR. This is our first numerical measure of dispersion.
The five-number summary can also be presented graphically using a boxplot
(also called box-and-whisker plot) as in Figure 1.9. These plots were generated
using
iris-bwplot
> bwplot(Sepal.Length~Species,data=iris)
> bwplot(Species~Sepal.Length,data=iris)
> bwplot(~eruptions,faithful)

The size of the box reflects the IQR. If the box is small, then the middle 50% of
the data are near the median, which is indicated by a dot in these plots. (Some
boxplots, including those made by the boxplot() use a vertical line to indicate
the median.) Outliers (values that seem unusually large or small) can be indicated
by a special symbol. The whiskers are then drawn from the box to the largest
and smallest non-outliers. One common rule for automating outlier detection for
boxplots is the 1.5 IQR rule. Under this rule, any value that is more than 1.5 IQR
away from the box is marked as an outlier. Indicating outliers in this way is useful
since it allows us to see if the whisker is long only because of one extreme value.

Variance and standard deviation

Another important way to measure the dispersion of a distribution is by comparing


each value with the mean of the distribution. If the distribution is spread out, these
differences will tend to be large; otherwise these differences will be small. To get a
single number, we could simply add up all of the deviation from the mean:

total deviation from the mean = (x − x) .
16 1. Summarizing Data

The trouble with this is that the total deviation from the mean is always 0 because
the negative deviations and the positive deviations always exactly cancel out. (See
Exercise 1.10).
To fix this problem, we might consider taking the absolute value of the devia-
tions from the mean:

total absolute deviation from the mean = |x − x| .
This number will only be 0 if all of the data values are equal to the mean. Even
better would be to divide by the number of data values:
1
mean absolute deviation = |x − x| .
n
Otherwise large data sets will have large sums even if the values are all close to the
mean. The mean absolute deviation is a reasonable measure of the dispersion in
a distribution, but we will not use it very often. There is another measure that is
much more common, namely the variance, which is defined by
1  2
variance = Var(x) = (x − x) .
n−1
You will notice two differences from the mean absolute deviation. First, instead
of using an absolute value to make things positive, we square the deviations from
the mean. The chief advantage of squaring over the absolute value is that it is
much easier to do calculus with a polynomial than with functions involving absolute
values. Because the squaring changes the units of this measure, the square root
of the variance, called the standard deviation, is commonly used in place of the
variance.
The second difference is that we divide by n − 1 instead of by n. There is
a very good reason for this, even though dividing by n probably would have felt
much more natural to you at this point. We’ll get to that very good reason later
in the course (in Section 4.6). For now, we’ll settle for a less good reason. If you
know the mean and all but one of the values of a variable, then you can determine
the remaining value, since the sum of all the values must be the product of the
number of values and the mean. So once the mean is known, there are only n − 1
independent pieces of information remaining. That is not a particularly satisfying
explanation, but it should help you remember to divide by the correct quantity.
All of these quantities are easy to compute in R.
intro-dispersion02
> x=c(1,3,5,5,6,8,9,14,14,20)
>
> mean(x)
[1] 8.5
> x - mean(x)
[1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5 5.5 5.5 11.5
> sum(x - mean(x))
[1] 0
> abs(x - mean(x))
[1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5 5.5 5.5 11.5
> sum(abs(x - mean(x)))
[1] 46
1.3. Graphical and Numerical Summaries of Multivariate Data 17

> (x - mean(x))^2
[1] 56.25 30.25 12.25 12.25 6.25 0.25 0.25 30.25 30.25
[10] 132.25
> sum((x - mean(x))^2)
[1] 310.5
> n= length(x)
> 1/(n-1) * sum((x - mean(x))^2)
[1] 34.5
> var(x)
[1] 34.5
> sd(x)
[1] 5.8737
> sd(x)^2
[1] 34.5

1.3. Graphical and Numerical Summaries of Multivariate Data

1.3.1. Side-by-Side Comparisons


Often it is useful to consider two or more variables together. In fact, we have already
done some of this. For example, we looked at iris sepal length separated by species.
This sort of side-by-side comparison – in graphical or tabular form – is especially
useful when one variable is quantitative and the other categorical. Graphical or
numerical summaries of the quantitative variable can be made separately for each
group defined by the categorical variable (or by shingles of a second quantitative
variable). See Appendix A for more examples.

1.3.2. Scatterplots
There is another plot that is useful for looking at the relationship between two
quantitative variables. A scatterplot (or scattergram) is essentially the familiar
Cartesian coordinate plot you learned about in school. Since each observation in a
bivariate data set has two values, we can plot points on a rectangular grid repre-
senting both values simultaneously. The lattice function for making a scatterplot
is xyplot().
The scatterplot in Figure 1.10 becomes even more informative if we separate
the dots of the three species. Figure 1.11 shows two ways this can be done. The
first uses a conditioning variable, as we have seen before, to make separate panels
for each species. The second uses the groups argument to plot the data in the same
panel but with different symbols for each species. Each of these clearly indicates
that, in general, plants with wider sepals also have longer sepals but that the typical
values of and the relationship between width and length differ by species.

1.3.3. Two-Way Tables and Mosaic Plots


A 1981 paper [Rad81] investigating racial biases in the application of the death
penalty reported on 326 cases in which the defendant was convicted of murder.
18 1. Summarizing Data

8 ●
● ● ● ●



● ● ●

7 ●
● ●
● ● ●

Sepal.Length
● ● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
6 ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●

● ● ● ●
● ● ● ● ● ●
5 ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●

● ● ● ●

● ● ●

2.0 2.5 3.0 3.5 4.0 4.5

Sepal.Width

Figure 1.10. A scatterplot made with xyplot(Sepal.Length∼Sepal.Width,iris).

For each case they noted the race of the defendant and whether or not the death
penalty was imposed. We can use R to cross tabulate this data for us:
intro-deathPenalty01
> xtabs(~Penalty+Victim,data=deathPenalty)
Victim
Penalty Black White
Death 6 30
Not 106 184

Perhaps you are surprised that white defendants are more likely to receive
the death penalty. It turns out that there is more to the story. The researchers
also recorded the race of the victim. If we make a new table that includes this
information, we see something interesting.
intro-deathPenalty02

> xtabs(~Penalty+Defendant+Victim,
+ data=deathPenalty) , , Victim = White
, , Victim = Black
Defendant
Defendant Penalty Black White
Penalty Black White Death 11 19
Death 6 0 Not 52 132
Not 97 9

setosa L versicolor virginica


2.0 2.5 3.0 3.5 4.0 4.5

setosa versicolor virginica 8


8 ●
● ● ● ●



● ● ● 7

Sepal.Length

7 ●
● ●●
Sepal.Length

● ● ●
●● ● ●● ●
●●
● ● ●
● ● ●● ●●
● ● ● ● ●●● ●●
● ● ● ●
6 ●
●●●
● ● ● ●
● ●

6
● ● ●
● ●● ●●
● ● ● ●●● ●
● ● ●● ●
● ● ●●●●
● ● ● ●

●● ● ●
●●● ●● ● 5
5 ● ●●●●● ● ●
●● ● ● ●
●● ●

●● ● ●

●● ●

2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5

Sepal.Width Sepal.Width

Figure 1.11. The output of xyplot(Sepal.Length∼Sepal.Width|Species,iris)


and xyplot(Sepal.Length∼Sepal.Width,groups=Species,iris,auto.key=TRUE).
1.3. Graphical and Numerical Summaries of Multivariate Data 19

Defendant
Bl Wh

Yes No
Bl

DeathPenalty

Victim

No
Wh

Yes
Figure 1.12. A mosaic plot of death penalty by race of defendant and victim.
.

It appears that black defendants are more likely to receive the death penalty
when the victim is white and also when the victim is black, but not if we ignore
the race of the victim. This sort of apparent contradiction is known as Simpson’s
paradox. In this case, it appears that the death penalty is more likely to be given
for a white victim, and since most victims are the same race as their murderer, the
result is that overall white defendants are more likely (in this data set) to receive
the death penalty even though black defendants are more likely (again, in this data
set) to receive the death penalty for each race of victim.
The fact that our understanding of the data is so dramatically influenced by
whether or not our analysis includes the race of the victim is a warning to watch for
lurking variables – variables that have an important effect but are not included
in our analysis – in other settings as well. Part of the design of a good study is
selecting the right things to measure.
These cross tables can be visualized graphically using a mosaic plot. Mosaic
plots can be generated with the core R function mosaicplot() or with mosaic()
from the vcd package. (vcd is short for visualization of categorical data.) The
latter is somewhat more flexible and usually produces more esthetically pleasing
output. A number of different formula formats can be supplied to mosaic(). The
results of the following code are shown in Figure 1.12.
intro-deathPenalty03
> require(vcd)
> mosaic(~Victim+Defendant+DeathPenalty,data=deathPen)
> structable(~Victim+Defendant+DeathPenalty,data=deathPen)
Defendant Bl Wh
Victim DeathPenalty
Bl No 97 9
Yes 6 0
Wh No 52 132
Yes 11 19
As always, see ?mosaic for more information. The vcd package also provides an
alternative to xtabs() called structable(), and if you print() a mosaic(), you
will get both the graph and the table.
20 1. Summarizing Data

1.4. Summary

Data can be thought of in a 2-dimensional structure in which each variable has


a value (possibly missing) for each observational unit. In most statistical soft-
ware, including R, columns correspond to variables and rows correspond to the
observations.
The distribution of a variable is a description of the values obtained by a
variable and the frequency with which they occur. While simply listing all the
values does describe the distribution completely, it is not easy to draw conclusions
from this sort of description, especially when the number of observational units is
large. Instead, we will make frequent use of numerical and graphical summaries
that make it easier to see what is going on and to make comparisons.
The mean, median, standard deviation, and interquartile range are
among the most common numerical summaries. The mean and median give an
indication of the “center” of the distribution. They are especially useful for uni-
modal distributions but may not be appropriate summaries for distributions with
other shapes. When a distribution is skewed, the mean and median can be quite
different because the extreme values of the distribution have a large effect on the
mean but not on the median. A trimmed mean is sometimes used as a compromise
between the median and the mean. Although one could imagine other measures of
spread, the standard deviation is especially important because of its relationship
to important theoretical results in statistics, especially the Central Limit Theorem,
which we will encounter in Chapter 4.
Even as we learn formal methods of statistical analysis, we will not abandon
these numerical and graphical summaries. Appendix A provides a more complete
introduction to R and includes information on how to fine-tune plots. Additional
examples can be found throughout the text.

1.4.1. R Commands
Here is a table of important R commands introduced in this chapter. Usage details
can be found in the examples and using the R help.

x <- c(...) Concatenate arguments into a single vector


and store in object x.
data(x) (Re)load the data set x.
str(x) Print a summary of the object x.
head(x,n=4) First four rows of the data frame x.
tail(x,n=4) Last four rows of the data frame x.
table(x) Table of the values in vector x.
xtabs(~x+y,data) Cross tabulation of x and y.
Exercises 21

cut(x,breaks,right=TRUE) Divide up the range of x into intervals and


code the values in x according to which inter-
val they fall into.
require(fastR); Load packages.
require(lattice);
require(Hmisc)

histogram(~x|z,data,...) Histogram of x conditioned on z.


bwplot(x~z,data,...) Boxplot of x conditioned on z.
xyplot(y~x|z,data,...) Scatterplot of y by x conditioned on z.
stem(x) Stemplot of x.
sum(x); mean(x); median(x); Sum, mean, median, variance, standard devi-
var(x); sd(x); quantile(x) ation, quantiles of x.
summary(y~x,data,fun) Summarize y by computing the function fun
on each group defined by x [Hmisc].

Exercises

1.1. Read as much of Appendix A as you need to do the exercises there.

1.2. The pulse variable in the littleSurvey data set contains self-reported pulse
rates.
a) Make a histogram of these values. What problem does this histogram reveal?
b) Make a decision about what values should be removed from the data and make
a histogram of the remaining values. (You can use the subset argument of
the histogram() function to restrict the data or you can create a new vector
and make a histogram from that.)
c) Compute the mean and median of your restricted set of pulse rates.

1.3. The pulse variable in the littleSurvey data set contains self-reported pulse
rates. Make a table or graph showing the distribution of the last digits of the
recorded pulse rates and comment on the distribution of these digits. Any conjec-
tures?
Note: %% is the modulus operator in R. So x %% 10 gives the remainder after
dividing x by 10, which is the last digit.

1.4. Some students in introductory statistics courses were asked to select a num-
ber between 1 and 30 (inclusive). The results are in the number variable in the
littleSurvey data set.
22 1. Summarizing Data

a) Make a table showing the frequency with which each number was selected
using table().
b) Make a histogram of these values with bins centered at the integers from 1 to
30.
c) What numbers were most frequently chosen? Can you get R to find them for
you?
d) What numbers were least frequently chosen? Can you get R to find them for
you?
e) Make a table showing how many students selected odd versus even numbers.

1.5. The distribution of a quantitative variable is symmetric about m if whenever


there are k observations with value m + d, there are also k observations with value
m − d. Equivalently, if the values are x1 ≤ x2 ≤ · · · ≤ xn , then xi + xn+1−i = 2m
for all i.
a) Show that if a distribution is symmetric about m, then m is the median. (You
may need to handle separately the cases where the number of values is odd
and even.)
b) Show that if a distribution is symmetric about m, then m is the mean.
c) Create a small distribution such that the mean and median are equal to m
but the distribution is not symmetric about m.

1.6. Describe some situations where the mean or median is clearly a better measure
of central tendency than the other.
1.7. Below are histograms and boxplots from six distributions. Match each his-
togram (A–F) with its corresponding boxplot (U–Z).

A B C ●
U

V ●
Percent of Total

W ●

D E F X ●

Y ●

Z ●

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

1.8. The function bwplot() does not use the quantile() function to compute its
five-number summary. Instead it uses fivenum(). Technically, fivenum() com-
putes the hinges of the data rather than quantiles. Sometimes fivenum() and
quantile() agree:
fivenum-a
> fivenum(1:11)
[1] 1.0 3.5 6.0 8.5 11.0
> quantile(1:11)
0% 25% 50% 75% 100%
1.0 3.5 6.0 8.5 11.0
Exercises 23

But sometimes they do not:


fivenum-b
> fivenum(1:10)
[1] 1.0 3.0 5.5 8.0 10.0
> quantile(1:10)
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
Compute fivenum() on a number of data sets and answer the following questions:
a) When does fivenum() give the same values as quantile()?
b) What method is fivenum() using to compute the five numbers?

1.9. Design some data sets to test whether by default bwplot() uses the 1.5 IQR
rule to determine whether it should indicate data as outliers.
1.10. Show that the total deviation from the mean, defined by
 n
total deviation from the mean = (xi − x) ,
i=1
is 0 for any distribution.
1.11. We could compute the mean absolute deviation from the median instead of
from the mean. Show that the mean absolute deviation from the median is never
larger than the mean absolute deviation from the mean.
1.12. We could compute the mean absolute deviation from any number c (c for
center). Show that the mean absolute deviation from c is always at least as large
as the mean absolute deviation from the median. Thus the median is a minimizer
of mean absolute deviation.

1.13. Let SS(c) = (xi − c)2 . (SS stands for sum of squares.) Show that the
smallest value of SS(c) occurs when c = x. This shows that the mean is a minimizer
of SS.
1.14. Find a distribution with 10 values between 0 and 10 that has as large a
variance as possible.
1.15. Find a distribution with 10 values between 0 and 10 that has as small a
variance as possible.
1.16. The pitching2005 data set in the fastR package contains 2005 season statis-
tics for each pitcher in the major leagues. Use graphical and numerical summaries
of this data set to explore whether there are differences between the two leagues,
restricting your attention to pitchers that started at least 5 games (the variable GS
stands for ‘games started’). You may select the statistics that are of interest to
you.
If you are not much of a baseball fan, try using ERA (earned run average), which
is a measure of how many runs score while a pitcher is pitching. It is measured in
runs per nine innings.
1.17. Repeat the previous problem using batting statistics. The fastR data set
batting contains data on major league batters over a large number of years. You
may want to restrict your attention to a particular year or set of years.
24 1. Summarizing Data

1.18. Have major league batting averages changed over time? If so, in what ways?
Use the data in the batting data set to explore this question. Use graphical and
numerical summaries to make your case one way or the other.
1.19. The faithful data set contains two variables: the duration (eruptions) of
the eruption and the time until the next eruption (waiting).
a) Make a scatterplot of these two variables and comment on any patterns you
see.
b) Remove the first value of eruptions and the last value of waiting. Make a
scatterplot of these two vectors.
c) Which of the two scatterplots reveals a tighter relationship? What does that
say about the relationship between eruption duration and the interval between
eruptions?

1.20. The results of a little survey that has been given to a number of statistics
students are available in the littleSurvey data set. Make some conjectures about
the responses and use R’s graphical and numerical summaries to see if there is any
(informal) evidence to support your conjectures. See ?littleSurvey for details
about the questions on the survey.
1.21. The utilities data set contains information from utilities bills for a personal
residence over a number of years. This problem explores gas usage over time.
a) Make a scatterplot of gas usage (ccf) vs. time. You will need to combine
month and year to get a reasonable measurement for time. Such a plot is
called a time series plot.
b) Use the groups argument (and perhaps type=c(’p’,’l’), too) to make the
different months of the year distinguishable in your scatterplot.
c) Now make a boxplot of gas usage (ccf) vs. factor(month). Which months
are most variable? Which are most consistent?
d) What patterns do you see in the data? Does there appear to be any change
in gas usage over time? Which plots help you come to your conclusion?

1.22. Note that March and May of 2000 are outliers due to a bad meter reading.
Utility bills come monthly, but the number of days in a billing cycle varies from
month to month. Add a new variable to the utilities data set using
utilities-ccfpday
> utilities$ccfpday <- utilities$ccf / utilities$billingDays
> plot1 <- xyplot( ccfpday ~ (year + month/12), utilities, groups=month )
> plot2 <- bwplot( ccfpday ~ factor(month), utilities )
Repeat the previous exercise using ccfpday instead of ccf. Are there any noticeable
differences between the two analyses?
1.23. The utilities data set contains information from utilities bills for a personal
residence over a number of years. One would expect that the gas bill would be
related to the average temperature for the month.
Make a scatterplot showing the relationship between ccf (or, better, ccfpday;
see Exercise 1.22) and temp. Describe the overall pattern. Are there any outliers?
Exercises 25

1.24. The utilities data set contains information from utilities bills for a personal
residence over a number of years. The variables gasbill and ccf contain the gas
bill (in dollars) and usage (in 100 cubic feet) for a personal residence. Use plots
to explore the cost of gas over the time period covered in the utilities data set.
Look for both seasonal variation in price and any trends over time.
1.25. The births78 data set contains the number of births in the United States
for each day of 1978.
a) Make a histogram of the number of births. You may be surprised by the shape
of the distribution. (Make a stemplot too if you like.)
b) Now make a scatterplot of births vs. day of the year. What do you notice?
Can you conjecture any reasons for this?
c) Can you make a plot that will help you see if your conjecture seems correct?
(Hint: Use groups.)
Chapter 2

Probability and Random


Variables

The excitement that a gambler feels when making a bet is equal to the
amount he might win times the probability of winning it.
Blaise Pascal [Ros88]

In this chapter we will develop the foundations of probability. As an area of


mathematics, probability is the study of random processes.1 Randomness, as we
use it, describes a particular type of uncertainty.

Box 2.1. What is randomness?


We will say that a repeatable process is random if its outcome is
• unpredictable in the short run and
• predictable in the long run.
Note that it is the process that is random, not the outcome, even though we
often speak as if it were the outcome that is random.

A good example is flipping a coin. The result of any given toss of a fair coin is
unpredictable in advance. It could be heads, or it could be tails. We don’t know
with certainty which it will be. Nevertheless, we can say something about the
long-run behavior of flipping a coin many times. This is what makes us surprised
if someone flips a coin 20 times and gets heads all 20 times, but not so surprised if
the result is 12 heads and 8 tails.
1
It is traditional within the study of probability to refer to random processes as random exper-
iments. We will avoid that usage to avoid confusion with the randomized experiments – statistical
studies where the values of some variables are determined using randomness.

27
28 2. Probability and Random Variables

To facilitate talking about randomness more generally, it is useful to introduce


some terminology. An outcome of a random process is one of the potential results
of the process. An event is any set of outcomes. Two important events are the
empty set (the set of no outcomes, denoted ∅) and the set of all outcomes, which is
called the sample space. The probability of an event will be a number between
0 and 1 (inclusive) that indicates its relative likelihood of occurring. Things that
happen most of the time will be assigned numbers near 1. Things that almost never
happen “by chance” will be assigned numbers near 0.
But how do we assign these numbers? There are two important methods for
assigning probabilities: the empirical method and the theoretical method. We will
look at those in the next section. We will then rapidly turn our attention to a
special case, namely when the outcomes of our random process are numbers or can
be converted into numbers. Such a random process is called a random variable,
and just as we did with data, we will develop a number of graphical and numerical
tools for studying the distributions of random variables.
In our definitions of probability and the derivations of some basic properties of
random variables, we make use of the standard mathematical notation for sets and
functions. This notation is reviewed in Appendix B.

2.1. Introduction to Probability

As mentioned in the introduction to this chapter, a probability is a number assigned


to an event that represents how likely it is to occur. In this section we will consider
two ways of assigning these numbers.

2.1.1. Empirical Probability Calculations

The empirical method for assigning probabilities is straightforward and is based on


the fact that we are interested in the long-run behavior of a random process. If we
repeat the random process many times, some of the outcomes will belong to our
event of interest while others will not. We could then define the probability of the
event E to be
number of times outcome was in the event E
P(E) = . (2.1)
number of times the random process was repeated

There is much to be said for this definition. It will certainly give numbers
between 0 and 1 since the numerator is never larger than the denominator. Fur-
thermore, events that happen frequently will be assigned large numbers and events
which occur infrequently will be assigned small numbers. Nevertheless, (2.1) doesn’t
make a good definition, at least not in its current form. The problem with (2.1) is
that if two different people each repeat the random process and calculate the prob-
ability of E, very likely they will get different numbers. So perhaps the following
would be a better statement:
number of times outcome was in the event E
P(E) ≈ . (2.2)
number of times the random process was repeated
2.1. Introduction to Probability 29

It would be nice if we knew something about how accurate such an approxi-


mation is. And in fact, we will be able to say something about that later. But
for now, intuitively, we expect such approximations to be better when the number
of repetitions is larger. A simulation of 1000 tosses of a fair coin supports this
intuition. After each simulated toss, the proportion of heads to that point was
calculated. The results are displayed in Figure 2.1.
relative frequency

0.8

0.6

0.4

0.2

0 200 400 600 800 1000

number of tosses

Figure 2.1. Cumulative proportion of heads in 1000 simulated coin tosses.

The observation that the relative frequency of our event appears to be con-
verging as the number of repetitions increases might lead us to try a definition
like
number of times in n repetitions that outcome was in the event E
P(E) = lim .
n→∞ n
(2.3)
It’s not exactly clear how we would formally define such a limit, and even less
clear how we would attempt to evaluate it. But the intuition is still useful, and for
now we will think of the empirical probability method as an approximation method
(postponing for now any formal discussion of the quality of the approximation)
that estimates a probability by repeating a random process some number of times
and determining what percentage of the outcomes observed were in the event of
interest.
Such empirical probabilities can be very useful, especially if the process is quick
and cheap to repeat. But who has time to flip a coin 10,000 or more times just
to see if the coin is fair? Actually, there have been folks who have flipped a coin
a large number of times and recorded the results. One such was John Kerrich, a
South African mathematician who recorded 5067 heads in 10,000 flips while in a
prison camp during World War II. That isn’t exactly 50% heads, but it is pretty
close.
Since repeatedly carrying out even a simply random process like flipping a coin
can be tedious and time consuming, we will often make use of computer simulations.
If we have a reasonably good model for a random event and can program it into a
computer, we can let the computer repeat the simulation many times very rapidly
and (hopefully) get good approximations to what would happen if we actually
repeated the process many times. The histograms below show the results of 1000
simulations of flipping a fair coin 1000 times (left) and 10,000 times (right).
30 2. Probability and Random Variables

Results of 1000 simulations of 1000 coin tosses Results of 1000 simulations of 10,000 coin tosses

25
25
Percent of Total

Percent of Total
20
20
15 15
10 10
5 5

0 0

0.46 0.48 0.50 0.52 0.54 0.46 0.48 0.50 0.52 0.54

proportion heads proportion heads

Figure 2.2. As the number of coin tosses increases, the results of the simu-
lation become more consistent from simulation to simulation.

Notice that the simulation-based probability is closer to 0.50 more of the time
when we flip 10,000 coins than when we flip only 1000 but that there is some
variation in both cases. Simulations with even larger sample sizes would reveal
that this variation decreases as the sample size increases but that there is always
some amount of variation from sample to sample. Also notice that John Kerrich’s
results are quite consistent with the results of our simulations, which assumed a
tossed coin has a 50% probability of being heads.
So it appears that flipping a fair coin is a 50-50 proposition. That’s not too
surprising; you already knew that. But how do you “know” this? You probably
have not flipped a coin 10,000 times and carefully recorded the outcomes of each
toss.2 So you must be using some other method to derive the probability of 0.5.

2.1.2. Theoretical Probability Calculations


When you say you “know” that the toss of a fair coin has a 50% chance of being
heads, you are probably reasoning something like this:
• the outcome will be either a head or a tail,
• these two outcomes are equally likely (because the coin is fair),
• the two probabilities must add to 100% because there is a 100% chance of
getting one of these two outcomes.
From this we conclude that the probability of getting heads is 50%. This kind of
reasoning is an example of the theoretical method of probability calculation. The
basic idea is to combine
• some general properties that should be true of all probability situations (the
probability axioms) and
• some additional assumptions about the situation at hand,
using deductive reasoning to reach conclusions, just as we did with the coin example.
To use this method, we first need to have some axioms. These should be
statements about probability that are intuitively true and provide us with enough
to get the process going. Box 2.2 contains our three probability axioms. That’s
it. Notice that each of these makes intuitive sense and is easily seen to be true of
2
Another person who has flipped a coin a large number of times is Persi Diaconis, professor of
statistics and mathematics at Stanford University – and former magician – who in 2004 wrote a paper
with some colleagues demonstrating that there is actually a slight bias in a coin toss: It is more likely
to land with the same side up that was up before the toss [DHM07].
2.1. Introduction to Probability 31

Box 2.2. Axioms for probability


Let S be a sample space for a random process and let E ⊆ S be an event.
Then:
(1) P(E) ∈ [0, 1].
(2) P(S) = 1 (where S is the sample space).
(3) The probability of a disjoint union is the sum of probabilities.
• P(A ∪ B) = P(A) + P(B), provided A ∩ B = ∅.
• P(A1 ∪ A2 ∪ · · · ∪ Ak ) = P(A1 ) + P(A2 ) + · · · + P(Ak ), provided
Ai ∩ Aj = ∅ whenever i = j.
∞ ∞
• P ( i=1 Ai ) = i=1 P(Ai ), provided Ai ∩ Aj = ∅ whenever i = j.

our empirical probabilities (at least given a fixed set of repetitions). Despite the
fact that our axioms are so few and so simple, they are quite useful. In Section 2.2
we’ll see that a number of other useful general principles follow easily from our
three axioms. Together these rules and the axioms form the basis of theoretical
probability calculations. But before we provide examples of using these rules to
calculate probabilities, we’ll introduce the important notion of a random variable.

2.1.3. Random Variables


We are particularly interested in random events that produce (or can be converted
to) numerical outcomes. For example:
• We may roll a die and consider the outcome as a number between 1 and 6
(inclusive).
• We may roll two dice and consider the outcome to be the sum of the numbers
showing on each die.
• We may flip a coin 10 times and count the number of heads.
• We may telephone 1000 randomly selected households, ask the oldest person
at home if he or she voted in the last election, and record the number (or
percentage) who say yes.
In each case a number is being obtained as or from the result of a random process.
We will refer to such numbers as random variables. More formally, we define a
random variable as a function on the sample space.
Definition 2.1.1 (Random Variable). Let S be a sample space, and let X : S → R.
Then X is called a random variable. 

We will be concerned primarily with two types of random variables: discrete


and continuous. A discrete random variable is one that can only take on a finite
or countably infinite set of values. Often these values are a subset of the integers.
A continuous random variable on the other hand can take on all values in some
interval.
Exploring the Variety of Random
Documents with Different Content
The Project Gutenberg eBook of Fort
Duquesne and Fort Pitt; Early Names of
Pittsburgh Streets
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Fort Duquesne and Fort Pitt; Early Names of Pittsburgh Streets

Creator: Pa.) Daughters of the American Revolution. Pittsburgh


Chapter (Pittsburgh

Release date: June 18, 2012 [eBook #40037]


Most recently updated: October 23, 2024

Language: English

Credits: Produced by Charlene Taylor, Barbara Kosker and the Online


Distributed Proofreading Team at http://www.pgdp.net
(This
file was produced from images generously made available
by The Internet Archive/American Libraries.)

*** START OF THE PROJECT GUTENBERG EBOOK FORT DUQUESNE


AND FORT PITT; EARLY NAMES OF PITTSBURGH STREETS ***
Block House of Fort Pitt. Built 1764.

FORT DUQUESNE
AND

FORT PITT

EARLY NAMES OF PITTSBURGH STREETS


SIXTH EDITION

PUBLISHED BY

FORT PITT SOCIETY

DAUGHTERS OF THE AMERICAN REVOLUTION

OF

ALLEGHENY COUNTY, PENNSYLVANIA

Reed & Witting Co., Press


1921

This little sketch of Fort Duquesne and Fort Pitt is compiled from
extracts taken mainly from Parkman's Histories; The Olden Time, by
Neville B. Craig; Fort Pitt, by Mrs. Wm. Darlington; Pioneer History,
by S. P. Hildreth, etc.
Pittsburgh
September, 1898.
CHRONOLOGY
1753—The French begin to build a chain of forts to enforce their
boundaries.
December 11, 1753.—Washington visits Fort Le Boeuf.
January, 1754.—Washington lands on Wainwright's Island in the
Allegheny river.—Recommends that a Fort be built at the "Forks
of the Ohio."
February 17, 1754.—A fort begun at the "Forks of the Ohio" by
Capt. William Trent.
April 16, 1754.—Ensign Ward, with thirty-three men, surprised
here by the French, and surrenders.
June, 1754.—Fort Duquesne completed.
May 28, 1754.—Washington attacks Coulon de Jumonville at Great
Meadows.
July 9, 1755.—Braddock's defeat.
April, 1758.—Brig. Gen. John Forbes takes command.
August, 1758.—Fort Bedford built.
October, 1758.—Fort Ligonier built.
November 24, 1758.—Fort Duquesne destroyed by the retreating
French.
November 25, 1758.—Gen. Forbes takes possession.
August, 1759.—Fort Pitt begun by Gen. John Stanwix.
May, 1763.—Conspiracy of Pontiac.
July, 1763.—Fort Pitt besieged by Indians.
1764.—Col. Henry Bouquet builds the Redoubt.
October 10, 1772.—Fort Pitt abandoned by the British.
January, 1774.—Dr. James Connelly occupies Fort Pitt with Virginia
militia, and changes name to Fort Dunmore.
July, 1776.—Indian conference at Fort Pitt.—Pontiac and Guyasuta.
June 1, 1777.—Brig. Gen. Hand takes command of the fort.
1778.—Gen. McIntosh succeeds Hand.
November, 1781.—Gen. William Irvine takes command.
May 19, 1791.—Maj. Isaac Craig reports Fort Pitt in a ruinous
condition.—Builds Fort Lafayette.
September 4, 1805.—The historic site purchased by Gen. James
O'Hara.
April 1, 1894.—Mrs. Mary E. Schenley, granddaughter of Gen.
James O'Hara, presents Col. Bouquet's Redoubt to the
Daughters of the American Revolution of Allegheny County,
Pennsylvania.
FORT DUQUESNE

Conflicting Claims of France and England in North America.


On maps of British America in the earlier part of the eighteenth
century, one sees the eastern coast, from Maine to Georgia, gashed
with ten or twelve colored patches, very different in size and shape,
and defined more or less distinctly by dividing lines, which in some
cases are prolonged westward until they reach the Mississippi, or
even across it and stretch indefinitely towards the Pacific.
These patches are the British Provinces, and the western
prolongation of their boundary represents their several claims to vast
interior tracts founded on ancient grants, but not made good by
occupation or vindicated by an exertion of power * * *
Each Province remained in jealous isolation, busied with its own
work, growing in strength, in the capacity of self-rule, in the spirit of
independence, and stubbornly resisting all exercise of authority from
without. If the English-speaking population flowed westward, it was
in obedience to natural laws, for the King did not aid the movement,
and the royal Governor had no authority to do so. The power of the
colonies was that of a rising flood, slowly invading and conquering
by the unconscious force of its own growing volume, unless means
be found to hold it back by dams and embankments within
appointed limits.
In the French colonies it was different. Here the representatives of
the crown were men bred in the atmosphere of broad ambition and
masterful, far-reaching enterprise. They studied the strong and weak
points of their rivals, and with a cautious forecast and a daring
energy set themselves to the task of defeating them. If the English
colonies were comparatively strong in numbers these numbers could
not be brought into action, while if French forces were small they
were vigorously commanded and always ready at a word. It was
union confronting division, energy confronting apathy, and military
centralization opposed to industrial democracy, and for a time the
advantage was all on one side. Yet in view of what France had
achieved, of the patient gallantry of her explorers, the zeal of her
missionaries, the adventurous hardihood of her bush-rangers,
revealing to mankind the existence of this wilderness world, while
her rivals plodded at their workshops, their farms, their fisheries; in
view of all this, her pretensions were moderate and reasonable
compared to those of England.

Forks of the Ohio.—Washington's First Visit.


The Treaty of Utrecht had decided that the Iroquois or Five
Nations were British subjects; therefore it was insisted that all
countries conquered by them belonged to the British crown. The
range of the Iroquois war parties was prodigious, and the English
laid claim to every mountain, forest and prairie where an Iroquois
had taken a scalp. This would give them not only all between the
Alleghanies and the Mississippi, but all between Ottawa and Huron,
leaving nothing to France but the part now occupied by the Province
of Quebec.
The Treaty of Utrecht in 1713, and that of Aix la Chapelle in 1748,
were supposed to settle the disputed boundaries of the French and
English possessions in America; France, however, repented of her
enforced concessions, and claimed the whole American continent as
hers, except a narrow strip of sea-coast. To establish this boundary,
it was resolved to build a line of forts from Canada to the Mississippi,
following the Ohio, for they perceived that the "Forks of the Ohio,"
so strangely neglected by the English, formed together with Niagara
the key of the great West.
This chain of forts began at Niagara; then another was built of
squared logs at Presque Isle (now Erie), and a third called Fort Le
Boeuf, on what is now called French Creek. Here the work stopped
for a time, and Lagardeur de St. Pierre went into winter quarters
with a small garrison at Fort Le Boeuf.
On the 11th of December, 1753, Major George Washington, with
Christopher Gist as guide, Abraham Van Braam as interpreter, and
several woodsmen,[A] presented himself as a bearer of a letter from
Governor Dinwiddie of Virginia to the commander of Fort Le Boeuf.
He was kindly received. In fact, no form of courtesy was omitted
during the three days occupied by St. Pierre in framing his reply to
Governor Dinwiddie's letter. This letter expressed astonishment that
his (St. Pierre's) troops should build forts upon lands so notoriously
known to be the property of Great Britain, and demanded their
immediate and peaceable departure. In his answer, St. Pierre said he
acted in accordance with the commands of his general, that he
would forward Governor Dinwiddie's letter to the Marquis Duquesne
and await his orders.
It was on his return journey that Washington twice escaped death.
First from the gun of a French Indian; then in attempting to cross
the Allegheny, which was filled with ice, on a raft that he and his
companions had hastily constructed with the help of one hatchet
between them. He was thrown into the river and narrowly escaped
drowning; but Gist succeeded in dragging him out of the water, and
the party landed on Wainwright's Island, about opposite the foot of
Thirty-third Street. On making his report Washington recommended
that a fort be built at the "Forks of the Ohio."
Men and money were necessary to make good Governor
Dinwiddie's demand that the French evacuate the territory they had
appropriated; these he found it difficult to get. He dispatched letters,
orders, couriers from New Jersey to South Carolina, asking aid.
Massachusetts and New York were urged to make a feint against
Canada, but as the land belonged either to Pennsylvania or Virginia,
the other colonies did not care to vote money to defend them.
In Pennsylvania the placid obstinacy of the Quakers was matched
by the stolid obstinacy of the German farmers; notwithstanding,
Pennsylvania voted sixty thousand pounds, and raised twelve
hundred men at eighteen pence per day. All Dinwiddie could muster
elsewhere was the promise of three or four hundred men from North
Carolina, two companies from New York and one from South
Carolina, with what recruits he could gather in Virginia. In
accordance with Washington's recommendation, Capt. William Trent,
once an Indian trader of the better class, now a commissioned
officer, had been sent with a company of backwoodsmen to build a
fort at the Forks of the Ohio, and it was hoped he would fortify
himself sufficiently to hold the position. Trent began the fort, but left
it with forty men under Ensign Ward and went back to join
Washington. The recruits gathered in Virginia were to be
commanded by Joshua Fry, with Washington as second in command.

Fort Duquesne.—Washington at Fort Necessity.


On the 17th of April, 1754, Ward was surprised by the appearance
of a swarm of canoes and bateaux descending the Allegheny,
carrying, according to Ward, about one thousand Frenchmen, who
landed, planted their cannon and summoned the Ensign to
surrender. He promptly complied and was allowed to depart with all
his men. The French soon demolished the unfinished fort and built in
its place a much larger and better one, calling it Fort Duquesne, in
honor of the Marquis Duquesne, then Governor of Canada.
Washington, with his detachment of ragged recruits, without tents
and scarcely armed, was at Will's Creek, about one hundred and
forty miles from the "Forks of the Ohio," and he was deeply
chagrined when Ward joined him and reported the loss of the fort.
Dinwiddie then ordered Washington to advance. In order to do so, a
road must be cut for wagons and cannon, through a dense forest;
two mountain ranges must be crossed, and innumerable hills and
streams. Towards the end of May he reached Great Meadows with
one hundred and fifty men. While encamped here, Washington
learned that a detachment of French had marched from the fort in
order to attack him. They met in a rocky hollow and a short fight
ensued. Coulon de Jumonville, the commander, was killed; all the
French were taken prisoners or killed except one Canadian. This
skirmish was the beginning of the war. Washington then advanced as
far as Christopher Gist's settlement, twelve or fourteen miles on the
other side of the Laurel Ridge. He soon heard that strong
reinforcements had been sent to Fort Duquesne, and that another
detachment was even then on the march under Coulon de Villiers, so
on June 28th he began to retreat. Not having enough horses, the
men had to carry the baggage on their backs, and drag nine swivels
over miserable roads. Two days brought them to Great Meadows,
and they had but one full day to strengthen the slight fortification
they had made there, and which Washington named Fort Necessity.
The fighting began at about 11, and lasted for nine hours; the
English, notwithstanding their half starved condition, and their want
of ammunition, keeping their ground against double their number.
When darkness came a parley was sounded, to which Washington at
first paid no attention, but when the French repeated the proposal,
and requested that an officer might be sent, he could refuse no
longer. There were but two in Washington's command who could
understand French, and one of them was wounded. Capt. Van
Braam, a Dutchman, acted as interpreter. The articles were signed
about midnight. The English troops were to march out with drums
beating, carrying with them all their property. The prisoners taken in
the Jumonville affair were to be released, Capt. Van Braam and
Major Stobo to be detained as hostages for their safe return to Fort
Duquesne.
This defeat was disastrous to the English. There was now not an
English flag waving west of the Alleghanies. Villiers went back
exultant to Fort Duquesne, and Washington began his wretched
march to Will's Creek. No horses, no cattle, most of the baggage
must be left behind, while the sick and wounded must be carried
over the Alleghanies on the backs of their weary, half starved
comrades. And this was the Fourth of July, 1754.
The conditions of the surrender were never carried out. The
prisoners taken in the skirmish with Jumonville were not returned.
Van Braam and Stobo were detained for some time at Fort
Duquesne, then sent to Quebec, where they were kept prisoners for
several years. While a prisoner on parole Major Stobo made good
use of his opportunities by acquainting himself with the
neighborhood; afterwards he was kept in close confinement and
endured great hardships; but in the spring of 1759 he succeeded in
making his escape in the most miraculous manner. While Wolfe was
besieging Quebec he returned from Halifax, and, it is said, it was he
who guided the troops up the narrow wooded path to the Heights of
Abraham. Strange, that one taken prisoner in a far distant province,
in a skirmish which began the war, should guide the gallant Wolfe to
the victory at Quebec, which virtually closed the war in America.

Braddock.
Nothing of importance was done in Virginia and Pennsylvania until
the arrival of Braddock in February, 1755, bringing with him two
regiments. Governor Dinwiddie hailed his arrival with joy, hoping that
his troubles would now come to an end. Of Braddock, Governor
Dinwiddie's Secretary, Shirley wrote to Governor Morris: "We have a
general most judiciously chosen for being disqualified for the service
he is in, in almost every respect." Braddock issued a call to the
provincial governors to meet him in council, which was answered by
Dinwiddie of Virginia, Dobbs of North Carolina, Sharpe of Maryland,
Morris of Pennsylvania, Delancy of New York, and Shirley of
Massachusetts. The result was a plan to attack the French at four
points at once. Braddock was to advance on Fort Duquesne, Fort
Niagara was to be reduced, Crown Point seized, and a body of men
from New England to capture Beausejour and Arcadia.
We will follow Braddock. In his case prompt action was of the
utmost importance, but this was impossible, as the people refused to
furnish the necessary supplies. Franklin, who was Postmaster
General in Pennsylvania, was visiting Braddock's camp with his son
when the report of the agents sent to collect wagons was brought
in. The number was so wholly inadequate that Braddock stormed,
saying the expedition was at an end. Franklin said it was a pity he
had not landed in Pennsylvania, where he might have found horses
and wagons more plentiful. Braddock begged him to use his
influence to obtain the necessary supply, and Franklin on his return
to Pennsylvania issued an address to the farmers. In about two
weeks a sufficient number was furnished, and at last the march
began. He reached Will's Creek on May 10, 1755, where fortifications
had been erected by the colonial troops, and called Fort
Cumberland. Here Braddock assembled a force numbering about
twenty-two hundred. Although Braddock despised the provincial
troops and the Indians, he honored Col. George Washington, who
commanded the troops from Virginia, by placing him on his staff.
A month elapsed before this army was ready to leave Fort
Cumberland. Three hundred axemen led the way, the long, long,
train of pack-horses, wagons, and cannon following, as best they
could, along the narrow track, over stumps and rocks and roots. The
road cut was but twelve feet wide, so that the line of march was
sometimes four miles long, and the difficulties in the way were so
great that it was impossible to move more than three miles a day.
On the 18th of June they reached Little Meadows, not thirty miles
from Fort Cumberland, where a report reached them that five
hundred regulars were on their way to reinforce Fort Duquesne.
Washington advised Braddock to leave the heavy baggage and press
forward, and following this advice, the next day, June 19th, the
advance corps of about twelve hundred soldiers with what artillery
was thought indispensable, thirty wagons, and a number of pack-
horses, began its march; but the delays were such that it did not
reach the mouth of Turtle Creek until July 7th. The distance to Fort
Duquesne by a direct route was about eight miles, but the way was
difficult and perilous, so Braddock crossed the Monongahela and re-
crossed farther down, at one o'clock.
Washington describes the scene at the ford with admiration. The
music, the banners, the mounted officers, the troops of light cavalry,
the naval detachment, the red-coated regulars, the blue-coated
Virginians, the wagons and tumbrils, the cannon, howitzers and
coehorns, the train of pack-horses and the droves of cattle passed in
long procession through the rippling shallows and slowly entered the
forest.
Fort Duquesne was a strong little fort, compactly built of logs,
close to point of where the waters of the Allegheny and
Monongahela unite. Two sides were protected by these waters, and
the other two by ravelins, a ditch and glacis and a covered way,
enclosed by a massive stockade. The garrison consisted of a few
companies of regulars and Canadians and eight hundred Indian
warriors, under the command of Contrecœur. The captains under
him were Beaujeu, Dumas, and Ligneris.
When the scouts brought the intelligence that the English were
within six leagues of the fort, the French, in great excitement and
alarm, decided to march at once and ambuscade them at the ford.
The Indians at first refused to move, but Beaujeu, dressed as one of
them, finally persuaded them to march, and they filed off along the
forest trail that led to the ford of the Monongahela—six hundred
Indians and about three hundred regulars and Canadians. They did
not reach the ford in time to make the attack there.

Braddock's Defeat.
Braddock advanced carefully through the dense and silent forest,
when suddenly this silence was broken by the war-whoop of the
savages, of whom not one was visible. Gage's column wheeled
deliberately into line and fired; and at first the English seemed to
carry everything before them, for the Canadians were seized by a
panic and fled; but the scarlet coats of the English furnished good
targets for their invisible enemies. The Indians, yelling their war-
cries, swarmed in the forest, but were so completely hidden in
gullies and ravines, behind trees and bushes and fallen trunks, that
only the trees were struck by the volley after volley fired by the
English, who at last broke ranks and huddled together in a
bewildered mass. Both men and officers were ignorant of this mode
of warfare. The Virginians alone were equal to the emergency and
might have held the enemy in check, but when Braddock found
them hiding behind trees and bushes, as the Indians, he became so
furious at this seeming want of courage and discipline, that he
ordered them with oaths, to join the line, even beating them with his
sword, they replying to his threats and commands that they would
fight if they could see any one to fight with. The ground was strewn
with the dead and dying, maddened horses were plunging about,
the roar of musketry and cannon, and above all the yells that came
from the throats of six hundred invisible savages, formed a chaos of
anguish and terror indescribable.
Braddock saw that all was lost and ordered a retreat, but had
scarcely done so when a bullet pierced his lungs. It is alleged that
the shot was fired by one of his own men, but this statement is
without proof. The retreat soon turned into a rout, and all who
remained dashed pell-mell through the river to the opposite shore,
abandoning the wounded, the cannon, and all the baggage and
papers to the mercy of the Indians. Beaujeu had fallen early in the
conflict. Dumas and Ligneris did not pursue the flying enemy, but
retired to the Fort, abandoning the field to the savages, which soon
became a pandemonium of pillage and murder. Of the eighty-six
English officers all but twenty-three were killed or disabled, and but
a remnant of the soldiers escaped.
When the Indians returned to the Fort, they brought with them
twelve or fourteen prisoners, their bodies blackened and their hands
tied behind their backs. These were all burned to death on the bank
of the Allegheny, opposite the Fort. The loss of the French was
slight; of the regulars there were but four killed or wounded, and all
the Canadians returned to the Fort unhurt except five.
The miserable remnant of Braddock's army continued their wild
flight all that night and all the next day, when before nightfall those
who had not fainted by the way reached Christopher Gist's farm, but
six miles from Dunbar's camp. The wounded general had shown an
incredible amount of courage and endurance. After trying in vain to
stop the flight, he was lifted on a horse, when, fainting from the
effects of his mortal wound, some of the men were induced by large
bribes to carry him in a litter. Braddock ordered a detachment from
the camp to go to the relief of the stragglers, but as the fugitives
kept coming in with their tales of horror, the panic seized the camp,
and soldiers and teamsters fled.
The next day, whether from orders given by Braddock or Dunbar is
not known, more than one hundred wagons were burned, cannon,
coehorns, and shells were destroyed, barrels of gunpowder were
saved and the contents thrown into a brook, and provisions
scattered about through the woods and swamps, while the enemy,
with no thought of pursuit, had returned to Fort Duquesne. Braddock
died on the 13th of July, 1755, and was buried on the road; men,
horses and wagons passing over the grave of their dead commander
as they retreated to Fort Cumberland, thus effacing every trace of it,
lest it should be discovered by the Indians and the body mutilated.
Thus ended the attempt to capture Fort Duquesne, and for about
three years, while the storm of blood and havoc raged elsewhere,
that point was undisturbed.
Henry Bouquet.

Brigadier General Forbes.


In the meantime Dinwiddie had gone, a new governor was in his
place, while in the plans of Pitt the capture of Fort Duquesne held an
important place. Brigadier General John Forbes was charged with it.
He was Scotch by birth, a well bred man of the world, and unlike
Braddock, by his conduct toward the provincial troops, commanded
both the respect and affection of the colonists. He only resembled
Braddock in his determined resolution, but he did not hesitate to
embrace modes of warfare that Braddock would have scorned. He
wrote to Bouquet: "I have been long of your opinion of equipping
numbers of our men like the savages, and I fancy Col. Burd of
Virginia has most of his men equipped in that manner. In this
country we must learn our art of war from the Indians, or any one
else who has carried it on here." He arrived in Philadelphia in April
1758, but it was the end of June before his troops were ready to
march. His force consisted of Montgomery's Highlanders, twelve
hundred strong; Provincials from Pennsylvania, Virginia, Maryland,
North Carolina, and a detachment of Royal Americans: amounting to
about six or seven thousand men. The Royal Americans were
Germans from Pennsylvania, the Colonel-in-Chief being Lord
Amhurst, Colonel Commandant Frederick Haldimand, and
conspicuous among them was Lieutenant Colonel Henry Bouquet, a
brave and accomplished Swiss, who commanded one of the four
battalions of which the regiment was composed.
General Forbes was detained in Philadelphia by a painful and
dangerous malady. Bouquet advanced and encamped at Raystown,
now Bedford. Then arose the question of opening a new road
through Pennsylvania to Fort Duquesne, or following the old road
made by Braddock. Washington, who commanded the Virginians,
foretold the ruin of the expedition unless Braddock's road was
chosen, but Forbes and Bouquet were firm and it was decided to
adopt the new route through Pennsylvania. Forbes was able to reach
Carlisle early in July, but his disorder was so increased by the
journey that he was not able to leave that place until the 11th of
August, and then in a kind of litter swung between two horses. In
this way he reached Shippensburg, where he lay helpless until far in
September. His plan was to advance slowly, establishing fortified
magazines as he went, and at last when within easy distance of the
Fort, to advance upon it with all force, as little impeded as possible
with wagons and pack-horses. Having secured his magazines at
Raystown, and built a fort which he called Fort Bedford in honor of
his friend and patron, the Duke of Bedford,[B] Bouquet was sent with
his command to forward the heavy work of road making over the
main range of the Alleghanies and the Laurel Hills; "hewing, digging,
blasting, laying facines and gabions, to support the track along the
sides of the steep declivities, or worming their way like moles
through the jungle of swamp and forest." As far as the eye or mind
could reach a prodigious forest vegetation spread its impervious
canopy over hill, valley and plain. His next post was on the
Loyalhanna Creek, scarcely fifty miles distant from Fort Duquesne,
and here he built a fortification, naming it Fort Ligonier, in honor of
Lord Ligonier, commander-in-chief of His Majesty's armies. Forbes
had served under Ligonier, and his influence, together with that of
the Duke of Bedford, secured to Forbes his appointment.
Now came the difficult and important task of securing Indian
allies. Sir William Johnston for the English, and Joncaire for the
French, were trying in every way to frighten or cajole them into
choosing sides; but that which neither of them could accomplish was
done by a devoted Moravian missionary, Christian Frederick Post.
Post spoke the Delaware language, had married a converted squaw,
and by his simplicity, directness and perfect honesty, had gained
their full confidence. He was a plain German, upheld by a sense of
duty and single-hearted trust in God. The Moravians were apostles
of peace, and they succeeded in a surprising way in weaning their
converts from their ferocious instincts and savage practices, while
the mission Indians of Canada retained all their native ferocity, and
their wigwams were strung with scalps, male and female, adult and
infant. These so-called missions were but nests of baptized savages,
who wore the crucifix instead of the medicine-bag.
Post accepted the dangerous mission as envoy to the camp of the
hostile Indians, and making his way to a Delaware town on Beaver
Creek, he was kindly received by the three kings; but when they
conducted him to another town he was surrounded by a crowd of
warriors, who threatened to kill him. He managed to pacify them,
but they insisted that he should go with them to Fort Duquesne. In
his Journal he gives thrilling accounts of his escape from dangers
threatened by both French and Indians. But he at last succeeded in
securing a promise from both Delaware and Shawnees, and other
hostile tribes, to meet with the Five Nations, the Governor of
Pennsylvania and commissioners from other provinces, in the town
of Easton, before the middle of September. The result of this council
was that the Indians accepted the White wampum belt of peace,
and agreed on a joint message of peace to the tribes of Ohio.
A few weeks before this Col. Bouquet, from his post at Fort
Ligonier, forgot his usual prudence, and at his urgent request,
allowed Major Grant, commander of the Highlanders, to advance. On
the 14th of September, at about 2 A. M., he reached an eminence
about half a mile from the Fort. He divided his forces, placing
detachments in different positions, being convinced that the enemy
was too weak to attack him. Infatuated with this idea, when the fog
had cleared away, he ordered the reveille to be sounded. It was as if
he put his foot into a hornet's nest. The roll of drums was answered
by a burst of war-whoops, while the French came swarming out,
many of them in their night shirts, just as they had jumped from
their beds. There was a hot fight for about three-quarters of an hour,
when the Highlanders broke away in a wild flight. Captain Bullit and
his Virginians tried to cover the retreat, and fought until two-thirds
of them were killed and Grant taken prisoner. The name of "Grant's
Hill" still clings to the much-ambushed "hump" where the Court
House now stands.
The French pushed their advantages with spirit, and there were
many skirmishes in the forest between Fort Ligonier and Fort
Duquesne, but their case was desperate. Their Indian allies had
deserted them, and their supplies had been cut off; so Ligneris, who
succeeded Contrecœur, was forced to dismiss the greater part of his
force. The English, too, were enduring great hardships. Rain had
continued almost without cessation all through September; the
newly made road was liquid mud, into which the wagons sunk up to
the hubs. In October the rain changed to snow, while all this time
Forbes was chained to a sick-bed at Raystown, now Fort Bedford. In
the beginning of November he was carried from Fort Bedford to Fort
Ligonier in a litter, and a council of officers, then held, decided to
attempt nothing more that season, but a few days later a report of
the condition of the French was brought in, which led Forbes to give
orders for an immediate advance. On November 18, 1758, two
thousand five hundred picked men, without tents or baggage,
without wagons or artillery except a few light pieces, began their
march.
Lord Viscount Ligonier.

FOOTNOTES:

[A] The names of these woodsmen were Barnaby Currin and


James MacGuire, Indian Traders; Henry Stewart and William
Jenkins; Half King, Monokatoocha, Jeskakake, White Thunder and
the Hunter.
[B] In recognition of this honor, the Duke of Bedford presented to
the fort a large flag of crimson brocade silk. In 1895 this flag was
in the possession of Mrs. Moore, of Bedford, who kindly lent it to
the Pittsburgh Chapter Daughters of the American Revolution, for
exhibition at a reception given by them at Mrs. Park Painter's
residence, February 15th, 1895.
Fort Pitt

French Abandon Fort Duquestne.——Fort Pitt is Built.


On the evening of the 24th they encamped on the hills around
Turtle Creek, and at midnight the sentinels heard a heavy boom as if
a magazine had exploded. In the morning the march was resumed.
After the advance guard came Forbes, carried in a litter, the troops
following in three columns, the Highlanders in the center headed by
Montgomery, the Royal Americans and Provincials on the right and
left under Bouquet and Washington. Slowly they made their way
beneath an endless entanglement of bare branches. The Highlanders
were goaded to madness by seeing as they approached the Fort the
heads of their countrymen, who had fallen when Grant made his
rash attack, stuck on poles, around which their plaids had been
wrapped in imitation of petticoats. Foaming with rage they rushed
forward, abandoning their muskets and drawing their broadswords;
but their fury was in vain, for when they reached a point where the
Fort should have been in sight, there was nothing between them and
the hills on the opposite banks of the Monongahela and Allegheny
but a mass of blackened and smouldering ruins. The enemy, after
burning the barracks and storehouses, had blown up the
fortifications and retreated, some down the Ohio, others overland to
Presque Isle, and others up the Allegheny to Venango.
There were two forts, and some idea may be formed of their size,
with barracks and storehouses, from the fact that John Haslet writes
to the Rev. Dr. Allison, two days after the English took possession,
that there were thirty chimney stacks standing.
The troops had no shelter until the first fort was built. Col.
Bouquet wrote to Miss Ann Willing from Fort Duquesne, November
25th, 1758, "they have burned and destroyed to the ground their
fortifications, houses and magazines, and left us no other cover than
the heavens—a very cold one for an army without tents or
equipages."
Col. Bouquet in a letter written to Chief Justice Allen of
Pennsylvania on November 26th, enumerated the needs of the
garrison, which he hopes the Provinces of Pennsylvania and Virginia
will immediately supply. He adds: "After God, the success of this
expedition is entirely due to the general. He has shown the greatest
prudence, firmness and ability. No one is better informed than I am,
who had an opportunity to see every step that has been taken from
the beginning and every obstacle that was thrown in his way."
Forbes' first care was to provide defense and shelter for his troops,
and a strong stockade was built around the traders' cabins and
soldiers' huts, which he named Pittsburgh, in honor of England's
great Minister, William Pitt. Two hundred Virginians under Col.
Mercer were left to defend the new fortification, a force wholly
inadequate to hold the place if the French chose to return and
attempt to take it again. Those who remained must for a time
depend largely on stream and forest to supply their needs, while the
army, which was to return began their homeward march early in
December, with starvation staring them in the face.
No sooner was this work done than Forbes utterly succumbed. He
left with the soldiers, and was carried all the way to Philadelphia in a
litter, arriving there January 18, 1759. He lingered through the
winter, died in March, and was buried in Christ Church, March 14,
1759. Parkman says: "If his achievement was not brilliant, its solid
value was above price; it opened the Great West to English
enterprise, took from France half her savage allies, and relieved the
western borders from the scourge of Indian war. From Southern New
York to North Carolina the frontier population had cause to bless the
memory of this steadfast and all-enduring soldier."
Just sixty days after the taking of Fort Duquesne, William Pitt
wrote a letter, dated Whitehall, January 23, 1759, of which the
following extract will show how important this place was considered
in Great Britain.
"Sir:—I am now to acquaint you that the King has been pleased
immediately upon receiving the news of the success of his armies on
the river Ohio, to direct the commander-in-chief of His Majesty's
forces in North America, and General Forbes, to lose no time in
concerting the properest and speediest means for completely
restoring, if possible, the ruined Fort Duquesne to a defensible and
respectable state, or for erecting another in the room of it of
sufficient strength, and every way adequate to the great importance
of the several objects of maintaining His Majesty's subject in the
undisputed possession of the Ohio," etc., etc.
In a letter dated Pittsburgh, August 1759, Col. Mercer writes to
Gov. Denny: "Capt. Gordon, chief engineer, has arrived with most of
the artificers, but does not fix the spot for constructing the Fort till
the general comes up. We are preparing the materials for building
with what expedition so few men are capable of."
There was no attempt made to restore the old fortifications, but
about a year afterwards work was begun on a new fort. Gen. John
Stanwix, who succeeded Gen. Forbes, is said to have been a man of
high military standing, with a liberal and generous spirit. In 1760, he
appeared on the Ohio at the head of an army, and with full power to
build a large fort where Fort Duquesne had stood. The exact date of
his arrival and the day when work was commenced is not known,
but the work must have been begun the last of August or the first of
September, 1759. A letter dated September 24, 1759, gives the
following account: "It is now near a month since the army has been
employed in erecting a most formidable fortification, such a one as
will to latest posterity secure the British Empire on the Ohio. There is
no need to enumerate the abilities of the chief engineer nor the
spirit shown by the troops in executing the important task; the fort
will soon be a lasting monument of both."
The fort was built near the point where the Allegheny and
Monongahela unite their waters, but a little farther inland than the
site of Fort Duquesne. It stood on the present site of the Duquesne
Freight Station, while all the ground from the Point to Third Street
and from Liberty Street to the Allegheny River was enclosed in a
stockade and surrounded by a moat. It was a solid and substantial
building, constructed at an enormous expense to the English
Government.[C] It was five-sided, two sides facing the land of brick,
the others stockade. The earth around was thrown up so all was
enclosed by a rampart of earth, supported on the land side by a
perpendicular wall of brick; on the other sides a line of pickets was
fixed on the outside of the slope, and a moat encompassed the
entire work. Casemates, barracks and store houses were completed
for a garrison of one thousand men and officers, and eighteen
pieces of artillery mounted on the bastions. This strong fortification
was thought to establish the British dominion of the Ohio. The exact
date of its completion is not known, but on March 21, 1760, Maj.
Gen. Stanwix, having finished his work, set out on his return journey
to Philadelphia.

Conspiracy of Pontiac and Col. Bouquet.


The effect of this stronghold was soon apparent in the return of
about four thousand settlers to their lands on the frontiers of
Pennsylvania, Virginia and Maryland, from which they had been
driven by their savage enemies, and the brisk trade which at once
began to be carried on with the now, to all appearance, friendly
Indians. However, this security was not of long duration. The definite
treaty of peace between England, Spain and France was signed
February 10, 1763, but before that time, Pontiac, the great chief of
the Ottawas, was planning his great conspiracy, which carried death
and desolation throughout the frontier.
The French had always tried to ingratiate themselves with the
Indians. When their warriors came to French forts they were
hospitably welcomed and liberally supplied with guns, ammunition
and clothing, until the weapons and garments of their forefathers
were forgotten. The English, on the contrary, either gave reluctantly
or did not give at all. Many of the English traders were of the
coarsest stamp, who vied with each other in rapacity and violence.
When an Indian warrior came to an English fort, instead of the
kindly welcome he had been accustomed to receive from the French,
he got nothing but oaths, and menaces, and blows, sometimes
being assisted to leave the premises by the butt of a sentinel's
musket. But above and beyond all, they watched with wrath and fear
the progress of the white man into their best hunting grounds, for as
the English colonist advanced their beloved forests disappeared
under the strokes of the axe. The French did all in their power to
augment this discontent.
In this spirit of revenge and hatred a powerful confederacy was
formed, including all the western tribes, under the command of
Pontiac, alike renowned for his war like spirit, his wisdom and his
bravery, and whose name was a terror to the entire region of the
lakes. The blow was to be struck in the month of May, 1763. The
tribes were to rise simultaneously and attack the English garrisons.
Thus a sudden attack was made on all the western posts. Detroit
was saved after a long and close siege. Forts Pitt and Niagara
narrowly escaped, while Le Boeuf, Venango, Presqu' Isle, Miamis, St.
Joseph, Quachtanon, Sandusky and Michillimackinac all fell into the
hands of the Indians. Their garrisons were either butchered on the
spot, or carried off to be tortured for the amusement of their cruel
captors.
The savages swept over the surrounding country, carrying death
and destruction wherever they went. Hundreds of traders were
slaughtered without mercy, while their wives and children, if not
murdered, were carried off captives. The property destroyed or
stolen amounted, it is said to five hundred thousand pounds. Attacks
were made on Forts Bedford and Ligonier, but without success. Fort
Ligonier was under siege for two months. The preservation of this
post was of the utmost importance, and Lieut. Blaine, by his courage
and good conduct, managed to hold it until August 2, 1763, when
Col. Bouquet arrived with his little army.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like