Introductory Statistics for Data Analysis Warren J. Ewens instant download
Introductory Statistics for Data Analysis Warren J. Ewens instant download
https://ebookmeta.com/product/introductory-statistics-for-data-
analysis-warren-j-ewens/
https://ebookmeta.com/product/statistics-for-data-scientists-an-
introduction-to-probability-statistics-and-data-analysis-maurits-
kaptein/
https://ebookmeta.com/product/statistics-and-data-analysis-for-
engineers-and-scientists-1st-edition-tanvir-mustafy/
https://ebookmeta.com/product/introduction-to-python-for-
econometrics-statistics-and-data-analysis-5th-edition-kevin-
sheppard/
https://ebookmeta.com/product/brothers-in-arms-box-set-the-
complete-series-1st-edition-scott-moon/
I'm New Here Ian Russell-Hsieh
https://ebookmeta.com/product/im-new-here-ian-russell-hsieh/
https://ebookmeta.com/product/book-markets-in-mediterranean-
europe-and-latin-america-institutions-and-strategies-15th-18th-
centuries-1st-edition-montserrat-cachero/
https://ebookmeta.com/product/inquiry-based-lessons-in-world-
history-early-humans-to-global-expansion-vol-1-grades-7-10-1st-
edition-jana-kirchner/
https://ebookmeta.com/product/wonky-inn-06-0-the-mysterious-mr-
wylie-1st-edition-jeannie-wycherley/
https://ebookmeta.com/product/silent-tears-a-dark-revenge-
romance-sasha-rc/
Introduction To Electromagnetic Theory A Modern
Perspective 1st Edition Chow Tai
https://ebookmeta.com/product/introduction-to-electromagnetic-
theory-a-modern-perspective-1st-edition-chow-tai/
Warren J. Ewens
Katherine Brumberg
Introductory
Statistics
for Data
Analysis
Introductory Statistics for Data Analysis
Warren J. Ewens • Katherine Brumberg
Introductory Statistics
for Data Analysis
Warren J. Ewens Katherine Brumberg
Department of Statistics and Data Science Department of Statistics and Data Science
University of Pennsylvania University of Pennsylvania
Philadelphia, PA, USA Philadelphia, PA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
We have often given detailed answers to the problems since this allows them to be
considered as instructive examples rather than as problems. We have also provided
flowcharts that help put the material discussed into perspective.
We are well aware of the practical aspects of data analysis, for example of
ensuring that the data analyzed form an unbiased representative sample of the
population of interest and that the assumptions made in the theory are justified,
and have referred to these and similar matters several times throughout the book.
However, our focus is on the basic theory, since in our experience this is sometimes
little understood, so that incorrect procedures and inappropriate assumptions are
sometimes used in data analysis.
Any errors or obscurities observed in this book will be reported at the webpage
https://kbrumberg.com/publication/textbook/ewens/. Possible errors can be reported
according to the instructions on the same webpage.
Part I Introduction
1 Statistics and Probability Theory .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.1 What is Statistics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
1.2 The Relation Between Probability Theory and Statistics. . . . . . . . . . . 5
1.3 Problems .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7
vii
viii Contents
The word “Statistics” means different things to different people. For a baseball fan,
it might relate to batting averages. For an actuary, it might relate to life tables. In
this book, we mean the scientific definition of “Statistics”, which is Statistics is the
science of analyzing data in whose generation chance has played some part. This
sentence is the most important one in the entire book, and it permeates the entire
book. Statistics as we understand it via this definition has become a central area of
modern science and data analysis, as discussed below.
Why is Statistics now central to modern science and data analysis? This question
is best answered by considering the historical context. In the past, Mathematics
developed largely in association with areas of science in which chance mechanisms
were either non-existent or not important. Thus in the past a great deal of progress
was made in such areas as Physics, Engineering, Astronomy and Chemistry using
mathematical methods which did not allow any chance, or random, features in the
analysis. For example, no randomness is involved in Newton’s laws or in the theory
of relativity, both of which are entirely deterministic. It is true that quantum theory
is the prevailing paradigm in the physical sciences and that this theory intrinsically
involves randomness. However, that intrinsic level of randomness is not discussed
in this book.
Our focus is on more recently developed areas of science such as Medicine,
Biology and Psychology, in which there are various chance mechanisms at work,
and deterministic theory is no longer appropriate in these areas. In a medical clinical
trial of a proposed new medicine, the number of people cured by the new medicine
will depend on the choice of individuals in the trial: with a different group of
individuals, a different number of people cured will probably be seen. (Clinical
trials are discussed later in this book.) In areas such as Biology, there are many
random factors deriving from, for example, the random transmission of genes from
parent to offspring, implying that precise statements concerning the evolution of a
population cannot be made. Similar comments arise for many other areas of modern
science and indeed in all areas where inferences are to be drawn from data in whose
generation randomness played some part.
The data in “data analysis” are almost always a sample of some kind. A different
sample would almost certainly yield different data, so that the sampling process
introduces a second chance element over and above the inherent randomness in
areas of science described above. This means that in order to make progress in these
areas, one has to know how to analyze data in whose generation chance mechanisms
were at work.
This is where Statistics becomes relevant. The role played by Mathematics in
Physics, Engineering, Astronomy and Chemistry is played by Statistics in Medicine,
Economics, Biology and many other associated areas. Statistics is fundamental to
making progress in those areas. The following examples illustrate this.
Example 1.1.1 In a study to examine the effects of sunlight exposure on the growth
of a new type of grass, grass seeds were sown in 22 identical specifically designed
containers. Grass in 11 of these containers were exposed to full sunlight during the
growing period and grass in the remaining 11 containers were exposed to 50% shade
during the growing period. At the end of the growing period, the biomass in each
container was measured and the following data (in coded units) were obtained:
Full sun: 1903, 1935, 1910, 2096, 2008, 1961, 2060, 1644, 1612, 1811, 1714
.
50% Shade: 1759, 1718, 1820, 1933, 1990, 1920, 1796, 1696, 1578, 1682, 1526
(1.1)
There are clearly several chance mechanisms determining the data values that we
observed. A different experiment would almost certainly give different data. The
data do not immediately indicate an obvious difference between the two groups,
and in order to make our assessment about a possible difference, we will have to
use statistical methods, which allow for the randomness in the data. The statistical
analysis of data of this form is discussed in Sects. 8.5 and 13.2.
Example 1.1.2 The data from the 2020 clinical trial of the proposed Moderna
COVID vaccine, in which 30,420 volunteers were divided into two groups, 15,210
being given the proposed vaccine and 15,210 being given a harmless placebo, are
given below. The data are taken from L. R. Baden et al. Efficacy and Safety of the
mRNA-1273 SARS-CoV-2 Vaccine, New England Journal of Medicine 384:403-416,
February 2021.
The way in which data such as those in this table are analyzed statistically will be
described in Chap. 10. For now, we note that if this clinical trial had been carried out
on a different sample of 30,420 people, almost certainly different data would have
arisen. Again, Statistics provides a process for handling data where randomness
such as this arises.
These two examples are enough to make two important points. The first is that
because of the randomness inherent in the sampling process, no exact statements
such as those made, for example, in Physics are possible. We will have to make
statements indicating some level of uncertainty in our conclusions. It is not possible,
in analyzing data derived from a sampling process, to be .100% certain that our
conclusion is correct. This indicates a real limitation to what can be asserted in
modern science. More specific information about this lack of certainty is introduced
in Sect. 9.2.2 and then methods for handling this uncertainty are developed in later
sections.
The second point is that, because of the unpredictable random aspect in the
generation of the data arising in many areas of science, it is necessary to first
consider various aspects of probability theory in order to know what probability
calculations are needed for the statistical problem at hand. This book therefore
starts with an introduction to probability theory, with no immediate reference to
the associated statistical procedures. This implies that before discussing the details
of probability theory, we first discuss the relation between probability theory and
Statistics.
We start with a simple example concerning the flipping of a coin. Suppose that we
have a coin that we suspect is biased towards heads. To check on this suspicion, we
flip the coin 2000 times and observe the number of heads that we get. Even if the
coin is fair, we would not expect, beforehand, to get exactly 1000 heads from the
2000 flips. This is because of the randomness inherent in the coin-flipping operation.
However, we would expect to see approximately 1000 heads. If once we flipped the
coin we got 1373 heads, we would obviously (and reasonably) claim that we have
very good evidence that the coin is biased towards heads. The reasoning that one
goes through in coming to this conclusion is probably something like this: “if the
coin is fair, it is extremely unlikely that we would get 1373 or more heads from 2000
flips. But since we did in fact get 1373 heads, we have strong evidence that the coin
is unfair.”
Conversely, if we got 1005 heads, we would not reasonably conclude that we
have good evidence that the coin is biased towards heads. The reason for coming to
this conclusion is that, because of the randomness involved in the flipping of a coin,
a fair coin can easily give 1005 or more heads from 2000 flips, so that observing
1005 heads gives no significant evidence that the coin is unfair.
6 1 Statistics and Probability Theory
These two examples are extreme cases, and in reality we often have to deal with
more gray-area situations. If we saw 1072 heads, intuition and common sense might
not help. What we have to do is to calculate the probability of getting 1072 or more
heads if the coin is fair. Probability theory calculations (which we will do later) show
that the probability of getting 1072 or more heads from 2000 flips of a fair coin is
very low (about 0.0006). This probability calculation is a deduction, or implication.
It is very unlikely that a fair coin would turn up heads 1072 times or more from
2000 flips. From this fact and the fact that we did see 1072 heads on the 2000 flips
of the coin, we make the statistical induction, or inference, that we can reasonably
conclude that we have significant evidence that the coin is biased.
The logic is as follows. Either the coin is fair and something very unlikely has
happened (probability about 0.0006) or the coin is not fair. We prefer to believe the
second possibility. We do not like to entertain a hypothesis that does not reasonably
explain what we saw in practice. This argument follows the procedures of modern
science.
In coming to the opinion that the coin is unfair we could be incorrect: the coin
might have been fair and something very unlikely might have happened (1072
heads). We have to accept this possibility when using Statistics: we cannot be certain
that any conclusion, that is, any statistical induction or inference, that we reach is
correct. This problem is discussed in detail later in this book.
To summarize: probability theory makes deductions, or implications. Statistics
makes inductions, or inferences. Each induction, or inference, is always based both
on data and the corresponding probability theory calculation relating to those data.
This induction might be incorrect because it is based on data in whose generation
randomness was involved.
In the coin example above, the statistical induction, or inference, that we made
(that we believe we have good evidence that the coin is unfair, given that there
were 1072 heads in the 2000 flips) was based entirely on the probability calculation
leading to the value 0.0006. In general, no statistical inference can be made
without first making the relevant probability theory calculation. This is one reason
why people often find Statistics difficult. In doing Statistics, we have to consider
aspects of probability theory, and unfortunately our intuition concerning probability
calculations is often incorrect.
Here is a more important example. Suppose that we are using some medicine
(the “current” medicine) to cure some illness. From experience we know that, for
any person having this illness, the probability that this current medicine cures any
patient is 0.8. A new medicine is proposed as being better than the current one. To
test whether this claim is justified, we plan to conduct a clinical trial in which the
new medicine will be given to 2000 people suffering from the disease in question.
If the new medicine is equally effective as the current one, we would, beforehand,
expect it to cure about 1600 of these people. Suppose that after the clinical trial is
conducted, the proposed new medicine cured 1643 people. Is this significantly more
than 1600? Calculations that we will do later show that the probability that we would
get 1643 or more people cured with the new medicine if it is equally effective as the
current medicine is about 0.009, or a bit less than 0.01. Thus if the new medicine did
Another Random Scribd Document
with Unrelated Content
*** END OF THE PROJECT GUTENBERG EBOOK SECRET SERVICE
UNDER PITT ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.