Statistics From Basics To Advanced
Statistics From Basics To Advanced
Statistical Inference
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Surrogates
Gaussian Process Modeling, Design, and Optimization for the Applied Sciences
Robert B. Gramacy
Statistical Rethinking
A Bayesian Course with Examples in R and STAN, Second Edition
Richard McElreath
Miltiadis C. Mavrakakis
Jeremy Penzer
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
The right of Miltiadis C. Mavrakakis and Jeremy Penzer to be identified as the authors of the editorial
material, and of the authors for their individual chapters, has been asserted in accordance with sec-
tions 77 and 78 of the Copyright, Designs and Patents Act 1988.
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
Preface xv
1 Introduction 1
2 Probability 7
2.1 Intuitive probability 7
2.2 Mathematical probability 9
2.2.1 Measure 9
2.2.2 Probability measure 11
2.3 Methods for counting outcomes 16
2.3.1 Permutations and combinations 17
2.3.2 Number of combinations and multinomial coefficients 20
2.4 Conditional probability and independence 23
2.4.1 Conditional probability 23
2.4.2 Law of total probability and Bayes’ theorem 25
2.4.3 Independence 27
2.5 Further exercises 31
2.6 Chapter summary 32
vii
viii Contents
3.3.1 Discrete random variables and mass functions 43
3.3.2 Continuous random variables and density functions 51
3.3.3 Parameters and families of distributions 60
3.4 Expectation, variance, and higher moments 61
3.4.1 Mean of a random variable 61
3.4.2 Expectation operator 63
3.4.3 Variance of a random variable 65
3.4.4 Inequalities involving expectation 67
3.4.5 Moments 70
3.5 Generating functions 74
3.5.1 Moment-generating functions 74
3.5.2 Cumulant-generating functions and cumulants 79
3.6 Functions of random variables 81
3.6.1 Distribution and mass/density for g(X) 82
3.6.2 Monotone functions of random variables 84
3.7 Sequences of random variables and convergence 88
3.8 A more thorough treatment of random variables 93
3.9 Further exercises 96
3.10 Chapter summary 99
Index 415
Preface
Who is it for?
This book is for students who have already completed a first course in probability and
statistics, and now wish to deepen and broaden their understanding of the subject. It
can serve as a foundation for advanced undergraduate or postgraduate courses. Our
aim is to challenge and excite the more mathematically able students, while providing
explanations of statistical concepts that are more detailed and approachable than those
in advanced texts. This book is also useful for data scientists, researchers, and other
applied practitioners who want to understand the theory behind the statistical methods
used in their fields. As such, it is intended as an instructional text as well as a reference
book.
Chapter summary
xv
xvi PREFACE
of sample moments and order statistics. Chapter 8 introduces the key topics in classi-
cal statistical inference: point estimation, interval estimation, and hypothesis testing.
The main techniques in likelihood-based inference are developed further in Chapter 9,
while Chapter 10 examines their theoretical basis. Chapter 11 is an overview of the
fundamental concepts of Bayesian statistics. Finally, Chapter 12 is an introduction to
some of the main computational methods used in modern statistical inference.
Teaching a course
The material in this book forms the basis of our 40-lecture course, spread across two
academic terms. Our students are mostly second- and third-year undergraduates who
have already completed an introductory statistics course, so they have some level
of familiarity with probability and distribution theory (Chapters 2–5). We use the
opportunity to focus on the more abstract concepts – measure-theoretic probability
(section 2.2), convergence of sequences of random variables (3.8), etc. – and introduce
some useful modelling frameworks, such as hierarchical models and mixtures (5.5),
and random sums (5.7).
We approach the section on statistical inference (Chapters 7–10) in a similar way.
As our students have already encountered the fundamental concepts of sampling
distributions, parameter estimation, and hypothesis tests, we spend more time on
advanced topics, such as order statistics (7.4), likelihood-ratio tests (9.4), and the
theoretical basis for likelihood-based inference (9.3 and 10.1).
In addition to these, we cover the first few sections in Chapter 6 (6.1–6.3) and a
selection of topics from the remaining parts. These are almost entirely self-contained,
so it is possible to focus on, say, generalised linear models (6.4) and survival models
(6.5), and omit stochastic processes (6.7 and 6.8) – or vice versa. Note that the MCMC
techniques in section 12.3 require at least a basic familiarity with Markov chain theory
(6.8).
For most of our students, this course is their first encounter with Bayesian inference,
and we spend quite a lot of time on the basic ideas in Chapter 11 (prior and posterior,
loss functions and estimators). Finally, we demonstrate the main techniques in Chapter
12 by implementing the algorithms in R, and encouraging the students to do the same.
For a shorter course, both chapters can be omitted entirely.
There are many other ways to structure a course from this book. For example, Chapters
2–5 could be taught as a standalone course in mathematical probability and distribu-
tion theory. This would be useful as a prerequisite for advanced courses in statistics.
Alternatively, Chapters 6, 11, and 12 could form the basis of a more practical course
on statistical modelling.
Exercises appear within each section and also at the end of each chapter. The former
are generally shorter and more straightforward, while the latter seek to test the key
learning outcomes of each chapter, and are more difficult as a result. Solutions to the
exercises will appear in a separate volume.
PREFACE xvii
Jeremy’s acknowledgements
This book started life as a set of study guides. A number of people helped enormously
in the preparation of these guides; in particular, Chris Jones and Wicher Bergsma (who
reviewed early versions), James Abdey (who proof read patiently and thoroughly),
and Martin Knott and Qiwei Yao (who were generous with both their time and their
advice) – thank you all. I would like to acknowledge that without Milt this would have
remained, at best, a half-finished piece of work. I am very grateful to my wife Debbie
whose love and support make all things possible.
Milt’s acknowledgements
I would like to thank Jeremy for starting this book and trusting me to finish it; Mark
Spivack, for fighting my corner when it really mattered; Lynn Frank, for being the
very first person who encouraged me to teach; my LSE students, for helping me
polish this material; and my colleagues at Smartodds, particularly Iain and Stu, for
their valuable suggestions which helped keep the book on track. I am forever grateful
to my parents, Maria and Kostas, for their unconditional love and support. Finally,
I would like to thank my wife, Sitali, for being my most passionate fan and fiercest
critic, as circumstances dictated; this book would never have been completed without
her.
About the authors
Jeremy Penzer’s first post-doc job was as a research assistant at the London School of
Economics. Jeremy went on to become a lecturer at LSE and to teach the second-year
statistical inference course (ST202) that formed the starting point for this book. While
working at LSE, his research interests were time series analysis and computational
statistics. After 12 years as an academic, Jeremy shifted career to work in financial
services. He is currently Chief Marketing and Analytics Officer for Capital One
Europe (plc). Jeremy lives just outside Nottingham with his wife and two daughters.
Miltiadis C. Mavrakakis obtained his PhD in Statistics at LSE under the supervision
of Jeremy Penzer. His first job was as a teaching fellow at LSE, taking over course
ST202 and completing this book in the process. He splits his time between lectur-
ing (at LSE, Imperial College London, and the University of London International
Programme) and his applied statistical work. Milt is currently a Senior Analyst at
Smartodds, a sports betting consultancy, where he focuses on the statistical mod-
elling of sports and financial markets. He lives in London with his wife, son, and
daughter.
xix
Chapter 1
Introduction
Statistics is concerned with methods for collecting, analysing, and drawing conclu-
sions from data. A clear understanding of the theoretical properties of these methods
is of paramount importance. However, this theory is often taught in a way that is
completely detached from the real problems that motivate it.
To infer is to draw general conclusions on the basis of specific observations, which
is a skill we begin to develop at an early age. It is such a fundamental part of our
intelligence that we do it without even thinking about it. We learn to classify objects
on the basis of a very limited training set. From a few simple pictures, a child learns
to infer that anything displaying certain characteristics (a long neck, long thin legs,
large brown spots) is a giraffe. In statistical inference, our specific observations take
the form of a data set. For our purposes, a data set is a collection of numbers.
Statistical inference uses these numbers to make general statements about the process
that generated the data.
Uncertainty is part of life. If we were never in any doubt about what was going to
happen next, life would be rather dull. We all possess an intuitive sense that some
things are more certain than others. If I knock over the bottle of water that is sitting
on my desk, I can be pretty sure some of the contents will spill out; as we write this
book, we might hope that it is going to be an international bestseller, but there are
many other (far more likely) outcomes. Statistical inference requires us to do more
than make vague statements like “I can be pretty sure” and “there are more likely
outcomes”. We need to be able to quantify our uncertainty by attaching numbers to
possible events. These numbers are referred to as probabilities.
The theory of probability did not develop in a vacuum. Our aim in studying probability
is to build the framework for modelling real-world phenomena; early work in the field
was motivated by an interest in gambling odds. At the heart of the models that we
build is the notion of a random variable and an associated distribution. Probability
and distribution theory provide the foundation. Their true value becomes apparent
when they are applied to questions of inference.
Statistical inference is often discussed in the context of a sample from a popula-
tion. Suppose that we are interested in some characteristic of individuals in a given
1
2 INTRODUCTION
population. We cannot take measurements for every member of the population so
instead we take a sample. The data set consists of a collection of measurements of
the characteristic of interest for the individuals in the sample. In this context, we will
often refer to the data set as the observed sample. Our key inferential question is then,
“what can we infer about the population from our observed sample?”.
Consider the following illustration. Zoologists studying sea birds on an island in the
South Atlantic are interested in the journeys made in search of food by members of
a particular species. They capture a sample of birds of this species, fit them with a
small tracking device, then release them. They measure the distance travelled over a
week by each of the birds in the sample. In this example, the population might be
taken to be all birds of this particular species based on the island. The characteristic
of interest is distance travelled over the week for which the experiment took place.
The observed sample will be a collection of measurements of distance travelled.
Inferential questions that we might ask include:
Consider the first of these questions. The population mean is the value that we would
get if we measured distance travelled for every member of the population, added up
these values, and divided by the population size. We can use our observed sample
to construct an estimate of the population mean. The usual way to do this is to
add up all the observed sample values and divide by the sample size; the resulting
value is the observed sample mean. Computing observed sample values, such as the
mean, median, maximum, and so on, forms the basis of descriptive statistics. Although
widely used (most figures given in news reports are presented as descriptive statistics),
descriptive statistics cannot answer questions of inference adequately.
The first of our questions asks, “what is a reasonable estimate of the population
mean?”. This brings us to a key idea in statistical inference: it is the properties of the
sample that determine whether an estimate is reasonable or not. Consider a sample
that consists of a small number of the weakest birds on the island; it is clear that
this sample will not generate a reasonable estimate of the population mean. If, on
the other hand, we sample at random and our sample is large, the observed sample
mean provides an intuitively appealing estimate. Using ideas from probability and
distribution theory, we can give a precise justification for this intuitive appeal.
It is important to recognise that all statistical inference takes place in the context
of a model, that is, a mathematical idealisation of the real situation. In practice,
data collection methods often fall well short of what is required to satisfy the model’s
properties. Consider sampling at random from a population. This requires us to ensure
that every member of the population is equally likely to be included in the sample. It
INTRODUCTION 3
is hard to come up with a practical situation where this requirement can be met. In
the sea birds illustration, the sample consists of those birds that the zoologists capture
and release. While they might set their traps in order to try to ensure that every bird in
the population is equally likely to be captured, in practice they may be more likely to
capture young, inexperienced birds or, if their traps are baited, the hungriest members
of the population. In the sage, if well-worn, words of George Box, “all models are
wrong, but some are useful”. The best we can hope for is a model that provides us
with a useful representation of the process that generates the data.
A common mistake in statistical inference is to extend conclusions beyond what can
be supported by the data. In the sea bird illustration, there are many factors that
potentially have an impact on the length of journeys: for example, prevailing weather
conditions and location of fish stocks. As the zoologists measure the distance travelled
for one week, it is reasonable to use our sample to draw inferences for this particular
week. However, if we were to try to extend this to making general statements for
any week of the year, the presence of factors that vary from week to week would
invalidate our conclusions. To illustrate, suppose the week in which the experiment
was conducted was one with particularly unfavourable weather conditions. In this
situation, the observed sample mean will grossly underestimate the distance travelled
by the birds in a normal week. One way in which the zoologists could collect data that
would support wider inferences would be to take measurements on randomly selected
days of the year. In general, close attention should be paid to the circumstances in
which the data are collected; these circumstances play a crucial role in determining
what constitutes sensible inference.
Statistical inference is often divided into three topics: point estimation, interval es-
timation, and hypothesis testing. Roughly, these correspond to the three questions
asked in the sea birds illustration. Finding a reasonable estimate for the population
mean is a question of point estimation. In general, point estimation covers methods for
generating single-number estimates for population characteristics and the properties
of these methods. Any estimate of a population characteristic has some uncertainty
associated with it. One way to quantify this uncertainty is to provide an interval that
covers, with a given probability, the value of the population characteristic of interest;
this is referred to as interval estimation. An interval estimate provides one possible
mechanism for answering the second question, “how sure can we be of our estimate?”.
The final question relates to the use of data as evidence. This is an important type
of problem whose relevance is sometimes lost in the formality of the third topic,
hypothesis testing. Although these labels are useful, we will try not to get too hung up
on them; understanding the common elements in these topics is at least as important
as understanding the differences.
The ideas of population and sample are useful. In fact, so useful that the terminology
appears in situations where the population is not clearly identified. Consider the
following illustration. A chemist measures the mass of a chemical reagent that is
produced by an experiment. The experiment is repeated many times; on each occasion
the mass is measured and all the measurements are collected as a data set. There is no
obvious population in this illustration. However, we can still ask sensible inferential
4 INTRODUCTION
questions, such as, “what is a reasonable estimate of the mass of the reagent we might
expect to be produced if we were to conduct the experiment again?”. The advantage of
thinking in terms of statistical models is now more apparent. It may be reasonable to
build a model for the measurements of mass of reagent that shares the properties of the
model of distance travelled by the sea birds. Although there is no obvious population,
the expected mass of reagent described by this model may still be referred to as the
population mean. Although we are not doing any sampling, the data set may still be
referred to as the observed sample. In this general context, the model may be thought
of as our mathematical representation of the process that generated the data.
So far, the illustrations we have considered are univariate, that is, we consider a
single characteristic of interest. Data sets that arise from real problems often contain
measurements of several characteristics. For example, when the sea birds are captured,
the zoologists might determine their gender, measure their weight and measure their
wingspan before releasing them. In this context, we often refer to the characteristics
that we record as variables. Data sets containing information on several variables
are usually represented as a table with columns corresponding to variables and rows
corresponding to individual sample members (a format familiar to anyone who has
used a spreadsheet). We may treat all the variables as having identical status and, by
building models that extend those used in the univariate case, try to take account of
possible relationships between variables. This approach and its extensions are often
referred to as multivariate analysis.
Data sets containing several variables often arise in a related, though subtly differ-
ent, context. Consider the chemistry experiment described above. Suppose that the
temperature of the reaction chamber is thought to have an impact on the mass of
reagent produced. In order to test this, the experimenter chooses two temperatures
and runs the experiment 100 times at the first temperature, then 100 times at the
second temperature. The most convenient way to represent this situation is as two
variables, one being the mass of reagent produced and the other temperature of reac-
tion chamber. An inferential question arises: do the results of the experiment provide
evidence to suggest that reaction chamber temperature has an impact on the mass of
reagent produced? As we think that the variations in reaction chamber temperature
may explain variations in the production of reagent, we refer to temperature as an
explanatory variable, and mass of reagent produced as the response variable. This
setup is a simple illustration of a situation where a linear model may be appropriate.
Linear models are a broad class that include as special cases models for regression,
analysis of variance, and factorial design.
In the example above, notice that we make a distinction between the variable that
we control (the explanatory variable, temperature) and the variable that we measure
(the response variable, mass of reagent). In many situations, particularly in the social
sciences, our data do not come from a designed experiment; they are values of
phenomena over which we exert no influence. These are sometimes referred to as
observational studies. To illustrate, suppose that we collect monthly figures for the
UK average house price and a broad measure of money supply. We may then ask
“what is the impact of money supply on house prices?”. Although we do not control
INTRODUCTION 5
the money supply, we may treat it as an explanatory variable in a model with average
house price as the response; this model may provide useful inferences about the
association between the two variables.
The illustrations in this preface attempt to give a flavour of the different situations in
which questions of statistical inference arise. What these situations have in common
is that we are making inferences under uncertainty. In measuring the wingspan of sea
birds, it is sampling from a population that introduces the uncertainty. In the house
price illustration, in common with many economic problems, we are interested in a
variable whose value is determined by the actions of tens of thousands of individuals.
The decisions made by each individual will, in turn, be influenced by a large number
of factors, many of them barely tangible. Thus, there are a huge number of factors that
combine to determine the value of an economic variable. It is our lack of knowledge
of these factors that is often thought of as the source of uncertainty. While sources
of uncertainty may differ, the statistical models and methods that are useful remain
broadly the same. It is these models and methods that are the subject of this book.