0% found this document useful (0 votes)
276 views

Statistics From Basics To Advanced

This document provides an overview of the textbook "Probability and Statistical Inference" by Miltiadis C. Mavrakakis and Jeremy Penzer. The textbook covers topics ranging from basic probability concepts to advanced probability models. It discusses probability measures, counting methods, conditional probability, random variables, univariate and multivariate distributions, and conditional distributions. The goal is to introduce readers to fundamental probability principles and statistical inference techniques.

Uploaded by

2021ugcs032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
276 views

Statistics From Basics To Advanced

This document provides an overview of the textbook "Probability and Statistical Inference" by Miltiadis C. Mavrakakis and Jeremy Penzer. The textbook covers topics ranging from basic probability concepts to advanced probability models. It discusses probability measures, counting methods, conditional probability, random variables, univariate and multivariate distributions, and conditional distributions. The goal is to introduce readers to fundamental probability principles and statistical inference techniques.

Uploaded by

2021ugcs032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Probability and

Statistical Inference
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Recently Published Titles

Surrogates
Gaussian Process Modeling, Design, and Optimization for the Applied Sciences
Robert B. Gramacy

Statistical Analysis of Financial Data


With Examples in R
James Gentle

Statistical Rethinking
A Bayesian Course with Examples in R and STAN, Second Edition
Richard McElreath

Statistical Machine Learning


A Model-Based Approach
Richard Golden

Randomization, Bootstrap and Monte Carlo Methods in Biology


Fourth Edition
Bryan F. J. Manly, Jorje A. Navarro Alberto

Principles of Uncertainty, Second Edition


Joseph B. Kadane

Beyond Multiple Linear Regression


Applied Generalized Linear Models and Multilevel Models in R
Paul Roback, Julie Legler

Bayesian Thinking in Biostatistics


Gary L. Rosner, Purushottam W. Laud, and Wesley O. Johnson

Modern Data Science with R, Second Edition


Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton

Probability and Statistical Inference


From Basic Principles to Advanced Models

Miltiadis C. Mavrakakis and Jeremy Penzer

For more information about this series, please visit: https://www.crcpress.com/


Chapman--HallCRC-Texts-in-Statistical-Science/book-series/CHTEXSTASCI
Probability and
Statistical Inference
From Basic Principles to
Advanced Models

Miltiadis C. Mavrakakis
Jeremy Penzer
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2021 Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, LLC

The right of Miltiadis C. Mavrakakis and Jeremy Penzer to be identified as the authors of the editorial
material, and of the authors for their individual chapters, has been asserted in accordance with sec-
tions 77 and 78 of the Copyright, Designs and Patents Act 1988.

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.

Library of Congress CataloginginPublication Data

ISBN: 9781584889397 (hbk)


ISBN: 9780367749125 (pbk)
ISBN: 9781315366630 (ebk)

Typeset in Computer Modern font


by KnowledgeWorks Global Ltd.
For JP. . . this book is dedicated to Debbie, Lily, and Bella.

For MCM. . . this book is dedicated to Sitali, Kostas,


and Liseli.
Contents

Preface xv

About the authors xix

1 Introduction 1

2 Probability 7
2.1 Intuitive probability 7
2.2 Mathematical probability 9
2.2.1 Measure 9
2.2.2 Probability measure 11
2.3 Methods for counting outcomes 16
2.3.1 Permutations and combinations 17
2.3.2 Number of combinations and multinomial coefficients 20
2.4 Conditional probability and independence 23
2.4.1 Conditional probability 23
2.4.2 Law of total probability and Bayes’ theorem 25
2.4.3 Independence 27
2.5 Further exercises 31
2.6 Chapter summary 32

3 Random variables and univariate distributions 33


3.1 Mapping outcomes to real numbers 33
3.2 Cumulative distribution functions 37
3.3 Discrete and continuous random variables 41

vii
viii Contents
3.3.1 Discrete random variables and mass functions 43
3.3.2 Continuous random variables and density functions 51
3.3.3 Parameters and families of distributions 60
3.4 Expectation, variance, and higher moments 61
3.4.1 Mean of a random variable 61
3.4.2 Expectation operator 63
3.4.3 Variance of a random variable 65
3.4.4 Inequalities involving expectation 67
3.4.5 Moments 70
3.5 Generating functions 74
3.5.1 Moment-generating functions 74
3.5.2 Cumulant-generating functions and cumulants 79
3.6 Functions of random variables 81
3.6.1 Distribution and mass/density for g(X) 82
3.6.2 Monotone functions of random variables 84
3.7 Sequences of random variables and convergence 88
3.8 A more thorough treatment of random variables 93
3.9 Further exercises 96
3.10 Chapter summary 99

4 Multivariate distributions 101


4.1 Joint and marginal distributions 101
4.2 Joint mass and joint density 104
4.2.1 Mass for discrete distributions 104
4.2.2 Density for continuous distributions 107
4.3 Expectation and joint moments 116
4.3.1 Expectation of a function of several variables 116
4.3.2 Covariance and correlation 117
4.3.3 Joint moments 120
4.3.4 Joint moment-generating functions 121
4.4 Independent random variables 123
Contents ix
4.4.1 Independence for pairs of random variables 124
4.4.2 Mutual independence 125
4.4.3 Identical distributions 126
4.5 Random vectors and random matrices 127
4.6 Transformations of continuous random variables 130
4.6.1 Bivariate transformations 130
4.6.2 Multivariate transformations 133
4.7 Sums of random variables 134
4.7.1 Sum of two random variables 135
4.7.2 Sum of n independent random variables 137
4.8 Multivariate normal distribution 138
4.8.1 Bivariate case 139
4.8.2 n-dimensional multivariate case 140
4.9 Further exercises 143
4.10 Chapter summary 145

5 Conditional distributions 147


5.1 Discrete conditional distributions 147
5.2 Continuous conditional distributions 150
5.3 Relationship between joint, marginal, and conditional 153
5.4 Conditional expectation and conditional moments 154
5.4.1 Conditional expectation 155
5.4.2 Conditional moments 158
5.4.3 Conditional moment-generating functions 161
5.5 Hierarchies and mixtures 162
5.6 Random sums 164
5.7 Conditioning for random vectors 166
5.8 Further exercises 167
5.9 Chapter summary 169
x Contents
6 Statistical models 171
6.1 Modelling terminology, conventions, and assumptions 171
6.1.1 Sample, observed sample, and parameters 171
6.1.2 Structural and distributional assumptions 172
6.2 Independent and identically distributed sequences 173
6.2.1 Random sample 173
6.2.2 Error sequences 173
6.3 Linear models 174
6.3.1 Simple linear regression 174
6.3.2 Multiple linear regression 175
6.3.3 Applications 176
6.4 Generalised linear models 177
6.4.1 Motivation 177
6.4.2 Link function 178
6.5 Time-to-event models 181
6.5.1 Survival function and hazard function 182
6.5.2 Censoring of time-to-event data 186
6.5.3 Covariates in time-to-event models 187
6.6 Time series models 188
6.6.1 Autoregressive models 191
6.6.2 Moving-average models 191
6.6.3 Autocovariance, autocorrelation, and stationarity 192
6.7 Poisson processes 197
6.7.1 Stochastic processes and counting processes 197
6.7.2 Definitions of the Poisson process 199
6.7.3 Thinning and superposition 203
6.7.4 Arrival and interarrival times 203
6.7.5 Compound Poisson process 205
6.7.6 Non-homogeneous Poisson process 205
6.8 Markov chains 207
6.8.1 Classification of states and chains 210
Contents xi
6.8.2 Absorption 211
6.8.3 Periodicity 213
6.8.4 Limiting distribution 214
6.8.5 Recurrence and transience 215
6.8.6 Continuous-time Markov chains 216
6.9 Further exercises 217
6.10 Chapter summary 220

7 Sample moments and quantiles 223


7.1 Sample mean 223
7.1.1 Mean and variance of the sample mean 224
7.1.2 Central limit theorem 225
7.2 Higher-order sample moments 228
7.2.1 Sample variance 229
7.2.2 Joint sample moments 233
7.3 Sample mean and variance for a normal population 236
7.4 Sample quantiles and order statistics 240
7.4.1 Sample minimum and sample maximum 242
7.4.2 Distribution of i th order statistic 243
7.5 Further exercises 248
7.6 Chapter summary 249

8 Estimation, testing, and prediction 251


8.1 Functions of a sample 251
8.1.1 Statistics 251
8.1.2 Pivotal functions 252
8.2 Point estimation 255
8.2.1 Bias, variance, and mean squared error 257
8.2.2 Consistency 258
8.2.3 The method of moments 261
8.2.4 Ordinary least squares 263
8.3 Interval estimation 267
xii Contents
8.3.1 Coverage probability and length 269
8.3.2 Constructing interval estimators using pivotal functions 272
8.3.3 Constructing interval estimators using order statistics 275
8.3.4 Confidence sets 278
8.4 Hypothesis testing 278
8.4.1 Statistical hypotheses 279
8.4.2 Decision rules 280
8.4.3 Types of error and the power function 281
8.4.4 Basic ideas in constructing tests 282
8.4.5 Conclusions and p-values from tests 283
8.5 Prediction 285
8.6 Further exercises 289
8.7 Chapter summary 290

9 Likelihood-based inference 293


9.1 Likelihood function and log-likelihood function 293
9.2 Score and information 296
9.3 Maximum-likelihood estimation 302
9.3.1 Properties of maximum-likelihood estimates 304
9.3.2 Numerical maximisation of likelihood 308
9.3.3 EM algorithm 310
9.4 Likelihood-ratio test 316
9.4.1 Testing in the presence of nuisance parameters 317
9.4.2 Properties of the likelihood ratio 318
9.4.3 Approximate tests 320
9.5 Further exercises 323
9.6 Chapter summary 326
Contents xiii
10 Inferential theory 327
10.1 Sufficiency 327
10.1.1 Sufficient statistics and the sufficiency principle 327
10.1.2 Factorisation theorem 329
10.1.3 Minimal sufficiency 334
10.1.4 Application of sufficiency in point estimation 338
10.2 Variance of unbiased estimators 341
10.3 Most powerful tests 345
10.4 Further exercises 350
10.5 Chapter summary 352

11 Bayesian inference 355


11.1 Prior and posterior distributions 355
11.2 Choosing a prior 357
11.2.1 Constructing reference priors 357
11.2.2 Conjugate priors 360
11.3 Bayesian estimation 363
11.3.1 Point estimators 363
11.3.2 Absolute loss 364
11.3.3 0-1 loss 364
11.3.4 Interval estimates 366
11.4 Hierarchical models and empirical Bayes 370
11.4.1 Hierarchical models 370
11.4.2 Empirical Bayes 371
11.4.3 Predictive inference 373
11.5 Further exercises 376
11.6 Chapter summary 378
xiv Contents
12 Simulation methods 379
12.1 Simulating independent values from a distribution 379
12.1.1 Table lookup 380
12.1.2 Probability integral 381
12.1.3 Box-Muller method 382
12.1.4 Accept/reject method 382
12.1.5 Composition 385
12.1.6 Simulating model structure and the bootstrap 387
12.2 Monte Carlo integration 390
12.2.1 Averaging over simulated instances 390
12.2.2 Univariate vs. multivariate integrals 391
12.2.3 Importance sampling 394
12.2.4 Antithetic variates 397
12.3 Markov chain Monte Carlo 400
12.3.1 Discrete Metropolis 400
12.3.2 Continuous Metropolis 402
12.3.3 Metropolis-Hastings algorithm 402
12.3.4 Gibbs sampler 404
12.4 Further exercises 408
12.5 Chapter summary 410

A Proof of Proposition 5.7.2 411

Index 415
Preface

About this book

Probability and Statistical Inference: From Basic Principles to Advanced Models


covers aspects of probability, distribution theory, and inference that are fundamental
to a proper understanding of data analysis and statistical modelling. It presents these
topics in an accessible manner without sacrificing mathematical rigour, bridging the
gap between the many excellent introductory books and the more advanced, graduate-
level texts. The book introduces and explores techniques that are relevant to modern
practitioners, while being respectful to the history of statistical inference. It seeks to
provide a thorough grounding in both the theory and application of statistics, with
even the more abstract parts placed in the context of a practical setting.

Who is it for?

This book is for students who have already completed a first course in probability and
statistics, and now wish to deepen and broaden their understanding of the subject. It
can serve as a foundation for advanced undergraduate or postgraduate courses. Our
aim is to challenge and excite the more mathematically able students, while providing
explanations of statistical concepts that are more detailed and approachable than those
in advanced texts. This book is also useful for data scientists, researchers, and other
applied practitioners who want to understand the theory behind the statistical methods
used in their fields. As such, it is intended as an instructional text as well as a reference
book.

Chapter summary

Chapter 2 is a complete introduction to the elements of probability theory. This is


the foundation for random variables and distribution theory, which are explored in
Chapter 3. These concepts are extended to multivariate distributions in Chapter 4,
and conditional distributions in Chapter 5. Chapter 6 is a concise but broad account of
statistical modelling, covering many topics that are essential to modern statisticians,
such as generalised linear models, survival analysis, time series, and random pro-
cesses. Sampling distributions are first encountered in Chapter 7, through the concept

xv
xvi PREFACE
of sample moments and order statistics. Chapter 8 introduces the key topics in classi-
cal statistical inference: point estimation, interval estimation, and hypothesis testing.
The main techniques in likelihood-based inference are developed further in Chapter 9,
while Chapter 10 examines their theoretical basis. Chapter 11 is an overview of the
fundamental concepts of Bayesian statistics. Finally, Chapter 12 is an introduction to
some of the main computational methods used in modern statistical inference.

Teaching a course

The material in this book forms the basis of our 40-lecture course, spread across two
academic terms. Our students are mostly second- and third-year undergraduates who
have already completed an introductory statistics course, so they have some level
of familiarity with probability and distribution theory (Chapters 2–5). We use the
opportunity to focus on the more abstract concepts – measure-theoretic probability
(section 2.2), convergence of sequences of random variables (3.8), etc. – and introduce
some useful modelling frameworks, such as hierarchical models and mixtures (5.5),
and random sums (5.7).
We approach the section on statistical inference (Chapters 7–10) in a similar way.
As our students have already encountered the fundamental concepts of sampling
distributions, parameter estimation, and hypothesis tests, we spend more time on
advanced topics, such as order statistics (7.4), likelihood-ratio tests (9.4), and the
theoretical basis for likelihood-based inference (9.3 and 10.1).
In addition to these, we cover the first few sections in Chapter 6 (6.1–6.3) and a
selection of topics from the remaining parts. These are almost entirely self-contained,
so it is possible to focus on, say, generalised linear models (6.4) and survival models
(6.5), and omit stochastic processes (6.7 and 6.8) – or vice versa. Note that the MCMC
techniques in section 12.3 require at least a basic familiarity with Markov chain theory
(6.8).
For most of our students, this course is their first encounter with Bayesian inference,
and we spend quite a lot of time on the basic ideas in Chapter 11 (prior and posterior,
loss functions and estimators). Finally, we demonstrate the main techniques in Chapter
12 by implementing the algorithms in R, and encouraging the students to do the same.
For a shorter course, both chapters can be omitted entirely.
There are many other ways to structure a course from this book. For example, Chapters
2–5 could be taught as a standalone course in mathematical probability and distribu-
tion theory. This would be useful as a prerequisite for advanced courses in statistics.
Alternatively, Chapters 6, 11, and 12 could form the basis of a more practical course
on statistical modelling.
Exercises appear within each section and also at the end of each chapter. The former
are generally shorter and more straightforward, while the latter seek to test the key
learning outcomes of each chapter, and are more difficult as a result. Solutions to the
exercises will appear in a separate volume.
PREFACE xvii
Jeremy’s acknowledgements

This book started life as a set of study guides. A number of people helped enormously
in the preparation of these guides; in particular, Chris Jones and Wicher Bergsma (who
reviewed early versions), James Abdey (who proof read patiently and thoroughly),
and Martin Knott and Qiwei Yao (who were generous with both their time and their
advice) – thank you all. I would like to acknowledge that without Milt this would have
remained, at best, a half-finished piece of work. I am very grateful to my wife Debbie
whose love and support make all things possible.

Milt’s acknowledgements

I would like to thank Jeremy for starting this book and trusting me to finish it; Mark
Spivack, for fighting my corner when it really mattered; Lynn Frank, for being the
very first person who encouraged me to teach; my LSE students, for helping me
polish this material; and my colleagues at Smartodds, particularly Iain and Stu, for
their valuable suggestions which helped keep the book on track. I am forever grateful
to my parents, Maria and Kostas, for their unconditional love and support. Finally,
I would like to thank my wife, Sitali, for being my most passionate fan and fiercest
critic, as circumstances dictated; this book would never have been completed without
her.
About the authors

Jeremy Penzer’s first post-doc job was as a research assistant at the London School of
Economics. Jeremy went on to become a lecturer at LSE and to teach the second-year
statistical inference course (ST202) that formed the starting point for this book. While
working at LSE, his research interests were time series analysis and computational
statistics. After 12 years as an academic, Jeremy shifted career to work in financial
services. He is currently Chief Marketing and Analytics Officer for Capital One
Europe (plc). Jeremy lives just outside Nottingham with his wife and two daughters.

Miltiadis C. Mavrakakis obtained his PhD in Statistics at LSE under the supervision
of Jeremy Penzer. His first job was as a teaching fellow at LSE, taking over course
ST202 and completing this book in the process. He splits his time between lectur-
ing (at LSE, Imperial College London, and the University of London International
Programme) and his applied statistical work. Milt is currently a Senior Analyst at
Smartodds, a sports betting consultancy, where he focuses on the statistical mod-
elling of sports and financial markets. He lives in London with his wife, son, and
daughter.

xix
Chapter 1

Introduction

Statistics is concerned with methods for collecting, analysing, and drawing conclu-
sions from data. A clear understanding of the theoretical properties of these methods
is of paramount importance. However, this theory is often taught in a way that is
completely detached from the real problems that motivate it.
To infer is to draw general conclusions on the basis of specific observations, which
is a skill we begin to develop at an early age. It is such a fundamental part of our
intelligence that we do it without even thinking about it. We learn to classify objects
on the basis of a very limited training set. From a few simple pictures, a child learns
to infer that anything displaying certain characteristics (a long neck, long thin legs,
large brown spots) is a giraffe. In statistical inference, our specific observations take
the form of a data set. For our purposes, a data set is a collection of numbers.
Statistical inference uses these numbers to make general statements about the process
that generated the data.
Uncertainty is part of life. If we were never in any doubt about what was going to
happen next, life would be rather dull. We all possess an intuitive sense that some
things are more certain than others. If I knock over the bottle of water that is sitting
on my desk, I can be pretty sure some of the contents will spill out; as we write this
book, we might hope that it is going to be an international bestseller, but there are
many other (far more likely) outcomes. Statistical inference requires us to do more
than make vague statements like “I can be pretty sure” and “there are more likely
outcomes”. We need to be able to quantify our uncertainty by attaching numbers to
possible events. These numbers are referred to as probabilities.
The theory of probability did not develop in a vacuum. Our aim in studying probability
is to build the framework for modelling real-world phenomena; early work in the field
was motivated by an interest in gambling odds. At the heart of the models that we
build is the notion of a random variable and an associated distribution. Probability
and distribution theory provide the foundation. Their true value becomes apparent
when they are applied to questions of inference.
Statistical inference is often discussed in the context of a sample from a popula-
tion. Suppose that we are interested in some characteristic of individuals in a given

1
2 INTRODUCTION
population. We cannot take measurements for every member of the population so
instead we take a sample. The data set consists of a collection of measurements of
the characteristic of interest for the individuals in the sample. In this context, we will
often refer to the data set as the observed sample. Our key inferential question is then,
“what can we infer about the population from our observed sample?”.
Consider the following illustration. Zoologists studying sea birds on an island in the
South Atlantic are interested in the journeys made in search of food by members of
a particular species. They capture a sample of birds of this species, fit them with a
small tracking device, then release them. They measure the distance travelled over a
week by each of the birds in the sample. In this example, the population might be
taken to be all birds of this particular species based on the island. The characteristic
of interest is distance travelled over the week for which the experiment took place.
The observed sample will be a collection of measurements of distance travelled.
Inferential questions that we might ask include:

i. Based on the observed sample, what is a reasonable estimate of the population


mean distance travelled?
ii. How sure can we be about this estimate?
iii. The zoological community has an established view on the mean distance that
these birds travel in search of food. Does the observed sample provide any
evidence to suggest that this view is incorrect?

Consider the first of these questions. The population mean is the value that we would
get if we measured distance travelled for every member of the population, added up
these values, and divided by the population size. We can use our observed sample
to construct an estimate of the population mean. The usual way to do this is to
add up all the observed sample values and divide by the sample size; the resulting
value is the observed sample mean. Computing observed sample values, such as the
mean, median, maximum, and so on, forms the basis of descriptive statistics. Although
widely used (most figures given in news reports are presented as descriptive statistics),
descriptive statistics cannot answer questions of inference adequately.
The first of our questions asks, “what is a reasonable estimate of the population
mean?”. This brings us to a key idea in statistical inference: it is the properties of the
sample that determine whether an estimate is reasonable or not. Consider a sample
that consists of a small number of the weakest birds on the island; it is clear that
this sample will not generate a reasonable estimate of the population mean. If, on
the other hand, we sample at random and our sample is large, the observed sample
mean provides an intuitively appealing estimate. Using ideas from probability and
distribution theory, we can give a precise justification for this intuitive appeal.
It is important to recognise that all statistical inference takes place in the context
of a model, that is, a mathematical idealisation of the real situation. In practice,
data collection methods often fall well short of what is required to satisfy the model’s
properties. Consider sampling at random from a population. This requires us to ensure
that every member of the population is equally likely to be included in the sample. It
INTRODUCTION 3
is hard to come up with a practical situation where this requirement can be met. In
the sea birds illustration, the sample consists of those birds that the zoologists capture
and release. While they might set their traps in order to try to ensure that every bird in
the population is equally likely to be captured, in practice they may be more likely to
capture young, inexperienced birds or, if their traps are baited, the hungriest members
of the population. In the sage, if well-worn, words of George Box, “all models are
wrong, but some are useful”. The best we can hope for is a model that provides us
with a useful representation of the process that generates the data.
A common mistake in statistical inference is to extend conclusions beyond what can
be supported by the data. In the sea bird illustration, there are many factors that
potentially have an impact on the length of journeys: for example, prevailing weather
conditions and location of fish stocks. As the zoologists measure the distance travelled
for one week, it is reasonable to use our sample to draw inferences for this particular
week. However, if we were to try to extend this to making general statements for
any week of the year, the presence of factors that vary from week to week would
invalidate our conclusions. To illustrate, suppose the week in which the experiment
was conducted was one with particularly unfavourable weather conditions. In this
situation, the observed sample mean will grossly underestimate the distance travelled
by the birds in a normal week. One way in which the zoologists could collect data that
would support wider inferences would be to take measurements on randomly selected
days of the year. In general, close attention should be paid to the circumstances in
which the data are collected; these circumstances play a crucial role in determining
what constitutes sensible inference.
Statistical inference is often divided into three topics: point estimation, interval es-
timation, and hypothesis testing. Roughly, these correspond to the three questions
asked in the sea birds illustration. Finding a reasonable estimate for the population
mean is a question of point estimation. In general, point estimation covers methods for
generating single-number estimates for population characteristics and the properties
of these methods. Any estimate of a population characteristic has some uncertainty
associated with it. One way to quantify this uncertainty is to provide an interval that
covers, with a given probability, the value of the population characteristic of interest;
this is referred to as interval estimation. An interval estimate provides one possible
mechanism for answering the second question, “how sure can we be of our estimate?”.
The final question relates to the use of data as evidence. This is an important type
of problem whose relevance is sometimes lost in the formality of the third topic,
hypothesis testing. Although these labels are useful, we will try not to get too hung up
on them; understanding the common elements in these topics is at least as important
as understanding the differences.
The ideas of population and sample are useful. In fact, so useful that the terminology
appears in situations where the population is not clearly identified. Consider the
following illustration. A chemist measures the mass of a chemical reagent that is
produced by an experiment. The experiment is repeated many times; on each occasion
the mass is measured and all the measurements are collected as a data set. There is no
obvious population in this illustration. However, we can still ask sensible inferential
4 INTRODUCTION
questions, such as, “what is a reasonable estimate of the mass of the reagent we might
expect to be produced if we were to conduct the experiment again?”. The advantage of
thinking in terms of statistical models is now more apparent. It may be reasonable to
build a model for the measurements of mass of reagent that shares the properties of the
model of distance travelled by the sea birds. Although there is no obvious population,
the expected mass of reagent described by this model may still be referred to as the
population mean. Although we are not doing any sampling, the data set may still be
referred to as the observed sample. In this general context, the model may be thought
of as our mathematical representation of the process that generated the data.
So far, the illustrations we have considered are univariate, that is, we consider a
single characteristic of interest. Data sets that arise from real problems often contain
measurements of several characteristics. For example, when the sea birds are captured,
the zoologists might determine their gender, measure their weight and measure their
wingspan before releasing them. In this context, we often refer to the characteristics
that we record as variables. Data sets containing information on several variables
are usually represented as a table with columns corresponding to variables and rows
corresponding to individual sample members (a format familiar to anyone who has
used a spreadsheet). We may treat all the variables as having identical status and, by
building models that extend those used in the univariate case, try to take account of
possible relationships between variables. This approach and its extensions are often
referred to as multivariate analysis.
Data sets containing several variables often arise in a related, though subtly differ-
ent, context. Consider the chemistry experiment described above. Suppose that the
temperature of the reaction chamber is thought to have an impact on the mass of
reagent produced. In order to test this, the experimenter chooses two temperatures
and runs the experiment 100 times at the first temperature, then 100 times at the
second temperature. The most convenient way to represent this situation is as two
variables, one being the mass of reagent produced and the other temperature of reac-
tion chamber. An inferential question arises: do the results of the experiment provide
evidence to suggest that reaction chamber temperature has an impact on the mass of
reagent produced? As we think that the variations in reaction chamber temperature
may explain variations in the production of reagent, we refer to temperature as an
explanatory variable, and mass of reagent produced as the response variable. This
setup is a simple illustration of a situation where a linear model may be appropriate.
Linear models are a broad class that include as special cases models for regression,
analysis of variance, and factorial design.
In the example above, notice that we make a distinction between the variable that
we control (the explanatory variable, temperature) and the variable that we measure
(the response variable, mass of reagent). In many situations, particularly in the social
sciences, our data do not come from a designed experiment; they are values of
phenomena over which we exert no influence. These are sometimes referred to as
observational studies. To illustrate, suppose that we collect monthly figures for the
UK average house price and a broad measure of money supply. We may then ask
“what is the impact of money supply on house prices?”. Although we do not control
INTRODUCTION 5
the money supply, we may treat it as an explanatory variable in a model with average
house price as the response; this model may provide useful inferences about the
association between the two variables.
The illustrations in this preface attempt to give a flavour of the different situations in
which questions of statistical inference arise. What these situations have in common
is that we are making inferences under uncertainty. In measuring the wingspan of sea
birds, it is sampling from a population that introduces the uncertainty. In the house
price illustration, in common with many economic problems, we are interested in a
variable whose value is determined by the actions of tens of thousands of individuals.
The decisions made by each individual will, in turn, be influenced by a large number
of factors, many of them barely tangible. Thus, there are a huge number of factors that
combine to determine the value of an economic variable. It is our lack of knowledge
of these factors that is often thought of as the source of uncertainty. While sources
of uncertainty may differ, the statistical models and methods that are useful remain
broadly the same. It is these models and methods that are the subject of this book.

You might also like