0% found this document useful (0 votes)
77 views

ST1215 Subject Guide 2023 Final (1)

Uploaded by

maximivanovevg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

ST1215 Subject Guide 2023 Final (1)

Uploaded by

maximivanovevg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 488

Undergraduate study in Economics,

Management, Finance and the Social Sciences

Introduction to
mathematical
statistics

J.S. Abdey

ST1215
2023
Introduction to mathematical
statistics
J.S. Abdey
ST1215
2023

Undergraduate study in
Economics, Management,
Finance and the Social Sciences

This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information about the University of London, see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.

University of London
Publications office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk

Published by: University of London


© University of London 2023
The University of London asserts copyright over all material in this subject guide
except where otherwise indicated. All rights reserved. No part of this work may
be reproduced in any form, or by any means, without permission in writing from
the publisher. We make every effort to respect copyright. If you think we have
inadvertently used your copyright material, please let us know.
4
Contents

Contents

0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 The role of statistics in the research process . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7 Overview of the learning resources . . . . . . . . . . . . . . . . . . . . . . 6
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.7.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.8.1 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 10
0.8.2 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.8.3 Making use of the Online Library . . . . . . . . . . . . . . . . . . 11

1 Data visualisation and descriptive statistics 13


1.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Continuous and discrete variables . . . . . . . . . . . . . . . . . . . . . . 16
1.5 The sample distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Bar charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.2 Sample distributions of variables with many values . . . . . . . . 19
1.5.3 Skewness of distributions . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Measures of central tendency . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6.1 Notation for variables . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.2 Summation notation . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.3 The sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.4 The (sample) median . . . . . . . . . . . . . . . . . . . . . . . . . 24

i
Contents

1.6.5 Sensitivity to outliers . . . . . . . . . . . . . . . . . . . . . . . . . 25


1.6.6 Skewness, means and medians . . . . . . . . . . . . . . . . . . . . 25
1.6.7 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.7 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.7.1 Variance and standard deviation . . . . . . . . . . . . . . . . . . . 27
1.7.2 Sample quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.7.3 Quantile-based measures of dispersion . . . . . . . . . . . . . . . 29
1.7.4 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.8 Associations between two variables . . . . . . . . . . . . . . . . . . . . . 30
1.8.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.8.2 Line plots (time series plots) . . . . . . . . . . . . . . . . . . . . . 31
1.8.3 Side-by-side boxplots for comparisons . . . . . . . . . . . . . . . . 31
1.8.4 Two-way contingency tables . . . . . . . . . . . . . . . . . . . . . 32
1.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Probability theory 35
2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 44
2.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 48
2.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 50
2.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 50
2.6.3 Combining counts: rules of sum and product . . . . . . . . . . . . 55
2.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 57
2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 58
2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 59
2.7.3 Conditional probability of independent events . . . . . . . . . . . 60
2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 61
2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 63
2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

ii
Contents

2.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


2.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 69
2.11 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 70

3 Random variables 73
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 Probability distribution of a discrete random variable . . . . . . . 75
3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 78
3.4.3 Properties of the cdf for discrete distributions . . . . . . . . . . . 81
3.4.4 General properties of the cdf . . . . . . . . . . . . . . . . . . . . . 81
3.4.5 Properties of a discrete random variable . . . . . . . . . . . . . . 82
3.4.6 Expected value versus sample mean . . . . . . . . . . . . . . . . . 82
3.4.7 Moments of a random variable . . . . . . . . . . . . . . . . . . . . 88
3.4.8 The moment generating function . . . . . . . . . . . . . . . . . . 90
3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.1 Moment generating functions . . . . . . . . . . . . . . . . . . . . 102
3.5.2 Median of a random variable . . . . . . . . . . . . . . . . . . . . . 102
3.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 104
3.9 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 105

4 Common distributions of random variables 107


4.1 Synopsis of chapter content . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.4 Common discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4.1 Discrete uniform distribution . . . . . . . . . . . . . . . . . . . . 109
4.4.2 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.3 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4.4 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.5 Connections between probability distributions . . . . . . . . . . . 116
4.4.6 Poisson approximation of the binomial distribution . . . . . . . . 116

iii
Contents

4.4.7 Some other discrete distributions . . . . . . . . . . . . . . . . . . 118


4.5 Common continuous distributions . . . . . . . . . . . . . . . . . . . . . . 119
4.5.1 The (continuous) uniform distribution . . . . . . . . . . . . . . . 119
4.5.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . 120
4.5.3 Two other distributions . . . . . . . . . . . . . . . . . . . . . . . 123
4.5.4 Normal (Gaussian) distribution . . . . . . . . . . . . . . . . . . . 124
4.5.5 Normal approximation of the binomial distribution . . . . . . . . 130
4.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 134
4.9 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 135

5 Multivariate random variables 137


5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.4 Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.6 Continuous multivariate distributions . . . . . . . . . . . . . . . . . . . . 141
5.7 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.7.1 Properties of conditional distributions . . . . . . . . . . . . . . . . 144
5.7.2 Conditional mean and variance . . . . . . . . . . . . . . . . . . . 145
5.7.3 Continuous conditional distributions (non-examinable) . . . . . . 145
5.8 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.8.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.8.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.8.3 Sample covariance and correlation . . . . . . . . . . . . . . . . . . 149
5.9 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 151
5.9.1 Joint distribution of independent random variables . . . . . . . . 152
5.10 Sums and products of random variables . . . . . . . . . . . . . . . . . . . 153
5.10.1 Distributions of sums and products . . . . . . . . . . . . . . . . . 154
5.10.2 Expected values and variances of sums of random variables . . . . 154
5.10.3 Expected values of products of independent random variables . . 155
5.10.4 Some proofs of previous results . . . . . . . . . . . . . . . . . . . 155
5.10.5 Distributions of sums of random variables . . . . . . . . . . . . . 156
5.11 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

iv
Contents

5.12 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


5.13 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 159
5.14 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 160

6 Sampling distributions of statistics 163


6.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.4.1 Joint distribution of a random sample . . . . . . . . . . . . . . . . 164
6.5 Statistics and their sampling distributions . . . . . . . . . . . . . . . . . 165
6.5.1 Sampling distribution of a statistic . . . . . . . . . . . . . . . . . 166
6.6 Sample mean from a normal population . . . . . . . . . . . . . . . . . . . 168
6.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.8 Some common sampling distributions . . . . . . . . . . . . . . . . . . . . 174
6.8.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.8.2 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 178
6.8.3 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.9 Prelude to statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 181
6.9.1 Population versus random sample . . . . . . . . . . . . . . . . . . 181
6.9.2 Parameter versus statistic . . . . . . . . . . . . . . . . . . . . . . 182
6.9.3 Difference between ‘Probability’ and ‘Statistics’ . . . . . . . . . . 184
6.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 185
6.13 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 186

7 Point estimation 189


7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7.4 Estimation criteria: bias, variance and mean squared error . . . . . . . . 190
7.5 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 196
7.6 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 198
7.7 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 200
7.8 Asymptotic distribution of MLEs . . . . . . . . . . . . . . . . . . . . . . 205

v
Contents

7.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206


7.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 207
7.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 208

8 Interval estimation 211


8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
8.4 Interval estimation for means of normal distributions . . . . . . . . . . . 212
8.4.1 An important property of normal samples . . . . . . . . . . . . . 214
8.5 Approximate confidence intervals . . . . . . . . . . . . . . . . . . . . . . 215
8.5.1 Means of non-normal distributions . . . . . . . . . . . . . . . . . 215
8.5.2 MLE-based confidence intervals . . . . . . . . . . . . . . . . . . . 215
8.6 Use of the chi-squared distribution . . . . . . . . . . . . . . . . . . . . . 215
8.7 Interval estimation for variances of normal distributions . . . . . . . . . . 216
8.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
8.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 217
8.11 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 218

9 Hypothesis testing 221


9.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.5 Setting p-value, significance level, test statistic . . . . . . . . . . . . . . . 224
9.5.1 General setting of hypothesis tests . . . . . . . . . . . . . . . . . 224
9.5.2 Statistical testing procedure . . . . . . . . . . . . . . . . . . . . . 225
9.5.3 Two-sided tests for normal means . . . . . . . . . . . . . . . . . . 226
9.5.4 One-sided tests for normal means . . . . . . . . . . . . . . . . . . 227
9.6 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.7 General approach to statistical tests . . . . . . . . . . . . . . . . . . . . . 229
9.8 Two types of error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.9 Tests for variances of normal distributions . . . . . . . . . . . . . . . . . 230
9.10 Summary: tests for µ and σ 2 in N (µ, σ 2 ) . . . . . . . . . . . . . . . . . . 232

vi
Contents

9.11 Comparing two normal means with paired observations . . . . . . . . . . 232


9.11.1 Power functions of the test . . . . . . . . . . . . . . . . . . . . . . 233
9.12 Comparing two normal means . . . . . . . . . . . . . . . . . . . . . . . . 233
2
9.12.1 Tests on µX − µY with known σX and σY2 . . . . . . . . . . . . . 234
2
9.12.2 Tests on µX − µY with σX = σY2 but unknown . . . . . . . . . . . 234
9.13 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . . . . . 237
9.13.1 Tests for correlation coefficients . . . . . . . . . . . . . . . . . . . 239
9.14 Tests for the ratio of two normal variances . . . . . . . . . . . . . . . . . 240
9.15 Summary: tests for two normal distributions . . . . . . . . . . . . . . . . 242
9.16 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.17 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.18 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 243
9.19 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 244

10 Analysis of variance (ANOVA) 247


10.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.4 Testing for equality of three population means . . . . . . . . . . . . . . . 247
10.5 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 249
10.6 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . . 258
10.7 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . 258
10.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 265
10.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . 265

11 Linear regression 267


11.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.4 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.5 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
11.6 Inference for parameters in normal regression models . . . . . . . . . . . 274
11.7 Regression ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

vii
Contents

11.8 Confidence intervals for E(y) . . . . . . . . . . . . . . . . . . . . . . . . . 279


11.9 Prediction intervals for y . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.10 Multiple linear regression models . . . . . . . . . . . . . . . . . . . . . 283
11.11 Regression using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.12 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
11.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 293
11.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . 294
11.15 Solutions to Sample examination questions . . . . . . . . . . . . . . . . 295

A Data visualisation and descriptive statistics 299


A.1 (Re)vision of fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 299
A.2 Worked example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
A.3 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

B Probability theory 307


B.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
B.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

C Random variables 321


C.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
C.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

D Common distributions of random variables 333


D.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
D.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346

E Multivariate random variables 349


E.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
E.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

F Sampling distributions of statistics 361


F.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
F.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368

G Point estimation 371


G.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
G.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

viii
Contents

H Interval estimation 385


H.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
H.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390

I Hypothesis testing 391


I.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
I.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

J Analysis of variance (ANOVA) 401


J.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
J.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

K Linear regression 409


K.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
K.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

L Solutions to Practice questions 417


L.1 Appendix A – Data visualisation and descriptive statistics . . . . . . . . 417
L.2 Appendix B – Probability theory . . . . . . . . . . . . . . . . . . . . . . 419
L.3 Appendix C – Random variables . . . . . . . . . . . . . . . . . . . . . . . 422
L.4 Appendix D – Common distributions of random variables . . . . . . . . . 424
L.5 Appendix E – Multivariate random variables . . . . . . . . . . . . . . . . 429
L.6 Appendix F – Sampling distributions of statistics . . . . . . . . . . . . . 430
L.7 Appendix G – Point estimation . . . . . . . . . . . . . . . . . . . . . . . 432
L.8 Appendix H – Interval estimation . . . . . . . . . . . . . . . . . . . . . . 436
L.9 Appendix I – Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 437
L.10 Appendix J – Analysis of variance . . . . . . . . . . . . . . . . . . . . . . 442
L.11 Appendix K – Linear regression . . . . . . . . . . . . . . . . . . . . . . . 443

M Formula sheet in the examination 447

N Sample examination paper 449

O Sample examination paper – Solutions 455

ix
Contents

x
Chapter 0
Preface

0.1 Route map to the subject guide


This subject guide provides you with a framework for covering the syllabus of the
ST1215 Introduction to mathematical statistics course and directs you to
additional resources such as readings and the virtual learning environment (VLE). The
up-to-date course syllabus for ST1215 Introduction to mathematical statistics
can be found in the course information sheet, which is available on the course VLE page.
The following chapters will cover important aspects of mathematical statistics, upon
which many applications in EC2020 Elements of econometrics, as well as ST2133
Advanced statistics: distribution theory and ST2134 Advanced statistics:
statistical inference draw heavily (among other courses). The chapters are not a
series of self-contained topics, rather they build on each other sequentially. As such, you
are strongly advised to follow the subject guide in chapter order. There is little point in
rushing past material which you have only partially understood in order to reach the
final chapter. Once you have completed your work on all of the chapters, you will be
ready for examination revision. A good place to start is the sample examination paper
which you will find at the end of the subject guide.
Colour has been included in places to emphasise important items. Formulae in the main
body of chapters are in blue – these exclude formulae used in examples. Key terms and
concepts when introduced are shown mainly in red, with a few in blue to avoid
repetition. References to other courses and half courses are shown in purple (such as
above). Terms in italics are shown in purple for emphasis. References to chapters,
sections, figures and tables are shown in teal.

0.2 Introduction to the subject area


Why study statistics?

By successfully completing this course, you will understand the ideas of randomness and
variability, and the way in which they link to probability theory. This will allow the use
of a systematic and logical collection of statistical techniques of great practical
importance in many applied areas. The examples in this subject guide will concentrate
on the social sciences, but the methods are important for the physical sciences too. This
subject aims to provide a grounding in probability theory and some of the most
common statistical methods.
The material in ST1215 Introduction to mathematical statistics is necessary as
preparation for other subjects you may study later on in your degree. The full details of

1
0. Preface

the ideas discussed in this subject guide will not always be required in these other
subjects, but you will need to have a solid understanding of the main concepts. This
can only be achieved by seeing how the ideas emerge in detail.

How to study statistics

For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
We begin with an illustrative example of how statistics can be applied in a research
context.

0.3 The role of statistics in the research process


Before we get into details, let us begin with the ‘big picture’. First, some definitions.

Research: trying to answer questions about the world in a systematic (scientific)


way.

Empirical research: doing research by first collecting relevant information (data)


about the world.

Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.

Example 0.1 It all starts with a question.

Can labour regulation hinder economic performance?

Understanding the gender pay gap: what has competition got to do with it?

Children and online risk: powerless victims or resourceful participants?

Refugee protection as a collective action problem: is the European Union (EU)


shirking its responsibilities?

Do directors perform for pay?

Heeding the push from below: how do social movements persuade the rich to
listen to the poor?

2
0.3. The role of statistics in the research process

Does devolution lead to regional inequalities in welfare activity?

The childhood origins of adult socio-economic disadvantage: do cohort and


gender matter?

Parental care as unpaid family labour: how do spouses share?

Key stages of the empirical research process

We can think of the empirical research process as having five key stages.

1. Formulating the research question.

2. Research design: deciding what kinds of data to collect, how and from where.

3. Collecting the data.

4. Analysis of the data to answer the research question.

5. Reporting the answer and how it was obtained.

The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.

It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.
It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.

We continue with an example of how statistics can be used to help answer a research
question.

Example 0.2 CCTV, crime and fear of crime.


Our research question is what is the effect of closed-circuit television (CCTV)
surveillance on:

the number of recorded crimes?

the fear of crime felt by individuals?

We illustrate this using part of the following study.

Gill, M. and A. Spriggs ‘Assessing the impact of CCTV’, Home Office Research
Study 292.

3
0. Preface

The research design of the study comprised the following.

Target area: a housing estate in northern England.

Control area: a second, comparable housing estate.

Intervention: CCTV cameras installed in the target area but not in the
control area.

Compare measures of crime and the fear of crime in the target and control
areas in the 12 months before and 12 months after the intervention.

The data and data collection were as follows.

Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.

Fear of crime: a survey of residents of the areas.


• Respondents: random samples of residents in each of the areas.
• In each area, one sample before the intervention date and one about 12
months after.
• Sample sizes:
Before After
Target area 172 168
Control area 215 242

• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.

% of respondents who worry ‘sometimes’, ‘often’ or ‘all the time’:


Target Control
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
26 23 −3 53 46 −7 0.98 (0.55, 1.74)

It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.

4
0.4. Aims and objectives

RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.

However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.

The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.

The number of (any kind of) recorded crimes:


Target area Control area
[a] [b] [c] [d] Confidence
Before After Change Before After Change RES interval
112 101 −11 73 88 15 1.34 (0.79, 1.89)

Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.

However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.

The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.

In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.

If you want to read more about research of this question, see Welsh, B.C. and
D.P. Farrington ‘Effects of closed circuit television surveillance on crime’,
Campbell Systematic Reviews 17 2008.

Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.

0.4 Aims and objectives

The course provides a precise and accurate treatment of introductory probability and
distribution theory, statistical ideas, methods and techniques. Topics covered are data
visualisation and descriptive statistics, probability theory, random variables, common
distributions of random variables, multivariate random variables, sampling distributions
of statistics, point estimation, interval estimation, hypothesis testing, analysis of
variance (ANOVA) and linear regression.

5
0. Preface

0.5 Learning outcomes


At the end of this course, and having completed the Recommended reading and
activities, students should be able to:

compute probabilities of events, including for univariate and multivariate random


variables

apply and be competent users of standard statistical operators and be able to recall
a variety of well-known probability distributions and their respective moments

derive estimators of unknown parameters using method of moments, least squares


and maximum likelihood estimation techniques, and analyse the statistical
properties of estimators

explain the fundamentals of statistical inference and develop the ability to


formulate the hypothesis of interest, derive the necessary tools to test this
hypothesis and interpret the results in a number of different settings

be familiar with the fundamental concepts of statistical modelling, with an


emphasis on analysis of variance and linear regression models

demonstrate understanding that statistical techniques are based on assumptions


and the plausibility of such assumptions must be investigated when analysing real
problems.

0.6 Employability outcomes


Below are the three most relevant skill outcomes for students undertaking this course
which can be conveyed to future prospective employers:

1. complex problem-solving

2. decision making

3. communication.

0.7 Overview of the learning resources

0.7.1 The subject guide


The subject guide is a self-contained resource, i.e. the content provided here is sufficient
to prepare for the examination. All examinable topics are discussed in detail with
numerous activities and practice problems. Studying extensively using the subject guide
is essential to perform well in the final examination. As such, there is no necessity to
purchase a textbook, although some students may wish to consult other resources to
read about the same topics through an alternative tutorial voice – please see the
suggested ‘Further reading’ below.

6
0.7. Overview of the learning resources

The subject guide provides a range of activities that will enable you to test your
understanding of the basic ideas and concepts. We want to encourage you to try the
exercises you encounter throughout the material before working through the solutions.
With statistics, the motto has to be ‘practise, practise, practise. . .’. It is the best way to
learn the material and prepare for examinations. The course is rigorous and demanding,
but the skills you will be developing will be rewarding and well recognised by future
employers.
A suggested approach for students studying ST1215 Introduction to mathematical
statistics, is to split the material into 10 two-week blocks.

1. Chapter 1: Data visualisation and descriptive statistics.

2. Chapter 2: Probability theory.

3. Chapter 3: Random variables.

4. Chapter 4: Common distributions of random variables.

5. Chapter 5: Multivariate random variables.

6. Chapter 6: Sampling distributions of statistics.

7. Chapters 7 and 8: Point estimation and Interval estimation.

8. Chapter 9: Hypothesis testing.

9. Chapter 10: Analysis of variance (ANOVA).

10. Chapter 11: Linear regression.

The following procedure is recommended:

1. Read the introductory comments.

2. Study the chapter content, worked examples and practice questions.

3. Go through the learning outcomes carefully.

4. Refer back to this subject guide, or to supplementary texts, to improve your


understanding until you are able to work through the problems confidently.

The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.

7
0. Preface

Basic notation

We often use the symbol  to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.

Calculators

A calculator may be used when answering questions on the examination paper for
ST1215 Introduction to mathematical statistics. It must comply in all respects
with the specification given in the Programme regulations. You should also refer to the
admission notice you will receive when entering the examination and the ‘Notice on
permitted materials’.

Computers

If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package, such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material. On a few
occasions in this subject guide R will be used for illustrative purposes only. You will not
be examined on R.

0.7.2 Essential reading

This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST1215 Introduction to mathematical statistics. Throughout
the subject guide there are many worked examples, practice problems and sample
examination questions replicating resources typically provided in statistical textbooks.
You may, however, feel you could benefit from reading textbooks, and a suggested list of
these is provided below.

Statistical tables

In the examination you will be provided with relevant extracts of:

Dougherty, C. Introduction to Econometrics. (Oxford: Oxford University Press,


2016) fifth edition [ISBN 9780199676828].

Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 9780521484855].

As relevant extracts of these statistical tables are the same as those distributed for use
in the examination, it is advisable that you become familiar with them, rather than
those at the end of a textbook.

8
0.8. Examination advice

0.7.3 Further reading


As mentioned above, this subject guide is sufficient for study of ST1215 Introduction
to mathematical statistics. Of course, you are free to read around the subject area
in any text, paper or online resource to support your learning and help you to think
about how these principles apply in the real world. To help you read extensively, you
have free access to the virtual learning environment (VLE) and University of London
Online Library (see below).
Unless otherwise stated, all websites in this subject guide were accessed in May 2023.
We cannot guarantee, however, that they will stay current and you may need to
perform an internet search to find the relevant pages.
Other useful texts for this course include:

Freedman, D., R. Pisani and R. Purves Statistics. (New York: W.W. Norton &
Company, 2007) fourth edition [ISBN 9780393930436].

Johnson, R.A. and G.K. Bhattacharyya Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].

Larsen, R.J. and M.J. Marx An Introduction to Mathematical Statistics and Its
Applications. (London: Pearson, 2017) sixth edition [ISBN 9780134114217].

Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Pearson, 2012) eighth edition [ISBN 9780273767060].

0.8 Examination advice


Important: The information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Programme regulations for relevant information about the
examination, and the VLE where you should be advised of any forthcoming changes.
You should also carefully check the rubric/instructions on the paper you actually sit
and follow those instructions.
The examination is by a three-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found at the end of the subject
guide).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.

9
0. Preface

Remember, it is important to check the VLE for:

up-to-date information on examination and assessment arrangements for this course

where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.

0.8.1 Online study resources


In addition to the subject guide, it is crucial that you take advantage of the study
resources that are available online for this course, including the VLE and the Online
Library.
You can access the VLE, the Online Library and your University of London email
account via the Student Portal at: https://my.london.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged into the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you have forgotten these login details, please click on the ‘Forgot Password’ link on
the login page.

0.8.2 The VLE


The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:

Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.

Readings: Direct links, wherever possible, to essential readings in the Online


Library, including journal articles and ebooks.

Video content: Including introductions to courses and topics within courses,


interviews, lessons and debates.

Screencasts: Videos of PowerPoint presentations, animated podcasts and


on-screen worked examples.

External material: Links out to carefully selected third-party resources.

Self-test activities: Multiple-choice, numerical and algebraic quizzes to check


your understanding.

10
0.8. Examination advice

Collaborative activities: Work with fellow students to build a body of


knowledge.

Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.

Past examination papers: We provide up to three years of past examinations


alongside Examiners’ commentaries that provide guidance on how to approach the
questions.

Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.

Note: Students registered for Laws courses also receive access to the dedicated Laws
VLE.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.

0.8.3 Making use of the Online Library


The Online Library (https://onlinelibrary.london.ac.uk) contains a huge array of journal
articles and other resources to help you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login.
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please use the online help pages
(https://onlinelibrary.london.ac.uk/resources/summon) or contact the Online Library
team using the ‘Chat with us’ function.

11
0. Preface

12
Chapter 1
Data visualisation and descriptive
statistics

1.1 Synopsis of chapter


Graphical representations of data provide us with a useful view of the distribution of
variables. In this chapter, we shall cover a selection of approaches for displaying data
visually – each being appropriate in certain situations. We then consider descriptive
statistics, whose main objective is to interpret key features of a dataset numerically.
Graphs and charts have little intrinsic value per se, however their main function is to
bring out interesting features of a dataset. For this reason, simple descriptions should be
preferred to complicated graphics.
Although data visualisation is useful as a preliminary form of data analysis to get a ‘feel’
for the data, in practice we also need to be able to summarise data numerically. We
introduce descriptive statistics and distinguish between measures of location, measures
of dispersion and skewness. All these statistics provide useful summaries of raw datasets.

1.2 Learning outcomes


At the end of this chapter, you should be able to:

interpret and summarise raw data on social science variables graphically

interpret and summarise raw data on social science variables numerically

calculate basic measures of location and dispersion

describe the skewness of a distribution and interpret boxplots

discuss the key terms and concepts introduced in the chapter.

1.3 Introduction
Starting point: a collection of numerical data (a sample) has been collected in order to
answer some questions. Statistical analysis may have two broad aims.

13
1. Data visualisation and descriptive statistics

1. Descriptive statistics: summarise the data which were collected, in order to


make them more understandable.

2. Statistical inference: use the observed data to draw conclusions about some
broader population.

Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential
first step.
Data do not speak for themselves. There are usually simply too many numbers to make
sense of just by staring at them. Descriptive statistics attempt to summarise some
key features of the data to make them understandable and easy to communicate.
These summaries may be graphical or numerical (tables or individual summary
statistics).

Example 1.1 We consider data for 155 countries and territories on three variables
from around 2002. The data can be found in the file ‘Countries.csv’ (available on
the VLE). The variables are the following.

Region of the country.


• This is a nominal variable coded (in alphabetical order) as follows:
1 = Africa, 2 = Asia, 3 = Europe, 4 = South America, 5 = North America,
6 = Oceania.

The level of democracy, i.e. a democracy index, in the country.


• This is an 11-point ordinal scale from 0 (lowest level of democracy) to 10
(highest level of democracy).

Gross domestic product per capita (GDP per capita) (i.e. per person, in
$000s) which is a ratio scale.

The statistical data in a sample are typically stored in a data matrix, as shown in
Figure 1.1 (on the next page).
Rows of the data matrix correspond to different units (subjects/observations).

Here, each unit is a country.

The number of units in a dataset is the sample size, typically denoted by n.

Here, n = 155 countries.

Columns of the data matrix correspond to variables, i.e. different characteristics of


the units.

Here, region, the level of democracy, and GDP per capita are the variables.

14
1.3. Introduction

Figure 1.1: Example of a data matrix.

15
1. Data visualisation and descriptive statistics

1.4 Continuous and discrete variables


Different variables may have different properties. These determine which kinds of
statistical methods are suitable for the variables.

Continuous and discrete variables

A continuous variable can, in principle, take any real values within some interval.

In Example 1.1, GDP per capita is continuous, taking any non-negative value.

A variable is discrete if it is not continuous, i.e. if it can only take certain values,
but not any others.

In Example 1.1, region and the level of democracy are discrete, with possible
values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.

Many discrete variables have only a finite number of possible values. In Example 1.1, the
region variable has 6 possible values, and the level of democracy has 11 possible values.
The simplest possibility is a binary, or dichotomous, variable, with just two possible
values. For example, a person’s sex could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.

For example, the number of visitors to a website in a day: 0, 1, 2, . . ..2

Example 1.2 In Example 1.1, the levels of democracy have a meaningful ordering,
from less democratic to more democratic countries. The numbers assigned to the
different levels must also be in this order, i.e. a larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, South America, North America
and Oceania) do not have such an ordering. The numbers used for the region
variable are just labels for different regions. A different numbering (such as
6 = Africa, 5 = Asia, 1 = Europe, 3 = South America, 2 = North America and
4 = Oceania) would be just as acceptable as the one we originally used. Some
statistical methods are appropriate for variables with both ordered and unordered
values, some only in the ordered case. Unordered categories are nominal data;
ordered categories are ordinal data.

1
Note that because sex is a nominal variable, the coding is arbitrary. We could also have, for example,
0 = male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which
coding has been used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable
to treat this variable as taking an unlimited number of possible values.

16
1.5. The sample distribution

1.5 The sample distribution


The sample distribution of a variable consists of:

a list of the values of the variable which are observed in the sample
the number of times each value occurs (the counts or frequencies of the observed
values).

When the number of different observed values is small, we can show the whole sample
distribution as a frequency table of all the values and their frequencies.

Example 1.3 Continuing with Example 1.1, the observations of the region variable
in the sample are:

3 5 3 3 3 5 3 3 6 3 2 3 3 3 3

3 3 2 2 2 3 6 2 3 2 2 2 3 3 2

2 3 3 3 2 4 3 2 3 1 4 3 1 3 3

4 4 4 1 2 4 3 4 3 2 1 2 3 1 3

2 1 4 2 4 3 1 4 6 2 1 3 4 2 1

4 4 4 2 3 2 4 1 4 1 4 2 2 2 4

2 2 1 4 2 1 4 2 2 4 4 1 6 3 1

2 1 2 2 1 1 2 1 1 3 2 2 1 2 4

2 1 2 1 1 2 1 2 1 2 1 1 1 1 1

1 1 1 2 1 1 1 1 1 2 1 1 1 1 1

1 1 1 2 1

We may construct a frequency table for the region variable as follows:

Relative
Frequency frequency
Region (count) (%)
100 × (48/155)
(1) Africa 48 31.0
(2) Asia 44 28.4
(3) Europe 34 21.9
(4) South America 23 14.8
(5) North America 2 1.3
(6) Oceania 4 2.6
Total 155 100

17
1. Data visualisation and descriptive statistics

Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the
sample. This is a measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:

Level of Cumulative
democracy Frequency % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

‘Cumulative %’ for a value of the variable is the sum of the percentages for that
value and all lower-numbered values.

1.5.1 Bar charts


A bar chart is the graphical equivalent of the table of frequencies. Figure 1.2 displays
the region variable data as a bar chart. The relative frequencies of each region are
clearly visible.

Figure 1.2: Example of a bar chart showing the region variable.

18
1.5. The sample distribution

1.5.2 Sample distributions of variables with many values


If a variable has many distinct values, listing frequencies of all of them is not very
practical.
A solution is to group the values into non-overlapping intervals, and produce a table or
graph of the frequencies within the intervals. The most common graph used for this is a
histogram.
A histogram is like a bar chart, but without gaps between bars, and often uses more
bars (intervals of values) than is sensible in a table. Histograms are usually drawn using
statistical software, such as R. You can let the software choose the intervals and the
number of bars.

Example 1.4 Continuing with Example 1.1, a table of frequencies for GDP per
capita where values have been grouped into non-overlapping intervals is shown
below. Figure 1.3 shows a histogram of GDP per capita with a greater number of
intervals to better display the sample distribution.

GDP per capita (in $000s) Frequency %


[0, 2) 49 31.6
[2, 5) 32 20.6
[5, 10) 29 18.7
[10, 20) 21 13.5
[20, 30) 19 12.3
[30, 50) 5 3.2
Total 155 100

Histogram of GDP per capita


50
40
30
Frequency

20
10
0

0 10 20 30 40

GDP per capita (thousands of U.S. dollars)

Figure 1.3: Histogram of GDP per capita.

19
1. Data visualisation and descriptive statistics

1.5.3 Skewness of distributions


Skewness and symmetry are terms used to describe the general shape of a sample
distribution.
From Figure 1.3 (on the previous page), it is clear that a small number of countries has
much larger values of GDP per capita than the majority of countries in the sample. The
distribution of GDP per capita has a ‘long right tail’. Such a distribution is called
positively skewed (or skewed to the right).
A distribution with a longer left tail (i.e. toward small values) is negatively skewed (or
skewed to the left). A distribution is symmetric if it is not skewed in either direction.

Example 1.5 Figure 1.4 shows a (more-or-less) symmetric sample distribution for
diastolic blood pressure.
0.04
0.03
Proportion
0.02
0.01
0.0

40 60 80 100 120
Diastolic blood pressure

Figure 1.4: Diastolic blood pressures of 4,489 respondents aged 25 or over, Health Survey
for England, 2002.

Example 1.6 Figure 1.5 (on the next page) shows a (slightly) negatively-skewed
distribution of marks in an examination. Note the data relate to all candidates
sitting the examination. Therefore, the histogram shows the population distribution,
not a sample distribution.

1.6 Measures of central tendency


Frequency tables, bar charts and histograms aim to summarise the whole sample
distribution of a variable. Next we consider descriptive statistics, which summarise one
feature of the sample distribution in a single number: summary statistics.

20
1.6. Measures of central tendency

Histogram of examination marks

60
50
40
Frequency

30
20
10
0

0 20 40 60 80 100

Marks

Figure 1.5: Final examination marks of a first-year statistics course.

We begin with measures of central tendency. These answer the question: where is
the ‘centre’ or ‘average’ of the distribution?
We consider the following measures of central tendency:

mean (i.e. the average, sample mean or arithmetic mean)

median

mode.

1.6.1 Notation for variables


In formulae, a generic variable is denoted by a single letter. In these course notes, that
letter is usually X. However, any other letter (Y , W etc.) can also be used, as long as it
is used consistently. A letter with a subscript denotes a single observation of a variable.

Example 1.7 We use Xi to denote the value of X for unit i, where i can take
values 1, 2, . . . , n, and n is the sample size.
Therefore, the n observations of X in the dataset (the sample) are X1 , X2 , . . . , Xn .
These can also be written as Xi , for i = 1, 2, . . . , n.

1.6.2 Summation notation


Let X1 , X2 , . . . , Xn (i.e. Xi , for i = 1, 2, . . . , n) be a set of n numbers. The sum of the
numbers is written as: n
X
Xi = X1 + X2 + · · · + Xn .
i=1

21
1. Data visualisation and descriptive statistics

P P
This may be written as i Xi , or just Xi . Other versions of the same idea are:

P
infinite sums: Xi = X1 + X 2 + · · ·
i=1

sums of sets of observations other than 1 to n, for example:


n/2
X
Xi = X2 + X3 + · · · + Xn/2 .
i=2

1.6.3 The sample mean


The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common
measure of central tendency. The sample mean of a variable X is denoted X̄. It is the
‘sum of the observations’ divided by the ‘number of observations’ (sample size)
expressed as:
Pn
Xi
i=1
X̄ = .
n
P
Example 1.8 The mean X̄ = i Xi /n of the numbers 1, 4 and 7 is:

1+4+7 12
= = 4.
3 3

Example 1.9 For the variables in Example 1.1:

the level of democracy has X̄ = 5.3


GDP per capita has X̄ = 8.6 (in $000s)
for region the mean is not meaningful(!), because the values of the variable do
not have a meaningful ordering.

The frequency table of the level of democracy is:

Level of democracy Frequency Cumulative


Xj fj % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

22
1.6. Measures of central tendency

If a variable has a small number of distinct values, X̄ is easy to calculate from the
frequency table. For example, the level of democracy has just 11 different values
which occur in the sample 35, 12, . . . , 32 times each, respectively.
Suppose X has K different values X1 , X2 , . . . , XK , with corresponding frequencies
K
P
f1 , f2 , . . . , fK . Therefore, fj = n and:
j=1

K
P
f j Xj
j=1 f 1 X 1 + f 2 X2 + · · · + f K XK f 1 X 1 + f 2 X2 + · · · + f K XK
X̄ = = = .
K
P f1 + f2 + · · · + fK n
fj
j=1

In our example, the mean of the level of democracy (where K = 11) is:
35 × 0 + 12 × 1 + · · · + 32 × 10 0 + 12 + · · · + 320
X̄ = = ≈ 5.3.
35 + 12 + · · · + 32 155

Why is the mean a good summary of the central tendency?

Consider the following small dataset:

Deviations:
from X̄ (= 4) from the median (= 3)
i Xi Xi − X̄ (Xi − X̄)2 Xi − 3 (Xi − 3)2
1 1 −3 9 −2 4
2 2 −2 4 −1 1
3 3 −1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36
Sum 20 0 40 +5 45
X̄ = 4

We see that the sum of deviations from the mean is 0, i.e. we have:

n
X
(Xi − X̄) = 0.
i=1

The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over all
the observations.
n
Also, the smallest possible value of the sum of squared deviations (Xi − C)2 for any
P
i=1
constant C is obtained when C = X̄.

23
1. Data visualisation and descriptive statistics

1.6.4 The (sample) median


Let X(1) , X(2) , . . . , X(n) denote the sample values of X when ordered from the smallest
to the largest, known as the order statistics, such that:

X(1) is the smallest observed value (the minimum) of X

X(n) is the largest observed value (the maximum) of X.

Median

The (sample) median, q50 , of a variable X is the value which is ‘in the middle’ of
the ordered sample.

If n is odd, then q50 = X((n+1)/2) .

For example, if n = 3, q50 = X(2) : (1) (2) (3).

If n is even, q50 = (X(n/2) + X(n/2+1) )/2.

For example, if n = 4, q50 = (X(2) + X(3) )/2: (1) (2) (3) (4).

Example 1.10 Continuing with Example 1.1, n = 155, so q50 = X(78) . For the level
of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative
percentage first reaches 50% (or, if a cumulative % is exactly 50%, the average of the
corresponding value of X and the next highest value).
The ordered values of the level of democracy are:

(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10

24
1.6. Measures of central tendency

The median can be determined from the frequency table of the level of democracy:

Level of democracy Frequency Cumulative


Xj fj % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100

1.6.5 Sensitivity to outliers


For the following small ordered dataset, the mean and median are both 4:

1, 2, 4, 5, 8.

Suppose we add one observation to get the ordered sample:

1, 2, 4, 5, 8, 100.

The median is now 4.5, and the mean is 20. In general, the mean is affected much more
than the median by outliers, i.e. unusually small or large observations. Therefore, you
should identify outliers early on and investigate them – perhaps there has been a data
entry error, which can simply be corrected. If deemed genuine outliers, a decision has to
be made about whether or not to remove them.

1.6.6 Skewness, means and medians


Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the
longer tail of the sample distribution.

For a positively-skewed distribution, the mean is larger than the median.

For a negatively-skewed distribution, the mean is smaller than the median.

For an exactly symmetric distribution, the mean and median are equal.

When summarising variables with skewed distributions, it is useful to report both the
mean and the median.

25
1. Data visualisation and descriptive statistics

Example 1.11 For the datasets considered previously:

Mean Median
Level of democracy 5.3 6
GDP per capita 8.6 4.7
Diastolic blood pressure 74.2 73.5
Examination marks 56.6 57.0

1.6.7 Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e.
appears most often) in the data.

Example 1.12 For Example 1.1, the modal region is 1 (Africa) and the mode of
the level of democracy is 0.

The mode is not very useful for continuous variables which have many different values,
such as GDP per capita in Example 1.1. A variable can have several modes (i.e. be
multimodal). For example, GDP per capita has modes 0.8 and 1.9, both with 5
countries out of the 155.
The mode is the only measure of central tendency which can be used even when the
values of a variable have no ordering, such as for the (nominal) region variable in
Example 1.1.

1.7 Measures of dispersion


Central tendency is not the whole story. The two sample distributions in Figure 1.6
have the same mean, but they are clearly not the same. In one (red) the values have
more dispersion (variation) than in the other.

Figure 1.6: Two sample distributions.

26
1.7. Measures of dispersion

Example 1.13 A small example determining the sum of the squared deviations
from the (sample) mean, used to calculate common measures of dispersion.

Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
= Xi2 = (Xi − X̄)2
P
X̄ = 4

1.7.1 Variance and standard deviation


The first measures of dispersion, the sample variance and its square root, the sample
standard deviation, are based on (Xi − X̄)2 , i.e. the squared deviations from the mean.

Sample variance and standard deviation

The sample variance of a variable X, denoted S 2 (or SX


2
), is defined as:
n
2
1 X
S = (Xi − X̄)2 .
n−1 i=1

The sample standard deviation of X, denoted S (or SX ), is the positive square


root of the sample variance:
v
u n
u 1 X
S= t (Xi − X̄)2 .
n − 1 i=1

These are the most commonly-used measures of dispersion. The standard deviation is
more understandable than the variance, because the standard deviation is expressed in
the same units as X (rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such
as the ‘normal’ distribution:

about 2/3 of the observations are between X̄ − S and X̄ + S, that is, within one
(sample) standard deviation about the (sample) mean

about 95% of the observations are between X̄ − 2 × S and X̄ + 2 × S, that is,


within two (sample) standard deviations about the (sample) mean.

Remember that standard deviations (and variances) are never negative, and they are

27
1. Data visualisation and descriptive statistics

zero only if all the Xi observations are the same (that is, there is no variation in the
data).
If we are using a frequency table, we can also calculate:

K
!
1 X
S2 = fj Xj2 − nX̄ 2 .
n−1 j=1

Example 1.14 Consider the following simple dataset:

Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
Xi2 (Xi − X̄)2
P
X̄ = 4 = =

We have:
1 X 40 1 X 2  120 − 5 × 42
S2 = (Xi − X̄)2 = = 10 = Xi − nX̄ 2 =
n−1 4 n−1 4
√ √
and S = S 2 = 10 = 3.16.

1.7.2 Sample quantiles

The median, q50 , is basically the value which divides the sample into the smallest 50%
of observations and the largest 50%. If we consider other percentage splits, we get other
(sample) quantiles (percentiles), qc .

Example 1.15 Some special quantiles are given below.

The first quartile, q25 or Q1 , is the value which divides the sample into the
smallest 25% of observations and the largest 75%.

The third quartile, q75 or Q3 , gives the 75%–25% split.

The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to
speak), and the maximum, X(n) (the ‘100% quantile’).

These are no longer ‘in the middle’ of the sample, but they are more general
measures of location of the sample distribution.

28
1.7. Measures of dispersion

1.7.3 Quantile-based measures of dispersion

Range and interquartile range

Two measures based on quantile-type statistics are the:

range: X(n) − X(1) = maximum − minimum

interquartile range (IQR): IQR = q75 − q25 = Q3 − Q1 .

The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the
extremes of the distribution, i.e. the minimum and maximum observations. The IQR
focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.

1.7.4 Boxplots
A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample
distribution using quantiles. The plot comprises the following.

The line inside the box, which is the median.

The box, whose edges are the first and third quartiles (Q1 and Q3 ). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the
interquartile range.

The bottom whisker extends either to the minimum or up to a length of 1.5 times
the interquartile range below the first quartile, whichever is closer to the first
quartile.

The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.

Points beyond 1.5 times the interquartile range below the first quartile or above the
third quartile are regarded as outliers, and plotted as individual points.

A much longer whisker (and/or outliers) in one direction relative to the other indicates
a skewed distribution, as does a median line not in the middle of the box.

Example 1.16 Figure 1.7 (on the next page) displays a boxplot of GDP per capita
using the sample of 155 countries introduced in Example 1.1. Some summary
statistics for this variable are reported below.

Standard
Mean Median deviation IQR Range
GDP per capita 8.6 4.7 9.5 9.7 37.3

29
1. Data visualisation and descriptive statistics

Figure 1.7: Boxplot of GDP per capita.

1.8 Associations between two variables


So far, we have tried to summarise (some aspect of) the sample distribution of one
variable at a time.
However, we can also look at two (or more) variables together. The key question is then
whether some values of one variable tend to occur frequently together with particular
values of another, for example high values with high values. This would be an example
of an association between the variables. Such associations are central to most
interesting research questions, so you will hear much more about them in the future.
Some common methods of descriptive statistics for two-variable associations are
introduced here, but only very briefly now and mainly through examples.
The best way to summarise two variables together depends on whether the variables
have ‘few’ or ‘many’ possible values. We illustrate one method for each combination, as
listed below.

‘Many’ versus ‘many’: scatterplots (including line plots).

‘Few’ versus ‘many’: side-by-side boxplots.

‘Few’ versus ‘few’: two-way contingency tables (cross-tabulations).

1.8.1 Scatterplots
A scatterplot shows the values of two continuous variables against each other, plotted
as points in a two-dimensional coordinate system.

Example 1.17 A plot of data for 164 countries is shown in Figure 1.8 (on the next
page) which plots the following variables.

30
1.8. Associations between two variables

On the horizontal axis (the x-axis): a World Bank measure of ‘control of


corruption’, where high values indicate low levels of corruption.

On the vertical axis (the y-axis): GDP per capita in $.

Interpretation: it appears that virtually all countries with high levels of corruption
have relatively low GDP per capita. At lower levels of corruption there is a positive
association, where countries with very low levels of corruption also tend to have high
GDP per capita.

Figure 1.8: GDP per capita plotted against control of corruption.

1.8.2 Line plots (time series plots)


A common special case of a scatterplot is a line plot (time series plot), where the
variable on the x-axis is time. The points are connected in time order by lines, to show
how the variable on the y-axis changes over time.

Example 1.18 Figure 1.9 (on the next page) is a time series of an index of prices
of consumer goods and services in the UK for the period 1800–2009 (Office for
National Statistics; scaled so that the price level in 1974 = 100). This shows the
price inflation over this period.

1.8.3 Side-by-side boxplots for comparisons


Boxplots are useful for comparisons of how the distribution of a continuous variable
varies across different groups, i.e. across different levels of a discrete variable.

31
1. Data visualisation and descriptive statistics

Figure 1.9: UK index of prices of consumer goods and services.

Example 1.19 Figure 1.10 (on the next page) shows side-by-side boxplots of GDP
per capita for the different regions in Example 1.1.

GDP per capita in African countries tends to be very low. There is a handful of
countries with somewhat higher GDPs per capita (shown as outliers in the plot).

The median for Asia is not much higher than for Africa. However, the
distribution in Asia is very much skewed to the right, with a tail of countries
with very high GDPs per capita.

The median in Europe is high, and the distribution is fairly symmetric.

The boxplots for North America and Oceania are not very useful, because they
are based on very few countries (two and three countries, respectively).

1.8.4 Two-way contingency tables


A (two-way) contingency table (or cross-tabulation) shows the frequencies in the
sample of each possible combination of the values of two discrete variables. Such tables
often show the percentages within each row or column of the table.

Example 1.20 The table on the next page reports the results from a survey of 972
private investors.3 The variables are as follows.

Row variable: age as a discrete, grouped variable (four categories).

Column variable: how much importance the respondent places on short-term


gains from their investments (four levels).

32
1.8. Associations between two variables

Figure 1.10: Side-by-side boxplots of GDP per capita by region.

Interpretation: look at the row percentages. For example, 17.8% of those aged under
45, but only 5.2% of those aged 65 and over, think that short-term gains are ‘very
important’. Among the respondents, the older age groups seem to be less concerned
with quick profits than the younger age groups.

Importance of short-term gains


Slightly Very
Age group Irrelevant important Important important Total
Under 45 37 45 38 26 146
(25.3) (30.8) (26.0) (17.8) (100)
45–54 111 77 57 37 282
(39.4) (27.3) (20.2) (13.1) (100)
55–64 153 49 31 20 253
(60.5) (19.4) (12.3) (7.9) (100)
65 and over 193 64 19 15 291
(66.3) (22.0) (6.5) (5.2) (100)
Total 494 235 145 98 972
(50.8) (24.2) (14.9) (10.1) (100)

Numbers in parentheses are percentages within the rows. For example,


25.3 = (37/146) × 100.

3
Lewellen, W.G., R.C. Lease and G.G. Schlarbaum (1977). ‘Patterns of investment strategy and
behavior among individual investors’. The Journal of Business, 50(3), pp. 296–333.

33
1. Data visualisation and descriptive statistics

1.9 Overview of chapter


This chapter has looked at different ways of presenting data visually. Which type of
diagram is most appropriate will be determined by the types of data being analysed.
You should be able to interpret any important features which are apparent from the
diagram. This chapter has also introduced some quantitative approaches to
summarising data, known as descriptive statistics. We have distinguished measures of
location, dispersion and skewness. Although descriptive statistics serve as a very basic
form of statistical analysis, they nevertheless are extremely useful for capturing the
main characteristics of a dataset. Therefore, any statistical analysis of data should start
with data visualisation and the calculation of descriptive statistics!

1.10 Key terms and concepts


(Arithmetic) mean Association
Bar chart Binary
Boxplot Contingency table
Continuous Count
Data matrix Descriptive statistics
Dichotomous Discrete
Distribution Frequency
Frequency table Histogram
Interquartile range Line plot
Maximum Measures of central tendency
Measures of dispersion Median
Minimum Mode
Nominal Order statistics
Ordinal Outlier
Proportion Quantile
Quartile Range
Relative frequency Sample distribution
Sample size Scatterplot
Skewness Standard deviation
Symmetry Unit
Variable Variance

1.11 Sample examination questions


Note Chapter 1 will not explicitly be examined, although it introduced the important
concept of a distribution, as well as measures of central tendency and dispersion (albeit
for sample data).

Torture numbers, and they’ll confess to anything.


(Gregg Easterbrook)

34
Chapter 2
Probability theory

2.1 Synopsis of chapter


Probability theory is very important for statistics because it provides the rules which
allow us to reason about uncertainty and randomness, which is the basis of statistics.
Independence and conditional probability are profound ideas, but they must be fully
understood in order to think clearly about any statistical investigation.

2.2 Learning outcomes


After completing this chapter, you should be able to:

explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.

2.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:

Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%

35
2. Probability theory

However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.

‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is


(0.5083, 0.5717).’

‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’

In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.

Each response Xi is a realisation of a random variable from a Bernoulli


distribution with probability parameter π.

The responses X1 , X2 , . . . , Xn are independent of each other.

The sampling distribution of the sample mean (proportion) X̄ has expected


value π and variance π(1 − π)/n.

By use of the central limit theorem, the sampling distribution is approximately


a normal distribution.

In the next few chapters, we will learn about the terms in bold, among others.

The need for probability in statistics

In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.

Values in a sample are variable. If we collected a different sample we would not


observe exactly the same values again.

Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.

Probability theory is the branch of mathematics which deals with randomness. So we


need to study this first.

36
2.4. Set theory: the basics

A preview of probability

The first basic concepts in probability are the following.

Experiment: for example, rolling a single die and recording the outcome.
Outcome of the experiment: for example, rolling a 3.
Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
Event: any subset A of the sample space, for example A = {4, 5, 6}.1

Probability of an event A, P (A), will be defined as a function which assigns


probabilities (real numbers) to events (sets). This uses the language and concepts of set
theory. So we need to study the basics of set theory first.

2.4 Set theory: the basics


A set is a collection of elements (also known as ‘members’ of the set).

Example 2.1 The following are all examples of sets.

A = {Amy, Bob, Sam}.

B = {1, 2, 3, 4, 5}.

C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . .}.

D = {x | x ≥ 0} (that is, the set of all non-negative real numbers).

Membership of sets and the empty set

x ∈ A means that object x is an element of set A.


x∈/ A means that object x is not an element of set A.
The empty set, denoted ∅, is the set with no elements, i.e. x ∈
/ ∅ is true for every
object x, and x ∈ ∅ is not true for any object x.

Example 2.2 If A = {1, 2, 3, 4, 5}, then:

1 ∈ A and 2 ∈ A

6∈
/ A and 1.5 ∈
/ A.

The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.

1
Strictly speaking not all subsets are events, as discussed later.

37
2. Probability theory

Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .

Figure 2.1: Venn diagram depicting A ∪ B (the total shaded area).

Subsets and equality of sets

A ⊂ B means that set A is a subset of set B, defined as:

A⊂B when x ∈ A ⇒ x ∈ B.

Hence A is a subset of B if every element of A is also an element of B. An example


is shown in Figure 2.2.

Figure 2.2: Venn diagram depicting a subset, where A ⊂ B.

Example 2.4 An example of the distinction between subsets and non-subsets is:

{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set

{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.

Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.

38
2.4. Set theory: the basics

Unions of sets (‘or’)

The union, denoted ∪, of two sets is:

A ∪ B = {x | x ∈ A or x ∈ B}.

That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.

Figure 2.3: Venn diagram depicting the union of two sets.

Example 2.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∪ B = {1, 2, 3, 4}

A ∪ C = {1, 2, 3, 4, 5, 6}

B ∪ C = {2, 3, 4, 5, 6}.

Intersections of sets (‘and’)

The intersection, denoted ∩, of two sets is:

A ∩ B = {x | x ∈ A and x ∈ B}.

That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.

Figure 2.4: Venn diagram depicting the intersection of two sets.

39
2. Probability theory

Example 2.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∩ B = {2, 3}

A ∩ C = {4}

B ∩ C = ∅.

Unions and intersections of many sets

Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1

and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1

These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.

Complement (‘not’)

Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.

Figure 2.5: Venn diagram depicting the complement of a set.

We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.

40
2.4. Set theory: the basics

Properties of set operators

Commutativity:

A ∩ B = B ∩ A and A ∪ B = B ∪ A.

Associativity:

A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.

Distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).

De Morgan’s laws:

(A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c .

Further properties of set operators

If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:

∅c = S.

∅ ⊂ A, A ⊂ A and A ⊂ S.

A ∩ A = A and A ∪ A = A.

A ∩ Ac = ∅ and A ∪ Ac = S.

If B ⊂ A, A ∩ B = B and A ∪ B = A.

A ∩ ∅ = ∅ and A ∪ ∅ = A.

A ∩ S = A and A ∪ S = S.

∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.

41
2. Probability theory

Mutually exclusive events

Two sets A and B are disjoint or mutually exclusive if:

A ∩ B = ∅.

Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.

Partition

The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form

S
a partition of A if they are pairwise disjoint and Ai = A.
i=1

A3 A2

A1

Figure 2.6: The partition of the set A into A1 , A2 and A3 .

Example 2.7 Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B.

42
2.5. Axiomatic definition of probability

We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.

2.5 Axiomatic definition of probability


First, we consider four basic concepts in probability.
An experiment is a process which produces outcomes and which can have several
different outcomes. The sample space S is the set of all possible outcomes of the
experiment. An event is any subset A of the sample space such that A ⊂ S.

Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.

The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.

A ∩ B: both A and B happen.

A ∪ B: either A or B happens (or both happen).

Ac : A does not happen, i.e. something other than A happens.

Once we introduce probabilities of events, we can also say that:

the sample space, S, is a certain event

the empty set, ∅, is an impossible event.

2
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.

43
2. Probability theory

Axioms of probability

‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).

Axiom 1: P (A) ≥ 0 for all events A.

Axiom 2: P (S) = 1.

Axiom 3: If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all


i 6= j), then:

! ∞
[ X
P Ai = P (Ai ).
i=1 i=1

The axioms require that a probability function must always satisfy these requirements.

Axiom 1 requires that probabilities are always non-negative.

Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.

Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of


their union is simply the sum of their individual probabilities.

All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.

2.5.1 Basic properties of probability

Probability property

For the empty set, ∅, we have:


P (∅) = 0. (2.1)

Proof: Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives:



X
P (∅) = P (∅ ∪ ∅ ∪ · · · ) = P (∅).
i=1

However, the only real number for P (∅) which satisfies this is P (∅) = 0.


44
2.5. Axiomatic definition of probability

Probability property (finite additivity)

If A1 , A2 , . . . , An are pairwise disjoint, then:


n
! n
[ X
P Ai = P (Ai ).
i=1 i=1

Proof: In Axiom 3, set An+1 = An+2 = · · · = ∅, so that:



! ∞ n ∞ n
[ X X X X
P Ai = P (Ai ) = P (Ai ) + P (Ai ) = P (Ai )
i=1 i=1 i=1 i=n+1 i=1

since P (Ai ) = P (∅) = 0 for i = n + 1, n + 2, . . ..



In pictures, the previous result means that in a situation like the one shown in Figure
2.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.

A2
A1

A3

Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.

Probability property

For any event A, we have:


P (Ac ) = 1 − P (A).

Proof: We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore:


1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac )
using the previous result, with n = 2, A1 = A and A2 = Ac .


45
2. Probability theory

Probability property

For any event A, we have:


P (A) ≤ 1.

Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:

P (Ac ) = 1 − P (A) < 0.

This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:

0 ≤ P (A) ≤ 1

for all events A.




Probability property

For any two events A and B, if A ⊂ B, then P (A) ≤ P (B).

Proof: We proved in Example 2.7 that we can partition B as B = A ∪ (B ∩ Ac ) where


the two sets in the union are disjoint. Therefore:

P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A)

since P (B ∩ Ac ) ≥ 0.


Probability property

For any two events A and B, then:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof: Using partitions:

P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)

P (A) = P (A ∩ B c ) + P (A ∩ B)

P (B) = P (Ac ∩ B) + P (A ∩ B)

and hence:

P (A ∪ B) = (P (A) − P (A ∩ B)) + P (A ∩ B) + (P (B) − P (A ∩ B))


= P (A) + P (B) − P (A ∩ B).

46
2.5. Axiomatic definition of probability

In summary, the probability function has the following properties.

P (S) = 1 and P (∅) = 0.

0 ≤ P (A) ≤ 1 for all events A.

If A ⊂ B, then P (A) ≤ P (B).

These show that the probability function has the kinds of values we expect of something
called a ‘probability’.

P (Ac ) = 1 − P (A).

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

These are useful for deriving probabilities of new events.

Example 2.9 Suppose that, on an average weekday, of all adults in a country:

86% spend at least 1 hour watching television (event A, with P (A) = 0.86)

19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)

15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).

We select a member of the population for an interview at random. For example, we


then have:

P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14, which is the probability that the


respondent watches less than 1 hour of television

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90, which is the


probability that the respondent spends at least 1 hour watching television or
reading newspapers (or both).

What does ‘probability’ mean?

Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined on the next page. The mathematical theory of
probability and calculations on probabilities are the same whichever interpretation we
assign to ‘probability’. So, in this course, we do not need to discuss the matter further.

47
2. Probability theory

Frequency interpretation of probability

This states that the probability of an outcome A of an experiment is the proportion


(relative frequency) of trials in which A would be the outcome if the experiment was
repeated a very large number of times under similar conditions.

Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?

‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.

‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.

How to find probabilities

A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.

Example 2.11 Consider the following.

If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.

Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.

The estimation of probabilities of events from observed data is an important part of


statistics.

2.6 Classical probability and counting rules


Classical probability is a simple special case where values of probabilities can be
found by just counting outcomes. This requires that:

the sample space contains only a finite number of outcomes

all of the outcomes are equally likely.

48
2.6. Classical probability and counting rules

Standard illustrations of classical probability are devices used in games of chance, such
as:

tossing a coin (heads or tails) one or more times

rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6)

drawing one or more playing cards from a deck of 52 cards.

We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:

k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S

That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.

Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?

The sample space is the 36 ordered pairs:

S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.

The probability is P (A) = 4/36 = 1/9.

Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.

49
2. Probability theory

Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?

The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.

Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12.

The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.

Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?

P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36.

So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9.

How to count the outcomes

In general, it is useful to know about three ways of counting.

Listing and counting all outcomes.

Combinatorial methods: choosing k objects out of n objects.

Combining different methods: rules of sum and product.

2.6.1 Brute force: listing and counting


In small problems, just listing all possibilities is often quickest.

Example 2.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 2.8 (on the next page).

2.6.2 Combinatorial counting methods


A powerful set of counting methods answers the following question: how many ways are
there to select k objects out of n distinct objects?

50
2.6. Classical probability and counting rules

[1] [2] [3] [4]


s s s s s s s s

s s s s s s s s

[5] [6] [7] [8]


s s s s s s s s
@
@
@
s s s @s s s s s

[9] [10] [11]


s s s s s s
@ @
@ @
@ @
s s s @s s @s

Figure 2.8: Friendship patterns in a four-person network.

The answer will depend on:

whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
whether the selected set is treated as ordered or unordered.

Ordered sets, with replacement

Suppose that the selection of k objects out of n needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.
with replacement, so that each of the n objects may appear several times in the
selection.

Therefore:

n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.

Therefore, the number of possible ordered sequences of k objects selected with


replacement from n objects is:
k times
z }| {
n × n × · · · × n = nk .

51
2. Probability theory

Ordered sets, without replacement

Suppose that the selection of k objects out of n now needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

without replacement, so that if an object is selected once, it cannot be selected


again.

Now:

n objects are available for selection for the 1st object in the sequence

n − 1 objects are available for selection for the 2nd object

n − 2 objects are available for selection for the 3rd object

. . . and so on, until n − k + 1 objects are available for selection for the kth object.

Therefore, the number of possible ordered sequences of k objects selected without


replacement from n objects is:

n × (n − 1) × · · · × (n − k + 1). (2.2)

An important special case is when k = n.

Factorials

The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.

Using factorials, (2.2) can be written as:

n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!

Unordered sets, without replacement

Suppose now that the identities of the objects in the selection matter, but the order
does not.

52
2.6. Classical probability and counting rules

For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.

The number of such unordered subsets (combinations) of k out of n objects is


determined as follows.

The number of ordered sequences is n!/(n − k)!.

Among these, every different combination of k distinct elements appears k! times,


in different orders.

Ignoring the ordering, there are:


 
n n!
=
k k! (n − k)!

different combinations, for each k = 0, 1, 2, . . . , n.

n

The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
 
0
= n = 1, so there is only 1 way of selecting 0 or n out of n objects.

Summary of the combinatorial counting rules

The number of k outcomes from n distinct possible outcomes can be summarised as


follows:

With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
 
Unordered k k
= k! (n−k)!

We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness with an illustration given in Example 2.16.

Example 2.16 We consider an outline of the proof, using n = 5 and k = 3 for


illustration.
Half-graphically, let x denote selected values and | the ‘walls’ between different
distinct values. For example:

x|xx||| denotes the selection of set (1, 2, 2)

x||x||x denotes the set (1, 3, 5)

||||xxx denotes the set (5, 5, 5).

53
2. Probability theory

In general, we have a sequence of n + k − 1 symbols, i.e. n − 1 walls (|) and k


selections (x). The number of different unordered sets of k objects selected with
replacement from n objects is the number of different ways of choosing the locations
of the xs in this, that is:  
n+k−1
.
k

Example 2.17 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?

1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:

(365)3 = 48,627,125.

2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!

3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have
different birthdays (without replacement). The number of different sets of
birthdays is:
 
365 365! 365 × 364 × 363
= = = 8,038,030.
3 3! (365 − 3)! 3×2×1

Example 2.18 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.

1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .

2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.

54
2.6. Classical probability and counting rules

Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r

and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:

r P (A) r P (A) r P (A) r P (A)


2 0.003 12 0.167 22 0.476 32 0.753
3 0.008 13 0.194 23 0.507 33 0.775
4 0.016 14 0.223 24 0.538 34 0.795
5 0.027 15 0.253 25 0.569 35 0.814
6 0.040 16 0.284 26 0.598 36 0.832
7 0.056 17 0.315 27 0.627 37 0.849
8 0.074 18 0.347 28 0.654 38 0.864
9 0.095 19 0.379 29 0.681 39 0.878
10 0.117 20 0.411 30 0.706 40 0.891
11 0.141 21 0.444 31 0.730 41 0.903

2.6.3 Combining counts: rules of sum and product


Even more complex cases can be handled by combining counts.

Rule of sum

If an element can be selected in m1 ways from set 1, or m2 ways from set 2, . . . or


mK ways from set K, the total number of possible selections is:

m1 + m2 + · · · + mK .

Rule of product

If, in an ordered sequence of K elements, element 1 can be selected in m1 ways, and


then element 2 in m2 ways, . . . and then element K in mK ways, the total number
of possible sequences is:
m1 × m2 × · · · × mK .

Example 2.19 Five playing cards are drawn from a well-shuffled deck of 52 playing
cards. What is the probability that the cards form a hand which is higher than ‘a
flush’ ? The cards in a hand are treated as an unordered set.

55
2. Probability theory

First, we determine the size of the sample space which is all unordered subsets of 5
cards selected from 52. So the size of the sample space is:
 
52 52! 52 × 51 × 50 × 49 × 48
= = = 2,598,960.
5 5! × 47! 5×4×3×2×1

The hand is higher than a flush if it is a:

‘straight flush’ or ‘four-of-a-kind’ or ‘full house’.


The rule of sum says that the number of hands better than a flush is:

number of straight flushes + number of four-of-a-kinds + number of full houses


= 40 + 624 + 3,744
= 4,408.

Therefore, the probability we want is:


4,408
≈ 0.0017.
2,598,960
How did we get the counts above?
A ‘straight flush’ is 5 cards of the same suit and with successive ranks (for example,
1–5, all clubs; note that 10–11–12–13–1 also count as successive). The straight can
start at ranks 1–10, and it can be of any of 4 suits. Therefore, the total number of
straight flushes is 10 × 4 = 40.
‘Four-of-a-kind’ is any hand with 4 cards of the same rank (for example, 8–8–8–8–2).
The rank of the four can be chosen in 13 ways, and the fifth card can be any of the
remaining 48. Therefore, the total number of four-of-a-kinds is 13 × 48 = 624.
A ‘full house’ is three cards of the same rank and two of another rank, for example:

♦2 ♠2 ♣2 ♦4 ♠4.

We can break the number of ways of choosing these into two steps.

The total number of ways of selecting the three: the rank of these can be any of
the 13 ranks. There are four cards of this rank, so the three of that rank can be
chosen in 43 = 4 ways. So the total number of different triplets is 13 × 4 = 52.
The total number of ways of selecting the two: the rank of these can be any of
the remaining 12 ranks, and the two cards of that rank can be chosen in 42 = 6
ways. So the total number of different pairs (with a different rank than the
triplet) is 12 × 6 = 72.

The rule of product then says that the total number of full houses is:

52 × 72 = 3,744.

(You do not need to memorise the different types of hands for the examination!)

56
2.7. Conditional probability and Bayes’ theorem

The following is a summary of the numbers of all types of 5-card hands, and their
probabilities (to reiterate you will not need to know these for the examination):

Hand Number Probability


Straight flush 40 0.000015
Four-of-a-kind 624 0.00024
Full house 3,744 0.00144
Flush 5,108 0.0020
Straight 10,200 0.0039
Three-of-a-kind 54,912 0.0211
Two pairs 123,552 0.0475
One pair 1,098,240 0.4226
High card 1,302,540 0.5012
Total 2,598,960 1.0

2.7 Conditional probability and Bayes’ theorem


Next we introduce some of the most important concepts in probability:

independence
conditional probability
Bayes’ theorem.

These give us powerful tools for:

deriving probabilities of combinations of events


updating probabilities of events, after we learn that some other event has happened.

Independence

Two events A and B are (statistically) independent if:

P (A ∩ B) = P (A) P (B).

Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that:

if A happens, this does not affect the probability of B happening (and vice versa)

if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).

For example, independence is often a reasonable assumption when A and B


correspond to physically separate experiments.

57
2. Probability theory

Example 2.20 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:

A = ‘Score of die 1 is not 6’

B = ‘Score of die 2 is not 6’.

Therefore:

P (A) = 30/36 = 5/6

P (B) = 30/36 = 5/6

P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent.

2.7.1 Independence of multiple events


Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset
of these events is the product of the individual probabilities of the events in the subset.
This implies the important result that if events A1 , A2 , . . . , An are independent, then:
P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ).
Note that there is a difference between pairwise independence and full independence.
The following example illustrates.

Example 2.21 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)

58
2.7. Conditional probability and Bayes’ theorem

and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.

2.7.2 Independent versus mutually exclusive events

The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.9.

Figure 2.9: Venn diagram depicting mutually exclusive events.

For mutually exclusive events A ∩ B = ∅, and so, from (2.1), P (A ∩ B) = 0. For


independent events, P (A ∩ B) = P (A) P (B). So since:

P (A ∩ B) = 0 6= P (A) P (B)

in general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then
mutually exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
Venn diagram.

59
2. Probability theory

Conditional probability

Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?

The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:

P (A ∩ B)
P (A | B) =
P (B)

assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.

Example 2.22 Suppose we roll two independent fair dice again. Consider the
following events.

A = ‘at least one of the scores is 2’.

B = ‘the sum of the scores is greater than 7’.

These are shown in Figure 2.10 (on the next page). Now P (A) = 11/36 ≈ 0.31,
P (B) = 15/36 and P (A ∩ B) = 2/36. Therefore, the conditional probability of A
given B is:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
Learning that B has occurred causes us to revise (update) the probability of A
downward, from 0.31 to 0.13.

One way to think about conditional probability is that when we condition on B, we


redefine the sample space to be B.

Example 2.23 In Example 2.22, when we are told that the conditioning event B
has occurred, we know we are within the solid green line in Figure 2.10 (on the next
page). So the 15 outcomes within it become the new sample space. There are 2
outcomes which satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B

2.7.3 Conditional probability of independent events


If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:

P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)

60
2.7. Conditional probability and Bayes’ theorem

A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

A B

Figure 2.10: Events A, B and A ∩ B for Example 2.22.

and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.

2.7.4 Chain rule of conditional probabilities


Since P (A | B) = P (A ∩ B)/P (B), then:
P (A ∩ B) = P (A | B) P (B).
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:

s
B
s
As

The path to A is to get first to B, and then from B to A.


It is also true that:
P (A ∩ B) = P (B | A) P (A)
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:
P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , A2 , . . . , An−1 )

61
2. Probability theory

where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.24.

Example 2.24 For n = 3, we have:

P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 )


= P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 )
= P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 )
= P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 )
= P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 )
= P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ).

Example 2.25 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52

4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:

P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards

P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn

P (A3 | A1 , A2 ) = 2/50

P (A4 | A1 , A2 , A3 ) = 1/49.

Putting these together with the chain rule gives:

P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 )


4 3 2 1 24 1
= × × × = = .
52 51 50 49 6,497,400 270,725
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.

More methods for summing probabilities

We now return to probabilities of partitions like the situation shown in Figure 2.11 (on
the next page).

62
2.7. Conditional probability and Bayes’ theorem

A2

H A1
A1 HH

 HH
A3
r HHr
A
A2
H 
HH 
H 
HH
H  A 3

Figure 2.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.

Both diagrams in Figure 2.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).

2.7.5 Total probability formula


Suppose B1 , B2 , . . . , BK form a partition of the sample space. Therefore, A ∩ B1 ,
A ∩ B2 , . . ., A ∩ BK form a partition of A, as shown in Figure 2.12.
r B1

r B2
HH
 HH
r B3

rH
 HHr
A
@H 
@HH HHr

@
@ B4
@
@r
B5
Figure 2.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.

In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:

1. apply the chain rule to each of the paths:


P (A ∩ Bi ) = P (A | Bi ) P (Bi )

2. add up the probabilities of the paths:


K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

63
2. Probability theory

This is known as the formula of total probability. It looks complicated, but it is


actually often far easier to use than trying to find P (A) directly.

Example 2.26 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).

r Bc
HH
 HH
 HH
rH
  Hr A
HH 

H
HH 
r 
H
B

Example 2.27 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01. Therefore:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.

2.7.6 Bayes’ theorem

So far we have considered how to calculate P (A) for an event A which can happen in
different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question. Suppose we know that A has occurred, as shown in Figure
2.13 (on the next page).
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.14 (on the next
page).

64
2.7. Conditional probability and Bayes’ theorem

Figure 2.13: Paths to A indicating that A has occurred.

Figure 2.14: A being achieved via B1 .

So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.

P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule.


K
P
P (A) = P (A | Bi ) P (Bi ) from the total probability formula.
i=1

Bayes’ theorem

Using the chain rule and the total probability formula, we have:

P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1

which holds for each Bj , j = 1, 2, . . . , K. This is known as Bayes’ theorem.

Example 2.28 Continuing with Example 2.27, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:

P (B) = 0.0001 P (B c ) = 0.9999


P (A | B) = 0.99 and P (A | B c ) = 0.01.

65
2. Probability theory

Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098

Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.

Example 2.29 You are taking part in a gameshow. The host of the show, who is
known as Monty, shows you three outwardly identical boxes. In one of them is a
prize, and the other two are empty.
You are asked to select, but not open, one of the boxes. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining boxes.
He always opens a box he knows to be empty, and randomly chooses which box to
open when he has more than one option (which happens when your initial choice
contains the prize).
After opening the empty box, Monty gives you the choice of either switching to the
other unopened box or sticking with your original choice. You then receive whatever
is in the box you choose.
What should you do, assuming you want to win the prize?
Suppose the three boxes are numbered 1, 2 and 3. Let us define the following events.

B1 , B2 , B3 : the prize is in Box 1, 2 and 3, respectively.


M1 , M2 , M3 : Monty opens Box 1, 2 and 3, respectively.

Suppose you choose Box 1 first, and then Monty opens Box 3 (the answer works the
same way for all combinations of these). So Boxes 1 and 2 remain unopened.
What we want to know now are the conditional probabilities P (B1 | M3 ) and
P (B2 | M3 ).
You should switch boxes if P (B2 | M3 ) > P (B1 | M3 ), and stick with your original
choice otherwise. (You would be indifferent about switching if it was the case that
P (B2 | M3 ) = P (B1 | M3 ).)
Suppose that you first choose Box 1, and then Monty opens Box 3. Bayes’ theorem
tells us that:
P (M3 | B2 ) P (B2 )
P (B2 | M3 ) = .
P (M3 | B1 ) P (B1 ) + P (M3 | B2 ) P (B2 ) + P (M3 | B3 ) P (B3 )

We can assign values to each of these.

The prize is initially equally likely to be in any of the boxes. Therefore,


P (B1 ) = P (B2 ) = P (B3 ) = 1/3.

If the prize is in Box 1 (which you choose), Monty chooses at random between
the two remaining boxes, i.e. Boxes 2 and 3. Hence P (M3 | B1 ) = 1/2.

66
2.7. Conditional probability and Bayes’ theorem

If the prize is in one of the two boxes you did not choose, Monty cannot open
that box, and must open the other one. Hence P (M3 | B2 ) = 1 and so
P (M3 | B3 ) = 0.

Putting these probabilities into the formula gives:

1 × 1/3 2
P (B2 | M3 ) = =
1/2 × 1/3 + 1 × 1/3 + 0 × 1/3 3

and hence P (B1 | M3 ) = 1 − P (B2 | M3 ) = 1/3 (because also P (M3 | B3 ) = 0 and so


P (B3 | M3 ) = 0).
The same calculation applies to every combination of your first choice and Monty’s
choice. Therefore, you will always double your probability of winning the prize if you
switch from your original choice to the box that Monty did not open.
The Monty Hall problem has been called a ‘cognitive illusion’, because something
about it seems to mislead most people’s intuition. In experiments, around 85% of
people tend to get the answer wrong at first.
The most common incorrect response is that the probabilities of the remaining boxes
after Monty’s choice are both 1/2, so that you should not (or rather need not) switch.
This is typically based on ‘no new information’ reasoning. Since we know in advance
that Monty will open one empty box, the fact that he does so appears to tell us
nothing new and should not cause us to favour either of the two remaining boxes –
hence a probability of 1/2 for each.
It is true that Monty’s choice tells you nothing new about the probability of your
original choice, which remains at 1/3. However, it tells us a lot about the other two
boxes. First, it tells us everything about the box he chose, namely that it does not
contain the prize. Second, all of the probability of that box gets ‘inherited’ by the
box neither you nor Monty chose, which now has the probability 2/3.

Example 2.30 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.

P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!

P (x | Ac ) = (200 − x)/200 if we assume that bags come out in a completely


random order.

67
2. Probability theory

Using Bayes’ theorem, we get:

P (x | A) P (A)
P (A | x) =
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
= .
P (A) + ((200 − x)/200)(1 − P (A))

Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:

Air Malta: P (A) = 0.0044

British Airways: P (A) = 0.023.

Figure 2.15 (on the next page) shows a plot of P (A | x) as a function of x for these
two airlines.
The probabilities are fairly small, even for large values of x.

For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.

For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.

This is because the baseline probability of lost bags, P (A), is low.


So, the moral of the story is that even when nearly everyone else has collected their
bags and left, do not despair!

2.8 Overview of chapter


This chapter introduced some formal terminology related to probability theory. The
axioms of probability were introduced, from which various other probability results were
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes’ theorem was derived.

68
2.9. Key terms and concepts

1.0
BA

0.8
Air Malta

P( Your bag is lost )


0.6
0.4
0.2
0.0

0 50 100 150 200


Bags arrived

Figure 2.15: Plot of P (A | x) as a function of x for the two airlines in Example 2.30, Air
Malta and British Airways (BA).

2.9 Key terms and concepts

Axiom Bayes’ theorem


Binomial coefficient Chain rule
Classical probability Collectively exhaustive
Combination Complement
Conditional probability Counting
Disjoint Element
Empty set Experiment
Event Factorial
Independence Intersection
Mutually exclusive Outcome
Pairwise disjoint Partition
Permutation Probability (theory)
Relative frequency Sample space
Set Subset
Total probability Union
Venn diagram With(out) replacement

2.10 Sample examination questions

1. A box contains 12 light bulbs, of which 2 are defective. If a person selects 5 bulbs
at random, without replacement, what is the probability that both defective bulbs
will be selected?

69
2. Probability theory

2. A and B are independent events such that:

P ((A ∪ B)c ) = π1

and:
P (A) = π2 .
Determine P (B) as a function of π1 and π2 .

3. A county is made up of three (mutually exclusive) communities A, B and C, with


proportions of people living in them given by the following table:

Community A B C
Proportion 0.20 0.50 0.30

Given a person belongs to a certain community, the probability of that person


being vaccinated is given by the following table:

Community given A B C
Probability of being vaccinated 0.80 0.70 0.60

(a) We choose a person from the county at random. What is the probability that
the person is not vaccinated?
(b) We choose a person from the county at random. Find the probability that the
person is in community A, given the person is vaccinated.
(c) In words, briefly explain how the ‘probability of being vaccinated’ for each
community would be known in practice.

2.11 Solutions to Sample examination questions


1. The sample space consists  of all (unordered) subsets of 5 out of the 12 light bulbs
in the box. There are 125
such subsets. The number of subsets which contain the  2
10
defective bulbs is the number of subsets of size 3 out of the other 10 bulbs, 3 , so
the probability we want is:
10

3 5×4
12 = = 0.1515.

5
12 × 11

2. We are given that P ((A ∪ B)c ) = π1 , P (A) = π2 , and that A and B are
independent. Hence:

P (A ∪ B) = 1 − π1 and P (A ∩ B) = P (A) P (B) = π2 P (B).

Therefore:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = π2 + P (B) − π2 P (B) = 1 − π1 .

70
2.11. Solutions to Sample examination questions

Solving for P (B), we have:


1 − π1 − π 2
P (B) = .
1 − π2

3. (a) Denote acceptance of vaccination by V , and not accepting by V c . By the law


of total probability, we have:

P (V c ) = P (V c | A) P (A) + P (V c | B) P (B) + P (V c | C) P (C)


= (1 − 0.8) × 0.2 + (1 − 0.7) × 0.5 + (1 − 0.6) × 0.3
= 0.31.

(b) By Bayes’ theorem, we have:

P (V | A) P (A) 0.8 × 0.2 16


P (A | V ) = = = = 0.2319.
P (V ) 1 − 0.31 69

(c) Any reasonable answer accepted, such as relative frequency estimate, or from
health records.

There are lies, damned lies and statistics.


(Mark Twain)

71
2. Probability theory

72
Chapter 3
Random variables

3.1 Synopsis of chapter


This chapter introduces the concept of random variables and probability distributions.
These distributions are univariate, which means that they are used to model a single
numerical quantity. The concepts of expected value and variance are also discussed.

3.2 Learning outcomes


After completing this chapter, you should be able to:

define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.

3.3 Introduction
In Chapter 1, we considered descriptive statistics for a sample of observations of a
variable X. Here we will represent the observations as a sequence of variables, denoted
as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.

The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.

Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.

73
3. Random variables

Random variable

A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:

the sample space, S, is the set of real numbers R, or a subset of R

the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)

events are sets of numbers (values) in this sample space.

Discrete and continuous random variables

There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.

A random variable is continuous if S is all of R or some interval(s) of it, for


example [0, 1] or [0, ∞).

A random variable is discrete if it is not continuous.2 More precisely, a discrete


random variable takes a finite or countably infinite number of values.

Notation

A random variable is typically denoted by an upper-case letter, for example X (or Y ,


W etc.). A specific value of a random variable is often denoted by a lower-case letter,
for example x.
Probabilities of values of a random variable are written as follows.

P (X = x) denotes the probability that (the value of) X is x.

P (X > 0) denotes the probability that X is positive.

P (a < X < b) denotes the probability that X is between the numbers a and b.

Random variables versus samples

You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in Chapter 1.

1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.

74
3.4. Discrete random variables

Random variable Sample


Probability distribution Sample distribution
Mean (expected value) Sample mean (average)
Variance Sample variance
Standard deviation Sample standard deviation
Median Sample median

This is no accident. In statistics, the population is represented as following a probability


distribution, and quantities for an observed sample are then used as estimators of the
analogous quantities for the population.

3.4 Discrete random variables

Example 3.1 The following two examples will be used throughout this chapter.

1. The number of people living in a randomly selected household in England.


• For simplicity, we use the value 8 to represent ‘8 or more’ (because 9 and
above are not reported separately in official statistics).
• This is a discrete random variable, with possible values of 1, 2, 3, 4, 5, 6, 7
and 8.

2. A person throws a basketball repeatedly from the free-throw line, trying to


make a basket. Consider the following random variable.
The number of unsuccessful throws before the first successful throw.
• The possible values of this are 0, 1, 2, . . .

3.4.1 Probability distribution of a discrete random variable


The probability distribution (or just distribution) of a discrete random variable X
is specified by:

its possible values, x (i.e. its sample space, S)

the probabilities of the possible values, i.e. P (X = x) for all x ∈ S.

So we first need to develop a convenient way of specifying the probabilities.

75
3. Random variables

Example 3.2 Consider the following probability distribution for the household
size, X.3

Number of people
in the household, x P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021

Probability function

The probability function (pf) of a discrete random variable X, denoted by p(x),


is a real-valued function such that for any number x the function is:

p(x) = P (X = x).

We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.

Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).

Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.

Necessary conditions for a probability function

To be a pf of a discrete random variable X with sample space S, a function p(x)


must satisfy the following conditions.

1. p(x) ≥ 0 for all real numbers x.


P
2. p(xi ) = 1, i.e. the sum of probabilities of all possible values of X is 1.
xi ∈S
The pf is defined for all real numbers x, but p(x) = 0 for any x ∈
/ S, i.e. for any value
x which is not one of the possible values of X.

3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.

76
3.4. Discrete random variables

Example 3.3 Continuing Example 3.2, here we can simply list all the values:



 0.3002 for x = 1




 0.3417 for x = 2



 0.1551 for x = 3

0.1336 for x = 4



p(x) = 0.0494 for x = 5

0.0145 for x = 6








 0.0034 for x = 7




 0.0021 for x = 8

0 otherwise.

8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
0.35
0.30
0.25
0.20
p(x)

0.15
0.10
0.05
0.00

1 2 3 4 5 6 7 8

x (number of people in the household)

Figure 3.1: Probability function for Example 3.3.

For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X a(1 − r n )
ar x =
x=0
1−r
and if |r| < 1, then:

X a
ar x = .
x=0
1−r

77
3. Random variables

Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:

the probability of a successful throw is π at each throw and, therefore, the


probability of an unsuccessful throw is 1 − π

outcomes of different throws are independent.

Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:

(1 − π)x π.

So the pf of the random variable X (the number of failures before the first success)
is: (
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.

Clearly, p(x) ≥ 0 for all x, since π ≥ 0 and 1 − π ≥ 0.

Using the sum to infinity of a geometric series, we get:



X ∞
X
p(x) = (1 − π)x π
x=0 x=0

X
=π (1 − π)x
x=0

1

1 − (1 − π)
π
=
π
= 1.

The expression of the pf involves a parameter π (the probability of a successful


throw), a number for which we can choose different values. This defines a whole
‘family’ of individual distributions, one for each value of π. For example, Figure 3.2
(on the next page) shows values of p(x) for two values of π reflecting fairly good and
fairly poor free-throw shooters, respectively.

3.4.2 The cumulative distribution function (cdf)


Another way to specify a probability distribution is to give its cumulative
distribution function (cdf) (or simply distribution function).

78
3.4. Discrete random variables

0.7
0.6
0.5
0.4
p(x)

0.3
π = 0.7
π = 0.3
0.2
0.1
0.0

0 5 10 15

x (number of failures)

Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a fairly poor free-throw shooter.

Cumulative distribution function (cdf)

The cdf is denoted F (x) (or FX (x)) and defined as:

F (x) = P (X ≤ x) for all real numbers x.

For a discrete random variable it is given by:


X
F (x) = p(xi )
xi ∈S, xi ≤x

i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.

Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:

Number of people
in the household, x p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000

79
3. Random variables

These are shown in graphical form in Figure 3.3.

1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0 2 4 6 8

x (number of people in the household)

Figure 3.3: Cumulative distribution function for Example 3.5.

Example 3.6 In the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., we


can calculate a simple formula for the cdf, using the sum of a geometric series. Since,
for any non-negative integer y, we obtain:
y y
X X
p(x) = (1 − π)x π
x=0 x=0
y
X
=π (1 − π)x
x=0

1 − (1 − π)y+1

1 − (1 − π)
= 1 − (1 − π)y+1

we can write: (
0 for x < 0
F (x) =
1 − (1 − π)x+1 for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4 (on the next page).

80
3.4. Discrete random variables

1.0
0.8
0.6
F(x)

0.4

π = 0.7
π = 0.3
0.2
0.0

0 5 10 15

x (number of failures)

Figure 3.4: Cumulative distribution function for Example 3.6.

3.4.3 Properties of the cdf for discrete distributions


The cdf F (x) of a discrete random variable X is a step function such that:

F (x) remains constant in all intervals between possible values of X

at a possible value xi of X, F (x) jumps up by the amount p(xi ) = P (X = xi )

at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).

3.4.4 General properties of the cdf


These hold for both discrete and continuous random variables.

1. 0 ≤ F (x) ≤ 1 for all x (since F (x) is a probability).

2. F (x) → 0 as x → −∞, and F (x) → 1 as x → ∞.

3. F (x) is a non-decreasing function, i.e. if x1 < x2 , then F (x1 ) ≤ F (x2 ).

4. For any x1 < x2 , P (x1 < X ≤ x2 ) = F (x2 ) − F (x1 ).

Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.

81
3. Random variables

Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:

P (X = 1) = p(1) = F (1) = 0.3002

P (X = 2) = p(2) = F (2) − F (1) = 0.3417

P (X ≤ 2) = p(1) + p(2) = F (2) = 0.6419

P (X = 3 or 4) = p(3) + p(4) = F (4) − F (2) = 0.2887

P (X > 5) = p(6) + p(7) + p(8) = 1 − F (5) = 0.0200

P (X ≥ 5) = p(5) + p(6) + p(7) + p(8) = 1 − F (4) = 0.0694.

3.4.5 Properties of a discrete random variable


Let X be a discrete random variable with sample space S and pf p(x).

Expected value of a discrete random variable

The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x

We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.

3.4.6 Expected value versus sample mean


The mean (expected value) E(X) of a probability distribution is analogous to the
sample mean (average) X̄ of a sample distribution.
This is easiest to see when the sample space is finite. Suppose the random variable X
can have K different values X1 , X2 , . . . , XK , and their frequencies in a sample are
f1 , f2 , . . . , fK , respectively. Therefore, the sample mean of X is:
f 1 x1 + f 2 x2 + · · · + f K xK
X̄ = = x1 pb(x1 ) + x2 pb(x2 ) + · · · + xK pb(xK )
f1 + f2 + · · · + fK
K
X
= xi pb(xi )
i=1

82
3.4. Discrete random variables

where:

fi
p(x
b i) = K
P
fi
i=1

are the sample proportions of the values xi .

The expected value of the random variable X is:

E(X) = x1 p(x1 ) + x2 p(x2 ) + · · · + xK p(xK )


K
X
= xi p(xi ).
i=1

So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).

Example 3.8 Continuing with the household size example:

Number of people
in the household, x p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)

The expected number of people in a randomly selected household is 2.36.

Example 3.9 For the basketball example:



(1 − π)x π for x = 0, 1, 2, . . .
p(x) =
0 otherwise.

83
3. Random variables

The expected value of X is then:


X ∞
X
E(X) = xi p(xi ) = x (1 − π)x π
xi ∈S x=0
X∞
(starting from x = 1) = x (1 − π)x π
x=1

X
= (1 − π) x (1 − π)x−1 π
x=1

X
(using y = x − 1) = (1 − π) (y + 1)(1 − π)y π
y=0

 
∞ ∞ 
X y
X
y 

= (1 − π) 
 y(1 − π) π + (1 − π) π 
 y=0 y=0 
| {z } | {z }
= E(X) =1

= (1 − π) (E(X) + 1)

= (1 − π) E(X) + (1 − π)

from which we can solve:


1−π 1−π
E(X) = = .
1 − (1 − π) π

Hence for example:

E(X) = 0.3/0.7 = 0.42 for π = 0.7

E(X) = 0.7/0.3 = 2.33 for π = 0.3.

So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a fairly poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.

Example 3.10 To illustrate the use of expected values, let us consider the game of
roulette, from the point of view of the casino (‘The House’).
Suppose a player puts a bet of £1 on ‘red’. If the ball lands on any of the 18 red
numbers, the player gets that £1 back, plus another £1 from The House. If the result
is one of the 18 black numbers or the green 0, the player loses the £1 to The House.
We assume that the roulette wheel is unbiased, i.e. that all 37 numbers have equal
probabilities. What can we say about the probabilities and expected values of wins
and losses?

84
3.4. Discrete random variables

Define the random variable X = ‘money received by The House’. Its possible values
are −1 (the player wins) and 1 (the player loses). The probability function is:

18/37 for x = −1

p(x) = 19/37 for x = 1

0 otherwise.

Therefore, the expected value is:


   
18 19
E(X) = −1 × + 1× = +0.027.
37 37

On average, The House expects to win 2.7p for every £1 which players bet on red.
This expected gain is known as the house edge. It is positive for all possible bets in
roulette.
The edge is the expected gain from a single bet. Usually, however, players bet again
if they win at first – gambling can be addictive!
Consider a player who starts with £10 and bets £1 on red repeatedly until the
player either has lost all of the £10 or doubled their money to £20.
It can be shown that the probability that such a player reaches £20 before they go
down to £0 is about 0.368. Define X = ‘money received by The House’, with the
probability function: 
0.368 for x = −10

p(x) = 0.632 for x = 10

0 otherwise.

Therefore, the expected value is:

E(X) = (−10 × 0.368) + (10 × 0.632) = +2.64.

On average, The House can expect to keep about 26.4% of the money which players
like this bring to the table.

Expected values of functions of a random variable

Let g(X) be a function (‘transformation’) of a discrete random variable X. This is


also a random variable, and its expected value is:
X
E(g(X)) = g(x) pX (x)

where pX (x) = p(x) is the probability function of X.

Example 3.11 The expected value of the square of X is:


X
E(X 2 ) = x2 p(x).

85
3. Random variables

In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.

Example 3.12 Note that:


 
2 2 1 1
E(X ) 6= (E(X)) and E 6= .
X E(X)

Expected values of linear transformations

Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:

E(aX + b) = a E(X) + b.

Proof: We have:
X
E(aX + b) = (ax + b)p(x)
x
X X
= ax p(x) + b p(x)
x x
X X
=a x p(x) + b p(x)
x x

= a E(X) + b

where the last step follows from:


P
i. x p(x) = E(X), by definition of E(X)
x

P
ii. p(x) = 1, by definition of the probability function.
x


A special case of the result:

E(aX + b) = a E(X) + b

is obtained when a = 0, which gives:

E(b) = b.

That is, the expected value of a constant is the constant itself.

86
3.4. Discrete random variables

Variance and standard deviation of a discrete random variable

The variance of a discrete random variable X is defined as:


X
Var(X) = E((X − E(X))2 ) = (x − E(X))2 p(x).
x
p
The standard deviation of X is sd(X) = Var(X).

Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
An alternative formula: the variance can also be calculated as:

Var(X) = E(X 2 ) − (E(X))2 .

This will be proved later.

Example 3.13 Continuing with the household size example:

x p(x) x p(x) (x − E(X))2 (x − E(X))2 p(x) x2 x2 p(x)


1 0.3002 0.3002 1.844 0.554 1 0.300
2 0.3417 0.6834 0.128 0.044 4 1.367
3 0.1551 0.4653 0.412 0.064 9 1.396
4 0.1336 0.5344 2.696 0.360 16 2.138
5 0.0494 0.2470 6.981 0.345 25 1.235
6 0.0145 0.0870 13.265 0.192 36 0.522
7 0.0034 0.0238 21.549 0.073 49 0.167
8
P 0.0021 0.0168 31.833 0.067 64 0.134
2.3579 1.699 7.259
= E(X) = Var(X) = E(X 2 )

2 2 2 2
Var(X) =pE((X − E(X))
√ ) = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.

Example 3.14 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise. It can be shown (although the proof is beyond the scope of the course)
that for this distribution:
1−π
Var(X) = .
π2
In the two cases we have used as examples:

Var(X) = 0.3/(0.7)2 = 0.61 and sd(X) = 0.78 for π = 0.7

Var(X) = 0.7/(0.3)2 = 7.78 and sd(X) = 2.79 for π = 0.3.

87
3. Random variables

So the variation in how many free throws a fairly poor shooter misses before the first
success is much higher than the variation for a fairly good shooter.

Variances of linear transformations

If X is a random variable and a and b are constants, then:

Var(aX + b) = a2 Var(X).

Proof:
Var(aX + b) = E ((aX + b) − E(aX + b))2


= E (aX + b − a E(X) − b)2




= E (aX − a E(X))2


= E a2 (X − E(X))2


= a2 E (X − E(X))2


= a2 Var(X).
Therefore, sd(aX + b) = |a| sd(X).


If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.

Summary of properties of E(X) and Var(X)

If X is a random variable and a and b are constants, then:

E(aX + b) = a E(X) + b
Var(aX + b) = a2 Var(X) and sd(aX + b) = |a| sd(X)
E(b) = b and Var(b) = sd(b) = 0.
p
We define Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 and sd(X) = Var(X).
Also, Var(X) ≥ 0 and sd(X) ≥ 0 always, and Var(X) = sd(X) = 0 only if X is a
constant.

3.4.7 Moments of a random variable


We can also define, for each k = 1, 2, . . ., the following:

the kth moment about zero is µk = E(X k )


the kth central moment is µ0k = E((X − E(X))k ).

88
3.4. Discrete random variables

Clearly, µ1 = µ = E(X) and µ02 = Var(X).


These will be mentioned again in Chapter 7.

Example 3.15 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2, . . . , n, where n is a known positive integer, and X
has the following probability function:
( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise

where nx = n!/(x! (n − x)!) denotes the binomial coefficient, and π is a probability




parameter such that 0 ≤ π ≤ 1.


A random variable like this follows the binomial distribution. We will discuss its
motivation and uses later in the next chapter.
Here, we consider the following tasks for this distribution.

Show that p(x) satisfies the conditions for a probability function.


Calculate probabilities from p(x).
Write down the cumulative distribution function, F (x).
Derive the expected value, E(X).

Note: the examination may also contain questions like this. The difficulty of such
questions depends partly on the form of p(x), and what kinds of manipulations are
needed to work with it. So questions of this type may be very easy, or quite hard!

To show that p(x) is a probability function, we need to show the following.

1. p(x) ≥ 0 for all x. This is clearly true, since x ≥ 0, π ≥ 0 and 1 − π ≥ 0.


n
P
2. p(x) = 1. This is easiest to show by using the binomial theorem, which states
x=0
that, for any integer n ≥ 0 and any real numbers y and z, then:
n  
n
X n x n−x
(y + z) = y z . (3.2)
x=0
x

If we choose y = π and z = 1 − π in (3.2), we get:


n   n
n n
X n x X
1 = 1 = (π + (1 − π)) = π (1 − π)n−x = p(x).
x=0
x x=0

This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, 2, . . . , n, the value of the cdf is:
x  
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y

89
3. Random variables

Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n−1
= nπ π x−1 (1 − π)n−x
x=1
x−1
n−1  
X n−1
= nπ π y (1 − π)(n−1)−y
y=0
y

= nπ × 1
= nπ

where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and probability
parameter π.
The variance of the distribution is Var(X) = nπ(1 − π). This is not derived here, but
will be proved in a different way later.

3.4.8 The moment generating function

Moment generating function

The moment generating function (mgf) of a discrete random variable X is defined


as: X
MX (t) = E(etX ) = etx p(x).
x

MX (t) is a function of real numbers t. It is not a random variable itself.

The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
0 00
MX (0) = E(X) and MX (0) = E(X 2 )

which also gives:


00 0
Var(X) = E(X 2 ) − (E(X))2 = MX (0) − (MX (0))2 .

This is useful if the mgf is easier to derive than E(X) and Var(X) directly.

90
3.4. Discrete random variables

Other moments about zero are obtained from the mgf similarly:

(k)
MX (0) = E(X k ) for k = 1, 2, . . . .

Example 3.16 In the basketball example, we considered the distribution with


p(x) = (1 − π)x π for x = 0, 1, 2, . . ..

The mgf for this distribution is:



X
tX
MX (t) = E(e ) = etx p(x)
x=0

X
= etx (1 − π)x π
x=0

X
=π (et (1 − π))x
x=0
π
=
1− et (1 − π)

using the sum to infinity of a geometric series, for t < − ln(1 − π) to ensure
convergence of the sum.
From the mgf MX (t) = π/(1 − et (1 − π)) we obtain:

π(1 − π)et
MX0 (t) =
(1 − et (1 − π))2
π(1 − π)et (1 − (1 − π)et )(1 + (1 − π)et )
MX00 (t) =
(1 − et (1 − π))4

and hence (since e0 = 1):


1−π
MX0 (0) = = E(X)
π
(1 − π)(2 − π)
MX00 (0) = = E(X 2 )
π2
and:
(1 − π)(2 − π) (1 − π)2 1−π
Var(X) = E(X 2 ) − (E(X))2 = − = .
π2 π2 π2

Example 3.17 Consider a discrete random variable X with possible values


0, 1, 2, . . ., a parameter λ > 0, and the following pf:
(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) =
0 otherwise.

91
3. Random variables

The mgf for this distribution is:


∞ −λ x ∞
X
tx e λ −λ
X (et λ)x t t
MX (t) = e =e = e−λ eλe = eλ(e −1) .
x=0
x! x=0
x!

Note: this uses the series expansion of the exponential function from calculus, i.e. for
any number a, we have:

a
X ax a2 a3
e = =1+a+ + + ··· .
x=0
x! 2! 3!
t
From the mgf MX (t) = eλ(e −1) we obtain:
t
MX0 (t) = λet eλ(e −1)

and:
t
MX00 (t) = λet (1 + λet )eλ(e −1) .
Hence:
MX0 (0) = λ = E(X)
also:
MX00 (0) = λ(1 + λ) = E(X 2 )
and:
Var(X) = E(X 2 ) − (E(X))2 = λ(1 + λ) − λ2 = λ.

Other useful properties of moment generating functions

If the mgfs mentioned in these statements exist, then the following apply.

The mgf uniquely determines a probability distribution. In other words, if for two
random variables X and Y we have MX (t) = MY (t) (for points around t = 0), then
X and Y have the same distribution.

If Y = aX + b where X is a random variable and a and b are constants, then:

MY (t) = ebt MX (at).

Suppose that the random variables X1 , X2 , . . . , Xn are independent (a concept


which will be defined in Chapter 5) and if we also define Y = X1 + X2 + · · · + Xn ,
then:
Y n
MY (t) = MXi (t)
i=1

and, in particular, if all the Xi s have the same distribution (of X), then
MY (t) = MX (t)n .

92
3.5. Continuous random variables

3.5 Continuous random variables


A random variable (and its probability distribution) is continuous if it can have an
uncountably infinite number of possible values.4

In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.

Example 3.18 An example of a continuous random variable, used here as an


approximating model, is the size of claim made on an insurance policy (i.e. a claim
by the customer to the insurance company), in £000s.

Suppose the policy has a deductible of £999, so all claims are at least £1,000.

Therefore, the possible values of this random variable are {x | x ≥ 1}.

Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.

Probability density function (pdf)

For a continuous random variable X, the probability function is replaced by the


probability density function (pdf), denoted as f (x) (or fX (x)).

Example 3.19 Continuing the insurance example in Example 3.18, we consider a


pdf of the following form:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise

where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5 (on the next page).

Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:

P (X = x) = 0 for all x. (3.3)


4
Strictly speaking, having an uncountably infinite number of possible values does not necessarily
imply that it is a continuous random variable. For example, the Cantor distribution (not covered in this
course) is neither a discrete nor an absolutely continuous probability distribution, nor is it a mixture of
these. However, we will not consider this matter any further in this course.

93
3. Random variables

2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.5: Probability density function for Example 3.19.

That is, the probability that X has any particular value exactly is always 0.
Because of (3.3), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:

P (a < X < b), P (a ≤ X ≤ b), P (a < X ≤ b) and P (a ≤ X < b).

Probabilities of intervals for continuous random variables

Integrals of the pdf give probabilities of intervals of values such that:


Z b
P (a < X ≤ b) = f (x) dx
a

for any two numbers a < b.


In other words, the probability that the value of X is between a and b is the area
under f (x) between a and b. Here a can also be −∞, and/or b can be ∞.

94
3.5. Continuous random variables

Example 3.20 In Figure 3.6, the shaded area is:


Z 3
P (1.5 < X ≤ 3) = f (x) dx.
1.5

2.0
1.5
f(x)

1.0
0.5
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Figure 3.6: Probability density function showing P (1.5 < X ≤ 3).

Properties of pdfs

The pdf f (x) of any continuous random variable must satisfy the following conditions.

1. We require:
f (x) ≥ 0 for all x.

2. We require: Z ∞
f (x) dx = 1.
−∞

These are analogous to the conditions for probability functions of discrete


distributions.

95
3. Random variables

Example 3.21 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise

where α > 0 and k > 0.

1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.

2. We have:
∞ ∞
αk α
Z Z
f (x) dx = dx
−∞ k xα+1
Z ∞
= αk α
x−α−1 dx
k
 h i∞
α 1 −α
= αk x
−α k

= (−k α )(0 − k −α )
= 1.

Cumulative distribution function

The cumulative distribution function (cdf) of a continuous random variable X


is defined exactly as for discrete random variables, i.e. the cdf is:

F (x) = P (X ≤ x) for all real numbers x.

The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.

Relationship between the cdf and pdf

The cdf is obtained from the pdf through integration:


Z x
F (x) = P (X ≤ x) = f (t) dt for all x.
−∞

The pdf is obtained from the cdf through differentiation:

f (x) = F 0 (x).

96
3.5. Continuous random variables

Example 3.22 Continuing the insurance example:


Z x Z x
αk α
f (t) dt = α+1
dt
−∞ k t
Z x
α
= (−k ) (−α)t−α−1 dt
k
h ix
= (−k α ) t−α
k
−α
α
= (−k )(x − k −α )
= 1 − k α x−α
 α
k
=1− .
x

Therefore: (
0 for x < k
F (x) = (3.4)
1 − (k/x)α for x ≥ k.
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
αk α
F 0 (x) = −k α (−α)x−α−1 = for x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

1 2 3 4 5 6 7

Figure 3.7: Cumulative distribution function for Example 3.22.

97
3. Random variables

Probabilities from cdfs and pdfs

Since P (X ≤ x) = F (x), it follows that P (X > x) = 1 − F (x). In general, for any


two numbers a < b, we have:
Z b
P (a < X ≤ b) = f (x) dx = F (b) − F (a).
a

Example 3.23 Continuing with the insurance example (with k = 1 and α = 2.2),
then:

P (X ≤ 1.5) = F (1.5) = 1 − (1/1.5)2.2 ≈ 0.59


P (X ≤ 3) = F (3) = 1 − (1/3)2.2 ≈ 0.91
P (X > 3) = 1 − F (3) ≈ 1 − 0.91 = 0.09
P (1.5 ≤ X ≤ 3) = F (3) − F (1.5) ≈ 0.91 − 0.59 = 0.32.

Example 3.24 Consider now a continuous random variable with the following pdf:
(
λe−λx for x ≥ 0
f (x) = (3.5)
0 otherwise

where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x h ix
λe−λt dt = − e−λt = 1 − e−λx
0 0

the cdf of the exponential distribution is:


(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.

We now show that (3.5) satisfies the conditions for a pdf.

1. Since λ > 0 and ea > 0 for any a, f (x) ≥ 0 for all x.

2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to 1. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞

which here is lim (1 − e−λx ) − 0 = (1 − 0) − 0 = 1.


x→∞

98
3.5. Continuous random variables

Mixed distributions

A random variable can also be a mixture of discrete and continuous parts.


For example, consider the sizes of payments which an insurance company needs to make
on all insurance policies of a particular type. Most policies result in no claims or claims
below the deductible, so the payment for them is 0. For those policies which do result in
a claim, the size of each claim is some number greater than 0.
Consider a random variable X which is a mixture of two components.

P (X = 0) = π for some π ∈ (0, 1). Here π is the probability that a policy results in
no payment.

Among the rest, X follows a continuous distribution with the probabilities


distributed as (1 − π)f (x), where f (x) is a continuous pdf over x > 0. In other
words, this spreads the remaining probability (1 − π) over different non-zero values
of payments. For example, we could use the Pareto distribution for this loss
distribution f (x) (or actually as a distribution of X + k, since the company only
pays the amount above the deductible, k).

Expected value and variance of a continuous distribution

Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X)= x f (x) dx
−∞

Z ∞
E(g(X))= g(x) f (x) dx
−∞

Var(X)= E((X − E(X))2 )


Z ∞
= (x − E(X))2 f (x) dx
−∞

= E(X 2 ) − (E(X))2

p
sd(X)= Var(X).

99
3. Random variables

Example 3.25 For the Pareto distribution, introduced in Example 3.19, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k

αk α
Z
= x dx
k xα+1

αk α
Z
= dx
k xα
Z ∞
(α − 1)k α−1

αk
= dx
α−1 k x(α−1)+1
| {z }
=1

αk
= (for α > 1).
α−1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.
Similarly:
∞ ∞
αk α
Z Z
2 2
E(X ) = x f (x) dx = x2 dx
k k xα+1

αk α
Z
= dx
k xα−1
Z ∞
αk 2 (α − 2)k α−2

= dx
α−2 k x(α−2)+1
| {z }
=1

αk 2
= (for α > 2)
α−2
and hence:
2
αk 2 α2 k 2

2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2

In our insurance example, where k = 1 and α = 2.2, we have:


 2
2.2 × 1 1 2.2
E(X) = ≈ 1.8 and Var(X) = × ≈ 7.6.
2.2 − 1 2.2 − 1 2.2 − 2

Means and variances can be ‘infinite’

Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).
For the Pareto distribution, the distribution is defined for all α > 0, but the mean is

100
3.5. Continuous random variables

infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)

0.4

α = 2.2
α = 0.8
0.2
0.0

0 10 20 30 40 50

Figure 3.8: Pareto distribution cdfs.

Example 3.26 Consider the exponential distribution introduced in Example 3.24.


To find E(X) we can use integration by parts by considering x λe−λx as the product
of the functions f = x and g 0 = λe−λx (so that g = −e−λx ). Therefore:
Z ∞ h i∞ Z ∞
−λx −λx
E(X) = x λe dx = − x e − −e−λx dx
0 0 0
h i∞
1 h −λx i∞
= − x e−λx e −
0 λ 0
h i 1h i
= 0−0 − 0−1
λ
1
= .
λ

101
3. Random variables

To obtain E(X 2 ), we choose f = x2 and g 0 = λe−λx , and use integration by parts:


Z ∞ h i∞ Z ∞
−λx 2 −λx
2
E(X ) = 2
x λe dx = − x e +2 x e−λx dx
0 0 0
Z ∞
2
=0+ x λe−λx dx
λ 0

2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ

3.5.1 Moment generating functions


The moment generating function (mgf) of a continuous random variable X is defined as
for discrete random variables, with summation replaced by integration:
Z ∞
tX
MX (t) = E(e ) = etx f (x) dx.
−∞

The properties of the mgf stated in Section 3.4.8 also hold for continuous distributions.
If the expected value E(etX ) is infinite, the random variable X does not have an mgf.
For example, the Pareto distribution does not have an mgf for positive t.

Example 3.27 For the exponential distribution, we have:


Z ∞ Z ∞
−λx
tX
MX (t) = E(e ) = tx
e λe dx = λe−(λ−t)x dx
0 0
Z ∞
λ λ
= (λ − t)e−(λ−t)x dx = (for t < λ)
λ−t 0 λ−t
| {z }
=1

from which we get MX0 (t) = λ/(λ − t)2 and MX00 (t) = 2λ/(λ − t)3 , so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) =
λ λ2
and Var(X) = E(X 2 ) − (E(X))2 = 2/λ2 − 1/λ2 = 1/λ2 .
These agree with the results derived with a bit more work in Example 3.26.

3.5.2 Median of a random variable


Recall that the sample median is essentially the observation ‘in the middle’ of a set of
data, i.e. where half of the observations in the sample are smaller than the median and
half of the observations are larger.

102
3.5. Continuous random variables

The median of a random variable (i.e. of its probability distribution) is similar in spirit.

Median of a random variable

The median, m, of a continuous random variable X is the value which satisfies:

F (m) = 0.5. (3.6)

So once we know F (x), we can find the median by solving (3.6).

A more precise general definition of the median of any probability distribution is as


follows.
Let X be a random variable with the cumulative distribution function F (x). The
median m of X is any number which satisfies:
P (X ≤ m) = F (m) ≥ 0.5
and:
P (X ≥ m) = 1 − F (m) + P (X = m) ≥ 0.5.
For a continuous distribution P (X = m) = 0 for any m, so this reduces to F (m) = 0.5.
If, for a discrete distribution, F (xm ) = 0.5 exactly for some value xm , the median is not
unique. Instead, all values from xm to the next largest observation (these included) are
medians.

Example 3.28 For the Pareto distribution we have:


 α
k
F (x) = 1 − for x ≥ k.
x

So F (m) = 1 − (k/m)α = 1/2 when:


 α
k 1 k 1 √
α
= ⇔ = √
α
⇔ m = k 2.
m 2 m 2
For example:
√ 2.2
when k = 1 and α = 2.2, the median is m = 2 = 1.37

when k = 1 and α = 0.8, the median is m = 0.8 2 = 2.38.

Example 3.29 For the exponential distribution we have:

F (x) = 1 − e−λx for x ≥ 0.

So F (m) = 1 − e−λm = 1/2 when:

1 ln 2
e−λm = ⇔ −λm = − ln 2 ⇔ m= .
2 λ

103
3. Random variables

3.6 Overview of chapter


This chapter has formally introduced random variables, making a distinction between
discrete and continuous random variables. Properties of probability distributions were
discussed, including the determination of expected values and variances.

3.7 Key terms and concepts

Binomial distribution Constant


Continuous Cumulative distribution function
Discrete Estimators
Expected value Experiment
Exponential distribution Interval
Median Moment generating function
Outcome Parameter
Pareto distribution Probability density function
Probability distribution Probability (mass) function
Random variable Standard deviation
Step function Variance

3.8 Sample examination questions

1. Suppose that X is a discrete random variable for which the moment generating
function is:
1 1
MX (t) = (e3t + e6t + e9t ) + (e2t + e4t )
4 8
for −∞ < t < ∞. Write down the probability distribution of X.

2. A random variable, X, has the following probability density function:



x/5
 for 0 ≤ x < 2
f (x) = (20 − 4x)/30 for 2 ≤ x ≤ 5

0 otherwise.

(a) Sketch the graph of f (x). (The sketch can be drawn on ordinary paper – no
graph paper needed.)

(b) Derive the cumulative distribution function of X.

(c) Find the mean and the standard deviation of X.

104
3.9. Solutions to Sample examination questions

3.9 Solutions to Sample examination questions


1. If X can take only a finite number of values x1 , x2 , . . . , xk with probabilities
p1 , p2 , . . . , pk , respectively, then the mgf of X will be:

MX (t) = p1 etx1 + p2 etx2 + · · · + pk etxk .

By matching this expression for MX (t) with:


1 1
MX (t) = (e3t + e6t + e9t ) + (e2t + e4t )
4 8
it can be seen that X can take only the five values 2, 3, 4, 6 and 9 and hence:

1/8
 for x = 2 and 4
p(x) = 1/4 for x = 3, 6 and 9

0 otherwise.

2. (a) The pdf of X has the following form:


0.4
0.3
f(x)

0.2
0.1
0.0

0 1 2 3 4 5

(b) We determine the cdf by integrating the pdf over the appropriate range, hence:



 0 for x < 0


x2 /10

for 0 ≤ x < 2
F (x) =
2
(10x − x − 10)/15 for 2 ≤ x ≤ 5





1 for x > 5.

This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞

For 0 ≤ x < 2, we have:


Z x 0 x  2 x
x2
Z Z
t t
F (x) = f (t) dt = 0 dt + dt = = .
−∞ −∞ 0 5 10 0 10

105
3. Random variables

For 2 ≤ x ≤ 5, we have:
Z x 0 2 Z x
20 − 4t
Z Z
t
F (x) = f (t) dt = 0 dt + dt + dt
−∞ −∞ 0 5 2 30
x
t2

4 2t
=0+ + −
10 3 15 2
2x x2
   
4 4 4
= + − − −
10 3 15 3 15
2x x2 2
= − −
3 15 3
10x − x2 − 10
= .
15

(c) To find the mean we proceed as follows:


Z ∞ Z 2 2 Z 5
x 20x − 4x2
µ = E(X) = x f (x) dx = dx + dx
−∞ 0 5 2 30
 3 2  2 5
x x 2x3
= + −
15 0 3 45 2
   
8 25 250 4 16
= + − − −
15 3 45 3 45
= 7/3 or 2.3333.

Similarly:
∞ Z 5 2
x3 20x2 − 4x3
Z Z
2 2
E(X ) = x f (x) dx = dx + dx
−∞ 0 5 2 30
 4 2  3 5
x 2x x4
= + −
20 0 9 30 2
   
16 250 625 16 16
= + − − −
20 9 30 9 30
13
= or 6.5.
2
Hence the variance is:
 2
2 2 13 7 2 117 98 19
σ = E(X ) − (E(X)) = − = − = ≈ 1.0555.
2 3 18 18 18

Therefore, the standard deviation is σ = 1.0555 = 1.0274.

The death of one man is a tragedy. The death of millions is a statistic.


(Stalin to Churchill, Potsdam 1945)

106
Chapter 4
Common distributions of random
variables

4.1 Synopsis of chapter content


This chapter formally introduces common ‘families’ of probability distributions which
can be used to model various real-world phenomena.

4.2 Learning outcomes


After completing this chapter, you should be able to:

summarise basic distributions such as the uniform, Bernoulli, binomial, Poisson,


exponential and normal

calculate probabilities of events for these distributions using the probability


function, probability density function or cumulative distribution function

determine probabilities using statistical tables, where appropriate

state properties of these distributions such as the expected value and variance.

4.3 Introduction
In statistical inference we will treat observations:

X1 , X2 , . . . , Xn

(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?

Usually we do not try to invent new distributions from scratch.

Instead, we use one of many existing standard distributions.

There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.

107
4. Common distributions of random variables

This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:

continuous versus discrete

among discrete: a finite versus an infinite number of possible values

among continuous: different sets of possible values (for example, all real numbers x,
x ≥ 0, or x ∈ [0, 1]); symmetric versus skewed distributions.

The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:

select a family of distributions based on the basic characteristics of X

use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.

Example 4.1 An opinion poll on a referendum, where each Xi is an answer to the


question ‘Will you vote ‘Yes’ or ‘No’ to leaving the European Union?’ has answers
recorded as Xi = 0 if ‘No’ and Xi = 1 if ‘Yes’. In a poll of 950 people, 513 answered
‘Yes’.
How do we choose a distribution to represent Xi ?

Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has one
parameter π (the probability that Xi = 1) is appropriate.

Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.

Distributions in the examination

For the discrete uniform, Bernoulli, binomial, Poisson, continuous uniform, exponential
and normal distributions:

you should memorise their pf/pdf, cdf (if given), mean, variance and median (if
given)

you can use these in any examination question without proof, unless the question
directly asks you to derive them again.

108
4.4. Common discrete distributions

For any other distributions:

you do not need to memorise their pf/pdf or cdf; if needed for a question, these will
be provided
if a question involves means, variances or other properties of these distributions,
these will either be provided, or the question will ask you to derive them.

4.4 Common discrete distributions


For discrete random variables, we will consider the following distributions.

Discrete uniform distribution.


Bernoulli distribution.
Binomial distribution.
Poisson distribution.

4.4.1 Discrete uniform distribution


Suppose a random variable X has k possible values 1, 2, . . . , k. X has a discrete
uniform distribution if all of these values have the same probability, i.e. if:
(
1/k for x = 1, 2, . . . , k
p(x) = P (X = x) =
0 otherwise.

Example 4.2 A simple example of the discrete uniform distribution is the


distribution of the score of a fair die, with k = 6.

The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.

Mean and variance of a discrete uniform distribution

Calculating directly from the definition,1 we have:


k
X 1 + 2 + ··· + k k+1
E(X) = x p(x) = = (4.1)
x=1
k 2
and:
k
X 12 + 22 + · · · + k 2 (k + 1)(2k + 1)
E(X 2 ) = x2 p(x) = = . (4.2)
x=1
k 6
Therefore:
k2 − 1
Var(X) = E(X 2 ) − (E(X))2 = .
12
n n
1
i2 = n(n + 1)(2n + 1)/6.
P P
(4.1) and (4.2) make use, respectively, of i = n(n + 1)/2 and
i=1 i=1

109
4. Common distributions of random variables

4.4.2 Bernoulli distribution


A Bernoulli trial is an experiment with only two possible outcomes. We will number
these outcomes 1 and 0, and refer to them as ‘success’ and ‘failure’, respectively.

Example 4.3 Examples of outcomes of Bernoulli trials are:

agree / disagree

male / female

employed / not employed

owns a car / does not own a car

business goes bankrupt / continues trading.

The Bernoulli distribution is the distribution of the outcome of a single Bernoulli


trial. This is the distribution of a random variable X with the following probability
function: (
π x (1 − π)1−x for x = 0, 1
p(x) =
0 otherwise.
Therefore, P (X = 1) = π and P (X = 0) = 1 − P (X = 1) = 1 − π, and no other values
are possible. Such a random variable X has a Bernoulli distribution with (probability)
parameter π. This is often written as:
X ∼ Bernoulli(π).
If X ∼ Bernoulli(π), then:
1
X
E(X)= x p(x) = 0 × (1 − π) + 1 × π = π (4.3)
x=0

1
X
2
E(X ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0

and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π(1 − π). (4.4)
The moment generating function is:
1
X
MX (t) = etx p(x) = e0 (1 − π) + et π = (1 − π) + πet .
x=0

4.4.3 Binomial distribution


Suppose we carry out n Bernoulli trials such that:

at each trial, the probability of success is π


different trials are statistically independent events.

110
4.4. Common discrete distributions

Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).

The binomial distribution was first encountered in Example 3.15.

Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:

X ∼ Bin(4, 0.25).

For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 (where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer).
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations for the three 1s which can be selected in the sequence of 4 answers. This is
4

3
= 4. Therefore, the probability of obtaining three 1s is:
 
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3

Binomial distribution probability function

In general, the probability function of X ∼ Bin(n, π) is:


( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) = (4.5)
0 otherwise.

We have shown in the previous chapter that (4.5) satisfies the conditions for being a
probability function (see Example 3.15).

111
4. Common distributions of random variables

Example 4.5 Continuing Example 4.4, where X ∼ Bin(4, 0.25), we have:


   
4 0 4 4
p(0) = × (0.25) × (0.75) = 0.3164, p(1) = × (0.25)1 × (0.75)3 = 0.4219,
0 1
   
4 4
p(2) = × (0.25)2 × (0.75)2 = 0.2109, p(3) = × (0.25)3 × (0.75)1 = 0.0469,
2 3
 
4
p(4) = × (0.25)4 × (0.75)0 = 0.0039.
4

If X ∼ Bin(n, π), then:


E(X) = nπ
and:
Var(X) = nπ(1 − π).
The expected value E(X) was derived in the previous chapter (see Example 3.15). The
variance will be derived later.
These can also be obtained from the moment generating function:

MX (t) = ((1 − π) + πet )n .

Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, 2, . . . , 20 we get (rounded to 2 decimal
places):

x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 (on the next page)
shows plots of the probabilities for π = 0.25, 0.5, 0.7 and 0.9.

112
4.4. Common discrete distributions

π = 0.25, E(X)=5 π = 0.5, E(X)=10

0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00
0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

π = 0.7, E(X)=14 π = 0.9, E(X)=18


0.30

0.30
0.20

0.20
Probability

Probability
0.10

0.10
0.00

0.00

0 5 10 15 20 0 5 10 15 20

Correct answers Correct answers

Figure 4.1: Probability plots for Example 4.6.

4.4.4 Poisson distribution

The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . ..

Poisson distribution probability function

The probability function of the Poisson distribution is:


(
e−λ λx /x! for x = 0, 1, 2, . . .
p(x) = (4.6)
0 otherwise

where λ > 0 is a parameter.

If a random variable X has a Poisson distribution with parameter λ, this is often


denoted by:
X ∼ Poisson(λ) or X ∼ Pois(λ).

If X ∼ Poisson(λ), then:
E(X) = λ

113
4. Common distributions of random variables

and:
Var(X) = λ.
These can also be obtained from the moment generating function (see Example 3.17):
t
MX (t) = eλ(e −1) .

Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions:

1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.

2. The probability of two or more occurrences at the same time is negligibly small.

3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.

In essence, these state that individual occurrences should be independent, sufficiently


rare, and happen at a constant rate λ per unit of time. A process like this is a Poisson
process.
If occurrences are generated by a Poisson process, then the number of occurrences in a
randomly selected time interval of length t = 1, X, follows a Poisson distribution with
mean λ, i.e. X ∼ Poisson(λ).
The single parameter λ of the Poisson distribution is, therefore, the rate of occurrences
per unit of time.

Example 4.7 Examples of variables for which we might use a Poisson distribution:

The number of telephone calls received at a call centre per minute.

The number of accidents on a stretch of motorway per week.

The number of customers arriving at a checkout per minute.

The number of misprints per page of newsprint.

Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.

Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).

λ is also the mean of the distribution, i.e. E(X) = λ.


Both motivations suggest that distributions with higher values of λ have higher
probabilities of large values of X.

114
4.4. Common discrete distributions

Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).

0.25
λ=2
λ=4
0.20
0.15
p(x)

0.10
0.05
0.00

0 2 4 6 8 10

Figure 4.2: Probability plots for Example 4.9.

Example 4.10 Customers arrive at a bank on weekday afternoons randomly at an


average rate of 1.6 customers per minute. Let X denote the number of arrivals per
minute and Y denote the number of arrivals per 5 minutes.
We assume a Poisson distribution for both, such that:
X ∼ Poisson(1.6)
and:
Y ∼ Poisson(1.6 × 5) = Poisson(8).

1. What is the probability that no customer arrives in a one-minute interval?


For X ∼ Poisson(1.6), the probability P (X = 0) is:
e−λ λ0 e−1.6 (1.6)0
pX (0) = = = e−1.6 = 0.2019.
0! 0!
2. What is the probability that more than two customers arrive in a one-minute
interval?
P (X > 2) = 1 − P (X ≤ 2) = 1 − (P (X = 0) + P (X = 1) + P (X = 2)) which is:
e−1.6 (1.6)0 e−1.6 (1.6)1 e−1.6 (1.6)2
1 − pX (0) − pX (1) − pX (2) = 1 − − −
0! 1! 2!
= 1 − e−1.6 − 1.6e−1.6 − 1.28e−1.6
= 1 − 3.88e−1.6
= 0.2167.

115
4. Common distributions of random variables

3. What is the probability that no more than 1 customer arrives in a five-minute


interval?
For Y ∼ Poisson(8), the probability P (Y ≤ 1) is:

e−8 80 e−8 81
pY (0) + pY (1) = + = e−8 + 8e−8 = 9e−8 = 0.0030.
0! 1!

4.4.5 Connections between probability distributions


There are close connections between some probability distributions, even across
different families of them. Some connections are exact, i.e. one distribution is exactly
equal to another, for particular values of the parameters. For example, Bernoulli(π) is
the same distribution as Bin(1, π).
Some connections are approximate (or asymptotic), i.e. one distribution is closely
approximated by another under some limiting conditions. We next discuss one of these,
the Poisson approximation of the binomial distribution.

4.4.6 Poisson approximation of the binomial distribution


Suppose that:

X ∼ Bin(n, π)
n is large and π is small.

Under such circumstances, the distribution of X is well-approximated by a Poisson(λ)


distribution with λ = nπ.
The connection is exact at the limit, i.e. Bin(n, π) → Poisson(λ) if n → ∞ and π → 0 in
such a way that nπ = λ remains constant.
This ‘law of small numbers’ provides another motivation for the Poisson distribution.

Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 (on the next page) shows the numbers of soldiers killed by horsekick in
each of 14 army corps of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
n is large – the number of men in a corps (perhaps 50,000)
π is small – the probability that a man is killed by a horsekick.

X should be well-approximated by a Poisson distribution with some mean λ. The


sample frequencies and proportions of different counts are as follows:
Number killed 0 1 2 3 4 More
Count 144 91 32 11 2 0
% 51.4 32.5 11.4 3.9 0.7 0

116
4.4. Common discrete distributions

The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.

Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years spanning 1875–94. Source: Bortkiewicz, L.V. Das Gesetz der
kleinen Zahlen. (Leipzig: B.G. Teubner, 1898).
0.5

Poisson(0.7)
Sample proportion
0.4
0.3
Probability

0.2
0.1
0.0

0 1 2 3 4 5 6

Men killed

Figure 4.4: Fit of Poisson distribution to the data in Example 4.11.

Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?

117
4. Common distributions of random variables

Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.1340 − 0.2707 = 0.5953.

Using the Poisson approximation, X ∼ Poisson(200 × 0.01) = Poisson(2).

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − e−2 − 2e−2 = 1 − 3e−2 = 0.5940.

4.4.7 Some other discrete distributions


Just their names and short comments are given here, so that you have an idea of what
else there is. You may meet some of these in future courses such as ST2133 Advanced
statistics: distribution theory.

Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.

Negative binomial(r, π) distribution.


• Distribution of the number of failures in Bernoulli trials before r successes
occur.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• Negative binomial(1, π) is the same as Geometric(π).

Hypergeometric(n, A, B) distribution.
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.
• The sample space is the integers x where max{0, n − B} ≤ x ≤ min{n, A}.
• If the selection was with replacement, the distribution of the number of
successes would be Bin(n, A/(A + B)).

Multinomial(n, π1 , π2 , . . . , πk ) distribution.
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .

118
4.5. Common continuous distributions

• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has k ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
k = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial distribution is the distribution of a multivariate
random variable, as discussed later in the course.

4.5 Common continuous distributions


For continuous random variables, we will consider the following distributions.

Uniform distribution.
Exponential distribution.
Normal distribution.

4.5.1 The (continuous) uniform distribution


The (continuous) uniform distribution has non-zero probabilities only on an interval
[a, b], where a < b are given numbers. The probability that its value is in an interval
within [a, b] is proportional to the length of the interval. In other words, all intervals
(within [a, b]) which have the same length have the same probability.

Uniform distribution pdf

The pdf of the (continuous) uniform distribution is:


(
1/(b − a) for a ≤ x ≤ b
f (x) =
0 otherwise.

A random variable X with this pdf may be written as X ∼ Uniform[a, b].

The pdf is ‘flat’, as shown in Figure 4.5 (on the next page). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 h ib 1 h i
f (x) dx = dx = x = b − a = 1.
−∞ a b−a b−a a b−a
The cdf is:

Z x 0
 for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a 
1 for x > b.

Therefore, the probability of an interval [x1 , x2 ], where a ≤ x1 < x2 ≤ b, is:


x2 − x1
P (x1 ≤ X ≤ x2 ) = F (x2 ) − F (x1 ) = .
b−a

119
4. Common distributions of random variables

So the probability depends only on the length of the interval, x2 − x1 .

F(x)
f(x)

a b a b

x x

Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).

If X ∼ Uniform[a, b], we have:

a+b
E(X) = = median of X
2
and:
(b − a)2
Var(X) = .
12
The mean and median also follow from the fact that the distribution is symmetric
about (a + b)/2, i.e. the midpoint of the interval [a, b].

4.5.2 Exponential distribution

Exponential distribution pdf

A random variable X has the exponential distribution with the parameter λ


(where λ > 0) if its probability density function is:
(
λe−λx for x ≥ 0
f (x) =
0 otherwise.

This is often denoted X ∼ Exponential(λ) or X ∼ Exp(λ).

It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.24). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).

120
4.5. Common continuous distributions

f(x)

0 1 2 3 4 5

Figure 4.6: Exponential distribution pdf.

The cdf of the Exp(λ) distribution is:


(
0 for x < 0
F (x) =
1 − e−λx for x ≥ 0.
The cdf is shown in Figure 4.7 for λ = 1.6.
1.0
0.8
0.6
F(x)

0.4
0.2
0.0

0 1 2 3 4 5

Figure 4.7: Exponential distribution cdf for λ = 1.6.

For X ∼ Exp(λ), we have:


1
E(X) =
λ
and:
1
Var(X) = .
λ2
121
4. Common distributions of random variables

These have been derived in the previous chapter (see Example 3.26). The median of the
distribution, also previously derived (see Example 3.29), is:

ln 2 1
m= = (ln 2) × = (ln 2) E(X) ≈ 0.69 × E(X).
λ λ
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The moment generating function of the exponential distribution (derived in Example
3.27) is:
λ
MX (t) = for t < λ.
λ−t

Uses of the exponential distribution

The exponential is, among other things, a basic distribution of waiting times of various
kinds. This arises from a connection between the Poisson distribution – the simplest
distribution for counts – and the exponential.

If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.

Note that the expected values of these behave as we would expect.

E(X) = λ for Pois(λ), i.e. a large λ means many events per unit of time, on average.

E(X) = 1/λ for Exp(λ), i.e. a large λ means short waiting times between successive
events, on average.

Example 4.13 Consider Example 4.10.

The number of customers arriving at a bank per minute has a Poisson


distribution with parameter λ = 1.6.

Therefore, the time X, in minutes, between the arrivals of two successive


customers follows an exponential distribution with parameter λ = 1.6.

From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(ln 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x < 0
F (x) =
1 − e−1.6x for x ≥ 0.

For example:

122
4.5. Common continuous distributions

P (X ≤ 1) = F (1) = 1 − e−1.6×1 = 1 − e−1.6 = 0.7981.


The probability is about 0.8 that two arrivals are at most a minute apart.

P (X > 3) = 1 − F (3) = e−1.6×3 = e−4.8 = 0.0082.


The probability of a gap of 3 minutes or more between arrivals is very small.

4.5.3 Two other distributions


These are generalisations of the uniform and exponential distributions. Only their
names and short comments are given here, just so that you know they exist. You may
meet these again in future courses such as ST2133 Advanced statistics:
distribution theory.

Beta(α, β) distribution, shown in Figure 4.8.


• Generalising the uniform distribution, these are distributions for a closed
interval, which is taken to be [0, 1].
• Therefore, the sample space is {x | 0 ≤ x ≤ 1}.
• Unlike for the uniform distribution, the pdf is generally not flat.
• Beta(1, 1) is the same as Uniform[0, 1].

alpha=0.5, beta=1 alpha=1, beta=2 alpha=1, beta=1

alpha=0.5, beta=0.5 alpha=2, beta=2 alpha=4, beta=2

Figure 4.8: Beta distribution density functions.

123
4. Common distributions of random variables

Gamma(α, β) distribution, shown in Figure 4.9.


• Generalising the exponential distribution, this is a two-parameter family of
skewed distributions for positive values.
• The sample space is {x | x > 0}.
• Gamma(1, β) is the same as Exp(β).

0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5

0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25

Figure 4.9: Gamma distribution density functions.

4.5.4 Normal (Gaussian) distribution


The normal distribution is by far the most important probability distribution in
statistics. This is for three broad reasons.

Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.

The normal distribution has extremely convenient mathematical properties, which


make it a useful default choice of distribution in many contexts.

124
4.5. Common continuous distributions

Even when a variable is not itself even approximately normally distributed,


functions of several observations of the variable (‘sampling distributions’) are often
approximately normal, due to the central limit theorem. Because of this, the
normal distribution has a crucial role in statistical inference. This will be discussed
later in the course.

Normal distribution pdf

The pdf of the normal distribution is:

(x − µ)2
 
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2

where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).

R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
The proof of this is non-examinable. It uses the moment generating function of the
normal distribution, which is:
σ 2 t2
 
MX (t) = exp µt + for − ∞ < t < ∞.
2
The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.

The mean µ determines the location of the curve.

The variance σ 2 determines the dispersion (spread) of the curve.

Example 4.14 Figure 4.10 (on the next page) shows that:

N (0, 1) and N (5, 1) have the same dispersion but different location: the N (5, 1)
curve is identical to the N (0, 1) curve, but shifted 5 units to the right
N (0, 1) and N (0, 9) have the same location but different dispersion: the N (0, 9)
curve is centered at the same value, 0, as the N (0, 1) curve, but spread out more
widely.

125
4. Common distributions of random variables

0.4
0.3
N(0, 1) N(5, 1)

0.2
0.1

N(0, 9)
0.0

−5 0 5 10

Figure 4.10: Various normal distributions.

Linear transformations of the normal distribution

We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
 2 !
1 µ X −µ 1 µ 1
Z= X− = ∼N µ− , σ2 = N (0, 1).
σ σ σ σ σ σ

The transformed variable Z = (X − µ)/σ is known as a standardised variable or a


z-score.
The distribution of the z-score is N (0, 1), i.e. the normal distribution with mean µ = 0
and variance σ 2 = 1 (and, therefore, a standard deviation of σ = 1). This is known as
the standard normal distribution. Its density function is:

x2
 
1
f (x) = √ exp − for − ∞ < x < ∞.
2π 2

126
4.5. Common continuous distributions

The cumulative distribution function of the normal distribution is:


Z x
(t − µ)2
 
1
F (x) = √ exp − dt.
−∞ 2πσ 2 2σ 2
In the special case of the standard normal distribution, the cdf is:
Z x  2
1 t
F (x) = Φ(x) = √ exp − dt.
−∞ 2π 2
Note, this is often denoted Φ(x).
Such integrals cannot be evaluated in a closed form, so we use statistical tables of them,
specifically a table of Φ(x) (or we could use a computer, but not in the examination).
In the examination, you will have a table of some values of Φ(x), the cdf of Z ∼ N (0, 1)
(Table A.1 of the Dougherty Statistical Tables, and Table 4 of the New Cambridge
Statistical Tables).
Note that Table 4 uses the notation Φ(x). However, we will denote this as Φ(z) (for
z-score). Φ(x) and Φ(z) mean the same thing, of course.
Table 4 shows values of Φ(z) = P (Z ≤ z) for z ≥ 0.00. This table can be used to
calculate probabilities of any intervals for any normal distribution, but how? The table
seems to be incomplete.

1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for z ≥ 0.

We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability.
Suppose that z ≥ 0, so that −z ≤ 0. Table 4 shows:

P (Z ≤ z) = Φ(z).

From it, we also get the following probabilities.

P (Z ≤ −z) = Φ(−z) = P (Z > z) = 1 − Φ(z).


P (Z > −z) = 1 − Φ(−z) = P (Z ≤ z) = Φ(z).

In each of these, ≤ can be replaced by <, and ≥ by > (see Section 3.5). Figure 4.11 (on
the next page) shows tail probabilities for the standard normal distribution.
If Z ∼ N (0, 1), for any two numbers z1 < z2 , then:

P (z1 < Z ≤ z2 ) = Φ(z2 ) − Φ(z1 )

where Φ(z2 ) and Φ(z1 ) are obtained using the rules above.
Reality check: remember that:

Φ(0) = P (Z ≤ 0) = 0.5 = P (Z > 0) = 1 − Φ(0).

127
4. Common distributions of random variables

−z 0 +z

Figure 4.11: Tail probabilities for the standard normal distribution.

So if you ever end up with results like P (Z ≤ −1) = 0.7 or P (Z ≤ 1) = 0.2 or


P (Z > 2) = 0.95, these must be wrong! (See property 3 of cdfs in Section 3.4.4.)

Example 4.15 Consider the 0.7995 value for x = 0.84 in Table 4 of the New
Cambridge Statistical Tables, which shows that:

Φ(0.84) = P (Z ≤ 0.84) = 0.7995.

Using the results above, we then also have:

P (Z > 0.84) = 1 − Φ(0.84) = 1 − 0.7995 = 0.2005

P (Z ≤ −0.84) = P (Z ≥ 0.84) = 0.2005

P (Z ≥ −0.84) = P (Z ≤ 0.84) = 0.7995

P (−0.84 ≤ Z ≤ 0.84) = P (Z ≤ 0.84) − P (Z ≤ −0.84) = 0.5990.

Probabilities for any normal distribution

How about a normal distribution X ∼ N (µ, σ 2 ), for any other µ and σ 2 ?


What if we want to calculate, for any a < b, P (a < X ≤ b) = F (b) − F (a)?
Remember that (X − µ)/σ = Z ∼ N (0, 1). If we apply this transformation to all parts

128
4.5. Common continuous distributions

of the inequalities, we get:


a−µ X −µ b−µ
 
P (a < X ≤ b)= P < ≤
σ σ σ
a−µ b−µ
 
=P <Z≤
σ σ
b−µ a−µ
   
=Φ −Φ
σ σ
which can be calculated using Table 4 of the New Cambridge Statistical Tables. (Note
that this also covers the cases of the one-sided inequalities P (X ≤ b), with a = −∞,
and P (X > a), with b = ∞.)

Example 4.16 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
X > 90 (high blood pressure)
X < 60 (low blood pressure)
60 ≤ X ≤ 90 (normal blood pressure).

These are calculated using standardisation with µ = 74.2, σ 2 = 127.87 and,


therefore, σ = 11.31. So here:
X − 74.2
= Z ∼ N (0, 1)
11.31
and we can refer values of this standardised variable to Table 4 of the New
Cambridge Statistical Tables.
 
X − 74.2 90 − 74.2
P (X > 90) = P >
11.31 11.31
= P (Z > 1.40)
= 1 − Φ(1.40)
= 1 − 0.9192
= 0.0808
and:
 
X − 74.2 60 − 74.2
P (X < 60) = P <
11.31 11.31
= P (Z < −1.26)
= P (Z > 1.26)
= 1 − Φ(1.26)
= 1 − 0.8962
= 0.1038.

129
4. Common distributions of random variables

Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.12.

0.04

Mid: 0.82
0.03

Low: 0.10
0.02

High: 0.08
0.01
0.00

40 60 80 100 120

Diastolic blood pressure

Figure 4.12: Distribution of blood pressure for Example 4.16.

Some probabilities around the mean

The following results hold for all normal distributions.

P (µ − σ < X < µ + σ) = 0.683. In other words, about 68.3% of the total


probability is within 1 standard deviation of the mean.

P (µ − 1.96 × σ < X < µ + 1.96 × σ) = 0.950.

P (µ − 2 × σ < X < µ + 2 × σ) = 0.954.

P (µ − 2.58 × σ < X < µ + 2.58 × σ) = 0.99.

P (µ − 3 × σ < X < µ + 3 × σ) = 0.997.

The first two of these are illustrated graphically in Figure 4.13 (on the next page).

4.5.5 Normal approximation of the binomial distribution


For 0 < π < 1, the binomial distribution Bin(n, π) tends to the normal distribution
N (nπ, nπ(1 − π)) as n → ∞.

130
4.5. Common continuous distributions

0.683

µ −1.96σ µ−σ µ µ+σ µ +1.96σ

<−−−−−−−−−− 0.95 −−−−−−−−−−>

Figure 4.13: Some probabilities around the mean for the normal distribution.

Less formally, the binomial distribution is well-approximated by the normal distribution


when the number of trials n is reasonably large.
For a given n, the approximation is best when π is not very close to 0 or 1. One
rule-of-thumb is that the approximation is good enough when nπ > 5 and n(1 − π) > 5.
Illustrations of the approximation are shown in Figure 4.14 (on the next page) for
different values of n and π. Each plot shows values of the pf of Bin(n, π), and the pdf of
the normal approximation, N (nπ, nπ(1 − π)).
When the normal approximation is appropriate, we can calculate probabilities for
X ∼ Bin(n, π) using Y ∼ N (nπ, nπ(1 − π)) and Table 4 of the New Cambridge
Statistical Tables.
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, 2, . . . , 40, then:

P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)

since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:

P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)

since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.

131
4. Common distributions of random variables

n=10, π = 0.5 n=25, π = 0.5 n=25, π = 0.25

n=10, π = 0.9 n=25, π = 0.9 n=50, π = 0.9

Figure 4.14: Examples of the normal approximation of the binomial distribution.

Continuity correction

This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n,


by the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X ∼ Bin(n, π) with Y ∼ N (nπ, nπ(1 − π)), then:

P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded)


P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included)
P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included).

Example 4.17 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1,000, 0.361).

1. What is the probability that X ≥ 400?

132
4.5. Common continuous distributions

Using the normal approximation, noting n = 1,000 and π = 0.361, with


Y ∼ N (1,000 × 0.361, 1,000 × 0.361 × 0.639) = N (361, 230.68), we get:
P (X ≥ 400) ≈ P (Y ≥ 399.5)
 
Y − 361 399.5 − 361
=P √ ≥ √
230.68 230.68
= P (Z ≥ 2.53)
= 1 − Φ(2.53)
= 0.0057.
The exact probability from the binomial distribution is P (X ≥ 400) = 0.0059.
Without the continuity correction, the normal approximation would give 0.0051.
2. What is the largest number x for which P (X ≤ x) < 0.01?
We need the largest x which satisfies:
 
x + 0.5 − 361
P (X ≤ x) ≈ P (Y ≤ x + 0.5) = P Z ≤ √ < 0.01.
230.68
According to Table 4 of the New Cambridge Statistical Tables, the smallest z
which satisfies P (Z ≥ z) < 0.01 is z = 2.33, so the largest z which satisfies
P (Z ≤ z) < 0.01 is z = −2.33. We then need to solve:
x + 0.5 − 361
√ ≤ −2.33
230.68
which gives x ≤ 325.1. The smallest integer value which satisfies this is x = 325.
Therefore, P (X ≤ x) < 0.01 for all x ≤ 325.
The sum of the exact binomial probabilities from 0 to x is 0.0093 for x = 325,
and 0.011 for x = 326. The normal approximation gives exactly the correct
answer in this instance.
3. Suppose that 300 respondents in the actual survey say they would vote for the
Conservative Party now. What do you conclude from this?
From the answer to Question 2, we know that P (X ≤ 300) < 0.01, if π = 0.361.
In other words, if the Conservatives’ support remains 36.1%, we would be very
unlikely to get a random sample where only 300 (or fewer) respondents would
say they would vote for the Conservative Party.
Now X = 300 is actually observed. We can then conclude one of two things (if
we exclude other possibilities, such as a biased sample or lying by the
respondents).
(a) The Conservatives’ true level of support is still 36.1% (or even higher), but
by chance we ended up with an unusual sample with only 300 of their
supporters.
(b) The Conservatives’ true level of support is currently less than 36.1% (in
which case getting 300 in the sample would be more probable).

Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.

133
4. Common distributions of random variables

4.6 Overview of chapter


This chapter has introduced some common discrete and continuous probability
distributions. Their properties, uses and applications have been discussed. The
relationships between some of these distributions have also been covered.

4.7 Key terms and concepts

Bernoulli distribution Binomial distribution


Central limit theorem Continuity correction
Continuous uniform distribution Discrete uniform distribution
Exponential distribution Moment
Moment generating function Normal distribution
Parameter Poisson distribution
Standardised variable Standard normal distribution
z-score

4.8 Sample examination questions

1. The random variable X has a Poisson distribution with parameter λ = 3. Calculate


P (X ≥ 2 | X > 0).

2. (a) A coffee machine can be calibrated to produce an average of µ millilitres (ml)


per cup. Suppose the quantity produced is normally distributed with a
standard deviation of 9 ml per cup. Determine the value of µ such that 250 ml
cups will overflow only 1% of the time.

(b) Suppose now that the standard deviation, σ, can be fixed at specified levels.
What is the largest value of σ that will allow the amount of coffee dispensed to
fall within 30 ml of the mean with a probability of at least 95%?

3. A vaccine for Covid-19 is known to be 90% effective, i.e. 90% of vaccine recipients
are successfully immunised against Covid-19. A new (different) vaccine is tested on
100 patients and found to successfully immunise 96 of the 100 patients. Is the new
vaccine better?

Hint: Assume the new vaccine is equally effective as the original vaccine and
consider using an appropriate approximating distribution.

134
4.9. Solutions to Sample examination questions

4.9 Solutions to Sample examination questions


1. We have that X ∼ Pois(3). The required probability is:

P ({X ≥ 2} ∩ {X > 0}) P (X ≥ 2) 1 − P (X ≤ 1)


P (X ≥ 2 | X > 0) = = =
P (X > 0) P (X ≥ 1) 1 − P (X = 0)
1 − e−3 − 3e−3
=
1 − e−3
= 0.8428.

2. (a) Let X = volume filled, so that X ∼ N (µ, 81). We require that:

P (X > 250) = 0.01.

Standardising, we have:
P (Z > 2.33) = 0.01
hence:
250 − µ
2.33 = ⇒ µ = 229.03.
9

(b) We have that X ∼ N (µ, σ 2 ). It follows that:


 
30
0.95 ≤ P (|X − µ| < 30) = P |Z| <
σ

so that 30/σ = 1.96 or σ = 30/1.96 = 15.31.

3. Let X be the number of successful immunisations. If the new vaccine is equally


effective, then:
X ∼ Bin(100, 0.90) ≈ Y ∼ N (90, 9).
Therefore, applying a continuity correction:
 
Y − 90 95.5 − 90
P (X ≥ 96) ≈ P (Y ≥ 95.5) = P √ ≥ √ = P (Z ≥ 1.83) = 0.0336.
9 9
Hence even if the new vaccine is no better than the original, the chance of 96 or
more effective cases is quite small, suggesting the new vaccine could be better.
(Any reasonable comment accepted.)

There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)

135
4. Common distributions of random variables

136
Chapter 5
Multivariate random variables

5.1 Synopsis of chapter


Almost all applications of statistical methods deal with several measurements on the
same, or connected, items. To think statistically about several measurements on a
randomly selected item, you must understand some of the concepts for joint
distributions of random variables.

5.2 Learning outcomes


After completing this chapter, you should be able to:

arrange the probabilities for a discrete bivariate distribution in tabular form


define marginal and conditional distributions, and determine them for a discrete
bivariate distribution
recall how to define and determine independence for two random variables
define and compute expected values for functions of two random variables and
demonstrate how to prove simple properties of expected values
provide the definition of covariance and correlation for two random variables and
calculate these.

5.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.

1. Several different variables – such as the height and weight of a person.


2. Several observations of the same variable, considered together – such as the heights
of all n people in a sample.

Suppose that X1 , X2 , . . . , Xn are random variables, then the vector:

X = (X1 , X2 , . . . , Xn )0

137
5. Multivariate random variables

is a multivariate random variable (here n-variate), also known as a random


vector. Its possible values are the vectors:

x = (x1 , x2 , . . . , xn )0

where each xi is a possible value of the random variable Xi , for i = 1, 2, . . . , n.


The joint probability distribution of a multivariate random variable X is defined by
the possible values x, and their probabilities.
For now, we consider just the simplest multivariate case, a bivariate random variable
where n = 2. This is sufficient for introducing most of the concepts of multivariate
random variables.
For notational simplicity, we will use X and Y instead of X1 and X2 . A bivariate
random variable is then the pair (X, Y ).

Example 5.1 In this chapter, we consider the following examples.


Discrete bivariate example – for a football match:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

Continuous bivariate example – for a person:

X = the person’s height

Y = the person’s weight.

5.4 Joint probability functions


When the random variables in (X1 , X2 , . . . , Xn ) are either all discrete or all continuous,
we also call the multivariate random variable either discrete or continuous, respectively.
For a discrete multivariate random variable, the joint probability distribution is
described by the joint probability function, defined as:

p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:

p(x, y) = P (X = x, Y = y)

which we sometimes write as pX,Y (x, y) to make the random variables clear.

138
5.4. Joint probability functions

Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:

Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006

and p(x, y) = 0 for all other (x, y).


Note that this satisfies the conditions for a probability function.

1. p(x, y) ≥ 0 for all (x, y).


3 P
P 3
2. p(x, y) = 0.100 + 0.031 + · · · + 0.006 = 1.000.
x=0 y=0

The joint probability function gives probabilities of values of (X, Y ), for example:

A 1–1 draw, which is the most probable single result, has probability

P (X = 1, Y = 1) = p(1, 1) = 0.146.

The match is a draw with probability:

P (X = Y ) = p(0, 0) + p(1, 1) + p(2, 2) + p(3, 3) = 0.344.

The match is won by the home team with probability:

P (X > Y ) = p(1, 0) + p(2, 0) + p(2, 1) + p(3, 0) + p(3, 1) + p(3, 2) = 0.425.

More than 4 goals are scored in the match with probability:

P (X + Y > 4) = p(2, 3) + p(3, 2) + p(3, 3) = 0.068.

139
5. Multivariate random variables

5.5 Marginal distributions

Consider a multivariate discrete random variable X = (X1 , X2 , . . . , Xn ).


The marginal distribution of a subset of the variables in X is the (joint) distribution
of this subset. The joint pf of these variables (the marginal pf) is obtained by
summing the joint pf of X over the variables which are not included in the subset.

Example 5.3 Consider X = (X1 , X2 , X3 , X4 ), and the marginal distribution of the


subset (X1 , X2 ). The marginal pf of (X1 , X2 ) is:
XX
p1,2 (x1 , x2 ) = P (X1 = x1 , X2 = x2 ) = p(x1 , x2 , x3 , x4 )
x3 x4

where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .

The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.

Marginal distributions for discrete bivariate distributions

For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x

Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:

Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000

and p(x, y) = pX (x) = pY (y) = 0 for all other (x, y).

140
5.6. Continuous multivariate distributions

For example:
3
X
pX (0) = p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0

= 0.100 + 0.031 + 0.039 + 0.031


= 0.201.

Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.

Example 5.5 Consider again the football example.

The expected number of goals scored by the home team is:


X
E(X) = x pX (x) = 0 × 0.201 + 1 × 0.353 + 2 × 0.308 + 3 × 0.138 = 1.383.
x

The expected number of goals scored by the visiting team is:


X
E(Y ) = y pY (y) = 0 × 0.347 + 1 × 0.316 + 2 × 0.262 + 3 × 0.075 = 1.065.
y

5.6 Continuous multivariate distributions


If all the random variables in X = (X1 , X2 , . . . , Xn ) are continuous, the joint
distribution of X is specified by its joint probability density function
f (x1 , x2 , . . . , xn ).
Marginal distributions are defined as in the discrete case, but with integration instead
of summation.
There will be no questions on continuous multivariate joint probability density
functions in the examination. Only discrete multivariate joint probability functions may
appear in the examination. So just a brief example of the continuous case is given here,
to give you an idea of such distributions. You will meet these if you study ST2133
Advanced statistics: distribution theory.

Example 5.6 For a randomly selected man (aged over 16) in England, let:

X = his height (in cm)


Y = his weight (in kg).

The univariate marginal distributions of X and Y are approximately normal, with:


X ∼ N (174.9, (7.39)2 ) and Y ∼ N (84.2, (15.63)2 )

141
5. Multivariate random variables

and the bivariate joint distribution of (X, Y ) is a bivariate normal distribution.


Plots of the univariate and bivariate probability density functions are shown in
Figure 5.1, Figure 5.2 (on the next page) and Figure 5.3 (p.144).

0.025
0.05

0.020
0.04

0.015
0.03

0.010
0.02

0.005
0.01

0.000
0.00

160 170 180 190 40 60 80 100 120

Height (cm) Weight (kg)

Figure 5.1: Univariate marginal pdfs for Example 5.6.

5.7 Conditional distributions


Consider discrete variables X and Y , with joint pf p(x, y) = pX,Y (x, y) and marginal pfs
pX (x) and pY (y), respectively.

Conditional distributions of discrete bivariate distributions

Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:

P (X = x and Y = y) pX,Y (x, y)


pY |X (y | x) = P (Y = y | X = x) = =
P (X = x) pX (x)

for any value y.


This is the conditional probability function of Y given X = x.

Example 5.7 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
142 pY (y) 0.347 0.316 0.262 0.075 1.000
5.7. Conditional distributions

120
100
Weight (kg)

80
60
40

160 170 180 190

Height (cm)

Figure 5.2: Bivariate joint pdf (contour plot) for Example 5.6.

We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:

pX,Y (0, y) pX,Y (0, y)


pY |X (y | 0) = pY |X (y | X = 0) = = .
pX (0) 0.201

So, for example, pY |X (1 | 0) = pX,Y (0, 1)/0.201 = 0.031/0.201 = 0.154.


Calculating these for each value of x gives:

pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00

So, for example:

if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154

if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.

143
5. Multivariate random variables

Wei
ght
f(x,y)

(kg)
Height (cm)

Figure 5.3: Bivariate joint pdf for Example 5.6.

5.7.1 Properties of conditional distributions


Each different value of x defines a different conditional distribution and conditional pf
pY |X (y | x). Each value of pY |X (y | x) is a conditional probability of the kind previously
defined. Defining events A = {Y = y} and B = {X = x}, then:
P (A ∩ B) P (Y = y and X = x)
P (A | B) = =
P (B) P (X = x)
= P (Y = y | X = x)
pX,Y (x, y)
=
pX (x)
= pY |X (y | x).

A conditional distribution is itself a probability distribution, and a conditional pf is a


pf. Clearly, pY |X (y | x) ≥ 0 for all y, and:
P
pX,Y (x, y)
X y pX (x)
pY |X (y | x) = = = 1.
y
pX (x) pX (x)

The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.

144
5.7. Conditional distributions

Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:

pX,Y (x, y)
pY|X (y | x) =
pX (x)

where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.

5.7.2 Conditional mean and variance


Since a conditional distribution is a probability distribution, it also has a mean
(expected value) and variance (and median etc.).
These are known as the conditional mean and conditional variance, and are
denoted, respectively, by:

EY |X (Y | x) and VarY |X (Y | x).

Example 5.8 In the football example, we have:


X
EY |X (Y | 0) = y pY |X (y | 0) = 0 × 0.498 + 1 × 0.154 + 2 × 0.194 + 3 × 0.154 = 1.00.
y

So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:

pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92

Plots of the conditional means are shown in Figure 5.4 (on the next page).

5.7.3 Continuous conditional distributions (non-examinable)


Suppose X and Y are continuous, with joint pdf fX,Y (x, y) and marginal pdfs fX (x)
and fY (y), respectively. The conditional distribution of Y given that X = x is the
continuous probability distribution with the pdf:

fX,Y (x, y)
fY |X (y | x) =
fX (x)

145
5. Multivariate random variables

3.0
Home goals x
Expected away goals E(Y|x)

2.5
2.0
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Goals

Figure 5.4: Conditional means for Example 5.8.

which is defined if fX (x) > 0. For a conditional distribution of X given Y = y,


fX|Y (x | y) is defined similarly, with the roles of X and Y reversed.
Unlike in the discrete case, this is not a conditional probability. However, fY |X (y | x) is a
pdf of a continuous random variable, so the conditional distribution is itself a
continuous probability distribution. You will consider these if you study ST2133
Advanced statistics: distribution theory.

Example 5.9 For a randomly selected man (aged over 16) in England, consider
X = height (in cm) and Y = weight (in kg). The joint distribution of (X, Y ) is
approximately bivariate normal (see Example 5.6).
The conditional distribution of Y given X = x is then a normal distribution for each
x, with the following parameters:

EY |X (Y | x) = −58.1 + 0.81x and VarY |X (Y | x) = 208.

In other words, the conditional mean depends on x, but the conditional variance
does not. For example:

EY |X (Y | 160) = 71.5 and EY | X (Y | 190) = 95.8.

For women, this conditional distribution is normal with the following parameters:

EY |X (Y | x) = −23.0 + 0.58x and VarY |X (Y | x) = 221.

The conditional means are shown in Figure 5.5 (on the next page).

146
5.8. Covariance and correlation

110
100
Conditional mean of weight (kg)

90
80
70

Women
Men
60

140 150 160 170 180 190 200 210

Height (cm)

Figure 5.5: Conditional means for Example 5.9.

5.8 Covariance and correlation


Suppose that the conditional distributions pY |X (y | x) of a random variable Y given
different values x of a random variable X are not all the same, i.e. the conditional
distribution of Y ‘depends on’ the value of X.
Therefore, there is said to be an association (or dependence) between X and Y .
If two random variables are associated (dependent), knowing the value of one (for
example, X) will help to predict the likely value of the other (for example, Y ).
We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).

5.8.1 Covariance

Definition of covariance

The covariance of two random variables X and Y is defined as:

Cov(X, Y ) = Cov(Y, X) = E((X − E(X))(Y − E(Y ))).

This can also be expressed as the more convenient formula:

Cov(X, Y ) = E(XY ) − E(X) E(Y ).

This result will be proved later.


(Note that these involve expected values of products of two random variables, which
have not been defined yet. We will do so later in this chapter.)

147
5. Multivariate random variables

Properties of covariance

Suppose X and Y are random variables, and a, b, c and d are constants.

The covariance of a random variable with itself is the variance of the random
variable:
Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).

The covariance of a random variable and a constant is 0:


Cov(a, X) = E(aX) − E(a) E(X) = a E(X) − a E(X) = 0.

The covariance of linear transformations of random variables is:


Cov(aX + b, cY + d) = ac Cov(X, Y ).

5.8.2 Correlation

Definition of correlation

The correlation of two random variables X and Y is defined as:


Cov(X, Y ) Cov(X, Y )
Corr(X, Y ) = Corr(Y, X) = p = .
Var(X) Var(Y ) sd(X) sd(Y )

When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.

Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.

If Corr(X, Y ) > 0, we say that X and Y are positively correlated.


If Corr(X, Y ) < 0, we say that X and Y are negatively correlated.

Example 5.10 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
148 3 0 3 6 9
0.062 0.031 0.039 0.006
5.8. Covariance and correlation

Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:

P (XY = 2) = pX,Y (1, 2) + pX,Y (2, 1) = 0.092 + 0.108 = 0.200.

The pf of the product XY is:

XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006

Hence:

E(XY ) = 0 × 0.448 + 1 × 0.146 + 2 × 0.200 + · · · + 9 × 0.006 = 1.478.

From the marginal pfs pX (x) and pY (y) we get:

E(X) = 1.383
E(Y ) = 1.065
E(X 2 ) = 2.827
E(Y 2 ) = 2.039
Var(X) = 2.827 − (1.383)2 = 0.9143
Var(Y ) = 2.039 − (1.065)2 = 0.9048.

Therefore, the covariance of X and Y is:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 1.478 − 1.383 × 1.065 = 0.00511

and the correlation is:


Cov(X, Y ) 0.00511
Corr(X, Y ) = p =√ = 0.00562.
Var(X) Var(Y ) 0.9143 × 0.9048

The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).

5.8.3 Sample covariance and correlation

We have just introduced covariance and correlation, two new characteristics of


probability distributions (population distributions). We now discuss their sample
equivalents.
Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a sample of n pairs of observed values of two
random variables X and Y .
We can use these observations to calculate sample versions of the covariance and
correlation between X and Y . These are measures of association in the sample, i.e.
descriptive statistics. They are also estimates of the corresponding population quantities

149
5. Multivariate random variables

Cov(X, Y ) and Corr(X, Y ). The uses of these sample measures will be discussed in
more detail later in the course.

Sample covariance

The sample covariance of random variables X and Y is calculated as:


n
\Y ) = 1 X
Cov(X, (Xi − X̄)(Yi − Ȳ )
n−1 i=1

where X̄ and Ȳ are the sample means of X and Y , respectively.

Sample correlation

The sample correlation of random variables X and Y is calculated as:


n
P
\Y ) (Xi − X̄)(Yi − Ȳ )
Cov(X, i=1
r= = s
SX SY n
P n
P
(Xi − X̄)2 (Yi − Ȳ )2
i=1 i=1

where SX and SY are the sample standard deviations of X and Y , respectively.

r is always between −1 and +1, and is equal to −1 or +1 only if X and Y are


perfectly linearly related in the sample.

r = 0 if X and Y are uncorrelated (not linearly related) in the sample.

Example 5.11 Figure 5.6 (on the next page) shows different examples of
scatterplots of observations of X and Y , and different values of the sample
correlation, r. The line shown in each plot is the best-fitting (least squares) line for
the scatterplot (which will be introduced later in the course).

In (a), X and Y are perfectly linearly related, and r = 1.

Plots (b), (c) and (e) show relationships of different strengths.

In (c), the variables are negatively correlated.

In (d), there is no linear relationship, and r = 0.

Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.

150
5.9. Independent random variables

(a) r=1 (b) r=0.85 (c) r=-0.5

(d) r=0 (e) r=0.92 (f) r=0

Figure 5.6: Scatterplots depicting various sample correlations as discussed in Example


5.11.

5.9 Independent random variables

Two discrete random variables X and Y are associated if pY |X (y | x) depends on x.


What if it does not, i.e. what if:

pX,Y (x, y)
pY |X (y | x) = = pY (y) for all x and y
pX (x)

so that knowing the value of X does not help to predict Y ?


This implies that:

pX,Y (x, y) = pX (x) pY (y) for all x, y. (5.1)

X and Y are independent of each other if and only if (5.1) is true.

151
5. Multivariate random variables

Independent random variables

In general, suppose that X1 , X2 , . . . , Xn are discrete random variables. These are


independent if and only if their joint pf is:

p(x1 , x2 , . . . , xn ) = p1 (x1 ) p2 (x2 ) · · · pn (xn )

for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:

f (x1 , x2 , . . . , xn ) = f1 (x1 ) f2 (x2 ) · · · fn (xn )

for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.

If two random variables are independent, they are also uncorrelated, i.e. we have:
Cov(X, Y ) = 0 and Corr(X, Y ) = 0.
This will be proved later.
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.

Example 5.12 The football example is an instance of this. The conditional


distributions pY |X (y | x) are clearly not all the same, but the correlation is very
nearly 0 (see Example 5.10).
Another example is plot (f) in Figure 5.6 (on the previous page), where the
dependence is not linear, but quadratic.

5.9.1 Joint distribution of independent random variables


When random variables are independent, we can easily derive their joint pf or pdf as
the product of their univariate marginal distributions. This is particularly simple if all
the marginal distributions are the same.

Example 5.13 Suppose that X1 , X2 , . . . , Xn are independent, and each of them


follows the Poisson distribution with the same mean λ. Therefore, the marginal pf of
each Xi is:
e−λ λxi
p(xi ) =
xi !
and the joint pf of the random variables is:
P
n n xi
Y Y e−λ λxi −nλ
e λi
p(x1 , x2 , . . . , xn ) = p(x1 ) p(x2 ) · · · p(xn ) = p(xi ) = = Q .
i=1 i=1
xi ! xi !
i

152
5.10. Sums and products of random variables

Example 5.14 For a continuous example, suppose that X1 , X2 , . . . , Xn are


independent, and each of them follows a normal distribution with the same mean µ
and same variance σ 2 . Therefore, the marginal pdf of each Xi is:

(xi − µ)2
 
1
f (xi ) = √ exp −
2πσ 2 2σ 2

and the joint pdf of the variables is:


n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ) f (x2 ) · · · f (xn ) = f (xi )
i=1
n
(xi − µ)2
 
Y 1
= √ exp −
i=1 2πσ 2 2σ 2
n
!
1 1 X
= √ n exp − 2 (xi − µ)2 .
2πσ 2 2σ i=1

5.10 Sums and products of random variables


Suppose X1 , X2 , . . . , Xn are random variables. We now go from the multivariate setting
back to the univariate setting, by considering univariate functions of X1 , X2 , . . . , Xn . In
particular, we consider sums and products like:
n
X
ai Xi + b = a1 X1 + a2 X2 + · · · + an Xn + b (5.2)
i=1

and: n
Y
ai Xi = (a1 X1 )(a2 X2 ) · · · (an Xn )
i=1
where a1 , a2 , . . . , an and b are constants.
Each such sum or product is itself a univariate random variable. The probability
distribution of such a function depends on the joint distribution of X1 , X2 , . . . , Xn .

Example 5.15 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:

Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006

For example, pZP


(1) = pX,Y (0, 1) + pX,Y (1, 0) = 0.031 + 0.100 = 0.131. The mean of Z
is then E(Z) = z pZ (z) = 2.448.
z

Another example is the distribution of XY (see Example 5.10).

153
5. Multivariate random variables

However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?

5.10.1 Distributions of sums and products


General results for the distributions of sums and products of random variables are
available as follows:

Sums Products
Only for
Mean Yes
independent variables
Variance Yes No
Normal: Yes
Distributional Some other distributions:
No
form only for independent
variables

5.10.2 Expected values and variances of sums of random


variables
We state, without proof, the following important result.
If X1 , X2 , . . . , Xn are random variables with means E(X1 ), E(X2 ), . . . , E(Xn ),
respectively, and a1 , a2 , . . . , an and b are constants, then:
n
!
X
E ai Xi + b = E(a1 X1 + a2 X2 + · · · + an Xn + b)
i=1

= a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ) + b


n
X
= ai E(Xi ) + b. (5.3)
i=1

Two simple special cases of this, when n = 2, are:

E(X + Y ) = E(X) + E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = a2 = 1


and b = 0

E(X − Y ) = E(X) − E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = 1,


a2 = −1 and b = 0.

Example 5.16 In the football example, we have previously shown that


E(X) = 1.383, E(Y ) = 1.065 and E(X + Y ) = 2.448. So E(X + Y ) = E(X) + E(Y ),
as the theorem claims.

If X1 , X2 , . . . , Xn are random variables with variances Var(X1 ), Var(X2 ), . . . , Var(Xn ),


respectively, and covariances Cov(Xi , Xj ) for i 6= j, and a1 , a2 , . . . , an and b are

154
5.10. Sums and products of random variables

constants, then:
n
! n
X X XX
Var ai Xi + b = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (5.4)
i=1 i=1 i<j

In particular, for n = 2:

Var(X + Y ) = Var(X) + Var(Y ) + 2 × Cov(X, Y )

Var(X − Y ) = Var(X) + Var(Y ) − 2 × Cov(X, Y ).

If X1 , X2 , . . . , Xn are independent random variables, then Cov(Xi , Xj ) = 0 for all i 6= j,


and so (5.4) simplifies to:
n
! n
X X
Var ai Xi = a2i Var(Xi ). (5.5)
i=1 i=1

In particular, for n = 2, when X and Y are independent:

Var(X + Y ) = Var(X) + Var(Y )

Var(X − Y ) = Var(X) + Var(Y ).

These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.

5.10.3 Expected values of products of independent random


variables
If X1 , X2 , . . . , Xn are independent random variables and a1 , a2 , . . . , an are constants,
then: !
Yn Y n
E ai Xi = E((a1 X1 )(a2 X2 ) · · · (an Xn )) = ai E(Xi ).
i=1 i=1

In particular, when X and Y are independent:

E(XY ) = E(X) E(Y ).

There is no corresponding simple result for the means of products of dependent random
variables. There is also no simple result for the variances of products of random
variables, even when they are independent.

5.10.4 Some proofs of previous results


With these new results, we can now prove some results which were stated earlier.
Recall:
Var(X) = E(X 2 ) − (E(X))2 .

155
5. Multivariate random variables

Proof:

Var(X) = E((X − E(X))2 )


= E(X 2 − 2E(X)X + (E(X))2 )
= E(X 2 ) − 2 E(X) E(X) + (E(X))2
= E(X 2 ) − 2(E(X))2 + (E(X))2
= E(X 2 ) − (E(X))2

using (5.3), with X1 = X 2 , X2 = X, a1 = 1, a2 = −2E(X) and b = (E(X))2 .




Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E(XY − E(Y )X − E(X)Y + E(X) E(Y ))
= E(XY ) − E(Y ) E(X) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y )

using (5.3), with X1 = XY , X2 = X, X3 = Y , a1 = 1, a2 = −E(Y ), a3 = −E(X) and


b = E(X) E(Y ).


Recall that if X and Y are independent, then:

Cov(X, Y ) = Corr(X, Y ) = 0.

Proof:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0

since E(XY ) = E(X) E(Y ) when X and Y are independent.


Since Corr(X, Y ) = Cov(X, Y )/[sd(X) sd(Y )], Corr(X, Y ) = 0 whenever Cov(X, Y ) = 0.


5.10.5 Distributions of sums of random variables


We now know the expected value and variance of the sum:

a1 X1 + a2 X2 + · · · + an Xn + b

whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.

156
5.10. Sums and products of random variables

In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.

Sums of independent binomial and Poisson random variables

Suppose X1 , X2 , . . . , Xn are random variables, and we consider the unweighted sum:


n
X
Xi = X1 + X2 + · · · + Xn .
i=1

That is, the general sum given by (5.2), with a1 = a2 = · · · = an = 1 and b = 0.


The following results hold when the random variables X1 , X2 , . . . , Xn are independent,
but not otherwise.
P P
If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π).
P P
If Xi ∼ Pois(λi ), then i Xi ∼ Pois( i λi ).

Application to the binomial distribution

An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.

1. Let Z1 , Z2 , . . . , Zn be independent random variables, each distributed as


Zi ∼ Bernoulli(π) = Bin(1, π).
2. It is easy to show that E(Zi ) = π and Var(Zi ) = π(1 − π) for each i = 1, 2, . . . , n
(see (4.3) and (4.4)).
n
P
3. Also, Zi = X ∼ Bin(n, π) by the result above for sums of independent binomial
i=1
random variables.
4. Therefore, using the results (5.2) and (5.5), we have:
n
X n
X
E(X) = E(Zi ) = nπ and Var(X) = Var(Zi ) = nπ(1 − π).
i=1 i=1

Sums of normally distributed random variables

All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1

157
5. Multivariate random variables

where:
n
X n
X XX
µ= ai µi + b and σ 2 = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j

If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1

Example 5.17 Suppose that in the population of English people aged 16 or over:

the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39

the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).

The probability we need is:


 
D − 13.6 10 − 13.6
P (D ≤ 10) = P ≤
10.08 10.08
= P (Z ≤ −0.36)
= P (Z ≥ 0.36)
= 0.3594

using Table 4 of the New Cambridge Statistical Tables.


The probability that a randomly selected man is at most 10 cm taller than a
randomly selected woman is about 0.3594.

5.11 Overview of chapter


This chapter has introduced how to deal with more than one random variable at a time.
Focusing mainly on discrete bivariate distributions, the relationships between joint,
marginal and conditional distributions were explored. Sums and products of random
variables concluded the chapter.

158
5.12. Key terms and concepts

5.12 Key terms and concepts


Association Bivariate
Conditional distribution Conditional mean
Conditional variance Correlation
Covariance Dependence
Independence Joint probability distribution
Joint probability (density) function Marginal distribution
Multivariate Random vector
Uncorrelated Univariate

5.13 Sample examination questions


1. Consider two random variables, X and Y . They both take the values −1, 0 and 1.
The joint probabilities for each pair of values, (x, y), are given in the following
table.

X = −1 X=0 X=1
Y = −1 0.09 0.16 0.15
Y =0 0.09 0.08 0.03
Y =1 0.12 0.16 0.12

(a) Determine the marginal distributions and calculate the expected values of X
and Y , respectively.

(b) Calculate the covariance of the random variables X and Y .

(c) Calculate E(X | Y = 0) and E(X | X + Y = 1).

(d) Define U = |X| and V = Y . Calculate E(U ) and the covariance of U and V .
Are U and V correlated?

2. Suppose X and Y are two independent random variables with the following
probability distributions:

X=x −1 0 1 Y =y −1 0 1
and
P (X = x) 0.30 0.40 0.30 P (Y = y) 0.40 0.20 0.40

The random variables S and T are defined as:

S = X2 + Y 2 and T = X + Y.

(a) Construct the table of the joint probability distribution of S and T .

(b) Calculate the following quantities:

159
5. Multivariate random variables

i. Var(T ), given that E(T ) = 0.


ii. Cov(S, T ).
iii. E(S | T = 0).

(c) Are S and T uncorrelated? Are S and T independent? Justify your answers.

5.14 Solutions to Sample examination questions


1. (a) The marginal distribution of X is:
X −1 0 1
pX (x) 0.30 0.40 0.30
The marginal distribution of Y is:
Y −1 0 1
pY (y) 0.40 0.20 0.40
Hence:
X
E(X) = x pX (x) = (−1 × 0.30) + (0 × 0.40) + (1 × 0.30) = 0
x

and:
X
E(Y ) = y pY (y) = (−1 × 0.40) + (0 × 0.20) + (1 × 0.40) = 0.
y

(b) We have:

E(XY ) = (−1 × −1 × 0.09) + (−1 × 1 × 0.12) + (1 × −1 × 0.15) + (1 × 1 × 0.12)


= 0.09 − 0.12 − 0.15 + 0.12
= −0.06.

Therefore:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = −0.06 − 0 × 0 = −0.06.

(c) We have P (Y = 0) = 0.09 + 0.08 + 0.03 = 0.20, hence:

0.09
P (X = −1 | Y = 0) = = 0.45
0.20
0.08
P (X = 0 | Y = 0) = = 0.40
0.20
0.03
P (X = 1 | Y = 0) = = 0.15
0.20

160
5.14. Solutions to Sample examination questions

and therefore:

E(X | Y = 0) = −1 × 0.45 + 0 × 0.4 + 1 × 0.15 = −0.30.

We also have P (X + Y = 1) = 0.16 + 0.03 = 0.19, hence:

0.16 16
P (X = 0 | X + Y = 1) = =
0.19 19
0.03 3
P (X = 1 | X + Y = 1) = =
0.19 19
and therefore:
16 3 3
E(X | X + Y = 1) = 0 × +1× = = 0.1579.
19 19 19

(d) Here is the table of joint probabilities:

U =0 U =1
V = −1 0.16 0.24
V =0 0.08 0.12
V =1 0.16 0.24

We then have that P (U = 0) = 0.16 + 0.08 + 0.16 = 0.40 and also that
P (U = 1) = 1 − P (U = 0) = 0.60. Also, we have that P (V = −1) = 0.40,
P (V = 0) = 0.20 and P (V = 1) = 0.40. So:

E(U ) = 0 × 0.40 + 1 × 0.60 = 0.60

E(V ) = −1 × 0.40 + 0 × 0.20 + 1 × 0.40 = 0

and:
E(U V ) = −1 × 1 × 0.24 + 1 × 1 × 0.24 = 0.
Hence Cov(U, V ) = E(U V ) − E(U )E(V ) = 0 − 0.60 × 0 = 0. Since the
covariance is zero, so is the correlation coefficient, therefore U and V are
uncorrelated.

2. (a) The joint probability distribution of S and T is:

S
0 1 2
−2 0 0 0.12
−1 0 0.22 0
T 0 0.08 0 0.24
1 0 0.22 0
2 0 0 0.12

161
5. Multivariate random variables

(b) i. Since E(T ) = 0, we have:

Var(T ) = E(T 2 )
2
X
= t2 p(t)
t=−2

= (−2)2 × 0.12 + (−1)2 × 0.22 + 02 × 0.32 + 12 × 0.22 + 22 × 0.12


= 1.4.

ii. We have that:


2 X
X 2
E(ST ) = st p(s, t) = (−4×0.12)+(−1×0.22)+1×0.22+4×0.12 = 0.
s=0 t=−2

Since E(T ) = 0, then:

Cov(S, T ) = E(ST ) − E(S) E(T ) = E(ST ) = 0.

iii. We have:
2
X 0.08 0.24
E(S | T = 0) = s pS|T =0 (s | t = 0) = 0 × +2× = 1.5.
s=0
0.32 0.32

(c) The random variables S and T are uncorrelated, since Cov(S, T ) = 0. However:

P (T = −2) = 0.12 and P (S = 0) = 0.08 ⇒ P (T = −2) P (S = 0) = 0.0096

but:
P ({T = −2} ∩ {S = 0}) = 0 6= P (T = −2) P (S = 0)
which is sufficient to show that S and T are not independent.

Statistics: the mathematical theory of ignorance.


(Morris Kline)

162
Chapter 6
Sampling distributions of statistics

6.1 Synopsis of chapter


This chapter considers the idea of sampling and the concept of a sampling distribution
for a statistic (such as a sample mean) which must be understood by all users of
statistics.

6.2 Learning outcomes


After completing this chapter, you should be able to:

demonstrate how sampling from a population results in a sampling distribution for


a statistic

prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement

state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.

6.3 Introduction
Suppose we have a sample of n observations of a random variable X:

{X1 , X2 , . . . , Xn }.

We have already stated that in statistical inference each individual observation Xi is


regarded as a value of a random variable X, with some probability distribution (that is,
the population distribution).
In this chapter we discuss how we define and work with:

the joint distribution of the whole sample {X1 , X2 , . . . , Xn }, treated as a


multivariate random variable

distributions of univariate functions of {X1 , X2 , . . . , Xn } (statistics).

163
6. Sampling distributions of statistics

6.4 Random samples


Many of the results discussed here hold for many (or even all) probability distributions,
not just for some specific distributions.
It is then convenient to use generic notation.

We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.

The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).

Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.

For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.

1. {X1 , X2 , . . . , Xn } are independent random variables.

2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the


same distribution f (x; θ), with the same value of the parameter(s) θ.

The random variables {X1 , X2 , . . . , Xn } are then called:

independent and identically distributed (IID) random variables from the


distribution (population) f (x; θ)

a random sample of size n from the distribution (population) f (x; θ).

We will assume this most of the time from now. So you will see many examples and
questions which begin something like:

‘Let {X1 , X2 , . . . , Xn } be a random sample from a normal distribution with


mean µ and variance σ 2 . . .’

6.4.1 Joint distribution of a random sample


The joint probability distribution of the random variables in a random sample is an
important quantity in statistical inference. It is known as the likelihood function.
You will hear more about it in the chapter on point estimation.
For a random sample the joint distribution is easy to derive, because the Xi s are
independent.

164
6.5. Statistics and their sampling distributions

The joint pf/pdf of a random sample is:


n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) = f (xi ; θ).
i=1

Other assumptions about random samples

Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.

IID samples from multivariate population distributions. For example, a sample of


n
Q
(Xi , Yi ), with the joint distribution f (xi , yi ).
i=1

Independent but not identically distributed observations. For example, observations


(Xi , Yi ) where Yi (the ‘response variable’) is treated as random, but Xi (the
‘explanatory variable’) is not. Hence the joint distribution of the Yi s is
Qn
fY |X (yi | xi ; θ) where fY |X (y | x; θ) is the conditional distribution of Y given X.
i=1
This is the starting point of regression modelling (introduced later in the course).

Non-independent observations. For example, a time series {Y1 , Y2 , . . . , YT } where


i = 1, 2, . . . , T are successive time points. The joint distribution of the series is, in
general:

f (y1 ; θ) f (y2 | y1 ; θ) f (y3 | y1 , y2 ; θ) · · · f (yT | y1 , y2 , . . . , yT −1 ; θ).

Random samples and their observed values

Here we treat {X1 , X2 , . . . , Xn } as random variables. Therefore, we consider what values


{X1 , X2 , . . . , Xn } might have in different samples.
Once a real sample is actually observed, the values of {X1 , X2 , . . . , Xn } in that specific
sample are no longer random variables, but realised values of random variables, i.e.
known numbers.
Sometimes this distinction is emphasised in the notation by using:

X1 , X2 , . . . , Xn for the random variables

x1 , x2 , . . . , xn for the observed values.

6.5 Statistics and their sampling distributions


A statistic is a known function of the random variables {X1 , X2 , . . . , Xn } in a random
sample.

165
6. Sampling distributions of statistics

Example 6.1 All of the following are statistics:


n
P
the sample mean X̄ = Xi /n
i=1

n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1

the sample median, quartiles, minimum, maximum etc.

quantities such as:


n
X X̄
Xi2 and √ .
i=1
S/ n

Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.

6.5.1 Sampling distribution of a statistic

A (simple) random sample is modelled as a sequence of IID random variables. A


statistic is a function of these random variables, so it is also a random variable, with a
distribution of its own.
In other words, if we collected several random samples from the same population, the
values of a statistic would not be the same from one sample to the next, but would vary
according to some probability distribution.
The sampling distribution is the probability distribution of the values which the
statistic would have in a large number of samples collected (independently) from the
same population.

Example 6.2 Suppose we collect a random sample of size n = 20 from a normal


population (distribution) X ∼ N (5, 1).
Consider the following statistics:

sample mean X̄, sample variance S 2 , and maxX = max(X1 , X2 , . . . , Xn ).

Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:

x̄ = 4.94

s2 = 0.90

maxx = 6.58.

166
6.5. Statistics and their sampling distributions

Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:

x̄ = 5.22 (the first sample had x̄ = 4.94)

s2 = 0.80 (the first sample had s2 = 0.90)

maxx = 6.62 (the first sample had maxx = 6.58).

How to derive a sampling distribution?

The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.

Exactly or approximately through mathematical derivation. This is the most


convenient way for subsequent use, but is not always easy.
With simulation, i.e. by using a computer to generate (artificial) random samples
from a population distribution of a known form.

Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .

We first consider deriving the sampling distributions of these by approximation


through simulation.

Here a computer was used to draw 10,000 independent random samples of


n = 20 from N (5, 1), and the values of X̄, S 2 and maxX for each of these
random samples were recorded.

Figures 6.1, 6.2 and 6.3 (the latter two figures appear on p.169) show
histograms of the statistics for these 10,000 random samples.

We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:

(a) X̄ ∼ N (µ, σ 2 /n)

(b) (n − 1)S 2 /σ 2 ∼ χ2n−1

167
6. Sampling distributions of statistics

(c) the sampling distribution of Y = maxX has the following pdf:

fY (y) = n(FX (y))n−1 fX (y)

where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.

Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.

4.5 5.0 5.5 6.0

Sample mean

Figure 6.1: Simulation-generated sampling distribution of X̄ to accompany Example 6.3.

6.6 Sample mean from a normal population


Consider one very common statistic, the sample mean:
n
1X 1 1 1
X̄ = Xi = X1 + X2 + · · · + Xn .
n i=1
n n n

What is the sampling distribution of X̄?


We know from Section 5.10.2 that for independent {X1 , X2 , . . . , Xn } from any
distribution: !
X n Xn
E ai Xi = ai E(Xi )
i=1 i=1

and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1

168
6.6. Sample mean from a normal population

0.5 1.0 1.5 2.0 2.5

Sample variance

Figure 6.2: Simulation-generated sampling distribution of S 2 to accompany Example 6.3.

5 6 7 8 9

Maximum value

Figure 6.3: Simulation-generated sampling distribution of maxX to accompany Example


6.3.

169
6. Sampling distributions of statistics

For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:

σ2
 
X̄ ∼ N µ, .
n

For example, the pdf drawn on the histogram in Figure 6.1 (p.168) is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.

In an individual sample, x̄ is not usually equal to µ, the expected value of the


population.

However, over repeated samples the values of X̄ are centred at µ.



We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.

The variation of the values of X̄ in different samples (the sampling variance) is


large when the population variance of X is large.

More interestingly, the sampling variance gets smaller when the sample size n
increases.

In other words, when n is large the distribution of X̄ is more tightly concentrated


around µ than when n is small.

Figure 6.4 (on the next page) shows sampling distributions of X̄ from N (5, 1) for
different n.

Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.

170
6.6. Sample mean from a normal population

n=100

n=20

n=5

4.0 4.5 5.0 5.5 6.0

Figure 6.4: Sampling distributions of X̄ from N (5, 1) for different n.

We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?

Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
 
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
 √ √ 
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
 √ 
n 0.05
P Z> < = 0.025
7.39 2
where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:

n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.

171
6. Sampling distributions of statistics

6.7 The central limit theorem


We have discussed the very convenient result that if a random sample comes from a
normally-distributed population, the sampling distribution of X̄ is also normal. How
about sampling distributions of X̄ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem
(CLT). In essence, the CLT states that the normal sampling distribution of X̄ which
holds exactly for random samples from a normal distribution, also holds approximately
for random samples from nearly any distribution.
The CLT applies to ‘nearly any’ distribution because it requires that the variance of the
population distribution is finite. If it is not (such as for some Pareto distributions,
introduced in Chapter 3), the CLT does not hold. However, such distributions are not
common.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution
which has mean E(Xi ) = µ < ∞ and variance Var(Xi ) = σ 2 < ∞, that is with a finite
mean and finite variance. Let X̄n denote the sample mean calculated from a random
sample of size n, then:
X̄n − µ
 
lim P √ ≤ z = Φ(z)
n→∞ σ/ n
for any z, where Φ(z) denotes the cdf of the standard normal distribution.
The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly
n→∞
well as n increases, and exactly when the sample size is infinite.
The full proof of the CLT is not straightforward and beyond the scope of this course.
In less formal language, the CLT says that for a random sample from nearly any
distribution with mean µ and variance σ 2 then:
σ2
 
X̄ ∼ N µ,
n
approximately, when n is sufficiently large. We can then say that X̄ is asymptotically
normally distributed with mean µ and variance σ 2 /n.

The wide reach of the CLT

It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.

There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n

172
6.7. The central limit theorem

Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.

How large is ‘large n’?

The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:

for symmetric distributions, even small n is enough

for very skewed distributions, larger n is required.

For many distributions, n > 30 is sufficient for the approximation to be reasonably


accurate.

Example 6.5 In the first case, we simulate random samples of sizes:

n = 1, 5, 10, 30, 100 and 1,000

from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 6.5 (on the next
page).
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.

Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6 (p.175).

173
6. Sampling distributions of statistics

n=5 n = 10

n=1

0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10

n = 30 n = 100 n = 1000

2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4

Figure 6.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.

6.8 Some common sampling distributions


In the remaining chapters, we will make use of results like the following.
Suppose that {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } are two independent random
samples from N (µ, σ 2 ), then:
2
(n − 1)SX (m − 1)SY2
∼ χ2n−1 and ∼ χ2m−1
σ2 σ2
s
n+m−2 X̄ − Ȳ
×p 2
∼ tn+m−2
1/n + 1/m (n − 1)SX + (m − 1)SY2

and:
2
SX
∼ Fn−1, m−1 .
SY2

174
6.8. Some common sampling distributions

n = 30

n = 10

n=1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5

n = 100 n = 1000
n = 50

0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24

Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.

Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:

the χ2 (‘chi-squared’) distribution

the t distribution

the F distribution.

These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way. We will now briefly introduce their main properties.
This is in preparation for statistical inference, where the uses of these distributions will
be discussed at length.

175
6. Sampling distributions of statistics

6.8.1 The χ2 distribution

Definition of the χ2 distribution

Let Z1 , Z2 , . . . , Zk be independent N (0, 1) random variables. If:


k
X
X= Z12 + Z22 + ··· + Zk2 = Zi2
i=1

the distribution of X is the χ2 distribution with k degrees of freedom. This is


denoted by X ∼ χ2 (k) or X ∼ χ2k .

The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:

E(X) = k

Var(X) = 2k.

For reference, the probability density function of X ∼ χ2k is:


(
(2k/2 Γ(k/2))−1 xk/2−1 e−x/2 for x ≥ 0
f (x) =
0 otherwise

where: Z ∞
Γ(α) = xα−1 e−x dx
0

is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7
(on the next page). In most applications of the χ2 distribution the appropriate value of
k is known, in which case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:

X1 + X2 + · · · + Xm ∼ χ2k1 +k2 +···+km .

The uses of the χ2 distribution will be discussed later. One example though is if
{X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the
sample variance, then:
(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.

176
6.8. Some common sampling distributions

0.10
0.6
k=1 k=10
k=2 k=20
0.5

0.08
k=4 k=30
k=6 k=40
0.4

0.06
0.3

0.04
0.2

0.02
0.1
0.0

0.0
0 2 4 6 8 0 10 20 30 40 50

Figure 6.7: χ2 pdfs for various degrees of freedom.

Tables of the χ2 distribution

In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of the New Cambridge Statistical Tables shows the following
information.

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.

The columns correspond to the right-tail probability P (X > x) = α, where X ∼ χ2k ,


for different values of α. The first page contains α = 0.9995, 0.999, . . . , 0.60, and the
second page contains α = 0.50, 0.25, . . . , 0.0005. (The table presents α in units of
percentage points, P , so, for example, α = 0.60 corresponds to P = 60.)

The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.

Example 6.7 Consider two numbers in the ‘ν = 5’ row, the 9.236 in the ‘α = 0.10
(P = 10)’ column and the 11.07 in the ‘α = 0.05 (P = 5)’ column. These mean that
for X ∼ χ25 we have:

P (X > 9.236) = 0.10 (and hence P (X ≤ 9.236) = 0.90)


P (X > 11.07) = 0.05 (and hence P (X ≤ 11.07) = 0.95).

These also provide bounds for probabilities of other values. For example, since 10.0
is between 9.236 and 11.07, we can conclude that:
0.05 < P (X > 10.0) < 0.10.

177
6. Sampling distributions of statistics

The ways in which this table may be used in statistical inference will be explained in
later chapters.

6.8.2 (Student’s) t distribution

Definition of Student’s t distribution

Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.

The tk distribution is continuous with the pdf:


−(k+1)/2
x2

Γ((k + 1)/2)
f (x) = √ 1+
kπΓ(k/2) k

for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note
the formula of the pdf of tk is not examinable.)
0.4

N(0,1)
k=1
k=3
k=8
0.3

k=20
0.2
0.1
0.0

−2 0 2

Figure 6.8: Student’s t pdfs for various degrees of freedom.

From Figure 6.8, we see the following.

The distribution is symmetric around 0.

As k → ∞, the tk distribution tends to the standard normal distribution, so tk with


large k is very similar to N (0, 1).

178
6.8. Some common sampling distributions

For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.

For T ∼ tk , the mean and variance of the distribution are:

E(T ) = 0 for k > 1

and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.

Tables of the t distribution

In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 10 of the New Cambridge Statistical Tables shows the following
information.

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).

If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.

The columns correspond to the right-tail probability P (T > t) = α, where T ∼ tk ,


for α = 0.40, 0.05, . . . , 0.0005. (The table presents α in units of percentage points,
P , so, for example, α = 0.40 corresponds to P = 40.)

The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.

Example 6.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05
(P = 5)’ column. This means that for T ∼ t4 we have:

P (T > 2.132) = 0.05 (and hence P (T ≤ 2.132) = 0.95).

The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025 (P = 2.5)’ column is 2.776, so P (T > 2.776) = 0.025. Since
2.132 < 2.5 < 2.776, we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.

179
6. Sampling distributions of statistics

6.8.3 The F distribution

Definition of the F distribution

Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).

The F distribution is a continuous distribution, with non-zero probabilities for x > 0.


The general shape of its pdf is shown in Figure 6.9.

(10,50)
(10,10)
(10,3)
f(x)

0 1 2 3 4

Figure 6.9: F pdfs for various degrees of freedom.

For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We will postpone practice with them until later in the course.

180
6.9. Prelude to statistical inference

6.9 Prelude to statistical inference


We conclude Chapter 6 with a discussion of the preliminaries of statistical inference
before moving on to point estimation. The discussion below will review some key
concepts introduced previously.
So, just what is ‘Statistics’ ? It is a scientific subject of collecting and ‘making sense’ of
data.

Collection: designing experiments/questionnaires, designing sampling schemes, and


administration of data collection.
Making sense: estimation, testing and forecasting.

So, ‘Statistics’ is an application-oriented subject, particularly useful or helpful in


answering questions such as the following.

Does a certain new drug prolong life for AIDS sufferers?


Is global warming really happening?
Are GCSE and A-level examination standards declining?
Is the gap between rich and poor widening in Britain?
Is there still a housing bubble in London?
Is the Chinese yuan undervalued? If so, by how much?

These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.

Understand what data say in each specific context. All the methods are just tools
to help us to understand data.
Concentrate on what to do and why, rather than on concrete calculations and
graphing.
It may take a while to grasp the basic idea of statistics – keep thinking!

6.9.1 Population versus random sample


Consider the following two practical examples.

Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.

181
6. Sampling distributions of statistics

Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
Labour Party supporters. It claims that the proportion of Labour voters in the
whole country is 350/1,000 = 0.35, i.e. 35%.

In both cases, a conclusion about a population (i.e. all the objects concerned) is drawn
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for the Labour party, and each
‘0’ represents a voter for other parties.
A sample is a (randomly) selected subset of a population, and is known in practice. The
population is unknown. We represent a population by a probability distribution.
Why do we need a model for the entire population?

Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.

Why do we need a random model?

Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.

6.9.2 Parameter versus statistic


For a given problem, we typically assume a population to be a probability distribution
F (x; θ), where the form of distribution F is known (such as normal or Poisson), and θ
denotes some unknown characteristic (such as the mean or variance) and is called a
parameter.

Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
X = the lifetime of a tyre
then we can write X ∼ N (µ, σ 2 ).

182
6.9. Prelude to statistical inference

Example 6.12 Continuing with Example 6.10, the population is a Bernoulli


distribution such that:

P (X = 1) = P (a Labour voter) = π

and:
P (X = 0) = P (a non-Labour voter) = 1 − π
where:

π = the proportion of Labour supporters in the UK


= the probability of a voter being a Labour supporter.

A sample: a set of data or random variables?

A sample of size n, {X1 , X2 , . . . , Xn }, is also called a random sample. It consists of n


real numbers in a practical problem. The word ‘random’ captures the fact that samples
(of the same size) taken by different people or at different times may be different, as
they are different subsets of a population.
Furthermore, a sample is also viewed as n independent and identically distributed
(IID) random variables, when we assess the performance of a statistical method.

Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
1X
x̄ = xi = 35,391.
n i=1

A different sample may give a different sample mean, such as 36,721.

Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , X2 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.

Definition of a statistic

Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.

Example 6.14 Let {X1 , X2 , . . . , Xn } be a random sample from the population


N (µ, σ 2 ), then:
n
1X
X̄ = Xi , X1 + Xn2 and sin(X3 ) + 6
n i=1

183
6. Sampling distributions of statistics

are all statistics, but:


X1 − µ
σ
is not a statistic, as it depends on the unknown quantities µ and σ 2 .

An observed random sample is often denoted as {x1 , x2 , . . . , xn }, indicating that they


are n real numbers. They are seen as a realisation of n IID random variables
{X1 , X2 , . . . , Xn }.
The connection between a population and a sample is shown in Figure 6.10, where θ is
a parameter. A known function of {X1 , X2 , . . . , Xn } is called a statistic.

Figure 6.10: Representation of the connection between a population and a sample.

6.9.3 Difference between ‘Probability’ and ‘Statistics’

‘Probability’ is a mathematical subject, while ‘Statistics’ is an application-oriented


subject (which uses probability heavily).

Example 6.15 Let:

X = the number of lectures attended by a student in a term with 20 lectures

then X ∼ Bin(20, π), i.e. the pf is:

20!
P (X = x) = π x (1 − π)20−x for x = 0, 1, 2, . . . , 20
x! (20 − x)!

and 0 otherwise.

184
6.10. Overview of chapter

Some probability questions are as follows. Treating π as known:

what is E(X) (the average number of lectures attended)?

what is P (X ≥ 18) (the proportion of students attending at least 18 lectures)?

what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?

Some statistics questions are as follows.

What is π (the average attendance rate)?

Is π larger than 0.9?

Is π smaller than 0.5?

6.10 Overview of chapter


This chapter introduced sampling distributions of statistics which are the foundations
to statistical inference. The sampling distribution of the sample mean was derived
exactly when sampling from normal populations and also approximately for more
general distributions using the central limit theorem. Three new families of distributions
(χ2 , t and F ) were defined.

6.11 Key terms and concepts

Central limit theorem Chi-squared (χ2 ) distribution


F distribution IID random variables
Random sample Sampling distribution
Sampling variance Statistic
(Student’s) t distribution

6.12 Sample examination questions

1. Suppose that on a certain statistics examination, students from university X


achieve scores which are normally distributed with a mean of 62 and a variance of
10, while students from university Y achieve scores which are normally distributed
with a mean of 60 and a variance of 15. If two students from university X and three
students from university Y, selected at random, sit this examination, what is the
probability that the average of the scores of the two students from university X will
be greater than the average of the scores of the three students from university Y?

185
6. Sampling distributions of statistics

2. Suppose Xi ∼ N (0, 9), for i = 1, 2, 3, 4. Assume all these random variables are
independent. Derive the value of k in each of the following.

(a) P (X1 + 6X2 < k) = 0.3974.


 4

Xi2
P
(b) P <k = 0.90.
i=1

 
1/2
(c) P X1 > (k(X22 + X32 )) = 0.10.

6.13 Solutions to Sample examination questions


1. With obvious notation, we have:
2
σY2
   
σX
X̄ ∼ N µX , = N (62, 5) and Ȳ ∼ N µY , = N (60, 5).
nX nY

Therefore, by independence:

X̄ − Ȳ ∼ N (2, 10)

so:  
−2
P (X̄ − Ȳ > 0) = P Z > √ = P (Z > −0.63) = 0.7357.
10

2. (a) Xi ∼ N (0, 9), for i = 1, 2, 3, 4. We have:

X1 + 6X2 ∼ N (0, 333).

Hence:  
k
P (X1 + 6X2 < k) = P Z < √ = 0.3974.
333
Since, from tables, Φ(−0.26) = 0.3974, we have:

k
√ = −0.26 ⇒ k = −4.7446.
333

√ 4
(b) Xi / 9 ∼ N (0, 1), and so Xi2 /9 ∼ χ21 . Hence we have that Xi2 /9 ∼ χ24 .
P
i=1
Therefore, from tables:
4
!  
X
2 k k
P Xi < k = P X < = 0.90 ⇒ = 7.779 ⇒ k = 70.011
i=1
9 9

where X ∼ χ24 .

186
6.13. Solutions to Sample examination questions

(c) We have:
√ !
X1 / 9 √ √
(k(X22 X32 ))1/2

P X1 > + =P p > 2× k
(X22 + X32 )/(9 × 2)
√ √
= P (T > 2 × k)
= 0.10
√ √
where T ∼ t2 . From tables, 2× k = 1.886, hence k = 1.7785.

Did you hear the one about the statistician? Probably.


(Anon)

187
6. Sampling distributions of statistics

188
Chapter 7
Point estimation

7.1 Synopsis of chapter


This chapter covers point estimation. Specifically, the properties of estimators are
considered and the attributes of a desirable estimator are discussed. Techniques for
deriving estimators are introduced.

7.2 Learning outcomes


After completing this chapter, you should be able to:

summarise the performance of an estimator with reference to its sampling


distribution

use the concepts of bias and variance of an estimator

define mean squared error and calculate it for simple estimators

find estimators using the method of moments, least squares and maximum
likelihood.

7.3 Introduction
The basic setting is that we assume a random sample {X1 , X2 , . . . , Xn } is observed from
a population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.

Statistical inference is based on two things.


1. A set of data/observations {X1 , X2 , . . . , Xn }.
2. An assumption of F (x; θ) for the joint distribution of {X1 , X2 , . . . , Xn }.

Inference is carried out using a statistic, i.e. a known function of {X1 , X2 , . . . , Xn }.

For estimation, we look for a statistic θb = θ(X


b 1 , X2 , . . . , Xn ) such that the value of
θb is taken as an estimate (i.e. an estimated value) of θ. Such a θb is called a point
estimator of θ.

For testing, we typically use a statistic to test if a hypothesis on θ (such as θ = 3) is


true or not.

189
7. Point estimation

Example 7.1 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ = E(Xi ). Find an estimator of µ.
Since µ is the mean of the population, a natural estimator would be the sample
mean µb = X̄, where:
n
1X X1 + X 2 + · · · + Xn
X̄ = Xi = .
n i=1 n

We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
9 + 16 + 15 + 4 + 12
µ
b= = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µb = 11.6.

7.4 Estimation criteria: bias, variance and mean


squared error
Estimators are random variables and, therefore, have probability distributions, known
as sampling distributions. As we know, two important properties of probability
distributions are the mean and variance. Our objective is to create a formal criterion
which combines both of these properties to assess the relative performance of different
estimators.

Bias of an estimator

Let θb be an estimator of the population parameter θ.1 We define the bias of an


estimator as:
Bias(θ)
b = E(θ)b − θ. (7.1)
An estimator is:

positively biased if b −θ >0


E(θ)

unbiased if b −θ =0
E(θ)

negatively biased if b − θ < 0.


E(θ)

A positively-biased estimator means the estimator would systematically overestimate


the parameter by the size of the bias, on average. An unbiased estimator means the
estimator would estimate the parameter correctly, on average. A negatively-biased
1
The b (hat) notation is often used by statisticians to denote an estimator of the parameter beneath
the b. So, for example, λ
b denotes an estimator of the Poisson rate parameter λ.

190
7.4. Estimation criteria: bias, variance and mean squared error

estimator means the estimator would systematically underestimate the parameter by


the size of the bias, on average.
In words, the bias of an estimator is the difference between the expected (average) value
of the estimator and the true parameter being estimated. Intuitively, it would be
desirable, other things being equal, to have an estimator with zero bias, called an
unbiased estimator. Given the definition of bias in (7.1), an unbiased estimator would
satisfy:
E(θ)
b = θ.
In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, an unbiased estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.

Example 7.2 Since E(X̄) = µ, the sample mean X̄ is an unbiased estimator of µ


because:
E(X̄) − µ = 0.

Variance of an estimator

The variance of an estimator, denoted Var(θ),


b is obtained directly from the
estimator’s sampling distribution.

Example 7.3 For the sample mean, X̄, we have:

σ2
Var(X̄) = . (7.2)
n

It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.

Estimator properties

Is µ
b = X̄ a ‘good’ estimator of µ?
Intuitively, X1 or (X1 + X2 + X3 )/3 would not be good enough as estimators of µ.
However, can we use other estimators such as the sample median:

X((n+1)/2) for odd n
µ
b1 =
(X(n/2) + X(n/2+1) )/2 for even n

2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.

191
7. Point estimation

or perhaps a trimmed sample mean:


1
µ
b2 = (X(k1 +1) + X(k1 +2) + · · · + X(n−k2 ) )
n − k1 − k2
or simply µ
b3 = (X(1) + X(n) )/2, where X(1) , X(2) , . . . , X(n) are the order statistics
obtained by rearranging X1 , X2 , . . . , Xn into ascending order:
X(1) ≤ X(2) ≤ · · · ≤ X(n)
and k1 and k2 are two small, positive integers?
To highlight the key idea, let θ be a scalar, and θb be a (point) estimator of θ. A good
estimator would make |θb − θ| as small as possible. However:

θ is unknown
the value of θb changes with the observed sample.

Mean squared error and mean absolute deviation

The mean squared error (MSE) of θb is defined as:


 
b = E (θb − θ)2
MSE(θ)

and the mean absolute deviation (MAD) of θb is defined as:


 
MAD(θ) = E |θ − θ| .
b b

Intuitively, MAD is a more appropriate measure for the error in estimation. However, it
is technically less convenient since the function h(x) = |x| is not differentiable at x = 0.
Therefore, the MSE is used more often.
If E(θb2 ) < ∞, it holds that:
 2
MSE(θ) b = Var(θ) b + Bias(θ) b

where Bias(θ) b − θ.
b = E(θ)
Proof:
 
b = E (θb − θ)2
MSE(θ)
 2 
= E (θ − E(θ)) + (E(θ) − θ)
b b b

     
b 2 + E (E(θ)
= E (θb − E(θ)) b − θ)2 + 2E (θb − E(θ))(E(
b b − θ)
θ)
   
= Var(θ) b 2 + 2 (E(θ)
b + E (Bias(θ)) b − E(θ))(E(
b b − θ)
θ)
 2
= Var(θ)
b + Bias(θ)
b + 0.

192
7.4. Estimation criteria: bias, variance and mean squared error

We have already established that both bias and variance of an estimator are ‘bad’
things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be
viewed as a ‘bad’ thing.3 Hence when faced with several competing estimators, we
prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in variance. Hence the MSE provides us with a formal criterion to assess the
trade-off between the bias and variance of different estimators of the same parameter.

Example 7.4 A population is known to be normally distributed, i.e. X ∼ N (µ, σ 2 ).


Suppose we wish to estimate the population mean, µ. We draw a random sample
{X1 , X2 , . . . , Xn } such that these random variables are IID. We have three candidate
estimators of µ, T1 , T2 and T3 , defined as:
n
1X X 1 + Xn
T1 = X̄ = Xi , T2 = and T3 = X̄ + 3.
n i=1 2

Which estimator should we choose?


We begin by computing the MSE for T1 , noting:

E(T1 ) = E(X̄) = µ

and:
σ2
Var(T1 ) = Var(X̄) =
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
Moving to T2 , note:
 
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2
and:
Var(X1 ) + Var(Xn ) 2σ 2 σ2
Var(T2 ) = = = .
22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Finally, consider T3 , noting:

E(T3 ) = E(X̄ + 3) = E(X̄) + 3 = µ + 3

and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
3
Or, for that matter, a ‘very bad’ thing!

193
7. Point estimation

for n = 2, MSE(T1 ) = MSE(T2 ), since the estimators are identical

for n > 2, MSE(T1 ) < MSE(T2 ), so T1 is preferred.

So T1 = X̄ is our preferred estimator of µ. Intuitively this should make sense. Note


for n > 2, T1 uses all the information in the sample (i.e. all observations are used),
unlike T2 which uses the first and last observations only. Of course, for n = 2, these
estimators are identical.

Some remarks are the following.

i. µ
b = X̄ is a better estimator of µ than X1 as:
σ2
MSE(µ)
b = < MSE(X1 ) = σ 2 .
n

ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , X2 , . . . , Xn } from a population with mean µ and
variance σ 2 , it holds that E(X̄) = µ and Var(X̄) = σ 2 /n. The derivation of the
expected value and variance of the sample mean was covered in Chapter 6.
iv. For any independent random variables Y1 , Y2 , . . . , Yk and constants a1 , a2 , . . . , ak ,
then:
k
! k k
! k
X X X X
E ai Yi = ai E(Yi ) and Var ai Yi = a2i Var(Yi ).
i=1 i=1 i=1 i=1

The proof uses the fact that:


!  !2 
k
X k
X
Var ai Yi = E ai (Yi − E(Yi )) .
i=1 i=1

Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:

P (θb1 = θ − 100) = P (θb1 = θ + 100) = 0.5

and:
P (θb2 = θ) = P (θb2 = θ + 0.2) = 0.5.

194
7.4. Estimation criteria: bias, variance and mean squared error

Note that θb1 is an unbiased estimator of θ and θb2 is a positively-biased estimator of θ


as:
Bias(θb2 ) = E(θb2 ) − θ = ((θ × 0.5) + ((θ + 0.2) × 0.5)) − θ = 0.1.
However:

MSE(θb1 ) = E((θb1 − θ)2 ) = (−100)2 × 0.5 + (100)2 × 0.5 = 10,000

and:
MSE(θb2 ) = E((θb2 − θ)2 ) = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .

Example 7.6 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ = E(Xi ) and variance σ 2 = Var(Xi ) < ∞, for i = 1, 2, . . . , n. Let µ
b = X̄.
Find MSE(b
µ).
We compute the bias and variance separately.
n
! n n
1X 1X 1X
E(b
µ) = E Xi = E(Xi ) = µ = µ.
n i=1 n i=1 n i=1

Hence Bias(b µ) − µ = 0. For the variance, we note the useful formula:


µ) = E(b
k
! k ! k X k k
X X X X X
ai bj = ai b j = ai b i + ai bj .
i=1 j=1 i=1 j=1 i=1 1≤i6=j≤k

Especially:
k
!2 k
X X X
ai = a2i + ai aj .
i=1 i=1 1≤i6=j≤k

Hence Var(b
µ) =
!  !2 
n n
1 X 1 X
Var Xi = E Xi − µ 
n i=1
n i=1
 !2 
n
1 X
= E (Xi − µ) 
n i=1

n
!
1 X X
= E((Xi − µ)2 ) + E ((Xi − µ)(Xj − µ))
n2 i=1 1≤i6=j≤n
!
1 X σ2
= 2 nσ 2 + E(Xi − µ) E(Xj − µ) = .
n 1≤i6=j≤n
n

µ) = MSE(X̄) = σ 2 /n.
Hence MSE(b

195
7. Point estimation

Finding estimators

In general, how should we find an estimator of θ in a practical situation?


There are three conventional methods:

method of moments estimation

least squares estimation

maximum likelihood estimation.

7.5 Method of moments (MM) estimation

Method of moments estimation

Let {X1 , X2 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has
p components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n

The MM estimator (MME) θb of θ is the solution of the p equations:

µk (θ)
b = Mk for k = 1, 2, . . . , p.

Example 7.7 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ).
There are two unknown parameters. Let:
n
1X 2
µ
b=µ
b1 = M1 and µ
b2 = M2 = X .
n i=1 i

This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 1X 2 1X
b = M2 −
σ M12 = Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1

196
7.5. Method of moments (MM) estimation

Note we have:
n
!
1X 2
σ2) = E
E(b Xi − X̄ 2
n i=1
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1

= E(X 2 ) − E(X̄ 2 )
 2 
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1

is a more frequently-used estimator of σ 2 as it has zero bias, i.e. it is an unbiased


estimator since E(S 2 ) = σ 2 . This is why we use the n − 1 divisor when calculating
the sample variance.

A useful formula for computation of the sample variance is:


n
!
2
1 X
2 2
S = X − nX̄ .
n − 1 i=1 i

Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1

converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.

Example 7.8 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.

197
7. Point estimation

For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.

> x <- rnorm(10,2,2)


> x
[1] 0.70709403 -1.38416864 -0.01692815 2.51837989 -0.28518898 1.96998829
[7] -1.53308559 -0.42573724 1.76006933 1.83541490
> mean(x)
[1] 0.5145838
> x2 <- x^2
> mean(x2)
[1] 2.171881

For a sample of size n = 100, we obtained m1 = 2.261542 and m2 = 8.973033.

> x <- rnorm(100,2,2)


> mean(x)
[1] 2.261542
> x2 <- x^2
> mean(x2)
[1] 8.973033

For a sample of size n = 500, we obtained m1 = 1.912112 and m2 = 7.456353.

> x <- rnorm(500,2,2)


> mean(x)
[1] 1.912112
> x2 <- x^2
> mean(x2)
[1] 7.456353

Example 7.9 For a Poisson distribution with λ = 1, we have µ1 = 1 and µ2 = 2.


With a sample of size n = 500, we obtained m1 = 1.09 and m2 = 2.198.

> x <- rpois(500,1)


> mean(x)
[1] 1.09
> x2 <- x^2
> mean(x2)
[1] 2.198
> x
[1] 1 2 2 1 0 0 0 0 0 0 2 2 1 2 1 1 1 2 ...

7.6 Least squares (LS) estimation


Given a random sample {X1 , X2 , . . . , Xn } from a population with mean µ and variance
σ 2 , how can we estimate µ?

198
7.6. Least squares (LS) estimation

n
P
The MME of µ is the sample mean X̄ = Xi /n.
i=1

Least squares estimator of µ

The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1

n n
(Xi − a)2 = (Xi − X̄)2 + n(X̄ − a)2 , where all terms are
P P
Proof: Given that S =
i=1 i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄.


Estimator accuracy

In order to assess the accuracy of µ


b = X̄ as an estimator of µ we calculate its MSE:

σ2
MSE(µ) b − µ)2 ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:

X̄ − µ
 
P √ ≤ z → Φ(z)
σ/ n

for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
 
σ
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n

In practice, the standard deviation σ is unknown and so we replace it by the sample


standard deviation S, where S 2 is the sample variance, given by:
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1

This gives an approximation of:


 
S
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n

199
7. Point estimation

To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated
standard error of X̄ is:
n
!1/2
S 1 X
E.S.E.(X̄) = √ = (Xi − X̄)2 .
n n(n − 1) i=1

Some remarks are the following.

i. The LSE is a geometrical solution – it minimises the sum of squared distances


between the estimated value and each observation. It makes no use of any
information about the underlying distribution.
n
(Xi − a)2 with respect to a, and equating it to 0, we
P
ii. Taking the derivative of
i=1
obtain (after dividing through by −2):
n
X n
X
(Xi − a) = Xi − na = 0.
i=1 i=1

Hence the solution is µ


b=b
a = X̄. This is another way to derive the LSE of µ.

7.7 Maximum likelihood (ML) estimation


We begin with an illustrative example. Maximum likelihood (ML) estimation
generalises the reasoning in the following example to arbitrary settings.

Example 7.10 Suppose we toss a coin 10 times, and record the number of ‘heads’
as a random variable X. Therefore:

X ∼ Bin(10, π)

where π = P (heads) ∈ (0, 1) is the unknown parameter.

If x = 8, what is your best guess (i.e. estimate) of π? Obviously 0.8!

Is π = 0.1 possible? Yes, but very unlikely.

Is π = 0.5 possible? Yes, but not very likely.

Is π = 0.7 or 0.9 possible? Yes, very likely.

Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 (on the next page) shows a plot of L(π) as a function of π.

200
7.7. Maximum likelihood (ML) estimation

The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:

l(π) = ln(L(π)) = 8 ln π + 2 ln(1 − π) + c

where c is the constant ln(10!/(8! 2!)). Setting dl(π)/dπ = 0, we obtain the


maximum likelihood estimate π b = 0.8.

Figure 7.1: Plot of the likelihood function in Example 7.10.

Maximum likelihood definition

Let f (x1 , x2 , . . . , xn ; θ) be the joint probability density function (or probability


function) for random variables (X1 , X2 , . . . , Xn ). The maximum likelihood estimator
(MLE) of θ based on the observations {X1 , X2 , . . . , Xn } is defined as:

θb = max f (X1 , X2 , . . . , Xn ; θ).


θ

Some remarks are the following.

i. The MLE depends only on the observations {X1 , X2 , . . . , Xn }, such that:


θb = θ(X
b 1 , X2 , . . . , Xn ).

Therefore, θb is a statistic (as it must be for an estimator of θ).


ii. If {X1 , X2 , . . . , Xn } is a random sample from a population with probability density
function f (x; θ), the joint probability density function for (X1 , X2 , . . . , Xn ) is:
n
Y
f (xi ; θ).
i=1

201
7. Point estimation

The joint pdf is a function of (X1 , X2 , . . . , Xn ), while θ is a parameter.

The joint pdf describes the probability distribution of {X1 , X2 , . . . , Xn }.

The likelihood function is defined as:


n
Y
L(θ) = f (Xi ; θ). (7.3)
i=1

The likelihood function is a function of θ, while {X1 , X2 , . . . , Xn } are treated as


constants (as given observations).

The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.

Some remarks are the following.

i. The likelihood function is a function of the parameter. It is defined up to positive


constant factors. A likelihood function is not a probability density function. It
contains all the information about the unknown parameter from the observations.

ii. The MLE is θb = max L(θ).


θ

iii. It is often more convenient to use the log-likelihood function4 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1

as it transforms the product in (7.3) into a sum. Note that:

θb = max l(θ).
θ

iv. For a smooth likelihood function, the MLE is often the solution of the equation:

d
l(θ) = 0.

v. If θb is the MLE and φ = g(θ) is a function of θ, φb = g(θ)


b is the MLE of φ (which is
known as the invariance principle of the MLE).

vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.

vii. In practice, ML estimation should be used whenever possible.

4
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.

202
7.7. Maximum likelihood (ML) estimation

Example 7.11 Let {X1 , X2 , . . . , Xn } be a random sample from a distribution with


pdf: (
λ2 xe−λx for x > 0
f (x; λ) =
0 otherwise
where λ > 0 is unknown. Find the MLE of λ.
n
The joint pdf is f (x1 , x2 , . . . , xn ; λ) = (λ2 xi e−λxi ) if all xi > 0, and 0 otherwise.
Q
i=1

The likelihood function is:


n
! n
X Y
2n
L(λ) = λ exp −λ Xi Xi
i=1 i=1
n
Y
= λ2n exp(−nλX̄) Xi .
i=1

n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.

Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.

Example 7.12 Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ).


 n 
2 −n/2 2 2
P
The joint pdf is (2πσ ) exp − (xi − µ) /(2σ ) .
i=1

Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = exp − (Xi − µ)2
(2πσ 2 )n/2 2σ 2 i=1
n
!
1 1 X n
 
= 2 n/2
exp − 2 (Xi − X̄)2 2
exp − 2 (X̄ − µ) .
(2πσ ) 2σ i=1 2σ

Hence the log-likelihood function is:


  n
1 1 X n
l(µ) = ln 2 n/2
− 2 (Xi − X̄)2 − 2 (X̄ − µ)2 .
(2πσ ) 2σ i=1 2σ

Maximising l(µ) with respect to µ gives µ


b = X̄.

203
7. Point estimation

Case II: σ 2 is unknown.


The likelihood function is:
n
!
2 −n/2 2 −n/2 1 X 2
L(µ, σ ) = (2π) (σ ) exp − 2 (Xi − µ) .
2σ i=1

Hence the log-likelihood function is:


n
2 n 2 1 X
l(µ, σ ) = − ln(σ ) − 2 (Xi − µ)2 + c
2 2σ i=1

where c = −(n/2) ln(2π). Regardless of the value of σ 2 , l(X̄, σ 2 ) ≥ l(µ, σ 2 ). Hence


µ
b = X̄.
The MLE of σ 2 should maximise:
n
2 n 2 1 X
l(X̄, σ ) = − ln(σ ) − 2 (Xi − X̄)2 + c.
2 2σ i=1

n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1

Lemma: Let g(x) = −a ln(x) − b/x, where a, b > 0, then:


 
b
g = max g(x).
a x>0

Proof: Letting g 0 (x) = −a/x + b/x2 = 0 leads to the solution x = b/a.



Now suppose we wanted to find the MLE of γ = σ/µ.
Since γ = γ(µ, σ), by the invariance principle the MLE of γ is:
rn
P
(Xi − X̄)2 /n
σ
b i=1
γ
b = γ(b b) = =
µ, σ n .
µ P
Xi /n
b
i=1

Example 7.13 Consider a population with three types of individuals labelled 1, 2


and 3, and occurring according to the Hardy–Weinberg proportions:
p(1; θ) = θ2 , p(2; θ) = 2θ(1 − θ) and p(3; θ) = (1 − θ)2
where 0 < θ < 1. Note that p(1; θ) + p(2; θ) + p(3; θ) = 1.
A random sample of size n is drawn from this population with n1 observed values
equal to 1 and n2 observed values equal to 2 (therefore, there are n − n1 − n2 values
equal to 3). Find the MLE of θ.
Let us assume {X1 , X2 , . . . , Xn } is the sample (i.e. n observed values). Among them,
there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝

204
7.8. Asymptotic distribution of MLEs

means ‘proportional to’):


n
Y
L(θ) = p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2
i=1

= θ2n1 (2θ(1 − θ))n2 (1 − θ)2(n−n1 −n2 )


∝ θ2n1 +n2 (1 − θ)2n−2n1 −n2 .

The log-likelihood is l(θ) ∝ (2n1 + n2 ) ln θ + (2n − 2n1 − n2 ) ln(1 − θ).


Setting:
d 2n1 + n2 2n − 2n1 − n2
l(θ) = − =0
dθ θb 1 − θb
that is:
(1 − θ)(2n
b 1 + n2 ) = θ(2n − 2n1 − n2 )
b

leads to the MLE:


2n1 + n2
θb = .
2n
For example, for a sample with n = 4, n1 = 1 and n2 = 2, we obtain a point estimate
of θb = 0.5.

7.8 Asymptotic distribution of MLEs


Let {X1 , X2 , . . . , Xn } be a random sample from a population with a smooth pdf f (x; θ),
and θ is a scalar. Denote as:

θb = θ(X
b 1 , X2 , . . . , Xn )

the MLE of θ. Under some regularity conditions, the distribution of n(θb − θ)
converges to N (0, 1/I(θ)) as n → ∞, where I(θ) is the Fisher information defined as:

∂ 2 ln f (x; θ)
Z
I(θ) = − f (x; θ) dx.
−∞ ∂θ 2

Some remarks are the following.

i. When n is large, θb ∼ N (θ, (nI(θ))−1 ) approximately.

ii. For a discrete distribution with probability function p(x; θ), then:
X ∂ 2 ln p(x; θ)
I(θ) = − p(x; θ) .
x
∂θ 2

You will use this asymptotic distribution result if you study ST2134 Advanced
statistics: statistical inference.

205
7. Point estimation

Example 7.14 For N (µ, σ 2 ) with σ 2 known, we have:


 
2 −1/2 1 2
f (x; µ) = (2πσ ) exp − 2 (x − µ) .

Therefore:
1 1
ln f (x; µ) = − ln(2πσ 2 ) − 2 (x − µ)2 .
2 2σ
Hence:
d ln f (x; µ) x−µ d2 ln f (x; µ) 1
= and = − .
dµ σ2 dµ2 σ2
Therefore: Z ∞
1 1
I(µ) = − − f (x; µ) dx = .
−∞ σ2 σ2
The MLE of µ is X̄, and hence X̄ ∼ N (µ, σ 2 /n).

Example 7.15 For the Poisson distribution, p(x; λ) = λx e−λ /x!. Therefore:

ln p(x; λ) = x ln λ − λ − ln(x!).

Hence:
d ln p(x; λ) x d2 ln p(x; λ) x
= − 1 and 2
= − 2.
dλ λ dλ λ
Therefore: ∞
1 X 1 1
I(λ) = x p(x; λ) = E(X) = .
λ2 x=0 λ2 λ

The MLE of λ is X̄. Hence X̄ ∼ N (λ, λ/n) approximately, when n is large.

7.9 Overview of chapter


This chapter introduced point estimation. Key properties of estimators were explored
and the characteristics of a desirable estimator were studied through the calculation of
the mean squared error. Methods for finding estimators of parameters were also
described, including method of moments, least squares and maximum likelihood
estimation.

7.10 Key terms and concepts


Bias Consistent estimator
Fisher information Information
Invariance principle Law of large numbers (LLN)
Least squares estimation Likelihood function
Log-likelihood function Maximum likelihood estimation
Mean absolute deviation (MAD) Mean squared error (MSE)

206
7.11. Sample examination questions

Method of moments estimation Parameter


Point estimate Point estimator
Population moment Random sample
Sample moment Standard error
Statistic Unbiased

7.11 Sample examination questions


1. Let {X1 , X2 , . . . , Xn } be a random sample from the probability distribution with
the probability density function:
(
(1 + θx)/2 for − 1 ≤ x ≤ 1
f (x; θ) =
0 otherwise

where −1 ≤ θ ≤ 1 is an unknown parameter.


(a) Derive the method of moments estimator of θ.

(b) Is the estimator of θ derived in part (a) biased or unbiased? Justify your
answer.

(c) Determine the variance of the estimator derived in part (a) and check whether
it is a consistent estimator of θ.

(d) Suppose n = 5, resulting in the sample:

x1 = 0.68, x2 = 0.05, x3 = 0.77, x4 = −0.65 and x5 = 0.35.

Use this sample to calculate the method of moments estimate of θ using the
estimator derived in part (a), and sketch the above probability density
function based on this estimate.

2. Let {X1 , X2 , . . . , Xn } be a random sample of size n from the following probability


density function:
1
f (x; α, θ) = α
xα−1 e−x/θ
(α − 1)! θ
for x > 0, and 0 otherwise, where α > 0 is known, and θ > 0.

(a) Derive the maximum likelihood estimator of θ. (You do not need to verify the
solution is a maximum.)

(b) Show that the estimator derived in part (a) is mean square consistent for θ.

Hint: You may use the fact that E(X) = αθ and Var(X) = αθ2 .

207
7. Point estimation

7.12 Solutions to Sample examination questions


1. (a) The first population moment is:
Z ∞ Z 1 Z 1 1
x + θx2
 2
1 + θx x θx3 θ
E(X) = x f (x) dx = x dx = dx = + = .
−∞ −1 2 −1 2 4 6 −1 3
Estimating the first population moment using the first sample moment, the
method of moments estimator of θ is:
θb
= X̄ ⇒ θb = 3X̄.
3
(b) Noting that E(X̄) = E(X) = θ/3, we have:
θ
E(θ)
b = E(3X̄) = 3 E(X̄) = 3 =θ
3
hence θb is an unbiased estimator of θ.
b → 0 as
(c) As θb is an unbiased estimator of θ, we simply check whether Var(θ)
n → ∞. We have:
b = Var(3X̄) = 9 Var(X̄) = 9 Var(X)
Var(θ)
n
Now:
Z ∞ Z 1 Z 1 2 1
x + θx3
 3
2 2 2 1 + θx x θx4 1
E(X ) = x f (x) dx = x dx = dx = + = .
−∞ −1 2 −1 2 6 8 −1 3
Hence:
1 θ2 3 − θ2
Var(X) = E(X 2 ) − (E(X))2 = − ⇒ Var(θ)
b = .
3 9 n
The mean squared error of θb is:
b 2= 3 − θ2
MSE(θ)
b = Var(θ)
b + (Bias(θ)) + 02 → 0
n
as n → ∞, hence θb is a consistent estimator of θ.
(d) The sample mean is x̄ = 0.24, hence θb = 3x̄ = 3 × 0.24 = 0.72. Therefore:
(
b = 0.50 + 0.36x for − 1 ≤ x ≤ 1
f (x; θ)
0 otherwise.

A sketch of f (x; θ)
b is:

208
7.12. Solutions to Sample examination questions

2. (a) For α > 0 known, due to independence the likelihood function is:
n n
!α−1 n
!
Y 1 Y 1X
L(θ) = f (xi ; α, θ) = xi exp − xi .
i=1
((α − 1)!)n θnα i=1
θ i=1

Hence the log-likelihood function is:


n
! n
Y 1X
l(θ; ) = −n log((α − 1)!) − nα log θ + (α − 1) log xi − xi
i=1
θ i=1

such that: n
d nα 1 X
l(θ) = − + 2 xi .
dθ θ θ i=1

Equating to zero and solving for θ,


b the maximum likelihood estimator of θ is:
n
1 X X̄
θb = Xi = .
nα i=1 α

(b) Noting the hint, we have:


 
b = E X̄ = 1 E(X̄) = 1 E(X) = αθ = θ
E(θ)
α α α α

hence θb is an unbiased estimator of θ. Also:


αθ2 θ2
 
X̄ 1 1
Var(θ) = Var
b = 2 Var(X̄) = Var(X) = = .
α α nα2 nα2 nα

b → 0 as n → ∞, then θb is a
Since θb is unbiased and noting that Var(θ)
consistent estimator of θ.

The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)

209
7. Point estimation

210
Chapter 8
Interval estimation

8.1 Synopsis of chapter


This chapter covers interval estimation – a natural extension of point estimation. Due
to the almost inevitable sampling error, we wish to communicate the level of
uncertainty in our point estimate by constructing confidence intervals.

8.2 Learning outcomes


After completing this chapter, you should be able to:

explain the coverage probability of a confidence interval

construct confidence intervals for means of normal and non-normal populations


when the variance is known and unknown

construct confidence intervals for the variance of a normal population

explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.

8.3 Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , X2 , . . . , Xn ) and a lower bound L = L(X1 , X2 , . . . , Xn ), and hope that the
unknown parameter θ lies between the two bounds L and U (life is not always as simple
as that, but it is a good start).
An intuitive guess for estimating the population mean would be:

L = X̄ − k × S.E.(X̄) and U = X̄ + k × S.E.(X̄)

where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.

211
8. Interval estimation

Typically, the coverage probability:

P (L(X1 , X2 , . . . , Xn ) < θ < U (X1 , X2 , . . . , Xn )) < 1.

Ideally, we should choose L and U such that:

the width of the interval is as small as possible

the coverage probability is as large as possible.

8.4 Interval estimation for means of normal


distributions
Let us consider a simple example. We have a random sample {X1 , X2 , . . . , Xn } from the
distribution N (µ, σ 2 ), with σ 2 known.
From Chapter 7, we have reason to believe that X̄ is a good estimator of µ. We also
know X̄ ∼ N (µ, σ 2 /n), and hence:

X̄ − µ n(X̄ − µ)
√ = ∼ N (0, 1).
σ/ n σ

Therefore, supposing a 95% coverage probability:


√ 
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
 
σ
= P |µ − X̄| ≤ 1.96 × √
n
 
σ σ
= P −1.96 × √ < µ − X̄ < 1.96 × √
n n
 
σ σ
= P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √ .
n n

Therefore, the interval covering µ with probability 0.95 is:


 
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √
n n

which is called a 95% confidence interval for µ.

Example 8.1 Suppose σ = 1, n = 4, and x̄ = 2.25, then a 95% confidence interval


for µ is:  
1 1
2.25 − 1.96 × √ , 2.25 + 1.96 × √ = (1.27, 3.23).
4 4
Instead of a simple point estimate of µ
b = 2.25, we say µ is between 1.27 and 3.23 at
the 95% confidence level.

212
8.4. Interval estimation for means of normal distributions

What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.

i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
√ 
n|X̄ − µ|
0.90 = P ≤ 1.645
σ
 
σ σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n n
√ 
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
 
σ σ
=P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n n
√ 
n|X̄ − µ|
0.99 = P ≤ 2.576
σ
 
σ σ
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ .
n n
√ √
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
The widths of the
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1 (on the next page).

Dealing with unknown σ

In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
!1/2
1 X 2
S= (Xi − X̄)
n − 1 i=1

213
8. Interval estimation

Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.

leading to a confidence interval for µ of the form:


 
S S
X̄ − k × √ , X̄ + k × √
n n

where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ . (8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.

8.4.1 An important property of normal samples


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ). Suppose:
n n
1X 2
1 X S
X̄ = Xi , S = (Xi − X̄)2 and E.S.E.(X̄) = √
n i=1
n−1 i=1
n

where E.S.E.(X̄) denotes the estimated standard error of the sample mean.

i. X̄ ∼ N (µ, σ 2 /n) and (n − 1)S 2 /σ 2 ∼ χ2n−1 .

ii. X̄ and S 2 are independent, therefore:



n(X̄ − µ)/σ X̄ − µ X̄ − µ
p = √ = ∼ tn−1 .
(n − 1)S 2 /((n − 1)σ 2 ) S/ n E.S.E.(X̄)

An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
 
S S
X̄ − c × √ , X̄ + c × √ = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
n n

where c > 0 is a constant such that P (T > c) = α/2, where T ∼ tn−1 .

214
8.5. Approximate confidence intervals

8.5 Approximate confidence intervals

8.5.1 Means of non-normal distributions


Let {X1 , X2 , . . . , Xn } be a random sample from a non-normal distribution with mean µ
and variance σ 2 < ∞.

When n is large, n(X̄ − µ)/σ is N (0, 1) approximately.
Therefore, we have an approximate 95% confidence interval for µ given by:
 
S S
X̄ − 1.96 × √ , X̄ + 1.96 × √
n n
where S is the sample standard deviation. Note that it is a two-stage approximation.

1. Approximate the distribution of n(X̄ − µ)/σ by N (0, 1).
2. Approximate σ by S.

Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
√ of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:

47.126 ± 1.96 × 0.43 ⇒ (46.283, 47.969).

8.5.2 MLE-based confidence intervals


Let {X1 , X2 , . . . , Xn } be a random sample from a smooth distribution with unknown
parameter θ. Let θb = θ(X b 1 , X2 , . . . , Xn ) be the MLE of θ.

Under some regularity conditions, it holds that θb ∼ N (θ, (nI(θ))−1 ) approximately,


when n is large, where I(θ) is the Fisher information.
This leads to the following approximate 95% confidence interval for θ:
 
b −1/2 , θb + 1.96 × (nI(θ))
θb − 1.96 × (nI(θ)) b −1/2 .

You will use these MLE-based confidence intervals if you study ST2134 Advanced
statistics: statistical inference.

8.6 Use of the chi-squared distribution


Let X1 , X2 , . . . , Xn be independent N (µ, σ 2 ) random variables. Therefore:
Xi − µ
∼ N (0, 1).
σ
215
8. Interval estimation

Hence: n
1 X
2
(Xi − µ)2 ∼ χ2n .
σ i=1
Note that:
n n
1 X 1 X n(X̄ − µ)2
(Xi − µ)2 = (X i − X̄)2
+ . (8.2)
σ2 i=1
σ 2 i=1 σ2

Proof: We have:
n
X n
X
2
(Xi − µ) = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
n
X n
X n
X
2 2
= (Xi − X̄) + (X̄ − µ) + 2 (Xi − X̄)(X̄ − µ)
i=1 i=1 i=1
n
X n
X
2 2
= (Xi − X̄) + n(X̄ − µ) + 2(X̄ − µ) (Xi − X̄)
i=1 i=1
n
X
= (Xi − X̄)2 + n(X̄ − µ)2 .
i=1

Hence: n n
1 X 2 1 X 2 n(X̄ − µ)2
(X i − µ) = (X i − X̄) + .
σ 2 i=1 σ 2 i=1 σ2

Since X̄ ∼ N (µ, σ 2 /n), then n(X̄ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Xi − X̄)2 ∼ χ2n−1 .
σ 2 i=1

Therefore, decomposition (8.2) is an instance of the relationship:

χ2n = χ2n−1 + χ21 .

8.7 Interval estimation for variances of normal


distributions
Let {X1 , X2 , . . . , Xn } be a random sample from a population with mean µ and variance
σ 2 < ∞.
n
Let M = (Xi − X̄)2 = (n − 1)S 2 , then M/σ 2 ∼ χ2n−1 .
P
i=1

For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
216
8.8. Overview of chapter

where X ∼ χ2n−1 . Therefore:


   
M M 2
M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1
Hence a 100(1 − α)% confidence interval for σ 2 is:
 
M M
, .
k2 k1

Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of the New Cambridge Statistical Tables, we find:

P (X < 5.629) = P (X > 26.12) = 0.025

where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
14 × S 2 14 × S 2
   
M M
, = ,
26.12 5.629 26.12 5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).

In the above calculation, we have used the formula:


n
2 1 X 1
S = (Xi − X̄)2 = × M.
n − 1 i=1 n−1

8.8 Overview of chapter


This chapter covered interval estimation. A confidence interval converts a point
estimate of an unknown parameter into an interval estimate, reflecting the likely
sampling error. The chapter demonstrated how to construct confidence intervals for
means and variances of normal populations.

8.9 Key terms and concepts


Confidence interval Coverage probability
Interval estimator Interval width

8.10 Sample examination questions


1. Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ), where σ 2 is unknown.
Derive the endpoints of an accurate 100(1 − α)% confidence interval for µ in this

217
8. Interval estimation

situation, where α ∈ (0, 1).

2. Suppose that the binomial distribution parameter π is to be estimated by


P = X/n, where X is the number of successes in n independent trials, i.e. P is the
sample proportion of successes.

(a) Write down the endpoints of an approximate 100(1 − α)% confidence interval
for π, stating any necessary conditions which should be satisfied for such an
approximate confidence interval to be used. You should also state the
approximate sampling distribution of P = X/n.

(b) Suppose we are willing to assume that π ≤ 0.40. What is the smallest n for
which P will have an approximate 99% probability of being within 0.05 of π?

8.11 Solutions to Sample examination questions


1. We have:
 
X̄ − µ
1 − α = P −tα/2, n−1 ≤ √ ≤ tα/2, n−1
S/ n
 
S S
= P −tα/2, n−1 × √ ≤ X̄ − µ ≤ tα/2, n−1 × √
n n
 
S S
= P −tα/2, n−1 × √ < µ − X̄ < tα/2, n−1 × √
n n
 
S S
= P X̄ − tα/2, n−1 × √ < µ < X̄ + tα/2, n−1 × √ .
n n

Hence an accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
 
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √ .
n n

2. (a) An approximate 100(1 − α)% confidence interval for π has endpoints:


r
P (1 − P )
P ± zα/2 × .
n
This is an adequate approximation for large n by the central limit theorem,
since P = X/n is a special case of the sample mean under Bernoulli sampling.
Hence:
σ2
   
X π(1 − π)
P = = X̄ ∼ N µ, = N π,
n n n
approximately, as n → ∞.

218
8.11. Solutions to Sample examination questions

(b) Take n to be the smallest integer such that:

(z0.005 )2 π(1 − π) (2.576)2 × 0.40 × 0.60


n≥ = = 637.03
(0.05)2 (0.05)2

hence n = 638.

A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)

219
8. Interval estimation

220
Chapter 9
Hypothesis testing

9.1 Synopsis of chapter


This chapter discusses hypothesis testing which is used to answer questions about an
unknown parameter. We consider how to perform an appropriate hypothesis test for a
given problem, determine error probabilities and test power, and draw appropriate
conclusions from a hypothesis test.

9.2 Learning outcomes


After completing this chapter, you should be able to:

define and apply the terminology of hypothesis testing

conduct statistical tests of all the types covered in the chapter

calculate the power of some of the simpler tests

explain the construction of rejection regions as a consequence of prior distributional


results, with reference to the significance level and power.

9.3 Introduction
Hypothesis testing and statistical estimation are the two most frequently-used
statistical inference methods. Hypothesis testing addresses a different type of practical
question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .

9.4 Introductory examples

Example 9.1 Consider a simple experiment – toss a coin 20 times.


Let {X1 , X2 , . . . , X20 } be the outcomes where ‘heads’ → Xi = 1, and ‘tails’
→ Xi = 0.
Hence the probability distribution is P (Xi = 1) = π = 1 − P (Xi = 0), for π ∈ (0, 1).

221
9. Hypothesis testing

b = X̄ = (X1 + X2 + · · · + X20 )/20.


Estimation would involve estimating π, using π
Testing involves assessing if a hypothesis such as ‘the coin is fair’ is true or not. For
example, this particular hypothesis can be formally represented as:

H0 : π = 0.50.

We cannot be sure what the answer is just from the data.

If π
b = 0.90, H0 is unlikely to be true.

If π
b = 0.45, H0 may be true (and also may be untrue).

If π
b = 0.70, what to do then?

Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:

H0 : µ = 3.

If we could reject H0 , the customer complaint would be vindicated.

Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.

Dr A claims the mean income is µ = 16.

Ms B claims the mean income is µ = 15.

Mr C claims the mean income is µ = 14.

How would you assess these experts’ statements?


X̄ ∼ N (µ, σ 2 /n) = N (µ, 1). We assess the statements based on this distribution.
If Dr A’s claim is correct, X̄ ∼ N (16, 1). The observed value x̄ = 17 is one standard
deviation away from µ, and may be regarded as a typical observation from the
distribution. Hence there is little inconsistency between the claim and the data
evidence. This is shown in Figure 9.1 (on the next page).

222
9.4. Introductory examples

If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.

Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.

Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.

Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.

223
9. Hypothesis testing

9.5 Setting p-value, significance level, test statistic


A measure of the discrepancy between the hypothesised (claimed) value of µ and the
observed value X̄ = x̄ is the probability of observing X̄ = x̄ or more extreme values
under the null hypothesis. This probability is called the p-value.

Example 9.4 Continuing Example 9.3:

under H0 : µ = 16, P (X̄ ≥ 17) + P (X̄ ≤ 15) = P (|X̄ − 16| ≥ 1) = 0.317

under H0 : µ = 15, P (X̄ ≥ 17) + P (X̄ ≤ 13) = P (|X̄ − 15| ≥ 2) = 0.046

under H0 : µ = 14, P (X̄ ≥ 17) + P (X̄ ≤ 11) = P (|X̄ − 14| ≥ 3) = 0.003.

In summary, we reject the hypothesis µ = 15 or µ = 14, as, for example, if the


hypothesis µ = 14 is true, the probability of observing x̄ = 17, or more extreme
values, would be as small as 0.003. We are comfortable with this decision, as a small
probability event would be very unlikely to occur in a single experiment.
On the other hand, we cannot reject the hypothesis µ = 16. However, this does not
imply that this hypothesis is necessarily true as, for example, µ = 17 or 18 are at
least as likely as µ = 16. Remember:

not reject 6= accept.

A statistical test is incapable of ‘accepting’ a hypothesis.

Definition of p-values

A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.

• A ‘small’ p-value indicates that H0 is not supported by the data.

• A ‘large’ p-value indicates that H0 is not inconsistent with the data.

So p-values may be seen as a risk measure of rejecting H0 , as shown in Figure 9.4.

9.5.1 General setting of hypothesis tests


Let {X1 , X2 , . . . , Xn } be a random sample from a distribution with cdf F (x; θ). We are
interested in testing the hypotheses:
H0 : θ = θ 0 vs. H1 : θ ∈ Θ1
where θ0 is a fixed value, Θ1 is a set, and θ0 6∈ Θ1 .

H0 is called the null hypothesis.


H1 is called the alternative hypothesis.

224
9.5. Setting p-value, significance level, test statistic

Figure 9.4: Interpretation of p-values as a risk measure.

The significance level is based on α, which is a small number between 0 and 1


selected subjectively. Often we choose α = 0.10, 0.05 or 0.01, i.e. tests are often
conducted at the significance levels of 10%, 5% or 1%, respectively. So we test at the
100α% significance level.
Our decision is to reject H0 if the p-value is ≤ α.

9.5.2 Statistical testing procedure


1. Find a test statistic T = T (X1 , X2 , . . . , Xn ). Denote by t the value of T for the
given sample of observations under H0 .
2. Compute the p-value:
p = Pθ0 (T = t or more ‘extreme’ values)
where Pθ0 denotes the probability distribution such that θ = θ0 .
3. If p ≤ α we reject H0 . Otherwise, H0 is not rejected.

Our understanding of ‘extremity’ is defined by the alternative hypothesis H1 . This will


become clear in subsequent examples. The significance level determines which p-values
are considered ‘small’.

Example 9.5 Let {X1 , X2 , . . . , X20 }, taking values either 1 or 0, be the outcomes
of an experiment of tossing a coin 20 times, where:

P (Xi = 1) = π = 1 − P (Xi = 0) for π ∈ (0, 1).

We are interested in testing:

H0 : π = 0.50 vs. H1 : π 6= 0.50.

Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?

225
9. Hypothesis testing

Let T = X1 + X2 + · · · + X20 . Therefore, T ∼ Bin(20, π). We use T as the test


statistic. With the given sample, we observe t = 17. What are the more extreme
values of T if H0 is true?
Under H0 , E(T ) = nπ0 = 10. Hence 3 is as extreme as 17, and the more extreme
values are:
0, 1, 2, 18, 19 and 20.
Therefore, the p-value is:
3 20
! 3 20
!
X X X X 20!
+ PH0 (T = i) = + (0.50)i (1 − 0.50)20−i
i=0 i=17 i=0 i=17
i! (20 − i)!
3
X20!
= 2 × (0.50)20
i=0
i! (20 − i)!
 
20 20 × 19 20 × 19 × 18
= 2 × (0.50) × 1 + 20 + +
2! 3!
= 0.0026.

So we reject the null hypothesis of a fair coin at the 1% significance level.

9.5.3 Two-sided tests for normal means


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ). Assume σ 2 > 0 is known. We
are interested in testing the hypotheses:
H0 : µ = µ0 vs. H1 : µ 6= µ0
where µ0 is a given constant.
P
Intuitively if H0 is true, X̄ = i Xi /n should be close to µ0 . Therefore, large values of
|X̄ − µ0 | suggest a departure from H0 .

Under H0 , X̄ ∼ N (µ0 , σ 2 /n), i.e. n(X̄ − µ0 )/σ ∼ N (0, 1). Hence the test statistic
may be defined as:

n(X̄ − µ0 ) X̄ − µ0
T = = √ ∼ N (0, 1)
σ σ/ n
and we reject H0 for sufficiently ‘large’ values of |T |.
How large is ‘large’ ? This is determined by the significance level.
Suppose
√ µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897. Therefore, the observed value of T is
t = 20 × (2.897 − 3)/0.148 = −3.112. Hence the p-value is:
Pµ0 (|T | ≥ 3.112) = P (|Z| > 3.112) = 0.0019
where Z ∼ N (0, 1). Therefore, the null hypothesis of µ = 3 will be rejected even at the
1% significance level.
Alternatively, for a given 100α% significance level we may find the critical value cα
such that Pµ0 (|T | > cα ) = α. Therefore, the p-value is ≤ α if and only if the observed
value of |T | ≥ cα .

226
9.5. Setting p-value, significance level, test statistic

Using this alternative approach, we do not need to compute the p-value.


For this example, cα = zα/2 , that is the top 100α/2th percentile of N (0, 1), i.e. the
z-value which cuts off α/2 probability in the upper tail of the standard normal
distribution.
For α = 0.10, 0.05 and 0.01, zα/2 = 1.645, 1.96 and 2.576, respectively. Since we observe
|t| = 3.112, the null hypothesis is rejected at all three significance levels.

9.5.4 One-sided tests for normal means


Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ) with σ 2 > 0 known. We are
interested in testing the hypotheses:

H0 : µ = µ0 vs. H1 : µ < µ0

where µ0 is a known constant.



Under H0 , T = n(X̄ − µ0 )/σ ∼ N (0, 1). We continue to use T as the test statistic. For
H1 : µ < µ0 we should reject H0 when t ≤ c, where c < 0 is a constant.
For a given 100α% significance level, the critical value c should be chosen such that:

α = Pµ0 (T ≤ c) = P (Z ≤ c).

Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1),
c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where
Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645.

Example 9.6 Suppose µ0 = 3, σ = 0.148, n = 20 and x̄ = 2.897, then:



20 × (2.897 − 3)
t= = −3.112 < −1.645.
0.148
So the null hypothesis of µ = 3 is rejected at the 5% significance level as there is
significant evidence from the data that the true mean is likely to be smaller than 3.

Some remarks are the following.

i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.

ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.

iii. A test may be carried out by either computing the p-value or determining the
critical value.

iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .

227
9. Hypothesis testing

9.6 t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:

H0 : µ = µ0 vs. H1 : µ < µ0

where µ0 is known.

Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
1 X
S2 = (Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√ n
!1/2
n(X̄ − µ0 ) X̄ − µ0 √ . 1 X
T = = √ = n(X̄ − µ0 ) (Xi − X̄)2 .
S S/ n n−1 i=1

We reject H0 if t < c, where c is the critical value determined by the significance level:

PH0 (T < c) = α

where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.

Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:

2.82, 3.01, 3.11, 2.71, 2.93, 2.68, 3.02, 3.01, 2.93, 2.56,
2.78, 3.01, 3.09, 2.94, 2.82, 2.81, 3.05, 3.01, 2.85, 2.79.

The sample mean and standard deviation are, respectively:

x̄ = 2.897 and s = 0.148.

To test H0 : µ = 3 vs. H1 : µ < 3 at the 1% significance level, the critical value is


c = −t0.01, 19 = −2.539.

Since t = 20 × (2.897 − 3)/0.148 = −3.112 < −2.539, we reject the null hypothesis
that µ = 3 at the 1% significance level.
We conclude that there is highly significant evidence which supports the claim that
the mean amount of coffee is less than 3 pounds.

228
9.7. General approach to statistical tests

Note the hypotheses tested are in fact:

H0 : µ = µ0 , σ 2 > 0 vs. H1 : µ 6= µ0 , σ 2 > 0.

Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.

9.7 General approach to statistical tests


Let {X1 , X2 , . . . , Xn } be a random sample from the distribution F (x; θ). We are
interested in testing:
H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1
where Θ0 and Θ1 are two non-overlapping sets. A general approach to test the above
hypotheses at the 100α% significance level may be described as follows.

1. Find a test statistic T = T (X1 , X2 , . . . , Xn ) such that the distribution of T under


H0 is known.

2. Identify a critical region C such that:

PH0 (T ∈ C) = α.

3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.

In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).

9.8 Two types of error


Statistical tests are often associated with two kinds of decision errors, which are
displayed in the following table:

Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision

Some remarks are the following.

i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.

229
9. Hypothesis testing

ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.

9.9 Tests for variances of normal distributions

Example 9.8 A container-filling machine is used to package milk cartons of 1 litre


(= 1,000 cm3 ). Ideally, the amount of milk should only vary slightly. The company
which produced the filling machine claims that the variance of the milk content is
not greater than 1 cm3 . To examine the veracity of the claim, a random sample of 25
cartons is taken, resulting in 25 measurements (in cm3 ) as follows:

1,000.3, 1,001.3, 999.5, 999.7, 999.3,


999.8, 998.3, 1,000.6, 999.7, 999.8,
1,001.0, 999.4, 999.5, 998.5, 1,000.7,
999.6, 999.8, 1,000.0, 998.2, 1,000.1,
998.1, 1,000.7, 999.8, 1,001.3, 1,000.7.

Do these data support the claim of the company?

Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
H0 : σ 2 = σ02 vs. H1 : σ 2 > σ02 .
n
Let S 2 = (Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have:
P
i=1

n
(Xi − X̄)2
P
(n − 1)S 2 i=1
T = = ∼ χ2n−1 .
σ02 σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .

230
9.9. Tests for variances of normal distributions

H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:

β(σ)= Pσ (H0 is rejected)


= Pσ (T > χ2α, n−1 )
(n − 1)S 2
 
2
= Pσ > χα, n−1
σ02
(n − 1)S 2 σ02
 
2
= Pσ > 2 × χα, n−1
σ2 σ
which is greater than α, as σ02 /σ 2 < 1, where (n − 1)S 2 /σ 2 ∼ χ2n−1 when σ 2 is the true
variance, instead of σ02 . Note that here 1 − β(σ) is the probability of a Type II error.
Suppose we choose α = 0.05. For n = 25, χ2α, n−1 = χ20.05, 24 = 36.415.
With the given sample, s2 = 0.8088 and σ02 = 1, t = 24 × 0.8088 = 19.41 < χ20.05, 24 .
Hence we do not reject H0 at the 5% significance level. There is no significant evidence
from the data against the company’s claim that the variance is not beyond 1.
With σ02 = 1, the power function is:
!
(n − 1)S 2 χ20.05, 24 (n − 1)S 2
 
36.415
β(σ) = P > =P >
σ2 σ2 σ2 σ2

where (n − 1)S 2 /σ 2 ∼ χ224 .


For any given values of σ 2 , we may compute β(σ). We list some specific values next.

σ2 1 1.5 2 3 4
χ20.05, 24 /σ 2 36.415 24.277 18.208 12.138 9.104
β(σ) 0.05 0.446 0.793 0.978 0.997
Approximate β(σ) 0.05 0.40 0.80 0.975 0.995

Clearly, β(σ) % as σ 2 %. Intuitively, it is easier to reject H0 : σ 2 = 1 if the true


population, which generates the data, has a larger variance σ 2 .
Due to the sparsity of the available χ2 tables, we may only obtain some approximate
values for β(σ) – see the entries in the last row in the above table. The more accurate
values of β(σ) were calculated using a computer.
Some remarks are the following.

i. The significance level is selected subjectively by the statistician. To make the


conclusion more convincing in the above example, we may use α = 0.10 instead. As
χ20.10, 24 = 33.196, H0 is not rejected at the 10% significance level. In fact the p-value
is:
PH0 (T ≥ 19.41) = 0.73
where T ∼ χ224 .

231
9. Hypothesis testing

ii. As σ 2 increases, the power function β(σ) also increases.


iii. For H1 : σ 2 6= σ02 , we should reject H0 if:
t ≤ χ21−α/2, n−1 or t ≥ χ2α/2, n−1

where χ2α, k denotes the top 100αth percentile of the χ2k distribution.

9.10 Summary: tests for µ and σ 2 in N (µ, σ 2)

Null hypothesis, H0 µ = µ0 µ = µ0 σ 2 = σ02


(σ 2 known)

X̄−µ X̄−µ (n−1)S 2


Test statistic, T √0
σ/ n
√0
S/ n σ02

Distribution of T N (0, 1) tn−1 χ2n−1


under H0

n n
Xi /n, S 2 = (Xi − X̄)2 /(n − 1), and {X1 , X2 , . . . , Xn } is a
P P
In the above table, X̄ =
i=1 i=1
random sample from N (µ, σ 2 ).

9.11 Comparing two normal means with paired


observations
Suppose that the observations are paired:
(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )
2
where all Xi s and Yi s are independent, Xi ∼ N (µX , σX ) and Yi ∼ N (µY , σY2 ).
We are interested in testing the hypothesis:
H0 : µX = µY . (9.1)

Example 9.9 The following are some practical examples.

Is the increased marketing budget improving sales?

Are customers willing to pay more for the new product than the old one?

Does TV advertisement A have higher average effectiveness than advertisement


B?

Will promotion method A generate higher sales than method B?

232
9.12. Comparing two normal means

Observations are paired together for good reasons: before-after, A-vs.-B (from the
same subject).

Let Zi = Xi − Yi , for i = 1, 2, . . . , n, then {Z1 , Z2 , . . . , Zn } is a random sample from the


population N (µ, σ 2 ), where:

µ = µX − µY and σ 2 = σX
2
+ σY2 .

The hypothesis (9.1) can also be expressed as:

H0 : µ = 0.

Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , Z2 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:

|t| > tα/2, n−1 , if the alternative is H1 : µX 6= µY

t > tα, n−1 , if the alternative is H1 : µX > µY

t < −tα, n−1 , if the alternative is H1 : µX < µY

where P (T > tα, n−1 ) = α, for T ∼ tn−1 .

9.11.1 Power functions of the test


Consider the case of testing H0 : µX = µY vs. H1 : µX > µY only. For µ = µX − µY > 0,
we have:

β(µ)= Pµ (H0 is rejected)


= Pµ (T > tα, n−1 )
√ !
nZ̄
= Pµ > tα, n−1
S
√ √ !
n(Z̄ − µ) nµ
= Pµ > tα, n−1 −
S S

where n(Z̄ − µ)/S ∼ tn−1 under the distribution represented by Pµ .
Note that for µ > 0, β(µ) > α. Furthermore, β(µ) increases as µ increases.

9.12 Comparing two normal means


Let {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } be two independent random samples drawn
2
from, respectively, N (µX , σX ) and N (µY , σY2 ). We seek to test hypotheses on µX − µY .
We cannot pair the two samples together, because of the different sample sizes n and m.

233
9. Hypothesis testing

n
P m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1

n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1

Some remarks are the following.

2
X̄, Ȳ , SX and SY2 are independent.
2 2 2
X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 .

Ȳ ∼ N (µY , σY2 /m) and (m − 1)SY2 /σY2 ∼ χ2m−1 .

2
Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX
2
= σY2 , then:
p 2
(X̄ − Ȳ − (µX − µY )) σX /n + σY2 /m
p
2 2
((n − 1)SX /σX + (m − 1)SY2 /σY2 )/(n + m − 2)
s
n+m−2 X̄ − Ȳ − (µX − µY )
= ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2

2
9.12.1 Tests on µX − µY with known σX and σY2
Suppose we are interested in testing:

H0 : µX = µY vs. H1 : µX 6= µY .

Note that:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /n + σY2 /m
Under H0 , µX − µY = 0, so we have:

X̄ − Ȳ
T = p 2 ∼ N (0, 1).
σX /n + σY2 /m

At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
s
2
σX σ2
X̄ − Ȳ ± zα/2 × + Y.
n m

2
9.12.2 Tests on µX − µY with σX = σY2 but unknown
This time we consider the following hypotheses:

H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0

234
9.12. Comparing two normal means

where δ0 is a given constant. Under H0 , we have:


s
n+m−2 X̄ − Ȳ − δ0
T = ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if t > tα, n+m−2 , where
P (T > tα, n+m−2 ) = α, for T ∼ tn+m−2 .
A 100(1 − α)% confidence interval for µX − µY is:
s
1/n + 1/m 2
X̄ − Ȳ ± tα/2, n+m−2 × ((n − 1)SX + (m − 1)SY2 ).
n+m−2

Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, 2, . . . , 100, corresponding to the razors A and B, respectively,
were recorded, yielding:

x̄ = 2.84, s2X = 0.48, ȳ = 3.02 and s2Y = 0.42.

Also available is the sample variance of the differences, Zi = Xi − Yi , which is


s2Z = 0.6.
Test, at the 5% significance level, if the two razors lead to different mean shaving
times. State clearly any assumptions used in the test.

Assumption: Suppose {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Yn } are two independent


2
random samples from, respectively, N (µX , σX ) and N (µY , σY2 ).
The problem requires us to test the following hypotheses:

H0 : µX = µY vs. H1 : µX 6= µY .

There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.

Method I: paired comparison


We have Zi = Xi − Yi ∼ N (µZ , σZ2 ) with µZ = µX − µY and σZ2 = σX
2
+ σY2 . We want
to test:
H0 : µZ = 0 vs. H1 : µZ 6= 0.
This is the standard one-sample t test, where:

n(Z̄ − µZ ) X̄ − Ȳ − (µX − µY )
= √ ∼ tn−1 .
SZ SZ / n

H0 is rejected if |t| > t0.025, 99 = 1.98, where under H0 we have:


√ √
nZ̄ 100 × (X̄ − Ȳ )
T = = .
SZ SZ

235
9. Hypothesis testing


With the given data, we observe t = 10 × (2.84 − 3.02)/ 0.6 = −2.327. Hence we
reject the hypothesis that the two razors lead to the same mean shaving time at the
5% significance level.
A 95% confidence interval for µX − µY is:
sZ
x̄ − ȳ ± t0.025, n−1 × √ = −0.18 ± 0.154 ⇒ (−0.334, −0.026).
n
Some remarks are the following.

i. Zero is not in the confidence interval for µX − µY .


ii. t0.025, 99 = 1.98 is pretty close to z0.025 = 1.96.

Method II: two-sample comparison with known variances


2
A further assumption is that σX = 0.48 and σY2 = 0.42.
2
Note X̄ − Ȳ ∼ N (µX − µY , σX /100 + σY2 /100), i.e. we have:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /100 + σY2 /100
Hence we reject H0 when |t| > 1.96 at the 5% significance level, where:
X̄ − Ȳ
T =p 2 .
σX /100 + σY2 /100

For the given data, t = −0.18/ 0.009 = −1.90. Hence we cannot reject H0 .
A 95% confidence interval for µX − µY is:
r
2
σX σ2
x̄ − ȳ ± 1.96 × + Y = −0.18 ± 0.186 ⇒ (−0.366, 0.006).
100 100
The value 0 is now contained in the confidence interval.

Method III: two-sample comparison with equal but unknown variance


2
A different additional assumption is that σX = σY2 = σ 2 .
Now X̄ − Ȳ ∼ N (µX − µY , σ 2 /50) and 99(SX
2
+ SY2 )/σ 2 ∼ χ2198 . Hence:

50 × (X̄ − Ȳ − (µX − µY )) X̄ − Ȳ − (µX − µY )
p
2 2
= 10 × p
2
∼ t198 .
99 × (SX + SY )/198 SX + SY2
Hence we reject H0 if |t| > t0.025, 198 = 1.97 where:
10 × (X̄ − Ȳ )
T = p 2 .
SX + SY2
For the given data, t = −1.897. Hence we cannot reject H0 at the 5% significance
level.
A 95% confidence interval for µX − µY is:
r
s2X + s2Y
x̄ − ȳ ± t0.025, 198 × = −0.18 ± 0.1870 ⇒ (−0.367, 0.007)
100
which contains 0.

236
9.13. Tests for correlation coefficients

Some remarks are the following.

i. Different methods lead to different but not contradictory conclusions, as remember:

not reject 6= accept.

ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.

iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.

iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.

v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.

9.13 Tests for correlation coefficients


We now consider a test for the correlation coefficient of two random variables X and Y
where:

Cov(X, Y )
ρ = Corr(X, Y )=
(Var(X) Var(Y ))1/2
E((X − E(X))(Y − E(Y )))
= .
(E((X − E(X))2 ) E((Y − E(Y ))2 ))1/2

Some remarks are the following.

i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.

ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.

iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .

iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.

237
9. Hypothesis testing

Sample correlation coefficient

Given paired observations (Xi , Yi ), for i = 1, 2, . . . , n, a natural estimator of ρ is


defined as: n
P
(Xi − X̄)(Yi − Ȳ )
i=1
ρb = !1/2
n
P Pn
(Xi − X̄)2 (Yj − Ȳ )2
i=1 j=1

n
P n
P
where X̄ = Xi /n and Ȳ = Yi /n.
i=1 i=1

Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5 (on the next page), the vertical line at x̄ and the horizontal line at ȳ
divide the 69 points into 4 quadrants: northeast (NE), southwest (SW), northwest
(NW) and southeast (SE). Most points are in either NE or SW.

In the NE quadrant, xi > x̄ and yi > ȳ, hence:


X
(xi − x̄)(yi − ȳ) > 0.
i∈NE

In the SW quadrant, xi < x̄ and yi < ȳ, hence:


X
(xi − x̄)(yi − ȳ) > 0.
i∈SW

In the NW quadrant, xi < x̄ and yi > ȳ, hence:


X
(xi − x̄)(yi − ȳ) < 0.
i∈NW

In the SE quadrant, xi > x̄ and yi < ȳ, hence:


X
(xi − x̄)(yi − ȳ) < 0.
i∈SE

Overall:
69
X
(xi − x̄)(yi − ȳ) > 0
i=1

and hence ρb > 0.

Figure 9.6 (p.240) shows examples of different sample correlation coefficients using
scatterplots of bivariate observations.

238
9.13. Tests for correlation coefficients

Figure 9.5: Scatterplot of height and weight in Example 9.11.

9.13.1 Tests for correlation coefficients


Let {(X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn )} be a random sample from a two-dimensional
normal distribution. Let ρ = Corr(Xi , Yi ). We are interested in testing:

H0 : ρ = 0 vs. H1 : ρ 6= 0.

It can be shown that under H0 the test statistic is:


s
n−2
T = ρb ∼ tn−2 .
1 − ρb2

Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 , where:
α
P (T > tα/2, n−2 ) = .
2
Some remarks are the following.
p
ρ| (n − 2)/(1 − ρb2 ) increases as |b
i. |T | = |b ρ| increases.

ii. For H1 : ρ > 0, we reject H0 if t > tα, n−2 .

iii. Two random variables X and Y are jointly normal if aX + bY is normal for any
constants a and b.

iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also
independent.

239
9. Hypothesis testing

Figure 9.6: Scatterplots of bivariate observations with different sample correlation


coefficients.

9.14 Tests for the ratio of two normal variances


Let {X1 , X2 , . . . , Xn } and {Y1 , Y2 , . . . , Ym } be two independent random samples from,
2
respectively, N (µX , σX ) and N (µY , σY2 ). We are interested in testing:
σY2 σY2
H0 : 2
= k vs. H1 : 2
6= k
σX σX
where k > 0 is a given constant. The case with k = 1 is of particular interest since this
tests for equal variances.
Pn m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1

n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2 2
We have (n − 1)SX /σX ∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore:
σY2 2
SX 2
SX 2
/σX
2
× = ∼ Fn−1, m−1 .
σX SY2 SY2 /σY2
2
 2
Under H0 , T = kSX SY ∼ Fn−1, m−1 . Hence H0 is rejected if:
t < F1−α/2, n−1, m−1 or t > Fα/2, n−1, m−1
where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:
P (T > Fα, p, k ) = α

240
9.14. Tests for the ratio of two normal variances

available from Table A.3 of the Dougherty Statistical Tables.


Since:
σY2 2
 
SX
P F1−α/2, n−1, m−1 ≤ 2 × 2 ≤ Fα/2, n−1, m−1 = 1 − α
σX SY
a 100(1 − α)% confidence interval for σY2 /σX 2
is:
SY2 SY2
 
F1−α/2, n−1, m−1 × 2 , Fα/2, n−1, m−1 × 2 .
SX SX

Example 9.12 Here we practise use of Table A.3 of the Dougherty Statistical
Tables to obtain critical values for the F distribution.
Table A.3 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:

P (F3, 5 > 5.41) = 0.05

P (F3, 5 > 12.06) = 0.01

and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
 
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
 
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.24

and:  
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.58

Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X 100
X 100
X
x2i = 1,989.24, yi2 = 932.78 and xi yi = 661.11.
i=1 i=1 i=1

Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n n
!
1 X 1 X
s2X = (xi − x̄)2 = x2i − nx̄2 = 9.69
n − 1 i=1 n−1 i=1

241
9. Hypothesis testing

and: !
n n
1 X 1 X
s2Y = (yi − ȳ)2 = yi2 − nȳ 2 = 7.41.
n−1 i=1
n−1 i=1

Therefore: n n
P P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
ρb = = = 0.249.
(n − 1)sX sY (n − 1)sX sY
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
n−2
T = ρb ∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX = σY2 2
vs. H1 : σX > σY2 .

Under H0 , we have that:


2
SX
T = ∼ F99, 99 .
SY2
Hence we reject H0 if t > F0.05, 99, 99 = 1.39 at the 5% significance level, using Table
A.3 of the Dougherty Statistical Tables.
With the given data, t = 9.69/7.41 = 1.308. Therefore, we cannot reject H0 . As the
test is not significant at the 5% significance level, we may not conclude that the
variances of the two assets are significantly different. Therefore, there is no
significant evidence indicating that asset X is riskier than asset Y .
Strictly speaking, the test is valid only if the two samples are independent of each
other, which is not the case here.

9.15 Summary: tests for two normal distributions


2
Let (X1 , X2 , . . . , Xn ) ∼IID N (µX , σX ), (Y1 , Y2 , . . . , Ym ) ∼IID N (µY , σY2 ), and
ρ = Corr(X, Y ).
A summary table of tests for two normal distributions is:

242
9.16. Overview of chapter

2
σY
Null hypothesis, µX − µY = δ µX − µY = δ ρ=0 2
σX
=k
2
H0 (σX , σY2 known) 2
(σX = σY2 unknown) (n = m)

q
S2
q
√ X̄−Ȳ −δ X̄−Ȳ −δ
Test statistic, T 2 /n+σ 2 /m
n+m−2
1/n+1/m
×√ 2 +(m−1)S 2
n−2
ρb 1−bρ2
k SX2
σX Y (n−1)SX Y Y

Distribution of T N (0, 1) tn+m−2 tn−2 Fn−1, m−1


under H0

9.16 Overview of chapter


This chapter has discussed hypothesis tests for parameters of normal distributions –
specifically means and variances. In each case an appropriate test statistic was
constructed whose distribution under the null hypothesis was known. Concepts of
hypothesis testing errors and power were also discussed, as well as how to test
correlation coefficients.

9.17 Key terms and concepts


Alternative hypothesis Critical value
Decision Null hypothesis
p-value Paired comparison
Power function Significance level
t test Test statistic
Type I error Type II error

9.18 Sample examination questions


1. A single observation of X, where X ∼ Bin(6, π), is observed, i.e. X follows a
binomial distribution. We wish to test:

H0 : π = 0.50 vs. H1 : π = 0.75.

The chosen statistical test rejects H0 if the observed value of X is x ≥ 5.

(a) Calculate the probability of a Type I error for this test.

(b) Calculate the power of this test.

(c) Calculate the probability of a Type II error for this test.

243
9. Hypothesis testing

2. A random sample of size n = 36 is drawn from N (µ, 64) in order to test:

H0 : µ = 60 vs. H1 : µ 6= 60.

A researcher who is not familiar with hypothesis testing decided to use a 4%


significance level and the decision rule of rejecting H0 if the sample mean, x̄, fell
inside the interval (60 − k, 60 + k). In this question you may use the nearest values
in any statistical tables you consult.

(a) Determine the value of k.

(b) Determine the power of this test if µ = 62.

3. A paired-difference experiment under two conditions used a random sample of


n = 121 adults and reported a sample mean difference of 1.195 and a standard
deviation of 10.2 for the differences. The researchers reported a t statistic value of
1.289, when testing whether the means of the two conditions are the same.

(a) Show how the researchers obtained the t statistic value of 1.289.

(b) Calculate the p-value of the test and use the p-value to draw a conclusion
about the significance of the test. Use a 5% significance level.

9.19 Solutions to Sample examination questions


1. (a) The probability of a Type I error is:

P (reject H0 | π = 0.50) = P (X ≥ 5 | π = 0.50)


   
6 6
= + × (0.50)6
5 6
= 0.1094.

(b) The power of the test is:

P (reject H0 | π = 0.75) = P (X ≥ 5 | π = 0.75)


   
6 5 6
= × (0.75) × 0.25 + × (0.75)6
5 6
= 0.5339.

(c) The probability of a Type II error is:

1 − P (reject H0 | π = 0.75) = 1 − 0.5339 = 0.4661.

244
9.19. Solutions to Sample examination questions

2. (a) For α = 0.04, we require:

P (60 − k ≤ X̄ ≤ 60 + k | µ = 60) = 0.04

hence:
 
60 − k − 60 X̄ − 60 60 + k − 60
P √ ≤ √ ≤ √ = P (−0.75k ≤ Z ≤ 0.75k) = 0.04.
8/ 36 8/ 36 8/ 36

We have that:
P (−0.05 ≤ Z ≤ 0.05) = 0.04
hence:
1
0.75k = 0.05 ⇒ k= .
15

(b) For the power of the test, we require the conditional probability:

P (reject H0 | µ = 62) = P (59.933 ≤ X̄ ≤ 60.067 | µ = 62).

Standardising gives:
 
59.933 − 62 X̄ − 62 60.067 − 62
P √ ≤ √ ≤ √ = P (−1.55 ≤ Z ≤ −1.45)
8/ 36 8/ 36 8/ 36
= 0.0735 − 0.0606
= 0.0129.

3. (a) The test statistic value is:


x̄d 1.195
√ = √ = 1.289.
sd / n 10.2/ 121

(b) Under H0 : µd = 0, the test statistic is:

X̄d − µd X̄d
T = √ ∼ tn−1 ⇒ T = √ ∼ t120 .
Sd / n Sd / 121

The p-value is, where T ∼ t120 , Table 10 of the New Cambridge Statistical
Tables:
2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20.
Since 0.20 > 0.05, the test is not significant at the 5% significance level.

To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )

1
Available at http://etheses.lse.ac.uk/31

245
9. Hypothesis testing

246
Chapter 10
Analysis of variance (ANOVA)

10.1 Synopsis of chapter


This chapter introduces analysis of variance (ANOVA) which is a widely-used technique
for detecting differences between groups based on continuous dependent variables.

10.2 Learning outcomes


After completing this chapter, you should be able to:

explain the purpose of analysis of variance

restate and interpret the models for one-way and two-way analysis of variance

conduct small examples of one-way and two-way analysis of variance with a


calculator, reporting the results in an ANOVA table

perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance

explain how to interpret residuals from an analysis of variance.

10.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.

10.4 Testing for equality of three population means


We begin with an illustrative example to test the hypothesis that three population
means are equal.

247
10. Analysis of variance (ANOVA)

Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?

Class 1 Class 2 Class 3


85 71 59
75 75 64
82 73 62
76 74 69
71 69 75
85 82 67

Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:

H0 : µ1 = µ2 = µ3 .

The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj

where nj is the sample size of group j (here nj = 6 for all j).


This leads to x̄·1 = 79, x̄·2 = 74 and x̄·3 = 66. Transposing the table, we get:

Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66

Note that similar problems arise from other practical situations. For example:

comparing the returns of three stocks

comparing sales using three advertising strategies

comparing the effectiveness of three medicines.

If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3

248
10.5. One-way analysis of variance

i.e. the mean value of all 18 observations.


So we wish to perform a hypothesis test based on the variation in the sample means
such that the greater the variation, the more likely we are to reject H0 . One possible
measure for the variation in the sample means X̄·j about the overall sample mean X̄,
for j = 1, 2, 3, is:
X3
(X̄·j − X̄)2 . (10.1)
j=1

However, (10.1) is not scale-invariant, so it would be difficult to judge whether the


realised value is large enough to warrant rejection of H0 due to the magnitude being
dependent on the units of measurement of the data. So we seek a scale-invariant test
statistic.
Just as we scaled the covariance between two random variables to give the
scale-invariant correlation coefficient, we can similarly scale (10.1) to give the
following possible test statistic:
3
(X̄·j − X̄)2
P
j=1
T = .
sum of the three sample variances

Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .

10.5 One-way analysis of variance


We now extend Example 10.1 to consider a general setting where there are k
independent random samples available from k normal distributions N (µj , σ 2 ), for
j = 1, 2, . . . , k. (Example 10.1 corresponds to k = 3.)
Denote by X1j , X2j , . . . , Xnj j the random sample with sample size nj from N (µj , σ 2 ), for
j = 1, 2, . . . , k.
Our goal is to test:
H0 : µ1 = µ2 = · · · = µk
vs.
H1 : not all µj s are the same.
One-way analysis of variance (one-way ANOVA) involves a continuous dependent
variable and one categorical independent variable (sometimes called a factor, or
treatment), where the k different levels of the categorical variable are the k different
groups.
We now introduce statistics associated with one-way ANOVA.

249
10. Analysis of variance (ANOVA)

Statistics associated with one-way ANOVA

The jth sample mean is:


nj
1 X
X̄·j = Xij .
nj i=1

The overall sample mean is:


k nj k
1 XX 1X
X̄ = Xij = nj X̄·j
n j=1 i=1
n j=1

k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1

with n − 1 degrees of freedom.


The between-groups variation is:
k
X
B= nj (X̄·j − X̄)2
j=1

with k − 1 degrees of freedom.


The within-groups variation is:
nj
k X
X
W = (Xij − X̄·j )2
j=1 i=1

k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1

We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.

i. B and W are also called, respectively, between-treatments variation and

250
10.5. One-way analysis of variance

within-treatments variation. In fact W is effectively a residual (error) sum of


squares, representing the variation which cannot be explained by the treatment or
group factor.
ii. The ANOVA decomposition follows from the identity:
m
X m
X
2
(ai − b) = (ai − ā)2 + m(ā − b)2 .
i=1 i=1

However, the actual derivation is not required for this course.


iii. The following are some useful formulae for manual computations.
k
• n=
P
nj .
j=1
nj k
• X̄·j =
P P
Xij /nj and X̄ = nj X̄·j /n.
i=1 j=1
nj
k P
• Total variation = Total SS = B + W = Xij2 − nX̄ 2 .
P
j=1 i=1
k
• B= nj X̄·j2 − nX̄ 2 .
P
j=1
nj
k P k k
• Residual (Error) SS = W = Xij2 − nj X̄·j2 = (nj − 1)Sj2 where Sj2 is
P P P
j=1 i=1 j=1 j=1
the jth sample variance.

We now note, without proof, the following results.


k nj
k P
nj (X̄·j − X̄)2 and W = (Xij − X̄·j )2 are independent of each other.
P P
i. B =
j=1 j=1 i=1

nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1

k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1

In order to test H0 : µ1 = µ2 = · · · = µk , we define the following test statistic:


k
nj (X̄·j − X̄)2 /(k − 1)
P
j=1 B/(k − 1)
F = k Pnj
= .
P W/(n − k)
(Xij − X̄·j )2 /(n − k)
j=1 i=1

Under H0 , F ∼ Fk−1, n−k . We reject H0 at the 100α% significance level if:

f > Fα, k−1, n−k

where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.

251
10. Analysis of variance (ANOVA)

The p-value of the test is:

p-value = P (F > f ).

It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.

One-way ANOVA table

Typically, one-way ANOVA results are presented in a table as follows:


Source DF SS MS F p-value
B/(k−1)
Factor k−1 B B/(k − 1) W/(n−k)
p
Error n−k W W/(n − k)
Total n−1 B+W

Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1

and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1

3
X
= 5s2j
j=1

= 5 × (34 + 20 + 32)
= 430.

Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.36 < 9, using
Table A.3 of the Dougherty Statistical Tables, we reject H0 at the 1% significance
level. In fact the p-value (using a computer) is P (F > 9) = 0.003. Therefore, we
conclude that there is a significant difference among the mean examination marks
across the three classes.

252
10.5. One-way analysis of variance

The one-way ANOVA table is as follows:

Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946

Example 10.3 A study performed by a Columbia University professor counted the


number of times per minute professors from three different departments said ‘uh’ or
‘ah’ during lectures to fill gaps between words. The data listed in ‘UhAh.csv’
(available on the VLE) were derived from observing 100 minutes from each of the
three departments. If we assume that the more frequent use of ‘uh’ or ‘ah’ results in
more boring lectures, can we conclude that some departments’ professors are more
boring than others?
The counts for English, Mathematics and Political Science departments are stored.
As always in statistical analysis, we first look at the summary (descriptive) statistics
of these data.

> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33

[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867

[[3]]
English Mathematics Political Science
100 100 100

[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867

253
10. Analysis of variance (ANOVA)

Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:

> anova(lm(Frequency ~ Department))


Analysis of Variance Table

Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:

H0 : µ1 = µ2 = µ3 .

Therefore, there is no evidence of a difference in the mean number of ‘uh’s or ‘ah’s


said by professors across the three departments.

In addition to a one-way ANOVA table, we can also obtain the following.

An estimator of σ is:
s
W
σ
b =S= .
n−k

95% confidence intervals for µj are given by:

S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj

where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.

Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:

1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,

254
10.5. One-way analysis of variance

respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100

R can produce the following:

> stripchart(Frequency ~ Department,pch=16,vert=T)


> arrows(1:3,xbar+1.96*2.173/sqrt(n),1:3,xbar-1.96*2.173/sqrt(n),
angle=90,code=3,length=0.1)
> lines(1:3,xbar,pch=4,type="b",cex=2)
These 95% confidence intervals can be seen plotted in the R output in Figure 10.1
below. Note that these confidence intervals all overlap, which is consistent with our
failure to reject the null hypothesis that all population means are equal.
10
8
Frequency

6
4
2
0

English Mathematics Political Science

Figure 10.1: Overlapping confidence intervals.

Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file

255
10. Analysis of variance (ANOVA)

‘GallupPoll.csv’ (available on the VLE). They are classified into four groups
according to their incomes. Below is part of the R output of the descriptive statistics
of the classified data. Can we infer that income group has a significant impact on the
mean length of time before facing financial hardship?

Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00

> xbar <- tapply(Hardship, Income.group, mean)


> s <- tapply(Hardship, Income.group, sd)
> n <- tapply(Hardship, Income.group, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
$20 to 30K $30 to 50K Over $50K Under $20K
15.493827 18.456140 22.205128 9.313433

[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043

[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67

[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1

Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301

256
10.5. One-way analysis of variance

Now:
k
X
b= nj (x̄·j − x̄)2
j=1

= 39 × (22.21 − 16.109)2 + 114 × (18.456 − 16.109)2


+ 81 × (15.49 − 16.109)2 + 67 × (9.313 − 16.109)2
= 5,205.097.

We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
s24 = (8.087)2 = 65.400, hence:

nj
k X k
X X
2
w= (xij − x̄·j ) = (nj − 1)s2j
j=1 i=1 j=1

= 38 × 121.661 + 113 × 90.383 + 80 × 85.193 + 66 × 65.400


= 25,968.24.

Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.85 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s= w/(n − k) = 25,968.24/(301 − 4) = 9.351.

A 95% confidence interval for µj is:

s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj

Hence, for example, a 95% confidence interval for µ1 is:

18.328
22.21 ± √ ⇒ (19.28, 25.14)
39

and a 95% confidence interval for µ4 is:

18.328
9.313 ± √ ⇒ (7.07, 11.55).
67

Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.

257
10. Analysis of variance (ANOVA)

R output for the data is:

> anova(lm(Hardship ~ Income.group))


Analysis of Variance Table

Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.

10.6 From one-way to two-way ANOVA


One-way ANOVA: a review
We have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, 2, . . . , nj and
j = 1, 2, . . . , k. We are interested in testing:

H0 : µ1 = µ2 = · · · = µk .

The variation of the Xij s is driven by a factor at different levels µ1 , µ2 , . . . , µk , in


addition to random fluctuations (i.e. random errors). We test whether such a factor
effect exists or not. We can model a one-way ANOVA problem as follows:

Xij = µ + βj + εij for i = 1, 2, . . . , nj and j = 1, 2, . . . , k

where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:

H0 : β1 = β2 = · · · = βk = 0.

10.7 Two-way analysis of variance


Two-way analysis of variance (two-way ANOVA) involves a continuous dependent
variable and two categorical independent variables (factors). Two-way ANOVA models
the observations as:

Xij = µ + γi + βj + εij for i = 1, 2, . . . , r and j = 1, 2, . . . , c

258
10.7. Two-way analysis of variance

where:

µ represents the average effect

β1 , β2 , . . . , βc represent c different treatment (column) levels

γ1 , γ2 , . . . , γr represent r different block (row) levels

εij ∼ N (0, σ 2 ) and the εij s are independent.

In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:
γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.

We will be interested in testing the following hypotheses.

The ‘no treatment (column) effect’ hypothesis of H0 : β1 = β2 = · · · = βc = 0.

The ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0.

We now introduce statistics associated with two-way ANOVA.

Statistics associated with two-way ANOVA

The sample mean at the ith block level is:


c
1X
X̄i· = Xij for i = 1, 2, . . . , r.
c j=1

The sample mean at the jth treatment level is:


r
X 1
X̄·j = Xij for j = 1, 2, . . . , c.
i=1
r

The overall sample mean is:


r c
1 XX
X̄ = X̄·· = Xij .
n i=1 j=1

The total variation (with rc − 1 degrees of freedom) is:


r X
X c
Total SS = (Xij − X̄)2 .
i=1 j=1

259
10. Analysis of variance (ANOVA)

The between-blocks (rows) variation (with r − 1 degrees of freedom) is:


r
X
Brow = c (X̄i· − X̄)2 .
i=1

The between-treatments (columns) variation (with c − 1 degrees of freedom)


is: c
X
Bcol = r (X̄·j − X̄)2 .
j=1

The residual (error) variation (with (r − 1)(c − 1) degrees of freedom) is:


r X
X c
Residual SS = (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

The (two-way) ANOVA decomposition is:


r X
X c r
X c
X
2 2
(Xij − X̄) = c (X̄i· − X̄) + r (X̄·j − X̄)2
i=1 j=1 i=1 j=1

r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.

c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1

r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1

r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1

r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1

r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1

260
10.8. Residuals

c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1

r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1

In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0, the


test statistic is defined as:
Brow /(r − 1) (c − 1)Brow
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fr−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, r−1, (r−1)(c−1)
where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e.
P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value.
The p-value of the test is:
p-value = P (F > f ).
In order to test the ‘no treatment (column) effect’ hypothesis of
H0 : β1 = β2 = · · · = βc = 0, the test statistic is defined as:
Bcol /(c − 1) (r − 1)Bcol
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fc−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, c−1, (r−1)(c−1) .
The p-value of the test is defined in the usual way.

Two-way ANOVA table

As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS

10.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .

261
10. Analysis of variance (ANOVA)

We now decompose the observations as follows:

Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)

for i = 1, 2, . . . , r and j = 1, 2, . . . , c, where we have the following point estimators.

µ
b = X̄ is the point estimator of µ.

bi = X̄i· − X̄ is the point estimator of γi , for i = 1, 2, . . . , r.


γ

βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, 2, . . . , c.

It follows that the residual, i.e. the estimator of εij , is:

εbij = Xij − X̄i· − X̄·j + X̄

for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.

Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock Exchange
during 1981–85, available in the data file ‘NYSE.csv’ (available on the VLE).

1st quarter 2nd quarter 3rd quarter 4th quarter


1981 5.7 6.0 7.1 6.7
1982 7.2 7.0 6.1 5.2
1983 4.9 4.1 4.2 4.4
1984 4.5 4.9 4.5 4.5
1985 4.4 4.2 4.2 3.6

(a) Is the variability in returns from year to year statistically significant?

(b) Are returns affected by the quarter of the year?


Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test
the no column effect hypothesis to answer (b). We have r = 5 and c = 4.
c
P
The row sample means are calculated using X̄i· = Xij /c, which gives 6.375, 6.375,
j=1
4.4, 4.6 and 4.1, for i = 1, 2, . . . , 5, respectively.
r
P
The column sample means are calculated using X̄·j = Xij /r, which gives 5.34,
i=1
5.24, 5.22 and 4.88, for j = 1, 2, 3, 4, respectively.
r
P
The overall sample mean is x̄ = x̄i· /r = 5.17.
i=1
r P
c
x2ij = 559.06.
P
The sum of the squared observations is
i=1 j=1

262
10.8. Residuals

Hence we have the following.

r X
X c
Total SS = x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1

r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1

c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1

Residual SS = (Total SS) − brow − bcol = 24.482 − 19.867 − 0.602 = 4.013.

To test the no row effect hypothesis H0 : γ1 = γ2 = · · · = γ5 = 0, the test statistic


value is:
(c − 1)brow 3 × 19.867
f= = = 14.852.
Residual SS 4.013
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table A.3 of the Dougherty Statistical
Tables, since F0.01, 4, 12 = 5.41 < 14.852, we reject H0 at the 1% significance level. We
conclude that there is strong evidence that the return does depend on the year.

To test the no column effect hypothesis H0 : β1 = β2 = β3 = β4 = 0, the test statistic


value is:
(r − 1)bcol 4 × 0.602
f= = = 0.600.
Residual SS 4.013
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.05, 3, 12 = 3.49 > 0.600, we cannot
reject H0 even at the 5% significance level. Therefore, there is no significant evidence
indicating that the return depends on the quarter.

The results may be summarised in a two-way ANOVA table as follows:

Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482

We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:

Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)

For the given data, s2 = 0.334.

263
10. Analysis of variance (ANOVA)

R produces the following output:

> anova(lm(Return ~ Year + Quarter))


Analysis of Variance Table

Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:

εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.

If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.

10.9 Overview of chapter

This chapter introduced analysis of variance as a statistical tool to detect differences


between group means. One-way and two-way analysis of variance frameworks were
presented depending on whether one or two independent variables were modelled,
respectively. Statistical inference in the form of hypothesis tests and confidence intervals
was conducted.

10.10 Key terms and concepts

ANOVA decomposition Between-blocks variation


Between-groups variation Between-treatments variation
One-way ANOVA Random errors
Residual Sample mean
Total variation Two-way ANOVA
Within-groups variation

264
10.11. Sample examination questions

10.11 Sample examination questions


1. An indicator of the value of a stock relative to its earnings is its price-earnings
ratio. The following table provides the summary statistics of the price-earnings
ratios for a random sample of 36 stocks, 12 each from the financial, industrial and
pharmaceutical sectors.

Sector Sample mean Sample variance Sample size


Financial 32.50 3.86 12
Industrial 29.91 3.26 12
Pharmaceutical 29.31 3.14 12

You are also given that:


3 X
X 12
x2ij = 33,829.70.
j=1 i=1

Test at the 5% significance level whether the true mean price-earnings ratios for
the three market sectors are the same. Use the ANOVA table format to summarise
your calculations. You may exclude the p-value.

2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.

Source Degrees of freedom Sum of squares Mean square F -value


City 1.95
Network
Error
Total 51.52

(a) Complete the table using the information provided above.

(b) Test, at the 5% significance level, whether there is evidence of a difference in


audience shares between networks.

10.12 Solutions to Sample examination questions


1. For these n = 36 observations and k = 3 groups, we are given that x̄·1 = 32.50,
x̄·2 = 29.91 and x̄·3 = 29.31. Hence:
32.50 + 29.91 + 29.31
x̄ = = 30.57.
3
Hence the total variation is:
3 X
X 12
x2ij − nx̄2 = 33,829.70 − 36 × (30.57)2 = 186.80.
j=1 i=1

265
10. Analysis of variance (ANOVA)

The between-groups variation is:


3
X
b= nj x̄2·j − nx̄2 = 12 × ((32.50)2 + (29.91)2 + (29.31)2 ) − 36 × (30.57)2
j=1

= 76.31.
Therefore, w = 186.80 − 76.31 = 110.49. Hence the ANOVA table is:
Source DF SS MS F
Sector 2 76.31 38.16 11.39
Error 33 110.49 3.35
Total 35 186.80
We test:
H0 : PE ratio means are equal vs. H1 : PE ratio means are not equal
and we reject H0 if:
f > F0.05, 2, 33 ≈ 3.30.
Since 3.30 < 11.39, we reject H0 and conclude that there is evidence of a difference
in the mean price-earnings ratios across the sectors.

2. (a) The average audience share of all networks is:


21.35 + 17.28 + 20.18
= 19.60.
3
Hence the sum of squares (SS) due to networks is:
4 × ((21.35 − 19.60)2 + (17.28 − 19.60)2 + (20.18 − 19.60)2 ) = 35.13
and the mean sum of squares (MS) due to networks is 35.13/(3 − 1) = 17.57.
The degrees of freedom are 4 − 1 = 3, 3 − 1 = 2, (4 − 1)(3 − 1) = 6 and
4 × 3 − 1 = 11 for cities, networks, error and total sum of squares, respectively.
The SS for cities is 3 × 1.95 = 5.85. We have that the SS due to residuals is
given by 51.52 − 5.85 − 35.13 = 10.54 and the MS is 10.54/6 = 1.76. The
F -values are 1.95/1.76 = 1.11 and 17.57/1.76 = 9.98 for cities and networks,
respectively.
Source Degrees of freedom Sum of squares Mean square F -value
City 3 5.85 1.95 1.11
Network 2 35.13 17.57 9.98
Error 6 10.54 1.76
Total 11 51.52

(b) We test H0 : There is no difference between networks against H1 : There is a


difference between networks. The F -value is 9.98 and at a 5% significance level
the critical value is F0.05, 2, 6 = 5.14, hence we reject H0 and conclude that there
is evidence of a difference between networks.

A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)

266
Chapter 11
Linear regression

11.1 Synopsis of chapter


This chapter covers linear regression whereby the variation in a continuous dependent
variable is modelled as being explained by one or more continuous independent
variables.

11.2 Learning outcomes


After completing this chapter, you should be able to:

derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model

explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model

demonstrate how to construct confidence intervals and prediction intervals and


explain the difference between the two

summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation

provide the assumptions on which regression models are based

interpret typical output from a computer package fitting of a regression model.

11.3 Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , x2 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , x2 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , x2 , . . . , xp . We start with
some simple examples with p = 1.

267
11. Linear regression

11.4 Introductory examples

Example 11.1 In a university town, the sales, y, of 10 Armand’s Pizza Parlour


restaurants are closely related to the student population, x, in their neighbourhoods.
The data file ‘Armand.csv’ (available on the VLE) contains the sales (in thousands
of euros) in a period of three months together with the numbers of students (in
thousands) in their neighbourhoods.
We plot y against x, and draw a straight line through the middle of the data points:

y = β0 + β1 x + ε

where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.

For a given student population, x, the predicted sales are yb = β0 + β1 x.

Example 11.2 The data file ‘WeightHeight.csv’ (available on the VLE) contains
the heights, x, and weights, y, of 69 students in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:

y = β0 + β1 x + ε

where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.

268
11.5. Simple linear regression

Example 11.3 Some other possible examples of y and x are shown in the following
table.

y x
Sales Price
Weight gain Protein in diet
Present FTSE 100 index Past FTSE 100 index
Consumption Income
Salary Tenure
Daughter’s height Mother’s height

In most cases, there are several x variables involved. We will consider such situations
later in this chapter.

Some questions to consider are the following.

How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?

How accurate is the fitted line?

What is the error in predicting a future y?

11.5 Simple linear regression


We now present the simple linear regression model. Let the paired observations
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be drawn from the model:

yi = β0 + β1 xi + εi

269
11. Linear regression

where:
E(εi ) = 0 and Var(εi ) = E(ε2i ) = σ 2 > 0.
Furthermore, suppose Cov(εi , εj ) = E(εi εj ) = 0 for all i 6= j. That is, the εi s are
assumed to be uncorrelated (remembering that a zero covariance between two random
variables implies that they are uncorrelated).
So the model has three parameters: β0 , β1 and σ 2 .
For convenience, we will treat x1 , x2 , . . . , xn as constants.1 We have:

E(yi ) = β0 + β1 xi and Var(yi ) = σ 2 .

Since the εi s are uncorrelated (by assumption), it follows that y1 , y2 , . . . , yn are also
uncorrelated with each other.
Sometimes we assume εi ∼ N (0, σ 2 ), in which case yi ∼ N (β0 + β1 xi , σ 2 ), and
y1 , y2 , . . . , yn are independent. (Remember that a linear transformation of a normal
random variable is also normal, and that for jointly normal random variables if they are
uncorrelated then they are also independent.)
Our tasks are two-fold.

Statistical inference for β0 , β1 and σ 2 , i.e. (point) estimation, confidence intervals


and hypothesis testing.

Prediction intervals for future values of y.

We derive estimators of β0 and β1 using least squares estimation (introduced in Chapter


7). The least squares estimators (LSEs) of β0 and β1 are the values of (β0 , β1 ) at which
the function:
Xn X n
L(β0 , β1 ) = ε2i = (yi − β0 − β1 xi )2
i=1 i=1

obtains its minimum.


We proceed to partially differentiate L(β0 , β1 ) with respect to β0 and β1 , respectively.
Firstly:
n
∂ X
L(β0 , β1 ) = −2 (yi − β0 − β1 xi ).
∂β0 i=1

Setting this partial derivative to zero leads to:


n
X n
X
yi − nβb0 − βb1 xi = 0 or βb0 = ȳ − βb1 x̄.
i=1 i=1

Secondly:
n
∂ X
L(β0 , β1 ) = −2 xi (yi − β0 − β1 xi ).
∂β1 i=1

1
If you study EC2020 Elements of econometrics, you will explore regression models in much
more detail than is covered here. For example, x1 , x2 , . . . , xn will be treated as random variables.

270
11.5. Simple linear regression

Setting this partial derivative to zero leads to:


n
X
0= xi (yi − βb0 − βb1 xi )
i=1
n
X
= xi (yi − ȳ − (βb1 xi − βb1 x̄))
i=1
n
X n
X
= xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1

Hence:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P and βb0 = ȳ − βb1 x̄.
xi (xi − x̄) (xi − x̄)2
i=1 i=1

The estimator βb1 above is based on the fact that for any constant c, we have:
n
X n
X
xi (yi − ȳ) = (xi − c)(yi − ȳ)
i=1 i=1

since: n n
X X
c(yi − ȳ) = c (yi − ȳ) = 0.
i=1 i=1
n
P n
P
Given that (xi − x̄) = 0, it follows that c(xi − x̄) = 0 for any constant c.
i=1 i=1

In order to calculate βb1 numerically, often the following formula is convenient:


n
P
xi yi − nx̄ȳ
i=1
βb1 = n .
x2i
P
− nx̄2
i=1
n
(yi − β0 − β1 xi )2 . For any β0
P
An alternative derivation is as follows. Note L(β0 , β1 ) =
i=1
and β1 , we have:
n
X
L(β0 , β1 ) = (yi − βb0 − βb1 xi + βb0 − β0 + (βb1 − β1 )xi )2
i=1
n
X
= L(βb0 , βb1 ) + (βb0 − β0 + (βb1 − β1 )xi )2 + 2B (11.1)
i=1

where:
n
X
B= (βb0 − β0 + (βb1 − β1 )xi )(yi − βb0 − βb1 xi )
i=1
n
X n
X
= (βb0 − β0 ) (yi − βb0 − βb1 xi ) + (βb1 − β1 ) xi (yi − βb0 − βb1 xi ).
i=1 i=1

271
11. Linear regression

Now let (βb0 , βb1 ) be the solution to the equations:


n
X n
X
(yi − βb0 − βb1 xi ) = 0 and xi (yi − βb0 − βb1 xi ) = 0 (11.2)
i=1 i=1

such that B = 0. By (11.1), we have:


n
X
L(β0 , β1 ) = L(βb0 , βb1 ) + (βb0 − β0 + (βb1 − β1 )xi )2 ≥ L(βb0 , βb1 ).
i=1

Hence (βb0 , βb1 ) are the least squares estimators (LSEs) of β0 and β1 , respectively.
To find the explicit expression from (11.2), note the first equation can be written as:

nȳ − nβb0 − nβb1 x̄ = 0.

Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
n
X n
X n
X
0= xi (yi − ȳ − βb1 (xi − x̄)) = xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1 i=1

Therefore:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
βb1 = i=1
Pn = i=1
n
P .
xi (xi − x̄) (xi − x̄)2
i=1 i=1

This completes the derivation.


Pn n
P
Remember (xi − x̄) = 0. Hence c(xi − x̄) = 0 for any constant c.
i=1 i=1
2
We also note the estimator of σ , which is:
n
(yi − βb0 − βb1 xi )2
P
i=1
b2 =
σ .
n−2

We now explore the properties of the LSEs βb0 and βb1 . We can show that the means and
variances of these LSEs are:
n
x2i
P
σ2 i=1
E(βb0 ) = β0 and Var(βb0 ) = n
n P (xi − x̄)2
i=1

for βb0 , and:


σ2
E(βb1 ) = β1 and Var(βb1 ) = P
n
(xi − x̄)2
i=1

for βb1 .

272
11.5. Simple linear regression

Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:

n
! n n
1X 1X 1X
E(ȳ) = E yi = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄.
n i=1 n i=1 n i=1

Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).

Consequently, we have:
 n  n n
(xi − x̄)2 β1
P P P
 i=1(xi − x̄)(yi − ȳ)  i=1
(xi − x̄)E(yi − ȳ)
i=1
E(β1 ) = E 
b
 n
P
=
 n
P = n
P = β1 .
(xi − x̄)2 (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1

Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .

Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.
To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):

n
P n
P
(xi − x̄)(yi − ȳ) (xi − x̄)yi n
X
i=1 i=1
βb1 = n
P = n
P = ai y i
(xi − x̄)2 (xk − x̄)2 i=1
i=1 k=1

n
P
where ai = (xi − x̄) (xk − x̄)2 and:
k=1

n n  
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n

Note that:
n n
X X 1
ai = 0 and a2i = P
n .
i=1 i=1 (xk − x̄)2
k=1

Now we note the following lemma, without proof. Let y1 , y2 , . . . , yn be uncorrelated


random variables, and b1 , b2 , . . . , bn be constants, then:

n
! n
X X
Var bi y i = b2i Var(yi ).
i=1 i=1

273
11. Linear regression

By this lemma:
n
! n
X
2
X σ2
Var(βb1 ) = Var ai y i =σ a2i = P
n
i=1 i=1 (xk − x̄)2
k=1

and:
 
n  2 n
!
X 1 1 X 2 2 σ2  nx̄2 
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 + 
n n i=1 i n  n
P 
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P
(xk − x̄)2
k=1

The last equality uses the fact that:


n
X n
X
x2k = (xk − x̄)2 + nx̄2 .
k=1 k=1

11.6 Inference for parameters in normal regression


models
The normal simple linear regression model is yi = β0 + β1 xi + εi , where:

ε1 , ε2 , . . . , εn ∼IID N (0, σ 2 ).

y1 , y2 , . . . , yn are independent (but not identically distributed) and:

yi ∼ N (β0 + β1 xi , σ 2 ).

Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:
 n   
2
P
xi
 σ2 i=1
  σ2 
β0 ∼ N β0 ,
b 
n
 and βb1 ∼ N β1 ,
n
.
n P (xi − x̄)2
  P
(xi − x̄)2

i=1 i=1

Since σ 2 is unknown in practice, we replace σ 2 by its estimator:


n
(yi − βb0 − βb1 xi )2
P
i=1
b2 =
σ
n−2

274
11.6. Inference for parameters in normal regression models

and use the estimated standard errors:


 n 1/2
x2i
P
σb  i=1

E.S.E.(βb0 ) = √  n

n  P (x − x̄)2 
i
i=1

and:
σ
b
E.S.E.(βb1 ) =  1/2 .
n
P
(xi − x̄)2
i=1

The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.

i. We have:
n
(yi − βb0 − βb1 xi )2
P
σ2
(n − 2)b i=1
= ∼ χ2n−2 .
σ2 σ2

b2 are independent, hence:


ii. βb0 and σ

βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )

b2 are independent, hence:


iii. βb1 and σ

βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )

Confidence intervals for the simple linear regression model parameters

A 100(1 − α)% confidence interval for β0 is:


 
βb0 − tα/2, n−2 × E.S.E.(βb0 ), βb0 + tα/2, n−2 × E.S.E.(βb0 )

and a 100(1 − α)% confidence interval for β1 is:


 
βb1 − tα/2, n−2 × E.S.E.(βb1 ), βb1 + tα/2, n−2 × E.S.E.(βb1 )

where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 10 of the New Cambridge Statistical Tables.

275
11. Linear regression

Tests for the regression slope

The relationship between y and x in the regression model hinges on β1 . If β1 = 0,


then y ∼ N (β0 , σ 2 ).
To validate the use of the regression model, we need to make sure that β1 6= 0, or
more practically that βb1 is significantly non-zero. This amounts to testing:

H0 : β1 = 0 vs. H1 : β1 6= 0.

Under H0 , the test statistic is:

βb1
T = ∼ tn−2 .
E.S.E.(βb1 )

At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.

Some remarks are the following.

i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:
βb1 − b
T = .
E.S.E.(βb1 )

ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.

In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
n  
2
Y 1 1 2
L(β0 , β1 , σ )= √ exp − 2 (yi − β0 − β1 xi )
i=1 2πσ 2 2σ

 n/2 n
!
1 1 X
∝ exp − 2 (yi − β0 − β1 xi )2 .
σ2 2σ i=1
Hence the log-likelihood function is:
  n
2
n 1 1 X
l(β0 , β1 , σ ) = ln − (yi − β0 − β1 xi )2 + c.
2 σ2 2σ 2 i=1

Therefore, for any β0 , β1 and σ 2 > 0, we have:


l(β0 , β1 , σ 2 ) ≤ l(βb0 , βb1 , σ 2 ).

276
11.6. Inference for parameters in normal regression models

Hence (βb0 , βb1 ) are the MLEs of (β0 , β1 ).


To find the MLE of σ 2 , we need to maximise:
  n
2 2 n 1 1 X
l(σ ) = l(β0 , β1 , σ ) = ln
b b − 2 (yi − βb0 − βb1 xi )2 .
2 σ2 2σ i=1

Setting u = 1/σ 2 , it is equivalent to maximising:

g(u) = n ln u − ub
n
(yi − βb0 − βb1 xi )2 .
P
where b =
i=1

Setting dg(u)/du = n/bu − b = 0, u


b = n/b, i.e. g(u) attains its maximum at u = u
b.
Hence the MLE of σ 2 is:
n
2 1 b 1X
σ
e = = = (yi − βb0 − βb1 xi )2 .
u
b n n i=1

e2 is a biased estimator of σ 2 . In practice, we often use the unbiased


Note the MLE σ
estimator: n
2
1 X
σ
b = (yi − βb0 − βb1 xi )2 .
n − 2 i=1
We now consider an empirical example of the normal simple linear regression model.

Example 11.4 The dataset ‘Cigarette.csv’ (available on the VLE) contains the
annual cigarette consumption, x, and the corresponding mortality rate, y, due to
coronary heart disease (CHD) of 21 countries. Some useful summary statistics
calculated from the data are:
21
X 21
X 21
X
xi = 45,110, yi = 3,042.2, x2i = 109,957,100,
i=1 i=1 i=1

21
X 21
X
yi2 = 529,321.58 and xi yi = 7,319,602.
i=1 i=1

Do these data support the suspicion that smoking contributes to CHD mortality?
(Note the assertion ‘smoking is harmful for health’ is largely based on statistical,
rather than laboratory, evidence.)
We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and
β0 are, respectively:
P P P
x y − i xi j yj /n
P P
(x i − x̄)(yi − ȳ) x i y i − nx̄ȳ i i i
βb1 = i P 2
= Pi
2 2
= P 2
P 2
i (xi − x̄) i xi − nx̄ i xi − ( i xi ) /n

7,319,602 − 45,110 × 3,042.2/21


=
109,957,100 − (45,110)2 /21
= 0.06

277
11. Linear regression

and:
3,042.2 − 0.06 × 45,110
βb0 = ȳ − βb1 x̄ = = 15.77.
21
Also:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= 2,181.66.

We now proceed to test H0 : β1 = 0 vs. H1 : β1 > 0. (If indeed smoking contributes


to CHD mortality, then β1 > 0.)
We have calculated βb1 = 0.06. However, is this deviation from zero due to sampling
error, or is it significantly different from zero? (The magnitude of βb1 itself is not
important in determining if β1 = 0 or not – changing the scale of x may make βb1
arbitrarily small.)
Under H0 , the test statistic is:

βb1
T = ∼ tn−2 = t19
E.S.E.(βb1 )

b/( i (xi − x̄)2 )1/2 = 0.01293.


P
where E.S.E.(βb1 ) = σ
Since t = 0.06/0.01293 = 4.64 > 2.54 = t0.01, 19 , we reject the hypothesis β1 = 0 at
the 1% significance level and we conclude that there is strong evidence that smoking
contributes to CHD mortality.

11.7 Regression ANOVA


In Chapter 10 we discussed ANOVA, whereby we decomposed the total variation of a
continuous dependent variable. In a similar way we can decompose the total variation of
y in the simple linear regression model. It can be shown that the regression ANOVA
decomposition is:
n
X n
X n
X
2 2 2
(yi − ȳ) = β1 (xi − x̄) +
b (yi − βb0 − βb1 xi )2
i=1 i=1 i=1

where, denoting sum of squares by ‘SS’, we have the following.


n n
(yi − ȳ)2 = yi2 − nȳ 2 .
P P
Total SS is
i=1 i=1

n
 n

βb12 (xi − x̄)2 = βb12 x2i 2
P P
Regression (explained) SS is − nx̄ .
i=1 i=1

n
(yi − βb0 − βb1 xi )2 = Total SS − Regression SS.
P
Residual (error) SS is
i=1

278
11.8. Confidence intervals for E(y)

If εi ∼ N (0, σ 2 ) and β1 = 0, then it can be shown that:


n
(yi − ȳ)2 /σ 2 ∼ χ2n−1
P
i=1

n
βb12 (xi − x̄)2 /σ 2 ∼ χ21
P
i=1

n
(yi − βb0 − βb1 xi )2 /σ 2 ∼ χ2n−2 .
P
i=1

Therefore, under H0 : β1 = 0, we have:


n
(n − 2)βb12 (xi − x̄)2
P !2
(Regression SS)/1 i=1 βb1
F = = n = ∼ F1, n−2 .
(Residual SS)/(n − 2) P
(yi − βb0 − βb1 xi )2 E.S.E.(βb1 )
i=1

We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table A.3 of the Dougherty Statistical Tables.
A useful statistic is the coefficient of determination, denoted as R2 , defined as:
Regression SS Residual SS
R2 = =1− .
Total SS Total SS
If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of
the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the
better the explanatory power of the regression model.

11.8 Confidence intervals for E(y)


Based on the observations (xi , yi ), for i = 1, 2, . . . , n, we fit a regression model:

yb = βb0 + βb1 x.

Our goal is to predict the unobserved y corresponding to a known x. The point


prediction is:
yb = βb0 + βb1 x.
For the analysis to be more informative, we would like to have some ‘error bars’ for our
prediction. We introduce two methods as follows.

A confidence interval for µ(x) = E(y) = β0 + β1 x.

A prediction interval for y.

A confidence interval is an interval estimator of an unknown parameter (i.e. for a


constant) while a prediction interval is for a random variable. They are different and
serve different purposes.

279
11. Linear regression

We assume the model is normal, i.e. ε = y − β0 − β1 x ∼ N (0, σ 2 ) and let


µ
b(x) = βb0 + βb1 x, such that µ
b(x) is an unbiased estimator of µ(x). We note without
proof that:  n 
2
P
(x − x)
 σ 2 i=1 i 
µ(x)
b ∼ N µ(x),

n
.
n P (xj − x̄)2

j=1
Standardising gives:
b(x) − µ(x)
µ
v
u ! ∼ N (0, 1).
u n
P n
P
t(σ 2 /n) (xi − x)2 / (xj − x̄)2
i=1 j=1

In practice σ 2 is unknown, but it can be shown that (n − 2)bσ 2 /σ 2 ∼ χ2n−2 , where


n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2). Furthermore, µ b2 are independent. Hence:
P
σ b(x) and σ
i=1

b(x) − µ(x)
µ
v
u ! ∼ tn−2 .
u n
P n
P
σ 2 /n)
t(b (xi − x)2 / (xj − x̄)2
i=1 j=1

Confidence interval for µ(x)

A 100(1 − α)% confidence interval for µ(x) is:


 P n 1/2
(xi − x)2
 i=1 
µ(x)
b ± tα/2, n−2 × σ
b ×
 P n
 .

n (xj − x̄)2
j=1

Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.

11.9 Prediction intervals for y


A 100(1 − α)% prediction interval is an interval which contains y with probability
1 − α.
We may assume that the y to be predicted is independent of y1 , y2 , . . . , yn used in the
estimation of the regression model.
Hence y − µ
b(x) is normal with mean 0 and variance:
n
(xi − x)2
P
2
σ i=1
Var(y) + Var(µ(x))
b = σ2 + n .
n P
(xj − x̄)2
j=1

280
11.9. Prediction intervals for y

Therefore:
  n 1/2
2
P
.  (xi − x) 
i=1
(y − µ
b(x)) 
σb2 
1 + Pn

 ∼ tn−2 .
n (xj − x̄)2
j=1

Prediction interval for y

A 100(1 − α)% prediction interval covering y with probability 1 − α is:


 n 1/2
(xi − x)2
P
i=1
 
µ(x)
b ± tα/2, n−2 × σ
b ×
1 + Pn

 .
n (xj − x̄)2
j=1

Some remarks are the following.

i. It holds that:

 1/2 
 n
(xi − x)2
P
  
i=1

P y ∈ µ(x) ± tα/2, n−2 × σ
b × 1 + = 1 − α.
  
b  Pn  

n (xj − x̄) 2 
j=1

ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.

Example 11.5 The dataset ‘UsedFord.csv’ (available on the VLE) contains the
prices (y, in $000s) of 100 three-year-old Ford Tauruses together with their mileages
(x, in thousands of miles) when they were sold at auction. Based on these data, a
car dealer needs to make two decisions.

1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage of
x = 40.

2. To prepare cash for buying several three-year-old Ford Tauruses with mileages
close to x = 40 from a rental company.

281
11. Linear regression

For the first task, a prediction interval would be more appropriate. For the second
task, the car dealer needs to know the average price and, therefore, a confidence
interval is appropriate. This can be easily done using R.

> reg <- lm(Price~ Mileage)


> summary(reg)

Call:
lm(formula = Price ~ Mileage)

Residuals:
Min 1Q Median 3Q Max
-0.68679 -0.27263 0.00521 0.23210 0.70071

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727 0.182093 94.72 <2e-16 ***
Mileage -0.066861 0.004975 -13.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3265 on 98 degrees of freedom


Multiple R-squared: 0.6483, Adjusted R-squared: 0.6447
F-statistic: 180.6 on 1 and 98 DF, p-value: < 2.2e-16
> new.Mileage <- data.frame(Mileage = c(40))
> predict(reg, newdata = new.Mileage, int = "c")
fit lwr upr
1 14.57429 14.49847 14.65011
> predict(reg, newdata = new.Mileage, int = "p")
fit lwr upr
1 14.57429 13.92196 15.22662

We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:

> pc <- predict(reg,int="c")


> pp <- predict(reg,int="p")
> plot(Mileage,Price,pch=16)
> matlines(Mileage,pc)
> matlines(Mileage,pp)

282
11.10. Multiple linear regression models

16.5
16.0
15.5
Price

15.0
14.5
14.0
13.5

20 25 30 35 40 45 50

Mileage

11.10 Multiple linear regression models


For most practical problems, the variable of interest, y, typically depends on several
explanatory variables, say x1 , x2 , . . . , xp , leading to the multiple linear regression
model. In this course we only provide a brief overview of the multiple linear regression
model. Subsequent econometrics courses would explore this model in much greater
depth.
Let (yi , xi1 , xi2 , . . . , xip ), for i = 1, 2, . . . , n, be observations from the model:
yi = β0 + β1 xi1 + β2 xi2 + · · · + βp xip + εi
where:
E(εi ) = 0, Var(εi ) = σ 2 > 0 and Cov(εi , εj ) = 0 for all i 6= j.
The multiple linear regression model is a natural extension of the simple linear
regression model, just with more parameters: β0 , β1 , β2 , . . . , βp and σ 2 .
Treating all of the xij s as constants as before, we have:
E(yi ) = β0 + β1 xi1 + β2 xi2 + · · · + βp xip and Var(yi ) = σ 2 .
y1 , y2 , . . . , yn are uncorrelated with each other, again as before.
If in addition εi ∼ N (0, σ 2 ), then:
p
!
X
yi ∼ N β0 + βj xij , σ 2 .
j=1

Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , βb2 , . . . , βbp are obtained by minimising:
n p
!2
X X
yi − β0 − βj xij
i=1 j=1

283
11. Linear regression

leading to the fitted regression model:

yb = βb0 + βb1 x1 + βb2 x2 + · · · + βbp xp .

The residuals are expressed as:

p
X
εbi = yi − βb0 − βbj xij .
j=1

Just as with the simple linear regression model, we can decompose the total variation of
y such that:
X n Xn Xn
(yi − ȳ)2 = yi − ȳ)2 +
(b εb2i
i=1 i=1 i=1

or, in words:
Total SS = Regression SS + Residual SS.

An unbiased estimator of σ 2 is:

n p
!2
1 X X Residual SS
b2 =
σ yi − βb0 − βbj xij = .
n−p−1 i=1 j=1
n−p−1

We can test a single slope coefficient by testing:

H0 : βi = 0 vs. H1 : βi 6= 0.

Under H0 , the test statistic is:

βbi
T = ∼ tn−p−1
E.S.E.(βbi )

and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.

It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:

H0 : β1 = β2 = · · · = βp = 0 vs. H1 : At least one βi 6= 0.

Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.

284
11.11. Regression using R

Provided εi ∼ N (0, σ 2 ), under H0 : β1 = β2 = · · · = βp = 0, the test statistic is:

(Regression SS)/p
F = ∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)

We reject H0 at the 100α% significance level if f > Fα, p, n−p−1 .


It may be shown that:

n
X n
X
2
Regression SS = yi − ȳ) =
(b (βb1 (xi1 − x̄1 ) + βb2 (xi2 − x̄2 ) + · · · + βbp (xip − x̄p ))2 .
i=1 i=1

Hence, under H0 , f should be very small.

We now conclude the chapter with worked examples of linear regression using R.

11.11 Regression using R


To solve practical regression problems, we need to use statistical computing packages.
All of them include linear regression analysis. In fact all statistical packages, such as R,
make regression analysis much easier to use.

Example 11.6 We illustrate the use of linear regression in R using the dataset
‘Armand.csv’, introduced in Example 11.1.

> reg <- lm(Sales ~ Student.population)


> summary(reg)

Call:
lm(formula = Sales ~ Student.population)

Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
Student.population 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.83 on 8 degrees of freedom


Multiple R-squared: 0.9027, Adjusted R-squared: 0.8906
F-statistic: 74.25 on 1 and 8 DF, p-value: 2.549e-05

285
11. Linear regression

The fitted line is yb = 60 + 5x. We have σb2 = (13.83)2 . Also, βb0 = 60 and
E.S.E.(βb0 ) = 9.2260. βb1 = 5 and E.S.E.(βb1 ) = 0.5803.
For testing H0 : β0 = 0 we have t = βb0 /E.S.E.(βb0 ) = 6.503. The p-value is
P (|T | > 6.503) = 0.000187, where T ∼ tn−2 .
For testing H0 : β1 = 0 we have t = βb1 /E.S.E.(βb1 ) = 8.617. The p-value is
P (|T | > 8.617) = 0.0000255, where T ∼ tn−2 .
The F test statistic value is 74.25 with a corresponding p-value of:

P (F > 74.25) = 0.00002549

where F1, 8 .

Example 11.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price
 
current price
return = ≈ ln
previous price previous price
when the difference between the two prices is small.
The data file ‘Returns.csv’ (available on the VLE) contains daily returns over the
period 3 January – 29 December 2000 (i.e. n = 252 observations). The dataset has 5
columns: Day, S&P500 return, Cisco return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.

> summary(S.P500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.00451 -0.85028 -0.03791 -0.04242 0.79869 4.65458

> summary(Cisco)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.4387 -3.0819 -0.1150 -0.1336 2.6363 15.4151
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.

> sandpts <- ts(S.P500)


> ciscots <- ts(Cisco)
> ts.plot(sandpts,ciscots,col=c(1:2))

286
11.11. Regression using R

15
10
5
0
−5
−10

0 50 100 150 200 250

Time

There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.

> cor.test(S.P500,Cisco)

Pearson’s product-moment correlation

data: S.P500 and Cisco


t = 14.943, df = 250, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6155530 0.7470423
sample estimates:
cor
0.686878
We fit the regression model: Cisco = β0 + β1 S&P500 + ε.
Our rationale is that part of the fluctuation in Cisco returns was driven by the
fluctuation in the S&P500 returns.
R produces the following regression output.

> reg <- lm(Cisco ~ S.P500)


> summary(reg)

Call:
lm(formula = Cisco ~ S.P500)

287
11. Linear regression

Residuals:
Min 1Q Median 3Q Max
-13.1175 -2.0238 0.0091 2.0614 9.9491

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547 0.19433 -0.234 0.815
S.P500 2.07715 0.13900 14.943 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.083 on 250 degrees of freedom


Multiple R-squared: 0.4718, Adjusted R-squared: 0.4697
F-statistic: 223.3 on 1 and 250 DF, p-value: < 2.2e-16
The estimated slope is βb1 = 2.07715. The null hypothesis H0 : β1 = 0 is rejected with
a p-value of 0.000 (to three decimal places). Therefore, the test is extremely
significant.
Our interpretation is that when the market index goes up by 1%, Cisco stock goes
up by 2.07715%, on average. However, the error term ε in the model is large with an
estimated σ
b = 3.083%.
The p-value for testing H0 : β0 = 0 is 0.815, so we cannot reject the hypothesis that
β0 = 0. Recall βb0 = ȳ − βb1 x̄ and both ȳ and x̄ are very close to 0.
R2 = 47.18%, hence 47.18% of the variation of Cisco stock may be explained by the
variation of the S&P500 index, or, in other words, 47.18% of the risk in Cisco stock
is the market-related risk.
The capital asset pricing model (CAPM) is a simple asset pricing model in finance
given by:
yi = β0 + β1 xi + εi
where yi is a stock return and xi is a market return at time i.
The total risk of the stock is:
n n n
1X 2
1X 2
1X
(yi − ȳ) = yi − ȳ) +
(b (yi − ybi )2 .
n i=1 n i=1 n i=1

The market-related (or systematic) risk is:


n n
1X 2
1 b2 X
yi − ȳ) = β1
(b (xi − x̄)2 .
n i=1 n i=1

The firm-specific risk is:


n
1X
(yi − ybi )2 .
n i=1

Some remarks are the following.

i. β1 measures the market-related (or systematic) risk of the stock.

288
11.11. Regression using R

ii. Market-related risk is unavoidable, while firm-specific risk may be ‘diversified


away’ through hedging.

iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.

Example 11.8 The data in the file ‘Foods.csv’ (available on the VLE) illustrate
the effects of marketing instruments on the weekly sales volume of a certain food
product over a three-year period. Data are real but transformed to protect the
innocent!
There are observations on the following four variables:
y = LVOL: logarithms of weekly sales volume
x1 = PROMP : promotion price
x2 = FEAT : feature advertising
x3 = DISP : display measure.

R produces the following descriptive statistics.

> summary(Foods)
LVOL PROMP FEAT DISP
Min. :13.83 Min. :3.075 Min. : 2.84 Min. :12.42
1st Qu.:14.08 1st Qu.:3.330 1st Qu.:15.95 1st Qu.:20.59
Median :14.24 Median :3.460 Median :22.99 Median :25.11
Mean :14.28 Mean :3.451 Mean :24.84 Mean :25.31
3rd Qu.:14.43 3rd Qu.:3.560 3rd Qu.:33.49 3rd Qu.:29.34
Max. :15.07 Max. :3.865 Max. :57.10 Max. :45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
As always, first we plot the data to ascertain basic characteristics.

> LVOLts <- ts(LVOL)


> ts.plot(LVOLts)
15.0
14.8
14.6
LVOLts

14.4
14.2
14.0
13.8

0 50 100 150

Time

289
11. Linear regression

The time series plot indicates momentum in the data.


Next we show scatterplots between y and each xi .

> plot(PROMP,LVOL,pch=16)

15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

3.2 3.4 3.6 3.8

PROMP

> plot(FEAT,LVOL,pch=16)
15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

10 20 30 40 50

FEAT

290
11.11. Regression using R

> plot(DISP,LVOL,pch=16)

15.0
14.8
14.6
LVOL

14.4
14.2
14.0
13.8

15 20 25 30 35 40 45

DISP

What can we observe from these pairwise plots?

There is a negative correlation between LVOL and PROMP.

There is a positive correlation between LVOL and FEAT.

There is little or no correlation between LVOL and DISP, but this might have
been blurred by the other input variables.

Therefore, we should regress LVOL on PROMP and FEAT first.


We run a multiple linear regression model using x1 and x2 as explanatory variables:

y = β0 + β1 x1 + β2 x2 + ε.

> reg <- lm(LVOL~PROMP + FEAT)


> summary(reg)

Call:
lm(formula = LVOL ~ PROMP + FEAT)

Residuals:
Min 1Q Median 3Q Max
-0.32734 -0.08519 -0.01011 0.08471 0.30804

291
11. Linear regression

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489 68.94 <2e-16 ***
PROMP -0.9042636 0.0694338 -13.02 <2e-16 ***
FEAT 0.0100666 0.0008827 11.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1268 on 153 degrees of freedom


Multiple R-squared: 0.756, Adjusted R-squared: 0.7528
F-statistic: 237 on 2 and 153 DF, p-value: < 2.2e-16
We begin by performing a joint test of significance by testing H0 : β1 = β2 = 0. The
test statistic value is given in the regression ANOVA table as f = 237, with a
corresponding p-value of 0.000 (to three decimal places). Hence H0 is rejected and we
have strong evidence that at least one slope coefficient is not equal to zero.
Next we consider individual t tests of H0 : β1 = 0 and H0 : β2 = 0. The respective
test statistic values are −13.02 and 11.40, both with p-values of 0.000 (to three
decimal places) indicating that both slope coefficients are non-zero.
Turning to the estimated coefficients, βb1 = −0.904 (to three decimal places) which
indicates that LVOL decreases as PROMP increases controlling for FEAT. Also,
βb2 = 0.010 (to three decimal places) which indicates that LVOL increases as FEAT
increases, controlling for PROMP.
We could also compute 95% confidence intervals, given by:

βbi ± t0.025, n−3 × E.S.E.(βbi ).

Since n − 3 = 153 is large, t0.025, n−3 ≈ z0.025 = 1.96.

R2 = 0.756. Therefore, 75.6% of the variation of LVOL can be explained (jointly)


with PROMP and FEAT. However, a large R2 does not necessarily mean that the
fitted model is useful. For the estimation of coefficients and predicting y, the
absolute measure ‘Residual SS’ (or σ b2 ) plays a critical role in determining the
accuracy of the model.

Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.

The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.

> reg <- lm(LVOL~PROMP + FEAT + DISP)


> summary(reg)

Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)

292
11.12. Overview of chapter

Residuals:
Min 1Q Median 3Q Max
-0.33363 -0.08203 -0.00272 0.07927 0.33812

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220 <2e-16 ***
PROMP -0.9564415 0.0726777 -13.160 <2e-16 ***
FEAT 0.0101421 0.0008728 11.620 <2e-16 ***
DISP 0.0035945 0.0016529 2.175 0.0312 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1253 on 152 degrees of freedom


Multiple R-squared: 0.7633, Adjusted R-squared: 0.7587
F-statistic: 163.4 on 3 and 152 DF, p-value: < 2.2e-16

All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
√ of DISP to the
√ model has resulted in a very small reduction in σ b, from
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
Special care should be exercised when predicting with x out of the range of the
observations used to fit the model, which is called extrapolation.

11.12 Overview of chapter


This chapter has covered the linear regression model with one or more explanatory
variables. Least squares estimators were derived for the simple linear regression model,
and statistical inference procedures were also covered. The multiple linear regression
model and applications using R concluded the chapter.

11.13 Key terms and concepts


ANOVA decomposition Coefficient of determination
Confidence interval Dependent variable
Independent variable Intercept

293
11. Linear regression

Least squares estimation Linear estimators


Multiple linear regression Prediction interval
Regression analysis Regressor
Residual Simple linear regression
Slope coefficient

11.14 Sample examination questions


1. Consider the regression model:
yi = βxi + εi
2
where E(εi ) = 0 and Var(εi ) = σ for i = 1, 2, . . . , n, and Cov(εi , εj ) = 0 for all
i 6= j. x1 , x2 , . . . , xn are assumed to be constants.

(a) Show that the least squares estimator of β is:


n
P
xi y i
i=1
β= P
b
n .
2
xi
i=1

(b) Show that βb is an unbiased estimator of β.

(c) Derive an expression for Var(β).


b

(d) Show that:


n
X n
X n
X
b i )2 =
(yi − βx 2
(yi − βxi ) − (β − β)
b 2
x2i .
i=1 i=1 i=1

(e) Hence show that:


n
1 X
b2 =
σ b i )2
(yi − βx
n − 1 i=1
is an unbiased estimator of σ 2 .

2. To consider the effect of a marketing instrument, x, on the weekly sales volume, y,


of a certain product, data over 20 weeks {(xi , yi ), for i = 1, 2, . . . , 20} are collected.
The following quantities are calculated:
20
X 20
X 20
X 20
X 20
X
xi = 400, yi = 220, x2i = 8,800, xi yi = 4,700, yi2 = 2,620.
i=1 i=1 i=1 i=1 i=1

Assume the linear regression model:

yi = β0 + β1 xi + εi

where ε1 , ε2 , . . . , ε20 are independent and N (0, σ 2 ).

294
11.15. Solutions to Sample examination questions

(a) Find the least squares estimates of β0 and β1 to three decimal places, and
write down the fitted regression model.

(b) Compute the estimated standard error of the least squares estimator of β1 .

(c) Perform a test of H0 : β1 = 0 vs. H1 : β1 6= 0 at the 1% significance level.

(d) Perform a test of H0 : β1 = 0.37 vs. H1 : β1 > 0.37 at the 10% significance level.

(e) For x = 18, determine a prediction interval which covers y with probability
0.95.

11.15 Solutions to Sample examination questions


1. (a) We minimise:
n
X n
X
S= ε2i = (yi − βxi )2 .
i=1 i=1
Differentiating, we have:
n
d X
S = −2 (yi − βxi )xi .
dβ i=1

Setting to zero, and dividing by −2, we have:


n
X n
X n
X
(yi − βx
b i )xi = xi yi − βb x2i = 0
i=1 i=1 i=1

hence: n
P
xi y i
i=1
β= P
b
n .
x2i
i=1

(b) Since E(yi ) = E(βxi + εi ) = βxi , because E(εi ) = 0, we have:


 n  n n
x2i
P P P
 i=1 x i yi
 i=1 x i E(yi ) β
i=1
E(β)
b = E
 P n
=
 n = Pn =β
2
x2i x2i
P
xi
i=1 i=1 i=1

hence βb is an unbiased estimator of β.

(c) Since Var(yi ) = Var(βxi + εi ) = Var(εi ) = σ 2 , we have:


 n  n
P
x ! P x2 Var(y )
i yi n i i
 i=1  1 X i=1 σ2
Var(β) = Var  P
b 
n
 =  2 Var x i y i =  2 = n .
2
 n n P 2
xi P
xi2 i=1
P 2
xi xi
i=1 i=1 i=1 i=1

295
11. Linear regression

(d) We have:
n
X n
X
b i )2 =
(yi − βx ((yi − βxi ) − (βb − β)xi )2
i=1 i=1
n
X n
X n
X
2 2
= (yi − βxi ) − 2 (yi − βxi )xi (β − β) + (β − β)
b b x2i .
i=1 i=1 i=1

From the least squares estimator βb we have:


n
X n
X n
X n
X
xi yi = βb x2i ⇒ (yi − βxi )xi = (βb − β) x2i .
i=1 i=1 i=1 i=1

Therefore:
n
X n
X n
X n
X
b i )2 =
(yi − βx (yi − βxi )2 − 2(βb − β)2 x2i + (βb − β)2 x2i
i=1 i=1 i=1 i=1
n
X n
X
= (yi − βxi )2 − (βb − β)2 x2i .
i=1 i=1

(e) We have:
n
!
1 X
σ2) = E
E(b b i )2
(yi − βx
n − 1 i=1
n
!
1 X
b i )2
= E (yi − βx
n−1 i=1

n n
!!
1 X X
= E (yi − βxi )2 − (βb − β)2 x2i
n−1 i=1 i=1

n n
!
1 X X
= E((yi − βxi )2 ) − x2i E((βb − β)2 )
n−1 i=1 i=1

n n
!
1 X X
= E(ε2i ) − x2i Var(β)
b
n−1 i=1 i=1
 
2  n
1 
nσ 2 −
X
2 σ
= x i P
n

n−1 
i=1 x2i
i=1

= σ2

b2 is an unbiased estimator of σ 2 .
hence σ

296
11.15. Solutions to Sample examination questions

2. (a) We have the following:


220 400
ȳ = = 11 and x̄ = = 20
20 20
also: X X
(xi − x̄)2 = x2i − nx̄2 = 8,800 − 20 × 202 = 800
i i
and:
X X
(xi − x̄)(yi − ȳ) = xi yi − nx̄ȳ = 4,700 − 20 × 20 × 11 = 300.
i i

Hence βb1 = 300/800 = 0.375 and βb0 = ȳ − βb1 x̄ = 11 − 0.375 × 20 = 3.500.


Therefore, the fitted model is:
yb = 3.500 + 0.375x.

(b) We have the following:


X X
Total SS = (yi − ȳ)2 = yi2 − nȳ 2 = 2620 − 20 × 112 = 200
i i

X
Regression SS = βb12 (xi − x̄)2 = 0.3752 × 800 = 112.5
i

Residual SS = Total SS − Regression SS = 200 − 112.5 = 87.5.


b2 = 87.5/18 = 4.861. Therefore:
Hence σ
s r
σ
b 2 4.861
E.S.E.(βb1 ) = P 2
= = 0.078.
i (xi − x̄) 800

(c) We have:
βb1 − β1
∼ tn−2 = t18 .
E.S.E.(βb1 )
Under H0 : β1 = 0, we have:

βb1
T = ∼ t18
E.S.E.(βb1 )
and we reject H0 if |t| > t0.005, 18 = 2.878. Since t = 0.375/0.078 = 4.808, we
reject H0 : β1 = 0 at the 1% significance level.

(d) Under H0 : β1 = 0.37, we have:

βb1 − 0.37
T = ∼ t18 .
E.S.E.(βb1 )
We reject H0 if t > t0.10, 18 = 1.330. Since t = (0.375 − 0.37)/0.078 = 0.064, we
cannot reject H0 : β1 = 0.37 at the 10% significance level.

297
11. Linear regression

(e) The required prediction interval is:


 P 2
1/2
i (x i − x)
βb0 + βb1 x ± t0.025, 18 × σ
b× 1+ P
n i (xi − x̄)2

r
8,800 − 2 × 18 × 400 + 20 × (18)2
= 3.5 + 0.375 × 18 ± 2.101 × 4.861 × 1 +
20 × 800

= 10.25 ± 4.632 × 1.055
= 10.25 ± 4.758.

Hence the prediction interval is (5.492, 15.008).

Facts are stubborn, but statistics are more pliable.


(Mark Twain)

298
Appendix A
Data visualisation and descriptive
statistics

A.1 (Re)vision of fundamentals


Properties of the summation operator

Let Xi and Yi , for i = 1, 2, . . . , n, be sets of n numbers. Let a denote a constant, i.e. a


number with the same value for all i.
All of the following results follow simply from the properties of addition (if you are still
not convinced, try them with n = 3).

n
P
1. a = n × a.
i=1
n times
n z }| {
• Proof:
P
a = (a + a + · · · + a) = n × a.
i=1


n
P n
P
2. aXi = a Xi .
i=1 i=1
n n
• Proof:
P P
aXi = (aX1 + aX2 + · · · + aXn ) = a(X1 + X2 + · · · + Xn ) = a Xi .
i=1 i=1


n
P n
P n
P
3. (Xi + Yi ) = Xi + Yi .
i=1 i=1 i=1

• Proof: Rearranging the elements of the summation, we get:


n
X
(Xi + Yi ) = ((X1 + Y1 ) + (X2 + Y2 ) + · · · + (Xn + Yn ))
i=1

= ((X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn ))
= (X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn )
n
X n
X
= Xi + Yi .
i=1 i=1

299
A. Data visualisation and descriptive statistics

Extension: double (triple etc.) summation

Sometimes sets of numbers may be indexed with two (or even more) subscripts, for
example as Xij , for i = 1, 2, . . . , n and j = 1, 2, . . . , m.
Summation over both indices is written as:
n X
X m n
X
Xij = (Xi1 + Xi2 + · · · + Xim )
i=1 j=1 i=1

= (X11 + X12 + · · · + X1m ) + (X21 + X22 + · · · + X2m )


+ · · · + (Xn1 + Xn2 + · · · + Xnm ).

The order of summation can be changed, that is:


n X
X m m X
X n
Xij = Xij .
i=1 j=1 j=1 i=1

Product notation

The analogous notation for the product of a set of numbers is:


n
Y
Xi = X1 × X2 × · · · × Xn .
i=1

The following results follow from the properties of multiplication.

n n
aXi = an
Q Q
1. Xi .
i=1 i=1

n
a = an .
Q
2.
i=1

n
 n
 n

Q Q Q
3. Xi Yi = Xi Yi .
i=1 i=1 i=1

The sum of deviations from the mean is 0

The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over
all the observations, that is:
Xn
(Xi − X̄) = 0.
i=1

Proof: (The proof uses the definition of X̄ and the properties of summation introduced
earlier. Note that X̄ is a constant in the summation, because it has the same value for

300
A.1. (Re)vision of fundamentals

all i.)

n
P
n
X n
X n
X n
X n
X Xi
i=1
(Xi − X̄) = Xi − X̄ = Xi − nX̄ = Xi − n
i=1 i=1 i=1 i=1 i=1
n
n
X n
X
= Xi − Xi = 0.
i=1 i=1

The mean minimises the sum of squared deviations

n
(Xi − C)2 , for any
P
The smallest possible value of the sum of squared deviations
i=1
constant C, is obtained when C = X̄.
Proof:
=0
X X z }| {
2
(Xi − C) = (Xi −X̄ + X̄ −C)2
X
= ((Xi − X̄) + (X̄ − C))2
X
= ((Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2 )
X X X
= (Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2
=0
X zX }| {
2
= (Xi − X̄) + 2(X̄ − C) (Xi − X̄) +n(X̄ − C)2
X
= (Xi − X̄)2 + n(X̄ − C)2
X
≥ (Xi − X̄)2

since n(X̄ − C)2 ≥ 0 for any choice of C. Equality is obtained only when C = X̄, so
that n(X̄ − C)2 = 0.


An alternative formula for the variance

The sum of squares in S 2 can also be expressed as:

n
X n
X
2
(Xi − X̄) = Xi2 − nX̄ 2 .
i=1 i=1

301
A. Data visualisation and descriptive statistics

Proof: We have:
n
X n
X
2
(Xi − X̄) = (Xi2 − 2Xi X̄ + X̄ 2 )
i=1 i=1

= nX̄ = nX̄ 2
z }| { z }| {
n
X Xn n
X
= Xi2 − 2X̄ Xi + X̄ 2
i=1 i=1 i=1
n
X
= Xi2 − nX̄ 2 .
i=1


Therefore, the sample variance can also be calculated as:
n
!
2 1 X
S = Xi2 − nX̄ 2
n−1 i=1


(and the standard deviation S = S 2 again).
This formula
P is most
Pconvenient for calculations done by hand when summary statistics
2
such as i Xi and i Xi are provided.

Sample moment

Sample moments are formally introduced in Chapter 7.


Let us define, for a variable X and for each k = 1, 2, . . ., the following:

the kth sample moment about zero is:


n
Xik
P
i=1
mk =
n

the kth central sample moment is:


n
(Xi − X̄)k
P
i=1
m0k = .
n

In other words, these are sample averages of the powers Xik and (Xi − X̄)k , respectively.
Clearly:
n 1
X̄ = m1 and S 2 = m02 = (nm2 − n(m1 )2 ).
n−1 n−1

Moments of powers 3 and 4 are used in two more summary statistics which are
described next, for reference only.
These are used much less often than measures of central tendency and dispersion.

302
A.1. (Re)vision of fundamentals

Sample skewness (non-examinable)

A measure of the skewness of the distibution of a variable X is:


m03 3
P
i (Xi − X̄) /n
g1 = 3 = P .
s ( i (Xi − X̄)2 /(n − 1))3/2
For this measure, g1 = 0 for a symmetric distribution, g1 > 0 for a positively-skewed
distribution, and g1 < 0 for a negatively-skewed distribution.
For example, g1 = 1.24 for the (positively skewed) GDP per capita distribution shown
in Chapter 1 of the main course notes, and g1 = 0.006 for the (fairly symmetric)
diastolic blood pressure distribution.

Sample kurtosis (non-examinable)

Kurtosis refers to yet another characteristic of a sample distribution. This has to do


with the relative sizes of the ‘peak’ and tails of the distribution (think about shapes of
histograms).

A distribution with high kurtosis (i.e. leptokurtic) has a sharp peak and a high
proportion of observations in the tails far from the peak.
A distribution with low kurtosis (i.e. platykurtic) is ‘flat’, with no pronounced peak
with most of the observations spread evenly around the middle and weak tails.
A sample measure of kurtosis is:
m04 4
P
i (Xi − X̄) /n
g2 = − 3 = − 3.
(m02 )2
P
( i (Xi − X̄)2 /n)2
g2 > 0 for leptokurtic and g2 < 0 for platykurtic distributions, and g2 = 0 for the normal
distribution (introduced in Chapter 4). Some software packages define a measure of
kurtosis without the −3, i.e. ‘excess kurtosis’.

Calculation of sample quantiles (non-examinable)

This is how computer software calculates general sample quantiles (or how you can do
so by hand, if you ever needed to).
Suppose we need to calculate the cth sample quantile, qc , where 0 < c < 100. Let
R = (n + 1)c/100, and define r as the integer part of R and f = R − r as the fractional
part (if R is an integer, r = R and f = 0). It follows that:
qc = X(r) + f (X(r+1) − X(r) ) = (1 − f )X(r) + f X(r+1) .
For example, if n = 10:

for q50 (the median): R = 5.5, r = 5, f = 0.5, and so we have:


q50 = X(5) + 0.5(X(6) − X(5) ) = 0.5(X(5) + X(6) )
as before
for q25 (the first quartile): R = 2.75, r = 2, f = 0.75, and so:
q25 = X(2) + 0.75(X(3) − X(2) ) = 0.25X(2) + 0.75X(3) .

303
A. Data visualisation and descriptive statistics

A.2 Worked example


1. Show that: " n #
n X
X n X
(xi − xj )2 = 2n (xi − x̄)2 .
i=1 j=1 i=1

Solution:
Begin with the left-hand side and proceed as follows:
n Xn n
" n #
X X X
(xi − xj )2 = (xi − xj )2 .
i=1 j=1 i=1 j=1

Now, expand the square:


n
" n #
X X
2 2
= (xi − 2xi xj + xj ) .
i=1 j=1

Next, sum separately inside [ ] so we have:


n
" n n n
#
X X X X
= x2i + (−2xi xj ) + x2j .
i=1 j=1 j=1 j=1

Now, factor out xi terms inside [ ] to give:


n
" n n n
#
X X X X
= x2i 1 − 2xi xj + x2j .
i=1 j=1 j=1 j=1

n
P
Now, recall that x̄ = xi /n, so re-write as:
i=1

n
" n
#
X X
= nx2i − 2xi nx̄ + x2j .
i=1 j=1

Next, expand the bracket:


n n n n
!
X X X X
= nx2i + (−2xi nx̄) + x2j .
i=1 i=1 i=1 j=1

Re-arrange again:
n n n
! n
X X X X
=n x2i − 2nx̄ xi + x2j 1.
i=1 i=1 j=1 i=1

Apply the ‘x̄ trick’ once more:


n n
!
X X
=n x2i − 2nx̄ × nx̄ + x2j n.
i=1 j=1

304
A.3. Practice questions

Factor out the common n to give:


" n n
#
X X
=n x2i − 2nx̄2 + x2j .
i=1 j=1

Without loss of generality, we can re-define the index j as index i so:


" n n
#
X X
=n x2i − 2nx̄2 + x2i .
i=1 i=1

Finally, add terms, factor out 2n, apply the ‘x̄ trick’ . . . and you’re done!
" n # " n #
X X
= 2n x2i − nx̄2 = 2n (xi − x̄)2 .
i=1 i=1

A.3 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. Let Y1 , Y2 and Y3 be real numbers with Ȳ = (Y1 + Y2 + Y3 )/3. Show that:


3
P
(a) (Yj − Ȳ ) = 0
j=1
3 P
P 3
(b) (Yj − Ȳ )(Yk − Ȳ ) = 0
j=1 k=1
3 3 3
(Yj − Ȳ )2 .
P P P
(c) j6=k (Yj − Ȳ )(Yk − Ȳ ) = −
j=1 k=1 j=1

Hint: there are three terms in the expression of (a), nine terms in (b) and six terms
in (c). Write out the terms, and try to find ways to simplify them which avoid the
need for a lot of messy algebra!

2. For constants a and b, show that:


(a) ȳ = ax̄ + b, where yi = axi + b for i = 1, 2, . . . , n
n n
(xi − x̄)2 = x2i − nx̄2
P P
(b)
i=1 i=1

(c) s.d.y = |a| s.d.x , where s.d.y is the standard deviation of y etc.
What are the mean and standard deviation of the set {x1 + k, x2 + k, . . . , xn + k}
where k is a constant? What are the mean and standard deviation of the set
{cx1 , cx2 , . . . , cxn } where c is a constant? Justify your answers with reference to the
above results.

305
A. Data visualisation and descriptive statistics

306
Appendix B
Probability theory

B.1 Worked examples


1. A and B are independent events. Suppose that P (A) = 2π, P (B) = π and
P (A ∪ B) = 0.8. Evaluate π.
Solution:
We have:

P (A ∪ B) = 0.8 = P (A) + P (B) − P (A ∩ B)


= P (A) + P (B) − P (A) P (B)
= 2π + π − 2π 2 .

Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!

2. A and B are events such that P (A | B) > P (A). Prove that:

P (Ac | B c ) > P (Ac )

where Ac and B c are the complements of A and B, respectively, and P (B c ) > 0.


Solution:
From the definition of conditional probability:

P (Ac ∩ B c ) P ((A ∪ B)c ) 1 − P (A) − P (B) + P (A ∩ B)


P (Ac | B c ) = c
= c
= .
P (B ) P (B ) 1 − P (B)

However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)

Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)

307
B. Probability theory

3. A, B and C are independent events. Prove that A and (B ∪ C) are independent.


Solution:
Using the distributive law:
P (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C))
= P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C)
= P (A) P (B) + P (A) P (C) − P (A) P (B) P (C)
= P (A) (P (B) + P (C) − P (B) P (C))
= P (A) P (B ∪ C).

4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).

Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).

(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).

308
B.1. Worked examples

5. State and prove Bayes’ theorem.

Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1

By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:

K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

Hence the result.

6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?

Solution:
Define a partition {Ci }, such that:

5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:

1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12

7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?

309
B. Probability theory

Solution:

(a) We require P (bag A | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1

The conditional probabilities are:

4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2

(b) We require P (right bag | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1

4/5 × 5/24 + 6/7 × 7/24


=
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24
5
= .
11

8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?

Solution:

(a) Simply 300/1,000 = 0.3.


(b) Simply 0.3 × 0.3 = 0.09.

310
B.1. Worked examples

(c) Suppose P (first greater) = x, then by symmetry we have that


P (second greater) = x. However, the probability that both are equal is (by
counting):
{0, 0}, {1, 1}, . . . , {999, 999} 1,000
= = 0.001.
1,000,000 1,000,000
Hence x + x + 0.001 = 1, so x = 0.4995.
(d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150
possibilities from (10)6 . So the required probability is:
150
= 0.00015.
1,000,000

(e) The probability that they are all different is:


999 998 997 996
1× × × × .
1,000 1,000 1,000 1,000
Note that the first number can be any number (with probability 1).
Subtracting from 1 gives the required probability, i.e. 0.009965.

9. If C1 , C2 , . . . are events in S which are pairwise mutually exclusive (i.e. Ci ∩ Cj = ∅


for all i 6= j), then, by the axioms of probability:

! ∞
[ X
P Ci = P (Ci ). (B.1)
i=1 i=1

Suppose that A1 , A2 , . . . are pairwise mutually exclusive events in S. Prove that a


property like (B.1) also holds for conditional probabilities given some event B, i.e.
prove that:

! ! ∞
[ X
P Ai B = P (Ai | B).
i=1 i=1

You can assume that all unions and intersections of Ai and B are also events in S.
Solution:
We have:

! !
P (( ∞
S
i=1 Ai ) ∩ B)
[
P Ai B =
i=1
P (B)

P( ∞
S
i=1 (Ai ∩ B))
=
P (B)

X P (Ai ∩ B)
=
i=1
P (B)

X
= P (Ai | B)
i=1

311
B. Probability theory

where the equation on the second line follows from (B.1) in the question, since
Ai ∩ B are also events in S, and they are pairwise mutually exclusive (i.e.
(Ai ∩ B) ∩ (Aj ∩ B) = ∅ for all i 6= j).

10. Suppose that three components numbered 1, 2 and 3 have probabilities of failure
π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of
the following cases where component failures are assumed to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2
and component 3 fail.

Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).

(c) Components 2 and 3 may be combined to form a notional component 4 with


failure probability π2 π3 . So the system is equivalent to a component with
failure probability π1 and another component with failure probability π2 π3 ,
these being connected in series. Therefore, the failure probability is:
1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 .

11. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.

12. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.

13. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.

312
B.1. Worked examples

14. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3

15. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.

16. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:

A ∩ B = ∅.

So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:

P (A ∩ B) = P (A) P (B).

It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
1
Note that independence and disjointness are not similar ideas.

313
B. Probability theory

17. Write down the condition for three events A, B and C to be independent.

Solution:
Applying the product rule, we must have:

P (A ∩ B ∩ C) = P (A) P (B) P (C).

Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:

P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)

and:
P (B ∩ C) = P (B) P (C).

One must check that all four conditions hold to verify independence of A, B and C.

18. Prove the simplest version of Bayes’ theorem from first principles.

Solution:
Applying the definition of conditional probability, we have:

P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)

19. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.
(a) If 25% of students do their homework consistently, what percentage of all
students can expect to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?

Solution:
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done
their homework consistently (C) or has not (N ).2 The sample space is
S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail
= {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
2
Notice that F = P c and N = C c .

314
B.1. Worked examples

(a) The first part of the example asks for the denominator of Bayes’ theorem:

P (Pass) = P (Pass | Homework) P (Homework)


+ P (Pass | No Homework) P (No Homework)
= 0.95 × 0.25 + 0.30 × (1 − 0.25)
= 0.2375 + 0.225
= 0.4625.

(b) Now applying Bayes’ theorem:

P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.

Alternatively, we could arrange the calculations in a tree diagram as shown


below.

20. Plagiarism is a serious problem for assessors of coursework. One check on


plagiarism is to compare the coursework with a standard text. If the coursework
has plagiarised the text, then there will be a 95% chance of finding exactly two
phrases which are the same in both coursework and text, and a 5% chance of
finding three or more phrases. If the work is not plagiarised, then these
probabilities are both 50%.

315
B. Probability theory

Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework


at random. What is the probability that it has been plagiarised if it has exactly two
phrases in the text?3
What if there are three or more phrases? Did you manage to get a roughly correct
guess of these results before calculating?
Solution:
Suppose that two phrases are the same. We use Bayes’ theorem:
0.95 × 0.05
P (plagiarised | two the same) = = 0.0909.
0.95 × 0.05 + 0.5 × 0.95
Finding two phrases has increased the chance the work is plagiarised from 5% to
9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find
three or more phrases:
0.05 × 0.05
P (plagiarised | three or more the same) = = 0.0052.
0.05 × 0.05 + 0.5 × 0.95
It seems that no plagiariser is silly enough to keep three or more phrases the same,
so if we find three or more, the chance of the work being plagiarised falls from 5%
to 0.5%! How close did you get by guessing?

21. A, B and C throw a die in that order until a six appears. The person who throws
the first six wins. What are their respective chances of winning?
Solution:
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning
by P (A), P (B) and P (C) for A, B and C, respectively. Therefore:

P (A) = P (A wins on the 1st throw)


+ P (A wins in some round after the 1st round)
1
= + P (A, B and C fail on the 1st throw and A wins after the 1st round)
6
1
= + P (A, B and C fail in the 1st round)
6
× P (A wins after the 1st round | A, B and C fail in the 1st round)
1
= + P (No six in first 3 throws) P (A)
6
 3
1 5
= + P (A)
6 6
 
1 125
= + P (A).
6 216
3
Try making a guess before doing the calculation!

316
B.1. Worked examples

So (1 − 125/216)P (A) = 1/6, and P (A) = 216/(91 × 6) = 36/91.


Similarly:

P (B) = P (B wins in the 1st round)


+ P (B wins after the 1st round)
= P (A fails with the 1st throw and B throws a six on the 1st throw)
+ P (All fail in the 1st round and B wins after the 1st round)
= P (A fails with the 1st throw) P (B throws a six with the 1st throw)
+ P (All fail in the 1st round) P (B wins after the 1st | All fail in the 1st)
     3
5 1 5
= + P (B).
6 6 6

So, (1 − 125/216)P (B) = 5/36, and P (B) = 5(216)/(91 × 36) = 30/91.


In the same way, P (C) = (5/6)(5/6)(1/6)(216/91) = 25/91.
Notice that P (A) + P (B) + P (C) = 1. You may, on reflection, think that this
rather long solution could be shortened, by considering the relative winning
chances of A, B and C.

22. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore, the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
Solution:
Suppose that the two players are A and B. We calculate the probability that A
wins a three-, four- or five-set match, and then, since the players are evenly
matched, double these probabilities for the final answer.

P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).

Since the sets are independent, we have:

P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1 1
= × × = .
2 2 2 8
Therefore, the total probability that the game lasts three sets is:

1 1
2× = .
8 4
If A wins in four sets, the possible winning patterns are:

BAAA, ABAA and AABA.

317
B. Probability theory

Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:

BBAAA, BABAA, BAABA, ABBAA, ABABA and AABBA.

Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.

B.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. (a) A, B and C are any three events in the sample space S. Prove that:

P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).

(b) A and B are events in a sample space S. Show that:

P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2

2. Suppose A and B are events with P (A) = p, P (B) = 2p and P (A ∪ B) = 0.75.


(a) Evaluate p and P (A | B) if A and B are independent events.
(b) Evaluate p and P (A | B) if A and B are mutually exclusive events.

3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.

4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.

318
B.2. Practice questions

(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall chance of winning the game.

319
B. Probability theory

320
Appendix C
Random variables

C.1 Worked examples


1. Toward the end of the financial year, James is considering whether to accept an
offer to buy his stock option now, rather than wait until the normal exercise time.
If he sells now, his profit will be £120,000. If he waits until the exercise time, his
profit will be £200,000, provided that there is no crisis in the markets before that
time; if there is a crisis, the option will be worthless and he would expect a net loss
of £50,000. What action should he take to maximise his expected profit if the
probability of crisis is:
(a) 0.5?
(b) 0.1?
For what probability of a crisis would James be indifferent between the two courses
of action if he wishes to maximise his expected profit?
Solution:
Let π = probability of crisis, then:

S = E(profit given James sells) = £120,000

and:

W = E(profit given James waits) = £200,000(1 − π) + (−£50,000)π.

(a) If π = 0.5, then S = £120,000 and W = £75,000, so S > W , hence James


should sell now.
(b) If π = 0.1, then S = £120,000 and W = £175,000, so S < W , hence James
should wait until the exercise time.
To be indifferent, we require S = W , i.e. we have:

£200,000 − £250,000 π = £120,000

so π = 8/25 = 0.32.

2. Suppose the random variable X has a geometric distribution with parameter π,


which has the following probability function:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x) =
0 otherwise.

321
C. Random variables

(a) Show that its moment generating function is:

πet
.
1 − et (1 − π)

(b) Hence show that the mean of the distribution is 1/π.

Solution:
(a) Working from the definition:

X ∞
X
tX tx
MX (t) = E(e ) = e p(x) = etx (1 − π)x−1 π
x∈S x=1

X
= πet (et (1 − π))x−1
x=1

πet
=
1 − et (1 − π)

using the sum to infinity of a geometric series.


(b) Differentiating:

(1 − et (1 − π))πet + πet (et (1 − π)) πet


MX0 (t) = = .
(1 − et (1 − π))2 (1 − et (1 − π))2

Therefore:
π π 1
E(X) = MX0 (0) = 2
= 2 = .
(1 − (1 − π)) π π

3. A continuous random variable, X, has a probability density function, f (x), defined


by: (
ax + bx2 for 0 ≤ x ≤ 1
f (x) =
0 otherwise
and E(X) = 1/2. Determine:
(a) the constants a and b
(b) the cumulative distribution function, F (x), of X
(c) the variance, Var(X).

Solution:
(a) We have:
1 1 1
ax2 bx3
Z Z 
2
f (x) dx = 1 ⇒ ax + bx dx = + =1
0 0 2 3 0

i.e. we have a/2 + b/3 = 1.

322
C.1. Worked examples

Also, we know E(X) = 1/2, hence:


Z 1  3 1
2 ax bx4 1
x (ax + bx ) dx = + =
0 3 4 0 2
i.e. we have:
a b 1
+ = ⇒ a = 6 and b = −6.
3 4 2
Hence f (x) = 6x(1 − x) for 0 ≤ x ≤ 1, and 0 otherwise.
(b) We have: 
0
 for x < 0
F (x) = 3x2 − 2x3 for 0 ≤ x ≤ 1

1 for x > 1.

(c) Finally:
1 1 1
6x4 6x5
Z Z 
2 2 3 4
E(X ) = x (6x(1 − x)) dx = 6x − 6x dx = − = 0.3
0 0 4 5 0
and so the variance is:
Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05.

4. The waiting time, W , of a traveller queueing at a taxi rank is distributed according


to the cumulative distribution function, G(w), defined by:

0
 for w < 0
G(w) = 1 − (2/3) exp(−w/2) for 0 ≤ w < 2

1 for w ≥ 2.

(a) Sketch the cumulative distribution function.


(b) Is the random variable W discrete, continuous or mixed?
(c) Evaluate P (W > 1), P (W = 2), P (W ≤ 1.5 | W > 0.5) and E(W ).

Solution:
(a) A sketch of the cumulative distribution function is:

G (w )
1

1-(2/3)e -1

1/3

0 2 w

(b) We see the distribution is mixed, with discrete ‘atoms’ at 0 and 2.

323
C. Random variables

(c) We have:

2
P (W > 1) = 1 − G(1) = e−1/2
3
2
P (W = 2) = e−1
3
P (0.5 < W ≤ 1.5)
P (W ≤ 1.5 | W > 0.5) =
P (W > 0.5)
G(1.5) − G(0.5)
=
1 − G(0.5)
(1 − (2/3)e−1.5/2 ) − (1 − (2/3)e−0.5/2 )
=
(2/3)e−0.5/2
= 1 − e−1/2 .

Finally, the mean is:


Z 2
1 2 −1 1
E(W ) = × 0 + e × 2 + w e−w/2 dw
3 3 0 3
 −w/2 2 Z 2
4 −1 we 2 −w/2
= e + + e dw
3 3 −1/2 0 0 3
 −w/2 2
4 −1 4 −1 2e
= e − e +
3 3 3 −1/2 0
4
= (1 − e−1 ).
3

5. A random variable X has the following pdf:



1/4 for 0 ≤ x ≤ 1

f (x) = 3/4 for 1 < x ≤ 2

0 otherwise.

(a) Explain why f (x) can serve as a pdf.


(b) Find the mean and median of the distribution.
(c) Find the variance, Var(X).
(d) Write down the cdf of X.
(e) Find P (X = 1) and P (X > 1.5 | X > 0.5).
(f) Derive the moment generating function of X.

324
C.1. Worked examples

Solution:
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen
geometrically, since f (x) defines two rectangles, one with base 1 and height
1/4, the other with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
Z ∞ Z 1 Z 2  2 1  2 2
x 3x x 3x 1 3 3 5
E(X) = x f (x) dx = dx+ dx = + = + − = .
−∞ 0 4 1 4 8 0 8 1 8 2 8 4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) For the variance, we proceed as follows:
Z ∞ Z 1 2 Z 2 2  3 1  3 2
2 2 x 3x x x 1 1 11
E(X ) = x f (x) dx = dx+ dx = + = +2− = .
−∞ 0 4 1 4 12 0 4 1 12 4 6
Hence the variance is:
11 25 88 75 13
Var(X) = E(X 2 ) − (E(X))2 = − = − = ≈ 0.2708.
6 16 48 48 48
(d) The cdf is:


 0 for x<0

x/4 for 0≤x≤1
F (x) =


 3x/4 − 1/2 for 1<x≤2

1 for x > 2.
(e) P (X = 1) = 0, since the cdf is continuous, and:
P ({X > 1.5} ∩ {X > 0.5}) P (X > 1.5)
P (X > 1.5 | X > 0.5) = =
P (X > 0.5) P (X > 0.5)
0.5 × 0.75
=
1 − 0.5 × 0.25
0.375
=
0.875
3
= ≈ 0.4286.
7
(f) The moment generating function is:
Z ∞ 1 Z 2 tx
etx
Z
tX tx 3e
MX (t) = E(e ) = e f (x) dx = dx + dx
−∞ 0 4 1 4
 tx 1  tx 2
e 3e
= +
4t 0 4t 1
1 t 3
= (e − 1) + (e2t − et )
4t 4t
1
3e2t − 2et − 1 .

=
4t

325
C. Random variables

6. A continuous random variable X has the following pdf:


(
x3 /4 for 0 ≤ x ≤ 2
f (x) =
0 otherwise.

(a) Explain why f (x) can serve as a pdf.


(b) Find the mean and mode of the distribution.
(c) Determine the cdf, F (x), of X.
(d) Find the variance, Var(X).
(e) Find the skewness of X, given by:

E((X − E(X))3 )
.
σ3
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.

Solution:
(a) Clearly, f (x) ≥ 0 for all x and:
2  4 2
x3
Z
x
dx = = 1.
0 4 16 0

(b) The mean is:


∞ 2  5 2
x4
Z Z
x 32
E(X) = x f (x) dx = dx = = = 1.6
−∞ 0 4 20 0 20

and the mode is 2 (where the density reaches a maximum).


(c) The cdf is: 
0
 for x < 0
4
F (x) = x /16 for 0 ≤ x ≤ 2

1 for x > 2.

(d) For the variance, we first find E(X 2 ), given by:


2 2  6 2
x5
Z Z
2 2 x 64 8
E(X ) = x f (x) dx = dx = = =
0 0 4 24 0 24 3
8 64 8
⇒ Var(X) = E(X 2 ) − (E(X))2 = − = ≈ 0.1067.
3 25 75

(e) The third moment about zero is:


2 2  7 2
x6
Z Z
3 3 x 128
E(X ) = x f (x) dx = dx = = ≈ 4.5714.
0 0 4 28 0 28

326
C.1. Worked examples

Letting E(X) = µ, the numerator is:


E((X − E(X))3 ) = E(X 3 ) − 3µ E(X 2 ) + 3µ2 E(X) − µ3
= 4.5714 − (3 × 1.6 × 2.6667) + (3 × (1.6)3 ) − (1.6)3
which is −0.0368, and the denominator is (0.1067)3/2 = 0.0349, hence the
skewness is −1.0544.
(f) The probability of a single observation exceeding 1.5 is:
Z 2 Z 2 3  4 2
x x
f (x) dx = dx = = 1 − 0.3164 = 0.6836.
1.5 1.5 4 16 1.5
So the probability of all five exceeding 1.5 is, by independence:
(0.6836)5 = 0.1493.

7. Consider the function:


(
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
(a) Show that this function has the characteristics of a probability density
function.
(b) Evaluate E(X) and Var(X).

Solution:
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x ≥ 0 and e−λx ≥ 0.
R∞
To show, −∞ f (x) dx = 1, we have:
Z ∞ Z ∞
f (x) dx = λ2 xe−λx dx
−∞ 0
∞ Z ∞
e−λx e−λx

2
= λx + λ2 dx
−λ 0 0 λ
Z ∞
=0+ λe−λx dx
0

= 1 (provided λ > 0).

(b) For the mean:


Z ∞
E(X) = x λ2 xe−λx dx
0
h i∞ Z ∞
−λx
= − x λe 2
+ 2xλe−λx dx
0 0

2
=0+ (from the exponential distribution).
λ
For the variance:
Z ∞ i∞ Z ∞
−λx
h
−λx 6
2
E(X ) = 2 2
x λ xe 3
dx = − x λe + 3x2 λe−λx dx = .
0 0 0 λ2
2 2 2
So, Var(X) = 6/λ − (2/λ) = 2/λ .

327
C. Random variables

8. A random variable, X, has a cumulative distribution function, F (x), defined by:



0
 for x < 0
F (x) = 1 − ae −x
for 0 ≤ x < 1

1 for x ≥ 1.

(a) Derive expressions for:


i. P (X = 0)
ii. P (X = 1)
iii. the pdf of X (where it is continuous)
iv. E(X).
(b) Suppose that E(X) = 0.75(1 − e−1 ). Evaluate the median of X and Var(X).

Solution:
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − ae−1 ) = ae−1 .
x→1
−x
iii. f (x) = ae , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z 1
−1
E(X) = 0 × (1 − a) + 1 × (ae ) + x ae−x dx
0
h i1 Z 1
= ae−1 + − xae−x + ae−x dx
0 0
h i1
= ae−1 − ae−1 + − ae−x
0

= a(1 − e−1 ).

(b) The median, m, satisfies:


 
−m 2
F (m) = 0.5 = 1 − 0.75e ⇒ m = − ln = 0.4055.
3
Recall Var(X) = E(X 2 ) − (E(X))2 , so:
Z 1
−1
2 2 2
E(X ) = 0 × (1 − a) + 1 × (ae ) + x2 ae−x dx
0
h i1 Z 1
= ae−1 + − x2 ae−x +2 xae−x dx
0 0

= ae−1 − ae−1 + 2(a − 2ae−1 )


= 2a − 4ae−1 .
Hence:
Var(X) = 2a − 4ae−1 − a2 (1 + e−2 − 2e−1 ) = 0.1716.

328
C.1. Worked examples

9. A continuous random variable, X, has a probability density function, f (x), defined


by: (
k sin(x) for 0 ≤ x ≤ π
f (x) =
0 otherwise.

(a) Determine the constant k and derive the cumulative distribution function,
F (x), of X.
(b) Find E(X) and Var(X).

Solution:
(a) We have: Z ∞ Z π
f (x) dx = k sin(x) dx = 1.
−∞ 0

Therefore: h iπ 1
k(− cos(x)) = 2k = 1 ⇒ k= .
0 2
The cdf is hence:

0
 for x < 0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π

1 for x > π.

(b) By symmetry, E(X) = π/2. Alternatively:


Z π iπ Z π 1
1 1h π 1h iπ π
E(X) = x sin(x) dx = x(− cos(x)) + cos(x) dx = + sin(x) = .
0 2 2 0 0 2 2 2 0 2

Next:
Z π  π Z π
2 21 1 2
E(X ) = x sin(x) dx = x (− cos(x)) + x cos(x) dx
0 2 2 0 0

π2 h iπ Z π
= + x sin(x) − sin(x) dx
2 0 0

π2 h iπ
= − − cos(x)
2 0

π2
= − 2.
2
Therefore, the variance is:

π2 π2 π2
Var(X) = E(X 2 ) − (E(X))2 = −2− = − 2.
2 4 4

10. (a) Define the cumulative distribution function (cdf) of a random variable and
state the principal properties of such a function.

329
C. Random variables

(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a(b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a(1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp(−x/2) for 0 ≤ x ≤ 2.

Solution:
(a) We defined the cdf to be F (x) = P (X ≤ x) where:
• 0 ≤ F (x) ≤ 1
• F (x) is non-decreasing
Rx
• dF (x)/dx = f (x) and F (x) = −∞
f (t) dt for continuous X
• F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞.
(b) i. Okay. a = 0.25 and b = −1.
ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function.
iii. Okay. a = b > 0 and b = (1 − e−1 )−1 .

11. Suppose that random variable X has the range {x1 , x2 , . . .}, where x1 < x2 < · · · .
Prove the following results:

X
p(xi ) = 1
i=1

p(xk ) = F (xk ) − F (xk−1 )


k
X
F (xk ) = p(xi ).
i=1

Solution:
The events X = x1 , X = x2 , . . . are disjoint, so we can write:

X ∞
X
p(xi ) = P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1 i=1

In words, this result states that the sum of the probabilities of all the possible
values X can take is equal to 1.
For the second equation, we have:

F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).

The two events on the right-hand side are disjoint, so:

F (xk ) = P (X = xk ) + P (X ≤ xk−1 ) = p(xk ) + F (xk−1 )

330
C.1. Worked examples

which immediately gives the required result.


For the final result, we can write:

k
X
F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) = p(xi ).
i=1

12. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability for each of them to win the prize?

Solution:
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:

1
P (‘Carol wins’) = P (X = 22) = p(22) = = 0.01
100
and:
5
P (‘Janet wins’) = P (X ≤ 5) = F (5) = = 0.05.
100

13. What is the expectation of the random variable X if the only possible value it can
take is c?

Solution:
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) = x p(x) = cp(x) = cp(c) = c × 1 = c. (C.1)
∀x

This is intuitively correct; on average, a constant must be equal to itself!

14. Show that E(X − E(X)) = 0.

Solution:
We have:
E(X − E(X)) = E(X) − E(E(X))
Since E(X) is just a number, as opposed to a random variable, (C.1) tells us that
its expectation is equal to itself. Therefore, we can write:

E(X − E(X)) = E(X) − E(X) = 0.

331
C. Random variables

15. Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X is almost
surely equal to its mean.)
Solution:
From the definition of variance, we have:
X
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) ≥ 0
∀x
2
because the squared term (x − µ) is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.

C.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. Construct suitable examples to show that for a random variable X:


(a) E(X 2 ) 6= (E(X))2 in general
(b) E(1/X) 6= 1/E(X) in general.

2. (a) Let X be a random variable. Show that:


Var(X) = E(X(X − 1)) − E(X)(E(X) − 1).

(b) Let X1 , X2 , . . . , Xn be independent random variables. Assume that all have a


mean of µ and a variance of σ 2 . Find expressions for the mean and variance of
the random variable (X1 + X2 + · · · + Xn )/n.

3. A doctor wishes to procure subjects possessing a certain chromosome abnormality


which is present in 4% of the population. How many randomly chosen independent
subjects should be procured if the doctor wishes to be 95% confident that at least
one subject has the abnormality?

4. In an investigation of animal behaviour, rats have to choose between four doors.


One of them, behind which is food, is ‘correct’. If an incorrect choice is made, the
rat is returned to the starting point and chooses again, continuing as long as
necessary until the correct choice is made. The random variable X is the serial
number of the trial on which the correct choice is made.
Find the probability function and expectation of X under each of the following
hypotheses:
(a) each door is equally likely to be chosen on each trial, and all trials are
mutually independent
(b) at each trial, the rat chooses with equal probability between the doors which it
has not so far tried
(c) the rat never chooses the same door on two successive trials, but otherwise
chooses at random with equal probabilities.

332
Appendix D
Common distributions of random
variables

D.1 Worked examples


1. The random variable X has a binomial distribution with parameters n and π.
Derive expressions for:
(a) E(X)
(b) E(X(X − 1))
(c) E(X(X − 1) · · · (X − r)).

Solution:
(a) We have:
n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n − 1 x−1
= nπ π (1 − π)(n−1)−(x−1)
x=1
x − 1
n−1  
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y

= nπ.

(b) We have:
n  
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n  
X n x
= x(x − 1) π (1 − π)n−x
x=2
x

333
D. Common distributions of random variables

n
X n(n − 1)(n − 2)!
E(X(X − 1)) = π 2 π x−2 (1 − π)n−x
x=2
(x − 2)! ((n − 2) − (x − 2))!
n  
X
2n − 2 x−2
= n(n − 1)π π (1 − π)(n−2)−(x−2)
x=2
x−2
n−2  
X
2 n−2 y
= n(n − 1)π π (1 − π)(n−2)−y
y=0
y
= n(n − 1)π 2 .

(c) We have:
E(X(X − 1) · · · (X − r))
n  
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x (if r < n)
x=0
x
n  
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x
x=r+1
x

= n(n − 1) · · · (n − r)π r+1


n  
X n − (r + 1) x−(r+1)
× π (1 − π)(n−(r+1))−(x−(r+1))
x=r+1
x − (r + 1)

= n(n − 1) · · · (n − r)π r+1 .

2. Suppose {Bi } is an infinite sequence of independent Bernoulli trials with:


P (Bi = 0) = 1 − π and P (Bi = 1) = π
for all i.
n
P
(a) Derive the distribution of Xn = Bi and the expected value and variance of
i=1
Xn .
(b) Let Y = min{i : Bi = 1}. Derive the distribution of Y and obtain an
expression for P (Y > y).

Solution:
n
P
(a) Xn = Bi takes the values 0, 1, 2, . . . , n. Any sequence consisting of x 1s and
i=1
x n−x
n−  x 0s has a probability π (1 − π) and gives a value Xn = x. There are
n
x
such sequences, so:
 
n x
P (Xn = x) = π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π(1 − π) which means
E(Xn ) = nπ and Var(Xn ) = nπ(1 − π).

334
D.1. Worked examples

(b) Y = min{i : Bi = 1} takes the values 1, 2, . . ., hence:

P (Y = y) = (1 − π)y−1 π

and 0 otherwise. It follows that P (Y > y) = (1 − π)y .

3. A continuous random variable X has the gamma distribution, denoted


X ∼ Gamma(α, β), if its probability density function (pdf) is of the form:

β α α−1 −βx
f (x) = x e for x > 0 (D.1)
Γ(α)

and 0 otherwise, where α > 0 and β > 0 are parameters, and Γ(α) is the value of
the gamma function such that:
Z ∞
Γ(α) = xα−1 e−x dx.
0

The gamma function has a finite value for all α > 0. Two of its properties are that:
• Γ(1) = 1
• Γ(α) = (α − 1) Γ(α − 1) for all α > 1.
(a) The function f (x) defined by (1) satisfies all the conditions for being a pdf.
Show that this implies the following result about an integral:
Z ∞
Γ(α)
xα−1 e−βx dx = α for any α > 0, β > 0.
0 β

(b) The Gamma(1, β) distribution is the same as another distribution with a


different name. What is this other distribution? Justify your answer.
(c) Show that if X ∼ Gamma(α, β), the moment generating function of X is:
 α
β
MX (t) =
β−t

which is defined when t < β.


(d) Suppose that X ∼ Gamma(α, β). Derive the expected value of X:
i. using the pdf and the definition of the expected value
ii. using the moment generating function.
(e) If X1 , X2 , . . . , Xk are independent random variables such that
Xi ∼ Gamma(αi , β) for i = 1, 2, . . . , k, then:
k k
!
X X
Xi ∼ Gamma αi , β .
i=1 i=1

Using this result and the known properties of the exponential distribution,
derive the expected value of X ∼ Gamma(α, β) when α is a positive integer
(i.e. α = 1, 2, . . .).

335
D. Common distributions of random variables

Solution:
(a) This
R ∞ follows immediately from the general property of pdfs that
−∞
f (x) dx = 1, applied to the specific pdf here. We have:

Γ(α) ∞ β α α−1 −βx


Z Z ∞
Γ(α)
α
= α x e dx = xα−1 e−βx dx.
β β 0 Γ(α) 0

(b) With α = 1, the pdf becomes f (x) = βe−βx for x ≥ 0, and 0 otherwise. This is
the pdf of the exponential distribution with parameter β, i.e. X ∼ Exp(β).
(c) We have:
∞ ∞
β α α−1 −βx
Z Z
tX tx
MX (t) = E(e ) = e f (x) dx = etx x e dx
0 0 Γ(α)

βα
Z
= etx xα−1 e−βx dx
Γ(α) 0

βα
Z
= xα−1 e−(β−t)x dx
Γ(α) 0

βα Γ(α)
= ×
Γ(α) (β − t)α
 α
β
=
β−t
which is finite when β − t > 0, i.e. when t < β. The second-to-last step follows
by substituting β − t for β in the result in (a).
(d) i. We have:
∞ ∞
β α α−1 −βx
Z Z
E(X) = x f (x) dx = x x e dx
−∞ 0 Γ(α)
Z ∞
βα
= x(α+1)−1 e−βx dx
Γ(α) 0

β α Γ(α + 1)
=
Γ(α) β α+1
β α αΓ(α)
=
Γ(α) β α+1
α
=
β
using (a) and the gamma function property stated in the question.
ii. The first derivative of MX (t) is:
 α−1
β β
MX0 (t) =α .
β−t (β − t)2
Therefore:
α
E(X) = MX0 (0) = .
β

336
D.1. Worked examples

(e) When α is a positive integer, by the result stated in the question, we have

X= Yi , where Y1 , Y2 , . . . , Yα are independent random variables each
i=1
distributed as Gamma(1, β), i.e. as exponential with parameter β as concluded
in (b). The expected value of the exponential distribution can be taken as
given from the lectures, so E(Yi ) = 1/β for each i = 1, 2, . . . , α. Therefore,
using the general result on expected values of sums:
α
! α
X X 1 α
E(X) = E Yi = E(Yi ) = α × = .
i=1 i=1
β β

4. James enjoys playing Solitaire on his laptop. One day, he plays the game
repeatedly. He has found, from experience, that the probability of success in any
game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes
in 100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he
plays?

Solution:
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a
geometric distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
   3  7
10 1 2
P (X = 3) = ≈ 0.2601.
3 3 3

(c) Approximate Bin(100, 1/3) by:


   
1 1 2 200
N 100 × , 100 × × = N 33.3, .
3 3 3 9
The approximation seems reasonable since n = 100 is ‘large’, π = 1/3 is quite
close to 0.5, nπ > 5 and n(1 − π) > 5. Using a continuity correction:
!
24.5 − 33.3
P (X ≤ 24.5) = P Z ≤ p = P (Z ≤ −1.87) ≈ 0.0307.
200/9

(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
 
x−1 k
p(x) = π (1 − π)x−k for x = k, k + 1, k + 2, . . .
k−1

337
D. Common distributions of random variables

and 0 otherwise. Hence we require:


   3  7
9 1 2
P (X = 10) = ≈ 0.0780.
2 3 3

Alternatively, you could calculate the probability of 2 successes in 9 trials,


followed by a further success.

5. You may assume that 15% of individuals in a large population are left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly
6 are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that
exactly 60 are left-handed by using a suitable approximation. Briefly discuss
the appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to
be 99% sure of finding at least one left-handed individual in the sample?

Solution:

(a) Let X ∼ Bin(40, 0.15), hence:


 
40
P (X = 6) = × (0.15)6 × (0.85)34 = 0.1742.
6

(b) Use a normal approximation with a continuity correction. We require


P (59.5 < X < 60.5), where X ∼ N (60, 51) since X has mean nπ and variance
nπ(1 − π) with n = 400 and π = 0.15. Standardising, this is
2 × P (0 < Z ≤ 0.07) = 0.0558, approximately.
Rules-of-thumb for use of the approximation are that n is ‘large’, π is close to
0.5, and nπ and n(1 − π) are both at least 5. The first and last of these
definitely hold. There is some doubt whether a value of 0.15 can be considered
close to 0.5, so use with caution!
(c) Given a sample of size n, P (no left-handers) = (0.85)n . Therefore:

P (at least 1 left-hander) = 1 − (0.85)n .

We require 1 − (0.85)n > 0.99, or (0.85)n < 0.01. This gives:


 n
1
100 <
0.85
or:
ln(100)
n> = 28.34.
ln(1.1765)
Rounding up, this gives a sample size of 29.

338
D.1. Worked examples

6. Show that the moment generating function (mgf) of a Poisson distribution with
parameter λ is given by:
MX (t) = exp(λ exp(t) − 1), writing exp(θ) ≡ eθ .
Hence show that the mean and variance of the distribution are both λ.
Solution:
We have:

X λx
MX (t) = E(exp(Xt)) = exp(xt) exp(−λ)
x=0
x!

X exp(−λ)
= (λ exp(t))x
x=0
x!

X (λ exp(t))x
= exp(−λ)
x=0
x!

= exp(−λ) exp(λ exp(t))


= exp(λ(exp(t) − 1)).
We have that MX (0) = exp(0) = 1. Now, taking logs:
ln MX (t) = λ(exp(t) − 1).
Now differentiate:
MX0 (t)
= λ exp(t) ⇒ MX0 (t) = MX (t)λ exp(t).
MX (t)
Differentiating again, we get:
MX00 (t) = MX0 (t)λ exp(t) + MX (t)λ exp(t).
We note E(X) = MX0 (0) = MX (0)λ exp(0) = λ, also:
Var(X) = MX00 (0) − (MX0 (0))2 = λ2 + λ − λ2 = λ.

7. In javelin throwing competitions, the throws of athlete A are normally distributed.


It has been found that 15% of her throws exceed 43 metres, while 3% exceed 45
metres. What distance will be exceeded by 90% of her throws?
Solution:
Suppose X ∼ N (µ, σ 2 ) is the random variable for throws. P (X > 43) = 0.15 leads
to µ = 43 − 1.035 × σ (using Table 4 of the New Cambridge Statistical Tables).
Similarly, P (X > 45) = 0.03 leads to µ = 45 − 1.88 × σ. Solving yields µ = 40.55
and σ = 2.367, hence X ∼ N (40.55, (2.367)2 ). So:
x − 40.55
P (X > x) = 0.90 ⇒ = −1.28.
2.367
Hence x = 37.52 metres.

339
D. Common distributions of random variables

8. People entering an art gallery are counted by the attendant at the door. Assume
that people arrive in accordance with a Poisson distribution, with one person
arriving every 2 minutes. The attendant leaves the door unattended for 5 minutes.
(a) Calculate the probability that:
i. nobody will enter the gallery in this time
ii. 3 or more people will enter the gallery in this time.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.90 of no arrivals in
that time.
(c) Comment briefly on the assumption of a Poisson distribution in this context.

Solution:
(a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore:
P (no arrivals) = e−2.5 = 0.0821
and:
P (≥ 3 arrivals) = 1−pX (0)−pX (1)−pX (2) = 1−e−2.5 (1+2.5+3.125) = 0.4562.

(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.90, so
e−N/2 = 0.90 giving N/2 = − ln(0.90) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early
evenings etc. Likely to be several arrivals in a small period – couples, groups
etc. Quite unlikely the Poisson will provide a good model.

9. The random variable Y , representing the life-span of an electronic component, is


distributed according to a probability density function f (y), where y > 0. The
survivor function, =, is defined as =(y) = P (Y > y) and the age-specific failure
rate, φ(y), is defined as f (y)/=(y). Suppose f (y) = λe−λy , i.e. Y ∼ Exp(λ).
(a) Derive expressions for =(y) and φ(y).
(b) Comment briefly on the implications of the age-specific failure rate you have
derived in the context of the exponentially-distributed component life-spans.

Solution:
(a) The survivor function is:
Z ∞ h i∞
=(y) = P (Y > y) = λe−λx dx = − e−λx = e−λy .
y y

The age-specific failure rate is:


f (y) λe−λy
φ(y) = = −λy = λ.
=(y) e

(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!

340
D.1. Worked examples

10. For the binomial distribution with a probability of success of 0.25 in an individual
trial, calculate the probability that, in 50 trials, there are at least 8 successes:
(a) using the normal approximation without a continuity correction
(b) using the normal approximation with a continuity correction.
Compare these results with the exact probability of 0.9547 and comment.
Solution:
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
(a) So, without a continuity correction:
 
8 − 12.5
P (Y ≥ 8) = P Z ≥ √ = P (Z ≥ −1.47) = 0.9292.
9.375
The required probability could have been expressed as P (X > 7), or indeed
any number in [7, 8), for example:
 
7 − 12.5
P (Y > 7) = P Z ≥ √ = P (Z ≥ −1.80) = 0.9641.
9.375

(b) With a continuity correction:


 
7.5 − 12.5
P (Y > 7.5) = P Z ≥ √ = P (Z ≥ −1.63) = 0.9484.
9.375

Compared to 0.9547, using the continuity correction yields the closer


approximation.

11. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old and
which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is
the distribution of the number of mouldy oranges in your sample?
Solution:
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:

P (‘mouldy’) = P (‘mouldy’ ∩ ‘new’) + P (‘mouldy’ ∩ ‘old’)


= P (‘mouldy’ | ‘new’) P (‘new’) + P (‘mouldy’ | ‘old’) P (‘old’)
= 0.1 × 0.5 + 0.2 × 0.5
= 0.15.

As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability
of ‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be
a binomial distribution with n = 5 and π = 0.15.

341
D. Common distributions of random variables

12. Underground trains on the Northern line have a probability 0.05 of failure between
Golders Green and King’s Cross. Supposing that the failures are all independent,
what is the probability that out of 10 journeys between Golders Green and King’s
Cross more than 8 do not have a breakdown?
Solution:
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
P (X > 8) = p(9) + p(10)
   
10 9 1 10
= × (0.95) × (0.05) + × (0.95)10 × (0.05)0
9 10
= 0.3151 + 0.5987
= 0.9138.

13. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
(a) 10 animals are injected; all 10 remain free from infection
(b) 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
Solution:
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has
no effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
 
10
P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563.
0

342
D.1. Worked examples

(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:

P (X(17) < 2) = P (X(17) = 0) + P (X(17) = 1)


   
17 17 17
= × (0.75) + × (0.25)1 × (0.75)16
0 1
= (0.75)17 + 17 × (0.25)1 × (0.75)16
= 0.0075 + 0.0426
= 0.0501.

(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:

P (X(23) < 3) = P (X(23) = 0) + P (X(23) = 1) + P (X(23) = 2)


   
23 23 23
= × (0.75) + × (0.25)1 × (0.75)22
0 1
 
23
+ × (0.25)2 × (0.75)21
2
23 × 22
= 0.7523 + 23 × 0.25 × (0.75)22 + × (0.25)2 × (0.75)21
2
= 0.0013 + 0.0103 + 0.0376
= 0.0492.

The most surprising-looking event in these three experiments is that of experiment


3, and so we can say that this experiment offered the most support for the use of
the serum.

14. In a large industrial plant there is an accident on average every two days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability
that no accidents occur while he is there?

Solution:
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).

343
D. Common distributions of random variables

(a) The probability of exactly two accidents in a week is:


e−2.5 (2.5)2
p(2) = = 0.2565.
2!
(b) The probability of two or more accidents in a given week is:
P (X ≥ 2) = 1 − p(0) − p(1) = 0.7127.

(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean
10. If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
pY (0) = = 0.0000454.
0!
James is very likely to be there when there is an accident!

15. The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is
the probability there is no winner?
(b) What is the probability that there is exactly 1 winner?
(c) What is the probability that there are exactly 2 winners?

Solution:
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10,000,000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = nπ = 10,000,000 × 0.0000001 = 1
and so we set X ∼ Pois(1). We have:
e−1 10 e−1 11 e−1 12
p(0) = = 0.3679, p(1) = = 0.3679 and p(2) = = 0.1839.
0! 1! 2!
Using the exact binomial distribution of X, the results are:
(10)7
 
7
p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
 
7
p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
(10)7
 
7
p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.

344
D.1. Worked examples

16. Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and
P (X 2 > 0.04).
Solution:
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants
c and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8.
1−0
Also:
P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Finally:
P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.

17. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?
Solution:
The distribution of X is Exp(1/3), so the probability is:
P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636.

18. Suppose that the distribution of men’s heights in London, measured in cm, is
N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
(b) over 190 cm
(c) between 169 cm and 190 cm.

Solution:
The values of interest are 169 and 190. The corresponding z-values are:
169 − 175 190 − 175
z1 = = −1 and z2 = = 2.5.
6 6
Using values from Table 4 of the New Cambridge Statistical Tables, we have:
P (X < 169) = P (Z < −1) = Φ(−1)
= 1 − Φ(1) = 1 − 0.8413 = 0.1587

P (X > 190) = P (Z > 2.5) = 1 − Φ(2.5)


= 1 − 0.9938 = 0.0062
and:
P (169 < X < 190) = P (−1 < Z < 2.5) = Φ(2.5) − Φ(−1)
= 0.9938 − 0.1587 = 0.8351.

345
D. Common distributions of random variables

19. Two statisticians disagree about the distribution of IQ scores for a population
under study. Both agree that the distribution is normal, and that σ = 15, but A
says that 5% of the population have IQ scores greater than 134.6735, whereas B
says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?

Solution:
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:

1.6449 × 15 = 24.6735 and 1.2816 × 15 = 19.224.

Write the means according to A and B as µA and µB , respectively. Therefore:

µA + 24.6735 = 134.6735

so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.

D.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. At one stage in the manufacture of an article a piston of circular cross-section has


to fit into a similarly-shaped cylinder. The distributions of diameters of pistons and
cylinders are known to be normal with parameters as follows.
• Piston diameters: mean 10.42 cm, standard deviation 0.03 cm.
• Cylinder diameters: mean 10.52 cm, standard deviation 0.04 cm.

(a) If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston
diameter exceeds the cylinder diameter)?

(b) Calculate exactly the chance that in 100 pairs, selected at random:
i. every piston will fit
ii. not more than two of the pistons will fail to fit.

(c) Now calculate the same probabilities using a Poisson approximation. Discuss
the appropriateness of using this approximation.

346
D.2. Practice questions

2. If X has the discrete uniform distribution such that P (X = i) = 1/k for


i = 1, 2, . . . , k, show that its moment generating function is:

et (1 − ekt )
MX (t) = .
k(1 − et )

(Do not attempt to find the mean and variance using the mgf.)

3. Let f (z) be defined as:


1
f (z) = e−|z| for all real values of z.
2

(a) Sketch f (z) and explain why it can serve as the pdf for a random variable Z.
(b) Determine the moment generating function of Z.
(c) Use the mgf to find E(Z), Var(Z), E(Z 3 ) and E(Z 4 ).
(You may assume that −1 < t < 1, for the mgf, which will ensure convergence.)

4. Show that for a binomial random variable X ∼ Bin(n, π), then:


n
X (n − 1)!
E(X) = nπ π x−1 (1 − π)n−x .
x=1
(x − 1)! (n − x)!

Hence find E(X) and Var(X). (The wording of the question implies that you use
the result which you have just proved. Other methods of derivation will not be
accepted!)

5. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.

6. James goes fishing every Saturday. The number of fish he catches follows a Poisson
distribution. On a proportion π of the days he goes fishing, he does not catch
anything. He makes it a rule to take home the first, and then every other, fish
which he catches, i.e. the first, third, fifth fish etc.
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.

347
D. Common distributions of random variables

348
Appendix E
Multivariate random variables

E.1 Worked examples


1. X and Y are independent random variables with distributions as follows:
X=x 0 1 2 Y =y 1 2
pX (x) 0.4 0.2 0.4 pY (y) 0.4 0.6
The random variables W and Z are defined by W = 2X and Z = Y − X,
respectively.
(a) Compute the joint distribution of W and Z.
(b) Evaluate P (W = 2 | Z = 1), E(W | Z = 0) and Cov(W, Z).

Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z

hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.

349
E. Multivariate random variables

2. The joint probability distribution of the random variables X and Y is:

X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15

(a) Identify the marginal distributions of X and Y and the conditional


distribution of X given Y = 1.
(b) Evaluate E(X | Y = 1) and the correlation coefficient of X and Y .
(c) Are X and Y independent random variables?

Solution:

(a) The marginal and conditional distributions are, respectively:


X=x −1 0 1 Y =y −1 0 1
pX (x) 0.25 0.25 0.50 pY (y) 0.30 0.40 0.30

X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:

1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:

Var(X) = E(X 2 ) − (E(X))2 = 0.75 − (0.25)2 = 0.6875.

(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y

= (−1)(−1)(0.05) + (1)(−1)(0.1) + (−1)(1)(0.1) + (1)(1)(0.15)


= 0.

So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.

(c) X and Y are not independent random variables since, for example:

P (X = 1, Y = −1) = 0.1 6= P (X = 1) P (Y = −1) = 0.5 × 0.3 = 0.15.

350
E.1. Worked examples

3. X1 , X2 , . . . , Xn are independent Bernoulli random variables. The probability


function of Xi is given by:
(
(1 − πi )1−xi πixi for xi = 0, 1
p(xi ) =
0 otherwise

where:
eiθ
πi =
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (but not identically distributed) random variables,
we have:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1

So, the joint probability function is:


n
P
n  1−xi  xi n  i θ ix

eiθxi

Y 1 e Y e i=1
p(x1 , x2 , . . . , xn ) = = = Q
n .
1 + eiθ 1 + eiθ 1 + eiθ iθ
i=1 i=1 (1 + e )
i=1

4. X1 , X2 , . . . , Xn are independent random variables with the common probability


density function: (
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
Derive the joint probability density function, f (x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have:
n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1

So, the joint probability density function is:


n
n n n
! P
Y Y Y −λ xi
2 −λxi 2n −λx1 −λx2 −···−λxn 2n
f (x1 , x2 , . . . , xn ) = λ xi e =λ xi e =λ xi e i=1 .
i=1 i=1 i=1

5. X1 , X2 , . . . , Xn are independent random variables with the common probability


function:
θx
 
m
p(x) = for x = 0, 1, 2, . . . , m
x (1 + θ)m
and 0 otherwise. Derive the joint probability function, p(x1 , x2 , . . . , xn ).

351
E. Multivariate random variables

Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1

So, the joint probability function is:


n
P
n  n  ! n  ! xi
θ xi x1 x2 xn

Y m Y m θ θ ···θ Y m θi=1
p(x1 , x2 , . . . , xn ) = = = .
i=1
xi (1 + θ)m i=1
xi (1 + θ)nm i=1
xi (1 + θ)nm

6. The random variables X1 and X2 are independent and have the common
distribution given in the table below:

X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1

The random variables W and Y are defined by W = max(X1 , X2 ) and


Y = min(X1 , X2 ).
(a) Calculate the table of probabilities which defines the joint distribution of W
and Y .
(b) Find:
i. the marginal distribution of W
ii. the conditional distribution of Y given W = 2
iii. E(Y | W = 2) and Var(Y | W = 2)
iv. Cov(W, Y ).

Solution:
(a) The joint distribution of W and Y is:
W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19

352
E.1. Worked examples

(b) i. Hence the marginal distribution of W is:


W =w 0 1 2 3
pW (w) 0.04 0.32 0.45 0.19
ii. The conditional distribution of Y | W = 2 is:
Y = y|W = 2 0 1 2 3
pY |W =2 (y | W = 2) 4/15 8/15 2/10 0
= 0.26̇ = 0.53̇ = 0.2 0
iii. We have:
4 8 2
E(Y | W = 2) = 0 × +1× +2× + 3 × 0 = 0.93̇
15 15 10
and:

Var(Y | W = 2) = E(Y 2 | W = 2)−(E(Y | W = 2))2 = 1.3̇−(0.93̇)2 = 0.4622.

iv. E(W Y ) = 1.69, E(W ) = 1.79 and E(Y ) = 0.81, therefore:

Cov(W, Y ) = E(W Y ) − E(W ) E(Y ) = 1.69 − 1.79 × 0.81 = 0.2401.

7. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20

(a) Calculate the marginal distributions and expected values of X and Y .


(b) Calculate the covariance of the random variables U and V , where U = X + Y
and V = X − Y .
(c) Calculate E(V | U = 1).

Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.

353
E. Multivariate random variables

(b) We have:

Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))

E(X 2 ) = ((−1)2 × 0.3) + (02 × 0.3) + (12 × 0.4) = 0.7

E(Y 2 ) = (02 × 0.4) + (12 × 0.25) + (22 × 0.35) = 1.65

hence:

Cov(U, V ) = (0.7 − 1.65) − (0.1 + 0.95)(0.1 − 0.95) = −0.0575.

(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:

P (U = 1) = 0.1 + 0.05 + 0.1 = 0.25

0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5

hence:
     
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5

8. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.

(a) Show that P (X = 1, Y = 1) = 3/14.


(b) Form the table showing the joint probability distribution of X and Y .
(c) Calculate E(X), E(Y ) and E(X | Y = 1).
(d) Find the covariance between X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.

354
E.1. Worked examples

Solution:

(a) With the obvious notation B = blue and R = red:

3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14

(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 × +1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
E(Y ) = 0 × +1× +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2

(d) The distribution of XY is:


XY = xy 0 1
pXY (xy) 22/28 6/28
Hence:
22 6 3
E(XY ) = 0 × +1× =
28 28 14
and:
3 3 1 9
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = − × =− .
14 4 2 56

(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.

355
E. Multivariate random variables

9. Show that the marginal distributions of a bivariate distribution are not enough to
define the bivariate distribution itself.
Solution:
Here we must show that there are two distinct bivariate distributions with the
same marginal distributions. It is easiest to think of the simplest case where X and
Y each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:

pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.

The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.

X/Y 0 1 X/Y 0 1
0 0.25 0.25 0 0.2 0.3
1 0.25 0.25 1 0.3 0.2

The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.

10. Show that if:


P (X ≤ x ∩ Y ≤ y) = (1 − e−x )(1 − e−2y )
for all x, y ≥ 0, then X and Y are independent random variables, each with an
exponential distribution.
Solution:
The right-hand side of the result given is the product of the cdf of an exponential
random variable X with mean 1 and the cdf of an exponential random variable Y
with mean 2. So the result follows from the definition of independent random
variables.

11. There are different ways to write the covariance. Show that:

Cov(X, Y ) = E(XY ) − E(X) E(Y )

and:
Cov(X, Y ) = E((X − E(X))Y ) = E(X(Y − E(Y ))).

356
E.1. Worked examples

Solution:
Working directly from the definition:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E(XY − X E(Y ) − E(X)Y + E(X) E(Y ))
= E(XY ) + E(−X E(Y )) + E(−E(X)Y ) + E(E(X) E(Y ))
= E(XY ) − E(X) E(Y ) − E(X) E(Y ) + E(X) E(Y )
= E(XY ) − E(X) E(Y ).

For the second part, we begin with the right-hand side:

E((X − E(X))Y ) = E(XY − E(X)Y )


= E(XY ) + E(−E(X)Y )
= E(XY ) − E(X) E(Y )
= Cov(X, Y ).

The remaining result follows by an argument symmetric with the last one.

12. Suppose that Var(X) = Var(Y ) = 1, and that X and Y have correlation coefficient
ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution:
We have:

0 ≤ Var(X − ρY ) = Var(X) − 2ρ Cov(X, Y ) + ρ2 Var(Y ) = 1 − 2ρ2 + ρ2 = (1 − ρ2 ).

Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.

13. The distribution of a random variable X is:

X=x −1 0 1
P (X = x) a b a

Show that X and X 2 are uncorrelated.


Solution:
This is an example of two random variables X and Y = X 2 which are uncorrelated,
but obviously dependent. The bivariate distribution of (X, Y ) in this case is
singular because of the complete functional dependence between them.
We have:

E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0

357
E. Multivariate random variables

and we must show that the covariance is zero:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X 3 ) − E(X) E(X 2 ) = 0 − 0 × 2a = 0.

There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.

14. A fair coin is thrown n times, each throw being independent of the ones before. Let
R = ‘the number of heads’, and S = ‘the number of tails’. Find the covariance of R
and S. What is the correlation of R and S?
Solution:
One can go about this in a straightforward way. If Xi is the number of heads and
Yi is the number of tails on the ith throw, then the distribution of Xi and Yi is
given by:

X/Y 0 1
0 0 0.5
1 0.5 0

From this table, we compute the following:

E(Xi ) = E(Yi ) = 0 × 0.5 + 1 × 0.5 = 0.5

E(Xi2 ) = E(Yi2 ) = 0 × 0.5 + 1 × 0.5 = 0.5

Var(Xi ) = Var(Yi ) = 0.5 − (0.5)2 = 0.25

E(Xi Yi ) = 0 × 0.5 + 0 × 0.5 = 0

Cov(Xi , Yi ) = E(Xi Yi ) − E(Xi ) E(Yi ) = 0 − 0.25 = −0.25.


P P
Now, since R = i Xi and S = i Yi , we can add covariances of independent Xi s
and Yi s, just like means and variances, then:

Cov(R, S) = −0.25n.

Since R + S = n is a fixed quantity, there is a complete linear dependence between


R and S. We have R = n − S, so the correlation between R and S should be −1.
This can be checked directly since:

Var(R) = Var(S) = 0.25n

(add the variances of the Xi s or Yi s). The correlation between R and S works out
as −0.25n/0.25n = −1.

15. Suppose that X and Y have a bivariate distribution. Find the covariance of the
new random variables W = aX + bY and V = cX + dY where a, b, c and d are
constants.

358
E.2. Practice questions

Solution:
The covariance of W and V is:

E(W V ) − E(W ) E(V ) = E(acX 2 + bdY 2 + (ad + bc)XY )


− (ac E(X)2 + bd E(Y )2 + (ad + bc) E(X) E(Y ))
= ac(E(X 2 ) − E(X)2 ) + bd(E(Y 2 ) − E(Y )2 )
+ (ad + bc)(E(XY ) − E(X) E(Y ))
2
= acσX + bdσY2 + (ad + bc)σXY .

16. Following on from Question 15, show that, if the variances of X and Y are the
same, then W = X + Y and V = X − Y are uncorrelated.
Solution:
Here we have a = b = c = 1 and d = −1. Substituting into the formula found above:
2
σW V = σX − σY2 = 0.

There is no assumption here that X and Y are independent. It is not true that W
and V are independent without further restrictions on X and Y .

E.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. (a) For random variables X and Y , show that:

Cov(X + Y, X − Y ) = Var(X) − Var(Y ).

(b) For random variables X and Y , and constants a, b, c and d, show that:

Cov(a + bX, c + dY ) = bd Cov(X, Y ).

2. Let X1 , X2 , . . . , Xk be independent random variables, and a1 , a2 , . . . , ak be


constants. Show that:
k  k
P P
(a) E ai X i = ai E(Xi )
i=1 i=1
 k
 k
a2i Var(Xi ).
P P
(b) Var ai X i =
i=1 i=1

3. X and Y are discrete random variables which can assume values 0, 1 and 2 only.

P (X = x, Y = y) = A(x + y) for some constant A and x, y ∈ {0, 1, 2}.

359
E. Multivariate random variables

(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.

360
Appendix F
Sampling distributions of statistics

F.1 Worked examples


1. Suppose A, B and C are independent chi-squared random variables with 5, 7 and
10 degrees of freedom, respectively. Calculate:
(a) P (B < 12)
(b) P (A + B + C < 14)
(c) P (A − B − C < 0)
(d) P (A3 + B 3 + C 3 < 0).
In this question, you should use the closest value given in the New Cambridge
Statistical Tables or the Dougherty Statistical Tables. Further approximation is not
required.
Solution:

(a) P (B < 12) ≈ 0.90, directly from Table 8, where B ∼ χ27 .


(b) A + B + C ∼ χ25+7+10 = χ222 , so P (A + B + C < 14) is the probability that such
a random variable is less than 14, which is approximately 0.10 from Table 8.
(c) Transforming and rearranging the probability, we need:
 
A B + C 17
P (A < B + C) = P < ×
5 17 5
 
A/5
=P < 3.4 = P (F < 3.4) ≈ 0.975
(B + C)/17

where F ∼ F5, 17 , using Table A.3 of the Dougherty Statistical Tables (practice
of which will be covered later in the course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:

P (A3 + B 3 + C 3 < 0) = 0.

2. Suppose {Zi }, for i = 1, 2, . . . , k, are independent and identically distributed


standard normal random variables, i.e. Zi ∼ N (0, 1), for i = 1, 2, . . . , k.
1
Although we have yet to ‘formally’ introduce Table A.3 of the Dougherty Statistical Tables, you
should be able to see how this works.

361
F. Sampling distributions of statistics

State the distribution of:


(a) Z12
(b) Z12 /Z22
p
(c) Z1 / Z22
k
P
(d) Zi /k
i=1
k
Zi2
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ).

Solution:

(a) Z12 ∼ χ21


(b) Z12 /Z22 ∼ F1, 1
p
(c) Z1 / Z22 ∼ t1
k
P
(d) Zi /k ∼ N (0, 1/k)
i=1
k
Zi2 ∼ χ2k
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ) ∼ F2, 3 .

3. X1 , X2 , X3 and X4 are independent normally distributed random variables each


with a mean of 0 and a standard deviation of 3. Find:
(a) P (X1 + 2X2 > 9)
(b) P (X12 + X22 > 54)
(c) P ((X12 + X22 ) > 99(X32 + X42 )).

Solution:

(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
 
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45

(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:

P (X12 + X22 > 54) = P (Y > 6) = 0.05

where Y ∼ χ22 .

362
F.1. Worked examples

(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:

X12 + X22 (X12 + X22 )/18


= ∼ F2, 2 .
X32 + X42 (X32 + X42 )/18

Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .

4. The independent random variables X1 , X2 and X3 are each normally distributed


with a mean of 0 and a variance of 4. Find:
(a) P (X1 > X2 + X3 )
(b) P (X12 > 9.25(X22 + X32 ))
(c) P (X1 > 5(X22 + X32 )1/2 ).

Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:

X1 − X2 − X3 ∼ N (0, 12).

So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.

(b) We have Xi /2 ∼ N (0, 1), so Xi2 /4 ∼ χ21 for i = 1, 2, 3. Hence:

2X12 (X12 /4)/1


= ∼ F1, 2 .
X22 + X32 ((X22 + X32 )/4)/2

So:
2X12
 
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32

where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32

X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 !  !
√ X22 X32 √

X1
=P >5 2 + 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.

363
F. Sampling distributions of statistics

5. The independent random variables X1 , X2 , X3 and X4 are each normally


distributed with a mean of 0 and a variance of 4. Using the New Cambridge
Statistical Tables or the Dougherty Statistical Tables, derive values for k in each of
the following cases:
(a) P (3X1 + 4X2 > 5) = k
p
(b) P (X1 > k X32 + X42 ) = 0.025
(c) P (X12 + X22 + X32 < k) = 0.9
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k.

Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085.

(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:

 q 
2 2
P X1 > k X3 + X4 = 0.025 = P (T > k 2)

where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
 
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4
where X ∼ χ23 . Therefore, k/4 = 6.251. Hence k = 25.004.
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k simplifies to:

P (X22 + X42 > 19(X12 + X32 )) = k

and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table A.3 of the Dougherty Statistical Tables, k = 0.05.

6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.

364
F.1. Worked examples

Solution:

(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:

X̄ ∼ N (68.5, (0.54)2 ).

We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
 
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.

As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 4 of the
New Cambridge Statistical Tables. Since there are 200 independent random
samples drawn, we can now think of each as a single trial. The recorded mean
lies between 67.9 and 69.2 with probability 0.8026 at each trial. We are dealing
with a binomial distribution with n = 200 trials and probability of success
π = 0.8026. The expected number of successes is:

nπ = 200 × 0.8026 = 160.52.

(c) The probability that the recorded mean is < 67.0 inches is:
 
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54

so the expected number of recorded means below 67.0 out of a sample of 200 is:

200 × 0.00205 = 0.41.

7. If Z is a random variable with a standard normal distribution, what is


P (Z 2 < 3.841)?

365
F. Sampling distributions of statistics

Solution:
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
 √ √ 
2
P (Z < 3.841) = P − 3.841 < Z < 3.841

= P (−1.96 < Z < 1.96)


= Φ(1.96) − Φ(−1.96)
= 0.9750 − (1 − 0.9750) = 0.95.

Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.

8. Suppose that X1 and X2 are independent N (0, 4) random variables. Compute


P (X12 < 36.84 − X22 ).
Solution:
Rearrange the inequality to obtain:

P (X12 < 36.84 − X22 ) = P (X12 + X22 < 36.84)


 2
X1 + X22

36.84
=P <
4 4
 2  2 !
X1 X2
=P + < 9.21 .
2 2

Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.

9. Suppose that X1 , X2 and X3 are independent N (0, 1) random variables, while Y


(independently) follows a χ25 distribution. Compute P (X12 + X22 < 7.236Y − X32 ).
Solution:
Rearranging the inequality gives:

P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
 2
X1 + X22 + X32

=P < 7.236
Y
 2
(X1 + X22 + X32 )/3

5
=P < × 7.236
Y /5 3
 2
(X1 + X22 + X32 )/3

=P < 12.060 .
Y /5

366
F.1. Worked examples

Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table A.3 of the Dougherty Statistical Tables, we see that 12.06
is the 1% upper-tail value for this distribution, so the probability we want is equal
to 0.99.

10. Compare the normal distribution approximation to the exact values for the
upper-tail probabilities for the binomial distribution with 100 trials and probability
of success 0.1.

Solution:
Let R ∼ Bin(100, 0.1) denote the exact number of successes. It has mean and
variance:
E(R) = nπ = 100 × 0.1 = 10
and:
Var(R) = nπ(1 − π) = 100 × 0.1 × 0.9 = 9
so we use the approximation R ∼
˙ N (10, 9) or, equivalently:
R − 10 R − 10
√ = ∼˙ N (0, 1).
9 3
Applying a continuity correction of 0.5 (for example, 7.8 successes are rounded up
to 8) gives:  
r − 0.5 − 10
P (R ≥ r) ≈ P Z > .
3
The results are summarised in the following table. The first column is the number
of successes; the second gives the exact binomial probabilities; the third column
lists the corresponding z-values (with the continuity correction); and the fourth
gives the probabilities for the normal approximation.
Although the agreement between columns two and four is not too bad, you may
think it is not as close as you would like for some applications.
r P (R ≥ r) z = (r − 0.5 − 10)/3 P (Z > z)
1 0.999973 −3.1667 0.999229
2 0.999678 −2.8333 0.997697
3 0.998055 −2.5000 0.993790
4 0.992164 −2.1667 0.984870
5 0.976289 −1.8333 0.966624
6 0.942423 −1.5000 0.933193
7 0.882844 −1.1667 0.878327
8 0.793949 −0.8333 0.797672
9 0.679126 −0.5000 0.691462
10 0.548710 −0.1667 0.566184
11 0.416844 0.1667 0.433816
12 0.296967 0.5000 0.308538
13 0.198179 0.8333 0.202328
14 0.123877 1.1667 0.121673
15 0.072573 1.5000 0.066807
16 0.039891 1.8333 0.033376

367
F. Sampling distributions of statistics

17 0.020599 2.1667 0.015130


18 0.010007 2.5000 0.006210
19 0.004581 2.8333 0.002303
20 0.001979 3.1667 0.000771
21 0.000808 3.5000 0.000233
22 0.000312 3.8333 0.000063
23 0.000114 4.1667 0.000015
24 0.000040 4.5000 0.000003
25 0.000013 4.8333 0.000001
26 0.000004 5.1667 0.000000

F.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. (a) Suppose {X1 , X2 , X3 , X4 } is a random sample of size n = 4 from the


n
P
Bernoulli(0.2) distribution. What is the distribution of Xi in this case?
i=1
n
P
(b) Write down the sampling distribution of X̄ = Xi /n for the sample
i=1
considered in (a). In other words, write down the possible values of X̄ and
their probabilities.
P
Hint: what are the possible values of i Xi , and their probabilities?
(c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2)
distribution. What is the approximate sampling distribution of X̄ suggested by
the central limit theorem in this case? Use this distribution to calculate an
approximate value for the probability that X̄ > 0.3. (The true value of this
probability is 0.0061.)

2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?

368
F.2. Practice questions

3. A random sample of 25 audits is to be taken from a company’s total audits, and


the average value of these audits is to be calculated.
(a) Explain what you understand by the sampling distribution of this average and
discuss its relationship to the population mean.
(b) Is it reasonable to assume that this sampling distribution is normal?
(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
i. the sample mean will be greater than £60
ii. the sample mean will be within 5% of the population mean.

369
F. Sampling distributions of statistics

370
Appendix G
Point estimation

G.1 Worked examples


1. Let X1 and X2 be two independent random variables with the same mean, µ, and
the same variance, σ 2 < ∞. Let µ
b = aX1 + bX2 be an estimator of µ, where a and b
are two non-zero constants.
(a) Identify the condition on a and b to ensure that µ
b is an unbiased estimator of
µ.
(b) Find the minimum mean squared error (MSE) among all unbiased estimators
of µ.
Solution:
(a) Let E(b
µ) = E(aX1 + bX2 ) = a E(X1 ) + b E(X2 ) = (a + b)µ. Hence a + b = 1 is
the condition for µ
b to be an unbiased estimator of µ.
(b) Under this condition, noting that b = 1 − a, we have:

MSE(b µ) = a2 Var(X1 ) + b2 Var(X2 ) = (a2 + b2 )σ 2 = (2a2 − 2a + 1)σ 2 .


µ) = Var(b

µ)/da = (4a − 2)σ 2 = 0, we have a = 0.5, and hence b = 0.5.


Setting d MSE(b
Therefore, among all unbiased linear estimators, the sample mean (X1 + X2 )/2
has the minimum variance.

Remark: Let {X1 , X2 , . . . , Xn } be a random sample from a population with finite


variance. The sample mean X̄ has the minimum variance among all unbiased linear
Pn
estimators of the form ai Xi , hence it is the best linear unbiased estimator
i=1
(known by the acronym BLUE(!)).

2. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform


distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of
moments estimator (MME) of θ.
Solution:
The pdf of Xi is: (
θ−1 for 0 ≤ xi ≤ θ
f (xi ; θ) =
0 otherwise.
Therefore:  θ
θ
1 x2i
Z
1 θ
E(Xi ) = xi dxi = = .
θ 0 θ 2 0 2

371
G. Point estimation

Therefore, setting µ
b1 = M1 , we have:
n
θb X Xi
= X̄ ⇒ θb = 2X̄ = 2 .
2 i=1
n

3. Let X ∼ Bin(n, π), where n is known. Find the method of moments estimator
(MME) of π.
Solution:
The pf of the binomial distribution is:

n!
P (X = x) = π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x! (n − x)!

and 0 otherwise. Therefore:


n n n
X X n! X n!
E(X) = x P (X = x) = x π x (1−π)n−x = π x (1−π)n−x .
x=0 x=1
x! (n − x)! x=1
(x − 1)! (n − x)!

Let m = n − 1 and write j = x − 1, then (n − x) = (m − j), and:


m m
X nm! X m!
E(X) = ππ j (1 − π)m−j = nπ π j (1 − π)m−j .
j=0
j! (m − j)! j=0
j! (m − j)!

Therefore, E(X) = nπ, and hence π


b = X/n.

4. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution with pdf:


(
λ exp(−λ(x − a)) for x ≥ a
f (x) =
0 otherwise

where λ > 0. Find the method of moments estimators (MMEs) of λ and a.


Solution:
We have:
Z ∞ Z ∞
1 1
E(X) = x λ exp(−λ(x − a)) dx = (y + λa)e−y dy = +a
a λ 0 λ

and:
Z ∞ Z ∞  2
y 2 2a
2
E(X ) = 2
x λ exp(−λ(x − a)) dx = e−y dy = 2
+ + a2 .
a 0 λ+a λ λ

Therefore, the MMEs are the solutions to the equations:


n
1 1X 2 2 2b
a
X̄ = + b
a and Xi = + a2 .
+b
λ
b n i=1 λ
b 2 λ
b

372
G.1. Worked examples

Actually, the explicit solutions may be obtained as follows:


n  2
1X 2 2 2 2b
a 2 1 1
Xi − X̄ = + +ba − +ba = .
n i=1 b2
λ λ
b λ
b b2
λ

Hence: !−1/2 !−1/2


n n
1X 2 1X
λ
b= Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1
Consequently:
1
a = X̄ − .
b
λ
b

5. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n   n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1

We write the above as a function of µ only:


n
!
1X
L(µ) = C exp − (Xi − µ)2
2 i=1

where C > 0 is a constant. The MLE µ


b maximises this function, and also
maximises the function:
n
1X
l(µ) = ln L(µ) = − (Xi − µ)2 + log(C).
2 i=1
n
(Xi − µ)2 , i.e. the MLE is also the
P
Therefore, the MLE effectively minimises
i=1
least squares estimator (LSE), i.e. µ
b = X̄.

6. Let {X1 , X2 , . . . , Xn } be a random sample from a Poisson distribution with mean


λ > 0. Find the maximum likelihood estimator (MLE) of λ.
Solution:
The probability function is:
e−λ λx
P (X = x) = .
x!
The likelihood and log-likelihood functions are, respectively:
n  −λ Xi 
Y e λ e−nλ λnX̄
L(λ) = = Q n
Xi !
i=1 Xi !
i=1

373
G. Point estimation

and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:  
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ b = X̄, which is also the MME.

7. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform


distribution Uniform[0, θ], where θ > 0 is unknown.
(a) Find the maximum likelihood estimator (MLE) of θ.
(b) If n = 3, x1 = 0.2, x2 = 3.6 and x3 = 1.1, what is the maximum likelihood
estimate of θ?

Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.
The joint pdf is:
(
θ−n for 0 ≤ x1 , x2 , . . . , xn ≤ θ
f (x1 , x2 , . . . , xn ; θ) =
0 otherwise.
In fact f (x1 , x2 , . . . , xn ; θ), as a function of θ, is the likelihood function, L(θ).
The maximum likelihood estimator of θ is the value at which the likelihood
function L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0 otherwise
where:
X(n) = max Xi .
i

Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:

374
G.1. Worked examples

(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.

8. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.

Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.

With all xi ≥ 0, L(λ) = λn e−λnX̄ , hence the log-likelihood function is:

l(λ) = ln L(λ) = n ln(λ) − λnX̄.

Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.

9. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.

Number of occupants 1 2 3 4 5 ≥6 Total


Frequency 678 227 56 28 8 14 1,011

Find the maximum likelihood estimate of π.


Solution:
The sample size is n = 1,011. If we know all the 1,011 observations, the joint
probability function for x1 , x2 , . . . , x1,011 is:
1,011
Y
L(π) = p(xi ; π).
i=1

However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.

375
G. Point estimation

Note that:

X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6

1
= π(1 − π)5 ×
π

= (1 − π)5 .

Hence we may only use:

L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525

hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.

10. Let {X1 , X2 , . . . , Xn }, where n > 2, be a random sample from an unknown


population with mean θ and variance σ 2 . We want to choose between two
estimators of θ, θb1 = X̄ and θb2 = (X1 + X2 )/2. Which is the better estimator of θ?
Solution:
Let us consider the bias first. The estimator θb1 is just the sample mean, so we know
that it is unbiased. The estimator θb2 has expectation:
 
X 1 + X 2 E(X1 ) + E(X2 ) θ+θ
E(θb2 ) = E = = =θ
2 2 2
so it is also an unbiased estimator of θ.
Next, we consider the variances of the two estimators. We have:
σ2
Var(θ1 ) = Var(X̄) =
b
n
and:
σ2 + σ2 σ2
 
X1 + X2 Var(X1 ) + Var(X2 )
Var(θb2 ) = Var = = = .
2 4 4 2

Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the
whole sample, rather than just the first two values.

376
G.1. Worked examples

11. Show that the MSE of an estimator θb can be written as:


 2
MSE(θ)b = Var(θ)b + Bias(θ) b .

Solution:
We need to introduce the term E(θ)
b inside the expectation, so we add and subtract
it to obtain:

b = E((θb − θ)2 )
MSE(θ)
 
= E ((θb − E(θ)) b 2
b − (θ − E(θ)))
 
2 2
= E (θ − E(θ)) − 2(θ − E(θ))(θ − E(θ)) + (θ − E(θ))
b b b b b b

b 2 ) − 2E((θb − E(θ))(θ
= E((θb − E(θ)) b − E(θ))) b 2 ).
b + E((θ − E(θ))

The first term in this expression is, by definition, the variance of θ.


b The final term
is:
b 2 ) = (θ − E(θ))
E((θ − E(θ)) b 2 = (E(θ)b − θ)2 = (Bias(θ))
b 2

because θ and E(θ)


b are both constants, and are not affected by the expectation
operator. It remains to be shown that the middle term is equal to zero. We have:
 
E (θb − E(θ))(θ
b − E(θ))
b = (θ − E(θ))
b E(θb − E(θ))
b = (θ − E(θ))(E(
b b − E(θ))
θ) b =0

which concludes our proof.

12. Find the MSEs of the estimators in Question 10.


Solution:
The MSEs are:
σ2 σ2
MSE(θb1 ) = Var(θb1 ) + (Bias(θb1 ))2 = +0=
n n
and:
σ2 σ2
MSE(θb2 ) = Var(θb2 ) + (Bias(θb2 ))2 =
+0= .
2 2
Note that the MSE of an unbiased estimator is equal to its variance.

13. Are the estimators in Question 10 (mean-square) consistent?


Solution:
The estimator θb1 has MSE equal to σ 2 /n, which converges to 0 as n → ∞. The
estimator θb2 has MSE equal to σ 2 /2, which stays constant as n → ∞. Therefore, θb1
is a (mean-square) consistent estimator of θ, whereas θb2 is not.

377
G. Point estimation

14. Suppose that we have a random sample {X1 , X2 , . . . , Xn } from a Uniform[−θ, θ]


distribution. Find the method of moments estimator of θ.
Solution:
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:
(θ − (−θ))2 θ2
E(X 2 ) = Var(X) + E(X)2 = + 02 = .
12 3
We set this equal to the second sample moment to obtain:
n
1 X 2 θb2
X = .
n i=1 i 3

Therefore, the method of moments estimator of θ is:


v
u n
u3 X
θbM M = t X 2.
n i=1 i

15. Let {X1 , X2 , . . . , Xn } be a random sample from a Bin(m, π) distribution, with both
m and π unknown. Find the method of moments estimators of m, the number of
trials, and π, the probability of success.
Solution:
There are two unknown parameters, so we need two equations. The expectation
and variance of a Bin(m, π) distribution are mπ and mπ(1 − π), respectively, so we
have:
µ1 = E(X) = mπ
and:
µ2 = Var(X) + E(X)2 = mπ(1 − π) + (mπ)2 .
Setting the first two sample and population moments equal gives:
n n
1X 1X 2
Xi = mb
b π and b π (1 − π
X = mb b π )2 .
b) + (mb
n i=1 n i=1 i

The two equations need to be solved simultaneously. Solving the first equation for
π
b gives:
Pn
Xi /n
i=1 X̄
π
b= = .
m
b mb
Now we can substitute π
b into the second moment equation to obtain:
n    2
1X 2 X̄ X̄ X̄
X =m 1− + m
n i=1 i
b b
mb m
b m
b

378
G.1. Worked examples

which we now solve for m


b to find the method of moments estimator:
X̄ 2
m
b MM =  n
.
2
P
X̄ 2 − Xi /n − X̄
i=1

16. Consider again the Uniform[−θ, θ] distribution from Question 14. Suppose that we
observe the following data:

1.8, 0.7, −0.2, −1.8, 2.8, 0.6, −1.3 and − 0.1.

Estimate θ using the method of moments.


Solution:
The point estimate is: v
u 8
u3 X
θbM M =t x2 ≈ 2.518
8 i=1 i

which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.

17. Let {X1 , X2 , . . . , Xn } be a random sample from an Exp(λ) distribution. Find the
MLE of λ.
Solution:
The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λe−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1

so the log-likelihood function is:

l(λ) = ln(λn e−λnX̄ ) = n ln(λ) − λnX̄.

Differentiating and setting equal to zero gives:


d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
The second derivative of the log-likelihood function is:

d2 n
2
l(λ) = − 2
dλ λ

which is always negative, hence the MLE λ


b = 1/X̄ is indeed a maximum. This
happens to be the same as the method of moments estimator of λ.

379
G. Point estimation

18. Let {X1 , X2 , . . . , Xn } be a random sample from a N (µ, σ 2 ) distribution. Find the
MLE of σ 2 if:
(a) µ is known
(b) µ is unknown.
In each case, work out if the MLE is an unbiased estimator of σ 2 .
Solution:
The likelihood function is:
n n
(Xi − µ)2
 
2
Y
2
Y 1
L(µ, σ ) = f (xi ; µ, σ ) = √ exp −
i=1 i=1 2πσ 2 2σ 2
n
!
1 X
= (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1

so the log-likelihood function is:


n
2 n n 2 1 X
l(µ, σ ) = − ln(2π) − ln(σ ) − 2 (Xi − µ)2 .
2 2 2σ i=1

Differentiating with respect to σ 2 and setting the derivative equal to zero gives:
n
d 2 n 1 1 X
l(µ, σ ) = − + 4 (Xi − µ)2 = 0.
dσ 2 b2 2b
2 σ σ i=1

b2 :
If µ is known, we can solve this equation for σ
n n n
n 1 1 X n 2 1X 1X
= 4 (Xi − µ)2 ⇒ σ
b = (Xi − µ)2 ⇒ σ2
b = (Xi − µ)2 .
b2
2 σ 2b
σ i=1 2 2 i=1 n i=1

The second derivative is always negative, so we conclude that the MLE:


n
2 1X
σ
b = (Xi − µ)2
n i=1

is indeed a maximum. We can work out the bias of this estimator directly:
n
! n
!
2 1X 2 2 1 X (Xi − µ)2
E(bσ )=E (Xi − µ) = σ E
n i=1 n i=1 σ2
n 2
σ2 X

Xi − µ
= E
n i=1 σ
n
σ2 X
= E(Zi2 )
n i=1

σ2
= n = σ2
n

380
G.1. Worked examples

where Zi = (Xi − µ)/σ, for i = 1, 2, . . . , n. Therefore, the MLE of σ 2 is an unbiased


estimator in this case.
If µ is unknown, we also need to maximise the likelihood function with respect to
µ. Here, we consider an alternative method. The likelihood function is:
n
!
1 X
L(µ, σ 2 ) = (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1

n
so, whatever the value of σ 2 , we need to ensure that (Xi − µ)2 is minimised.
P
i=1
However, we have:
n
X n
X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .
i=1 i=1

Only the second term on the right-hand side depends on µ and, because of the
square, its minimum value is zero. It is minimised when µ is equal to the sample
mean, so this is the MLE of µ:
µ
b = X̄.
The resulting MLE of σ 2 is:
n
2 1X
σ
b = (Xi − X̄)2 .
n i=1

This is not the same as the sample variance S 2 , where we divide by n − 1 instead of
n. The expectation of the MLE of σ 2 is:
n
! n
!
1 X 1 1 X
σ2) = E
E(b (Xi − X̄)2 = E (n − 1) (Xi − X̄)2
n i=1 n n − 1 i=1
1
= E((n − 1)S 2 )
n
σ2 (n − 1)S 2
 
= E .
n σ2

The term inside the expectation, (n − 1)S 2 /σ 2 , follows a χ2n−1 distribution, and so:

σ2
σ2) =
E(b (n − 1).
n
This is not equal to σ 2 , so the MLE of σ 2 is a biased estimator in this case. (Note
b2 = S 2 is an unbiased estimator of σ 2 .) The bias of the MLE is:
that the estimator σ

σ2 σ2
σ 2 ) = E(b
Bias(b σ2) − σ2 = (n − 1) − σ 2 = −
n n
which tends to zero as n → ∞. In such cases, we say that the estimator is
asymptotically unbiased.

381
G. Point estimation

G.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. Based on a random sample of two independent observations from a population with


mean µ and standard deviation σ, consider two estimators of µ, X and Y , defined
as:
X1 X2 X1 2X2
X= + and Y = + .
2 2 3 3
Are X and Y unbiased estimators of µ?

2. Prove that, for normally distributed data, S 2 is an unbiased estimator of σ 2 , but


that S is a biased estimator of σ.
Hint: if X̄ is the sample mean for a random sample of size n, the fact that the
observations {X1 , X2 , . . . , Xn } are independent can be used to prove that (in the
standard notation):
σ2
E(X̄ 2 ) = µ2 + .
n

3. A random sample of n independent Bernoulli trials with success probability π


results in R successes. Derive an unbiased estimator of π(1 − π).

4. Given a random sample of n values from a normal distribution with unknown mean
2
and variance, consider the
P following2 two estimators of σ (the unknown population
variance), where Sxx = (Xi − X̄) :

Sxx Sxx
T1 = and T2 = .
n−1 n
For each of these determine its bias, its variance and its mean squared error. Which
has the smaller mean squared error?
Hint: use the fact that Var(S 2 ) = 2σ 4 /(n − 1) for a random sample of size n, or
some equivalent formula.

5. Suppose that you are given observations y1 , y2 , y3 and y4 such that:

y 1 = α + β + ε1

y2 = −α + β + ε2

y 3 = α − β + ε3

y4 = −α − β + ε4 .

The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed


with mean 0 and variance σ 2 .

382
G.2. Practice questions

(a) Find the least squares estimators of the parameters α and β.


(b) Verify that the least squares estimators in (a) are unbiased estimators of their
respective parameters.
(c) Find the variance of the least squares estimator of α.

383
G. Point estimation

384
Appendix H
Interval estimation

H.1 Worked examples


1. (a) Find the length of a 95% confidence interval for the mean of a normal
distribution with known variance σ 2 .
(b) Find the minimum sample size such that the width of a 95% confidence
interval is not wider than d, where d > 0 is a prescribed constant.

Solution:
(a) With an available random sample {X1 , X2 , . . . , Xn } from the normal
distribution N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the
form:  
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √ .
n n
Hence the width of the confidence interval is:
   
σ σ σ σ
X̄ + 1.96 × √ − X̄ − 1.96 × √ = 2 × 1.96 × √ = 3.92 × √ .
n n n n

(b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample
size: 2
15.37 × σ 2

3.92 × σ
n≥ = .
d d2
Therefore, in order to achieve the required accuracy, the sample size n should
be at least as large as 15.37 × σ 2 /d2 .
Note that as the variance σ 2 %, the confidence interval width d %, and as the
sample size n %, the confidence interval width d &. Also, note that when σ 2 is
unknown, the width of a confidence interval for µ depends on S. Therefore, the
width is a random variable.

2. The data below are from a random sample of size n = 9 taken from the distribution
N (µ, σ 2 ):
3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34 and 14.31.

(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare
the result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .

385
H. Interval estimation

Solution:
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
 
σ 4 4
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × = (4.13, 9.35).
n 3 3
In general, a 100(1 − α)% confidence interval for µ is:
 
σ σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n n
where zα denotes the top 100αth percentile of the standard normal
distribution, i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ 15.68
2 × 1.96 × √ = √ ≤ 2.5.
n n
Hence:  2
15.68
n≥ = 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
(b) When σ 2 is unknown, a 95% confidence interval for µ is:
 
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
n n
n
where S 2 = (Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile
P
i=1
of the Student’s tk distribution, i.e. such that:
P (T > tα, k ) = α
for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence
a 95% confidence interval for µ is:
4
6.74 ± 2.306 × ⇒ (3.67, 9.81).
3
This confidence interval is much wider than the one obtained in (a). Since we
do not know σ 2 , we have less information available for our estimation. It is
only natural that our estimation becomes less accurate.
Note that although the sample size is n, the Student’s t distribution used has
only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample
variance is due to not knowing µ. Hence we estimate µ using the data, for
which we effectively pay a ‘price’ of one degree of freedom.

386
H.1. Worked examples

(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of the New Cambridge
Statistical Tables, for X ∼ χ28 , we find that:
P (X < 2.180) = P (X > 17.53) = 0.025.
Hence:
8 × S2
 
P 2.180 < < 17.53 = 0.95.
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.53, and the upper bound is
8 × s2 /2.180. Therefore, a 95% confidence interval for σ 2 , noting s2 = 16, is:
(7.302, 58.716).
Note that the estimation in this example is rather inaccurate. This is due to
two reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.

3. Assume that the random variable X is normally distributed and that σ 2 is known.
What confidence level would be associated with each of the following intervals?
√ √
(a) (x̄ − 1.645 × σ/ n, x̄ + 2.326 × σ/ n).

(b) (−∞, x̄ + 2.576 × σ/ n).

(c) (x̄ − 1.645 × σ/ n, x̄).

Solution:
√ √
We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1).
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
(c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level.

4. Five independent samples, each of size n, are to be drawn from a normal


distribution where σ 2 is known. For each sample, the interval:
 
σ σ
x̄ − 0.96 × √ , x̄ + 1.06 × √
n n
will be constructed. What is the probability that at least four of the intervals will
contain the unknown µ?
Solution:
The probability that the given interval will contain µ is:
P (−0.96 < Z < 1.06) = 0.6869.
The probability of four or five such intervals is binomial with n = 5 and
π = 0.6869, so let the number of such intervals be Y ∼ Bin(5, 0.6869). The required
probability is:
   
5 4 5
P (Y ≥ 4) = (0.6869) (0.3131) + (0.6869)5 = 0.5014.
4 5

387
H. Interval estimation

5. A personnel manager has found that historically the scores on aptitude tests given
to applicants for entry-level positions are normally distributed with σ = 32.4
points. A random sample of nine test scores from the current group of applicants
had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.

Solution:
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.80, hence α/2 = 0.10 and,
from Table 5 of the New Cambridge Statistical Tables, P (Z > 1.2816) =
1 − Φ(1.2816) = 0.10. So an 80% confidence interval is:
32.4
187.9 ± 1.2816 × √ ⇒ (174.06, 201.74).
9

(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is
equal to the margin of error, i.e. we have:
σ 32.4
22.1 = k × √ = k × √ ⇒ k = 2.05.
n 9
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have
a 100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.

6. A manufacturer is concerned about the variability of the levels of impurity


contained in consignments of raw materials from a supplier. A random sample of 10
consignments showed a standard deviation of 2.36 in the concentration of impurity
levels. Assume normality.
(a) Find a 95% confidence interval for the population variance.
(b) Would a 99% confidence interval for this variance be wider or narrower than
that found in (a)?

Solution:
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.02.
Hence a 95% confidence interval for σ 2 is:
(n − 1)s2 (n − 1)s2
   
9 × 5.5696 9 × 5.5696
, = , = (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1 19.02 2.700

(b) A 99% confidence interval would be wider since:

χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 .

388
H.1. Worked examples

7. Why do we not always choose a very high confidence level for a confidence interval?

Solution:
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the
confidence interval and the coverage probability.

8. Suppose that 9 bags of sugar are selected from the supermarket shelf at random
and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6, 811.1, 797.4,
797.8, 800.8 and 793.2. Construct a 95% confidence interval for the mean weight of
all the bags on the shelf. Assume the population is normal.

Solution:
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From
Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95%
confidence interval is:
 
8.53 8.53
798.30 − 2.306 × √ , 798.30 + 2.306 × √ = (798.30 − 6.56, 798.30 + 6.56)
9 9
= (791.74, 804.86).

It is sometimes more useful to write this as 798.30 ± 6.56.

9. Continuing Question 8, suppose we are now told that σ, the population standard
deviation, is known to be 8.5 g. Construct a 95% confidence interval using this
information.

Solution:
From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
 
8.5 8.5
798.30 − 1.96 × √ , 798.30 + 1.96 × √ = (798.30 − 6.53, 798.30 + 6.53)
9 9
= (792.75, 803.85).

Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in Question 8, even though our initial
estimate s turned out to be very close to the true value of σ.

10. Construct a 90% confidence interval for the variance of the bags of sugar in
Question 8. Does the given value of 8.5 g for the population standard deviation
seem plausible?

389
H. Interval estimation

Solution:
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:

χ20.95, 8 = 2.733 and χ20.05, 8 = 15.51.

A 90% confidence interval is:


!
(n − 1)S 2 (n − 1)S 2
 
(9 − 1) × 72.76 (9 − 1) × 72.76
, = ,
χ2α/2, n−1 χ21−α/2, n−1 15.51 2.733

= (37.529, 213.010).

The corresponding values for the standard deviation are:


√ √
( 37.529, 213.010) = (6.126, 14.595).

The given value falls well within this confidence interval, so we have no reason to
doubt it.

H.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. A business requires an inexpensive check on the value of stock in its warehouse. In


order to do this, a random sample of 50 items is taken and valued. The average
value of these is computed to be £320.41 with a (sample) standard deviation of
£40.60. It is known that there are 9,875 items in the total stock.
(a) Estimate the total value of the stock to the nearest £10,000.
(b) Construct a 95% confidence interval for the mean value of all items and hence
construct a 95% confidence interval for the total value of the stock.
(c) You are told that the confidence interval in (b) is too wide for decision-making
purposes and you are asked to assess how many more items would need to be
sampled to obtain a confidence interval with the same level of confidence, but
with half the width.

2. (a) A sample of 954 adults in early 1987 found that 23% of them held shares.
Given a UK adult population of 41 million and assuming a proper random
sample was taken, construct a 95% confidence interval estimate for the number
of shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.

390
Appendix I
Hypothesis testing

I.1 Worked examples

1. A manufacturer has developed a new fishing line which is claimed to have an


average breaking strength of 7 kg, with a standard deviation of 0.25 kg. Assume
that the standard deviation figure is correct and that the breaking strength is
normally distributed. Suppose that we carry out a test, at the 5% significance level,
of H0 : µ = 7 vs. H1 : µ < 7. Find the sample size which is necessary for the test to
have 90% power if the true breaking strength is 6.95 kg.

Solution:
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:  
X̄ − 7
P √ < −1.645
0.25/ n
which we rewrite as:
 
X̄ − 6.95 7 − 6.95
P √ < √ − 1.645
0.25/ n 0.25/ n

because X̄ ∼ N (6.95, (0.25)2 /n).


To ensure power of 90% we need z0.10 = 1.282 since:

P (Z < 1.282) = 0.90.

Therefore:

7 − 6.95
√ − 1.645 = 1.282
0.25/ n

0.2 × n = 2.927

n = 14.635
n = 214.1832.

So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.

391
I. Hypothesis testing

2. A doctor claims that the average European is more than 8.5 kg overweight. To test
this claim, a random sample of 12 Europeans were weighed, and the difference
between their actual weight and their ideal weight was calculated. The data are:

14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7 and 14.

Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution:
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5
vs. H1 : µ > 8.5. The test statistic, under H0 , is:

X̄ − 8.5 X̄ − 8.5
T = √ = √ ∼ t11 .
S/ n S/ 12

We reject H0 if t > t0.05, 11 = 1.796. For the given data:


12 12
!
1 X 1 X
x̄ = xi = 11.333 and s2 = x2i − 12x̄2 = 26.606.
12 i=1 11 i=1

Hence:
11.333 − 8.5
t= p = 1.903 > 1.796 = t0.05, 11
26.606/12
so we reject H0 at the 5% significance level. There is significant evidence to support
the doctor’s claim.

3. {X1 , X2 , . . . , X21 } represents a random sample of size 21 from a normal population


with mean µ and variance σ 2 .
(a) Construct a test procedure with a 5% significance level to test the null
hypothesis that σ 2 = 8 against the alternative that σ 2 > 8.
(b) Evaluate the power of the test for the values of σ 2 given below.
σ2 = 8.84 10.04 10.55 11.03 12.99 15.45 17.24

Solution:
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:

(n − 1)S 2 20 × S 2
T = 2
= ∼ χ220 .
σ0 8

With a 5% significance level, we reject the null hypothesis if:

t ≥ 31.41

since χ20.05, 20 = 31.41.

392
I.1. Worked examples

(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.41) conditional on the actual value of σ 2 , that is:
 
2 8 8
P (T ≥ 31.41 | σ = k) = P T × ≥ 31.41 ×
k k

where k is the true value of σ 2 , noting that:


8
T× ∼ χ220 .
k

σ2 = k 8.84 10.04 10.55 11.03 12.99 15.45 17.24


31.41 × 8/k 28.4 25.0 23.8 22.8 19.3 16.3 14.6
β(σ 2 ) 0.10 0.20 0.25 0.30 0.50 0.70 0.80

4. The weights (in grammes) of a group of five-week-old chickens reared on a


high-protein diet are 336, 421, 310, 446, 390 and 434. The weights of a second
group of chickens similarly reared, except for their low-protein diet, are 224, 275,
393, 282 and 365. Is there evidence that the additional protein has increased the
average weight of the chickens? Assume normality.
Solution:
Assuming normally-distributed populations with possibly different means, but the
same variance, we test:

H0 : µX = µY vs. H1 : µX > µY .

The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40
and sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2 X̄ − Ȳ
T = ×p ∼ tn+m−2
1/n + 1/m (n − 1)SX 2
+ (m − 1)SY2

and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0
that the mean weights are equal and conclude that the mean weight for the
high-protein diet is greater at the 5% significance level.

5. Suppose that we have two independent samples from normal populations with
known variances. We want to test the H0 that the two population means are equal
against the alternative that they are different. We could use each sample by itself
to write down 95% confidence intervals and reject H0 if these intervals did not
overlap. What would be the significance level of this test?
Solution:
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX σY σY σX
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
n m m n

393
I. Hypothesis testing

So we want the probability:


  
σX σY
P |X̄ − Ȳ | ≥ 1.96 × √ + √
n m
which is: √ √ !
X̄ − Ȳ σX / n + σY / m
P p
2
≥ 1.96 × p 2 .
σX /n + σY2 /m σX /n + σY2 /m
So we have: √ √ !
σX / n + σY / m
P |Z| ≥ 1.96 × p 2
σX /n + σY2 /m
where Z ∼ N (0, 1). This does not reduce in general, but if we assume n = m and
2
σX = σY2 , then it reduces to:

P (|Z| ≥ 1.96 × 2) = 0.0056.

The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of
incorrectly rejecting the null hypothesis is smaller than you might expect!

6. The following table shows the number of salespeople employed by a company and
the corresponding value of sales (in £000s):

Number of salespeople (x) 210 209 219 225 232 221


Sales (y) 206 200 204 215 222 216
Number of salespeople (x) 220 233 200 215 205 227
Sales (y) 210 218 201 212 204 212

Compute the sample correlation coefficient for these data and carry out a formal
test for a (linear) relationship between the number of salespeople and sales.
Note that: X X X
xi = 2,616, yi = 2,520, x2i = 571,500,
X X
yi2 = 529,746 and xi yi = 550,069.

Solution:
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:

ρb n − 2
T =p ∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.

394
I.1. Worked examples

7. Two independent samples from normal populations yield the following results:

2
P
Sample 1 n=5 P (xi − x̄)2 = 4.8
Sample 2 m=7 (yi − ȳ) = 37.2

Test at the 10% signficance level whether the population variances are the same
based on the above data.
Solution:
We test:
H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:
S12
T = ∼ Fn−1, m−1 = F4, 6 .
S22
Critical values are F0.95, 4, 6 = 1/F0.05, 6, 4 = 1/6.16 = 0.16 and F0.05, 4, 6 = 4.53, using
Table A.3 of the Dougherty Statistical Tables. The test statistic value is:
4.8/4
t= = 0.1935
37.2/6
and since 0.16 < 0.1935 < 4.53 we do not reject H0 , which means there is no
evidence of a difference in the variances.

8. Why does it make no sense to use a hypothesis like x̄ = 2?


Solution:
We can see immediately if x̄ = 2 by calculating the sample mean. Inference is
concerned with the population from which the sample was taken. We are not very
interested in the sample mean in its own right.

9. (a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
Solution:
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.
(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.

395
I. Hypothesis testing

10. A machine is designed to fill bags of sugar. The weight of the bags is normally
distributed with standard deviation σ. If the machine is correctly calibrated, σ
should be no greater than 20 g. We collect a random sample of 18 bags and weigh
them. The sample standard deviation is found to be equal to 32.48 g. Is there any
evidence that the machine is incorrectly calibrated?
Solution:
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:

X1 , X2 , . . . , X18 ∼ N (µ, σ 2 )

be the weights of the bags in the sample. An appropriate test has hypotheses:

H0 : σ 2 = 400 vs. H1 : σ 2 > 400.

This is a one-sided test, because we are interested in detecting an increase in


variance. We compute the value of the test statistic:
(n − 1)s2 (18 − 1) × (32.48)2
t= = = 44.385.
σ02 (20)2
At the 5% significance level, the upper-tail value of the chi-squared distribution on
ν = 18 − 1 degrees of freedom is χ20.05, 17 = 27.59. Our test statistic exceeds this
value, so we reject the null hypothesis.
We now move to the 1% significance level. The upper-tail value is χ20.01, 17 = 33.41,
so we reject H0 again. We conclude that there is very strong evidence that the
machine is incorrectly calibrated.

11. After the machine in Question 10 is calibrated, we collect a new sample of 21 bags.
The sample standard deviation of their weights is 23.72 g. Based on this sample,
can you conclude that the calibration has reduced the variance of the weights of the
bags?
Solution:
Let:
Y1 , Y2 , . . . , Y21 ∼ N (µY , σY2 )
2
be the weights of the bags in the new sample, and use σX to denote the variance of
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
2 2
σX σX
H0 : = 1 vs. H 1 : > 1.
σY2 σY2
The value of the test statistic in this case is:
s2X (32.48)2
= = 1.875.
s2Y (23.72)2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.

396
I.2. Practice questions

At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821 (using a computer), so we can now reject the null hypothesis (if
only barely). We conclude that there is some evidence that the variance is reduced,
but it is not very strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.

I.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. A random sample of fibres is known to come from one of two environments, A or B.


It is known from past experience that the lengths of fibres from A have a log-normal
distribution so that the log-length of an A-type fibre is normally distributed about
a mean of 0.80 with a standard deviation of 1.00. (Original units are in microns.)
The log-lengths of B-type fibres are normally distributed about a mean of 0.65
with a standard deviation of 1.00. In order to identify the environment from which
the given sample was taken a subsample of n fibres are to be measured and the
classification is to be made on the evidence of these measurements.
Do not be put off by the log-normal distribution. This simply means that it is the
logs of the data, rather than the original data, which have a normal distribution. If
X represents the log of a fibre length for fibres from A, then X ∼ N (0.8, 1).
(a) If n = 50 and the sample is attributed to type A if the sample mean of
log-lengths exceeds 0.75, determine the error probabilities.
(b) What sample size and decision procedures should be used to give error
probabilities such that the chance of misclassifying as A is 5% and the chance
of misclassifying as B is 10%?
(c) If the sample is classified as A if the sample mean of log-lengths exceeds 0.75,
and the misclassification as A is to have a probability of 2%, what sample size
should be used and what is the probability of a B-type misclassification?
(d) If the sample comes from neither A nor B but from an environment with a
mean log-length of 0.70, what is the probability of classifying it as type A if
the decision procedure determined in (b) is applied?

2. In a wire-based nail manufacturing process the target length for cut wire is 22 cm.
It is known that widths vary with a standard deviation equal to 0.08 cm. In order
to monitor this process, a random sample of 50 separate wires is accurately
measured and the process is regarded as operating satisfactorily (the null

397
I. Hypothesis testing

hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(a) Determine the probability of a Type I error for this test.
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)

3. A sample of seven is taken at random from a large batch of (nominally 12-volt)


batteries. These are tested and their true voltages are shown below:

12.9, 11.6, 13.5, 13.9, 12.1, 11.9 and 13.0.

(a) Test if the mean voltage of the whole batch is 12 volts.


(b) Test if the mean batch voltage is less than 12 volts.
Which test do you think is the more appropriate?

4. To instil customer loyalty, airlines, hotels, rental car companies, and credit card
companies (among others) have initiated frequency marketing programmes which
reward their regular customers. In the United States alone, millions of people are
members of the frequent-flier programmes of the airline industry. A large fast food
restaurant chain wished to explore the profitability of such a programme. They
randomly selected 12 of their 1,200 restaurants nationwide and instituted a
frequency programme which rewarded customers with a $5.00 gift certificate after
every 10 meals purchased at full price.
They ran the trial programme for three months. The restaurants not in the sample
had an average increase in profits of $1,047.34 over the previous three months,
whereas the restaurants in the sample had the following changes in profit:

$2,232.90 $545.47 $3,440.70 $1,809.10


$6,552.70 $4,798.70 $2,965.00 $2,610.70
$3,381.30 $1,591.40 $2,376.20 −$2,191.00

Note that the last number is negative, representing a decrease in profits. Specify
the appropriate null and alternative hypotheses for determining whether the mean
profit change for restaurants with frequency programmes is significantly greater (in
a statistical sense which you should make clear) than $1,047.34.

5. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:

398
I.2. Practice questions

Sample size Sample mean Sample variance


Company A 44 11.9 7.3
Company B 52 10.8 6.2

(a) Is there evidence that the companies differ in their true mean repair times?
Give an appropriate hypothesis test to support your conclusions.
(b) What is the p-value of your test?
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sizes 9 and 17, respectively)?

6. A museum conducts a survey of its visitors in order to assess the popularity of a


device which is used to provide information on the museum exhibits. The device
will be withdrawn if less than 30% of all of the museum’s visitors make use of it. Of
a random sample of 80 visitors, 20 chose to use the device.
(a) Carry out a hypothesis test at the 5% significance level to see if the device
should be withdrawn or not and state your conclusions.
(b) Determine the p-value of the test.
(c) What is the power of this test if the actual percentage of all visitors who would
use this device is only 20%?

399
I. Hypothesis testing

400
Appendix J
Analysis of variance (ANOVA)

J.1 Worked examples


1. Three trainee salespeople were working on a trial basis. Salesperson A went in the
field for 5 days and made a total of 440 sales. Salesperson B was tried for 7 days
and made a total of 630 sales. Salesperson C was tried for 10 days and made a total
of 690 sales. Note that these figures
P are total sales, not daily averages. The sum of
the squares of all 22 daily sales ( x2i ) is 146,840.
(a) Construct a one-way analysis of variance table.
(b) Would you say there is a difference between the mean daily sales of the three
salespeople? Justify your answer.
(c) Construct a 95% confidence interval for the mean difference between
salesperson B and salesperson C. Would you say there is a difference?

Solution:

(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:

440 + 630 + 690


= 80.
22
We can now calculate the sum of squares between salespeople. This is:

5 × (88 − 80)2 + 7 × (90 − 80)2 + 10 × (69 − 80)2 = 2,230.

The total sum of squares is:

146,840 − 22 × (80)2 = 6,040.

Here is the one-way ANOVA table:


Source DF SS MS F p-value
Salesperson 2 2,230 1,115 5.56 ≈ 0.01
Error 19 3,810 200.53
Total 21 6,040

(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table A.3 of the Dougherty Statistical Tables),
we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that the means
are not equal.

401
J. Analysis of variance (ANOVA)

(c) We have:
s  
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10

Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.

2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:

3 × (29 − 21.875)2 + 2 × (23 − 21.875)2 + 3 × (14 − 21.875)2 = 340.875.

The total sum of squares is:

4,307 − 8 × (21.875)2 = 478.875.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875

We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.

As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.

3. Three independent random samples were taken. Sample A consists of 4


observations taken from a normal distribution with mean µA and variance σ 2 ,
sample B consists of 6 observations taken from a normal distribution with mean µB
and variance σ 2 , and sample C consists of 5 observations taken from a normal
distribution with mean µC and variance σ 2 .
The average value of the first sample was 24, the average value of the second
sample was 20, and the average value of the third sample was 18. The sum of the
squared observations (all of them) was 6,722.4. Test the hypothesis:

H 0 : µA = µB = µC

against the alternative that this is not so.

402
J.1. Worked examples

Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:

4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6.

The total sum of squares is:

6,722.4 − 15 × (20.4)2 = 480.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480

As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.

4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quotes of suppliers B, C and
D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.

Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:

7×((1,315.8−1,245)2 +(1,238.4−1,245)2 +(1,225.8−1,245)2 +(1,200.0−1,245)2 ] = 52,148.88

403
J. Analysis of variance (ANOVA)

and the MS due to suppliers is 52,148.88/(4 − 1) = 17,382.96.


The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for materials, suppliers, error and total sum of squares,
respectively.
The SS for materials is 6 × 17,800 = 106,800. We have that the SS due to the
error is given by 358,700 − 52,148.88 − 106,800 = 199,751.12 and the MS is
199,751.12/18 = 11,097.28. The F values are:

17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28

for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables (degrees of freedom 3 and 18) is 3.16, hence we do not reject
H0 and conclude that there is not enough evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s  
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7

giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.

5. Blood alcohol content (BAC) is measured in milligrams per decilitre of blood


(mg/dL). A researcher is looking into the effects of alcoholic drinks. Four different
individuals tried five different brands of strong beer (A, B, C, D and E) on different
days, of course! Each individual consumed 1L of beer over a 30-minute period and
their BAC was measured one hour later. The average BAC for beers A, C, D and E
were 83.25, 95.75, 79.25 and 99.25, respectively. The value for beer B is not given.
The following information is provided as well.

Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total

404
J.1. Worked examples

(a) Complete the table using the information provided above.


(b) Is there a significant difference between the effects of different beers? What
about different drinkers?
(c) Construct a 90% confidence interval for the difference between the effects of
beers C and D. Would you say there is a difference?
Solution:
(a) We have:
Source DF SS MS F p-value
Drinker 3 271.284 90.428 1.56 ≈ 0.250
Beer 4 1214 303.5 5.236 ≈ 0.011
Error 12 695.6 57.967
Total 19 2,180.884
(b) We test the hypothesis H0 : µ1 = µ2 = · · · = µ5 (i.e. there is no difference
between the effects of different beers) vs. the alternative H1 : There is a
difference between the effects of different beers. The F value is 5.236 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables is F0.05, 4, 12 = 3.26, so since 5.236 > 3.26 we reject H0 and
conclude that there is evidence of a difference.
For drinkers, we test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no
difference between the effects on different drinkers) vs. the alternative H1 :
There is a difference between the effects on different drinkers. The F value is
1.56 and at a 5% significance level the critical value from Table A.3 of the
Dougherty Statistical Tables is F0.05, 3, 12 = 3.49, so since 1.56 < 3.49 we fail to
reject H0 and conclude that there is no evidence of a difference.
(c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782.
So a 90% confidence interval is:
s  
1 1
95.75 − 79.25 ± 1.782 × 57.967 + = 16.5 ± 9.59
4 4

giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.

6. A motor manufacturer operates five continuous-production plants: A, B, C, D and


E. The average rate of production has been calculated for the three shifts of each
plant and recorded in the table below. Does there appear to be a difference in
production rates in different plants or by different shifts?

A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76

Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:

405
J. Analysis of variance (ANOVA)

Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73

Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.

7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from the Dougherty Statistical Tables.

Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Error ? 708.00 ?
Total 34 1,915.76

Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table A.3
of the Dougherty Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:

Two-way ANOVA: C1 versus C2, C3

Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76

406
J.2. Practice questions

J.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. An executive of a prepared frozen meals company is interested in the amounts of


money spent on such products by families in different income ranges. The table
below lists the monthly expenditures (in dollars) on prepared frozen meals from 15
randomly selected families divided into three groups according to their incomes.

Under $15,000 $15,000 – $30,000 Over $30,000


45.2 53.2 52.7
60.1 56.6 73.6
52.8 68.7 63.3
31.7 51.8 51.8
33.6 54.2
39.4

(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first
(under $15,000) and the third (over $30,000) income groups.

2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.

Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter

Variable N Mean SE Mean StDev


1st quarter 30 74.10 2.89 15.81
2nd quarter 30 75.67 2.48 13.57
3rd quarter 30 78.50 2.79 15.28
4th quarter 30 81.30 2.85 15.59
(a) Can we infer that the amount of payment differs significantly across the four
groups of companies?
(b) Construct 95% confidence intervals for the mean payment of the 1st quarter
companies and the 4th quarter companies.

407
J. Analysis of variance (ANOVA)

408
Appendix K
Linear regression

K.1 Worked examples


1. Consider the simple linear regression model representing the linear relationship
between two variables, y and x:
y i = β 0 + β 1 xi + εi
for i = 1, 2, . . . , n, where εi are independent and identically distributed random
variables with mean 0 and variance σ 2 . Prove that the least squares straight line
must necessarily pass through the point (x̄, ȳ).
Solution:
The estimated regression line is:
ybi = βb0 + βb1 xi
where βb0 = ȳ − βb1 x̄. When x̄ is substituted for xi , we obtain:
yb = βb0 + βb1 x̄ = ȳ − βb1 x̄ + βb1 x̄ = ȳ.
Therefore, the least squares straight line must necessarily pass through the point
(x̄, ȳ).

2. The following linear regression model is proposed to represent the linear


relationship between two variables, y and x:
yi = βxi + εi
for i = 1, 2, . . . , n, where εi are independent and identically distributed random
variables with mean 0 and variance σ 2 , and β is an unknown parameter to be
estimated.
(a) A proposed estimator of β is:
n
X
βb = min (yi − βxi )2 .
β
i=1

Explain why this estimator is sensible.


(b) Another proposed estimator of β is:
n
X
β̃ = min |yi − βxi |.
β
i=1

Explain why βb would be preferred to β̃.

409
K. Linear regression

(c) Express βb explicitly as a function of yi and xi only.


(d) Using the estimator β:
b

i. what is the value of βb if yi = xi for all i? What if they are the exact
opposites of each other, i.e. yi = −xi for all i?
ii. is it always the case that −1 ≤ βb ≤ 1?
Solution:
(a) The estimator βb is sensible because it is the least squares estimator of β, which
provides the ‘best’ fit to the data in terms of minimising the sum of squared
residuals.
(b) The estimator βb is preferred to β̃ because the estimator β̃ is the least absolute
deviations estimator of β, which is also an option, but unlike βb it cannot be
computed explicitly via differentiation as the function f (x) = |x| is not
differentiable at zero. Therefore, β̃ is harder to compute than β.b
(c) We need to minimise a convex quadratic, so we can do that by differentiating
it and equating the derivative to zero. We obtain:
n
X
−2 (yi − βx
b i )xi = 0
i=1

which yields:
n
P
xi y i
i=1
βb = n .
x2i
P
i=1

(d) i. If xi = yi , then βb = 1. If xi = −yi , then βb = −1.


ii. Not true. A counterexample is to take n = 1, x1 = 1 and y1 = 2.

3. Let {(xi , yi )}, for i = 1, 2, . . . , n, be observations from the linear regression model:

y i = β 0 + β 1 xi + εi .

(a) Suppose that the slope, β1 , is known. Find the least squares estimator (LSE) of
the intercept, β0 .
(b) Suppose that the intercept, β0 , is known. Find the LSE of the slope, β1 .
Solution:
(a) When β1 is known, let zi = yi − β1 xi . The model then reduces to zi = β0 + εi .
n
The LSE βb0 minimises (zi − β0 )2 , hence:
P
i=1

n
1X
βb0 = z̄ = (yi − β1 xi ).
n i=1

410
K.1. Worked examples

(b) When β0 is known, we may write zi = yi − β0 . The model is reduced to


zi = β1 xi + εi . Note that:
n
X n
X
2
(zi − β1 xi ) = (zi − βb1 xi + (βb1 − β1 )xi )2
i=1 i=1
n
X n
X
= (zi − βb1 xi )2 + (βb1 − β1 )2 x2i + 2D
i=1 i=1

n
P
where D = (βb1 − β1 ) xi (zi − βb1 xi ). Suppose we choose βb1 such that:
i=1

n
X n
X n
X
xi (zi − βb1 xi ) = 0 i.e. xi zi − βb1 x2i = 0.
i=1 i=1 i=1

Hence:
n
X n
X n
X n
X
2 2 2 2
(zi − β1 xi ) = (zi − β1 xi ) + (β1 − β1 )
b b xi ≥ (zi − βb1 xi )2 .
i=1 i=1 i=1 i=1

Therefore, βb1 is the LSE of β1 . Note now:


n
P n
P
xi zi xi (yi − β0 )
βb1 = i=1
n = i=1
n .
x2i x2i
P P
i=1 i=1

4. Suppose an experimenter intends to perform a regression analysis by taking a total


of 2n data points, where the xi s are restricted to the interval [0, 5]. If the
xy-relationship is assumed to be linear and if the objective is to estimate the slope
with the greatest possible precision, what values should be assigned to the xi s?
Solution:
Since:
σ2
Var(βb1 ) = P
n
(xi − x̄)2
i=1

in order to minimise the variance of the sampling distribution of βb1 , we must


maximise: n
X
(xi − x̄)2 .
i=1

To accomplish this, take half of the xi s to be 0, and the other half to be 5.

5. Suppose a total of n = 9 observations are to be taken on a simple linear regression


model, where the xi s will be set equal to 1, 2, . . . , 9. If the variance associated with
the xy-relationship is known to be 45, what is the probability that the estimated
slope will be within 1.5 units of the true slope?

411
K. Linear regression

Solution:
n
(xi − x̄)2 = 60 and so:
P
Since x̄ = (1 + 2 + · · · + 9)/9 = 5, then
i=1

σ2 45
Var(βb1 ) = P
n = = 0.75.
60
(xi − x̄)2
i=1

Therefore:
βb1 ∼ N (β1 , 0.75).
We require:
 
1.5
P (|βb1 − β1 | < 1.5) = P |Z| < √ = P (|Z| < 1.73) = 1 − 2 × 0.0418 = 0.9164.
0.75

6. A researcher wants to investigate whether there is a significant link between GDP


per capita and average life expectancy in major cities. Data have been collected in
30 major cities, yielding average GDPs per capita x1 , x2 , . . . , x30 (in $000s) and
average life expectancies y1 , y2 , . . . , y30 (in years). The following linear regression
model has been proposed:
y i = β 0 + β 1 xi + εi
where the εi s are independent and N (0, σ 2 ). Some summary statistics are:
30
X 30
X 30
X
xi = 620.35, yi = 2,123.00, xi yi = 44,585.1
i=1 i=1 i=1
30
X 30
X
x2i = 13,495.62 and yi2 = 151,577.3.
i=1 i=1

(a) Find the least-squares estimates of β0 and β1 and write down the fitted
regression model.
(b) Compute a 95% confidence interval for the slope coefficient β1 . What can be
concluded?
(c) Compute R2 . What can be said about how ‘good’ the model is?
(d) With x = 30, find a prediction interval which covers y with probability 0.95.
With 97.5% confidence, what minimum average life expectancy can a city
expect once its GDP per capita reaches $30,000?
Solution:
(a) We have:
n
P n
P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
βb1 = n = n = 1.026
x2i − nx̄2
P P
(xi − x̄)2
i=1 i=1
and:
βb0 = ȳ − βb1 x̄ = 49.55.
Hence the fitted model is:
yb = 49.55 + 1.026x.

412
K.1. Worked examples

b2 . For σ
(b) We first need E.S.E.(βb1 ), for which we need σ b2 , we need the Residual
SS (from the Total SS and the Regression SS). We compute:
X
Total SS = yi2 − nȳ 2 = 1,339.67
i
!
X
Regression SS = βb12 x2i − nx̄2 = 702.99
i

Residual SS = Total SS − Regression SS = 636.68

636.68
b2 =
σ = 22.74
28
1/2
b2

σ
E.S.E.(β1 ) = P 2
b
2
= 0.184.
i xi − nx̄

Hence a 95% confidence interval for β1 is:


 
βb1 − t0.025, 28 × E.S.E.(βb1 ), βb1 + t0.025, 28 × E.S.E.(βb1 )

which gives:
1.026 ± 2.05 × 0.184 ⇒ (0.65, 1.40).

The confidence interval does not contain zero. Therefore, we would reject the
hypothesis of β1 being zero at the 5% significance level. Hence there does
appear to be a significant link.

(c) The model can explain 52% of the variation of y, since:

Regression SS 702.99
R2 = = = 0.52.
Total SS 1,339.67

Whether or not the model is ‘good’ is subjective. It is not necessarily ‘bad’,


although we may be able to determine a ‘better’ model with better
explanatory power, possibly using multiple linear regression.

(d) The prediction interval has the form:

P 2 1/2
− 2x i xi + nx2
 P
i xi P
β0 + β1 x ± t0.025, n−2 × σ
b b b× 1+
n( i x2i − nx̄2 )

which gives:
(69.79, 90.87).

Therefore, we can be 97.5% confident that the average life expectancy lies
above 69.79 years once GDP per capita reaches $30,000.

413
K. Linear regression

7. The following is partial regression output:

The regression equation is


y = 2.1071 + 1.1263x

Predictor Coef SE Coef


Constant 2.1071 0.2321
x 1.1263 0.0911

Analysis of Variance

SOURCE DF SS
Regression 1 2011.12
Residual Error 40 539.17

In addition, x̄ = 1.56.
(a) Find an estimate of the error term variance, σ 2 .
(b) Calculate and interpret R2 .
(c) Test at the 5% significance level whether or not the slope in the regression
model is equal to 1.
(d) For x = 0.8, find a 95% confidence interval for the expectation of y.
Solution:
(a) Noting n = 40 + 1 + 1 = 42, we have:

Residual SS 539.17
b2 =
σ = = 13.479.
n−2 40

(b) We have Total SS = Regression SS + Residual SS = 2,550.29. Hence:

Regression SS 2,011.12
R2 = = = 0.7886.
Total SS 2,550.29

Therefore, 78.86% of the variation of y is explained by x.


(c) Under H0 : β1 = 1, the test statistic is:

βb1 − 1
T = ∼ tn−2 = t40 .
E.S.E.(βb1 )

We reject H0 if |t| > 2.021 = t0.025, 40 . As t = 0.1263/0.0911 = 1.386, we cannot


reject the null hypothesis that β1 = 1 at the 5% significance level.
n
(d) Note (xi − x̄)2 = (Regression SS)/(βb1 )2 = 2,011.12/(1.1263)2 = 1,585.367.
P
i=1
Also:
n
X n
X
2
(xi − x) = (xi − x̄)2 + n(x̄ − x)2 = 1,585.367 + 42 × (1.56 − 0.8)2
i=1 i=1
= 1,609.626.

414
K.2. Practice questions

Hence a 95% confidence interval for E(y) given x = 0.8 is:


 n 1/2
2
P
(x
 i=1 i − x) 
βb0 + βb1 x ± t0.025, n−2 × σ
b×
 P n


n (xj − x̄) 2
j=1
r
13.479 × 1,609.626
= 2.1071 + 1.1263 × 0.8 ± 2.021 ×
42 × 1,585.367
= 3.0081 ± 1.1536 ⇒ (1.854, 4.162).

8. Why is the squared sample correlation coefficient between the yi s and xi s the same
as the squared sample correlation coefficient between the yi s and ybi s? No algebra is
needed for this.
Solution:
The only difference between the xi s and ybi s is a rescaling by multiplying by βb1 ,
followed by a relocation by adding βb0 . Correlation coefficients are not affected by a
change of scale or location, so it will be the same whether we use the xi s or the ybi s.

9. If the model fits, then the fitted values and the residuals from the model are
independent of each other. What do you expect to see if the model fits when you
plot residuals against fitted values?
Solution:
If the model fits, one would expect to see a random scatter with no particular
pattern.

K.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix L.

1. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station:

Distance in miles (x) 4.9 4.5 6.3 3.2 5.0


Cost in £000s (y) 31.1 31.1 43.1 22.1 36.2
Distance in miles (x) 5.7 4.0 4.3 2.5 5.2
Cost in £000s (y) 35.8 25.9 28.0 22.9 33.5

(a) Fit a straight line to these data and construct a 95% confidence interval for
the increase in cost of a fire for each mile from the nearest fire station.
(b) Test the hypothesis that the ‘true line’ passes through the origin.

415
K. Linear regression

2. The yearly profits made by a company, over a period of eight consecutive years are
shown below:

Year 1 2 3 4 5 6 7 8
Profit (in £000s) 18 21 34 31 44 46 60 75

(a) Fit a straight line to these data and compute a 95% confidence interval for the
‘true’ yearly increase in profits.
(b) The company accountant forecasts the profits for year 9 to be £90,000. Is this
forecast reasonable if it is based on the above data?

3. The data table below shows the yearly expenditure (in £000s) by a cosmetics
company in advertising a particular brand of perfume:

Year (x) 1 2 3 4 5 6 7 8
Expenditure (y) 170 170 275 340 435 510 740 832

(a) Fit a regression line to these data and construct a 95% confidence interval for
its slope.
(b) Construct an analysis of variance table and compute the R2 statistic for the fit.
(c) Comment on the goodness of fit of the linear regression model.
(d) Predict the expenditure for Year 9 and construct a 95% prediction interval for
the actual expenditure.

4. Let X and ε be two independent random variables, and E(ε) = 0. Let


Y = β0 + β1 X + ε. Show that:
s
Cov(X, Y ) Var(Y )
β1 = = Corr(X, Y ) × .
Var(X) Var(X)

416
Appendix L
Solutions to Practice questions

L.1 Appendix A – Data visualisation and descriptive


statistics

1. (a) We have:
3
X
(Yj − Ȳ ) = (Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ).
j=1

However:
3Ȳ = Y1 + Y2 + Y3

hence:
3
X
(Yj − Ȳ ) = 3Ȳ − 3Ȳ = 0.
j=1

(b) We have:

3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = (Y1 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
j=1 k=1

+ (Y2 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))


+ (Y3 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
= ((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))2

and 3Ȳ = Y1 + Y2 + Y3 as above, so:

3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = 02 = 0.
j=1 k=1

(c) We have:

3 X
X 3 3
X 3
X 3
X
(Yj − Ȳ )(Yk − Ȳ ) = j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1 k=1 j=1

We have written the nine terms in the left-hand expression as the sum of the
six terms for which j 6= k, and the three terms for which j = k.

417
L. Solutions to Practice questions

However, we showed in (b) that the left-hand expression is in fact 0, so:


3
X 3
X 3
X
0= j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1

from which the result follows.

2. (a) We have:
n
P n
P n
P
yi (axi + b) a xi + nb
i=1 i=1 i=1
ȳ = = = = ax̄ + b.
n n n
(b) Multiply out the square within the summation sign and then evaluate the
three expressions, remembering that x̄ is a constant with respect to summation
and can be taken outside the summation sign as a common factor, i.e. we have:
n
X n
X
2
(xi − x̄) = (x2i − 2xi x̄ + x̄2 )
i=1 i=1
n
X n
X n
X
= x2i − 2x̄ xi + x̄2
i=1 i=1 i=1
Xn
= x2i − 2nx̄2 + nx̄2
i=1

n
P
hence the result. Recall that xi = nx̄.
i=1

(c) It is probably best to work with variances to avoid the square roots. The
variance of y values, say s2y , is given by:
n
1X
s2y = (yi − ȳ)2
n i=1
n
1X
= (axi + b − (ax̄ + b))2
n i=1
n
21
X
=a (xi − x̄)2
n i=1
= a2 s2x .

The result follows on taking the square root, observing that the standard
deviation cannot be a negative quantity.
Adding a constant k to each value of a dataset adds k to the mean and leaves the
standard deviation unchanged. This corresponds to a transformation yi = axi + b
with a = 1 and b = k. Apply (a) and (c) with these values.
Multiplying each value of a dataset by a constant c multiplies the mean by c and
also the standard deviation by |c|. This corresponds to a transformation yi = cxi
with a = c and b = 0. Apply (a) and (c) with these values.

418
L.2. Appendix B – Probability theory

L.2 Appendix B – Probability theory


1. (a) We know P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
Consider A ∪ B ∪ C as (A ∪ B) ∪ C (i.e. as the union of the two sets A ∪ B and
C) and then apply the result above to obtain:

P (A ∪ B ∪ C) = P ((A ∪ B) ∪ C) = P (A ∪ B) + P (C) − P ((A ∪ B) ∩ C).

Now (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) – a Venn diagram can be drawn to


check this.
So:

P (A∪B ∪C) = P (A∪B)+P (C)−(P (A∩C)+P (B ∩C)−P ((A∩C)∩(B ∩C)))

using the earlier result again for A ∩ C and B ∩ C.


Now (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C and if we apply the earlier result once
more for A and B, we obtain:

P (A∪B∪C) = P (A)+P (B)−P (A∩B)+P (C)−P (A∩C)−P (B∩C)+P (A∩B∩C)

which is the required result.


(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2P (A ∪ B) so:

P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2P (A ∩ B) ≤ P (A) + P (B) so:

P (A) + P (B)
P (A ∩ B) ≤ .
2

2. (a) We know that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For independent events


A and B, P (A ∩ B) = P (A) P (B), so P (A ∪ B) = P (A) + P (B) − P (A) P (B)
gives 0.75 = p + 2p − 2p2 , or 2p2 − 3p + 0.75 = 0.
Solving the quadratic equation gives:

3− 3
p= ≈ 0.317
4
suppressing the irrelevant case for which p > 1.
Since A and B are independent, P (A | B) = P (A) = p = 0.317.

419
L. Solutions to Practice questions

(b) For mutually exclusive events, P (A ∪ B) = P (A) + P (B), so 0.75 = p + 2p,


leading to p = 0.25.
Here P (A ∩ B) = 0, so P (A | B) = P (A ∩ B)/P (B) = 0.

3. (a) We are given that A and B are independent, so P (A ∩ B) = P (A) P (B). We


need to show a similar result for Ac and B c , namely we need to show that
P (Ac ∩ B c ) = P (Ac ) P (B c ).
Now Ac ∩ B c = (A ∪ B)c from basic set theory (draw a Venn diagram), hence:

P (Ac ∩ B c ) = P ((A ∪ B)c )


= 1 − P (A ∪ B)
= 1 − (P (A) + P (B) − P (A ∩ B))
= 1 − P (A) − P (B) + P (A ∩ B)
= 1 − P (A) − P (B) + P (A) P (B) (independence assumption)
= (1 − P (A))(1 − P (B)) (factorising)
= P (Ac ) P (B c ) (as required).

(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.

4. (a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3)(2/3)4 , so the probability of one of these outcomes is given by
4(1/3)(2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to
happen, namely BBAAAA, BABAAA, BAABAA, BAAABA,
ABBAAA, ABABAA, ABAABA, AABBAA, AABABA and
AAABBA. Each has probability (1/3)2 (2/3)4 , so the probability of one of
these outcomes is given by 10(1/3)2 (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
16 64 160 144 + 192 + 160 496
+ + = = .
81 243 729 729 729
420
L.2. Appendix B – Probability theory

(b) We can mimic the above argument to find the probability that B wins the
game without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4(2/3)(1/3)4 = 8/243, and the probability that A wins just two points is
10(2/3)2 (1/3)4 = 40/729. So the probability of B winning without a deuce is
1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA –
with probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 (1/3) + (2/3)3 (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4(2/3)4 (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called
is a/(1 − r) (sum to infinity of a GP) which is:

(2/3)2 4/9 4
2
= = .
1 − (2/3) 5/9 5

Or (quicker!): given a deuce, the next 2 balls can yield the following results.
A wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce
with probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:

P (A wins the game) = P (A wins without deuce being called)


+ P (deuce is called) P (A wins | deuce is called)
496 160 4
= + ×
729 729 5
496 128
= +
729 729
624
= .
729

Aside: so the probability of B winning the game is 1 − 624/729 = 105/729. It


follows that A is about six times as likely as B to win the game although the
probability of winning any point is only twice that of B. Another example of the
counterintuitive nature of probability.

421
L. Solutions to Practice questions

L.3 Appendix C – Random variables

1. We require a counterexample. A simple one will suffice – there is no merit in


complexity. Let the discrete random variable X assume values 1 and 2 with
probabilities 1/3 and 2/3, respectively. (Obviously, there are many other examples
we could have chosen.) Therefore:

1 2 5
E(X) = 1 × +2× =
3 3 3
1 2
E(X 2 ) = 1 × + 4 × = 3
3 3
1 1 2 2
E(1/X) = 1 × + × =
3 2 3 3

and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result
has been shown in general.

2. (a) Recall that Var(X) = E(X 2 ) − (E(X)2 ). Now, working backwards:

E(X(X − 1)) − E(X)(E(X) − 1) = E(X 2 − X) − (E(X))2 + E(X)


= E(X 2 ) − E(X) − E(X)2 + E(X)
(using standard properties of expectation) = E(X 2 ) − (E(X))2
= Var(X).

(b) We have:

 
X1 + X 2 + · · · + X n E(X1 + X2 + · · · + Xn )
E =
n n
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
n
µ + µ + ··· + µ
=
n

=
n
= µ.

422
L.3. Appendix C – Random variables

 
X1 + X 2 + · · · + Xn Var(X1 + X2 + · · · + Xn )
Var =
n n2
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
(by independence) =
n2
σ2 + σ2 + · · · + σ2
=
n2
nσ 2
=
n2
σ2
= .
n

3. Suppose n subjects are procured. The probability that a single subject does not
have the abnormality is 0.96. Using independence, the probability that none of the
subjects has the abnormality is (0.96)n .
The probability that at least one subject has the abnormality is 1 − (0.96)n . We
require the smallest whole number n for which 1 − (0.96)n > 0.95, i.e. we have
(0.96)n < 0.05.
We can solve the inequality by ‘trial and error’, but it is neater to take logs.
n ln(0.96) < ln(0.05), so n > ln(0.05)/ ln(0.96), or n > 73.39. Rounding up, 74
subjects should be procured.

4. (a) For the ‘stupid’ rat:


1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 4
..
.
 r−1
3 1
P (X = r) = × .
4 4
This is a ‘geometric distribution’ with π = 1/4, which gives E(X) = 1/π = 4.
(b) For the ‘intelligent’ rat:
1
P (X = 1) =
4
3 1 1
P (X = 2) = × =
4 3 4
3 2 1 1
P (X = 3) = × × =
4 3 2 4
3 2 1 1
P (X = 4) = × × × 1 = .
4 3 2 4
Hence E(X) = (1 + 2 + 3 + 4)/4 = 10/4 = 2.5.

423
L. Solutions to Practice questions

(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):
1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
 r−2
3 2 1
P (X = r) = × × (for r ≥ 2).
4 3 3

Therefore:
     2 ! !
1 3 1 2 1 2 1
E(X) = + × 2× + 3× × + 4× × + ···
4 4 3 3 3 3 3
   2 ! !
1 1 2 2
= + 2+ 3× + 4× + ··· .
4 4 3 3

There is more than one way to evaluate this sum.


 2 !  2 !!
1 1 2 2 2 2
E(X) = + × 1+ + + ··· + 1 + 2 × + 3 × + ···
4 4 3 3 3 3
1 1
= + × (3 + 9)
4 4
= 3.25.

Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!

L.4 Appendix D – Common distributions of random


variables
1. (a) Let P ∼ N (10.42, (0.03)2 ) for the pistons, and C ∼ N (10.52, (0.04)2 ) for the
cylinders. It follows that D ∼ N (0.1, (0.05)2 ) for the difference (adding the
variances, assuming independence). The piston will fit if D > 0. We require:
 
0 − 0.1
P (D > 0) = P Z > = P (Z > −2) = 0.9772
0.05

so the proportion of 1 − 0.9772 = 0.0228 will not fit.


The number of pistons, N , failing to fit out of 100 will be a binomial random
variable such that N ∼ Bin(100, 0.0228).

424
L.4. Appendix D – Common distributions of random variables

(b) Calculating directly, we have the following.


i. P (N = 0) = (0.9772)100 = 0.0996.
ii. P (N ≤ 2) =
100

(0.9772)100 + 100 × (0.9772)99 (0.0228) + 2
(0.9772)98 (0.0228)2 = 0.6005.

(c) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and nπ < 5.

2. We have P (X = 1) = P (X = 2) = · · · = P (X = k) = 1/k. Therefore:

1 1 1
MX (t) = E(eXt ) = et × + e2t × + · · · + ekt ×
k k k
1 t
= (e + e2t + · · · + ekt ).
k

The bracketed part of this expression is a geometric progression where the first
term is et and the common ratio is et .
Using the well-known result for the sum of k terms of a geometric progression, we
obtain:
1 et (1 − (et )k ) et (1 − ekt )
MX (t) = × = .
k 1 − et k(1 − et )

R ∞ f (z) to serve as a pdf, we require (i.) f (z) ≥ 0 for all z, and (ii.)
3. (a) For
−∞
f (z) dz = 1. The first condition certainly holds for f (z). The second also
holds since:
Z ∞ Z 0 Z ∞
1 −|z| 1 −|z|
f (z) dz = e dz + e dz
−∞ −∞ 2 0 2
Z 0 Z ∞
1 z 1 −z
= e dz + e dz
−∞ 2 0 2
h i∞
z 0 −z
= [e /2]−∞ − e /2
0

1 1
= +
2 2
= 1.

A sketch of f (z) is shown below.

425
L. Solutions to Practice questions

(b) The moment generating function is:

Z 0 Z ∞
Zt 1 −|z| zt 1 −|z| zt
MZ (t) = E(e ) = e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z zt 1 −z zt
= e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z(1+t) 1 z(t−1)
= e dz + e dz
−∞ 2 0 2
 0  ∞
1 z(1+t) 1 z(t−1)
= e + e
2(1 + t) −∞ 2(t − 1) 0

1 1
= −
2(1 + t) 2(t − 1)
= (1 − t2 )−1

where the condition −1 < t < 1 ensures the integrands are 0 at the infinite
limits.

(c) We can find the various moments by differentiating MZ (t), but it is simpler to
expand it:
MZ (t) = (1 − t2 )−1 = 1 + t2 + t4 + · · · .

Now the coefficient of t is 0, so E(Z) = 0. The coefficient of t2 is 1, so


E(Z 2 )/2 = 1, and Var(Z) = E(Z 2 ) − (E(Z))2 = 2.

The coefficient of t3 is 0, so E(Z 3 ) = 0. The coefficient of t4 is 1, so


E(Z 4 )/4! = 1, and so E(Z 4 ) = 24.

Note that the first and third of these results follow directly from the fact
(illustrated in the sketch) that the distribution is symmetric about z = 0.

426
L.4. Appendix D – Common distributions of random variables

n
 x
4. For X ∼ Bin(n, π), P (X = x) = x
π (1 − π)n−x . So, for E(X), we have:

n  
X n x
E(X) = x π (1 − π)n−x
x=0
x
n  
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n  
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1  
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y

= nπ × 1
= nπ

where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and
probability parameter π.

Similarly:

n  
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X x(x − 1)n!
= π x (1 − π)n−x
x=2
x! (n − x)!
n
2
X (n − 2)!
= n(n − 1)π π x−2 (1 − π)n−x
x=2
(x − 2)! (n − x)!
n−2
2
X (n − 2)!
= n(n − 1)π π y (1 − π)n−y−2
y=0
y! (n − y − 2)!

with y = x − 2. Now let m = n − 2, so:

m
2
X m!
E(X(X − 1)) = n(n − 1)π π y (1 − π)m−y
y=0
y! (m − y)!

= n(n − 1)π 2

since the summation is 1, as before.

427
L. Solutions to Practice questions

Finally, noting Practice question 2 in Appendix C:

Var(X) = E(X(X − 1)) − E(X)(E(X) − 1)


= n(n − 1)π 2 − nπ(nπ − 1)
= −nπ 2 + nπ
= nπ(1 − π).

5. (a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.

(b) The expected number of cars passing in two minutes is 2 × 2.5 = 5.

(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.

6. (a) Let X denote the number of fish caught, such that X ∼ Pois(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln(π) and
λ = ln(1/π).

(b) James will take home the last fish caught if he catches 1, 3, 5, . . . fish. So we
require:

e−λ λ1 e−λ λ3 e−λ λ5


P (X = 1) + P (X = 3) + P (X = 5) + · · · = + + + ···
1! 3! 5!
 1
λ3 λ5

−λ λ
=e + + + ··· .
1! 3! 5!

Now we know:
λ2 λ3
eλ = 1 + λ + + + ···
2! 3!
and:
λ2 λ3
e−λ = 1 − λ + − + ··· .
2! 3!
Subtracting gives:

λ3 λ5
 
λ −λ
e −e =2 λ+ + + ··· .
3! 5!

Hence the required probability is:

eλ − e−λ 1 − e−2λ 1 − π2
 
−λ
e = =
2 2 2

since e−λ = π above gives e−2λ = π 2 .

428
L.5. Appendix E – Multivariate random variables

L.5 Appendix E – Multivariate random variables


1. (a) Recall that for any random variable U , we have Var(U ) = E(U 2 ) − (E(U ))2 ,
E(kU ) = k E(U ), where k is a constant, and for random variables U and V ,
E(U + V ) = E(U ) + E(V ), and also Cov(U, V ) = E(U V ) − E(U ) E(V ). We
have:

Cov(X + Y, X − Y ) = E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )


= E(X 2 − XY + Y X − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))
= E(X 2 ) − E(Y 2 ) − (E(X))2 + E(X) E(Y ) − E(Y ) E(X) + (E(Y ))2
= E(X 2 ) − (E(X))2 − (E(Y 2 ) − (E(Y ))2 )
= Var(X) − Var(Y )

as required.
(b) We have:

Cov(a + bX, c + dY ) = E((a + bX)(c + dY )) − E(a + bX) E(c + dY )


= E(ac + adY + bcX + bdXY ) − (a + b E(X))(c + d E(Y ))
= ac + ad E(Y ) + bc E(X) + bd E(XY )
− ac − ad E(Y ) − bc E(X) − bd E(X) E(Y )
= bd E(XY ) − bd E(X) E(Y )
= bd(E(XY ) − E(X) E(Y ))
= bd Cov(X, Y )

as required.

2. (a) We have: !
k
X k
X k
X
E ai X i = E(ai Xi ) = ai E(Xi ).
i=1 i=1 i=1

(b) We have:
!  !2   !2 
k
X k
X k
X k
X
Var ai X i = E ai X i − ai E(Xi )  = E ai (Xi − E(Xi )) 
i=1 i=1 i=1 i=1

k
X X
= a2i E((Xi − E(Xi ))2 ) + ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
i=1 1≤i6=j≤n

k
X X k
X
= a2i Var(Xi ) + ai aj E(Xi − E(Xi )) E(Xj − E(Xj )) = a2i Var(Xi ).
i=1 1≤i6=j≤n i=1

429
L. Solutions to Practice questions

Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.

3. (a) The joint distribution table is:


X=x
0 1 2
0 0 A 2A
Y =y 1 A 2A 3A
2 2A 3A 4A
PP
Since pX,Y (x, y) = 1, we have A = 1/18.
∀x ∀y

(b) The marginal distribution of X (similarly of Y ) is:


X=x 0 1 2
P (X = x) 3A = 1/6 6A = 1/3 9A = 1/2
(c) The distribution of X | Y = 1 is:
X = x|y = 1 0 1 2
PX|Y =1 (X = x | y = 1) A/6A = 1/6 2A/6A = 1/3 3A/6A = 1/2
Hence E(X | Y = 1) = (0 × 1/6) + (1 × 1/3) + (2 × 1/2) = 4/3.
(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.

L.6 Appendix F – Sampling distributions of statistics


1. (a) The sum of n independent Bernoulli random variables, each with success
4
P
probability π, is Bin(n, π). Here n = 4 and π = 0.2, so Xi ∼ Bin(4, 0.2).
i=1
P
(b) The possible values of Xi are 0, 1, 2, 3 and 4, and their probabilities can be
calculated from the binomial distribution. For example:
4
!  
X 4
P Xi = 1 = (0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096.
i=1
1

The other probabilities are shown in the table below.


P
Since X̄ = Xi /4, the possible values of X̄ are 0, 0.25, 0.5, 0.75 and
P 1. Their
probabilities are the same asP those of the corresponding values of Xi . For
example, P (X̄ = 0.25) = P ( Xi = 1) = 0.4096. The values and their
probabilities are:

430
L.6. Appendix F – Sampling distributions of statistics

X̄ = x̄ 0.0 0.25 0.50 0.75 1.0


P (X̄ = x̄) 0.4096 0.4096 0.1536 0.0256 0.0016
(c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π(1 − π). Therefore, the
approximate normal sampling distribution of X̄, derived from the central limit
theorem, is N (π, π(1 − π)/n). Here this is:
 
0.2 × 0.8
N 0.2, = N (0.2, 0.0016) = N (0.2, (0.04)2 ).
100
Therefore, the probability requested by the question is approximately:
 
X̄ − 0.2 0.3 − 0.2
P (X̄ > 0.3) = P > = P (Z > 2.5) = 0.0062
0.04 0.04
using Table 4 of the New Cambridge Statistical Tables. This is very close to the
probability obtained from the exact sampling distribution, which is about
0.0061.

2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
 
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n

= 1 − 2 × P (Z > 0.25 n)
≥ 0.95

431
L. Solutions to Practice questions

which holds if:


√ 0.05
P (Z > 0.25 n) ≤ = 0.025.
2
From Table√ 4 of the New Cambridge Statistical2Tables, we see that this is true
when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25) = 61.5. Rounding up to the
nearest integer, we get n ≥ 62. The sample size should be at least 62 for us to
be 95% confident that the sample mean will be within 0.5 units of the true
mean, µ.
(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.

3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:
σ2
   
100
X̄ ∼ N µ, = N 54, .
n 25

(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25
using Table 4 of the New Cambridge Statistical Tables.
ii. We are asked for:
 
−0.05 × 54 0.05 × 54
P (0.95 × 54 < X̄ < 1.05 × 54) = P <Z<
2 2
= P (−1.35 < Z < 1.35)
= 0.8230
using Table 4 of the New Cambridge Statistical Tables.

L.7 Appendix G – Point estimation


1. We have:
 
X1 X2 1 1 1 1
E(X) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ
2 2 2 2 2 2

432
L.7. Appendix G – Point estimation

and:
 
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3

It follows that both estimators are unbiased estimators of µ.

2. The formula for S 2 is:


n
!
1 X
S2 = Xi2 − nX̄ 2 .
n−1 i=1

n
This means that (n − 1)S 2 = Xi2 − nX̄ 2 , hence:
P
i=1

n
!
X
E((n − 1)S 2 ) = (n − 1) E(S 2 ) = E Xi2 − nX̄ 2 = n E(Xi2 ) − n E(X̄ 2 ).
i=1

Because the sample is random, E(Xi2 ) = E(X 2 ) for all i = 1, 2, . . . , n as all the
variables are identically distributed. From the standard formula
Var(X) = σ 2 = E(X 2 ) − µ2 , so (using the hint):

σ2
E(X 2 ) = σ 2 + µ2 and E(X̄ 2 ) = µ2 + .
n
Hence:
σ2
 
2 2 2 2
(n − 1) E(S ) = n(σ + µ ) − n µ + = (n − 1)σ 2
n
so E(S 2 ) = σ 2 , which means that S 2 is an unbiased estimator of σ 2 , as stated.
The standard formula for Var(X), applied to S, states that:

E(S 2 ) = Var(S) + (E(S))2

which means that:


p p √
E(S) = E(S 2 ) − Var(S) = σ 2 − Var(S) < σ = σ 2

since all variances are strictly positive. It follows that S is a biased estimator of σ
(with its average value lower than the true value σ).

3. As defined, R is a random variable, and R ∼ Bin(n, π), so that E(R) = nπ and


hence E(R/n) = π. It also follows that:
     
R R n−R
1−E =E 1− =E = 1 − π.
n n n

So the first obvious guess is that we should try R/n × (1 − R/n) = R/n − (R/n)2 .
Now:
nπ(1 − π) = Var(R) = E(R2 ) − (E(R))2 = E(R2 ) − (nπ)2 .

433
L. Solutions to Practice questions

So:
 2 !
R 1 1
E = 2
E(R2 ) = 2 (nπ(1 − π) + n2 π 2 )
n n n
 2 !
R R 1 1
⇒ E − = E(R) − 2 (nπ(1 − π) + n2 π 2 )
n n n n

nπ n2 π 2 π(1 − π)
= − 2 −
n n n
π(1 − π)
= π − π2 − .
n
However, π(1 − π) = π − π 2 , so:
 2 !
R R π(1 − π) n−1
E − = π(1 − π) − = π(1 − π) × .
n n n n

It follows that:
 2 !
R2
 
n R R R
π(1 − π) = ×E − =E − .
n−1 n n n − 1 n(n − 1)

So we have found the unbiased estimator of π(1 − π), but it could do with tidying
up! When this is done, we see that:
R(n − R)
n(n − 1)
is the required unbiased estimator of π(1 − π).

4. For T1 :
 
Sxx 1 1
E(T1 ) = E = E(Sxx ) = × (n − 1)σ 2 = σ 2 .
n−1 n−1 n−1

Hence T1 is an unbiased estimator of σ 2 . Turning to the variance:


2 2
2σ 4
   
Sxx 1 1
Var(T1 ) = Var = ×Var(Sxx ) = ×(2σ 4 (n−1)) = .
n−1 n−1 n−1 n−1

By definition, MSE(T1 ) = Var(T1 ) + (Bias(T1 ))2 = 2σ 4 /(n − 1) + 02 = 2σ 4 /(n − 1).


For T2 :
   
Sxx 1 1 1
E(T2 ) = E = E(Sxx ) = × (n − 1)σ 2 = 1− σ2.
n n n n

It follows that Bias(T2 ) = −σ 2 /n, hence T2 is a biased estimator of σ 2 .


  2  2
2(n − 1)σ 4

Sxx 1 1
Var(T2 ) = Var = × Var(Sxx ) = × (2σ 4 (n − 1)) = .
n n n n2

434
L.7. Appendix G – Point estimation

By definition, MSE(T2 ) = 2(n − 1)σ 4 /n2 + (−σ 2 /n)2 = (2n − 1)σ 4 /n2 .
It can be seen that MSE(T1 ) > MSE(T2 ) since:

2 2n − 1 2n2 − (n − 1)(2n − 1) 2n2 − (2n2 − 3n + 1) 3n − 1


− 2
= 2
= 2
= 2 > 0.
n−1 n n (n − 1) n (n − 1) n (n − 1)

So, although T2 is a biased estimator of σ 2 , it is preferable to T1 due to the


dominating effect of its smaller variance.

5. (a) We start off with the sum of squares function:

4
X
S= ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1

Now take the partial derivatives:

∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α

and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.

The least squares estimators α


b and βb are the solutions to ∂S/∂α = 0 and
∂S/∂β = 0. Hence:

y1 − y2 + y3 − y4 y1 + y2 − y3 − y4
α
b= and βb = .
4 4

(b) α
b is an unbiased estimator of α since:
 
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4

βb is an unbiased estimator of β since:


 
y1 + y2 − y3 − y4 α+β−α+β−α+β+α+β
E(β)
b =E = = β.
4 4

(c) We have:
4σ 2 σ2
 
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4

435
L. Solutions to Practice questions

L.8 Appendix H – Interval estimation


1. (a) The total value of the stock is 9,875µ, where µ is the mean value of an item of
stock. From Chapter 6, X̄ is the obvious estimator of µ, so 9,875X̄ is the
obvious estimator of 9,875µ. Therefore, an estimate for the total value of the
stock is 9,875 × 320.41 = £3,160,000 (to the nearest £10,000).
(b) In this question n = 50 is large, and σ 2 is unknown so a 95% confidence
interval for µ is:
s 40.6
x̄±1.96× √ = 320.41±1.96× √ = 320.41±11.25 ⇒ (£309.16, £331.66).
n 50
Note that because n is large we have used the standard normal distribution. It
is more accurate to use a t distribution with 49 degrees of freedom. This gives
an interval of (£308.87, £331.95) – not much of a difference.
To obtain a 95% confidence interval for the total value of the stock, 9,875µ,
multiply the interval by 9,875. This gives (to the nearest £10,000):

(£3,050,000, £3,280,000).

(c) Increasing the sample size


√ by a factor of k reduces the width of the confidence
interval by a factor of k. Therefore, increasing the sample size by a √
factor of
4 will reduce the width of the confidence interval by a factor of 2 (= 4).
Hence we need to increase the sample size from 50 to 4 × 50 = 200. So we
should collect another 150 observations.

2. (a) Let π be the proportion of shareholders in the population. Start by estimating


π. We are estimating a proportion and n is large, so an approximate 95%
confidence interval for π is, using the central limit theorem:
r r
b(1 − π
π b) 0.23 × 0.77
b±1.96×
π ⇒ 0.23±1.96× = 0.23±0.027 ⇒ (0.203, 0.257).
n 954
Therefore, a 95% confidence interval for the number (rather than the
proportion) of shareholders in the UK is obtained by multiplying the above
interval endpoints by 41 million and getting the answer 8.3 million to 10.5
million. An alternative way of expressing this is:

9,400,000 ± 1,100,000 ⇒ (8,300,000, 10,500,000).

Therefore, we estimate there are about 9.4 million shareholders in the UK,
with a margin of error of 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
b1 (1 − π
π b1 ) πb2 (1 − π
b2 )
b1 − π
π b2 ± 1.96 × + .
n1 n2

436
L.9. Appendix I – Hypothesis testing

The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively.
We know n1 = 954 and although n2 is unknown we can assume it is
approximately equal to 954 (noting the ‘similar’ in the question), so an
approximate 95% confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
0.23−0.171±1.96× + = 0.059±0.036 ⇒ (0.023, 0.094).
954 954
By multiplying by 41 million, we get a confidence interval of:

2,400,000 ± 1,500,000 ⇒ (900,000, 3,900,000).

We estimate that the number of shareholders has increased by about 2.4


million in the two years. There is quite a large margin of error, i.e. 1.5 million,
especially when compared with a point estimate (i.e. interval midpoint) of 2.4
million.

L.9 Appendix I – Hypothesis testing


1. (a) We have n = 50 and σ = 1. We wish to test:

H0 : µ = 0.65 (sample is from ‘B’) vs. H1 : µ = 0.80 (sample is from ‘A’).

The decision rule is that we reject H0 if x̄ > 0.75.


The probability of a Type I error is:
 
0.75 − 0.65
P (X̄ > 0.75 | H0 ) = P Z > √ = P (Z > 0.71) = 0.2389.
1/ 50
The probability of a Type II error is:
 
0.75 − 0.80
P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −0.35) = 0.3632.
1/ 50

(b) To find the sample size n and the value a, we need to solve two conditions:

• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.

• β = P (X̄ < a |√H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round
up!
(c) A sample is classified as being from A if x̄ > 0.75. We have:
 
0.75 − 0.65 0.75 − 0.65
α = P (X̄ > 0.75 | H0 ) = P Z > √ = 0.02 ⇒ √ = 2.05.
1/ n 1/ n
Solving this equation gives n = 421, remembering to round up! Therefore:
 
0.75 − 0.80
β = P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −1.026) = 0.1515.
1/ 421

437
L. Solutions to Practice questions

(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
 
0.734 − 0.7
P (X̄ > 0.734 | µ = 0.7) = P Z > √ = P (Z > 0.66) = 0.2546.
1/ 381

2. (a) We have:

α = 1 − P (21.97 < X̄ < 22.03 | µ = 22)


 
21.97 − 22 22.03 − 22
=1−P √ <Z< √
0.08/ 50 0.08/ 50
= 1 − P (−2.65 < Z < 2.65)
= 1 − 0.992
= 0.008.

(b) We have:

β = P (21.97 < X̄ < 22.03 | µ = 22.05)


 
21.97 − 22.05 22.03 − 22.05
=P √ <Z< √
0.08/ 50 0.08/ 50
= P (−7.07 < Z < −1.77)
= P (Z < −1.77) − P (Z < −7.07)
= 0.0384.

(c) We have:

P (rejecting H0 | µ = 22.01) = 1 − P (21.97 < X̄ < 22.03 | µ = 22.01)


 
21.97 − 22.01 22.03 − 22.01
=1−P √ <Z< √
0.08/ 50 0.08/ 50
= 1 − P (−3.53 < X < 1.77)
= 1 − (P (Z < 1.77) − P (Z < −3.53))
= 1 − (0.9616 − 0.00023)
= 0.0386.

3. (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7

438
L.9. Appendix I – Hypothesis testing

This is compared to a Student’s t distribution on 6 degrees of freedom. The


critical value corresponding to a 5% significance level is 2.447. Hence we
cannot reject the null hypothesis at the 5% significance level. (We can reject at
the 10% significance level, but the convention on this course is to regard such
evidence merely as casting doubt on H0 , rather than justifying rejection as
such, i.e. such a result would be ‘weakly significant’.)

(b) We are to test H0 : µ = 12 vs. H1 : µ < 12. There is no need to do a formal


statistical test. As the sample mean is 12.7, which is greater than 12, there is
no evidence whatsoever for the alternative hypothesis.

In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which
is more appropriate will depend on the purpose of the experiment, and your
suspicions before you conduct it.
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.
• General rule: decide on whether it is a one- or two-sided test before performing
the statistical test!

4. It is useful to discuss the issues about this question before giving the solution.
• We want to know whether a loyalty programme such as that at the 12 selected
restaurants would result in an increase in mean profits greater than that
observed (during the three-month test) at the other sites within the chain.
• So we can model the profits across the chain as $1,047.34 + x, where $x is the
supposed effect of the promotion, and if the true mean value of x is µ, then we
wish to test:
H0 : µ = 0 vs. H1 : µ > 0
which is a one-tailed test since, clearly, there are (preliminary) grounds for
thinking that there is an increase due to the loyalty programme.
• We know nothing about the variability of profits across the rest of the chain,
so we will have to use the sample data, i.e. to calculate the sample variance
and to employ the t distribution with ν = 12 − 1 = 11 degrees of freedom.
• Although we shall want the variance of the data ‘sample value − 1,047.34’,
this will be the same as the variance of the sample data, since for any random
variable X and constant k we have:

Var(X + k) = Var(X)

because in calculating the variance every value (xi − x̄) is ‘replaced’ by


((xi + k) − (x̄ + k)), which is in fact the same value.
12 12
x2i , Sxx = x2i − nx̄2 and s2 .
P P
• So we need to calculate x̄,
i=1 i=1

439
L. Solutions to Practice questions

12
P
The total change in profit for restaurants in the programme is xi = 30,113.17.
i=1
Since n = 12, the mean change in profit for restaurants in the programme is:
30,113.17
= 2,509.431 = 1,047.34 + 1,462.091
12
hence use x̄ = 1,462.091.
12
x2i = 126,379,568.8. So, the ‘corrected’ sum of squares
P
The raw sum of squares is
i=1
is:
12
X
Sxx = x2i − nx̄2 = 126,379,568.8 − 12 × (2,509.431)2 = 50,812,651.51.
i=1

Therefore:
Sxx 50,812,651.51
s2 = = = 4,619,331.956.
n−1 11
Hence the estimated standard error is:
r
s 4,619,331,956 p
√ = = 384,944.3296 = 620.439.
n 12

So, the test statistic value is:


x̄ − µ0 1,462.091 − 0
√ = = 2.3565.
s/ n 620.439

The relevant critical values for t11 in this one-tailed test are:

5%: 1.796 and 1%: 2.718.

So we see that the test is significant at the 5% significance level, but not at the 1%
significance level, so reject H0 and conclude that the loyalty programme does have
an effect. (In fact, this means the result is moderately significant that the
programme has had a beneficial effect for the company.)

5. (a) We test H0 : µA = µB vs. H1 : µA 6= µB , where we use a two-tailed test since


there is no prior reason to suggest the direction of the difference, if any. The
test statistic value is:
11.9 − 10.8
p = 2.06
7.3/44 + 6.3/52
where we assume the sample variances are equal to the population variances
due to the large sample sizes (and hence we would expect accurate variance
estimates). For a two-tailed test, this is significant at the 5% significance level
(1.96 < 2.06), but not at the 1% significance level (2.06 < 2.576). We reject H0
and conclude that company A is slower in repair times on average than
company B, with a moderately significant result.
(b) The p-value for this two-tailed test is 2 × P (Z > 2.06) = 0.0394.

440
L.9. Appendix I – Hypothesis testing

(c) For small samples, we should use a pooled estimate of the population standard
deviation:
s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626 on 24 degrees of freedom.
(9 − 1) + (17 − 1)

Hence the test statistic value in this case is:


11.9 − 10.8
p = 1.04.
2.5626 × 1/9 + 1/17

This should be compared with the t24 distribution and is clearly not
significant, even at the 10% significance level. With the smaller samples we fail
to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.

6. (a) Let π be the population proportion of visitors who would use the device. We
test H0 : π = 0.3 vs. H1 : π < 0.3. The sample p proportion is p = 20/80 = 0.25.
Standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The test
statistic value is:
0.25 − 0.30
z= = −0.976.
0.0512
For a one-sided (lower-tailed) test at the 5% significance level, the critical
value is −1.645, so the test is not significant – and not even at the 10%
significance level (the critical value is −1.282). On the basis of the data, there
is no reason to withdraw the device.
The critical region for the above test is to reject H0 if the sample proportion is
less than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than
0.2157.
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:

P (Z ≤ −0.976) = 0.1645.

So for any α < 0.1645 we would fail to reject H0 .


(c) The power of the test when π = 0.2 is the conditional probability:

P (P < 0.2157 | π = 0.2).

When
p π = 0.2, the standard error of the sample proportion is
0.2 × 0.8/80 = 0.0447. Therefore, the power when π = 0.2 is:
 
0.2157 − 0.2
P Z< = P (Z < 0.35) = 0.6368.
0.0447

441
L. Solutions to Practice questions

L.10 Appendix J – Analysis of variance


1. (a) For this example, k = 3, n1 = 6, n2 = 5, n3 = 4 and n = n1 + n2 + n3 = 15.
We have x̄·1 = 43.8, x̄·2 = 56.9, x̄·3 = 60.35 and x̄ = 52.58.
nj
3 P
x2ij = 43,387.85.
P
Also,
j=1 i=1
nj
3 P
x2ij − nx̄2 = 43,387.85 − 41,469.85 = 1,918.
P
Total SS =
j=1 i=1
nj
3 P P3
x2ij − nj x̄2·j = 43,387.85 − 42,267.18 = 1,120.67.
P
w= j=1
j=1 i=1

Therefore, b = Total SS − w = 1,918 − 1,120.67 = 797.33.


To test H0 : µ1 = µ2 = µ3 , the test statistic value is:
b/(k − 1) 797.33/2
f= = = 4.269.
w/(n − k) 1,120.67/12
Under H0 , F ∼ F2, 12 . Since F0.05, 2, 12 = 3.89 < 4.269, we reject H0 at the 5%
significance level, i.e. there exists evidence indicating that the population mean
expenditures on frozen meals are not the same for the three different income
groups.
(b) The ANOVA table is as follows:
Source DF SS MS F P
Income 2 797.33 398.67 4.269 <0.05
Error 12 1,120.67 93.39
Total 14 1,918.00
(c) A 95% confidence interval for µj is of the form:

S 93.39 21.056
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 12 × √ = X̄·j ± √ .
nj nj nj

For j = 1, a 95% confidence interval is 43.8 ± 21.056/ 6 ⇒ (35.20, 52.40).

For j = 3, a 95% confidence interval is 60.35 ± 21.056/ 4 ⇒ (49.82, 70.88).

2. (a) Here k = 4 and n1 = n2 = n3 = n4 = 30. We have x̄·1 = 74.10, x̄·2 = 75.67,


x̄·3 = 78.50, x̄·4 = 81.30, b = 909, w = 26,408 and the pooled estimate of σ is
s = 15.09.
Hence the test statistic value is:
b/(k − 1)
f= = 1.33.
w/(n − k)
Under H0 : µ1 = µ2 = µ3 = µ4 , F ∼ Fk−1, n−k = F3, 116 . Since
F0.05, 3, 116 = 2.68 > 1.33, we cannot reject H0 at the 5% significance level.
Hence there is no evidence to support the claim that payments among the four
groups are significantly different.

442
L.11. Appendix K – Linear regression

(b) A 95% confidence interval for µj is of the form:


S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30
For j = 1, a 95% confidence interval is 74.10 ± 5.46 ⇒ (68.64, 79.56).
For j = 4, a 95% confidence interval is 81.30 ± 5.46 ⇒ (75.84, 86.76).

L.11 Appendix K – Linear regression


P 2 P 2
1. (a) We
P first calculate x̄ = 4.56, x i = 219.46, ȳ = 30.97, yi = 9,973.99 and
xi yi = 1,475.1. The estimated regression coefficients are:
1,475.1 − 10 × 4.56 × 30.97
βb1 = = 5.46 and βb0 = 30.97−5.46×4.56 = 6.07.
219.46 − 10 × (4.56)2
The fitted line is:
[ = 6.09 + 5.46 × Distance.
Cost
In order to perform statistical inference, we need to find:

(yi − βb0 − βb1 xi )2


P
2
b = i
σ
n−2
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= (9,973.99 + 10 × (6.07)2 + (5.46)2 × 219.46 − 2 × 6.07 × 309.7


− 2 × 5.46 × 1475.1 + 2 × 6.07 × 5.46 × 45.6)/(10 − 2)
= 4.95.

The estimated standard error of βb1 is:



4.95
p = 0.66.
219.46 − 10 × (4.56)2

Hence a 95% confidence interval for β1 is 5.46 ± 2.306 × 0.66 ⇒ (3.94, 6.98).
(b) To test H0 : β0 = 0 vs. H1 : β0 6= 0, we first determine the estimated standard
error of βb0 , which is:
√  1/2
4.95 219.46
√ = 3.07.
10 219.46 − 10 × (4.56)2
Therefore, test statistic value is:
6.07
= 1.98.
3.07
Comparing with the t8 distribution, this is not significant at the 5%
significance level (1.98 < 2.306), but it is significant at the 10% significance
level (1.860 < 1.98).

443
L. Solutions to Practice questions

There is only weak evidence against the null hypothesis. Note though that in
practice this hypothesis is not really of interest. A line through the origin
implies that there is zero cost of a fire which takes place right next to a fire
station. This hypothesis does not seem sensible!

2. The question implies that we want to explain changes in profitability by the


passage of time. If we let x represent years and y represent profits (in £000s) then
we need to perform a regression of y on x.
P 2 P 2
(a) We
P first calculate x̄ = 4.5, x i = 204, ȳ = 41.125, yi = 16,159 and
xi yi = 1,802. The estimated regression coefficients are:
1,802 − 8 × 4.5 × 41.125
βb1 = = 7.65 and βb0 = 41.125 − 7.65 × 4.5 = 6.70.
204 − 8 × (4.5)2

The fitted line is:


\ = 6.70 + 7.65 × Year.
Profit
In order to perform statistical inference, we need to find:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= (16,159 + 8 × (6.70)2 + (7.65)2 × 204 − 2 × 6.70 × 329


− 2 × 7.65 × 1,802 + 2 × 6.70 × 7.65 × 36)/(8 − 2)
= 27.98.

The estimated standard error of βb1 is:



27.98
p = 0.82.
204 − 8 × (4.5)2

Hence a 95% confidence interval for β1 is 7.65 ± 2.447 × 0.82 ⇒ (5.64, 9.66).
(b) Substituting x = 9 we find the predicted year 9 profit (in £000s) is 75.55. The
estimated standard error of this prediction is:
1/2
√ 204 − 2 × 9 × 36 + 8 × 92

27.98 × 1 + = 6.71.
8 × (204 − 8 × (4.5)2 )

It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:

75.55 ± 2.447 × 6.71 ⇒ (59.13, 91.97).

As 90 is in this prediction interval, we cannot reject the accountant’s forecast


out of hand. However, it is right at the top end of the prediction interval, and
hence seems rather optimistic.

444
L.11. Appendix K – Linear regression

P 2 P 2
3. (a) We
P first calculate x̄ = 4.5, x i = 204, ȳ = 434, yi = 1,938,174 and
xi yi = 19,766. The estimated regression coefficients are:
19,766 − 8 × 4.5 × 434
βb1 = = 98.62 and βb0 = 434 − 98.62 × 4.5 = −9.79.
204 − 8 × (4.5)2
The fitted line is:
\
Expenditure = −9.79 + 98.62 × Year.

In order to perform statistical inference, we need to find:


X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X 
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)

= (1,938,174 + 8 × (−9.79)2 + (98.62)2 × 204 − 2 × (−9.79) × 3,472


− 2 × 98.62 × 19,766 + 2 × (−9.79) × 98.62 × 36)/(8 − 2)
= 3,807.65.

The estimated standard error of βb1 is:



3,807.65
p = 9.52.
204 − 8 × (4.5)2

Hence a 95% confidence interval for β1 is:

98.62 ± 2.447 × 9.52 ⇒ (75.32, 121.92).

(b) The ANOVA table is:


Source DF SS MS F
Regression 1 408,480 408,480 107.269
Residual Error 6 22,846 3,808
Total 7 431,326
Hence R2 = 408,480/431,326 = 0.947.
(c) As R2 is very close to 1, the linear regression model provides a very good fit.
(d) Substituting x = 9 we find the predicted year 9 profit (in £000s) is 877.79. The
estimated standard error of this prediction is:
1/2
204 − 2 × 9 × 36 + 8 × 92

p
3,807.65 × 1 + = 78.23.
8 × (204 − 8 × (4.5)2 )

It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:

877.79 ± 2.447 × 78.23 ⇒ (686.36, 1,069.22).

445
L. Solutions to Practice questions

4. We first note E(Y ) = β0 + β1 E(X) and Y − E(Y ) = (X − E(X))β1 + ε. Hence:

Cov(X, Y ) = E((X − E(X))(Y − E(Y )))


= E((X − E(X))(X − E(X))β1 )
= β1 Var(X).

Therefore, β1 = Cov(X, Y )/Var(X). The second equality follows from the fact that
Corr(X, Y ) = Cov(X, Y )/(Var(X) Var(Y ))1/2 .
Also, note that the first equality resembles the estimator:
P
(xi − x̄)(yi − ȳ)
βb1 = i P 2
i (xi − x̄)

although in the simple linear regression model y = β0 + β1 x + ε, x is assumed to be


fixed (to make the inference easier). Otherwise βb0 and βb1 are no longer linear
estimators, for example. The second equality reinforces the fact that β1 > 0 if and
only if x and y are positively correlated.

446
Appendix M
Formula sheet in the examination

Simple linear regression


Model: yi = β0 + β1 xi + εi .
n n
LSEs: βb0 = ȳ − βb1 x̄, βb1 = (xi − x̄)(yi − ȳ)/ (xj − x̄)2
P P
and:
i=1 j=1

n
x2j
P
2
σ j=1 σ2 2
−σ x̄
Var(βb0 ) = n , Var(βb1 ) = P
n , Cov(βb0 , βb1 ) = P
n .
n P
(xi − x̄)2 (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1

n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2).
P
Estimator for the variance of εi : σ
i=1
Regression ANOVA:
n n
Total SS = (yi − ȳ)2 , Regression SS = βb12 (xi − x̄)2
P P
and
i=1 i=1
n
(yi − βb0 − βb1 xi )2 .
P
Residual SS =
i=1
Squared regression correlation coefficients:

Regression SS (Residual SS)/(n − 2)


R2 = 2
and Radj =1− .
Total SS (Total SS)/(n − 1)

For a given x, the expectation of y is µ(x) = β0 + β1 x. A 100(1 − α)% confidence


interval for µ(x) is:
 n 1/2
2
P


 (xi − x) 

i=1
βb0 + βb1 x ± tα/2, n−2 × σ
b× Pn
n (xj − x̄)2 

 

j=1

and a 100(1 − α)% prediction interval covering y with probability (1 − α) is:


 n
1/2
2
P


 (xi − x) 


i=1
βb0 + βb1 x ± tα/2, n−2 × σ
b× 1+ Pn .
n (xj − x̄)2 

 
 
j=1

447
M. Formula sheet in the examination

One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1

Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1

Residual (error) variation:


r X
X c r X
X c r
X c
X
2 2 2
(Xij − X̄i· − X̄·j + X̄) = Xij − c X̄i· − r X̄·j2 + rcX̄ 2 .
i=1 j=1 i=1 j=1 i=1 j=1

448
Appendix N
Sample examination paper

Time allowed: 3 hours.


Candidates should answer all FIVE questions. All questions carry equal marks.

1. (a) Let X be a discrete random variable with probability function defined by:
(
2k x for x = 2, 3, . . .
p(x) =
0 otherwise

where k is a constant with 0 < k < 1.

i. Show that k = 1/2.


(4 marks)
ii. Derive the cumulative distribution function of X for x = 2, 3, . . ..
(4 marks)
iii. Derive the moment generating function of X and use it to find E(X).
(8 marks)

(b) For the empty set, ∅, prove that:

P (∅) = 0.

(4 marks)

449
N. Sample examination paper

2. (a) Suppose we have a biased coin which comes up heads with probability π. An
experiment is carried out so that X is the number of independent flips of the
coin required until r heads show up, where r ≥ 1 is known. Determine the
probability function of X.
(5 marks)

(b) The random variable X has the following probability function:


(
(1 − π)x π for x = 0, 1, 2, . . .
p(x) =
0 otherwise

where 0 < π < 1 is a parameter.


i. Derive the moment generating function of X.
(4 marks)
ii. Using the moment generating function, derive the mean and variance of X.
(7 marks)

(c) Suppose X ∼ N (−5, 16).

i. Determine P (−8 < X < −2).


(2 marks)
ii. Find the value a such that P (−5 − a < X < −5 + a) = 0.95.
(2 marks)

450
3. (a) X1 , X2 , . . . , Xn are independent random variables with the common
probability density function:
(
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.

Derive the joint probability density function, f (x1 , x2 , . . . , xn ).


(4 marks)

(b) X and Y are random variables with a joint distribution given below:
X=x
0 2 4
Y =y 0 0.05 0.10 0.25
2 0.20 0.15 0.25
i. Obtain the marginal distributions of X and Y .
(2 marks)
ii. Evaluate E(X), Var(X), E(Y ) and Var(Y ).
(4 marks)
iii. Obtain the conditional distributions of Y | X = 2 and X | Y = 2.
(4 marks)
iv. Evaluate E(XY ), Cov(X, Y ) and Corr(X, Y ).
(4 marks)
v. Are X and Y independent? Justify your answer.
(2 marks)

451
N. Sample examination paper

4. (a) The mean squared error (MSE) of an estimator is the average squared error,
defined as:  
2
MSE(θ) = E (θ − θ) .
b b

Show how this can be decomposed into variance and bias components such
that:  2
MSE(θ)
b = Var(θ)b + Bias(θ)b .

(5 marks)

(b) Let {X1 , X2 , . . . , Xn } be a random sample of size n from the following


probability density function:
1
f (x) =
2(θ + 1)

for 0 ≤ x ≤ 2θ + 2, and 0 otherwise, where θ > 0 is unknown.

i. Derive the maximum likelihood estimator of θ.


(5 marks)
ii. Obtain the maximum likelihood estimator of the standard deviation of the
above distribution.
(3 marks)

(c) Suppose that you are given observations y1 and y2 such that:

y1 = α + 3β + ε1 and y2 = 3α + β + ε2 .

Here the variables ε1 and ε2 are independent and normally distributed with
mean 0 and variance σ 2 .
Find the least squares estimators α
b and βb of the parameters α and β and
verify that they are unbiased estimators.
(7 marks)

452
5. (a) Let {X1 , X2 , . . . , Xn } be a random sample from a normally-distributed
population with mean µ and variance σ 2 < ∞. Let
n
M = (Xi − X̄)2 = (n − 1)S 2 , such that M/σ 2 ∼ χ2n−1 (you do not need to
P
i=1
derive this result for this question).

i. Show that a 100(1 − α)% confidence interval for σ 2 , for any α ∈ (0, 1) and
constants k1 and k2 such that 0 < k1 < k2 is:
 
M M
, .
k2 k1

(4 marks)
ii. Suppose the sample size is n = 20 and the sample variance is s2 = 17.3.
Compute a 99% confidence interval for σ 2 .
(3 marks)
iii. A researcher decides to test:

H0 : σ 2 = 15 vs. H1 : σ 2 > 15

at a 5% significance level. If n = 30, calculate the power of the test if


σ 2 = 18.2.
(You may compute the power using the closest value available in the New
Cambridge Statistical Tables.)
(6 marks)
(b) Consider the simple linear regression model:

y i = β 0 + β 1 xi + εi

where xi is treated as a constant, Var(yi ) = σ 2 and the least squares estimator


of β0 is:
n
X
β0 = ȳ − β1 x̄ = ȳ −
b b ai x̄yi
i=1
 n
(xk − x̄)2 . Show that:
P
where ai = (xi − x̄)
k=1

n
x2k
P
2
σ k=1
Var(βb0 ) = n .
n P 2
(xk − x̄)
k=1

(7 marks)

[END OF PAPER]

453
N. Sample examination paper

454
Appendix O
Sample examination paper –
Solutions

1. (a) i. We must have:



X 2k 2
1= 2k x = .
x=2
1−k
Solving, k = 1/2 or k = −1 (rejected, since k > 0).
ii. For x = 2, 3, . . ., we have:
x  t
X 1 (2/22 )(1 − (1/2)x−1 )  1 x−1
F (x) = P (X ≤ x) = 2 = =1− .
t=2
2 1 − 1/2 2

iii. We have:
∞ ∞
X X 2k 2 e2t e2t
MX (t) = E(etX ) = 2k x etx = 2 (ket )x = t
= t
.
x=2 x=2
1 − ke 2 − e

For the above to be valid, the sum to infinity has to be valid. That is,
ket < 1, meaning t < log(2). We then have:

4e2t − e3t
MX0 (t) =
(2 − et )2

so that E(X) = MX0 (0) = 3.

(b) Since ∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅, Axiom 3 gives:



X
P (∅) = P (∅ ∪ ∅ ∪ · · · ∪ ∅) = P (∅) = 0.
i=1

However, the only real number for P (∅) which satisfies this is P (∅) = 0.

2. (a) To wait for r heads to show up, suppose x flips are required. The last flip must
be a head, with r − 1 heads randomly appearing in the first x − 1 flips. In each
particular combination of heads and tails, there must be r heads by definition
of the experiment, as well as x − r tails (so adding together, x flips in total),
with probability due to independence of:

π r (1 − π)x−r .

455
O. Sample examination paper – Solutions

x−1

There are r−1
combinations of outcomes with this probability. Hence we
have: (
x−1
 r
r−1
π (1 − π)x−r for x = r, r + 1, . . .
p(x) =
0 otherwise.

(b) i. The mgf for this distribution is:



X
tX
MX (t) = E(e ) = etx p(x)
x=0

X
= etx (1 − π)x π
x=0

X
=π (et (1 − π))x
x=0
π
=
1− et (1 − π)
using the sum to infinity of a geometric series, for t < − ln(1 − π) to
ensure convergence of the sum.
ii. From the mgf MX (t) = π/(1 − et (1 − π)) we obtain:
π(1 − π)et
MX0 (t) =
(1 − et (1 − π))2
π(1 − π)et (1 − (1 − π)et )(1 + (1 − π)et )
MX00 (t) =
(1 − et (1 − π))4
and hence (since e0 = 1):
1−π
MX0 (0) = = E(X)
π
(1 − π)(2 − π)
MX00 (0) = = E(X 2 )
π2
and:
(1 − π)(2 − π) (1 − π)2 1−π
Var(X) = E(X 2 ) − (E(X))2 = 2
− 2
= .
π π π2

(c) i. We have:
 
−8 + 5 −2 + 5
P (−8 < X < −2) = P <Z<
4 4
= P (−0.75 < Z < 0.75)
= P (Z < 0.75) − P (Z < −0.75)
= Φ(0.75) − Φ(−0.75)
= 0.7734 − (1 − 0.7734)
= 0.5468.

456
ii. We want to find the value a such that P (−5 − a < X < −5 + a) = 0.95,
that is:
 
(−5 − a) + 5 (−5 + a) + 5
0.95 = P <Z<
4 4
 a a 
=P − <Z<
4 4
 a   a
=1−P Z > −P Z <−
4 4
 a 
=1−2×P Z > .
4
This is the same as 2 × P (Z > a/4) = 0.05, i.e. P (Z > a/4) = 0.025.
Hence a/4 = 1.96, and so a = 7.84.

3. (a) Since the Xi s are independent (and identically distributed) random variables,
we have: n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1

So, the joint probability density function is:


n
Y
f (x1 , x2 , . . . , xn ) = λ2 xi e−λxi
i=1
n
Y
=λ 2n
xi e−λx1 −λx2 −···−λxn
i=1
n
n
! P
Y −λ xi
2n
=λ xi e i=1 .
i=1

(b) i. The marginal distributions are found by adding across rows and columns:

X=x 0 2 4 Y =y 0 2
P (X = x) 0.25 0.25 0.50 P (Y = y) 0.40 0.60

ii. E(X) = 0 × 0.25 + 2 × 0.25 + 4 × 0.50 = 2.5.


E(X 2 ) = 02 × 0.25 + 22 × 0.25 + 42 × 0.50 = 9, so:

Var(X) = 9 − (2.5)2 = 2.75.

E(Y ) = 0 × 0.40 + 2 × 0.60 = 1.2.


E(Y 2 ) = 02 × 0.40 + 22 × 0.60 = 2.4, so:

Var(Y ) = 2.4 − (1.2)2 = 0.96.

iii. pY |X=2 (y | x = 2) and pX|Y =2 (x | y = 2) are given by, respectively:

457
O. Sample examination paper – Solutions

Y = y|X = 2 0 2
P (Y = y | X = 2) 0.10/0.25 = 0.4 0.15/0.25 = 0.6
and:

X = x|Y = 2 0 2 4
P (X = x | Y = 2) 0.20/0.60 = 4/12 0.15/0.60 = 3/12 0.25/0.60 = 5/12
iv. We have:
E(XY ) = 0 × 0.60 + 4 × 0.15 + 8 × 0.25 = 2.6.
Hence:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 2.6 − (2.5)(1.2) = −0.4.

Also:
Cov(X, Y ) −0.4
Corr(X, Y ) = p =√ = −0.2462.
Var(X) Var(Y ) 2.75 × 0.96

v. Since Corr(X, Y ) 6= 0, then X and Y are not independent.

4. (a) We have:

Var(X) = E(X 2 ) − (E(X))2 ⇔ E(X 2 ) = Var(X) + (E(X))2 .

By setting X = (θb − θ), we have:


 
2
MSE(θ) = E (θ − θ)
b b
 2
= Var(θb − θ) + E(θb − θ)
 2
= Var(θ)
b + Bias(θ)
b .

(b) i. Due to independence, the likelihood function is:


n
Y 1
L(θ) = = (2θ + 2))−n = γ −n .
i=1
2(θ + 1)

The likelihood is maximised for small values of γ. The smallest value that
can safely maximise the likelihood without violating the support is:

γ
b = X(n) .

Hence by the invariance principle of MLEs, we have:

X(n) − 2 X(n)
θb = = − 1.
2 2

458
ii. The probability density function is that of a continuous uniform
distribution, hence the variance is:

(2θ + 2)2
Var(X) = σ 2 = .
12
By the invariance principle, the MLE of the standard deviation is:
r
(X(n) )2 X(n)
σ
b= =√ .
12 12

(c) Given:
y1 = α + 3β + ε1 and y2 = 3α + β + ε2
we have:
S = ε21 + ε22 = (y1 − α − 3β)2 + (y2 − 3α − β)2 .
Taking partial derivatives with respect to α and β, respectively, and equating
to zero leads to the equations:
∂S
= −2(y1 − α
b − 3β)
b − 6(y2 − 3b
α − β)
b =0
∂α
and:
∂S
= −6(y1 − α
b − 3β)
b − 2(y2 − 3b
α − β)
b = 0.
∂β
Solving this system yields:
−y1 + 3y2 3y1 − y2
α
b= and βb = .
8 8
These are unbiased estimators since:
−α − 3β + 9α + 3β
E(b
α) = =α
8
and:
3α + 9β − 3α − β
E(β)
b = = β.
8

5. (a) i. For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
where X ∼ χ2n−1 . Alternatively, k1 = χ21−α/2, n−1 and k2 = χ2α/2, n−1 .
Therefore:
   
M M 2 M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1

Hence a 100(1 − α)% confidence interval for σ 2 is:


 
M M
, .
k2 k1

459
O. Sample examination paper – Solutions

ii. A 99% confidence interval for σ 2 is:


19 × s2 19 × s2
   
M M
, ⇒ ,
38.58 6.844 38.58 6.844
= (0.4925 × s2 , 2.7762 × s2 )
= (8.5203, 48.0283).

iii. Under H0 , the test statistic is (n − 1)S 2 /σ 2 = 29 × S 2 /15 ∼ χ229 . The


critical value is χ20.05, 29 = 42.56. Hence the power of the test if σ 2 = 18.2 is:

29 × S 2
 
Pσ (H0 is rejected) = Pσ > 42.56
15
29 × S 2
 
15
= Pσ > × 42.56
18.2 18.2
= Pσ (T > 35.08)

where T ∼ χ229 . Since χ20.20, 29 = 35.14, the power is 0.20, approximately.

(b) Note that:


n n
X X 1
ai = 0 and a2i = P
n
i=1 i=1 (xk − x̄)2
k=1

and: n n  
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n
Hence:
 
n  2 n
!
X 1 1 X 2 2 σ2  nx̄2 
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 + 
n n i=1 i n  n
P 
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P
(xk − x̄)2
k=1

The last equality uses the fact that:


n
X n
X
x2k = (xk − x̄)2 + nx̄2 .
k=1 k=1

460
STATISTICAL
TABLES
Cumulative normal distribution
Critical values of the t distribution
Critical values of the F distribution
Critical values of the chi-squared distribution

New Cambridge Statistical Tables pages 17-29

© C. Dougherty 2001, 2002 ([email protected]). These tables have been computed to accompany the text C. Dougherty Introduction to
Econometrics (second edition 2002, Oxford University Press, Oxford), They may be reproduced freely provided that this attribution is retained.
STATISTICAL TABLES 1

TABLE A.1

Cumulative Standardized Normal Distribution

A(z) is the integral of the standardized normal


distribution from  f to z (in other words, the
area under the curve to the left of z). It gives the
probability of a normal random variable not
A(z) being more than z standard deviations above its
mean. Values of z of particular importance:

z A(z)
1.645 0.9500 Lower limit of right 5% tail
1.960 0.9750 Lower limit of right 2.5% tail
2.326 0.9900 Lower limit of right 1% tail
2.576 0.9950 Lower limit of right 0.5% tail
3.090 0.9990 Lower limit of right 0.1% tail
3.291 0.9995 Lower limit of right 0.05% tail
-4 -3 -2 -1 0 1 z 2 3 4

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999

.
STATISTICAL TABLES 2

TABLE A.2
t Distribution: Critical Values of t

Significance level
Degrees of Two-tailed test: 10% 5% 2% 1% 0.2% 0.1%
freedom One-tailed test: 5% 2.5% 1% 0.5% 0.1% 0.05%
1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.768
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
32 1.694 2.037 2.449 2.738 3.365 3.622
34 1.691 2.032 2.441 2.728 3.348 3.601
36 1.688 2.028 2.434 2.719 3.333 3.582
38 1.686 2.024 2.429 2.712 3.319 3.566
40 1.684 2.021 2.423 2.704 3.307 3.551
42 1.682 2.018 2.418 2.698 3.296 3.538
44 1.680 2.015 2.414 2.692 3.286 3.526
46 1.679 2.013 2.410 2.687 3.277 3.515
48 1.677 2.011 2.407 2.682 3.269 3.505
50 1.676 2.009 2.403 2.678 3.261 3.496
60 1.671 2.000 2.390 2.660 3.232 3.460
70 1.667 1.994 2.381 2.648 3.211 3.435
80 1.664 1.990 2.374 2.639 3.195 3.416
90 1.662 1.987 2.368 2.632 3.183 3.402
100 1.660 1.984 2.364 2.626 3.174 3.390
120 1.658 1.980 2.358 2.617 3.160 3.373
150 1.655 1.976 2.351 2.609 3.145 3.357
200 1.653 1.972 2.345 2.601 3.131 3.340
300 1.650 1.968 2.339 2.592 3.118 3.323
400 1.649 1.966 2.336 2.588 3.111 3.315
500 1.648 1.965 2.334 2.586 3.107 3.310
600 1.647 1.964 2.333 2.584 3.104 3.307
f 1.645 1.960 2.326 2.576 3.090 3.291

.
STATISTICAL TABLES 3

TABLE A.3

F Distribution: Critical Values of F (5% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.91 245.36 246.46 247.32 248.01
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.42 19.43 19.44 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.71 8.69 8.67 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.87 5.84 5.82 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.64 4.60 4.58 4.56
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.96 3.92 3.90 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.53 3.49 3.47 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.24 3.20 3.17 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.03 2.99 2.96 2.94
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.86 2.83 2.80 2.77
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.74 2.70 2.67 2.65
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.64 2.60 2.57 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.55 2.51 2.48 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.48 2.44 2.41 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.42 2.38 2.35 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.37 2.33 2.30 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.33 2.29 2.26 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.29 2.25 2.22 2.19
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.26 2.21 2.18 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.22 2.18 2.15 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.20 2.16 2.12 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.17 2.13 2.10 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.15 2.11 2.08 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.13 2.09 2.05 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.11 2.07 2.04 2.01
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.09 2.05 2.02 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.08 2.04 2.00 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.06 2.02 1.99 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.05 2.01 1.97 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.04 1.99 1.96 1.93
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.04 1.99 1.94 1.91 1.88
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.95 1.90 1.87 1.84
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.95 1.89 1.85 1.81 1.78
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.86 1.82 1.78 1.75
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 1.89 1.84 1.79 1.75 1.72
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 1.88 1.82 1.77 1.73 1.70
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 1.86 1.80 1.76 1.72 1.69
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.85 1.79 1.75 1.71 1.68
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.78 1.73 1.69 1.66
150 3.90 3.06 2.66 2.43 2.27 2.16 2.07 2.00 1.94 1.89 1.82 1.76 1.71 1.67 1.64
200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.80 1.74 1.69 1.66 1.62
250 3.88 3.03 2.64 2.41 2.25 2.13 2.05 1.98 1.92 1.87 1.79 1.73 1.68 1.65 1.61
300 3.87 3.03 2.63 2.40 2.24 2.13 2.04 1.97 1.91 1.86 1.78 1.72 1.68 1.64 1.61
400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 1.78 1.72 1.67 1.63 1.60
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.77 1.71 1.66 1.62 1.59
600 3.86 3.01 2.62 2.39 2.23 2.11 2.02 1.95 1.90 1.85 1.77 1.71 1.66 1.62 1.59
750 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.77 1.70 1.66 1.62 1.58
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.70 1.65 1.61 1.58

.
STATISTICAL TABLES 4

TABLE A.3 (continued)

F Distribution: Critical Values of F (5% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 249.26 250.10 250.69 251.14 251.77 252.20 252.62 253.04 253.46 253.68
2 19.46 19.46 19.47 19.47 19.48 19.48 19.48 19.49 19.49 19.49
3 8.63 8.62 8.60 8.59 8.58 8.57 8.56 8.55 8.54 8.54
4 5.77 5.75 5.73 5.72 5.70 5.69 5.68 5.66 5.65 5.65
5 4.52 4.50 4.48 4.46 4.44 4.43 4.42 4.41 4.39 4.39
6 3.83 3.81 3.79 3.77 3.75 3.74 3.73 3.71 3.70 3.69
7 3.40 3.38 3.36 3.34 3.32 3.30 3.29 3.27 3.26 3.25
8 3.11 3.08 3.06 3.04 3.02 3.01 2.99 2.97 2.96 2.95
9 2.89 2.86 2.84 2.83 2.80 2.79 2.77 2.76 2.74 2.73
10 2.73 2.70 2.68 2.66 2.64 2.62 2.60 2.59 2.57 2.56
11 2.60 2.57 2.55 2.53 2.51 2.49 2.47 2.46 2.44 2.43
12 2.50 2.47 2.44 2.43 2.40 2.38 2.37 2.35 2.33 2.32
13 2.41 2.38 2.36 2.34 2.31 2.30 2.28 2.26 2.24 2.23
14 2.34 2.31 2.28 2.27 2.24 2.22 2.21 2.19 2.17 2.16
15 2.28 2.25 2.22 2.20 2.18 2.16 2.14 2.12 2.10 2.10
16 2.23 2.19 2.17 2.15 2.12 2.11 2.09 2.07 2.05 2.04
17 2.18 2.15 2.12 2.10 2.08 2.06 2.04 2.02 2.00 1.99
18 2.14 2.11 2.08 2.06 2.04 2.02 2.00 1.98 1.96 1.95
19 2.11 2.07 2.05 2.03 2.00 1.98 1.96 1.94 1.92 1.91
20 2.07 2.04 2.01 1.99 1.97 1.95 1.93 1.91 1.89 1.88
21 2.05 2.01 1.98 1.96 1.94 1.92 1.90 1.88 1.86 1.84
22 2.02 1.98 1.96 1.94 1.91 1.89 1.87 1.85 1.83 1.82
23 2.00 1.96 1.93 1.91 1.88 1.86 1.84 1.82 1.80 1.79
24 1.97 1.94 1.91 1.89 1.86 1.84 1.82 1.80 1.78 1.77
25 1.96 1.92 1.89 1.87 1.84 1.82 1.80 1.78 1.76 1.75
26 1.94 1.90 1.87 1.85 1.82 1.80 1.78 1.76 1.74 1.73
27 1.92 1.88 1.86 1.84 1.81 1.79 1.76 1.74 1.72 1.71
28 1.91 1.87 1.84 1.82 1.79 1.77 1.75 1.73 1.70 1.69
29 1.89 1.85 1.83 1.81 1.77 1.75 1.73 1.71 1.69 1.67
30 1.88 1.84 1.81 1.79 1.76 1.74 1.72 1.70 1.67 1.66
35 1.82 1.79 1.76 1.74 1.70 1.68 1.66 1.63 1.61 1.60
40 1.78 1.74 1.72 1.69 1.66 1.64 1.61 1.59 1.56 1.55
50 1.73 1.69 1.66 1.63 1.60 1.58 1.55 1.52 1.50 1.48
60 1.69 1.65 1.62 1.59 1.56 1.53 1.51 1.48 1.45 1.44
70 1.66 1.62 1.59 1.57 1.53 1.50 1.48 1.45 1.42 1.40
80 1.64 1.60 1.57 1.54 1.51 1.48 1.45 1.43 1.39 1.38
90 1.63 1.59 1.55 1.53 1.49 1.46 1.44 1.41 1.38 1.36
100 1.62 1.57 1.54 1.52 1.48 1.45 1.42 1.39 1.36 1.34
120 1.60 1.55 1.52 1.50 1.46 1.43 1.40 1.37 1.33 1.32
150 1.58 1.54 1.50 1.48 1.44 1.41 1.38 1.34 1.31 1.29
200 1.56 1.52 1.48 1.46 1.41 1.39 1.35 1.32 1.28 1.26
250 1.55 1.50 1.47 1.44 1.40 1.37 1.34 1.31 1.27 1.25
300 1.54 1.50 1.46 1.43 1.39 1.36 1.33 1.30 1.26 1.23
400 1.53 1.49 1.45 1.42 1.38 1.35 1.32 1.28 1.24 1.22
500 1.53 1.48 1.45 1.42 1.38 1.35 1.31 1.28 1.23 1.21
600 1.52 1.48 1.44 1.41 1.37 1.34 1.31 1.27 1.23 1.20
750 1.52 1.47 1.44 1.41 1.37 1.34 1.30 1.26 1.22 1.20
1000 1.52 1.47 1.43 1.41 1.36 1.33 1.30 1.26 1.22 1.19

.
STATISTICAL TABLES 5

TABLE A.3 (continued)

F Distribution: Critical Values of F (1% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 5928.36 5981.07 6022.47 6055.85 6106.32 6142.67 6170.10 6191.53 6208.73
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.44 99.44 99.45
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.92 26.83 26.75 26.69
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.25 14.15 14.08 14.02
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.77 9.68 9.61 9.55
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.60 7.52 7.45 7.40
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.36 6.28 6.21 6.16
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.56 5.48 5.41 5.36
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 5.01 4.92 4.86 4.81
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.60 4.52 4.46 4.41
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.29 4.21 4.15 4.10
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.05 3.97 3.91 3.86
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.86 3.78 3.72 3.66
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.70 3.62 3.56 3.51
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.56 3.49 3.42 3.37
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.45 3.37 3.31 3.26
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.35 3.27 3.21 3.16
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.27 3.19 3.13 3.08
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.19 3.12 3.05 3.00
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.13 3.05 2.99 2.94
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.07 2.99 2.93 2.88
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 3.02 2.94 2.88 2.83
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.97 2.89 2.83 2.78
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.93 2.85 2.79 2.74
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.89 2.81 2.75 2.70
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.86 2.78 2.72 2.66
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.82 2.75 2.68 2.63
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.79 2.72 2.65 2.60
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.77 2.69 2.63 2.57
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.74 2.66 2.60 2.55
35 7.42 5.27 4.40 3.91 3.59 3.37 3.20 3.07 2.96 2.88 2.74 2.64 2.56 2.50 2.44
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.56 2.48 2.42 2.37
50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 2.56 2.46 2.38 2.32 2.27
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.39 2.31 2.25 2.20
70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 2.59 2.45 2.35 2.27 2.20 2.15
80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.42 2.31 2.23 2.17 2.12
90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 2.39 2.29 2.21 2.14 2.09
100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 2.37 2.27 2.19 2.12 2.07
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.23 2.15 2.09 2.03
150 6.81 4.75 3.91 3.45 3.14 2.92 2.76 2.63 2.53 2.44 2.31 2.20 2.12 2.06 2.00
200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.60 2.50 2.41 2.27 2.17 2.09 2.03 1.97
250 6.74 4.69 3.86 3.40 3.09 2.87 2.71 2.58 2.48 2.39 2.26 2.15 2.07 2.01 1.95
300 6.72 4.68 3.85 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.24 2.14 2.06 1.99 1.94
400 6.70 4.66 3.83 3.37 3.06 2.85 2.68 2.56 2.45 2.37 2.23 2.13 2.05 1.98 1.92
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.22 2.12 2.04 1.97 1.92
600 6.68 4.64 3.81 3.35 3.05 2.83 2.67 2.54 2.44 2.35 2.21 2.11 2.03 1.96 1.91
750 6.67 4.63 3.81 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.21 2.11 2.02 1.96 1.90
1000 6.66 4.63 3.80 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.20 2.10 2.02 1.95 1.90

.
STATISTICAL TABLES 6

TABLE A.3 (continued)

F Distribution: Critical Values of F (1% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 6239.83 6260.65 6275.57 6286.78 6302.52 6313.03 6323.56 6334.11 6344.68 6349.97
2 99.46 99.47 99.47 99.47 99.48 99.48 99.49 99.49 99.49 99.49
3 26.58 26.50 26.45 26.41 26.35 26.32 26.28 26.24 26.20 26.18
4 13.91 13.84 13.79 13.75 13.69 13.65 13.61 13.58 13.54 13.52
5 9.45 9.38 9.33 9.29 9.24 9.20 9.17 9.13 9.09 9.08
6 7.30 7.23 7.18 7.14 7.09 7.06 7.02 6.99 6.95 6.93
7 6.06 5.99 5.94 5.91 5.86 5.82 5.79 5.75 5.72 5.70
8 5.26 5.20 5.15 5.12 5.07 5.03 5.00 4.96 4.93 4.91
9 4.71 4.65 4.60 4.57 4.52 4.48 4.45 4.41 4.38 4.36
10 4.31 4.25 4.20 4.17 4.12 4.08 4.05 4.01 3.98 3.96
11 4.01 3.94 3.89 3.86 3.81 3.78 3.74 3.71 3.67 3.66
12 3.76 3.70 3.65 3.62 3.57 3.54 3.50 3.47 3.43 3.41
13 3.57 3.51 3.46 3.43 3.38 3.34 3.31 3.27 3.24 3.22
14 3.41 3.35 3.30 3.27 3.22 3.18 3.15 3.11 3.08 3.06
15 3.28 3.21 3.17 3.13 3.08 3.05 3.01 2.98 2.94 2.92
16 3.16 3.10 3.05 3.02 2.97 2.93 2.90 2.86 2.83 2.81
17 3.07 3.00 2.96 2.92 2.87 2.83 2.80 2.76 2.73 2.71
18 2.98 2.92 2.87 2.84 2.78 2.75 2.71 2.68 2.64 2.62
19 2.91 2.84 2.80 2.76 2.71 2.67 2.64 2.60 2.57 2.55
20 2.84 2.78 2.73 2.69 2.64 2.61 2.57 2.54 2.50 2.48
21 2.79 2.72 2.67 2.64 2.58 2.55 2.51 2.48 2.44 2.42
22 2.73 2.67 2.62 2.58 2.53 2.50 2.46 2.42 2.38 2.36
23 2.69 2.62 2.57 2.54 2.48 2.45 2.41 2.37 2.34 2.32
24 2.64 2.58 2.53 2.49 2.44 2.40 2.37 2.33 2.29 2.27
25 2.60 2.54 2.49 2.45 2.40 2.36 2.33 2.29 2.25 2.23
26 2.57 2.50 2.45 2.42 2.36 2.33 2.29 2.25 2.21 2.19
27 2.54 2.47 2.42 2.38 2.33 2.29 2.26 2.22 2.18 2.16
28 2.51 2.44 2.39 2.35 2.30 2.26 2.23 2.19 2.15 2.13
29 2.48 2.41 2.36 2.33 2.27 2.23 2.20 2.16 2.12 2.10
30 2.45 2.39 2.34 2.30 2.25 2.21 2.17 2.13 2.09 2.07
35 2.35 2.28 2.23 2.19 2.14 2.10 2.06 2.02 1.98 1.96
40 2.27 2.20 2.15 2.11 2.06 2.02 1.98 1.94 1.90 1.87
50 2.17 2.10 2.05 2.01 1.95 1.91 1.87 1.82 1.78 1.76
60 2.10 2.03 1.98 1.94 1.88 1.84 1.79 1.75 1.70 1.68
70 2.05 1.98 1.93 1.89 1.83 1.78 1.74 1.70 1.65 1.62
80 2.01 1.94 1.89 1.85 1.79 1.75 1.70 1.65 1.61 1.58
90 1.99 1.92 1.86 1.82 1.76 1.72 1.67 1.62 1.57 1.55
100 1.97 1.89 1.84 1.80 1.74 1.69 1.65 1.60 1.55 1.52
120 1.93 1.86 1.81 1.76 1.70 1.66 1.61 1.56 1.51 1.48
150 1.90 1.83 1.77 1.73 1.66 1.62 1.57 1.52 1.46 1.43
200 1.87 1.79 1.74 1.69 1.63 1.58 1.53 1.48 1.42 1.39
250 1.85 1.77 1.72 1.67 1.61 1.56 1.51 1.46 1.40 1.36
300 1.84 1.76 1.70 1.66 1.59 1.55 1.50 1.44 1.38 1.35
400 1.82 1.75 1.69 1.64 1.58 1.53 1.48 1.42 1.36 1.32
500 1.81 1.74 1.68 1.63 1.57 1.52 1.47 1.41 1.34 1.31
600 1.80 1.73 1.67 1.63 1.56 1.51 1.46 1.40 1.34 1.30
750 1.80 1.72 1.66 1.62 1.55 1.50 1.45 1.39 1.33 1.29
1000 1.79 1.72 1.66 1.61 1.54 1.50 1.44 1.38 1.32 1.28

.
STATISTICAL TABLES 7

TABLE A.3 (continued)

F Distribution: Critical Values of F (0.1% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4.05e05 5.00e05 5.40e05 5.62e05 5.76e05 5.86e05 5.93e05 5.98e05 6.02e05 6.06e05 6.11e05 6.14e05 6.17e05 6.19e05 6.21e05
2 998.50 999.00 999.17 999.25 999.30 999.33 999.36 999.37 999.39 999.40 999.42 999.43 999.44 999.44 999.45
3 167.03 148.50 141.11 137.10 134.58 132.85 131.58 130.62 129.86 129.25 128.32 127.64 127.14 126.74 126.42
4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49.00 48.47 48.05 47.41 46.95 46.60 46.32 46.10
5 47.18 37.12 33.20 31.09 29.75 28.83 28.16 27.65 27.24 26.92 26.42 26.06 25.78 25.57 25.39
6 35.51 27.00 23.70 21.92 20.80 20.03 19.46 19.03 18.69 18.41 17.99 17.68 17.45 17.27 17.12
7 29.25 21.69 18.77 17.20 16.21 15.52 15.02 14.63 14.33 14.08 13.71 13.43 13.23 13.06 12.93
8 25.41 18.49 15.83 14.39 13.48 12.86 12.40 12.05 11.77 11.54 11.19 10.94 10.75 10.60 10.48
9 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 10.11 9.89 9.57 9.33 9.15 9.01 8.90
10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.20 8.96 8.75 8.45 8.22 8.05 7.91 7.80
11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.35 8.12 7.92 7.63 7.41 7.24 7.11 7.01
12 18.64 12.97 10.80 9.63 8.89 8.38 8.00 7.71 7.48 7.29 7.00 6.79 6.63 6.51 6.40
13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.80 6.52 6.31 6.16 6.03 5.93
14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.80 6.58 6.40 6.13 5.93 5.78 5.66 5.56
15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.81 5.62 5.46 5.35 5.25
16 16.12 10.97 9.01 7.94 7.27 6.80 6.46 6.19 5.98 5.81 5.55 5.35 5.20 5.09 4.99
17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.32 5.13 4.99 4.87 4.78
18 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 5.56 5.39 5.13 4.94 4.80 4.68 4.59
19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 4.97 4.78 4.64 4.52 4.43
20 14.82 9.95 8.10 7.10 6.46 6.02 5.69 5.44 5.24 5.08 4.82 4.64 4.49 4.38 4.29
21 14.59 9.77 7.94 6.95 6.32 5.88 5.56 5.31 5.11 4.95 4.70 4.51 4.37 4.26 4.17
22 14.38 9.61 7.80 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.58 4.40 4.26 4.15 4.06
23 14.20 9.47 7.67 6.70 6.08 5.65 5.33 5.09 4.89 4.73 4.48 4.30 4.16 4.05 3.96
24 14.03 9.34 7.55 6.59 5.98 5.55 5.23 4.99 4.80 4.64 4.39 4.21 4.07 3.96 3.87
25 13.88 9.22 7.45 6.49 5.89 5.46 5.15 4.91 4.71 4.56 4.31 4.13 3.99 3.88 3.79
26 13.74 9.12 7.36 6.41 5.80 5.38 5.07 4.83 4.64 4.48 4.24 4.06 3.92 3.81 3.72
27 13.61 9.02 7.27 6.33 5.73 5.31 5.00 4.76 4.57 4.41 4.17 3.99 3.86 3.75 3.66
28 13.50 8.93 7.19 6.25 5.66 5.24 4.93 4.69 4.50 4.35 4.11 3.93 3.80 3.69 3.60
29 13.39 8.85 7.12 6.19 5.59 5.18 4.87 4.64 4.45 4.29 4.05 3.88 3.74 3.63 3.54
30 13.29 8.77 7.05 6.12 5.53 5.12 4.82 4.58 4.39 4.24 4.00 3.82 3.69 3.58 3.49
35 12.90 8.47 6.79 5.88 5.30 4.89 4.59 4.36 4.18 4.03 3.79 3.62 3.48 3.38 3.29
40 12.61 8.25 6.59 5.70 5.13 4.73 4.44 4.21 4.02 3.87 3.64 3.47 3.34 3.23 3.14
50 12.22 7.96 6.34 5.46 4.90 4.51 4.22 4.00 3.82 3.67 3.44 3.27 3.41 3.04 2.95
60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.86 3.69 3.54 3.32 3.15 3.02 2.91 2.83
70 11.80 7.64 6.06 5.20 4.66 4.28 3.99 3.77 3.60 3.45 3.23 3.06 2.93 2.83 2.74
80 11.67 7.54 5.97 5.12 4.58 4.20 3.92 3.70 3.53 3.39 3.16 3.00 2.87 2.76 2.68
90 11.57 7.47 5.91 5.06 4.53 4.15 3.87 3.65 3.48 3.34 3.11 2.95 2.82 2.71 2.63
100 11.50 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.30 3.07 2.91 2.78 2.68 2.59
120 11.38 7.32 5.78 4.95 4.42 4.04 3.77 3.55 3.38 3.24 3.02 2.85 2.72 2.62 2.53
150 11.27 7.24 5.71 4.88 4.35 3.98 3.71 3.49 3.32 3.18 2.96 2.80 2.67 2.56 2.48
200 11.15 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 2.90 2.74 2.61 2.51 2.42
250 11.09 7.10 5.59 4.77 4.25 3.88 3.61 3.40 3.23 3.09 2.87 2.71 2.58 2.48 2.39
300 11.04 7.07 5.56 4.75 4.22 3.86 3.59 3.38 3.21 3.07 2.85 2.69 2.56 2.46 2.37
400 10.99 7.03 5.53 4.71 4.19 3.83 3.56 3.35 3.18 3.04 2.82 2.66 2.53 2.43 2.34
500 10.96 7.00 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.81 2.64 2.52 2.41 2.33
600 10.94 6.99 5.49 4.68 4.16 3.80 3.53 3.32 3.15 3.01 2.80 2.63 2.51 2.40 2.32
750 10.91 6.97 5.48 4.67 4.15 3.79 3.52 3.31 3.14 3.00 2.78 2.62 2.49 2.39 2.31
1000 10.89 6.96 5.46 4.65 4.14 3.78 3.51 3.30 3.13 2.99 2.77 2.61 2.48 2.38 2.30

.
STATISTICAL TABLES 8

TABLE A.3 (continued)

F Distribution: Critical Values of F (0.1% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 6.24e05 6.26e05 6.28e05 6.29e05 6.30e05 6.31e05 6.32e05 6.33e05 6.35e05 6.35e05
2 999.46 999.47 999.47 999.47 999.48 999.48 999.49 999.49 999.49 999.49
3 125.84 125.45 125.17 124.96 124.66 124.47 124.27 124.07 123.87 123.77
4 45.70 45.43 45.23 45.09 44.88 44.75 44.61 44.47 44.33 44.26
5 25.08 24.87 24.72 24.60 24.44 24.33 24.22 24.12 24.01 23.95
6 16.85 16.67 16.54 16.44 16.31 16.21 16.12 16.03 15.93 15.89
7 12.69 12.53 12.41 12.33 12.20 12.12 12.04 11.95 11.87 11.82
8 10.26 10.11 10.00 9.92 9.80 9.73 9.65 9.57 9.49 9.45
9 8.69 8.55 8.46 8.37 8.26 8.19 8.11 8.04 7.96 7.93
10 7.60 7.47 7.37 7.30 7.19 7.12 7.05 6.98 6.91 6.87
11 6.81 6.68 6.59 6.52 6.42 6.35 6.28 6.21 6.14 6.10
12 6.22 6.09 6.00 5.93 5.83 5.76 5.70 5.63 5.56 5.52
13 5.75 5.63 5.54 5.47 5.37 5.30 5.24 5.17 5.10 5.07
14 5.38 5.25 5.17 5.10 5.00 4.94 4.87 4.81 4.74 4.71
15 5.07 4.95 4.86 4.80 4.70 4.64 4.57 4.51 4.44 4.41
16 4.82 4.70 4.61 4.54 4.45 4.39 4.32 4.26 4.19 4.16
17 4.60 4.48 4.40 4.33 4.24 4.18 4.11 4.05 3.98 3.95
18 4.42 4.30 4.22 4.15 4.06 4.00 3.93 3.87 3.80 3.77
19 4.26 4.14 4.06 3.99 3.90 3.84 3.78 3.71 3.65 3.61
20 4.12 4.00 3.92 3.86 3.77 3.70 3.64 3.58 3.51 3.48
21 4.00 3.88 3.80 3.74 3.64 3.58 3.52 3.46 3.39 3.36
22 3.89 3.78 3.70 3.63 3.54 3.48 3.41 3.35 3.28 3.25
23 3.79 3.68 3.60 3.53 3.44 3.38 3.32 3.25 3.19 3.16
24 3.71 3.59 3.51 3.45 3.36 3.29 3.23 3.17 3.10 3.07
25 3.63 3.52 3.43 3.37 3.28 3.22 3.15 3.09 3.03 2.99
26 3.56 3.44 3.36 3.30 3.21 3.15 3.08 3.02 2.95 2.92
27 3.49 3.38 3.30 3.23 3.14 3.08 3.02 2.96 2.89 2.86
28 3.43 3.32 3.24 3.18 3.09 3.02 2.96 2.90 2.83 2.80
29 3.38 3.27 3.18 3.12 3.03 2.97 2.91 2.84 2.78 2.74
30 3.33 3.22 3.13 3.07 2.98 2.92 2.86 2.79 2.73 2.69
35 3.13 3.02 2.93 2.87 2.78 2.72 2.66 2.59 2.52 2.49
40 2.98 2.87 2.79 2.73 2.64 2.57 2.51 2.44 2.38 2.34
50 2.79 2.68 2.60 2.53 2.44 2.38 2.31 2.25 2.18 2.14
60 2.67 2.55 2.47 2.41 2.32 2.25 2.19 2.12 2.05 2.01
70 2.58 2.47 2.39 2.32 2.23 2.16 2.10 2.03 1.95 1.92
80 2.52 2.41 2.32 2.26 2.16 2.10 2.03 1.96 1.89 1.85
90 2.47 2.36 2.27 2.21 2.11 2.05 1.98 1.91 1.83 1.79
100 2.43 2.32 2.24 2.17 2.08 2.01 1.94 1.87 1.79 1.75
120 2.37 2.26 2.18 2.11 2.02 1.95 1.88 1.81 1.73 1.68
150 2.32 2.21 2.12 2.06 1.96 1.89 1.82 1.74 1.66 1.62
200 2.26 2.15 2.07 2.00 1.90 1.83 1.76 1.68 1.60 1.55
250 2.23 2.12 2.03 1.97 1.87 1.80 1.72 1.65 1.56 1.51
300 2.21 2.10 2.01 1.94 1.85 1.78 1.70 1.62 1.53 1.48
400 2.18 2.07 1.98 1.92 1.82 1.75 1.67 1.59 1.50 1.45
500 2.17 2.05 1.97 1.90 1.80 1.73 1.65 1.57 1.48 1.43
600 2.16 2.04 1.96 1.89 1.79 1.72 1.64 1.56 1.46 1.41
750 2.15 2.03 1.95 1.88 1.78 1.71 1.63 1.55 1.45 1.40
1000 2.14 2.02 1.94 1.87 1.77 1.69 1.62 1.53 1.44 1.38

.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.

You might also like