ST1215 Subject Guide 2023 Final (1)
ST1215 Subject Guide 2023 Final (1)
Introduction to
mathematical
statistics
J.S. Abdey
ST1215
2023
Introduction to mathematical
statistics
J.S. Abdey
ST1215
2023
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information about the University of London, see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.
University of London
Publications office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Contents
0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 The role of statistics in the research process . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7 Overview of the learning resources . . . . . . . . . . . . . . . . . . . . . . 6
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.7.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
0.8.1 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 10
0.8.2 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
0.8.3 Making use of the Online Library . . . . . . . . . . . . . . . . . . 11
i
Contents
2 Probability theory 35
2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 44
2.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 48
2.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 50
2.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 50
2.6.3 Combining counts: rules of sum and product . . . . . . . . . . . . 55
2.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 57
2.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 58
2.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 59
2.7.3 Conditional probability of independent events . . . . . . . . . . . 60
2.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 61
2.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 63
2.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ii
Contents
3 Random variables 73
3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 Probability distribution of a discrete random variable . . . . . . . 75
3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 78
3.4.3 Properties of the cdf for discrete distributions . . . . . . . . . . . 81
3.4.4 General properties of the cdf . . . . . . . . . . . . . . . . . . . . . 81
3.4.5 Properties of a discrete random variable . . . . . . . . . . . . . . 82
3.4.6 Expected value versus sample mean . . . . . . . . . . . . . . . . . 82
3.4.7 Moments of a random variable . . . . . . . . . . . . . . . . . . . . 88
3.4.8 The moment generating function . . . . . . . . . . . . . . . . . . 90
3.5 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.1 Moment generating functions . . . . . . . . . . . . . . . . . . . . 102
3.5.2 Median of a random variable . . . . . . . . . . . . . . . . . . . . . 102
3.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 104
3.9 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 105
iii
Contents
iv
Contents
v
Contents
vi
Contents
vii
Contents
viii
Contents
ix
Contents
x
Chapter 0
Preface
By successfully completing this course, you will understand the ideas of randomness and
variability, and the way in which they link to probability theory. This will allow the use
of a systematic and logical collection of statistical techniques of great practical
importance in many applied areas. The examples in this subject guide will concentrate
on the social sciences, but the methods are important for the physical sciences too. This
subject aims to provide a grounding in probability theory and some of the most
common statistical methods.
The material in ST1215 Introduction to mathematical statistics is necessary as
preparation for other subjects you may study later on in your degree. The full details of
1
0. Preface
the ideas discussed in this subject guide will not always be required in these other
subjects, but you will need to have a solid understanding of the main concepts. This
can only be achieved by seeing how the ideas emerge in detail.
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
We begin with an illustrative example of how statistics can be applied in a research
context.
Research may be about almost any topic: physics, biology, medicine, economics, history,
literature etc. Most of our examples will be from the social sciences: economics,
management, finance, sociology, political science, psychology etc. Research in this sense
is not just what universities do. Governments, businesses, and all of us as individuals do
it too. Statistics is used in essentially the same way for all of these.
Understanding the gender pay gap: what has competition got to do with it?
Heeding the push from below: how do social movements persuade the rich to
listen to the poor?
2
0.3. The role of statistics in the research process
We can think of the empirical research process as having five key stages.
2. Research design: deciding what kinds of data to collect, how and from where.
The main job of statistics is the analysis of data, although it also informs other stages
of the research process. Statistics are used when the data are quantitative, i.e. in the
form of numbers.
Statistical analysis of quantitative data has the following features.
It can cope with large volumes of data, in which case the first task is to provide an
understandable summary of the data. This is the job of descriptive statistics.
It can deal with situations where the observed data are regarded as only a part (a
sample) from all the data which could have been obtained (the population). There
is then uncertainty in the conclusions. Measuring this uncertainty is the job of
statistical inference.
We continue with an example of how statistics can be used to help answer a research
question.
Gill, M. and A. Spriggs ‘Assessing the impact of CCTV’, Home Office Research
Study 292.
3
0. Preface
Intervention: CCTV cameras installed in the target area but not in the
control area.
Compare measures of crime and the fear of crime in the target and control
areas in the 12 months before and 12 months after the intervention.
Level of crime: the number of crimes recorded by the police, in the 12 months
before and 12 months after the intervention.
• Question considered here: ‘In general, how much, if at all, do you worry
that you or other people in your household will be victims of crime?’ (from
1 = ‘all the time’ to 5 = ‘never’).
Statistical analysis of the data.
It is possible to calculate various statistics, for example the Relative Effect Size
RES = ([d]/[c])/([b]/[a]) = 0.98 is a summary measure which compares the
changes in the two areas.
4
0.4. Aims and objectives
RES < 1, which means that the observed change in the reported fear of crime
has been a bit less good in the target area.
However, there is uncertainty because of sampling: only 168 and 242 individuals
were actually interviewed at each time in each area, respectively.
The confidence interval for RES includes 1, which means that changes in the
self-reported fear of crime in the two areas are ‘not statistically significantly
different’ from each other.
Now the RES > 1, which means that the observed change in the number of
crimes has been worse in the control area than in the target area.
However, the numbers of crimes in each area are fairly small, which means that
these estimates of the changes in crime rates are fairly uncertain.
The confidence interval for RES again includes 1, which means that the changes
in crime rates in the two areas are not statistically significantly different from
each other.
In summary, this study did not support the claim that the introduction of CCTV
reduces crime or the fear of crime.
If you want to read more about research of this question, see Welsh, B.C. and
D.P. Farrington ‘Effects of closed circuit television surveillance on crime’,
Campbell Systematic Reviews 17 2008.
Many of the statistical terms and concepts mentioned above have not been explained yet
– that is what the rest of the course is for! However, it serves as an interesting example
of how statistics can be employed in the social sciences to investigate research questions.
The course provides a precise and accurate treatment of introductory probability and
distribution theory, statistical ideas, methods and techniques. Topics covered are data
visualisation and descriptive statistics, probability theory, random variables, common
distributions of random variables, multivariate random variables, sampling distributions
of statistics, point estimation, interval estimation, hypothesis testing, analysis of
variance (ANOVA) and linear regression.
5
0. Preface
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known probability distributions and their respective moments
1. complex problem-solving
2. decision making
3. communication.
6
0.7. Overview of the learning resources
The subject guide provides a range of activities that will enable you to test your
understanding of the basic ideas and concepts. We want to encourage you to try the
exercises you encounter throughout the material before working through the solutions.
With statistics, the motto has to be ‘practise, practise, practise. . .’. It is the best way to
learn the material and prepare for examinations. The course is rigorous and demanding,
but the skills you will be developing will be rewarding and well recognised by future
employers.
A suggested approach for students studying ST1215 Introduction to mathematical
statistics, is to split the material into 10 two-week blocks.
The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.
7
0. Preface
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Calculators
A calculator may be used when answering questions on the examination paper for
ST1215 Introduction to mathematical statistics. It must comply in all respects
with the specification given in the Programme regulations. You should also refer to the
admission notice you will receive when entering the examination and the ‘Notice on
permitted materials’.
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package, such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material. On a few
occasions in this subject guide R will be used for illustrative purposes only. You will not
be examined on R.
This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST1215 Introduction to mathematical statistics. Throughout
the subject guide there are many worked examples, practice problems and sample
examination questions replicating resources typically provided in statistical textbooks.
You may, however, feel you could benefit from reading textbooks, and a suggested list of
these is provided below.
Statistical tables
Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 9780521484855].
As relevant extracts of these statistical tables are the same as those distributed for use
in the examination, it is advisable that you become familiar with them, rather than
those at the end of a textbook.
8
0.8. Examination advice
Freedman, D., R. Pisani and R. Purves Statistics. (New York: W.W. Norton &
Company, 2007) fourth edition [ISBN 9780393930436].
Johnson, R.A. and G.K. Bhattacharyya Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.J. Marx An Introduction to Mathematical Statistics and Its
Applications. (London: Pearson, 2017) sixth edition [ISBN 9780134114217].
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Pearson, 2012) eighth edition [ISBN 9780273767060].
9
0. Preface
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
10
0.8. Examination advice
Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.
Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.
Note: Students registered for Laws courses also receive access to the dedicated Laws
VLE.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
11
0. Preface
12
Chapter 1
Data visualisation and descriptive
statistics
1.3 Introduction
Starting point: a collection of numerical data (a sample) has been collected in order to
answer some questions. Statistical analysis may have two broad aims.
13
1. Data visualisation and descriptive statistics
2. Statistical inference: use the observed data to draw conclusions about some
broader population.
Sometimes ‘1.’ is the only aim. Even when ‘2.’ is the main aim, ‘1.’ is still an essential
first step.
Data do not speak for themselves. There are usually simply too many numbers to make
sense of just by staring at them. Descriptive statistics attempt to summarise some
key features of the data to make them understandable and easy to communicate.
These summaries may be graphical or numerical (tables or individual summary
statistics).
Example 1.1 We consider data for 155 countries and territories on three variables
from around 2002. The data can be found in the file ‘Countries.csv’ (available on
the VLE). The variables are the following.
Gross domestic product per capita (GDP per capita) (i.e. per person, in
$000s) which is a ratio scale.
The statistical data in a sample are typically stored in a data matrix, as shown in
Figure 1.1 (on the next page).
Rows of the data matrix correspond to different units (subjects/observations).
Here, region, the level of democracy, and GDP per capita are the variables.
14
1.3. Introduction
15
1. Data visualisation and descriptive statistics
A continuous variable can, in principle, take any real values within some interval.
In Example 1.1, GDP per capita is continuous, taking any non-negative value.
A variable is discrete if it is not continuous, i.e. if it can only take certain values,
but not any others.
In Example 1.1, region and the level of democracy are discrete, with possible
values of 1, 2, . . . , 6, and 0, 1, 2, . . . , 10, respectively.
Many discrete variables have only a finite number of possible values. In Example 1.1, the
region variable has 6 possible values, and the level of democracy has 11 possible values.
The simplest possibility is a binary, or dichotomous, variable, with just two possible
values. For example, a person’s sex could be recorded as 1 = female and 2 = male.1
A discrete variable can also have an unlimited number of possible values.
Example 1.2 In Example 1.1, the levels of democracy have a meaningful ordering,
from less democratic to more democratic countries. The numbers assigned to the
different levels must also be in this order, i.e. a larger number = more democratic.
In contrast, different regions (Africa, Asia, Europe, South America, North America
and Oceania) do not have such an ordering. The numbers used for the region
variable are just labels for different regions. A different numbering (such as
6 = Africa, 5 = Asia, 1 = Europe, 3 = South America, 2 = North America and
4 = Oceania) would be just as acceptable as the one we originally used. Some
statistical methods are appropriate for variables with both ordered and unordered
values, some only in the ordered case. Unordered categories are nominal data;
ordered categories are ordinal data.
1
Note that because sex is a nominal variable, the coding is arbitrary. We could also have, for example,
0 = male and 1 = female, or 0 = female and 1 = male. However, it is important to remember which
coding has been used!
2
In practice, of course, there is a finite number of internet users in the world. However, it is reasonable
to treat this variable as taking an unlimited number of possible values.
16
1.5. The sample distribution
a list of the values of the variable which are observed in the sample
the number of times each value occurs (the counts or frequencies of the observed
values).
When the number of different observed values is small, we can show the whole sample
distribution as a frequency table of all the values and their frequencies.
Example 1.3 Continuing with Example 1.1, the observations of the region variable
in the sample are:
3 5 3 3 3 5 3 3 6 3 2 3 3 3 3
3 3 2 2 2 3 6 2 3 2 2 2 3 3 2
2 3 3 3 2 4 3 2 3 1 4 3 1 3 3
4 4 4 1 2 4 3 4 3 2 1 2 3 1 3
2 1 4 2 4 3 1 4 6 2 1 3 4 2 1
4 4 4 2 3 2 4 1 4 1 4 2 2 2 4
2 2 1 4 2 1 4 2 2 4 4 1 6 3 1
2 1 2 2 1 1 2 1 1 3 2 2 1 2 4
2 1 2 1 1 2 1 2 1 2 1 1 1 1 1
1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
1 1 1 2 1
Relative
Frequency frequency
Region (count) (%)
100 × (48/155)
(1) Africa 48 31.0
(2) Asia 44 28.4
(3) Europe 34 21.9
(4) South America 23 14.8
(5) North America 2 1.3
(6) Oceania 4 2.6
Total 155 100
17
1. Data visualisation and descriptive statistics
Here ‘%’ is the percentage of countries in a region, out of the 155 countries in the
sample. This is a measure of proportion (that is, relative frequency).
Similarly, for the level of democracy, the frequency table is:
Level of Cumulative
democracy Frequency % %
0 35 22.6 22.6
1 12 7.7 30.3
2 4 2.6 32.9
3 6 3.9 36.8
4 5 3.2 40.0
5 5 3.2 43.2
6 12 7.7 50.9
7 13 8.4 59.3
8 16 10.3 69.6
9 15 9.7 79.3
10 32 20.6 100
Total 155 100
‘Cumulative %’ for a value of the variable is the sum of the percentages for that
value and all lower-numbered values.
18
1.5. The sample distribution
Example 1.4 Continuing with Example 1.1, a table of frequencies for GDP per
capita where values have been grouped into non-overlapping intervals is shown
below. Figure 1.3 shows a histogram of GDP per capita with a greater number of
intervals to better display the sample distribution.
20
10
0
0 10 20 30 40
19
1. Data visualisation and descriptive statistics
Example 1.5 Figure 1.4 shows a (more-or-less) symmetric sample distribution for
diastolic blood pressure.
0.04
0.03
Proportion
0.02
0.01
0.0
40 60 80 100 120
Diastolic blood pressure
Figure 1.4: Diastolic blood pressures of 4,489 respondents aged 25 or over, Health Survey
for England, 2002.
Example 1.6 Figure 1.5 (on the next page) shows a (slightly) negatively-skewed
distribution of marks in an examination. Note the data relate to all candidates
sitting the examination. Therefore, the histogram shows the population distribution,
not a sample distribution.
20
1.6. Measures of central tendency
60
50
40
Frequency
30
20
10
0
0 20 40 60 80 100
Marks
We begin with measures of central tendency. These answer the question: where is
the ‘centre’ or ‘average’ of the distribution?
We consider the following measures of central tendency:
median
mode.
Example 1.7 We use Xi to denote the value of X for unit i, where i can take
values 1, 2, . . . , n, and n is the sample size.
Therefore, the n observations of X in the dataset (the sample) are X1 , X2 , . . . , Xn .
These can also be written as Xi , for i = 1, 2, . . . , n.
21
1. Data visualisation and descriptive statistics
P P
This may be written as i Xi , or just Xi . Other versions of the same idea are:
∞
P
infinite sums: Xi = X1 + X 2 + · · ·
i=1
1+4+7 12
= = 4.
3 3
22
1.6. Measures of central tendency
If a variable has a small number of distinct values, X̄ is easy to calculate from the
frequency table. For example, the level of democracy has just 11 different values
which occur in the sample 35, 12, . . . , 32 times each, respectively.
Suppose X has K different values X1 , X2 , . . . , XK , with corresponding frequencies
K
P
f1 , f2 , . . . , fK . Therefore, fj = n and:
j=1
K
P
f j Xj
j=1 f 1 X 1 + f 2 X2 + · · · + f K XK f 1 X 1 + f 2 X2 + · · · + f K XK
X̄ = = = .
K
P f1 + f2 + · · · + fK n
fj
j=1
In our example, the mean of the level of democracy (where K = 11) is:
35 × 0 + 12 × 1 + · · · + 32 × 10 0 + 12 + · · · + 320
X̄ = = ≈ 5.3.
35 + 12 + · · · + 32 155
Deviations:
from X̄ (= 4) from the median (= 3)
i Xi Xi − X̄ (Xi − X̄)2 Xi − 3 (Xi − 3)2
1 1 −3 9 −2 4
2 2 −2 4 −1 1
3 3 −1 1 0 0
4 5 +1 1 +2 4
5 9 +5 25 +6 36
Sum 20 0 40 +5 45
X̄ = 4
We see that the sum of deviations from the mean is 0, i.e. we have:
n
X
(Xi − X̄) = 0.
i=1
The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over all
the observations.
n
Also, the smallest possible value of the sum of squared deviations (Xi − C)2 for any
P
i=1
constant C is obtained when C = X̄.
23
1. Data visualisation and descriptive statistics
Median
The (sample) median, q50 , of a variable X is the value which is ‘in the middle’ of
the ordered sample.
For example, if n = 4, q50 = (X(2) + X(3) )/2: (1) (2) (3) (4).
Example 1.10 Continuing with Example 1.1, n = 155, so q50 = X(78) . For the level
of democracy, the median is 6.
From a table of frequencies, the median is the value for which the cumulative
percentage first reaches 50% (or, if a cumulative % is exactly 50%, the average of the
corresponding value of X and the next highest value).
The ordered values of the level of democracy are:
(.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)
(0.) 0 0 0 0 0 0 0 0 0
(1.) 0 0 0 0 0 0 0 0 0 0
(2.) 0 0 0 0 0 0 0 0 0 0
(3.) 0 0 0 0 0 0 1 1 1 1
(4.) 1 1 1 1 1 1 1 1 2 2
(5.) 2 2 3 3 3 3 3 3 4 4
(6.) 4 4 4 5 5 5 5 5 6 6
(7.) 6 6 6 6 6 6 6 6 6 6
(8.) 7 7 7 7 7 7 7 7 7 7
(9.) 7 7 7 8 8 8 8 8 8 8
(10.) 8 8 8 8 8 8 8 8 8 9
(11.) 9 9 9 9 9 9 9 9 9 9
(12.) 9 9 9 9 10 10 10 10 10 10
(13.) 10 10 10 10 10 10 10 10 10 10
(14.) 10 10 10 10 10 10 10 10 10 10
(15.) 10 10 10 10 10 10
24
1.6. Measures of central tendency
The median can be determined from the frequency table of the level of democracy:
1, 2, 4, 5, 8.
1, 2, 4, 5, 8, 100.
The median is now 4.5, and the mean is 20. In general, the mean is affected much more
than the median by outliers, i.e. unusually small or large observations. Therefore, you
should identify outliers early on and investigate them – perhaps there has been a data
entry error, which can simply be corrected. If deemed genuine outliers, a decision has to
be made about whether or not to remove them.
For an exactly symmetric distribution, the mean and median are equal.
When summarising variables with skewed distributions, it is useful to report both the
mean and the median.
25
1. Data visualisation and descriptive statistics
Mean Median
Level of democracy 5.3 6
GDP per capita 8.6 4.7
Diastolic blood pressure 74.2 73.5
Examination marks 56.6 57.0
1.6.7 Mode
The (sample) mode of a variable is the value which has the highest frequency (i.e.
appears most often) in the data.
Example 1.12 For Example 1.1, the modal region is 1 (Africa) and the mode of
the level of democracy is 0.
The mode is not very useful for continuous variables which have many different values,
such as GDP per capita in Example 1.1. A variable can have several modes (i.e. be
multimodal). For example, GDP per capita has modes 0.8 and 1.9, both with 5
countries out of the 155.
The mode is the only measure of central tendency which can be used even when the
values of a variable have no ordering, such as for the (nominal) region variable in
Example 1.1.
26
1.7. Measures of dispersion
Example 1.13 A small example determining the sum of the squared deviations
from the (sample) mean, used to calculate common measures of dispersion.
Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
= Xi2 = (Xi − X̄)2
P
X̄ = 4
These are the most commonly-used measures of dispersion. The standard deviation is
more understandable than the variance, because the standard deviation is expressed in
the same units as X (rather than the variance, which is expressed in squared units).
A useful rule-of-thumb for interpretation is that for many symmetric distributions, such
as the ‘normal’ distribution:
about 2/3 of the observations are between X̄ − S and X̄ + S, that is, within one
(sample) standard deviation about the (sample) mean
Remember that standard deviations (and variances) are never negative, and they are
27
1. Data visualisation and descriptive statistics
zero only if all the Xi observations are the same (that is, there is no variation in the
data).
If we are using a frequency table, we can also calculate:
K
!
1 X
S2 = fj Xj2 − nX̄ 2 .
n−1 j=1
Deviations from X̄
i Xi Xi2 Xi − X̄ (Xi − X̄)2
1 1 1 −3 9
2 2 4 −2 4
3 3 9 −1 1
4 5 25 +1 1
5 9 81 +5 25
Sum 20 120 0 P 40
Xi2 (Xi − X̄)2
P
X̄ = 4 = =
We have:
1 X 40 1 X 2 120 − 5 × 42
S2 = (Xi − X̄)2 = = 10 = Xi − nX̄ 2 =
n−1 4 n−1 4
√ √
and S = S 2 = 10 = 3.16.
The median, q50 , is basically the value which divides the sample into the smallest 50%
of observations and the largest 50%. If we consider other percentage splits, we get other
(sample) quantiles (percentiles), qc .
The first quartile, q25 or Q1 , is the value which divides the sample into the
smallest 25% of observations and the largest 75%.
The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to
speak), and the maximum, X(n) (the ‘100% quantile’).
These are no longer ‘in the middle’ of the sample, but they are more general
measures of location of the sample distribution.
28
1.7. Measures of dispersion
The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the
extremes of the distribution, i.e. the minimum and maximum observations. The IQR
focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.
1.7.4 Boxplots
A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample
distribution using quantiles. The plot comprises the following.
The box, whose edges are the first and third quartiles (Q1 and Q3 ). Hence the box
captures the middle 50% of the data. Therefore, the length of the box is the
interquartile range.
The bottom whisker extends either to the minimum or up to a length of 1.5 times
the interquartile range below the first quartile, whichever is closer to the first
quartile.
The top whisker extends either to the maximum or up to a length of 1.5 times the
interquartile range above the third quartile, whichever is closer to the third quartile.
Points beyond 1.5 times the interquartile range below the first quartile or above the
third quartile are regarded as outliers, and plotted as individual points.
A much longer whisker (and/or outliers) in one direction relative to the other indicates
a skewed distribution, as does a median line not in the middle of the box.
Example 1.16 Figure 1.7 (on the next page) displays a boxplot of GDP per capita
using the sample of 155 countries introduced in Example 1.1. Some summary
statistics for this variable are reported below.
Standard
Mean Median deviation IQR Range
GDP per capita 8.6 4.7 9.5 9.7 37.3
29
1. Data visualisation and descriptive statistics
1.8.1 Scatterplots
A scatterplot shows the values of two continuous variables against each other, plotted
as points in a two-dimensional coordinate system.
Example 1.17 A plot of data for 164 countries is shown in Figure 1.8 (on the next
page) which plots the following variables.
30
1.8. Associations between two variables
Interpretation: it appears that virtually all countries with high levels of corruption
have relatively low GDP per capita. At lower levels of corruption there is a positive
association, where countries with very low levels of corruption also tend to have high
GDP per capita.
Example 1.18 Figure 1.9 (on the next page) is a time series of an index of prices
of consumer goods and services in the UK for the period 1800–2009 (Office for
National Statistics; scaled so that the price level in 1974 = 100). This shows the
price inflation over this period.
31
1. Data visualisation and descriptive statistics
Example 1.19 Figure 1.10 (on the next page) shows side-by-side boxplots of GDP
per capita for the different regions in Example 1.1.
GDP per capita in African countries tends to be very low. There is a handful of
countries with somewhat higher GDPs per capita (shown as outliers in the plot).
The median for Asia is not much higher than for Africa. However, the
distribution in Asia is very much skewed to the right, with a tail of countries
with very high GDPs per capita.
The boxplots for North America and Oceania are not very useful, because they
are based on very few countries (two and three countries, respectively).
Example 1.20 The table on the next page reports the results from a survey of 972
private investors.3 The variables are as follows.
32
1.8. Associations between two variables
Interpretation: look at the row percentages. For example, 17.8% of those aged under
45, but only 5.2% of those aged 65 and over, think that short-term gains are ‘very
important’. Among the respondents, the older age groups seem to be less concerned
with quick profits than the younger age groups.
3
Lewellen, W.G., R.C. Lease and G.G. Schlarbaum (1977). ‘Patterns of investment strategy and
behavior among individual investors’. The Journal of Business, 50(3), pp. 296–333.
33
1. Data visualisation and descriptive statistics
34
Chapter 2
Probability theory
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
2.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:
Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%
35
2. Probability theory
However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.
‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.
In the next few chapters, we will learn about the terms in bold, among others.
In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.
Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.
36
2.4. Set theory: the basics
A preview of probability
Experiment: for example, rolling a single die and recording the outcome.
Outcome of the experiment: for example, rolling a 3.
Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
Event: any subset A of the sample space, for example A = {4, 5, 6}.1
B = {1, 2, 3, 4, 5}.
1 ∈ A and 2 ∈ A
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
1
Strictly speaking not all subsets are events, as discussed later.
37
2. Probability theory
Example 2.3 In Figure 2.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
A⊂B when x ∈ A ⇒ x ∈ B.
Example 2.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
38
2.4. Set theory: the basics
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 2.3.
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 2.4.
39
2. Probability theory
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 2.5.
We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.
40
2.4. Set theory: the basics
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
41
2. Probability theory
A ∩ B = ∅.
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 2.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and Ai = A.
i=1
A3 A2
A1
42
2.5. Axiomatic definition of probability
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
Example 2.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.
2
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.
43
2. Probability theory
Axioms of probability
‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.2 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).
Axiom 2: P (S) = 1.
The axioms require that a probability function must always satisfy these requirements.
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.
Probability property
However, the only real number for P (∅) which satisfies this is P (∅) = 0.
44
2.5. Axiomatic definition of probability
A2
A1
A3
Figure 2.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
Probability property
45
2. Probability theory
Probability property
Proof (by contradiction): If it was true that P (A) > 1 for some A, then we would have:
This violates Axiom 1, so cannot be true. Therefore, it must be that P (A) ≤ 1 for all A.
Putting this and Axiom 1 together, we get:
0 ≤ P (A) ≤ 1
Probability property
since P (B ∩ Ac ) ≥ 0.
Probability property
P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)
P (A) = P (A ∩ B c ) + P (A ∩ B)
P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
46
2.5. Axiomatic definition of probability
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
86% spend at least 1 hour watching television (event A, with P (A) = 0.86)
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)
15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined on the next page. The mathematical theory of
probability and calculations on probabilities are the same whichever interpretation we
assign to ‘probability’. So, in this course, we do not need to discuss the matter further.
47
2. Probability theory
Example 2.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.
48
2.6. Classical probability and counting rules
Standard illustrations of classical probability are devices used in games of chance, such
as:
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.
Example 2.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
49
2. Probability theory
Example 2.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 2.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
Example 2.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 2.8 (on the next page).
50
2.6. Classical probability and counting rules
s s s s s s s s
whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
whether the selected set is treated as ordered or unordered.
Therefore:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
51
2. Probability theory
Now:
n objects are available for selection for the 1st object in the sequence
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
n × (n − 1) × · · · × (n − k + 1). (2.2)
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!
Suppose now that the identities of the objects in the selection matter, but the order
does not.
52
2.6. Classical probability and counting rules
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
n
The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
0
= n = 1, so there is only 1 way of selecting 0 or n out of n objects.
With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
Unordered k k
= k! (n−k)!
We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness with an illustration given in Example 2.16.
53
2. Probability theory
Example 2.17 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?
1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
(365)3 = 48,627,125.
2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have
different birthdays (without replacement). The number of different sets of
birthdays is:
365 365! 365 × 364 × 363
= = = 8,038,030.
3 3! (365 − 3)! 3×2×1
Example 2.18 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
54
2.6. Classical probability and counting rules
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
Rule of sum
m1 + m2 + · · · + mK .
Rule of product
Example 2.19 Five playing cards are drawn from a well-shuffled deck of 52 playing
cards. What is the probability that the cards form a hand which is higher than ‘a
flush’ ? The cards in a hand are treated as an unordered set.
55
2. Probability theory
First, we determine the size of the sample space which is all unordered subsets of 5
cards selected from 52. So the size of the sample space is:
52 52! 52 × 51 × 50 × 49 × 48
= = = 2,598,960.
5 5! × 47! 5×4×3×2×1
♦2 ♠2 ♣2 ♦4 ♠4.
We can break the number of ways of choosing these into two steps.
The total number of ways of selecting the three: the rank of these can be any of
the 13 ranks. There are four cards of this rank, so the three of that rank can be
chosen in 43 = 4 ways. So the total number of different triplets is 13 × 4 = 52.
The total number of ways of selecting the two: the rank of these can be any of
the remaining 12 ranks, and the two cards of that rank can be chosen in 42 = 6
ways. So the total number of different pairs (with a different rank than the
triplet) is 12 × 6 = 72.
The rule of product then says that the total number of full houses is:
52 × 72 = 3,744.
(You do not need to memorise the different types of hands for the examination!)
56
2.7. Conditional probability and Bayes’ theorem
The following is a summary of the numbers of all types of 5-card hands, and their
probabilities (to reiterate you will not need to know these for the examination):
independence
conditional probability
Bayes’ theorem.
Independence
P (A ∩ B) = P (A) P (B).
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
57
2. Probability theory
Example 2.20 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
Therefore:
Example 2.21 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) =
and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
58
2.7. Conditional probability and Bayes’ theorem
and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 2.9.
P (A ∩ B) = 0 6= P (A) P (B)
in general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then
mutually exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
Venn diagram.
59
2. Probability theory
Conditional probability
Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A ∩ B)
P (A | B) =
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 2.22 Suppose we roll two independent fair dice again. Consider the
following events.
These are shown in Figure 2.10 (on the next page). Now P (A) = 11/36 ≈ 0.31,
P (B) = 15/36 and P (A ∩ B) = 2/36. Therefore, the conditional probability of A
given B is:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
Learning that B has occurred causes us to revise (update) the probability of A
downward, from 0.31 to 0.13.
Example 2.23 In Example 2.22, when we are told that the conditioning event B
has occurred, we know we are within the solid green line in Figure 2.10 (on the next
page). So the 15 outcomes within it become the new sample space. There are 2
outcomes which satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
60
2.7. Conditional probability and Bayes’ theorem
A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
A B
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.
s
B
s
As
61
2. Probability theory
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 2.24.
Example 2.25 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn
P (A3 | A1 , A2 ) = 2/50
P (A4 | A1 , A2 , A3 ) = 1/49.
We now return to probabilities of partitions like the situation shown in Figure 2.11 (on
the next page).
62
2.7. Conditional probability and Bayes’ theorem
A2
H A1
A1 HH
HH
A3
r HHr
A
A2
H
HH
H
HH
H A 3
Figure 2.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 2.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 2.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
r B2
HH
HH
r B3
rH
HHr
A
@H
@HH HHr
@
@ B4
@
@r
B5
Figure 2.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:
63
2. Probability theory
Example 2.26 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).
r Bc
HH
HH
HH
rH
Hr A
HH
H
HH
r
H
B
Example 2.27 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and
P (A | B c ) = 0.01. Therefore:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
So far we have considered how to calculate P (A) for an event A which can happen in
different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question. Suppose we know that A has occurred, as shown in Figure
2.13 (on the next page).
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 2.14 (on the next
page).
64
2.7. Conditional probability and Bayes’ theorem
So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1
Example 2.28 Continuing with Example 2.27, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
65
2. Probability theory
Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
Example 2.29 You are taking part in a gameshow. The host of the show, who is
known as Monty, shows you three outwardly identical boxes. In one of them is a
prize, and the other two are empty.
You are asked to select, but not open, one of the boxes. After you have done so,
Monty, who knows where the prize is, opens one of the two remaining boxes.
He always opens a box he knows to be empty, and randomly chooses which box to
open when he has more than one option (which happens when your initial choice
contains the prize).
After opening the empty box, Monty gives you the choice of either switching to the
other unopened box or sticking with your original choice. You then receive whatever
is in the box you choose.
What should you do, assuming you want to win the prize?
Suppose the three boxes are numbered 1, 2 and 3. Let us define the following events.
Suppose you choose Box 1 first, and then Monty opens Box 3 (the answer works the
same way for all combinations of these). So Boxes 1 and 2 remain unopened.
What we want to know now are the conditional probabilities P (B1 | M3 ) and
P (B2 | M3 ).
You should switch boxes if P (B2 | M3 ) > P (B1 | M3 ), and stick with your original
choice otherwise. (You would be indifferent about switching if it was the case that
P (B2 | M3 ) = P (B1 | M3 ).)
Suppose that you first choose Box 1, and then Monty opens Box 3. Bayes’ theorem
tells us that:
P (M3 | B2 ) P (B2 )
P (B2 | M3 ) = .
P (M3 | B1 ) P (B1 ) + P (M3 | B2 ) P (B2 ) + P (M3 | B3 ) P (B3 )
If the prize is in Box 1 (which you choose), Monty chooses at random between
the two remaining boxes, i.e. Boxes 2 and 3. Hence P (M3 | B1 ) = 1/2.
66
2.7. Conditional probability and Bayes’ theorem
If the prize is in one of the two boxes you did not choose, Monty cannot open
that box, and must open the other one. Hence P (M3 | B2 ) = 1 and so
P (M3 | B3 ) = 0.
1 × 1/3 2
P (B2 | M3 ) = =
1/2 × 1/3 + 1 × 1/3 + 0 × 1/3 3
Example 2.30 You are waiting for your bag at the baggage reclaim carousel of an
airport. Suppose that you know that there are 200 bags to come from your flight,
and you are counting the distinct bags which come out. Suppose that x bags have
arrived, and your bag is not among them. What is the probability that your bag will
not arrive at all, i.e. that it has been lost (or at least delayed)?
Define A = ‘your bag has been lost’ and x = ‘your bag is not among the first x bags
to arrive’. What we want to know is the conditional probability P (A | x) for any
x = 0, 1, 2, . . . , 200. The conditional probabilities the other way round are as follows.
P (x | A) = 1 for all x. If your bag has been lost, it will not arrive!
67
2. Probability theory
P (x | A) P (A)
P (A | x) =
P (x | A) P (A) + P (x | Ac ) P (Ac )
P (A)
= .
P (A) + ((200 − x)/200)(1 − P (A))
Obviously, P (A | 200) = 1. If the bag has not arrived when all 200 have come out, it
has been lost!
For other values of x we need P (A). This is the general probability that a bag gets
lost, before you start observing the arrival of the bags from your particular flight.
This kind of probability is known as the prior probability of an event A.
Let us assign values to P (A) based on some empirical data. Statistics by the
Association of European Airlines (AEA) show how many bags were ‘mishandled’ per
1,000 passengers the airlines carried. This is not exactly what we need (since not all
passengers carry bags, and some have several), but we will use it anyway. In
particular, we will compare the results for the best and the worst of the AEA in 2006:
Figure 2.15 (on the next page) shows a plot of P (A | x) as a function of x for these
two airlines.
The probabilities are fairly small, even for large values of x.
For Air Malta, P (A | 199) = 0.469. So even when only 1 bag remains to arrive,
the probability is less than 0.5 that your bag has been lost.
For British Airways, P (A | 199) = 0.825. Also, we see that P (A | 197) = 0.541 is
the first probability over 0.5.
68
2.9. Key terms and concepts
1.0
BA
0.8
Air Malta
Figure 2.15: Plot of P (A | x) as a function of x for the two airlines in Example 2.30, Air
Malta and British Airways (BA).
1. A box contains 12 light bulbs, of which 2 are defective. If a person selects 5 bulbs
at random, without replacement, what is the probability that both defective bulbs
will be selected?
69
2. Probability theory
P ((A ∪ B)c ) = π1
and:
P (A) = π2 .
Determine P (B) as a function of π1 and π2 .
Community A B C
Proportion 0.20 0.50 0.30
Community given A B C
Probability of being vaccinated 0.80 0.70 0.60
(a) We choose a person from the county at random. What is the probability that
the person is not vaccinated?
(b) We choose a person from the county at random. Find the probability that the
person is in community A, given the person is vaccinated.
(c) In words, briefly explain how the ‘probability of being vaccinated’ for each
community would be known in practice.
2. We are given that P ((A ∪ B)c ) = π1 , P (A) = π2 , and that A and B are
independent. Hence:
Therefore:
70
2.11. Solutions to Sample examination questions
(c) Any reasonable answer accepted, such as relative frequency estimate, or from
health records.
71
2. Probability theory
72
Chapter 3
Random variables
define a random variable and distinguish it from the values which it takes
explain the difference between discrete and continuous random variables
find the mean and the variance of simple random variables whether discrete or
continuous
demonstrate how to proceed and use simple properties of expected values and
variances.
3.3 Introduction
In Chapter 1, we considered descriptive statistics for a sample of observations of a
variable X. Here we will represent the observations as a sequence of variables, denoted
as:
X1 , X2 , . . . , Xn
where n is the sample size.
In statistical inference, the observations will be treated as a sample drawn at random
from a population. We will then think of each observation Xi of a variable X as an
outcome of an experiment.
The experiment is ‘select a unit at random from the population and record its
value of X’.
The outcome is the observed value Xi of X.
Because variables X in statistical data are recorded as numbers, we can now focus on
experiments where the outcomes are also numbers – random variables.
73
3. Random variables
Random variable
A random variable is an experiment for which the outcomes are numbers.1 This
means that for a random variable:
the outcomes are numbers in this sample space (instead of ‘outcomes’, we often
call them the values of the random variable)
There are two main types of random variables, depending on the nature of S, i.e. the
possible values of the random variable.
Notation
P (a < X < b) denotes the probability that X is between the numbers a and b.
You will notice that many of the quantities we define for random variables are
analogous to sample quantities defined in Chapter 1.
1
This definition is a bit informal, but it is sufficient for this course.
2
Strictly speaking, a discrete random variable is not just a random variable which is not continuous
as there are many others, such as mixture distributions.
74
3.4. Discrete random variables
Example 3.1 The following two examples will be used throughout this chapter.
75
3. Random variables
Example 3.2 Consider the following probability distribution for the household
size, X.3
Number of people
in the household, x P (X = x)
1 0.3002
2 0.3417
3 0.1551
4 0.1336
5 0.0494
6 0.0145
7 0.0034
8 0.0021
Probability function
p(x) = P (X = x).
We can talk of p(x) both as the pf of the random variable X, and as the pf of the
probability distribution of X. Both mean the same thing.
Alternative terminology: the pf of a discrete random variable is also often called the
probability mass function (pmf).
Alternative notation: instead of p(x), the pf is also often denoted by, for example, pX (x)
– especially when it is necessary to indicate clearly to which random variable the
function corresponds.
3
Source: ONS, National report for the 2001 Census, England and Wales. Table UV51.
76
3.4. Discrete random variables
Example 3.3 Continuing Example 3.2, here we can simply list all the values:
0.3002 for x = 1
0.3417 for x = 2
0.1551 for x = 3
0.1336 for x = 4
p(x) = 0.0494 for x = 5
0.0145 for x = 6
0.0034 for x = 7
0.0021 for x = 8
0 otherwise.
8
P
These are clearly all non-negative, and their sum is p(x) = 1.
x=1
A graphical representation of the pf is shown in Figure 3.1.
0.35
0.30
0.25
0.20
p(x)
0.15
0.10
0.05
0.00
1 2 3 4 5 6 7 8
For the next example, we need to remember the following results from mathematics,
concerning sums of geometric series. If r 6= 1, then:
n−1
X a(1 − r n )
ar x =
x=0
1−r
and if |r| < 1, then:
∞
X a
ar x = .
x=0
1−r
77
3. Random variables
Example 3.4 In the basketball example, the number of possible values is infinite,
so we cannot simply list the values of the pf. So we try to express it as a formula.
Suppose that:
Hence the probability that the first success occurs after x failures is the probability
of a sequence of x failures followed by a success, i.e. the probability is:
(1 − π)x π.
So the pf of the random variable X (the number of failures before the first success)
is: (
(1 − π)x π for x = 0, 1, 2, . . .
p(x) = (3.1)
0 otherwise
where 0 ≤ π ≤ 1. Let us check that (3.1) satisfies the conditions for a pf.
1
=π
1 − (1 − π)
π
=
π
= 1.
78
3.4. Discrete random variables
0.7
0.6
0.5
0.4
p(x)
0.3
π = 0.7
π = 0.3
0.2
0.1
0.0
0 5 10 15
x (number of failures)
Figure 3.2: Probability function for Example 3.4. π = 0.7 indicates a fairly good free-throw
shooter. π = 0.3 indicates a fairly poor free-throw shooter.
i.e. the sum of the probabilities of the possible values of X which are less than or
equal to x.
Example 3.5 Continuing with the household size example, values of F (x) at all
possible values of X are:
Number of people
in the household, x p(x) F (x)
1 0.3002 0.3002
2 0.3417 0.6419
3 0.1551 0.7970
4 0.1336 0.9306
5 0.0494 0.9800
6 0.0145 0.9945
7 0.0034 0.9979
8 0.0021 1.0000
79
3. Random variables
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
0 2 4 6 8
1 − (1 − π)y+1
=π
1 − (1 − π)
= 1 − (1 − π)y+1
we can write: (
0 for x < 0
F (x) =
1 − (1 − π)x+1 for x = 0, 1, 2, . . . .
The cdf is shown in graphical form in Figure 3.4 (on the next page).
80
3.4. Discrete random variables
1.0
0.8
0.6
F(x)
0.4
π = 0.7
π = 0.3
0.2
0.0
0 5 10 15
x (number of failures)
at such an xi , the value of F (xi ) is the value at the top of the jump (i.e. F (x) is
right-continuous).
Either the pf or the cdf can be used to calculate the probabilities of any events for a
discrete random variable.
81
3. Random variables
Example 3.7 Continuing with the household size example (for the probabilities,
see Example 3.5), then:
The expected value (or mean) of X is denoted E(X), and defined as:
X
E(X) = xi p(xi ).
xi ∈S
P P
This can also be written more concisely as E(X) = x p(x) or E(X) = x p(x).
x
We can talk of E(X) as the expected value of both the random variable X, and of the
probability distribution of X.
Alternative notation: instead of E(X), the symbol µ (the lower-case Greek letter ‘mu’),
or µX , is often used.
82
3.4. Discrete random variables
where:
fi
p(x
b i) = K
P
fi
i=1
So X̄ uses the sample proportions, pb(xi ), whereas E(X) uses the population
probabilities, p(xi ).
Number of people
in the household, x p(x) x p(x)
1 0.3002 0.3002
2 0.3417 0.6834
3 0.1551 0.4653
4 0.1336 0.5344
5 0.0494 0.2470
6 0.0145 0.0870
7 0.0034 0.0238
8 0.0021 0.0168
Sum 2.3579
= E(X)
83
3. Random variables
∞ ∞
X y
X
y
= (1 − π)
y(1 − π) π + (1 − π) π
y=0 y=0
| {z } | {z }
= E(X) =1
= (1 − π) (E(X) + 1)
= (1 − π) E(X) + (1 − π)
So, before scoring a basket, a fairly good free-throw shooter (with π = 0.7) misses on
average about 0.42 shots, and a fairly poor free-throw shooter (with π = 0.3) misses
on average about 2.33 shots.
Example 3.10 To illustrate the use of expected values, let us consider the game of
roulette, from the point of view of the casino (‘The House’).
Suppose a player puts a bet of £1 on ‘red’. If the ball lands on any of the 18 red
numbers, the player gets that £1 back, plus another £1 from The House. If the result
is one of the 18 black numbers or the green 0, the player loses the £1 to The House.
We assume that the roulette wheel is unbiased, i.e. that all 37 numbers have equal
probabilities. What can we say about the probabilities and expected values of wins
and losses?
84
3.4. Discrete random variables
Define the random variable X = ‘money received by The House’. Its possible values
are −1 (the player wins) and 1 (the player loses). The probability function is:
18/37 for x = −1
p(x) = 19/37 for x = 1
0 otherwise.
On average, The House expects to win 2.7p for every £1 which players bet on red.
This expected gain is known as the house edge. It is positive for all possible bets in
roulette.
The edge is the expected gain from a single bet. Usually, however, players bet again
if they win at first – gambling can be addictive!
Consider a player who starts with £10 and bets £1 on red repeatedly until the
player either has lost all of the £10 or doubled their money to £20.
It can be shown that the probability that such a player reaches £20 before they go
down to £0 is about 0.368. Define X = ‘money received by The House’, with the
probability function:
0.368 for x = −10
p(x) = 0.632 for x = 10
0 otherwise.
On average, The House can expect to keep about 26.4% of the money which players
like this bring to the table.
85
3. Random variables
In general:
E(g(X)) 6= g(E(X))
when g(X) is a non-linear function of X.
Suppose X is a random variable and a and b are constants, i.e. known numbers
which are not random variables. Therefore:
E(aX + b) = a E(X) + b.
Proof: We have:
X
E(aX + b) = (ax + b)p(x)
x
X X
= ax p(x) + b p(x)
x x
X X
=a x p(x) + b p(x)
x x
= a E(X) + b
P
ii. p(x) = 1, by definition of the probability function.
x
E(aX + b) = a E(X) + b
E(b) = b.
86
3.4. Discrete random variables
Both Var(X) and sd(X) are always ≥ 0. Both are measures of the dispersion (variation)
of the random variable X.
Alternative notation: the variance is often denoted σ 2 (‘sigma squared’) and the
standard deviation by σ (‘sigma’).
An alternative formula: the variance can also be calculated as:
2 2 2 2
Var(X) =pE((X − E(X))
√ ) = 1.699 = 7.259 − (2.358) = E(X ) − (E(X)) and
sd(X) = Var(X) = 1.699 = 1.30.
Example 3.14 For the basketball example, p(x) = (1 − π)x π for x = 0, 1, 2, . . ., and
0 otherwise. It can be shown (although the proof is beyond the scope of the course)
that for this distribution:
1−π
Var(X) = .
π2
In the two cases we have used as examples:
87
3. Random variables
So the variation in how many free throws a fairly poor shooter misses before the first
success is much higher than the variation for a fairly good shooter.
Var(aX + b) = a2 Var(X).
Proof:
Var(aX + b) = E ((aX + b) − E(aX + b))2
= E (aX − a E(X))2
= E a2 (X − E(X))2
= a2 E (X − E(X))2
= a2 Var(X).
Therefore, sd(aX + b) = |a| sd(X).
If a = 0, this gives:
Var(b) = 0.
That is, the variance of a constant is 0. The converse also holds – if a random variable
has a variance of 0, it is actually a constant.
E(aX + b) = a E(X) + b
Var(aX + b) = a2 Var(X) and sd(aX + b) = |a| sd(X)
E(b) = b and Var(b) = sd(b) = 0.
p
We define Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 and sd(X) = Var(X).
Also, Var(X) ≥ 0 and sd(X) ≥ 0 always, and Var(X) = sd(X) = 0 only if X is a
constant.
88
3.4. Discrete random variables
Example 3.15 For further practice, let us consider a discrete random variable X
which has possible values 0, 1, 2, . . . , n, where n is a known positive integer, and X
has the following probability function:
(
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
p(x) =
0 otherwise
Note: the examination may also contain questions like this. The difficulty of such
questions depends partly on the form of p(x), and what kinds of manipulations are
needed to work with it. So questions of this type may be very easy, or quite hard!
This does not simplify into a simple formula, so we just calculate the values
from the definition, by summation.
For the values x = 0, 1, 2, . . . , n, the value of the cdf is:
x
X n y
F (x) = P (X ≤ x) = π (1 − π)n−y .
y=0
y
89
3. Random variables
Since X is a discrete random variable, F (x) is a step function. For E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n−1
= nπ π x−1 (1 − π)n−x
x=1
x−1
n−1
X n−1
= nπ π y (1 − π)(n−1)−y
y=0
y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and probability
parameter π.
The variance of the distribution is Var(X) = nπ(1 − π). This is not derived here, but
will be proved in a different way later.
The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
0 00
MX (0) = E(X) and MX (0) = E(X 2 )
This is useful if the mgf is easier to derive than E(X) and Var(X) directly.
90
3.4. Discrete random variables
Other moments about zero are obtained from the mgf similarly:
(k)
MX (0) = E(X k ) for k = 1, 2, . . . .
using the sum to infinity of a geometric series, for t < − ln(1 − π) to ensure
convergence of the sum.
From the mgf MX (t) = π/(1 − et (1 − π)) we obtain:
π(1 − π)et
MX0 (t) =
(1 − et (1 − π))2
π(1 − π)et (1 − (1 − π)et )(1 + (1 − π)et )
MX00 (t) =
(1 − et (1 − π))4
91
3. Random variables
Note: this uses the series expansion of the exponential function from calculus, i.e. for
any number a, we have:
∞
a
X ax a2 a3
e = =1+a+ + + ··· .
x=0
x! 2! 3!
t
From the mgf MX (t) = eλ(e −1) we obtain:
t
MX0 (t) = λet eλ(e −1)
and:
t
MX00 (t) = λet (1 + λet )eλ(e −1) .
Hence:
MX0 (0) = λ = E(X)
also:
MX00 (0) = λ(1 + λ) = E(X 2 )
and:
Var(X) = E(X 2 ) − (E(X))2 = λ(1 + λ) − λ2 = λ.
If the mgfs mentioned in these statements exist, then the following apply.
The mgf uniquely determines a probability distribution. In other words, if for two
random variables X and Y we have MX (t) = MY (t) (for points around t = 0), then
X and Y have the same distribution.
and, in particular, if all the Xi s have the same distribution (of X), then
MY (t) = MX (t)n .
92
3.5. Continuous random variables
In other words, the set of possible values (the sample space) is the real numbers R,
or one or more intervals in R.
Suppose the policy has a deductible of £999, so all claims are at least £1,000.
Most of the concepts introduced for discrete random variables have exact or
approximate analogies for continuous random variables, and many results are the same
for both types. However, there are some differences in the details. The most obvious
difference is that wherever in the discrete case there are sums over the possible values of
the random variable, in the continuous case these are integrals.
where α > 0 is a parameter, and k > 0 (the smallest possible value of X) is a known
number. In our example, k = 1 (due to the deductible). A probability distribution
with this pdf is known as the Pareto distribution. A graph of this pdf when
α = 2.2 is shown in Figure 3.5 (on the next page).
Unlike for probability functions of discrete random variables, in the continuous case
values of the probability density function are not probabilities of individual values, i.e.
f (x) 6= P (X = x). In fact, for a continuous random variable:
93
3. Random variables
2.0
1.5
f(x)
1.0
0.5
0.0
That is, the probability that X has any particular value exactly is always 0.
Because of (3.3), with a continuous random variable we do not need to be very careful
about differences between < and ≤, and between > and ≥. Therefore, the following
probabilities are all equal:
94
3.5. Continuous random variables
2.0
1.5
f(x)
1.0
0.5
0.0
Properties of pdfs
The pdf f (x) of any continuous random variable must satisfy the following conditions.
1. We require:
f (x) ≥ 0 for all x.
2. We require: Z ∞
f (x) dx = 1.
−∞
95
3. Random variables
Example 3.21 Continuing with the insurance example, we check that the
conditions hold for the pdf:
(
αk α /xα+1 for x ≥ k
f (x) =
0 otherwise
1. Clearly, f (x) ≥ 0 for all x, since α > 0, k α > 0 and xα+1 ≥ k α+1 > 0.
2. We have:
∞ ∞
αk α
Z Z
f (x) dx = dx
−∞ k xα+1
Z ∞
= αk α
x−α−1 dx
k
h i∞
α 1 −α
= αk x
−α k
= (−k α )(0 − k −α )
= 1.
The general properties of the cdf stated previously also hold for continuous
distributions. The cdf of a continuous distribution is not a step function, so results
on discrete-specific properties do not hold in the continuous case. A continuous cdf
is a smooth, continuous function of x.
f (x) = F 0 (x).
96
3.5. Continuous random variables
Therefore: (
0 for x < k
F (x) = (3.4)
1 − (k/x)α for x ≥ k.
If we were given (3.4), we could obtain the pdf by differentiation, since F 0 (x) = 0
when x < k, and:
αk α
F 0 (x) = −k α (−α)x−α−1 = for x ≥ k.
xα+1
A plot of the cdf is shown in Figure 3.7.
1.0
0.8
0.6
F(x)
0.4
0.2
0.0
1 2 3 4 5 6 7
97
3. Random variables
Example 3.23 Continuing with the insurance example (with k = 1 and α = 2.2),
then:
Example 3.24 Consider now a continuous random variable with the following pdf:
(
λe−λx for x ≥ 0
f (x) = (3.5)
0 otherwise
where λ > 0 is a parameter. This is the pdf of the exponential distribution. The
uses of this distribution will be discussed in the next chapter.
Since: Z x h ix
λe−λt dt = − e−λt = 1 − e−λx
0 0
2. Since we have just done the integration to derive the cdf F (x), we can also use
it to show that f (x) integrates to 1. This follows from:
Z ∞
f (x) dx = P (−∞ < X < ∞) = lim F (x) − lim F (x)
−∞ x→∞ x→−∞
98
3.5. Continuous random variables
Mixed distributions
P (X = 0) = π for some π ∈ (0, 1). Here π is the probability that a policy results in
no payment.
Suppose X is a continuous random variable with pdf f (x). Definitions of its expected
value, the expected value of any transformation g(X), the variance and standard
deviation are the same as for discrete distributions, except that summation is replaced
by integration:
Z ∞
E(X)= x f (x) dx
−∞
Z ∞
E(g(X))= g(x) f (x) dx
−∞
= E(X 2 ) − (E(X))2
p
sd(X)= Var(X).
99
3. Random variables
Example 3.25 For the Pareto distribution, introduced in Example 3.19, we have:
Z ∞ Z ∞
E(X) = x f (x) dx = x f (x) dx
−∞ k
∞
αk α
Z
= x dx
k xα+1
∞
αk α
Z
= dx
k xα
Z ∞
(α − 1)k α−1
αk
= dx
α−1 k x(α−1)+1
| {z }
=1
αk
= (for α > 1).
α−1
Here the last step follows because the last integrand has the form of the Pareto pdf
with parameter α − 1, so its integral from k to ∞ is 1. This integral converges only if
α − 1 > 0, i.e. if α > 1.
Similarly:
∞ ∞
αk α
Z Z
2 2
E(X ) = x f (x) dx = x2 dx
k k xα+1
∞
αk α
Z
= dx
k xα−1
Z ∞
αk 2 (α − 2)k α−2
= dx
α−2 k x(α−2)+1
| {z }
=1
αk 2
= (for α > 2)
α−2
and hence:
2
αk 2 α2 k 2
2 2 k α
Var(X) = E(X ) − (E(X)) = − = .
α − 2 (α − 1)2 α−1 α−2
Expected values and variances are said to be infinite when the corresponding integral
does not exist (i.e. does not have a finite value).
For the Pareto distribution, the distribution is defined for all α > 0, but the mean is
100
3.5. Continuous random variables
infinite if α < 1 and the variance is infinite if α < 2. This happens because for small
values of α the distribution has very heavy tails, i.e. the probabilities of very large
values of X are non-negligible.
This is actually useful in some insurance applications, for example liability insurance
and medical insurance. There most claims are relatively small, but there is a
non-negligible probability of extremely large claims. The Pareto distribution with a
small α can be a reasonable representation of such situations. Figure 3.8 shows plots of
Pareto cdfs with α = 2.2 and α = 0.8. When α = 0.8, the distribution is so heavy-tailed
that E(X) is infinite.
1.0
0.8
0.6
F(x)
0.4
α = 2.2
α = 0.8
0.2
0.0
0 10 20 30 40 50
101
3. Random variables
2
=
λ2
where the last step follows because the last integral is simply E(X) = 1/λ again.
Finally:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2 − 2 = 2 .
λ λ λ
The properties of the mgf stated in Section 3.4.8 also hold for continuous distributions.
If the expected value E(etX ) is infinite, the random variable X does not have an mgf.
For example, the Pareto distribution does not have an mgf for positive t.
from which we get MX0 (t) = λ/(λ − t)2 and MX00 (t) = 2λ/(λ − t)3 , so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) =
λ λ2
and Var(X) = E(X 2 ) − (E(X))2 = 2/λ2 − 1/λ2 = 1/λ2 .
These agree with the results derived with a bit more work in Example 3.26.
102
3.5. Continuous random variables
The median of a random variable (i.e. of its probability distribution) is similar in spirit.
1 ln 2
e−λm = ⇔ −λm = − ln 2 ⇔ m= .
2 λ
103
3. Random variables
1. Suppose that X is a discrete random variable for which the moment generating
function is:
1 1
MX (t) = (e3t + e6t + e9t ) + (e2t + e4t )
4 8
for −∞ < t < ∞. Write down the probability distribution of X.
(a) Sketch the graph of f (x). (The sketch can be drawn on ordinary paper – no
graph paper needed.)
104
3.9. Solutions to Sample examination questions
0.2
0.1
0.0
0 1 2 3 4 5
(b) We determine the cdf by integrating the pdf over the appropriate range, hence:
0 for x < 0
x2 /10
for 0 ≤ x < 2
F (x) =
2
(10x − x − 10)/15 for 2 ≤ x ≤ 5
1 for x > 5.
This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞
105
3. Random variables
For 2 ≤ x ≤ 5, we have:
Z x 0 2 Z x
20 − 4t
Z Z
t
F (x) = f (t) dt = 0 dt + dt + dt
−∞ −∞ 0 5 2 30
x
t2
4 2t
=0+ + −
10 3 15 2
2x x2
4 4 4
= + − − −
10 3 15 3 15
2x x2 2
= − −
3 15 3
10x − x2 − 10
= .
15
Similarly:
∞ Z 5 2
x3 20x2 − 4x3
Z Z
2 2
E(X ) = x f (x) dx = dx + dx
−∞ 0 5 2 30
4 2 3 5
x 2x x4
= + −
20 0 9 30 2
16 250 625 16 16
= + − − −
20 9 30 9 30
13
= or 6.5.
2
Hence the variance is:
2
2 2 13 7 2 117 98 19
σ = E(X ) − (E(X)) = − = − = ≈ 1.0555.
2 3 18 18 18
√
Therefore, the standard deviation is σ = 1.0555 = 1.0274.
106
Chapter 4
Common distributions of random
variables
state properties of these distributions such as the expected value and variance.
4.3 Introduction
In statistical inference we will treat observations:
X1 , X2 , . . . , Xn
(the sample) as values of a random variable X, which has some probability distribution
(the population distribution).
How to choose the probability distribution?
There is a large number of such distributions, such that for most purposes we can
find a suitable standard distribution.
107
4. Common distributions of random variables
This part of the course introduces some of the most common standard distributions for
discrete and continuous random variables.
Probability distributions may differ from each other in a broader or narrower sense. In
the broader sense, we have different families of distributions which may have quite
different characteristics, for example:
among continuous: different sets of possible values (for example, all real numbers x,
x ≥ 0, or x ∈ [0, 1]); symmetric versus skewed distributions.
The ‘distributions’ discussed in this chapter are really families of distributions in this
sense.
In the narrower sense, individual distributions within a family differ in having different
values of the parameters of the distribution. The parameters determine the mean and
variance of the distribution, values of probabilities from it etc.
In the statistical analysis of a random variable X we typically:
use observed data to choose (estimate) values for the parameters of that
distribution, and perform statistical inference on them.
Here we need a family of discrete distributions with only two possible values (0
and 1). The Bernoulli distribution (discussed in the next section), which has one
parameter π (the probability that Xi = 1) is appropriate.
Within the family of Bernoulli distributions, we use the one where the value of
π is our best estimate based on the observed data. This is π
b = 513/950 = 0.54.
For the discrete uniform, Bernoulli, binomial, Poisson, continuous uniform, exponential
and normal distributions:
you should memorise their pf/pdf, cdf (if given), mean, variance and median (if
given)
you can use these in any examination question without proof, unless the question
directly asks you to derive them again.
108
4.4. Common discrete distributions
you do not need to memorise their pf/pdf or cdf; if needed for a question, these will
be provided
if a question involves means, variances or other properties of these distributions,
these will either be provided, or the question will ask you to derive them.
The discrete uniform distribution is not very common in applications, but it is useful as
a reference point for more complex distributions.
109
4. Common distributions of random variables
agree / disagree
male / female
1
X
2
E(X ) = x2 p(x) = 02 × (1 − π) + 12 × π = π
x=0
and:
Var(X) = E(X 2 ) − (E(X))2 = π − π 2 = π(1 − π). (4.4)
The moment generating function is:
1
X
MX (t) = etx p(x) = e0 (1 − π) + et π = (1 − π) + πet .
x=0
110
4.4. Common discrete distributions
Let X denote the total number of successes in these n trials. X follows a binomial
distribution with parameters n and π, where n ≥ 1 is a known integer and 0 ≤ π ≤ 1.
This is often written as:
X ∼ Bin(n, π).
Example 4.4 A multiple choice test has 4 questions, each with 4 possible answers.
James is taking the test, but has no idea at all about the correct answers. So he
guesses every answer and, therefore, has the probability of 1/4 of getting any
individual question correct.
Let X denote the number of correct answers in James’ test. X follows the binomial
distribution with n = 4 and π = 0.25, i.e. we have:
X ∼ Bin(4, 0.25).
For example, what is the probability that James gets 3 of the 4 questions correct?
Here it is assumed that the guesses are independent, and each has the probability
π = 0.25 of being correct. The probability of any particular sequence of 3 correct
and 1 incorrect answers, for example 1110, is π 3 (1 − π)1 (where ‘1’ denotes a correct
answer and ‘0’ denotes an incorrect answer).
However, we do not care about the order of the 1s and 0s, only about the number of
1s. So 1101 and 1011, for example, also count as 3 correct answers. Each of these
also has the probability π 3 (1 − π)1 .
The total number of sequences with three 1s (and, therefore, one 0) is the number of
locations for the three 1s which can be selected in the sequence of 4 answers. This is
4
3
= 4. Therefore, the probability of obtaining three 1s is:
4 3
π (1 − π)1 = 4 × (0.25)3 × (0.75)1 ≈ 0.0469.
3
We have shown in the previous chapter that (4.5) satisfies the conditions for being a
probability function (see Example 3.15).
111
4. Common distributions of random variables
Example 4.6 Suppose a multiple choice examination has 20 questions, each with 4
possible answers. Consider again James who guesses each one of the answers. Let X
denote the number of correct answers by such a student, so that we have
X ∼ Bin(20, 0.25). For such a student, the expected number of correct answers is
E(X) = 20 × 0.25 = 5.
The teacher wants to set the pass mark of the examination so that, for such a
student, the probability of passing is less than 0.05. What should the pass mark be?
In other words, what is the smallest x such that P (X ≥ x) < 0.05, i.e. such that
P (X < x) ≥ 0.95?
Calculating the probabilities of x = 0, 1, 2, . . . , 20 we get (rounded to 2 decimal
places):
x 0 1 2 3 4 5 6 7 8 9 10
p(x) 0.00 0.02 0.07 0.13 0.19 0.20 0.17 0.11 0.06 0.03 0.01
x 11 12 13 14 15 16 17 18 19 20
p(x) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Calculating the cumulative probabilities, we find that F (7) = P (X < 8) = 0.898 and
F (8) = P (X < 9) = 0.959. Therefore, P (X ≥ 8) = 0.102 > 0.05 and also
P (X ≥ 9) = 0.041 < 0.05. The pass mark should be set at 9.
More generally, consider a student who has the same probability π of the correct
answer for every question, so that X ∼ Bin(20, π). Figure 4.1 (on the next page)
shows plots of the probabilities for π = 0.25, 0.5, 0.7 and 0.9.
112
4.4. Common discrete distributions
0.30
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
0.30
0.20
0.20
Probability
Probability
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
The possible values of the Poisson distribution are the non-negative integers
0, 1, 2, . . ..
If X ∼ Poisson(λ), then:
E(X) = λ
113
4. Common distributions of random variables
and:
Var(X) = λ.
These can also be obtained from the moment generating function (see Example 3.17):
t
MX (t) = eλ(e −1) .
Poisson distributions are used for counts of occurrences of various kinds. To give a
formal motivation, suppose that we consider the number of occurrences of some
phenomenon in time, and that the process which generates the occurrences satisfies the
following conditions:
1. The numbers of occurrences in any two disjoint intervals of time are independent of
each other.
2. The probability of two or more occurrences at the same time is negligibly small.
3. The probability of one occurrence in any short time interval of length t is λt for
some constant λ > 0.
Example 4.7 Examples of variables for which we might use a Poisson distribution:
Because λ is the rate per unit of time, its value also depends on the unit of time (that
is, the length of interval) we consider.
Example 4.8 If X is the number of arrivals per hour and X ∼ Poisson(1.5), then if
Y is the number of arrivals per two hours, Y ∼ Poisson(1.5 × 2) = Poisson(3).
114
4.4. Common discrete distributions
Example 4.9 Figure 4.2 shows the probabilities p(x) for x = 0, 1, 2, . . . , 10 for
X ∼ Poisson(2) and X ∼ Poisson(4).
0.25
λ=2
λ=4
0.20
0.15
p(x)
0.10
0.05
0.00
0 2 4 6 8 10
115
4. Common distributions of random variables
e−8 80 e−8 81
pY (0) + pY (1) = + = e−8 + 8e−8 = 9e−8 = 0.0030.
0! 1!
X ∼ Bin(n, π)
n is large and π is small.
Example 4.11 A classic example (from Bortkiewicz (1898) Das Gesetz der kleinen
Zahlen) helps to remember the key elements of the ‘law of small numbers’.
Figure 4.3 (on the next page) shows the numbers of soldiers killed by horsekick in
each of 14 army corps of the Prussian army in each of the years spanning 1875–94.
Suppose that the number of men killed by horsekicks in one corps in one year is
X ∼ Bin(n, π), where:
n is large – the number of men in a corps (perhaps 50,000)
π is small – the probability that a man is killed by a horsekick.
116
4.4. Common discrete distributions
The sample mean of the counts is x̄ = 0.7, which we use as λ for the Poisson
distribution. X ∼ Poisson(0.7) is indeed a good fit to the data, as shown in Figure
4.4.
Figure 4.3: Numbers of soldiers killed by horsekick in each of 14 army corps of the Prussian
army in each of the years spanning 1875–94. Source: Bortkiewicz, L.V. Das Gesetz der
kleinen Zahlen. (Leipzig: B.G. Teubner, 1898).
0.5
Poisson(0.7)
Sample proportion
0.4
0.3
Probability
0.2
0.1
0.0
0 1 2 3 4 5 6
Men killed
Example 4.12 An airline is selling tickets to a flight with 198 seats. It knows that,
on average, about 1% of customers who have bought tickets fail to arrive for the
flight. Because of this, the airline overbooks the flight by selling 200 tickets. What is
the probability that everyone who arrives for the flight will get a seat?
117
4. Common distributions of random variables
Let X denote the number of people who fail to turn up. Using the binomial
distribution, X ∼ Bin(200, 0.01). We have:
Geometric(π) distribution.
• Distribution of the number of failures in Bernoulli trials before the first success.
• π is the probability of success at each trial.
• The sample space is 0, 1, 2, . . ..
• See the basketball example in Chapter 3.
Hypergeometric(n, A, B) distribution.
• Experiment where initially A + B objects are available for selection, and A of
them represent ‘success’.
• n objects are selected at random, without replacement.
• Hypergeometric is then the distribution of the number of successes.
• The sample space is the integers x where max{0, n − B} ≤ x ≤ min{n, A}.
• If the selection was with replacement, the distribution of the number of
successes would be Bin(n, A/(A + B)).
Multinomial(n, π1 , π2 , . . . , πk ) distribution.
• Here π1 + π2 + · · · + πk = 1, and the πi s are the probabilities of the values
1, 2, . . . , k.
• If n = 1, the sample space is 1, 2, . . . , k. This is essentially a generalisation of
the discrete uniform distribution, but with non-equal probabilities πi .
118
4.5. Common continuous distributions
• If n > 1, the sample space is the vectors (n1 , n2 , . . . , nk ) where ni ≥ 0 for all i,
and n1 + n2 + · · · + nk = n. This is essentially a generalisation of the binomial
to the case where each trial has k ≥ 2 possible outcomes, and the random
variable records the numbers of each outcome in n trials. Note that with
k = 2, Multinomial(n, π1 , π2 ) is essentially the same as Bin(n, π) with π = π2
(or with π = π1 ).
• When n > 1, the multinomial distribution is the distribution of a multivariate
random variable, as discussed later in the course.
Uniform distribution.
Exponential distribution.
Normal distribution.
The pdf is ‘flat’, as shown in Figure 4.5 (on the next page). Clearly, f (x) ≥ 0 for all x,
and: Z ∞ Z b
1 1 h ib 1 h i
f (x) dx = dx = x = b − a = 1.
−∞ a b−a b−a a b−a
The cdf is:
Z x 0
for x < a
F (x) = P (X ≤ x) = f (t) dt = (x − a)/(b − a) for a ≤ x ≤ b
a
1 for x > b.
119
4. Common distributions of random variables
F(x)
f(x)
a b a b
x x
Figure 4.5: Continuous uniform distribution pdf (left) and cdf (right).
a+b
E(X) = = median of X
2
and:
(b − a)2
Var(X) = .
12
The mean and median also follow from the fact that the distribution is symmetric
about (a + b)/2, i.e. the midpoint of the interval [a, b].
It was shown in the previous chapter that this satisfies the conditions for a pdf (see
Example 3.24). The general shape of the pdf is that of ‘exponential decay’, as shown in
Figure 4.6 (hence the name).
120
4.5. Common continuous distributions
f(x)
0 1 2 3 4 5
0.4
0.2
0.0
0 1 2 3 4 5
These have been derived in the previous chapter (see Example 3.26). The median of the
distribution, also previously derived (see Example 3.29), is:
ln 2 1
m= = (ln 2) × = (ln 2) E(X) ≈ 0.69 × E(X).
λ λ
Note that the median is always smaller than the mean, because the distribution is
skewed to the right.
The moment generating function of the exponential distribution (derived in Example
3.27) is:
λ
MX (t) = for t < λ.
λ−t
The exponential is, among other things, a basic distribution of waiting times of various
kinds. This arises from a connection between the Poisson distribution – the simplest
distribution for counts – and the exponential.
If the number of events per unit of time has a Poisson distribution with parameter
λ, the time interval (measured in the same units of time) between two successive
events has an exponential distribution with the same parameter λ.
E(X) = λ for Pois(λ), i.e. a large λ means many events per unit of time, on average.
E(X) = 1/λ for Exp(λ), i.e. a large λ means short waiting times between successive
events, on average.
From this exponential distribution, the expected waiting time between arrivals of
customers is E(X) = 1/1.6 = 0.625 (minutes) and the median is calculated to be
(ln 2) × 0.625 = 0.433.
We can also calculate probabilities of waiting times between arrivals, using the
cumulative distribution function:
(
0 for x < 0
F (x) =
1 − e−1.6x for x ≥ 0.
For example:
122
4.5. Common continuous distributions
123
4. Common distributions of random variables
0 1 2 3 4 5 6 0 2 4 6 8 10
alpha=0.5, beta=1 alpha=1, beta=0.5
0 1 2 3 4 5 6 0 5 10 15 20
alpha=2, beta=1 alpha=2, beta=0.25
Many variables have distributions which are approximately normal, for example
heights of humans or animals, and weights of various products.
124
4.5. Common continuous distributions
(x − µ)2
1
f (x) = √ exp − for − ∞ < x < ∞
2πσ 2 2σ 2
where π is the mathematical constant (i.e. π = 3.14159 . . .), and µ and σ 2 are
parameters, with −∞ < µ < ∞ and σ 2 > 0.
A random variable X with this pdf is said to have a normal distribution with mean
µ and variance σ 2 , denoted X ∼ N (µ, σ 2 ).
R∞
Clearly, f (x) ≥ 0 for all x. Also, it can be shown that −∞
f (x) dx = 1 (do not attempt
to show this), so f (x) really is a pdf.
If X ∼ N (µ, σ 2 ), then:
E(X) = µ
and:
Var(X) = σ 2
and, therefore, the standard deviation is sd(X) = σ.
The proof of this is non-examinable. It uses the moment generating function of the
normal distribution, which is:
σ 2 t2
MX (t) = exp µt + for − ∞ < t < ∞.
2
The mean can also be inferred from the observation that the normal pdf is symmetric
about µ. This also implies that the median of the normal distribution is µ.
The normal density is the so-called ‘bell curve’. The two parameters affect it as follows.
Example 4.14 Figure 4.10 (on the next page) shows that:
N (0, 1) and N (5, 1) have the same dispersion but different location: the N (5, 1)
curve is identical to the N (0, 1) curve, but shifted 5 units to the right
N (0, 1) and N (0, 9) have the same location but different dispersion: the N (0, 9)
curve is centered at the same value, 0, as the N (0, 1) curve, but spread out more
widely.
125
4. Common distributions of random variables
0.4
0.3
N(0, 1) N(5, 1)
0.2
0.1
N(0, 9)
0.0
−5 0 5 10
We now consider one of the convenient properties of the normal distribution. Suppose
X is a random variable, and we consider the linear transformation Y = aX + b, where a
and b are constants.
Whatever the distribution of X, it is true that E(Y ) = a E(X) + b and also that
Var(Y ) = a2 Var(X).
Furthermore, if X is normally distributed, then so is Y . In other words, if
X ∼ N (µ, σ 2 ), then:
Y = aX + b ∼ N (aµ + b, a2 σ 2 ). (4.7)
This type of result is not true in general. For other families of distributions, the
distribution of Y = aX + b is not always in the same family as X.
Let us apply (4.7) with a = 1/σ and b = −µ/σ, to get:
2 !
1 µ X −µ 1 µ 1
Z= X− = ∼N µ− , σ2 = N (0, 1).
σ σ σ σ σ σ
x2
1
f (x) = √ exp − for − ∞ < x < ∞.
2π 2
126
4.5. Common continuous distributions
1. It is only for N (0, 1), not for N (µ, σ 2 ) for any other µ and σ 2 .
2. Even for N (0, 1), it only shows probabilities for z ≥ 0.
We next show how these are not really limitations, starting with ‘2.’.
The key to using the tables is that the standard normal distribution is symmetric about
0. This means that for an interval in one tail, its ‘mirror image’ in the other tail has the
same probability.
Suppose that z ≥ 0, so that −z ≤ 0. Table 4 shows:
P (Z ≤ z) = Φ(z).
In each of these, ≤ can be replaced by <, and ≥ by > (see Section 3.5). Figure 4.11 (on
the next page) shows tail probabilities for the standard normal distribution.
If Z ∼ N (0, 1), for any two numbers z1 < z2 , then:
where Φ(z2 ) and Φ(z1 ) are obtained using the rules above.
Reality check: remember that:
127
4. Common distributions of random variables
−z 0 +z
Example 4.15 Consider the 0.7995 value for x = 0.84 in Table 4 of the New
Cambridge Statistical Tables, which shows that:
128
4.5. Common continuous distributions
Example 4.16 Let X denote the diastolic blood pressure of a randomly selected
person in England. This is approximately distributed as X ∼ N (74.2, 127.87).
Suppose we want to know the probabilities of the following intervals:
X > 90 (high blood pressure)
X < 60 (low blood pressure)
60 ≤ X ≤ 90 (normal blood pressure).
129
4. Common distributions of random variables
Finally:
P (60 ≤ X ≤ 90) = P (X ≤ 90) − P (X < 60) = 0.8152.
These probabilities are shown in Figure 4.12.
0.04
Mid: 0.82
0.03
Low: 0.10
0.02
High: 0.08
0.01
0.00
40 60 80 100 120
The first two of these are illustrated graphically in Figure 4.13 (on the next page).
130
4.5. Common continuous distributions
0.683
Figure 4.13: Some probabilities around the mean for the normal distribution.
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast if Y ∼ N (16, 9.6), then:
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.
131
4. Common distributions of random variables
Continuity correction
Example 4.17 In the UK general election in May 2010, the Conservative Party
received 36.1% of the votes. We carry out an opinion poll in November 2014, where
we survey 1,000 people who say they voted in 2010, and ask who they would vote for
if a general election was held now. Let X denote the number of people who say they
would now vote for the Conservative Party.
Suppose we assume that X ∼ Bin(1,000, 0.361).
132
4.5. Common continuous distributions
Here (b) seems a more plausible conclusion than (a). This kind of reasoning is
the basis of statistical significance tests.
133
4. Common distributions of random variables
(b) Suppose now that the standard deviation, σ, can be fixed at specified levels.
What is the largest value of σ that will allow the amount of coffee dispensed to
fall within 30 ml of the mean with a probability of at least 95%?
3. A vaccine for Covid-19 is known to be 90% effective, i.e. 90% of vaccine recipients
are successfully immunised against Covid-19. A new (different) vaccine is tested on
100 patients and found to successfully immunise 96 of the 100 patients. Is the new
vaccine better?
Hint: Assume the new vaccine is equally effective as the original vaccine and
consider using an appropriate approximating distribution.
134
4.9. Solutions to Sample examination questions
Standardising, we have:
P (Z > 2.33) = 0.01
hence:
250 − µ
2.33 = ⇒ µ = 229.03.
9
There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)
135
4. Common distributions of random variables
136
Chapter 5
Multivariate random variables
5.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.
X = (X1 , X2 , . . . , Xn )0
137
5. Multivariate random variables
x = (x1 , x2 , . . . , xn )0
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:
p(x, y) = P (X = x, Y = y)
which we sometimes write as pX,Y (x, y) to make the random variables clear.
138
5.4. Joint probability functions
Example 5.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:
Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:
Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006
The joint probability function gives probabilities of values of (X, Y ), for example:
A 1–1 draw, which is the most probable single result, has probability
P (X = 1, Y = 1) = p(1, 1) = 0.146.
139
5. Multivariate random variables
where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .
The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.
For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x
Example 5.4 Continuing with the football example introduced in Example 5.2, the
joint and marginal probability functions are:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
140
5.6. Continuous multivariate distributions
For example:
3
X
pX (0) = p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0
Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 3.
Example 5.6 For a randomly selected man (aged over 16) in England, let:
141
5. Multivariate random variables
0.025
0.05
0.020
0.04
0.015
0.03
0.010
0.02
0.005
0.01
0.000
0.00
Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:
Example 5.7 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
142 pY (y) 0.347 0.316 0.262 0.075 1.000
5.7. Conditional distributions
120
100
Weight (kg)
80
60
40
Height (cm)
Figure 5.2: Bivariate joint pdf (contour plot) for Example 5.6.
We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:
pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00
if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154
if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.
143
5. Multivariate random variables
Wei
ght
f(x,y)
(kg)
Height (cm)
The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.
144
5.7. Conditional distributions
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pX,Y (x, y)
pY|X (y | x) =
pX (x)
where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.
So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:
pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92
Plots of the conditional means are shown in Figure 5.4 (on the next page).
fX,Y (x, y)
fY |X (y | x) =
fX (x)
145
5. Multivariate random variables
3.0
Home goals x
Expected away goals E(Y|x)
2.5
2.0
1.5
1.0
0.5
0.0
Goals
Example 5.9 For a randomly selected man (aged over 16) in England, consider
X = height (in cm) and Y = weight (in kg). The joint distribution of (X, Y ) is
approximately bivariate normal (see Example 5.6).
The conditional distribution of Y given X = x is then a normal distribution for each
x, with the following parameters:
In other words, the conditional mean depends on x, but the conditional variance
does not. For example:
For women, this conditional distribution is normal with the following parameters:
The conditional means are shown in Figure 5.5 (on the next page).
146
5.8. Covariance and correlation
110
100
Conditional mean of weight (kg)
90
80
70
Women
Men
60
Height (cm)
5.8.1 Covariance
Definition of covariance
147
5. Multivariate random variables
Properties of covariance
The covariance of a random variable with itself is the variance of the random
variable:
Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).
5.8.2 Correlation
Definition of correlation
When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.
Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.
Example 5.10 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
148 3 0 3 6 9
0.062 0.031 0.039 0.006
5.8. Covariance and correlation
Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:
XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006
Hence:
E(X) = 1.383
E(Y ) = 1.065
E(X 2 ) = 2.827
E(Y 2 ) = 2.039
Var(X) = 2.827 − (1.383)2 = 0.9143
Var(Y ) = 2.039 − (1.065)2 = 0.9048.
The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).
149
5. Multivariate random variables
Cov(X, Y ) and Corr(X, Y ). The uses of these sample measures will be discussed in
more detail later in the course.
Sample covariance
Sample correlation
Example 5.11 Figure 5.6 (on the next page) shows different examples of
scatterplots of observations of X and Y , and different values of the sample
correlation, r. The line shown in each plot is the best-fitting (least squares) line for
the scatterplot (which will be introduced later in the course).
Plot (f) shows that r can be 0 even if two variables are clearly related, if that
relationship is not linear.
150
5.9. Independent random variables
pX,Y (x, y)
pY |X (y | x) = = pY (y) for all x and y
pX (x)
151
5. Multivariate random variables
for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:
for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.
If two random variables are independent, they are also uncorrelated, i.e. we have:
Cov(X, Y ) = 0 and Corr(X, Y ) = 0.
This will be proved later.
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.
152
5.10. Sums and products of random variables
(xi − µ)2
1
f (xi ) = √ exp −
2πσ 2 2σ 2
and: n
Y
ai Xi = (a1 X1 )(a2 X2 ) · · · (an Xn )
i=1
where a1 , a2 , . . . , an and b are constants.
Each such sum or product is itself a univariate random variable. The probability
distribution of such a function depends on the joint distribution of X1 , X2 , . . . , Xn .
Example 5.15 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:
Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006
153
5. Multivariate random variables
However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?
Sums Products
Only for
Mean Yes
independent variables
Variance Yes No
Normal: Yes
Distributional Some other distributions:
No
form only for independent
variables
154
5.10. Sums and products of random variables
constants, then:
n
! n
X X XX
Var ai Xi + b = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (5.4)
i=1 i=1 i<j
In particular, for n = 2:
These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.
There is no corresponding simple result for the means of products of dependent random
variables. There is also no simple result for the variances of products of random
variables, even when they are independent.
155
5. Multivariate random variables
Proof:
Recall:
Cov(X, Y ) = E(XY ) − E(X) E(Y ).
Proof:
Cov(X, Y ) = Corr(X, Y ) = 0.
Proof:
a1 X1 + a2 X2 + · · · + an Xn + b
whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.
156
5.10. Sums and products of random variables
In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.
An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.
All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1
157
5. Multivariate random variables
where:
n
X n
X XX
µ= ai µi + b and σ 2 = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j
If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1
Example 5.17 Suppose that in the population of English people aged 16 or over:
the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
158
5.12. Key terms and concepts
X = −1 X=0 X=1
Y = −1 0.09 0.16 0.15
Y =0 0.09 0.08 0.03
Y =1 0.12 0.16 0.12
(a) Determine the marginal distributions and calculate the expected values of X
and Y , respectively.
(d) Define U = |X| and V = Y . Calculate E(U ) and the covariance of U and V .
Are U and V correlated?
2. Suppose X and Y are two independent random variables with the following
probability distributions:
X=x −1 0 1 Y =y −1 0 1
and
P (X = x) 0.30 0.40 0.30 P (Y = y) 0.40 0.20 0.40
S = X2 + Y 2 and T = X + Y.
159
5. Multivariate random variables
(c) Are S and T uncorrelated? Are S and T independent? Justify your answers.
and:
X
E(Y ) = y pY (y) = (−1 × 0.40) + (0 × 0.20) + (1 × 0.40) = 0.
y
(b) We have:
Therefore:
0.09
P (X = −1 | Y = 0) = = 0.45
0.20
0.08
P (X = 0 | Y = 0) = = 0.40
0.20
0.03
P (X = 1 | Y = 0) = = 0.15
0.20
160
5.14. Solutions to Sample examination questions
and therefore:
0.16 16
P (X = 0 | X + Y = 1) = =
0.19 19
0.03 3
P (X = 1 | X + Y = 1) = =
0.19 19
and therefore:
16 3 3
E(X | X + Y = 1) = 0 × +1× = = 0.1579.
19 19 19
U =0 U =1
V = −1 0.16 0.24
V =0 0.08 0.12
V =1 0.16 0.24
We then have that P (U = 0) = 0.16 + 0.08 + 0.16 = 0.40 and also that
P (U = 1) = 1 − P (U = 0) = 0.60. Also, we have that P (V = −1) = 0.40,
P (V = 0) = 0.20 and P (V = 1) = 0.40. So:
and:
E(U V ) = −1 × 1 × 0.24 + 1 × 1 × 0.24 = 0.
Hence Cov(U, V ) = E(U V ) − E(U )E(V ) = 0 − 0.60 × 0 = 0. Since the
covariance is zero, so is the correlation coefficient, therefore U and V are
uncorrelated.
S
0 1 2
−2 0 0 0.12
−1 0 0.22 0
T 0 0.08 0 0.24
1 0 0.22 0
2 0 0 0.12
161
5. Multivariate random variables
Var(T ) = E(T 2 )
2
X
= t2 p(t)
t=−2
iii. We have:
2
X 0.08 0.24
E(S | T = 0) = s pS|T =0 (s | t = 0) = 0 × +2× = 1.5.
s=0
0.32 0.32
(c) The random variables S and T are uncorrelated, since Cov(S, T ) = 0. However:
but:
P ({T = −2} ∩ {S = 0}) = 0 6= P (T = −2) P (S = 0)
which is sufficient to show that S and T are not independent.
162
Chapter 6
Sampling distributions of statistics
prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement
state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.
6.3 Introduction
Suppose we have a sample of n observations of a random variable X:
{X1 , X2 , . . . , Xn }.
163
6. Sampling distributions of statistics
We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.
The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).
Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.
For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.
We will assume this most of the time from now. So you will see many examples and
questions which begin something like:
164
6.5. Statistics and their sampling distributions
Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.
165
6. Sampling distributions of statistics
n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1
Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.
Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:
x̄ = 4.94
s2 = 0.90
maxx = 6.58.
166
6.5. Statistics and their sampling distributions
Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:
The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.
Example 6.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .
Figures 6.1, 6.2 and 6.3 (the latter two figures appear on p.169) show
histograms of the statistics for these 10,000 random samples.
We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:
167
6. Sampling distributions of statistics
where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.
Curves of the densities of these distributions are also shown in Figures 6.1, 6.2 and
6.3.
Sample mean
and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1
168
6.6. Sample mean from a normal population
Sample variance
5 6 7 8 9
Maximum value
169
6. Sampling distributions of statistics
For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:
σ2
X̄ ∼ N µ, .
n
For example, the pdf drawn on the histogram in Figure 6.1 (p.168) is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.
More interestingly, the sampling variance gets smaller when the sample size n
increases.
Figure 6.4 (on the next page) shows sampling distributions of X̄ from N (5, 1) for
different n.
Example 6.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.
170
6.6. Sample mean from a normal population
n=100
n=20
n=5
We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?
√
Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
√ √
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
√
n 0.05
P Z> < = 0.025
7.39 2
where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:
√
n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.
171
6. Sampling distributions of statistics
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n
172
6.7. The central limit theorem
Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.
The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:
from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 6.5 (on the next
page).
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 6.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.
Example 6.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 6.6 (p.175).
173
6. Sampling distributions of statistics
n=5 n = 10
n=1
0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10
n = 30 n = 100 n = 1000
2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4
Figure 6.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.
and:
2
SX
∼ Fn−1, m−1 .
SY2
174
6.8. Some common sampling distributions
n = 30
n = 10
n=1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5
n = 100 n = 1000
n = 50
0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24
Figure 6.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.
Here ‘χ2 ’, ‘t’ and ‘F ’ refer to three new families of probability distributions:
the t distribution
the F distribution.
These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way. We will now briefly introduce their main properties.
This is in preparation for statistical inference, where the uses of these distributions will
be discussed at length.
175
6. Sampling distributions of statistics
The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:
E(X) = k
Var(X) = 2k.
where: Z ∞
Γ(α) = xα−1 e−x dx
0
is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 6.7
(on the next page). In most applications of the χ2 distribution the appropriate value of
k is known, in which case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:
The uses of the χ2 distribution will be discussed later. One example though is if
{X1 , X2 , . . . , Xn } is a random sample from the population N (µ, σ 2 ), and S 2 is the
sample variance, then:
(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.
176
6.8. Some common sampling distributions
0.10
0.6
k=1 k=10
k=2 k=20
0.5
0.08
k=4 k=30
k=6 k=40
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.0
0.0
0 2 4 6 8 0 10 20 30 40 50
In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of the New Cambridge Statistical Tables shows the following
information.
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.
The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.
Example 6.7 Consider two numbers in the ‘ν = 5’ row, the 9.236 in the ‘α = 0.10
(P = 10)’ column and the 11.07 in the ‘α = 0.05 (P = 5)’ column. These mean that
for X ∼ χ25 we have:
These also provide bounds for probabilities of other values. For example, since 10.0
is between 9.236 and 11.07, we can conclude that:
0.05 < P (X > 10.0) < 0.10.
177
6. Sampling distributions of statistics
The ways in which this table may be used in statistical inference will be explained in
later chapters.
Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.
for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 6.8. (Note
the formula of the pdf of tk is not examinable.)
0.4
N(0,1)
k=1
k=3
k=8
0.3
k=20
0.2
0.1
0.0
−2 0 2
178
6.8. Some common sampling distributions
For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.
and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.
In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 10 of the New Cambridge Statistical Tables shows the following
information.
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).
If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.
The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.
Example 6.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05
(P = 5)’ column. This means that for T ∼ t4 we have:
The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025 (P = 2.5)’ column is 2.776, so P (T > 2.776) = 0.025. Since
2.132 < 2.5 < 2.776, we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.
179
6. Sampling distributions of statistics
Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).
(10,50)
(10,10)
(10,3)
f(x)
0 1 2 3 4
For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We will postpone practice with them until later in the course.
180
6.9. Prelude to statistical inference
These questions are difficult to study in a laboratory, and admit no self-evident axioms.
Statistics provides a way of answering these types of questions using data.
What should we learn in ‘Statistics’ ? The basic ideas, methods and theory. Some
guidelines for learning/applying statistics are the following.
Understand what data say in each specific context. All the methods are just tools
to help us to understand data.
Concentrate on what to do and why, rather than on concrete calculations and
graphing.
It may take a while to grasp the basic idea of statistics – keep thinking!
Example 6.9 A new type of tyre was designed to increase its lifetime. The
manufacturer tested 120 new tyres and obtained the average lifetime (over these 120
tyres) of 35,391 miles. So the manufacturer claims that the mean lifetime of new
tyres is 35,391 miles.
181
6. Sampling distributions of statistics
Example 6.10 A newspaper sampled 1,000 potential voters, and 350 of them were
Labour Party supporters. It claims that the proportion of Labour voters in the
whole country is 350/1,000 = 0.35, i.e. 35%.
In both cases, a conclusion about a population (i.e. all the objects concerned) is drawn
based on the information from a sample (i.e. a subset of the population).
In Example 6.9, it is impossible to measure the whole population. In Example 6.10, it is
not economical to measure the whole population. Therefore, errors are inevitable!
The population is the entire set of objects concerned, and these objects are typically
represented by some numbers. We do not know the entire population in practice.
In Example 6.9, the population consists of the lifetimes of all tyres, including those to
be produced in the future. For the opinion poll in Example 6.10, the population consists
of many ‘1’s and ‘0’s, where each ‘1’ represents a voter for the Labour party, and each
‘0’ represents a voter for other parties.
A sample is a (randomly) selected subset of a population, and is known in practice. The
population is unknown. We represent a population by a probability distribution.
Why do we need a model for the entire population?
Because the questions we ask concern the entire population, not just the data we
have. Having a model for the population tells us that the remaining population is
not much different from our data or, in other words, that the data are
representative of the population.
Because the process of drawing a sample from a population is a bit like the process
of generating random variables. A different sample would produce different values.
Therefore, the population from which we draw a random sample is represented as a
probability distribution.
Example 6.11 Continuing with Example 6.9, the population may be assumed to
be N (µ, σ 2 ) with θ = (µ, σ 2 ), where µ is the ‘true’ lifetime.
Let:
X = the lifetime of a tyre
then we can write X ∼ N (µ, σ 2 ).
182
6.9. Prelude to statistical inference
P (X = 1) = P (a Labour voter) = π
and:
P (X = 0) = P (a non-Labour voter) = 1 − π
where:
Example 6.13 For the tyre lifetime in Example 6.9, suppose the realised sample
(of size n = 120) gives the sample mean:
n
1X
x̄ = xi = 35,391.
n i=1
Is the sample mean X̄ a good estimator of the unknown ‘true’ lifetime µ? Obviously,
we cannot use the real number 35,391 to assess how good this estimator is, as a different
sample may give a different average value, such as 36,721.
By treating {X1 , X2 , . . . , Xn } as random variables, X̄ is also a random variable. If the
distribution of X̄ concentrates closely around (unknown) µ, X̄ is a good estimator of µ.
Definition of a statistic
Any known function of a random sample is called a statistic. Statistics are used for
statistical inference such as estimation and testing.
183
6. Sampling distributions of statistics
20!
P (X = x) = π x (1 − π)20−x for x = 0, 1, 2, . . . , 20
x! (20 − x)!
and 0 otherwise.
184
6.10. Overview of chapter
what is P (X < 10) (the proportion of students attending fewer than half of the
lectures)?
185
6. Sampling distributions of statistics
2. Suppose Xi ∼ N (0, 9), for i = 1, 2, 3, 4. Assume all these random variables are
independent. Derive the value of k in each of the following.
1/2
(c) P X1 > (k(X22 + X32 )) = 0.10.
Therefore, by independence:
X̄ − Ȳ ∼ N (2, 10)
so:
−2
P (X̄ − Ȳ > 0) = P Z > √ = P (Z > −0.63) = 0.7357.
10
Hence:
k
P (X1 + 6X2 < k) = P Z < √ = 0.3974.
333
Since, from tables, Φ(−0.26) = 0.3974, we have:
k
√ = −0.26 ⇒ k = −4.7446.
333
√ 4
(b) Xi / 9 ∼ N (0, 1), and so Xi2 /9 ∼ χ21 . Hence we have that Xi2 /9 ∼ χ24 .
P
i=1
Therefore, from tables:
4
!
X
2 k k
P Xi < k = P X < = 0.90 ⇒ = 7.779 ⇒ k = 70.011
i=1
9 9
where X ∼ χ24 .
186
6.13. Solutions to Sample examination questions
(c) We have:
√ !
X1 / 9 √ √
(k(X22 X32 ))1/2
P X1 > + =P p > 2× k
(X22 + X32 )/(9 × 2)
√ √
= P (T > 2 × k)
= 0.10
√ √
where T ∼ t2 . From tables, 2× k = 1.886, hence k = 1.7785.
187
6. Sampling distributions of statistics
188
Chapter 7
Point estimation
find estimators using the method of moments, least squares and maximum
likelihood.
7.3 Introduction
The basic setting is that we assume a random sample {X1 , X2 , . . . , Xn } is observed from
a population F (x; θ). The goal is to make inference (i.e. estimation or testing) for the
unknown parameter(s) θ.
189
7. Point estimation
We call µ
b = X̄ a point estimator (or simply an estimator) of µ.
For example, if we have an observed sample of 9, 16, 15, 4 and 12, hence of size
n = 5, the sample mean is:
9 + 16 + 15 + 4 + 12
µ
b= = 11.2.
5
The value 11.2 is a point estimate of µ. For an observed sample of 15, 16, 10, 8
and 9, we obtain µb = 11.6.
Bias of an estimator
unbiased if b −θ =0
E(θ)
190
7.4. Estimation criteria: bias, variance and mean squared error
Variance of an estimator
σ2
Var(X̄) = . (7.2)
n
It is clear that in (7.2) increasing the sample size n decreases the estimator’s variance
(and hence the standard error, i.e. the square root of the estimator’s variance), therefore
increasing the precision of the estimator.2 We conclude that variance is also a ‘bad’
thing so, other things being equal, the smaller an estimator’s variance the better.
Estimator properties
Is µ
b = X̄ a ‘good’ estimator of µ?
Intuitively, X1 or (X1 + X2 + X3 )/3 would not be good enough as estimators of µ.
However, can we use other estimators such as the sample median:
X((n+1)/2) for odd n
µ
b1 =
(X(n/2) + X(n/2+1) )/2 for even n
2
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
191
7. Point estimation
θ is unknown
the value of θb changes with the observed sample.
Intuitively, MAD is a more appropriate measure for the error in estimation. However, it
is technically less convenient since the function h(x) = |x| is not differentiable at x = 0.
Therefore, the MSE is used more often.
If E(θb2 ) < ∞, it holds that:
2
MSE(θ) b = Var(θ) b + Bias(θ) b
where Bias(θ) b − θ.
b = E(θ)
Proof:
b = E (θb − θ)2
MSE(θ)
2
= E (θ − E(θ)) + (E(θ) − θ)
b b b
b 2 + E (E(θ)
= E (θb − E(θ)) b − θ)2 + 2E (θb − E(θ))(E(
b b − θ)
θ)
= Var(θ) b 2 + 2 (E(θ)
b + E (Bias(θ)) b − E(θ))(E(
b b − θ)
θ)
2
= Var(θ)
b + Bias(θ)
b + 0.
192
7.4. Estimation criteria: bias, variance and mean squared error
We have already established that both bias and variance of an estimator are ‘bad’
things, so the MSE (being the sum of a bad thing and a bad thing squared) can also be
viewed as a ‘bad’ thing.3 Hence when faced with several competing estimators, we
prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in variance. Hence the MSE provides us with a formal criterion to assess the
trade-off between the bias and variance of different estimators of the same parameter.
E(T1 ) = E(X̄) = µ
and:
σ2
Var(T1 ) = Var(X̄) =
.
n
Hence T1 is an unbiased estimator of µ. So the MSE of T1 is just the variance of T1 ,
since the bias is 0. Therefore, MSE(T1 ) = σ 2 /n.
Moving to T2 , note:
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2
and:
Var(X1 ) + Var(Xn ) 2σ 2 σ2
Var(T2 ) = = = .
22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2.
Finally, consider T3 , noting:
and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a positively-biased estimator of µ, with a bias of 3. Hence we have
MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
3
Or, for that matter, a ‘very bad’ thing!
193
7. Point estimation
i. µ
b = X̄ is a better estimator of µ than X1 as:
σ2
MSE(µ)
b = < MSE(X1 ) = σ 2 .
n
ii. As n → ∞, MSE(X̄) → 0, i.e. when the sample size tends to infinity, the error in
estimation goes to 0. Such an estimator is called a (mean-square) consistent
estimator.
Consistency is a reasonable requirement. It may be used to rule out some silly
estimators.
For µ̃ = (X1 + X4 )/2, MSE(µ̃) = σ 2 /2 which does not converge to 0 as n → ∞.
This is due to the fact that only a small portion of information (i.e. X1 and X4 )
is used in the estimation.
iii. For any random sample {X1 , X2 , . . . , Xn } from a population with mean µ and
variance σ 2 , it holds that E(X̄) = µ and Var(X̄) = σ 2 /n. The derivation of the
expected value and variance of the sample mean was covered in Chapter 6.
iv. For any independent random variables Y1 , Y2 , . . . , Yk and constants a1 , a2 , . . . , ak ,
then:
k
! k k
! k
X X X X
E ai Yi = ai E(Yi ) and Var ai Yi = a2i Var(Yi ).
i=1 i=1 i=1 i=1
Example 7.5 Bias by itself cannot be used to measure the quality of an estimator.
Consider two artificial estimators of θ, θb1 and θb2 , such that θb1 takes only the two
values, θ − 100 and θ + 100, and θb2 takes only the two values θ and θ + 0.2, with the
following probabilities:
and:
P (θb2 = θ) = P (θb2 = θ + 0.2) = 0.5.
194
7.4. Estimation criteria: bias, variance and mean squared error
and:
MSE(θb2 ) = E((θb2 − θ)2 ) = 02 × 0.5 + (0.2)2 × 0.5 = 0.02.
Hence θb2 is a much better (i.e. more accurate) estimator of θ than θb1 .
Especially:
k
!2 k
X X X
ai = a2i + ai aj .
i=1 i=1 1≤i6=j≤k
Hence Var(b
µ) =
! !2
n n
1 X 1 X
Var Xi = E Xi − µ
n i=1
n i=1
!2
n
1 X
= E (Xi − µ)
n i=1
n
!
1 X X
= E((Xi − µ)2 ) + E ((Xi − µ)(Xj − µ))
n2 i=1 1≤i6=j≤n
!
1 X σ2
= 2 nσ 2 + E(Xi − µ) E(Xj − µ) = .
n 1≤i6=j≤n
n
µ) = MSE(X̄) = σ 2 /n.
Hence MSE(b
195
7. Point estimation
Finding estimators
Let {X1 , X2 , . . . , Xn } be a random sample from a population F (x; θ). Suppose θ has
p components (for example, for a normal population N (µ, σ 2 ), p = 2; for a Poisson
population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n
µk (θ)
b = Mk for k = 1, 2, . . . , p.
This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 1X 2 1X
b = M2 −
σ M12 = Xi − X̄ 2 = (Xi − X̄)2 .
n i=1 n i=1
196
7.5. Method of moments (MM) estimation
Note we have:
n
!
1X 2
σ2) = E
E(b Xi − X̄ 2
n i=1
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1
= E(X 2 ) − E(X̄ 2 )
2
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1
Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.
Example 7.8 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.
197
7. Point estimation
198
7.6. Least squares (LS) estimation
n
P
The MME of µ is the sample mean X̄ = Xi /n.
i=1
The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1
n n
(Xi − a)2 = (Xi − X̄)2 + n(X̄ − a)2 , where all terms are
P P
Proof: Given that S =
i=1 i=1
non-negative, then the value of a for which S is minimised is when n(X̄ − a)2 = 0, i.e.
a = X̄.
Estimator accuracy
σ2
MSE(µ) b − µ)2 ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:
X̄ − µ
P √ ≤ z → Φ(z)
σ/ n
for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Hence when n is large:
σ
P |X̄ − µ| ≤ 1.96 × √ ≈ 0.95.
n
199
7. Point estimation
To be on the safe side, the coefficient 1.96 is often replaced by 2. The estimated
standard error of X̄ is:
n
!1/2
S 1 X
E.S.E.(X̄) = √ = (Xi − X̄)2 .
n n(n − 1) i=1
Example 7.10 Suppose we toss a coin 10 times, and record the number of ‘heads’
as a random variable X. Therefore:
X ∼ Bin(10, π)
Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 (on the next page) shows a plot of L(π) as a function of π.
200
7.7. Maximum likelihood (ML) estimation
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:
201
7. Point estimation
The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.
iii. It is often more convenient to use the log-likelihood function4 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1
θb = max l(θ).
θ
iv. For a smooth likelihood function, the MLE is often the solution of the equation:
d
l(θ) = 0.
dθ
vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.
4
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.
202
7.7. Maximum likelihood (ML) estimation
n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.
Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.
Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = exp − (Xi − µ)2
(2πσ 2 )n/2 2σ 2 i=1
n
!
1 1 X n
= 2 n/2
exp − 2 (Xi − X̄)2 2
exp − 2 (X̄ − µ) .
(2πσ ) 2σ i=1 2σ
203
7. Point estimation
n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1
204
7.8. Asymptotic distribution of MLEs
θb = θ(X
b 1 , X2 , . . . , Xn )
√
the MLE of θ. Under some regularity conditions, the distribution of n(θb − θ)
converges to N (0, 1/I(θ)) as n → ∞, where I(θ) is the Fisher information defined as:
∞
∂ 2 ln f (x; θ)
Z
I(θ) = − f (x; θ) dx.
−∞ ∂θ 2
ii. For a discrete distribution with probability function p(x; θ), then:
X ∂ 2 ln p(x; θ)
I(θ) = − p(x; θ) .
x
∂θ 2
You will use this asymptotic distribution result if you study ST2134 Advanced
statistics: statistical inference.
205
7. Point estimation
Therefore:
1 1
ln f (x; µ) = − ln(2πσ 2 ) − 2 (x − µ)2 .
2 2σ
Hence:
d ln f (x; µ) x−µ d2 ln f (x; µ) 1
= and = − .
dµ σ2 dµ2 σ2
Therefore: Z ∞
1 1
I(µ) = − − f (x; µ) dx = .
−∞ σ2 σ2
The MLE of µ is X̄, and hence X̄ ∼ N (µ, σ 2 /n).
Example 7.15 For the Poisson distribution, p(x; λ) = λx e−λ /x!. Therefore:
ln p(x; λ) = x ln λ − λ − ln(x!).
Hence:
d ln p(x; λ) x d2 ln p(x; λ) x
= − 1 and 2
= − 2.
dλ λ dλ λ
Therefore: ∞
1 X 1 1
I(λ) = x p(x; λ) = E(X) = .
λ2 x=0 λ2 λ
206
7.11. Sample examination questions
(b) Is the estimator of θ derived in part (a) biased or unbiased? Justify your
answer.
(c) Determine the variance of the estimator derived in part (a) and check whether
it is a consistent estimator of θ.
Use this sample to calculate the method of moments estimate of θ using the
estimator derived in part (a), and sketch the above probability density
function based on this estimate.
(a) Derive the maximum likelihood estimator of θ. (You do not need to verify the
solution is a maximum.)
(b) Show that the estimator derived in part (a) is mean square consistent for θ.
Hint: You may use the fact that E(X) = αθ and Var(X) = αθ2 .
207
7. Point estimation
A sketch of f (x; θ)
b is:
208
7.12. Solutions to Sample examination questions
2. (a) For α > 0 known, due to independence the likelihood function is:
n n
!α−1 n
!
Y 1 Y 1X
L(θ) = f (xi ; α, θ) = xi exp − xi .
i=1
((α − 1)!)n θnα i=1
θ i=1
such that: n
d nα 1 X
l(θ) = − + 2 xi .
dθ θ θ i=1
b → 0 as n → ∞, then θb is a
Since θb is unbiased and noting that Var(θ)
consistent estimator of θ.
The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)
209
7. Point estimation
210
Chapter 8
Interval estimation
explain the link between confidence intervals and distribution theory, and critique
the assumptions made to justify the use of various confidence intervals.
8.3 Introduction
Point estimation is simple but not informative enough, since a point estimator is
always subject to errors. A more scientific approach is to find an upper bound
U = U (X1 , X2 , . . . , Xn ) and a lower bound L = L(X1 , X2 , . . . , Xn ), and hope that the
unknown parameter θ lies between the two bounds L and U (life is not always as simple
as that, but it is a good start).
An intuitive guess for estimating the population mean would be:
where k > 0 is a constant and S.E.(X̄) is the standard error of the sample mean.
The (random) interval (L, U ) forms an interval estimator of θ. For estimation to be
as precise as possible, intuitively the width of the interval, U − L, should be small.
211
8. Interval estimation
212
8.4. Interval estimation for means of normal distributions
What is P (1.27 < µ < 3.23) = 0.95 in Example 8.1? Well, this probability does not
mean anything, since µ is an unknown constant!
We treat (1.27, 3.23) as one realisation of the random interval (X̄ − 0.98, X̄ + 0.98)
which covers µ with probability 0.95.
What is the meaning of ‘with probability 0.95’ ? If one repeats the interval estimation a
large number of times, about 95% of the time the interval estimator covers the true µ.
Some remarks are the following.
i. The confidence level is often specified as 90%, 95% or 99%. Obviously the higher
the confidence level, the wider the interval.
For the normal distribution example:
√
n|X̄ − µ|
0.90 = P ≤ 1.645
σ
σ σ
= P X̄ − 1.645 × √ < µ < X̄ + 1.645 × √
n n
√
n|X̄ − µ|
0.95 = P ≤ 1.96
σ
σ σ
=P X̄ − 1.96 × √ < µ < X̄ + 1.96 × √
n n
√
n|X̄ − µ|
0.99 = P ≤ 2.576
σ
σ σ
= P X̄ − 2.576 × √ < µ < X̄ + 2.576 × √ .
n n
√ √
√ three intervals are 2 × 1.645 × σ/ n, 2 × 1.96 × σ/ n and
The widths of the
2 × 2.576 × σ/ n, corresponding to the confidence levels of 90%, 95% and 99%,
respectively.
To achieve a 100% confidence level in the normal example, the width of the interval
would have to be infinite!
ii. Among all the confidence intervals at the same confidence level, the one with the
smallest width gives the most accurate estimation and is, therefore, optimal.
iii. For a distribution with a symmetric unimodal density function, optimal confidence
intervals are symmetric, as depicted in Figure 8.1 (on the next page).
In practice the standard deviation σ is typically unknown, and we replace it with the
sample standard deviation:
n
!1/2
1 X 2
S= (Xi − X̄)
n − 1 i=1
213
8. Interval estimation
Figure 8.1: Symmetric unimodal density function showing that a given probability is
represented by the narrowest interval when symmetric about the mean.
where k is a constant determined by the confidence level and also by the distribution of
the statistic:
X̄ − µ
√ . (8.1)
S/ n
However, the distribution of (8.1) is no longer normal – it is the Student’s t distribution.
where E.S.E.(X̄) denotes the estimated standard error of the sample mean.
An accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S S
X̄ − c × √ , X̄ + c × √ = (X̄ − c × E.S.E.(X̄), X̄ + c × E.S.E.(X̄))
n n
214
8.5. Approximate confidence intervals
Example 8.2 The salary data of 253 graduates from a UK business school (in
thousands
√ of pounds) yield the following: n = 253, x̄ = 47.126, s = 6.843 and so
s/ n = 0.43.
A point estimate of the average salary µ is x̄ = 47.126.
An approximate 95% confidence interval for µ is:
You will use these MLE-based confidence intervals if you study ST2134 Advanced
statistics: statistical inference.
Hence: n
1 X
2
(Xi − µ)2 ∼ χ2n .
σ i=1
Note that:
n n
1 X 1 X n(X̄ − µ)2
(Xi − µ)2 = (X i − X̄)2
+ . (8.2)
σ2 i=1
σ 2 i=1 σ2
Proof: We have:
n
X n
X
2
(Xi − µ) = ((Xi − X̄) + (X̄ − µ))2
i=1 i=1
n
X n
X n
X
2 2
= (Xi − X̄) + (X̄ − µ) + 2 (Xi − X̄)(X̄ − µ)
i=1 i=1 i=1
n
X n
X
2 2
= (Xi − X̄) + n(X̄ − µ) + 2(X̄ − µ) (Xi − X̄)
i=1 i=1
n
X
= (Xi − X̄)2 + n(X̄ − µ)2 .
i=1
Hence: n n
1 X 2 1 X 2 n(X̄ − µ)2
(X i − µ) = (X i − X̄) + .
σ 2 i=1 σ 2 i=1 σ2
Since X̄ ∼ N (µ, σ 2 /n), then n(X̄ − µ)2 /σ 2 ∼ χ21 . It can be proved that:
n
1 X
(Xi − X̄)2 ∼ χ2n−1 .
σ 2 i=1
For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
216
8.8. Overview of chapter
Example 8.3 Suppose n = 15 and the sample variance is s2 = 24.5. Let α = 0.05.
From Table 8 of the New Cambridge Statistical Tables, we find:
where X ∼ χ214 .
Hence a 95% confidence interval for σ 2 is:
14 × S 2 14 × S 2
M M
, = ,
26.12 5.629 26.12 5.629
= (0.536 × S 2 , 2.487 × S 2 )
= (13.132, 60.934).
217
8. Interval estimation
(a) Write down the endpoints of an approximate 100(1 − α)% confidence interval
for π, stating any necessary conditions which should be satisfied for such an
approximate confidence interval to be used. You should also state the
approximate sampling distribution of P = X/n.
(b) Suppose we are willing to assume that π ≤ 0.40. What is the smallest n for
which P will have an approximate 99% probability of being within 0.05 of π?
Hence an accurate 100(1 − α)% confidence interval for µ, where α ∈ (0, 1), is:
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √ .
n n
218
8.11. Solutions to Sample examination questions
hence n = 638.
A statistician took the Dale Carnegie Course, improving his confidence from
95% to 99%.
(Anon)
219
8. Interval estimation
220
Chapter 9
Hypothesis testing
9.3 Introduction
Hypothesis testing and statistical estimation are the two most frequently-used
statistical inference methods. Hypothesis testing addresses a different type of practical
question from statistical estimation.
Based on the data, a (statistical) test is to make a binary decision on a hypothesis,
denoted by H0 :
reject H0 or not reject H0 .
221
9. Hypothesis testing
H0 : π = 0.50.
If π
b = 0.90, H0 is unlikely to be true.
If π
b = 0.45, H0 may be true (and also may be untrue).
If π
b = 0.70, what to do then?
Example 9.2 A customer complains that the amount of coffee powder in a coffee
tin is less than the advertised weight of 3 pounds.
A random sample of 20 tins is selected, resulting in an average weight of x̄ = 2.897
pounds. Is this sufficient to substantiate the complaint?
Again statistical estimation cannot provide a firm answer, due to random
fluctuations between different random samples. So we cast the problem into a
hypothesis testing problem as follows.
Let the weight of coffee in a tin be a normal random variable X ∼ N (µ, σ 2 ). We
need to test the hypothesis µ < 3. In fact, we use the data to test the hypothesis:
H0 : µ = 3.
Example 9.3 Suppose one is interested in evaluating the mean income (in £000s)
of a community. Suppose income in the population is modelled as N (µ, 25) and a
random sample of n = 25 observations is taken, yielding the sample mean x̄ = 17.
Independently of the data, three expert economists give their own opinions as
follows.
222
9.4. Introductory examples
If Ms B’s claim is correct, X̄ ∼ N (15, 1). The observed value x̄ = 17 begins to look a
bit ‘extreme’, as it is two standard deviations away from µ. Hence there is some
inconsistency between the claim and the data evidence. This is shown in Figure 9.2.
If Mr C’s claim is correct, X̄ ∼ N (14, 1). The observed value x̄ = 17 is very extreme,
as it is three standard deviations away from µ. Hence there is strong inconsistency
between the claim and the data evidence. This is shown in Figure 9.3.
Figure 9.1: Comparison of claim and data evidence for Dr A in Example 9.3.
Figure 9.2: Comparison of claim and data evidence for Ms B in Example 9.3.
Figure 9.3: Comparison of claim and data evidence for Mr C in Example 9.3.
223
9. Hypothesis testing
Definition of p-values
A p-value is the probability of the event that the test statistic takes the observed
value or more extreme (i.e. more unlikely) values under H0 . It is a measure of the
discrepancy between the hypothesis H0 and the data.
224
9.5. Setting p-value, significance level, test statistic
Example 9.5 Let {X1 , X2 , . . . , X20 }, taking values either 1 or 0, be the outcomes
of an experiment of tossing a coin 20 times, where:
Suppose there are 17 Xi s taking the value 1, and 3 Xi s taking the value 0. Will you
reject the null hypothesis at the 5% significance level?
225
9. Hypothesis testing
226
9.5. Setting p-value, significance level, test statistic
H0 : µ = µ0 vs. H1 : µ < µ0
α = Pµ0 (T ≤ c) = P (Z ≤ c).
Therefore, c is the 100αth percentile of N (0, 1). Due to the symmetry of N (0, 1),
c = −zα , where zα is the top 100αth percentile of N (0, 1), i.e. P (Z > zα ) = α, where
Z ∼ N (0, 1). For α = 0.05, zα = 1.645. We reject H0 if t ≤ −1.645.
i. We use a one-tailed test when we are only interested in the departure from H0 in
one direction.
ii. The distribution of a test statistic under H0 must be known in order to calculate
p-values or critical values.
iii. A test may be carried out by either computing the p-value or determining the
critical value.
iv. The probability of incorrect decisions in hypothesis testing is typically positive. For
example, the significance level is the probability of rejecting a true H0 .
227
9. Hypothesis testing
9.6 t tests
t tests are one of the most frequently-used statistical tests.
Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ), where both µ and σ 2 > 0 are
unknown. We are interested in testing the hypotheses:
H0 : µ = µ0 vs. H1 : µ < µ0
where µ0 is known.
√
Now we cannot use n(X̄ − µ0 )/σ as a statistic, since σ is unknown. Naturally we
replace it by S, where:
n
1 X
S2 = (Xi − X̄)2 .
n − 1 i=1
The test statistic is then the famous t statistic:
√ n
!1/2
n(X̄ − µ0 ) X̄ − µ0 √ . 1 X
T = = √ = n(X̄ − µ0 ) (Xi − X̄)2 .
S S/ n n−1 i=1
We reject H0 if t < c, where c is the critical value determined by the significance level:
PH0 (T < c) = α
where PH0 denotes the distribution under H0 (with mean µ0 and unknown σ 2 ).
Under H0 , T ∼ tn−1 . Hence:
α = PH0 (T < c)
i.e. c is the 100αth percentile of the t distribution with n − 1 degrees of freedom. By
symmetry, c = −tα, n−1 , where tα, k denotes the top 100αth percentile of the tk
distribution.
Example 9.7 To deal with the customer complaint that the average amount of
coffee powder in a coffee tin is less than the advertised 3 pounds, 20 tins were
weighed, yielding the following observations:
2.82, 3.01, 3.11, 2.71, 2.93, 2.68, 3.02, 3.01, 2.93, 2.56,
2.78, 3.01, 3.09, 2.94, 2.82, 2.81, 3.05, 3.01, 2.85, 2.79.
228
9.7. General approach to statistical tests
Although H0 does not specify the population distribution completely (σ 2 > 0), the
distribution of the test statistic, T , under H0 is completely known. This enables us
to find the critical value or p-value.
PH0 (T ∈ C) = α.
3. If the observed value of T with the given sample is in the critical region C, H0 is
rejected. Otherwise, H0 is not rejected.
In order to make a test powerful in the sense that the chance of making an incorrect
decision is small, the critical region should consist of those values of T which are least
supportive of H0 (i.e. which lie in the direction of H1 ).
Decision made
H0 not rejected H0 rejected
True state H0 true Correct decision Type I error
of nature H1 true Type II error Correct decision
i. Ideally we would like to have a test which minimises the probabilities of making
both types of error, which unfortunately is not feasible.
229
9. Hypothesis testing
ii. The probability of making a Type I error is the significance level, which is under
our control.
iii. We do not have explicit control over the probability of a Type II error. For a given
significance level, we try to choose a test statistic such that the probability of a
Type II error is small.
iv. The power function of the test is defined as:
β(θ) = Pθ (H0 is rejected) for θ ∈ Θ1
i.e. β(θ) = 1 − P (Type II error).
v. The null hypothesis H0 and the alternative hypothesis H1 are not treated equally in
a statistical test, i.e. there is an asymmetric treatment. The choice of H0 is based
on the subject matter concerned and/or technical convenience.
vi. It is more conclusive to end a test with H0 rejected, as the decision of ‘not reject
H0 ’ does not imply that H0 is accepted.
Turning Example 9.8 into a statistical problem, we assume that the data form a random
sample from N (µ, σ 2 ). We are interested in testing the hypotheses:
H0 : σ 2 = σ02 vs. H1 : σ 2 > σ02 .
n
Let S 2 = (Xi − X̄)2 /(n − 1), then (n − 1)S 2 /σ 2 ∼ χ2n−1 . Under H0 we have:
P
i=1
n
(Xi − X̄)2
P
(n − 1)S 2 i=1
T = = ∼ χ2n−1 .
σ02 σ02
Since we will reject H0 against an alternative hypothesis σ 2 > σ02 , we should reject H0
for large values of T .
230
9.9. Tests for variances of normal distributions
H0 is rejected if t > χ2α, n−1 , where χ2α, n−1 denotes the top 100αth percentile of the χ2n−1
distribution, i.e. we have:
P (T ≥ χ2α, n−1 ) = α.
For any σ 2 > σ02 , the power of the test at σ is:
σ2 1 1.5 2 3 4
χ20.05, 24 /σ 2 36.415 24.277 18.208 12.138 9.104
β(σ) 0.05 0.446 0.793 0.978 0.997
Approximate β(σ) 0.05 0.40 0.80 0.975 0.995
231
9. Hypothesis testing
where χ2α, k denotes the top 100αth percentile of the χ2k distribution.
n n
Xi /n, S 2 = (Xi − X̄)2 /(n − 1), and {X1 , X2 , . . . , Xn } is a
P P
In the above table, X̄ =
i=1 i=1
random sample from N (µ, σ 2 ).
Are customers willing to pay more for the new product than the old one?
232
9.12. Comparing two normal means
Observations are paired together for good reasons: before-after, A-vs.-B (from the
same subject).
µ = µX − µY and σ 2 = σX
2
+ σY2 .
H0 : µ = 0.
√
Therefore, we should use the test statistic T = nZ̄/S, where Z̄ and S 2 denote,
respectively, the sample mean and the sample variance of {Z1 , Z2 , . . . , Zn }.
At the 100α% significance level, for α ∈ (0, 1), we reject the hypothesis µX = µY when:
233
9. Hypothesis testing
n
P m
P
Let the sample means be X̄ = Xi /n and Ȳ = Yi /m, and the sample variances be:
i=1 i=1
n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2
X̄, Ȳ , SX and SY2 are independent.
2 2 2
X̄ ∼ N (µX , σX /n) and (n − 1)SX /σX ∼ χ2n−1 .
2
Hence X̄ − Ȳ ∼ N (µX − µY , σX /n + σY2 /m). If σX
2
= σY2 , then:
p 2
(X̄ − Ȳ − (µX − µY )) σX /n + σY2 /m
p
2 2
((n − 1)SX /σX + (m − 1)SY2 /σY2 )/(n + m − 2)
s
n+m−2 X̄ − Ȳ − (µX − µY )
= ×p 2
∼ tn+m−2 .
1/n + 1/m (n − 1)SX + (m − 1)SY2
2
9.12.1 Tests on µX − µY with known σX and σY2
Suppose we are interested in testing:
H0 : µX = µY vs. H1 : µX 6= µY .
Note that:
X̄ − Ȳ − (µX − µY )
p
2
∼ N (0, 1).
σX /n + σY2 /m
Under H0 , µX − µY = 0, so we have:
X̄ − Ȳ
T = p 2 ∼ N (0, 1).
σX /n + σY2 /m
At the 100α% significance level, for α ∈ (0, 1), we reject H0 if |t| > zα/2 , where
P (Z > zα/2 ) = α/2, for Z ∼ N (0, 1).
A 100(1 − α)% confidence interval for µX − µY is:
s
2
σX σ2
X̄ − Ȳ ± zα/2 × + Y.
n m
2
9.12.2 Tests on µX − µY with σX = σY2 but unknown
This time we consider the following hypotheses:
H0 : µX − µY = δ0 vs. H1 : µX − µY > δ0
234
9.12. Comparing two normal means
Example 9.10 Two types of razor, A and B, were compared using 100 men in an
experiment. Each man shaved one side, chosen at random, of his face using one razor
and the other side using the other razor. The times taken to shave, Xi and Yi
minutes, for i = 1, 2, . . . , 100, corresponding to the razors A and B, respectively,
were recorded, yielding:
H0 : µX = µY vs. H1 : µX 6= µY .
There are three approaches – a paired comparison method and two two-sample
comparisons based on different assumptions. Since the data are recorded in pairs,
the paired comparison is most relevant and effective to analyse these data.
235
9. Hypothesis testing
√
With the given data, we observe t = 10 × (2.84 − 3.02)/ 0.6 = −2.327. Hence we
reject the hypothesis that the two razors lead to the same mean shaving time at the
5% significance level.
A 95% confidence interval for µX − µY is:
sZ
x̄ − ȳ ± t0.025, n−1 × √ = −0.18 ± 0.154 ⇒ (−0.334, −0.026).
n
Some remarks are the following.
236
9.13. Tests for correlation coefficients
ii. The paired comparison is intuitively the most relevant, requires the least
assumptions, and leads to the most conclusive inference (i.e. rejection of H0 ). It
also produces the narrowest confidence interval.
iii. Methods II and III ignore the pairing of the data. Consequently, the inference is
less conclusive and less accurate.
iv. A general observation is that H0 is rejected at the 100α% significance level if and
only if the value hypothesised by H0 is not within the corresponding 100(1 − α)%
confidence interval.
v. It is much more challenging to compare two normal means with unknown and
unequal variances. This will not be discussed in this course.
Cov(X, Y )
ρ = Corr(X, Y )=
(Var(X) Var(Y ))1/2
E((X − E(X))(Y − E(Y )))
= .
(E((X − E(X))2 ) E((Y − E(Y ))2 ))1/2
i. ρ ∈ [−1, 1], and |ρ| = 1 if and only if Y = aX + b for some constants a and b.
Furthermore, a > 0 if ρ = 1, and a < 0 if ρ = −1.
ii. ρ measures only the linear relationship between X and Y . When ρ = 0, X and Y
are linearly independent, that is uncorrelated.
iii. If X and Y are independent (in the sense that the joint pdf is the product of the
two marginal pdfs), ρ = 0. However, if ρ = 0, X and Y are not necessarily
independent, as there may exist some non-linear relationship between X and Y .
iv. If ρ > 0, X and Y tend to increase (or decrease) together. If ρ < 0, X and Y tend
to move in opposite directions.
237
9. Hypothesis testing
n
P n
P
where X̄ = Xi /n and Ȳ = Yi /n.
i=1 i=1
Example 9.11 The measurements of height, X, and weight, Y , are taken from 69
students in a class. ρ should be positive, intuitively!
In Figure 9.5 (on the next page), the vertical line at x̄ and the horizontal line at ȳ
divide the 69 points into 4 quadrants: northeast (NE), southwest (SW), northwest
(NW) and southeast (SE). Most points are in either NE or SW.
Overall:
69
X
(xi − x̄)(yi − ȳ) > 0
i=1
Figure 9.6 (p.240) shows examples of different sample correlation coefficients using
scatterplots of bivariate observations.
238
9.13. Tests for correlation coefficients
H0 : ρ = 0 vs. H1 : ρ 6= 0.
Hence we reject H0 at the 100α% significance level, for α ∈ (0, 1), if |t| > tα/2, n−2 , where:
α
P (T > tα/2, n−2 ) = .
2
Some remarks are the following.
p
ρ| (n − 2)/(1 − ρb2 ) increases as |b
i. |T | = |b ρ| increases.
iii. Two random variables X and Y are jointly normal if aX + bY is normal for any
constants a and b.
iv. For jointly normal random variables X and Y , if Corr(X, Y ) = 0, X and Y are also
independent.
239
9. Hypothesis testing
n m
2 1 X 1 X
SX = (Xi − X̄)2 and SY2 = (Yi − Ȳ )2 .
n − 1 i=1 m − 1 i=1
2 2
We have (n − 1)SX /σX ∼ χ2n−1 and (m − 1)SY2 /σY2 ∼ χ2m−1 . Therefore:
σY2 2
SX 2
SX 2
/σX
2
× = ∼ Fn−1, m−1 .
σX SY2 SY2 /σY2
2
2
Under H0 , T = kSX SY ∼ Fn−1, m−1 . Hence H0 is rejected if:
t < F1−α/2, n−1, m−1 or t > Fα/2, n−1, m−1
where Fα, p, k denotes the top 100αth percentile of the Fp, k distribution, that is:
P (T > Fα, p, k ) = α
240
9.14. Tests for the ratio of two normal variances
Example 9.12 Here we practise use of Table A.3 of the Dougherty Statistical
Tables to obtain critical values for the F distribution.
Table A.3 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:
and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.24
and:
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.58
Example 9.13 The daily returns (in percentages) of two assets, X and Y , are
recorded over a period of 100 trading days, yielding average daily returns of x̄ = 3.21
and ȳ = 1.41. Also available from the data are the following quantities:
100
X 100
X 100
X
x2i = 1,989.24, yi2 = 932.78 and xi yi = 661.11.
i=1 i=1 i=1
Assume the data are normally distributed. Are the two assets positively correlated
with each other, and is asset X riskier than asset Y ?
With n = 100 we have:
n n
!
1 X 1 X
s2X = (xi − x̄)2 = x2i − nx̄2 = 9.69
n − 1 i=1 n−1 i=1
241
9. Hypothesis testing
and: !
n n
1 X 1 X
s2Y = (yi − ȳ)2 = yi2 − nȳ 2 = 7.41.
n−1 i=1
n−1 i=1
Therefore: n n
P P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
ρb = = = 0.249.
(n − 1)sX sY (n − 1)sX sY
First we test:
H0 : ρ = 0 vs. H1 : ρ > 0.
Under H0 , the test statistic is:
r
n−2
T = ρb ∼ t98 .
1 − ρb2
Setting α = 0.01, we reject H0 if t > t0.01, 98 = 2.37. With the given data, t = 2.545
hence we reject the null hypothesis of ρ = 0 at the 1% significance level. We
conclude that there is highly significant evidence indicating that the two assets are
positively correlated.
We measure the risks in terms of variances, and test:
2
H0 : σX = σY2 2
vs. H1 : σX > σY2 .
242
9.16. Overview of chapter
2
σY
Null hypothesis, µX − µY = δ µX − µY = δ ρ=0 2
σX
=k
2
H0 (σX , σY2 known) 2
(σX = σY2 unknown) (n = m)
q
S2
q
√ X̄−Ȳ −δ X̄−Ȳ −δ
Test statistic, T 2 /n+σ 2 /m
n+m−2
1/n+1/m
×√ 2 +(m−1)S 2
n−2
ρb 1−bρ2
k SX2
σX Y (n−1)SX Y Y
243
9. Hypothesis testing
H0 : µ = 60 vs. H1 : µ 6= 60.
(a) Show how the researchers obtained the t statistic value of 1.289.
(b) Calculate the p-value of the test and use the p-value to draw a conclusion
about the significance of the test. Use a 5% significance level.
244
9.19. Solutions to Sample examination questions
hence:
60 − k − 60 X̄ − 60 60 + k − 60
P √ ≤ √ ≤ √ = P (−0.75k ≤ Z ≤ 0.75k) = 0.04.
8/ 36 8/ 36 8/ 36
We have that:
P (−0.05 ≤ Z ≤ 0.05) = 0.04
hence:
1
0.75k = 0.05 ⇒ k= .
15
(b) For the power of the test, we require the conditional probability:
Standardising gives:
59.933 − 62 X̄ − 62 60.067 − 62
P √ ≤ √ ≤ √ = P (−1.55 ≤ Z ≤ −1.45)
8/ 36 8/ 36 8/ 36
= 0.0735 − 0.0606
= 0.0129.
X̄d − µd X̄d
T = √ ∼ tn−1 ⇒ T = √ ∼ t120 .
Sd / n Sd / 121
The p-value is, where T ∼ t120 , Table 10 of the New Cambridge Statistical
Tables:
2 × P (T ≥ 1.289) = 2 × 0.10 = 0.20.
Since 0.20 > 0.05, the test is not significant at the 5% significance level.
To p, or not to p?
(James Abdey, Ph.D. Thesis 2009.1 )
1
Available at http://etheses.lse.ac.uk/31
245
9. Hypothesis testing
246
Chapter 10
Analysis of variance (ANOVA)
restate and interpret the models for one-way and two-way analysis of variance
perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance
10.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.
247
10. Analysis of variance (ANOVA)
Example 10.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?
Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:
H0 : µ1 = µ2 = µ3 .
The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj
Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66
Note that similar problems arise from other practical situations. For example:
If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3
248
10.5. One-way analysis of variance
Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .
249
10. Analysis of variance (ANOVA)
k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1
k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1
We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.
250
10.5. One-way analysis of variance
nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1
k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1
where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.
251
10. Analysis of variance (ANOVA)
p-value = P (F > f ).
It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.
Example 10.2 Continuing with Example 10.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1
and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1
3
X
= 5s2j
j=1
= 5 × (34 + 20 + 32)
= 430.
Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.36 < 9, using
Table A.3 of the Dougherty Statistical Tables, we reject H0 at the 1% significance
level. In fact the p-value (using a computer) is P (F > 9) = 0.003. Therefore, we
conclude that there is a significant difference among the mean examination marks
across the three classes.
252
10.5. One-way analysis of variance
Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946
> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33
[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867
[[3]]
English Mathematics Political Science
100 100 100
[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867
253
10. Analysis of variance (ANOVA)
Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:
Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:
H0 : µ1 = µ2 = µ3 .
An estimator of σ is:
s
W
σ
b =S= .
n−k
S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.
Example 10.4 Assuming a common variance for each group, from the preceding
output in Example 10.3 we see that:
1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
254
10.5. One-way analysis of variance
respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100
6
4
2
0
Example 10.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file
255
10. Analysis of variance (ANOVA)
‘GallupPoll.csv’ (available on the VLE). They are classified into four groups
according to their incomes. Below is part of the R output of the descriptive statistics
of the classified data. Can we infer that income group has a significant impact on the
mean length of time before facing financial hardship?
Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00
[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043
[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67
[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1
Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301
256
10.5. One-way analysis of variance
Now:
k
X
b= nj (x̄·j − x̄)2
j=1
We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
s24 = (8.087)2 = 65.400, hence:
nj
k X k
X X
2
w= (xij − x̄·j ) = (nj − 1)s2j
j=1 i=1 j=1
Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.85 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s= w/(n − k) = 25,968.24/(301 − 4) = 9.351.
s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj
18.328
22.21 ± √ ⇒ (19.28, 25.14)
39
18.328
9.313 ± √ ⇒ (7.07, 11.55).
67
Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.
257
10. Analysis of variance (ANOVA)
Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.
H0 : µ1 = µ2 = · · · = µk .
where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:
H0 : β1 = β2 = · · · = βk = 0.
258
10.7. Two-way analysis of variance
where:
In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:
γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.
259
10. Analysis of variance (ANOVA)
r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1
The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.
c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1
r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1
r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1
r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1
r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1
260
10.8. Residuals
c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1
r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1
As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS
10.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
261
10. Analysis of variance (ANOVA)
µ
b = X̄ is the point estimator of µ.
for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.
Example 10.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock Exchange
during 1981–85, available in the data file ‘NYSE.csv’ (available on the VLE).
262
10.8. Residuals
r X
X c
Total SS = x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1
r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1
c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1
Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482
We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)
263
10. Analysis of variance (ANOVA)
Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:
εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.
If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.
264
10.11. Sample examination questions
Test at the 5% significance level whether the true mean price-earnings ratios for
the three market sectors are the same. Use the ANOVA table format to summarise
your calculations. You may exclude the p-value.
2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.
265
10. Analysis of variance (ANOVA)
= 76.31.
Therefore, w = 186.80 − 76.31 = 110.49. Hence the ANOVA table is:
Source DF SS MS F
Sector 2 76.31 38.16 11.39
Error 33 110.49 3.35
Total 35 186.80
We test:
H0 : PE ratio means are equal vs. H1 : PE ratio means are not equal
and we reject H0 if:
f > F0.05, 2, 33 ≈ 3.30.
Since 3.30 < 11.39, we reject H0 and conclude that there is evidence of a difference
in the mean price-earnings ratios across the sectors.
A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)
266
Chapter 11
Linear regression
derive from first principles the least squares estimators of the intercept and slope in
the simple linear regression model
explain how to construct confidence intervals and perform hypothesis tests for the
intercept and slope in the simple linear regression model
summarise the multiple linear regression model with several explanatory variables,
and explain its interpretation
11.3 Introduction
Regression analysis is one of the most frequently-used statistical techniques. It aims
to model an explicit relationship between one dependent variable, often denoted as y,
and one or more regressors (also called covariates, or independent variables), often
denoted as x1 , x2 , . . . , xp .
The goal of regression analysis is to understand how y depends on x1 , x2 , . . . , xp and to
predict or control the unobserved y based on the observed x1 , x2 , . . . , xp . We start with
some simple examples with p = 1.
267
11. Linear regression
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
Example 11.2 The data file ‘WeightHeight.csv’ (available on the VLE) contains
the heights, x, and weights, y, of 69 students in a class.
We plot y against x, and draw a straight line through the middle of the data cloud:
y = β0 + β1 x + ε
where ε stands for a random error term, β0 is the intercept and β1 is the slope of the
straight line.
For a given height, x, the predicted value yb = β0 + β1 x may be viewed as a kind of
‘standard weight’.
268
11.5. Simple linear regression
Example 11.3 Some other possible examples of y and x are shown in the following
table.
y x
Sales Price
Weight gain Protein in diet
Present FTSE 100 index Past FTSE 100 index
Consumption Income
Salary Tenure
Daughter’s height Mother’s height
In most cases, there are several x variables involved. We will consider such situations
later in this chapter.
How to draw a line through data clouds, i.e. how to estimate β0 and β1 ?
yi = β0 + β1 xi + εi
269
11. Linear regression
where:
E(εi ) = 0 and Var(εi ) = E(ε2i ) = σ 2 > 0.
Furthermore, suppose Cov(εi , εj ) = E(εi εj ) = 0 for all i 6= j. That is, the εi s are
assumed to be uncorrelated (remembering that a zero covariance between two random
variables implies that they are uncorrelated).
So the model has three parameters: β0 , β1 and σ 2 .
For convenience, we will treat x1 , x2 , . . . , xn as constants.1 We have:
Since the εi s are uncorrelated (by assumption), it follows that y1 , y2 , . . . , yn are also
uncorrelated with each other.
Sometimes we assume εi ∼ N (0, σ 2 ), in which case yi ∼ N (β0 + β1 xi , σ 2 ), and
y1 , y2 , . . . , yn are independent. (Remember that a linear transformation of a normal
random variable is also normal, and that for jointly normal random variables if they are
uncorrelated then they are also independent.)
Our tasks are two-fold.
Secondly:
n
∂ X
L(β0 , β1 ) = −2 xi (yi − β0 − β1 xi ).
∂β1 i=1
1
If you study EC2020 Elements of econometrics, you will explore regression models in much
more detail than is covered here. For example, x1 , x2 , . . . , xn will be treated as random variables.
270
11.5. Simple linear regression
Hence:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
i=1 i=1
βb1 = Pn = n
P and βb0 = ȳ − βb1 x̄.
xi (xi − x̄) (xi − x̄)2
i=1 i=1
The estimator βb1 above is based on the fact that for any constant c, we have:
n
X n
X
xi (yi − ȳ) = (xi − c)(yi − ȳ)
i=1 i=1
since: n n
X X
c(yi − ȳ) = c (yi − ȳ) = 0.
i=1 i=1
n
P n
P
Given that (xi − x̄) = 0, it follows that c(xi − x̄) = 0 for any constant c.
i=1 i=1
where:
n
X
B= (βb0 − β0 + (βb1 − β1 )xi )(yi − βb0 − βb1 xi )
i=1
n
X n
X
= (βb0 − β0 ) (yi − βb0 − βb1 xi ) + (βb1 − β1 ) xi (yi − βb0 − βb1 xi ).
i=1 i=1
271
11. Linear regression
Hence (βb0 , βb1 ) are the least squares estimators (LSEs) of β0 and β1 , respectively.
To find the explicit expression from (11.2), note the first equation can be written as:
Hence βb0 = ȳ − βb1 x̄. Substituting this into the second equation, we have:
n
X n
X n
X
0= xi (yi − ȳ − βb1 (xi − x̄)) = xi (yi − ȳ) − βb1 xi (xi − x̄).
i=1 i=1 i=1
Therefore:
n
P n
P
xi (yi − ȳ) (xi − x̄)(yi − ȳ)
βb1 = i=1
Pn = i=1
n
P .
xi (xi − x̄) (xi − x̄)2
i=1 i=1
We now explore the properties of the LSEs βb0 and βb1 . We can show that the means and
variances of these LSEs are:
n
x2i
P
σ2 i=1
E(βb0 ) = β0 and Var(βb0 ) = n
n P (xi − x̄)2
i=1
for βb1 .
272
11.5. Simple linear regression
Proof: Recall we treat the xi s as constants, and we have E(yi ) = β0 + β1 xi and also
Var(yi ) = σ 2 . Hence:
n
! n n
1X 1X 1X
E(ȳ) = E yi = E(yi ) = (β0 + β1 xi ) = β0 + β1 x̄.
n i=1 n i=1 n i=1
Therefore:
E(yi − ȳ) = β0 + β1 xi − (β0 + β1 x̄) = β1 (xi − x̄).
Consequently, we have:
n n n
(xi − x̄)2 β1
P P P
i=1(xi − x̄)(yi − ȳ) i=1
(xi − x̄)E(yi − ȳ)
i=1
E(β1 ) = E
b
n
P
=
n
P = n
P = β1 .
(xi − x̄)2 (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1
Now:
E(βb0 ) = E(ȳ − βb1 x̄) = β0 + β1 x̄ − β1 x̄ = β0 .
Therefore, the LSEs βb0 and βb1 are unbiased estimators of β0 and β1 , respectively.
To work out the variances, the key is to write βb1 and βb0 as linear estimators (i.e.
linear combinations of the yi s):
n
P n
P
(xi − x̄)(yi − ȳ) (xi − x̄)yi n
X
i=1 i=1
βb1 = n
P = n
P = ai y i
(xi − x̄)2 (xk − x̄)2 i=1
i=1 k=1
n
P
where ai = (xi − x̄) (xk − x̄)2 and:
k=1
n n
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n
Note that:
n n
X X 1
ai = 0 and a2i = P
n .
i=1 i=1 (xk − x̄)2
k=1
n
! n
X X
Var bi y i = b2i Var(yi ).
i=1 i=1
273
11. Linear regression
By this lemma:
n
! n
X
2
X σ2
Var(βb1 ) = Var ai y i =σ a2i = P
n
i=1 i=1 (xk − x̄)2
k=1
and:
n 2 n
!
X 1 1 X 2 2 σ2 nx̄2
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 +
n n i=1 i n n
P
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P
(xk − x̄)2
k=1
ε1 , ε2 , . . . , εn ∼IID N (0, σ 2 ).
yi ∼ N (β0 + β1 xi , σ 2 ).
Since any linear combination of normal random variables is also normal, the LSEs of β0
and β1 (as linear estimators) are also normal random variables. In fact:
n
2
P
xi
σ2 i=1
σ2
β0 ∼ N β0 ,
b
n
and βb1 ∼ N β1 ,
n
.
n P (xi − x̄)2
P
(xi − x̄)2
i=1 i=1
274
11.6. Inference for parameters in normal regression models
and:
σ
b
E.S.E.(βb1 ) = 1/2 .
n
P
(xi − x̄)2
i=1
The following results all make use of distributional results introduced earlier in the
course. Statistical inference (confidence intervals and hypothesis testing) for the normal
simple linear regression model can then be performed.
i. We have:
n
(yi − βb0 − βb1 xi )2
P
σ2
(n − 2)b i=1
= ∼ χ2n−2 .
σ2 σ2
βb0 − β0
∼ tn−2 .
E.S.E.(βb0 )
βb1 − β1
∼ tn−2 .
E.S.E.(βb1 )
where tα, k denotes the top 100αth percentile of the Student’s tk distribution, obtained
from Table 10 of the New Cambridge Statistical Tables.
275
11. Linear regression
H0 : β1 = 0 vs. H1 : β1 6= 0.
βb1
T = ∼ tn−2 .
E.S.E.(βb1 )
At the 100α% significance level, we reject H0 if |t| > tα/2, n−2 , where t is the observed
test statistic value.
Alternatively, we could use H1 : β1 < 0 or H1 : β1 > 0 if there was a rationale for
doing so. In such cases, we would reject H0 if t < −tα, n−2 and t > tα, n−2 for the
lower-tailed and upper-tailed t tests, respectively.
i. For testing H0 : β1 = b for a given constant b, the above test still applies, but now
with the following test statistic:
βb1 − b
T = .
E.S.E.(βb1 )
ii. Tests for the regression intercept β0 may be constructed in a similar manner,
replacing β1 and βb1 with β0 and βb0 , respectively.
In the normal regression model, the LSEs βb0 and βb1 are also the MLEs of β0 and β1 ,
respectively.
Since εi = yi − β0 − β1 xi ∼IID N (0, σ 2 ), the likelihood function is:
n
2
Y 1 1 2
L(β0 , β1 , σ )= √ exp − 2 (yi − β0 − β1 xi )
i=1 2πσ 2 2σ
n/2 n
!
1 1 X
∝ exp − 2 (yi − β0 − β1 xi )2 .
σ2 2σ i=1
Hence the log-likelihood function is:
n
2
n 1 1 X
l(β0 , β1 , σ ) = ln − (yi − β0 − β1 xi )2 + c.
2 σ2 2σ 2 i=1
276
11.6. Inference for parameters in normal regression models
g(u) = n ln u − ub
n
(yi − βb0 − βb1 xi )2 .
P
where b =
i=1
Example 11.4 The dataset ‘Cigarette.csv’ (available on the VLE) contains the
annual cigarette consumption, x, and the corresponding mortality rate, y, due to
coronary heart disease (CHD) of 21 countries. Some useful summary statistics
calculated from the data are:
21
X 21
X 21
X
xi = 45,110, yi = 3,042.2, x2i = 109,957,100,
i=1 i=1 i=1
21
X 21
X
yi2 = 529,321.58 and xi yi = 7,319,602.
i=1 i=1
Do these data support the suspicion that smoking contributes to CHD mortality?
(Note the assertion ‘smoking is harmful for health’ is largely based on statistical,
rather than laboratory, evidence.)
We fit the regression model y = β0 + β1 x + ε. Our least squares estimates of β1 and
β0 are, respectively:
P P P
x y − i xi j yj /n
P P
(x i − x̄)(yi − ȳ) x i y i − nx̄ȳ i i i
βb1 = i P 2
= Pi
2 2
= P 2
P 2
i (xi − x̄) i xi − nx̄ i xi − ( i xi ) /n
277
11. Linear regression
and:
3,042.2 − 0.06 × 45,110
βb0 = ȳ − βb1 x̄ = = 15.77.
21
Also:
X
b2 =
σ (yi − βb0 − βb1 xi )2 /(n − 2)
i
X X X X X
= yi2 + nβb02 + βb12 x2i − 2βb0 yi − 2βb1 xi yi + 2βb0 βb1 xi /(n − 2)
= 2,181.66.
βb1
T = ∼ tn−2 = t19
E.S.E.(βb1 )
n
n
βb12 (xi − x̄)2 = βb12 x2i 2
P P
Regression (explained) SS is − nx̄ .
i=1 i=1
n
(yi − βb0 − βb1 xi )2 = Total SS − Regression SS.
P
Residual (error) SS is
i=1
278
11.8. Confidence intervals for E(y)
n
βb12 (xi − x̄)2 /σ 2 ∼ χ21
P
i=1
n
(yi − βb0 − βb1 xi )2 /σ 2 ∼ χ2n−2 .
P
i=1
We reject H0 at the 100α% significance level if f > Fα, 1, n−2 , where f is the observed
test statistic value and Fα, 1, n−2 is the top 100αth percentile of the F1, n−2 distribution,
obtained from Table A.3 of the Dougherty Statistical Tables.
A useful statistic is the coefficient of determination, denoted as R2 , defined as:
Regression SS Residual SS
R2 = =1− .
Total SS Total SS
If we view Total SS as the total variation (or energy) of y, then R2 is the proportion of
the total variation of y explained by x. Note that R2 ∈ [0, 1]. The closer R2 is to 1, the
better the explanatory power of the regression model.
yb = βb0 + βb1 x.
279
11. Linear regression
b(x) − µ(x)
µ
v
u ! ∼ tn−2 .
u n
P n
P
σ 2 /n)
t(b (xi − x)2 / (xj − x̄)2
i=1 j=1
Such a confidence interval contains the true expectation E(y) = µ(x) with probability
1 − α over repeated samples. It does not cover y with probability 1 − α.
280
11.9. Prediction intervals for y
Therefore:
n 1/2
2
P
. (xi − x)
i=1
(y − µ
b(x))
σb2
1 + Pn
∼ tn−2 .
n (xj − x̄)2
j=1
i. It holds that:
1/2
n
(xi − x)2
P
i=1
P y ∈ µ(x) ± tα/2, n−2 × σ
b × 1 + = 1 − α.
b Pn
n (xj − x̄) 2
j=1
ii. The prediction interval for y is wider than the confidence interval for E(y). The
former contains the unobserved random variable y with probability 1 − α, the
latter contains the unknown constant E(y) with probability 1 − α over repeated
samples.
Example 11.5 The dataset ‘UsedFord.csv’ (available on the VLE) contains the
prices (y, in $000s) of 100 three-year-old Ford Tauruses together with their mileages
(x, in thousands of miles) when they were sold at auction. Based on these data, a
car dealer needs to make two decisions.
1. To prepare cash for bidding on one three-year-old Ford Taurus with a mileage of
x = 40.
2. To prepare cash for buying several three-year-old Ford Tauruses with mileages
close to x = 40 from a rental company.
281
11. Linear regression
For the first task, a prediction interval would be more appropriate. For the second
task, the car dealer needs to know the average price and, therefore, a confidence
interval is appropriate. This can be easily done using R.
Call:
lm(formula = Price ~ Mileage)
Residuals:
Min 1Q Median 3Q Max
-0.68679 -0.27263 0.00521 0.23210 0.70071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.248727 0.182093 94.72 <2e-16 ***
Mileage -0.066861 0.004975 -13.44 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We predict that a Ford Taurus will sell for between $13,922 and $15,227. The
average selling price of several three-year-old Ford Tauruses is estimated to be
between $14,498 and $14,650. Because predicting the selling price for one car is more
difficult, the corresponding prediction interval is wider than the confidence interval.
To produce the plots with confidence intervals for E(y) and prediction intervals for
y, we proceed as follows:
282
11.10. Multiple linear regression models
16.5
16.0
15.5
Price
15.0
14.5
14.0
13.5
20 25 30 35 40 45 50
Mileage
Estimation of the intercept and slope parameters is still performed using least squares
estimation. The LSEs βb0 , βb1 , βb2 , . . . , βbp are obtained by minimising:
n p
!2
X X
yi − β0 − βj xij
i=1 j=1
283
11. Linear regression
p
X
εbi = yi − βb0 − βbj xij .
j=1
Just as with the simple linear regression model, we can decompose the total variation of
y such that:
X n Xn Xn
(yi − ȳ)2 = yi − ȳ)2 +
(b εb2i
i=1 i=1 i=1
or, in words:
Total SS = Regression SS + Residual SS.
n p
!2
1 X X Residual SS
b2 =
σ yi − βb0 − βbj xij = .
n−p−1 i=1 j=1
n−p−1
H0 : βi = 0 vs. H1 : βi 6= 0.
βbi
T = ∼ tn−p−1
E.S.E.(βbi )
and we reject H0 if |t| > tα/2, n−p−1 . However, note the slight difference in the
interpretation of the slope coefficient βj . In the multiple regression setting, βj is the
effect of xj on y, holding all other independent variables fixed – this is unfortunately
not always practical.
It is also possible to test whether all the regression coefficients are equal to zero. This is
known as a joint test of significance and can be used to test the overall significance
of the regression model, i.e. whether there is at least one significant explanatory
(independent) variable, by testing:
Indeed, it is preferable to perform this joint test of significance before conducting t tests
of individual slope coefficients. Failure to reject H0 would render the model useless and
hence the model would not warrant any further statistical investigation.
284
11.11. Regression using R
(Regression SS)/p
F = ∼ Fp, n−p−1 .
(Residual SS)/(n − p − 1)
n
X n
X
2
Regression SS = yi − ȳ) =
(b (βb1 (xi1 − x̄1 ) + βb2 (xi2 − x̄2 ) + · · · + βbp (xip − x̄p ))2 .
i=1 i=1
We now conclude the chapter with worked examples of linear regression using R.
Example 11.6 We illustrate the use of linear regression in R using the dataset
‘Armand.csv’, introduced in Example 11.1.
Call:
lm(formula = Sales ~ Student.population)
Residuals:
Min 1Q Median 3Q Max
-21.00 -9.75 -3.00 11.25 18.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.0000 9.2260 6.503 0.000187 ***
Student.population 5.0000 0.5803 8.617 2.55e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
285
11. Linear regression
The fitted line is yb = 60 + 5x. We have σb2 = (13.83)2 . Also, βb0 = 60 and
E.S.E.(βb0 ) = 9.2260. βb1 = 5 and E.S.E.(βb1 ) = 0.5803.
For testing H0 : β0 = 0 we have t = βb0 /E.S.E.(βb0 ) = 6.503. The p-value is
P (|T | > 6.503) = 0.000187, where T ∼ tn−2 .
For testing H0 : β1 = 0 we have t = βb1 /E.S.E.(βb1 ) = 8.617. The p-value is
P (|T | > 8.617) = 0.0000255, where T ∼ tn−2 .
The F test statistic value is 74.25 with a corresponding p-value of:
where F1, 8 .
Example 11.7 We apply the simple linear regression model to study the
relationship between two series of financial returns – a regression of Cisco Systems
stock returns, y, on S&P500 Index returns, x. This regression model is an example of
the capital asset pricing model (CAPM).
Stock returns are defined as:
current price − previous price
current price
return = ≈ ln
previous price previous price
when the difference between the two prices is small.
The data file ‘Returns.csv’ (available on the VLE) contains daily returns over the
period 3 January – 29 December 2000 (i.e. n = 252 observations). The dataset has 5
columns: Day, S&P500 return, Cisco return, Intel return and Sprint return.
Daily prices are definitely not independent. However, daily returns may be seen as a
sequence of uncorrelated random variables.
> summary(S.P500)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.00451 -0.85028 -0.03791 -0.04242 0.79869 4.65458
> summary(Cisco)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.4387 -3.0819 -0.1150 -0.1336 2.6363 15.4151
For the S&P500, the average daily return is −0.04%, the maximum daily return is
4.46%, the minimum daily return is −6.01% and the standard deviation is 1.40%.
For Cisco, the average daily return is −0.13%, the maximum daily return is 15.42%,
the minimum daily return is −13.44% and the standard deviation is 4.23%.
We see that Cisco is much more volatile than the S&P500.
286
11.11. Regression using R
15
10
5
0
−5
−10
Time
There is clear synchronisation between the movements of the two series of returns,
as evident from examining the sample correlation coefficient.
> cor.test(S.P500,Cisco)
Call:
lm(formula = Cisco ~ S.P500)
287
11. Linear regression
Residuals:
Min 1Q Median 3Q Max
-13.1175 -2.0238 0.0091 2.0614 9.9491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.04547 0.19433 -0.234 0.815
S.P500 2.07715 0.13900 14.943 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
288
11.11. Regression using R
iii. Variance is a simple measure (and one of the most frequently-used) of risk in
finance.
Example 11.8 The data in the file ‘Foods.csv’ (available on the VLE) illustrate
the effects of marketing instruments on the weekly sales volume of a certain food
product over a three-year period. Data are real but transformed to protect the
innocent!
There are observations on the following four variables:
y = LVOL: logarithms of weekly sales volume
x1 = PROMP : promotion price
x2 = FEAT : feature advertising
x3 = DISP : display measure.
> summary(Foods)
LVOL PROMP FEAT DISP
Min. :13.83 Min. :3.075 Min. : 2.84 Min. :12.42
1st Qu.:14.08 1st Qu.:3.330 1st Qu.:15.95 1st Qu.:20.59
Median :14.24 Median :3.460 Median :22.99 Median :25.11
Mean :14.28 Mean :3.451 Mean :24.84 Mean :25.31
3rd Qu.:14.43 3rd Qu.:3.560 3rd Qu.:33.49 3rd Qu.:29.34
Max. :15.07 Max. :3.865 Max. :57.10 Max. :45.94
n = 156. The values of FEAT and DISP are much larger than LVOL.
As always, first we plot the data to ascertain basic characteristics.
14.4
14.2
14.0
13.8
0 50 100 150
Time
289
11. Linear regression
> plot(PROMP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
PROMP
> plot(FEAT,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
10 20 30 40 50
FEAT
290
11.11. Regression using R
> plot(DISP,LVOL,pch=16)
15.0
14.8
14.6
LVOL
14.4
14.2
14.0
13.8
15 20 25 30 35 40 45
DISP
There is little or no correlation between LVOL and DISP, but this might have
been blurred by the other input variables.
y = β0 + β1 x1 + β2 x2 + ε.
Call:
lm(formula = LVOL ~ PROMP + FEAT)
Residuals:
Min 1Q Median 3Q Max
-0.32734 -0.08519 -0.01011 0.08471 0.30804
291
11. Linear regression
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.1500102 0.2487489 68.94 <2e-16 ***
PROMP -0.9042636 0.0694338 -13.02 <2e-16 ***
FEAT 0.0100666 0.0008827 11.40 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Consider now introducing DISP into the regression model to give three explanatory
variables:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε.
The reason for adding the third variable is that one would expect DISP to have an
impact on sales and we may wish to estimate its magnitude.
Call:
lm(formula = LVOL ~ PROMP + FEAT + DISP)
292
11.12. Overview of chapter
Residuals:
Min 1Q Median 3Q Max
-0.33363 -0.08203 -0.00272 0.07927 0.33812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.2372251 0.2490226 69.220 <2e-16 ***
PROMP -0.9564415 0.0726777 -13.160 <2e-16 ***
FEAT 0.0101421 0.0008728 11.620 <2e-16 ***
DISP 0.0035945 0.0016529 2.175 0.0312 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All the estimated coefficients have the right sign (according to commercial common
sense!) and are statistically significant. In particular, the relationship with DISP
seems real when the other inputs are taken into account. On the other hand, the
addition
√ of DISP to the
√ model has resulted in a very small reduction in σ b, from
0.0161 = 0.1268 to 0.0157 = 0.1253, and correspondingly a slightly higher R2
(0.7633, i.e. 76.33% of the variation of LVOL is explained by the model). Therefore,
DISP contributes very little to ‘explaining’ the variation of LVOL after the other
two explanatory variables, PROMP and FEAT, are taken into account.
Intuitively, we would expect a higher R2 if we add a further explanatory variable to
the model. However, the model has become more complex as a result – there is an
additional parameter to estimate. Therefore, strictly speaking, we should consider
the ‘adjusted R2 ’ statistic, although this will not be considered in this course.
Special care should be exercised when predicting with x out of the range of the
observations used to fit the model, which is called extrapolation.
293
11. Linear regression
yi = β0 + β1 xi + εi
294
11.15. Solutions to Sample examination questions
(a) Find the least squares estimates of β0 and β1 to three decimal places, and
write down the fitted regression model.
(b) Compute the estimated standard error of the least squares estimator of β1 .
(d) Perform a test of H0 : β1 = 0.37 vs. H1 : β1 > 0.37 at the 10% significance level.
(e) For x = 18, determine a prediction interval which covers y with probability
0.95.
hence: n
P
xi y i
i=1
β= P
b
n .
x2i
i=1
295
11. Linear regression
(d) We have:
n
X n
X
b i )2 =
(yi − βx ((yi − βxi ) − (βb − β)xi )2
i=1 i=1
n
X n
X n
X
2 2
= (yi − βxi ) − 2 (yi − βxi )xi (β − β) + (β − β)
b b x2i .
i=1 i=1 i=1
Therefore:
n
X n
X n
X n
X
b i )2 =
(yi − βx (yi − βxi )2 − 2(βb − β)2 x2i + (βb − β)2 x2i
i=1 i=1 i=1 i=1
n
X n
X
= (yi − βxi )2 − (βb − β)2 x2i .
i=1 i=1
(e) We have:
n
!
1 X
σ2) = E
E(b b i )2
(yi − βx
n − 1 i=1
n
!
1 X
b i )2
= E (yi − βx
n−1 i=1
n n
!!
1 X X
= E (yi − βxi )2 − (βb − β)2 x2i
n−1 i=1 i=1
n n
!
1 X X
= E((yi − βxi )2 ) − x2i E((βb − β)2 )
n−1 i=1 i=1
n n
!
1 X X
= E(ε2i ) − x2i Var(β)
b
n−1 i=1 i=1
2 n
1
nσ 2 −
X
2 σ
= x i P
n
n−1
i=1 x2i
i=1
= σ2
b2 is an unbiased estimator of σ 2 .
hence σ
296
11.15. Solutions to Sample examination questions
X
Regression SS = βb12 (xi − x̄)2 = 0.3752 × 800 = 112.5
i
(c) We have:
βb1 − β1
∼ tn−2 = t18 .
E.S.E.(βb1 )
Under H0 : β1 = 0, we have:
βb1
T = ∼ t18
E.S.E.(βb1 )
and we reject H0 if |t| > t0.005, 18 = 2.878. Since t = 0.375/0.078 = 4.808, we
reject H0 : β1 = 0 at the 1% significance level.
βb1 − 0.37
T = ∼ t18 .
E.S.E.(βb1 )
We reject H0 if t > t0.10, 18 = 1.330. Since t = (0.375 − 0.37)/0.078 = 0.064, we
cannot reject H0 : β1 = 0.37 at the 10% significance level.
297
11. Linear regression
298
Appendix A
Data visualisation and descriptive
statistics
n
P
1. a = n × a.
i=1
n times
n z }| {
• Proof:
P
a = (a + a + · · · + a) = n × a.
i=1
n
P n
P
2. aXi = a Xi .
i=1 i=1
n n
• Proof:
P P
aXi = (aX1 + aX2 + · · · + aXn ) = a(X1 + X2 + · · · + Xn ) = a Xi .
i=1 i=1
n
P n
P n
P
3. (Xi + Yi ) = Xi + Yi .
i=1 i=1 i=1
= ((X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn ))
= (X1 + X2 + · · · + Xn ) + (Y1 + Y2 + · · · + Yn )
n
X n
X
= Xi + Yi .
i=1 i=1
299
A. Data visualisation and descriptive statistics
Sometimes sets of numbers may be indexed with two (or even more) subscripts, for
example as Xij , for i = 1, 2, . . . , n and j = 1, 2, . . . , m.
Summation over both indices is written as:
n X
X m n
X
Xij = (Xi1 + Xi2 + · · · + Xim )
i=1 j=1 i=1
Product notation
n n
aXi = an
Q Q
1. Xi .
i=1 i=1
n
a = an .
Q
2.
i=1
n
n
n
Q Q Q
3. Xi Yi = Xi Yi .
i=1 i=1 i=1
The mean is ‘in the middle’ of the observations X1 , X2 , . . . , Xn , in the sense that
positive and negative values of the deviations Xi − X̄ cancel out, when summed over
all the observations, that is:
Xn
(Xi − X̄) = 0.
i=1
Proof: (The proof uses the definition of X̄ and the properties of summation introduced
earlier. Note that X̄ is a constant in the summation, because it has the same value for
300
A.1. (Re)vision of fundamentals
all i.)
n
P
n
X n
X n
X n
X n
X Xi
i=1
(Xi − X̄) = Xi − X̄ = Xi − nX̄ = Xi − n
i=1 i=1 i=1 i=1 i=1
n
n
X n
X
= Xi − Xi = 0.
i=1 i=1
n
(Xi − C)2 , for any
P
The smallest possible value of the sum of squared deviations
i=1
constant C, is obtained when C = X̄.
Proof:
=0
X X z }| {
2
(Xi − C) = (Xi −X̄ + X̄ −C)2
X
= ((Xi − X̄) + (X̄ − C))2
X
= ((Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2 )
X X X
= (Xi − X̄)2 + 2(Xi − X̄)(X̄ − C) + (X̄ − C)2
=0
X zX }| {
2
= (Xi − X̄) + 2(X̄ − C) (Xi − X̄) +n(X̄ − C)2
X
= (Xi − X̄)2 + n(X̄ − C)2
X
≥ (Xi − X̄)2
since n(X̄ − C)2 ≥ 0 for any choice of C. Equality is obtained only when C = X̄, so
that n(X̄ − C)2 = 0.
n
X n
X
2
(Xi − X̄) = Xi2 − nX̄ 2 .
i=1 i=1
301
A. Data visualisation and descriptive statistics
Proof: We have:
n
X n
X
2
(Xi − X̄) = (Xi2 − 2Xi X̄ + X̄ 2 )
i=1 i=1
= nX̄ = nX̄ 2
z }| { z }| {
n
X Xn n
X
= Xi2 − 2X̄ Xi + X̄ 2
i=1 i=1 i=1
n
X
= Xi2 − nX̄ 2 .
i=1
Therefore, the sample variance can also be calculated as:
n
!
2 1 X
S = Xi2 − nX̄ 2
n−1 i=1
√
(and the standard deviation S = S 2 again).
This formula
P is most
Pconvenient for calculations done by hand when summary statistics
2
such as i Xi and i Xi are provided.
Sample moment
In other words, these are sample averages of the powers Xik and (Xi − X̄)k , respectively.
Clearly:
n 1
X̄ = m1 and S 2 = m02 = (nm2 − n(m1 )2 ).
n−1 n−1
Moments of powers 3 and 4 are used in two more summary statistics which are
described next, for reference only.
These are used much less often than measures of central tendency and dispersion.
302
A.1. (Re)vision of fundamentals
A distribution with high kurtosis (i.e. leptokurtic) has a sharp peak and a high
proportion of observations in the tails far from the peak.
A distribution with low kurtosis (i.e. platykurtic) is ‘flat’, with no pronounced peak
with most of the observations spread evenly around the middle and weak tails.
A sample measure of kurtosis is:
m04 4
P
i (Xi − X̄) /n
g2 = − 3 = − 3.
(m02 )2
P
( i (Xi − X̄)2 /n)2
g2 > 0 for leptokurtic and g2 < 0 for platykurtic distributions, and g2 = 0 for the normal
distribution (introduced in Chapter 4). Some software packages define a measure of
kurtosis without the −3, i.e. ‘excess kurtosis’.
This is how computer software calculates general sample quantiles (or how you can do
so by hand, if you ever needed to).
Suppose we need to calculate the cth sample quantile, qc , where 0 < c < 100. Let
R = (n + 1)c/100, and define r as the integer part of R and f = R − r as the fractional
part (if R is an integer, r = R and f = 0). It follows that:
qc = X(r) + f (X(r+1) − X(r) ) = (1 − f )X(r) + f X(r+1) .
For example, if n = 10:
303
A. Data visualisation and descriptive statistics
Solution:
Begin with the left-hand side and proceed as follows:
n Xn n
" n #
X X X
(xi − xj )2 = (xi − xj )2 .
i=1 j=1 i=1 j=1
n
P
Now, recall that x̄ = xi /n, so re-write as:
i=1
n
" n
#
X X
= nx2i − 2xi nx̄ + x2j .
i=1 j=1
Re-arrange again:
n n n
! n
X X X X
=n x2i − 2nx̄ xi + x2j 1.
i=1 i=1 j=1 i=1
304
A.3. Practice questions
Finally, add terms, factor out 2n, apply the ‘x̄ trick’ . . . and you’re done!
" n # " n #
X X
= 2n x2i − nx̄2 = 2n (xi − x̄)2 .
i=1 i=1
Hint: there are three terms in the expression of (a), nine terms in (b) and six terms
in (c). Write out the terms, and try to find ways to simplify them which avoid the
need for a lot of messy algebra!
(c) s.d.y = |a| s.d.x , where s.d.y is the standard deviation of y etc.
What are the mean and standard deviation of the set {x1 + k, x2 + k, . . . , xn + k}
where k is a constant? What are the mean and standard deviation of the set
{cx1 , cx2 , . . . , cxn } where c is a constant? Justify your answers with reference to the
above results.
305
A. Data visualisation and descriptive statistics
306
Appendix B
Probability theory
Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!
However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)
Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)
307
B. Probability theory
4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).
Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).
(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).
308
B.1. Worked examples
Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1
By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:
K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1
6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?
Solution:
Define a partition {Ci }, such that:
5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:
1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12
7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?
309
B. Probability theory
Solution:
P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1
4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2
P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1
8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?
Solution:
310
B.1. Worked examples
You can assume that all unions and intersections of Ai and B are also events in S.
Solution:
We have:
∞
! !
P (( ∞
S
i=1 Ai ) ∩ B)
[
P Ai B =
i=1
P (B)
P( ∞
S
i=1 (Ai ∩ B))
=
P (B)
∞
X P (Ai ∩ B)
=
i=1
P (B)
∞
X
= P (Ai | B)
i=1
311
B. Probability theory
where the equation on the second line follows from (B.1) in the question, since
Ai ∩ B are also events in S, and they are pairwise mutually exclusive (i.e.
(Ai ∩ B) ∩ (Aj ∩ B) = ∅ for all i 6= j).
10. Suppose that three components numbered 1, 2 and 3 have probabilities of failure
π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of
the following cases where component failures are assumed to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2
and component 3 fail.
Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).
11. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.
12. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.
13. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.
312
B.1. Worked examples
14. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3
15. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and
the two events are independent.
16. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:
A ∩ B = ∅.
So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:
P (A ∩ B) = P (A) P (B).
It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
1
Note that independence and disjointness are not similar ideas.
313
B. Probability theory
17. Write down the condition for three events A, B and C to be independent.
Solution:
Applying the product rule, we must have:
Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.
18. Prove the simplest version of Bayes’ theorem from first principles.
Solution:
Applying the definition of conditional probability, we have:
P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)
19. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.
(a) If 25% of students do their homework consistently, what percentage of all
students can expect to pass?
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?
Solution:
Here the random experiment is to choose a student at random, and to record
whether the student passes (P ) or fails (F ), and whether the student has done
their homework consistently (C) or has not (N ).2 The sample space is
S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail
= {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.
2
Notice that F = P c and N = C c .
314
B.1. Worked examples
(a) The first part of the example asks for the denominator of Bayes’ theorem:
P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.
315
B. Probability theory
21. A, B and C throw a die in that order until a six appears. The person who throws
the first six wins. What are their respective chances of winning?
Solution:
We must assume that the game finishes with probability one (it would be proved in
a more advanced subject). If A, B and C all throw and fail to get a six, then their
respective chances of winning are as at the start of the game. We can call each
completed set of three throws a round. Let us denote the probabilities of winning
by P (A), P (B) and P (C) for A, B and C, respectively. Therefore:
316
B.1. Worked examples
22. In men’s singles tennis, matches are played on the best-of-five-sets principle.
Therefore, the first player to win three sets wins the match, and a match may
consist of three, four or five sets. Assuming that two players are perfectly evenly
matched, and that sets are independent events, calculate the probabilities that a
match lasts three sets, four sets and five sets, respectively.
Solution:
Suppose that the two players are A and B. We calculate the probability that A
wins a three-, four- or five-set match, and then, since the players are evenly
matched, double these probabilities for the final answer.
P (‘A wins in 3 sets’) = P (‘A wins 1st set’ ∩ ‘A wins 2nd set’ ∩ ‘A wins 3rd set’).
P (‘A wins in 3 sets’) = P (‘A wins 1st set’) P (‘A wins 2nd set’) P (‘A wins 3rd set’)
1 1 1 1
= × × = .
2 2 2 8
Therefore, the total probability that the game lasts three sets is:
1 1
2× = .
8 4
If A wins in four sets, the possible winning patterns are:
317
B. Probability theory
Each of these patterns has probability (1/2)4 by using the same argument as in the
case of 3 sets. So the probability that A wins in four sets is 3 × (1/16) = 3/16.
Therefore, the total probability of a match lasting four sets is 2 × (3/16) = 3/8.
The probability of a five-set match should be 1 − 3/8 − 1/4 = 3/8, but let us check
this directly. The winning patterns for A in a five-set match are:
Each of these has probability (1/2)5 because of the independence of the sets. So the
probability that A wins in five sets is 6 × (1/32) = 3/16. Therefore, the total
probability of a five-set match is 3/8, as before.
1. (a) A, B and C are any three events in the sample space S. Prove that:
P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2
3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.
4. In a game of tennis, each point is won by one of the two players A and B. The
usual rules of scoring for tennis apply. That is, the winner of the game is the player
who first scores four points, unless each player has won three points, when deuce is
called and play proceeds until one player is two points ahead of the other and
hence wins the game.
A is serving and has a probability of winning any point of 2/3. The result of each
point is assumed to be independent of every other point.
318
B.2. Practice questions
(a) Show that the probability of A winning the game without deuce being called is
496/729.
(b) Find the probability of deuce being called.
(c) If deuce is called, show that A’s subsequent probability of winning the game is
4/5.
(d) Hence determine A’s overall chance of winning the game.
319
B. Probability theory
320
Appendix C
Random variables
and:
so π = 8/25 = 0.32.
321
C. Random variables
πet
.
1 − et (1 − π)
Solution:
(a) Working from the definition:
X ∞
X
tX tx
MX (t) = E(e ) = e p(x) = etx (1 − π)x−1 π
x∈S x=1
∞
X
= πet (et (1 − π))x−1
x=1
πet
=
1 − et (1 − π)
Therefore:
π π 1
E(X) = MX0 (0) = 2
= 2 = .
(1 − (1 − π)) π π
Solution:
(a) We have:
1 1 1
ax2 bx3
Z Z
2
f (x) dx = 1 ⇒ ax + bx dx = + =1
0 0 2 3 0
322
C.1. Worked examples
(c) Finally:
1 1 1
6x4 6x5
Z Z
2 2 3 4
E(X ) = x (6x(1 − x)) dx = 6x − 6x dx = − = 0.3
0 0 4 5 0
and so the variance is:
Var(X) = E(X 2 ) − (E(X))2 = 0.3 − 0.25 = 0.05.
Solution:
(a) A sketch of the cumulative distribution function is:
G (w )
1
1-(2/3)e -1
1/3
0 2 w
323
C. Random variables
(c) We have:
2
P (W > 1) = 1 − G(1) = e−1/2
3
2
P (W = 2) = e−1
3
P (0.5 < W ≤ 1.5)
P (W ≤ 1.5 | W > 0.5) =
P (W > 0.5)
G(1.5) − G(0.5)
=
1 − G(0.5)
(1 − (2/3)e−1.5/2 ) − (1 − (2/3)e−0.5/2 )
=
(2/3)e−0.5/2
= 1 − e−1/2 .
324
C.1. Worked examples
Solution:
R∞
(a) Clearly, f (x) ≥ 0 for all x and −∞ f (x) dx = 1. This can be seen
geometrically, since f (x) defines two rectangles, one with base 1 and height
1/4, the other with base 1 and height 3/4, giving a total area of 1/4 + 3/4 = 1.
(b) We have:
Z ∞ Z 1 Z 2 2 1 2 2
x 3x x 3x 1 3 3 5
E(X) = x f (x) dx = dx+ dx = + = + − = .
−∞ 0 4 1 4 8 0 8 1 8 2 8 4
The median is most simply found geometrically. The area to the right of the
point x = 4/3 is 0.5, i.e. the rectangle with base 2 − 4/3 = 2/3 and height 3/4,
giving an area of 2/3 × 3/4 = 1/2. Hence the median is 4/3.
(c) For the variance, we proceed as follows:
Z ∞ Z 1 2 Z 2 2 3 1 3 2
2 2 x 3x x x 1 1 11
E(X ) = x f (x) dx = dx+ dx = + = +2− = .
−∞ 0 4 1 4 12 0 4 1 12 4 6
Hence the variance is:
11 25 88 75 13
Var(X) = E(X 2 ) − (E(X))2 = − = − = ≈ 0.2708.
6 16 48 48 48
(d) The cdf is:
0 for x<0
x/4 for 0≤x≤1
F (x) =
3x/4 − 1/2 for 1<x≤2
1 for x > 2.
(e) P (X = 1) = 0, since the cdf is continuous, and:
P ({X > 1.5} ∩ {X > 0.5}) P (X > 1.5)
P (X > 1.5 | X > 0.5) = =
P (X > 0.5) P (X > 0.5)
0.5 × 0.75
=
1 − 0.5 × 0.25
0.375
=
0.875
3
= ≈ 0.4286.
7
(f) The moment generating function is:
Z ∞ 1 Z 2 tx
etx
Z
tX tx 3e
MX (t) = E(e ) = e f (x) dx = dx + dx
−∞ 0 4 1 4
tx 1 tx 2
e 3e
= +
4t 0 4t 1
1 t 3
= (e − 1) + (e2t − et )
4t 4t
1
3e2t − 2et − 1 .
=
4t
325
C. Random variables
E((X − E(X))3 )
.
σ3
(f) If a sample of five observations is drawn at random from the distribution, find
the probability that all the observations exceed 1.5.
Solution:
(a) Clearly, f (x) ≥ 0 for all x and:
2 4 2
x3
Z
x
dx = = 1.
0 4 16 0
326
C.1. Worked examples
Solution:
(a) Clearly, f (x) ≥ 0 for all x since λ2 > 0, x ≥ 0 and e−λx ≥ 0.
R∞
To show, −∞ f (x) dx = 1, we have:
Z ∞ Z ∞
f (x) dx = λ2 xe−λx dx
−∞ 0
∞ Z ∞
e−λx e−λx
2
= λx + λ2 dx
−λ 0 0 λ
Z ∞
=0+ λe−λx dx
0
2
=0+ (from the exponential distribution).
λ
For the variance:
Z ∞ i∞ Z ∞
−λx
h
−λx 6
2
E(X ) = 2 2
x λ xe 3
dx = − x λe + 3x2 λe−λx dx = .
0 0 0 λ2
2 2 2
So, Var(X) = 6/λ − (2/λ) = 2/λ .
327
C. Random variables
Solution:
(a) We have:
i. P (X = 0) = F (0) = 1 − a.
ii. P (X = 1) = lim (F (1) − F (x)) = 1 − (1 − ae−1 ) = ae−1 .
x→1
−x
iii. f (x) = ae , for 0 ≤ x < 1, and 0 otherwise.
iv. The mean is:
Z 1
−1
E(X) = 0 × (1 − a) + 1 × (ae ) + x ae−x dx
0
h i1 Z 1
= ae−1 + − xae−x + ae−x dx
0 0
h i1
= ae−1 − ae−1 + − ae−x
0
= a(1 − e−1 ).
328
C.1. Worked examples
(a) Determine the constant k and derive the cumulative distribution function,
F (x), of X.
(b) Find E(X) and Var(X).
Solution:
(a) We have: Z ∞ Z π
f (x) dx = k sin(x) dx = 1.
−∞ 0
Therefore: h iπ 1
k(− cos(x)) = 2k = 1 ⇒ k= .
0 2
The cdf is hence:
0
for x < 0
F (x) = (1 − cos(x))/2 for 0 ≤ x ≤ π
1 for x > π.
Next:
Z π π Z π
2 21 1 2
E(X ) = x sin(x) dx = x (− cos(x)) + x cos(x) dx
0 2 2 0 0
π2 h iπ Z π
= + x sin(x) − sin(x) dx
2 0 0
π2 h iπ
= − − cos(x)
2 0
π2
= − 2.
2
Therefore, the variance is:
π2 π2 π2
Var(X) = E(X 2 ) − (E(X))2 = −2− = − 2.
2 4 4
10. (a) Define the cumulative distribution function (cdf) of a random variable and
state the principal properties of such a function.
329
C. Random variables
(b) Identify which, if any, of the following functions could be a cdf under suitable
choices of the constants a and b. Explain why (or why not) each function
satisfies the properties required of a cdf and the constraints which may be
required in respect of the constants a and b.
i. F (x) = a(b − x)2 for −1 ≤ x ≤ 1.
ii. F (x) = a(1 − xb ) for −1 ≤ x ≤ 1.
iii. F (x) = a − b exp(−x/2) for 0 ≤ x ≤ 2.
Solution:
(a) We defined the cdf to be F (x) = P (X ≤ x) where:
• 0 ≤ F (x) ≤ 1
• F (x) is non-decreasing
Rx
• dF (x)/dx = f (x) and F (x) = −∞
f (t) dt for continuous X
• F (x) → 0 as x → −∞ and F (x) → 1 as x → ∞.
(b) i. Okay. a = 0.25 and b = −1.
ii. Not okay. At x = 1, F (x) = 0, which would mean a decreasing function.
iii. Okay. a = b > 0 and b = (1 − e−1 )−1 .
11. Suppose that random variable X has the range {x1 , x2 , . . .}, where x1 < x2 < · · · .
Prove the following results:
∞
X
p(xi ) = 1
i=1
Solution:
The events X = x1 , X = x2 , . . . are disjoint, so we can write:
∞
X ∞
X
p(xi ) = P (X = xi ) = P (X = x1 ∪ X = x2 ∪ · · · ) = P (S) = 1.
i=1 i=1
In words, this result states that the sum of the probabilities of all the possible
values X can take is equal to 1.
For the second equation, we have:
F (xk ) = P (X ≤ xk ) = P (X = xk ∪ X ≤ xk−1 ).
330
C.1. Worked examples
k
X
F (xk ) = P (X ≤ xk ) = P (X = x1 ∪ X = x2 ∪ · · · ∪ X = xk ) = p(xi ).
i=1
12. At a charity event, the organisers sell 100 tickets to a raffle. At the end of the
event, one of the tickets is selected at random and the person with that number
wins a prize. Carol buys ticket number 22. Janet buys tickets numbered 1–5. What
is the probability for each of them to win the prize?
Solution:
Let X denote the number on the winning ticket. Since all values between 1 and 100
are equally likely, X has a discrete ‘uniform’ distribution such that:
1
P (‘Carol wins’) = P (X = 22) = p(22) = = 0.01
100
and:
5
P (‘Janet wins’) = P (X ≤ 5) = F (5) = = 0.05.
100
13. What is the expectation of the random variable X if the only possible value it can
take is c?
Solution:
We have p(c) = 1, so X is effectively a constant, even though it is called a random
variable. Its expectation is:
X
E(X) = x p(x) = cp(x) = cp(c) = c × 1 = c. (C.1)
∀x
Solution:
We have:
E(X − E(X)) = E(X) − E(E(X))
Since E(X) is just a number, as opposed to a random variable, (C.1) tells us that
its expectation is equal to itself. Therefore, we can write:
331
C. Random variables
15. Show that if Var(X) = 0 then p(µ) = 1. (We say in this case that X is almost
surely equal to its mean.)
Solution:
From the definition of variance, we have:
X
Var(X) = E((X − µ)2 ) = (x − µ)2 p(x) ≥ 0
∀x
2
because the squared term (x − µ) is non-negative (as is p(x)). The only case where
it is equal to 0 is when x − µ = 0, that is, when x = µ. Therefore, the random
variable X can only take the value µ, and we have p(µ) = P (X = µ) = 1.
332
Appendix D
Common distributions of random
variables
Solution:
(a) We have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n − 1 x−1
= nπ π (1 − π)(n−1)−(x−1)
x=1
x − 1
n−1
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y
= nπ.
(b) We have:
n
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X n x
= x(x − 1) π (1 − π)n−x
x=2
x
333
D. Common distributions of random variables
n
X n(n − 1)(n − 2)!
E(X(X − 1)) = π 2 π x−2 (1 − π)n−x
x=2
(x − 2)! ((n − 2) − (x − 2))!
n
X
2n − 2 x−2
= n(n − 1)π π (1 − π)(n−2)−(x−2)
x=2
x−2
n−2
X
2 n−2 y
= n(n − 1)π π (1 − π)(n−2)−y
y=0
y
= n(n − 1)π 2 .
(c) We have:
E(X(X − 1) · · · (X − r))
n
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x (if r < n)
x=0
x
n
X n x
= x(x − 1) · · · (x − r) π (1 − π)n−x
x=r+1
x
Solution:
n
P
(a) Xn = Bi takes the values 0, 1, 2, . . . , n. Any sequence consisting of x 1s and
i=1
x n−x
n− x 0s has a probability π (1 − π) and gives a value Xn = x. There are
n
x
such sequences, so:
n x
P (Xn = x) = π (1 − π)n−x
x
and 0 otherwise. Hence E(Bi ) = π and Var(Bi ) = π(1 − π) which means
E(Xn ) = nπ and Var(Xn ) = nπ(1 − π).
334
D.1. Worked examples
P (Y = y) = (1 − π)y−1 π
β α α−1 −βx
f (x) = x e for x > 0 (D.1)
Γ(α)
and 0 otherwise, where α > 0 and β > 0 are parameters, and Γ(α) is the value of
the gamma function such that:
Z ∞
Γ(α) = xα−1 e−x dx.
0
The gamma function has a finite value for all α > 0. Two of its properties are that:
• Γ(1) = 1
• Γ(α) = (α − 1) Γ(α − 1) for all α > 1.
(a) The function f (x) defined by (1) satisfies all the conditions for being a pdf.
Show that this implies the following result about an integral:
Z ∞
Γ(α)
xα−1 e−βx dx = α for any α > 0, β > 0.
0 β
Using this result and the known properties of the exponential distribution,
derive the expected value of X ∼ Gamma(α, β) when α is a positive integer
(i.e. α = 1, 2, . . .).
335
D. Common distributions of random variables
Solution:
(a) This
R ∞ follows immediately from the general property of pdfs that
−∞
f (x) dx = 1, applied to the specific pdf here. We have:
(b) With α = 1, the pdf becomes f (x) = βe−βx for x ≥ 0, and 0 otherwise. This is
the pdf of the exponential distribution with parameter β, i.e. X ∼ Exp(β).
(c) We have:
∞ ∞
β α α−1 −βx
Z Z
tX tx
MX (t) = E(e ) = e f (x) dx = etx x e dx
0 0 Γ(α)
∞
βα
Z
= etx xα−1 e−βx dx
Γ(α) 0
∞
βα
Z
= xα−1 e−(β−t)x dx
Γ(α) 0
βα Γ(α)
= ×
Γ(α) (β − t)α
α
β
=
β−t
which is finite when β − t > 0, i.e. when t < β. The second-to-last step follows
by substituting β − t for β in the result in (a).
(d) i. We have:
∞ ∞
β α α−1 −βx
Z Z
E(X) = x f (x) dx = x x e dx
−∞ 0 Γ(α)
Z ∞
βα
= x(α+1)−1 e−βx dx
Γ(α) 0
β α Γ(α + 1)
=
Γ(α) β α+1
β α αΓ(α)
=
Γ(α) β α+1
α
=
β
using (a) and the gamma function property stated in the question.
ii. The first derivative of MX (t) is:
α−1
β β
MX0 (t) =α .
β−t (β − t)2
Therefore:
α
E(X) = MX0 (0) = .
β
336
D.1. Worked examples
(e) When α is a positive integer, by the result stated in the question, we have
Pα
X= Yi , where Y1 , Y2 , . . . , Yα are independent random variables each
i=1
distributed as Gamma(1, β), i.e. as exponential with parameter β as concluded
in (b). The expected value of the exponential distribution can be taken as
given from the lectures, so E(Yi ) = 1/β for each i = 1, 2, . . . , α. Therefore,
using the general result on expected values of sums:
α
! α
X X 1 α
E(X) = E Yi = E(Yi ) = α × = .
i=1 i=1
β β
4. James enjoys playing Solitaire on his laptop. One day, he plays the game
repeatedly. He has found, from experience, that the probability of success in any
game is 1/3 and is independent of the outcomes of other games.
(a) What is the probability that his first success occurs in the fourth game he
plays? What is the expected number of games he needs to play to achieve his
first success?
(b) What is the probability of three successes in ten games? What is the expected
number of successes in ten games?
(c) Use a suitable approximation to find the probability of less than 25 successes
in 100 games. You should justify the use of the approximation.
(d) What is the probability that his third success occurs in the tenth game he
plays?
Solution:
(a) P (first success in 4th game) = (2/3)3 × (1/3) = 8/81 ≈ 0.1. This is a
geometric distribution, for which E(X) = 1/π = 1/(1/3) = 3.
(b) Use X ∼ Bin(10, 1/3), such that E(X) = 10 × 1/3 = 3.33, and:
3 7
10 1 2
P (X = 3) = ≈ 0.2601.
3 3 3
(d) This is a negative binomial distribution (used for the trial number of the kth
success) with a pf given by:
x−1 k
p(x) = π (1 − π)x−k for x = k, k + 1, k + 2, . . .
k−1
337
D. Common distributions of random variables
5. You may assume that 15% of individuals in a large population are left-handed.
(a) If a random sample of 40 individuals is taken, find the probability that exactly
6 are left-handed.
(b) If a random sample of 400 individuals is taken, find the probability that
exactly 60 are left-handed by using a suitable approximation. Briefly discuss
the appropriateness of the approximation.
(c) What is the smallest possible size of a randomly chosen sample if we wish to
be 99% sure of finding at least one left-handed individual in the sample?
Solution:
338
D.1. Worked examples
6. Show that the moment generating function (mgf) of a Poisson distribution with
parameter λ is given by:
MX (t) = exp(λ exp(t) − 1), writing exp(θ) ≡ eθ .
Hence show that the mean and variance of the distribution are both λ.
Solution:
We have:
∞
X λx
MX (t) = E(exp(Xt)) = exp(xt) exp(−λ)
x=0
x!
∞
X exp(−λ)
= (λ exp(t))x
x=0
x!
∞
X (λ exp(t))x
= exp(−λ)
x=0
x!
339
D. Common distributions of random variables
8. People entering an art gallery are counted by the attendant at the door. Assume
that people arrive in accordance with a Poisson distribution, with one person
arriving every 2 minutes. The attendant leaves the door unattended for 5 minutes.
(a) Calculate the probability that:
i. nobody will enter the gallery in this time
ii. 3 or more people will enter the gallery in this time.
(b) Find, to the nearest second, the length of time for which the attendant could
leave the door unattended for there to be a probability of 0.90 of no arrivals in
that time.
(c) Comment briefly on the assumption of a Poisson distribution in this context.
Solution:
(a) λ = 1 for a two-minute interval, so λ = 2.5 for a five-minute interval. Therefore:
P (no arrivals) = e−2.5 = 0.0821
and:
P (≥ 3 arrivals) = 1−pX (0)−pX (1)−pX (2) = 1−e−2.5 (1+2.5+3.125) = 0.4562.
(b) For an interval of N minutes, the parameter is N/2. We need p(0) = 0.90, so
e−N/2 = 0.90 giving N/2 = − ln(0.90) and N = 0.21 minutes, or 13 seconds.
(c) The rate is unlikely to be constant: more people at lunchtimes or early
evenings etc. Likely to be several arrivals in a small period – couples, groups
etc. Quite unlikely the Poisson will provide a good model.
Solution:
(a) The survivor function is:
Z ∞ h i∞
=(y) = P (Y > y) = λe−λx dx = − e−λx = e−λy .
y y
(b) The age-specific failure rate is constant, indicating it does not vary with age.
This is unlikely to be true in practice!
340
D.1. Worked examples
10. For the binomial distribution with a probability of success of 0.25 in an individual
trial, calculate the probability that, in 50 trials, there are at least 8 successes:
(a) using the normal approximation without a continuity correction
(b) using the normal approximation with a continuity correction.
Compare these results with the exact probability of 0.9547 and comment.
Solution:
We seek P (X ≥ 8) using the normal approximation Y ∼ N (12.5, 9.375).
(a) So, without a continuity correction:
8 − 12.5
P (Y ≥ 8) = P Z ≥ √ = P (Z ≥ −1.47) = 0.9292.
9.375
The required probability could have been expressed as P (X > 7), or indeed
any number in [7, 8), for example:
7 − 12.5
P (Y > 7) = P Z ≥ √ = P (Z ≥ −1.80) = 0.9641.
9.375
11. A greengrocer has a very large pile of oranges on his stall. The pile of fruit is a
mixture of 50% old fruit with 50% new fruit; one cannot tell which are old and
which are new. However, 20% of old oranges are mouldy inside, but only 10% of
new oranges are mouldy. Suppose that you choose 5 oranges at random. What is
the distribution of the number of mouldy oranges in your sample?
Solution:
For an orange chosen at random, the event ‘mouldy’ is the union of the disjoint
events ‘mouldy’ ∩ ‘new’ and ‘mouldy’ ∩ ‘old’. So:
As the pile of oranges is very large, we can assume that the results for the five
oranges will be independent, so we have 5 independent trials each with probability
of ‘mouldy’ equal to 0.15. The distribution of the number of mouldy oranges will be
a binomial distribution with n = 5 and π = 0.15.
341
D. Common distributions of random variables
12. Underground trains on the Northern line have a probability 0.05 of failure between
Golders Green and King’s Cross. Supposing that the failures are all independent,
what is the probability that out of 10 journeys between Golders Green and King’s
Cross more than 8 do not have a breakdown?
Solution:
The probability of no breakdown on one journey is π = 1 − 0.05 = 0.95, so the
number of journeys without a breakdown, X, has a Bin(10, 0.95) distribution. We
want P (X > 8), which is:
P (X > 8) = p(9) + p(10)
10 9 1 10
= × (0.95) × (0.05) + × (0.95)10 × (0.05)0
9 10
= 0.3151 + 0.5987
= 0.9138.
13. Suppose that the normal rate of infection for a certain disease in cattle is 25%. To
test a new serum which may prevent infection, three experiments are carried out.
The test for infection is not always valid for some particular cattle, so the
experimental results are incomplete – we cannot always tell whether a cow is
infected or not. The results of the three experiments are:
(a) 10 animals are injected; all 10 remain free from infection
(b) 17 animals are injected; more than 15 remain free from infection and there are
2 doubtful cases
(c) 23 animals are infected; more than 20 remain free from infection and there are
three doubtful cases.
Which experiment provides the strongest evidence in favour of the serum?
Solution:
These experiments involve tests on different cattle, which one might expect to
behave independently of one another. The probability of infection without injection
with the serum might also reasonably be assumed to be the same for all cattle. So
the distribution which we need here is the binomial distribution. If the serum has
no effect, then the probability of infection for each of the cattle is 0.25.
One way to assess the evidence of the three experiments is to calculate the
probability of the result of the experiment if the serum had no effect at all. If it has
an effect, then one would expect larger numbers of cattle to remain free from
infection, so the experimental results as given do provide some clue as to whether
the serum has an effect, in spite of their incompleteness.
Let X(n) be the number of cattle infected, out of a sample of n. We are assuming
that X(n) ∼ Bin(n, 0.25).
(a) With 10 trials, the probability of 0 infected if the serum has no effect is:
10
P (X(10) = 0) = × (0.75)10 = (0.75)10 = 0.0563.
0
342
D.1. Worked examples
(b) With 17 trials, the probability of more than 15 remaining uninfected if the
serum has no effect is:
(c) With 23 trials, the probability of more than 20 remaining free from infection if
the serum has no effect is:
14. In a large industrial plant there is an accident on average every two days.
(a) What is the chance that there will be exactly two accidents in a given week?
(b) What is the chance that there will be two or more accidents in a given week?
(c) If James goes to work there for a four-week period, what is the probability
that no accidents occur while he is there?
Solution:
Here we have counts of random events over time, which is a typical application for
the Poisson distribution. We are assuming that accidents are equally likely to occur
at any time and are independent. The mean for the Poisson distribution is 0.5 per
day.
Let X be the number of accidents in a week. The probability of exactly two
accidents in a given week is found by using the parameter λ = 5 × 0.5 = 2.5 (5
working days a week assumed).
343
D. Common distributions of random variables
(c) If James goes to the industrial plant and does not change the probability of an
accident simply by being there (he might bring bad luck, or be superbly
safety-conscious!), then over 4 weeks there are 20 working days, and the
probability of no accident comes from a Poisson random variable with mean
10. If Y is the number of accidents while James is there, the probability of no
accidents is:
e−10 (10)0
pY (0) = = 0.0000454.
0!
James is very likely to be there when there is an accident!
15. The chance that a lottery ticket has a winning number is 0.0000001.
(a) If 10,000,000 people buy tickets which are independently numbered, what is
the probability there is no winner?
(b) What is the probability that there is exactly 1 winner?
(c) What is the probability that there are exactly 2 winners?
Solution:
The number of winning tickets, X, will be distributed as:
X ∼ Bin(10,000,000, 0.0000001).
Since n is large and π is small, the Poisson distribution should provide a good
approximation. The Poisson parameter is:
λ = nπ = 10,000,000 × 0.0000001 = 1
and so we set X ∼ Pois(1). We have:
e−1 10 e−1 11 e−1 12
p(0) = = 0.3679, p(1) = = 0.3679 and p(2) = = 0.1839.
0! 1! 2!
Using the exact binomial distribution of X, the results are:
(10)7
7
p(0) = × ((10)−7 )0 × (1 − (10)−7 )(10) = 0.3679
0
(10)7
7
p(1) = × ((10)−7 )1 × (1 − (10)−7 )(10) −1 = 0.3679
1
(10)7
7
p(2) = × ((10)−7 )2 × (1 − (10)−7 )(10) −2 = 0.1839.
2
Notice that, in this case, the Poisson approximation is correct to at least 4 decimal
places.
344
D.1. Worked examples
16. Suppose that X ∼ Uniform[0, 1]. Compute P (X > 0.2), P (X ≥ 0.2) and
P (X 2 > 0.04).
Solution:
We have a = 0 and b = 1, and can use the formula for P (c < X ≤ d), for constants
c and d. Hence:
1 − 0.2
P (X > 0.2) = P (0.2 < X ≤ 1) = = 0.8.
1−0
Also:
P (X ≥ 0.2) = P (X = 0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
Finally:
P (X 2 > 0.04) = P (X < −0.2) + P (X > 0.2) = 0 + P (X > 0.2) = 0.8.
17. Suppose that the service time for a customer at a fast food outlet has an
exponential distribution with parameter 1/3 (customers per minute). What is the
probability that a customer waits more than 4 minutes?
Solution:
The distribution of X is Exp(1/3), so the probability is:
P (X > 4) = 1 − F (4) = 1 − (1 − e−(1/3)×4 ) = 1 − 0.7364 = 0.2636.
18. Suppose that the distribution of men’s heights in London, measured in cm, is
N (175, 62 ). Find the proportion of men whose height is:
(a) under 169 cm
(b) over 190 cm
(c) between 169 cm and 190 cm.
Solution:
The values of interest are 169 and 190. The corresponding z-values are:
169 − 175 190 − 175
z1 = = −1 and z2 = = 2.5.
6 6
Using values from Table 4 of the New Cambridge Statistical Tables, we have:
P (X < 169) = P (Z < −1) = Φ(−1)
= 1 − Φ(1) = 1 − 0.8413 = 0.1587
345
D. Common distributions of random variables
19. Two statisticians disagree about the distribution of IQ scores for a population
under study. Both agree that the distribution is normal, and that σ = 15, but A
says that 5% of the population have IQ scores greater than 134.6735, whereas B
says that 10% of the population have IQ scores greater than 109.224. What is the
difference between the mean IQ score as assessed by A and that as assessed by B?
Solution:
The standardised z-value giving 5% in the upper tail is 1.6449, and for 10% it is
1.2816. So, converting to the scale for IQ scores, the values are:
µA + 24.6735 = 134.6735
so:
µA = 110
whereas:
µB + 19.224 = 109.224
so µB = 90. The difference µA − µB = 110 − 90 = 20.
(a) If pairs of pistons and cylinders are selected at random for assembly, for what
proportion will the piston not fit into the cylinder (i.e. for which the piston
diameter exceeds the cylinder diameter)?
(b) Calculate exactly the chance that in 100 pairs, selected at random:
i. every piston will fit
ii. not more than two of the pistons will fail to fit.
(c) Now calculate the same probabilities using a Poisson approximation. Discuss
the appropriateness of using this approximation.
346
D.2. Practice questions
et (1 − ekt )
MX (t) = .
k(1 − et )
(Do not attempt to find the mean and variance using the mgf.)
(a) Sketch f (z) and explain why it can serve as the pdf for a random variable Z.
(b) Determine the moment generating function of Z.
(c) Use the mgf to find E(Z), Var(Z), E(Z 3 ) and E(Z 4 ).
(You may assume that −1 < t < 1, for the mgf, which will ensure convergence.)
Hence find E(X) and Var(X). (The wording of the question implies that you use
the result which you have just proved. Other methods of derivation will not be
accepted!)
5. Cars independently pass a point on a busy road at an average rate of 150 per hour.
(a) Assuming a Poisson distribution, find the probability that none passes in a
given minute.
(b) What is the expected number passing in two minutes?
(c) Find the probability that the expected number actually passes in a given
two-minute period.
6. James goes fishing every Saturday. The number of fish he catches follows a Poisson
distribution. On a proportion π of the days he goes fishing, he does not catch
anything. He makes it a rule to take home the first, and then every other, fish
which he catches, i.e. the first, third, fifth fish etc.
(a) Using a Poisson distribution, find the mean number of fish he catches.
(b) Show that the probability that he takes home the last fish he catches is
(1 − π 2 )/2.
347
D. Common distributions of random variables
348
Appendix E
Multivariate random variables
Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z
hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.
349
E. Multivariate random variables
X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15
Solution:
X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:
1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:
(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y
So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.
(c) X and Y are not independent random variables since, for example:
350
E.1. Worked examples
where:
eiθ
πi =
1 + eiθ
for i = 1, 2, . . . , n. Derive the joint probability function, p(x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (but not identically distributed) random variables,
we have:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
351
E. Multivariate random variables
Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have: n
Y
p(x1 , x2 , . . . , xn ) = p(xi ).
i=1
6. The random variables X1 and X2 are independent and have the common
distribution given in the table below:
X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1
Solution:
(a) The joint distribution of W and Y is:
W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19
352
E.1. Worked examples
7. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20
Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.
353
E. Multivariate random variables
(b) We have:
Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))
hence:
(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:
0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5
hence:
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5
8. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.
354
E.1. Worked examples
Solution:
3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14
(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 × +1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
E(Y ) = 0 × +1× +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.
355
E. Multivariate random variables
9. Show that the marginal distributions of a bivariate distribution are not enough to
define the bivariate distribution itself.
Solution:
Here we must show that there are two distinct bivariate distributions with the
same marginal distributions. It is easiest to think of the simplest case where X and
Y each take only two values, say 0 and 1.
Suppose the marginal distributions of X and Y are the same, with
p(0) = p(1) = 0.5. One possible bivariate distribution with these marginal
distributions is the one for which there is independence between X and Y . This has
pX,Y (x, y) = pX (x) pY (y) for all x, y. Writing it in full:
pX,Y (0, 0) = pX,Y (1, 0) = pX,Y (0, 1) = pX,Y (1, 1) = 0.5 × 0.5 = 0.25.
The table of probabilities for this choice of independence is shown in the first table
below.
Trying some other value for pX,Y (0, 0), like 0.2, gives the second table below.
X/Y 0 1 X/Y 0 1
0 0.25 0.25 0 0.2 0.3
1 0.25 0.25 1 0.3 0.2
The construction of these probabilities is done by making sure the row and column
totals are equal to 0.5, and so we now have a second distribution with the same
marginal distributions as the first.
This example is very simple, but one can almost always construct many bivariate
distributions with the same marginal distributions even for continuous random
variables.
11. There are different ways to write the covariance. Show that:
and:
Cov(X, Y ) = E((X − E(X))Y ) = E(X(Y − E(Y ))).
356
E.1. Worked examples
Solution:
Working directly from the definition:
The remaining result follows by an argument symmetric with the last one.
12. Suppose that Var(X) = Var(Y ) = 1, and that X and Y have correlation coefficient
ρ. Show that it follows from Var(X − ρY ) ≥ 0 that ρ2 ≤ 1.
Solution:
We have:
Hence 1 − ρ2 ≥ 0, and so ρ2 ≤ 1.
X=x −1 0 1
P (X = x) a b a
E(X) = −1 × a + 0 × b + 1 × a = 0
E(X 2 ) = +1 × a + 0 × b + 1 × a = 2a
E(X 3 ) = −1 × a + 0 × b + 1 × a = 0
357
E. Multivariate random variables
There are many possible choices for a and b which give a valid probability
distribution, for instance a = 0.25 and b = 0.5.
14. A fair coin is thrown n times, each throw being independent of the ones before. Let
R = ‘the number of heads’, and S = ‘the number of tails’. Find the covariance of R
and S. What is the correlation of R and S?
Solution:
One can go about this in a straightforward way. If Xi is the number of heads and
Yi is the number of tails on the ith throw, then the distribution of Xi and Yi is
given by:
X/Y 0 1
0 0 0.5
1 0.5 0
Cov(R, S) = −0.25n.
(add the variances of the Xi s or Yi s). The correlation between R and S works out
as −0.25n/0.25n = −1.
15. Suppose that X and Y have a bivariate distribution. Find the covariance of the
new random variables W = aX + bY and V = cX + dY where a, b, c and d are
constants.
358
E.2. Practice questions
Solution:
The covariance of W and V is:
16. Following on from Question 15, show that, if the variances of X and Y are the
same, then W = X + Y and V = X − Y are uncorrelated.
Solution:
Here we have a = b = c = 1 and d = −1. Substituting into the formula found above:
2
σW V = σX − σY2 = 0.
There is no assumption here that X and Y are independent. It is not true that W
and V are independent without further restrictions on X and Y .
(b) For random variables X and Y , and constants a, b, c and d, show that:
3. X and Y are discrete random variables which can assume values 0, 1 and 2 only.
359
E. Multivariate random variables
(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.
360
Appendix F
Sampling distributions of statistics
where F ∼ F5, 17 , using Table A.3 of the Dougherty Statistical Tables (practice
of which will be covered later in the course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:
P (A3 + B 3 + C 3 < 0) = 0.
361
F. Sampling distributions of statistics
Solution:
Solution:
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45
(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:
where Y ∼ χ22 .
362
F.1. Worked examples
(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.
So:
2X12
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32
where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32
X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 ! !
√ X22 X32 √
X1
=P >5 2 + 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.
363
F. Sampling distributions of statistics
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085.
(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:
√
q
2 2
P X1 > k X3 + X4 = 0.025 = P (T > k 2)
√
where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4
where X ∼ χ23 . Therefore, k/4 = 6.251. Hence k = 25.004.
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k simplifies to:
and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table A.3 of the Dougherty Statistical Tables, k = 0.05.
6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.
364
F.1. Worked examples
Solution:
(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 4 of the
New Cambridge Statistical Tables. Since there are 200 independent random
samples drawn, we can now think of each as a single trial. The recorded mean
lies between 67.9 and 69.2 with probability 0.8026 at each trial. We are dealing
with a binomial distribution with n = 200 trials and probability of success
π = 0.8026. The expected number of successes is:
(c) The probability that the recorded mean is < 67.0 inches is:
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
365
F. Sampling distributions of statistics
Solution:
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
√ √
2
P (Z < 3.841) = P − 3.841 < Z < 3.841
Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.
Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.
P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
2
X1 + X22 + X32
=P < 7.236
Y
2
(X1 + X22 + X32 )/3
5
=P < × 7.236
Y /5 3
2
(X1 + X22 + X32 )/3
=P < 12.060 .
Y /5
366
F.1. Worked examples
Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table A.3 of the Dougherty Statistical Tables, we see that 12.06
is the 1% upper-tail value for this distribution, so the probability we want is equal
to 0.99.
10. Compare the normal distribution approximation to the exact values for the
upper-tail probabilities for the binomial distribution with 100 trials and probability
of success 0.1.
Solution:
Let R ∼ Bin(100, 0.1) denote the exact number of successes. It has mean and
variance:
E(R) = nπ = 100 × 0.1 = 10
and:
Var(R) = nπ(1 − π) = 100 × 0.1 × 0.9 = 9
so we use the approximation R ∼
˙ N (10, 9) or, equivalently:
R − 10 R − 10
√ = ∼˙ N (0, 1).
9 3
Applying a continuity correction of 0.5 (for example, 7.8 successes are rounded up
to 8) gives:
r − 0.5 − 10
P (R ≥ r) ≈ P Z > .
3
The results are summarised in the following table. The first column is the number
of successes; the second gives the exact binomial probabilities; the third column
lists the corresponding z-values (with the continuity correction); and the fourth
gives the probabilities for the normal approximation.
Although the agreement between columns two and four is not too bad, you may
think it is not as close as you would like for some applications.
r P (R ≥ r) z = (r − 0.5 − 10)/3 P (Z > z)
1 0.999973 −3.1667 0.999229
2 0.999678 −2.8333 0.997697
3 0.998055 −2.5000 0.993790
4 0.992164 −2.1667 0.984870
5 0.976289 −1.8333 0.966624
6 0.942423 −1.5000 0.933193
7 0.882844 −1.1667 0.878327
8 0.793949 −0.8333 0.797672
9 0.679126 −0.5000 0.691462
10 0.548710 −0.1667 0.566184
11 0.416844 0.1667 0.433816
12 0.296967 0.5000 0.308538
13 0.198179 0.8333 0.202328
14 0.123877 1.1667 0.121673
15 0.072573 1.5000 0.066807
16 0.039891 1.8333 0.033376
367
F. Sampling distributions of statistics
2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?
368
F.2. Practice questions
369
F. Sampling distributions of statistics
370
Appendix G
Point estimation
371
G. Point estimation
Therefore, setting µ
b1 = M1 , we have:
n
θb X Xi
= X̄ ⇒ θb = 2X̄ = 2 .
2 i=1
n
3. Let X ∼ Bin(n, π), where n is known. Find the method of moments estimator
(MME) of π.
Solution:
The pf of the binomial distribution is:
n!
P (X = x) = π x (1 − π)n−x for x = 0, 1, 2, . . . , n
x! (n − x)!
and:
Z ∞ Z ∞ 2
y 2 2a
2
E(X ) = 2
x λ exp(−λ(x − a)) dx = e−y dy = 2
+ + a2 .
a 0 λ+a λ λ
372
G.1. Worked examples
5. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1
373
G. Point estimation
and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ b = X̄, which is also the MME.
Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.
The joint pdf is:
(
θ−n for 0 ≤ x1 , x2 , . . . , xn ≤ θ
f (x1 , x2 , . . . , xn ; θ) =
0 otherwise.
In fact f (x1 , x2 , . . . , xn ; θ), as a function of θ, is the likelihood function, L(θ).
The maximum likelihood estimator of θ is the value at which the likelihood
function L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0 otherwise
where:
X(n) = max Xi .
i
Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:
374
G.1. Worked examples
(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.
8. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.
Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.
Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.
9. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.
However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.
375
G. Point estimation
Note that:
∞
X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6
1
= π(1 − π)5 ×
π
= (1 − π)5 .
L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525
hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.
Since n > 2, we can see that θb1 has a lower variance than θb2 , so it is a better
estimator. Unsurprisingly, we obtain a better estimator of θ by considering the
whole sample, rather than just the first two values.
376
G.1. Worked examples
Solution:
We need to introduce the term E(θ)
b inside the expectation, so we add and subtract
it to obtain:
b = E((θb − θ)2 )
MSE(θ)
= E ((θb − E(θ)) b 2
b − (θ − E(θ)))
2 2
= E (θ − E(θ)) − 2(θ − E(θ))(θ − E(θ)) + (θ − E(θ))
b b b b b b
b 2 ) − 2E((θb − E(θ))(θ
= E((θb − E(θ)) b − E(θ))) b 2 ).
b + E((θ − E(θ))
377
G. Point estimation
15. Let {X1 , X2 , . . . , Xn } be a random sample from a Bin(m, π) distribution, with both
m and π unknown. Find the method of moments estimators of m, the number of
trials, and π, the probability of success.
Solution:
There are two unknown parameters, so we need two equations. The expectation
and variance of a Bin(m, π) distribution are mπ and mπ(1 − π), respectively, so we
have:
µ1 = E(X) = mπ
and:
µ2 = Var(X) + E(X)2 = mπ(1 − π) + (mπ)2 .
Setting the first two sample and population moments equal gives:
n n
1X 1X 2
Xi = mb
b π and b π (1 − π
X = mb b π )2 .
b) + (mb
n i=1 n i=1 i
The two equations need to be solved simultaneously. Solving the first equation for
π
b gives:
Pn
Xi /n
i=1 X̄
π
b= = .
m
b mb
Now we can substitute π
b into the second moment equation to obtain:
n 2
1X 2 X̄ X̄ X̄
X =m 1− + m
n i=1 i
b b
mb m
b m
b
378
G.1. Worked examples
16. Consider again the Uniform[−θ, θ] distribution from Question 14. Suppose that we
observe the following data:
which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.
17. Let {X1 , X2 , . . . , Xn } be a random sample from an Exp(λ) distribution. Find the
MLE of λ.
Solution:
The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λe−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1
d2 n
2
l(λ) = − 2
dλ λ
379
G. Point estimation
18. Let {X1 , X2 , . . . , Xn } be a random sample from a N (µ, σ 2 ) distribution. Find the
MLE of σ 2 if:
(a) µ is known
(b) µ is unknown.
In each case, work out if the MLE is an unbiased estimator of σ 2 .
Solution:
The likelihood function is:
n n
(Xi − µ)2
2
Y
2
Y 1
L(µ, σ ) = f (xi ; µ, σ ) = √ exp −
i=1 i=1 2πσ 2 2σ 2
n
!
1 X
= (2πσ 2 )−n/2 exp − 2 (Xi − µ)2
2σ i=1
Differentiating with respect to σ 2 and setting the derivative equal to zero gives:
n
d 2 n 1 1 X
l(µ, σ ) = − + 4 (Xi − µ)2 = 0.
dσ 2 b2 2b
2 σ σ i=1
b2 :
If µ is known, we can solve this equation for σ
n n n
n 1 1 X n 2 1X 1X
= 4 (Xi − µ)2 ⇒ σ
b = (Xi − µ)2 ⇒ σ2
b = (Xi − µ)2 .
b2
2 σ 2b
σ i=1 2 2 i=1 n i=1
is indeed a maximum. We can work out the bias of this estimator directly:
n
! n
!
2 1X 2 2 1 X (Xi − µ)2
E(bσ )=E (Xi − µ) = σ E
n i=1 n i=1 σ2
n 2
σ2 X
Xi − µ
= E
n i=1 σ
n
σ2 X
= E(Zi2 )
n i=1
σ2
= n = σ2
n
380
G.1. Worked examples
n
so, whatever the value of σ 2 , we need to ensure that (Xi − µ)2 is minimised.
P
i=1
However, we have:
n
X n
X
(Xi − µ)2 = (Xi − X̄)2 + n(X̄ − µ)2 .
i=1 i=1
Only the second term on the right-hand side depends on µ and, because of the
square, its minimum value is zero. It is minimised when µ is equal to the sample
mean, so this is the MLE of µ:
µ
b = X̄.
The resulting MLE of σ 2 is:
n
2 1X
σ
b = (Xi − X̄)2 .
n i=1
This is not the same as the sample variance S 2 , where we divide by n − 1 instead of
n. The expectation of the MLE of σ 2 is:
n
! n
!
1 X 1 1 X
σ2) = E
E(b (Xi − X̄)2 = E (n − 1) (Xi − X̄)2
n i=1 n n − 1 i=1
1
= E((n − 1)S 2 )
n
σ2 (n − 1)S 2
= E .
n σ2
The term inside the expectation, (n − 1)S 2 /σ 2 , follows a χ2n−1 distribution, and so:
σ2
σ2) =
E(b (n − 1).
n
This is not equal to σ 2 , so the MLE of σ 2 is a biased estimator in this case. (Note
b2 = S 2 is an unbiased estimator of σ 2 .) The bias of the MLE is:
that the estimator σ
σ2 σ2
σ 2 ) = E(b
Bias(b σ2) − σ2 = (n − 1) − σ 2 = −
n n
which tends to zero as n → ∞. In such cases, we say that the estimator is
asymptotically unbiased.
381
G. Point estimation
4. Given a random sample of n values from a normal distribution with unknown mean
2
and variance, consider the
P following2 two estimators of σ (the unknown population
variance), where Sxx = (Xi − X̄) :
Sxx Sxx
T1 = and T2 = .
n−1 n
For each of these determine its bias, its variance and its mean squared error. Which
has the smaller mean squared error?
Hint: use the fact that Var(S 2 ) = 2σ 4 /(n − 1) for a random sample of size n, or
some equivalent formula.
y 1 = α + β + ε1
y2 = −α + β + ε2
y 3 = α − β + ε3
y4 = −α − β + ε4 .
382
G.2. Practice questions
383
G. Point estimation
384
Appendix H
Interval estimation
Solution:
(a) With an available random sample {X1 , X2 , . . . , Xn } from the normal
distribution N (µ, σ 2 ) with σ 2 known, a 95% confidence interval for µ is of the
form:
σ σ
X̄ − 1.96 × √ , X̄ + 1.96 × √ .
n n
Hence the width of the confidence interval is:
σ σ σ σ
X̄ + 1.96 × √ − X̄ − 1.96 × √ = 2 × 1.96 × √ = 3.92 × √ .
n n n n
√
(b) Let 3.92 × σ/ n ≤ d, and so we obtain the condition for the required sample
size: 2
15.37 × σ 2
3.92 × σ
n≥ = .
d d2
Therefore, in order to achieve the required accuracy, the sample size n should
be at least as large as 15.37 × σ 2 /d2 .
Note that as the variance σ 2 %, the confidence interval width d %, and as the
sample size n %, the confidence interval width d &. Also, note that when σ 2 is
unknown, the width of a confidence interval for µ depends on S. Therefore, the
width is a random variable.
2. The data below are from a random sample of size n = 9 taken from the distribution
N (µ, σ 2 ):
3.75, 5.67, 3.14, 7.89, 3.40, 9.32, 2.80, 10.34 and 14.31.
(a) Assume σ 2 = 16. Find a 95% confidence interval for µ. If the width of such a
confidence interval must not exceed 2.5, at least how many observations do we
need?
(b) Suppose σ 2 is now unknown. Find a 95% confidence interval for µ. Compare
the result with that obtained in (a) and comment.
(c) Obtain a 95% confidence interval for σ 2 .
385
H. Interval estimation
Solution:
(a) We have x̄ = 6.74. For a 95% confidence interval, α = 0.05 so we need to find
the top 100α/2 = 2.5th percentile of N (0, 1), which is 1.96. Since σ = 4 and
n = 9, a 95% confidence interval for µ is:
σ 4 4
x̄ ± 1.96 × √ ⇒ 6.74 − 1.96 × , 6.74 + 1.96 × = (4.13, 9.35).
n 3 3
In general, a 100(1 − α)% confidence interval for µ is:
σ σ
X̄ − zα/2 × √ , X̄ + zα/2 × √
n n
where zα denotes the top 100αth percentile of the standard normal
distribution, i.e. such that:
P (Z > zα ) = α
where Z ∼ N (0, 1). Hence the width of the confidence interval is:
σ
2 × zα/2 × √ .
n
For this example, α = 0.05, z0.025 = 1.96 and σ = 4. Setting the width of the
confidence interval to be at most 2.5, we have:
σ 15.68
2 × 1.96 × √ = √ ≤ 2.5.
n n
Hence: 2
15.68
n≥ = 39.34.
2.5
So we need a sample of at least 40 observations in order to obtain a 95%
confidence interval with a width not greater than 2.5.
(b) When σ 2 is unknown, a 95% confidence interval for µ is:
S S
X̄ − tα/2, n−1 × √ , X̄ + tα/2, n−1 × √
n n
n
where S 2 = (Xi − X̄)2 /(n − 1), and tα, k denotes the top 100αth percentile
P
i=1
of the Student’s tk distribution, i.e. such that:
P (T > tα, k ) = α
for T ∼ tk . For this example, s2 = 16, s = 4, n = 9 and t0.025, 8 = 2.306. Hence
a 95% confidence interval for µ is:
4
6.74 ± 2.306 × ⇒ (3.67, 9.81).
3
This confidence interval is much wider than the one obtained in (a). Since we
do not know σ 2 , we have less information available for our estimation. It is
only natural that our estimation becomes less accurate.
Note that although the sample size is n, the Student’s t distribution used has
only n − 1 degrees of freedom. The loss of 1 degree of freedom in the sample
variance is due to not knowing µ. Hence we estimate µ using the data, for
which we effectively pay a ‘price’ of one degree of freedom.
386
H.1. Worked examples
(c) Note (n − 1)S 2 /σ 2 ∼ χ2n−1 = χ28 . From Table 8 of the New Cambridge
Statistical Tables, for X ∼ χ28 , we find that:
P (X < 2.180) = P (X > 17.53) = 0.025.
Hence:
8 × S2
P 2.180 < < 17.53 = 0.95.
σ2
Therefore, the lower bound for σ 2 is 8 × s2 /17.53, and the upper bound is
8 × s2 /2.180. Therefore, a 95% confidence interval for σ 2 , noting s2 = 16, is:
(7.302, 58.716).
Note that the estimation in this example is rather inaccurate. This is due to
two reasons.
i. The sample size is small.
ii. The population variance, σ 2 , is large.
3. Assume that the random variable X is normally distributed and that σ 2 is known.
What confidence level would be associated with each of the following intervals?
√ √
(a) (x̄ − 1.645 × σ/ n, x̄ + 2.326 × σ/ n).
√
(b) (−∞, x̄ + 2.576 × σ/ n).
√
(c) (x̄ − 1.645 × σ/ n, x̄).
Solution:
√ √
We have X̄ ∼ N (µ, σ 2 / n), hence n(X̄ − µ)/σ ∼ N (0, 1).
(a) P (−1.645 < Z < 2.326) = 0.94, hence a 94% confidence level.
(b) P (−∞ < Z < 2.576) = 0.995, hence a 99.5% confidence level.
(c) P (−1.645 < Z < 0) = 0.45, hence a 45% confidence level.
387
H. Interval estimation
5. A personnel manager has found that historically the scores on aptitude tests given
to applicants for entry-level positions are normally distributed with σ = 32.4
points. A random sample of nine test scores from the current group of applicants
had a mean score of 187.9 points.
(a) Find an 80% confidence interval for the population mean score of the current
group of applicants.
(b) Based on these sample results, a statistician found for the population mean a
confidence interval extending from 165.8 to 210.0 points. Find the confidence
level of this interval.
Solution:
(a) We have n = 9, x̄ = 187.9, σ = 32.4 and 1 − α = 0.80, hence α/2 = 0.10 and,
from Table 5 of the New Cambridge Statistical Tables, P (Z > 1.2816) =
1 − Φ(1.2816) = 0.10. So an 80% confidence interval is:
32.4
187.9 ± 1.2816 × √ ⇒ (174.06, 201.74).
9
(b) The half-width of the confidence interval is 210.0 − 187.9 = 22.1, which is
equal to the margin of error, i.e. we have:
σ 32.4
22.1 = k × √ = k × √ ⇒ k = 2.05.
n 9
P (Z > 2.05) = 1 − Φ(2.05) = 0.02018 = α/2 ⇒ α = 0.04036. Hence we have
a 100(1 − α)% = 100(1 − 0.04036)% ≈ 96% confidence interval.
Solution:
(a) We have n = 10, s2 = (2.36)2 = 5.5696, χ20.975, 9 = 2.700 and χ20.025, 9 = 19.02.
Hence a 95% confidence interval for σ 2 is:
(n − 1)s2 (n − 1)s2
9 × 5.5696 9 × 5.5696
, = , = (2.64, 18.57).
χ20.025, n−1 χ20.975, n−1 19.02 2.700
χ20.995, n−1 < χ20.975, n−1 and χ20.005, n−1 > χ20.025, n−1 .
388
H.1. Worked examples
7. Why do we not always choose a very high confidence level for a confidence interval?
Solution:
We do not always want to use a very high confidence level because the confidence
interval would be very wide. We have a trade-off between the width of the
confidence interval and the coverage probability.
8. Suppose that 9 bags of sugar are selected from the supermarket shelf at random
and weighed. The weights in grammes are 812.0, 786.7, 794.1, 791.6, 811.1, 797.4,
797.8, 800.8 and 793.2. Construct a 95% confidence interval for the mean weight of
all the bags on the shelf. Assume the population is normal.
Solution:
Here we have a random sample of size n = 9. The mean is 798.30. The sample
variance is s2 = 72.76, which gives a sample standard deviation s = 8.53. From
Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of the t
distribution with n − 1 = 8 degrees of freedom is 2.306. Therefore, a 95%
confidence interval is:
8.53 8.53
798.30 − 2.306 × √ , 798.30 + 2.306 × √ = (798.30 − 6.56, 798.30 + 6.56)
9 9
= (791.74, 804.86).
9. Continuing Question 8, suppose we are now told that σ, the population standard
deviation, is known to be 8.5 g. Construct a 95% confidence interval using this
information.
Solution:
From Table 10 of the New Cambridge Statistical Tables, the top 2.5th percentile of
the standard normal distribution z0.025 = 1.96 (recall t∞ = N (0, 1)) so a 95%
confidence interval for the population mean is:
8.5 8.5
798.30 − 1.96 × √ , 798.30 + 1.96 × √ = (798.30 − 6.53, 798.30 + 6.53)
9 9
= (792.75, 803.85).
Again, it may be more useful to write this as 798.30 ± 5.55. Note that this
confidence interval is less wide than the one in Question 8, even though our initial
estimate s turned out to be very close to the true value of σ.
10. Construct a 90% confidence interval for the variance of the bags of sugar in
Question 8. Does the given value of 8.5 g for the population standard deviation
seem plausible?
389
H. Interval estimation
Solution:
We have n = 9 and s2 = 72.76. For a 90% confidence interval, we need the bottom
and top 5th percentiles of the chi-squared distribution on n − 1 = 8 degrees of
freedom. These are:
= (37.529, 213.010).
The given value falls well within this confidence interval, so we have no reason to
doubt it.
2. (a) A sample of 954 adults in early 1987 found that 23% of them held shares.
Given a UK adult population of 41 million and assuming a proper random
sample was taken, construct a 95% confidence interval estimate for the number
of shareholders in the UK.
(b) A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size, construct a 95% confidence
interval estimate of the increase in shareholders between the two years.
390
Appendix I
Hypothesis testing
Solution:
The critical value for the test is z0.95 = −1.645 and the probability of rejecting H0
with this test is:
X̄ − 7
P √ < −1.645
0.25/ n
which we rewrite as:
X̄ − 6.95 7 − 6.95
P √ < √ − 1.645
0.25/ n 0.25/ n
Therefore:
7 − 6.95
√ − 1.645 = 1.282
0.25/ n
√
0.2 × n = 2.927
√
n = 14.635
n = 214.1832.
So to ensure that the test power is at least 90%, we should use a sample size of 215.
Remark: We see a rather large sample size is required. Hence investigators are
encouraged to use sample sizes large enough to come to rational decisions.
391
I. Hypothesis testing
2. A doctor claims that the average European is more than 8.5 kg overweight. To test
this claim, a random sample of 12 Europeans were weighed, and the difference
between their actual weight and their ideal weight was calculated. The data are:
14, 12, 8, 13, −1, 10, 11, 15, 13, 20, 7 and 14.
Assuming the data follow a normal distribution, conduct a t test to infer at the 5%
significance level whether or not the doctor’s claim is true.
Solution:
We have a random sample of size n = 12 from N (µ, σ 2 ), and we test H0 : µ = 8.5
vs. H1 : µ > 8.5. The test statistic, under H0 , is:
X̄ − 8.5 X̄ − 8.5
T = √ = √ ∼ t11 .
S/ n S/ 12
Hence:
11.333 − 8.5
t= p = 1.903 > 1.796 = t0.05, 11
26.606/12
so we reject H0 at the 5% significance level. There is significant evidence to support
the doctor’s claim.
Solution:
(a) We test:
H0 : σ 2 = 8 vs. H1 : σ 2 > 8.
The test statistic, under H0 , is:
(n − 1)S 2 20 × S 2
T = 2
= ∼ χ220 .
σ0 8
t ≥ 31.41
392
I.1. Worked examples
(b) To evaluate the power, we need the probability of rejecting H0 (which happens
if t ≥ 31.41) conditional on the actual value of σ 2 , that is:
2 8 8
P (T ≥ 31.41 | σ = k) = P T × ≥ 31.41 ×
k k
H0 : µX = µY vs. H1 : µX > µY .
The sample means and standard deviations are x̄ = 389.5, ȳ = 307.8, sX = 55.40
and sY = 69.45. The test statistic and its distribution under H0 are:
s
n+m−2 X̄ − Ȳ
T = ×p ∼ tn+m−2
1/n + 1/m (n − 1)SX 2
+ (m − 1)SY2
and we obtain, for the given data, t = 2.175 > 1.833 = t0.05, 9 hence we reject H0
that the mean weights are equal and conclude that the mean weight for the
high-protein diet is greater at the 5% significance level.
5. Suppose that we have two independent samples from normal populations with
known variances. We want to test the H0 that the two population means are equal
against the alternative that they are different. We could use each sample by itself
to write down 95% confidence intervals and reject H0 if these intervals did not
overlap. What would be the significance level of this test?
Solution:
Let us assume H0 : µX = µY is true, then the two 95% confidence intervals do not
overlap if and only if:
σX σY σY σX
X̄ − 1.96 × √ ≥ Ȳ + 1.96 × √ or Ȳ − 1.96 × √ ≥ X̄ + 1.96 × √ .
n m m n
393
I. Hypothesis testing
The significance level is about 0.6%, which is much smaller than the usual
conventions of 5% and 1%. Putting variability into two confidence intervals makes
them more likely to overlap than you might think, and so your chance of
incorrectly rejecting the null hypothesis is smaller than you might expect!
6. The following table shows the number of salespeople employed by a company and
the corresponding value of sales (in £000s):
Compute the sample correlation coefficient for these data and carry out a formal
test for a (linear) relationship between the number of salespeople and sales.
Note that: X X X
xi = 2,616, yi = 2,520, x2i = 571,500,
X X
yi2 = 529,746 and xi yi = 550,069.
Solution:
We test:
H0 : ρ = 0 vs. H1 : ρ > 0.
The corresponding test statistic and its distribution under H0 are:
√
ρb n − 2
T =p ∼ tn−2 .
1 − ρb2
We find ρb = 0.8716 and obtain t = 5.62 > 2.764 = t0.01, 10 and so we reject H0 at the
1% significance level. Since the test is highly significant, there is overwhelming
evidence of a (linear) relationship between the number of salespeople and the value
of sales.
394
I.1. Worked examples
7. Two independent samples from normal populations yield the following results:
2
P
Sample 1 n=5 P (xi − x̄)2 = 4.8
Sample 2 m=7 (yi − ȳ) = 37.2
Test at the 10% signficance level whether the population variances are the same
based on the above data.
Solution:
We test:
H0 : σ12 = σ22 vs. H1 : σ12 6= σ22 .
Under H0 , the test statistic is:
S12
T = ∼ Fn−1, m−1 = F4, 6 .
S22
Critical values are F0.95, 4, 6 = 1/F0.05, 6, 4 = 1/6.16 = 0.16 and F0.05, 4, 6 = 4.53, using
Table A.3 of the Dougherty Statistical Tables. The test statistic value is:
4.8/4
t= = 0.1935
37.2/6
and since 0.16 < 0.1935 < 4.53 we do not reject H0 , which means there is no
evidence of a difference in the variances.
9. (a) Of 100 clinical trials, 5 have shown that wonder-drug ‘Zap2’ is better than the
standard treatment (aspirin). Should we be excited by these results?
(b) Of the 1,000 clinical trials of 1,000 different drugs this year, 30 trials found
drugs which seem better than the standard treatments with which they were
compared. The television news reports only the results of those 30 ‘successful’
trials. Should we believe these reports?
Solution:
(a) If 5 clinical trials out of 100 report that Zap2 is better, this is consistent with
there being no difference whatsoever between Zap2 and aspirin if a 5% Type I
error probability is being used for tests in these clinical trials. With a 5%
significance level we expect 5 trials in 100 to show spurious significant results.
(b) If the television news reports the 30 successful trials out of 1,000, and those
trials use tests with a significance level of 5%, we may well choose to be very
cautious about believing the results. We would expect 50 spuriously significant
results in the 1,000 trial results.
395
I. Hypothesis testing
10. A machine is designed to fill bags of sugar. The weight of the bags is normally
distributed with standard deviation σ. If the machine is correctly calibrated, σ
should be no greater than 20 g. We collect a random sample of 18 bags and weigh
them. The sample standard deviation is found to be equal to 32.48 g. Is there any
evidence that the machine is incorrectly calibrated?
Solution:
This is a hypothesis test for the variance of a normal population, so we will use the
chi-squared distribution. Let:
X1 , X2 , . . . , X18 ∼ N (µ, σ 2 )
be the weights of the bags in the sample. An appropriate test has hypotheses:
11. After the machine in Question 10 is calibrated, we collect a new sample of 21 bags.
The sample standard deviation of their weights is 23.72 g. Based on this sample,
can you conclude that the calibration has reduced the variance of the weights of the
bags?
Solution:
Let:
Y1 , Y2 , . . . , Y21 ∼ N (µY , σY2 )
2
be the weights of the bags in the new sample, and use σX to denote the variance of
the distribution of the previous sample, to avoid confusion. We want to test for a
reduction in variance, so we set:
2 2
σX σX
H0 : = 1 vs. H 1 : > 1.
σY2 σY2
The value of the test statistic in this case is:
s2X (32.48)2
= = 1.875.
s2Y (23.72)2
If the null hypothesis is true, the test statistic will follow an F18−1, 21−1 = F17, 20
distribution.
396
I.2. Practice questions
At the 5% significance level, the upper-tail critical value of the F17, 20 distribution is
F0.05, 17, 20 = 2.17. Our test statistic does not exceed this value, so we cannot reject
the null hypothesis.
We move to the 10% significance level. The upper-tail critical value is
F0.10, 17, 20 = 1.821 (using a computer), so we can now reject the null hypothesis (if
only barely). We conclude that there is some evidence that the variance is reduced,
but it is not very strong evidence.
Notice the difference between the conclusions of these two tests. We have a much
more powerful test when we compare our standard deviation of 32.48 g to a fixed
standard deviation of 25 g, than when we compare it to an estimated standard
deviation of 23.78 g, even though the values are similar.
2. In a wire-based nail manufacturing process the target length for cut wire is 22 cm.
It is known that widths vary with a standard deviation equal to 0.08 cm. In order
to monitor this process, a random sample of 50 separate wires is accurately
measured and the process is regarded as operating satisfactorily (the null
397
I. Hypothesis testing
hypothesis) if the sample mean width lies between 21.97 cm and 22.03 cm so that
this is the decision procedure used (i.e. if the sample mean falls within this range
then the null hypothesis is not rejected, otherwise the null hypothesis is rejected).
(a) Determine the probability of a Type I error for this test.
(b) Determine the probability of making a Type II error when the process is
actually cutting to a length of 22.05 cm.
(c) Find the probability of rejecting the null hypothesis when the true cutting
length is 22.01 cm. (This is the power of the test when the true mean is 22.01
cm.)
4. To instil customer loyalty, airlines, hotels, rental car companies, and credit card
companies (among others) have initiated frequency marketing programmes which
reward their regular customers. In the United States alone, millions of people are
members of the frequent-flier programmes of the airline industry. A large fast food
restaurant chain wished to explore the profitability of such a programme. They
randomly selected 12 of their 1,200 restaurants nationwide and instituted a
frequency programme which rewarded customers with a $5.00 gift certificate after
every 10 meals purchased at full price.
They ran the trial programme for three months. The restaurants not in the sample
had an average increase in profits of $1,047.34 over the previous three months,
whereas the restaurants in the sample had the following changes in profit:
Note that the last number is negative, representing a decrease in profits. Specify
the appropriate null and alternative hypotheses for determining whether the mean
profit change for restaurants with frequency programmes is significantly greater (in
a statistical sense which you should make clear) than $1,047.34.
5. Two companies supplying a television repair service are compared by their repair
times (in days). Random samples of recent repair times for these companies gave
the following statistics:
398
I.2. Practice questions
(a) Is there evidence that the companies differ in their true mean repair times?
Give an appropriate hypothesis test to support your conclusions.
(b) What is the p-value of your test?
(c) What difference would it have made if the sample sizes had each been smaller
by 35 (i.e. sizes 9 and 17, respectively)?
399
I. Hypothesis testing
400
Appendix J
Analysis of variance (ANOVA)
Solution:
(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:
(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table A.3 of the Dougherty Statistical Tables),
we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that the means
are not equal.
401
J. Analysis of variance (ANOVA)
(c) We have:
s
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10
Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.
2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:
Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875
We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.
As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.
H 0 : µA = µB = µC
402
J.1. Worked examples
Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:
Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480
As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.
4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quotes of suppliers B, C and
D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.
Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:
403
J. Analysis of variance (ANOVA)
17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28
for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables (degrees of freedom 3 and 18) is 3.16, hence we do not reject
H0 and conclude that there is not enough evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7
giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.
Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total
404
J.1. Worked examples
giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.
A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76
Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:
405
J. Analysis of variance (ANOVA)
Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73
Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.
7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from the Dougherty Statistical Tables.
Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Error ? 708.00 ?
Total 34 1,915.76
Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table A.3
of the Dougherty Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:
Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76
406
J.2. Practice questions
(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first
(under $15,000) and the third (over $30,000) income groups.
2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.
Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter
407
J. Analysis of variance (ANOVA)
408
Appendix K
Linear regression
409
K. Linear regression
i. what is the value of βb if yi = xi for all i? What if they are the exact
opposites of each other, i.e. yi = −xi for all i?
ii. is it always the case that −1 ≤ βb ≤ 1?
Solution:
(a) The estimator βb is sensible because it is the least squares estimator of β, which
provides the ‘best’ fit to the data in terms of minimising the sum of squared
residuals.
(b) The estimator βb is preferred to β̃ because the estimator β̃ is the least absolute
deviations estimator of β, which is also an option, but unlike βb it cannot be
computed explicitly via differentiation as the function f (x) = |x| is not
differentiable at zero. Therefore, β̃ is harder to compute than β.b
(c) We need to minimise a convex quadratic, so we can do that by differentiating
it and equating the derivative to zero. We obtain:
n
X
−2 (yi − βx
b i )xi = 0
i=1
which yields:
n
P
xi y i
i=1
βb = n .
x2i
P
i=1
3. Let {(xi , yi )}, for i = 1, 2, . . . , n, be observations from the linear regression model:
y i = β 0 + β 1 xi + εi .
(a) Suppose that the slope, β1 , is known. Find the least squares estimator (LSE) of
the intercept, β0 .
(b) Suppose that the intercept, β0 , is known. Find the LSE of the slope, β1 .
Solution:
(a) When β1 is known, let zi = yi − β1 xi . The model then reduces to zi = β0 + εi .
n
The LSE βb0 minimises (zi − β0 )2 , hence:
P
i=1
n
1X
βb0 = z̄ = (yi − β1 xi ).
n i=1
410
K.1. Worked examples
n
P
where D = (βb1 − β1 ) xi (zi − βb1 xi ). Suppose we choose βb1 such that:
i=1
n
X n
X n
X
xi (zi − βb1 xi ) = 0 i.e. xi zi − βb1 x2i = 0.
i=1 i=1 i=1
Hence:
n
X n
X n
X n
X
2 2 2 2
(zi − β1 xi ) = (zi − β1 xi ) + (β1 − β1 )
b b xi ≥ (zi − βb1 xi )2 .
i=1 i=1 i=1 i=1
411
K. Linear regression
Solution:
n
(xi − x̄)2 = 60 and so:
P
Since x̄ = (1 + 2 + · · · + 9)/9 = 5, then
i=1
σ2 45
Var(βb1 ) = P
n = = 0.75.
60
(xi − x̄)2
i=1
Therefore:
βb1 ∼ N (β1 , 0.75).
We require:
1.5
P (|βb1 − β1 | < 1.5) = P |Z| < √ = P (|Z| < 1.73) = 1 − 2 × 0.0418 = 0.9164.
0.75
(a) Find the least-squares estimates of β0 and β1 and write down the fitted
regression model.
(b) Compute a 95% confidence interval for the slope coefficient β1 . What can be
concluded?
(c) Compute R2 . What can be said about how ‘good’ the model is?
(d) With x = 30, find a prediction interval which covers y with probability 0.95.
With 97.5% confidence, what minimum average life expectancy can a city
expect once its GDP per capita reaches $30,000?
Solution:
(a) We have:
n
P n
P
(xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
i=1 i=1
βb1 = n = n = 1.026
x2i − nx̄2
P P
(xi − x̄)2
i=1 i=1
and:
βb0 = ȳ − βb1 x̄ = 49.55.
Hence the fitted model is:
yb = 49.55 + 1.026x.
412
K.1. Worked examples
b2 . For σ
(b) We first need E.S.E.(βb1 ), for which we need σ b2 , we need the Residual
SS (from the Total SS and the Regression SS). We compute:
X
Total SS = yi2 − nȳ 2 = 1,339.67
i
!
X
Regression SS = βb12 x2i − nx̄2 = 702.99
i
636.68
b2 =
σ = 22.74
28
1/2
b2
σ
E.S.E.(β1 ) = P 2
b
2
= 0.184.
i xi − nx̄
which gives:
1.026 ± 2.05 × 0.184 ⇒ (0.65, 1.40).
The confidence interval does not contain zero. Therefore, we would reject the
hypothesis of β1 being zero at the 5% significance level. Hence there does
appear to be a significant link.
Regression SS 702.99
R2 = = = 0.52.
Total SS 1,339.67
P 2 1/2
− 2x i xi + nx2
P
i xi P
β0 + β1 x ± t0.025, n−2 × σ
b b b× 1+
n( i x2i − nx̄2 )
which gives:
(69.79, 90.87).
Therefore, we can be 97.5% confident that the average life expectancy lies
above 69.79 years once GDP per capita reaches $30,000.
413
K. Linear regression
Analysis of Variance
SOURCE DF SS
Regression 1 2011.12
Residual Error 40 539.17
In addition, x̄ = 1.56.
(a) Find an estimate of the error term variance, σ 2 .
(b) Calculate and interpret R2 .
(c) Test at the 5% significance level whether or not the slope in the regression
model is equal to 1.
(d) For x = 0.8, find a 95% confidence interval for the expectation of y.
Solution:
(a) Noting n = 40 + 1 + 1 = 42, we have:
Residual SS 539.17
b2 =
σ = = 13.479.
n−2 40
Regression SS 2,011.12
R2 = = = 0.7886.
Total SS 2,550.29
βb1 − 1
T = ∼ tn−2 = t40 .
E.S.E.(βb1 )
414
K.2. Practice questions
8. Why is the squared sample correlation coefficient between the yi s and xi s the same
as the squared sample correlation coefficient between the yi s and ybi s? No algebra is
needed for this.
Solution:
The only difference between the xi s and ybi s is a rescaling by multiplying by βb1 ,
followed by a relocation by adding βb0 . Correlation coefficients are not affected by a
change of scale or location, so it will be the same whether we use the xi s or the ybi s.
9. If the model fits, then the fitted values and the residuals from the model are
independent of each other. What do you expect to see if the model fits when you
plot residuals against fitted values?
Solution:
If the model fits, one would expect to see a random scatter with no particular
pattern.
1. The table below shows the cost of fire damage for ten fires together with the
corresponding distances of the fires to the nearest fire station:
(a) Fit a straight line to these data and construct a 95% confidence interval for
the increase in cost of a fire for each mile from the nearest fire station.
(b) Test the hypothesis that the ‘true line’ passes through the origin.
415
K. Linear regression
2. The yearly profits made by a company, over a period of eight consecutive years are
shown below:
Year 1 2 3 4 5 6 7 8
Profit (in £000s) 18 21 34 31 44 46 60 75
(a) Fit a straight line to these data and compute a 95% confidence interval for the
‘true’ yearly increase in profits.
(b) The company accountant forecasts the profits for year 9 to be £90,000. Is this
forecast reasonable if it is based on the above data?
3. The data table below shows the yearly expenditure (in £000s) by a cosmetics
company in advertising a particular brand of perfume:
Year (x) 1 2 3 4 5 6 7 8
Expenditure (y) 170 170 275 340 435 510 740 832
(a) Fit a regression line to these data and construct a 95% confidence interval for
its slope.
(b) Construct an analysis of variance table and compute the R2 statistic for the fit.
(c) Comment on the goodness of fit of the linear regression model.
(d) Predict the expenditure for Year 9 and construct a 95% prediction interval for
the actual expenditure.
416
Appendix L
Solutions to Practice questions
1. (a) We have:
3
X
(Yj − Ȳ ) = (Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ).
j=1
However:
3Ȳ = Y1 + Y2 + Y3
hence:
3
X
(Yj − Ȳ ) = 3Ȳ − 3Ȳ = 0.
j=1
(b) We have:
3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = (Y1 − Ȳ )((Y1 − Ȳ ) + (Y2 − Ȳ ) + (Y3 − Ȳ ))
j=1 k=1
3 X
X 3
(Yj − Ȳ )(Yk − Ȳ ) = 02 = 0.
j=1 k=1
(c) We have:
3 X
X 3 3
X 3
X 3
X
(Yj − Ȳ )(Yk − Ȳ ) = j6=k (Yj − Ȳ )(Yk − Ȳ ) + (Yj − Ȳ )2
j=1 k=1 j=1 k=1 j=1
We have written the nine terms in the left-hand expression as the sum of the
six terms for which j 6= k, and the three terms for which j = k.
417
L. Solutions to Practice questions
2. (a) We have:
n
P n
P n
P
yi (axi + b) a xi + nb
i=1 i=1 i=1
ȳ = = = = ax̄ + b.
n n n
(b) Multiply out the square within the summation sign and then evaluate the
three expressions, remembering that x̄ is a constant with respect to summation
and can be taken outside the summation sign as a common factor, i.e. we have:
n
X n
X
2
(xi − x̄) = (x2i − 2xi x̄ + x̄2 )
i=1 i=1
n
X n
X n
X
= x2i − 2x̄ xi + x̄2
i=1 i=1 i=1
Xn
= x2i − 2nx̄2 + nx̄2
i=1
n
P
hence the result. Recall that xi = nx̄.
i=1
(c) It is probably best to work with variances to avoid the square roots. The
variance of y values, say s2y , is given by:
n
1X
s2y = (yi − ȳ)2
n i=1
n
1X
= (axi + b − (ax̄ + b))2
n i=1
n
21
X
=a (xi − x̄)2
n i=1
= a2 s2x .
The result follows on taking the square root, observing that the standard
deviation cannot be a negative quantity.
Adding a constant k to each value of a dataset adds k to the mean and leaves the
standard deviation unchanged. This corresponds to a transformation yi = axi + b
with a = 1 and b = k. Apply (a) and (c) with these values.
Multiplying each value of a dataset by a constant c multiplies the mean by c and
also the standard deviation by |c|. This corresponds to a transformation yi = cxi
with a = c and b = 0. Apply (a) and (c) with these values.
418
L.2. Appendix B – Probability theory
P (A) + P (B)
≤ P (A ∪ B).
2
Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and
P (A ∩ B) ≤ P (B).
Adding, 2P (A ∩ B) ≤ P (A) + P (B) so:
P (A) + P (B)
P (A ∩ B) ≤ .
2
419
L. Solutions to Practice questions
(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.
4. (a) A will win the game without deuce if he or she wins four points, including the
last point, before B wins three points. This can occur in three ways.
• A wins four straight points, i.e. AAAA with probability (2/3)4 = 16/81.
• B wins just one point in the game. There are 4 C1 ways for this to happen,
namely BAAAA, ABAAA, AABAA and AAABA. Each has probability
(1/3)(2/3)4 , so the probability of one of these outcomes is given by
4(1/3)(2/3)4 = 64/243.
• B wins just two points in the game. There are 5 C2 ways for this to
happen, namely BBAAAA, BABAAA, BAABAA, BAAABA,
ABBAAA, ABABAA, ABAABA, AABBAA, AABABA and
AAABBA. Each has probability (1/3)2 (2/3)4 , so the probability of one of
these outcomes is given by 10(1/3)2 (2/3)4 = 160/729.
Therefore, the probability that A wins without a deuce must be the sum of
these, namely:
16 64 160 144 + 192 + 160 496
+ + = = .
81 243 729 729 729
420
L.2. Appendix B – Probability theory
(b) We can mimic the above argument to find the probability that B wins the
game without a deuce. That is, the probability of four straight points to B is
(1/3)4 = 1/81, the probability that A wins just one point in the game is
4(2/3)(1/3)4 = 8/243, and the probability that A wins just two points is
10(2/3)2 (1/3)4 = 40/729. So the probability of B winning without a deuce is
1/81 + 8/243 + 40/729 = 73/729 and so the probability of deuce is
1 − 496/729 − 73/729 = 160/729.
(c) Either: suppose deuce has been called. The probability that A wins the set
without further deuces is the probability that the next two points go AA –
with probability (2/3)2 .
The probability of exactly one further deuce is that the next four points go
ABAA or BAAA – with probability (2/3)3 (1/3) + (2/3)3 (1/3) = (2/3)4 .
The probability of exactly two further deuces is that the next six points go
ABABAA, ABBAAA, BAABAA or BABAAA – with probability
4(2/3)4 (1/3)2 = (2/3)6 .
Continuing this way, the probability that A wins after three further deuces is
(2/3)8 and the overall probability that A wins after deuce has been called is
(2/3)2 + (2/3)4 + (2/3)6 + (2/3)8 + · · · .
This is a geometric progression (GP) with first term a = (2/3)2 and common
ratio (2/3)2 , so the overall probability that A wins after deuce has been called
is a/(1 − r) (sum to infinity of a GP) which is:
(2/3)2 4/9 4
2
= = .
1 − (2/3) 5/9 5
Or (quicker!): given a deuce, the next 2 balls can yield the following results.
A wins with probability (2/3)2 , B wins with probability (1/3)2 , and deuce
with probability 4/9.
Hence P (A wins | deuce) = (2/3)2 + (4/9) P (A wins | deuce) and solving
immediately gives P (A wins | deuce) = 4/5.
(d) We have:
421
L. Solutions to Practice questions
1 2 5
E(X) = 1 × +2× =
3 3 3
1 2
E(X 2 ) = 1 × + 4 × = 3
3 3
1 1 2 2
E(1/X) = 1 × + × =
3 2 3 3
and, clearly, E(X 2 ) 6= (E(X))2 and E(1/X) 6= 1/E(X) in this case. So the result
has been shown in general.
(b) We have:
X1 + X 2 + · · · + X n E(X1 + X2 + · · · + Xn )
E =
n n
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
n
µ + µ + ··· + µ
=
n
nµ
=
n
= µ.
422
L.3. Appendix C – Random variables
X1 + X 2 + · · · + Xn Var(X1 + X2 + · · · + Xn )
Var =
n n2
Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
(by independence) =
n2
σ2 + σ2 + · · · + σ2
=
n2
nσ 2
=
n2
σ2
= .
n
3. Suppose n subjects are procured. The probability that a single subject does not
have the abnormality is 0.96. Using independence, the probability that none of the
subjects has the abnormality is (0.96)n .
The probability that at least one subject has the abnormality is 1 − (0.96)n . We
require the smallest whole number n for which 1 − (0.96)n > 0.95, i.e. we have
(0.96)n < 0.05.
We can solve the inequality by ‘trial and error’, but it is neater to take logs.
n ln(0.96) < ln(0.05), so n > ln(0.05)/ ln(0.96), or n > 73.39. Rounding up, 74
subjects should be procured.
423
L. Solutions to Practice questions
(c) For the ‘forgetful’ rat (short-term, but not long-term, memory):
1
P (X = 1) =
4
3 1
P (X = 2) = ×
4 3
3 2 1
P (X = 3) = × ×
4 3 3
..
.
r−2
3 2 1
P (X = r) = × × (for r ≥ 2).
4 3 3
Therefore:
2 ! !
1 3 1 2 1 2 1
E(X) = + × 2× + 3× × + 4× × + ···
4 4 3 3 3 3 3
2 ! !
1 1 2 2
= + 2+ 3× + 4× + ··· .
4 4 3 3
Note that 2.5 < 3.25 < 4, so the intelligent rat needs the least trials on average,
while the stupid rat needs the most, as we would expect!
424
L.4. Appendix D – Common distributions of random variables
(c) Using the Poisson approximation with λ = 100 × 0.0228 = 2.28, we have the
following.
i. P (N = 0) ≈ e−2.28 = 0.1023.
ii. P (N ≤ 2) ≈ e−2.28 + e−2.28 × 2.28 + e−2.28 × (2.28)2 /2! = 0.6013.
The approximations are good (note there will be some rounding error, but the
values are close with the two methods). It is not surprising that there is close
agreement since n is large, π is small and nπ < 5.
1 1 1
MX (t) = E(eXt ) = et × + e2t × + · · · + ekt ×
k k k
1 t
= (e + e2t + · · · + ekt ).
k
The bracketed part of this expression is a geometric progression where the first
term is et and the common ratio is et .
Using the well-known result for the sum of k terms of a geometric progression, we
obtain:
1 et (1 − (et )k ) et (1 − ekt )
MX (t) = × = .
k 1 − et k(1 − et )
R ∞ f (z) to serve as a pdf, we require (i.) f (z) ≥ 0 for all z, and (ii.)
3. (a) For
−∞
f (z) dz = 1. The first condition certainly holds for f (z). The second also
holds since:
Z ∞ Z 0 Z ∞
1 −|z| 1 −|z|
f (z) dz = e dz + e dz
−∞ −∞ 2 0 2
Z 0 Z ∞
1 z 1 −z
= e dz + e dz
−∞ 2 0 2
h i∞
z 0 −z
= [e /2]−∞ − e /2
0
1 1
= +
2 2
= 1.
425
L. Solutions to Practice questions
Z 0 Z ∞
Zt 1 −|z| zt 1 −|z| zt
MZ (t) = E(e ) = e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z zt 1 −z zt
= e e dz + e e dz
−∞ 2 0 2
Z 0 Z ∞
1 z(1+t) 1 z(t−1)
= e dz + e dz
−∞ 2 0 2
0 ∞
1 z(1+t) 1 z(t−1)
= e + e
2(1 + t) −∞ 2(t − 1) 0
1 1
= −
2(1 + t) 2(t − 1)
= (1 − t2 )−1
where the condition −1 < t < 1 ensures the integrands are 0 at the infinite
limits.
(c) We can find the various moments by differentiating MZ (t), but it is simpler to
expand it:
MZ (t) = (1 − t2 )−1 = 1 + t2 + t4 + · · · .
Note that the first and third of these results follow directly from the fact
(illustrated in the sketch) that the distribution is symmetric about z = 0.
426
L.4. Appendix D – Common distributions of random variables
n
x
4. For X ∼ Bin(n, π), P (X = x) = x
π (1 − π)n−x . So, for E(X), we have:
n
X n x
E(X) = x π (1 − π)n−x
x=0
x
n
X n x
= x π (1 − π)n−x
x=1
x
n
X n(n − 1)!
= ππ x−1 (1 − π)n−x
x=1
(x − 1)! ((n − 1) − (x − 1))!
n
X n − 1 x−1
= nπ π (1 − π)n−x
x=1
x − 1
n−1
X n−1 y
= nπ π (1 − π)(n−1)−y
y=0
y
= nπ × 1
= nπ
where y = x − 1, and the last summation is over all the values of the pf of another
binomial distribution, this time with possible values 0, 1, 2, . . . , n − 1 and
probability parameter π.
Similarly:
n
X n x
E(X(X − 1)) = x(x − 1) π (1 − π)n−x
x=0
x
n
X x(x − 1)n!
= π x (1 − π)n−x
x=2
x! (n − x)!
n
2
X (n − 2)!
= n(n − 1)π π x−2 (1 − π)n−x
x=2
(x − 2)! (n − x)!
n−2
2
X (n − 2)!
= n(n − 1)π π y (1 − π)n−y−2
y=0
y! (n − y − 2)!
m
2
X m!
E(X(X − 1)) = n(n − 1)π π y (1 − π)m−y
y=0
y! (m − y)!
= n(n − 1)π 2
427
L. Solutions to Practice questions
5. (a) A rate of 150 cars per hour is a rate of 2.5 per minute. Using a Poisson
distribution with λ = 2.5, P (none passes) = e−2.5 × (2.5)0 /0! = e−2.5 = 0.0821.
(c) The probability of 5 cars passing in two minutes is e−5 × 55 /5! = 0.1755.
6. (a) Let X denote the number of fish caught, such that X ∼ Pois(λ).
P (X = 0) = e−λ λx /x! where the parameter λ is as yet unknown, so
P (X = 0) = e−λ λ0 /0! = e−λ .
However, we know P (X = 0) = π. So e−λ = π giving −λ = ln(π) and
λ = ln(1/π).
(b) James will take home the last fish caught if he catches 1, 3, 5, . . . fish. So we
require:
Now we know:
λ2 λ3
eλ = 1 + λ + + + ···
2! 3!
and:
λ2 λ3
e−λ = 1 − λ + − + ··· .
2! 3!
Subtracting gives:
λ3 λ5
λ −λ
e −e =2 λ+ + + ··· .
3! 5!
eλ − e−λ 1 − e−2λ 1 − π2
−λ
e = =
2 2 2
428
L.5. Appendix E – Multivariate random variables
as required.
(b) We have:
as required.
2. (a) We have: !
k
X k
X k
X
E ai X i = E(ai Xi ) = ai E(Xi ).
i=1 i=1 i=1
(b) We have:
! !2 !2
k
X k
X k
X k
X
Var ai X i = E ai X i − ai E(Xi ) = E ai (Xi − E(Xi ))
i=1 i=1 i=1 i=1
k
X X
= a2i E((Xi − E(Xi ))2 ) + ai aj E((Xi − E(Xi ))(Xj − E(Xj )))
i=1 1≤i6=j≤n
k
X X k
X
= a2i Var(Xi ) + ai aj E(Xi − E(Xi )) E(Xj − E(Xj )) = a2i Var(Xi ).
i=1 1≤i6=j≤n i=1
429
L. Solutions to Practice questions
Additional note: remember there are two ways to compute the variance:
Var(X) = E((X − µ)2 ) and Var(X) = E(X 2 ) − (E(X))2 . The former is more
convenient for analytical derivations/proofs (see above), while the latter should be
used to compute variances for common distributions such as Poisson or exponential
distributions. Actually it is rather difficult to compute the variance for a Poisson
distribution using the formula Var(X) = E((X − µ)2 ) directly.
430
L.6. Appendix F – Sampling distributions of statistics
2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n
√
= 1 − 2 × P (Z > 0.25 n)
≥ 0.95
431
L. Solutions to Practice questions
3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:
σ2
100
X̄ ∼ N µ, = N 54, .
n 25
(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25
using Table 4 of the New Cambridge Statistical Tables.
ii. We are asked for:
−0.05 × 54 0.05 × 54
P (0.95 × 54 < X̄ < 1.05 × 54) = P <Z<
2 2
= P (−1.35 < Z < 1.35)
= 0.8230
using Table 4 of the New Cambridge Statistical Tables.
432
L.7. Appendix G – Point estimation
and:
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3
n
This means that (n − 1)S 2 = Xi2 − nX̄ 2 , hence:
P
i=1
n
!
X
E((n − 1)S 2 ) = (n − 1) E(S 2 ) = E Xi2 − nX̄ 2 = n E(Xi2 ) − n E(X̄ 2 ).
i=1
Because the sample is random, E(Xi2 ) = E(X 2 ) for all i = 1, 2, . . . , n as all the
variables are identically distributed. From the standard formula
Var(X) = σ 2 = E(X 2 ) − µ2 , so (using the hint):
σ2
E(X 2 ) = σ 2 + µ2 and E(X̄ 2 ) = µ2 + .
n
Hence:
σ2
2 2 2 2
(n − 1) E(S ) = n(σ + µ ) − n µ + = (n − 1)σ 2
n
so E(S 2 ) = σ 2 , which means that S 2 is an unbiased estimator of σ 2 , as stated.
The standard formula for Var(X), applied to S, states that:
since all variances are strictly positive. It follows that S is a biased estimator of σ
(with its average value lower than the true value σ).
So the first obvious guess is that we should try R/n × (1 − R/n) = R/n − (R/n)2 .
Now:
nπ(1 − π) = Var(R) = E(R2 ) − (E(R))2 = E(R2 ) − (nπ)2 .
433
L. Solutions to Practice questions
So:
2 !
R 1 1
E = 2
E(R2 ) = 2 (nπ(1 − π) + n2 π 2 )
n n n
2 !
R R 1 1
⇒ E − = E(R) − 2 (nπ(1 − π) + n2 π 2 )
n n n n
nπ n2 π 2 π(1 − π)
= − 2 −
n n n
π(1 − π)
= π − π2 − .
n
However, π(1 − π) = π − π 2 , so:
2 !
R R π(1 − π) n−1
E − = π(1 − π) − = π(1 − π) × .
n n n n
It follows that:
2 !
R2
n R R R
π(1 − π) = ×E − =E − .
n−1 n n n − 1 n(n − 1)
So we have found the unbiased estimator of π(1 − π), but it could do with tidying
up! When this is done, we see that:
R(n − R)
n(n − 1)
is the required unbiased estimator of π(1 − π).
4. For T1 :
Sxx 1 1
E(T1 ) = E = E(Sxx ) = × (n − 1)σ 2 = σ 2 .
n−1 n−1 n−1
434
L.7. Appendix G – Point estimation
By definition, MSE(T2 ) = 2(n − 1)σ 4 /n2 + (−σ 2 /n)2 = (2n − 1)σ 4 /n2 .
It can be seen that MSE(T1 ) > MSE(T2 ) since:
4
X
S= ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1
∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α
and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.
y1 − y2 + y3 − y4 y1 + y2 − y3 − y4
α
b= and βb = .
4 4
(b) α
b is an unbiased estimator of α since:
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4
(c) We have:
4σ 2 σ2
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4
435
L. Solutions to Practice questions
(£3,050,000, £3,280,000).
Therefore, we estimate there are about 9.4 million shareholders in the UK,
with a margin of error of 1.1 million.
(b) Let us start by finding a 95% confidence interval for the difference in the two
proportions. We use the formula:
s
b1 (1 − π
π b1 ) πb2 (1 − π
b2 )
b1 − π
π b2 ± 1.96 × + .
n1 n2
436
L.9. Appendix I – Hypothesis testing
The estimates of the proportions π1 and π2 are 0.23 and 0.171, respectively.
We know n1 = 954 and although n2 is unknown we can assume it is
approximately equal to 954 (noting the ‘similar’ in the question), so an
approximate 95% confidence interval is:
r
0.23 × 0.77 0.171 × 0.829
0.23−0.171±1.96× + = 0.059±0.036 ⇒ (0.023, 0.094).
954 954
By multiplying by 41 million, we get a confidence interval of:
(b) To find the sample size n and the value a, we need to solve two conditions:
√
• α = P (X̄ > a |√H0 ) = P (Z > (a − 0.65)/(1/ n)) = 0.05 ⇒
(a − 0.65)/(1/ n) = 1.645.
√
• β = P (X̄ < a |√H1 ) = P (Z < (a − 0.80)/(1/ n)) = 0.10 ⇒
(a − 0.80)/(1/ n) = −1.28.
Solving these equations gives a = 0.734 and n = 381, remembering to round
up!
(c) A sample is classified as being from A if x̄ > 0.75. We have:
0.75 − 0.65 0.75 − 0.65
α = P (X̄ > 0.75 | H0 ) = P Z > √ = 0.02 ⇒ √ = 2.05.
1/ n 1/ n
Solving this equation gives n = 421, remembering to round up! Therefore:
0.75 − 0.80
β = P (X̄ < 0.75 | H1 ) = P Z < √ = P (Z < −1.026) = 0.1515.
1/ 421
437
L. Solutions to Practice questions
(d) The rule in (b) is ‘take n = 381 and reject H0 if x̄ > 0.734’. So:
0.734 − 0.7
P (X̄ > 0.734 | µ = 0.7) = P Z > √ = P (Z > 0.66) = 0.2546.
1/ 381
2. (a) We have:
(b) We have:
(c) We have:
3. (a) We are to test H0 : µ = 12 vs. H1 : µ 6= 12. The key points here are that n is
small and that σ 2 is unknown. We can use the t test and this is valid provided
the data are normally distributed. The test statistic value is:
x̄ − 12 12.7 − 12
t= √ = √ = 2.16.
s/ 7 0.858/ 7
438
L.9. Appendix I – Hypothesis testing
In (a) you are asked to do a two-sided test and in (b) it is a one-sided test. Which
is more appropriate will depend on the purpose of the experiment, and your
suspicions before you conduct it.
• If you suspected before collecting the data that the mean voltage was less than
12 volts, the one-sided test would be appropriate.
• If you had no prior reason to believe that the mean was less than 12 volts you
would perform a two-sided test.
• General rule: decide on whether it is a one- or two-sided test before performing
the statistical test!
4. It is useful to discuss the issues about this question before giving the solution.
• We want to know whether a loyalty programme such as that at the 12 selected
restaurants would result in an increase in mean profits greater than that
observed (during the three-month test) at the other sites within the chain.
• So we can model the profits across the chain as $1,047.34 + x, where $x is the
supposed effect of the promotion, and if the true mean value of x is µ, then we
wish to test:
H0 : µ = 0 vs. H1 : µ > 0
which is a one-tailed test since, clearly, there are (preliminary) grounds for
thinking that there is an increase due to the loyalty programme.
• We know nothing about the variability of profits across the rest of the chain,
so we will have to use the sample data, i.e. to calculate the sample variance
and to employ the t distribution with ν = 12 − 1 = 11 degrees of freedom.
• Although we shall want the variance of the data ‘sample value − 1,047.34’,
this will be the same as the variance of the sample data, since for any random
variable X and constant k we have:
Var(X + k) = Var(X)
439
L. Solutions to Practice questions
12
P
The total change in profit for restaurants in the programme is xi = 30,113.17.
i=1
Since n = 12, the mean change in profit for restaurants in the programme is:
30,113.17
= 2,509.431 = 1,047.34 + 1,462.091
12
hence use x̄ = 1,462.091.
12
x2i = 126,379,568.8. So, the ‘corrected’ sum of squares
P
The raw sum of squares is
i=1
is:
12
X
Sxx = x2i − nx̄2 = 126,379,568.8 − 12 × (2,509.431)2 = 50,812,651.51.
i=1
Therefore:
Sxx 50,812,651.51
s2 = = = 4,619,331.956.
n−1 11
Hence the estimated standard error is:
r
s 4,619,331,956 p
√ = = 384,944.3296 = 620.439.
n 12
The relevant critical values for t11 in this one-tailed test are:
So we see that the test is significant at the 5% significance level, but not at the 1%
significance level, so reject H0 and conclude that the loyalty programme does have
an effect. (In fact, this means the result is moderately significant that the
programme has had a beneficial effect for the company.)
440
L.9. Appendix I – Hypothesis testing
(c) For small samples, we should use a pooled estimate of the population standard
deviation:
s
(9 − 1) × 7.3 + (17 − 1) × 6.2
s= = 2.5626 on 24 degrees of freedom.
(9 − 1) + (17 − 1)
This should be compared with the t24 distribution and is clearly not
significant, even at the 10% significance level. With the smaller samples we fail
to detect the difference.
Comparing the two test statistic calculations shows that the different results
flow from differences in the estimated standard errors, hence ultimately (and
unsurprisingly) from the differences in the sample sizes used in the two
situations.
6. (a) Let π be the population proportion of visitors who would use the device. We
test H0 : π = 0.3 vs. H1 : π < 0.3. The sample p proportion is p = 20/80 = 0.25.
Standard error of the sample proportion is 0.3 × 0.7/80 = 0.0512. The test
statistic value is:
0.25 − 0.30
z= = −0.976.
0.0512
For a one-sided (lower-tailed) test at the 5% significance level, the critical
value is −1.645, so the test is not significant – and not even at the 10%
significance level (the critical value is −1.282). On the basis of the data, there
is no reason to withdraw the device.
The critical region for the above test is to reject H0 if the sample proportion is
less than 0.3 − 1.645 × 0.0512, i.e. if the sample proportion, p, is less than
0.2157.
(b) The p-value of the test is the probability of the test statistic value or a more
extreme value conditional on H0 being true. Hence the p-value is:
P (Z ≤ −0.976) = 0.1645.
When
p π = 0.2, the standard error of the sample proportion is
0.2 × 0.8/80 = 0.0447. Therefore, the power when π = 0.2 is:
0.2157 − 0.2
P Z< = P (Z < 0.35) = 0.6368.
0.0447
441
L. Solutions to Practice questions
442
L.11. Appendix K – Linear regression
Hence a 95% confidence interval for β1 is 5.46 ± 2.306 × 0.66 ⇒ (3.94, 6.98).
(b) To test H0 : β0 = 0 vs. H1 : β0 6= 0, we first determine the estimated standard
error of βb0 , which is:
√ 1/2
4.95 219.46
√ = 3.07.
10 219.46 − 10 × (4.56)2
Therefore, test statistic value is:
6.07
= 1.98.
3.07
Comparing with the t8 distribution, this is not significant at the 5%
significance level (1.98 < 2.306), but it is significant at the 10% significance
level (1.860 < 1.98).
443
L. Solutions to Practice questions
There is only weak evidence against the null hypothesis. Note though that in
practice this hypothesis is not really of interest. A line through the origin
implies that there is zero cost of a fire which takes place right next to a fire
station. This hypothesis does not seem sensible!
Hence a 95% confidence interval for β1 is 7.65 ± 2.447 × 0.82 ⇒ (5.64, 9.66).
(b) Substituting x = 9 we find the predicted year 9 profit (in £000s) is 75.55. The
estimated standard error of this prediction is:
1/2
√ 204 − 2 × 9 × 36 + 8 × 92
27.98 × 1 + = 6.71.
8 × (204 − 8 × (4.5)2 )
It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:
444
L.11. Appendix K – Linear regression
P 2 P 2
3. (a) We
P first calculate x̄ = 4.5, x i = 204, ȳ = 434, yi = 1,938,174 and
xi yi = 19,766. The estimated regression coefficients are:
19,766 − 8 × 4.5 × 434
βb1 = = 98.62 and βb0 = 434 − 98.62 × 4.5 = −9.79.
204 − 8 × (4.5)2
The fitted line is:
\
Expenditure = −9.79 + 98.62 × Year.
It follows that (using tn−2 = t6 ) a 95% prediction interval for the predicted
profit (in £000s) is:
445
L. Solutions to Practice questions
Therefore, β1 = Cov(X, Y )/Var(X). The second equality follows from the fact that
Corr(X, Y ) = Cov(X, Y )/(Var(X) Var(Y ))1/2 .
Also, note that the first equality resembles the estimator:
P
(xi − x̄)(yi − ȳ)
βb1 = i P 2
i (xi − x̄)
446
Appendix M
Formula sheet in the examination
n
x2j
P
2
σ j=1 σ2 2
−σ x̄
Var(βb0 ) = n , Var(βb1 ) = P
n , Cov(βb0 , βb1 ) = P
n .
n P
(xi − x̄)2 (xi − x̄)2 (xi − x̄)2
i=1 i=1 i=1
n
b2 = (yi − βb0 − βb1 xi )2 /(n − 2).
P
Estimator for the variance of εi : σ
i=1
Regression ANOVA:
n n
Total SS = (yi − ȳ)2 , Regression SS = βb12 (xi − x̄)2
P P
and
i=1 i=1
n
(yi − βb0 − βb1 xi )2 .
P
Residual SS =
i=1
Squared regression correlation coefficients:
447
M. Formula sheet in the examination
One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1
Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1
448
Appendix N
Sample examination paper
1. (a) Let X be a discrete random variable with probability function defined by:
(
2k x for x = 2, 3, . . .
p(x) =
0 otherwise
P (∅) = 0.
(4 marks)
449
N. Sample examination paper
2. (a) Suppose we have a biased coin which comes up heads with probability π. An
experiment is carried out so that X is the number of independent flips of the
coin required until r heads show up, where r ≥ 1 is known. Determine the
probability function of X.
(5 marks)
450
3. (a) X1 , X2 , . . . , Xn are independent random variables with the common
probability density function:
(
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
(b) X and Y are random variables with a joint distribution given below:
X=x
0 2 4
Y =y 0 0.05 0.10 0.25
2 0.20 0.15 0.25
i. Obtain the marginal distributions of X and Y .
(2 marks)
ii. Evaluate E(X), Var(X), E(Y ) and Var(Y ).
(4 marks)
iii. Obtain the conditional distributions of Y | X = 2 and X | Y = 2.
(4 marks)
iv. Evaluate E(XY ), Cov(X, Y ) and Corr(X, Y ).
(4 marks)
v. Are X and Y independent? Justify your answer.
(2 marks)
451
N. Sample examination paper
4. (a) The mean squared error (MSE) of an estimator is the average squared error,
defined as:
2
MSE(θ) = E (θ − θ) .
b b
Show how this can be decomposed into variance and bias components such
that: 2
MSE(θ)
b = Var(θ)b + Bias(θ)b .
(5 marks)
(c) Suppose that you are given observations y1 and y2 such that:
y1 = α + 3β + ε1 and y2 = 3α + β + ε2 .
Here the variables ε1 and ε2 are independent and normally distributed with
mean 0 and variance σ 2 .
Find the least squares estimators α
b and βb of the parameters α and β and
verify that they are unbiased estimators.
(7 marks)
452
5. (a) Let {X1 , X2 , . . . , Xn } be a random sample from a normally-distributed
population with mean µ and variance σ 2 < ∞. Let
n
M = (Xi − X̄)2 = (n − 1)S 2 , such that M/σ 2 ∼ χ2n−1 (you do not need to
P
i=1
derive this result for this question).
i. Show that a 100(1 − α)% confidence interval for σ 2 , for any α ∈ (0, 1) and
constants k1 and k2 such that 0 < k1 < k2 is:
M M
, .
k2 k1
(4 marks)
ii. Suppose the sample size is n = 20 and the sample variance is s2 = 17.3.
Compute a 99% confidence interval for σ 2 .
(3 marks)
iii. A researcher decides to test:
H0 : σ 2 = 15 vs. H1 : σ 2 > 15
y i = β 0 + β 1 xi + εi
n
x2k
P
2
σ k=1
Var(βb0 ) = n .
n P 2
(xk − x̄)
k=1
(7 marks)
[END OF PAPER]
453
N. Sample examination paper
454
Appendix O
Sample examination paper –
Solutions
iii. We have:
∞ ∞
X X 2k 2 e2t e2t
MX (t) = E(etX ) = 2k x etx = 2 (ket )x = t
= t
.
x=2 x=2
1 − ke 2 − e
For the above to be valid, the sum to infinity has to be valid. That is,
ket < 1, meaning t < log(2). We then have:
4e2t − e3t
MX0 (t) =
(2 − et )2
However, the only real number for P (∅) which satisfies this is P (∅) = 0.
2. (a) To wait for r heads to show up, suppose x flips are required. The last flip must
be a head, with r − 1 heads randomly appearing in the first x − 1 flips. In each
particular combination of heads and tails, there must be r heads by definition
of the experiment, as well as x − r tails (so adding together, x flips in total),
with probability due to independence of:
π r (1 − π)x−r .
455
O. Sample examination paper – Solutions
x−1
There are r−1
combinations of outcomes with this probability. Hence we
have: (
x−1
r
r−1
π (1 − π)x−r for x = r, r + 1, . . .
p(x) =
0 otherwise.
(c) i. We have:
−8 + 5 −2 + 5
P (−8 < X < −2) = P <Z<
4 4
= P (−0.75 < Z < 0.75)
= P (Z < 0.75) − P (Z < −0.75)
= Φ(0.75) − Φ(−0.75)
= 0.7734 − (1 − 0.7734)
= 0.5468.
456
ii. We want to find the value a such that P (−5 − a < X < −5 + a) = 0.95,
that is:
(−5 − a) + 5 (−5 + a) + 5
0.95 = P <Z<
4 4
a a
=P − <Z<
4 4
a a
=1−P Z > −P Z <−
4 4
a
=1−2×P Z > .
4
This is the same as 2 × P (Z > a/4) = 0.05, i.e. P (Z > a/4) = 0.025.
Hence a/4 = 1.96, and so a = 7.84.
3. (a) Since the Xi s are independent (and identically distributed) random variables,
we have: n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1
(b) i. The marginal distributions are found by adding across rows and columns:
X=x 0 2 4 Y =y 0 2
P (X = x) 0.25 0.25 0.50 P (Y = y) 0.40 0.60
457
O. Sample examination paper – Solutions
Y = y|X = 2 0 2
P (Y = y | X = 2) 0.10/0.25 = 0.4 0.15/0.25 = 0.6
and:
X = x|Y = 2 0 2 4
P (X = x | Y = 2) 0.20/0.60 = 4/12 0.15/0.60 = 3/12 0.25/0.60 = 5/12
iv. We have:
E(XY ) = 0 × 0.60 + 4 × 0.15 + 8 × 0.25 = 2.6.
Hence:
Also:
Cov(X, Y ) −0.4
Corr(X, Y ) = p =√ = −0.2462.
Var(X) Var(Y ) 2.75 × 0.96
4. (a) We have:
The likelihood is maximised for small values of γ. The smallest value that
can safely maximise the likelihood without violating the support is:
γ
b = X(n) .
X(n) − 2 X(n)
θb = = − 1.
2 2
458
ii. The probability density function is that of a continuous uniform
distribution, hence the variance is:
(2θ + 2)2
Var(X) = σ 2 = .
12
By the invariance principle, the MLE of the standard deviation is:
r
(X(n) )2 X(n)
σ
b= =√ .
12 12
(c) Given:
y1 = α + 3β + ε1 and y2 = 3α + β + ε2
we have:
S = ε21 + ε22 = (y1 − α − 3β)2 + (y2 − 3α − β)2 .
Taking partial derivatives with respect to α and β, respectively, and equating
to zero leads to the equations:
∂S
= −2(y1 − α
b − 3β)
b − 6(y2 − 3b
α − β)
b =0
∂α
and:
∂S
= −6(y1 − α
b − 3β)
b − 2(y2 − 3b
α − β)
b = 0.
∂β
Solving this system yields:
−y1 + 3y2 3y1 − y2
α
b= and βb = .
8 8
These are unbiased estimators since:
−α − 3β + 9α + 3β
E(b
α) = =α
8
and:
3α + 9β − 3α − β
E(β)
b = = β.
8
5. (a) i. For any given small α ∈ (0, 1), we can find 0 < k1 < k2 such that:
α
P (X < k1 ) = P (X > k2 ) =
2
where X ∼ χ2n−1 . Alternatively, k1 = χ21−α/2, n−1 and k2 = χ2α/2, n−1 .
Therefore:
M M 2 M
1 − α = P k1 < 2 < k2 = P <σ < .
σ k2 k1
459
O. Sample examination paper – Solutions
29 × S 2
Pσ (H0 is rejected) = Pσ > 42.56
15
29 × S 2
15
= Pσ > × 42.56
18.2 18.2
= Pσ (T > 35.08)
and: n n
X X 1
βb0 = ȳ − βb1 x̄ = ȳ − ai x̄yi = − ai x̄ yi .
i=1 i=1
n
Hence:
n 2 n
!
X 1 1 X 2 2 σ2 nx̄2
Var(βb0 ) = σ 2 − ai x̄ =σ 2
+ a x̄ = 1 +
n n i=1 i n n
P
i=1 (xk − x̄)2
k=1
n
x2k
P
σ2 k=1
= n .
n P
(xk − x̄)2
k=1
460
STATISTICAL
TABLES
Cumulative normal distribution
Critical values of the t distribution
Critical values of the F distribution
Critical values of the chi-squared distribution
© C. Dougherty 2001, 2002 ([email protected]). These tables have been computed to accompany the text C. Dougherty Introduction to
Econometrics (second edition 2002, Oxford University Press, Oxford), They may be reproduced freely provided that this attribution is retained.
STATISTICAL TABLES 1
TABLE A.1
z A(z)
1.645 0.9500 Lower limit of right 5% tail
1.960 0.9750 Lower limit of right 2.5% tail
2.326 0.9900 Lower limit of right 1% tail
2.576 0.9950 Lower limit of right 0.5% tail
3.090 0.9990 Lower limit of right 0.1% tail
3.291 0.9995 Lower limit of right 0.05% tail
-4 -3 -2 -1 0 1 z 2 3 4
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999
.
STATISTICAL TABLES 2
TABLE A.2
t Distribution: Critical Values of t
Significance level
Degrees of Two-tailed test: 10% 5% 2% 1% 0.2% 0.1%
freedom One-tailed test: 5% 2.5% 1% 0.5% 0.1% 0.05%
1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.768
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
32 1.694 2.037 2.449 2.738 3.365 3.622
34 1.691 2.032 2.441 2.728 3.348 3.601
36 1.688 2.028 2.434 2.719 3.333 3.582
38 1.686 2.024 2.429 2.712 3.319 3.566
40 1.684 2.021 2.423 2.704 3.307 3.551
42 1.682 2.018 2.418 2.698 3.296 3.538
44 1.680 2.015 2.414 2.692 3.286 3.526
46 1.679 2.013 2.410 2.687 3.277 3.515
48 1.677 2.011 2.407 2.682 3.269 3.505
50 1.676 2.009 2.403 2.678 3.261 3.496
60 1.671 2.000 2.390 2.660 3.232 3.460
70 1.667 1.994 2.381 2.648 3.211 3.435
80 1.664 1.990 2.374 2.639 3.195 3.416
90 1.662 1.987 2.368 2.632 3.183 3.402
100 1.660 1.984 2.364 2.626 3.174 3.390
120 1.658 1.980 2.358 2.617 3.160 3.373
150 1.655 1.976 2.351 2.609 3.145 3.357
200 1.653 1.972 2.345 2.601 3.131 3.340
300 1.650 1.968 2.339 2.592 3.118 3.323
400 1.649 1.966 2.336 2.588 3.111 3.315
500 1.648 1.965 2.334 2.586 3.107 3.310
600 1.647 1.964 2.333 2.584 3.104 3.307
f 1.645 1.960 2.326 2.576 3.090 3.291
.
STATISTICAL TABLES 3
TABLE A.3
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.91 245.36 246.46 247.32 248.01
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.42 19.43 19.44 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.71 8.69 8.67 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.87 5.84 5.82 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.64 4.60 4.58 4.56
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.96 3.92 3.90 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.53 3.49 3.47 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.24 3.20 3.17 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.03 2.99 2.96 2.94
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.86 2.83 2.80 2.77
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.74 2.70 2.67 2.65
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.64 2.60 2.57 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.55 2.51 2.48 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.48 2.44 2.41 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.42 2.38 2.35 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.37 2.33 2.30 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.33 2.29 2.26 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.29 2.25 2.22 2.19
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.26 2.21 2.18 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.22 2.18 2.15 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.20 2.16 2.12 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.17 2.13 2.10 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.15 2.11 2.08 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.13 2.09 2.05 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.11 2.07 2.04 2.01
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.09 2.05 2.02 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.08 2.04 2.00 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.06 2.02 1.99 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.05 2.01 1.97 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.04 1.99 1.96 1.93
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.04 1.99 1.94 1.91 1.88
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.95 1.90 1.87 1.84
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.95 1.89 1.85 1.81 1.78
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.86 1.82 1.78 1.75
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 1.89 1.84 1.79 1.75 1.72
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 1.88 1.82 1.77 1.73 1.70
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 1.86 1.80 1.76 1.72 1.69
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.85 1.79 1.75 1.71 1.68
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.78 1.73 1.69 1.66
150 3.90 3.06 2.66 2.43 2.27 2.16 2.07 2.00 1.94 1.89 1.82 1.76 1.71 1.67 1.64
200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.80 1.74 1.69 1.66 1.62
250 3.88 3.03 2.64 2.41 2.25 2.13 2.05 1.98 1.92 1.87 1.79 1.73 1.68 1.65 1.61
300 3.87 3.03 2.63 2.40 2.24 2.13 2.04 1.97 1.91 1.86 1.78 1.72 1.68 1.64 1.61
400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 1.78 1.72 1.67 1.63 1.60
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.77 1.71 1.66 1.62 1.59
600 3.86 3.01 2.62 2.39 2.23 2.11 2.02 1.95 1.90 1.85 1.77 1.71 1.66 1.62 1.59
750 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.77 1.70 1.66 1.62 1.58
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.70 1.65 1.61 1.58
.
STATISTICAL TABLES 4
.
STATISTICAL TABLES 5
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 5928.36 5981.07 6022.47 6055.85 6106.32 6142.67 6170.10 6191.53 6208.73
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.44 99.44 99.45
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.92 26.83 26.75 26.69
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.25 14.15 14.08 14.02
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.77 9.68 9.61 9.55
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.60 7.52 7.45 7.40
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.36 6.28 6.21 6.16
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.56 5.48 5.41 5.36
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 5.01 4.92 4.86 4.81
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.60 4.52 4.46 4.41
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.29 4.21 4.15 4.10
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.05 3.97 3.91 3.86
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.86 3.78 3.72 3.66
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.70 3.62 3.56 3.51
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.56 3.49 3.42 3.37
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.45 3.37 3.31 3.26
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.35 3.27 3.21 3.16
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.27 3.19 3.13 3.08
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.19 3.12 3.05 3.00
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.13 3.05 2.99 2.94
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.07 2.99 2.93 2.88
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 3.02 2.94 2.88 2.83
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.97 2.89 2.83 2.78
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.93 2.85 2.79 2.74
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.89 2.81 2.75 2.70
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.86 2.78 2.72 2.66
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.82 2.75 2.68 2.63
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.79 2.72 2.65 2.60
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.77 2.69 2.63 2.57
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.74 2.66 2.60 2.55
35 7.42 5.27 4.40 3.91 3.59 3.37 3.20 3.07 2.96 2.88 2.74 2.64 2.56 2.50 2.44
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.56 2.48 2.42 2.37
50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 2.56 2.46 2.38 2.32 2.27
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.39 2.31 2.25 2.20
70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 2.59 2.45 2.35 2.27 2.20 2.15
80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.42 2.31 2.23 2.17 2.12
90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 2.39 2.29 2.21 2.14 2.09
100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 2.37 2.27 2.19 2.12 2.07
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.23 2.15 2.09 2.03
150 6.81 4.75 3.91 3.45 3.14 2.92 2.76 2.63 2.53 2.44 2.31 2.20 2.12 2.06 2.00
200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.60 2.50 2.41 2.27 2.17 2.09 2.03 1.97
250 6.74 4.69 3.86 3.40 3.09 2.87 2.71 2.58 2.48 2.39 2.26 2.15 2.07 2.01 1.95
300 6.72 4.68 3.85 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.24 2.14 2.06 1.99 1.94
400 6.70 4.66 3.83 3.37 3.06 2.85 2.68 2.56 2.45 2.37 2.23 2.13 2.05 1.98 1.92
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.22 2.12 2.04 1.97 1.92
600 6.68 4.64 3.81 3.35 3.05 2.83 2.67 2.54 2.44 2.35 2.21 2.11 2.03 1.96 1.91
750 6.67 4.63 3.81 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.21 2.11 2.02 1.96 1.90
1000 6.66 4.63 3.80 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.20 2.10 2.02 1.95 1.90
.
STATISTICAL TABLES 6
.
STATISTICAL TABLES 7
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4.05e05 5.00e05 5.40e05 5.62e05 5.76e05 5.86e05 5.93e05 5.98e05 6.02e05 6.06e05 6.11e05 6.14e05 6.17e05 6.19e05 6.21e05
2 998.50 999.00 999.17 999.25 999.30 999.33 999.36 999.37 999.39 999.40 999.42 999.43 999.44 999.44 999.45
3 167.03 148.50 141.11 137.10 134.58 132.85 131.58 130.62 129.86 129.25 128.32 127.64 127.14 126.74 126.42
4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49.00 48.47 48.05 47.41 46.95 46.60 46.32 46.10
5 47.18 37.12 33.20 31.09 29.75 28.83 28.16 27.65 27.24 26.92 26.42 26.06 25.78 25.57 25.39
6 35.51 27.00 23.70 21.92 20.80 20.03 19.46 19.03 18.69 18.41 17.99 17.68 17.45 17.27 17.12
7 29.25 21.69 18.77 17.20 16.21 15.52 15.02 14.63 14.33 14.08 13.71 13.43 13.23 13.06 12.93
8 25.41 18.49 15.83 14.39 13.48 12.86 12.40 12.05 11.77 11.54 11.19 10.94 10.75 10.60 10.48
9 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 10.11 9.89 9.57 9.33 9.15 9.01 8.90
10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.20 8.96 8.75 8.45 8.22 8.05 7.91 7.80
11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.35 8.12 7.92 7.63 7.41 7.24 7.11 7.01
12 18.64 12.97 10.80 9.63 8.89 8.38 8.00 7.71 7.48 7.29 7.00 6.79 6.63 6.51 6.40
13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.80 6.52 6.31 6.16 6.03 5.93
14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.80 6.58 6.40 6.13 5.93 5.78 5.66 5.56
15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.81 5.62 5.46 5.35 5.25
16 16.12 10.97 9.01 7.94 7.27 6.80 6.46 6.19 5.98 5.81 5.55 5.35 5.20 5.09 4.99
17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.32 5.13 4.99 4.87 4.78
18 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 5.56 5.39 5.13 4.94 4.80 4.68 4.59
19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 4.97 4.78 4.64 4.52 4.43
20 14.82 9.95 8.10 7.10 6.46 6.02 5.69 5.44 5.24 5.08 4.82 4.64 4.49 4.38 4.29
21 14.59 9.77 7.94 6.95 6.32 5.88 5.56 5.31 5.11 4.95 4.70 4.51 4.37 4.26 4.17
22 14.38 9.61 7.80 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.58 4.40 4.26 4.15 4.06
23 14.20 9.47 7.67 6.70 6.08 5.65 5.33 5.09 4.89 4.73 4.48 4.30 4.16 4.05 3.96
24 14.03 9.34 7.55 6.59 5.98 5.55 5.23 4.99 4.80 4.64 4.39 4.21 4.07 3.96 3.87
25 13.88 9.22 7.45 6.49 5.89 5.46 5.15 4.91 4.71 4.56 4.31 4.13 3.99 3.88 3.79
26 13.74 9.12 7.36 6.41 5.80 5.38 5.07 4.83 4.64 4.48 4.24 4.06 3.92 3.81 3.72
27 13.61 9.02 7.27 6.33 5.73 5.31 5.00 4.76 4.57 4.41 4.17 3.99 3.86 3.75 3.66
28 13.50 8.93 7.19 6.25 5.66 5.24 4.93 4.69 4.50 4.35 4.11 3.93 3.80 3.69 3.60
29 13.39 8.85 7.12 6.19 5.59 5.18 4.87 4.64 4.45 4.29 4.05 3.88 3.74 3.63 3.54
30 13.29 8.77 7.05 6.12 5.53 5.12 4.82 4.58 4.39 4.24 4.00 3.82 3.69 3.58 3.49
35 12.90 8.47 6.79 5.88 5.30 4.89 4.59 4.36 4.18 4.03 3.79 3.62 3.48 3.38 3.29
40 12.61 8.25 6.59 5.70 5.13 4.73 4.44 4.21 4.02 3.87 3.64 3.47 3.34 3.23 3.14
50 12.22 7.96 6.34 5.46 4.90 4.51 4.22 4.00 3.82 3.67 3.44 3.27 3.41 3.04 2.95
60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.86 3.69 3.54 3.32 3.15 3.02 2.91 2.83
70 11.80 7.64 6.06 5.20 4.66 4.28 3.99 3.77 3.60 3.45 3.23 3.06 2.93 2.83 2.74
80 11.67 7.54 5.97 5.12 4.58 4.20 3.92 3.70 3.53 3.39 3.16 3.00 2.87 2.76 2.68
90 11.57 7.47 5.91 5.06 4.53 4.15 3.87 3.65 3.48 3.34 3.11 2.95 2.82 2.71 2.63
100 11.50 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.30 3.07 2.91 2.78 2.68 2.59
120 11.38 7.32 5.78 4.95 4.42 4.04 3.77 3.55 3.38 3.24 3.02 2.85 2.72 2.62 2.53
150 11.27 7.24 5.71 4.88 4.35 3.98 3.71 3.49 3.32 3.18 2.96 2.80 2.67 2.56 2.48
200 11.15 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 2.90 2.74 2.61 2.51 2.42
250 11.09 7.10 5.59 4.77 4.25 3.88 3.61 3.40 3.23 3.09 2.87 2.71 2.58 2.48 2.39
300 11.04 7.07 5.56 4.75 4.22 3.86 3.59 3.38 3.21 3.07 2.85 2.69 2.56 2.46 2.37
400 10.99 7.03 5.53 4.71 4.19 3.83 3.56 3.35 3.18 3.04 2.82 2.66 2.53 2.43 2.34
500 10.96 7.00 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.81 2.64 2.52 2.41 2.33
600 10.94 6.99 5.49 4.68 4.16 3.80 3.53 3.32 3.15 3.01 2.80 2.63 2.51 2.40 2.32
750 10.91 6.97 5.48 4.67 4.15 3.79 3.52 3.31 3.14 3.00 2.78 2.62 2.49 2.39 2.31
1000 10.89 6.96 5.46 4.65 4.14 3.78 3.51 3.30 3.13 2.99 2.77 2.61 2.48 2.38 2.30
.
STATISTICAL TABLES 8
.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.