Notes For Lectures 1 To 10 - 2024
Notes For Lectures 1 To 10 - 2024
Christl Donnelly
In some places these notes will contain more than the lectures and in other places they
will contain less. I strongly encourage you to attend all of the lectures.
These notes are largely based on previous versions of the course, and I would like to make
grateful acknowledgement to Neil Laws, Peter Donnelly and Bernard Silverman. Please
send any comments or corrections to [email protected].
TT 2024 updates:
1
Introduction
Probability: in probability we start with a probability model 𝑃, and we deduce properties
of 𝑃.
E.g. Imagine flipping a coin 5 times. Assuming that flips are fair and independent, what is
the probability of getting 2 heads?
Statistics: in statistics we have data, which we regard as having been generated from some
unknown probability model 𝑃. We want to be able to say some useful things about 𝑃.
E.g. We flip a coin 1000 times and observe 530 heads. Is the coin fair? i.e. is the probability
of heads greater than 21 ? or could it be equal to 12 ?
Precision of estimation
1.2
1.0
0.8
Density
0.6
0.4
0.2
0.0
Temperature (degrees C)
Figure. Histogram of body temperatures of 130 individuals. A description of the dataset is available
here: https://rdrr.io/cran/UsingR/man/normtemp.html
2
Relationships between observations
●
350 ●●
●
● ●
●●●
● ●●
300 ● ●
● ● ● ●
● ● ●
●● ● ●● ● ●●
● ●● ●
● ●
●● ●
Price
●●● ●●● ● ●
● ● ●●
250 ●
●
● ● ●
●
●● ●
●●● ● ●●● ●
● ●●
● ●
● ●●
200 ●●
●
●
● ●
● ●
● ●●
●●●●
●● ●●● ● ●
●●
150 ●
0 20 40 60 80 100
Month
Lower case letters 𝑥 1 , . . . , 𝑥 𝑛 denote observations: we regard these as the observed values
of random variables 𝑋1 , . . . , 𝑋𝑛 .
Let x = (𝑥 1 , . . . , 𝑥 𝑛 ) and X = (𝑋1 , . . . , 𝑋𝑛 ).
It is convenient to think of 𝑥 𝑖 as
• the observed value of 𝑋𝑖
• or sometimes as a possible value that 𝑋𝑖 can take.
Since 𝑥 𝑖 is a possible value for 𝑋𝑖 we can calculate probabilities, e.g. if 𝑋𝑖 ∼ Poisson(𝜆)
then
𝑒 −𝜆 𝜆 𝑥 𝑖
𝑃(𝑋𝑖 = 𝑥 𝑖 ) = , 𝑥 𝑖 = 0, 1, . . . .
𝑥𝑖 !
3
1 Random Samples
Definition. A random sample of size 𝑛 is a set of random variables 𝑋1 , . . . , 𝑋𝑛 which are
independent and identically distributed (i.i.d.).
𝑒 −𝜆 𝜆 𝑥1 𝑒 −𝜆 𝜆 𝑥2 𝑒 −𝜆 𝜆 𝑥 𝑛
= · ··· since 𝑋𝑖 Poisson
𝑥1 ! 𝑥2 ! 𝑥𝑛 !
Í𝑛
𝑥𝑖 )
𝑒 −𝑛𝜆 𝜆( 𝑖=1
= Î𝑛 .
𝑖=1 𝑥𝑖 !
1 −𝑥/𝜇
𝑓 (𝑥) = 𝑒 , 𝑥 ⩾ 0.
𝜇
𝑛
1 1Õ
= 𝑛 exp − 𝑥𝑖 .
𝜇 𝜇
𝑖=1
Note.
1. We use 𝑓 to denote a p.m.f. in the first example and to denote a p.d.f. in the second
example. It is convenient to use the same letter (i.e. 𝑓 ) in both the discrete and
continuous cases. (In introductory probability you may often see 𝑝 for p.m.f. and 𝑓
for p.d.f.)
We could write 𝑓𝑋𝑖 (𝑥 𝑖 ) for the p.m.f./p.d.f. of 𝑋𝑖 , and 𝑓𝑋1 ,...,𝑋𝑛 (x) for the joint
p.m.f./p.d.f. of 𝑋, . . . , 𝑋𝑛 . However it is convenient to keep things simpler by
omitting subscripts on 𝑓 .
2. In the second example 𝐸(𝑋𝑖 ) = 𝜇 and we say “𝑋𝑖 has an exponential distribution
with mean 𝜇” (i.e. expectation 𝜇).
4
Sometimes, and often in probability, we work with “an exponential distribution with
parameter 𝜆” where 𝜆 = 1/𝜇. To change the parameter from 𝜇 to 𝜆 all we do is
replace 𝜇 by 1/𝜆 to get
𝑓 (𝑥) = 𝜆𝑒 −𝜆𝑥 , 𝑥 ⩾ 0.
Sometimes (often in statistics) we parametrise the distribution using 𝜇, sometimes
(often in probability) we parametrise it using 𝜆.
In probability we assume that the parameters 𝜆 and 𝜇 in our two examples are known. In
statistics we wish to estimate 𝜆 and 𝜇 from data.
• What is the best way to estimate them? And what does “best” mean?
• For a given method, how precise is the estimation?
5
2 Summary Statistics
The expected value of 𝑋, E(𝑋), is also called its mean. This is often denoted 𝜇.
The variance of 𝑋, var(𝑋), is var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]. This is often denoted 𝜎2 . The standard
deviation of 𝑋 is 𝜎.
Note.
1. The denominator in the definition of 𝑆 2 is 𝑛 − 1, not 𝑛.
2. 𝑋 and 𝑆 2 are random variables. Their distributions are called the sampling distributions
of 𝑋 and 𝑆 2 .
3. Given observations 𝑥 1 , , . . . , 𝑥 𝑛 we can compute the observed values 𝑥 and 𝑠 2 .
4. The sample mean 𝑥 is a summary of the location of the sample.
5. The sample variance 𝑠 2 (or the sample standard deviation 𝑠) is a summary of the
spread of the sample about 𝑥.
The random variable 𝑋 has a normal distribution with mean 𝜇 and variance 𝜎2 , written
𝑋 ∼ 𝑁(𝜇, 𝜎2 ), if the p.d.f. of 𝑋 is
1 (𝑥 − 𝜇)2
𝑓 (𝑥) = √ exp − , −∞ < 𝑥 < ∞.
2𝜋𝜎 2 2𝜎2
6
• The cumulative distribution function (c.d.f.) of the standard normal distribution is
𝑥
1
∫
2
Φ(𝑥) = √ 𝑒 −𝑢 /2 𝑑𝑢.
−∞ 2𝜋
This cannot be written in a closed form, but can be found by numerical integration
to an arbitrary degree of accuracy.
0.4 1.0
0.8
0.3
0.6
Φ(x)
f(x)
0.2
0.4
0.1
0.2
0.0 0.0
−4 −2 0 2 4 −4 −2 0 2 4
x x
7
3 Maximum Likelihood Estimation
Maximum likelihood estimation is a general method for estimating unknown parameters
from data. This turns out to be the method of choice in many contexts, though this isn’t
obvious at this stage.
Example. Suppose e.g. that 𝑥 1 , . . . , 𝑥 𝑛 are time intervals between major earthquakes.
Assume these are observations of 𝑋1 , . . . , 𝑋𝑛 independently drawn from an exponential
distribution with mean 𝜇, so that each 𝑋𝑖 has p.d.f.
1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥 ⩾ 0.
𝜇
0.0010
Density
0.0005
0.0000
Figure 3.1. Histogram of time intervals between 62 major earthquakes 1902–77: an exponential
density looks plausible. (There are more effective ways than a histogram to check if a particular
distribution is appropriate, i.e. to check the exponential assumption in this case – see Part A
Statistics.)
8
Definition. Let 𝑋1 , . . . , 𝑋𝑛 have joint p.d.f./p.m.f. 𝑓 (x; 𝜃). Given observed values
𝑥1 , . . . , 𝑥 𝑛 of 𝑋1 , . . . , 𝑋𝑛 , the likelihood of 𝜃 is the function
𝐿(𝜃) = 𝑓 (x; 𝜃)
𝑛
Ö
= 𝑓 (𝑥 𝑖 ; 𝜃) since the 𝑋𝑖 are independent.
𝑖=1
Sometimes we have independent 𝑋𝑖 whose distributions differ, say 𝑋𝑖 is from 𝑓𝑖 (𝑥; 𝜃).
Then the likelihood is
𝑛
Ö
𝐿(𝜃) = 𝑓𝑖 (𝑥 𝑖 ; 𝜃).
𝑖=1
9
So the MLE is b
𝜇(x) = 𝑥. Often we’ll just write b
𝜇 = 𝑥.
In this case the maximum likelihood estimator of 𝜇 is b
𝜇(X) = 𝑋, which is a random variable.
(More on the difference between estimates and estimators later.)
−4.5
2.0
logLikelihood × 10−2
−5.0
Likelihood × 10191
1.5
−5.5
1.0
−6.0
0.5
−6.5
0.0 −7.0
0 200 400 600 800 1000 0 200 400 600 800 1000
mu mu
Figure 3.2. Likelihood 𝐿(𝜇) and log-likelihood ℓ (𝜇) for exponential (earthquake) example.
Definition. If 𝜃(x)
b is the maximum likelihood estimate of 𝜃, then the maximum likelihood
estimator (MLE) is defined by 𝜃(X).
b
Note: both maximum likelihood estimate and maximum likelihood estimator are often
abbreviated to MLE.
Example (Opinion poll). Suppose 𝑛 individuals are drawn independently from a large
population. Let (
1 if individual 𝑖 is a Labour voter
𝑋𝑖 =
0 otherwise.
Let 𝑝 be the proportion of Labour voters, so that
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
The likelihood is
𝑛
Ö
𝐿(𝑝) = 𝑝 𝑥 𝑖 (1 − 𝑝)1−𝑥 𝑖
𝑖=1
= 𝑝 𝑟 (1 − 𝑝)𝑛−𝑟
Í𝑛
where 𝑟 = 𝑖=1 𝑥 𝑖 . So the log-likelihood is
10
For a maximum, differentiate and set to zero:
𝑟 𝑛−𝑟 𝑟 𝑛−𝑟
− = 0 ⇐⇒ = ⇐⇒ 𝑟 − 𝑟𝑝 = 𝑛𝑝 − 𝑟𝑝
𝑝 1−𝑝 𝑝 1−𝑝
𝐿(𝜃) = 𝑃(𝑋1 = 𝑥 1 , 𝑋2 = 𝑥 2 , 𝑋3 = 𝑥 3 )
𝑛!
= 𝑝 1𝑥1 𝑝 2𝑥2 𝑝 3𝑥3
𝑥1 ! 𝑥2 ! 𝑥3 !
𝑛!
= (𝜃2 )𝑥1 [2𝜃(1 − 𝜃)]𝑥2 [(1 − 𝜃)2 ]𝑥3 .
𝑥1 ! 𝑥2 ! 𝑥3 !
This is a multinomial distribution.
So
Then ℓ ′(𝜃)
b = 0 gives
2𝑥1 + 𝑥 2 𝑥 2 + 2𝑥3
− =0
𝜃
b 1−𝜃
b
or
(2𝑥 1 + 𝑥 2 )(1 − 𝜃)
b = (𝑥2 + 2𝑥3 )𝜃.
b
So
2𝑥 1 + 𝑥 2 2𝑥 1 + 𝑥 2
𝜃
b= = .
2(𝑥1 + 𝑥2 + 𝑥 3 ) 2𝑛
Steps:
• Write down the (log) likelihood
11
• Find the maximum (usually by differentiation, but not quite always)
• Rearrange to give the parameter estimate in terms of the data.
iid
Example (Estimating multiple parameters). Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) where both 𝜇 and
𝜎2 are unknown.
iid
[Here ∼ means “are independent and identically distributed as.”]
The likelihood is
𝑛
1 (𝑥 𝑖 − 𝜇)2
Ö
2
𝐿(𝜇, 𝜎 ) = √ exp −
𝑖=1 2𝜋𝜎 2 2𝜎2
𝑛
1 Õ
2 −𝑛/2 2
= (2𝜋𝜎 ) exp − 2 (𝑥 𝑖 − 𝜇)
2𝜎 𝑖=1
with log-likelihood
𝑛
2 𝑛 𝑛 2 1 Õ
ℓ (𝜇, 𝜎 ) = − log 2𝜋 − log(𝜎 ) − 2 (𝑥 𝑖 − 𝜇)2 .
2 2 2𝜎 𝑖=1
𝜕ℓ 𝜕ℓ
and solving 𝜕𝜇
= 0 and 𝜕(𝜎2 )
= 0 simultaneously we obtain
𝑛
1Õ
𝜇= 𝑥𝑖 = 𝑥
𝑛
b
𝑖=1
𝑛 𝑛
2 1Õ 2 1Õ
𝜎 = 𝜇) =
(𝑥 𝑖 − b (𝑥 𝑖 − 𝑥)2 .
𝑛 𝑛
b
𝑖=1 𝑖=1
Hence: the MLE of 𝜇 is the sample mean, but the MLE of 𝜎2 is (𝑛 − 1)𝑠 2 /𝑛. (More later.)
Estimator:
• A rule for constructing an estimate.
• A function of the random variables X involved in the random sample.
• Itself a random variable.
Estimate:
• The numerical value of the estimator for the particular data set.
• The value of the function evaluated at the data 𝑥 1 , . . . , 𝑥 𝑛 .
12
4 Parameter Estimation
Recall the earthquake example:
iid
𝑋1 , . . . , 𝑋𝑛 ∼ exponential distribution, mean 𝜇.
Possible estimators of 𝜇:
• 𝑋
1
• 3 𝑋1 + 23 𝑋2
• 𝑋1 + 𝑋2 − 𝑋3
2
• 𝑛(𝑛+1)
(𝑋1 + 2𝑋2 + · · · + 𝑛𝑋𝑛 ).
How should we choose?
In general, suppose 𝑋1 , . . . , 𝑋𝑛 is a random sample from a distribution with p.d.f./p.m.f.
𝑓 (𝑥; 𝜃). We want to estimate 𝜃 from observations 𝑥 1 , . . . , 𝑥 𝑛 .
We can choose between estimators by studying their properties. A good estimator should
take values close to 𝜃.
Definition. The estimator 𝑇 = 𝑇(X) is said to be unbiased for 𝜃 if, whatever the true value
of 𝜃, we have 𝐸(𝑇) = 𝜃.
This means that “on average” 𝑇 is correct.
𝐸( 13 𝑋1 + 32 𝑋2 ) = 13 𝜇 + 23 𝜇 = 𝜇.
2 Í𝑛
Similar calculations show that 𝑋1 + 𝑋2 − 𝑋3 and 𝑛(𝑛+1) 𝑗=1 𝑗𝑋 𝑗 are also unbiased.
iid
Example (Normal variance). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ), with 𝜇 and 𝜎2 unknown,
and let 𝑇 = 𝑛1 (𝑋𝑖 − 𝑋)2 . Then 𝑇 is the MLE of 𝜎2 . Is 𝑇 unbiased?
Í
13
Let 𝑍 𝑖 = (𝑋𝑖 − 𝜇)/𝜎. So the 𝑍 𝑖 are independent and 𝑁(0, 1), 𝐸(𝑍 𝑖 ) = 0, var(𝑍 𝑖 ) = 𝐸(𝑍 2𝑖 ) = 1.
𝐸[(𝑋𝑖 − 𝑋)2 ]
= 𝐸[𝜎2 (𝑍 𝑖 − 𝑍)2 ]
= 𝜎2 var(𝑍 𝑖 − 𝑍) since 𝐸(𝑍 𝑖 − 𝑍) = 0
1 1 𝑛−1 1
2
= 𝜎 var − 𝑍1 − 𝑍2 − · · · + 𝑍𝑖 + · · · − 𝑍𝑛
𝑛 𝑛 𝑛 𝑛
1 1 (𝑛 − 1)2 1
2
=𝜎 2
var(𝑍 1 ) + 2
var(𝑍 2 ) + · · · + 2
var(𝑍 𝑖 ) + · · · + 2 var(𝑍 𝑛 )
𝑛 𝑛 𝑛 𝑛
Õ Õ
since var( 𝑎 𝑖 𝑈𝑖 ) = 𝑎 2𝑖 var(𝑈 𝑖 ) for indep 𝑈 𝑖
1 (𝑛 − 1)2
2
= 𝜎 (𝑛 − 1) × 2 +
𝑛 𝑛2
(𝑛 − 1) 2
= 𝜎 .
𝑛
So
𝑛
1Õ (𝑛 − 1) 2
𝐸(𝑇) = 𝐸[(𝑋𝑖 − 𝑋)2 ] = 𝜎 < 𝜎2 .
𝑛 𝑛
𝑖=1
Hence 𝑇 is not unbiased, 𝑇 will underestimate 𝜎2 on average.
But
𝑛
1 Õ 𝑛
𝑆2 = (𝑋𝑖 − 𝑋)2 = 𝑇
𝑛−1 𝑛−1
𝑖=1
so the sample variance satisfies
𝑛
𝐸(𝑆 2 ) = 𝐸(𝑇) = 𝜎2 .
𝑛−1
So 𝑆 2 is unbiased for 𝜎2 .
iid
Example (Uniform distribution – some unusual features!). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃],
where 𝜃 > 0, i.e.
1
if 0 ⩽ 𝑥 ⩽ 𝜃
𝑓 (𝑥; 𝜃) = 𝜃
0 otherwise.
What is the MLE for 𝜃? Is the MLE unbiased?
Calculate the likelihood:
𝑛
Ö
𝐿(𝜃) = 𝑓 (𝑥; 𝜃)
𝑖=1
1
if 0 ⩽ 𝑥 𝑖 ⩽ 𝜃 for all 𝑖
= 𝜃𝑛
0
otherwise
0 if 0 < 𝜃 < max 𝑥 𝑖
= 1
𝑛 if 𝜃 ⩾ max 𝑥 𝑖 .
𝜃
14
Note: 𝜃 ⩾ 𝑥 𝑖 for all 𝑖 ⇐⇒ 𝜃 ⩾ max 𝑥 𝑖 . (And max 𝑥 𝑖 means max1⩽ 𝑖 ⩽ 𝑛 𝑥 𝑖 ).
Likelihood
0
0 max xi
theta
𝐹(𝑦) = 𝑃(𝜃
b ⩽ 𝑦)
= 𝑃(max 𝑋𝑖 ⩽ 𝑦)
= 𝑃(𝑋1 ⩽ 𝑦, 𝑋2 ⩽ 𝑦, . . . , 𝑋𝑛 ⩽ 𝑦)
= 𝑃(𝑋1 ⩽ 𝑦)𝑃(𝑋2 ⩽ 𝑦) . . . 𝑃(𝑋𝑛 ⩽ 𝑦) since 𝑋𝑖 independent
(𝑦/𝜃)𝑛
(
if 0 ⩽ 𝑦 ⩽ 𝜃
=
1 if 𝑦 > 𝜃.
𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0 ⩽ 𝑦 ⩽ 𝜃.
𝜃𝑛
15
So
𝜃
𝑛 𝑦 𝑛−1
∫
𝐸(𝜃)
b = 𝑦· 𝑑𝑦
0 𝜃𝑛
𝜃
𝑛
∫
= 𝑛 𝑦 𝑛 𝑑𝑦
𝜃 0
𝑛𝜃
= .
𝑛+1
So 𝜃
b is not unbiased. But note that it is asymptotically unbiased: 𝐸(𝜃)
b → 𝜃 as 𝑛 → ∞.
Note.
1. Both MSE(𝑇) and 𝑏(𝑇) may depend on 𝜃.
2. MSE is a measure of the “distance” between 𝑇 and 𝜃, so is a good overall measure of
performance.
3. 𝑇 is unbiased if 𝑏(𝑇) = 0 for all 𝜃.
So an estimator with small MSE needs to have small variance and small bias. Unbiasedness
alone is not particularly desirable – it is the combination of small variance and small bias
which is important.
16
Reminder
𝜎2
𝐸(𝑋) = 𝜇 and var(𝑋) = .
𝑛
iid
Example (Uniform distribution). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], i.e.
1
if 0 ⩽ 𝑥 ⩽ 𝜃
𝑓 (𝑥; 𝜃) = 𝜃
0
otherwise.
We will consider two estimators of 𝜃:
• 𝑇 = 2𝑋, the natural estimator based on the sample mean (because the mean of the
distribution is 𝜃/2)
• 𝜃
b = max 𝑋𝑖 , the MLE.
MSE(𝑇) = var(𝑇)
= 4 var(𝑋)
4 var(𝑋1 )
= .
𝑛
We have 𝐸(𝑋1 ) = 𝜃/2 and
𝜃
1 𝜃2
∫
𝐸(𝑋12 ) = 𝑥2 · 𝑑𝑥 =
0 𝜃 3
so
2
𝜃2 𝜃 𝜃2
var(𝑋1 ) = − =
3 2 12
hence
4 var(𝑋1 ) 𝜃 2
MSE(𝑇) = = .
𝑛 3𝑛
𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0⩽𝑦⩽𝜃
𝜃𝑛
and 𝐸(𝜃)
b = 𝑛𝜃/(𝑛 + 1). So 𝑏(𝜃)
b = 𝑛𝜃/(𝑛 + 1) − 𝜃 = −𝜃/(𝑛 + 1). Also,
𝜃
𝑛 𝑦 𝑛−1 𝑛𝜃2
∫
b2 ) =
𝐸(𝜃 2
𝑦 · 𝑑𝑦 =
0 𝜃𝑛 𝑛+2
17
so
𝑛 𝑛2 𝑛𝜃2
b = 𝜃2
var(𝜃) − =
𝑛 + 2 (𝑛 + 1)2 (𝑛 + 1)2 (𝑛 + 2)
hence
MSE(𝜃)
b = var(𝜃) b 2
b + [𝑏(𝜃)]
2𝜃 2
=
(𝑛 + 1)(𝑛 + 2)
𝜃2
< for 𝑛 ⩾ 3
3𝑛
= MSE(𝑇).
𝑛 + 1b 𝑛 + 1b
MSE 𝜃 = var 𝜃
𝑛 𝑛
(𝑛 + 1)2
= var(𝜃)
b
𝑛2
𝜃2
=
𝑛(𝑛 + 2)
< MSE(𝜃)
b for 𝑛 ⩾ 2.
Estimation so far
So far we’ve considered getting an estimate which is a single number – a point estimate –
for the parameter of interest: e.g. 𝑥, max 𝑥 𝑖 , 𝑠 2 , ....
Maximum likelihood is a good way (usually) of producing an estimate (but we did better
when the range of the distribution depends on 𝜃 – fairly unusual).
MLEs are usually asymptotically unbiased, and have MSE decreasing like 1/𝑛 for
large 𝑛.
MLEs can be found in quite general situations.
18
5 Accuracy of estimation: Confidence Intervals
A crucial aspect of statistics is not just to estimate a quantity of interest, but to assess how
accurate or precise that estimate is. One approach is to find an interval, called a confidence
interval (CI) within which we think the true parameter falls.
Theorem 5.1. Suppose 𝑎 1 , . . . , 𝑎 𝑛 are constants and that 𝑋1 , . . . , 𝑋𝑛 are independent with
𝑋𝑖 ∼ 𝑁(𝜇𝑖 , 𝜎𝑖2 ). Let 𝑌 = 𝑛𝑖=1 𝑎 𝑖 𝑋𝑖 . Then
Í
𝑛
Õ 𝑛
Õ
𝑌∼𝑁 𝑎 𝑖 𝜇𝑖 , 𝑎 2𝑖 𝜎𝑖2 .
𝑖=1 𝑖=1
So, standardising 𝑋,
𝜎02
𝑋 − 𝜇 ∼ 𝑁 0,
𝑛
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎0 / 𝑛
19
0.4
0.3
N(0, 1) p.d.f.
0.2
0.1
0.0
−4 −2 0 2 4
Figure 5.1. Standard normal p.d.f.: the shaded area, i.e. the area under the curve from 𝑥 = −1.96 to
𝑥 = 1.96, is 0.95.
So
𝑋 −𝜇
𝑃 − 1.96 < √ < 1.96 = 0.95
𝜎0 / 𝑛
𝜎0 𝜎0
𝑃 − 1.96 √ < 𝑋 − 𝜇 < 1.96 √ = 0.95
𝑛 𝑛
𝜎0 𝜎0
𝑃 𝑋 − 1.96 √ < 𝜇 < 𝑋 + 1.96 √ = 0.95
𝑛 𝑛
𝜎0
𝑃 the interval 𝑋 ± 1.96 √ contains 𝜇 = 0.95.
𝑛
𝜎0 𝜎0
𝑋 − 1.96 √ , 𝑋 + 1.96 √ .
𝑛 𝑛
This interval is a random interval since its endpoints involve 𝑋 (and 𝑋 is a random variable).
It is an example of a confidence interval.
Definition. If 𝑎(X) and 𝑏(X) are two statistics, and 0 < 𝛼 < 1, the interval 𝑎(X), 𝑏(X) is
called a confidence interval for 𝜃 with confidence level 1 − 𝛼 if, for all 𝜃,
The interval 𝑎(X), 𝑏(X) is also called a 100(1 − 𝛼)% CI, e.g. a “95% confidence interval” if
𝛼 = 0.05.
20
Usually we are interested in small values of 𝛼: the most commonly used values are 0.05
and 0.01 (i.e. confidence levels of 95% and 99%) but there is nothing special about any
confidence level.
The interval 𝑎(x), 𝑏(x) is called an interval estimate and the random interval 𝑎(X), 𝑏(X) is
For any 𝛼 with 0 < 𝛼 < 1, let 𝑧 𝛼 be the constant such that Φ(𝑧 𝛼 ) = 1 − 𝛼, where Φ is
the 𝑁(0, 1) c.d.f. (i.e. if 𝑍 ∼ 𝑁(0, 1) then 𝑃(𝑍 > 𝑧 𝛼 ) = 𝛼).
N(0, 1) p.d.f.
0 zα
Figure 5.2. Standard normal p.d.f.: the shaded area, i.e. the area under the curve to the right of 𝑧 𝛼 ,
is 𝛼.
iid
By the same argument as before, if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) with 𝜎02 known, then a level
1 − 𝛼 confidence interval for 𝜇 is
𝑧 𝛼/2 𝜎0
𝑋± √ .
𝑛
21
Example (See slides on rainfall example). Measure annual rainfall amounts for 20 years
(i.e. 𝑛 = 20). Suppose 𝑥 = 688.4 and 𝜎0 = 130.
Then the endpoints of a 95% CI for 𝜇 are
𝜎0
𝑥 ± 1.96 √ = 688.4 ± 57.0
𝑛
So a 95% CI for 𝜇 is (631.4, 745.4).
The endpoints of a 99% CI are 𝑥 ± 2.58 √𝜎0𝑛 giving a 99% CI for 𝜇 of (613.4, 763.4).
The symmetric confidence interval for 𝜇 above is called a central confidence interval
for 𝜇.
−c d
Figure 5.3. Standard normal p.d.f.: the shaded area, i.e. the area under the curve from −𝑐 to 𝑑, is
1 − 𝛼.
Then
𝑑𝜎0 𝑐𝜎0
𝑃 𝑋− √ <𝜇<𝑋+ √ = 1 − 𝛼.
𝑛 𝑛
The choice 𝑐 = 𝑑 = 𝑧 𝛼/2 gives the shortest such interval.
𝑋 −𝜇
𝑃 √ > −𝑧 𝛼 = 1 − 𝛼
𝜎0 / 𝑛
22
so
𝑧 𝛼 𝜎0
𝑃 𝜇<𝑋+ √ =1−𝛼
𝑛
and so (−∞, 𝑋 + 𝑧√𝛼 𝜎𝑛0 ) is a “one-sided” confidence interval. We call 𝑋 + 𝑧 𝛼 𝜎0
√
𝑛
an upper 1 − 𝛼
confidence limit for 𝜇.
Similarly
𝑧 𝛼 𝜎0
𝑃 𝜇>𝑋− √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and 𝑋 − √
𝑛
is a lower 1 − 𝛼 confidence limit for 𝜇.
repeatedly we would “catch” the true parameter value about 95% of the time, for a
95% confidence interval: i.e. about 95% of our intervals would contain 𝜃.
• The confidence level is a coverage probability, the probability that the random confidence
interval 𝑎(X), 𝑏(X) covers the true 𝜃. (It’s a random interval because the endpoints
in the rainfall example. So it is wrong to say that 𝑎(x), 𝑏(x) contains 𝜃 with probability
1 − 𝛼: this interval, e.g. (631.4, 745.4), either definitely does or definitely does not contain 𝜃,
but we can’t say which of these two possibilities is true as 𝜃 is unknown.
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎/ 𝑛
Theorem 5.2. Let 𝑋1 , . . . , 𝑋𝑛 be i.i.d. from any distribution with mean 𝜇 and variance 𝜎2 ∈ (0, ∞).
Then, for all 𝑥,
𝑋 −𝜇
𝑃 √ ⩽ 𝑥 → Φ(𝑥) as 𝑛 → ∞.
𝜎/ 𝑛
Here, as usual, Φ is the 𝑁(0, 1) c.d.f. So for large 𝑛, whatever the distribution of the 𝑋𝑖 ,
𝑋 −𝜇
√ ≈ 𝑁(0, 1)
𝜎/ 𝑛
where ≈ means “has approximately the same distribution as.” Usually 𝑛 > 30 is ok for a
reasonable approximation.
23
With 𝑋1 , 𝑋2 , . . . i.i.d. from any distribution with 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜎2 :
• the weak law of large numbers (Prelims Probability) tells us that the distribution of
𝑋 concentrates around 𝜇 as 𝑛 becomes large,
i.e. for 𝜖 > 0, we have 𝑃(|𝑋 − 𝜇| > 𝜖) → 0 as 𝑛 → ∞
• the CLT adds to this
√
– the fluctuations of 𝑋 around 𝜇 are of order 1/ 𝑛
– the asymptotic distribution of these fluctuations is normal.
iid
Example. Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ exponential mean 𝜇, e.g. 𝑋𝑖 = survival time of patient 𝑖.
So
1
𝑓 (𝑥; 𝜇) = 𝑒 −𝑥/𝜇 , 𝑥 ⩾ 0
𝜇
and 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜇2 .
For large 𝑛, by CLT,
𝑋 −𝜇
√ ≈ 𝑁(0, 1).
𝜇/ 𝑛
So
𝑋 −𝜇
𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2 ≈ 1 − 𝛼
𝜇/ 𝑛
𝑧 𝛼/2 𝑧 𝛼/2
𝑃 𝜇 1− √ < 𝑋 < 𝜇 1+ √ ≈1−𝛼
𝑛 𝑛
𝑋 𝑋
𝑃 𝑧 𝛼/2 <𝜇< 𝑧 𝛼/2 ≈ 1 − 𝛼.
1+ √
𝑛
1− √
𝑛
𝑋 𝑋
𝑧 𝛼/2 , 𝑧 𝛼/2 .
1+ √
𝑛
1− √
𝑛
Numerically, if we have 𝑛 = 100 patients and 𝛼 = 0.05 (so 𝑧 𝛼/2 = 1.96), then
(0.84𝑥, 1.24𝑥)
Example (Opinion poll). In a opinion poll, suppose 321 of 1003 voters said they would
vote for the Party X. What’s the underlying level of support for Party X?
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from the Bernoulli(𝑝) distribution, i.e.
𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.
24
For large 𝑛, by CLT,
𝑋−𝑝
√ ≈ 𝑁(0, 1).
𝜎(𝑝)/ 𝑛
So
𝑋−𝑝
1 − 𝛼 ≈ 𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2
𝜎(𝑝)/ 𝑛
𝜎(𝑝) 𝜎(𝑝)
= 𝑃 𝑋 − 𝑧 𝛼/2 √ < 𝑝 < 𝑋 + 𝑧 𝛼/2 √ .
𝑛 𝑛
𝑧 𝛼/2
The interval 𝑋 ± √ 𝜎(𝑝)
𝑛
has approximate probability 1 − 𝛼 of containing the true 𝑝, but
it is not a confidence interval since its endpoints depend on 𝑝 via 𝜎(𝑝).
To get an approximate confidence interval:
• either, solve the inequality to get 𝑃 𝑎(X) < 𝑝 < 𝑏(X) ≈ 1 − 𝛼 where 𝑎(X), 𝑏(X) don’t
depend on 𝑝
p
• or, estimate 𝜎(𝑝) by 𝜎(b
𝑝) = b𝑝 (1 − b
𝑝 ) where b 𝑝 = 𝑥 the MLE. This gives endpoints
r
𝑝 (1 − b
𝑝)
𝑝 ± 𝑧 𝛼/2 .
b
𝑛
b
Example (Opinion polls continued). Opinion polls often mention “±3% error.”
Note that for any 𝑝,
1
𝜎2 (𝑝) = 𝑝(1 − 𝑝) ⩽
4
since 𝑝(1 − 𝑝) is maximised at 𝑝 = 21 . Then we have
𝑝−𝑝 0.03
−0.03
𝑃(b
𝑝 − 0.03 < 𝑝 < b
𝑝 + 0.03) = 𝑃 √ < √ <
b
√
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
0.03
−0.03
≈Φ √ −Φ √
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
√ √
⩾ Φ(0.03 4𝑛) − Φ(−0.03 4𝑛).
√
For this probability to be at least 0.95 we need 0.03 4𝑛 ⩾ 1.96, or 𝑛 ⩾ 1068. Opinion polls
typically use 𝑛 ≈ 1100.
Standard errors
Definition. Let 𝜃
b be an estimator of 𝜃 based on X. The standard error SE(𝜃)
b of 𝜃
b is defined
by q
SE(𝜃)
b = var(𝜃).
b
25
Example.
iid √
• Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ). Then b 𝜇) = 𝜎2 /𝑛. So SE(b
𝜇 = 𝑋 and var(b 𝜇) = 𝜎/ 𝑛.
iid
• Let 𝑝 = 𝑋 and var(b
p 𝑋1 , . . . , 𝑋𝑛 ∼ Bernoulli(𝑝). Then b 𝑝 ) = 𝑝(1 − 𝑝)/𝑛. So SE(b
𝑝) =
𝑝(1 − 𝑝)/𝑛.
Sometimes SE(𝜃)b depends on 𝜃 itself, meaning that SE(𝜃) b is unknown. In such cases we
have to plug-in parameter estimates to get the estimatedpstandard error. E.g. plug-in to get
√
estimated standard errors SE(𝑋) = b𝜎/ 𝑛 and SE(b 𝑝) = b 𝑝 (1 − b
𝑝 )/𝑛.
The values plugged-in (b
𝜎 and b
𝑝 above) could be maximum likelihood, or other, esti-
mates.
(We could write SE(
cb 𝑝 ), i.e. with a hat on the SE, to denote that b
𝑝 has been plugged-in, but
this is ugly so we won’t, we’ll just write SE(b𝑝 ).)
If 𝜃
b is unbiased, then MSE(𝜃)
b = var(𝜃) b 2 . So the standard error (or estimated
b = [SE(𝜃)]
standard error) gives some quantification of the accuracy of estimation.
If in addition 𝜃
b is approximately 𝑁(𝜃, SE(𝜃) b 2 ) then, by the arguments used above, an
approximate 1− 𝛼 CI for 𝜃 is given by 𝜃
b± 𝑧 𝛼/2 SE(𝜃)
b where again we might need to plug-in
to obtain the estimated standard error. Since, roughly, 𝑧 0.025 = 2 and 𝑧0.001 = 3,
iid
Example. Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) with 𝜇 and 𝜎2 unknown.
√
The MLEs are b 𝜇 = 𝑋, b 𝜎2 = 𝑛1 (𝑋𝑖 − 𝑋)2 , and SE(b 𝜇) = 𝜎/ 𝑛 is unknown because 𝜎 is
Í
unknown. So to use b 𝜇 ± 𝑧 𝛼/2 SE(b
𝜇) as the basis for a confidence interval we need to estimate
𝜎. One possibility is to use b𝜎 and so get the interval
𝜎
𝜇 ± 𝑧 𝛼/2 √ .
b
𝑛
b
26
6 Linear Regression
Suppose we measure two variables in the same population:
𝑥 the explanatory variable
𝑦 the response variable.
Other possible names for 𝑥 are the predictor or feature or input variable or independent variable.
Other possible names for 𝑦 are the output variable or dependent variable.
Example. Let 𝑥 measure the GDP per head, and 𝑦 the CO2 emissions per head, for various
countries.
●
●
●●
●
●●
● ●
● ●●● ● ●
● ● ● ● ●●
● ● ● ● ● ●● ●●●
●
● ●●●● ●
● ● ●● ● ●● ● ●● ●●● ● ●
● ● ●
● ● ● ●●
●
● ●● ● ● ●
● ●
CO2 emissions
● ● ●●●
●
● ● ●
● ● ●
●● ●
● ● ● ●●
● ●●
●
● ● ● ●●
●● ● ●● ●● ● ●
● ● ●●
●●
● ●● ● ●●● ●
●●●●●● ●
●● ●
●● ●●
● ●
●●
● ●● ●●
● ●● ●●
● ●
●●● ●
● ● ●
● ●●
●●
●● ●
●●
●
●
●
● ●
GDP
Figure 6.1. Scatterplot of measures of CO2 emissions per head against GDP per head, for 178
countries.
Questions of interest:
For fixed 𝑥, what is the average value of 𝑦?
How does that average value change with 𝑥?
A simple model for the dependence of 𝑦 on 𝑥 is
𝑦 = 𝛼 + 𝛽𝑥 + "error".
Note: a linear relationship like this does not necessarily imply that 𝑥 causes 𝑦.
We regard the values of 𝑥 as being fixed and known, and we regard the values of 𝑦 as
being the observed values of random variables.
We suppose that
𝑌𝑖 = 𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛 (6.1)
27
where
The “random errors” 𝜖 1 , . . . , 𝜖 𝑛 represent random scatter of the points (𝑥 𝑖 , 𝑦 𝑖 ) about the
line 𝑦 = 𝛼 + 𝛽𝑥, we do not expect these points to lie on a perfect straight line.
Sometimes we will refer to the above model as being 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.
Note.
1. The values 𝑦1 , . . . , 𝑦𝑛 (e.g. the CO2 emissions per head in the various countries) are
the observed values of random variables 𝑌1 , . . . , 𝑌𝑛 .
2. The values 𝑥 1 , . . . , 𝑥 𝑛 (e.g. the GDP per head in the various countries) do not
correspond to random variables. They are fixed and known constants.
Questions:
• How do we estimate 𝛼 and 𝛽?
• Does the mean of 𝑌 actually depend on the value of 𝑥? i.e. is 𝛽 ≠ 0?
We now find the MLEs of 𝛼 and 𝛽, and we regard 𝜎2 as being known. The MLEs of 𝛼 and 𝛽
are the same if 𝜎2 is unknown. If 𝜎2 is unknown, then we simply maximise over 𝜎2 as well
to obtain its MLE – this is no harder than what we do here (try it!). However, working out
all of the properties of this MLE is harder and beyond what we can do in this course.
From (6.1) we have 𝑌𝑖 ∼ 𝑁(𝛼 + 𝛽𝑥 𝑖 , 𝜎2 ). So 𝑌𝑖 has p.d.f.
1 1
𝑓𝑌𝑖 (𝑦 𝑖 ) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 , −∞ < 𝑦 𝑖 < ∞.
2𝜋𝜎 2 2𝜎
For this reason the MLEs of 𝛼 and 𝛽 are also called least squares estimators.
28
●
●● ●
●
●●
● ●●
● ●● ●
● ● ● ●●● ● ●●
● ●● ● ● ● ●● ●●
●●●●
● ●● ● ●
●
●●●●● ● ● ●●● ● ●●●
● ● ●
●
● ● ●● ●
●●● ● ●
●
● ●●● ●
● ● ● ● ●
● ● ●●
● ●● ● ●
● ● ● ●●
●●
●
●● ● ●●● ●● ● ●
● ●
y
● ● ●● ●●
●● ●●●●●
●
●● ●
●
●● ●
●●● ● ● ●
●
●
● ●● ●●
●
● ● ● ● ●●
●● ● ●
●● ●
● ●●
●
●● ●●
●●
●
● ●
● ●
Figure 6.2. Vertical distances from the points (𝑥 𝑖 , 𝑦 𝑖 ) to the line 𝑦 = 𝛼 + 𝛽𝑥. 𝑆(𝛼, 𝛽) is the sum of
these squared distances. The MLEs/least squares estimates of 𝛼 and 𝛽 minimise this sum.
Theorem 6.1. The MLEs (or, equivalently, the least squares estimates) of 𝛼 and 𝛽 are given by
𝑥 2𝑖 )( 𝑦𝑖 ) − ( 𝑥 𝑖 )( 𝑥 𝑖 𝑦𝑖 )
Í Í Í Í
(
𝛼=
𝑛 𝑥 2𝑖 − ( 𝑥 𝑖 )2
b Í Í
𝑛 𝑥 𝑖 𝑦𝑖 − ( 𝑥 𝑖 )( 𝑦 𝑖 )
Í Í Í
𝛽=
b .
𝑥 2𝑖 − (
Í 2
𝑛 𝑥𝑖 )
Í
The sums in Theorem 6.1, and similar sums below, are from 𝑖 = 1 to 𝑛.
Proof. To find b
𝛼 and b
𝛽 we calculate
𝜕𝑆 Õ
= −2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
𝜕𝛼
𝜕𝑆 Õ
= −2 𝑥 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 ).
𝜕𝛽
𝑌𝑖 = 𝑎 + 𝑏(𝑥 𝑖 − 𝑥) + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
29
and find the MLEs of 𝑎 and 𝑏 by minimising (𝑦 𝑖 − 𝑎 − 𝑏(𝑥 𝑖 − 𝑥))2 . This model is just an
Í
alternative parametrisation of our original model: the first model is
𝑌 = 𝛼 + 𝛽𝑥 + 𝜖
and the second is
𝑌 = 𝑎 + 𝑏(𝑥 − 𝑥) + 𝜖
= (𝑎 − 𝑏𝑥) + 𝑏𝑥 + 𝜖.
Here 𝑌, 𝑥 denote general values of 𝑌, 𝑥 (and 𝑥 = 𝑛1 𝑥 𝑖 is the mean of the 𝑛 data values of
Í
𝑥). Comparing the two model equations, 𝑏 = 𝛽 and 𝑎 − 𝑏𝑥 = 𝛼.
The interpretation of the parameters is that 𝛽 = 𝑏 is the increase in 𝐸(𝑌) when 𝑥 increases
by 1. The parameter 𝛼 is the value of 𝐸(𝑌) when 𝑥 is 0; whereas 𝑎 is the value of 𝐸(𝑌)
when 𝑥 is 𝑥.
• Alternative expressions for b
𝛼 and b
𝛽 are
(𝑥 𝑖 − 𝑥)(𝑦 𝑖 − 𝑦)
Í
𝛽=
b (6.2)
(𝑥 𝑖 − 𝑥)2
Í
(𝑥 𝑖 − 𝑥)𝑦 𝑖
Í
= Í 2
(6.3)
(𝑥 𝑖 − 𝑥)
𝛼 = 𝑦 −b
b 𝛽 𝑥.
𝜕𝑆
The above alternative for b
𝛼 follows directly from 𝜕𝛼
= 0.
𝑥 𝑖 𝑦 𝑖 − 𝑛1 ( 𝑥 𝑖 )( 𝑦𝑖 )
Í Í Í
𝛽=
b
𝑥 2𝑖 − 𝑛1 ( 𝑥 𝑖 )2
Í Í
𝑥 𝑖 𝑦 𝑖 − 𝑛𝑥 𝑦
Í
= Í . (6.4)
𝑥 2𝑖 − 𝑛𝑥 2
Now check that the numerators and denominators in (6.2) and (6.4) are the same.
Then observe that the numerators of (6.2) and (6.3) differ by (𝑥 𝑖 − 𝑥)𝑦, which is 0.
Í
1 Õ
𝛽=Í 𝑤 𝑖 𝑌𝑖
b
𝑤 2𝑖
30
so
1 Õ
𝐸(b
𝛽) = Í 𝐸 𝑤 𝑖 𝑌𝑖
2
𝑤𝑖
1 Õ
=Í 𝑤 𝑖 𝐸(𝑌𝑖 ).
𝑤 2𝑖
=𝛽 since 𝑤 𝑖 = 0.
Í
𝛼 =𝑌−b
Also b 𝛽 𝑥 and so
𝐸(b
𝛼) = 𝐸(𝑌) − 𝑥𝐸(b
𝛽)
1Õ
= 𝐸(𝑌𝑖 ) − 𝛽𝑥 since 𝐸(b
𝛽) = 𝛽
𝑛
1Õ
= (𝛼 + 𝛽𝑥 𝑖 ) − 𝛽𝑥
𝑛
1
= · 𝑛(𝛼 + 𝛽𝑥) − 𝛽𝑥
𝑛
= 𝛼.
So b
𝛼 and b
𝛽 are unbiased.
Note the unbiasedness of b 𝛼, b
𝛽 does not depend on the assumptions that the 𝜖 𝑖 are
independent, normal and have the same variance, only on the assumptions that the errors
are additive and 𝐸(𝜖 𝑖 ) = 0.
Variance
1 Õ
= Í 2 𝑤 2𝑖 𝜎2
( 𝑤 𝑖 )2
𝜎2
= Í 2.
𝑤𝑖
31
Since b
𝛽 is a linear combination of the independent normal random variables 𝑌𝑖 , the
estimator b 𝛽 ∼ 𝑁(𝛽, 𝜎𝛽2 ) where 𝜎𝛽2 = 𝜎2 / 𝑤 2𝑖 .
𝛽 is itself normal: b
Í
𝛽 ± 1.96𝜎𝛽 ).
(b
However, this is only a valid CI when 𝜎2 is known and, in practice, 𝜎2 is rarely known.
𝜎𝛽2 = b
For 𝜎2 unknown we need to plug-in an estimate of 𝜎2 , i.e. use b 𝜎2 / 𝑤 2𝑖 where b
𝜎2 is
Í
𝜎2 = 𝑛1 (𝑦 𝑖 − b
some estimate of 𝜎2 . For example we could use the MLE which is b 𝛼−b 𝛽𝑥 𝑖 )2 .
Í
Using the 𝜃
b ± 2 SE(𝜃)
b approximation for a 95% confidence interval, we have that (b
𝛽 ± 2b
𝜎𝛽 )
is an approximate 95% confidence interval for 𝛽.
[A better approach here, but beyond the scope of this course, is to estimate 𝜎2 using
1 Õ
𝑠2 = 𝛽𝑥 𝑖 )2
𝛼−b
(𝑦 𝑖 − b
𝑛−2
and to base the confidence interval on a 𝑡-distribution rather than a normal distribution.
This estimator 𝑆 2 is unbiased for 𝜎2 (see Sheet 5), but details about its distribution and the
𝑡-distribution are beyond this course – see Parts A/B.]
32
7 Multiple Linear Regression
For the material in the last two sections of these notes (this section and the next), Chapter 3
of the book An Introduction to Statistical Learning by James, Witten, Hastie and Tibshirani
(2013) is useful – we cover some, but not all, of the material in that chapter. The whole
book is freely available online at:
http://www-bcf.usc.edu/~gareth/ISL/
We’ll refer to this book as JWHT (2013).
𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝜖.
This model has one response variable 𝑌 (as usual), but now we have two explanatory
variables 𝑥1 and 𝑥2 , and three regression parameters 𝛽 0 , 𝛽 1 , 𝛽2 .
Let the 𝑖th race have time 𝑦 𝑖 , distance 𝑥 𝑖1 and climb 𝑥 𝑖2 . Then in more detail our model
is
𝑌𝑖 = 𝛽0 + 𝛽 1 𝑥 𝑖1 + 𝛽 2 𝑥 𝑖2 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
where
and (as usual) 𝑦 𝑖 denotes the observed value of the random variable 𝑌𝑖 .
As for simple linear regression we obtain the MLEs/least squares estimates of 𝛽0 , 𝛽 1 , 𝛽 2 by
minimising
𝑛
Õ
𝑆(𝛽) = (𝑦 𝑖 − 𝛽 0 − 𝛽1 𝑥 𝑖1 − 𝛽2 𝑥 𝑖2 )2
𝑖=1
𝜕𝑆
with respect to 𝛽 0 , 𝛽 1 , 𝛽2 , i.e. by solving 𝜕𝛽 𝑘
= 0 for 𝑘 = 0, 1, 2.
As before, the only property of the 𝜖 𝑖 needed to define the least squares estimates is
𝐸(𝜖 𝑖 ) = 0.
A general multiple regression model has 𝑝 explanatory variables (𝑥 1 , . . . , 𝑥 𝑝 ),
𝑌 = 𝛽0 + 𝛽1 𝑥1 + · · · + 𝛽 𝑝 𝑥 𝑝 + 𝜖
33
Example (Quadratic regression). The relationship between 𝑌 and 𝑥 may be approximately
quadratic in which case we can consider the model 𝑌 = 𝛽 0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝜖. This is the
case 𝑝 = 2 with 𝑥 1 = 𝑥 and 𝑥 2 = 𝑥 2 .
Example. For convenience write the two explanatory variables as 𝑥 and 𝑧. So suppose
𝑌𝑖 = 𝛽 0 + 𝛽1 𝑥 𝑖 + 𝛽2 𝑧 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛
iid
where 𝜖 𝑖 ∼ 𝑁(0, 𝜎2 ) and assume 𝑥𝑖 = 𝑧 𝑖 = 0.
Í Í
𝑦𝑖
Í
𝛽0 =
b
𝑛
1 Õ 2 Õ Õ Õ
𝛽1 =
b 𝑧𝑖 𝑥 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑧 𝑖 𝑦𝑖
Δ
1 Õ 2 Õ Õ Õ
𝛽2 =
b 𝑥𝑖 𝑧 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑥 𝑖 𝑦𝑖
Δ
where
Õ Õ Õ 2
Δ= 𝑥 2𝑖 𝑧 2𝑖 − 𝑥𝑖 𝑧𝑖 .
𝜕𝑆
The method (solving 𝜕𝛽 = 0 for 𝑘 = 0, 1, 2, i.e. 3 equations in 3 unknowns) is straightfor-
𝑘
ward, the algebra less so – the are more elegant ways to do some of this (using matrices, in
3rd year).
𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝜖.
We interpret 𝛽 1 as the average effect on 𝑌 of a one unit increase in 𝑥1 , holding the other
regressor 𝑥 2 fixed.
Similarly we interpret 𝛽2 as the average effect on 𝑌 of a one unit increase in 𝑥2 , holding the
other regressor 𝑥 1 fixed.
In these interpretations “average” means we are talking about the change in 𝐸(𝑌) when
changing 𝑥 1 (or 𝑥2 ).
One important thing to note here is that 𝑥 1 and 𝑥 2 often change together. E.g. in the hill
races example, a race whose distance is one mile greater will usually to have an increased
value of climb as well. This makes interpretation more difficult.
34
8 Assessing the fit of a model
Having fitted a model, we should consider how well it fits the data. A model is normally
an approximation to reality: is the approximation sufficiently good that the model is
useful? This question applies to mathematical models in general. In this course we will
approach the question by considering the fit of a simple linear regression (generalisations
are possible).
For the model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 let b 𝛼, b
𝛽 be the usual estimates of 𝛼, 𝛽 based on the observation
pairs (𝑥 1 , 𝑦1 ), . . . , (𝑥 𝑛 , 𝑦𝑛 ).
From now on we consider this model, with the usual assumptions about 𝜖, unless otherwise
stated.
The RSE is an estimate of the standard deviation 𝜎. If the fitted values are close to the
observed values, i.e. b
𝑦 𝑖 ≈ 𝑦 𝑖 for all 𝑖 (so that the 𝑒 𝑖 are small), then the RSE will be small.
Alternatively if one or more of the 𝑒 𝑖 is large then the RSE will be higher.
We have 𝐸(𝑒 𝑖 ) = 0. In taking this expectation, we treat 𝑦 𝑖 as the random variable 𝑌𝑖 , and we
treat b
𝑦 𝑖 as the random variable b𝛼+b 𝛽𝑥 𝑖 (in particular, b 𝛽 are estimators, not estimates).
𝛼 and b
Hence
𝐸(𝑒 𝑖 ) = 𝐸(𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖 )
= 𝐸(𝑌𝑖 ) − 𝐸(b
𝛼) − 𝐸(b
𝛽)𝑥 𝑖
= 𝐸(𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖 since b
𝛼, b
𝛽 are unbiased
= 𝛼 + 𝛽𝑥 𝑖 + 𝐸(𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖
= 0 since 𝐸(𝜖 𝑖 ) = 0.
35
8.2 Potential problem: non-constant variance of errors
We have assumed that the errors have a constant variance, i.e. var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝜎2 . That
is, the same variance for all 𝑖. Unfortunately, this is often not true. E.g. the variance of the
error may increase as 𝑌 increases.
Non-constant variance is also called heteroscedasticity. We can identify this from the
presence of a funnel-type shape in the residual plot.
How might we deal with non-constant variance of the errors?
(i) One
√ possibility is to transform the response 𝑌 using a transformation such as log 𝑌 or
𝑌 (which shrinks larger responses more), leading to a reduction in heteroscedasticity.
(ii) Sometimes we might have a good idea about of the variance of 𝑌𝑖 : we might think
2
var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝑤𝜎 𝑖 where 𝜎2 is unknown but where the 𝑤 𝑖 are known. E.g. if 𝑌𝑖 is
actually the mean of 𝑛 𝑖 observations, where each of these 𝑛 𝑖 observations are made
2
at 𝑥 = 𝑥 𝑖 , then var(𝑌𝑖 ) = 𝜎𝑛 𝑖 . So 𝑤 𝑖 = 𝑛 𝑖 in this case.
It is straightforward to show (exercise) that the MLEs of 𝛼, 𝛽 are obtained by
minimising
𝑛
Õ
𝑤 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 . (8.1)
𝑖=1
The form of (8.1) is intuitively correct: if 𝑤 𝑖 is small then var(𝑌𝑖 ) is large, so there is a
lot of uncertainty about observation 𝑖, so this observation shouldn’t affect the fit too
much – this is achieved in (8.1) by observation 𝑖 being weighted by the small value
of 𝑤 𝑖 . Hence this approach is called weighted least squares.
36
[Here we follow the terminology in the synopses and JWHT (2013) and call the 𝑟 𝑖
“studentized” residuals. Some authors call the 𝑟 𝑖 “standardized” residuals and save the
word “studentized” to mean something similar but different.]
So the 𝑟 𝑖 are all on a comparable scale, each having a standard deviation of about 1. Fol-
lowing JWHT (2013) we will say that observations with |𝑟 𝑖 | > 3 are possible outliers.
If we believe an outlier is due to an error in data collection, then one solution is to simply
remove the observation from the data and re-fit the model. However, an outlier may
instead indicate a problem with the model, e.g. a nonlinear relationship between 𝑌 and 𝑥,
so care must be taken.
Similarly, this kind of problem could arise if we have a missing regressor, i.e. we could be
using 𝑌 = 𝛽0 + 𝛽 1 𝑥 1 + 𝜖 when we should really be using 𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽 2 𝑥 2 + 𝜖.
since the 𝑌𝑗 are independent with var(𝑌𝑗 ) = 𝜎2 for all 𝑗. Here and below, all sums are from
1 to n.
First recall
− 𝑥)𝑌𝑗
Í
𝑗 (𝑥 𝑗
Õ
𝛽=
b where 𝑆 𝑥𝑥 = (𝑥 𝑘 − 𝑥)2
𝑆 𝑥𝑥
𝑘
and
𝛼 =𝑌−b
b 𝛽𝑥
− 𝑥)𝑌𝑗
Í
1 𝑗 (𝑥 𝑗
Õ
= 𝑌𝑗 − 𝑥
𝑛 𝑆 𝑥𝑥
𝑗
Õ1 𝑥(𝑥 𝑗 − 𝑥)
= − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗
So
𝑦𝑖 = b
b 𝛼+b
𝛽𝑥 𝑖
− 𝑥)𝑌𝑗
!
Õ 1 𝑥(𝑥 𝑗 − 𝑥)
Í
𝑗 (𝑥 𝑗
= − 𝑌𝑗 + 𝑥 𝑖
𝑛 𝑆 𝑥𝑥 𝑆 𝑥𝑥
𝑗
Õ1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
= + 𝑌𝑗 . (8.3)
𝑛 𝑆 𝑥𝑥
𝑗
We can write
Õ
𝑌𝑖 = 𝛿 𝑖𝑗 𝑌𝑗 (8.4)
𝑗
37
where (
1 if 𝑖 = 𝑗
𝛿 𝑖𝑗 =
0 otherwise.
𝑒 𝑖 = 𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖
and using (8.3) and (8.4)
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
= 𝛿 𝑖𝑗 − − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗
2
Õ 1 (𝑥 𝑖 − 𝑥)2 (𝑥 𝑗 − 𝑥)2 1
=𝜎 𝛿2𝑖𝑗 + + − 2 𝛿 𝑖𝑗
𝑛 2 2
𝑆 𝑥𝑥 𝑛
𝑗
(𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥) 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
−2 𝛿 𝑖𝑗 + 2
𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
1 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥) Õ
2
= 𝜎 1+ + − −2 + (𝑥 𝑗 − 𝑥)
𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
𝑗
1 (𝑥 𝑖 − 𝑥)2
= 𝜎2 1 − −
𝑛 𝑆 𝑥𝑥
= 𝜎2 (1 − ℎ 𝑖 ). □
1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2
Clearly ℎ 𝑖 depends only on the values of the 𝑥 𝑖 (it doesn’t depend on the 𝑦 𝑖 values at all).
We see that ℎ 𝑖 increases with the distance of 𝑥 𝑖 from 𝑥.
High leverage points tend to have a sizeable impact on the regression line. Since
var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ), a large leverage ℎ 𝑖 will make var(𝑒 𝑖 ) small. And then since 𝐸(𝑒 𝑖 ) = 0,
this means that the regression line will be “pulled” close to the point (𝑥 𝑖 , 𝑦 𝑖 ).
Note that ℎ 𝑖 = 2, so the average leverage is ℎ = 𝑛2 . One rule-of-thumb is to regard
Í
points with a leverage more than double this to be high leverage points, i.e. points with
ℎ 𝑖 > 4/𝑛.
38
Why does this matter? We should be concerned if the regression line is heavily affected by
just a couple of points, because any problems with these points might invalidate the entire
fit. Hence it is important to identify high leverage observations.
39