0% found this document useful (0 votes)
15 views

Notes For Lectures 1 To 10 - 2024

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Notes For Lectures 1 To 10 - 2024

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Prelims Statistics and Data Analysis, Lectures 1–10

Christl Donnelly

Trinity Term 2024 (version of 11-04-2024)

In some places these notes will contain more than the lectures and in other places they
will contain less. I strongly encourage you to attend all of the lectures.
These notes are largely based on previous versions of the course, and I would like to make
grateful acknowledgement to Neil Laws, Peter Donnelly and Bernard Silverman. Please
send any comments or corrections to [email protected].

TT 2024 updates:

None so far. If you spot any errors please let me know.

1
Introduction
Probability: in probability we start with a probability model 𝑃, and we deduce properties
of 𝑃.
E.g. Imagine flipping a coin 5 times. Assuming that flips are fair and independent, what is
the probability of getting 2 heads?

Statistics: in statistics we have data, which we regard as having been generated from some
unknown probability model 𝑃. We want to be able to say some useful things about 𝑃.
E.g. We flip a coin 1000 times and observe 530 heads. Is the coin fair? i.e. is the probability
of heads greater than 21 ? or could it be equal to 12 ?

Precision of estimation

Question: What is the average body temperature in humans?


Collect data: let 𝑥 1 , . . . , 𝑥 𝑛 be the body temperatures of 𝑛 randomly chosen individuals, i.e.
our model is that individuals are chosen independently from the general population.
1 Í𝑛
Then estimate: 𝑥 = 𝑛 𝑖=1 𝑥 𝑖 could be our estimate, i.e. the sample average.
• How precise is 𝑥 as an estimate of the population mean?

1.2

1.0

0.8
Density

0.6

0.4

0.2

0.0

35.5 36.0 36.5 37.0 37.5 38.0

Temperature (degrees C)

Figure. Histogram of body temperatures of 130 individuals. A description of the dataset is available
here: https://rdrr.io/cran/UsingR/man/normtemp.html

2
Relationships between observations

Let 𝑦 𝑖 = price of house in month 𝑥 𝑖 , for 𝑖 = 1, . . . , 𝑛.


• Is it reasonable that 𝑦 𝑖 = 𝛼 + 𝛽𝑥 𝑖 + "error"?
• Is 𝛽 > 0?
• How accurately can we estimate 𝛼 and 𝛽, and how do we estimate them?


350 ●●


● ●
●●●
● ●●
300 ● ●
● ● ● ●
● ● ●
●● ● ●● ● ●●
● ●● ●
● ●
●● ●
Price

●●● ●●● ● ●
● ● ●●
250 ●

● ● ●

●● ●
●●● ● ●●● ●
● ●●
● ●
● ●●
200 ●●


● ●
● ●
● ●●
●●●●
●● ●●● ● ●
●●
150 ●

0 20 40 60 80 100

Month

Figure. Scatterplot of some Oxford house price data.

Notation and conventions

Lower case letters 𝑥 1 , . . . , 𝑥 𝑛 denote observations: we regard these as the observed values
of random variables 𝑋1 , . . . , 𝑋𝑛 .
Let x = (𝑥 1 , . . . , 𝑥 𝑛 ) and X = (𝑋1 , . . . , 𝑋𝑛 ).
It is convenient to think of 𝑥 𝑖 as
• the observed value of 𝑋𝑖
• or sometimes as a possible value that 𝑋𝑖 can take.
Since 𝑥 𝑖 is a possible value for 𝑋𝑖 we can calculate probabilities, e.g. if 𝑋𝑖 ∼ Poisson(𝜆)
then
𝑒 −𝜆 𝜆 𝑥 𝑖
𝑃(𝑋𝑖 = 𝑥 𝑖 ) = , 𝑥 𝑖 = 0, 1, . . . .
𝑥𝑖 !

3
1 Random Samples
Definition. A random sample of size 𝑛 is a set of random variables 𝑋1 , . . . , 𝑋𝑛 which are
independent and identically distributed (i.i.d.).

Example. Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from a Poisson(𝜆) distribution.


E.g. 𝑋𝑖 = # traffic accidents on St Giles’ in year 𝑖.
We’ll write 𝑓 (x) to denote the joint probability mass function (p.m.f.) of 𝑋1 , . . . , 𝑋𝑛 . Then

𝑓 (x) = 𝑃(𝑋1 = 𝑥 1 , 𝑋2 = 𝑥2 , . . . , 𝑋𝑛 = 𝑥 𝑛 ) definition of joint p.m.f.


= 𝑃(𝑋1 = 𝑥 1 )𝑃(𝑋2 = 𝑥 2 ) . . . 𝑃(𝑋𝑛 = 𝑥 𝑛 ) by independence

𝑒 −𝜆 𝜆 𝑥1 𝑒 −𝜆 𝜆 𝑥2 𝑒 −𝜆 𝜆 𝑥 𝑛
= · ··· since 𝑋𝑖 Poisson
𝑥1 ! 𝑥2 ! 𝑥𝑛 !
Í𝑛
𝑥𝑖 )
𝑒 −𝑛𝜆 𝜆( 𝑖=1
= Î𝑛 .
𝑖=1 𝑥𝑖 !

Example. Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from an exponential distribution with


probability density function (p.d.f.) given by

1 −𝑥/𝜇
𝑓 (𝑥) = 𝑒 , 𝑥 ⩾ 0.
𝜇

E.g. 𝑋𝑖 = time until the first breakdown of machine 𝑖 in a factory.


We’ll write 𝑓 (x) to denote the joint p.d.f. of 𝑋1 , . . . , 𝑋𝑛 . Then

𝑓 (x) = 𝑓 (𝑥 1 ) . . . 𝑓 (𝑥 𝑛 ) since the 𝑋𝑖 are independent


𝑛
Ö 1
= 𝑒 −𝑥 𝑖 /𝜇
𝜇
𝑖=1

𝑛
1 1Õ
 
= 𝑛 exp − 𝑥𝑖 .
𝜇 𝜇
𝑖=1

Note.
1. We use 𝑓 to denote a p.m.f. in the first example and to denote a p.d.f. in the second
example. It is convenient to use the same letter (i.e. 𝑓 ) in both the discrete and
continuous cases. (In introductory probability you may often see 𝑝 for p.m.f. and 𝑓
for p.d.f.)
We could write 𝑓𝑋𝑖 (𝑥 𝑖 ) for the p.m.f./p.d.f. of 𝑋𝑖 , and 𝑓𝑋1 ,...,𝑋𝑛 (x) for the joint
p.m.f./p.d.f. of 𝑋, . . . , 𝑋𝑛 . However it is convenient to keep things simpler by
omitting subscripts on 𝑓 .
2. In the second example 𝐸(𝑋𝑖 ) = 𝜇 and we say “𝑋𝑖 has an exponential distribution
with mean 𝜇” (i.e. expectation 𝜇).

4
Sometimes, and often in probability, we work with “an exponential distribution with
parameter 𝜆” where 𝜆 = 1/𝜇. To change the parameter from 𝜇 to 𝜆 all we do is
replace 𝜇 by 1/𝜆 to get
𝑓 (𝑥) = 𝜆𝑒 −𝜆𝑥 , 𝑥 ⩾ 0.
Sometimes (often in statistics) we parametrise the distribution using 𝜇, sometimes
(often in probability) we parametrise it using 𝜆.

In probability we assume that the parameters 𝜆 and 𝜇 in our two examples are known. In
statistics we wish to estimate 𝜆 and 𝜇 from data.
• What is the best way to estimate them? And what does “best” mean?
• For a given method, how precise is the estimation?

5
2 Summary Statistics
The expected value of 𝑋, E(𝑋), is also called its mean. This is often denoted 𝜇.
The variance of 𝑋, var(𝑋), is var(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]. This is often denoted 𝜎2 . The standard
deviation of 𝑋 is 𝜎.

Definition. Let 𝑋1 , . . . , 𝑋𝑛 be a random sample. The sample mean 𝑋 is defined by


𝑛

𝑋= 𝑋𝑖 .
𝑛
𝑖=1

The sample variance 𝑆 2 is defined by


𝑛
1 Õ
𝑆2 = (𝑋𝑖 − 𝑋)2 .
𝑛−1
𝑖=1

The sample standard deviation is 𝑆 = sample variance.


p

Note.
1. The denominator in the definition of 𝑆 2 is 𝑛 − 1, not 𝑛.
2. 𝑋 and 𝑆 2 are random variables. Their distributions are called the sampling distributions
of 𝑋 and 𝑆 2 .
3. Given observations 𝑥 1 , , . . . , 𝑥 𝑛 we can compute the observed values 𝑥 and 𝑠 2 .
4. The sample mean 𝑥 is a summary of the location of the sample.
5. The sample variance 𝑠 2 (or the sample standard deviation 𝑠) is a summary of the
spread of the sample about 𝑥.

The Normal Distribution

The random variable 𝑋 has a normal distribution with mean 𝜇 and variance 𝜎2 , written
𝑋 ∼ 𝑁(𝜇, 𝜎2 ), if the p.d.f. of 𝑋 is

1 (𝑥 − 𝜇)2
 
𝑓 (𝑥) = √ exp − , −∞ < 𝑥 < ∞.
2𝜋𝜎 2 2𝜎2

Properties: 𝐸(𝑋) = 𝜇, var(𝑋) = 𝜎2 .

Standard Normal Distribution

If 𝜇 = 0 and 𝜎2 = 1, then 𝑋 ∼ 𝑁(0, 1) is said to have a standard normal distribution.


Properties:
• If 𝑍 ∼ 𝑁(0, 1) and 𝑋 = 𝜇 + 𝜎𝑍, then 𝑋 ∼ 𝑁(𝜇, 𝜎2 ).
• If 𝑋 ∼ 𝑁(𝜇, 𝜎2 ) and 𝑍 = (𝑋 − 𝜇)/𝜎, then 𝑍 ∼ 𝑁(0, 1).

6
• The cumulative distribution function (c.d.f.) of the standard normal distribution is
𝑥
1

2
Φ(𝑥) = √ 𝑒 −𝑢 /2 𝑑𝑢.
−∞ 2𝜋
This cannot be written in a closed form, but can be found by numerical integration
to an arbitrary degree of accuracy.

N(0, 1) p.d.f. N(0, 1) c.d.f.

0.4 1.0

0.8
0.3

0.6

Φ(x)
f(x)

0.2
0.4

0.1
0.2

0.0 0.0

−4 −2 0 2 4 −4 −2 0 2 4

x x

Figure 2.1. Standard normal p.d.f. and c.d.f.

7
3 Maximum Likelihood Estimation
Maximum likelihood estimation is a general method for estimating unknown parameters
from data. This turns out to be the method of choice in many contexts, though this isn’t
obvious at this stage.

Example. Suppose e.g. that 𝑥 1 , . . . , 𝑥 𝑛 are time intervals between major earthquakes.
Assume these are observations of 𝑋1 , . . . , 𝑋𝑛 independently drawn from an exponential
distribution with mean 𝜇, so that each 𝑋𝑖 has p.d.f.

1 −𝑥/𝜇
𝑓 (𝑥; 𝜇) = 𝑒 , 𝑥 ⩾ 0.
𝜇

We have written 𝑓 (𝑥; 𝜇) to indicate that the p.d.f. 𝑓 depends on 𝜇.


• How do we estimate 𝜇?
• Is 𝑋 a good estimator for 𝜇?
• Is there a general principle we can use?

0.0010
Density

0.0005

0.0000

0 500 1000 1500 2000

Time interval (days)

Figure 3.1. Histogram of time intervals between 62 major earthquakes 1902–77: an exponential
density looks plausible. (There are more effective ways than a histogram to check if a particular
distribution is appropriate, i.e. to check the exponential assumption in this case – see Part A
Statistics.)

In general we write 𝑓 (𝑥; 𝜃) to indicate that the p.d.f./p.m.f. 𝑓 , which is a function of 𝑥,


depends on a parameter 𝜃. Similarly, 𝑓 (x; 𝜃) denotes the joint p.d.f./p.m.f. of X. (Sometimes
𝑓 (𝑥; 𝜃) is written 𝑓 (𝑥 | 𝜃).)
The parameter 𝜃 may be a vector, e.g. 𝜃 = (𝜇, 𝜎2 ) in the earlier 𝑁(𝜇, 𝜎2 ) example.
If we regard 𝜃 as unknown, then we need to estimate it using 𝑥 1 , . . . , 𝑥 𝑛 .

8
Definition. Let 𝑋1 , . . . , 𝑋𝑛 have joint p.d.f./p.m.f. 𝑓 (x; 𝜃). Given observed values
𝑥1 , . . . , 𝑥 𝑛 of 𝑋1 , . . . , 𝑋𝑛 , the likelihood of 𝜃 is the function

𝐿(𝜃) = 𝑓 (x; 𝜃). (3.1)

The log-likelihood is ℓ (𝜃) = log 𝐿(𝜃).


So 𝐿(𝜃) is the joint p.d.f./p.m.f. of the observed data regarded as a function of 𝜃, for
fixed x.
In the definition of ℓ (𝜃), log means log to base 𝑒, i.e. log = log𝑒 = ln.
Often we assume that 𝑋1 , . . . , 𝑋𝑛 are a random sample from 𝑓 (𝑥; 𝜃), so that

𝐿(𝜃) = 𝑓 (x; 𝜃)
𝑛
Ö
= 𝑓 (𝑥 𝑖 ; 𝜃) since the 𝑋𝑖 are independent.
𝑖=1

Sometimes we have independent 𝑋𝑖 whose distributions differ, say 𝑋𝑖 is from 𝑓𝑖 (𝑥; 𝜃).
Then the likelihood is
𝑛
Ö
𝐿(𝜃) = 𝑓𝑖 (𝑥 𝑖 ; 𝜃).
𝑖=1

Definition. The maximum likelihood estimate (MLE) 𝜃(x)


b is the value of 𝜃 that maximises
𝐿(𝜃) for the given x.
The idea of maximum likelihood is to estimate the parameter by the value of 𝜃 that gives
the greatest likelihood to observations 𝑥 1 , . . . , 𝑥 𝑛 . That is, the 𝜃 for which the probability
or probability density (3.1) is maximised.

Since taking logs is monotone, 𝜃(x)


b also maximises ℓ (𝜃). Finding the MLE by maximising
ℓ (𝜃) is often more convenient.

Example (continued). In our exponential mean 𝜇 example, the parameter is 𝜇 and


𝑛 𝑛
1 1Õ
Ö  
−𝑥 𝑖 /𝜇 −𝑛
𝐿(𝜇) = 𝑒 =𝜇 exp − 𝑥𝑖
𝜇 𝜇
𝑖=1 𝑖=1
𝑛

ℓ (𝜇) = log 𝐿(𝜇) = −𝑛 log 𝜇 − 𝑥𝑖 .
𝜇
𝑖=1

To find the maximum Í𝑛


𝑑ℓ 𝑛 𝑖=1 𝑥𝑖
=− + .
𝑑𝜇 𝜇 𝜇2
So Í𝑛
𝑑ℓ 𝑛 𝑖=1 𝑥𝑖
= 0 ⇐⇒ = ⇐⇒ 𝜇 = 𝑥.
𝑑𝜇 𝜇 𝜇2
This is a maximum since
Í𝑛
𝑑 2ℓ 𝑛 2 𝑖=1 𝑥𝑖 𝑛
= − =− < 0.
𝑑𝜇2 𝜇=𝑥 𝑥 2
𝑥3
𝑥2

9
So the MLE is b
𝜇(x) = 𝑥. Often we’ll just write b
𝜇 = 𝑥.
In this case the maximum likelihood estimator of 𝜇 is b
𝜇(X) = 𝑋, which is a random variable.
(More on the difference between estimates and estimators later.)

−4.5
2.0

logLikelihood × 10−2
−5.0
Likelihood × 10191

1.5
−5.5

1.0
−6.0

0.5
−6.5

0.0 −7.0

0 200 400 600 800 1000 0 200 400 600 800 1000

mu mu

Figure 3.2. Likelihood 𝐿(𝜇) and log-likelihood ℓ (𝜇) for exponential (earthquake) example.

Definition. If 𝜃(x)
b is the maximum likelihood estimate of 𝜃, then the maximum likelihood
estimator (MLE) is defined by 𝜃(X).
b

Note: both maximum likelihood estimate and maximum likelihood estimator are often
abbreviated to MLE.

Example (Opinion poll). Suppose 𝑛 individuals are drawn independently from a large
population. Let (
1 if individual 𝑖 is a Labour voter
𝑋𝑖 =
0 otherwise.
Let 𝑝 be the proportion of Labour voters, so that

𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.

This is a Bernoulli distribution, for which the p.m.f. can be written

𝑓 (𝑥; 𝑝) = 𝑃(𝑋 = 𝑥) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 , 𝑥 = 0, 1.

The likelihood is
𝑛
Ö
𝐿(𝑝) = 𝑝 𝑥 𝑖 (1 − 𝑝)1−𝑥 𝑖
𝑖=1
= 𝑝 𝑟 (1 − 𝑝)𝑛−𝑟
Í𝑛
where 𝑟 = 𝑖=1 𝑥 𝑖 . So the log-likelihood is

ℓ (𝑝) = log 𝐿(𝑝)


= 𝑟 log 𝑝 + (𝑛 − 𝑟) log(1 − 𝑝)

10
For a maximum, differentiate and set to zero:
𝑟 𝑛−𝑟 𝑟 𝑛−𝑟
− = 0 ⇐⇒ = ⇐⇒ 𝑟 − 𝑟𝑝 = 𝑛𝑝 − 𝑟𝑝
𝑝 1−𝑝 𝑝 1−𝑝

and so 𝑝 = 𝑟/𝑛. This is a maximum since ℓ ′′(𝑝) < 0.


Í𝑛
𝑝 = 𝑟/𝑛, i.e. the MLE is b
So b 𝑝= 𝑖=1 𝑋𝑖 /𝑛 which is the proportion of Labour voters in the
sample.

Example (Genetics). Suppose we test randomly chosen individuals at a particular locus on


the genome. Each chromosome can be type 𝐴 or 𝑎. Every individual has two chromosomes
(one from each parent), so the genotype can be 𝐴𝐴, 𝐴𝑎 or 𝑎𝑎. (Note that order is not
relevant, there is no distinction between 𝐴𝑎 and 𝑎𝐴.)
Hardy-Weinberg law: under plausible assumptions,

𝑃(𝐴𝐴) = 𝑝 1 = 𝜃2 , 𝑃(𝐴𝑎) = 𝑝 2 = 2𝜃(1 − 𝜃), 𝑃(𝑎𝑎) = 𝑝 3 = (1 − 𝜃)2

for some 𝜃 with 0 ⩽ 𝜃 ⩽ 1.


Now suppose the random sample of 𝑛 individuals contains:

𝑥 1 of type 𝐴𝐴, 𝑥 2 of type 𝐴𝑎, 𝑥 3 of type 𝑎𝑎

where 𝑥 1 + 𝑥 2 + 𝑥3 = 𝑛 and these are observations of 𝑋1 , 𝑋2 , 𝑋3 . Then

𝐿(𝜃) = 𝑃(𝑋1 = 𝑥 1 , 𝑋2 = 𝑥 2 , 𝑋3 = 𝑥 3 )
𝑛!
= 𝑝 1𝑥1 𝑝 2𝑥2 𝑝 3𝑥3
𝑥1 ! 𝑥2 ! 𝑥3 !
𝑛!
= (𝜃2 )𝑥1 [2𝜃(1 − 𝜃)]𝑥2 [(1 − 𝜃)2 ]𝑥3 .
𝑥1 ! 𝑥2 ! 𝑥3 !
This is a multinomial distribution.
So

ℓ (𝜃) = constant + 2𝑥 1 log 𝜃 + 𝑥 2 [log 2 + log 𝜃 + log(1 − 𝜃)] + 2𝑥3 log(1 − 𝜃)


= constant + (2𝑥 1 + 𝑥 2 ) log 𝜃 + (𝑥 2 + 2𝑥3 ) log(1 − 𝜃)

(where the constants depend on 𝑥1 , 𝑥2 , 𝑥3 but not 𝜃).

Then ℓ ′(𝜃)
b = 0 gives
2𝑥1 + 𝑥 2 𝑥 2 + 2𝑥3
− =0
𝜃
b 1−𝜃
b
or
(2𝑥 1 + 𝑥 2 )(1 − 𝜃)
b = (𝑥2 + 2𝑥3 )𝜃.
b
So
2𝑥 1 + 𝑥 2 2𝑥 1 + 𝑥 2
𝜃
b= = .
2(𝑥1 + 𝑥2 + 𝑥 3 ) 2𝑛

Steps:
• Write down the (log) likelihood

11
• Find the maximum (usually by differentiation, but not quite always)
• Rearrange to give the parameter estimate in terms of the data.

iid
Example (Estimating multiple parameters). Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) where both 𝜇 and
𝜎2 are unknown.
iid
[Here ∼ means “are independent and identically distributed as.”]
The likelihood is
𝑛
1 (𝑥 𝑖 − 𝜇)2
Ö  
2
𝐿(𝜇, 𝜎 ) = √ exp −
𝑖=1 2𝜋𝜎 2 2𝜎2
𝑛
1 Õ
 
2 −𝑛/2 2
= (2𝜋𝜎 ) exp − 2 (𝑥 𝑖 − 𝜇)
2𝜎 𝑖=1
with log-likelihood
𝑛
2 𝑛 𝑛 2 1 Õ
ℓ (𝜇, 𝜎 ) = − log 2𝜋 − log(𝜎 ) − 2 (𝑥 𝑖 − 𝜇)2 .
2 2 2𝜎 𝑖=1

We maximise ℓ jointly over 𝜇 and 𝜎2 :


𝑛
𝜕ℓ 1 Õ
= 2 (𝑥 𝑖 − 𝜇)
𝜕𝜇 𝜎
𝑖=1
𝑛
𝜕ℓ 𝑛 1 Õ
= − + (𝑥 𝑖 − 𝜇)2
𝜕(𝜎2 ) 2𝜎2 2(𝜎2 )2 𝑖=1

𝜕ℓ 𝜕ℓ
and solving 𝜕𝜇
= 0 and 𝜕(𝜎2 )
= 0 simultaneously we obtain
𝑛

𝜇= 𝑥𝑖 = 𝑥
𝑛
b
𝑖=1
𝑛 𝑛
2 1Õ 2 1Õ
𝜎 = 𝜇) =
(𝑥 𝑖 − b (𝑥 𝑖 − 𝑥)2 .
𝑛 𝑛
b
𝑖=1 𝑖=1

Hence: the MLE of 𝜇 is the sample mean, but the MLE of 𝜎2 is (𝑛 − 1)𝑠 2 /𝑛. (More later.)

“Estimate” and “estimator”

Estimator:
• A rule for constructing an estimate.
• A function of the random variables X involved in the random sample.
• Itself a random variable.
Estimate:
• The numerical value of the estimator for the particular data set.
• The value of the function evaluated at the data 𝑥 1 , . . . , 𝑥 𝑛 .

12
4 Parameter Estimation
Recall the earthquake example:

iid
𝑋1 , . . . , 𝑋𝑛 ∼ exponential distribution, mean 𝜇.

Possible estimators of 𝜇:
• 𝑋
1
• 3 𝑋1 + 23 𝑋2
• 𝑋1 + 𝑋2 − 𝑋3
2
• 𝑛(𝑛+1)
(𝑋1 + 2𝑋2 + · · · + 𝑛𝑋𝑛 ).
How should we choose?
In general, suppose 𝑋1 , . . . , 𝑋𝑛 is a random sample from a distribution with p.d.f./p.m.f.
𝑓 (𝑥; 𝜃). We want to estimate 𝜃 from observations 𝑥 1 , . . . , 𝑥 𝑛 .

Definition. A statistic is any function 𝑇(X) of 𝑋1 , . . . , 𝑋𝑛 that does not depend on 𝜃.


An estimator of 𝜃 is any statistic 𝑇(X) that we might use to estimate 𝜃.
𝑇(x) is the estimate of 𝜃 obtained via 𝑇 from observed values x.

Note. 𝑇(X) is a random variable, e.g. 𝑋.


𝑇(x) is a fixed number, based on data, e.g. 𝑥.

We can choose between estimators by studying their properties. A good estimator should
take values close to 𝜃.

Definition. The estimator 𝑇 = 𝑇(X) is said to be unbiased for 𝜃 if, whatever the true value
of 𝜃, we have 𝐸(𝑇) = 𝜃.
This means that “on average” 𝑇 is correct.

Example (Earthquakes). Possible estimators 𝑋, 13 𝑋1 + 23 𝑋2 , etc.


Since 𝐸(𝑋𝑖 ) = 𝜇, we have
𝑛

𝐸(𝑋) = 𝜇=𝜇
𝑛
𝑖=1

𝐸( 13 𝑋1 + 32 𝑋2 ) = 13 𝜇 + 23 𝜇 = 𝜇.
2 Í𝑛
Similar calculations show that 𝑋1 + 𝑋2 − 𝑋3 and 𝑛(𝑛+1) 𝑗=1 𝑗𝑋 𝑗 are also unbiased.

iid
Example (Normal variance). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ), with 𝜇 and 𝜎2 unknown,
and let 𝑇 = 𝑛1 (𝑋𝑖 − 𝑋)2 . Then 𝑇 is the MLE of 𝜎2 . Is 𝑇 unbiased?
Í

13
Let 𝑍 𝑖 = (𝑋𝑖 − 𝜇)/𝜎. So the 𝑍 𝑖 are independent and 𝑁(0, 1), 𝐸(𝑍 𝑖 ) = 0, var(𝑍 𝑖 ) = 𝐸(𝑍 2𝑖 ) = 1.

𝐸[(𝑋𝑖 − 𝑋)2 ]
= 𝐸[𝜎2 (𝑍 𝑖 − 𝑍)2 ]
= 𝜎2 var(𝑍 𝑖 − 𝑍) since 𝐸(𝑍 𝑖 − 𝑍) = 0

1 1 𝑛−1 1
 
2
= 𝜎 var − 𝑍1 − 𝑍2 − · · · + 𝑍𝑖 + · · · − 𝑍𝑛
𝑛 𝑛 𝑛 𝑛
1 1 (𝑛 − 1)2 1
 
2
=𝜎 2
var(𝑍 1 ) + 2
var(𝑍 2 ) + · · · + 2
var(𝑍 𝑖 ) + · · · + 2 var(𝑍 𝑛 )
𝑛 𝑛 𝑛 𝑛
Õ Õ
since var( 𝑎 𝑖 𝑈𝑖 ) = 𝑎 2𝑖 var(𝑈 𝑖 ) for indep 𝑈 𝑖

1 (𝑛 − 1)2
 
2
= 𝜎 (𝑛 − 1) × 2 +
𝑛 𝑛2
(𝑛 − 1) 2
= 𝜎 .
𝑛
So
𝑛
1Õ (𝑛 − 1) 2
𝐸(𝑇) = 𝐸[(𝑋𝑖 − 𝑋)2 ] = 𝜎 < 𝜎2 .
𝑛 𝑛
𝑖=1
Hence 𝑇 is not unbiased, 𝑇 will underestimate 𝜎2 on average.
But
𝑛
1 Õ 𝑛
𝑆2 = (𝑋𝑖 − 𝑋)2 = 𝑇
𝑛−1 𝑛−1
𝑖=1
so the sample variance satisfies
𝑛
𝐸(𝑆 2 ) = 𝐸(𝑇) = 𝜎2 .
𝑛−1
So 𝑆 2 is unbiased for 𝜎2 .
iid
Example (Uniform distribution – some unusual features!). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃],
where 𝜃 > 0, i.e.
1
if 0 ⩽ 𝑥 ⩽ 𝜃


𝑓 (𝑥; 𝜃) = 𝜃


 0 otherwise.


What is the MLE for 𝜃? Is the MLE unbiased?
Calculate the likelihood:
𝑛
Ö
𝐿(𝜃) = 𝑓 (𝑥; 𝜃)
𝑖=1

1
if 0 ⩽ 𝑥 𝑖 ⩽ 𝜃 for all 𝑖


= 𝜃𝑛


0

otherwise

0 if 0 < 𝜃 < max 𝑥 𝑖



= 1

 𝑛 if 𝜃 ⩾ max 𝑥 𝑖 .
𝜃

14
Note: 𝜃 ⩾ 𝑥 𝑖 for all 𝑖 ⇐⇒ 𝜃 ⩾ max 𝑥 𝑖 . (And max 𝑥 𝑖 means max1⩽ 𝑖 ⩽ 𝑛 𝑥 𝑖 ).

Likelihood
0

0 max xi
theta

Figure 4.1. Likelihood 𝐿(𝜃) for Uniform[0, 𝜃] example.

From the diagram:

• the max occurs at 𝜃


b = max 𝑥 𝑖

• this is not a point where ℓ ′(𝜃) = 0


• taking logs doesn’t help.
Consider the range of values of 𝑥 for which 𝑓 (𝑥; 𝜃) > 0, i.e. 0 ⩽ 𝑥 ⩽ 𝜃. The thing that
makes this example different to our previous ones is that this range depends on 𝜃 (and we
must take this into account because the likelihood is a function of 𝜃).
The MLE of 𝜃 is 𝜃
b = max 𝑋𝑖 . What is 𝐸(𝜃)?
b

Find the c.d.f. of 𝜃:


b

𝐹(𝑦) = 𝑃(𝜃
b ⩽ 𝑦)
= 𝑃(max 𝑋𝑖 ⩽ 𝑦)
= 𝑃(𝑋1 ⩽ 𝑦, 𝑋2 ⩽ 𝑦, . . . , 𝑋𝑛 ⩽ 𝑦)
= 𝑃(𝑋1 ⩽ 𝑦)𝑃(𝑋2 ⩽ 𝑦) . . . 𝑃(𝑋𝑛 ⩽ 𝑦) since 𝑋𝑖 independent

(𝑦/𝜃)𝑛
(
if 0 ⩽ 𝑦 ⩽ 𝜃
=
1 if 𝑦 > 𝜃.

So, differentiating the c.d.f., the p.d.f. is

𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0 ⩽ 𝑦 ⩽ 𝜃.
𝜃𝑛

15
So
𝜃
𝑛 𝑦 𝑛−1

𝐸(𝜃)
b = 𝑦· 𝑑𝑦
0 𝜃𝑛
𝜃
𝑛

= 𝑛 𝑦 𝑛 𝑑𝑦
𝜃 0

𝑛𝜃
= .
𝑛+1

So 𝜃
b is not unbiased. But note that it is asymptotically unbiased: 𝐸(𝜃)
b → 𝜃 as 𝑛 → ∞.

In fact under mild assumptions MLEs are always asymptotically unbiased.

4.1 Further Properties of Estimators


Definition. The mean squared error (MSE) of an estimator 𝑇 is defined by

MSE(𝑇) = 𝐸[(𝑇 − 𝜃)2 ].

The bias 𝑏(𝑇) of 𝑇 is defined by


𝑏(𝑇) = 𝐸(𝑇) − 𝜃.

Note.
1. Both MSE(𝑇) and 𝑏(𝑇) may depend on 𝜃.
2. MSE is a measure of the “distance” between 𝑇 and 𝜃, so is a good overall measure of
performance.
3. 𝑇 is unbiased if 𝑏(𝑇) = 0 for all 𝜃.

Theorem 4.1. MSE(𝑇) = var(𝑇) + [𝑏(𝑇)]2 .

Proof. Let 𝜇 = 𝐸(𝑇). Then

MSE(𝑇) = 𝐸[{(𝑇 − 𝜇) + (𝜇 − 𝜃)}2 ]


= 𝐸[(𝑇 − 𝜇)2 + 2(𝜇 − 𝜃)(𝑇 − 𝜇) + (𝜇 − 𝜃)2 ]
= 𝐸[(𝑇 − 𝜇)2 ] + 2(𝜇 − 𝜃)𝐸[𝑇 − 𝜇] + (𝜇 − 𝜃)2
= var(𝑇) + 2(𝜇 − 𝜃) × 0 + (𝜇 − 𝜃)2
= var(𝑇) + [𝑏(𝑇)]2 . □

So an estimator with small MSE needs to have small variance and small bias. Unbiasedness
alone is not particularly desirable – it is the combination of small variance and small bias
which is important.

16
Reminder

Suppose 𝑎 1 , . . . , 𝑎 𝑛 are constants. It is always the case that

𝐸(𝑎 1 𝑋1 + · · · + 𝑎 𝑛 𝑋𝑛 ) = 𝑎 1 𝐸(𝑋1 ) + · · · + 𝑎 𝑛 𝐸(𝑋𝑛 ).

If 𝑋1 , . . . , 𝑋𝑛 are independent then

var(𝑎 1 𝑋1 + · · · + 𝑎 𝑛 𝑋𝑛 ) = 𝑎 12 var(𝑋1 ) + · · · + 𝑎 𝑛2 var(𝑋𝑛 ).

In particular, if 𝑋1 , . . . , 𝑋𝑛 is a random sample with 𝐸(𝑋𝑖 ) = 𝜇 and var(𝑋𝑖 ) = 𝜎2 , then

𝜎2
𝐸(𝑋) = 𝜇 and var(𝑋) = .
𝑛

iid
Example (Uniform distribution). Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ Uniform[0, 𝜃], i.e.
1
if 0 ⩽ 𝑥 ⩽ 𝜃


𝑓 (𝑥; 𝜃) = 𝜃


0

otherwise.

We will consider two estimators of 𝜃:
• 𝑇 = 2𝑋, the natural estimator based on the sample mean (because the mean of the
distribution is 𝜃/2)
• 𝜃
b = max 𝑋𝑖 , the MLE.

Now 𝐸(𝑇) = 2𝐸(𝑋) = 𝜃, so 𝑇 is unbiased. Hence

MSE(𝑇) = var(𝑇)
= 4 var(𝑋)
4 var(𝑋1 )
= .
𝑛
We have 𝐸(𝑋1 ) = 𝜃/2 and
𝜃
1 𝜃2

𝐸(𝑋12 ) = 𝑥2 · 𝑑𝑥 =
0 𝜃 3
so
 2
𝜃2 𝜃 𝜃2
var(𝑋1 ) = − =
3 2 12
hence
4 var(𝑋1 ) 𝜃 2
MSE(𝑇) = = .
𝑛 3𝑛

Previously we showed that 𝜃


b has p.d.f.

𝑛 𝑦 𝑛−1
𝑓 (𝑦) = , 0⩽𝑦⩽𝜃
𝜃𝑛
and 𝐸(𝜃)
b = 𝑛𝜃/(𝑛 + 1). So 𝑏(𝜃)
b = 𝑛𝜃/(𝑛 + 1) − 𝜃 = −𝜃/(𝑛 + 1). Also,
𝜃
𝑛 𝑦 𝑛−1 𝑛𝜃2

b2 ) =
𝐸(𝜃 2
𝑦 · 𝑑𝑦 =
0 𝜃𝑛 𝑛+2

17
so
𝑛 𝑛2 𝑛𝜃2
 
b = 𝜃2
var(𝜃) − =
𝑛 + 2 (𝑛 + 1)2 (𝑛 + 1)2 (𝑛 + 2)
hence

MSE(𝜃)
b = var(𝜃) b 2
b + [𝑏(𝜃)]

2𝜃 2
=
(𝑛 + 1)(𝑛 + 2)
𝜃2
< for 𝑛 ⩾ 3
3𝑛
= MSE(𝑇).

• MSE(𝜃) b is much better – its MSE decreases like 1/𝑛 2


b ≪ MSE(𝑇) for large 𝑛, so 𝜃
rather than 1/𝑛.
• Note that ( 𝑛+1
𝑛 )𝜃 is unbiased and
b

𝑛 + 1b 𝑛 + 1b
   
MSE 𝜃 = var 𝜃
𝑛 𝑛

(𝑛 + 1)2
= var(𝜃)
b
𝑛2
𝜃2
=
𝑛(𝑛 + 2)

< MSE(𝜃)
b for 𝑛 ⩾ 2.

b the MSE is minimized by ( 𝑛+2 )𝜃.


However, among all estimators of the form 𝜆𝜃, 𝑛+1
b
b = 𝜆2 var(𝜃) 𝜆𝑛𝜃
[To show this: note var(𝜆𝜃) b and 𝑏(𝜆𝜃)
b =
𝑛+1 − 𝜃. Now plug in formulae
and minimise over 𝜆.]

Estimation so far

So far we’ve considered getting an estimate which is a single number – a point estimate –
for the parameter of interest: e.g. 𝑥, max 𝑥 𝑖 , 𝑠 2 , ....
Maximum likelihood is a good way (usually) of producing an estimate (but we did better
when the range of the distribution depends on 𝜃 – fairly unusual).
MLEs are usually asymptotically unbiased, and have MSE decreasing like 1/𝑛 for
large 𝑛.
MLEs can be found in quite general situations.

18
5 Accuracy of estimation: Confidence Intervals
A crucial aspect of statistics is not just to estimate a quantity of interest, but to assess how
accurate or precise that estimate is. One approach is to find an interval, called a confidence
interval (CI) within which we think the true parameter falls.

Theorem 5.1. Suppose 𝑎 1 , . . . , 𝑎 𝑛 are constants and that 𝑋1 , . . . , 𝑋𝑛 are independent with
𝑋𝑖 ∼ 𝑁(𝜇𝑖 , 𝜎𝑖2 ). Let 𝑌 = 𝑛𝑖=1 𝑎 𝑖 𝑋𝑖 . Then
Í

𝑛
Õ 𝑛
Õ 
𝑌∼𝑁 𝑎 𝑖 𝜇𝑖 , 𝑎 2𝑖 𝜎𝑖2 .
𝑖=1 𝑖=1

Proof. See Sheet 1.


We know from Prelims Probability how to calculate the mean and variance of 𝑌. The
additional information here is that 𝑌 has a normal distribution, i.e. “a linear combination
of normals is itself normal.” □
iid
Example. Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) where 𝜇 is unknown and 𝜎02 is known. What can we
say about 𝜇?
By Theorem 5.1,
𝑛
Õ
𝑋𝑖 ∼ 𝑁(𝑛𝜇, 𝑛𝜎02 )
𝑖=1
 𝜎02 
𝑋 ∼ 𝑁 𝜇, .
𝑛

So, standardising 𝑋,
 𝜎02 
𝑋 − 𝜇 ∼ 𝑁 0,
𝑛
𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎0 / 𝑛

Now if 𝑍 ∼ 𝑁(0, 1), then 𝑃(−1.96 < 𝑍 < 1.96) = 0.95.

19
0.4

0.3

N(0, 1) p.d.f.
0.2

0.1

0.0

−4 −2 0 2 4

Figure 5.1. Standard normal p.d.f.: the shaded area, i.e. the area under the curve from 𝑥 = −1.96 to
𝑥 = 1.96, is 0.95.

So
𝑋 −𝜇
 
𝑃 − 1.96 < √ < 1.96 = 0.95
𝜎0 / 𝑛
𝜎0 𝜎0
 
𝑃 − 1.96 √ < 𝑋 − 𝜇 < 1.96 √ = 0.95
𝑛 𝑛
𝜎0 𝜎0
 
𝑃 𝑋 − 1.96 √ < 𝜇 < 𝑋 + 1.96 √ = 0.95
𝑛 𝑛
𝜎0 
  
𝑃 the interval 𝑋 ± 1.96 √ contains 𝜇 = 0.95.
𝑛

Note that we write 𝑋 ± 1.96 √𝜎0𝑛 to mean the interval




𝜎0 𝜎0
 
𝑋 − 1.96 √ , 𝑋 + 1.96 √ .
𝑛 𝑛

This interval is a random interval since its endpoints involve 𝑋 (and 𝑋 is a random variable).
It is an example of a confidence interval.

Confidence Intervals: general definition

Definition. If 𝑎(X) and 𝑏(X) are two statistics, and 0 < 𝛼 < 1, the interval 𝑎(X), 𝑏(X) is


called a confidence interval for 𝜃 with confidence level 1 − 𝛼 if, for all 𝜃,

𝑃 𝑎(X) < 𝜃 < 𝑏(X) = 1 − 𝛼.




The interval 𝑎(X), 𝑏(X) is also called a 100(1 − 𝛼)% CI, e.g. a “95% confidence interval” if


𝛼 = 0.05.

20
Usually we are interested in small values of 𝛼: the most commonly used values are 0.05
and 0.01 (i.e. confidence levels of 95% and 99%) but there is nothing special about any
confidence level.
The interval 𝑎(x), 𝑏(x) is called an interval estimate and the random interval 𝑎(X), 𝑏(X) is
 

called an interval estimator.


Note: 𝑎(X) and 𝑏(X) do not depend on 𝜃.
We would like to construct 𝑎(X) and 𝑏(X) so that:
• the width of the interval 𝑎(X), 𝑏(X) is small


• the probability 𝑃 𝑎(X) < 𝜃 < 𝑏(X) is large.




Percentage points of normal distribution

For any 𝛼 with 0 < 𝛼 < 1, let 𝑧 𝛼 be the constant such that Φ(𝑧 𝛼 ) = 1 − 𝛼, where Φ is
the 𝑁(0, 1) c.d.f. (i.e. if 𝑍 ∼ 𝑁(0, 1) then 𝑃(𝑍 > 𝑧 𝛼 ) = 𝛼).
N(0, 1) p.d.f.

0 zα

Figure 5.2. Standard normal p.d.f.: the shaded area, i.e. the area under the curve to the right of 𝑧 𝛼 ,
is 𝛼.

We call 𝑧 𝛼 the “1 − 𝛼 quantile of 𝑁(0, 1).”

𝛼 0.1 0.05 0.025 0.005


𝑧𝛼 1.28 1.64 1.96 2.58

iid
By the same argument as before, if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎02 ) with 𝜎02 known, then a level
1 − 𝛼 confidence interval for 𝜇 is
𝑧 𝛼/2 𝜎0
 
𝑋± √ .
𝑛

21
Example (See slides on rainfall example). Measure annual rainfall amounts for 20 years
(i.e. 𝑛 = 20). Suppose 𝑥 = 688.4 and 𝜎0 = 130.
Then the endpoints of a 95% CI for 𝜇 are
𝜎0
𝑥 ± 1.96 √ = 688.4 ± 57.0
𝑛
So a 95% CI for 𝜇 is (631.4, 745.4).
The endpoints of a 99% CI are 𝑥 ± 2.58 √𝜎0𝑛 giving a 99% CI for 𝜇 of (613.4, 763.4).

The symmetric confidence interval for 𝜇 above is called a central confidence interval
for 𝜇.

Suppose now that 𝑐 and 𝑑 are any constants such that

𝑃(−𝑐 < 𝑍 < 𝑑) = 1 − 𝛼

for 𝑍 ∼ 𝑁(0, 1).


N(0, 1) p.d.f.

−c d

Figure 5.3. Standard normal p.d.f.: the shaded area, i.e. the area under the curve from −𝑐 to 𝑑, is
1 − 𝛼.

Then
𝑑𝜎0 𝑐𝜎0
 
𝑃 𝑋− √ <𝜇<𝑋+ √ = 1 − 𝛼.
𝑛 𝑛
The choice 𝑐 = 𝑑 = 𝑧 𝛼/2 gives the shortest such interval.

One-sided confidence limits

Continuing our normal example we have

𝑋 −𝜇
 
𝑃 √ > −𝑧 𝛼 = 1 − 𝛼
𝜎0 / 𝑛

22
so
𝑧 𝛼 𝜎0
 
𝑃 𝜇<𝑋+ √ =1−𝛼
𝑛
and so (−∞, 𝑋 + 𝑧√𝛼 𝜎𝑛0 ) is a “one-sided” confidence interval. We call 𝑋 + 𝑧 𝛼 𝜎0

𝑛
an upper 1 − 𝛼
confidence limit for 𝜇.
Similarly
𝑧 𝛼 𝜎0
 
𝑃 𝜇>𝑋− √ =1−𝛼
𝑛
𝑧 𝛼 𝜎0
and 𝑋 − √
𝑛
is a lower 1 − 𝛼 confidence limit for 𝜇.

Interpretation of a Confidence Interval

• The parameter 𝜃 is fixed but unknown.


• If we imagine repeating our experiment then we’d get new data, x′ = (𝑥 1′ , . . . , 𝑥 𝑛′ )
say, and hence we’d get a new confidence interval 𝑎(x′), 𝑏(x′) . If we did this


repeatedly we would “catch” the true parameter value about 95% of the time, for a
95% confidence interval: i.e. about 95% of our intervals would contain 𝜃.
• The confidence level is a coverage probability, the probability that the random confidence
interval 𝑎(X), 𝑏(X) covers the true 𝜃. (It’s a random interval because the endpoints


𝑎(X), 𝑏(X) are random variables.)


But note that the interval 𝑎(x), 𝑏(x) is not a random interval, e.g. 𝑎(x), 𝑏(x) = (631.4, 745.4)
 

in the rainfall example. So it is wrong to say that 𝑎(x), 𝑏(x) contains 𝜃 with probability


1 − 𝛼: this interval, e.g. (631.4, 745.4), either definitely does or definitely does not contain 𝜃,
but we can’t say which of these two possibilities is true as 𝜃 is unknown.

5.1 Confidence Intervals using the CLT


The Central Limit Theorem (CLT)
iid
We know that if 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) then

𝑋 −𝜇
√ ∼ 𝑁(0, 1).
𝜎/ 𝑛

Theorem 5.2. Let 𝑋1 , . . . , 𝑋𝑛 be i.i.d. from any distribution with mean 𝜇 and variance 𝜎2 ∈ (0, ∞).
Then, for all 𝑥,
𝑋 −𝜇
 
𝑃 √ ⩽ 𝑥 → Φ(𝑥) as 𝑛 → ∞.
𝜎/ 𝑛

Here, as usual, Φ is the 𝑁(0, 1) c.d.f. So for large 𝑛, whatever the distribution of the 𝑋𝑖 ,

𝑋 −𝜇
√ ≈ 𝑁(0, 1)
𝜎/ 𝑛
where ≈ means “has approximately the same distribution as.” Usually 𝑛 > 30 is ok for a
reasonable approximation.

23
With 𝑋1 , 𝑋2 , . . . i.i.d. from any distribution with 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜎2 :
• the weak law of large numbers (Prelims Probability) tells us that the distribution of
𝑋 concentrates around 𝜇 as 𝑛 becomes large,
i.e. for 𝜖 > 0, we have 𝑃(|𝑋 − 𝜇| > 𝜖) → 0 as 𝑛 → ∞
• the CLT adds to this

– the fluctuations of 𝑋 around 𝜇 are of order 1/ 𝑛
– the asymptotic distribution of these fluctuations is normal.

iid
Example. Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ exponential mean 𝜇, e.g. 𝑋𝑖 = survival time of patient 𝑖.
So
1
𝑓 (𝑥; 𝜇) = 𝑒 −𝑥/𝜇 , 𝑥 ⩾ 0
𝜇
and 𝐸(𝑋𝑖 ) = 𝜇, var(𝑋𝑖 ) = 𝜇2 .
For large 𝑛, by CLT,
𝑋 −𝜇
√ ≈ 𝑁(0, 1).
𝜇/ 𝑛
So

𝑋 −𝜇
 
𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2 ≈ 1 − 𝛼
𝜇/ 𝑛
𝑧 𝛼/2  𝑧 𝛼/2 
   
𝑃 𝜇 1− √ < 𝑋 < 𝜇 1+ √ ≈1−𝛼
𝑛 𝑛

𝑋 𝑋
 
𝑃 𝑧 𝛼/2 <𝜇< 𝑧 𝛼/2 ≈ 1 − 𝛼.
1+ √
𝑛
1− √
𝑛

Hence an approximate 1 − 𝛼 CI for 𝜇 is

𝑋 𝑋
 
𝑧 𝛼/2 , 𝑧 𝛼/2 .
1+ √
𝑛
1− √
𝑛

Numerically, if we have 𝑛 = 100 patients and 𝛼 = 0.05 (so 𝑧 𝛼/2 = 1.96), then

(0.84𝑥, 1.24𝑥)

is an approximate 95% CI for 𝜇.

Example (Opinion poll). In a opinion poll, suppose 321 of 1003 voters said they would
vote for the Party X. What’s the underlying level of support for Party X?
Let 𝑋1 , . . . , 𝑋𝑛 be a random sample from the Bernoulli(𝑝) distribution, i.e.

𝑃(𝑋𝑖 = 1) = 𝑝, 𝑃(𝑋𝑖 = 0) = 1 − 𝑝.

The MLE of 𝑝 is 𝑋. Also 𝐸(𝑋𝑖 ) = 𝑝 and var(𝑋𝑖 ) = 𝑝(1 − 𝑝) = 𝜎2 (𝑝) say.

24
For large 𝑛, by CLT,
𝑋−𝑝
√ ≈ 𝑁(0, 1).
𝜎(𝑝)/ 𝑛
So

𝑋−𝑝
 
1 − 𝛼 ≈ 𝑃 − 𝑧 𝛼/2 < √ < 𝑧 𝛼/2
𝜎(𝑝)/ 𝑛
𝜎(𝑝) 𝜎(𝑝)
 
= 𝑃 𝑋 − 𝑧 𝛼/2 √ < 𝑝 < 𝑋 + 𝑧 𝛼/2 √ .
𝑛 𝑛
𝑧 𝛼/2
 
The interval 𝑋 ± √ 𝜎(𝑝)
𝑛
has approximate probability 1 − 𝛼 of containing the true 𝑝, but
it is not a confidence interval since its endpoints depend on 𝑝 via 𝜎(𝑝).
To get an approximate confidence interval:
• either, solve the inequality to get 𝑃 𝑎(X) < 𝑝 < 𝑏(X) ≈ 1 − 𝛼 where 𝑎(X), 𝑏(X) don’t


depend on 𝑝
p
• or, estimate 𝜎(𝑝) by 𝜎(b
𝑝) = b𝑝 (1 − b
𝑝 ) where b 𝑝 = 𝑥 the MLE. This gives endpoints
r
𝑝 (1 − b
𝑝)
𝑝 ± 𝑧 𝛼/2 .
b
𝑛
b

For 𝑛 = 1003 and b


𝑝 = 𝑥 = 321/1003, this gives a 95% CI for 𝑝 of (0.29, 0.35) using the
two approximations: (i) CLT and (ii) 𝜎(𝑝) approximated by 𝜎(b 𝑝 ).

Example (Opinion polls continued). Opinion polls often mention “±3% error.”
Note that for any 𝑝,
1
𝜎2 (𝑝) = 𝑝(1 − 𝑝) ⩽
4
since 𝑝(1 − 𝑝) is maximised at 𝑝 = 21 . Then we have

𝑝−𝑝 0.03
 
−0.03
𝑃(b
𝑝 − 0.03 < 𝑝 < b
𝑝 + 0.03) = 𝑃 √ < √ <
b

𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛

0.03
   
−0.03
≈Φ √ −Φ √
𝜎(𝑝)/ 𝑛 𝜎(𝑝)/ 𝑛
√ √
⩾ Φ(0.03 4𝑛) − Φ(−0.03 4𝑛).

For this probability to be at least 0.95 we need 0.03 4𝑛 ⩾ 1.96, or 𝑛 ⩾ 1068. Opinion polls
typically use 𝑛 ≈ 1100.

Standard errors

Definition. Let 𝜃
b be an estimator of 𝜃 based on X. The standard error SE(𝜃)
b of 𝜃
b is defined
by q
SE(𝜃)
b = var(𝜃).
b

25
Example.
iid √
• Let 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ). Then b 𝜇) = 𝜎2 /𝑛. So SE(b
𝜇 = 𝑋 and var(b 𝜇) = 𝜎/ 𝑛.
iid
• Let 𝑝 = 𝑋 and var(b
p 𝑋1 , . . . , 𝑋𝑛 ∼ Bernoulli(𝑝). Then b 𝑝 ) = 𝑝(1 − 𝑝)/𝑛. So SE(b
𝑝) =
𝑝(1 − 𝑝)/𝑛.

Sometimes SE(𝜃)b depends on 𝜃 itself, meaning that SE(𝜃) b is unknown. In such cases we
have to plug-in parameter estimates to get the estimatedpstandard error. E.g. plug-in to get

estimated standard errors SE(𝑋) = b𝜎/ 𝑛 and SE(b 𝑝) = b 𝑝 (1 − b
𝑝 )/𝑛.
The values plugged-in (b
𝜎 and b
𝑝 above) could be maximum likelihood, or other, esti-
mates.
(We could write SE(
cb 𝑝 ), i.e. with a hat on the SE, to denote that b
𝑝 has been plugged-in, but
this is ugly so we won’t, we’ll just write SE(b𝑝 ).)

If 𝜃
b is unbiased, then MSE(𝜃)
b = var(𝜃) b 2 . So the standard error (or estimated
b = [SE(𝜃)]
standard error) gives some quantification of the accuracy of estimation.

If in addition 𝜃
b is approximately 𝑁(𝜃, SE(𝜃) b 2 ) then, by the arguments used above, an
approximate 1− 𝛼 CI for 𝜃 is given by 𝜃
b± 𝑧 𝛼/2 SE(𝜃)
b where again we might need to plug-in


to obtain the estimated standard error. Since, roughly, 𝑧 0.025 = 2 and 𝑧0.001 = 3,

(estimate ± 2 estimated std errors) is an approximate 95% CI


(estimate ± 3 estimated std errors) is an approximate 99.8% CI.

The CLT justifies the normal approximation for 𝜃b = 𝑋, but 𝜃 b 2 ) is also


b ≈ 𝑁(𝜃, SE(𝜃)
appropriate for more general MLEs by other theory (see Part A).

iid
Example. Suppose 𝑋1 , . . . , 𝑋𝑛 ∼ 𝑁(𝜇, 𝜎2 ) with 𝜇 and 𝜎2 unknown.

The MLEs are b 𝜇 = 𝑋, b 𝜎2 = 𝑛1 (𝑋𝑖 − 𝑋)2 , and SE(b 𝜇) = 𝜎/ 𝑛 is unknown because 𝜎 is
Í
unknown. So to use b 𝜇 ± 𝑧 𝛼/2 SE(b
𝜇) as the basis for a confidence interval we need to estimate
𝜎. One possibility is to use b𝜎 and so get the interval

𝜎
 
𝜇 ± 𝑧 𝛼/2 √ .
b
𝑛
b

However we can improve on this:


𝜎 (better as 𝐸(𝑆 2 ) = 𝜎2 whereas
(i) use 𝑠 (the sample standard deviation) instead of b
𝜎2 ) < 𝜎2 )
𝐸(b
(ii) use a critical value from a 𝑡-distribution instead of 𝑧 𝛼/2 – see Part A Statistics

(better as (𝑋 − 𝜇)/(𝑆/ 𝑛) has a 𝑡-distribution, exactly, whereas using the CLT is an
approximation).

26
6 Linear Regression
Suppose we measure two variables in the same population:
𝑥 the explanatory variable
𝑦 the response variable.
Other possible names for 𝑥 are the predictor or feature or input variable or independent variable.
Other possible names for 𝑦 are the output variable or dependent variable.

Example. Let 𝑥 measure the GDP per head, and 𝑦 the CO2 emissions per head, for various
countries.



●●

●●
● ●
● ●●● ● ●
● ● ● ● ●●
● ● ● ● ● ●● ●●●

● ●●●● ●
● ● ●● ● ●● ● ●● ●●● ● ●
● ● ●
● ● ● ●●

● ●● ● ● ●
● ●
CO2 emissions

● ● ●●●

● ● ●
● ● ●
●● ●
● ● ● ●●
● ●●

● ● ● ●●
●● ● ●● ●● ● ●
● ● ●●
●●
● ●● ● ●●● ●
●●●●●● ●
●● ●
●● ●●
● ●
●●
● ●● ●●
● ●● ●●
● ●
●●● ●
● ● ●
● ●●
●●
●● ●
●●



● ●

GDP

Figure 6.1. Scatterplot of measures of CO2 emissions per head against GDP per head, for 178
countries.

Questions of interest:
For fixed 𝑥, what is the average value of 𝑦?
How does that average value change with 𝑥?
A simple model for the dependence of 𝑦 on 𝑥 is

𝑦 = 𝛼 + 𝛽𝑥 + "error".

Note: a linear relationship like this does not necessarily imply that 𝑥 causes 𝑦.

More precise model

We regard the values of 𝑥 as being fixed and known, and we regard the values of 𝑦 as
being the observed values of random variables.
We suppose that
𝑌𝑖 = 𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛 (6.1)

27
where

𝑥1 , . . . , 𝑥 𝑛 are known constants


𝜖1 , . . . , 𝜖 𝑛 are i.i.d. 𝑁(0, 𝜎2 ) "random errors"
𝛼, 𝛽 are unknown parameters.

The “random errors” 𝜖 1 , . . . , 𝜖 𝑛 represent random scatter of the points (𝑥 𝑖 , 𝑦 𝑖 ) about the
line 𝑦 = 𝛼 + 𝛽𝑥, we do not expect these points to lie on a perfect straight line.
Sometimes we will refer to the above model as being 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖.

Note.
1. The values 𝑦1 , . . . , 𝑦𝑛 (e.g. the CO2 emissions per head in the various countries) are
the observed values of random variables 𝑌1 , . . . , 𝑌𝑛 .
2. The values 𝑥 1 , . . . , 𝑥 𝑛 (e.g. the GDP per head in the various countries) do not
correspond to random variables. They are fixed and known constants.
Questions:
• How do we estimate 𝛼 and 𝛽?
• Does the mean of 𝑌 actually depend on the value of 𝑥? i.e. is 𝛽 ≠ 0?
We now find the MLEs of 𝛼 and 𝛽, and we regard 𝜎2 as being known. The MLEs of 𝛼 and 𝛽
are the same if 𝜎2 is unknown. If 𝜎2 is unknown, then we simply maximise over 𝜎2 as well
to obtain its MLE – this is no harder than what we do here (try it!). However, working out
all of the properties of this MLE is harder and beyond what we can do in this course.
From (6.1) we have 𝑌𝑖 ∼ 𝑁(𝛼 + 𝛽𝑥 𝑖 , 𝜎2 ). So 𝑌𝑖 has p.d.f.

1 1
 
𝑓𝑌𝑖 (𝑦 𝑖 ) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 , −∞ < 𝑦 𝑖 < ∞.
2𝜋𝜎 2 2𝜎

So the likelihood 𝐿(𝛼, 𝛽) is


𝑛
1 1
Ö  
𝐿(𝛼, 𝛽) = √ exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2
𝑖=1 2𝜋𝜎 2 2𝜎
𝑛
1 Õ
 
2 −𝑛/2 2
= (2𝜋𝜎 ) exp − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
2𝜎 𝑖=1
with log-likelihood
𝑛
𝑛 2 1 Õ
ℓ (𝛼, 𝛽) = − log(2𝜋𝜎 ) − 2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
2 2𝜎 𝑖=1

So maximising ℓ (𝛼, 𝛽) over 𝛼 and 𝛽 is equivalent to minimising the sum of squares


𝑛
Õ
𝑆(𝛼, 𝛽) = (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 .
𝑖=1

For this reason the MLEs of 𝛼 and 𝛽 are also called least squares estimators.

28

●● ●

●●
● ●●
● ●● ●
● ● ● ●●● ● ●●
● ●● ● ● ● ●● ●●
●●●●
● ●● ● ●

●●●●● ● ● ●●● ● ●●●
● ● ●

● ● ●● ●
●●● ● ●

● ●●● ●
● ● ● ● ●
● ● ●●
● ●● ● ●
● ● ● ●●
●●

●● ● ●●● ●● ● ●
● ●

y
● ● ●● ●●
●● ●●●●●

●● ●

●● ●
●●● ● ● ●


● ●● ●●

● ● ● ● ●●
●● ● ●
●● ●
● ●●

●● ●●
●●

● ●
● ●

Figure 6.2. Vertical distances from the points (𝑥 𝑖 , 𝑦 𝑖 ) to the line 𝑦 = 𝛼 + 𝛽𝑥. 𝑆(𝛼, 𝛽) is the sum of
these squared distances. The MLEs/least squares estimates of 𝛼 and 𝛽 minimise this sum.

Theorem 6.1. The MLEs (or, equivalently, the least squares estimates) of 𝛼 and 𝛽 are given by

𝑥 2𝑖 )( 𝑦𝑖 ) − ( 𝑥 𝑖 )( 𝑥 𝑖 𝑦𝑖 )
Í Í Í Í
(
𝛼=
𝑛 𝑥 2𝑖 − ( 𝑥 𝑖 )2
b Í Í

𝑛 𝑥 𝑖 𝑦𝑖 − ( 𝑥 𝑖 )( 𝑦 𝑖 )
Í Í Í
𝛽=
b .
𝑥 2𝑖 − (
Í 2
𝑛 𝑥𝑖 )
Í

The sums in Theorem 6.1, and similar sums below, are from 𝑖 = 1 to 𝑛.

Proof. To find b
𝛼 and b
𝛽 we calculate

𝜕𝑆 Õ
= −2 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )
𝜕𝛼
𝜕𝑆 Õ
= −2 𝑥 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 ).
𝜕𝛽

Putting these partial derivatives equal to zero, the minimisers b


𝛼 and b
𝛽 satisfy
Õ Õ
𝑛b
𝛼+b
𝛽 𝑥𝑖 = 𝑦𝑖
Õ Õ Õ
𝛼
b 𝑥𝑖 + b
𝛽 𝑥 2𝑖 = 𝑥 𝑖 𝑦𝑖 .

Solving this pair of simultaneous equations for b


𝛼 and b
𝛽 gives the required MLEs. □

Note. Sometimes we consider the model

𝑌𝑖 = 𝑎 + 𝑏(𝑥 𝑖 − 𝑥) + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛

29
and find the MLEs of 𝑎 and 𝑏 by minimising (𝑦 𝑖 − 𝑎 − 𝑏(𝑥 𝑖 − 𝑥))2 . This model is just an
Í
alternative parametrisation of our original model: the first model is

𝑌 = 𝛼 + 𝛽𝑥 + 𝜖
and the second is
𝑌 = 𝑎 + 𝑏(𝑥 − 𝑥) + 𝜖
= (𝑎 − 𝑏𝑥) + 𝑏𝑥 + 𝜖.

Here 𝑌, 𝑥 denote general values of 𝑌, 𝑥 (and 𝑥 = 𝑛1 𝑥 𝑖 is the mean of the 𝑛 data values of
Í
𝑥). Comparing the two model equations, 𝑏 = 𝛽 and 𝑎 − 𝑏𝑥 = 𝛼.
The interpretation of the parameters is that 𝛽 = 𝑏 is the increase in 𝐸(𝑌) when 𝑥 increases
by 1. The parameter 𝛼 is the value of 𝐸(𝑌) when 𝑥 is 0; whereas 𝑎 is the value of 𝐸(𝑌)
when 𝑥 is 𝑥.
• Alternative expressions for b
𝛼 and b
𝛽 are

(𝑥 𝑖 − 𝑥)(𝑦 𝑖 − 𝑦)
Í
𝛽=
b (6.2)
(𝑥 𝑖 − 𝑥)2
Í

(𝑥 𝑖 − 𝑥)𝑦 𝑖
Í
= Í 2
(6.3)
(𝑥 𝑖 − 𝑥)

𝛼 = 𝑦 −b
b 𝛽 𝑥.

𝜕𝑆
The above alternative for b
𝛼 follows directly from 𝜕𝛼
= 0.

To obtain the alternatives for b


𝛽: Theorem 6.1 gives

𝑥 𝑖 𝑦 𝑖 − 𝑛1 ( 𝑥 𝑖 )( 𝑦𝑖 )
Í Í Í
𝛽=
b
𝑥 2𝑖 − 𝑛1 ( 𝑥 𝑖 )2
Í Í

𝑥 𝑖 𝑦 𝑖 − 𝑛𝑥 𝑦
Í
= Í . (6.4)
𝑥 2𝑖 − 𝑛𝑥 2

Now check that the numerators and denominators in (6.2) and (6.4) are the same.
Then observe that the numerators of (6.2) and (6.3) differ by (𝑥 𝑖 − 𝑥)𝑦, which is 0.
Í

• The fitted regression line is the line 𝑦 = b


𝛼+b
𝛽𝑥. The point (𝑥, 𝑦) always lies on this
line.

6.1 Properties of regression parameter estimators


Bias

Let 𝑤 𝑖 = 𝑥 𝑖 − 𝑥 and note 𝑤 𝑖 = 0.


Í

From (6.3) the maximum likelihood estimator of 𝛽 is

1 Õ
𝛽=Í 𝑤 𝑖 𝑌𝑖

b
𝑤 2𝑖

30
so
1 Õ
𝐸(b
𝛽) = Í 𝐸 𝑤 𝑖 𝑌𝑖

2
𝑤𝑖
1 Õ
=Í 𝑤 𝑖 𝐸(𝑌𝑖 ).
𝑤 2𝑖

Note 𝐸(𝑌𝑖 ) = 𝛼 + 𝛽𝑥 𝑖 = 𝛼 + 𝛽𝑥 + 𝛽𝑤 𝑖 (using 𝑥 𝑖 = 𝑤 𝑖 + 𝑥), and so


1 Õ
𝐸(b
𝛽) = Í 𝑤 𝑖 (𝛼 + 𝛽𝑥 + 𝛽𝑤 𝑖 )
𝑤 2𝑖
1  Õ Õ 
=Í (𝛼 + 𝛽𝑥) 𝑤 𝑖 + 𝛽 𝑤 2𝑖
𝑤 2𝑖

=𝛽 since 𝑤 𝑖 = 0.
Í

𝛼 =𝑌−b
Also b 𝛽 𝑥 and so

𝐸(b
𝛼) = 𝐸(𝑌) − 𝑥𝐸(b
𝛽)

= 𝐸(𝑌𝑖 ) − 𝛽𝑥 since 𝐸(b
𝛽) = 𝛽
𝑛

= (𝛼 + 𝛽𝑥 𝑖 ) − 𝛽𝑥
𝑛
1
= · 𝑛(𝛼 + 𝛽𝑥) − 𝛽𝑥
𝑛
= 𝛼.

So b
𝛼 and b
𝛽 are unbiased.
Note the unbiasedness of b 𝛼, b
𝛽 does not depend on the assumptions that the 𝜖 𝑖 are
independent, normal and have the same variance, only on the assumptions that the errors
are additive and 𝐸(𝜖 𝑖 ) = 0.

Variance

We are usually only interested in the variance of b


𝛽:
𝑤 𝑖 𝑌𝑖
Í 
var(b
𝛽) = var Í 2
𝑤𝑖
1 Õ
= Í 2 var 𝑤 𝑖 𝑌𝑖

( 𝑤 𝑖 )2
1 Õ
= Í 2 𝑤 2𝑖 var(𝑌𝑖 )
( 𝑤𝑖 ) 2

1 Õ
= Í 2 𝑤 2𝑖 𝜎2
( 𝑤 𝑖 )2

𝜎2
= Í 2.
𝑤𝑖

31
Since b
𝛽 is a linear combination of the independent normal random variables 𝑌𝑖 , the
estimator b 𝛽 ∼ 𝑁(𝛽, 𝜎𝛽2 ) where 𝜎𝛽2 = 𝜎2 / 𝑤 2𝑖 .
𝛽 is itself normal: b
Í

𝛽 is 𝜎𝛽 and if 𝜎2 is known, then a 95% CI for 𝛽 is


So the standard error of b

𝛽 ± 1.96𝜎𝛽 ).
(b

However, this is only a valid CI when 𝜎2 is known and, in practice, 𝜎2 is rarely known.
𝜎𝛽2 = b
For 𝜎2 unknown we need to plug-in an estimate of 𝜎2 , i.e. use b 𝜎2 / 𝑤 2𝑖 where b
𝜎2 is
Í

𝜎2 = 𝑛1 (𝑦 𝑖 − b
some estimate of 𝜎2 . For example we could use the MLE which is b 𝛼−b 𝛽𝑥 𝑖 )2 .
Í

Using the 𝜃
b ± 2 SE(𝜃)
b approximation for a 95% confidence interval, we have that (b
𝛽 ± 2b
𝜎𝛽 )
is an approximate 95% confidence interval for 𝛽.
[A better approach here, but beyond the scope of this course, is to estimate 𝜎2 using

1 Õ
𝑠2 = 𝛽𝑥 𝑖 )2
𝛼−b
(𝑦 𝑖 − b
𝑛−2
and to base the confidence interval on a 𝑡-distribution rather than a normal distribution.
This estimator 𝑆 2 is unbiased for 𝜎2 (see Sheet 5), but details about its distribution and the
𝑡-distribution are beyond this course – see Parts A/B.]

32
7 Multiple Linear Regression
For the material in the last two sections of these notes (this section and the next), Chapter 3
of the book An Introduction to Statistical Learning by James, Witten, Hastie and Tibshirani
(2013) is useful – we cover some, but not all, of the material in that chapter. The whole
book is freely available online at:
http://www-bcf.usc.edu/~gareth/ISL/
We’ll refer to this book as JWHT (2013).

Example (Hill races). Let 𝑌 = time, 𝑥 1 = distance, 𝑥 2 = climb.


We can consider a model in which 𝑌 depends on both 𝑥 1 and 𝑥 2 , of the form

𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝜖.

This model has one response variable 𝑌 (as usual), but now we have two explanatory
variables 𝑥1 and 𝑥2 , and three regression parameters 𝛽 0 , 𝛽 1 , 𝛽2 .
Let the 𝑖th race have time 𝑦 𝑖 , distance 𝑥 𝑖1 and climb 𝑥 𝑖2 . Then in more detail our model
is

𝑌𝑖 = 𝛽0 + 𝛽 1 𝑥 𝑖1 + 𝛽 2 𝑥 𝑖2 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛

where

𝑥 𝑖1 , 𝑥 𝑖2 , for 𝑖 = 1, . . . , 𝑛, are known constants


iid
𝜖 1 , . . . , 𝜖 𝑛 ∼ 𝑁(0, 𝜎2 )
𝛽 0 , 𝛽1 , 𝛽 2 are unknown parameters

and (as usual) 𝑦 𝑖 denotes the observed value of the random variable 𝑌𝑖 .
As for simple linear regression we obtain the MLEs/least squares estimates of 𝛽0 , 𝛽 1 , 𝛽 2 by
minimising
𝑛
Õ
𝑆(𝛽) = (𝑦 𝑖 − 𝛽 0 − 𝛽1 𝑥 𝑖1 − 𝛽2 𝑥 𝑖2 )2
𝑖=1

𝜕𝑆
with respect to 𝛽 0 , 𝛽 1 , 𝛽2 , i.e. by solving 𝜕𝛽 𝑘
= 0 for 𝑘 = 0, 1, 2.

As before, the only property of the 𝜖 𝑖 needed to define the least squares estimates is
𝐸(𝜖 𝑖 ) = 0.
A general multiple regression model has 𝑝 explanatory variables (𝑥 1 , . . . , 𝑥 𝑝 ),

𝑌 = 𝛽0 + 𝛽1 𝑥1 + · · · + 𝛽 𝑝 𝑥 𝑝 + 𝜖

and the MLEs/least squares estimates are obtained by minimising


𝑛
Õ
(𝑦 𝑖 − 𝛽0 − 𝛽 1 𝑥 𝑖1 − · · · − 𝛽 𝑝 𝑥 𝑖𝑝 )2
𝑖=1

with respect to 𝛽 0 , 𝛽1 , . . . , 𝛽 𝑝 . In this course we will focus on 𝑝 = 1 or 2.

33
Example (Quadratic regression). The relationship between 𝑌 and 𝑥 may be approximately
quadratic in which case we can consider the model 𝑌 = 𝛽 0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 + 𝜖. This is the
case 𝑝 = 2 with 𝑥 1 = 𝑥 and 𝑥 2 = 𝑥 2 .

Example. For convenience write the two explanatory variables as 𝑥 and 𝑧. So suppose

𝑌𝑖 = 𝛽 0 + 𝛽1 𝑥 𝑖 + 𝛽2 𝑧 𝑖 + 𝜖 𝑖 , 𝑖 = 1, . . . , 𝑛

iid
where 𝜖 𝑖 ∼ 𝑁(0, 𝜎2 ) and assume 𝑥𝑖 = 𝑧 𝑖 = 0.
Í Í

Then minimising 𝑆(𝛽) gives

𝑦𝑖
Í
𝛽0 =
b
𝑛
1 Õ 2 Õ Õ Õ 
𝛽1 =
b 𝑧𝑖 𝑥 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑧 𝑖 𝑦𝑖
Δ
1 Õ 2 Õ Õ Õ 
𝛽2 =
b 𝑥𝑖 𝑧 𝑖 𝑦𝑖 − 𝑥𝑖 𝑧𝑖 𝑥 𝑖 𝑦𝑖
Δ
where
Õ Õ Õ 2
Δ= 𝑥 2𝑖 𝑧 2𝑖 − 𝑥𝑖 𝑧𝑖 .

𝜕𝑆
The method (solving 𝜕𝛽 = 0 for 𝑘 = 0, 1, 2, i.e. 3 equations in 3 unknowns) is straightfor-
𝑘
ward, the algebra less so – the are more elegant ways to do some of this (using matrices, in
3rd year).

Interpretation of regression coefficients

Consider the model with 𝑝 = 2, so two regressors 𝑥 1 and 𝑥 2 :

𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽2 𝑥2 + 𝜖.

We interpret 𝛽 1 as the average effect on 𝑌 of a one unit increase in 𝑥1 , holding the other
regressor 𝑥 2 fixed.
Similarly we interpret 𝛽2 as the average effect on 𝑌 of a one unit increase in 𝑥2 , holding the
other regressor 𝑥 1 fixed.
In these interpretations “average” means we are talking about the change in 𝐸(𝑌) when
changing 𝑥 1 (or 𝑥2 ).
One important thing to note here is that 𝑥 1 and 𝑥 2 often change together. E.g. in the hill
races example, a race whose distance is one mile greater will usually to have an increased
value of climb as well. This makes interpretation more difficult.

34
8 Assessing the fit of a model
Having fitted a model, we should consider how well it fits the data. A model is normally
an approximation to reality: is the approximation sufficiently good that the model is
useful? This question applies to mathematical models in general. In this course we will
approach the question by considering the fit of a simple linear regression (generalisations
are possible).
For the model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 let b 𝛼, b
𝛽 be the usual estimates of 𝛼, 𝛽 based on the observation
pairs (𝑥 1 , 𝑦1 ), . . . , (𝑥 𝑛 , 𝑦𝑛 ).
From now on we consider this model, with the usual assumptions about 𝜖, unless otherwise
stated.

Definition. The 𝑖th fitted value b


𝑦 𝑖 of 𝑌 is defined by b
𝑦𝑖 = b
𝛼+b
𝛽𝑥 𝑖 , for 𝑖 = 1, . . . , 𝑛.
The 𝑖th residual 𝑒 𝑖 is defined by 𝑒 𝑖 = 𝑦 𝑖 − b
𝑦 𝑖 , for 𝑖 = 1, . . . , 𝑛.
The residual sum of squares RSS is defined by RSS = 𝑒 𝑖2 .
Í
q
1
The residual standard error RSE is defined by RSE = 𝑛−2 RSS.

The RSE is an estimate of the standard deviation 𝜎. If the fitted values are close to the
observed values, i.e. b
𝑦 𝑖 ≈ 𝑦 𝑖 for all 𝑖 (so that the 𝑒 𝑖 are small), then the RSE will be small.
Alternatively if one or more of the 𝑒 𝑖 is large then the RSE will be higher.
We have 𝐸(𝑒 𝑖 ) = 0. In taking this expectation, we treat 𝑦 𝑖 as the random variable 𝑌𝑖 , and we
treat b
𝑦 𝑖 as the random variable b𝛼+b 𝛽𝑥 𝑖 (in particular, b 𝛽 are estimators, not estimates).
𝛼 and b
Hence

𝐸(𝑒 𝑖 ) = 𝐸(𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖 )
= 𝐸(𝑌𝑖 ) − 𝐸(b
𝛼) − 𝐸(b
𝛽)𝑥 𝑖
= 𝐸(𝛼 + 𝛽𝑥 𝑖 + 𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖 since b
𝛼, b
𝛽 are unbiased
= 𝛼 + 𝛽𝑥 𝑖 + 𝐸(𝜖 𝑖 ) − 𝛼 − 𝛽𝑥 𝑖
= 0 since 𝐸(𝜖 𝑖 ) = 0.

8.1 Potential problem: non-linearity


The model 𝑌 = 𝛼 + 𝛽𝑥 + 𝜖 assumes a straight-line relationship between 𝑌 (the response)
and 𝑥 (the predictor). If the true relationship is far from linear then any conclusions (e.g.
predictions) we draw from the fit will be suspect.
A residual plot is a useful graphical tool for identifying non-linearity: for simple linear
regression we can plot the residuals 𝑒 𝑖 against the fitted values b
𝑦 𝑖 . Ideally the plot will
show no pattern. The existence of a pattern may indicate a problem with some aspect of
the linear model.
Note that for simple linear regression (i.e. the case 𝑝 = 1) plotting 𝑒 𝑖 against 𝑥 𝑖 gives an
equivalent plot, just with a different horizontal scale, since there is an exact linear relation
between the 𝑥 𝑖 and b𝑦 𝑖 (i.e. b
𝑦𝑖 = b
𝛼+b
𝛽𝑥 𝑖 ).
[Plotting 𝑒 𝑖 against b
𝑦 𝑖 generalises better to multiple regression.]

35
8.2 Potential problem: non-constant variance of errors
We have assumed that the errors have a constant variance, i.e. var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝜎2 . That
is, the same variance for all 𝑖. Unfortunately, this is often not true. E.g. the variance of the
error may increase as 𝑌 increases.
Non-constant variance is also called heteroscedasticity. We can identify this from the
presence of a funnel-type shape in the residual plot.
How might we deal with non-constant variance of the errors?
(i) One
√ possibility is to transform the response 𝑌 using a transformation such as log 𝑌 or
𝑌 (which shrinks larger responses more), leading to a reduction in heteroscedasticity.
(ii) Sometimes we might have a good idea about of the variance of 𝑌𝑖 : we might think
2
var(𝑌𝑖 ) = var(𝜖 𝑖 ) = 𝑤𝜎 𝑖 where 𝜎2 is unknown but where the 𝑤 𝑖 are known. E.g. if 𝑌𝑖 is
actually the mean of 𝑛 𝑖 observations, where each of these 𝑛 𝑖 observations are made
2
at 𝑥 = 𝑥 𝑖 , then var(𝑌𝑖 ) = 𝜎𝑛 𝑖 . So 𝑤 𝑖 = 𝑛 𝑖 in this case.
It is straightforward to show (exercise) that the MLEs of 𝛼, 𝛽 are obtained by
minimising
𝑛
Õ
𝑤 𝑖 (𝑦 𝑖 − 𝛼 − 𝛽𝑥 𝑖 )2 . (8.1)
𝑖=1

The form of (8.1) is intuitively correct: if 𝑤 𝑖 is small then var(𝑌𝑖 ) is large, so there is a
lot of uncertainty about observation 𝑖, so this observation shouldn’t affect the fit too
much – this is achieved in (8.1) by observation 𝑖 being weighted by the small value
of 𝑤 𝑖 . Hence this approach is called weighted least squares.

8.3 Potential problem: outliers


An outlier is a point for which 𝑦 𝑖 is far from the value b
𝑦 𝑖 predicted by the model. Outliers
can arise for a variety of reasons, e.g. incorrect recording of an observation during data
collection.
Residual plots can be used to identify outliers. Recall that 𝐸(𝑒 𝑖 ) = 0. But in practice it can
be difficult to decide how large (i.e. how far from the expected value of zero) a residual
needs to be before we consider it a possible outlier. To address this we can plot studentized
residuals instead of residuals, where
𝑒𝑖
𝑖th studentized residual = .
SE(𝑒 𝑖 )

Theorem 8.1. var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ) where


1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2

Definition. The 𝑖th studentized residual 𝑟 𝑖 is defined by


𝑒𝑖
𝑟𝑖 = √
𝑠 1 − ℎ𝑖
where 𝑠 = RSE is the residual standard error.

36
[Here we follow the terminology in the synopses and JWHT (2013) and call the 𝑟 𝑖
“studentized” residuals. Some authors call the 𝑟 𝑖 “standardized” residuals and save the
word “studentized” to mean something similar but different.]
So the 𝑟 𝑖 are all on a comparable scale, each having a standard deviation of about 1. Fol-
lowing JWHT (2013) we will say that observations with |𝑟 𝑖 | > 3 are possible outliers.
If we believe an outlier is due to an error in data collection, then one solution is to simply
remove the observation from the data and re-fit the model. However, an outlier may
instead indicate a problem with the model, e.g. a nonlinear relationship between 𝑌 and 𝑥,
so care must be taken.
Similarly, this kind of problem could arise if we have a missing regressor, i.e. we could be
using 𝑌 = 𝛽0 + 𝛽 1 𝑥 1 + 𝜖 when we should really be using 𝑌 = 𝛽 0 + 𝛽1 𝑥 1 + 𝛽 2 𝑥 2 + 𝜖.

Proof of Theorem 8.1

Idea of the proof: write 𝑒 𝑗 = 𝑎 𝑗 𝑌𝑗 and use


Í
𝑗
Õ  Õ Õ
var 𝑎 𝑗 𝑌𝑗 = 𝑎 2𝑗 var(𝑌𝑗 ) = 𝜎2 𝑎 2𝑗 (8.2)
𝑗 𝑗 𝑗

since the 𝑌𝑗 are independent with var(𝑌𝑗 ) = 𝜎2 for all 𝑗. Here and below, all sums are from
1 to n.
First recall
− 𝑥)𝑌𝑗
Í
𝑗 (𝑥 𝑗
Õ
𝛽=
b where 𝑆 𝑥𝑥 = (𝑥 𝑘 − 𝑥)2
𝑆 𝑥𝑥
𝑘
and
𝛼 =𝑌−b
b 𝛽𝑥
− 𝑥)𝑌𝑗

1 𝑗 (𝑥 𝑗
Õ  
= 𝑌𝑗 − 𝑥
𝑛 𝑆 𝑥𝑥
𝑗
Õ1 𝑥(𝑥 𝑗 − 𝑥)

= − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗

So

𝑦𝑖 = b
b 𝛼+b
𝛽𝑥 𝑖
− 𝑥)𝑌𝑗
!
Õ 1 𝑥(𝑥 𝑗 − 𝑥) 
Í
𝑗 (𝑥 𝑗

= − 𝑌𝑗 + 𝑥 𝑖
𝑛 𝑆 𝑥𝑥 𝑆 𝑥𝑥
𝑗
Õ1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

= + 𝑌𝑗 . (8.3)
𝑛 𝑆 𝑥𝑥
𝑗

We can write
Õ
𝑌𝑖 = 𝛿 𝑖𝑗 𝑌𝑗 (8.4)
𝑗

37
where (
1 if 𝑖 = 𝑗
𝛿 𝑖𝑗 =
0 otherwise.

Note 𝛿2𝑖𝑗 = 𝛿 𝑖𝑗 and 𝛿2𝑖𝑗 = 𝛿 𝑖𝑗 = 1. So


Í Í
𝑗 𝑗

𝑒 𝑖 = 𝑌𝑖 − b
𝛼−b
𝛽𝑥 𝑖
and using (8.3) and (8.4)
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

= 𝛿 𝑖𝑗 − − 𝑌𝑗 .
𝑛 𝑆 𝑥𝑥
𝑗

As the 𝑌𝑗 are independent, as at (8.2),


2
Õ 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)
var(𝑒 𝑖 ) = 𝛿 𝑖𝑗 − − var(𝑌𝑗 )
𝑛 𝑆 𝑥𝑥
𝑗

2
Õ 1 (𝑥 𝑖 − 𝑥)2 (𝑥 𝑗 − 𝑥)2 1
=𝜎 𝛿2𝑖𝑗 + + − 2 𝛿 𝑖𝑗
𝑛 2 2
𝑆 𝑥𝑥 𝑛
𝑗
(𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥) 1 (𝑥 𝑖 − 𝑥)(𝑥 𝑗 − 𝑥)

−2 𝛿 𝑖𝑗 + 2
𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
1 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥)2 2 (𝑥 𝑖 − 𝑥) Õ
 
2
= 𝜎 1+ + − −2 + (𝑥 𝑗 − 𝑥)
𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥 𝑛 𝑆 𝑥𝑥
𝑗

1 (𝑥 𝑖 − 𝑥)2
 
= 𝜎2 1 − −
𝑛 𝑆 𝑥𝑥
= 𝜎2 (1 − ℎ 𝑖 ). □

8.4 Potential problem: high leverage points


Outliers are observations for which the response 𝑦 𝑖 is unusual given the value of 𝑥 𝑖 . On
the other hand, observations with high leverage have an unusual value of 𝑥 𝑖 .

Definition. The leverage of the 𝑖th observation is ℎ 𝑖 , where

1 (𝑥 𝑖 − 𝑥)2
ℎ𝑖 = + Í𝑛 .
𝑛 𝑗=1 (𝑥 𝑗 − 𝑥)
2

Clearly ℎ 𝑖 depends only on the values of the 𝑥 𝑖 (it doesn’t depend on the 𝑦 𝑖 values at all).
We see that ℎ 𝑖 increases with the distance of 𝑥 𝑖 from 𝑥.
High leverage points tend to have a sizeable impact on the regression line. Since
var(𝑒 𝑖 ) = 𝜎2 (1 − ℎ 𝑖 ), a large leverage ℎ 𝑖 will make var(𝑒 𝑖 ) small. And then since 𝐸(𝑒 𝑖 ) = 0,
this means that the regression line will be “pulled” close to the point (𝑥 𝑖 , 𝑦 𝑖 ).
Note that ℎ 𝑖 = 2, so the average leverage is ℎ = 𝑛2 . One rule-of-thumb is to regard
Í
points with a leverage more than double this to be high leverage points, i.e. points with
ℎ 𝑖 > 4/𝑛.

38
Why does this matter? We should be concerned if the regression line is heavily affected by
just a couple of points, because any problems with these points might invalidate the entire
fit. Hence it is important to identify high leverage observations.

SLIDES. Slide with example involving points 𝐴 and 𝐵 goes here.


Removing point 𝐵 – which has high leverage and a high residual – has a much more
substantial impact on the regression line than removing the outlier had at the end of
Section 8.3.
In the plot of studentized residuals against leverage:
• point B has high leverage and a high residual, it is having a substantial effect on the
regression line yet it still has a high residual
• in contrast point A has a high residual, the model isn’t fitting well at A, but we saw
in Section 8.3 that A isn’t affecting the fit much at all – the reason that A has little
effect on the fit is that it has low leverage.

39

You might also like