Notes
Notes
Regression
Korbinian Strimmer
20 January 2023
2
Contents
Welcome 7
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Preface 9
About the author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
About the module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3
4 CONTENTS
II Bayesian Statistics 83
7 Conditioning and Bayes rule 85
7.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Conditional mean and variance . . . . . . . . . . . . . . . . . . . . 86
7.4 Conditional entropy and entropy chain rules . . . . . . . . . . . . 87
7.5 Entropy bounds for the marginal variables . . . . . . . . . . . . . 88
Appendix 175
A Refresher 177
A.1 Basic mathematical notation . . . . . . . . . . . . . . . . . . . . . . 177
A.2 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 CONTENTS
Bibliography 199
Welcome
These are the lecture notes for MATH20802, a course in Statistical Methods
for second year mathematics students at the Department of Mathematics of the
University of Manchester.
The course text was written by Korbinian Strimmer from 2019–2023. This version
is from 20 January 2023.
The notes will be updated from time to time. To view the current version visit
the online MATH20802 lecture notes.
You may also download the MATH20802 lecture notes as PDF. For a paper copy
it is recommended to print two pages per sheet.
License
These notes are licensed to you under Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License.
7
8 CONTENTS
Preface
9
10 CONTENTS
Prerequisites
For this module it is important that you refresh your knowledge in:
• Introduction to statistics
• Probability
• R data analysis and programming
In addition you will need to some elements of matrix algebra and how to compute
the gradient and the curvature of a function of several variables.
Check the Appendix of these notes for a brief refresher of the essential material.
Acknowledgements
Many thanks to Beatriz Costa Gomes for her help in creating the 2019 version
of the lecture notes when I was teaching this module for the first time and to
Kristijonas Raudys for his extensive feedback on the 2020 version.
Part I
11
Chapter 1
Overview of statistical
learning
13
14 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING
• It was also in the 1950s that the concept of artificial neural network arises,
essentially a nonlinear input-output map that works in a non-probabilistic
way. This field saw another leap in the 1980 and further progress from
2010 onwards with the development of deep dearning. It is now one of the
most popular (and most effective) methods for analysis of imaging data.
Even your mobile phone most likely has a dedicated computer chip with
special neural network hardware, for example.
• Further advanced theories of information where developed in the 1960
under the term of computational learning, most notably the Vapnik-
Chernov theory, with the most prominent example of the “support vector
machine” (another non-probabilistic model).
• With the advent of large-scale genomic and other high-dimensional data
there has been a surge of new and exciting developments in the field of
high-dimensional (large dimension) and also big data (large dimension
and large sample size), both in statistics and in machine learning.
The connections between various fields of information is still not perfectly
understood, but it is clear that an overarching theory will need to be based on
probabilistic learning.
processes (on purpose or not), but not because the underlying process is actually
random. The success of statistics is based on the fact that we can mathematically
model the uncertainty without knowing any specifics of the underlying processes,
and we still have procedures for optimal inference under uncertainty.
In short, statistics is about describing the state of knowledge of the world, which
may be uncertain and incomplete, and to make decisions and prediction in the
face of uncertainty, and this uncertaintly sometimes derives from randomness
but most often from our ignorance (and sometimes this ignorance even helps to
create a simple yet effective model)!
Model world
Hypothesis 𝑀1 , 𝜽 1
−→ 𝑀2 , 𝜽 2
How the world works
..
.
Real world,
−→ unknown true model −→ Data 𝑥1 , . . . , 𝑥 𝑛
𝑀true , 𝜽 true
The aim of statistical learning is to identify the model(s) that explain the
current data and also predict future data (i.e. predict outcome of experiments
that have not been conducted yet).
Thus a good model provides a good fit to the current data (i.e. it explains current
observations well) and also to future data (i.e. it generalises well).
16 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING
1.4 Likelihood
In statistics and machine learning most models that are being used are prob-
abilistic to take account of both randomness and uncertainty. A core task in
statistical learning is to identify those models that explain the existing data well
and that also generalise well to unseen data.
For this we need, among other things, a measure of how well a candidate
model approximates the (typically unknown) true data generating model and
an approach to choose the best model(s). One such approach is provided by
the method of maximum likelihood that enables us to estimate parameters of
models and to find the particular model that is the best fit to the data.
Given a probability distribution 𝑃𝜽 with density or mass function 𝑝(𝑥|𝜽) where
𝜽 is a parameter vector, and 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } are the observed iid data (i.e. in-
dependent and identically distributed), the likelihood function is defined
as
𝑛
Ö
𝐿𝑛 (𝜽|𝐷) = 𝑝(𝑥 𝑖 |𝜽)
𝑖=1
Reasons for preferring the log-likelihood (rather than likelihood) include that
• the log-density is in fact the more “natural” and relevant quantity (this
will become clear in the upcoming chapters) and that
• addition is numerically more stable than multiplication on a computer.
For discrete random variables for which 𝑝(𝑥|𝜽) is a probability mass function
the likelihood is often interpreted as the probability to observe the data given
the model with specified parameters 𝜽. In fact, this was indeed the way how
the likelihood was historically introduced. However, this view is not strictly
correct. First, given that the samples are iid and thus the ordering of the 𝑥 𝑖 is
not important, an additional factor accounting for the possible permutations is
needed in the likelihood to obtain the actual probability of the data. Moreover,
for continuous random variables this interpretation breaks down due to the use
1.4. LIKELIHOOD 17
of densities rather than probability mass functions in the likelihood. Thus, the
view of the likelihood being the probability of the data is in fact too simplistic.
In the next chapter we will see that the justification for using likelihood rather
stems from its close link to the Kullback-Leibler information and cross-entropy.
This also helps to understand why using likelihood for estimation is only optimal
in the limit of large sample size.
In the first part of the MATH28082 “Statistical Methods” module we will study
likelihood estimation and inference in much detail. We will provide links to
related methods of inference and discuss its information-theoretic foundations.
We will also discuss the optimality properties as well as the limitation of
likelihood inference. Extensions of likelihood analysis, in particular Bayesian
learning, which will be discussed in the second part of the module. In the third
part of the module we will apply statistical learning to linear regression.
18 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING
Chapter 2
2.1 Entropy
2.1.1 Overview
In this chapter we discuss various information criteria and their connection to
maximum likelihood.
19
20 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD
𝑝
log = − log(1 − 𝑝) − (− log(𝑝))
1−𝑝
In this module we always use the natural logarithm by default, and will explicitly
write log2 and log10 for logarithms with respect to base 2 and 10, respectively.
Surprise and entropy computed with the natural logarithm (log) is given in
“nats” (=natural information units ). Using log2 leads to “bits” and using log10
to “ban” or “Hartley”.
𝐾
Õ
=− 𝑝 𝑘 log(𝑝 𝑘 )
𝑘=1
log 𝐾 ≥ 𝐻(𝑃) ≥ 0
Note this is the largest value the Shannon entropy can assume with 𝐾 classes.
Example 2.2. Concentrated probability mass: let 𝑝1 = 1 and 𝑝2 = 𝑝3 = . . . =
𝑝 𝐾 = 0. Using 0 × log(0) = 0 we obtain for the Shannon entropy
Note that 0 is the smallest value that Shannon entropy can assume, and corre-
sponds to maximum concentration.
Thus, large entropy implies that the distribution is spread out whereas small
entropy means the distribution is concentrated.
Correspondingly, maximum entropy distributions can be considered minimally
informative about a random variable.
This interpretation is also supported by the close link of Shannon entropy with
multinomial coefficients counting the permutations of 𝑛 items (samples) of 𝐾
distinct types (classes).
Example 2.3. Large sample asymptotics of the log-multinomial coefficient and
link to Shannon entropy:
The number of possible permutation of 𝑛 items of 𝐾 distinct types, with 𝑛1 of
type 1, 𝑛2 of type 2 and so on, is given by the multinomial coefficient
𝑛 𝑛!
𝑊= =
𝑛1 , . . . , 𝑛 𝐾 𝑛1 ! × 𝑛2 ! × . . . × 𝑛 𝐾 !
Í𝐾
with 𝑘=1 𝑛 𝑘 = 𝑛 and 𝐾 ≤ 𝑛.
Now recall the Moivre-Sterling formula which for large 𝑛 allow to approximate
the factorial by
log 𝑛! ≈ 𝑛 log 𝑛 − 𝑛
22 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD
With this
𝑛
log 𝑊 = log
𝑛1 , . . . , 𝑛 𝐾
𝐾
Õ
= log 𝑛! − log 𝑛 𝑘 !
𝑘=1
𝐾
Õ
≈ 𝑛 log 𝑛 − 𝑛 − (𝑛 𝑘 log 𝑛 𝑘 − 𝑛 𝑘 )
𝑘=1
𝐾
Õ
= 𝑛 log 𝑛 − 𝑛 𝑘 log 𝑛 𝑘
𝑘=1
𝐾
Õ 𝐾
Õ
= 𝑛 𝑘 log 𝑛 − 𝑛 𝑘 log 𝑛 𝑘
𝑘=1 𝑘=1
𝐾
Õ 𝑛 𝑘
𝑛
𝑘
= −𝑛 log
𝑛 𝑛
𝑘=1
and thus
𝐾
𝑛
1 Õ
log ≈− 𝑝ˆ 𝑘 log 𝑝ˆ 𝑘
𝑛 𝑛1 , . . . , 𝑛 𝐾
𝑘=1
ˆ
= 𝐻(𝑃)
𝑛𝑘
where 𝑃ˆ is the empirical categorical distribution with 𝑝ˆ 𝑘 = 𝑛 .
The combinatorical derivation of Shannon entropy is now credited to Wallis
(1962) but has already been used a century earlier by Boltzmann (1877) who
discovered it in his work in statistical mechanics (recall 𝑆 = 𝑘 𝑏 log 𝑊 is the
Boltzmann entropy ).
Despite having essentially the same formula the different name is justified
because differential entropy exhibits different properties compared to Shannon
entropy, because the logarithm is taken of a density which in contrast to a
probability can assume values larger than one. As a consequence, differential
entropy is not bounded below by zero and can be negative.
2.1. ENTROPY 23
2.1.6 Cross-entropy
If in the definition of Shannon entropy (and differential entropy) the expectation
over the log-density (say 𝑔(𝑥) of distribution 𝐺) is taken with regard to a different
distribution 𝐹 over the same state space we arrive at the cross-entropy
𝐻(𝐹, 𝐺) − 𝐻(𝐹) ≥ 0
| {z }
relative entropy
= 𝐻(𝐹, 𝐺) − 𝐻(𝐹)
Note that in the KL divergence the expectation is taken over a ratio of densities
(or ratio of probabilities for discrete random variables). This is what creates the
transformation invariance.
For more details and proofs of properties 3 and 4 see Worksheet E1.
𝑝 1−𝑝
𝐷KL (Ber(𝑝), Ber(𝑞)) = 𝑝 log + (1 − 𝑝) log
𝑞 1−𝑞
1 (𝜇 − 𝜇ref )2
𝐷KL (𝑁(𝜇ref , 𝜎 ), 𝑁(𝜇, 𝜎 )) =
2 2
2 𝜎2
1 𝑇
𝐷KL (𝐹𝜽0 , 𝐹𝜽0 +𝜺 ) = ℎ(𝜽 0 + 𝜺) ≈ 𝜺 ∇∇𝑇 ℎ(𝜽0 )𝜺
2
1
= 𝜺𝑇 −E𝐹𝜽0 ∇∇𝑇 log 𝑓 (𝒙|𝜽0 ) 𝜺
2
1
= 𝜺𝑇 𝑰 Fisher (𝜽 0 ) 𝜺
2 | {z }
expected Fisher information
1 𝑇 Fisher
𝐷KL (𝐹𝜽0 +𝜺 , 𝐹𝜽0 ) ≈ 𝜺 𝑰 (𝜽 0 + 𝜺) 𝜺
2
1 𝑇 Fisher
𝐷KL (𝐹𝜽0 +𝜺 , 𝐹𝜽0 ) ≈ 𝜺 𝑰 (𝜽 0 + 𝜺)𝜺
2
1
≈ 𝜺𝑇 𝑰 Fisher (𝜽 0 )𝜺 + 𝜺𝑇 𝚫𝜺 𝜺
2
1 𝑇 Fisher
≈ 𝜺 𝑰 (𝜽 0 )𝜺
2
keeping only terms quadratic in 𝜺.
Note that there is no data involved in computing the expected Fisher information,
hence it is purely a property of the model, or more precisely of the space of the
models indexed by 𝜽. In the next Chapter we will study a related quantity, the
observed Fisher information that in contrast is a function of the observed data.
2.3.2 Examples
Example 2.11. Expected Fisher information for the Bernoulli distribution:
The log-probability mass function of the Bernoulli Ber(𝑝) distribution is
where 𝑝 is the proportion of “success”. The second derivative with regard to the
parameter 𝑝 is
𝑑2 𝑥 1−𝑥
log 𝑓 (𝑥|𝑝) = − 2 −
𝑑𝑝 2 𝑝 (1 − 𝑝)2
Since E(𝑥) = 𝑝 we get as Fisher information
𝑑2
𝐼 Fisher
(𝑝) = −E log 𝑓 (𝑥|𝑝)
𝑑𝑝 2
𝑝 1−𝑝
= 2+
𝑝 (1 − 𝑝)2
1
=
𝑝(1 − 𝑝)
𝑝1 1 − 𝑝1
𝐷KL (Ber(𝑝1 ), Ber(𝑝2 )) = 𝑝 1 log + (1 − 𝑝1 ) log
𝑝2 1 − 𝑝2
𝜀2 Fisher 𝜀2
𝐷KL (Ber(𝑝), Ber(𝑝 + 𝜀)) ≈ 𝐼 (𝑝) =
2 2𝑝(1 − 𝑝)
Hint for calculating the gradient: replace 𝜎 2 by 𝑣 and then take the partial
derivative with regard to 𝑣, then substitute back.
The Hessian matrix is
− 𝜎12 − 𝜎14 (𝑥 − 𝜇)
∇∇𝑇 log 𝑓 (𝑥|𝜇, 𝜎2 ) =
− 𝜎4 (𝑥 − 𝜇)
1 1
2𝜎 4
− 𝜎16 (𝑥 − 𝜇)2
As E(𝑥) = 𝜇 we have
E(𝑥 − 𝜇) = 0. Furthermore, with E((𝑥 − 𝜇) ) = 𝜎 we see
2 2
that E 1
𝜎6
(𝑥 − 𝜇)2 = 1
𝜎4
. Therefore the expected Fisher information matrix as
the negative expected Hessian matrix is
1
𝜎2
0
𝑰 Fisher
𝜇, 𝜎 2
= 1
0 2𝜎 4
The relative entropy 𝐷KL (𝐹, 𝐺𝜽 ) then measures the divergence of the approxima-
tion 𝐺𝜽 from the unknown true model 𝐹. It can be written as:
𝐷KL (𝐹, 𝐺𝜽 ) = 𝐻(𝐹, 𝐺𝜽 ) − 𝐻(𝐹)
= −E𝐹 log 𝑔𝜽 (𝑥) −( −E𝐹 log 𝑓 (𝑥) )
| {z } | {z }
cross-entropy entropy of 𝐹, does not depend on 𝜽
Hence, for large sample size 𝑛 we can approximate cross-entropy and as a result
the KL divergence. The cross-entropy 𝐻(𝐹, 𝐺𝜽 ) is approximated by the empirical
cross-entropy where the expectation is taken with regard to 𝐹ˆ 𝑛 rather than 𝐹:
𝐻(𝐹, 𝐺𝜽 ) ≈ 𝐻(𝐹ˆ 𝑛 , 𝐺𝜽 )
= −E𝐹ˆ𝑛 (log 𝑔(𝑥|𝜽))
𝑛
1Õ
=− log 𝑔(𝑥 𝑖 |𝜽)
𝑛
𝑖=1
1
= − 𝑙𝑛 (𝜽|𝐷)
𝑛
This turns out to be equal to the negative log-likelihood standardised by the
sample size 𝑛! Or in other words, the log-likelihood is the negative empirical
cross-entropy multiplied by sample size 𝑛.
From the link of the multinomial coefficient with Shannon entropy (Example
2.3) we already know that for large sample size
𝑛
ˆ ≈ 1 log
𝐻(𝐹)
𝑛 𝑛1 , . . . , 𝑛 𝐾
2.4. ENTROPY LEARNING AND MAXIMUM LIKELIHOOD 31
𝑛
1
𝐷KL (𝐹, 𝐺𝜽 ) ≈ − log + 𝑙𝑛 (𝜽|𝐷)
𝑛 𝑛1 , . . . , 𝑛 𝐾
Thus, with the KL divergence we obtain not just the log-likelihood (the cross-
entropy part) but also the multiplicity factor taking account of the possible
orderings of the data (the entropy part).
However, for large sample size 𝑛 when the empirical distribution 𝐹ˆ 𝑛 is a good ap-
proximation for 𝐹, we can use the results from the previous section. Thus, instead
of minimising the KL divergence 𝐷KL (𝐹, 𝐺𝜽 ) we simply minimise 𝐻(𝐹ˆ 𝑛 , 𝐺𝜽 )
which is the same as maximising the log-likelihood 𝑙𝑛 (𝜽|𝐷). Note that the
entropy of the true distribution 𝐹 (and the corresponding empirical distribution
ˆ that does not depend on the parameters 𝜽 and hence it does not matter when
𝐹)
minimising the divergence.
Conversely, this implies that maximising the likelihood with regard to the 𝜽 is
equivalent ( asymptotically for large 𝑛) to minimising the KL divergence of the
approximating model and the unknown true model!
𝑀𝐿
𝜽ˆ = arg max 𝑙𝑛 (𝜽|𝐷)
𝜽
Maximum likelihood
estimation
33
34 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
between the unknown true data generating model and the approximating model
𝐹𝜽 . Note that the log-likelihood is additive over the samples 𝑥 𝑖 .
𝑀𝐿
The maximum likelihood point estimate 𝜽ˆ is then given by maximising the
(log)-likelihood
𝑀𝐿
𝜽ˆ = arg max 𝑙𝑛 (𝜽|𝐷)
Thus, finding the MLE is an optimisation problem that in practise is most often
solved numerically on the computer, using approaches such as gradient ascent (or
for negative log-likelihood gradient descent) and related algorithms. Depending
on the complexity of the likelihood function finding the maximum can be very
difficult.
𝑑𝑙𝑛 (𝜃|𝐷)
𝑆𝑛 (𝜃) = 𝑑𝜃
scalar parameter 𝜃: first derivative
of log-likelihood function
𝑺 𝑛 (𝜽ˆ 𝑀𝐿 ) = 0
𝑑2 𝑙𝑛 (𝜃ˆ 𝑀𝐿 |𝐷)
<0
𝑑𝜃 2
In the case of a parameter vector (multivariate 𝜽) you need to compute the
Hessian matrix (matrix of second order derivatives) at the MLE:
and then verify that this matrix is negative definite (i.e. all its eigenvalues must
be negative).
As we will see later the second order derivatives of the log-likelihood function
also play an important role for assessing the uncertainty of the MLE.
𝐹true ⊂ 𝐹𝜽
|{z} |{z}
true model specified models
then
large 𝑛
𝜽ˆ 𝑀𝐿 −→ 𝜽 true
Í𝑛 𝑛1
• the average of the data points is 𝑥¯ = 1
𝑛 𝑖=1 𝑥𝑖 = 𝑛 .
• the probability mass function (PMF) of the Bernoulli distribution Ber(𝑝) is:
(
𝑝 if 𝑥 = 1
𝑓 (𝑥|𝑝) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 =
1−𝑝 if 𝑥 = 0
• log-PMF:
log 𝑓 (𝑥|𝑝) = 𝑥 log(𝑝) + (1 − 𝑥) log(1 − 𝑝)
• log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝑝|𝐷) = log 𝑓 (𝑥 𝑖 )
𝑖=1
= 𝑛1 log 𝑝 + (𝑛 − 𝑛1 ) log(1 − 𝑝)
= 𝑛 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝)
Note how the log-likelihood depends on the data only through 𝑥! ¯ This
is an example of a sufficient statistic for the parameter 𝑝 (in fact it is also a
minimally sufficient statistic). This will be discussed in more detail later.
• Score function:
𝑑𝑙𝑛 (𝑝|𝐷) 𝑥¯ 1 − 𝑥¯
𝑆𝑛 (𝑝) = =𝑛 −
𝑑𝑝 𝑝 1−𝑝
𝑑𝑆𝑛 (𝑝)
𝑥¯ 1− 𝑥¯
With 𝑑𝑝 = −𝑛 𝑝2
+ (1−𝑝)2
< 0 the optimum corresponds indeed to the
maximum of the (log-)likelihood function as this is negative for 𝑝ˆ 𝑀𝐿 (and
indeed for any 𝑝).
The maximum likelihood estimator of 𝑝 is therefore identical to the fre-
quency of the successes among all observations.
Note that to analyse the coin tossing experiment and to estimate 𝑝 we may
equally well use the binomial distribution Bin(𝑛, 𝑝) as model for the number
of successes. In this case we then have only a single observation, namely the
observed 𝑘 . This results in the same MLE for 𝑝 but the likelihood function
based on the binomial PMF includes the binomial coefficient 𝑛𝑘 . However, as
this factor does not depend on 𝑝 it disappears in the score function and has no
influence in the derivation of the MLE.
38 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
• Log-Density:
1 (𝑥 − 𝜇)2
log 𝑓 (𝑥|𝜇) = − log(2𝜋𝜎2 ) −
2 2𝜎 2
• Log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝜇|𝐷) = log 𝑓 (𝑥 𝑖 )
𝑖=1
𝑛
1 Õ 𝑛
=− (𝑥 𝑖 − 𝜇)2 − log(2𝜋𝜎 2 )
2𝜎2 𝑖=1 2
| {z }
constant term, does not depend on 𝜇, can be removed
𝑛
1 Õ 2
=− (𝑥 − 2𝑥 𝑖 𝜇 + 𝜇2 ) + 𝐶
2𝜎2 𝑖=1 𝑖
𝑛
𝑛 1 1 Õ 2
= ( 𝑥𝜇
¯ − 𝜇2 ) − 𝑥 +𝐶
𝜎2 2 2𝜎2 𝑖=1 𝑖
| {z }
another constant term
Note how the non-constant terms of the log-likelihood depend on the data
only through 𝑥!
¯
• Score function:
𝑛
𝑆𝑛 (𝜇) = ( 𝑥¯ − 𝜇)
𝜎2
𝑆𝑛 (𝜇ˆ 𝑀𝐿 ) = 0 ⇒ 𝜇ˆ 𝑀𝐿 = 𝑥¯
3.2. MAXIMUM LIKELIHOOD ESTIMATION IN PRACTISE 39
𝑑𝑆𝑛 (𝜇)
• With 𝑑𝜇 = − 𝜎𝑛2 < 0 the optimum is indeed the maximum
The constant term 𝐶 in the log-likelihood function collects all terms that do not
depend on the parameter. After taking the first derivative with regard to the
parameter this term disappears thus 𝐶 is not relevant for finding the MLE of
the parameter. In the future we will often omit such constant terms from the
log-likelihood function without further mention.
Example 3.3. Normal distribution with mean and variance both unknown:
• 𝑥 ∼ 𝑁(𝜇, 𝜎2 ) with E(𝑥) = 𝜇 and Var(𝑥) = 𝜎 2
• both 𝜇 and 𝜎2 need to be estimated.
What’s the MLE of the parameter vector 𝜽 = (𝜇, 𝜎2 )𝑇 ?
• the data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } are all real in the range 𝑥 𝑖 ∈ [−∞, ∞].
Í𝑛
• the average 𝑥¯ = 1
𝑛 𝑖=1 𝑥 𝑖 is real as well.
Í𝑛
• the average of the squared data 𝑥 2 = 1
𝑛 𝑖=1 𝑥 2𝑖 ≥ 0 is non-negative.
• Density:
(𝑥 − 𝜇)2
2 − 12
𝑓 (𝑥|𝜇, 𝜎 ) = (2𝜋𝜎 )
2
exp −
2𝜎 2
• Log-Density:
1 (𝑥 − 𝜇)2
log 𝑓 (𝑥|𝜇, 𝜎2 ) = − log(2𝜋𝜎 2 ) −
2 2𝜎2
• Log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝜽|𝐷) = 𝑙𝑛 (𝜇, 𝜎 |𝐷) =
2
log 𝑓 (𝑥 𝑖 )
𝑖=1
𝑛
𝑛 1 Õ 𝑛
=− log(𝜎 2 ) − 2 (𝑥 𝑖 − 𝜇)2 − log(2𝜋)
2 2𝜎 𝑖=1 2
| {z }
constant not depending on 𝜇 or 𝜎 2
𝑛 𝑛
= − log(𝜎 2 ) − 2 (𝑥 2 − 2𝑥𝜇
¯ + 𝜇2 ) + 𝐶
2 2𝜎
Note how the log-likelihood function depends on the data only through 𝑥¯
and 𝑥 2 !
• Score function 𝑺, gradient of 𝑙𝑛 (𝜽|𝐷):
Note that to obtain the second component of the score function the partial
derivative needs to be taken with regard to the variance parameter 𝜎2 —
not with regard to 𝜎! Hint: replace 𝜎2 = 𝑣 in the log-likelihood function,
then take the partial derivative with regard to 𝑣, then backsubstitute 𝑣 = 𝜎2
in the result.
• Maximum likelihood estimate:
𝑺(𝜽ˆ 𝑀𝐿 ) = 0 ⇒
𝜇ˆ 𝑀𝐿 𝑥¯
𝜽ˆ 𝑀𝐿 = b2 = 2
𝜎 𝑀𝐿 𝑥 − 𝑥¯ 2
𝜎2
𝜇ˆ 𝑀𝐿 = 𝑥¯ ∼ 𝑁 𝜇,
𝑛
𝑛−1
with E( 𝜎b2 𝑀𝐿 ) = 𝑛 𝜎2 .
Therefore, the MLE of 𝜇 is unbiased as
Bias(𝜇ˆ 𝑀𝐿 ) = E(𝜇ˆ 𝑀𝐿 ) − 𝜇 = 0
In contrast, however, the MLE of 𝜎2 is negatively biased because
1 2
Bias( 𝜎b2 𝑀𝐿 ) = E( 𝜎b2 𝑀𝐿 ) − 𝜎2 = − 𝜎
𝑛
Thus, in the case of the variance parameter of the normal distribution the MLE
is not recovering the well-known unbiased estimator of the variance
𝑛
1 Õ 𝑛 b2
𝜎b2 𝑈 𝐵 = (𝑥 𝑖 − 𝑥)
¯2= 𝜎 𝑀𝐿
𝑛−1 𝑛−1
𝑖=1
Conversely, the unbiased estimator is not a maximum likelihood estimate!
Therefore it is worth keeping in mind that maximum likelihood can result in
biased estimates for finite 𝑛. For large 𝑛, however, the bias disappears as MLEs
are consistent.
0.8
likelihood
likelihood
0.4
0.4
0.0
0.0
0 2 4 6 8 10 0 2 4 6 8 10
θ θ
maximum (low curvature) then if is more difficult to find the optimal parameter
(also numerically!). Conversely, if the likelihood surface is peaked (strong
curvature) then the maximum point is clearly defined.
The curvature is described by the second-order derivatives (Hessian matrix) of
the log-likelihood function.
For univariate 𝜃 the Hessian is a scalar:
𝑑2 𝑙𝑛 (𝜃|𝐷)
𝑑𝜃 2
By construction the Hessian is negative definite at the MLE (i.e. its eigenvalues
are all negative) to ensure the the function is concave at the MLE (i.e. peak
shaped).
The observed Fisher information (matrix) is defined as the negative curvature
at the MLE 𝜽ˆ 𝑀𝐿 :
𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) = −∇∇𝑇 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷)
𝑥 𝑝(1−𝑝)
for 𝑥 ∼ Bin(𝑛, 𝑝).
Compare this with Var 𝑛 = 𝑛
Example 3.5. Normal distribution with unknown mean and known variance:
This is the continuation of Example 3.2. Recall the MLE for the mean 𝜇ˆ 𝑀𝐿 =
1 Í𝑛 𝑛
𝑛 𝑖=1 𝑥 𝑖 = 𝑥¯ and the score function 𝑺 𝑛 (𝜇) = 𝜎 2 ( 𝑥¯ − 𝜇). The negative second
derivative of the score function is
𝑑𝑆𝑛 (𝜇) 𝑛
− = 2
𝑑𝜇 𝜎
𝜎2
𝐽𝑛 (𝜇ˆ 𝑀𝐿 )−1 =
𝑛
𝜎2
For 𝑥 𝑖 ∼ 𝑁(𝜇, 𝜎2 ) we have Var(𝑥 𝑖 ) = 𝜎 2 and hence Var( 𝑥)
¯ = 𝑛 , which is equal to
the inverse observed Fisher information.
Example 3.6. Normal distribution with mean and variance parameter:
This is the continuation of Example 3.3. Recall the MLE for the mean and
variance:
𝑛
1Õ
𝜇ˆ 𝑀𝐿 = 𝑥 𝑖 = 𝑥¯
𝑛
𝑖=1
𝑛
1 Õ
𝜎b2 𝑀𝐿 = (𝑥 𝑖 − 𝑥)
¯ 2 = 𝑥 2 − 𝑥¯ 2
𝑛
𝑖=1
− 𝜎𝑛2 − 𝜎𝑛4 ( 𝑥¯ − 𝜇)
!
𝑇
∇∇ 𝑙𝑛 (𝜇, 𝜎 |𝐷) =
2
− 𝜎𝑛4 ( 𝑥¯ − 𝜇) 𝑛
2𝜎 4
− 𝑛
𝜎6
𝑥 2 − 2𝜇𝑥¯ + 𝜇2
44 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
The negative Hessian at the MLE, i.e. at 𝜇ˆ 𝑀𝐿 = 𝑥¯ and 𝜎b2 𝑀𝐿 = 𝑥 2 − 𝑥¯ 2 yields the
observed Fisher information matrix:
𝑛 !
0
𝜎 2 𝑀𝐿
𝑱 𝑛 (𝜇ˆ 𝑀𝐿 , 𝜎b2 𝑀𝐿 ) =
c
𝑛
0
𝜎 2 𝑀𝐿 )2
2(c
Note that the observed Fisher information matrix is diagonal with positive
entries. Therefore its eigenvalues are all positive as required for a maximum,
because for a diagonal matrix the eigenvalues are simply the the entries on the
diagonal.
The inverse of the observed Fisher information matrix is
!
𝜎 2 𝑀𝐿
c
𝑛 0
𝑱 𝑛 (𝜇ˆ 𝑀𝐿 , 𝜎b2 𝑀𝐿 )−1 =
𝜎 2 𝑀𝐿 )2
2(c
0 𝑛
𝜎2
𝜇ˆ 𝑀𝐿 = 𝑥¯ ∼ 𝑁 𝜇,
𝑛
Hence Var(𝜇ˆ 𝑀𝐿 ) = 𝜎𝑛 . If you compare this with the first diagonal entry of the
2
inverse observed Fisher information matrix you see that this is essentially the
same expression (apart from the “hat”).
𝜎2 2
𝜎b2 𝑀𝐿 ∼ 𝜒
𝑛 𝑛−1
𝑎
with variance Var( 𝜎b2 𝑀𝐿 ) = 𝑛−1
4 4
𝑛 . For large 𝑛 this becomes Var( 𝜎 𝑀𝐿 ) = 𝑛
2𝜎 b2 2𝜎
𝑛
which is essentially (apart from the “hat”) the second diagonal entry of the
inverse observed Fisher information matrix.
© ª
= E (𝒙 − 𝝁) (𝒙 − 𝝁)𝑇 ®
®
| {z } | {z }®
« 𝑑×1 1×𝑑 ¬
= E(𝒙𝒙 𝑇 ) − 𝝁𝝁𝑇
47
48CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS
The entries of the covariance matrix 𝜎𝑖𝑗 = Cov(𝑥 𝑖 , 𝑥 𝑗 ) describe the covariance
between the random variables 𝑥 𝑖 and 𝑥 𝑗 . The covariance matrix is symmetric,
hence 𝜎𝑖𝑗 = 𝜎 𝑗𝑖 . The diagonal entries 𝜎𝑖𝑖 = Cov(𝑥 𝑖 , 𝑥 𝑖 ) = Var(𝑥 𝑖 ) = 𝜎𝑖2 correspond
to the variances of the components of 𝒙. The covariance matrix is positive
semi-definite, i.e. the eigenvalues of 𝚺 are all positive or equal to zero. However,
in practise one aims to use non-singular covariance matrices, with all eigenvalues
positive, so that they are invertible.
A covariance matrix can be factorised into the product
1 1
𝚺 = 𝑽 2 𝑷𝑽 2
𝜎11 ... 0
©
𝑽 = .. .. .. ª®
. . . ®
« 0 ... 𝜎𝑑𝑑 ¬
and the matrix 𝑷 (“upper case rho”) is the symmetric correlation matrix
1 ... 𝜌1𝑑
𝑷 = (𝜌 𝑖𝑗 ) = ... .. .. ª® = 𝑽 − 12 𝚺𝑽 − 12
©
. . ®
«𝜌 𝑑1 ... 1 ¬
𝜎𝑖𝑗
𝜌 𝑖𝑗 = Cor(𝑥 𝑖 , 𝑥 𝑗 ) = √
𝜎𝑖𝑖 𝜎 𝑗 𝑗
(𝑥 − 𝜇)2
1
𝑓 (𝑥|𝜇, 𝜎 ) = √
2
exp −
2𝜋𝜎2 2𝜎2
© ª
®
®
1
𝑑
®
1
𝑇
𝑓 (𝒙|𝝁, 𝚺) = (2𝜋)− 2 det(𝚺)− 2 −1
exp − (𝒙 − 𝝁) 𝚺 (𝒙 − 𝝁)®
®
2 | {z } |{z} | {z }®
𝑑×𝑑
®
1×𝑑 𝑑×1 ®
| {z }®
« 1×1=scalar! ¬
𝚺𝑀𝐿 = 𝒙𝒙 𝑇 − 𝒙¯ 𝒙¯ 𝑇
b
𝑛 𝜇ˆ
1Õ © .1 ª
𝜇ˆ 𝑖 = 𝑥 𝑘𝑖 with 𝝁ˆ = .. ®®
𝑛
𝑘=1
«𝜇ˆ 𝑑 ¬
50CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS
𝑛
1Õ
𝜎ˆ 𝑖𝑗 = (𝑥 𝑘𝑖 − 𝜇ˆ 𝑖 ) ( 𝑥 𝑘 𝑗 − 𝜇ˆ 𝑗 ) with b
𝚺 = ( 𝜎ˆ 𝑖𝑗 )
𝑛
𝑘=1
1 𝑇
𝝁ˆ = 𝑿 1𝑛
𝑛
Here 1𝑛 is a vector of length 𝑛 containing 1 at each component.
ˆ = 1 𝑿 𝑇 𝑿 − 𝝁ˆ 𝝁ˆ 𝑇
𝚺
𝑛
To simplify the expression for the estimate of the covariance matrix one often
assumes that the data matrix is centered, i.e. that 𝝁ˆ = 0.
1
𝑙1 (𝝁|𝒙) = 𝐶 − (𝒙 − 𝝁)𝑇 𝚺−1 (𝒙 − 𝝁)
2
where 𝐶 is a constant that does not depend on 𝝁. Note that the log-likelihood is
exactly quadratic and the maximum lies at (𝒙, 𝐶).
4.2. APPROXIMATE DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTIMATES51
We assume the underlying model is regular and that ∇𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) = 0.
The Taylor series approximation of scalar-valued function 𝑓 (𝒙) around 𝒙 0 is
1
𝑓 (𝒙) = 𝑓 (𝒙 0 ) + ∇ 𝑓 (𝒙 0 )𝑇 (𝒙 − 𝒙 0 ) + (𝒙 − 𝒙 0 )𝑇 ∇∇𝑇 𝑓 (𝒙 0 )(𝒙 − 𝒙 0 ) + . . .
2
Applied to the log-likelihood function this yields
1
𝑙𝑛 (𝜽|𝐷) ≈ 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − (𝜽ˆ 𝑀𝐿 − 𝜽)𝑇 𝐽𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽ˆ 𝑀𝐿 − 𝜽)
2
This is a quadratic function with maximum at (𝜽ˆ 𝑀𝐿 , 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷)). Note the natural
appearance of the observed Fisher information 𝐽𝑛 (𝜽ˆ 𝑀𝐿 ) in the quadratic term.
There is no linear term because of the vanishing gradient at the MLE.
Crucially, we realise that the approximation has the same form as if 𝜽ˆ 𝑀𝐿 was a
sample from a multivariate normal distribution with mean 𝜽 and with covariance
given by the inverse observed Fisher information! Note that this requires a positive
definite observed Fisher information matrix so that 𝐽𝑛 (𝜽ˆ 𝑀𝐿 ) is actually invertible!
Example 4.2. Quadratic approximation of the log-likelihood for a proportion:
From Example 3.1 we have the log-likelihood
𝑙𝑛 (𝑝|𝐷) = 𝑛 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝)
𝑙𝑛 ( 𝑝ˆ 𝑀𝐿 |𝐷) = 𝑛 𝑥¯ log 𝑥¯ + (1 − 𝑥)
¯ log(1 − 𝑥)
¯
52CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS
The constant 𝐶 does not depend on 𝑝, its function is to match the approximate
log-likelihood at the MLE with that of the corresponding original log-likelihood.
The approximate log-likelihood takes on the form of a normal log-likelihood
𝑥(1−
¯ 𝑥)
¯
(Example 3.2) for one observation of 𝑝ˆ 𝑀𝐿 = 𝑥¯ from 𝑁 𝑝, 𝑛 .
The following figure shows the above log-likelihood function and its quadratic
approximation for example data with 𝑛 = 30 and 𝑥¯ = 0.7:
maximum log−likelihood
−20
−24
ln(p)
−28
log−likelihood
quadratic approx. MLE
−32
ª
𝑎
𝜽ˆ 𝑀𝐿 ∼ ˆ 𝑀𝐿 )−1 ®®
©
𝑁𝑑
|{z}𝜽 , 𝑱 𝑛 ( 𝜽
|{z} | {z } ®
multivariate normal « mean vector covariance matrix¬
This theorem about the distributional properties of MLEs greatly enhances the
usefulness of the method of maximum likelihood. It implies that in regular
settings maximum likelihood is not just a method for obtaining point estimates
but also also provides estimates of their uncertainty.
3) However, 𝑛 going to infinity is in fact not always required for the normal
approximation to hold! Depending on the particular model a good local fit
to a quadratic log-likelihood may be available also for finite 𝑛. As a trivial
example, for the normal log-likelihood it is valid for any 𝑛.
Remarks:
• The technical details of the above considerations are worked out in the
theory of locally asymptotically normal (LAN) models pioneered in 1960
by Lucien LeCam (1924–2000).
54CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS
• There are also methods to obtain higher-order (higher than quadratic and
thus non-normal) asymptotic approximations. These relate to so-called
saddle point approximations.
ˆ ≥ 1 Fisher −1
Var(𝜽) 𝑰 (𝜽)
𝑛
which puts a lower bound on the variance of an estimator for 𝜽. (Note for
𝑑 > 1 this is a matrix inequality, meaning that the difference matrix is positive
semidefinite).
For large sample size with 𝑛 → ∞ and 𝜽ˆ 𝑀𝐿 → 𝜽 the observed Fisher informa-
tion becomes 𝐽𝑛 (𝜽) ˆ → 𝑛𝑰 Fisher (𝜽) and therefore we can write the asymptotic
distribution of 𝜽ˆ 𝑀𝐿 as
𝑎 1
𝜽ˆ 𝑀𝐿 ∼ 𝑁𝑑 𝜽, 𝑰 Fisher (𝜽)−1
𝑛
This means that for large 𝑛 in regular models 𝜽ˆ 𝑀𝐿 achieves the lowest variance
possible according to the Cramér-Rao information inequality. In other words, for
large sample size maximum likelihood is optimally efficient and thus the best
available estimator will in fact be the MLE!
However, as we will see later this does not hold for small sample size where it
is indeed possible (and necessary) to improve over the MLE (e.g. via Bayesian
estimation or regularisation).
c 𝜽ˆ 𝑀𝐿 ) = 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )−1
Var(
c 𝜽ˆ 𝑀𝐿 ) = 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )−1/2
SD(
Note that in the above we use matrix inversion and the (inverse) matrix square root.
The reasons for preferring observed Fisher information are made mathematically
precise in a classic paper by Efron and Hinkley (1978)1 .
Example 4.3. Estimated variance and distribution of the MLE of a proportion:
From Examples 3.1 and 3.4 we know the MLE
𝑘
𝑝ˆ 𝑀𝐿 = 𝑥¯ =
𝑛
and the corresponding observed Fisher information
𝑛
𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 ) =
𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
𝑎
𝑝ˆ 𝑀𝐿 ∼ 𝑁 𝑝,
𝑛
1Efron, B., and D. V. Hinkley. 1978. Assessing the accuracy of the maximum likelihood estimator: observed
versus expected {Fisher} information. Biometrika 65:457–482. https://doi.org/10.1093/biomet/65.3.457
56CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS
Example 4.4. Estimated variance and distribution of the MLE of the mean
parameter for the normal distribution with known variance:
From Examples 3.2 and 3.5 we know that
𝜇ˆ 𝑀𝐿 = 𝑥¯
c 𝜇ˆ 𝑀𝐿 ) = 𝜎2
Var(
𝑛
and the corresponding asymptotic normal distribution is
𝜎2
𝜇ˆ 𝑀𝐿 ∼ 𝑁 𝜇,
𝑛
Note that in this case the distribution is not asymptotic but is exact, i.e. valid also
for small 𝑛 (as long as the data 𝑥 𝑖 are actually from 𝑁(𝜇, 𝜎2 )!).
c 𝜽ˆ 𝑀𝐿 )−1 (𝜽ˆ 𝑀𝐿 − 𝜽0 )
𝒕(𝜽 0 ) = SD(
= 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )1/2 (𝜽ˆ 𝑀𝐿 − 𝜽 0 )
Note that in the literature both 𝒕(𝜽0 ) and 𝑡(𝜽 0 )2 are commonly referred to as
Wald statistics. In this text we use the qualifier “squared” if we refer to the latter.
We now assume that the true underlying parameter is 𝜽 0 . Since the MLE is
asymptotically normal the Wald statistic is asymptotically standard normal
distributed:
𝑎
𝒕(𝜽0 ) ∼ 𝑁𝑑 (0𝑑 , 𝑰 𝑑 ) for vector 𝜽
𝑎
𝑡(𝜃0 ) ∼ 𝑁(0, 1) for scalar 𝜃
4.3. QUANTIFYING THE UNCERTAINTY OF MAXIMUM LIKELIHOOD ESTIMATES57
𝑥¯ − 𝑝0 𝑎
𝑡(𝑝0 ) = p ∼ 𝑁(0, 1)
𝑥(1
¯ − 𝑥)/𝑛
¯
( 𝑥¯ − 𝑝 0 )2 𝑎 2
𝑡(𝑝 0 )2 = 𝑛 ∼ 𝜒1
𝑥(1
¯ − 𝑥) ¯
Example 4.6. Wald statistic for the mean parameter of a normal distribution
with known variance:
𝜎2
c 𝜇ˆ 𝑀𝐿 ) =
We continue from Example 4.4. With 𝜇ˆ 𝑀𝐿 = 𝑥¯ and Var( 𝑛 and thus
c 𝜇ˆ 𝑀𝐿 ) = √𝜎 we get as Wald statistic:
SD( 𝑛
𝑥¯ − 𝜇0
𝑡(𝜇0 ) = √ ∼ 𝑁(0, 1)
𝜎/ 𝑛
Note this is the one sample 𝑡-statistic with given 𝜎. The squared Wald statistic
is:
( 𝑥¯ − 𝜇0 )2
𝑡(𝜇0 )2 = ∼ 𝜒12
𝜎2 /𝑛
Again, in this instance this is the exact distribution, not just the asymptotic one.
Using the Wald statistic or the squared Wald statistic we can test whether a
particular 𝜇0 can be rejected as underlying true parameter, and we can also
construct corresponding confidence intervals.
For example, to construct the asymptotic normal CI for the MLE of a scalar
parameter 𝜃 we use the MLE 𝜃ˆ 𝑀𝐿 as estimate of the mean and its standard
c 𝜃ˆ 𝑀𝐿 ) computed from the observed Fisher information:
deviation SD(
For example, for a CI with 95% coverage one uses the factor 1.96 so that
so we can indeed reject this value. In other words, data plus model exclude this
value as statistically implausible.
This can be verified more directly by computing the corresponding (squared)
Wald statistics (see Example 4.5) and comparing them with the relevant critical
value (3.84 from chi-squared distribution for 5% significance level):
(0.7−0.5)2
• 𝑡(0.5)2 = 0.0842
= 5.71 > 3.84 hence 𝑝 0 = 0.5 can be rejected.
(0.7−0.8)2
• 𝑡(0.8)2 = 0.0842
= 1.43 < 3.84 hence 𝑝 0 = 0.8 cannot be rejected.
Note that the squared Wald statistic at the boundaries of the normal confidence
interval is equal to the critical value.
Example 4.10. Normal confidence interval and test for the mean:
We continue from Example 4.8.
We now consider two possible values (𝜇0 = 9.5 and 𝜇0 = 11) as potentially true
underlying mean parameter.
The value 𝜇0 = 9.5 lies inside the 95% confidence interval [9.216, 10.784]. This
implies we cannot reject the hypthesis that this is the true underlying parameter
on 5% significance level. In contrast, 𝜇0 = 11 is outside the confidence interval
so we can indeed reject this value. In other words, data plus model exclude this
value as a statistically implausible.
This can be verified more directly by computing the corresponding (squared)
Wald statistics (see Example 4.6) and comparing them with the relevant critical
values:
(10−9.5)2
• 𝑡(9.5)2 = 4/25
= 1.56 < 3.84 hence 𝜇0 = 9.5 cannot be rejected.
(10−11)2
• 𝑡(11)2 = 4/25
= 6.25 > 3.84 hence 𝜇0 = 11 can be rejected.
The squared Wald statistic at the boundaries of the confidence interval equals
the critical value.
Note that this is the standard one-sample test of the mean, and that it is exact,
not an approximation.
𝑥1 , . . . , 𝑥 𝑛 ∼ 𝑈(0, 𝜃)
4.4. EXAMPLE OF A NON-REGULAR MODEL 61
With 𝑥[𝑖] we denote the ordered observations with 0 ≤ 𝑥[1] < 𝑥 [2] < . . . < 𝑥[𝑛] ≤ 𝜃
and 𝑥[𝑛] = max(𝑥1 , . . . , 𝑥 𝑛 ).
We would like to obtain both the maximum likelihood estimator 𝜃ˆ 𝑀𝐿 and its
distribution.
The probability density function of 𝑈(0, 𝜃) is
(
1
𝜃 if 𝑥 ∈ [0, 𝜃]
𝑓 (𝑥|𝜃) =
0 otherwise.
𝑛
E(𝑥[𝑛] ) = 𝑛+1 𝜃
𝑛
Var(𝑥[𝑛] ) = (𝑛+1)2 (𝑛+2)
𝜃2 ≈ 𝜃2
𝑛2
1
Note that the variance decreases with which is much faster than the usual 𝑛1
𝑛2
of an “efficient” estimator. Correspondingly, 𝜃ˆ 𝑀𝐿 is a so-called “super efficient”
estimator.
Chapter 5
Likelihood-based confidence
interval and likelihood ratio
Idea: find all 𝜽 0 that have a log-likelihood that is almost as good as 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷).
Here Δ is our tolerated deviation from the maximum log-likelihood. We will see
below how to determine a suitable Δ.
The above leads naturally to the Wilks log likelihood ratio statistic 𝑊(𝜽0 )
63
64CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
defined as:
𝐿(𝜽ˆ 𝑀𝐿 |𝐷)
𝑊(𝜽 0 ) = 2 log
𝐿(𝜽0 |𝐷)
ˆ
= 2(𝑙𝑛 (𝜽 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷))
CI = {𝜽 0 : 𝑊(𝜽 0 ) ≤ 2Δ}
𝑙𝑛 (𝑝|𝐷) = 𝑛( 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝))
Comparing with Example 2.8 we see that in this case the Wilks statistic is
essentially (apart from a scale factor 2𝑛) the KL divergence between two Bernoulli
distributions:
𝑊(𝑝 0 ) = 2𝑛𝐷KL (Ber( 𝑝ˆ 𝑀𝐿 ), Ber(𝑝0 ))
Example 5.2. Wilks statistic for the mean parameter of a normal model:
The Wilks statistic is
( 𝑥¯ − 𝜇0 )2
𝑊(𝜇0 )2 =
𝜎2 /𝑛
See Worksheet L3 for a derivation of the Wilks statistic directly from the log-
likelihood function.
Note this is the same as the squared Wald statistic discussed in Example 4.6.
5.1. LIKELIHOOD-BASED CONFIDENCE INTERVALS AND WILKS STATISTIC65
Comparing with Example 2.10 we see that in this case the Wilks statistic is
essentially (apart from a scale factor 2𝑛) the KL divergence between two normal
distributions with different means and variance equal to 𝜎2 :
𝑊(𝑝0 ) = 2𝑛𝐷KL (𝑁(𝜇ˆ 𝑀𝐿 , 𝜎2 ), 𝑁(𝜇0 , 𝜎 2 ))
1
𝑙𝑛 (𝜽 0 |𝐷) ≈ 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − (𝜽 0 − 𝜽ˆ 𝑀𝐿 )𝑇 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽 0 − 𝜽ˆ 𝑀𝐿 )
2
With this we can then approximate the Wilks statistic:
𝑊(𝜽0 ) = 2(𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷))
≈ (𝜽 0 − 𝜽ˆ 𝑀𝐿 )𝑇 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽0 − 𝜽ˆ 𝑀𝐿 )
= 𝑡(𝜽 0 )2
Thus the quadratic approximation of the Wilks statistic yields the squared Wald
statistic!
Conversely, the Wilks statistic can be understood a generalisation of the squared
Wald statistic.
Example 5.3. Quadratic approximation of the Wilks statistic for a proportion
(continued from Example 5.1):
A Taylor series of second order (for 𝑝0 around 𝑥)
¯ yields
𝑥¯ 𝑝 0 − 𝑥¯ (𝑝0 − 𝑥) ¯2
log ≈− +
𝑝0 𝑥¯ 2 𝑥¯ 2
and
1 − 𝑥¯ 𝑝 0 − 𝑥¯ (𝑝0 − 𝑥) ¯2
log ≈ +
1 − 𝑝0 1 − 𝑥¯ 2(1 − 𝑥)
¯2
With this we can approximate the Wilks statistic of the proportion as
(𝑝0 − 𝑥)
¯2 (𝑝 0 − 𝑥)
¯2
𝑊(𝑝0 ) ≈ 2𝑛 −(𝑝0 − 𝑥)
¯ + + (𝑝 0 − 𝑥)
¯ +
2𝑥¯ 2(1 − 𝑥)
¯
(𝑝0 − 𝑥)
¯ 2 (𝑝0 − 𝑥)
¯2
=𝑛 +
𝑥¯ (1 − 𝑥)
¯
(𝑝0 − 𝑥)
¯2
=𝑛
𝑥(1
¯ − 𝑥)¯
= 𝑡(𝑝0 )2 .
66CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
This verifies that the quadratic approximation of the Wilks statistic leads back to
the squared Wald statistic of Example 4.5.
Example 5.4. Quadratic approximation of the Wilks statistic for the mean
parameter of a normal model (continued from Example 5.2):
The normal log-likelihood is already quadratic in the mean parameter (cf. Exam-
ple 3.2). Correspondingly, the Wilks statistic is quadratic in the mean parameter
as well. Hence in this particular case the quadratic “approximation” is in fact
exact and the Wilks statistic and the squared Wald statistic are identical!
Correspondingly, confidence intervals and tests based on the Wilks statistic are
identical to those obtained using the Wald statistic.
the interval [0.524, 0.843]. It is similar but not identical to the corresponding
asymptotic normal interval [0.536, 0.864] obtained in Example 4.7.
The following figure illustrate the relationship between the normal CI, the
likelihood CI and also shows the role of the quadratic approximation (see also
Example 4.2). Note that:
• the normal CI is symmetric around the MLE whereas the likelihood CI is
not symmetric
• the normal CI is identical to the likelihood CI when using the quadratic
approximation!
maximum log−likelihood
−20
maximum − 1.92
−24
ln(p)
−28
log−likelihood
quadratic approx.
normal CI
likelihood CI MLE
−32
As test statistic we use the Wilks log likelihood ratio 𝑊(𝜽0 ). Extreme values of
this test statistic imply evidence against 𝐻0 .
Note that the null model is “simple” (= a single parameter value) whereas the
alternative model is “composite” (= a set of parameter values).
68CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
Remarks:
• The composite alternative 𝐻1 is represented by a single point (the MLE).
• Reject 𝐻0 for large values of 𝑊(𝜽 0 )
• under 𝐻0 and for large 𝑛 the statistic 𝑊(𝜽 0 ) is chi-squared distributed,
𝑎
i.e. 𝑊(𝜽 0 ) ∼ 𝜒𝑑2 . This allows to compute critical values (i.e tresholds
to declared rejection under a given significance level) and also 𝑝-values
corresponding to the observed test statistics.
• Models outside the CI are rejected
• Models inside the CI cannot be rejected, i.e. they can’t be statistically
distinguished from the best alternative model.
A statistic equivalent to 𝑊(𝜽0 ) is the likelihood ratio
𝐿(𝜽 0 |𝐷)
Λ(𝜽0 ) =
𝐿(𝜽ˆ 𝑀𝐿 |𝐷)
The two statistics can be transformed into each other by 𝑊(𝜽 0 ) = −2 log Λ(𝜽 0 )
and Λ(𝜽 0 ) = 𝑒 −𝑊(𝜽0 )/2 . We reject 𝐻0 for small values of Λ.
It can be shown that the likelihood ratio test to compare two simple models
is optimal in the sense that for any given specified type I error (=probability
of wrongly rejecting 𝐻0 , i.e. the sigificance level) it will maximise the power
(=1- type II error, probability of correctly accepting 𝐻1 ). This is known as the
Neyman-Pearson theorem.
Example 5.6. Likelihood test for a proportion:
We continue from Example 5.5 with 95% likelihood confidence interval
[0.524, 0.843].
The value 𝑝0 = 0.5 is outside the CI and hence can be rejected whereas 𝑝0 = 0.8
is insided the CI and hence cannot be rejected on 5% significance level.
The Wilks statistic for 𝑝0 = 0.5 and 𝑝0 = 0.8 take on the following values:
• 𝑊(0.5)2 = 4.94 > 3.84 hence 𝑝0 = 0.5 can be rejected.
• 𝑊(0.8)2 = 1.69 < 3.84 hence 𝑝0 = 0.8 cannot be rejected.
Note that the Wilks statistic at the boundaries of the likelihood confidence
interval is equal to the critical value (3.84 corresponding to 5% significance level
for a chi-squared distribution with 1 degree of freedom).
𝐷 = {𝑥1 , . . . , 𝑥 𝑛 }. The KL divergences 𝐷KL (𝐹, 𝐺 𝐴 ) and 𝐷KL (𝐹, 𝐺 𝐵 ) indicate how
close each of the models 𝐺 𝐴 and 𝐺 𝐵 fit the true 𝐹. The difference of the two
divergences is a way to measure the relative fit of the two models, and can be
computed as
𝑔(𝑥|𝜽 𝐴 )
𝐷KL (𝐹, 𝐺 𝐵 ) − 𝐷KL (𝐹, 𝐺 𝐴 ) = E𝐹 log
𝑔(𝑥|𝜽 𝐵 )
Replacing 𝐹 by the empirical distribution 𝐹ˆ 𝑛 leads to the large sample approxi-
mation
2𝑛(𝐷KL (𝐹, 𝐺 𝐵 ) − 𝐷KL (𝐹, 𝐺 𝐴 )) ≈ 2(𝑙𝑛 (𝜽 𝐴 |𝐷) − 𝑙𝑛 (𝜽 𝐵 |𝐷))
Hence, the difference in the log-likelihoods provides an estimate of the difference
in the KL divergence of the two models involved.
The Wilks log likelihood ratio statistic
thus compares the best-fit distribution with 𝜽ˆ 𝑀𝐿 as the parameter to the distri-
bution with parameter 𝜽0 .
For some specific models the Wilks statistic can also be written in the form of
the KL divergence:
𝑊(𝜽 0 ) = 2𝑛𝐷KL (𝐹𝜽ˆ 𝑀𝐿 , 𝐹𝜽0 )
This is the case for the examples 5.1 and 5.2 and also more generally for
exponential family models, but it is not true in general.
! max 𝐿(𝜃|𝐷)
𝐿(𝜃ˆ 𝑀𝐿 |𝐷) 𝜃∈𝜔0
𝑊 = 2 log and Λ =
𝐿(𝜃ˆ 0 |𝐷) max 𝐿(𝜃|𝐷)
𝑀𝐿 𝜃∈Ω
70CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
where 𝐿(𝜃ˆ 𝑀𝐿 |𝐷) is the maximised likelihood assuming the full model (with
parameter space Ω) and 𝐿(𝜃ˆ 𝑀𝐿
0
|𝐷) is the maximised likelihood for the restricted
model (with parameter space 𝜔0 ). Hence, to compute the GRLT test statistic we
need to perform two optimisations, one for the full and another for the restricted
model.
Remarks:
• MLE in the restricted model space 𝜔0 is taken as a representative of 𝐻0 .
• The likelihood is maximised in both numerator and denominator.
• The restricted model is a special case of the full model (i.e. the two models
are nested).
• The asymptotic distribution of 𝑊 is chi-squared with degree of freedom
depending on both 𝑑 and 𝑑0 :
𝑎
𝑊 ∼ 𝜒𝑑−𝑑
2
0
• This result is due to Wilks (1938).1 Note that it assumes that the true model
is contained among the investigated models.
• If 𝐻0 is a simple hypothesis (i.e. 𝑑0 = 0) then the standard LRT (and
corresponding CI) is recovered as special case of the GLRT.
Example 5.7. GLRT example:
Case-control study: (e.g. “healthy” vs. “disease”)
we observe normal data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } from two groups with sample size 𝑛1
and 𝑛2 (and 𝑛 = 𝑛1 + 𝑛2 ):
𝑥1 , . . . , 𝑥 𝑛1 ∼ 𝑁(𝜇1 , 𝜎2 )
and
𝑥 𝑛1 +1 , . . . , 𝑥 𝑛 ∼ 𝑁(𝜇2 , 𝜎 2 )
Question: are the two means 𝜇1 and 𝜇2 the same in the two groups?
Full model Ω:
𝑛1
!
𝑛1 1 Õ
log 𝐿(𝜇1 , 𝜇2 , 𝜎 |𝐷) = − log(𝜎2 ) − 2
2
(𝑥 𝑖 − 𝜇1 )2 +
2 2𝜎 𝑖=1
𝑛
!
𝑛2 1 Õ
− log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇2 )2
2 2𝜎 𝑖=𝑛 +1
1
𝑛1 𝑛
!
𝑛 1 Õ Õ
= − log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇1 ) +
2
(𝑥 𝑖 − 𝜇2 ) 2
2 2𝜎 𝑖=1 𝑖=𝑛1 +1
Corresponding MLEs:
Í𝑛 Í𝑛
𝜔0 : 𝜇ˆ 0 = 𝑥𝑖
1
𝑛 𝑖=1 𝜎b02 = 1
𝑛 𝑖=1 (𝑥 𝑖 − 𝜇ˆ 0 )2
Í𝑛1
Ω: 𝜇ˆ 1 = 1
𝑥 𝜎b2 = 1
Í𝑛1
(𝑥 𝑖 − 𝜇ˆ 1 )2 +
Í𝑛
− 𝜇ˆ 2 )2
𝑛1 Í𝑛𝑖=1 𝑖 𝑛 𝑖=1 𝑖=𝑛1 +1 (𝑥 𝑖
𝜇ˆ 2 = 1
𝑛2 𝑖=𝑛1 +1 𝑥 𝑖
with
𝜇ˆ 1 − 𝜇ˆ 2
𝑡 𝑀𝐿 = r
1
𝑛1 + 1
𝑛2 𝜎b2
72CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
𝑛 𝑛
log 𝐿(𝜇ˆ 0 , 𝜎b02 |𝐷) = − log( 𝜎b02 ) −
2 2
Full model:
𝑛 𝑛
log 𝐿(𝜇ˆ 1 , 𝜇ˆ 2 , 𝜎b2 |𝐷) = − log( 𝜎b2 ) −
2 2
Likelihood ratio statistic:
© 𝜎0 ª
b2
= 𝑛 log ®
«𝜎 ¬
b2
!
𝑡 𝑀𝐿
2
= 𝑛 log 1 +
𝑛
The last step uses the decomposition for the total variance 𝜎b02 . If an unbiased
𝑛 b2
estimate of the variance is used ( 𝜎b2 UB = 𝑛−2 𝜎 ) rather than the MLE then
1 2
𝑊 = 𝑛 log 1 + 𝑡
𝑛−2
with
𝜇ˆ 1 − 𝜇ˆ 2
𝑡 = r
1
𝑛1 + 1
𝑛2 𝜎b2 UB
5.2. GENERALISED LIKELIHOOD RATIO TEST (GLRT) 73
Kullback-Leibler 1951
Entropy learning: minimise 𝐷KL (𝐹true , 𝐹𝜽 )
↓
large 𝑛
↓
Fisher 1922
Maximise Likelihood 𝐿(𝜽|𝐷)
↓
normal model
↓
Gauss 1805
Minimise squared error 𝑖 (𝑥 𝑖 − 𝜃)2
Í
75
76 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION
where ℎ() and 𝑘() are positive-valued functions, and or equivalently on log-scale
require the original data 𝐷. Instead, the sufficient statistic 𝑇(𝐷) contains all the
information in 𝐷 required to learn about the parameter 𝜽.
Therefore, if the MLE 𝜽ˆ 𝑀𝐿 of 𝜽 exists and is unique then the MLE is a unique
function of the sufficient statistic 𝑇(𝐷). If the MLE is not unique then it can be
chosen to be function of 𝑇(𝐷). Note that a sufficient statistic always exists since
the data 𝐷 are themselves sufficient statistics, with 𝑇(𝐷) = 𝐷. Furthermore,
sufficient statistics are not unique since applying a one-to-one transformation to
𝑇(𝐷) yields another sufficient statistic.
and thus is constant with regard to 𝜽. As a result, all data sets in 𝒳𝑡 are
likelihood equivalent. However, the converse is not true: depending on the
sufficient statistics there usually will be many likelihood equivalent data sets
that are not part of the same set 𝒳𝑡 .
𝐷2 it also follows that 𝑇(𝐷1 ) = 𝑇(𝐷2 ). If this holds true then 𝑇 is a minimally
sufficient statistic.
An equivalent non-operational definition is that a minimal sufficient statistic
𝑇(𝐷) is a sufficient statistic that can be computed from any other sufficient
statistic 𝑆(𝐷). This follows from the above directly: assume any sufficient
statistic 𝑆(𝐷), this defines a corresponding set 𝒳𝑠 of likelihood equivalent data
sets. By implication any 𝐷1 , 𝐷2 ∈ 𝒳𝑠 will necessarily also be in 𝒳𝑡 , thus whenever
𝑆(𝐷1 ) = 𝑆(𝐷2 ) we also have 𝑇(𝐷1 ) = 𝑇(𝐷2 ), and therefore 𝑇(𝐷1 ) is a function of
𝑆(𝐷1 ).
A trivial but important example of a minimal sufficient statistic is the likelihood
function itself since by definition it can be computed from any set of sufficient
statistics. Thus the likelihood function 𝐿(𝜽) captures all information about 𝜽
that is available in the data. In other words, it provides an optimal summary
of the observed data with regard to a model. Note that in Bayesian statistics
(to be discussed in Part 2 of the module) the likelihood function is used as
proxy/summary of the data.
One possible set of minimal sufficient statistics for 𝜽 are 𝑥¯ and 𝑥 2 , and with these
we can rewrite the log-likelihood function without any reference to the original
data 𝑥1 , . . . , 𝑥 𝑛 as follows
𝑛 𝑛
𝑙𝑛 (𝜽) = − log(2𝜋𝜎2 ) − 2 (𝑥 2 − 2𝑥𝜇
¯ + 𝜇2 )
2 2𝜎
An alternative set of minimal sufficient statistics for 𝜽 consists of 𝑠 2 = 𝑥 2 − 𝑥¯ 2 =
𝜎b2 𝑀𝐿 as and 𝑥¯ = 𝜇ˆ 𝑀𝐿 . The log-likelihood written in terms of 𝑠 2 and 𝑥¯ is
𝑛 𝑛
𝑙𝑛 (𝜽) = − log(2𝜋𝜎2 ) − 2 (𝑠 2 + ( 𝑥¯ − 𝜇)2 )
2 2𝜎
Note that in this example the dimension of the parameter vector 𝜽 equals the
dimension of the minimal sufficient statistic, and furthermore, that the MLEs of
the parameters are in fact minimal sufficient!
the MLEs of the parameters are minimal sufficient statistics. Thus, there will
typically be substantial dimension reduction from the raw data to the sufficient
statistics.
In summary, the likelihood function acts as perfect data summariser (i.e. as mini-
mally sufficient statistic), and in exponential families (e.g. Normal distribution)
the MLEs of the parameters 𝜽ˆ 𝑀𝐿 are minimal sufficient.
Finally, while sufficiency is clearly a useful concept for data reduction one needs
to keep in mind that this is always in reference to a specific model. Therefore,
unless one strongly believes in a certain model it is generally a good idea to keep
(and not discard!) the original data.
However, since the KL divergence is not symmetric there are in fact two ways
to minimise the divergence between a fixed 𝐹0 and the family 𝐹𝜽 , each with
different properties:
Note that here we keep the first argument fixed and minimise KL by
changing the second argument.
This procedure is mean-seeking and inclusive, i.e. when there are multiple
modes in the density of 𝐹0 a fitted unimodal density 𝐹𝜽ˆ will seek to cover
all modes.
80 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION
Note that here we keep the second argument fixed and minimise KL by
changing the first argument.
This procedure is mode-seeking and exclusive, i.e. when there are multiple
modes in the density of 𝐹0 a fitted unimodal density 𝐹𝜽ˆ will seek out one
mode to the exclusion of the others.
However, for small sample size it is indeed possible (and necessary) to improve
over the MLE (e.g. via Bayesian estimation or regularisation). Some of these
ideas will be discussed in Part II.
• regularised/penalised likelihood
• Bayesian methods
Classic example of a simple non-ML estimator that is better than the MLE: Stein’s
example / Stein paradox (C. Stein, 1955):
• Problem setting: estimation of the mean in multivariate case
• Maximum likelihood estimation breaks down! → average (=MLE) is worse
in terms of MSE than Stein estimator.
• For small 𝑛 the asymptotic distributions for the MLE and for the LRT are
not accurate, so for inference in these situations the distributions may need
to be obtained by simulation (e.g. parametric or nonparametric bootstrap).
• Note that, by construction, the model with more parameters always has a
higher likelihood, implying likelihood favours complex models
• Complex model may overfit!
• Instead, the aim is model building, i.e. to find a model that explains the
data well and that predicts well!
• Typically, this will not be the best-fit ML model, but rather a simpler model
that is close enough to the best / most complex model.
82 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION
Part II
Bayesian Statistics
83
Chapter 7
85
86 CHAPTER 7. CONDITIONING AND BAYES RULE
𝑝(𝑥)
𝑝(𝑥|𝑦) = 𝑝(𝑦|𝑥)
𝑝(𝑦)
This rule relates the two possible conditional densities (or conditional probability
mass functions) for two random variables 𝑥 and 𝑦. It thus allows to reverse the
ordering of conditioning.
Bayes’s theorem was published in 1763 only after his death by Richard Price
(1723-1791):
Pierre-Simon Laplace independently published Bayes’ theorem in 1774 and he
was in fact the first to routinely apply it to statistical calculations.
E(E(𝑥|𝑦)) = E(𝑥)
The first term is the “explained” or “between-group” variance, and the second
the “unexplained” or “mean within group” variance (also known as “pooled”
variance).
Example 7.1. Mean and variance of a mixture model:
Assume 𝐾 groups indicated by a discrete variable 𝑦 = 1, 2, . . . , 𝐾 with probability
𝑝(𝑦) = 𝜋 𝑦 . In each group the observations 𝑥 follow a density 𝑝(𝑥|𝑦) with
conditional mean 𝐸(𝑥|𝑦) = 𝜇 𝑦 and conditional variance Var(𝑥|𝑦) = 𝜎2𝑦 . The
joint density for 𝑥 and 𝑦 is 𝑝(𝑥, 𝑦) = 𝜋 𝑦 𝑝(𝑥|𝑦). The marginal density for 𝑥 is
Í𝐾
𝑦=1 𝜋 𝑦 𝑝(𝑥|𝑦). This is called a mixture model.
7.4. CONDITIONAL ENTROPY AND ENTROPY CHAIN RULES 87
Í𝐾
The total mean E(𝑥) = 𝜇0 is equal to 𝑦=1 𝜋 𝑦 𝜇 𝑦 .
𝑞(𝑦|𝑥)
𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ) = E𝑄 𝑥 E𝑄 𝑦|𝑥 log
𝑝(𝑦|𝑥)
The above decompositions for the entropy, the cross-entropy and relative entropy
are known as entropy chain rules.
≥ 𝐷KL (𝑄 𝑥 , 𝑃𝑥 )
This means that the KL divergence between the joint distributions forms an
upper bound for the KL divergence between the marginal distributions, with
the difference given by the conditional KL divergence 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ).
Equivalently, we can state an upper bound for the marginal cross-entropy:
≥ 𝐻(𝑄 𝑥 , 𝑃𝑥 )
Instead of an upper bound we may as well express this as lower bound for the
negative marginal cross-entropy
Since entropy and KL divergence is closedly linked with maximum likelihood the
above bounds play a major role in statistical learning of models with unobserved
latent variables (here 𝑦). They form the basis of important methods such as the
EM algorithm as well as of variational Bayes.
Chapter 8
𝑝(𝑥, 𝑦|𝜽)
where 𝑄ˆ 𝑥,𝑦 is the empirical joint distribution based on both 𝐷𝑥 and 𝐷 𝑦 and 𝑃𝑥,𝑦|𝜽
the joint model, so maximising the complete data log-likelihood minimises the
cross-entropy 𝐻(𝑄ˆ 𝑥,𝑦 , 𝑃𝑥,𝑦|𝜽 ).
Now assume that 𝑦 is not observable and hence is a so-called latent variable.
Then we don’t have observations 𝐷 𝑦 and therefore cannot use the complete data
likelihood. Instead, for maximum likelihood estimation with missing data we
need to use the observed data log-likelihood.
89
90 CHAPTER 8. MODELS WITH LATENT VARIABLES AND MISSING DATA
From the joint density we obtain the marginal density for 𝑥 by integrating out
the unobserved variable 𝑦:
∫
𝑝(𝑥|𝜽) = 𝑝(𝑥, 𝑦|𝜽)𝑑𝑦
𝑦
Using the marginal model we then compute the observed data log-likelihood
𝑛
Õ 𝑛
Õ ∫
𝑙𝑛 (𝜽|𝐷𝑥 ) = log 𝑝(𝑥 𝑖 |𝜽) = log 𝑝(𝑥 𝑖 , 𝑦|𝜽)𝑑𝑦
𝑖=1 𝑖=1 𝑦
where 𝑄ˆ 𝑥 is the empirical distribution based only on 𝐷𝑥 and 𝑃𝑥|𝜽 is the model
family. Hence, maximising the observed data log-likelihood minimises the
cross-entropy 𝐻(𝑄ˆ 𝑥 , 𝑃𝑥|𝜽 ).
Example 8.1. Two group normal mixture model:
Assume we have two groups labelled by 𝑦 = 1 and 𝑦 = 2 (thus the variable 𝑦 is
discrete). The data 𝑥 observed in each group are normal with means 𝜇1 and 𝜇2
and variances 𝜎12 and 𝜎22 , respectively. The probability of group 1 is 𝜋1 = 𝑝 and
the probability of group 2 is 𝜋2 = 1 − 𝑝. The density of the joint model for 𝑥 and
𝑦 is
𝑝(𝑥, 𝑦|𝜽) = 𝜋 𝑦 𝑁(𝑥|𝜇 𝑦 , 𝜎 𝑦 )
The model parameters are 𝜽 = (𝑝, 𝜇1 , 𝜇2 , 𝜎12 , 𝜎22 )𝑇 and they can be inferred from
the complete data comprised of 𝐷𝑥 = {𝑥1 , . . . , 𝑥 𝑛 } and the group allocations
𝐷 𝑦 = {𝑦1 , . . . , 𝑦𝑛 } of each sample using the complete data log-likelihood
𝑛
Õ 𝑛
Õ
𝑙𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = log 𝜋 𝑦𝑖 + log 𝑁(𝑥 𝑖 |𝜇 𝑦𝑖 , 𝜎 𝑦𝑖 )
𝑖=1 𝑖=1
However, typically we do not know the class allocation 𝑦 and thus we need to
use the marginal model for 𝑥 alone which has density
2
Õ
𝑝(𝑥|𝜽) = 𝜋 𝑦 𝑁(𝜇 𝑦 , 𝜎2𝑦 )
𝑦=1
Note that the form of the observed data log-likelihood is more complex than that
of the complete data log-likelihood because it contains the logarithm of a sum
that cannot be simplified. It is used to estimate the model parameters 𝜽 from 𝐷𝑥
without requiring knowledge of the class allocations 𝐷 𝑦 .
Example 8.2. Alternative computation of the observed data likelihood:
An alternative way to arrive at the observed data likelihood is to marginalise the
complete data likelihood.
𝑛
Ö
𝐿𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = 𝑝(𝑥 𝑖 , 𝑦 𝑖 |𝜽)
𝑖=1
and ∫ 𝑛
Ö
𝐿𝑛 (𝜽|𝐷𝑥 ) = 𝑝(𝑥 𝑖 , 𝑦 𝑖 |𝜽)𝑑𝑦1 . . . 𝑑𝑦𝑛
𝑦1 ,...,𝑦𝑛 𝑖=1
The integration (sum) and the multiplication can be interchanged as per Gener-
alised Distributive Law leading to
𝑛 ∫
Ö
𝐿𝑛 (𝜽|𝐷𝑥 ) = 𝑝(𝑥 𝑖 , 𝑦|𝜽)𝑑𝑦
𝑖=1 𝑦
which is the same as constructing the likelihood from the marginal density.
ˆ
𝑝(𝑥 𝑖 , 𝑦 𝑖 | 𝜽) ˆ 𝑝(𝑥 𝑖 |𝑦 𝑖 , 𝜽)
𝑝(𝑦 𝑖 | 𝜽) ˆ
ˆ =
𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝜽) =
ˆ
𝑝(𝑥 𝑖 | 𝜽) ˆ
𝑝(𝑥 𝑖 | 𝜽)
the probabilities / densities of all states of 𝑦 𝑖 (note this an application of Bayes’
theorem).
92 CHAPTER 8. MODELS WITH LATENT VARIABLES AND MISSING DATA
𝜋ˆ 𝑦𝑖 𝑁(𝑥 𝑖 | 𝜇ˆ 𝑦𝑖 , 𝜎c
2
𝑦𝑖 )
ˆ =
𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝜽)
𝑝𝑁(𝑥
ˆ 𝑖 |𝜇
ˆ 1 , 𝜎b12 ) + (1 − 𝑝)𝑁(𝑥
ˆ 𝑖 |𝜇
ˆ 2 , 𝜎b22 )
8.3 EM Algorithm
Computing and maximising the observed data log-likelihood can be difficult
because of the integration over the unobserved variable (or summation in case of
a discrete latent variable). In contrast, the complete data log-likelihood function
may be easier to compute.
The widely used EM algorithm, formally described by Dempster and others
(1977) but also used before, addresses this problem and maximises the observed
data log-likelihood indirectly in an iterative procedure comprising two steps:
1) First (“E” step), the missing data 𝐷 𝑦 is imputed using Bayes’ theorem. This
provides probabilities (“soft allocations”) for each possible state of the
latent variable.
2) Subsequently (“M” step), the expected complete data log-likelihood func-
tion is computed, where the expectation is taken with regard to the
distribution over the latent states, and it is maximised with regard to 𝜽 to
estimate the model parameters.
The EM algorithm leads to the exact same estimates as if the observed data
log-likelihood would be optimised directly. Therefore the EM algorithm is in
fact not an approximation, it is just a different way to find the MLEs.
The EM algorithm and application to clustering is discussed in more detail in
the module MATH38161 Multivariate Statistics and Machine Learning.
In a nutshell, the justication for the EM algorithm follows from the entropy chain
rules and the corresponding bounds, such as 𝐷KL (𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) ≥ 𝐷KL (𝑄 𝑥 , 𝑃𝑥 ) (see
previous chapter). Given observed data for 𝑥 we know the empirical distribution
𝜽 ) iteratively
𝑄ˆ 𝑥 . Hence, by minimising 𝐷KL (𝑄ˆ 𝑥 𝑄 𝑦|𝑥 , 𝑃𝑥,𝑦
1) with regard to 𝑄 𝑦|𝑥 (“E” step) and
𝜽 (“M” step")
2) with regard to the parameters 𝜽 of 𝑃𝑥,𝑦
one minimises 𝐷KL (𝑄ˆ 𝑥 , 𝑃𝑥𝜽 ) with regard to the parameters of 𝑃𝑥𝜽 .
Interestingly, in the “E” step the first argument of the KL divergence is optimised
(“I” projection) and in the “M” step the second argument (“M” projection).
8.3. EM ALGORITHM 93
Ingredients:
Note the model underlying the Bayesian approach is the joint distribution
𝑝(𝜽, 𝑥) = 𝑝(𝜽)𝑝(𝑥|𝜽)
Question: new information in the form of new observation 𝑥 arrives - how does
the uncertainty about 𝜽 change?
Answer: use Bayes’ theorem to update the prior density to the posterior density.
𝑝(𝑥|𝜽)
𝑝(𝜽|𝑥) = 𝑝(𝜽)
| {z } |{z} 𝑝(𝑥)
posterior prior
95
96 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS
For the denominator in Bayes formula we need to compute 𝑝(𝑥). This is obtained
by
∫
𝑝(𝑥) = 𝑝(𝑥, 𝜽)𝑑𝜽
∫𝜽
= 𝑝(𝑥|𝜽)𝑝(𝜽)𝑑𝜽
𝜽
Note that this implies that assigning prior probability 1 should be avoided, too.
𝐿(𝜽|𝐷)
𝑝(𝜽|𝐷) = 𝑝(𝜽)
| {z } |{z} 𝑝(𝐷)
posterior prior
Î𝑛
∫ the likelihood 𝐿(𝜽|𝐷) = 𝑖=1 𝑝(𝑥 𝑖 |𝜽) and the marginal likelihoood
involving
𝑝(𝐷) = 𝜽 𝑝(𝜽)𝐿(𝜽|𝐷)𝑑𝜽 with 𝜽 integrated out.
The marginal likelihood serves as a standardising factor so that the posterior
density for 𝜽 integrates to 1:
∫ ∫
1
𝑝(𝜽|𝐷)𝑑𝜽 = 𝑝(𝜽)𝐿(𝜽|𝐷)𝑑𝜽 = 1
𝜽 𝑝(𝐷) 𝜽
𝑛
Ö
𝑝(𝐷) = 𝑝(𝑥 𝑖 |𝑥 <𝑖 )
𝑖=1
with
∫
𝑝(𝑥 𝑖 |𝑥 <𝑖 ) = 𝑝(𝑥 𝑖 |𝜽)𝑝(𝜽|𝑥 <𝑖 )𝑑𝜽
𝜽
The last factor is the posterior predictive density of the new data 𝑥 𝑖 after seeing
data 𝑥 1 , . . . , 𝑥 𝑖−1 (given the model class 𝑀). It is straightforward to understand
why the probability of the new 𝑥 𝑖 depends on the previously observed data
points — because the uncertainty about the model parameter 𝜽 depends on how
much data we have already observed. Therefore the marginal likelihood 𝑝(𝐷) is
not simply the product of the marginal densities 𝑝(𝑥 𝑖 ) at each 𝑥 𝑖 but instead the
product of the conditional densities 𝑝(𝑥 𝑖 |𝑥 <𝑖 ).
Only when the parameter is fully known and there is no uncertainty about 𝜽
the observations 𝑥 𝑖 are independent. This leads back to the standard likelihood
Î𝑛 we condition on a particular 𝜽 and the likelihood is the product 𝑝(𝐷|𝜽) =
where
𝑖=1 𝑝(𝑥 𝑖 |𝜽).
Note that there are typically many credible intervals with the given specified
coverage 𝛼 (say 95%). Therefore, we may need further criteria to construct these
intervals.
In the Worksheet B1 examples for both types of credible intervals are given and
compared visually.
100 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS
→ The Kolmogorov framework is the basis for both the frequentist and the
Bayesian interpretation of probability.
A: Frequentist interpretation
This is the ontological view of probability (i.e. probability “exists” and is identical
to something that can be observed.).
B: Bayesian probability
Note that this does not require any repeated experiments. The Bayesian in-
terpretation of probability is valid regardless of sample size or the number or
repetitions of an experiment.
Hence, the key difference between frequentist and Bayesian approaches is not
the use of Bayes’ theorem. Rather it is whether you consider probability as
ontological (frequentist) or epistemological entity (Bayesian).
102 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS
In this chapter we discuss how three basic problems, namely how to estimate a
proportion, the mean and the variance in a Bayesian framework.
Mean: E( 𝑛𝑥 ) = 𝑝
𝑝(1−𝑝)
Variance: Var( 𝑛𝑥 ) = 𝑛
From part I (likelihood theory) we know that the maximum likelihood estimate
of the proportion is the frequency 𝑝ˆ 𝑀𝐿 = 𝑛𝑥 given 𝑥 (number of “heads”) is
105
106 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE
observed in 𝑛 repeats.
𝛼 𝜇(1−𝜇)
The mean is E(𝑥) = 𝜇 = 𝛼+𝛽 and the variance Var(𝑥) = 𝛼+𝛽+1 .
Γ(𝑎)Γ(𝑏)
The density depends on the Beta function 𝐵(𝑎, 𝑏) = Γ(𝑎+𝑏)
which in turn is
defined via Euler’s Gamma function
∫ ∞
Γ(𝑥) = 𝑡 𝑥−1 𝑒 −𝑡 𝑑𝑡
0
The Beta distribution is very flexible and can assume a number of different
shapes, depending on the value of 𝛼 and 𝛽:
10.1. ESTIMATING A PROPORTION USING THE BETA-BINOMIAL MODEL107
𝑝 ∼ Beta(𝛼, 𝛽)
Note this does not actually mean that 𝑝 is random! It only means that we model
the uncertainty about 𝑝 using a Beta random variable!
The flexibility of the Beta distribution allows to accomodate a large variety of
possible scenarios for our prior knowledge.
The prior mean is
𝛼
E(𝑝) = = 𝜇prior
𝑚
and the prior variance
𝜇prior (1 − 𝜇prior )
Var(𝑝) =
𝑚+1
where 𝑚 = 𝛼 + 𝛽.
Note the similarity to the moments of the standardised binomial above!
𝑓 (𝑥|𝑝) 𝑓 (𝑝)
𝑓 (𝑝|𝑥) = ∫
𝑝
0 𝑓 (𝑥|𝑝 0 ) 𝑓 (𝑝 0 )𝑑𝑝 0
𝑛 𝑥
𝑓 (𝑥|𝑝) = 𝑝 (1 − 𝑝)(𝑛−𝑥)
𝑥
Applying Bayes’ theorem results in
108 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE
𝑝|𝑥 ∼ Beta(𝛼 + 𝑥, 𝛽 + 𝑛 − 𝑥)
1
𝑓 (𝑝|𝑥) = 𝑝 𝛼+𝑥−1 (1 − 𝑝)𝛽+𝑛−𝑥−1
𝐵(𝛼 + 𝑥, 𝛽 + 𝑛 − 𝑥)
The posterior can be summarised by its first two moments (mean and variance):
Posterior mean:
𝑥+𝛼
𝜇posterior = E(𝑝|𝑥) =
𝑛+𝑚
Posterior variance:
𝜇posterior (1 − 𝜇posterior )
𝜎posterior
2
= Var(𝑝|𝑥) =
𝑛+𝑚+1
Specifically, 𝛼 and 𝛽 act as pseudocounts that influence both the posterior mean
and the posterior variance, exactly in the same way as conventional data.
For example, the larger 𝑚 (and thus larger 𝛼 and 𝛽) the smaller is the posterior
variance, with variance decreasing proportional to the inverse of 𝑚. If the prior
is highly concentrated, i.e. if it has low variance and large precision (=inverse
variance) then the implicit data size 𝑚 is large. Conversely, if the prior has a
large variance, then the prior is vague and the implicit data size 𝑚 is small.
Hence, a prior has the same effect as if one would add data – but without actually
adding data! This is precisely this why a prior acts as a regulariser and prevents
overfitting, because it increases effective sample size.
Another interpretation is that any prior summarises data that may have been
available previously as observations.
10.2. PROPERTIES OF BAYESIAN LEARNING 109
𝑝 𝛼−1 (1 − 𝑝)𝛽−1
Since the posterior is proportional to the product of prior and likelihood the
posterior will have exactly the same form as the prior:
𝑝 𝛼+𝑥−1 (1 − 𝑝)𝛽+𝑛−𝑥−1
Choosing the prior distribution from a family conjugate to the likelihood greatly
simplifies Bayesian analysis since the Bayes formula can then be written in form
of an update formula for the parameters of the Beta distribution:
𝛼→𝛼+𝑥
𝛽→𝛽+𝑛−𝑥
Thus, conjugate prior distributions are very convenient choices. However, in their
application it must be ensured that the prior distribution is flexible enough to
encapsulate all prior information that may be available. In cases where this is not
the case alternative priors should be used (and most likely this will then require
to compute the posterior distribution numerically rather than analytically).
𝑎 𝑥
𝜇posterior = = 𝜇ˆ 𝑀𝐿
𝑛
3Goldstein, M., and D. Wooff. 2007. Bayes Linear Statistics: Theory and Methods. Wiley. https:
//doi.org/10.1002/9780470065662
10.2. PROPERTIES OF BAYESIAN LEARNING 111
and
𝑎 𝜇ˆ 𝑀𝐿 (1 − 𝜇ˆ 𝑀𝐿 )
𝜎posterior
2
=
𝑛
Thus, if sample size is large the Bayes’ estimator turns into the ML estimator!
Specifically, the posterior mean becomes the ML point estimate, and the posterior
variance is equal to the asymptotic variance computed via the observed Fisher
information!
Thus, for large 𝑛 the data dominate and any details about the prior (such as
values of 𝛼 and 𝛽 become irrelevant!
So not only are the posterior mean and variance converging to the MLE and the
variance of the MLE for large sample size, but also the posterior distribution
itself converges to the sampling distribution!
This holds generally in many regular cases, not just in our example of the
Beta-Bernoulli model.
The Bayesian CLT is generally known as the Bernstein-van Mises theorem (who
discovered it at around 1920-30), but special cases were already known as by
Laplace.
In the Worksheet B1 the asymptotic convergence of the posterior distribution to
a normal distribution is demonstrated grapically.
𝑝ˆ Bayes (1 − 𝑝ˆ Bayes )
Var(𝑝|𝑥) =
𝑛+𝑚+1
112 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE
Asymptotically, we have seen that for large 𝑛 the posterior mean becomes the
maximum likelihood estimate (MLE), and the posterior variance becomes the
asymptotic variance of the MLE. Thus, for large 𝑛 the Bayesian estimate will be
indistinguishable from the MLE and shares its favourable properties.
In addition, for finite sample size the posterior variance will tyically be smaller
than both the asymptotic posterior variance (for large 𝑛) and the prior variance,
showing that combining the information in the prior and in the data leads to a
more efficient estimate.
𝑥|𝜇 ∼ 𝑁(𝜇, 𝜎2 )
𝜇 ∼ 𝑁(𝜇0 , 𝜎2 /𝑚)
with prior mean E(𝜇) = 𝜇0 and prior variance Var(𝜇) = 𝜎𝑚 where 𝑚 is the implied
2
sample size from the prior. Note that 𝑚 does not need to be an integer value!
𝑚𝜇0 + 𝑛 𝑥¯
E(𝜇|𝑥1 , . . . 𝑥 𝑛 ) = 𝜇1 = = 𝜆𝜇0 + (1 − 𝜆)𝜇ˆ 𝑀𝐿
𝑛+𝑚
𝑚
with 𝜆 = 𝑛+𝑚 . Note the linear shrinkage of 𝜇ˆ 𝑀𝐿 towards 𝜇0 !
10.4. ESTIMATING THE VARIANCE USING THE INVERSE-GAMMA-NORMAL MODEL
𝜎2
Var(𝜇|𝑥 1 , . . . 𝑥 𝑛 ) =
𝑛+𝑚
Thus, the normal distribution is the conjugate distribution to the mean
parameter in the normal likelihood.
𝑎 𝜎2
Var(𝜇|𝑥 1 , . . . 𝑥 𝑛 ) =
𝑛
i.e. the MLE and its asymptotic variance!
𝜎2 𝜎2
Note that the posterior variance 𝑛+𝑚 is smaller than the asymptotic variance 𝑛
and the prior variance 𝜎𝑚 .
2
𝑥 ∼ Inv-Gam(𝛼, 𝛽)
This distribution is closely linked with the Gamma distribution — the inverse of
𝑥 is Gamma-distributed with inverted scale parameter:
1
∼ Gam(𝛼, 𝛽−1 )
𝑥
𝑘+2 𝑘𝜇
𝑥 ∼ Inv-Gam 𝛼 = ,𝛽 = = Inv-Gam(𝜇, 𝑘)
2 2
The reason for choosing the mean parameterisation using 𝜇 and 𝑘 instead of 𝛼
and 𝛽 is that this parameterisation simplifies the Bayesian update rule for the
mean.
114 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE
The inverse Gamma distribution is also known under two further alternative
names: 1) inverse scaled chi-squared distribution and 2) one-dimensional inverse
Wishart distribution.
E(𝜎2 ) = 𝜎02
𝜎02 𝑚+𝑛b
𝜎 2𝑀𝐿
with 𝜎12 = 𝑚+𝑛 .
The posterior mean is
E(𝜎 2 |𝑥1 . . . , 𝑥 𝑛 ) = 𝜎12
10.4. ESTIMATING THE VARIANCE USING THE INVERSE-GAMMA-NORMAL MODEL
2𝜎14
Var(𝜎2 |𝑥 1 . . . , 𝑥 𝑛 ) =
𝑚+𝑛−2
The update formula for the posterior mean of the variance follows the usual
linear shrinkage rule:
𝜎12 = 𝜆𝜎02 + (1 − 𝜆)b
𝜎𝑀𝐿
2
𝑚
with 𝜆 = 𝑚+𝑛 .
𝑎 2𝜎 4
Var(𝜎2 |𝑥1 , . . . 𝑥 𝑛 ) =
𝑛
which is indeed the MLE of 𝜎2 and its asymptotic variance!
117
118 CHAPTER 11. BAYESIAN MODEL COMPARISON
𝑝(𝜽|𝐷, 𝑀)
log 𝑝(𝐷|𝑀) = log 𝑝(𝐷|𝜽, 𝑀) − log
𝑝(𝜽|𝑀)
𝑝(𝜽ˆ ML |𝐷, 𝑀)
log 𝑝(𝐷|𝑀) = log 𝑝(𝐷| 𝜽ˆ ML , 𝑀) − log
| {z } 𝑝(𝜽ˆ ML |𝑀)
maximum log-likelihood | {z }
penalty > 0
𝑝(𝐷|𝑀1 )
𝐵12 =
𝑝(𝐷|𝑀2 )
The log-Bayes factor log 𝐵12 is also called the weight of evidence for 𝑀1 over
𝑀2 .
Pr(𝑀1 )
Pr(𝑀2 )
𝑝(𝐷|𝑀 )
Using Bayes Theorem Pr(𝑀 𝑖 |𝐷) = Pr(𝑀 𝑖 ) 𝑝(𝐷) 𝑖 we can rewrite the posterior
odds as
Pr(𝑀1 |𝐷) 𝑝(𝐷|𝑀1 ) Pr(𝑀1 )
=
Pr(𝑀2 |𝐷) 𝑝(𝐷|𝑀2 ) Pr(𝑀2 )
| {z } | {z } | {z }
posterior odds Bayes factor 𝐵12 prior odds
The Bayes factor is the multiplicative factor that updates the prior odds to the
posterior odds.
120 CHAPTER 11. BAYESIAN MODEL COMPARISON
More recently, Kass and Raftery (1995)2 proposed to use the following slightly
modified scale:
𝑀 1
log 𝑝(𝐷|𝑀) ≈ 𝑙𝑛𝑀 (𝜽ˆ 𝑀𝐿 ) − 𝑑 𝑀 log 𝑛
2
where 𝑑 𝑀 is the dimension of the model 𝑀 (number of parameters in 𝜽 belonging
𝑀
to 𝑀) and 𝑛 is the sample size and 𝜽ˆ 𝑀𝐿 is the MLE. For a simple model 𝑑 𝑀 = 0
so then there is no approximation as in this case the marginal likelihood equals
the likelihood.
The above formula can be obtained by quadratic approximation of the likelihood
assuming large 𝑛 and assuming that the prior is locally uniform around the
MLE. The Schwarz (1978) approximation is therefore a special case of a Laplace
approximation.
Note that the approximation is the maximum log-likelihood minus a penalty
that depends on the model complexity (as measured by dimension 𝑑), hence
this is an example of penalised ML! Also note that the distribution over the
parameter 𝜽 is not required in the approximation.
𝑀
𝐵𝐼𝐶(𝑀) = −2𝑙𝑛𝑀 (𝜽ˆ 𝑀𝐿 ) + 𝑑 𝑀 log 𝑛
Thus, when comparing models one aimes to maximise the marginal likelihood
or, as approximation, minimise the BIC.
The reason for the factor “-2” is simply to have a quantity that is on the same
scale as the Wilks log likelihood ratio. Some people / software packages also
use the factor “2”.
3Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6:461–464. https:
//doi.org/10.1214/aos/1176344136
122 CHAPTER 11. BAYESIAN MODEL COMPARISON
Pr(𝐻0 |𝑥 𝑖 ),
124 CHAPTER 11. BAYESIAN MODEL COMPARISON
𝜋0 𝑓0 (𝑥 𝑖 )
Pr(𝐻0 |𝑥 𝑖 ) = = 𝐿𝐹𝐷𝑅(𝑥 𝑖 )
𝑓 (𝑥 𝑖 )
This quantity is also known as the local FDR or local False Discovery Rate.
In the given one-sided setup the local FDR is large (close to 1) for small 𝑥, and will
become close to 0 for large 𝑥. A common decision rule is given by thresholding
local false discovery rates: if 𝐿𝐹𝐷𝑅(𝑥 𝑖 ) < 0.1 the 𝑥 𝑖 is called significant.
11.4.3.3 q-values
In correspondence to 𝑝-values one can also define tail-area based false discovery
rates:
𝜋0 𝐹0 (𝑥 𝑖 )
𝐹𝑑𝑟(𝑥 𝑖 ) = Pr(𝐻0 |𝑋 > 𝑥 𝑖 ) =
𝐹(𝑥 𝑖 )
These are called q-values, or simply False Discovery Rates (FDR). Intrigu-
ingly, these also have a frequentist interpretation as adjusted p-values (using a
Benjamini-Hochberg adjustment procedure).
11.4.4 Software
There are a number of R packages to compute (local) FDR values:
For example:
• locfdr
• qvalue
• fdrtool
11.4. BAYESIAN TESTING USING FALSE DISCOVERY RATES 125
For large sample size 𝑛 the posterior mean converges to the maximum likelihood
estimate (and the posterior distribution to normal distribution centered around
the MLE), so for large 𝑛 we may ignore specifying a prior.
127
128 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS
1. Use a weakly informative prior. This means that you do have an idea (even
if only vague) about the suitable values of the parameter of interest, and
you use a corresponding prior (for example with moderate variance) to
model the uncertainty. This acknowledges that there are no uninformative
priors and but also aims that the prior does not dominate the likelihood
(i.e. the data). The result is a weakly regularised estimator. Note that it is
often desirable that the prior adds information (if only a little) so that it
can act as a regulariser.
2. Empirical Bayes methods can often be used to determine one or all of the
hyperparameters (i.e. the parameters in the prior) from the observed data.
There are several ways to do this, one of them is to tune the shrinkage
parameter 𝜆 to achieve minimum MSE. We discuss this further below.
Furthermore, there also exist many proposals advocating so-called “uninforma-
tive priors” or “objective priors”. However, there are no actually unformative
priors, since a prior distribution that looks uninformative (i.e. “flat”) in one
coordinate system can be informative in another — this is a simple consequence
of the rule for transformation of probability densities. As a result, often the
suggested objective priors are in fact improper, i.e. are not actually probability
distributions!
1Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. A
186:453–461. https://doi.org/10.1098/rspa.1946.0056.
12.3. EMPIRICAL BAYES 129
For the Inverse-Gamma-Norma model the Jeffreys prior is the improper prior
𝑝(𝜎2 ) = 𝜎12 .
This already illustrates the main problem with this type of prior – namely
that it often is improper, i.e. the prior distribution is not actually a probability
distribution (i.e. the density does not integrate to 1).
Another issue is that Jeffreys priors are usually not conjugate which complicates
the update from the prior to the posterior.
Furthermore, if there are multiple parameters (𝜽 is a vector) then Jeffreys priors
do not usually lead to sensible priors.
2Bernardo, J. M. 1979. Reference posterior distributions for Bayesian inference (with discussion). JRSS B
41:113–147. https://doi.org/10.1111/j.2517-6161.1979.tb01066.x
130 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS
It turns out that this can be minimised without knowing the actual true value of
𝜃 and the result for an unbiased 𝜃ˆML is
Var(𝜃ˆML )
𝜆★ =
E((𝜃ˆML − 𝜃0 )2 )
Hence, the shrinkage intensity will be small if the variance of the MLE is small
and/or if the target and the MLE differ substantially. On the other hand, if the
variance of the MLE is large and/or the target is close to the MLE the shrinkage
intensity will be large.
12.3. EMPIRICAL BAYES 131
𝑑−2
𝝁ˆ 𝐽𝑆 = 1 − 𝒙
||𝒙|| 2
𝑑−2
Here, we recognise 𝝁ˆ 𝑀𝐿 = 𝒙, 𝝁0 = 0 and shrinkage intensity 𝜆★ = ||𝒙|| 2
.
Efron and Morris (1972) and Lindley and Smith (1972) later generalised the
James-Stein estimator to the case of multiple observations 𝒙 1 , . . . 𝒙 𝑛 and target 𝝁0 ,
yielding an empirical Bayes estimate of 𝜇 based on the Normal-Normal model.
132 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS
Chapter 13
𝜃1 = 𝜆𝜃0 + (1 − 𝜆)𝜃ˆ 𝑀𝐿
𝑚
with shrinkage intensity 𝜆 = 𝑛+𝑚
• 𝑚 can be interpreted as prior sample size
13.1.1 Remarks
• If posterior in same family as prior → conjugate prior.
• In an exponential family the Bayesian update of the mean is always
expressible as linear shrinkage of the MLE.
133
134 CHAPTER 13. OPTIMALITY PROPERTIES AND SUMMARY
• Note that the Bayesian estimator is biased for finite 𝑛 by construction (but
asymptotically unbiased like the MLE).
13.1.2 Advantages
• Adding prior information has regularisation properties. This is very im-
portant in more complex models with many parameters, e.g., in estimation
of a covariance matrix (to avoid singularity).
• Bayesian credible intervals are conceptually much more simple than fre-
quentist confidence intervals.
As a result, Bayesian estimators may have smaller MSE (=squared bias + variance)
than the ML estimator for finite 𝑛.
Unfortunately, this theorem does not tell which prior is needed to achive
optimality, however an optimal estimator can often be found by tuning the
hyper-parameter 𝜆.
13.2. OPTIMALITY OF BAYESIAN INFERENCE 135
principle (1957).
This shows (again) how fundamentally important KL divergence is in statistics.
It not only leads to likelihood inference (via forward KL) but also to Bayesian
learning, as well as to other forms of information updating (via reverse KL).
Furthermore, in Bayesian statistics relative entropy is useful to choose priors
(e.g. reference priors) and it also helps in (Bayesian) experimental design to
quantify the information provided by an experiment.
13.4 Conclusion
Bayesian statistics offers a coherent framework for statistical learning from data,
with methods for
• estimation
• testing
• model building
There are a number of theorems that show that “optimal” estimators (defined in
various ways) are all Bayesian.
It is conceptually very simple — but can be computationally very involved!
It provides a coherent generalisation of classical TRUE/FALSE logic (and there-
fore does not suffer from some of the inconsistencies prevalent in frequentist
statistics).
Bayesian statistics is a non-asymptotic theory, it works for any sample size.
Asympotically (large 𝑛) it is consistent and converges to the true model (like
ML!). But Bayesian reasoning can also be applied to events that take place
only once — no assumption of hypothetical infinitely many repetitions as in
frequentist statistics is needed.
Moreover, many classical (frequentist) procedures may be viewed as approxima-
tions to Bayesian methods and estimators, so using classical approaches in the
correct application domain is perfectly in line with the Bayesian framework.
Bayesian estimation and inference also automatically regularises (via the prior)
which is important for complex models and when there is the problem of
overfitting.
138 CHAPTER 13. OPTIMALITY PROPERTIES AND SUMMARY
Part III
Regression
139
Chapter 14
𝑦 = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) + 𝜀
141
142 CHAPTER 14. OVERVIEW OVER REGRESSION MODELLING
14.2 Objectives
1. Understand the relationship between the response 𝑦 and the predictor
variables 𝑥 𝑖 by learning the regression function 𝑓 from observed data
(training data). The estimated regression function is 𝑓ˆ.
2. Prediction of outcomes
𝑦ˆ = 𝑓ˆ(𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 )
|{z}
predicted response
using fitted 𝑓ˆ
𝑦★ = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 )
|{z}
predicted response
using known 𝑓
3. Variable importance
• which covariates are most relevant in predicting the outcome?
• allows to better understand the data and model
→ variable selection (to build simpler model with same predictive
capability)
𝑦 = 𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) + "noise"
In R the linear model is implemented in the function lm(), and generalised linear
models in the function glm(). Generalised additive models are available in the
package “mgcv”.
In the following we focus on the linear regression model with continuous
response.
144 CHAPTER 14. OVERVIEW OVER REGRESSION MODELLING
Chapter 15
Linear Regression
𝑑
Õ
𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) = 𝛽 0 + 𝛽 𝑗 𝑥 𝑗 = 𝑦★
𝑗=1
In vector notation:
𝑓 (𝒙) = 𝛽 0 + 𝜷𝑇 𝒙 = 𝑦★
𝛽 𝑥
© .1 ª © .1 ª
with 𝜷 = . ® and 𝒙 = .. ®®
. ®
«𝛽 𝑑 ¬ «𝑥 𝑑 ¬
Therefore, the linear regression model is
𝑑
Õ
𝑦 = 𝛽0 + 𝛽𝑗 𝑥𝑗 + 𝜀
𝑗=1
= 𝛽 0 + 𝜷𝑇 𝒙 + 𝜀
= 𝑦★ + 𝜀
where:
• 𝛽 0 is the intercept
• 𝜷 = (𝛽 1 , . . . , 𝛽 𝑑 )𝑇 are the regression coefficients
• 𝒙 = (𝑥1 , . . . , 𝑥 𝑑 )𝑇 is the predictor vector containing the predictor variables
145
146 CHAPTER 15. LINEAR REGRESSION
ii. Predictors:
E(𝑥 𝑖 ) = 𝜇𝑥 𝑖 (or E(𝒙) = 𝝁𝒙 )
Var(𝑥 𝑖 ) = 𝜎𝑥2 𝑖 and Cor(𝑥 𝑖 , 𝑥 𝑗 ) = 𝜌 𝑖𝑗 (or Var(𝒙) = 𝚺𝒙 )
The signal variance Var(𝑦★) = Var(𝛽 0 + 𝜷𝑇 𝒙) = 𝜷𝑇 𝚺𝒙 𝜷 is also called
the explained variation.
• We assume that 𝑦 and 𝒙 are jointly distributed with correlation Cor(𝑦, 𝑥 𝑗 ) =
𝜌 𝑦,𝑥 𝑗 between each predictor variable 𝑥 𝑗 and the response 𝑦.
• In contrast to 𝑦 and 𝒙 the noise variable 𝜀 is only indirectly observed via
the difference 𝜀 = 𝑦 − 𝑦★. We denote the mean and variance of the noise
by E(𝜀) and Var(𝜀).
The noise variance Var(𝜀) is also called the unexplained variation or the
residual variance. The residual standard error is SD(𝜀).
Identifiability assumptions:
In a statistical analysis we would like to be able to separate signal (𝑦★) from
noise (𝜀). To achieve this we require some distributional assumptions to ensure
identifiability and avoid confounding:
1) Assumption 1: 𝜀 and 𝑦★ are are independent. This implies Var(𝑦) =
Var(𝑦★) + Var(𝜀), or equivalently Var(𝜀) = Var(𝑦) − Var(𝑦★).
Thus, this assumption implies the decomposition of variance, i.e. that the
total variation Var(𝑦) equals the sum of the explained variationVar(𝑦★)
and the unexplained variationVar(𝜀).
2) Assumption 2: E(𝜀) = 0. This allows to identify the intercept 𝛽0 and
implies E(𝑦) = E(𝑦★).
Optional assumptions (often but not always):
• The noise 𝜀 is normally distributed
• The response 𝑦 and and the predictor variables 𝑥 𝑖 are continuous variables
• The response and predictor variables are jointly normally distributed
Further properties:
• As a result of the independence assumption 1) we can only choose two out
of the three variances freely:
i. in a generative perspective we will choose signal variance Var(𝑦★)
(or equivalently the variances Var(𝑥 𝑗 )) and the noise variance Var(𝜀),
then the variance of the response Var(𝑦) follows.
ii. in an observational perspective we will observe the variance of the
reponse Var(𝑦) and the variances Var(𝑥 𝑗 ), and then the error variance
Var(𝜀) follows.
• As we will see later the regression coefficients 𝛽 𝑗 depend on the correlations
between the response 𝑦 and and the predictor variables 𝑥 𝑗 . Thus, the choice
of regression coefficients implies a specific correlation pattern, and vice
148 CHAPTER 15. LINEAR REGRESSION
versa (in fact, we will use this correlation pattern to infer the regression
coefficients from data!).
𝑥 ... 𝑥1𝑑
© .11 .. ª®
𝑿 = .. ..
. . ®
«𝑥 𝑛1 ... 𝑥 𝑛𝑑 ¬
Note the statistics convention: the 𝑛 rows of 𝑿 contain the samples, and the 𝑑
columns contain variables.
Response data vector: (𝑦1 , . . . , 𝑦𝑛 )𝑇 = 𝒚
Then the regression equation is written in data matrix notation:
𝒚 = 1𝑛 𝛽 0 + 𝑿 𝜷 + 𝜺
|{z} |{z} |{z} |{z} |{z}
𝑛×1 𝑛×1 𝑛×𝑑 𝑑×1 𝑛×1
|{z}
residuals
1
where 1𝑛 = ... ®® is a column vector of length 𝑛 (size 𝑛 × 1).
© ª
«1¬
Note that here the regression coefficients are now multiplied after the data matrix
(compare with the original vector notation where the transpose of regression
coefficients come before the vector of the predictors).
The observed noise values (i.e. realisations of the random variable 𝜀) are called
the residuals.
Estimating regression
coefficients
Idea: choose regression coefficients such as to minimise the squared error between
observations and the prediction.
151
152 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS
In data matrix notation (note we assume 𝛽0 = 0 and thus centered data 𝑿 and 𝒚):
RSS(𝜷) = (𝒚 − 𝑿 𝜷)𝑇 (𝒚 − 𝑿 𝜷)
𝜷
b
OLS = arg min RSS(𝜷)
𝜷
RSS(𝜷) = 𝒚𝑇 𝒚 − 2𝜷𝑇 𝑿 𝑇 𝒚 + 𝜷𝑇 𝑿 𝑇 𝑿 𝜷
Gradient:
∇RSS(𝜷) = −2𝑿 𝑇 𝒚 + 2𝑿 𝑇 𝑿 𝜷
𝜷) = 0 −→ 𝑿 𝑇 𝒚 = 𝑿 𝑇 𝑿 b
∇RSS(b 𝜷
−1
𝜷OLS = 𝑿 𝑇 𝑿
=⇒ b 𝑿𝑇𝒚
𝑦 = 𝛽 0 + 𝜷𝑇 𝒙 + 𝜀
𝛽 0 = 𝜇 𝑦 − 𝜷𝑇 𝝁 𝒙
Combining the two above equations we see that noise variable equals
𝜀 = (𝑦 − 𝜇 𝑦 ) − 𝜷𝑇 (𝒙 − 𝝁𝒙 )
Assuming joint (multivariate) normality for the observed data, the response 𝑦
and predictors 𝒙, we get as the MLEs for the respective means and (co)variances:
Í𝑛
• 𝜇ˆ 𝑦 = Ê(𝑦) = 𝑛1 𝑖=1 𝑦 𝑖
1 Í𝑛
• 𝜎ˆ 2𝑦 = Var(𝑦)
c = 𝑛 𝑖=1 (𝑦 𝑖 − 𝜇
ˆ 𝑦 )2
1 Í𝑛
• 𝝁ˆ 𝒙 = Ê(𝒙) = 𝑛 𝑖=1 𝒙 𝑖
• 𝚺ˆ 𝒙𝒙 = Var(𝒙) = 𝑛1 𝑛𝑖=1 (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝒙 𝑖 − 𝝁ˆ 𝒙 )𝑇
c Í
ˆ 1 Í𝑛
• 𝚺𝒙 𝑦 = Cov(𝒙, 𝑦) =
d
𝑛 𝑖=1 (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝑦 𝑖 − 𝜇ˆ 𝑦 )
Note that these are are sufficient statistics and hence summarize perfectly the
observed data for 𝒙 and 𝑦 under the normal assumption
Consequently, the residuals (indirect observations of the noise variable) for a
given choice of regression coefficients 𝜷 and the observed data for 𝒙 and 𝑦 are
𝜀𝑖 = (𝑦 𝑖 − 𝜇ˆ 𝑦 ) − 𝜷𝑇 (𝒙 𝑖 − 𝝁ˆ 𝒙 )
Assuming that the noise 𝜀 ∼ 𝑁(0, 𝜎𝜀2 ) is normally distributed with mean 0 and
variance Var(𝜀) = 𝜎𝜀2 . we can write down the normal log-likelihood function for
𝜎𝜀2 and 𝜷:
𝑛
𝑛 1 Õ 2
log 𝐿(𝜷, 𝜎𝜀2 ) = − log 𝜎𝜀2 − 2 (𝑦 𝑖 − 𝜇ˆ 𝑦 ) − 𝜷𝑇 (𝒙 𝑖 − 𝝁ˆ 𝒙 )
2 2𝜎𝜀 𝑖=1
154 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS
ˆ −1
𝜷ˆ = 𝚺 ˆ
𝒙𝒙 𝚺𝒙 𝑦
𝑇
with 𝑦ˆ 𝑖 = 𝛽ˆ 0 + 𝜷ˆ 𝒙 𝑖 and the residuals 𝑦 𝑖 − 𝑦ˆ 𝑖 resulting from the fitted linear
model. This leads to the MLE of the noise variance
𝑛
1Õ
𝜎b𝜀2 = (𝑦 𝑖 − 𝑦ˆ 𝑖 )2
𝑛
𝑖=1
Note that
Í𝑛 the MLE 𝜎b𝜀2 is a biased estimate of 𝜎𝜀2 . The unbiased estimate is
𝑖=1 (𝑦 𝑖 − 𝑦ˆ 𝑖 ) , where 𝑑 is the dimension of 𝜷 (i.e. the number of predictors).
1 2
𝑛−𝑑−1
16.2.3 Asymptotics
The advantage of using maximum likelihood is that we also get the (asympotic)
variance associated with each estimator and typically can also assume asymptotic
normality.
16.3. COVARIANCE PLUG-IN ESTIMATOR OF REGRESSION COEFFICIENTS155
Specifically, for 𝜷ˆ we get via the observed Fisher information at the MLE an
asymptotic estimator of its variance
1 b2 ˆ −1
Var( 𝜷) = 𝜎 𝚺
𝑛 𝜀 𝒙𝒙
c b
1 b2 ˆ −1
Var( 𝛽0 ) = 𝜎 (1 + 𝝁ˆ 𝑇 𝚺 𝒙𝒙 𝝁)
ˆ
𝑛 𝜀
c b
For finite sample size 𝑛 with known Var(𝜀) one can show that the variances are
1 2 ˆ −1
𝜷) =
Var(b 𝜎 𝚺
𝑛 𝜀 𝒙𝒙
and
1 2 ˆ −1
Var(b 𝜎 (1 + 𝝁ˆ 𝑇𝒙 𝚺
𝛽0 ) = 𝒙𝒙 𝝁
ˆ 𝒙)
𝑛 𝜀
and that the regression coefficients and the intercept are normally distributed
according to
𝜷 ∼ 𝑁𝑑 (𝜷, Var(b
b 𝜷))
and
𝛽 0 ∼ 𝑁(𝛽 0 , Var(b
b 𝛽0 ))
−1
𝜷 = 𝑿𝑇𝑿
b 𝑿𝑇𝒚
𝑥 ... 𝑥1𝑑
© .11 .. ª®
𝑿 = .. ..
. . ®
«𝑥 𝑛1 ... 𝑥 𝑛𝑑 ¬
156 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS
Note that we assume that the data matrix 𝑿 is centered (i.e. column sums
𝑿 𝑇 1𝑛 = 0 are zero).
Likewise 𝒚 = (𝑦1 , . . . , 𝑦𝑛 )𝑇 is the response data vector (also centered with
𝒚𝑇 1𝑛 = 0).
Noting that
ˆ 𝒙𝒙 = 1 (𝑿 𝑇 𝑿 )
𝚺
𝑛
is the MLE of covariance matrix among 𝒙 and
ˆ 𝒙 𝑦 = 1 (𝑿 𝑇 𝒚)
𝚺
𝑛
is the MLE of the covariance between 𝒙 and 𝑦 we see that the OLS estimate of
the regression coefficients can be expressed as
−1
𝜷= 𝚺
b ˆ 𝒙𝒙 ˆ 𝒙𝑦
𝚺
𝜷 = 𝚺−1
𝒙𝒙 𝚺𝒙 𝑦
𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙 𝑦
16.5.1.1 Result:
The mean squared prediction error 𝑀𝑆𝑃𝐸 in dependence of (𝑏 0 , 𝒃) is
We look for
(𝛽 0 , 𝜷) = arg min 𝑀𝑆𝑃𝐸(𝑏 0 , 𝒃)
𝑏 0 ,𝒃
−2(𝜇 𝑦 − 𝑏0 − 𝒃𝑇 𝝁𝒙 )
∇𝑀𝑆𝑃𝐸 =
2 𝚺𝒙𝒙 𝒃 − 2 𝚺𝒙 𝑦 − 2𝝁𝒙 (𝜇 𝑦 − 𝑏0 − 𝒃𝑇 𝝁𝒙 )
𝛽0 𝜇 − 𝜷𝑇 𝝁 𝒙
= 𝑦 −1
𝜷 𝚺𝒙𝒙 𝚺𝒙 𝑦
Thus, the optimal values for 𝑏 0 and 𝒃 in the best linear predictor correspond to
the previously derived coefficients 𝛽 0 and 𝜷!
16.5. FURTHER WAYS TO OBTAIN REGRESSION COEFFICIENTS 159
E(𝑦|𝒙) = 𝑦★ = 𝛽 0 + 𝜷𝑇 𝒙
𝑇
with coefficients 𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙𝑦 and intercept 𝛽 0 = 𝜇 𝑦 − 𝜷 𝝁𝒙 .
E(𝑦★) = 𝛽 0 + 𝜷𝑇 𝝁𝒙 = 𝜇 𝑦
and variance
Var(𝑦★) = Var(E(𝑦|𝒙))
= 𝜷𝑇 𝚺𝒙𝒙 𝜷 = 𝚺 𝑦𝒙 𝚺−1
𝒙𝒙 𝚺𝒙 𝑦
= 𝜎2𝑦 𝑷 𝑦𝒙 𝑷 −1
𝒙𝒙 𝑷 𝒙𝑦
= 𝜎2𝑦 Ω2
b) Conditional variance:
= 𝜎2𝑦 (1 − Ω2 )
In this chapter we first introduce the (squared) multiple correlation and the
multiple and adjusted 𝑅 2 coefficients as estimators. Subsequently we discuss
variance decomposition.
Ω2 = 𝑷 𝑦𝒙 𝑷 −1 −2 −1
𝒙𝒙 𝑷 𝒙 𝑦 = 𝜎 𝑦 𝚺 𝑦𝒙 𝚺𝒙𝒙 𝚺 𝒙 𝑦
𝑇
With 𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙𝑦 and 𝛽 0 = 𝜇 𝑦 − 𝜷 𝝁𝒙 it is straightforward to verify the following:
161
162CHAPTER 17. SQUARED MULTIPLE CORRELATION AND VARIANCE DECOMPOSIT
Cov(𝑦,𝑦★ )
hence the correlation Cor(𝑦, 𝑦★) = SD(𝑦)SD(𝑦★ )
= Ω with Ω ≥ 0.
Var(𝜀) = 𝜎2𝑦 (1 − Ω2 ) .
Ω2 = 1 − Var(𝜀)/𝜎2𝑦
The maximum likelihood estimate of the noise variance Var(𝜀) (also called
residual
Í𝑛 variance) can be computed from the residual sum of squares 𝑅𝑆𝑆 =
𝑖=1 (𝑦 𝑖 − 𝑦ˆ 𝑖 )2 as follows:
c 𝑀𝐿 = 𝑅𝑆𝑆
Var(𝜀)
𝑛
whereas the unbiased estimate is obtained by
𝑅𝑆𝑆 𝑅𝑆𝑆
Var(𝜀)
c 𝑈𝐵 = =
𝑛−𝑑−1 𝑑𝑓
𝑑𝑓
1 − 𝑅 2 = (1 − 𝑅 2adj )
𝑛−1
17.1.2 R commands
In R the command lm() fits the linear regression model.
In addition to the regression cofficients (and derived quantities) the R function
lm() also lists
• the multiple R-squared 𝑅2 ,
• the adjusted R-squared 𝑅 2adj ,
• the degrees of freedom 𝑑 𝑓 andq
• the residual standard error c 𝑈 𝐵 (computed from the unbiased
Var(𝜀)
variance estimate).
See also Worksheet R3 which provides R code to reproduce the exact output of
the native lm() R function.
The unexplained variance measures the fit after introducing predictors into
the model (smaller means better fit). The total variance measures the fit of the
model without any predictors. The explained variance is the difference between
total and unexplained variance, it indicates the increase in model fit due to the
predictors.
explained var 𝜎 𝑦 Ω
2 2
= = Ω2
total var 𝜎2𝑦
(range 0 to 1, with 1 indicating perfect fit)
2) coefficient of non-determination, coefficient of alienation:
unexplained var 𝜎 𝑦 (1 − Ω )
2 2
= = 1 − Ω2
total var 𝜎2𝑦
(range 0 to 1, with 0 indicating perfect fit)
3) 𝐹 score, 𝑡 2 score:
17.3. SAMPLE VERSION OF VARIANCE DECOMPOSITION 165
𝑛
Õ 𝑛
Õ 𝑛
Õ
(𝑦 𝑖 − 𝑦)
¯ 2 = ( 𝑦ˆ 𝑖 − 𝑦)
¯ 2 + (𝑦 𝑖 − 𝑦ˆ 𝑖 )2
𝑖=1 𝑖=1 𝑖=1
| {z } | {z } | {z }
total sum of squares (TSS) explained sum of squares (ESS) residual sum of squares (RSS)
Note that TSS, ESS and RSS all scale with 𝑛. Using data vector notation the
sample-based variance decomposition can be written in form of the Pythagorean
theorem:
||𝒚 − 𝑦1
¯ || 2 = || 𝒚ˆ − 𝑦1||
¯ 2 + ˆ 2
||𝒚 − 𝒚||
| {z } | {z } | {z }
total sum of squares (TSS) explained sum of squares (ESS) residual sum of squares (RSS)
||𝒚|| 2 = || 𝒚||
ˆ 2 + ||𝒚 − 𝒚||
ˆ 2
| {z }
RSS
167
168 CHAPTER 18. PREDICTION AND VARIABLE SELECTION
𝑦★ = E(𝑦|𝒙) = 𝛽 0 + 𝜷𝑇 𝒙
We know that the mean squared prediction error for 𝑦★ is E((𝑦 − 𝑦★)2 ) = Var(𝜀)
and that this is the minimal irreducible error. Hence, we may use Var(𝜀) as the
minimum variability for the prediction.
The corresponding prediction interval is
where 𝑐 is some suitable constant (e.g. 1.96 for symmetric 95% normal intervals).
However, please note that the prediction interval constructed in this fashion will
be an underestimate. The reason is that this assumes that we employ 𝑦★ = 𝛽 0 + 𝜷𝑇 𝒙
𝑇
but in reality we actually use 𝑦ˆ = 𝛽ˆ 0 + 𝜷ˆ 𝒙 for prediction — note the estimated
coefficients! We recall from an earlier chapter (best linear predictor) that this
leads to increase of MSPE compared with using the optimal 𝛽 0 and 𝜷.
Thus, for better prediction intervals we would need to consider the mean squared
prediction error of 𝑦ˆ that can be written as E((𝑦 − 𝑦)
ˆ 2 ) = Var(𝜀) + 𝛿 where 𝛿 is an
additional error term due to using an estimated rather than the true regression
function. 𝛿 typically declines with 1/𝑛 but can be substantial for small 𝑛 (in
particular as it usually depends on the number of predictors 𝑑).
For more details on this we refer to later modules on regression.
𝑑
Õ
⇒ Ω2 = 𝑷 𝑦𝒙 𝑷 𝒙 𝑦 = 𝜌2𝑦,𝑥 𝑖
𝑖=1
𝛽ˆ 𝑖
𝑡𝑖 =
c 𝛽ˆ 𝑖 )
SD(
model, predictors with the largest 𝑝-values (and thus smallest absolute 𝑡-scores)
may be removed from a model. However, note that having a 𝑝-value say larger
than 0.05 by itself is not sufficient to declare a regression coefficient to be zero
(because in classical statistical testing you can only reject the null hypothesis, but
not accept it!).
Note that by construction the regression 𝑡-scores do not depend on the scale,
so when the original data are rescaled this will not affect the corresponding
regression 𝑡-scores. Furthermore, if SD( c 𝛽ˆ 𝑖 ) is small, then the regression 𝑡-score
ˆ
𝑡 𝑖 can still be large even if 𝛽 𝑖 is small!
18.3.2 Computing
When you perform regression analysis in R (or another statistical software
package) the computer will return the following:
𝛽ˆ 𝑖
𝛽ˆ 𝑖 c 𝛽ˆ 𝑖 )
SD( 𝑡𝑖 = c 𝛽ˆ 𝑖 )
p-values Indicator of
SD(
Estimated Error of t-score for 𝑡 𝑖 Significance
repression 𝛽ˆ 𝑖 computed from based on t-distribution * 0.9
coefficient first two columns ** 0.95
*** 0.99
In the lm() function in R the standard deviation is the square root of the unbiased
estimate of the variance (but note that it itself is not unbiased!).
In particular, the (squared) regression 𝑡-score can be 1:1 transformed into the
(estimated) (squared) partial correlation
𝑡 𝑖2
𝜌ˆ 2𝑦,𝑥 𝑖 |𝑥 𝑗≠𝑖 =
𝑡 𝑖2 + 𝑑 𝑓
with 𝑑 𝑓 = 𝑛 − 𝑑 − 1, and it can be shown that the 𝑝-values for testing that 𝛽 𝑖 = 0
are exactly the same as the 𝑝-values for testing that the partial correlation 𝜌 𝑦,𝑥 𝑖 |𝑥 𝑗≠𝑖
vanishes!
𝑡2 𝑅2
= =𝐹
𝑛 1 − 𝑅2
which is a function of 𝑅 2 . If 𝑅 2 = 0 then 𝐹 = 0. If 𝑅 2 is large (< 1) then 𝐹 is large
as well (< ∞) and the null hypothesis 𝜷 = 0 can be rejected, which implies that at
least one regression coefficient is non-zero. Note that the squared Wald statistic
𝑡 2 is asymptotically 𝜒𝑑2 distributed which is useful to find critical values and to
compute 𝑝-values.
175
176 CHAPTER 18. PREDICTION AND VARIABLE SELECTION
Appendix A
Refresher
Multiplication:
𝑛
Ö
𝑥 𝑖 = 𝑥1 × 𝑥2 × . . . × 𝑥 𝑛
𝑖=1
177
178 APPENDIX A. REFRESHER
Inverse of a matrix.
A.3 Functions
A.3.1 Gradient
The gradient of a scalar-valued function ℎ(𝒙) with vector argument 𝒙 =
(𝑥1 , . . . , 𝑥 𝑑 )𝑇 is the vector containing the first order partial derivatives of ℎ(𝒙)
with regard to each 𝑥1 , . . . , 𝑥 𝑑 :
𝜕ℎ(𝒙)
© 𝜕𝑥1 ª
∇ℎ(𝒙) = ... ®
®
𝜕ℎ(𝒙)
®
« 𝜕𝑥 𝑑 ¬
𝜕ℎ(𝒙)
=
𝜕𝒙
= grad ℎ(𝒙)
The symbol ∇ is called the nabla operator (also known as del operator).
Note that we write the gradient as a column vector. This is called the denominator
layout convention, see https://en.wikipedia.org/wiki/Matrix_calculus for
details. In contrast, many textbooks (and also earlier versions of these lecture
notes) assume that gradients are row vectors, following the so-called numerator
layout convention.
𝜕ℎ(𝒙)
• ℎ(𝒙) = 𝒂 𝑇 𝒙 + 𝑏. Then ∇ℎ(𝒙) = 𝜕𝒙
= 𝒂.
𝑇 𝜕ℎ(𝒙)
• ℎ(𝒙) = 𝒙 𝒙. Then ∇ℎ(𝒙) = 𝜕𝒙 = 2𝒙.
• ℎ(𝒙) = 𝒙 𝑇 𝑨𝒙. Then ∇ℎ(𝒙) = 𝜕ℎ(𝒙)
𝜕𝒙
= (𝑨 + 𝑨𝑇 )𝒙.
A.3. FUNCTIONS 179
𝜕ℎ(𝒙)
=
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝜕2 ℎ(𝒙)
=
𝜕𝒙𝜕𝒙 𝑇
By construction the Hessian matrix is square and symmetric.
𝜕2 ℎ(𝒙)
Example A.2. ℎ(𝒙) = 𝒙 𝑇 𝑨𝒙. Then ∇∇𝑇 ℎ(𝒙) = 𝜕𝒙𝜕𝒙 𝑇
= (𝑨 + 𝑨𝑇 ).
1
ℎ(𝒙) ≈ ℎ(𝒙 0 ) + ∇ℎ(𝒙 0 )𝑇 (𝒙 − 𝒙 0 ) + (𝒙 − 𝒙 0 )𝑇 ∇∇𝑇 ℎ(𝒙 0 )(𝒙 − 𝒙 0 )
2
With 𝒙 = 𝒙 0 + 𝜺 this can be written as
1
ℎ(𝒙 0 + 𝜺) ≈ ℎ(𝒙 0 ) + ∇ℎ(𝒙 0 )𝑇 𝜺 + 𝜺𝑇 ∇∇𝑇 ℎ(𝒙 0 )𝜺
2
𝜀 𝜀2
log(𝑥0 + 𝜀) ≈ log(𝑥0 ) + − 2
𝑥 0 2𝑥
0
and
𝑥0 𝜀 𝜀2
≈1− + 2
𝑥0 + 𝜀 𝑥0 𝑥
0
A.4 Combinatorics
A.4.1 Number of permutations
The number of possible orderings, or permutations, of 𝑛 distinct items is the
number of ways to put 𝑛 items in 𝑛 bins with exactly one item in each bin. It is
given by the factorial
𝑛
Ö
𝑛! = 𝑖 = 1×2×...×𝑛
𝑖=1
0! = 1
𝑛! = Γ(𝑛 + 1)
𝑛 𝑛!
=
𝑛1 , . . . , 𝑛 𝐾 𝑛1 ! × 𝑛2 ! × . . . × 𝑛 𝐾 !
If there are only two bins / types (𝐾 = 2) the multinomial coefficients becomes
the binomial coefficient
𝑛 𝑛 𝑛!
= =
𝑛1 𝑛1 , 𝑛 − 𝑛1 𝑛1 !(𝑛 − 𝑛1 )!
which counts the number of ways to choose 𝑛1 elements from a set of 𝑛 elements.
182 APPENDIX A. REFRESHER
The approximation is good for small 𝑛 (but fails for 𝑛 = 0) and becomes more
and more accurate with increasing 𝑛. For large 𝑛 the approximation can be
simplified to
log 𝑛! ≈ 𝑛 log 𝑛 − 𝑛
A.5 Probability
A.5.1 Random variables
A random variable describes a random experiment. The set of possible outcomes
is the sample space or state space and is denoted by Ω = {𝜔1 , 𝜔2 , . . .}. The
outcomes 𝜔 𝑖 are the elementary events. The sample space Ω can be finite or
infinite. Depending on type of outcomes the random variable is discrete or
continuous.
An event 𝐴 ⊆ Ω is subset of Ω and thus itself a set of elementary events
𝐴 = {𝑎1 , 𝑎2 , . . .}. This includes as special cases the full set 𝐴 = Ω, the empty set
𝐴 = ∅, and the elementary events 𝐴 = 𝜔 𝑖 . The complementary event 𝐴𝐶 is the
complement of the set 𝐴 in the set Ω so that 𝐴𝐶 = Ω \ 𝐴 = {𝜔 𝑖 ∈ Ω : 𝜔 𝑖 ∉ 𝐴}.
The probability of an event is denoted by Pr(𝐴). We assume that
• Pr(𝐴) ≥ 0, probabilities are positive,
• Pr(Ω) = 1,
Íthe certain event has probability 1, and
• Pr(𝐴) = 𝑎 𝑖 ∈𝐴 Pr(𝑎 𝑖 ), the probability of an event equals the sum of its
constituting elementary events 𝑎 𝑖 .
This implies
• Pr(𝐴) ≤ 1, i.e. probabilities all lie in the interval [0, 1]
• Pr(𝐴𝐶 ) = 1 − Pr(𝐴), and
• Pr(∅) = 0
Assume now we have two events 𝐴 and 𝐵. The probability of the event “𝐴 and
𝐵” is then given by the probability of the set intersection Pr(𝐴 ∩ 𝐵). Likewise
the probability of the event “𝐴 or 𝐵” is given by the probability of the set union
Pr(𝐴 ∪ 𝐵).
A.5. PROBABILITY 183
From the above it is clear that probability theory is closely linked to set theory,
and in particular to measure theory. This allows for an unified treatment of
discrete and continuous random variables (an elegant framework but not needed
for this module).
This is called the “law of the unconscious statistician”, or short LOTUS. Again,
to highlight that the random variable 𝑥 has distribution 𝐹 we write E𝐹 (ℎ(𝑥)).
For an event 𝐴 we can define a corresponding indicator function
(
1 𝑥∈𝐴
1𝐴 (𝑥) =
0 𝑥∉𝐴
Intriguingly,
E(1𝐴 (𝑥)) = Pr(𝐴)
i.e. the expectation of the indicator variable for 𝐴 is the probability of 𝐴.
The moments of random variables are also defined by expectation:
• Zeroth moment: E(𝑥 0 ) = 1 by definition of PDF and PMF,
• First moment: E(𝑥 1 ) = E(𝑥) = 𝜇 , the mean,
• Second moment: E(𝑥 2 )
• The variance is the second momented centered about the mean 𝜇:
𝑑𝑥
The transformation of the density is 𝑓 𝑦 (𝑦) = 𝑑𝑦 𝑓𝑥 (𝑥(𝑦)).
−1
𝑑𝑥 𝑑𝑦
Note that 𝑑𝑦 = 𝑑𝑥 .
Í𝑛
• Correspondingly, for 𝑛 → ∞ the average E𝐹ˆ𝑛 (ℎ(𝑥)) = 1
𝑛 𝑖=1 ℎ(𝑥 𝑖 ) con-
verges to the expectation E𝐹 (ℎ(𝑥)).
A.6 Distributions
A.6.1 Bernoulli distribution and binomial distribution
The Bernoulli distribution Ber(𝑝) is simplest distribution possible. It is named
after Jacob Bernoulli (1655-1705) who also invented the law of large numbers.
It describes a discrete binary random variable with two states 𝑥 = 0 (“failure”)
and 𝑥 = 1 (“success”), where the parameter 𝑝 ∈ [0, 1] is the probability of
“success”. Often the Bernoulli distribution is also referred to as “coin tossing”
model with the two outcomes “heads” and “tails”.
Correspondingly, the probability mass function of Ber(𝑝) is
𝑓 (𝑥 = 0) = Pr("failure") = 1 − 𝑝
and
𝑓 (𝑥 = 1) = Pr("success") = 𝑝
A compact way to write the PMF of the Bernoulli distribution is
𝑓 (𝑥|𝑝) = 𝑝 𝑥 (1 − 𝑝)1−𝑥
𝑥 ∼ Ber(𝑝) .
The expected value is E(𝑥) = 𝑝 and the variance is Var(𝑥) = 𝑝(1 − 𝑝).
Closely related to the Bernoulli distribution is the binomial distribution Bin(𝑚, 𝑝)
which results from repeating a Bernoulli experiment 𝑚 times and counting the
number of successes among the 𝑚 trials (without keeping track of the ordering
of the experiments).
Its probability mass function is:
𝑚 𝑥
𝑓 (𝑥|𝑝) = 𝑝 (1 − 𝑝)𝑚−𝑥
𝑥
𝑥 ∼ Bin(𝑚, 𝑝)
(𝑥 − 𝜇)2
2 − 12
𝑓 (𝑥|𝜇, 𝜎 ) = (2𝜋𝜎 )
2
exp −
2𝜎2
0.2
0.0
−6 −4 −2 0 2 4 6
0.8
Φ(x)
0.4
0.0
−6 −4 −2 0 2 4 6
The inverse Φ−1 (𝑝) is called the quantile function of the standard normal. In R
the function is called qnorm().
2
Φ−1
0
−2
The sum of two normal random variables is also normal (with the appropriate
mean and variance).
A.6. DISTRIBUTIONS 189
𝑧1 , 𝑧2 , . . . , 𝑧 𝑚 ∼ 𝑁(0, 𝜎𝑧2 )
with
𝜇𝑥 = 𝑚𝜎𝑧2
follows a scaled chi-squared distribution
𝜇𝑥 2 𝜇 𝜇𝑥
𝑥
1
𝑥∼ 𝜒 = 𝑊1 , 𝑚 = Gam 𝑚, 2
𝑚 𝑚 𝑚 2 𝑚
2𝜇2𝑥
𝛼𝛽 2 = 𝑚 .
Here is a plot of the density of the chi-squared distribution for degrees of freedom
𝑚 = 1 and 𝑚 = 3:
0.25
1.2
0.20
1.0
m=1 m=3
0.8
0.15
density
density
0.6
0.10
0.4
0.05
0.2
0.00
0.0
0 2 4 6 8 10 0 2 4 6 8 10
x x
A.7 Statistics
A.7.1 Statistical learning
The aim in statistics - data science - machine learning is to learn from data (from
experiments, observations, measurements) to learn about and understand the
world.
Specifically, to identify the best model(s) for the data in order to
• to explain the current data, and
• to enable good prediction of future data
Note that it is easy to get models that only explain the data but do not predict
well!
This is called overfitting the data and happens in particular if the model is
overparameterized for the amount of data available.
Specifically, we have data 𝑥1 , . . . , 𝑥 𝑛 and models 𝑓 (𝑥|𝜃) that are indexed the
parameter 𝜃.
A.7. STATISTICS 191
Often (but not always) 𝜃 can be interpreted and/or is associated with some
property of the model.
Often the parameter(s) of interest are related to moments (such as mean and
variance) or to quantiles of the distribution representing the model.
Estimation:
ˆ 1 , . . . , 𝑥 𝑛 ) that maps the data (input) to
• An estimator for 𝜃 is a function 𝜃(𝑥
a “guess” (output) about 𝜃.
• A point estimator provides a single number for each parameter
• An interval estimator provides a set of possible values for each parameter.
Thus 𝜃ˆ can be seen as a random variable, and its distribution is called sampling
distribution (across different experiments).
Properties of this distribution can be used to evaluate how far the estimator
deviates (on average across different experiments) from the true value:
192 APPENDIX A. REFRESHER
ˆ −𝜃
ˆ = E(𝜃)
Bias: Bias(𝜃)
Variance: ˆ = E (𝜃ˆ − E(𝜃))
Var(𝜃) ˆ 2
Mean squared error: ˆ = E((𝜃ˆ − 𝜃)2 )
MSE(𝜃)
ˆ + Bias(𝜃)
= Var(𝜃) ˆ 2
The last identity about MSE follows from E(𝑥 2 ) = Var(𝑥) + E(𝑥)2 .
In many situations it is better to allow for some small bias and in order to achieve
a smaller variance and an overall total smaller MSE. This is called bias-variance
tradeoff — as more bias is traded for smaller variance (or, conversely, less bias is
traded for higher variance)
𝜎2
𝜇ˆ ∼ 𝑁 𝜇,
𝑛
© ª
𝜎2 2 𝑛 − 1 2𝜎2 ®
®
𝜎b2 ML ∼ 𝜒𝑛−1 = Gam ,
𝑛 2 𝑛 ®
®
|{z} |{z}®
« shape scale ¬
𝑛−1 2 2(𝑛−1) 4
Thus, E( 𝜎b2 ML ) = 𝑛 𝜎 and Var( 𝜎b2 ML ) = 𝑛2
𝜎 . The estimate 𝜎b2 ML is
biased since E( 𝜎b2 ML )− 𝜎 2 = − 𝑛1 𝜎 2 . The mean squared error is MSE( 𝜎b2 ML ) =
2(𝑛−1) 4
𝑛2
𝜎 + 1 4
𝑛2
𝜎 = 𝑛2
𝜎 .
2𝑛−1 4
A.7. STATISTICS 193
𝑛
• The unbiased variance 𝜎b2 UB = 𝑛−1
1
𝑖=1 (𝑥 𝑖 − 𝜇)
ˆ 2 for normal data follows a
Í
scaled chi-squared distribution or equivalently a Gamma distribution
© ª
𝜎2 2 𝑛 − 1 2𝜎2 ®
®
𝜎b2 UB ∼ 𝜒𝑛−1 = Gam ,
𝑛−1 2 𝑛 − 1®
®
|{z} |{z}®
« shape scale ¬
Thus, E( 𝜎b2 UB ) = 𝜎 2 and Var( 𝜎b2 UB ) = 𝑛−1 𝜎 .
2 4 The estimate 𝜎b2 ML is unbiased
since E( 𝜎b2 UB ) − 𝜎 2 = 0. The mean squared error is MSE( 𝜎b2 UB ) = 𝑛−1 𝜎 .
2 4
Note that for any 𝑛 > 1 we find that Var( 𝜎b2 UB ) > Var( 𝜎b2 ML ) and
MSE( 𝜎b2 UB ) > MSE( 𝜎b2 ML ) so that the biased empirical estimator has both
lower variance and lower mean squared error than the unbiased estimator.
A.7.5 Asymptotics
Typically, Bias, Var and MSE all decrease with increasing sample size so that
with more data 𝑛 → ∞ the errors become smaller and smaller.
The typical rate of decrease of variance of a good estimator is 𝑛1 . Thus, when
sample size is√doubled the variance is divided by 2 (and the standard deviation
is divided by 2).
Consistency: 𝜃ˆ is called consistent if
ˆ −→ 0 with 𝑛 → ∞
MSE(𝜃)
The three estimators discussed above (empirical mean, empirical variance,
unbiased variance) are all consistent as their MSE goes to zero with large sample
size 𝑛.
Consistency is a minimum essential requirement for any reasonable estimator!
Of all consistent estimators we typically prefer the estimator that is most efficient
(i.e. with fasted decrease in MSE) and that therefore has smallest variance and/or
MSE for given finite 𝑛.
Consistency implies we recover the true model in the limit of infinite data if the
model class contains the true data generating model.
If the model class does not contain the true model then strict consistency cannot
be achived but we still wish to get as close as possible to the true model when
choosing model parameters.
b 1 , . . . , 𝑥 𝑛 ), i.e. it depends on
• Note that a CI is actually an estimate: CI(𝑥
data and has a random (sampling) variation.
Note: the coverage probability is not the probability that the true value is
contained in a given estimated interval (that would be the Bayesian Credible
Interval).
For a normal random variable 𝑋 ∼ 𝑁(𝜇, 𝜎2 ) with mean 𝜇 and variance 𝜎2 and
density function 𝑓 (𝑥) we can compute the probability
𝜇+𝑐𝜎
1+𝜅
∫
Pr(𝑥 ≤ 𝜇 + 𝑐𝜎) = 𝑓 (𝑥)𝑑𝑥 = Φ(𝑐) =
−∞ 2
Note Φ(𝑐) is the cumulative distribution function (CDF) of the standard normal
𝑁(0, 1):
From the above we obtain the critical point 𝑐 from the quantile function, i.e. by
inversion of Φ:
1+𝜅
−1
𝑐=Φ
2
The following table lists 𝑐 for the three most commonly used values of 𝜅 - it is
useful to memorise these values!
• a scalar parameter 𝜃
• with normally distributed estimate 𝜃ˆ and
ˆ = 𝜎ˆ
ˆ 𝜃)
• with estimated standard deviation SD(
196 APPENDIX A. REFRESHER
is then given by
b = [𝜃ˆ ± 𝑐 𝜎]
CI ˆ
where 𝑐 is chosen for desired coverage level 𝜅.
As for the normal CI we can compute critical values but for the chi-squared
distribution we use a one-sided interval:
Pr(𝑥 ≤ 𝑐) = 𝜅
As before we get 𝑐 by the quantile function, i.e. by inverting the CDF of the
chi-squared distribution.
The following list the critical values for the three most common choice of 𝜅 for
𝑚 = 1 (one degree of freedom):
Further study
In this module we can only touch the surface of likelihood and Bayes inference.
As a starting point for further reading the following text books are recommended.
197
198 APPENDIX B. FURTHER STUDY
Bibliography
Agresti, A., and M. Kateri. 2022. Foundations of Statistics for Data Scientists.
Chapman; Hall/CRC.
Domingos, P. 2015. The Master Algorithm: How the Quest for the Ultimate Learning
Machine Will Remake Our World. Basic Books.
Faraway, J. J. 2015. Linear Models with R. 2nd ed. Chapman; Hall/CRC.
Gelman, A., J. B. Carlin, H. A. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin.
2014. Bayesian Data Analysis. 3rd ed. CRC Press.
Heard, N. 2021. An Introduction to Bayesian Inference, Methods and Computation.
Springer.
Held, L., and D. S. Bové. 2020. Applied Statistical Inference: Likelihood and Bayes.
Second. Springer.
Wood, S. 2015. Core Statistics. Cambridge University Press.
199