0% found this document useful (0 votes)
20 views

Notes

Uploaded by

shdi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Notes

Uploaded by

shdi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

Statistical Methods: Likelihood, Bayes and

Regression

Korbinian Strimmer

20 January 2023
2
Contents

Welcome 7
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Preface 9
About the author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
About the module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

I Likelihood estimation and inference 11


1 Overview of statistical learning 13
1.1 How to learn from data? . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Probability theory versus statistical learning . . . . . . . . . . . . 14
1.3 Cartoon of statistical learning . . . . . . . . . . . . . . . . . . . . . 15
1.4 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 From entropy to maximum likelihood 19


2.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Local quadratic approximation and expected Fisher information . 27
2.4 Entropy learning and maximum likelihood . . . . . . . . . . . . . 29

3 Maximum likelihood estimation 33


3.1 Principle of maximum likelihood estimation . . . . . . . . . . . . 33
3.2 Maximum likelihood estimation in practise . . . . . . . . . . . . . 36
3.3 Observed Fisher information . . . . . . . . . . . . . . . . . . . . . 41

4 Quadratic approximation and normal asymptotics 47


4.1 Multivariate statistics for random vectors . . . . . . . . . . . . . . 47
4.2 Approximate distribution of maximum likelihood estimates . . . 50
4.3 Quantifying the uncertainty of maximum likelihood estimates . . 54
4.4 Example of a non-regular model . . . . . . . . . . . . . . . . . . . 60

3
4 CONTENTS

5 Likelihood-based confidence interval and likelihood ratio 63


5.1 Likelihood-based confidence intervals and Wilks statistic . . . . . 63
5.2 Generalised likelihood ratio test (GLRT) . . . . . . . . . . . . . . . 69

6 Optimality properties and conclusion 75


6.1 Properties of maximum likelihood encountered so far . . . . . . . 75
6.2 Summarising data and the concept of (minimal) sufficiency . . . . 76
6.3 Concluding remarks on maximum likelihood . . . . . . . . . . . . 79

II Bayesian Statistics 83
7 Conditioning and Bayes rule 85
7.1 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Conditional mean and variance . . . . . . . . . . . . . . . . . . . . 86
7.4 Conditional entropy and entropy chain rules . . . . . . . . . . . . 87
7.5 Entropy bounds for the marginal variables . . . . . . . . . . . . . 88

8 Models with latent variables and missing data 89


8.1 Complete data log-likelihood versus observed data log-likelihood 89
8.2 Estimation of the unobservable latent states using Bayes theorem 91
8.3 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

9 Essentials of Bayesian statistics 95


9.1 Principle of Bayesian learning . . . . . . . . . . . . . . . . . . . . . 95
9.2 Some background on Bayesian statistics . . . . . . . . . . . . . . . 100

10 Bayesian learning in practise 105


10.1 Estimating a proportion using the Beta-Binomial model . . . . . . 105
10.2 Properties of Bayesian learning . . . . . . . . . . . . . . . . . . . . 108
10.3 Estimating the mean using the Normal-Normal model . . . . . . 112
10.4 Estimating the variance using the inverse-Gamma-Normal model 113

11 Bayesian model comparison 117


11.1 Marginal likelihood as model likelihood . . . . . . . . . . . . . . . 117
11.2 The Bayes factor for comparing two models . . . . . . . . . . . . . 119
11.3 Approximate computations . . . . . . . . . . . . . . . . . . . . . . 121
11.4 Bayesian testing using false discovery rates . . . . . . . . . . . . . 122

12 Choosing priors in Bayesian analysis 127


12.1 Choosing a prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
12.2 Default priors or uninformative priors . . . . . . . . . . . . . . . . 128
12.3 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

13 Optimality properties and summary 133


13.1 Bayesian statistics in a nutshell . . . . . . . . . . . . . . . . . . . . 133
CONTENTS 5

13.2 Optimality of Bayesian inference . . . . . . . . . . . . . . . . . . . 135


13.3 Connection with entropy learning . . . . . . . . . . . . . . . . . . 136
13.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

III Regression 139


14 Overview over regression modelling 141
14.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
14.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
14.3 Regression as a form of supervised learning . . . . . . . . . . . . . 142
14.4 Various regression models used in statistics . . . . . . . . . . . . . 143

15 Linear Regression 145


15.1 The linear regression model . . . . . . . . . . . . . . . . . . . . . . 145
15.2 Interpretation of regression coefficients and intercept . . . . . . . 146
15.3 Different types of linear regression: . . . . . . . . . . . . . . . . . . 146
15.4 Distributional assumptions and properties . . . . . . . . . . . . . 146
15.5 Regression in data matrix notation . . . . . . . . . . . . . . . . . . 148
15.6 Centering and vanishing of the intercept 𝛽 0 . . . . . . . . . . . . . 148
15.7 Objectives in data analysis using linear regression . . . . . . . . . 149

16 Estimating regression coefficients 151


16.1 Ordinary Least Squares (OLS) estimator of regression coefficients 151
16.2 Maximum likelihood estimation of regression coefficients . . . . . 153
16.3 Covariance plug-in estimator of regression coefficients . . . . . . 155
16.4 Standardised regression coefficients and their relationship to
correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
16.5 Further ways to obtain regression coefficients . . . . . . . . . . . . 158

17 Squared multiple correlation and variance decomposition in linear


regression 161
17.1 Squared multiple correlation Ω2 and the 𝑅 2 coefficient . . . . . . . 161
17.2 Variance decomposition in regression . . . . . . . . . . . . . . . . 163
17.3 Sample version of variance decomposition . . . . . . . . . . . . . . 165

18 Prediction and variable selection 167


18.1 Prediction and prediction intervals . . . . . . . . . . . . . . . . . . 167
18.2 Variable importance and prediction . . . . . . . . . . . . . . . . . . 168
18.3 Regression 𝑡-scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
18.4 Further approaches for variable selection . . . . . . . . . . . . . . 172

Appendix 175

A Refresher 177
A.1 Basic mathematical notation . . . . . . . . . . . . . . . . . . . . . . 177
A.2 Vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 CONTENTS

A.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178


A.4 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.5 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.6 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
A.7 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B Further study 197


B.1 Recommended reading . . . . . . . . . . . . . . . . . . . . . . . . . 197
B.2 Additional references . . . . . . . . . . . . . . . . . . . . . . . . . . 197

Bibliography 199
Welcome

These are the lecture notes for MATH20802, a course in Statistical Methods
for second year mathematics students at the Department of Mathematics of the
University of Manchester.
The course text was written by Korbinian Strimmer from 2019–2023. This version
is from 20 January 2023.
The notes will be updated from time to time. To view the current version visit
the online MATH20802 lecture notes.
You may also download the MATH20802 lecture notes as PDF. For a paper copy
it is recommended to print two pages per sheet.

License
These notes are licensed to you under Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License.

7
8 CONTENTS
Preface

About the author


Hello! My name is Korbinian Strimmer and I am a Professor in Statistics. I
am a member of the Statistics group at the Department of Mathematics of the
University of Manchester. You can find more information about me on my home
page.
The notes are for the version of MATH20802 taught in spring 2023 at the
University of Manchester.
I hope that you enjoy the course and that you will find the notes useful! If you
have any questions, comments, or corrections please email me at korbinian.stri
[email protected].

About the module


Topics covered
The MATH20802 module is designed to run over the course of 11 weeks. It has
three main parts:
1. Likelihood estimation and inference (W1–W4)
2. Bayesian learning and inference (W5–W8)
3. Linear regression (W9–W11)
This module focuses on conceptual understanding and methods, not on theory.
Specifically, you will learn about the foundations of statistical learning using
likelihood and Bayesian approaches and also how these are underpinned by
entropy.
As such, the presentation in this course is non-technical. The aim is to offer
insights how diverse statistical approaches are linked and to demonstrate that
statistics offers a concise and coherent theory of information rather than being an
adhoc collection of “recipes” for data analysis (a common but wrong perception
of statistics).

9
10 CONTENTS

Prerequisites
For this module it is important that you refresh your knowledge in:
• Introduction to statistics
• Probability
• R data analysis and programming
In addition you will need to some elements of matrix algebra and how to compute
the gradient and the curvature of a function of several variables.
Check the Appendix of these notes for a brief refresher of the essential material.

Additional support material


If you are a University of Manchester student and enrolled in this module you
will find on Blackboard:
• a weekly learning plan for an 11 week study period (plus one additional
week for revision),
• weekly worksheets with examples and solutions and R code, and
• exam papers of previous years.
Furthermore, there is also a MATH20802 online reading list hosted by the
University of Manchester library.

Acknowledgements
Many thanks to Beatriz Costa Gomes for her help in creating the 2019 version
of the lecture notes when I was teaching this module for the first time and to
Kristijonas Raudys for his extensive feedback on the 2020 version.
Part I

Likelihood estimation and


inference

11
Chapter 1

Overview of statistical
learning

1.1 How to learn from data?


A fundamental question is how to extract information from data in an optimal
way, and to make predictions based on this information.
For this purpose, a number of competing theories of information have been
developed. Statistics is the oldest science of information and is concerned with
offering principled ways to learn from data and to extract and process information
using probabilistic models. However, there are other theories of information
(e.g. Vapnik-Chernov theory of learning, computational learning) that are more
algorithmic than analytic and sometimes not even based on probability theory.
Furthermore, there are other disciplines, such computer science and machine
learning that are closely linked with and also have substantial overlap with
statistics. The field of “data science” today comprises both statistics and a
machine learning and brings together mathematics, statistics and computer
science. Also the growing field of so-called “artificial intelligence” makes
substantial use of statistical and machine learning techniques.
The recent popular science book “The Master Algorithm” by Domingos (2015)
provides an accessible informal overview over the various schools of science of
information. It discusses the main algorithms used in machine learning and
statistics:
• Starting as early as 1763, the Bayesian school of learning was started which
later turned out to be closely linked with likelihood inference established
in 1922 by R.A. Fisher (1890–1962) and generalised in 1951 to entropy
learning by Kullback and Leibler.

13
14 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING

• It was also in the 1950s that the concept of artificial neural network arises,
essentially a nonlinear input-output map that works in a non-probabilistic
way. This field saw another leap in the 1980 and further progress from
2010 onwards with the development of deep dearning. It is now one of the
most popular (and most effective) methods for analysis of imaging data.
Even your mobile phone most likely has a dedicated computer chip with
special neural network hardware, for example.
• Further advanced theories of information where developed in the 1960
under the term of computational learning, most notably the Vapnik-
Chernov theory, with the most prominent example of the “support vector
machine” (another non-probabilistic model).
• With the advent of large-scale genomic and other high-dimensional data
there has been a surge of new and exciting developments in the field of
high-dimensional (large dimension) and also big data (large dimension
and large sample size), both in statistics and in machine learning.
The connections between various fields of information is still not perfectly
understood, but it is clear that an overarching theory will need to be based on
probabilistic learning.

1.2 Probability theory versus statistical learning


When you study statistics (or any other information theory) you need to be aware
that there is a fundamental difference between probability theory and statistics,
and that relates to the distinction between “randomness” and “uncertainty”.
Probability theory studies randomness, by developing mathematical models
for randomness (such as probability distributions), and studying corresponding
mathematical properties (including asymptotics etc). Probability theoy may in
fact be viewed as a branch of measure theory, and thus it belongs to the domain
pure mathematics.
Probability theory provides probabilistic generative models for data, for simula-
tion of data or for use in learning from data, i.e. inference about the model from
observations. Methods and theory how to best learn from data is the domain
of applied mathematics, specifically statistics and the related areas of machine
learning and data science.
Note that statistics, in contrast to probability, is in fact not at all concerned
with randomness. Instead, the focus is about measuring and elucidating the
uncertainty of events, predictions, outcomes, parameters and this uncertainty
measures the state of knowledge. Note that if new data or information becomes
available, the state of knowledge and thus the uncertainty changes! Thus,
uncertainty is an epistemological property.
The uncertainty most often is due to our ignorance of the true underlying
1.3. CARTOON OF STATISTICAL LEARNING 15

processes (on purpose or not), but not because the underlying process is actually
random. The success of statistics is based on the fact that we can mathematically
model the uncertainty without knowing any specifics of the underlying processes,
and we still have procedures for optimal inference under uncertainty.
In short, statistics is about describing the state of knowledge of the world, which
may be uncertain and incomplete, and to make decisions and prediction in the
face of uncertainty, and this uncertaintly sometimes derives from randomness
but most often from our ignorance (and sometimes this ignorance even helps to
create a simple yet effective model)!

1.3 Cartoon of statistical learning


We observe data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } assumed to have been generated by an
underlying true model 𝑀true with true parameters 𝜽 true
To explain the data, and make prediction, we make hypotheses in the form of
candidate models 𝑀1 , 𝑀2 , . . . and corresponding parameters 𝜽 1 , 𝜽 2 , . . .. The
true model itself is unknown and cannot be observed. However, what we can
observe is data 𝐷 from the true model by measuring properties of objects interest
(our observations from experiments). Sometimes we can also perturb the model
and see what the effect is (interventional study).
The various candidate models 𝑀1 , 𝑀2 , . . . in the model world will never be
perfect or correct as the true model 𝑀true will only be among the candidate
models in an idealised situation. However, even an imperfect candidate model
will often provide a useful mathematical approximation and capture some
important characteristics of the true model and thus will help to interpret
observed data.

Model world
Hypothesis 𝑀1 , 𝜽 1
−→ 𝑀2 , 𝜽 2
How the world works
..
.

Real world,
−→ unknown true model −→ Data 𝑥1 , . . . , 𝑥 𝑛
𝑀true , 𝜽 true

The aim of statistical learning is to identify the model(s) that explain the
current data and also predict future data (i.e. predict outcome of experiments
that have not been conducted yet).
Thus a good model provides a good fit to the current data (i.e. it explains current
observations well) and also to future data (i.e. it generalises well).
16 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING

A large proportion of statistical theory is devoted to finding these “good” models


that avoid both overfitting (models being too complex and don’t generalise well)
or underfitting (models being too simplistic and hence also don’t predict well).
Typically the aim is to find a model whose the model complexity matches the
complexity of the unknown true model and also the complexity of the data
observed from the unknown true model.

1.4 Likelihood
In statistics and machine learning most models that are being used are prob-
abilistic to take account of both randomness and uncertainty. A core task in
statistical learning is to identify those models that explain the existing data well
and that also generalise well to unseen data.
For this we need, among other things, a measure of how well a candidate
model approximates the (typically unknown) true data generating model and
an approach to choose the best model(s). One such approach is provided by
the method of maximum likelihood that enables us to estimate parameters of
models and to find the particular model that is the best fit to the data.
Given a probability distribution 𝑃𝜽 with density or mass function 𝑝(𝑥|𝜽) where
𝜽 is a parameter vector, and 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } are the observed iid data (i.e. in-
dependent and identically distributed), the likelihood function is defined
as
𝑛
Ö
𝐿𝑛 (𝜽|𝐷) = 𝑝(𝑥 𝑖 |𝜽)
𝑖=1

Typically, instead of the likelihood one uses the log-likelihood function:


𝑛
Õ
𝑙𝑛 (𝜽|𝐷) = log 𝐿𝑛 (𝜽|𝐷) = log 𝑝(𝑥 𝑖 |𝜽)
𝑖=1

Reasons for preferring the log-likelihood (rather than likelihood) include that
• the log-density is in fact the more “natural” and relevant quantity (this
will become clear in the upcoming chapters) and that
• addition is numerically more stable than multiplication on a computer.
For discrete random variables for which 𝑝(𝑥|𝜽) is a probability mass function
the likelihood is often interpreted as the probability to observe the data given
the model with specified parameters 𝜽. In fact, this was indeed the way how
the likelihood was historically introduced. However, this view is not strictly
correct. First, given that the samples are iid and thus the ordering of the 𝑥 𝑖 is
not important, an additional factor accounting for the possible permutations is
needed in the likelihood to obtain the actual probability of the data. Moreover,
for continuous random variables this interpretation breaks down due to the use
1.4. LIKELIHOOD 17

of densities rather than probability mass functions in the likelihood. Thus, the
view of the likelihood being the probability of the data is in fact too simplistic.
In the next chapter we will see that the justification for using likelihood rather
stems from its close link to the Kullback-Leibler information and cross-entropy.
This also helps to understand why using likelihood for estimation is only optimal
in the limit of large sample size.
In the first part of the MATH28082 “Statistical Methods” module we will study
likelihood estimation and inference in much detail. We will provide links to
related methods of inference and discuss its information-theoretic foundations.
We will also discuss the optimality properties as well as the limitation of
likelihood inference. Extensions of likelihood analysis, in particular Bayesian
learning, which will be discussed in the second part of the module. In the third
part of the module we will apply statistical learning to linear regression.
18 CHAPTER 1. OVERVIEW OF STATISTICAL LEARNING
Chapter 2

From entropy to maximum


likelihood

2.1 Entropy
2.1.1 Overview
In this chapter we discuss various information criteria and their connection to
maximum likelihood.

The modern definition of (relative) entropy, or “disorder”, was first discovered in


the 1870s by physicist L. Boltzmann (1844–1906) in the context of thermodynamics.
The probabilistic interpretation of statistical mechanics and entropy was further
developed by J. W. Gibbs (1839–1903).

In the 1940–1950’s the notion of entropy turned out to be central in information


theory, a field pioneered by mathematicians such as R. Hartley (1888–1970), S.
Kullback (1907–1994), A. Turing (1912–1954), R. Leibler (1914–2003), I. J. Good
(1916–2009), C. Shannon (1916–2001), and E. T. Jaynes (1922–1998), and later
further explored by S. Amari (1936–), I. Ciszár (1938–), B. Efron (1938–), A. P.
Dawid (1946–) and many others.

Shannon Entropy (Shannon 1948)


%
Entropy
&
Relative Entropy (Kullback-Leibler 1951)

19
20 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

Fisher information → Likelihood theory (Fisher 1922)

Mutual Information → Information theory (Shannon 1948, Lindley 1953)

2.1.2 Surprise, surprisal or Shannon information


The surprise to observe an event of probability 𝑝 is defined as − log(𝑝). This is
also called surprisal or Shannon information.
Thus, the surprise to observe a certain event (with 𝑝 = 1) is zero, and conversely
the surprise to observe an event that is certain not to happen (with 𝑝 = 0) is
infinite.
The log-odds ratio can be viewed as the difference of the surprise of an event
and the surprise of the complementary event:

𝑝
 
log = − log(1 − 𝑝) − (− log(𝑝))
1−𝑝

In this module we always use the natural logarithm by default, and will explicitly
write log2 and log10 for logarithms with respect to base 2 and 10, respectively.
Surprise and entropy computed with the natural logarithm (log) is given in
“nats” (=natural information units ). Using log2 leads to “bits” and using log10
to “ban” or “Hartley”.

2.1.3 Shannon entropy


Assume we have a categorical distribution 𝑃 with 𝐾 classes/categories. The
corresponding class probabilities are 𝑝1 , . . . , 𝑝 𝐾 with Pr("class k") = 𝑝 𝑘 and
Í𝐾
𝑘=1 𝑝 𝑘 = 1. The probability mass function (PMF) is 𝑝(𝑥 = "class k") = 𝑝 𝑘 .

As the random variable 𝑥 is discrete the categorical distribution 𝑃 is a discrete


distribution but 𝑃 is generally also known as the discrete distribution.
The Shannon entropy (1948)1 of the distribution 𝑃 is defined as the expected
surprise, i.e. the negative expected log-probability

𝐻(𝑃) = −E𝑃 log 𝑝(𝑥)




𝐾
Õ
=− 𝑝 𝑘 log(𝑝 𝑘 )
𝑘=1

As all 𝑝 𝑘 ∈ [0, 1] by construction Shannon entropy must be larger or equal to 0.


1Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal
27:379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
2.1. ENTROPY 21

Furthermore, it is bounded above by log 𝐾. This can be seen by maximising


Shannon entropy as a function with regard to the 𝑝 𝑘 under the constraint
Í𝐾
𝑘=1 𝑝 𝑘 = 1, e.g., by constrained optimisation using Langrange multipliers. The
maximum is achieved for 𝑃 being the discrete uniform - see Example 2.1.
Hence for any categorical distribution 𝑃 with 𝐾 categories we have

log 𝐾 ≥ 𝐻(𝑃) ≥ 0

In statistical physics, the Shannon entropy is known as Gibbs entropy (1878).


Example 2.1. Discrete uniform distribution 𝑈𝐾 : let 𝑝1 = 𝑝2 = . . . = 𝑝 𝐾 = 1
𝐾.
Then
𝐾  
Õ 1 1
𝐻(𝑈𝐾 ) = − log = log 𝐾
𝐾 𝐾
𝑘=1

Note this is the largest value the Shannon entropy can assume with 𝐾 classes.
Example 2.2. Concentrated probability mass: let 𝑝1 = 1 and 𝑝2 = 𝑝3 = . . . =
𝑝 𝐾 = 0. Using 0 × log(0) = 0 we obtain for the Shannon entropy

𝐻(𝑃) = 1 × log(1) + 0 × log(0) + · · · = 0

Note that 0 is the smallest value that Shannon entropy can assume, and corre-
sponds to maximum concentration.
Thus, large entropy implies that the distribution is spread out whereas small
entropy means the distribution is concentrated.
Correspondingly, maximum entropy distributions can be considered minimally
informative about a random variable.
This interpretation is also supported by the close link of Shannon entropy with
multinomial coefficients counting the permutations of 𝑛 items (samples) of 𝐾
distinct types (classes).
Example 2.3. Large sample asymptotics of the log-multinomial coefficient and
link to Shannon entropy:
The number of possible permutation of 𝑛 items of 𝐾 distinct types, with 𝑛1 of
type 1, 𝑛2 of type 2 and so on, is given by the multinomial coefficient

𝑛 𝑛!
 
𝑊= =
𝑛1 , . . . , 𝑛 𝐾 𝑛1 ! × 𝑛2 ! × . . . × 𝑛 𝐾 !
Í𝐾
with 𝑘=1 𝑛 𝑘 = 𝑛 and 𝐾 ≤ 𝑛.
Now recall the Moivre-Sterling formula which for large 𝑛 allow to approximate
the factorial by
log 𝑛! ≈ 𝑛 log 𝑛 − 𝑛
22 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

With this
𝑛
 
log 𝑊 = log
𝑛1 , . . . , 𝑛 𝐾
𝐾
Õ
= log 𝑛! − log 𝑛 𝑘 !
𝑘=1
𝐾
Õ
≈ 𝑛 log 𝑛 − 𝑛 − (𝑛 𝑘 log 𝑛 𝑘 − 𝑛 𝑘 )
𝑘=1
𝐾
Õ
= 𝑛 log 𝑛 − 𝑛 𝑘 log 𝑛 𝑘
𝑘=1
𝐾
Õ 𝐾
Õ
= 𝑛 𝑘 log 𝑛 − 𝑛 𝑘 log 𝑛 𝑘
𝑘=1 𝑘=1
𝐾
Õ 𝑛 𝑘
𝑛 
𝑘
= −𝑛 log
𝑛 𝑛
𝑘=1

and thus
𝐾
𝑛
 
1 Õ
log ≈− 𝑝ˆ 𝑘 log 𝑝ˆ 𝑘
𝑛 𝑛1 , . . . , 𝑛 𝐾
𝑘=1
ˆ
= 𝐻(𝑃)
𝑛𝑘
where 𝑃ˆ is the empirical categorical distribution with 𝑝ˆ 𝑘 = 𝑛 .
The combinatorical derivation of Shannon entropy is now credited to Wallis
(1962) but has already been used a century earlier by Boltzmann (1877) who
discovered it in his work in statistical mechanics (recall 𝑆 = 𝑘 𝑏 log 𝑊 is the
Boltzmann entropy ).

2.1.4 Differential entropy


Shannon entropy is only defined for discrete random variables.
Differential Entropy results from applying the definition of Shannon entropy to a
continuous random variable 𝑥 with density 𝑝(𝑥):

𝐻(𝑃) = −E𝑃 (log 𝑝(𝑥)) = − 𝑝(𝑥) log 𝑝(𝑥) 𝑑𝑥
𝑥

Despite having essentially the same formula the different name is justified
because differential entropy exhibits different properties compared to Shannon
entropy, because the logarithm is taken of a density which in contrast to a
probability can assume values larger than one. As a consequence, differential
entropy is not bounded below by zero and can be negative.
2.1. ENTROPY 23

Example 2.4. Consider the uniform distribution


∫𝑎 𝑈(0, 𝑎) with
∫ 𝑎𝑎 > 0, support from
0 to 𝑎 and density 𝑝(𝑥) = 1/𝑎. As − 0 𝑝(𝑥) log 𝑝(𝑥)𝑑𝑥 = − 0 1𝑎 log( 1𝑎 )𝑑𝑥 = log 𝑎
the differential entropy is
𝐻(𝑈(0, 𝑎)) = log 𝑎 .
Note that for 𝑎 < 1 the differential entropy is negative.
Example 2.5. The log density of the univariate normal 𝑁(𝜇, 𝜎2 ) distribution
2

(𝑥−𝜇)
is log 𝑝(𝑥|𝜇, 𝜎2 ) = − 12 log(2𝜋𝜎2 ) + 𝜎2 with 𝜎2 > 0. The corresponding
differential entropy is with E((𝑥 − 𝜇)2 ) = 𝜎2
 
𝐻(𝑃) = −E log 𝑝(𝑥|𝜇, 𝜎2 )
1 
= log(2𝜋𝜎2 ) + 1 .
2
Interestingly, 𝐻(𝑃) only depends on the variance and not on the mean, and
the entropy grows with the variance. Note that for 𝜎2 < 1/(2𝜋𝑒) ≈ 0.0585 the
differential entropy is negative.

2.1.5 Maximum entropy principle to characterise distributions


Both maximum Shannon entropy and differential entropy are useful to charac-
terise distributions:
1) The discrete uniform distribution is the maximum entropy distribution
among all discrete distributions.
2) the maximum entropy distribution of a continuous random variable with
support [−∞, ∞] with a specific mean and variance is the normal distribu-
tion
3) the maximum entropy distribution among all continuous distributions
supported in [0, ∞] with a specified mean is the exponential distribution.
The higher the entropy the more spread out (and more uninformative) is a
distribution.
Using maximum entropy to characterise maximally uniformative distributions
was advocated by E.T. Jaynes (who also proposed to use maximum entropy
in the context of finding Bayesian priors). The maximum entropy principle in
statistical physics goes back to Boltzmann.
A list of maximum entropy distribution is given here: https://en.wikipedia.org
/wiki/Maximum_entropy_probability_distribution .
Many distributions commonly used in statistical modelling are exponential
families. Intriguingly, these distribution are all maximum entropy distributions,
so there is a very close link between the principle of maximum entropy and
common model choices in statistics and machine learning.
24 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

2.1.6 Cross-entropy
If in the definition of Shannon entropy (and differential entropy) the expectation
over the log-density (say 𝑔(𝑥) of distribution 𝐺) is taken with regard to a different
distribution 𝐹 over the same state space we arrive at the cross-entropy

𝐻(𝐹, 𝐺) = −E𝐹 log 𝑔(𝑥)




For discrete distributions 𝐹 and 𝐺 with class probabilities 𝑓1 , . . . , 𝑓𝐾


and 𝑔1 , . . . , 𝑔𝐾 the cross-entropy is computed as the weighted sum
𝐻(𝐹, 𝐺) = − 𝐾𝑘=1 𝑓 𝑘 log 𝑔 𝑘 . For continuous distributions∫ 𝐹 and 𝐺 with
Í
densities 𝑓 (𝑥) and 𝑔(𝑥) we compute the integral 𝐻(𝐹, 𝐺) = − 𝑥 𝑓 (𝑥) log 𝑔(𝑥) 𝑑𝑥.
Therefore, cross-entropy is a measure linking two distributions 𝐹 and 𝐺.
Note that
• cross-entropy is not symmetric with regard to 𝐹 and 𝐺, because the
expectation is taken with reference to 𝐹.
• By construction 𝐻(𝐹, 𝐹) = 𝐻(𝐹). Thus if both distributions are identical
cross-entropy reduces to Shannon and differential entropy, respectively.
A crucial property of the cross-entropy 𝐻(𝐹, 𝐺) is that it is bounded below by
the entropy of 𝐹, therefore
𝐻(𝐹, 𝐺) ≥ 𝐻(𝐹)
with equality for 𝐹 = 𝐺. This is known as Gibbs’ inequality.
Equivalently we can write

𝐻(𝐹, 𝐺) − 𝐻(𝐹) ≥ 0
| {z }
relative entropy

In fact, this recalibrated cross-entropy (known as KL divergence or relative


entropy) turns out to be more fundamental than both cross-entropy and entropy.
It will be studied in detail in the next section.
Example 2.6. Cross-entropy between two normals:
Assume 𝐹ref = 𝑁(𝜇ref , 𝜎ref
2
) and 𝐹 = 𝑁(𝜇, 𝜎2 ). The cross-entropy 𝐻(𝐹ref , 𝐹) is
 
𝐻(𝐹ref , 𝐹) = −E𝐹ref log 𝑝(𝑥|𝜇, 𝜎2 )
(𝑥 − 𝜇)2
 
1
= E𝐹ref log(2𝜋𝜎2 ) +
2 𝜎2
!
1 (𝜇 − 𝜇ref )2 𝜎ref
2
= + + log(2𝜋𝜎2 )
2 𝜎2 𝜎2

using E𝐹ref ((𝑥 − 𝜇)2 ) = (𝜇ref − 𝜇)2 + 𝜎ref


2
.
2.2. KULLBACK-LEIBLER DIVERGENCE 25

Example 2.7. If 𝜇ref = 𝜇 and 𝜎ref


2
= 𝜎2 then the cross-entropy 𝐻(𝐹ref , 𝐹) in
 
Example 2.6 degenerates to the differential entropy 𝐻(𝐹ref ) = 1
2
2
log(2𝜋𝜎ref )+1 .

2.2 Kullback-Leibler divergence


2.2.1 Definition
Also known as relative entropy and discrimination information.
The relative entropy measures the divergence of a distribution 𝐺 from the
distribution 𝐹 and is defined as
𝑑𝐹
 
𝐷KL (𝐹, 𝐺) = E𝐹 log
𝑑𝐺
𝑓 (𝑥)
 
= E𝐹 log
𝑔(𝑥)
= −E𝐹 (log 𝑔(𝑥)) −( −E𝐹 (log 𝑓 (𝑥)) )
| {z } | {z }
cross-entropy (differential) entropy

= 𝐻(𝐹, 𝐺) − 𝐻(𝐹)

• 𝐷KL (𝐹, 𝐺) measures the amount of information lost if 𝐺 is used to approxi-


mate 𝐹.
• If 𝐹 and 𝐺 are identical (and no information is lost) then 𝐷KL (𝐹, 𝐺) = 0.
(Note: here “divergence” measures the dissimilarity between probability distri-
butions. This type of divergence is not related and should not be confused with
divergence (div) as used in vector analysis.)
The use of the term “divergence” rather than “distance” serves to emphasise
that the distributions 𝐹 and 𝐺 are not interchangeable in 𝐷KL (𝐹, 𝐺).
There exist various notations for KL divergence in the literature. Here we use
𝐷KL (𝐹, 𝐺) but you often find as well KL(𝐹||𝐺) and 𝐼 𝐾𝐿 (𝐹; 𝐺).
Some authors (e.g. Efron) call twice the KL divergence 2𝐷KL (𝐹, 𝐺) = 𝐷(𝐹, 𝐺) the
deviance of 𝐺 from 𝐹.

2.2.2 Properties of KL divergence


1. 𝐷KL (𝐹, 𝐺) ≠ 𝐷KL (𝐺, 𝐹), i.e. the KL divergence is not symmetric, 𝐹 and 𝐺
cannot be interchanged.
2. 𝐷KL (𝐹, 𝐺) = 0 if and only if 𝐹 = 𝐺, i.e., the KL divergence is zero if and
only if 𝐹 and 𝐺 are identical.
3. 𝐷KL (𝐹, 𝐺) ≥ 0, proof via Jensen’s inequality.
4. 𝐷KL (𝐹, 𝐺) remains invariant under coordinate transformations, i.e. it is an
invariant geometric quantity.
26 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

Note that in the KL divergence the expectation is taken over a ratio of densities
(or ratio of probabilities for discrete random variables). This is what creates the
transformation invariance.
For more details and proofs of properties 3 and 4 see Worksheet E1.

2.2.3 Origin of KL divergence and application in statistics


Historically, in physics (negative) relative entropy was discovered by Boltzmann
(1878).2 In statistics and information theory it was introduced by Kullback and
Leibler (1951).3
In statistics the typical roles of the distribution 𝐹 and 𝐺 in 𝐷KL (𝐹, 𝐺) are:
• 𝐹 is the (unknown) underlying true model for the data generating process
• 𝐺 is the approximating model (typically a distribution family indexed by
parameters)
Optimising (i.e. minimising) the KL divergence with regard to 𝐺 amounts to
approximation and optimising with regard to 𝐹 to imputation. Later we will see
how this leads to the method of maximum likelihood and to Bayesian learning,
respectively.

2.2.4 KL divergence examples


Example 2.8. KL divergence between two Bernoulli distributions Ber(𝑝) and
Ber(𝑞):
The “success” probabilities for the two distributions are 𝑝 and 𝑞, respectively,
and the complementary “failure” probabilities are 1 − 𝑝 and 1 − 𝑞. With this we
get for the KL divergence

𝑝 1−𝑝
   
𝐷KL (Ber(𝑝), Ber(𝑞)) = 𝑝 log + (1 − 𝑝) log
𝑞 1−𝑞

Example 2.9. KL divergence between two univariate normals with different


means and variances:
Assume 𝐹ref = 𝑁(𝜇ref , 𝜎ref
2
) and 𝐹 = 𝑁(𝜇, 𝜎2 ). Then

𝐷KL (𝐹ref , 𝐹) = 𝐻(𝐹ref , 𝐹) − 𝐻(𝐹ref )


! !
1 (𝜇 − 𝜇ref )2 𝜎ref 𝜎ref
2 2
= + − log −1
2 𝜎2 𝜎2 𝜎2
2Boltzmann, L. 1878. Weitere Bemerkungen über einige Probleme der mechanischen Wärmetheo-
rie. Wien Ber. 78:7–46. https://doi.org/10.1017/CBO9781139381437.013
3Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statist. 22
79–86. https://doi.org/10.1214/aoms/1177729694
2.3. LOCAL QUADRATIC APPROXIMATION AND EXPECTED FISHER INFORMATION

Example 2.10. KL divergence between two univariate normals with different


means and common variance:
An important special case of the previous Example 2.9 occurs if the variances
are equal. Then we get

1 (𝜇 − 𝜇ref )2
 
𝐷KL (𝑁(𝜇ref , 𝜎 ), 𝑁(𝜇, 𝜎 )) =
2 2
2 𝜎2

2.3 Local quadratic approximation and expected


Fisher information
2.3.1 Definition of expected Fisher information
KL information measures the divergence of two distributions. We may thus
use relative entropy to measure the divergence between two distributions in the
same family, separated in parameter space only by some small 𝜺.
Let ℎ(𝜽) = 𝐷KL (𝐹𝜽0 , 𝐹𝜽 ) = E𝐹𝜽0 log 𝑓 (𝒙|𝜽0 ) − log 𝑓 (𝒙|𝜽) . Note that the first

distribution in the KL divergence is fixed at 𝐹𝜽0 and the second distribution is
varied. Then ℎ(𝜽 0 + 𝜺) = 𝐷KL (𝐹𝜽0 , 𝐹𝜽0 +𝜺 ). Since the KL divergence vanishes
only when the two arguments are identical ℎ(𝜽) reaches a minimum at 𝜽 0 with
ℎ(𝜽0 ) = 0 and flat gradient ∇ℎ(𝜽 0 ) = 0.
We can therefore approximate ℎ(𝜽 0 + 𝜺) by a quadratic function around 𝜽0

1 𝑇
𝐷KL (𝐹𝜽0 , 𝐹𝜽0 +𝜺 ) = ℎ(𝜽 0 + 𝜺) ≈ 𝜺 ∇∇𝑇 ℎ(𝜽0 )𝜺
2
1  
= 𝜺𝑇 −E𝐹𝜽0 ∇∇𝑇 log 𝑓 (𝒙|𝜽0 ) 𝜺
2
1
= 𝜺𝑇 𝑰 Fisher (𝜽 0 ) 𝜺
2 | {z }
expected Fisher information

This yields the expected Fisher information at 𝜽 0 as the negative expected


Hessian matrix of the log-density at 𝜽 0 . Since 𝜽0 is a minimum the expected
Fisher information matrix must be positive definite!
We can use the above approximation also to compute the divergence
𝐷KL (𝐹𝜽0 +𝜺 , 𝐹𝜽0 ) where the first argument varies and the second is kept fixed:

1 𝑇 Fisher
𝐷KL (𝐹𝜽0 +𝜺 , 𝐹𝜽0 ) ≈ 𝜺 𝑰 (𝜽 0 + 𝜺) 𝜺
2

In a linear approximation 𝑰 Fisher (𝜽 0 + 𝜺) ≈ 𝑰 Fisher (𝜽 0 ) + 𝚫𝜺 each element of the


matrix 𝚫𝜺 is the scalar product of 𝜺 and the gradient of the corresponding element
28 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

in 𝑰 Fisher (𝜽 0 ) evaluated at 𝜽 0 . Therefore 𝜺𝑇 𝚫𝜺 𝜺 is of cubic order in 𝜺 and hence

1 𝑇 Fisher
𝐷KL (𝐹𝜽0 +𝜺 , 𝐹𝜽0 ) ≈ 𝜺 𝑰 (𝜽 0 + 𝜺)𝜺
2
1
≈ 𝜺𝑇 𝑰 Fisher (𝜽 0 )𝜺 + 𝜺𝑇 𝚫𝜺 𝜺
2
1 𝑇 Fisher
≈ 𝜺 𝑰 (𝜽 0 )𝜺
2
keeping only terms quadratic in 𝜺.
Note that there is no data involved in computing the expected Fisher information,
hence it is purely a property of the model, or more precisely of the space of the
models indexed by 𝜽. In the next Chapter we will study a related quantity, the
observed Fisher information that in contrast is a function of the observed data.

2.3.2 Examples
Example 2.11. Expected Fisher information for the Bernoulli distribution:
The log-probability mass function of the Bernoulli Ber(𝑝) distribution is

log 𝑓 (𝑥|𝑝) = 𝑥 log(𝑝) + (1 − 𝑥) log(1 − 𝑝)

where 𝑝 is the proportion of “success”. The second derivative with regard to the
parameter 𝑝 is
𝑑2 𝑥 1−𝑥
log 𝑓 (𝑥|𝑝) = − 2 −
𝑑𝑝 2 𝑝 (1 − 𝑝)2
Since E(𝑥) = 𝑝 we get as Fisher information

𝑑2
 
𝐼 Fisher
(𝑝) = −E log 𝑓 (𝑥|𝑝)
𝑑𝑝 2
𝑝 1−𝑝
= 2+
𝑝 (1 − 𝑝)2
1
=
𝑝(1 − 𝑝)

Example 2.12. Quadratic approximations of the KL divergence between two


Bernoulli distributions:
From Example 2.8 we have as KL divergence

𝑝1 1 − 𝑝1
   
𝐷KL (Ber(𝑝1 ), Ber(𝑝2 )) = 𝑝 1 log + (1 − 𝑝1 ) log
𝑝2 1 − 𝑝2

and from Example 2.11 the corresponding expected Fisher information.


2.4. ENTROPY LEARNING AND MAXIMUM LIKELIHOOD 29

The quadratic approximation implies that

𝜀2 Fisher 𝜀2
𝐷KL (Ber(𝑝), Ber(𝑝 + 𝜀)) ≈ 𝐼 (𝑝) =
2 2𝑝(1 − 𝑝)

and also that


𝜀2 Fisher 𝜀2
𝐷KL (Ber(𝑝 + 𝜀), Ber(𝑝)) ≈ 𝐼 (𝑝) =
2 2𝑝(1 − 𝑝)

In Worksheet E1 this is verified by using a second order Taylor series applied to


the KL divergence.
Example 2.13. Expected Fisher information for the normal distribution 𝑁(𝜇, 𝜎2 ).
The log-density is
1 1 1
log 𝑓 (𝑥|𝜇, 𝜎2 ) = − log(𝜎 2 ) − 2 (𝑥 − 𝜇)2 − log(2𝜋)
2 2𝜎 2
The gradient with respect to 𝜇 and 𝜎2 (!) is the vector
1
(𝑥 − 𝜇)
 
∇ log 𝑓 (𝑥|𝜇, 𝜎 ) = 2 𝜎2
− 2𝜎2 + 2𝜎1 4 (𝑥 −
1
𝜇)2

Hint for calculating the gradient: replace 𝜎 2 by 𝑣 and then take the partial
derivative with regard to 𝑣, then substitute back.
The Hessian matrix is
− 𝜎12 − 𝜎14 (𝑥 − 𝜇)
 
∇∇𝑇 log 𝑓 (𝑥|𝜇, 𝜎2 ) =
− 𝜎4 (𝑥 − 𝜇)
1 1
2𝜎 4
− 𝜎16 (𝑥 − 𝜇)2

As E(𝑥) = 𝜇 we have
 E(𝑥 − 𝜇) = 0. Furthermore, with E((𝑥 − 𝜇) ) = 𝜎 we see
2 2

that E 1
𝜎6
(𝑥 − 𝜇)2 = 1
𝜎4
. Therefore the expected Fisher information matrix as
the negative expected Hessian matrix is
1
 
 
𝜎2
0
𝑰 Fisher
𝜇, 𝜎 2
= 1
0 2𝜎 4

2.4 Entropy learning and maximum likelihood


2.4.1 The relative entropy between true model and approximat-
ing model
Assume we have observations 𝑥1 , . . . , 𝑥 𝑛 . The data is sampled from 𝐹, the
true but unknown data generating distribution. We also specify a family of
distributions 𝐺𝜽 indexed by 𝜽 to approximate 𝐹.
30 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

The relative entropy 𝐷KL (𝐹, 𝐺𝜽 ) then measures the divergence of the approxima-
tion 𝐺𝜽 from the unknown true model 𝐹. It can be written as:
𝐷KL (𝐹, 𝐺𝜽 ) = 𝐻(𝐹, 𝐺𝜽 ) − 𝐻(𝐹)
= −E𝐹 log 𝑔𝜽 (𝑥) −( −E𝐹 log 𝑓 (𝑥) )
| {z } | {z }
cross-entropy entropy of 𝐹, does not depend on 𝜽

However, since we do not know 𝐹 we cannot actually compute this divergence.


Nonetheless, we may use the empirical distribution 𝐹ˆ 𝑛 — a function of the
observed data — as approximation for 𝐹, and in this way we arrive at an
approximation for 𝐷KL (𝐹, 𝐺𝜽 ) that becomes more and more accurate with
growing sample size.

Recall the “Law of Large Numbers” :


• By the strong law of large numbers the empirical distribution 𝐹ˆ 𝑛 based
on observed data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } converges to the true underlying
distribution 𝐹 as 𝑛 → ∞ almost surely:
𝑎.𝑠.
𝐹ˆ 𝑛 → 𝐹
Í𝑛
• For 𝑛 → ∞ the average E𝐹ˆ𝑛 (ℎ(𝑥)) = 1
𝑛 𝑖=1 ℎ(𝑥 𝑖 ) converges to the expecta-
tion E𝐹 (ℎ(𝑥)).

Hence, for large sample size 𝑛 we can approximate cross-entropy and as a result
the KL divergence. The cross-entropy 𝐻(𝐹, 𝐺𝜽 ) is approximated by the empirical
cross-entropy where the expectation is taken with regard to 𝐹ˆ 𝑛 rather than 𝐹:

𝐻(𝐹, 𝐺𝜽 ) ≈ 𝐻(𝐹ˆ 𝑛 , 𝐺𝜽 )
= −E𝐹ˆ𝑛 (log 𝑔(𝑥|𝜽))
𝑛

=− log 𝑔(𝑥 𝑖 |𝜽)
𝑛
𝑖=1
1
= − 𝑙𝑛 (𝜽|𝐷)
𝑛
This turns out to be equal to the negative log-likelihood standardised by the
sample size 𝑛! Or in other words, the log-likelihood is the negative empirical
cross-entropy multiplied by sample size 𝑛.
From the link of the multinomial coefficient with Shannon entropy (Example
2.3) we already know that for large sample size
𝑛
 
ˆ ≈ 1 log
𝐻(𝐹)
𝑛 𝑛1 , . . . , 𝑛 𝐾
2.4. ENTROPY LEARNING AND MAXIMUM LIKELIHOOD 31

The KL divergence 𝐷KL (𝐹, 𝐺𝜽 ) can therefore be approximated by

𝑛
   
1
𝐷KL (𝐹, 𝐺𝜽 ) ≈ − log + 𝑙𝑛 (𝜽|𝐷)
𝑛 𝑛1 , . . . , 𝑛 𝐾

Thus, with the KL divergence we obtain not just the log-likelihood (the cross-
entropy part) but also the multiplicity factor taking account of the possible
orderings of the data (the entropy part).

2.4.2 Minimum KL divergence and maximum likelihood


If we knew 𝐹 we would simply minimise 𝐷KL (𝐹, 𝐺𝜽 ) to find the particular model
𝐺𝜽 that is closest to the true model. Equivalently, we would minimise the
cross-entropy 𝐻(𝐹, 𝐺𝜽 ). However, since we actually don’t know 𝐹 this is not
possible.

However, for large sample size 𝑛 when the empirical distribution 𝐹ˆ 𝑛 is a good ap-
proximation for 𝐹, we can use the results from the previous section. Thus, instead
of minimising the KL divergence 𝐷KL (𝐹, 𝐺𝜽 ) we simply minimise 𝐻(𝐹ˆ 𝑛 , 𝐺𝜽 )
which is the same as maximising the log-likelihood 𝑙𝑛 (𝜽|𝐷). Note that the
entropy of the true distribution 𝐹 (and the corresponding empirical distribution
ˆ that does not depend on the parameters 𝜽 and hence it does not matter when
𝐹)
minimising the divergence.

Conversely, this implies that maximising the likelihood with regard to the 𝜽 is
equivalent ( asymptotically for large 𝑛) to minimising the KL divergence of the
approximating model and the unknown true model!

𝑀𝐿
𝜽ˆ = arg max 𝑙𝑛 (𝜽|𝐷)
𝜽

= arg min 𝐻(𝐹ˆ 𝑛 , 𝐺𝜽 )


𝜽
≈ arg min 𝐷KL (𝐹, 𝐺𝜽 )
𝜽

Therefore, the reasoning behind the method of maximum likelihood is that it


minimises a large sample approximation of the KL divergence of the candidate
model 𝐺𝜽 from the unkown true model 𝐹.

As a consequence of the close link of maximum likelhood and relative entropy


maximum likelihood inherits for large 𝑛 (and only then!) all the optimality
properties from KL divergence. These will be discussed in more detail later in
the course.
32 CHAPTER 2. FROM ENTROPY TO MAXIMUM LIKELIHOOD

2.4.3 Further connections


Since minimising KL divergence contains ML estimation as special case you may
wonder whether there is a broader justification of relative entropy in the context
of statistical data analysis?
Indeed, KL divergence has strong geometrical interpretation that forms the basis
of information geometry. In this field the manifold of distributions is studied
using tools from differential geometry. The expected Fisher information plays
an important role as metric tensor in the space of distributions.
Furthermore, it is also linked to probabilistic forecasting. In the framework
of so-called scoring rules. the only local proper scoring rule is the negative
log-probability (“surprise”). The expected “surprise” is the cross-entropy and
relative entropy is the corresponding natural divergence connected with the log
scoring rule.
Furthermore, another intriguing property of KL divergence is that the relative
entropy 𝐷KL (𝐹, 𝐺) is the only divergence measure that is both a Bregman and
an 𝑓 -divergence. Note that 𝑓 -divergences and Bregman-divergences (in turn
related to proper scoring rules) are two large classes of measures of similarity
and divergence between two probability distributions.
Finally, not only the likelihood estimation but also the Bayesian update rule (as
discussed later in this module) is another special case of entropy learning.
Chapter 3

Maximum likelihood
estimation

3.1 Principle of maximum likelihood estimation


3.1.1 Outline
The starting points in an ML analysis are
• the observed data 𝐷 = {𝑥 1 , . . . , 𝑥 𝑛 } with 𝑛 independent and identically
distributed (iid) samples, with the ordering irrelevant, and a
• model 𝐹𝜽 with corresponding probability density or probability mass
function 𝑓 (𝑥|𝜽) with parameters 𝜽
From this we construct the likelihood function:
Î𝑛
• 𝐿𝑛 (𝜽|𝐷) = 𝑖=1 𝑓 (𝑥 𝑖 |𝜽)
Historically, the likelihood is also often interpreted as the probability of the data
given the model. However, this is not strictly correct. First, this interpretation
only applies to discrete random variables. Second, since the samples are iid even
in this case one would still need to add a factor accounting for the multiplicity
of possible orderings of the samples to obtain the correct probability of the
data. Third, the interpretation of likelihood as probability of the data completely
breaks down for continuous random variables because then 𝑓 (𝑥|𝜽) is a density,
not a probability.
As we have seen in the previous chapter the origin of the likelihood function lies
in its connection to relative entropy. Specifically, the log-likelihood function
Í𝑛
• 𝑙𝑛 (𝜽|𝐷) = 𝑖=1 log 𝑓 (𝑥 𝑖 |𝜽)
divided by sample size 𝑛 is a large sample approximation of the cross-entropy

33
34 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

between the unknown true data generating model and the approximating model
𝐹𝜽 . Note that the log-likelihood is additive over the samples 𝑥 𝑖 .
𝑀𝐿
The maximum likelihood point estimate 𝜽ˆ is then given by maximising the
(log)-likelihood

𝑀𝐿
𝜽ˆ = arg max 𝑙𝑛 (𝜽|𝐷)

Thus, finding the MLE is an optimisation problem that in practise is most often
solved numerically on the computer, using approaches such as gradient ascent (or
for negative log-likelihood gradient descent) and related algorithms. Depending
on the complexity of the likelihood function finding the maximum can be very
difficult.

3.1.2 Obtaining MLEs for a regular model


In regular situations, i.e. when
• the log-likelihood function is twice differentiable with regard to the pa-
rameters,
• the maximum (peak) of the likelihood function lies inside the parameter
space and not at a boundary,
• the parameters of the model are all identifiable (in particular the model is
not overparameterised), and
• the second derivative of the log-likelihood at the maximum is negative
and not zero (for more than one parameter: the Hessian matrix at the
maximum is negative definite and not singular)
then in order to maximise 𝑙𝑛 (𝜽|𝐷) one may use the score function 𝑺(𝜽) which is
the first derivative of the log-likelihood function:

𝑑𝑙𝑛 (𝜃|𝐷)
𝑆𝑛 (𝜃) = 𝑑𝜃
scalar parameter 𝜃: first derivative
of log-likelihood function

𝑺 𝑛 (𝜽) = ∇𝑙𝑛 (𝜽|𝐷) gradient if 𝜽 is a vector


(i.e. if there’s more than one parameter)
3.1. PRINCIPLE OF MAXIMUM LIKELIHOOD ESTIMATION 35

A necessary (but not sufficient) condition for the MLE is that

𝑺 𝑛 (𝜽ˆ 𝑀𝐿 ) = 0

To demonstrate that the log-likelihood function actually achieves a maximum at


𝜽ˆ 𝑀𝐿 the curvature at the MLE must negative, i.e. that the log-likelihood must be
locally concave at the MLE.
In the case of a single parameter (scalar 𝜃) this requires to check that the second
derivative of the log-likelihood function is negative:

𝑑2 𝑙𝑛 (𝜃ˆ 𝑀𝐿 |𝐷)
<0
𝑑𝜃 2
In the case of a parameter vector (multivariate 𝜽) you need to compute the
Hessian matrix (matrix of second order derivatives) at the MLE:

∇∇𝑇 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷)

and then verify that this matrix is negative definite (i.e. all its eigenvalues must
be negative).
As we will see later the second order derivatives of the log-likelihood function
also play an important role for assessing the uncertainty of the MLE.

3.1.3 Invariance property of the maximum likelihood


The invariance principle states that the maximum likelihood is invariant against
reparameterisation.
Assume we transform a parameter 𝜃 into another parameter 𝜆 using some
invertible function 𝑔() so that 𝜆 = 𝑔(𝜃). Then the maximum likelihood estimate
𝜆ˆ 𝑀𝐿 of the new parameter 𝜆 is simply the transformation of the maximum
likelihood estimate 𝜃ˆ 𝑀𝐿 of the original parameter 𝜃 with 𝜆ˆ 𝑀𝐿 = 𝑔(𝜃ˆ 𝑀𝐿 ). The
achieved maximum likelihood is the same in both cases.
The reason why this works is that maximisation is a procedure that is invariant
against transformations of the argument of the function that is maximised.
Consider a function ℎ(𝑥) with a maximum at 𝑥max = arg max ℎ(𝑥). Now we
relabel the argument using 𝑦 = 𝑔(𝑥) where 𝑔 is an invertible function. Then the
function in terms of 𝑦 is ℎ(𝑔 −1 (𝑦)). and clearly this function has a maximum at
𝑦max = 𝑔(𝑥max ) since ℎ(𝑔 −1 (𝑦max )) = ℎ(𝑥max ).
The invariance property can be very useful in practise because it is often easier
(and sometimes numerically more stable) to maximise the likelihood for a
different set of parameters.
See Worksheet L1 for an example application of the invariance principle.
36 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

3.1.4 Consistency of maximum likelihood estimates


One important property of maximum likelihood is that it produces consistent
estimates.
Specifically, if the true underlying model 𝐹true with parameter 𝜽 true is contained
in the set of specified candidates models 𝐹𝜽

𝐹true ⊂ 𝐹𝜽
|{z} |{z}
true model specified models

then
large 𝑛
𝜽ˆ 𝑀𝐿 −→ 𝜽 true

This is a consequence of 𝐷KL (𝐹true , 𝐹𝜽 ) → 0 for 𝐹𝜽 → 𝐹true , and that maximisation


of the likelihood function is for large 𝑛 equivalent to minimising the relative
entropy.
Thus given sufficient data the MLE will converge to the true value. As a
consequence, MLEs are asympotically unbiased. As we will see in the examples
they can still be biased in finite samples.
Note that even if the candidate model 𝐹𝜽 is misspecified (i.e. it does not contain
the actual true model) the MLE is still optimal in the sense in that it will find the
closest possible model.
It is possible to find inconsistent MLEs, but this occurs only in situations where
the dimension of the model / number of parameters increases with sample
size, or when the MLE is at a boundary or when there are singularities in the
likelihood function.

3.2 Maximum likelihood estimation in practise


3.2.1 Estimation of a proportion
Example 3.1. Maximum likelihood estimation for the Bernoulli model:
We aim to estimate the true proportion 𝑝 in a Bernoulli experiment with binary
outcomes, say the proportion of “successes” vs. “failures” or of “heads” vs. “tails”
in a coin tossing experiment.
• Bernoulli model Ber(𝑝): Pr("success") = 𝑝 and Pr("failure") = 1 − 𝑝.
• The “success” is indicated by outcome 𝑥 = 1 and the “failure” by 𝑥 = 0.
• We conduct 𝑛 trials and record 𝑛1 successes and 𝑛 − 𝑛1 failures.
• Parameter: 𝑝: probability of “success”.
What is the MLE of 𝑝?
• the observations 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } take on values 0 or 1.
3.2. MAXIMUM LIKELIHOOD ESTIMATION IN PRACTISE 37

Í𝑛 𝑛1
• the average of the data points is 𝑥¯ = 1
𝑛 𝑖=1 𝑥𝑖 = 𝑛 .

• the probability mass function (PMF) of the Bernoulli distribution Ber(𝑝) is:
(
𝑝 if 𝑥 = 1
𝑓 (𝑥|𝑝) = 𝑝 𝑥 (1 − 𝑝)1−𝑥 =
1−𝑝 if 𝑥 = 0

• log-PMF:
log 𝑓 (𝑥|𝑝) = 𝑥 log(𝑝) + (1 − 𝑥) log(1 − 𝑝)

• log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝑝|𝐷) = log 𝑓 (𝑥 𝑖 )
𝑖=1
= 𝑛1 log 𝑝 + (𝑛 − 𝑛1 ) log(1 − 𝑝)
= 𝑛 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝)


Note how the log-likelihood depends on the data only through 𝑥! ¯ This
is an example of a sufficient statistic for the parameter 𝑝 (in fact it is also a
minimally sufficient statistic). This will be discussed in more detail later.
• Score function:
𝑑𝑙𝑛 (𝑝|𝐷) 𝑥¯ 1 − 𝑥¯
 
𝑆𝑛 (𝑝) = =𝑛 −
𝑑𝑝 𝑝 1−𝑝

• Maximum likelihood estimate: Setting 𝑆𝑛 ( 𝑝ˆ 𝑀𝐿 ) = 0 yields as solution


𝑛1
𝑝ˆ 𝑀𝐿 = 𝑥¯ =
𝑛

𝑑𝑆𝑛 (𝑝)
 
𝑥¯ 1− 𝑥¯
With 𝑑𝑝 = −𝑛 𝑝2
+ (1−𝑝)2
< 0 the optimum corresponds indeed to the
maximum of the (log-)likelihood function as this is negative for 𝑝ˆ 𝑀𝐿 (and
indeed for any 𝑝).
The maximum likelihood estimator of 𝑝 is therefore identical to the fre-
quency of the successes among all observations.
Note that to analyse the coin tossing experiment and to estimate 𝑝 we may
equally well use the binomial distribution Bin(𝑛, 𝑝) as model for the number
of successes. In this case we then have only a single observation, namely the
observed 𝑘 . This results in the same MLE for 𝑝 but the likelihood function
based on the binomial PMF includes the binomial coefficient 𝑛𝑘 . However, as

this factor does not depend on 𝑝 it disappears in the score function and has no
influence in the derivation of the MLE.
38 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

3.2.2 Estimation of the mean and variance of a normal distribu-


tion
Example 3.2. Normal distribution with unknown mean and known variance:
• 𝑥 ∼ 𝑁(𝜇, 𝜎 2 ) with E(𝑥) = 𝜇 and Var(𝑥) = 𝜎2
• the parameter to be estimated is 𝜇 whereas 𝜎 2 is known.
What’s the MLE of parameter 𝜇?
• the data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } are all real in the range 𝑥 𝑖 ∈ [−∞, ∞].
Í𝑛
• the average 𝑥¯ = 1
𝑛 𝑖=1 𝑥 𝑖 is real as well.
• Density:
(𝑥 − 𝜇)2
 
1
𝑓 (𝑥|𝜇) = √ exp −
2𝜋𝜎2 2𝜎2

• Log-Density:
1 (𝑥 − 𝜇)2
log 𝑓 (𝑥|𝜇) = − log(2𝜋𝜎2 ) −
2 2𝜎 2

• Log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝜇|𝐷) = log 𝑓 (𝑥 𝑖 )
𝑖=1
𝑛
1 Õ 𝑛
=− (𝑥 𝑖 − 𝜇)2 − log(2𝜋𝜎 2 )
2𝜎2 𝑖=1 2
| {z }
constant term, does not depend on 𝜇, can be removed
𝑛
1 Õ 2
=− (𝑥 − 2𝑥 𝑖 𝜇 + 𝜇2 ) + 𝐶
2𝜎2 𝑖=1 𝑖
𝑛
𝑛 1 1 Õ 2
= ( 𝑥𝜇
¯ − 𝜇2 ) − 𝑥 +𝐶
𝜎2 2 2𝜎2 𝑖=1 𝑖
| {z }
another constant term

Note how the non-constant terms of the log-likelihood depend on the data
only through 𝑥!
¯
• Score function:
𝑛
𝑆𝑛 (𝜇) = ( 𝑥¯ − 𝜇)
𝜎2

• Maximum likelihood estimate:

𝑆𝑛 (𝜇ˆ 𝑀𝐿 ) = 0 ⇒ 𝜇ˆ 𝑀𝐿 = 𝑥¯
3.2. MAXIMUM LIKELIHOOD ESTIMATION IN PRACTISE 39

𝑑𝑆𝑛 (𝜇)
• With 𝑑𝜇 = − 𝜎𝑛2 < 0 the optimum is indeed the maximum
The constant term 𝐶 in the log-likelihood function collects all terms that do not
depend on the parameter. After taking the first derivative with regard to the
parameter this term disappears thus 𝐶 is not relevant for finding the MLE of
the parameter. In the future we will often omit such constant terms from the
log-likelihood function without further mention.
Example 3.3. Normal distribution with mean and variance both unknown:
• 𝑥 ∼ 𝑁(𝜇, 𝜎2 ) with E(𝑥) = 𝜇 and Var(𝑥) = 𝜎 2
• both 𝜇 and 𝜎2 need to be estimated.
What’s the MLE of the parameter vector 𝜽 = (𝜇, 𝜎2 )𝑇 ?
• the data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } are all real in the range 𝑥 𝑖 ∈ [−∞, ∞].
Í𝑛
• the average 𝑥¯ = 1
𝑛 𝑖=1 𝑥 𝑖 is real as well.
Í𝑛
• the average of the squared data 𝑥 2 = 1
𝑛 𝑖=1 𝑥 2𝑖 ≥ 0 is non-negative.
• Density:
(𝑥 − 𝜇)2
 
2 − 12
𝑓 (𝑥|𝜇, 𝜎 ) = (2𝜋𝜎 )
2
exp −
2𝜎 2

• Log-Density:

1 (𝑥 − 𝜇)2
log 𝑓 (𝑥|𝜇, 𝜎2 ) = − log(2𝜋𝜎 2 ) −
2 2𝜎2

• Log-likelihood function:
𝑛
Õ
𝑙𝑛 (𝜽|𝐷) = 𝑙𝑛 (𝜇, 𝜎 |𝐷) =
2
log 𝑓 (𝑥 𝑖 )
𝑖=1
𝑛
𝑛 1 Õ 𝑛
=− log(𝜎 2 ) − 2 (𝑥 𝑖 − 𝜇)2 − log(2𝜋)
2 2𝜎 𝑖=1 2
| {z }
constant not depending on 𝜇 or 𝜎 2
𝑛 𝑛
= − log(𝜎 2 ) − 2 (𝑥 2 − 2𝑥𝜇
¯ + 𝜇2 ) + 𝐶
2 2𝜎
Note how the log-likelihood function depends on the data only through 𝑥¯
and 𝑥 2 !
• Score function 𝑺, gradient of 𝑙𝑛 (𝜽|𝐷):

𝑺(𝜽) = ∇𝑙𝑛 (𝜽|𝐷)


𝑛
( 𝑥¯ − 𝜇)
!
𝜎2 
=

− 2𝜎𝑛 2 + 𝑛
2𝜎 4
𝑥 2 − 2𝑥𝜇
¯ + 𝜇2
40 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

Note that to obtain the second component of the score function the partial
derivative needs to be taken with regard to the variance parameter 𝜎2 —
not with regard to 𝜎! Hint: replace 𝜎2 = 𝑣 in the log-likelihood function,
then take the partial derivative with regard to 𝑣, then backsubstitute 𝑣 = 𝜎2
in the result.
• Maximum likelihood estimate:

𝑺(𝜽ˆ 𝑀𝐿 ) = 0 ⇒

𝜇ˆ 𝑀𝐿 𝑥¯
   
𝜽ˆ 𝑀𝐿 = b2 = 2
𝜎 𝑀𝐿 𝑥 − 𝑥¯ 2

The ML estimate of the variance we can also write 𝜎b2 𝑀𝐿 = 𝑥 2 − 𝑥¯ 2 =


1 Í𝑛
𝑛 𝑖=1 (𝑥 𝑖 − 𝑥)
¯ 2.
• To confirm that we actually have maximum we need to verify that the
eigenvalues of the Hessian matrix are all negative. This is indeed the case,
for details see Example 3.6.

3.2.3 Relationship of maximum likelihood with least squares


estimation
In Example 3.2 the form of the log-likelihood functionÍ is a function of the sum
of squared differences. Maximising 𝑙𝑛 (𝜇|𝐷) = − 2𝜎1 2 𝑛𝑖=1 (𝑥 𝑖 − 𝜇)2 is equivalent
to minimising 𝑛𝑖=1 (𝑥 𝑖 − 𝜇)2 . Hence, finding the mean by maximum likelihood
Í
assuming a normal model is equivalent to least-squares estimation!
Note that least-squares estimation has been in use at least since the early 1800s1
and thus predates maximum likelihood (1922). Due to its simplicity it is still
very popular in particular in regression and the link with maximum likelihood
and normality allows to understand why it usually works well!

3.2.4 Bias and maximum likelihood estimates


Example 3.3 is interesting because it shows that maximum likelihood can result
in both biased and as well as unbiased estimators.
Recall that 𝑥 ∼ 𝑁(𝜇, 𝜎2 ). As a result

𝜎2
 
𝜇ˆ 𝑀𝐿 = 𝑥¯ ∼ 𝑁 𝜇,
𝑛

with E(𝜇ˆ 𝑀𝐿 ) = 𝜇 and


𝜎2 2
𝜎b2 𝑀𝐿 ∼ 𝜒
𝑛 𝑛−1
1Stigler, S. M. 1981. Gauss and the invention of least squares. Ann. Statist. 9:465–474. https:
//doi.org/10.1214/aos/1176345451
3.3. OBSERVED FISHER INFORMATION 41

𝑛−1
with E( 𝜎b2 𝑀𝐿 ) = 𝑛 𝜎2 .
Therefore, the MLE of 𝜇 is unbiased as

Bias(𝜇ˆ 𝑀𝐿 ) = E(𝜇ˆ 𝑀𝐿 ) − 𝜇 = 0
In contrast, however, the MLE of 𝜎2 is negatively biased because
1 2
Bias( 𝜎b2 𝑀𝐿 ) = E( 𝜎b2 𝑀𝐿 ) − 𝜎2 = − 𝜎
𝑛

Thus, in the case of the variance parameter of the normal distribution the MLE
is not recovering the well-known unbiased estimator of the variance
𝑛
1 Õ 𝑛 b2
𝜎b2 𝑈 𝐵 = (𝑥 𝑖 − 𝑥)
¯2= 𝜎 𝑀𝐿
𝑛−1 𝑛−1
𝑖=1
Conversely, the unbiased estimator is not a maximum likelihood estimate!
Therefore it is worth keeping in mind that maximum likelihood can result in
biased estimates for finite 𝑛. For large 𝑛, however, the bias disappears as MLEs
are consistent.

3.3 Observed Fisher information


3.3.1 Motivation and definition

flat likelihood concentrated likelihood


0.8

0.8
likelihood

likelihood
0.4

0.4
0.0

0.0

0 2 4 6 8 10 0 2 4 6 8 10

θ θ

By inspection of some log-likelihood curves it is apparent that the log-likelihood


function contains more information about the parameter 𝜽 than just the maximum
point 𝜽ˆ 𝑀𝐿 .
In particular the curvature of the log-likelihood function at the MLE must be
somehow related the accuracy of 𝜽ˆ 𝑀𝐿 : if the likelihood surface is flat near the
42 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

maximum (low curvature) then if is more difficult to find the optimal parameter
(also numerically!). Conversely, if the likelihood surface is peaked (strong
curvature) then the maximum point is clearly defined.
The curvature is described by the second-order derivatives (Hessian matrix) of
the log-likelihood function.
For univariate 𝜃 the Hessian is a scalar:
𝑑2 𝑙𝑛 (𝜃|𝐷)
𝑑𝜃 2

For multivariate parameter vector 𝜽 of dimension 𝑑 the Hessian is a matrix of


size 𝑑 × 𝑑:
∇∇𝑇 𝑙𝑛 (𝜽|𝐷)

By construction the Hessian is negative definite at the MLE (i.e. its eigenvalues
are all negative) to ensure the the function is concave at the MLE (i.e. peak
shaped).
The observed Fisher information (matrix) is defined as the negative curvature
at the MLE 𝜽ˆ 𝑀𝐿 :
𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) = −∇∇𝑇 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷)

Sometimes this is simply called the “observed information”. To avoid confusion


with the expected Fisher information introduced earlier
 
𝑰 Fisher (𝜽) = −E𝐹𝜽 ∇∇𝑇 log 𝑓 (𝑥|𝜽)

it is necessary to always use the qualifier “observed” when referring to 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ).

3.3.2 Examples of observed Fisher information


Example 3.4. Bernoulli model Ber(𝑝):
𝑛1
We continue Example 3.1. Recall that 𝑝ˆ 𝑀𝐿 = 𝑥¯ = 𝑛 and the score function
 
𝑥¯ 1− 𝑥¯
𝑆𝑛 (𝑝) = 𝑛 𝑝 − 1−𝑝 . The negative second derivative of the log-likelihood
function is
𝑑𝑆𝑛 (𝑝) 𝑥¯ 1 − 𝑥¯
 
=𝑛 2 + −
𝑑𝑝 𝑝 (1 − 𝑝)2
The observed Fisher information is therefore
!
𝑥¯ 1 − 𝑥¯
𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 ) = 𝑛 +
𝑝ˆ 2𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )2
 
1 1
=𝑛 +
𝑝ˆ 𝑀𝐿 1 − 𝑝ˆ 𝑀𝐿
𝑛
=
𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
3.3. OBSERVED FISHER INFORMATION 43

The inverse of the observed Fisher information is:


𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 )−1 =
𝑛

𝑥 𝑝(1−𝑝)
for 𝑥 ∼ Bin(𝑛, 𝑝).

Compare this with Var 𝑛 = 𝑛

Example 3.5. Normal distribution with unknown mean and known variance:
This is the continuation of Example 3.2. Recall the MLE for the mean 𝜇ˆ 𝑀𝐿 =
1 Í𝑛 𝑛
𝑛 𝑖=1 𝑥 𝑖 = 𝑥¯ and the score function 𝑺 𝑛 (𝜇) = 𝜎 2 ( 𝑥¯ − 𝜇). The negative second
derivative of the score function is
𝑑𝑆𝑛 (𝜇) 𝑛
− = 2
𝑑𝜇 𝜎

The observed Fisher information at the MLE is therefore


𝑛
𝐽𝑛 (𝜇ˆ 𝑀𝐿 ) =
𝜎2
and the inverse of the observed Fisher information is

𝜎2
𝐽𝑛 (𝜇ˆ 𝑀𝐿 )−1 =
𝑛

𝜎2
For 𝑥 𝑖 ∼ 𝑁(𝜇, 𝜎2 ) we have Var(𝑥 𝑖 ) = 𝜎 2 and hence Var( 𝑥)
¯ = 𝑛 , which is equal to
the inverse observed Fisher information.
Example 3.6. Normal distribution with mean and variance parameter:
This is the continuation of Example 3.3. Recall the MLE for the mean and
variance:
𝑛

𝜇ˆ 𝑀𝐿 = 𝑥 𝑖 = 𝑥¯
𝑛
𝑖=1
𝑛
1 Õ
𝜎b2 𝑀𝐿 = (𝑥 𝑖 − 𝑥)
¯ 2 = 𝑥 2 − 𝑥¯ 2
𝑛
𝑖=1

with score function


𝑛
( 𝑥¯ − 𝜇)
!
𝜎2 
𝑺 𝑛 (𝜇, 𝜎 ) = ∇𝑙𝑛 (𝜇, 𝜎 |𝐷) =
2 2 
− 2𝜎𝑛 2 + 𝑛
2𝜎 4
𝑥 2 − 2𝜇𝑥¯ + 𝜇2

The Hessian matrix of the log-likelihood function is

− 𝜎𝑛2 − 𝜎𝑛4 ( 𝑥¯ − 𝜇)
!
𝑇
∇∇ 𝑙𝑛 (𝜇, 𝜎 |𝐷) =
2 
− 𝜎𝑛4 ( 𝑥¯ − 𝜇) 𝑛
2𝜎 4
− 𝑛
𝜎6
𝑥 2 − 2𝜇𝑥¯ + 𝜇2
44 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION

The negative Hessian at the MLE, i.e. at 𝜇ˆ 𝑀𝐿 = 𝑥¯ and 𝜎b2 𝑀𝐿 = 𝑥 2 − 𝑥¯ 2 yields the
observed Fisher information matrix:
𝑛 !
0
𝜎 2 𝑀𝐿
𝑱 𝑛 (𝜇ˆ 𝑀𝐿 , 𝜎b2 𝑀𝐿 ) =
c
𝑛
0
𝜎 2 𝑀𝐿 )2
2(c

Note that the observed Fisher information matrix is diagonal with positive
entries. Therefore its eigenvalues are all positive as required for a maximum,
because for a diagonal matrix the eigenvalues are simply the the entries on the
diagonal.
The inverse of the observed Fisher information matrix is
!
𝜎 2 𝑀𝐿
c
𝑛 0
𝑱 𝑛 (𝜇ˆ 𝑀𝐿 , 𝜎b2 𝑀𝐿 )−1 =
𝜎 2 𝑀𝐿 )2
2(c
0 𝑛

Recall that 𝑥 ∼ 𝑁(𝜇, 𝜎2 ) and therefore

𝜎2
 
𝜇ˆ 𝑀𝐿 = 𝑥¯ ∼ 𝑁 𝜇,
𝑛

Hence Var(𝜇ˆ 𝑀𝐿 ) = 𝜎𝑛 . If you compare this with the first diagonal entry of the
2

inverse observed Fisher information matrix you see that this is essentially the
same expression (apart from the “hat”).

The empirical variance 𝜎b2 𝑀𝐿 follows a scaled chi-squared distribution

𝜎2 2
𝜎b2 𝑀𝐿 ∼ 𝜒
𝑛 𝑛−1
𝑎
with variance Var( 𝜎b2 𝑀𝐿 ) = 𝑛−1
4 4
𝑛 . For large 𝑛 this becomes Var( 𝜎 𝑀𝐿 ) = 𝑛
2𝜎 b2 2𝜎
𝑛
which is essentially (apart from the “hat”) the second diagonal entry of the
inverse observed Fisher information matrix.

3.3.3 Relationship between observed and expected Fisher in-


formation
The observed Fisher information 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) and the expected Fisher information
𝑰 Fisher (𝜽) are related but also two clearly different entities:
• Both types of Fisher information are based on computing the second order
derivative (Hessian matrix), thus are based on the curvature of a function.
• The observed Fisher information is computed from the log-likelihood
function. Therefore it takes the observed data 𝐷 into account and explicitly
depends on the sample size 𝑛. It contains estimates of the parameters but
3.3. OBSERVED FISHER INFORMATION 45

not the parameters themselves. While the curvature of the log-likelihood


function may be computed for any point the the observed Fisher informa-
tion specifically refers to curvature at the MLE 𝜽ˆ 𝑀𝐿 . It is linked to the
(asymptotic) variance of the MLE as we have seen in the examples and will
discuss in more detail later.
• In contrast, the expected Fisher information is derived directly from the
log-density. It does not depend on the observed data, and thus does not
have dependency on sample size. It can be computed for any value of the
parameters. It describes the geometry of the space of the models, and is
the local approximation of relative entropy.
• Asympotically, for large sample size 𝑛 the MLE converges to 𝜽ˆ 𝑀𝐿 → 𝜽 0 . It
follows from the construction of the observed Fisher information and the
law of large numbers that correspondingly 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) → 𝑛𝑰 Fisher (𝜽 0 ).
• In a very important class of models, namely in an exponential family
model, we find that 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) = 𝑛𝑰 Fisher (𝜽ˆ 𝑀𝐿 ) also for finite sample size
𝑛. This is in fact the case in all the examples discussed above (e.g. see
Examples 2.11 and 3.4 for the Bernoulli and Examples 2.13 and 3.6 for the
normal distribution).
• However, this is an exception. In a general model 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 ) ≠ 𝑛𝑰 Fisher (𝜽ˆ 𝑀𝐿 )
for finite sample size 𝑛. An example is provided by the Cauchy distribution
with median parameter 𝜃. It is not an exponential family model and
has expected Fisher information 𝐼 Fisher (𝜃) = 12 regardless of the choice
the median parameter whereas the observed Fisher information 𝐽𝑛 (𝜃ˆ 𝑀𝐿 )
depends on the MLE 𝜃ˆ 𝑀𝐿 of the median parameter and is not simply 𝑛2 .
46 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
Chapter 4

Quadratic approximation and


normal asymptotics

4.1 Multivariate statistics for random vectors


4.1.1 Covariance and correlation
Assume a scalar random variable 𝑥 with mean E(𝑥) = 𝜇. The corresponding
variance is given by
 
Var(𝑥) = E (𝑥 − 𝜇)2
= E ((𝑥 − 𝜇)(𝑥 − 𝜇))
= E(𝑥 2 ) − 𝜇2

For a random vector 𝒙 = (𝑥 1 , 𝑥2 , ..., 𝑥 𝑑 )𝑇 the mean E(𝒙) = 𝝁 is simply comprised


of the means of its components, i.e. 𝝁 = (𝜇1 , . . . , 𝜇𝑑 )𝑇 . Thus, the mean of a
random vector of dimension is a vector of of the same length.
The variance of a random vector of length 𝑑, however, is not a vector but a matrix
of size 𝑑 × 𝑑. This matrix is called the covariance matrix:
𝜎 ... 𝜎1𝑑
© 11
Var(𝒙) = 𝚺 = (𝜎𝑖𝑗 ) = ­­ ... ..
.
.. ª®
. ®
|{z}
𝑑×𝑑 «𝜎𝑑1 ... 𝜎𝑑𝑑 ¬

© ª
= E ­(𝒙 − 𝝁) (𝒙 − 𝝁)𝑇 ®
­ ®
­| {z } | {z }®
« 𝑑×1 1×𝑑 ¬
= E(𝒙𝒙 𝑇 ) − 𝝁𝝁𝑇

47
48CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

The entries of the covariance matrix 𝜎𝑖𝑗 = Cov(𝑥 𝑖 , 𝑥 𝑗 ) describe the covariance
between the random variables 𝑥 𝑖 and 𝑥 𝑗 . The covariance matrix is symmetric,
hence 𝜎𝑖𝑗 = 𝜎 𝑗𝑖 . The diagonal entries 𝜎𝑖𝑖 = Cov(𝑥 𝑖 , 𝑥 𝑖 ) = Var(𝑥 𝑖 ) = 𝜎𝑖2 correspond
to the variances of the components of 𝒙. The covariance matrix is positive
semi-definite, i.e. the eigenvalues of 𝚺 are all positive or equal to zero. However,
in practise one aims to use non-singular covariance matrices, with all eigenvalues
positive, so that they are invertible.
A covariance matrix can be factorised into the product
1 1
𝚺 = 𝑽 2 𝑷𝑽 2

where 𝑽 is a diagonal matrix containing the variances

𝜎11 ... 0
©
𝑽 = ­­ .. .. .. ª®
. . . ®
« 0 ... 𝜎𝑑𝑑 ¬

and the matrix 𝑷 (“upper case rho”) is the symmetric correlation matrix

1 ... 𝜌1𝑑
𝑷 = (𝜌 𝑖𝑗 ) = ­­ ... .. .. ª® = 𝑽 − 12 𝚺𝑽 − 12
©
. . ®
«𝜌 𝑑1 ... 1 ¬

Thus, the correlation between 𝑥 𝑖 and 𝑥 𝑗 is defined as

𝜎𝑖𝑗
𝜌 𝑖𝑗 = Cor(𝑥 𝑖 , 𝑥 𝑗 ) = √
𝜎𝑖𝑖 𝜎 𝑗 𝑗

For univariate 𝑥 and scalar constant 𝑎 the variance of 𝑎𝑥 equals Var(𝑎𝑥) =


𝑎 2 Var(𝑥). For a random vector 𝒙 of dimension 𝑑 and matrix 𝑨 of dimension
𝑚 × 𝑑 this generalises to Var(𝑨𝑥) = 𝑨Var(𝒙)𝑨𝑇 .

4.1.2 Multivariate normal distribution


The density of a normally distributed scalar variable 𝑥 ∼ 𝑁(𝜇, 𝜎2 ) with mean
E(𝑥) = 𝜇 and variance Var(𝑥) = 𝜎2 is

(𝑥 − 𝜇)2
 
1
𝑓 (𝑥|𝜇, 𝜎 ) = √
2
exp −
2𝜋𝜎2 2𝜎2

The univariate normal distribution for a scalar 𝑥 generalises to the multivariate


normal distribution for a vector 𝒙 = (𝑥1 , 𝑥2 , ..., 𝑥 𝑑 )𝑇 ∼ 𝑁𝑑 (𝝁, 𝚺) with with mean
4.1. MULTIVARIATE STATISTICS FOR RANDOM VECTORS 49

E(𝒙) = 𝝁 and covariance matrix Var(𝒙) = 𝚺. The corresponding density is

© ª
­ ®
­ ®
­ 1
𝑑
­ ®
1
𝑇
𝑓 (𝒙|𝝁, 𝚺) = (2𝜋)− 2 det(𝚺)− 2 −1
exp ­− (𝒙 − 𝝁) 𝚺 (𝒙 − 𝝁)®
®
­ 2 | {z } |{z} | {z }®
𝑑×𝑑
­ ®
­ 1×𝑑 𝑑×1 ®
­ | {z }®
« 1×1=scalar! ¬

For 𝑑 = 1 we have 𝒙 = 𝑥, 𝝁 = 𝜇 and 𝚺 = 𝜎2 so that the multivariate normal


density reduces to the univariate normal density.
Example 4.1. Maximum likelihood estimates of the parameters of the multivariate
normal distribution:
Maximising the log-likelihood based on the multivariate normal density yields
the MLEs for 𝝁 and 𝚺. These are generalisations of the MLEs for the mean 𝜇 and
variance 𝜎2 of the univariate normal as encountered in Example 3.3.
The estimates can be written in three different ways:
a) data vector notation
with 𝒙 1 , . . . , 𝒙 𝑛 the 𝑛 vector-valued observations from the multivariate normal:
MLE for the mean:
𝑛

𝝁ˆ 𝑀𝐿 = 𝒙 𝑘 = 𝒙¯
𝑛
𝑘=1

MLE for the covariance:


𝑛

𝚺𝑀𝐿 =
b (𝒙 𝑘 − 𝒙¯ ) (𝒙 𝑘 − 𝒙¯ )𝑇
|{z} 𝑛 𝑘=1 | {z } | {z }
𝑑×𝑑 𝑑×1 1×𝑑
1
Note the factor 𝑛 in the estimator of the covariance matrix.
Í𝑛
With 𝒙𝒙 𝑇 = 1
𝑛 𝑘=1 𝒙 𝑘 𝒙 𝑇𝑘 we can also write

𝚺𝑀𝐿 = 𝒙𝒙 𝑇 − 𝒙¯ 𝒙¯ 𝑇
b

b) data component notation


with 𝑥 𝑘𝑖 the 𝑖-th component of the 𝑘-th sample 𝒙 𝑘 :

𝑛 𝜇ˆ
1Õ © .1 ª
𝜇ˆ 𝑖 = 𝑥 𝑘𝑖 with 𝝁ˆ = ­ .. ®®
­
𝑛
𝑘=1
«𝜇ˆ 𝑑 ¬
50CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

𝑛

𝜎ˆ 𝑖𝑗 = (𝑥 𝑘𝑖 − 𝜇ˆ 𝑖 ) ( 𝑥 𝑘 𝑗 − 𝜇ˆ 𝑗 ) with b
𝚺 = ( 𝜎ˆ 𝑖𝑗 )
𝑛
𝑘=1

c) data matrix notation


𝒙𝑇
© 1ª
with 𝑿 = ­ ... ® as a data matrix containing the samples in its rows. Note that this
𝑇
«𝒙 𝑛 ¬
is the statistics convention — in much of the engineering and computer science
literature the data matrix is often transposed and samples are stored in the
columns. Thus, the formulas below are only correct assuming the statistics
convention.

1 𝑇
𝝁ˆ = 𝑿 1𝑛
𝑛
Here 1𝑛 is a vector of length 𝑛 containing 1 at each component.

ˆ = 1 𝑿 𝑇 𝑿 − 𝝁ˆ 𝝁ˆ 𝑇
𝚺
𝑛
To simplify the expression for the estimate of the covariance matrix one often
assumes that the data matrix is centered, i.e. that 𝝁ˆ = 0.

Because of the ambiguity in convention (machine learning versus statistics


convention) and the often implicit use of centered data matrices the matrix
notation is often a source of confusion. Hence, using the other two notations is
generally preferable.

4.2 Approximate distribution of maximum likeli-


hood estimates
4.2.1 Quadratic log-likelihood resulting from normal model
Assume we observe a single sample 𝒙 ∼ 𝑁(𝝁, 𝚺2 ) with known covariance. The
corresponding log-likelihood for 𝝁 is

1
𝑙1 (𝝁|𝒙) = 𝐶 − (𝒙 − 𝝁)𝑇 𝚺−1 (𝒙 − 𝝁)
2

where 𝐶 is a constant that does not depend on 𝝁. Note that the log-likelihood is
exactly quadratic and the maximum lies at (𝒙, 𝐶).
4.2. APPROXIMATE DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTIMATES51

4.2.2 Quadratic approximation of a log-likelihood function


Now consider the quadratic approximation of the log-likelihood function 𝑙𝑛 (𝜽|𝐷)
for around the MLE 𝜽ˆ 𝑀𝐿 .

We assume the underlying model is regular and that ∇𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) = 0.
The Taylor series approximation of scalar-valued function 𝑓 (𝒙) around 𝒙 0 is
1
𝑓 (𝒙) = 𝑓 (𝒙 0 ) + ∇ 𝑓 (𝒙 0 )𝑇 (𝒙 − 𝒙 0 ) + (𝒙 − 𝒙 0 )𝑇 ∇∇𝑇 𝑓 (𝒙 0 )(𝒙 − 𝒙 0 ) + . . .
2
Applied to the log-likelihood function this yields

1
𝑙𝑛 (𝜽|𝐷) ≈ 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − (𝜽ˆ 𝑀𝐿 − 𝜽)𝑇 𝐽𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽ˆ 𝑀𝐿 − 𝜽)
2

This is a quadratic function with maximum at (𝜽ˆ 𝑀𝐿 , 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷)). Note the natural
appearance of the observed Fisher information 𝐽𝑛 (𝜽ˆ 𝑀𝐿 ) in the quadratic term.
There is no linear term because of the vanishing gradient at the MLE.
Crucially, we realise that the approximation has the same form as if 𝜽ˆ 𝑀𝐿 was a
sample from a multivariate normal distribution with mean 𝜽 and with covariance
given by the inverse observed Fisher information! Note that this requires a positive
definite observed Fisher information matrix so that 𝐽𝑛 (𝜽ˆ 𝑀𝐿 ) is actually invertible!
Example 4.2. Quadratic approximation of the log-likelihood for a proportion:
From Example 3.1 we have the log-likelihood

𝑙𝑛 (𝑝|𝐷) = 𝑛 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝)


and the MLE


𝑝ˆ 𝑀𝐿 = 𝑥¯
and from Example 3.4 the observed Fisher information
𝑛
𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 ) =
𝑥(1
¯ − 𝑥)¯
The log-likelihood at the MLE is

𝑙𝑛 ( 𝑝ˆ 𝑀𝐿 |𝐷) = 𝑛 𝑥¯ log 𝑥¯ + (1 − 𝑥)
¯ log(1 − 𝑥)

¯
52CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

This allows us to construct the quadratic approximation of the log-likelihood


around the MLE as
1
𝑙𝑛 (𝑝|𝐷) ≈ 𝑙𝑛 ( 𝑝ˆ 𝑀𝐿 |𝐷) − 𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 )(𝑝 − 𝑝ˆ 𝑀𝐿 )2
2
(𝑝 − 𝑥)
¯2
 
= 𝑛 𝑥¯ log 𝑥¯ + (1 − 𝑥)¯ log(1 − 𝑥) ¯ −
2𝑥(1
¯ − 𝑥)¯
𝑥𝑝
¯ − 12 𝑝 2
=𝐶+
𝑥(1
¯ − 𝑥)/𝑛
¯

The constant 𝐶 does not depend on 𝑝, its function is to match the approximate
log-likelihood at the MLE with that of the corresponding original log-likelihood.
The approximate log-likelihood takes on the form of a normal  log-likelihood
𝑥(1−
¯ 𝑥)
¯
(Example 3.2) for one observation of 𝑝ˆ 𝑀𝐿 = 𝑥¯ from 𝑁 𝑝, 𝑛 .

The following figure shows the above log-likelihood function and its quadratic
approximation for example data with 𝑛 = 30 and 𝑥¯ = 0.7:

maximum log−likelihood
−20
−24
ln(p)

−28

log−likelihood
quadratic approx. MLE
−32

0.3 0.4 0.5 0.6 0.7 0.8 0.9

4.2.3 Asymptotic normality of maximum likelihood estimates


Intuitively, it makes sense to associate large amount of curvature of the log-
likelihood at the MLE with low variance of the MLE (and conversely, low amount
of curvature with high variance).
From the above we see that
4.2. APPROXIMATE DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTIMATES53

• normality implies a quadratic log-likelihood,


• conversely, taking an quadratic approximation of the log-likelihood implies
approximate normality, and
• in the quadratic approximation the inverse observed Fisher information
plays the role of the covariance of the MLE.

This suggests the following theorem: Asymptotically, the MLE is normally


distributed around the true parameter and with covariance equal to the inverse
of the observed Fisher information:

ª
𝑎
𝜽ˆ 𝑀𝐿 ∼ ˆ 𝑀𝐿 )−1 ®®
©
𝑁𝑑 ­
­ |{z}𝜽 , 𝑱 𝑛 ( 𝜽
|{z} | {z } ®
multivariate normal « mean vector covariance matrix¬

This theorem about the distributional properties of MLEs greatly enhances the
usefulness of the method of maximum likelihood. It implies that in regular
settings maximum likelihood is not just a method for obtaining point estimates
but also also provides estimates of their uncertainty.

However, we need to clarify what “asymptotic” actually means in the context of


the above theorem:

1) Primarily, it means to have sufficient sample size so that the log-likelihood


𝑙𝑛 (𝜽) is sufficiently well approximated by a quadratic function around
𝜽ˆ 𝑀𝐿 . The better the local quadratic approximation the better the normal
approximation!

2) In a regular model with positive definite observed Fisher information


matrix this is guaranteed for large sample size 𝑛 → ∞ thanks to the central
limit theorem).

3) However, 𝑛 going to infinity is in fact not always required for the normal
approximation to hold! Depending on the particular model a good local fit
to a quadratic log-likelihood may be available also for finite 𝑛. As a trivial
example, for the normal log-likelihood it is valid for any 𝑛.

4) In the other hand, in non-regular models (with nondifferentiable log-


likelihood at the MLE and/or a singular Fisher information matrix) no
amount of data, not even 𝑛 → ∞, will make the quadratic approximation
work.

Remarks:

• The technical details of the above considerations are worked out in the
theory of locally asymptotically normal (LAN) models pioneered in 1960
by Lucien LeCam (1924–2000).
54CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

• There are also methods to obtain higher-order (higher than quadratic and
thus non-normal) asymptotic approximations. These relate to so-called
saddle point approximations.

4.2.4 Asymptotic optimal efficiency


Assume now that 𝜽ˆ is an arbitrary and unbiased estimator for 𝜽 and the
underlying data generating model is regular with density 𝑓 (𝒙|𝜽).
H. Cramér (1893–1985), C. R. Rao (1920–) and others demonstrated in 1945 the
so-called information inequality,

ˆ ≥ 1 Fisher −1
Var(𝜽) 𝑰 (𝜽)
𝑛
which puts a lower bound on the variance of an estimator for 𝜽. (Note for
𝑑 > 1 this is a matrix inequality, meaning that the difference matrix is positive
semidefinite).
For large sample size with 𝑛 → ∞ and 𝜽ˆ 𝑀𝐿 → 𝜽 the observed Fisher informa-
tion becomes 𝐽𝑛 (𝜽) ˆ → 𝑛𝑰 Fisher (𝜽) and therefore we can write the asymptotic
distribution of 𝜽ˆ 𝑀𝐿 as
 
𝑎 1
𝜽ˆ 𝑀𝐿 ∼ 𝑁𝑑 𝜽, 𝑰 Fisher (𝜽)−1
𝑛

This means that for large 𝑛 in regular models 𝜽ˆ 𝑀𝐿 achieves the lowest variance
possible according to the Cramér-Rao information inequality. In other words, for
large sample size maximum likelihood is optimally efficient and thus the best
available estimator will in fact be the MLE!
However, as we will see later this does not hold for small sample size where it
is indeed possible (and necessary) to improve over the MLE (e.g. via Bayesian
estimation or regularisation).

4.3 Quantifying the uncertainty of maximum likeli-


hood estimates
4.3.1 Estimating the variance of MLEs
In the previous section we saw that MLEs are asymptotically normally distributed,
with the inverse Fisher information (both expected and observed) linked to the
asymptotic variance.
This leads to the question whether to use the observed Fisher information
𝐽𝑛 (𝜽ˆ 𝑀𝐿 ) or the expected Fisher information at the MLE 𝑛𝑰 Fisher (𝜽ˆ 𝑀𝐿 ) to estimate
the variance of the MLE?
4.3. QUANTIFYING THE UNCERTAINTY OF MAXIMUM LIKELIHOOD ESTIMATES55

• Clearly, for 𝑛 → ∞ both can be used interchangeably.


• However, they can be very different for finite 𝑛 in particular for models
that are not exponential families.
• Also normality may occur well before 𝑛 goes to ∞.
Therefore one needs to choose between the two, considering also that
• the expected Fisher information at the MLE is the average curvature at
the MLE, whereas the observed Fisher information is the actual observed
curvature, and
• the observed Fisher information naturally occurs in the quadratic approxi-
mation of the log-likelihood.
All in all, the observed Fisher information as estimator of the variance is more
appropriate as it is based on the actual observed data and also works for large 𝑛
(in which case it yields the same result as using expected Fisher information):

c 𝜽ˆ 𝑀𝐿 ) = 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )−1
Var(

and its square-root as the estimate of the standard deviation

c 𝜽ˆ 𝑀𝐿 ) = 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )−1/2
SD(

Note that in the above we use matrix inversion and the (inverse) matrix square root.
The reasons for preferring observed Fisher information are made mathematically
precise in a classic paper by Efron and Hinkley (1978)1 .
Example 4.3. Estimated variance and distribution of the MLE of a proportion:
From Examples 3.1 and 3.4 we know the MLE

𝑘
𝑝ˆ 𝑀𝐿 = 𝑥¯ =
𝑛
and the corresponding observed Fisher information
𝑛
𝐽𝑛 ( 𝑝ˆ 𝑀𝐿 ) =
𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )

The estimated variance of the MLE is therefore


𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
c 𝑝ˆ 𝑀𝐿 ) =
Var(
𝑛
and the corresponding asymptotic normal distribution is

𝑝ˆ 𝑀𝐿 (1 − 𝑝ˆ 𝑀𝐿 )
 
𝑎
𝑝ˆ 𝑀𝐿 ∼ 𝑁 𝑝,
𝑛
1Efron, B., and D. V. Hinkley. 1978. Assessing the accuracy of the maximum likelihood estimator: observed
versus expected {Fisher} information. Biometrika 65:457–482. https://doi.org/10.1093/biomet/65.3.457
56CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

Example 4.4. Estimated variance and distribution of the MLE of the mean
parameter for the normal distribution with known variance:
From Examples 3.2 and 3.5 we know that

𝜇ˆ 𝑀𝐿 = 𝑥¯

and that the corresponding observed Fisher information at 𝜇ˆ 𝑀𝐿 is


𝑛
𝐽𝑛 (𝜇ˆ 𝑀𝐿 ) =
𝜎2

The estimated variance of the MLE is therefore

c 𝜇ˆ 𝑀𝐿 ) = 𝜎2
Var(
𝑛
and the corresponding asymptotic normal distribution is

𝜎2
 
𝜇ˆ 𝑀𝐿 ∼ 𝑁 𝜇,
𝑛

Note that in this case the distribution is not asymptotic but is exact, i.e. valid also
for small 𝑛 (as long as the data 𝑥 𝑖 are actually from 𝑁(𝜇, 𝜎2 )!).

4.3.2 Wald statistic


Centering the MLE 𝜽ˆ 𝑀𝐿 with 𝜽0 followed by standardising with SD(
c 𝜽ˆ 𝑀𝐿 ) yields
the Wald statistic (named after Abraham Wald, 1902–1950):

c 𝜽ˆ 𝑀𝐿 )−1 (𝜽ˆ 𝑀𝐿 − 𝜽0 )
𝒕(𝜽 0 ) = SD(
= 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )1/2 (𝜽ˆ 𝑀𝐿 − 𝜽 0 )

The squared Wald statistic is a scalar defined as

𝑡(𝜽 0 )2 = 𝒕(𝜽 0 )𝑇 𝒕(𝜽 0 )


= (𝜽ˆ 𝑀𝐿 − 𝜽 0 )𝑇 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽ˆ 𝑀𝐿 − 𝜽 0 )

Note that in the literature both 𝒕(𝜽0 ) and 𝑡(𝜽 0 )2 are commonly referred to as
Wald statistics. In this text we use the qualifier “squared” if we refer to the latter.
We now assume that the true underlying parameter is 𝜽 0 . Since the MLE is
asymptotically normal the Wald statistic is asymptotically standard normal
distributed:
𝑎
𝒕(𝜽0 ) ∼ 𝑁𝑑 (0𝑑 , 𝑰 𝑑 ) for vector 𝜽
𝑎
𝑡(𝜃0 ) ∼ 𝑁(0, 1) for scalar 𝜃
4.3. QUANTIFYING THE UNCERTAINTY OF MAXIMUM LIKELIHOOD ESTIMATES57

Correspondingly, the squared Wald statistic is chi-squared distributed:


𝑎
𝑡(𝜽 0 )2 ∼ 𝜒𝑑2 for vector 𝜽
𝑎
𝑡(𝜃0 )2 ∼ 𝜒12 for scalar 𝜃

The degree of freedom of the chi-squared distribution is the dimension 𝑑 of the


parameter vector 𝜽.
Example 4.5. Wald statistic for a proportion:
𝑝ˆ 𝑀𝐿 (1− 𝑝ˆ 𝑀𝐿 )
We continue from Example 4.3. With 𝑝ˆ 𝑀𝐿 = 𝑥¯ and Var(
c 𝑝ˆ 𝑀𝐿 ) =
𝑛 and
q
c 𝑝ˆ 𝑀𝐿 ) = 𝑝ˆ 𝑀𝐿 (1−𝑝ˆ 𝑀𝐿 )
thus SD( 𝑛 we get as Wald statistic:

𝑥¯ − 𝑝0 𝑎
𝑡(𝑝0 ) = p ∼ 𝑁(0, 1)
𝑥(1
¯ − 𝑥)/𝑛
¯

The squared Wald statistic is:

( 𝑥¯ − 𝑝 0 )2 𝑎 2
𝑡(𝑝 0 )2 = 𝑛 ∼ 𝜒1
𝑥(1
¯ − 𝑥) ¯

Example 4.6. Wald statistic for the mean parameter of a normal distribution
with known variance:
𝜎2
c 𝜇ˆ 𝑀𝐿 ) =
We continue from Example 4.4. With 𝜇ˆ 𝑀𝐿 = 𝑥¯ and Var( 𝑛 and thus
c 𝜇ˆ 𝑀𝐿 ) = √𝜎 we get as Wald statistic:
SD( 𝑛

𝑥¯ − 𝜇0
𝑡(𝜇0 ) = √ ∼ 𝑁(0, 1)
𝜎/ 𝑛
Note this is the one sample 𝑡-statistic with given 𝜎. The squared Wald statistic
is:
( 𝑥¯ − 𝜇0 )2
𝑡(𝜇0 )2 = ∼ 𝜒12
𝜎2 /𝑛

Again, in this instance this is the exact distribution, not just the asymptotic one.
Using the Wald statistic or the squared Wald statistic we can test whether a
particular 𝜇0 can be rejected as underlying true parameter, and we can also
construct corresponding confidence intervals.

4.3.3 Normal confidence intervals using the Wald statistic


The asymptotic normality of MLEs derived from regular models enables us to
construct a corresponding normal confidence interval (CI):
58CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

For example, to construct the asymptotic normal CI for the MLE of a scalar
parameter 𝜃 we use the MLE 𝜃ˆ 𝑀𝐿 as estimate of the mean and its standard
c 𝜃ˆ 𝑀𝐿 ) computed from the observed Fisher information:
deviation SD(

CI = [𝜃ˆ 𝑀𝐿 ± 𝑐normal SD(


c 𝜃ˆ 𝑀𝐿 )]

𝑐 𝑛𝑜𝑟𝑚𝑎𝑙 is a critical value for the standard-normal symmetric confidence interval


chosen to achieve the desired nominal coverage- The critical values are computed
using the inverse standard normal distribution function via 𝑐normal = Φ−1 1+𝜅 2
(cf. refresher section in the Appendix).

coverage 𝜅 Critical value 𝑐 normal


0.9 1.64
0.95 1.96
0.99 2.58

For example, for a CI with 95% coverage one uses the factor 1.96 so that

CI = [𝜃ˆ 𝑀𝐿 ± 1.96 SD(


c 𝜃ˆ 𝑀𝐿 )]

The normal CI can be expressed using Wald statistic as follows:

CI = {𝜃0 : |𝑡(𝜃0 )| < 𝑐 normal }

Similary, it can also be expressed using the squared Wald statistic:

CI = {𝜃0 : 𝑡(𝜽 0 )2 < 𝑐 chisq }


Note that this form facilitates the construction of normal confidence intervals for
a parameter vector 𝜽 0 .
The following lists containst the critical values resulting from the chi-squared
distribution with degree of freedom 𝑚 = 1 for the three most common choices
of coverage 𝜅 for a normal CI for a univariate parameter:
4.3. QUANTIFYING THE UNCERTAINTY OF MAXIMUM LIKELIHOOD ESTIMATES59

coverage 𝜅 Critical value 𝑐 chisq (𝑚 = 1)


0.9 2.71
0.95 3.84
0.99 6.63

Example 4.7. Asymptotic normal confidence interval for a proportion:


We continue from Examples 4.3 and 4.5. Assume we observe 𝑛q= 30 measure-
𝑥(1−
¯ 𝑥)
¯
ments with average 𝑥¯ = 0.7. Then 𝑝ˆ 𝑀𝐿 = 𝑥¯ = 0.7 and SD(
c 𝑝ˆ 𝑀𝐿 ) =
𝑛 ≈ 0.084.
The symmetric asymptotic normal CI for 𝑝 with 95% coverage is given by
𝑝ˆ 𝑀𝐿 ±1.96 SD(
c 𝑝ˆ 𝑀𝐿 ) which for the present data results in the interval [0.536, 0.864].
Example 4.8. Normal confidence interval for the mean:
We continue from Examples 4.4 and 4.6. Assume that we observe 𝑛 = 25
measurements with average 𝑥¯ = 10, from a normal with unknown mean and
variance 𝜎2 = 4.
q
𝜎2
Then 𝜇ˆ 𝑀𝐿 = 𝑥¯ = 10 and SD(
c 𝜇ˆ 𝑀𝐿 ) =
𝑛 = 25 .

The symmetric asymptotic normal CI for 𝑝 with 95% coverage is given by 𝜇ˆ 𝑀𝐿 ±


c 𝜇ˆ 𝑀𝐿 ) which for the present data results in the interval [9.216, 10.784].
1.96 SD(

4.3.4 Normal tests using the Wald statistic


Finally, recall the duality between confidence intervals and statistical tests.
Specifically, a confidence interval with coverage 𝜅 can be also used for testing as
follows.
• for every 𝜃0 inside the CI the data do not allow to reject the hypothesis that
𝜃0 is the true parameter with significance level 1 − 𝜅.
• Conversely, all values 𝜃0 outside the CI can be rejected to be the true
parameter with significance level 1 − 𝜅 .
Hence, in order to test whether 𝜽 0 is the true underlying parameter value we
can compute the corresponding (squared) Wald statistic, find the desired critical
value and then decide on rejection.
Example 4.9. Asymptotic normal test for a proportion:
We continue from Example 4.7.
We now consider two possible values (𝑝 0 = 0.5 and 𝑝0 = 0.8) as potentially true
underlying proportion.
The value 𝑝0 = 0.8 lies inside the 95% confidence interval [0.536, 0.864]. This
implies we cannot reject the hypthesis that this is the true underlying parameter
on 5% significance level. In contrast, 𝑝0 = 0.5 is outside the confidence interval
60CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

so we can indeed reject this value. In other words, data plus model exclude this
value as statistically implausible.
This can be verified more directly by computing the corresponding (squared)
Wald statistics (see Example 4.5) and comparing them with the relevant critical
value (3.84 from chi-squared distribution for 5% significance level):
(0.7−0.5)2
• 𝑡(0.5)2 = 0.0842
= 5.71 > 3.84 hence 𝑝 0 = 0.5 can be rejected.
(0.7−0.8)2
• 𝑡(0.8)2 = 0.0842
= 1.43 < 3.84 hence 𝑝 0 = 0.8 cannot be rejected.
Note that the squared Wald statistic at the boundaries of the normal confidence
interval is equal to the critical value.
Example 4.10. Normal confidence interval and test for the mean:
We continue from Example 4.8.
We now consider two possible values (𝜇0 = 9.5 and 𝜇0 = 11) as potentially true
underlying mean parameter.
The value 𝜇0 = 9.5 lies inside the 95% confidence interval [9.216, 10.784]. This
implies we cannot reject the hypthesis that this is the true underlying parameter
on 5% significance level. In contrast, 𝜇0 = 11 is outside the confidence interval
so we can indeed reject this value. In other words, data plus model exclude this
value as a statistically implausible.
This can be verified more directly by computing the corresponding (squared)
Wald statistics (see Example 4.6) and comparing them with the relevant critical
values:
(10−9.5)2
• 𝑡(9.5)2 = 4/25
= 1.56 < 3.84 hence 𝜇0 = 9.5 cannot be rejected.
(10−11)2
• 𝑡(11)2 = 4/25
= 6.25 > 3.84 hence 𝜇0 = 11 can be rejected.

The squared Wald statistic at the boundaries of the confidence interval equals
the critical value.
Note that this is the standard one-sample test of the mean, and that it is exact,
not an approximation.

4.4 Example of a non-regular model


Not all models allow a quadratic approximation of the log-likelihood function
around the MLE. This is the case when the log-likelihood function is not
differentiable at the MLE. These models are called non-regular and for those
models the normal approximation is not available.
Example 4.11. Uniform distribution with upper bound 𝜃:

𝑥1 , . . . , 𝑥 𝑛 ∼ 𝑈(0, 𝜃)
4.4. EXAMPLE OF A NON-REGULAR MODEL 61

With 𝑥[𝑖] we denote the ordered observations with 0 ≤ 𝑥[1] < 𝑥 [2] < . . . < 𝑥[𝑛] ≤ 𝜃
and 𝑥[𝑛] = max(𝑥1 , . . . , 𝑥 𝑛 ).

We would like to obtain both the maximum likelihood estimator 𝜃ˆ 𝑀𝐿 and its
distribution.
The probability density function of 𝑈(0, 𝜃) is
(
1
𝜃 if 𝑥 ∈ [0, 𝜃]
𝑓 (𝑥|𝜃) =
0 otherwise.

and on the log-scale


(
− log 𝜃 if 𝑥 ∈ [0, 𝜃]
log 𝑓 (𝑥|𝜃) =
−∞ otherwise.

Since all observed data 𝐷 = {𝑥 1 , . . . , 𝑥 𝑛 } lie in the interval [0, 𝜃] we get as


log-likelihood function
(
−𝑛 log 𝜃 for 𝑥[𝑛] ≤ 𝜃
𝑙𝑛 (𝜃|𝐷) =
−∞ otherwise

Obtaining the MLE of 𝜃 is straightforward: −𝑛 log 𝜃 is monotonically decreasing


with 𝜃 and 𝜃 ≥ 𝑥[𝑛] hence the log-likelihood function has a maximum at
𝜃ˆ 𝑀𝐿 = 𝑥[𝑛] .
However, there is a discontinuity in 𝑙𝑛 (𝜃|𝐷) at 𝑥 [𝑛] and therefore 𝑙𝑛 (𝜃|𝐷) is not
differentiable at 𝜃ˆ 𝑀𝐿 . Thus, there is no quadratic approximation around 𝜃ˆ 𝑀𝐿
and the observed Fisher information cannot be computed. Hence, the normal
approximation for the distribution of 𝜃ˆ 𝑀𝐿 is not valid regardless of sample size,
i.e. not even asymptotically for 𝑛 → ∞.

Nonetheless, we can in fact still obtain the sampling distribution of 𝜃ˆ 𝑀𝐿 = 𝑥 [𝑛] .


However, not via asymptotic arguments but instead by understanding that 𝑥[𝑛]
is an order statistic (see https://en.wikipedia.org/wiki/Order_statistic ) with
62CHAPTER 4. QUADRATIC APPROXIMATION AND NORMAL ASYMPTOTICS

the following properties:

𝑥[𝑛] ∼ 𝜃 Beta(𝑛, 1) "n-th order statistic"

𝑛
E(𝑥[𝑛] ) = 𝑛+1 𝜃

𝑛
Var(𝑥[𝑛] ) = (𝑛+1)2 (𝑛+2)
𝜃2 ≈ 𝜃2
𝑛2

1
Note that the variance decreases with which is much faster than the usual 𝑛1
𝑛2
of an “efficient” estimator. Correspondingly, 𝜃ˆ 𝑀𝐿 is a so-called “super efficient”
estimator.
Chapter 5

Likelihood-based confidence
interval and likelihood ratio

5.1 Likelihood-based confidence intervals and Wilks


statistic
5.1.1 General idea and definition of Wilks statistic
Instead of relying on normal / quadratic approximation, we can also use the
log-likelihood directly to find the so called likelihood confidence intervals:

Idea: find all 𝜽 0 that have a log-likelihood that is almost as good as 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷).

CI = {𝜽 0 : 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷) ≤ Δ}

Here Δ is our tolerated deviation from the maximum log-likelihood. We will see
below how to determine a suitable Δ.
The above leads naturally to the Wilks log likelihood ratio statistic 𝑊(𝜽0 )

63
64CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO

defined as:

𝐿(𝜽ˆ 𝑀𝐿 |𝐷)
 
𝑊(𝜽 0 ) = 2 log
𝐿(𝜽0 |𝐷)
ˆ
= 2(𝑙𝑛 (𝜽 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷))

With its help we can write the likelihood CI follows:

CI = {𝜽 0 : 𝑊(𝜽 0 ) ≤ 2Δ}

The Wilks statistic is named after Samuel S. Wilks (1906–1964).


Advantages of using a likelihood-based CI:
• not restricted to be symmetric
• enables to construct multivariate CIs for parameter vector easily even in
non-normal cases
• contains normal CI as special case
Question: how to choose Δ, i.e how to calibrate the likelihood interval?
Essentially, by comparing with a normal CI!
Example 5.1. Wilks statistic for the proportion:
The log-likelihood for the parameter 𝑝 is (cf. Example 3.1)

𝑙𝑛 (𝑝|𝐷) = 𝑛( 𝑥¯ log 𝑝 + (1 − 𝑥)
¯ log(1 − 𝑝))

Hence the Wilks statistic is


𝑊(𝑝 0 ) = 2(𝑙𝑛 ( 𝑝ˆ 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝑝0 |𝐷))
𝑥¯ 1 − 𝑥¯
    
= 2𝑛 𝑥¯ log + (1 − 𝑥)
¯ log
𝑝0 1 − 𝑝0

Comparing with Example 2.8 we see that in this case the Wilks statistic is
essentially (apart from a scale factor 2𝑛) the KL divergence between two Bernoulli
distributions:
𝑊(𝑝 0 ) = 2𝑛𝐷KL (Ber( 𝑝ˆ 𝑀𝐿 ), Ber(𝑝0 ))

Example 5.2. Wilks statistic for the mean parameter of a normal model:
The Wilks statistic is
( 𝑥¯ − 𝜇0 )2
𝑊(𝜇0 )2 =
𝜎2 /𝑛

See Worksheet L3 for a derivation of the Wilks statistic directly from the log-
likelihood function.
Note this is the same as the squared Wald statistic discussed in Example 4.6.
5.1. LIKELIHOOD-BASED CONFIDENCE INTERVALS AND WILKS STATISTIC65

Comparing with Example 2.10 we see that in this case the Wilks statistic is
essentially (apart from a scale factor 2𝑛) the KL divergence between two normal
distributions with different means and variance equal to 𝜎2 :
𝑊(𝑝0 ) = 2𝑛𝐷KL (𝑁(𝜇ˆ 𝑀𝐿 , 𝜎2 ), 𝑁(𝜇0 , 𝜎 2 ))

5.1.2 Quadratic approximation of Wilks statistic and squared


Wald statistic
Recall the quadratic approximation of the log-likelihood function 𝑙𝑛 (𝜽 0 |𝐷) (=
second order Taylor series around the MLE 𝜽ˆ 𝑀𝐿 ):

1
𝑙𝑛 (𝜽 0 |𝐷) ≈ 𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − (𝜽 0 − 𝜽ˆ 𝑀𝐿 )𝑇 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽 0 − 𝜽ˆ 𝑀𝐿 )
2
With this we can then approximate the Wilks statistic:
𝑊(𝜽0 ) = 2(𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷))
≈ (𝜽 0 − 𝜽ˆ 𝑀𝐿 )𝑇 𝑱 𝑛 (𝜽ˆ 𝑀𝐿 )(𝜽0 − 𝜽ˆ 𝑀𝐿 )
= 𝑡(𝜽 0 )2

Thus the quadratic approximation of the Wilks statistic yields the squared Wald
statistic!
Conversely, the Wilks statistic can be understood a generalisation of the squared
Wald statistic.
Example 5.3. Quadratic approximation of the Wilks statistic for a proportion
(continued from Example 5.1):
A Taylor series of second order (for 𝑝0 around 𝑥)
¯ yields
𝑥¯ 𝑝 0 − 𝑥¯ (𝑝0 − 𝑥) ¯2
 
log ≈− +
𝑝0 𝑥¯ 2 𝑥¯ 2
and
1 − 𝑥¯ 𝑝 0 − 𝑥¯ (𝑝0 − 𝑥) ¯2
 
log ≈ +
1 − 𝑝0 1 − 𝑥¯ 2(1 − 𝑥)
¯2
With this we can approximate the Wilks statistic of the proportion as
(𝑝0 − 𝑥)
¯2 (𝑝 0 − 𝑥)
¯2
 
𝑊(𝑝0 ) ≈ 2𝑛 −(𝑝0 − 𝑥)
¯ + + (𝑝 0 − 𝑥)
¯ +
2𝑥¯ 2(1 − 𝑥)
¯
(𝑝0 − 𝑥)
¯ 2 (𝑝0 − 𝑥)
¯2
 
=𝑛 +
𝑥¯ (1 − 𝑥)
¯
(𝑝0 − 𝑥)
¯2
 
=𝑛
𝑥(1
¯ − 𝑥)¯
= 𝑡(𝑝0 )2 .
66CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO

This verifies that the quadratic approximation of the Wilks statistic leads back to
the squared Wald statistic of Example 4.5.
Example 5.4. Quadratic approximation of the Wilks statistic for the mean
parameter of a normal model (continued from Example 5.2):
The normal log-likelihood is already quadratic in the mean parameter (cf. Exam-
ple 3.2). Correspondingly, the Wilks statistic is quadratic in the mean parameter
as well. Hence in this particular case the quadratic “approximation” is in fact
exact and the Wilks statistic and the squared Wald statistic are identical!
Correspondingly, confidence intervals and tests based on the Wilks statistic are
identical to those obtained using the Wald statistic.

5.1.3 Distribution of the Wilks statistic


The connection with the squared Wald statistic implies that both have asympoti-
cally the same distribution.
Hence, under 𝜽0 the Wilks statistic is distributed asymptotically as
𝑎
𝑊(𝜽 0 ) ∼ 𝜒𝑑2

where 𝑑 is the number of parameters in 𝜽, i.e. the dimension of the model.


For scalar 𝜃 (i.e. single parameter and 𝑑 = 1) this becomes
𝑎
𝑊(𝜃0 ) ∼ 𝜒12

This fact is known as Wilks’ theorem.

5.1.4 Cutoff values for the likelihood CI


𝑐chisq
coverage 𝜅 Δ= 2 (𝑚 = 1)
0.9 1.35
0.95 1.92
0.99 3.32

The asymptotic distribution for 𝑊 is useful to choose a suitable Δ for the


likelihood CI — note that 2Δ = 𝑐chisq where 𝑐 chisq is the critical value for a
specified coverage 𝜅. This yields the above table for scalar parameter
Example 5.5. Likelihood confidence interval for a proportion:
We continue from Example 5.1, and as in Example 4.7 we asssume we have data
with 𝑛 = 30 and 𝑥¯ = 0.7.
This yields (via numerical root finding) as the 95% likelihood confidence interval
5.1. LIKELIHOOD-BASED CONFIDENCE INTERVALS AND WILKS STATISTIC67

the interval [0.524, 0.843]. It is similar but not identical to the corresponding
asymptotic normal interval [0.536, 0.864] obtained in Example 4.7.
The following figure illustrate the relationship between the normal CI, the
likelihood CI and also shows the role of the quadratic approximation (see also
Example 4.2). Note that:
• the normal CI is symmetric around the MLE whereas the likelihood CI is
not symmetric
• the normal CI is identical to the likelihood CI when using the quadratic
approximation!

maximum log−likelihood
−20

maximum − 1.92
−24
ln(p)

−28

log−likelihood
quadratic approx.
normal CI
likelihood CI MLE
−32

0.3 0.4 0.5 0.6 0.7 0.8 0.9

5.1.5 Likelihood ratio test (LRT) using Wilks statistic


As in the normal case (with Wald statistic and normal CIs) one can also construct
a test using the Wilks statistic:

𝐻0 : 𝜽 = 𝜽 0 True model is 𝜽 0 Null hypothesis


𝐻1 : 𝜽 ≠ 𝜽 0 True model is not 𝜽 0 Alternative hypothesis

As test statistic we use the Wilks log likelihood ratio 𝑊(𝜽0 ). Extreme values of
this test statistic imply evidence against 𝐻0 .
Note that the null model is “simple” (= a single parameter value) whereas the
alternative model is “composite” (= a set of parameter values).
68CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO

Remarks:
• The composite alternative 𝐻1 is represented by a single point (the MLE).
• Reject 𝐻0 for large values of 𝑊(𝜽 0 )
• under 𝐻0 and for large 𝑛 the statistic 𝑊(𝜽 0 ) is chi-squared distributed,
𝑎
i.e. 𝑊(𝜽 0 ) ∼ 𝜒𝑑2 . This allows to compute critical values (i.e tresholds
to declared rejection under a given significance level) and also 𝑝-values
corresponding to the observed test statistics.
• Models outside the CI are rejected
• Models inside the CI cannot be rejected, i.e. they can’t be statistically
distinguished from the best alternative model.
A statistic equivalent to 𝑊(𝜽0 ) is the likelihood ratio

𝐿(𝜽 0 |𝐷)
Λ(𝜽0 ) =
𝐿(𝜽ˆ 𝑀𝐿 |𝐷)

The two statistics can be transformed into each other by 𝑊(𝜽 0 ) = −2 log Λ(𝜽 0 )
and Λ(𝜽 0 ) = 𝑒 −𝑊(𝜽0 )/2 . We reject 𝐻0 for small values of Λ.
It can be shown that the likelihood ratio test to compare two simple models
is optimal in the sense that for any given specified type I error (=probability
of wrongly rejecting 𝐻0 , i.e. the sigificance level) it will maximise the power
(=1- type II error, probability of correctly accepting 𝐻1 ). This is known as the
Neyman-Pearson theorem.
Example 5.6. Likelihood test for a proportion:
We continue from Example 5.5 with 95% likelihood confidence interval
[0.524, 0.843].
The value 𝑝0 = 0.5 is outside the CI and hence can be rejected whereas 𝑝0 = 0.8
is insided the CI and hence cannot be rejected on 5% significance level.
The Wilks statistic for 𝑝0 = 0.5 and 𝑝0 = 0.8 take on the following values:
• 𝑊(0.5)2 = 4.94 > 3.84 hence 𝑝0 = 0.5 can be rejected.
• 𝑊(0.8)2 = 1.69 < 3.84 hence 𝑝0 = 0.8 cannot be rejected.
Note that the Wilks statistic at the boundaries of the likelihood confidence
interval is equal to the critical value (3.84 corresponding to 5% significance level
for a chi-squared distribution with 1 degree of freedom).

5.1.6 Origin of likelihood ratio statistic


The likelihood ratio statistic is asymptotically linked to differences in the KL
divergences of the two compared models with the underlying true model.
Assume that 𝐹 is the true (and unknown) data generating model and that 𝐺𝜽
is a family of models and we would like to compare two candidate models 𝐺 𝐴
and 𝐺 𝐵 corresponding to parameters 𝜽 𝐴 and 𝜽 𝐵 on the basis of observed data
5.2. GENERALISED LIKELIHOOD RATIO TEST (GLRT) 69

𝐷 = {𝑥1 , . . . , 𝑥 𝑛 }. The KL divergences 𝐷KL (𝐹, 𝐺 𝐴 ) and 𝐷KL (𝐹, 𝐺 𝐵 ) indicate how
close each of the models 𝐺 𝐴 and 𝐺 𝐵 fit the true 𝐹. The difference of the two
divergences is a way to measure the relative fit of the two models, and can be
computed as
𝑔(𝑥|𝜽 𝐴 )
𝐷KL (𝐹, 𝐺 𝐵 ) − 𝐷KL (𝐹, 𝐺 𝐴 ) = E𝐹 log
𝑔(𝑥|𝜽 𝐵 )
Replacing 𝐹 by the empirical distribution 𝐹ˆ 𝑛 leads to the large sample approxi-
mation
2𝑛(𝐷KL (𝐹, 𝐺 𝐵 ) − 𝐷KL (𝐹, 𝐺 𝐴 )) ≈ 2(𝑙𝑛 (𝜽 𝐴 |𝐷) − 𝑙𝑛 (𝜽 𝐵 |𝐷))
Hence, the difference in the log-likelihoods provides an estimate of the difference
in the KL divergence of the two models involved.
The Wilks log likelihood ratio statistic

𝑊(𝜽0 ) = 2(𝑙𝑛 (𝜽ˆ 𝑀𝐿 |𝐷) − 𝑙𝑛 (𝜽 0 |𝐷))

thus compares the best-fit distribution with 𝜽ˆ 𝑀𝐿 as the parameter to the distri-
bution with parameter 𝜽0 .
For some specific models the Wilks statistic can also be written in the form of
the KL divergence:
𝑊(𝜽 0 ) = 2𝑛𝐷KL (𝐹𝜽ˆ 𝑀𝐿 , 𝐹𝜽0 )
This is the case for the examples 5.1 and 5.2 and also more generally for
exponential family models, but it is not true in general.

5.2 Generalised likelihood ratio test (GLRT)


Also known as maximum likelihood ratio test (MLRT). The Generalised
Likelihood Ratio Test (GLRT) works just like the standard likelihood ratio test
with the difference that now the null model 𝐻0 is also a composite model.

𝐻0 : 𝜽 ∈ 𝜔 0 ⊂ Ω True model lies in restricted model space


𝐻1 : 𝜽 ∈ 𝜔 1 = Ω \ 𝜔 0 True model is not the restricted model space

Both 𝐻0 and 𝐻1 are now composite hypotheses. Ω represents the unrestricted


model space with dimension (=number of free parameters) 𝑑 = |Ω|. The
constrained space 𝜔0 has degree of freedom 𝑑0 = |𝜔0 | with 𝑑0 < 𝑑. Note that in
the standard LRT the set 𝜔0 is a simple point with 𝑑0 = 0 as the null model is a
simple distribution. Thus, LRT is contained in GLRT as special case!
The corresponding generalised (log) likelihood ratio statistic is given by

! max 𝐿(𝜃|𝐷)
𝐿(𝜃ˆ 𝑀𝐿 |𝐷) 𝜃∈𝜔0
𝑊 = 2 log and Λ =
𝐿(𝜃ˆ 0 |𝐷) max 𝐿(𝜃|𝐷)
𝑀𝐿 𝜃∈Ω
70CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO

where 𝐿(𝜃ˆ 𝑀𝐿 |𝐷) is the maximised likelihood assuming the full model (with
parameter space Ω) and 𝐿(𝜃ˆ 𝑀𝐿
0
|𝐷) is the maximised likelihood for the restricted
model (with parameter space 𝜔0 ). Hence, to compute the GRLT test statistic we
need to perform two optimisations, one for the full and another for the restricted
model.
Remarks:
• MLE in the restricted model space 𝜔0 is taken as a representative of 𝐻0 .
• The likelihood is maximised in both numerator and denominator.
• The restricted model is a special case of the full model (i.e. the two models
are nested).
• The asymptotic distribution of 𝑊 is chi-squared with degree of freedom
depending on both 𝑑 and 𝑑0 :

𝑎
𝑊 ∼ 𝜒𝑑−𝑑
2
0

• This result is due to Wilks (1938).1 Note that it assumes that the true model
is contained among the investigated models.
• If 𝐻0 is a simple hypothesis (i.e. 𝑑0 = 0) then the standard LRT (and
corresponding CI) is recovered as special case of the GLRT.
Example 5.7. GLRT example:
Case-control study: (e.g. “healthy” vs. “disease”)
we observe normal data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } from two groups with sample size 𝑛1
and 𝑛2 (and 𝑛 = 𝑛1 + 𝑛2 ):

𝑥1 , . . . , 𝑥 𝑛1 ∼ 𝑁(𝜇1 , 𝜎2 )
and
𝑥 𝑛1 +1 , . . . , 𝑥 𝑛 ∼ 𝑁(𝜇2 , 𝜎 2 )

Question: are the two means 𝜇1 and 𝜇2 the same in the two groups?

𝐻0 : 𝜇1 = 𝜇2 (with variance unknown, i.e. treated as nuisance parameter)


𝐻1 : 𝜇1 ≠ 𝜇2

Restricted and full models:


𝜔0 : restricted model with two parameters 𝜇0 and 𝜎02 (so that 𝑥1 , . . . , 𝑥 𝑛 ∼
𝑁(𝜇0 , 𝜎02 ) ).
Ω: full model with three parameters 𝜇1 , 𝜇2 , 𝜎 2 .
1Wilks, S. S. 1938. The large-sample distribution of the likelihood ratio for testing composite hypotheses.
Ann. Math. Statist. 9:60–62. https://doi.org/10.1214/aoms/1177732360
5.2. GENERALISED LIKELIHOOD RATIO TEST (GLRT) 71

Corresponding log-likelihood functions:


Restricted model 𝜔0 :
𝑛
𝑛 1 Õ
log 𝐿(𝜇0 , 𝜎02 |𝐷) = − log(𝜎02 ) − 2 (𝑥 𝑖 − 𝜇0 )2
2 2𝜎0 𝑖=1

Full model Ω:
𝑛1
!
𝑛1 1 Õ
log 𝐿(𝜇1 , 𝜇2 , 𝜎 |𝐷) = − log(𝜎2 ) − 2
2
(𝑥 𝑖 − 𝜇1 )2 +
2 2𝜎 𝑖=1
𝑛
!
𝑛2 1 Õ
− log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇2 )2
2 2𝜎 𝑖=𝑛 +1
1
𝑛1 𝑛
!
𝑛 1 Õ Õ
= − log(𝜎2 ) − 2 (𝑥 𝑖 − 𝜇1 ) +
2
(𝑥 𝑖 − 𝜇2 ) 2
2 2𝜎 𝑖=1 𝑖=𝑛1 +1

Corresponding MLEs:

Í𝑛 Í𝑛
𝜔0 : 𝜇ˆ 0 = 𝑥𝑖
1
𝑛 𝑖=1 𝜎b02 = 1
𝑛 𝑖=1 (𝑥 𝑖 − 𝜇ˆ 0 )2
Í𝑛1
Ω: 𝜇ˆ 1 = 1
𝑥 𝜎b2 = 1
 Í𝑛1
(𝑥 𝑖 − 𝜇ˆ 1 )2 +
Í𝑛
− 𝜇ˆ 2 )2
𝑛1 Í𝑛𝑖=1 𝑖 𝑛 𝑖=1 𝑖=𝑛1 +1 (𝑥 𝑖
𝜇ˆ 2 = 1
𝑛2 𝑖=𝑛1 +1 𝑥 𝑖

Note that the estimated means are related by


𝑛1 𝑛2
𝜇ˆ 0 = 𝜇ˆ 1 + 𝜇ˆ 2
𝑛 𝑛
so the overall mean is the weighted average of the two individual group means.
Moreover, the two estimated variances are related by
𝑛1 𝑛2
𝜎b02 = 𝜎b2 + 2 (𝜇ˆ 1 − 𝜇ˆ 2 )2
𝑛
1 (𝜇ˆ 1 − 𝜇ˆ 2 )2 ª
= 𝜎b2 ­1 +
©
𝑛 𝑛 b2
®
« 𝜎 𝑛1 𝑛2
! ¬
𝑡 𝑀𝐿
2
= 𝜎b2 1 +
𝑛

with
𝜇ˆ 1 − 𝜇ˆ 2
𝑡 𝑀𝐿 = r  
1
𝑛1 + 1
𝑛2 𝜎b2
72CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO

This is an example of a variance decomposition, with

• 𝜎b02 being the estimated total variance,


• 𝜎b2 the estimated within-group variance and
𝑡 𝑀𝐿
2
𝑛1 𝑛2
• 𝜎b2 𝑛 = 𝑛2
(𝜇ˆ 1 − 𝜇ˆ 2 )2 the estimated between-group variance.
and
𝜎2 𝑡 𝑀𝐿
2
• =1+
c
𝑛
𝜎2
c
0

Corresponding maximised log-likelihood:


Restricted model:

𝑛 𝑛
log 𝐿(𝜇ˆ 0 , 𝜎b02 |𝐷) = − log( 𝜎b02 ) −
2 2
Full model:

𝑛 𝑛
log 𝐿(𝜇ˆ 1 , 𝜇ˆ 2 , 𝜎b2 |𝐷) = − log( 𝜎b2 ) −
2 2
Likelihood ratio statistic:

© 𝐿(𝜇ˆ 1 , 𝜇ˆ 2 , 𝜎b2 |𝐷) ª


𝑊 = 2 log ­ ®
«  𝐿(𝜇ˆ 0 , 𝜎0 |𝐷)  ¬
b2
 
= 2 log 𝐿 𝜇ˆ 1 , 𝜇ˆ 2 , 𝜎b2 |𝐷 − 2 log 𝐿 𝜇ˆ 0 , 𝜎b2 |𝐷 0

© 𝜎0 ª
b2
= 𝑛 log ­ ®
«𝜎 ¬
b2
!
𝑡 𝑀𝐿
2
= 𝑛 log 1 +
𝑛

The last step uses the decomposition for the total variance 𝜎b02 . If an unbiased
𝑛 b2
estimate of the variance is used ( 𝜎b2 UB = 𝑛−2 𝜎 ) rather than the MLE then
 
1 2
𝑊 = 𝑛 log 1 + 𝑡
𝑛−2
with
𝜇ˆ 1 − 𝜇ˆ 2
𝑡 = r 
1
𝑛1 + 1
𝑛2 𝜎b2 UB
5.2. GENERALISED LIKELIHOOD RATIO TEST (GLRT) 73

−→ the GRLT is a monotone function of the (squared) two-sample 𝑡-statistic!


Asymptotic distribution:
The degree of freedom of the full model is 𝑑 = 3 and that of the constrained
model 𝑑0 = 2 so the generalised log likelihood ratio statistic 𝑊 is distributed
asymptotically as 𝜒12 . Hence, we reject the null model on 5% significance level
for all 𝑊 > 3.84.
More generally, it turns out that not just the two-sample 𝑡-test but many other
commonly used familiar statistical tests can be interpreted as GLRTs!
74CHAPTER 5. LIKELIHOOD-BASED CONFIDENCE INTERVAL AND LIKELIHOOD RATIO
Chapter 6

Optimality properties and


conclusion

6.1 Properties of maximum likelihood encountered


so far
1. MLE is a special case of relative entropy minimisation valid for large samples.
2. MLE can be seen as generalisation of least squares (and conversely, least
squares is a special case of ML).

Kullback-Leibler 1951
Entropy learning: minimise 𝐷KL (𝐹true , 𝐹𝜽 )

large 𝑛

Fisher 1922
Maximise Likelihood 𝐿(𝜽|𝐷)

normal model

Gauss 1805
Minimise squared error 𝑖 (𝑥 𝑖 − 𝜃)2
Í

3. Given a model, derivation of the MLE is basically automatic (only optimi-


sation required)!
4. MLEs are consistent, i.e. if the true underlying model 𝐹true with parameter
𝜽 true is contained in the set of specified candidates models 𝐹𝜽 then the
MLE will converge to the true model.

75
76 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION

5. Correspondingly, MLEs are asympotically unbiased.


6. However, MLEs are not necessarily unbiased in finite samples (e.g. the
MLE of the variance parameter in the normal distribution).
7. The maximum likelihood is invariant against parameter transformations.
8. In regular situations (when local quadratic approximation is possible) MLEs
are asympotically normally distributed, with the asymptotic variance
determined by the observed Fisher information.
9. In regular situations and for large sample size MLEs are asympotically
optimally efficient (Cramer-Rao theorem): For large samples the MLE
achieves the lowest possible variance possible in an estimator — this is the
so-called Cramer-Rao lower bound. The variance decreases to zero with
𝑛 → ∞ typically with rate 1/𝑛.
10. The likelihood ratio can be used to construct optimal tests (in the sense of
the Neyman-Pearson theorem).

6.2 Summarising data and the concept of (minimal)


sufficiency
6.2.1 Sufficient statistic
Another important concept in statistics and likelihood theory are so-called
sufficient statistics to summarise the information available in the data about a
parameter in a model.
Generally, a statistic 𝑇(𝐷) is function of the observed data 𝐷 = {𝑥 1 , . . . , 𝑥 𝑛 }. The
statistic 𝑇(𝐷) can be of any type and value (scalar, vector, matrix etc. — even
a function). 𝑇(𝐷) is called a summary statistic if it describes important aspects
of the data such as location (e.g. the average avg(𝐷) = 𝑥, ¯ the median) or scale
(e.g. standard deviation, interquartile range).
A statistic 𝑇(𝐷) is said to be sufficient for a parameter 𝜽 in a model if the
corresponding likelihood function can be written using only 𝑇(𝐷) in the terms
that involve 𝜽 such that

𝐿(𝜽|𝐷) = ℎ(𝑇(𝐷), 𝜽) 𝑘(𝐷) ,

where ℎ() and 𝑘() are positive-valued functions, and or equivalently on log-scale

𝑙𝑛 (𝜽) = log ℎ(𝑇(𝐷), 𝜽) + log 𝑘(𝐷) .

This is known as the Fisher-Pearson factorisation.


By construction, estimation and inference about 𝜽 based on the factorised
likelihood 𝐿(𝜽) is mediated through the sufficient statistic 𝑇(𝐷) and does not
6.2. SUMMARISING DATA AND THE CONCEPT OF (MINIMAL) SUFFICIENCY77

require the original data 𝐷. Instead, the sufficient statistic 𝑇(𝐷) contains all the
information in 𝐷 required to learn about the parameter 𝜽.
Therefore, if the MLE 𝜽ˆ 𝑀𝐿 of 𝜽 exists and is unique then the MLE is a unique
function of the sufficient statistic 𝑇(𝐷). If the MLE is not unique then it can be
chosen to be function of 𝑇(𝐷). Note that a sufficient statistic always exists since
the data 𝐷 are themselves sufficient statistics, with 𝑇(𝐷) = 𝐷. Furthermore,
sufficient statistics are not unique since applying a one-to-one transformation to
𝑇(𝐷) yields another sufficient statistic.

6.2.2 Induced partioning of data space and likelihood equiva-


lence
Every sufficient statistic 𝑇(𝐷) induces a partitioning of the space of data sets by
clustering all hypothetical outcomes for which the statistic 𝑇(𝐷) assumes the
same value 𝑡:
𝒳𝑡 = {𝐷 : 𝑇(𝐷) = 𝑡}
The data sets in 𝒳𝑡 are equivalent in terms of the sufficient statistic 𝑇(𝐷). Note
that this implies that 𝑇(𝐷) is not a 1:1 transformation of 𝐷. Instead of 𝑛 data
points 𝑥1 , . . . , 𝑥 𝑛 as few as one or two summaries (such as mean and variance)
may be sufficient to fully convey all the information in the data about the model
parameters. Thus, transforming data 𝐷 using a sufficient statistic 𝑇(𝐷) may
result in substantial data reduction.
Two data sets 𝐷1 and 𝐷2 for which the ratio of the corresponding likelihoods
𝐿(𝜽|𝐷1 )/𝐿(𝜽|𝐷2 ) does not depend on 𝜽 (so the two likelihoods are proportional
to each other by a constant) are called likelihood equivalent because a likelihood-
based procedure to learn about 𝜽 will draw identical conclusions from 𝐷1 and
𝐷2 . For data sets 𝐷1 , 𝐷2 ∈ 𝒳𝑡 which are equivalent with respect to a sufficient
statistic 𝑇 it follows directly from the Fisher-Pearson factorisation that the ratio

𝐿(𝜽|𝐷1 )/𝐿(𝜽|𝐷2 ) = 𝑘(𝐷1 )/𝑘(𝐷2 )

and thus is constant with regard to 𝜽. As a result, all data sets in 𝒳𝑡 are
likelihood equivalent. However, the converse is not true: depending on the
sufficient statistics there usually will be many likelihood equivalent data sets
that are not part of the same set 𝒳𝑡 .

6.2.3 Minimal sufficient statistics


Of particular interest is therefore to find those sufficient statistics that achieve the
coarsest partitioning of the sample space and thus may allow the highest data
reduction. Specifically, a minimal sufficient statistic is a sufficient statistic for
which all likelihood equivalent data sets also are equivalent under this statistic.
Therefore, to check whether a sufficient statistic 𝑇(𝐷) is minimally sufficient
we need to verify whether for any two likelihood equivalent data sets 𝐷1 and
78 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION

𝐷2 it also follows that 𝑇(𝐷1 ) = 𝑇(𝐷2 ). If this holds true then 𝑇 is a minimally
sufficient statistic.
An equivalent non-operational definition is that a minimal sufficient statistic
𝑇(𝐷) is a sufficient statistic that can be computed from any other sufficient
statistic 𝑆(𝐷). This follows from the above directly: assume any sufficient
statistic 𝑆(𝐷), this defines a corresponding set 𝒳𝑠 of likelihood equivalent data
sets. By implication any 𝐷1 , 𝐷2 ∈ 𝒳𝑠 will necessarily also be in 𝒳𝑡 , thus whenever
𝑆(𝐷1 ) = 𝑆(𝐷2 ) we also have 𝑇(𝐷1 ) = 𝑇(𝐷2 ), and therefore 𝑇(𝐷1 ) is a function of
𝑆(𝐷1 ).
A trivial but important example of a minimal sufficient statistic is the likelihood
function itself since by definition it can be computed from any set of sufficient
statistics. Thus the likelihood function 𝐿(𝜽) captures all information about 𝜽
that is available in the data. In other words, it provides an optimal summary
of the observed data with regard to a model. Note that in Bayesian statistics
(to be discussed in Part 2 of the module) the likelihood function is used as
proxy/summary of the data.

6.2.4 Example: normal distribution


Example 6.1. Sufficient statistics for the parameters of the normal distribution:
The normal model 𝑁(𝜇, 𝜎2 ) with parameter vector 𝜽 = (𝜇, 𝜎2 )𝑇 and log-likelihood
𝑛
𝑛 1 Õ
𝑙𝑛 (𝜽) = − log(2𝜋𝜎 2 ) − 2 (𝑥 𝑖 − 𝜇)2
2 2𝜎 𝑖=1

One possible set of minimal sufficient statistics for 𝜽 are 𝑥¯ and 𝑥 2 , and with these
we can rewrite the log-likelihood function without any reference to the original
data 𝑥1 , . . . , 𝑥 𝑛 as follows
𝑛 𝑛
𝑙𝑛 (𝜽) = − log(2𝜋𝜎2 ) − 2 (𝑥 2 − 2𝑥𝜇
¯ + 𝜇2 )
2 2𝜎
An alternative set of minimal sufficient statistics for 𝜽 consists of 𝑠 2 = 𝑥 2 − 𝑥¯ 2 =
𝜎b2 𝑀𝐿 as and 𝑥¯ = 𝜇ˆ 𝑀𝐿 . The log-likelihood written in terms of 𝑠 2 and 𝑥¯ is
𝑛 𝑛
𝑙𝑛 (𝜽) = − log(2𝜋𝜎2 ) − 2 (𝑠 2 + ( 𝑥¯ − 𝜇)2 )
2 2𝜎
Note that in this example the dimension of the parameter vector 𝜽 equals the
dimension of the minimal sufficient statistic, and furthermore, that the MLEs of
the parameters are in fact minimal sufficient!

6.2.5 MLEs of parameters in exponential family are minimal


sufficient statistics
The conclusion from Example 6.1 holds true more generally: in an exponential
family model (such as the normal distribution as particular important case)
6.3. CONCLUDING REMARKS ON MAXIMUM LIKELIHOOD 79

the MLEs of the parameters are minimal sufficient statistics. Thus, there will
typically be substantial dimension reduction from the raw data to the sufficient
statistics.

However, outside exponential families the MLE is not necessarily a minimal


sufficient statistic, and may not even be a sufficient statistic. This is because a
(minimal) sufficient statistic of the same dimension as the parameters does
not always exist. A classic example is the Cauchy distribution for which the
minimal sufficient statistics are the ordered observations, thus the MLE of the
parameters do not constitute sufficient statistics, let alone minimal sufficient
statistics. However, the MLE is of course still a function of the minimal sufficient
statistic.

In summary, the likelihood function acts as perfect data summariser (i.e. as mini-
mally sufficient statistic), and in exponential families (e.g. Normal distribution)
the MLEs of the parameters 𝜽ˆ 𝑀𝐿 are minimal sufficient.

Finally, while sufficiency is clearly a useful concept for data reduction one needs
to keep in mind that this is always in reference to a specific model. Therefore,
unless one strongly believes in a certain model it is generally a good idea to keep
(and not discard!) the original data.

6.3 Concluding remarks on maximum likelihood


6.3.1 Remark on KL divergence
Finding the model 𝐹𝜽 that best approximates the underlying true model 𝐹0 is
done by minimising the relative entropy 𝐷KL (𝐹0 , 𝐹𝜽 ). For large sample size
𝑛 we may approximate 𝐹0 by the empirical distribution 𝐹ˆ0 , and minimising
𝐷KL (𝐹ˆ0 , 𝐹𝜽 ) then yields the method of maximum likelihood, as discussed earlier.

However, since the KL divergence is not symmetric there are in fact two ways
to minimise the divergence between a fixed 𝐹0 and the family 𝐹𝜽 , each with
different properties:

a) forward KL, approximation KL: min𝜽 𝐷KL (𝐹0 , 𝐹𝜽 )

Note that here we keep the first argument fixed and minimise KL by
changing the second argument.

This is also called an “M (Moment) projection”. It has a zero avoiding


property: 𝑓𝜽 (𝑥) > 0 whenever 𝑓0 (𝑥) > 0.

This procedure is mean-seeking and inclusive, i.e. when there are multiple
modes in the density of 𝐹0 a fitted unimodal density 𝐹𝜽ˆ will seek to cover
all modes.
80 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION

b) reverse KL, inference KL: min𝜽 𝐷KL (𝐹𝜽 , 𝐹0 )

Note that here we keep the second argument fixed and minimise KL by
changing the first argument.

This is also called an “I (Information) projection”. It has a zero forcing


property: 𝑓𝜽 (𝑥) = 0 whenever 𝑓0 (𝑥) = 0.

This procedure is mode-seeking and exclusive, i.e. when there are multiple
modes in the density of 𝐹0 a fitted unimodal density 𝐹𝜽ˆ will seek out one
mode to the exclusion of the others.

Maximum likelihood is based on “forward KL”, whereas Bayesian updating and


Variational Bayes approximations use “reverse KL”.

6.3.2 What happens if 𝑛 is small?


From the long list of optimality properties of ML it is clear that for large sample
size 𝑛 the best estimator will typically be the MLE.

However, for small sample size it is indeed possible (and necessary) to improve
over the MLE (e.g. via Bayesian estimation or regularisation). Some of these
ideas will be discussed in Part II.

• Likelihood will overfit!

Alternative methods need to be used:

• regularised/penalised likelihood
• Bayesian methods

which are essentially two sides of the same coin.


6.3. CONCLUDING REMARKS ON MAXIMUM LIKELIHOOD 81

Classic example of a simple non-ML estimator that is better than the MLE: Stein’s
example / Stein paradox (C. Stein, 1955):
• Problem setting: estimation of the mean in multivariate case
• Maximum likelihood estimation breaks down! → average (=MLE) is worse
in terms of MSE than Stein estimator.
• For small 𝑛 the asymptotic distributions for the MLE and for the LRT are
not accurate, so for inference in these situations the distributions may need
to be obtained by simulation (e.g. parametric or nonparametric bootstrap).

6.3.3 Model selection


• CI are sets of models that are not statistically distinguishable from the best
ML model
• in doubt, choose the simplest model compatible with data
• better prediction, avoids overfitting
• Useful for model exploration and model building.

• Note that, by construction, the model with more parameters always has a
higher likelihood, implying likelihood favours complex models
• Complex model may overfit!

• For comparison of models penalised likelihood or Bayesian approaches


may be necessary
• Model selection in small samples and high dimension is challenging
• Recall that the aim in statistics is not about rejecting models (this is easy as
for large sample size any model will be rejected!)

• Instead, the aim is model building, i.e. to find a model that explains the
data well and that predicts well!

• Typically, this will not be the best-fit ML model, but rather a simpler model
that is close enough to the best / most complex model.
82 CHAPTER 6. OPTIMALITY PROPERTIES AND CONCLUSION
Part II

Bayesian Statistics

83
Chapter 7

Conditioning and Bayes rule

In this chapter we review conditional probabilities. Conditional probability is


essential for Bayesian statistical modelling.

7.1 Conditional probability


Assume we have two random ∫ variables 𝑥 and 𝑦 with a joint density (or joint
PMF) 𝑝(𝑥, 𝑦). By definition 𝑥,𝑦 𝑝(𝑥, 𝑦)𝑑𝑥𝑑𝑦 = 1.

The marginal densities for the individual 𝑥 and 𝑦 are given by 𝑝(𝑥) = 𝑦
𝑝(𝑥, 𝑦)𝑑𝑦

and 𝑝(𝑦) = 𝑥 𝑓 (𝑥, 𝑦)𝑑𝑥. Thus, when computing the marginal densities a variable
is removed from the joint density
∫ by integrating∫ over all possible states of that
variable. It follows also that 𝑥 𝑝(𝑥)𝑑𝑥 = 1 and 𝑦 𝑝(𝑦)𝑑𝑦 = 1, i.e. the marginal
densities also integrate to 1.

As alternative to integrating out a random variable in the joint density 𝑝(𝑥, 𝑦)


we may wish to keep it fixed at some value, say keep 𝑦 fixed at 𝑦0 . In this case
𝑝(𝑥, 𝑦 = 𝑦0 ) is proportional to the conditional density (or PMF) given by the
ratio
𝑝(𝑥, 𝑦 = 𝑦0 )
𝑝(𝑥|𝑦 = 𝑦0 ) =
𝑝(𝑦 = 𝑦0 )

The denominator 𝑝(𝑦 = 𝑦0 ) = 𝑥 𝑝(𝑥, 𝑦 = 𝑦0 )𝑑𝑥 is needed to ensure that

𝑥
𝑝(𝑥|𝑦 = 𝑦0 )𝑑𝑥 = 1, thus it renormalises 𝑝(𝑥, 𝑦 = 𝑦0 ) so that it is a proper
density.

To simplify notation, the specific value on which a variable is conditioned is


often left out so we just write 𝑝(𝑥|𝑦).

85
86 CHAPTER 7. CONDITIONING AND BAYES RULE

7.2 Bayes’ theorem


Thomas Bayes (1701-1761) was the first to state Bayes’ theorem on conditional
probabilities.
Using the definition of conditional probabilities we see that the joint density can
be written as the product of marginal and conditional density in two different
ways:
𝑝(𝑥, 𝑦) = 𝑝(𝑥|𝑦)𝑝(𝑦) = 𝑝(𝑦|𝑥)𝑝(𝑥)

This directly leads to Bayes’ theorem:

𝑝(𝑥)
𝑝(𝑥|𝑦) = 𝑝(𝑦|𝑥)
𝑝(𝑦)

This rule relates the two possible conditional densities (or conditional probability
mass functions) for two random variables 𝑥 and 𝑦. It thus allows to reverse the
ordering of conditioning.
Bayes’s theorem was published in 1763 only after his death by Richard Price
(1723-1791):
Pierre-Simon Laplace independently published Bayes’ theorem in 1774 and he
was in fact the first to routinely apply it to statistical calculations.

7.3 Conditional mean and variance


The mean E(𝑥|𝑦) and variance Var(𝑥|𝑦) of the conditional distribution with
density 𝑝(𝑥|𝑦) are called conditional mean and conditional variance.
The law of total expectation states that

E(E(𝑥|𝑦)) = E(𝑥)

The law of total variance states that

Var(𝑥) = Var(E(𝑥|𝑦)) + E(Var(𝑥|𝑦))

The first term is the “explained” or “between-group” variance, and the second
the “unexplained” or “mean within group” variance (also known as “pooled”
variance).
Example 7.1. Mean and variance of a mixture model:
Assume 𝐾 groups indicated by a discrete variable 𝑦 = 1, 2, . . . , 𝐾 with probability
𝑝(𝑦) = 𝜋 𝑦 . In each group the observations 𝑥 follow a density 𝑝(𝑥|𝑦) with
conditional mean 𝐸(𝑥|𝑦) = 𝜇 𝑦 and conditional variance Var(𝑥|𝑦) = 𝜎2𝑦 . The
joint density for 𝑥 and 𝑦 is 𝑝(𝑥, 𝑦) = 𝜋 𝑦 𝑝(𝑥|𝑦). The marginal density for 𝑥 is
Í𝐾
𝑦=1 𝜋 𝑦 𝑝(𝑥|𝑦). This is called a mixture model.
7.4. CONDITIONAL ENTROPY AND ENTROPY CHAIN RULES 87

Í𝐾
The total mean E(𝑥) = 𝜇0 is equal to 𝑦=1 𝜋 𝑦 𝜇 𝑦 .

The total variance Var(𝑥) = 𝜎02 is equal to


𝐾
Õ 𝐾
Õ
𝜋 𝑦 (𝜇 𝑦 − 𝜇0 )2 + 𝜋 𝑦 𝜎2𝑦
𝑦=1 𝑦=1

7.4 Conditional entropy and entropy chain rules


For the entropy of the joint distribution we find that
𝐻(𝑃𝑥,𝑦 ) = −E𝑃𝑥,𝑦 log 𝑝(𝑥, 𝑦)
= −E𝑃𝑥 E𝑃𝑦|𝑥 (log 𝑝(𝑥) + log 𝑝(𝑦|𝑥)
= −E𝑃𝑥 log 𝑝(𝑥) − E𝑃𝑥 E𝑃𝑦|𝑥 log 𝑝(𝑦|𝑥)
= 𝐻(𝑃𝑥 ) + 𝐻(𝑃𝑦|𝑥 )
thus it decomposes into the entropy of the marginal distribution and the
conditional entropy defined as
𝐻(𝑃𝑦|𝑥 ) = −E𝑃𝑥 E𝑃𝑦|𝑥 log 𝑝(𝑦|𝑥)
Note that to simplify notation by convention the expectation E𝑃𝑥 over the variable
𝑥 that we condition on (𝑥) is implicitly assumed.
Similarly, for the cross-entropy we get
𝐻(𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) = −E𝑄 𝑥,𝑦 log 𝑝(𝑥, 𝑦)
= −E𝑄 𝑥 E𝑄 𝑦|𝑥 log ( 𝑝(𝑥) 𝑝(𝑦|𝑥) )
= −E𝑄 𝑥 log 𝑝(𝑥) − E𝑄 𝑥 E𝑄 𝑦|𝑥 log 𝑝(𝑦|𝑥)
= 𝐻(𝑄 𝑥 , 𝑃𝑥 ) + 𝐻(𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 )
where the conditional cross-entropy is defined as
𝐻(𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ) = −E𝑄 𝑥 E𝑄 𝑦|𝑥 log 𝑝(𝑦|𝑥)
Note again the implicit expectation E𝑄 𝑥 over 𝑥 implied in this notation.
The KL divergence between the joint distributions can be decomposed as follows:
𝑞(𝑥, 𝑦)
 
𝐷KL (𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) = E𝑄 𝑥,𝑦 log
𝑝(𝑥, 𝑦)
𝑞(𝑥)𝑞(𝑦|𝑥)
 
= E𝑄 𝑥 E𝑄 𝑦|𝑥 log
𝑝(𝑥)𝑝(𝑦|𝑥)
𝑞(𝑥) 𝑞(𝑦|𝑥)
   
= E𝑄 𝑥 log + E𝑄 𝑥 E𝑄 𝑦|𝑥 log
𝑝(𝑥) 𝑝(𝑦|𝑥)
= 𝐷KL (𝑄 𝑥 , 𝑃𝑥 ) + 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 )
88 CHAPTER 7. CONDITIONING AND BAYES RULE

with the conditional KL divergence or conditional relative entropy defined as

𝑞(𝑦|𝑥)
 
𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ) = E𝑄 𝑥 E𝑄 𝑦|𝑥 log
𝑝(𝑦|𝑥)

(again the expectation E𝑄 𝑥 is usually dropped for convenience). The conditional


relative entropy can also be computed from the conditional (cross-)entropies by

𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ) = 𝐻(𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ) − 𝐻(𝑄 𝑦|𝑥 )

The above decompositions for the entropy, the cross-entropy and relative entropy
are known as entropy chain rules.

7.5 Entropy bounds for the marginal variables


The chain rule for KL divergence directly shows that

𝐷KL (𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) = 𝐷KL (𝑄 𝑥 , 𝑃𝑥 ) + 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 )


| {z } | {z }
upper bound ≥0

≥ 𝐷KL (𝑄 𝑥 , 𝑃𝑥 )

This means that the KL divergence between the joint distributions forms an
upper bound for the KL divergence between the marginal distributions, with
the difference given by the conditional KL divergence 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 ).
Equivalently, we can state an upper bound for the marginal cross-entropy:

𝐻(𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) − 𝐻(𝑄 𝑦|𝑥 ) = 𝐻(𝑄 𝑥 , 𝑃𝑥 ) + 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 )


| {z } | {z }
upper bound ≥0

≥ 𝐻(𝑄 𝑥 , 𝑃𝑥 )

Instead of an upper bound we may as well express this as lower bound for the
negative marginal cross-entropy

−𝐻(𝑄 𝑥 , 𝑃𝑥 ) = −𝐻(𝑄 𝑥 𝑄 𝑦|𝑥 , 𝑃𝑥,𝑦 ) + 𝐻(𝑄 𝑦|𝑥 ) + 𝐷KL (𝑄 𝑦|𝑥 , 𝑃𝑦|𝑥 )


| {z } | {z }
. lower bound ≥0
≥ 𝐹 𝑄 𝑥 , 𝑄 𝑦|𝑥 , 𝑃𝑥,𝑦


Since entropy and KL divergence is closedly linked with maximum likelihood the
above bounds play a major role in statistical learning of models with unobserved
latent variables (here 𝑦). They form the basis of important methods such as the
EM algorithm as well as of variational Bayes.
Chapter 8

Models with latent variables


and missing data

8.1 Complete data log-likelihood versus observed


data log-likelihood
It is frequently the case that we need to employ models where not all variables
are observable and the corresponding data are missing.
For example consider two random variables 𝑥 and 𝑦 with a joint density

𝑝(𝑥, 𝑦|𝜽)

and parameters 𝜽. If we observe data 𝐷𝑥 = {𝑥1 , . . . , 𝑥 𝑛 } and 𝐷 𝑦 = {𝑦1 , . . . , 𝑦𝑛 }


for 𝑛 samples we can use the complete data log-likelihood
𝑛
Õ
𝑙𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = log 𝑝(𝑥 𝑖 , 𝑦 𝑖 |𝜽)
𝑖=1

to estimate 𝜽. Recall that

𝑙𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = −𝑛𝐻(𝑄ˆ 𝑥,𝑦 , 𝑃𝑥,𝑦|𝜽 )

where 𝑄ˆ 𝑥,𝑦 is the empirical joint distribution based on both 𝐷𝑥 and 𝐷 𝑦 and 𝑃𝑥,𝑦|𝜽
the joint model, so maximising the complete data log-likelihood minimises the
cross-entropy 𝐻(𝑄ˆ 𝑥,𝑦 , 𝑃𝑥,𝑦|𝜽 ).
Now assume that 𝑦 is not observable and hence is a so-called latent variable.
Then we don’t have observations 𝐷 𝑦 and therefore cannot use the complete data
likelihood. Instead, for maximum likelihood estimation with missing data we
need to use the observed data log-likelihood.

89
90 CHAPTER 8. MODELS WITH LATENT VARIABLES AND MISSING DATA

From the joint density we obtain the marginal density for 𝑥 by integrating out
the unobserved variable 𝑦:

𝑝(𝑥|𝜽) = 𝑝(𝑥, 𝑦|𝜽)𝑑𝑦
𝑦

Using the marginal model we then compute the observed data log-likelihood
𝑛
Õ 𝑛
Õ ∫
𝑙𝑛 (𝜽|𝐷𝑥 ) = log 𝑝(𝑥 𝑖 |𝜽) = log 𝑝(𝑥 𝑖 , 𝑦|𝜽)𝑑𝑦
𝑖=1 𝑖=1 𝑦

Note that only the data 𝐷𝑥 are used.


Maximum likelihood estimation based on the marginal model proceeds as usual
by maximising the corresponding observed data likelihood function which is

𝑙𝑛 (𝜽|𝐷𝑥 ) = −𝑛𝐻(𝑄ˆ 𝑥 , 𝑃𝑥|𝜽 )

where 𝑄ˆ 𝑥 is the empirical distribution based only on 𝐷𝑥 and 𝑃𝑥|𝜽 is the model
family. Hence, maximising the observed data log-likelihood minimises the
cross-entropy 𝐻(𝑄ˆ 𝑥 , 𝑃𝑥|𝜽 ).
Example 8.1. Two group normal mixture model:
Assume we have two groups labelled by 𝑦 = 1 and 𝑦 = 2 (thus the variable 𝑦 is
discrete). The data 𝑥 observed in each group are normal with means 𝜇1 and 𝜇2
and variances 𝜎12 and 𝜎22 , respectively. The probability of group 1 is 𝜋1 = 𝑝 and
the probability of group 2 is 𝜋2 = 1 − 𝑝. The density of the joint model for 𝑥 and
𝑦 is
𝑝(𝑥, 𝑦|𝜽) = 𝜋 𝑦 𝑁(𝑥|𝜇 𝑦 , 𝜎 𝑦 )

The model parameters are 𝜽 = (𝑝, 𝜇1 , 𝜇2 , 𝜎12 , 𝜎22 )𝑇 and they can be inferred from
the complete data comprised of 𝐷𝑥 = {𝑥1 , . . . , 𝑥 𝑛 } and the group allocations
𝐷 𝑦 = {𝑦1 , . . . , 𝑦𝑛 } of each sample using the complete data log-likelihood

𝑛
Õ 𝑛
Õ
𝑙𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = log 𝜋 𝑦𝑖 + log 𝑁(𝑥 𝑖 |𝜇 𝑦𝑖 , 𝜎 𝑦𝑖 )
𝑖=1 𝑖=1

However, typically we do not know the class allocation 𝑦 and thus we need to
use the marginal model for 𝑥 alone which has density

2
Õ
𝑝(𝑥|𝜽) = 𝜋 𝑦 𝑁(𝜇 𝑦 , 𝜎2𝑦 )
𝑦=1

= 𝑝𝑁(𝑥|𝜇1 , 𝜎12 ) + (1 − 𝑝)𝑁(𝑥|𝜇2 , 𝜎22 )


8.2. ESTIMATION OF THE UNOBSERVABLE LATENT STATES USING BAYES THEOREM

This is an example of a two-component mixture model. The corresponding


observed data log-likelihood is
𝑛
Õ 2
Õ
𝑙𝑛 (𝜽|𝐷𝑥 ) = log 𝜋 𝑦 𝑁(𝑥|𝜇 𝑦 , 𝜎2𝑦 )
𝑖=1 𝑦=1

Note that the form of the observed data log-likelihood is more complex than that
of the complete data log-likelihood because it contains the logarithm of a sum
that cannot be simplified. It is used to estimate the model parameters 𝜽 from 𝐷𝑥
without requiring knowledge of the class allocations 𝐷 𝑦 .
Example 8.2. Alternative computation of the observed data likelihood:
An alternative way to arrive at the observed data likelihood is to marginalise the
complete data likelihood.
𝑛
Ö
𝐿𝑛 (𝜽|𝐷𝑥 , 𝐷 𝑦 ) = 𝑝(𝑥 𝑖 , 𝑦 𝑖 |𝜽)
𝑖=1

and ∫ 𝑛
Ö
𝐿𝑛 (𝜽|𝐷𝑥 ) = 𝑝(𝑥 𝑖 , 𝑦 𝑖 |𝜽)𝑑𝑦1 . . . 𝑑𝑦𝑛
𝑦1 ,...,𝑦𝑛 𝑖=1

The integration (sum) and the multiplication can be interchanged as per Gener-
alised Distributive Law leading to
𝑛 ∫
Ö
𝐿𝑛 (𝜽|𝐷𝑥 ) = 𝑝(𝑥 𝑖 , 𝑦|𝜽)𝑑𝑦
𝑖=1 𝑦

which is the same as constructing the likelihood from the marginal density.

8.2 Estimation of the unobservable latent states using


Bayes theorem
After estimating the marginal model it is straightforward to obtain a probabilistic
prediction about the state of the latent variables 𝑦1 , . . . , 𝑦𝑛 . Since

𝑝(𝑥, 𝑦|𝜽) = 𝑝(𝑥|𝜽) 𝑝(𝑦|𝑥, 𝜽) = 𝑝(𝑦|𝜽) 𝑝(𝑥|𝑦, 𝜽)

given an estimate 𝜽ˆ we are able to compute for each observation 𝑥 𝑖

ˆ
𝑝(𝑥 𝑖 , 𝑦 𝑖 | 𝜽) ˆ 𝑝(𝑥 𝑖 |𝑦 𝑖 , 𝜽)
𝑝(𝑦 𝑖 | 𝜽) ˆ
ˆ =
𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝜽) =
ˆ
𝑝(𝑥 𝑖 | 𝜽) ˆ
𝑝(𝑥 𝑖 | 𝜽)
the probabilities / densities of all states of 𝑦 𝑖 (note this an application of Bayes’
theorem).
92 CHAPTER 8. MODELS WITH LATENT VARIABLES AND MISSING DATA

Example 8.3. Latent states of two group normal mixture model:


Continuing from Example 8.1 above we assume the marginal model has been
ˆ 𝜇ˆ 1 , 𝜇ˆ 2 , 𝜎b12 , 𝜎b22 )𝑇 . Then for each sample 𝑥 𝑖 we
fitted with parameter values 𝜽ˆ = (𝑝,
can get probabilistic prediction about group assocation of each sample by

𝜋ˆ 𝑦𝑖 𝑁(𝑥 𝑖 | 𝜇ˆ 𝑦𝑖 , 𝜎c
2
𝑦𝑖 )
ˆ =
𝑝(𝑦 𝑖 |𝑥 𝑖 , 𝜽)
𝑝𝑁(𝑥
ˆ 𝑖 |𝜇
ˆ 1 , 𝜎b12 ) + (1 − 𝑝)𝑁(𝑥
ˆ 𝑖 |𝜇
ˆ 2 , 𝜎b22 )

8.3 EM Algorithm
Computing and maximising the observed data log-likelihood can be difficult
because of the integration over the unobserved variable (or summation in case of
a discrete latent variable). In contrast, the complete data log-likelihood function
may be easier to compute.
The widely used EM algorithm, formally described by Dempster and others
(1977) but also used before, addresses this problem and maximises the observed
data log-likelihood indirectly in an iterative procedure comprising two steps:
1) First (“E” step), the missing data 𝐷 𝑦 is imputed using Bayes’ theorem. This
provides probabilities (“soft allocations”) for each possible state of the
latent variable.
2) Subsequently (“M” step), the expected complete data log-likelihood func-
tion is computed, where the expectation is taken with regard to the
distribution over the latent states, and it is maximised with regard to 𝜽 to
estimate the model parameters.
The EM algorithm leads to the exact same estimates as if the observed data
log-likelihood would be optimised directly. Therefore the EM algorithm is in
fact not an approximation, it is just a different way to find the MLEs.
The EM algorithm and application to clustering is discussed in more detail in
the module MATH38161 Multivariate Statistics and Machine Learning.
In a nutshell, the justication for the EM algorithm follows from the entropy chain
rules and the corresponding bounds, such as 𝐷KL (𝑄 𝑥,𝑦 , 𝑃𝑥,𝑦 ) ≥ 𝐷KL (𝑄 𝑥 , 𝑃𝑥 ) (see
previous chapter). Given observed data for 𝑥 we know the empirical distribution
𝜽 ) iteratively
𝑄ˆ 𝑥 . Hence, by minimising 𝐷KL (𝑄ˆ 𝑥 𝑄 𝑦|𝑥 , 𝑃𝑥,𝑦
1) with regard to 𝑄 𝑦|𝑥 (“E” step) and
𝜽 (“M” step")
2) with regard to the parameters 𝜽 of 𝑃𝑥,𝑦

one minimises 𝐷KL (𝑄ˆ 𝑥 , 𝑃𝑥𝜽 ) with regard to the parameters of 𝑃𝑥𝜽 .
Interestingly, in the “E” step the first argument of the KL divergence is optimised
(“I” projection) and in the “M” step the second argument (“M” projection).
8.3. EM ALGORITHM 93

Alternatively, instead of bounding the marginal KL divergence one can also


either minimise the upper bound of the cross-entropy or maximise the lower
bound of the negative cross-entropy. All of these three procedures yield the
same EM algorithm.
Note that the optimisation of the entropy bound in the “E” step requires
variational calculus since the argument is a distribution! The EM algorithm is
therefore in fact a special case of a variational Bayes algorithm since it not only
provides estimates of 𝜽 but also yields the distribution of the latent states by
means of the calculus of variations.
Finally, in the above we see that we can learn about unobservable states by means
of Bayes theorem. By extending this same principle to learning about parameters
and models we arrive at Bayesian learning.
94 CHAPTER 8. MODELS WITH LATENT VARIABLES AND MISSING DATA
Chapter 9

Essentials of Bayesian statistics

9.1 Principle of Bayesian learning


9.1.1 From prior to posterior distribution
Bayesian statistical learning applies Bayes’ theorem to update our state of
knowledge about a parameter in the light of data.

Ingredients:

• 𝜽 parameter(s) of interest, unknown and fixed.


• prior distribution with density 𝑝(𝜽) describing the uncertainty (not ran-
domness!) about 𝜽
• data generating process 𝑝(𝑥|𝜽)

Note the model underlying the Bayesian approach is the joint distribution

𝑝(𝜽, 𝑥) = 𝑝(𝜽)𝑝(𝑥|𝜽)

as both a prior distribution over the parameters as well as a data generating


process have to be specified.

Question: new information in the form of new observation 𝑥 arrives - how does
the uncertainty about 𝜽 change?

Answer: use Bayes’ theorem to update the prior density to the posterior density.

𝑝(𝑥|𝜽)
𝑝(𝜽|𝑥) = 𝑝(𝜽)
| {z } |{z} 𝑝(𝑥)
posterior prior

95
96 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS

For the denominator in Bayes formula we need to compute 𝑝(𝑥). This is obtained
by

𝑝(𝑥) = 𝑝(𝑥, 𝜽)𝑑𝜽
∫𝜽
= 𝑝(𝑥|𝜽)𝑝(𝜽)𝑑𝜽
𝜽

i.e. by marginalisation of the parameter 𝜽 from the joint distribution of 𝜽 and 𝑥.


(For discrete 𝜽 replace the integral by a sum). Depending on the context this
quantity is either called the

• normalisation constant as it ensures that the posterior density 𝑝(𝜽|𝑥)


integrates to one.
• prior predictive density of the data 𝑥 given the model 𝑀 before seeing
any data. To emphasise the implicit conditioning on a model we may write
𝑝(𝑥|𝑀). Since all parameters have been integrated out 𝑀 in fact refers to a
model class.
• marginal likelihood of the underlying model (class) 𝑀 given data 𝑥.
To emphasise this may write 𝐿(𝑀 |𝑥). Sometimes it is also called model
likelihood.

9.1.2 Zero forcing property


It is easy to see that if in Bayes rule the prior density/probability is zero for some
parameter value 𝜽 then the posterior density/probability will remain at zero for
that 𝜽, regardless of any data collected. This zero-forcing property of the Bayes
update rule has been called Cromwell’s rule by Dennis Lindley (1923–2013).
Therefore, assigning prior density/probability 0 to an event should be avoided.

Note that this implies that assigning prior probability 1 should be avoided, too.

9.1.3 Bayesian update and likelihood


After independent and identically distributed data 𝐷 = {𝑥 1 , . . . , 𝑥 𝑛 } have been
observed the Bayesian posterior is computed by
9.1. PRINCIPLE OF BAYESIAN LEARNING 97

𝐿(𝜽|𝐷)
𝑝(𝜽|𝐷) = 𝑝(𝜽)
| {z } |{z} 𝑝(𝐷)
posterior prior
Î𝑛
∫ the likelihood 𝐿(𝜽|𝐷) = 𝑖=1 𝑝(𝑥 𝑖 |𝜽) and the marginal likelihoood
involving
𝑝(𝐷) = 𝜽 𝑝(𝜽)𝐿(𝜽|𝐷)𝑑𝜽 with 𝜽 integrated out.
The marginal likelihood serves as a standardising factor so that the posterior
density for 𝜽 integrates to 1:
∫ ∫
1
𝑝(𝜽|𝐷)𝑑𝜽 = 𝑝(𝜽)𝐿(𝜽|𝐷)𝑑𝜽 = 1
𝜽 𝑝(𝐷) 𝜽

Unfortunately, the integral to compute the marginal likelihood is typically ana-


lytically intractable and requires numerical integration and/or approximation.
Comparing likelihood and Bayes procedures note that
• conducting a Bayesian statistical analysis requires integration respectively
averaging (to compute the marginal likelihood)
• in contrast to a likelihood analysis that requires optimisation (to find the
maximum likelihood).

9.1.4 Sequential updates


Note that the Bayesian update procedure can be repeated again and again: we
can use the posterior as our new prior and then update it with further data.
Thus, we may also update the posterior density sequentially, with the data
points 𝑥1 , . . . , 𝑥 𝑛 arriving one after the other, by computing first 𝑝(𝜽|𝑥1 ), then
𝑝(𝜽|𝑥1 , 𝑥2 ) and so on until we reach 𝑝(𝜽|𝑥 1 , . . . , 𝑥 𝑛 ) = 𝑝(𝜽|𝐷).
For example, for the first update we have
𝑝(𝑥 1 |𝜽)
𝑝(𝜽|𝑥 1 ) = 𝑝(𝜽)
𝑝(𝑥1 )

with 𝑝(𝑥1 ) = 𝜽
𝑝(𝑥 1 |𝜽)𝑝(𝜽)𝑑𝜽. The second update yields
𝑝(𝑥2 |𝜽, 𝑥1 )
𝑝(𝜽|𝑥1 , 𝑥2 ) = 𝑝(𝜽|𝑥1 )
𝑝(𝑥 2 |𝑥1 )
𝑝(𝑥2 |𝜽)
= 𝑝(𝜽|𝑥1 )
𝑝(𝑥2 |𝑥 1 )
𝑝(𝑥1 |𝜽)𝑝(𝑥 2 |𝜽)
= 𝑝(𝜽)
𝑝(𝑥1 )𝑝(𝑥2 |𝑥1 )

with 𝑝(𝑥2 |𝑥 1 ) = 𝜽
𝑝(𝑥2 |𝜽)𝑝(𝜽|𝑥1 )𝑑𝜽. The final step is
Î𝑛
𝑝(𝑥 𝑖 |𝜽)
𝑖=1
𝑝(𝜽|𝐷) = 𝑝(𝜽|𝑥 1 , . . . , 𝑥 𝑛 ) = 𝑝(𝜽)
𝑝(𝐷)
98 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS

with the marginal likelihood factorising into

𝑛
Ö
𝑝(𝐷) = 𝑝(𝑥 𝑖 |𝑥 <𝑖 )
𝑖=1

with

𝑝(𝑥 𝑖 |𝑥 <𝑖 ) = 𝑝(𝑥 𝑖 |𝜽)𝑝(𝜽|𝑥 <𝑖 )𝑑𝜽
𝜽

The last factor is the posterior predictive density of the new data 𝑥 𝑖 after seeing
data 𝑥 1 , . . . , 𝑥 𝑖−1 (given the model class 𝑀). It is straightforward to understand
why the probability of the new 𝑥 𝑖 depends on the previously observed data
points — because the uncertainty about the model parameter 𝜽 depends on how
much data we have already observed. Therefore the marginal likelihood 𝑝(𝐷) is
not simply the product of the marginal densities 𝑝(𝑥 𝑖 ) at each 𝑥 𝑖 but instead the
product of the conditional densities 𝑝(𝑥 𝑖 |𝑥 <𝑖 ).

Only when the parameter is fully known and there is no uncertainty about 𝜽
the observations 𝑥 𝑖 are independent. This leads back to the standard likelihood
Î𝑛 we condition on a particular 𝜽 and the likelihood is the product 𝑝(𝐷|𝜽) =
where
𝑖=1 𝑝(𝑥 𝑖 |𝜽).

9.1.5 Summaries of posterior distributions and credible inter-


vals
The Bayesian estimate is the full complete posterior distribution!

However, it is useful to summarise aspects of the posterior distribution:

• Posterior mean E(𝜽|𝐷)


• Posterior variance Var(𝜽|𝐷)
• Posterior mode etc.

In particular the mean of the posterior distribution is often taken as a Bayesian


point estimate.

The posterior distribution also allows to define credible regions or credible


intervals. These are the Bayesian equivalent to confidence intervals and are
constructed by finding the areas of highest probability mass (say 95%) in the
posterior distribution.
9.1. PRINCIPLE OF BAYESIAN LEARNING 99

Bayesian credible intervals (unlike their frequentist confidence counterparts) are


thus very easy to interpret - they simply correspond to the area in the parameter
space in which the we can find the parameter with a given specified probability.
In contrast, in frequentist statistics it does not make sense to assign a probability
to a parameter value!

Note that there are typically many credible intervals with the given specified
coverage 𝛼 (say 95%). Therefore, we may need further criteria to construct these
intervals.

For univariate parameter 𝜃 a two-sided equal-tail credible interval is obtained


by finding the corresponding lower 1 − 𝛼/2 and upper 𝛼/2 quantiles. Typically
this type of credible interval is easy to compute. However, note that the density
values at the left and right boundary points of such an interval are typically
different. Also this does not generalise well to a multivariate parameter 𝜽.

As alternative, a highest posterior density (HPD) credible interval of coverage


𝛼 is found by identifying the shortest interval (i.e. with smallest support) for the
given 𝛼 probability mass. Any point within an HDP credible interval has higher
density than a point outside the HDP credible interval. Correspondingly, the
density at the boundary of an HPD credible interval is constant taking on the
same value everywhere along the boundary.

A Bayesian HPD credible interval is constructed in a similar fashion as a


likelihood-based confidence interval, starting from the mode of the posterior
density and then looking for a common threshold value for the density to define
the boundary of the credible interval. When the posterior density has multiple
modes the HPD interval may be disjoint. HPD intervals are also well defined for
multivariate 𝜽 with the boundaries given by the contour lines of the posterior
density resulting from the threshold value.

In the Worksheet B1 examples for both types of credible intervals are given and
compared visually.
100 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS

9.1.6 Practical application of Bayes statistics on the computer


As we have seen Bayesian learning is conceptually straightforward:
1) Specify prior uncertainty 𝑝(𝜽) about the parameters of interest 𝜽.
2) Specify the data generating process for a specified parameter: 𝑝(𝑥|𝜽).
3) Apply Bayes’ theorem to update prior uncertainty in the light of the new
data.
In practise, however, computing the posterior distribution can be computationally
very demanding, especially for complex models.
For this reason specialised software packages have been developed for computa-
tional Bayesian modelling, for example:
• Bayesian statistics in R: https://cran.r-project.org/web/views/Bayesian.h
tml
• Stan probabilistic programming language (interfaces with R, Python, Julia
and other languages) — https://mc-stan.org/
• Bayesian statistics in Python: PyMC using Aesara/JAX as backend,
NumPyro using JAX as backend, TensorFlow Probability on JAX using
JAX as backend, PyMC3 using Theano as backend, Pyro using PyTorch as
backend, TensorFlow Probability using Tensorflow as backend.
• Bayesian statistics in Julia: Turing.jl
• Bayesian hierarchical modelling with BUGS, JAGS and NIMBLE.
In addition to numerical procedures to sample from the posterior distribution
there are also many procedures aiming to approximate the Bayesian posterior,
employing the Laplace approximation, integrated nested Laplace approximation
(INLA), variational Bayes etc.

9.2 Some background on Bayesian statistics


9.2.1 Bayesian interpretation of probability
9.2.1.1 What makes you “Bayesian”?
If you use Bayes’ theorem are you therefore automatically a Bayesian? No!!
Bayes’ theorem is a mathematical fact from probability theory. Hence, Bayes’
theorem is valid for everyone, whichever form for statistical learning your are
subscribing (such as frequentist ideas, likelihood methods, entropy learning,
Bayesian learning).
As we discuss now the key difference between Bayesian and frequentist sta-
tistical learning lies in the differences in interpretation of probability, not in the
mathematical formalism for probability (which includes Bayes’ theorem).
9.2. SOME BACKGROUND ON BAYESIAN STATISTICS 101

9.2.1.2 Mathematics of probability

The mathematics of probability in its modern foundation was developed by


Andrey Kolmogorov (1903–1987). In this book Foundations of the Theory of
Probability (1933) he establishes probability in terms of set theory/ measure
theory. This theory provides a coherent mathematical framework to work with
probabilities.

However, Kolmogorov’s theory does not provide an interpretation of probability!

→ The Kolmogorov framework is the basis for both the frequentist and the
Bayesian interpretation of probability.

9.2.1.3 Interpretations of probability

Essentially, there are two major commonly used interpretation of probability in


statistics - the frequentist interpretation and the Bayesian interpretation.

A: Frequentist interpretation

probability = frequency (of an event in a long-running series of identically


repeated experiments)

This is the ontological view of probability (i.e. probability “exists” and is identical
to something that can be observed.).

It is also a very restrictive view of probability. For example, frequentist probability


cannot be used to describe events that occur only a single time. Frequentist
probability thus can only be applied asymptotically, for large samples!

B: Bayesian probability

“Probability does not exist” — famous quote by Bruno de Finetti (1906–1985), a


Bayesian statistician.

What does this mean?

Probability is a description of the state of knowledge and of uncertainty.

Probability is thus an epistemological quantity that is assigned and that changes


rather than something that is an inherent property of an object.

Note that this does not require any repeated experiments. The Bayesian in-
terpretation of probability is valid regardless of sample size or the number or
repetitions of an experiment.

Hence, the key difference between frequentist and Bayesian approaches is not
the use of Bayes’ theorem. Rather it is whether you consider probability as
ontological (frequentist) or epistemological entity (Bayesian).
102 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS

9.2.2 Historical developments


• Bayesian statistics is named after Thomas Bayes (1701-1761). His paper1
introducing the famous theorem was published only after his death (1763).
• Pierre-Simon Laplace (1749-1827) was the first to practically use Bayes’
theorem for statistical calculations, and he also independently discovered
Bayes’ theorem in 17742
• This activity was then called “inverse probability” and not “Bayesian
statistics”.
• Between 1900 and 1940 classical mathematical statistics was developed
and the field was heavily influenced and dominated by R.A. Fisher (who
invented likelihood theory and ANOVA, among other things - he was also
working in biology and was professor of genetics). Fisher was very much
opposed to Bayesian statistics.
• 1931 Bruno de Finetti publishes his “representation theorem”. This shows
that the joint distribution of a sequence of exchangeable events (i.e. where
the ordering can be permuted) can be represented by a mixture distribution
that can be constructed via Bayes’ theorem. (Note that exchangeability is a
weaker condition than i.i.d.) This theorem is often used as a justification of
Bayesian statistics (along with the so-called Dutch book argument, also by
de Finetti).
• 1933 publication of Andrey Kolmogorov’s book on probability theory.
• 1946 Cox theorem by Richard T. Cox (1898–1991): the aim to generalise
classical logic from TRUE/FALSE statements to continuous measures of
uncertainty inevitably leads to probability theory and Bayesian learning!
This justification of Bayesian statistics was later popularised by Edwin T.
Jaynes (1922–1998) in various books (1959, 2003).
• 1955 Stein Paradox - Charles M. Stein (1920–2016) publishes paper on the
Stein estimator — an estimator of the mean that dominates ML estimator.
His estimator is always better in terms of MSE than the ML estimator, and
this was very puzzling at that time!
• Only from the 1950s the use of the term “Bayesian statistics” became
prevalent — see Fienberg (2006)3
Due to advances in personal computing from 1970 onwards Bayesian learning
has become more pervasive!
1Bayes, T. 1763. An essay towards solving a problem in the doctrine of chances. The Philosophical
Transactions 53:370–418. https://doi.org/10.1098/rstl.1763.0053
2Laplace, P.-S. 1774. Mémoire sur la probabilité de causes par les évenements. Mémoires de mathéma-
tique et de physique, présentés à l’Académie Royale des sciences par divers savants et lus dans ses
assemblées. Paris, Imprimerie Royale, pp. 621–657.
3Fienberg, S. E. 2006. When did Bayesian inference become “Bayesian”? Bayesian Analysis 1:1–40.
https://doi.org/10.1214/06-BA101
9.2. SOME BACKGROUND ON BAYESIAN STATISTICS 103

• Computers allow to do the complex (numerical) calculations needed in


Bayesian statistics .
• Metropolis-Hastings algorithm published in 1970 (which allows to sample
from a posterior distribution without explicitly computing the marginal
likelihood).
• Development of regularised estimation techniques such as penalised
likelihood in regression (e.g. ridge regression 1970).
• penalised likelihood via KL divergence for model selection (Akaike 1973).
• A lot of work on interpreting Stein estimators as empirical Bayes estimators
(Efron and Morris 1975)
• regularisation originally was only meant to make singular sys-
tems/matrices invertible, but then it turned out regularisation has also a
Bayesian interpretation.
• work on reference priors (Bernado 1979).
• EM algorithm published in 1977 which uses Bayes theorem for imputing
the distribution of the latent variables.
Another boost was in the 1990/2000s when in science (e.g. genomics) many
complex and high-dimensional data set were becoming the norm, not the
exception.
• Classical statistical methods cannot be used in this setting (overfitting!) so
new methods were developed for high-dimensional data analysis, many
with a direct link to Bayesian statistics
• 1996 lasso (L1 regularised) regression invented by Robert Tibshirani.
• Machine learning methods for non-parametric and extremely highly para-
metric models (neural network) require either explicit or implicit regulari-
sation.
• Many Bayesians in this field, many using variational Bayes techniques that
came arose as generalisation of the EM algorithm and are also linked to
models and methods from statistical physics.
104 CHAPTER 9. ESSENTIALS OF BAYESIAN STATISTICS
Chapter 10

Bayesian learning in practise

In this chapter we discuss how three basic problems, namely how to estimate a
proportion, the mean and the variance in a Bayesian framework.

10.1 Estimating a proportion using the Beta-Binomial


model
10.1.1 Binomial likelihood
In order to apply Bayes’ theorem we first need to find a suitable likelihood. We
use the Bernoulli/ binomial model as in the analogous example in Part I:
Repeated Bernoulli experiment (binomial model):
𝑥 ∈ {0, 1} (e.g. “tails” vs. “heads”)
probability mass function (pmf): Pr(𝑥 = 1) = 𝑝, Pr(𝑥 = 0) = 1 − 𝑝
Mean: E(𝑥) = 𝑝
Variance Var(𝑥) = 𝑝(1 − 𝑝)
Bin(𝑛, 𝑝) (sum of 𝑛 Bernoulli experiments)
𝑥 ∈ {0, 1, . . . , 𝑛}
Mean: E(𝑥) = 𝑛𝑝
Variance: Var(𝑥) = 𝑛𝑝(1 − 𝑝)
Standardised binomial (average of 𝑛 Bernoulli experiments):
𝑥
𝑛 ∈ {0, 𝑛 , . . . , 1}
1

Mean: E( 𝑛𝑥 ) = 𝑝
𝑝(1−𝑝)
Variance: Var( 𝑛𝑥 ) = 𝑛

From part I (likelihood theory) we know that the maximum likelihood estimate
of the proportion is the frequency 𝑝ˆ 𝑀𝐿 = 𝑛𝑥 given 𝑥 (number of “heads”) is

105
106 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE

observed in 𝑛 repeats.

10.1.2 Excursion: Properties of the Beta distribution


The density of the Beta distribution Beta(𝛼, 𝛽) for 𝑥 ∈ [0, 1] and 𝛼 > 0 and 𝛽 > 0
is
1
𝑓 (𝑥|𝛼, 𝛽) = 𝑥 𝛼−1 (1 − 𝑥)𝛽−1
𝐵(𝛼, 𝛽)

𝛼 𝜇(1−𝜇)
The mean is E(𝑥) = 𝜇 = 𝛼+𝛽 and the variance Var(𝑥) = 𝛼+𝛽+1 .

Γ(𝑎)Γ(𝑏)
The density depends on the Beta function 𝐵(𝑎, 𝑏) = Γ(𝑎+𝑏)
which in turn is
defined via Euler’s Gamma function
∫ ∞
Γ(𝑥) = 𝑡 𝑥−1 𝑒 −𝑡 𝑑𝑡
0

Note that Γ(𝑥) = (𝑥 − 1)! for any positive integer 𝑥

A useful reparameterisation of the Beta distribution is in terms of the parameters


𝜇 ∈ [0, 1] and 𝑚 > 0, yielding the original parameters via 𝛼 = 𝜇𝑚 and 𝛽 = (1−𝜇)𝑚.
𝛼
Conversely, 𝑚 = 𝛼 + 𝛽 and 𝜇 = 𝛼+𝛽 .

The Beta distribution is very flexible and can assume a number of different
shapes, depending on the value of 𝛼 and 𝛽:
10.1. ESTIMATING A PROPORTION USING THE BETA-BINOMIAL MODEL107

10.1.3 Beta prior distribution


In Bayesian learning we need to make explicit our uncertainty about 𝑝.
𝑝 has support [0, 1] → we use the Beta distribution Beta(𝛼, 𝛽) as prior for 𝑝
with parameters 𝛼 ≥ 0 and 𝛽 ≥ 0:

𝑝 ∼ Beta(𝛼, 𝛽)

Note this does not actually mean that 𝑝 is random! It only means that we model
the uncertainty about 𝑝 using a Beta random variable!
The flexibility of the Beta distribution allows to accomodate a large variety of
possible scenarios for our prior knowledge.
The prior mean is
𝛼
E(𝑝) = = 𝜇prior
𝑚
and the prior variance

𝜇prior (1 − 𝜇prior )
Var(𝑝) =
𝑚+1
where 𝑚 = 𝛼 + 𝛽.
Note the similarity to the moments of the standardised binomial above!

10.1.4 Computing the posterior distribution


Bayes’ theorem for continuous random variables to compute posterior density:

𝑓 (𝑥|𝑝) 𝑓 (𝑝)
𝑓 (𝑝|𝑥) = ∫
𝑝
0 𝑓 (𝑥|𝑝 0 ) 𝑓 (𝑝 0 )𝑑𝑝 0

We use in our analysis the Beta-Binomial model:


a) Beta prior:
𝑝 ∼ Beta(𝛼, 𝛽)
1
𝑓 (𝑝) = 𝑝 𝛼−1 (1 − 𝑝)𝛽−1
𝐵(𝛼, 𝛽)
b) Binomial likelihood:
𝑥|𝑝 ∼ Bin(𝑛, 𝑝)

𝑛 𝑥
 
𝑓 (𝑥|𝑝) = 𝑝 (1 − 𝑝)(𝑛−𝑥)
𝑥
Applying Bayes’ theorem results in
108 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE

c) Beta posterior distribution

𝑝|𝑥 ∼ Beta(𝛼 + 𝑥, 𝛽 + 𝑛 − 𝑥)

1
𝑓 (𝑝|𝑥) = 𝑝 𝛼+𝑥−1 (1 − 𝑝)𝛽+𝑛−𝑥−1
𝐵(𝛼 + 𝑥, 𝛽 + 𝑛 − 𝑥)

(for a proof see Worksheet B1!)

The posterior can be summarised by its first two moments (mean and variance):

Posterior mean:
𝑥+𝛼
𝜇posterior = E(𝑝|𝑥) =
𝑛+𝑚

Posterior variance:

𝜇posterior (1 − 𝜇posterior )
𝜎posterior
2
= Var(𝑝|𝑥) =
𝑛+𝑚+1

10.2 Properties of Bayesian learning


The Beta-Binomial models allows to observe a number of intriguing features and
properties of Bayesian learning. Many of these extend also to other models as
we will see later.

10.2.1 Prior acting as pseudodata


In the expression for the posterior mean and variance you can see that 𝑚 = 𝛼 + 𝛽
behaves like an implicit sample size connected with prior information!

Specifically, 𝛼 and 𝛽 act as pseudocounts that influence both the posterior mean
and the posterior variance, exactly in the same way as conventional data.

For example, the larger 𝑚 (and thus larger 𝛼 and 𝛽) the smaller is the posterior
variance, with variance decreasing proportional to the inverse of 𝑚. If the prior
is highly concentrated, i.e. if it has low variance and large precision (=inverse
variance) then the implicit data size 𝑚 is large. Conversely, if the prior has a
large variance, then the prior is vague and the implicit data size 𝑚 is small.

Hence, a prior has the same effect as if one would add data – but without actually
adding data! This is precisely this why a prior acts as a regulariser and prevents
overfitting, because it increases effective sample size.

Another interpretation is that any prior summarises data that may have been
available previously as observations.
10.2. PROPERTIES OF BAYESIAN LEARNING 109

10.2.2 Linear shrinkage of mean


The posterior mean 𝜇posterior is a linearly adjusted 𝜇ˆ 𝑀𝐿 . This becomes evident by
writing 𝜇posterior as

𝜇posterior = 𝜆𝜇prior + (1 − 𝜆)𝜇ˆ 𝑀𝐿


with weight 𝜆 ∈ [0, 1]
𝑚
𝜆= .
𝑚+𝑛
The posterior mean is a convex combination (i.e. the weighted average) of the
ML estimate and the prior mean. The factor 𝜆 is called the shrinkage intensity
— note that it is the ratio of the “prior sample size” (𝑚) and the “effective overall
sample size” (𝑚 + 𝑛).
1. This is called shrinkage, because the ML estimator is “shrunk” towards the
prior mean (which is often called the “target”, and sometimes the target is
zero, and then the terminology “shrinking” makes most sense).
2. If the shrinkage intensity is zero (𝜆 = 0) then the ML point estimator is
recovered. This implies 𝛼 = 0 and 𝛽 = 0, or 𝑛 → ∞.
Note that using maximum likelihood to estimate the proportion 𝑝 (for
moderate or small 𝑛) is the same as Bayesian posterior mean estimation
using the Beta-Binomial model with prior 𝛼 = 0 and 𝛽 = 0. This prior is
extremely “u-shaped” and the implicit prior for the ML estimation. (Would
you use such a prior intentionally?)
3. If the shrinkage intensity is large (𝜆 → 1) then the posterior mean corre-
sponds to the prior. This happens if 𝑛 = 0 or if 𝑚 is very large (implying
that the prior is sharply concentrated around the prior mean).
4. Since the ML estimate 𝜇ˆ 𝑀𝐿 is unbiased the Bayesian point estimate is biased
(for finite n)! And the bias is induced by the prior mean deviating from
the true mean. This is also true more generally, Bayesian learning typically
produces biased estimators (but asymptotically they will be unbiased like
in ML).
5. That the posterior mean is a linear combination of the ML estimate and
the prior mean is not a coincidence. In fact, this is true for all distributions
that are exponential families, see e.g. Diaconis and Ylvisaker (1979)1.
6. Furthermore, it is possible (and indeed quite useful for computational
reasons!) to formulate Bayes learning in terms of linear shrinkage and
using only second moments, see e.g. Hartigan (1969)2. The resulting theory
1Diaconis, P., and D Ylvisaker. 1979. Conjugate Priors for Exponential Families. Ann. Statist.
7:269–281. https://doi.org/10.1214/aos/1176344611
2Hartigan, J. A. 1969. Linear Bayesian methods. J. Roy. Statist. Soc. B 31:446-454 https:
//doi.org/10.1111/j.2517-6161.1969.tb00804.x
110 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE

is called “Bayes linear statistics” (Goldstein and Wooff, 2007)3.

10.2.3 Conjugacy of prior and posterior distribution


In the Beta-Binomial model for estimating the proportion 𝑝 the choice of the Beta
distribution as prior distribution along with the binomial likelihood resulted
in having the Beta distribution as posterior distribution as well.
If the prior and posterior belong to the same distributional family the prior is
called a conjugate prior. This will be the case if the prior has the same functional
form as the likelihood.
In the Beta-Binomial model the likelihood is based on the binomial distribution
and has the following form (only terms depending on the parameter 𝑝 are
shown):
𝑝 𝑥 (1 − 𝑝)𝑛−𝑥
The form of the Beta prior is (again, only showing terms depending on 𝑝):

𝑝 𝛼−1 (1 − 𝑝)𝛽−1

Since the posterior is proportional to the product of prior and likelihood the
posterior will have exactly the same form as the prior:

𝑝 𝛼+𝑥−1 (1 − 𝑝)𝛽+𝑛−𝑥−1

Choosing the prior distribution from a family conjugate to the likelihood greatly
simplifies Bayesian analysis since the Bayes formula can then be written in form
of an update formula for the parameters of the Beta distribution:

𝛼→𝛼+𝑥

𝛽→𝛽+𝑛−𝑥

Thus, conjugate prior distributions are very convenient choices. However, in their
application it must be ensured that the prior distribution is flexible enough to
encapsulate all prior information that may be available. In cases where this is not
the case alternative priors should be used (and most likely this will then require
to compute the posterior distribution numerically rather than analytically).

10.2.4 Large sample limits of mean and variance


If 𝑛 is large and 𝑛 >> 𝛼, 𝛽 the posterior mean and variance become asympotically

𝑎 𝑥
𝜇posterior = = 𝜇ˆ 𝑀𝐿
𝑛
3Goldstein, M., and D. Wooff. 2007. Bayes Linear Statistics: Theory and Methods. Wiley. https:
//doi.org/10.1002/9780470065662
10.2. PROPERTIES OF BAYESIAN LEARNING 111

and
𝑎 𝜇ˆ 𝑀𝐿 (1 − 𝜇ˆ 𝑀𝐿 )
𝜎posterior
2
=
𝑛

Thus, if sample size is large the Bayes’ estimator turns into the ML estimator!
Specifically, the posterior mean becomes the ML point estimate, and the posterior
variance is equal to the asymptotic variance computed via the observed Fisher
information!
Thus, for large 𝑛 the data dominate and any details about the prior (such as
values of 𝛼 and 𝛽 become irrelevant!

10.2.5 Asymptotic Normality of the Posterior distribution


Also known as Bayesian Central Limit Theorem (CLT).
Under some regularity conditions (such as regular likelihood and positive prior
probability for all parameter values, finite number of parameters, etc.) for
large sample size the Bayesian posterior distribution converges to a Normal
distribution centered around the MLE and with the variance of the MLE:

for large 𝑛: 𝑝(𝜽|𝒙 1 , 𝒙 2 , . . . , 𝒙 𝑛 ) → 𝑁(𝜽ˆ 𝑀𝐿 , Var(𝜽ˆ 𝑀𝐿 ))

So not only are the posterior mean and variance converging to the MLE and the
variance of the MLE for large sample size, but also the posterior distribution
itself converges to the sampling distribution!
This holds generally in many regular cases, not just in our example of the
Beta-Bernoulli model.
The Bayesian CLT is generally known as the Bernstein-van Mises theorem (who
discovered it at around 1920-30), but special cases were already known as by
Laplace.
In the Worksheet B1 the asymptotic convergence of the posterior distribution to
a normal distribution is demonstrated grapically.

10.2.6 Posterior variance for finite 𝑛


Previously we have the derived a Bayesian point estimate for the proportion 𝑝 as
the posterior mean
𝑥+𝛼
E(𝑝|𝑥) = = 𝑝ˆ Bayes
𝑛+𝑚
with posterior variance

𝑝ˆ Bayes (1 − 𝑝ˆ Bayes )
Var(𝑝|𝑥) =
𝑛+𝑚+1
112 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE

Asymptotically, we have seen that for large 𝑛 the posterior mean becomes the
maximum likelihood estimate (MLE), and the posterior variance becomes the
asymptotic variance of the MLE. Thus, for large 𝑛 the Bayesian estimate will be
indistinguishable from the MLE and shares its favourable properties.
In addition, for finite sample size the posterior variance will tyically be smaller
than both the asymptotic posterior variance (for large 𝑛) and the prior variance,
showing that combining the information in the prior and in the data leads to a
more efficient estimate.

10.3 Estimating the mean using the Normal-Normal


model
10.3.1 Normal likelihood
For the likelihood we assume as data-generating model the normal distribution
with known fixed variance 𝜎2

𝑥|𝜇 ∼ 𝑁(𝜇, 𝜎2 )

This yields as the MLE 𝜇ˆ 𝑀𝐿 = 𝑥.


¯

10.3.2 Normal prior distribution


To model the uncertainty about 𝜇 we use the normal distribution 𝑁(𝜇, 𝜎2 /𝑘)
parameterised by the two parameters 𝜇 and 𝑘 (remember 𝜎2 is fixed).
With 𝜇 = 𝜇0 and 𝑘 = 𝑚 we get the normal prior

𝜇 ∼ 𝑁(𝜇0 , 𝜎2 /𝑚)

with prior mean E(𝜇) = 𝜇0 and prior variance Var(𝜇) = 𝜎𝑚 where 𝑚 is the implied
2

sample size from the prior. Note that 𝑚 does not need to be an integer value!

10.3.3 Normal posterior distribution


The posterior distribution after observing 𝑛 samples 𝑥1 , . . . 𝑥 𝑛 is normal with
𝜇 = 𝜇1 and 𝑘 = 𝑚 + 𝑛

𝜇|𝑥1 , . . . 𝑥 𝑛 ∼ 𝑁(𝜇1 , 𝜎2 /(𝑚 + 𝑛))

with posterior mean

𝑚𝜇0 + 𝑛 𝑥¯
E(𝜇|𝑥1 , . . . 𝑥 𝑛 ) = 𝜇1 = = 𝜆𝜇0 + (1 − 𝜆)𝜇ˆ 𝑀𝐿
𝑛+𝑚
𝑚
with 𝜆 = 𝑛+𝑚 . Note the linear shrinkage of 𝜇ˆ 𝑀𝐿 towards 𝜇0 !
10.4. ESTIMATING THE VARIANCE USING THE INVERSE-GAMMA-NORMAL MODEL

The corresponding posterior variance is

𝜎2
Var(𝜇|𝑥 1 , . . . 𝑥 𝑛 ) =
𝑛+𝑚
Thus, the normal distribution is the conjugate distribution to the mean
parameter in the normal likelihood.

10.3.4 Large sample asymptotics and Stein paradox


For 𝑛 large and 𝑛 >> 𝑚 we get
𝑎
E(𝜇|𝑥1 , . . . 𝑥 𝑛 ) = 𝜇ˆ 𝑀𝐿

𝑎 𝜎2
Var(𝜇|𝑥 1 , . . . 𝑥 𝑛 ) =
𝑛
i.e. the MLE and its asymptotic variance!
𝜎2 𝜎2
Note that the posterior variance 𝑛+𝑚 is smaller than the asymptotic variance 𝑛
and the prior variance 𝜎𝑚 .
2

10.4 Estimating the variance using the inverse-


Gamma-Normal model
10.4.1 Inverse Gamma distribution
Next, we study a common Bayesian model for estimating the variance parameter
of the normal distribution. For this we use the inverse Gamma distribution:

𝑥 ∼ Inv-Gam(𝛼, 𝛽)

This distribution is closely linked with the Gamma distribution — the inverse of
𝑥 is Gamma-distributed with inverted scale parameter:

1
∼ Gam(𝛼, 𝛽−1 )
𝑥

For use as prior and posterior we employ a different parameterisation with


𝜇 = 𝛽/(𝛼 − 1) and 𝑘 = 2(𝛼 − 1):

𝑘+2 𝑘𝜇
 
𝑥 ∼ Inv-Gam 𝛼 = ,𝛽 = = Inv-Gam(𝜇, 𝑘)
2 2

The reason for choosing the mean parameterisation using 𝜇 and 𝑘 instead of 𝛼
and 𝛽 is that this parameterisation simplifies the Bayesian update rule for the
mean.
114 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE

The first two moments of the inverse Gamma distribution are


𝛽
E(𝑥) = =𝜇
𝛼−1
and
𝛽2 2𝜇2
Var(𝑥) = =
(𝛼 − 1)2 (𝛼 − 2) 𝑘−2

The inverse Gamma distribution is also known under two further alternative
names: 1) inverse scaled chi-squared distribution and 2) one-dimensional inverse
Wishart distribution.

10.4.2 Normal likelihoood


As data likelihood / generating model we use normal distribution 𝑁(𝜇, 𝜎2 ) with
given fixed mean 𝜇.
Í𝑛
𝜎𝑀𝐿
This yields as MLE b2
= 1
𝑛 𝑖=1 (𝑥 𝑖 − 𝜇)2

10.4.3 Inverse Gamma prior distribution


For the prior distribution we use the inverse Gamma distribution with 𝑘 = 𝑚
and 𝜇 = 𝜎02
𝜎 2 ∼ Inv-Gam(𝜇 = 𝜎02 , 𝑘 = 𝑚)
The corresponding prior mean is

E(𝜎2 ) = 𝜎02

and the prior variance is


2𝜎04
Var(𝜎2 ) =
𝑚−2
(note that 𝑚 > 2)

10.4.4 Inverse Gamma posterior distribution


As the inverse Gammma distribution is conjugate to the normal likelihood the
posterior distribution is inverse Gamma as well:

𝜎2 |𝑥1 . . . , 𝑥 𝑛 ∼ Inv-Gam(𝜇 = 𝜎12 , 𝑘 = 𝑚 + 𝑛)

𝜎02 𝑚+𝑛b
𝜎 2𝑀𝐿
with 𝜎12 = 𝑚+𝑛 .
The posterior mean is
E(𝜎 2 |𝑥1 . . . , 𝑥 𝑛 ) = 𝜎12
10.4. ESTIMATING THE VARIANCE USING THE INVERSE-GAMMA-NORMAL MODEL

and the posterior variance

2𝜎14
Var(𝜎2 |𝑥 1 . . . , 𝑥 𝑛 ) =
𝑚+𝑛−2
The update formula for the posterior mean of the variance follows the usual
linear shrinkage rule:
𝜎12 = 𝜆𝜎02 + (1 − 𝜆)b
𝜎𝑀𝐿
2

𝑚
with 𝜆 = 𝑚+𝑛 .

10.4.5 Large sample asymptotics


For 𝑛 large and 𝑛 >> 𝑚 we get
𝑎
E(𝜎2 |𝑥 1 , . . . 𝑥 𝑛 ) = b
𝜎𝑀𝐿
2

𝑎 2𝜎 4
Var(𝜎2 |𝑥1 , . . . 𝑥 𝑛 ) =
𝑛
which is indeed the MLE of 𝜎2 and its asymptotic variance!

10.4.6 Gamma-Normal model for the precision


Instead of estimating the variance we may wish to estimate the precision, i.e. the
inverse variance. In the above we have used an inverse Gamma distribution for
the prior and posterior of the variance. Thus, to model the precision we may
therefore use a Gamma prior distribution and a normal likelihood, resulting in a
Gamma posterior distribution.

10.4.7 Joint estimation of mean and variance


It is possible to combine the Normal-Normal for the mean and the Inverse-
Gamma-Normal model into a joint model for the mean and variance.
This implies having a joint prior and a joint posterior for 𝜇 and 𝜎 2 .
The resulting joint point estimates are identical to the above individual estimates.
116 CHAPTER 10. BAYESIAN LEARNING IN PRACTISE
Chapter 11

Bayesian model comparison

11.1 Marginal likelihood as model likelihood


11.1.1 Simple and composite models
In the introduction of the Bayesian learning we already encountered the marginal
likelihood 𝑝(𝐷|𝑀) of a model class 𝑀 in the denominator of Bayes’ rule:
𝑝(𝜽|𝑀)𝑝(𝐷|𝜽, 𝑀)
𝑝(𝜽|𝐷, 𝑀) =
𝑝(𝐷|𝑀)
Computing this marginal likelihood is different for simple and composite models.
A model is called “simple” if it directly corresponds to a specific distribution,
say, a Normal with fixed mean and variance, or a Binomial distribution with a
set probability for the two classes. Thus, a simple model is a point in the model
space described by the parameters of a distribution family (e.g. 𝜇 and 𝜎2 for the
normal family 𝑁(𝜇, 𝜎 2 ). For a simple model 𝑀 the density 𝑝(𝐷|𝑀) corresponds
to standard likelihood of 𝑀 and there are no free parameters.
On the other hand, a model is “composite” if it is composed of simple models.
This can be a finite set, or it can be comprised of infinite number of simpple
models. Thus a composite model represent a model class. For example, a
Normal with a given mean but unspecified variance, or a Binomial model with
unspecified parameter 𝑝, is a composite model.
If 𝑀 is a composite model, with the underlying simple models indexed by a
parameter 𝜽, the likelihood of the model is obtained by marginalisation over 𝜽:

𝑝(𝐷|𝑀) = 𝑝(𝐷|𝜽, 𝑀)𝑝(𝜽|𝑀)𝑑𝜽
∫𝜽
= 𝑝(𝐷, 𝜽|𝑀)𝑑𝜽
𝜽

117
118 CHAPTER 11. BAYESIAN MODEL COMPARISON

i.e. we integrate over all parameter values 𝜽.


If the distribution over the parameter 𝜽 of a model is strongly concentrated
around a specific value 𝜽0 then the composite model degenerates to a simple
point model, and the marginal likelihood becomes the likelihood of the parameter
𝜽0 under that model.
Example 11.1. Beta-Binomial distribution:
Assume that likelihood is binomial with mean parameter 𝑝. If 𝑝 follows a
Beta distribution then the marginal likelihood with 𝑝 integrated out is the
Beta-Binomial distribution (see also Worksheet B2). This is an example of a
compound probability distribution.

11.1.2 Log-marginal likelihood as penalised maximum log-


likelihood
By rearranging Bayes’ rule we see that

𝑝(𝜽|𝐷, 𝑀)
log 𝑝(𝐷|𝑀) = log 𝑝(𝐷|𝜽, 𝑀) − log
𝑝(𝜽|𝑀)

The above is valid for all 𝜽.


Assuming concentration of the posterior around the MLE 𝜽ˆ ML we will have
𝑝(𝜽ˆ ML |𝐷, 𝑀) > 𝑝(𝜽ˆ ML |𝑀) and thus

𝑝(𝜽ˆ ML |𝐷, 𝑀)
log 𝑝(𝐷|𝑀) = log 𝑝(𝐷| 𝜽ˆ ML , 𝑀) − log
| {z } 𝑝(𝜽ˆ ML |𝑀)
maximum log-likelihood | {z }
penalty > 0

Therefore, the log-marginal likelihood is essentially a penalised version of the


maximum log-likelihood, and the penalty depends on the concentration of the
posterior around the MLE

11.1.3 Model complexity and Occams razor


Intriguingly, the penality implicit in the log-marginal likelihood is linked to the
complexity of the model, in particular to the number of parameters of 𝑀. We
will see this directly in the Schwarz approximation of the log-marginal likelihood
discussed below.
Thus, the averaging over 𝜽 in the marginal likelihood has the effect of automati-
cally penalising complex models. Therefore, when comparing models using the
marginal likelihood a complex model may be ranked below simpler models. In
contrast, when selecting a model by comparing maximum likelihood directly the
model with the highest number of parameters always wins over simpler models.
11.2. THE BAYES FACTOR FOR COMPARING TWO MODELS 119

Hence, the penalisation implicit in the marginal likelihood prevents overfitting


that occurs with maximum likelihood.
The principle of preferring a less complex model is called Occam’s razor or the
law of parsimony.
When choosing models a simpler model is often preferable over a more complex
model, because the simpler model is typically better suited to both explaining
the currently observed data as well as future data, whereas a complex model
will typically only excel in fitting the current data but will perform poorly in
prediction.

11.2 The Bayes factor for comparing two models


11.2.1 Definition of the Bayes factor
The Bayes factor is the ratio of the likelihoods of the two models:

𝑝(𝐷|𝑀1 )
𝐵12 =
𝑝(𝐷|𝑀2 )

The log-Bayes factor log 𝐵12 is also called the weight of evidence for 𝑀1 over
𝑀2 .

11.2.2 Bayes theorem in terms of the Bayes factor


We would like to compare two models 𝑀1 and 𝑀2 . Before seeing data 𝐷 we
can check their Prior odds (= ratio of prior probabilities of the models 𝑀1 and 𝑀2 ):

Pr(𝑀1 )
Pr(𝑀2 )

After seeing data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } we arrive at the Posterior odds (= ratio of


posterior probabilities):
Pr(𝑀1 |𝐷)
Pr(𝑀2 |𝐷)

𝑝(𝐷|𝑀 )
Using Bayes Theorem Pr(𝑀 𝑖 |𝐷) = Pr(𝑀 𝑖 ) 𝑝(𝐷) 𝑖 we can rewrite the posterior
odds as
Pr(𝑀1 |𝐷) 𝑝(𝐷|𝑀1 ) Pr(𝑀1 )
=
Pr(𝑀2 |𝐷) 𝑝(𝐷|𝑀2 ) Pr(𝑀2 )
| {z } | {z } | {z }
posterior odds Bayes factor 𝐵12 prior odds

The Bayes factor is the multiplicative factor that updates the prior odds to the
posterior odds.
120 CHAPTER 11. BAYESIAN MODEL COMPARISON

On the log scale we see that

log-posterior odds = weight of evidence + log-prior odds

11.2.3 Scale for the Bayes factor


Following Harold Jeffreys (1961)1 one may interpret the strength of the Bayes
factor as follows:

𝐵12 log 𝐵12 evidence in favour of 𝑀1 versus 𝑀2


> 100 > 4.6 decisive
10 to 100 2.3 to 4.6 strong
3.2 to 10 1.16 to 2.3 substantial
1 to 3.2 0 to 1.16 not worth more than a bare mention

More recently, Kass and Raftery (1995)2 proposed to use the following slightly
modified scale:

𝐵12 log 𝐵12 evidence in favour of 𝑀1 versus 𝑀2


> 150 >5 very strong
20 to 150 3 to 5 strong
3 to 20 1 to 3 positive
1 to 3 0 to 1 not worth more than a bare mention

11.2.4 Bayes factor versus likelihood ratio


If both 𝑀1 and 𝑀2 are simple models then the Bayes factor is identical to the
likelihood ratio of the two models.
However, if one of the two models is composite then the Bayes factor and the
generalised likelihood ratio differ: In the Bayes factor the representative of a
composite model is the model average of the simple models indexed by 𝜽, with
weights taken from the prior distribution over the simple models contained in
𝑀. In contrast, in the generalised likelihood ratio statistic the representative of a
composite model is chosen by maximisation.
Thus, for composite models, the Bayes factor does not equal the corresponding
generalised likelihood ratio statistic. In fact, the key difference is that the Bayes
factor is a penalised version of the likelihood ratio, with the penality depending
on the difference in complexity (number of parameters) of the two models
1Jeffreys, H. Theory of Probability. 3rd ed. Oxford University Press.
2Kass, R.E., and A.E. Raftery. 1995. Bayes factors. JASA 90:773–795. https://doi.org/10.1080/0162
1459.1995.10476572
11.3. APPROXIMATE COMPUTATIONS 121

11.3 Approximate computations


The marginal likelihood and the Bayes factor can be difficult to compute in
practise. Therefore, a number of approximations have been developed. The most
important is the so-called Schwarz (1978) approximation of the log-marginal
likelihood. It is used to approximate the log-Bayes factor and also yields the BIC
(Bayesian information criterion) which can be interpreted as penalised maximum
likelihood.

11.3.1 Schwarz (1978) approximation of log-marginal likelihood


The logarithm of the marginal likelihood of a model can be approximated
following Schwarz (1978)3 as follow:

𝑀 1
log 𝑝(𝐷|𝑀) ≈ 𝑙𝑛𝑀 (𝜽ˆ 𝑀𝐿 ) − 𝑑 𝑀 log 𝑛
2
where 𝑑 𝑀 is the dimension of the model 𝑀 (number of parameters in 𝜽 belonging
𝑀
to 𝑀) and 𝑛 is the sample size and 𝜽ˆ 𝑀𝐿 is the MLE. For a simple model 𝑑 𝑀 = 0
so then there is no approximation as in this case the marginal likelihood equals
the likelihood.
The above formula can be obtained by quadratic approximation of the likelihood
assuming large 𝑛 and assuming that the prior is locally uniform around the
MLE. The Schwarz (1978) approximation is therefore a special case of a Laplace
approximation.
Note that the approximation is the maximum log-likelihood minus a penalty
that depends on the model complexity (as measured by dimension 𝑑), hence
this is an example of penalised ML! Also note that the distribution over the
parameter 𝜽 is not required in the approximation.

11.3.2 Bayesian information criterion (BIC)


The BIC (Bayesian information criterion) of the model 𝑀 is the approximated
log-marginal likelihood times the factor -2:

𝑀
𝐵𝐼𝐶(𝑀) = −2𝑙𝑛𝑀 (𝜽ˆ 𝑀𝐿 ) + 𝑑 𝑀 log 𝑛

Thus, when comparing models one aimes to maximise the marginal likelihood
or, as approximation, minimise the BIC.
The reason for the factor “-2” is simply to have a quantity that is on the same
scale as the Wilks log likelihood ratio. Some people / software packages also
use the factor “2”.
3Schwarz, G. 1978. Estimating the dimension of a model. Ann. Statist. 6:461–464. https:
//doi.org/10.1214/aos/1176344136
122 CHAPTER 11. BAYESIAN MODEL COMPARISON

11.3.3 Approximating the weight of evidence (log-Bayes factor)


with BIC
Using BIC (twice) the log-Bayes factor can be approximated as

2 log 𝐵12 ≈ −𝐵𝐼𝐶(𝑀1 ) + 𝐵𝐼𝐶(𝑀2 )


𝑀1 𝑀2
 
= 2 𝑙𝑛𝑀1 (𝜽ˆ 𝑀𝐿 ) − 𝑙𝑛𝑀2 (𝜽ˆ 𝑀𝐿 ) − log(𝑛)(𝑑 𝑀1 − 𝑑 𝑀2 )

i.e. it is the penalised log-likelihood ratio of model 𝑀1 vs. 𝑀2 .

11.4 Bayesian testing using false discovery rates


We introduce False Discovery Rates (FDR) as a Bayesian method to distinguish
a null model from an alternative model. This is closely linked with classical
frequentist multiple testing procedures.

11.4.1 Setup for testing a null model 𝐻0 versus an alternative


model 𝐻𝐴
We consider two models:
𝐻0 : null model, with density 𝑓0 (𝑥) and distribution 𝐹0 (𝑥)
𝐻𝐴 : alternative model, with density 𝑓𝐴 (𝑥) and distribution 𝐹𝐴 (𝑥)
Aim: given observations 𝑥1 , . . . , 𝑥 𝑛 we would like to decide for each 𝑥 𝑖 whether
it belongs to 𝐻0 or 𝐻𝐴 .
This is done by a critical decision threshold 𝑥 𝑐 : if 𝑥 𝑖 > 𝑥 𝑐 then 𝑥 𝑖 is called
“significant” and otherwise called “not significant”.
In classical statistics one of the the most widely used approach to find the
decision threshold is by computing 𝑝-values from the 𝑥 𝑖 (this uses only the null
model but not the alternative model), and then thresholding the 𝑝-values a a
certain level (say 5%). If 𝑛 is large then often the test is modified by adjusting
the 𝑝-values or the threshold (e.g. if Bonferroni correction).
Note that this procedure ignores any information we may have about the
alternative model!

11.4.2 Test errors


11.4.2.1 True and false positives and negatives
For any decision threshold 𝑥 𝑐 we can distinguish the following errors:
• False positives (FP), “false alarm”, type I error: 𝑥 𝑖 belongs to null but is
called “significant”
11.4. BAYESIAN TESTING USING FALSE DISCOVERY RATES 123

• False negative (FN), “miss”, type II error: 𝑥 𝑖 belongs to alternative, but is


called “not significant”
In addition we have:
• True positives (TP), “hits”: belongs to alternative and is called “significant”
• True negatives (TN), “correct rejections”: belongs to null and is called “not
significant”

11.4.2.2 Specificity and Sensitivity


From counts of TP, TN, FN, FP we can derive further quantities:
𝑇𝑁
• True Negative Rate TNR, specificity: 𝑇𝑁 𝑅 = 𝑇𝑁+𝐹𝑃 = 1 − 𝐹𝑃𝑅 with
FPR=False Positive Rate = 1 − 𝛼 𝐼
𝑇𝑃
• True Positive Rate TPR, sensitivity, power, recall: 𝑇𝑃𝑅 = 𝑇𝑃+𝐹𝑁 = 1−𝐹𝑁 𝑅
with FNR=False negative rate = 1 − 𝛼 𝐼𝐼
𝑇𝑃+𝑇𝑁
• Accuracy: 𝐴𝐶𝐶 = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

Another common way to choose the decision threshold 𝑥 𝑑 in classical statistics


is to balance sensitivity/power vs. specificity (maximising both power and
specificity, or equivalently, minimising both false positive and false negative
rates). ROC curves plot TPR/sensitivity vs. FPR = 1-specificity.

11.4.2.3 FDR and FNDR


It is possible to link the above with the observed counts of TP, FP, TN, FN:
𝐹𝑃
• False Discovery Rate (FDR): 𝐹𝐷𝑅 = 𝐹𝑃+𝑇𝑃
𝐹𝑁
• False Nondiscovery Rate (FNDR): 𝐹𝑁 𝐷𝑅 = 𝑇𝑁+𝐹𝑁
• Positive predictive value (PPV), True Discovery Rate (TDR), precision:
𝑇𝑃
𝑃𝑃𝑉 = 𝐹𝑃+𝑇𝑃 = 1 − 𝐹𝐷𝑅
𝑇𝑁
• Negative predictive value (NPV): 𝑁𝑃𝑉 = 𝑇𝑁+𝐹𝑁 = 1 − 𝐹𝑁 𝐷𝑅
In order to choose the decision threshold it is natural to balance FDR and FDNR
(or PPV and NPV), by minimising both FDR and FNDR or maximising both PPV
and NPV.
In machine learning it is common to use “precision-recall plots” that plot
precision (=PPV, TDR) vs. recall (=power, sensitivity).

11.4.3 Bayesian perspective


11.4.3.1 Two component mixture model
In the Bayesian perspective the problem of choosing the decision threshold is
related to computing the posterior probability

Pr(𝐻0 |𝑥 𝑖 ),
124 CHAPTER 11. BAYESIAN MODEL COMPARISON

i.e. probability of the null model given the observation 𝑥 𝑖 , or equivalently


computing
Pr(𝐻𝐴 |𝑥 𝑖 ) = 1 − Pr(𝐻0 |𝑥 𝑖 )
the probability of the alternative model given the observation 𝑥 𝑖 .
This is done by assuming a mixture model

𝑓 (𝑥) = 𝜋0 𝑓0 (𝑥) + (1 − 𝜋0 ) 𝑓𝐴 (𝑥)

where 𝜋0 = Pr(𝐻0 ) is the prior probability of 𝐻0 and. 𝜋𝐴 = 1 − 𝜋0 = Pr(𝐻𝐴 ) the


prior probabiltiy of 𝐻𝐴 .
Note that the weights 𝜋0 can in fact be estimated from the observations by fitting
the mixture distribution to the observations 𝑥1 , . . . , 𝑥 𝑛 (so it is effectively an
empirical Bayes method where the prior is informed by the data).

11.4.3.2 Local FDR


The posterior probability of the null model given a data point is then given by

𝜋0 𝑓0 (𝑥 𝑖 )
Pr(𝐻0 |𝑥 𝑖 ) = = 𝐿𝐹𝐷𝑅(𝑥 𝑖 )
𝑓 (𝑥 𝑖 )

This quantity is also known as the local FDR or local False Discovery Rate.
In the given one-sided setup the local FDR is large (close to 1) for small 𝑥, and will
become close to 0 for large 𝑥. A common decision rule is given by thresholding
local false discovery rates: if 𝐿𝐹𝐷𝑅(𝑥 𝑖 ) < 0.1 the 𝑥 𝑖 is called significant.

11.4.3.3 q-values
In correspondence to 𝑝-values one can also define tail-area based false discovery
rates:
𝜋0 𝐹0 (𝑥 𝑖 )
𝐹𝑑𝑟(𝑥 𝑖 ) = Pr(𝐻0 |𝑋 > 𝑥 𝑖 ) =
𝐹(𝑥 𝑖 )

These are called q-values, or simply False Discovery Rates (FDR). Intrigu-
ingly, these also have a frequentist interpretation as adjusted p-values (using a
Benjamini-Hochberg adjustment procedure).

11.4.4 Software
There are a number of R packages to compute (local) FDR values:
For example:
• locfdr
• qvalue
• fdrtool
11.4. BAYESIAN TESTING USING FALSE DISCOVERY RATES 125

and many more.


Using FDR values for screening is especially useful in high-dimensional settings
(e.g. when analysing genomic and other high-throughput data).
FDR values have both a Bayesian as well as frequentist interpretation, providing
further evidence that good classical statistical methods do have a Bayesian
interpretation.
126 CHAPTER 11. BAYESIAN MODEL COMPARISON
Chapter 12

Choosing priors in Bayesian


analysis

12.1 Choosing a prior


12.1.1 Prior as part of the model
It is essential in a Bayesian analysis to specify your prior uncertainty about
the model parameters. Note that this is simply part of the modelling process!
Thus in a Bayesian approach the data analyst needs to be more explicit about all
modelling assumptions.

Typically, when choosing a suitable prior distribution we consider the overall


form (shape and domain) of the distribution as well as its key characteristics such
as the mean and variance. As we have learned the precision (inverse variance) of
the prior may often be viewed as implied sample size.

For large sample size 𝑛 the posterior mean converges to the maximum likelihood
estimate (and the posterior distribution to normal distribution centered around
the MLE), so for large 𝑛 we may ignore specifying a prior.

However, for small 𝑛 it is essential that a prior is specified. In non-Bayesian


approaches this prior is still there but it is either implicit (maximum likeli-
hood estimation) or specified via a penality (penalised maximum likelihood
estimation).

12.1.2 Some guidelines


So the question remains what are good ways to choose a prior? Two useful ways
are:

127
128 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS

1. Use a weakly informative prior. This means that you do have an idea (even
if only vague) about the suitable values of the parameter of interest, and
you use a corresponding prior (for example with moderate variance) to
model the uncertainty. This acknowledges that there are no uninformative
priors and but also aims that the prior does not dominate the likelihood
(i.e. the data). The result is a weakly regularised estimator. Note that it is
often desirable that the prior adds information (if only a little) so that it
can act as a regulariser.
2. Empirical Bayes methods can often be used to determine one or all of the
hyperparameters (i.e. the parameters in the prior) from the observed data.
There are several ways to do this, one of them is to tune the shrinkage
parameter 𝜆 to achieve minimum MSE. We discuss this further below.
Furthermore, there also exist many proposals advocating so-called “uninforma-
tive priors” or “objective priors”. However, there are no actually unformative
priors, since a prior distribution that looks uninformative (i.e. “flat”) in one
coordinate system can be informative in another — this is a simple consequence
of the rule for transformation of probability densities. As a result, often the
suggested objective priors are in fact improper, i.e. are not actually probability
distributions!

12.2 Default priors or uninformative priors


Objective or for default priors are attempts 1) to automatise specification of a
prior and 2) to find uniformative priors.

12.2.1 Jeffreys prior


The most well-known non-informative prior is given by a proposal by Harold
Jeffreys (1891–1989) in 1946.1
Specifically, this prior is constructed from the expected Fisher information and
thus promises automatic construction of objective uninformative priors using
the likelihood: q
𝑝(𝜽) ∝ det 𝑰 Fisher (𝜽)

The reasoning underlying this prior is invariance against transformation of the


coordinate system of the parameters.
For the Beta-Binomial model the Jeffreys prior corresponds to Beta( 12 , 12 ). Note
this is not the uniform distribution but a U-shaped prior.
For the Normal-Normal model it corresponds to the flat improper prior 𝑝(𝜇) = 1.

1Jeffreys, H. 1946. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. A
186:453–461. https://doi.org/10.1098/rspa.1946.0056.
12.3. EMPIRICAL BAYES 129

For the Inverse-Gamma-Norma model the Jeffreys prior is the improper prior
𝑝(𝜎2 ) = 𝜎12 .
This already illustrates the main problem with this type of prior – namely
that it often is improper, i.e. the prior distribution is not actually a probability
distribution (i.e. the density does not integrate to 1).
Another issue is that Jeffreys priors are usually not conjugate which complicates
the update from the prior to the posterior.
Furthermore, if there are multiple parameters (𝜽 is a vector) then Jeffreys priors
do not usually lead to sensible priors.

12.2.2 Reference priors


An alternative to Jeffreys priors are the so-called reference priors developed by
Bernardo (1979).2 This type of priors aims to choose the prior such that there
is maximal “correlation” between the data and the parameter. More precisely,
the mutual information between 𝜃 and 𝑥 is maximised (i.e. the the expected
KL divergence between the posterior and prior distribution). The underlying
motivation is that the data and parameters should be maximally linked (thereby
minimising the influence of the prior).
For univariate settings the reference priors are identical to Jeffreys priors. How-
ever, reference prior also provide reasonable priors in multivariate settings.
In both Jeffreys’ and the reference prior approach the choice of prior is by
expectation over the data, i.e. not for the specific data set at hand (this can be
seen both as a positive and negative!).

12.3 Empirical Bayes


In empirical Bayes the data analysist specifies a family of prior distribution (say
a Beta distribution with free parameters), and then the data at hand are used to
find an optimal choise for the hyper-parameters (hence the name “empirical”).
Thus the hyper-parameters are not specified but themselves estimated.

12.3.1 Type II maximum likelihood


In particular, assuming data 𝐷, a likelihood 𝑝(𝐷|𝜽) for some model with
parameters 𝜽 as well as a prior 𝑝(𝜽|𝜆) for 𝜽 with hyper-parameter 𝜆 the marginal
likelihood now depends on 𝜆:

𝑝(𝐷|𝜆) = 𝑝(𝐷|𝜽)𝑝(𝜽|𝜆)𝑑𝜽
𝜽

2Bernardo, J. M. 1979. Reference posterior distributions for Bayesian inference (with discussion). JRSS B
41:113–147. https://doi.org/10.1111/j.2517-6161.1979.tb01066.x
130 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS

We can therefore use maximum (marginal) likelihood find optimal values of 𝜆


given the data.
Since maximum-likelihood is used in a second level step (the hyper-parameters)
this type of empirical Bayes is also often called “type II maximum likelihood”.

12.3.2 Shrinkage estimation using empirical risk minimisation


An alternative (but related) way to estimate hyper-parameters is by minimising
the empirical risk.
In the examples for Bayesian estimation that we have considered so far the
posterior mean of the parameter of interest was obtained by linear shrinkage

𝜃ˆshrink = E(𝜃|𝑥1 , . . . , 𝑥 𝑛 ) = 𝜆𝜃0 + (1 − 𝜆)𝜃ˆML


𝑚
of the MLE 𝜃ˆML towards the prior mean 𝜃0 , with shrinkage intensity 𝜆 = 𝑚+𝑛
determined by the parameter 𝑚 (the implicit sample size) and the sample size 𝑛.
The resulting point estimate 𝜃ˆshrink is called shrinkage estimate and is a convex
combination of 𝜃0 and 𝜃ˆML . The prior mean 𝜃0 is also called the “target”.
The hyper-parameter in this setting is 𝑚 (linked to the precision of the prior) and
or equivalently the shrinkage intensity 𝜆.
An optimal value for 𝜆 can be obtained by minimising the mean squared error
of the estimator 𝜃ˆshrink .
In particular, by construction, the target 𝜃0 has low or even zero variance but
non-vanishing and potentially large bias, whereas the MLE 𝜃ˆML will have low
or zero bias but a substantial variance. By combinining these two estimators
with opposite properties the aim is to achieve a bias-variance tradeoff so that the
resulting estimator 𝜃ˆshrink has lower MSE than either 𝜃0 and 𝜃ˆML .
Specifically, the aim is to find
 
𝜆★ = arg min E (𝜃 − 𝜃ˆshrink )2
𝜆

It turns out that this can be minimised without knowing the actual true value of
𝜃 and the result for an unbiased 𝜃ˆML is

Var(𝜃ˆML )
𝜆★ =
E((𝜃ˆML − 𝜃0 )2 )
Hence, the shrinkage intensity will be small if the variance of the MLE is small
and/or if the target and the MLE differ substantially. On the other hand, if the
variance of the MLE is large and/or the target is close to the MLE the shrinkage
intensity will be large.
12.3. EMPIRICAL BAYES 131

Choosing the shrinkage parameter by optimising expected risk (here mean


squared error) is also a form empirical Bayes.
Example 12.1. James-Stein estimator:
Empirical risk minimisation to estimate the shrinkage parameter of the Normal-
Normal model for a single observation yields the James-Stein estimator (1955).
Specifically, James and Stein propose the following estimate for the multivariate
mean 𝝁 of using a single sample 𝒙 drawn from the multivariate normal 𝑁𝑑 (𝝁, 𝑰):

𝑑−2
 
𝝁ˆ 𝐽𝑆 = 1 − 𝒙
||𝒙|| 2
𝑑−2
Here, we recognise 𝝁ˆ 𝑀𝐿 = 𝒙, 𝝁0 = 0 and shrinkage intensity 𝜆★ = ||𝒙|| 2
.

Efron and Morris (1972) and Lindley and Smith (1972) later generalised the
James-Stein estimator to the case of multiple observations 𝒙 1 , . . . 𝒙 𝑛 and target 𝝁0 ,
yielding an empirical Bayes estimate of 𝜇 based on the Normal-Normal model.
132 CHAPTER 12. CHOOSING PRIORS IN BAYESIAN ANALYSIS
Chapter 13

Optimality properties and


summary

13.1 Bayesian statistics in a nutshell


• Bayesian statistics explicitly models the uncertainty about the parameters
of interests by probability
• In the light of new evidence (observed data) the uncertainty is updated,
i.e. the prior distribution is combined with the likelihood to form the
posterior distribution
Example: Beta-Binomial model
• Binomial likelihood
• 𝑛 observations: 𝑥 “heads”, 𝑛 − 𝑥 “tails”
• Frequency 𝜃ˆ 𝑀𝐿 = 𝑛𝑥
• Beta prior 𝜃 ∼ Beta(𝛼 0 , 𝛽 0 ) with mean 𝜃0 = 𝛼𝑚0 and 𝑚 = 𝛼0 + 𝛽 0
• Beta posterior 𝜃|𝑥, 𝑛 ∼ Beta(𝛼 1 , 𝛽 1 ) with mean 𝜃1 = 𝛼1𝛼+𝛽
1
1
and 𝛼1 = 𝛼 0 + 𝑥
and 𝛽 1 = 𝛽 0 + 𝑛 − 𝑥
• Update of prior mean to posterior mean by shrinkage of MLE:

𝜃1 = 𝜆𝜃0 + (1 − 𝜆)𝜃ˆ 𝑀𝐿
𝑚
with shrinkage intensity 𝜆 = 𝑛+𝑚
• 𝑚 can be interpreted as prior sample size

13.1.1 Remarks
• If posterior in same family as prior → conjugate prior.
• In an exponential family the Bayesian update of the mean is always
expressible as linear shrinkage of the MLE.

133
134 CHAPTER 13. OPTIMALITY PROPERTIES AND SUMMARY

• For sample size 𝑛 → ∞ then 𝜆 → 0 and 𝜃1 → 𝜃ˆ 𝑀𝐿 (for large samples


posterior mean = maximum likelihood estimator).

• For 𝑛 → 0 then 𝜆 → 1 and 𝜃1 → 𝜃ˆ0 (if no data is available fall back to


prior).

• Note that the Bayesian estimator is biased for finite 𝑛 by construction (but
asymptotically unbiased like the MLE).

13.1.2 Advantages
• Adding prior information has regularisation properties. This is very im-
portant in more complex models with many parameters, e.g., in estimation
of a covariance matrix (to avoid singularity).

• Improves small-sample accuracy (e.g. MSE)

• Bayesian estimators tend to perform better than MLEs is not surprising -


they use the obseved data plus extra information!

• Bayesian credible intervals are conceptually much more simple than fre-
quentist confidence intervals.

13.1.3 Frequentist properties of Bayesian estimators


A Bayesian point estimator (e.g. the posterior mean) can also be assessed by its
frequentist properties.

• First, by construction due to introducing a prior the Bayesian estimator


will be biased for finite 𝑛 even if the MLE is unbiased.
• Second, intriguingly it turns out that the sampling variance of the Bayes
point estimator (not to be confused with the posterior variance!) can be
smaller than the variance of the MLE. This depends on the choice of the
shrinkage parameter 𝜆 that also determines the posterior variance.

As a result, Bayesian estimators may have smaller MSE (=squared bias + variance)
than the ML estimator for finite 𝑛.

In statistical decision theory this is called the theorem of admissibility of Bayes


rules. It states that under mild conditions every admissible estimation rule
(i.e. one that dominates all other estimators with regard to some expected loss,
such as the MSE) is in fact a Bayes estimator with some prior.

Unfortunately, this theorem does not tell which prior is needed to achive
optimality, however an optimal estimator can often be found by tuning the
hyper-parameter 𝜆.
13.2. OPTIMALITY OF BAYESIAN INFERENCE 135

13.1.4 Specifying the prior — problem or advantage?


In Bayesian statistics the analysist needs to be very explicit about the modelling
assumptions:
Model = data generating process (likelihood) + prior uncertainty (prior
distribution)
Note that alternative statistical methods can often be interpreted as Bayesian
methods assuming a specific implicit prior!
For example, likelihood estimation for the binomial model is equivalent to Bayes
estimation using the Beta-Binomial model with a Beta(0, 0) prior (=Haldane
prior).
However, when choosing a prior explicitly for this model, interestingly most
analysts would rather use a flat prior Beta(1, 1) (=Laplace prior) with implicit
sample size 𝑚 = 2 or a transformation-invariant prior Beta(1/2, 1/2) (=Jeffreys
prior) with implicit sample size 𝑚 = 1 than the Haldane prior!
→ be aware about the implicit priors!!
Better to acknowledge that a prior is being used (even if implicit!)
Being specific about all your assumptions is enforced by the Bayesian approach.
Specifying a prior is thus best understood as an intrinsic part of model specifica-
tion. It helps to improve inference and it may only be ignored if there is lots of
data.

13.2 Optimality of Bayesian inference


The optimality of Bayesian model making use of full model specification (like-
lihood plus prior) can be shown from a number of different perspectives.
Correspondingly, there are many theorems that prove (or at least indicate) this
optimality:
1) Richard Cox’s theorem: generalising classical logic invariably leads to
Bayesian inference.
2) de Finetti’s representation theorem: joint distribution of exchangeable
observations can always be expressed as weighted mixture over a prior
distribution for the parameter of the model. This implies the existence of
the prior distribution and the requirement of a Bayesian approach.
3) Frequentist decision theory: all admissible decision rules are Bayes rules!
4) Entropy perspective: The posterior density (a function!) is obtained as a
result of optimising an entropy criterion. Bayesian updating may thus be
viewed as a variational optimisation problem. Specifically, Bayes theorem is
the minimal update when new information arrives in form of observations
(see below).
136 CHAPTER 13. OPTIMALITY PROPERTIES AND SUMMARY

Remark: there exist a number of further (often somewhat esoteric) suggestions


for propagating uncertainty such as “fuzzy logic”, imprecise probabilities, etc.
These contradict Bayesian learning and are thus in direct violation of the above
theorems.

13.3 Connection with entropy learning


The Bayesian update rule is a very general form of learning when the new information
arrives in the form of data. But actually there is an even more general principle
of which the Bayesian update rule is just a special case: the principle of
minimal information update (e.g. Jaynes 1959, 2003) or principle of minimum
information discrimination (MDI) (Kullback 1959).
It can be summarised as follows: Change your beliefs only as much as necessary
to be coherent with new evidence!
Under this principle of “inertia of beliefs” when new information arrives the
uncertainty about a parameter is only minimally adjusted, only as much as
needed to account for the new information. To implement this principle KL
divergence is a natural measure to quantify the change of the underlying beliefs.
This is known as entropy learning.
The Bayes rule emerges a special case of entropy learning:
• The KL divergence between the joint posterior 𝑄 𝑥,𝜽 and joint prior dis-
tribution 𝑃𝑥,𝜽 is computed, with the posterior distribution 𝑄 𝜽|𝑥 as free
parameter.
• The conditional distribution 𝑄 𝜽|𝑥 is found by minimising the KL divergence
𝐷KL (𝑄 𝑥,𝜽 , 𝑃𝑥,𝜽 ).
• The optimal solution to this variational optimisation problem is given by
Bayes’ rule!
This application of the KL divergence is an example of reverse KL optimisation
(aka 𝐼-projection, see Part I of the notes). Intringuingly, this explains the zero
forcing property of Bayes’ rule (because that this is a general property of an
𝐼-projection).
Applying entropy learning therefore includes Bayesian learning as special case:
1) If information arrives in form of data → update prior by Bayes’ theorem
(Bayesian learning).
Interestingly, entropy learning will lead to other update rules for other types of
information:
2) If information arrives in the form of another distribution → update using
R. Jeffrey’s rule of conditioning (1965).
3) If the information is presented in the form of constraints → Kullback’s
principle of minimum MDI (1959), E. T. Jaynes maximum entropy (MaxEnt)
13.4. CONCLUSION 137

principle (1957).
This shows (again) how fundamentally important KL divergence is in statistics.
It not only leads to likelihood inference (via forward KL) but also to Bayesian
learning, as well as to other forms of information updating (via reverse KL).
Furthermore, in Bayesian statistics relative entropy is useful to choose priors
(e.g. reference priors) and it also helps in (Bayesian) experimental design to
quantify the information provided by an experiment.

13.4 Conclusion
Bayesian statistics offers a coherent framework for statistical learning from data,
with methods for
• estimation
• testing
• model building
There are a number of theorems that show that “optimal” estimators (defined in
various ways) are all Bayesian.
It is conceptually very simple — but can be computationally very involved!
It provides a coherent generalisation of classical TRUE/FALSE logic (and there-
fore does not suffer from some of the inconsistencies prevalent in frequentist
statistics).
Bayesian statistics is a non-asymptotic theory, it works for any sample size.
Asympotically (large 𝑛) it is consistent and converges to the true model (like
ML!). But Bayesian reasoning can also be applied to events that take place
only once — no assumption of hypothetical infinitely many repetitions as in
frequentist statistics is needed.
Moreover, many classical (frequentist) procedures may be viewed as approxima-
tions to Bayesian methods and estimators, so using classical approaches in the
correct application domain is perfectly in line with the Bayesian framework.
Bayesian estimation and inference also automatically regularises (via the prior)
which is important for complex models and when there is the problem of
overfitting.
138 CHAPTER 13. OPTIMALITY PROPERTIES AND SUMMARY
Part III

Regression

139
Chapter 14

Overview over regression


modelling

14.1 General setup

• 𝑦: response variable, also known as outcome or label

• 𝑥1 , 𝑥2 , 𝑥3 , . . . , 𝑥 𝑑 : predictor variables, also known as covariates or covari-


ables

• The relationship between the outcomes and the predictor variables is


assumed to follow

𝑦 = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 ) + 𝜀

where 𝑓 is the regression function (not a density) and 𝜀 represents noise.

141
142 CHAPTER 14. OVERVIEW OVER REGRESSION MODELLING

14.2 Objectives
1. Understand the relationship between the response 𝑦 and the predictor
variables 𝑥 𝑖 by learning the regression function 𝑓 from observed data
(training data). The estimated regression function is 𝑓ˆ.
2. Prediction of outcomes

𝑦ˆ = 𝑓ˆ(𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 )
|{z}
predicted response
using fitted 𝑓ˆ

If instead of the fitted function 𝑓ˆ the known regression function 𝑓 is used


we denote this by

𝑦★ = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 )
|{z}
predicted response
using known 𝑓

3. Variable importance
• which covariates are most relevant in predicting the outcome?
• allows to better understand the data and model
→ variable selection (to build simpler model with same predictive
capability)

14.3 Regression as a form of supervised learning


Regression modelling is a special case of supervised learning.
In supervised learning we make use of labelled data, i.e. 𝒙 𝑖 has an associated
label 𝑦 𝑖 . Thus, the data is consists of pairs (𝒙 1 , 𝑦1 ), (𝒙 2 , 𝑦2 ), . . . , (𝒙 𝑛 , 𝑦𝑛 ).
The supervision part of in supervised learning refers to the fact that the labels are
given.
In regression typically the label 𝑦 𝑖 is continuous and called the response.
On the other hand, if the label 𝑦 𝑖 is discrete/categorical then supervised learning
is called classification.

−→ Discrete 𝑦 −→ Classification Methods


Supervised Learning
−→ Continuous 𝑦 −→ Regression Methods
14.4. VARIOUS REGRESSION MODELS USED IN STATISTICS 143

Another important type of statistical learning is unsupervised learning where la-


bels 𝑦 are inferred from the data 𝒙 (this is also known as clustering). Furthermore,
there is also semi-supervised learning with labels only partly known.
Note that there are regression models (e.g. logistic regression) with discrete
response that are performing classification, so one may argue that “supervised
learning”=“generalised regression”.

14.4 Various regression models used in statistics


In this course we only study linear multiple regression. However, you should
be aware that the linear model is in fact just a special cases of some much more
general regression approaches.
General regression model:

𝑦 = 𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) + "noise"

• The function 𝑓 is estimated nonparametrically - splines - Gaussian processes


• Generalised Additive Models (GAM): - the function 𝑓 is assumed to be the
sum of individual functions 𝑓𝑖 (𝑥 𝑖 )
• Generalised Linear Models (GLM): - 𝑓 is a transformed linear predictor
ℎ( 𝑏 𝑖 𝑥 𝑖 ), noise is assumed from an exponential family
Í

• Linear Model (LM): - linear predictor 𝑏 𝑖 𝑥 𝑖 , normal noise


Í

In R the linear model is implemented in the function lm(), and generalised linear
models in the function glm(). Generalised additive models are available in the
package “mgcv”.
In the following we focus on the linear regression model with continuous
response.
144 CHAPTER 14. OVERVIEW OVER REGRESSION MODELLING
Chapter 15

Linear Regression

15.1 The linear regression model


In this module we assume that 𝑓 is a linear function:

𝑑
Õ
𝑓 (𝑥1 , . . . , 𝑥 𝑑 ) = 𝛽 0 + 𝛽 𝑗 𝑥 𝑗 = 𝑦★
𝑗=1

In vector notation:
𝑓 (𝒙) = 𝛽 0 + 𝜷𝑇 𝒙 = 𝑦★

𝛽 𝑥
© .1 ª © .1 ª
with 𝜷 = ­ . ® and 𝒙 = ­­ .. ®®
­ . ®

«𝛽 𝑑 ¬ «𝑥 𝑑 ¬
Therefore, the linear regression model is

𝑑
Õ
𝑦 = 𝛽0 + 𝛽𝑗 𝑥𝑗 + 𝜀
𝑗=1

= 𝛽 0 + 𝜷𝑇 𝒙 + 𝜀
= 𝑦★ + 𝜀

where:
• 𝛽 0 is the intercept
• 𝜷 = (𝛽 1 , . . . , 𝛽 𝑑 )𝑇 are the regression coefficients
• 𝒙 = (𝑥1 , . . . , 𝑥 𝑑 )𝑇 is the predictor vector containing the predictor variables

145
146 CHAPTER 15. LINEAR REGRESSION

15.2 Interpretation of regression coefficients and in-


tercept
• The regression coefficient 𝛽 𝑖 corresponds to the slope (first partial deriva-
tive) of the regression function in the direction of 𝑥 𝑖 . In other words, the
gradient of 𝑓 (𝒙) are the regression coefficients: ∇ 𝑓 (𝒙) = 𝜷
• The intercept 𝛽 0 is the offset at the origin (𝑥1 = 𝑥2 = . . . = 𝑥 𝑑 = 0):

15.3 Different types of linear regression:


• Simple linear regression: 𝑦 = 𝛽 0 + 𝛽𝑥 + 𝜀 (=single predictor)
• Multiple linear regression: 𝑦 = 𝛽 0 + 𝑑𝑗=1 𝛽 𝑗 𝑥 𝑗 + 𝜀 (= multiple predictor
Í
variables)
• Multivariate regression: multivariate response 𝒚

15.4 Distributional assumptions and properties


General assumptions:
• We treat 𝑦 and 𝑥 1 , . . . , 𝑥 𝑑 as the primary observables that can be described
by random variables.
• 𝛽 0 , 𝜷 are parameters to be inferred from the observations on 𝑦 and
𝑥1 , . . . , 𝑥 𝑑 .
• Specifically, will we assume that response and predictors have a mean and
a (cov)variance:
i. Response:
E(𝑦) = 𝜇 𝑦
Var(𝑦) = 𝜎2𝑦
The variance of the response Var(𝑦) is also called the total variation .
15.4. DISTRIBUTIONAL ASSUMPTIONS AND PROPERTIES 147

ii. Predictors:
E(𝑥 𝑖 ) = 𝜇𝑥 𝑖 (or E(𝒙) = 𝝁𝒙 )
Var(𝑥 𝑖 ) = 𝜎𝑥2 𝑖 and Cor(𝑥 𝑖 , 𝑥 𝑗 ) = 𝜌 𝑖𝑗 (or Var(𝒙) = 𝚺𝒙 )
The signal variance Var(𝑦★) = Var(𝛽 0 + 𝜷𝑇 𝒙) = 𝜷𝑇 𝚺𝒙 𝜷 is also called
the explained variation.
• We assume that 𝑦 and 𝒙 are jointly distributed with correlation Cor(𝑦, 𝑥 𝑗 ) =
𝜌 𝑦,𝑥 𝑗 between each predictor variable 𝑥 𝑗 and the response 𝑦.
• In contrast to 𝑦 and 𝒙 the noise variable 𝜀 is only indirectly observed via
the difference 𝜀 = 𝑦 − 𝑦★. We denote the mean and variance of the noise
by E(𝜀) and Var(𝜀).
The noise variance Var(𝜀) is also called the unexplained variation or the
residual variance. The residual standard error is SD(𝜀).
Identifiability assumptions:
In a statistical analysis we would like to be able to separate signal (𝑦★) from
noise (𝜀). To achieve this we require some distributional assumptions to ensure
identifiability and avoid confounding:
1) Assumption 1: 𝜀 and 𝑦★ are are independent. This implies Var(𝑦) =
Var(𝑦★) + Var(𝜀), or equivalently Var(𝜀) = Var(𝑦) − Var(𝑦★).
Thus, this assumption implies the decomposition of variance, i.e. that the
total variation Var(𝑦) equals the sum of the explained variationVar(𝑦★)
and the unexplained variationVar(𝜀).
2) Assumption 2: E(𝜀) = 0. This allows to identify the intercept 𝛽0 and
implies E(𝑦) = E(𝑦★).
Optional assumptions (often but not always):
• The noise 𝜀 is normally distributed
• The response 𝑦 and and the predictor variables 𝑥 𝑖 are continuous variables
• The response and predictor variables are jointly normally distributed
Further properties:
• As a result of the independence assumption 1) we can only choose two out
of the three variances freely:
i. in a generative perspective we will choose signal variance Var(𝑦★)
(or equivalently the variances Var(𝑥 𝑗 )) and the noise variance Var(𝜀),
then the variance of the response Var(𝑦) follows.
ii. in an observational perspective we will observe the variance of the
reponse Var(𝑦) and the variances Var(𝑥 𝑗 ), and then the error variance
Var(𝜀) follows.
• As we will see later the regression coefficients 𝛽 𝑗 depend on the correlations
between the response 𝑦 and and the predictor variables 𝑥 𝑗 . Thus, the choice
of regression coefficients implies a specific correlation pattern, and vice
148 CHAPTER 15. LINEAR REGRESSION

versa (in fact, we will use this correlation pattern to infer the regression
coefficients from data!).

15.5 Regression in data matrix notation


We can also write the regression in terms of actual observed data (rather than in
terms of random variables):
Data matrix for the predictors:

𝑥 ... 𝑥1𝑑
© .11 .. ª®
𝑿 = ­­ .. ..
. . ®
«𝑥 𝑛1 ... 𝑥 𝑛𝑑 ¬

Note the statistics convention: the 𝑛 rows of 𝑿 contain the samples, and the 𝑑
columns contain variables.
Response data vector: (𝑦1 , . . . , 𝑦𝑛 )𝑇 = 𝒚
Then the regression equation is written in data matrix notation:

𝒚 = 1𝑛 𝛽 0 + 𝑿 𝜷 + 𝜺
|{z} |{z} |{z} |{z} |{z}
𝑛×1 𝑛×1 𝑛×𝑑 𝑑×1 𝑛×1
|{z}
residuals

1
where 1𝑛 = ­­ ... ®® is a column vector of length 𝑛 (size 𝑛 × 1).
© ª

«1¬
Note that here the regression coefficients are now multiplied after the data matrix
(compare with the original vector notation where the transpose of regression
coefficients come before the vector of the predictors).
The observed noise values (i.e. realisations of the random variable 𝜀) are called
the residuals.

15.6 Centering and vanishing of the intercept 𝛽 0


If 𝒙 and 𝑦 are centered, i.e. if E(𝒙) = 𝝁𝒙 = 0 and E(𝑦) = 𝜇 𝑦 = 0, then the intercept
𝛽0 disappears:
The regression equation is
𝑦 = 𝛽 0 + 𝜷𝑇 𝒙 + 𝜀
15.7. OBJECTIVES IN DATA ANALYSIS USING LINEAR REGRESSION 149

with 𝐸(𝜀). Taking the expectation on both sides we get 𝜇 𝑦 = 𝛽 0 + 𝜷𝑇 𝝁𝒙 and


therefore
𝛽 0 = 𝜇 𝑦 − 𝜷𝑇 𝝁 𝒙
This is zero if the mean of the response 𝜇 𝑦 and the mean of predictors 𝝁𝒙 vanish.
Conversely, if we assume that the intercept vanishes (𝛽 0 = 0) this is only possible
for general 𝜷 if both 𝝁𝒙 = 0 and 𝜇 𝑦 = 0.
Thus, in the linear model is always possible to transform 𝑦 and 𝒙 (or data 𝒚 and
𝑿 ) so that the intercept vanishes. To simplify equations we will therefore often
set 𝛽 0 = 0.

15.7 Objectives in data analysis using linear regres-


sion
1. Understand functional relationship: find estimates of the intercept (𝛽ˆ 0 )
and the regression coefficients (𝛽ˆ 𝑗 ), as well as the associated errors.
2. Prediction:
• Known coefficients 𝛽 0 and 𝜷: 𝑦★ = 𝛽0 + 𝜷𝑇 𝒙
• Estimated coefficients 𝛽ˆ 0 and 𝛽ˆ (note the “hat”!): 𝑦ˆ = 𝛽ˆ 0 + 𝑑𝑗=1 𝛽ˆ 𝑗 𝑥 𝑗 =
Í
𝑇
𝛽ˆ 0 + 𝜷ˆ 𝒙
For each point prediction find the corresponding prediction error!
3. Variable importance: Which predictors 𝑥 𝑗 are most relevant?
→ test whether 𝛽 𝑗 = 0
→ find measures of variable importance
Remark: as we will see 𝛽 𝑗 or 𝛽ˆ 𝑗 itself is not a measure of variable importance!
150 CHAPTER 15. LINEAR REGRESSION
Chapter 16

Estimating regression
coefficients

In this chapter we discuss various ways to estimate the regression coefficients.


First, we discuss estimation by Ordinary Least Squares (OLS) by minimising the
residual sum of squares. This yields the famous Gauss estimator. Second, we
derive estimates of the regression coefficients using the methods of maximum
likelihood assuming normal errors. This also leads to the Gauss estimator. Third,
we show that the coefficients in linear regression can written and interpreted in
terms of two covariance matrices and that the Gauss estimator of the regression
coefficients is a plug-in estimator using the MLEs of these covariance matrices.
Furthermore, we show that the (population version) of the Gauss estimator
can also be derived by finding the best linear predictor and by conditioning.
Finally, we discuss special cases of regression coefficients and their relationship
to marginal correlation.

16.1 Ordinary Least Squares (OLS) estimator of re-


gression coefficients
Now we show the classic way (Gauss 1809; Legendre 1805) to estimate regression
coefficients by the method of ordinary least squares (OLS).

Idea: choose regression coefficients such as to minimise the squared error between
observations and the prediction.

151
152 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS

In data matrix notation (note we assume 𝛽0 = 0 and thus centered data 𝑿 and 𝒚):

RSS(𝜷) = (𝒚 − 𝑿 𝜷)𝑇 (𝒚 − 𝑿 𝜷)

RSS is an abbreviation for “Residual Sum of Squares” which is is a function of 𝜷.


Minimising RSS yields the OLS estimate:

𝜷
b
OLS = arg min RSS(𝜷)
𝜷

RSS(𝜷) = 𝒚𝑇 𝒚 − 2𝜷𝑇 𝑿 𝑇 𝒚 + 𝜷𝑇 𝑿 𝑇 𝑿 𝜷

Gradient:
∇RSS(𝜷) = −2𝑿 𝑇 𝒚 + 2𝑿 𝑇 𝑿 𝜷

𝜷) = 0 −→ 𝑿 𝑇 𝒚 = 𝑿 𝑇 𝑿 b
∇RSS(b 𝜷

  −1
𝜷OLS = 𝑿 𝑇 𝑿
=⇒ b 𝑿𝑇𝒚

Note the similarities in the procedure to maximum likelihood (ML) estimation


(with minimisation instead of maximisation)! In fact, as we see next this is not
by chance as OLS is indeed a special case of ML! This also implies that OLS is
generally a good method — but only if sample size 𝑛 is large!
The above Gauss’ estimator is fundamental in statistics so it is worthwile to
memorise it!
16.2. MAXIMUM LIKELIHOOD ESTIMATION OF REGRESSION COEFFICIENTS153

16.2 Maximum likelihood estimation of regression


coefficients
16.2.1 Normal log-likelihood function for regression coefficients
and noise variance
We now show how to estimate regression coefficients using the method of
ˆ
maximum likelihood. This is a second method to derive 𝜷.
We recall the basic regression equation

𝑦 = 𝛽 0 + 𝜷𝑇 𝒙 + 𝜀

with independent noise 𝜀 and observed data 𝑦1 , . . . , 𝑦𝑛 and 𝒙 1 , . . . , 𝒙 𝑛 .


Assuming E(𝜀) = 0 the intercept is identified as

𝛽 0 = 𝜇 𝑦 − 𝜷𝑇 𝝁 𝒙

Combining the two above equations we see that noise variable equals

𝜀 = (𝑦 − 𝜇 𝑦 ) − 𝜷𝑇 (𝒙 − 𝝁𝒙 )

Assuming joint (multivariate) normality for the observed data, the response 𝑦
and predictors 𝒙, we get as the MLEs for the respective means and (co)variances:
Í𝑛
• 𝜇ˆ 𝑦 = Ê(𝑦) = 𝑛1 𝑖=1 𝑦 𝑖
1 Í𝑛
• 𝜎ˆ 2𝑦 = Var(𝑦)
c = 𝑛 𝑖=1 (𝑦 𝑖 − 𝜇
ˆ 𝑦 )2
1 Í𝑛
• 𝝁ˆ 𝒙 = Ê(𝒙) = 𝑛 𝑖=1 𝒙 𝑖
• 𝚺ˆ 𝒙𝒙 = Var(𝒙) = 𝑛1 𝑛𝑖=1 (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝒙 𝑖 − 𝝁ˆ 𝒙 )𝑇
c Í
ˆ 1 Í𝑛
• 𝚺𝒙 𝑦 = Cov(𝒙, 𝑦) =
d
𝑛 𝑖=1 (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝑦 𝑖 − 𝜇ˆ 𝑦 )
Note that these are are sufficient statistics and hence summarize perfectly the
observed data for 𝒙 and 𝑦 under the normal assumption
Consequently, the residuals (indirect observations of the noise variable) for a
given choice of regression coefficients 𝜷 and the observed data for 𝒙 and 𝑦 are

𝜀𝑖 = (𝑦 𝑖 − 𝜇ˆ 𝑦 ) − 𝜷𝑇 (𝒙 𝑖 − 𝝁ˆ 𝒙 )

Assuming that the noise 𝜀 ∼ 𝑁(0, 𝜎𝜀2 ) is normally distributed with mean 0 and
variance Var(𝜀) = 𝜎𝜀2 . we can write down the normal log-likelihood function for
𝜎𝜀2 and 𝜷:
𝑛
𝑛 1 Õ 2
log 𝐿(𝜷, 𝜎𝜀2 ) = − log 𝜎𝜀2 − 2 (𝑦 𝑖 − 𝜇ˆ 𝑦 ) − 𝜷𝑇 (𝒙 𝑖 − 𝝁ˆ 𝒙 )
2 2𝜎𝜀 𝑖=1
154 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS

Maximising this function leads to the MLEs of 𝜎𝜀2 and 𝜷!


Note that the residual sum of squares appears in the log-likelihood function
(with a minus sign), which implies that ML assuming normal distribution will
recover the OLS estimator for the regression coefficients! So OLS is a special case
of ML !

16.2.2 Detailed derivation of the MLEs


The gradient with regard to 𝜷 is
𝑛
1 Õ 
∇𝜷 log 𝐿(𝜷, 𝜎𝜀2 ) = (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝑦 𝑖 − 𝜇ˆ 𝑦 ) − (𝒙 𝑖 − 𝝁ˆ 𝒙 )(𝒙 𝑖 − 𝝁ˆ 𝒙 )𝑇 𝜷
𝜎𝜀2 𝑖=1
𝑛 ˆ ˆ

= 2 𝚺 𝒙𝑦 − 𝚺 𝒙𝒙 𝜷
𝜎𝜀

Setting this equal to zero yields the Gauss estimator

ˆ −1
𝜷ˆ = 𝚺 ˆ
𝒙𝒙 𝚺𝒙 𝑦

By plugin we the get the MLE of 𝛽0 as


𝑇
𝛽ˆ 0 = 𝜇ˆ 𝑦 − 𝜷ˆ 𝝁ˆ 𝒙

ˆ 𝜎𝜀2 ) with regard to 𝜎𝜀2 yields


Taking the derivative of log 𝐿(𝜷,
𝑛
𝜕 ˆ 𝜎𝜀2 ) = − 𝑛 + 1
Õ
log 𝐿( 𝜷, (𝑦 𝑖 − 𝑦ˆ 𝑖 )2
𝜕𝜎𝜀2 2𝜎𝜀2 2𝜎𝜀4 𝑖=1

𝑇
with 𝑦ˆ 𝑖 = 𝛽ˆ 0 + 𝜷ˆ 𝒙 𝑖 and the residuals 𝑦 𝑖 − 𝑦ˆ 𝑖 resulting from the fitted linear
model. This leads to the MLE of the noise variance
𝑛

𝜎b𝜀2 = (𝑦 𝑖 − 𝑦ˆ 𝑖 )2
𝑛
𝑖=1

Note that
Í𝑛 the MLE 𝜎b𝜀2 is a biased estimate of 𝜎𝜀2 . The unbiased estimate is
𝑖=1 (𝑦 𝑖 − 𝑦ˆ 𝑖 ) , where 𝑑 is the dimension of 𝜷 (i.e. the number of predictors).
1 2
𝑛−𝑑−1

16.2.3 Asymptotics
The advantage of using maximum likelihood is that we also get the (asympotic)
variance associated with each estimator and typically can also assume asymptotic
normality.
16.3. COVARIANCE PLUG-IN ESTIMATOR OF REGRESSION COEFFICIENTS155

Specifically, for 𝜷ˆ we get via the observed Fisher information at the MLE an
asymptotic estimator of its variance

1 b2 ˆ −1
Var( 𝜷) = 𝜎 𝚺
𝑛 𝜀 𝒙𝒙
c b

Similarly, for 𝛽ˆ 0 we have

1 b2 ˆ −1
Var( 𝛽0 ) = 𝜎 (1 + 𝝁ˆ 𝑇 𝚺 𝒙𝒙 𝝁)
ˆ
𝑛 𝜀
c b

For finite sample size 𝑛 with known Var(𝜀) one can show that the variances are

1 2 ˆ −1
𝜷) =
Var(b 𝜎 𝚺
𝑛 𝜀 𝒙𝒙
and
1 2 ˆ −1
Var(b 𝜎 (1 + 𝝁ˆ 𝑇𝒙 𝚺
𝛽0 ) = 𝒙𝒙 𝝁
ˆ 𝒙)
𝑛 𝜀
and that the regression coefficients and the intercept are normally distributed
according to
𝜷 ∼ 𝑁𝑑 (𝜷, Var(b
b 𝜷))
and
𝛽 0 ∼ 𝑁(𝛽 0 , Var(b
b 𝛽0 ))

We may use this to test whether whether 𝛽 𝑗 = 0 and 𝛽 0 = 0.

16.3 Covariance plug-in estimator of regression co-


efficients
16.3.1 Regression coeffients as product of variances
We now try to understand regression coefficients in terms of covariances (thus
obtaining a third way to compute and estimate them).
We recall that the Gauss regression coefficients are given by

  −1
𝜷 = 𝑿𝑇𝑿
b 𝑿𝑇𝒚

where 𝑿 is the 𝑛 × 𝑑 data matrix (in statistics convention)

𝑥 ... 𝑥1𝑑
© .11 .. ª®
𝑿 = ­­ .. ..
. . ®
«𝑥 𝑛1 ... 𝑥 𝑛𝑑 ¬
156 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS

Note that we assume that the data matrix 𝑿 is centered (i.e. column sums
𝑿 𝑇 1𝑛 = 0 are zero).
Likewise 𝒚 = (𝑦1 , . . . , 𝑦𝑛 )𝑇 is the response data vector (also centered with
𝒚𝑇 1𝑛 = 0).
Noting that
ˆ 𝒙𝒙 = 1 (𝑿 𝑇 𝑿 )
𝚺
𝑛
is the MLE of covariance matrix among 𝒙 and

ˆ 𝒙 𝑦 = 1 (𝑿 𝑇 𝒚)
𝚺
𝑛
is the MLE of the covariance between 𝒙 and 𝑦 we see that the OLS estimate of
the regression coefficients can be expressed as
  −1
𝜷= 𝚺
b ˆ 𝒙𝒙 ˆ 𝒙𝑦
𝚺

We can write down a population version (with no hats!):

𝜷 = 𝚺−1
𝒙𝒙 𝚺𝒙 𝑦

Thus, OLS regression coefficients can be interpreted as plugin estimator using


MLEs of covariances! In fact, we may also use the unbiased estimates since the
scale factor (1/𝑛 or 1/(𝑛 − 1)) cancels out so it does not matter which one you
use!

16.3.2 Importance of positive definiteness of estimated covari-


ance matrix
  −1
ˆ 𝒙𝒙 is inverted in b
Note that 𝚺 ˆ 𝒙𝒙
𝜷= 𝚺 ˆ 𝒙𝑦.
𝚺

ˆ 𝒙𝒙 needs to be positive definite!


• Hence, the estimate 𝚺
MLE
ˆ 𝒙𝒙 is only positive definite if 𝑛 > 𝑑!
• But 𝚺
Therefore we can use the ML estimate (empirical estimator) only for large 𝑛 >
𝑑, otherwise we need to employ a different (regularised) estimation approach
(e.g. Bayes or a penalised ML)!

Remark: writing 𝜷ˆ explicitly based on covariance estimates has the advantage


that we can construct plug-in estimators of regression coefficients based on
regularised covariance estimators that improve over ML for small sample size.
This leads to the so-called SCOUT method (=covariance-regularized regression
by Witten and Tibshirani, 2008).
16.4. STANDARDISED REGRESSION COEFFICIENTS AND THEIR RELATIONSHIP TO

16.4 Standardised regression coefficients and their


relationship to correlation
We recall the relationship between regression coefficients 𝜷 and the marginal
covariance 𝚺𝒙 𝑦 and the covariances among the predictors 𝚺𝒙𝒙 :

𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙 𝑦

We can rewrite the regression coefficients in terms of marginal correlations


𝑷 𝒙𝑦 and correlations 𝑷 𝒙𝒙 among the predictors using the variance-correlation
1/2 1/2 1/2
decompositions 𝚺𝒙𝒙 = 𝑽 𝒙 𝑷 𝒙𝒙 𝑽 𝒙 and 𝚺𝒙 𝑦 = 𝑽 𝒙 𝑷 𝒙 𝑦 𝜎 𝑦 :
−1/2
𝜷= 𝑽𝒙 𝑷 −1
𝒙𝒙 𝑷 𝒙 𝑦 𝜎𝑦
|{z} |{z}
(inverse) scale of 𝑥 𝑖 scale of 𝑦
−1/2
= 𝑽𝒙 𝜷std 𝜎 𝑦
Thus the regression coefficients 𝜷 contain the scale of the variables, and take into
account the correlations among the predictors (𝑷 𝒙𝒙 ) in addition to the marginal
correlations between the response 𝑦 and the predictors 𝑥 𝑖 (𝑷 𝒙 𝑦 ).
This decomposition allows to understand a number special cases for which the
regression coefficients simplify further:
a) If the response and the predictors are standardised to have variance one,
i.e. Var(𝑦) = 1 and Var(𝑥 𝑖 ) = 1, then 𝜷 becomes equal to the standardised
regression coefficients
𝜷 std = 𝑷 −1
𝒙𝒙 𝑷 𝒙𝑦
Note that standardised regression coefficients do not make use of variances
and and thus are scale-independent.
b) If there is no correlation among the predictors , i.e. 𝑷 𝒙𝒙 = 𝑰 the the
regression coefficients reduce to
𝜷 = 𝑽 −1
𝒙 𝚺𝒙 𝑦

where 𝑽 𝒙 is a diagonal matrix containing the variances of the predictors.


This is also called marginal regression. Note that the inversion of 𝑽 𝒙 is
trival since you only need to invert each diagonal element individually.
c) If both a) and b) apply simultaneously (i.e. there is no correlation among
predictors and response and predictors and predictors are standardised)
then the regression coefficients simplify even further to
𝜷 = 𝑷𝒙𝑦
Thus, in this very special case the regression coefficients are identical to
the correlations between the response and the predictors!
158 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS

16.5 Further ways to obtain regression coefficients


16.5.1 Best linear predictor
The best linear predictor is a fourth way to arrive at the linear model. This is
closely related to OLS and minimising squared residual error.
Without assuming normality the above multiple regression model can be shown
to be optimal linear predictor under the minimum mean squared prediction
error:
Assumptions:
• 𝑦 and 𝒙 are random variables
• we construct a new variable (the linear predictor) 𝑦★★ = 𝑏 0 + 𝒃𝑇 𝒙 to
optimally approximate 𝑦
Aim:
• choose 𝑏 0 and 𝒃 such to minimize the mean squared prediction error
E((𝑦 − 𝑦★★)2 )

16.5.1.1 Result:
The mean squared prediction error 𝑀𝑆𝑃𝐸 in dependence of (𝑏 0 , 𝒃) is

E((𝑦 − 𝑦★★)2 ) = Var(𝑦 − 𝑦★★) + E(𝑦 − 𝑦★★)2


= Var(𝑦 − 𝑏0 − 𝒃𝑇 𝒙) + (E(𝑦) − 𝑏0 − 𝒃𝑇 E(𝒙))2
= 𝜎2𝑦 + Var(𝒃𝑇 𝒙) + 2 Cov(𝑦, −𝒃𝑇 𝒙) + (𝜇 𝑦 − 𝑏0 − 𝒃𝑇 𝝁𝒙 )2
= 𝜎2𝑦 + 𝒃𝑇 𝚺𝒙𝒙 𝒃 − 2 𝒃𝑇 𝚺𝒙 𝑦 + (𝜇 𝑦 − 𝑏 0 − 𝒃𝑇 𝝁𝒙 )2
= 𝑀𝑆𝑃𝐸(𝑏0 , 𝒃)

We look for
(𝛽 0 , 𝜷) = arg min 𝑀𝑆𝑃𝐸(𝑏 0 , 𝒃)
𝑏 0 ,𝒃

In order to find the minimum we compute the gradient with regard to (𝑏 0 , 𝒃)

−2(𝜇 𝑦 − 𝑏0 − 𝒃𝑇 𝝁𝒙 )
 
∇𝑀𝑆𝑃𝐸 =
2 𝚺𝒙𝒙 𝒃 − 2 𝚺𝒙 𝑦 − 2𝝁𝒙 (𝜇 𝑦 − 𝑏0 − 𝒃𝑇 𝝁𝒙 )

and setting this equal to zero yields

𝛽0 𝜇 − 𝜷𝑇 𝝁 𝒙
   
= 𝑦 −1
𝜷 𝚺𝒙𝒙 𝚺𝒙 𝑦

Thus, the optimal values for 𝑏 0 and 𝒃 in the best linear predictor correspond to
the previously derived coefficients 𝛽 0 and 𝜷!
16.5. FURTHER WAYS TO OBTAIN REGRESSION COEFFICIENTS 159

16.5.1.2 Irreducible Error


The minimum achieved MSPE (=irreducible error) is

𝑀𝑆𝑃𝐸(𝛽 0 , 𝜷) = 𝜎2𝑦 − 𝜷𝑇 𝚺𝒙𝒙 𝜷 = 𝜎 2𝑦 − 𝚺 𝑦𝒙 𝚺−1


𝒙𝒙 𝚺𝒙 𝑦

𝒙𝒙 𝑷 𝒙 𝑦 = 𝜎 𝑦 𝚺 𝑦𝒙 𝚺𝒙𝒙 𝚺 𝒙𝑦 we can simplify this to


−1
With the abbreviation Ω2 = 𝑷 𝑦𝒙 𝑷 −1 −2

𝑀𝑆𝑃𝐸(𝛽 0 , 𝜷) = 𝜎2𝑦 (1 − Ω2 ) = Var(𝜀)

Writing 𝑏 0 = 𝛽 0 + Δ0 and 𝒃 = 𝜷 + 𝚫 it is easy to see that the mean squared


predictive error is a quadratic function around the minimum:

𝑀𝑆𝑃𝐸(𝛽 0 + Δ0 , 𝜷 + 𝚫) = Var(𝜀) + Δ20 + 𝚫𝑇 𝚺𝒙𝒙 𝚫

Note that usually 𝑦★ = 𝛽0 + 𝜷𝑇 𝒙 does not perfectly approximate 𝑦 so there will


be an irreducible error (= noise variance)

Var(𝜀) = 𝜎2𝑦 (1 − Ω2 ) > 0

which implies Ω2 < 1.


The quantity Ω2 has a further interpretation of the population version of as the
squared multiple correlation coefficient between the response and the predictors
and plays a vital role in decomposition of variance, as discussed later.

16.5.2 Regression by conditioning


Conditioning is a fifth way to arrive at the linear model. This is also the most
general way and can be used to derive many other regression models (not just
the simple linear model).

16.5.2.1 General idea:


• two random variables 𝑦 (response, scalar) and 𝒙 (predictor variables,
vector)
• we assume that 𝑦 and 𝒙 have a joint distribution 𝐹 𝑦,𝒙
• compute conditional random variable 𝑦|𝒙 and the corresponding distribu-
tion 𝐹 𝑦|𝒙

16.5.2.2 Multivariate normal assumption


Now we assume that 𝑦 and 𝒙 are (jointly) multivariate normal. Then the
conditional distribution 𝐹 𝑦|𝒙 is a univariate normal with the following moments
(you can verify this by looking up the general conditional multivariate normal
distribution):
a) Conditional expectation:
160 CHAPTER 16. ESTIMATING REGRESSION COEFFICIENTS

E(𝑦|𝒙) = 𝑦★ = 𝛽 0 + 𝜷𝑇 𝒙

𝑇
with coefficients 𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙𝑦 and intercept 𝛽 0 = 𝜇 𝑦 − 𝜷 𝝁𝒙 .

Note that as 𝑦★ depends on 𝒙 it is a random variable itself with mean

E(𝑦★) = 𝛽 0 + 𝜷𝑇 𝝁𝒙 = 𝜇 𝑦

and variance
Var(𝑦★) = Var(E(𝑦|𝒙))
= 𝜷𝑇 𝚺𝒙𝒙 𝜷 = 𝚺 𝑦𝒙 𝚺−1
𝒙𝒙 𝚺𝒙 𝑦
= 𝜎2𝑦 𝑷 𝑦𝒙 𝑷 −1
𝒙𝒙 𝑷 𝒙𝑦

= 𝜎2𝑦 Ω2

b) Conditional variance:

Var(𝑦|𝒙) = 𝜎2𝑦 − 𝜷𝑇 𝚺𝒙𝒙 𝜷


= 𝜎2𝑦 − 𝚺 𝑦𝒙 𝚺−1
𝒙𝒙 𝚺 𝒙 𝑦

= 𝜎2𝑦 (1 − Ω2 )

Note this is a constant so E(Var(𝑦|𝒙)) = 𝜎2𝑦 (1 − Ω2 ) as well.


Chapter 17

Squared multiple correlation


and variance decomposition in
linear regression

In this chapter we first introduce the (squared) multiple correlation and the
multiple and adjusted 𝑅 2 coefficients as estimators. Subsequently we discuss
variance decomposition.

17.1 Squared multiple correlation Ω2 and the 𝑅 2 co-


efficient
In the previous chapter we encountered the following quantity:

Ω2 = 𝑷 𝑦𝒙 𝑷 −1 −2 −1
𝒙𝒙 𝑷 𝒙 𝑦 = 𝜎 𝑦 𝚺 𝑦𝒙 𝚺𝒙𝒙 𝚺 𝒙 𝑦

𝑇
With 𝜷 = 𝚺−1
𝒙𝒙 𝚺 𝒙𝑦 and 𝛽 0 = 𝜇 𝑦 − 𝜷 𝝁𝒙 it is straightforward to verify the following:

• the cross-covariance between 𝑦 and 𝑦★ is

Cov(𝑦, 𝑦★) = 𝚺 𝑦𝒙 𝜷 = 𝚺 𝑦𝒙 𝚺−1


𝒙𝒙 𝚺 𝒙𝑦
= 𝜎 2𝑦 𝑷 𝑦𝒙 𝑷 −1
𝒙𝒙 𝑷 𝒙 𝑦 = 𝜎 𝑦 Ω
2 2

• the (signal) variance of 𝑦★ is

Var(𝑦★) = 𝜷𝑇 𝚺𝒙𝒙 𝜷 = 𝚺 𝑦𝒙 𝚺−1


𝒙𝒙 𝚺𝒙 𝑦
= 𝜎2𝑦 𝑷 𝑦𝒙 𝑷 −1
𝒙𝒙 𝑷 𝒙 𝑦 = 𝜎 𝑦 Ω
2 2

161
162CHAPTER 17. SQUARED MULTIPLE CORRELATION AND VARIANCE DECOMPOSIT

Cov(𝑦,𝑦★ )
hence the correlation Cor(𝑦, 𝑦★) = SD(𝑦)SD(𝑦★ )
= Ω with Ω ≥ 0.

This helps to understand the Ω and Ω2 coefficients:


• Ω is the linear correlation between the response (𝑦) and prediction 𝑦★.
• Ω2 is called the squared multiple correlation between the scalar 𝑦 and the
vector 𝒙.
• Note that if we only have one predictor (if 𝑥 is a scalar) then 𝑷 𝑥𝑥 = 1
and 𝑷 𝑦𝑥 = 𝜌 𝑦𝑥 so the multiple squared correlation coefficient reduces to
squared correlation Ω2 = 𝜌2𝑦𝑥 between two scalar random variables 𝑦 and
𝑥.

17.1.1 Estimation of Ω2 and the multiple 𝑅 2 coefficient


The multiple squared correlation coefficient Ω2 can be estimated by plug-in of
empirical estimates for the corresponding correlation matrices:
−1 −1
𝑅 2 = 𝑷ˆ 𝑦𝒙 𝑷ˆ 𝒙𝒙 𝑷ˆ 𝒙𝑦 = 𝜎ˆ −2 ˆ ˆ ˆ
𝑦 𝚺 𝑦𝒙 𝚺𝒙𝒙 𝚺𝒙 𝑦

This estimator of Ω2 is called the multiple 𝑅 2 coefficient.


If the same scale factor 1/𝑛 or 1/(𝑛 − 1) is used in estimating the variance 𝜎2𝑦 and
the covariances 𝚺𝒙𝒙 and 𝚺 𝑦𝒙 then this factor will cancel out.
Above we have seen that Ω2 is directly linked with the noise variance via

Var(𝜀) = 𝜎2𝑦 (1 − Ω2 ) .

so we can express the squared multiple correlation as

Ω2 = 1 − Var(𝜀)/𝜎2𝑦

The maximum likelihood estimate of the noise variance Var(𝜀) (also called
residual
Í𝑛 variance) can be computed from the residual sum of squares 𝑅𝑆𝑆 =
𝑖=1 (𝑦 𝑖 − 𝑦ˆ 𝑖 )2 as follows:
c 𝑀𝐿 = 𝑅𝑆𝑆
Var(𝜀)
𝑛
whereas the unbiased estimate is obtained by

𝑅𝑆𝑆 𝑅𝑆𝑆
Var(𝜀)
c 𝑈𝐵 = =
𝑛−𝑑−1 𝑑𝑓

where the degree of freedom is 𝑑 𝑓 = 𝑛 − 𝑑 − 1 and 𝑑 is the number of predictors.


Similarly, we can find the maximum likelihood estimate 𝑣 𝑦𝑀𝐿 for 𝜎2𝑦 (with factor
1/𝑛) as well as an unbiased estimate 𝑣 𝑈 𝐵
𝑦 (with scale factor 1/(𝑛 − 1))
17.2. VARIANCE DECOMPOSITION IN REGRESSION 163

The multiple 𝑅 2 coefficient can then be written as


c 𝑀𝐿 /𝑣 𝑦𝑀𝐿
𝑅 2 = 1 − Var(𝜀)
Note we use MLEs.
In contrast, the so-called adjusted multiple 𝑅 2 coefficient is given by
c 𝑈 𝐵 /𝑣 𝑈𝐵
𝑅 2adj = 1 − Var(𝜀) 𝑦

where the unbiased variances are used.


Both 𝑅 2 and 𝑅 2adj are estimates of Ω2 and are related by

𝑑𝑓
1 − 𝑅 2 = (1 − 𝑅 2adj )
𝑛−1

17.1.2 R commands
In R the command lm() fits the linear regression model.
In addition to the regression cofficients (and derived quantities) the R function
lm() also lists
• the multiple R-squared 𝑅2 ,
• the adjusted R-squared 𝑅 2adj ,
• the degrees of freedom 𝑑 𝑓 andq
• the residual standard error c 𝑈 𝐵 (computed from the unbiased
Var(𝜀)
variance estimate).
See also Worksheet R3 which provides R code to reproduce the exact output of
the native lm() R function.

17.2 Variance decomposition in regression


The squared multiple correlation coefficient is useful also because it plays an
important role in the decomposition of the total variance:
• total variance: Var(𝑦) = 𝜎2𝑦
• unexplained variance (irreducible error): 𝜎 2𝑦 (1 − Ω2 ) = Var(𝜀)
• the explained variance is the complement: 𝜎2𝑦 Ω2 = Var(𝑦★)
In summary:

Var(𝑦) = Var(𝑦★) + Var(𝜀)


becomes
𝜎2𝑦 = 𝜎 2𝑦 Ω2 + 𝜎2𝑦 (1 − Ω2 )
|{z} |{z} | {z }
total variance explained variance unexplained variance
164CHAPTER 17. SQUARED MULTIPLE CORRELATION AND VARIANCE DECOMPOSIT

The unexplained variance measures the fit after introducing predictors into
the model (smaller means better fit). The total variance measures the fit of the
model without any predictors. The explained variance is the difference between
total and unexplained variance, it indicates the increase in model fit due to the
predictors.

17.2.1 Law of total variance and variance decomposition


The law of total variance is

Var(𝑦) = Var(E(𝑦|𝒙)) + E(Var(𝑦|𝒙))


| {z } | {z } | {z }
total variance explained variance unexplained variance

provides a very general decomposition in explained and unexplained parts of


the variance that is valid regardless of the form of the distributions 𝐹 𝑦,𝒙 and 𝐹 𝑦|𝒙 .
In regression it conncects variance decomposition and conditioning. If you plug-
in the conditional expections for the multivariate normal model (cf. previous
chapter) we recover

𝜎2𝑦 = 𝜎2𝑦 Ω2 + 𝜎2𝑦 (1 − Ω2 )


|{z} |{z} | {z }
total variance explained variance unexplained variance

17.2.2 Related quantities


Using the above three quantities (total variance, explained variance, and unex-
plained variance) we can construct a number of scores:
1) coefficient of determination, squared multiple correlation:

explained var 𝜎 𝑦 Ω
2 2
= = Ω2
total var 𝜎2𝑦
(range 0 to 1, with 1 indicating perfect fit)
2) coefficient of non-determination, coefficient of alienation:

unexplained var 𝜎 𝑦 (1 − Ω )
2 2
= = 1 − Ω2
total var 𝜎2𝑦
(range 0 to 1, with 0 indicating perfect fit)
3) 𝐹 score, 𝑡 2 score:
17.3. SAMPLE VERSION OF VARIANCE DECOMPOSITION 165

explained var 𝜎2𝑦 Ω2 Ω2 𝜏2


= 2 = = ℱ =
unexplained var 𝜎 𝑦 (1 − Ω2 ) 1 − Ω2 𝑛
(range 0 to ∞, with ∞ indicating perfect fit)
Note that the ℱ and 𝜏2 scores are population versions of the 𝐹 and 𝑡 2 statistics.
𝜏2
Also note that Ω2 = 𝜏2 +𝑛
= ℱ
ℱ +1
links squared correlation with squared 𝑡-scores
and 𝐹-scores.

17.3 Sample version of variance decomposition


If Ω2 and 𝜎2𝑦 are replaced by their MLEs this can be written in a sample version
as follows using data points 𝑦 𝑖 , predictions 𝑦ˆ 𝑖 and 𝑦¯ = 𝑛1 𝑛𝑖=1 𝑦 𝑖
Í

𝑛
Õ 𝑛
Õ 𝑛
Õ
(𝑦 𝑖 − 𝑦)
¯ 2 = ( 𝑦ˆ 𝑖 − 𝑦)
¯ 2 + (𝑦 𝑖 − 𝑦ˆ 𝑖 )2
𝑖=1 𝑖=1 𝑖=1
| {z } | {z } | {z }
total sum of squares (TSS) explained sum of squares (ESS) residual sum of squares (RSS)

Note that TSS, ESS and RSS all scale with 𝑛. Using data vector notation the
sample-based variance decomposition can be written in form of the Pythagorean
theorem:
||𝒚 − 𝑦1
¯ || 2 = || 𝒚ˆ − 𝑦1||
¯ 2 + ˆ 2
||𝒚 − 𝒚||
| {z } | {z } | {z }
total sum of squares (TSS) explained sum of squares (ESS) residual sum of squares (RSS)

17.3.1 Geometric interpretation of regression as orthogonal


projection:
The above equation can be further simplified to

||𝒚|| 2 = || 𝒚||
ˆ 2 + ||𝒚 − 𝒚||
ˆ 2
| {z }
RSS

Geometrically speaking, this implies 𝒚ˆ is an orthogonal projection of 𝒚, since the


residuals 𝒚 − 𝒚ˆ and the predictions 𝒚ˆ are orthogonal (by construction!).
ˆ 𝑦1
This also valid for the centered versions of the vectors, i.e. 𝒚− ¯ 𝑛 is an orthogonal
projection of 𝒚 − 𝑦1¯ 𝑛 (see Figure).
Also note that the angle 𝜃 between the two centered vectors is directly related
ˆ 𝑦1
|| 𝒚− ¯ 𝑛 ||
to the (estimated) multiple correlation, with 𝑅 = cos(𝜃) = ||𝒚− 𝑦1 ¯ 𝑛 ||
, or 𝑅 2 =
ˆ 𝑦1
|| 𝒚− ¯ 𝑛 || 2 ESS
cos(𝜃)2 = ||𝒚− 𝑦1
¯ 𝑛 || 2
= TSS .
166CHAPTER 17. SQUARED MULTIPLE CORRELATION AND VARIANCE DECOMPOSIT

Source of Figure: Stack Exchange


Chapter 18

Prediction and variable


selection

In this chapter we discuss how to compute (lower bounds) of the prediction


error and how to select variables relevant for prediction

18.1 Prediction and prediction intervals


Learning the regression function from (training) data is only the first step in
application of regression models.
The next step is to actually make prediction of future outcomes 𝑦 test given test
data 𝒙 test :
ˆ test ) = 𝑓ˆ𝛽ˆ 0 ,𝜷ˆ (𝒙 test )
𝑦 test = 𝑦(𝒙

Note that 𝑦 test is a point estimator. Is it possible also to construct a corresponding


interval estimate?

167
168 CHAPTER 18. PREDICTION AND VARIABLE SELECTION

The answer is yes, and leads back to the conditioning approach:

𝑦★ = E(𝑦|𝒙) = 𝛽 0 + 𝜷𝑇 𝒙

Var(𝜀) = Var(𝑦|𝒙) = 𝜎2𝑦 (1 − Ω2 )

We know that the mean squared prediction error for 𝑦★ is E((𝑦 − 𝑦★)2 ) = Var(𝜀)
and that this is the minimal irreducible error. Hence, we may use Var(𝜀) as the
minimum variability for the prediction.
The corresponding prediction interval is

𝑦★(𝒙 test ) ± 𝑐 × SD(𝜀)


 

where 𝑐 is some suitable constant (e.g. 1.96 for symmetric 95% normal intervals).
However, please note that the prediction interval constructed in this fashion will
be an underestimate. The reason is that this assumes that we employ 𝑦★ = 𝛽 0 + 𝜷𝑇 𝒙
𝑇
but in reality we actually use 𝑦ˆ = 𝛽ˆ 0 + 𝜷ˆ 𝒙 for prediction — note the estimated
coefficients! We recall from an earlier chapter (best linear predictor) that this
leads to increase of MSPE compared with using the optimal 𝛽 0 and 𝜷.
Thus, for better prediction intervals we would need to consider the mean squared
prediction error of 𝑦ˆ that can be written as E((𝑦 − 𝑦)
ˆ 2 ) = Var(𝜀) + 𝛿 where 𝛿 is an
additional error term due to using an estimated rather than the true regression
function. 𝛿 typically declines with 1/𝑛 but can be substantial for small 𝑛 (in
particular as it usually depends on the number of predictors 𝑑).
For more details on this we refer to later modules on regression.

18.2 Variable importance and prediction


Another key question in regression modelling is to find out predictor variables
𝑥1 , 𝑥2 , . . . , 𝑥 𝑑 are actually important for predicting the outcome 𝑦.
→ We need to study variable importance measures (VIM).

18.2.1 How to quantify variable importance?


A variable 𝑥 𝑖 is important if it improves prediction of the response 𝑦.
Recall variance decomposition:

Var(𝑦) = 𝜎2𝑦 = 𝜎2𝑦 Ω2 + 𝜎 2𝑦 (1 − Ω2 )


|{z} | {z }
explained variance unexplained/residual variance =Var(𝜀)
18.2. VARIABLE IMPORTANCE AND PREDICTION 169

• Ω2 squared multiple correlation ∈ [0, 1]


• Ω2 large → 1 predictor variables explain most of 𝜎2𝑦
• Ω2 small → 0 linear model fails and predictors do not explain variability
increase explained variance
• ⇒ If a predictor helps to then it is impor-
decrease unexplained variance
tant!
• Ω2 = 𝑷 𝑦𝒙 𝑷 −1 ˆ a function of the 𝑋!
𝒙𝒙 𝑷 𝒙𝑦 =

VIM: which predictors contribute most to Ω2

18.2.2 Some candidates for VIMs


1. The regression coefficients 𝜷
−1/2
• 𝜷 = 𝚺−1
𝒙𝒙 𝚺𝒙 𝑦 = 𝑽 𝒙 𝒙𝒙 𝑷 𝒙 𝑦 𝜎 𝑦
𝑷 −1
• Not a good VIM since 𝜷 contains the scale!
• Large 𝛽ˆ 𝑖 does not indicate that 𝑥 𝑖 is important.
• Small 𝛽ˆ 𝑖 does not indicate that 𝑥 𝑖 is not important.
2. Standardised regression coefficients 𝜷 std
• 𝜷std = 𝑷 −1
𝒙𝒙 𝑷 𝒙 𝑦
• implies Var(𝑦) = 1, Var(𝑥 𝑖 ) = 1
• ˆ
These do not contain the scale (so better than 𝛽)
• But still unclear how this relates to decomposition of variance
3. Squared marginal correlations 𝜌2𝑦,𝑥 𝑖
Consider case of uncorrelated predictors: 𝑷 𝒙𝒙 = 𝑰 (no correlation among
𝑥𝑖 )

𝑑
Õ
⇒ Ω2 = 𝑷 𝑦𝒙 𝑷 𝒙 𝑦 = 𝜌2𝑦,𝑥 𝑖
𝑖=1

𝜌2𝑦,𝑥 𝑖 = Cor(𝑦, 𝑥 𝑖 ) is the marginal correlation between 𝑦 and 𝑥 𝑖 , and Ω2 is


(for uncorrelated predictors) the sum of squared marginal correlations.
• If 𝑷 𝒙𝒙 = 𝑰, then ranking predictors by 𝜌2𝑦,𝑥 𝑖 is optimal!
• The predictor with largest marginal correlation reduces the unex-
plained variance most!
• good news: even if there is weak correlation among predictors the
marginal correlations are still good as VIM (but then they will not
perfectly add up to Ω2 )
• Advantage: very simple but often also very effective.
• Caution! If there is strong correlation in 𝑷 𝒙𝒙 , then there is colinearity
(in this case it is oftern best to remove one of the strongly correlated
variables, or to merge the correlated variables).
170 CHAPTER 18. PREDICTION AND VARIABLE SELECTION

Often, ranking predictors by their squared marginal correlations is done as a


prefiltering step (independence screening).

18.3 Regression 𝑡-scores.


18.3.1 Wald statistic for regression coefficients
So far, we discussed three obvious candidates for for variable importance
measures (regression coefficients, standardised regression coefficients, marginal
correlations).
In this section we consider a further quantity, the regression 𝑡−score:
Recall that ML estimation of the regression coefficients yields
• a point estimate 𝜷ˆ
• the (asymptotic) variance Var(
c 𝜷) ˆ
𝑎
• the asymptotic normal distribution 𝜷ˆ ∼ 𝑁𝑑 (𝜷, Var(
c 𝜷))ˆ

Corresponding to each predictor 𝑥 𝑖 we can construct from the above a 𝑡-score

𝛽ˆ 𝑖
𝑡𝑖 =
c 𝛽ˆ 𝑖 )
SD(

where the standard deviations are computed by SD( c 𝛽ˆ 𝑖 ) = Diag(Var(


c 𝜷))ˆ 𝑖 . This
corresponds to the Wald statistic to test that the underlying true regression
coefficient is zero (𝛽 𝑖 = 0).
Correspondingly, under the null hypthesis that 𝛽 𝑖 = 0 asymptotically for large 𝑛
the regression 𝑡-score is standard normal distributed:
𝑎
𝑡 𝑖 ∼ 𝑁(0, 1)

This allows to compute (symmetric) 𝑝-values 𝑝 = 2Φ(−|𝑡 𝑖 |) where Φ is the


standard normal distribution function.
For finite 𝑛, assuming normality of the observation and using the unbiased
estimate for variance when computing 𝑡 𝑖 , the exact distribution of 𝑖 𝑖 is given by
the Student-𝑡 distribution:
𝑡 𝑖 ∼ 𝑡 𝑛−𝑑−1

Regression 𝑡-scores can thus be used to test whether a regression coefficient


is zero. A large magnitude of the 𝑡 𝑖 score indicates that the hypothesis 𝛽 𝑖 = 0
can be rejected. Thus, a small 𝑝-value (say smaller than 0.05) signals that the
regression coefficient is non-zero and hence that the corresponding predictor
variable should be included in the model.
This allows rank predictor variables by |𝑡 𝑖 | or the corresponding 𝑝-values with
regard to their relevance in the linear model. Typically, in order to simplify a
18.3. REGRESSION 𝑇-SCORES. 171

model, predictors with the largest 𝑝-values (and thus smallest absolute 𝑡-scores)
may be removed from a model. However, note that having a 𝑝-value say larger
than 0.05 by itself is not sufficient to declare a regression coefficient to be zero
(because in classical statistical testing you can only reject the null hypothesis, but
not accept it!).

Note that by construction the regression 𝑡-scores do not depend on the scale,
so when the original data are rescaled this will not affect the corresponding
regression 𝑡-scores. Furthermore, if SD( c 𝛽ˆ 𝑖 ) is small, then the regression 𝑡-score
ˆ
𝑡 𝑖 can still be large even if 𝛽 𝑖 is small!

18.3.2 Computing
When you perform regression analysis in R (or another statistical software
package) the computer will return the following:

𝛽ˆ 𝑖
𝛽ˆ 𝑖 c 𝛽ˆ 𝑖 )
SD( 𝑡𝑖 = c 𝛽ˆ 𝑖 )
p-values Indicator of
SD(
Estimated Error of t-score for 𝑡 𝑖 Significance
repression 𝛽ˆ 𝑖 computed from based on t-distribution * 0.9
coefficient first two columns ** 0.95
*** 0.99

In the lm() function in R the standard deviation is the square root of the unbiased
estimate of the variance (but note that it itself is not unbiased!).

18.3.3 Connection with partial correlation


The deeper reason why ranking predictors by regression 𝑡-scores and associated
𝑝-values is useful is their link with partial correlation.

In particular, the (squared) regression 𝑡-score can be 1:1 transformed into the
(estimated) (squared) partial correlation

𝑡 𝑖2
𝜌ˆ 2𝑦,𝑥 𝑖 |𝑥 𝑗≠𝑖 =
𝑡 𝑖2 + 𝑑 𝑓

with 𝑑 𝑓 = 𝑛 − 𝑑 − 1, and it can be shown that the 𝑝-values for testing that 𝛽 𝑖 = 0
are exactly the same as the 𝑝-values for testing that the partial correlation 𝜌 𝑦,𝑥 𝑖 |𝑥 𝑗≠𝑖
vanishes!

Therefore, ranking the predictors 𝑥 𝑖 by regression 𝑡-scores leads to exactly the


same ranking and 𝑝-values as partial correlation!
172 CHAPTER 18. PREDICTION AND VARIABLE SELECTION

18.3.4 Squared Wald statistic and the 𝐹 statistic


In the above we looked at individual regression coefficients. However, we can
ˆ The squared Wald statistic
also construct a Wald test using the complete vector 𝜷.
to test that 𝜷 = 0 is given by
𝑇 −1
𝑡 2 = 𝜷ˆ Var(
c 𝜷ˆ )𝜷ˆ
!

ˆ −1
ˆ 𝑦𝒙 𝚺
 𝑛 ˆ  −1
ˆ 𝒙𝒙 𝚺
ˆ 𝒙𝑦

= 𝚺 𝒙𝒙 𝚺𝒙𝒙 𝚺
𝜎b𝜀2
𝑛 ˆ ˆ −1 ˆ
= 𝚺 𝑦𝒙 𝚺𝒙𝒙 𝚺𝒙 𝑦
𝜎b2 𝜀
𝑛
= 𝜎ˆ 2𝑦 𝑅 2
𝜎b2 𝜀

With 𝜎b𝜀2 / 𝜎ˆ 2𝑦 = 1 − 𝑅 2 we finally get the related 𝐹 statistic

𝑡2 𝑅2
= =𝐹
𝑛 1 − 𝑅2
which is a function of 𝑅 2 . If 𝑅 2 = 0 then 𝐹 = 0. If 𝑅 2 is large (< 1) then 𝐹 is large
as well (< ∞) and the null hypothesis 𝜷 = 0 can be rejected, which implies that at
least one regression coefficient is non-zero. Note that the squared Wald statistic
𝑡 2 is asymptotically 𝜒𝑑2 distributed which is useful to find critical values and to
compute 𝑝-values.

18.4 Further approaches for variable selection


In addition to ranking by marginal and partial correlation, there are many other
approaches for variable selection in regression!
a) Search-based methods:
• search through subsets of linear models for 𝑑 variables, ranging from
full model (including all predictors) to the empty model (includes no
predictor) and everything inbetween.
• Problem: exhaustive search not possible even for relatively small 𝑑 as
space of models is very large!
• Therefore heuristic approaches such as forward selection (adding
predictors), backward selection (removing predictors), or monte-carlo
random search are employed.
• Problem: maximum likelihood cannot be used for choosing among
the models - since ML will always pick the best model. Therefore,
penalised ML criteria such as AIC or Bayesian criteria are often
employed instead.
18.4. FURTHER APPROACHES FOR VARIABLE SELECTION 173

b) Integrative estimation and variable selection:


• there are methods that fit the regression model and perform variable
selection simultaneously.
• the most well-known approach of this type is “lasso” regression
(Tibshirani 1996)
• This applies a (generalised) linear model with ML plus L1 penalty.
• Alternative: Bayesian variable selection and estimation procedures
c) Entropy-based variable selection:
As seen above, two of the most popular approaches in linear models are
based on correlation, either marginal correlation or partial correlation (via
regression 𝑡-scores).
Correlation measures can be generalised to non-linear settings. One very
popular measure is the mutual information which is computed using
the KL divergence. In case of two variables 𝑥 and 𝑦 with joint normal
distribution and correlation 𝜌 the mutual information is a function of the
correlation:
1
MI(𝑥, 𝑦) = log(1 − 𝜌2 )
2

In regression he mutual information between the response 𝑦 and predictor


𝑥 𝑖 is MI(𝑦, 𝑥 𝑖 ), and this widely used for feature selection, in particular in
machine learning.
d) FDR based variable selection in regression:
Feature selection controling the false discovery rate (FDR) among the
selected features are becoming more popular, in particular a number of
procedures using so-called “knockoffs”, see https://web.stanford.edu/g
roup/candes/knockoffs/ .
e) Variable importance using Shapley values:
Borrowing a concept from game theory Shapley values have recently
become popular in machine learning to evaluate the variable importance
of predictors in nonlinear models. Their relationship to other statistical
methods for measuring variable importance is the focus of current research.
174 CHAPTER 18. PREDICTION AND VARIABLE SELECTION
Appendix

175
176 CHAPTER 18. PREDICTION AND VARIABLE SELECTION
Appendix A

Refresher

Statistics is a mathematical science that requires practical use of tools from


probability, vector and matrices, analysis etc.
Here we briefly list some essentials that are needed for “Statistical Methods”.
Please familiarise yourself (again) with these topics.

A.1 Basic mathematical notation


Summation:
𝑛
Õ
𝑥 𝑖 = 𝑥1 + 𝑥2 + . . . + 𝑥 𝑛
𝑖=1

Multiplication:
𝑛
Ö
𝑥 𝑖 = 𝑥1 × 𝑥2 × . . . × 𝑥 𝑛
𝑖=1

A.2 Vectors and matrices


Vector and matrix notation.
Vector algebra.
Eigenvectors and eigenvalues for a real symmetric matrix.
Eigenvalue (spectral) decomposition of a real symmetric matrix.
Positive and negative definiteness of a real symmetric matrix (containing only
positive or only negative eigenvalues).

177
178 APPENDIX A. REFRESHER

Singularity of a real symmetric matrix (containing one or more eigenvalues


identical to zero).

Singular value decomposition of a real matrix.

Inverse of a matrix.

Trace and determinant of a square matrix.

Connection with eigenvalues (trace = sum of eigenvalues, determinant = product


of eigenvalues).

A.3 Functions
A.3.1 Gradient
The gradient of a scalar-valued function ℎ(𝒙) with vector argument 𝒙 =
(𝑥1 , . . . , 𝑥 𝑑 )𝑇 is the vector containing the first order partial derivatives of ℎ(𝒙)
with regard to each 𝑥1 , . . . , 𝑥 𝑑 :

𝜕ℎ(𝒙)
© 𝜕𝑥1 ª
∇ℎ(𝒙) = ­ ... ®
­ ®
𝜕ℎ(𝒙)
­ ®
« 𝜕𝑥 𝑑 ¬
𝜕ℎ(𝒙)
=
𝜕𝒙
= grad ℎ(𝒙)

The symbol ∇ is called the nabla operator (also known as del operator).

Note that we write the gradient as a column vector. This is called the denominator
layout convention, see https://en.wikipedia.org/wiki/Matrix_calculus for
details. In contrast, many textbooks (and also earlier versions of these lecture
notes) assume that gradients are row vectors, following the so-called numerator
layout convention.

Example A.1. Examples for the gradient:

𝜕ℎ(𝒙)
• ℎ(𝒙) = 𝒂 𝑇 𝒙 + 𝑏. Then ∇ℎ(𝒙) = 𝜕𝒙
= 𝒂.
𝑇 𝜕ℎ(𝒙)
• ℎ(𝒙) = 𝒙 𝒙. Then ∇ℎ(𝒙) = 𝜕𝒙 = 2𝒙.
• ℎ(𝒙) = 𝒙 𝑇 𝑨𝒙. Then ∇ℎ(𝒙) = 𝜕ℎ(𝒙)
𝜕𝒙
= (𝑨 + 𝑨𝑇 )𝒙.
A.3. FUNCTIONS 179

A.3.2 Hessian matrix


The matrix of all second order partial derivates of scalar-valued function with
vector-valued argument is called the Hessian matrix:
𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙)
© 𝜕𝑥12 𝜕𝑥1 𝜕𝑥 2
··· 𝜕𝑥1 𝜕𝑥 𝑑 ª
­ 𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙) ®
­ 𝜕𝑥2 𝜕𝑥1
­
𝜕𝑥 22
··· 𝜕𝑥2 𝜕𝑥 𝑑 ®
®
∇∇𝑇 ℎ(𝒙) = ­ . .. ..
­ . ..
®
­ . . . .
®
®
­ 𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙) 𝜕2 ℎ(𝒙) ®
···
« 𝜕𝑥 𝑑 𝜕𝑥1 𝜕𝑥 𝑑 𝜕𝑥2 𝜕𝑥 2𝑑 ¬

𝜕ℎ(𝒙)
 
=
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝜕2 ℎ(𝒙)
=
𝜕𝒙𝜕𝒙 𝑇
By construction the Hessian matrix is square and symmetric.
𝜕2 ℎ(𝒙)
Example A.2. ℎ(𝒙) = 𝒙 𝑇 𝑨𝒙. Then ∇∇𝑇 ℎ(𝒙) = 𝜕𝒙𝜕𝒙 𝑇
= (𝑨 + 𝑨𝑇 ).

A.3.3 Convex and concave functions


A function ℎ(𝑥) is convex if the second derivative ℎ 00(𝑥) ≥ 0 for all 𝑥. More
generally, a function ℎ(𝒙), where 𝒙 is a vector, is convex if the Hessian matrix
∇∇𝑇 ℎ(𝒙) is positive definite, i.e. if the eigenvalues of the Hessian matrix are all
positive.
If ℎ(𝒙) is convex, then −ℎ(𝒙) is concave. A function is concave if the Hessian
matrix is negative definite, i.e. if the eigenvalues of the Hessian matrix are all
negative.
Example A.3. The logarithm log(𝑥) is an example of a concave function whereas
𝑥 2 is a convex function.
To memorise, a valley is convex.

A.3.4 Linear and quadratic approximation


A linear and quadratic approximation of a function is given by a Taylor series of
first and second order, respectively.
Applied to scalar-valued function of a scalar:
1 00
ℎ(𝑥) ≈ ℎ(𝑥0 ) + ℎ 0(𝑥0 )(𝑥 − 𝑥 0 ) + ℎ (𝑥0 )(𝑥 − 𝑥0 )2
2
With 𝑥 = 𝑥 0 + 𝜀 this can be written as
1 00
ℎ(𝑥0 + 𝜀) ≈ ℎ(𝑥0 ) + ℎ 0(𝑥0 ) 𝜀 + ℎ (𝑥0 ) 𝜀2
2
180 APPENDIX A. REFRESHER

Applied to scalar-valued function of a vector:

1
ℎ(𝒙) ≈ ℎ(𝒙 0 ) + ∇ℎ(𝒙 0 )𝑇 (𝒙 − 𝒙 0 ) + (𝒙 − 𝒙 0 )𝑇 ∇∇𝑇 ℎ(𝒙 0 )(𝒙 − 𝒙 0 )
2
With 𝒙 = 𝒙 0 + 𝜺 this can be written as

1
ℎ(𝒙 0 + 𝜺) ≈ ℎ(𝒙 0 ) + ∇ℎ(𝒙 0 )𝑇 𝜺 + 𝜺𝑇 ∇∇𝑇 ℎ(𝒙 0 )𝜺
2

Example A.4. Commonly occurring Taylor series approximations of second


order are for example

𝜀 𝜀2
log(𝑥0 + 𝜀) ≈ log(𝑥0 ) + − 2
𝑥 0 2𝑥
0

and
𝑥0 𝜀 𝜀2
≈1− + 2
𝑥0 + 𝜀 𝑥0 𝑥
0

A.3.5 Conditions for local optimum of a function


To check if 𝑥 0 or 𝒙 0 is a local maximum or minimum we can use the following
conditions:
For a function of a single variable:
i) First derivative is zero at optimum ℎ 0(𝑥0 ) = 0.
ii) If the second derivative ℎ 00(𝑥0 ) < 0 at the optimum is negative the function
is locally concave and the optimum is a maximum.
iii) If the second derivative ℎ 00(𝑥0 ) > 0 is positive at the optimum the function
is locally convex and the optimum is a minimum.
For a function of several variables:
i) Gradient vanishes at maximum, ∇ℎ(𝒙 0 ) = 0.
ii) If the Hessian ∇∇𝑇 ℎ(𝒙 0 ) is negative definite (= all eigenvalues of Hessian
matrix are negative) then the function is locally concave and the optimum
is a maximum.
iii) If the Hessian is positive definite (= all eigenvalues of Hessian matrix
are positive) then the function is locally convex and the optimum is a
minimum.
Around the local optimum 𝒙 0 we can approximate the function quadratically
using
1
ℎ(𝒙 0 + 𝜺) ≈ ℎ(𝒙 0 ) + 𝜺𝑇 ∇∇𝑇 ℎ(𝒙 0 )𝜺
2
Note the linear term is missing due to the gradient being zero at 𝒙 0 .
A.4. COMBINATORICS 181

A.4 Combinatorics
A.4.1 Number of permutations
The number of possible orderings, or permutations, of 𝑛 distinct items is the
number of ways to put 𝑛 items in 𝑛 bins with exactly one item in each bin. It is
given by the factorial
𝑛
Ö
𝑛! = 𝑖 = 1×2×...×𝑛
𝑖=1

where 𝑛 is a positive integer. For 𝑛 = 0 the factorial is defined as

0! = 1

as there is exactly one permutation of zero objects.

The factorial can also be obtained using the Gamma function

𝑛! = Γ(𝑛 + 1)

which can be viewed as continuous version of the factorial.

A.4.2 Multinomial and binomial coefficient


The number of possible permutation of 𝑛 items of 𝐾 distinct types, with 𝑛1 of
type 1, 𝑛2 of type 2 and so on, equals the number of ways to put 𝑛 items into 𝐾
bins with 𝑛1 items in the first bin, 𝑛2 in the second and so on. It is given by the
multinomial coefficient

𝑛 𝑛!
 
=
𝑛1 , . . . , 𝑛 𝐾 𝑛1 ! × 𝑛2 ! × . . . × 𝑛 𝐾 !

with 𝐾𝑘=1 𝑛 𝑘 = 𝑛 and 𝐾 ≤ 𝑛. Note that it equals the number of permutation of


Í
all items divided by the number of permutations of the items in each bin (or of
each type).

If all 𝑛 𝑘 = 1 and hence 𝐾 = 𝑛 the multinomial coefficient reduces to the factorial.

If there are only two bins / types (𝐾 = 2) the multinomial coefficients becomes
the binomial coefficient

𝑛 𝑛 𝑛!
   
= =
𝑛1 𝑛1 , 𝑛 − 𝑛1 𝑛1 !(𝑛 − 𝑛1 )!

which counts the number of ways to choose 𝑛1 elements from a set of 𝑛 elements.
182 APPENDIX A. REFRESHER

A.4.3 De Moivre-Sterling approximation of the factorial


The factorial is frequently approximated by the following formula derived by
Abraham de Moivre (1667–1754) and James Stirling (1692-1770)

2𝜋𝑛 𝑛+ 2 𝑒 −𝑛
1
𝑛! ≈

or equivalently on logarithmic scale


 
1 1
log 𝑛! ≈ 𝑛 + log 𝑛 − 𝑛 + log (2𝜋)
2 2

The approximation is good for small 𝑛 (but fails for 𝑛 = 0) and becomes more
and more accurate with increasing 𝑛. For large 𝑛 the approximation can be
simplified to
log 𝑛! ≈ 𝑛 log 𝑛 − 𝑛

A.5 Probability
A.5.1 Random variables
A random variable describes a random experiment. The set of possible outcomes
is the sample space or state space and is denoted by Ω = {𝜔1 , 𝜔2 , . . .}. The
outcomes 𝜔 𝑖 are the elementary events. The sample space Ω can be finite or
infinite. Depending on type of outcomes the random variable is discrete or
continuous.
An event 𝐴 ⊆ Ω is subset of Ω and thus itself a set of elementary events
𝐴 = {𝑎1 , 𝑎2 , . . .}. This includes as special cases the full set 𝐴 = Ω, the empty set
𝐴 = ∅, and the elementary events 𝐴 = 𝜔 𝑖 . The complementary event 𝐴𝐶 is the
complement of the set 𝐴 in the set Ω so that 𝐴𝐶 = Ω \ 𝐴 = {𝜔 𝑖 ∈ Ω : 𝜔 𝑖 ∉ 𝐴}.
The probability of an event is denoted by Pr(𝐴). We assume that
• Pr(𝐴) ≥ 0, probabilities are positive,
• Pr(Ω) = 1,
Íthe certain event has probability 1, and
• Pr(𝐴) = 𝑎 𝑖 ∈𝐴 Pr(𝑎 𝑖 ), the probability of an event equals the sum of its
constituting elementary events 𝑎 𝑖 .
This implies
• Pr(𝐴) ≤ 1, i.e. probabilities all lie in the interval [0, 1]
• Pr(𝐴𝐶 ) = 1 − Pr(𝐴), and
• Pr(∅) = 0
Assume now we have two events 𝐴 and 𝐵. The probability of the event “𝐴 and
𝐵” is then given by the probability of the set intersection Pr(𝐴 ∩ 𝐵). Likewise
the probability of the event “𝐴 or 𝐵” is given by the probability of the set union
Pr(𝐴 ∪ 𝐵).
A.5. PROBABILITY 183

From the above it is clear that probability theory is closely linked to set theory,
and in particular to measure theory. This allows for an unified treatment of
discrete and continuous random variables (an elegant framework but not needed
for this module).

A.5.2 Probability mass and density function and distribution


and quantile function
To describe a random variable 𝑥 we need to assign probabilities to the corre-
sponding elementary outcomes 𝑥 ∈ Ω. For convenience we use the same name
to denote the random variable and the elementary outcomes.
For a discrete random variable we employ a probability mass function (PMF). We
denote the it by a lower case 𝑓 but occasionally we also use 𝑝 or 𝑞. In the discrete
case we can define the event 𝐴 = {𝑥 : 𝑥 = 𝑎} = {𝑎} and obtain the probability
directly from the PMF:

Pr(𝐴) = Pr(𝑥 = 𝑎) = 𝑓 (𝑎) .

𝑓 (𝑥) = 1 and that 𝑓 (𝑥) ∈ [0, 1].


Í
The PMF has the property that 𝑥∈Ω

For continuous random variables we need to use a probability density function


(PDF) instead. We define the event 𝐴 = {𝑥 : 𝑎 < 𝑥 ≤ 𝑎 + 𝑑𝑎} as an infinitesimal
interval and then assign the probability

Pr(𝐴) = Pr(𝑎 < 𝑥 ≤ 𝑎 + 𝑑𝑎) = 𝑓 (𝑎)𝑑𝑎 .



The PDF has the property that 𝑥∈Ω 𝑓 (𝑥)𝑑𝑥 = 1 but in contrast to a PMF the
density 𝑓 (𝑥) ≥ 0 may take on values larger than 1.
Assuming an ordering we can define the event 𝐴 = {𝑥 : 𝑥 ≤ 𝑎} and compute its
probability

𝑥∈𝐴 𝑓 (𝑥) discrete case
𝐹(𝑎) = Pr(𝐴) = Pr(𝑥 ≤ 𝑎) = ∫
𝑥∈𝐴
𝑓 (𝑥)𝑑𝑥 continuous case

This is known as the distribution function, or cumulative distribution function


(CDF) and is denoted by upper case 𝐹 if the corresponding PDF/PMF is 𝑓 (or
𝑃 and 𝑄 if the corresponding PDF/PMF are 𝑝 and 𝑞). By construction the
distribution function is monotonically increasing and its value ranges from 0 to
1. With its help we can compute the probability of general interval sets such as

Pr(𝑎 < 𝑥 ≤ 𝑏) = 𝐹(𝑏) − 𝐹(𝑎) .

The inverse of the distribution function 𝑦 = 𝐹(𝑥) is the quantile function


𝑥 = 𝐹 −1 (𝑦). The 50% quantile 𝐹 −1 12 is the median.


If the random variable 𝑥 has distribution function 𝐹 we write 𝑥 ∼ 𝐹.


184 APPENDIX A. REFRESHER

A.5.3 Expection and variance of a random variable


The expected value E(𝑥) of a random variable is defined as the weighted average
over all possible outcomes, with the weight given by the PMF / PDF 𝑓 (𝑥):

𝑥∈Ω 𝑓 (𝑥)𝑥 discrete case
E(𝑥) = ∫
𝑥∈Ω
𝑓 (𝑥)𝑥𝑑𝑥 continuous case

To emphasise that the expecation is taken with regard to the distribution 𝐹 we


write E𝐹 (𝑥) with the distribution 𝐹 as subscript. The expectation is not necessarily
always defined for a continuous random variable as the integral may diverge.
The expected value of a function of a random variable ℎ(𝑥) is obtained similarly:

𝑥∈Ω 𝑓 (𝑥)ℎ(𝑥) discrete case
E(ℎ(𝑥)) = ∫
𝑥∈Ω
𝑓 (𝑥)ℎ(𝑥)𝑑𝑥 continuous case

This is called the “law of the unconscious statistician”, or short LOTUS. Again,
to highlight that the random variable 𝑥 has distribution 𝐹 we write E𝐹 (ℎ(𝑥)).
For an event 𝐴 we can define a corresponding indicator function
(
1 𝑥∈𝐴
1𝐴 (𝑥) =
0 𝑥∉𝐴

Intriguingly,
E(1𝐴 (𝑥)) = Pr(𝐴)
i.e. the expectation of the indicator variable for 𝐴 is the probability of 𝐴.
The moments of random variables are also defined by expectation:
• Zeroth moment: E(𝑥 0 ) = 1 by definition of PDF and PMF,
• First moment: E(𝑥 1 ) = E(𝑥) = 𝜇 , the mean,
• Second moment: E(𝑥 2 )
• The variance is the second momented centered about the mean 𝜇:

Var(𝑥) = E((𝑥 − 𝜇)2 ) = 𝜎2


A.5. PROBABILITY 185

• The variance can also be computed by Var(𝑥) = E(𝑥 2 ) − E(𝑥)2 .


A distribution does not necessarily need to have any finite first or higher moments.
An example is the Cauchy distribution that does not have a mean or variance (or
any other higher moment).

A.5.4 Transformation of random variables


Linear transformation of random variables: if 𝑎 and 𝑏 are constants and 𝑥 is a
random variable, then the random variable 𝑦 = 𝑎 + 𝑏𝑥 has mean E(𝑦) = 𝑎 + 𝑏E(𝑥)
and variance Var(𝑦) = 𝑏 2 Var(𝑥).
For a general invertible coordinate transformation 𝑦 = ℎ(𝑥) = 𝑦(𝑥) the backtrans-
formation is 𝑥 = ℎ −1 (𝑦) = 𝑥(𝑦).
𝑑𝑦
The transformation of the infinitesimal volume element is 𝑑𝑦 = | 𝑑𝑥 |𝑑𝑥.

𝑑𝑥
The transformation of the density is 𝑓 𝑦 (𝑦) = 𝑑𝑦 𝑓𝑥 (𝑥(𝑦)).

−1
𝑑𝑥 𝑑𝑦
Note that 𝑑𝑦 = 𝑑𝑥 .

A.5.5 Law of large numbers:


Suppose we observe data 𝐷 = {𝑥1 , . . . , 𝑥 𝑛 } with each 𝑥 𝑖 ∼ 𝐹.

• By the strong law of large numbers the empirical distribution 𝐹ˆ 𝑛 based on


data 𝐷 = {𝑥 1 , . . . , 𝑥 𝑛 } converges to the true underlying distribution 𝐹 as
𝑛 → ∞ almost surely:
𝑎.𝑠.
𝐹ˆ 𝑛 → 𝐹

The Glivenko–Cantelli theorem asserts that the convergence is uniform.


Since the strong law implies the weak law we also have convergence in
probability:
𝑃
𝐹ˆ 𝑛 → 𝐹

Í𝑛
• Correspondingly, for 𝑛 → ∞ the average E𝐹ˆ𝑛 (ℎ(𝑥)) = 1
𝑛 𝑖=1 ℎ(𝑥 𝑖 ) con-
verges to the expectation E𝐹 (ℎ(𝑥)).

A.5.6 Jensen’s inequality


E(ℎ(𝒙)) ≥ ℎ(E(𝒙))

for a convex function ℎ(𝒙).


Recall: a convex function (such as 𝑥 2 ) has the shape of a “valley”.
186 APPENDIX A. REFRESHER

A.6 Distributions
A.6.1 Bernoulli distribution and binomial distribution
The Bernoulli distribution Ber(𝑝) is simplest distribution possible. It is named
after Jacob Bernoulli (1655-1705) who also invented the law of large numbers.
It describes a discrete binary random variable with two states 𝑥 = 0 (“failure”)
and 𝑥 = 1 (“success”), where the parameter 𝑝 ∈ [0, 1] is the probability of
“success”. Often the Bernoulli distribution is also referred to as “coin tossing”
model with the two outcomes “heads” and “tails”.
Correspondingly, the probability mass function of Ber(𝑝) is

𝑓 (𝑥 = 0) = Pr("failure") = 1 − 𝑝

and
𝑓 (𝑥 = 1) = Pr("success") = 𝑝
A compact way to write the PMF of the Bernoulli distribution is

𝑓 (𝑥|𝑝) = 𝑝 𝑥 (1 − 𝑝)1−𝑥

If a random variable 𝑥 follows the Bernoulli distribution we write

𝑥 ∼ Ber(𝑝) .

The expected value is E(𝑥) = 𝑝 and the variance is Var(𝑥) = 𝑝(1 − 𝑝).
Closely related to the Bernoulli distribution is the binomial distribution Bin(𝑚, 𝑝)
which results from repeating a Bernoulli experiment 𝑚 times and counting the
number of successes among the 𝑚 trials (without keeping track of the ordering
of the experiments).
Its probability mass function is:

𝑚 𝑥
 
𝑓 (𝑥|𝑝) = 𝑝 (1 − 𝑝)𝑚−𝑥
𝑥

for 𝑥 = 0, 1, 2, . . . , 𝑚. The binomial coefficient 𝑚𝑥 is needed to account for the



multiplicity of ways (orderings of samples) in which we can observe 𝑥 sucesses.
The expected value is E(𝑥) = 𝑚𝑝 and the variance is Var(𝑥) = 𝑚𝑝(1 − 𝑝).
If a random variable 𝑥 follows the binomial distribution we write

𝑥 ∼ Bin(𝑚, 𝑝)

For 𝑚 = 1 it reduces to the Bernoulli distribution Ber(𝑝).


In R the PMF of the binomial distribution is called dbinom(). The binomial
coefficient itself is computed by choose().
A.6. DISTRIBUTIONS 187

A.6.2 Normal distribution


Univariate normal distribution:

𝑥 ∼ 𝑁(𝜇, 𝜎2 ) with E(𝑥) = 𝜇 and Var(𝑥) = 𝜎2 .

Probability density function (PDF):

(𝑥 − 𝜇)2
 
2 − 12
𝑓 (𝑥|𝜇, 𝜎 ) = (2𝜋𝜎 )
2
exp −
2𝜎2

In R the density function is called dnorm().

The standard normal distribution is 𝑁(0, 1) with mean 1 and variance 1.

Plot of the PDF of the standard normal:


0.4
f(x)

0.2
0.0

−6 −4 −2 0 2 4 6

The cumulative distribution function (CDF) of the standard normal 𝑁(0, 1) is


∫ 𝑥
Φ(𝑥) = 𝑓 (𝑥 0 |𝜇 = 0, 𝜎2 = 1)𝑑𝑥 0
−∞

There is no analytic expression for Φ(𝑥). In R the function is called pnorm().

Plot of the CDF of the standard normal:


188 APPENDIX A. REFRESHER

0.8
Φ(x)

0.4
0.0

−6 −4 −2 0 2 4 6

The inverse Φ−1 (𝑝) is called the quantile function of the standard normal. In R
the function is called qnorm().
2
Φ−1

0
−2

0.0 0.2 0.4 0.6 0.8 1.0

The sum of two normal random variables is also normal (with the appropriate
mean and variance).
A.6. DISTRIBUTIONS 189

A.6.3 Gamma distribution aka scaled chi-squared and one-


dimensional Wishart distribution
Assume 𝑚 independent normal random variables

𝑧1 , 𝑧2 , . . . , 𝑧 𝑚 ∼ 𝑁(0, 𝜎𝑧2 )

Then the sum of the squares


𝑚
Õ
𝑥= 𝑧 2𝑖
𝑖=1

with
𝜇𝑥 = 𝑚𝜎𝑧2
follows a scaled chi-squared distribution

𝜇𝑥 2 𝜇 𝜇𝑥
 
𝑥
 1
𝑥∼ 𝜒 = 𝑊1 , 𝑚 = Gam 𝑚, 2
𝑚 𝑚 𝑚 2 𝑚

with degree of freedom 𝑚 and 𝑥 ≥ 0. The mean and variance of a scaled


2𝜇2𝑥
chi-squared distributed variable is E(𝑥) = 𝜇𝑥 and Var(𝑥) = 𝑚 .

Another name for the scaled chi-squared distribution is one-dimensional Wishart


𝜇
distribution 𝑊1 ( 𝑚𝑥 , 𝑚).
The gamma distribution Gam(𝛼, 𝛽) is another name for the scaled chi-squared
distribution with a different parameterisation in terms of a shape parameter
𝜇
𝛼 and a scale parameter 𝛽. The scaled chi-squared distribution 𝑚𝑥 𝜒𝑚2 equals
𝜇𝑥
Gam(𝛼 = 2 𝑚, 𝛽 = 2 𝑚 ). The mean of Gam(𝛼, 𝛽) is 𝛼𝛽 = 𝜇𝑥 and its variance is
1

2𝜇2𝑥
𝛼𝛽 2 = 𝑚 .

The density of the gamma distribution (aka scaled chi-squared distribution)


is available in the R function dgamma(). The cumulative density function is
pgamma() and the quantile function is qgamma().

A.6.4 Special cases of the gamma distribution: exponential


distribution and chi-squared distibution
𝜇
The chi-squared distribution is a special case with 𝜎𝑧2 = 1 and hence 𝑚𝑥 = 1. It
has mean E(𝑥) = 𝑚 and variance Var(𝑥) = 2𝑚. The chi-squared distribution 𝜒𝑚 2
𝑚
equals Gam( 2 , 2).
The exponential distribution Exp(𝛽) with scale parameter 𝛽 is another special
case of the gamma distribution with shape parameter 𝛼 = 1 (or 𝑚 = 2) and thus
equals Gam(1, 𝛽). It has mean 𝛽 and variance 𝛽2 .
Instead of the scale parameter the exponential distribution is also often specified
using a rate parameter 𝜆 = 𝛽1 .
190 APPENDIX A. REFRESHER

Here is a plot of the density of the chi-squared distribution for degrees of freedom
𝑚 = 1 and 𝑚 = 3:

0.25
1.2

0.20
1.0

m=1 m=3
0.8

0.15
density

density
0.6

0.10
0.4

0.05
0.2

0.00
0.0

0 2 4 6 8 10 0 2 4 6 8 10

x x

In R the density of the chi-squared distribution is given by dchisq(). The


cumulative density function is pchisq() and the quantile function is qchisq().
Likewise, in R dexp() gives the density of the exponential distribution, and
pexp() and qexp() are the corresponding cumulutative density and quantile
functions.

A.7 Statistics
A.7.1 Statistical learning
The aim in statistics - data science - machine learning is to learn from data (from
experiments, observations, measurements) to learn about and understand the
world.
Specifically, to identify the best model(s) for the data in order to
• to explain the current data, and
• to enable good prediction of future data
Note that it is easy to get models that only explain the data but do not predict
well!
This is called overfitting the data and happens in particular if the model is
overparameterized for the amount of data available.
Specifically, we have data 𝑥1 , . . . , 𝑥 𝑛 and models 𝑓 (𝑥|𝜃) that are indexed the
parameter 𝜃.
A.7. STATISTICS 191

Often (but not always) 𝜃 can be interpreted and/or is associated with some
property of the model.

If there is only a single parameter we write 𝜃 (scalar parameter). For a parameter


vector we write 𝜽 (in bold type).

A.7.2 Point and interval estimation


• There is a parameter 𝜃 of interest in a model
• we are uncertain about this parameter (i.e. we don’t know the exact value)
• we would like to learn about this parameter by observing data 𝑥1 , . . . , 𝑥 𝑛
from the model

Often the parameter(s) of interest are related to moments (such as mean and
variance) or to quantiles of the distribution representing the model.

Estimation:
ˆ 1 , . . . , 𝑥 𝑛 ) that maps the data (input) to
• An estimator for 𝜃 is a function 𝜃(𝑥
a “guess” (output) about 𝜃.
• A point estimator provides a single number for each parameter
• An interval estimator provides a set of possible values for each parameter.

Simple estimators of mean and variance:

Suppose we have data 𝑥 1 , . . . , 𝑥 𝑛 all sampled independently from a distribution


𝐹.

• The average (also known as empirical mean) 𝜇ˆ = 𝑛1 𝑛𝑖=1 𝑥 𝑖 is an estimate


Í
of the mean of 𝐹.
• The empirical variance 𝜎b2 ML = 𝑛1 𝑛𝑖=1 (𝑥 𝑖 − 𝜇)
ˆ 2 is an estimate of the variance
Í
of 𝐹. Note the factor 1/𝑛. It is the maximum likelihood estimate assuming
a normal model.
1 Í𝑛
• The unbiased sample variance 𝜎b2 UB = 𝑛−1 𝑖=1 (𝑥 𝑖 − 𝜇)
ˆ 2 is another estimate
of the variance of 𝐹. Note the factor 1/(𝑛 − 1) therefore 𝑛 ≥ 2 is required
for this estimator.

A.7.3 Sampling properties of a point estimator 𝜽ˆ


A point estimator 𝜃ˆ depends on the data, hence it has sampling variation
(i.e. estimate will be different for a new set of observations)

Thus 𝜃ˆ can be seen as a random variable, and its distribution is called sampling
distribution (across different experiments).

Properties of this distribution can be used to evaluate how far the estimator
deviates (on average across different experiments) from the true value:
192 APPENDIX A. REFRESHER

ˆ −𝜃
ˆ = E(𝜃)
Bias: Bias(𝜃) 
Variance: ˆ = E (𝜃ˆ − E(𝜃))
Var(𝜃) ˆ 2
Mean squared error: ˆ = E((𝜃ˆ − 𝜃)2 )
MSE(𝜃)
ˆ + Bias(𝜃)
= Var(𝜃) ˆ 2

The last identity about MSE follows from E(𝑥 2 ) = Var(𝑥) + E(𝑥)2 .

At first sight it seems desirable to focus on unbiased (for finite 𝑛) estimators.


However, requiring strict unbiasedness is not always a good idea!

In many situations it is better to allow for some small bias and in order to achieve
a smaller variance and an overall total smaller MSE. This is called bias-variance
tradeoff — as more bias is traded for smaller variance (or, conversely, less bias is
traded for higher variance)

A.7.4 Sampling distribution of mean and variance estimators


for normal data
Suppose we have data 𝑥1 , . . . , 𝑥 𝑛 all sampled from a normal distribution 𝑁(𝜇, 𝜎2 ).
Í𝑛
• The empirical estimator of the mean parameter 𝜇 is given by 𝜇ˆ = 1
𝑛 𝑖=1 𝑥𝑖 .
Under the normal assumption the distribution of 𝜇ˆ is

𝜎2
 
𝜇ˆ ∼ 𝑁 𝜇,
𝑛

ˆ = 𝜎𝑛 . The estimate 𝜇ˆ is unbiased since E(𝜇)−𝜇


2
Thus E(𝜇)
ˆ = 𝜇 and Var(𝜇) ˆ =0
𝜎2
The mean squared error of 𝜇ˆ is MSE(𝜇)
ˆ = 𝑛.

• The empirical variance 𝜎b2 ML = 𝑛1 𝑛𝑖=1 (𝑥 𝑖 − 𝜇)


ˆ 2 for normal data follows a
Í
scaled chi-squared distribution or equivalently a Gamma distribution

© ª
𝜎2 2 ­ 𝑛 − 1 2𝜎2 ®
­ ®
𝜎b2 ML ∼ 𝜒𝑛−1 = Gam ­ ,
𝑛 ­ 2 𝑛 ®
®
­|{z} |{z}®
« shape scale ¬
𝑛−1 2 2(𝑛−1) 4
Thus, E( 𝜎b2 ML ) = 𝑛 𝜎 and Var( 𝜎b2 ML ) = 𝑛2
𝜎 . The estimate 𝜎b2 ML is
biased since E( 𝜎b2 ML )− 𝜎 2 = − 𝑛1 𝜎 2 . The mean squared error is MSE( 𝜎b2 ML ) =
2(𝑛−1) 4
𝑛2
𝜎 + 1 4
𝑛2
𝜎 = 𝑛2
𝜎 .
2𝑛−1 4
A.7. STATISTICS 193

𝑛
• The unbiased variance 𝜎b2 UB = 𝑛−1
1
𝑖=1 (𝑥 𝑖 − 𝜇)
ˆ 2 for normal data follows a
Í
scaled chi-squared distribution or equivalently a Gamma distribution

© ª
𝜎2 2 ­ 𝑛 − 1 2𝜎2 ®
­ ®
𝜎b2 UB ∼ 𝜒𝑛−1 = Gam ­ ,
𝑛−1 ­ 2 𝑛 − 1®
®
­|{z} |{z}®
« shape scale ¬
Thus, E( 𝜎b2 UB ) = 𝜎 2 and Var( 𝜎b2 UB ) = 𝑛−1 𝜎 .
2 4 The estimate 𝜎b2 ML is unbiased
since E( 𝜎b2 UB ) − 𝜎 2 = 0. The mean squared error is MSE( 𝜎b2 UB ) = 𝑛−1 𝜎 .
2 4

Note that for any 𝑛 > 1 we find that Var( 𝜎b2 UB ) > Var( 𝜎b2 ML ) and
MSE( 𝜎b2 UB ) > MSE( 𝜎b2 ML ) so that the biased empirical estimator has both
lower variance and lower mean squared error than the unbiased estimator.

A.7.5 Asymptotics
Typically, Bias, Var and MSE all decrease with increasing sample size so that
with more data 𝑛 → ∞ the errors become smaller and smaller.
The typical rate of decrease of variance of a good estimator is 𝑛1 . Thus, when
sample size is√doubled the variance is divided by 2 (and the standard deviation
is divided by 2).
Consistency: 𝜃ˆ is called consistent if
ˆ −→ 0 with 𝑛 → ∞
MSE(𝜃)
The three estimators discussed above (empirical mean, empirical variance,
unbiased variance) are all consistent as their MSE goes to zero with large sample
size 𝑛.
Consistency is a minimum essential requirement for any reasonable estimator!
Of all consistent estimators we typically prefer the estimator that is most efficient
(i.e. with fasted decrease in MSE) and that therefore has smallest variance and/or
MSE for given finite 𝑛.
Consistency implies we recover the true model in the limit of infinite data if the
model class contains the true data generating model.
If the model class does not contain the true model then strict consistency cannot
be achived but we still wish to get as close as possible to the true model when
choosing model parameters.

A.7.6 Confidence intervals


• A confidence interval (CI) is an interval estimate with a frequentist
interpretation.
194 APPENDIX A. REFRESHER

• Definition of coverage 𝜅 of a CI: how often (in repeated identical experiment)


does the estimated CI overlap the true parameter value 𝜃
– Eg.: Coverage 𝜅 = 0.95 (95%) means that in 95 out of 100 case the
estimated CI will contain the (unknown) true value (i.e. it will “cover”
𝜃).

Illustration of the repeated construction of a CI for 𝜃:

b 1 , . . . , 𝑥 𝑛 ), i.e. it depends on
• Note that a CI is actually an estimate: CI(𝑥
data and has a random (sampling) variation.

• A good CI has high coverage and is compact.

Note: the coverage probability is not the probability that the true value is
contained in a given estimated interval (that would be the Bayesian Credible
Interval).

A.7.7 Symmetric normal confidence interval

For a normally distributed univariate random variable it is straightforward to


construct a symmetric two-sided CI with a given desired coverage 𝜅.
A.7. STATISTICS 195

For a normal random variable 𝑋 ∼ 𝑁(𝜇, 𝜎2 ) with mean 𝜇 and variance 𝜎2 and
density function 𝑓 (𝑥) we can compute the probability

𝜇+𝑐𝜎
1+𝜅

Pr(𝑥 ≤ 𝜇 + 𝑐𝜎) = 𝑓 (𝑥)𝑑𝑥 = Φ(𝑐) =
−∞ 2
Note Φ(𝑐) is the cumulative distribution function (CDF) of the standard normal
𝑁(0, 1):
From the above we obtain the critical point 𝑐 from the quantile function, i.e. by
inversion of Φ:

1+𝜅
 
−1
𝑐=Φ
2

The following table lists 𝑐 for the three most commonly used values of 𝜅 - it is
useful to memorise these values!

Coverage 𝜅 Critical value 𝑐


0.9 1.64
0.95 1.96
0.99 2.58

A symmetric standard normal CI with nominal coverage 𝜅 for

• a scalar parameter 𝜃
• with normally distributed estimate 𝜃ˆ and
ˆ = 𝜎ˆ
ˆ 𝜃)
• with estimated standard deviation SD(
196 APPENDIX A. REFRESHER

is then given by
b = [𝜃ˆ ± 𝑐 𝜎]
CI ˆ
where 𝑐 is chosen for desired coverage level 𝜅.

A.7.8 Confidence interval for chi-squared distribution

As for the normal CI we can compute critical values but for the chi-squared
distribution we use a one-sided interval:

Pr(𝑥 ≤ 𝑐) = 𝜅

As before we get 𝑐 by the quantile function, i.e. by inverting the CDF of the
chi-squared distribution.
The following list the critical values for the three most common choice of 𝜅 for
𝑚 = 1 (one degree of freedom):

Coverage 𝜅 Critical value 𝑐 (𝑚 = 1)


0.9 2.71
0.95 3.84
0.99 6.63

A one-sided CI with nominal coverage 𝜅 is then given by [0, 𝑐].


Appendix B

Further study

In this module we can only touch the surface of likelihood and Bayes inference.
As a starting point for further reading the following text books are recommended.

B.1 Recommended reading


• Faraway (2015) Linear Models with R (second edition). Chapman and
Hall/CRC.
• Held and Bové (2020) Applied Statistical Inference: Likelihood and Bayes (2nd
edition). Springer.
• Agresti and Kateri (2022) Foundations of Statistics for Data Scientists. Chap-
man and Hall/CRC.

B.2 Additional references


• Heard (2021) An Introduction to Bayesian Inference, Methods and Computation.
Springer.
• Gelman et al. (2014) Bayesian data analysis (3rd edition). CRC Press.
• Wood (2015) Core Statistics. Cambridge University Press. PDF available
from https://www.maths.ed.ac.uk/~swood34/core-statistics-nup.pdf

197
198 APPENDIX B. FURTHER STUDY
Bibliography

Agresti, A., and M. Kateri. 2022. Foundations of Statistics for Data Scientists.
Chapman; Hall/CRC.
Domingos, P. 2015. The Master Algorithm: How the Quest for the Ultimate Learning
Machine Will Remake Our World. Basic Books.
Faraway, J. J. 2015. Linear Models with R. 2nd ed. Chapman; Hall/CRC.
Gelman, A., J. B. Carlin, H. A. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin.
2014. Bayesian Data Analysis. 3rd ed. CRC Press.
Heard, N. 2021. An Introduction to Bayesian Inference, Methods and Computation.
Springer.
Held, L., and D. S. Bové. 2020. Applied Statistical Inference: Likelihood and Bayes.
Second. Springer.
Wood, S. 2015. Core Statistics. Cambridge University Press.

199

You might also like