The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
The Book of Statistical Proofs: DOI: 10.5281/zenodo.4305950 2022-10-22, 07:22
DOI: 10.5281/zenodo.4305950
https://statproofbook.github.io/
[email protected]
2022-10-22, 07:22
Contents
I General Theorems 1
1 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Random experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Random experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Event space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Random event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Random vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Random matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.6 Discrete vs. continuous . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.7 Univariate vs. multivariate . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.5 Exceedance probability . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.6 Statistical independence . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.7 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.8 Probability under independence . . . . . . . . . . . . . . . . . . 9
1.3.9 Mutual exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.10 Probability under exclusivity . . . . . . . . . . . . . . . . . . . . 10
1.4 Probability axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Monotonicity of probability . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Probability of the empty set . . . . . . . . . . . . . . . . . . . . 12
1.4.4 Probability of the complement . . . . . . . . . . . . . . . . . . . 13
1.4.5 Range of probability . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.6 Addition law of probability . . . . . . . . . . . . . . . . . . . . . 14
1.4.7 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.8 Probability of exhaustive events . . . . . . . . . . . . . . . . . . 16
1.4.9 Probability of exhaustive events . . . . . . . . . . . . . . . . . . 16
1.5 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . 18
i
ii CONTENTS
1.7.4 Non-negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.7.5 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.7.6 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.7.7 (Non-)Multiplicativity . . . . . . . . . . . . . . . . . . . . . . . . 49
1.7.8 Expectation of a trace . . . . . . . . . . . . . . . . . . . . . . . . 51
1.7.9 Expectation of a quadratic form . . . . . . . . . . . . . . . . . . 52
1.7.10 Squared expectation of a product . . . . . . . . . . . . . . . . . 53
1.7.11 Law of total expectation . . . . . . . . . . . . . . . . . . . . . . . 54
1.7.12 Law of the unconscious statistician . . . . . . . . . . . . . . . . 55
1.7.13 Expected value of a random vector . . . . . . . . . . . . . . . . . . . 57
1.7.14 Expected value of a random matrix . . . . . . . . . . . . . . . . . . . 57
1.8 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.8.2 Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.8.3 Partition into expected values . . . . . . . . . . . . . . . . . . . 58
1.8.4 Non-negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8.5 Variance of a constant . . . . . . . . . . . . . . . . . . . . . . . . 60
1.8.6 Invariance under addition . . . . . . . . . . . . . . . . . . . . . . 61
1.8.7 Scaling upon multiplication . . . . . . . . . . . . . . . . . . . . . 61
1.8.8 Variance of a sum . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.8.9 Variance of linear combination . . . . . . . . . . . . . . . . . . . 63
1.8.10 Additivity under independence . . . . . . . . . . . . . . . . . . 63
1.8.11 Law of total variance . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.8.12 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9.2 Sample covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.9.3 Partition into expected values . . . . . . . . . . . . . . . . . . . 66
1.9.4 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.9.5 Self-covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.9.6 Covariance under independence . . . . . . . . . . . . . . . . . . 67
1.9.7 Relationship to correlation . . . . . . . . . . . . . . . . . . . . . 68
1.9.8 Law of total covariance . . . . . . . . . . . . . . . . . . . . . . . 68
1.9.9 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
1.9.10 Sample covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 70
1.9.11 Covariance matrix and expected values . . . . . . . . . . . . . 70
1.9.12 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.9.13 Positive semi-definiteness . . . . . . . . . . . . . . . . . . . . . . 71
1.9.14 Invariance under addition of vector . . . . . . . . . . . . . . . . 72
1.9.15 Scaling upon multiplication with matrix . . . . . . . . . . . . . 73
1.9.16 Cross-covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.9.17 Covariance matrix of a sum . . . . . . . . . . . . . . . . . . . . 74
1.9.18 Covariance matrix and correlation matrix . . . . . . . . . . . 75
1.9.19 Precision matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
1.9.20 Precision matrix and correlation matrix . . . . . . . . . . . . . 77
1.10 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.10.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1.10.2 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iv CONTENTS
V Appendix 517
1 Proof by Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
2 Definition by Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
3 Proof by Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
CONTENTS xv
General Theorems
1
2 CHAPTER I. GENERAL THEOREMS
1 Probability theory
1.1 Random experiments
1.1.1 Random experiment
Definition: A random experiment is any repeatable procedure that results in one (→ Definition
I/1.2.2) out of a well-defined set of possible outcomes.
• The set of possible outcomes is called sample space (→ Definition I/1.1.2).
• A set of zero or more outcomes is called a random event (→ Definition I/1.2.1).
• A function that maps from events to probabilities is called a probability function (→ Definition
I/1.5.1).
Together, sample space (→ Definition I/1.1.2), event space (→ Definition I/1.1.3) and probability
function (→ Definition I/1.1.4) characterize a random experiment.
Sources:
• Wikipedia (2020): “Experiment (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-19; URL: https://en.wikipedia.org/wiki/Experiment_(probability_theory).
Metadata: ID: D109 | shortcut: rexp | author: JoramSoch | date: 2020-11-19, 04:10.
Sources:
• Wikipedia (2021): “Sample space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-26;
URL: https://en.wikipedia.org/wiki/Sample_space.
Metadata: ID: D165 | shortcut: samp-spc | author: JoramSoch | date: 2021-11-26, 14:13.
Sources:
• Wikipedia (2021): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).
Metadata: ID: D166 | shortcut: eve-spc | author: JoramSoch | date: 2021-11-26, 14:26.
1. PROBABILITY THEORY 3
Sources:
• Wikipedia (2021): “Probability space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-
26; URL: https://en.wikipedia.org/wiki/Probability_space#Definition.
Metadata: ID: D167 | shortcut: prob-spc | author: JoramSoch | date: 2021-11-26, 14:30.
Sources:
• Wikipedia (2020): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-19; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).
Metadata: ID: D110 | shortcut: reve | author: JoramSoch | date: 2020-11-19, 04:33.
Sources:
• Wikipedia (2020): “Random variable”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
27; URL: https://en.wikipedia.org/wiki/Random_variable#Definition.
Metadata: ID: D65 | shortcut: rvar | author: JoramSoch | date: 2020-05-27, 22:36.
Sources:
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-27; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable.
Metadata: ID: D66 | shortcut: rvec | author: JoramSoch | date: 2020-05-27, 22:44.
Sources:
• Wikipedia (2020): “Random matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-27;
URL: https://en.wikipedia.org/wiki/Random_matrix.
Metadata: ID: D67 | shortcut: rmat | author: JoramSoch | date: 2020-05-27, 22:48.
1.2.5 Constant
Definition: A constant is a quantity which does not change and thus always has the same value.
From a statistical perspective, a constant is a random variable (→ Definition I/1.2.2) which is equal
to its expected value (→ Definition I/1.7.1)
X = E(X) (1)
or equivalently, whose variance (→ Definition I/1.8.1) is zero
Var(X) = 0 . (2)
Sources:
• ProofWiki (2020): “Definition: Constant”; in: ProofWiki, retrieved on 2020-09-09; URL: https:
//proofwiki.org/wiki/Definition:Constant#Definition.
Metadata: ID: D96 | shortcut: const | author: JoramSoch | date: 2020-09-09, 01:30.
1. PROBABILITY THEORY 5
Sources:
• Wikipedia (2020): “Random variable”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-
29; URL: https://en.wikipedia.org/wiki/Random_variable#Standard_case.
Metadata: ID: D105 | shortcut: rvar-disc | author: JoramSoch | date: 2020-10-29, 04:44.
Sources:
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-06; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable.
Metadata: ID: D106 | shortcut: rvar-uni | author: JoramSoch | date: 2020-11-06, 03:47.
1.3 Probability
1.3.1 Probability
Definition: Let E be a statement about an arbitrary event such as the outcome of a random
experiment (→ Definition I/1.1.1). Then, p(E) is called the probability of E and may be interpreted
as
• (objectivist interpretation of probability:) some physical state of affairs, e.g. the relative frequency
of occurrence of E, when repeating the experiment (“Frequentist probability”); or
• (subjectivist interpretation of probability:) a degree of belief in E, e.g. the price at which someone
would buy or sell a bet that pays 1 unit of utility if E and 0 if not E (“Bayesian probability”).
Sources:
• Wikipedia (2020): “Probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-10;
URL: https://en.wikipedia.org/wiki/Probability#Interpretations.
Metadata: ID: D48 | shortcut: prob | author: JoramSoch | date: 2020-05-10, 19:41.
6 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Joint probability distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-10; URL: https://en.wikipedia.org/wiki/Joint_probability_distribution.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
Metadata: ID: D49 | shortcut: prob-joint | author: JoramSoch | date: 2020-05-10, 19:49.
Sources:
• Wikipedia (2020): “Marginal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-10; URL: https://en.wikipedia.org/wiki/Marginal_distribution#Definition.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
Metadata: ID: D50 | shortcut: prob-marg | author: JoramSoch | date: 2020-05-10, 20:01.
joint probability (→ Definition I/1.3.2) distribution p(A, B). Then, p(A|B) is called the conditional
probability that A is true, given that B is true, and is given by
p(A, B)
p(A|B) = (1)
p(B)
where p(B) is the marginal probability (→ Definition I/1.3.3) of B.
Sources:
• Wikipedia (2020): “Conditional probability”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-10; URL: https://en.wikipedia.org/wiki/Conditional_probability#Definition.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
Metadata: ID: D51 | shortcut: prob-cond | author: JoramSoch | date: 2020-05-10, 20:06.
Sources:
• Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009): “Bayesian model selection for
group studies”; in: NeuroImage, vol. 46, pp. 1004–1017, eq. 16; URL: https://www.sciencedirect.
com/science/article/abs/pii/S1053811909002638; DOI: 10.1016/j.neuroimage.2009.03.025.
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.
Metadata: ID: D103 | shortcut: prob-exc | author: JoramSoch | date: 2020-10-22, 04:36.
where p(x1 , . . . , xn ) are the joint probabilities (→ Definition I/1.3.2) of X1 , . . . , Xn and p(xi ) are the
marginal probabilities (→ Definition I/1.3.3) of Xi .
where F are the joint (→ Definition I/1.5.2) or marginal (→ Definition I/1.5.3) cumulative distri-
bution functions (→ Definition I/1.6.13) and f are the respective probability density functions (→
Definition I/1.6.6).
Sources:
• Wikipedia (2020): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-06-06; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.
Metadata: ID: D75 | shortcut: ind | author: JoramSoch | date: 2020-06-06, 07:16.
where p(x1 , . . . , xn |y) are the joint (conditional) probabilities (→ Definition I/1.3.2) of X1 , . . . , Xn
given Y and p(xi ) are the marginal (conditional) probabilities (→ Definition I/1.3.3) of Xi given Y .
Y
n
FX1 ,...,Xn |Y =y (x1 , . . . , xn ) = FXi |Y =y (xi ) for all xi ∈ Xi and all y ∈ Y (2)
i=1
where F are the joint (conditional) (→ Definition I/1.5.2) or marginal (conditional) (→ Definition
I/1.5.3) cumulative distribution functions (→ Definition I/1.6.13) and f are the respective probability
density functions (→ Definition I/1.6.6).
Sources:
• Wikipedia (2020): “Conditional independence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-19; URL: https://en.wikipedia.org/wiki/Conditional_independence#Conditional_independence_
of_random_variables.
Metadata: ID: D112 | shortcut: ind-cond | author: JoramSoch | date: 2020-11-19, 05:40.
p(A) = p(A|B)
(1)
p(B) = p(B|A) .
Proof: If A and B are independent (→ Definition I/1.3.6), then the joint probability (→ Definition
I/1.3.2) is equal to the product of the marginal probabilities (→ Definition I/1.3.3):
p(A, B)
p(A|B) = . (3)
p(B)
Combining (2) and (3), we have:
p(A) · p(B)
p(A|B) = = p(A) . (4)
p(B)
Equivalently, we can write:
Sources:
10 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-07-23; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.
Metadata: ID: P241 | shortcut: prob-ind | author: JoramSoch | date: 2021-07-23, 16:05.
p(A1 , . . . , An ) = 0 (1)
where p(A1 , . . . , An ) is the joint probability (→ Definition I/1.3.2) of the statements A1 , . . . , An .
Sources:
• Wikipedia (2021): “Mutual exclusivity”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-23; URL: https://en.wikipedia.org/wiki/Mutual_exclusivity#Probability.
Metadata: ID: D156 | shortcut: exc | author: JoramSoch | date: 2021-07-23, 16:32.
Proof: If A and B are mutually exclusive (→ Definition I/1.3.9), then their joint probability (→
Definition I/1.3.2) is zero:
p(A, B) = 0 . (2)
The addition law of probability (→ Definition I/1.3.3) states that
Sources:
• Wikipedia (2021): “Mutual exclusivity”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-23; URL: https://en.wikipedia.org/wiki/Mutual_exclusivity#Probability.
Metadata: ID: P242 | shortcut: prob-exc | author: JoramSoch | date: 2021-07-23, 17:19.
P (Ω) = 1 . (2)
• Third axiom: The probability of any countable sequence of disjoint (i.e. mutually exclusive (→
Definition I/1.3.9)) events E1 , E2 , E3 , . . . is equal to the sum of the probabilities of the individual
events:
!
[∞ X∞
P Ei = P (Ei ) . (3)
i=1 i=1
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/2/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eqs. 8.2-8.4; URL: https://www.
wiley.com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+
6th+Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Axioms.
Metadata: ID: D158 | shortcut: prob-ax | author: JoramSoch | date: 2021-07-30, 11:11.
Proof: Set E1 = A, E2 = B \ A and Ei = ∅ for i ≥ 3. Then, the sets Ei are pairwise disjoint and
E1 ∪ E2 ∪ . . . = B, because A ⊆ B. Thus, from the third axiom of probability (→ Definition I/1.4.1),
we have:
X
∞
P (B) = P (A) + P (B \ A) + P (Ei ) . (2)
i=3
Since, by the first axiom of probability (→ Definition I/1.4.1), the right-hand side is a series of
non-negative numbers converging to P (B) on the left-hand side, it follows that
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Monotonicity.
Metadata: ID: P243 | shortcut: prob-mon | author: JoramSoch | date: 2021-07-30, 11:37.
P (∅) = 0 . (1)
Assume that the probability of the empty set is not zero, i.e. P (∅) > 0. Then, the right-hand side of
(2) would be infinite. However, by the first axiom of probability (→ Definition I/1.4.1), the left-hand
side must be finite. This is a contradiction. Therefore, P (∅) = 0.
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6, eq. 3; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/
2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (b); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
1. PROBABILITY THEORY 13
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-07-
30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_probability_of_the_empty_
set.
Metadata: ID: P244 | shortcut: prob-emp | author: JoramSoch | date: 2021-07-30, 11:58.
Proof: Since A and Ac are mutually exclusive (→ Definition I/1.3.9) and A ∪ Ac = Ω, the third
axiom of probability (→ Definition I/1.4.1) implies:
P (A ∪ Ac ) = P (A) + P (Ac )
P (Ω) = P (A) + P (Ac ) (2)
P (Ac ) = P (Ω) − P (A) .
The second axiom of probability (→ Definition I/1.4.1) states that P (Ω) = 1, such that we obtain:
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6, eq. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/
2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (c); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_complement_rule.
Metadata: ID: P245 | shortcut: prob-comp | author: JoramSoch | date: 2021-07-30, 12:14.
0 ≤ P (E) ≤ 1 . (1)
P (E) ≥ 0 . (2)
By combining the first axiom of probability (→ Definition I/1.4.1) and the probability of the com-
plement (→ Proof I/1.4.4), we obtain:
1 − P (E) = P (E c ) ≥ 0
1 − P (E) ≥ 0 (3)
P (E) ≤ 1 .
0 ≤ P (E) ≤ 1 . (4)
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_numeric_bound.
Metadata: ID: P246 | shortcut: prob-range | author: JoramSoch | date: 2021-07-30, 12:25.
P (A ∪ B) = P (A) + P (B \ A)
(2)
P (A ∪ B) = P (A) + P (B \ [A ∩ B]) .
Then, let E1 = B \ [A ∩ B] and E2 = A ∩ B, such that E1 ∪ E2 = B. Again, from the third axiom of
probability (→ Definition I/1.4.1), we obtain:
P (B) = P (B \ [A ∩ B]) + P (A ∩ B)
(3)
P (B \ [A ∩ B]) = P (B) − P (A ∩ B) .
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/2/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (a); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Further_consequences.
Metadata: ID: P247 | shortcut: prob-add | author: JoramSoch | date: 2021-07-30, 12:45.
Bi ∩ Bj = ∅ ⇒ (A ∩ Bi ) ∩ (A ∩ Bj ) = A ∩ (Bi ∩ Bj ) = A ∩ ∅ = ∅ . (2)
Because the Bi are exhaustive, the sets (A ∩ Bi ) are also exhaustive:
∪i Bi = Ω ⇒ ∪i (A ∩ Bi ) = A ∩ (∪i Bi ) = A ∩ Ω = A . (3)
Thus, the third axiom of probability (→ Definition I/1.4.1) implies that
X
P (A) = P (A ∩ Bi ) . (4)
i
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, p. 288, eq. (d); p. 289, eq. 8.7; URL: https://www.
wiley.com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+
6th+Edition-p-9780470669549.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 289, eq. 8.7; URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Law of total probability”; in: Wikipedia, the free encyclopedia, retrieved on
2021-08-08; URL: https://en.wikipedia.org/wiki/Law_of_total_probability#Statement.
16 CHAPTER I. GENERAL THEOREMS
Metadata: ID: P248 | shortcut: prob-tot | author: JoramSoch | date: 2021-08-08, 03:56.
∪i Bi = Ω . (3)
Thus, the third axiom of probability (→ Definition I/1.4.1) implies that
X
P (Bi ) = P (Ω) . (4)
i
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-08; URL: https://en.wikipedia.org/wiki/Probability_axioms#Axioms.
Metadata: ID: P249 | shortcut: prob-exh | author: JoramSoch | date: 2021-08-08, 04:10.
Proof: The addition law of probability (→ Proof I/1.4.6) states that for two events (→ Definition
I/1.2.1) A and B, the probability (→ Definition I/1.3.1) of at least one of them occuring is:
1. PROBABILITY THEORY 17
∪i Bi = Ω . (6)
Since the probability of the sample space is one (→ Definition I/1.4.1), this means that the left-hand
side of (5) becomes equal to one:
X
n
1= P (Bi ) . (7)
i
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2022): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
03-27; URL: https://en.wikipedia.org/wiki/Probability_axioms#Consequences.
Metadata: ID: P319 | shortcut: prob-exh2 | author: JoramSoch | date: 2022-03-27, 23:14.
18 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Probability distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-17; URL: https://en.wikipedia.org/wiki/Probability_distribution.
Metadata: ID: D55 | shortcut: dist | author: JoramSoch | date: 2020-05-17, 20:23.
Sources:
• Wikipedia (2020): “Joint probability distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-17; URL: https://en.wikipedia.org/wiki/Joint_probability_distribution.
Metadata: ID: D56 | shortcut: dist-joint | author: JoramSoch | date: 2020-05-17, 20:43.
Sources:
• Wikipedia (2020): “Marginal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-17; URL: https://en.wikipedia.org/wiki/Marginal_distribution.
Metadata: ID: D57 | shortcut: dist-marg | author: JoramSoch | date: 2020-05-17, 21:02.
1. PROBABILITY THEORY 19
Sources:
• Wikipedia (2020): “Conditional probability distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-05-17; URL: https://en.wikipedia.org/wiki/Conditional_probability_distribution.
Metadata: ID: D58 | shortcut: dist-cond | author: JoramSoch | date: 2020-05-17, 21:25.
Sources:
• Wikipedia (2021): “Sampling distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
03-31; URL: https://en.wikipedia.org/wiki/Sampling_distribution.
Metadata: ID: D140 | shortcut: dist-samp | author: JoramSoch | date: 2021-03-31, 09:43.
fX (x) = 0 (1)
for all x ∈
/ X,
Sources:
• Wikipedia (2020): “Probability mass function”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-13; URL: https://en.wikipedia.org/wiki/Probability_mass_function.
20 CHAPTER I. GENERAL THEOREMS
X
fZ (z) = fX (z − y)fY (y)
y∈Y
X (1)
or fZ (z) = fY (z − x)fX (x)
x∈X
where fX (x), fY (y) and fZ (z) are the probability mass functions (→ Definition I/1.6.1) of X, Y and
Z.
Proof: Using the definition of the probability mass function (→ Definition I/1.6.1) and the expected
value (→ Definition I/1.7.1), the first equation can be derived as follows:
fZ (z) = Pr(Z = z)
= Pr(X + Y = z)
= Pr(X = z − Y )
= E [Pr(X = z − Y |Y = y)]
(2)
= E [Pr(X = z − Y )]
= E [fX (z − Y )]
X
= fX (z − y)fY (y) .
y∈Y
Note that the third-last transition is justified by the fact that X and Y are independent (→ Definition
I/1.3.6), such that conditional probabilities are equal to marginal probabilities (→ Proof I/1.3.8).
The second equation can be derived by switching X and Y .
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
Metadata: ID: P257 | shortcut: pmf-sumind | author: JoramSoch | date: 2021-08-30, 09:14.
f (g −1 (y)) , if y ∈ Y
X
fY (y) = (1)
0 , if y ∈
/Y
where g −1 (y) is the inverse function of g(x) and Y is the set of possible outcomes of Y :
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because a strictly increasing function is invertible, the probability mass function (→ Defini-
tion I/1.6.1) of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid3.
Metadata: ID: P184 | shortcut: pmf-sifct | author: JoramSoch | date: 2020-10-29, 05:55.
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because a strictly decreasing function is invertible, the probability mass function (→ Defini-
tion I/1.6.1) of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
Sources:
22 CHAPTER I. GENERAL THEOREMS
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid6.
Metadata: ID: P187 | shortcut: pmf-sdfct | author: JoramSoch | date: 2020-11-06, 04:21.
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because an invertible function is a one-to-one mapping, the probability mass function (→
Definition I/1.6.1) of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
Metadata: ID: P253 | shortcut: pmf-invfct | author: JoramSoch | date: 2021-08-30, 05:13.
fX (x) ≥ 0 (1)
for all x ∈ R,
Z
Pr(X ∈ A) = fX (x) dx (2)
A
for any A ⊂ X and
1. PROBABILITY THEORY 23
Z
fX (x) dx = 1 . (3)
X
Sources:
• Wikipedia (2020): “Probability density function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-13; URL: https://en.wikipedia.org/wiki/Probability_density_function.
Metadata: ID: D10 | shortcut: pdf | author: JoramSoch | date: 2020-02-13, 19:26.
Z +∞
fZ (z) = fX (z − y)fY (y) dy
−∞
Z +∞ (1)
or fZ (z) = fY (z − x)fX (x) dx
−∞
where fX (x), fY (y) and fZ (z) are the probability density functions (→ Definition I/1.6.6) of X, Y
and Z.
Proof: The cumulative distribution function of a sum of independent random variables (→ Proof
I/1.6.14) is
d
fZ (z) = FZ (z)
dz
d
= E [FX (z − Y )]
dz
d (3)
=E FX (z − Y )
dz
= E [fX (z − Y )]
Z +∞
= fX (z − y)fY (y) dy .
−∞
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
24 CHAPTER I. GENERAL THEOREMS
Metadata: ID: P258 | shortcut: pdf-sumind | author: JoramSoch | date: 2021-08-30, 09:31.
Y = {y = g(x) : x ∈ X } . (2)
Proof: The cumulative distribution function of a strictly increasing function (→ Proof I/1.6.15) is
0 , if y < min(Y)
FY (y) = FX (g −1 (y)) , if y ∈ Y (3)
1 , if y > max(Y)
Because the probability density function is the first derivative of the cumulative distribution function
(→ Proof I/1.6.12)
dFX (x)
fX (x) = , (4)
dx
the probability density function (→ Definition I/1.6.6) of Y can be derived as follows:
1) If y does not belong to the support of Y , FY (y) is constant, such that
fY (y) = 0, if y∈
/Y. (5)
2) If y belongs to the support of Y , then fY (y) can be derived using the chain rule:
(4) d
fY (y) = FY (y)
dy
(3) d
= FX (g −1 (y)) (6)
dy
dg −1 (y)
= fX (g −1 (y)) .
dy
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid4.
1. PROBABILITY THEORY 25
Metadata: ID: P185 | shortcut: pdf-sifct | author: JoramSoch | date: 2020-10-29, 06:21.
Y = {y = g(x) : x ∈ X } . (2)
Proof: The cumulative distribution function of a strictly decreasing function (→ Proof I/1.6.15) is
1 , if y > max(Y)
FY (y) = 1 − FX (g −1 (y)) + Pr(X = g −1 (y)) , if y ∈ Y (3)
0 , if y < min(Y)
Note that for continuous random variables, the probability (→ Definition I/1.6.6) of point events is
Z a
Pr(X = a) = fX (x) dx = 0 . (4)
a
Because the probability density function is the first derivative of the cumulative distribution function
(→ Proof I/1.6.12)
dFX (x)
fX (x) = , (5)
dx
the probability density function (→ Definition I/1.6.6) of Y can be derived as follows:
1) If y does not belong to the support of Y , FY (y) is constant, such that
fY (y) = 0, if y∈
/Y. (6)
2) If y belongs to the support of Y , then fY (y) can be derived using the chain rule:
(5) d
fY (y) = FY (y)
dy
(3) d
= 1 − FX (g −1 (y)) + Pr(X = g −1 (y))
dy
(4) d
= 1 − FX (g −1 (y)) (7)
dy
d
= − FX (g −1 (y))
dy
dg −1 (y)
= −fX (g −1 (y)) .
dy
26 CHAPTER I. GENERAL THEOREMS
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid7.
Metadata: ID: P188 | shortcut: pdf-sdfct | author: JoramSoch | date: 2020-11-06, 05:30.
Y = {y = g(x) : x ∈ X } . (4)
Proof:
1) First, we obtain the cumulative distribution function (→ Definition I/1.6.13) of Y = g(X). The
joint CDF (→ Definition I/1.6.22) is given by
FY (y) = Pr(Y1 ≤ y1 , . . . , Yn ≤ yn )
= Pr(g1 (X) ≤ y1 , . . . , gn (X) ≤ yn )
Z (5)
= fX (x) dx
A(y)
Z
FY (z) = fX (g −1 (y)) dg −1 (y)
B(z)
Z zn Z z1 (7)
= ... fX (g −1 (y)) dg −1 (y) .
−∞ −∞
where we have the modified the integration regime B(z) which reads
Z zn Z z1
FY (z) = ... fX (g −1 (y)) |Jg−1 (y)| dy
Z−∞
zn Z−∞
z1 (10)
−1
= ... fX (g (y)) |Jg−1 (y)| dy1 . . . dyn .
−∞ −∞
4) Finally, we obtain the probability density function (→ Definition I/1.6.6) of Y = g(X). Because
the PDF is the derivative of the CDF (→ Proof I/1.6.12), we can differentiate the joint CDF to get
dn
fY (z) = FY (z)
dz1 . . . dzn
Z zn Z z1
dn (11)
= ... fX (g −1 (y)) |Jg−1 (y)| dy1 . . . dyn
dz1 . . . dzn −∞ −∞
= fX (g −1 (z)) |Jg−1 (z)|
which can also be written as
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
• Lebanon, Guy (2017): “Functions of a Random Vector”; in: Probability: The Analysis of Data,
Vol. 1, retrieved on 2021-08-30; URL: http://theanalysisofdata.com/probability/4_4.html.
• Poirier, Dale J. (1995): “Distributions of Functions of Random Variables”; in: Intermediate Statis-
tics and Econometrics: A Comparative Approach, ch. 4, pp. 149ff.; URL: https://books.google.de/
books?id=K52_YvD1YNwC&hl=de&source=gbs_navlinks_s.
• Devore, Jay L.; Berk, Kennth N. (2011): “Conditional Distributions”; in: Modern Mathemat-
ical Statistics with Applications, ch. 5.2, pp. 253ff.; URL: https://books.google.de/books?id=
5PRLUho-YYgC&hl=de&source=gbs_navlinks_s.
28 CHAPTER I. GENERAL THEOREMS
• peek-a-boo (2019): “How to come up with the Jacobian in the change of variables formula”; in:
StackExchange Mathematics, retrieved on 2021-08-30; URL: https://math.stackexchange.com/a/
3239222.
• Bazett, Trefor (2019): “Change of Variables & The Jacobian | Multi-variable Integration”; in:
YouTube, retrieved on 2021-08-30; URL: https://www.youtube.com/watch?v=wUF-lyyWpUc.
Metadata: ID: P254 | shortcut: pdf-invfct | author: JoramSoch | date: 2021-08-30, 07:05.
Y = {y = Σx + µ : x ∈ X } . (2)
Proof: Because the linear function g(X) = ΣX + µ is invertible and differentiable, we can determine
the probability density function of an invertible function of a continuous random vector (→ Proof
I/1.6.10) using the relation
f (g −1 (y)) |J −1 (y)| , if y ∈ Y
X g
fY (y) = . (3)
0 , if y ∈
/Y
The inverse function is
Plugging (4) and (5) into (3) and applying the determinant property |A−1 | = |A|−1 , we obtain
1
fY (y) = fX (Σ−1 (y − µ)) . (6)
|Σ|
Sources:
1. PROBABILITY THEORY 29
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
Metadata: ID: P255 | shortcut: pdf-linfct | author: JoramSoch | date: 2021-08-30, 07:46.
dFX (x)
fX (x) = . (1)
dx
Proof: The cumulative distribution function in terms of the probability density function of a con-
tinuous random variable (→ Proof I/1.6.18) is given by:
Z x
FX (x) = fX (t) dt, x ∈ R . (2)
−∞
The fundamental theorem of calculus states that, if f (x) is a continuous real-valued function defined
on the interval [a, b], then it holds that
Z x
F (x) = f (t) dt ⇒ F ′ (x) = f (x) for all x ∈ (a, b) . (4)
a
Sources:
• Wikipedia (2020): “Fundamental theorem of calculus”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus#
Formal_statements.
Metadata: ID: P191 | shortcut: pdf-cdf | author: JoramSoch | date: 2020-11-12, 07:19.
Sources:
• Wikipedia (2020): “Cumulative distribution function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-02-17; URL: https://en.wikipedia.org/wiki/Cumulative_distribution_function#
Definition.
Metadata: ID: D13 | shortcut: cdf | author: JoramSoch | date: 2020-02-17, 22:07.
FZ (z) = E [FX (z − Y )]
(1)
or FZ (z) = E [FY (z − X)]
where FX (x), FY (y) and FZ (z) are the cumulative distribution functions (→ Definition I/1.6.13) of
X, Y and Z and E [·] denotes the expected value (→ Definition I/1.7.1).
Proof: Using the definition of the cumulative distribution function (→ Definition I/1.6.13), the first
equation can be derived as follows:
FZ (z) = Pr(Z ≤ z)
= Pr(X + Y ≤ z)
= Pr(X ≤ z − Y )
(2)
= E [Pr(X ≤ z − Y |Y = y)]
= E [Pr(X ≤ z − Y )]
= E [FX (z − Y )] .
Note that the second-last transition is justified by the fact that X and Y are independent (→
Definition I/1.3.6), such that conditional probabilities are equal to marginal probabilities (→ Proof
I/1.3.8). The second equation can be derived by switching X and Y .
1. PROBABILITY THEORY 31
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
Metadata: ID: P256 | shortcut: cdf-sumind | author: JoramSoch | date: 2021-08-30, 08:53.
Y = {y = g(x) : x ∈ X } . (2)
Proof: The support of Y is determined by g(x) and by the set of possible outcomes of X. More-
over, if g(x) is strictly increasing, then g −1 (y) is also strictly increasing. Therefore, the cumulative
distribution function (→ Definition I/1.6.13) of Y can be derived as follows:
1) If y is lower than the lowest value (→ Definition I/1.13.1) Y can take, then Pr(Y ≤ y) = 0, so
FY (y) = Pr(Y ≤ y)
= Pr(g(X) ≤ y)
(4)
= Pr(X ≤ g −1 (y))
= FX (g −1 (y)) .
3) If y is higher than the highest value (→ Definition I/1.13.2) Y can take, then Pr(Y ≤ y) = 1, so
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid2.
Metadata: ID: P183 | shortcut: cdf-sifct | author: JoramSoch | date: 2020-10-29, 05:35.
32 CHAPTER I. GENERAL THEOREMS
Y = {y = g(x) : x ∈ X } . (2)
Proof: The support of Y is determined by g(x) and by the set of possible outcomes of X. More-
over, if g(x) is strictly decreasing, then g −1 (y) is also strictly decreasing. Therefore, the cumulative
distribution function (→ Definition I/1.6.13) of Y can be derived as follows:
1) If y is higher than the highest value (→ Definition I/1.13.2) Y can take, then Pr(Y ≤ y) = 1, so
FY (y) = Pr(Y ≤ y)
= 1 − Pr(Y > y)
= 1 − Pr(g(X) > y)
= 1 − Pr(X < g −1 (y))
(4)
= 1 − Pr(X < g −1 (y)) − Pr(X = g −1 (y)) + Pr(X = g −1 (y))
= 1 − Pr(X < g −1 (y)) + Pr(X = g −1 (y)) + Pr(X = g −1 (y))
= 1 − Pr(X ≤ g −1 (y)) + Pr(X = g −1 (y))
= 1 − FX (g −1 (y)) + Pr(X = g −1 (y)) .
3) If y is lower than the lowest value (→ Definition I/1.13.1) Y can take, then Pr(Y ≤ y) = 0, so
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid5.
Metadata: ID: P186 | shortcut: cdf-sdfct | author: JoramSoch | date: 2020-11-06, 04:12.
1. PROBABILITY THEORY 33
Proof: The cumulative distribution function (→ Definition I/1.6.13) of a random variable (→ Defi-
nition I/1.2.2) X is defined as the probability that X is smaller than x:
(2) X
FX (x) = Pr(X = t)
t∈X
t≤x
X (4)
(3)
= fX (t) .
t∈X
t≤x
Sources:
• original work
Metadata: ID: P189 | shortcut: cdf-pmf | author: JoramSoch | date: 2020-11-12, 06:03.
Proof: The cumulative distribution function (→ Definition I/1.6.13) of a random variable (→ Defi-
nition I/1.2.2) X is defined as the probability that X is smaller than x:
The probability density function (→ Definition I/1.6.6) of a continuous (→ Definition I/1.2.6) random
variable (→ Definition I/1.2.2) X can be used to calculate the probability that X falls into a particular
interval A:
Z
Pr(X ∈ A) = fX (x) dx . (3)
A
Taking these two definitions together, we have:
(2)
FX (x) = Pr(X ∈ (−∞, x])
Z x (4)
(3)
= fX (t) dt .
−∞
Sources:
• original work
Metadata: ID: P190 | shortcut: cdf-pdf | author: JoramSoch | date: 2020-11-12, 06:33.
Y = FX (X) (1)
has a standard uniform distribution (→ Definition II/3.1.2).
Proof: The cumulative distribution function (→ Definition I/1.6.13) of Y = FX (X) can be derived
as
FY (y) = Pr(Y ≤ y)
= Pr(FX (X) ≤ y)
= Pr(X ≤ FX−1 (y)) (2)
= FX (FX−1 (y))
=y
which is the cumulative distribution function of a continuous uniform distribution (→ Proof II/3.1.4)
with a = 0 and b = 1, i.e. the cumulative distribution function (→ Definition I/1.6.13) of the standard
uniform distribution (→ Definition II/3.1.2) U(0, 1).
Sources:
• Wikipedia (2021): “Probability integral transform”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-07; URL: https://en.wikipedia.org/wiki/Probability_integral_transform#Proof.
Metadata: ID: P220 | shortcut: cdf-pit | author: JoramSoch | date: 2021-04-07, 08:47.
1. PROBABILITY THEORY 35
X = FX−1 (U ) (1)
has a probability distribution (→ Definition I/1.5.1) characterized by the invertible (→ Definition
I/1.6.23) cumulative distribution function (→ Definition I/1.6.13) FX (x).
Pr(X ≤ x)
= Pr(FX−1 (U ) ≤ x)
(2)
= Pr(U ≤ FX (x))
= FX (x) ,
because the cumulative distribution function (→ Definition I/1.6.13) of the standard uniform distri-
bution (→ Definition II/3.1.2) U(0, 1) is
Sources:
• Wikipedia (2021): “Inverse transform sampling”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-07; URL: https://en.wikipedia.org/wiki/Inverse_transform_sampling#Proof_of_correctness.
Metadata: ID: P221 | shortcut: cdf-itm | author: JoramSoch | date: 2021-04-07, 08:47.
FX̃ (y) = Pr X̃ ≤ y
= Pr FY−1 (FX (X)) ≤ y
= Pr (FX (X) ≤ FY (y)) (2)
= Pr X ≤ FX−1 (FY (y))
= FX FX−1 (FY (y))
= FY (y)
which shows that X̃ and Y have the same cumulative distribution function (→ Definition I/1.6.13)
and are thus identically distributed (→ Definition I/1.5.1).
Sources:
• Soch, Joram (2020): “Distributional Transformation Improves Decoding Accuracy When Predict-
ing Chronological Age From Structural MRI”; in: Frontiers in Psychiatry, vol. 11, art. 604268;
URL: https://www.frontiersin.org/articles/10.3389/fpsyt.2020.604268/full; DOI: 10.3389/fpsyt.2020.60426
Metadata: ID: P222 | shortcut: cdf-dt | author: JoramSoch | date: 2021-04-07, 09:19.
Sources:
• Wikipedia (2021): “Cumulative distribution function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-04-07; URL: https://en.wikipedia.org/wiki/Cumulative_distribution_function#
Definition_for_more_than_two_random_variables.
Metadata: ID: D141 | shortcut: cdf-joint | author: JoramSoch | date: 2020-04-07, 08:17.
Sources:
1. PROBABILITY THEORY 37
• Wikipedia (2020): “Probability density function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-17; URL: https://en.wikipedia.org/wiki/Quantile_function#Definition.
Proof: The quantile function (→ Definition I/1.6.23) QX (p) is defined as the function that, for a
given quantile p ∈ [0, 1], returns the smallest x for which FX (x) = p:
Sources:
• Wikipedia (2020): “Quantile function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
12; URL: https://en.wikipedia.org/wiki/Quantile_function#Definition.
Metadata: ID: P192 | shortcut: qf-cdf | author: JoramSoch | date: 2020-11-12, 07:48.
Sources:
38 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Characteristic function (probability theory)”; in: Wikipedia, the free ency-
clopedia, retrieved on 2021-09-22; URL: https://en.wikipedia.org/wiki/Characteristic_function_
(probability_theory)#Definition.
• Taboga, Marco (2017): “Joint characteristic function”; in: Lectures on probability and mathematical
statistics, retrieved on 2021-10-07; URL: https://www.statlect.com/fundamentals-of-probability/
joint-characteristic-function.
X
E[g(X)] = g(x)fX (x)
Z
x∈X
(3)
E[g(X)] = g(x)fX (x) dx ,
X
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
Metadata: ID: P259 | shortcut: cf-fct | author: JoramSoch | date: 2021-09-22, 09:12.
h i
tT X
MX (t) = E e , t ∈ Rn . (2)
Sources:
• Wikipedia (2020): “Moment-generating function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-01-22; URL: https://en.wikipedia.org/wiki/Moment-generating_function#Definition.
• Taboga, Marco (2017): “Joint moment generating function”; in: Lectures on probability and mathe-
matical statistics, retrieved on 2021-10-07; URL: https://www.statlect.com/fundamentals-of-probability/
joint-moment-generating-function.
X
E[g(X)] = g(x)fX (x)
Z
x∈X
(3)
E[g(X)] = g(x)fX (x) dx ,
X
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
Metadata: ID: P260 | shortcut: mgf-fct | author: JoramSoch | date: 2021-09-22, 09:00.
40 CHAPTER I. GENERAL THEOREMS
MY (t) = E exp tT (AX + b)
= E exp tT AX · exp tT b
(3)
= exp tT b · E exp (At)T X
= exp tT b · MX (At) .
Sources:
• ProofWiki (2020): “Moment Generating Function of Linear Transformation of Random Variable”;
in: ProofWiki, retrieved on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_Generating_
Function_of_Linear_Transformation_of_Random_Variable.
Metadata: ID: P154 | shortcut: mgf-ltt | author: JoramSoch | date: 2020-08-19, 08:09.
Because the expected value is multiplicative for independent random variables (→ Proof I/1.7.7), we
have
Y
n
MX (t) = E (exp [(ai t)Xi ])
i=1
(4)
Yn
= MXi (ai t) .
i=1
Sources:
• ProofWiki (2020): “Moment Generating Function of Linear Combination of Independent Random
Variables”; in: ProofWiki, retrieved on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_
Generating_Function_of_Linear_Combination_of_Independent_Random_Variables.
Metadata: ID: P155 | shortcut: mgf-lincomb | author: JoramSoch | date: 2020-08-19, 08:36.
Sources:
• Wikipedia (2020): “Probability-generating function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-31; URL: https://en.wikipedia.org/wiki/Probability-generating_function#Definition.
Metadata: ID: D69 | shortcut: pgf | author: JoramSoch | date: 2020-05-31, 23:59.
42 CHAPTER I. GENERAL THEOREMS
Proof: The law of the unconscious statistician (→ Proof I/1.7.12) states that
X
E[g(X)] = g(x)fX (x) (2)
x∈X
where fX (x) is the probability mass function (→ Definition I/1.6.1) of X. Here, we have g(X) = z X ,
such that
X x
E zX = z fX (x) . (3)
x∈X
X ∞
E zX = fX (x) z x . (4)
x=0
Sources:
• ProofWiki (2022): “Probability Generating Function as Expectation”; in: ProofWiki, retrieved on
2022-10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_as_Expectation.
Metadata: ID: P360 | shortcut: pgf-mean | author: JoramSoch | date: 2022-10-11, 02:01.
where fX (x) is the probability mass function (→ Definition I/1.6.1) of X. Setting z = 0, we obtain:
X
∞
GX (0) = fX (x) · 0x
x=0
= fX (0) + 01 · fX (1) + 02 · fX (2) + . . . (3)
= fX (0) + 0 + 0 + . . .
= fX (0) .
Sources:
• ProofWiki (2022): “Probability Generating Function of Zero”; in: ProofWiki, retrieved on 2022-
10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_Zero.
Metadata: ID: P361 | shortcut: pgf-zero | author: JoramSoch | date: 2022-10-11, 08:06.
GX (1) = 1 . (1)
where fX (x) is the probability mass function (→ Definition I/1.6.1) of X. Setting z = 1, we obtain:
X
∞
GX (1) = fX (x) · 1x
x=0
X∞
= fX (x) · 1 (3)
x=0
X∞
= fX (x) .
x=0
Because the probability mass function (→ Definition I/1.6.1) sums up to one, this becomes:
X
GX (1) = fX (x)
x∈X (4)
=1.
Sources:
44 CHAPTER I. GENERAL THEOREMS
• ProofWiki (2022): “Probability Generating Function of One”; in: ProofWiki, retrieved on 2022-
10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_One.
Metadata: ID: P362 | shortcut: pgf-one | author: JoramSoch | date: 2022-10-11, 08:17.
Sources:
• Wikipedia (2020): “Cumulant”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-31; URL:
https://en.wikipedia.org/wiki/Cumulant#Definition.
Metadata: ID: D68 | shortcut: cgf | author: JoramSoch | date: 2020-05-31, 23:46.
2) The expected value (or, mean) of a continuous random variable (→ Definition I/1.2.2) X with
domain X is
Z
E(X) = x · fX (x) dx (2)
X
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Definition.
Metadata: ID: D11 | shortcut: mean | author: JoramSoch | date: 2020-02-13, 19:38.
1. PROBABILITY THEORY 45
1X
n
x̄ = xi . (1)
n i=1
Sources:
• Wikipedia (2021): “Sample mean and covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2020-04-16; URL: https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Definition_of_
the_sample_mean.
Metadata: ID: D142 | shortcut: mean-samp | author: JoramSoch | date: 2021-04-16, 11:53.
Proof: Because the cumulative distribution function gives the probability of a random variable being
smaller than a given value (→ Definition I/1.6.13),
= fX (z) 1 dx dz (5)
Z ∞
0 0
= [x]z0 · fX (z) dz
Z0 ∞
= z · fX (z) dz
0
46 CHAPTER I. GENERAL THEOREMS
and by applying the definition of the expected value (→ Definition I/1.7.1), we see that
Z ∞ Z ∞
(1 − FX (x)) dx = z · fX (z) dz = E(X) (6)
0 0
which proves the identity given above.
Sources:
• Kemp, Graham (2014): “Expected value of a non-negative random variable”; in: StackExchange
Mathematics, retrieved on 2020-05-18; URL: https://math.stackexchange.com/questions/958472/
expected-value-of-a-non-negative-random-variable.
Metadata: ID: P103 | shortcut: mean-nnrvar | author: JoramSoch | date: 2020-05-18, 23:54.
1.7.4 Non-negativity
Theorem: If a random variable (→ Definition I/1.2.2) is strictly non-negative, its expected value
(→ Definition I/1.7.1) is also non-negative, i.e.
Proof:
1) If X ≥ 0 is a discrete random variable, then, because the probability mass function (→ Definition
I/1.6.1) is always non-negative, all the addends in
X
E(X) = x · fX (x) (2)
x∈X
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
Metadata: ID: P52 | shortcut: mean-nonneg | author: JoramSoch | date: 2020-02-13, 20:14.
1.7.5 Linearity
Theorem: The expected value (→ Definition I/1.7.1) is a linear operator, i.e.
1. PROBABILITY THEORY 47
for random variables (→ Definition I/1.2.2) X and Y and a constant (→ Definition I/1.2.5) a.
Proof:
1) If X and Y are discrete random variables (→ Definition I/1.2.6), the expected value (→ Definition
I/1.7.1) is
X
E(X) = x · fX (x) (2)
x∈X
XX
E(X + Y ) = (x + y) · fX,Y (x, y)
x∈X y∈Y
XX XX
= x · fX,Y (x, y) + y · fX,Y (x, y)
x∈X y∈Y x∈X y∈Y
X X X X
= x fX,Y (x, y) + y fX,Y (x, y) (4)
x∈X y∈Y y∈Y x∈X
(3) X X
= x · fX (x) + y · fY (y)
x∈X y∈Y
(2)
= E(X) + E(Y )
as well as
X
E(a X) = a x · fX (x)
x∈X
X
=a x · fX (x) (5)
x∈X
(2)
= a E(X) .
2) If X and Y are continuous random variables (→ Definition I/1.2.6), the expected value (→
Definition I/1.7.1) is
Z
E(X) = x · fX (x) dx (6)
X
Z
p(x) = p(x, y) dy . (7)
Y
Z Z
E(X + Y ) = (x + y) · fX,Y (x, y) dy dx
X Y
Z Z Z Z
= x · fX,Y (x, y) dy dx + y · fX,Y (x, y) dy dx
X Y X Y
Z Z Z Z
= x fX,Y (x, y) dy dx + y fX,Y (x, y) dx dy (8)
X Y Y X
Z Z
(7)
= x · fX (x) dx + y · fY (y) dy
X Y
(6)
= E(X) + E(Y )
as well as
Z
E(a X) = a x · fX (x) dx
X
Z
=a x · fX (x) dx (9)
X
(6)
= a E(X) .
Collectively, this shows that both requirements for linearity are fulfilled for the expected value (→
Definition I/1.7.1), for discrete (→ Definition I/1.2.6) as well as for continuous (→ Definition I/1.2.6)
random variables.
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
• Michael B, Kuldeep Guha Mazumder, Geoff Pilling et al. (2020): “Linearity of Expectation”; in:
brilliant.org, retrieved on 2020-02-13; URL: https://brilliant.org/wiki/linearity-of-expectation/.
Metadata: ID: P53 | shortcut: mean-lin | author: JoramSoch | date: 2020-02-13, 21:08.
1.7.6 Monotonicity
Theorem: The expected value (→ Definition I/1.7.1) is monotonic, i.e.
Proof: Let Z = Y − X. Due to the linearity of the expected value (→ Proof I/1.7.5), we have
With the non-negativity property of the expected value (→ Proof I/1.7.4), it also holds that
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-17;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
Metadata: ID: P54 | shortcut: mean-mono | author: JoramSoch | date: 2020-02-17, 21:00.
1.7.7 (Non-)Multiplicativity
Theorem:
1) If two random variables (→ Definition I/1.2.2) X and Y are independent (→ Definition I/1.3.6),
the expected value (→ Definition I/1.7.1) is multiplicative, i.e.
Proof:
1) If X and Y are independent (→ Definition I/1.3.6), it holds that
XX
E(X Y ) = (x · y) · fX,Y (x, y)
x∈X y∈Y
(3) XX
= (x · y) · (fX (x) · fY (y))
x∈X y∈Y
X X
= x · fX (x) y · fY (y) (4)
x∈X y∈Y
X
= x · fX (x) · E(Y )
x∈X
= E(X) E(Y ) .
And applying it to the expected value for continuous random variables (→ Definition I/1.7.1), we
have
50 CHAPTER I. GENERAL THEOREMS
Z Z
E(X Y ) = (x · y) · fX,Y (x, y) dy dx
X Y
Z Z
(3)
= (x · y) · (fX (x) · fY (y)) dy dx
X Y
Z Z
(5)
= x · fX (x) y · fY (y) dy dx
X Y
Z
= x · fX (x) · E(Y ) dx
X
= E(X) E(Y ) .
2) Let X and Y be Bernoulli random variables (→ Definition II/1.2.1) with the following joint
probability (→ Definition I/1.3.2) mass function (→ Definition I/1.6.1)
p(X = 0, Y = 0) = 1/2
p(X = 0, Y = 1) = 0
(6)
p(X = 1, Y = 0) = 0
p(X = 1, Y = 1) = 1/2
X X
E(X Y ) = (x · y) · p(x, y)
x∈{0,1} y∈{0,1}
= (1 · 1) · p(X = 1, Y = 1) (9)
(6) 1
=
2
while the product of their expected values is
X X
E(X) E(Y ) = x · p(x) · y · p(y)
x∈{0,1} y∈{0,1}
(10)
= (1 · p(X = 1)) · (1 · p(Y = 1))
(7) 1
=
4
1. PROBABILITY THEORY 51
and thus,
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-17;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
Metadata: ID: P55 | shortcut: mean-mult | author: JoramSoch | date: 2020-02-17, 21:51.
Using this definition of the trace, the linearity of the expected value (→ Proof I/1.7.5) and the
expected value of a random matrix (→ Definition I/1.7.14), we have:
" #
X
n
E [tr(A)] = E aii
i=1
X
n
= E [aii ]
i=1
(3)
E[a ] . . . E[a1n ]
11
.. . . ..
= tr . . .
E[an1 ] . . . E[ann ]
= tr (E[A]) .
Sources:
• drerD (2018): “’Trace trick’ for expectations of quadratic forms”; in: StackExchange Mathematics,
retrieved on 2021-12-07; URL: https://math.stackexchange.com/a/3004034/480910.
Metadata: ID: P298 | shortcut: mean-tr | author: JoramSoch | date: 2021-12-07, 09:03.
52 CHAPTER I. GENERAL THEOREMS
E X T AX = tr A Σ + µµT
= tr AΣ + AµµT
= tr(AΣ) + tr(AµµT ) (7)
= tr(AΣ) + tr(µT Aµ)
= µT Aµ + tr(AΣ) .
Sources:
• Kendrick, David (1981): “Expectation of a quadratic form”; in: Stochastic Control for Economic
Models, pp. 170-171.
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-07-13; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable#Expectation_
of_a_quadratic_form.
• Halvorsen, Kjetil B. (2012): “Expected value and variance of trace function”; in: StackExchange
CrossValidated, retrieved on 2020-07-13; URL: https://stats.stackexchange.com/questions/34477/
expected-value-and-variance-of-trace-function.
• Sarwate, Dilip (2013): “Expected Value of Quadratic Form”; in: StackExchange CrossValidated, re-
trieved on 2020-07-13; URL: https://stats.stackexchange.com/questions/48066/expected-value-of-quadrat
Metadata: ID: P131 | shortcut: mean-qf | author: JoramSoch | date: 2020-07-13, 21:59.
1. PROBABILITY THEORY 53
Proof: Note that Y 2 is a non-negative random variable (→ Definition I/1.2.2) whose expected value
is also non-negative (→ Proof I/1.7.4):
E Y2 ≥0. (2)
1) First, consider the case that E (Y 2 ) > 0. Define a new random variable Z as
E(XY )
Z =X −Y . (3)
E (Y 2 )
Once again, because Z 2 is always non-negative, we have the expected value:
E Z2 ≥ 0 . (4)
Thus, using the linearity of the expected value (→ Proof I/1.7.5), we have
0 ≤ E Z2
2 !
E(XY )
≤E X −Y
E (Y 2 )
!
2
E(XY ) 2 [E(XY )]
≤ E X 2 − 2XY + Y
E (Y 2 ) [E (Y 2 )]2
2 (5)
E(XY ) 2 [E(XY )]
≤ E X − 2 E(XY )
2
+E Y
E (Y 2 ) [E (Y 2 )]2
[E(XY )]2 [E(XY )]2
≤E X −2
2
+
E (Y 2 ) E (Y 2 )
[E(XY )]2
≤ E X2 − ,
E (Y 2 )
giving
[E(XY )]2 ≤ E X 2 E Y 2 (6)
as required.
2) Next, consider the case that E (Y 2 ) = 0. In this case, Y must be a constant (→ Definition I/1.2.5)
with mean (→ Definition I/1.7.1) E(Y ) = 0 and variance (→ Definition I/1.8.1) Var(Y ) = 0, thus
we have
Pr(Y = 0) = 1 . (7)
54 CHAPTER I. GENERAL THEOREMS
This implies
Pr(XY = 0) = 1 , (8)
such that
E(XY ) = 0 . (9)
Thus, we can write
0 = [E(XY )]2 = E X 2 E Y 2 = 0 , (10)
giving
[E(XY )]2 ≤ E X 2 E Y 2 (11)
as required.
Sources:
• ProofWiki (2022): “Square of Expectation of Product is Less Than or Equal to Product of Ex-
pectation of Squares”; in: ProofWiki, retrieved on 2022-10-11; URL: https://proofwiki.org/wiki/
Square_of_Expectation_of_Product_is_Less_Than_or_Equal_to_Product_of_Expectation_
of_Squares.
Metadata: ID: P359 | shortcut: mean-prodsqr | author: JoramSoch | date: 2022-10-11, 01:39.
Proof: Let X and Y be discrete random variables (→ Definition I/1.2.6) with sets of possible
outcomes X and Y. Then, the expectation of the conditional expetectation can be rewritten as:
" #
X
E[E(X|Y )] = E x · Pr(X = x|Y )
x∈X
" #
X X
= x · Pr(X = x|Y = y) · Pr(Y = y) (2)
y∈Y x∈X
XX
= x · Pr(X = x|Y = y) · Pr(Y = y) .
x∈X y∈Y
XX
E[E(X|Y )] = x · Pr(X = x, Y = y)
x∈X y∈Y
X X (3)
= x Pr(X = x, Y = y) .
x∈X y∈Y
X
E[E(X|Y )] = x · Pr(X = x)
x∈X (4)
= E(X) .
Sources:
• Wikipedia (2021): “Law of total expectation”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_expectation#Proof_in_the_
finite_and_countable_cases.
Metadata: ID: P291 | shortcut: mean-tot | author: JoramSoch | date: 2021-11-26, 10:57.
2) If X is a continuous random variable with possible outcomes X and probability density function
(→ Definition I/1.6.6) fX (x), the expected value (→ Definition I/1.7.1) of g(X) is
Z
E[g(X)] = g(x)fX (x) dx . (2)
X
X
E[g(X)] = y Pr(g(x) = y)
y∈Y
X
= y Pr(x = g −1 (y))
y∈Y
X X
= y fX (x)
(4)
y∈Y x=g −1 (y)
X X
= yfX (x)
y∈Y x=g −1 (y)
X X
= g(x)fX (x) .
y∈Y x=g −1 (y)
Finally, noting that “for all y, then for all x = g −1 (y)” is equivalent to “for all x” if g −1 is a monotonic
function, we can conclude that
X
E[g(X)] = g(x)fX (x) . (5)
x∈X
FY (y) = Pr(Y ≤ y)
= Pr(g(X) ≤ y)
(9)
= Pr(X ≤ g −1 (y))
= FX (g −1 (y)) .
Differentiating to get the probability density function (→ Definition I/1.6.6) of Y , the result is:
d
fY (y) = FY (y)
dy
(9) d
= FX (g −1 (y))
dy
(10)
d
= fX (g −1 (y)) (g −1 (y))
dy
(6) 1
= fX (g −1 (y)) ′ −1 .
g (g (y))
1. PROBABILITY THEORY 57
Sources:
• Wikipedia (2020): “Law of the unconscious statistician”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-07-22; URL: https://en.wikipedia.org/wiki/Law_of_the_unconscious_statistician#
Proof.
• Taboga, Marco (2017): “Transformation theorem”; in: Lectures on probability and mathematical
statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/glossary/transformation-theorem.
Metadata: ID: P138 | shortcut: mean-lotus | author: JoramSoch | date: 2020-07-22, 08:30.
Sources:
• Taboga, Marco (2017): “Expected value”; in: Lectures on probability theory and mathematical
statistics, retrieved on 2021-07-08; URL: https://www.statlect.com/fundamentals-of-probability/
expected-value#hid12.
• Wikipedia (2021): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-07-08; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable#Expected_
value.
Metadata: ID: D154 | shortcut: mean-rvec | author: JoramSoch | date: 2021-07-08, 08:34.
Sources:
58 CHAPTER I. GENERAL THEOREMS
• Taboga, Marco (2017): “Expected value”; in: Lectures on probability theory and mathematical
statistics, retrieved on 2021-07-08; URL: https://www.statlect.com/fundamentals-of-probability/
expected-value#hid13.
Metadata: ID: D155 | shortcut: mean-rmat | author: JoramSoch | date: 2021-07-08, 08:42.
1.8 Variance
1.8.1 Definition
Definition: The variance of a random variable (→ Definition I/1.2.2) X is defined as the expected
value (→ Definition I/1.7.1) of the squared deviation from its expected value (→ Definition I/1.7.1):
Var(X) = E (X − E(X))2 . (1)
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
Metadata: ID: D12 | shortcut: var | author: JoramSoch | date: 2020-02-13, 19:55.
1X
n
2
σ̂ = (xi − x̄)2 (1)
n i=1
and the unbiased sample variance of x is given by
1 X
n
2
s = (xi − x̄)2 (2)
n − 1 i=1
where x̄ is the sample mean (→ Definition I/1.7.2).
Sources:
• Wikipedia (2021): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-04-16; URL:
https://en.wikipedia.org/wiki/Variance#Sample_variance.
Metadata: ID: D143 | shortcut: var-samp | author: JoramSoch | date: 2021-04-16, 12:04.
Var(X) = E (X − E[X])2
= E X 2 − 2 X E(X) + E(X)2
(3)
= E(X 2 ) − 2 E(X) E(X) + E(X)2
= E(X 2 ) − E(X)2 .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-19; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
Metadata: ID: P104 | shortcut: var-mean | author: JoramSoch | date: 2020-05-19, 00:17.
1.8.4 Non-negativity
Theorem: The variance (→ Definition I/1.8.1) is always non-negative, i.e.
Var(X) ≥ 0 . (1)
Proof: The variance (→ Definition I/1.8.1) of a random variable (→ Definition I/1.2.2) is defined as
Var(X) = E (X − E(X))2 . (2)
1) If X is a discrete random variable (→ Definition I/1.2.2), then, because squares and probabilities
are stricly non-negative, all the addends in
X
Var(X) = (x − E(X))2 · fX (x) (3)
x∈X
2) If X is a continuous random variable (→ Definition I/1.2.2), then, because squares and probability
densities are strictly non-negative, the integrand in
Z
Var(X) = (x − E(X))2 · fX (x) dx (4)
X
is always non-negative, thus the term on the right-hand side is a Lebesgue integral, so that the result
on the left-hand side must be non-negative.
Sources:
60 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-06; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P123 | shortcut: var-nonneg | author: JoramSoch | date: 2020-06-06, 07:29.
Proof:
1) A constant (→ Definition I/1.2.5) is defined as a quantity that always has the same value. Thus,
if understood as a random variable (→ Definition I/1.2.2), the expected value (→ Definition I/1.7.1)
of a constant is equal to itself:
E(a) = a . (3)
Plugged into the formula of the variance (→ Definition I/1.8.1), we have
Var(a) = E (a − E(a))2
= E (a − a)2 (4)
= E(0) .
Applied to the formula of the expected value (→ Definition I/1.7.1), this gives
X
E(0) = x · fX (x) = 0 · 1 = 0 . (5)
x=0
(X − E(X))2 = 0 . (7)
This, in turn, requires that X is equal to its expected value (→ Definition I/1.7.1)
X = E(X) (8)
which can only be the case, if X always has the same value (→ Definition I/1.2.5):
X = const. (9)
1. PROBABILITY THEORY 61
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-27; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P124 | shortcut: var-const | author: JoramSoch | date: 2020-06-27, 06:44.
Proof: The variance (→ Definition I/1.8.1) is defined in terms of the expected value (→ Definition
I/1.7.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
(2)
Var(X + a) = E ((X + a) − E(X + a))2
= E (X + a − E(X) − a)2
(3)
= E (X − E(X))2
(2)
= Var(X) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P126 | shortcut: var-inv | author: JoramSoch | date: 2020-07-07, 05:23.
Proof: The variance (→ Definition I/1.8.1) is defined in terms of the expected value (→ Definition
I/1.7.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
62 CHAPTER I. GENERAL THEOREMS
(2)
Var(aX) = E ((aX) − E(aX))2
= E (aX − aE(X))2
= E (a[X − E(X)])2
(3)
= E a2 (X − E(X))2
= a2 E (X − E(X))2
(2)
= a2 Var(X) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P127 | shortcut: var-scal | author: JoramSoch | date: 2020-07-07, 05:38.
Proof: The variance (→ Definition I/1.8.1) is defined in terms of the expected value (→ Definition
I/1.7.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
(2)
Var(X + Y ) = E ((X + Y ) − E(X + Y ))2
= E ([X − E(X)] + [Y − E(Y )])2
= E (X − E(X))2 + (Y − E(Y ))2 + 2 (X − E(X))(Y − E(Y )) (3)
= E (X − E(X))2 + E (Y − E(Y ))2 + E [2 (X − E(X))(Y − E(Y ))]
(2)
= Var(X) + Var(Y ) + 2 Cov(X, Y ) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P128 | shortcut: var-sum | author: JoramSoch | date: 2020-07-07, 06:10.
1. PROBABILITY THEORY 63
Proof: The variance (→ Definition I/1.8.1) is defined in terms of the expected value (→ Definition
I/1.7.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
(2)
Var(aX + bY ) = E ((aX + bY ) − E(aX + bY ))2
= E (a[X − E(X)] + b[Y − E(Y )])2
= E a2 (X − E(X))2 + b2 (Y − E(Y ))2 + 2ab (X − E(X))(Y − E(Y )) (3)
= E a2 (X − E(X))2 + E b2 (Y − E(Y ))2 + E [2ab (X − E(X))(Y − E(Y ))]
(2)
= a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P129 | shortcut: var-lincomb | author: JoramSoch | date: 2020-07-07, 06:21.
Proof: The variance of the sum of two random variables (→ Proof I/1.8.8) is given by
Sources:
64 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Metadata: ID: P130 | shortcut: var-add | author: JoramSoch | date: 2020-07-07, 06:52.
Proof: The variance can be decomposed into expected values (→ Proof I/1.8.3) as follows:
Sources:
• Wikipedia (2021): “Law of total variance”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_variance#Proof.
Metadata: ID: P292 | shortcut: var-tot | author: JoramSoch | date: 2021-11-26, 11:20.
1. PROBABILITY THEORY 65
1.8.12 Precision
Definition: The precision of a random variable (→ Definition I/1.2.2) X is defined as the inverse of
the variance (→ Definition I/1.8.1), i.e. one divided by the expected value (→ Definition I/1.7.1) of
the squared deviation from its expected value (→ Definition I/1.7.1):
1
Prec(X) = Var(X)−1 = . (1)
E [(X − E(X))2 ]
Sources:
• Wikipedia (2020): “Precision (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
04-21; URL: https://en.wikipedia.org/wiki/Precision_(statistics).
Metadata: ID: D145 | shortcut: prec | author: JoramSoch | date: 2020-04-21, 07:04.
1.9 Covariance
1.9.1 Definition
Definition: The covariance of two random variables (→ Definition I/1.2.2) X and Y is defined as
the expected value (→ Definition I/1.7.1) of the product of their deviations from their individual
expected values (→ Definition I/1.7.1):
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-06;
URL: https://en.wikipedia.org/wiki/Covariance#Definition.
Metadata: ID: D70 | shortcut: cov | author: JoramSoch | date: 2020-06-02, 20:20.
1X
n
σ̂xy = (xi − x̄)(yi − ȳ) (1)
n i=1
and the unbiased sample covariance of x and y is given by
1 X
n
sxy = (xi − x̄)(yi − ȳ) (2)
n − 1 i=1
where x̄ and ȳ are the sample means (→ Definition I/1.7.2).
Sources:
66 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-20;
URL: https://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance.
Metadata: ID: D144 | shortcut: cov-samp | author: ciaranmci | date: 2021-04-21, 06:53.
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-02;
URL: https://en.wikipedia.org/wiki/Covariance#Definition.
Metadata: ID: P118 | shortcut: cov-mean | author: JoramSoch | date: 2020-06-02, 20:50.
1.9.4 Symmetry
Theorem: The covariance (→ Definition I/1.9.1) of two random variables (→ Definition I/1.2.2) is
a symmetric function:
Proof: The covariance (→ Definition I/1.9.1) of random variables (→ Definition I/1.2.2) X and Y
is defined as:
(2)
Cov(Y, X) = E [(Y − E[Y ])(X − E[X])]
= E [(X − E[X])(Y − E[Y ])] (3)
= Cov(X, Y ) .
1. PROBABILITY THEORY 67
Sources:
• Wikipedia (2022): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-09-26;
URL: https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations.
Metadata: ID: P353 | shortcut: cov-symm | author: JoramSoch | date: 2022-09-26, 12:14.
1.9.5 Self-covariance
Theorem: The covariance (→ Definition I/1.9.1) of a random variable (→ Definition I/1.2.2) with
itself is equal to the variance (→ Definition I/1.8.1):
Proof: The covariance (→ Definition I/1.9.1) of random variables (→ Definition I/1.2.2) X and Y
is defined as:
(2)
Cov(X, X) = E [(X − E[X])(X − E[X])]
= E (X − E[X])2 (3)
= Var(X) .
Sources:
• Wikipedia (2022): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-09-26;
URL: https://en.wikipedia.org/wiki/Covariance#Covariance_with_itself.
Metadata: ID: P352 | shortcut: cov-var | author: JoramSoch | date: 2022-09-26, 12:08.
Proof: The covariance can be expressed in terms of expected values (→ Proof I/1.9.3) as
(2)
Cov(X, Y ) = E(X Y ) − E(X) E(Y )
(3)
= E(X) E(Y ) − E(X) E(Y ) (4)
=0.
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-03;
URL: https://en.wikipedia.org/wiki/Covariance#Uncorrelatedness_and_independence.
Metadata: ID: P158 | shortcut: cov-ind | author: JoramSoch | date: 2020-09-03, 06:05.
Cov(X, Y )
Corr(X, Y ) = . (2)
σX σY
which can be rearranged for the covariance (→ Definition I/1.9.1) to give
Sources:
• original work
Metadata: ID: P119 | shortcut: cov-corr | author: JoramSoch | date: 2020-06-02, 21:00.
Proof: The covariance can be decomposed into expected values (→ Proof I/1.9.3) as follows:
1. PROBABILITY THEORY 69
Sources:
• Wikipedia (2021): “Law of total covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_covariance#Proof.
Metadata: ID: P293 | shortcut: cov-tot | author: JoramSoch | date: 2021-11-26, 11:38.
Cov(X1 , X1 ) . . . Cov(X1 , Xn ) E [(X1 − E[X1 ])(X1 − E[X1 ])] . . . E [(X1 − E[X1 ])(Xn − E
.. .. .
.. .. .. ..
ΣXX = . . = . . .
Cov(Xn , X1 ) . . . Cov(Xn , Xn ) E [(Xn − E[Xn ])(X1 − E[X1 ])] . . . E [(Xn − E[Xn ])(Xn − E
(1)
Sources:
• Wikipedia (2020): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-06; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Definition.
Metadata: ID: D72 | shortcut: covmat | author: JoramSoch | date: 2020-06-06, 04:24.
70 CHAPTER I. GENERAL THEOREMS
1X
n
Σ̂ = (xi − x̄)(xi − x̄)T (1)
n i=1
and the unbiased sample covariance matrix of x is given by
1 X
n
S= (xi − x̄)(xi − x̄)T (2)
n − 1 i=1
where x̄ is the sample mean (→ Definition I/1.7.2).
Sources:
• Wikipedia (2021): “Sample mean and covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-20; URL: https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Definition_of_
sample_covariance.
Metadata: ID: D153 | shortcut: covmat-samp | author: JoramSoch | date: 2021-05-20, 07:46.
ΣXX = E (X − E[X])(X − E[X])T
= E XX T − X E(X)T − E(X) X T + E(X)E(X)T
(4)
= E(XX T ) − E(X)E(X)T − E(X)E(X)T + E(X)E(X)T
= E(XX T ) − E(X)E(X)T .
Sources:
1. PROBABILITY THEORY 71
• Taboga, Marco (2010): “Covariance matrix”; in: Lectures on probability and statistics, retrieved
on 2020-06-06; URL: https://www.statlect.com/fundamentals-of-probability/covariance-matrix.
Metadata: ID: P120 | shortcut: covmat-mean | author: JoramSoch | date: 2020-06-06, 05:31.
1.9.12 Symmetry
Theorem: Each covariance matrix (→ Definition I/1.9.9) is symmetric:
ΣT
XX = ΣXX . (1)
Proof: The covariance matrix (→ Definition I/1.9.9) of a random vector (→ Definition I/1.2.3) X
is defined as
Cov(X1 , X1 ) . . . Cov(X1 , Xn )
.. ... ..
ΣXX = . . . (2)
Cov(Xn , X1 ) . . . Cov(Xn , Xn )
A symmetric matrix is a matrix whose transpose is equal to itself. The transpose of ΣXX is
Cov(X1 , X1 ) . . . Cov(Xn , X1 )
.. . . ..
ΣXX =
T
. . . . (3)
Cov(X1 , Xn ) . . . Cov(Xn , Xn )
Because the covariance is a symmetric function (→ Proof I/1.9.4), i.e. Cov(X, Y ) = Cov(Y, X), this
matrix is equal to
Cov(X1 , X1 ) . . . Cov(X1 , Xn )
.. .. ..
XX =
ΣT . . . (4)
Cov(Xn , X1 ) . . . Cov(Xn , Xn )
which is equivalent to our original definition in (2).
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Metadata: ID: P350 | shortcut: covmat-symm | author: JoramSoch | date: 2022-09-26, 10:54.
Proof: The covariance matrix (→ Definition I/1.9.9) of X can be expressed (→ Proof I/1.9.11) in
terms of expected values (→ Definition I/1.7.1) as follows
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T (2)
A positive semi-definite matrix is a matrix whose eigenvalues are all non-negative or, equivalently,
Y = aT (X − µX ) . (6)
where µX = E[X] and note that
aT (X − µX ) = (X − µX )T a . (7)
Thus, combing (5) with (6), we have:
aT ΣXX a = E Y 2 . (8)
Because Y 2 is a random variable that cannot become negative and the expected value of a strictly
non-negative random variable is also non-negative (→ Proof I/1.7.4), we finally have
aT ΣXX a ≥ 0 (9)
for any a ∈ Rn .
Sources:
• hkBattousai (2013): “What is the proof that covariance matrices are always semi-definite?”; in:
StackExchange Mathematics, retrieved on 2022-09-26; URL: https://math.stackexchange.com/a/
327872.
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Metadata: ID: P351 | shortcut: covmat-psd | author: JoramSoch | date: 2022-09-26, 11:26.
Proof: The covariance matrix (→ Definition I/1.9.9) of X can be expressed (→ Proof I/1.9.11) in
terms of expected values (→ Definition I/1.7.1) as follows:
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
(2)
Σ(X + a) = E ([X + a] − E[X + a])([X + a] − E[X + a])T
= E (X + a − E[X] − a)(X + a − E[X] − a)T
(3)
= E (X − E[X])(X − E[X])T
(2)
= Σ(X) .
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-22; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Metadata: ID: P347 | shortcut: covmat-inv | author: JoramSoch | date: 2022-09-22, 11:29.
Proof: The covariance matrix (→ Definition I/1.9.9) of X can be expressed (→ Proof I/1.9.11) in
terms of expected values (→ Definition I/1.7.1) as follows:
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T . (2)
Using this and the linearity of the expected value (→ Proof I/1.7.5), we can derive (1) as follows:
(2)
Σ(AX) = E ([AX] − E[AX])([AX] − E[AX])T
= E (A[X − E[X]])(A[X − E[X]])T
= E A(X − E[X])(X − E[X])T AT (3)
= A E (X − E[X])(X − E[X])T AT
(2)
= A Σ(X)AT .
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-22; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Metadata: ID: P348 | shortcut: covmat-scal | author: JoramSoch | date: 2022-09-22, 11:45.
74 CHAPTER I. GENERAL THEOREMS
Cov(X1 , Y1 ) . . . Cov(X1 , Ym ) E [(X1 − E[X1 ])(Y1 − E[Y1 ])] . . . E [(X1 − E[X1 ])(Ym − E[Ym
.. . .. .
.. .. ... ..
ΣXY = . = . .
Cov(Xn , Y1 ) . . . Cov(Xn , Ym ) E [(Xn − E[Xn ])(Y1 − E[Y1 ])] . . . E [(Xn − E[Xn ])(Ym − E[Y
(1)
Sources:
• Wikipedia (2022): “Cross-covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-26; URL: https://en.wikipedia.org/wiki/Cross-covariance_matrix#Definition.
Metadata: ID: D176 | shortcut: covmat-cross | author: JoramSoch | date: 2022-09-26, 09:45.
Proof: The covariance matrix (→ Definition I/1.9.9) of X can be expressed (→ Proof I/1.9.11) in
terms of expected values (→ Definition I/1.7.1) as follows
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T (2)
and the cross-covariance matrix (→ Definition I/1.9.16) of X and Y can similarly be written as
ΣXY = Σ(X, Y ) = E (X − E[X])(Y − E[Y ])T (3)
Using this and the linearity of the expected value (→ Proof I/1.7.5) as well as the definitions of
covariance matrix (→ Definition I/1.9.9) and cross-covariance matrix (→ Definition I/1.9.16), we
can derive (1) as follows:
(2)
Σ(X + Y ) = E ([X + Y ] − E[X + Y ])([X + Y ] − E[X + Y ])T
= E ([X − E(X)] + [Y − E(Y )])([X − E(X)] + [Y − E(Y )])T
= E (X − E[X])(X − E[X])T + (X − E[X])(Y − E[Y ])T + (Y − E[Y ])(X − E[X])T + (Y − E[Y ])
= E (X − E[X])(X − E[X])T + E (X − E[X])(Y − E[Y ])T + E (Y − E[Y ])(X − E[X])T + E
(2)
= ΣXX + ΣY Y + E (X − E[X])(Y − E[Y ])T + E (Y − E[Y ])(X − E[X])T
(3)
= ΣXX + ΣY Y + ΣXY + ΣY X .
(4)
1. PROBABILITY THEORY 75
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Metadata: ID: P349 | shortcut: covmat-sum | author: JoramSoch | date: 2022-09-26, 10:37.
E[(X1 −E[X1 ])(X1 −E[X1 ])] E[(X1 −E[X1 ])(Xn −E[Xn ])]
σX1 . . . 0 . . . σ . . . 0
σX1 σX1 σX1 σXn X1
.. . .. .
.. · .
.. . .. .
.. .. ... ..
ΣXX = . ·
. .
E[(Xn −E[Xn ])(X1 −E[X1 ])] E[(Xn −E[Xn ])(Xn −E[Xn ])]
0 . . . σXn σXn σX1
... σXn σXn
0 . . . σXn
σX1 ·E[(X1 −E[X1 ])(X1 −E[X1 ])] σX1 ·E[(X1 −E[X1 ])(Xn −E[Xn ])]
... σ ... 0
σX1 σX1 σX1 σXn X1
.. .. .
.. . .. ..
= . . · .. . .
σXn ·E[(Xn −E[Xn ])(X1 −E[X1 ])] σXn ·E[(Xn −E[Xn ])(Xn −E[Xn ])]
σXn σX1
... σXn σXn
0 . . . σXn
σX1 ·E[(X1 −E[X1 ])(X1 −E[X1 ])]·σX1 σX1 ·E[(X1 −E[X1 ])(Xn −E[Xn ])]·σXn
. . .
σX1 σX1 σX1 σXn
.. . . ..
= . . .
σXn ·E[(Xn −E[Xn ])(X1 −E[X1 ])]·σX1 σXn ·E[(Xn −E[Xn ])(Xn −E[Xn ])]·σXn
σXn σX1
. . . σXn σXn
E [(X1 − E[X1 ])(X1 − E[X1 ])] . . . E [(X1 − E[X1 ])(Xn − E[Xn ])]
.. .. ..
= . . .
E [(Xn − E[Xn ])(X1 − E[X1 ])] . . . E [(Xn − E[Xn ])(Xn − E[Xn ])]
(4)
which is nothing else than the definition of the covariance matrix (→ Definition I/1.9.9).
Sources:
• Penny, William (2006): “The correlation matrix”; in: Mathematics for Brain Imaging, ch. 1.4.5,
p. 28, eq. 1.60; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
Metadata: ID: P121 | shortcut: covmat-corrmat | author: JoramSoch | date: 2020-06-06, 06:02.
Sources:
• Wikipedia (2020): “Precision (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-06; URL: https://en.wikipedia.org/wiki/Precision_(statistics).
Metadata: ID: D74 | shortcut: precmat | author: JoramSoch | date: 2020-06-06, 05:08.
1. PROBABILITY THEORY 77
ΛXX = D−1 −1 −1
X · CXX · DX , (1)
where D−1 X is a diagonal matrix with the inverse standard deviations (→ Definition I/1.12.1) of
X1 , . . . , Xn as entries on the diagonal:
1
... 0
σ X1
−1 .. . . ..
DX = diag(1/σX1 , . . . , 1/σXn ) = . . . . (2)
0 . . . σX1
n
Proof: The precision matrix (→ Definition I/1.9.19) is defined as the inverse of the covariance matrix
(→ Definition I/1.9.9)
ΛXX = Σ−1
XX (3)
and the relation between covariance matrix and correlation matrix (→ Proof I/1.9.18) is given by
we obtain
78 CHAPTER I. GENERAL THEOREMS
(3)
ΛXX = Σ−1
XX
(4)
= (DX · CXX · DX )−1
(6)
= D−1 −1
X · CXX · DX
−1
(8)
1 1
. . . 0 ... 0
σX1 σX1
(7) . . .. −1 . ... ..
= . . . . . · CXX · .. .
0 . . . σX1 0 ... 1
σXn
n
Sources:
• original work
Metadata: ID: P122 | shortcut: precmat-corrmat | author: JoramSoch | date: 2020-06-06, 06:28.
1.10 Correlation
1.10.1 Definition
Definition: The correlation of two random variables (→ Definition I/1.2.2) X and Y , also called
Pearson product-moment correlation coefficient (PPMCC), is defined as the ratio of the covariance
(→ Definition I/1.9.1) of X and Y relative to the product of their standard deviations (→ Definition
I/1.12.1):
Sources:
• Wikipedia (2020): “Correlation and dependence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-06; URL: https://en.wikipedia.org/wiki/Correlation_and_dependence#Pearson’s_product-mom
coefficient.
Metadata: ID: D71 | shortcut: corr | author: JoramSoch | date: 2020-06-02, 20:34.
1.10.2 Range
Theorem: Let X and Y be two random variables (→ Definition I/1.2.2). Then, the correlation of
X and Y is between and including −1 and +1:
−1 ≤ Corr(X, Y ) ≤ +1 . (1)
Proof: Consider the variance (→ Definition I/1.8.1) of X plus or minus Y , each divided by their
standard deviations (→ Definition I/1.12.1):
1. PROBABILITY THEORY 79
X Y
Var ± . (2)
σX σY
Because the variance is non-negative (→ Proof I/1.8.4), this term is larger than or equal to zero:
X Y
0 ≤ Var ± . (3)
σX σY
Using the variance of a linear combination (→ Proof I/1.8.9), it can also be written as:
X Y X Y X Y
Var ± = Var + Var ± 2 Cov ,
σX σY σX σY σX σY
1 1 1
= 2
Var(X) + 2 Var(Y ) ± 2 Cov(X, Y ) (4)
σX σY σX σY
1 2 1 1
= 2 σX + 2 σY2 ± 2 σXY .
σX σY σX σY
Using the relationship between covariance and correlation (→ Proof I/1.9.7), we have:
X Y
Var ± = 1 + 1 + ±2 Corr(X, Y ) . (5)
σX σY
Thus, the combination of (3) with (5) yields
0 ≤ 2 ± 2 Corr(X, Y ) (6)
which is equivalent to
−1 ≤ Corr(X, Y ) ≤ +1 . (7)
Sources:
• Dor Leventer (2021): “How can I simply prove that the pearson correlation coefficient is be-
tween -1 and 1?”; in: StackExchange Mathematics, retrieved on 2021-12-14; URL: https://math.
stackexchange.com/a/4260655/480910.
Metadata: ID: P300 | shortcut: corr-range | author: JoramSoch | date: 2021-12-14, 02:08.
Sources:
80 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Pearson correlation coefficient”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-12-14; URL: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#For_a_sample.
Metadata: ID: D168 | shortcut: corr-samp | author: JoramSoch | date: 2021-12-14, 07:23.
1 Xn
rxy = (xi − x̄)(yi − ȳ) . (4)
(n − 1) sx sy i=1
Further simplifying, the result is:
1 X
n
xi − x̄ yi − ȳ
rxy = . (5)
n − 1 i=1 sx sy
Sources:
• Wikipedia (2021): “Peason correlation coefficient”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-12-14; URL: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#For_a_sample.
Metadata: ID: P299 | shortcut: corr-z | author: JoramSoch | date: 2021-12-14, 02:31.
E[(X1 −E[X1 ])(X1 −E[X1 ])] E[(X1 −E[X1 ])(Xn −E[Xn ])]
Corr(X1 , X1 ) . . . Corr(X1 , Xn ) ...
σX1 σX1 σX1 σXn
.. ... .. .. ... ..
CXX = . . = . . .
E[(Xn −E[Xn ])(X1 −E[X1 ])] E[(Xn −E[Xn ])(Xn −E[Xn ])]
Corr(Xn , X1 ) . . . Corr(Xn , Xn ) σX σX
... σXn σXn
n 1
(1)
Sources:
• Wikipedia (2020): “Correlation and dependence”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-06-06; URL: https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_
matrices.
Metadata: ID: D73 | shortcut: corrmat | author: JoramSoch | date: 2020-06-06, 04:56.
in which x̄(j) and x̄(k) are the sample means (→ Definition I/1.7.2)
1X
n
(j)
x̄ = xij
n i=1
(3)
1X
n
(k)
x̄ = xik .
n i=1
Sources:
• original work
Metadata: ID: D169 | shortcut: corrmat-samp | author: JoramSoch | date: 2021-12-14, 07:45.
82 CHAPTER I. GENERAL THEOREMS
i.e. the median is the “middle” number when all numbers are sorted from smallest to largest.
2) Let X be a continuous random variable (→ Definition I/1.2.2) with cumulative distribution func-
tion (→ Definition I/1.6.13) FX (x). Then, the median of X is
1
median(X) = x, s.t. FX (x) = , (2)
2
i.e. the median is the value at which the CDF is 1/2.
Sources:
• Wikipedia (2020): “Median”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-15; URL:
https://en.wikipedia.org/wiki/Median.
Metadata: ID: D101 | shortcut: med | author: JoramSoch | date: 2020-10-15, 10:53.
1.11.2 Mode
Definition: The mode of a sample or random variable is the value which occurs most often or with
largest probability among all its values.
2) Let X be a random variable (→ Definition I/1.2.2) with probability mass function (→ Definition
I/1.6.1) or probability density function (→ Definition I/1.6.6) fX (x). Then, the mode of X is the
the value which maximizes the PMF or PDF:
Sources:
• Wikipedia (2020): “Mode (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-
15; URL: https://en.wikipedia.org/wiki/Mode_(statistics).
Metadata: ID: D102 | shortcut: mode | author: JoramSoch | date: 2020-10-15, 11:10.
1. PROBABILITY THEORY 83
Sources:
• Wikipedia (2020): “Standard deviation”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-
03; URL: https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values.
Metadata: ID: D94 | shortcut: std | author: JoramSoch | date: 2020-09-03, 05:43.
FHWM(X) = ∆x = x2 − x1 (1)
where x1 and x2 are specified, such that
1
fX (x1 ) = fX (x2 ) = fX (xM ) and x1 < x M < x 2 . (2)
2
Sources:
• Wikipedia (2020): “Full width at half maximum”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-19; URL: https://en.wikipedia.org/wiki/Full_width_at_half_maximum.
Metadata: ID: D91 | shortcut: fwhm | author: JoramSoch | date: 2020-08-19, 05:40.
2) Let X be a random variable (→ Definition I/1.2.2) with possible values X . Then, the minimum
of X is
84 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Sample maximum and minimum”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Sample_maximum_and_minimum.
Metadata: ID: D107 | shortcut: min | author: JoramSoch | date: 2020-11-12, 05:25.
1.13.2 Maximum
Definition: The maximum of a sample or random variable is its highest observed or possible value.
2) Let X be a random variable (→ Definition I/1.2.2) with possible values X . Then, the maximum
of X is
Sources:
• Wikipedia (2020): “Sample maximum and minimum”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Sample_maximum_and_minimum.
Metadata: ID: D108 | shortcut: max | author: JoramSoch | date: 2020-11-12, 05:33.
Sources:
1. PROBABILITY THEORY 85
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-19; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
Metadata: ID: D90 | shortcut: mom | author: JoramSoch | date: 2020-08-19, 05:24.
Proof: Using the definition of the moment-generating function (→ Definition I/1.6.27), we can write:
(n) dn
MX (t) = E(etX ) . (2)
dtn
Using the power series expansion of the exponential function
X
∞
xn
x
e = , (3)
n=0
n!
equation (2) becomes
!
(n) dn X∞
tm X m
MX (t) = nE . (4)
dt m=0
m!
Because the expected value is a linear operator (→ Proof I/1.7.5), we have:
m m
dn X
∞
(n) t X
MX (t) = n E
dt m=0 m!
(5)
X∞
dn tm
= n m!
E (X m ) .
m=0
dt
Y
n−1
m!
n
m = (m − i) = , (7)
i=0
(m − n)!
equation (5) becomes
86 CHAPTER I. GENERAL THEOREMS
(n)
X∞
mn tm−n
MX (t) = E (X m )
m=n
m!
(7) X
∞
m! tm−n
= E (X m )
m=n
(m − n)! m!
X
∞
tm−n
= E (X m )
m=n
(m − n)!
X∞ (8)
tn−n n tm−n
= E (X ) + E (X m )
(n − n)! m=n+1
(m − n)!
t0 X
∞
tm−n
n
= E (X ) + E (X m )
0! m=n+1
(m − n)!
X
∞
tm−n
n
= E (X ) + E (X m ) .
m=n+1
(m − n)!
(n)
X
∞
0m−n
MX (0) = E (X n ) + E (X m )
m=n+1
(m − n)! (9)
= E (X n )
which conforms to equation (1).
Sources:
• ProofWiki (2020): “Moment in terms of Moment Generating Function”; in: ProofWiki, retrieved
on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_in_terms_of_Moment_Generating_
Function.
Metadata: ID: P153 | shortcut: mom-mgf | author: JoramSoch | date: 2020-08-19, 07:51.
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
Metadata: ID: D97 | shortcut: mom-raw | author: JoramSoch | date: 2020-10-08, 03:31.
1. PROBABILITY THEORY 87
µ′1 = µ . (1)
Proof: The first raw moment (→ Definition I/1.14.3) of a random variable (→ Definition I/1.2.2)
X is defined as
µ′1 = E (X − 0)1 (2)
which is equal to the expected value (→ Definition I/1.7.1) of X:
Sources:
• original work
Metadata: ID: P171 | shortcut: momraw-1st | author: JoramSoch | date: 2020-10-08, 04:19.
Proof: The second raw moment (→ Definition I/1.14.3) of a random variable (→ Definition I/1.2.2)
X is defined as
µ′2 = E (X − 0)2 . (2)
Using the partition of variance into expected values (→ Proof I/1.8.3)
Sources:
• original work
Metadata: ID: P172 | shortcut: momraw-2nd | author: JoramSoch | date: 2020-10-08, 05:05.
88 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
Metadata: ID: D98 | shortcut: mom-cent | author: JoramSoch | date: 2020-10-08, 03:37.
µ1 = 0 . (1)
Proof: The first central moment (→ Definition I/1.14.6) of a random variable (→ Definition I/1.2.2)
X with mean (→ Definition I/1.7.1) µ is defined as
µ1 = E (X − µ)1 . (2)
Due to the linearity of the expected value (→ Proof I/1.7.5) and by plugging in µ = E(X), we have
µ1 = E [X − µ]
= E(X) − µ
(3)
= E(X) − E(X)
=0.
Sources:
• ProofWiki (2020): “First Central Moment is Zero”; in: ProofWiki, retrieved on 2020-09-09; URL:
https://proofwiki.org/wiki/First_Central_Moment_is_Zero.
Metadata: ID: P167 | shortcut: momcent-1st | author: JoramSoch | date: 2020-09-09, 07:51.
µ2 = Var(X) . (1)
1. PROBABILITY THEORY 89
Proof: The second central moment (→ Definition I/1.14.6) of a random variable (→ Definition
I/1.2.2) X with mean (→ Definition I/1.7.1) µ is defined as
µ2 = E (X − µ)2 (2)
which is equivalent to the definition of the variance (→ Definition I/1.8.1):
µ2 = E (X − E(X))2 = Var(X) . (3)
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
Metadata: ID: P173 | shortcut: momcent-2nd | author: JoramSoch | date: 2020-10-08, 05:13.
µn E[(X − µ)n ]
µ∗n = = . (1)
σn σn
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Standardized_moments.
Metadata: ID: D99 | shortcut: mom-stand | author: JoramSoch | date: 2020-10-08, 03:47.
90 CHAPTER I. GENERAL THEOREMS
2 Information theory
2.1 Shannon entropy
2.1.1 Definition
Definition: Let X be a discrete random variable (→ Definition I/1.2.2) with possible outcomes X
and the (observed or assumed) probability mass function (→ Definition I/1.6.1) p(x) = fX (x). Then,
the entropy (also referred to as “Shannon entropy”) of X is defined as
X
H(X) = − p(x) · logb p(x) (1)
x∈X
where b is the base of the logarithm specifying in which unit the entropy is determined.
Sources:
• Shannon CE (1948): “A Mathematical Theory of Communication”; in: Bell System Technical
Journal, vol. 27, iss. 3, pp. 379-423; URL: https://ieeexplore.ieee.org/document/6773024; DOI:
10.1002/j.1538-7305.1948.tb01338.x.
Metadata: ID: D15 | shortcut: ent | author: JoramSoch | date: 2020-02-19, 17:36.
2.1.2 Non-negativity
Theorem: The entropy of a discrete random variable (→ Definition I/1.2.2) is a non-negative num-
ber:
H(X) ≥ 0 . (1)
Because the co-domain of probability mass functions (→ Definition I/1.6.1) is [0, 1], we can deduce:
0 ≤ p(x) ≤ 1
−∞ ≤ logb p(x) ≤ 0
(4)
0 ≤ − logb p(x) ≤ +∞
0 ≤ p(x) · (− logb p(x)) ≤ +∞ .
By convention, 0 · logb (0) is taken to be 0 when calculating entropy, consistent with
Taking this together, each addend in (3) is positive or zero and thus, the entire sum must also be
non-negative.
Sources:
• Cover TM, Thomas JA (1991): “Elements of Information Theory”, p. 15; URL: https://www.
wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959.
Metadata: ID: P57 | shortcut: ent-nonneg | author: JoramSoch | date: 2020-02-19, 19:10.
2.1.3 Concavity
Theorem: The entropy (→ Definition I/2.1.1) is concave in the probability mass function (→ Def-
inition I/1.6.1) p, i.e.
Proof: Let X be a discrete random variable (→ Definition I/1.2.2) with possible outcomes X and let
u(x) be the probability mass function (→ Definition I/1.6.1) of a discrete uniform distribution (→
Definition II/1.1.1) on X ∈ X . Then, the entropy (→ Definition I/2.1.1) of an arbitrary probability
mass function (→ Definition I/1.6.1) p(x) can be rewritten as
X
H[p] = − p(x) · log p(x)
x∈X
X p(x)
=− p(x) · log u(x)
x∈X
u(x)
X p(x) X
=− p(x) · log − p(x) · log u(x) (2)
x∈X
u(x) x∈X
1 X
= −KL[p||u] − log p(x)
|X | x∈X
= log |X | − KL[p||u]
log |X | − H[p] = KL[p||u]
where we have applied the definition of the Kullback-Leibler divergence (→ Definition I/2.5.1), the
probability mass function of the discrete uniform distribution (→ Proof II/1.1.2) and the total sum
over the probability mass function (→ Definition I/1.6.1).
Note that the KL divergence is convex (→ Proof I/2.5.5) in the pair of probability distributions (→
Definition I/1.5.1) (p, q):
Sources:
• Wikipedia (2020): “Entropy (information theory)”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-11; URL: https://en.wikipedia.org/wiki/Entropy_(information_theory)#Further_properties.
• Cover TM, Thomas JA (1991): “Elements of Information Theory”, p. 30; URL: https://www.
wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959.
• Xie, Yao (2012): “Chain Rules and Inequalities”; in: ECE587: Information Theory, Lecture 3,
Slide 25; URL: https://www2.isye.gatech.edu/~yxie77/ece587/Lecture3.pdf.
• Goh, Siong Thye (2016): “Understanding the proof of the concavity of entropy”; in: StackEx-
change Mathematics, retrieved on 2020-11-08; URL: https://math.stackexchange.com/questions/
2000194/understanding-the-proof-of-the-concavity-of-entropy.
Metadata: ID: P149 | shortcut: ent-conc | author: JoramSoch | date: 2020-08-11, 08:29.
Sources:
• Cover TM, Thomas JA (1991): “Joint Entropy and Conditional Entropy”; in: Elements of Infor-
mation Theory, ch. 2.2, p. 15; URL: https://www.wiley.com/en-us/Elements+of+Information+
Theory%2C+2nd+Edition-p-9780471241959.
Metadata: ID: D17 | shortcut: ent-cond | author: JoramSoch | date: 2020-02-19, 18:08.
where b is the base of the logarithm specifying in which unit the entropy is determined.
Sources:
• Cover TM, Thomas JA (1991): “Joint Entropy and Conditional Entropy”; in: Elements of Infor-
mation Theory, ch. 2.2, p. 16; URL: https://www.wiley.com/en-us/Elements+of+Information+
Theory%2C+2nd+Edition-p-9780471241959.
Metadata: ID: D18 | shortcut: ent-joint | author: JoramSoch | date: 2020-02-19, 18:18.
2.1.6 Cross-entropy
Definition: Let X be a discrete random variable (→ Definition I/1.2.2) with possible outcomes X
and let P and Q be two probability distributions (→ Definition I/1.5.1) on X with the probability
mass functions (→ Definition I/1.6.1) p(x) and q(x). Then, the cross-entropy of Q relative to P is
defined as
X
H(P, Q) = − p(x) · logb q(x) (1)
x∈X
where b is the base of the logarithm specifying in which unit the cross-entropy is determined.
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-28;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
Metadata: ID: D85 | shortcut: ent-cross | author: JoramSoch | date: 2020-07-28, 02:51.
Proof: The relationship between Kullback-Leibler divergence, entropy and cross-entropy (→ Proof
I/2.5.8) is:
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-08-11;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
• gunes (2019): “Convexity of cross entropy”; in: StackExchange CrossValidated, retrieved on 2020-
11-08; URL: https://stats.stackexchange.com/questions/394463/convexity-of-cross-entropy.
Metadata: ID: P150 | shortcut: entcross-conv | author: JoramSoch | date: 2020-08-11, 09:16.
Proof: Without loss of generality, we will use the natural logarithm, because a change in the base
of the logarithm only implies multiplication by a constant:
ln a
logb a = . (2)
ln b
Let I be the set of all x for which p(x) is non-zero. Then, proving (1) requires to show that
X p(x)
p(x) ln ≥0. (3)
x∈I
q(x)
For all x > 0, it holds that ln x ≤ x − 1, with equality only if x = 1. Multiplying this with −1, we
have ln x1 ≥ 1 − x. Applying this to (3), we can say about the left-hand side that
X
p(x) X q(x)
p(x) ln ≥ p(x) 1 −
x∈I
q(x) x∈I
p(x)
X X (4)
= p(x) − q(x) .
x∈I x∈I
Finally, since p(x) and q(x) are probability mass functions (→ Definition I/1.6.1), we have
2. INFORMATION THEORY 95
X
0 ≤ p(x) ≤ 1, p(x) = 1 and
x∈I
X (5)
0 ≤ q(x) ≤ 1, q(x) ≤ 1 ,
x∈I
X p(x) X X
p(x) ln ≥ p(x) − q(x)
x∈I
q(x) x∈I x∈I
X (6)
=1− q(x) ≥ 0 .
x∈I
Sources:
• Wikipedia (2020): “Gibbs’ inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-
09; URL: https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof.
Metadata: ID: P164 | shortcut: gibbs-ineq | author: JoramSoch | date: 2020-09-09, 02:18.
Proof: Without loss of generality, we will use the natural logarithm, because a change in the base
of the logarithm only implies multiplication by a constant:
ln a
.
logc a = (2)
ln c
Let f (x) = x ln x. Then, the left-hand side of (1) can be rewritten as
X
ai X
n n
ai
ai ln = bi f
i=1
bi i=1
bi
Xn (3)
bi ai
=b f .
i=1
b b i
bi
≥0
b
Xn
bi (4)
=1,
i=1
b
96 CHAPTER I. GENERAL THEOREMS
!
X
n
bi ai X
n
b i ai
b f ≥ bf
b bi b bi
i=1 i=1
!
1X
n
= bf ai (5)
b i=1
a
= bf
b
a
= a ln .
b
Finally, combining (3) and (5), this demonstrates (1).
Sources:
• Wikipedia (2020): “Log sum inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Log_sum_inequality#Proof.
• Wikipedia (2020): “Jensen’s inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Jensen%27s_inequality#Statements.
Metadata: ID: P165 | shortcut: logsum-ineq | author: JoramSoch | date: 2020-09-09, 02:46.
Sources:
• Cover TM, Thomas JA (1991): “Differential Entropy”; in: Elements of Information Theory, ch.
8.1, p. 243; URL: https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+
Edition-p-9780471241959.
Metadata: ID: D16 | shortcut: dent | author: JoramSoch | date: 2020-02-19, 17:53.
2.2.2 Negativity
Theorem: Unlike its discrete analogue (→ Proof I/2.1.2), the differential entropy (→ Definition
I/2.2.1) can become negative.
Proof: Let X be a random variable (→ Definition I/1.2.2) following a continuous uniform distribution
(→ Definition II/3.1.1) with minimum 0 and maximum 1/2:
2. INFORMATION THEORY 97
Z
h(X) = − fX (x) logb fX (x) dx
X
Z 1
2
=− 2 logb (2) dx
0
Z 1 (3)
2
= − logb (2) 2 dx
0
1
= − logb (2) [2x]02
= − logb (2)
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-02; URL: https://en.wikipedia.org/wiki/Differential_entropy#Definition.
Metadata: ID: P68 | shortcut: dent-neg | author: JoramSoch | date: 2020-03-02, 20:32.
Y = g(X) = X + c ⇔ X = g −1 (Y ) = Y − c . (3)
Note that g(X) is a strictly increasing function, such that the probability density function (→ Proof
I/1.6.8) of Y is
dg −1 (y) (3)
fY (y) = fX (g −1 (y)) = fX (y − c) . (4)
dy
98 CHAPTER I. GENERAL THEOREMS
Z
h(Y ) = − fY (y) log fY (y) dy
Y
Z (5)
(4)
=− fX (y − c) log fX (y − c) dy
Y
Z
h(Y ) = − fX (x + c − c) log fX (x + c − c) d(x + c)
{y−c | y∈Y}
Z
=− fX (x) log fX (x) dx (6)
X
(2)
= h(X) .
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-12; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
Metadata: ID: P199 | shortcut: dent-inv | author: JoramSoch | date: 2020-12-02, 16:11.
dg −1 (y) (3) 1 y
fY (y) = fX (g −1 (y)) = fX ; (4)
dy a a
if a < 0, then g(X) is a strictly decreasing function, such that the probability density function (→
Proof I/1.6.9) of Y is
2. INFORMATION THEORY 99
dg −1 (y) (3) 1 y
−1
fY (y) = −fX (g (y)) = − fX ; (5)
dy a a
thus, we can write
1 y
fY (y) = fX . (6)
|a| a
Writing down the differential entropy for Y , we have:
Z
h(Y ) = − fY (y) log fY (y) dy
Y
Z y y (7)
(6) 1 1
=− fX log fX dy
Y |a| a |a| a
Z ax ax
1 1
h(Y ) = − fX log fX d(ax)
{y/a | y∈Y} |a| a |a| a
Z
1
=− fX (x) log fX (x) dx
X |a|
Z
(8)
=− fX (x) [log fX (x) − log |a|] dx
ZX Z
=− fX (x) log fX (x) dx + log |a| fX (x) dx
X X
(2)
= h(X) + log |a| .
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-12; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
Metadata: ID: P200 | shortcut: dent-add | author: JoramSoch | date: 2020-12-02, 16:39.
where fX (x) is the probability density function (→ Definition I/1.6.6) of X and X is the set of
possible values of X.
The probability density function of a linear function of a continuous random vector (→ Proof I/1.6.11)
Y = g(X) = ΣX + µ is
1 f (Σ−1 (y − µ)) , if y ∈ Y
|Σ| X
fY (y) = (3)
0 , if y ∈
/Y
where Y = {y = Σx + µ : x ∈ X } is the set of possible outcomes of Y .
Therefore, with Y = g(X) = AX, i.e. Σ = A and µ = 0n , the probability density function (→
Definition I/1.6.6) of Y is given by
1 f (A−1 y) , if y ∈ Y
|A| X
fY (y) = (4)
0 , if y ∈
/Y
where Y = {y = Ax : x ∈ X }.
Thus, the differential entropy (→ Definition I/2.2.1) of Y is
Z
(2)
h(Y ) = − fY (y) log fY (y) dy
Y
Z (5)
(4) 1 −1 1 −1
=− fX (A y) log fX (A y) dy .
Y |A| |A|
Substituting y = Ax into the integral, we obtain
Z
1 −1 1 −1
h(Y ) = − fX (A Ax) log fX (A Ax) d(Ax)
X |A| |A|
Z (6)
1 1
=− fX (x) log fX (x) d(Ax) .
|A| X |A|
Z
|A| 1
h(Y ) = − fX (x) log fX (x) dx
|A| X |A|
Z Z (7)
1
=− fX (x) log fX (x) dx − fX (x) log dx .
X X |A|
R
Finally, employing the fact (→ Definition I/1.6.6) that X fX (x) dx = 1, we can derive the differential
entropy (→ Definition I/2.2.1) of Y as
Z Z
h(Y ) = − fX (x) log fX (x) dx + log |A| fX (x) dx
X X (8)
(2)
= h(X) + log |A| .
Sources:
2. INFORMATION THEORY 101
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
10-07; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
Metadata: ID: P261 | shortcut: dent-addvec | author: JoramSoch | date: 2021-10-07, 09:10.
where Jg (x) is the Jacobian matrix of the vector-valued function g and X is the set of possible values
of X.
Z
(3)
h(Y ) = − fY (y) log fY (y) dy
Y
Z (6)
(4)
=− fX (g (y)) |Jg−1 (y)| log fX (g −1 (y)) |Jg−1 (y)| dy .
−1
Y
Substituting y = g(x) into the integral and applying Jf −1 (y) = Jf−1 (x), we obtain
102 CHAPTER I. GENERAL THEOREMS
Z
h(Y ) = − fX (g −1 (g(x))) |Jg−1 (y)| log fX (g −1 (g(x))) |Jg−1 (y)| d[g(x)]
ZX (7)
=− fX (x) Jg−1 (x) log fX (x) Jg−1 (x) d[g(x)] .
X
Using the relations y = f (x) ⇒ dy = |Jf (x)| dx and |A||B| = |AB|, this becomes
Z
h(Y ) = − fX (x) Jg−1 (x) |Jg (x)| log fX (x) Jg−1 (x) dx
ZX Z (8)
=− fX (x) log fX (x) dx − fX (x) log Jg−1 (x) dx .
X X
R
Finally, employing the fact (→ Definition I/1.6.6) that X fX (x) dx = 1 and the determinant property
|A−1 | = 1/|A|, we can derive the differential entropy (→ Definition I/2.2.1) of Y as
Z Z
1
h(Y ) = − fX (x) log fX (x) dx − fX (x) log dx
X X |Jg (x)|
Z (9)
(3)
= h(X) + fX (x) log |Jg (x)| dx .
X
Because there exist X and Y , such that the integral term in (9) is non-zero, this also demonstrates
that there exist X and Y , such that (1) is fulfilled.
Sources:
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
10-07; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
• Bernhard (2016): “proof of upper bound on differential entropy of f(X)”; in: StackExchange Math-
ematics, retrieved on 2021-10-07; URL: https://math.stackexchange.com/a/1759531.
• peek-a-boo (2019): “How to come up with the Jacobian in the change of variables formula”; in:
StackExchange Mathematics, retrieved on 2021-08-30; URL: https://math.stackexchange.com/a/
3239222.
• Wikipedia (2021): “Jacobian matrix and determinant”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-10-07; URL: https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant#
Inverse.
• Wikipedia (2021): “Inverse function theorem”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-07; URL: https://en.wikipedia.org/wiki/Inverse_function_theorem#Statement.
• Wikipedia (2021): “Determinant”; in: Wikipedia, the free encyclopedia, retrieved on 2021-10-07;
URL: https://en.wikipedia.org/wiki/Determinant#Properties_of_the_determinant.
Metadata: ID: P262 | shortcut: dent-noninv | author: JoramSoch | date: 2021-10-07, 10:39.
Sources:
• original work
Metadata: ID: D34 | shortcut: dent-cond | author: JoramSoch | date: 2020-03-21, 12:27.
where b is the base of the logarithm specifying in which unit the differential entropy is determined.
Sources:
• original work
Metadata: ID: D35 | shortcut: dent-joint | author: JoramSoch | date: 2020-03-21, 12:37.
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-28;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
Metadata: ID: D86 | shortcut: dent-cross | author: JoramSoch | date: 2020-07-28, 03:03.
104 CHAPTER I. GENERAL THEOREMS
where p(x) and p(y) are the probability mass functions (→ Definition I/1.6.1) of X and Y and p(x, y)
is the joint probability (→ Definition I/1.3.2) mass function of X and Y .
2) The mutual information of two continuous random variables (→ Definition I/1.2.2) X and Y is
defined as
Z Z
p(x, y)
I(X, Y ) = − p(x, y) · log dy dx (2)
X Y p(x) · p(y)
where p(x) and p(y) are the probability density functions (→ Definition I/1.6.1) of X and Y and
p(x, y) is the joint probability (→ Definition I/1.3.2) density function of X and Y .
Sources:
• Cover TM, Thomas JA (1991): “Relative Entropy and Mutual Information”; in: Elements of
Information Theory, ch. 2.3/8.5, p. 20/251; URL: https://www.wiley.com/en-us/Elements+of+
Information+Theory%2C+2nd+Edition-p-9780471241959.
where H(X) and H(Y ) are the marginal entropies (→ Definition I/2.1.1) of X and Y and H(X|Y )
and H(Y |X) are the conditional entropies (→ Definition I/2.1.4).
Applying the law of conditional probability (→ Definition I/1.3.4), i.e. p(x, y) = p(x|y) p(y), we get:
XX XX
I(X, Y ) = p(x|y) p(y) log p(x|y) − p(x, y) log p(x) . (4)
x y x y
Now considering the definitions of marginal (→ Definition I/2.1.1) and conditional (→ Definition
I/2.1.4) entropy
X
H(X) = − p(x) log p(x)
x∈X
X (7)
H(X|Y ) = p(y) H(X|Y = y) ,
y∈Y
The conditioning of X on Y in this proof is without loss of generality. Thus, the proof for the
expression using the reverse conditional entropy of Y given X is obtained by simply switching x and
y in the derivation.
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P19 | shortcut: dmi-mce | author: JoramSoch | date: 2020-01-13, 18:20.
! !
XX X X X X
I(X, Y ) = p(x, y) log p(x, y) − p(x, y) log p(x) − p(x, y) log p(y) . (4)
x y x y y x
P
Applying the law of marginal probability (→ Definition I/1.3.3), i.e. p(x) = y p(x, y), we get:
XX X X
I(X, Y ) = p(x, y) log p(x, y) − p(x) log p(x) − p(y) log p(y) . (5)
x y x y
Now considering the definitions of marginal (→ Definition I/2.1.1) and joint (→ Definition I/2.1.5)
entropy
X
H(X) = − p(x) log p(x)
x∈X
XX (6)
H(X, Y ) = − p(x, y) log p(x, y) ,
x∈X y∈Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P20 | shortcut: dmi-mje | author: JoramSoch | date: 2020-01-13, 21:53.
where H(X, Y ) is the joint entropy (→ Definition I/2.1.5) of X and Y and H(X|Y ) and H(Y |X) are
the conditional entropies (→ Definition I/2.1.4).
Proof: The existence of the joint probability mass function (→ Definition I/1.6.1) ensures that the
mutual information (→ Definition I/2.4.1) is defined:
XX p(x, y)
I(X, Y ) = p(x, y) log . (2)
x∈X y∈Y
p(x) p(y)
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P21 | shortcut: dmi-jce | author: JoramSoch | date: 2020-01-13, 22:17.
where p(x) and p(y) are the probability mass functions (→ Definition I/1.6.1) of X and Y and p(x, y)
is the joint probability (→ Definition I/1.3.2) mass function of X and Y .
108 CHAPTER I. GENERAL THEOREMS
2) The mutual information of two continuous random variables (→ Definition I/1.2.2) X and Y is
defined as
Z Z
p(x, y)
I(X, Y ) = − p(x, y) · log dy dx (2)
X Y p(x) · p(y)
where p(x) and p(y) are the probability density functions (→ Definition I/1.6.1) of X and Y and
p(x, y) is the joint probability (→ Definition I/1.3.2) density function of X and Y .
Sources:
• Cover TM, Thomas JA (1991): “Relative Entropy and Mutual Information”; in: Elements of
Information Theory, ch. 2.3/8.5, p. 20/251; URL: https://www.wiley.com/en-us/Elements+of+
Information+Theory%2C+2nd+Edition-p-9780471241959.
where h(X) and h(Y ) are the marginal differential entropies (→ Definition I/2.2.1) of X and Y and
h(X|Y ) and h(Y |X) are the conditional differential entropies (→ Definition I/2.2.7).
Applying the law of conditional probability (→ Definition I/1.3.4), i.e. p(x, y) = p(x|y) p(y), we get:
Z Z Z Z
I(X, Y ) = p(x|y) p(y) log p(x|y) dy dx − p(x, y) log p(x) dy dx . (4)
X Y X Y
Now considering the definitions of marginal (→ Definition I/2.2.1) and conditional (→ Definition
I/2.2.7) differential entropy
Z
h(X) = − p(x) log p(x) dx
Z X
(7)
h(X|Y ) = p(y) h(X|Y = y) dy ,
Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P58 | shortcut: cmi-mcde | author: JoramSoch | date: 2020-02-21, 16:53.
Z Z Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx − p(x, y) log p(x) dy dx − p(x, y) log p(y) dy dx .
X Y X Y X Y
(3)
Regrouping the variables, this reads:
Z Z Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx− p(x, y) dy log p(x) dx− p(x, y) dx log p(y) dy .
X Y X Y Y X
(4)
110 CHAPTER I. GENERAL THEOREMS
R
Applying the law of marginal probability (→ Definition I/1.3.3), i.e. p(x) = Y p(x, y), we get:
Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx − p(x) log p(x) dx − p(y) log p(y) dy . (5)
X Y X Y
Now considering the definitions of marginal (→ Definition I/2.2.1) and joint (→ Definition I/2.2.8)
differential entropy
Z
h(X) = − p(x) log p(x) dx
ZX Z (6)
h(X, Y ) = − p(x, y) log p(x, y) dy dx ,
X Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P59 | shortcut: cmi-mjde | author: JoramSoch | date: 2020-02-21, 17:13.
Proof: The existence of the joint probability density function (→ Definition I/1.6.6) ensures that
the mutual information (→ Definition I/2.4.1) is defined:
Z Z
p(x, y)
I(X, Y ) = p(x, y) log dy dx . (2)
X Y p(x) p(y)
The relation of mutual information to conditional differential entropy (→ Proof I/2.4.2) is:
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Metadata: ID: P60 | shortcut: cmi-jcde | author: JoramSoch | date: 2020-02-21, 17:23.
where p(x) and q(x) are the probability mass functions (→ Definition I/1.6.1) of P and Q.
2) The Kullback-Leibler divergence of P from Q for a continuous random variable X is defined as
Z
p(x)
KL[P ||Q] = p(x) · log dx (2)
X q(x)
where p(x) and q(x) are the probability density functions (→ Definition I/1.6.6) of P and Q.
Sources:
• MacKay, David J.C. (2003): “Probability, Entropy, and Inference”; in: Information Theory, In-
ference, and Learning Algorithms, ch. 2.6, eq. 2.45, p. 34; URL: https://www.inference.org.uk/
itprnn/book.pdf.
2.5.2 Non-negativity
Theorem: The Kullback-Leibler divergence (→ Definition I/2.5.1) is always non-negative
Gibbs’ inequality (→ Proof I/2.1.8) states that the entropy (→ Definition I/2.1.1) of a probability
distribution is always less than or equal to the cross-entropy (→ Definition I/2.1.6) with another
probability distribution – with equality only if the distributions are identical –,
X
n X
n
− p(xi ) log p(xi ) ≤ − p(xi ) log q(xi ) (4)
i=1 i=1
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-31; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
Metadata: ID: P117 | shortcut: kl-nonneg | author: JoramSoch | date: 2020-05-31, 23:43.
2.5.3 Non-negativity
Theorem: The Kullback-Leibler divergence (→ Definition I/2.5.1) is always non-negative
X
n
ai a
ai logc ≥ a logc . (3)
i=1
bi b
Pn Pn
where a1 , . . . , an and b1 , . . . , bn be non-negative real numbers and a = i=1 ai and b = i=1 bi .
Because p(x) and q(x) are probability mass functions (→ Definition I/1.6.1), such that
X
p(x) ≥ 0, p(x) = 1 and
x∈X
X (4)
q(x) ≥ 0, q(x) = 1 ,
x∈X
Sources:
• Wikipedia (2020): “Log sum inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Log_sum_inequality#Applications.
Metadata: ID: P166 | shortcut: kl-nonneg2 | author: JoramSoch | date: 2020-09-09, 07:02.
2.5.4 Non-symmetry
Theorem: The Kullback-Leibler divergence (→ Definition I/2.5.1) is non-symmetric, i.e.
Proof: Let X ∈ X = {0, 1, 2} be a discrete random variable (→ Definition I/1.2.2) and consider the
two probability distributions (→ Definition I/1.5.1)
P : X ∼ Bin(2, 0.5)
(2)
Q : X ∼ U (0, 2)
where Bin(n, p) indicates a binomial distribution (→ Definition II/1.3.1) and U(a, b) indicates a
discrete uniform distribution (→ Definition II/1.1.1).
Then, the probability mass function of the binomial distribution (→ Proof II/1.3.2) entails that
1/4 , if x = 0
p(x) = 1/2 , if x = 1 (3)
1/4 , if x = 2
and the probability mass function of the discrete uniform distribution (→ Proof II/1.1.2) entails that
114 CHAPTER I. GENERAL THEOREMS
1
q(x) = , (4)
3
such that the Kullback-Leibler divergence (→ Definition I/2.5.1) of P from Q is
X p(x)
KL[P ||Q] = p(x) · log
x∈X
q(x)
1 3 1 3 1 3
= log + log + log
4 4 2 2 4 4
1 3 1 3
= log + log
2 4 2 2 (5)
1 3 3
= log + log
2 4 2
1 3 3
= log ·
2 4 2
1 9
= log = 0.0589
2 8
and the Kullback-Leibler divergence (→ Definition I/2.5.1) of Q from P is
X q(x)
KL[Q||P ] = q(x) · log
x∈X
p(x)
1 4 1 2 1 4
= log + log + log
3 3 3 3 3 3
1 4 2 4
= log + log + log (6)
3 3 3 3
1 4 2 4
= log · ·
3 3 3 3
1 32
= log = 0.0566
3 27
which provides an example for
Sources:
• Kullback, Solomon (1959): “Divergence”; in: Information Theory and Statistics, ch. 1.3, pp. 6ff.;
URL: http://index-of.co.uk/Information-Theory/Information%20theory%20and%20statistics%20-%
20Solomon%20Kullback.pdf.
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-11; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Basic_
example.
Metadata: ID: P147 | shortcut: kl-nonsymm | author: JoramSoch | date: 2020-08-11, 06:57.
2. INFORMATION THEORY 115
2.5.5 Convexity
Theorem: The Kullback-Leibler divergence (→ Definition I/2.5.1) is convex in the pair of probability
distributions (→ Definition I/1.5.1) (p, q), i.e.
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-11; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
• Xie, Yao (2012): “Chain Rules and Inequalities”; in: ECE587: Information Theory, Lecture 3,
Slides 22/24; URL: https://www2.isye.gatech.edu/~yxie77/ece587/Lecture3.pdf.
Metadata: ID: P148 | shortcut: kl-conv | author: JoramSoch | date: 2020-08-11, 07:30.
where P1 and P2 are independent (→ Definition I/1.3.6) distributions (→ Definition I/1.5.1) with
the joint distribution (→ Definition I/1.5.2) P , such that p(x, y) = p1 (x) p2 (y), and equivalently for
Q1 , Q2 and Q.
Z Z
p1 (x) p2 (y)
KL[P ||Q] = p1 (x) p2 (y) · log + log dy dx
X Y q1 (x) q2 (y)
Z Z Z Z
p1 (x) p2 (y)
= p1 (x) p2 (y) · log dy dx + p1 (x) p2 (y) · log dy dx
X Y q1 (x) X Y q2 (y)
Z Z Z Z
p1 (x) p2 (y) (5)
= p1 (x) · log p2 (y) dy dx + p2 (y) · log p1 (x) dx dy
X q1 (x) Y Y q2 (y) X
Z Z
p1 (x) p2 (y)
= p1 (x) · log dx + p2 (y) · log dy
X q1 (x) Y q2 (y)
(2)
= KL[P1 ||Q1 ] + KL[P2 ||Q2 ] .
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-31; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
Metadata: ID: P116 | shortcut: kl-add | author: JoramSoch | date: 2020-05-31, 23:26.
Proof: The continuous Kullback-Leibler divergence (→ Definition I/2.5.1) (KL divergence) is defined
as
Z b
p(x)
KL[p(x)||q(x)] = p(x) · log dx (2)
a q(x)
where a = min(X ) and b = max(X ) are the lower and upper bound of the possible outcomes X of
X.
Due to the identity of the differentials
p(x) dx = p(y) dy
(3)
q(x) dx = q(y) dy
dy
p(x) = p(y)
dx (4)
dy
q(x) = q(y) ,
dx
the KL divergence can be evaluated as follows:
Z b
p(x)
KL[p(x)||q(x)] = p(x) · log dx
q(x)
Z
a
!
y(b) dy
dy p(y) dx
= p(y) · log dy
dx
y(a) dx q(y) dx (5)
Z y(b)
p(y)
= p(y) · log dy
y(a) q(y)
= KL[p(y)||q(y)] .
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
• shimao (2018): “KL divergence invariant to affine transformation?”; in: StackExchange CrossVali-
dated, retrieved on 2020-05-28; URL: https://stats.stackexchange.com/questions/341922/kl-divergence-inv
Metadata: ID: P115 | shortcut: kl-inv | author: JoramSoch | date: 2020-05-28, 00:18.
where H(P, Q) is the cross-entropy (→ Definition I/2.1.6) of P and Q and H(P ) is the marginal
entropy (→ Definition I/2.1.1) of P .
where p(x) and q(x) are the probability mass functions (→ Definition I/1.6.1) of P and Q.
Separating the logarithm, we have:
X X
KL[P ||Q] = − p(x) log q(x) + p(x) log p(x) . (3)
x∈X x∈X
Now considering the definitions of marginal entropy (→ Definition I/2.1.1) and cross-entropy (→
Definition I/2.1.6)
X
H(P ) = − p(x) log p(x)
x∈X
X (4)
H(P, Q) = − p(x) log q(x) ,
x∈X
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation.
Metadata: ID: P113 | shortcut: kl-ent | author: JoramSoch | date: 2020-05-27, 23:20.
Z Z
KL[P ||Q] = − p(x) log q(x) dx + p(x) log p(x) dx . (3)
X X
Now considering the definitions of marginal differential entropy (→ Definition I/2.2.1) and differential
cross-entropy (→ Definition I/2.2.9)
Z
h(P ) = − p(x) log p(x) dx
ZX (4)
h(P, Q) = − p(x) log q(x) dx ,
X
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation.
Metadata: ID: P114 | shortcut: kl-dent | author: JoramSoch | date: 2020-05-27, 23:32.
120 CHAPTER I. GENERAL THEOREMS
3 Estimation theory
3.1 Point estimates
3.1.1 Mean squared error
Definition: Let θ̂ be an estimator (→ Definition “est”) of an unknown parameter (→ Definition
“para”) θ̂ based on measured data (→ Definition “data”) y. Then, the mean squared error is defined
as the expected value (→ Definition I/1.7.1) of the squared difference between the estimated value
and the true value of the parameter:
2
MSE = Eθ̂ θ̂ − θ . (1)
where Eθ̂ [·] is expectation calculated over all possible samples (→ Definition “samp”) y leading to
values of θ̂.
Sources:
• Wikipedia (2022): “Estimator”; in: Wikipedia, the free encyclopedia, retrieved on 2022-03-27; URL:
https://en.wikipedia.org/wiki/Estimator#Mean_squared_error.
Metadata: ID: D173 | shortcut: mse | author: JoramSoch | date: 2022-03-27, 23:41.
3.1.2 Partition of the mean squared error into bias and variance
Theorem: The mean squared error (→ Definition I/3.1.1) can be partitioned into variance and
squared bias
Proof: The mean squared error (→ Definition I/3.1.1) (MSE) is defined as the expected value (→
Definition I/1.7.1) of the squared deviation of the estimated value θ̂ from the true value θ of a
parameter, over all values θ̂:
2
MSE(θ̂) = Eθ̂ θ̂ − θ . (4)
2
MSE(θ̂) = Eθ̂ θ̂ − θ
2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ
2
2 (5)
= Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 θ̂ − Eθ̂ (θ̂) Eθ̂ (θ̂) − θ + Eθ̂ (θ̂) − θ
2 h i 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ 2 θ̂ − Eθ̂ (θ̂) Eθ̂ (θ̂) − θ + Eθ̂ Eθ̂ (θ̂) − θ .
2 h i 2
MSE(θ̂) = Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 Eθ̂ (θ̂) − θ Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ
2 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 Eθ̂ (θ̂) − θ Eθ̂ (θ̂) − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ (6)
2 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ .
Sources:
• Wikipedia (2019): “Mean squared error”; in: Wikipedia, the free encyclopedia, retrieved on 2019-11-
27; URL: https://en.wikipedia.org/wiki/Mean_squared_error#Proof_of_variance_and_bias_
relationship.
Sources:
• Wikipedia (2022): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
03-27; URL: https://en.wikipedia.org/wiki/Confidence_interval#Definition.
Proof: The confidence interval (→ Definition I/3.2.1) is defined as the interval that, under infinitely
repeated random experiments (→ Definition I/1.1.1), contains the true parameter value with a certain
probability.
Let us define the likelihood ratio (→ Definition “lr”)
p(y|ϕ, λ̂)
Λ(ϕ) = (4)
p(y|ϕ̂, λ̂)
and compute the log-likelihood ratio (→ Definition “llr”)
maxθ∈Θ0 p(y|θ)
H0 : θ ∈ Θ0 ⇒ −2 log ∼ χ2∆k (6)
maxθ∈Θ1 p(y|θ)
where ∆k is thendifference
o in dimensionality between Θ0 and Θ1 . Applied to our example in (5), we
note that Θ1 = ϕ, ϕ̂ and Θ0 = {ϕ}, such that ∆k = 1 and Wilks’ theorem implies:
h i
−2 log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≤ χ21,1−α
1
log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≥ − χ21,1−α (9)
2
1
log p(y|ϕ, λ̂) ≥ log p(y|ϕ̂, λ̂) − χ21,1−α
2
which is equivalent to the confidence interval given by (3).
Sources:
• Wikipedia (2020): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Confidence_interval#Methods_of_derivation.
• Wikipedia (2020): “Likelihood-ratio test”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Likelihood-ratio_test#Definition.
• Wikipedia (2020): “Wilks’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-19;
URL: https://en.wikipedia.org/wiki/Wilks%27_theorem.
Metadata: ID: P56 | shortcut: ci-wilks | author: JoramSoch | date: 2020-02-19, 17:15.
124 CHAPTER I. GENERAL THEOREMS
4 Frequentist statistics
4.1 Likelihood theory
4.1.1 Likelihood function
Definition: Let there be a generative model (→ Definition I/5.1.1) m describing measured data
y using model parameters θ. Then, the probability density function (→ Definition I/1.6.6) of the
distribution of y given θ is called the likelihood function of m:
Sources:
• original work
Sources:
• original work
Metadata: ID: D59 | shortcut: llf | author: JoramSoch | date: 2020-05-17, 22:52.
The process of calculating θ̂ is called “maximum likelihood estimation” and the functional form lead-
ing from y to θ̂ given m is called “maximum likelihood estimator”. Maximum likelihood estimation,
estimator and estimates may all be abbreviated as “MLE”.
Sources:
• original work
4. FREQUENTIST STATISTICS 125
Metadata: ID: D60 | shortcut: mle | author: JoramSoch | date: 2020-05-15, 23:05.
Proof: Consider a set of independent and identical (→ Definition “iid”) normally distributed (→
Definition II/3.2.1) observations x = {x1 , . . . , xn } with unknown mean (→ Definition I/1.7.1) µ and
variance (→ Definition I/1.8.1) σ 2 :
i.i.d.
xi ∼ N (µ, σ 2 ), i = 1, . . . , n . (2)
Then, we know that the maximum likelihood estimator (→ Definition I/4.1.3) for the variance
(→ Definition I/1.8.1) σ 2 is underestimating the true variance of the data distribution (→ Proof
IV/1.1.2):
2 n−1 2
E σ̂MLE = σ ̸= σ 2 . (3)
n
This proofs the existence of cases such as those stated by the theorem.
Sources:
• original work
Metadata: ID: P317 | shortcut: mle-bias | author: JoramSoch | date: 2022-03-18, 17:26.
The maximum log-likelihood can be obtained by plugging the maximum likelihood estimates (→
Definition I/4.1.3) into the log-likelihood function (→ Definition I/4.1.2).
Sources:
• original work
Metadata: ID: D61 | shortcut: mll | author: JoramSoch | date: 2020-05-15, 23:13.
126 CHAPTER I. GENERAL THEOREMS
µ1 = f1 (θ1 , . . . , θk )
.. (1)
.
µk = fk (θ1 , . . . , θk ) ,
Sources:
• Wikipedia (2021): “Method of moments (statistics)”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Method_of_moments_(statistics)#Method.
Metadata: ID: D151 | shortcut: mome | author: JoramSoch | date: 2021-04-29, 07:51.
H : θ ∈ Θ∗ where Θ∗ ⊂ Θ . (1)
Sources:
4. FREQUENTIST STATISTICS 127
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Metadata: ID: D127 | shortcut: hyp | author: JoramSoch | date: 2021-03-19, 14:18.
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-03-19; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#
Terminology.
Metadata: ID: D128 | shortcut: hyp-simp | author: JoramSoch | date: 2021-03-19, 14:24.
H : θ = θ∗ ; (1)
• H is called a set hypothesis or inexact hypothesis, if it specifies a set of possible values with more
than one element for the parameter value (e.g. a range or an interval):
H : θ ∈ Θ∗ . (2)
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-03-19; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#
Terminology.
Metadata: ID: D129 | shortcut: hyp-point | author: JoramSoch | date: 2021-03-19, 14:28.
H0 : θ = θ0 (1)
128 CHAPTER I. GENERAL THEOREMS
and consider a set (→ Definition I/4.2.3) alternative hypothesis (→ Definition I/4.3.3) H1 . Then,
• H1 is called a left-sided one-tailed hypothesis, if θ is assumed to be smaller than θ0 :
H1 : θ < θ0 ; (2)
• H1 is called a right-sided one-tailed hypothesis, if θ is assumed to be larger than θ0 :
H1 : θ > θ0 ; (3)
• H1 is called a two-tailed hypothesis, if θ is assumed to be unequal to θ0 :
H1 : θ ̸= θ0 . (4)
Sources:
• Wikipedia (2021): “One- and two-tailed tests”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-31; URL: https://en.wikipedia.org/wiki/One-_and_two-tailed_tests.
Metadata: ID: D138 | shortcut: hyp-tail | author: JoramSoch | date: 2021-03-31, 09:21.
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#The_testing_
process.
Metadata: ID: D130 | shortcut: test | author: JoramSoch | date: 2021-03-19, 14:36.
4. FREQUENTIST STATISTICS 129
H0 : θ ∈ Θ0 where Θ0 ⊂ Θ . (1)
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-12; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#Basic_
definitions.
H0 : θ ∈ Θ0 where Θ0 ⊂ Θ
(1)
H1 : θ ∈ Θ1 where Θ1 = Θ \ Θ0 .
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-12; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#Basic_
definitions.
The fact whether a test (→ Definition I/4.3.1) is one-tailed or two-tailed has consequences for the
computation of critical value (→ Definition I/4.3.9) and p-value (→ Definition I/4.3.10).
Sources:
• Wikipedia (2021): “One- and two-tailed tests”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-31; URL: https://en.wikipedia.org/wiki/One-_and_two-tailed_tests.
Metadata: ID: D139 | shortcut: test-tail | author: JoramSoch | date: 2021-03-31, 09:32.
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#The_testing_
process.
Metadata: ID: D131 | shortcut: tstat | author: JoramSoch | date: 2021-03-19, 14:40.
Sources:
• Wikipedia (2021): “Size (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-19;
URL: https://en.wikipedia.org/wiki/Size_(statistics).
Metadata: ID: D132 | shortcut: size | author: JoramSoch | date: 2021-03-19, 14:46.
4. FREQUENTIST STATISTICS 131
Sources:
• Wikipedia (2021): “Power of a test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-31;
URL: https://en.wikipedia.org/wiki/Power_of_a_test#Description.
Metadata: ID: D137 | shortcut: power | author: JoramSoch | date: 2021-03-31, 09:01.
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Metadata: ID: D133 | shortcut: alpha | author: JoramSoch | date: 2021-03-19, 14:50.
Sources:
132 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Metadata: ID: D134 | shortcut: cval | author: JoramSoch | date: 2021-03-19, 14:54.
4.3.10 p-value
Definition: Let there be a statistical test (→ Definition I/4.3.1) of the null hypothesis (→ Definition
I/4.3.2) H0 and the alternative hypothesis (→ Definition I/4.3.3) H1 using the test statistic (→
Definition I/4.3.5) T (Y ). Let y be the measured data (→ Definition “data”) and let tobs = T (y)
be the observed test statistic computed from y. Moreover, assume that FT (t) is the cumulative
distribution function (→ Definition I/1.6.13) (CDF) of the distribution of T (Y ) under H0 .
Then, the p-value is the probability of obtaining a test statistic more extreme than or as extreme as
tobs , given that the null hypothesis H0 is true:
• p = FT (tobs ), if H1 is a left-sided one-tailed hypothesis (→ Definition I/4.2.4);
• p = 1 − FT (tobs ), if H1 is a right-sided one-tailed hypothesis (→ Definition I/4.2.4);
• p = 2 · min ([FT (tobs ), 1 − FT (tobs )]), if H1 is a two-tailed hypothesis (→ Definition I/4.2.4).
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Metadata: ID: D135 | shortcut: pval | author: JoramSoch | date: 2021-03-19, 14:58.
p ∼ U (0, 1) . (1)
Proof: Without loss of generality, consider a left-sided one-tailed hypothesis test (→ Definition
I/4.2.4). Then, the p-value is a function of the test statistic (→ Definition I/4.3.10)
P = FT (T )
(2)
p = FT (tobs )
where tobs is the observed test statistic (→ Definition I/4.3.5) and FT (t) is the cumulative distribution
function (→ Definition I/1.6.13) of the test statistic (→ Definition I/4.3.5) under the null hypothesis
(→ Definition I/4.3.2).
Then, we can obtain the cumulative distribution function (→ Definition I/1.6.13) of the p-value (→
Definition I/4.3.10) as
4. FREQUENTIST STATISTICS 133
which is the cumulative distribution function of a continuous uniform distribution (→ Proof II/3.1.4)
over the interval [0, 1]:
Z x
FX (x) = U(z; 0, 1) dz = x where 0 ≤ x ≤ 1 . (4)
−∞
Sources:
• jll (2018): “Why are p-values uniformly distributed under the null hypothesis?”; in: StackExchange
CrossValidated, retrieved on 2022-03-18; URL: https://stats.stackexchange.com/a/345763/270304.
Metadata: ID: P318 | shortcut: pval-h0 | author: JoramSoch | date: 2022-03-18, 22:37.
134 CHAPTER I. GENERAL THEOREMS
5 Bayesian statistics
5.1 Probabilistic modeling
5.1.1 Generative model
Definition: Consider measured data (→ Definition “data”) y and some unknown latent parameters
(→ Definition “para”) θ. A statement about the distribution (→ Definition I/1.5.1) of y given θ is
called a generative model m
m : y ∼ D(θ) , (1)
where D denotes an arbitrary probability distribution and θ are the parameters of this distribution.
Sources:
• Friston et al. (2008): “Bayesian decoding of brain images”; in: NeuroImage, vol. 39, pp. 181-205;
URL: https://www.sciencedirect.com/science/article/abs/pii/S1053811907007203; DOI: 10.1016/j.neuroim
Sources:
• original work
θ ∼ D(λ) . (1)
The parameters λ of this distribution are called the prior hyperparameters and the probability density
function (→ Definition I/1.6.6) is called the prior density:
Sources:
• original work
Metadata: ID: D29 | shortcut: prior | author: JoramSoch | date: 2020-03-03, 16:09.
5. BAYESIAN STATISTICS 135
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Probability and
inference”; in: Bayesian Data Analysis, ch. 1, p. 3; URL: http://www.stat.columbia.edu/~gelman/
book/.
Metadata: ID: D30 | shortcut: fpm | author: JoramSoch | date: 2020-03-03, 16:16.
Sources:
• original work
Proof: The joint likelihood (→ Definition I/5.1.5) is defined as the joint probability (→ Definition
I/1.3.2) density function (→ Definition I/1.6.6) of data y and parameters θ:
p(y, θ|m)
p(y|θ, m) =
p(θ|m)
(3)
⇔
p(y, θ|m) = p(y|θ, m) p(θ|m) .
Sources:
• original work
Metadata: ID: P89 | shortcut: jl-lfnprior | author: JoramSoch | date: 2020-05-05, 04:21.
Sources:
• original work
Metadata: ID: D32 | shortcut: post | author: JoramSoch | date: 2020-03-03, 16:43.
Proof: In a full probability model (→ Definition I/5.1.4), the posterior distribution (→ Definition
I/5.1.7) can be expressed using Bayes’ theorem (→ Proof I/5.3.1):
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (2)
p(y|m)
Applying the law of conditional probability (→ Definition I/1.3.4) to the numerator, we have:
p(y, θ|m)
p(θ|y, m) = . (3)
p(y|m)
5. BAYESIAN STATISTICS 137
Because the denominator does not depend on θ, it is constant in θ and thus acts a proportionality
factor between the posterior distribution and the joint likelihood:
Sources:
• original work
Metadata: ID: P90 | shortcut: post-jl | author: JoramSoch | date: 2020-05-05, 04:46.
Sources:
• original work
and related to likelihood function (→ Definition I/5.1.2) and prior distribution (→ Definition I/5.1.7)
as follows:
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ . (2)
Θ
Proof: In a full probability model (→ Definition I/5.1.4), the marginal likelihood (→ Definition
I/5.1.9) is defined as the marginal probability (→ Definition I/1.3.3) of the data y, given only the
model m:
p(y|m) . (3)
Using the law of marginal probabililty (→ Definition I/1.3.3), this can be obtained by integrating
the joint likelihood (→ Definition I/5.1.5) function over the entire parameter space:
138 CHAPTER I. GENERAL THEOREMS
Z
p(y|m) = p(y, θ|m) dθ . (4)
Θ
Applying the law of conditional probability (→ Definition I/1.3.4), the integrand can also be written
as the product of likelihood function (→ Definition I/5.1.2) and prior density (→ Definition I/5.1.3):
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ . (5)
Θ
Sources:
• original work
Metadata: ID: P91 | shortcut: ml-jl | author: JoramSoch | date: 2020-05-05, 04:59.
Sources:
• Friston et al. (2002): “Classical and Bayesian Inference in Neuroimaging: Theory”; in: NeuroIm-
age, vol. 16, iss. 2, pp. 465-483, fn. 1; URL: https://www.sciencedirect.com/science/article/pii/
S1053811902910906; DOI: 10.1006/nimg.2002.1090.
• Friston et al. (2002): “Classical and Bayesian Inference in Neuroimaging: Applications”; in: Neu-
roImage, vol. 16, iss. 2, pp. 484-512, fn. 10; URL: https://www.sciencedirect.com/science/article/
pii/S1053811902910918; DOI: 10.1006/nimg.2002.1091.
Metadata: ID: D116 | shortcut: prior-flat | author: JoramSoch | date: 2020-12-02, 17:04.
Sources:
5. BAYESIAN STATISTICS 139
• Wikipedia (2020): “Lindley’s paradox”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Lindley%27s_paradox#Bayesian_approach.
Metadata: ID: D117 | shortcut: prior-uni | author: JoramSoch | date: 2020-12-02, 17:21.
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data
analysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eq.
15, p. 473; URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI:
10.1016/j.neuroimage.2016.07.047.
Metadata: ID: D118 | shortcut: prior-inf | author: JoramSoch | date: 2020-12-02, 17:28.
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data
analysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eq.
13, p. 473; URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI:
10.1016/j.neuroimage.2016.07.047.
Metadata: ID: D119 | shortcut: prior-emp | author: JoramSoch | date: 2020-12-02, 17:37.
Sources:
• Wikipedia (2020): “Conjugate prior”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-02;
URL: https://en.wikipedia.org/wiki/Conjugate_prior.
Metadata: ID: D120 | shortcut: prior-conj | author: JoramSoch | date: 2020-12-02, 17:55.
Sources:
• Wikipedia (2020): “Prior probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-
02; URL: https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors.
Metadata: ID: D121 | shortcut: prior-maxent | author: JoramSoch | date: 2020-12-02, 18:13.
Sources:
• Wikipedia (2020): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2020-12-02; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
Metadata: ID: D122 | shortcut: prior-eb | author: JoramSoch | date: 2020-12-02, 18:19.
5. BAYESIAN STATISTICS 141
Sources:
• Wikipedia (2020): “Prior probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-
02; URL: https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors.
Metadata: ID: D123 | shortcut: prior-ref | author: JoramSoch | date: 2020-12-02, 18:26.
p(B|A) p(A)
p(A|B) = . (1)
p(B)
Proof: The conditional probability (→ Definition I/1.3.4) is defined as the ratio of joint probability
(→ Definition I/1.3.2), i.e. the probability of both statements being true, and marginal probability
(→ Definition I/1.3.3), i.e. the probability of only the second one being true:
p(A, B)
p(A|B) = . (2)
p(B)
It can also be written down for the reverse situation, i.e. to calculate the probability that B is true,
given that A is true:
p(A, B)
p(B|A) = . (3)
p(A)
Both equations can be rearranged for the joint probability
(2) (3)
p(A|B) p(B) = p(A, B) = p(B|A) p(A) (4)
from which Bayes’ theorem can be directly derived:
Sources:
142 CHAPTER I. GENERAL THEOREMS
• Koch, Karl-Rudolf (2007): “Rules of Probability”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, pp. 6/13, eqs. 2.12/2.38; URL: https://www.springer.com/de/book/
9783540727231; DOI: 10.1007/978-3-540-72726-2.
Proof: Using Bayes’ theorem (→ Proof I/5.3.1), the conditional probabilities (→ Definition I/1.3.4)
on the left are given by
p(B|A1 ) · p(A1 )
p(A1 |B) = (2)
p(B)
p(B|A2 ) · p(A2 )
p(A2 |B) = . (3)
p(B)
Dividing the two conditional probabilities by each other
Sources:
• Wikipedia (2019): “Bayes’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-01-06;
URL: https://en.wikipedia.org/wiki/Bayes%27_theorem#Bayes%E2%80%99_rule.
Metadata: ID: P12 | shortcut: bayes-rule | author: JoramSoch | date: 2020-01-06, 20:55.
Z
p(y|λ, m) = p(y|θ, λ, m) (θ|λ, m) dθ , (1)
Sources:
• Wikipedia (2021): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-29; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
• Bishop CM (2006): “The Evidence Approximation”; in: Pattern Recognition for Machine Learning,
ch. 3.5, pp. 165-172; URL: https://www.springer.com/gp/book/9780387310732.
for Bayesian inference, i.e. obtaining the posterior distribution (from eq. (3)) and approximating the
marginal likelihood (by plugging eq. (3) into eq. (2)).
Sources:
144 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2021): “Variational Bayesian methods”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Variational_Bayesian_methods#Evidence_
lower_bound.
• Penny W, Flandin G, Trujillo-Barreto N (2007): “Bayesian Comparison of Spatially Regularised
General Linear Models”; in: Human Brain Mapping, vol. 28, pp. 275–293, eqs. 2-9; URL: https:
//onlinelibrary.wiley.com/doi/full/10.1002/hbm.20327; DOI: 10.1002/hbm.20327.
Probability Distributions
145
146 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ U (a, b) , (1)
if and only if each integer between and including a and b occurs with the same probability.
Sources:
• Wikipedia (2020): “Discrete uniform distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-07-28; URL: https://en.wikipedia.org/wiki/Discrete_uniform_distribution.
Metadata: ID: D88 | shortcut: duni | author: JoramSoch | date: 2020-07-28, 04:05.
X ∼ U (a, b) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
1
fX (x) = where x ∈ {a, a + 1, . . . , b − 1, b} . (2)
b−a+1
Proof: A discrete uniform variable is defined as (→ Definition II/1.1.1) having the same probability
for each integer between and including a and b. The number of integers between and including a and
b is
n=b−a+1 (3)
and because the sum across all probabilities (→ Definition I/1.6.1) is
X
b
fX (x) = 1 , (4)
x=a
we have
1 1
fX (x) = = . (5)
n b−a+1
Sources:
• original work
Metadata: ID: P140 | shortcut: duni-pmf | author: JoramSoch | date: 2020-07-28, 04:57.
1. UNIVARIATE DISCRETE DISTRIBUTIONS 147
X ∼ U (a, b) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
0 , if x < a
⌊x⌋−a+1
FX (x) = , if a ≤ x ≤ b (2)
b−a+1
1 , if x > b .
Proof: The probability mass function of the discrete uniform distribution (→ Proof II/1.1.2) is
1
U(x; a, b) = where x ∈ {a, a + 1, . . . , b − 1, b} . (3)
b−a+1
Thus, the cumulative distribution function (→ Definition I/1.6.13) is:
Z x
FX (x) = U(z; a, b) dz (4)
−∞
From (3), it follows that the cumulative probability increases step-wise by 1/n at each integer between
and including a and b where
n=b−a+1 (5)
is the number of integers between and including a and b. This can be expressed by noting that
(3) ⌊x⌋ − a + 1
FX (x) = , if a ≤ x ≤ b . (6)
n
Also, because Pr(X < a) = 0, we have
Z x
(4)
FX (x) = 0 dz = 0, if x < a (7)
−∞
Z x
(4)
FX (x) = U(z; a, b) dz
−∞
Z b Z x
= U(z; a, b) dz + U(z; a, b) dz
−∞ (8)
Z x b
(6)
= FX (b) + 0 dz = 1 + 0
b
= 1, if x > b .
Sources:
148 CHAPTER II. PROBABILITY DISTRIBUTIONS
• original work
Metadata: ID: P141 | shortcut: duni-cdf | author: JoramSoch | date: 2020-07-28, 05:34.
X ∼ U (a, b) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
−∞ , if p = 0
(2)
a(1 − p) + (b + 1)p − 1 , when p ∈ 1 , 2 , . . . , b−a , 1 .
QX (p) =
n n n
with n = b − a + 1.
Proof: The cumulative distribution function of the discrete uniform distribution (→ Proof II/1.1.3)
is:
0 , if x < a
⌊x⌋−a+1
FX (x) = , if a ≤ x ≤ b (3)
b−a+1
1 , if x > b .
The quantile function QX (p) is defined as (→ Definition I/1.6.23) the smallest x, such that FX (x) = p:
QX (p) = a + p · (b − a + 1) − 1
= a + pb − pa + p − 1 (7)
= a(1 − p) + (b + 1)p − 1 .
Sources:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 149
• original work
Metadata: ID: P142 | shortcut: duni-qf | author: JoramSoch | date: 2020-07-28, 06:17.
X ∼ Bern(p) , (1)
if X = 1 with probability (→ Definition I/1.3.1) p and X = 0 with probability (→ Definition I/1.3.1)
q = 1 − p.
Sources:
• Wikipedia (2020): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution.
Metadata: ID: D44 | shortcut: bern | author: JoramSoch | date: 2020-03-22, 17:40.
X ∼ Bern(p) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
p , if x = 1
fX (x) = . (2)
1 − p , if x = 0 .
Proof: This follows directly from the definition of the Bernoulli distribution (→ Definition II/1.2.1).
Sources:
• original work
Metadata: ID: P96 | shortcut: bern-pmf | author: JoramSoch | date: 2020-05-11, 22:10.
1.2.3 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a Bernoulli distribution (→
Definition II/1.2.1):
X ∼ Bern(p) . (1)
150 CHAPTER II. PROBABILITY DISTRIBUTIONS
E(X) = p . (2)
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average of all possible
values:
X
E(X) = x · Pr(X = x) . (3)
x∈X
Since there are only two possible outcomes for a Bernoulli random variable (→ Proof II/1.2.2), we
have:
Sources:
• Wikipedia (2020): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-16; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Mean.
Metadata: ID: P22 | shortcut: bern-mean | author: JoramSoch | date: 2020-01-16, 10:58.
1.2.4 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a Bernoulli distribution (→
Definition II/1.2.1):
X ∼ Bern(p) . (1)
Then, the variance (→ Definition I/1.7.1) of X is
Var(X) = p (1 − p) . (2)
Proof: The variance (→ Definition I/1.8.1) is the probability-weighted average of the squared devi-
ation from the expected value (→ Definition I/1.7.1) across all possible values
X
Var(X) = (x − E(X))2 · Pr(X = x) (3)
x∈X
and can also be written in terms of the expected values (→ Proof I/1.8.3):
Var(X) = E X 2 − E(X)2 . (4)
The mean of a Bernoulli random variable (→ Proof II/1.2.3) is
E X 2 = 02 · Pr(X = 0) + 12 · Pr(X = 1) = 0 · (1 − p) + 1 · p = p . (6)
Combining (4), (5) and (6), we have:
Var(X) = p − p2 = p (1 − p) . (7)
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
01-20; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance.
Metadata: ID: P301 | shortcut: bern-var | author: JoramSoch | date: 2022-01-20, 15:06.
X ∼ Bern(p) . (1)
Then, the variance (→ Definition I/1.8.1) of X is necessarily between 0 and 1/4:
1
0 ≤ Var(X) ≤ . (2)
4
dVar(p)
= −2 p + 1 (5)
dp
and setting this deriative to zero
dVar(pM )
=0
dp
0 = −2 pM + 1 (6)
1
pM = ,
2
we obtain the maximum possible variance
2
1 1 1
max [Var(X)] = Var(pM ) = − + = . (7)
2 2 4
152 CHAPTER II. PROBABILITY DISTRIBUTIONS
The function Var(p) is monotonically increasing for 0 < p < pM as dVar(p)/dp > 0 in this interval
and it is monotonically decreasing for pM < p < 1 as dVar(p)/dp < 0 in this interval. Moreover, as
variance is always non-negative (→ Proof I/1.8.4), the minimum variance is
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
01-27; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance.
Metadata: ID: P303 | shortcut: bern-varrange | author: JoramSoch | date: 2022-01-27, 09:03.
X ∼ Bern(p) . (1)
Then, the (Shannon) entropy (→ Definition I/2.1.1) of X in bits is
Proof: The entropy (→ Definition I/2.1.1) is defined as the probability-weighted average of the
logarithmized probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (3)
x∈X
Entropy is measured in bits by setting b = 2. Since there are only two possible outcomes for a
Bernoulli random variable (→ Proof II/1.2.2), we have:
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-02; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution.
• Wikipedia (2022): “Binary entropy function”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-02; URL: https://en.wikipedia.org/wiki/Binary_entropy_function.
Metadata: ID: P334 | shortcut: bern-ent | author: JoramSoch | date: 2022-09-02, 12:21.
1. UNIVARIATE DISCRETE DISTRIBUTIONS 153
X ∼ Bin(n, p) , (1)
if X is the number of successes observed in n independent (→ Definition I/1.3.6) trials, where each
trial has two possible outcomes (→ Definition II/1.2.1) (success/failure) and the probability of success
and failure are identical across trials (p/q = 1 − p).
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Binomial_distribution.
Metadata: ID: D45 | shortcut: bin | author: JoramSoch | date: 2020-03-22, 17:52.
X ∼ Bin(n, p) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
n x
fX (x) = p (1 − p)n−x . (2)
x
Proof: A binomial variable is defined as (→ Definition II/1.3.1) the number of successes observed
in n independent (→ Definition I/1.3.6) trials, where each trial has two possible outcomes (→ Defi-
nition II/1.2.1) (success/failure) and the probability (→ Definition I/1.3.1) of success and failure are
identical across trials (p, q = 1 − p).
If one has obtained x successes in n trials, one has also obtained (n − x) failures. The probability of
a particular series of x successes and (n − x) failures, when order does matter, is
px (1 − p)n−x . (3)
When order does not matter, there is a number of series consisting of x successes and (n − x) failures.
This number is equal to the number of possibilities in which x objects can be choosen from n objects
which is given by the binomial coefficient:
n
. (4)
x
In order to obtain the probability of x successes and (n − x) failures, when order does not matter,
the probability in (3) has to be multiplied with the number of possibilities in (4) which gives
n x
p(X = x|n, p) = p (1 − p)n−x (5)
x
154 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• original work
Metadata: ID: P97 | shortcut: bin-pmf | author: JoramSoch | date: 2020-05-11, 22:35.
X ∼ Bin(n, p) . (1)
Then, the probability-generating function (→ Definition I/1.6.31) of X is
With the probability mass function of the binomial distribution (→ Proof II/1.3.2)
n x
fX (x) = p (1 − p)n−x , (4)
x
we obtain:
n
X n
GX (z) = px (1 − p)n−x z x
x=0
x
Xn (5)
n
= (pz)x (1 − p)n−x .
x=0
x
Sources:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 155
• ProofWiki (2022): “Probability Generating Function of Binomial Distribution”; in: ProofWiki, re-
trieved on 2022-10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_
Binomial_Distribution.
Metadata: ID: P363 | shortcut: bin-pgf | author: JoramSoch | date: 2022-10-11, 09:25.
1.3.4 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a binomial distribution (→
Definition II/1.3.1):
X ∼ Bin(n, p) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
E(X) = np . (2)
Proof: By definition, a binomial random variable (→ Definition II/1.3.1) is the sum of n independent
and identical (→ Definition “iid”) Bernoulli trials (→ Definition II/1.2.1) with success probability p.
Therefore, the expected value is
With the expected value of the Bernoulli distribution (→ Proof II/1.2.3), we have:
X
n
E(X) = p = np . (5)
i=1
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-16; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Expected_value_and_
variance.
Metadata: ID: P23 | shortcut: bin-mean | author: JoramSoch | date: 2020-01-16, 11:06.
1.3.5 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a binomial distribution (→
Definition II/1.3.1):
X ∼ Bin(n, p) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
156 CHAPTER II. PROBABILITY DISTRIBUTIONS
Var(X) = np (1 − p) . (2)
Proof: By definition, a binomial random variable (→ Definition II/1.3.1) is the sum of n independent
and identical (→ Definition “iid”) Bernoulli trials (→ Definition II/1.2.1) with success probability p.
Therefore, the variance is
Sources:
• Wikipedia (2022): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-01-20; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Expected_value_and_
variance.
Metadata: ID: P302 | shortcut: bin-var | author: JoramSoch | date: 2022-01-20, 15:19.
X ∼ Bin(n, p) . (1)
Then, the variance (→ Definition I/1.8.1) of X is necessarily between 0 and n/4:
n
0 ≤ Var(X) ≤ . (2)
4
Proof: By definition, a binomial random variable (→ Definition II/1.3.1) is the sum of n independent
and identical (→ Definition “iid”) Bernoulli trials (→ Definition II/1.2.1) with success probability p.
Therefore, the variance is
As the variance of a Bernoulli random variable is always between 0 and 1/4 (→ Proof II/1.2.5)
1. UNIVARIATE DISCRETE DISTRIBUTIONS 157
1
0 ≤ Var(Xi ) ≤ for all i = 1, . . . , n , (5)
4
the minimum variance of X is
Sources:
• original work
Metadata: ID: P304 | shortcut: bin-varrange | author: JoramSoch | date: 2022-01-27, 09:20.
X ∼ Bin(n, p) . (1)
Then, the (Shannon) entropy (→ Definition I/2.1.1) of X in bits is
Proof: The entropy (→ Definition I/2.1.1) is defined as the probability-weighted average of the
logarithmized probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (5)
x∈X
Entropy is measured in bits by setting b = 2. Then, with the probability mass function of the binomial
distribution (→ Proof II/1.3.2), we have:
158 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
H(X) = − fX (x) · log2 fX (x)
x∈X
n
X
n n x
=− p (1 − p)
x
· log2
n−x
p (1 − p) n−x
x=0
x x
Xn (6)
n x n
=− p (1 − p)n−x
· log2 + x · log2 p + (n − x) · log2 (1 − p)
x=0
x x
Xn
n x n
=− p (1 − p)n−x
· log2 + x · log2 p + n · log2 (1 − p) − x · log2 (1 − p) .
x=0
x x
Since the first factor in the sum corresponds to the probability mass (→ Definition I/1.6.1) of X = x,
we can rewrite this as the sum of the expected values (→ Definition I/1.7.1) of the functions (→
Proof I/1.7.12) of the discrete random variable (→ Definition I/1.2.6) x in the square bracket:
n
H(X) = − log2 − ⟨x · log2 p⟩p(x) − ⟨n · log2 (1 − p)⟩p(x) + ⟨x · log2 (1 − p)⟩p(x)
x p(x)
(7)
n
= − log2 − log2 p · ⟨x⟩p(x) − n · log2 (1 − p) + log2 (1 − p) · ⟨x⟩p(x) .
x p(x)
Using the expected value of the binomial distribution (→ Proof II/1.3.4), i.e. X ∼ Bin(n, p) ⇒ ⟨x⟩ =
np, this gives:
n
H(X) = − log2 − np · log2 p − n · log2 (1 − p) + np · log2 (1 − p)
x p(x)
(8)
n
= − log2 + n [−p · log2 p − (1 − p) log2 (1 − p)] .
x p(x)
Finally, we note that the first term is the negative expected value (→ Definition I/1.7.1) of the
logarithm of a binomial coefficient (→ Proof II/1.3.2) and that the term in square brackets is the
entropy of the Bernoulli distribution (→ Proof II/1.3.7), such that we finally get:
Sources:
• original work
Metadata: ID: P335 | shortcut: bin-ent | author: JoramSoch | date: 2022-09-02, 13:52.
Y |X ∼ Bin(X, q) (1)
and X also follows a binomial distribution (→ Definition II/1.3.1), but with different success fre-
quency (→ Definition II/1.3.1):
X ∼ Bin(n, p) . (2)
Then, the maginal distribution (→ Definition I/1.5.3) of Y unconditional on X is again a binomial
distribution (→ Definition II/1.3.1):
Proof: We are interested in the probability that Y equals a number m. According to the law of
marginal probability (→ Definition I/1.3.3) or the law of total probability (→ Proof I/1.4.7), this
probability can be expressed as:
X
∞
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (4)
k=0
Since, by definitions (2) and (1), Pr(X = k) = 0 when k > n and Pr(Y = m|X = k) = 0 when
k < m, we have:
X
n
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (5)
k=m
Now we can take the probability mass function of the binomial distribution (→ Proof II/1.3.2) and
plug it in for the terms in the sum of (5) to get:
n
X
k n k
Pr(Y = m) = q (1 − q)
m k−m
· p (1 − p)n−k . (6)
k=m
m k
k n−m
Applying the binomial coefficient identity nk m = mn
k−m
rearranging the terms, we have:
n
X
n n−m
Pr(Y = m) = pk q m (1 − p)n−k (1 − q)k−m . (7)
k=m
m k−m
Now we partition pk = pm · pk−m and pull all terms dependent on k out of the sum:
Xn
n n − m k−m
Pr(Y = m) = m m
p q p (1 − p)n−k (1 − q)k−m
m k=m
k−m
Xn (8)
n n−m
= (pq)m
(p(1 − q))k−m (1 − p)n−k .
m k=m
k − m
X
n−m
n−m
(p − pq)i (1 − p)n−m−i = ((p − pq) + (1 − p))n−m . (11)
i=0
i
Thus, (9) develops into
n
Pr(Y = m) = (pq)m (p − pq + 1 − p)n−m
m
(12)
n
= (pq)m (1 − pq)n−m
m
which is the probability mass function of the binomial distribution (→ Proof II/1.3.2) with parameters
n and pq, such that
Sources:
• Wikipedia (2022): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
10-07; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Conditional_binomials.
Metadata: ID: P358 | shortcut: bin-margcond | author: JoramSoch | date: 2022-10-07, 21:03.
p ∼ Bet(α, β) (1)
and let X be a random variable (→ Definition I/1.2.2) following a binomial distribution (→ Definition
II/1.3.1) conditional on p
X | p ∼ Bin(n, p) . (2)
Then, the marginal distribution (→ Definition I/1.5.3) of X is called a beta-binomial distribution
X ∼ BetBin(n, α, β) (3)
with number of trials (→ Definition II/1.3.1) n and shape parameters (→ Definition II/3.9.1) α and
β.
Sources:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 161
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Motivation_and_
derivation.
Metadata: ID: D177 | shortcut: betabin | author: JoramSoch | date: 2022-10-20, 08:09.
X ∼ BetBin(n, α, β) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
n B(α + x, β + n − x)
fX (x) = · (2)
x B(α, β)
where B(x, y) is the beta function.
X | p ∼ Bin(n, p)
(3)
p ∼ Bet(α, β) .
Thus, we can combine the law of marginal probability (→ Definition I/1.3.3) and the law of con-
ditional probability (→ Definition I/1.3.4) to derive the probability (→ Definition I/1.3.1) of X
as
Z
p(x) = p(x, p) dp
ZP (4)
= p(x|p) p(p) dp .
P
Now, we can plug in the probability mass function of the binomial distribution (→ Proof II/1.3.2)
and the probability density function of the beta distribution (→ Proof II/3.9.3) to get
Z
1
n x 1
p(x) = p (1 − p)n−x · pα−1 (1 − p)β−1 dp
x B(α, β)
0 Z 1
n 1
= · pα+x−1 (1 − p)β+n−x−1 dp (5)
x B(α, β) 0
Z
n B(α + x, β + n − x) 1 1
= · pα+x−1 (1 − p)β+n−x−1 dp .
x B(α, β) 0 B(α + x, β + n − x)
Finally, we recognize that the integrand is equal to the probability density function of a beta distri-
bution (→ Proof II/3.9.3) and because probability density integrates to one (→ Definition I/1.6.6),
we have
162 CHAPTER II. PROBABILITY DISTRIBUTIONS
n B(α + x, β + n − x)
p(x) = · = fX (x) . (6)
x B(α, β)
This completes the proof.
Sources:
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#As_a_compound_
distribution.
Metadata: ID: P364 | shortcut: betabin-pmf | author: JoramSoch | date: 2022-10-20, 08:56.
X ∼ BetBin(n, α, β) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X can be expressed as
Proof: The probability mass function of the beta-binomial distribution (→ Proof II/1.4.2) is given
by
n B(α + x, β + n − x)
fX (x) = · . (3)
x B(α, β)
Note that the binomial coefficient can be expressed in terms of factorials
n n!
= , (4)
x x! (n − x)!
that factorials are related to the gamma function via n! = Γ(n + 1)
n! Γ(n + 1)
= (5)
x! (n − x)! Γ(x + 1) Γ(n − x + 1)
and that the beta function is related to the gamma function via
Γ(α) Γ(β)
B(α, β) = . (6)
Γ(α + β)
Applying (4), (5) and (6) to (3), we get
Sources:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 163
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#As_a_compound_
distribution.
Metadata: ID: P365 | shortcut: betabin-pmfitogf | author: JoramSoch | date: 2022-10-20, 08:56.
X ∼ BetBin(n, α, β) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
1 Γ(n + 1) X
x
Γ(α + i) · Γ(β + n − i)
FX (x) = · · (2)
B(α, β) Γ(α + β + n) i=0 Γ(i + 1) · Γ(n − i + 1)
where B(x, y) is the beta function and Γ(x) is the gamma function.
With the probability mass function of the beta-binomial distribution (→ Proof II/1.4.2), this becomes
Xx
n B(α + i, β + n − i)
FX (x) = · . (5)
i=0
i B(α, β)
Using the expression of binomial coefficients in terms of factorials
n n!
= , (6)
k k! (n − k)!
the relationship between factorials and the gamma function
n! = Γ(n + 1) (7)
and the link between gamma function and beta function
Γ(α) Γ(β)
B(α, β) = , (8)
Γ(α + β)
equation (5) can be further developped as follows:
164 CHAPTER II. PROBABILITY DISTRIBUTIONS
(6) 1 X
x
n!
FX (x) = · · B(α + i, β + n − i)
B(α, β) i=0 i! (n − i)!
(8) 1 X
x
n! Γ(α + i) · Γ(β + n − i)
= · ·
B(α, β) i=0 i! (n − i)! Γ(α + β + n)
(9)
1 n! X
x
Γ(α + i) · Γ(β + n − i)
= · ·
B(α, β) Γ(α + β + n) i=0 i! (n − i)!
(7) 1 Γ(n + 1) X
x
Γ(α + i) · Γ(β + n − i)
= · · .
B(α, β) Γ(α + β + n) i=0 Γ(i + 1) · Γ(n − i + 1)
Sources:
• original work
Metadata: ID: P366 | shortcut: betabin-cdf | author: JoramSoch | date: 2022-10-22, 05:28.
X ∼ Poiss(λ) , (1)
if and only if its probability mass function (→ Definition I/1.6.1) is given by
λx e−λ
Poiss(x; λ) = (2)
x!
where x ∈ N0 and λ > 0.
Sources:
• Wikipedia (2020): “Poisson distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-25; URL: https://en.wikipedia.org/wiki/Poisson_distribution#Definitions.
Metadata: ID: D62 | shortcut: poiss | author: JoramSoch | date: 2020-05-25, 23:34.
X ∼ Poiss(λ) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
1. UNIVARIATE DISCRETE DISTRIBUTIONS 165
λx e−λ
fX (x) = , x ∈ N0 . (2)
x!
Proof: This follows directly from the definition of the Poisson distribution (→ Definition II/1.5.1).
Sources:
• original work
Metadata: ID: P102 | shortcut: poiss-pmf | author: JoramSoch | date: 2020-05-14, 20:39.
1.5.3 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a Poisson distribution (→
Definition II/1.5.1):
X ∼ Poiss(λ) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
E(X) = λ . (2)
Proof: The expected value of a discrete random variable (→ Definition I/1.7.1) is defined as
X
E(X) = x · fX (x) , (3)
x∈X
such that, with the probability mass function of the Poisson distribution (→ Proof II/1.5.2), we have:
X
∞
λx e−λ
E(X) = x·
x=0
x!
X
∞
λx e−λ
= x·
x=1
x!
(4)
X∞
x x
−λ
=e · λ
x=1
x!
X∞
λx−1
= λe−λ · .
x=1
(x − 1)!
E(X) = λe−λ · eλ
(7)
=λ.
Sources:
• ProofWiki (2020): “Expectation of Poisson Distribution”; in: ProofWiki, retrieved on 2020-08-19;
URL: https://proofwiki.org/wiki/Expectation_of_Poisson_Distribution.
Metadata: ID: P151 | shortcut: poiss-mean | author: JoramSoch | date: 2020-08-19, 06:09.
1.5.4 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a Poisson distribution (→
Definition II/1.5.1):
X ∼ Poiss(λ) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
Var(X) = λ . (2)
Proof: The variance (→ Definition I/1.8.1) can be expressed in terms of expected values (→ Proof
I/1.8.3) as
E(X) = λ . (4)
Let us now consider the expectation (→ Definition I/1.7.1) of X (X − 1) which is defined as
X
E[X (X − 1)] = x (x − 1) · fX (x) , (5)
x∈X
such that, with the probability mass function of the Poisson distribution (→ Proof II/1.5.2), we have:
X
∞
λx e−λ
E[X (X − 1)] = x (x − 1) ·
x=0
x!
X∞
λx e−λ
= x (x − 1) ·
x=2
x!
(6)
X
∞
λx
= e−λ · x (x − 1) ·
x=2
x · (x − 1) · (x − 2)!
X∞
λx−2
−λ
=λ ·e2
· .
x=2
(x − 2)!
1. UNIVARIATE DISCRETE DISTRIBUTIONS 167
Var(X) = λ2 + λ − λ2 = λ . (12)
Sources:
• jbstatistics (2013): “The Poisson Distribution: Mathematically Deriving the Mean and Variance”;
in: YouTube, retrieved on 2021-04-29; URL: https://www.youtube.com/watch?v=65n_v92JZeE.
Metadata: ID: P230 | shortcut: poiss-var | author: JoramSoch | date: 2021-04-29, 09:59.
168 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Cat([p1 , . . . , pk ]) , (1)
if X = ei with probability (→ Definition I/1.3.1) pi for all i = 1, . . . , k, where ei is the i-th elementary
row vector, i.e. a 1 × k vector of zeros with a one in i-th position.
Sources:
• Wikipedia (2020): “Categorical distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-22; URL: https://en.wikipedia.org/wiki/Categorical_distribution.
Metadata: ID: D46 | shortcut: cat | author: JoramSoch | date: 2020-03-22, 18:09.
X ∼ Cat([p1 , . . . , pk ]) . (1)
Then, the probability mass function (→ Definition I/1.6.1) of X is
p1 , if x = e1
.. ..
fX (x) = . . (2)
p , if x = e .
k k
Proof: This follows directly from the definition of the categorical distribution (→ Definition II/2.1.1).
Sources:
• original work
Metadata: ID: P98 | shortcut: cat-pmf | author: JoramSoch | date: 2020-05-11, 22:58.
2.1.3 Mean
Theorem: Let X be a random vector (→ Definition I/1.2.3) following a categorical distribution (→
Definition II/2.1.1):
X ∼ Cat([p1 , . . . , pk ]) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 169
X
E(X) = x · Pr(X = x)
x∈X
X
k
= ei · Pr(X = ei )
i=1 (3)
Xk
= e i · pi
i=1
= [p1 , . . . , pk ] .
Sources:
• original work
Metadata: ID: P24 | shortcut: cat-mean | author: JoramSoch | date: 2020-01-16, 11:17.
2.1.4 Covariance
Theorem: Let X be a random vector (→ Definition I/1.2.3) following a categorical distribution (→
Definition II/2.1.1):
X ∼ Cat(n, p) . (1)
Then, the covariance matrix (→ Definition I/1.9.9) of X is
Proof: The categorical distribution (→ Definition II/2.1.1) is a special case of the multinomial
distribution (→ Definition II/2.2.1) in which n = 1:
Sources:
170 CHAPTER II. PROBABILITY DISTRIBUTIONS
• original work
Metadata: ID: P338 | shortcut: cat-cov | author: JoramSoch | date: 2022-09-09, 16:57.
X ∼ Cat(p) . (1)
Then, the (Shannon) entropy (→ Definition I/2.1.1) of X is
X
k
H(X) = − pi · log pi . (2)
i=1
Proof: The entropy (→ Definition I/2.1.1) is defined as the probability-weighted average of the
logarithmized probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (3)
x∈X
Since there are k possible values for a categorical random vector (→ Definition II/2.1.1) with prob-
abilities given by the entries (→ Proof II/2.1.2) of the 1 × k vector p, we have:
Sources:
• original work
Metadata: ID: P336 | shortcut: cat-ent | author: JoramSoch | date: 2022-09-09, 15:41.
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Multinomial_distribution.
Metadata: ID: D47 | shortcut: mult | author: JoramSoch | date: 2020-03-22, 17:52.
Proof: A multinomial variable is defined as (→ Definition II/2.2.1) a vector of the numbers of obser-
vations belonging to k distinct categories in n independent (→ Definition I/1.3.6) trials, where each
trial has k possible outcomes (→ Definition II/2.1.1) and the category probabilities (→ Definition
I/1.3.1) are identical across trials.
The probability of a particular series of x1 observations for category 1, x2 observations for category
2 etc., when order does matter, is
Y
k
p i xi . (3)
i=1
When order does not matter, there is a number of series consisting of x1 observations for category
1, ..., xk observations for category k. This number is equal to the number of possibilities in which x1
category 1 objects, ..., xk category k objects can be distributed in a sequence of n objects which is
given by the multinomial coefficient that can be expressed in terms of factorials:
n n!
= . (4)
x1 , . . . , x k x1 ! · . . . · xk !
In order to obtain the probability of x1 observations for category 1, ..., xk observations for category
k, when order does not matter, the probability in (3) has to be multiplied with the number of
possibilities in (4) which gives
Y
k
n
p(X = x|n, [p1 , . . . , pk ]) = p i xi (5)
x1 , . . . , x k i=1
Sources:
172 CHAPTER II. PROBABILITY DISTRIBUTIONS
• original work
Metadata: ID: P99 | shortcut: mult-pmf | author: JoramSoch | date: 2020-05-11, 23:30.
2.2.3 Mean
Theorem: Let X be a random vector (→ Definition I/1.2.3) following a multinomial distribution
(→ Definition II/2.2.1):
Proof: By definition, a multinomial random variable (→ Definition II/2.2.1) is the sum of n inde-
pendent and identical categorical trials (→ Definition II/2.1.1) with category probabilities p1 , . . . , pk .
Therefore, the expected value is
With the expected value of the categorical distribution (→ Proof II/2.1.3), we have:
X
n
E(X) = [p1 , . . . , pk ] = n · [p1 , . . . , pk ] = [np1 , . . . , npk ] . (5)
i=1
Sources:
• original work
Metadata: ID: P25 | shortcut: mult-mean | author: JoramSoch | date: 2020-01-16, 11:26.
2.2.4 Covariance
Theorem: Let X be a random vector (→ Definition I/1.2.3) following a multinomial distribution
(→ Definition II/2.2.1):
Proof: We first observe that the sample space (→ Definition I/1.1.2) of each coordinate Xi is
{0, 1, . . . , n} and Xi is the sum of independent draws of category i, which is drawn with probability
pi . Thus each coordinate follows a binomial distribution (→ Definition II/1.3.1):
i.i.d.
Xi ∼ Bin(n, pi ), i = 1, . . . , k , (3)
which has the variance (→ Proof II/1.3.5) Var(Xi ) = npi (1 − pi ) = n(pi − p2i ), constituting the
elements of the main diagonal in Cov(X) in (2). To prove Cov(Xi , Xj ) = −npi pj for i ̸= j (which
constitutes the off-diagonal elements of the covariance matrix), we first recognize that
(
Xn
1 if k-th draw was of category i,
Xi = Ii (k), with Ii (k) = (4)
k=1
0 otherwise ,
where the indicator function Ii is a Bernoulli-distributed (→ Definition II/1.2.1) random variable
with the expected value (→ Proof II/1.2.3) pi . Then, we have
!
X
n X
n
Cov(Xi , Xj ) = Cov Ii (k), Ij (l)
k=1 l=1
X
n X
n
= Cov (Ii (k), Ij (l))
k=1 l=1
X
n
X
n
= Cov (Ii (k), Ij (k)) + Cov (Ii (k), Ij (l))
| {z } (5)
k=1 l=1 =0
l̸=k
X
n
i̸=j
= E Ii (k) Ij (k) − E Ii (k) E Ij (k)
| {z }
k=1 =0
X
n
=− E Ii (k) E Ij (k)
k=1
= −npi pj ,
as desired.
Sources:
• Tutz (2012): “Regression for Categorical Data”, pp. 209ff..
Metadata: ID: P322 | shortcut: mult-cov | author: adkipnis | date: 2022-05-11, 16:40.
X ∼ Mult(n, p) . (1)
Then, the (Shannon) entropy (→ Definition I/2.1.1) of X is
174 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
k
Hcat (p) = − pi · log pi (3)
i=1
and Elmc (n, p) is the expected value (→ Definition I/1.7.1) of the logarithmized multinomial coefficient
(→ Proof II/2.2.2) with superset size n
n
Elmf (n, p) = E log where X ∼ Mult(n, p) . (4)
X1 , . . . , X k
Proof: The entropy (→ Definition I/2.1.1) is defined as the probability-weighted average of the
logarithmized probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (5)
x∈X
X
H(X) = − fX (x) · log fX (x)
x∈Xn,k
" Y #
X n
k
=− fX (x) · log p i xi (7)
x∈Xn,k
x1 , . . . , x k i=1
" #
X n X
k
=− fX (x) · log + xi · log pi .
x∈Xn,k
x1 , . . . , x k i=1
Since the first factor in the sum corresponds to the probability mass (→ Definition I/1.6.1) of X = x,
we can rewrite this as the sum of the expected values (→ Definition I/1.7.1) of the functions (→
Proof I/1.7.12) of the discrete random variable (→ Definition I/1.2.6) x in the square bracket:
* k +
n X
H(X) = − log − xi · log pi
x1 , . . . , x k p(x) i=1 p(x)
(8)
n Xk
= − log − ⟨xi · log pi ⟩p(x) .
x1 , . . . , x k p(x) i=1
Using the expected value of the multinomial distribution (→ Proof II/2.2.3), i.e. X ∼ Mult(n, p) ⇒
⟨xi ⟩ = npi , this gives:
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 175
Xk
n
H(X) = − log − npi · log pi
x1 , . . . , x k p(x) i=1
(9)
n X k
= − log −n pi · log pi .
x1 , . . . , x k p(x) i=1
Finally, we note that the first term is the negative expected value (→ Definition I/1.7.1) of the
logarithm of a multinomial coefficient (→ Proof II/2.2.2) and that the second term is the entropy of
the categorical distribution (→ Proof II/2.1.5), such that we finally get:
Sources:
• original work
Metadata: ID: P337 | shortcut: mult-ent | author: JoramSoch | date: 2022-09-09, 16:33.
176 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ U (a, b) , (1)
if and only if each value between and including a and b occurs with the same probability.
Sources:
• Wikipedia (2020): “Uniform distribution (continuous)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-01-27; URL: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous).
X ∼ U (0, 1) . (1)
Sources:
• Wikipedia (2021): “Continuous uniform distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-07-23; URL: https://en.wikipedia.org/wiki/Continuous_uniform_distribution#
Standard_uniform.
Metadata: ID: D157 | shortcut: suni | author: JoramSoch | date: 2021-07-23, 17:32.
X ∼ U (a, b) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
1 , if a ≤ x ≤ b
b−a
fX (x) = (2)
0 , otherwise .
Proof: A continuous uniform variable is defined as (→ Definition II/3.1.1) having a constant prob-
ability density between minimum a and maximum b. Therefore,
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 177
To ensure that fX (x) is a proper probability density function (→ Definition I/1.6.6), the integral
over all non-zero probabilities has to sum to 1. Therefore,
1
fX (x) = for all x ∈ [a, b] (4)
c(a, b)
where the normalization factor c(a, b) is specified, such that
Z b
1
1 dx = 1 . (5)
c(a, b) a
Solving this for c(a, b), we obtain:
Z b
1 dx = c(a, b)
a
(6)
[x]ba = c(a, b)
c(a, b) = b − a .
Sources:
• original work
Metadata: ID: P37 | shortcut: cuni-pdf | author: JoramSoch | date: 2020-01-31, 15:41.
X ∼ U (a, b) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
0 , if x < a
FX (x) = x−a
, if a ≤ x ≤ b (2)
b−a
1 , if x > b .
Proof: The probability density function of the continuous uniform distribution (→ Proof II/3.1.3)
is:
1 , if a ≤ x ≤ b
U(x; a, b) = b−a
(3)
0 , otherwise .
178 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z a Z x
FX (x) = U(z; a, b) dz + U(z; a, b) dz
−∞
Z a Z x a
1
= 0 dz + dz
−∞ a b−a (6)
1
=0+ [z]x
b−a a
x−a
= .
b−a
Finally, if x > b, we have
Z b Z x
FX (x) = U(z; a, b) dz + U(z; a, b) dz
−∞
Z x b
= FX (b) + 0 dz
b (7)
b−a
= +0
b−a
=1.
This completes the proof.
Sources:
• original work
Metadata: ID: P38 | shortcut: cuni-cdf | author: JoramSoch | date: 2020-01-02, 18:05.
X ∼ U (a, b) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
−∞ , if p = 0
QX (p) = (2)
bp + a(1 − p) , if p > 0 .
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 179
Proof: The cumulative distribution function of the continuous uniform distribution (→ Proof II/3.1.4)
is:
0 , if x < a
FX (x) = x−a
, if a ≤ x ≤ b (3)
b−a
1 , if x > b .
The quantile function QX (p) is defined as (→ Definition I/1.6.23) the smallest x, such that FX (x) = p:
x−a
p=
b−a
x = p(b − a) + a (6)
x = bp + a(1 − p) .
Sources:
• original work
Metadata: ID: P39 | shortcut: cuni-qf | author: JoramSoch | date: 2020-01-02, 18:27.
3.1.6 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a continuous uniform dis-
tribution (→ Definition II/3.1.1):
X ∼ U (a, b) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
1
E(X) = (a + b) . (2)
2
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the continuous uniform distribution (→ Proof II/3.1.3), this
becomes:
180 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z b
1
E(X) = x· dx
a b−a
2
b
1 x
=
2 b−a a
1 b 2 − a2 (4)
=
2 b−a
1 (b + a)(b − a)
=
2 b−a
1
= (a + b) .
2
Sources:
• original work
Metadata: ID: P82 | shortcut: cuni-mean | author: JoramSoch | date: 2020-03-16, 16:12.
3.1.7 Median
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a continuous uniform dis-
tribution (→ Definition II/3.1.1):
X ∼ U (a, b) . (1)
Then, the median (→ Definition I/1.11.1) of X is
1
median(X) = (a + b) . (2)
2
Proof: The median (→ Definition I/1.11.1) is the value at which the cumulative distribution function
(→ Definition I/1.6.13) is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the continuous uniform distribution (→ Proof II/3.1.4) is
0 , if x < a
FX (x) = x−a
, if a ≤ x ≤ b (4)
b−a
1 , if x > b .
x = bp + a(1 − p) . (5)
Setting p = 1/2, we obtain:
1 1 1
median(X) = b · + a · 1 − = (a + b) . (6)
2 2 2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 181
Sources:
• original work
Metadata: ID: P83 | shortcut: cuni-med | author: JoramSoch | date: 2020-03-16, 16:19.
3.1.8 Mode
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a continuous uniform dis-
tribution (→ Definition II/3.1.1):
X ∼ U (a, b) . (1)
Then, the mode (→ Definition I/1.11.2) of X is
Proof: The mode (→ Definition I/1.11.2) is the value which maximizes the probability density
function (→ Definition I/1.6.6):
The probability density function of the continuous uniform distribution (→ Proof II/3.1.3) is:
1 , if a ≤ x ≤ b
b−a
fX (x) = (4)
0 , otherwise .
Sources:
• original work
Metadata: ID: P84 | shortcut: cuni-med | author: JoramSoch | date: 2020-03-16, 16:29.
X ∼ N (µ, σ 2 ) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
182 CHAPTER II. PROBABILITY DISTRIBUTIONS
" 2 #
1 1 x−µ
N (x; µ, σ 2 ) = √ · exp − (2)
2πσ 2 σ
where µ ∈ R and σ 2 > 0.
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-27; URL: https://en.wikipedia.org/wiki/Normal_distribution.
X ∼ N (0, 1) . (1)
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-26; URL: https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution.
Metadata: ID: D63 | shortcut: snorm | author: JoramSoch | date: 2020-05-26, 23:32.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ Definition II/3.2.2)
with mean 0 and variance 1:
X −µ
Z= ∼ N (0, 1) . (2)
σ
X = g −1 (Z) = σZ + µ . (4)
Because σ is positive, g(X) is strictly increasing and we can calculate the cumulative distribution
function of a strictly increasing function (→ Proof I/1.6.15) as
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 183
0 , if y < min(Y)
FY (y) = FX (g −1 (y)) , if y ∈ Y (5)
1 , if y > max(Y) .
The cumulative distribution function of the normally distributed (→ Proof II/3.2.12) X is
Z x " 2 #
1 1 t−µ
FX (x) = √ · exp − dt . (6)
−∞ 2πσ 2 σ
Applying (5) to (6), we have:
(5)
FZ (z) = FX (g −1 (z))
Z σz+µ " 2 #
(6) 1 1 t − µ (7)
= √ · exp − dt .
−∞ 2πσ 2 σ
Z " 2 #
([σz+µ]−µ)/σ
1 1 (σs + µ) − µ
FZ (z) = √ · exp − d(σs + µ)
(−∞−µ)/σ 2πσ 2 σ
Z z
σ 1 2 (8)
= √ · exp − s ds
−∞ 2πσ 2
Z z
1 1 2
= √ · exp − s ds
−∞ 2π 2
which is the cumulative distribution function (→ Definition I/1.6.13) of the standard normal distri-
bution (→ Definition II/3.2.2).
Sources:
• original work
Metadata: ID: P111 | shortcut: norm-snorm | author: JoramSoch | date: 2020-05-26, 23:01.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ Definition II/3.2.2)
with mean 0 and variance 1:
X −µ
Z= ∼ N (0, 1) . (2)
σ
184 CHAPTER II. PROBABILITY DISTRIBUTIONS
X = g −1 (Z) = σZ + µ . (4)
Because σ is positive, g(X) is strictly increasing and we can calculate the probability density function
of a strictly increasing function (→ Proof I/1.6.8) as
f (g −1 (y)) dg−1 (y) , if y ∈ Y
X dy
fY (y) = (5)
0 , if y ∈
/Y
where Y = {y = g(x) : x ∈ X }. With the probability density function of the normal distribution (→
Proof II/3.2.10), we have
" 2 #
−1
1 1 g (z) − µ dg −1 (z)
fZ (z) = √ · exp − ·
2πσ 2 σ dz
" 2 #
1 1 (σz + µ) − µ d(σz + µ)
=√ · exp − ·
2πσ 2 σ dz (6)
1 1
=√ · exp − z 2 · σ
2πσ 2
1 1 2
= √ · exp − z
2π 2
which is the probability density function (→ Definition I/1.6.6) of the standard normal distribution
(→ Definition II/3.2.2).
Sources:
• original work
Metadata: ID: P176 | shortcut: norm-snorm2 | author: JoramSoch | date: 2020-10-15, 11:42.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ Definition II/3.2.2)
with mean 0 and variance 1:
X −µ
Z= ∼ N (0, 1) . (2)
σ
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 185
Proof: The linear transformation theorem for multivariate normal distribution (→ Proof II/4.1.8)
states
Z ∼ N (0, 1) . (6)
Sources:
• original work
Metadata: ID: P180 | shortcut: norm-snorm3 | author: JoramSoch | date: 2020-10-22, 06:34.
1X
n
X̄ = Xi (2)
n i=1
and the unbiased sample variance (→ Definition I/1.8.2)
1 X 2
n
s2 = Xi − X̄ . (3)
n − 1 i=1
Then, the sampling distribution (→ Definition I/1.5.5) of the sample variance is given by a chi-
squared distribution (→ Definition II/3.7.1) with n − 1 degrees of freedom:
s2
V = (n − 1) ∼ χ2 (n − 1) . (4)
σ2
Xi − µ
Ui = (5)
σ
which follows a standard normal distribution (→ Proof II/3.2.3)
Ui ∼ N (0, 1) . (6)
Then, the sum of squared random variables Ui can be rewritten as
X
n n
X 2
Xi − µ
Ui2 =
i=1 i=1
σ
Xn 2
(Xi − X̄) + (X̄ − µ)
=
i=1
σ
(7)
X
n
(Xi − X̄)2 Xn
(X̄ − µ)2 X (Xi − X̄)(X̄ − µ)
n
= 2
+ 2
+
i=1
σ i=1
σ i=1
σ2
Xn 2 X n 2
(X̄ − µ) X
n
Xi − X̄ X̄ − µ
= + + (Xi − X̄) .
i=1
σ2 i=1
σ2 σ2 i=1
X
n X
n
(Xi − X̄) = Xi − nX̄
i=1 i=1
Xn
1X
n
= Xi − n · Xi
i=1
n i=1 (8)
Xn X
n
= Xi − Xi
i=1 i=1
=0,
X
n n
X 2 Xn 2
Xi − X̄ X̄ − µ
Ui2 = + . (9)
i=1 i=1
σ2 i=1
σ 2
Cochran’s theorem (→ Proof “snorm-cochran”) states that, if a sum of squared standard normal
(→ Definition II/3.2.2) random variables (→ Definition I/1.2.2) can be written as a sum of squared
forms
X
n X
m X
n X
n
(j)
Ui2 = Qj where Qj = Uk Bkl Ul
i=1 j=1 k=1 l=1
X
m
(10)
with B (j) = In
j=1
then the terms Qj are independent (→ Definition I/1.3.6) and each term Qj follows a chi-squared
distribution (→ Definition II/3.7.1) with rj degrees of freedom:
Qj ∼ χ2 (rj ) . (11)
We observe that (9) can be represented as
X
n n
X 2 Xn 2
Xi − X̄ X̄ − µ
Ui2
= +
i=1 i=1
σ2 i=1
σ2
!2 !2 (12)
Xn
1X
n
1 X
n
= Q1 + Q2 = Ui − Uj + Ui
i=1
n j=1
n i=1
s2
(n − 1) ∼ χ2 (n − 1) . (15)
σ2
Sources:
• Glen-b (2014): “Why is the sampling distribution of variance a chi-squared distribution?”; in:
StackExchange CrossValidated, retrieved on 2021-05-20; URL: https://stats.stackexchange.com/
questions/121662/why-is-the-sampling-distribution-of-variance-a-chi-squared-distribution.
• Wikipedia (2021): “Cochran’s theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
20; URL: https://en.wikipedia.org/wiki/Cochran%27s_theorem#Sample_mean_and_sample_
variance.
Metadata: ID: P233 | shortcut: norm-chi2 | author: JoramSoch | date: 2021-05-20, 10:18.
1X
n
X̄ = Xi (2)
n i=1
and the unbiased sample variance (→ Definition I/1.8.2)
1 X 2
n
2
s = Xi − X̄ . (3)
n − 1 i=1
Then, subtracting µ from the sample mean (→ Definition√I/1.7.1), dividing by the sample standard
deviation (→ Definition I/1.12.1) and multiplying with n results in a qunatity that follows a t-
distribution (→ Definition II/3.3.1) with n − 1 degrees of freedom:
√ X̄ − µ
t= n ∼ t(n − 1) . (4)
s
s2
V = (n − 1)
2
∼ χ2 (n − 1) . (8)
σ
Observe that t is the ratio of a standard normal random variable (→ Definition II/3.2.2) and the
square root of a chi-squared random variable (→ Definition II/3.7.1), divided by its degrees of
freedom:
√ X̄−µ
√ X̄ − µ n σ Z
t= n =q =p . (9)
s (n − 1) σs 2 /(n − 1)
2
V /(n − 1)
Thus, by definition of the t-distribution (→ Definition II/3.3.1), this ratio follows a t-distribution
with n − 1 degrees of freedom:
t ∼ t(n − 1) . (10)
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 189
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-05-27; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Characterization.
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
05-27; URL: https://en.wikipedia.org/wiki/Normal_distribution#Operations_on_multiple_independent_
normal_variables.
Metadata: ID: P234 | shortcut: norm-t | author: JoramSoch | date: 2021-05-27, 08:10.
Proof: The probability density function of the multivariate normal distribution (→ Proof II/4.1.3)
is
1 1 T −1
N (x; µ, Σ) = p · exp − (x − µ) Σ (x − µ) . (1)
(2π)n |Σ| 2
Setting n = 1, such that x, µ ∈ R, and Σ = σ 2 , we obtain
1 1 2 −1
N (x; µ, σ ) = p
2
· exp − (x − µ) (σ ) (x − µ)
T
(2π)1 |σ 2 | 2
1 1
=p · exp − 2 (x − µ) 2
(2)
(2π)σ 2 2σ
" 2 #
1 1 x−µ
=√ · exp −
2πσ 2 σ
which is equivalent to the probability density function of the normal distribution (→ Proof II/3.2.10).
Sources:
• Wikipedia (2022): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2022-08-19; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution.
Metadata: ID: P331 | shortcut: norm-mvn | author: JoramSoch | date: 2022-08-19, 19:41.
Proof: Let
Z ∞
I= exp −x2 dx (2)
0
190 CHAPTER II. PROBABILITY DISTRIBUTIONS
and
Z Z
P P
IP = exp −x2 dx = exp −y 2 dy . (3)
0 0
Then, we have
lim IP = I (4)
P →∞
and
Z Z
(3)
P P
IP2 = exp −x 2
dx exp −y 2 dy
0 0
Z P Z P
= exp − x2 + y 2 dx dy (6)
Z0Z 0
= exp − x2 + y 2 dx dy
SP
where SP is the square with corners (0, 0), (0, P ), (P, P ) and (P, 0). For this integral, we can write
down the following inequality
ZZ ZZ
exp − x + y
2 2
dx dy ≤ IP ≤2
exp − x2 + y 2 dx dy (7)
C1 C2
where C1 and C2 are the regions in the first quadrant bounded by circles with center at (0, 0) √
and going
through the
√ points (0,
√ P ) and (P, P ), respectively. The radii of these two circles are r 1 = P2 = P
and r2 = 2P 2 = P 2, such that we can rewrite equation (7) using polar coordinates as
Z Z Z Z
π π
2
r1 2
r2
exp −r2 r dr dθ ≤ IP2 ≤ exp −r2 r dr dθ . (8)
0 0 0 0
Solving the definite integrals yields:
Z Z Z Z
π π
2
r1 2
r2
exp −r
r dr dθ ≤ 2
IP2 ≤ exp −r2 r dr dθ
0 0 0 0
Z π Z π
2 1 2 r1 2 1 2 r2
− exp −r dθ ≤ IP2 ≤ − exp −r dθ
0 2 0 0 2 0
Z π Z π
1 2 1 2
− exp −r12 − 1 dθ ≤ IP2 ≤− exp −r22 − 1 dθ (9)
2 0 2 0
1 π 1 π
− exp −r12 − 1 θ 02 ≤ IP2 ≤− exp −r22 − 1 θ 02
2 2
1 π 1 π
1 − exp −r12 ≤ IP2 ≤ 1 − exp −r22
2 2 2 2
π π
1 − exp −P 2
≤ IP2 ≤ 1 − exp −2P 2
4 4
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 191
π π
lim 1 − exp −P 2 ≤ lim IP2 ≤ lim 1 − exp −2P 2
P →∞ 4 P →∞ P →∞ 4
π π (10)
≤I ≤ ,
2
4 4
such that we have a preliminary result for I:
√
π π
I =2
⇒ I= . (11)
4 2
Because the integrand in (1) is an even function, we can calculate the final result as follows:
Z Z
+∞ ∞
exp −x 2
dx = 2 exp −x2 dx
−∞ 0
√
(11) π (12)
= 2
√ 2
= π.
Sources:
• ProofWiki (2020): “Gaussian Integral”; in: ProofWiki, retrieved on 2020-11-25; URL: https://
proofwiki.org/wiki/Gaussian_Integral.
• ProofWiki (2020): “Integral to Infinity of Exponential of minus t squared”; in: ProofWiki, retrieved
on 2020-11-25; URL: https://proofwiki.org/wiki/Integral_to_Infinity_of_Exponential_of_-t%
5E2.
Metadata: ID: P196 | shortcut: norm-gi | author: JoramSoch | date: 2020-11-25, 04:47.
X ∼ N (µ, σ 2 ) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (2)
2πσ 2 σ
Proof: This follows directly from the definition of the normal distribution (→ Definition II/3.2.1).
Sources:
• original work
Metadata: ID: P33 | shortcut: norm-pdf | author: JoramSoch | date: 2020-01-27, 15:15.
192 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ N (µ, σ 2 ) . (1)
Then, the moment-generating function (→ Definition I/1.6.27) of X is
1 22
MX (t) = exp µt + σ t . (2)
2
Proof: The probability density function of the normal distribution (→ Proof II/3.2.10) is
" 2 #
1 1 x−µ
fX (x) = √ · exp − (3)
2πσ 2 σ
and the moment-generating function (→ Definition I/1.6.27) is defined as
MX (t) = E etX . (4)
Using the expected value for continuous random variables (→ Definition I/1.7.1), the moment-
generating function of X therefore is
Z " 2 #
+∞
1 1 x−µ
MX (t) = exp[tx] · √ · exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (5)
1 1 x−µ
=√ exp tx − dx .
2πσ −∞ 2 σ
√ √
Substituting u = (x − µ)/( 2σ), i.e. x = 2σu + µ, we have
!2
Z √
√ 1 √
1 (+∞−µ)/( 2σ)
2σu + µ − µ √
MX (t) = √ exp t 2σu + µ − d 2σu + µ
2πσ (−∞−µ)/(√2σ) 2 σ
√ Z +∞ h√ i
2σ
= √ exp 2σu + µ t − u2 du
2πσ −∞
Z h√ i
exp(µt) +∞
= √ exp 2σut − u2 du
π −∞
Z +∞ h √ i (6)
exp(µt)
= √ exp − u − 2σut du
2
π −∞
!2
Z +∞ √
exp(µt) 2 1
= √ exp − u − σt + σ 2 t2 du
π −∞ 2 2
!2
1 2 2 Z +∞
√
exp µt + 2 σ t 2
= √ exp − u − σt du
π −∞ 2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 193
√ √
Now substituting v = u − 2/2 σt, i.e. u = v + 2/2 σt, we have
Z +∞−√2/2 σt
exp µt + 12 σ 2 t2 2 √
MX (t) = √ √
exp −v d v + 2/2 σt
π −∞− 2/2 σt
Z +∞ (7)
exp µt + 21 σ 2 t2
= √ exp −v 2 dv .
π −∞
Sources:
• ProofWiki (2020): “Moment Generating Function of Gaussian Distribution”; in: ProofWiki, re-
trieved on 2020-03-03; URL: https://proofwiki.org/wiki/Moment_Generating_Function_of_Gaussian_
Distribution.
Metadata: ID: P71 | shortcut: norm-mgf | author: JoramSoch | date: 2020-03-03, 11:29.
X ∼ N (µ, σ 2 ) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
1 x−µ
FX (x) = 1 + erf √ (2)
2 2σ
where erf(x) is the error function defined as
Z x
2
erf(x) = √ exp(−t2 ) dt . (3)
π 0
Proof: The probability density function of the normal distribution (→ Proof II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (4)
2πσ 2 σ
Thus, the cumulative distribution function (→ Definition I/1.6.13) is:
194 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z x
FX (x) = N (z; µ, σ 2 ) dz
−∞
Z x " 2 #
1 1 z−µ
= √ · exp − dz (5)
−∞ 2πσ 2 σ
Z x " 2 #
1 z−µ
=√ exp − √ dz .
2πσ −∞ 2σ
√ √
Substituting t = (z − µ)/( 2σ), i.e. z = 2σt + µ, this becomes:
Z (x−µ)/(√2σ) √
1
FX (x) = √ exp(−t 2
) d 2σt + µ
2πσ (−∞−µ)/(√2σ)
√ Z x−µ
√
2σ 2σ
= √ exp(−t2 ) dt
2πσ −∞
Z x−µ
√
1 2σ
(6)
= √ exp(−t2 ) dt
π −∞
Z 0 Z x−µ√
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt
π −∞ π 0
Z ∞ Z x−µ√
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt .
π 0 π 0
Applying (3) to (6), we have:
1 1 x−µ
FX (x) = lim erf(x) + erf √
2 x→∞ 2 2σ
1 1 x−µ
= + erf √ (7)
2 2 2σ
1 x−µ
= 1 + erf √ .
2 2σ
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_function.
• Wikipedia (2020): “Error function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-03-20;
URL: https://en.wikipedia.org/wiki/Error_function.
Metadata: ID: P85 | shortcut: norm-cdf | author: JoramSoch | date: 2020-03-20, 01:33.
X ∼ N (µ, σ 2 ) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X can be expressed as
X∞
x−µ 2i−1
x−µ 1
fX (x) = Φµ,σ (x) = φ · σ
+ (2)
σ i=1
(2i − 1)!! 2
where φ(x) is the probability density function (→ Definition I/1.6.6) of the standard normal distri-
bution (→ Definition II/3.2.2) and n!! is a double factorial.
Proof:
1) First, consider the standard normal distribution (→ Definition II/3.2.2) N (0, 1) which has the
probability density function (→ Proof II/3.2.10)
1
φ(x) = √ · e− 2 x .
1 2
(3)
2π
Let T (x) be the indefinite integral of this function. It can be obtained using infinitely repeated
integration by parts as follows:
Z
T (x) = φ(x) dx
Z
1
√ · e− 2 x dx
1 2
=
2π
Z
1
1 · e− 2 x dx
1 2
=√
2π
Z
1 − 12 x2 − 12 x2
= √ · x·e + x ·e 2
dx
2π
Z
1 − 12 x2 1 3 − 1 x2 1 4 − 1 x2
= √ · x·e + x ·e 2 + x · e 2 dx (4)
2π 3 3
Z
1 − 12 x2 1 3 − 1 x2 1 5 − 1 x2 1 6 − 1 x2
= √ · x·e + x ·e 2 + x ·e 2 + x · e 2 dx
2π 3 15 15
= ...
" n Z #
1 X x2i−1 x 2n
· e− 2 x + · e− 2 x dx
1 2 1 2
=√ ·
2π (2i − 1)!! (2n − 1)!!
" i=1
∞ Z #
1 X x2i−1 x 2n
· e− 2 x + lim · e− 2 x dx .
1 2 1 2
=√ ·
2π i=1
(2i − 1)!! n→∞ (2n − 1)!!
X∞
1 x2i−1 − 21 x2
T (x) = √ · ·e +c
2π i=1 (2i − 1)!!
1 X∞
x2i−1
− 12 x2
= √ ·e · +c (6)
2π i=1
(2i − 1)!!
(3) X
∞
x2i−1
= φ(x) · +c.
i=1
(2i − 1)!!
2) Next, let Φ(x) be the cumulative distribution function (→ Definition I/1.6.13) of the standard
normal distribution (→ Definition II/3.2.2):
Z x
Φ(x) = φ(x) dx . (7)
−∞
It can be obtained by matching T (0) to Φ(0) which is 1/2, because the standard normal distribution
is symmetric around zero:
X
∞
02i−1 1
T (0) = φ(0) · + c = = Φ(0)
i=1
(2i − 1)!! 2
1
⇔c= (8)
2
X
∞
x2i−1 1
⇒ Φ(x) = φ(x) · + .
i=1
(2i − 1)!! 2
3) Finally, the cumulative distribution functions (→ Definition I/1.6.13) of the standard normal
distribution (→ Definition II/3.2.2) and the general normal distribution (→ Definition II/3.2.1) are
related to each other (→ Proof II/3.2.3) as
x−µ
Φµ,σ (x) = Φ . (9)
σ
Combining (9) with (8), we have:
X∞
x−µ 2i−1
x−µ 1
Φµ,σ (x) = φ · σ
+ . (10)
σ i=1
(2i − 1)!! 2
Sources:
• Soch J (2015): “Solution for the Indefinite Integral of the Standard Normal Probability Density
Function”; in: arXiv stat.OT, arXiv:1512.04858; URL: https://arxiv.org/abs/1512.04858.
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_function.
Metadata: ID: P86 | shortcut: norm-cdfwerf | author: JoramSoch | date: 2020-03-20, 04:26.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 197
Proof: The cumulative distribution function of a normally distributed (→ Proof II/3.2.12) random
variable X is
1 x−µ
FX (x) = 1 + erf √ (2)
2 2σ
where erf(x) is the error function defined as
Z x
2
erf(x) = √ exp(−t2 ) dt (3)
π 0
which exhibits a point-symmetry property:
Sources:
198 CHAPTER II. PROBABILITY DISTRIBUTIONS
• Wikipedia (2022): “68-95-99.7 rule”; in: Wikipedia, the free encyclopedia, retrieved on 2022-05.08;
URL: https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule.
Metadata: ID: P321 | shortcut: norm-probstd | author: JoramSoch | date: 2022-05-08, 18:56.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
√
QX (p) = 2σ · erf −1 (2p − 1) + µ (2)
where erf −1 (x) is the inverse error function.
Proof: The cumulative distribution function of the normal distribution (→ Proof II/3.2.12) is:
1 x−µ
FX (x) = 1 + erf √ . (3)
2 2σ
Because the cumulative distribution function (CDF) is strictly monotonically increasing, the quantile
function is equal to the inverse of the CDF (→ Proof I/1.6.24):
1 x−µ
p= 1 + erf √
2 2σ
x−µ
2p − 1 = erf √
2σ (5)
x − µ
erf −1 (2p − 1) = √
2σ
√
x = 2σ · erf −1 (2p − 1) + µ .
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Quantile_function.
Metadata: ID: P87 | shortcut: norm-qf | author: JoramSoch | date: 2020-03-20, 04:47.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 199
3.2.16 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a normal distribution (→
Definition II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
E(X) = µ . (2)
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the normal distribution (→ Proof II/3.2.10), this reads:
Z " 2 #
+∞
1 1 x−µ
E(X) = x· √· exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (4)
1 1 x−µ
=√ x · exp − dx .
2πσ −∞ 2 σ
Substituting z = x − µ, we have:
Z +∞−µ
1 1 z 2
E(X) = √ (z + µ) · exp − d(z + µ)
2πσ −∞−µ 2 σ
Z +∞
1 1 z 2
= √ (z + µ) · exp − dz
2πσ −∞ 2 σ
Z +∞ Z +∞ (5)
1 1 z 2 1 z 2
= √ z · exp − dz + µ exp − dz
2πσ −∞ 2 σ −∞ 2 σ
Z +∞ Z +∞
1 1 1
= √ z · exp − 2 · z dz + µ
2
exp − 2 · z dz .
2
2πσ −∞ 2σ −∞ 2σ
Z
1
x · exp −ax2 dx = − · exp −ax2
2a
Z r (6)
1 π √
exp −ax2 dx = · erf ax
2 a
where erf(x) is the error function. Using this, the integrals can be calculated as:
200 CHAPTER II. PROBABILITY DISTRIBUTIONS
+∞ r +∞ !
1 1 π 1
E(X) = √ −σ 2 · exp − 2 · z 2 +µ σ · erf √ z
2πσ 2σ −∞ 2 2σ −∞
1 1 1
=√ lim −σ · exp − 2 · z
2 2
− lim −σ · exp − 2 · z
2 2
2πσ z→∞ 2σ z→−∞ 2σ
r r
π 1 π 1
+ µ lim σ · erf √ z − lim σ · erf √ z
z→∞ 2 2σ z→−∞ 2 2σ (7)
r r
1 π π
=√ [0 − 0] + µ σ− − σ
2πσ 2 2
r
1 π
=√ ·µ·2 σ
2πσ 2
=µ.
Sources:
• Papadopoulos, Alecos (2013): “How to derive the mean and variance of Gaussian random vari-
able?”; in: StackExchange Mathematics, retrieved on 2020-01-09; URL: https://math.stackexchange.
com/questions/518281/how-to-derive-the-mean-and-variance-of-a-gaussian-random-variable.
Metadata: ID: P15 | shortcut: norm-mean | author: JoramSoch | date: 2020-01-09, 15:04.
3.2.17 Median
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a normal distribution (→
Definition II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the median (→ Definition I/1.11.1) of X is
median(X) = µ . (2)
Proof: The median (→ Definition I/1.11.1) is the value at which the cumulative distribution function
(→ Definition I/1.6.13) is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the normal distribution (→ Proof II/3.2.12) is
1 x−µ
FX (x) = 1 + erf √ (4)
2 2σ
where erf(x) is the error function. Thus, the inverse CDF is
√
x= 2σ · erf −1 (2p − 1) + µ (5)
where erf −1 (x) is the inverse error function. Setting p = 1/2, we obtain:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 201
√
median(X) = 2σ · erf −1 (0) + µ = µ . (6)
Sources:
• original work
Metadata: ID: P16 | shortcut: norm-med | author: JoramSoch | date: 2020-01-09, 15:33.
3.2.18 Mode
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a normal distribution (→
Definition II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the mode (→ Definition I/1.11.2) of X is
mode(X) = µ . (2)
Proof: The mode (→ Definition I/1.11.2) is the value which maximizes the probability density
function (→ Definition I/1.6.6):
The probability density function of the normal distribution (→ Proof II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (4)
2πσ 2 σ
The first two deriatives of this function are:
" 2 #
df (x) 1 1 x − µ
fX′ (x) =
X
=√ · (−x + µ) · exp − (5)
dx 2πσ 3 2 σ
" 2 # " 2 #
d2
f (x) 1 1 x − µ 1 1 x − µ
fX′′ (x) =
X
2
= −√ ·exp − +√ ·(−x+µ)2 ·exp − . (6)
dx 2πσ 3 2 σ 2πσ 5 2 σ
" 2 #
1 1 x − µ
fX′ (x) = 0 = √ · (−x + µ) · exp −
2πσ 3 2 σ
(7)
0 = −x + µ
x=µ.
1 1
fX′′ (µ) = − √ · exp(0) + √ · (0)2 · exp(0)
2πσ 3 2πσ 5
(8)
1
= −√ <0,
2πσ 3
we confirm that it is in fact a maximum which shows that
mode(X) = µ . (9)
Sources:
• original work
Metadata: ID: P17 | shortcut: norm-mode | author: JoramSoch | date: 2020-01-09, 15:58.
3.2.19 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a normal distribution (→
Definition II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
Var(X) = σ 2 . (2)
Proof: The variance (→ Definition I/1.8.1) is the probability-weighted average of the squared devi-
ation from the mean (→ Definition I/1.7.1):
Z
Var(X) = (x − E(X))2 · fX (x) dx . (3)
R
With the expected value (→ Proof II/3.2.16) and probability density function (→ Proof II/3.2.10)
of the normal distribution, this reads:
Z " 2 #
+∞
1 1 x−µ
Var(X) = (x − µ)2 · √
· exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (4)
1 1 x − µ
=√ (x − µ)2 · exp − dx .
2πσ −∞ 2 σ
Substituting z = x − µ, we have:
Z +∞−µ
1 1 z 2
Var(X) = √ z · exp −
2
d(z + µ)
2πσ −∞−µ 2 σ
Z +∞ (5)
1 1 z 2
=√ z · exp −
2
dz .
2πσ −∞ 2 σ
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 203
√
Now substituting z = 2σx, we have:
!2
Z √
1 √
+∞
1 2σx √
Var(X) = √ ( 2σx)2 · exp − d( 2σx)
2πσ −∞ 2 σ
1 √ Z +∞ 2 (6)
=√ · 2σ · 2σ
2
x · exp −x2 dx
2πσ −∞
2 Z +∞
2σ
x2 · e−x dx .
2
= √
π −∞
Sources:
• Papadopoulos, Alecos (2013): “How to derive the mean and variance of Gaussian random vari-
able?”; in: StackExchange Mathematics, retrieved on 2020-01-09; URL: https://math.stackexchange.
com/questions/518281/how-to-derive-the-mean-and-variance-of-a-gaussian-random-variable.
Metadata: ID: P18 | shortcut: norm-var | author: JoramSoch | date: 2020-01-09, 22:47.
X ∼ N (µ, σ 2 ) . (1)
Then, the full width at half maximum (→ Definition I/1.12.2) (FWHM) of X is
√
FWHM(X) = 2 2 ln 2σ . (2)
Proof: The probability density function of the normal distribution (→ Proof II/3.2.10) is
204 CHAPTER II. PROBABILITY DISTRIBUTIONS
" 2 #
1 1 x−µ
fX (x) = √ · exp − (3)
2πσ 2 σ
and the mode of the normal distribution (→ Proof II/3.2.18) is
mode(X) = µ , (4)
such that
(4) (3) 1
fmax = fX (mode(X)) = fX (µ) = √ . (5)
2πσ
The FWHM bounds satisfy the equation (→ Definition I/1.12.2)
1 (5) 1
fX (xFWHM ) = fmax = √ . (6)
2 2 2πσ
Using (3), we can develop this equation as follows:
" 2 #
1 1 xFWHM − µ 1
√ · exp − = √
2πσ 2 σ 2 2πσ
" 2 #
1 xFWHM − µ 1
exp − =
2 σ 2
2
1 xFWHM − µ 1
− = ln
2 σ 2 (7)
2
xFWHM − µ 1
= −2 ln
σ 2
xFWHM − µ √
= ± 2 ln 2
σ √
xFWHM − µ = ± 2 ln 2σ
√
xFWHM = ± 2 ln 2σ + µ .
√
x1 = µ − 2 ln 2σ
√ (8)
x2 = µ + 2 ln 2σ ,
FWHM(X) = ∆x = x2 − x1
(8)
√ √
= µ + 2 ln 2σ − µ − 2 ln 2σ (9)
√
= 2 2 ln 2σ .
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 205
• Wikipedia (2020): “Full width at half maximum”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-19; URL: https://en.wikipedia.org/wiki/Full_width_at_half_maximum.
Metadata: ID: P152 | shortcut: norm-fwhm | author: JoramSoch | date: 2020-08-19, 06:39.
Proof: The probability density function of the normal distribution (→ Proof II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (1)
2πσ 2 σ
The first two deriatives of this function (→ Proof II/3.2.18) are:
" 2 #
df (x) 1 1 x − µ
fX′ (x) =
X
=√ · (−x + µ) · exp − (2)
dx 2πσ 3 2 σ
" 2 # " 2 #
d2
f (x) 1 1 x − µ 1 1 x − µ
fX′′ (x) =
X
= −√ ·exp − +√ ·(−x+µ)2 ·exp − . (3)
dx2 2πσ 3 2 σ 2πσ 5 2 σ
−x + µ = 0 ⇔ x=µ. (4)
Since the second derivative is negative at this value
1
fX′′ (µ) = − √ <0, (5)
2πσ 3
there is a maximum at x = µ. From (2), it can be seen that fX′ (x) is positive for x < µ and negative
for x > µ. Thus, there are no further extrema and N (µ, σ 2 ) is unimodal (→ Proof II/3.2.18).
Sources:
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Normal_distribution#Symmetries_and_derivatives.
Metadata: ID: P251 | shortcut: norm-extr | author: JoramSoch | date: 2020-08-25, 21:11.
Proof: The probability density function of the normal distribution (→ Proof II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (1)
2πσ 2 σ
The first three deriatives of this function are:
" 2 #
df (x) 1 x − µ 1 x − µ
fX′ (x) =
X
=√ · − 2 · exp − (2)
dx 2πσ σ 2 σ
" 2 # 2 " 2 #
d 2
f (x) 1 1 1 x − µ 1 x − µ 1 x − µ
fX′′ (x) =
X
=√ · − 2 · exp − +√ · · exp −
dx2 2πσ σ 2 σ 2πσ σ2 2 σ
" 2 # " 2 #
1 x−µ 1 1 x−µ
=√ · − 2 · exp −
2πσ σ2 σ 2 σ
(3)
" 2 # " 2 #
′′′ d3
f X (x) 1 2 x − µ 1 x − µ 1 x − µ 1 x−µ
fX (x) = =√ · · exp − −√ · − 2 · ·
dx3 2πσ σ 2 σ2 2 σ 2πσ σ2 σ σ2
" 3 # " 2 #
1 x−µ x−µ 1 x−µ
=√ · − +3 · exp − .
2πσ σ2 σ4 2 σ
(4)
" 2 #
x−µ 1
0= − 2
σ2 σ
x2 2µx µ2 1
0= − + −
σ4 σ4 σ4 σ2
0 = x − 2µx + (µ2 − σ 2 )
2
(5)
s 2
−2µ −2µ
x1/2 = − ± − (µ2 − σ 2 )
2 2
p
x1/2 = µ ± µ2 − µ2 + σ 2
x1/2 = µ ± σ .
" 3 # " 2 #
1 ±σ ±σ 1 ±σ
fX′′′ (µ ± σ) = √ · − +3 · exp −
2πσ σ2 σ4 2 σ
(6)
1 2 1
=√ · ± 3 · exp − ̸= 0 ,
2πσ σ 2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 207
there are inflection points at x1/2 = µ ± σ. Because µ is the mean and σ 2 is the variance of a normal
distribution (→ Definition II/3.2.1), these points are exactly one standard deviation (→ Definition
I/1.12.1) away from the mean.
Sources:
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Normal_distribution#Symmetries_and_derivatives.
Metadata: ID: P252 | shortcut: norm-infl | author: JoramSoch | date: 2020-08-26, 12:26.
X ∼ N (µ, σ 2 ) . (1)
Then, the differential entropy (→ Definition I/2.2.1) of X is
1
h(X) = ln 2πσ 2 e . (2)
2
Sources:
• Wang, Peng-Hua (2012): “Differential Entropy”; in: National Taipei University; URL: https://
web.ntpu.edu.tw/~phwang/teaching/2012s/IT/slides/chap08.pdf.
Metadata: ID: P101 | shortcut: norm-dent | author: JoramSoch | date: 2020-05-14, 20:09.
P : X ∼ N (µ1 , σ12 )
(1)
Q : X ∼ N (µ2 , σ22 ) .
Then, the Kullback-Leibler divergence (→ Definition I/2.5.1) of P from Q is given by
1 (µ2 − µ1 )2 σ12 σ12
KL[P || Q] = + 2 − ln 2 − 1 . (2)
2 σ22 σ2 σ2
Proof: The KL divergence for a continuous random variable (→ Definition I/2.5.1) is given by
Z
p(x)
KL[P || Q] = p(x) ln dx (3)
X q(x)
which, applied to the normal distributions (→ Definition II/3.2.1) in (1), yields
Z +∞
N (x; µ1 , σ12 )
KL[P || Q] = N (x; µ1 , σ12 ) ln dx
−∞ N (x; µ2 , σ22 )
(4)
N (x; µ1 , σ12 )
= ln .
N (x; µ2 , σ22 ) p(x)
Using the probability density function of the normal distribution (→ Proof II/3.2.10), this becomes:
2
* √ 1 · exp − 2 σ1
1 x−µ1 +
2πσ1
KL[P || Q] = ln 2
√ 1
2πσ2
· exp − 2 σ2
1 x−µ2
p(x)
* s " 2 2 #!+
σ22 1 x − µ1 1 x − µ2
= ln · exp − +
σ12 2 σ1 2 σ2
p(x)
* 2 2 + (5)
1 σ2 1 x − µ1
2
1 x − µ2
= ln 2 − +
2 σ1 2 σ1 2 σ2
p(x)
* 2 2 +
1 x − µ1 x − µ2 σ2
= − + − ln 12
2 σ1 σ2 σ2
p(x)
1 (x − µ1 ) 2
x − 2µ2 x + µ2
2 2 2
σ1
= − + − ln 2 .
2 σ12 σ22 σ2 p(x)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 209
Because the expected value (→ Definition I/1.7.1) is a linear operator (→ Proof I/1.7.5), the expec-
tation can be moved into the sum:
1 ⟨(x − µ1 )2 ⟩ ⟨x2 − 2µ2 x + µ22 ⟩ σ12
KL[P || Q] = − + − ln 2
2 σ12 σ22 σ2
(6)
1 ⟨(x − µ1 )2 ⟩ ⟨x2 ⟩ − ⟨2µ2 x⟩ + ⟨µ22 ⟩ σ12
= − + − ln 2 .
2 σ12 σ22 σ2
The first expectation corresponds to the variance (→ Definition I/1.8.1)
(X − µ)2 = E[(X − E(X))2 ] = Var(X) (7)
and the variance of a normally distributed random variable (→ Proof II/3.2.19) is
2
1 σ1 µ21 + σ12 − 2µ2 µ1 + µ22 σ12
KL[P || Q] = − 2+ − ln 2
2 σ1 σ22 σ
2 2
1 µ1 − 2µ1 µ2 + µ2 σ1
2 2
σ 2
= 2
+ 2 − ln 12 − 1 (10)
2 σ2 σ2 σ2
1 (µ1 − µ2 ) 2 2
σ1 2
σ1
= + 2 − ln 2 − 1
2 σ22 σ2 σ2
which is equivalent to (2).
Sources:
• original work
Metadata: ID: P193 | shortcut: norm-kl | author: JoramSoch | date: 2020-11-19, 07:08.
Proof: For a random variable (→ Definition I/1.2.2) X with set of possible values with probability
density function (→ Definition I/1.6.6) f (x), the differential entropy (→ Definition I/2.2.1) is defined
as:
Z
h(X) = − p(x) log p(x) dx (1)
X
Let g(x) be the probability density function (→ Definition I/1.6.6) of a normal distribution (→
Definition II/3.2.1) with mean (→ Definition I/1.7.1) µ and variance (→ Definition I/1.8.1) σ 2 and
210 CHAPTER II. PROBABILITY DISTRIBUTIONS
let f (x) be an arbitrary probability density function (→ Definition I/1.6.6) with the same variance
(→ Definition I/1.8.1). Since differential entropy (→ Definition I/2.2.1) is translation-invariant (→
Proof I/2.2.3), we can assume that f (x) has the same mean as g(x).
Consider the Kullback-Leibler divergence (→ Definition I/2.5.1) of distribution f (x) from distribution
g(x) which is non-negative (→ Proof I/2.5.2):
Z
f (x)
0 ≤ KL[f ||g] = f (x) log dx
g(x)
ZX Z
= f (x) log f (x) dx − f (x) log g(x) dx (2)
X Z X
(1)
= −h[f (x)] − f (x) log g(x) dx .
X
By plugging the probability density function of the normal distribution (→ Proof II/3.2.10) into the
second term, we obtain:
Z Z " 2 #!
1 1 x−µ
f (x) log g(x) dx = f (x) log √ · exp − dx
X X 2πσ 2 σ
Z Z
1 (x − µ)2 (3)
= f (x) log √ dx + f (x) log exp − dx
X 2πσ 2 X 2σ 2
Z Z
1 log(e)
= − log 2πσ 2
f (x) dx − 2
f (x)(x − µ)2 dx .
2 X 2σ X
Because the entire integral over a probability density function is one (→ Definition I/1.6.6) and the
second central moment is equal to the variance (→ Proof I/1.14.8), we have:
Z
1 log(e)σ 2
f (x) log g(x) dx = − log 2πσ 2 −
X 2 2σ 2
1
(4)
= − log 2πσ 2 + log(e)
2
1
= − log 2πσ 2 e .
2
This is actually the negative of the differential entropy of the normal distribution (→ Proof II/3.2.23),
such that:
Z
f (x) log g(x) dx = −h[g(x)] . (5)
X
Combining (2) with (5), we can show that
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 211
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Differential_entropy#Maximization_in_the_normal_
distribution.
Metadata: ID: P250 | shortcut: norm-maxent | author: JoramSoch | date: 2020-08-25, 08:31.
with mean and variance which are functions of the individual means and variances.
Thus, we can apply the linear transformation theorem for the multivariate normal distribution (→
Proof II/4.1.8)
This implies the following distribution the linear combination given by equation (2):
µ1
.. X
n
Aµ = [a1 , . . . , an ] . = ai µ i and
i=1
µn
(9)
σ12 · · · 0 a1
.. X 2 2
n
. .
AΣAT = [a1 , . . . , an ] .. . . . .. . = aσ .
i=1 i i
0 · · · σn2 an
Sources:
• original work
Metadata: ID: P235 | shortcut: norm-lincomb | author: JoramSoch | date: 2021-06-02, 08:24.
3.3 t-distribution
3.3.1 Definition
Definition: Let Z and V be independent (→ Definition I/1.3.6) random variables (→ Definition
I/1.2.2) following a standard normal distribution (→ Definition II/3.2.2) and a chi-squared distribu-
tion (→ Definition II/3.7.1) with ν degrees of freedom (→ Definition “dof”), respectively:
Z ∼ N (0, 1)
(1)
V ∼ χ2 (ν) .
Then, the ratio of Z to the square root of V , divided by the respective degrees of freedom, is said to
be t-distributed with degrees of freedom ν:
Z
Y =p ∼ t(ν) . (2)
V /ν
The t-distribution is also called “Student’s t-distribution”, after William S. Gosset a.k.a. “Student”.
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-21; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Characterization.
Y = σX + µ (1)
is said to follow a non-standardized t-distribution with non-centrality µ, scale σ 2 and degrees of
freedom ν:
Y ∼ nst(µ, σ 2 , ν) . (2)
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-05-20; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Generalized_Student’
s_t-distribution.
Metadata: ID: D152 | shortcut: nst | author: JoramSoch | date: 2021-05-20, 07:35.
X ∼ nst(µ, σ 2 , ν) . (1)
Then, subtracting the mean and dividing by the square root of the scale results in a random variable
(→ Definition I/1.2.2) following a t-distribution (→ Definition II/3.3.1) with degrees of freedom ν:
X −µ
Y = ∼ t(ν) . (2)
σ
Proof: The non-standardized t-distribution is a special case (→ Proof “nst-mvt”) of the multivariate
t-distribution (→ Definition II/4.2.1) in which the mean vector and scale matrix are scalars:
X −µ X µ
Y = = −
σ σ σ !
2
µ µ 1 (5)
∼t − , σ2, ν
σ σ σ
= t (0, 1, ν) .
214 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• original work
Metadata: ID: P232 | shortcut: nst-t | author: JoramSoch | date: 2021-05-11, 15:46.
Proof: The probability density function of the multivariate t-distribution (→ Proof II/4.2.2) is
s −(ν+n)/2
1 Γ([ν + n]/2) 1 T −1
t(x; µ, Σ, ν) = 1 + (x − µ) Σ (x − µ) . (1)
(νπ)n |Σ| Γ(ν/2) ν
Setting n = 1, such that x ∈ R, as well as µ = 0 and Σ = 1, we obtain
s −(ν+1)/2
1 Γ([ν + 1]/2) 1 T −1
t(x; 0, 1, ν) = 1 + (x − 0) 1 (x − 0)
(νπ)1 |1| Γ(ν/2) ν
r −(ν+1)/2
1 Γ([ν + 1]/2) x2 (2)
= 1+
νπ Γ(ν/2) ν
ν+1
− ν+1
1 Γ 2 x2 2
=√ · · 1 + .
νπ Γ ν2 ν
which is equivalent to the probability density function of the t-distribution (→ Proof II/3.3.5).
Sources:
• Wikipedia (2022): “Multivariate t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-08-25; URL: https://en.wikipedia.org/wiki/Multivariate_t-distribution#Derivation.
Metadata: ID: P332 | shortcut: t-mvt | author: JoramSoch | date: 2022-08-25, 12:38.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 215
T ∼ t(ν) . (1)
Then, the probability density function (→ Definition I/1.6.6) of T is
ν+1
− ν+1
Γ t2 2
fT (t) = 2 √ · +1 . (2)
Γ ν
2
· νπ ν
Proof: A t-distributed random variable (→ Definition II/3.3.1) is defined as the ratio of a standard
normal random variable (→ Definition II/3.2.2) and the square root of a chi-squared random variable
(→ Definition II/3.7.1), divided by its degrees of freedom (→ Definition “dof”)
X
X ∼ N (0, 1), Y ∼ χ2 (ν) ⇒ T =p ∼ t(ν) (3)
Y /ν
where X and Y are independent of each other (→ Definition I/1.3.6).
The probability density function (→ Proof II/3.2.10) of the standard normal distribution (→ Defi-
nition II/3.2.2) is
1 x2
fX (x) = √ · e− 2 (4)
2π
and the probability density function of the chi-squared distribution (→ Proof II/3.7.3) is
1 y
· y 2 −1 · e− 2 .
ν
fY (y) = (5)
Γ ν
2
·2ν/2
r
ν
T =X·
Y (6)
W =Y ,
such that the inverse functions X and Y in terms of T and W are
r
W
X=T·
ν (7)
Y =W .
This implies the following Jacobian matrix and determinant:
q
dX dX W √T
J= dT dW
=
ν 2 W/ν
dY dY
dT dW 0 1 (8)
r
W
|J| = .
ν
216 CHAPTER II. PROBABILITY DISTRIBUTIONS
Because X and Y are independent (→ Definition I/1.3.6), the joint density (→ Definition I/1.5.2) of
X and Y is equal to the product (→ Proof I/1.3.8) of the marginal densities (→ Definition I/1.5.3):
r
w
fT,W (t, w) = fX t · · fY (w) · |J|
ν
√w 2 r
1 ( t· ν) 1 w
= √ · e− 2 · w 2 −1 · e− 2 ·
ν w
· (11)
2π Γ 2 ·2
ν ν/2 ν
( 2 )
1 − w t +1
· w 2 −1 · e 2 ν
ν+1
=√ .
2πν · Γ 2 · 2ν/2
ν
The marginal density (→ Definition I/1.5.3) of T can now be obtained by integrating out (→ Defi-
nition I/1.3.3) W :
Z ∞
fT (t) = fT,W (t, w) dw
0
Z ∞
1 ν+1
−1 1 t2
=√ · w 2 · exp − + 1 w dw
2πν · Γ ν2 · 2ν/2 0 2 ν
h i(ν+1)/2
ν+1
Z ∞ 1 t2 + 1
1 Γ 2 2 ν ν+1
−1 1 t2
=√ · (ν+1)/2 · ·w 2 · exp − + 1 w dw
2πν · Γ ν2 · 2ν/2 1 t2
+1 0 Γ ν+1
2
2 ν
2 ν
(12)
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ Proof II/3.4.6) with
ν+1 1 t2
a= and b = +1 , (13)
2 2 ν
and because a probability density function integrates to one (→ Definition I/1.6.6), we finally have:
1 Γ ν+1
fT (t) = √ · 2
2πν · Γ ν2 · 2ν/2 1 t2 + 1 (ν+1)/2
2 ν
2 − ν+1 (14)
Γ ν+1 t 2
= 2
√ · + 1 .
Γ ν2 · νπ ν
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 217
• Computation Empire (2021): “Student’s t Distribution: Derivation of PDF”; in: YouTube, retrieved
on 2021-10-11; URL: https://www.youtube.com/watch?v=6BraaGEVRY8.
Metadata: ID: P263 | shortcut: t-pdf | author: JoramSoch | date: 2021-10-12, 08:15.
X ∼ Gam(a, b) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
ba a−1
Gam(x; a, b) = x exp[−bx], x>0 (2)
Γ(a)
where a > 0 and b > 0, and the density is zero, if x ≤ 0.
Sources:
• Koch, Karl-Rudolf (2007): “Gamma Distribution”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, p. 47, eq. 2.172; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
X ∼ Gam(a, 1) . (1)
Sources:
• JoramSoch (2017): “Gamma-distributed random numbers”; in: MACS – a new SPM toolbox for
model assessment, comparison and selection, retrieved on 2020-05-26; URL: https://github.com/
JoramSoch/MACS/blob/master/MD_gamrnd.m; DOI: 10.5281/zenodo.845404.
• NIST/SEMATECH (2012): “Gamma distribution”; in: e-Handbook of Statistical Methods, ch.
1.3.6.6.11; URL: https://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm; DOI: 10.18434/M
Metadata: ID: D64 | shortcut: sgam | author: JoramSoch | date: 2020-05-26, 23:36.
218 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Gam(a, b) . (1)
Then, the quantity Y = bX will have a standard gamma distribution (→ Definition II/3.4.2) with
shape a and rate 1:
Y = bX ∼ Gam(a, 1) . (2)
Y = g(X) = bX (3)
with the inverse function
1
X = g −1 (Y ) = Y . (4)
b
Because b is positive, g(X) is strictly increasing and we can calculate the cumulative distribution
function of a strictly increasing function (→ Proof I/1.6.15) as
0 , if y < min(Y)
FY (y) = FX (g −1 (y)) , if y ∈ Y (5)
1 , if y > max(Y) .
The cumulative distribution function of the gamma-distributed (→ Proof II/3.4.7) X is
Z x
ba a−1
FX (x) = t exp[−bt] dt . (6)
−∞ Γ(a)
(5)
FY (y) = FX (g −1 (y))
Z y/b a (7)
(6) b a−1
= t exp[−bt] dt .
−∞ Γ(a)
Z
ba s a−1
b(y/b) h s i s
FY (y) = exp −b d
−b∞ Γ(a) b b b
Z y a
b 1
= a−1 b
sa−1 exp[−s] ds (8)
Γ(a) b
Z−∞
y
1 a−1
= s exp[−s] ds
−∞ Γ(a)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 219
which is the cumulative distribution function (→ Definition I/1.6.13) of the standard gamma distri-
bution (→ Definition II/3.4.2).
Sources:
• original work
Metadata: ID: P112 | shortcut: gam-sgam | author: JoramSoch | date: 2020-05-26, 23:14.
X ∼ Gam(a, b) . (1)
Then, the quantity Y = bX will have a standard gamma distribution (→ Definition II/3.4.2) with
shape a and rate 1:
Y = bX ∼ Gam(a, 1) . (2)
Y = g(X) = bX (3)
with the inverse function
1
X = g −1 (Y ) = Y . (4)
b
Because b is positive, g(X) is strictly increasing and we can calculate the probability density function
of a strictly increasing function (→ Proof I/1.6.8) as
f (g −1 (y)) dg−1 (y) , if y ∈ Y
X dy
fY (y) = (5)
0 , if y ∈
/Y
where Y = {y = g(x) : x ∈ X }. With the probability density function of the gamma distribution (→
Proof II/3.4.6), we have
ba −1 dg −1 (y)
fY (y) = [g (y)]a−1 exp[−b g −1 (y)] ·
Γ(a) dy
a−1
b a
1 1 d 1b y
= y exp −b y ·
Γ(a) b b dy (6)
ba 1 a−1 1
= y exp[−y] ·
Γ(a) ba−1 b
1 a−1
= y exp[−y]
Γ(a)
which is the probability density function (→ Definition I/1.6.6) of the standard gamma distribution
(→ Definition II/3.4.2).
220 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• original work
Metadata: ID: P177 | shortcut: gam-sgam2 | author: JoramSoch | date: 2020-10-15, 12:04.
Proof: Let X be a p×p positive-definite symmetric matrix, such that X follows a Wishart distribution
(→ Definition II/5.2.1):
Y ∼ W(V, n) . (1)
Then, Y is described by the probability density function (→ Proof “wish-pdf”)
1 1 1 −1
p(Y ) = ·p · |X| (n−p−1)/2
· exp − tr V X (2)
Γp n
2 2np |V |n 2
where |A| is a matrix determinant, A−1 is a matrix inverse and Γp (x) is the multivariate gamma
function of order p. If p = 1, then Γp (x) = Γ(x) is the ordinary gamma function, x = X and v = V
are real numbers. Thus, the probability density function (→ Definition I/1.6.6) of x can be developed
as
1 1 1 −1
p(x) = ·√ ·x (n−2)/2
· exp − tr v x
Γ n2 2n v n 2
−n/2
(3)
(2v) 1
= n
· xn/2−1 · exp − x
Γ 2 2v
n 1
Finally, substituting a = 2
and b = 2v
, we get
ba a−1
p(x) = x exp[−bx] (4)
Γ(a)
which is the probability density function of the gamma distribution (→ Proof II/3.4.6).
Sources:
• original work
Metadata: ID: P328 | shortcut: gam-wish | author: JoramSoch | date: 2022-07-14, 07:45.
X ∼ Gam(a, b) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
ba a−1
fX (x) = x exp[−bx] . (2)
Γ(a)
Proof: This follows directly from the definition of the gamma distribution (→ Definition II/3.4.1).
Sources:
• original work
Metadata: ID: P45 | shortcut: gam-pdf | author: JoramSoch | date: 2020-02-08, 23:41.
X ∼ Gam(a, b) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
γ(a, bx)
FX (x) = (2)
Γ(a)
where Γ(x) is the gamma function and γ(s, x) is the lower incomplete gamma function.
Proof: The probability density function of the gamma distribution (→ Proof II/3.4.6) is:
ba a−1
fX (x) = x exp[−bx] . (3)
Γ(a)
Thus, the cumulative distribution function (→ Definition I/1.6.13) is:
Z x
FX (x) = Gam(z; a, b) dz
Z0
x
ba a−1
= z exp[−bz] dz (4)
0 Γ(a)
Z x
ba
= z a−1 exp[−bz] dz .
Γ(a) 0
Substituting t = bz, i.e. z = t/b, this becomes:
Z bx a−1
ba t t t
FX (x) = exp −b d
Γ(a) b·0 b b b
a Z bx
b 1 1
= · a−1 · ta−1 exp[−t] dt (5)
Γ(a) b b 0
Z bx
1
= ta−1 exp[−t] dt .
Γ(a) 0
222 CHAPTER II. PROBABILITY DISTRIBUTIONS
γ(a, bx)
FX (x) = . (7)
Γ(a)
Sources:
• Wikipedia (2020): “Incomplete gamma function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-10-29; URL: https://en.wikipedia.org/wiki/Incomplete_gamma_function#Definition.
Metadata: ID: P178 | shortcut: gam-cdf | author: JoramSoch | date: 2020-10-15, 12:34.
X ∼ Gam(a, b) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
−∞ , if p = 0
QX (p) = (2)
γ −1 (a, Γ(a) · p)/b , if p > 0
where γ −1 (s, y) is the inverse of the lower incomplete gamma function γ(s, x)
Proof: The cumulative distribution function of the gamma distribution (→ Proof II/3.4.7) is:
0 , if x < 0
FX (x) = (3)
γ(a,bx) , if x ≥ 0 .
Γ(a)
The quantile function QX (p) is defined as (→ Definition I/1.6.23) the smallest x, such that FX (x) = p:
γ(a, bx)
p=
Γ(a)
Γ(a) · p = γ(a, bx)
(6)
γ −1 (a, Γ(a) · p) = bx
γ −1 (a, Γ(a) · p)
x= .
b
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 223
Sources:
• Wikipedia (2020): “Incomplete gamma function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-19; URL: https://en.wikipedia.org/wiki/Incomplete_gamma_function#Definition.
Metadata: ID: P194 | shortcut: gam-qf | author: JoramSoch | date: 2020-11-19, 07:31.
3.4.9 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a gamma distribution (→
Definition II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
a
E(X) = . (2)
b
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the gamma distribution (→ Proof II/3.4.6), this reads:
Z ∞
ba a−1
E(X) = x· x exp[−bx] dx
Γ(a)
Z0 ∞
ba (a+1)−1
= x exp[−bx] dx (4)
Γ(a)
Z0
∞
1 ba+1 (a+1)−1
= · x exp[−bx] dx .
0 b Γ(a)
Employing the relation Γ(x + 1) = Γ(x) · x, we have
Z ∞
a ba+1
E(X) = · x(a+1)−1 exp[−bx] dx (5)
0 b Γ(a + 1)
and again using the density of the gamma distribution (→ Proof II/3.4.6), we get
Z
a ∞
E(X) = Gam(x; a + 1, b) dx
b 0 (6)
a
= .
b
Sources:
• Turlapaty, Anish (2013): “Gamma random variable: mean & variance”; in: YouTube, retrieved on
2020-05-19; URL: https://www.youtube.com/watch?v=Sy4wP-Y2dmA.
Metadata: ID: P108 | shortcut: gam-mean | author: JoramSoch | date: 2020-05-19, 06:54.
224 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.4.10 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a gamma distribution (→
Definition II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
a
Var(X) = . (2)
b2
Proof: The variance (→ Definition I/1.8.1) can be expressed in terms of expected values (→ Proof
I/1.8.3) as
Z ∞
ba a−1
2
E(X ) = x2 · x exp[−bx] dx
Γ(a)
Z0 ∞
ba (a+2)−1
= x exp[−bx] dx (5)
Γ(a)
Z0
∞
1 ba+2 (a+2)−1
= · x exp[−bx] dx .
0 b2 Γ(a)
Twice-applying the relation Γ(x + 1) = Γ(x) · x, we have
Z ∞
a (a + 1) ba+2
2
E(X ) = 2
· x(a+2)−1 exp[−bx] dx (6)
0 b Γ(a + 2)
and again using the density of the gamma distribution (→ Proof II/3.4.6), we get
Z ∞
a (a + 1)
2
E(X ) = Gam(x; a + 2, b) dx
b2 0 (7)
a2 + a
= .
b2
Plugging (7) and (4) into (3), the variance of a gamma random variable finally becomes
a2 + a a 2
Var(X) = −
b2 b (8)
a
= 2 .
b
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 225
• Turlapaty, Anish (2013): “Gamma random variable: mean & variance”; in: YouTube, retrieved on
2020-05-19; URL: https://www.youtube.com/watch?v=Sy4wP-Y2dmA.
Metadata: ID: P109 | shortcut: gam-var | author: JoramSoch | date: 2020-05-19, 07:20.
X ∼ Gam(a, b) . (1)
Then, the expectation (→ Definition I/1.7.1) of the natural logarithm of X is
Proof: Let Y = ln(X), such that E(Y ) = E(ln X) and consider the special case that b = 1. In this
case, the probability density function of the gamma distribution (→ Proof II/3.4.6) is
1
fX (x) = xa−1 exp[−x] . (3)
Γ(a)
Multiplying this function with dx, we obtain
1 dx
fX (x) dx = xa exp[−x] . (4)
Γ(a) x
Substituting y = ln x, i.e. x = ey , such that dx/dy = x, i.e. dx/x = dy, we get
1
fY (y) dy = (ey )a exp[−ey ] dy
Γ(a)
(5)
1
= exp [ay − ey ] dy .
Γ(a)
Because fY (y) integrates to one, we have
Z
1= fY (y) dy
ZR
1
1= exp [ay − ey ] dy (6)
R Γ(a)
Z
Γ(a) = exp [ay − ey ] dy .
R
d
exp [ay − ey ] dy = y exp [ay − ey ] dy
da (7)
(5)
= Γ(a) y fY (y) dy .
226 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z
E(Y ) = y fY (y) dy
R
Z
(7) 1 d
= exp [ay − ey ] dy
Γ(a) R da
Z
1 d
= exp [ay − ey ] dy (8)
Γ(a) da R
(6) 1 d
= Γ(a)
Γ(a) da
Γ′ (a)
= .
Γ(a)
d f ′ (x)
ln f (x) = (9)
dx f (x)
and the definition of the digamma function
d
ψ(x) = ln Γ(x) , (10)
dx
we have
Sources:
• whuber (2018): “What is the expected value of the logarithm of Gamma distribution?”; in:
StackExchange CrossValidated, retrieved on 2020-05-25; URL: https://stats.stackexchange.com/
questions/370880/what-is-the-expected-value-of-the-logarithm-of-gamma-distribution.
Metadata: ID: P110 | shortcut: gam-logmean | author: JoramSoch | date: 2020-05-25, 21:28.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 227
3.4.12 Expectation of x ln x
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a gamma distribution (→
Definition II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of (X · ln X) is
a
E(X ln X) = [ψ(a) − ln(b)] . (2)
b
Proof: With the definition of the expected value (→ Definition I/1.7.1), the law of the unconscious
statistician (→ Proof I/1.7.12) and the probability density function of the gamma distribution (→
Proof II/3.4.6), we have:
Z ∞
ba a−1
E(X ln X) = x ln x · x exp[−bx] dx
Γ(a)
0
Z ∞
1 ba+1 a
= ln x · x exp[−bx] dx (3)
Γ(a) 0 b
Z
Γ(a + 1) ∞ ba+1
= ln x · x(a+1)−1 exp[−bx] dx
Γ(a) b 0 Γ(a + 1)
The integral now corresponds to the logarithmic expectation of a gamma distribution (→ Proof
II/3.4.11) with shape a + 1 and rate b
Γ(x + 1)
Γ(x + 1) = Γ(x) · x ⇔ =x, (6)
Γ(x)
the expression in equation (3) develops into:
a
E(X ln X) = [ψ(a) − ln(b)] . (7)
b
Sources:
• gunes (2020): “What is the expected value of x log(x) of the gamma distribution?”; in: StackEx-
change CrossValidated, retrieved on 2020-10-15; URL: https://stats.stackexchange.com/questions/
457357/what-is-the-expected-value-of-x-logx-of-the-gamma-distribution.
Metadata: ID: P179 | shortcut: gam-xlogx | author: JoramSoch | date: 2020-10-15, 13:02.
228 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Gam(a, b) (1)
Then, the differential entropy (→ Definition I/2.2.1) of X in nats is
a
b
h(X) = −E ln a−1
x exp[−bx]
Γ(a)
(5)
= −E [a · ln b − ln Γ(a) + (a − 1) ln x − bx]
= −a · ln b + ln Γ(a) − (a − 1) · E(ln x) + b · E(x) .
Using the mean (→ Proof II/3.4.9) and logarithmic expectation (→ Proof II/3.4.11) of the gamma
distribution (→ Definition II/3.4.1)
a
X ∼ Gam(a, b) ⇒ E(X) = and E(ln X) = ψ(a) − ln(b) , (6)
b
the differential entropy (→ Definition I/2.2.1) of X becomes:
a
h(X) = −a · ln b + ln Γ(a) − (a − 1) · (ψ(a) − ln b) + b ·
b
= −a · ln b + ln Γ(a) + (1 − a) · ψ(a) + a · ln b − ln b + a (7)
= a + ln Γ(a) + (1 − a) · ψ(a) − ln b .
Sources:
• Wikipedia (2021): “Gamma distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-14; URL: https://en.wikipedia.org/wiki/Gamma_distribution#Information_entropy.
Metadata: ID: P239 | shortcut: gam-dent | author: JoramSoch | date: 2021-07-14, 07:37.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 229
P : X ∼ Gam(a1 , b1 )
(1)
Q : X ∼ Gam(a2 , b2 ) .
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) . (2)
b2 Γ(a2 ) b1
Proof: The KL divergence for a continuous random variable (→ Definition I/2.5.1) is given by
Z
p(x)
KL[P || Q] = p(x) ln dx (3)
X q(x)
which, applied to the gamma distributions (→ Definition II/3.4.1) in (1), yields
Z +∞
Gam(x; a1 , b1 )
KL[P || Q] = Gam(x; a1 , b1 ) ln dx
−∞ Gam(x; a2 , b2 )
(4)
Gam(x; a1 , b1 )
= ln .
Gam(x; a2 , b2 ) p(x)
Using the probability density function of the gamma distribution (→ Proof II/3.4.6), this becomes:
* b1 a1 a1 −1
+
Γ(a1 )
x exp[−b1 x]
KL[P || Q] = ln a2
b2
Γ(a2 )
xa2 −1 exp[−b2 x]
p(x)
a1
b1 Γ(a2 ) a1 −a2 (5)
= ln · ·x · exp[−(b1 − b2 )x]
b2 a2 Γ(a1 ) p(x)
Using the mean of the gamma distribution (→ Proof II/3.4.9) and the expected value of a logarith-
mized gamma variate (→ Proof II/3.4.11)
a
x ∼ Gam(a, b) ⇒ ⟨x⟩ = and
b (6)
⟨ln x⟩ = ψ(a) − ln(b) ,
a1
KL[P || Q] = a1 · ln b1 − a2 · ln b2 − ln Γ(a1 ) + ln Γ(a2 ) + (a1 − a2 ) · (ψ(a1 ) − ln(b1 )) − (b1 − b2 ) ·
b1
a1
= a2 · ln b1 − a2 · ln b2 − ln Γ(a1 ) + ln Γ(a2 ) + (a1 − a2 ) · ψ(a1 ) − (b1 − b2 ) · .
b1
(7)
230 CHAPTER II. PROBABILITY DISTRIBUTIONS
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) . (8)
b2 Γ(a2 ) b1
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densi-
ties”; in: University College, London; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/publications/
densities.ps.
Metadata: ID: P93 | shortcut: gam-kl | author: JoramSoch | date: 2020-05-05, 08:41.
X ∼ Exp(λ) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
Sources:
• Wikipedia (2020): “Exponential distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-08; URL: https://en.wikipedia.org/wiki/Exponential_distribution#Definitions.
Proof: The probability density function of the gamma distribution (→ Proof II/3.4.6) is
ba a−1
Gam(x; a, b) = x exp[−bx] . (1)
Γ(a)
Setting a = 1 and b = λ, we obtain
λ1 1−1
Gam(x; 1, λ) = x exp[−λx]
Γ(1)
x0 (2)
= λ exp[−λx]
Γ(1)
= λ exp[−λx]
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 231
which is equivalent to the probability density function of the exponential distribution (→ Proof
II/3.5.3).
Sources:
• original work
Metadata: ID: P69 | shortcut: exp-gam | author: JoramSoch | date: 2020-03-02, 20:49.
X ∼ Exp(λ) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
Proof: This follows directly from the definition of the exponential distribution (→ Definition II/3.5.1).
Sources:
• original work
Metadata: ID: P46 | shortcut: exp-pdf | author: JoramSoch | date: 2020-02-08, 23:53.
X ∼ Exp(λ) . (1)
Then, the cumulative distribution function (→ Definition I/1.6.13) of X is
0 , if x < 0
FX (x) = (2)
1 − exp[−λx] , if x ≥ 0 .
Proof: The probability density function of the exponential distribution (→ Proof II/3.5.3) is:
0 , if x < 0
Exp(x; λ) = (3)
λ exp[−λx] , if x ≥ 0 .
If x < 0, we have:
Z x
FX (x) = 0 dz = 0 . (5)
−∞
Z 0 Z x
FX (x) = Exp(z; λ) dz + Exp(z; λ) dz
−∞ 0
Z 0 Z x
= 0 dz + λ exp[−λz] dz
−∞
0
x
1 (6)
= 0 + λ − exp[−λz]
λ
0
1 1
= λ − exp[−λx] − − exp[−λ · 0]
λ λ
= 1 − exp[−λx] .
Sources:
• original work
Metadata: ID: P48 | shortcut: exp-cdf | author: JoramSoch | date: 2020-02-11, 14:48.
X ∼ Exp(λ) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
−∞ , if p = 0
QX (p) = (2)
− ln(1−p) , if p > 0 .
λ
Proof: The cumulative distribution function of the exponential distribution (→ Proof II/3.5.4) is:
0 , if x < 0
FX (x) = (3)
1 − exp[−λx] , if x ≥ 0 .
The quantile function QX (p) is defined as (→ Definition I/1.6.23) the smallest x, such that FX (x) = p:
p = 1 − exp[−λx]
exp[−λx] = 1 − p
−λx = ln(1 − p) (6)
ln(1 − p)
x=− .
λ
Sources:
• original work
Metadata: ID: P50 | shortcut: exp-qf | author: JoramSoch | date: 2020-02-12, 15:48.
3.5.6 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following an exponential distribution
(→ Definition II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
1
E(X) = . (2)
λ
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the exponential distribution (→ Proof II/3.5.3), this reads:
Z +∞
E(X) = x · λ exp(−λx) dx
0
Z +∞ (4)
=λ x · exp(−λx) dx .
0
+∞
1 1
E(X) = λ − x − 2 exp(−λx)
λ λ
0
1 1 1 1
= λ lim − x − 2 exp(−λx) − − · 0 − 2 exp(−λ · 0)
x→∞ λ λ λ λ (6)
1
=λ 0+ 2
λ
1
= .
λ
Sources:
• Koch, Karl-Rudolf (2007): “Expected Value”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, p. 39, eq. 2.142a; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
Metadata: ID: P47 | shortcut: exp-mean | author: JoramSoch | date: 2020-02-10, 21:57.
3.5.7 Median
Theorem: Let X be a random variable (→ Definition I/1.2.2) following an exponential distribution
(→ Definition II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the median (→ Definition I/1.11.1) of X is
ln 2
median(X) = . (2)
λ
Proof: The median (→ Definition I/1.11.1) is the value at which the cumulative distribution function
(→ Definition I/1.6.13) is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the exponential distribution (→ Proof II/3.5.4) is
ln(1 − p)
x=− (5)
λ
and setting p = 1/2, we obtain:
ln(1 − 21 ) ln 2
median(X) = − = . (6)
λ λ
Sources:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 235
• original work
Metadata: ID: P49 | shortcut: exp-med | author: JoramSoch | date: 2020-02-11, 15:03.
3.5.8 Mode
Theorem: Let X be a random variable (→ Definition I/1.2.2) following an exponential distribution
(→ Definition II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the mode (→ Definition I/1.11.2) of X is
mode(X) = 0 . (2)
Proof: The mode (→ Definition I/1.11.2) is the value which maximizes the probability density
function (→ Definition I/1.6.6):
The probability density function of the exponential distribution (→ Proof II/3.5.3) is:
0 , if x < 0
fX (x) = (4)
λe−λx , if x ≥ 0 .
Since
fX (0) = λ (5)
and
mode(X) = 0 . (7)
Sources:
• original work
Metadata: ID: P51 | shortcut: exp-mode | author: kantundpeterpan | date: 2020-02-12, 15:53.
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-02-07; URL: https://en.wikipedia.org/wiki/Log-normal_distribution.
Metadata: ID: D170 | shortcut: lognorm | author: majapavlo | date: 2022-02-07, 22:33.
X ∼ ln N (µ, σ 2 ) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is given by:
" #
1 (ln x − µ)2
fX (x) = √ · exp − . (2)
xσ 2π 2σ 2
Proof: A log-normally distributed random variable (→ Definition II/3.6.1) is defined as the expo-
nential function of a normal random variable (→ Definition II/3.2.1):
f (g −1 (x)) dg−1 (x) , if x ∈ X
Y dx
fX (x) = (7)
0 , if x ∈
/X
where X = {x = g(y) : y ∈ Y}. With the probability density function of the normal distribution (→
Proof II/3.2.10), we have
dg −1 (x)
fX (x) = fY (g −1 (x)) ·
" dx 2 #
1 1 g −1 (x) − µ dg −1 (x)
=√ · exp − ·
2πσ 2 σ dx
" 2 #
1 1 (ln x) − µ d(ln x)
=√ · exp − · (8)
2πσ 2 σ dx
" 2 #
1 1 ln x − µ 1
=√ · exp − ·
2πσ 2 σ x
" #
1 (ln x − µ)2
= √ · exp −
xσ 2π 2σ 2
which is the probability density function (→ Definition I/1.6.6) of the log-normal distribution (→
Definition II/3.6.1).
Sources:
• Taboga, Marco (2021): “Log-normal distribution”; in: Lectures on probability and statistics, re-
trieved on 2022-02-13; URL: https://www.statlect.com/probability-distributions/log-normal-distribution.
Metadata: ID: P310 | shortcut: lognorm-pdf | author: majapavlo | date: 2022-02-13, 10:05.
X ∼ ln N (µ, σ 2 ) . (1)
Then, the cumulative distribution function (→ Definition “lognorm-cdf”) of X is
1 ln x − µ
FX (x) = 1 + erf √ (2)
2 2σ
where erf(x) is the error function defined as
Z x
2
erf(x) = √ exp(−t2 ) dt . (3)
π 0
Proof: The probability density function of the log-normal distribution (→ Proof II/3.6.2) is:
238 CHAPTER II. PROBABILITY DISTRIBUTIONS
" 2 #
1 ln x − µ
fX (x) = √ · exp − √ . (4)
xσ 2π 2σ
Thus, the cumulative distribution function (→ Definition “lognorm-cdf”) is:
Z x
FX (x) = ln N (z; µ, σ 2 ) dz
−∞
Z x " 2 #
1 ln z − µ
= √ · exp − √ dz (5)
−∞ zσ 2π 2σ
Z x " 2 #
1 1 ln z − µ
= √ · exp − √ dz .
σ 2π −∞ z 2σ
From this point forward, the proof is similar to the derivation of the cumulative distribution
√ function
for
√ the normal distribution
√ (→ Proof II/3.2.12). Substituting t = (ln z − µ)/( 2σ), i.e. ln z =
2σt + µ, z = exp( 2σt + µ) this becomes:
Z (ln x−µ)/(√2σ)
1 1 h √ i
FX (x) = √ √ · exp −t d exp
2
2σt + µ
σ 2π (−∞−µ)/(√2σ) exp( 2σt + µ)
√ Z ln√x−µ √
2σ 2σ 1
= √ √ · exp(−t2 ) · exp 2σt + µ dt
σ 2π −∞ exp( 2σt + µ)
Z ln√x−µ
1 2σ
(6)
= √ exp(−t2 ) dt
π −∞
Z 0 Z ln√x−µ
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt
π −∞ π 0
Z ∞ Z ln√x−µ
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt .
π 0 π 0
Applying (3) to (6), we have:
1 1 ln x − µ
FX (x) = lim erf(x) + erf √
2 x→∞ 2 2σ
1 1 ln x − µ
= + erf √ (7)
2 2 2σ
1 ln x − µ
= 1 + erf √ .
2 2σ
Sources:
• skdhfgeq2134 (2015): “How to derive the cdf of a lognormal distribution from its pdf”; in: StackEx-
change, retrieved on 2022-06-29; URL: https://stats.stackexchange.com/questions/151398/how-to-derive-t
151404#151404.
Metadata: ID: P325 | shortcut: lognorm-cdf | author: majapavlo | date: 2022-06-29, 22:20.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 239
X ∼ ln N (µ, σ 2 ) . (1)
Then, the quantile function (→ Definition I/1.6.23) of X is
√
QX (p) = exp(µ + 2σ · erf −1 (2p − 1)) (2)
where erf −1 (x) is the inverse error function.
Proof: The cumulative distribution function of the log-normal distribution (→ Proof II/3.6.3) is:
1 ln x − µ
FX (x) = 1 + erf √ . (3)
2 2σ
From this point forward, the proof is similar to the derivation of the quantile function for the normal
distribution (→ Proof II/3.2.15). Because the cumulative distribution function (CDF) is strictly
monotonically increasing, the quantile function is equal to the inverse of the CDF (→ Proof I/1.6.24):
1 ln x − µ
p= 1 + erf √
2 2σ
ln x − µ
2p − 1 = erf √
2σ (5)
ln x − µ
erf −1 (2p − 1) = √
2σ
√
x = exp(µ + 2σ · erf −1 (2p − 1)) .
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-07-08; URL: https://en.wikipedia.org/wiki/Log-normal_distribution#Mode,_median,_quantiles.
Metadata: ID: P326 | shortcut: lognorm-qf | author: majapavlo | date: 2022-07-09, 11:05.
3.6.5 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a log-normal distribution
(→ Definition II/3.6.1):
X ∼ ln N (µ, σ 2 ) (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
240 CHAPTER II. PROBABILITY DISTRIBUTIONS
1
E(X) = exp µ + σ 2 (2)
2
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx (3)
X
With the probability density function of the log-normal distribution (→ Proof II/3.6.2), this is:
Z " #
+∞
1 1 (ln x − µ)2
E(X) = x· √ · exp − dx
0 x 2πσ 2 2 σ2
Z +∞ " # (4)
1 1 (ln x − µ)2
=√ exp − dx
2πσ 2 0 2 σ2
ln x−µ
Substituting z = σ
, i.e. x = exp (µ + σz), we have:
Z
(ln x−µ)/(σ)
1 1 2
E(X) = √ exp − z d [exp (µ + σz)]
2πσ 2 (−∞−µ)/(σ) 2
Z +∞
1 1
=√ exp − z 2 σ exp (µ + σz) dz
2
2πσ −∞ 2
Z +∞ (5)
1 1 2
=√ exp − z + σz + µ dz
2π −∞ 2
Z +∞
1 1 2
=√ exp − z − 2σz − 2µ dz
2π −∞ 2
Now multiplying exp 21 σ 2 and exp − 12 σ 2 , we have:
Z +∞
1 1 2
E(X) = √ exp − z − 2σz + σ − 2µ − σ
2 2
dz
2π −∞ 2
Z +∞
1 1 2 1 2
=√ exp − z − 2σz + σ 2
exp µ + σ dz (6)
2π −∞ 2 2
Z +∞
1 1 1
= exp µ + σ 2 √ exp − (z − σ)2 dz
2 −∞ 2π 2
The probability density function of a normal distribution (→ Proof II/3.2.10) is given by
" 2 #
1 1 x−µ
fX (x) = √ · exp − (7)
2πσ 2 σ
and, with unit variance σ 2 = 1, this reads:
1 1
fX (x) = √ · exp − (x − µ)2
(8)
2π 2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 241
Using the definition of the probability density function (→ Definition I/1.6.6), we get
Z +∞
1 1
√ · exp − (x − µ) dx = 1
2
(9)
−∞ 2π 2
and applying (9) to (6), we have:
1 2
E(X) = exp µ + σ . (10)
2
Sources:
• Taboga, Marco (2022): “Log-normal distribution”; in: Lectures on probability theory and mathe-
matical statistics, retrieved on 2022-10-01; URL: https://www.statlect.com/probability-distributions/
log-normal-distribution.
Metadata: ID: P354 | shortcut: lognorm-mean | author: majapavlo | date: 2022-10-02, 09:46.
3.6.6 Median
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a log-normal distribution
(→ Definition II/3.6.1):
X ∼ ln N (µ, σ 2 ) . (1)
Then, the median (→ Definition I/1.11.1) of X is
median(X) = eµ . (2)
Proof: The median (→ Definition I/1.11.1) is the value at which the cumulative distribution function
is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the lognormal distribution (→ Proof II/3.6.3) is
1 ln(x) − µ
FX (x) = 1 + erf √ (4)
2 σ 2
where erf(x) is the error function. Thus, the inverse CDF is
√
ln(x) = σ 2 · erf −1 (2p − 1) + µ
h √ i (5)
−1
x = exp σ 2 · erf (2p − 1) + µ
where erf −1 (x) is the inverse error function. Setting p = 1/2, we obtain:
√
ln [median(X)] = σ 2 · erf −1 (0) + µ
(6)
median(X) = eµ .
Sources:
242 CHAPTER II. PROBABILITY DISTRIBUTIONS
• original work
Metadata: ID: P306 | shortcut: lognorm-med | author: majapavlo | date: 2022-02-07, 22:33.
3.6.7 Mode
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a log-normal distribution
(→ Definition II/3.6.1):
X ∼ ln N (µ, σ 2 ) . (1)
Then, the mode (→ Definition I/1.11.2) of X is
mode(X) = e(µ−σ ) .
2
(2)
Proof: The mode (→ Definition I/1.11.2) is the value which maximizes the probability density
function (→ Definition I/1.6.6):
The probability density function of the log-normal distribution (→ Proof II/3.6.2) is:
" #
1 (ln x − µ)2
fX (x) = √ · exp − . (4)
xσ 2π 2σ 2
The first two derivatives of this function are:
" #
′ 1 (ln x − µ)2
ln x − µ
fX (x) = − √ · exp − · 1+ (5)
x2 σ 2π 2σ 2 σ2
" #
′′ 1 (ln x − µ)2
ln x − µ
fX (x) = √ exp − · (ln x − µ) · 1 +
2πσ 2 x3 2σ 2 σ2
√ " #
2 (ln x − µ)2 ln x − µ
+ √ 3 exp − · 1+ (6)
πx 2σ 2 σ2
" #
1 (ln x − µ)2
−√ exp − .
2πσ 2 x3 2σ 2
" #
′ 1 (ln x − µ) 2
ln x − µ
fX (x) = 0 = − √ · exp − · 1+
x2 σ 2π 2σ 2 σ2
ln x − µ (7)
−1 =
σ2
x = e(µ−σ ) .
2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 243
2
1 σ
fX′′ (e(µ−σ ) )
2
=√ 2
exp − · σ 2 · (0)
2πσ 2 (e(µ−σ ) )3 2
√ 2
2 σ
+ √ (µ−σ2 ) 3 exp − · (0)
π(e ) 2
2 (8)
1 σ
−√ 2
exp −
2πσ 2 (e(µ−σ ) )3 2
2
1 σ
= −√ 2) 3
exp − <0,
2
2πσ (e (µ−σ ) 2
mode(X) = e(µ−σ ) .
2
(9)
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-02-12; URL: https://en.wikipedia.org/wiki/Log-normal_distribution#Mode.
• Mdoc (2015): “Mode of lognormal distribution”; in: Mathematics Stack Exchange, retrieved on
2022-02-12; URL: https://math.stackexchange.com/questions/1321221/mode-of-lognormal-distribution/
1321626.
Metadata: ID: P311 | shortcut: lognorm-mode | author: majapavlo | date: 2022-02-13, 10:15.
3.6.8 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a log-normal distribution
(→ Definition II/3.6.1):
X ∼ ln N (µ, σ 2 ). (1)
Then, the variance (→ Definition I/1.8.1) of X is
Var(X) = exp 2µ + 2σ 2 − exp 2µ + σ 2 . (2)
Z +∞
2
E[X ] = x2 · fX (x) dx
−∞
Z " #
+∞
1 1 (ln x − µ) 2
= x2 · √ · exp − dx (6)
0 x 2πσ 2 2 σ2
Z +∞ " #
1 1 (ln x − µ)2
=√ x · exp − dx
2πσ 2 0 2 σ2
ln x−µ
Substituting z = σ
, i.e. x = exp (µ + σz), we have:
Z (ln x−µ)/(σ)
1 1 2
E[X ] = √
2
exp (µ + σz) exp − z d [exp (µ + σz)]
2πσ 2 (−∞−µ)/(σ) 2
Z +∞
1 1
=√ exp − z 2 σ exp (2µ + 2σz) dz (7)
2πσ 2 −∞ 2
Z +∞
1 1 2
=√ exp − z − 4σz − 4µ dz
2π −∞ 2
Z +∞
1 1 2
E[X ] = √
2
exp − z − 4σz + 4σ − 4σ − 4µ dz
2 2
2π −∞ 2
Z +∞
1 1 2
=√ exp − z − 4σz + 4σ 2
exp 2σ 2 + 2µ dz (8)
2π −∞ 2
Z +∞
1 1
2
= exp 2σ + 2µ √ exp − (z − 2σ) dz2
−∞ 2π 2
Var(X) = E X 2 − E [X]2
2
1 2
= exp 2σ + 2µ − exp µ + σ
2 (13)
2
= exp 2σ + 2µ − exp 2µ + σ 2 .
2
Sources:
• Taboga, Marco (2022): “Log-normal distribution”; in: Lectures on probability theory and mathe-
matical statistics, retrieved on 2022-10-01; URL: https://www.statlect.com/probability-distributions/
log-normal-distribution.
• Wikipedia (2022): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-10-01; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
Metadata: ID: P355 | shortcut: lognorm-var | author: majapavlo | date: 2022-10-02, 09:02.
X
k
Y = Xi2 ∼ χ2 (k) where k > 0 . (2)
i=1
The probability density function of the chi-squared distribution (→ Proof II/3.7.3) with k degress of
freedom is
1
χ2 (x; k) = xk/2−1 e−x/2 (3)
2k/2 Γ(k/2)
where k > 0 and the density is zero if x ≤ 0.
Sources:
• Wikipedia (2020): “Chi-square distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-12; URL: https://en.wikipedia.org/wiki/Chi-square_distribution#Definitions.
• Robert V. Hogg, Joseph W. McKean, Allen T. Craig (2018): “The Chi-Squared-Distribution”;
in: Introduction to Mathematical Statistics, Pearson, Boston, 2019, p. 178, eq. 3.3.7; URL: https:
//www.pearson.com/store/p/introduction-to-mathematical-statistics/P100000843744.
Metadata: ID: D100 | shortcut: chi2 | author: kjpetrykowski | date: 2020-10-13, 01:20.
246 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: The probability density function of the gamma distribution (→ Proof II/3.4.6) for x > 0,
where α is the shape parameter and β is the rate paramete, is as follows:
β α α−1 −βx
Gam(x; α, β) = x e (2)
Γ(α)
If we let α = k/2 and β = 1/2, we obtain
k 1 xk/2−1 e−x/2 1
Gam x; , = k/2
= k/2 xk/2−1 e−x/2 (3)
2 2 Γ(k/2)2 2 Γ(k/2)
which is equivalent to the probability density function of the chi-squared distribution (→ Proof
II/3.7.3).
Sources:
• original work
Metadata: ID: P174 | shortcut: chi2-gam | author: kjpetrykowski | date: 2020-10-12, 22:15.
Y ∼ χ2 (k) . (1)
Then, the probability density function (→ Definition I/1.6.6) of Y is
1
fY (y) = y k/2−1 e−y/2 . (2)
2k/2 Γ(k/2)
X
k
X1 , . . . , Xk ∼ N (0, 1) ⇒ Y = Xi2 ∼ χ2 (k) . (3)
i=1
X
k
y= x2i (4)
i=1
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 247
and let fY (y) and FY (y) be the probability density function (→ Definition I/1.6.6) and cumulative
distribution function (→ Definition I/1.6.13) of Y . Because the PDF is the first derivative of the
CDF (→ Proof I/1.6.12), we can write:
FY (y)
FY (y) = dy = fY (y) dy . (5)
dy
Then, the cumulative distribution function (→ Definition I/1.6.13) of Y can be expressed as
Z Y
k
fY (y) dy = (N (xi ; 0, 1) dxi ) (6)
V i=1
where N (xi ; 0, 1) is the probability density function (→ Definition I/1.6.6) of the standard normal
distribution (→ Definition II/3.2.2) and V is the elemental shell volume at y(x), which is proportional
to the (k − 1)-dimensional surface in k-space for which equation (4) is fulfilled. Using the probability
density function of the normal distribution (→ Proof II/3.2.10), equation (6) can be developed as
follows:
Z Y k
1 1 2
fY (y) dy = √ · exp − xi dxi
V i=1 2π 2
Z 1 2
exp − 2 (x1 + . . . + x2k ) (7)
= dx1 . . . dxk
(2π)k/2
V
Z h yi
1
= exp − dx1 . . . dxk .
(2π)k/2 V 2
Because y is constant within the set V , it can be moved out of the integral:
Z
exp [−y/2]
fY (y) dy = dx1 . . . dxk . (8)
(2π)k/2 V
√
Now, the integral is simply the surface area of the (k − 1)-dimensional sphere with radius r = y,
which is
π k/2
A = 2rk−1 , (9)
Γ(k/2)
times the infinitesimal thickness of the sphere, which is
dr 1 dy
= y −1/2 ⇔ dr = . (10)
dy 2 2y 1/2
Substituting (9) and (10) into (8), we have:
exp [−y/2]
fY (y) dy = · A dr
(2π)k/2
k/2
exp [−y/2] k−1 π dy
= k/2
· 2r · 1/2
(2π) Γ(k/2) 2y
√ k−1 (11)
1 2 y
= k/2 · √ · exp [−y/2] dy
2 Γ(k/2) 2 y
1 h yi
k
−1
= k/2 ·y 2 · exp − dy .
2 Γ(k/2) 2
248 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2020): “Proofs related to chi-squared distribution”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-11-25; URL: https://en.wikipedia.org/wiki/Proofs_related_to_chi-squared_
distribution#Derivation_of_the_pdf_for_k_degrees_of_freedom.
• Wikipedia (2020): “n-sphere”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-25; URL:
https://en.wikipedia.org/wiki/N-sphere#Volume_and_surface_area.
Metadata: ID: P197 | shortcut: chi2-pdf | author: JoramSoch | date: 2020-11-25, 05:56.
3.7.4 Moments
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a chi-squared distribution
(→ Definition II/3.7.1):
X ∼ χ2 (k) . (1)
If m > −k/2, then E(X m ) exists and is equal to:
2m Γ k2 + m
E(X ) =m
. (2)
Γ k2
Proof: Combining the definition of the m-th raw moment (→ Definition I/1.14.3) with the probability
density function of the chi-squared distribution (→ Proof II/3.7.3), we have:
Z ∞
1
m
E(X ) = k
k/2
x(k/2)+m−1 e−x/2 dx . (3)
0 Γ 2 2
Now define a new variable u = x/2. As a result, we obtain:
Z ∞
1
m
E(X ) = k
(k/2)−1
2(k/2)+m−1 u(k/2)+m−1 e−u du . (4)
0 Γ 2 2
This leads to the desired result when m > −k/2. Observe that, if m is a nonnegative integer, then
m > −k/2 is always true. Therefore, all moments (→ Definition I/1.14.1) of a chi-squared distribution
(→ Definition II/3.7.1) exist and the m-th raw moment is given by the foregoing equation.
Sources:
• Robert V. Hogg, Joseph W. McKean, Allen T. Craig (2018): “The �2-Distribution”; in: Introduction
to Mathematical Statistics, Pearson, Boston, 2019, p. 179, eq. 3.3.8; URL: https://www.pearson.
com/store/p/introduction-to-mathematical-statistics/P100000843744.
Metadata: ID: P175 | shortcut: chi2-mom | author: kjpetrykowski | date: 2020-10-13, 01:30.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 249
3.8 F-distribution
3.8.1 Definition
Definition: Let X1 and X2 be independent (→ Definition I/1.3.6) random variables (→ Definition
I/1.2.2) following a chi-squared distribution (→ Definition II/3.7.1) with d1 and d2 degrees of freedom
(→ Definition “dof”), respectively:
X1 ∼ χ2 (d1 )
(1)
X2 ∼ χ2 (d2 ) .
Then, the ratio of X1 to X2 , divided by their respective degrees of freedom, is said to be F -distributed
with numerator degrees of freedom d1 and denominator degrees of freedom d2 :
X1 /d1
Y = ∼ F (d1 , d2 ) where d1 , d2 > 0 . (2)
X2 /d2
The F -distribution is also called “Snedecor’s F -distribution” or “Fisher–Snedecor distribution”, after
Ronald A. Fisher and George W. Snedecor.
Sources:
• Wikipedia (2021): “F-distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-04-21;
URL: https://en.wikipedia.org/wiki/F-distribution#Characterization.
F ∼ F (u, v) . (1)
Then, the probability density function (→ Definition I/1.6.6) of F is
u u2 u − u+v
Γ u+v u
−1
2
fF (f ) = 2
· · f 2 · f + 1 . (2)
Γ 2 ·Γ 2
u v
v v
Proof: An F-distributed random variable (→ Definition II/3.8.1) is defined as the ratio of two chi-
squared random variables (→ Definition II/3.7.1), divided by their degrees of freedom (→ Definition
“dof”)
X/u
X ∼ χ2 (u), Y ∼ χ2 (v) ⇒ F = ∼ F (u, v) (3)
Y /v
where X and Y are independent of each other (→ Definition I/1.3.6).
The probability density function of the chi-squared distribution (→ Proof II/3.7.3) is
1
· x 2 −1 · e− 2 .
u x
fX (x) = (4)
Γ u
2
·2u/2
250 CHAPTER II. PROBABILITY DISTRIBUTIONS
X/u
F =
Y /v (5)
W =Y ,
such that the inverse functions X and Y in terms of F and W are
u
X= FW
v (6)
Y =W .
This implies the following Jacobian matrix and determinant:
dX dX u u
W F
J= dF dW
= v v
dY dY
dF dW
0 1 (7)
u
W . |J| =
v
Because X and Y are independent (→ Definition I/1.3.6), the joint density (→ Definition I/1.5.2) of
X and Y is equal to the product (→ Proof I/1.3.8) of the marginal densities (→ Definition I/1.5.3):
· w 2 −1 · e− 2 ( v f +1) .
u+v w u
= v
Γ 2 ·Γ 2 ·2
u v (u+v)/2
The marginal density (→ Definition I/1.5.3) of F can now be obtained by integrating out (→ Defi-
nition I/1.3.3) W :
Z ∞
fF (f ) = fF,W (f, w) dw
0
u2 Z ∞
· f 2 −1 1 u
u u
u+v
−1
=
v
· w 2 · exp − f + 1 w dw
Γ u2 · Γ v2 · 2(u+v)/2 0 2 v
u Z ∞ 1 u (u+v)/2
· f 2 −1 1 u
u 2 u
Γ u+v f +1 u+v
−1
= v · 2
· 2 v ·w 2 · exp − f +1 w
Γ u2 · Γ v2 · 2(u+v)/2 1 u f + 1 (u+v)/2 0 Γ u+v
2
2 v
2 v
(11)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 251
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ Proof II/3.4.6) with
u+v 1 u
a= and b = f +1 , (12)
2 2 v
and because a probability density function integrates to one (→ Definition I/1.6.6), we finally have:
u2
· f 2 −1
u u
Γ u+v
fF (f ) = v · 2
Γ u2 · Γ v2 · 2(u+v)/2 1 u f + 1 (u+v)/2
2 v (13)
Γ u+v u u2 u − u+v
u
−1
2
= 2
· · f 2 · f + 1 .
Γ 2 ·Γ 2
u v
v v
Sources:
• statisticsmatt (2018): “Statistical Distributions: Derive the F Distribution”; in: YouTube, retrieved
on 2021-10-11; URL: https://www.youtube.com/watch?v=AmHiOKYmHkI.
Metadata: ID: P264 | shortcut: f-pdf | author: JoramSoch | date: 2021-10-12, 09:00.
X ∼ Bet(α, β) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
1
Bet(x; α, β) = xα−1 (1 − x)β−1 (2)
B(α, β)
where α > 0 and β > 0, and the density is zero, if x ∈
/ [0, 1].
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
10; URL: https://en.wikipedia.org/wiki/Beta_distribution#Definitions.
Metadata: ID: D53 | shortcut: beta | author: JoramSoch | date: 2020-05-10, 20:29.
X m n
∼ Bet , . (2)
X +Y 2 2
Proof: The probability density function of the chi-squared distribution (→ Proof II/3.7.3) is
1 u
−1 − x2
X ∼ χ2 (u) ⇒ fX (x) = · x 2 · e . (3)
Γ u
2
· 2u/2
Define the random variables Z and W as functions of X and Y
X
Z=
X +Y (4)
W =Y ,
such that the inverse functions X and Y in terms of Z and W are
ZW
X=
1−Z (5)
Y =W .
This implies the following Jacobian matrix and determinant:
dX dX W Z
J= dZ dW
= (1−Z)2 1−Z
dY dY
dZ dW
0 1 (6)
W
|J| = .
(1 − Z)2
Because X and Y are independent (→ Definition I/1.3.6), the joint density (→ Definition I/1.5.2) of
X and Y is equal to the product (→ Proof I/1.3.8) of the marginal densities (→ Definition I/1.5.3):
zw
fZ,W (z, w) = fX · fY (w) · |J|
1−z
m2 −1
1 zw 1 w
· e− 2 ( 1−z ) ·
1 zw
· w 2 −1 · e− 2 ·
n w
= ·
Γ 2 · 2m/2
m
1−z Γ ·2n/2 n
2
(1 − z)2
m2 −1 2 (9)
1 z 1 w(1−z)
· w 2 + 2 −1 e− 2 ( 1−z + 1−z )
m n 1 zw
= ·
Γ m
2
Γ n
2
· 2m/2 2n/2 1−z (1 − z)
1
· e− 2 ( 1−z ) .
1 w
· z 2 −1 · (1 − z)− 2 −1 · w −1
m m m+n
= 2
Γ m
2
Γ n
2
· 2(m+n)/2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 253
The marginal density (→ Definition I/1.5.3) of Z can now be obtained by integrating out (→ Defi-
nition I/1.5.3) W :
Z ∞
fZ (z) = fZ,W (z, w) dw
0
Z ∞
1
· e− 2 ( 1−z ) dw
1 w
m
−1 −m −1 m+n
−1
= ·z 2 · (1 − z) 2 · w 2
Γ m
Γ n
· 2(m+n)/2 0
2 2
m+n
1 m
−1 −m −1 Γ (10)
= ·z 2 · (1 − z) 2 · 2
m+n ·
Γ m
2
Γ n
2
· 2(m+n)/2 1 2
2(1−z)
m+n
Z ∞
1 2
2(1−z)
· e− 2(1−z) w dw .
1
m+n
−1
m+n
·w 2
0 Γ 2
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ Proof II/3.4.6) with
m+n 1
and b = a= , (11)
2 2(1 − z)
and because a probability density function integrates to one (→ Definition I/1.6.6), we have:
m+n
1 m
−1 −m −1 Γ
fZ (z) = ·z 2 · (1 − z) 2 · 2
m+n
Γ m
2
Γ n
2
· 2(m+n)/2 1 2
2(1−z)
Γ · 2(m+n)/2
m+n
m
−1 −m + m+n −1 (12)
= 2 · z 2 · (1 − z) 2 2
Γ 2 Γ 2 · 2(m+n)/2
m n
Γ m+n
· z 2 −1 · (1 − z) 2 −1 .
m n
= m
2
n
Γ 2 Γ 2
With the definition of the beta function (→ Proof II/3.9.6), this becomes
1
· z 2 −1 · (1 − z) 2 −1
m n
fZ (z) = m n
(13)
B ,
2 2
which is the probability density function of the beta distribution (→ Proof II/3.9.3) with parameters
m n
α= and β = , (14)
2 2
such that
m n
Z ∼ Bet , . (15)
2 2
Sources:
• Probability Fact (2021): “If X chisq(m) and Y chisq(n) are independent”; in: Twitter, retrieved
on 2022-10-17; URL: https://twitter.com/ProbFact/status/1450492787854647300.
Metadata: ID: P356 | shortcut: beta-chi2 | author: JoramSoch | date: 2022-10-07, 13:20.
254 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Bet(α, β) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
1
fX (x) = xα−1 (1 − x)β−1 . (2)
B(α, β)
Proof: This follows directly from the definition of the beta distribution (→ Definition II/3.9.1).
Sources:
• original work
Metadata: ID: P94 | shortcut: beta-pdf | author: JoramSoch | date: 2020-05-05, 21:03.
X ∼ Bet(α, β) . (1)
Then, the moment-generating function (→ Definition I/1.6.27) of X is
!
X∞ Y α+m
n−1
tn
MX (t) = 1 + . (2)
n=1 m=0
α + β + m n!
Proof: The probability density function of the beta distribution (→ Proof II/3.9.3) is
1
fX (x) = xα−1 (1 − x)β−1 (3)
B(α, β)
and the moment-generating function (→ Definition I/1.6.27) is defined as
MX (t) = E etX . (4)
Using the expected value for continuous random variables (→ Definition I/1.7.1), the moment-
generating function of X therefore is
Z 1
1
MX (t) = exp[tx] · xα−1 (1 − x)β−1 dx
0 B(α, β)
Z 1 (5)
1
= etx xα−1 (1 − x)β−1 dx .
B(α, β) 0
Γ(α) Γ(β)
B(α, β) = (6)
Γ(α + β)
and the integral representation of the confluent hypergeometric function (Kummer’s function of the
first kind)
Z 1
Γ(b)
1 F1 (a, b, z) = ezu ua−1 (1 − u)(b−a)−1 du , (7)
Γ(a) Γ(b − a) 0
the moment-generating function can be written as
Y
n−1
mn = (m + i) , (10)
i=0
Applying the rising factorial equation (10) and using m0 = x0 = 0! = 1, we finally have:
!
X∞ Y α+m
n−1
tn
MX (t) = 1 + . (12)
n=1 m=0
α + β + m n!
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Beta_distribution#Moment_generating_function.
• Wikipedia (2020): “Confluent hypergeometric function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-25; URL: https://en.wikipedia.org/wiki/Confluent_hypergeometric_function#
Kummer’s_equation.
Metadata: ID: P198 | shortcut: beta-mgf | author: JoramSoch | date: 2020-11-25, 06:55.
X ∼ Bet(α, β) . (1)
256 CHAPTER II. PROBABILITY DISTRIBUTIONS
B(x; α, β)
FX (x) = (2)
B(α, β)
where B(a, b) is the beta function and B(x; a, b) is the incomplete gamma function.
Proof: The probability density function of the beta distribution (→ Proof II/3.9.3) is:
1
fX (x) = xα−1 (1 − x)β−1 . (3)
B(α, β)
Thus, the cumulative distribution function (→ Definition I/1.6.13) is:
Z x
FX (x) = Bet(z; α, β) dz
Z
0
x
1
= z α−1 (1 − z)β−1 dz (4)
B(α, β)
0
Z x
1
= z α−1 (1 − z)β−1 dz .
B(α, β) 0
B(x; α, β)
FX (x) = . (6)
B(α, β)
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
19; URL: https://en.wikipedia.org/wiki/Beta_distribution#Cumulative_distribution_function.
• Wikipedia (2020): “Beta function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-19;
URL: https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function.
Metadata: ID: P195 | shortcut: beta-cdf | author: JoramSoch | date: 2020-11-19, 08:01.
3.9.6 Mean
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a beta distribution (→
Definition II/3.9.1):
X ∼ Bet(α, β) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
α
E(X) = . (2)
α+β
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 257
Proof: The expected value (→ Definition I/1.7.1) is the probability-weighted average over all possible
values:
Z
E(X) = x · fX (x) dx . (3)
X
Γ(α) · Γ(β)
B(α, β) = . (5)
Γ(α + β)
Combining (3), (4) and (5), we have:
Z 1
Γ(α + β) α−1
E(X) = x· x (1 − x)β−1 dx
0 Γ(α) · Γ(β)
Z 1 (6)
Γ(α + β) Γ(α + 1) Γ(α + 1 + β) (α+1)−1
= · x (1 − x)β−1 dx .
Γ(α) Γ(α + 1 + β) 0 Γ(α + 1) · Γ(β)
Z 1
Γ(α + β) α · Γ(α) Γ(α + 1 + β) (α+1)−1
E(X) = · x (1 − x)β−1 dx
Γ(α) (α + β) · Γ(α + β) 0 Γ(α + 1) · Γ(β)
Z 1 (7)
α Γ(α + 1 + β) (α+1)−1
= x (1 − x)β−1 dx
α + β 0 Γ(α + 1) · Γ(β)
and again using the density of the beta distribution (→ Proof II/3.9.3), we get
Z 1
α
E(X) = Bet(x; α + 1, β) dx
α+β 0 (8)
α
= .
α+β
Sources:
• Boer Commander (2020): “Beta Distribution Mean and Variance Proof”; in: YouTube, retrieved
on 2021-04-29; URL: https://www.youtube.com/watch?v=3OgCcnpZtZ8.
Metadata: ID: P228 | shortcut: beta-mean | author: JoramSoch | date: 2021-04-29, 09:12.
3.9.7 Variance
Theorem: Let X be a random variable (→ Definition I/1.2.2) following a beta distribution (→
Definition II/3.9.1):
258 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Bet(α, β) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
αβ
Var(X) = . (2)
(α + β + 1) · (α + β)2
Proof: The variance (→ Definition I/1.8.1) can be expressed in terms of expected values (→ Proof
I/1.8.3) as
Γ(α) · Γ(β)
B(α, β) = . (6)
Γ(α + β)
Therefore, the expected value of a squared beta random variable becomes
Z 1
Γ(α + β) α−1
2
E(X ) = x2 · x (1 − x)β−1 dx
0 Γ(α) · Γ(β)
Z 1 (7)
Γ(α + β) Γ(α + 2) Γ(α + 2 + β) (α+2)−1
= · x (1 − x)β−1 dx .
Γ(α) Γ(α + 2 + β) 0 Γ(α + 2) · Γ(β)
Z 1
Γ(α + β) (α + 1) · α · Γ(α) Γ(α + 2 + β) (α+2)−1
2
E(X ) = · x (1 − x)β−1 dx
Γ(α) (α + β + 1) · (α + β) · Γ(α + β) 0 Γ(α + 2) · Γ(β)
Z 1 (8)
(α + 1) · α Γ(α + 2 + β) (α+2)−1
= x (1 − x)β−1
dx
(α + β + 1) · (α + β) 0 Γ(α + 2) · Γ(β)
and again using the density of the beta distribution (→ Proof II/3.9.3), we get
Z 1
2 (α + 1) · α
E(X ) = Bet(x; α + 2, β) dx
(α + β + 1) · (α + β) 0
(9)
(α + 1) · α
= .
(α + β + 1) · (α + β)
Plugging (9) and (4) into (3), the variance of a beta random variable finally becomes
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 259
2
(α + 1) · α α
Var(X) = −
(α + β + 1) · (α + β) α+β
(α + α) · (α + β)
2
α2 · (α + β + 1)
= −
(α + β + 1) · (α + β)2 (α + β + 1) · (α + β)2
(10)
(α3 + α2 β + α2 + αβ) − (α3 + α2 β + α2 )
=
(α + β + 1) · (α + β)2
αβ
= .
(α + β + 1) · (α + β)2
Sources:
• Boer Commander (2020): “Beta Distribution Mean and Variance Proof”; in: YouTube, retrieved
on 2021-04-29; URL: https://www.youtube.com/watch?v=3OgCcnpZtZ8.
Metadata: ID: P229 | shortcut: beta-var | author: JoramSoch | date: 2021-04-29, 09:31.
X ∼ Wald(γ, α) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
α (α − γx)2
Wald(x; γ, α) = √ exp − (2)
2πx3 2x
where γ > 0, α > 0, and the density is zero if x ≤ 0.
Sources:
• Anders, R., Alario, F.-X., and van Maanen, L. (2016): “The Shifted Wald Distribution for Response
Time Data Analysis”; in: Psychological Methods, vol. 21, no. 3, pp. 309-327; URL: https://dx.doi.
org/10.1037/met0000066; DOI: 10.1037/met0000066.
Metadata: ID: D95 | shortcut: wald | author: tomfaulkenberry | date: 2020-09-04, 12:00.
X ∼ Wald(γ, α) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
260 CHAPTER II. PROBABILITY DISTRIBUTIONS
α (α − γx)2
fX (x) = √ exp − . (2)
2πx3 2x
Proof: This follows directly from the definition of the Wald distribution (→ Definition II/3.10.1).
Sources:
• original work
Metadata: ID: P162 | shortcut: wald-pdf | author: tomfaulkenberry | date: 2020-09-04, 12:00.
X ∼ Wald(γ, α) . (1)
Then, the moment-generating function (→ Definition I/1.6.27) of X is
h p i
MX (t) = exp αγ − α2 (γ 2 − 2t) . (2)
Proof: The probability density function of the Wald distribution (→ Proof II/3.10.2) is
α (α − γx)2
fX (x) = √ exp − (3)
2πx3 2x
and the moment-generating function (→ Definition I/1.6.27) is defined as
MX (t) = E etX . (4)
Using the definition of expected value for continuous random variables (→ Definition I/1.7.1), the
moment-generating function of X therefore is
Z ∞
α (α − γx)2
MX (t) = e ·√ tx
· exp − dx
2πx3 2x
0
Z ∞ (5)
α −3/2 (α − γx)2
=√ x · exp tx − dx .
2π 0 2x
To evaluate this integral, we will need two identities about modified Bessel functions of the second
kind1 , denoted Kp . The function Kp (for p ∈ R) is one of the two linearly independent solutions of
the differential equation
d2 y dy
x2
2
+ x − (x2 + p2 )y = 0 . (6)
dx dx
2
The first of these identities gives an explicit solution for K−1/2 :
1
https://dlmf.nist.gov/10.25
2
https://dlmf.nist.gov/10.39.2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 261
r
π −x
K−1/2 (x) = e . (7)
2x
The second of these identities3 gives an integral representation of Kp :
Z
√ 1 a p/2 ∞ p−1 1 b
Kp ( ab) = x · exp − ax + dx . (8)
2 b 0 2 x
Starting from (5), we can expand the binomial term and rearrange the moment generating function
into the following form:
Z ∞
α −3/2 α2 γ 2x
MX (t) = √ x · exp tx − + αγ − dx
2π 0 2x 2
Z ∞
α −3/2 γ2 α2
= √ ·e αγ
x · exp t − x− dx (9)
2π 2 2x
0
Z ∞
α −3/2 1 2 1 α2
= √ ·e αγ
x · exp − γ − 2t x − · dx .
2π 0 2 2 x
The integral now has the form of the integral in (8) with p = −1/2, a = γ 2 − 2t, and b = α2 . This
allows us to write the moment-generating function in terms of the modified Bessel function K−1/2 :
1/4 p
α γ 2 − 2t
MX (t) = √ · eαγ · 2 · K−1/2 α2 (γ 2 − 2t) . (10)
2π α2
Combining with (7) and simplifying gives
1/4 s h p i
α γ 2 − 2t π
MX (t) = √ · eαγ · 2 · p · exp − α2 (γ 2 − 2t)
2π α2 2 α2 (γ 2 − 2t)
√ h p i
α (γ 2 − 2t)1/4 π
= √ √ ·e ·2· αγ
√ ·√ √ · exp − α2 (γ 2 − 2t) (11)
2· π α 2 · α · (γ 2 − 2t)1/4
h p i
= eαγ · exp − α2 (γ 2 − 2t)
h p i
= exp αγ − α2 (γ 2 − 2t) .
Sources:
• Siegrist, K. (2020): “The Wald Distribution”; in: Random: Probability, Mathematical Statistics,
Stochastic Processes, retrieved on 2020-09-13; URL: https://www.randomservices.org/random/
special/Wald.html.
• National Institute of Standards and Technology (2020): “NIST Digital Library of Mathematical
Functions”, retrieved on 2020-09-13; URL: https://dlmf.nist.gov.
Metadata: ID: P168 | shortcut: wald-mgf | author: tomfaulkenberry | date: 2020-09-13, 12:00.
3
https://dlmf.nist.gov/10.32.10
262 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.10.4 Mean
Theorem: Let X be a positive random variable (→ Definition I/1.2.2) following a Wald distribution
(→ Definition II/3.10.1):
X ∼ Wald(γ, α) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
α
E(X) = . (2)
γ
Proof: The mean or expected value E(X) is the first moment (→ Definition I/1.14.1) of X, so we can
use (→ Proof I/1.14.2) the moment-generating function of the Wald distribution (→ Proof II/3.10.3)
to calculate
h p i 1 −1/2
MX′ (t) = exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
h p i α2 (5)
= exp αγ − α (γ − 2t) · p
2 2 .
α2 (γ 2 − 2t)
h p i α2
MX′ (0) = exp αγ − − 2(0)) · p
α2 (γ 2
α2 (γ 2 − 2(0))
h p i α2
= exp αγ − α2 · γ 2 · p
α2 · γ 2 (6)
α2
= exp[0] ·
αγ
α
= .
γ
Sources:
• original work
Metadata: ID: P169 | shortcut: wald-mean | author: tomfaulkenberry | date: 2020-09-13, 12:00.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 263
3.10.5 Variance
Theorem: Let X be a positive random variable (→ Definition I/1.2.2) following a Wald distribution
(→ Definition II/3.10.1):
X ∼ Wald(γ, α) . (1)
Then, the variance (→ Definition I/1.8.1) of X is
α
Var(X) = . (2)
γ3
Proof: To compute the variance of X, we partition the variance into expected values (→ Proof
I/1.8.3):
h p i 1 −1/2
MX′ (t) = exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
h p i α2
= exp αγ − α2 (γ 2 − 2t) · p (6)
α2 (γ 2 − 2t)
h p i
= α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1/2 .
h p i 1 −1/2
MX′′ (t) = α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1/2 · − α2 (γ 2 − 2t) · −2α2
2
h p i 1
+ α · exp αγ − α2 (γ 2 − 2t) · − (γ 2 − 2t)−3/2 · −2
h p i 2
= α2 · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1 (7)
h p i
+ α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−3/2
" #
h p i α 1
= α · exp αγ − α2 (γ 2 − 2t) +p .
γ 2 − 2t (γ 2 − 2t)3
Since the mean of a Wald distribution (→ Proof II/3.10.4) is given by E(X) = α/γ, we can apply
(3) to show
Sources:
• original work
Metadata: ID: P170 | shortcut: wald-var | author: tomfaulkenberry | date: 2020-09-13, 12:00.
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 265
X ∼ N (µ, Σ) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
1 1 T −1
N (x; µ, Σ) = p · exp − (x − µ) Σ (x − µ) (2)
(2π)n |Σ| 2
where µ is an n × 1 real vector and Σ is an n × n positive definite matrix.
Sources:
• Koch KR (2007): “Multivariate Normal Distribution”; in: Introduction to Bayesian Statistics,
ch. 2.5.1, pp. 51-53, eq. 2.195; URL: https://www.springer.com/gp/book/9783540727231; DOI:
10.1007/978-3-540-72726-2.
Proof: The probability density function of the matrix-normal distribution (→ Proof II/5.1.3) is
1 1 −1 T −1
MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) . (1)
(2π)np |V |n |U |p 2
Setting p = 1, X = x, M = µ, U = Σ and V = 1, we obtain
1 1 −1 T −1
MN (x; µ, Σ, 1) = p · exp − tr 1 (x − µ) Σ (x − µ)
(2π)n |1|n |Σ|1 2
(2)
1 1 T −1
=p · exp − (x − µ) Σ (x − µ)
(2π)n |Σ| 2
which is equivalent to the probability density function of the multivariate normal distribution (→
Proof II/4.1.3).
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-07-31; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Definition.
266 CHAPTER II. PROBABILITY DISTRIBUTIONS
Metadata: ID: P330 | shortcut: mvn-matn | author: JoramSoch | date: 2022-07-31, 11:00.
X ∼ N (µ, Σ) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
1 1 T −1
fX (x) = p · exp − (x − µ) Σ (x − µ) . (2)
(2π)n |Σ| 2
Proof: This follows directly from the definition of the multivariate normal distribution (→ Definition
II/4.1.1).
Sources:
• original work
Metadata: ID: P34 | shortcut: mvn-pdf | author: JoramSoch | date: 2020-01-27, 15:23.
4.1.4 Mean
Theorem: Let x follow a multivariate normal distribution (→ Definition II/4.1.1):
x ∼ N (µ, Σ) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of x is
E(x) = µ . (2)
Proof:
1) First, consider a set of independent (→ Definition I/1.3.6) and standard normally (→ Definition
II/3.2.2) distributed random variables (→ Definition I/1.2.2):
i.i.d.
zi ∼ N (0, 1), i = 1, . . . , n . (3)
Then, these variables together form a multivariate normally (→ Proof II/4.1.11) distributed random
vector (→ Definition I/1.2.3):
z ∼ N (0n , In ) . (4)
By definition, the expected value of a random vector is equal to the vector of all expected values (→
Definition I/1.7.13):
z E(z1 )
1
.. ..
E(z) = E . = . . (5)
zn E(zn )
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 267
Because the expected value of all its entries is zero (→ Proof II/3.2.16), the expected value of the
random vector is
E(z1 ) 0
. ..
E(z) = .. = . = 0n . (6)
E(zn ) 0
2) Next, consider an n × n matrix A solving the equation AAT = Σ. Such a matrix exists, because
Σ is defined to be positive definite (→ Definition II/4.1.1). Then, x can be represented as a linear
transformation of (→ Proof II/4.1.8) z:
E(x) = E(Az + µ)
= E(Az) + E(µ)
= A E(z) + µ (9)
(6)
= A 0n + µ
=µ.
Sources:
• Taboga, Marco (2021): “Multivariate normal distribution”; in: Lectures on probability theory and
mathematical statistics, retrieved on 2022-09-15; URL: https://www.statlect.com/probability-distributions/
multivariate-normal-distribution.
Metadata: ID: P339 | shortcut: mvn-mean | author: JoramSoch | date: 2022-09-15, 02:22.
4.1.5 Covariance
Theorem: Let x follow a multivariate normal distribution (→ Definition II/4.1.1):
x ∼ N (µ, Σ) . (1)
Then, the covariance matrix (→ Definition I/1.9.9) of x is
Cov(x) = Σ . (2)
Proof:
1) First, consider a set of independent (→ Definition I/1.3.6) and standard normally (→ Definition
II/3.2.2) distributed random variables (→ Definition I/1.2.2):
268 CHAPTER II. PROBABILITY DISTRIBUTIONS
i.i.d.
zi ∼ N (0, 1), i = 1, . . . , n . (3)
Then, these variables together form a multivariate normally (→ Proof II/4.1.11) distributed random
vector (→ Definition I/1.2.3):
z ∼ N (0n , In ) . (4)
Because the covariance is zero for independent random variables (→ Proof I/1.9.6), we have
Cov(x) = Cov(Az + µ)
(10)
= Cov(Az)
(11)
= A Cov(z)AT
(12)
(7) T
= AIn A
= AAT
=Σ.
Sources:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 269
• Rosenfeld, Meni (2016): “Deriving the Covariance of Multivariate Gaussian”; in: StackExchange
Mathematics, retrieved on 2022-09-15; URL: https://math.stackexchange.com/questions/1905977/
deriving-the-covariance-of-multivariate-gaussian.
Metadata: ID: P340 | shortcut: mvn-cov | author: JoramSoch | date: 2022-09-15, 08:41.
x ∼ N (µ, Σ) . (1)
Then, the differential entropy (→ Definition I/2.2.1) of x in nats is
n 1 1
h(x) = ln(2π) + ln |Σ| + n . (2)
2 2 2
" !#
1 1
h(x) = −E ln p · exp − (x − µ)T Σ−1 (x − µ)
(2π) |Σ|
n 2
n 1 1 T −1 (5)
= −E − ln(2π) − ln |Σ| − (x − µ) Σ (x − µ)
2 2 2
n 1 1
= ln(2π) + ln |Σ| + E (x − µ)T Σ−1 (x − µ) .
2 2 2
The last term can be evaluted as
E (x − µ)T Σ−1 (x − µ) = E tr (x − µ)T Σ−1 (x − µ)
= E tr Σ−1 (x − µ)(x − µ)T
= tr Σ−1 E (x − µ)(x − µ)T
(6)
= tr Σ−1 Σ
= tr (In )
=n,
such that the differential entropy is
n 1 1
h(x) = ln(2π) + ln |Σ| + n . (7)
2 2 2
270 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Kiuhnm (2018): “Entropy of the multivariate Gaussian”; in: StackExchange Mathematics, retrieved
on 2020-05-14; URL: https://math.stackexchange.com/questions/2029707/entropy-of-the-multivariate-gau
Metadata: ID: P100 | shortcut: mvn-dent | author: JoramSoch | date: 2020-05-14, 19:49.
P : x ∼ N (µ1 , Σ1 )
(1)
Q : x ∼ N (µ2 , Σ2 ) .
Proof: The KL divergence for a continuous random variable (→ Definition I/2.5.1) is given by
Z
p(x)
KL[P || Q] = p(x) ln dx (3)
X q(x)
which, applied to the multivariate normal distributions (→ Definition II/4.1.1) in (1), yields
Z
N (x; µ1 , Σ1 )
KL[P || Q] = N (x; µ1 , Σ1 ) ln dx
Rn N (x; µ2 , Σ2 )
(4)
N (x; µ1 , Σ1 )
= ln .
N (x; µ2 , Σ2 ) p(x)
Using the probability density function of the multivariate normal distribution (→ Proof II/4.1.3),
this becomes:
* +
√ 1
· exp − 12 (x − µ1 )T Σ−1
1 (x − µ1 )
(2π)n |Σ1 |
KL[P || Q] = ln
√ 1
· exp − 12 (x − µ2 )T Σ−1
2 (x − µ2 )
(2π)n |Σ2 |
p(x)
1 |Σ2 | 1 T −1 1 T −1 (5)
= ln − (x − µ1 ) Σ1 (x − µ1 ) + (x − µ2 ) Σ2 (x − µ2 )
2 |Σ1 | 2 2 p(x)
1 |Σ2 |
= ln − (x − µ1 )T Σ−1 T −1
1 (x − µ1 ) + (x − µ2 ) Σ2 (x − µ2 ) .
2 |Σ1 | p(x)
Now, using the fact that x = tr(x), if a is scalar, and the trace property tr(ABC) = tr(BCA), we
have:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 271
1 |Σ2 | −1 −1
KL[P || Q] = ln − tr Σ1 (x − µ1 )(x − µ1 ) + tr Σ2 (x − µ2 )(x − µ2 )
T T
2 |Σ1 | p(x)
(6)
1 |Σ2 | −1 −1
= ln − tr Σ1 (x − µ1 )(x − µ1 ) + tr Σ2 xx − 2µ2 x + µ2 µ2
T T T T
.
2 |Σ1 | p(x)
Because trace function and expected value are both linear operators (→ Proof I/1.7.8), the expecta-
tion can be moved inside the trace:
h
1 |Σ2 | −1
i h
−1
T i
KL[P || Q] = ln − tr Σ1 (x − µ1 )(x − µ1 ) p(x) + tr Σ2 xx − 2µ2 x + µ2 µ2 p(x)
T T T
2 |Σ1 |
h
1 |Σ2 |
i h
i
= ln − tr Σ−1 (x − µ 1 )(x − µ 1 )T
+ tr Σ −1
xx T
− 2µ 2 x T
+ µ µ T
2 2 p(x)
2 |Σ1 | 1 p(x) 2 p(x) p(x)
(7)
Using the expectation of a linear form for the multivariate normal distribution (→ Proof II/4.1.8)
1 |Σ2 | −1 −1
KL[P || Q] = ln − tr Σ1 Σ1 + tr Σ2 Σ1 + µ1 µ1 − 2µ2 µ1 + µ2 µ2
T T T
2 |Σ1 |
1 |Σ2 | −1 −1
= ln − tr [In ] + tr Σ2 Σ1 + tr Σ2 µ1 µ1 − 2µ2 µ1 + µ2 µ2
T T T
2 |Σ1 |
(10)
1 |Σ2 | −1 T −1 T −1 T −1
= ln − n + tr Σ2 Σ1 + tr µ1 Σ2 µ1 − 2µ1 Σ2 µ2 + µ2 Σ2 µ2
2 |Σ1 |
1 |Σ2 | −1 T −1
= ln − n + tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) .
2 |Σ1 |
Sources:
• Duchi, John (2014): “Derivations for Linear Algebra and Optimization”; in: University of Cali-
fornia, Berkeley; URL: http://www.eecs.berkeley.edu/~jduchi/projects/general_notes.pdf.
Metadata: ID: P92 | shortcut: mvn-kl | author: JoramSoch | date: 2020-05-05, 06:57.
272 CHAPTER II. PROBABILITY DISTRIBUTIONS
x ∼ N (µ, Σ) . (1)
Then, any linear transformation of x is also multivariate normally distributed:
(2)
My (t) = E exp tT (Ax + b)
= E exp tT Ax · exp tT b
(4)
= exp tT b · E exp tT Ax
(3)
= exp tT b · Mx (At) .
(4)
My (t) = exp tT b · Mx (At)
(5) T 1 T
= exp t b · exp t Aµ + t AΣA t
T T
(6)
2
T 1 T T
= exp t (Aµ + b) + t AΣA t .
2
Because moment-generating function and probability density function of a random variable are equiv-
alent, this demonstrates that y is following a multivariate normal distribution with mean Aµ + b and
covariance AΣAT .
Sources:
• Taboga, Marco (2010): “Linear combinations of normal random variables”; in: Lectures on probabil-
ity and statistics, retrieved on 2019-08-27; URL: https://www.statlect.com/probability-distributions/
normal-distribution-linear-combinations.
x ∼ N (µ, Σ) . (1)
Then, the marginal distribution (→ Definition I/1.5.3) of any subset vector xs is also a multivariate
normal distribution
xs ∼ N (µs , Σs ) (2)
where µs drops the irrelevant variables (the ones not in the subset, i.e. marginalized out) from the
mean vector µ and Σs drops the corresponding rows and columns from the covariance matrix Σ.
Proof: Define an m × n subset matrix S such that sij = 1, if the j-th element in xs corresponds to
the i-th element in x, and sij = 0 otherwise. Then,
xs = Sx (3)
and we can apply the linear transformation theorem (→ Proof II/4.1.8) to give
Sources:
• original work
Metadata: ID: P35 | shortcut: mvn-marg | author: JoramSoch | date: 2020-01-29, 15:12.
x ∼ N (µ, Σ) . (1)
Then, the conditional distribution (→ Definition I/1.5.4) of any subset vector x1 , given the comple-
ment vector x2 , is also a multivariate normal distribution
µ1
µ=
µ2
(4)
Σ11 Σ12
Σ= .
Σ21 Σ22
x1 , x2 ∼ N (µ, Σ) . (6)
Moreover, the marginal distribution (→ Definition I/1.5.3) of x2 follows from (→ Proof II/4.1.9) (1)
and (4) as
p(x1 , x2 )
p(x1 |x2 ) = (8)
p(x2 )
Applying (6) and (7) to (8), we have:
N (x; µ, Σ)
p(x1 |x2 ) = . (9)
N (x2 ; µ2 , Σ22 )
Using the probability density function of the multivariate normal distribution (→ Proof II/4.1.3),
this becomes:
p
1/ (2π)n |Σ| · exp − 12 (x − µ)T Σ−1 (x − µ)
p(x1 |x2 ) = p
1/ (2π)n2 |Σ22 | · exp − 12 (x2 − µ2 )T Σ−1
22 (x2 − µ2 )
s (10)
1 |Σ22 | 1 1
=p · · exp − (x − µ)T Σ−1 (x − µ) + (x2 − µ2 )T Σ−1
22 (x2 − µ2 ) .
(2π) n−n2 |Σ| 2 2
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
T
11 12
1 x1 µ1 Σ Σ x µ
exp − − 1 − 1 (12)
2 x2 µ2 Σ21 Σ22 x2 µ2
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 ) .
2
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
1
exp − (x1 − µ1 )T Σ11 (x1 − µ1 ) + 2(x1 − µ1 )T Σ12 (x2 − µ2 ) + (x2 − µ2 )T Σ22 (x2 − µ2 )
2
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 )
2
(13)
where we have used the fact that Σ21 = Σ12 , because Σ−1 is a symmetric matrix.
T
−1
Σ11 Σ12 (Σ11 − Σ12 Σ−1
22 Σ21 )
−1
−(Σ11 − Σ12 Σ−1 −1 −1
22 Σ21 ) Σ12 Σ22
= .
Σ21 Σ22 −Σ−1 Σ
22 21 (Σ11 − Σ Σ −1
Σ
12 22 21 ) −1
Σ −1
22 + Σ −1
Σ
22 21 (Σ11 − Σ Σ−1
Σ
12 22 21 )−1
Σ Σ−1
12 22
(15)
Plugging this into (13), we have:
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
1
exp − (x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1
22 Σ21 ) (x1 − µ1 ) −
2
(16)
2(x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1 −1
22 Σ21 ) Σ12 Σ22 (x2 − µ2 )+
(x2 − µ2 )T Σ−1 −1 −1 −1 −1
22 + Σ22 Σ21 (Σ11 − Σ12 Σ22 Σ21 ) Σ12 Σ22 (x2 − µ2 )
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 ) .
2
276 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 277
• Wang, Ruye (2006): “Marginal and conditional distributions of multivariate normal distribution”;
in: Computer Image Processing and Analysis; URL: http://fourier.eng.hmc.edu/e161/lectures/
gaussianprocess/node7.html.
• Wikipedia (2020): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-03-20; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#
Conditional_distributions.
Metadata: ID: P88 | shortcut: mvn-cond | author: JoramSoch | date: 2020-03-20, 08:44.
x ∼ N (µ, Σ) . (1)
Then, the components of x are statistically independent (→ Definition I/1.3.6), if and only if the
covariance matrix (→ Definition I/1.9.9) is a diagonal matrix:
p(x) = p(x1 ) · . . . · p(xn ) ⇔ Σ = diag σ12 , . . . , σn2 . (2)
Proof: The marginal distribution of one entry from a multivariate normal random vector is a uni-
variate normal distribution (→ Proof II/4.1.9) where mean (→ Definition I/1.7.1) and variance (→
Definition I/1.8.1) are equal to the corresponding entries of the mean vector and covariance matrix:
1) Let
" 2 #
1 1 (4),(5) Y n
1 1 x − µ
· exp − (x − µ)T Σ−1 (x − µ) =
i i
p p · exp −
(2π) |Σ|
n 2 2πσi2 2 σ i
i=1
" #
1 1 1 1 X n
1
p · exp − (x − µ)T Σ−1 (x − µ) = p Q · exp − (xi − µi ) 2 (xi − µi )
(2π) |Σ|
n 2 (2π)n ni=1 σi2 2 i=1 σi
1X 1X
n n
1 1 T −1 1
− log |Σ| − (x − µ) Σ (x − µ) = − log σi −
2
(xi − µi ) 2 (xi − µi )
2 2 2 i=1 2 i=1 σi
(7)
which is only fulfilled by a diagonal covariance matrix
Σ = diag σ12 , . . . , σn2 , (8)
because the determinant of a diagonal matrix is a product
Y
n
|diag ([a1 , . . . , an ]) | = ai , (9)
i=1
the inverse of a diagonal matrix is a diagonal matrix
2) Let
Σ = diag σ12 , . . . , σn2 . (12)
Then, we have
(4) 1 1 2
2 −1
p(x) = p · exp − (x − µ) diag σ1 , . . . , σn
T
(x − µ)
(2π)n |diag ([σ12 , . . . , σn2 ]) | 2
1 1
=p Q · exp − (x − µ) diag 1/σ1 , . . . , 1/σn (x − µ)
T 2 2
(2π)n ni=1 σi2 2
" # (13)
1 X (xi − µi )2
n
1
=p Q · exp −
(2π)n ni=1 σi2 2 i=1 σi2
" 2 #
Yn
1 1 xi − µi
= p · exp −
i=1 2πσi2 2 σi
which implies that
Sources:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 279
• original work
Metadata: ID: P236 | shortcut: mvn-ind | author: JoramSoch | date: 2021-06-02, 09:22.
X ∼ t(µ, Σ, ν) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
s −(ν+n)/2
1 Γ([ν + n]/2) 1 T −1
t(x; µ, Σ, ν) = 1 + (x − µ) Σ (x − µ) (2)
(νπ)n |Σ| Γ(ν/2) ν
where µ is an n × 1 real vector, Σ is an n × n positive definite matrix and ν > 0.
Sources:
• Koch KR (2007): “Multivariate t-Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.2,
pp. 53-55; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-3-540-
72726-2.
Metadata: ID: D148 | shortcut: mvt | author: JoramSoch | date: 2020-04-21, 08:16.
X ∼ t(µ, Σ, ν) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
s −(ν+n)/2
1 Γ([ν + n]/2) 1
fX (x) = 1+ (x − µ)T Σ−1 (x − µ) . (2)
(νπ)n |Σ| Γ(ν/2) ν
Proof: This follows directly from the definition of the multivariate t-distribution (→ Definition
II/4.2.1).
Sources:
• original work
Metadata: ID: P333 | shortcut: mvt-pdf | author: JoramSoch | date: 2022-09-02, 11:50.
280 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ t(µ, Σ, ν) . (1)
Then, the centered, weighted and standardized quadratic form of X follows an F-distribution (→
Definition II/3.8.1) with degrees of freedom n and ν:
Proof: The linear transformation theorem for the multivariate t-distribution (→ Proof “mvt-ltt”)
states
where Σ−1/2 is a matrix square root of the inverse of Σ. Then, applying (3) to (4) with (1), one
obtains the distribution of Y as
Yi ∼ t(ν), i = 1, . . . , n . (6)
Note that, when X follows a t-distribution with n degrees of freedom, this is equivalent to (→
Definition II/3.3.1) an expression of X in terms of a standard normal (→ Definition II/3.2.2) random
variable Z and a chi-squared (→ Definition II/3.7.1) random variable V :
Z
X ∼ t(n) ⇔ X=p with independent Z ∼ N (0, 1) and V ∼ χ2 (n) . (7)
V /n
With that, Z from (4) can be rewritten as follows:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 281
(4)
Z = Y T Y /n
1X 2
n
= Y
n i=1 i
!2 (8)
(7) 1 X
n
Z
= p i
n i=1 V /ν
Pn 2
( i=1 Zi ) /n
= .
V /ν
Because by definition, the sum of squared standard normal random variables follows a chi-squared
distribution (→ Definition II/3.7.1)
X
n
Xi ∼ N (0, 1), i = 1, . . . , n ⇒ Xi2 ∼ χ2 (n) , (9)
i=1
W/n
Z= with W ∼ χ2 (n) and V ∼ χ2 (ν) , (10)
V /ν
such that Z, by definition, follows an F-distribution (→ Definition II/3.8.1):
W/n
Z= ∼ F (n, ν) . (11)
V /ν
Sources:
• Lin, Pi-Erh (1972): “Some Characterizations of the Multivariate t Distribution”; in: Journal of
Multivariate Analysis, vol. 2, pp. 339-344, Lemma 2; URL: https://core.ac.uk/download/pdf/
81139018.pdf; DOI: 10.1016/0047-259X(72)90021-8.
• Nadarajah, Saralees; Kotz, Samuel (2005): “Mathematical Properties of the Multivariate t Dis-
tribution”; in: Acta Applicandae Mathematicae, vol. 89, pp. 53-84, page 56; URL: https://link.
springer.com/content/pdf/10.1007/s10440-005-9003-4.pdf; DOI: 10.1007/s10440-005-9003-4.
Metadata: ID: P231 | shortcut: mvt-f | author: JoramSoch | date: 2021-05-04, 10:29.
X, Y ∼ NG(µ, Λ, a, b) , (1)
if the distribution of X conditional on Y is a multivariate normal distribution (→ Definition II/4.1.1)
with mean vector µ and covariance matrix (yΛ)−1 and Y follows a gamma distribution (→ Definition
II/3.4.1) with shape parameter a and rate parameter b:
282 CHAPTER II. PROBABILITY DISTRIBUTIONS
The n×n matrix Λ is referred to as the precision matrix (→ Definition I/1.9.19) of the normal-gamma
distribution.
Sources:
• Koch KR (2007): “Normal-Gamma Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.3,
pp. 55-56, eq. 2.212; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-
3-540-72726-2.
Proof: Let X be an n × p real matrix and let Y be a p × p positive-definite symmetric matrix, such
that X and Y jointly follow a normal-Wishart distribution (→ Definition II/5.3.1):
X, Y ∼ NW(M, U, V, ν) . (1)
Then, X and Y are described by the probability density function (→ Proof II/5.3.2)
√
1 2−νp
p(X, Y ) = p · · |Y |(ν+n−p−1)/2 ·
(2π) |U | |V |
np p ν Γp 2
ν
(2)
1 T −1 −1
exp − tr Y (X − M ) U (X − M ) + V
2
where |A| is a matrix determinant, A−1 is a matrix inverse and Γp (x) is the multivariate gamma
function of order p. If p = 1, then Γp (x) = Γ(x) is the ordinary gamma function, x = X is a column
vector and y = Y is a real number. Thus, the probability density function (→ Definition I/1.6.6) of
x and y can be developed as
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 283
√
1 2−ν (ν+n−2)/2
p(x, y) = p · ·y ·
(2π)n |U ||V |ν Γ 2 ν
1 T −1 −1
exp − tr y (x − M ) U (x − M ) + V
2
s q
|U |
−1 (2|V |)−ν
· y 2 + 2 −1 ·
ν n
= n
· ν
(2π) Γ 2
(3)
1 T −1 −1
exp − y (x − M ) U (x − M ) + 2 (2V )
2
s ν2
1
|U |
−1 2|V |
· y 2 + 2 −1 ·
ν n
= n
· ν
(2π) Γ
2
y T −1 1
exp − (x − M ) U (x − M ) + 2
2 2V
Sources:
• original work
Metadata: ID: P324 | shortcut: ng-nw | author: JoramSoch | date: 2022-05-20, 18:23.
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the joint probability (→ Definition I/1.3.2) density function (→ Definition I/1.6.6) of x and
y is
s
|Λ| ba h y i
a+ n −1
p(x, y) = · y 2 exp − (x − µ)T
Λ(x − µ) + 2b . (2)
(2π)n Γ(a) 2
Thus, using the probability density function of the multivariate normal distribution (→ Proof
II/4.1.3) and the probability density function of the gamma distribution (→ Proof II/3.4.6), we
have the following probabilities:
Sources:
• Koch KR (2007): “Normal-Gamma Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.3,
pp. 55-56, eq. 2.212; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-
3-540-72726-2.
Metadata: ID: P44 | shortcut: ng-pdf | author: JoramSoch | date: 2020-02-07, 20:44.
4.3.4 Mean
Theorem: Let x ∈ Rn and y > 0 follow a normal-gamma distribution (→ Definition II/4.3.1):
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the expected value (→ Definition I/1.7.1) of x and y is
a
E[(x, y)] = µ, . (2)
b
x1
..
x .
= . (3)
y xn
y
According to the expected value of a random vector (→ Definition I/1.7.13), its expected value is
E(x1 )
..
x . E(x)
E =
=
. (4)
y E(xn ) E(y)
E(y)
When x and y are jointly normal-gamma distributed, then (→ Definition II/4.3.1) by definition x
follows a multivariate normal distribution (→ Definition II/4.1.1) conditional on y and y follows a
univariate gamma distribution (→ Definition II/3.4.1):
ZZ
E(x) = x · p(x, y) dx dy
ZZ
= x · p(x|y) · p(y) dx dy
Z Z
= p(y) x · p(x|y) dx dy
Z
(6)
= p(y) ⟨x⟩N (µ,(yΛ)−1 ) dy
Z
= p(y) · µ dy
Z
= µ p(y) dy
=µ,
and with the expected value of the gamma distribution (→ Proof II/3.4.9), E(y) becomes
Z
E(y) = y · p(y) dy
= ⟨y⟩Gam(a,b) (7)
a
= .
b
Thus, the expectation of the random vector in equations (3) and (4) is
286 CHAPTER II. PROBABILITY DISTRIBUTIONS
x µ
E = , (8)
y a/b
as indicated by equation (2).
Sources:
• original work
Metadata: ID: P237 | shortcut: ng-mean | author: JoramSoch | date: 2021-07-08, 09:40.
4.3.5 Covariance
Theorem: Let x ∈ Rn and y > 0 follow a normal-gamma distribution (→ Definition II/4.3.1):
x, y ∼ NG(µ, Λ, a, b) . (1)
Then,
1) the covariance (→ Definition I/1.9.1) of x, conditional (→ Definition I/1.5.4) on y is
1
Cov(x|y) = Λ−1 ; (2)
y
2) the covariance (→ Definition I/1.9.1) of x, unconditional (→ Definition I/1.5.3) on y is
b
Cov(x) = Λ−1 ; (3)
a−1
3) the variance (→ Definition I/1.8.1) of y is
a
Var(y) = . (4)
b2
Proof:
1) According to the definition of the normal-gamma distribution (→ Definition II/4.3.1), the distri-
bution of x given y is a multivariate normal distribution (→ Definition II/4.1.1):
y ∼ Gam(a, b) . (11)
The variance of the gamma distribution (→ Proof II/3.4.10) is
a
x ∼ Gam(a, b) ⇒ Var(x) = , (12)
b2
such that we have:
a
Var(y) = . (13)
b2
Sources:
• original work
Metadata: ID: P345 | shortcut: ng-cov | author: JoramSoch | date: 2022-09-22, 09:17.
n 1 1
h(x, y) = ln(2π) − ln |Λ| + n
2 2 2 (2)
n − 2 + 2a n−2
+ a + ln Γ(a) − ψ(a) + ln b .
2 2
Proof: The probabibility density function of the normal-gamma distribution (→ Proof II/4.3.3) is
The differential entropy of a continuous random variable (→ Definition I/2.2.1) in nats is given by
Z
h(Z) = − p(z) ln p(z) dz (6)
Z
which, applied to the normal-gamma distribution (→ Definition II/4.3.1) over x and y, yields
Z ∞Z
h(x, y) = − p(x, y) ln p(x, y) dx dy . (7)
0 Rn
Using the law of conditional probability (→ Definition I/1.3.4), this can be evaluated as follows:
Z ∞ Z
h(x, y) = − p(x|y) p(y) ln p(x|y) p(y) dx dy
Rn
Z ∞Z0
Z ∞Z
=− p(x|y) p(y) ln p(x|y) dx dy − p(x|y) p(y) ln p(y) dx dy
Z ∞0 Rn
Z Z ∞ 0 Rn
Z (8)
= p(y) p(x|y) ln p(x|y) dx dy + p(y) ln p(y) p(x|y) dx dy
0 Rn 0 Rn
= ⟨h(x|y)⟩p(y) + h(y) .
In other words, the differential entropy of the normal-gamma distribution over x and y is equal to the
sum of a multivariate normal entropy regarding x conditional on y, expected over y, and a univariate
gamma entropy regarding y.
and using the relation (→ Proof II/3.4.11) y ∼ Gam(a, b) ⇒ ⟨ln y⟩ = ψ(a) − ln(b), we have
n 1 1 n n
⟨h(x|y)⟩p(y) = ln(2π) − ln |Λ| + n − ψ(a) + ln b . (10)
2 2 2 2 2
By plugging (10) and (5) into (8), one arrives at the differential entropy given by (2).
Sources:
• original work
Metadata: ID: P238 | shortcut: ng-dent | author: JoramSoch | date: 2021-07-08, 10:51.
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 289
1 a1 1 1 |Λ2 | n
KL[P || Q] = (µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ−1 1 )− ln −
2 b1 2 2 |Λ1 | 2
(2)
b1 Γ(a1 ) a1
+ a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) .
b2 Γ(a2 ) b1
Proof: The probabibility density function of the normal-gamma distribution (→ Proof II/4.3.3) is
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) (5)
b2 Γ(a2 ) b1
where Γ(x) is the gamma function and ψ(x) is the digamma function.
Z ∞ Z
p(x|y) p(y)
KL[P || Q] = p(x|y) p(y) ln dx dy
Rn q(x|y) q(y)
Z ∞Z
0
Z ∞Z
p(x|y) p(y)
= p(x|y) p(y) ln dx dy + p(x|y) p(y) ln dx dy
Rn q(x|y) Rn q(y) (8)
Z ∞
0
Z Z ∞
0
Z
p(x|y) p(y)
= p(y) p(x|y) ln dx dy + p(y) ln p(x|y) dx dy
0 Rn q(x|y) 0 q(y) Rn
= ⟨KL[p(x|y) || q(x|y)]⟩p(y) + KL[p(y) || q(y)] .
290 CHAPTER II. PROBABILITY DISTRIBUTIONS
In other words, the KL divergence between two normal-gamma distributions over x and y is equal
to the sum of a multivariate normal KL divergence regarding x conditional on y, expected over y,
and a univariate gamma KL divergence regarding y.
⟨KL[p(x|y) || q(x|y)]⟩p(y)
1 −1
|(yΛ1 )−1 |
= (µ2 − µ1 ) (yΛ2 )(µ2 − µ1 ) + tr (yΛ2 )(yΛ1 )
T
− ln −n
2 |(yΛ2 )−1 | p(y) (9)
y 1 1 |Λ2 | n
= (µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ−1
1 )− ln −
2 2 2 |Λ1 | 2 p(y)
and using the relation (→ Proof II/3.4.9) y ∼ Gam(a, b) ⇒ ⟨y⟩ = a/b, we have
1 a1 1 1 |Λ2 | n
⟨KL[p(x|y) || q(x|y)]⟩p(y) = (µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ1−1 ) − ln − . (10)
2 b1 2 2 |Λ1 | 2
By plugging (10) and (5) into (8), one arrives at the KL divergence given by (2).
Sources:
• Soch J, Allefeld A (2016): “Kullback-Leibler Divergence for the Normal-Gamma Distribution”;
in: arXiv math.ST, 1611.01437; URL: https://arxiv.org/abs/1611.01437.
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the marginal distribution (→ Definition I/1.5.3) of y is a gamma distribution (→ Definition
II/3.4.1)
y ∼ Gam(a, b) (2)
and the marginal distribution (→ Definition I/1.5.3) of x is a multivariate t-distribution (→ Definition
II/4.2.1)
a −1
x ∼ t µ, Λ , 2a . (3)
b
Proof: The probability density function of the normal-gamma distribution (→ Proof II/4.3.3) is
given by
Using the law of marginal probability (→ Definition I/1.3.3), the marginal distribution of y can be
derived as
Z
p(y) = p(x, y) dx
Z
= N (x; µ, (yΛ)−1 ) Gam(y; a, b) dx
Z (5)
= Gam(y; a, b) N (x; µ, (yΛ)−1 ) dx
= Gam(y; a, b)
which is the probability density function of the gamma distribution (→ Proof II/3.4.6) with shape
parameter a and rate parameter b.
Using the law of marginal probability (→ Definition I/1.3.3), the marginal distribution of x can be
derived as
292 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z
p(x) = p(x, y) dy
Z
= N (x; µ, (yΛ)−1 ) Gam(y; a, b) dy
Z s
|yΛ| 1 ba a−1
= exp − (x − µ) T
(yΛ)(x − µ) · y exp[−by] dy
(2π)n 2 Γ(a)
Z s n
y |Λ| 1 ba a−1
= exp − (x − µ) T
(yΛ)(x − µ) · y exp[−by] dy
(2π)n 2 Γ(a)
Z s
|Λ| ba a+ n −1 1
= · · y 2 · exp − b + (x − µ) Λ(x − µ) y dy T
(2π)n Γ(a) 2
Z s
|Λ| ba Γ a + n2 n 1
= · · n · Gam y; a + , b + (x − µ) Λ(x − µ) dy
T
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2 2 2
s 2
Z
|Λ| ba Γ a + n2 n 1
= · · n Gam y; a + , b + (x − µ) Λ(x − µ) dy T
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2 2 2
s 2
|Λ| ba Γ a + n2
= · · n
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2
2
p −(a+ n2 )
|Λ| Γ 2a+n 2 1
= n · · ba · b + (x − µ)T Λ(x − µ)
(2π) 2 Γ 2a 2
p 2
−a −a − n2
|Λ| Γ 2a+n 2 1 1 −n 1
= n · · · b + (x − µ) Λ(x − µ) T
· 2 2 · b + (x − µ) Λ(x − µ) T
π2 Γ 2a b 2 2
p 2
−a
|Λ| Γ 2a+n 2 1 − n
= n · 2a
· 1 + (x − µ) T
Λ(x − µ) · 2b + (x − µ)T Λ(x − µ) 2
π2 Γ 2 2b
p −a −a b − n2
|Λ| Γ 2a+n 2 1 T a T a
= n · · · 2a + (x − µ) Λ (x − µ) · · 2a + (x − µ) Λ (x − µ
π2 Γ 2a 2a b a b
q 2
a n
|Λ| Γ 2a+n −a − n2
b 2 T a T a
= n · · 2a + (x − µ) Λ (x − µ) · 2a + (x − µ) Λ (x − µ)
(2a)−a π 2 Γ 2a b b
q 2
a n −a
|Λ| Γ 2a+n 1
b 2 −a T a −n 1 T a
= n · · (2a) · 1 + (x − µ) Λ (x − µ) · (2a) 2 · 1 + (x − µ) Λ (
(2a)−a π 2 Γ 2a 2a b 2a b
q 2
a n
|Λ| Γ 2a+n 1 − 2a+n
T a
2
b 2
= n n · · 1 + (x − µ) Λ (x − µ)
(2a) 2 π 2 Γ 2a 2a b
s 2
a Λ Γ 2a+n 1 − 2a+n
T a
2
2
= b
· 2a
· 1 + (x − µ) Λ (x − µ)
(2a π)n Γ 2 2a b
(6)
which is the probability density function of a multivariate t-distribution (→ Proof II/4.2.2) with
−1
mean vector µ, shape matrix ab Λ and 2a degrees of freedom.
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 293
Sources:
• original work
Metadata: ID: P36 | shortcut: ng-marg | author: JoramSoch | date: 2020-01-29, 21:42.
x, y ∼ NG(µ, Λ, a, b) . (1)
Then,
1) the conditional distribution (→ Definition I/1.5.4) of x given y is a multivariate normal distribution
(→ Definition II/4.1.1)
where µ1 , µ2 and Σ11 , Σ12 , Σ22 , Σ21 are block-wise components (→ Proof II/4.1.10) of µ and Σ(y) =
(yΛ)−1 ;
3) the conditional distribution (→ Definition I/1.5.4) of y given x is a gamma distribution (→
Definition II/3.4.1)
n 1
y|x ∼ Gam a + , b + (x − µ) Λ(x − µ)
T
(5)
2 2
where n is the dimensionality of x.
Proof:
1) This follows from the definition of the normal-gamma distribution (→ Definition II/4.3.1):
x ∼ N (µ, Σ)
⇒ x1 |x2 ∼ N (µ1|2 , Σ1|2 )
(7)
µ1|2 = µ1 + Σ12 Σ−1 22 (x2 − µ2 )
Σ1|2 = Σ11 − Σ12 Σ−1 22 Σ21 .
294 CHAPTER II. PROBABILITY DISTRIBUTIONS
3) The conditional density of y given x follows from Bayes’ theorem (→ Proof I/5.3.1) as
p(x|y) · p(y)
p(y|x) = . (8)
p(x)
The conditional distribution (→ Definition I/1.5.4) of x given y is a multivariate normal distribution
(→ Proof II/4.3.3)
s
−1 |yΛ| 1
p(x|y) = N (x; µ, (yΛ) ) = exp − (x − µ) (yΛ)(x − µ) ,
T
(9)
(2π)n 2
the marginal distribution (→ Definition I/1.5.3) of y is a gamma distribution (→ Proof II/4.3.8)
ba a−1
p(y) = Gam(y; a, b) =
y exp [−by] (10)
Γ(a)
and the marginal distribution (→ Definition I/1.5.3) of x is a multivariate t-distribution (→ Proof
II/4.3.8)
a −1
p(x) = t x; µ, Λ , 2a
b
s
a Λ Γ 2a+n 1 − 2a+n
T a
2
2
= b
· · 1 + (x − µ) Λ (x − µ) (11)
(2a π)n Γ 2a 2a b
s 2
−(a+ n2 )
|Λ| Γ a + n2 1
= · · b · b + (x − µ) Λ(x − µ)
a T
.
(2π)n Γ(a) 2
Plugging (9), (10) and (11) into (8), we obtain
q
|yΛ| ba a−1
(2π)n
exp − 12 (x − µ)T (yΛ)(x − µ) · Γ(a)
y exp [−by]
p(y|x) = q Γ(a+ n −(a+ n2 )
|Λ| 2)
(2π)n
· Γ(a)
· ba · b + 12 (x − µ)T Λ(x − µ)
a+ n2
n 1 1 1
= y · exp − (x − µ) (yΛ)(x − µ) · y
2
T a−1
· exp [−by] · · b + (x − µ) Λ(x − µ)
T
2 Γ a + n2 2
n
a+ 2
b + 12 (x − µ)T Λ(x − µ) a+ n −1 1
= · y 2 · exp − b + (x − µ) Λ(x − µ) T
Γ a + n2 2
(12)
which is the probability density function of a gamma distribution (→ Proof II/3.4.6) with shape and
rate parameters
n 1
a+ and b + (x − µ)T Λ(x − µ) , (13)
2 2
such that
n 1
p(y|x) = Gam y; a + , b + (x − µ) Λ(x − µ) .
T
(14)
2 2
Sources:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 295
• original work
Metadata: ID: P146 | shortcut: ng-cond | author: JoramSoch | date: 2020-08-05, 06:54.
Proof: If all entries of Z1 are independent and standard normally distributed (→ Definition II/3.2.2)
i.i.d.
z1i ∼ N (0, 1) for all i = 1, . . . , n , (2)
this implies a multivariate normal distribution with diagonal covariance matrix (→ Proof II/4.1.11):
Z1 ∼ N (0n , In ) (3)
where 0n is an n × 1 matrix of zeros and In is the n × n identity matrix.
If the distribution of Z2 is a standard gamma distribution (→ Definition II/3.4.2)
Z2 ∼ Gam(a, 1) , (4)
then due to the relationship between gamma and standard gamma distribution distribution (→ Proof
II/3.4.3), we have:
Z2
Y = ∼ Gam(a, b) . (5)
b
Moreover, using the linear transformation theorem for the multivariate normal distribution (→ Proof
II/4.1.8), it follows that:
Z1 ∼ N (0n , In )
! !T
1 1 1 1
X =µ+ p AZ1 ∼ N µ + p A 0n , p A In p A
Z2 /b Z2 /b Z2 /b Z2 /b
! (6)
2
1
X ∼ N µ + 0n , √ AAT
Y
X ∼ N µ, (Y Λ)−1 .
Thus, Y follows a gamma distribution (→ Definition II/3.4.1) and the distribution of X conditional
on Y is a multivariate normal distribution (→ Definition II/4.1.1):
296 CHAPTER II. PROBABILITY DISTRIBUTIONS
This means that, by definition (→ Definition II/4.3.1), X and Y jointly follow a normal-gamma
distribution (→ Definition II/4.3.1):
X, Y ∼ NG(µ, Λ, a, b) , (8)
Thus, given Z1 defined by (2) and Z2 defined by (4), X and Y defined by (1) are a sample (→
Definition “samp”) from NG(µ, Λ, a, b).
Sources:
• Wikipedia (2022): “Normal-gamma distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2022-09-22; URL: https://en.wikipedia.org/wiki/Normal-gamma_distribution#Generating_
normal-gamma_random_variates.
Metadata: ID: P346 | shortcut: ng-samp | author: JoramSoch | date: 2022-09-22, 11:10.
X ∼ Dir(α) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
P
k
Γ i=1 αi Y
k
Dir(x; α) = Qk xi αi −1 (2)
i=1 Γ(αi ) i=1
where
Pk αi > 0 for all i = 1, . . . , k, and the density is zero, if xi ∈
/ [0, 1] for any i = 1, . . . , k or
x
i=1 i ̸
= 1.
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-10; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Probability_density_function.
Metadata: ID: D54 | shortcut: dir | author: JoramSoch | date: 2020-05-10, 20:36.
X ∼ Dir(α) . (1)
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 297
Proof: This follows directly from the definition of the Dirichlet distribution (→ Definition II/4.4.1).
Sources:
• original work
Metadata: ID: P95 | shortcut: dir-pdf | author: JoramSoch | date: 2020-05-05, 21:22.
P : x ∼ Dir(α1 )
(1)
Q : x ∼ Dir(α2 ) .
P " !#
k
Γ i=1 α1i X
k
Γ(α2i ) X
k Xk
KL[P || Q] = ln P + ln + (α1i − α2i ) ψ(α1i ) − ψ α1i . (2)
Γ k
α2i i=1
Γ(α1i ) i=1 i=1
i=1
Proof: The KL divergence for a continuous random variable (→ Definition I/2.5.1) is given by
Z
p(x)
KL[P || Q] = p(x) ln dx (3)
X q(x)
which, applied to the Dirichlet distributions (→ Definition II/4.1.1) in (1), yields
Z
Dir(x; α1 )
KL[P || Q] = Dir(x; α1 ) ln dx
Xk Dir(x; α2 )
(4)
Dir(x; α1 )
= ln
Dir(x; α2 ) p(x)
n Pk o
where X is the set x ∈ R | i=1 xi = 1, 0 ≤ xi ≤ 1, i = 1, . . . , k .
k k
Using the probability density function of the Dirichlet distribution (→ Proof II/4.4.2), this becomes:
298 CHAPTER II. PROBABILITY DISTRIBUTIONS
∑
* Γ( ki=1 α1i ) Qk +
∏k
i=1 xi α1i −1
i=1 Γ(α1i )
KL[P || Q] = ln ∑
Qk
Γ( ki=1 α2i )
∏k
i=1 xi α2i −1
i=1 Γ(α2i ) p(x)
* Pk
Qk
+
Γ i=1 α1i Γ(α2i ) Yk
= ln P · Qi=1k
· xi α1i −α2i
k
Γ i=1 α2i i=1 Γ(α1i ) i=1
P
p(x) (5)
* k +
Γ i=1 α1i X k
Γ(α2i ) X
k
= ln P + ln + (α1i − α2i ) · ln(xi )
Γ k
α i=1
Γ(α 1i ) i=1
i=1 2i
p(x)
P
k
Γ i=1 α1i X k
Γ(α2i ) X
k
= ln Pk + ln + (α1i − α2i ) · ⟨ln xi ⟩p(x) .
Γ Γ(α1i ) i=1
i=1 α2i i=1
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densi-
ties”; in: University College, London, p. 2, eqs. 8-9; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/
publications/densities.ps.
Metadata: ID: P294 | shortcut: dir-kl | author: JoramSoch | date: 2021-12-02, 14:28.
r ∼ Dir(α) . (1)
where Γ(x) is the gamma function and γ(s, x) is the lowerr incomplete gamma function.
Proof: In the context of the Dirichlet distribution (→ Definition II/4.4.1), the exceedance probability
(→ Definition I/1.3.10) for a particular ri is defined as:
n o
φi = p ∀j ∈ 1, . . . , k j ̸= i : ri > rj |α
^
(4)
=p ri > rj α .
j̸=i
The probability density function of the Dirichlet distribution (→ Proof II/4.4.2) is given by:
P
k
Γ α
i=1 i Y k
Dir(r; α) = Qk ri αi −1 . (5)
i=1 Γ(αi ) i=1
X
k
ri ∈ [0, 1] for i = 1, . . . , k and ri = 1 , (6)
i=1
1) If k = 2, the probability density function of the Dirichlet distribution (→ Proof II/4.4.2) reduces
to
Γ(α1 + α2 ) α1 −1 α2 −1
p(r) = r r2 (7)
Γ(α1 ) Γ(α2 ) 1
which is equivalent to the probability density function of the beta distribution (→ Proof II/3.9.3)
r1α1 −1 (1 − r1 )α2 −1
p(r1 ) = (8)
B(α1 , α2 )
with the beta function given by
Γ(x) Γ(y)
B(x, y) = . (9)
Γ(x + y)
With (6), the exceedance probability for this bivariate case simplifies to
Z 1
φ1 = p(r1 > r2 ) = p(r1 > 1 − r1 ) = p(r1 > 1/2) = p(r1 ) dr1 . (10)
1
2
Using the cumulative distribution function of the beta distribution (→ Proof II/3.9.5), it evaluates
to
Z 1
2 B 12 ; α1 , α2
φ1 = 1 − p(r1 ) dr1 = 1 − (11)
0 B(α1 , α2 )
300 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
k
Y1 ∼ Gam(α1 , β), . . . , Yk ∼ Gam(αk , β), Ys = Yj
i=1 (14)
Y1 Yk
⇒ X = (X1 , . . . , Xk ) = ,..., ∼ Dir(α1 , . . . , αk ) .
Ys Ys
The probability density function of the gamma distribution (→ Proof II/3.4.6) is given by
ba a−1
Gam(x; a, b) = x exp[−bx] for x>0. (15)
Γ(a)
Consider the gamma random variables (→ Definition II/3.4.1)
X
k
q1 ∼ Gam(α1 , 1), . . . , qk ∼ Gam(αk , 1), qs = qj (16)
j=1
Y Y γ(αj , qi )
p(∀j̸=i [qi > qj ] |qi ) = p(qi > qj |qi ) = . (21)
j̸=i j̸=i
Γ(αj )
In order to obtain the exceedance probability φi , the dependency on qi in this probability still has
to be removed. From equations (4) and (18), it follows that
In other words, the exceedance probability (→ Definition I/1.3.10) for one element from a Dirichlet-
distributed (→ Definition II/4.4.1) random vector (→ Definition I/1.2.3) is an integral from zero
to infinity where the first term in the integrand conforms to a product of gamma (→ Definition
II/3.4.1) cumulative distribution functions (→ Definition I/1.6.13) and the second term is a gamma
(→ Definition II/3.4.1) probability density function (→ Definition I/1.6.6).
Sources:
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.
Metadata: ID: P181 | shortcut: dir-ep | author: JoramSoch | date: 2020-10-22, 08:04.
302 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ MN (M, U, V ) , (1)
if and only if its probability density function (→ Definition I/1.6.6) is given by
1 1 −1 T −1
MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) (2)
(2π)np |V |n |U |p 2
where M is an n × p real matrix, U is an n × n positive definite matrix and V is a p × p positive
definite matrix.
Sources:
• Wikipedia (2020): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-27; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Definition.
X ∼ MN (M, U, V ) , (1)
if and only if vec(X) is multivariate normally distributed (→ Definition II/4.1.1)
Proof: The probability density function of the matrix-normal distribution (→ Proof II/5.1.3) with
n × p mean M , n × n covariance across rows U and p × p covariance across columns V is
1 1 −1 T −1
MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) . (3)
(2π)np |V |n |U |p 2
Using the trace property tr(ABC) = tr(BCA), we have:
1 1 T −1 −1
MN (X; M, U, V ) = p · exp − tr (X − M ) U (X − M ) V . (4)
(2π)np |V |n |U |p 2
Using the trace-vectorization relation tr(AT B) = vec(A)T vec(B), we have:
1 1 −1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) vec U (X − M ) V
T
. (5)
(2π)np |V |n |U |p 2
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 303
Using the vectorization-Kronecker relation vec(ABC) = C T ⊗ A vec(B), we have:
1 1 −1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) V ⊗ U
T
vec(X − M ) . (6)
(2π)np |V |n |U |p 2
1 1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) (V ⊗ U ) vec(X − M ) .
T
(7)
(2π)np |V |n |U |p 2
1 1 −1
MN (X; M, U, V ) = p ·exp − [vec(X) − vec(M )] (V ⊗ U ) [vec(X) − vec(M )] .
T
(2π)np |V |n |U |p 2
(8)
Using the Kronecker-determinant relation |A ⊗ B| = |A|m |B|n , we have:
1 1 −1
MN (X; M, U, V ) = p ·exp − [vec(X) − vec(M )] (V ⊗ U ) [vec(X) − vec(M )] .
T
(2π)np |V ⊗ U | 2
(9)
This is the probability density function of the multivariate normal distribution (→ Proof II/4.1.3)
with the np × 1 mean vector vec(M ) and the np × np covariance matrix V ⊗ U :
Sources:
• Wikipedia (2020): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-20; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Proof.
Metadata: ID: P26 | shortcut: matn-mvn | author: JoramSoch | date: 2020-01-20, 21:09.
X ∼ MN (M, U, V ) . (1)
Then, the probability density function (→ Definition I/1.6.6) of X is
1 1 −1 T −1
f (X) = p · exp − tr V (X − M ) U (X − M ) . (2)
(2π)np |V |n |U |p 2
304 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: This follows directly from the definition of the matrix-normal distribution (→ Definition
II/5.1.1).
Sources:
• original work
Metadata: ID: P70 | shortcut: matn-pdf | author: JoramSoch | date: 2020-03-02, 21:03.
5.1.4 Mean
Theorem: Let X be a random matrix (→ Definition I/1.2.4) following a matrix-normal distribution
(→ Definition II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then, the mean or expected value (→ Definition I/1.7.1) of X is
E(X) = M . (2)
Proof: When X follows a matrix-normal distribution (→ Definition II/5.1.1), its vectorized version
follows a multivariate normal distribution (→ Proof II/5.1.2)
E[X] = M . (5)
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-15; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Expected_values.
Metadata: ID: P341 | shortcut: matn-mean | author: JoramSoch | date: 2022-09-15, 12:05.
5.1.5 Covariance
Theorem: Let X be an n × p random matrix (→ Definition I/1.2.4) following a matrix-normal
distribution (→ Definition II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then,
1) the covariance matrix (→ Definition I/1.9.9) of each row of X is a scalar multiple of V
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 305
i,• ) ∝ V
Cov(xT for all i = 1, . . . , n ; (2)
2) the covariance matrix (→ Definition I/1.9.9) of each column of X is a scalar multiple of U
Proof:
1) The marginal distribution (→ Definition I/1.5.3) of a given row of X is a multivariate normal
distribution (→ Proof II/5.1.10)
i,• ) = uii V ∝ V .
Cov(xT (5)
2) The marginal distribution (→ Definition I/1.5.3) of a given column of X is a multivariate normal
distribution (→ Proof II/5.1.10)
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-15; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Expected_values.
Metadata: ID: P342 | shortcut: matn-cov | author: JoramSoch | date: 2022-09-15, 12:23.
X ∼ MN (M, U, V ) . (1)
Then, the differential entropy (→ Definition I/2.2.1) of X in nats is
np n p np
h(X) = ln(2π) + ln |V | + ln |U | + . (2)
2 2 2 2
n 1 1
X ∼ N (µ, Σ) ⇒ h(X) = ln(2π) + ln |Σ| + n (4)
2 2 2
where X is an n × 1 random vector (→ Definition I/1.2.3).
Thus, we can plug the distribution parameters from (1) into the differential entropy in (4) using the
relationship given by (3)
np 1 1
h(X) =ln(2π) + ln |V ⊗ U | + np . (5)
2 2 2
Using the Kronecker product property
np 1 1
h(X) = ln(2π) + ln (|V |n |U |p ) + np
2 2 2 (7)
np n p np
= ln(2π) + ln |V | + ln |U | + .
2 2 2 2
Sources:
• original work
Metadata: ID: P344 | shortcut: matn-dent | author: JoramSoch | date: 2022-09-22, 08:39.
P : X ∼ MN (M1 , U1 , V1 )
(1)
Q : X ∼ MN (M2 , U2 , V2 ) .
1
KL[P || Q] = vec(M2 − M1 )T vec U2−1 (M2 − M1 )V2−1
2
|V1 | |U1 | (2)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
1 T −1 −1 |Σ1 |
KL[P || Q] = (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + tr(Σ2 Σ1 ) − ln −n (4)
2 |Σ2 |
where X is an n × 1 random vector (→ Definition I/1.2.3).
Thus, we can plug the distribution parameters from (1) into the KL divergence in (4) using the
relationship given by (3)
1
KL[P || Q] = (vec(M2 ) − vec(M1 ))T (V2 ⊗ U2 )−1 (vec(M2 ) − vec(M1 ))
2
|V1 ⊗ U1 | (5)
−1
+ tr (V2 ⊗ U2 ) (V1 ⊗ U1 ) − ln − np .
|V2 ⊗ U2 |
1
KL[P || Q] = vec(M2 − M1 )T (V2−1 ⊗ U2−1 ) vec(M2 − M1 )
2
|V1 | |U1 | (10)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
1
KL[P || Q] = vec(M2 − M1 )T vec U2−1 (M2 − M1 )V2−1
2
|V1 | |U1 | (12)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
Sources:
• original work
Metadata: ID: P296 | shortcut: matn-kl | author: JoramSoch | date: 2021-12-02, 20:22.
308 CHAPTER II. PROBABILITY DISTRIBUTIONS
5.1.8 Transposition
Theorem: Let X be a random matrix (→ Definition I/1.2.4) following a matrix-normal distribution
(→ Definition II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then, the transpose of X also has a matrix-normal distribution:
X T ∼ MN (M T , V, U ) . (2)
Proof: The probability density function of the matrix-normal distribution (→ Proof II/5.1.3) is:
1 1 −1 T −1
f (X) = MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) . (3)
(2π)np |V |n |U |p 2
Sources:
• original work
Metadata: ID: P144 | shortcut: matn-trans | author: JoramSoch | date: 2020-08-03, 22:21.
X ∼ MN (M, U, V ) . (1)
Then, a linear transformation of X is also matrix-normally distributed
vec(Y ) = vec(AXB + C)
= vec(AXB) + vec(C) (5)
= (B ⊗ A)vec(X) + vec(C) .
T
Y ∼ MN (AM B + C, AU AT , B T V B) . (7)
Sources:
• original work
Metadata: ID: P145 | shortcut: matn-ltt | author: JoramSoch | date: 2020-08-03, 22:24.
X ∼ MN (M, U, V ) . (1)
Then,
1) the marginal distribution (→ Definition I/1.5.3) of any subset matrix XI,J , obtained by dropping
some rows and/or columns from X, is also a matrix-normal distribution (→ Definition II/5.1.1)
2) the marginal distribution (→ Definition I/1.5.3) of each row vector is a multivariate normal
distribution (→ Definition II/4.1.1)
Proof:
1) Define a selector matrix A, such that aij = 1, if the i-th row in the subset matrix should be the
j-th row from the original matrix (and aij = 0 otherwise)
1 , if I = j
i
A ∈ R|I|×n , s.t. aij = (6)
0 , otherwise
and define a selector matrix B, such that bij = 1, if the j-th column in the subset matrix should be
the i-th column from the original matrix (and bij = 0 otherwise)
1 , if J = i
j
B∈R p×|J|
, s.t. bij = (7)
0 , otherwise .
A = ei , B = Ip , (10)
the i-th row of X can be expressed as
A = In , B = eT
j , (14)
the j-th column of X can be expressed as
A = ei , B = eT
j , (17)
the (i, j)-th entry of X can be expressed as
Sources:
• original work
Metadata: ID: P343 | shortcut: matn-marg | author: JoramSoch | date: 2022-09-15, 11:41.
312 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: If all entries of X are independent and standard normally distributed (→ Definition II/3.2.2)
i.i.d.
xij ∼ N (0, 1) for all i = 1, . . . , n and j = 1, . . . , p , (2)
this implies a multivariate normal distribution with diagonal covariance matrix (→ Proof II/4.1.11):
X ∼ MN (0np , In , Ip ) . (4)
Thus, with the linear transformation theorem for the matrix-normal distribution (→ Proof II/5.1.9),
it follows that
Y = M + AXB ∼ MN M + A0np B, AIn AT , B T Ip B
∼ MN M, AAT , B T B (5)
∼ MN (M, U, V ) .
Thus, given X defined by (2), Y defined by (1) is a sample (→ Definition “samp”) from MN (M, U, V ).
Sources:
• Wikipedia (2021): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-12-07; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Drawing_values_
from_the_distribution.
Metadata: ID: P297 | shortcut: matn-samp | author: JoramSoch | date: 2021-12-07, 08:43.
X ∼ MN (0, In , V ) . (1)
Define the scatter matrix S as the product of the transpose of X with itself:
X
n
T
S=X X= xT
i xi . (2)
i=1
Then, the matrix S is said to follow a Wishart distribution with scale matrix V and degrees of
freedom n
S ∼ W(V, n) (3)
where n > p − 1 and V is a positive definite symmetric covariance matrix.
Sources:
• Wikipedia (2020): “Wishart distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Wishart_distribution#Definition.
Metadata: ID: D43 | shortcut: wish | author: JoramSoch | date: 2020-03-22, 17:15.
P : S ∼ W(V1 , n1 )
(1)
Q : S ∼ W(V2 , n2 ) .
" #
1 Γp n2 n
n2 (ln |V2 | − ln |V1 |) + n1 tr(V2−1 V1 ) + 2 ln 2 1
KL[P || Q] = n1
+ (n1 − n2 )ψp − n1 p (2)
2 Γp 2
2
Yk
j−1
Γp (x) = π p(p−1)/4
Γ x− (3)
j=1
2
Proof: The KL divergence for a continuous random variable (→ Definition I/2.5.1) is given by
Z
p(x)
KL[P || Q] = p(x) ln dx (5)
X q(x)
314 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z
W(S; V1 , n1 )
KL[P || Q] = W(S; V1 , n1 ) ln dS
Sp W(S; V2 , n2 )
(6)
W(S; α1 )
= ln
W(S; α1 ) p(S)
*
√ 1
n · |S|(n1 −p−1)/2 · exp − 21 tr V1−1 S +
2n1 p |V1 |n1 Γp ( 21 )
KL[P || Q] = ln
√ 1
n · |S|(n2 −p−1)/2 · exp − 21 tr V2−1 S
2n2 p |V2 |n2 Γp ( 22 )
p(S)
* s !+
|V 2 |n2 Γ p
n2
2 −n 1 −1
1 −1
= ln 2(n2 −n1 )p · · · |S| 1 2 · exp − tr V1 S − tr V2 S
(n )/2
|V1 |n1 Γp n1
2
2 2
p(S)
*
(n2 − n1 )p n2 n1 Γp 2 n2
= ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 1 −1
1 −1
+ ln |S| − tr V1 S − tr V2 S
2 2 2 p(S)
(n2 − n1 )p n2 n1 Γp n22
= ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 1
1
+ ⟨ln |S|⟩p(S) − tr V1−1 S p(S) − tr V2−1 S p(S) .
2 2 2
(7)
(n2 − n1 )p n2 n1 Γp n22
KL[P || Q] = ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 h n1 i n n1
+ p · ln 2 + ln |V1 | − tr V1−1 V1 − tr V2−1 V1
1
+ ψp
2 2 2 2
n2 Γp 2 n2
n1 − n2 n1 n1 n1
−1
= (ln |V2 | − ln |V1 |) + ln + ψ p − tr (Ip ) − tr V2 V1
2 Γp n21 2 2 2 2
" #
1 Γ n2 n
p 2
= n2 (ln |V2 | − ln |V1 |) + n1 tr(V2−1 V1 ) + 2 ln n1
+ (n1 − n2 )ψp 1 − n1 p .
2 Γp 2 2
(11)
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densities”;
in: University College, London, pp. 2-3, eqs. 13/15; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/
publications/densities.ps.
• Wikipedia (2021): “Wishart distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
12-02; URL: https://en.wikipedia.org/wiki/Wishart_distribution#KL-divergence.
Metadata: ID: P295 | shortcut: wish-kl | author: JoramSoch | date: 2021-12-02, 15:33.
X, Y ∼ NW(M, U, V, ν) , (1)
if the distribution of X conditional on Y is a matrix-normal distribution (→ Definition II/5.1.1)
with mean M , covariance across rows U , covariance across columns Y −1 and Y follows a Wishart
distribution (→ Definition II/5.2.1) with scale matrix V and degrees of freedom ν:
X|Y ∼ MN (M, U, Y −1 )
(2)
Y ∼ W(V, ν) .
The p × p matrix Y can be seen as the precision matrix (→ Definition I/1.9.19) across the columns
of the n × p matrix X.
Sources:
• original work
X, Y ∼ NW(M, U, V, ν) . (1)
Then, the joint probability (→ Definition I/1.3.2) density function (→ Definition I/1.6.6) of X and
Y is
√
1 2−νp
p(X, Y ) = p · · |Y |(ν+n−p−1)/2 ·
(2π) |U | |V | Γp 2
np p ν ν
(2)
1 T −1 −1
exp − tr Y (X − M ) U (X − M ) + V .
2
X|Y ∼ MN (M, U, Y −1 )
(3)
Y ∼ W(V, ν) .
Thus, using the probability density function of the matrix-normal distribution (→ Proof II/5.1.3)
and the probability density function of the Wishart distribution (→ Proof “wish-pdf”), we have the
following probabilities:
p(X|Y ) = MN (X; M, U, Y −1 )
s
|Y |n 1 T −1
= · exp − tr Y (X − M ) U (X − M )
(2π)np |U |p 2
(4)
p(Y ) = W(Y ; V, ν)
1 1 1 −1
= ·p · |Y |(ν−p−1)/2
· exp − tr V Y .
Γp ν2 2νp |V |ν 2
Sources:
• original work
Metadata: ID: P323 | shortcut: nw-pdf | author: JoramSoch | date: 2022-05-14, 23:58.
5.3.3 Mean
Theorem: Let X ∈ Rn×p and Y ∈ Rp×p follow a normal-Wishart distribution (→ Definition II/5.3.1):
X, Y ∼ NW(M, U, V, ν) . (1)
Then, the expected value (→ Definition I/1.7.1) of X and Y is
E(x11 ) . . . E(x1p )
.. ... ..
. .
X E(xn1 ) . . . E(xnp ) E(X)
E = = . (4)
Y E(y11 ) . . . E(y1p ) E(Y )
.. ... ..
. .
E(yp1 ) . . . E(ypp )
When X and Y are jointly normal-Wishart distributed, then (→ Definition II/5.3.1) by definition
X follows a matrix-normal distribution (→ Definition II/5.1.1) conditional on Y and Y follows a
Wishart distribution (→ Definition II/5.2.1):
=M ,
and with the expected value of the gamma distribution (→ Proof II/3.4.9), E(Y ) becomes
Z
E(Y ) = Y · p(Y ) dY
= ⟨Y ⟩W(V,ν) (7)
= νV .
Thus, the expectation of the random matrix in equations (3) and (4) is
X M
E = , (8)
Y νV
as indicated by equation (2).
Sources:
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 319
• original work
Metadata: ID: P327 | shortcut: nw-mean | author: JoramSoch | date: 2022-07-14, 07:17.
320 CHAPTER II. PROBABILITY DISTRIBUTIONS
Chapter III
Statistical Models
321
322 CHAPTER III. STATISTICAL MODELS
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Sources:
• Bishop, Christopher M. (2006): “Example: The univariate Gaussian”; in: Pattern Recognition for
Machine Learning, ch. 10.1.3, p. 470, eq. 10.21; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimates (→ Definition I/4.1.3) for mean µ and variance σ 2 are given
by
1X
n
µ̂ = yi
n i=1
(2)
1X
n
σ̂ 2 = (yi − ȳ)2 .
n i=1
Proof: The likelihood function (→ Definition I/5.1.2) for each observation is given by the probability
density function of the normal distribution (→ Proof II/3.2.10)
" 2 #
1 1 y i − µ
p(yi |µ, σ 2 ) = N (x; µ, σ 2 ) = √ · exp − (3)
2πσ 2 2 σ
and because observations are independent (→ Definition I/1.3.6), the likelihood function for all
observations is the product of the individual ones:
s " n 2 #
Yn
1 1 X y i − µ
p(y|µ, σ 2 ) = p(yi |µ) = · exp − . (4)
i=1
(2πσ 2 )n 2 i=1 σ
This can be developed into
1. UNIVARIATE NORMAL DATA 323
n/2 " n 2 #
1 1 X y − 2yi µ + µ 2
p(y|µ, σ 2 ) = · exp − i
2πσ 2 2 σ2
n/2
i=1
(5)
1 1
= · exp − 2 y T y − 2nȳµ + nµ2
2πσ 2 2σ
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Thus, the log-likelihood function (→ Definition I/4.1.2) is
n 1
LL(µ, σ 2 ) = log p(y|µ, σ 2 ) = − log(2πσ 2 ) − 2 y T y − 2nȳµ + nµ2 . (6)
2 2σ
dLL(µ, σ 2 ) nȳ nµ n
= 2 − 2 = 2 (ȳ − µ) (7)
dµ σ σ σ
and setting this derivative to zero gives the MLE for µ:
dLL(µ̂, σ 2 )
=0
dµ
n
0 = 2 (ȳ − µ̂)
σ
0 = ȳ − µ̂ (8)
µ̂ = ȳ
1X
n
µ̂ = yi .
n i=1
dLL(µ̂, σ 2 ) n 1 1
2
=− 2 + 2 2
y T y − 2nȳ µ̂ + nµ̂2
dσ 2σ 2(σ )
1 X 2
n
n
=− 2 + y i − 2y i µ̂ + µ̂ 2
(9)
2σ 2(σ 2 )2 i=1
1 X
n
n
=− + (yi − µ̂)2
2σ 2 2(σ 2 )2 i=1
dLL(µ̂, σ̂ 2 )
=0
dσ 2
1 X
n
0= (yi − µ̂)2
2(σ̂ 2 )2 i=1
1 X
n
n
= (yi − µ̂)2
2σ̂ 2 2(σ̂ 2 )2 i=1
(10)
1 X
n
2(σ̂ 2 )2 n 2(σ̂ 2 )2
· 2 = · 2 2
(yi − µ̂)2
n 2σ̂ n 2(σ̂ ) i=1
1X
n
2
σ̂ = (yi − µ̂)2
n i=1
1X
n
2
σ̂ = (yi − ȳ)2
n i=1
Together, (8) and (10) constitute the MLE for the univariate Gaussian.
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 93-94, eqs. 2.121, 2.122; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/
Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%
202006.pdf.
Metadata: ID: P223 | shortcut: ug-mle | author: JoramSoch | date: 2021-04-16, 11:03.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n (1)
be a univariate Gaussian data set (→ Definition III/1.1.1) with unknown mean µ and unknown
variance σ 2 . Then, the test statistic (→ Definition I/4.3.5)
ȳ − µ0
t= √ (2)
s/ n
with sample mean (→ Definition I/1.7.2) ȳ and sample variance (→ Definition I/1.8.2) s2 follows a
Student’s t-distribution (→ Definition II/3.3.1) with n − 1 degrees of freedom (→ Definition “dof”)
t ∼ t(n − 1) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ = µ0 . (4)
1X
n
ȳ = yi (5)
n i=1
and the sample variance (→ Definition I/1.8.2) is given by
1 X
n
2
s = (yi − ȳ)2 . (6)
n − 1 i=1
Using the linear combination formula for normal random variables (→ Proof II/3.2.26), the sample
mean follows a normal distribution (→ Definition II/3.2.1) with the following parameters:
2 !
1X
n
1 1
ȳ = yi ∼ N nµ, nσ 2 = N µ, σ 2 /n . (7)
n i=1 n n
Again employing the√ linear combination theorem and applying the null hypothesis from (4), the
distribution of Z = n(ȳ − µ0 )/σ becomes standard normal (→ Definition II/3.2.2)
√ √ √ 2 2 !
n(ȳ − µ0 ) n n σ H0
Z= ∼N (µ − µ0 ), = N (0, 1) . (8)
σ σ σ n
Because sample variances calculated from independent normal random variables follow a chi-squared
distribution (→ Proof II/3.2.6), the distribution of V = (n − 1) s2 /σ 2 is
(n − 1) s2
V = ∼ χ2 (n − 1) . (9)
σ2
Finally, since the ratio of a standard normal random variable and the square root of a chi-squared
random variable follows a t-distribution (→ Definition II/3.3.1), the distribution of the test statistic
(→ Definition I/4.3.5) is given by
ȳ − µ0 Z
t= √ =p ∼ t(n − 1) . (10)
s/ n V /(n − 1)
This means that the null hypothesis (→ Definition I/4.3.2) can be rejected when t is as extreme or
more extreme than the critical value (→ Definition I/4.3.9) obtained from the Student’s t-distribution
(→ Definition II/3.3.1) with n − 1 degrees of freedom (→ Definition “dof”) using a significance level
(→ Definition I/4.3.8) α.
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-12; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Derivation.
Metadata: ID: P204 | shortcut: ug-ttest1 | author: JoramSoch | date: 2021-03-12, 08:43.
y1i ∼ N (µ1 , σ 2 ), i = 1, . . . , n1
(1)
y2i ∼ N (µ2 , σ 2 ), i = 1, . . . , n2
326 CHAPTER III. STATISTICAL MODELS
be two univariate Gaussian data sets (→ Definition III/1.1.1) representing two groups of unequal size
n1 and n2 with unknown means µ1 and µ2 and equal unknown variance σ 2 . Then, the test statistic
(→ Definition I/4.3.5)
(ȳ1 − ȳ2 ) − µ∆
t= q (2)
sp · n11 + n12
with sample means (→ Definition I/1.7.2) ȳ1 and ȳ2 and pooled standard deviation (→ Definition
“std-pool”) sp follows a Student’s t-distribution (→ Definition II/3.3.1) with n1 + n2 − 2 degrees of
freedom (→ Definition “dof”)
t ∼ t(n1 + n2 − 2) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ1 − µ2 = µ∆ . (4)
1 X
n1
ȳ1 = y1i
n1 i=1
(5)
1 X
n2
ȳ2 = y2i
n2 i=1
1 X 1n
s21 = (y1i − ȳ1 )2
n1 − 1 i=1
(7)
1 X 2 n
2
s2 = (y2i − ȳ2 )2 .
n2 − 1 i=1
Using the linear combination formula for normal random variables (→ Proof II/3.2.26), the sample
means follows normal distributions (→ Definition II/3.2.1) with the following parameters:
2 !
1 X
n1
1 1
ȳ1 = y1i ∼ N n1 µ1 , n1 σ 2 = N µ1 , σ 2 /n1
n1 i=1 n1 n1
2 ! (8)
1 X
n2
1 1
ȳ2 = y2i ∼ N n2 µ2 , n2 σ 2 = N µ2 , σ 2 /n2 .
n2 i=1 n2 n2
1. UNIVARIATE NORMAL DATA 327
Again employing the linear combination p theorem and applying the null hypothesis from (4), the
distribution of Z = ((ȳ1 − ȳ2 ) − µ∆ )/(σ 1/n1 + 1/n2 ) becomes standard normal (→ Definition
II/3.2.2)
2
(ȳ1 − ȳ2 ) − µ∆ (µ1 − µ2 ) − µ∆ 1 σ2 2
σ H0
Z= q ∼N q , q + = N (0, 1) . (9)
σ · n11 + n12 σ · n11 + n12 σ · n11 + 1 n1 n2
n2
Because sample variances calculated from independent normal random variables follow a chi-squared
distribution (→ Proof II/3.2.6), the distribution of V = (n1 + n2 − 2) s2p /σ 2 is
(n1 + n2 − 2) s2p
V = ∼ χ2 (n1 + n2 − 2) . (10)
σ2
Finally, since the ratio of a standard normal random variable and the square root of a chi-squared
random variable follows a t-distribution (→ Definition II/3.3.1), the distribution of the test statistic
(→ Definition I/4.3.5) is given by
(ȳ1 − ȳ2 ) − µ∆ Z
t= q =p ∼ t(n1 + n2 − 2) . (11)
sp · n11 + n12 V /(n1 + n2 − 2)
This means that the null hypothesis (→ Definition I/4.3.2) can be rejected when t is as extreme or
more extreme than the critical value (→ Definition I/4.3.9) obtained from the Student’s t-distribution
(→ Definition II/3.3.1) with n1 + n2 − 2 degrees of freedom (→ Definition “dof”) using a significance
level (→ Definition I/4.3.8) α.
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-12; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Derivation.
• Wikipedia (2021): “Student’s t-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-12;
URL: https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_
variances_(1/2_%3C_sX1/sX2_%3C_2).
Metadata: ID: P205 | shortcut: ug-ttest2 | author: JoramSoch | date: 2021-03-12, 09:20.
d¯ − µ0
t= √ where di = yi1 − yi2 (2)
sd / n
with sample mean (→ Definition I/1.7.2) d¯ and sample variance (→ Definition I/1.8.2) s2d follows a
Student’s t-distribution (→ Definition II/3.3.1) with n − 1 degrees of freedom (→ Definition “dof”)
328 CHAPTER III. STATISTICAL MODELS
t ∼ t(n − 1) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ = µ0 . (4)
Proof: Define the pair-wise difference di = yi1 − yi2 which is, according to the linearity of the
expected value (→ Proof I/1.7.5) and the invariance of the variance under addition (→ Proof I/1.8.6),
distributed as
Sources:
• Wikipedia (2021): “Student’s t-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-12;
URL: https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples.
Metadata: ID: P206 | shortcut: ug-ttestp | author: JoramSoch | date: 2021-03-12, 09:34.
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Y
n
1 1 yi − µ
= √ · exp − (3)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
Yn r h τ i
τ
= · exp − (yi − µ) 2
(4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
s " #
1 τ Xn
p(y|µ, τ ) = · τ n/2 · exp − y 2 − 2µyi + µ2
(2π)n 2 i=1 i
s " !#
1 τ Xn Xn
= · τ n/2 · exp − y 2 − 2µ yi + nµ2
(2π)n 2 i=1 i i=1
s (6)
1 h τ T i
= · τ n/2
· exp − y y − 2µnȳ + nµ 2
(2π)n 2
s
1 τn 1 T
= ·τ n/2
· exp − y y − 2µȳ + µ 2
(2π)n 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
s
1 τn 1 T
p(y|µ, τ ) = ·τ n/2
· exp − (µ − ȳ) − ȳ + y y
2 2
(7)
(2π)n 2 n
330 CHAPTER III. STATISTICAL MODELS
In other words, the likelihood function (→ Definition I/5.1.2) is proportional to a power of τ times
an exponential of τ and an exponential of a squared form of µ, weighted by τ :
h τ i h τn i
p(y|µ, τ ) ∝ τ n/2 · exp − y T y − nȳ 2 · exp − (µ − ȳ)2 . (8)
2 2
The same is true for a normal-gamma distribution (→ Definition II/4.3.1) over µ and τ
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 97-102, eq. 2.154; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%
20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
Metadata: ID: P201 | shortcut: ug-prior | author: JoramSoch | date: 2021-03-03, 08:54.
λ0 µ0 + nȳ
µn =
λ0 + n
λn = λ0 + n
n (4)
an = a0 +
2
1 T
bn = b0 + (y y + λ0 µ20 − λn µ2n ) .
2
Proof: According to Bayes’ theorem (→ Proof I/5.3.1), the posterior distribution (→ Definition
I/5.1.7) is given by
p(y|µ, τ ) p(µ, τ )
p(µ, τ |y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ Proof I/5.1.8) to the
numerator:
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ 2 )n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
Combining the likelihood function (→ Definition I/5.1.2) (8) with the prior distribution (→ Definition
I/5.1.3) (2), the joint likelihood (→ Definition I/5.1.5) of the model is given by
332 CHAPTER III. STATISTICAL MODELS
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
" !# (10)
τ X
n
exp − (yi − µ) + λ0 (µ − µ0 )
2 2
.
2 i=1
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 ) (11)
h τ i
exp − (y T y − 2µnȳ + nµ2 ) + λ0 (µ2 − 2µµ0 + µ20 )
2
P P
where ȳ = n1 ni=1 yi and y T y = ni=1 yi2 , such that
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 ) (12)
h τ i
exp − µ2 (λ0 + n) − 2µ(λ0 µ0 + nȳ) + (y T y + λ0 µ20 )
2
Completing the square over µ, we finally have
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
(13)
τ λn τ T
exp − (µ − µn ) −
2
y y + λ0 µ0 − λn µn
2 2
2 2
λ0 µ0 + nȳ
µn =
λ0 + n (14)
λn = λ0 + n .
1. UNIVARIATE NORMAL DATA 333
n
an = a0 +
2
1 T (16)
bn = b0 + y y + λ0 µ20 − λn µ2n .
2
From the term in (13), we can isolate the posterior distribution over µ given τ :
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 97-102, eq. 2.154; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%
20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
Metadata: ID: P202 | shortcut: ug-post | author: JoramSoch | date: 2021-03-03, 09:53.
λ0 µ0 + nȳ
µn =
λ0 + n
λn = λ0 + n
n (4)
an = a0 +
2
1 T
bn = b0 + (y y + λ0 µ20 − λn µ2n ) .
2
Proof: According to the law of marginal probability (→ Definition I/1.3.3), the model evidence (→
Definition I/5.1.9) for this model is:
ZZ
p(y|m) = p(y|µ, τ ) p(µ, τ ) dµ dτ . (5)
According to the law of conditional probability (→ Definition I/1.3.4), the integrand is equivalent to
the joint likelihood (→ Definition I/5.1.5):
ZZ
p(y|m) = p(y, µ, τ ) dµ dτ . (6)
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
When deriving the posterior distribution (→ Proof III/1.1.7) p(µ, τ |y), the joint likelihood p(y, µ, τ )
is obtained as
1. UNIVARIATE NORMAL DATA 335
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
(9)
τ λn τ T
exp − (µ − µn ) −
2
y y + λ0 µ0 − λn µn .
2 2
2 2
Using the probability density function of the normal distribution (→ Proof II/3.2.10), we can rewrite
this as
r r r
τn 2π b0 a0 a0 −1
τ λ0
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n 2πτ λn Γ(a0 ) (10)
h τ i
−1
N (µ; µn , (τ λn ) ) exp − y y + λ0 µ0 − λn µn .
T 2 2
2
Now, µ can be integrated out easily:
Z s r
1 λ0 b0 a0 a0 +n/2−1
p(y, µ, τ ) dµ = τ exp[−b0 τ ]·
(2π)n λn Γ(a0 ) (11)
h τ i
exp − y T y + λ0 µ20 − λn µ2n .
2
Using the probability density function of the gamma distribution (→ Proof II/3.4.6), we can rewrite
this as
Z s r
1 λ0 b0 a0 Γ(an )
p(y, µ, τ ) dµ = Gam(τ ; an , bn ) . (12)
(2π)n λn Γ(a0 ) bn an
Finally, τ can also be integrated out:
ZZ s r
1 λ0 Γ(an ) b0 a0
p(y, µ, τ ) dµ dτ = . (13)
(2π)n λn Γ(a0 ) bn an
Thus, the log model evidence (→ Definition IV/3.1.3) of this model is given by
n 1 λ0
log p(y|m) = − log(2π) + log + log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn . (14)
2 2 λn
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning, pp.
152-161, ex. 3.23, eq. 3.118; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%
20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
Metadata: ID: P203 | shortcut: ug-lme | author: JoramSoch | date: 2021-03-03, 10:25.
336 CHAPTER III. STATISTICAL MODELS
1 an T 1 n n
Acc(m) = − y y − 2nȳµn + nµ2n − nλ−1 n + (ψ(an ) − log(bn )) − log(2π)
2 bn 2 2 2
1 an 1 λ0 1 λ 0 1
Com(m) = λ0 (µ0 − µn )2 − 2(bn − b0 ) + − log − (3)
2 bn 2 λn 2 λn 2
bn Γ(an )
+ a0 · log − log + (an − a0 ) · ψ(an )
b0 Γ(a0 )
where µn and λn as well as an and bn are the posterior hyperparameters for the univariate Gaussian
(→ Proof III/1.1.7) and ȳ is the sample mean (→ Definition I/1.7.2).
The accuracy term is the expectation (→ Definition I/1.7.1) of the log-likelihood function (→ Defini-
tion I/4.1.2) log p(y|µ, τ ) with respect to the posterior distribution (→ Definition I/5.1.7) p(µ, τ |y).
With the log-likelihood function for the univariate Gaussian (→ Proof III/1.1.2) and the posterior
distribution for the univariate Gaussian (→ Proof III/1.1.7), the model accuracy of m evaluates to:
n n τ T 1 (5)
= log(τ ) − log(2π) − y y − 2nȳµn + nµ2n − nλ−1
2 2 2 2 n Gam(an ,bn )
n n 1 an T 1
= (ψ(an ) − log(bn )) − log(2π) − y y − 2nȳµn + nµ2n − nλ−1
2 2 2 bn 2 n
1 an T 1 n n
=− y y − 2nȳµn + nµ2n − nλn−1 + (ψ(an ) − log(bn )) − log(2π)
2 bn 2 2 2
1. UNIVARIATE NORMAL DATA 337
The complexity penalty is the Kullback-Leibler divergence (→ Definition I/2.5.1) of the posterior
distribution (→ Definition I/5.1.7) p(µ, τ |y) from the prior distribution (→ Definition I/5.1.3) p(µ, τ ).
With the prior distribution (→ Proof III/1.1.6) given by (2), the posterior distribution for the
univariate Gaussian (→ Proof III/1.1.7) and the Kullback-Leibler divergence of the normal-gamma
distribution (→ Proof II/4.3.7), the model complexity of m evaluates to:
Sources:
• original work
Metadata: ID: P240 | shortcut: ug-anc | author: JoramSoch | date: 2021-07-14, 08:26.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 97, eq. 2.137; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
Metadata: ID: D136 | shortcut: ugkv | author: JoramSoch | date: 2021-03-23, 16:12.
338 CHAPTER III. STATISTICAL MODELS
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ Definition I/4.1.3) for the mean µ is given by
µ̂ = ȳ (2)
where ȳ is the sample mean (→ Definition I/1.7.2)
1X
n
ȳ = yi . (3)
n i=1
Proof: The likelihood function (→ Definition I/5.1.2) for each observation is given by the probability
density function of the normal distribution (→ Proof II/3.2.10)
" 2 #
1 1 y i − µ
p(yi |µ) = N (x; µ, σ 2 ) = √ · exp − (4)
2πσ 2 2 σ
and because observations are independent (→ Definition I/1.3.6), the likelihood function for all
observations is the product of the individual ones:
s " n 2 #
Yn
1 1 X yi − µ
p(y|µ) = p(yi |µ) = 2 )n
· exp − . (5)
i=1
(2πσ 2 i=1
σ
This can be developed into
n/2 "
n #
1 1 X yi2 − 2yi µ + µ2
p(y|µ) = · exp −
2πσ 2 2 i=1 σ2
n/2 (6)
1 1
= 2
· exp − 2 y T y − 2nȳµ + nµ2
2πσ 2σ
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Thus, the log-likelihood function (→ Definition I/4.1.2) is
n 1
LL(µ) = log p(y|µ) = − log(2πσ 2 ) − 2 y T y − 2nȳµ + nµ2 . (7)
2 2σ
The derivatives of the log-likelihood with respect to µ are
dLL(µ) nȳ nµ n
= 2 − 2 = 2 (ȳ − µ)
dµ σ σ σ
2 (8)
d LL(µ) n
2
=− 2 .
dµ σ
Setting the first derivative to zero, we obtain:
1. UNIVARIATE NORMAL DATA 339
dLL(µ̂)
=0
dµ
n
0 = 2 (ȳ − µ̂) (9)
σ
0 = ȳ − µ̂
µ̂ = ȳ
d2 LL(µ̂) n
2
=− 2 <0. (10)
dµ σ
This demonstrates that the estimate µ̂ = ȳ maximizes the likelihood p(y|µ).
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 98, eq. 2.143; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
Metadata: ID: P207 | shortcut: ugkv-mle | author: JoramSoch | date: 2021-03-24, 03:48.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n (1)
be a univariate Gaussian data set (→ Definition III/1.2.1) with unknown mean µ and known variance
σ 2 . Then, the test statistic (→ Definition I/4.3.5)
√ ȳ − µ0
z= n (2)
σ
with sample mean (→ Definition I/1.7.2) ȳ follows a standard normal distribution (→ Definition
II/3.2.2)
z ∼ N (0, 1) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ = µ0 . (4)
1X
n
ȳ = yi . (5)
n i=1
Using the linear combination formula for normal random variables (→ Proof II/3.2.26), the sample
mean follows a normal distribution (→ Definition II/3.2.1) with the following parameters:
340 CHAPTER III. STATISTICAL MODELS
2 !
1X
n
1 1
ȳ = yi ∼ N nµ, nσ 2 = N µ, σ 2 /n . (6)
n i=1 n n
p
Again employing the linear combination theorem, the distribution of z = n/σ 2 (ȳ − µ0 ) becomes
r r r 2 2 !
n n n σ √ µ − µ0
z= (ȳ − µ0 ) ∼ N (µ − µ0 ), =N n ,1 , (7)
σ2 σ2 σ2 n σ
such that, under the null hypothesis in (4), we have:
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Einstichproben-Gau%C3%9F-Test.
Metadata: ID: P208 | shortcut: ugkv-ztest1 | author: JoramSoch | date: 2021-03-24, 04:23.
be two univariate Gaussian data sets (→ Definition III/1.1.1) representing two groups of unequal
size n1 and n2 with unknown means µ1 and µ2 and unknown variances σ12 and σ22 . Then, the test
statistic (→ Definition I/4.3.5)
(ȳ1 − ȳ2 ) − µ∆
z= q 2 (2)
σ1 σ22
n1
+ n2
with sample means (→ Definition I/1.7.2) ȳ1 and ȳ2 follows a standard normal distribution (→
Definition II/3.2.2)
z ∼ N (0, 1) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ1 − µ2 = µ∆ . (4)
1 X
n1
ȳ1 = y1i
n1 i=1
(5)
1 X
n2
ȳ2 = y2i .
n2 i=1
Using the linear combination formula for normal random variables (→ Proof II/3.2.26), the sample
means follows normal distributions (→ Definition II/3.2.1) with the following parameters:
2 !
1 X
n1
1 1
ȳ1 = y1i ∼ N n1 µ1 , n1 σ 2 = N µ1 , σ12 /n1
n1 i=1 n1 n1
2 ! (6)
1 X
n2
1 1
ȳ2 = y2i ∼ N n2 µ2 , n2 σ 2 = N µ2 , σ22 /n2 .
n2 i=1 n2 n2
Again employing the linear combination theorem, the distribution of z = [(ȳ1 − ȳ2 )−µ∆ ]/σ∆ becomes
2 !
(ȳ1 − ȳ2 ) − µ∆ (µ1 − µ2 ) − µ∆ 1 (µ1 − µ2 ) − µ∆
z= ∼N , σ∆ = N
2
,1 (7)
σ∆ σ∆ σ∆ σ∆
where σ∆ is the pooled standard deviation (→ Definition “std-pool”)
s
σ12 σ22
σ∆ = + , (8)
n1 n2
such that, under the null hypothesis in (4), we have:
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Zweistichproben-Gau%C3%9F-Test_f%
C3%BCr_unabh%C3%A4ngige_Stichproben.
Metadata: ID: P209 | shortcut: ugkv-ztest2 | author: JoramSoch | date: 2021-03-24, 04:38.
is a univariate Gaussian data set (→ Definition III/1.2.1) with unknown shift µ and known variance
σ 2 . Then, the test statistic (→ Definition I/4.3.5)
√ d¯ − µ0
nz= where di = yi1 − yi2 (2)
σ
with sample mean (→ Definition I/1.7.2) d¯ follows a standard normal distribution (→ Definition
II/3.2.2)
z ∼ N (0, 1) (3)
under the null hypothesis (→ Definition I/4.3.2)
H0 : µ = µ0 . (4)
Proof: Define the pair-wise difference di = yi1 − yi2 which is, according to the linearity of the
expected value (→ Proof I/1.7.5) and the invariance of the variance under addition (→ Proof I/1.8.6),
distributed as
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Zweistichproben-Gau%C3%9F-Test_f%
C3%BCr_abh%C3%A4ngige_(verbundene)_Stichproben.
Metadata: ID: P210 | shortcut: ugkv-ztestp | author: JoramSoch | date: 2021-03-24, 05:10.
Definition I/1.5.1). This is fulfilled when the prior density and the likelihood function are proportional
to the model model parameters in the same way, i.e. the model parameters appear in the same
functional form in both.
Equation (1) implies the following likelihood function (→ Definition I/5.1.2)
Y
n
p(y|µ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (3)
i=1
2πσ 2 σ
r !n " #
1 1 Xn
= 2
· exp − 2 (yi − µ)2
2πσ 2σ i=1
Y
n
p(y|µ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
" #
τ n/2 τX 2
n
p(y|µ) = · exp − y − 2µyi + µ2
2π 2 i=1 i
" !#
τ n/2 τ X 2
n Xn
= · exp − y − 2µ yi + nµ2
2π 2 i=1 i i=1 (5)
τ n/2 h τ i
= · exp − y T y − 2µnȳ + nµ2
2π 2
τ n/2 τn 1 T
= · exp − y y − 2µȳ + µ 2
2π 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
τ n/2
τn 1 T
p(y|µ) = · exp − (µ − ȳ) − ȳ + y y
2 2
(6)
2π 2 n
h τn i
p(y|µ) ∝ exp − (µ − ȳ) . 2
(7)
2
The same is true for a normal distribution (→ Definition II/3.2.1) over µ
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for
Machine Learning, ch. 2.3.6, pp. 97-98, eq. 2.138; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
Metadata: ID: P211 | shortcut: ugkv-prior | author: JoramSoch | date: 2021-03-24, 05:57.
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n
with the sample mean (→ Definition I/1.7.2) ȳ and the inverse variance or precision (→ Definition
I/1.8.12) τ = 1/σ 2 .
1. UNIVARIATE NORMAL DATA 345
Proof: According to Bayes’ theorem (→ Proof I/5.3.1), the posterior distribution (→ Definition
I/5.1.7) is given by
p(y|µ) p(µ)
p(µ|y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ Proof I/5.1.8) to the
numerator:
Y
n
p(y|µ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r !n " #
1 1 Xn
= 2
· exp − 2 (yi − µ)2
2πσ 2σ i=1
Y
n
p(y|µ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
Combining the likelihood function (→ Definition I/5.1.2) (8) with the prior distribution (→ Definition
I/5.1.3) (2), the joint likelihood (→ Definition I/5.1.5) of the model is given by
" !#
τ n2 λ 12 1 X
n
0
p(y, µ) = · · exp − τ (yi − µ) + λ0 (µ − µ0 )
2 2
2π 2π 2 i=1
" !#
τ n2 λ 12 1 X
n
0
= · · exp − τ (yi − 2yi µ + µ ) + λ0 (µ − 2µµ0 + µ0 )
2 2 2 2
2π 2π 2 i=1
(11)
τ n2 λ 12
1
0
= · · exp − τ (y y − 2nȳµ + nµ ) + λ0 (µ − 2µµ0 + µ0 )
T 2 2 2
2π 2π 2
τ 2 1
n
λ0 2 1 2
= · · exp − µ (τ n + λ0 ) − 2µ(τ nȳ + λ0 µ0 ) + (τ y y + λ0 µ0 )
T 2
2π 2π 2
P P
where ȳ = n1 ni=1 yi and y T y = ni=1 yi2 . Completing the square in µ then yields
τ n2 λ 12
λn
0
p(y, µ) = · · exp − (µ − µn ) + fn
2
(12)
2π 2π 2
with the posterior hyperparameters (→ Definition I/5.1.7)
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (13)
λn = λ0 + τ n
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 98, eqs. 2.139-2.142; URL: http://users.isr.ist.utl.pt/~wurmd/
Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springe
20%202006.pdf.
Metadata: ID: P212 | shortcut: ugkv-post | author: JoramSoch | date: 2021-03-24, 06:10.
1. UNIVARIATE NORMAL DATA 347
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n
with the sample mean (→ Definition I/1.7.2) ȳ and the inverse variance or precision (→ Definition
I/1.8.12) τ = 1/σ 2 .
Proof: According to the law of marginal probability (→ Definition I/1.3.3), the model evidence (→
Definition I/5.1.9) for this model is:
Z
p(y|m) = p(y|µ) p(µ) dµ . (5)
According to the law of conditional probability (→ Definition I/1.3.4), the integrand is equivalent to
the joint likelihood (→ Definition I/5.1.5):
Z
p(y|m) = p(y, µ) dµ . (6)
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Y
n
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r ! n " #
1 X
n
1
= · exp − 2 (yi − µ)2
2πσ 2 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
When deriving the posterior distribution (→ Proof III/1.2.7) p(µ|y), the joint likelihood p(y, µ) is
obtained as
τ n2 r λ
λn 1
0
p(y, µ) = · · exp − (µ − µn ) −
2
τ y y + λ0 µ0 − λn µn .
T 2 2
(9)
2π 2π 2 2
Using the probability density function of the normal distribution (→ Proof II/3.2.10), we can rewrite
this as
τ n2 r λ r 2π
1
0 −1
p(y, µ) = · · · N (µ; λn ) · exp − τ y y + λ0 µ0 − λn µn .
T 2 2
(10)
2π 2π λn 2
Now, µ can be integrated out using the properties of the probability density function (→ Definition
I/1.6.6):
Z τ n2 r λ
1
0
p(y|m) = p(y, µ) dµ = · · exp − τ y y + λ0 µ0 − λn µn .
T 2 2
(11)
2π λn 2
Thus, the log model evidence (→ Definition IV/3.1.3) of this model is given by
τ 1
n λ0 1
log p(y|m) = log + log − τ y T y + λ0 µ20 − λn µ2n . (12)
2 2π 2 λn 2
Sources:
• original work
Metadata: ID: P213 | shortcut: ugkv-lme | author: JoramSoch | date: 2021-03-24, 06:45.
n τ 1 τn
Acc(m) = log − τ y y − 2 τ nȳµn + τ nµn +
T 2
2 2π 2 λn
(3)
1 λ0 λ0
Com(m) = + λ0 (µ0 − µn )2 − 1 + log
2 λn λn
where µn and λn are the posterior hyperparameters for the univariate Gaussian with known variance
(→ Proof III/1.2.7), τ = 1/σ 2 is the inverse variance or precision (→ Definition I/1.8.12) and ȳ is
the sample mean (→ Definition I/1.7.2).
The accuracy term is the expectation (→ Definition I/1.7.1) of the log-likelihood function (→ Defini-
tion I/4.1.2) log p(y|µ) with respect to the posterior distribution (→ Definition I/5.1.7) p(µ|y). With
the log-likelihood function for the univariate Gaussian with known variance (→ Proof III/1.2.2) and
the posterior distribution for the univariate Gaussian with known variance (→ Proof III/1.2.7), the
model accuracy of m evaluates to:
The complexity penalty is the Kullback-Leibler divergence (→ Definition I/2.5.1) of the posterior
distribution (→ Definition I/5.1.7) p(µ|y) from the prior distribution (→ Definition I/5.1.3) p(µ).
With the prior distribution (→ Proof III/1.2.6) given by (2), the posterior distribution for the
univariate Gaussian with known variance (→ Proof III/1.2.7) and the Kullback-Leibler divergence
of the normal distribution (→ Proof II/3.2.24), the model complexity of m evaluates to:
where LME(m) is the log model evidence for the univariate Gaussian with known variance (→ Proof
III/1.2.8).
Sources:
• original work
Metadata: ID: P214 | shortcut: ugkv-anc | author: JoramSoch | date: 2021-03-24, 07:49.
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Proof: The log Bayes factor is equal to the difference of two log model evidences (→ Proof IV/3.3.8):
1 λ0 1
LBF10 = LME(m1 ) − LME(m0 ) = log − λ0 µ20 − λn µ2n (7)
2 λn 2
where the posterior hyperparameters (→ Definition I/5.1.7) are given by (→ Proof III/1.2.7)
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (8)
λn = λ0 + τ n
with the sample mean (→ Definition I/1.7.2) ȳ and the inverse variance or precision (→ Definition
I/1.8.12) τ = 1/σ 2 .
Sources:
• original work
Metadata: ID: P215 | shortcut: ugkv-lbf | author: JoramSoch | date: 2021-03-24, 09:05.
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, under the null hypothesis (→ Definition I/4.3.2) that m0 generated the data, the expectation
(→ Definition I/1.7.1) of the log Bayes factor (→ Definition IV/3.3.6) in favor of m1 with µ0 = 0
against m0 is
1 λ0 1 λn − λ0
⟨LBF10 ⟩ = log + (3)
2 λn 2 λn
where λn is the posterior precision for the univariate Gaussian with known variance (→ Proof
III/1.2.7).
Proof: The log Bayes factor for the univariate Gaussian with known variance (→ Proof III/1.2.10)
is
1 λ0 1
LBF10 = log − λ0 µ20 − λn µ2n (4)
2 λn 2
where the posterior hyperparameters (→ Definition I/5.1.7) are given by (→ Proof III/1.2.7)
352 CHAPTER III. STATISTICAL MODELS
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (5)
λn = λ0 + τ n
with the sample mean (→ Definition I/1.7.2) ȳ and the inverse variance or precision (→ Definition
I/1.8.12) τ = 1/σ 2 . Plugging µn from (5) into (4), we obtain:
1 λ0 1 (λ0 µ0 + τ nȳ)2
LBF10 = log − λ0 µ0 − λn
2
2 λn 2 λ2n
(6)
1 λ0 1 1 2 2
= log − λ0 µ0 − (λ0 µ0 − 2τ nλ0 µ0 ȳ + τ (nȳ) )
2 2 2
2 λn 2 λn
Because m1 uses a zero-mean prior distribution (→ Definition I/5.1.3) with prior mean (→ Definition
I/1.7.1) µ0 = 0 per construction, the log Bayes factor simplifies to:
1 λ0 1 τ 2 (nȳ)2
LBF10 = log + . (7)
2 λn 2 λn
From (1), we know that the data are distributed as yi ∼ N (µ, σ 2 ), such that we can derive the
expectation (→ Definition I/1.7.1) of (nȳ)2 as follows:
* n n +
XX
(nȳ)2 = yi yj = nyi2 + (n2 − n)[yi yj ]i̸=j
i=1 j=1
(8)
= n(µ2 + σ 2 ) + (n2 − n)µ2
= n2 µ2 + nσ 2 .
Applying this expected value (→ Definition I/1.7.1) to (7), the expected LBF emerges as:
1 λ0 1 τ 2 (n2 µ2 + nσ 2 )
⟨LBF10 ⟩ = log +
2 λn 2 λn
2
(9)
1 λ0 1 (τ nµ) + τ n
= log +
2 λn 2 λn
Under the null hypothesis (→ Definition I/4.3.2) that m0 generated the data, the unknown mean is
µ = 0, such that the log Bayes factor further simplifies to:
1 λ0 1 τn
⟨LBF10 ⟩ = log + . (10)
2 λn 2 λn
Finally, plugging λn from (5) into (10), we obtain:
1 λ0 1 λn − λ0
⟨LBF10 ⟩ = log + . (11)
2 λn 2 λn
Sources:
• original work
Metadata: ID: P216 | shortcut: ugkv-lbfmean | author: JoramSoch | date: 2021-03-24, 10:03.
1. UNIVARIATE NORMAL DATA 353
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, the cross-validated log model evidences (→ Definition IV/3.1.8) of m0 and m1 are
n τ 1
cvLME(m0 ) = log − τ yTy
2 2π 2
2
(i)
X n1 ȳ1
S (3)
n τ S S−1 τ (nȳ)2
cvLME(m1 ) = log + log − y T y + −
2 2π 2 S 2 i=1
n1 n
where ȳ is the sample mean (→ Definition I/1.7.2), τ = 1/σ 2 is the inverse variance or precision (→
(i)
Definition I/1.8.12), y1 are the training data in the i-th cross-validation fold and S is the number
of data subsets (→ Definition IV/3.1.8).
Proof: For evaluation of the cross-validated log model evidences (→ Definition IV/3.1.8) (cvLME),
we assume that n data points are divided into S | n data subsets without remainder. Then, the
number of training data points n1 and test data points n2 are given by
n = n1 + n2
S−1
n1 = n (4)
S
1
n2 = n ,
S
such that training data y1 and test data y2 in the i-th cross-validation fold are
y = {y1 , . . . , yn }
n o
(i) (i) (i)
y1 = x ∈ y | x ∈ / y2 = y\y2 (5)
(i)
y2 = y(i−1)·n2 +1 , . . . , yi·n2 .
First, we consider the null model m0 assuming µ = 0. Because this model has no free parameter,
nothing is estimated from the training data and the assumed parameter value is applied to the test
354 CHAPTER III. STATISTICAL MODELS
data. Consequently, the out-of-sample log model evidence (→ Definition “ooslme”) (oosLME) is equal
to the log-likelihood function (→ Proof III/1.2.2) of the test data at µ = 0:
n τ 1h i
(i) 2 (i) T (i)
oosLMEi (m0 ) = log p y2 µ = 0 = log − τ y 2 y2 . (6)
2 2π 2
By definition, the cross-validated log model evidence is the sum of out-of-sample log model evidences
(→ Definition IV/3.1.8) over cross-validation folds, such that the cvLME of m0 is:
X
S
cvLME(m0 ) = oosLMEi (m0 )
i=1
XS τ 1h i
n2 (i) T (i) (7)
= log − τ y 2 y2
i=1
2 2π 2
n τ 1
= log − τ yTy .
2 2π 2
(i)
Next, we have a look at the alternative m1 assuming µ ̸= 0. First, the training data y1 are ana-
lyzed using a non-informative prior distribution (→ Definition I/5.2.3) and applying the posterior
distribution for the univariate Gaussian with known variance (→ Proof III/1.2.7):
(1)
µ0 = 0
(1)
λ0 = 0
(i)
τ n1 ȳ1 + λ0 µ0
(1) (1)
(i)
(8)
µ(1)
n = (1)
= ȳ1
τ n 1 + λ0
(1)
λ(1)
n = τ n1 + λ0 = τ n1 .
(1) (1) (i)
This results in a posterior characterized by µn and λn . Then, the test data y2 are analyzed using
this posterior as an informative prior distribution (→ Definition I/5.2.3), again applying the posterior
distribution for the univariate Gaussian with known variance (→ Proof III/1.2.7):
(2) (i)
µ0 = µ(1)
n = ȳ1
(2)
λ0 = λ(1)
n = τ n1
(i)
τ n2 ȳ2 + λ0 µ0
(2) (2) (9)
µ(2)
n = (2)
= ȳ
τ n2 + λ0
(2)
λ(2)
n = τ n2 + λ0 = τ n .
(2) (2) (2) (2)
In the test data, we now have a prior characterized by µ0 /λ0 and a posterior characterized µn /λn .
Applying the log model evidence for the univariate Gaussian with known variance (→ Proof III/1.2.8),
the out-of-sample log model evidence (→ Definition “ooslme”) (oosLME) therefore follows as
!
n2 τ 1 λ0
(2)
1 h (i) T (i) i
(2) (2) 2 (2) (2) 2
oosLMEi (m1 ) = log + log (2)
τ y−2 y 2 + λ 0 µ 0 − λn µ n
2 2π 2 λn 2
(10)
n2 τ 1 n 1 τ (i) 2 τ
1 (i) T (i)
= log + log − τ y2 y2 + n1 ȳ1 − (nȳ) . 2
2 2π 2 n 2 n1 n
1. UNIVARIATE NORMAL DATA 355
Again, because the cross-validated log model evidence is the sum of out-of-sample log model evidences
(→ Definition IV/3.1.8) over cross-validation folds, the cvLME of m1 becomes:
X
S
cvLME(m1 ) = oosLMEi (m1 )
i=1
S τ 1 n 1
X n2 1 (i) T (i) τ (i) 2 τ
= log + log − τ y 2 y2 + n1 ȳ1 − (nȳ) 2
2 2π 2 n 2 n 1 n
i=1
2
S · n2 τ S n τ X S (i)
n1 ȳ1 (nȳ)2 (11)
1 (i) T (i)
= log + log − y2 y2 + −
2 2π 2 n 2 i=1 n1 n
2
(i)
X n1 ȳ1
S
n τ S S−1 τ (nȳ)2
= log + log − y T y + − .
2 2π 2 S 2 i=1
n1 n
Sources:
• original work
Metadata: ID: P217 | shortcut: ugkv-cvlme | author: JoramSoch | date: 2021-03-24, 10:57.
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, the cross-validated (→ Definition IV/3.1.8) log Bayes factor (→ Definition IV/3.3.6) in favor
of m1 against m0 is
2
(i)
τ X n1 ȳ1
S
S S−1 (nȳ)2
cvLBF10 = log − − (3)
2 S 2 i=1 n1 n
where ȳ is the sample mean (→ Definition I/1.7.2), τ = 1/σ 2 is the inverse variance or precision (→
(i)
Definition I/1.8.12), y1 are the training data in the i-th cross-validation fold and S is the number
of data subsets (→ Definition IV/3.1.8).
356 CHAPTER III. STATISTICAL MODELS
Proof: The relationship between log Bayes factor and log model evidences (→ Proof IV/3.3.8) also
holds for cross-validated log bayes factor (→ Definition IV/3.3.6) (cvLBF) and cross-validated log
model evidences (→ Definition IV/3.1.8) (cvLME):
n τ 1
cvLME(m0 ) = log − τ yTy
2 2π 2
2
τ S (i)
X n1 ȳ1
S (5)
n S−1 τ (nȳ)2
cvLME(m1 ) = log + log − y T y + − .
2 2π 2 S 2 i=1
n1 n
Subtracting the two cvLMEs from each other, the cvLBF emerges as
Sources:
• original work
Metadata: ID: P218 | shortcut: ugkv-cvlbf | author: JoramSoch | date: 2021-03-24, 11:13.
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
1. UNIVARIATE NORMAL DATA 357
Then, the expectation (→ Definition I/1.7.1) of the cross-validated (→ Definition IV/3.1.8) log Bayes
factor (→ Definition IV/3.3.6) (cvLBF) in favor of m1 against m0 is
S S−1 1
⟨cvLBF10 ⟩ = log + τ nµ2 (3)
2 S 2
where τ = 1/σ 2 is the inverse variance or precision (→ Definition I/1.8.12) and S is the number of
data subsets (→ Definition IV/3.1.8).
Proof: The cross-validated log Bayes factor for the univariate Gaussian with known variance (→
Proof III/1.2.13) is
2
(i)
τ X n1 ȳ1
S
S S−1 (nȳ)2
cvLBF10 = log − − (4)
2 S 2 i=1 n1 n
From (1), we know that the data are distributed as yi ∼ N (µ, σ 2 ), such that we can derive the
2
(i)
expectation (→ Definition I/1.7.1) of (nȳ)2 and n1 ȳ1 as follows:
* n n +
XX
(nȳ)2 = yi yj = nyi2 + (n2 − n)[yi yj ]i̸=j
i=1 j=1
(5)
= n(µ2 + σ 2 ) + (n2 − n)µ2
= n2 µ2 + nσ 2 .
Applying this expected value (→ Definition I/1.7.1) to (4), the expected cvLBF emerges as:
2
* (i) +
S S−1 τ X
S
n1 ȳ1 (nȳ)2
⟨cvLBF10 ⟩ = log − −
2 S 2 i=1
n1 n
2
(i)
n1 ȳ1
τ X ⟨(nȳ)2 ⟩
S
S S−1
= log − −
2 S 2 i=1 n1 n
S (6)
S
(5) S−1 τ X n21 µ2 + n1 σ 2 n2 µ2 + nσ 2
= log − −
2 S 2 i=1 n1 n
τX
S
S S−1
= log − [n1 µ2 + σ 2 ] − [nµ2 + σ 2 ]
2 S 2 i=1
τX
S
S S−1
= log − (n1 − n)µ2
2 S 2 i=1
τX
S
S S−1
⟨cvLBF10 ⟩ = log − (−n2 )µ2
2 S 2 i=1
(7)
S S−1 1
= log + τ nµ2 .
2 S 2
Sources:
• original work
Metadata: ID: P219 | shortcut: ugkv-cvlbfmean | author: JoramSoch | date: 2021-03-24, 12:27.
y = β0 + β1 x + ε , (1)
together with a statement asserting a normal distribution (→ Definition II/4.1.1) for ε
ε ∼ N (0, σ 2 V ) (2)
is called a univariate simple regression model or simply, “simple linear regression”.
• y is called “dependent variable”, “measured data” or “signal”;
• x is called “independent variable”, “predictor” or “covariate”;
• V is called “covariance matrix” or “covariance structure”;
• β1 is called “slope of the regression line (→ Definition III/1.3.10)”;
• β0 is called “intercept of the regression line (→ Definition III/1.3.10)”;
• ε is called “noise”, “errors” or “error terms”;
• σ 2 is called “noise variance” or “error variance”;
• n is the number of observations.
When the covariance structure V is equal to the n × n identity matrix, this is called simple linear
regression with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = In ⇒ ε ∼ N (0, σ 2 In ) ⇒ εi ∼ N (0, σ 2 ) . (3)
In this case, the linear regression model can also be written as
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ) . (4)
Otherwise, it is called simple linear regression with correlated observations.
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
1. UNIVARIATE NORMAL DATA 359
Metadata: ID: D163 | shortcut: slr | author: JoramSoch | date: 2021-10-27, 07:07.
Proof: Without loss of generality, consider the simple linear regression case with uncorrelated errors
(→ Definition III/1.3.1):
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n . (2)
In matrix notation and using the multivariate normal distribution (→ Definition II/4.1.1), this can
also be written as
y = β0 1n + β1 x + ε, ε ∼ N (0, In )
h i β0 (3)
y = 1n x + ε, ε ∼ N (0, In ) .
β1
Comparing with the multiple linear regression equations for uncorrelated errors (→ Definition III/1.4.1),
we finally note:
h i β0
y = Xβ + ε with X = 1n x and β = . (4)
β1
In the case of correlated observations (→ Definition III/1.3.1), the error distribution changes to (→
Definition III/1.4.1):
ε ∼ N (0, σ 2 V ) . (5)
Sources:
• original work
Metadata: ID: P281 | shortcut: slr-mlr | author: JoramSoch | date: 2021-11-09, 07:57.
360 CHAPTER III. STATISTICAL MODELS
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the parameters minimizing the residual sum of squares (→ Definition III/1.4.7) are given by
β̂0 = ȳ − β̂1 x̄
sxy (2)
β̂1 = 2
sx
where x̄ and ȳ are the sample means (→ Definition I/1.7.2), s2x is the sample variance (→ Definition
I/1.8.2) of x and sxy is the sample covariance (→ Definition I/1.9.2) between x and y.
dRSS(β0 , β1 ) X
n
= 2(yi − β0 − β1 xi )(−1)
dβ0 i=1
X
n
= −2 (yi − β0 − β1 xi )
i=1
(4)
dRSS(β0 , β1 ) X
n
= 2(yi − β0 − β1 xi )(−xi )
dβ1 i=1
X
n
= −2 (xi yi − β0 xi − β1 x2i )
i=1
X
n
0 = −2 (yi − β̂0 − β̂1 xi )
i=1
(5)
Xn
0 = −2 (xi yi − β̂0 xi − β̂1 x2i )
i=1
X
n X
n
β̂1 xi + β̂0 · n = yi
i=1 i=1
(6)
X
n X
n Xn
β̂1 x2i + β̂0 xi = xi yi .
i=1 i=1 i=1
1. UNIVARIATE NORMAL DATA 361
From the first equation, we can derive the estimate for the intercept:
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1 (7)
= ȳ − β̂1 x̄ .
From the second equation, we can derive the estimate for the slope:
X
n X
n X
n
β̂1 x2i + β̂0 xi = xi y i
i=1 i=1 i=1
X
n Xn
(7) Xn
β̂1 x2i + ȳ − β̂1 x̄ xi = xi yi
i=1 i=1
! i=1
(8)
X
n X
n X
n X
n
β̂1 x2i − x̄ xi = xi yi − ȳ xi
i=1 i=1
P
i=1
P
i=1
n
x y − ȳ ni=1 xi
β̂1 = Pn 2 Pn
i=1 i i
.
i=1 xi − x̄ i=1 xi
X
n X
n X
n
xi yi − ȳ xi = xi yi − nx̄ȳ
i=1 i=1 i=1
X
n
= xi yi − nx̄ȳ − nx̄ȳ + nx̄ȳ
i=1
Xn X
n X
n X
n
= xi yi − ȳ xi − x̄ yi + x̄ȳ (9)
i=1 i=1 i=1 i=1
Xn
= (xi yi − xi ȳ − x̄yi + x̄ȳ)
i=1
X
n
= (xi − x̄)(yi − ȳ)
i=1
X
n X
n X
n
x2i − x̄ xi = x2i − nx̄2
i=1 i=1 i=1
X
n
= x2i − 2nx̄x̄ + nx̄2
i=1
Xn X
n X
n
= x2i − 2x̄ xi − x̄2 (10)
i=1 i=1 i=1
Xn
= x2i − 2x̄xi + x̄2
i=1
X
n
= (xi − x̄)2 .
i=1
With (9) and (10), the estimate from (8) can be simplified as follows:
Pn P
xi yi − ȳ ni=1 xi
β̂1 = Pn 2
i=1 Pn
i=1 xi − x̄ i=1 xi
Pn
(x − x̄)(yi − ȳ)
= Pn i
i=1
Together, (7) and (11) constitute the ordinary least squares parameter estimates for simple linear
regression.
Sources:
• Penny, William (2006): “Linear regression”; in: Mathematics for Brain Imaging, ch. 1.2.2, pp.
14-16, eqs. 1.24/1.25; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Derivation_of_simple_linear_regression_estimators.
Metadata: ID: P271 | shortcut: slr-ols | author: JoramSoch | date: 2021-10-27, 08:56.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the parameters minimizing the residual sum of squares (→ Definition III/1.4.7) are given by
1. UNIVARIATE NORMAL DATA 363
β̂0 = ȳ − β̂1 x̄
sxy (2)
β̂1 = 2
sx
where x̄ and ȳ are the sample means (→ Definition I/1.7.2), s2x is the sample variance (→ Definition
I/1.8.2) of x and sxy is the sample covariance (→ Definition I/1.9.2) between x and y.
Proof: Simple linear regression is a special case of multiple linear regression (→ Proof III/1.3.2)
with
h i β0
X = 1n x and β = (3)
β1
and ordinary least sqaures estimates (→ Proof III/1.4.3) are given by
β̂ = (X T X)−1 X T y . (4)
Writing out equation (4), we have
−1
1T h i 1T
β̂ = n
1n x n
y
T T
x x
−1
n nx̄ nȳ
=
T T
nx̄ x x x y
(5)
1 x T
x −nx̄ nȳ
=
nx x − (nx̄)2 −nx̄ n
T T
x y
1 nȳ xT x − nx̄ xT y
= .
nxT x − (nx̄)2 n xT y − (nx̄)(nȳ)
n xT y − (nx̄)(nȳ)
β̂1 =
nxT x − (nx̄)2
xT y − nx̄ȳ
=
xT x − nx̄2 P
P n
xi yi − ni=1 x̄ȳ
= Pi=1 Pn 2 (6)
i=1 xi −
n 2
i=1 x̄
Pn
(x − x̄)(yi − ȳ)
= Pn i
i=1
sxy
= .
s2x
364 CHAPTER III. STATISTICAL MODELS
nȳ xT x − nx̄ xT y
β̂0 =
nxT x − (nx̄)2
ȳ xT x − x̄ xT y
=
xT x − nx̄2
ȳ xT x − x̄ xT y + nx̄2 ȳ − nx̄2 ȳ
=
xT x − nx̄2
ȳ(x x − nx̄2 ) − x̄(xT y − nx̄ȳ)
T
(7)
=
xT x − nx̄2
ȳ(xT x − nx̄2 ) x̄(xT y − nx̄ȳ)
= −
− nx̄2
xT x P xT x − nx̄2
P
i=1 xi yi −P i=1 x̄ȳ
n n
= ȳ − x̄ P
i=1 xi −
n 2 n 2
i=1 x̄
= ȳ − β̂1 x̄ .
Sources:
• original work
Metadata: ID: P288 | shortcut: slr-ols2 | author: JoramSoch | date: 2021-11-16, 09:36.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, the expected values
(→ Definition I/1.7.1) of the estimated parameters are
E(β̂0 ) = β0
(2)
E(β̂1 ) = β1
which means that the ordinary least squares solution (→ Proof III/1.3.3) produces unbiased estima-
tors (→ Definition “est-unb”).
Proof: According to the simple linear regression model in (1), the expectation of a single data point
is
E(yi ) = β0 + β1 xi . (3)
The ordinary least squares estimates for simple linear regression (→ Proof III/1.3.3) are given by
1. UNIVARIATE NORMAL DATA 365
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1
Pn (4)
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i .
i=1 (xi − x̄)
2
we note that
X
n Pn Pn
(xi − x̄) xi − nx̄
ci = Pni=1
= Pni=1
i=1 (xi − x̄) i=1 (xi − x̄)
2 2
i=1 (6)
nx̄ − nx̄
= Pn =0,
i=1 (xi − x̄)
2
and
X
n Pn Pn
(xi − x̄)xi i=1 (xi − x̄xi )
2
ci x i = Pn
i=1
= P
i=1 (xi − x̄) i=1 (xi − x̄)
2 n 2
i=1
Pn 2 Pn
xi − 2nx̄x̄ + nx̄2 (x2i − 2x̄xi + x̄2 ) (7)
= Pn
i=1
= P
i=1
(xi − x̄)2 i=1 (xi − x̄)
n 2
Pn i=1
i=1 (xi − x̄)
2
P
= n =1.
i=1 (xi − x̄)
2
With (5), the estimate for the slope from (4) becomes
Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1
i=1 (xi − x̄)
2
X
n
= ci (yi − ȳ) (8)
i=1
Xn X
n
= ci yi − ȳ ci
i=1 i=1
Finally, with (3) and (9), the expectation of the intercept estimate from (4) becomes
!
1X 1X
n n
E(β̂0 ) = E yi − β̂1 · xi
n i=1 n i=1
1X
n
= E(yi ) − E(β̂1 ) · x̄
n i=1
(10)
1X
n
= (β0 + β1 xi ) − β1 · x̄
n i=1
= β0 + β1 x̄ − β1 x̄
= β0 .
Sources:
• Penny, William (2006): “Finding the uncertainty in estimating the slope”; in: Mathematics for
Brain Imaging, ch. 1.2.4, pp. 18-20, eq. 1.37; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/
mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
Metadata: ID: P272 | shortcut: slr-olsmean | author: JoramSoch | date: 2021-10-27, 09:54.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, the variances (→
Definition I/1.8.1) of the estimated parameters are
xT x σ2
Var(β̂0 ) = ·
n (n − 1)s2x
(2)
σ2
Var(β̂1 ) =
(n − 1)s2x
where s2x is the sample variance (→ Definition I/1.8.2) of x and xT x is the sum of squared values of
the covariate.
Proof: According to the simple linear regression model in (1), the variance of a single data point is
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1
Pn (4)
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i .
i=1 (xi − x̄)
2
we note that
X
n n
X 2
x − x̄
c2i = Pn i
i=1 (xi − x̄)
2
i=1
Pn
i=1
(xi − x̄)2 (6)
= Pni=1 2
[ i=1 (xi − x̄)2 ]
1
= Pn .
i=1 (xi − x̄)
2
With (5), the estimate for the slope from (4) becomes
Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1
i=1 (xi − x̄)
2
X
n
= ci (yi − ȳ) (7)
i=1
Xn X
n
= ci yi − ȳ ci
i=1 i=1
and with (3) and (6) as well as invariance (→ Proof I/1.8.6), scaling (→ Proof I/1.8.7) and additivity
(→ Proof I/1.8.10) of the variance, the variance of β̂1 is:
368 CHAPTER III. STATISTICAL MODELS
!
X
n X
n
Var(β̂1 ) = Var ci yi − ȳ ci
i=1
! i=1
Xn
= Var ci yi
i=1
X
n
= c2i Var(yi )
i=1
X
n
(8)
2
=σ c2i
i=1
1
= σ 2 Pn
i=1 (xi − x̄)
2
σ2
= P
(n − 1) n−1
1
i=1 (xi − x̄)
n 2
σ2
= .
(n − 1)s2x
Finally, with (3) and (8), the variance of the intercept estimate from (4) becomes:
!
1X 1X
n n
Var(β̂0 ) = Var yi − β̂1 · xi
n i=1 n i=1
!
1X
n
= Var yi + Var β̂1 · x̄
n i=1
2 Xn
1 (9)
= Var(yi ) + x̄2 · Var(β̂1 )
n i=1
1 X 2
n
σ2
= 2 σ + x̄2
n i=1 (n − 1)s2x
σ2 σ 2 x̄2
= + .
n (n − 1)s2x
Applying the formula for the sample variance (→ Definition I/1.8.2) s2x , we finally get:
1. UNIVARIATE NORMAL DATA 369
1 x̄2
Var(β̂0 ) = σ 2
+ Pn
n (xi − x̄)2
1 Pn i=1
nP i=1 i
(x − x̄)2 x̄2
=σ 2
+ Pn
i=1 (xi − x̄) i=1 (xi − x̄)
n 2 2
1 Pn
(x2 − 2x̄xi + x̄2 ) + x̄2
= σ 2 n i=1 Pin
i=1 (xi − x̄)
2
Pn 2 Pn !
1
x − 2x̄ 1
x i + x̄ 2
+ x̄ 2
= σ2 n i=1 i
Pn n i=1 2 (10)
i=1 (xi − x̄)
1 Pn 2 2
i=1 xi − 2x̄ + 2x̄
2
=σ 2 n Pn
i=1 (xi − x̄)
2
P n
!
1 2
i=1 xi
=σ 2 n
Pn
(n − 1) n−11
i=1 (xi − x̄)
2
xT x σ2
= · .
n (n − 1)s2x
Sources:
• Penny, William (2006): “Finding the uncertainty in estimating the slope”; in: Mathematics for
Brain Imaging, ch. 1.2.4, pp. 18-20, eq. 1.37; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/
mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
Metadata: ID: P273 | shortcut: slr-olsvar | author: JoramSoch | date: 2021-10-27, 11:53.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ) (1)
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, the estimated pa-
rameters are normally distributed (→ Definition II/4.1.1) as
β̂0 β0 σ 2 xT x/n −x̄
∼ N , · (2)
β̂1 β1 (n − 1) s 2
x −x̄ 1
where x̄ is the sample mean (→ Definition I/1.7.2) and s2x is the sample variance (→ Definition
I/1.8.2) of x.
Proof: Simple linear regression is a special case of multiple linear regression (→ Proof III/1.3.2)
with
370 CHAPTER III. STATISTICAL MODELS
h i β0
X = 1n x and β = , (3)
β1
such that (1) can also be written as
y = Xβ + ε, ε ∼ N (0, σ 2 In ) (4)
and ordinary least sqaures estimates (→ Proof III/1.4.3) are given by
β̂ = (X T X)−1 X T y . (5)
From (4) and the linear transformation theorem for the multivariate normal distribution (→ Proof
II/4.1.8), it follows that
y ∼ N Xβ, σ 2 In . (6)
From (5), in combination with (6) and the transformation theorem (→ Proof II/4.1.8), it follows
that
β̂ ∼ N (X T X)−1 X T Xβ, σ 2 (X T X)−1 X T In X(X T X)−1
(7)
∼ N β, σ 2 (X T X)−1 .
Applying (3), the covariance matrix (→ Definition II/4.1.1) can be further developed as follows:
−1
1T h i
σ 2 (X T X)−1 = σ 2 1n x
n
xT
−1
n nx̄
= σ 2
T
nx̄ x x
(8)
σ 2 x x −nx̄
T
=
nxT x − (nx̄)2 −nx̄ n
σ 2 x x/n −x̄
T
= T .
x x − nx̄ 2
−x̄ 1
i=1
Xn
2
= x2i − x̄
i=1
= (n − 1) s2x .
Sources:
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-11-09; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
Metadata: ID: P282 | shortcut: slr-olsdist | author: JoramSoch | date: 2021-11-09, 09:09.
Proof: The parameter estimates for simple linear regression are bivariate normally distributed under
ordinary least squares (→ Proof III/1.3.7):
β̂ β σ 2 x x/n −x̄
T
0 ∼ N 0 , · (1)
β̂1 β1 (n − 1) sx
2
−x̄ 1
Because the covariance matrix (→ Definition I/1.9.9) of the multivariate normal distribution (→
Definition II/4.1.1) contains the pairwise covariances of the random variables (→ Definition I/1.2.2),
we can deduce that the covariance (→ Definition I/1.9.1) of β̂0 and β̂1 is:
σ 2 x̄
Cov β̂0 , β̂1 =− (2)
(n − 1) s2x
372 CHAPTER III. STATISTICAL MODELS
where σ 2 is the noise variance (→ Definition III/1.3.1), s2x is the sample variance (→ Definition
I/1.8.2) of x and n is the number of observations. When x is mean-centered, we have x̄ = 0, such
that:
Cov β̂0 , β̂1 = 0 . (3)
Sources:
• original work
Metadata: ID: P320 | shortcut: slr-olscorr | author: JoramSoch | date: 2022-04-14, 17:17.
Proof:
1) Under unaltered y and x, ordinary least squares estimates for simple linear regression (→ Proof
III/1.3.3) are
β̂0 = ȳ − β̂1 x̄
Pn
(x − x̄)(yi − ȳ) sxy (1)
β̂1 = i=1 Pn i = 2
i=1 (xi − x̄)
2 sx
with sample means (→ Definition I/1.7.2) x̄ and ȳ, sample variance (→ Definition I/1.8.2) s2x and
sample covariance (→ Definition I/1.9.2) sxy , such that β0 estimates “the mean y at x = 0”.
and we can see that β̂1 (x̃, y) = β̂1 (x, y), but β̂0 (x̃, y) ̸= β̂0 (x, y), specifically β0 now estimates “the
mean y at the mean x”.
and we can see that β̂1 (x, ỹ) = β̂1 (x, y), but β̂0 (x, ỹ) ̸= β̂0 (x, y), specifically β0 now estimates “the
mean x, multiplied with the negative slope”.
x̃i = xi − x̄ ⇒ x̃¯ = 0
(6)
ỹi = yi − ȳ ⇒ ỹ¯ = 0 .
and we can see that β̂1 (x̃, ỹ) = β̂1 (x, y), but β̂0 (x̃, ỹ) ̸= β̂0 (x, y), specifically β0 is now forced to
become zero.
Sources:
• original work
Metadata: ID: P274 | shortcut: slr-meancent | author: JoramSoch | date: 2021-10-27, 12:38.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ) . (1)
Then, given some parameters β0 , β1 ∈ R, the set
L(β0 , β1 ) = (x, y) ∈ R2 | y = β0 + β1 x (2)
is called a “regression line” and the set
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
Metadata: ID: D164 | shortcut: regline | author: JoramSoch | date: 2021-10-27, 07:30.
Proof: The fitted regression line (→ Definition III/1.3.10) is described by the equation
ȳ = β̂0 + β̂1 x̄
ȳ = ȳ − β̂1 x̄ + β̂1 x̄ (2)
ȳ = ȳ .
which is a true statement. Thus, the regression line (→ Definition III/1.3.10) goes through the center
of mass point (x̄, ȳ), if the model (→ Definition III/1.3.1) includes an intercept term β0 .
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
Metadata: ID: P275 | shortcut: slr-comp | author: JoramSoch | date: 2021-10-27, 12:52.
1. UNIVARIATE NORMAL DATA 375
Proof: The intersection point of the regression line (→ Definition III/1.3.10) with the y-axis is
S(0|β̂0 ) . (3)
Let a be a vector describing the direction of the regression line, let b be the vector pointing from S
to O and let p be the vector pointing from S to P .
Because β̂1 is the slope of the regression line, we have
1
a= . (4)
β̂1
Moreover, with the points O and S, we have
xo 0 xo
b= − = . (5)
yo β̂0 yo − β̂0
Because P is located on the regression line, p is collinear with a and thus a scalar multiple of this
vector:
p=w·a. (6)
Moreover, as P is the point on the regression line which is closest to O, this means that the vector
b − p is orthogonal to a, such that the inner product of these two vectors is equal to zero:
aT (b − p) = 0 . (7)
Rearranging this equation gives
aT (b − p) = 0
aT (b − w · a) = 0
aT b − w · aT a = 0 (8)
w · aT a = aT b
aT b
w= .
aT a
With (4) and (5), w can be calculated as
376 CHAPTER III. STATISTICAL MODELS
aT b
w=
aT a
T
1 xo
β̂1 yo − β̂0
w = T (9)
1 1
β̂1 β̂1
x0 + (yo − β̂0 )β̂1
w=
1 + β̂12
Finally, with the point S (3) and the vector p (6), the coordinates of P are obtained as
xp 0 1 w
= +w· = . (10)
yp β̂0 β̂1 β̂0 + β̂1 w
Together, (10) and (9) constitute the proof of equation (2).
Sources:
• Penny, William (2006): “Projections”; in: Mathematics for Brain Imaging, ch. 1.4.10, pp. 34-35,
eqs. 1.87/1.88; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
Metadata: ID: P283 | shortcut: slr-proj | author: JoramSoch | date: 2021-11-09, 10:16.
TSS = (n − 1) s2y
s2xy
ESS = (n − 1)
s2 (1)
x
s2xy
RSS = (n − 1) sy − 2
2
sx
where s2x and s2y are the sample variances (→ Definition I/1.8.2) of x and y and sxy is the sample
covariance (→ Definition I/1.9.2) between x and y.
Proof: The ordinary least squares parameter estimates (→ Proof III/1.3.3) are given by
sxy
β̂0 = ȳ − β̂1 x̄ and β̂1 = . (2)
s2x
X
n
TSS = (yi − ȳ)2 (3)
i=1
X
n
TSS = (yi − ȳ)2
i=1
1 X
n
(4)
= (n − 1) (yi − ȳ)2
n − 1 i=1
= (n − 1)s2y .
X
n
ESS = (ŷi − ȳ)2
i=1
X
n
= (β̂0 + β̂1 xi − ȳ)2
i=1
(2) X
n
= (ȳ − β̂1 x̄ + β̂1 xi − ȳ)2
i=1
Xn 2
= β̂1 (xi − x̄)
i=1 (6)
n 2
(2) X sxy
= 2
(xi − x̄)
i=1
s x
2 X n
sxy
= (xi − x̄)2
s2x i=1
2
sxy
= (n − 1)s2x
s2x
s2xy
= (n − 1) 2 .
sx
X
n
RSS = (yi − ŷi )2
i=1
X
n
= (yi − β̂0 − β̂1 xi )2
i=1
(2) Xn
= (yi − ȳ + β̂1 x̄ − β̂1 xi )2
i=1
n
X 2
= (yi − ȳ) − β̂1 (xi − x̄)
i=1
n
X
= (yi − ȳ) − 2β̂1 (xi − x̄)(yi − ȳ) +
2
β̂12 (xi − x̄) 2
(8)
i=1
Xn X
n X
n
= (yi − ȳ)2 − 2β̂1 (xi − x̄)(yi − ȳ) + β̂12 (xi − x̄)2
i=1 i=1 i=1
Sources:
• original work
Metadata: ID: P284 | shortcut: slr-sss | author: JoramSoch | date: 2021-11-09, 11:34.
1 (x x/n) 1n − x̄ x
T T T
E=
(n − 1) s2x −x̄ 1n + x
T T
(x x/n) − 2x̄x1 + x1
T 2
· · · (x x/n) − x̄(x1 + xn ) + x1 xn
T
1 .. .. ..
P = . . .
(n − 1) sx
2
(x x/n) − x̄(x1 + xn ) + x1 xn · · ·
T
(x x/n) − 2x̄xn + xn
T 2
(n − 1)(xT x/n) + x̄(2x1 − nx̄) − x21 · · · −(xT x/n) + x̄(x1 + xn ) − x1 xn
1 .. ... ..
R= . .
(n − 1) s2x
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
· · · (n − 1)(x x/n) + x̄(2xn − nx̄) − xn
T 2
(1)
where 1n is an n × 1 vector of ones, x is the n × 1 single predictor variable, x̄ is the sample mean (→
Definition I/1.7.2) of x and s2x is the sample variance (→ Definition I/1.8.2) of x.
Proof: Simple linear regression is a special case of multiple linear regression (→ Proof III/1.3.2)
with
h i β0
X = 1n x and β = , (2)
β1
such that the simple linear regression model can also be written as
y = Xβ + ε, ε ∼ N (0, σ 2 In ) . (3)
Moreover, we note the following equality (→ Proof III/1.3.7):
E = (X T X)−1 X T (5)
which is a 2 × n matrix and can be reformulated as follows:
380 CHAPTER III. STATISTICAL MODELS
E = (X T X)−1 X T
−1
1T h i 1T
= n
1n x n
T T
x x
−1
n nx̄ 1T
= n
T T
nx̄ x x x
1 x T
x −nx̄ 1T (6)
= n
nx x − (nx̄) −nx̄ n
T 2
x T
1 xT x/n −x̄ 1T
= T n
x x − nx̄ 2
−x̄ 1 x T
(4) 1 (x x/n) 1n − x̄ x
T T T
= .
(n − 1) s2x −x̄ 1T + xT
n
h i e1
P = X E = 1n x
e2
1 x1
T
− · · · T
−
1 .. .. (x x/n) x̄x 1 (x x/n) x̄x n
= . .
(n − 1) sx
2 −x̄ + x1 ··· −x̄ + xn (8)
1 xn
(x x/n) − 2x̄x1 + x1
T 2
· · · (x x/n) − x̄(x1 + xn ) + x1 xn
T
1 .. . . ..
= . . . .
(n − 1) sx
2
(x x/n) − x̄(x1 + xn ) + x1 xn · · ·
T
(x x/n) − 2x̄xn + xn
T 2
1 ··· 0 p · · · p1n
11
.. . . .. .. ... ..
R = In − P = . . . − . .
0 ··· 1 pn1 · · · pnn
xT x − nx̄2 · · · 0
(4) 1 .. .. ..
= . . .
(n − 1) s2x
0 · · · x x − nx̄
T 2
(x x/n) − 2x̄x1 + x1
T 2
· · · (x x/n) − x̄(x1 + xn ) + x1 xn
T
1 .. . . ..
− . . .
(n − 1) sx
2
(xT x/n) − x̄(x1 + xn ) + x1 xn · · · (xT x/n) − 2x̄xn + x2n
(n − 1)(x x/n) + x̄(2x1 − nx̄) − x1 · · ·
T 2
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
1 .. . . ..
= . . . .
(n − 1) sx
2
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
· · · (n − 1)(x x/n) + x̄(2xn − nx̄) − xn
T 2
(10)
Sources:
• original work
Metadata: ID: P285 | shortcut: slr-mat | author: JoramSoch | date: 2021-11-09, 15:19.
y = β0 + β1 x + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ Definition III/1.4.7) are given
by
xT V −1 x 1T
nV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
xT V −1 x 1T
nV n n n
T −1 T −1 T −1 T −1
(2)
1 V 1n x V y − x V 1n 1n V y
β̂1 = Tn −1
1n V 1n xT V −1 x − xT V −1 1n 1TnV
−1 x
W V W T = In . (3)
382 CHAPTER III. STATISTICAL MODELS
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Because β0 is a scalar, (1) may also be written as
y = β0 1n + β1 x + ε, ε ∼ N (0, σ 2 V ) , (5)
Left-multiplying (5) with W , the linear transformation theorem (→ Proof II/4.1.8) implies that
W y = β0 W 1n + β1 W x + W ε, W ε ∼ N (0, σ 2 W V W T ) . (6)
Applying (3), we see that (6) is actually a linear regression model (→ Definition III/1.4.1) with
independent observations
h i β0
ỹ = x̃0 x̃ + ε̃, ε̃ ∼ N (0, σ 2 In ) (7)
β1
where ỹ = W y, x̃0 = W 1n , x̃ = W x and ε̃ = W ε, such that we can apply the ordinary least squares
solution (→ Proof III/1.4.3) giving:
β̂ = (X̃ T X̃)−1 X̃ T ỹ
−1
T h i
x̃0 x̃T
= x̃0 x̃ ỹ
0
T T
x̃ x̃ (8)
−1
x̃T x̃ x̃ T
x̃ x̃T
= 0 0 0
0
ỹ .
T T T
x̃ x̃0 x̃ x̃ x̃
−1
1 T
x̃ x̃ −x̃T
0 x̃ x̃T
β̂ = T 0 ỹ
x̃0 x̃0 x̃T x̃ − x̃T T
0 x̃ x̃ x̃0 −x̃T x̃0 x̃T x̃T
0 x̃0
1 x̃ x̃ x̃0 − x̃0 x̃ x̃
T T T T
= T ỹ (9)
x̃0 x̃0 x̃ x̃ − x̃0 x̃ x̃ x̃0 x̃T x̃0 x̃T − x̃T x̃0 x̃T
T T T
0 0
1 x̃ x̃ x̃0 ỹ − x̃0 x̃ x̃ ỹ
T T T T
= T .
x̃0 x̃0 x̃ x̃ − x̃0 x̃ x̃ x̃0 x̃T x̃0 x̃T ỹ − x̃T x̃0 x̃T ỹ
T T T
0 0
1 x W − T T
W x 1T T
nW Wy 1T T T T
nW Wx x W Wy
β̂ =
n W W 1n x W W x − 1n W W x x W W 1n 1T
1T T T T T T T T
n W W 1n x W W y − x W W 1n 1n W W y
T T T T T T T
T −1 T −1 T −1 T −1
1 x V x 1n V y − 1n V x x V y
= T −1 T −1
x V x 1n V 1n − 1n V x x V 1n 1T V −1 1n xT V −1 y − xT V −1 1n 1T V −1 y
T −1 T −1
n n
xT V −1 x 1T
nV
−1 y−1T V −1 x xT V −1 y
n
= x V x 1n V 1n −1n V x x V 1n
T −1 T −1 T −1 T −1
1T −1 1 xT V −1 y−xT V −1 1 1T V −1 y
nV n n n
1T V −1 1 xT V −1 x−xT V −1 1 1T V −1 x
n n n n
(10)
Sources:
• original work
Metadata: ID: P286 | shortcut: slr-wls | author: JoramSoch | date: 2021-11-16, 07:16.
y = β0 + β1 x + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ Definition III/1.4.7) are given
by
xT V −1 x 1TnV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
xT V −1 x 1T
nV n n n
−1 T −1 T −1 T −1
(2)
1T V 1n x V y − x V 1 1
n n V y
β̂1 = Tn −1
1n V 1n x V x − x V 1n 1T
T −1 T −1
nV
−1 x
Proof: Simple linear regression is a special case of multiple linear regression (→ Proof III/1.3.2)
with
h i β0
X = 1n x and β = (3)
β1
and weighted least sqaures estimates (→ Proof III/1.4.14) are given by
β̂ = (X T V −1 X)−1 X T V −1 y . (4)
Writing out equation (4), we have
384 CHAPTER III. STATISTICAL MODELS
−1
1T h i 1T
β̂ = n
V −1
1n x n
V −1 y
T T
x x
−1
T −1 T −1 T −1
1n V 1n 1n V x 1 V y
= n
xT V −1 1n xT V −1 x xT V −1 y
(5)
−1 −1 −1
1 x V x T
−1T
nV x 1T
nV y
=
− 1n V x x V 1n −xT V −1 1n 1T V −1 1n
xT V −1 x 1T
nV
−1 1
T −1
n
T −1 T −1
x V y
n
T −1 T −1 T −1 T −1
1 x V x 1n V y − 1n V x x V y
= T −1 T −1 .
x V x 1n V 1n − 1n V x x V 1n 1T V −1 1n xT V −1 y − xT V −1 1n 1T V −1 y
T −1 T −1
n n
xT V −1 x 1T
nV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
. (6)
xT V −1 x 1T
nV n n n
Sources:
• original work
Metadata: ID: P289 | shortcut: slr-wls2 | author: JoramSoch | date: 2021-11-16, 11:20.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the maximum likelihood estimates (→ Definition I/4.1.3) of β0 , β1 and σ 2 are given by
β̂0 = ȳ − β̂1 x̄
sxy
β̂1 = 2
sx (2)
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2
n i=1
where x̄ and ȳ are the sample means (→ Definition I/1.7.2), s2x is the sample variance (→ Definition
I/1.8.2) of x and sxy is the sample covariance (→ Definition I/1.9.2) between x and y.
1. UNIVARIATE NORMAL DATA 385
Proof: With the probability density function of the normal distribution (→ Proof II/3.2.10) and
probability under independence (→ Definition I/1.3.6), the linear regression equation (1) implies the
following likelihood function (→ Definition I/5.1.2)
Y
n
2
p(y|β0 , β1 , σ ) = p(yi |β0 , β1 , σ 2 )
i=1
Yn
= N (yi ; β0 + β1 xi , σ 2 )
i=1
Y
n (3)
1 (yi − β0 − β1 xi )2
= √ · exp −
2πσ 2σ 2
i=1
" #
1 X
n
1
=p · exp − 2 (yi − β0 − β1 xi )2
(2πσ )2 n 2σ i=1
1 X
n
dLL(β0 , β1 , σ 2 )
= 2 (yi − β0 − β1 xi ) (5)
dβ0 σ i=1
and setting this derivative to zero gives the MLE for β0 :
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dβ0
1 X
n
0= 2 (yi − β̂0 − β̂1 xi )
σ̂ i=1
X
n X
n
(6)
0= yi − nβ̂0 − β̂1 xi
i=1 i=1
1X 1X
n n
β̂0 = yi − β̂1 xi
n i=1 n i=1
β̂0 = ȳ − β̂1 x̄ .
1 X
n
dLL(β̂0 , β1 , σ 2 )
= 2 (xi yi − β̂0 xi − β1 x2i ) (7)
dβ1 σ i=1
386 CHAPTER III. STATISTICAL MODELS
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dβ1
1 X
n
0= 2 (xi yi − β̂0 xi − β̂1 x2i )
σ̂ i=1
X
n X
n X
n
0= xi yi − β̂0 xi − β̂1 x2i )
i=1 i=1 i=1
(6) Xn X
n X
n
0= xi yi − (ȳ − β̂1 x̄) xi − β̂1 x2i
i=1 i=1 i=1
X
n X
n X
n X
n
(8)
0= xi yi − ȳ xi + β̂1 x̄ xi − β̂1 x2i
i=1 i=1 i=1 i=1
X
n X
n
0= xi yi − nx̄ȳ + β̂1 nx̄2 − β̂1 x2i
Pi=1
P i=1
n
xi yi − ni=1 x̄ȳ
β̂1 = Pn 2 Pn 2
i=1
x − i=1 x̄
Pni=1 i
(x − x̄)(yi − ȳ)
β̂1 = i=1 Pn i
i=1 (xi − x̄)
2
sxy
β̂1 = 2 .
sx
The derivative of the log-likelihood function (4) at (β̂0 , β̂1 ) with respect to σ 2 is
1 X
n
dLL(β̂0 , β̂1 , σ 2 ) n
=− 2 + (yi − β̂0 − β̂1 xi )2 (9)
dσ 2 2σ 2(σ 2 )2 i=1
and setting this derivative to zero gives the MLE for σ 2 :
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dσ 2
1 X
n
n
0=− 2 + (yi − β̂0 − β̂1 xi )2
2σ̂ 2(σ̂ 2 )2 i=1
(10)
1 X
n
n
= (yi − β̂0 − β̂1 xi )2
2σ̂ 2 2(σ̂ 2 )2 i=1
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2 .
n i=1
Together, (6), (8) and (10) constitute the MLE for simple linear regression.
Sources:
1. UNIVARIATE NORMAL DATA 387
• original work
Metadata: ID: P287 | shortcut: slr-mle | author: JoramSoch | date: 2021-11-16, 08:34.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the maximum likelihood estimates (→ Definition I/4.1.3) of β0 , β1 and σ 2 are given by
β̂0 = ȳ − β̂1 x̄
sxy
β̂1 = 2
sx (2)
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2
n i=1
where x̄ and ȳ are the sample means (→ Definition I/1.7.2), s2x is the sample variance (→ Definition
I/1.8.2) of x and sxy is the sample covariance (→ Definition I/1.9.2) between x and y.
Proof: Simple linear regression is a special case of multiple linear regression (→ Proof III/1.3.2)
with
h i β0
X = 1n x and β = (3)
β1
and weighted least sqaures estimates (→ Proof III/1.4.16) are given by
β̂ = (X T V −1 X)−1 X T V −1 y
1 (4)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂) .
n
Under independent observations, the covariance matrix is
−1
1T h i 1T
β̂ = V −1 V −1 y
n n
1n x
xT x T
−1 (6)
1T h i 1T
= 1n y
n n
x
xT xT
388 CHAPTER III. STATISTICAL MODELS
which is equal to the ordinary least squares solution for simple linear regression (→ Proof III/1.3.4):
β̂0 = ȳ − β̂1 x̄
sxy (7)
β̂1 = 2 .
sx
Additionally, we can write out the estimate of σ 2 :
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n
T
1 h i β̂ h i β̂0
= y − 1n x y − 1n x
0
n β̂1 β̂1
(8)
1 T
= y − β̂0 − β̂1 x y − β̂0 − β̂1 x
n
1X
n
= (yi − β̂0 − β̂1 xi )2 .
n i=1
Sources:
• original work
Metadata: ID: P290 | shortcut: slr-mle2 | author: JoramSoch | date: 2021-11-16, 11:53.
Proof: The residuals are defined as the estimated error terms (→ Definition III/1.3.1)
X
n X
n
ε̂i = (yi − β̂0 − β̂1 xi )
i=1 i=1
Xn
= (yi − ȳ + β̂1 x̄ − β̂1 xi )
i=1
(3)
Xn X
n
= yi − nȳ + β̂1 nx̄ − β̂1 xi
i=1 i=1
Thus, the sum of the residuals (→ Definition III/1.4.7) is zero under ordinary least squares (→ Proof
III/1.3.3), if the model (→ Definition III/1.3.1) includes an intercept term β0 .
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
Metadata: ID: P276 | shortcut: slr-ressum | author: JoramSoch | date: 2021-10-27, 13:07.
Proof: The residuals are defined as the estimated error terms (→ Definition III/1.3.1)
X
n X
n
xi ε̂i = xi (yi − β̂0 − β̂1 xi )
i=1 i=1
n
X
= xi yi − β̂0 xi − β̂1 x2i
i=1
Xn
= xi yi − xi (ȳ − β̂1 x̄) − β̂1 x2i
i=1
n
X
= xi (yi − ȳ) + β̂1 (x̄xi − x2i
i=1
!
X
n X
n X
n X
n
= xi yi − ȳ xi − β̂1 x2i − x̄ xi
i=1 i=1 i=1
! i=1
!
X
n X
n
(3)
= xi yi − nx̄ȳ − nx̄ȳ + nx̄ȳ − β̂1 x2i − 2nx̄x̄ + nx̄2
i=1
! i=1
!
X
n X
n X
n X
n X
n
= xi yi − ȳ xi − x̄ yi + nx̄ȳ − β̂1 x2i − 2x̄ xi + nx̄2
i=1 i=1 i=1 i=1 i=1
X
n X
n
= (xi yi − ȳxi − x̄yi + x̄ȳ) − β̂1 x2i − 2x̄xi + x̄ 2
i=1 i=1
Xn X
n
= (xi − x̄)(yi − ȳ) − β̂1 (xi − x̄)2
i=1 i=1
sxy
= (n − 1)sxy − 2 (n − 1)s2x
sx
= (n − 1)sxy − (n − 1)sxy
=0.
Because an inner product of zero also implies zero correlation (→ Definition I/1.10.1), this demon-
strates that residuals (→ Definition III/1.4.7) and covariate (→ Definition III/1.3.1) values are
uncorrelated under ordinary least squares (→ Proof III/1.3.3).
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
Metadata: ID: P277 | shortcut: slr-rescorr | author: JoramSoch | date: 2021-10-27, 13:07.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
1. UNIVARIATE NORMAL DATA 391
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, residual variance
(→ Definition IV/1.1.1) and sample variance (→ Definition I/1.8.2) are related to each other via the
correlation coefficient (→ Definition I/1.10.1):
2
σ̂ 2 = 1 − rxy
2
sy . (2)
Proof: The residual variance (→ Definition IV/1.1.1) can be expressed in terms of the residual sum
of squares (→ Definition III/1.4.7):
1
σ̂ 2 = RSS(β̂0 , β̂1 ) (3)
n−1
and the residual sum of squares for simple linear regression (→ Proof III/1.3.13) is
s2xy
RSS(β̂0 , β̂1 ) = (n − 1) sy − 2
2
. (4)
sx
Combining (3) and (4), we obtain:
s2xy
2
σ̂ = − 2s2y
sx
s2xy
= 1 − 2 2 s2y (5)
sx sy
2 !
sxy
= 1− s2y .
sx sy
Using the relationship between correlation, covariance and standard deviation (→ Definition I/1.10.1)
Cov(X, Y )
Corr(X, Y ) = p p (6)
Var(X) Var(Y )
which also holds for sample correlation, sample covariance (→ Definition I/1.9.2) and sample standard
deviation (→ Definition I/1.12.1)
sxy
rxy = , (7)
sx sy
we get the final result:
2
σ̂ 2 = 1 − rxy
2
sy . (8)
Sources:
• Penny, William (2006): “Relation to correlation”; in: Mathematics for Brain Imaging, ch. 1.2.3,
p. 18, eq. 1.28; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
Metadata: ID: P278 | shortcut: slr-resvar | author: JoramSoch | date: 2021-10-27, 14:37.
392 CHAPTER III. STATISTICAL MODELS
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, correlation coefficient
(→ Definition I/1.10.1) and the estimated value of the slope parameter (→ Definition III/1.3.1) are
related to each other via the sample standard deviations (→ Definition I/1.12.1):
sx
rxy = β̂1 . (2)
sy
Proof: The ordinary least squares estimate of the slope (→ Proof III/1.3.3) is given by
sxy
β̂1 = . (3)
s2x
Using the relationship between covariance and correlation (→ Proof I/1.9.7)
sxy
β̂1 =
s2x
sx rxy sy
β̂1 =
s2x
sy (6)
β̂1 = rxy
sx
sx
⇔ rxy = β̂1 .
sy
Sources:
• Penny, William (2006): “Relation to correlation”; in: Mathematics for Brain Imaging, ch. 1.2.3,
p. 18, eq. 1.27; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
Metadata: ID: P279 | shortcut: slr-corr | author: JoramSoch | date: 2021-10-27, 14:58.
1. UNIVARIATE NORMAL DATA 393
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ Proof III/1.3.3). Then, the coefficient of
determination (→ Definition IV/1.2.1) is equal to the squared correlation coefficient (→ Definition
I/1.10.1) between x and y:
R2 = rxy
2
. (2)
Proof: The ordinary least squares estimates for simple linear regression (→ Proof III/1.3.3) are
β̂0 = ȳ − β̂1 x̄
sxy (3)
β̂1 = 2 .
sx
The coefficient of determination (→ Definition IV/1.2.1) R2 is defined as the proportion of the
variance explained by the independent variables, relative to the total variance in the data. This can
be quantified as the ratio of explained sum of squares (→ Definition III/1.4.6) to total sum of squares
(→ Definition III/1.4.5):
ESS
R2 =
. (4)
TSS
Using the explained and total sum of squares for simple linear regression (→ Proof III/1.3.13), we
have:
Pn
(ŷi − ȳ)2
R = Pi=1
2
Pn
i=1 (ȳ − β̂ x̄ + β̂1 xi − ȳ)2
2
R = Pn 1
i=1 (yi − ȳ)
2
Pn 2
i=1 β̂1 (x i − x̄)
= Pn
i=1 (yi − ȳ)
2
P
i=1 (xi − x̄)
1 n 2
= β̂1 1 Pn
2 n−1 (6)
i=1 (yi − ȳ)
2
n−1
s2
= β̂12 x2
sy
2
sx
= β̂1 .
sy
394 CHAPTER III. STATISTICAL MODELS
Using the relationship between correlation coefficient and slope estimate (→ Proof III/1.3.22), we
conclude:
2
2 sx 2
R = β̂1 = rxy . (7)
sy
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
• Wikipedia (2021): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Coefficient_of_determination#As_squared_correlation_
coefficient.
• Wikipedia (2021): “Correlation”; in: Wikipedia, the free encyclopedia, retrieved on 2021-10-27;
URL: https://en.wikipedia.org/wiki/Correlation#Sample_correlation_coefficient.
Metadata: ID: P280 | shortcut: slr-rsq | author: JoramSoch | date: 2021-10-27, 15:31.
y = Xβ + ε , (1)
together with a statement asserting a normal distribution (→ Definition II/4.1.1) for ε
ε ∼ N (0, σ 2 V ) (2)
is called a univariate linear regression model or simply, “multiple linear regression”.
• y is called “measured data”, “dependent variable” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• V is called “covariance matrix” or “covariance structure”;
• β are called “regression coefficients” or “weights”;
• ε is called “noise”, “errors” or “error terms”;
• σ 2 is called “noise variance” or “error variance”;
• n is the number of observations;
• p is the number of predictors.
Alternatively, the linear combination may also be written as
X
p
y= β i xi + ε (3)
i=1
X
p
y = β0 + βi xi + ε (4)
i=1
1. UNIVARIATE NORMAL DATA 395
Sources:
• Wikipedia (2020): “Linear regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-03-
21; URL: https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_linear_regression.
Metadata: ID: D36 | shortcut: mlr | author: JoramSoch | date: 2020-03-21, 20:09.
Y = y, B = β, E = ε and Σ = σ2 (1)
where y, β, ε and σ 2 are the data vector, regression coefficients, noise vector and noise variance from
multiple linear regression (→ Definition III/1.4.1).
Proof: The linear regression model with correlated errors (→ Definition III/1.4.1) is given by:
y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (2)
Because ε is an n × 1 vector and σ is scalar, we have the following identities:
2
vec(ε) = ε
(3)
σ2 ⊗ V = σ2V .
Thus, using the relationship between multivariate normal and matrix normal distribution (→ Proof
II/5.1.2), equation (2) can also be written as
y = Xβ + ε, ε ∼ MN (0, V, σ 2 ) . (4)
Comparing with the general linear model with correlated observations (→ Definition III/2.1.1)
Y = XB + E, E ∼ MN (0, V, Σ) , (5)
we finally note the equivalences given in equation (1).
Sources:
• Wikipedia (2022): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
07-21; URL: https://en.wikipedia.org/wiki/General_linear_model#Comparison_to_multiple_
linear_regression.
Metadata: ID: P329 | shortcut: mlr-glm | author: JoramSoch | date: 2022-07-21, 08:28.
396 CHAPTER III. STATISTICAL MODELS
β̂ = (X T X)−1 X T y . (2)
Proof: Let β̂ be the ordinary least squares (OLS) solution and let ε̂ = y − X β̂ be the resulting vector
of residuals. Then, this vector must be orthogonal to the design matrix,
X T ε̂ = 0 , (3)
because if it wasn’t, there would be another solution β̃ giving another vector ε̃ with a smaller residual
sum of squares. From (3), the OLS formula can be directly derived:
X T ε̂ = 0
X T y − X β̂ = 0
X T y − X T X β̂ = 0 (4)
X T X β̂ = X T y
β̂ = (X T X)−1 X T y .
Sources:
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slides 10/11; URL: http://www.socialbehavior.
uzh.ch/teaching/methodsspring10.html.
β̂ = (X T X)−1 X T y . (2)
RSS(β) = y T y − y T Xβ − β T X T y + β T X T Xβ
(4)
= y T y − 2β T X T y + β T X T Xβ .
The derivative of RSS(β) with respect to β is
dRSS(β)
= −2X T y + 2X T Xβ (5)
dβ
and setting this deriative to zero, we obtain:
dRSS(β̂)
=0
dβ
0 = −2X T y + 2X T X β̂ (6)
T T
X X β̂ = X y
β̂ = (X T X)−1 X T y .
Sources:
• Wikipedia (2020): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-02-03; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Least_squares_estimator_for_%CE%B2.
• ad (2015): “Derivation of the Least Squares Estimator for Beta in Matrix Notation”; in: Economic
Theory Blog, retrieved on 2021-05-27; URL: https://economictheoryblog.com/2015/02/19/ols_
estimator/.
Metadata: ID: P40 | shortcut: mlr-ols2 | author: JoramSoch | date: 2020-02-03, 18:43.
Sources:
• Wikipedia (2020): “Total sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/Total_sum_of_squares.
Metadata: ID: D37 | shortcut: tss | author: JoramSoch | date: 2020-03-21, 21:44.
398 CHAPTER III. STATISTICAL MODELS
with estimated regression coefficients β̂, e.g. obtained via ordinary least squares (→ Proof III/1.4.3).
Sources:
• Wikipedia (2020): “Explained sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-21; URL: https://en.wikipedia.org/wiki/Explained_sum_of_squares.
Metadata: ID: D38 | shortcut: ess | author: JoramSoch | date: 2020-03-21, 21:57.
with estimated regression coefficients β̂, e.g. obtained via ordinary least squares (→ Proof III/1.4.3).
Sources:
• Wikipedia (2020): “Residual sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-21; URL: https://en.wikipedia.org/wiki/Residual_sum_of_squares.
Metadata: ID: D39 | shortcut: rss | author: JoramSoch | date: 2020-03-21, 22:03.
X
n
TSS = (yi − ȳ + ŷi − ŷi )2
i=1
X
n
= ((ŷi − ȳ) + (yi − ŷi ))2
i=1
Xn
= ((ŷi − ȳ) + ε̂i )2
i=1
X
n
= (ŷi − ȳ)2 + 2 ε̂i (ŷi − ȳ) + ε̂2i
i=1
X
n X
n X
n (4)
= (ŷi − ȳ) +
2
ε̂2i +2 ε̂i (ŷi − ȳ)
i=1 i=1 i=1
Xn Xn Xn
= (ŷi − ȳ)2 + ε̂2i + 2 ε̂i (xi β̂ − ȳ)
i=1 i=1 i=1
!
Xn Xn Xn X
p
X
n
= (ŷi − ȳ) +
2
ε̂2i +2 ε̂i xij β̂j −2 ε̂i ȳ
i=1 i=1 i=1 j=1 i=1
Xn Xn Xp
X
n X
n
= (ŷi − ȳ) +
2
ε̂2i +2 β̂j ε̂i xij − 2ȳ ε̂i
i=1 i=1 j=1 i=1 i=1
The fact that the design matrix includes a constant regressor ensures that
X
n
ε̂i = ε̂T 1n = 0 (5)
i=1
and because the residuals are orthogonal to the design matrix (→ Proof III/1.4.3), we have
X
n
ε̂i xij = ε̂T xj = 0 . (6)
i=1
and, with the definitions of explained (→ Definition III/1.4.6) and residual sum of squares (→
Definition III/1.4.7), it is
Sources:
• Wikipedia (2020): “Partition of sums of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-09; URL: https://en.wikipedia.org/wiki/Partition_of_sums_of_squares#Partitioning_
the_sum_of_squares_in_linear_regression.
Metadata: ID: P76 | shortcut: mlr-pss | author: JoramSoch | date: 2020-03-09, 22:18.
Ey = β̂ . (1)
Sources:
• original work
Metadata: ID: D81 | shortcut: emat | author: JoramSoch | date: 2020-07-22, 05:17.
P y = ŷ = X β̂ . (1)
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Overview.
Metadata: ID: D82 | shortcut: pmat | author: JoramSoch | date: 2020-07-22, 05:25.
Ry = ε̂ = y − ŷ = y − X β̂ . (1)
1. UNIVARIATE NORMAL DATA 401
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Properties.
Metadata: ID: D83 | shortcut: rfmat | author: JoramSoch | date: 2020-07-22, 05:35.
β̂ = Ey
ŷ = P y (2)
ε̂ = Ry
where
E = (X T X)−1 X T
P = X(X T X)−1 X T (3)
R = In − X(X T X)−1 X T
are the estimation matrix (→ Definition III/1.4.9), projection matrix (→ Definition III/1.4.10) and
residual-forming matrix (→ Definition III/1.4.11) and n is the number of observations.
Proof:
1) Ordinary least squares parameter estimates of β are defined as minimizing the residual sum of
squares (→ Definition III/1.4.7)
β̂ = arg min (y − Xβ)T (y − Xβ) (4)
β
β̂ = (X T X)−1 X T y
(3)
(5)
= Ey .
2) The fitted signal is given by multiplying the design matrix with the estimated regression coefficients
ŷ = X β̂ (6)
and using (5), this becomes
402 CHAPTER III. STATISTICAL MODELS
ŷ = X(X T X)−1 X T y
(3) (7)
= Py .
3) The residuals of the model are calculated by subtracting the fitted signal from the measured signal
ε̂ = y − ŷ (8)
and using (7), this becomes
ε̂ = y − X(X T X)−1 X T y
= (In − X(X T X)−1 X T )y (9)
(3)
= Ry .
Sources:
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slide 10; URL: http://www.socialbehavior.uzh.
ch/teaching/methodsspring10.html.
Metadata: ID: P75 | shortcut: mlr-mat | author: JoramSoch | date: 2020-03-09, 21:18.
P2 = P
(1)
R2 = R .
Proof:
1) The projection matrix for ordinary least squares is given by (→ Proof III/1.4.12)
2) The residual-forming matrix for ordinary least squares is given by (→ Proof III/1.4.12)
1. UNIVARIATE NORMAL DATA 403
R2 = (In − P )(In − P )
= In − P − P + P 2
(3)
= In − 2P + P (5)
= In − P
(4)
=R.
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Properties.
Metadata: ID: P135 | shortcut: mlr-idem | author: JoramSoch | date: 2020-07-22, 06:28.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ Definition III/1.4.7) are given
by
β̂ = (X T V −1 X)−1 X T V −1 y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
Proof II/4.1.8) implies that
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 W V W T ) . (5)
Applying (3), we see that (5) is actually a linear regression model (→ Definition III/1.4.1) with
independent observations
where ỹ = W y, X̃ = W X and ε̃ = W ε, such that we can apply the ordinary least squares solution
(→ Proof III/1.4.3) giving
β̂ = (X̃ T X̃)−1 X̃ T ỹ
−1
= (W X)T W X (W X)T W y
−1 T T
= X TW TW X X W Wy (7)
−1
= X TW W X X TW W y
(4) −1 T −1
= X T V −1 X X V y
Sources:
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slides 20/23; URL: http://www.socialbehavior.
uzh.ch/teaching/methodsspring10.html.
• Wikipedia (2021): “Weighted least squares”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-17; URL: https://en.wikipedia.org/wiki/Weighted_least_squares#Motivation.
Metadata: ID: P77 | shortcut: mlr-wls | author: JoramSoch | date: 2020-03-11, 11:22.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ Definition III/1.4.7) are given
by
β̂ = (X T V −1 X)−1 X T V −1 y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed the
matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
Proof II/4.1.8) implies that
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 W V W T ) . (5)
Applying (3), we see that (5) is actually a linear regression model (→ Definition III/1.4.1) with
independent observations
1. UNIVARIATE NORMAL DATA 405
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 In ) . (6)
With this, we can express the weighted residual sum of squares (→ Definition III/1.4.7) as
X
n
wRSS(β) = (W ε)2i = (W ε)T (W ε) = (W y − W Xβ)T (W y − W Xβ) (7)
i=1
wRSS(β) = y T W T W y − y T W T W Xβ − β T X T W T W y + β T X T W T W Xβ
= y T W W y − 2β T X T W W y + β T X T W W Xβ (8)
(4)
= y T V −1 y − 2β T X T V −1 y + β T X T V −1 Xβ .
dwRSS(β)
= −2X T V −1 y + 2X T V −1 Xβ (9)
dβ
and setting this deriative to zero, we obtain:
dwRSS(β̂)
=0
dβ
0 = −2X T V −1 y + 2X T V −1 X β̂ (10)
T −1 T −1
X V X β̂ = X V y
−1
T
β̂ = (X V X)−1 X T V −1 y .
Sources:
• original work
Metadata: ID: P136 | shortcut: mlr-wls2 | author: JoramSoch | date: 2020-07-22, 06:48.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the maximum likelihood estimates (→ Definition I/4.1.3) of β and σ 2 are given by
β̂ = (X T V −1 X)−1 X T V −1 y
1 (2)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂) .
n
406 CHAPTER III. STATISTICAL MODELS
Proof: With the probability density function of the multivariate normal distribution (→ Proof
II/4.1.3), the linear regression equation (1) implies the following likelihood function (→ Definition
I/5.1.2)
n n 1
LL(β, σ 2 ) = − log(2π) − log(σ 2 ) − log(|V |)
2 2 2 (5)
1
− 2 y T P y − 2β T X T P y + β T X T P Xβ .
2σ
dLL(β, σ 2 ) d 1
= − 2 y P y − 2β X P y + β X P Xβ
T T T T T
dβ dβ 2σ
1 d
= 2 2β T X T P y − β T X T P Xβ
2σ dβ (6)
1
= 2 2X T P y − 2X T P Xβ
2σ
1
= 2 X T P y − X T P Xβ
σ
and setting this derivative to zero gives the MLE for β:
dLL(β̂, σ 2 )
=0
dβ
1
0 = 2 X T P y − X T P X β̂
σ (7)
0 = X T P y − X T P X β̂
X T P X β̂ = X T P y
−1
β̂ = X T P X X TP y
1. UNIVARIATE NORMAL DATA 407
dLL(β̂, σ 2 ) d n 1 T −1
= − log(σ ) − 2 (y − X β̂) V (y − X β̂)
2
dσ 2 dσ 2 2 2σ
n 1 1
=− + (y − X β̂)T V −1 (y − X β̂) (8)
2 σ 2 2(σ 2 )2
n 1
=− 2 + (y − X β̂)T V −1 (y − X β̂)
2σ 2(σ 2 )2
dLL(β̂, σ̂ 2 )
=0
dσ 2
n 1
0=− 2
+ (y − X β̂)T V −1 (y − X β̂)
2σ̂ 2(σ̂ 2 )2
n 1
2
= 2 2
(y − X β̂)T V −1 (y − X β̂) (9)
2σ̂ 2(σ̂ )
2 2
2(σ̂ ) n 2(σ̂ 2 )2 1
· 2 = · (y − X β̂)T V −1 (y − X β̂)
n 2σ̂ n 2(σ̂ 2 )2
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n
Together, (7) and (9) constitute the MLE for multiple linear regression.
Sources:
• original work
Metadata: ID: P78 | shortcut: mlr-mle | author: JoramSoch | date: 2020-03-11, 12:27.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the maximum log-likelihood (→ Definition I/4.1.5) for this model is
n RSS n
MLL(m) = − log − [1 + log(2π)] (2)
2 n 2
under uncorrelated observations (→ Definition III/1.4.1), i.e. if V = In , and
n wRSS n 1
MLL(m) = − log − [1 + log(2π)] − log |V | , (3)
2 n 2 2
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ Definition III/1.4.7)
and wRSS is the weighted residual sum of squares (→ Proof III/1.4.15).
408 CHAPTER III. STATISTICAL MODELS
Proof: The likelihood function (→ Definition I/5.1.2) for multiple linear regression is given by (→
Proof III/1.4.16)
such that, with |σ 2 V | = (σ 2 )n |V |, the log-likelihood function (→ Definition I/4.1.2) for this model
becomes (→ Proof III/1.4.16)
1X
n
1 1 wRSS
(y − X β̂)T V −1 (y − X β̂) = (W y − W X β̂)T (W y − W X β̂) = (W ε̂)2i = (7)
n n n i=1 n
where W = V −1/2 . Plugging (6) into (5), we obtain the maximum log-likelihood (→ Definition
I/4.1.5) as
MLL(m) = LL(β̂, σ̂ 2 )
n n 1 1
= − log(2π) − log(σ̂ 2 ) − log |V | − 2 (y − X β̂)T V −1 (y − X β̂)
2 2 2 2σ̂
n n wRSS 1 1 n (8)
= − log(2π) − log − log |V | − · · wRSS
2 2 n 2 2 wRSS
n wRSS n 1
= − log − [1 + log(2π)] − log |V |
2 n 2 2
which proves the result in (3). Assuming V = In , we have
1 X 2 RSS
n
1
σ̂ = (y − X β̂)T (y − X β̂) =
2
ε̂ = (9)
n n i=1 i n
and
1 1 1
log |V | = log |In | = log 1 = 0 , (10)
2 2 2
such that
n RSS n
MLL(m) = − log − [1 + log(2π)] (11)
2 n 2
1. UNIVARIATE NORMAL DATA 409
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.2, p. 66; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
Metadata: ID: P305 | shortcut: mlr-mll | author: JoramSoch | date: 2022-02-04, 07:27.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the deviance (→ Definition IV/2.3.2) for this model is
D(β, σ 2 ) = RSS/σ 2 + n · log(σ 2 ) + log(2π) (2)
under uncorrelated observations (→ Definition III/1.4.1), i.e. if V = In , and
D(β, σ 2 ) = wRSS/σ 2 + n · log(σ 2 ) + log(2π) + log |V | , (3)
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ Definition III/1.4.7)
and wRSS is the weighted residual sum of squares (→ Proof III/1.4.15).
Proof: The likelihood function (→ Definition I/5.1.2) for multiple linear regression is given by (→
Proof III/1.4.16)
such that, with |σ 2 V | = (σ 2 )n |V |, the log-likelihood function (→ Definition I/4.1.2) for this model
becomes (→ Proof III/1.4.16)
1 T −1 1
− (y − Xβ) V (y − Xβ) = − (W y − W Xβ)T (W y − W Xβ)
2σ 2 2σ 2 !
1 1 Xn
wRSS (6)
=− 2 (W ε)2i = −
2σ n i=1 2σ 2
410 CHAPTER III. STATISTICAL MODELS
where W = V −1/2 . Plugging (6) into (5) and multiplying with −2, we obtain the deviance (→
Definition IV/2.3.2) as
D(β, σ 2 ) = −2 LL(β, σ 2 )
wRSS n n 1
= −2 − − log(σ ) − log(2π) − log |V |
2
(7)
2σ 2 2 2 2
= wRSS/σ + n · log(σ ) + log(2π) + log |V |
2 2
1 1
− 2
(y − Xβ)T V −1 (y − Xβ) = − 2 (y − Xβ)T (y − Xβ)
2σ 2σ !
1 1X 2
n
RSS (8)
=− 2 εi = − 2
2σ n i=1 2σ
and
1 1 1
log |V | = log |In | = log 1 = 0 , (9)
2 2 2
such that
D(β, σ 2 ) = RSS/σ 2 + n · log(σ 2 ) + log(2π) (10)
which proves the result in (2). This completes the proof.
Sources:
• original work
Metadata: ID: P312 | shortcut: mlr-dev | author: JoramSoch | date: 2022-03-01, 08:42.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the Akaike information criterion (→ Definition IV/2.1.1) for this model is
wRSS
AIC(m) = n log + n [1 + log(2π)] + log |V | + 2(p + 1) (2)
n
where wRSS is the weighted residual sum of squares (→ Definition III/1.4.7), p is the number of
regressors (→ Definition III/1.4.1) in the design matrix X and n is the number of observations (→
Definition III/1.4.1) in the data vector y.
where MLL(m) is the maximum log-likelihood (→ Definition I/4.1.5) is k is the number of free
parameters in m.
The maximum log-likelihood for multiple linear regression (→ Proof III/1.4.17) is given by
n wRSS n 1
MLL(m) = − log − [1 + log(2π)] − log |V | (4)
2 n 2 2
and the number of free paramters in multiple linear regression (→ Definition III/1.4.1) is k = p + 1,
i.e. one for each regressor in the design matrix (→ Definition III/1.4.1) X, plus one for the noise
variance (→ Definition III/1.4.1) σ 2 .
Thus, the AIC of m follows from (3) and (4) as
wRSS
AIC(m) = n log + n [1 + log(2π)] + log |V | + 2(p + 1) . (5)
n
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.2, p. 66; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
Metadata: ID: P307 | shortcut: mlr-aic | author: JoramSoch | date: 2022-02-11, 06:26.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the Bayesian information criterion (→ Definition IV/2.2.1) for this model is
wRSS
BIC(m) = n log + n [1 + log(2π)] + log |V | + log(n) (p + 1) (2)
n
where wRSS is the weighted residual sum of squares (→ Definition III/1.4.7), p is the number of
regressors (→ Definition III/1.4.1) in the design matrix X and n is the number of observations (→
Definition III/1.4.1) in the data vector y.
Sources:
• original work
Metadata: ID: P308 | shortcut: mlr-bic | author: JoramSoch | date: 2022-02-11, 06:43.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the corrected Akaike information criterion (→ Definition IV/2.1.2) for this model is
wRSS 2 n (p + 1)
AICc (m) = n log + n [1 + log(2π)] + log |V | + (2)
n n−p−2
where wRSS is the weighted residual sum of squares (→ Definition III/1.4.7), p is the number of
regressors (→ Definition III/1.4.1) in the design matrix X and n is the number of observations (→
Definition III/1.4.1) in the data vector y.
2k 2 + 2k
AICc (m) = AIC(m) + (3)
n−k−1
where AIC(m) is the Akaike information criterion (→ Definition IV/2.1.1), k is the number of free
parameters in m and n is the number of observations.
The Akaike information criterion for multiple linear regression (→ Proof III/1.4.17) is given by
wRSS
AIC(m) = n log + n [1 + log(2π)] + log |V | + 2(p + 1) (4)
n
and the number of free paramters in multiple linear regression (→ Definition III/1.4.1) is k = p + 1,
i.e. one for each regressor in the design matrix (→ Definition III/1.4.1) X, plus one for the noise
variance (→ Definition III/1.4.1) σ 2 .
Thus, the corrected AIC of m follows from (3) and (4) as
wRSS 2k 2 + 2k
AICc (m) = n log + n [1 + log(2π)] + log |V | + 2 k +
n n−k−1
wRSS 2nk − 2k 2 − 2k 2k 2 + 2k
= n log + n [1 + log(2π)] + log |V | + +
n n−k−1 n−k−1
wRSS 2nk (5)
= n log + n [1 + log(2π)] + log |V | +
n n−k−1
wRSS 2 n (p + 1)
= n log + n [1 + log(2π)] + log |V | +
n n−p−2
.
1. UNIVARIATE NORMAL DATA 413
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.5, p. 67; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
Metadata: ID: P309 | shortcut: mlr-aicc | author: JoramSoch | date: 2022-02-11, 07:07.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ Definition III/1.4.1) with measured n × 1 data vector y, known n × p
design matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients
β and unknown noise variance σ 2 .
Then, the conjugate prior (→ Definition I/5.2.5) for this model is a normal-gamma distribution (→
Definition II/4.3.1)
s
|P | h τ i
p(y|β, τ ) = · τ n/2
· exp − y T
P y − y T
P Xβ − β T T
X P y + β T T
X P Xβ . (6)
(2π)n 2
Completing the square over β, finally gives
s
|P | h τ i
p(y|β, τ ) = · τ n/2
· exp − (β − X̃y)T T
X P X(β − X̃y) − y T
Qy + y T
P y (7)
(2π)n 2
−1 T
where X̃ = X T P X X P and Q = X̃ T X T P X X̃.
In other words, the likelihood function (→ Definition I/5.1.2) is proportional to a power of τ , times
an exponential of τ and an exponential of a squared form of β, weighted by τ :
h τ i h τ i
p(y|β, τ ) ∝ τ n/2
· exp − y P y − y Qy · exp − (β − X̃y) X P X(β − X̃y) .
T T T T
(8)
2 2
The same is true for a normal-gamma distribution (→ Definition II/4.3.1) over β and τ
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.112; URL: https://www.springer.com/gp/book/9780387310732.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ Definition III/1.4.1) with measured n × 1 data vector y, known n × p
design matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients
β and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ Proof
III/1.5.1) over the model parameters β and τ = 1/σ 2 :
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (4)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
Proof: According to Bayes’ theorem (→ Proof I/5.3.1), the posterior distribution (→ Definition
I/5.1.7) is given by
p(y|β, τ ) p(β, τ )
p(β, τ |y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ Proof I/5.1.8) to the
numerator:
Combining the likelihood function (→ Definition I/5.1.2) (8) with the prior distribution (→ Definition
I/5.1.3) (2), the joint likelihood (→ Definition I/5.1.5) of the model is given by
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (10)
h τ i
exp − (y − Xβ)T P (y − Xβ) + (β − µ0 )T Λ0 (β − µ0 ) .
2
Expanding the products in the exponent gives:
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 )
h τ (11)
exp − y T P y − y T P Xβ − β T X T P y + β T X T P Xβ+
2
β T Λ0 β − β T Λ0 µ0 − µT 0 Λ 0 β + µ T
0 Λ 0 µ 0 .
Completing the square over β, we finally have
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (12)
h τ i
exp − (β − µn ) Λn (β − µn ) + (y P y + µ0 Λ0 µ0 − µn Λn µn )
T T T T
2
with the posterior hyperparameters (→ Definition I/5.1.7)
µn = Λ−1 T
n (X P y + Λ0 µ0 )
(13)
Λn = X T P X + Λ 0 .
Ergo, the joint likelihood is proportional to
h τ i
p(y, β, τ ) ∝ τ p/2 · exp − (β − µn )T Λn (β − µn ) · τ an −1 · exp [−bn τ ] (14)
2
with the posterior hyperparameters (→ Definition I/5.1.7)
n
an = a0 +
2
1 T (15)
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
From the term in (14), we can isolate the posterior distribution over β given τ :
Sources:
1. UNIVARIATE NORMAL DATA 417
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.113; URL: https://www.springer.com/gp/book/9780387310732.
Metadata: ID: P10 | shortcut: blr-post | author: JoramSoch | date: 2020-01-03, 17:53.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ Definition III/1.4.1) with measured n × 1 data vector y, known n × p
design matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients
β and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ Proof
III/1.5.1) over the model parameters β and τ = 1/σ 2 :
1 n 1 1
log p(y|m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2 (3)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (4)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
Proof: According to the law of marginal probability (→ Definition I/1.3.3), the model evidence (→
Definition I/5.1.9) for this model is:
ZZ
p(y|m) = p(y|β, τ ) p(β, τ ) dβ dτ . (5)
According to the law of conditional probability (→ Definition I/1.3.4), the integrand is equivalent to
the joint likelihood (→ Definition I/5.1.5):
ZZ
p(y|m) = p(y, β, τ ) dβ dτ . (6)
When deriving the posterior distribution (→ Proof III/1.5.2) p(β, τ |y), the joint likelihood p(y, β, τ )
is obtained as
s s
τ n |P | τ p |Λ0 | b0 a0 a0 −1
p(y, β, τ ) = τ exp[−b0 τ ]·
(2π)n (2π)p Γ(a0 ) (9)
h τ i
exp − (β − µn ) Λn (β − µn ) + (y P y + µ0 Λ0 µ0 − µn Λn µn ) .
T T T T
2
Using the probability density function of the multivariate normal distribution (→ Proof II/4.1.3),
we can rewrite this as
s s s
τ n |P | τ p |Λ
0|
(2π)p b0 a0 a0 −1
p(y, β, τ ) = τ exp[−b0 τ ]·
(2π)n τ p |Λn | Γ(a0 )
(2π)p (10)
h τ i
−1
N (β; µn , (τ Λn ) ) exp − (y P y + µ0 Λ0 µ0 − µn Λn µn ) .
T T T
2
Now, β can be integrated out easily:
Z s s
τ n |P | |Λ0 | b0 a0 a0 −1
p(y, β, τ ) dβ = τ exp[−b0 τ ]·
(2π)n |Λn | Γ(a0 ) (11)
h τ i
exp − (y T P y + µT0 Λ0 µ0 − µTn Λn µn ) .
2
Using the probability density function of the gamma distribution (→ Proof II/3.4.6), we can rewrite
this as
Z s s
|P | |Λ0 | b0 a0 Γ(an )
p(y, β, τ ) dβ = Gam(τ ; an , bn ) . (12)
(2π)n |Λn | Γ(a0 ) bn an
Finally, τ can also be integrated out:
ZZ s s
|P | |Λ0 | Γ(an ) b0 a0
p(y, β, τ ) dβ dτ = = p(y|m) . (13)
(2π)n |Λn | Γ(a0 ) bn an
Thus, the log model evidence (→ Definition IV/3.1.3) of this model is given by
1 n 1 1
log p(y|m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2 (14)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
Sources:
1. UNIVARIATE NORMAL DATA 419
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.23, eq. 3.118; URL: https://www.springer.com/gp/book/9780387310732.
Metadata: ID: P11 | shortcut: blr-lme | author: JoramSoch | date: 2020-01-03, 22:05.
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (10)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
Thus, we have the following posterior expectations:
⟨β⟩β,τ |y = µn (11)
an
⟨τ ⟩β,τ |y = (12)
bn
T −1
β Aβ β|τ,y = µT
n Aµ n + tr A(τ Λ n )
1 −1
(14)
= µT
n Aµn + tr AΛn .
τ
In these identities, we have used the mean of the multivariate normal distribution (→ Proof II/4.1.4),
the mean of the gamma distribution (→ Proof II/3.4.9), the logarithmic expectation of the gamma
distribution (→ Proof II/3.4.11), the expectation of a quadratic form (→ Proof I/1.7.9) and the
covariance of the multivariate normal distribution (→ Proof II/4.1.5).
With that, the deviance at the expectation is:
(5)
D(⟨β⟩ , ⟨τ ⟩) = n · log(2π) − n · log(⟨τ ⟩) − log |P | + τ · (y − X ⟨β⟩)T P (y − X ⟨β⟩)
(11)
= n · log(2π) − n · log(⟨τ ⟩) − log |P | + τ · (y − Xµn )T P (y − Xµn ) (15)
(12) an an
= n · log(2π) − n · log − log |P | + · (y − Xµn )T P (y − Xµn ) .
bn bn
(5)
⟨D(β, τ )⟩ = n · log(2π) − n · log(τ ) − log |P | + τ · (y − Xβ)T P (y − Xβ)
= n · log(2π) − n · ⟨log(τ )⟩ − log |P | + τ · (y − Xβ)T P (y − Xβ)
(13)
= n · log(2π) − n · [ψ(an ) − log(bn )] − log |P |
D
E
+ τ · (y − Xβ)T P (y − Xβ) β|τ,y
τ |y
(8)
DIC(m) = 2 ⟨D(β, τ )⟩ − D(⟨β⟩ , ⟨τ ⟩)
(16)
= 2 [n · log(2π) − n · [ψ(an ) − log(bn )] − log |P |
an −1
+ · (y − Xµn ) P (y − Xµn ) + tr X P XΛn
T T
bn
(15) an an
− n · log(2π) − n · log − log |P | + · (y − Xµn ) P (y − Xµn )
T
bn bn (17)
= n · log(2π) − 2nψ(an ) + 2n log(bn ) + n log(an ) − log(bn ) − log |P |
an
+ (y − Xµn )T P (y − Xµn ) + tr X T P XΛ−1 n
bn
= n · log(2π) − n [2ψ(an ) − log(an ) − log(bn )] − log |P |
an
+ (y − Xµn )T P (y − Xµn ) + tr X T P XΛ−1 n .
bn
This conforms to equation (3).
Sources:
• original work
Metadata: ID: P313 | shortcut: blr-dic | author: JoramSoch | date: 2022-03-01, 12:10.
422 CHAPTER III. STATISTICAL MODELS
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and assume a normal-gamma (→ Definition II/4.3.1) prior distribution (→ Definition I/5.1.3) over
the model parameters β and τ = 1/σ 2 :
H1 : cT β > 0 (3)
is given by
cT µ
Pr (H1 |y) = 1 − T − √ ;ν (4)
cT Σc
where c is a p×1 contrast vector (→ Definition “con”), T(x; ν) is the cumulative distribution function
(→ Definition I/1.6.13) of the t-distribution (→ Definition II/3.3.1) with ν degrees of freedom (→
Definition “dof”) and µ, Σ and ν can be obtained from the posterior hyperparameters (→ Definition
I/5.1.7) of Bayesian linear regression.
Proof: The posterior distribution for Bayesian linear regression (→ Proof III/1.5.2) is given by a
normal-gamma distribution (→ Definition II/4.3.1) over β and τ = 1/σ 2
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (6)
an = a0 +
2
1
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y T P y + µT T
2
The marginal distribution of a normal-gamma distribution is a multivariate t-distribution (→ Proof
II/4.3.8), such that the marginal (→ Definition I/1.5.3) posterior (→ Definition I/5.1.7) distribution
of β is
µ = µn
−1
an
Σ= Λn (8)
bn
ν = 2 an .
1. UNIVARIATE NORMAL DATA 423
Define the quantity γ = cT β. According to the linear transformation theorem for the multivariate t-
distribution (→ Proof “mvt-ltt”), γ also follows a multivariate t-distribution (→ Definition II/4.2.1):
Using the relation between non-standardized t-distribution and standard t-distribution (→ Proof
II/3.3.3), we can finally write:
(0 − cT µ)
Pr (H1 |y) = 1 − T √ ;ν
cT Σc
(11)
cT µ
= 1 − T −√ ;ν .
cT Σc
Sources:
• Koch, Karl-Rudolf (2007): “Multivariate t-distribution”; in: Introduction to Bayesian Statistics,
Springer, Berlin/Heidelberg, 2007, eqs. 2.235, 2.236, 2.213, 2.210, 2.188; URL: https://www.
springer.com/de/book/9783540727231; DOI: 10.1007/978-3-540-72726-2.
Metadata: ID: P133 | shortcut: blr-pp | author: JoramSoch | date: 2020-07-17, 17:03.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and assume a normal-gamma (→ Definition II/4.3.1) prior distribution (→ Definition I/5.1.3) over
the model parameters β and τ = 1/σ 2 :
H0 : C T β = 0 (3)
424 CHAPTER III. STATISTICAL MODELS
Proof: The posterior distribution for Bayesian linear regression (→ Proof III/1.5.2) is given by a
normal-gamma distribution (→ Definition II/4.3.1) over β and τ = 1/σ 2
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (6)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
The marginal distribution of a normal-gamma distribution is a multivariate t-distribution (→ Proof
II/4.3.8), such that the marginal (→ Definition I/1.5.3) posterior (→ Definition I/5.1.7) distribution
of β is
µ = µn
−1
an
Σ= Λn (8)
bn
ν = 2 an .
Define the quantity γ = C T β. According to the linear transformation theorem for the multivariate t-
distribution (→ Proof “mvt-ltt”), γ also follows a multivariate t-distribution (→ Definition II/4.2.1):
(1 − α) = F (QF(0); q, ν)
(11)
= F µT C(C T Σ C)−1 C T µ /q; q, ν .
Sources:
• Koch, Karl-Rudolf (2007): “Multivariate t-distribution”; in: Introduction to Bayesian Statistics,
Springer, Berlin/Heidelberg, 2007, eqs. 2.235, 2.236, 2.213, 2.210, 2.211, 2.183; URL: https://
www.springer.com/de/book/9783540727231; DOI: 10.1007/978-3-540-72726-2.
Metadata: ID: P134 | shortcut: blr-pcr | author: JoramSoch | date: 2020-07-17, 17:41.
426 CHAPTER III. STATISTICAL MODELS
Y = XB + E, E ∼ MN (0, V, Σ) (1)
is called a multivariate linear regression model or simply, “general linear model”.
• Y is called “data matrix”, “set of dependent variables” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• B are called “regression coefficients” or “weights”;
• E is called “noise matrix” or “error terms”;
• V is called “covariance across rows”;
• Σ is called “covariance across columns”;
• n is the number of observations;
• v is the number of measurements;
• p is the number of predictors.
When rows of Y correspond to units of time, e.g. subsequent measurements, V is called “temporal
covariance”. When columns of Y correspond to units of space, e.g. measurement channels, Σ is called
“spatial covariance”.
When the covariance matrix V is a scalar multiple of the n×n identity matrix, this is called a general
linear model with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = λIn ⇒ E ∼ MN (0, λIn , Σ) ⇒ εi ∼ N (0, λΣ) . (2)
Otherwise, it is called a general linear model with correlated observations.
Sources:
• Wikipedia (2020): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/General_linear_model.
Metadata: ID: D40 | shortcut: glm | author: JoramSoch | date: 2020-03-21, 22:24.
Y = XB + E, E ∼ MN (0, σ 2 In , Σ) , (1)
the ordinary least squares (→ Proof III/1.4.3) parameters estimates are given by
B̂ = (X T X)−1 X T Y . (2)
Proof: Let B̂ be the ordinary least squares (→ Proof III/1.4.3) (OLS) solution and let Ê = Y − X B̂
be the resulting matrix of residuals. According to the exogeneity assumption of OLS, the errors have
conditional mean (→ Definition I/1.7.1) zero
2. MULTIVARIATE NORMAL DATA 427
E(E|X) = 0 , (3)
a direct consequence of which is that the regressors are uncorrelated with the errors
E(X T E) = 0 , (4)
which, in the finite sample, means that the residual matrix must be orthogonal to the design matrix:
X T Ê = 0 . (5)
From (5), the OLS formula can be directly derived:
X T Ê = 0
X T Y − X B̂ = 0
X T Y − X T X B̂ = 0 (6)
X T X B̂ = X T Y
B̂ = (X T X)−1 X T Y .
Sources:
• original work
Metadata: ID: P106 | shortcut: glm-ols | author: JoramSoch | date: 2020-05-19, 06:02.
Y = XB + E, E ∼ MN (0, V, Σ) , (1)
the weighted least sqaures (→ Proof III/1.4.14) parameter estimates are given by
B̂ = (X T V −1 X)−1 X T V −1 Y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W W = V −1 ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
Proof II/5.1.9) implies that
W Y = W XB + W E, W E ∼ MN (0, W V W T , Σ) . (5)
428 CHAPTER III. STATISTICAL MODELS
Applying (3), we see that (5) is actually a general linear model (→ Definition III/2.1.1) with inde-
pendent observations
B̂ = (X̃ T X̃)−1 X̃ T Ỹ
−1
= (W X)T W X (W X)T W Y
−1 T T
= X TW TW X X W WY (7)
−1
= X TW W X X TW W Y
(4) −1 T −1
= X T V −1 X X V Y
Sources:
• original work
Metadata: ID: P107 | shortcut: glm-wls | author: JoramSoch | date: 2020-05-19, 06:27.
Y = XB + E, E ∼ MN (0, V, Σ) , (1)
maximum likelihood estimates (→ Definition I/4.1.3) for the unknown parameters B and Σ are given
by
B̂ = (X T V −1 X)−1 X T V −1 Y
1 (2)
Σ̂ = (Y − X B̂)T V −1 (Y − X B̂) .
n
nv n v
LL(B, Σ) = − log(2π) − log(|Σ|) + log(|P |)
2 2 2
1 −1 T (5)
− tr Σ Y P Y − Y T P XB − B T X T P Y + B T X T P XB .
2
dLL(B, Σ) d 1 −1 T
= − tr Σ Y P Y − Y P XB − B X P Y + B X P XB
T T T T T
dB dB 2
d 1 −1 T
d 1 −1 T T
= − tr −2Σ Y P XB + − tr Σ B X P XB
dB 2 dB 2 (6)
1 1
= − −2X T P Y Σ−1 − X T P XBΣ−1 + (X T P X)T B(Σ−1 )T
2 2
= X T P Y Σ−1 − X T P XBΣ−1
dLL(B̂, Σ)
=0
dB
0 = X T P Y Σ−1 − X T P X B̂Σ−1
0 = X T P Y − X T P X B̂ (7)
X T P X B̂ = X T P Y
−1
B̂ = X T P X X TP Y
i
dLL(B̂, Σ) d n 1 h −1 T −1
= − log |Σ| − tr Σ (Y − X B̂) V (Y − X B̂)
dΣ dΣ 2 2
n T 1 T
= − Σ−1 + Σ−1 (Y − X B̂)T V −1 (Y − X B̂) Σ−1 (8)
2 2
n −1 1 −1
= − Σ + Σ (Y − X B̂)T V −1 (Y − X B̂) Σ−1
2 2
and setting this derivative to zero gives the MLE for Σ:
430 CHAPTER III. STATISTICAL MODELS
dLL(B̂, Σ̂)
=0
dΣ
n 1
0 = − Σ̂−1 + Σ̂−1 (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
2 2
n −1 1 −1
Σ̂ = Σ̂ (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
2 2 (9)
1
Σ̂−1 = Σ̂−1 (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
n
1
Iv = (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
n
1
Σ̂ = (Y − X B̂)T V −1 (Y − X B̂)
n
Together, (7) and (9) constitute the MLE for the GLM.
Sources:
• original work
Y = XB + E, E ∼ MN (0, V, Σ) (1)
Y = Xt Γ + Et , Et ∼ MN (0, V, Σt ) (2)
and assume that Xt can be transformed into X using a transformation matrix T ∈ Rt×p
X = Xt T (3)
where p < t and X, Xt and T have full ranks rk(X) = p, rk(Xt ) = t and rk(T ) = p.
Then, a linear model (→ Definition III/2.1.1) of the parameter estimates from (2), under the as-
sumption of (1), is called a transformed general linear model.
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to
the problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroIm-
age, vol. 209, art. 116449, Appendix A; URL: https://www.sciencedirect.com/science/article/pii/
S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: D160 | shortcut: tglm | author: JoramSoch | date: 2021-10-21, 14:43.
2. MULTIVARIATE NORMAL DATA 431
Y = XB + E, E ∼ MN (0, V, Σ) (1)
Y = Xt Γ + Et , Et ∼ MN (0, V, Σt ) (2)
and a matrix T transforming Xt into X:
X = Xt T . (3)
Then, the transformed general linear model (→ Definition III/2.2.1) is given by
Γ̂ = T B + H, H ∼ MN (0, U, Σ) (4)
where the covariance across rows (→ Definition II/5.1.1) is U = (XtT V −1 Xt )−1 .
Proof: The linear transformation theorem for the matrix-normal distribution (→ Proof II/5.1.9)
states:
Y ∼ MN (XB, V, Σ) (7)
Combining (6) with (7), the distribution of Γ̂ is
Γ̂ ∼ MN (XtT V −1 Xt )−1 XtT V −1 XB, (XtT V −1 Xt )−1 XtT V −1 V V −1 Xt (XtT V −1 Xt )−1 , Σ
∼ MN (XtT V −1 Xt )−1 XtT V −1 Xt T B, (XtT V −1 Xt )−1 XtT V −1 Xt (XtT V −1 Xt )−1 , Σ (8)
T −1 −1
∼ MN T B, (Xt V Xt ) , Σ .
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix A, Theorem 1; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: P265 | shortcut: tglm-dist | author: JoramSoch | date: 2021-10-21, 15:03.
432 CHAPTER III. STATISTICAL MODELS
Y = XB + E, E ∼ MN (0, V, Σ) (1)
and the transformed general linear model (→ Definition III/2.2.1)
Γ̂ = T B + H, H ∼ MN (0, U, Σ) (2)
which are linked to each other (→ Proof III/2.2.2) via
X = Xt T . (4)
Then, the parameter estimates for B from (1) and (2) are equivalent.
Proof: The weighted least squares parameter estimates (→ Proof III/2.1.3) for (1) are given by
B̂ = (X T V −1 X)−1 X T V −1 Y (5)
and the weighted least squares parameter estimates (→ Proof III/2.1.3) for (2) are given by
B̂ = (T T U −1 T )−1 T T U −1 Γ̂ . (6)
The covariance across rows for the transformed general linear model (→ Proof III/2.2.2) is equal to
(6)
B̂ = (T T U −1 T )−1 T T U −1 Γ̂
(7)
= (T T XtT V −1 Xt T )−1 T T XtT V −1 Xt Γ̂
(4)
= (X T V −1 X)−1 T T XtT V −1 Xt Γ̂
(8)
(3)
= (X T V −1 X)−1 T T XtT V −1 Xt (XtT V −1 Xt )−1 XtT V −1 Y
= (X T V −1 X)−1 T T XtT V −1 Y
(4)
= (X T V −1 X)−1 X T V −1 Y
which is equivalent to the estimates in (5).
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix A, Theorem 2; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: P266 | shortcut: tglm-para | author: JoramSoch | date: 2021-10-21, 15:25.
2. MULTIVARIATE NORMAL DATA 433
Y = XB + E, E ∼ MN (0, V, Σ) . (1)
Then, a linear model (→ Definition III/2.1.1) of X in terms of Y , under the assumption of (1), is
called an inverse general linear model.
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to
the problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroIm-
age, vol. 209, art. 116449, Appendix C; URL: https://www.sciencedirect.com/science/article/pii/
S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: D161 | shortcut: iglm | author: JoramSoch | date: 2021-10-21, 15:31.
Y = XB + E, E ∼ MN (0, V, Σ) . (1)
Then, the inverse general linear model (→ Definition III/2.3.1) of X ∈ Rn×p
is given by
X = Y W + N, N ∼ MN (0, V, Σx ) (2)
where W ∈ R v×p
is a matrix, such that B W = Ip , and the covariance across columns (→ Definition
II/5.1.1) is Σx = W T ΣW .
Proof: The linear transformation theorem for the matrix-normal distribution (→ Proof II/5.1.9)
states:
X = Y W + N, N ∼ MN (0, V, W T ΣW ) (6)
which is equivalent to (2).
Sources:
434 CHAPTER III. STATISTICAL MODELS
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix C, Theorem 4; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: P267 | shortcut: iglm-dist | author: JoramSoch | date: 2021-10-21, 16:03.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
implying the inverse general linear model (→ Proof III/2.3.2) of X ∈ Rn×p
X = Y W + N, N ∼ MN (0, V, Σx ) . (2)
where
B W = Ip and Σx = W T ΣW . (3)
Then, the weighted least squares solution (→ Proof III/2.1.3) for W is the best linear unbiased
estimator (→ Definition “blue”) of W .
Proof: The linear transformation theorem for the matrix-normal distribution (→ Proof II/5.1.9)
states:
Ŵ = (Y T V −1 Y )−1 Y T V −1 X . (5)
The best linear unbiased estimator (→ Definition “blue”) θ̂ of a certain quantity θ estimated from
measured data (→ Definition “data”) y is 1) an estimator resulting from a linear operation f (y), 2)
whose expected value is equal to θ and 3) which has, among those satisfying 1) and 2), the minimum
variance (→ Definition I/1.8.1).
W̃ = M X ∼ MN (M Y W, M V M T , Σx ) (6)
which requires (→ Proof II/5.1.4) that M Y = Iv . This is fulfilled by any matrix
M = (Y T V −1 Y )−1 Y T V −1 + D (7)
where D is a v × n matrix which satisfies DY = 0.
2. MULTIVARIATE NORMAL DATA 435
3) Third, the best linear unbiased estimator (→ Definition “blue”) is the one with minimum variance
(→ Definition I/1.8.1), i.e. the one that minimizes the expected Frobenius norm
D h iE
Var W̃ = tr (W̃ − W )T (W̃ − W ) . (8)
h i (8) D h iE
Var W̃ (M ) = tr (W̃ − W ) (W̃ − W )
T
D h iE
= tr (W̃ − W )(W̃ − W )T
hD Ei
= tr (W̃ − W )(W̃ − W )T (11)
(10)
= tr tr(Σx ) M V M T
= tr(Σx ) tr(M V M T ) .
h i (7) h T i
T −1 −1 T −1 T −1 −1 T −1
Var W̃ (D) = tr(Σx ) tr (Y V Y ) Y V + D V (Y V Y ) Y V + D
= tr(Σx ) tr (Y T V −1 Y )−1 Y T V −1 V V −1 Y (Y T V −1 Y )−1 + (12)
(Y T V −1 Y )−1 Y T V −1 V DT + DV V −1 Y (Y T V −1 Y )−1 + DV DT
= tr(Σx ) tr (Y T V −1 Y )−1 + tr DV DT .
Since DV DT is a positive-semidefinite matrix, all its eigenvalues are non-negative. Because the trace
of a square matrix is the sum of its eigenvalues, the mimimum variance is achieved by D = 0, thus
producing Ŵ as in (5).
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix C, Theorem 5; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Metadata: ID: P268 | shortcut: iglm-blue | author: JoramSoch | date: 2021-10-21, 16:46.
436 CHAPTER III. STATISTICAL MODELS
X̂ = Y W . (1)
Given that the columns of X̂ are linearly independent, then
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: Neu-
roImage, vol. 87, pp. 96–110, eq. 3; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
Metadata: ID: D162 | shortcut: cfm | author: JoramSoch | date: 2021-10-21, 17:01.
X̂ = Y W . (1)
Then, the parameter matrix of the corresponding forward model (→ Definition III/2.3.4) is equal to
A = Σy W Σ−1
x (2)
with the “sample covariances (→ Definition I/1.9.2)”
Σx = X̂ T X̂
(3)
Σy = Y T Y .
Y = X̂AT + E , (4)
subject to the constraint that predicted X and errors E are uncorrelated:
X̂ T E = 0 . (5)
With that, we can directly derive the parameter matrix A:
2. MULTIVARIATE NORMAL DATA 437
(4)
Y = X̂AT + E
X̂AT = Y − E
X̂ T X̂AT = X̂ T (Y − E)
X̂ T X̂AT = X̂ T Y − X̂ T E
(5)
X̂ T X̂AT = X̂ T Y (6)
(1)
X̂ T X̂AT = W T Y T Y
(3)
Σx AT = W T Σy
AT = Σ−1 T
x W Σy
A = Σy W Σx−1 .
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: NeuroIm-
age, vol. 87, pp. 96–110, Theorem 1; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
Metadata: ID: P269 | shortcut: cfm-para | author: JoramSoch | date: 2021-10-21, 17:20.
X̂ = Y W . (1)
Then, there exists a corresponding forward model (→ Definition III/2.3.4).
A = Σy W Σ−1
x where Σx = X̂ T X̂ and Σy = Y T Y . (3)
1) Because the columns of X̂ are assumed to be linearly independent by definition of the corresponding
forward model (→ Definition III/2.3.4), the matrix Σx = X̂ T X̂ is invertible, such that A in (3) is
well-defined.
2) Moreover, the solution for the matrix A satisfies the constraint of the corresponding forward
model (→ Definition III/2.3.4) for predicted X and errors E to be uncorrelated which can be shown
as follows:
438 CHAPTER III. STATISTICAL MODELS
(2)
X̂ T E = X̂ T Y − X̂AT
(3)
−1
= X̂ Y − X̂ Σx W Σy
T T
= X̂ T Y − X̂ T X̂ Σ−1 T
x W Σy
(3)
−1 (4)
= X̂ T Y − X̂ T X̂ X̂ T X̂ W T Y TY
(1)
= (Y W )T Y − W T Y T Y
= W TY TY − W TY TY
=0.
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: NeuroIm-
age, vol. 87, pp. 96–110, Appendix B; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
Metadata: ID: P270 | shortcut: cfm-exist | author: JoramSoch | date: 2021-10-21, 17:43.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ Definition III/2.1.1) with measured n × v data matrix Y , known n × p
design matrix X, known n × n covariance structure (→ Definition II/5.1.1) V as well as unknown
p × v regression coefficients B and unknown v × v noise covariance (→ Definition II/5.1.1) Σ.
Then, the conjugate prior (→ Definition I/5.2.5) for this model is a normal-Wishart distribution (→
Definition II/5.3.1)
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(3)
which, for mathematical convenience, can also be parametrized as
s
−1 −1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P ,T )= exp − tr T (Y − XB) P (Y − XB)
T
(4)
(2π)nv 2
using the v × v precision matrix (→ Definition I/1.9.19) T = Σ−1 and the n × n precision matrix (→
Definition I/1.9.19) P = V −1 .
s
|P |v 1 T
p(Y |B, T ) = · |T | · exp − tr T Y P Y − Y P XB − B X P Y + B X P XB
n/2 T T T T T
.
(2π)nv 2
(6)
Completing the square over B, finally gives
s
|P |v 1 h i
p(Y |B, T ) = · |T | · exp − tr T (B − X̃Y ) X P X(B − X̃Y ) − Y QY + Y P Y
n/2 T T T T
(2π)nv 2
(7)
T
−1 T T T
where X̃ = X P X X P and Q = X̃ X P X X̃.
In other words, the likelihood function (→ Definition I/5.1.2) is proportional to a power of the
determinant of T , times an exponential of the trace of T and an exponential of the trace of a squared
form of B, weighted by T :
i
1 T 1 h
p(Y |B, T ) ∝ |T | ·exp − tr T Y P Y − Y QY
n/2 T
·exp − tr T (B − X̃Y ) X P X(B − X̃Y )
T T
.
2 2
(8)
The same is true for a normal-Wishart distribution (→ Definition II/5.3.1) over B and T
s r
|T |p |Λ0 |v 1 1 |Ω0 |ν0 (ν0 −v−1)/2 1
p(B, T ) = exp − tr T (B − M0 ) Λ0 (B − M0 ) ·
T
|T | exp − tr (Ω0 T )
(2π)pv 2 Γv ν20 2ν0 v 2
(10)
440 CHAPTER III. STATISTICAL MODELS
1 1
p(B, T ) ∝ |T |
(ν0 +p−v−1)/2
· exp − tr (T Ω0 ) · exp − tr T (B − M0 ) Λ0 (B − M0 )
T
(11)
2 2
Sources:
• Wikipedia (2020): “Bayesian multivariate linear regression”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-09-03; URL: https://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression#
Conjugate_prior_distribution.
Metadata: ID: P159 | shortcut: mblr-prior | author: JoramSoch | date: 2020-09-03, 07:33.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ Definition III/2.1.1) with measured n × v data matrix Y , known n × p
design matrix X, known n×n covariance structure (→ Definition II/5.1.1) V as well as unknown p×v
regression coefficients B and unknown v × v noise covariance (→ Definition II/5.1.1) Σ. Moreover,
assume a normal-Wishart prior distribution (→ Proof III/2.4.1) over the model parameters B and
T = Σ−1 :
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
Λn = X T P X + Λ 0
(4)
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
νn = ν0 + n .
Proof: According to Bayes’ theorem (→ Proof I/5.3.1), the posterior distribution (→ Definition
I/5.1.7) is given by
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(7)
which, for mathematical convenience, can also be parametrized as
s
−1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P, T )= exp − tr T (Y − XB) P (Y − XB)
T
(8)
(2π)nv 2
using the v × v precision matrix (→ Definition I/1.9.19) T = Σ−1 and the n × n precision matrix (→
Definition I/1.9.19) P = V −1 .
Combining the likelihood function (→ Definition I/5.1.2) (8) with the prior distribution (→ Definition
I/5.1.3) (2), the joint likelihood (→ Definition I/5.1.5) of the model is given by
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(10)
1
exp − tr T (Y − XB) P (Y − XB) + (B − M0 ) Λ0 (B − M0 )
T T
.
2
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
1 (11)
exp − tr T Y T P Y − Y T P XB − B T X T P Y + B T X T P XB+
2
B T Λ0 B − B T Λ0 M0 − M0T Λ0 B + M0T Λ0 µ0 .
442 CHAPTER III. STATISTICAL MODELS
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(12)
1
exp − tr T (B − Mn )T Λn (B − Mn ) + (Y T P Y + M0T Λ0 M0 − MnT Λn Mn ) .
2
with the posterior hyperparameters (→ Definition I/5.1.7)
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
(13)
Λn = X T P X + Λ 0 .
Ergo, the joint likelihood is proportional to
1 (νn −v−1)/2 1
p(Y, B, T ) ∝ |T |p/2
·exp − tr T (B − Mn ) Λn (B − Mn ) ·|T |
T
·exp − tr (Ωn T ) (14)
2 2
with the posterior hyperparameters (→ Definition I/5.1.7)
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
(15)
νn = ν0 + n .
From the term in (14), we can isolate the posterior distribution over B given T :
Sources:
• Wikipedia (2020): “Bayesian multivariate linear regression”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-09-03; URL: https://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression#
Posterior_distribution.
Metadata: ID: P160 | shortcut: mblr-post | author: JoramSoch | date: 2020-09-03, 08:37.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ Definition III/2.1.1) with measured n × v data matrix Y , known n × p
design matrix X, known n×n covariance structure (→ Definition II/5.1.1) V as well as unknown p×v
2. MULTIVARIATE NORMAL DATA 443
v nv v v
log p(Y |m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2
ν0 1 νn 1 ν ν (3)
log Ω0 − log Ωn + log Γv
n 0
− log Γv
2 2 2 2 2 2
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
Λn = X T P X + Λ 0
(4)
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
νn = ν0 + n .
Proof: According to the law of marginal probability (→ Definition I/1.3.3), the model evidence (→
Definition I/5.1.9) for this model is:
ZZ
p(Y |m) = p(Y |B, T ) p(B, T ) dB dT . (5)
According to the law of conditional probability (→ Definition I/1.3.4), the integrand is equivalent to
the joint likelihood (→ Definition I/5.1.5):
ZZ
p(Y |m) = p(Y, B, T ) dB dT . (6)
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(7)
which, for mathematical convenience, can also be parametrized as
s
−1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P, T )= exp − tr T (Y − XB) P (Y − XB)
T
(8)
(2π)nv 2
using the v × v precision matrix (→ Definition I/1.9.19) T = Σ−1 and the n × n precision matrix (→
Definition I/1.9.19) P = V −1 .
When deriving the posterior distribution (→ Proof III/2.4.2) p(B, T |Y ), the joint likelihood p(Y, B, T )
is obtained as
444 CHAPTER III. STATISTICAL MODELS
s s r
|T |n |P |v
|T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(9)
1
exp − tr T (B − Mn )T Λn (B − Mn ) + (Y T P Y + M0T Λ0 M0 − MnT Λn Mn ) .
2
Using the probability density function of the matrix-normal distribution (→ Proof II/5.1.3), we can
rewrite this as
s s r s
|T |n |P |v |T |p |Λ0 |v
(2π)pv |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv |T |p |Λn |v
(2π)pv 2ν0 v Γv ν20 2
−1 −1 1 T
MN (B; Mn , Λn , T ) · exp − tr T Y P Y + M0 Λ0 M0 − Mn Λn Mn
T T
.
2
(10)
Z s s r
|T |n |P |v |Λ0 |v |Ω0 |ν0 1
p(Y, B, T ) dB = · |T |(ν0 −v−1)/2 ·
(2π) nv |Λn |v 2 0 Γv ν20
ν v
(11)
1
exp − tr T Ω0 + Y P Y + M0 Λ0 M0 − Mn Λn Mn
T T T
.
2
Using the probability density function of the Wishart distribution (→ Proof “wish-pdf”), we can
rewrite this as
Z s s r s
|P |v |Λ0 |v |Ω0 |ν0 2νn v Γv ν2n
p(Y, B, T ) dB = · W(T ; Ω−1
n , νn ) . (12)
(2π)nv |Λn |v 2ν0 v |Ωn |νn Γv ν20
Finally, T can also be integrated out:
ZZ s s s
|P |v |Λ0 |v 1 Ω0 ν0 Γv νn
p(Y, B, T ) dB dT = 12 νn 2
= p(y|m) . (13)
(2π)nv |Λn |v Ωn Γv ν0
2 2
Thus, the log model evidence (→ Definition IV/3.1.3) of this model is given by
v nv v v
log p(Y |m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2
ν0 1 νn 1 ν ν (14)
log Ω0 − log Ωn + log Γv
n 0
− log Γv .
2 2 2 2 2 2
Sources:
• original work
Metadata: ID: P161 | shortcut: mblr-lme | author: JoramSoch | date: 2020-09-03, 09:23.
3. COUNT DATA 445
3 Count data
3.1 Binomial observations
3.1.1 Definition
Definition: An ordered pair (n, y) with n ∈ N and y ∈ N0 , where y is the number of successes in n
trials, consititutes a set of binomial observations.
Sources:
• original work
Metadata: ID: D78 | shortcut: bin-data | author: JoramSoch | date: 2020-07-07, 07:04.
y ∼ Bin(n, p) . (1)
Then, the conjugate prior (→ Definition I/5.2.5) for the model parameter p is a beta distribution
(→ Definition II/3.9.1):
Proof: With the probability mass function of the binomial distribution (→ Proof II/1.3.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (3)
y
In other words, the likelihood function is proportional to a power of p times a power of (1 − p):
Sources:
446 CHAPTER III. STATISTICAL MODELS
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-23; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Estimation_of_parameters.
Metadata: ID: P29 | shortcut: bin-prior | author: JoramSoch | date: 2020-01-23, 23:38.
y ∼ Bin(n, p) . (1)
Moreover, assume a beta prior distribution (→ Proof III/3.1.2) over the model parameter p:
αn = α0 + y
(4)
βn = β0 + (n − y) .
Proof: With the probability mass function of the binomial distribution (→ Proof II/1.3.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (5)
y
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ Defi-
nition I/5.1.5) of the model is given by
Note that the posterior distribution is proportional to the joint likelihood (→ Proof I/5.1.8):
which, when normalized to one, results in the probability density function of the beta distribution
(→ Proof II/3.9.3):
1
p(p|y) = pαn −1 (1 − p)βn −1 = Bet(p; αn , βn ) . (9)
B(αn , βn )
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-23; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Estimation_of_parameters.
Metadata: ID: P30 | shortcut: bin-post | author: JoramSoch | date: 2020-01-24, 00:20.
y ∼ Bin(n, p) . (1)
Moreover, assume a beta prior distribution (→ Proof III/3.1.2) over the model parameter p:
αn = α0 + y
(4)
βn = β0 + (n − y) .
Proof: With the probability mass function of the binomial distribution (→ Proof II/1.3.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (5)
y
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ Defi-
nition I/5.1.5) of the model is given by
Note that the model evidence is the marginal density of the joint likelihood (→ Definition I/5.1.9):
Z
p(y) = p(y, p) dp . (7)
Z
n 1 B(αn , βn ) 1
p(y) = pαn −1 (1 − p)βn −1 dp
y B(α0 , β0 ) 1 B(αn , βn )
Z
n B(αn , βn )
= Bet(p; αn , βn ) dp (9)
y B(α0 , β0 )
n B(αn , βn )
= ,
y B(α0 , β0 )
such that the log model evidence (→ Definition IV/3.1.3) (LME) is shown to be
n
log p(y|m) = log + log B(αn , βn ) − log B(α0 , β0 ) . (10)
y
With the definition of the binomial coefficient
n n!
= (11)
k k! (n − k)!
and the definition of the gamma function
Sources:
• Wikipedia (2020): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-24; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Motivation_and_
derivation.
Metadata: ID: P31 | shortcut: bin-lme | author: JoramSoch | date: 2020-01-24, 00:44.
3. COUNT DATA 449
Sources:
• original work
Metadata: ID: D79 | shortcut: mult-data | author: JoramSoch | date: 2020-07-07, 07:12.
y ∼ Mult(n, p) . (1)
Then, the conjugate prior (→ Definition I/5.2.5) for the model parameter p is a Dirichlet distribution
(→ Definition II/4.4.1):
Proof: With the probability mass function of the multinomial distribution (→ Proof II/2.2.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
Yk
n
p(y|p) = p j yj . (3)
y1 , . . . , yk j=1
In other words, the likelihood function is proportional to a product of powers of the entries of the
vector p:
Y
k
p(y|p) ∝ pj yj . (4)
j=1
Y
k
p(p) ∝ pj α0j −1 (7)
j=1
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-11; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical/multinomi
Metadata: ID: P79 | shortcut: mult-prior | author: JoramSoch | date: 2020-03-11, 14:15.
y ∼ Mult(n, p) . (1)
Moreover, assume a Dirichlet prior distribution (→ Proof III/3.2.2) over the model parameter p:
Proof: With the probability mass function of the multinomial distribution (→ Proof II/2.2.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
Yk
n
p(y|p) = p j yj . (5)
y1 , . . . , yk j=1
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ Defi-
nition I/5.1.5) of the model is given by
Note that the posterior distribution is proportional to the joint likelihood (→ Proof I/5.1.8):
Y
k
p(p|y) ∝ pj αnj −1 (8)
j=1
which, when normalized to one, results in the probability density function of the Dirichlet distribution
(→ Proof II/4.4.2):
P
k
Γ j=1 αnj Y
k
p(p|y) = Qk pj αnj −1 = Dir(p; αn ) . (9)
j=1 Γ(α nj ) j=1
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-11; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical/multinomi
Metadata: ID: P80 | shortcut: mult-post | author: JoramSoch | date: 2020-03-11, 14:40.
y ∼ Mult(n, p) . (1)
Moreover, assume a Dirichlet prior distribution (→ Proof III/3.2.2) over the model parameter p:
X
k
log p(y|m) = log Γ(n + 1) − log Γ(kj + 1)
j=1
! !
X
k X
k
+ log Γ α0j − log Γ αnj (3)
j=1 j=1
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
Proof: With the probability mass function of the multinomial distribution (→ Proof II/2.2.2), the
likelihood function (→ Definition I/5.1.2) implied by (1) is given by
Yk
n
p(y|p) = p j yj . (5)
y1 , . . . , yk j=1
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ Defi-
nition I/5.1.5) of the model is given by
Note that the model evidence is the marginal density of the joint likelihood:
Z
p(y) = p(y, p) dp . (7)
Using the probability density function of the Dirichlet distribution (→ Proof II/4.4.2), p can now be
integrated out easily
P
Z Γ Pk α Qk k
Γ j=1 nj Y
α k
n j=1 0j j=1 Γ(αnj )
p(y) = Qk Pk Qk pj αnj −1 dp
y1 , . . . , y k j=1 Γ(α0j ) Γ j=1 Γ(αnj ) j=1
j=1 αnj
Γ Pk α Qk Z
n j=1 Γ(αnj )
j=1 0j
= Qk P Dir(p; αn ) dp (9)
y1 , . . . , y k j=1 Γ(α0j ) Γ
k
α
j=1 nj
P
Γ k Qk
n j=1 α0j Γ(αnj )
= P Qj=1
k
,
y1 , . . . , y k Γ k
α j=1 Γ(α0j )
j=1 nj
such that the log model evidence (→ Definition IV/3.1.3) (LME) is shown to be
! !
n X
k X
k
log p(y|m) = log + log Γ α0j − log Γ αnj
y1 , . . . , yk j=1 j=1
(10)
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
3. COUNT DATA 453
X
k
log p(y|m) = log Γ(n + 1) − log Γ(kj + 1)
j=1
! !
X
k X
k
+ log Γ α0j − log Γ αnj (13)
j=1 j=1
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
Sources:
• original work
Metadata: ID: P81 | shortcut: mult-lme | author: JoramSoch | date: 2020-03-11, 15:17.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Sources:
• Wikipedia (2020): “Poisson distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Poisson_distribution#Parameter_estimation.
Metadata: ID: D41 | shortcut: poiss-data | author: JoramSoch | date: 2020-03-22, 22:50.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ Definition I/4.1.3) for the rate parameter λ is given by
454 CHAPTER III. STATISTICAL MODELS
λ̂ = ȳ (2)
where ȳ is the sample mean (→ Definition I/1.7.2)
1X
n
ȳ = yi . (3)
n i=1
Proof: The likelihood function (→ Definition I/5.1.2) for each observation is given by the probability
mass function of the Poisson distribution (→ Proof II/1.5.2)
λyi · exp(−λ)
p(yi |λ) = Poiss(yi ; λ) = (4)
yi !
and because observations are independent (→ Definition I/1.3.6), the likelihood function for all
observations is the product of the individual ones:
Y
n Yn
λyi · exp(−λ)
p(y|λ) = p(yi |λ) = . (5)
i=1 i=1
y i !
Thus, the log-likelihood function (→ Definition I/4.1.2) is
" n #
Y λyi · exp(−λ)
LL(λ) = log p(y|λ) = log (6)
i=1
yi !
which can be developed into
X
n
λyi · exp(−λ)
LL(λ) = log
i=1
yi !
Xn
= [yi · log(λ) − λ − log(yi !)]
i=1
(7)
X
n X
n X
n
=− λ+ yi · log(λ) − log(yi !)
i=1 i=1 i=1
X
n X
n
= −nλ + log(λ) yi − log(yi !)
i=1 i=1
1X
n
dLL(λ)
= yi − n
dλ λ i=1
(8)
1 X
n
d2 LL(λ)
=− 2 yi .
dλ2 λ i=1
dLL(λ̂)
=0
dλ
1X
n
0= yi − n (9)
λ̂ i=1
1X
n
λ̂ = yi = ȳ .
n i=1
1 X
n
d2 LL(λ̂)
=− 2 yi
dλ2 ȳ i=1
n · ȳ (10)
=− 2
ȳ
n
=− <0.
ȳ
Sources:
• original work
Metadata: ID: P27 | shortcut: poiss-mle | author: JoramSoch | date: 2020-01-20, 21:53.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Then, the conjugate prior (→ Definition I/5.2.5) for the model parameter λ is a gamma distribution
(→ Definition II/3.4.1):
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Yn
1 Y yi Y
n n
p(y|λ) = · λ · exp [−λ]
y ! i=1
i=1 i i=1
Yn (5)
1
= · λnȳ · exp [−nλ]
i=1
y i !
1X
n
ȳ = yi . (6)
n i=1
In other words, the likelihood function is proportional to a power of λ times an exponential of λ:
b0 a0 a0 −1
p(λ) = λ exp[−b0 λ] (9)
Γ(a0 )
exhibits the same proportionality
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14ff.; URL:
http://www.stat.columbia.edu/~gelman/book/.
Metadata: ID: P225 | shortcut: poiss-prior | author: JoramSoch | date: 2020-04-21, 08:31.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ Proof III/3.3.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + n .
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Yn
1 Y yi Y
n n
b0 a0 a0 −1
p(y, λ) = λ exp [−λ] · λ exp[−b0 λ]
y!
i=1 i i=1 i=1
Γ(a0 )
Yn
1 b0 a0 a0 −1
= λnȳ exp [−nλ] · λ exp[−b0 λ] (8)
i=1
y i ! Γ(a 0 )
Yn
1 b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nλ)]
i=1
y i ! Γ(a 0 )
1X
n
ȳ = yi . (9)
n i=1
Note that the posterior distribution is proportional to the joint likelihood (→ Proof I/5.1.8):
bn an an −1
p(λ|y) = λ exp [−bn λ] = Gam(λ; an , bn ) . (12)
Γ(a0 )
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.15; URL:
http://www.stat.columbia.edu/~gelman/book/.
Metadata: ID: P226 | shortcut: poiss-post | author: JoramSoch | date: 2020-04-21, 08:48.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ Proof III/3.3.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + n .
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Yn
1 Y yi Y
n n
b0 a0 a0 −1
p(y, λ) = λ exp [−λ] · λ exp[−b0 λ]
y!
i=1 i i=1 i=1
Γ(a0 )
Yn
1 b0 a0 a0 −1
= λnȳ exp [−nλ] · λ exp[−b0 λ] (8)
i=1
y i ! Γ(a 0 )
Yn
1 b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nλ)]
i=1
y i ! Γ(a 0 )
1X
n
ȳ = yi . (9)
n i=1
Note that the model evidence is the marginal density of the joint likelihood (→ Definition I/5.1.9):
Z
p(y) = p(y, λ) dλ . (10)
Z Y n a0
1 b0 Γ(an ) bn an an −1
p(y) = an · λ exp [−bn λ] dλ
i=1
y i ! Γ(a 0 ) b n Γ(a n )
Yn Z
1 Γ(an ) b0 a0
= Gam(λ; an , bn ) dλ (12)
i=1
yi ! Γ(a0 ) bn an
Yn
1 Γ(an ) b0 a0
= ,
i=1
yi ! Γ(a0 ) bn an
Sources:
460 CHAPTER III. STATISTICAL MODELS
• original work
Metadata: ID: P227 | shortcut: poiss-lme | author: JoramSoch | date: 2020-04-21, 09:09.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14; URL:
http://www.stat.columbia.edu/~gelman/book/.
Metadata: ID: D42 | shortcut: poissexp | author: JoramSoch | date: 2020-03-22, 22:57.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ Definition I/4.1.3) for the rate parameter λ is given by
ȳ
λ̂ = (2)
x̄
where ȳ and x̄ are the sample means (→ Definition “mean-sample”)
1X
n
ȳ = yi
n i=1
(3)
1X
n
x̄ = xi .
n i=1
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Y
n Yn
(λxi )yi · exp[−λxi ]
p(y|λ) = p(yi |λ) = . (5)
i=1 i=1
y i !
Thus, the log-likelihood function (→ Definition I/4.1.2) is
" n #
Y (λxi )yi · exp[−λxi ]
LL(λ) = log p(y|λ) = log (6)
i=1
yi !
which can be developed into
X
n
(λxi )yi · exp[−λxi ]
LL(λ) = log
i=1
yi !
X
n
= [yi · log(λxi ) − λxi − log(yi !)]
i=1
X
n X
n X
n
=− λxi + yi · [log(λ) + log(xi )] − log(yi !) (7)
i=1 i=1 i=1
Xn X
n X
n X
n
= −λ xi + log(λ) yi + yi log(xi ) − log(yi !)
i=1 i=1 i=1 i=1
X
n X
n
= −nx̄λ + nȳ log(λ) + yi log(xi ) − log(yi !)
i=1 i=1
dLL(λ) nȳ
= −nx̄ +
dλ λ (8)
d2 LL(λ) nȳ
=− 2 .
dλ2 λ
Setting the first derivative to zero, we obtain:
dLL(λ̂)
=0
dλ
nȳ
0 = −nx̄ + (9)
λ̂
nȳ ȳ
= .λ̂ =
nx̄ x̄
Plugging this value into the second derivative, we confirm:
d2 LL(λ̂) nȳ
2
=−
dλ λ̂2
n · ȳ
=− (10)
(ȳ/x̄)2
n · x̄2
=− <0.
ȳ
462 CHAPTER III. STATISTICAL MODELS
This demonstrates that the estimate λ̂ = ȳ/x̄ maximizes the likelihood p(y|λ).
Sources:
• original work
Metadata: ID: P224 | shortcut: poissexp-mle | author: JoramSoch | date: 2021-04-16, 11:42.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Then, the conjugate prior (→ Definition I/5.2.5) for the model parameter λ is a gamma distribution
(→ Definition II/3.4.1):
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Yn
xi yi Y yi Y
n n
p(y|λ) = · λ · exp [−λxi ]
y i !
i=1
i=1
i=1
" #
Y xi i
n y ∑n X n
= · λ i=1 yi · exp −λ xi (5)
i=1
y i ! i=1
Yn yi
xi
= · λnȳ · exp [−nx̄λ]
i=1
y i !
1X
n
ȳ = yi
n i=1
(6)
1X
n
x̄ = xi .
n i=1
3. COUNT DATA 463
b0 a0 a0 −1
p(λ) = λ exp[−b0 λ] (9)
Γ(a0 )
exhibits the same proportionality
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14ff.; URL:
http://www.stat.columbia.edu/~gelman/book/.
Metadata: ID: P41 | shortcut: poissexp-prior | author: JoramSoch | date: 2020-02-04, 14:11.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ Proof III/3.4.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + nx̄ .
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
and because observations are independent (→ Definition I/1.3.6), the likelihood function for all
observations is the product of the individual ones:
Y
n Yn
(λxi )yi · exp [−λxi ]
p(y|λ) = p(yi |λ) = . (6)
i=1 i=1
y i !
Combining the likelihood function (6) with the prior distribution (2), the joint likelihood (→ Defi-
nition I/5.1.5) of the model is given by
Y
n
xi yi Y
n Y
n
b0 a0 a0 −1
p(y, λ) = λ yi
exp [−λxi ] · λ exp[−b0 λ]
yi ! i=1 Γ(a0 )
i=1
n yi
i=1
" #
Y xi ∑ n X n
b0 a0 a0 −1
= λ i=1 yi
exp −λ xi · λ exp[−b0 λ]
i=1
yi ! i=1
Γ(a0 )
n yi
(8)
Y xi b0 a0 a0 −1
= λnȳ exp [−nx̄λ] · λ exp[−b0 λ]
i=1
y i ! Γ(a 0 )
Yn yi
xi b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nx̄)λ]
i=1
y i ! Γ(a 0 )
1X
n
ȳ = yi
n i=1
(9)
1X
n
x̄ = xi .
n i=1
Note that the posterior distribution is proportional to the joint likelihood (→ Proof I/5.1.8):
bn an an −1
p(λ|y) = λ exp [−bn λ] = Gam(λ; an , bn ) . (12)
Γ(a0 )
Sources:
3. COUNT DATA 465
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.15; URL:
http://www.stat.columbia.edu/~gelman/book/.
Metadata: ID: P42 | shortcut: poissexp-post | author: JoramSoch | date: 2020-02-04, 14:42.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ Proof III/3.4.3) over the model parameter λ:
X
n X
n
log p(y|m) = yi log(xi ) − log yi !+
i=1 i=1 (3)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
an = a0 + nȳ
(4)
an = a0 + nx̄ .
Proof: With the probability mass function of the Poisson distribution (→ Proof II/1.5.2), the like-
lihood function (→ Definition I/5.1.2) for each observation implied by (1) is given by
Y
n
xi yi Y
n Y
n
b0 a0 a0 −1
p(y, λ) = λ yi exp [−λxi ] · λ exp[−b0 λ]
y i ! Γ(a 0 )
i=1
n yi
i=1 i=1
" #
Y xi ∑n X n
b0 a0 a0 −1
= λ i=1 yi exp −λ xi · λ exp[−b0 λ]
i=1
y i ! i=1
Γ(a 0 )
n yi
(8)
Y xi b0 a0 a0 −1
= λnȳ exp [−nx̄λ] · λ exp[−b0 λ]
i=1
yi ! Γ(a0 )
Yn yi
xi b0 a0
= · λa0 +nȳ−1 · exp [−(b0 + nx̄)λ]
i=1
yi ! Γ(a0 )
1X
n
ȳ = yi
n i=1
(9)
1X
n
x̄ = xi .
n i=1
Note that the model evidence is the marginal density of the joint likelihood (→ Definition I/5.1.9):
Z
p(y) = p(y, λ) dλ . (10)
Setting an = a0 + nȳ and bn = b0 + nx̄, the joint likelihood can also be written as
Yn yi
xi b0 a0 Γ(an ) bn an an −1
p(y, λ) = an · λ exp [−bn λ] . (11)
i=1
y i ! Γ(a 0 ) b n Γ(a n )
Using the probability density function of the gamma distribution (→ Proof II/3.4.6), λ can now be
integrated out easily
Z Y n yi
xi b0 a0 Γ(an ) bn an an −1
p(y) = · λ exp [−bn λ] dλ
i=1
yi ! Γ(a0 ) bn an Γ(an )
Yn yi Z
xi Γ(an ) b0 a0
= an Gam(λ; an , bn ) dλ (12)
i=1
y i ! Γ(a 0 ) b n
Yn yi
xi Γ(an ) b0 a0
= ,
i=1
yi ! Γ(a0 ) bn an
X
n X
n
log p(y|m) = yi log(xi ) − log yi !+
i=1 i=1 (13)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
3. COUNT DATA 467
Sources:
• original work
Metadata: ID: P43 | shortcut: poissexp-lme | author: JoramSoch | date: 2020-02-04, 15:12.
468 CHAPTER III. STATISTICAL MODELS
4 Frequency data
4.1 Beta-distributed data
4.1.1 Definition
Definition: Beta-distributed data are defined as a set of proportions y = {y1 , . . . , yn } with yi ∈
[0, 1], i = 1, . . . , n, independent and identically distributed according to a beta distribution (→
Definition II/3.9.1) with shapes α and β:
Sources:
• original work
Metadata: ID: D77 | shortcut: beta-data | author: JoramSoch | date: 2020-06-28, 21:16.
ȳ(1 − ȳ)
α̂ = ȳ −1
v̄
(2)
ȳ(1 − ȳ)
β̂ = (1 − ȳ) −1
v̄
where ȳ is the sample mean (→ Definition I/1.7.2) and v̄ is the unbiased sample variance (→ Defi-
nition I/1.8.2):
1X
n
ȳ = yi
n i=1
(3)
1 X
n
v̄ = (yi − ȳ)2 .
n − 1 i=1
Proof: Mean (→ Proof II/3.9.6) and variance (→ Proof II/3.9.7) of the beta distribution (→ Defi-
nition II/3.9.1) in terms of the parameters α and β are given by
α
E(X) =
α+β
αβ (4)
Var(X) = 2
.
(α + β) (α + β + 1)
4. FREQUENCY DATA 469
Thus, matching the moments (→ Definition I/4.1.6) requires us to solve the following equation system
for α and β:
α
ȳ =
α+β
αβ (5)
v̄ = 2
.
(α + β) (α + β + 1)
ȳ(α + β) = α
αȳ + β ȳ = α
β ȳ = α − αȳ
α (6)
β = −α
ȳ
1
β=α −1 .
ȳ
If we define q = 1/ȳ − 1 and plug (6) into the second equation, we have:
α · αq
v̄ =
(α + αq)2 (α + αq + 1)
α2 q
=
(α(1 + q))2 (α(1 + q) + 1)
q (7)
= 2
(1 + q) (α(1 + q) + 1)
q
=
α(1 + q) + (1 + q)2
3
q = v̄ α(1 + q)3 + (1 + q)2 .
1 − ȳ α 1
= v̄ 3 + 2
ȳ ȳ ȳ
1 − ȳ α 1
= 3+ 2
ȳ v̄ ȳ ȳ
ȳ (1 − ȳ)
3
= α + ȳ (8)
ȳ v̄
ȳ 2 (1 − ȳ)
α= − ȳ
v̄
ȳ(1 − ȳ)
= ȳ −1 .
v̄
ȳ(1 − ȳ) 1 − ȳ
β = ȳ −1 ·
v̄ ȳ
(9)
ȳ(1 − ȳ)
= (1 − ȳ) −1 .
v̄
Together, (8) and (9) constitute the method-of-moment estimates of α and β.
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-01-
20; URL: https://en.wikipedia.org/wiki/Beta_distribution#Method_of_moments.
Metadata: ID: P28 | shortcut: beta-mome | author: JoramSoch | date: 2020-01-22, 02:53.
yi = [yi1 , . . . , yik ],
yij ∈ [0, 1] and
(1)
X
k
yij = 1
j=1
for all i = 1, . . . , n (and j = 1, . . . , k) and each yi is independent and identically distributed according
to a Dirichlet distribution (→ Definition II/4.4.1) with concentration parameters α = [α1 , . . . , αk ]:
yi ∼ Dir(α), i = 1, . . . , n . (2)
Sources:
• original work
Metadata: ID: D104 | shortcut: dir-data | author: JoramSoch | date: 2020-10-22, 05:06.
yi ∼ Dir(α), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ Definition I/4.1.3) for the concentration parameters α
can be obtained by iteratively computing
" ! #
(new)
X
k
(old)
αj = ψ −1 ψ αj + log ȳj (2)
j=1
4. FREQUENCY DATA 471
where ψ(x) is the digamma function and log ȳj is given by:
1X
n
log ȳj = log yij . (3)
n i=1
Proof: The likelihood function (→ Definition I/5.1.2) for each observation is given by the probability
density function of the Dirichlet distribution (→ Proof II/4.4.2)
P
k
Γ j=1 αj Y
k
p(yi |α) = Qk yij αj −1 (4)
j=1 Γ(αj ) j=1
and because observations are independent (→ Definition I/1.3.6), the likelihood function for all
observations is the product of the individual ones:
Pk
Yn Yn Γ α
j=1 j Yk
p(y|α) = p(yi |α) = Q yij αj −1 . (5)
k
i=1 i=1 j=1 Γ(αj ) j=1
1X
n
log ȳj = log yij . (8)
n i=1
The derivative of the log-likelihood with respect to a particular parameter αj is
" ! #
dLL(α) d Xk X k X k
= n log Γ αj − n log Γ(αj ) + n (αj − 1) log ȳj
dαj dαj j=1 j=1 j=1
" !#
d Xk
d d
= n log Γ αj − [n log Γ(αj )] + [n(αj − 1) log ȳj ] (9)
dαj j=1
dαj dαj
!
Xk
= nψ αj − nψ(αj ) + n log ȳj
j=1
472 CHAPTER III. STATISTICAL MODELS
d log Γ(x)
ψ(x) = . (10)
dx
Setting this derivative to zero, we obtain:
dLL(α)
=0
dαj
!
X
k
0 = nψ αj − nψ(αj ) + n log ȳj
j=1
!
X
k
0=ψ αj − ψ(αj ) + log ȳj (11)
j=1
!
X
k
ψ(αj ) = ψ αj + log ȳj
j=1
" ! #
X
k
αj = ψ −1 ψ αj + log ȳj .
j=1
In the following, we will use a fixed-point iteration to maximize LL(α). Given an initial guess for α,
we construct a lower bound on the likelihood function (7) which is tight at α. The maximum of this
bound is computed and it becomes the new guess. Because the Dirichlet distribution (→ Definition
II/4.4.1) belongs to the exponential family (→ Definition “dist-expfam”), the log-likelihood function
is convex in α ánd the maximum is the only stationary point, such that the procedure is guaranteed
to converge to the maximum.
In our case, we use a bound on the gamma function
!
1 Xk X
k X
k
LL(α) = log Γ αj − log Γ(αj ) + (αj − 1) log ȳj
n j=1 j=1 j=1
! ! !
1 X
k X
k X
k X
k X
k X
k
LL(α) ≥ log Γ α̂j + αj − α̂j ψ α̂j − log Γ(αj ) + (αj − 1) log ȳj
n j=1 j=1 j=1 j=1 j=1 j=1
! !
1 X
k X
k X
k X
k
LL(α) ≥ αj ψ α̂j − log Γ(αj ) + (αj − 1) log ȳj + const.
n j=1 j=1 j=1 j=1
(13)
" ! #
(new)
X
k
(old)
−1
αj =ψ ψ αj + log ȳj . (14)
j=1
Sources:
• Minka TP (2012): “Estimating a Dirichlet distribution”; in: Papers by Tom Minka, retrieved on
2020-10-22; URL: https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf.
Metadata: ID: P182 | shortcut: dir-mle | author: JoramSoch | date: 2020-10-22, 09:31.
Sources:
• original work
Metadata: ID: D178 | shortcut: betabin-data | author: JoramSoch | date: 2022-10-20, 08:20.
nm1 − m2
α̂ =
n m1 − m1 − 1 + m1
m2
(2)
(n − m1 ) n − m2
m1
β̂ =
n m 2
m1
− m1 − 1 + m1
where m1 and m2 are the first two raw sample moments (→ Definition I/1.14.3):
474 CHAPTER III. STATISTICAL MODELS
1 X
N
m1 = yi
N i=1
(3)
1 X 2
N
m2 = y .
N i=1 i
Proof: The first two raw moments of the beta-binomial distribution (→ Definition “betabin-mom”)
in terms of the parameters α and β are given by
nα
µ1 =
α+β
nα(nα + β + n) (4)
µ2 =
(α + β)(nα + β + 1)
Thus, matching the moments (→ Definition I/4.1.6) requires us to solve the following equation system
for α and β:
nα
m1 =
α+β
nα(nα + β + n) (5)
m2 = .
(α + β)(nα + β + 1)
From the first equation, we can deduce:
m1 (α + β) = nα
m1 α + m1 β = nα
m1 β = nα − m1 α
nα (6)
β= −α
m1
n
β=α −1 .
m1
If we define q = n/m1 − 1 and plug (6) into the second equation, we have:
nα(nα + αq + n)
m2 =
(α + αq)(α + αq + 1)
nα(α(n + q) + n)
=
α(1 + q)(α(1 + q) + 1)
(7)
n(α(n + q) + n)
=
(1 + q)(α(1 + q) + 1)
n(α(n + q) + n)
= .
α(1 + q)2 + (1 + q)
Noting that 1 + q = n/m1 and expanding the fraction with m1 , one obtains:
4. FREQUENCY DATA 475
n α m1 + n − 1 + n
n
m2 =
n α mn2 + m11
1
α (n + nm1 − m1 ) + nm1
m2 =
α mn1 + 1
αn
m2 + 1 = α (n + nm1 − m1 ) + nm1
m1 (8)
m2
α n − (n + nm1 − m1 ) = nm1 − m2
m1
m2
α n − m1 − 1 + m1 = nm1 − m2
m1
nm1 − m2
α= .
n mm1
2
− m 1 − 1 + m1
n
β=α −1
m1
nm1 − m2 n
β= −1
n m 2
− m1 − 1 + m1 m1
m1
n2 − nm1 − n m 2
+ m2 (9)
β=
m1
n m 2
m1
− m1 − 1 + m1
(n − m1 ) n − m m2
β̂ =
1
.
n m 2
m1
− m 1 − 1 + m 1
Sources:
• statisticsmatt (2022): “Method of Moments Estimation Beta Binomial Distribution”; in: YouTube,
retrieved on 2022-10-07; URL: https://www.youtube.com/watch?v=18PWnWJsPnA.
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-07; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Method_of_moments.
Metadata: ID: P357 | shortcut: betabin-mome | author: JoramSoch | date: 2022-10-07, 15:13.
476 CHAPTER III. STATISTICAL MODELS
5 Categorical data
5.1 Logistic regression
5.1.1 Definition
Definition: A logistic regression model is given by a set of binary observations yi ∈ {0, 1} , i =
1, . . . , n, a set of predictors xj ∈ Rn , j = 1, . . . , p, a base b and the assumption that the log-odds are
a linear combination of the predictors:
li = xi β + εi , i = 1, . . . , n (1)
where li are the log-odds that yi = 1
Pr(yi = 1)
li = logb (2)
Pr(yi = 0)
and xi is the i-th row of the n × p matrix
X = [x1 , . . . , xp ] . (3)
Within this model,
• y are called “categorical observations” or “dependent variable”;
• X is called “design matrix” or “set of independent variables”;
• β are called “regression coefficients” or “weights”;
• εi is called “noise” or “error term”;
• n is the number of observations;
• p is the number of predictors.
Sources:
• Wikipedia (2020): “Logistic regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-28; URL: https://en.wikipedia.org/wiki/Logistic_regression#Logistic_model.
Metadata: ID: D76 | shortcut: logreg | author: JoramSoch | date: 2020-06-28, 20:51.
li = xi β + εi , i = 1, . . . , n (1)
where xi are the predictors corresponding to the i-th observation yi and li are the log-odds that
yi = 1.
Then, the log-odds in favor of yi = 1 against yi = 0 can also be expressed as
Proof: Using Bayes’ theorem (→ Proof I/5.3.1) and the law of marginal probability (→ Definition
I/1.3.3), the posterior probabilities (→ Definition I/5.1.7) for yi = 1 and yi = 0 are given by
p(yi = 1|xi )
li = logb
p(yi = 0|xi )
(4)
p(xi |yi = 1) p(yi = 1)
= logb .
p(xi |yi = 0) p(yi = 0)
Sources:
• Bishop, Christopher M. (2006): “Linear Models for Classification”; in: Pattern Recognition for
Machine Learning, ch. 4, p. 197, eq. 4.58; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/
Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%
202006.pdf.
Metadata: ID: P105 | shortcut: logreg-pnlo | author: JoramSoch | date: 2020-05-19, 05:08.
li = xi β + εi , i = 1, . . . , n (1)
where xi are the predictors corresponding to the i-th observation yi and li are the log-odds that
yi = 1.
Then, the probability that yi = 1 is given by
1
Pr(yi = 1) = (2)
1+ b−(xi β+εi )
where b is the base used to form the log-odds li .
pi
logb = xi β + εi
1 − pi
pi
= bxi β+εi
1 − pi
pi = bxi β+εi (1 − pi )
pi 1 + bxi β+εi = bxi β+εi
(4)
bxi β+εi
pi =
1 + bxi β+εi
bxi β+εi
pi = xi β+εi
b (1 + b−(xi β+εi ) )
1
pi = −(x
1 + b i β+εi )
which proves the identity given by (2).
Sources:
• Wikipedia (2020): “Logistic regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-03; URL: https://en.wikipedia.org/wiki/Logistic_regression#Logistic_model.
Metadata: ID: P72 | shortcut: logreg-lonp | author: JoramSoch | date: 2020-03-03, 12:01.
Chapter IV
Model Selection
479
480 CHAPTER IV. MODEL SELECTION
1 Goodness-of-fit measures
1.1 Residual variance
1.1.1 Definition
Definition: Let there be a linear regression model (→ Definition III/1.4.1)
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
with measured data y, known design matrix X and covariance structure V as well as unknown
regression coefficients β and noise variance σ 2 .
Then, an estimate of the noise variance σ 2 is called the “residual variance” σ̂ 2 , e.g. obtained via
maximum likelihood estimation (→ Definition I/4.1.3).
Sources:
• original work
Metadata: ID: D20 | shortcut: resvar | author: JoramSoch | date: 2020-02-25, 11:21.
1X
n
σ̂ 2 = (xi − x̄)2 (2)
n i=1
where
1X
n
x̄ = xi (3)
n i=1
2) and σ̂ 2 is a biased estimator (→ Definition “est-unb”) of σ 2
E σ̂ 2 ̸= σ 2 , (4)
more precisely:
n−1 2
E σ̂ 2 = σ . (5)
n
Proof:
1) This is equivalent to the maximum likelihood estimator for the univariate Gaussian with unknown
variance (→ Proof III/1.1.2) and a special case of the maximum likelihood estimator for multiple
linear regression (→ Proof III/1.4.16) in which y = x, X = 1n and β̂ = x̄:
1. GOODNESS-OF-FIT MEASURES 481
1
σ̂ 2 = (y − X β̂)T (y − X β̂)
n
1
= (x − 1n x̄)T (x − 1n x̄) (6)
n
1X
n
= (xi − x̄)2 .
n i=1
2) The expectation (→ Definition I/1.7.1) of the maximum likelihood estimator (→ Definition I/4.1.3)
can be developed as follows:
" n #
2 1X
E σ̂ = E (xi − x̄)2
n i=1
" n #
1 X
= E (xi − x̄)2
n
" i=1 #
1 X n
= E x2i − 2xi x̄ + x̄2
n
" i=1 #
1 X n Xn X n
= E x2i − 2 xi x̄ + x̄2
n
" i=1 i=1
#i=1 (7)
1 X n
= E x2i − 2nx̄2 + nx̄2
n
" i=1 #
1 X n
= E x2i − nx̄2
n i=1
!
1 X 2 2
n
= E xi − nE x̄
n i=1
1 X 2
n
= E xi − E x̄2
n i=1
1X n
E σ̂ 2 = Var(xi ) + E(xi )2 − Var(x̄) + E(x̄)2 . (10)
n i=1
482 CHAPTER IV. MODEL SELECTION
" #
1 Xn
1X
n
E [x̄] = E xi = E [xi ]
n i=1 n i=1
1X
n
(11) 1 (12)
= µ= ·n·µ
n i=1 n
=µ.
" #
1X 1 X
n n
Var [x̄] = Var xi = 2 Var [xi ]
n i=1 n i=1
1 X 2
n
(11) 1 (13)
= 2 σ = 2 · n · σ2
n i=1 n
1 2
= σ .
n
Plugging (11), (12) and (13) into (10), we have
2 1 X n
1 2
E σ̂ = σ 2 + µ2 − σ +µ 2
n i=1 n
2 1 1 2
E σ̂ = · n · σ + µ −
2 2
σ +µ 2
n n (14)
1
E σ̂ 2 = σ 2 + µ2 − σ 2 − µ2
n
2 n − 1 2
E σ̂ = σ
n
which proves the bias (→ Definition “est-unb”) given by (5).
Sources:
• Liang, Dawen (????): “Maximum Likelihood Estimator for Variance is Biased: Proof”, retrieved
on 2020-02-24; URL: https://dawenl.github.io/files/mle_biased.pdf.
Metadata: ID: P61 | shortcut: resvar-bias | author: JoramSoch | date: 2020-02-24, 23:44.
i.i.d.
xi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
An unbiased estimator (→ Definition “est-unb”) of σ 2 is given by
1 X
n
2
σ̂unb = (xi − x̄)2 . (2)
n − 1 i=1
Proof: It can be shown that (→ Proof IV/1.1.2) the maximum likelihood estimator (→ Definition
I/4.1.3) of σ 2
1X
n
2
σ̂MLE = (xi − x̄)2 (3)
n i=1
is a biased estimator (→ Definition “est-unb”) in the sense that
2 n−1 2
E σ̂MLE = σ . (4)
n
From (4), it follows that
n n 2
E 2
σ̂MLE = E σ̂MLE
n−1 n−1
(4) n n−1 2 (5)
= · σ
n−1 n
= σ2 ,
such that an unbiased estimator (→ Definition “est-unb”) can be constructed as
2 n 2
σ̂unb = σ̂MLE
n−1
1X
n
(3) n
= · (xi − x̄)2
n − 1 n i=1 (6)
1 X
n
= (xi − x̄)2 .
n − 1 i=1
Sources:
• Liang, Dawen (????): “Maximum Likelihood Estimator for Variance is Biased: Proof”, retrieved
on 2020-02-25; URL: https://dawenl.github.io/files/mle_biased.pdf.
Metadata: ID: P62 | shortcut: resvar-unb | author: JoramSoch | date: 2020-02-25, 15:38.
1.2 R-squared
1.2.1 Definition
Definition: Let there be a linear regression model (→ Definition III/1.4.1) with independent (→
Definition I/1.3.6) observations
484 CHAPTER IV. MODEL SELECTION
i.i.d.
y = Xβ + ε, εi ∼ N (0, σ 2 ) (1)
with measured data y, known design matrix X as well as unknown regression coefficients β and noise
variance σ 2 .
Then, the proportion of the variance of the dependent variable y (“total variance (→ Definition
III/1.4.5)”) that can be predicted from the independent variables X (“explained variance (→ Defi-
nition III/1.4.6)”) is called “coefficient of determination”, “R-squared” or R2 .
Sources:
• Wikipedia (2020): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-25; URL: https://en.wikipedia.org/wiki/Mean_squared_error#Proof_of_variance_
and_bias_relationship.
Metadata: ID: D21 | shortcut: rsq | author: JoramSoch | date: 2020-02-25, 11:41.
RSS/(n − p)
2
Radj =1− (3)
TSS/(n − 1)
where the residual (→ Definition III/1.4.7) and total sum of squares (→ Definition III/1.4.5) are
X
n
RSS = (yi − ŷi )2 , ŷ = X β̂
i=1
(4)
Xn
1X
n
TSS = (yi − ȳ) , 2
ȳ = yi
i=1
n i=1
where X is the n×p design matrix and β̂ are the ordinary least squares (→ Proof III/1.4.3) estimates.
then R2 is given by
ESS
R2 = . (6)
TSS
which is equal to
TSS − RSS RSS
R2 = =1− , (7)
TSS TSS
because (→ Proof III/1.4.8) TSS = ESS + RSS.
If we replace the variance estimates by their unbiased estimators (→ Proof IV/1.1.3), we obtain
Pn
i=1 (yi − ŷi )
1 2
RSS/df r
Radj = 1 − 1 Pn
n−p
2
= 1 − (9)
n−1 i=1 (y i − ȳ)2 TSS/df t
where df r = n − p and df t = n − 1 are the residual and total degrees of freedom (→ Definition “dof”).
This gives the adjusted R2 which adjusts R2 for the number of explanatory variables.
Sources:
• Wikipedia (2019): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved on
2019-12-06; URL: https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2.
R2 = 1 − (exp[∆MLL])−2/n (2)
where n is the number of observations and ∆MLL is the difference in maximum log-likelihood between
the model given by (1) and a linear regression model with only a constant regressor.
Proof: First, we express the maximum log-likelihood (→ Definition I/4.1.5) (MLL) of a linear re-
gression model in terms of its residual sum of squares (→ Definition III/1.4.7) (RSS). The model in
(1) implies the following log-likelihood function (→ Definition I/4.1.2)
n 1
LL(β, σ 2 ) = log p(y|β, σ 2 ) = − log(2πσ 2 ) − 2 (y − Xβ)T (y − Xβ) , (3)
2 2σ
such that maximum likelihood estimates are (→ Proof III/1.4.16)
486 CHAPTER IV. MODEL SELECTION
β̂ = (X T X)−1 X T y (4)
1
(y − X β̂)T (y − X β̂)
σ̂ 2 = (5)
n
and the residual sum of squares (→ Definition III/1.4.7) is
X
n
RSS = ε̂i = ε̂T ε̂ = (y − X β̂)T (y − X β̂) = n · σ̂ 2 . (6)
i=1
Since β̂ and σ̂ 2 are maximum likelihood estimates (→ Definition I/4.1.3), plugging them into the
log-likelihood function gives the maximum log-likelihood:
n 1
MLL = LL(β̂, σ̂ 2 ) = − log(2πσ̂ 2 ) − 2 (y − X β̂)T (y − X β̂) . (7)
2 2σ̂
2 2
With (6) for the first σ̂ and (5) for the second σ̂ , the MLL becomes
n n 2π n
MLL = − log(RSS) − log − . (8)
2 2 n 2
Second, we establish the relationship between maximum log-likelihood (MLL) and coefficient of
determination (R²). Consider the two models
m 0 : X0 = 1 n
(9)
m 1 : X1 = X
For m1 , the residual sum of squares is given by (6); and for m0 , the residual sum of squares is equal
to the total sum of squares (→ Definition III/1.4.5):
X
n
TSS = (yi − ȳ)2 . (10)
i=1
h n n i
exp[∆MLL] = exp − log(RSS) + log(TSS)
2 2
= (exp [log(RSS) − log(TSS)])−n/2
−n/2
exp[log(RSS)] (12)
=
exp[log(TSS)]
−n/2
RSS
= .
TSS
Taking both sides to the power of −2/n and subtracting from 1, we have
1. GOODNESS-OF-FIT MEASURES 487
RSS
(exp[∆MLL])−2/n =
TSS (13)
RSS
1 − (exp[∆MLL])−2/n =1− = R2
TSS
which proves the identity given above.
Sources:
• original work
Metadata: ID: P14 | shortcut: rsq-mll | author: JoramSoch | date: 2020-01-08, 04:46.
Var(X β̂)
SNR = . (2)
σ̂ 2
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 6; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D22 | shortcut: snr | author: JoramSoch | date: 2020-02-25, 12:01.
β̂ = (X T X)−1 X T y . (2)
488 CHAPTER IV. MODEL SELECTION
Then, the signal-to noise ratio (→ Definition IV/1.3.1) can be expressed in terms of the coefficient
of determination (→ Definition IV/1.2.1)
R2
SNR = (3)
1 − R2
and vice versa
SNR
R2 = , (4)
1 + SNR
if the predicted signal mean is equal to the actual signal mean.
Note that it is irrelevant whether we use the biased estimator of the variance (→ Proof IV/1.1.2)
(dividing by n) or the unbiased estimator fo the variance (→ Proof IV/1.1.3) (dividing by n − 1),
because the relevant terms cancel out.
If the predicted signal mean is equal to the actual signal mean – which is the case when variable
regressors in X have mean zero, such that they are orthogonal to a constant regressor in X –, this
means that ŷ¯ = ȳ, such that
Pn
(ŷi − ȳ)2
SNR = Pni=1 . (7)
i=1 (yi − ŷi )
2
Then, the SNR can be written in terms of the explained (→ Definition III/1.4.6), residual (→
Definition III/1.4.7) and total sum of squares (→ Definition III/1.4.5):
ESS ESS/TSS
SNR = = . (8)
RSS RSS/TSS
With the derivation of the coefficient of determination (→ Proof IV/1.2.2), this becomes
R2
SNR = . (9)
1 − R2
Rearranging this equation for the coefficient of determination (→ Definition IV/1.2.1), we have
SNR
R2 = , (10)
1 + SNR
Sources:
• original work
Metadata: ID: P63 | shortcut: snr-rsq | author: JoramSoch | date: 2020-02-26, 10:37.
2. CLASSICAL INFORMATION CRITERIA 489
Sources:
• Akaike H (1974): “A New Look at the Statistical Model Identification”; in: IEEE Transactions on
Automatic Control, vol. AC-19, no. 6, pp. 716-723; URL: https://ieeexplore.ieee.org/document/
1100705; DOI: 10.1109/TAC.1974.1100705.
Metadata: ID: D23 | shortcut: aic | author: JoramSoch | date: 2020-02-25, 12:31.
Then, the corrected Akaike information criterion (→ Definition IV/2.1.2) (AICc ) of this model is
defined as
2k 2 + 2k
AICc (m) = AIC(m) + (2)
n−k−1
where AIC(m) is the Akaike information criterion (→ Definition IV/2.1.1) and k is the number of
free parameters estimated via (1).
Sources:
• Hurvich CM, Tsai CL (1989): “Regression and time series model selection in small samples”; in:
Biometrika, vol. 76, no. 2, pp. 297-307; URL: https://academic.oup.com/biomet/article-abstract/
76/2/297/265326; DOI: 10.1093/biomet/76.2.297.
Metadata: ID: D171 | shortcut: aicc | author: JoramSoch | date: 2022-02-11, 06:49.
490 CHAPTER IV. MODEL SELECTION
2k 2 + 2k
AICc (m) = AIC(m) + . (2)
n−k−1
Note that the number of free model parameters k is finite. Thus, we have:
2k 2 + 2k
lim AICc (m) = lim AIC(m) +
n→∞ n→∞ n−k−1
2k 2 + 2k
= lim AIC(m) + lim (3)
n→∞ n→∞ n − k − 1
= AIC(m) + 0
= AIC(m) .
Sources:
• Wikipedia (2022): “Akaike information criterion”; in: Wikipedia, the free encyclopedia, retrieved on
2022-03-18; URL: https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_
small_sample_size.
Metadata: ID: P316 | shortcut: aicc-aic | author: JoramSoch | date: 2022-03-18, 17:00.
2k 2 + 2k
AICc (m) = AIC(m) + . (3)
n−k−1
Plugging (2) into (3), we obtain:
2. CLASSICAL INFORMATION CRITERIA 491
2k 2 + 2k
AICc (m) = −2 log p(y|θ̂, m) + 2 k +
n−k−1
2k(n − k − 1) 2k 2 + 2k
= −2 log p(y|θ̂, m) + +
n−k−1 n−k−1 (4)
2nk − 2k − 2k
2
2k 2 + 2k
= −2 log p(y|θ̂, m) + +
n−k−1 n−k−1
2nk
= −2 log p(y|θ̂, m) + .
n−k−1
Sources:
• Wikipedia (2022): “Akaike information criterion”; in: Wikipedia, the free encyclopedia, retrieved on
2022-03-11; URL: https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_
small_sample_size.
Metadata: ID: P315 | shortcut: aicc-mll | author: JoramSoch | date: 2022-03-11, 16:53.
Sources:
• Schwarz G (1978): “Estimating the Dimension of a Model”; in: The Annals of Statistics, vol. 6,
no. 2, pp. 461-464; URL: https://www.jstor.org/stable/2958889.
Metadata: ID: D24 | shortcut: bic | author: JoramSoch | date: 2020-02-25, 12:21.
2.2.2 Derivation
Theorem: Let p(y|θ, m) be the likelihood function (→ Definition I/5.1.2) of a generative model
(→ Definition I/5.1.1) m ∈ M with model parameters θ ∈ Θ describing measured data y ∈ Rn .
Let p(θ|m) be a prior distribution (→ Definition I/5.1.3) on the model parameters. Assume that
likelihood function and prior density are twice differentiable.
Then, as the number of data points goes to infinity, an approximation to the log-marginal likelihood
(→ Definition I/5.1.9) log p(y|m), up to constant terms not depending on the model, is given by the
Bayesian information criterion (→ Definition IV/2.2.1) (BIC) as
492 CHAPTER IV. MODEL SELECTION
g(θ) = p(θ|m)
1 (3)
h(θ) = LL(θ) .
n
Then, the marginal likelihood (→ Definition I/5.1.9) can be written as follows:
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ
ZΘ (4)
= exp [n h(θ)] g(θ) dθ .
Θ
As n → ∞, the last three terms are Op (1) and can therefore be ignored when comparing between
models M = {m1 , . . . , mM } and using p(y|mj ) to compute posterior model probabilies (→ Definition
IV/3.4.1) p(mj |y). With that, the BIC is given as
Sources:
2. CLASSICAL INFORMATION CRITERIA 493
• Claeskens G, Hjort NL (2008): “The Bayesian information criterion”; in: Model Selection and Model
Averaging, ch. 3.2, pp. 78-81; URL: https://www.cambridge.org/core/books/model-selection-and-model-av
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
Metadata: ID: P32 | shortcut: bic-der | author: JoramSoch | date: 2020-01-26, 23:36.
Sources:
• Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002): “Bayesian measures of model com-
plexity and fit”; in: Journal of the Royal Statistical Society, Series B: Statistical Methodology, vol.
64, iss. 4, pp. 583-639; URL: https://rss.onlinelibrary.wiley.com/doi/10.1111/1467-9868.00353;
DOI: 10.1111/1467-9868.00353.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 10-12; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D25 | shortcut: dic | author: JoramSoch | date: 2020-02-25, 12:46.
2.3.2 Deviance
Definition: Let there be a generative model (→ Definition I/5.1.1) m describing measured data
y using model parameters θ. Then, the deviance of m is a function of θ which multiplies the log-
likelihood function (→ Definition I/4.1.2) with −2:
The deviance function serves the definition of the deviance information criterion (→ Definition
IV/2.3.1).
Sources:
• Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002): “Bayesian measures of model com-
plexity and fit”; in: Journal of the Royal Statistical Society, Series B: Statistical Methodology, vol.
64, iss. 4, pp. 583-639; URL: https://rss.onlinelibrary.wiley.com/doi/10.1111/1467-9868.00353;
DOI: 10.1111/1467-9868.00353.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 10-12; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
• Wikipedia (2022): “Deviance information criterion”; in: Wikipedia, the free encyclopedia, retrieved
on 2022-03-01; URL: https://en.wikipedia.org/wiki/Deviance_information_criterion#Definition.
Metadata: ID: D172 | shortcut: dev | author: JoramSoch | date: 2022-03-01, 07:48.
3. BAYESIAN MODEL SELECTION 495
Sources:
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eq. 15; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
3.1.2 Derivation
Theorem: Let p(y|θ, m) be a likelihood function (→ Definition I/5.1.2) of a generative model (→
Definition I/5.1.1) m for making inferences on model parameters θ given measured data y. Moreover,
let p(θ|m) be a prior distribution (→ Definition I/5.1.3) on model parameters θ in the parameter
space Θ. Then, the model evidence (→ Definition IV/3.1.1) (ME) can be expressed in terms of
likelihood (→ Definition I/5.1.2) and prior (→ Definition I/5.1.3) as
Z
ME(m) = p(y|θ, m) p(θ|m) dθ . (1)
Θ
Proof: This a consequence of the law of marginal probability (→ Definition I/1.3.3) for continuous
variables (→ Definition I/1.2.6)
Z
p(y|m) = p(y, θ|m) dθ (2)
Θ
Sources:
• original work
Metadata: ID: P367 | shortcut: me-der | author: JoramSoch | date: 2022-10-20, 10:11.
496 CHAPTER IV. MODEL SELECTION
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 13; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D26 | shortcut: lme | author: JoramSoch | date: 2020-02-25, 12:56.
Proof: This a consequence of the law of marginal probability (→ Definition I/1.3.3) for continuous
variables (→ Definition I/1.2.6)
Z
p(y|m) = p(y, θ|m) dθ (3)
Sources:
• original work
Metadata: ID: P13 | shortcut: lme-der | author: JoramSoch | date: 2020-01-06, 21:27.
3. BAYESIAN MODEL SELECTION 497
Proof: For a full probability model (→ Definition I/5.1.4), Bayes’ theorem (→ Proof I/5.3.1) makes
a statement about the posterior distribution (→ Definition I/5.1.7):
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (3)
p(y|m)
Rearranging for p(y|m) and logarithmizing, we have:
p(y|θ, m) p(θ|m)
LME(m) = log p(y|m) = log
p(θ|y, m) (4)
= log p(y|θ, m) + log p(θ|m) − log p(θ|y, m) .
Sources:
• original work
Metadata: ID: P314 | shortcut: lme-pnp | author: JoramSoch | date: 2022-03-11, 16:25.
Proof: We consider Bayesian inference on data (→ Definition “data”) y using model (→ Definition
I/5.1.1) m with parameters θ. Then, Bayes’ theorem (→ Proof I/5.3.1) makes a statement about the
498 CHAPTER IV. MODEL SELECTION
posterior distribution (→ Definition I/5.1.7), i.e. the probability of parameters, given the data and
the model:
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (4)
p(y|m)
Rearranging this for the model evidence (→ Proof IV/3.1.5), we have:
p(y|θ, m) p(θ|m)
p(y|m) = . (5)
p(θ|y, m)
Logarthmizing both sides of the equation, we obtain:
p(θ|y, m)
log p(y|m) = log p(y|θ, m) − log . (6)
p(θ|m)
Now taking the expectation over the posterior distribution yields:
Z Z
p(θ|y, m)
log p(y|m) = p(θ|y, m) log p(y|θ, m) dθ − p(θ|y, m) log dθ . (7)
p(θ|m)
By definition, the left-hand side is the log model evidence and the terms on the right-hand side corre-
spond to the posterior expectation of the log-likelihood function and the Kullback-Leibler divergence
of posterior from prior
Sources:
• Penny et al. (2007): “Bayesian Comparison of Spatially Regularised General Linear Models”; in:
Human Brain Mapping, vol. 28, pp. 275–293; URL: https://onlinelibrary.wiley.com/doi/full/10.
1002/hbm.20327; DOI: 10.1002/hbm.20327.
• Soch et al. (2016): “How to avoid mismodelling in GLM-based fMRI data analysis: cross-validated
Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469–489; URL: https://www.sciencedirect.
com/science/article/pii/S1053811916303615; DOI: 10.1016/j.neuroimage.2016.07.047.
Sources:
• Wikipedia (2020): “Lindley’s paradox”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Lindley%27s_paradox#Bayesian_approach.
3. BAYESIAN MODEL SELECTION 499
Metadata: ID: D113 | shortcut: uplme | author: JoramSoch | date: 2020-11-25, 07:28.
X
S Z
cvLME(m) = log p(yi |θ, m) p(θ|y¬i , m) dθ (1)
i=1
S
where y¬i = j̸=i yj is the union of all data subsets except yi and p(θ|y¬i , m) is the posterior dis-
tribution (→ Definition I/5.1.7) obtained from y¬i when using the prior distribution (→ Definition
I/5.1.3) pni (θ|m):
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data anal-
ysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eqs. 13-15;
URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI: 10.1016/j.neuroimage.
• Soch J, Meyer AP, Allefeld C, Haynes JD (2017): “How to improve parameter estimates in GLM-
based fMRI data analysis: cross-validated Bayesian model averaging”; in: NeuroImage, vol. 158,
pp. 186-195, eq. 6; URL: https://www.sciencedirect.com/science/article/pii/S105381191730527X;
DOI: 10.1016/j.neuroimage.2017.06.056.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 14-15; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
• Soch J (2018): “cvBMS and cvBMA: filling in the gaps”; in: arXiv stat.ME, arXiv:1807.01585;
URL: https://arxiv.org/abs/1807.01585.
Metadata: ID: D111 | shortcut: cvlme | author: JoramSoch | date: 2020-11-19, 04:55.
Z
p(y|λ, m) = p(y|θ, λ, m) (θ|λ, m) dθ (2)
Sources:
• Wikipedia (2020): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-25; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
• Penny, W.D. and Ridgway, G.R. (2013): “Efficient Posterior Probability Mapping Using Savage-
Dickey Ratios”; in: PLoS ONE, vol. 8, iss. 3, art. e59655, eqs. 7/11; URL: https://journals.plos.
org/plosone/article?id=10.1371/journal.pone.0059655; DOI: 10.1371/journal.pone.0059655.
Metadata: ID: D114 | shortcut: eblme | author: JoramSoch | date: 2020-11-25, 07:43.
and
Z
q(θ)
KL [q(θ)||p(θ|m)] = q(θ) log dθ . (3)
p(θ|m)
Sources:
• Wikipedia (2020): “Variational Bayesian methods”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-25; URL: https://en.wikipedia.org/wiki/Variational_Bayesian_methods#Evidence_
lower_bound.
• Penny W, Flandin G, Trujillo-Barreto N (2007): “Bayesian Comparison of Spatially Regularised
General Linear Models”; in: Human Brain Mapping, vol. 28, pp. 275–293, eqs. 2-9; URL: https:
//onlinelibrary.wiley.com/doi/full/10.1002/hbm.20327; DOI: 10.1002/hbm.20327.
Metadata: ID: D115 | shortcut: vblme | author: JoramSoch | date: 2020-11-25, 08:10.
3. BAYESIAN MODEL SELECTION 501
f ⇔ m1 ∨ . . . ∨ mM . (1)
Then, the family evidence (FE) of f is defined as the marginal probability (→ Definition I/1.3.3)
relative to the model evidences (→ Definition IV/3.1.1) p(y|mi ), conditional only on f :
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
3.2.2 Derivation
Theorem: Let f be a family of M generative models (→ Definition I/5.1.1) m1 , . . . , mM with model
evidences (→ Definition IV/3.1.1) p(y|m1 ), . . . , p(y|mM ). Then, the family evidence (→ Definition
IV/3.2.1) can be expressed in terms of the model evidences as
X
M
FE(f ) = p(y|mi ) p(mi |f ) (1)
i=1
where p(mi |f ) are the within-family (→ Definition IV/3.2.3) prior (→ Definition I/5.1.3) model (→
Definition I/5.1.1) probabilities (→ Definition I/1.3.1).
Proof: This a consequence of the law of marginal probability (→ Definition I/1.3.3) for discrete
variables (→ Definition I/1.2.6)
X
M
p(y|f ) = p(y, mi |f ) (2)
i=1
X
M
FE(f ) = p(y|f ) = p(y|mi ) p(mi |f ) . (5)
i=1
Sources:
• original work
Metadata: ID: P368 | shortcut: fe-der | author: JoramSoch | date: 2022-10-20, 10:47.
f ⇔ m1 ∨ . . . ∨ mM . (1)
Then, the log family evidence is given by the logarithm of the family evidence (→ Definition IV/3.2.1):
X
M
LFE(f ) = log p(y|f ) = log p(y|mi ) p(mi |f ) . (2)
i=1
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D80 | shortcut: lfe | author: JoramSoch | date: 2020-07-13, 22:31.
X
M
LFE(f ) = log p(y|mi ) p(mi |f ) (2)
i=1
where p(mi |f ) are the within-family (→ Definition IV/3.2.3) prior (→ Definition I/5.1.3) model (→
Definition I/5.1.1) probabilities (→ Definition I/1.3.1).
X
M
p(f ) = p(mi ) (3)
i=1
3. BAYESIAN MODEL SELECTION 503
X
M
p(f |y) = p(mi |y) (4)
i=1
Bayes’ theorem (→ Proof I/5.3.1) for the family evidence (→ Definition IV/3.2.3) gives
Bayes’ theorem (→ Proof I/5.3.1) for the model evidence (→ Definition IV/3.2.3) gives
PM
p(y|mi ) p(mi )
p(y|f ) = i=1
PM
i=1 p(mi )
X
M
p(mi )
= p(y|mi ) · PM
i=1 i=1 p(mi )
(9)
X
M
p(mi , f )
= p(y|mi ) ·
i=1
p(f )
XM
= p(y|mi ) · p(mi |f ) .
i=1
Sources:
• original work
Metadata: ID: P132 | shortcut: lfe-der | author: JoramSoch | date: 2020-07-13, 22:58.
X
LFE(fj ) = log [exp[LME(mi )] · p(mi |fj )] , j = 1, . . . , F, (1)
mi ∈fj
where p(mi |fj ) are within-family (→ Definition IV/3.2.3) prior (→ Definition I/5.1.3) model (→
Definition I/5.1.1) probabilities (→ Definition I/1.3.1).
Proof: Let us consider the (unlogarithmized) family evidence p(y|fj ). According to the law of
marginal probability (→ Definition I/1.3.3), this conditional probability is given by
X
p(y|fj ) = [p(y|mi , fj ) · p(mi |fj )] . (2)
mi ∈fj
Because model families are mutually exclusive, it holds that p(y|mi , fj ) = p(y|mi ), such that
X
p(y|fj ) = [p(y|mi ) · p(mi |fj )] . (3)
mi ∈fj
Logarithmizing transforms the family evidence p(y|fj ) into the log family evidence LFE(fj ):
X
LFE(fj ) = log [p(y|mi ) · p(mi |fj )] . (4)
mi ∈fj
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: P65 | shortcut: lfe-lme | author: JoramSoch | date: 2020-02-27, 21:16.
p(y | m1 )
BF12 = . (1)
p(y | m2 )
Note that by Bayes’ theorem (→ Proof I/5.3.1), the ratio of posterior model probabilities (→ Defi-
nition IV/3.4.1) (i.e., the posterior model odds) can be written as
p(m1 | y) p(m1 )
= · BF12 . (3)
p(m2 | y) p(m2 )
In other words, the Bayes factor can be viewed as the factor by which the prior model odds are
updated (after observing data y) to posterior model odds – which is also expressed by Bayes’ rule
(→ Proof I/5.3.2).
Sources:
• Kass, Robert E. and Raftery, Adrian E. (1995): “Bayes Factors”; in: Journal of the American
Statistical Association, vol. 90, no. 430, pp. 773-795; URL: https://dx.doi.org/10.1080/01621459.
1995.10476572; DOI: 10.1080/01621459.1995.10476572.
3.3.2 Transitivity
Theorem: Consider three competing models (→ Definition I/5.1.1) m1 , m2 , and m3 for observed
data y. Then the Bayes factor (→ Definition IV/3.3.1) for m1 over m3 can be written as:
Proof: By definition (→ Definition IV/3.3.1), the Bayes factor BF13 is the ratio of marginal likeli-
hoods of data y over m1 and m3 , respectively. That is,
p(y | m1 )
BF13 = . (2)
p(y | m3 )
We can equivalently write
p(y | m1 )
(2)
BF13 =
p(y | m3 )
p(y | m1 ) p(y | m2 )
= ·
p(y | m3 ) p(y | m2 ) (3)
p(y | m1 ) p(y | m2 )
= ·
p(y | m2 ) p(y | m3 )
(2)
= BF12 · BF23 ,
Sources:
• original work
Metadata: ID: P163 | shortcut: bf-trans | author: tomfaulkenberry | date: 2020-09-07, 12:00.
p(δ = δ0 | y, m1 )
BF01 = . (1)
p(δ = δ0 | m1 )
Proof: By definition (→ Definition IV/3.3.1), the Bayes factor BF01 is the ratio of marginal likeli-
hoods of data y over m0 and m1 , respectively. That is,
p(y | m0 )
BF01 = . (2)
p(y | m1 )
The key idea in the proof is that we can use a “change of variables” technique to express BF01 entirely
in terms of the “encompassing” model m1 . This proceeds by first unpacking the marginal likelihood
(→ Definition I/5.1.9) for m0 over the nuisance parameter φ and then using the fact that m0 is a
sharp hypothesis nested within m1 to rewrite everything in terms of m1 . Specifically,
Z
p(y | m0 ) = p(y | φ, m0 ) p(φ | m0 ) dφ
Z
(3)
= p(y | φ, δ = δ0 , m1 ) p(φ | δ = δ0 , m1 ) dφ
= p(y | δ = δ0 , m1 ).
p(δ = δ0 | y, m1 ) p(y | m1 )
p(y | δ = δ0 , m1 ) = . (4)
p(δ = δ0 | m1 )
Thus we have
3. BAYESIAN MODEL SELECTION 507
(2) p(y | m0 )
BF01 =
p(y | m1 )
1
= p(y | m0 ) ·
p(y | m1 )
(3) 1
= p(y | δ = δ0 , m1 ) · (5)
p(y | m1 )
(4) p(δ = δ0 | y, m1 ) p(y | m1 ) 1
= ·
p(δ = δ0 | m1 ) p(y | m1 )
p(δ = δ0 | y, m1 )
= ,
p(δ = δ0 | m1 )
Sources:
• Faulkenberry, Thomas J. (2019): “A tutorial on generalizing the default Bayesian t-test via pos-
terior sampling and encompassing priors”; in: Communications for Statistical Applications and
Methods, vol. 26, no. 2, pp. 217-238; URL: https://dx.doi.org/10.29220/CSAM.2019.26.2.217;
DOI: 10.29220/CSAM.2019.26.2.217.
• Penny, W.D. and Ridgway, G.R. (2013): “Efficient Posterior Probability Mapping Using Savage-
Dickey Ratios”; in: PLoS ONE, vol. 8, iss. 3, art. e59655, eq. 16; URL: https://journals.plos.org/
plosone/article?id=10.1371/journal.pone.0059655; DOI: 10.1371/journal.pone.0059655.
Metadata: ID: P156 | shortcut: bf-sddr | author: tomfaulkenberry | date: 2020-08-26, 12:00.
c 1/d
BF1e = = (1)
d 1/c
where 1/d and 1/c represent the proportions of the posterior and prior of the encompassing model,
respectively, that are in agreement with the inequality constraint imposed by the nested model m1 .
Proof: Consider first that for any model m1 on data y with parameter θ, Bayes’ theorem (→ Proof
I/5.3.1) implies
p(y | θ, m1 ) · p(θ | m1 )
p(θ | y, m1 ) = . (2)
p(y | m1 )
Rearranging equation (2) allows us to write the marginal likelihood (→ Definition I/5.1.9) for y
under m1 as
p(y | θ, m1 ) · p(θ | m1 )
p(y | m1 ) = . (3)
p(θ | y, m1 )
508 CHAPTER IV. MODEL SELECTION
Taking the ratio of the marginal likelihoods for m1 and the encompassing model (→ Definition
IV/3.3.5) me yields the following Bayes factor (→ Definition IV/3.3.1):
p(θ′ | m1 )/p(θ′ | y, m1 )
BF1e = . (5)
p(θ′ | me )/p(θ′ | y, me )
Because m1 is nested within me via an inequality constraint, the prior p(θ′ | m1 ) is simply a truncation
of the encompassing prior p(θ′ | me ). Thus, we can express p(θ′ | m1 ) in terms of the encompassing
prior p(θ′ | me ) by multiplying the encompassing prior by an indicator function over m1 and then
normalizing the resulting product. That is,
where Iθ′ ∈m1 is an indicator function. For parameters θ′ ∈ m1 , this indicator function is identically
equal to 1, so the expression in parentheses reduces to a constant, say c, allowing us to write the
prior as
Sources:
• Klugkist, I., Kato, B., and Hoijtink, H. (2005): “Bayesian model selection using encompassing
priors”; in: Statistica Neerlandica, vol. 59, no. 1., pp. 57-69; URL: https://dx.doi.org/10.1111/j.
1467-9574.2005.00279.x; DOI: 10.1111/j.1467-9574.2005.00279.x.
3. BAYESIAN MODEL SELECTION 509
• Faulkenberry, Thomas J. (2019): “A tutorial on generalizing the default Bayesian t-test via pos-
terior sampling and encompassing priors”; in: Communications for Statistical Applications and
Methods, vol. 26, no. 2, pp. 217-238; URL: https://dx.doi.org/10.29220/CSAM.2019.26.2.217;
DOI: 10.29220/CSAM.2019.26.2.217.
Metadata: ID: P157 | shortcut: bf-ep | author: tomfaulkenberry | date: 2020-09-02, 12:00.
Sources:
• Klugkist, I., Kato, B., and Hoijtink, H. (2005): “Bayesian model selection using encompassing
priors”; in: Statistica Neerlandica, vol. 59, no. 1, pp. 57-69; URL: https://dx.doi.org/10.1111/j.
1467-9574.2005.00279.x; DOI: 10.1111/j.1467-9574.2005.00279.x.
Metadata: ID: D93 | shortcut: encm | author: tomfaulkenberry | date: 2020-09-02, 12:00.
¬(m1 ∧ m2 ) (1)
Then, the Bayes factor in favor of m1 and against m2 is the ratio of the model evidences (→ Definition
I/5.1.9) of m1 and m2 :
p(y|m1 )
BF12 = . (2)
p(y|m2 )
The log Bayes factor is given by the logarithm of the Bayes factor:
p(y|m1 )
LBF12 = log BF12 = log . (3)
p(y|m2 )
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 18; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D84 | shortcut: lbf | author: JoramSoch | date: 2020-07-22, 07:02.
510 CHAPTER IV. MODEL SELECTION
p(y|m1 )
LBF12 = log . (2)
p(y|m2 )
Proof: The Bayes factor (→ Definition IV/3.3.1) is defined as the posterior (→ Definition I/5.1.7)
odds ratio (→ Definition “odds”) when both models (→ Definition I/5.1.1) are equally likely apriori
(→ Definition I/5.1.3):
p(m1 |y)
BF12 = (3)
p(m2 |y)
Plugging in the posterior odds ratio according to Bayes’ rule (→ Proof I/5.3.2), we have
p(y|m1 ) p(m1 )
BF12 = · . (4)
p(y|m2 ) p(m2 )
When both models are equally likely apriori, the prior (→ Definition I/5.1.3) odds ratio (→ Definition
“odds”) is one, such that
p(y|m1 )
BF12 = . (5)
p(y|m2 )
Equation (2) follows by logarithmizing both sides of (5).
Sources:
• original work
Metadata: ID: P137 | shortcut: lbf-der | author: JoramSoch | date: 2020-07-22, 07:27.
Proof: The Bayes factor (→ Definition IV/3.3.1) is defined as the ratio of the model evidences (→
Definition I/5.1.9) of m1 and m2
p(y|m1 )
BF12 = (2)
p(y|m2 )
3. BAYESIAN MODEL SELECTION 511
and the log Bayes factor (→ Definition IV/3.3.6) is defined as the logarithm of the Bayes factor
p(y|m1 )
LBF12 = log BF12 = log . (3)
p(y|m2 )
With the definition of the log model evidence (→ Definition IV/3.1.3)
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 18; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: P64 | shortcut: lbf-lme | author: JoramSoch | date: 2020-02-27, 20:51.
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 23; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: D87 | shortcut: pmp | author: JoramSoch | date: 2020-07-28, 03:30.
3.4.2 Derivation
Theorem: Let there be a set of generative models (→ Definition I/5.1.1) m1 , . . . , mM with model
evidences (→ Definition I/5.1.9) p(y|m1 ), . . . , p(y|mM ) and prior probabilities (→ Definition I/5.1.3)
p(m1 ), . . . , p(mM ). Then, the posterior probability (→ Definition IV/3.4.1) of model mi is given by
512 CHAPTER IV. MODEL SELECTION
p(y|mi ) p(mi )
p(mi |y) = PM , i = 1, . . . , M . (1)
j=1 p(y|mj ) p(mj )
Proof: From Bayes’ theorem (→ Proof I/5.3.1), the posterior model probability (→ Definition
IV/3.4.1) of the i-th model can be derived as
p(y|mi ) p(mi )
p(mi |y) = . (2)
p(y)
Using the law of marginal probability (→ Definition I/1.3.3), the denominator can be rewritten, such
that
p(y|mi ) p(mi )
p(mi |y) = PM . (3)
j=1 p(y, m j )
Finally, using the law of conditional probability (→ Definition I/1.3.4), we have
p(y|mi ) p(mi )
p(mi |y) = PM . (4)
j=1 p(y|mj ) p(mj )
Sources:
• original work
Metadata: ID: P139 | shortcut: pmp-der | author: JoramSoch | date: 2020-07-28, 03:58.
where BFi,0 is the Bayes factor (→ Definition IV/3.3.1) comparing model mi with m0 and αi is the
prior (→ Definition I/5.1.3) odds ratio (→ Definition “odds”) of model mi against m0 .
p(y|mi )
BFi,0 = (2)
p(y|m0 )
and prior odds ratio of mi against m0
p(mi )
αi = . (3)
p(m0 )
The posterior model probability (→ Proof IV/3.4.2) of mi is given by
p(y|mi ) · p(mi )
p(mi |y) = PM . (4)
j=1 p(y|mj ) · p(mj )
3. BAYESIAN MODEL SELECTION 513
such that
BFi,0 · αi
p(mi |y) = PM . (6)
j=1 BF j,0 · αj
Sources:
• Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999): “Bayesian Model Averaging: A Tu-
torial”; in: Statistical Science, vol. 14, no. 4, pp. 382–417, eq. 9; URL: https://projecteuclid.org/
euclid.ss/1009212519; DOI: 10.1214/ss/1009212519.
Metadata: ID: P74 | shortcut: pmp-bf | author: JoramSoch | date: 2020-03-03, 13:13.
exp(LBF12 )
p(m1 |y) = . (1)
exp(LBF12 ) + 1
Proof: From Bayes’ rule (→ Proof I/5.3.2), the posterior odds ratio (→ Definition “odds”) is
p(m1 |y)
= BF12 . (4)
p(m2 |y)
Because the two posterior model probabilities (→ Definition IV/3.4.1) add up to 1, we have
p(m1 |y)
= BF12 . (5)
1 − p(m1 |y)
Now rearranging for the posterior probability (→ Definition IV/3.4.1), this gives
514 CHAPTER IV. MODEL SELECTION
BF12
p(m1 |y) = . (6)
BF12 + 1
Because the log Bayes factor is the logarithm of the Bayes factor (→ Definition IV/3.3.6), we finally
have
exp(LBF12 )
p(m1 |y) = . (7)
exp(LBF12 ) + 1
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 21; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: P73 | shortcut: pmp-lbf | author: JoramSoch | date: 2020-03-03, 12:27.
exp[LME(mi )] p(mi )
p(mi |y) = PM , i = 1, . . . , M , (1)
j=1 exp[LME(mj )] p(mj )
p(y|mi ) p(mi )
p(mi |y) = PM . (2)
j=1 p(y|m j ) p(mj )
The definition of the log model evidence (→ Definition IV/3.1.3)
exp[LME(mi )] p(mi )
p(mi |y) = PM . (5)
j=1 exp[LME(mj )] p(mj )
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 23; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
3. BAYESIAN MODEL SELECTION 515
Metadata: ID: P66 | shortcut: pmp-lme | author: JoramSoch | date: 2020-02-27, 21:33.
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) . (1)
i=1
Sources:
• Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999): “Bayesian Model Averaging: A Tu-
torial”; in: Statistical Science, vol. 14, no. 4, pp. 382–417, eq. 1; URL: https://projecteuclid.org/
euclid.ss/1009212519; DOI: 10.1214/ss/1009212519.
Metadata: ID: D89 | shortcut: bma | author: JoramSoch | date: 2020-08-03, 21:34.
3.5.2 Derivation
Theorem: Let m1 , . . . , mM be M statistical models (→ Definition I/5.1.4) with posterior model
probabilities (→ Definition IV/3.4.1) p(m1 |y), . . . , p(mM |y) and posterior distributions (→ Definition
I/5.1.7) p(θ|y, m1 ), . . . , p(θ|y, mM ). Then, the marginal (→ Definition I/1.5.3) posterior (→ Definition
I/5.1.7) density (→ Definition I/1.6.6), conditional (→ Definition I/1.3.4) on the measured data y,
but unconditional (→ Definition I/1.3.3) on the modelling approach m, is given by:
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) . (1)
i=1
Proof: Using the law of marginal probability (→ Definition I/1.3.3), the probability distribution of
the shared parameters θ conditional (→ Definition I/1.3.4) on the measured data y can be obtained
by marginalizing (→ Definition I/1.3.3) over the discrete random variable (→ Definition I/1.2.2)
model m:
X
M
p(θ|y) = p(θ, mi |y) . (2)
i=1
Using the law of the conditional probability (→ Definition I/1.3.4), the summand can be expanded
to give
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) (3)
i=1
516 CHAPTER IV. MODEL SELECTION
where p(θ|y, mi ) is the posterior distribution (→ Definition I/5.1.7) of the i-th model and p(mi |y)
happens to be the posterior probability (→ Definition IV/3.4.1) of the i-th model.
Sources:
• original work
Metadata: ID: P143 | shortcut: bma-der | author: JoramSoch | date: 2020-08-03, 22:05.
X
M
exp[LME(mi )] p(mi )
p(θ|y) = p(θ|mi , y) · PM , (1)
i=1 j=1 exp[LME(mj )] p(mj )
Proof: According to the law of marginal probability (→ Definition I/1.3.3), the probability of the
shared parameters θ conditional on the measured data y can be obtained (→ Proof IV/3.5.2) by
marginalizing over the discrete variable model m:
X
M
p(θ|y) = p(θ|mi , y) · p(mi |y) , (2)
i=1
where p(mi |y) is the posterior probability (→ Definition IV/3.4.1) of the i-th model. One can express
posterior model probabilities in terms of log model evidences (→ Proof IV/3.4.5) as
exp[LME(mi )] p(mi )
p(mi |y) = PM (3)
j=1 exp[LME(mj )] p(mj )
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 25; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Metadata: ID: P67 | shortcut: bma-lme | author: JoramSoch | date: 2020-02-27, 21:58.
Chapter V
Appendix
517
518 CHAPTER V. APPENDIX
1 Proof by Number
P216 ugkv- Expectation of the log Bayes fac- JoramSoch 2021-03-24 351
lbfmean tor for the univariate Gaussian with
known variance
P217 ugkv- Cross-validated log model evidence JoramSoch 2021-03-24 353
cvlme for the univariate Gaussian with
known variance
P218 ugkv-cvlbf Cross-validated log Bayes factor for JoramSoch 2021-03-24 355
the univariate Gaussian with known
variance
P219 ugkv- Expectation of the cross-validated JoramSoch 2021-03-24 356
cvlbfmean log Bayes factor for the univariate
Gaussian with known variance
P220 cdf-pit Probability integral transform using JoramSoch 2021-04-07 34
cumulative distribution function
P221 cdf-itm Inverse transformation method us- JoramSoch 2021-04-07 35
ing cumulative distribution function
P222 cdf-dt Distributional transformation using JoramSoch 2021-04-07 35
cumulative distribution function
P223 ug-mle Maximum likelihood estimation for JoramSoch 2021-04-16 322
the univariate Gaussian
P224 poissexp- Maximum likelihood estimation for JoramSoch 2021-04-16 460
mle the Poisson distribution with expo-
sure values
P225 poiss-prior Conjugate prior distribution for JoramSoch 2020-04-21 455
Poisson-distributed data
P226 poiss-post Posterior distribution for Poisson- JoramSoch 2020-04-21 456
distributed data
P227 poiss-lme Log model evidence for Poisson- JoramSoch 2020-04-21 458
distributed data
P228 beta-mean Mean of the beta distribution JoramSoch 2021-04-29 256
P229 beta-var Variance of the beta distribution JoramSoch 2021-04-29 257
P230 poiss-var Variance of the Poisson distribution JoramSoch 2021-04-29 166
P231 mvt-f Relationship between multivariate t- JoramSoch 2021-05-04 280
distribution and F-distribution
P232 nst-t Relationship between non- JoramSoch 2021-05-11 213
standardized t-distribution and
t-distribution
1. PROOF BY NUMBER 531
P268 iglm-blue Best linear unbiased estimator for JoramSoch 2021-10-21 434
the inverse general linear model
P269 cfm-para Parameters of the corresponding for- JoramSoch 2021-10-21 436
ward model
P270 cfm-exist Existence of a corresponding for- JoramSoch 2021-10-21 437
ward model
P271 slr-ols Ordinary least squares for simple lin- JoramSoch 2021-10-27 360
ear regression
P272 slr- Expectation of parameter estimates JoramSoch 2021-10-27 364
olsmean for simple linear regression
P273 slr-olsvar Variance of parameter estimates for JoramSoch 2021-10-27 366
simple linear regression
P274 slr- Effects of mean-centering on param- JoramSoch 2021-10-27 372
meancent eter estimates for simple linear re-
gression
P275 slr-comp The regression line goes through the JoramSoch 2021-10-27 374
center of mass point
P276 slr-ressum The sum of residuals is zero in simple JoramSoch 2021-10-27 388
linear regression
P277 slr-rescorr The residuals and the covariate are JoramSoch 2021-10-27 389
uncorrelated in simple linear regres-
sion
P278 slr-resvar Relationship between residual vari- JoramSoch 2021-10-27 390
ance and sample variance in simple
linear regression
P279 slr-corr Relationship between correlation co- JoramSoch 2021-10-27 392
efficient and slope estimate in simple
linear regression
P280 slr-rsq Relationship between coefficient of JoramSoch 2021-10-27 393
determination and correlation coef-
ficient in simple linear regression
P281 slr-mlr Simple linear regression is a special JoramSoch 2021-11-09 359
case of multiple linear regression
P282 slr-olsdist Distribution of parameter estimates JoramSoch 2021-11-09 369
for simple linear regression
P283 slr-proj Projection of a data point to the re- JoramSoch 2021-11-09 375
gression line
534 CHAPTER V. APPENDIX
P284 slr-sss Sums of squares for simple linear re- JoramSoch 2021-11-09 376
gression
P285 slr-mat Transformation matrices for simple JoramSoch 2021-11-09 378
linear regression
P286 slr-wls Weighted least squares for simple JoramSoch 2021-11-16 381
linear regression
P287 slr-mle Maximum likelihood estimation for JoramSoch 2021-11-16 384
simple linear regression
P288 slr-ols2 Ordinary least squares for simple lin- JoramSoch 2021-11-16 362
ear regression
P289 slr-wls2 Weighted least squares for simple JoramSoch 2021-11-16 383
linear regression
P290 slr-mle2 Maximum likelihood estimation for JoramSoch 2021-11-16 387
simple linear regression
P291 mean-tot Law of total expectation JoramSoch 2021-11-26 54
P292 var-tot Law of total variance JoramSoch 2021-11-26 64
P293 cov-tot Law of total covariance JoramSoch 2021-11-26 68
P294 dir-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 297
Dirichlet distribution
P295 wish-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 313
Wishart distribution
P296 matn-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 306
matrix-normal distribution
P297 matn- Sampling from the matrix-normal JoramSoch 2021-12-07 312
samp distribution
P298 mean-tr Expected value of the trace of a ma- JoramSoch 2021-12-07 51
trix
P299 corr-z Correlation coefficient in terms of JoramSoch 2021-12-14 80
standard scores
P300 corr-range Correlation always falls between -1 JoramSoch 2021-12-14 78
and +1
P301 bern-var Variance of the Bernoulli distribu- JoramSoch 2022-01-20 150
tion
P302 bin-var Variance of the binomial distribu- JoramSoch 2022-01-20 155
tion
1. PROOF BY NUMBER 535
P320 slr-olscorr Parameter estimates for simple lin- JoramSoch 2022-04-14 371
ear regression are uncorrelated after
mean-centering
P321 norm- Probability of normal random vari- JoramSoch 2022-05-08 197
probstd able being within standard devia-
tions from its mean
P322 mult-cov Covariance matrix of the multino- adkipnis 2022-05-11 172
mial distribution
P323 nw-pdf Probability density function of the JoramSoch 2022-05-14 316
normal-Wishart distribution
P324 ng-nw Normal-gamma distribution is a spe- JoramSoch 2022-05-20 282
cial case of normal-Wishart distribu-
tion
P325 lognorm- Cumulative distribution function of majapavlo 2022-06-29 237
cdf the log-normal distribution
P326 lognorm-qf Quantile function of the log-normal majapavlo 2022-07-09 239
distribution
P327 nw-mean Mean of the normal-Wishart distri- JoramSoch 2022-07-14 317
bution
P328 gam-wish Gamma distribution is a special case JoramSoch 2022-07-14 220
of Wishart distribution
P329 mlr-glm Multiple linear regression is a special JoramSoch 2022-07-21 395
case of the general linear model
P330 mvn-matn Multivariate normal distribution is a JoramSoch 2022-07-31 265
special case of matrix-normal distri-
bution
P331 norm-mvn Normal distribution is a special case JoramSoch 2022-08-19 189
of multivariate normal distribution
P332 t-mvt t-distribution is a special case of JoramSoch 2022-08-25 214
multivariate t-distribution
P333 mvt-pdf Probability density function of the JoramSoch 2022-09-02 279
multivariate t-distribution
P334 bern-ent Entropy of the Bernoulli distribution JoramSoch 2022-09-02 152
P335 bin-ent Entropy of the binomial distribution JoramSoch 2022-09-02 157
P336 cat-ent Entropy of the categorical distribu- JoramSoch 2022-09-09 170
tion
1. PROOF BY NUMBER 537
2 Definition by Number
D115 vblme Variational Bayesian log model evi- JoramSoch 2020-11-25 500
dence
D116 prior-flat Flat, hard and soft prior distribution JoramSoch 2020-12-02 138
D117 prior-uni Uniform and non-uniform prior dis- JoramSoch 2020-12-02 138
tribution
D118 prior-inf Informative and non-informative JoramSoch 2020-12-02 139
prior distribution
D119 prior-emp Empirical and theoretical prior dis- JoramSoch 2020-12-02 139
tribution
D120 prior-conj Conjugate and non-conjugate prior JoramSoch 2020-12-02 139
distribution
D121 prior- Maximum entropy prior distribution JoramSoch 2020-12-02 140
maxent
D122 prior-eb Empirical Bayes prior distribution JoramSoch 2020-12-02 140
D123 prior-ref Reference prior distribution JoramSoch 2020-12-02 141
D124 ug Univariate Gaussian JoramSoch 2021-03-03 322
D125 h0 Null hypothesis JoramSoch 2021-03-12 129
D126 h1 Alternative hypothesis JoramSoch 2021-03-12 129
D127 hyp Statistical hypothesis JoramSoch 2021-03-19 126
D128 hyp-simp Simple and composite hypothesis JoramSoch 2021-03-19 127
D129 hyp-point Point and set hypothesis JoramSoch 2021-03-19 127
D130 test Statistical hypothesis test JoramSoch 2021-03-19 128
D131 tstat Test statistic JoramSoch 2021-03-19 130
D132 size Size of a statistical test JoramSoch 2021-03-19 130
D133 alpha Significance level JoramSoch 2021-03-19 131
D134 cval Critical value JoramSoch 2021-03-19 131
D135 pval p-value JoramSoch 2021-03-19 132
D136 ugkv Univariate Gaussian with known JoramSoch 2021-03-23 337
variance
D137 power Power of a statistical test JoramSoch 2021-03-31 131
D138 hyp-tail One-tailed and two-tailed hypothe- JoramSoch 2021-03-31 127
sis
D139 test-tail One-tailed and two-tailed test JoramSoch 2021-03-31 129
544 CHAPTER V. APPENDIX
3 Proof by Topic
A
• Accuracy and complexity for the univariate Gaussian, 336
• Accuracy and complexity for the univariate Gaussian with known variance, 348
• Addition law of probability, 14
• Addition of the differential entropy upon multiplication with a constant, 98
• Addition of the differential entropy upon multiplication with invertible matrix, 99
• Additivity of the Kullback-Leibler divergence for independent distributions, 115
• Additivity of the variance for independent random variables, 63
• Akaike information criterion for multiple linear regression, 410
B
• Bayes’ rule, 142
• Bayes’ theorem, 141
• Bayesian information criterion for multiple linear regression, 411
• Bayesian model averaging in terms of log model evidences, 516
• Best linear unbiased estimator for the inverse general linear model, 434
C
• Characteristic function of a function of a random variable, 38
• Chi-squared distribution is a special case of gamma distribution, 246
• Concavity of the Shannon entropy, 91
• Conditional distributions of the multivariate normal distribution, 273
• Conditional distributions of the normal-gamma distribution, 293
• Conjugate prior distribution for Bayesian linear regression, 413
• Conjugate prior distribution for binomial observations, 445
• Conjugate prior distribution for multinomial observations, 449
• Conjugate prior distribution for multivariate Bayesian linear regression, 438
• Conjugate prior distribution for Poisson-distributed data, 455
• Conjugate prior distribution for the Poisson distribution with exposure values, 462
• Conjugate prior distribution for the univariate Gaussian, 328
• Conjugate prior distribution for the univariate Gaussian with known variance, 342
• Construction of confidence intervals using Wilks’ theorem, 122
• Construction of unbiased estimator for variance, 482
• Convexity of the cross-entropy, 93
• Convexity of the Kullback-Leibler divergence, 115
• Corrected Akaike information criterion converges to uncorrected Akaike information criterion when
infinite data are available, 490
• Corrected Akaike information criterion for multiple linear regression, 412
• Corrected Akaike information criterion in terms of maximum log-likelihood, 490
• Correlation always falls between -1 and +1, 78
• Correlation coefficient in terms of standard scores, 80
• Covariance and variance of the normal-gamma distribution, 286
• Covariance matrices of the matrix-normal distribution, 304
• Covariance matrix of the categorical distribution, 169
• Covariance matrix of the multinomial distribution, 172
• Covariance matrix of the multivariate normal distribution, 267
3. PROOF BY TOPIC 547
D
• Derivation of Bayesian model averaging, 515
• Derivation of R² and adjusted R², 484
• Derivation of the Bayesian information criterion, 491
• Derivation of the family evidence, 501
• Derivation of the log Bayes factor, 510
• Derivation of the log family evidence, 502
• Derivation of the log model evidence, 496
• Derivation of the model evidence, 495
• Derivation of the posterior model probability, 511
• Deviance for multiple linear regression, 409
• Deviance information criterion for multiple linear regression, 419
• Differential entropy can be negative, 96
• Differential entropy for the matrix-normal distribution, 305
• Differential entropy of the gamma distribution, 228
• Differential entropy of the multivariate normal distribution, 269
• Differential entropy of the normal distribution, 207
• Differential entropy of the normal-gamma distribution, 287
• Distribution of parameter estimates for simple linear regression, 369
• Distribution of the inverse general linear model, 433
• Distribution of the transformed general linear model, 431
• Distributional transformation using cumulative distribution function, 35
E
• Effects of mean-centering on parameter estimates for simple linear regression, 372
• Encompassing prior method for computing Bayes factors, 507
• Entropy of the Bernoulli distribution, 152
• Entropy of the binomial distribution, 157
548 CHAPTER V. APPENDIX
F
• First central moment is zero, 88
• First raw moment is mean, 87
• Full width at half maximum for the normal distribution, 203
G
• Gamma distribution is a special case of Wishart distribution, 220
• Gaussian integral, 189
• Gibbs’ inequality, 94
I
• Inflection points of the probability density function of the normal distribution, 205
• Invariance of the covariance matrix under addition of constant vector, 72
• Invariance of the differential entropy under addition of a constant, 97
• Invariance of the Kullback-Leibler divergence under parameter transformation, 116
• Invariance of the variance under addition of a constant, 61
• Inverse transformation method using cumulative distribution function, 35
J
• Joint likelihood is the product of likelihood function and prior density, 135
K
• Kullback-Leibler divergence for the Dirichlet distribution, 297
• Kullback-Leibler divergence for the gamma distribution, 229
• Kullback-Leibler divergence for the matrix-normal distribution, 306
• Kullback-Leibler divergence for the multivariate normal distribution, 270
• Kullback-Leibler divergence for the normal distribution, 208
• Kullback-Leibler divergence for the normal-gamma distribution, 289
3. PROOF BY TOPIC 549
L
• Law of the unconscious statistician, 55
• Law of total covariance, 68
• Law of total expectation, 54
• Law of total probability, 15
• Law of total variance, 64
• Linear combination of independent normal random variables, 211
• Linear transformation theorem for the matrix-normal distribution, 308
• Linear transformation theorem for the moment-generating function, 40
• Linear transformation theorem for the multivariate normal distribution, 272
• Linearity of the expected value, 46
• Log Bayes factor for the univariate Gaussian with known variance, 350
• Log Bayes factor in terms of log model evidences, 510
• Log family evidences in terms of log model evidences, 503
• Log model evidence for Bayesian linear regression, 417
• Log model evidence for binomial observations, 447
• Log model evidence for multinomial observations, 451
• Log model evidence for multivariate Bayesian linear regression, 442
• Log model evidence for Poisson-distributed data, 458
• Log model evidence for the Poisson distribution with exposure values, 465
• Log model evidence for the univariate Gaussian, 333
• Log model evidence for the univariate Gaussian with known variance, 347
• Log model evidence in terms of prior and posterior distribution, 497
• Log sum inequality, 95
• Log-odds and probability in logistic regression, 477
• Logarithmic expectation of the gamma distribution, 225
M
• Marginal distribution of a conditional binomial distribution, 158
• Marginal distributions for the matrix-normal distribution, 309
• Marginal distributions of the multivariate normal distribution, 273
• Marginal distributions of the normal-gamma distribution, 290
• Marginal likelihood is a definite integral of joint likelihood, 137
• Maximum likelihood estimation can result in biased estimates, 125
• Maximum likelihood estimation for Dirichlet-distributed data, 470
• Maximum likelihood estimation for multiple linear regression, 405
• Maximum likelihood estimation for Poisson-distributed data, 453
• Maximum likelihood estimation for simple linear regression, 384
• Maximum likelihood estimation for simple linear regression, 387
• Maximum likelihood estimation for the general linear model, 428
• Maximum likelihood estimation for the Poisson distribution with exposure values, 460
• Maximum likelihood estimation for the univariate Gaussian, 322
• Maximum likelihood estimation for the univariate Gaussian with known variance, 338
• Maximum likelihood estimator of variance is biased, 480
• Maximum log-likelihood for multiple linear regression, 407
• Mean of the Bernoulli distribution, 149
550 CHAPTER V. APPENDIX
N
• Necessary and sufficient condition for independence of multivariate normal random variables, 277
• Non-invariance of the differential entropy under change of variables, 101
• (Non-)Multiplicativity of the expected value, 49
• Non-negativity of the expected value, 46
• Non-negativity of the Kullback-Leibler divergence, 112
• Non-negativity of the Kullback-Leibler divergence, 112
• Non-negativity of the Shannon entropy, 90
• Non-negativity of the variance, 59
• Non-symmetry of the Kullback-Leibler divergence, 113
• Normal distribution is a special case of multivariate normal distribution, 189
3. PROOF BY TOPIC 551
O
• One-sample t-test for independent observations, 324
• One-sample z-test for independent observations, 339
• Ordinary least squares for multiple linear regression, 396
• Ordinary least squares for multiple linear regression, 396
• Ordinary least squares for simple linear regression, 360
• Ordinary least squares for simple linear regression, 362
• Ordinary least squares for the general linear model, 426
P
• Paired t-test for dependent observations, 327
• Paired z-test for dependent observations, 341
• Parameter estimates for simple linear regression are uncorrelated after mean-centering, 371
• Parameters of the corresponding forward model, 436
• Partition of a covariance matrix into expected values, 70
• Partition of covariance into expected values, 66
• Partition of sums of squares in ordinary least squares, 398
• Partition of the log model evidence into accuracy and complexity, 497
• Partition of the mean squared error into bias and variance, 120
• Partition of variance into expected values, 58
• Positive semi-definiteness of the covariance matrix, 71
• Posterior credibility region against the omnibus null hypothesis for Bayesian linear regression, 423
• Posterior density is proportional to joint likelihood, 136
• Posterior distribution for Bayesian linear regression, 414
• Posterior distribution for binomial observations, 446
• Posterior distribution for multinomial observations, 450
• Posterior distribution for multivariate Bayesian linear regression, 440
• Posterior distribution for Poisson-distributed data, 456
• Posterior distribution for the Poisson distribution with exposure values, 463
• Posterior distribution for the univariate Gaussian, 330
• Posterior distribution for the univariate Gaussian with known variance, 344
• Posterior model probabilities in terms of Bayes factors, 512
• Posterior model probabilities in terms of log model evidences, 514
• Posterior model probability in terms of log Bayes factor, 513
• Posterior probability of the alternative hypothesis for Bayesian linear regression, 422
• Probability and log-odds in logistic regression, 476
• Probability density function is first derivative of cumulative distribution function, 29
• Probability density function of a linear function of a continuous random vector, 28
• Probability density function of a strictly decreasing function of a continuous random variable, 25
• Probability density function of a strictly increasing function of a continuous random variable, 24
• Probability density function of a sum of independent discrete random variables, 23
• Probability density function of an invertible function of a continuous random vector, 26
• Probability density function of the beta distribution, 254
• Probability density function of the chi-squared distribution, 246
• Probability density function of the continuous uniform distribution, 176
552 CHAPTER V. APPENDIX
Q
• Quantile function is inverse of strictly monotonically increasing cumulative distribution function,
37
• Quantile function of the continuous uniform distribution, 178
• Quantile function of the discrete uniform distribution, 148
• Quantile function of the exponential distribution, 232
• Quantile function of the gamma distribution, 222
• Quantile function of the log-normal distribution, 239
• Quantile function of the normal distribution, 198
R
3. PROOF BY TOPIC 553
• Range of probability, 13
• Range of the variance of the Bernoulli distribution, 151
• Range of the variance of the binomial distribution, 156
• Relation of continuous Kullback-Leibler divergence to differential entropy, 118
• Relation of continuous mutual information to joint and conditional differential entropy, 110
• Relation of continuous mutual information to marginal and conditional differential entropy, 108
• Relation of continuous mutual information to marginal and joint differential entropy, 109
• Relation of discrete Kullback-Leibler divergence to Shannon entropy, 117
• Relation of mutual information to joint and conditional entropy, 106
• Relation of mutual information to marginal and conditional entropy, 104
• Relation of mutual information to marginal and joint entropy, 105
• Relationship between chi-squared distribution and beta distribution, 251
• Relationship between coefficient of determination and correlation coefficient in simple linear re-
gression, 393
• Relationship between correlation coefficient and slope estimate in simple linear regression, 392
• Relationship between covariance and correlation, 68
• Relationship between covariance matrix and correlation matrix, 75
• Relationship between gamma distribution and standard gamma distribution, 218
• Relationship between gamma distribution and standard gamma distribution, 219
• Relationship between multivariate t-distribution and F-distribution, 280
• Relationship between non-standardized t-distribution and t-distribution, 213
• Relationship between normal distribution and chi-squared distribution, 185
• Relationship between normal distribution and standard normal distribution, 182
• Relationship between normal distribution and standard normal distribution, 183
• Relationship between normal distribution and standard normal distribution, 184
• Relationship between normal distribution and t-distribution, 187
• Relationship between precision matrix and correlation matrix, 77
• Relationship between R² and maximum log-likelihood, 485
• Relationship between residual variance and sample variance in simple linear regression, 390
• Relationship between second raw moment, variance and mean, 87
• Relationship between signal-to-noise ratio and R², 487
S
• Sampling from the matrix-normal distribution, 312
• Sampling from the normal-gamma distribution, 295
• Savage-Dickey density ratio for computing Bayes factors, 506
• Scaling of the covariance matrix upon multiplication with constant matrix, 73
• Scaling of the variance upon multiplication with a constant, 61
• Second central moment is variance, 88
• Self-covariance equals variance, 67
• Simple linear regression is a special case of multiple linear regression, 359
• Square of expectation of product is less than or equal to product of expectation of squares, 53
• Sums of squares for simple linear regression, 376
• Symmetry of the covariance, 66
• Symmetry of the covariance matrix, 71
T
• t-distribution is a special case of multivariate t-distribution, 214
554 CHAPTER V. APPENDIX
• The p-value follows a uniform distribution under the null hypothesis, 132
• The regression line goes through the center of mass point, 374
• The residuals and the covariate are uncorrelated in simple linear regression, 389
• The sum of residuals is zero in simple linear regression, 388
• Transformation matrices for ordinary least squares, 401
• Transformation matrices for simple linear regression, 378
• Transitivity of Bayes Factors, 505
• Transposition of a matrix-normal random variable, 308
• Two-sample t-test for independent observations, 325
• Two-sample z-test for independent observations, 340
V
• Value of the probability-generating function for argument one, 43
• Value of the probability-generating function for argument zero, 42
• Variance of constant is zero, 60
• Variance of parameter estimates for simple linear regression, 366
• Variance of the Bernoulli distribution, 150
• Variance of the beta distribution, 257
• Variance of the binomial distribution, 155
• Variance of the gamma distribution, 224
• Variance of the linear combination of two random variables, 63
• Variance of the log-normal distribution, 243
• Variance of the normal distribution, 202
• Variance of the Poisson distribution, 166
• Variance of the sum of two random variables, 62
• Variance of the Wald distribution, 263
W
• Weighted least squares for multiple linear regression, 403
• Weighted least squares for multiple linear regression, 404
• Weighted least squares for simple linear regression, 381
• Weighted least squares for simple linear regression, 383
• Weighted least squares for the general linear model, 427
4. DEFINITION BY TOPIC 555
4 Definition by Topic
A
• Akaike information criterion, 489
• Alternative hypothesis, 129
B
• Bayes factor, 504
• Bayesian information criterion, 491
• Bayesian model averaging, 515
• Bernoulli distribution, 149
• Beta distribution, 251
• Beta-binomial data, 473
• Beta-binomial distribution, 160
• Beta-distributed data, 468
• Binomial distribution, 153
• Binomial observations, 445
C
• Categorical distribution, 168
• Central moment, 88
• Characteristic function, 37
• Chi-squared distribution, 245
• Coefficient of determination, 483
• Conditional differential entropy, 102
• Conditional entropy, 92
• Conditional independence, 8
• Conditional probability distribution, 19
• Confidence interval, 121
• Conjugate and non-conjugate prior distribution, 139
• Constant, 4
• Continuous uniform distribution, 176
• Corrected Akaike information criterion, 489
• Correlation, 78
• Correlation matrix, 80
• Corresponding forward model, 436
• Covariance, 65
• Covariance matrix, 69
• Critical value, 131
• Cross-covariance matrix, 74
• Cross-entropy, 93
• Cross-validated log model evidence, 499
• Cumulant-generating function, 44
• Cumulative distribution function, 29
D
• Deviance, 493
• Deviance information criterion, 493
556 CHAPTER V. APPENDIX
E
• Empirical and theoretical prior distribution, 139
• Empirical Bayes, 142
• Empirical Bayes prior distribution, 140
• Empirical Bayesian log model evidence, 499
• Encompassing model, 509
• Estimation matrix, 400
• Event space, 2
• Exceedance probability, 10
• Expected value, 44
• Expected value of a random matrix, 57
• Expected value of a random vector, 57
• Explained sum of squares, 398
• Exponential distribution, 230
F
• F-distribution, 249
• Family evidence, 501
• Flat, hard and soft prior distribution, 138
• Full probability model, 135
• Full width at half maximum, 83
G
• Gamma distribution, 217
• General linear model, 426
• Generative model, 134
I
• Informative and non-informative prior distribution, 139
• Inverse general linear model, 433
J
• Joint cumulative distribution function, 36
• Joint differential entropy, 103
• Joint entropy, 92
• Joint likelihood, 135
• Joint probability, 6
• Joint probability distribution, 18
K
• Kolmogorov axioms of probability, 11
• Kullback-Leibler divergence, 111
4. DEFINITION BY TOPIC 557
L
• Law of conditional probability, 6
• Law of marginal probability, 6
• Likelihood function, 134
• Likelihood function, 134
• Log Bayes factor, 509
• Log family evidence, 502
• Log model evidence, 496
• Log-likelihood function, 124
• Log-normal distribution, 235
• Logistic regression, 476
M
• Marginal likelihood, 137
• Marginal probability distribution, 18
• Matrix-normal distribution, 302
• Maximum, 84
• Maximum entropy prior distribution, 140
• Maximum likelihood estimation, 124
• Maximum log-likelihood, 125
• Mean squared error, 120
• Median, 82
• Method-of-moments estimation, 126
• Minimum, 83
• Mode, 82
• Model evidence, 495
• Moment, 84
• Moment-generating function, 38
• Multinomial distribution, 170
• Multinomial observations, 449
• Multiple linear regression, 394
• Multivariate normal distribution, 265
• Multivariate t-distribution, 279
• Mutual exclusivity, 10
• Mutual information, 107
• Mutual information, 107
N
• Non-standardized t-distribution, 213
• Normal distribution, 181
• Normal-gamma distribution, 281
• Normal-Wishart distribution, 315
• Null hypothesis, 129
O
• One-tailed and two-tailed hypothesis, 127
• One-tailed and two-tailed test, 129
558 CHAPTER V. APPENDIX
P
• p-value, 132
• Point and set hypothesis, 127
• Poisson distribution, 164
• Poisson distribution with exposure values, 460
• Poisson-distributed data, 453
• Posterior distribution, 136
• Posterior model probability, 511
• Power of a statistical test, 131
• Precision, 65
• Precision matrix, 76
• Prior distribution, 134
• Probability, 5
• Probability density function, 22
• Probability distribution, 18
• Probability mass function, 19
• Probability space, 3
• Probability-generating function, 41
• Projection matrix, 400
Q
• Quantile function, 36
R
• Random event, 3
• Random experiment, 2
• Random matrix, 4
• Random variable, 3
• Random vector, 4
• Raw moment, 86
• Reference prior distribution, 141
• Regression line, 373
• Residual sum of squares, 398
• Residual variance, 480
• Residual-forming matrix, 400
S
• Sample correlation coefficient, 79
• Sample correlation matrix, 81
• Sample covariance, 65
• Sample covariance matrix, 70
• Sample mean, 45
• Sample space, 2
• Sample variance, 58
• Sampling distribution, 19
• Shannon entropy, 90
• Signal-to-noise ratio, 487
• Significance level, 131
4. DEFINITION BY TOPIC 559
T
• t-distribution, 212
• Test statistic, 130
• Total sum of squares, 397
• Transformed general linear model, 430
U
• Uniform and non-uniform prior distribution, 138
• Uniform-prior log model evidence, 498
• Univariate and multivariate random variable, 5
• Univariate Gaussian, 322
• Univariate Gaussian with known variance, 337
V
• Variance, 58
• Variational Bayes, 143
• Variational Bayesian log model evidence, 500
W
• Wald distribution, 259
• Wishart distribution, 312