Untitled
Untitled
APPENDIX
(1) (1)
1 |P1 | 1 |Λ | 1 |Λn |
LBF12 = log + log 0(2) − log (2)
2 |P2 | 2 |Λ0 | 2 |Λn |
(1)
Γ an
+ log + a0 log b0 − a(1)
(1) (1) (1)
(1) n log bn (9)
Γ a0
(2)
Γ an
− log − a0 log b0 + a(2)
(2) (2) (2)
(2) n log bn .
Γ a0
y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Then, the conjugate prior (→ I/5.2.5) for this model is a multivariate normal distribution (→ II/4.1.1)
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model parameters in the same way, i.e.
the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2):
s
1 1 T −1
p(y|β) = N (y; Xβ, Σ) = exp − (y − Xβ) Σ (y − Xβ) . (3)
(2π)n |Σ| 2
Expanding the product in the exponent, we have:
s
1 1 T −1 T −1 T T −1 T T −1
p(y|β) = · exp − y Σ y − y Σ Xβ − β X Σ y + β X Σ Xβ . (4)
(2π)n |Σ| 2
Completing the square over β, one obtains
s
1 1
T T −1 T −1
p(y|β) = · exp − (β − X̃y) X Σ X(β − X̃y) − y Qy + y Σ y
T
(5)
(2π)n |Σ| 2
−1 T −1
where X̃ = X T Σ−1 X X Σ and Q = X̃ T X T Σ−1 X X̃.
1. UNIVARIATE NORMAL DATA 523
With the posterior distribution for Bayesian linear regression with known covariance (→ III/1.7.2),
this becomes:
n 1 1 T −1 T −1 T T −1
Acc(m) = − log(2π) − log |Σ| − y Σ y − 2y Σ Xβ + β X Σ Xβ . (9)
2 2 2 N (β;µn ,Σn )
⟨x⟩ = µ (10)
and the expectation of a quadratic form is given by (→ I/1.10.9)
xT Ax = µT Aµ + tr(AΣ) . (11)
Thus, the model accuracy of m evaluates to
n 1
Acc(m) = − log(2π) − log |Σ|−
2 2
1 T −1
y Σ y − 2y T Σ−1 Xµn + µT T −1 T −1
n X Σ Xµn + tr(X Σ XΣn )
2 (12)
1 1 n 1
= − (y − Xµn )T Σ−1 (y − Xµn ) − log |Σ| − log(2π) − tr(X T Σ−1 XΣn )
2 2 2 2
(4) 1 T −1 1 n 1 T −1
= − ey Σ ey − log |Σ| − log(2π) − tr(X Σ XΣn )
2 2 2 2
which proofs the first part of (3).
2) The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(β|y) from the prior distribution (→ I/5.1.3) p(β):
1 T −1 −1 |Σn |
Com(m) = (µ0 − µn ) Σ0 (µ0 − µn ) + tr(Σ0 Σn ) − log −p
2 |Σ0 |
1 1 1 1 p (16)
= (µ0 − µn )T Σ−1 0 (µ0 − µn ) + log |Σ0 | − log |Σn | + tr(Σ−1
0 Σn ) −
2 2 2 2 2
(4) 1 T −1 1 1 1 −1 p
= eβ Σ0 eβ + log |Σ0 | − log |Σn | + tr(Σ0 Σn ) −
2 2 2 2 2
1. PROBABILITY THEORY 31
(2) X
FX (x) = Pr(X = t)
t∈X
t≤x
X (4)
(3)
= fX (t) .
t∈X
t≤x
(2)
FX (x) = Pr(X ∈ (−∞, x])
Z x (4)
(3)
= fX (t) dt .
−∞
■
5. BAYESIAN STATISTICS 127
Then, the value of θ at which the posterior density (→ I/5.1.7) attains its maximum is called the
“maximum-a-posteriori estimate”, “MAP estimate” or “posterior mode” of θ:
Sources:
• Wikipedia (2023): “Maximum a posteriori estimation”; in: Wikipedia, the free encyclopedia, re-
trieved on 2023-12-01; URL: https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation#
Description.
Proof: In a full probability model (→ I/5.1.4), the posterior distribution (→ I/5.1.7) can be expressed
using Bayes’ theorem (→ I/5.3.1):
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (2)
p(y|m)
Applying the law of conditional probability (→ I/1.3.4) to the numerator, we have:
p(y, θ|m)
p(θ|y, m) = . (3)
p(y|m)
Because the denominator does not depend on θ, it is constant in θ and thus acts a proportionality
factor between the posterior distribution and the joint likelihood:
p(θ|y1 ) · p(θ|y2 )
p(θ|y1 , y2 ) ∝ . (2)
p(θ)