0% found this document useful (0 votes)
7 views

Untitled

The document contains various statistical terms and concepts, including Bayesian linear regression, conjugate prior distribution, and maximum-a-posteriori estimation. It discusses the relationships between likelihood functions, posterior distributions, and model complexity through Kullback-Leibler divergence. Additionally, it provides mathematical proofs and theorems related to cumulative distribution functions and Bayesian statistics.

Uploaded by

nelelen929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Untitled

The document contains various statistical terms and concepts, including Bayesian linear regression, conjugate prior distribution, and maximum-a-posteriori estimation. It discusses the relationships between likelihood functions, posterior distributions, and model complexity through Kullback-Leibler divergence. Additionally, it provides mathematical proofs and theorems related to cumulative distribution functions and Bayesian statistics.

Uploaded by

nelelen929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

654 CHAPTER V.

APPENDIX

D185 tcon t-contrast for contrast-based infer- JoramSoch 2022-12-16 475


ence in multiple linear regression
D186 fcon F-contrast for contrast-based infer- JoramSoch 2022-12-16 475
ence in multiple linear regression
D187 exg ex-Gaussian distribution tomfaulkenberry 2023-04-18 274
D188 skew Skewness tomfaulkenberry 2023-04-20 61
D189 bvn Bivariate normal distribution JoramSoch 2023-09-22 287
D190 skew-samp Sample skewness tomfaulkenberry 2023-10-30 61
D191 map Maximum-a-posteriori estimation JoramSoch 2023-12-01 126
D192 sr Scoring rule KarahanS 2024-02-28 135
D193 psr Proper scoring rule KarahanS 2024-02-28 135
D194 spsr Strictly proper scoring rule KarahanS 2024-02-28 135
D195 lpsr Log probability scoring rule KarahanS 2024-02-28 136
D196 fstat F-statistic JoramSoch 2024-03-15 610
D197 bsr Brier scoring rule KarahanS 2024-03-23 139
D198 lr Likelihood ratio JoramSoch 2024-06-14 117
D199 llr Log-likelihood ratio JoramSoch 2024-06-14 117
516 CHAPTER III. STATISTICAL MODELS

(1) (1)
1 |P1 | 1 |Λ | 1 |Λn |
LBF12 = log + log 0(2) − log (2)
2 |P2 | 2 |Λ0 | 2 |Λn |
 
(1)
Γ an
+ log   + a0 log b0 − a(1)
(1) (1) (1)
(1) n log bn (9)
Γ a0
 
(2)
Γ an
− log   − a0 log b0 + a(2)
(2) (2) (2)
(2) n log bn .
Γ a0

1.7 Bayesian linear regression with known covariance


1.7.1 Conjugate prior distribution
Theorem: Let

y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Then, the conjugate prior (→ I/5.2.5) for this model is a multivariate normal distribution (→ II/4.1.1)

p(β) = N (β; µ0 , Σ0 ) . (2)

Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model parameters in the same way, i.e.
the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2):
s  
1 1 T −1
p(y|β) = N (y; Xβ, Σ) = exp − (y − Xβ) Σ (y − Xβ) . (3)
(2π)n |Σ| 2
Expanding the product in the exponent, we have:
s  
1 1 T −1 T −1 T T −1 T T −1

p(y|β) = · exp − y Σ y − y Σ Xβ − β X Σ y + β X Σ Xβ . (4)
(2π)n |Σ| 2
Completing the square over β, one obtains
s 
1 1 
T T −1 T −1
p(y|β) = · exp − (β − X̃y) X Σ X(β − X̃y) − y Qy + y Σ y
T
(5)
(2π)n |Σ| 2
−1 T −1 
where X̃ = X T Σ−1 X X Σ and Q = X̃ T X T Σ−1 X X̃.
1. UNIVARIATE NORMAL DATA 523

With the posterior distribution for Bayesian linear regression with known covariance (→ III/1.7.2),
this becomes:

 
n 1 1  T −1 T −1 T T −1

Acc(m) = − log(2π) − log |Σ| − y Σ y − 2y Σ Xβ + β X Σ Xβ . (9)
2 2 2 N (β;µn ,Σn )

If x ∼ N(µ, Σ), then its expected value is (→ II/4.1.9)

⟨x⟩ = µ (10)
and the expectation of a quadratic form is given by (→ I/1.10.9)

xT Ax = µT Aµ + tr(AΣ) . (11)
Thus, the model accuracy of m evaluates to

n 1
Acc(m) = − log(2π) − log |Σ|−
2 2
1  T −1 
y Σ y − 2y T Σ−1 Xµn + µT T −1 T −1
n X Σ Xµn + tr(X Σ XΣn )
2 (12)
1 1 n 1
= − (y − Xµn )T Σ−1 (y − Xµn ) − log |Σ| − log(2π) − tr(X T Σ−1 XΣn )
2 2 2 2
(4) 1 T −1 1 n 1 T −1
= − ey Σ ey − log |Σ| − log(2π) − tr(X Σ XΣn )
2 2 2 2
which proofs the first part of (3).

2) The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(β|y) from the prior distribution (→ I/5.1.3) p(β):

Com(m) = KL [p(β|y) || p(β)] . (13)


With the prior distribution (→ III/1.7.1) given by (2) and the posterior distribution for Bayesian
linear regression with known covariance (→ III/1.7.2), this becomes:

Com(m) = KL [N (β; µn , Σn ) || N (β; µ0 , Σ0 )] . (14)


With the Kullback-Leibler divergence for the multivariate normal distribution (→ II/4.1.12)
 
1 T −1 −1 |Σ1 |
KL[N (µ1 , Σ1 ) || N (µ2 , Σ2 )] = (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + tr(Σ2 Σ1 ) − ln −n (15)
2 |Σ2 |
the model complexity of m evaluates to

 
1 T −1 −1 |Σn |
Com(m) = (µ0 − µn ) Σ0 (µ0 − µn ) + tr(Σ0 Σn ) − log −p
2 |Σ0 |
1 1 1 1 p (16)
= (µ0 − µn )T Σ−1 0 (µ0 − µn ) + log |Σ0 | − log |Σn | + tr(Σ−1
0 Σn ) −
2 2 2 2 2
(4) 1 T −1 1 1 1 −1 p
= eβ Σ0 eβ + log |Σ0 | − log |Σn | + tr(Σ0 Σn ) −
2 2 2 2 2
1. PROBABILITY THEORY 31

Proof: The cumulative distribution function (→ I/1.8.1) of a random variable (→ I/1.2.2) X is


defined as the probability that X is smaller than x:

FX (x) = Pr(X ≤ x) . (2)


The probability mass function (→ I/1.6.1) of a discrete (→ I/1.2.6) random variable (→ I/1.2.2) X
returns the probability that X takes a particular value x:

fX (x) = Pr(X = x) . (3)


Taking these two definitions together, we have:

(2) X
FX (x) = Pr(X = t)
t∈X
t≤x
X (4)
(3)
= fX (t) .
t∈X
t≤x

1.8.6 Cumulative distribution function of continuous random variable


Theorem: Let X be a continuous (→ I/1.2.6) random variable (→ I/1.2.2) with possible values X
and probability density function (→ I/1.7.1) fX (x). Then, the cumulative distribution function (→
I/1.8.1) of X is
Z x
FX (x) = fX (t) dt . (1)
−∞

Proof: The cumulative distribution function (→ I/1.8.1) of a random variable (→ I/1.2.2) X is


defined as the probability that X is smaller than x:

FX (x) = Pr(X ≤ x) . (2)


The probability density function (→ I/1.7.1) of a continuous (→ I/1.2.6) random variable (→ I/1.2.2)
X can be used to calculate the probability that X falls into a particular interval A:
Z
Pr(X ∈ A) = fX (x) dx . (3)
A
Taking these two definitions together, we have:

(2)
FX (x) = Pr(X ∈ (−∞, x])
Z x (4)
(3)
= fX (t) dt .
−∞


5. BAYESIAN STATISTICS 127

Then, the value of θ at which the posterior density (→ I/5.1.7) attains its maximum is called the
“maximum-a-posteriori estimate”, “MAP estimate” or “posterior mode” of θ:

θ̂MAP = arg max D(θ; ϕ) . (2)


θ

Sources:
• Wikipedia (2023): “Maximum a posteriori estimation”; in: Wikipedia, the free encyclopedia, re-
trieved on 2023-12-01; URL: https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation#
Description.

5.1.9 Posterior density is proportional to joint likelihood


Theorem: In a full probability model (→ I/5.1.4) m describing measured data y using model pa-
rameters θ, the posterior density (→ I/5.1.7) over the model parameters is proportional to the joint
likelihood (→ I/5.1.5):

p(θ|y, m) ∝ p(y, θ|m) . (1)

Proof: In a full probability model (→ I/5.1.4), the posterior distribution (→ I/5.1.7) can be expressed
using Bayes’ theorem (→ I/5.3.1):

p(y|θ, m) p(θ|m)
p(θ|y, m) = . (2)
p(y|m)
Applying the law of conditional probability (→ I/1.3.4) to the numerator, we have:

p(y, θ|m)
p(θ|y, m) = . (3)
p(y|m)
Because the denominator does not depend on θ, it is constant in θ and thus acts a proportionality
factor between the posterior distribution and the joint likelihood:

p(θ|y, m) ∝ p(y, θ|m) . (4)

5.1.10 Combined posterior distribution from independent data


Theorem: Let p(θ|y1 ) and p(θ|y2 ) be posterior distributions (→ I/5.1.7), obtained using the same
prior distribution (→ I/5.1.3) from conditionally independent (→ I/1.3.7) data sets y1 and y2 :

p(y1 , y2 |θ) = p(y1 |θ) · p(y2 |θ) . (1)


Then, the combined posterior distribution (→ I/1.5.1) is proportional to the product of the individual
posterior densities (→ I/1.7.1), divided by the prior density:

p(θ|y1 ) · p(θ|y2 )
p(θ|y1 , y2 ) ∝ . (2)
p(θ)

You might also like