EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
EN007001 Engineering Research Methodology: Statistical Inference: Bayesian Inference
Methodology
Statistical Inference: Bayesian Inference
Lecture: 3 hours
Text:
Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and
computation. New York: Springer.
Koller, D., & Friedman, N. (2009). Probabilistic graphical
models: principles and techniques. MIT press.
Bolstad, W. M., & Curran, J. M. (2016). Introduction to
Bayesian statistics. John Wiley & Sons.
Leon-Garcia, A. (2017). Probability, statistics, and random
processes for electrical engineering. Pearson Education.
P ({X = x} ∩ C)
pX (x | C) =
P (C)
10 / 72
Definition 6 (Continuous Random Variable)
A continuous random variable is defined as a random variable
whose cdf FX (x) is continuous everywhere and it can be written
as an integral of some nonegative function f (x):
Z x
FX (x) = f (t)dt,
−∞
where Z ∞
f (t)dt = 1.
−∞
11 / 72
Definition 7 (Probability Density Function)
The probability density function of X (pdf), if it exist, is defined
as the derivative of cdf FX (x):
dFX (x)
fX (x) =
dx
12 / 72
Definition 9 (Jointly continuous random variables)
Two random variables X and Y are jointly continuous with joint
density fX,Y (x, y) if
ZZ
P ((X, Y ) ∈ A) = fX,Y (x, y)dxdy,
A
where
Z ∞ Z ∞
fX,Y (x, y)dxdy = 1
−∞ −∞
dFX (x | C)
fX (x | C) =
dx
13 / 72
Fact 1
If X and Y are independent then
the joint cdf is
14 / 72
Important Random Variables. The most commonly used random
variables in communications are:
Bernoulli Random Variable.
p, x=1
pX (x) =
1 − p, x = 0
15 / 72
This r.v. model, for example, the total number of bits received in
error when a sequence of n bits is transmitted over a channel with
bit-error probability of p.
Uniform Random Variable. The pdf is given by
1 , a<x<b
f (x) = b−a
0, otherwise
16 / 72
The most important distribution in the study of statistics: the
normal (or Gaussian) distribution.
Definition 11 (Normal Distribution)
A random variable X is said to have a normal distribution with
parameters µ and σ 2 , N (µ, σ 2 ) if its pdf is given by
1 2 2
fX (x) = √ e−(x−µ) /2σ , x∈R
2πσ
17 / 72
Example 1
Show that
1 −(2x2 −2xy+y2 )/2
fX,Y (x, y) = e
2π
is a valid joint probability density.
Solution. Since fX,Y (x, y) > 0, all we have to do is show that it
integrates to one. By factor the exponent, we obtain
2 /2 2
e−(y−x) e−x /2
fX,Y (x, y) = √ · √ .
2π 2π
Then
Z ∞Z ∞ ∞ 2 ∞ 2
e−x /2 e−(y−x) /2
Z Z
fX,Y (x, y)dxdy = √ √ dy dx = 1
−∞ −∞ −∞ 2π −∞ 2π
| {z } | {z }
N (0,1) N (x,1)
#
EN007001 Engineering Research Methodology
18 / 72
Sagemath Program 1
#Proof
reset()
x,y=var("x,y")
f(x,y)=1/2/pi*e^( -(2*x^2-2*x*y+y^2)/2 )
#Proof f(x,y)>0
print bool(f(x,y)>0)
19 / 72
Julia Program 1
f=sm.Function("f")
f=1/2/(sm.pi)*sm.exp(-(2*x^2-2x*y+y^2)/2)
sm.integrate(f,(x,-oo,oo),(y,-oo,oo))
20 / 72
Julia Program 1 (cont.)
21 / 72
Definition 12 (Random Vector and Random Matrix)
A vector X whose entries are random variables is called a random
vector, and a matrix Y whose entries are random variables is
called a random matrix. i.e.
X = [X1 , X2 , . . . , Xn ]T ,
Y11 Y12 . . . Y1p
Y21 Y22 . . . Y2p
Y= .
.. .. .
..
.. . . .
Yn1 Yn2 . . . Ynp
22 / 72
Definition 13 (Statistical inference)
Statistical inference is a collection of methods that deal with
drawing conclusions about the model on the basis of the observed
data. (See fig 1)
23 / 72
Definition 14 (Classical statistics)
Let x be outcome of a random vector X described by a
probabilistic model that depend on unknown parameter θ. Let θ
is assumed to be fixed. Then classical statistic is the
method for estimating and for drawing inferences about a
parameter θ.
24 / 72
Definition 15 (Bayesian statistics)
Let x be outcome of a random vector X described by a
probabilistic model that depend on unknown parameter θ. Let θ
is assumed to be random. Then Bayesian statistic is the
method for estimating and for drawing inferences about a
parameter θ, such that θ is carried out by analyzing the conditional
pdf f (θ | x).
25 / 72
Example 2 (Bias coin)
We throw a coin 1000 times and observe 570 Heads. Using this
information, what can we say about the “fairness” of the coin?
The data (or better, datum) here is the number x = 570. Suppose
we view x as the outcome of a random variable X which
describes the number of Heads in 1000 tosses. Our statistical
model (See Fig 1) is then
X ∼ Bin(1000, p)
26 / 72
Remark 1
Any statement about the fairness of the coin is expressed in
terms of p and is assessed via this model.
It is important to understand that p will never be known.
A common-sense estimate of p is simply the proportion of
Heads, x/1000 = 0.570.
27 / 72
Bayesian statistics is a branch of statistics that is centered around
Bayes’ formula.
Theorem 1 (Bayes’ Rule)
P (A | Bj )P (Bj )
P (Bj | A) = n . (1)
P
P (A | Bi )P (Bi )
i=1
28 / 72
Corollary 1.1
For continuous random variables X and Y , Bayes’ Theorem is
formulated in terms of densities:
f (x | y)f (y)
f (y | x) = R ∝ f (x | y) · f (y), (2)
| {z } f (x | y)f (y)dy | {z } |{z}
posterior likelihood prior
where
f (y) := fY (y), f (x | y) := fX|Y (x | y), f (y | x) := fY |X (y | x)
and likelihood function, l(x | y) = f (x | y).
29 / 72
Definition 16 (Prior, Likelihood, and Posterior)
Let x and θ denote the data and parameters in a Bayesian
statistical model.
f (θ | x) ∝ f (x | θ)f (θ).
30 / 72
Remark 2 (Bayesian Statistical Inference)
fY (y), if Y is continuous,
PY (y), if Y is discrete.
31 / 72
Remark 2 (Cont.)
After observing the value of the random variable X, we find the
posterior distribution of Y . This is the conditional PDF (or PMF)
of Y given X = x,
fY |X (y | x) or PY |X (y | x).
32 / 72
Fact 2
If X ∼ Uniform[a, b], then
1 , a≤x≤b
fX (x) = b−a
0, otherwise.
If X ∼ Geometric(p), then
PX (x) = (1 − p)x−1 · p, x = 1, 2, 3, . . .
33 / 72
Fact 3 (Law of Total Probability)
Let A be an event and let B1 , B2 , . . . , Bn be a partition of Ω.
Then,
Xn
P (A) = P (A | Bi )P (Bi ).
i=1
34 / 72
Example 3
PY |X (2 | x)fX (x)
fX|Y (x | 2) = .
PY (2)
and
PY |X (2 | x) = x(1 − x).
35 / 72
Example 3 (cont.)
To find PY (2), we can use the law of total probability
Z ∞
PY (2) = PY |X (2 | x)fX (x)dx
−∞
Z1
= x(1 − x) · 1dx
0
1
= .
6
Therefore, we obtain
x(1 − x) · 1
fX|Y (x | 2) = 1
6
= 6x(1 − x), for 0 ≤ x ≤ 1.
#
36 / 72
The posterior distribution, fX|Y (x | y) (or PX|Y (x | y)), contains
all the knowledge about the unknown quantity X. Therefore, we
can use the posterior distribution to find point or interval estimates
of X.
Definition 17 (MAP)
Let fX|Y (x | y) be a posterior distribution. Then the Maximum A
Posteriori (MAP) Estimation is defined as
37 / 72
fY |X (y|x)fX (x)
Since fX|Y (x | y) = fY (y) and fY (y) does not depend on
x. Therefore,
Find x̂M AP
To find the MAP estimate of X given that we have observed
Y = y, we find the value of x that maximizes
fY |X (y | x)fX (x)
38 / 72
Example 4
39 / 72
Example 4 (cont.)
For Y = 3, it follows that
PY |X (3 | x) = x(1 − x)2 .
40 / 72
Example 4 (cont.)
d 2
x (1 − x)2 = 2x(1 − x)2 − 2(1 − x)x2 = 0.
(6)
dx
Solve for x, one obtain
1
x̂M AP = .
2
#
41 / 72
Julia Program 2 (Find solution of (6) and x̂M AP )
42 / 72
We now consider a classical statistic inference called maximum
likelihood method for finding a point estimator that maximizes the
probability of the observed data Yn = (Y1 , Y2 , . . . , Yn ).
Definition 18 (Likelihood function)
l(yn ; x) = l(y1 , y2 , . . . , yn ; x)
P
Y |X (y1 , y2 , . . . , yn | x), Y discrete r.v. (7)
=
Y |X (y1 , y2 , . . . , yn | x),
f Y cont. r.v.,
43 / 72
Definition 18 (cont.)
where PY |X (y1 , y2 , . . . , yn | x) and fY |X (y1 , y2 , . . . , yn | x) are the
joint pmf and joint pdf evaluated at the observation values if the
parameter value is x.
44 / 72
Remark 3
45 / 72
Example 5
genotype AA Aa aa
probability θ2 2θ(1 − θ) (1 − θ)2
46 / 72
Example 5 (cont.)
Solution. The likelihood function is given by
! ! !
k1 + k2 + k3 k2 + k3 k3
P (k1 , k2 , k3 | θ) = ×
k1 k2 k3 (8)
θ2k1 (2θ(1 − θ))k2 (1 − θ)2k3 .
47 / 72
Example 5 (cont.)
We set the derivative equal to zero:
2k1 + k2 k2 + 2k3
− = 0.
θ 1−θ
Solving for θ, we find the MLE is
2k1 + k2
θ̂ = .
2k1 + 2k2 + 2k3
#
48 / 72
Julia Program 3 (Find θ̂ that MLE of (8))
49 / 72
Example 6
2 ) is transmitted over a
Suppose that the signal X ∼ N (0, σX
communication channel. Assume that the received signal is given
by
Y = X + W,
2 ) is independent of X.
where W ∼ N (0, σW
1 Find the ML estimate of X, given Y = y is observed.
2 Find the MAP estimate of X, given Y = y is observed.
50 / 72
Example 6 (cont.)
Solution. The PDF for r.v. X is
2
1 − x2
fX (x) = √ e 2σX .
2πσX
2 ), thus the conditional PDF is
Since (Y | X = x) ∼ N (x, σW
(y−x)2
1 −
2σ 2
fY |X (y | x) = √ e W .
2πσW
51 / 72
Example 6 (cont.)
1 The ML estimate of X, given Y = y, is the value of x that
maximizes
(y−x)2
1 −
2σ 2
fY |X (y | x) = √ e W .
2πσW
To maximize the above function, we should minimize
(y − x)2 . Therefore, we conclude
x̂M L = y.
2 The MAP estimate of X, given Y = y, is the value of x that
maximizes
(y − x)2 x2
fY |X (y | x)fX (x) = c exp − 2 + 2 ,
2σW 2σX
52 / 72
Example 6 (cont.)
where c is a constant. To maximize the above function, we should
minimize
(y − x)2 x2
2 + 2 . (9)
2σW 2σX
By differentiation, we obtain the MAP estimate of x as
2
σX
x̂M AP = 2 + σ 2 y.
σX W
53 / 72
Julia Program 4 (Find the MAP estimate of x from (9))
54 / 72
Example 7 (Bayesian Inference for Coin Toss Experiment)
55 / 72
Example 7 (cont.)
Let prior pdf f (θ) is given by a uniform pior f (θ) = 1, 0 ≤ θ ≤ 1.
We assume that conditional on θ the {xi } are independent and
Ber(θ) distributed. Thus, the Bayesian likelihood is
n
Y
f (x | θ) = θxi (1 − θ)1−xi = θs (1 − θ)n−s ,
i=1
f (θ | x) = c θs (1 − θ)n−s , 0 ≤ θ ≤ 1.
56 / 72
Example 7 (cont.)
This is the pdf of the Beta(s + 1, n − s +!1) distribution. The
n
normalization constant is c = (n + 1) . The graph of the
s
posterior pdf for n = 100 and s = 1 is given in Fig 3.
57 / 72
Example 7 (cont.)
A Bayesian confidence interval, called a credible interval, for θ is
formed by taking the appropriate quantiles of the posterior pdf. As
an example, suppose that n = 100 and s = 1. Then, a left
one-sided 95% credible interval for θ is [0, 0.0461], where 0.0461
is the 0.95 quantile of the Beta(2, 100) distribution.
58 / 72
Definition 20 (Bayesian Network)
Mathematically, a Bayesian network is a directed acyclic graph,
that is, a collection of vertices (nodes) and arcs (arrows between
nodes) such that arcs, when put head-to-tail, do not create loops.
59 / 72
Remark 4
The directed graphs in (a) and (b) are acyclic. Graph (c) has a
(directed) cycle and can therefore not represent a Bayesian
network
60 / 72
n
Y
f (x1 , . . . , xn ) = f (xj | Pj ).
j=1
61 / 72
Example 8
Consider
62 / 72
Example 8 (cont.)
The left plane of Fig. 5 shows a classical statistical model with
random variables x1 , . . . , x5 and fixed parameters θ1 , θ2 :
63 / 72
Example 9 (Applied Bayesian Networks)
Graph representations:
64 / 72
Example 9 (cont.)
Factorization:
P (S, F, H, C, M ) = P (S)·P (F | S) · P (H | S)·
P (C | F, H) · P (M | F )
65 / 72
Example 10 (Belief Nets)
66 / 72
Example 10 (cont.)
67 / 72
Example 10 (cont.)
Solution.
The belief net in Fig. 8 shows the prior probabilities of
smoking and age, the conditional probabilities of heart disease
given age and smoking, and the conditional probabilities of
chest pains and shortness of breath given heart disease.
Suppose a person experiences chest pains and shortness of
breath, but we do not know her/his age and if she/he is
smoking. How likely is it that she/he has a heart disease?
68 / 72
Example 10 (cont.)
Define the variables s (smoking), a (age), h (heart disease), c
(chest pains), and b (shortness of breath). We assume that s and a
are independent. Let “Yes” denoted by “Y” and “No” denoted by
“N”. We wish to calculate
∆
P (h = Yes | b = Yes, c = Yes) = P (h = Y | b = Y, c = Y).
From the Bayesian network structure, we see that the joint pdf of
s, a, h, c and b can be written as
f (s, a, h, c, b) = f (s)f (a)f (h | s, a)f (c | h)f (b | h).
It follows that X
f (h | b, c) ∝ f (c | h)f (b | h) f (h | s, a)f (s)f (a) .
a,s
| {z }
f (h)
69 / 72
Example 10 (cont.)
We have
P (h = Y) = P (h = Y | s = Y, a ≤ 50) · P (s = Y ) · P (a ≤ 50)+
P (h = Y | s = Y, a > 50) · P (s = Y ) · P (a > 50)+
P (h = Y | s = N, a ≤ 50) · P (s = N ) · P (a ≤ 50)+
P (h = Y | s = N, a > 50) · P (s = N ) · P (a > 50)
= 0.2 × 0.3 × 0.6 + 0.4 × 0.3 × 0.4
+ 0.05 × 0.7 × 0.6 + 0.15 × 0.7 × 0.4 = 0.147.
70 / 72
Example 10 (cont.)
Consequently,
P (h = Y | b = Y, c = Y) = β · P (c = Y | h = Y )·
P (b = Y | h = Y ) · P (h = Y )
= β × 0.2 × 0.3 × 0.147 = β0.00882
and
P (h = N | b = Y, c = Y) = β · P (c = Y | h = N )·
P (b = Y | h = N ) · P (h = N )
= β × 0.01 × 0.1 × (1 − 0.147)
= β0.000853
71 / 72
Example 10 (cont.)
for some normalization constant β. Thus,
0.00882
f (h = Yes | b = Yes, c = Yes) =
0.0882 + 0.000853
= 0.911816 ≈ 0.91.
72 / 72