0% found this document useful (0 votes)
29 views

Estimation and Detection: Lecture 6: The Bayesian Philosophy

This lecture discusses estimating parameters from a Bayesian perspective when the parameter of interest is a random variable rather than a deterministic constant. The Bayesian approach allows incorporating prior knowledge about the parameter, which can be useful when the minimum variance unbiased estimator is difficult to find. Bayesian mean square error is defined differently than classical mean square error by considering the joint probability of the data and the parameter rather than just the probability of the data given the parameter.

Uploaded by

billie quant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Estimation and Detection: Lecture 6: The Bayesian Philosophy

This lecture discusses estimating parameters from a Bayesian perspective when the parameter of interest is a random variable rather than a deterministic constant. The Bayesian approach allows incorporating prior knowledge about the parameter, which can be useful when the minimum variance unbiased estimator is difficult to find. Bayesian mean square error is defined differently than classical mean square error by considering the joint probability of the data and the parameter rather than just the probability of the data given the parameter.

Uploaded by

billie quant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

02/12/16

Estimation and Detection


Lecture 6: The Bayesian Philosophy

Dr. ir. Richard C. Hendriks – 2/12/2016


1

Summary of previous lectures


Estimation of a deterministic parameter:

• Typical signal model: x[n] = A + w[n]

• Find the estimator Â.

– Find the MVUE (Ch. 2) using e.g.

⇤ the Cramer Rao bound (Ch. 3).

⇤ Sufficient statistics (Ch. 5, not covered)

– Find the BLUE: If data is Gaussian distributed, the BLUE is MVUE

– Find the MLE

⇤ For large data records, MLE is efficient and leads (asymptotically) to MVUE
⇤ for linear data model and Gaussian noise, MLE is efficient and MVUE (for
limited data records). 2

1
02/12/16

This lecture

Previous lectures: Parameter of interest (A) was assumed to be a deterministic (unknown)


constant.
This and next two lectures: The parameter of interest is a realisation of a random vari-
able.

This "Bayesian approach" has two advantages:

• Prior knowledge on A can be incorporated.

• Bayesian approach can be useful when the MVU estimator is difficult to find.

Example 1 (1)
Example: estimation of the mean

2
x[n] = A + w[n], n = 0, · · · , N 1, w[n] ⇠ N (0, ).

In lecture 2 (Ch. 3) we have seen that the MVU estimator can be derived to be the sample
mean estimator:
N 1
1 X
 = x[n]
N n=0

2
02/12/16

Example 1 (2)
h PN i
1 1 1
p(x; A) = (2⇡ 2 )N/2
exp 2 2 n=0 (x[n] A)2

The CRB:
" N 1
#
@ ln p(x; A) @ 2 N/2 1 X 2
= ln[(2⇡ ) ] 2
(x[n] A)
@A @A 2 n=0
N 1
1 X
= 2
(x[n] A)
n=0
0 1
B N 1 C
N B X C
= B1 x[n] A C
2 BN C
|{z} @ n=0 A
I(A)
| {z }

@ 2 ln p(x; A) N
=
@A2 2

h i 2 PN 1
var  MVU: 1
N n=0 x[n]
N 5

Example 1 (3)
h PN i
1 1 1
p(x; A) = (2⇡ 2 )N/2
exp 2 2 n=0 (x[n] A)2

PN 1
MVU: 1
N n=0 x[n]

The MLE:
" N 1
#
@ ln p(x; A) @ 2 N/2 1 X 2
= ln[(2⇡ ) ] 2
(x[n] A)
@A @A 2 n=0
N 1
1 X
= 2
(x[n] A) = 0
n=0

1
PN 1
 = N n=0 x[n]

3
02/12/16

Example 1 (4)
Example: estimation of the mean

2
x[n] = A + w[n], n = 0, · · · , N 1, w[n] ⇠ N (0, ).

In lecture 2 (Ch. 3) we have seen that the MVU estimator can be derived to be the sample
mean estimator:
N 1
1 X
 = x[n]
N n=0

Let us now assume that A0  A  A0 . Without any further information about the pdf of A,
let’s assume A is uniformly distributed in the interval.

How to take this information into account? The Bayesian approach!

This lecture
w[n]
+
Classical approach + P
A x[n] for n = 0, 1, 2, ..., N 1

pdf Knowledge on how A is chosen!

A
A0 A0
w[n]
+
+ P
Bayesian approach Select A x[n] for n = 0, 1, 2, ..., N 1

4
02/12/16

Bayesian versus Classical MSE


The Bayesian MSE defined as
Z Z
Bmse(Â) = E[(Â(x) A)2 ] = (Â(x) A)2 p(x, A)dxdA

Remember that the classical MSE is defined as


Z
mse(Â) = E[(Â(x) A)2 ] = (Â(x) A)2 p(x; A)dx

So it is a completely different philosophy, although the form of the estimators is sometimes


the same.
Remember that for the classical mse we also derived:
h i
mse(Â) = E (Â E(Â))2 + (E(Â) A)2 = var(Â) +(E(Â) A)2
| {z } | {z }
variance bias

The Bayesian Philosophy


• ✓ is viewed as a random variable and we must estimate its particular realization. This
allows us to use prior knowledge about ✓, i.e., its prior pdf p(✓). Again, we would like
to minimize the MSE
ˆ = E[(✓ˆ ✓)2 ]
Bmse(✓)

but this time both x and ✓ are random and the statistics of ✓ˆ depend on the statistics
of both x and ✓.

• Note the difference between these two MSEs:


Z
ˆ = E[(✓ˆ ✓)2 ] = (✓ˆ ✓)2 p(x; ✓)dx
mse(✓)
Z Z
ˆ = E[(✓ˆ ✓)2 ] =
Bmse(✓) (✓ˆ ✓)2 p(x, ✓)dxd✓

Whereas the first MSE depends on ✓, the second MSE does not, only on its statistics.
10

5
02/12/16

Minimum Mean Square Error Estimator


We know that p(x, ✓) = p(✓|x)p(x), so that
Z Z
Bmse✓ˆ = (✓ˆ ✓)2 p(✓|x)d✓ p(x)dx

Since p(x) 0 for all x, we have to minimize the inner integral for each x.
Z
Problem: min (✓ˆ ✓)2 p(✓|x)d✓

R
Solution: mean of posterior pdf of ✓: ✓ˆ = E(✓|x) = ✓p(✓|x)d✓

Proof: Setting the derivative with respect to ✓ˆ to zero we obtain:


Z Z Z
@ ˆ 2 ˆ ˆ
(✓ ✓) p(✓|x)d✓ = 2 (✓ ✓)p(✓|x)d✓ = 2✓ 2 ✓p(✓|x)d✓ = 0
@ ✓ˆ

The a posteriori pdf p(✓|x): the pdf after observing the data.
11

Minimum Mean Square Error Estimator


Remarks:

• In contrast to the MVU estimator the MMSE estimator always exists.

• The MMSE has a smaller average MSE (Bayesian MSE) than the MVU, but the MMSE
estimator is biased whereas the MVU estimator is unbiased.

Bayesian versus deterministic framework:

• Deterministic: Only use the observed data x and/or its distribution.

• Bayesian: Use both the observed data x and prior information on ✓.


The final estimator will trade off the observed data x and prior information on ✓. This
depends on how concentrated the prior pdf is, and how much data is used.

12

6
02/12/16

Effect of conditional information


• The effect of observing data will concentrate (make narrow) the pdf of A: additional
knowledge reduces the uncertainty on A.
prior pdf that "truncates" p(x|A)
• Hence,
p(x|A)p(A) p(x|A)p(A)
p(A|x) = =R
p(x) p(x|A)p(A)dA

• Notice the difference between p(x|A) and p(x; A). normalization

p(A)
1

0.5

0
-10 -5 0 5 10
p(xjA)
0.4

0.2

0
-10 -5 0 5 10
p(xjA)p(A)
0.4

0.2

0
-10 -5 0 5 10 13

Minimum Mean Square Error Estimator

14

7
02/12/16

Minimum Mean Square Error Estimator


pY,Z (y, z)

15

15

Minimum Mean Square Error Estimator


pY,Z (y, z)

16

16

8
02/12/16

Minimum Mean Square Error Estimator

Since the MMSE estimator is  = E(A|x), we need to find p(A|x).

Using Bayes’ rule, we obtain prior pdf


p(x|A)p(A) p(x|A)p(A)
p(A|x) = =R
p(x) p(x|A)p(A)dA

normalization to make sure

• p(A) follows from prior information. p(A|x) integrates to one.

• p(x|A) follows from one additional assumption and the signal model:

– Assume noise process w[n] and A are independent

– Use signal model: x[n] = A + w[n].

– It then follows that px (x[n]|A) = pw (x[n] A|A) = pw (x[n] A) 17

The Bayesian approach


Bayesian approach:

• Parameter to be estimated is assumed to be a realisation of a random variable ✓.

• Our knowledge about the unknown parameter is summarised in the density

p(✓)p(x|✓)
p(✓|x) = .
p(x)

• The Bayesian MSE minimises the MSE over all realisations of ✓ and x.
R
• Bayesian MMSE: ✓ˆ = E(✓|x) = ✓
✓p(✓|x)d✓.

• The MMSE depends thus on both the prior knowledge (via p(✓)) and the data (via
p(x|✓)).

• Remember: Classical approaches only use p(x; ✓).


18

9
02/12/16

Prior Knowledge Versus Data


The posterior pdf in the Bayesian approach summarises the available information:

p(✓)p(x|✓)
p(✓|x) = .
p(x)

• If prior knowledge is rather weak compared to the data (p(✓) much wider than p(x|✓)),
then the estimator will rely primarily on the data and ignores the prior knowledge.

• When the prior knowledge is rather strong compared to the data (p(✓) much more
narrow than p(x|✓)), the estimator will be biased towards the mean of the prior pdf.

• The Bayesian estimator thus always compromises between the prior and the data.

• Increasing the amount of data typically makes p(x|✓)) more concentrated, resulting in
an estimator relying more and more on the data.

19

Example 1 - Bayesian approach (1)


Example: estimation of the mean

2
x[n] = A + w[n], n = 0, · · · , N 1, w[n] ⇠ N (0, ), A ⇠ U ( A0 , A 0 )

Conditional pdf: p(x[n]|A) = p 1 exp[ 1


2 2
(x[n] A)2 ]
2⇡ 2

1 1
PN 1
) p(x|A) = (2⇡ 2 )N/2
exp[ 2 2 n=0 (x[n] A)2 ]
8
< 1
|A|  A0
2A0 ,
and p(A) =
: 0, |A| > A0

20

10
02/12/16

Example 1 - Bayesian approach (2)


Altogether, the a posteriori pdf becomes:

8 " N 1
#
>
> 1 1 X 2
>
> exp (x[n] A)
>
> 2A0 (2⇡ 2 )N/2 2 2
>
< n=0
Z " N
# , |A|  A0
p(A|x) = A0
1 1 X1
>
> exp (x[n] A) 2
dA
>
> 2A0 (2⇡ 2 )N/2 2 2
>
> A0 n=0
>
:
0, |A| > A0

PN 1
we can write n=0 (x[n] A)2 as
N
X1 N
X1 N
X1
(x[n] A)2 = x2 [n] 2N Ax̄ + N A2 = N (A x̄)2 + x2 [n] N x̄2 ,
n=0 n=0 n=0

which can be substituted above to simplify the expression.

21

Example 1 - Bayesian approach (3)


Leading to the simplified expression
8 
> 1 1
>
> p exp (A x̄)2
>
> 2⇡ 2 /N 2 2 /N
>
< Z , |A|  A0
A0 
p(A|x) = 1 1 2
>
> p exp (A x̄) dA
>
> A0 2⇡ 2 /N 2 2 /N
>
>
: 0, |A| > A0
Truncation if the
This looks like a truncated Gaussian, and the final MMSE estimate is given by
variance 2
/N
Z A0 
1 1 of this Gaussian
Z 1
Ap exp 2
(A x̄)2 dA
A0 2⇡ 2 /N 2 /N
 = Ap(A|x)dA = Z  density is large
A0
1 1 1 2
p exp (A x̄) dA compared to A0 .
A0 2⇡ 2 /N 2 2 /N

22

11
02/12/16

Example
Z
1 - Bayesian

approach (4)
A0
1 1
Z 1
Ap exp 2 /N
(A x̄)2 dA
A0 2⇡ 2 /N 2
 = Ap(A|x)dA = Z A0 
1 1 1
p exp 2 /N
(A x̄)2 dA
A0 2⇡ 2 /N 2

In absence of data (if x gives no information about A), Â is given by:


Z A0
Z 1 AdA
A0
 = Ap(A|x)dA ⇡ Z A0
= 0.
1
dA
A0
p
If A0 is much larger than 2 /N (effectively no truncation), e.g., because N increases, Â
gets closer to x̄. Z 
A0
1 1
Z 1
Ap exp 2 /N
(A x̄)2 dA
A0 2⇡ 2 /N 2
 = Ap(A|x)dA ⇡ lim Z A0  = x̄.
1 A0 !1 1 1 2
p exp (A x̄) dA
A0 2⇡ 2 /N 2 2 /N
23

Example 2- Gaussian prior


Example: estimation of the mean

2 2
x[n] = A + w[n], n = 0, · · · , N 1, w[n] ⇠ N (0, ), A ⇠ N (µA , A)

Conditional pdf: p(x[n]|A) = p 1 exp[ 1


2 2
(x[n] A)2 ]
2⇡ 2

1 1
PN 1
) p(x|A) = (2⇡ 2 )N/2
exp[ 2 2 n=0 (x[n] A)2 ]
and p(A) = q 1
2
exp[ 1
2 (A
2 A
µA ) 2 ]
2⇡ A

Since both p(x|A) and p(A) are now Gaussian, the a posterior pdf p(A|x) will also be
Gaussian:
1 1
p(A|x) = q exp[ 2 (A µA|x )2 ]
2⇡ 2 2 A|x
A|x

with 2
A|x = 1
N + 1 and µA|x = ( N2 x̄ + µA
2 )
2
A|x
2 2 A
A
24

12
02/12/16

Example 2- Gaussian prior


N x̄+ µA 2
2 2 2
A A x̄+ N µA
 = E(A|x) = µA|x = N + 1 = 2 2
2 2 N + A
A
2
Using ↵ = A
2+ 2
This can be rewritten as:
A N

 = ↵x̄ + (1 ↵)µA

• ↵ expresses the interplay between the prior knowledge (µA ) and the data (x̄).

• When N is small or data is noisy (large 2


), 2
A << 2
/N and  = µA .

• The larger N (more data samples) or the less noisy (smaller 2


) the closer ↵ gets to
one and  = x̄.

• Also notice that the larger the data record (increasing N ), the narrower the a posteriori
pdf (and less uncertainty) as 2
A|x = 1
N + 1 .
2 2
A
25

Example 2- Gaussian prior


N x̄+ µA
2 2
A
 = E(A|x) = µA|x = N + 1
2 2
A

1 1
p(A|x) = q exp[ 2 (A µA|x )2 ]
2⇡ 2 2 A|x
A|x

with 2
A|x = 1
N + 1 and µA|x = ( N2 x̄ + µA
2 ) A|x .
2
2 2 A
A

• If N ! 1, then  ! x̄.

• When there is no prior knowledge: 2


A ! 1, then  ! x̄ (= the classical estimator.)

26

13
02/12/16

Example 2- Gaussian prior – MSE versus BMSE


Remember the original statement: Using prior knowledge we can improve the estimation
accuracy.

Z Z
Bmse(Â) = E[(A Â)2 ] = (A E[A|x])2 p(x, A)dxdA
Z Z
= (A E[A|x])2 p(A|x)dAp(x)dx
Z
= var[A|x]p(x)dx
Z !
2 2
1 1 A
= N 1
p(x)dx = N 1
= 2
2 + 2 2 + 2 N 2
A+ N
A A

Hence, !
2 2 2
A
Bmse(Â) = 2 < = mse(Â)
N 2
A+ N
N
|{z}
CRB for classical estimators 27

Generalization of results for jointly Gaussian pdfs


If x and y are jointly Gaussian, where x is k ⇥1 and y is l ⇥1, with joint mean and covariance
matrix
02 31 2 3
x E(x)
E @4 5A = 4 5
y E(y)
2 3
Cxx Cxy
C =4 5
Cyx Cyy
2 2 3 33
12
1 h i Cxx Cxy x E(x) 7
61
so that p(x, y) = p exp 4 xT E(x)T yT E(y)T 4 5 4 55
(2⇡)k+l det(C) 2 Cyx Cuy y E(y)
then the conditional PDF p(y|x) is also Gaussian with mean and covariance matrix

E(y|x) = E(y) + Cyx Cxx1 (x E(x))

Cy|x = Cyy Cyx Cxx1 Cxy


28

14
02/12/16

Example: Estimation of the mean (1)


Let us assume now that the prior distribution of A is Gaussian: A ⇠ N (0, 2
A ), and w[n]
white Gaussian noise, i.e., for n = 0, ..., N 1 w[n] ⇠ N (0, 2
),

x = 1A + w.

Then, we can say that x and A are jointly Gaussian (k = N and l = 1), with zero mean and
covariance matrix
22 3 3 2 3
2 T 2 2
x ⇥ T ⇤ A 11 + I A 15
Cx,A = E 44 5 x , A 5 = 4
2 T 2
A A1 A

29

Example: Estimation of the mean (2) 2 3 2 3


22 3 3 2 3 2 3 E(x) Cxx Cxy
x ⇥ T ⇤ 2 T 2 2 Remember: 4 5,4 5
A 11 + I 1
A 5 x
Cx,A = E 44 5 x , A 5 = 4 , E4 5 = 0 E(y) Cyx Cyy
2 T 2
A A1 A A E(y|x) = E(y) + Cyx Cxx1 (x E(x))
Cy|x = Cyy Cyx Cxx1 Cxy
As a result we have

2 T 2 T 2 1
E(A|x) = A 1 ( A 11 + I) x
2 4 T 2 T 2 1
CA|x = A A 1 ( A 11 + I) 1

Using the matrix inversion lemma (MIL)

(A + BCD) 1 = A 1 A 1 B(C 1
+ DA 1
B) 1
DA 1
2 T 2 T
A 1 ( A 11 +
2
I) 1 = ( A 2 + 2 T
1 1) 1 2 T
1
we can compute

2 T 2 T 2 1 2 2 T 1 2 T
E(A|x) = A 1 ( A 11 + I) x=( A + 1 1) 1 x 30

15
02/12/16

Example: Estimation of the mean (3)


Using the earlier equality from the MIL, we get
N
2

2 2 T 1 2 T
E(A|x) = ( A + 1 1) 1 x=
1 N
2 + 2
A
2 2 2 T 1 2 T 1
CA|x = A [1 ( A + 1 1) 1 1] =
1 N
2 + 2
A

So finally, we obtain the following closed form expressions:

N
2

 = E(A|x) =
1 N
2 + 2
A
Z Z Z Z
Bmse(Â) = (A Â)2 p(x, A)dxdA = (A E(A|x))2 p(A|x)dAp(x)dx
Z 2 2
1 A
= CA|x p(x)dx = = 2
1 N N A+
2 /N
2 + 2 31
A

General Linear Gaussian Model

The above example can be generalized to the linear Gaussian model:

x = H✓ + w, w ⇠ N (0, C)

with ✓ a random vector with distribution N (µ✓ , C✓ ).

In that case p(✓|x) is also Gaussian with mean and covariance matrix

E(✓|x) = µ✓ + C✓ HT (HC✓ HT + C) 1
(x Hµ✓ )
T T 1
C✓|x = C✓ C✓ H (HC✓ H + C) HC✓

32

16
02/12/16

Bayesian Estimator for Deterministic Parameters


Bayesian approach meant for estimation of realisation of random variable.

However, sometimes it is also used to estimate a deterministic parameter, for example,


when the MVU does not exist.

• The Bayesian MMSE estimator always exist (although sometimes difficult to calculate
analytically)

• The MMSE estimator provides an estimator "works well on average".

• For one particular parameter ✓ (when deterministic) it might not perform well.

33

Bayesian Estimator for Deterministic Parameters


Remember:

Example: estimation of the mean

2 2
x[n] = A + w[n], n = 0, · · · , N 1, w[n] ⇠ N (0, ), A ⇠ N (µA , A)

1 1
PN 1
p(x|A) = (2⇡ 2 )N/2
exp[ 2 2 n=0 (x[n] A)2 ]
and p(A) = q 1
2
exp[ 1
2 A2 (A µA ) 2 ]
2⇡ A

Since both p(x|A) and p(A) are now Gaussian, the a posterior pdf p(A|x) will also be
Gaussian:
1 1
p(A|x) = q exp[ 2 (A µA|x )2 ]
2⇡ 2 2 A|x
A|x

with 2
A|x = 1
N + 1 and µA|x = ( N2 x̄ + µA
2 )
2
A|x
2 2 A
A

34

17
02/12/16

Bayesian Estimator for Deterministic Parameters


N x̄+ µA
2 2
A
 = E(A|x) = µA|x = N + 1
2 2
A
2
Using ↵ = A
2+ 2
This can be rewritten as:
A N

 = ↵x̄ + (1 ↵)µA
Imagine that we have used this Bayesian estimator for A, while A was in fact deterministic.
In that case we can evaluate the MSE:

M SE(Â) = var(Â) + b2 (Â) = var(Â) + (E[Â] A)2 = ↵2 var(x̄) + (↵A + (1 ↵)µA A)2
2
= ↵2 + (1 ↵)2 (A µA ) 2
N

35

Bayesian Estimator for Deterministic Parameters


N x̄+ µA
2 2
A
 = E(A|x) = µA|x = N + 1
2 2
A
2
Using ↵ = A
2+ 2
This can be rewritten as:
A N

 = ↵x̄ + (1 ↵)µA
Imagine that we have used this Bayesian estimator for A, while A was in fact deterministic.
In that case we can evaluate the MSE:

M SE(Â) = var(Â) + b2 (Â) = var(Â) + (E[Â] A)2 = ↵2 var(x̄) + (↵A + (1 ↵)µA A)2
2
= ↵2 + (1 ↵)2 (A µA ) 2
N
2
Compare this to the classical deterministic (MVU) estimator ÂC = x̄: M SE(x̄) = N .
Only if A is close to the prior mean µA , M SE(Â) < M SE(ÂC ), as it trades off the bias for
a lower variance.
36

18
02/12/16

Bayesian Estimator for Deterministic Parameters


• Using the Bayesian estimator for the deterministic parameter reduces the variance as
0 < ↵ < 1.

• However, it can substantially increase the bias if A is NOT close to µA , in case it is a


very poor estimator.

• The Bayesian MSE is the smallest on average. It trades of the bias for variance in
order to improve the overall Bayesian MSE (using prior knowledge on A). That is,
only if A is really random, we have

2
Bmse(Â) = EA [mse(Â)] = ↵2 N + (1 ↵)2 EA [(A µA ) 2 ]
2 2 2 2
= ↵2 N + (1 ↵)2 2
A = N
A
2+ 2
< N
A N

37

19

You might also like