0% found this document useful (0 votes)
36 views

Essential Questions For The Exam 2018, AMCS 308, Stochastic Methods in Engineering

1. The document provides essential questions for an exam on stochastic methods in engineering. 2. It covers key concepts like the Wiener process, the Central Limit Theorem, Monte Carlo methods, acceptance-rejection sampling, and variance reduction techniques. 3. Variance reduction techniques like control variates aim to reduce the variance of Monte Carlo estimates without introducing bias, improving computational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Essential Questions For The Exam 2018, AMCS 308, Stochastic Methods in Engineering

1. The document provides essential questions for an exam on stochastic methods in engineering. 2. It covers key concepts like the Wiener process, the Central Limit Theorem, Monte Carlo methods, acceptance-rejection sampling, and variance reduction techniques. 3. Variance reduction techniques like control variates aim to reduce the variance of Monte Carlo estimates without introducing bias, improving computational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Essential Questions for the Exam 2018, AMCS 308,

Stochastic Methods in Engineering

May 31, 2018

1. Formulate the basic properties a Wiener process.


The one dimensional Wiener process is a mapping W : [0, ∞) × Ω → R, where Ω
is a probability space with probability measure associated. This Process satisfies the
following properties
• W0 = 0.
• The mapping t → Wt is almost surely continuos in [0, ∞).
• Consider the time mesh, 0 = t0 < t1 < t2 . . . < tn = T . Then the increments,
Wtn − Wtn−1 , . . . , Wt1 − Wt0 are independent and normally distributed random
variables: Wt − Ws ∼ N (0, t − s) f or 0 ≤ s < t.
2. State and derive the Central Limit Theorem. State the BerryEsseen The-
orem.
The Central Limit Theorem: Let X1 , . . . , Xn be IID with mean µ and variance σ 2 . Let
X n = n−1 ni=1 Xi . Then
P

Xn − µ n(X n − µ)
Zn = q = →Z
σ
V [X n ]
where Z ∼ N (0, 1).
Proof : Let Yi = (Xiσ−µ) . Then, Zn = n−1/2 ni=1 Yi . Now
P
n
Pn if φYi (t) is the moment-
generating
√ nfunction of Y i , then (φYi (t)) is the MGF of i=1 Yi and then, φZn (t) =
(φYi (t/ n)) is the MGF of Zn . Now I have that:
(
φ0Yi (0) = E[Yi ] = 0,
φ00Yi (0) = E[Yi2 ] = V[Yi ] = 1.
Then:
t2 00 t3
φYi (t) = φYi (0) + tφ0Yi (0) + φYi (0) + φ000 (0) + . . .
2! 3! Yi
t2 t3 000
= 1 + 0 + + φYi (0) + . . .
2! 3!
t2 t3 000
= 1 + + φYi (0) + . . . ,
2! 3!
1
and then:
  t n
φZn (t) = φYi √
n
2
 t t3 000
n 2
= 1+ + 3/2
φYi (0) + . . . → et /2 ,
2!n 3!n
which is the MGF of a N (0, 1).
The Berry-Essen Inequality:
let’s define,

E[|Y − EY |3 ]
λ= <∞ (1)
σy
then the inequality is given by

c∗ λ3
|F (x) − φ(x)| ≤ √ (2)
(1 + |x|)3 M

where φ is the CDF of a standard normal and c∗ is a constant.

3. Show how to apply the Monte Carlo method to compute an integral and
discuss the corresponding error.
Consider,
Z 1 M
1 X
f (x) dx ≈ f (xi ) := VM (3)
0 M i=1
where xi ∼ U [0, 1] IID samples.
The variance is given by,
M M
1 X 1 X V [f (xi )]
V[ f (xi )] = 2 Vf (xi ) = (4)
M i=1 M i=1 M

Then by the CLT as the samples are IID , asymptotically we have that,

VM − E[VM ] √ VM − E[VM ]
p = M p ∼ N (0, 1) (5)
V[VM ] V[f (xi )]

Equivalently,
V[f (xi )]
VM − E[VM ] ∼ N (0, ) (6)
M
and we so can build the asymptotic confidence intervals,
r
V[f (xi )]
V M ± Cα (7)
M
where Cα is an appropriate percentile of the standard normal.

2
4. Describe and give all the proofs on the sampling method of acceptance-
rejection.
The method is as follows. We choose a proposal density η which has the share the
support of our target distribution f and covers it. Then we sample U ∼ U [0, 1]. And
for any  ∈ [0, 1], ( f (x)
 η(x) < U Reject the sample
f (x) (8)
 η(x) ≥ U Accept the sample
Proof:
f (x)
Z Z  η(x) Z Z
f (xk ) f (x)
P (Uk ≤  )= dµ dx =  η(x) dx =  f (x) dx =  (9)
η(xk ) 0 η(x)

Let xk be the first accepted sample. We would like to show that xk is from the density
f . Consider,
k−1
X f (xk ) Y f (xm )
P (xk ∈ B) = P (xk ∈ B, Uk ≤  ) P (Um >  )
k≥1
η(xk ) m=1 η(xm )
| {z }
=1−
(10)
f (xk ) X
= P (xk ∈ B, Uk ≤  ) (1 − )k−1 ∵ IID
η(xk ) k≥1
| {z }
1/

And we have,
Z Z f (x) Z
f (xk ) η(x) f (x)
P (xk ∈ B, Uk ≤  )= dµ η(x) dx =  η(x) (11)
η(xk ) B 0 B η(x)

Hence, Z
f (xk )
P (xk ∈ B|Uk ≤  )= f (x) dx (12)
η(xk ) B

And so the sample are from the density f given that we have accepted them.

We look at the expected number of sampling from the proposal before we accept one
sample,
X X 1
E[K] = kP (K = k) = k(1 − )k−1  = (13)
k≥1 k≥1


We see that, ( →0


E[K] −−→ ∞
→1 (14)
E[K] −−→ 1

5. Motivate the use of variance reduction techniques. Describe two techniques


and discuss their computational efficiency.

3
Table 1: Comparison between MC and MC with Control Variates
Method #Samples Computational Work
c 2 c
MC ( T OL ) VY ( T OL )2 VY
Control Variates ( T OL )2 V[Vm (β ∗ )] (1 + ρ)( T OL )2 V[Vm (β ∗ )]
c c

As we have seen in question (3). The Monte Carlo’s error is asymptotically given by
the CLT since the samples are IID,
M
r
1 X VY
E[Y ] − Yj ≈ C α (15)

M j=1 M

where Cα are appropriate percentile of the standard normal. We would like to reduce
the variance without introducing bias.
(a) Control Variates: We would like to estimate EY . We sample Y and another
auxiliary RV, X for which EX is given and X and y are highly correlated.
For a given β > 0, we consider the unbiased estimator
M M
1 X β X
Vm = Yj − (Xj − EX) (16)
M j=1 M j=1

We observe that,
1 1 
V[Vm ](β) = V[Y − βX] = VY + β 2 VX − 2βCov[X, Y ] (17)
M M
We find the minimizing β ,
Cov(Y, X)
β∗ = (18)
VX
Then,
1  Cov(Y, X)2  1   1
V[Vm ](β ∗ ) = V[Y ] 1 − = V[Y ] 1 − Cor(X, Y )2 ≤ V[Y ]
M V [X]V[Y ] M M
(19)
Assume that the work to sample (X, Y ) is (1 + ρ) times the work of sampling Y
alone.
From Table(1), we see that control variate is a good choice only if the increase
in computational work of the control variates technique is minor to the benefit of
its variance reduction. From (19), we will have the following computational work
inequality,
ρ
(1 + ρ)(1 − Cor(X, Y )2 ) ≤ 1 ⇐⇒ Cor(X, Y )2 > (20)
1+ρ
This shows exactly how high the correlation between Y and the auxiliary vari-
able X must be in order to benefit from control variates computationally. The
condition can also be stated as,
VY ≥ (1 + ρ)V[Vm (β ∗ )] (21)

4
(b) Antithetic variates: Let Y = g(x) such that X has a symmetric distribution and
so EX = 0. Then we have X and −X identically distributed, which means that
Eg(X) = Eg(−X) and
h g(X) + g(−X) i
EY = E (22)
2
then we have the estimator,
M
1 X g(Xj ) + g(−Xj )
Vm = (23)
M j=1 2

Which has variance,


V[g(x) + g(−X)]
V[Vm ] = (24)
4M
Assume that the extra evaluation of g(−X) has the same cost as evaluating g(X).
Then we require that,

2V[V1 ] ≤ V[X] =⇒ V[g(X) + g(−X)] ≤ 2V[g(X)] (25)

to effectively reduce the variance as the combined variance of both g(X) and
g(−X) must be smaller than the variance of the original random variable X
alone. Note that by the variance of the sum law we see that it is sufficient that
the covariance term is negative,

Cov[g(X), g(−X)] < 0 (26)

Note that for linear g we have V[g(X) + g(−X)] = 0. As,

V [g(X) + g(−X)] = V [g(X) − g(X)] = 2V[g(x)] + 2 Cov(g(X), −g(X)) = 0 (27)


| {z }
=−V[g(X)]

6. Define the Poisson counting process and explain how to sample it.
It is a collection of random variables where N (t) is a counting of events that has
occurred upto time t starting from t0 . The number of events between time t and time
s is given by N (s) − N (t) which follows a Poisson distribution.
Def.

(a) N (0) = 0
(b) Consider the time mesh, 0 = t0 < t1 < t2 . . . < tn = T . Then the increments,
Ntn − Ntn−1 , . . . , Nt1 − Nt0 are independent.
(c) If I = (s, t) ⊆ R for k = 1, 2, . . .

(λ|I|)k
P (N (t) − N (s) = k) = P (k jumps in I) = e−λ|I| (28)
k!
Where λ > 0 is the intensity of the process.

5
Given a sequence time t where t0 = 0 and tn = T with λ > 0:

(a) Sample ∆Ntn ∼ P oisson(λ(tn − tn−1 )) for k = 1, 2, . . . N


P
(b) Nt = ∆Ntk
k=1

7. Define the compound Poisson process and explain how to sample it.

A compound Poisson process with rate λ > 0 and jump size distribution f is a contin-
uous time stochastic process {X(t)} given by
N (t)
X
X(t) = Zi
i=1

Where N (t) is a Poisson counting process with rate λ > 0. The random variables Z
are IID with distribution f .

The general algorithm:

1 Set t = 0, N = 0, X = 0.
2 Sample U ∼ unif orm
3 t = t + [−(1/λ) ln(U )]. If t > T : final time, stop.
4 Sample Z ∼ f
5 Set N = N + 1 and X = X + Z
6 Go back to step (2).

To sample a compound Poisson process with jump amplitudes N (0, 1) and with parti-
tions in time ∆tn :

1 Sample ∆Ntn ∼ P oisson(λ(tn − tn−1 )) ∀n = 1, 2, . . . N .


2 Sample ∆Xtn ∼ N (0, ∆Nt2n ) ∀n = 1, 2, . . . N
N
P
3 X(t) = ∆Xtn
n=1

8. What is a Brownian bridge? Discuss an algorithm to sample the mid point


of the Brownian bridge.

Given a Wiener process W (t)


on [t1 = 0, t2 ]. Let the Brownian bridge have the con-
ditional value B(s) = W (s) for s ∈ [t1 = 0, t2 ]. Since W (s) and W (t2 ) are both

W (t2 )
Gaussian, then the conditional B(s) = W (s) is also Gaussian. We can find its

W (t2 )
parameters in the following way,

6
E[W (s)|W (t2 )] = E[Ws ] +Cov(Ws , Wt2 ) V [Wt2 ]−1 (Wt2 − E[Ws ])
| {z } | {z } | {z }
=0 = 1/t2 =0
(29)
s Wt2
=
t2
and
V[W (s)|W (t2 )] = V[Ws ] −Cov[Ws , Wt2 ] V [Wt2 ]−1 Cov[Wt2 , Ws ]
| {z } | {z }
= s = 1/t2
(30)
2
s
=s−
t2
Note that,
Cov[Ws , Wt2 ] = E[Ws Wt2 ] − E[Ws ]E[Wt2 ] = min(s, t2 ) = s (31)
| {z }
=0

We conclude that the distribution of B at s ∈ [t1 = 0, t2 ] is Normal with parameters,


s Wt2 s2
E[B(s)] = V[B(s)] = s − (32)
t2 t2
9. Describe Kernel density estimates for non parametric estimation. Discuss
the choice of the window parameter h and the resulting error in the esti-
mate of the pdf. How does this discussion depend on the dimension of the
random variable?

We want to estimate the density ρy (y) given a set {Yl }M


l=1 IDD observations from a
common density ρy . The kernel density estimator of ρy of the point y is given by,
M y − Y 
1 X j
ρ̂k (y) = Ker (33)
M k j=1 k

if Ker(·) is a pdf, i.e. ( Ker(z) ≥ 0 and normalized). Then ρ̂k (y) will also be a candi-
date pdf estimator.
Next, we will show that no unbiased estimator can exit for all continuous ρy (y),

We look for a candidate with the least mean square error. An asymptotically unbiased
candidate estimator ρ̂km (y) such that,
M →∞
E[ρ̂km (y)] −−−−→ ρ(y) ∀y ∈ R (34)
We choose the window size parameter k > 0 to minimize the total error in the estimate
of Y ∼ ρ, given by
ρ̂k (y) − ρ(y) = E[ρ̂k (y)] − ρ(y) + ρ̂k (y) − E[ρ̂k (y)] (35)
| {z } | {z }
Bias sampling error

7
We look closer at the bias error,
Z
E[ρ̂k (y)] − ρ(y) = Kerk (y − z)ρ(z) dz − ρ(y)

= (Kerk ∗ ρ)(y) − ρ(y)


Z
= Kerk (z)ρ(y − zk) dz − ρ(y)
Z
0 1 00
= Kerk (z)[  ρ(y)

 + kzρ (y) + (kz)2 ρ (y)
R
|{z} | {z } 2
∵ Kerk (z)=1 integration is zero by symmetry

1 000
+ (kz)3 ρ (y) +O(z 4 )] dz − 
ρ(y)


|6 {z }
integration is zero by symmetry
1 00
≈ σk2 k 2 ρ (y)
2
(36)

Note that we have expanded ρ(y − kz) around kz using a Taylor series. Where the first
term cancels out with ρ(y) as the kernel is a pdf and the secondR term is zero as the
2
kernel is symmetric. And we write the third term with σKer = z 2 Kerk (z) dz plus
higher order terms.

Next we look at the sampling error,


By chebychev’s inequality, for any a > 1,
  1
P |EX − X| ≥ aσ(X) ≤ 2 (37)
a
We can approximate,
E[ρˆk ](y) − ρ̂k (y) ≈ Cσρ̂k (y) (38)

And,
1 y − z 
V[ρ̂k ](y) = V[Ker ]
M k2 k
Z 0 Z 00 Z
ρ(y) 2 ρ (y) 2 ρ (y)
= Kerk (z) dz + zKer (z) dz + z 2 Ker2 (z) dz + . . .
Mk M 2M
(39)

Where we have expanded ρ(y − kz) using Taylor around kz.


Next we choose k that minimizes the total squared error,
Z
 2 1 4 00 2 ρ(y)
E[ρ̂k ](y) − ρ(y) + CV[ρ̂k ](y) = σKer (ρ (y)) + C Kerk2 (z) dz + . . . (40)
4 MK

Hence, k ∗ ∝ M −1/5 . And so the approximation error ∝ M −2/5 .However, we see that
this choice depends on y. Next we find an optimal k independent of y. This can be

8
done by minimizing the L2 error,
Z Z Z
2 2
E[(ρ(y) − ρ̂k (y)) ] dy ≤ 2 (ρ(y) − E[ρ̂k ](y)) dy + 2 V[ρ̂k ](y)
Z Z (41)
1 4 4 00 2 2c
using chebychev inq. ≤ σKer k |ρ (y)| dy + Kerk2 (z) dz + . . .
2 MK

We consider the Mean Square Error,


Z
M SE(ρ̂k ) = E (ρ̂k (y) − ρ(y))2 dy
Z
= E(ρ̂k (y) − ρ(y))2 dy
Z
= M SE(ρ̂k (y)) dy
Z Z Z Z
1 1
= Ker (y) dy + (1 − ) (ρ̂k ∗ ρ) (y) − 2 (Kerk ∗ ρ)(y)ρ(y) dy + ρ2 (y) dy
2 2
Mk M
1 1 00 1
= R(Ker) + k 2 σKer
2
R(ρ ) + O( + k4)
Mk 4 Mk
(42)
R 00 R 00
where R(Ker) = Ker2 (y) dy and R(ρ ) = (ρ )2 (y) dy. Hence,
1 1 00
M SE(ρ̂k ) ≈ R(Ker) + k 2 σKer
2
R(ρ ) (43)
Mk 4
Then the optimal choice of k is given by ,
R(Ker) 1/5 −1/5
kM SE = [ ] M (44)
4
σKer (ρ00 )

This yields an approximation error ∝ M −2/5 .


The optimal value of k ∗ for a kde of an Rd random variable X, based on observing
X1 , X2 , . . . , Xn is as follows:

The d-dimensional KDE is defined as follows,


M
ˆ 1 X
f (x; H) = KH (X − Xi ),
M i=1

where H is a d × d SPD bandwidth matrix and KH is defined as follows,

KH (X) = |H|−1/2 K(H−1/2 X)

Note that KH is a normalized pdf. We propose the kernel K to be a d-variate Gaussian,


1 − 1 XT X
K(X) = e 2
2π d/2
9
To see the convergence rate, we look at the mean square error (MSE), where

M SE = bias2 (fˆ) + V ar(fˆ)

Lets first compute the bias and we use use Taylor expansion to simplify,
Z Z
ˆ
E[f (X; H)] − f (X) = KH (X − Y )f (Y )dY − f (X) = K(Z)f (X − H1/2 Z)dZ − f (X)
Z  
1/2 T 1 1/2 T 1/2
= K(Z) f (X) − (H Z) Df (X) + (H Z) Hf (X)(H Z) dZ + o(tr(H)) − f (X)
2
Z Z
1/2 1
T
= f (X) − Z H Df (X)K(Z)dZ + Z T H1/2 Hf (X)H1/2 ZK(Z)dZ + o(tr(H)) − f (X)
2
 Z 
1 1/2 1/2 1
= tr H Hf (X)H ZZ K(Z)dZ + o(tr(H)) ∼ σ 2 (K)tr(HHf (X))
T
2 2
R
where σ 2 = zi2 K(Z)dZ
Next we compute the variance of fˆ ,
 Z Z 
ˆ 1 −1/2 2 1/2 1/2 2
V(f (X, H)) = |H| K (Z)f (X − H Z)dZ − ( K(Z)f (X − H Z)
M
1 1
= |H|−1/2 R(K)f (x) + o( |H|−1/2 )
M M
R 2
Where R(K) = K (z) dz. Then the MSE is given by,
For simplicity, consider H = h2 I , then
1 1 1
M SE = h4 (σ 2 (K))2 (∇2 f (x))2 + h−d R(K)f (x) + o( h−d )
4 M M

Then optimal bandwidth h∗ that minimizes the MSE is given by,


 1
 d+4
dR(K) 1
h∗ = ∝ M − d+4
M (σ (K))2 (∇2 f (x))2
2

Note that at best, when using h∗ , the MSE depends on the dimensions as follows,
2
M SE ∝ M − d+4 .

We see that as the dimensions increase, the convergence rate decreases as expected.

10. What is affine prediction? Describe its use for the approximation of the
conditional expectation E[Y |Z]? When is this approximation exact?

The affine prediction is a method used to solve prediction problems, that means given
the joint distribution of a random variable (Y, Z) and the observations of Z, we want

10
to compute V̂ , the best prediction of Y .
In particular, consider the space:
Lp = {g : g is a polynomial with degre at most p} (45)
The affine prediction returns an Lp approximation of E[Y |Z]
.
Affine prediction: If we restrict ourselves to the L1 approximation of E[Y |Z]. So
lets consider, Y = g(Z) = at (Z − EZ) + b. To find (a∗ , b∗ ), we solve for m = 1, . . . , M .

E[(at (Z − EZ) + b)(Zm − EZm )] = E[Y (Zm − EZm )] (46)


and
E[at (Z − EZ) + b] = E[Y ] (47)

This can be written in a matrix form as,


    
CovZ 0 a E[Y (Z − EZ)]
=
0 1 b EY

So we have that b∗ = EY and a∗ = Cov −1 (Z)E[Y (Z − EZ)] then,


g1∗ (Z) = E[Y (Z − EZ)]t Cov −1 (Z)(Z − EZ) + EY
 t (48)
= E[Y Z] − EY EZ Cov(Z)−1 (Z − EZ) + EY

11. Motivate the use of non-linear weighted least squares starting from a max-
imum posterior, Gaussian likelihood, Bayesian formulation.
We consider the following setting 1 ,
x=m+e (49)
where m is the model (a deterministic unkown) and e ∼ N (0, σ 2 ∆) is a zero-mean
multivariate Gaussian noise with covariance Cov[e] = ∆. In the weighted least square
framework, we aim to minimize the following where we have M samples,
σW LS = ||W (x − m)||2F (50)
where W is a diagonal matrix of weights and || · ||F is the Frobenuis norm. Note that
by minimizing (50) we are reducing the error according to different weights. These
weights can be choose based on prior information (a prior distribution).
Next we find the log-likelihood function of the residual e = x − m,
M 1 1
`=− ln σ 2 − ln |∆| − 2 (x − m)t ∆−1 (x − m) (51)
2 2 2σ
1
-Bro R, Sidiropoulos ND and Smilde AK. Maximum likelihood fitting using ordinary least squares
algorithms. J. Chemom. 2002; 16: 387400.
-Judge GG, Griffths WE, Carter Hill R, Luetkepohl H and Lee TC. The Theory and Practice of Econometrics.
Wiley: New York,1985.

11
The minimization of the ` with respect to σ 2 , yields the estimator

(x − m)t ∆−1 (x − m)
2
σ̂ = ∝ (x − m)t ∆−1 (x − m) (52)
M
By choosing W = ∆−1/2 , we obtain

(x − m)t ∆−1 (x − m) = (x − m)t W t W (x − m) = [W (x − m)]t [W (x − m)] (53)

This again equivalent to minimizing,

σM L = ||W (x − m)||2F (54)

Comparing (50) and (54), we see that by choosing W = ∆−1/2 the weighted least
square estimates coincide with maximum likelihood estimates.

Next, it remains to minimize (54). We use an iterative majorizartion technique to tackle


a possibly full matrix W . We denote the cth iteration by σM L (mc ) = ||W (x − mc )||2F .
Since we know the σM L ≥ 0. Then we aim to have,

σM L (mc+1 ) ≤ σM L (mc ) (55)

We define a majorization function σmaj that satisfies the following:

• monotonically converges to σM L from above.


• Identical to σM L at the current point mc , i.e. σmaj (mc ) = σM L (mc )

We rewrite (54) using m = mc + (m − mc ),

σM L (m) = [(x − mc ) − (m − mc )]t W t W [(x − mc ) − (m − mc )]


= (x − mc )t W t W (x − mc ) + (m − mc )t W t W (m − mc ) − 2(m − mc )t W t W (x − mc )
| {z } | {z } | {z }
a constant non-linear in m linear in m

Let β be the largest eigenvalue of W t W , then for any vector s,

st W t W s ≤ βst s =⇒ (m − mc )t W t W (m − mc ) ≤ β(m − mc )t (m − mc ) (56)

Then a majorization function defined by,

σmaj = (x − mc )t W t W (x − mc ) + β(m − mc )t (m − mc ) − 2(m − mc )t W t W (x − mc ) (57)

satisfies σM L ≤ σmaj and we see that σmaj (mc ) = σM L (mc ) = (x − mc )t W t W (x − mc ).


Amd so it satisfies the requirement of being a majorization function as defined above.
Note that we have convergence as,

σM L (mc+1 ) ≤ σmaj (mc+1 ) ≤ σmaj (mc ) ≤ σM L (mc ) (58)

To find the best solution, we perform the following algorithm:

12
1 Set c = 0 and an initial mc = 0

2 Find mc+1 = arg min σmaj (mc )


m∈ξ
||mc −mc−1 ||2F
3 Got to (2) until ||mc−1 ||2F
≤

where ξ is the admissible set of m following the structure of the model and  is a required
tolerance. Hence, in this Bayesian setting, we have used as a prior the covariance ∆
of the noise e and we were able to formulate a maximum posterior estimator of the
parameter σ using non-linear least squares.

12. Describe Linear Partially Observed State Space Models. What is the filter-
ing problem? What is a Bayesian filter?
An Rd stochastic sequence X = {Xk }k≥0 is a state space model if it evolves according
to a recursion of the form:

Xk+1 = Fk Xk + Wk + Uk (59)

Where {Fk }k≥0 is a deterministic sequence of d×d matrices , {Uk }k≥0 is a deterministic
sequence of d × 1 vectors and {Wk }k≥0 is a sequence of independent Rd random vectors
for which EWk = 0 and E[||Wk ||2 ] < ∞.
Partially observed states When are unable to observe directly the state or the
system. However, we observe a corrupted state Xk . In this setting, we assume that
one can directly observe {Zk }k≥0 where:

Zk = Gk Xk + Vk f or k ≥ 0 (60)

Here {Gk }k≥0 is a deterministic sequence of n × d matrices and {Vk }k≥0 is a stochastic
sequence of n × 1 random vectors.
Filtering is concerned with trying to extract the current state Xk from the observed
history of the observation process Z up to time n.

The Bayesian filter gives the complete evolution of the distribution of the state Xk in
terms of the available data in a recursive way. It is done in the following way:
R
1 Time evolution update: π(Xk+1 |{Zj }kj=1 ) = π(Xk+1 |Xk )π(Xk |{Zj }kj=1 ) dXk
π(Zk+1 |Xk+1 )π(Xk+1 |{Zj }kj=1 )
2 Observation update: π(Xk+1 |{Zj }kj=1 ) = π(Zk+1 |{Zj }kj=1 )
R
with: π(Zk+1 |{Zj }kj=1 ) = π(Zk+1 |Xk+1 )π(Xk+1 |{Zj }kj=1 ) dXk+1

13. Describe the Kalman filter, motivate it as a particular case of a Bayesian


filter. Give a simple example of a Kalman Filter applied to a Linear Ob-
served State Space Model.

13
Given a Bayesian filter for linear evolution and observations subject to Gaussian noise
and initial condition X0 Gaussian, then Xk will be Gaussian. This leads to the Kalman
filter that just tracks the conditional mean and the covariance of Xk .

Kalman Filter: Let (Xk , Zk ) follow the linear equation:

Xk+1 = Fk Xk + Wk + Uk (61)

For k ≥ 0 :

(a) {Fk }k≥0 deterministic


(b) {Uk }k≥0
(c) {Wk }k≥0 sequence of independent random vectors Zk = Gk Xk + Vk
(d) {Gk }k≥0 deterministic
(e) {Vk }k≥0 stochastic

We assume that W and V are independent. Then we have,


t
1 Time evolution update: Xk+1|k = Fk+1 X̂k|k + Uk and Σk+1|k = Fk+1 Σk|k Fk+1 +
ΣWk+1 where ΣWk+1 = Cov(Wk+1 )
2 Observation update: X̂k+1|k+1 = X̂k+1|k +Kk+1 (Zk+1 −Gk+1 X̂k+1|k ) and Σk+1|k+1 =
(I − Kk+1 Gk+1 )Σk+1|k
 −1
t
where: Kk+1 = Σk+1|k Gk+1 Gk+1Σk+1|k Gtk+1 +ΣVk+1 where ΣVk+1 = Cov(Vk+1 )
such that for n ≥ j :
X̂n|j = E[Xn |Zl : 0 ≤ l ≤ j] (62)
Example: Consider (Ω, F, P ). Given the general stochastic signal evolution for
Un : Ω → R
Z T
−T
Un+1 = Un e + σes−T dWs n = 0, . . . , N − 1 (63)
0

for T = 1 and
1
σ2
Z
σes−1 dWs ∼ N (0, (1 − e−2 )) (64)
0 2
and
Yn+1 = Un+1 + ηn+1 ηn ∼ N (0, T ) (65)
then given an initial condition U0 Gaussian, the filtering distribution,

π(µn+1 |{Yj }n+1


j=1 ) (66)

is Gaussian. So using the Kalman filter we can track the evolution of the first two
moments as follows:

14
(a) Prediction step:

Un+1 = e−1 m̂n Cn+1 = e−2 Ĉn + Σ


(b) Update step:

m̂n+1 = (I − Kn+1 )mn+1 + Kn+1 Yn+1 Ĉn+1 = (I − Kn+1 )Cn+1


−1
where Kn+1 = Cn+1 Sn+1 and Sn+1 = T + Cn+1
14. Define Markov chains and give an example. Compute its limit distribution
if it exists.

Def. Y = {Yt , t = 0, 1, 2, . . .} is a Markov Chain with state space S, initial distribution


π and transition matrix P if :
(a) Y0 has a distribution π
(b) Conditional distribution of Yt+1 given Yt = i is p(i, i + 1) at time t + 1 and it is
idempotent of Y0 , . . . Yt−1 .
Example:
Every month Raul travels (T ) or not N . Given a state, the transition probability is
given by ,

p(T → T ) = 0.9 p(T → N ) = 0.1 p(N → T ) = 0.5 p(N → N ) = 0.5 (67)

Then we have the transition matrix,


 
0.9 0.1
P =
0.5 0.5

Then the Markov Chain X ∼ M C(λ, p) is a regular M C such that:

lim P n = (1, . . . , 1)t w̄ (68)


n→∞

where w̄ is the distribution that solves


(
w̄ = w̄P
P (69)
i w̄i = 1

We find the solution to be w̄ = [5/6, 1/6]


15. Let X(t) be a discrete Markov Chain and

u(y, T ) = P (X(T ) = y|X(0) = x).

These probabilities satisfy an evolution equation moving forward in time:


state it and derive it.

15
Define a µ(x, t) = p(X(t) = x) where x ∈ S sample space. Starting from the numbers
µ(x, 0) = λx ∀x ∈ S and given the transition matrix P . The next evolution in time,

µ(x, t) = λx pt ∀x ∈ S (70)

By conditional probability,
µ(x, t + 1) = p(X(t + 1) = x)
X
= p(x(t + 1) = x|x(t) = y)p(X(t) = y)
y∈S (71)
X
= p(y, x)µ(y, t)
y∈S

We will write the last equation in a matrix form. Suppose that S = {x1 , . . . , xN } and
suppose that |S| < ∞. We define a row vector of order |S|:

µ(t) =< µ1 (x1 , t), . . . , µn (xN , t) > (72)

And so we can write,


u(t + 1) = u(t)P (73)
Where λ = µ(0) =< p(x(0) = x1 ), . . . , p(x(0) = xN ) >

16. Let X(t) be a discrete Markov Chain and for t < T let

f (x, t) = E(V (X(T ))|X(t) = x).

State and derive a backward equation for the above expected value of a
state observable.

Def. F (x, T ) = E[V[X(T )]|X(t) = x] for t < T and x ∈ S. Given F (x, t) =


V (x) ∀x ∈ S and given the transition matrix P , we have the following backward
evolution,
F (x, t) = P T −t f (x, T ) ∀x ∈ S (74)

We derive,
f (x, t) = E[V (X(t))|X(t) = x]
X
= E[V (X|T )|X(t) = x and X(t + 1) = x]p(X(t + 1) = y|X(t) = x)
y∈S (75)
X X
= E[V (X(T ))|X(t + 1) = y]p(x, y) = f (y, t + 1)p(x, y)
y∈S y∈S

The evolution equation of the expectation can be written in a matrix form. Again,
suppose that S = {x1 , . . . , xN } and suppose that |S| < ∞. We define a vector of order
|S|: f (t) =< f1 (t), . . . , fN (t) >. And so we have,

f (t) = P f (t + 1) with f (T ) = V (x) ∀x ∈ S (76)

16
17. State and discuss a method for sampling from the invariant measure of a
Markov Chain. Mention a possible application of it.
Metropolis-Hasting algorithm samples from pdf f . Fixing f = w , it is possible to
sample out of w and then construct an ergodic, recursive MC such that the given w
satisfies , the detailed balance condition.
Metropolis-Hasting:
Input: f the target pdf to be sampled and q(y|x) a proposal density. Let Xn be the
current sample. We perform the following:
1 Given Xn , we sample Yn ∼ q(·|Xn ).
2 Let the acceptance probability be,
f (y)q(x|y)
p(x, y) = min{1, } (77)
f (x)q(y|x)
and take, (
Yn with probability p(Xn , Yn )
Xn+1 = (78)
Xn else

In particular , if q(x|y) = q(y|x) =⇒ p(x, y) = min{1, ff (x)


(y)
}
Next, we check the detailed balance property, we have to show that:

f (x)t(x, y) = f (y)t(y, x) (79)

where t(x, y) are the transition probabilities of the MC defined by Metropolis-Hasting.


The transition pdf is :
Z
t(x, y) = p(x, y)q(y, x) + δx (y)(1 − p(x, s)q(s, x) ds) (80)

By two identities corresponding to two terms in f (x)t(x, y).

(a) ρ(x, y)q(y, x)f (x) = ρ(y, x)q(x|y)f (y).


(b) (1 − r(x))δx (y)f (x) = (1 − r(y))δy (x)f (y)

Hence the detailed balance follows and then f is a stationary pdf for MC.
Application: In the Bayesian framework, we have data {Xi }ni=1 we want to estimate
the parameter θ:
p(θ|{Xi }) ∝ L({Xi }|θ)p(θ) (81)
this has to be known up to a multiplicative constant. Then we can apply M-H algorithm
to draw samples from the posterior p(θ|{Xi }).
Next we discuss the bias and statistical error of MCMC. For an arbitrary function we
would like to estimate the expectation θ = E[ψ(X)] and where X is a discrete random
vector. Consider the stationary target probability π, then

θ = E[ψ(X)] = π t ψ (82)

17
Consider the MCMC estimator,
n−1
1X
θ̂n = ψ(Xk ) (83)
n k=0

where Xk is a state in the Markov chain starting from a fixed initial distribution α.
Consider an irreducible apriodic transition matrix P whose ergodic limit is π. Next we
check the bias, i.e. if Eθ̂n = θ. We proceed,
n−1 n−1 n−1
1X 1X t k 1 tX k
Eθ̂n = E[ψ(Xk )] = αP ψ= α P ψ (84)
n k=0 n k=0 n k=0

Then using the ergodicity of P and its limit π, we have

lim Eθ̂n = θ (85)


n→∞

Hence, MCMC is asymptotically unbiased. However, for finite n it is biased and could
be significantly biased near the initial distribution α. Therefore, it is customary to
allow a burn in period r (discarding the output for the first r runs) then we calculate
our estimate as follows,
n
1 X
θ̂n−r = ψ(Xk ) (86)
n − r k=r+1
The choice of r can be pre-selected, or estimated statistically.
Next we find the variance of the estimator θ̂n . Let Π be a matrix with identical row
vectors of the target probability π t . Then,

(P − Π)n = (P − Π)n−1 (P − Π) = (P n−1 − Π)(P − Π) = (P n − ΠP − P n−1 Π + Π2 )


= Pn − Π

Where we have used the fact that ΠP = P Π = Π = Π2 since P has π as the limit
probability and the idempotent property of Π. We can also see that,

lim (P − Π)n = lim P n − Π = 0 (87)


n→∞ n→∞

Note that Q = P − Π describes the probabilities between transient states. Next we


can define the fundamental matrix,

Z = (I − (P − Π))−1 (88)

which we can expand into,



X
Z = (I − (P − Π))−1 = I + (P k − Π) (89)
k=1

We begin find the variance of the MCMC estimator,

18
n−1 n−1
1 X 2
X 
V[θ̂n ] = 2 E[ ψ(Xk ) ] − (E[ ψ(Xk )])2 (90)
n k=1 k=1

Rewriting,
n−1 n−1 n−1
0
X X X
2 2
n V[θ̂n ] = E[ ψ(Xk ) ] + 2 E[ψ(Xk )ψ(Xk )] − (E[ ψ(Xk )])2 (91)
k=1 0 k=1
k6=k

Consider the case of α = π where π is of fixed length l, that is when the initial
probability is the target probability. Then,
n−1
X
(E[ ψ(Xk )])2 = n2 θ2 (92)
k=1

and
n−1
X n−1
X l
X
2 2
E[ ψ(Xk ) ] = E[ψ(Xk ) ] = n πi ψi2 (93)
k=1 k=1 i

Also,
n−1 n−1
0
X X
E[ψ(Xk )ψ(Xk )] = (n − k)E[ψ(X0 )ψ(Xk )]
k6=k0 k=1
n−1
X l X
X l n−1
X
(k)
= (n − k) πi ψi pi,j ψj = π t diag(ψ) (n − k)P k ψ
k=1 i=1 j=1 k=1

Then we have, after substituting and dividing by n,


l n−1
X X (n − k)
nV[θ̂n ] = πi ψi2 2
− θ +2π diag(ψ) t
P k ψ − (n − 1)θ2
i k=1
n (94)
| {z }
σ2

Observe that θ2 can be written as θ2 = π t diag(ψ)Πψ, then we can rewrite,


n−1
X (n − k)
nV[θ̂n ] = σ 2 + 2π t diag(ψ) P k ψ − (n − 1)π t diag(ψ)Πψ (95)
k=1
n

rewriting,
n−1
2
X (n − k) k
t n−1 
nV[θ̂n ] = σ + 2π diag(ψ) P ψ− Πψ
k=1
n 2
n−1 
(96)
X (n − k) 
= σ 2 + 2π t diag(ψ) (P k − Π) ψ
k=1
n

19
Taking the limit and using (89) ,

lim nV[θ̂n ] = σ 2 + 2π t diag(ψ)(Z − I)ψ (97)


n→∞

In the previous calculation we assumed that we have started MCMC with an initial
distribution equal to the target distribution α = π. It remains to show that the result
is valid if we start from any initial distribution α, i.e.

lim nV[θ̂nα ] − nV[θ̂nπ ] = 0 (98)


n→∞

where θ̂nα denotes the estimator for a chain started from distribution α and θ̂nπ denotes
the estimator for a chain started from distribution π

For the purpose of manipulation, lets define the following summations


r−1
X n−1
X
Yr = ψ(Xk ) Zn = ψ(Xk ) (99)
k=1 k=r

Then we can write,

   
n(V[θ̂nπ ] − V[θ̂nα ]) = E(Yrπ + Znπ )2 − E(Yrα + Znα )2 − (EYrπ + EZnπ )2 − (EYrα + EZnα )2
1 h 
= E(Yrπ )2 − (EYrπ )2 − E(Yrα )2 + (EYrα )2 := A
n   
π π π π α α α α
+ 2E (Yr − EYr )(Zn − EZn ) − 2E (Yr − EYr )(Zn − EZn ) := B
  i
+ E(Znπ )2 − (EZnπ )2 − E(Znα )2 + (EZnα )2 := C

We look closer at each term. Note that the term A does not depend on n. Hence, lim n1 A = 0.
n→∞
Since the state space is finite, we have a maximum such that C = max|ψ(Xi )| < ∞. Then
i∈S
we can see that lim n1 B = 0 and lim n1 C = 0. Hence, we have shown (98) and the statement
n→∞ n→∞
(97) holds for any starting distribution α, that is lim V[θ̂n ] = 0. Then we can conclude that
n→∞
the MSE of MCMC is given by ,

M SE 2 (θ̂n ) = Bias(θ̂n )2 + V[θ̂n ] = Eθ̂n − θ + V[θ̂n ] (100)


And we conclude that in the limit,

lim M SE 2 (θ̂n ) = lim (Eθ̂n − θ)2 + lim V[θ̂n ] = 0 (101)


n→∞ n→∞ n→∞

However, each term converges at a different speed. The bias converge on the order of
n−2 while the variance on the order of n−1 .Hence, controlling the variance is much more
important. This can be done by selecting an initial distribution α that yields the least
variance of the estimator in order to save on computational costs.

20

You might also like