Essential Questions For The Exam 2018, AMCS 308, Stochastic Methods in Engineering
Essential Questions For The Exam 2018, AMCS 308, Stochastic Methods in Engineering
E[|Y − EY |3 ]
λ= <∞ (1)
σy
then the inequality is given by
c∗ λ3
|F (x) − φ(x)| ≤ √ (2)
(1 + |x|)3 M
3. Show how to apply the Monte Carlo method to compute an integral and
discuss the corresponding error.
Consider,
Z 1 M
1 X
f (x) dx ≈ f (xi ) := VM (3)
0 M i=1
where xi ∼ U [0, 1] IID samples.
The variance is given by,
M M
1 X 1 X V [f (xi )]
V[ f (xi )] = 2 Vf (xi ) = (4)
M i=1 M i=1 M
Then by the CLT as the samples are IID , asymptotically we have that,
VM − E[VM ] √ VM − E[VM ]
p = M p ∼ N (0, 1) (5)
V[VM ] V[f (xi )]
Equivalently,
V[f (xi )]
VM − E[VM ] ∼ N (0, ) (6)
M
and we so can build the asymptotic confidence intervals,
r
V[f (xi )]
V M ± Cα (7)
M
where Cα is an appropriate percentile of the standard normal.
2
4. Describe and give all the proofs on the sampling method of acceptance-
rejection.
The method is as follows. We choose a proposal density η which has the share the
support of our target distribution f and covers it. Then we sample U ∼ U [0, 1]. And
for any ∈ [0, 1], ( f (x)
η(x) < U Reject the sample
f (x) (8)
η(x) ≥ U Accept the sample
Proof:
f (x)
Z Z η(x) Z Z
f (xk ) f (x)
P (Uk ≤ )= dµ dx = η(x) dx = f (x) dx = (9)
η(xk ) 0 η(x)
Let xk be the first accepted sample. We would like to show that xk is from the density
f . Consider,
k−1
X f (xk ) Y f (xm )
P (xk ∈ B) = P (xk ∈ B, Uk ≤ ) P (Um > )
k≥1
η(xk ) m=1 η(xm )
| {z }
=1−
(10)
f (xk ) X
= P (xk ∈ B, Uk ≤ ) (1 − )k−1 ∵ IID
η(xk ) k≥1
| {z }
1/
And we have,
Z Z f (x) Z
f (xk ) η(x) f (x)
P (xk ∈ B, Uk ≤ )= dµ η(x) dx = η(x) (11)
η(xk ) B 0 B η(x)
Hence, Z
f (xk )
P (xk ∈ B|Uk ≤ )= f (x) dx (12)
η(xk ) B
And so the sample are from the density f given that we have accepted them.
We look at the expected number of sampling from the proposal before we accept one
sample,
X X 1
E[K] = kP (K = k) = k(1 − )k−1 = (13)
k≥1 k≥1
3
Table 1: Comparison between MC and MC with Control Variates
Method #Samples Computational Work
c 2 c
MC ( T OL ) VY ( T OL )2 VY
Control Variates ( T OL )2 V[Vm (β ∗ )] (1 + ρ)( T OL )2 V[Vm (β ∗ )]
c c
As we have seen in question (3). The Monte Carlo’s error is asymptotically given by
the CLT since the samples are IID,
M
r
1 X VY
E[Y ] − Yj ≈ C α (15)
M j=1 M
where Cα are appropriate percentile of the standard normal. We would like to reduce
the variance without introducing bias.
(a) Control Variates: We would like to estimate EY . We sample Y and another
auxiliary RV, X for which EX is given and X and y are highly correlated.
For a given β > 0, we consider the unbiased estimator
M M
1 X β X
Vm = Yj − (Xj − EX) (16)
M j=1 M j=1
We observe that,
1 1
V[Vm ](β) = V[Y − βX] = VY + β 2 VX − 2βCov[X, Y ] (17)
M M
We find the minimizing β ,
Cov(Y, X)
β∗ = (18)
VX
Then,
1 Cov(Y, X)2 1 1
V[Vm ](β ∗ ) = V[Y ] 1 − = V[Y ] 1 − Cor(X, Y )2 ≤ V[Y ]
M V [X]V[Y ] M M
(19)
Assume that the work to sample (X, Y ) is (1 + ρ) times the work of sampling Y
alone.
From Table(1), we see that control variate is a good choice only if the increase
in computational work of the control variates technique is minor to the benefit of
its variance reduction. From (19), we will have the following computational work
inequality,
ρ
(1 + ρ)(1 − Cor(X, Y )2 ) ≤ 1 ⇐⇒ Cor(X, Y )2 > (20)
1+ρ
This shows exactly how high the correlation between Y and the auxiliary vari-
able X must be in order to benefit from control variates computationally. The
condition can also be stated as,
VY ≥ (1 + ρ)V[Vm (β ∗ )] (21)
4
(b) Antithetic variates: Let Y = g(x) such that X has a symmetric distribution and
so EX = 0. Then we have X and −X identically distributed, which means that
Eg(X) = Eg(−X) and
h g(X) + g(−X) i
EY = E (22)
2
then we have the estimator,
M
1 X g(Xj ) + g(−Xj )
Vm = (23)
M j=1 2
to effectively reduce the variance as the combined variance of both g(X) and
g(−X) must be smaller than the variance of the original random variable X
alone. Note that by the variance of the sum law we see that it is sufficient that
the covariance term is negative,
6. Define the Poisson counting process and explain how to sample it.
It is a collection of random variables where N (t) is a counting of events that has
occurred upto time t starting from t0 . The number of events between time t and time
s is given by N (s) − N (t) which follows a Poisson distribution.
Def.
(a) N (0) = 0
(b) Consider the time mesh, 0 = t0 < t1 < t2 . . . < tn = T . Then the increments,
Ntn − Ntn−1 , . . . , Nt1 − Nt0 are independent.
(c) If I = (s, t) ⊆ R for k = 1, 2, . . .
(λ|I|)k
P (N (t) − N (s) = k) = P (k jumps in I) = e−λ|I| (28)
k!
Where λ > 0 is the intensity of the process.
5
Given a sequence time t where t0 = 0 and tn = T with λ > 0:
7. Define the compound Poisson process and explain how to sample it.
A compound Poisson process with rate λ > 0 and jump size distribution f is a contin-
uous time stochastic process {X(t)} given by
N (t)
X
X(t) = Zi
i=1
Where N (t) is a Poisson counting process with rate λ > 0. The random variables Z
are IID with distribution f .
1 Set t = 0, N = 0, X = 0.
2 Sample U ∼ unif orm
3 t = t + [−(1/λ) ln(U )]. If t > T : final time, stop.
4 Sample Z ∼ f
5 Set N = N + 1 and X = X + Z
6 Go back to step (2).
To sample a compound Poisson process with jump amplitudes N (0, 1) and with parti-
tions in time ∆tn :
6
E[W (s)|W (t2 )] = E[Ws ] +Cov(Ws , Wt2 ) V [Wt2 ]−1 (Wt2 − E[Ws ])
| {z } | {z } | {z }
=0 = 1/t2 =0
(29)
s Wt2
=
t2
and
V[W (s)|W (t2 )] = V[Ws ] −Cov[Ws , Wt2 ] V [Wt2 ]−1 Cov[Wt2 , Ws ]
| {z } | {z }
= s = 1/t2
(30)
2
s
=s−
t2
Note that,
Cov[Ws , Wt2 ] = E[Ws Wt2 ] − E[Ws ]E[Wt2 ] = min(s, t2 ) = s (31)
| {z }
=0
if Ker(·) is a pdf, i.e. ( Ker(z) ≥ 0 and normalized). Then ρ̂k (y) will also be a candi-
date pdf estimator.
Next, we will show that no unbiased estimator can exit for all continuous ρy (y),
We look for a candidate with the least mean square error. An asymptotically unbiased
candidate estimator ρ̂km (y) such that,
M →∞
E[ρ̂km (y)] −−−−→ ρ(y) ∀y ∈ R (34)
We choose the window size parameter k > 0 to minimize the total error in the estimate
of Y ∼ ρ, given by
ρ̂k (y) − ρ(y) = E[ρ̂k (y)] − ρ(y) + ρ̂k (y) − E[ρ̂k (y)] (35)
| {z } | {z }
Bias sampling error
7
We look closer at the bias error,
Z
E[ρ̂k (y)] − ρ(y) = Kerk (y − z)ρ(z) dz − ρ(y)
1 000
+ (kz)3 ρ (y) +O(z 4 )] dz −
ρ(y)
|6 {z }
integration is zero by symmetry
1 00
≈ σk2 k 2 ρ (y)
2
(36)
Note that we have expanded ρ(y − kz) around kz using a Taylor series. Where the first
term cancels out with ρ(y) as the kernel is a pdf and the secondR term is zero as the
2
kernel is symmetric. And we write the third term with σKer = z 2 Kerk (z) dz plus
higher order terms.
And,
1 y − z
V[ρ̂k ](y) = V[Ker ]
M k2 k
Z 0 Z 00 Z
ρ(y) 2 ρ (y) 2 ρ (y)
= Kerk (z) dz + zKer (z) dz + z 2 Ker2 (z) dz + . . .
Mk M 2M
(39)
Hence, k ∗ ∝ M −1/5 . And so the approximation error ∝ M −2/5 .However, we see that
this choice depends on y. Next we find an optimal k independent of y. This can be
8
done by minimizing the L2 error,
Z Z Z
2 2
E[(ρ(y) − ρ̂k (y)) ] dy ≤ 2 (ρ(y) − E[ρ̂k ](y)) dy + 2 V[ρ̂k ](y)
Z Z (41)
1 4 4 00 2 2c
using chebychev inq. ≤ σKer k |ρ (y)| dy + Kerk2 (z) dz + . . .
2 MK
Lets first compute the bias and we use use Taylor expansion to simplify,
Z Z
ˆ
E[f (X; H)] − f (X) = KH (X − Y )f (Y )dY − f (X) = K(Z)f (X − H1/2 Z)dZ − f (X)
Z
1/2 T 1 1/2 T 1/2
= K(Z) f (X) − (H Z) Df (X) + (H Z) Hf (X)(H Z) dZ + o(tr(H)) − f (X)
2
Z Z
1/2 1
T
= f (X) − Z H Df (X)K(Z)dZ + Z T H1/2 Hf (X)H1/2 ZK(Z)dZ + o(tr(H)) − f (X)
2
Z
1 1/2 1/2 1
= tr H Hf (X)H ZZ K(Z)dZ + o(tr(H)) ∼ σ 2 (K)tr(HHf (X))
T
2 2
R
where σ 2 = zi2 K(Z)dZ
Next we compute the variance of fˆ ,
Z Z
ˆ 1 −1/2 2 1/2 1/2 2
V(f (X, H)) = |H| K (Z)f (X − H Z)dZ − ( K(Z)f (X − H Z)
M
1 1
= |H|−1/2 R(K)f (x) + o( |H|−1/2 )
M M
R 2
Where R(K) = K (z) dz. Then the MSE is given by,
For simplicity, consider H = h2 I , then
1 1 1
M SE = h4 (σ 2 (K))2 (∇2 f (x))2 + h−d R(K)f (x) + o( h−d )
4 M M
Note that at best, when using h∗ , the MSE depends on the dimensions as follows,
2
M SE ∝ M − d+4 .
We see that as the dimensions increase, the convergence rate decreases as expected.
10. What is affine prediction? Describe its use for the approximation of the
conditional expectation E[Y |Z]? When is this approximation exact?
The affine prediction is a method used to solve prediction problems, that means given
the joint distribution of a random variable (Y, Z) and the observations of Z, we want
10
to compute V̂ , the best prediction of Y .
In particular, consider the space:
Lp = {g : g is a polynomial with degre at most p} (45)
The affine prediction returns an Lp approximation of E[Y |Z]
.
Affine prediction: If we restrict ourselves to the L1 approximation of E[Y |Z]. So
lets consider, Y = g(Z) = at (Z − EZ) + b. To find (a∗ , b∗ ), we solve for m = 1, . . . , M .
11. Motivate the use of non-linear weighted least squares starting from a max-
imum posterior, Gaussian likelihood, Bayesian formulation.
We consider the following setting 1 ,
x=m+e (49)
where m is the model (a deterministic unkown) and e ∼ N (0, σ 2 ∆) is a zero-mean
multivariate Gaussian noise with covariance Cov[e] = ∆. In the weighted least square
framework, we aim to minimize the following where we have M samples,
σW LS = ||W (x − m)||2F (50)
where W is a diagonal matrix of weights and || · ||F is the Frobenuis norm. Note that
by minimizing (50) we are reducing the error according to different weights. These
weights can be choose based on prior information (a prior distribution).
Next we find the log-likelihood function of the residual e = x − m,
M 1 1
`=− ln σ 2 − ln |∆| − 2 (x − m)t ∆−1 (x − m) (51)
2 2 2σ
1
-Bro R, Sidiropoulos ND and Smilde AK. Maximum likelihood fitting using ordinary least squares
algorithms. J. Chemom. 2002; 16: 387400.
-Judge GG, Griffths WE, Carter Hill R, Luetkepohl H and Lee TC. The Theory and Practice of Econometrics.
Wiley: New York,1985.
11
The minimization of the ` with respect to σ 2 , yields the estimator
(x − m)t ∆−1 (x − m)
2
σ̂ = ∝ (x − m)t ∆−1 (x − m) (52)
M
By choosing W = ∆−1/2 , we obtain
Comparing (50) and (54), we see that by choosing W = ∆−1/2 the weighted least
square estimates coincide with maximum likelihood estimates.
12
1 Set c = 0 and an initial mc = 0
where ξ is the admissible set of m following the structure of the model and is a required
tolerance. Hence, in this Bayesian setting, we have used as a prior the covariance ∆
of the noise e and we were able to formulate a maximum posterior estimator of the
parameter σ using non-linear least squares.
12. Describe Linear Partially Observed State Space Models. What is the filter-
ing problem? What is a Bayesian filter?
An Rd stochastic sequence X = {Xk }k≥0 is a state space model if it evolves according
to a recursion of the form:
Xk+1 = Fk Xk + Wk + Uk (59)
Where {Fk }k≥0 is a deterministic sequence of d×d matrices , {Uk }k≥0 is a deterministic
sequence of d × 1 vectors and {Wk }k≥0 is a sequence of independent Rd random vectors
for which EWk = 0 and E[||Wk ||2 ] < ∞.
Partially observed states When are unable to observe directly the state or the
system. However, we observe a corrupted state Xk . In this setting, we assume that
one can directly observe {Zk }k≥0 where:
Zk = Gk Xk + Vk f or k ≥ 0 (60)
Here {Gk }k≥0 is a deterministic sequence of n × d matrices and {Vk }k≥0 is a stochastic
sequence of n × 1 random vectors.
Filtering is concerned with trying to extract the current state Xk from the observed
history of the observation process Z up to time n.
The Bayesian filter gives the complete evolution of the distribution of the state Xk in
terms of the available data in a recursive way. It is done in the following way:
R
1 Time evolution update: π(Xk+1 |{Zj }kj=1 ) = π(Xk+1 |Xk )π(Xk |{Zj }kj=1 ) dXk
π(Zk+1 |Xk+1 )π(Xk+1 |{Zj }kj=1 )
2 Observation update: π(Xk+1 |{Zj }kj=1 ) = π(Zk+1 |{Zj }kj=1 )
R
with: π(Zk+1 |{Zj }kj=1 ) = π(Zk+1 |Xk+1 )π(Xk+1 |{Zj }kj=1 ) dXk+1
13
Given a Bayesian filter for linear evolution and observations subject to Gaussian noise
and initial condition X0 Gaussian, then Xk will be Gaussian. This leads to the Kalman
filter that just tracks the conditional mean and the covariance of Xk .
Xk+1 = Fk Xk + Wk + Uk (61)
For k ≥ 0 :
for T = 1 and
1
σ2
Z
σes−1 dWs ∼ N (0, (1 − e−2 )) (64)
0 2
and
Yn+1 = Un+1 + ηn+1 ηn ∼ N (0, T ) (65)
then given an initial condition U0 Gaussian, the filtering distribution,
is Gaussian. So using the Kalman filter we can track the evolution of the first two
moments as follows:
14
(a) Prediction step:
15
Define a µ(x, t) = p(X(t) = x) where x ∈ S sample space. Starting from the numbers
µ(x, 0) = λx ∀x ∈ S and given the transition matrix P . The next evolution in time,
µ(x, t) = λx pt ∀x ∈ S (70)
By conditional probability,
µ(x, t + 1) = p(X(t + 1) = x)
X
= p(x(t + 1) = x|x(t) = y)p(X(t) = y)
y∈S (71)
X
= p(y, x)µ(y, t)
y∈S
We will write the last equation in a matrix form. Suppose that S = {x1 , . . . , xN } and
suppose that |S| < ∞. We define a row vector of order |S|:
16. Let X(t) be a discrete Markov Chain and for t < T let
State and derive a backward equation for the above expected value of a
state observable.
We derive,
f (x, t) = E[V (X(t))|X(t) = x]
X
= E[V (X|T )|X(t) = x and X(t + 1) = x]p(X(t + 1) = y|X(t) = x)
y∈S (75)
X X
= E[V (X(T ))|X(t + 1) = y]p(x, y) = f (y, t + 1)p(x, y)
y∈S y∈S
The evolution equation of the expectation can be written in a matrix form. Again,
suppose that S = {x1 , . . . , xN } and suppose that |S| < ∞. We define a vector of order
|S|: f (t) =< f1 (t), . . . , fN (t) >. And so we have,
16
17. State and discuss a method for sampling from the invariant measure of a
Markov Chain. Mention a possible application of it.
Metropolis-Hasting algorithm samples from pdf f . Fixing f = w , it is possible to
sample out of w and then construct an ergodic, recursive MC such that the given w
satisfies , the detailed balance condition.
Metropolis-Hasting:
Input: f the target pdf to be sampled and q(y|x) a proposal density. Let Xn be the
current sample. We perform the following:
1 Given Xn , we sample Yn ∼ q(·|Xn ).
2 Let the acceptance probability be,
f (y)q(x|y)
p(x, y) = min{1, } (77)
f (x)q(y|x)
and take, (
Yn with probability p(Xn , Yn )
Xn+1 = (78)
Xn else
Hence the detailed balance follows and then f is a stationary pdf for MC.
Application: In the Bayesian framework, we have data {Xi }ni=1 we want to estimate
the parameter θ:
p(θ|{Xi }) ∝ L({Xi }|θ)p(θ) (81)
this has to be known up to a multiplicative constant. Then we can apply M-H algorithm
to draw samples from the posterior p(θ|{Xi }).
Next we discuss the bias and statistical error of MCMC. For an arbitrary function we
would like to estimate the expectation θ = E[ψ(X)] and where X is a discrete random
vector. Consider the stationary target probability π, then
θ = E[ψ(X)] = π t ψ (82)
17
Consider the MCMC estimator,
n−1
1X
θ̂n = ψ(Xk ) (83)
n k=0
where Xk is a state in the Markov chain starting from a fixed initial distribution α.
Consider an irreducible apriodic transition matrix P whose ergodic limit is π. Next we
check the bias, i.e. if Eθ̂n = θ. We proceed,
n−1 n−1 n−1
1X 1X t k 1 tX k
Eθ̂n = E[ψ(Xk )] = αP ψ= α P ψ (84)
n k=0 n k=0 n k=0
Hence, MCMC is asymptotically unbiased. However, for finite n it is biased and could
be significantly biased near the initial distribution α. Therefore, it is customary to
allow a burn in period r (discarding the output for the first r runs) then we calculate
our estimate as follows,
n
1 X
θ̂n−r = ψ(Xk ) (86)
n − r k=r+1
The choice of r can be pre-selected, or estimated statistically.
Next we find the variance of the estimator θ̂n . Let Π be a matrix with identical row
vectors of the target probability π t . Then,
Where we have used the fact that ΠP = P Π = Π = Π2 since P has π as the limit
probability and the idempotent property of Π. We can also see that,
Z = (I − (P − Π))−1 (88)
18
n−1 n−1
1 X 2
X
V[θ̂n ] = 2 E[ ψ(Xk ) ] − (E[ ψ(Xk )])2 (90)
n k=1 k=1
Rewriting,
n−1 n−1 n−1
0
X X X
2 2
n V[θ̂n ] = E[ ψ(Xk ) ] + 2 E[ψ(Xk )ψ(Xk )] − (E[ ψ(Xk )])2 (91)
k=1 0 k=1
k6=k
Consider the case of α = π where π is of fixed length l, that is when the initial
probability is the target probability. Then,
n−1
X
(E[ ψ(Xk )])2 = n2 θ2 (92)
k=1
and
n−1
X n−1
X l
X
2 2
E[ ψ(Xk ) ] = E[ψ(Xk ) ] = n πi ψi2 (93)
k=1 k=1 i
Also,
n−1 n−1
0
X X
E[ψ(Xk )ψ(Xk )] = (n − k)E[ψ(X0 )ψ(Xk )]
k6=k0 k=1
n−1
X l X
X l n−1
X
(k)
= (n − k) πi ψi pi,j ψj = π t diag(ψ) (n − k)P k ψ
k=1 i=1 j=1 k=1
rewriting,
n−1
2
X (n − k) k
t n−1
nV[θ̂n ] = σ + 2π diag(ψ) P ψ− Πψ
k=1
n 2
n−1
(96)
X (n − k)
= σ 2 + 2π t diag(ψ) (P k − Π) ψ
k=1
n
19
Taking the limit and using (89) ,
In the previous calculation we assumed that we have started MCMC with an initial
distribution equal to the target distribution α = π. It remains to show that the result
is valid if we start from any initial distribution α, i.e.
where θ̂nα denotes the estimator for a chain started from distribution α and θ̂nπ denotes
the estimator for a chain started from distribution π
n(V[θ̂nπ ] − V[θ̂nα ]) = E(Yrπ + Znπ )2 − E(Yrα + Znα )2 − (EYrπ + EZnπ )2 − (EYrα + EZnα )2
1 h
= E(Yrπ )2 − (EYrπ )2 − E(Yrα )2 + (EYrα )2 := A
n
π π π π α α α α
+ 2E (Yr − EYr )(Zn − EZn ) − 2E (Yr − EYr )(Zn − EZn ) := B
i
+ E(Znπ )2 − (EZnπ )2 − E(Znα )2 + (EZnα )2 := C
We look closer at each term. Note that the term A does not depend on n. Hence, lim n1 A = 0.
n→∞
Since the state space is finite, we have a maximum such that C = max|ψ(Xi )| < ∞. Then
i∈S
we can see that lim n1 B = 0 and lim n1 C = 0. Hence, we have shown (98) and the statement
n→∞ n→∞
(97) holds for any starting distribution α, that is lim V[θ̂n ] = 0. Then we can conclude that
n→∞
the MSE of MCMC is given by ,
However, each term converges at a different speed. The bias converge on the order of
n−2 while the variance on the order of n−1 .Hence, controlling the variance is much more
important. This can be done by selecting an initial distribution α that yields the least
variance of the estimator in order to save on computational costs.
20