0% found this document useful (0 votes)
20 views7 pages

4404 Notes Is

Uploaded by

Sudeep Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

4404 Notes Is

Uploaded by

Sudeep Raja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Copyright

c 2013 by Karl Sigman

1 Rare event simulation and importance sampling


Suppose we wish to use Monte Carlo simulation to estimate a probability p = P (A) when the
event A is “rare” (e.g., when p is very small). An example would be p = P (Dk > b) with a
very large b for delay Dk of the k th customer in a queue. We could naively simulate n (large)
iid copies of A, denoted by A1 , A2 , . . . , An , then set Xi = I{Ai } and use the crude estimate
n
1X
p ≈ p(n) = Xi . (1)
n
i=1

def def
(A) = p and σ 2 = V ar(Xi ) = p(1 − p) and so,
But this is not a good idea: µ = E(Xi ) = P p

since p is assumed very small, the ratio σ/µ = p(1 − p)/p ∼ 1/ p −→ ∞ as p ↓ 0; relative
to µ, σ is of a much larger magnitude. This is very bad since when constructing confidence
intervals,
zα/2 σ
p(n) ± √ ,
n
the length of the interval is in units of σ: If σ is much larger than what we are trying to
estimate, µ, then the confidence interval will be way too large to be of any use. It would be
like saying “I am 95% confident that he weighs 140 pounds plus or minus 500 pounds”.
To make matters worse, increasing the number n of copies in the Monte Carlo so as to reduce
the interval length, while sounding OK, could be impractical since n would end up having to
be enormous.
Importance sampling is a technique that gets around this problem by changing the proba-
bility distributions of the model so as to make the rare event happen oftenR instead of rarely.
To understand the basic idea, suppose we wish to compute E(h(X)) = h(x)f (x)dx for a
continuous random variable X distributed with density f (x). For example, if h(x) = I{x > b}
for a given large b, then h(X) = I{X > b} and E(h(X)) = P (X > b).
Now let g(x) be any other density such that f (x) = 0 whenever g(x) = 0, and observe that
we can re-write

Z
E(h(X)) = h(x)f (x)dx
Z h
f (x) i
= h(x) g(x)dx
g(x)
h f (X) i
= Ẽ h(X) ,
g(X)

where Ẽ denotes expected value when g is used as the distribution of X (as opposed to the
original distribution f ). In other words: If X has distribution g, then the expected value of
h(X) fg(X)
(X)
is the same as what we originally wanted. The ratio L(X) = fg(X) (X)
is called the
likelihood ratio. We can write

E(h(X)) = Ẽ(h(X)L(X)); (2)

the left-hand side uses distribution f for X, while the right-hand side uses distribution g for X.

1
To make this work in our favor, we would want to choose g so that the variance of h(X)L(X)
(under g) is small relative to its mean.
We can easily generalize this idea to multi-dimensions: Suppose h = h(X1 , . . . , Xk ) is real-
valued where (X1 , . . . , Xk ) has joint density f (x1 , . . . xk ). Then for an alternative joint density
g(x1 , . . . xk ), we once again can write

E(h(X1 , . . . , Xk )) = Ẽ(h(X1 , . . . , Xk )L(X1 , . . . , Xk )), (3)

where L(X1 , . . . , Xk ) = fg(X


(X1 ,...,Xk )
1 ,...,Xk )
, and Ẽ denotes expected value when g is used as the joint
distribution of (X1 , . . . , Xk ).

1.1 Application to the FIFO GI/GI/1 queue


As a concrete example, let’s consider the FIFO GI/GI/1 queue with iid service times Si dis-
tributed as G(x) = P (S ≤ x) and iid interarrival times Ti distributed as A(x) = P (T ≤ x).
def
λ = E(T )−1 is the arrival rate, 1/µ = E(S) and ρ = λ/µ. We assume the stability condition
ρ < 1, which can be equivalently stated as E(S − T ) < 0.
def
∆i = Si − Ti , i ≥ 0, and the FIFO delay recursion is given by Dn+1 = (Dn + ∆n )+ , n ≥ 0,
and we shall assume that D0 = 0. If we form the random walk

Rk = ∆1 + · · · + ∆k , R0 = 0,

then since E(∆) = E(S − T ) < 0, the random walk has negative drift and hence tends to −∞
as time goes on: Rk → −∞ as k → ∞ wp1. Before it drifts off to −∞, however, it first reaches
def
a finite maximum M = maxk≥0 Rk which is a non-negative random variable. It is well known
(duality between queue and risk) that the distribution of Dk (for any fixed k) is the same as
def
the distribution of Mk = max0≤j≤k Rj (the maximum up to time k) and thus taking limits
(k → ∞) yields that the distribution of stationary delay D is the same as the distribution of
the all-time maximum M .
Thus for any b ≥ 0,

P (Dk > b) = P (Mk > b) (4)


P (D > b) = P (M > b). (5)

For a given fixed b ≥ 0 let


(
min{k ≥ 0 : Rk > b} if Rk > b for some k,
τ (b) =
∞ if Rk ≤ b for all k.
τ (b) is the first passage time above b, it denotes the first time at which (if ever) the random
walk goes above b. The case τ (b) = ∞ must be included because the random walk has negative
drift and thus might never reach a value above b (before eventually drifting to −∞). Noting that
{τ (b) ≤ k} = {Mk > b}, and that {τ (b) < ∞} = {M > b}, we have P (τ (b) ≤ k) = P (Mk > b),
and P (τ (b) < ∞) = P (M > b). Thus (4)-(5) become

P (Dk > b) = P (Mk > b) = P (τ (b) ≤ k) (6)


P (D > b) = P (M > b) = P (τ (b) < ∞). (7)

2
1.1.1 Importance sampling in the light-tailed service case
Let F (x) = P (∆ ≤ x), x ≥ 0, and assume that it has a density function f (x). We shall also
assume that service times are light-tailed: E(eS ) < ∞ for some  > 0 (e.g., S has a finite
moment generating function), which implies that the tail P (S > x) tends to 0 fast like an
exponential tail does. (A Pareto tail, however, such as x−3 , does not have this property; it is
an example of a heavy-tailed distribution.)
We further shall assume the existence of a γ > 0 such that
Z ∞
E(eγ∆ ) = eγx f (x)dx = 1. (8)
−∞
Defining the moment generating function K() = E(e∆ ) = E(eS )E(e−T ) and observing
that K(0) = 1 and K 0 (0) = E(∆) < 0, and K is convex K 00 () > 0, we see that the condition
(8) would hold under suitable conditions; conditions ensuring that K, while moving down below
1 for a while, shoots back upwards and hits 1 as  increases. The value γ at which it hits 1 is
called the Lundberg constant. Furthermore, since it is increasing when it hits 1 at γ, it must
hold that K 0 (γ) > 0.
Let us change the distribution of ∆ to have density
g(x) = eγx f (x). (9)
R
We know that g defines a probability density, g(x)dx = 1, because of the definition of γ in
(8). We say that we have exponentially tilted or twisted the distribution f to be that of g. In
general we could take any value  > 0 for which K() < ∞, and change f to the new twisted
density
ex f (x)
g (x) = . (10)
K()
Our g in (9) is the special case when  is set to be the Lundberg constant in (8); g = gγ .
It is easy to show that in fact, Ẽ(∆) > 0, that is, using distribution g for the random
walk increments ∆i makes the random walk now have positive drift! (To see this, note that
Ẽ(∆) = K 0 (γ) and recall that we argued above that K 0 (γ) > 0.)1
Thus for a given large b > 0 it is more likely that events such as {Dn > b} will occur as
compared to the original case when f is used. In fact P̃ (M > b) = 1 = P̃ (τ (b) < ∞), where P̃
denotes using g instead of f : the random walk will now with certainty tend to +∞ and hence
pass any value b along the way no matter how large; Rn → ∞ wp1 under P̃ .
Noting that the likelihood ratio function L(x) = f (x)/g(x) = e−γx and using h(x) = I{x >
b}, we can use (2) and conclude, for example, that

E(h(∆1 )) = P (∆1 > b) = P (M1 > b) = P (D1 > b) = Ẽ[e−γ∆1 I{∆1 > b}]. (11)
In two dimensions, utilizing (3), we can take h(x1 , x2 ) = I{x1 > b or x1 + x2 > b} yielding
h(∆1 , ∆2 ) = I{M2 > b}. We make the two increments iid each distributed as g in in (9), so
that their joint distribution is the product g2 (x1 , x2 ) = g(x1 )g(x2 ) = eγ(x1 +x2 ) f (x1 )f (x2 ). The
original joint distribution is the product f2 (x1 , x2 ) = f (x1 )f (x2 ) and so L(x1 , x2 ) = f2 /g2 =
e−γ(x1 +x2 ) , and therefore L(∆1 , ∆2 ) = e−γR2 . This then yields
E(h(∆1 , ∆2 )) = P (M2 > b) = P (D2 > b) = Ẽ[e−γR2 I{M2 > b}]. (12)
1
In the case when S ∼ exp(µ) and T ∼ exp(λ), the M/M/1 case, with λ < µ, it is easily seen that γ = µ − λ
and that the twisted density has the effect of swapping the rates: under g, the service times Si become iid with
an exponential distribution at rate λ, while the interarrival times Ti become iid with an exponential distribution
at rate µ; this yields an unstable queue.

3
Continuing analogously to higher dimensions then yields for any k ≥ 1:

P (Dk > b) = P (Mk > b) = Ẽ[e−γRk I{Mk > b}]. (13)


At this point, let us discuss how to use (13) in a simulation. Our objective is to estimate
P (Dk > b) for a very large value of b. We thus simulate the first k steps of a random walk,
R1 , . . . , Rk , having iid positive drift increments distributed as g in (9). We compute Mk =
max0≤j≤k Rj and obtain a first copy X1 = e−γRk I{Mk > b}. Then, independently, we simulate
a second copy and so on yielding n (large) iid copies to be used in our Monte Carlo estimate
n
1X
P (Dk > b) ≈ p̃(n) = Xi . (14)
n
i=1

(We of course need to be able to simulate from g, we assume this is so.)


If we had used the naive approach, we would have simulated iid copies of X1 = I{Mk > b}
where the increments of the random walk would be distributed as f and thus have negative
drift. The event in question, I{Mk > b}, would thus rarely happen; we would be in the bad
situation outlined in the beginning of these notes. With our new approach, the random walk is
changed to have positive drift and thus this same event, I{Mk > b}, now is very likely to occur.
The likelihood ratio factor e−γRk must be multiplied along before taking expected values so as
to bring the answer down to its true value P (Mk > b) as opposed to the larger (incorrect) value
P̃ (Mk > b).
It turns out (using martingale theory, see Section 1.2 below) that we can re-express the
right-hand side of (13) as
Ẽ[e−γRτ (b) I{Mk > b}],
so that after taking the limit as k → ∞ we obtain

P (D > b) = P (M > b) = Ẽ[e−γRτ (b) I{M > b}]. (15)

But as we know, P̃ (M > b) = P̃ (τ (b) < ∞) = 1 since the random walk has positive drift under
P̃ (said differently, M = ∞ wp1 under P̃ , so for any b > 0, I{M > b} = 1 wp1 under P̃ ).
Thus (15) becomes
P (D > b) = P (M > b) = Ẽ(e−γRτ (b) ). (16)
But by definition, at time τ (b) the random walk has shot passed level b; Rτ (b) = b + B, where
B = Rτ (b) − b denotes the overshoot. We finally arrive at

P (D > b) = P (M > b) = e−γb Ẽ(e−γB ). (17)

In essence, we have reduced the problem of computing P (D > b) to computing the Laplace
transform evaluated at γ, Ẽ(e−γB ), of the overshoot B of a positive drift random walk.
To put this to good use, we then use Monte Carlo simulation to estimate Ẽ(e−γB ): Simulate
the positive drift random walk with increments iid distributed as g until it first passes level
b, and let B1 denote the overshoot. Set X1 = e−γB1 . Independently repeat the simulation
to obtain another copy of the overshoot B2 and so on for a total of n such iid copies, Xi =
e−γBi , i = 1, 2, . . . n. Then use as the estimate
n
h1 X i
P (D > b) ≈ e−γb Xi . (18)
n
i=1

4
Note in passing that since Ẽ(e−γB ) < 1 we conclude from (17) that P (D > b) ≤ e−γb , an
exponential upper bound on the tail of delay. It turns out that under suitable further conditions,
it can be proved that there exists a constant C > 0 such that

P (D > b) ∼ Ce−γb , as b → ∞,

by which we mean that


P (D > b)
lim = 1.
b→∞ Ce−γb

This is known as the Lundberg approximation2 .

1.2 Martingales and the likelihood ratio identity


L = e−γRk , k ≥ 0, is a mean 1 martingale under P̃ because Ẽ[e−γ∆ ] = e−γx e+γx f (x)dx =
R
Rk
f (x)dx = 1. Thus, by optional sampling, for each fixed k, 1 = Ẽ(Lk ) = Ẽ(Lτ (b)∧k ) since
τ (b) ∧ k is a bounded stopping time. But Ẽ(Lτ (b)∧k ) = Ẽ(Lτ (b) I{Mk > b}) + Ẽ(Lk I{Mk ≤
b}), while Ẽ(Lk ) = Ẽ(Lk I{Mk > b}) + Ẽ(Lk I{Mk ≤ b}). Equating these two then yields
Ẽ(Lk I{Mk > b}) = Ẽ(Lτ (b) I{Mk > b}). Lurking in here is the famous likelihood ratio identity:
Given any stopping time τ (τ = τ (b) for example), it holds for any event A ⊆ {τ < ∞} that
P (A) = Ẽ(Lτ I{A}). (The filtration in our case is Fn = σ{∆1 , . . . , ∆n } = σ{R1 , . . . , Rn }.)
Finally note that, meanwhile, L−1
k =e
γRk is a mean 1 martingale under P due to (8).

There is a back and forth between the two probabilities P and P̃ :

P̃ (A) = E(L−1
k I{A}), A ∈ Fk . (19)

P (A) = Ẽ(Lk I{A}), A ∈ Fk . (20)


The general framework here: We use the canonical space Ω = RN = {(x0 , x1 , x2 , . . .) :
xi ∈ R}; the space of sequences of real numbers (the sample space for discrete-time stochastic
processes) endowed with the standard Borel σ− field and filtration {Fk : k ≥ 0}. Each random
element on this space corresponds to a stochastic process, denoted by R = {Rn : n ≥ 0}. We
start with the probability measure P corresponding to R being a random walk Rk = ∆1 +
· · · + ∆k , R0 = 0, with iid increments ∆i having density f (x) with assumed Lundberg constant
γ > 0 as defined in (8). Define the non-negative mean 1 martingale Lk = eγRk , k ≥ 1, L0 = 1,
and then define a new probability on Ω via

P̃ (A) = E(Lk I{A}), A ∈ Fk . (21)

(That (21) really defines a unique probability on Ω follows by Kolomogorov’s extension theorem
in probability theory: For each k (21) does define a probability on Fk , denote this by P̃k .
Consistency of these probabilities follows by the martingale property, for m < k, P̃k (A) =
P̃m (A) A ∈ Fm ; that is, P̃k (A) restricted to Fm is the same as P̃m .)
2
B = B(b) depends on b. If (under P̃ ) B(b) converges weakly (in distribution) as b → ∞ to (say) a rv B ∗ ,

then C = Ẽ(e−γB ). The needed conditions for such weak convergence are that Ẽ(∆) < ∞ and that the first
strictly ascending ladder height H = Rτ (0) have a non-lattice distribution. But it is known that H is non-lattice
if and only if the distribution of ∆ is so, and in our case it has a density g, hence is non-lattice. (We already
know that K 0 (γ) = Ẽ(∆) > 0 but it could be infinite.)

5
P̃ turns out to be the distribution of a random walk as well, but with new iid increments
distributed
Rx with the tilted density g defined in (10): P̃ (R1 ≤ x) = E(L1 I{R1 ≤ x}) = G(x) =
−∞ g(x)dx, and more generally, with ∆i = Ri−1 − Ri , i ≥ 1, P̃ (∆1 ≤ y1 , . . . , Yk ≤ yk ) =
E(Lk I{∆1 ≤ y1 , . . . , ∆k ≤ yk }) = G(y1 ) × · · · × G(yk ). Moreover, we can go the other way by
using the non-negative mean 1 martingale L−1 k =e
−γRk :

P (A) = E(Lk L−1 −1


k I{A}) = Ẽ(Lk I{A})), A ∈ Fk . (22)
The general change-of-measure approach allows as a starting point a given probability P , and
a non-negative mean 1 martingale {Lk }. Then P̃ (A) = E(Lk I{A}), A ∈ Fk is defined. Then
the likelihood ratio identity is: Given any stopping time τ , it holds for any event A ⊆ {τ < ∞}
that P (A) = Ẽ(L−1
τ I{A}). This works fine in continuous time t ∈ [0, ∞), but then the canonical
space used is D[0, ∞), the space of functions that are continuous from the right and have left-
hand limits, equipped with the Skorohod topology.

Remark 1.1 The Lundberg constant and the change of measure using it generalizes nicely to
continuous-time Levy processes, Brownian motion for example (and for Brownian motion such
change of measure results are usually known as Girsanov’s Theorem). For example, if X(t) =
σB(t) + µt is a Brownian motion with negative drift, µ < 0, then solving 1 = E(eγX(1) ) =
γ 2 σ2
eγµ+ 2 yields γ = 2|µ|/σ 2 . This yields the martingale L(t) = eγX(t) and the new measure
P̃ (A) = E(eγX(t) I{A}), A ∈ Ft . It is easily seen that under P̃ , the process X(t) remains a
Brownian motion but with positive drift µ̃ = |µ|, and the variance remains the same as it was.
Using the likelihood ratio identity with τ (b) = inf{t ≥ 0 : X(t) > b} = inf{t ≥ 0 : X(t) = b}
(via continuity of sample paths) then yields as in (17)
P (M > b) = e−γb Ẽ(e−γB ). But now, B = 0 since X(t) has continuous sample paths; there
is no overshoot, b is hit exactly. Thus we get an exact exponential distribution, P (M > b) =
e−γb , b > 0 for the maximum of a negative drift Brownian motion, a well-known result that
can be derived using more basic principles.
As a second example, we consider the Levy process Z(t) = σB(t) + µt + Y (t), where
independently we have added on a compound Poisson process
N (t)
X
Y (t) = Ji ,
i=1

where {N (t)} is a Poisson process at rate λ and the Ji are iid with a given distribution H(x) =
P (J ≤ x), x ∈ R. Also, let Ĥ(s) = E(esJ ) denote the moment generating function of H,
assumed to be finite for sufficiently small s > 0 (e.g., H is light-tailed). Choosing any  for
def 2 σ 2
which KZ () = E(eZ(1) ) = eµ+ 2 +λ(Ĥ()−1) < ∞, we always obtain a martingale L(t) =
[KZ ()]−1 eZ(t) , and a new measure P̃ (A) = E(L(t)I{A}), A ∈ Ft .
We now show that under P̃ , Z remains the same kind of Levy process but with drift
µ̃ = µ + σ 2 , variance unchanged σ̃ 2 = σ 2 , λ̃ = λĤ() and the distribution H exponentially
x
tilted to be H̃(x) given by dH̃(x) = e dH(x) .
Ĥ()
To this end, we need to confirm that for s ≥ 0,
s2 σ 2 ˆ
Ẽ(esZ(1) ) = esµ̃+ 2
+λ̃(H̃(s)−1)
,
ˆ
where H̃(s)
˜
= E(esJ ) = esx dH̃(x) =
R Ĥ(+s)
.
Ĥ()

6
Direct calculations yield

Ẽ(esZ(1) ) = E(L(1)esZ(1) )
= [KZ ()]−1 E(eZ(1) esZ(1) )
= [KZ ()]−1 KZ ( + s)
2 σ 2 (+s)2 σ 2
= e−µ− 2 e−λ(Ĥ()−1) e(+s)µ+ 2
+λ(Ĥ(s+)−1)
s2 σ 2 ˆ
= esµ̃+ 2
+λ̃(H̃(s)−1)
,

as was to be shown.
As for the Lundberg constant: We assume apriori that Z has negative drift, E(Z(1)) =
µ + λE(J) < 0, so that M = maxt≥0 Z(t) defines a finite random variable. Solving for a γ > 0
γ 2 σ2
such that 1 = KZ (γ) = eγµ+ 2
+λ(Ĥ(γ)−1)
, leads to the equation

γ 2σ2
γµ + + λ(Ĥ(γ) − 1) = 0.
2

Assuming a solution exists (this depends on H), then as the martingale we use L(t) = eγZ(t) .
˜ > 0.
Under P̃ , Z now has positive drift, KZ0 (γ) = Ẽ(Z(1)) = µ̃ + λ̃E(J)
Just as for the FIFO/GI/GI/1 queue, we obtain exactly the same kind of exponential bound
for the tail of M : Using the likelihood ratio identity with τ (b) = inf{t ≥ 0 : Z(t) > b} then
yields as in (17), P (M > b) = e−γb Ẽ(e−γB ). Now, because of the “jumps” Ji , there is an
overshoot B to deal with (unless the jumps are ≤ 0 wp1.). All of this goes thru with general
negative drift Levy processes, the idea being that under P̃ the process remains Levy, but with
new parameters making it have positive drift.

You might also like