4404 Notes Is
4404 Notes Is
def def
(A) = p and σ 2 = V ar(Xi ) = p(1 − p) and so,
But this is not a good idea: µ = E(Xi ) = P p
√
since p is assumed very small, the ratio σ/µ = p(1 − p)/p ∼ 1/ p −→ ∞ as p ↓ 0; relative
to µ, σ is of a much larger magnitude. This is very bad since when constructing confidence
intervals,
zα/2 σ
p(n) ± √ ,
n
the length of the interval is in units of σ: If σ is much larger than what we are trying to
estimate, µ, then the confidence interval will be way too large to be of any use. It would be
like saying “I am 95% confident that he weighs 140 pounds plus or minus 500 pounds”.
To make matters worse, increasing the number n of copies in the Monte Carlo so as to reduce
the interval length, while sounding OK, could be impractical since n would end up having to
be enormous.
Importance sampling is a technique that gets around this problem by changing the proba-
bility distributions of the model so as to make the rare event happen oftenR instead of rarely.
To understand the basic idea, suppose we wish to compute E(h(X)) = h(x)f (x)dx for a
continuous random variable X distributed with density f (x). For example, if h(x) = I{x > b}
for a given large b, then h(X) = I{X > b} and E(h(X)) = P (X > b).
Now let g(x) be any other density such that f (x) = 0 whenever g(x) = 0, and observe that
we can re-write
Z
E(h(X)) = h(x)f (x)dx
Z h
f (x) i
= h(x) g(x)dx
g(x)
h f (X) i
= Ẽ h(X) ,
g(X)
where Ẽ denotes expected value when g is used as the distribution of X (as opposed to the
original distribution f ). In other words: If X has distribution g, then the expected value of
h(X) fg(X)
(X)
is the same as what we originally wanted. The ratio L(X) = fg(X) (X)
is called the
likelihood ratio. We can write
the left-hand side uses distribution f for X, while the right-hand side uses distribution g for X.
1
To make this work in our favor, we would want to choose g so that the variance of h(X)L(X)
(under g) is small relative to its mean.
We can easily generalize this idea to multi-dimensions: Suppose h = h(X1 , . . . , Xk ) is real-
valued where (X1 , . . . , Xk ) has joint density f (x1 , . . . xk ). Then for an alternative joint density
g(x1 , . . . xk ), we once again can write
Rk = ∆1 + · · · + ∆k , R0 = 0,
then since E(∆) = E(S − T ) < 0, the random walk has negative drift and hence tends to −∞
as time goes on: Rk → −∞ as k → ∞ wp1. Before it drifts off to −∞, however, it first reaches
def
a finite maximum M = maxk≥0 Rk which is a non-negative random variable. It is well known
(duality between queue and risk) that the distribution of Dk (for any fixed k) is the same as
def
the distribution of Mk = max0≤j≤k Rj (the maximum up to time k) and thus taking limits
(k → ∞) yields that the distribution of stationary delay D is the same as the distribution of
the all-time maximum M .
Thus for any b ≥ 0,
2
1.1.1 Importance sampling in the light-tailed service case
Let F (x) = P (∆ ≤ x), x ≥ 0, and assume that it has a density function f (x). We shall also
assume that service times are light-tailed: E(eS ) < ∞ for some > 0 (e.g., S has a finite
moment generating function), which implies that the tail P (S > x) tends to 0 fast like an
exponential tail does. (A Pareto tail, however, such as x−3 , does not have this property; it is
an example of a heavy-tailed distribution.)
We further shall assume the existence of a γ > 0 such that
Z ∞
E(eγ∆ ) = eγx f (x)dx = 1. (8)
−∞
Defining the moment generating function K() = E(e∆ ) = E(eS )E(e−T ) and observing
that K(0) = 1 and K 0 (0) = E(∆) < 0, and K is convex K 00 () > 0, we see that the condition
(8) would hold under suitable conditions; conditions ensuring that K, while moving down below
1 for a while, shoots back upwards and hits 1 as increases. The value γ at which it hits 1 is
called the Lundberg constant. Furthermore, since it is increasing when it hits 1 at γ, it must
hold that K 0 (γ) > 0.
Let us change the distribution of ∆ to have density
g(x) = eγx f (x). (9)
R
We know that g defines a probability density, g(x)dx = 1, because of the definition of γ in
(8). We say that we have exponentially tilted or twisted the distribution f to be that of g. In
general we could take any value > 0 for which K() < ∞, and change f to the new twisted
density
ex f (x)
g (x) = . (10)
K()
Our g in (9) is the special case when is set to be the Lundberg constant in (8); g = gγ .
It is easy to show that in fact, Ẽ(∆) > 0, that is, using distribution g for the random
walk increments ∆i makes the random walk now have positive drift! (To see this, note that
Ẽ(∆) = K 0 (γ) and recall that we argued above that K 0 (γ) > 0.)1
Thus for a given large b > 0 it is more likely that events such as {Dn > b} will occur as
compared to the original case when f is used. In fact P̃ (M > b) = 1 = P̃ (τ (b) < ∞), where P̃
denotes using g instead of f : the random walk will now with certainty tend to +∞ and hence
pass any value b along the way no matter how large; Rn → ∞ wp1 under P̃ .
Noting that the likelihood ratio function L(x) = f (x)/g(x) = e−γx and using h(x) = I{x >
b}, we can use (2) and conclude, for example, that
E(h(∆1 )) = P (∆1 > b) = P (M1 > b) = P (D1 > b) = Ẽ[e−γ∆1 I{∆1 > b}]. (11)
In two dimensions, utilizing (3), we can take h(x1 , x2 ) = I{x1 > b or x1 + x2 > b} yielding
h(∆1 , ∆2 ) = I{M2 > b}. We make the two increments iid each distributed as g in in (9), so
that their joint distribution is the product g2 (x1 , x2 ) = g(x1 )g(x2 ) = eγ(x1 +x2 ) f (x1 )f (x2 ). The
original joint distribution is the product f2 (x1 , x2 ) = f (x1 )f (x2 ) and so L(x1 , x2 ) = f2 /g2 =
e−γ(x1 +x2 ) , and therefore L(∆1 , ∆2 ) = e−γR2 . This then yields
E(h(∆1 , ∆2 )) = P (M2 > b) = P (D2 > b) = Ẽ[e−γR2 I{M2 > b}]. (12)
1
In the case when S ∼ exp(µ) and T ∼ exp(λ), the M/M/1 case, with λ < µ, it is easily seen that γ = µ − λ
and that the twisted density has the effect of swapping the rates: under g, the service times Si become iid with
an exponential distribution at rate λ, while the interarrival times Ti become iid with an exponential distribution
at rate µ; this yields an unstable queue.
3
Continuing analogously to higher dimensions then yields for any k ≥ 1:
But as we know, P̃ (M > b) = P̃ (τ (b) < ∞) = 1 since the random walk has positive drift under
P̃ (said differently, M = ∞ wp1 under P̃ , so for any b > 0, I{M > b} = 1 wp1 under P̃ ).
Thus (15) becomes
P (D > b) = P (M > b) = Ẽ(e−γRτ (b) ). (16)
But by definition, at time τ (b) the random walk has shot passed level b; Rτ (b) = b + B, where
B = Rτ (b) − b denotes the overshoot. We finally arrive at
In essence, we have reduced the problem of computing P (D > b) to computing the Laplace
transform evaluated at γ, Ẽ(e−γB ), of the overshoot B of a positive drift random walk.
To put this to good use, we then use Monte Carlo simulation to estimate Ẽ(e−γB ): Simulate
the positive drift random walk with increments iid distributed as g until it first passes level
b, and let B1 denote the overshoot. Set X1 = e−γB1 . Independently repeat the simulation
to obtain another copy of the overshoot B2 and so on for a total of n such iid copies, Xi =
e−γBi , i = 1, 2, . . . n. Then use as the estimate
n
h1 X i
P (D > b) ≈ e−γb Xi . (18)
n
i=1
4
Note in passing that since Ẽ(e−γB ) < 1 we conclude from (17) that P (D > b) ≤ e−γb , an
exponential upper bound on the tail of delay. It turns out that under suitable further conditions,
it can be proved that there exists a constant C > 0 such that
P (D > b) ∼ Ce−γb , as b → ∞,
P̃ (A) = E(L−1
k I{A}), A ∈ Fk . (19)
(That (21) really defines a unique probability on Ω follows by Kolomogorov’s extension theorem
in probability theory: For each k (21) does define a probability on Fk , denote this by P̃k .
Consistency of these probabilities follows by the martingale property, for m < k, P̃k (A) =
P̃m (A) A ∈ Fm ; that is, P̃k (A) restricted to Fm is the same as P̃m .)
2
B = B(b) depends on b. If (under P̃ ) B(b) converges weakly (in distribution) as b → ∞ to (say) a rv B ∗ ,
∗
then C = Ẽ(e−γB ). The needed conditions for such weak convergence are that Ẽ(∆) < ∞ and that the first
strictly ascending ladder height H = Rτ (0) have a non-lattice distribution. But it is known that H is non-lattice
if and only if the distribution of ∆ is so, and in our case it has a density g, hence is non-lattice. (We already
know that K 0 (γ) = Ẽ(∆) > 0 but it could be infinite.)
5
P̃ turns out to be the distribution of a random walk as well, but with new iid increments
distributed
Rx with the tilted density g defined in (10): P̃ (R1 ≤ x) = E(L1 I{R1 ≤ x}) = G(x) =
−∞ g(x)dx, and more generally, with ∆i = Ri−1 − Ri , i ≥ 1, P̃ (∆1 ≤ y1 , . . . , Yk ≤ yk ) =
E(Lk I{∆1 ≤ y1 , . . . , ∆k ≤ yk }) = G(y1 ) × · · · × G(yk ). Moreover, we can go the other way by
using the non-negative mean 1 martingale L−1 k =e
−γRk :
Remark 1.1 The Lundberg constant and the change of measure using it generalizes nicely to
continuous-time Levy processes, Brownian motion for example (and for Brownian motion such
change of measure results are usually known as Girsanov’s Theorem). For example, if X(t) =
σB(t) + µt is a Brownian motion with negative drift, µ < 0, then solving 1 = E(eγX(1) ) =
γ 2 σ2
eγµ+ 2 yields γ = 2|µ|/σ 2 . This yields the martingale L(t) = eγX(t) and the new measure
P̃ (A) = E(eγX(t) I{A}), A ∈ Ft . It is easily seen that under P̃ , the process X(t) remains a
Brownian motion but with positive drift µ̃ = |µ|, and the variance remains the same as it was.
Using the likelihood ratio identity with τ (b) = inf{t ≥ 0 : X(t) > b} = inf{t ≥ 0 : X(t) = b}
(via continuity of sample paths) then yields as in (17)
P (M > b) = e−γb Ẽ(e−γB ). But now, B = 0 since X(t) has continuous sample paths; there
is no overshoot, b is hit exactly. Thus we get an exact exponential distribution, P (M > b) =
e−γb , b > 0 for the maximum of a negative drift Brownian motion, a well-known result that
can be derived using more basic principles.
As a second example, we consider the Levy process Z(t) = σB(t) + µt + Y (t), where
independently we have added on a compound Poisson process
N (t)
X
Y (t) = Ji ,
i=1
where {N (t)} is a Poisson process at rate λ and the Ji are iid with a given distribution H(x) =
P (J ≤ x), x ∈ R. Also, let Ĥ(s) = E(esJ ) denote the moment generating function of H,
assumed to be finite for sufficiently small s > 0 (e.g., H is light-tailed). Choosing any for
def 2 σ 2
which KZ () = E(eZ(1) ) = eµ+ 2 +λ(Ĥ()−1) < ∞, we always obtain a martingale L(t) =
[KZ ()]−1 eZ(t) , and a new measure P̃ (A) = E(L(t)I{A}), A ∈ Ft .
We now show that under P̃ , Z remains the same kind of Levy process but with drift
µ̃ = µ + σ 2 , variance unchanged σ̃ 2 = σ 2 , λ̃ = λĤ() and the distribution H exponentially
x
tilted to be H̃(x) given by dH̃(x) = e dH(x) .
Ĥ()
To this end, we need to confirm that for s ≥ 0,
s2 σ 2 ˆ
Ẽ(esZ(1) ) = esµ̃+ 2
+λ̃(H̃(s)−1)
,
ˆ
where H̃(s)
˜
= E(esJ ) = esx dH̃(x) =
R Ĥ(+s)
.
Ĥ()
6
Direct calculations yield
Ẽ(esZ(1) ) = E(L(1)esZ(1) )
= [KZ ()]−1 E(eZ(1) esZ(1) )
= [KZ ()]−1 KZ ( + s)
2 σ 2 (+s)2 σ 2
= e−µ− 2 e−λ(Ĥ()−1) e(+s)µ+ 2
+λ(Ĥ(s+)−1)
s2 σ 2 ˆ
= esµ̃+ 2
+λ̃(H̃(s)−1)
,
as was to be shown.
As for the Lundberg constant: We assume apriori that Z has negative drift, E(Z(1)) =
µ + λE(J) < 0, so that M = maxt≥0 Z(t) defines a finite random variable. Solving for a γ > 0
γ 2 σ2
such that 1 = KZ (γ) = eγµ+ 2
+λ(Ĥ(γ)−1)
, leads to the equation
γ 2σ2
γµ + + λ(Ĥ(γ) − 1) = 0.
2
Assuming a solution exists (this depends on H), then as the martingale we use L(t) = eγZ(t) .
˜ > 0.
Under P̃ , Z now has positive drift, KZ0 (γ) = Ẽ(Z(1)) = µ̃ + λ̃E(J)
Just as for the FIFO/GI/GI/1 queue, we obtain exactly the same kind of exponential bound
for the tail of M : Using the likelihood ratio identity with τ (b) = inf{t ≥ 0 : Z(t) > b} then
yields as in (17), P (M > b) = e−γb Ẽ(e−γB ). Now, because of the “jumps” Ji , there is an
overshoot B to deal with (unless the jumps are ≤ 0 wp1.). All of this goes thru with general
negative drift Levy processes, the idea being that under P̃ the process remains Levy, but with
new parameters making it have positive drift.