0% found this document useful (0 votes)
34 views

Importance Sampling Via Simulacrum: Alan E. Wessel

This document presents an overview of importance sampling, a Monte Carlo technique used to estimate tail probabilities. Importance sampling introduces a biased density function to reduce the variance of Monte Carlo estimates. The document discusses two common approaches - conventional and improved importance sampling - and provides examples where they fail. It then introduces a new approach called Importance Sampling via a Simulacrum (ISS) which uses a truncated density to simulate the tail of the original density, potentially providing substantial variance reduction over previous techniques.

Uploaded by

Eric Hall
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Importance Sampling Via Simulacrum: Alan E. Wessel

This document presents an overview of importance sampling, a Monte Carlo technique used to estimate tail probabilities. Importance sampling introduces a biased density function to reduce the variance of Monte Carlo estimates. The document discusses two common approaches - conventional and improved importance sampling - and provides examples where they fail. It then introduces a new approach called Importance Sampling via a Simulacrum (ISS) which uses a truncated density to simulate the tail of the original density, potentially providing substantial variance reduction over previous techniques.

Uploaded by

Eric Hall
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Importance Sampling Via a Simulacrum

by ALAN E. WESSEL
Departi?~ent Mathematics, Santa Clara Unirersity, Santa Clara, C A 95053, of U.S.A
ERIC B. HALL

Department of Elrctrical Engineering, Southern Methodist Unitvrsit~., Dallas, TX 75275, U.S.A. and GARY L. WISE Department of Statistics, University of California, Berkeley, C A 94720, U.S.A.
ABSTRACT : A Monte Carlo variance reduction technique known as "importance sampling" has recently been applied to many problems in data communications. This technique holds the promise of offering vast improvements to traditional Monre Carlo methods. An ooemiew of importance sampling applied to the calculation of tail probabilities is presented, as well as examples for which some popular approaches to importance sampling fail to work. New techniques for the calculation of the resulting variances are introduced, as well as a new approach to importance sampling which offers the promise of substantial variance reduction ocer precious techniques.

I. Introduction

Often the complexity of a problem discourages analytical solutions, and the analytical intractability of such problems may suggest the use of simulation. One example is the calculation of error probabilities for complex systems in communications. In this setting an error probability can often be expressed as a tail probability

where f is a probability density function of concern and Tis a positive real number. Of course, actual applications might involve multidimensional integrals over more complicated sets. However, for clarity of exposition, we have chosen to restrict ourselves to the consideration of tail probabilities as given above. Most of our techniques extend straightforwardly to more complicated settings. In an effort to evaluate Po based on Monte Carlo techniques, we estimate Po by the sample mean
g The Franklin institute 0016403290 53.00+0.00

A . E. Wessel et al.

where h denotes the indicator function of [T,a ) and X,, X,, . ..,X,,,are mutually independent random variables, each with the density f. If Po is small, then we would not expect many of the random variables to contribute to the above sum. In this case an inordinately large number of samples N may have to be used to ensure that 6 is close to Powith high probability. lndeed, recalling the standard Chebyshev inequality, for E > 0, we have that Var ( E ) ~(1%-P,J > 6) < g2

Since
1 ~ a(5) - Var ( h ( X , ) ) , r = N

we see that 1 1 ~ ( 1 % - pol > E) < -Var (h(X,)) = -P o ( ]-Po). NcZ NE~ For such an estimate to have any significance, E should be chosen to be no greater than some fraction of Po, say E = Pollo. In this case, we see that

Any probability is upper bounded by one. Assuming that Pois much less than one, N must be greater than 100/Pofor this estimate to be meaningful. For sufficiently small values of Po, the corresponding values for N will result in a problem beyond the scope of simulation. Importance sampling is a technique which introduces an estimator for Powhich By has a smaller variance than the sample mean 6. this method we can thus reduce either the requisite number of samples N to obtain a fixed variance or the variance associated with a fixed number of samples N. The ratio of variances of two schemes for estimating Pois often used as a measure of improvement of one scheme over another. We note that such an improvement factor must be interpreted with caution when the larger variance yields a Chebyshev bound on the probability which is greater than one, since in this case the ratio may exaggerate the improvement. T o phrase it differently, large improvement ratios may occur precisely when Po is extremely small and one is still obtaining poor results. Let f * be a probability density function that is positive on [T, a ) . The approach to importance sampling [cf. (I)] begins by noticing that
Journql of the Franklin lnstilute Perearnon Press plc

Importance Sampling Via a Simulacrum

Interpreting this integral as an expectation with respect to f*, we obtain a new given by estimate of Po, denoted by

e,

where the X,*s are mutually independent random variables, each having the density f *. The objective here is to choose f * in such a manner as to reduce the variance of h(X?)f(X?)/ f *(X?) compared with the variance of h(X,). Recall that
1 1 Var (E)= - Var (h(X1)) and Var (B) - Var = N N

The optimal choice for f * is f *(x) = h(x)f (x)/Po, which results in a zero variance for h(X?)f (X'/')/ f *(Xn. However, this choice off * requires knowledge of Po, the quantity to be determined. The essence of importance sampling consists of choosing an appropriate density f *. It can easily be seen that any choice for f * which assigns positive probability to (- x,T) can be immediately improved by truncating f* to the left of T and then renormalizing. This observation appears to have been overlooked by many working in this area. Importance sampling seems to have first been addressed at a conference in 1949 (1). Subsequent studies include effortsin operations research (I), integral equations arising in certain areas of physics (2, pp. 149-156), the evaluation of multidimensional integrals (2, pp. 92-103), numerous simulation texts (2-4), and more recently, communication theory (5-14). Recently various authors have suggested implementing importance sampling by means of several new choices for the density f *, and they have analysed the resulting improvements. Two such choices for f * are stretched and shifted versions off; these have been referred to as conventional importance sampling (CIS) and improved importance sampling (IIS), respectively (5,8, 11, 14). Often in calculating tail probabilities such as Po,f is assumed to be unknown [see discussion in (I)]. Popular choices for f *, such as f *(x) = f (x+a), for real nonzero a, which arises in IlS, and such as f *(x) = af (ax), for 0 < a < 1, which arises in CIS, then present a serious problem in that the estimator depends on an explicit knowledge of J This situation is the essence of an often overlooked "Catch 22" associated with this method. In the appendices we present two examples, one for which CIS and one for which 11s techniques do not work. In each case the only useful choice for the biasing density f * is the original density f;any other choice for the biasing density results in catastrophic performance. These results should cause some concern in situations where the density under consideration is assumed to be unknown. The next two sections consider the situation where the density f is known. In the second of these sections, we present a new approach to importance sampling
Vol. No. 5. pp. Pnnted in Great Bn~ain

37 2.

771-783. 1990

A. E. Wessel et al.

based on the use of a simulacrum of the tail of the density f.We call this approach Importance Sampling via a Simulacrum (ISS).

II. Shijiing and Scaling


Unless mentioned otherwise, we assume that the density f is known and that T is a fixed positive real number. We shall need some upper and lower bounds on the integral of the tail of the Gaussian density. Let Q(r) be defined by

for t > 0.An often-used approximation for Q(t) is given by Q(O z exp
*t

(q).

We shall need the following sharper result. Lemma 1 Q(r) satisfies the following inequalities for t > 0

-- !

fi

(: $)exp (T) < -ex* ($). < Q(I)


l f i

Proof. This result follows easily from the definition of Q ( t ) via integration by parts. As noted earlier, many authors have considered shifted or scaled versions off as a choice for f *. In this vein, we consider a transformation on the data which, for each i = I , . ..,N, is given by v l X , ( + T, where the parameter v is chosen so as Note to decrease the variance of PB. that this transformation involves shifting and scaling and ensures that the resulting f * assigns no probability to the left of T. If X has a symmetric density f (x) then vlXl+ T has the density given by

which we denote by o& is given by Recall that the variance of h ( X 3f ( X n lf *(XT),

Substituting

in the case of a symmetric density S, we see that


Journal of the Franklin Institute Pngamon Press plc

'

Imporrance Sampling Via a Simulacrum

Exarnple I Consider the often-analysed case where

(8, 11, 14). Recall that

Q(0=

1'

f (x) dx is less than -exp


l f i

(q)

for

r > 0.

Using this inequality and substituting the above choice for f into the above expression for 0 we see, after some algebraic manipulation, that :
0 :

c ----exp (- T 2 )- P;,

where v must be greater than or equal to 1 / f i to retain integrability. Note that Pi = [Q(T)I2. Thus, using the lower bound to Q ( T ) given in Lemma 1, we see that

This upper bound is minimized, subject to our constraints on v, when v = I/$. Of course, only the first term of this upper bound is affected by our choice of v. Our choice of a density

which is zero off [T, a;) has decreased this term by a factor of two over the result obtained using the more conventional choice of

which is admittedly nonrobust, then results in an Our choice of v = I/& additional rather modest improvement over the result obtained using the commonly considered choice of v = 1. However, as we demonstrate below, improvements of
VoL 327. No. 5. pp. 7 1 7 31990 7-8. Printed in Great Britain

A. E. Wessel et al.

quite another magnitude are achievable. We first consider the case of a Laplace density.
Example 2 Let f (x) = (K/2) exp ( - K l x l ) , where K i s a fixed positive constant. Choosing

straightforwardly results in a zero variance for 7 have f*(x) = h(x)i. exp (-i.(x-

8 when v = 1. Indeed, we then

T)) = h(x)f(x) Po ' the optimal choice for f *. The great advantage of the Laplace density here is that we can trivially "normalize" its tail. This, in turn, suggests the following technique.
Examplc 3 We now estimate the tail probability of a Gaussian density using a truncated exponential density for f *. With

f ( x ) =-exp($)

and using the upper bound for Q ( t ) ,we obtain

This upper bound is minimized when i. = T yielding exp ) o < - (- T ~- P;. :


1

Notice that using the "standard" Q(T) approximation here would yield a zero variance. a result which is obviously false. Using the lower bound for Q ( T ) given in Lemma I , we obtain

which is less than (2/T2)Pi for T greater than one. We note that this result represents a quite significant improvement over the bound given in Example 1. III. Importance Sampling via a Simulacrum We know that choosing for f * a suitably truncated and normalized version off results in an estimator with zero variance. As noted before, this normalization
Journal o r h e Franklin Instilute Pergamon Press plc

if
1

and f*(x) = h(x)i.exp(-i.(x-T)),

2nT2

Importance Sampling Via a Simulacrum requires knowledge of the quantity to be estimated-hardly a practical choice. This observation does, however, suggest the following approach. First find a functiong which mimics the tail behavior off, yet is simple enough that its integral from T to infinity can be straightforwardly evaluated. (We shall call such a function a simulacrum for f ; the precise definition is given below.) Then set f * equal to a normalized version of g. What the results of the previous section tell us is that a properly chosen exponential density mimics the behavior of the Gaussian tail well enough to produce extraordinary variance reduction. This is not entirely surprising; a glance at the exponential density seems to indicate a closer fit t o the Gaussian tail than any shifted or scaled version of the Gaussian density itself. In this section, we examine this approach and describe a more general method for estimating tail probabilities. As we shall see, the family of exponential tails provides a ready supply of potential simulacra ; exponential tails mimic the tail behavior of a large family of practical densities and can be easily integrated from T t o infinity as well as over more complicated sets. The ISS method is based on the following reasoning. First, choose a nonnegative integrable function g such that the integral

can be straightforwardly evaluated and such that

Call such a function a simulacrum f o r 5 Note that for positive g this last condition is equivalent t o g 2 f 12 a.e. on [T, a . ) Setting f *(x) = h(x)g(x)/Pg, we obtain the following estimate for at. the variance of h(X?) f(X?)/f *(Xr) :

Thus, we see that Var (P;fl)

< 2PgP0- Pi . N

Since Var = Po(l - Po)/N, we see that if Pg< 4, then the method suggested above results in a smaller variance. If g = f /2 a.e. on [T, oo) then these estimates are sharp and we obtain a = Var (P$) 0. Of course, in this case f * is the optimal, : = albeit usually impractical, biasing density. Note that in the cases we consider, Pg
Vol. 327. No. 5. pp. 771-783. 1990 Printed in Great Britain

(E)

A. E. Wessel et al. compares favorably with Po, and is thus substantially lessihan 4. (See Example 4 below.) One way to ensure that the function g is a simulacrum is to choose g so that P, can be straightforwardly evaluated and so that the function r(x) = .f (x)/g(x) is differentiable and satisfies : (i) r(T) = 2, (ii) r'(s) < 0 for x 2 T. These conditions obviously imply that Cf(x)/g(x)) - 1

< 1 on [T, m).

Exanzple 4 We now apply this method to the case in which f is a generalized Gaussian density and g is chosen from the family of exponential tails. Let

where y 2 I and C i s the appropriate normalization constant. Set C g(x) = - exp 2

($)

exp (-I.(*-

TI),

where i. = F - ' , and note that this choice of g satisfies (i) and (ii). Thus g is a simulacrum for f. In this case,

and

Note that when y equals one, f(x) is the Laplace density f(x) g equals -

on [T, m), and f *(x) is the optimal biasing density h(x)f (x)/Po. In this case, a = 0. : When y equals two, f (x) is the standard Gaussian density, and this method yields

Using this upper bound for Poin the upper bound we are using for Po= Q(T). : the above inequality, we recover the estimate for o obtained in Example 3. Using the tighter inequality
Journal otthe Franklrn lnstrtute Pergamon Press plc

Importance Sampling Via a Simulacrum

(which follows easily along the lines of Lemma 1 via one additional integration by parts), we obtain

For large T, this bound represents an improvement by almost a factor of 2 over the bound given in Example 3. Note that, as in Example 3, setting i. = Tyields the proper exponential simulacrum and biasing density f * for the standard Gaussian density.

I V. Final Remarks
Now we turn our attention to the case in which an explicit expression for f is not known. Recall the aforementioned "Catch 22": in order to calculate we must know the function h ( x )f ( x ) /f * ( x ) . In specific applications this ratio may be known even when f is not explicitly known. For instance, iff is symmetric, let f * be the density corresponding to IXil ; this ratio is then h(x)/2, which results in a modest improvement. In this case, we can base the estimate solely on the originally and the mutually independent data X I , . . .,XN drawn from the unknown density f; resulting estimate is given by

The attempt to implement importance sampling for the case where one knows nothing about the density f seems fraught with difficulties. Aside from the problem mentioned above, given any scheme to estimate Po, a choice for the unknown density f exists for which the performance is very poor.
Acknon~ledgements This research was partially supported by the Office of Naval Research under Grant No. N00014-90-3-1712 and b> the Air Force Office of Scientific Research under Grant No. AFOSR-86-0016. Also. the authors gratefully acknowledge helpful conversations on the subject of this paper with Prof. Dana J. Taipale of the University of Texas at Austin.

References
H. Kahn and A. W. Marshall, "Methods of reducing sample size in Monte Carlo computations", J. Operations Res. Soc. Am., Vol. 1, pp. 263-278, 1953. M. H. Kalos and P. A. Whitlock, "Monte Carlo Methods", Wiley, New York, 1986. R. Y. Rubinstein, "Simulation and the Monte Carlo Method", Wiley, New York,
1981.

(4) J. M. Hammersley and D. C. Handscomb, "Monte Carlo Methods", Methuen, New York, 1964.
Vol. 327. No. 5. pp. 771-783. 1990 Printed In Great Britsin

A. E. Wesselet al.

(5) K. S. Shanmugan and P. Balaban, "A modified Monte Carlo simulation technique for the evaluation of error rate in digital communications systems", IEEE Trans. Commun., Vol. COM-28, pp. 1916-1924,1980. (6) G. Lank, "Theoretical aspects of importance sampling applied to false alarms", IEEE Trans. Info. Theory, Vol. IT-29, pp. 73-82, 1983. (7) M. C. Jeruchim, "Techniques for estimating the bit error rate in the simulation of digital communication systems", IEEE J. Selected Areas Commun., Vol. SAC-2. pp. 153-170, 1984. (8) D. Lu and K. Yao, "Bounds on the variances of importance sampling simulations in digital communication systems", Proc. 25th Annual Allerton ConJ Comntunication, Control, and Computing, pp. 125-1 34, 1987. (9) M. A. Herro and J. M. Nowack, "Simulated Viterbi decoding using importance sampling", Proc. 1987 Con$ Information Sciences and Systems, pp. 718-723, 1987. (10) .P. Hahn and M. Jeruchim, "Developments in the theory and applications of importance sampling", IEEE Trans. Commun., Vol. COM-35, pp. 706-714, 1987. ( 1 1 ) D. Lu and K. Yao, "New importance sampling technique for the simulation of communication and radar systems", Proc. I987 ConJ Information Sciences and Systems, pp. 7 13-7 17, 1987. (12) G. Orsak and B. Aazhang, "On the application of importance sampling to the analysis of detection systems", Proc. 25th Annual Allerfon Con$ Communication, Control, and Computing, pp. 135-144, 1987. (13) Q. Wong and V. Bhargava, "On the application of importance sampling to BER estimation in the simulation of digital communication systems", IEEE Trans. Commun., Vol. COM-35, pp. 1231-1233, 1987. (14) D. Lu and K. Yao, "Improved importance sampling technique for efficientsimulation of digital communication systems", IEEE J. Selected Areas Commun., Vol. SAC-6, pp. 67-75, 1988.
Appendix A

In this appendix, we present an example for which conventional importance sampling is useless as a variance reduction technique. Recall that conventional importance sampling is performed using a stretched version of the density f ( x ) , say %/(ax)where 0 < a < 1.
Example A Let N denote the set of natural numbers. Consider a sequence of real numbers defined as follows :
0" =

1 if n = l if n e N and n > l .

n(l+a,-,)

We define a probability density function via:


f(-x)

if x < O
x ~ [ a ~ , l + a for I E N J

r;i~ if

0 otherwise. Let T be a fixed positive real number. Conventional importance sampling dictates that
Journal ofthe Franklin lns~itu~e Pergamon Press plc

Importance Sampling Via a Simulacrum


our choice for f * is af ( r x ) , where a is some fixed element of ( 0 , l ) . Ideally, a.should be chosen so as to minimize the variance of h(X:) f (X:)/f *(X:) which is given by

We shall now show that this variance is infinite. Fix a in ( 0 , l ) and choose me N SO that m > 11% and a, > T. Recall that
f (x) = 2-'"+ I' for x ~ [ a , , l + a , ) ,

and note that


f ( a x ) = 0 for
XE

["f-',

):

Also,

and from the definition of a,,,, and, further, there exists a real number fl such that a,, < fl < a,/a and f l < I +a,. Thus, we see that j ( x ) is nonzero on [a,, fl) and that f ( r x ) is zero on

a proper superset of [a,, fl) which implies that

Therefore, CIS techniques are useless when applied to the problem of calculating tail probabilities for the density f given above. We note in passing that any density that is nonzero on the support of the above density f and zero off the support off will exhibit the same phenomenon. Thus, for instance, such a density could be chosen to be infinitely differentiable.

Appendix B
In this appendix, we present an example where improved importance sampling is useless as a variance reduction technique. Recall that improved importance sampling is performed using a shifted version of the density f ( x ) , say f ( x + a ) where ~ E R .
Example B Let

and define two real valued sequences as follows :


Vol. 327. No. 5. pp. 771-783. 1990 Printed in Great Britain

A. E. Wesselet al.
Oifn=l
On=

{L,-,+L-, if

n.N

and n > I , b, =

1 ifn=1
if n ~ l V and n > 1.

Further, define a probability density function f(x) via:


[/(-x)

if

X<O

if x E [a,. 6,) ; if
X E [b,,

a,+ ,) ; n E N.

T o see that f (x) is indeed a probability density function simply note that

Now, let a be any fixed nonzero real number, and let T be any fixed positive real number. Define M , = { n E N : n - 2 < la1 < n and b, > Tf . Note that M,is nonempty, and, in fact, is a 0 infinite subset of N since j E M, implies that j+ 1 EM,. Let m, = min { n E M,). Note that b,-a, = n and a,, ,-b, = n-2 for all nEN. Consider the case when a < 0. Then,b,-a.=n > Ia(ifn>m..Thus,a,,-a < b n i f n ~ m , . F u r t h e r , a , + , - b , = n - ' c (a1 if n 2 m,. Thus, a,+, < b,-a if n 2 m,. Therefore, a,-a < b, < a , + , < b,-a if a < 0 and n 3 m,. Consider the case when a > 0. Then, a,,, -b, = n-2 < a if n 2 rn,. Thus, a,+,-a < bn if n > m,. Also, bn+,-a,,+, = n + I > n > a if n 2 m,. Thus, a,,, < b.+,-a ifn~m,.Therefore,a,+,-a<bn<a,+,~bn+I-aifa>Oandn>m,. In summary, we have shown that the interval [b,, a,+ ,) is a proper subset of [a,-a, b,-a) if n 3 nt, and a < 0, and is a proper subset of [a,+ -a, b,+ -a) if n 2 m, and a > 0. Recall that

Further, note that

and

Thus, using the previous result, we see that if n 3 m,, then f ( x + a ) =- 45 and
2n4n5

for xr[b,,a,,+,)

and a < 0 ,

and a > 0.
Therefore, if a c 0 then
loumal of the Franklm Inswute Pergamon Press plc

Importance Sampling Via a Simulacrum

and, if a > 0 then

f for Since this density results in infinite variance of h(X:) f (Xn/*(XT) any nonzero choice of a, IIS techniques are useless in this situation. Again, we note that using standard techniques, the above phenomenon could be exhibited by using an infinitely differentiable density.

Vol. 327. No. 5. pp 771-783.1990

Printed in Great Britain

You might also like