Importance Sampling
Importance Sampling
Xingjie
2023-12-14
Definition
Suppose we want to compute the expectation of a function f (X) with respect to a distribution with density
p(x). So Z
I = f (x)p(x)dx.
That is
I = E[f (X)]
where the expectation is taken over a random variable X with density p. We could approximate I by “naive
simulation”:
1 X
I≈ f (Xi )
M i
where X1 , . . . , XM ∼ p(x).
Now let q(x) denote any other density function that is non-zero whenever p(x) is non-zero. (We need this
condition to avoid dividing by 0 in what follows). Then we can rewrite this as
Z
I = f (x)[p(x)/q(x)]q(x)dx.
That is
f (X)p(X)
I = E[ ]
q(X)
where the expectation is taken over a random variable X with density q. So we could also approximate I by
simulation:
1 X p(Xi0 ) 1 X
I≈ 0 f (Xi0 ) = w(Xi0 )f (Xi0 )
M i q(Xi ) M i
where X10 , . . . , XM
0
∼ q(x).
This is called “Importance Sampling” (IS) and q is called the “Importance sampling function”. The quantities
wi are known as importance weights, and they correct the bias introduced by sampling from the wrong
distribution.
An obvious problem with this approach is that the number of terms in the summation grows exponentially
with the dimensionality of X.
The idea behind IS is that if q is well chosen then the approximation to I will be better than the naive
approximation.
1
Examples 1
Suppose X ∼ N (0, 1), and we want to compute Pr(X > z) for z = 10. That is, f (x) = I(x > 10) and
p(x) = φ(x) is the density of the standard normal distribution.
Let’s try naive simulation,
and compare it with the “truth”, as ascertained by the R function pnorm.
set.seed(100)
X = rnorm(100000)
mean(1*(X>10))
## [1] 0
pnorm(10,lower.tail=FALSE)
## [1] 7.619853e-24
You can see the problem with naive simulation: all the simulations are less than 10 (where f(x)=0), so you
don’t get any precision.
φ(y)
Now we use IS. Here we code the general case for z, using IS function q to be N (z, 1). w(y) = q(y) =
2
(y−z)2
exp[− y2 − 2 ].
Note that because of this choice of q many of the samples are > z, where f is non-zero, and we hope to get
better precision. Of course, we could do this problem much better ways. . . this is just a toy illustration of IS.
pnorm.IS= function(z,nsamp=100000){
y = rnorm(nsamp,z,1)
w = exp(dnorm(y, 0, 1, log = TRUE) - dnorm(y, z, 1, log = TRUE))
mean(w*(y>z))
}
pnorm.IS(10)
## [1] 7.673529e-24
pnorm(10,lower.tail=FALSE)
## [1] 7.619853e-24
## [1] 0
pnorm(100,lower.tail=FALSE)
## [1] 0
Hmmm.. we are having numerical issues.
The trick to solving this is doing things on log scale,
1 X m 1 X
log Pr(X > z) = log w(Xi0 )I(Xi0 > z) = log + log w(Xi0 )I(Xi0 > z),
M i M m i
2
0
P
where m = m I(Xm > z). Since z is too large, each w(xm ) → 0. How can we utilize the log function in
computation to assist in scaling down exponential values in w? This function allows us to do this.
#function to find the log of the mean of exp(lx).
lmean=function(lx){
m = max(lx)
m + log(mean(exp(lx-m)))
}
## [1] -5005.571
pnorm(100,lower.tail=FALSE,log=TRUE)
## [1] -5005.524
Acknowledgement: This lecture note is from Matthew Stephens.