Delta Method
Delta Method
In the simplest form of the central limit theorem, Theorem 4.18, we consider a sequence
X1 , X2 , . . . of independent and identically distributed (univariate) random variables with
finite variance σ 2 . In this case, the central limit theorem states that
√ d
n(X n − µ) → σZ, (5.1)
85
by a constant multiple. Yet what if the function g(t) is nonlinear? It is in this nonlinear
case that a strong understanding of the argument above, as simple as it may be, pays real
dividends. For if X n is consistent for µ (say), then we know that, roughly speaking, X n
will be very close to µ for large n. Therefore, the only meaningful aspect of the behavior of
g(t) is its behavior in a small neighborhood of µ. And in a small neighborhood of µ, g(µ)
may be considered to be roughly a linear function if we use a first-order Taylor expansion.
In particular, we may approximate
g(t) ≈ g(µ) + g 0 (µ)(t − µ)
for t in a small neighborhood of µ. We see that g 0 (µ) is the multiple of t, and so the logic of
the linear case above suggests
√ d
n g(X n ) − g(µ) → g 0 (µ)σ 2 Z. (5.2)
Indeed, expression (5.2) is a special case of the powerful theorem known as the delta method.
d
Theorem 5.1 Delta method: If g 0 (a) exists and nb (Xn − a) → X for b > 0, then
d
nb {g(Xn ) − g(a)} → g 0 (a)X.
P
The proof of the delta method uses Taylor’s theorem, Theorem 1.18: Since Xn − a → 0,
nb {g(Xn ) − g(a)} = nb (Xn − a) {g 0 (a) + oP (1)} ,
d
and thus Slutsky’s theorem together with the fact that nb (Xn − a) → X proves the result.
Expression (5.2) may be reexpressed as a corollary of Theorem 5.1:
Corollary 5.2 The often-used special case of Theorem 5.1 in which X is normally
√ d
distributed states that if g 0 (µ) exists and n(X n − µ) → N (0, σ 2 ), then
√ d
n g(X n ) − g(µ) → N 0, σ 2 g 0 (µ)2 .
Ultimately, we will extend Theorem 5.1 in two directions: Theorem 5.5 deals with the special
case in which g 0 (a) = 0, and Theorem 5.6 is the multivariate version of the delta method. But
we first apply the delta method to a couple of simple examples that illustrate a frequently
understood but seldom stated principle: When we speak of the “asymptotic distribution” of
a sequence of random variables, we generally refer to a nontrivial (i.e., nonconstant) distribu-
tion. Thus, in the case of an independent and identically distributed sequence X1 , X2 , . . . of
random variables with finite variance, the phrase “asymptotic distribution of Xn ” generally
refers to the fact that
√ d
n X n − E X1 → N (0, Var X1 ),
P
not the fact that X n → E X1 .
86
2
Example 5.3 Asymptotic distribution of X n Suppose X1 , X2 , . . . are iid with mean
µ and finite variance σ 2 . Then by the central limit theorem,
√ d
n(X n − µ) → N (0, σ 2 ).
However, this is not necessarily the end of the story. If µ = 0, then the normal
√ 2
limit in (5.3) is degenerate—that is, expression (5.3) merely states that n(X n )
converges in probability to the constant 0. This is not what we mean by the
asymptotic distribution! Thus, we must treat the case µ = 0 separately, noting
√ d
in that case that nX n → N (0, σ 2 ) by the central limit theorem, which implies
that
d
nX n → σ 2 χ21 .
Note that in the case p = 1/2, this does not give the asymptotic distribution of
δn . Exercise 5.1 gives a hint about how to find the asymptotic distribution of δn
in this case.
We have seen in the preceding examples that if g 0 (a) = 0, then the delta method gives
something other than the asymptotic distribution we seek. However, by using more terms
in the Taylor expansion, we obtain the following generalization of Theorem 5.1:
Theorem 5.5 If g(t) has r derivatives at the point a and g 0 (a) = g 00 (a) = · · · =
d
g (r−1) (a) = 0, then nb (Xn − a) → X for b > 0 implies that
d 1 (r)
nrb {g(Xn ) − g(a)} → g (a)X r .
r!
87
Theorem 5.6 Multivariate delta method: If g : Rk → R` has a derivative ∇g(a) at
a ∈ Rk and
d
nb (Xn − a) → Y
The proof of Theorem 5.6 involves a simple application of the multivariate Taylor expansion
of Equation (1.18).
Often, if E (Xi ) = µ is the parameter of interest, the central limit theorem gives
√ d
n(X n − µ) → N {0, σ 2 (µ)}.
In other words, the variance of the limiting distribution is a function of µ. This is a problem
if we wish to do inference for µ, because ideally the limiting distribution should not depend
on the unknown µ. The delta method gives a possible solution: Since
√ d
n g(X n ) − g(µ) → N 0, σ 2 (µ)g 0 (µ)2 ,
we may search for a transformation g(x) such that g 0 (µ)σ(µ) is a constant. Such a transfor-
mation is called a variance stabilizing transformation.
88
Example 5.7 Suppose that X1 , X2 , . . . are independent normal random variables
with mean 0 and variance σ 2 . Let us define τ 2 = Var Xi2 , which for the normal
distribution may be seen to be 2σ 4 . (To verify this, try showing that E Xi4 = 3σ 4
by differentiating the normal characteristic function four times and evaluating at
zero.) Thus, Example 4.11 shows that
n
!
√ 1X 2 d
n Xi − σ 2 → N (0, 2σ 4 ).
n i=1
To do inference for σ 2 when we believe that our data are truly independent and
identically normally distributed, it would be helpful if the limiting distribution
did not depend on the unknown σ 2 . Therefore, it is sensible in light of Corollary
5.2 to search for a function g(t) such that 2[g 0 (σ√2 2 4
)] σ is not a function of σ 2 . In
other words, we want g 0 (t) to be proportional to t−2 = |t|−1 . Clearly g(t) = log t
is such a function. Therefore, we call the logarithm function a variance-stabilizing
function in this example, and Corollary 5.2 shows that
( n
! )
√ 1X 2 d
Xi − log σ 2 → N (0, 2).
n log
n i=1
Exercise 5.5 Let Xn ∼ binomial(n, p), where p ∈ (0, 1) is unknown. Obtain confi-
dence intervals for p in two different ways:
89
√ d
(a) Since n(Xn /n−p) → N [0, p(1−p)], the variance of the limiting distribution
P
depends only on p. Use the fact that Xn /n → p to find a consistent estimator of
the variance and use it to derive a 95% confidence interval for p.
(b) Use the result of problem 5.3(b) to derive a 95% confidence interval for p.
(c) Evaluate the two confidence intervals in parts (a) and (b) numerically for
all combinations of n ∈ {10, 100, 1000} and p ∈ {.1, .3, .5} as follows: For 1000
realizations of X ∼ bin(n, p), construct both 95% confidence intervals and keep
track of how many times (out of 1000) that the confidence intervals contain p.
Report the observed proportion of successes for each (n, p) combination. Does
your study reveal any differences in the performance of these two competing
methods?
The weak law of large numbers tells us that If X1 , X2 , . . . are independent and identically
distributed with E |X1 |k < ∞, then
n
1X k P
Xi → E X1k .
n i=1
That is, sample moments are (weakly) consistent. For example, the sample variance, which
we define as
n n
1X 1X 2
s2n = (Xi − X n )2 = Xi − (X n )2 , (5.4)
n i=1 n i=1
90
Letting
√
n(X n − µ)
An =
σ
d
and Bn = σ/sn , we obtain Tn = An Bn . Therefore, since An → N (0, 1) by the
P
central limit theorem and Bn → 1 by the weak law of large numbers, Slutsky’s
d
theorem implies that Tn → N (0, 1). In other words, T statistics are asymptotically
normal under the null hypothesis.
√
2
Yn 0 d σ γ
n − → N2 0, .
Zn σ2 γ τ2
Note that the above result uses the fact that Cov (Y1 , Z1 ) = γ.
We may write Sn2 = Z n − (Y n )2 . Therefore, define the function g(a, b) = b − a2
and note that this gives ġ(a, b) = (−2a, 1). To use the delta method, we should
evaluate
2 2
2 σ γ 2 T σ γ 0
ġ(0, σ ) ġ(0, σ ) = (0, 1) = ġ(0, σ 2 )T = τ 2
γ τ2 γ τ2 1
We conclude that
√ √
Yn 0 d
n g −g 2 = n(Sn2 − σ 2 ) → N (0, τ 2 )
Zn σ
91
(b) For n = 25, estimate the true level of the test in part (a) for σ02 = 1 by
simulating 5000 samples of size n = 25 from the null distribution. Report the
proportion of cases in which you reject the null hypothesis according to your test
(ideally, this proportion will be about .05).
Suppose that (X1 , Y1 ), (X2 , Y2 ), . . . are iid vectors with E Xi4 < ∞ and E Yi4 < ∞. For
the sake of simplicity, we will assume without loss of generality that E Xi = E Yi = 0
(alternatively, we could base all of the following derivations on the centered versions of the
random variables).
We wish to find the asymptotic distribution of the sample correlation coefficient, r. If we let
Pn
mx Pni=1 X i
my
Pni=1 Yi2
mxx =
1
Pi=1 Xi (5.5)
myy
n n 2
Pni=1 Yi
mxy i=1 Xi Yi
and
s2x = mxx − m2x , s2y = myy − m2y , and sxy = mxy − mx my , (5.6)
Let Σ denote the covariance matrix in expression (5.7). Define a function g : R5 → R3 such
that g applied to the vector of moments in Equation (5.5) yields the vector (s2x , s2y , sxy ) as
defined in expression (5.6). Then
a −2a 0 −b
b
0
−2b −a
∇g c
= 1
0 0 .
d 0 1 0
e 0 0 1
92
Therefore, if we let
T
0 0
0 0
Σ∗
2
= ∇g σx Σ ∇g σx2
2 2
σy σy
σxy σxy
2 2
Cov (X1 , X1 ) Cov (X12 , Y12 ) Cov (X12 , X1 Y1 )
= Cov (Y12 , X12 ) Cov (Y12 , Y12 ) Cov (Y12 , X1 Y1 ) ,
2 2
Cov (X1 Y1 , X1 ) Cov (X1 Y1 , Y1 ) Cov (X1 Y1 , X1 Y1 )
then by the delta method,
2 2
√ sx2 σx
d
n sy − σy2 → N3 (0, Σ∗ ). (5.8)
sxy σxy
As
√ an2 aside, note that expression (5.8) gives the same marginal asymptotic distribution for
n(sx − σx2 ) as was derived using a different approach in Example 4.11, since Cov (X12 , X12 )
is the same as τ 2 in that example.
√
Next, define the function h(a, b, c) = c/ ab, so that we have h(s2x , s2y , sxy ) = r. Then
T 1 −c −c 2
[∇h(a, b, c)] = √ ,√ ,√ ,
2 a3 b ab3 ab
so that
−σxy −σxy 1 −ρ −ρ 1
[∇h(σx2 , σy2 , σxy )]T = 3
, 3
, = , , . (5.9)
2σx σy 2σx σy σx σy 2σx2 2σy2 σx σy
Therefore, if A denotes the 1 × 3 matrix in Equation (5.9), using the delta method once
again yields
√ d
n(r − ρ) → N (0, AΣ∗ AT ).
To recap, we have used the basic tools of the multivariate central limit theorem and the
multivariate delta method to obtain a univariate result. This derivation of univariate facts
via multivariate techniques is common practice in statistical large-sample theory.
Example 5.10 Consider the special case of bivariate normal (Xi , Yi ). In this case,
we may derive
2σx4 2ρ2 σx2 σy2 2ρσx3 σy
93
In this case, AΣ∗ AT = (1 − ρ2 )2 , which implies that
√ d
n(r − ρ) → N {0, (1 − ρ2 )2 }. (5.11)
we integrate to obtain
1 1+x
f (x) = log .
2 1−x
This is called Fisher’s transformation; we conclude that
√
1 1+r 1 1+ρ d
n log − log → N (0, 1).
2 1−r 2 1−ρ
Exercise 5.8 Assume (X1 , Y1 ), . . . , (Xn , Yn ) are iid from some bivariate normal dis-
tribution. Let ρ denote the population correlation coefficient and r the sample
correlation coefficient.
(a) Describe a test of H0 : ρ = 0 against H1 : ρ 6= 0 based on the fact that
√ d
n[f (r) − f (ρ)] → N (0, 1),
where f (x) is Fisher’s transformation f (x) = (1/2) log[(1 + x)/(1 − x)]. Use
α = .05.
(b) Based on 5000 repetitions each, estimate the actual level for this test in the
case when E (Xi ) = E (Yi ) = 0, Var (Xi ) = Var (Yi ) = 1, and n ∈ {3, 5, 10, 20}.
Exercise 5.9 Suppose that X and Y are jointly distributed such that X and Y are
Bernoulli (1/2) random variables with P (XY = 1) = θ for θ ∈ (0, 1/2). Let
(X1 , Y1 ), (X2 , Y2 ), . . . be iid with (Xi , Yi ) distributed as (X, Y ).
94
√
(a) Find the asymptotic distribution of n (X n , Y n ) − (1/2, 1/2) .
(b) If rn is the sample correlation
√ coefficient for a sample of size n, find the
asymptotic distribution of n(rn − ρ).
(c) Find a variance stabilizing transformation for rn .
(d) Based on your answer to part (c), construct a 95% confidence interval for θ.
(e) For each combination of n ∈ {5, 20} and θ ∈ {.05, .25, .45}, estimate the true
coverage probability of the confidence interval in part (d) by simulating 5000
samples and the corresponding confidence intervals. One problem you will face
is that in some samples, the sample correlation coefficient is undefined because
with positive probability each of the Xi or Yi will be the same. In such cases,
consider the confidence interval to be undefined and the true parameter therefore
not contained therein.
Hint: To generate a sample of (X, Y ), first simulate the X’s from their marginal
distribution, then simulate the Y ’s according to the conditional distribution of
Y given X. To obtain this conditional distribution, find P (Y = 1 | X = 1) and
P (Y = 1 | X = 0).
95