0% found this document useful (0 votes)
34 views

Notes

Large Sample Theory provides approximations to statistical procedures as the sample size increases without bound. It uses three main tools: the Law of Large Numbers, the Central Limit Theorem, and Taylor expansion. For maximum likelihood estimation, it studies the behavior of the log-likelihood function U(θ) and its derivatives as the sample size increases. It shows that under regularity conditions, the maximum likelihood estimator θ^ is consistent, meaning it converges in probability to the true parameter value θ0 as the sample size increases.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Notes

Large Sample Theory provides approximations to statistical procedures as the sample size increases without bound. It uses three main tools: the Law of Large Numbers, the Central Limit Theorem, and Taylor expansion. For maximum likelihood estimation, it studies the behavior of the log-likelihood function U(θ) and its derivatives as the sample size increases. It shows that under regularity conditions, the maximum likelihood estimator θ^ is consistent, meaning it converges in probability to the true parameter value θ0 as the sample size increases.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Large Sample Theory

Large Sample Theory is a name given to the search for approximations to


the behaviour of statistical procedures which are derived by computing limits
as the sample size, n, tends to infinity. Suppose we have a data set with a
fairly large sample size, say n = 100. We imagine our data set is one in a
sequence of possible data sets — one for each possible value of n. If we have
a sequence of statistics which converges to something as n tends to infinity
we approximate the value of a probability, for instance, when n = 100 by
the corresponding value when n = ∞. In this section we will investigate
approximations of this kind for Maximimum Likelihood Estimation.
Most large sample theory uses three main technical tools: the Law of
Large Numbers (LLN), the Central Limit Theorem (CLT) and Taylor ex-
pansion. I assume you have heard of all of these but will state versions of
them as we go.
These tools are generally easier to apply to statistics for which we have
explicit formulas than to statistics, like maximum likelihood estimates, where
we do not usually have explicit formulas. In this situation we study the
equations you solve to find the MLEs instead.
We therefore will study the approximate behaviour of the MLE, θ̂, by
studying the function U . Notice first that U is a sum of independent random
variables. This will allow us to apply both the LLN and the CLT to U .

Theorem 1 (LLN) If Y1 , Y2 , . . . are iid with mean µ then


P
Yi
→µ
n
This is called the law of large numbers but it comes in two forms: Strong
and Weak.

Theorem 2 (SLLN) If Y1 , Y2 , . . . are iid with mean µ then

P lim
P
Yi
n→∞ n
=µ)=1.

The strong law is harder to prove than the weak law of large numbers:

1
Theorem 3 (WLLN) If Y1 , Y2 , . . . are iid with mean µ then for each posi-
tive  P
Yi
lim P (| − µ| > ) = 0
n
For iid Yi the stronger conclusion (the SLLN) holds but for our heuristics
we will ignore the differences between these notions.
Now suppose that θ0 is the true value of θ. Then

U (θ)/n → µ(θ)

where
 
∂ log f
µ(θ) =Eθ0 (Xi , θ)
∂θ
Z
∂ log f
= (x, θ)f (x, θ0 )dx
∂θ

Remark: This convergence is pointwise and might not be uniform. That is,
there might be some θ values where the convergence takes much longer than
others. It could be that for every n there is a θ where U (θ)/n and µ(θ) are
not close together.
Consider as an example the case of N (µ, 1) data where
X
U (µ)/n = (Xi − µ)/n = X̄ − µ

If the true mean is µ0 then X̄ → µ0 and

U (µ)/n → µ0 − µ

If we think of a µ < µ0 we see that the derivative of `(µ) is likely to


be positive so that ` increases as we increase µ. For µ more than µ0 the
derivative is probably negative and so ` tends to be decreasing for µ > 0. It
follows that ` is likely to be maximized close to µ0 .
Now we repeat these ideas for a more general case. We study the random
variable log[f (Xi , θ)/f (Xi , θ0 )]. You know the inequality

E(X)2 ≤ E(X 2 )

2
(because the difference is V ar(X) ≥ 0. This inequality has the following
generalization, called Jensen’s inequality. If g is a convex function (non-
negative second derivative, roughly) then
g(E(x)) ≤ E(g(X))
The inequality above has g(x) = x2 . We use g(x) = − log(x) which is convex
because g 00 (x) = x−2 > 0. We get
− log(Eθ0 [f (Xi , θ)/f (Xi , θ0 )] ≤ Eθ0 [− log{f (Xi , θ)/f (Xi , θ0 )}]
But
Z
f (x, θ)
Eθ0 [f (Xi , θ)/f (Xi , θ0 )] = f (x, θ0 )dx
f (x, θ0 )
Z
= f (x, θ)dx

=1
We can reassemble the inequality and this calculation to get
Eθ0 [log{f (Xi , θ)/f (Xi , θ0 )}] ≤ 0
It is possible to prove that the inequality is strict unless the θ and θ0 densities
are actually the same. Let µ(θ) < 0 be this expected value. Then for each θ
we find
X
n−1 [`(θ) − `(θ0 )] = n−1 log[f (Xi , θ)/f (Xi , θ0 )]
→ µ(θ)
This proves that the likelihood is probably higher at θ0 than at any other
single θ. This idea can often be stretched to prove that the MLE is consis-
tent.
Definition: A sequence θ̂n of estimators of θ is consistent if θ̂n converges
weakly (or strongly) to θ.
Proto theorem: In regular problems the MLE θ̂ is consistent.
Now let us study the shape of the log likelihood near the true value of θ̂
under the assumption that θ̂ is a root of the likelihood equations close to θ0 .
We use Taylor expansion to write, for a 1 dimensional parameter θ,
U (θ̂) = 0
= U (θ0 ) + U 0 (θ0 )(θ̂ − θ0 ) + U 00 (θ̃)(θ̂ − θ0 )2 /2

3
for some θ̃ between θ0 and θ̂. (This form of the remainder in Taylor’s theorem
is not valid for multivariate θ.) The derivatives of U are each sums of n terms
and so should be both proportional to n in size. The second derivative is
multiplied by the square of the small number θ̂ − θ0 so should be negligible
compared to the first derivative term. If we ignore the second derivative term
we get
−U 0 (θ0 )(θ̂ − θ0 ) ≈ U (θ0 )
Now let’s look at the terms U and U 0 .
In the normal case X
U (θ0 ) = (Xi − µ0 )

has a normal distribution with mean 0 and variance n (SD n). The deriva-
tive is simply
U 0 (µ) = −n
and the next derivative U 00 is 0. We will analyze the general case by noticing
that both U and U 0 are sums of iid random variables. Let
∂ log f
Ui = (Xi , θ0 )
∂θ
and
∂ 2 log f
Vi = − (Xi , θ)
∂θ2
P
In general, U (θ0 ) = Ui has mean 0 and approximately a normal distri-
bution. Here is how we check that:
Eθ0 (U (θ0 )) = nEθ0 (U1 )
Z
∂ log(f (x, θ))
=n (x, θ0 )f (x, θ0 )dx
∂θ
Z
∂f /∂θ(x, θ0 )
=n θf (x, θ0 )dx
f (x, θ0 )
Z
∂f
=n (x, θ0 )dx
∂θ
Z

=n f (x, θ)dx
∂θ θ=θ0

=n 1
∂θ
=0

4
Notice that I have interchanged the order of differentiation and integra-
tion at one point. This step is usually justified by applying the dominated
convergence theorem to the definition of the derivative. The same tactic can
be applied by differentiating the identity which we just proved
Z
∂ log f
(x, θ)f (x, θ)dx = 0
∂θ
Taking the derivative of both sides with respect to θ and pulling the derivative
under the integral sign again gives
Z  
∂ ∂ log f
(x, θ)f (x, θ) dx = 0
∂θ ∂θ

Do the derivative and get


Z 2 Z
∂ log(f ) ∂ log f ∂f
− 2
f (x, θ)dx = (x, θ) (x, θ)dx
∂θ ∂θ ∂θ
Z  2
∂ log f
= (x, θ) f (x, θ)dx
∂θ

Definition: The Fisher Information is

I(θ) = −Eθ (U 0 (θ)) = nEθ0 (V1 )

We refer to I(θ0 ) = Eθ0 (V1 ) as the information in 1 observation.


The idea is that I is a measure of how curved the log likelihood tends
to be at the true value of θ. Big curvature means precise estimates. Our
identity above is
I(θ) = V arθ (U (θ)) = nI(θ)
Now we return to our Taylor expansion approximation

−U 0 (θ0 )(θ̂ − θ0 ) ≈ U (θ0 )

and study the two appearancesPof U .


We have shown that U = Ui is a sum of iid mean 0 random variables.
The central limit theorem thus proves that

n−1/2 U (θ) ⇒ N (0, σ 2 )

5
where σ 2 = V ar(Ui ) = E(Vi ) = I(θ).
Next observe that X
−U 0 (θ) = Vi
where again
∂Ui
Vi = −
∂θ
The law of large numbers can be applied to show

−U 0 (θ0 )/n → Eθ0 [V1 ] = I(θ0 )

Now manipulate our Taylor expansion as follows


P −1 P
Vi U
1/2
n (θ̂ − θ0 ) ≈ √ i
n n

Apply Slutsky’s Theorem to conclude that the right hand side of this con-
verges in distribution to N (0, σ 2 /I(θ)2 ) which simplifies, because of the iden-
tities, to N (0, 1/I(θ).
Summary
In regular families:

• Under strong regularity conditions Jensen’s inequality can be used to


demonstrate that θ̂ which maximizes ` globally is consistent and that
this θ̂ is a root of the likelihood equations.

• It is generally easier to study ` only close to θ0 . For instance define A


to be the event that ` is concave on the set of θ such that |θ − θ0 | < δ
and the likelihood equations have a unique root in that set. Under
weaker conditions than the previous case we can prove that there is a
δ > 0 such that
P (A) → 1
In that case we can prove that the root θ̂ of the likelihood equations
mentioned in the definition of A is consistent.

• Sometimes we can only get an even weaker conclusion. Define B to be


the event that `(θ) is concave for n1/2 |θ − θ0 | < L and there is a unique
root of ` over this range. Again this root is consistent but there might
be other consistent roots of the likelihood equations.

6
• Under any of these scenarios there is a consistent root of the likelihood
equations which is definitely the closest to the true value θ0 . This root
θ̂ has the property

n(θ̂ − θ0 ) ⇒ N (0, 1/I(θ)) .

We usually simply say that the MLE is consistent and asymptotically


normal with an asymptotic variance which is the inverse of the Fisher infor-
mation. This assertion is actually valid for vector valued θ where now I is a
matrix with ijth entry  2 
∂ `
Ii j = −E
∂θi ∂θj
Estimating Equations
The same ideas arise in almost any model where estimates are derived
by solving some equation. As an example I sketch large sample theory for
Generalized Linear Models.
Suppose that for i = 1, . . . , n we have observations of the numbers of
cancer cases Yi in some group of people characterized by values xi of some
covariates. You are supposed to think of xi as containing variables like age,
or a dummy for sex or average income or . . . A parametric regression model
for the Yi might postulate that Yi has a Poisson distribution with mean µi
where the mean µi depends somehow on the covariate values. Typically we
might assume that g(µi ) = β0 + xi β where g is a so-called link function,
often for this case g(µ) = log(µ) and xi β is a matrix product with xi written
as a row vector and µ a column vector. This is supposed to function as a
“linear regression model with Poisson errors”. I will do as a special case
log(µi ) = βxi where xi is a scalar.
The log likelihood is simply
X
`(β) = (Yi log(µi ) − µi )

ignoring irrelevant factorials. The score function is, since log(µi ) = βxi ,
X X
U (β) = (Yi xi − xi µi ) = xi (Yi − µi )

(Notice again that the score has mean 0 when you plug in the true parameter
value.) The key observation, however, is that it is not necessary to believe

7
that Yi has a Poisson distribution to make solving the equation U = 0 sensi-
ble. Suppose only that log(E(Yi )) = xi β. Then we have assumed that
Eβ (U (β)) = 0
This was the key condition in proving that there was a root of the likelihood
equations which was consistent and here it is what is needed, roughly, to
prove that the equation U (β) = 0 has a consistent root β̂. Ignoring higher
order terms in a Taylor expansion will give
V (β)(β̂ − β) ≈ U (β)
where V = −U 0 . In the MLE case we had identities relating the expectation
of V to the variance of U . In general here we have
X
V ar(U ) = x2i V ar(Yi )

If Yi is Poisson with mean µi (and so V ar(Yi ) = µi ) this is


X
V ar(U ) = x2i µi
Moreover we have
Vi = x2i µi
and so X
V (β) = x2i µi
The central limit theorem (the Lyapunov kind) will show P 2 that U (β) has an
2
approximate normal distribution with variance σU = xi V ar(Yi ) and so
X
β̂ − β ≈ N (0, σU2 /( x2i µi )2 )

If V ar(Yi )P= µi , as it is for the Poisson case, the asymptotic variance simpli-
fies to 1/ x2i µi .
Notice that other estimating equations are possible. People suggest al-
ternatives very often. If wi is any set of deterministic weights (even possibly
depending on µi then we could define
X
U (β) = wi (Yi − µi )
and still conclude that U = 0 probably has a consistent root which has an
asymptotic normal distribution. This idea is being used all over the place
these days: see, for example Zeger and Liang’s Generalized estimating equa-
tions (GEE) which the econometricians call Generalized Method of Moments.

8
Problems with maximum likelihood

1. In problems with many parameters the approximations don’t work very


well and maximum likelihood estimators can be far from the right an-
swer. See your homework for the Neyman Scott example where the
MLE is not consistent.

2. When there are multiple roots of the likelihood equation you must
choose the right root. To do so you might start with a different con-
sistent estimator and then apply some iterative scheme like Newton
Raphson to the likelihood equations to find the MLE. It turns out
not many steps of NR are generally required if the starting point is a
reasonable estimate.

Finding (good) preliminary Point Estimates

Method of Moments
Basic strategy: set sample moments equal to population moments and
solve for the parameters.
Definition: The rth sample moment (about the origin) is
n
1X r
X
n i=1 i

The rth population moment is

E(X r )

Definition: Central moments are


n
1X
(Xi − X̄)r
n i=1

and
E [(X − µ)r ] .
If we have p parameters we can estimate the parameters θ1 , . . . , θp by
solving the system of p equations:

µ1 = X̄

9
µ02 = X 2
and so on to
µ0p = X p
You need to remember that the population moments µ0k will be formulas
involving the parameters.
Gamma Example
The Gamma(α, β) density is
 α−1  
1 x x
f (x; α, β) = exp − 1(x > 0)
βΓ(α) β β

and has
µ1 = αβ
and
µ02 = αβ 2
This gives the equations

αβ = X
αβ 2 = X 2

Divide the second by the first to find the method of moments estimate of β
is
β̃ = X 2 /X
Then from the first equation get

α̃ = X/β̃ = (X)2 /X 2

The equations are much easier to solve than the likelihood equations
which involve the function
d
ψ(α) = log(Γ(α))

called the digamma function.

10

You might also like