Notes
Notes
P lim
P
Yi
n→∞ n
=µ)=1.
The strong law is harder to prove than the weak law of large numbers:
1
Theorem 3 (WLLN) If Y1 , Y2 , . . . are iid with mean µ then for each posi-
tive P
Yi
lim P (| − µ| > ) = 0
n
For iid Yi the stronger conclusion (the SLLN) holds but for our heuristics
we will ignore the differences between these notions.
Now suppose that θ0 is the true value of θ. Then
U (θ)/n → µ(θ)
where
∂ log f
µ(θ) =Eθ0 (Xi , θ)
∂θ
Z
∂ log f
= (x, θ)f (x, θ0 )dx
∂θ
Remark: This convergence is pointwise and might not be uniform. That is,
there might be some θ values where the convergence takes much longer than
others. It could be that for every n there is a θ where U (θ)/n and µ(θ) are
not close together.
Consider as an example the case of N (µ, 1) data where
X
U (µ)/n = (Xi − µ)/n = X̄ − µ
U (µ)/n → µ0 − µ
E(X)2 ≤ E(X 2 )
2
(because the difference is V ar(X) ≥ 0. This inequality has the following
generalization, called Jensen’s inequality. If g is a convex function (non-
negative second derivative, roughly) then
g(E(x)) ≤ E(g(X))
The inequality above has g(x) = x2 . We use g(x) = − log(x) which is convex
because g 00 (x) = x−2 > 0. We get
− log(Eθ0 [f (Xi , θ)/f (Xi , θ0 )] ≤ Eθ0 [− log{f (Xi , θ)/f (Xi , θ0 )}]
But
Z
f (x, θ)
Eθ0 [f (Xi , θ)/f (Xi , θ0 )] = f (x, θ0 )dx
f (x, θ0 )
Z
= f (x, θ)dx
=1
We can reassemble the inequality and this calculation to get
Eθ0 [log{f (Xi , θ)/f (Xi , θ0 )}] ≤ 0
It is possible to prove that the inequality is strict unless the θ and θ0 densities
are actually the same. Let µ(θ) < 0 be this expected value. Then for each θ
we find
X
n−1 [`(θ) − `(θ0 )] = n−1 log[f (Xi , θ)/f (Xi , θ0 )]
→ µ(θ)
This proves that the likelihood is probably higher at θ0 than at any other
single θ. This idea can often be stretched to prove that the MLE is consis-
tent.
Definition: A sequence θ̂n of estimators of θ is consistent if θ̂n converges
weakly (or strongly) to θ.
Proto theorem: In regular problems the MLE θ̂ is consistent.
Now let us study the shape of the log likelihood near the true value of θ̂
under the assumption that θ̂ is a root of the likelihood equations close to θ0 .
We use Taylor expansion to write, for a 1 dimensional parameter θ,
U (θ̂) = 0
= U (θ0 ) + U 0 (θ0 )(θ̂ − θ0 ) + U 00 (θ̃)(θ̂ − θ0 )2 /2
3
for some θ̃ between θ0 and θ̂. (This form of the remainder in Taylor’s theorem
is not valid for multivariate θ.) The derivatives of U are each sums of n terms
and so should be both proportional to n in size. The second derivative is
multiplied by the square of the small number θ̂ − θ0 so should be negligible
compared to the first derivative term. If we ignore the second derivative term
we get
−U 0 (θ0 )(θ̂ − θ0 ) ≈ U (θ0 )
Now let’s look at the terms U and U 0 .
In the normal case X
U (θ0 ) = (Xi − µ0 )
√
has a normal distribution with mean 0 and variance n (SD n). The deriva-
tive is simply
U 0 (µ) = −n
and the next derivative U 00 is 0. We will analyze the general case by noticing
that both U and U 0 are sums of iid random variables. Let
∂ log f
Ui = (Xi , θ0 )
∂θ
and
∂ 2 log f
Vi = − (Xi , θ)
∂θ2
P
In general, U (θ0 ) = Ui has mean 0 and approximately a normal distri-
bution. Here is how we check that:
Eθ0 (U (θ0 )) = nEθ0 (U1 )
Z
∂ log(f (x, θ))
=n (x, θ0 )f (x, θ0 )dx
∂θ
Z
∂f /∂θ(x, θ0 )
=n θf (x, θ0 )dx
f (x, θ0 )
Z
∂f
=n (x, θ0 )dx
∂θ
Z
∂
=n f (x, θ)dx
∂θ θ=θ0
∂
=n 1
∂θ
=0
4
Notice that I have interchanged the order of differentiation and integra-
tion at one point. This step is usually justified by applying the dominated
convergence theorem to the definition of the derivative. The same tactic can
be applied by differentiating the identity which we just proved
Z
∂ log f
(x, θ)f (x, θ)dx = 0
∂θ
Taking the derivative of both sides with respect to θ and pulling the derivative
under the integral sign again gives
Z
∂ ∂ log f
(x, θ)f (x, θ) dx = 0
∂θ ∂θ
5
where σ 2 = V ar(Ui ) = E(Vi ) = I(θ).
Next observe that X
−U 0 (θ) = Vi
where again
∂Ui
Vi = −
∂θ
The law of large numbers can be applied to show
Apply Slutsky’s Theorem to conclude that the right hand side of this con-
verges in distribution to N (0, σ 2 /I(θ)2 ) which simplifies, because of the iden-
tities, to N (0, 1/I(θ).
Summary
In regular families:
6
• Under any of these scenarios there is a consistent root of the likelihood
equations which is definitely the closest to the true value θ0 . This root
θ̂ has the property
√
n(θ̂ − θ0 ) ⇒ N (0, 1/I(θ)) .
ignoring irrelevant factorials. The score function is, since log(µi ) = βxi ,
X X
U (β) = (Yi xi − xi µi ) = xi (Yi − µi )
(Notice again that the score has mean 0 when you plug in the true parameter
value.) The key observation, however, is that it is not necessary to believe
7
that Yi has a Poisson distribution to make solving the equation U = 0 sensi-
ble. Suppose only that log(E(Yi )) = xi β. Then we have assumed that
Eβ (U (β)) = 0
This was the key condition in proving that there was a root of the likelihood
equations which was consistent and here it is what is needed, roughly, to
prove that the equation U (β) = 0 has a consistent root β̂. Ignoring higher
order terms in a Taylor expansion will give
V (β)(β̂ − β) ≈ U (β)
where V = −U 0 . In the MLE case we had identities relating the expectation
of V to the variance of U . In general here we have
X
V ar(U ) = x2i V ar(Yi )
If V ar(Yi )P= µi , as it is for the Poisson case, the asymptotic variance simpli-
fies to 1/ x2i µi .
Notice that other estimating equations are possible. People suggest al-
ternatives very often. If wi is any set of deterministic weights (even possibly
depending on µi then we could define
X
U (β) = wi (Yi − µi )
and still conclude that U = 0 probably has a consistent root which has an
asymptotic normal distribution. This idea is being used all over the place
these days: see, for example Zeger and Liang’s Generalized estimating equa-
tions (GEE) which the econometricians call Generalized Method of Moments.
8
Problems with maximum likelihood
2. When there are multiple roots of the likelihood equation you must
choose the right root. To do so you might start with a different con-
sistent estimator and then apply some iterative scheme like Newton
Raphson to the likelihood equations to find the MLE. It turns out
not many steps of NR are generally required if the starting point is a
reasonable estimate.
Method of Moments
Basic strategy: set sample moments equal to population moments and
solve for the parameters.
Definition: The rth sample moment (about the origin) is
n
1X r
X
n i=1 i
E(X r )
and
E [(X − µ)r ] .
If we have p parameters we can estimate the parameters θ1 , . . . , θp by
solving the system of p equations:
µ1 = X̄
9
µ02 = X 2
and so on to
µ0p = X p
You need to remember that the population moments µ0k will be formulas
involving the parameters.
Gamma Example
The Gamma(α, β) density is
α−1
1 x x
f (x; α, β) = exp − 1(x > 0)
βΓ(α) β β
and has
µ1 = αβ
and
µ02 = αβ 2
This gives the equations
αβ = X
αβ 2 = X 2
Divide the second by the first to find the method of moments estimate of β
is
β̃ = X 2 /X
Then from the first equation get
α̃ = X/β̃ = (X)2 /X 2
The equations are much easier to solve than the likelihood equations
which involve the function
d
ψ(α) = log(Γ(α))
dα
called the digamma function.
10