0% found this document useful (0 votes)
1 views36 pages

lecture23 (1)

Lecture 23 covers large-sample theory for likelihood ratio tests, focusing on the Wald test, score test, and generalized likelihood ratio test. It reviews the asymptotic properties of maximum likelihood estimators (MLE) and their applications in constructing confidence regions. The lecture also discusses the advantages and disadvantages of the Wald test and introduces the score test as an alternative approach to hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views36 pages

lecture23 (1)

Lecture 23 covers large-sample theory for likelihood ratio tests, focusing on the Wald test, score test, and generalized likelihood ratio test. It reviews the asymptotic properties of maximum likelihood estimators (MLE) and their applications in constructing confidence regions. The lecture also discusses the advantages and disadvantages of the Wald test and introduces the score test as an alternative approach to hypothesis testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

STA732

Statistical Inference
Lecture 23: Large-Sample Theory for Likelihood Ratio Tests

Yuansi Chen
Spring 2022
Duke University

https://www2.stat.duke.edu/courses/Spring22/sta732.01/

1
Recap from Lecture 22

1. Canonical linear model

𝑍0 𝜇0
⎛ ⎞ ⎛ ⎛ ⎞ ⎞
⎜ ⎜
𝑍 = ⎜𝑍1 ⎟ ∼ 𝒩 ⎜⎜𝜇1 ⎟
⎟ ⎜ ⎟ , 𝜎𝕀𝑛 ⎟

𝑍
⎝ 𝑟⎠ ⎝⎝ ⎠0 ⎠

• 𝜎2 known, 𝑑1 = 1, 𝑍-test: 𝑍𝜎1


• 𝜎2 unknown, 𝑑1 = 1, 𝑡-test: 𝑍𝜎1̂
2
‖𝑍1 ‖2
• 𝜎2 known, 𝑑1 ≥ 1, 𝜒2 -test: 𝜎2
2
‖𝑍 ‖ /𝑑
• 𝜎2 unknown, 𝑑1 ≥ 1, 𝐹 -test: 1𝜎2̂ 2 1
2. General linear model: find an orthonormal matrix 𝑄 such that
𝑄⊤ 𝑌 follows the canonical linear model

2
Goal of Lecture 22

1. Wald test
2. Score test
3. Generalized likelihood ratio test

Chap. 17.1-3 of Keener or 12.4 in Lehmann and Romano

3
Review the asymptotics of MLE
Setup

i.i.d.
𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜃 (𝑥), 𝑝𝜃 (⋅) is “regular” enough (check the
conditions in Thm 9.14 of Keener)

4
Consistency of MLE on compact Ω

Define
𝑊𝑖 (𝜃) = ℓ1 (𝜃; 𝑋𝑖 ) − ℓ1 (𝜃0 ; 𝑋𝑖 )
1 𝑛
𝑊̄ 𝑛 = ∑ 𝑊𝑖
𝑛 𝑖=1
We know that
𝔼𝑊𝑖 (𝜃) = −𝒟KL (𝜃0 ‖ 𝜃) ≤ 0
and it becomes = 0 iff 𝑃𝜃 = 𝑃𝜃0 .
Consistency result
If model is identifiable, 𝑊𝑖 continuous random function, then
𝑝
• ∥𝑊̄ 𝑛 − 𝔼𝑊̄ 𝑛 ∥∞ → 0 on compact Ω.
𝑝
• Then 𝜃𝑛̂ → 𝜃0 (convergence of argmax requires uniform convergence
5
result in Thm 9.4 Keener)
Asymptotic distribution of MLE

MLE satisfies

0 = ∇ℓ𝑛 (𝜃𝑛̂ ) = ∇ℓ𝑛 (𝜃0 ) + ∇2 ℓ𝑛 (𝜃𝑛̃ )(𝜃𝑛̂ − 𝜃0 ).

Then
−1
√ 1 1
𝑛(𝜃𝑛̂ − 𝜃0 ) = (− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( √ ∇ℓ𝑛 (𝜃0 ))
𝑛 𝑛

−1 𝑝
• (− 𝑛1 ∇2 ℓ𝑛 (𝜃𝑛̃ )) → 𝐼1 (𝜃0 )−1 (convergence of a random function
evaluated on a random point requires uniform convergence result in Thm
9.4 Keener!)
• √1 ∇ℓ (𝜃 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )) (CLT)
𝑛 𝑛 0

By Slutsky’s thm, 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )
6

𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )

We can use the asymptotic distribution to compute confidence


regions!

7
Wald test
Intuition for Wald-type confidence regions (1)

Assume we have an estimator 𝐼𝑛̂ ⪰ 0 such that

1 ̂ 𝑝
𝐼 → 𝐼1 (𝜃0 )
𝑛 𝑛
Then we can use it as plug-in estimate for 𝐼1 (𝜃0 ) in asymptotic
distribution


Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
1/2 √
then (𝐼1 (𝜃0 )) 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 ),
by Slutsky’s thm,
1/2
𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 )

8
Intuition for Wald-type confidence regions (2)

Under the null hypothesis 𝐻0 ∶ 𝜃 = 𝜃0 , we have


1/2 2
∥𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥ ⇒ 𝜒2𝑑
2
We can construct a test that rejects for large value of
1/2 2
∥𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥ :
2

𝜙=1 1/2
2
̂ −𝜃 )∥ >𝜒2 (𝛼)
∥𝐼𝑛̂ (𝜃𝑛 0 𝑑
2

Remark
• The test might not have the correct level. It only has
asymptotic level 𝛼
• The confidence region is an ellipsoid
−1/2
𝜃𝑛̂ + 𝐼𝑛̂ 𝔹(0, 𝜒2𝑑 (𝛼))
9
Two options for 𝐼𝑛̂

1. 𝐼𝑛 (𝜃𝑛̂ ) obtained by plugging in the MLE

𝐼𝑛̂ = 𝐼𝑛 (𝜃𝑛̂ )
= Var𝜃 (∇ℓ𝑛 (𝜃; 𝑋)) ∣𝜃=𝜃 ̂
𝑛

2. Observed Fisher information

𝐼𝑛̂ = −∇2 ℓ𝑛 (𝜃𝑛̂ ; 𝑋)

Remark:
𝑝
Both should have 𝑛1 𝐼𝑛̂ → 𝐼1 (𝜃0 ) in “regular” model i.i.d. setting

10
Wald interval for 𝜃𝑗


Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain

̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗

Using 𝑛1 𝐼𝑛̂ as plug-in estimate for 𝐼1 (𝜃0 ), we obtain univariate


interval

̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗

11
Wald interval for 𝜃𝑗


Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain

̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗

Using 𝑛1 𝐼𝑛̂ as plug-in estimate for 𝐼1 (𝜃0 ), we obtain univariate


interval

̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗

glm function in R uses the above intervals:


with 𝐼𝑛̂ = −∇2 ℓ𝑛 (𝜃𝑛̂ )

11
Confidence ellipsoid for 𝜃0,𝑆

Want to provide confidence ellipsoid for 𝜃0,𝑆 = (𝜃0,𝑗 )𝑗∈𝑆 , |𝑆| = 𝑘


We have

̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛 (𝜃𝑛,𝑆 0,𝑆 1 0 𝑆𝑆

Then the confidence ellipsoid is


1/2
̂ + ((𝐼 −1
𝜃𝑛,𝑆 ̂
𝑛 )𝑆𝑆 ) 𝔹(0, 𝜒𝑘 (𝛼))

12
Example: generalized linear model with fixed design

Suppose 𝑥1 , … , 𝑥𝑛 ∈ ℝ𝑑 fixed
ind.
𝑌𝑖 ∼ 𝑝𝜂𝑖 (𝑦𝑖 ) = 𝑒𝜂𝑖 𝑦𝑖 −𝐴(𝜂𝑖 ) ℎ(𝑦𝑖 )

where 𝜂𝑖 = 𝛽 ⊤ 𝑥𝑖
Link function
Let 𝜇𝑖 (𝛽) = 𝔼𝛽 𝑌𝑖 . If 𝑓(𝜇𝑖 ) = 𝛽 ⊤ 𝑥𝑖 , then 𝑓 is called link function.

Common examples

ind. 𝑒𝑥𝑖 𝛽
• Logistic regression: 𝑌𝑖 ∼ Bernoulli ( ⊤ )
1+𝑒𝑥𝑖 𝛽
ind. ⊤
• Poisson log-linear model: 𝑌𝑖 ∼ Poisson(𝑒𝑥𝑖 𝛽 )

13
Confidence interval in generalized linear model

𝑛
ℓ𝑛 (𝛽; 𝑌 ) = ∑(𝑥⊤ ⊤
𝑖 𝛽)𝑦𝑖 − 𝐴(𝑥𝑖 𝛽) − log ℎ(𝑦𝑖 )
𝑖=1
𝑛
∇ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝑦𝑖 𝑥𝑖 − 𝐴′ (𝑥⊤
𝑖 𝛽)𝑥𝑖
𝑖=1
𝑛
= ∑ (𝑦𝑖 − 𝜇𝑖 (𝛽)) 𝑥𝑖
𝑖=1
𝑛
−∇2 ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝐴″ (𝑥⊤ ⊤
𝑖 𝛽)𝑥𝑖 𝑥𝑖
𝑖=1
𝑛
= ∑ Var𝛽 (𝑦𝑖 )𝑥𝑖 𝑥⊤
𝑖
𝑖=1

= Var𝛽 (∇ℓ𝑛 (𝛽; 𝑌 ))


in GLM, −∇2 ℓ𝑛 (𝛽; 𝑌 ) is not random 14
Can estimate 𝐼𝑛̂ by plug-in MLE
Apply our asymptotic directly (or do Taylor expansion from scracth)
1/2
𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 )

15
Pros and cons of Wald test

Advantages
• Easy to invert, simple confidence regions
• Asympotically correct level

Disadvantages
• Have to compute MLE
• Depends on parameterization
• Relies on second order Taylor expansion of ℓ𝑛
• Need MLE to be consistent
• Confidence region might go outside of Ω

16
Score test
Intuition for score test

Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
We can bypass quadratic approximation by using the score as test
statistics
1
√ ∇ℓ𝑛 (𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 ))
𝑛

17
Score test

Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)

if 𝑑 = 1, we just use 𝑍-test instead

18
Score test

Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)

if 𝑑 = 1, we just use 𝑍-test instead

Advantages of score test


• No quadratic approximation
• No MLE

Disadvantage is that it might not be easy to invert the test

18
Score test is invariant to reparameterization

Assume 𝑑 = 1, 𝜃 = 𝑔(𝜉) with 𝑔′ (𝜉) > 0,

𝑞𝜉 (𝑥) = 𝑝𝑔(𝜉) (𝑥),

show that the two test statistics are the same a.s.

19
Example 1: 𝑠-parameter exponential family

i.i.d.
Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜂 (𝑥) = exp(𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂))ℎ(𝑥). Derive
score test for 𝐻0 ∶ 𝜂 = 𝜂0

20
Example 2: Pearson 𝜒2 test

Suppose 𝑁 = (𝑁1 , … , 𝑁𝑑 ) ∼ Multinom(𝑛, (𝜋1 , … , 𝜋𝑑 )), with


density
𝑁 𝑁
𝑛!𝜋1 1 ⋯ 𝜋𝑑 𝑑
1
𝑁1 ! ⋯ 𝑁𝑑 ! ∑ 𝑁𝑖 =𝑛
𝑑
Note since ∑𝑗=1 𝜋𝑗 = 1, this is a full-rank (𝑑 − 1)-param exp family,
with the possible parameterization

⎧ 1
{ 1+∑𝑘>1 𝑒𝜂𝑘 𝑗=1
𝜋𝑗 =

{ 𝑒𝜂 𝑗
𝑗>1
⎩ 1+∑𝑘>1 𝑒𝜂𝑘
Derive score test.

21
Generalized likelihood ratio test
GLRT in simple vs composite two-sided testing

Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
Taylor expansion around 𝜃𝑛̂ gives
1
ℓ𝑛 (𝜃0 ) − ℓ𝑛 (𝜃𝑛̂ ) = ∇ℓ(𝜃𝑛̂ ) + (𝜃0 − 𝜃𝑛̂ )⊤ ∇2 ℓ𝑛 (𝜃𝑛̃ )(𝜃0 − 𝜃𝑛̂ )
2
2
1/2
1 1 √
= 0 − ∥(− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( 𝑛(𝜃0 − 𝜃𝑛̂ ))∥
2 𝑛
2
1
⇒ − 𝜒2𝑑
2
why?

Test statistic in GLRT

2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0 )) ⇒ 𝜒2𝑑

22
GLRT in composite vs composite

Testing 𝐻0 ∶ 𝜃 ∈ Ω0 vs. 𝐻1 ∶ 𝜃 ∈ Ω\Ω0


The generalized likelihood ratio is

supΩ 𝐿(𝜃)
𝜆= 1

supΩ 𝐿(𝜃)
0

The test statistic is

2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ ))

where 𝜃0̂ = arg max𝜃∈Ω0 ℓ𝑛 (𝜃)

23
Asympotitic distribution of 2 log(𝜆)

Asymptotic distribution of 2 log(𝜆), see 17.2 Keener


Assume Ω = ℝ𝑑 , Ω0 𝑑0 -dim subspace. 𝜃0 in interior of Ω0 , 𝜃𝑛̂ is
consistent, 𝑝𝜃 (⋅) is “regular” (as in the asymptotic of MLE), then

2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ )) ⇒ 𝜒2𝑑−𝑑0

where 𝜃0̂ = arg max𝜃∈Ω0 ℓ𝑛 (𝜃)

24
Intuition for the asymptotic distribution

(See rigorous derivation in 17.2 Keener)


Assume 𝜃0 = 0, 𝐼0 (0) = 𝕀𝑑 (after reparameterization), then

• 𝜃𝑛̂ ≈ 𝒩(𝜃0 , 𝑛1 𝕀𝑑 )
• locally, ∇2 ℓ𝑛 (𝜃) ≈ 𝑛𝕀𝑑 near 𝜃0
2
• ℓ𝑛 (𝜃) − ℓ𝑛 (𝜃𝑛̂ ) ≈ 𝑛
2 ∥𝜃 − 𝜃𝑛̂ ∥
2
2
• 𝜃0̂ ≈ arg min𝜃∈Ω0 ∥𝜃 − 𝜃∥̂ = ProjΩ (𝜃𝑛̂ )
2 0


2
2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ )) ≈ 𝑛 ∥𝜃𝑛̂ − ProjΩ (𝜃𝑛̂ )∥
0 2
⇒ 𝜒2𝑑−𝑑0

25
Asymptotic equivalence of the three tests

How close are the three tests asymptotically?


1/2 2
• Wald test: ∥𝐽𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥
2
1/2 2
• Score test: ∥𝐽𝑛 (𝜃0 ) ∇ℓ𝑛 (𝜃0 )∥2
• GLRT: ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0 )

all are related to (for large 𝑛)


2
∥𝐼𝑛 (𝜃0 )1/2 (𝜃𝑛̂ − 𝜃0 )∥
2

26
Summary

• Wald test: test statistic based on quadratic approx


• Score test: test statistic using score
• Generalized likelihood ratio test: 2 log(𝜆)
We intuitively derived its asympotitic distribution

Read Page 362 of Keener for strengths and weaknesses

27
What is next?

• Final review

28
Thank you for attending
See you on Wednesday in Old
Chem 025

29
30

You might also like