lecture23 (1)
lecture23 (1)
Statistical Inference
Lecture 23: Large-Sample Theory for Likelihood Ratio Tests
Yuansi Chen
Spring 2022
Duke University
https://www2.stat.duke.edu/courses/Spring22/sta732.01/
1
Recap from Lecture 22
𝑍0 𝜇0
⎛ ⎞ ⎛ ⎛ ⎞ ⎞
⎜ ⎜
𝑍 = ⎜𝑍1 ⎟ ∼ 𝒩 ⎜⎜𝜇1 ⎟
⎟ ⎜ ⎟ , 𝜎𝕀𝑛 ⎟
⎟
𝑍
⎝ 𝑟⎠ ⎝⎝ ⎠0 ⎠
2
Goal of Lecture 22
1. Wald test
2. Score test
3. Generalized likelihood ratio test
3
Review the asymptotics of MLE
Setup
i.i.d.
𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜃 (𝑥), 𝑝𝜃 (⋅) is “regular” enough (check the
conditions in Thm 9.14 of Keener)
4
Consistency of MLE on compact Ω
Define
𝑊𝑖 (𝜃) = ℓ1 (𝜃; 𝑋𝑖 ) − ℓ1 (𝜃0 ; 𝑋𝑖 )
1 𝑛
𝑊̄ 𝑛 = ∑ 𝑊𝑖
𝑛 𝑖=1
We know that
𝔼𝑊𝑖 (𝜃) = −𝒟KL (𝜃0 ‖ 𝜃) ≤ 0
and it becomes = 0 iff 𝑃𝜃 = 𝑃𝜃0 .
Consistency result
If model is identifiable, 𝑊𝑖 continuous random function, then
𝑝
• ∥𝑊̄ 𝑛 − 𝔼𝑊̄ 𝑛 ∥∞ → 0 on compact Ω.
𝑝
• Then 𝜃𝑛̂ → 𝜃0 (convergence of argmax requires uniform convergence
5
result in Thm 9.4 Keener)
Asymptotic distribution of MLE
MLE satisfies
Then
−1
√ 1 1
𝑛(𝜃𝑛̂ − 𝜃0 ) = (− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( √ ∇ℓ𝑛 (𝜃0 ))
𝑛 𝑛
−1 𝑝
• (− 𝑛1 ∇2 ℓ𝑛 (𝜃𝑛̃ )) → 𝐼1 (𝜃0 )−1 (convergence of a random function
evaluated on a random point requires uniform convergence result in Thm
9.4 Keener!)
• √1 ∇ℓ (𝜃 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )) (CLT)
𝑛 𝑛 0
√
By Slutsky’s thm, 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )
6
√
𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )
7
Wald test
Intuition for Wald-type confidence regions (1)
1 ̂ 𝑝
𝐼 → 𝐼1 (𝜃0 )
𝑛 𝑛
Then we can use it as plug-in estimate for 𝐼1 (𝜃0 ) in asymptotic
distribution
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
1/2 √
then (𝐼1 (𝜃0 )) 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 ),
by Slutsky’s thm,
1/2
𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 )
8
Intuition for Wald-type confidence regions (2)
𝜙=1 1/2
2
̂ −𝜃 )∥ >𝜒2 (𝛼)
∥𝐼𝑛̂ (𝜃𝑛 0 𝑑
2
Remark
• The test might not have the correct level. It only has
asymptotic level 𝛼
• The confidence region is an ellipsoid
−1/2
𝜃𝑛̂ + 𝐼𝑛̂ 𝔹(0, 𝜒2𝑑 (𝛼))
9
Two options for 𝐼𝑛̂
𝐼𝑛̂ = 𝐼𝑛 (𝜃𝑛̂ )
= Var𝜃 (∇ℓ𝑛 (𝜃; 𝑋)) ∣𝜃=𝜃 ̂
𝑛
Remark:
𝑝
Both should have 𝑛1 𝐼𝑛̂ → 𝐼1 (𝜃0 ) in “regular” model i.i.d. setting
10
Wald interval for 𝜃𝑗
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain
√
̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗
̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗
11
Wald interval for 𝜃𝑗
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain
√
̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗
̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗
11
Confidence ellipsoid for 𝜃0,𝑆
12
Example: generalized linear model with fixed design
Suppose 𝑥1 , … , 𝑥𝑛 ∈ ℝ𝑑 fixed
ind.
𝑌𝑖 ∼ 𝑝𝜂𝑖 (𝑦𝑖 ) = 𝑒𝜂𝑖 𝑦𝑖 −𝐴(𝜂𝑖 ) ℎ(𝑦𝑖 )
where 𝜂𝑖 = 𝛽 ⊤ 𝑥𝑖
Link function
Let 𝜇𝑖 (𝛽) = 𝔼𝛽 𝑌𝑖 . If 𝑓(𝜇𝑖 ) = 𝛽 ⊤ 𝑥𝑖 , then 𝑓 is called link function.
Common examples
⊤
ind. 𝑒𝑥𝑖 𝛽
• Logistic regression: 𝑌𝑖 ∼ Bernoulli ( ⊤ )
1+𝑒𝑥𝑖 𝛽
ind. ⊤
• Poisson log-linear model: 𝑌𝑖 ∼ Poisson(𝑒𝑥𝑖 𝛽 )
13
Confidence interval in generalized linear model
𝑛
ℓ𝑛 (𝛽; 𝑌 ) = ∑(𝑥⊤ ⊤
𝑖 𝛽)𝑦𝑖 − 𝐴(𝑥𝑖 𝛽) − log ℎ(𝑦𝑖 )
𝑖=1
𝑛
∇ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝑦𝑖 𝑥𝑖 − 𝐴′ (𝑥⊤
𝑖 𝛽)𝑥𝑖
𝑖=1
𝑛
= ∑ (𝑦𝑖 − 𝜇𝑖 (𝛽)) 𝑥𝑖
𝑖=1
𝑛
−∇2 ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝐴″ (𝑥⊤ ⊤
𝑖 𝛽)𝑥𝑖 𝑥𝑖
𝑖=1
𝑛
= ∑ Var𝛽 (𝑦𝑖 )𝑥𝑖 𝑥⊤
𝑖
𝑖=1
15
Pros and cons of Wald test
Advantages
• Easy to invert, simple confidence regions
• Asympotically correct level
Disadvantages
• Have to compute MLE
• Depends on parameterization
• Relies on second order Taylor expansion of ℓ𝑛
• Need MLE to be consistent
• Confidence region might go outside of Ω
16
Score test
Intuition for score test
Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
We can bypass quadratic approximation by using the score as test
statistics
1
√ ∇ℓ𝑛 (𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 ))
𝑛
17
Score test
Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)
18
Score test
Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)
18
Score test is invariant to reparameterization
show that the two test statistics are the same a.s.
19
Example 1: 𝑠-parameter exponential family
i.i.d.
Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜂 (𝑥) = exp(𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂))ℎ(𝑥). Derive
score test for 𝐻0 ∶ 𝜂 = 𝜂0
20
Example 2: Pearson 𝜒2 test
⎧ 1
{ 1+∑𝑘>1 𝑒𝜂𝑘 𝑗=1
𝜋𝑗 =
⎨
{ 𝑒𝜂 𝑗
𝑗>1
⎩ 1+∑𝑘>1 𝑒𝜂𝑘
Derive score test.
21
Generalized likelihood ratio test
GLRT in simple vs composite two-sided testing
Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
Taylor expansion around 𝜃𝑛̂ gives
1
ℓ𝑛 (𝜃0 ) − ℓ𝑛 (𝜃𝑛̂ ) = ∇ℓ(𝜃𝑛̂ ) + (𝜃0 − 𝜃𝑛̂ )⊤ ∇2 ℓ𝑛 (𝜃𝑛̃ )(𝜃0 − 𝜃𝑛̂ )
2
2
1/2
1 1 √
= 0 − ∥(− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( 𝑛(𝜃0 − 𝜃𝑛̂ ))∥
2 𝑛
2
1
⇒ − 𝜒2𝑑
2
why?
22
GLRT in composite vs composite
supΩ 𝐿(𝜃)
𝜆= 1
supΩ 𝐿(𝜃)
0
23
Asympotitic distribution of 2 log(𝜆)
24
Intuition for the asymptotic distribution
• 𝜃𝑛̂ ≈ 𝒩(𝜃0 , 𝑛1 𝕀𝑑 )
• locally, ∇2 ℓ𝑛 (𝜃) ≈ 𝑛𝕀𝑑 near 𝜃0
2
• ℓ𝑛 (𝜃) − ℓ𝑛 (𝜃𝑛̂ ) ≈ 𝑛
2 ∥𝜃 − 𝜃𝑛̂ ∥
2
2
• 𝜃0̂ ≈ arg min𝜃∈Ω0 ∥𝜃 − 𝜃∥̂ = ProjΩ (𝜃𝑛̂ )
2 0
•
2
2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ )) ≈ 𝑛 ∥𝜃𝑛̂ − ProjΩ (𝜃𝑛̂ )∥
0 2
⇒ 𝜒2𝑑−𝑑0
25
Asymptotic equivalence of the three tests
26
Summary
27
What is next?
• Final review
28
Thank you for attending
See you on Wednesday in Old
Chem 025
29
30