0% found this document useful (0 votes)
69 views

Profile Likelihood Method

The document discusses profiling out nuisance parameters from a likelihood function to estimate parameters of interest. It provides an example of estimating parameters from a Weibull distribution using profiling. Specifically: 1) Profiling involves treating one set of parameters (nuisance parameters) as fixed, and maximizing the likelihood with respect to the other set (parameters of interest) to estimate them. 2) For a Weibull distribution example, profiling allows deriving the maximum likelihood estimator of one parameter (θ) for a fixed value of the other (α). 3) The maximum likelihood estimator of the parameter of interest (β) is then found by maximizing over the profile likelihoods. This avoids directly maximizing the full likelihood.

Uploaded by

Pramesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Profile Likelihood Method

The document discusses profiling out nuisance parameters from a likelihood function to estimate parameters of interest. It provides an example of estimating parameters from a Weibull distribution using profiling. Specifically: 1) Profiling involves treating one set of parameters (nuisance parameters) as fixed, and maximizing the likelihood with respect to the other set (parameters of interest) to estimate them. 2) For a Weibull distribution example, profiling allows deriving the maximum likelihood estimator of one parameter (θ) for a fixed value of the other (α). 3) The maximum likelihood estimator of the parameter of interest (β) is then found by maximizing over the profile likelihoods. This avoids directly maximizing the full likelihood.

Uploaded by

Pramesh Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 3

The Profile Likelihood

3.1 The Profile Likelihood

3.1.1 The method of profiling


Let us suppose that the unknown parameters ✓ can be partitioned as ✓0 = ( 0 , 0 ), where
are the p-dimensional parameters of interest (eg. mean) and are the q-dimensional
nuisance parameters (eg. variance). We will need to estimate both and , but our
interest lies only in the parameter . To achieve this one often profiles out the nuisance
parameters. To motivate the profile likelihood, we first describe a method to estimate the
parameters ( , ) in two stages and consider some examples.
Let us suppse that {Xi } are iid random variables, with density f (x; , ) where our
objective is to estimate and . In this case the log-likelihood is
n
X
Ln ( , ) = log f (Xi ; , ).
i=1

To estimate and one can use ( ˆ n , ˆn ) = arg max , Ln ( , ). However, this can be
difficult to directly maximise. Instead let us consider a di↵erent method, which may,
sometimes, be easier to evaluate. Suppose, for now, is known, then we rewrite the
likelihood as Ln ( , ) = L ( ) (to show that is fixed but varies). To estimate we
maximise L ( ) with respect to , i.e.

ˆ = arg max L ( ).

99
In reality is unknown, hence for each we can evaluate ˆ . Note that for each , we
have a new curve L ( ) over . Now to estimate , we evaluate the maximum L ( ),
over , and choose the , which is the maximum over all these curves. In other words,
we evaluate

ˆn = arg max L ( ˆ ) = arg max Ln ( , ˆ ).

A bit of logical deduction shows that ˆn and ˆn are the maximum likelihood estimators
( ˆ n , ˆn ) = arg max , Ln ( , ).
We note that we have profiled out nuisance parameter , and the likelihood L ( ˆ ) =
Ln ( , ˆ ) is in terms of the parameter of interest .
The advantage of this procedure is best illustrated through some examples.

Example 3.1.1 (The Weibull distribution) Let us suppose that {Xi } are iid random
↵y ↵ 1
variables from a Weibull distribution with density f (x; ↵, ✓) = ✓↵
exp( (y/✓)↵ ). We
know from Example 2.2.2, that if ↵, were known an explicit expression for the MLE can
be derived, it is

✓ˆ↵ = arg max L↵ (✓)



Xn ✓ ◆
Yi ↵
= arg max log ↵ + (↵ 1) log Yi ↵ log ✓

i=1

n ✓
X ◆ n
Yi ↵ 1 X ↵ 1/↵
= arg max ↵ log ✓ =( Y ) ,

i=1
✓ n i=1 i
✓ ◆
Pn Yi ↵
where L↵ (X; ✓) = i=1 log ↵ + (↵ 1) log Yi ↵ log ✓ ✓
. Thus for a given ↵,
the maximum likelihood estimator of ✓ can be derived. The maximum likelihood estimator
of ↵ is
n ✓
X n ◆
1 X ↵ 1/↵ Y ↵

ˆ n = arg max log ↵ + (↵ 1) log Yi ↵ log( Y ) Pn i ↵ 1/↵ .

i=1
n i=1 i 1
( n i=1 Yi )

Pn
Therefore, the maximum likelihood estimator of ✓ is ( n1 i=1 Yi↵ˆ n )1/↵ˆ n . We observe that
ˆ n can be tricky but no worse than maximising the likelihood Ln (↵, ✓) over ↵
evaluating ↵
and ✓.

100
As we mentioned above, we are not interest in the nuisance parameters and are only
interesting in testing and constructing CIs for . In this case, we are interested in the
limiting distribution of the MLE ˆn . Using Theorem 2.6.2(ii) we have
! ✓ ! 1 ◆
p ˆn D I I
n ! N 0, .
ˆn I I

where
! !
@ 2 log f (Xi ; , ) @ 2 log f (Xi ; , )
I I E @ 2
E @ @
= @ 2 log f (Xi ; , ) 0 @ 2 log f (Xi ; , )
. (3.1)
I I E @ @
E @ 2

p
To derive an exact expression for the limiting variance of n( ˆn ), we use the block
inverse matrix identity.

Remark 3.1.1 (Inverse of a block matrix) Suppose that


!
A B
C D

is a square matrix. Then


! 1 !
A B (A BD 1 C) 1
A 1 B(D CA 1 B) 1
= . (3.2)
C D D 1 CB(A BD 1 C) 1
(D CA 1 B) 1

Using (3.2) we have


p D
n( bn ) ! N (0, (I , I , I 1
I , ) 1 ). (3.3)

Thus if is a scalar we can use the above to construct confidence intervals for .

Example 3.1.2 (Block diagonal information matrix) If


!
I , 0
I( , ) = ,
0 I ,

then using (3.3) we have


p D
n( bn ) ! N (0, I ,
1
).

101
3.1.2 The score and the log-likelihood ratio for the profile like-
lihood
To ease notation, let us suppose that 0 and 0 are the true parameters in the distribution.
We now consider the log-likelihood ratio

2 max Ln ( , ) max Ln ( 0, ) , (3.4)
,

where 0 is the true parameter. However, to derive the limiting distribution in this case
for this statistic is a little more complicated than the log-likelihood ratio test that does
not involve nuisance parameters. This is because directly applying Taylor expansion does
not work since this is usually expanded about the true parameters. We observe that

2 max Ln ( , ) max Ln ( 0 , )
,
⇢ n o
= 2 max Ln ( , ) Ln ( 0 , 0 ) 2 max Ln ( 0 , ) max Ln ( 0 , 0 ) .
,
| {z } | {z }
2 2
q
p+q

2
It seems reasonable that the di↵erence may be a p but it is really not clear by. Below,
we show that by using a few Taylor expansions why this is true.
In the theorem below we will derive the distribution of the score and the nested log-
likelihood.

Theorem 3.1.1 Suppose Assumption 2.6.1 holds. Suppose that ( 0, 0) are the true
parameters. Then we have

@Ln ( , ) @Ln ( , ) @Ln ( , ) 1


cˆ , ⇡ c 0, 0 c 0, 0 I I 0 0 (3.5)
@ 0 0
@ @ 0 0

1 @Ln ( , ) D 1
p c 0,
ˆ ! N (0, (I 0 0 I 0 0 I I 0, 0 )) (3.6)
n @ 0 0 0

where I is defined as in (3.1) and



D
2 Ln ( ˆn , ˆ n ) Ln ( 0,
ˆ )
0 ! 2
p, (3.7)

where p denotes the dimension of . This result is often called Wilks Theorem.

102
PROOF. We first prove (3.5) which is the basis of the proofs of (3.6). To avoid, notational
@Ln ( , ) @Ln ( , )
difficulties we will assume that @
cˆ , 0
and @
c = 0, 0 are univariate random
0

variables.
@Ln ( , ) @Ln ( , )
Our objective is to find an expression for @
cˆ , 0
in terms of @
c = 0, 0 and
0
@Ln ( , )
@
c = 0, 0 which will allow us to obtain its variance and asymptotic distribution.
@Ln ( , ) @Ln ( , )
Making a Taylor expansion of @
cˆ , 0
about @
c 0, 0 gives
0

@Ln ( , ) @Ln ( , ) @ 2 Ln ( , )
cˆ , ⇡ c 0, 0 + (ˆ 0 0) c 0, 0 .
@ 0 0
@ @ @

Notice that we have used ⇡ instead of = because we replace the second derivative
@ 2 Ln ( , )
with its true parameters. If the sample size is large enough then @ @
c 0, 0 ⇡
@ 2 Ln ( , )
E @ @
c 0, 0 ; eg. in the iid case we have

n
1 @ 2 Ln ( , ) 1 X @ 2 log f (Xi ; , )
c 0, 0 = c 0, 0
n @ @ n i=1 @ @
✓ 2 ◆
@ log f (Xi ; , )
⇡ E c 0, 0 = I ,
@ @

Therefore
@Ln ( , ) @Ln ( , )
cˆ , ⇡ c 0, 0 n( ˆ 0 0 )I . (3.8)
@ 0 0
@

Next we make a decomposition of ( ˆ 0 0 ). We recall that since Ln ( 0,


ˆ ) = arg max Ln (
0 0, )
then
@Ln ( , )
cˆ , =0
@ 0 0

(if the maximum is not on the boundary). Therefore making a Taylor expansion of
@Ln ( 0, ) @Ln ( 0, )
@
cˆ , 0
about @
c 0, 0 gives
0

@Ln ( 0 , ) @Ln ( 0 , ) @ 2 Ln ( 0, )
cˆ , ⇡ c 0, 0 + 2
c 0, 0 (ˆ 0 0 ).
| @ {z @ @
0 0
}
=0

@ 2 Ln ( 0 , )
Replacing @ 2
c 0, 0 with I gives

@Ln ( 0 , )
c 0, 0 nI ( ˆ 0 0) ⇡ 0,
@
103
and rearranging the above gives
1
@Ln ( 0 , ) I
(ˆ 0 0) c⇡ 0, 0 . (3.9)
n @
Therefore substituting (3.9) into (3.8) gives
@Ln ( , ) @Ln ( , ) @Ln ( 0 , ) 1
cˆ , ⇡ c 0, 0 c 0, 0 I I
@ 0 0
@ @
and thus we have proved (3.5).
To prove (3.6) we note that
@Ln ( , ) @Ln ( 0 , ) 0 1 0
cˆ , ⇡ c 0, I, I , . (3.10)
@ 0 0
@✓ 0

We recall that the regular score function satisfies


!
@Ln ( , )
1 @Ln ( , ) 1 @
c 0, 0 D
p c 0, 0 =p @Ln ( , )
! N (0, I(✓0 )).
n @✓ n c 0, 0
@

Now by substituting the above into (3.10) and calculating the variance gives (3.6).
Finally to prove (3.7) we apply the Taylor expansion on the decomposition
⇢ ⇢ ⇢
ˆ ˆ ˆ
2 Ln ( n , n ) Ln ( 0 , 0 ) ˆ ˆ
= 2 Ln ( n , n ) Ln ( 0 , 0 ) 2 Ln ( 0 , ˆ 0 ) Ln ( 0, 0)

⇡ (✓ˆn ✓0 )0 I(✓)(✓ˆn ✓0 ) (ˆ 0 0)
0
I (ˆ 0 0 ), (3.11)

where ✓ˆn0 = ( ˆ, ˆ ) (the mle). We now find an approximation of ( ˆ 0 0)


0
in terms
(✓ˆn ✓0 ). We recall that (✓b ✓) = I(✓0 ) 1 r✓ Ln (✓)c✓=✓0 therefore
! ! !
@Ln (✓)
I I bn 0
@
⇡ (3.12)
@Ln (✓)
I I bn n
@
@Ln (✓)
From (3.9) and the expansion of given in (3.12) we have
@

I 1 @Ln ( 0 , ) I 1⇣ ⌘
(ˆ 0 0 ) ⇡ c 0, 0 ⇡ I (ˆ 0) + I (
ˆ 0)
n @ n ⇣ ⌘
⇡ I 1I ( ˆ 0) + (
ˆ 0) = I
1
I , 1 ✓bn ✓0 .

Substituting the above into (3.11) and making lots of cancellations we have

2 Ln ( ˆn , ˆ n ) Ln ( 0 , ˆ 0 ) ⇡ n( ˆ 0
0 ) (I I I ,1 I , )( ˆ 0 ).

p D
Finally, by using (3.3) we substitute n( ˆ 0) ! N (0, (I I I ,
1
I , ) 1 ), into the
above which gives the desired result. ⇤

104
Remark 3.1.2 (i) The limiting variance of b 0 if were known is I ,
1
, whereas
the the limiting variance of @Ln@( , ) c ˆ , 0 is (I I I ,1 I , ) and the limiting
p 0

variance of n( ˆ 0 ) is (I I I ,1 I , ) 1 . Therefore if and are scalars


and the correlation I , is positive, then the limiting variance of b 0 is more than
if were known. This makes sense, if we have less information the variance grows.

(ii) Look again at the expression

@Ln ( , ) @Ln ( , ) 1 @Ln ( 0 , )


cˆ , ⇡ c 0, 0 I I c 0, 0 (3.13)
@ 0 0
@ @

It is useful to understand where it came from. Consider the problem of linear re-
gression. Suppose X and Y are random variables and we want to construct the
best linear predictor of Y given X. We know that the best linear predictor is
Ŷ (X) = E(XY )/E(Y 2 )X and the residual and mean squared error is
✓ ◆2
E(XY ) E(XY )
Y Ŷ (X) = Y X and E Y X = E(Y 2 ) E(XY )E(Y 2 ) 1 E(XY ).
E(Y 2 ) E(Y 2 )
@Ln ( , )
Compare this expression with (3.13). We see that in some sense @
cˆ , 0 can
0
@Ln ( , ) @Ln ( 0 , )
be treated as the residual (error) of the projection of @
c 0, 0 onto @
c 0, 0 .

3.1.3 The log-likelihood ratio statistics in the presence of nui-


sance parameters
Theorem 3.1.1 can be used to test H0 : = 0 against HA : 6= 0 since

D 2
2 max Ln ( , ) max Ln ( 0 , ) ! p.
,

The same quantity can be used in the construction of confidence intervals By using (3.7)
we can construct CIs. For example, to construct a 95% CI for we can use the mle
✓ˆn = ( ˆn , ˆ n ) and the profile likelihood (3.7) to give
⇢ ⇢
; 2 Ln ( ˆn , ˆ n ) Ln ( , ˆ )  2p (0.95) .

Example 3.1.3 (The normal distribution and confidence intervals) This example
is taken from Davidson (2004), Example 4.31, p129.

105
We recall that the log-likelihood for {Yi } which are iid random variables from a normal
2
distribution with mean µ and variance is
n
2 2 1 X n
Ln (µ, ) = Lµ ( ) = 2
(Yi µ)2 log 2
.
2 i=1
2

Our aim is to the use the log-likelihood ratio statistic, analogous to Section 2.8.1 to con-
2
struct a CI for µ. Thus we treat as the nuisance parameter.
2 1
Pn
Keeping µ fixed, the maximum likelihood estimator of is b2 (µ) = n i=1 (Yi µ)2 .
Rearranging b2 (µ) gives
✓ ◆
n 1 t2 (µ)
2
b (µ) = s 2
1+ n
n n 1
1
Pn
where t2n (µ) = n(Ȳ µ)2 /s2 and s2 = n 1 i=1 (Yi Ȳ )2 . Substituting b2 (µ) into Ln (µ, 2
)
gives the profile likelihood
n
2 1 X n
Ln (µ, b (µ)) = (Yi µ)2 log b2 (µ)
b2 (µ) i=1 2
| {z }
= n/2
⇢ ✓ ◆
n n n 1 t2 (µ)
= log s 2
1+ n .
2 2 n n 1

It is clear that Ln (µ, b2 (µ)) is maximised at µ


b = Ȳ . Hence

2 n n n 1 2
Ln (b
µ, b (bµ)) = log s .
2 2 n

Thus the log-likelihood ratio is


◆ ✓
2 2 t2n (µ)
Wn (µ) = 2 Ln (b
µ, b (b
µ)) Ln (µ, b (µ)) = n log 1 + .
n 1
| {z }
D 2
! 1 for true µ

Therefore, using the same argument to those in Section 2.8.1, the 95% confidence interval
for the mean is

µ, b2 (b
µ; 2 Ln (b µ)) Ln (µ, b2 (µ)) = µ; Wn (µ)  21 (0.95)
⇢ ✓ ◆
t2n (µ) 2
= µ; n log 1 +  1 (0.95) .
n 1

106
However, this is an asymptotic result. With the normal distribution we can get the exact
distribution. We note that since log is a monotonic function the log-likelihood ratio is
equivalent to

µ; t2n (µ)  C↵ ,

where C↵ is an appropriately chosen critical value. We recall that tn (µ) is a t-distribution


with n 1 degrees of freedom. Thus C↵ is the critical value corresponding to a Hotelling
T 2 -distribution.

2
Exercise 3.1 Derive the test for independence (in the case of two by two tables) using
the log-likelihood ratio test. More precisely, derive the asymptotic distribution of
(O1 E1 )2 (O2 E2 )2 (O3 E3 )2 (O4 E4 )2
T = + + , +
E1 E2 E3 E4
under the null that there is no association between the categorical variables C and R,
where and E1 = n3 ⇥ n1 /N , E2 = n4 ⇥ n1 /N , E3 = n3 ⇥ n2 /N and E2 = n4 ⇥ n2 /N . State

C1 C2 Subtotal
R1 O1 O2 n1
R2 O3 O4 n2
Subtotal n3 n4 N

all results you use.


Hint: You may need to use the Taylor approximation x log(x/y) ⇡ (x y)+ 12 (x y)2 /y.

Pivotal Quantities

Pivotal quantities are statistics whose distribution does not depend on any parameters.
p
These include the t-ratio t = n(X̄ µ)/sn ⇠ tn 1 (in the case the data is normal) F -test
etc.
In many applications it is not possible to obtain a pivotal quantity, but a quantity can
be asymptotically pivotal. The log-likelihood ratio statistic is one such example (since its
distribution is a chi-square).
Pivotal statistics have many advantages. The main is that it avoids the need to
estimate extra parameters. However, they are also useful in developing Bootstrap methods
etc.

107
3.1.4 The score statistic in the presence of nuisance parameters
We recall that we used Theorem 3.1.1 to obtain the distribution of 2 max , Ln ( , )
max Ln ( 0, ) under the null, we now consider the score test.
@Ln ( , )
We recall that under the null H0 : = 0 the derivative= 0, but the@
cˆ , 0
0

same is not true of @Ln ( , )


@
cˆ , 0
. However, if the null were true we would expect ˆ 0 to
0
@Ln ( , )
be close to the true 0 and for @
cˆ , 0
to be close to zero. Indeed this is what we
0

showed in (3.6), where we showed that under the null

1/2 @Ln ( , ) D 1
n cˆ ! N (0, I I I , I , ), (3.14)
@ 0

where 0 = arg max Ln ( 0, ).


Therefore (3.14) suggests an alternative test for H0 : = 0 against HA : 6= 0. We
@Ln ( , )
can use p1 cˆ as the test statistic. This is called the score or LM test.
n @ 0

The log-likelihood ratio test and the score test are asymptotically equivalent. There
are advantages and disadvantages of both.

(i) An advantage of the log-likelihood ratio test is that we do not need to calculate the
information matrix.

(ii) An advantage of the score test is that we do not have to evaluate the the maximum
likelihood estimates under the alternative model.

3.2 Applications

3.2.1 An application of profiling to frequency estimation


Suppose that the observations {Xt ; t = 1, . . . , n} satisfy the following nonlinear regression
model

Xt = A cos(!t) + B sin(!t) + "i

where {"t } are iid standard normal random variables and 0 < ! < ⇡ (thus allowing the
case ! = ⇡/2, but not the end points ! = 0 or ⇡). The parameters A, B, and ! are real
and unknown. Full details can be found in the paper http://www.jstor.org/stable/
pdf/2334314.pdf (Walker, 1971, Biometrika).

108
(i) Ignoring constants, obtain the log-likelihood of {Xt }. Denote this likelihood as
Ln (A, B, !).

(ii) Let
✓X
n n
X ◆
1
Sn (A, B, !) = Xt2 2 Xt A cos(!t) + B sin(!t) 2 2
n(A + B ) .
t=1 t=1
2
Show that
n n
(A2 B2) X X
2Ln (A, B, !) + Sn (A, B, !) = cos(2t!) + AB sin(2t!).
2 t=1 t=1

Thus show that |Ln (A, B, !) + 12 Sn (A, B, !)| = O(1) (ie. the di↵erence does not
grow with n).

1
Since Ln (A, B, !) and S (A, B, !)
2 n
are asymptotically equivalent, for the rest of
1
this question, use 2 n
S (A, B, !) instead of the likelihood Ln (A, B, !).

(iii) Obtain the profile likelihood of !.


Pn
ˆ n = arg max! |
(hint: Profile out the parameters A and B, to show that ! t=1 Xt exp(it!)|2 ).
Suggest, a graphical method for evaluating !
ˆn?

(iv) By using the identity


8
n < exp( 12 i(n+1)⌦) sin( 12 n⌦)
X 0 < ⌦ < 2⇡
sin( 12 ⌦)
exp(i⌦t) = (3.15)
t=1
: n ⌦ = 0 or 2⇡.
show that for 0 < ⌦ < 2⇡ we have
Xn n
X
t cos(⌦t) = O(n) t sin(⌦t) = O(n)
t=1 t=1
n
X Xn
t2 cos(⌦t) = O(n2 ) t2 sin(⌦t) = O(n2 ).
t=1 t=1

(v) By using the results in part (iv) show that the Fisher Information of Ln (A, B, !)
(denoted as I(A, B, !)) is asymptotically equivalent to
0 1
n n2
2
0 2
B + O(n)
2
@ Sn B C
=B C.
n n 2
2I(A, B, !) = E @ 0 A + O(n) A
@! 2 2 2
n2 n2 n3 2
2
B + O(n) 2
A + O(n) 3
(A + B ) + O(n2 )
2

109
(vi) Derive the asymptotic variance of maximum likelihood estimator, !
ˆ n , derived in
part (iv).

Comment on the rate of convergence of !


ˆn.

Useful information: The following quantities may be useful:

8
n < exp( 12 i(n+1)⌦) sin( 12 n⌦)
X 0 < ⌦ < 2⇡
sin( 12 ⌦)
exp(i⌦t) = (3.16)
: n ⌦ = 0 or 2⇡.
t=1

the trignometric identities: sin(2⌦) = 2 sin ⌦ cos ⌦, cos(2⌦) = 2 cos2 (⌦) 1 = 1 2 sin2 ⌦,
exp(i⌦) = cos(⌦) + i sin(⌦) and

n
X n
X
n(n + 1) n(n + 1)(2n + 1)
t= t2 = .
t=1
2 t=1
6

Solution
Since {"i } are standard normal iid random variables the likelihood is

n
1X
Ln (A, B, !) = (Xt A cos(!t) B sin(!t))2 .
2 t=1

If the frequency ! were known, then the least squares estimator of A and B would be

! n
! 1 n
!
b
A X 1X cos(!t)
= n 1
x0t xt Xt
b
B n t=1 sin(!t)
t=1

where xt = (cos(!t), sin(!t)). However, because the sine and cosine functions are near
P
orthogonal we have that n 1 nt=1 x0t xt ⇡ I2 and

! n
!
b
A 1X cos(!t)
⇡ Xt ,
b
B n t=1 sin(!t)

110
which is simple to evaluate! The above argument is not very precise. To make it precise
we note that

2Ln (A, B, !)
n
X X n
= Xt2 2 Xt A cos(!t) + B sin(!t)
t=1 t=1
n
X n
X n
X
2 2 2 2
+A cos (!t) + B sin (!t) + 2AB sin(!t) cos(!t)
t=1 t=1 t=1
n
X n
X
= Xt2 2 Xt A cos(!t) + B sin(!t) +
t=1 t=1
n n n
A X
2
B2 X X
(1 + cos(2t!)) + (1 cos(2t!)) + AB sin(2t!)
2 t=1 2 t=1 t=1
n
X n
X n 2
= Xt2 2 Xt A cos(!t) + B sin(!t) + (A + B 2 ) +
t=1 t=1
2
2 2 Xn n
X
(A B )
cos(2t!) + AB sin(2t!)
2 t=1 t=1
n n
(A 2
B )X2 X
= Sn (A, B, !) + cos(2t!) + AB sin(2t!)
2 t=1 t=1

where
n
X n
X n 2
Sn (A, B, !) = Xt2 2 Xt A cos(!t) + B sin(!t) + (A + B 2 ).
t=1 t=1
2

The important point abut the above is that n 1 Sn (A, B, !) is bounded away from zero,
P P
however n 1 nt=1 sin(2!t) and n 1 nt=1 cos(2!t) both converge to zero (at the rate n 1 ,
though it is not uniform over !); use identity (3.16). Thus Sn (A, B, !) is the dominant
term in Ln (A, B, !);

2Ln (A, B, !) = Sn (A, B, !) + O(1).

Thus ignoring the O(1) term and di↵erentiating Sn (A, B, !) wrt A and B (keeping !
fixed) gives the estimators
! n
!
b
A(!) 1X cos(!t)
= Xt .
b
B(!) n t=1 sin(!t)

111
Thus we have “profiled out” the nuisance parameters A and B.
Using the approximation Sn (A bn (!), B
bn (!), !) we have

bn (!), B
bn (!), !) = 1
Ln (A Sp (!) + O(1),
2

where
✓X
n n
X ◆
bn (!) cos(!t) + B(!)
b n b b 2
Sp (!) = Xt2 2 Xt A sin(!t) + (A 2
n (!) + B(!) )
t=1 t=1
2
✓X
n  ◆
n b bn (!)2 .
= Xt2 An (!)2 + B
t=1
2

Thus

bn (!), B
bn (!), !) ⇡ arg max 1
arg max Ln (A Sp (!)
2
bn (!)2 + B
= arg max A bn (!)2 .

Thus

bn (!)2 + B
bn = arg max( 1/2)Sp (!) = arg max A
! bn (!)2
! !
n
X 2
= arg max Xt exp(it!) ,
!
t=1

which is easily evaluated (using a basic grid search).

(iv) Di↵erentiating both sides of (3.15) with respect to ⌦ and considering the real and
P Pn
imaginary terms gives nt=1 t cos(⌦t) = O(n) t=1 t sin(⌦t) = O(n). Di↵erenti-
ating both sides of (3.15) twice wrt to ⌦ gives the second term.

b ! ), B(b
b , A(b
(v) In order to obtain the rate of convergence of the estimators, ! b ! ) we eval-
uate the Fisher information of Ln (the inverse of which will give us limiting rate
of convergence). For convenience rather than take the second derivative of L we
evaluate the second derivative of Sn (A, B, !) (though, you will find the in the limit
both the second derivative of Ln and Sn (A, B, !) are the same).
Pn 2
Pn 1 2
Di↵erentiating Sn (A, B, !) = t=1 Xt 2 t=1 Xt A cos(!t)+B sin(!t) + 2 n(A +

112
B 2 ) twice wrt to A, B and ! gives
n
X
@Sn
= 2 Xt cos(!t) + An
@A t=1
n
X
@Sn
= 2 Xt sin(!t) + Bn
@B t=1
n
X n
X
@Sn
=2 AXt t sin(!t) 2 BXt t cos(!t).
@! t=1 t=1

@ 2 Sn @ 2 Sn @ 2 Sn
and @A2
= n, @B 2
= n, @A@B
= 0,
X n
@ 2 Sn
=2 Xt t sin(!t)
@!@A t=1
n
X
@ 2 Sn
= 2 Xt t cos(!t)
@!@B t=1
X n
@ 2 Sn
= 2 t2 Xt A cos(!t) + B sin(!t) .
@! 2 t=1

Now taking expectations of the above and using (v) we have


X n
@ 2 Sn
E( ) = 2 t sin(!t) A cos(!t) + B sin(!t)
@!@A t=1
n
X n
X
2
= 2B t sin (!t) + 2 At sin(!t) cos(!t)
t=1 t=1
n
X n
X n(n + 1) n2
= B t(1 cos(2!t)) + A t sin(2!t) = B + O(n) = B + O(n).
t=1 t=1
2 2

@ Sn 2 2
Using a similar argument we can show that E( @!@B )= A n2 + O(n) and
Xn ✓ ◆2
@ 2 Sn 2
E( ) = 2 t A cos(!t) + B sin(!t)
@! 2 t=1
n(n + 1)(2n + 1)
= (A2 + B 2 ) + O(n2 ) = (A2 + B 2 )n3 /3 + O(n2 ).
6
Since E( r2 Ln ) ⇡ 12 E(r2 Sn ), this gives the required result.

(vi) Noting that the asymptotic variance for the profile likelihood estimator !
ˆn
✓ ◆ 1
1
I!,! I!,(AB) IA,B I(BA),! ,

113
by subsituting (vi) into the above we have
✓ 2 ◆ 1
A + B2 3 2 12
2 n + O(n ) ⇡
6 (A2 + B 2 )n3
ˆ n is O(n 3 ).
Thus we observe that the asymptotic variance of !
Typically estimators have a variance of order O(n 1 ), so we see that the estimator
!
ˆ n converges to to the true parameter, far faster than expected. Thus the estimator
is extremely good compared with the majority of parameter estimators.

Exercise 3.2 Run a simulation study to illustrate the above example.


2⇡k
Pn 2⇡k
Evaluate In (!) for all !k = n
using the fft function in R (this evaluates { t=1 Yt eit n }nk=1 ),
then take the absolute square of it. Find the maximum over the sequence using the function
bn . From this, estimate A and B. However, !
which.max. This will estimate ! bn will only
estimate ! to Op (n 1 ), since we have discretized the frequencies. To improve on this, one
can use one further iteration see http: // www. jstor. org/ stable/ pdf/ 2334314. pdf
for the details.
Run the above over several realisations and evaluate the average squared error.

3.2.2 An application of profiling in survival analysis


This application uses some methods from Survival Analysis which is covered later in this
course.
Let Ti denote the survival time of an electrical component (we cover survival functions
in Chapter 6.1). Often for each survival time, there are known regressors xi which are
believed to influence the survival time Ti . The survival function is defined as

P (Ti > t) = Fi (t) t 0.

It is clear from the definition that what defines a survival function is that Fi (t) is positive,
Fi (0) = 1 and Fi (1) = 0. The density is easily derived from the survival function taking
dFi (t)
the negative derivative; fi (t) = dt
.
To model the influence the regressors have on the survival time, the Cox-proportional
hazard model is often used with the exponential distribution as the baseline distribution
and (xi ; ) is a positive “link” function (typically, we use (xi ; ) = exp( xi ) as the link
function). More precisely the survival function of Ti is
(xi ; )
Fi (t) = F0 (t) ,

114
where F0 (t) = exp( t/✓). Not all the survival times of the electrical components are
observed, and there can arise censoring. Hence we observe Yi = min(Ti , ci ), where ci is
the (non-random) censoring time and i, where i is the indicator variable, where i =1
denotes censoring of the ith component and i = 0 denotes that it is not censored. The
parameters and ✓ are unknown.

(i) Derive the log-likelihood of {(Yi , i )}.

(ii) Compute the profile likelihood of the regression parameters , profiling out the
baseline parameter ✓.

Solution

(i) The survivial function and the density are


[ (xi ; ) 1] (xi ; )
fi (t) = (xi ; ) F0 (t) f0 (t) and Fi (t) = F0 (t) .

Thus for this example, the logarithm of density and survival function is
⇥ ⇤
log fi (t) = log (xi ; ) (xi ; ) 1 F0 (t) + log f0 (t)
⇥ ⇤t t
= log (xi ; ) (xi ; ) 1 log ✓
✓ ✓
t
log Fi (t) = (xi ; ) log F0 (t) = (xi ; ) .

Since
( [ (xi ; ) 1
fi (yi ) = (xi ; ) F0 (yi ) f0 (t) i = 0
fi (yi , i ) =
Fi (yi ) = F0 (t) (xi ; ) i = 1

the log-likelihood of ( , ✓) based on (Yi , i ) is


n
X
Ln ( , ✓) = (1 i) log (xi ; ) + log f0 (Yi ) + ( (xi ; ) 1) log F0 (Yi ) +
i=1
n
X
i (xi ; ) log F0 (Yi )
i=1
Xn ✓ ◆
Yi Yi
= (1 i) log (xi ; ) log ✓ ( (xi ; ) 1)
i=1
✓ ✓
n
X Yi
i (xi ; )
i=1

n
X n
X Yi
= (1 i) log (xi ; ) log ✓ (xi ; )
i=1 i=1

115
Di↵erentiating the above wrt and ✓ gives
Xn n
X
@L r (xi ; ) Yi
= (1 i) r (xi ; )
@ i=1
(xi ; ) i=1

n
X n
X
@L 1 Yi
= (1 i) + (xi ; )
@✓ i=1
✓ i=1
✓2

which is not simple to solve.

(ii) Instead we keep fixed and di↵erentiate the likelihood with respect to ✓ and equate
to zero, this gives
Xn X
@Ln 1 Yi
= (1 i) + (xi ; ) 2
@✓ i=1
✓ i=1

and
Pn
b i=1 (xi ; )Yi
✓( ) = P n .
i=1 (1 i)

This gives us the best estimator of ✓ for a given . Next we find the best estimator
of . The profile likelihood (after profiling out ✓) is
n
X n
X
b )) = b ) Yi
`P ( ) = Ln ( , ✓( (1 i) log (xi ; ) log ✓( (xi ; ) .
b )
✓(
i=1 i=1

Hence to obtain the ML estimator of we maximise the above with respect to ,


this gives us b. Which in turn gives us the MLE ✓(
b b).

3.2.3 An application of profiling in semi-parametric regression


Here we apply the profile “likelihood” (we use inverted commas here because we do not
use the likelihood, but least squares instead) to semi-parametric regression. Recently this
type of method has been used widely in various semi-parametric models. This application
requires a little knowledge of nonparametric regression, which is considered later in this
course. Suppose we observe (Yi , Ui , Xi ) where

Yi = Xi + (Ui ) + "i ,

(Xi , Ui , "i ) are iid random variables and is an unknown function. Before analyzing the
model we summarize some of its interesting properties:

116
• When a model does not have a parametric form (i.e. a finite number of parameters
1/2
cannot describe the model), then we cannot usually obtain the usual O(n ) rate.
We see in the above model that (·) does not have a parametric form thus we cannot
p
expect than an estimator of it n-consistent.

• The model above contains Xi which does have a parametric form, can we obtain
p
a n-consistent estimator of ?

The Nadaraya-Watson estimator

Suppose

Yi = (Ui ) + "i ,

where Ui , "i are iid random variables. A classical method for estimating (·) is to use the
Nadarayan-Watson estimator. This is basically a local least squares estimator of (u).
The estimator bn (u) is defined as

X1 ✓ ◆ P
bn (u) = arg min u Ui 2 Wb (u Ui )Yi
W (Yi a) = Pi
a
i
b b i Wb (u Ui )

R
where W (·) is a kernel (think local window function) with W (x)dx = 1 and Wb (u) =
b 1 W (u/b) with b ! 0 as n ! 1; thus the window gets narrower and more localized
P
as the sample size grows. Dividing by i Wb (u Ui ) “removes” the clustering in the
locations {Ui }.
Note that the above can also be treated as an estimator of
Z Z
yfY,U (y, u)
E (Y |U = u) = yfY |U (y|u)dy dy = (u),
R R fU (u)

where we replace fY,U and fU with

n
1 X
fbY,U (u, y) = Yi (y)Wb (u Ui )
bn i=1
n
1 X
fbU (u) = Wb (u Ui ) ,
bn i=1

117
with Y (y) denoting the Dirac-delta function. Note that the above is true because
Z b Z
fY,U (y, u) 1
dy = y fbY,U (y, u)dy
R b
fU (u) b
fU (u) R
Z n
1 1 X
= y Yi (y)Wb (u Ui ) dy
fbU (u) R bn i=1
n Z P
1 1 X Wb (u Ui )Yi
= Wb (u Ui ) y Yi (y)dy = Pi .
fbU (u) bn i=1 | R
{z } i Wb (u Ui )
=Yi

The Nadaraya-Watson estimator is a non-parametric estimator and su↵ers from a far


slower rate of convergence to the non-parametric function than parametric estimators.
This rates are usually (depending on the smoothness of and the density of U )
✓ ◆
b 2 1 4
| n (u) (u)| = Op +b .
bn
Since b ! 0, bn ! 1 as n ! 1 we see this is far slower than the parametric rate
1/2
Op (n ). Heuristically, this is because not all n observations are used to estimate (·)
at any particular point u (the number is about bn).

Estimating using the Nadaraya-Watson estimator and profiling

To estimate , we first profile out (·) (this is the nuisance parameter), which we estimate
as if were known. In other other words, we suppose that were known and let

Y i ( ) = Yi Xi = (Ui ) + "i ,

We then estimate (·) using the Nadaraya-Watson estimator, in other words the (·)
which minimises the criterion
X P
ˆ (u) = arg min 2 Wb (u
iP Ui )Yi ( )
Wb (u Ui )(Yi ( ) a) =
i Wb (u Ui )
a
i
P P
Wb (u Ui )Yi i Wb (u Ui )Xi
= Pi P
i Wb (u Ui ) i Wb (u Ui )
:= Gb (u) Hb (u), (3.17)

where
P P
Wb (u Ui )Yi i Wb (u Ui )Xi
Gb (u) = Pi and Hb (u) = P .
i Wb (u Ui ) i Wb (u Ui )

118
Thus, given , the estimator of and the residuals "i are
ˆ (u) = Gb (u) Hb (u)

and

"b = Yi Xi ˆ (Ui ).

Given the estimated residuals Yi Xi ˆ (Ui ) we can now use least squares to estimate
coefficient . We define the least squares criterion
X 2
Ln ( ) = Yi Xi ˆ (Ui )
i
X 2
= Yi Xi Gb (Ui ) + Hb (Ui )
i
X 2
= Yi Gb (Ui ) [Xi Hb (Ui )] .
i

Therefore, the least squares estimator of is


P
ˆb,T = i [Yi P Gb (Ui )][Xi Hb (Ui )]
.
i [Xi Hb (Ui )]2
Using b,T we can then estimate (3.18). We observe how we have the used the principle of
profiling to estimate the unknown parameters. There is a large literature on this, including
Wahba, Speckman, Carroll, Fan etc. In particular it has been shown that under some
p
conditions on b (as T ! 1), the estimator ˆb,T has the usual n rate of convergence.
It should be mentioned that using random regressors Ui is not necessary. It could
P
be that Ui = ni (observations lie on a on a grid). In this case n 1 i Wb (u i/n) =
1
Pn u i/n R
nb i=1 W ( b
) = b 1
W ( u b x )dx + O((bn) 1 ) = 1 + O((bn) 1 ) (with a change of
variables). This gives
P
X i
ˆ (u) = arg min i 2 i Wb (u n
)Yi ( )
Wb (u )(Yi ( ) a) = P i
a
i
n i Wb (u n
)
X i X
= Wb (u )Yi Wb (u Ui )Xi
i
n i
:= Gb (u) Hb (u), (3.18)

where
X i X i
Gb (u) = Wb (u )Yi and Hb (u) = Wb (u )Xi .
i
n i
n
Using the above estimator of (·) we continue as before.

119

You might also like