Profile Likelihood Method
Profile Likelihood Method
To estimate and one can use ( ˆ n , ˆn ) = arg max , Ln ( , ). However, this can be
difficult to directly maximise. Instead let us consider a di↵erent method, which may,
sometimes, be easier to evaluate. Suppose, for now, is known, then we rewrite the
likelihood as Ln ( , ) = L ( ) (to show that is fixed but varies). To estimate we
maximise L ( ) with respect to , i.e.
ˆ = arg max L ( ).
99
In reality is unknown, hence for each we can evaluate ˆ . Note that for each , we
have a new curve L ( ) over . Now to estimate , we evaluate the maximum L ( ),
over , and choose the , which is the maximum over all these curves. In other words,
we evaluate
A bit of logical deduction shows that ˆn and ˆn are the maximum likelihood estimators
( ˆ n , ˆn ) = arg max , Ln ( , ).
We note that we have profiled out nuisance parameter , and the likelihood L ( ˆ ) =
Ln ( , ˆ ) is in terms of the parameter of interest .
The advantage of this procedure is best illustrated through some examples.
Example 3.1.1 (The Weibull distribution) Let us suppose that {Xi } are iid random
↵y ↵ 1
variables from a Weibull distribution with density f (x; ↵, ✓) = ✓↵
exp( (y/✓)↵ ). We
know from Example 2.2.2, that if ↵, were known an explicit expression for the MLE can
be derived, it is
Pn
Therefore, the maximum likelihood estimator of ✓ is ( n1 i=1 Yi↵ˆ n )1/↵ˆ n . We observe that
ˆ n can be tricky but no worse than maximising the likelihood Ln (↵, ✓) over ↵
evaluating ↵
and ✓.
100
As we mentioned above, we are not interest in the nuisance parameters and are only
interesting in testing and constructing CIs for . In this case, we are interested in the
limiting distribution of the MLE ˆn . Using Theorem 2.6.2(ii) we have
! ✓ ! 1 ◆
p ˆn D I I
n ! N 0, .
ˆn I I
where
! !
@ 2 log f (Xi ; , ) @ 2 log f (Xi ; , )
I I E @ 2
E @ @
= @ 2 log f (Xi ; , ) 0 @ 2 log f (Xi ; , )
. (3.1)
I I E @ @
E @ 2
p
To derive an exact expression for the limiting variance of n( ˆn ), we use the block
inverse matrix identity.
Thus if is a scalar we can use the above to construct confidence intervals for .
101
3.1.2 The score and the log-likelihood ratio for the profile like-
lihood
To ease notation, let us suppose that 0 and 0 are the true parameters in the distribution.
We now consider the log-likelihood ratio
⇢
2 max Ln ( , ) max Ln ( 0, ) , (3.4)
,
where 0 is the true parameter. However, to derive the limiting distribution in this case
for this statistic is a little more complicated than the log-likelihood ratio test that does
not involve nuisance parameters. This is because directly applying Taylor expansion does
not work since this is usually expanded about the true parameters. We observe that
⇢
2 max Ln ( , ) max Ln ( 0 , )
,
⇢ n o
= 2 max Ln ( , ) Ln ( 0 , 0 ) 2 max Ln ( 0 , ) max Ln ( 0 , 0 ) .
,
| {z } | {z }
2 2
q
p+q
2
It seems reasonable that the di↵erence may be a p but it is really not clear by. Below,
we show that by using a few Taylor expansions why this is true.
In the theorem below we will derive the distribution of the score and the nested log-
likelihood.
Theorem 3.1.1 Suppose Assumption 2.6.1 holds. Suppose that ( 0, 0) are the true
parameters. Then we have
1 @Ln ( , ) D 1
p c 0,
ˆ ! N (0, (I 0 0 I 0 0 I I 0, 0 )) (3.6)
n @ 0 0 0
where p denotes the dimension of . This result is often called Wilks Theorem.
102
PROOF. We first prove (3.5) which is the basis of the proofs of (3.6). To avoid, notational
@Ln ( , ) @Ln ( , )
difficulties we will assume that @
cˆ , 0
and @
c = 0, 0 are univariate random
0
variables.
@Ln ( , ) @Ln ( , )
Our objective is to find an expression for @
cˆ , 0
in terms of @
c = 0, 0 and
0
@Ln ( , )
@
c = 0, 0 which will allow us to obtain its variance and asymptotic distribution.
@Ln ( , ) @Ln ( , )
Making a Taylor expansion of @
cˆ , 0
about @
c 0, 0 gives
0
@Ln ( , ) @Ln ( , ) @ 2 Ln ( , )
cˆ , ⇡ c 0, 0 + (ˆ 0 0) c 0, 0 .
@ 0 0
@ @ @
Notice that we have used ⇡ instead of = because we replace the second derivative
@ 2 Ln ( , )
with its true parameters. If the sample size is large enough then @ @
c 0, 0 ⇡
@ 2 Ln ( , )
E @ @
c 0, 0 ; eg. in the iid case we have
n
1 @ 2 Ln ( , ) 1 X @ 2 log f (Xi ; , )
c 0, 0 = c 0, 0
n @ @ n i=1 @ @
✓ 2 ◆
@ log f (Xi ; , )
⇡ E c 0, 0 = I ,
@ @
Therefore
@Ln ( , ) @Ln ( , )
cˆ , ⇡ c 0, 0 n( ˆ 0 0 )I . (3.8)
@ 0 0
@
(if the maximum is not on the boundary). Therefore making a Taylor expansion of
@Ln ( 0, ) @Ln ( 0, )
@
cˆ , 0
about @
c 0, 0 gives
0
@Ln ( 0 , ) @Ln ( 0 , ) @ 2 Ln ( 0, )
cˆ , ⇡ c 0, 0 + 2
c 0, 0 (ˆ 0 0 ).
| @ {z @ @
0 0
}
=0
@ 2 Ln ( 0 , )
Replacing @ 2
c 0, 0 with I gives
@Ln ( 0 , )
c 0, 0 nI ( ˆ 0 0) ⇡ 0,
@
103
and rearranging the above gives
1
@Ln ( 0 , ) I
(ˆ 0 0) c⇡ 0, 0 . (3.9)
n @
Therefore substituting (3.9) into (3.8) gives
@Ln ( , ) @Ln ( , ) @Ln ( 0 , ) 1
cˆ , ⇡ c 0, 0 c 0, 0 I I
@ 0 0
@ @
and thus we have proved (3.5).
To prove (3.6) we note that
@Ln ( , ) @Ln ( 0 , ) 0 1 0
cˆ , ⇡ c 0, I, I , . (3.10)
@ 0 0
@✓ 0
Now by substituting the above into (3.10) and calculating the variance gives (3.6).
Finally to prove (3.7) we apply the Taylor expansion on the decomposition
⇢ ⇢ ⇢
ˆ ˆ ˆ
2 Ln ( n , n ) Ln ( 0 , 0 ) ˆ ˆ
= 2 Ln ( n , n ) Ln ( 0 , 0 ) 2 Ln ( 0 , ˆ 0 ) Ln ( 0, 0)
⇡ (✓ˆn ✓0 )0 I(✓)(✓ˆn ✓0 ) (ˆ 0 0)
0
I (ˆ 0 0 ), (3.11)
I 1 @Ln ( 0 , ) I 1⇣ ⌘
(ˆ 0 0 ) ⇡ c 0, 0 ⇡ I (ˆ 0) + I (
ˆ 0)
n @ n ⇣ ⌘
⇡ I 1I ( ˆ 0) + (
ˆ 0) = I
1
I , 1 ✓bn ✓0 .
Substituting the above into (3.11) and making lots of cancellations we have
⇢
2 Ln ( ˆn , ˆ n ) Ln ( 0 , ˆ 0 ) ⇡ n( ˆ 0
0 ) (I I I ,1 I , )( ˆ 0 ).
p D
Finally, by using (3.3) we substitute n( ˆ 0) ! N (0, (I I I ,
1
I , ) 1 ), into the
above which gives the desired result. ⇤
104
Remark 3.1.2 (i) The limiting variance of b 0 if were known is I ,
1
, whereas
the the limiting variance of @Ln@( , ) c ˆ , 0 is (I I I ,1 I , ) and the limiting
p 0
It is useful to understand where it came from. Consider the problem of linear re-
gression. Suppose X and Y are random variables and we want to construct the
best linear predictor of Y given X. We know that the best linear predictor is
Ŷ (X) = E(XY )/E(Y 2 )X and the residual and mean squared error is
✓ ◆2
E(XY ) E(XY )
Y Ŷ (X) = Y X and E Y X = E(Y 2 ) E(XY )E(Y 2 ) 1 E(XY ).
E(Y 2 ) E(Y 2 )
@Ln ( , )
Compare this expression with (3.13). We see that in some sense @
cˆ , 0 can
0
@Ln ( , ) @Ln ( 0 , )
be treated as the residual (error) of the projection of @
c 0, 0 onto @
c 0, 0 .
The same quantity can be used in the construction of confidence intervals By using (3.7)
we can construct CIs. For example, to construct a 95% CI for we can use the mle
✓ˆn = ( ˆn , ˆ n ) and the profile likelihood (3.7) to give
⇢ ⇢
; 2 Ln ( ˆn , ˆ n ) Ln ( , ˆ ) 2p (0.95) .
Example 3.1.3 (The normal distribution and confidence intervals) This example
is taken from Davidson (2004), Example 4.31, p129.
105
We recall that the log-likelihood for {Yi } which are iid random variables from a normal
2
distribution with mean µ and variance is
n
2 2 1 X n
Ln (µ, ) = Lµ ( ) = 2
(Yi µ)2 log 2
.
2 i=1
2
Our aim is to the use the log-likelihood ratio statistic, analogous to Section 2.8.1 to con-
2
struct a CI for µ. Thus we treat as the nuisance parameter.
2 1
Pn
Keeping µ fixed, the maximum likelihood estimator of is b2 (µ) = n i=1 (Yi µ)2 .
Rearranging b2 (µ) gives
✓ ◆
n 1 t2 (µ)
2
b (µ) = s 2
1+ n
n n 1
1
Pn
where t2n (µ) = n(Ȳ µ)2 /s2 and s2 = n 1 i=1 (Yi Ȳ )2 . Substituting b2 (µ) into Ln (µ, 2
)
gives the profile likelihood
n
2 1 X n
Ln (µ, b (µ)) = (Yi µ)2 log b2 (µ)
b2 (µ) i=1 2
| {z }
= n/2
⇢ ✓ ◆
n n n 1 t2 (µ)
= log s 2
1+ n .
2 2 n n 1
Therefore, using the same argument to those in Section 2.8.1, the 95% confidence interval
for the mean is
µ, b2 (b
µ; 2 Ln (b µ)) Ln (µ, b2 (µ)) = µ; Wn (µ) 21 (0.95)
⇢ ✓ ◆
t2n (µ) 2
= µ; n log 1 + 1 (0.95) .
n 1
106
However, this is an asymptotic result. With the normal distribution we can get the exact
distribution. We note that since log is a monotonic function the log-likelihood ratio is
equivalent to
µ; t2n (µ) C↵ ,
2
Exercise 3.1 Derive the test for independence (in the case of two by two tables) using
the log-likelihood ratio test. More precisely, derive the asymptotic distribution of
(O1 E1 )2 (O2 E2 )2 (O3 E3 )2 (O4 E4 )2
T = + + , +
E1 E2 E3 E4
under the null that there is no association between the categorical variables C and R,
where and E1 = n3 ⇥ n1 /N , E2 = n4 ⇥ n1 /N , E3 = n3 ⇥ n2 /N and E2 = n4 ⇥ n2 /N . State
C1 C2 Subtotal
R1 O1 O2 n1
R2 O3 O4 n2
Subtotal n3 n4 N
Pivotal Quantities
Pivotal quantities are statistics whose distribution does not depend on any parameters.
p
These include the t-ratio t = n(X̄ µ)/sn ⇠ tn 1 (in the case the data is normal) F -test
etc.
In many applications it is not possible to obtain a pivotal quantity, but a quantity can
be asymptotically pivotal. The log-likelihood ratio statistic is one such example (since its
distribution is a chi-square).
Pivotal statistics have many advantages. The main is that it avoids the need to
estimate extra parameters. However, they are also useful in developing Bootstrap methods
etc.
107
3.1.4 The score statistic in the presence of nuisance parameters
We recall that we used Theorem 3.1.1 to obtain the distribution of 2 max , Ln ( , )
max Ln ( 0, ) under the null, we now consider the score test.
@Ln ( , )
We recall that under the null H0 : = 0 the derivative= 0, but the@
cˆ , 0
0
1/2 @Ln ( , ) D 1
n cˆ ! N (0, I I I , I , ), (3.14)
@ 0
The log-likelihood ratio test and the score test are asymptotically equivalent. There
are advantages and disadvantages of both.
(i) An advantage of the log-likelihood ratio test is that we do not need to calculate the
information matrix.
(ii) An advantage of the score test is that we do not have to evaluate the the maximum
likelihood estimates under the alternative model.
3.2 Applications
where {"t } are iid standard normal random variables and 0 < ! < ⇡ (thus allowing the
case ! = ⇡/2, but not the end points ! = 0 or ⇡). The parameters A, B, and ! are real
and unknown. Full details can be found in the paper http://www.jstor.org/stable/
pdf/2334314.pdf (Walker, 1971, Biometrika).
108
(i) Ignoring constants, obtain the log-likelihood of {Xt }. Denote this likelihood as
Ln (A, B, !).
(ii) Let
✓X
n n
X ◆
1
Sn (A, B, !) = Xt2 2 Xt A cos(!t) + B sin(!t) 2 2
n(A + B ) .
t=1 t=1
2
Show that
n n
(A2 B2) X X
2Ln (A, B, !) + Sn (A, B, !) = cos(2t!) + AB sin(2t!).
2 t=1 t=1
Thus show that |Ln (A, B, !) + 12 Sn (A, B, !)| = O(1) (ie. the di↵erence does not
grow with n).
1
Since Ln (A, B, !) and S (A, B, !)
2 n
are asymptotically equivalent, for the rest of
1
this question, use 2 n
S (A, B, !) instead of the likelihood Ln (A, B, !).
(v) By using the results in part (iv) show that the Fisher Information of Ln (A, B, !)
(denoted as I(A, B, !)) is asymptotically equivalent to
0 1
n n2
2
0 2
B + O(n)
2
@ Sn B C
=B C.
n n 2
2I(A, B, !) = E @ 0 A + O(n) A
@! 2 2 2
n2 n2 n3 2
2
B + O(n) 2
A + O(n) 3
(A + B ) + O(n2 )
2
109
(vi) Derive the asymptotic variance of maximum likelihood estimator, !
ˆ n , derived in
part (iv).
8
n < exp( 12 i(n+1)⌦) sin( 12 n⌦)
X 0 < ⌦ < 2⇡
sin( 12 ⌦)
exp(i⌦t) = (3.16)
: n ⌦ = 0 or 2⇡.
t=1
the trignometric identities: sin(2⌦) = 2 sin ⌦ cos ⌦, cos(2⌦) = 2 cos2 (⌦) 1 = 1 2 sin2 ⌦,
exp(i⌦) = cos(⌦) + i sin(⌦) and
n
X n
X
n(n + 1) n(n + 1)(2n + 1)
t= t2 = .
t=1
2 t=1
6
Solution
Since {"i } are standard normal iid random variables the likelihood is
n
1X
Ln (A, B, !) = (Xt A cos(!t) B sin(!t))2 .
2 t=1
If the frequency ! were known, then the least squares estimator of A and B would be
! n
! 1 n
!
b
A X 1X cos(!t)
= n 1
x0t xt Xt
b
B n t=1 sin(!t)
t=1
where xt = (cos(!t), sin(!t)). However, because the sine and cosine functions are near
P
orthogonal we have that n 1 nt=1 x0t xt ⇡ I2 and
! n
!
b
A 1X cos(!t)
⇡ Xt ,
b
B n t=1 sin(!t)
110
which is simple to evaluate! The above argument is not very precise. To make it precise
we note that
2Ln (A, B, !)
n
X X n
= Xt2 2 Xt A cos(!t) + B sin(!t)
t=1 t=1
n
X n
X n
X
2 2 2 2
+A cos (!t) + B sin (!t) + 2AB sin(!t) cos(!t)
t=1 t=1 t=1
n
X n
X
= Xt2 2 Xt A cos(!t) + B sin(!t) +
t=1 t=1
n n n
A X
2
B2 X X
(1 + cos(2t!)) + (1 cos(2t!)) + AB sin(2t!)
2 t=1 2 t=1 t=1
n
X n
X n 2
= Xt2 2 Xt A cos(!t) + B sin(!t) + (A + B 2 ) +
t=1 t=1
2
2 2 Xn n
X
(A B )
cos(2t!) + AB sin(2t!)
2 t=1 t=1
n n
(A 2
B )X2 X
= Sn (A, B, !) + cos(2t!) + AB sin(2t!)
2 t=1 t=1
where
n
X n
X n 2
Sn (A, B, !) = Xt2 2 Xt A cos(!t) + B sin(!t) + (A + B 2 ).
t=1 t=1
2
The important point abut the above is that n 1 Sn (A, B, !) is bounded away from zero,
P P
however n 1 nt=1 sin(2!t) and n 1 nt=1 cos(2!t) both converge to zero (at the rate n 1 ,
though it is not uniform over !); use identity (3.16). Thus Sn (A, B, !) is the dominant
term in Ln (A, B, !);
Thus ignoring the O(1) term and di↵erentiating Sn (A, B, !) wrt A and B (keeping !
fixed) gives the estimators
! n
!
b
A(!) 1X cos(!t)
= Xt .
b
B(!) n t=1 sin(!t)
111
Thus we have “profiled out” the nuisance parameters A and B.
Using the approximation Sn (A bn (!), B
bn (!), !) we have
bn (!), B
bn (!), !) = 1
Ln (A Sp (!) + O(1),
2
where
✓X
n n
X ◆
bn (!) cos(!t) + B(!)
b n b b 2
Sp (!) = Xt2 2 Xt A sin(!t) + (A 2
n (!) + B(!) )
t=1 t=1
2
✓X
n ◆
n b bn (!)2 .
= Xt2 An (!)2 + B
t=1
2
Thus
bn (!), B
bn (!), !) ⇡ arg max 1
arg max Ln (A Sp (!)
2
bn (!)2 + B
= arg max A bn (!)2 .
Thus
bn (!)2 + B
bn = arg max( 1/2)Sp (!) = arg max A
! bn (!)2
! !
n
X 2
= arg max Xt exp(it!) ,
!
t=1
(iv) Di↵erentiating both sides of (3.15) with respect to ⌦ and considering the real and
P Pn
imaginary terms gives nt=1 t cos(⌦t) = O(n) t=1 t sin(⌦t) = O(n). Di↵erenti-
ating both sides of (3.15) twice wrt to ⌦ gives the second term.
b ! ), B(b
b , A(b
(v) In order to obtain the rate of convergence of the estimators, ! b ! ) we eval-
uate the Fisher information of Ln (the inverse of which will give us limiting rate
of convergence). For convenience rather than take the second derivative of L we
evaluate the second derivative of Sn (A, B, !) (though, you will find the in the limit
both the second derivative of Ln and Sn (A, B, !) are the same).
Pn 2
Pn 1 2
Di↵erentiating Sn (A, B, !) = t=1 Xt 2 t=1 Xt A cos(!t)+B sin(!t) + 2 n(A +
112
B 2 ) twice wrt to A, B and ! gives
n
X
@Sn
= 2 Xt cos(!t) + An
@A t=1
n
X
@Sn
= 2 Xt sin(!t) + Bn
@B t=1
n
X n
X
@Sn
=2 AXt t sin(!t) 2 BXt t cos(!t).
@! t=1 t=1
@ 2 Sn @ 2 Sn @ 2 Sn
and @A2
= n, @B 2
= n, @A@B
= 0,
X n
@ 2 Sn
=2 Xt t sin(!t)
@!@A t=1
n
X
@ 2 Sn
= 2 Xt t cos(!t)
@!@B t=1
X n
@ 2 Sn
= 2 t2 Xt A cos(!t) + B sin(!t) .
@! 2 t=1
@ Sn 2 2
Using a similar argument we can show that E( @!@B )= A n2 + O(n) and
Xn ✓ ◆2
@ 2 Sn 2
E( ) = 2 t A cos(!t) + B sin(!t)
@! 2 t=1
n(n + 1)(2n + 1)
= (A2 + B 2 ) + O(n2 ) = (A2 + B 2 )n3 /3 + O(n2 ).
6
Since E( r2 Ln ) ⇡ 12 E(r2 Sn ), this gives the required result.
(vi) Noting that the asymptotic variance for the profile likelihood estimator !
ˆn
✓ ◆ 1
1
I!,! I!,(AB) IA,B I(BA),! ,
113
by subsituting (vi) into the above we have
✓ 2 ◆ 1
A + B2 3 2 12
2 n + O(n ) ⇡
6 (A2 + B 2 )n3
ˆ n is O(n 3 ).
Thus we observe that the asymptotic variance of !
Typically estimators have a variance of order O(n 1 ), so we see that the estimator
!
ˆ n converges to to the true parameter, far faster than expected. Thus the estimator
is extremely good compared with the majority of parameter estimators.
It is clear from the definition that what defines a survival function is that Fi (t) is positive,
Fi (0) = 1 and Fi (1) = 0. The density is easily derived from the survival function taking
dFi (t)
the negative derivative; fi (t) = dt
.
To model the influence the regressors have on the survival time, the Cox-proportional
hazard model is often used with the exponential distribution as the baseline distribution
and (xi ; ) is a positive “link” function (typically, we use (xi ; ) = exp( xi ) as the link
function). More precisely the survival function of Ti is
(xi ; )
Fi (t) = F0 (t) ,
114
where F0 (t) = exp( t/✓). Not all the survival times of the electrical components are
observed, and there can arise censoring. Hence we observe Yi = min(Ti , ci ), where ci is
the (non-random) censoring time and i, where i is the indicator variable, where i =1
denotes censoring of the ith component and i = 0 denotes that it is not censored. The
parameters and ✓ are unknown.
(ii) Compute the profile likelihood of the regression parameters , profiling out the
baseline parameter ✓.
Solution
Thus for this example, the logarithm of density and survival function is
⇥ ⇤
log fi (t) = log (xi ; ) (xi ; ) 1 F0 (t) + log f0 (t)
⇥ ⇤t t
= log (xi ; ) (xi ; ) 1 log ✓
✓ ✓
t
log Fi (t) = (xi ; ) log F0 (t) = (xi ; ) .
✓
Since
( [ (xi ; ) 1
fi (yi ) = (xi ; ) F0 (yi ) f0 (t) i = 0
fi (yi , i ) =
Fi (yi ) = F0 (t) (xi ; ) i = 1
115
Di↵erentiating the above wrt and ✓ gives
Xn n
X
@L r (xi ; ) Yi
= (1 i) r (xi ; )
@ i=1
(xi ; ) i=1
✓
n
X n
X
@L 1 Yi
= (1 i) + (xi ; )
@✓ i=1
✓ i=1
✓2
(ii) Instead we keep fixed and di↵erentiate the likelihood with respect to ✓ and equate
to zero, this gives
Xn X
@Ln 1 Yi
= (1 i) + (xi ; ) 2
@✓ i=1
✓ i=1
✓
and
Pn
b i=1 (xi ; )Yi
✓( ) = P n .
i=1 (1 i)
This gives us the best estimator of ✓ for a given . Next we find the best estimator
of . The profile likelihood (after profiling out ✓) is
n
X n
X
b )) = b ) Yi
`P ( ) = Ln ( , ✓( (1 i) log (xi ; ) log ✓( (xi ; ) .
b )
✓(
i=1 i=1
Yi = Xi + (Ui ) + "i ,
(Xi , Ui , "i ) are iid random variables and is an unknown function. Before analyzing the
model we summarize some of its interesting properties:
116
• When a model does not have a parametric form (i.e. a finite number of parameters
1/2
cannot describe the model), then we cannot usually obtain the usual O(n ) rate.
We see in the above model that (·) does not have a parametric form thus we cannot
p
expect than an estimator of it n-consistent.
• The model above contains Xi which does have a parametric form, can we obtain
p
a n-consistent estimator of ?
Suppose
Yi = (Ui ) + "i ,
where Ui , "i are iid random variables. A classical method for estimating (·) is to use the
Nadarayan-Watson estimator. This is basically a local least squares estimator of (u).
The estimator bn (u) is defined as
X1 ✓ ◆ P
bn (u) = arg min u Ui 2 Wb (u Ui )Yi
W (Yi a) = Pi
a
i
b b i Wb (u Ui )
R
where W (·) is a kernel (think local window function) with W (x)dx = 1 and Wb (u) =
b 1 W (u/b) with b ! 0 as n ! 1; thus the window gets narrower and more localized
P
as the sample size grows. Dividing by i Wb (u Ui ) “removes” the clustering in the
locations {Ui }.
Note that the above can also be treated as an estimator of
Z Z
yfY,U (y, u)
E (Y |U = u) = yfY |U (y|u)dy dy = (u),
R R fU (u)
n
1 X
fbY,U (u, y) = Yi (y)Wb (u Ui )
bn i=1
n
1 X
fbU (u) = Wb (u Ui ) ,
bn i=1
117
with Y (y) denoting the Dirac-delta function. Note that the above is true because
Z b Z
fY,U (y, u) 1
dy = y fbY,U (y, u)dy
R b
fU (u) b
fU (u) R
Z n
1 1 X
= y Yi (y)Wb (u Ui ) dy
fbU (u) R bn i=1
n Z P
1 1 X Wb (u Ui )Yi
= Wb (u Ui ) y Yi (y)dy = Pi .
fbU (u) bn i=1 | R
{z } i Wb (u Ui )
=Yi
To estimate , we first profile out (·) (this is the nuisance parameter), which we estimate
as if were known. In other other words, we suppose that were known and let
Y i ( ) = Yi Xi = (Ui ) + "i ,
We then estimate (·) using the Nadaraya-Watson estimator, in other words the (·)
which minimises the criterion
X P
ˆ (u) = arg min 2 Wb (u
iP Ui )Yi ( )
Wb (u Ui )(Yi ( ) a) =
i Wb (u Ui )
a
i
P P
Wb (u Ui )Yi i Wb (u Ui )Xi
= Pi P
i Wb (u Ui ) i Wb (u Ui )
:= Gb (u) Hb (u), (3.17)
where
P P
Wb (u Ui )Yi i Wb (u Ui )Xi
Gb (u) = Pi and Hb (u) = P .
i Wb (u Ui ) i Wb (u Ui )
118
Thus, given , the estimator of and the residuals "i are
ˆ (u) = Gb (u) Hb (u)
and
"b = Yi Xi ˆ (Ui ).
Given the estimated residuals Yi Xi ˆ (Ui ) we can now use least squares to estimate
coefficient . We define the least squares criterion
X 2
Ln ( ) = Yi Xi ˆ (Ui )
i
X 2
= Yi Xi Gb (Ui ) + Hb (Ui )
i
X 2
= Yi Gb (Ui ) [Xi Hb (Ui )] .
i
where
X i X i
Gb (u) = Wb (u )Yi and Hb (u) = Wb (u )Xi .
i
n i
n
Using the above estimator of (·) we continue as before.
119