Sia Notes 2013
Sia Notes 2013
Pavel Chigansky
Department of Statistics, The Hebrew University, Mount Scopus, Jerusalem
91905, Israel
E-mail address: [email protected]
Preface
These are the lecture notes for the courses Statistical Inference and Applications A
(52303)/ Theory of Statistics (52314) I have taught at the Statistics Department of the Hebrew
University during the fall semesters of 2009-2012 .
The course is divided into two parts: the introduction to multivariate probability theory
and introduction to mathematical statistics. The probability is served on an elementary (nonmeasure theoretic) level and is similar in the spirit to the corresponding part of Casella & Berger
text [3] ([14] is my choice for a deeper insight into probability theory).
The statistical part is in the spirit of Bickel & Doksum [1], Casella & Berger [3]. For an
in depth reading the comprehensive classical texts are recommended: Lehmann & Casella [8],
Lehmann & Romano [9], Borovkov[2]. The text of Shao [12] (comes with the solutions guide
[13] to the exercises) is highly recommended, if your probability is already measure theoretic.
The book of Ibragimov and Khasminskii [6] is an advanced treatment of the asymptotic theory
of estimation (both parametric and non-parametric). A.Tsybakovs [15] is an excellent text,
focusing on the nonparametric estimation. Finally, the papers [16] and [7] contain the proofs of
some facts, mentioned in the text.
Please do not hesitate to e-mail your comments/bug reports to the author.
P.Ch., 21/01/2013
Contents
Part 1.
Probability
9
9
11
12
14
16
19
19
24
28
30
35
35
38
42
43
49
53
53
57
59
61
65
65
69
Part 2.
71
Statistical inference
73
73
77
78
5
CONTENTS
d. Sucient statistic
Exercises
80
88
93
93
103
111
118
130
136
137
158
169
169
174
183
191
196
202
Appendix. Exams/solutions
a. 2009/2010 (A) 52303
b. 2009/2010 (B) 52303
c. 2009/2010 (A) 52314
d. 2009/2010 (B) 52314
e. 2010/2011 (A) 52303
f. 2010/2011 (B) 52303
g. 2010/2011 (A) 52314
h. 2011/2012 (A) 52314
i. 2012/2013 (A) 52314
j. 2012/2013 (B) 52314
207
207
215
224
232
242
247
250
257
263
271
Appendix.
279
Bibliography
Part 1
Probability
CHAPTER 1
In words, this means that if A and B are mutually exclusive, the probability of either A or B
to occur is the sum of their individual probabilities.
For technical reasons, beyond our scope, one cannot dene many natural probability measures on all the subsets of , if it is uncountable, e.g., = R. Luckily it can be dened on
a rich enough collection of subsets, which is denoted by F and called the -algebra2 of events.
The triple (, F, P) is called probability space. In general construction of probability measures
on (, F) can be a challenging mathematical problem, depending on the complexity of (think
e.g. about consisting of all continuous functions).
Probability measures on := Rd can always be dened by the cumulative distribution
functions (or c.d.f. in short). Hence in this course, a probability measure will be always identied
with the corresponding c.d.f. For d = 1, a function F : R 7 [0, 1] is a legitimate c.d.f. if and
only if it satises the following properties
(i) F is a nondecreasing function
(ii) limx F (x) = 1
(iii) limx F (x) = 0
(iv) F is right continuous
The probability measure of semi-innite intervals is dened by the formulas
(
)
(
)
P (, x] := F (x), and P (, x) := lim F (x ) =: F (x)
0
and is extended to other types of intervals or their nite (or countable) unions through additivity.
For example, since (, a] (a, b] = (, b] for a < b, by additivity
(
)
(
)
(
)
P (, a] + P (a, b] = P (, b]
1in fact, a stronger property of -additivity is required, but again this is way beyond our scope.
2the term -algebra comes from the fact that F is to be closed under taking compliments and countable
unions or intersections
9
10
and hence
(
)
P (a, b] = F (b) F (a).
(1a1)
Similarly, probabilities of other intervals, e.g. [a, b), [a, b], etc. or unions of intervals are dened.
In fact, this assignment denes P on subsets (events), much more general than intervals, but we
shall rarely need to calculate such probabilities explicitly. These denitions explain the need for
the conditions (i)-(iv).
The c.d.f. is said to be purely discrete (atomic) if it is a piecewise constant function with
jumps at xk R, k N with the corresponding sizes {pk }:
pk , x R.
(1a2)
F (x) =
k:xk x
In this case,
P({xk }) = F (xk ) F (xk ) = pk ,
i.e.
each value xk is assigned a positive probability pk (check that (i)-(iv) imply pk > 0 and
k pk = 1). Often {xk } = N, in which case {pk } is referred to as probability mass function
(p.m.f.) Here are some familiar examples
Example 1a1 (Bernoulli distribution). Bernoulli distribution with parameter p (0, 1),
denoted Ber(p), has the c.d.f.
x (, 0)
0
F (x) = 1 p x [0, 1)
1
x [1, )
This function satises (i)-(iv) (sketch a plot and check). We have
P(X = 0) = F (0) F (0) = 1 p,
Example 1a2 (Poisson distriution). The Poisson distribution with rate parameter > 0
has piecewise constant c.d.f with jumps at {0, 1, ...} = {0} N =: Z+ and its p.m.f. is given by
P(X = k) =
e k
,
k!
k Z+
Hence e.g.
P (X [3, 17.5]) = F (17.5) F (3) = F (17) F (3) =
17 k
e
k=3
k!
The latter expression can be evaluated numerically if is given as a number (rather as a symbolic
variable). Also
e k
F (x) =
.
k!
k:kx
B. RANDOM VARIABLE
11
for some function f , it is said to have density f (check that (i)-(iv) imply R f (u)du = 1 and
that f (u) 0, if it is continuous). Of course, in general c.d.f may increase in other ways, e.g.
to have both jumps and continuous parts.
Note that a continuous c.d.f. assigns zero probability to each point in . Does this imply
that we cannot get any particular outcome? Certainly not, since P() = 1. This seemingly
paradoxical situation is quite intuitive: after all, drawing 1/2 or any other number from the
interval [0, 1] with the uniform distribution on it feels like impossible. Mathematically, this is
resolved by means of the measure theory, which is the way probability is treated rigorously3.
Example 1a3. Normal (Gaussian) c.d.f. N (, 2 ) with mean R and variance 2 > 0
has the density
(
)
1 (x )2
1
exp
, x R.
f (x) =
2
2
2
Other frequently encountered p.d.fs are Exponential, Cauchy, Gamma, etc.
b. Random variable
Functions on the sampling space are called random variables. We shall denote random
variables by capital letters to distinguish them from the values they take. A random variable
X : 7 R generates events of the form { : X() A}, where A is a subset of R, e.g. an
interval. The function
(
)
FX (x) := P { : X() (, x]}
is called the c.d.f. of the random variable X. It denes the probabilities of the events, generated
by X, similarly to (1a1). In fact, for the coordinate random variables X() := , the c.d.f. of
X coincides with the c.d.f., which denes P:
FX (x) = P({ : X() x}) = P({ : x}) = F (x).
(1b1)
For other random variables, FX is always a c.d.f., but the connection between FX and F can be
more complicated.
Example 1b1. One simple yet important example of r.v. is the indicator of an event A:
{
1, A
I(A) :=
0, Ac .
Since I(A) {0, 1}, it is in fact a Bernoulli r.v. with parameter P(I(A) = 1) = P(A).
It is customary in probability theory not to elaborate the structure of the function X() and
even omit from the notations, but just specify the corresponding c.d.f. FX . This is sucient
in many problems, since FX denes probabilities of all the events, generated by X, regardless
of the underlying probability space.
3consult books, if you are curious. My favorite text on the subject is [14]
12
0,
x < 1/2
1/2, x = 1/2
FY (x) =
x,
x (1/2, 1]
1,
x>1
Hence Y has an atom at {1/2} and a continuous nontrivial part.
c. Expectation
Expectation of a random variable is averaging over all its possible realizations. More precisely, given a function g, dened on the range of the r.v. X
(
)
Eg(X) :=
g(xi )P(X = xi ) =
g(xi ) FX (xi ) FX (xi ) =
i
if X is discrete, and
Eg(X) =
g(x)dFX (x),
g(x)dFX (x),
if X has p.d.f. fX (x). Note that in the two cases the notation R g(x)dFX (x) is interpreted
dierently, depending on the context4.
For particular functions g(), expectation has special names: e.g. EX is the mean of X, EX p
is the p-th moment of X, E(X EX)2 is the variance of X, etc.
R
g(x)fX (x)dx =:
e k
k
= ... = ,
k!
k=0
4you can think of dF (x) = f (x)dx if F (x) = f (x) and dF (x) = F (x) if F (x) has a jump at x.
X
X
X
X
X
X
C. EXPECTATION
and for X N (, 2 )
EX =
13
2
1 (x)
1
e 2 2 dx = ... = ,
2
and
(x )2
2
1 (x)
1
e 2 2 dx = ... = 2
2
Being an integral (or series), expectation of g(X) does not have to be nite or even well
dened:
Example 1c2. Recall that X is a Cauchy r.v. if it has the density
1 1
fX (x) =
, x R.
1 + x2
Consider g(x) := |x|1/2 , then
N
|x|1/2 1
|x|1/2 1
E|X|1/2 =
dx
:=
lim
dx = ... = / 2.
2
2
N N
1+x
1+x
R
Now let g(x) = |x|:
E|X| =
|x| 1
dx := lim
N
1 + x2
|x| 1
dx = ,
1 + x2
x/(1 + x2 )
14
|X|
= Ee
ex ex dx < ,
if < . Hence
MX (t) = Ee
tX
t (, ).
Example 1d3. For the Cauchy r.v. the m.g.f. is not dened: for all > 0, Ee|X| = .
Note that if m.g.f. is well dened on an open interval (, ) near the origin, then X has
nite absolute moments (and hence the moments) of all orders:
p
p
E|X| =
|x| dF (x)
Ce/2|x| dF (x) = CEe/2|X| < ,
R
|x|p
since
d
d
d tx
tx
MX (0) := MX (t)
e dFX (x)
=
e dFX (x)
=
=
t=0
t=0
t=0
dt
dt R
dt
R
tx
xe dFX (x)
=
xdFX (x) = EX,
Ce|x|
t=0
where the interchanging the derivative and integral in should and can be justied (think how).
Similarly, for k 1,
(k)
MX (0) = EX k ,
which is where the name of MX (t) comes from: if one knows MX (t), one can reconstruct
(generate) all the moments.
7another characterization of the probability distribution of r.v. X is the characteristic function (t) :=
X
Ee , where i is the imaginary unit (i.e. the Fourier transform of its c.d.f.) The advantage of c.f. is that it is
always well dened (unlike the m.g.f.). It is a powerful tool in many applications (e.g. limit theorems, etc.). You
will need to know some complex analysis to discover more.
8recall that we interpret h(x)dF dierently, depending on the context - this is just a convenient unifying
X
notation in our course
itX
15
MX
(0) =
d
(1 t/)1
t=0 = ... = 1/,
dt
It is then very plausible that knowing all the moments is enough to be able to reconstruct
FX (x). Indeed the m.g.f., when it exists, completely determines 9 FX (x) and consequently the
probability law of X.
While in principle FX (x) can be reconstructed from MX (t), the reconstruction formula is
quite complicated and is beyond our scope. However in many cases, MX (t) can be recognized
to have a particular form, which identies the corresponding FX (x). This simple observation,
as we shall see, is quite powerful.
While m.g.f. is dened for r.v. of any type, it is more convenient to deal with probability
generating function (p.g.f.) if X is integer valued:
Definition 1d5. The p.g.f. of a discrete r.v. X with values in Z+ is
tk P(X = k), t D,
GX (t) := EtX =
k
d
GX (t)|t=0 =
ktk1 P(X = k)|t=0 = P(X = 1)
dt
k=1
and similarly
(m)
k=1
pt
,
1 tq
9though plausible, the moments do not determine F in general. The fact that there is a one-to-one corX
respondence between MX (t) and FX (x) stems from the theory of Fourier-Laplace transform from functional
analysis
16
where q = 1 p. Then
P(X = 0) = GX (0) = 0
P(X = 1) = GX (0) =
etc.
Note that MX (ln t) = GX (t) and hence p.g.f. is m.g.f. in disguise.
Exercises
Problem 1.1. Let X be a r.v. with
(a) p.d.f.
fX (x) = c(1 x2 )I(x (1, 1)).
(b) p.d.f.
fX (x) = (1 |x|)I(|x| 1)
(c) p.m.f.
{
pX (k) =
1
N,
x {0, ..., N 1}
otherwise
IA Ber(p),
with p = P(A). In particular, EIA = P(A)
Ii Ai = i IAi
Ii Ai = maxi IAi
2
IA = IA
IAc = 1 IA
A B = IA IB
EXERCISES
17
Problem 1.3. Let X be a discrete r.v. with integer values with the p.g.f. GX (s) and the
m.g.f. MX (t). Show that
d
d2
GX (s)|s=1 = EX 2 GX (s)|s=1 = EX 2 EX
ds
ds
using
(1) the denition of GX (s);
(2) the relation GX (s) = MX (ln s)
Problem 1.4.
(1) Find the m.g.f. of N (, 2 ) and use it to check that EX = and var(X) = 2 .
(2) For X N (, 2 ) prove
{
0
p odd
p
E(X ) =
p!
p
2p/2 (p/2)! p even
Problem 1.5. Find the c.d.f. (or p.d.f./p.m.f. if appropriate) of the r.v. with the following
m.g.f.s
(1) MX (t) = et(t1) , t R
(2) MX (t) = 1/(1 t), |t| < 1
(3) MX (t) = pet + 1 p, t R
(4) Mx (t) = etC , t, C R
(5) MX (t) = exp(et 1), t R
Problem 1.6. Show that X 0 = EX 0
CHAPTER 2
a1 b1 , a2 b2 ,
(2a1)
XY
20
It is already clear at this point that individual c.d.f.s of each component can be found from
FXY , e.g.:
FX (u) = P(X u) = P(X u, Y R) = FXY (u, ).
However, in general one cannot restore FXY from its marginals FX and FY (see Example 2a6
below). Thus FXY is a more complete probabilistic description of (X, Y ).
All of these properties are automatically satised in the two particular cases: the jointly
discrete and jointly continuous.
*. Jointly discrete random variables
Definition 2a1. A random vector (X, Y ) is (jointly) discrete if there is a countable (or
nite)
number of points (xk , ym ) R2 , k 1, m 1 and positive numbers pk,m > 0, such that
R2 ,
P((X, Y ) A) =
pk,m .
k,m:(xk ,ym )A
In particular,
(
)
pk,m =
k:xk u m:ym R
pX (k) = FX (u),
k:xk u
where pX (k) := m pk,m is the p.m.f. of X. Similarly, we get FXY (, v) = FY (v). This means
that the c.d.f.s of both entries of the vector (which are one dimensional random variables) are
recovered from the joint c.d.f., i.e. the probabilistic description of X and Y as individual r.v., is
in fact incorporated in their joint probabilistic characterization. From this point of view, FX (u)
and FY (v) are the marginal c.d.fs of the j.c.d.f FXY (u, v). Similarly, pX (k) and pY (m) are the
marginal p.m.fs of the j.p.m.f. pk,m (or in other popular notation, pXY (k, m)).
Example 2a2. Consider the random vector (X, Y ) with values in N2 and j.p.m.f.
pk,m = (1 p)k1 p(1 r)m1 r,
k, m N,
where p, r (0, 1). Lets check that it is indeed a legitimate j.p.m.f.: clearly pk,m 0 and
pk,m =
(1 p)k1 p
(1 r)m1 r = ... = 1.
k,m
pk =
pk,m = (1 p)k1 p
(1 r)m1 r = (1 p)k1 p,
m
k N,
21
Note that two dierent j.c.d.fs may have identical marginals (see also the Example 2a6
below)
Example 2a3. Let (X, Y ) be a r.v. taking values in
{(0, 0), (0, 1), (1, 0), (1, 1)}
with the j.p.m.f.:
pXY (0, 0) = 1/4
pXY (0, 1) = 1/4 +
pXY (1, 0) = 1/4 +
pXY (1, 1) = 1/4
where (0, 1/4) is an arbitrary constant. Clearly, j.p.m.f depends on , while the marginals
do not: check that X Ber(1/2) and Y Ber(1/2) irrespectively of .
If (X, Y ) is a discrete random vector, then both X and Y are discrete random variables, e.g.
P(X = xk ) =
P(X = xk , Y = ym ) = 1.
k
The converse is also true, i.e. if X is discrete and Y is discrete, then (X, Y ) is discrete.
Given a function g : R2 7 R, the expectation of g(X, Y ) is dened:
g(xk , ym )pk,m .
Eg(X, Y ) :=
k,m
As in the one dimensional case, the expectation Eg(X, Y ) may take values in R {} or may
not be dened at all (recall the Example 1c2).
The expectation is linear: for a, b R,
xk pk,m + b
ym pk,m =
E(aX + bY ) =
(axk + bym )pk,m = a
k,m
k,m
k,m
xk pX (k) + b
2Proof: suppose that X takes values in {x , x , ...} and Y takes values in {y , y , ...}, then
1
2
1
2
1=
P(X = xk ) =
which implies P(X R \ {x1 , x2 , ...}, Y R) = 0. Similarly, P(X R, Y R \ {y1 , y2 , ...}) = 0. Recall that if
P(A) = 0 and P(B) = 0, then P(A B) = 0, hence we conclude
which implies
random vector.
k,m
22
Example 2a4. Let (X, Y ) be as in the Example 2a2 and g(x, y) = x2 y, then
Eg(X, Y ) =
k 2 m(1 p)k1 p(1 r)m1 r =
k,m
k (1 p)
2
k1
m=1
k=1
2p1
p2 r
which is well dened in e.g. the rectangular |s| < 1, |t| < 1. Analogously to one dimension, the
j.p.m.f. can be generated from j.p.g.f.:
k m
GXY (0, 0) = k!m!pk,m .
sk tm
Jointly continuous random variables. A pair of random variables (X, Y ) are called
jointly continuous (or, equivalently, the vector (X, Y ) is continuous), if the j.c.d.f. FXY (u, v) is
a continuous function, jointly3 in (u, v). We shall deal exclusively with a subclass of continuous
random vectors 4, whose c.d.f. is dened by:
u v
FXY (u, v) =
fXY (x, y)dxdy,
where fXY (x, y) is the joint p.d.f., i.e. a non-negative integrable function satisfying
u (
)
FXY (u, ) =
fXY (u, v)dv du =
where fX (u) := R fXY (u, v)dv is the p.d.f. of X (indeed it is non-negative and integrates to
1). Hence as in the discrete case, the one dimensional p.d.f.s and c.d.fs are marginals of j.p.d.f.
and j.c.d.f.
Example 2a5. Let (X, Y ) be a r.v. with j.p.d.f.
fXY (x, y) =
1 x2 /2y2 /2
e
,
2
x, y R.
3recall that a function h(x, y) is jointly continuous in (x, y), if for any sequence (x , y ), which converges to
n
n
1 x2 /2y2 /2
1
1
1
2
2
2
ey /2 dy = ex /2 ,
fX (x) =
e
dy = ex /2
2
2
2
R 2
R
i.e. X N (0, 1) (and similarly Y N (0, 1)).
23
x R,
j.c.d.f is not uniquely determined by its marginals, i.e. two dierent j.c.d.fs may have
identical marginals:
Example 2a6. Suppose X and Y have p.d.f.s fX and fY . Let be a number in [1, 1] and
dene
(
(
)(
))
fXY (x, y; ) = fX (x)fY (y) 1 + 2FX (x) 1 2FY (y) 1 .
Since 2FX (x) 1 [1, 1] for all x R, fXY is a non-negative function. Further, since
d
2
dx FX (x) = 2FX (x)fX (x)
d 2
FX (x)dx = 1,
2FX (x)fX (x)dx =
dx
R
R
we have R2 fXY (x, y; )dxdy = 1 and thus fXY is a legitimate two dimensional p.d.f. Also a
direct calculation shows that its marginals are fX and fY , regardless of . However, fXY is a
dierent function for dierent s!
Expectation of g(X, Y ) for an appropriate function g : R2 7 R is dened
Eg(X, Y ) =
g(u, v)fXY (u, v)dudv.
R2
Again it may be nite or innite or may not exist at all. The linearity property as in (2a2) is
readily checked.
The joint moments and the j.m.g.f. are dened similar to the discrete case, e.g.
MXY (s, t) = EesX+tY ,
if the expectation is nite in an open vicinity of the origin. In this case,
k m
MXY (s, t)
= EX k Y m .
(s,t)=(0,0)
sk tm
The components of continuous random vector are continuous r.v. as we have already seen.
However, the r.v. X and Y can be continuous individually, but not jointly:
Example 2a7. Consider X N (0, 1) and Y := X, then each X and Y are N (0, 1), i.e.
continuous, while not jointly continuous. Indeed, suppose that (X, Y ) has a density fXY (x, y).
Since P(X = Y ) = 0,
1=
fXY (u, v)dudv =
fXY (u, v)dudv = P(X = Y ) = 0,
R2
R2 \{(x,y):x=y}
where the rst equality is a property of the integral over R2 . The contradiction indicates that
j.p.d.f. doesnt exist in this case.
24
Remark 2a8. Of course, random vectors may be neither discrete nor continuous in various
ways. In particular, their entries may be r.v. of the mixed type, i.e. contain both continuous and discrete components, or the vector may contain both discrete and continuous entries.
Calculation of the expectations in this case follow the same pattern, but can be more involved
technically.
b. Independence
From here on we shall consider both discrete and continuous random vectors within the
same framework, emphasizing dierences only when essential. All of the following properties
are equivalent5 and can be taken as the denition of independence of X and Y :
(1) the j.c.d.f. factors into the product of individual c.d.fs:
FXY (x, y) = FX (x)FY (y),
x, y R.
s, t D,
x, y R.
or the j.p.m.f. factors into the product of individual p.m.fs in the discrete case:
pXY (k, m) = pX (k)pY (m),
k, m N.
P(X A, Y B)
= P(X A).
P(Y B)
B. INDEPENDENCE
25
Example 2b2. Let X U ([0, 1]) and Y = I(X > 1/2). Note that (X, Y ) does not categorize
as discrete or continuous random vector. X and Y are not independent, since e.g.
1
EXY = EXI(X 1/2) =
xdx = 3/8,
1/2
while
EXEI(X 1/2) = 1/2 1/2 = 1/4,
Definition 2b3. Let X and Y be a pair of r.v. with nite second moments, EX 2 < ,
EY < . Then the correlation coecient between X and Y is
cov(X, Y )
XY :=
,
var(X)var(Y )
2
26
(
The covariance matrix of the vector X =
X1
X2
is by denition6
(
)
cov(X1 , X1 ) cov(X1 , X2 )
.
cov(X2 , X1 ) cov(X2 , X2 )
1 2
{ 1 1 ( (x )2 2(x )(x ) (x )2 )}
1
1
2
2
2
2
1
1
exp
+
, (2b1)
2 1 2
1 2
12
22
21 2
for x R2 .
)
1
and
By a direct (but tedious) calculation, one can verify that indeed EX = =
2
Cov(X, X) = S as above. is the correlation coecient between X1 and X2 . The marginal
p.d.f.s are N (1 , 12 ) and N (2 , 22 ). Remarkably, for Gaussian vectors lack of correlation implies
independence
(
1 t2 /2
e
2
+ 12 et
/2
= et
/2
B. INDEPENDENCE
27
Remark 2b10. In fact, the above expression for m.g.f. is well dened even if = 1 (i.e.
the covariance matrix S is singular). In these cases, the j.p.d.f does not exist and the Gaussian
vector is referred to as degenerate. Do not think, however, that this is a pathology: e.g. if
X N (0, 1), the vector (X, X) is a degenerate Gaussian vector.
Gaussian distribution is stable under linear transformations, namely
Lemma 2b11. Let X be a Gaussian vector as above, A be a 2 2 matrix and b a vector in
R2 . Then AX + b is a Gaussian vector with mean A + b and covariance matrix ASA .
Proof. By denition,
{
}
{ }
{
}
MAX+b (t) = E exp t (AX + b) = exp t b E exp (A t) X =
{ }
{
}
{
}
1
1
exp t b exp A t + (A t) SAt = exp (A + b) t + t ASA t .
2
2
Since the j.m.g.f. uniquely determines the probability law, X N (A + b, ASA ).
SY = ASX A =
1 1
1 1
)(
1 0
0 1
)(
) (
)
1 1
2 2
=
.
1 1
2 2
The latter means that Y1 N (0, 2) and Y2 N (1, 2) and (Y1 , Y2 ) = 1 (think why). Hence
the j.p.d.f. is not well dened (as SY1 doesnt exist). Hence Y is a degenerate Gaussian vector.
This, of course, should be anticipated since Y1 = X1 X2 and Y2 = X1 + X2 + 1 and hence
Y2 = Y1 + 1.
Example 2b13. Let X be a Gaussian random vector in R2 with i.i.d. standard components
and let be a deterministic (non-random) angle in [0, 2]. Recall that
(
)
cos sin
U () =
sin cos
is the rotation matrix, i.e.8 the vector y = U ()x has the same length as x and the angle between
x and y is . The vector Y = U ()X, i.e. the rotation of X by angle , is Gaussian with zero
mean and covariance matrix U ()U () = I. This means that standard Gaussian distribution
in R2 is invariant under deterministic rotations.
8convince yourself by a planar plot
28
x = (x1 , ..., xn ) Rn .
The j.c.d.f FX (x), x Rn , should satisfy the usual properties. The random vector X can
be either discrete, in which case its j.c.d.f is determined by the j.p.m.f pX (k), k Nn , or
continuous, when j.c.d.f is given by an n-fold integral of j.p.d.f. fX (x), x Rn . j.p.m.f./j.p.d.f.
should sum/integrate to 1.
The one-dimensional marginals are dened for all the quantities as in the two-dimensional
case, e.g.
fX2 (x2 ) =
fX (x1 , x2 , ..., xn )dx1 dx3 ...dxn .
Rn1
Hence once again, the probability law of each individual component is completely determined
by the probability law of the vector. In addition, one can recover the joint probability law of
any subset of r.v.: for example, the j.p.d.f. of (X1 , Xn ) is given by
fX1 Xn (x1 , xn ) =
fX (x1 , x2 , ..., xn1 , xn )dx2 ...dxn1 .
Rn2
or, equivalently,
FX1 Xn (x1 , x2 ) = FX (x1 , , ..., , xn ).
Similarly, the k-dimensional marginals are found. Expectation of a function g : Rn 7 R of a
continuous random vector is
Eg(X) =
g(x)fX (x)dx1 ...dxn
Rn
Eg(X) =
g(x)pX (x).
xNn
The j.m.g.f is
MX (t) = Ee
i=1 ti Xi
t D Rn ,
where D is an open vicinity of the origin in Rn . The joint moments can be extracted from
j.m.g.f. similarly to the two dimensional case. The covariance matrix is dened if all the entries
of X have nite second moments:
Cov(X, X) = E(X EX)(X EX) .
The r.v. (X1 , ..., Xn ) are independent if their j.c.d.f. (j.p.d.f., etc.) factors into the product
of individual c.d.fs (p.d.f.s, etc.) analogously to the corresponding properties (1)-(4) above.
Independence of all the components of the vector clearly implies independence of any subset
of its components. However, a random vector may have dependent components, which are e.g.
pairwise (or triplewise, etc.) independent.
29
In statistics, n independent samples from p.d.f. f means a realization of the random vector
X = (X1 , ..., Xn ), with the j.p.d.f. of the form:
fX (x) =
f (xi ),
x Rn .
i=1
Example 2c1. Suppose X = (X1 , ..., Xn ) is a vector of i.i.d. random variables, where X1
takes values in {x1 , ..., xk } with positive probabilities
p1 , ..., pk . Let Yi be the number of times
n
xi appeared in the vector X, i.e. Yi =
m=1 I(Xm = xi ) and consider the random vector
Y = (Y1 , ..., Yk ). Clearly each Yi takes values in {0, ..., n}, i.e. it is a discrete r.v. Hence the
random vector Y is also discrete. The j.p.m.f. of Y is given by
{
0
n1 + ... + nk = n
pY (n1 , ..., nk ) =
.
nk
n1
n!
n1 + ... + nk = n
n1 !...nk ! p1 ...pk
The corresponding distribution is called multinomial. The particular case k = 2, is readily
recognized as the familiar binomial distribution. One dimensional marginals are binomial, since
each Yi is a sum of n Bernoulli i.i.d. random variables I(Xm = xi ), m = 1, ..., n. (this can
also be seen by the direct and more tedious calculation of the marginal). Similarly, the two
dimensional marginals are trinomial, etc.
Lets check dependence between the components. Clearly Y1 , ..., Yk are not independent,
since Y1 = n ki=2 Yk and hence e.g. Y1 and ki=2 Yk are fully correlated, which would be
impossible, were Y1 , ..., Yn independent. What about pairwise independence: e.g. are Y1 and Y2
independent ...? Intuitively, we feel that they are not: after all if Y1 = n, then Y2 = 0 ! This is
indeed the case:
P(Y1 = n, Y2 = 0) = P(Y1 = n) = pn1 ,
while
P(Y1 = n)P(Y2 = 0) = pn1 (1 p2 )n .
Gaussian vectors in Rn play an important role in probability theory.
Definition 2c2. X is a Gaussian random vector with mean and nonsingular9 covariance
matrix S, if it has j.p.d.f. of the form:
}
{ 1
1
fX (x) =
exp (x ) S 1 (x ) , x Rn .
2
(2)n/2 det(S)
It is not hard to see that the two-dimensional density from Denition 2b6 is obtained directly from the latter general formula. If S is a diagonal matrix, i.e. all the entries of X are
uncorrelated, X1 , ..., Xn are independent. Furthermore, if = 0 and S is an identity matrix,
then X1 , ..., Xn are i.i.d. standard Gaussian r.v.s.
Both Lemmas 2b9 and 2b11 remain valid in the general multivariate case, i.e. the j.m.g.f. of
a Gaussian vector is an exponential of a quadratic form and Gaussian multivariate distribution
is stable under linear transformations.
9positive denite
30
Exercises
Problem 2.1. A pair of fair dice is thrown. Let X be the sum of outcomes and Y be the
Bernoulli r.v. taking value 1 if the rst dice comes up even.
(1) Show that (X, Y ) is discrete and nd its j.p.m.f. (present your answer as a table)
(2) Calculate the marginal p.m.f.s
(3) Calculate the moments EX, EY and EXY
Problem 2.2. Answer the questions from the previous problem for the r.v. (X, Y ), dened
by the following experiment. Two Hanukkah tops are spined: X and Y are the outcomes of the
rst and the second tops respectively.
Problem 2.4. Let (X, Y ) be a random vector distributed uniformly on the triangle with
the corners at (2, 0), (2, 0), (0, 2).
(1) Find the j.p.d.f.
(2) Find the marginal p.d.f.s
(3) Are X and Y independent ?
(4) Calculate the correlation coecient between X and Y
Problem 2.5. Let (X, Y ) be a random vector with j.p.d.f fXY (x, y) = 12(xy)2 I((x, y) A)
where A = {(x, y) : 0 x y 1}.
(1) nd the marginal p.d.f.s
(2) Calculate P(1/2 X + Y 1), P(X + Y 1/2)
Problem 2.6. Let F be a one dimensional c.d.f.. Are the following functions legitimate twodimensional j.c.d.fs ? Prove, if your answer is positive and give a counterexample if negative.
(1) F (x, y) := F (x) + F (y)
EXERCISES
31
Problem 2.9. Let (X, Y ) be a Gaussian vector with parameters x , y , x and y and .
(1) Find the p.d.f. of the average (X + Y )/2
(2) Find the p.d.f. of (X Y )/2
(3) Find the values of so that the r.v. in (1) and (2) are independent
Problem 2.10. Let (X, Y ) be a random vector, whose components have nite nonzero
second moments and correlation coecient . Show that for a, b, c, d R
ac
(aX + b, cY + d) =
(X, Y ).
|ac|
Problem 2.11. Show that for a discrete random vector (X, Y ), the properties
Ef (X)g(Y ) = Ef (X)Eg(X),
g, f bounded
and
pXY (k, m) = pX (k)pY (m),
k, m N
are equivalent.
Problem 2.12. Show that the general formula for the Gaussian density from Denition 2c2
in two dimensions reduces to the one from Denition 2b6
Problem 2.13. Let X and Y be independent random vectors in Rk and Rm respectively
and let g : Rk 7 R and h : Rm 7 R be some functions. Show that = g(X) and = h(Y ) are
independent r.v.
32
Problem 2.14. Let X be a standard Gaussian vector in Rn (i.e. with i.i.d. N (0, 1) entries)
= 1 n Xi is a Gaussian random variable and nd
(1) Show that the empirical mean X
i=1
n
its mean and variance
is a Gaussian vector in R2 , nd its mean and covariance
(2) Argue that the vector (X1 , X)
matrix
..., Xn X)
is a Gaussian vector, nd its mean
(3) Argue that the vector R := (X1 X,
and covariance matrix
and R are independent
(4) Show that X
Problem 2.15. Let X and Y be i.i.d. N (0, 1) r.v.s Show that X 2 Y 2 and 2XY have the
same probability law.
X+Y
(x
x
)
(x
+
x
)
, x R3
2
1
3
2
1
3/2
2
2
2
(2)
(1) Check that fX is indeed a legitimate j.p.d.f
(2) Find all the one dimensional marginal p.d.fs
(3) Find all the two dimensional marginal j.p.d.fs. Are (X1 , X2 ), (X2 , X3 ), (X1 , X3 ) Gaussian vectors in R2 ? If yes, nd the corresponding parameters.
(4) Verify that fX can be put in the following form:
( 1
)
1
fX (x) =
exp (x ) S 1 (x ) , x R3 ,
2
(2)3/2 det(S)
1
1 1
2 2
S= 1
1 2 3
where
1 (xi i )2 )
exp
,
n
2
i2
2
i=1
i=1 i
n
x Rn
n/2
i=1
X2i
and So =
n/2
i=1
X2i1 .
EXERCISES
33
Problem 2.18. Let X = (X1 , ..., Xn ) be i.i.d. N (0, 1) r.v. and = (1 , ..., n ) i.i.d. Ber(1/2).
Assuming that X and are independent, nd the probability law of
n
Z=
(1 2i )Xi .
i=1
CHAPTER 3
Conditional expectation
a. The denition and the characterization via orthogonality
Recall the denition of the conditional probability of an event A, given event B:
P(A|B) =
P(A B)
,
P(B)
(3a1)
where P(B) > 0. If we know that B occurred, i.e. the experiment outcome is known to belong
to B, the event A occurs, i.e. A, if and only if A B. Hence the conditional probability
in (3a1) is proportional to P(A B) and is normalized so that P(A|B) is a legitimate probability
measure with respect to A, for any xed B.
Consider a pair of random variables (X, Y ) and suppose we want to calculate the conditional
probability of an event A, generated by X, say of A := {X 0}, having observed the realization
of the random variable Y . If Y takes a countable number of values, Y {y1 , y2 , ...}, and
P(Y = yi ) > 0, then we can use the formula (3a1):
P(A|Y = yi ) =
P(A, Y = yi )
.
P(Y = yi )
However, how do we dene the probability P(A|Y = y), if Y is a continuous random variable and hence P(Y = y) = 0 for all ys ...? An intuitively appealing generalization is the
innitesimal denition
P(A, |Y y| )
P(A|Y = y) := lim
.
0 P(|Y y| )
While conceptually simple (and, in fact, correct - see Remark 3d8 below), this denition turns
to have a somewhat limited scope and does not easily reveal some of the important sides of
conditioning with respect to random variables1. Below we shall take a dierent approach to
conditional probabilities, based on its variational characterization.
Let X = (X1 , ..., Xn ) be a random vector and suppose that we have sampled (i.e. obtained
the realizations of) X2 , ..., Xn and would like to guess the realization of X
1 . If for example, X
had multinomial distribution, this would be simple and precise, namely n ni=2 Xi would be the
only possible realization of X1 . On the other hand, if the components of X were all independent,
then, intuitively, we feel that any guess, based on X2 , ..., Xn would be as bad as a guess, not
utilizing any information, gained from observing X2 , ..., Xn . The problem we want to address
in this chapter is the following: for a random vector X = (X1 , ..., Xn ) with known distribution,
1at its ultimate modern form, the basic conditioning is with respect to -algebras of events. Hence conditioning with respect to random variables is dened as conditioning with respect to the -algebra of events, they
generate. Again, this truth is beyond the scope of these lecture notes
35
36
3. CONDITIONAL EXPECTATION
how do we guess the realization of X1 , given the realizations of X2 , ..., Xn - preferably in the
best possible way ?
To formulate this problem of prediction in rigorous terms, we have to dene what we mean
by guess and what is best for us. It is natural to consider the guesses of X1 of the form
g(X2 , ..., Xn ), where g is an Rn1 7 R function: this means that each time we get a realization
of X2 , ..., Xn , we plug it into g and thus generate a real number, which we interpret as the guess
of X1 . How do we compare dierent guesses ? Clearly, looking at the error X1 g(X2 , ..., Xn )
would be meaningless, since it varies randomly from realization to realization (or experiment
to experiment). Instead we shall measure the quality of the guess g by the mean square error
(MSE)
(
)2
E X1 g(X2 , ..., Xn ) .
The best guess for us will be the one which minimizes the MSE over all guesses, i.e. we aim to
nd g so that
(
)2
(
)2
E X1 g (X2 , ..., Xn ) E X1 g(X2 , ..., Xn )
for all gs
(3a2)
Finding such g does not seem to be an easy problem, since the space of potential candidates
for g - all2 functions on Rn1 - is vast and apparently it doesnt have any convenient structure to
perform the search3. However, things are not as hopeless as they may seem, due to the following
simple observation:
Lemma 3a1 (Orthogonality property). g satises (3a2) if and only if
(
)
E X1 g (X2 , ..., Xn ) h(X2 , ..., Xn ) = 0, for all functions h.
(3a3)
the task would be simple: just calculate all the corresponding MSEs and choose the minimal one. Also
it would be
simple if we were looking for g within certain family of candidates: e.g. of the form g(X2 , ..., Xn ) = n
i=2 ai Xi +b,
where ai s and b are real numbers. Then we could derive an expression for MSE, which depends only on ai s and
b. In this case we could nd the minimal MSE by the tools from calculus: derivatives, etc. But this is not what
we want: we are searching for g among all (not too weird) functions !
or equivalently
37
(
)2
(
)2
E X1 g E X1 g h = 2E(X1 g )h 2 Eh2 0.
3
3
E(X Y )h(Y ) = E(X X 3 )h(Y ) = E(X X)h(Y ) = 0 for all h. Hence E(X|Y ) = 3 Y . Of
course, this is
also clear from the denition of g as the minimizer: the MSE corresponding to
3
the predictor Y is zero and hence is minimal. Note that along the same lines E(X|X) = X.
Example 3a4. Suppose that X1 and X2 are independent. What would be E(X1 |X2 ) ?
The natural candidate is E(X1 |X2 ) = EX1 . EX1 is a constant and hence is a (rather simple)
function of X2 . Moreover, by independence, Eh(X2 )(X1 EX1 ) = Eh(X2 )E(X1 EX1 ) = 0,
which veries (3a3). Lets see now that e.g. g(x) = EX1 1 is a bad candidate: lets check
whether
(
)
E X1 g(X2 ) h(X2 ) = 0, h ...?
To this end, we have:
(
)
(
)
E X1 EX1 + 1 h(X2 ) = E X1 EX1 + 1 Eh(X2 ) = Eh(X2 ).
Clearly, the latter does not vanish for all h: e.g. not for h(x) = x2 .
But how do we nd the conditional expectation in less trivial situations, when nding the
right candidate is less obvious ? In fact, it is even not clear whether or when the conditional
expectation exists! Before addressing these questions, let us note that if it exists, it is essentially
unique: i.e. if one nds a right candidate it is the right candidate !
Lemma 3a5. If g and g satisfy (3a3), then P(g = g ) = 1.
Proof. Subtracting the equalities E(X1 g )(g g ) = 0 and E(X1 g )(g g ) = 0,
we get E(g g )2 = 0 and the claim (think why?).
4two random variables and are said to be orthogonal, if E = 0
38
3. CONDITIONAL EXPECTATION
kpX|Y (k; m) =
kpXY (k, m)
.
pY (m)
The latter is nothing but the optimal function g (m) and the corresponding conditional expectation is
kpXY (k, Y )
.
E(X|Y ) =
kpX|Y (k; Y ) = k
pY (Y )
k
Lets see how these formulae are obtained from the orthogonality property: we are looking for
a function g , such that
(
)
E X g (Y ) h(Y ) = 0, h.
On one hand,
EXh(Y ) =
kh(m)pXY (k, m)
k,m
k,m
Hence
(
)
kh(m)pXY (k, m)
g (m)h(m)pXY (k, m) =
E X g (Y ) h(Y ) =
k,m
k,m
(
)
pXY (k, m) .
kpXY (k, m) g (m)
h(m)
The latter expression should equal zero for any function h: this would be the case if we choose
kpXY (k, m)
k kpXY (k, m)
= k
g (m) =
.
p
(k,
m)
pY (m)
k XY
Note that any other essentially dierent choice of g such that (3a3) holds, is impossible, by
uniqueness of the conditional expectation.
39
Following the very same steps we can derive various other Bayes formulae:
Proposition 3b2 (The discrete Bayes formula ). Let X be a discrete random vector and let
pX (k), k Nn be its j.p.m.f. Then for a function : Rm 7 R
(
)
E (X1 , ..., Xm )|Xm+1 = km+1 , ..., Xn = kn =
,
k1 ,...,km pX (k1 , ..., kn )
k1 ,...,km
1 ,...,m
x1 ,...,xn
k1 ,...,km
1 ,...,m
xm+1 ,...,xn
pX (x1 , ..., xn ) =
x1 ,...,xm
x1 ,...,xn
40
3. CONDITIONAL EXPECTATION
Note that
(
)
E (X1 , ..., Xm )|Xm+1 = km+1 , ..., Xn = kn =
(x1 , ..., xm )pX1 ,...,Xm |Xm+1 ,...,Xn (x1 , ..., xm ; km+1 , ..., kn ),
x1 ,...,xm
where
pX (x1 , ..., xn )
x1 ,...,xm pX (x1 , ..., xn )
is the conditional j.p.m.f. of X1 , ..., Xm , given Xm+1 , ..., Xn . The corresponding conditional
j.m.g.f. and j.p.g.f are dened in the usual way.
pX1 ,...,Xm |Xm+1 ,...,Xn (x1 , ..., xm ; xm+1 , ..., xn ) :=
Remark 3b4. In particular, for (x) = I(X1 = j), we recover the familiar expression for
conditional probabilities: e.g.
(
)
pX (j, x2 , x3 )
E I(X1 = j)|X2 = x2 , X3 = x3 =
= P(X1 = j|X2 = x2 , X3 = x3 ).
i pX (i, x2 , x3 )
Example 3b5. Consider Y1 , ..., Yk from Example 2c1 with multinomial distribution. The
conditional p.m.f. of Y1 , given Yk is found by means of the Bayes formula
pY Y (x; y)
pY Y (x; y)
pY1 |Yk (x; y) = 1 k
= 1 k
=
pYk (y)
j pY1 Yk (j, y)
n!
x y
nxy
x!y!(nxy)! p1 pk (1 p1 pk )
y
n!
ny
y!(ny)! pk (1 pk )
px1 (1 p1 pk )nxy
(n y)!
x!(n x y)!
(1 pk )ny
=
=
)(
p1 )(ny)x
p1 )x (
ny
1
(n y) x
1 pk
1 pk
for x = 0, ..., n y and zero otherwise. This is easily recognized as the Binomial distribution
Bin(p1 /(1 pk ), n y). Hence e.g. the conditional expectation of Y1 , given Yk is
p1
E(Y1 |Yk ) = (n Yk )
.
1 pk
Again, note that the latter is a random variable !
Similarly, the conditional expectation is calculated in the continuous case:
Proposition 3b6 (The continuous Bayes formula). Let X be a continuous random vector
taking values in Rn with j.p.d.f. fX (x), x Rn . Then for any : Rm 7 R
(
)
E (X1 , ..., Xm )|Xm+1 = xm+1 , ..., Xn = xn =
(x1 , ..., xm )fX1 ,...,Xm |Xm+1 ,...,Xn (x1 , ..., xm ; xm+1 , ..., xn )dx1 ...dxm ,
Rm
where
fX1 ,...,Xm |Xm+1 ,...,Xn (x1 , ..., xm ; xm+1 , ..., xn ) =
fX (x)
,
Rm fX (x)dx1 ...dxm
41
is the conditional j.p.d.f. of X1 , ..., Xm , given Xm+1 ,(..., Xn . The conditional expectation is
obtained
by plugging Xm+1 , ..., Xn into the function E (X1 , ..., Xm )|Xm+1 = xm+1 , ..., Xn =
)
xn .
Corollary 3b7. Let X1 and X2 be jointly continuous r.v.s with the j.p.d.f. fX1 X2 (u, v).
Then for a function : R 7 R
E((X1 )|X2 ) =
(u)fX1 |X2 (u; X2 )du,
R
where
fX1 X2 (u, v)
fX1 X2 (u, v)
=
,
fX2 (v)
R fX1 X2 (s, v)ds
is the conditional p.d.f. of X1 , given X2 .
fX1 |X2 (u; v) =
(3b2)
fY (y) = (x + y)I(x (0, 1))I(y (0, 1))dx = I(y (0, 1))(1/2 + y).
R
x, y (0, 1),
1/3 + Y /2
.
Y + 1/2
What if X is neither discrete nor continuous, but e.g. of a mixed type? Pay attention, that
so far we havent discussed the existence of the conditional expectation in general: perhaps,
in certain situations there is no g which satises (3a3) ...? It turns out that the conditional
expectation exists under very mild conditions 5 and, moreover, is often given by an abstract Bayes
formula. However, the actual calculation of conditional expectations can be quite involved. In
5the conditional expectation of X given Y exists, if e.g. E|X| < . Note that nothing is mentioned of the
conditioning r.v. Y .
42
3. CONDITIONAL EXPECTATION
some cases, which do not t neither of the propositions above, the conditional expectation can
be found directly from the property (3a3). Here is an example:
Example 3b9. Let X exp() and Y = X, where Ber(1/2), independent of X. We
are interested to calculate E(X|Y ). Note that (X, Y ) is neither jointly continuous (since Y is
not continuous) nor jointly discrete vector (since X is continuous) and hence formally all the
Bayes formulae derived above are not applicable. In fact, using (3a3) to nd E(X|Y ) can be
viewed as deriving the Bayes formula for this particular situation.
We are looking for a function g : R 7 R, such that
E(X g (Y ))h(Y ) = 0,
h.
We have
EXh(Y ) = EXh(X) = EXh(X)I( = 0) + EXh(X)I( = 1) =
1
1
EXh(0)I( = 0) + EXh(X)I( = 1) = h(0) EX + EXh(X) =
2
2
11 1
h(0)
+
uh(u)fX (u)du
2 2 R
Similarly,
1
1
Eg (Y )h(Y ) = Eg (X)h(X) = g (0)h(0) +
2
2
g (u)h(u)fX (u)du.
Hence
11 1
E(X g (Y ))h(Y ) = h(0)
+
uh(u)fX (u)du
2 2 R
1
1
g (0)h(0)
g (u)h(u)fX (u)du =
2
2 R
1
1
h(0)(1/ g (0)) +
h(u)(u g (u))fX (u)du
2
2 R
The latter equals zero for any h if we choose g (0) = 1/ and g (u) = u for u = 0. Hence
{
u u = 0
E(X|Y = u) = 1
u=0
and
1
I(Y = 0).
The latter conrms the intuition: it is optimal predict X by the value of Y , whenever it is not 0,
and predict EX = 1 when Y = 0 (i.e. when Y doesnt tell anything about X). Think, however,
how could you possibly get this answer by means of the Bayes formulae derived above ...?!
E(X|Y ) = Y I(Y = 0) +
D. PROPERTIES
43
Here is how the formulae look like in the general multivariate case:
Proposition 3c3. Let (X, Y ) be a Gaussian vector with values in Rk+m , where the rst k
coordinates are denoted by X and the rest nm coordinates are denoted by Y . Let = (X , Y )
be the expectation vector and the covariance matrix
)
(
SX SXY
.
S=
SY X SY
Assume that SY is nonsingular. Then the conditional distribution of X given Y is Gaussian
with the mean
E(X|Y ) = X + SXY SY1 (Y Y )
and the covariance
cov(X|Y ) = SX SXY SY1 SY X .
Proof. (a direct tedious calculation)
d. Properties
Not less important than the computational techniques of the conditional expectation, are
its properties listed below.
Proposition 3d1. Let X, Y, Z be random vectors, a, b be real constants and , functions
with appropriate domains of denition
(1) linearity:
E(aX + bY |Z) = aE(X|Z) + bE(Y |Z).
(2) X( Y
= ) E(X|Z) E(Y |Z)
(3) E E(X|Y, Z)Z = E(X|Z)
(
)
(4) E E(X|Z)Y, Z = E(X|Z)
(5) E(X|a) = EX
(6) EE(X|Y ) = EX
(7) E(c|Y ) = c
(8) E((X)(Y )|Y ) = (Y )E((X)|Y )
44
3. CONDITIONAL EXPECTATION
Proof. All the properties are conveniently checked by applying the characterization by
(3a3).
(1) Clearly aE(X|Z)+bE(Y |Z) is a legitimate candidate for E(aX +bY |Z) as it is a function
of Z. Let h be an arbitrary function, then
(
)
E aX + bY aE(X|Z) bE(Y |Z) h(Z) =
(
)
(
)
aE X E(X|Z) h(Z) + bE Y E(Y |Z) h(Z) = 0,
where we used linearity of the expectation 2a2 in the rst equality and (3a3), applied
to E(X|Z) and E(Y |Z) individually.
(2) by linearity, it is enough to show that 0 implies E(|Z) 0 (with = X Y ).
Note that
(
) (
)
E(|Z) I E(|Z) 0 =
(
)
(
)
(
)
I E(|Z) 0 E(|Z)I E(|Z) 0 I E(|Z) 0 0,
where we used 0 in the latter inequality. On the other hand, by the orthogonality
property (3a3),
(
) (
)
E E(|Z) I E(|Z) 0 = 0,
and hence with probability one
(
) (
)
E(|Z) I E(|Z) 0 = 0,
i.e.
(
)
(
)
E(|Z)I E(|Z) 0 = I E(|Z) 0 0.
But then
(
)
(
)
E(|Z) = E(|Z)I E(|Z) > 0 + E(|Z)I E(|Z) 0 0,
as claimed.
(3) we have to check that
(
)
E E(X|Y, Z) E(X|Z) h(Z) = 0
D. PROPERTIES
45
Sn = ni=1 Xi . We would like to calculate E(X1 |Sn ). To this end, note that
Sn = E(Sn |Sn ) =
E(Xi |Sn ).
i=1
Set Sn\i = j=i Xj and notice that Xi and Sn\i are independent (why?) and Sn\i and Sn\j
have the same distribution (as Xi s are identically distributed). Then for a bounded function h
and any i,
and, consequently,
(
)
(
)
E Xi E(X1 |Sn ) h(Sn ) = E X1 E(X1 |Sn ) h(Sn ) = 0
i=1
46
3. CONDITIONAL EXPECTATION
EE exp tX1
X3 E exp tX2
X3 =
2
2
1 + X3
1 + X3
)
(1
(1
2 )
1
2
2 X3
E exp t
exp
t
=
2 1 + X32
2 1 + X32
(1
(1 )
(1 )
1
1 2 X32 )
2
E exp t2
+
t
=
E
exp
t
=
exp
t2 .
2 1 + X32 2 1 + X32
2
2
The latter is identied as the m.g.f. of an N (0, 1) r.v.
Example 3d7. Consider X from Example 2b13 and suppose that is a random variable on
[0, 2] (rather than a deterministic angle) with arbitrary distribution, independent of X. For
t R2 ,
{
}
(
{
} )
E exp tU ()X = EE exp tU ()X | .
By the property 9,
(
{
} )
{
}
E exp tU ()X | = exp t U ()U ()t = exp(t2 ),
D. PROPERTIES
and hence
47
{
}
E exp tU ()X = exp(t2 ),
which means that the standard Gaussian distribution is invariant under independent random
rotations.
Remark 3d8. Let (X, Y ) be a continuous random vector. What do we mean by P(X
[1, 2]|Y = 5) ...? Certainly this cannot be interpreted as the conditional probability of the
event A = {X [1, 2]}, given the event B = {Y = 5}, since P(B) = 0 and thus P(A|B) =
P(A B)/P(B) = 0/0 is not well dened ! One natural way to deal with this is to dene
P(X [1, 2]|Y = 5) by the limit
P(X [1, 2], |Y 5| )
.
0
P(|Y 5| )
This approach actually works when X and Y are random variable taking real values, and yielding
the aforementioned Bayes formulae. This approach, however, has serious disadvantages: (1) it
doesnt work in a greater generality 6 and, more importantly for this course, it is not easy to
derive the properties of the obtained object.
The standard
modern approach
is through conditional expectation: by denition, P(X
)
(
[1, 2]|Y ) = E I(X [1, 2])Y and the latter makes perfect sense: it is the essentially unique
random variable, which is dened either by (3a2) or equivalently by (3a3). Moreover, P(X
[1, 2]|Y = u), u R is the function (called g (u) above), which realizes the conditional expectation.
The following example is an illuminating manifestation7
Example 3d9. Let U and V be i.i.d. r.v.s with uniform distribution over the interval
(0, 1). Dene D := U V and R := U/V . We shall calculate E(U |D) and E(U |R) by the Bayes
formulae. To this end, that the random vector (U, D) has the joint density
fU D (x, y) = fU (x)fV (x y) = I(x [0, 1])I(x y [0, 1]),
and the p.d.f. of D is given by
1
fD (y) =
fU D (x, y)dx =
0
1
0
D (1, 0]
1
= (1 + D).
2
D [0, 1)
48
3. CONDITIONAL EXPECTATION
To nd E(U |R) we shall use the orthogonality characterization. To this end, for a bounded
function h
1 ( 1
1 (
)
x )
EU h(R) =
x
h(x/y)dy dx =
x
h(z) 2 dz dx =
z
0
0
0
x
1z
(
)
1
1 1
2
h(z) 2
x dx dz =
h(z) 2 (1 z)3 dz.
z
z
3
0
0
0
Similarly,
Eg (R)h(R) =
1 1
g (x/y)h(x/y)dydx =
)
1 1
1 ( 1z
xdx dz =
g (z)h(z) 2 (1 z)2 dz.
g (z)h(z) 2
z
z
2
0
0
0
(
)
E U g (R) h(R) = 0,
1
3 (1
1
2 (1
z)3
2
(1 z),
=
3
z)2
2
E(U |R) = (1 R).
3
Consequently,
2
and E(U |R = 1) = .
3
This may seem counterintuitive, since {R = 1} = {D = 0} = {U = V }, but we shall predict U
dierently on {R = 1} and {D = 0}! This is no paradox, since we measure the quality of the
prediction of U by the mean square error and not the individual errors for particular realization
of R or D. In fact, for an arbitrary number a R,
{
1
(1 + D), D = 0
E(U |D) = 2
, with prob. 1,
a,
D=0
E(U |D = 0) = 1/2,
and hence comparing E(U |D) and E(U |R) for individual realizations of U and R is meaningless.
To get an additional insight into the mysterious numbers 1/2 and 2/3, explore the limits
lim
1
EU I(|D| )
= ,
P(|D| )
2
and
lim
2
EU I(|R 1| )
= .
P(|R 1| )
3
Geometrically, the rst limit suggests to calculate the area of the linear strip
{(x, y) (0, 1) (0, 1) : |x y| },
whose center of mass is at its mid point, corresponding to 1/2, while the second limit has to
do with the area of the sector
{(x, y) (0, 1) (0, 1) : |x/y 1| },
whose center of mass is shifted towards the thick end, yielding 2/3.
EXERCISES
49
Exercises
Problem 3.1. Find FX|Y (x; y), fX|Y (x; y), FY |X (y; x) and fY |X (y; x) for the random vector
from Problem 2.3
Problem 3.2. Let (X, Y ) be a random vector, such that E(Y |X = x) is in fact not a function
of x. Show that var(Y ) = Evar(Y |X).
Problem 3.3 (The law of total variance). For a pair of random variables X and Y , show
that
var(X) = Evar(X|Y ) + var(E(X|Y )),
if EX 2 < .
Problem 3.4. Let X U ([0, 1]) and suppose that the conditional law of Y given X is
binomial Bin(n, X).
(1) Calculate EY
(2) Find the p.m.f. of Y for n = 2
(3) Generalize to n > 2 (Leave your answer in terms of the so called -function:
1
xa1 (1 x)b1 dx,
(a, b) =
0
(a1)!(b1)!
(a+b1)!
Problem 3.5. Let X, Y be a random variables with nite second moments and var(Y ) > 0.
Show that
cov2 (X, Y )
E(Y aX b)2 var(X)
, a, b R,
var(Y )
and that the minimum is attained by the optimal linear predictor:
a =
cov(X, Y )
,
var(Y )
b = EX a EY.
50
3. CONDITIONAL EXPECTATION
E(X X) . Suppose that Alice sent 1 and Bob obtained 1/2: what guess would Bob
generate by the linear predictor ?
Problem 3.7. Answer the question from the previous problem, assuming X N (1/2, 1/4)
(instead of X Ber(1/2)).
Problem 3.8. Let (X, Y ) be a Gaussian vector in R2 with the parameters X = 5, Y = 10,
= 1, Y = 5.
(1) Find (X, Y ) if (X, Y ) > 0 and P(4 < Y < 16|X = 5) = 0.954...
(2) If (X, Y ) = 0, nd P(X + Y < 16)
2 , 2 ,
Problem 3.9. Let (X, Y ) be a Gaussian vector with the parameters x ,Y ,X
Y
(1) Is E(X|Y ) a Gaussian r.v. ? Find its mean and variance
(2) Is (X, E(X|Y( ) a Gaussian
) vector ? Find its mean and covariance matrix.
(3) Calculate E(E(X|Y )|X )
(4) Calculate E X|E(X|Y )
Hint: remember that E(X|Y ) is a linear function of Y
Problem 3.10. Let X U ([0, 1]) and Y = XI(X 1/2). Find E(X|Y ).
Problem 3.11. n i.i.d. experiments with the success probability p > 0 are performed. Let
X be the number of successes in the n experiments and Y be the number of successes in the
rst m experiments (of course, m n).
(1) Find the j.p.m.f of (X, Y )
(2) Calculate the conditional p.m.f of Y , given X and identify it as one of the standard
p.m.f.s
Problem 3.12. (*) Let X be a r.v. with the p.d.f. f (x) and let Y := g(X), where g is
a piecewise strictly monotonous dierentiable function (so that the set of roots g 1 (y) = {x :
g(x) = y} is a nite or countable set). Prove that the conditional law of X, given Y , is discrete
and its p.m.f. is given by:
P(X = x|Y ) =
f (x)/|g (x)|
,
sg 1 (Y ) f (s)/|g (s)|
x g 1 (Y ),
where g is the derivative of g. Think what changes if g has zero derivative on a nonempty
interval ...?
Problem 3.13. Let X Exp() with > 0 and set Z = I(X s) for some s > 0.
EXERCISES
51
A = {(x, y) R2 : 0 x 2y 2}
CHAPTER 4
Y := min k 0 :
pi X
i=0
j
j1
(
)
P(Y = j) = P
p(i) X,
<X =
i=0
i=0
j1
i=0
i=0
p(i)
dx =
p(i)
p(i)
i=0
j1
p(i) = p(j).
i=0
Can we get a continuous r.v. from a discrete one ? Obviously not:
Proposition 4a3. Let X be a r.v. with p.m.f. pX and g : R 7 R, then Y = g(X) is
discrete, and
pY (i) =
pX (j).
j:g(xj )=yi
53
54
Proof. Let {x1 , x2 , ...} be the set of values of X, then Y takes values in the discrete set
{g(x1 ), g(x2 ), ...}. Moreover,
pY (i) = P(Y = yi ) = P(X {xj : g(xj ) = yi }) =
{X = xj } =
xj :g(xj )=yi
pX (j)
j:g(xj )=yi
Here is a particular, but important, example of a dierent avor:
Proposition 4a4. Let F and G be continuous and strictly increasing c.d.f s and let X be a
r.v. sampled from F , then U = F (X) has uniform distribution on [0, 1] and Y := G1
Y (FX (X))
is a sample from G.
Proof. If X is a r.v. with a strictly increasing c.d.f. F (x), then for U := F (X)
P(U x) = P(F (X) x) = P(X F 1 (x)) = F (F 1 (x)) = x,
x [0, 1]
Remark 4a5. The typical application of this Proposition 4a4 is the following: suppose we
can generate r.v. X with a particular distribution FX (e.g. uniform or normal etc.) and we want
to generate r.v. Y with a dierent distribution FY . If FX and FY satisfy the above conditions,
this can be done by setting Y := FY1 (FX (X)).
Lets now consider the setting, when a r.v. with p.d.f. is mapped by a dierentiable one-toone function g:
Proposition 4a6. Let X be a r.v. with p.d.f. fX (x) and let Y = g(X), where g is a
dierentiable and strictly monotonous function on the interior1 of the support of fX . Then Y
has the p.d.f.
(
)
fX g 1 (v)
fY (v) = ( 1 ) , v R.
g g (v)
Proof. Suppose g increases, then
FY (v) := P(Y v) = P(g(X) v) = P(X g 1 (v)) = FX (g 1 (v)),
(4a1)
1Recall that the interior of a subset A R, denoted by A , is the set of all internal points of A, whereas a
point x A is internal, if there is an open interval Vx , such that x Vx A. For example, 0.5 is an interval
point of [0, 1), while 0 and 1 are not.
A. R 7 R TRANSFORMATIONS
55
where the last equality follows from the strict monotonicity of g. Since g is invertible, i.e.
g(g 1 (x)) = x, x R and
as g )is dierentiable and strictly increasing, taking derivative of the
( 1
d 1
The following generalizes the latter proposition to the setting, when the transformation g is
only piecewise monotone
Proposition 4a8. Let X be a r.v. with p.d.f fX (x) and g is a function of the form
g(x) =
m1
i=1
where Ai , i = 1, ..., m are pairwise disjoint intervals partitioning the support of fX , and where
gi (x) are dierentiable and strictly monotonous2 functions on Ai (the interiors of Ai ) respectively. Then Y = g(X) has the p.d.f.
(
)
m
m
fX gi1 (v)
( 1 ) I(gi1 (v) Ai ), v
fY (v) =
Ai .
(4a2)
g g (v)
i=1
i=1
Remark 4a9. Note that Proposition 4a6 is a particular case of the latter proposition, corresponding to m = 1, i.e when g is monotonous on all the support.
Remark 4a10. Note that fY in (4a2) remains undened outside the open set m
j=1 Aj , which
consists of a nite number of points. At these points fY can be dened arbitrarily without
aecting any probability calculations, it is involved in (why?)
Proof. Since gi is a monotonous function and Ai is an interval, say Ai := [ai , bi ], the image
of Ai under gi is the interval [gi (ai ), gi (bi )], if gi increases and [g(bi ), g(ai )], if g decreases. For a
xed i and v R, consider the quantity
P(Y v, X Ai ) = P(gi (X) v, X Ai ),
2for simplicity, we shall assume that g (x) > 0, if g is increasing and g (x) < 0 if it is decreasing, for all x R
56
v g(ai )
0 1
(
)
gi (v)
P(Y v, X Ai ) =
f
(x)dx
v
g(a
),
g(b
)
i
i
X
a
i
P(X Ai )
v g(bi )
and hence
d
P(Y v, X Ai ) =
dv
{
(
)
d 1
fX (gi1 (v)) dv
gi (v) v g(ai ), g(bi )
=
0
otherwise
[ (
)]1
(
)
v g(ai ), g(bi )
otherwise
P(Y v) =
P(Y v, X Ai )
i
Example 4a11. In many cases it is convenient to bypass the general formula and act directly.
Let X N (0, 1) and let Y = X 2 . Then obviously P(Y v) = 0 if v 0 and for v > 0
2
1 ex /2
2
1 1
1 1
d
FY (v) = fx ( v) + fX ( v) .
dv
2 v
2 v
we obtain:
1
1 1
fY (v) = fX ( v) = ev/2 I(v > 0).
v
v 2
(4a3)
Now lets get the same answer by applying the general recipe from the last proposition. The
function g(x) = x2 is decreasing on (, 0] and increasing on (0, ) (note that inclusion of
{0} to either intervals is not essential, since we deal only with the interiors of Ai s). We have
g11 (v) = v and g21 (v) = v and gi (x) = 2x. For v < 0, fY (v) = 0 since v does not belong
neither to g(A1 ) = [0, ) nor to g(A2 ) = (0, ). For v > 0 the formula yields
1
1
fY (v) = fX ( v) I(v (0, )) + fX ( v) I(v (0, )),
2 v
2 v
which is the same p.d.f. as in (4a3).
57
(
)n1
fX(1) (x) = nf (x) 1 F (x)
.
Proof. We have
FX(n) (x) = P(X(n) x) = P(X1 x, ..., Xn x) =
i=1
and
1 FX(1) (x) = P(X(1) > x) = P(X1 > x, ..., Xn > x) =
(
)n
P(Xi > x) = 1 F (x) .
Example 4b2. Let X1 U ([0, 1]), i.e. f (x) = I(x [0, 1]). Then by the above formulae
fX(n) (x) = nxn1 I(x [0, 1])
and
(
)n1
fX(1) (x) = n 1 x
I(x [0, 1]).
Note that the p.d.f. of min concentrates around 0, while p.d.f. of max is more concentrated
around 1 (think why).
Sum.
Proposition 4b3. For a pair of real valued random variables X and Y with the joint p.d.f.
fXY (x, y), the sum S = X + Y has the p.d.f., given by the convolution integral
fS (u) =
fXY (x, u x)dx =
fXY (u x, x)dx.
R
Proof. For u R
P(S u) = P(X + Y u) =
I(s + t u)fXY (s, t)dsdt =
R2
(
( ut
)
)
I(s u t)fXY (s, t)ds dt =
fXY (s, t)ds dt =
R
R
R
( u
u (
)
)
fXY (s t, t)ds dt =
fXY (s t, t)dt ds .
R
58
Remark 4b4. If X and Y take integer values, S = X + Y is an integer valued r.v. with the
p.m.f. given by the convolution
pS (k) =
pXY (k m, m).
m
Remark 4b5. If X and Y are i.i.d. with common p.d.f. f (or p.m.f), then fS (u) =
n is the n-fold convolution
f
R (u x, x)dx =: f f . By induction, if X1 , ..., Xn are i.i.d. fS = f
of f .
Example 4b6. Let X and Y be i.i.d. r.v. with common uniform distribution U ([0, 1]).
Then
fS (u) =
I(u x (0, 1))I(x (0, 1))dx =
R
1
1
I(u x (0, 1))dx =
I(x (u 1, u))dx.
0
If u < 0, then (u1, u)(0, 1) = and hence the integral is zero. If u [0, 1), (u1, u)(0, 1) =
(0, u) and the integral yields u. If u [1, 2), then (u 1, u) (0, 1) = (u 1, 1) and the integral
gives 2 u. Finally, for u 2, (u 1, u) (0, 1) = and the integral is zero again. Hence we
get:
fS (u) = (1 |u 1|)I(u [0, 2]).
Calculating the convolution integral (sum) can be tedious. Sometimes it is easier to tackle
the problem by means of the m.g.f.:
n Proposition 4b7. Let X1 , ..., Xn be i.i.d. r.v. with the common m.g.f M (t). Then S =
i=1 Xi has the m.g.f.
MS (t) = M n (t).
Proof.
MS (t) = Ee
St
= Ee
i=1
Xi t
=E
i=1
Xi t
(
)n
EeXi t = EeX1 t = M n (t),
i=1
Remark 4b8. Similarly, if X1 is a discrete r.v. with p.g.f. GX (s), the sum S is also discrete
with the p.g.f. GS (s) = GnX (s).
In many cases M n (t) can be identied with some standard known m.g.f. Here are some
examples:
Example 4b9. Let X1 Ber(p). Then
(
)n
GS (s) = ps + 1 p .
Hence
pS (0) = GS (0) = (1 p)n
(
)n1
pS (1) = GS (0) = np ps + 1 p |s=0 = np(1 p)n1
...
C. DIFFERENTIABLE Rn 7 Rn TRANSFORMATIONS
59
{
}
MS (t) = M n (t) = exp n(et 1) ,
which is recognized as the m.g.f. of the Poisson distribution Poi(n). Hence
Poi(n).
( S )
Similarly, if X1 , ..., Xn are independent r.v. with Xi Poi(i ), then S Poi
i i .
2
Example4b11.
1 , ..., Xn are independent Gaussian random variables Xi N (i , i ),
If X
2
then S N ( i , i i ) (which can also be deduced from the fact that Gaussian distribution
in Rn is stable under linear transformations).
c. Dierentiable Rn 7 Rn transformations
Let X = (X1 , ..., Xn ) be a random vector with j.p.d.f fX (x), x Rn and let g : Rn 7 Rn be
a given function. The following proposition gives the formula for the j.p.d.f. of Y = g(X) for
appropriate gs.
Proposition 4c1. Let X1 , ..., Xn be a random vector with continuous joint p.d.f. fX (x),
x Rn . Denote by D the support of fX in Rn :
{
}
D := cl x Rn : fX (x) > 0 ,
and consider a function3 g : D 7 Rn . Suppose that there are pairwise disjoint subsets Di ,
i = 1, ..., m, such that the set D \ m
i=1 Di has probability zero:
fX (x)dx = 0
...
D\m
i=1 Di
and the function g is one-to-one on all Di s. Let gi1 be the inverse of g on Di and dene the
corresponding Jacobians
1
1
1
g
(y)
g
(y)
...
g
(y)
i1
i1
i1
y
y2
yn
1 1
1
1
y1 gi2 (y) y 2 gi2 (y) ... yn gi2 (y)
Ji (y) = det
, y Di , i = 1, ..., m.
...
...
...
...
1
1
1
g
(y)
g
(y)
...
g
(y)
y1 in
y2 in
yn in
Assume that all the partial derivatives above are continuous on4 g(Di )s and Ji (y) = 0 for all
y g(Di ), for all i = 1, ..., m. Then
fY (y) =
y i g(Di ).
i=1
Proof. A change of variable for n-fold integrals (tools from n-variate calculus).
3the domain of g might dier from D by a set of points in Rn with zero probability - see the examples below.
4for a subset C Rn and a function h : Rn 7 R, g(C) denotes the image of C under g, i.e. g(C) := {g(x), x
C}
60
Remark 4c2. Pay attention that for n = 1, the latter reduces to the claim of Proposition
4a8.
Example 4c3. Let X1 and X2 be i.i.d. r.v. with N (0, 1) distribution. We would like to nd
the j.p.d.f. of Y1 = X1 + X2 and Y2 = X1 /X2 . In terms of the ingredients of the proposition,
2
2
fX (x) = 1 ex1 /2x2 /2 , x R2 , D = R2 and g(x) = (x1 + x2 , x1 /x2 ) : R2 7 R2 . Note that
the domain of g is R2 \ {(x1 , x2 ) R : x2 = 0}, i.e. g is dened on the plane o the line
:= {x2 = 0} (which has probability zero). The inverse function can be found from the system
of equations
y1 = x1 + x2
y2 = x1 /x2
These yield x1 = y2 x2 and y1 = (y2 + 1)x2 . If y2 = 1, then
x2 = y1 /(y2 + 1)
x1 = y2 y2 /(y2 + 1).
If y2 = 1, i.e. x1 = x2 , then y1 = 0: the whole line := {(x1 , x2 ) : x1 = x2 } is mapped to
a single point (0, 1). Hence g is invertible on R2 \ ( ) with
(
)
g 1 (y) = y1 y2 /(y2 + 1), y1 /(y2 + 1) .
Since fX (x)dx = 0, the natural choice of the partition is just D1 = R2 \ ( ) and g1 (y) :=
g(y). The range of D1 under g1 is R \ {(0, 1)}. Lets calculate the Jacobian:
(
)
(
)
1
1
g
(y)
g
(y)
y
y
/(y
+
1)
y
y
/(y
+
1)
1
2
2
1
2
2
1
1
y2
y2
J = det y 1 1
= det y1
=
1
y1 g2 (y) y2 g2 (y)
y1 y1 /(y2 + 1)
y2 y1 /(y2 + 1)
(
)
y1
y2 /(y2 + 1) y1 /(1 + y2 )2
.
det
= y2 y1 /(y2 + 1)3 y1 /(1 + y2 )3 =
2
1/(y2 + 1) y1 /(y2 + 1)
(1 + y2 )2
Now we are prepared to apply the formula: for y R \ (0, 1)
fY (y) = |J(y)|fX (g 1 (y)) =
{ 1(
)2 1 (
)2 }
|y1 |
1
exp
y
y
/(y
+
1)
y
/(y
+
1)
=
1 2
2
1
2
(1 + y2 )2 2
2
2
{ 1 y 2 (y 2 + 1) }
|y1 |
1 2
exp
.
2(1 + y2 )2
2 (y2 + 1)2
One can check that the Y1 marginal is N (0, 2). Lets calculate the Y2 marginal p.d.f.:
{ 1 y 2 (y 2 + 1) }
|y1 |
1 2
fY2 (y2 ) =
fY1 Y2 (y1 , y2 )dy1 =
exp
dy1 =
2
2
2(1
+
y
)
2
(y
2
2 + 1)
R
R
{ 1 y 2 (y 2 + 1) }
1
1 1
1 2
2
y
exp
dy1 = ... =
,
1
2
2
2(1 + y2 )
2 (y2 + 1)
1 + y22
0
i.e. Y2 has standard Cauchy distribution.
EXERCISES
61
Exercises
Problem 4.1. Let FXY (u, v), u, v R2 be a continuous j.c.d.f., such that both v 7 FY (v)
and u 7 FX|Y (u; v), v R are strictly increasing. Suggest a way to produce a sample from
FXY , given a sample of i.i.d. r.vs U and V with uniform distribution on [0, 1].
Problem 4.2. Let X1 , ..., Xn be i.i.d. random variables with the common distribution F .
Let X(1) , ..., X(n) be the permutation of X1 , ..., Xn , such that X(1) ... X(n) . X(i) is called
the i-th order statistic of X1 , ..., Xn .
(1) Show that the c.d.f. of X(i) is given by
FX(i) (x) =
(
)nj
F (x)j 1 F (x)
.
j=i
(2) Show that if F has the p.d.f. f , then X(i) has the p.d.f.
fX(i) (x) =
n!
F (x)i1 (1 F (x))ni f (x).
(i 1)!(n i)!
Problem 4.4. Find the p.d.f. of Z = X/Y , if X and Y are r.v. with j.p.d.f. f (x, y),
(x, y) R2
62
Problem 4.5. Let X N (0, 12 ) and Y N (0, 22 ) be independent r.v. Show that the p.d.f.
of Z = X/Y is Cauchy:
1
fZ (x) =
, x R,
(1 + (t/)2 )
and nd the corresponding scaling parameter .
= X/1 and Y = Y /2 , show that X/
Y has standard Cauchy density
Hint: Introduce X
(i.e. with = 1) and deduce the claim
Problem 4.6. Let X1 , ..., Xn be i.i.d. r.v. with the common exponential distribution of rate
> 0.
(1) Show that X(1) = min(X1 , ..., Xn ) has exponential distribution and nd its rate.
Problem 4.7. Let X and Y be i.i.d. standard Gaussian r.v. Dene R := X 2 + Y 2 , the
distance of the random point (X, Y ) from the origin and := arctan(Y /X), the angle formed
by the vector (X, Y ) and the x-axis.
(1) Prove that R and are independent, U ([0, 2]) and R has the Rayleigh p.d.f.:
fR (x) = rer /2 I(r 0).
(
)
Hint: the function g(x, y) =
x2 + y 2 , arctan(y/x) is invertible and its inverse is
given by
g 1 (r, ) = (r cos , r sin ), (r, ) R+ [0, 2)
(2) Let R and be independent r.v. with Rayleigh and U ([0, ]) distributions. Show that
(X, Y ) are i.i.d. standard Gaussian r.v.
Hint: the transformation is one-to-one and onto.
2
Problem 4.8. Let U and V be i.i.d. r.v. with the common distribution U, V U ([0, 1]).
Show that
Y = 2 ln U sin(2V ), X = 2 ln U cos(2V ),
are i.i.d. standard Gaussian
r.v.5
Hint: dene R = X 2 + Y 2 and = arctan(X/Y ) and show that R and are independent
and have the Rayleigh and U ([0, 2]) distributions. Refer to the previous problem.
Problem 4.9. Let X U ([0, 1]). Find an appropriate function g, so that Y = g(X) has
each of the following distributions:
(1) U ([a, b]), b > a R.
(2) uniform distribution on the points {x1 , ..., xn } R
(3) Poi()
5this suggests a way to generate a pair of i.i.d. Gaussian r.v. from a pair of i.i.d. uniformly distributed r.v.
EXERCISES
63
Problem 4.10. Let X be a r.v. with c.d.f. F . Express the c.d.f. of Y = max(X, 0) in terms
of F
Problem 4.11. Let X and Y be r.v. with nite expectations. Prove or disprove6:
(1) E max(X, Y ) max(EX, EY )
Hint: note that max(X, Y ) X and max(X, Y ) Y
(2) E max(X, Y ) + E min(X, Y ) = E(X + Y )
Problem 4.12. Let X and Y be i.i.d. standard Gaussian r.v. Show that 2XY and X 2 Y 2
have the same probability laws.
X+Y
Hint: X 2 Y 2 = 2 XY
2
2
Problem 4.13. Let X
1n, ..., Xn be independent r.v., Xi Poi(i ). Show that the conditional
p.m.f. of X1 , given S = i=1 Xi is Binomial and nd the corresponding parameters.
6this problem demonstrates that sometimes a trick is required to avoid heavy calculations ;)
CHAPTER 5
n =
n = X
i=1
where
n (X) =
n
1
n
n1
n
X
,
n (X)
2
i=1 (Xi Xn ) , has certain p.d.f., call it f for the moment, which depends
or 2 ! This provides a convenient way to construct a condence interval
66
(
) z n1
n
X
P n1
f (x)dx.
n 1z =
n
z n1
z n1
Now if we require the condence level of 1 , we have to solve the equation z n1 f (x)dx =
1 for z. It is clear that this equation has a unique solution, z (), which can be found
numerically. Thus we have constructed an interval estimator for with condence 1 . Notice
also that for a given condence level 1 , smaller z would emerge for large n, i.e. the 1
condence interval, so constructed, shrinks as n . This is plausible, since the uncertainty
about the location of should decrease as the number of observations grow.
Of course, the latter procedure is possible and makes sense only if f does not depend on the
unknown quantities. Unfortunately, in general this would be rarely the case. One famous and
practically very popular exception is the Gaussian i.i.d. setting as above:
Proposition 5a1. Let
X = (X1 , ..., Xn ) be i.i.d.
the common distribution
n r.v. with
2 (X) = 1
n )2 be the empirical mean
n = 1 n Xi and
(X
X
N (, 2 ), and let X
i
n
i=1
i=1
n
n
and variance. Then for any n 2,
n N (, 2 /n)
(1) X
n and the vector of residuals R = (X1 X
n , ..., Xn X
n ) are independent; in particular
(2) X
2
Xn and
n (X) are independent r.v.
(3) n
n2 (X)/ 2 2 (n 1), where 2 (k) is the -square distribution with k degrees of
freedom, which has the p.d.f.
2
fk (x) =
x [0, ),
(5a1)
n
X
n 1
=
2 (X)
n
X
Stt(n 1)
n
1
2
(X
X
)
i
n
i=1
n(n1)
where Stt(k) is the Student t-distribution with k degrees of freedom, which has the p.d.f.
(
)
(
)(k+1)/2
k+1
2
x2
S
( ) 1+
fk (x) =
, x R.
(5a2)
k
k k
2
) be i.i.d. r.v.s with the common distribution N ( , 2 ), indepen(5) Let X = (X1 , ..., Xm
m
m
= 1
2 (X ) = 1
m
2
dent of X introduced above. Let X
m
i=1 Xi and
i=1 (Xi Xm ) .
m
m
Then
2 (X )/ 2
m
Fis(m 1, n 1),
n2 (X)/ 2
A. NORMAL SAMPLE
67
k2 2
1 + k/ x
Remark 5a2. The condence interval construction for i.i.d. Gaussian sample described
above is based on (4). The property (5) is useful in other statistical applications (some to be
explored later).
Remark 5a3. The Student distribution has a history:
http://en.wikipedia.org/wiki/William_Sealy_Gosset
Proof.
is a linear transformation of the Gaussian vector and EX
n =
(1) the
claim holds, since X
1
i EXi = and
n
(
)2
n ) = E(X
n )2 = E 1
var(X
(Xi ) =
n
i
1
n2
E(Xi )(Xj ) =
1
n 2 = 2 /n,
n2
to check that Xn and Xi Xn are uncorrelated for all i = 1, ..., n. EXn = and
n ) = 0 and hence
E(Xi X
n , Xi X
n ) = E(X
n )(Xi X
n ) = 2 EZn (Zi Zn ),
cov(X
where we have dened Zi := (Xi )/, which are i.i.d. N (0, 1) r.v. Since EZi Zj = 0,
EZn Zi = 1/n and, as we have already seen, E(Zn )2 = var(Zn ) = 1/n, which implies
n , Xi X
n ) = 0,
cov(X
n is independent of R. Since
n
and in turn that X
n2 (X) is in fact a function of R, X
2
and
n (X) are independent as well. Note that the Gaussian property played the crucial
role in establishing independence!
n2 (Z) (where Zi s are dened above)
(3) First note that
n2 (X)/ 2 =
n2 (X)/ 2 =
1
n )2 / 2 =
(Xi X
n
i
)2
)2
1 (
1
1 (
(Xi )/
(Xj )/ =
Zi Zn =
n2 (Z),
n
n
n
i
68
2
2
hence
it is enough
to check that n
n2 (Z) =
i (Zi Zn ) (n 1). Note that
2
2
2
i (Zi Zn ) =
i Zi n(Zn ) and hence
{ }
{
}
E exp t
Zi2 = E exp t
(Zi Zn )2 + tn(Zn )2 =
i
{
}
{
}
2
2
E exp t
(Zi Zn ) E exp tn(Zn ) ,
i
Eet
where the latter equality holds by independence of Zi s. Now we shall need the following
fact: for N (0, 1) and t (1/2, 1/2),
1 2
1
tx2 1
x2 /2
e 2 x (12t) dx =
=
e e
dx =
2
2
R
R
x2
1
1
1
.
e 2(12t)1 dx =
1
1 2t R 2 (1 2t)
1 2t
The latter expression is the m.g.f. of the density given in (5a1) with k := n 1 degrees
of freedom, as can be veried by a direct (tedious) calculation.
(4) Once again, it is enough to verify the claim for Zi s:
n )/
X
(X
Zn
n
=
=
.
n2 (X)
n2 (X)/ 2
n2 (Z)
Note that
Zn
Zn
n 1
=
= n 1
n
1
n2 (Z)
2
(Z
Z
)
i
n
i=1
n
nZn
V
=:
.
n
1
2
U/(n
1)
(Z
Z
)
i
n
i=1
n1
EXERCISES
69
By the preceding calculations, V N (0, 1) and U 2n1 and hence the claim holds
by the following lemma:
Lemma 5a4. Let U 2 (k) and V N (0, 1) be independent r.v. Then V / U/k
Stt(k).
with g1 (x, y) = x and g2 (x, y) = y/ x/k Obviously, the j.p.d.f. of (U, V ) is supported
on (R+ , R), on which g is invertible:
g 1 (
x, y) = (
x, y x
/k),
whose Jacobian is
J=
)
1
0
,
1
/ x
k
x
/k
2y
det J =
x
/k.
, V ) is given by:
Hence the j.p.d.f. of (U
x, y) =
fU V (
x
/k
e
=
(k/2)
2
(
)
1
(1/2)k/2 (k1)/2 x/2 1+y2 /k
x
e
k 2 (k/2)
for (
x, y) (R+ , R) and zero otherwise. Now the distribution of V = V / U/k is
obtained as the marginal:
(
)
1
(1/2)k/2
x/2 1+
y 2 /k
(k1)/2
x, y)d
x=
y) =
fU V (
fV (
x
e
d
x=
k 2 (k/2) R+
R+
(
)
(
)(k+1)/2
k+1
2
x2
(
)
... =
1+
k
k k2
(5) The claim follows from the following fact: if U
2 (m)
and V
2 (n),
then
U/m
Fis(m, n).
V /n
The proof is a straightforward (tedious) calculation, which we shall omit.
Exercises
Problem 5.1. Let X1 and X2 be i.i.d. standard Gaussian r.v.s. Find the probability laws
of the following r.v.s:
(1) (X1 X2 )/ 2
(2) (X1 + X2 )2 /(X1 X2 )2
70
Xi /i2
(Xi U )2 /i2 .
U := i
, V :=
2
1/
j
j
i
(1) Show that U and V are independent
(2) Show that U is Gaussian and nd its mean and variance
(3) Show that V 2n1
Hint: Recall the proof in the special case i2 = 2 , i = 1, ..., n
Problem 5.4.
(1) Show that if U 2n and V 2m , then U + V 2n+m
2
Hint: use the connection between the Gaussian distribution
and 2
(2) Show that if X1 , ..., Xn are i.i.d. Exp() r.v.s then T := 2 i Xi has 2n distribution
Hint: Prove for n = 1 and use the answer in (1)
1the expression for the mean should explain the often use of the normalizing factor 1/(n 1) instead of 1/n
Part 2
Statistical inference
CHAPTER 6
Statistical model
a. Basic concepts
Generally and roughly speaking the statistical inference deals with drawing conclusions about
objects which cannot be observed directly, but are only known to have inuence on apparently
random observable phenomena. It is convenient to view statistical inference as consisting of
three steps:
(Step 1) Modeling: postulating the statistical model, i.e. the random mechanism which presumably had produced the observed data. The statistical model is specied up to an
unknown parameter to be inferred from the data. Once a model is postulated, the
statistician denes the scope of inference, i.e. poses the questions of interest. The three
canonical problems are:
* point estimation: guessing the value of the parameter
* interval estimation: constructing an interval, to which the value of the parameter
belongs with high condence
* hypothesis testing: deciding whether the value of the parameter belongs to a specic
subset of possible values
The main tool in statistical modeling is probability theory.
(Step 2) Synthesis/Analysis: deriving a statistical procedure (an algorithm), based on the postulated model, which takes the observed data as its input and generates the relevant
conclusions. This is typically done by methodologies, which rely on the optimization
theory and numerical methods. Once a procedure is chosen, its quality is to be assessed
or/and compared with alternative procedures, if such are available.
(Step 3) Application: predicting/or taking decisions on the basis of the derived conclusions
Remark 6a1. The order is by no means canonical. For example, often the choice of the
model is motivated by the question under consideration or the methodological or computational
constraints.
Remark 6a2. The above scheme corresponds to the inferential statistics, distinguished from
the descriptive statistics. The latter is concerned with data exploration by means of various
computational tools (e.g. empirical means, histograms, etc.) without presuming or modeling
randomness of the data. The quality assessment in descriptive statistics is often subjective and
non-rigorous.
73
74
6. STATISTICAL MODEL
n := t
x2i x2i1
(6a1)
n/2
i=1
will also yield an estimate close to the true value of for large n (think why). Clearly none of
the procedures (and in fact no procedure) would yield the exact value of the parameter with
probability 1. How do we measure precision then ? Which algorithm is more precise ? Is there
an algorithm which gives the best possible precision ? How large n should be to guarantee
the desired precision ? These and many more interesting questions is the main subject of
mathematical statistics and of this course.
1x is the largest integer smaller or equal to x
A. BASIC CONCEPTS
75
Finally, when we generate an estimate of and e.g. decide to continue playing, we may use
n to calculate various predictions regarding the future games: e.g. how much time it should
take on average till we either bankrupt or win certain sum, etc.
Now that we have a rough idea of what the statistical inference problem is, lets give it an
exact mathematical formulation.
Definition 6a4. A statistical model (or an experiment) is a collection of probabilities P =
(P ), parameterized by , where is the parameter space.
If the available data is a vector of real numbers, the probabilities P can be dened by means
of j.c.d.fs (or j.p.d.fs, j.p.m.f.s if exist) and the data is thought of as a realization of a random
vector X with the particular j.c.d.f., corresponding to the actual value 0 of the parameter. This
value is unknown to the statistician and is to be inferred on the basis of the observed realization
of X and the postulated model.
Example 6a3 (continued) In this case P is the j.p.m.f. of i.i.d. X = (X1 , ..., Xn ) with
X1 Ber():
P (X = x) = pX (x; ) =
i=1
xi
(1 )n
i=1
xi
x {0, 1}n .
Example 6a5. A plant produces lamps and its home statistician believes that the lifetime
of a lamp has exponential distribution. To estimate the mean lifetime she chooses n lamps
sporadically, puts them on test and records the corresponding lifetimes. In this setting, P is
given by the j.p.d.f.
n
fX (x; ) = n
exi I(xi 0), x Rn .
i=1
76
6. STATISTICAL MODEL
Example 6a7. Suppose we produce a coee machine, which accepts coins of all values as
payment. The machine recognizes dierent coins by their weight. Hence prior to installing the
machine in the campus, we have to tune it. To this end, we shall need to estimate the typical
(mean) weight of each type of coin and the standard deviation from this typical weight. For
each type, this can be done by e.g. n weighings. A reasonable 2 statistical model would be e.g.
to assume that the measured weights Xi s are i.i.d. and X1 N (, 2 ), where both and 2
are unknown. Hence P is given by the j.p.d.f.:
n
{ 1 (x )2 }
1
i
exp
, x Rn .
(6a2)
fX (x) =
2
2
2
i=1
The unknown parameter is two dimensional = (, 2 ) and the natural choice of the parameter
space is = R R+ .
Example 6a8. Suppose that instead of assuming i.i.d. model in the Example 6a3, there is a
reason to believe that the tosses are independent, but not identically distributed (e.g. tosses are
done by dierent people to avoid fraud, but each person actually cheats in his/her own special
way). This corresponds to the statistical model
pX (x; ) =
ixi (1 i )1xi ,
x {0, 1}n ,
i=1
Roughly speaking, models with parameter space of a nite dimension are called parametric.
Here is a natural example of a nonparametric model:
Example 6a5 (continued) Instead of presuming exponential distribution, one can assume
that the data is still i.i.d. but with completely unknown p.d.f. In this case the probability laws
P are parameterized by the space of all functions, which can serve as legitimate p.d.f.s:
{
}
perhaps very small, but nonzero probability. Nevertheless, we expect that 2 is very small, compared to and
hence the tiny probabilities of absurd events would not alter much the conclusions derived on the basis of Gaussian
model. This is an example of practical statistical thinking. As we do not go into modeling step in this course we
shall ignore such aspects bravely and thoughtlessly: we shall just assume that the model is already given to us
and focus on its analysis, etc.
77
Definition 6a10. An arbitrary function of the data (but not of the unknown parameter!) is
called statistic.
1 n/2
= 1 n Xi and T (X) :=
In Example 6a3, both X
i=1
i=1 X2i X2i1 are statistics.
n
n/2
Remark 6a11. If P is a probability on Rn , X P and T (x) is a function on Rn , we shall
refer both to T (x), x Rn (i.e. to the function x 7 T (x)) and to T (X) (i.e. to the random
variable T (X), obtained by plugging X into T ) as statistic. The precise intention should be
clear from the context.
In our course, typically we shall be given a statistical model and will be mostly concerned
with two questions: how to construct a statistical procedure and how to assess its performance
(accuracy). Hence we focus on Step 2 in the above program, assuming that Step 1 is already
done and it is known how to carry out Step 3, after we come up with the inference results.
b. The likelihood function
In what follows we shall impose more structure on the statistical models, namely we shall
consider models which satisfy one of the following conditions
(R1) P is dened by a j.p.d.f. f (x; ) for all
(R2) P is dened by a j.p.m.f. p(x; ), such that i p(xi ; ) = 1 for a set {x1 , x2 , ...} which
does not depend on .
We shall refer to these assumptions as regularity3. It will allow us to dene 4 the likelihood
function
Definition 6b1. Let P , be a model satisfying either (R1) or (R2). The function
{
fX (x; ), if P satises (R1)
L(x; ) :=
,
pX (x; ), if P satises (R2)
is called likelihood.
All the models considered above satisfy either (R1) or (R2).
Example 6a7 (continued) P is dened by the Gaussian j.p.d.f. (6a2) for any and
hence satises (R1).
Example 6a3 (continued) P is dened by the j.p.m.f:
pX (x; ) =
and
i=1
xi
(1 )n
i=1
xi
x {0, 1}n ,
pX (x; ) = 1,
for the set of all 2n binary strings, = {0, 1}n (which does not depend on ), i.e. (R2) is
satised.
3In statistics, the term regularity is not rigid and may have completely dierent meanings, depending on
78
6. STATISTICAL MODEL
Example 6b2. Let X be a random variable distributed uniformly on the set {1, ..., },
= N (think of a real life experiment supported by this model):
{
1
k = 1, ...,
p(k; ) =
.
0 otherwise
Since iN p(i; ) = 1 for all , (R2) holds.
Here are models which do not t our framework:
Example 6b3. Consider a sample X N (, 1), R and set Y = max(0, X). Let P be
the probability law of Y . The model (P ) doesnt satisfy neither (R1) nor (R2), since the
c.d.f. which denes P , i.e. the c.d.f. of Y , has both continuous and discrete parts.
Example 6b4. Let X be a binary random variable which takes two values {0, }, with
probabilities P (X = 0) = 1 and P (X = ) = , = (0, 1). Clearly, (R1) does not hold.
(R2) does not hold, since the p.m.f. is supported on {0, }, which depends5 on . Note that if
we observe the event X > 0, we can determine the value of exactly: = X.
Do not think that the statistical models which do not t our framework, are of no interest.
On the contrary, they are often even more fun, but typically require dierent mathematical
tools.
c. Identiability of statistical models
in the Example 6a3 is a reasonable guess of , since it is
Intuitively, we feel that T (X) = X
likely to be close to the actual value at least when n is large. On the other hand, if Xi = Zi
with unknown = R and i.i.d. N (0, 1) r.v.s Zi , any guess of the sign of , based on the
observation of X1 , ..., Xn will be as bad as deciding it by tossing an independent coin, discarding
all the data. On the third
if we were not interested in the signed value of , but only in
hand,
1 n
n )2 would be a decent guess (again, intuitively).
its absolute value, e.g.
(Xi X
n
i=1
P = P .
x B.
not enough.
79
E T (X) = E T (X).
Proof. Suppose that the model is not identiable, i.e. there is a pair of parameters = ,
for which P = P . Then, in particular, E T (X) = E T (X), which is a contradiction and hence
the model is identiable.
Remark 6c3. Estimation of parameters for non identiable models is meaningless 7. If the
constructed model turns to be nonidentiable, a dierent parametrization (model) is to be found
to turn it into an identiable one.
Example 6a3 (continued) Recall that P is the probability law of i.i.d. r.v. X1 , ..., Xn with
X1 Ber(), [0, 1]. Since
E X1 =
is a one-to-one function of , the model is identiable.
An alternative way to get to the same conclusion is to consider the p.m.f. p(x; ) at e.g. x
with xi = 1, i = 1, ..., n:
(
)
(
)
p (1...1), = n = n = p (1...1), , = .
Example 6a7 (continued) If X1 , ..., Xn are i.i.d. N (, 2 ) and = (1 , 2 ) = (, 2 ) =
R R+ , the model is identiable:
E X1 = 1 = ,
E X12 = 2 + 12 = 2 + 2 .
The function g() = (1 , 2 +12 ) is one-to-one on : indeed, suppose that for = , g() = g( )
which means 1 = 1 and 2 + 12 = 2 + 12 . The latter implies 1 = 1 and 2 = 2 , which is a
contradiction. Hence the model is identiable.
Now lets check what happens if we would have chosen a dierent parametrization, namely
= (1 , 2 ) = (, ) R R. Let = (0, 1) and = (0, 1). Since 2 = appears in the
Gaussian density only with an absolute value, it follows that P = P for the specic choice
of = and hence the model with such parametrization is not identiable, conrming our
premonitions above.
Here is a less obvious example:
Example 6c4. Let X = (X1 , ..., Xn ) be an i.i.d. sample from N (, 2 ) and suppose that we
observe Y = (Y1 , ..., Yn ), where Yi = Xi2 . Let P be the probability law of Y , where = (, )
R+ R+ (note that is known to be nonnegative). Is this model identiable8 ? Note that
E Y1 = E X12 = var (X1 ) + (E X1 )2 = 2 + 2
7Running a bit ahead, suppose that we have a statistic T (X), which we use as a point estimator of . A good
estimator of should be close to the values of for all (!) . For non-identiable models this is impossible.
Suppose that = (think of = +d, where d is a large number) and P = P , i.e. the model is not identiable.
If T (X) is a good estimator of , then its probability distribution should be highly concentrated around , when
X P . But P = P and hence it is also highly concentrated around , when X P . But since the distance
between and is large, the latter means that the distribution of T (X) is poorly concentrated around , when
X P - i.e. it is a bad estimator of !
8we have seen already, in Example 6a7, that it is identiable with a dierent parameter space. Think why
this implies identiability for the new parameter space under consideration.
80
6. STATISTICAL MODEL
and
E Y12 = E X14 = E (X1 + )4 =
E (X1 )4 + 4E (X1 )3 + 6E (X1 )2 2 + 4E (X1 )3 + 4 =
3 4 + 6 2 2 + 4 = 3( 2 + 2 )2 24 .
The function g() := ( 2 + 2 , 3( 2 + 2 )2 24 ) is invertible on (, ) R+ R+ (check!) and
hence the model is identiable.
Now suppose we observe only the signs of Xi s, i.e. = (1 , ..., n ) with
{
1
Xi 0
i = sign(Xi ) :=
.
1 Xi < 0
Let P be the law of with the same parametrization as before. Is this model identiable ...?
In this case, P is given by its j.p.m.f., namely for u {1, 1}n and as above
p (u; ) =
n {
}
I(ui = 1)P (i = 1) + I(ui = 1)P (i = 1) =
i=1
n {
}
Further,
(X
)
1
< / = (/),
and hence p (u; ) depends on = (, ) only through the ratio /. Clearly this model is
not identiable: for example, = (1, 1) and = (2, 2) yield the same distribution of the data:
p (u; ) = p (u; ) for all u {1, 1}n . This means that the observation of cannot be used to
construct a reasonable estimate of (, ).
P (X1 < 0) = P
d. Sucient statistic
Consider the following simple model: we observe the realizations of X1 = +Z1 and X2 = Z2 ,
where Z1 , Z2 are i.i.d. N (0, 1) r.v.s and would like to infer R. It is intuitively clear that X2
is irrelevant as far as inference of is concerned, since it is just noise, not aected in any way
by the parameter value. In particular, the statistic T (X) = X1 is sucient for the purpose
of e.g. guessing the value of . On the other hand, if Z1 and Z2 were dependent, then such
suciency would be less apparent (and in fact false as we shall be able to see shortly).
Definition 6d1. A statistic T is sucient for the model (P ) if the conditional distribution of X P , given T (X), does not depend on .
The meaning of the latter denition is made particularly transparent through the following
two-stage procedure. Suppose that we sample from P once, call this sample X, and calculate
T (X). Discard X and keep only T (X). Since T is sucient, by denition the conditional
distribution of X, given T (X), does not depend on the unknown value of . Hence we are able
to sample from the conditional distribution of X, given T (X), without knowing the value of !
D. SUFFICIENT STATISTIC
81
Let X be a sample from this conditional distribution9. Typically, the obtained realizations of
X and X will not be the same and hence we would not be able to restore the original discarded
realization of X. However X and X will have the same probability distribution P and hence
bear the very same statistical information about . Indeed, by the very denition of X
(
)
(
)
P X = x|T (X) = P X = x|T (X)
where we assumed for deniteness that all the random vectors involved are discrete. Since T (X)
is sucient, the latter doesnt in fact depend on and
(
)
(
)
(
)
P X = x = E P X = x|T (X) = E P X = x|T (X) =
(
)
(
)
E pX|T x; T (X) = E P X = x|T (X) = P (X = x), x.
To recap, no matter what kind of inference we are going to carry out on the basis of the sample
X, the value of a sucient statistic T (X) is all we need to keep, to be able to sample from the
original distribution without knowing the parameter and hence to attain the very same accuracy
in the statistical analysis, we would be able to attain should we have kept the original sample
X! In this sense, the sucient statistic is a summary statistic.
Example 6a3(continued) Lets show that T (X) = ni=1 Xi is a sucient statistic for i.i.d
X = (X1 , ..., Xn ), X1 Ber(), = (0, 1) r.v.s The conditional j.p.m.f. of X given
T (X) is given by the Bayes formula: let x {0, 1}n and t {0, ..., n}; if t = ni=1
xi , then
P(X = x, T (X) = t) = 0 and hence P(X = x|T (X) = t) = 0. Otherwise, i.e. for t = i xi ,
(
)
(
) P X = x, T (X) = t
t (1 )nt
1
(
)
P X = x|T (X) = t =
=( )
= ( ),
n t
n
P T (X) = t
(1 )nt
t
t
where we used the fact that T (X) Bin(n, ). Hence X conditionally on T (X) is distributed
uniformly on the set of binary vectors
n
{
}
x {0, 1}n :
xi = t ,
i=1
and thus the conditional distribution X given T (X) doesnt depend on . Hence T (X) is indeed
a sucient statistic.
Lets see how the aforementioned hypothetic experiment works out in this particular case.
Suppose we tossed the coin n = 5 times and obtained {X = (11001)}, for which T (X) = 3.
Now sample from the uniform distribution on all the strings which have precisely 3 ones (you
can actually easily list all of them). Notice that this is feasible without the knowledge of .
The obtained sample, say {X = (00111)}, is clearly dierent from X, but is as relevant to any
statistical question regarding as the original data X, since X and X are samples from the
very same distribution.
The statistic S(X) = n1
i=1 Xi is intuitively not sucient, since it ignores Xn , which might
be useful for inference of . Lets verify the intuition by a contradiction. Suppose it is sucient.
Then the conditional distribution of X given S(X) doesnt depend on . On the other hand, Xn
9you might need to enlarge the probability space to do this, but we keep the same notations for brevity
82
6. STATISTICAL MODEL
and S(X) are independent and thus E (Xn |S(X)) = EXn = , which contradicts the assumption
of suciency. Hence S(X) is not sucient as expected.
Example
6d2. Suppose X = (X1 , ..., Xn ) are i.i.d. N (, 1) r.v., with R. Lets check
1 n
E X)
= + 1/n (X
) = X
= E Xi + cov (Xi , X) (X
E (Xi |X)
1/n
var (X)
and
cov2 (Xi , X)
1/n2
=
1
= 1 1/n,
1/n
var (X)
where the equality is the formula from the Normal Correlation Theorem 3c1. To calculate
when i = j,
cov (Xi , Xj |X)
(
)(
) )
Xj X
X
= E X i X
= E (Xi Xj |X)
X
2,
cov (Xi , Xj |X)
(6d1)
j=1
j=1
= 1 E (X 2 |X)
+1
E (Xi Xj |X)
E (Xi Xj |X).
i
n
n
j=i
E (Xi Xj |X)
j =
,
2
2
2
n
(
)
1
= 1 1 1 + 1X
2 + n 1 E (Xi Xj |X).
E (Xi Xj |X)
n
n
n
n
n
(6d2)
j=1
suciency of X.
D. SUFFICIENT STATISTIC
83
{
1
Sij = cov(Xi , Xj ) =
n1
1
n
i=j
.
i = j
The obtained new sample, call it X , is a realization of a random vector with n i.i.d. N (, 1)
components. Note that in this case the event {X = X } has zero probability, i.e. we shall never
get the very same realization back. However, as far as inference of is concerned, it is enough
instead of n real numbers X1 , ..., Xn !
to keep just one real number X
As you can see, verifying that a given statistic is sucient for a given model may be quite
involved computationally. Moreover, it is not apparent neither from the denition nor from the
examples, how a nontrivial sucient statistic can be found for a given model. The main tool
which makes both of these tasks straightforward is
Theorem 6d3 (Fisher-Neyman factorization theorem). Let P , be a model with likelihood L(x; ) and let X P . Statistic T (X) is sucient if and only if there exist functions
g(u, t) and h(x) (with appropriate domains), so that
L(x; ) = g(, T (x))h(x)
x, .
(6d3)
Proof. We shall give the proof for the discrete case, leaving out the more technically
involved (but similar in spirit) continuous case. When X is discrete, the likelihood equals the
p.m.f., call it pX (x; ), x {x1 , x2 , ...}. Suppose (6d3) holds, i.e.
pX (x; ) = g(, T (x))h(x)
for some functions g and h. We shall show that T is sucient, by checking that the conditional
law of X given T (X) doesnt depend on . For any x and t = T (x),
pX|T (X) (x; t, ) = P (X = x|T (X) = t) = 0,
which certainly doesnt depend on . Further, for t = T (x), by the Bayes formula
pX|T (X) (x; t, ) = P (X = x|T (X) = t) =
P (X = x)
P (X = x, T (X) = t)
=
P (T (x) = t)
P (T (x) = t)
g(, t)h(x)
h(x)
g(, t)h(x)
=
=
,
g(, t) x h(x )
x g(, t)h(x )
x h(x )
84
6. STATISTICAL MODEL
Conversely, suppose that T (X) is a sucient statistic. To prove the claim we shall exhibit
functions g and h such that (6d3) holds. To this end, note that since T (X) is a function of X,
{
pX (x; ) t = T (x)
pX,T (X) (x, t; ) = P (X = x, T (X) = t) = P (X = x, T (x) = t) =
,
0
t = T (x)
and hence pX|T (X) (x; t) = 0 for t = T (x). Then
pX (x; ) =
pX|T (X) (x; t)pT (X) (t; ) = pX|T (X) (x; T (x))pT (X) (T (x); )
where T (x) = ni=1 xi . Hence (6d3) holds with h(x) = 1 and g(, t) = t (1 )nt .
Example 6d2 (continued) The likelihood is the j.p.d.f. in this case:
( 1
)
1
2
L(x; ) =fX (x; ) =
exp
(x
)
=
i
2
(2)n/2
i=1
n
n
n
( 1
1
n 2)
2
exp
x
+
=
i
i
2
2
(2)n/2
i=1
i=1
(6d4)
( 1 )
(
1
n 2)
2
exp
x
exp
n
x
.
i
2
2
(2)n/2
i=1
n
2
(2)n/2
i=1
n
h(x) =
)
(
is
and g(, t) = exp nt n2 2 . Compare this to the calculations, required to check that X
sucient directly from the denition!
Lets demonstrate the power of F-N theorem by slightly modifying the latter example:
Example 6a7 (continued)
D. SUFFICIENT STATISTIC
85
Let X1 , ..., Xn be i.i.d. N (, 2 ) r.v.s. If 2 is known, then we are in the same situation as
in the previous example. If however both and 2 are unknown, i.e. = (, 2 ) = R R+
n
( 1 1
)
1
2
L(x; ) =
exp
(x
)
=
i
2 2
(2 2 )n/2
1
(2 2 )n/2
1
(2 2 )n/2
i=1
n
n
)
1 1
1 1 2
2
x
+
x
n)
=:
exp
i
i
2 2
2
2 2
i=1
i=1
)
( 1 1
1 1
2
2+
nx
n
x
n
.
exp
2 2
2
2 2
(6d5)
1
L(x; ) =
I(xi [0, ]) = n I(max xi ), x Rn+ .
i
i=1
Note that neither the denition nor F-N Theorem do not say anything on the uniqueness of
the sucient statistic: the factorization of the likelihood can be done in many dierent ways to
yield dierent sucient statistics. In fact a typical statistical model has many quite dierent
sucient statistics. In particular, the original data, i.e. X sampled from P , is trivially a
sucient statistic: indeed, P (X u|X) = I(X u) for any u R and I(X u) doesnt
depend on . Of course, this is also very intuitive: after all the original data is all we have!
This suggests the following relation between statistics:
Definition 6d5. T (X) is coarser11 than T (X) if T (X) = f (T (X)) for some function f .
Coarser in this denition means revealing less details on the original data: one can
calculate T (X) from T (X) but wont be able to calculate T (X) from T (X) (unless f is oneto-one). Hence some information will be possibly lost. In the Example 6a7, the statistic
X 2 ) is coarser than X itself: clearly, one can calculate T (X) from X, but not vise
T (X) = (X,
versa (if n 2). The trivial statistic T (X) 17 is coarser than both T and X: it is so coarse
that it is useless for any inference (and of course not sucient).
Definition 6d6. Two statistics T and T are equivalent, if there is a one-to-one function f
such that T (X) = f (T (X)).
10more precisely, (
does not determine x2 and vise versa (check!). Compare
x, x2 ) is two dimensional since, x
e.g. with (
x, x
) which takes values in R2 , but in fact takes values only in a one dimensional manifold, namely on
the diagonal {(x, y) R2 : x = y}
11a bit more precisely, P (T (X) = f (T (X))) = 1 for all is enough.
86
6. STATISTICAL MODEL
The statistics are equivalent if they reveal the same details about the data.
Remark 6d7. If S is coarser than (or equivalent to) T and S is sucient, then T is sucient
as well (convince yourself, using the F-N factorization).
(
)
1 n X 2 =: (T1 (X), T2 (X)) and
Example 6a7 (continued) The statistics T (X) = X,
i=1 i
n
(
)
n
1
2 =: (S1 (X), S2 (X)) are equivalent. Indeed S can be recovered
S(X) = X,
i=1 (Xi X)
n
from T :
S1 (X) = T1 (X)
and
n
n
1
2= 1
2 = T2 (X) T 2 (X).
S2 (X) =
(Xi X)
Xi2 X
1
n
n
i=1
i=1
and vise versa. Clearly T and X are not equivalent: T is coarser than X as previously mentioned.
This discussion leads us to the question: is there coarsest (minimal) sucient statistic ?
And if there is, how it can be found ?
Definition 6d8. The sucient statistic T is minimal if it is coarser than any other sucient
statistic.
It can be shown that the minimal sucient statistic exists (at least for our type of models).
The proof is beyond the scope of our course. Finding minimal statistic does not appear at the
outset an easy problem: in principle, if one tries to nd it via the denition, she would have to
perform a search among all sucient statistics (or at least all nonequivalent statistics), which is
practically impossible. Remarkably, checking that a particular sucient statistic is minimal is
easier, as suggested by the following lemma. Note that in practice candidates for the minimal
sucient statistic are oered by the F-N factorization theorem.
Lemma 6d9. A sucient statistic S is minimal sucient if
L(x; )
doesnt depend on
=
S(x) = S(y).
L(y; )
Remark 6d10. Note that by F-N factorization theorem, S(x) = S(y) implies that
(6d6)
L(x; )
L(y; )
D. SUFFICIENT STATISTIC
87
Example 6a7 (continued) Suppose that 2 is known (say 2 = 1) and we would like to infer
is a sucient statistic.
R. Applying the F-N factorization theorem to (6d4), we see that X
Is it minimal ?
(
)
)
(
1
1 n
n 2
2 exp n
exp
x
x
i=1 i
2
2
L(x; )
(2)n/2
(
)
) =
(
=
1
L(y; )
y n 2
exp 1 n y 2 exp n
(2)n/2
i=1 i
n
( 1
)
(
)
(x2i yi2 ) exp n(
exp
x y) .
2
i=1
preceding Lemma. Remarkably, keeping just X and discarding all the observations is enough
for all the purposes of inference of and any coarser statistic wont be sucient!
and we
The whole data X is, of course, sucient but is not minimal, as it is ner than X
L(x;)
should expect that the conditions of the Lemma cannot be satised. Indeed, if L(y;) doesnt
depend on , it is still possible to have x = y.
Example 6d4 (continued) In this case,
L(x; )
I(maxi xi )
=
.
L(y; )
I(maxi yi )
Using the conventions13 0/0 = 1, 1/0 = we see that if S(x) := maxi xi < maxi yi =: S(y),
then
1, S(x)
L(x; )
= 0, (S(x), S(y)] ,
L(y; )
1, > S(y)
which is a nontrivial (nonconstant) function of
the case S(x) > S(y):
1,
L(x; )
= ,
L(y; )
1,
88
6. STATISTICAL MODEL
Exercises
Problem 6.1 (Problem 2.1.1, page 80 [1]). Give a formal statement of the following models
identifying the probability laws of the data and the parameter space. State whether the model
in question is parametric or nonparametric.
(1) A geologist measures the diameters of a large number n of pebbles in an old stream bed.
Theoretical considerations lead him to believe that the logarithm of pebble diameter is
normally distributed with mean and variance 2 . He wishes to use his observations
to obtain some information about and 2 , but has in advance no knowledge of the
magnitudes of the two parameters.
(2) A measuring instrument is being used to obtain n independent determinations of a
physical constant . Suppose that the measuring instrument is known to be biased
to the positive side by 0.1 units. Assume that the errors are otherwise identically
distributed normal random variables with known variance.
(3) In part (2) suppose that the amount of bias is positive but unknown. Can you perceive
any diculties in making statements about for this model?
(4) The number of eggs laid by an insect follows a Poisson distribution with unknown mean
. Once laid, each egg has an unknown chance p of hatching and the hatching of one
egg is independent of the hatching of the others. An entomologist studies a set of n
such insects observing both the number of eggs laid and the number of eggs hatching
for each nest.
Problem 6.2 (Problem 2.1.2, page 80 [1]). Are the following parametrizations identiable?
(Prove or disprove.)
(1) The parametrization of Problem 6.1 (3).
(2) The parametrization of Problem 6.1 (4).
(3) The parametrization of Problem 6.1 (4) if the entomologist observes only the number
of eggs hatching but not the number of eggs laid in each case.
Problem 6.3 (Problem 2.1.3, page 80 [1]). Which of the following parametrizations are
identiable ? (Prove or disprove.)
(
)
(1) X1 , ..., Xp are independent r.v. with Xi N (i + , 2 ). = 1 , ..., p , , 2 and P
is the distribution of X = (X1 , ..., Xn )
EXERCISES
89
Problem 6.4 (Problem 2.1.4, page 81 [1]). 5. The number n of graduate students entering a
certain department is recorded. In each of k subsequent years the number of students graduating
and of students dropping out is recorded. Let Ni be the number dropping out and Mi the number
graduating by the end of year i, i = 1, ..., k. The following model is proposed.
(
)
P N1 = n1 , M1 = m1 , ..., Nk = nk , Mk = mk =
n!
n1 ...nk k 1m1 ...kmk r
n1 !...nk !m1 !...mk !r! 1
where
k
k
i +
j + = 1, i , (0, 1), i = 1, ..., k
i=1
j=1
n1 + ... + nk + m1 + ... + mk + r = n
and = (1 , ..., k , 1 , ..., k ) is unknown.
(1) What are the assumptions underlying the model ?
(2) is very dicult to estimate here if k is large. The simplication i = (1 )i1 ,
i = (1)(1)i1 for i = 1, .., k is proposed where 0 < < 1, 0 < < 1, 0 < < 1
are unknown. What assumptions underline the simplication ?
Problem 6.5 (Problem 2.1.6 page 81, [1]). Which of the following models are regular ?
(Prove or disprove)
(1) P is the distribution of X, when X is uniform on (0, ), = (0, )
(2) P is the distribution of X when X is uniform on {0, 1, ..., }, = {1, 2, ...}
(3) Suppose X N (, 2 ). Let Y = 1 if X 1 and Y = X if X > 1. = (, 2 ) and P
is the distribution of Y
(4) Suppose the possible control responses in an experiment are 0.1 , ... , 0.9 and they
occur with frequencies p(0.1),...,p(0.9). Suppose the eect of a treatment is to increase
the control response by a xed amount . Let P be the distribution of a treatment
response.
Problem 6.6 (based on Problem 2.2.1, page 82, [1]). Let X1 , ..., Xn be a sample from Poi()
population with > 0.
i=1
)
T2 (X) = X1 , ..., Xn1
(
)
T3 (X) = T1 (X), T (X)
(
)
T4 (X) = T1 (X), T2 (X)
90
6. STATISTICAL MODEL
Problem 6.7 (based on Problem 2.2.2, page 82, [1]). Let n items be drawn in order without
replacement from a shipment of N items of which N are bad. Let Xi = 1 if the i-th item drawn
is bad, and Xi = 0 otherwise.
T2 (X) =
n1
Xi
i=1
)
(
T3 (X) = T1 (X), min Xi
i
(
)
T4 (X) = T1 (X), T2 (X)
(4) Order the statistics above according to coarseness relation
(5) Which statistic is minimal among the statistics mentioned above ?
(6) Show that T (X) is minimal sucient (among all sucient statistics)
Problem 6.8 (Problem 2.2.3, page 82, [1]). Let X1 , ..., Xn be an i.i.d. sample from one of
the following p.d.f.s
(1)
f (x; ) = x1 , x (0, 1), > 0
(2) the Weibull density
f (x; ) = axa1 exp(xa ),
x, a, (0, )
x, a, (0, )
where a is a xed constant and is the unknown parameter. Find a real valued sucient statistic
for .
Problem 6.9 (Problem 2.2.6 page 82 [1]). Let X take the specied values v1 , ..., vk+1
with probabilities 1 , ..., k+1 respectively. Suppose that X1 , ..., Xn are i.i.d. with the same
distribution as X. Suppose that = (1 , ..., k+1 ) is unknown and may range over the set
EXERCISES
91
Problem 6.10 (Problem 2.2.7 page 83 [1]). Let X1 , ..., Xn be an i.i.d. sample from the p.d.f.
{ x }
1
f (x; ) = exp
I(x 0).
Let = (, ) R R+
(1) Show that mini Xi is sucient for when is xed (known)
(2) Find a one dimensional sucient statistic for when is xed
(3) Exhibit a two dimensional sucient statistic for
Problem 6.11 (Problem 2.2.9 page 83 [1]). Let X1 , ..., Xn be an i.i.d. sample from the p.d.f.
CHAPTER 7
Point estimation
Point estimation deals with estimating the value of the unknown parameter or a quantity,
which depends on the unknown parameter (in a known way). More precisely, given a statistical
model (P ) , the observed data X P and a function q : 7 R, an estimator of q() is a
statistic T (X), taking values in q() := {q(), q }. Estimating the value of the parameter
itself, ts this framework with q() := . The realization of the estimator for a particular set of
data is called the estimate.
a. Methods of point estimation
In this section we shall introduce the basic methods of point estimation. Typically dierent
methods would give dierent estimators. The choice of the particular method in a given problem
depends on various factors, such as the complexity of the emerging estimation procedure, the
amount of available data, etc. It should be stressed that all of the methods in this section,
originated on a heuristic basis and none of them guarantees to produce good or even reasonable
estimators at the outset: an additional eort is usually required to assess the quality of the
obtained estimators and thus to refute or justify their practical applicability.
Remark 7a1. If( is )a point estimator of , the quantity q() can be estimated by the
Frequently (but not always), polynomials or indicator functions are used in practice, in which
cases, this approach is known as the method of moments or frequency substitution respectively.
Example 6a3 (continued) For X1 Ber(),
E X1 = ,
93
94
7. POINT ESTIMATION
In this case the method of moments can be formally applied with (x) = x and () = ,
n as the estimator of .
suggesting the statistic X
Here is another possibility: note that E X1 X2 = E X1 E X2 = 2 , which is invertible on
[0, 1]. Hence by the method of moments
v
u
u 1 [n/2]
n := t
X2i1 X2i ,
[n/2]
i=1
= NAA /n
= 1 Naa /n
Since E NAa is not a one-to-one function of , it doesnt t the frequency substitution
method as is. Having several dierent estimators of can be practically handy, since some
genotypes can be harder to observe than
( the others.
)
Here is another alternative: since E I(X1 = AA) + 21 I(X1 = Aa) = 2 + (1 ) = , the
frequency substitution suggests the estimator:
NAA 1 NAa
(x)
=
+
.
n
2 n
The substitution principle is applicable to parameter spaces with higher dimension:
1look up for HardyWeinberg principle in wikipedia for more details about this model
95
Example 7a3. Let X1 , ..., Xn be an i.i.d. sample from the (, ) distribution with the
p.d.f.
x1 ex/
f (x; , ) =
I(x 0),
()
where the unknown parameter is := (, ) R+ R+ =: .
A calculation reveals that
E X1 = ... =
and
E X12 = ... = 2 ( + 1).
Denote the empirical moments by
n
n
1
1 2
2
Xn :=
Xi , Xn :=
Xi .
n
n
i=1
and
i=1
1
2.
n )2 = X 2 X
(Xi X
n
n
n
n
n2 (X) =
i=1
(X) :=
n2
X
n2
X
,
2
n (X)
2
X2 X
2 (X)
(X)
:= n n = n
Xn
Xn
2
Xn2 X
n
Note that
is well dened since
n2 (X) 0 and the equality holds with zero probability.
Remark 7a4. The method of moments (and other substitution principles) does not require
precise knowledge of the distribution of the sample, but only of the dependence of moments on
the parameter. This can be a practical advantage, when the former is uncertain.
Least squares estimation. In many practical situations the observed quantities are known
to satisfy a noisy functional dependence, which itself is specied up to unknown parameters.
More precisely, one observes the pairs (Xi , Yi ), i = 1, ..., n which are presumed to satisfy the
equations
Yi = gi (Xi , ) + i , i = 1, ..., n,
where gi s are known functions, i s are random variables and is the unknown parameter. The
least squares estimator of is dened as
n
(
)2
(X,
Y ) := argmin
Yi gi (Xi , ) ,
(7a1)
i=1
where any of the minimizers is chosen, when the minimum is not unique. In other words the
least squares estimator is the best t of a known curve to noisy measurements.
One classical example is the linear regression: one observes the pairs (Xi , Yi ) i = 1, ..., n,
presumes that Yi = 1 Xi + 2 + i and would like to estimate = (1 , 2 ) given the observations.
Working out the minimization in (7a1) yields the familiar regression formulae.
96
7. POINT ESTIMATION
(
n
i=1
xi x0 vx ti
)2
(
i=1
yi y0 vy ti
)2
(
i=1
)
1 )2
.
zi z0 vz ti + gt2i
2
2this is an oversimplication: the trajectory of the asteroid is signicantly aected by the drag force applied
by the atmosphere, which itself varies depending on the mass being evaporated, the eective area of asteroid body
and the material it is composed of. Moreover, the gravity of the Earth depends both on the instantaneous height
and longitude/latitude coordinates, etc.
97
The latter is a quadratic function, whose minimum is found in the usual way by means of
dierention:
i x0
i xi t
i ti
vx =
2
i ti
i y0
i yi t
i ti
vy =
2
i ti
1 3
i z i ti z 0
i ti + 2 g
i ti
2
vz =
i ti
Remark 7a6. This example demonstrates that in some situations it is easier to postulate
the statistical model in the form, dierent from the canonical one, i.e. postulating a family of
probability distributions. The denition of P in these cases is implicit: in the last example,
P will be completely specied if we make the additional assumption that i s are i.i.d. and
sampled from3 N (0, 2 ) with known or unknown 2 .
Maximum likelihood estimation.
Definition 7a7. For a regular model (P ) and a sample X P , the Maximum Likelihood estimator (MLE) is
(X)
:= argmax L(x; ) = argmax log L(x; ),
assuming4 argmax exists and taking any of the maximizers, when it is not unique.
The heuristical basis behind the ML method is to choose the parameter, for which the
corresponding P assigns maximal probability to the observed realization of X.
Remark 7a8. For many models, the likelihood function is a product of similar terms (e.g.
when P corresponds to an i.i.d. sample), hence considering log-likelihood is more convenient.
Clearly, this does not aect the estimator itself (since log is a strictly increasing function). Also,
typically, the maximal value of the likelihood is not of immediate interest.
Example 6a3 (continued) For n independent tosses of a coin, the log-likelihood is (x
{0, 1}n , = [0, 1]):
(
)
log Ln (x; ) = Sn (x) log + n Sn (x) log(1 ),
(X)
argmax L(x; ),
i.e. (X)
is any point in the set of maximizers.
98
7. POINT ESTIMATION
and hence the maximum is attained in the interior of . As log Ln (x; ) is a smooth function of
on (0, 1), all local maximizers are found by dierentiation:
) 1
1 (
Sn (x) n Sn (x)
= 0,
1
which gives
1
n,
n (X) = Sn (X) = X
(7a2)
n
which is the familiar intuitive estimator. Since the local maximum is unique it is also the global
one and hence the MLE.
If Sn (x) = 0, log Ln (x; ) = n log(1 ), which is a decreasing function of . Hence in this
case, the maximum is attained at = 0. Similarly, for Sn (x) = n, the maximum is attained at
= 1. Note that these two solutions are included in the general formula (7a2).
Example 7a2 (continued) The log-likelihood is
log Ln (x; ) = NAA (x) log 2 + NAa (x) log 2(1 ) + Naa (x) log(1 )2 =
(
)
(
)
2NAA (x) + NAa (x) log + NAa (x) log 2 + 2Naa (x) + NAa (x) log(1 ).
The maximization, done as in the previous example, gives the following intuitive estimator:
2NAA (x) + NAa (x)
NAA (x) 1 NAa (x)
(x)
=
=
+
.
2n
n
2
n
Note that it coincides with one of the frequency substitution estimators obtained before (but
not the others).
Example 7a9. Let X1 , ..., Xn be an i.i.d. sample from N (, 2 ), = (, 2 ) = R R+ .
The log-likelihood is
n
n
( (x )2 )
1
(xi 1 )2
n
n
i
log
.
exp
=
2
2 2
2
2
22
2 2
i=1
i=1
This function is dierentiable on R R+ and its gradient vanishes at all the local maximizers:
(xi 1 )
log Ln (x; ) =
=0
1
2
n
i=1
n 1
1
log Ln (x; ) =
+
(xi 1 )2 = 0.
2
2 2 222
i=1
i=1
n
1
2 (x) =
(xi x
n )2
n
i=1
99
To see that the obtained extremum is indeed a local maximum we shall examine the Hessian
matrix (of second order derivatives) and verify that its eigenvalues are negative at (1 , 2 ). A
calculation shows that it is indeed the case.
To see that the found local maximum is also global, we have to check the value of the loglikelihood on the boundary, namely as 2 approaches to zero. The case x1 = x2 = ... = xn
can
from the consideration, since probability of getting such sample is zero. Hence
be excluded
2 is strictly positive uniformly over R. This in turn implies lim
(x
)
1
1
2 0 log Ln (x; ) =
i i
uniformly over 1 R. Hence the found maximizer is the MLE of .
Example 7a10. Let X1 , ..., Xn be a sample from the uniform density on [0, ], with the
unknown parameter = (0, ). The likelihood is
{
n
1
0
< maxi xi
n
, x Rn+ .
Ln (x; ) =
I(xi [0, ]) = (1/) I(max xi ) = 1
i
maxi xi
n
i=1
The computation of the MLE amounts to solving an optimization problem and hence neither
existence nor uniqueness of the MLE is clear in advance. Moreover, even when MLE exists and
is unique, its actual calculation may be quite challenging and in many practical problems is done
numerically (which is typically not a serious drawback in view of the computational power of
the modern hard/software).
Below are some examples, which demonstrate the existence and uniqueness issues.
Example 7a11. Let Y1 , ..., Yn be an i.i.d. sample from U ([0, 1]) and let Xi = Yi + , where
= R. We would like to estimate given X1 , ..., Xn . The corresponding likelihood is
Ln (x; ) =
I( xi + 1) = I(min xi , max xi + 1) =
i=1
I(max xi 1 min xi ).
i
i
(
)
(
)
(
)
Note that for any xed n, P mini Xi maxi Xi 1 = 0 = P maxi Xi mini Xi = 1 = 0
and hence with probability 1, the maximizer, i.e. the MLE, is not unique: any point in the
interval [maxi Xi 1, mini Xi ] can be taken as MLE.
Here is a natural and simple example when MLE does not exist:
Example 7a12. Let X1 , ..., Xn be sampled from Poi(), with unknown = R+ . The
log-likelihood is
n
n
(
)
log Ln (x; ) = n +
xi log
log xi !, x Nn .
n
i=1
i=1
If Sn (x) =
n . If, however,
i=1 xi > 0, this expression maximized by (x) = Sn (x)/n = x
Sn (x) = 0, i.e. the sample was all zeros, log Ln (x; ) = n, which does not have a maximum
on the open interval (0, ) (its supremum is 0, which does not belong to the parametric space).
Here is a more vivid demonstration
100
7. POINT ESTIMATION
1 1 ( xi ) 1 (
1 (
1
Ln (x; ) =
+ xi
xi ,
2
2
2
2
i=1
i=2
where the inequality is obtained by removing all the missing positive terms. The obtaned lower
bound can be made arbitrarily large by choosing := x1 and taking 0. Hence the MLE
does not exist.
Remark 7a14. Notice that MLE must be a function of a sucient statistic, which immediately follows from the F-N factorization theorem. In particular, MLE can be constructed using
the minimal sucient statistic. This is appealing from the practical point of view, since the
minimal sucient statistic is all we need for the inference purpose in general (see also Remark
7d11).
Here is another useful property of MLE:
Lemma 7a15. MLE is invariant under reparametrization. More precisely, suppose is the
MLE for the model (P ) and let h be a one-to-one function, mapping the parameter space
1
n )2 + X
2 = 1
(Xi X
Xi2 .
n
n
n
n
i=1
i=1
5such sample is obtained by independent tossings of a fair coin, and sampling from N (, 2 ) if it comes out
101
(X)
= N X/n.
N
Now lets calculate the MLE of . For convenience we shall reparameterize the model by
dening M := N (the number of rotten oranges in the shipment). By Lemma 7a15, (X)
=
M (X)/N . The likelihood for the model with the new parametrization is
(
L(k, M ) =
M
k
)(
)
N M
nk
( )
,
N
n
M {0, ..., N }.
6Here is how E X can be calculated: let be the indicator of the i-th orange in the sample being rotten.
i
Thus X =
i=1
i and E X =
i=1
E i . For j = 2, ..., n,
E j = E P (j |1 , ..., j1 ) = E
Hence the quantities rm =
rm =
m
i=1
j1
N j1
N E =1
=1
=
.
N (j 1)
N (j 1)
m1
m
N rj1
N rj1
N rm1
N rm1
=
+
=
+ rm1
N
(j
1)
N
(m
1)
N
(j
1)
N
(m 1)
i=1
i=1
102
7. POINT ESTIMATION
nk
k
1
N M
M +1
k
nk
N M
M +1
k
(N + 1) 1.
n
Hence the sequence M 7 L(k, M ) increases for all M s less than M := nk (N + 1) 1 and
decreases otherwise. If M is an integer, then L(k, M ) = L(k, M +1) and L(k, M ) > L(k, M )
for all M {M , M + 1}, hence the maximizer is not unique:
0,
= {
} k=0
= N,
k=n
(X)/N .
and (X)
=M
103
n (X) = t
n
i=1
n (X) =
n
n
(7b2)
i=1
appears as plausible as
n (X). Which one is better? As we shall see, in a certain sense, this
question can be resolved in favor of
n (X); however, the answer is far from being obvious at the
outset. The questions of comparison and optimality of decision rules, such as point estimators
or statistical tests, etc. are addressed by the statistical decision theory 8. Below we will present
the essentials in the context of point estimators, deferring consideration of statistical hypothesis
testing to the next chapter.
It is clear that comparing the realizations of estimators, i.e. the corresponding estimates,
is meaningless, since the conclusion will depend on the particular outcome of the experiment.
Hence we shall compare the expected performances of the point estimators.
Definition 7b1. For a statistical model (P ) and a loss function : 7 R+ , the
-risk of the point estimator T is
(
)
R (; T ) := E , T (X) ,
where X P .
The loss function in this denition measures the loss incurred by estimating by T (X)
(for each particular realization) and the risk is the expected loss. Intuitively, the loss should
increase with the magnitude of deviation of the estimate from the true value of the parameter
, i.e. larger errors should be assigned greater loss. Here are some popular loss functions for
Rd :
p (, ) =
|i i |p =: pp
i=1
0 (, ) = I( = )
(, ) = max |i i |
i
advocated for the MLE, and the physicist A.Eddington, who favored the other estimator.
8a concise account can be found in e.g. [2]
104
7. POINT ESTIMATION
These losses can be associated with distances between and : the larger distance the greater
loss is suered. The corresponding risks of an estimator T are given by
Rp (, T ) = E
|i Ti (X)|p = E pp
i=1
)
(
R0 (, T ) = E I( = T (X)) = P = T (X)
R (, T ) = E max |i Ti (X)|.
i
The choice of the loss function in a particular problem depends on the context, but sometimes
is motivated by the simplicity of the performance analysis. In this course we shall consider almost
exclusively the quadratic (MSE) risk R2 (, T ) (assuming that T (X) is such that the risk is nite
and omitting 2 from the notations):
(
)2
R(, T ) = E T (X) .
This risk makes sense in many models and, moreover, turns to be somewhat more convenient to
deal with technically.
Remark 7b2. If the objective is to estimate q(), for a known function q, the MSE risk is
dened similarly
R(q(), T ) = E (q() T (X))2 ,
and the theory we shall develop below translates to this more general case with minor obvious
adjustments.
In view of the above denitions, we are tempted to compare two estimators T1 and T2 by
comparing their risks, i.e. R(, T1 ) and R(, T2 ). Sometimes, this is indeed possible:
Example 7b3. Let X1 , ..., Xn be an i.i.d. sample from U ([0, ]), > 0. As we saw, the MLE
of is
n (X) = max Xi =: Mn (X)
i
fMn (x) =
Hence
E Mn =
0
and
E Mn2
=
0
n n
n
x dx =
n
n+1
(7b3)
n n+1
n
x
dx = 2
.
n
n+2
Consequently,
R(, n ) = E ( Mn )2 = 2 2E Mn + E Mn2 =
(
n )
2
2n
+
= 2
. (7b4)
2 1
n+1 n+2
(n + 1)(n + 2)
105
Since
(7b5)
1
2
(n 1)(n 2)
=
0,
3n (n + 1)(n + 2)
3n(n + 1)(n + 2)
it follows
R(, n ) R(, n ),
= (0, ),
where the inequality is strict for n 3. Hence n yields better (smaller) risk than n , for all
values of the parameter.
This example motivates the following notion
Definition 7b4. An estimator is inadmissible, if there exists an estimator , such that
R(, )
R(, )
(x )
(x c)dx =
E ( (X)) = ( (x)) (x )dx = ( (x))2
(x c)
(X )
(X )
Ec ( (X))2
= ( c)2 Ec
= ( c)2 ,
(X c)
(X c)
i.e. the risk of coincides with the risk of c. The obtained contradiction shows that c
is admissible.
106
7. POINT ESTIMATION
Remark 7b7. Establishing inadmissibility of an estimator amounts to nding another estimator with better risk. Establishing admissibility of an estimator appears to be a much harder
task: we have to check that better estimators dont exist! It turns out that constructing an
admissible estimator is sometimes an easier objective. In particular, the Bayes estimators are
admissible (see Lemma 7c8 below).
The following celebrated example, whose earlier version was suggested by C.Stein, demonstrates that estimators can be inadmissible in a surprising way
Example 7b8 (W.James and C.Stein, 1961). Let X be a normal vector in Rp with independent entries Xi N (i , 1). Given a realization of X, it is required to estimate the vector
. Since Xi s are independent, Xi doesnt seem to bear any relevance to estimating the values
of j , for j = i. Hence it makes sense to estimate i by Xi , i.e. = X. This estimator is
reasonable from a number of perspectives: it is the ML estimator and, as we shall see below, the
optimal unbiased estimator of and also the minimax estimator. The quadratic risk of = X
is constant
p
) = E X 2 = E
R(,
(Xi i )2 = p, Rp .
i=1
This natural estimator = X can also be shown admissible for p = 1 and p = 2, but,
surprisingly, not for p 3 ! The following estimator, constructed by W.James and C.Stein, has
a strictly better risk for all Rp :
(
)
p2
JS
= 1
X.
X2
Note that this estimator, unlike = X, uses all the components of X to estimate each individual
component of and the values of X with smaller norms are pushed further towards the origin.
The latter property is often referred to as shrinkage and the estimators with similar property
are called shrinking estimators. Some further details can be found in the concise article [11].
To compute the risk of JS , we shall need the following simple identity:
Lemma 7b9 (Steins lemma). For N (0, 1) and a continuously dierentiable function
h : R 7 R,
Eh() = Eh (),
whenever the expectations are well dened and nite.
Proof. Note that the standard normal density satises (x) = x(x) and under our
integrability assumptions, the claim follows by integration by parts
107
matrix U , such that U = e1 . Since X = Z + , where Z is the vector with i.i.d. N (0, 1)
entries,
2
X2 = U X2 =
U Z + e1
.
But as U is orthonormal, U Z has the same distribution as Z and
1
1
1
1
2
=
E
e 2 z dz =
2
p/2
X2
p
R
z + e1 (2)
1
1
1
1
exp (v1 )2
vi2 dv
2
p/2
2
2
Rp v (2)
i2
(
)
(
)
p
p
1
1 2
1
1 2 1 2
exp
v
v
dv
exp
v
v
dv
1
1
i
1
i
2
2
2
4
4
Rp v
Rp v
i=1
i=1
(
)
)
(
1
1 2
1
1
2
2
e
exp
vi dv = e
exp r2 rp1 dr < ,
2
2
4
r
4
Rp v
0
i=1
where in we used the elementary inequality xa 14 x2 a2 for all x R and holds by the
change of variables to polar coordinates (the term rp1 is the Jacobian).
Now let us get back to calculating the risk of the James-Stein estimator:
2
p 2
2
p2
p2
JS
2
2
=
X
X
= Z 2 Z,
X +
X
=
X2
X2
X2
p2
(p 2)2
2
Z 2 Z,
X
+
,
X2
X2
where
p2
Z,
X
X2
i=1
p2
Zi + i
= (p 2)
Zi
.
2
X
Z + 2
p
Zi Xi
i=1
xi +i
Consider h(xi ) := x+
2 as a function of xi R with all other coordinates xj , j = i xed, so
that xj = j at least for some j. In this case, h(xi ) is smooth in xi and
i h(xi ) =
1
(xi + i )2
x + 2 2(xi + i )2
=
2
.
x + 4
x + 2
x + 4
i
=
E
E
h(Z
)
Z
,
j
=
i
=
i
i
j
i
i
j
Z + 2
(
)
1
(Zi + i )2
1
Xi2
E E
2
Z
,
j
=
i
= E
2E
j
2
4
2
Z +
Z +
X
X4
E Zi
108
7. POINT ESTIMATION
and hence
p (
p2
1
Xi2
=
E Z,
X = (p 2)
E
2E
X2
X2
X4
i=1
(
)
p
1
1
(p 2)E
2
= (p 2)2 E
2
2
X
X
X2
)
p2
(p 2)2
JS
JS
2
2
R( , ) = E = E Z 2 Z,
X +
=
X2
X2
1
(p 2)2
1
),
p 2(p 2)2 E
+
E
= p (p 2)2 E
< p = R(,
2
2
X
X
X2
for all Rp .
mator, suggested earlier by the frequency substitution, is (X) = X1 X2 . Lets calculate the
corresponding risks:
(
)2
(
)
= E X1 + X2 = 1 var (X1 ) + var (X2 ) = 1 (1 ),
R(, )
2
4
2
and
(
)2
= E X2 X2 = 2 2E X2 X2 + E X2 X2 = 2 23 + 2 = 22 (1 ).
R(, )
is worse (greater) than R(, )
for (0, 1/4)
The obtained risks are plotted at Figure 1: R(, )
and vise versa for (1/4, 1)
This example shows that the risks of two estimators may satisfy opposite inequalities on
dierent regions of : since we do not know in advance to which of the regions the unknown
parameter belongs, preferring one estimator to another in this situation does not make sense.
If we cannot compare some estimators, maybe we can still nd the best estimator for
which
for all estimators ?
R(, ) R(, ),
A simple argument shows that the latter is also impossible! Suppose that the best estimator
R(0 , ) = 0.
But 0 was arbitrary, hence in fact R(, ) = 0 for all , i.e. errorless estimation is possible,
which is obviously a nonsense. This contradiction shows that the best estimator does not exist.
109
0.25
0.2
0.15
0.1
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
and R(, )
as functions of [0, 1]
Figure 1. the risks R(, )
Remark 7b11. Note that nonexistence of the best estimator does not contradict the existence of the admissible estimators:
R(, a ) R(, ),
r()
110
7. POINT ESTIMATION
is the worst value of the risk. Any two estimators and are now compared by
In words, r()
and r()
and the estimator with smaller one is
comparing the corresponding maximal risks r()
preferred.
The best estimator , i.e. the one which yields the smallest maximal risk is called minimax:
= r(),
In words, the minimax estimator optimizes the worst case performance. The principle disadvantage of this approach is that the estimator which yields the best performance in the worst
case (which maybe quite bad on its own), may perform quite poorly for other values of the
parameters.
Bayes estimation. In the Bayesian approach, one chooses a prior probability distribution
(e.g. in the form of p.d.f. if is a continuum) (), and denes the Bayes risk of the
estimator as
r(, )
R(, )()d.
Any two estimators are comparable by their Bayes risks. The estimator, which minimizes the
Bayes risk, is called the Bayes estimator and is usually given by an explicit (in fact Bayes, what
else ?) formula (see more details in Section c below).
The prior distribution , from which the value of the parameter is sampled, is interpreted as
the subjective a priori belief regarding . For example, if = [0, 1] (as in the coin tossing), the
prior = U ([0, 1]) makes all s equiprobable. If for some reason we believe that cannot deviate
from 1/2 too much, we may
express
(
)r this belief by choosing a prior, concentrated more around
= 1/2, e.g. () = Cr (1 ) , where r is a positive number and Cr is the corresponding
normalizing constant. The Bayes estimator is then viewed as the optimal fusion of the a priori
belief about the unknown parameter and the data obtained in the experiment.
On one hand, introducing a prior can be viewed as an advantage, since it allows to incorporate
prior information about the parameter into the solution. On the other hand, if such information
is not available or, more frequently, cannot be translated to a particular prior distribution,
the choice of the prior is left arbitrary and thus problematic (do not think that a uniform
prior is not informative !). On the third hand, the Bayes estimator enjoys many good decision
theoretic properties and turns to yield the best possible behavior asymptotically, as the sample
size increases. We shall discuss Bayes estimators in more details below.
Restricting estimators to a class. While the best estimator does not exist, remarkably
it may exist and even be explicitly computable, if we consider a restricted class of estimators
satisfying certain additional properties.
Equivariant estimators. Let X = (X1 , ..., Xn ) be an i.i.d. sample from the density f (x ),
(x
for any shift11 s R. Remarkably, this extra structure of the estimators
guarantees existence of the best estimator, which can be explicitly computed.
9in other words, X = + Y , where Y s are sampled from the p.d.f. f
i
i
i
10more generally, equivariance is dened with respect to a group of operations
11for a vector x Rn and a scalar c R, x c stands for the vector with entries x c
i
C. BAYES ESTIMATOR
111
Unbiased estimators.
Definition 7b12. Let X P . The bias function of an estimator12 is
:= E (X)
b(, )
.
= 0, , i.e. E (X)
r() =
R(, )(d),
sL(X; s)(s)ds
,
(7c1)
(X) =
L(X; r)(r)dr
12Pay attention that denition of the bias implicitly requires the expectation to be well dened
13Note that the trivial estimators, such as used in the discussion preceding Remark 7b11 are biased and thus
112
7. POINT ESTIMATION
r () =
R(s, )(s)ds =
(s (x))
f (x; s)(s)dxds.
Rn
Note that f (x; s)(s) is a j.p.d.f. on Rn (since it is nonnegative and integrates to one) and
let E denote the corresponding expectation. Let be a sample from . Then
(
)2
= E (X)
r ()
,
which is recognized as the MSE of the prediction error of , from the observation of X. The
minimal MSE is attained by the conditional expectation, which in this case is given by the Bayes
formula:
sf (X; s)(s)ds
E(|X) =
.
f (X; r)(r)dr
The Bayes estimator with respect to risks, corresponding to other losses are computable
similarly. Let and X P , then for a given loss function and any statistic T (x), the
Bayes risk is given by15
(
)
(
)
r (, T ) =
E , T (X) ()d =
, T (x) fX| (x; )()dxd =
Rn
)
(
)
(
Rn z
Rn ,
z (X) argminz
(, z)f|X (; X)d.
Remark 7c2. Note that for any loss function, z (X) depends on the data X only through
the posterior distribution f|X . The posterior distribution can be thought as an update of the
prior distribution on the basis of the obtained observations. In other words, the data updates
our prior belief about the parameter.
Lets see how this recovers the formula (7c1) for 2 loss:
2 f|X (; X)d 2z
f|X (; X)d + z 2 .
C. BAYES ESTIMATOR
113
z (X) :=
f|X (; X)d = E(|X),
(, z)f|X (; X)d =
| z|f|X (; X)d.
If e.g. = R, then
| z|f|X (; X)d =
(z )f|X (; X)d +
( z)f|X (; X)d.
The latter is a dierentiable function in z, if the integrands are continuous, and the derivative
is given by17:
z
d
| z|f|X (; X)d =
f|X (; X)d
f|X (; X)d =
dz
z
(
)
F|X (z; X) 1 F|X (z; X) = 2F|X (z; X) 1.
Equating the latter to 0 we nd that the extremum is attained at z (X) which is the solution of
F|X (z; X) = 1/2.
Dierentiating one more time w.r.t. z, we nd that the extremum is in fact a minimum. Hence
z (X) is the median of the posterior in this case.
Example 7c3. Suppose that we use a device, which is known to be quite precise in terms of
the random errors, but may have a signicant constant error, which we would like to estimate
(and compensate from the device measurements in the future). More precisely, we perform n
measurements and obtain an i.i.d. sample from N (, 1), where R is the unknown parameter.
Suppose we do not believe that the constant error term is too large, which we express by
assuming that is itself a random variable sampled from N (0, 2 ), where 2 is known to us and
it expresses our belief regarding the dispersion of the error. Hence Xi = + Zi , where Zi s are
i.i.d. N (0, 1) r.v.
In this setting, our prior is = N (0, 2 ) and hence , X1 , ..., Xn are jointly Gaussian. So
the general formula for the Bayes estimator can be bypassed, using its particular form in the
Gaussian case:
E(|X1 , ..., Xn ) = E + Cov(, X)Cov1 (X, X)(X EX).
(
)
We have E = 0, EX = 0, ... 0 and
{
2 + 1 i = j
cov(, Xi ) = 2 , cov(Xi , Xj ) =
2
i = j
Let 1 denote the column vector of ones, then
Cov(, X) = EX = 2 1
17Recall that
d
(z, z)
dz
(x, z)x:=z
x
(z, y)y:=z
y
114
7. POINT ESTIMATION
and
Cov(X, X) = EXX = 2 11 + I,
where I is the n-by-n identity matrix. Lets check that18 Cov1 (X, X) = I
(
1
11 :
2 +n
)
1
2
1
11 + I I 2
11
= 2 11 2
11 11 + I 2
11 =
+n
+n
+n
)
( 2n
2n
1
1
2 11 2
=
11 + I 2
11 = 2 11 + I 11 2
+ 2
+n
+n
+n +n
2 11 + I 2 11 = I,
2
)(
n
1
2
1 2
1 X = 2
Xi .
+n
+n
i=1
The estimator depends explicitly on 2 : if is large, i.e. the uncertainty about is big, the
n (i.e. the a priori belief is essentially ignored and
estimator is close to the empirical mean X
the estimator relies on the data alone). If 2 is small, the prior is concentrated around zero
and the estimator is close to zero, virtually regardless of the observations (for moderate n). If
we continue to increase n, we shall again get back to the empirical mean. This behavior is
intuitively appealing, since for large sample the empirical mean is close to the actual value of
the parameter by virtue of the law of large numbers (regardless of our prior beliefs!).
Example 7c4. Consider the n i.i.d. tosses of a coin with unknown parameter . Suppose
we tend to believe that should not be too far from 1/2 and express our belief by assuming the
prior density
r () = Cr r (1 )r ,
[0, 1],
where Cr is the normalizing constant (whose value as we shall see wont play any role) and
r 0 is a parameter, which measures the strength of our faith: for large r, r () is strongly
concentrated around 1/2 and for small r, it is close to the uniform distribution (which itself
corresponds to r = 0). The Bayes estimator under the quadratic loss is the posterior mean given
by (7c1):
1
(X) = 1
0
0
(Sn (X) + 2 + r)/(3 + 2r + n)
.
(Sn (X) + 1 + r)/(2 + 2r + n)
18this is an application of the well known matrix Woodbury identity, known also as the matrix inversion
lemma
C. BAYES ESTIMATOR
115
i = 1, ..., n
where Zi s are i.i.d. N (0, 1) r.v.s Let = N (, 2 ) (an arbitrary Gaussian density),
then , X1 , ..., Xn are jointly Gaussian (i.e. form a Gaussian vector) and hence the conditional
distribution of given X is Gaussian with the parameters, which can be computed explicitly as
in the Example 7c3.
Here is an example, popular in computer science
Example 7c7. Lets show that Dirichlet distribution is conjugate to multinomial likelihood
and nd the corresponding parameters of the posterior. The p.m.f. of the multinomial distribution with parameters
= (1 , ..., k ) = S
k1
:= {u R : ui 0,
k
ui = 1},
i=1
is given by
p(x; ) = L(x; ) =
n!
x1 ...kxk ,
x1 !...xk ! 1
x Nk ,
i=1
xi = n.
116
7. POINT ESTIMATION
Hence the conjugate prior should be a distribution on the simplex Sk1 . The Dirichlet distribution D() is dened by the the p.d.f.
1 i 1
f (; ) =
i
,
B()
k
Sk1
i=1
where = (1 , ..., k )
Rk+ ,
i=1 i
and where k := 1 k1
j=1 j is dened. Notice that the p.d.f. actually depends only on the
rst k 1 coordinates of , i.e. it is a p.d.f. on Rk1 (and not on Rk , think why).
To check the claim we have to verify that the posterior is Dirichlet and nd its parameters.
Let X be a sample from the above multinomial distribution with = D(). Then by an
appropriate Bayes formula we have
f|X (; x)
1 ( +x )1
n!
i i
i i i
1x1 ...kxk
x1 !...xk !
k
i=1
i=1
where the symbol stands for equality up to the multiplicative normalizing constant which
depends only on x (and not on ). Hence the resulting posterior is identied as Dirichlet with
parameter + x, where x = (x1 , ..., xk ).
Lets see how these conjugates are typically used. Suppose we roll a dice n times independently and would like to estimate the probabilities to get each one of the six sides. Hence we
obtain a realization of n i.i.d. r.v.s 1 , ..., n each taking value in {1, ..., 6}
with probabilities
1 , ..., 6 . Let Xi be the number of times the i-th side came up, i.e. Xi = nm=1 I(m = i). As
we saw, X = (X1 , ..., X6 ) is a sucient statistic and hence nothing is lost if we regard X as our
observation. If we put a prior Dirichlet distribution on , then the emerging Bayes estimator
(under the quadratic risk) has a very simple form.
A direct calculation shows that for V D(),
i
.
EVi = k
k
j=1
Since we already know that the posterior is D( + X), we conclude that
i + Xi
,
j=1 (j + Xj )
E(i |X) = 6
i = 1, ..., 6.
Note that before obtaining any observations, our best guess of i is just i / 6j=1 j . As n
grows, the estimator becomes more inuenced by X and for large n becomes close to the usual
empirical frequency estimator.
If we would have chosen a dierent prior, the Bayes estimator might be quite nasty, requiring
complicated numerical calculations.
A list of conjugate priors and the corresponding likelihoods can be looked up in the literature.
When dealing with a particular model P , one may be lucky to nd a family of conjugate priors,
C. BAYES ESTIMATOR
117
20
automatically
Some properties of the Bayes estimator. The Bayes estimator satises a number of
interesting properties.
Lemma 7c8. The Bayes estimator is admissible, if the corresponding prior density is
positive and the risk of any estimator is continuous in .
Remark 7c9. The continuity assumption is quite weak: e.g. it is satised, if the loss function
is continuous in both variables and 7 P is continuous (in an appropriate sense).
Proof. Let be the Bayes estimator with respect to the loss function and the prior .
Suppose is inadmissible and let be an estimator such that
= E (, )
E (, ) = R(, ), ,
R(, )
where the inequality is strict for some . The continuity of risks imply that in fact the
latter inequality is strict on an open neighborhood of and hence
rB () =
R(, )()d <
R(, )()d = rB ( ),
The following lemma establishes the important connection between the Bayes and the minimax estimators.
Lemma 7c10. A Bayes estimator with constant risk is minimax.
Proof. Let be the Bayes estimator w.r.t. to the loss function and the prior , such
that
R(, ) = E (, ) C,
sup R(, ) = C =
R(, )()d
R(, )()d
sup R(, ),
The preceding lemma is only rarely useful, since the Bayes estimators with constant risks
are hard to nd. The following variation is more applicable:
Lemma 7c11. Let (i ) be a sequence of priors and i be the corresponding Bayes estimators.
If the Bayes risks converge to a constant
lim
R(, i )i ()d = C,
(7c2)
i
20in particular, when the number of the sample grows any Bayes estimator will typically forget its prior
118
7. POINT ESTIMATION
sup R(, )
R(, )i ()d
R(, i )i ()d C.
Example 7c12. In Example 7c3 we saw that the Bayes estimator of from the sample
X1 , ..., Xn with Xi N (, 1) and the Gaussian prior N (0, 2 ) is given by
1
Xi .
2 + n
n
= E(|X) =
i=1
n (X) = X
n = , the estimator
Since E
n = E X
n is unbiased. What about the MLE of 2 :
1
2
(Xi X)
n
n
n2 (X) =
i=1
1
2 = 2 E 1
2=
(Xi X)
(Zi Z)
n
n
n
i=1
i=1
n
)
(
)
( 1
Zi2 E (Zn )2 = 2 1 1/n .
2 E
n
i=1
Since E
n2 (X) = 2 , it is a biased estimator of 2 with the bias
(
)
b( 2 ,
n2 ) = 2 1 1/n 2 = 2 /n,
D. UMVU ESTIMATOR
119
n 1
n )2 = 1
n )2
n2 (X) :=
n2 (X)/ 1 1/n =
(Xi X
(Xi X
n1n
n1
i=1
i=1
Example 7d2. For an i.i.d. sample X1 , ..., Xn from U ([0, ]) with the unknown parameter
n
= (0, ), the MLE of is n (X) = maxi Xi . By (7b3), E n (X) = n+1
and hence n is
biased:
n
b(, n ) =
= /(n + 1).
n+1
Again, slightly modifying n we obtain an unbiased estimator
n+1
n+1
n (X) =
n (X) =
max Xi ,
i
n
n
While unbiased estimators are often practically appealing, do not think they are always
available! The unbiased estimator may not exist, as the following example demonstrates:
Example 7d3. Let X1 , ..., Xn be i.i.d. coin tosses with probability of heads (0, 1). We
want to estimate the odds, i.e. the ratio q() = /(1 ). Suppose T is an estimator of q(),
then
E T (X) =
T (x)S(x) (1 )nS(x) ,
x{0,1}n
where S(x) = i xi . The latter is a polynomial of a nite order in . On the other hand, the
function q() can be written as the series:
q() =
=
i ,
1
(0, 1),
i=0
where we used the formula for the sum of geometric progression. Since polynomials of dierent
orders cannot be equal, we conclude that E T (X) = /(1 ) for some R+ . But as T was
an arbitrary statistic, we conclude that no unbiased estimator of q() exists.
It is convenient to decompose the quadratic risk into the variance and bias terms:
)2
(
= E E (X)
(X)
=
+ E (X)
R(, )
(
)2
(
)2
+ b2 (, )
(7d1)
E E (X)
+ E E (X)
(X)
= var ()
where we used the property
(
)(
)
(
) (
)
2E E (X)
E (X)
(X)
= 2 E (X)
E E (X)
(X)
=
(
)(
)
2 E (X)
E (X)
E (X)
= 0.
{z
}
|
=0
In particular, the quadratic risk for unbiased estimators consists only of the variance term.
120
7. POINT ESTIMATION
1 x = j + 1
j (x) = 0
x=j
1
x=j1
and extend the denition to other x Z by periodicity, i.e. j (x) := j (x + 3) for all x Z
(sketch the plot). Note that for any Z,
)
1(
E j (X) =
j ( 1) + j () + j ( + 1) = 0,
3
since the average over any three neighboring values of j (x) equals zero (look at your plot). Now
set Tj (X) = X + j (X). The estimator Tj (X) is unbiased:
E Tj (X) = + E j (X) = .
Under distribution Pj , X takes the values in {j 1, j, j +1} and Tj (X) j. Hence varj (Tj (X)) =
0 as claimed.
Note that the crux of the proof is construction of a family of nontrivial unbiased estimators
of zero j (X), j Z. This is precisely, what the notion of completeness, to be dened below,
excludes.
21If X , ..., X is a sample from U ([0, 1]), T (X) =
1
n
n+1
n
D. UMVU ESTIMATOR
121
One of the main tools in analysis of the unbiased estimators is the RaoBlackwell theorem,
which states that conditioning on a sucient statistic improves the MSE risk:
Theorem 7d8 (Rao-Blackwell). Let (P ) be a statistical model, X P , S(X) be a
sucient statistic and T (X) an estimator of q() for some known function q, with E T 2 (X) <
. Then the statistic
T (X) := E (T (X)|S(X))
is an estimator of q() with the same bias as T (X) and smaller MSE risk:
R(q(), T ) R(q(), T ),
(7d2)
122
7. POINT ESTIMATION
Note that T (X) = E (T (X)|S(X)) is a function of S(X) and hence by orthogonality property
(
)(
)
E T (X) E T (X) T (X) T (X) =
(
)(
)
E T (X) E T (X) T (X) E (T (X)|S(X)) = 0.
|
{z
}
a function of S(X)
(
)2
var (T ) = E T (X) T (X) + var (T ) var (T ),
The R-B theorem suggests yet another way to construct estimators: come up with any
unbiased estimator and improve it by applying the R-B procedure. Here is a typical application
Example 7d10. Let Xi be the number of customers, who come at a bank branch at the
i-th day. Suppose we measure X1 , ..., Xn and would like to estimate the probability of having
no customers during a day. Lets accept the statistical model, which assumes that X1 , ..., Xn
are i.i.d. r.v.s and X1 Poi(), > 0. We would like to estimate e , which is the probability
of having no clients under this statistical model.
is T (X) = I(X = 0): indeed, E T (X) = P (X =
An obvious22 unbiased estimator of e
1
1
(
)
n
P
X
=
0,
X
=
k
1
j
j=2
P X1 = 0, S(X) = k
(
)
=
=
P X1 = 0|S(X) = k =
n
P (S(X) = k)
P
i=1 Xi = k
)
(
) ( n
(
)k
(
X
=
k
P X 1 = 0 P
j=2 j
e e(n1) (n 1) /k!
(n 1)k
1 )k
(
)
=
=
=
1
,
( )k
n
n
nk
en n /k!
P
X =k
(
i=1
where we used the fact that a sum of independent Poisson r.v.s is Poisson. Hence the statistic
(
1 )S(X)
T (X) = 1
,
(7d3)
n
is an unbiased estimator of, whose risk improves the risk of the original estimator T (X). Note
n = 1 n Xi (by the law of large numbers) and 23
that if n is large, X
i=1
n
(
1 )S(X) (
1 )nX n
T (X) = 1
= 1
e .
n
n
22note that strictly speaking, I(X = 0) cannot be accepted as a point estimator of e , since none of its
1
values, i.e. {0, 1}, is in the range of e , R+ . However, the claim of R-B theorem (and many other results
above and below) do not depend on this assumption
23recall that for := (1 1/n)n , log = n log(1 1/n) 1 + o(1/n), i.e. e1
n
n
n
D. UMVU ESTIMATOR
123
Since T is complete, we conclude that g(T (X)) = 0 and thus g(T (X)) = 0, P -a.s. which means
that T (X) is complete.
Lets see how this notion helps in nding the UMVUE.
Lemma 7d14 (Lehmann-Schee). If the sucient statistic S(X) is complete, then there is
at most one unbiased estimator, coarser than S(X).
Proof. Suppose that T (X) = g(S(X)) and T (X) = g (S(X)) are both unbiased, then
(
)
(
)
E g(S(X)) g (S(X)) = E T (X) T (X) = 0,
and since S(X) is complete, g(S(X)) = g (S(X)), i.e. T (X) = T (X), P -a.s.
This implies that R-B procedure applied to any unbiased estimator, using a complete sucient statistic, yields the unique and hence optimal unbiased estimator, i.e.
Corollary 7d15. If S(X) is the complete sucient statistic, the R-B procedure yields the
UMVUE.
This suggests at least two ways of searching for the UMVUE:
(1) Rao-Blackwellize an unbiased estimator, using the complete sucient statistic.
124
7. POINT ESTIMATION
(2) Find an unbiased estimator, which is a function of the complete sucient statistic.
As we shall shortly see below, the sucient statistic S(X) in Example 7d10 is complete and
hence the estimator (7d3) is UMVUE. The second approach is demonstrated in the following
example:
Example 7d16. Suppose that X1 , ..., Xn are i.i.d. U ([0, ]) r.v. with the unknown parameter
R+ . The minimal sucient statistic M = maxi Xi has the density fM (x) = nxn1 /n ,
supported on [0, ]. Suppose that we know that M is complete and would like to nd the
UMVUE of q(), where q is a given dierentiable function. Consider the estimator of the form
(M ), where is a continuous function. If (M ) is an unbiased estimator of q()
s
1
q(s) = Es (M ) = n
nxn1 (x)dx, s > 0.
s 0
Taking the derivative, we obtain
s
n1
q (s) = ns
nxn1 (x)dx + ns1 (s) = ns1 q(s) + ns1 (s),
0
and
1
sq (s) + q(s).
n
Hence taking q() := , we obtain the UMVUE of :
(s) =
1
n+1
(X)
= M +M =
max Xi ,
i
n
n
which we already encountered in Example 7d2. For q() = sin() we obtain the UMVUE
qb(X) :=
1
M cos(M ) + sin(M ).
n
The main question remains: how to establish completeness of a sucient statistic? The rst
helpful observation is the following:
Lemma 7d17. A complete sucient statistic is minimal.
We have already seen (recall the discussion following the Example 7d10 on page 122), that if
R-B is carried out with a sucient statistic, which is not minimal, UMVUE cannot be obtained in
general. In view of the L-S theorem, this implies that a complete sucient statistic is necessarily
minimal. Here is a direct proof:
Proof. Let T (X) be a complete sucient statistic and S(X) be the minimal sucient
statistic. Since S(X) is sucient, E (T (X)|S(X)) does not depend on , i.e. E (T (X)|S(X)) =
(S(X)) for some function . But as S(X) is the minimal sucient statistic, it is, by denition,
coarser than any other sucient statistic, and in particular, S(X) = (T (X)) for some function
D. UMVU ESTIMATOR
125
( (
))
. Hence E (T (X)|S(X)) = T (X) . Let g(u) := u ((u)), then
(
)
(
)
E g(T (X)) = E T (X) ((T (X))) = E T (X) (S(X)) =
(
)
E T (X) E (T (X)|S(X)) = E T (X) E E (T (X)|S(X)) =
E T (X) E T (X) = 0,
But T (X) is complete and hence g(T (X)) 0 with probability one, i.e.
(
)
E T (X)|S(X) = T (X).
(7d4)
(
)
By denition of the conditional expectation, E T (X)|S(X) is a function of S(X) and hence
(7d4) implies that T (X) is coarser than S(X). But on the other hand, S(X) is minimal sucient
and hence is coarser than T (X). Hence S(X) and T (X) are equivalent and thus T (X) is minimal
sucient.
Remark 7d18. A minimal sucient statistic does not have to be complete and hence RB conditioning on the minimal sucient statistic is not enough to get the UMVUE (revisit
Example 7d6).
The above lemma shows that the only candidate for a complete sucient statistic is the
minimal sucient. How do we check that the minimal sucient statistic is complete? Sometimes,
this can be done directly by the denition, as the following examples demonstrate.
Example 6a3
(continued) We saw that for n independent tosses of a coin with probability of
heads , S(X) = ni=1 Xi is the minimal sucient statistic. Let us check whether it is complete.
Let g be a function such that E g(S(X)) = 0, . Recall that S(X) Bin(n, ) and hence
( )
( )(
)i
n
n
n i
n
E g(S(X)) =
g(i)
(1 )ni = (1 )n
g(i)
i
i
1
i=0
Hence
n
i=0
i=0
( )(
)i
n
g(i)
0,
i
1
(0, 1)
or, equivalently, for any /(1 ) R+ . But since the left hand side is a polynomial in /(1 ),
we conclude that all its coecients equal zero, i.e. g(i) = 0, and thus g(S(X)) = 0. Hence S(X)
is complete.
n = S(X)/n is an unbiased estimator of . The R-B procedure does not
Now recall that X
n and hence by completeness of S(X), it is the UMVUE.
change X
Let us see, e.g. that the trivial sucient statistic X = (X1 , ..., Xn ) is not complete. To this
end, we should exhibit a function g, such that E g(X) = 0, but g(X) is not identically zero
(with probability one). A simple choice is g(X) = X1 X2 : clearly g(X) is not identically zero
P (g(X) = 0) = 1 P (X1 = X2 ) = 1 2 (1 )2
However E g(X) = = 0. Hence X is not complete.
126
7. POINT ESTIMATION
Example 7d19. Let X1 , ..., Xn be a sample from U ([0, ]), > 0 distribution. As we saw,
M (X) = maxi=1,...,n Xi is the sucient statistic. To check completeness let g be a function such
that E g(M ) = 0, for all . Recall that M has the density (Example 7b3)
n
fM (x) = n xn1 I(x (0, )),
and hence
n n1
n
E g(M ) =
g(s) n s
I(s (0, ))ds = n
g(s)sn1 ds.
0
0
So E g(M ) = 0 reads
g(s)sn1 ds = 0, ,
0
which implies that g(s) = 0 for almost all s [0, 1]. Hence g(M ) = 0, P -a.s., which veries
completeness of M .
Recall that (1 + 1/n)M (X) is an unbiased estimator of ; since M (X) is complete, it is in
fact the UMVUE by the L-S theorem.
These examples indicate that checking completeness can be quite involved technically. Luckily, it can be established at a signicant level of generality for the so called exponential family
of distributions.
Definition 7d20. The probability distributions (P ) on Rm belong to the k-parameter
exponential family if the corresponding likelihood (i.e. either p.d.f. or p.m.f.) is of the form
L(x; ) = exp
k
{
}
ci ()Ti (x) + d() + S(x) I(x A),
x Rm
(7d5)
i=1
where ci s and d are 7 R functions, Ti s and S are statistics taking values in R and A Rm
does not depend on .
Remark 7d21. Clearly, if the probability distribution of X1 belongs to an exponential family,
the probability distribution of X1 , ..., Xn also belongs to the same exponential family (check!).
Remark 7d22. By the F-N factorization theorem, T = (T1 (x), ..., Tk (x)) is a sucient
statistic.
Example 7d23. The Bernoulli distribution belongs to one parameter exponential family:
{
}
p(x; ) = x (1 )1x I(x {0, 1}) = exp x log + (1 x) log(1 ) I(x {0, 1}),
Example 7d24. Let X1 , ..., Xn be an i.i.d. sample from N (, 2 ) with unknown parameter
= (1 , 1 ) = (, 2 ) R R+ . The right hand side of (6d5) tells that P belongs to 2parameters exponential family with c1 () = n1 /2 , c2 () = 1/2n1/2 and T1 (x) = x
n , T2 (x) =
n
2
n
2
xn , d() = 1/2n1 /2 n log 2 and S(x) = 2 log(2) and A = R .
D. UMVU ESTIMATOR
127
Example 7d25. Uniform distribution U ([0, ]), R+ does not belong to the exponential
family, since its support depends on (no appropriate A can be identied).
The following fact, whose proof is beyond the scope of our course, is often handy:
Theorem 7d26. Let (P ) belong to k-parameter exponential family of distributions with
the likelihood function (7d5). The canonical statistic
(
)
T (X) = T1 (X), ..., Tk (X)
is complete, if the interior
24
is not empty.
Example 6a3 (continued)
Lets see how this theorem is applied to deduce completeness of the
statistic S(X) = ni=1 Xi (which we have already checked directly). The likelihood in this case
is the j.p.m.f.:
{
}
Hence L(x; ) belongs to the one parameter exponential family with c() := log 1
. When
increases from 0 to 1, the function c() moves from to +, i.e. {c(), [0, 1]} = R, which
trivially has a non-empty interior (in fact any open ball, i.e. interval in this case, is contained
in R). Hence the statistic S(X) is complete.
Example 7d24 (continued) The sample X1 , ..., Xn from N (, 2 ) with the unknown parameter
:= (, 2 ) R R+ , has the likelihood:
}
{
n
1
1 (xi )2
L(x; ) =
=
exp
2
2
(2)n/2 n
i=1
{
}
n
n
1 2
n
exp n/2 log(2) n/2 log 2 2
xi + 2
xi 2 2 , x Rn .
2
2
i=1
i=1
and
i=1
)
(
) (
c() = c1 (), c2 () = 1 /2 , 1/(22 ) .
24Consider the Euclidian space Rd with the natural distance metric, i.e. x y =
i=1 (xi
yi )2 . An
open ball with radius > 0 and center x, is dened as B (x) = {y R : x y < }, i.e. all the points which
are not further away from x than . Let A be a set in Rd . A point x A is an internal point, if there exists an
open ball with suciently small radius > 0, such that B (x) A. The interior of A, denoted by A is the set of
all internal points. A point x is a boundary point of A, if it is not internal and any open ball around x contains
an internal point of A. The set of the boundary points is called the boundary of A and is denoted by A. A set
A is closed if it contains all its boundary points.
d
128
7. POINT ESTIMATION
alent to T (X), is also complete. But we have already seen, that T (X) is the unbiased estimator
of (, 2 ) and in view of completeness, the L-S theorem implies that T (X) is the UMVUE.
Now we can explain why the estimator (7b1) is better than the estimator (7b2) and in which
sense. First of all, note that the estimator
1
n|
|Xi X
n
i=1
(
)
n
n
2 and hence is inadmisis not a function of the minimal sucient statistic
X
,
X
i
i=1
i=1 i
sible, being strictly improvable by means of the Rao-Blackwell lemma. The estimator
v
u n
u1
n )2 ,
n (X) = t
(Xi X
n
n
n (X) =
i=1
is the optimal one among the estimators with the same bias, since it is coarser than the minimal
sucient statistic, which, as we saw, is complete.
Further,
v
v
u n
u n
u1
u1
2
t
n (X) =
(Xi Xn ) = t
(Zi Zn )2 ,
n
n
i=1
i=1
with Zi := (Xi )/. Let C2 (n) := E n1 ni=1 (Zi Zn )2 and note that it does not depend on
, as Zi s are i.i.d. N (0, 1) r.v.s. Hence
n (X)/C1 (n), where C1 (n) := E n i=1 |Zi Zn | (again independent of ), is also an unbiased
estimator of . However,
n (X)/C2 (n) is a function of the complete sucient statistic and
hence is UMVUE. Thus the normalized unbiased versions of
n (X) and
n (X) are comparable
and the former has better risk.
Example 7d10 (continued) Lets see that the unbiased estimator (7d3) is in fact UMVUE of
e . Since T (X) is equivalent (why?) to S(X), by Remark 7d13 and L-S theorem it is enough
to check that S(X) is complete. To this end, note the likelihood for this statistical model:
{
}
n
n
e xi
L(x; ) =
= exp n + log S(x)
log(xi !) ,
xi !
i=1
i=1
belongs to the one parameter exponential family with c() = log . The range of c() over > 0
is R and hence S(X) is complete by the Lemma 7d26.
Finally, let us demonstrate that the empty interior part of the Lemma 7d26 cannot be in
general omitted:
Example 7d27. Consider a sample X1 , ..., Xn from N (, 2 ), where R+ is the unknown
parameter. Repeating the calculations from the preceding example, we see that L(x; ) still
D. UMVU ESTIMATOR
129
R+ .
The latter is a one dimensional curve in R R and hence its interior is empty (sketch the plot
of c()). Hence L-S theorem cannot be applied to check completeness of T (X) (as dened in
the preceding example).
In fact, the statistic is easily seen to be incomplete: take e.g.
1
n )2 n (X
n )2 ,
(Xi X
n1
n+1
i=1
( n
)
n
2
which is a function of T (X) =
i=1 Xi ,
i=1 Xi after a rearrangement of terms. We have
n
g(T (X)) =
1
n )2 n E (X
n )2 =
(Xi X
n1
n+1
n
E g(T (X)) = E
i=1
n ( 2 1 2)
+ = 0,
n+1
n
R+ .
(
)
However, it is obvious that g T (X) is a non-degenerate random variable (e.g. its variance is
nonzero).
The notions of completeness and suciency are related to ancillarity:
Definition 7d28. A statistic T is ancillary if its probability distribution does not depend
on .
While an ancillary statistic does not contain any information about the unknown parameter
on its own, it nevertheless may be very much relevant for the purpose of inference in conjunction
with other statistics. An ancillary statistic T is an ancillary complement of an insucient
statistic T , if (T, T ) is sucient. For example, if Xi = + Ui , where Ui U ([0, 1]) are
independent, the statistic T (X) = X1 X2 is an ancillary complement of the statistic T (X) =
X1 + X2 .
A statistic is called rst order ancillary if its expectation is constant w.r.t. . Hence by denition, a statistic is complete if any coarser rst order ancillary statistic is trivial. In particular,
it is complete if any coarser ancillary statistic is trivial. A deeper relation to completeness is
revealed by the following theorem
Theorem 7d29 (D. Basu). A complete sucient statistic and an ancillary statistic are
independent.
Proof. Let S be complete sucient and T ancillary. Then for a xed x R, (x) :=
P (T x) doesnt depend on . On the other hand, by suciency of S
(x; S) := P (T x|S)
also doesnt depend on . Moreover,
(
)
(
)
E (x) (x; S) = E P (T x) P (T x|S) = 0
130
7. POINT ESTIMATION
f (x; )dx
(7e1)
f (x; )dx =
Rn
n
R
and
(7e2)
T (x)f (x; )dx =
T (x) f (x; )dx.
Rn
Rn
Then
( )2
()
var (T )
,
(7e3)
I()
where () := E T (X) and
(
)2 (
)2
I() := E
log f (X; ) =
log f (x; ) f (x; )dx
Rn
is called the Fisher information, contained in the sample X.
f (x; )dx = 0.
Rn
Moreover, (7e2) reads:
(7e4)
f (x; )dx =
() =: ().
Rn
Multiplying (7e4) by () and subtracting it from (7e5), we obtain:
(
)
T (x) ()
f (x; )dx = ().
n
R
T (x)
(7e5)
131
( )2
(
)
(
)
f (x; )
() =
f (x; )dx =
f (x; )dx
T (x) ()
T (x) ()
f (x; )
Rn
Rn
(
)( (
)2
)
(
)2
f (x; )
T (x) () f (x; )dx
f (x; )dx = var (T )I(),
(7e6)
f (x; )
Rn
Rn
which is the claimed inequality.
Corollary 7e2. Under the assumptions of Theorem 7e1, the MSE risk of an unbiased
estimator T (X) satises
1
R(, T ) = var (T )
.
(7e7)
I()
Proof. Follows from (7e3) with () = .
Remark 7e3. The Cramer-Rao bound is valid under similar assumptions for the discrete
models, for which the p.d.f. in the denition of the Fisher information in replaced by p.m.f. For
deniteness, we shall state all the results in the continuous setting hereafter, but all of them
translate to the discrete setting as is.
The multivariate version of the C-R bound, i.e. when Rd , with d > 1, is derived along
the same lines: in this case, the Fisher information is a matrix and the inequality is understood
as comparison of nonnegative denite matrices.
Remark 7e4. The estimator, whose risk attains the C-R lower bound, is called ecient.
Hence the ecient unbiased estimator is the UMVUE. However, it is possible that UMVUE
does not attain the C-R bound (see Example 7e10 below) and hence is not necessarily ecient.
In fact, the Cauchy-Schwarz inequality in (7e6) saturates if and only if
(
)
T (x) () C() =
log f (x; )
for some function C(), independent of x. Integrating both parts, we see that equality is possible
if f (x; ) belongs to the exponential family with the canonical sucient statistic T (x) (the precise
details can be found in [16]).
The assumptions (7e1) and (7e2) are not as innocent as they might seem at the rst glance.
Here is a simpler sucient condition:
Lemma 7e5. Assume that the support of f (x; ) does not depend on and for some > 0,
|h(x)|
sup
(7e8)
f (x; u)dx < ,
Rn
u[,+]
with both h(x) 1 and h(x) = T (x). Then (7e1) and (7e2) hold.
The proof of this lemma relies on the classical results from real analysis, which are beyond
the scope of this course. Note, for instance, that the model corresponding to U ([0, ]), > 0
does not satisfy the above assumptions (as its support depends on ).
Before working out a number of examples, let us prove some useful properties of the Fisher
information.
132
7. POINT ESTIMATION
Lemma 7e6. Let X and Y be independent r.v. with the Fisher informations IX () and IY ()
respectively. Then the Fisher information contained in the vector (X, Y ) is the sum of individual
informations:
IX,Y () = IX () + IY ().
Proof. By independence, fX,Y (u, v; ) = fX (u; )fY (v; ) and
Recall that E
log fX (X; ) = 0 and hence, again, using the independence,
)2
(
(
)2
E
log fX (X; ) + 2E
log fX (X; )E
log fY (Y ; )+
(
)2
E
log fY (Y ; ) = IX () + IY ().
In particular,
Corollary 7e7. Let X = (X1 , ..., Xn ) be an i.i.d. sample from the p.d.f f (u; ) with the
Fisher information I(), then
IX () = nI().
Lemma 7e8. Assume that f (x; ) is twice dierentiable and
2
2
f
(x;
)dx
=
f (x; )dx,
2
2 Rn
Rn
then
2
I() = E 2 log f (X; ) =
f (x; )
Rn
(7e9)
2
log f (x; )dx.
2
Proof. Denote by f (x; ) and f (x; ) the rst and second partial derivatives of f (x; )
with respect to . Then
(
)2
f (x; )f (x; ) f (x; )
2
f (x; )
log f (x; ) =
=
.
2
f (x; )
f 2 (x; )
(X;)
The claim holds if E ff (X;)
= 0. Indeed, by (7e9)
f (X; )
2
E
=
f (x; )dx = 2
f (x; )dx = 0.
f (X; )
Rn
Rn
{z
}
|
=1
133
Example 7e9. Let X = (X1 , ..., Xn ) be a sample from N (, 2 ), where 2 is known. Lets
calculate the Fisher information for this model. By Corollary 7e7, IX () = nI(), with
I() = E
2
1
1
2
(X)2 /(2 2 )
log
e
=
E
(X )2 = 1/ 2 .
2
2 2 2
2 2
It is not hard to check the condition (7e8) in this case25 and hence for an unbiased estimator
T (X)
2
var (T )
.
n
n is an unbiased estimator of and var (X
n ) = 2 /n, the risk of X
n attains the C-R
Since X
n is the UMVUE.
bound, which conrms our previous conclusion that X
Example 7e10. Let X = (X1 , ..., Xn ) be a sample from Ber(), [0, 1]. The likelihood
for this model is the j.p.m.f. of X. The Fisher information of Ber() p.m.f. is
(
)2
(
(
))2
1
1
X1
1X1
I() = E
log (1 )
=
= E X1 (1 X1 )
1
1
1
1
1
1
=
.
+
(1 ) = +
2
2
(1 )
1
(1 )
Since Ber() random variable takes a nite number of values, the conditions analogous to (7e1)
and (7e2) (with integrals replaced by sums) obviously hold (for any T ) and hence the risk of all
unbiased estimators is lower bounded by n1 (1 ). But this is precisely the risk of the empirical
n , which is therefore UMVUE.
mean X
Here is a simple example, in which UMVUE does not attain the C-R bound:
Example 7e11. Let X Poi(), > 0. The Fisher information of Poi() is
(
)
2
2 (
X2 1
X )
I() = E 2 log e
= ,
= E 2 + X log = E
X!
theorem, (X)
is in fact the UMVUE of e . The corresponding risk is readily found:
(
)
= var I(X = 0) = e (1 e ).
var ()
Hence the best attainable variance is strictly greater than the Cramer-Rao lower bound (the
two curves are plotted on Figure 2). Note that this does not contradict Remark 7e4, since
T (X) = I(X = 0) is not the canonical sucient statistic of Poi().
25in fact, checking (7e8) or other alternative conditions can be quite involved in general and we shall not
dwell on this (important!) technicality. Note also that (7e8) involves T (X) as well.
134
7. POINT ESTIMATION
UMVUE risk vs. the CreamerRao lower bound
0.25
CR bound
UMVUE risk
0.2
0.15
0.1
0.05
10
The quantity I() is the information contained in the sample: if I() is small the C-R
bound is large, which means that high precision estimation in this model is impossible. It turns
out that the information, contained in the original sample, is preserved by sucient statistic:
Lemma 7e12. Consider a statistical model (P ) , given by a p.d.f. fX (x; ) and let IX ()
be the Fisher information contained in the sample X P . Let T (X) be a statistic with the
p.d.f. fT (t; ) and Fisher information IT (). Then
IX () IT (),
)
(
(7e10)
135
(
)
(
)
E
log fX (X; )h T (X) =
log fX (x; )h T (x) fX (x; )dx =
Rn
(
)
(
)
fX (x; )h T (x) dx =
fX (x; )h T (x) dx =
Rn
Rn
(
)
E h T (X) =
h(t)fT (t; )dt =
h(t) fT (t; )dt =
T (Rn )
T (Rn )
(
)
(
)
h(t)
log fT (t; ) fT (t; )dt = E
log fT (T (X); )h T (X) ,
T (Rn )
which, by the orthogonality property (3a3) of conditional expectation, yields (7e10). Further,
(
)2
(
)2
E
log fX (X; )
log fT (T (X); ) = E
log fX (X; )
(
)
(
)2
2E
log fX (X; ) log fT (T (X); ) + E
log fT (T (X); ) =
(
(
))
IX () + IT () 2E
log fT (T (X); )E
log fX (X; )T (X)
=
(
)2
IX () + IT () 2E
log fT (T (X); ) = IX () IT ().
log fT (T (X); ) =
log g(, T (X)) =
log g(, T (X))h(X) =
log fX (X; ).
+
,
IX+Y
IX
IY
where the equality is attained if and only if X and Y are Gaussian.
26we exchange derivative and integral fearlessly, assuming that the required conditions are satised
136
7. POINT ESTIMATION
Let X = (X1 , ..., Xn ) be an i.i.d. sample from the density f (x ), where R is the
+ s) = s + (X)
n
(
)
u
f (Xj u)du
(7f1)
Proof. Suppose that g(X) is an equivariant statistic, i.e. for all x Rn and c R
g(x + c) g(x) = c,
(7f2)
x Rd .
(7f3)
Conversely, any g, solving this functional equation, satises (7f2). Let S be the subset of
functions of the form x
+ (x x
), x Rn for some real valued (). Clearly any solution of
the above equation belongs to S. Conversely, a direct check shows that any function in S is a
solution of (7f3). Hence all the functions satisfying (7f3) (and thus also (7f2)) are of the form
g(x) = x
+ (x x
).
For such statistics
(
(
)
(
)
(
))2
(X X)
2 = E0 X
+ (X X)
2 E0 X
E0 X|X
X
,
R(, g(X)) = E X
(
)
= xx
X
, which veries the rst
where the equality is attained with (x) = E0 X|X
equality in (7f1).
and (Z1 , ..., Zn1 ) := (X1 Xn , ..., Xn1 Xn ) are equivalent
Further, note that X X
(
)
(Xn X),
i = 1, ..., n 1 and, conversely, Xi X
= 1 1 Zi
statistics: Zi = Xi X
n
1
Z
,
i
=
1,
...,
n
1.
Hence
k
k=i
n
(
)
(
)
E X|X
= Xn (Xn X)
E X|X
=
X
X
X
(
)
(
)
+ Xn X|X
n1
i=1
f (vi + vn )
137
and hence
n1
(
)
i=1 f (Xi Xn + v)dv
R vf (v)
E Xn |X1 Xn , ..., Xn1 Xn =
=
n1
i=1 f (Xi Xn + u)du
R f (u)
n
n1
f (Xi v)dv
v
i=1 f (Xi v)dv
R (Xn v)f (Xn v)
= Xn R ni=1
,
n1
i=1 f (Xi u)du
R
i=1 f (Xi u)du
R f (Xn u)
is X. For X1 U ([, + 1]), the Pitman estimator is (maxj Xj + minj Xj 1)/2 and for
X1 exp(1), the Pitman estimator is minj Xj 1/n.
g. Asymptotic theory of estimation
In many practical situations the amount of the available data can be as abundant as we
wish and it is reasonable to require that the estimator produces more precise guesses of the
parameter as the number of observations grows. The main object studied by the asymptotic
theory of estimation is a sequence of estimators and its goal is to compare dierent sequences of
estimators in the asymptotic regime, i.e. when the number of observations27 tends to innity.
For example, we already mentioned several reasons for preferring the estimator (7b1):
v
u n
u1
n )2 .
n (X) = t
(Xi X
n
i=1
1
n |.
|Xi X
n (X) =
n
n
i=1
In particular, after appropriate normalization (7b1) becomes the optimal unbiased estimator.
However, even the optimal estimator may perform quite poorly if we dont have enough data
(i.e. n is small, say 10 or so). Hence, the natural question is whether
n (X) is getting closer to
the actual value of , when n is large. The intuition, based on the law of large numbers, tells
us that this indeed will be the case. But, perhaps
n (X) is as good as
n (X), when n is large ?
How do we compare the two in the asymptotic regime ?
These and many more questions are in the heart of the asymptotic theory and it turns out
that the answers are often surprising and counterintuitive at the rst glance. In this section we
shall introduce the basic notions and explore some simple examples (which will hopefully induce
appetite for a deeper dive).
n is calculated for n =
Lets start with a simulation: a coin28 was tossed 500 times and X
1, 2, ..., 500. This was repeated three times independently (think of them as three dierent
n ), and Figure 3 depicts the values of X
n as a function of n, in the three
realizations of X
experiments (of 500 tosses each). The actual value of the heads probability is 0 = 2/3.
27other asymptotic regimes are possible: small noise asymptotic, etc.
28well, a digital coin ;)
138
7. POINT ESTIMATION
The empirical mean in several experiments versus n
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
50
100
150
200
250
300
number of samples
350
400
450
500
random variable Xn concentrates around the true value of the parameter. In particular,
n ) = 0,
lim var (X
= [0, 1],
(7g1)
139
Note however, that the convergence in this probabilistic setting is very dierent from the
convergence of deterministic sequences of numbers. Recall that a sequence of numbers an converges to a number a, called the limit of (an )n1 and denoted a := limn an , if for any > 0,
there exists an integer N (), possibly dependent on , so that |an a| for all n N ().
That is, the entries of the sequence an , starting from some N () are all in a small (2-wide)
interval around a.
Hence, for an arbitrarily small > 0, (7g1) implies that there is an integer N (), such
n ) (which is a number) is not further away from zero than . However, this does
that var (X
n itself must be within the distance of from : in fact, sometimes (i.e. for
not mean that X
some realizations) it wont! After all, a random variable with small variance is not prohibited
from generating large values30. Hence the convergence of random variables to a limit, either
random or deterministic, indicates that large deviations from the limits are improbable (though
possible!) for large ns.
On the technical level, the dierence between convergence of sequences of numbers and
random variables is even deeper. Convergence of sequences of real numbers, or more generally
of real vectors in Rd , does not depend on the norm we choose to measure the distance between
d
the vectors.
For
either
nexample, if2 we measure the distance between two vectors x, y R , by
d either
(x
y
)
or
d
(x,
y)
=
max
|x
y
|,
a
sequence
x
R
d2 (x, y) =
i
i
n
i=1,...,d i
i=1 i
converges or diverges in the two metrics simultaneously. One can show that for nite dimensional
linear spaces any two norms are equivalent in this sense.
The situation is radically dierent for sequences with entries in the innite dimensional
spaces, such as e.g. functions. Since random variables are functions on , many nonequivalent
modes of convergence emerge, some of which to be discussed below.
Convergence of sequences of random variables. How do we dene convergence for a
sequence of random variables (n )n1 on a probability space (, F, P)? To tackle this question, it
will be convenient to think of a realizations of (n )n1 as deterministic sequences. Then a natural
obvious way to dened convergence is to require that all the realizations of (n )n1 converge to
a limit. This limit itself may and will in general depend on the particular realization, i.e. will
be a random variable on its own:
Definition 7g1. A sequence (n ) converges pointwise to a random variable if
lim n () = (),
It turns out, however, that this denition is too strong to be satised in a majority of cases.
Here is a simple demonstration:
Example 7g2. Consider a sequence of i.i.d. random variables (Xn )n1 with Xn N (0, 1)
and let Yn = Xn /n. Does the sequence (Yn )n1 converge to a limit in the latter sense ? Obviously
not: for example, the sequence (1, 2, 3, +4, ...) is a legitimate realization of (Xn )n1 and the
corresponding realization of (Yn )n1 is the sequence (1, 1, 1, 1, ...), which does not converge.
Thus (Yn )n1 does not converge pointwise. On the other hand, it is intuitively clear that Xn /n
will typically be very small for large n and hence in a certain weaker sense must converge to 0.
30e.g. N (0, 0.001) can generate the value in [1000 : 1001] with small, but non-zero probability, which means
140
7. POINT ESTIMATION
Example 7g2 (continued) Lets check that the sequence (Yn )n1 does converge to Y 0 (viewed
as a constant random variable) in L2 . Indeed,
E(Yn Y )2 = E(Xn /n 0)2 =
1
n
EXn2 = 1/n2 0,
2
n
2
and hence, by denition Yn
0.
n
But what if Xn has the Cauchy distribution, rather than the Gaussian one ...? In this case,
E(Yn 0)2 = E(Xn /n)2 = for any n 1 and hence the convergence in L2 is lost. This means
that the L2 convergence is too strong to capture the intuition that Xn /n still goes to zero as
n : after all, Xn is sampled from a probability density which is centered at zero and hence
we expect that for large n, Xn /n will be typically small.
which implies:
(
)
lim P |Yn | = 0,
as claimed.
This example shows that convergence in probability is generally weaker than in L2 , i.e. a
sequence may converge in probability, but not in L2 . The converse, however, is impossible:
Lemma 7g5. Convergence in L2 implies convergence in probability.
31i.e. ( ) converges as a sequence in the space of functions (random variables) with nite p-norm
n
141
Proof. Suppose that (n )n1 converges to in L2 , then by the Chebyshev inequality32 for
any > 0
n
P(|n | > ) 2 E(n )2 0,
which veries the convergence in probability.
Of course, a sequence may not converge at all:
Example 7g6. Let n a sequence of i.i.d. r.v.s with the common p.d.f. f . Suppose that
n in probability. Then for any > 0
(
)
(
)
P |n n+1 | = P |n + n+1 |
(
)
(
) n
P |n | /2 + P | n+1 | /2 0.
On the other hand,
(
)
P |n n+1 | =
R2
R2
which is arbitrarily close to 1 for small > 0. The obtained contradiction shows that n does
not converge in probability.
Here are some useful facts about the convergence in probability:
P
142
7. POINT ESTIMATION
(1) n + n +
P
(3) n n
Proof. To check (1), let > 0, then
n
Remark 7g9. Be warned: various facts, familiar from convergence of real sequences, may or
may not hold for various types of convergence of random variables. For example, if n in L2 ,
then, depending on the function g, g(n ) may not converge to g() in L2 , even if g is continuous
4
(if e.g. n = Xn /n with i.i.d. N (0, 1) sequence (Xn )n1 , then for g(x) = ex , Eg 2 (n ) = and
hence the convergence in L2 fails.)
A completely dierent kind of convergence is the convergence in distribution (or weak convergence35):
Definition 7g10. A sequence of random variables (n )n1 converges in distribution to a
random variable , if
lim P(n x) = P( x),
n
143
x R.
for all n 1.
Before considering much more convincing examples of the weak convergence, lets explore
some of its useful properties.
P
(i) n
(ii) Eg(n ) Eg() for any bounded continuous function g
(iii) Eeitn Eeit for all38t R
36the convergence is allowed to fail at the points at which the target c.d.f. F is discontinuous, simply because
requiring otherwisemay yield an unreasonably strong notion of convergence. For example, by the law of large
n := 1 n Xi converges in probability to X 0 if Xi s are i.i.d. N (0, 1) r.v.s. However, the c.d.f.
numbers X
j=1
n
n fails to converge to the (degenerate) c.d.f of 0 at x = 0:
of X
n 0) = 1 = 1 = P(X 0).
P(X
2
n 0 is distribution, if X
n 0 in probability, as the latter should
It would be embarrassing not to say that X
be expected to be stronger. Hence by denition, the convergence in distribution excludes convergence at the
discontinuity point 0 from consideration.
37 may be even dened on dierent probability spaces
n
38 (t) := Eeit , where i is the imaginary unit, is called the characteristic function of the r.v. . It resembles
the denition of the moment generating function with t replaced by it. This seemingly minor detail is in fact
major, since the characteristic function is always dened (unlike the m.g.f. which requires existence of all moments
at least). As the m.g.f. function, the characteristic function also determins the distribution (being its Fourier
transform). If youre not familiar with complex analysis, just replace characteristic function in any appearance in
the text with m.g.f., and pay the price, losing the generality.
144
7. POINT ESTIMATION
n .
Example 7g12 shows that (n ) which converges weakly, may not converge in probability. The
converse, however, is true:
P
39
40
(1) n + n c +
d
(2) n n c
d
(3) n /n /c, if c = 0
d
Proof. To prove (n , n ) (c, n ) we shall check that41 for arbitrary bounded continuous
functions : R 7 R and : R 7 R
n
E(n )(n ) (c)E()
E(n )(n ) (c)E(n ) + (c)E(n ) (c)E() =
n
E|(n )|(n ) (c) + (c)E(n ) E() 0,
P
where the rst term on the right hand side converges to zero by Lemma 7g7 as n c (think
why) and the second term converges to zero by Theorem 7g14. Now the claims (1)-(3) follow,
P
since n c implies n c (by Lemma 7g16) and (x, y) 7 x + y, (x, y) 7 xy are continuous
functions on R2 and (x, y) 7 x/y is continuous on R2 \ {(x, y) : y = 0}.
39
n in probability does not necessarily implies En E. For example, this fails if n = X/n, where
X is Cauchy r.v. However, Lebesgues Dominated Convergence theorem states that if there exists a r.v. with
L1
P
E < , such that |n | for all n and n , then n
and En E < . In particular, the latter holds
if |n | M for some constant M > 0.
40convergence in distribution of random vectors is dened similarly, as convergence of c.d.fs (think, how the
discontinuity sets may look like)
41you may suspect that taking bounded continuous functions of the product form h(s, t) = (s)(t) is not
enough and we shall consider the whole class of bounded continuous functions h : R2 7 R. In fact, the former is
sucient, but of course requires a justication (which we omit).
145
Remark 7g18. Beware: the claims of the preceding lemma may fail, if n converges in
probability to a non-degenerate r.v., rather than a constant (can you give an example ?)
Here is another useful fact:
d
Lemma 7g19. Suppose that n s are dened on the same probability space and n c, where
P
c is a constant. Then n c.
Proof. Since c is constant, we can consider c = 0 without loss of generality (think why).
Let Fn be the c.d.f.s of n , then for > 0
n
P(|n | ) = Fn () + 1 Fn () 0,
since the Fn (x) F (x) = 1{x0} for all x R \ {0}.
Limit theorems. One of the classical limit theorems in probability is the Law of Large
Numbers (LLN), which can be stated with dierent types of convergence under appropriate
assumptions. Here is a particularly primitive version:
Theorem 7g20 (an LLN).
Let (Xn )n1 be a sequence of i.i.d. r.v. with E|X1 |2 < .
n
n = 1
Then the sequence X
i=1 Xi , n 1 converges to := EX1 in L2 (and hence also in
n
probability).
Proof. By the i.i.d. property:
n
n
(1
)2
1
var(X1 ) n
E
Xi = 2
0,
E(Xi )2 =
n
n
n
i=1
i=1
The latter, perhaps, is the most naive version of the LLN: it makes a number of strong
assumptions, among which the most restrictive is boundedness of the second moment. The
more classical version of LLN is
42
Theorem 7g21 (The weak
LLN). Let (Xn )n1 be a sequence of i.i.d. r.v. with E|X1 | < .
n
1
1 + (0) +
(tn ) (0)
= 1 + i + rn
n
n
n
where |tn | t and
n
|nrn | = |t| (tn ) (0) 0,
42in fact under the assumptions of the weak LLN, one can check that the empirical mean converges to EX
1
in a much stronger sense, namely with probability one. This stronger result is called the strong LLN
146
7. POINT ESTIMATION
by continuity of . Hence
EeitXn eit
and the claim follows by Theorem 7g14.
which means that n (Xn ) does not converge to zero in L2 any more (and, on the other
hand, has a bounded variance uniformly in n). Remarkably, much more can be claimed:
Theorem 7g23 (Central Limit Theorem). Let (Xn )n1 be a sequence of i.i.d. r.v.s with
n
X
:= EX1 and 2 := var(X1 ) < . Then n
converges weakly to a standard Gaussian
Let (t) = EeitY1 and n (t) = E exp it 1n ni=1 Yi By the i.i.d. property
( tn )
= (0)
+
(tn ) (0)
n (t) = (0) + (0) +
2n
2n
2n
n
43the easiest way is by means of characteristic functions (note that m.g.f.s are useless in this case)
147
where |tn | |t| and (0) = iEY1 = 0 and (0) = 1. By continuity of , it follows
lim n (t) = e 2 t ,
1 2
t R,
In words, this denition means that large estimation errors are getting less probable as the
sampling size increases. More qualitative information on the convergence is provided by the
following notion:
Definition 7g25. A consistent sequence of estimators (n ) is said to have an asymptotic
(error) distribution F , if there exists a sequence of numbers n such that the scaled
estimation error converges to F in distribution: for any ,
)
( (
)
lim P n n x = F (x; ),
n
d
n(n )
N (0, 2 ), .
This
means that the sequence of estimators (n ) is consistent and asymptotically normal with
rate n and variance 2 . Remarkably, this result holds without any further assumptions on the
c.d.f. of X1 , other than niteness of the second moment.
Though the rate n and the asymptotic normality frequently emerge in statistical models,
do not think that this is always the case:
Example 7b3 (continued) Let us revisit the problem of estimating from the sample X1 , ..., Xn ,
where X1 U ([0, ]), > 0. We considered the MLE n (X) = maxi Xi = Mn and the estimator
n and calculated the corresponding risks in (7b4) and (7b5). As we saw, the risks
n (X) = 2X
are comparable for any n 1 and the estimator n (X) has smaller risk than the estimator n (X)
(and hence the latter is inadmissible).
Since everything is explicitly computable in this example, there is no need to appeal to the
asymptotic theory. Still it will be instructive to analyze this example asymptotically. First of
all, note that both sequences of estimators are consistent, since R(, n ) = E ( n (X))2 0
148
7. POINT ESTIMATION
n )
Now lets nd the corresponding
asymptotic
By the CLT, n(2X
(
) error distributions.
2
normal with rate n and variance 2 /3. What about the MLEs n ?
Note that X1 has uniform distribution on [, 0] and hence Mn has the c.d.f.
x <
0
n
Fn (x) = (x/ + 1)
x [, 0)
1
x0
Note that
lim Fn (x) = 1{x0} ,
n
xR
P
d
which means that n 0 under P and hence, by Lemma 7g19, n 0. This veries
consistency of (n ).
Further, let (n ) be an increasing sequence of positive numbers, then for x R,
)
(
(
)
P n (Mn ) x = P Mn x/n = Fn (x/n ) =
0
x/
<
n
(
0(
)n
)n x < n
1 x
1 x
x/n [, 0) =
x [n , 0) .
n + 1
n + 1
x/ 0
1
x0
n
This is readily recognized as the density of , where is exponential r.v. with mean . Hence
the sequence (n ) is consistent with rate n, which is much faster than the consistency rate of
n oered by the sequence of the estimators (n ): roughly speaking, the accuracy attained by
149
(7g9)
The rst term on the right hand side converges weakly to the Gaussian r.v. with zero mean and
(
)2
variance g () V () by (2) of Slutskys Lemma 7g17. Similarly the second term converges to
0 in probability by continuity of the derivative g and the claim follows from Slutskys Lemma
7g17.
The following example demonstrates a number of approaches to construction of condence
intervals (recall Section 5, page 65), based on the limit theorems and the Delta method.
Example 7g29. Let X1 , ..., Xn be a sample from the distribution Ber() where =
(0, 1) is the unknown parameter. We would like to construct a condence interval (shortly c.i.)
with the given condence level 1 (e.g. 1 = 0.95), i.e. nd an interval of the form
I(X) := [a (X), a+ (X)], such that
(
)
P [a (X), a+ (X)] 1 , .
(7g10)
The exact solution of this problem is computationally demanding, though possible (think how).
When n is large, the limit theorems can be used to suggest approximate solutions in a number
of ways.
The pivot method
Consider the sequence of random variables:
Yn :=
n
X
n (1 X
n) + 1
X
n {0,1}}
{X
n (1 X
n ) = (1 ) and limn 1
By the law of large numbers limn X
n =0} = 0 in
{Xn =1} = limn 1{X
d
P -probability and hence, by the CLT and Slutskys lemma Yn N (0, 1) under P . The weak
limit of the pivot random variables (Yn ) suggests the condence interval:
(
)
1/2
a (X) = Xn z1/2 n
Xn (1 Xn ) + 1{X n {0,1}}
(7g11)
)
(
1/2
Xn (1 Xn ) + 1{X n {0,1}}
a+ (X) = Xn + z1/2 n
where z1/2 is the (1 /2)-quantile of the standard Gaussian distribution, i.e. (z1/2 ) =
1 /2 and hence
(
)
(
)
(
)
P [a (X), a+ (X)] = P Yn [z1/2 , z1/2 ] 1 2 1 (z1/2 ) = 1 ,
150
7. POINT ESTIMATION
as required in (7g10), where the approximate equality replaces the convergence if n is large
enough 44. Note that for smaller s, larger z1/2 emerges, i.e. wider c.i.s correspond to more
stringent requirements. On the other hand, for a xed , and large n, we get narrower c.i.s, as
should be expected.
Variance stabilizing transformation
Alternatively, the condence interval can be constructed by means of the variance stabilizing
transformation as follows. Let g : (0, 1) 7 R be a continuously dierentiable function, then
applying the Delta method of Lemma (7g28) we have
( (
)
)
)
(
n ) g() d N 0, g () 2 (1 ) , .
n g(X
n
1
,
(1)
of . If in addition, g is increasing and invertible on (0, 1), the following c.i. can be suggested:
[
(
(
)
)]
n ) z1/2 / n , g 1 g(X
n ) + z1/2 / n .
I(X) := g 1 g(X
(7g12)
Indeed,
(
)
(
)
P I(X) = P g() g(I(X)) =
(
g() =
ds
s(1 s)
0
does the job, since the integral is well dened. Clearly, it is continuously dierentiable on (0, 1),
increasing and invertible. A calculation yields g() = 12 + arcsin (2 1) . Since we required
only a particular form of g , we may omit the constant and just take g() := arcsin (2 1).
Wilsons method
As we saw above, by the CLT
)
(
n
X
z1/2 1 .
P n
(1 )
44How large n should be to support rmly such a belief ...? Clearly, to answer this question we must know
more about convergence in the CLT (e.g. the rate of convergence , etc.). Applied statisticians usually use various
rules of thumbs, which often can be justied by a considerable mathematical eort (see Example 8a4 below)
151
Note that
{
}
} {
2
X
n n z1/2 = n (Xn ) z 2
1/2 =
(1 )
(1 )
{
}
n )2 n1 z 2
(X
(1
)
=
1/2
{
} {
}
2
n + z2
n2 0 = a (X) a+ (X) ,
2 (1 + z1/2
/n) (2X
/n)
+
X
1/2
where
n +
X
a (X) :=
11 2
2 n z
z1/2
1+
1
n Xn (1
1 2
n z
n) +
X
1 2
z
4n2
and hence the condence interval I(X) = [a (X), a+ (X)] has the coverage probability close to
1 when n is large.
The asymptotic risk of estimators is often easier to compute than their exact risk for a xed
sample size. In this regard, the Delta-method is the main tool:
Example 7g30. Let X1 , ..., Xn be a sample from Ber() with (0, 1) and consider the
n . Since E |X1 | = < , by the weak LLN, the sequence n (X) converges
MLE n (X) = X
to in P -probability for any . Hence n (X) is a consistent sequence of estimators.
Furthermore, as var (X1 ) = (1 ) < , by the CLT
( X
)
n
lim P n
x = (x), x R,
n
(1 )
n ) x = (x/ (1 )), x R,
lim P n(X
n
n is asymptotically
where (x) is the standard Gaussian c.d.f. Hence the sequence n (X) = X
normal with the limit variance (1 ).
Now consider another sequence of estimators:
v
u
u 1 [n/2]
n (X) = t
X2i1 X2i .
[n/2]
i=1
Yi := X2i X2i1 , i = 1, ..., [n/2] are i.i.d. Ber(2 ) r.vs and hence again by the weak LLN
[n/2]
1
P
X2i1 X2i 2 ,
[n/2]
i=1
P
and since u 7 u is a continuous function, n (X) 2 = , i.e. the sequence n is also
consistent.
Note that by the CLT,
(
)
(
)
d
[n/2] n2 (X) 2 N 0, 2 (1 2 ) , n
152
7. POINT ESTIMATION
Asymptotic variances versus
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Now we can apply the Delta method with g(u) = u to the sequence n2 (X):
( (
)
)
[n/2] g n (X) g(2 ) =
(
)
( (
)
)
d
2 2 2
2
[n/2] n (X)
N 0, g ( ) (1 ) = N (0, (1 2 )/4).
Finally, [n/2]/n 1/2, we conclude (why?)
)
(
d
n n (X)
N (0, 1 2 )
In this example, one can calculate the risks of n and n for each xed n and check whether they
are comparable. However, computing the risk of n does not appear to be an easy calculation.
Hence considering the estimators in the asymptotic regime, i.e. through the asymptotic risks,
often leads to a more tractable problem.
In this example, the competing sequences of estimators have the same rates, and moreover,
are both asymptotically normal: hence, we can try to compare them by the asymptotic variance.
Indeed, we see that the asymptotic variance of n is uniformly smaller than the variance of n
(see Figure 4). Hence we conclude that n is asymptotically inferior to n . Of course, just as in
the non-asymptotic case two sequences of estimators may not be comparable by their asymptotic
variances.
Two sequences of estimators do not have to be comparable, even if they have the same rate
and the same limit type of distribution (e.g. asymptotically normal): it is possible that the
asymptotic variances, being functions of , satisfy opposite inequalities on dierent regions of
153
(similarly to Example 7b10). Note that a sequence of consistent estimators (n ), which also
converges in L1 , must be asymptotically unbiased 45:
lim E n = , .
n
Thus there still may exist a sequence of estimators (n ), which attains the Cramer-Rao information bound for unbiased estimators, asymptotically as n :
1
lim nE (n )2 =
, .
(7g13)
n
I()
46
Guided by the intuition from the nite sample setting, we may think that no sequence of estimators can yield asymptotic risk, smaller than this bound and hence regard (n ) asymptotically
ecient (optimal). The following example shows that asymptotic optimality is a more delicate
matter!
Example 7g31 (J.Hodges). Consider an i.i.d. sample X1 , X2 , ... from N (, 1) distribution,
n is the
where R is the unknown parameter. As we already saw, the empirical mean X
estimator, which has various optimality properties: it is the UMVUE, attaining the CramerRao bound for unbiased estimators, as well as the minimax estimator for each xed n 1.
n is asymptotically optimal, in the sense that if (n ) is a
Hence it is tempting to think that X
sequence of asymptotically unbiased estimators, then
n ) = 1, R.
lim n1 E (n )2 lim n1 E (X
(7g14)
n
n 1
n n = n X
n |<n1/4 } =
{|Xn |n1/4 } n1{|X
)
(
n 1
= n X
n1
1/4
{| n(Xn )+ n|n
= Z1{|Z+n|n1/4 }
{| n(Xn )+ n|<n1/4 }
n1{|Z+n|<n1/4 }
where = stands for equality in distribution and Z N (0, 1). Hence for = 0,
(
)2
n
2
4
E n n = E Z 1{|Z|n1/4 } E Z
P (|Z| n1/4 ) 0,
where we used the CauchySchwarz inequality. For = 0,
)2
(
(
)2
E n n =E Z1{|Z+n|n1/4 } n1{|Z+n|<n1/4 } =
)2
(
E Z (Z + n)1{|Z+n|<n1/4 } = 1 + rn ,
45sometimes, this is referred as approximately unbiased, while the term asymptotically unbiased is reserved
for a slightly dierent notion
46just like in the UMVUE theory: not all unbiased estimators are comparable, while the optimal unbiased
estimator can be found through R-B if the minimal sucient statistic is complete!
154
7. POINT ESTIMATION
|rn | = E 2Z(Z + n) + (Z + n)2 1{|Z+n|<n1/4 }
(
)2
)2 (
)
(
)
)
P |Z + n| < n1/4 P Z < n1/4 n P Z <
n =
n ,
2
2
where (x) is the c.d.f. of N (0, 1) distribution, which for x < 1 satises
x
x
1 y2 /2
2
2
(x) =
e
dy
yey /2 dy = ex /2 .
2
Plugging these bounds back, we see that rn 0 for > 0 and, similarly, for < 0. To recap,
{
1 = 0
(7g15)
lim nE (n (X) )2 =
n
0 =0
which shows that (7g14) fails for the sequence (n ) at the point = 0. Hodges estimator
is superecient, e.g., in the sense that its asymptotic variance is better than the asymptotic
Cramer-Rao bound for unbiased estimators!
This is of course no paradox: Hodges estimator is biased for each xed n and hence doesnt
have to satisfy the Cramer-Rao bound for unbiased estimators. The relevant Cramer-Rao bound
for n is
(
)2
n ) + 1
b(,
R(, n ) = var (T ) + b2 (, n )
+ b2 (, n ) =: CRn (),
In ()
where b(, n ) = E n is the bias and In () = n is the Fisher information in the sample. A
calculation reveals that nb2 (, n ) 0 for any , but limn b(, n )
= 1 and hence
=0
nCRn ()
=0
0
This example shows that a sequence of consistent and asymptotically normal estimators,
satisfying (7g13), can be outperformed. Hodges example can be modied so that the set of
points in , at which the estimator is superecient, is innitely countable. Remarkably, a
theorem of L.Le Cam shows that the set of such points cannot be essentially larger: more
precisely, it has zero Lebesgue measure (length). This allows to dene the following local
minimax notion of asymptotic eciency:
0 n |0 |
sup nE (n )2 ,
0 n |0 |
0 ,
)2
155
n anymore, since nE (X
n
Note that in this sense, Hodges estimator is not better than X
= 1 for all n and R and
lim lim sup nE (n )2 1.
0 n |0 |
However, Xn is no longer minimax on small bounded intervals: for example, for the trivial
estimator 0
sup E ( ) = 2
[,]
)2
(
sup nE (n )2 En n n n
||
we conclude that
lim sup nE (n )2 C 2
n ||
sup nE (n )2 = .
0 n |0 |
Roughly speaking, this means that Hodges estimator may perform very poorly at the points
close to the supereciency point for large n.
Finding asymptotically ecient estimators is an interesting problem, which is beyond the
scope of our course. Remarkably, the sequence of the MLEs often turns to be consistent,
asymptotically normal and asymptotically ecient:
Theorem 7g34. 47 Let X1 , ..., Xn be a sample from the probability density48 f (x; ),
and let (n ) be the sequence of MLE estimators:
n (X) argmax L(; X),
47adapted from [2]
48the theorem holds also for the discrete case with appropriate adjustments
156
7. POINT ESTIMATION
f (x; )dx = 0,
R
1/
.
1 + (x )2
MLE n cannot be found in an explicit form beyond n 2 and hence the maximization of the
likelihoodis to be done numerically51. The assumptions (Ac ) and (A0 ) obviously hold. The
function f (x; ) is continuously dierentiable and
)2
(
1
2(x )
1
1
4(x )2
I() =
dx
=
(
)2
(
)3 dx <
1 + (x )2
1 + (x )2
1 + (x )2
49in fact, even stronger result holds under these assumptions, namely
n
(n
an
)
0 for any unbounded
function
L(X; ). However, the number of roots of the score function grows with n and hence deciding which
root is the MLE may be a challenge (remarkably, the set of roots converges to a Poisson process on R after an
appropriate rescaling, see the paper [10])
157
is positive and bounded for any . Hence by Theorem 7g34, the sequence of MLEs is
consistent for . By checking further assumptions, we may be lucky to be able to conclude that
(n ) is also asymptotically normal (try!).
The mathematical tools for proving, or even, sketching the ideas, leading to this remarkable
conclusion are far beyond the scope of our course, but we shall briey demonstrate what kind
of diculties, arise in establishing e.g. consistency of MLE.
The asymptotic analysis is usually straightforward if the MLE is given by an explicit formula
(as in Examples 7b3 and 7g30) or as the unique root of some equation (see Problem 7.57).
However, in the majority of practical situations, the MLE is found numerically for the concrete
sample at hand and its asymptotic analysis should be based directly on its denition as a
maximizer. A number of dierent approaches have been developed for this purpose, and we
shall sketch one of them below.
Suppose that we sample X = (X1 , ..., Xn ) from a p.d.f. f (x; ). Then the corresponding
log-likelihood is given by
1
log Ln (x; ) =
log f (xi ; ),
n
n
x Rn ,
i=1
n
n
f (Xi ; 0 ) n
i=1
f (x; )
f (X1 ; )
E0 log
=
log
f (x; 0 )dx =: H(, 0 ).
f (X1 ; 0 )
f (x; 0 )
R
n
The quantity H(, 0 ) is called the Kullback-Leibler relative entropy (or divergence) and it is
not hard to see 52 that it is nonnegative and has the properties, similar53 to those of a distance
between the p.d.f.s f (x; ) and f (x; 0 ). If the statistical model is identiable, then 7 H(, 0 )
has a unique maximum at 0 , i.e. H(, 0 ) < H(0 , 0 ) = 0 for all = 0 .
Thus we have a sequence of (random) functions
(
)
1
Hn (; 0 ) := log Ln (X; )/Ln (X; 0 )
n
converging to H(, 0 ), which has a unique maximum at := 0 . It is tempting to conclude that
for each xed 0 , the maximizer of Hn (; 0 ) over converges to the maximizer of H(, 0 ), that
52using the Jensen inequality
53The K-L divergence is not a true distance, since it is not symmetric
158
7. POINT ESTIMATION
is n 0 and thus n is consistent. Unfortunately, this does not have to be the case54 and for
the latter to hold, a stronger - uniform over - convergence is required in general. Establishing
such convergence is harder and can be done under appropriate assumption. Once consistency is
established, one can study the asymptotic normality, essentially using Taylors expansion.
Exercises
Methods of point estimation.
Problem 7.1. Consider a population made up of three dierent types of individuals occurring in the Hardy-Weinberg proportions 2 , 2(1 ) and (1 )2 , respectively, where 0 < < 1.
(1) Show that T3 = N1 /n + N2 /2n is a frequency substitution estimator of
(2) Using the estimator of (1), what is a frequency substitution estimator of the odds ratio
/(1 )?
(3) Suppose X takes the values 1, 0, 1 with respective probabilities p1 , p2 , p3 given by the
Hardy-Weinberg proportions. By considering the rst moment of X, show that T3 is a
method of moment estimator of .
(4) Find the MLE 0f and compare to the other estimators, obtained so far.
Problem 7.2. Consider n systems with failure times X1 , ..., Xn assumed to be independent
and identically distributed with exponential exp() distributions.
(1) Find the method of moments estimator of based on the rst moment.
(2) Find the method of moments estimator of based on the second moment.
(3) Combine your answers to (1) and (2) to get a method of moment estimator of based
on the rst two moments.
(4) Find the method of moments estimator of the probability P (X1 > 1) that one system
will last at least a month.
(5) Find the MLE of
Problem 7.3. Let X1 , ..., Xn be the indicators of n Bernoulli trials with probability of
success 0.
n is a method of moments estimator of .
(1) Show that X
(2) Exhibit method of moments estimators for var (X1 ) = (1 ) rst using only the rst
moment and then using only the second moment of the population. Show that these
estimators coincide.
(3) Argue that in this case all frequency substitution estimators of q() must agree with
n ).
q(X
(4) Find the MLE of
Problem 7.4. Suppose X = (X1 , ..., Xn ) where the Xi are independent N (0, 2 )
(1) Find an estimator of 2 based on the second moment.
54think of a counterexample: a sequence of (even deterministic) functions f (x) f (x) for all x, where f
n
n
EXERCISES
159
(2) Construct an estimator of using the estimator of part (1) and the equation = 2
(3) Use the empirical
substitution principle to construct an estimator of using the relation
E|X1 | = 2.
Problem 7.5. An object of unit mass is placed in a force eld of unknown constant intensity
. Readings Y1 , ..., Yn are taken at times t1 , ..., tn on the position of the object. The reading Yi
diers from the true position (/2)t2i by a random error i . We suppose the i to have mean 0
and be uncorrelated with constant variance.
(1) Find the least square estimator (LSE) of .
(2) Can you compute the MLE of without additional assumptions ? The method of
moments estimator ?
Problem 7.6. Let Y1 , ..., Yn be independent random variables with equal variances such
that EYi = zi where the zi are known constants. Find the least squares estimator of .
Problem 7.7. Suppose Y1 , ..., Yn1 +n2 are given by
{
1 + i , i = 1, ..., n1
Yi =
2 + i , i = n1 + 1, ..., n2 ,
where 1 , ..., n1 +n2 are independent N (0, 2 ) variables.
(1) Find the LSE of = (1 , 2 )
(2) Find the MLE of , when 2 is known
(3) Find the MLE of , when 2 is unknown
Problem 7.8. Let X1 , ..., Xn be a sample from one of the following distributions. Find the
MLE of .
(1) f (x; ) = ex , x 0, > 0 (exponential p.d.f.)
(2) f (x; ) = c x(+1) , x c, c > 0 and > 0 (Pareto p.d.f.)
(3) f (x; ) = cc x(c+1)
, x , c > 0, > 0 (Pareto p.d.f.)
1
(4) f (x; ) = x
,x
[0,
1],
>
0
((
, 1) p.d.f.)
{
}
2 , x > 0, > 0 (Rayleigh p.d.f.)
(5) f (x; ) = (x/2 ) exp { x2 /2
}
(6) f (x; ) = cxc1 exp xc , x 0, c > 0, > 0 (Weibull p.d.f.)
Problem 7.9. Let X1 , ..., Xn be a sample from N (, 2 ). Find the MLE of = (, 2 )
under the assumption that 0.
Problem 7.10. Let X1 , ..., Xn be a sample from the p.d.f.
{
1 (x)/
e
, x
f (x; ) =
0
x<
where > 0 and R.
160
7. POINT ESTIMATION
Show that E X1 = /( + 1)
Find the method of moments estimator
Compute the maximum likelihood estimator of
Find the maximum likelihood estimator of E
> 0.
EXERCISES
161
Optimality.
Problem 7.16. Let X be the p.m.f. of a Poisson r.v., conditioned to be positive56:
e k /k!
, k {1, 2, ...}, > 0.
1 e
It is required to estimate q() = 1 e and for this purpose the estimator
{
0 X is odd
.
T (X) =
2 X is even
p(k; ) =
1
n )2 and S2 := 1
n )2
S 2 :=
(Xi X
(Xi X
n1
n+1
i=1
i=1
(2) Show that the MSE risk of S 2 is greater than of S2 for all (i.e. S 2 is inadmissible).
Bayes estimation.
Problem 7.19. Prove Lemma 7c8 in the case of countable parametric space .
Problem 7.20 (The sunrise problem of Laplace). What is the probability that the sun
will rise tomorrow? is the question which P-S. Laplace addressed in the 18-th century. He
suggested that at the beginning of the world (he has literally taken the date from the Bible), the
probability of the sun to rise was completely uncertain, which he expressed by assuming that
p was sampled form U ([0, 1]). Further, he assumed that the sun rises each day independently
with the same conditional probability of success p. More precisely, if Xi takes value 1 if the sun
rises on the i-th morning, then Xn , n 1 forms a sequence of i.i.d. Ber(p) r.v.s, conditioned
on p. Prove the rule of succession :
s+1
.
P(Xn+1 = 1|X1 + ... + Xn = s) =
n+2
56i.e. the conditional p.m.f of Y given {Y > 0}, where Y Poi().
162
7. POINT ESTIMATION
Problem 7.21. Let (P ) be a statistical model with being a discrete set of points.
Show that the Bayes estimator of with respect to the prior and the loss function:
0 (, ) = I( = ),
is given by the maximum a posteriori probability (MAP)
(X) = argmax P( = |X).
Explain, why the MAP estimator minimizes the probability of error in guessing the value of ,
given the observation X.
Problem 7.22. Show that distribution is conjugate to the likelihood of n i.i.d. Ber(),
[0, 1] r.v.s Find the corresponding posterior parameters and deduce the Bayes estimator
under MSE.
Problem 7.23. Prove that Pareto distribution with the p.d.f.
c
f (x; c, ) = +1 I(x c), c > 0, > 0
x
is conjugate to the likelihood of the sample X = (X1 , ..., Xn ) of size n from U ([0, ]), R+ .
Show that the posterior distribution has the Pareto density with parameters max(c, maxi Xi )
and + n. Check that the Bayes estimator under MSE is given by:
( + n) max(c, X1 , ..., Xn )
(X) =
.
+n1
Problem 7.24. Let X1 , ..., Xn be a sample from Poi(), > 0. Assume the prior (, ).
(1) Find the Bayes estimator of with respect to the loss function (, ) = ( )2 /
(2) Find the Bayes risk of the Bayes estimator
Problem 7.25. Let X = (X1 , ..., Xn ) be a sample from the exponential distribution exp(),
> 0. Assume the prior (, ).
(1) Find the conditional distribution of given X. Is the prior conjugate to the likelihood
?
(2) Find the Bayes estimator of with respect to the loss function (, ) = ( )2 .
(3) Find the MLE of
(4) Calculate the risk functions of the estimators in (3) and (2) and compare
(5) Calculate the Bayes risk of the estimaets in (3) and (2) and compare
Problem 7.26. Show that Bayesian estimator with respect to quadratic loss is biased, if
the posterior is non-degenerate (i.e. the posterior variance is positive with positive probability)
Problem 7.27. Find the minimax estimator for R+ , given the sample X1 , ..., Xn
Poi()
Problem 7.28. Show that the naive estimator = X is minimax in Steins example 7b8.
EXERCISES
163
Unbiased estimation.
Problem 7.29. Argue that an estimator, which is not a function of the minimal sucient
statistic, is inadmissible with respect to the quadratic risk.
2.
Problem 7.30.
Let X1 , ..., Xn be a sample from a p.d.f. with unknown mean and variance
Dene T (X) = ni=1 ci Xi .
Problem 7.31. Let X1 , ..., Xn be a sample from N (, 1). It is required to estimate P (X1
0) = ()
(1) Show that T (X) = I(X1 0) is an unbiased estimator
n to obtain an improved estimator.
(2) Apply the R-B theorem with the sucient statistic X
is a Gaussian vector.
Hint: note that (X1 , X)
(3) Show that the obtained estimator is UMVUE
Problem 7.32. Let X1 , ..., Xn be a sample from U ([0, ]), > 0 and let Mn (X) = maxi Xi .
(1) Show that T (X) = n+1
n Mn (X) is an unbiased estimator of .
M
(X).
Show that
(2) Let T (X) = n+2
n+1 n
R(, T ) < R(, Mn ),
and conclude that both Mn and
Problem 7.33. In each one of the following cases, show that is an unbiased estimator of
the parameter of interest, nd a sucient statistic and improve by means of the R-B procedure.
(1) X1 , X2 is a sample from Geo(), (X) = I(X
1n= 1) is an estimator of n
(2) X1 , ..., Xn is a sample form Ber(), (X) = i=1 Xi is an estimator of
(3) X1 , ..., Xn is a sample form Ber(), (X) = X1 X1 X2 is an estimator of (1 )
Problem 7.34. Let X1 Geo(). It is required to estimate /(1 + ) and the estimator
T (X) = eX is considered.
(1) Is T (X) unbiased ?
(2) Let X2 be an additional sample from Geo(), independent of X1 and dene S = X1 +X2 .
Find the conditional distribution of X1 , given S.
(3) Apply the R-B procedure with57 S from (2), to improve the estimator from (1)
Problem 7.35. Let X1 , ..., Xn be a sample from Ber(). Show that S(X) = ni=1 Xi is a
n is the UMVUE of .
complete statistic, if contains more than n points. Argue that X
57check that S is sucient
164
7. POINT ESTIMATION
Problem 7.36. Let X1 , ..., Xn be a sample from U ([1 , 2 ]), where = (1 , 2 ) is the unknown parameter.
(1) Specify the parametric space .
(2) Show that T (X) = (mini Xi , maxi Xi ) is a sucient statistic.
(3) Assuming that T is complete, argue that (T1 + T2 )/2 is the UMVUE of the mean
(1 + 2 )/2
Problem 7.37. Consider N = (N1 , ..., Nk ) Mult(n; ), where = (1 , ..., k ) Sk1 =
(2) Show that if is known and 2 is unknown, then n1 ni=1 (Xi )2 is the UMVUE of
2.
Problem 7.39. Let X1 , ..., Xn be a sample from (p, ), where = (p, ) is the unknown
parameter. Find the UMVUE of p/.
Problem 7.40. Let X Bint (n, ) be a sample from truncated Binomial distribution:
( )
n k
(1 )nk
k
p(k; ) =
, k {1, ..., n}.
1 (1 )n
(1) Show that X is a complete sucient statistic for .
X(nX)
n(n1)
is the UMVUE of (1 ).
Problem 7.42. Let (P )N be the family of uniform distributions on the rst integers
U ({1, ..., }).
(1) Show that X P is a complete sucient statistic
(2) Show that 2X 1{ is the} UMVUE of
(3) Let (P )N\k = PN \ Pk , where k is a xed integer. Show that X P is not
complete. Hint: Calculate E g(X ) for
i = k, k + 1
0,
g(i) = 1,
i=k
1, i = k + 1
EXERCISES
165
(4) Show
that 2X 1 is not UMVUE of for P . Hint: Consider the estimator T (X ) =
{
2X 1, X
{k, k + 1}
2k,
X {k, k + 1}
Problem 7.44. Solve the previous problem for the Weibull p.d.f.
f (x; ) = cxc1 ex I(x > 0),
166
7. POINT ESTIMATION
Problem 7.47. Let (Xn )n1 be a sequence of i.i.d r.v. with the common distribution
U ([0, 1]) and let Mn = maxin Xi . Introduce the sequences
Yn = n(1 Mn )
Zn = n(1 Mn )
Wn = n2 (1 Mn ).
Check whether each one of these sequences converges in probability ? in distribution ?
Problem 7.48. Let (Xn )n1 be a sequence or r.v. with the p.m.f.
1
k=n
2n
1
k=2
P(Xn = k) = 2n 1
1 n k =3
0
otherwise
(1) Does (Xn )n1 converge in probability ? If yes, what is the limit ?
(2) Does EXn converge ? Compare to your answer in (1)
(3) Answer (1) and (2) if k = n in the rst line of denition of Xn is replaced with
k = 1
Problem 7.49. Prove that Xn Bin(n, pn ) with pn satisfying
lim npn = > 0
Problem 7.50. Let Xn Poi(n), show that (Xn n)/ n converges weakly to N (0, 1).
Problem 7.51. Let X1 , ..., Xn be a sample from U ([ 1/2, + 1/2]), where R.
(1) Show that the MLE n of is not unique
(2) Show that any choice of n yields a consistent sequence of estimators
Problem 7.52. A beuro of statistics wants to choose a sample, so that the empirical proportion of voters for a particular candidate will be less than 50% with probability 0.01, when
the actual proportion is 52%. Suggest how to choose the sample size on the basis of CLT ?
2
2
Problem 7.53. Let X
1 , ..., Xn be a sample from N (, ), where both and are unknown.
n
1
2
2
n ) is a consistent and asymptotically normal estimator
Show that Sn (X) := n1 i=1 (Xi X
2
of . Calculate its limit variance.
Hint: use the particular properties of the distribution of Sn2 (X) and apply CLT.
EXERCISES
167
Problem 7.54. Let X HG(N, N , n), where = (0, 1). Suppose that n = (N ) for
an increasing function (i.e. the size of the sample n grows with the population (shipment,
etc.) size N ). Find conditions on so that the sequence of estimators N (X) = X/n = X/(N )
is consistent.
Problem 7.55. For the sample X1 , ..., Xn from each one of the following distributions, nd
the sequence of MLEs as n , show that it is consistent and nd the asymptotic error
distribution (under appropriate scaling)
(1) Poi(), > 0
(2) Ber(), [0, 1]
(3) Geo(1/), > 0
Problem 7.56. Construct condence interval estimator for the parameter of the i.i.d.
Poi() sample X1 , ..., Xn
(1) Using the CLT and Slutskys theorem
(2) Using the corresponding variance stabilizing transformation
Problem 7.57 (MLE in exponential families). Let X1 , ..., Xn be an i.i.d. sample from the
density, which belongs to the 1-exponential family
(
)
f (x; ) = exp c()T (X) + d() + S(x) , x R, .
Assuming that c() is a one-to-one function, the natural parametrization of the exponential
family is dened by := c() with c():
(
)
+ S(x) , x R, c(),
f(x; ) = exp T (X) + d()
:= d(c1 ()).
where d()
(1) Show that d () = E T (X1 ) and d () = var (T (X1 )).
(2) Show that the MLE n is the unique root of the equation
n
1
d () =
T (Xi ).
n
i=1
(1) Argue that the statistics Xn form a consistent sequence of estimators for the location
parameter
168
7. POINT ESTIMATION
(2) Show that the MLE of can be taken to be the sample median med(X n ) := X(n/2)
where n/2 is the smallest integer greater or equal n/2 and X(1) , ..., X(n) is the order
statistic of X n .
Hint: note that the MLE is not unique for even n.
(3) Prove consistency of med(X n )
Hint: note that e.g.
{ n
}
{
}
n
med(X ) + =
1{Xi } n/2 .
i=1
The Cauchy density with the location parameter R and the scaling parameter > 0
is
1
1
f (x; , ) =
)2
(
1 + x
(4) Let X n = (X1 , ..., Xn ) be a sample from Cau(, ) density with the unknown location
n consistent ? Is med(X n ) consistent ?
parameter and = 1. Is X
Hint: It can be shown that the characteristic function of the Cauchy random
variable is given by:
(t; , ) =
eitx f (x; , )dx = eit|t| .
R
(5) Suggest yet another consistent estimator of , using the substitution principle method.
CHAPTER 8
Hypothesis testing
In contrast to parameter estimation (either point or interval), which deals with guessing
the value of the unknown parameter, we are often interested to know only whether or not the
parameter lies in a specic region of interest in the parametric space. A typical instance of such
a problem is signal detection, frequently arising in electrical engineering. A radar transmitter
sends a signal in a particular direction: if the signal encounters an object on its way, the echo of
the signal is returned to the radar receiver, otherwise the receiver picks up only noise. Hence the
receiver has to decide whether the received transmission contains a signal or it consists only of the
background noise. How to formalize this problem mathematically ? How to construct reasonable
test procedures ? How to compare dierent test procedures ? Is there a best procedure ? All
these questions are addressed within the statistical framework of hypothesis testing, which we
shall explore below.
a. The setting and terminology
Let (P ) be a statistical model and suppose = 0 1 , where the subsets 0 and 1
do not intersect. The value of the parameter is unknown and our goal is to decide whether
belongs to 0 or to 1 , given a sample X P . Using the statistical language, we want to test
the null hypothesis H0 : 0 against the alternative H1 : 1 , based on the sample X P .
If 0 (or/and 1 ) consists of a single point, the null hypothesis (or/and the alternative) is called
simple, otherwise it is referred as composite.
Example 8a1. In the signal detection problem1, a reasonable statistical model is an i.i.d.
sample X = (X1 , ..., Xn ) from N (, 2 ), where 2 is the known intensity (variance) of the noise
and is the unknown parameter. The null hypothesis is then H0 : = 0, i.e. 0 consists of a
single point {0} and we want to test it against the alternative H1 : = 0, i.e. 1 = \ {0}.
Thus we want to test a simple hypothesis against composite alternative.
If we know that the signal may have only positive (but still unknown) value, then we may
consider testing H0 : = 0 against H1 : > 0. If, in addition, this value is known, say 0 , then
the problem is to test H0 : = 0 against H1 : = 0 , i.e. testing a simple hypothesis against a
simple alternative.
Remark 8a2. While technically the roles of the null hypothesis and the alternative are
symmetric, it is customary to think about H0 as some usual theory/state/etc., which we want
to reject in favor of the alternative theory/state.
1it is convenient to think of this problem to memorize the terminology: e.g. null hypothesis can be thought
170
8. HYPOTHESIS TESTING
Given a sample X P we have to reject or accept the null hypothesis. Suppose for definiteness that the sample X takes values in Rn . Then any2 test can be dened by specifying
the region of rejection (or critical region), i.e. a set C Rn such that the null hypothesis H0 is
rejected if and only if the event {X C} occurs. The complement Rn \ C is called the region
of acceptance (of the null hypothesis). This can be equivalently reformulated in terms
of the
{
}
critical function (or test function) (X) := I(X C), i.e H0 is rejected
if
and
only
if
(X)
=
1
(
)
occurs. Usually the test function can be put in the form (X) = I T (X) c , where T (X) is
the test statistic, i.e. a function from Rn to R, depending only on the sample, and c is a real
constant, called the critical value.
Example 8a1 (continued) If we want to test H0 : = 0 against H1 : = 0, intuitively we
feel that we shall reject the null hypothesis H0 if the empirical mean is far from zero: in terms
n and reject H0 if and only if
of the objects dened above we use the test statistic T (X) = X
{|Xn | c}, where c is the critical value to be chosen (see below). This kind of tests are called
two-sided (or two-tailed) as they reject if the test statistic takes values on both sides with
respect to the null hypothesis.
Now suppose that we know that the signal value is positive, i.e. H0 : = 0 is to be tested
n c} appears as a reasonable
against H1 : > 0. In this case, the one-sided (one-tailed) test {X
(8a1)
Lets try to get a feeling of how c aects these three quantities (it may be helpful to think of
the signal detection again). For large critical values c, we shall rarely reject H0 and, in particular,
shall rarely reject erroneously, and consequently shall get small -error. On the other hand, for
exactly the same reason we shall get little power at 1 (and hence large -error). Conversely,
small c will cause frequent rejections and hence higher false alarms, however will yield higher
2in fact, we may also consider randomized tests, i.e. those in which the decision is taken not only on the
basis of the sample realization, but also on an auxiliary randomness. More specically, (X) is allowed to take
values in the interval [0, 1] (rather than in the set of two points {0, 1}) and a coin with probability of heads (X)
is tossed: H0 is rejected if the coin comes up heads. It turns out that this general framework is more exible and
convenient in certain situations (see Remark 8b6)
171
values of power at 1 . This tradeo is the key to the optimality theory of tests: we would
like to make small false alarm errors and simultaneously have large power at the alternatives
(or, equivalently, to make small errors of both types simultaneously).
If a test erroneously rejects H0 with probability less than > 0 for all 0 , then it is
said to have the level of signicance . Clearly a test of level is also of level is > . The
smallest level of the test is referred to as its size (which is nothing but the denition (8a1)).
n c} of
Example 8a1 (continued) Lets calculate the power function of the test (X) = {X
H0 : = 0 against H1 : > 0:
)
(
)
(
(
)
n )/ n(c )/ = 1 n(c )/ ,
n c = P n(X
(, ) = P X
where is the c.d.f of a standard Gaussian r.v. and we used the fact that := n(X
n )/
N (0, 1). Note that (, ) is an increasing function of (why ?). The size of the test is
(
)
= sup (, ) = (0, ) = 1 nc/ .
0
c() = 1 (1 )/ n.
n is closer to 0 under
For larger values of n, c() gets smaller, which agrees with the fact that X
H0 . For the critical value corresponding to the size , the power function reads
(
)
(
)
(, ) = 1 n(1 (1 )/ n )/ = 1 1 (1 ) n/) ,
(8a2)
Pay attention (see Figure 1) that for small values of the alternative 1 , the power is
close to (which is of course anticipated as the power function is continuous in this case). On
the other hand, for any 1 , the power tends to 1 as n , i.e. the test makes small errors
of both types. In particular, we can choose n large enough to force arbitrarily small -errors at
any value of from the alternative.
Remark 8a3. This example shows that the power of the test is close to at s close to
the null hypothesis, which of course conrms the fact that close values of the hypothesis and
alternatives are hard to distinguish. Hence often the practical requirement is formulated in terms
of the size of the test and the minimal power of the test outside some indierence region of
length , where good testing quality cannot be expected. For instance, in the previous example
one can require level = 0.01 and minimal power 0.9 outside the indierence region [0, 1] (i.e.
= 1). The corresponding n is easily found, using monotonicity of (, ) and the formula
(8a2).
Another practically useful notion is the p-value of the test. Suppose a simple null hypotheses
is tested by means of a test statistic T . The p value is dened as the probability of getting the
values of the test statistic more extreme than the actually observed one:
p(t) := P0 (T t),
t R.
Large p-value indicates that the observed value of the test statistic is typical under the null
hypothesis, which is thus to be accepted. If T has a continuous increasing c.d.f., then the
random variable p(T ) has uniform distribution, irrespectively of P0 (think why). Hence the
172
8. HYPOTHESIS TESTING
The power function (,) for testing H0:=0 against H1:>0 given a sample N(,2)
0.8
0.6
n=20
n=50
n=100
0.4
=0.01
0.2
0.5
0.5
1.5
test (X) = {|Xn 1/2| c} appears reasonable. The power function of this test is given by:
( )
(
)
n k
(1 )nk .
(, ) = P |Xn 1/2| c =
k
k:|k/n1/2|c
The critical value of the test of level is the minimal c satisfying the inequality
( )
n
(1/2, )
=
2n ,
k
k:|k/n1/2|c
which can be found numerically. For large n, such calculation can be dicult and, alternatively,
one can use the CLT to construct an approximate test. Assuming that n is large enough (to
justify the approximation) and adding the subscript n to c to emphasize its dependence on n,
173
we get:
(
(
)
(
)
) n
n 1/2| cn = P1/2 n Xn 1/2 2cn n
(1/2, ) = P1/2 |X
2 2z ,
1/2
x R, n 1,
where m3 := E|1 |3 and C is an absolute constant with value less than 0.4784
Xi E1/2 Xi
CE|1 |3
C
= ,
n
n
F
(2z
)
+
F
(2z
)
=
n
n
2
2
(
)
(
)
1
1 Fn (2z ) 1 (2z ) + Fn (2z ) 2z 2Cn1/2 .
n
Let us stress that the latter inequality holds for any n and not just asymptotically. Hence for a
given , one can choose an n, which yields the approximation of the required quality.
Application of the limit theorems must be done with care. For example, suppose that we
want to calculate the power of the obtained test at a particular value of alternative (e.g. at an
end point of an indierence region, say at 1 := 3/4):
(
)
n 1/2| z / n .
(3/4, ) = P3/4 |X
174
8. HYPOTHESIS TESTING
Note that this time the critical value is already xed. Lets try to implement the same approach
as above:
(
n 3/4 z /n 1/4 )
(
)
X
n 3/4
n 3/4
X
X
z / n + 1/4
z 1/4 n
P3/4
n
n
= P3/4
n
+
3/4
3/4
3/4
3/4
(
(
(
)
)
)
n 3/4
X
z 1/4 n
z + 1/4 n
z + 1/4 n
P3/4
n
= 1 Fn
+ Fn
.
3/4
3/4
3/4
3/4
Note that Fn () in the latter expression is evaluated at points, which themselves depend on n
and hence CLT is not informative (or applicable)( anymore:
this expression
converges3 to 1 as
(
)
)
n
n
(1/2)50 = 0.0066...
P1/2 (|X50 1/2| |35/50 1/2|) = P1/2 (|S50 25| 10) = 2
k
k35
which indicates that the obtained sample poorly supports the hypothesis H0 (or in other words
this outcome is not typical for a fair coin).
b. Comparison of tests and optimality
From the decision theoretic point of view, it makes sense to compare tests of the same size:
if
Definition 8b1. Let and be tests of size . The test is more powerful than ,
(, ) (, ),
1 .
Definition 8b2. A test of size is uniformly most powerful (UMP) if it is more powerful
than any other test of size .
As in the case of the point estimators, two tests of size do not have to be comparable
in the sense of the latter denition and the UMP test does not have to exist. When both the
hypothesis and the alternative are simple, the UMP4 test can be found explicitly:
Theorem 8b3 (Neyman-Pearson lemma). Consider the statistical model (P ) with =
{0 , 1 }, X P and let L(x; 0 ) and L(x; 1 ) be the corresponding likelihood functions. Then
the likelihood ratio test5:
(
)
L(X; 1 )
(X) = I
c() ,
L(X; 0 )
of size < 1 is MP.
3again, by the Berry-Esseen theorem (check!)
4in fact it is MP (most powerful) test, since the alternative is simple
5A better way to dene the test is
(
)
(X) = I L(X; 1 ) c()L(X; 0 ) ,
175
Remark 8b4. The test is not counterintuitive: if the sample X comes from P0 , then
L(X; 0 ) will be typically greater than L(X; 1 ) and hence the test statistic in is small, i.e.
H0 is rarely rejected. This also provides the heuristic grounds for the so called generalized
likelihood ratio test, which is the main tool in more general hypothesis testing problems (see
Section c below).
Remark 8b5. Note that if S(X) is the minimal sucient statistic for , then by the
F-N factorization theorem the likelihood ratio test depends on the sample only through S(X).
Proof. We shall give the proof for the continuous case, when the likelihoods are in fact
p.d.f.s on Rn (the discrete case is treated similarly). We have to show that if is a test of level
, i.e. E0 (X) , then
E1 (X) E1 (X).
(8b1)
To this end
E1 (X) =
(x)f (x; 1 )dx =
Rn
(
)
c()
(x)f (x; 0 )dx +
(x) f (x; 1 ) c()f (x; 0 ) dx =
Rn
Rn
(
)
c()E0 (X) +
(x) f (x; 1 ) c()f (x; 0 ) dx.
(8b2)
Rn
Further,
Rn
Rn
(
)
E1 (X) c(),
where holds since (x) 0 and is true as (x) 1. Plugging the latter inequality into (8b2)
yields
(
)
E1 (X) c() E0 (X) + E1 (X) E1 (X),
where the last inequality holds, since c() 0 (otherwise = 1) and E0 (X) .
Remark 8b6. If the likelihood ratio does not have a continuous distribution, as will typically
be the case if X is discrete, then it might be impossible to nd the critical value c, so that the
N-P test has the precise level . In this case, the UMP test exists in the more general class of
randomized test. In practice, if a randomized test is undesirable, one can switch to the largest
achievable size less than .
which does not run into the problem of division by zero (in case, the likelihoods do not have the same support)
176
8. HYPOTHESIS TESTING
Example 8b7. Suppose that we observe the outcome of n independent tosses of one of two
coins with the head probabilities 0 < 0 < 1 < 1 respectively. We want to decide which coin
was actually tossed, i.e. test H0 : = 0 against H1 : = 1 given the sample X = (X1 , ..., Xn ).
The likelihood ratio statistic is
( )Sn (x) (
)
L(X; 1 )
1
1 1 nSn (x)
=
L(X; 0 )
0
1 0
n
where Sn (X) = i=1 Xi , and the N-P test rejects H0 if and only if
( )Sn (x) (
)
1
1 1 nSn (x)
c,
0
1 0
or, equivalently,
1/0 1
1/1 1
)Sn (x)
(
c
1 0
1 1
)n
.
{
}
(X) = Sn (X) c ,
where the critical value c is to be chosen to match the desired level . The power function of
this test is
(n )
(, ) =
k (1 )nk , {0 , 1 }.
k
kc
, )
If is in the range of (0
considered as a function of c , then the corresponding critical
value is the unique root of the equation
(n)
0k (1 0 )nk = ,
k
kc ()
(8b3)
Example 8b8. Consider the signal detection problem, when we know the exact value of the
signal to be 1 > 0. We observe n i.i.d. samples from N (, 2 ), = {0, 1 } where 2 > 0
is known and would like to decide whether the signal is present. Put into the above framework,
we are faced with the problem of testing H0 : = 0 against the alternative H1 : = 1 . The
corresponding likelihood is
n
( 1 )n
(
)
1
Ln (x; ) =
exp 2
(xi )2 , x Rn ,
(8b4)
2
2 2
i=1
and the likelihood ratio test statistic is
(
)
(
)
n
n
n
Ln (X; 1 )
1
1 2
1
2
= exp 2
(Xi 1 ) + 2
Xi = exp
Xi n/2 ,
Ln (X; 0)
2
2
2
i=1
i=1
i=1
177
and the N-P test rejects H0 (i.e. decides that there is a signal) if and only if
(
)
n
1
exp
Xi n/2 c,
2
i=1
for a critical value c to be chosen to meet the level specication. Note that the above is equivalent
to the test suggested in Example 8a1 on the intuitive grounds:
n c ,
X
where c is related to c by a one-to-one correspondence, which is not of immediate interest to
us, since we shall choose c directly to get a test of level . To this end, note that under H0 ,
n N (0, 2 /n) and hence the level of the test is given by
X
(
)
(
)
(
)
n c = P0 nX
n / nc / = 1 nc / = ,
P0 X
which gives c =
1 (1
n
(X) = Xn (1 ) .
n
(1 , ) =P1 X
n
(1
=
1
n
/ n
{z }
|
(
1 1 (1 )
/ n
N (0,1)
(8b5)
(8b6)
Note that the test (8b5) is most powerful against any simple alternative 1 > 0 and does
not itself depend on 1 . Hence we can use to test H0 : = 0 against H1 : = 1 , without
knowing the value of 1 ! Let (X) be an arbitrary level test of H0 : = 0 against H1 : > 0,
i.e. E0 (X) . Then by the N-P lemma for any xed 1 1 ,
(1 , ) (1 , ),
and since 1 is arbitrary alternative, we conclude that is UMP:
(, ) (, ),
1 .
(8b7)
Of course, independence of the test of the alternative parameter value 1 was crucial and, in
fact, in general will not be the case.
Furthermore, note that the power function (8b6) of is continuous and monotonically
increasing in 1 . Consider as a test for the problem
H0 : 0
(8b8)
H1 : > 0
Clearly, has size in this problem
sup (, ) = (0, ) = .
0
178
8. HYPOTHESIS TESTING
Now let be another test of size , i.e. sup0 (, ) , and suppose that
( , ) > ( , )
(8b9)
for some 1 . Consider the tests and for testing the simple problem
H0 : = 0
H1 : = .
Again by N-P lemma is MP in this problem, which contradicts (8b9), since (0, ) . Thus
we conclude that is the UMP test for the more complex problem (8b8).
To recap, remarkably the N-P likelihood ratio test, which, at the outset, is optimal only
for testing simple hypothesis versus simple alternative, is in fact UMP test in the two latter
examples. It turns out that this is a somewhat more general phenomena:
Definition 8b9. A family of probability distributions (P ) , R with the likelihood
L(x; ) is said to be a monotone likelihood ratio family in statistic T (X), if for any 0 < 1 , P0
and P1 are distinct and L(x; 1 )/L(x; 0 ) is a strictly increasing function of T (x).
Theorem 8b10 (Karlin-Rubin). If (P ) is a monotone likelihood ratio family with respect
to T (X) and T (X) has a continuous distribution under P0 , then T (X) is the optimal test
statistic6 for testing
H0 : 0
(8b10)
H1 : > 0
and the corresponding power function is increasing.
Remark 8b11. Strictly increasing in the above denition can be replaced with nondecreasing and continuity of the distribution in the Theorem 8b10 can be omitted, if randomized tests
are allowed, or alternatively, some levels are forbidden.
Proof. By the assumption
R(x; 1 , 0 ) := L(x; 1 )/L(x; 0 ) = (T (x); 1 , 0 ),
where t 7 (t, 1 , 0 ) is a strictly increasing function for any 1 > 0 . Hence for any c in the
range of
{
} {
}
L(x; 1 )
c = T (X) 1 (c; 1 , 0 ) .
L(x; 0 )
In particular, this is true for the critical value c (0 , 1 ), which yields the level at 0 , i.e.
)
(
L(x; 1 )
c (1 , 0 ) = ,
P0
L(x; 0 )
and it follows that
(
(
))
1
P0 T (X) c (1 , 0 ); 1 , 0 = .
179
Since
distribution under P0 , the latter can be solved for c (0 ) :=
( T (X) has a continuous
)
1
c (1 , 0 ); 1 , 0 , which does not depend on 1 . Hence we conclude that the level likelihood ratio test for the simple problem
H0 : = 0
H1 : = 1
with 1 > 0 has the form {T (X) c }, where c does not depend on 1 .
We shall prove shortly that the power function of this test (, ) = E (X) is strictly
increasing in . Then sup0 (, ) = (0 , ) = , i.e. is a level test for the problem
(8b10).
Lets repeat the arguments from the discussion, preceding Denition 8b9, to show that
is in fact UMP for the problem (8b10). Suppose that this is not the case and there is a test
with sup0 (, ) , such that
( , ) > ( , ),
(8b11)
for some > 0 . Now consider the problem of testing two simple hypotheses
H0 : = 0
H1 : = .
Note that (0 , ) and hence (8b11) contradicts the statement of the N-P lemma, which
shows that is UMP for the problem (8b10), as claimed.
To complete the proof, it remains to check that the power function (, ) is strictly increasing in
( . To
) this end, we shall show that for an nondecreasing function , the function
7 E T (X) is increasing: in particular, with7 (u) := I(u c()) this implies that the
power (, ) = E I(T (x) c()) increases in .
180
8. HYPOTHESIS TESTING
(
)
E (T (X)) E (T (X)) =
(T (x)) f (x; ) f (x; ) dx =
n
(
) R
f (x; )
(T (x))
1 f (x; )dx =
f (x; )
Rn
| {z }
=R(x; ,)
(
)
(T (x)) R(x; , ) 1 f (x; )dx+
{z
}
|
R(x; ,)>1
>0
(
)
(T (x)) R(x; , ) 1 f (x; )dx
|
{z
}
R(x; ,)<1
<0
(
)
inf
(T (x))
R(x; , ) 1 f (x; )dx+
x:R(x; ,)>1
R(x; ,)>1
(
)
sup
(T (x))
R(x; , ) 1 f (x; )dx =
R(x; ,)<1
x:R(x; ,)<1
inf
x:R(x; ,)>1
(T (x))
sup
x:R(x; ,)<1
)
(T (x))
(
R(x; ,)>1
)
f (x; ) f (x; ) dx,
(
)
(
)
0=
f (x; ) f (x; ) dx =
R(x; , ) 1 f (x; )dx =
Rn
Rn
(
)
(
)
Note that
R(x; ,)>1
R(x; ,)>1
)
f (x; ) f (x; ) dx > 0
x:R(x; ,)>1
(T (x))
sup
x:R(x; ,)<1
(T (x)).
The latter holds, since is an nondecreasing function and R(x; , ) is increasing in T (x).
181
increases in T (x).
Corollary 8b13. The conclusion of Theorem 8b10 holds for one-parameter exponential
family as in Lemma 8b12 (if T (X) has continuous distribution).
Example 8b8 (continued) The likelihood function (8b4) belongs to the one parameter exponential family:
n
( 1 )n
(
)
1
Ln (x; ) =
(xi )2 =
exp 2
2
2 2
i=1
n
n
(
)
n
1 2
xi + 2
xi 2 2 ,
exp n/2 log(2 2 ) 2
2
2
i=1
i=1
/ 2 ,
with c() =
which is an increasing function. Hence by Corollary 8b13, it is also monotonic
likelihood ratio family and hence by K-R Theorem 8b10, the level test (8b5) is UMP in testing
H0 : 0 against H1 : > 0.
Example 8b7 (continued) The likelihood
{
L(x; ) = Sn (x) (1 )nSn (x) = exp Sn (x) log
+ n log(1 )
1
belong to the exponential family with c() = log /(1 ), which is strictly increasing. Hence
the test (8b3) is UMP in testing H0 : 0 against H1 : > 0 .
Unfortunately, K-R theorem does not easily extend to hypothesis testing problems with a
more complicated structure: e.g. it doesnt appear to have a natural analog in the case of
multivariate parameter space or when the alternative is two-sided. In fact, UMP test may fail
to exist at all. To see why we shall need a result, converse to the Neyman-Pearson lemma:
Proposition 8b14. Let be a test of H0 : = 0 against H1 : = 1 , with both types of
errors less than those of the N-P test. Then the two tests coincide, except perhaps on the set on
which the likelihood ratio equals the critical value of the N-P test.
Proof. Let be the N-P test, then under the assumptions of the proposition
E0 (X) E0 (X)
and
The rst inequality reads
(
)
(
)
E1 1 (X) E1 1 (X) .
Rn
Rn
Rn
Rn
182
8. HYPOTHESIS TESTING
Let c be the critical value of the N-P test (w.l.o.g. c 0), then multiplying the former inequality
by c and adding it to the latter, we get:
(
)
(
)
(x) fX (x; 1 ) cfX (x; 0 ) dx,
(x) fX (x; 1 ) cfX (x; 0 ) dx
Rn
Rn
and
Rn
)
)(
(x) (x) fX (x; 1 ) cfX (x; 0 ) dx 0
Rn
Let D := {x
: fX (x; 1 ) cfX (x; 0 ) 0}, then the latter reads:
)
)
(
)(
(
)(
(x) 1 fX (x; 1 ) cfX (x; 0 ) dx +
(x) 0 fX (x; 1 ) cfX (x; 0 ) dx 0.
{z } |
{z } |
D|
Dc |
{z
}
{z
}
0
183
Example 8b16. Let X be a single sample from the Cauchy density with the location parameter = R:
1/
f (x; ) =
, x R.
1 + (x )2
The likelihood ratio is
L(x; 1 )
1 + (x 0 )2
R(x; 1 , 0 ) :=
.
=
L(x; 0 )
1 + (x 1 )2
For xed 1 > 0 , this ratio is not monotonous in x: take e.g. 0 = 0 and 1 = 1, for which
lim R(x; 1, 0) = 1
and R(1; 1, 0) = 2.
value increases with 1 . Hence for small we shall get even nonintersecting critical regions
(1 /2)2 + 1, whose
184
8. HYPOTHESIS TESTING
{
}
will yield large values and hence the test (X) c , with an appropriate critical value c will
tend to reject the null hypothesis correctly. Conversely, when H0 is true, the null hypothesis
will be correctly accepted with large probability.
This test is called generalized likelihood ratio test (GLRT). While the test statistic (8c1)
is not guaranteed to produce optimal tests, it nevertheless typically does lead to good tests in
many problems of interest and thus is used as the main test design tool in practice. Moreover,
the critical value of GLRT can be eciently approximated9for large n (see Theorem 8c6 below).
Remark 8c1. If the likelihood L(x; ) is continuous in and 0 has a smaller dimension
than 1 (which itself has the dimension of ), then (8c1) reads:
(
)
sup L X;
(
).
(X) =
sup0 L X;
Such situations are typical in applications and hence this latter form is sometimes called GLRT.
In this context, the maximizers over 0 and 1 are referred to as restricted and unrestricted
MLEs respectively.
Below we shall explore a number of classical examples.
Example 8b7 (continued) A coin is tossed and we want to decide whether it is fair or not,
i.e. test H0 : 0 = 1/2 against H1 : = 1/2. The GLRT statistic in this case is given by
(
)
sup(0,1)\1/2 L X;
sup(0,1) Sn (X) (1 )nSn (X)
(
) =
(X) =
=
(1/2)n
sup{1/2} L X;
n )Sn (X) (1 X
n )nSn (X) = 2n (X
n )nX n (1 X
n )n(1X n ) ,
2n (X
and thus
(
( ))
n log X
n + n(1 X
n ) log(1 X
n ) =: n h(1/2) h X
n ,
log (X) = n log 2 + nX
where
h(p) = p log p (1 p) log(1 p),
is the Shannon entropy of the Ber(p) r.v. It is easy to see that h(p) is maximal at p = 1/2 and
symmetric around it. The GLRT rejects H0 if and only if
{ ( )
}
n c ,
(X) = h X
i.e. if the empirical entropy is small. Note that this is a two-sided test and, in fact, it is
n 1/2| c} (why?). To get an test, we have to choose
equivalent to our original suggestion {|X
an appropriate c, which for large n can be done approximately, using normal approximation.
Example 8c2. (matched pairs experiment)
Suppose we want to establish eectiveness of a medicine, whose eect varies, depending on
the weight, sleeping regime, diet, etc. of the patient. To neutralize the eect of dierent habits,
a pair of patients with the similar habits is chosen, one of them is given a placebo and the other
9One of the major diculties in constructing practical tests is calculation of the critical value for the given
size . This basically reduces to nding the distribution of the the test statistic, which can be a complicated
function of the sample.
185
one is treated by the medicine. The responses are recorded and the experiment is repeated
independently n times with other pairs. In this setting it is not unnatural to assume that the
dierences in responses X1 , ..., Xn is a sample from N (, 2 ), where both and 2 are unknown.
We want to test H0 : = 0 against H1 : = 0. In this problem, = (, 2 ) = R R+ .
Under hypothesis H0 , the value of is known to be 0 and the MLE of 2 is given by
n
1 2
2
0 (X) =
Xi .
2
i=1
10
are given by
n
1 = X
1
n )2 .
(Xi X
n
n
12 =
i=1
)2 1 1
1 (X)
0 (X) n/2
1 1 (
2
=
Xi Xn + 2
X
.
( 2
)n/2 exp 2 2
2
1 (X) i=1
0 (X) i=1 i
12 (X)
0 (X)
|
{z
}
| {z }
=n
12 (X)
=n
02 (X)
Note that
(
n2
n2
n )2
12 (X) + X
X
X
=
=1+ 2
=1+
,
12 (X)
1 (X)
12 (X)
and hence the GLRT test rejects H0 if and only if
{
}
n
X
n1
c ,
n2 (X)
02 (X)/
12 (X)
X
n 1
Stt(n 1),
n2 (X)
under H0 and level test is obtained with the critical value c solving the equation
= 2FStt(n1) (c)
1
c() = FStt(n1)
(/2).
The obtained test may not be UMP, but is still convenient and powerful enough for practical
purposes.
Example 8c3. If the medicine in the previous example does not depend too much on the
patients habits, the experiment can be performed in a simpler manner: a group of m patients
is given a placebo and another group of n patients is treated with the drug. Assuming that
the responses X1 , ..., Xm and Y1 , ..., Yn of the rst and the second groups are independent, with
10pay attention that we are maximizing with respect to over R \ {0}, but this leads to the usual estimator,
186
8. HYPOTHESIS TESTING
n
m
1 (
Yi +
Xj =
Yn +
Xm
=
n+m
n+m
n+m
i=1
j=1
02
1
=
n+m
(
n
(Yi
) +
2
i=1
)
(Xj
)
j=1
m,
Under H1 , the MLEs are
0 = X
1 = Yn and
(
)
n
m
1
2
2
2
1 =
(Yi Yn ) +
(Xi Xm ) .
n+m
i=1
j=1
0
1 (
2
2
(X, Y ) =
exp
(X
X
)
+
(Y
Y
)
i
m
i
n
12
2
12 i=1
i=1
)
1 (
2
2
(X
)
+
(Y
)
i
i
2
02 i=1
m
X
+
Xj n+m
Yn
Y
i
n
m
i=1
j=1
n+m
n+m
=
n
m
2
2
2
1
i=1 (Yi Yn ) +
j=1 (Xi Xm )
Note that
n (
i=1
n (
(
=
m
n+m Xm
02
12
)n+m
)2
.
n
m )2 (
n
m )2
Yi
Yn
Xm =
Yi Yn + Yn
Yn
Xm =
n+m
n+m
n+m
n+m
n
Yi Yn +
i=1
and similarly
m (
Xj
j=1
m (
m
Yn X
n+m
))2
i=1
n
(
(Yi Yn )2 + n
i=1
)
m )2 (
m 2
Yn X
n+m
( n )2 (
)
n
m )2
m )2 + m
m 2.
Yn
Xm =
(Xi X
Yn X
n+m
n+m
n+m
m
j=1
187
Then
( (
)2
(
) 2 )(
)
m
n
(
)
m 2
n n+m + m n+m
Yn X
nm
2
2
0
n+m Yn Xm
= 1 + n
2 m (Xi X
m )2 = 1 + n (Yi Yn )2 + m (Xi X
m )2 .
12
i=1 (Yi Yn ) +
j=1
i=1
j=1
2 (X) +
2 (Y )
= 1 m Xi , Y = 1 n Yi and
2 (X) and
2 (Y ) are the corresponding estimators
where X
i=1
i=1
m
n
of the variances. A nontrivial issue is to nd or approximate the distribution of T (X, Y ) under
H0 to be able to nd an appropriate critical value, etc.
Example 8c5 (Test of independence). Given two samples X1 , ..., Xn and Y1 , ..., Yn we would
like to test whether they are independent or not. If the samples are Gaussian, the question of
independence translates to the question of lack of correlation. Assume that the pairs (Xi , Yi )
are i.i.d. and have bivariate Gaussian j.p.d.f. (2b1):
)}
{
(
1
1 1
(x 1 )2 2(x 1 )(y 2 ) (y 2 )2
f (x, y) =
exp
+
,
2 1 2
1 2
12
22
21 2 1 2
where all the parameters = (1 , 2 , 1 , 2 , ) are unknown and vary in the corresponding
subspaces. We want to test the hypothesis H0 : = 0 against H1 : = 0. The corresponding
likelihood function is
)n
(
1
L(x, y; ) =
21 2 1 2
)}
{
(
1
(xi 1 )2 (xi 1 )(yi 2 ) (yi 2 )2
.
exp
+
2
2
1 2
1 2
2
2
1
2
i
i
i
188
8. HYPOTHESIS TESTING
1 = X
2 = Yn
1
n )2
(Xi X
n
n
12 =
22 =
1
n
i=1
n
(Yi Yn )2 .
i=1
n
1
n )(Yi Yn ).
(Xi X
n
1
2
i=1
n 2
Tn (X) =
1 2
has Student distribution with n 2 degrees of freedom. Now the level test can be readily
obtained, by choosing an appropriate critical value.
Potentially we may encounter two diculties when trying to apply the GLRT: rst, as
calculating the GLRT statistic reduces to an optimization problem, it can easily be quite involved
technically and be challenging even numerically; second, even if a closed form statistic can be
found, its distribution under H0 might be hard to nd and, consequently, it is not clear how to
choose the critical value to achieve the required signicance level.
Wilks theorem resolves the second diculty asymptotically as n :
Theorem 8c6. Let X1 , ..., Xn be a sample from the density f (x; ), satisfying the assumptions of Theorem 7g34 (consistency and asymptotic normality of MLE, page 155) hold. Let
n (X) be the GLRT statistic
sup Ln (X; )
,
n (X) :=
Ln (X; 0 )
for testing
H0 : = 0
H1 : = 0 ,
189
2 log n (X)
21 .
Proof. (sketch) Since is closed and the likelihood is continuous in , sup Ln (X; ) =
Ln (X; n ), where n is the MLE of . Expanding into powers of n 0 around n and using
the continuity of the second derivative, we get
2 log n (X) = 2 log Ln (X; n ) 2 log Ln (X; 0 ) =
2
log Ln (X; n )(n 0 ) 2 log Ln (X; 0 )(n 0 )2 + rn (X), (8c2)
where rn (X) is the reminder term, converging to zero in P0 -probability as n (ll in the
details!).
Ln (X; n ) =
Ln (X; n )1{n } .
Consequently,
)
( n
(Xi ; n ) =
log Ln (X; n )(n 0 ) = 1{n } (n 0 )
i=1
)
( n
n
)
(
1
(Xi ; 0 ) +
(Xi ; n ) n 0
(n 0 )
{n }
i=1
i=1
where (x; ) = log f (x; ) and where |n 0 | |n 0 |. Recall that | (x, )| h(x) for some
h(x), satisfying E0 h(X1 ) < , hence by Slutskys theorem
n
)
( n
)
(
)2 n
(
1
2
1{n }
h(Xi )
n(n 0 ) 0,
(Xi ; n ) n 0 1{n }
n
i=1
i=1
where the convergence holds, since the second term converges in probability to a constant by
LLN, the third term converges weakly since (n ) is asymptotically normal with rate n and the
rst term converges to zero in probability, since n converges to 0 , which is in the interior of .
Similarly,
n
n
1
P
0
n 0
(Xi ; 0 )
(Xi ; 0 ) = 1{n } n
0,
1{n } (n 0 )
n
i=1
E0 (Xi ; 0 )
i=1
2
1
1
d
2
I(0 ) n
i=1
190
8. HYPOTHESIS TESTING
where the convergence holds, since the rst term converges to 1 by the LLN and
N (0, 1/I(0 )).
n(n 0 )
Wilks theorem extends to the multivariate case as follows. Suppose that we are given a
sample from the p.d.f. f (x; ), Rk , dim() = k and we want to test
H0 : R() = 0
H1 : R() = 0
(8c3)
i=1 f (Xi ; ).
sup Ln (X; )
,
sup0 Ln (X; )
2 log n (X)
2r ,
where 2r is the -square distribution with r degrees of freedom. Hence the number of degrees of
freedom of the limit distribution is the dimension of the parameter space minus the dimension
of the constrained space under H0 .
Walds test and Raos score test. Consider the problem of testing a simple null hypothesis H0 : = 0 against two-sided composite alternative H1 : = 0 , given the the i.i.d. sample
X n = (X1 , ..., Xn ). Walds test statistic is
)2
(
Wn (X n ) := nI(n ) n 0
where n is the MLE of and I(), is the Fisher information in X1 . Since under appropriate conditions the MLE is asymptotically normal with variance 1/I(), Wn (X n ) converges in
distribution to the 21 random variable as n , if I() is a continuous function of .
An alternative is Raos score test statistic
(
)2
log L(X n ; 0 )
n
Sn (X ) =
,
nI(0 )
where Ln (X n ; ) is the likelihood function of the sample. Under appropriate conditions on the
(
)2
density f (x; ), E log f (X1 , ) = 0 and E log f (X1 , ) = I() and hence by the CLT
(
)2
n
1
1
d
log f (Xi ; )
Sn (X n ) =
21 .
n
I(0 ) n
i=1
where n is the unrestricted MLE of (i.e. the maximizer of the likelihood over ), R() is
the gradient matrix of R and In () = nE log f (X1 ; ) log f (X1 ; ) is the Fisher information
matrix.
191
(
)
(
)
Sn (X) := log Ln X n ; n(0) In1 (n(0) ) log Ln X n ; n(0) ,
(0)
where n is the restricted MLE of (i.e., over 0 ).
It can be shown that both statistics converge in distribution to 2r random variable under
H0 , just as the GLRT statistic. Note that Walds and Raos tests require calculation of only
unrestricted and restricted MLEs respectively, while the GLRT test requires both.
(8d1)
suppSk1 p1 n
S (k)
...pk n
.
S (1)
S (k)
q1 n ...qk n
The supremum can be found by the Lagrange multipliers: the Lagrangian is
)
(
k
k
(p, ) =
Sn (i) log pi + 1
pi ,
n (X) =
i=1
i=1
and taking the derivatives w.r.t. pi and equating them to zero gives:
Sn (i) = pi .
Summing up over i gives = n and hence the MLEs are given by (check the conditions for
maximum)
n (i)
pi = Sn (i)/n = X
n (i) > 0})
Hence (on the event ki=1 {X
log n (X) = n
(
)
n (i) log qi log X
n (i) .
X
i=1
P 0
)
(
)
1 (
n (i) 1
n (i) 2 + rn .
qi Xn (i) + n
X
q
X
2 log n (X) = 2n
Xn (i)
i
2 (i)
Xn (i)
X
n
i=1
i=1
The rst term on the right vanishes, since i Xn (i) = i qi = 1, and the residual term can be
seen to converge to zero in probability under H0 . Since
{
}
k
k1
k
=S
:= q R : qi 0,
qi = 1 ,
k
i=1
192
8. HYPOTHESIS TESTING
dim() = k 1 and hence by the multivariate Wilks theorem the second term converges weakly
to 2k1 distribution. For large n, the second term is the dominant one and hence it makes sense
to use it as the test statistic for the problem, ignoring other terms.
To recap, the so called Pearsons 2 statistic
2
(Sn ) :=
Sn (i)
(8d2)
can be used to test the problem (8d1) and the critical value of the level test can be approxid
193
(
)
Sn (i) log Sn (i)/n log pi =
i=1
(
i=1
)2
n2 (
Sn (i)/n pi +
Sn (i)/n pi + rn .
Sn (i)
i=1
As before, the rst term vanishes and the residual term converges to zero in probability under
H0 . Moreover, since under H0 Sn (i)/n pi = (Sn (i)/n pi ) + (pi pi ) 0 in probability,
( (
))2
)2
k (
k
k
2
n
S
(i)/n
n
i
)2
Sn (i) n
pi
n (
=
+ rn ,
Sn (i)/n pi =
Sn (i)
Sn (i)/n
n
pi
i=1
i=1
rn
i=1
Sn (i) n
pi
2
(Sn ) =
(8d3)
n
pi
i=1
2(k1)d
converges to
distribution under H0 . The main application of this version of the Pear2
sons -test is testing independence in contingency tables.
Example 8d2. Suppose we want to test whether smoking and lung cancer are related. For
this purpose, we may survey n individuals from the population for their smoking habits and
the disease history. If the individuals are assumed i.i.d., the sucient statistic is the vector of
counters of each one of the four cases, which can be conveniently summarized in the following
contingency table:
smoking/cancer no yes
no
S00 S01
yes
S10 S11
Denote by pij , i, j {0, 1} the corresponding probabilities (e.g. p00 is the probability that
an individual doesnt smoke and didnt develop cancer, etc.) and by pi and pj the marginal
probabilities (e.g. p1 = p10 + p11 is the probability that a sampled individual smokes). We want
to test
H0 : pij = pi pj ,
H1 : pij = pi pj ,
More generally, we may consider testing the above hypotheses for the contingency table of the
form
x1 x2 ... xc total
y1
S11 S12 ... S1c S1
y2
S21 S22 ... S2c S2
...
..
... ... ...
...
yr
Sr1 Sr2 ... Src Sr
total S1 S2 ... Sc
n
194
8. HYPOTHESIS TESTING
where c and r are the numbers of columns and rows respectively, and Si = cj=1 Sij and Sj =
r
i=1 Sij . A calculation reveals that the MLE of pij under H0 equals Xi Xj := (Si /n)(Sj /n)
and the statistic (8d3) reads
)
r
c (
i X
j 2
Sij nX
2
=
.
i X
j
nX
i=1 j=1
x R.
i=1
Clearly, Fn is a legitimate discrete c.d.f., which is random due to its dependence on the sample.
By the LLN,
PF
Fn (x)
F (x), x R,
where PF denotes the probability law of the data, corresponding to the c.d.f. F . Remarkably, a
stronger, uniform over x, result holds
Theorem 8d3 (Glivenko Cantelli). Assume X1 , ..., Xn are i.i.d., then
PF
sup |Fn (x) F (x)|
0.
xR
These results, whose proof is beyond our scope, hint to uniform over x R version of CLT:
195
Theorem 8d5 (Kolmogorov Smirnov). Assume X1 , ..., Xn are i.i.d. with continuous distribution, then
dF
(1)i1 e2i
2 x2
i=1
The latter theorem can be applied to test (8d4): given the sample, calculate12 the statistic
and reject H0 if Dn c(), where c() is chosen as the -quantile of the Kolmogorov distribution.
Remark 8d6. It is not hard to nd the distribution of Dn for any xed n, however, for large
n the computations become quite involved and the Kolmogorov-Smirnov asymptotic is often
preferred.
A computationally appealing feature of the statistic Dn is that its limit distribution does
not depend on F0 , i.e. it is asymptotically distribution-free. This property should be expected
and, in fact, the statistic Dn is distribution-free for each n 1. To see this, recall that if F is a
continuous distribution on R and X1 F , then F (X1 ) U ([0, 1]). Hence
n
n
1
1
sup |Fn (x) F0 (x)| = sup
1{Xi x} F0 (x) = sup
1{F0 (Xi )F0 (x)} F0 (x) =
xR
xR n i=1
xR n i=1
n
n
1
1
1{F0 (Xi )u} u = sup
1{Ui u} u ,
sup
n
n
u[0,1]
u[0,1]
i=1
i=1
any xed x [0, 1], ni=1 1{Ui x} Bin(n, x). Hence by the classical CLT,
) d
(
n Fn (x) F0 (x)
N (0, x(1 x)).
(
)
d
n )
n(X
N (0, S).
12calculating supremum is easy in this case since x 7 F (x) increases and F (x) is piecewise constant
0
n
196
8. HYPOTHESIS TESTING
(
)
1
cov Fn (x1 ), Fn (x2 ) = 2 EF0
1{Ui x1 } 1{Uj x2 } x1 x2 =
n
i,j
(
)
)
1 ( 2
(n
n)x
x
+
n(x
x
)
x
x
=
1/n
(x
x
)
x
x
.
1
2
1
2
1
2
1
2
1
2
n2
Hence by the multivariate CLT,
(
)
) d
(
n Fn (x1 ) F0 (x1 ), Fn (x2 ) F0 (x2 )
N 0, C(x1 , x2 ) ,
)
x1 (1 x1 )
x1 x2 x1 x2
C(x1 , x2 ) =
.
x1 x2 x1 x2
x2 (1 x2 )
)
(
Similarly, one can check that the vector with entries n Fn (x) F0 (x) , x {x1 , ..., xn } converges weakly to a zero mean Gaussian vector with the covariance matrix with the entries
xi xj xi xj , i, j {1, ..., n}.
More sophisticated
mathematics, namely functional CLT, shows that the whole function
)
(
n Fn (x) F0 (x) , x R converges weakly13 to a zero mean Gaussian process V (x) with the
correlation function EV (x)V (y) = x y xy, x, y [0, 1]. The process with this correlation
function is called Brownian bridge and the Kolmogorov distribution is the one of supx[0,1] V (x),
which can be found using appropriate techniques.
Similarly, it can be shown that the Cramervon Mises statistic
V 2 (x)dx,
H=
0
197
Note that it is the type I error of the test = maxi=1,...,m i in the problem
H0 : (1 , ..., m ) 0 = 01 ... 0m
H1 : (1 , ..., m ) 0 = 01 ... 0m
where 0i and 1i are the parameter subspaces, corresponding to the null hypothesis and the
alternative in the i-th problem. If the samples in the problems are independent, then for 0 ,
FWER = E = E max i = 1 E
(1 i ) = 1 (1 )m .
i
1 (1
)1/m ,
and hence the sizes of the individual tests are to be corrected to achieve the desired control over
the FWER. The latter simple formula is called Sidaks correction. For example, if m = 20 and
= 0.05 is needed, then 0.0026 must be chosen. Requiring such small sizes in the tests
may signicantly decrease their powers and consequently in the familywise power.
If the samples are dependent, the Bonferroni correction suggests to replace the individual
sizes with
/m, in which case
FWER = P
m
(
)
{i = 1} mE i = m
/m =
.
i=1
Note that if the samples are in fact independent, the Bonferroni correction yields smaller individual sizes and hence worse overall power. More sophisticated approaches to controlling the
FWER exist.
False Discovery Rate. Suppose we want to identify the genes, which are related to certain
disease. For a single given gene such relation can be tested, e.g., by Pearsons 2 test as in
Example 8d2. In practice, a very large number of genes is screened and it makes sense to seek
for a screening procedure, which makes as few erroneous detections as possible. To this end,
consider the quantities, summarized in the following table
# of true H0i s # of false H0i s total
# of rejected H0i s
V
S
R
# of accepted H0i s
U
T
m-R
total
m0
m m0
m
Table 1. The standard terminology of FDR
Note that the random variables V , S, U and T are not observable, while R is. The false
discovery rate (FDR) is dened as the expected ratio of false discoveries (erroneous rejections of
the null hypotheses) over the total number of discoveries:
FDR = E
V
V
1{R>0} = E
.
R
R1
198
8. HYPOTHESIS TESTING
Here E is the expectation with respect to the probability P, under which m0 out of m null
hypotheses are true14.
Remark 8e1. Note that if m0 = m, i.e. all the null hypotheses are true, V = R and
FWER = P(V > 0) = P(R > 0). Hence in this case
R
= P(R > 0) = FWER.
R1
The following algorithm, proposed by Y.Benjamini and Y.Hochberg, controls the FDR at
any desired level (0, 1).
(1) Order the p-values pi of the individual tests in the increasing order and denote the
ordered sequence by p(i) , i = 1, ..., m
FDR = E
)
m
such probabilities, corresponding to the combinations of indices of the true null hypotheses,
m0
but this will not play a role in the calculations to be presented below due to independence
14there are
199
is a stopping time. Indeed, {a > j} = ji=1 {Xi < a} and since (Xj ) is adapted to (Fj ), we
have {Xi a} Fj for all i j and hence {a > j} Fj and hence {a j} Fj for all
js. Convince yourself, that, for example, the last passage time sa = max{j : Xj a} is not a
stopping time.
Example 8e4. Let (j ) be a sequence of i.i.d. equiprobable random signs, i.e. P(1 = 1) =
1/2 and dene the simple random walk
Sj =
i ,
j 1,
i=1
and S0 = 0. Since for each j, all i s with i j can be recovered from S1 , ..., Sj and vise versa,
the natural ltrations of (j ) and (Sj ) coincide. Denote this ltration by Fj . The sequence (j )
is not a martingale w.r.t. (Fj ) since for j > i, E(j |Fi ) = Ej = 0 = i . However (Sj ) is a
martingale w.r.t. (Fj )
( j
)
E(Sj |Fi ) = Si + E
Fi = Si , j i.
=i+1
The martingale property implies that EMj = EM0 and one may expect that EM = EM0
holds for Markov times as well. This is true, at least for martingales on nite intervals:
Theorem 8e5 (a rudiment of the Optional Sampling Theorem). Let (Mj ), j = 0, 1, ..., n be
a martingale with respect to a ltration (Fj ) and let be a stopping time with values in {0, ..., n}.
Then
EM = EM0 .
n
Proof. Since i=0 1{ =i} = 1,
EM = EM0 1{ =0} + E
i=1
EM0 1{ =0} +
EM0 1{ =0} +
EM0 1{ =0} +
(
)
Mi 1{ i} 1{ <i} =
i=1
(
))
EMi 1{ i} EE Mi 1{ i1} |Fi1 =
i=1
n
))
(
EMi 1{ i} E1{ i1} E Mi |Fi1 =
i=1
n
)
EMi 1{ i} EMi1 1{ i1} = EMn 1{ n} = EMn = EM0 .
i=1
Remark 8e6. The Optional Sampling Theorem remains valid under more general conditions.
For example, if is a stopping time with E < and (Mj ) is a martingale with
)
(
E Mj+1 Mj |Fj C, P a.s.j,
200
8. HYPOTHESIS TESTING
then EM = EM0 . Do not think, however, that the latter property holds without appropriate
assumptions. For example, if (Sj ) is the simple random walk as above and
1 = inf{j 0 : Sj = 1},
then obviously S1 = 1 and thus ES1 = ES0 = 0. It can be seen that E1 = in this case.
In the proof to follow, we shall need similar results and notions for continuous time processes,
i.e. families of random variables indexed by real numbers, rather than sequences of random
variables indexed by integers. While the analogous theory is similar in spirit, it is much more
technically involved.
Proof of Proposition 8e2. Let I be the set of true null hypotheses, |I| = m0 , and for
any t [0, 1], dene the total number of hypotheses, rejected at level t
r(t) = #{i : pi t}
and the number of erroneously rejected hypotheses at level t
v(t) = #{i I : pi t}.
Let fdr(t) be the (random) proportion of false discoveries:
fdr(t) =
v(t)
v(t)
1
=
,
r(t) {r(t)>0} r(t) 1
t [0, 1]
( )
Note that r p(i) = i and hence
{
} {
}
i } { m
m
i : p(i) = i : p(i) = i :
p(i) =
m
i
r(p(i) )
}
{
}
{
m
p(i) := i : Q(p(i) ) ,
i:
r(p(i) ) 1
(8e1)
where
mt
, t [0, 1].
Q(t) =
r(t) 1
{
}
Dene = sup t [0, 1] : Q(t) , and note that
{
{
}
{
}
i }
k = max i : p(i) = max i : Q(p(i) ) = max i : p(i) ,
m
where the latter holds since Q(t) has negative jumps at p(i) s and increases otherwise (sketch a
plot). Since both r(t) and v(t) are constant between p(i) s, it follows that v(p(k) ) = v( ) and
r(p(k) ) = r( ). For the BH procedure (with p(0) := 0)
v(p(k) )
V
v( )
v( ) Q( )
v( )
=
=
=
R1
r(p(k) ) 1
r( ) 1
m
m
and hence
EM ( ),
(8e2)
R1
m
where we dened M (t) = v(t)/t. Let (Ft ), t [0, 1] be the ltration of -algebras generated by
the events {pi t}. Since pi s are independent and pi U ([0, 1]) for i I, for s t and i I
(
(
)
)
s
E 1{pi s} Ft = E 1{pi s} 1{pi t} = P(pi s|pi t)1{pi t} = 1{pi t}
t
FDR = E
and hence
201
(1
) 1s
(
)
E M (s)|Ft = E
1{pi s} Ft =
1
= M (t).
s
s
t {pi t}
iI
iI
Clearly M (t) is measurable with respect to Ft for all t [0, 1], and hence M (t) is a martingale
in reversed time. Moreover, in reversed time, is the rst hitting time of the level by the
process Q(t), which is adapted to the time reversed ltration and hence is a stopping time with
respect to the time reversed ltration. Then by the appropriate Optional Sampling Theorem
EM ( ) = EM (1) = Ev(1)/1 = m0 .
Plugging this equality into (8e2), we obtain the required claim.
The recent text [4] is a good starting point for further exploration of the large scale inference
problems.
202
8. HYPOTHESIS TESTING
Exercises
Problem 8.1. Consider the problem of testing the hypothesis H0 : p = 1/2 against the
alternative H1 : p > 1/2, given a sample X1 , ..., Xn from Ber(p) (n independent tosses of a coin).
(1) Suggest a reasonable level test, using the CLT approximation
(2) Calculate the power of the test at p = 3/4
(3) For n = 30, = 0.05 nd the critical region C of the test and evaluate your answer in
the previous questions
(4) Find the approximate value of n so that the power in (2) is greater than 0.99
(5) Argue why the approximation in (1) is well justied by CLT, while (2) is not (consult
the footnotes in Example 8a4)
C2 = [1, ),
C4 = R \ {1}.
Calculate the level and nd the power functions of the corresponding tests.
Problem 8.3. In each one of the following problems, nd the N-P test based on the i.i.d.
sample X = (X1 , ..., Xn ) and simplify it as much as possible:
(1) H0 : X N (0, 1) against H1 : X N (2, 1)
(2) H0 : X 2(2) against H1 : X 2(6)
(3) H0 : X exp(2) against H1 : X exp(3)
Find the power of corresponding level = 0.05 tests.
Problem 8.5 (Slope detection problem). Let Xi N (i, 1), i = 1, ..., 10 be independent
r.v.s. Find the level N-P test for testing H0 : = 0 against H1 : = 1.
Problem 8.6.
EXERCISES
203
(1) Find the N-P level test, based on a single sample X for
H0 : X exp(1)
2
exp(x2 /2)I(x (0, )).
2
(2) Does the test with the critical region C = (0.8, 1.2) yields maximal power in the above
problem. Find its power and level.
H1 : X f (x) =
x<
0
F (x) = 14 (x )2 x [, + 2) ,
1
x+2
R.
(1) Find the most powerful level test for the problem:
H0 : = 0
H1 : = 1
with 1 > 0 .
(2) Specify your test for 0 = 0, 1 = 1 and = 0.36. Calculate its power.
[1, 1].
(1) Find the most powerful level test for testing H0 : = 0 against H1 : = 1.
(2) To test the hypotheses H0 : 0 against H1 : > 0, consider the critical region
C = {x R : x 1/2}. Calculate the corresponding power function and signicance
level.
(3) Find the UMP level test in the previous question (or argue that it doesnt exist)
204
8. HYPOTHESIS TESTING
Problem 8.11. Let X Bin(n, ) and consider the hypotheses testing problem:
H0 : = 1/2
H1 : = 1/2
(1) Find that the GLRT has the form {|2X n| > c}.
(2) Use the normal approximation to nd the test with level = 0.05, when n = 36.
(3) Show that under H0 , approximately 2 log (X) 21 .
Problem 8.12. Let X1 , ..., Xn N (, 2 ) and consider testing H0 : = 0 against H1 :
= 0
(1) Find the GLRT statistic and show that it is a function of
n
n )2
(Xi X
U = i=1 2
.
0
(2) Show that the acceptance region of the GLRT has the form {c1 < U < c2 }.
(3) Show that level test is obtained if F (c2 ) F (c1 ) = 1 , where F is the c.d.f. of
2n1 distribution.
(4) Show that c1 c2 = n log(c1 /c2 ) (Hint: use the result in (2)).
Problem 8.13. The horse races fans claim that on the circle hippodrome, the outer track
is disadvantageous against the inner track. Suppose that the hippodrome has 8 circle tracks,
which are numbered from 1 to 8, where the 8-th track has the largest radius. Test whether the
position of the track aects the win rate, summarized in the following table:
track
1 2 3 4 5 6 7 8
win rate 29 19 18 25 17 10 15 11
Table 2. Track position versus win rate
Hint: test the null hypothesis, under which the win rate has uniform distribution, against
the alternative, under which the win rate has multinomial distribution.
EXERCISES
205
number of girls 0 1
2
3
4
5 6
frequency
13 68 154 185 136 68 16
Table 3. Frequency table
Problem 8.14. A random sample from families with 6 children is summarized in the following table:
(1) Test the hypothesis15 that the number of girls has Bin(6, 1/2) distribution. Calculate
the p-value of the test.
(2) Solve the previous question for the hypothesis Bin(6, p)
2
(X) = ( )n/2 ( )m/2 ,
x2
y2
where
2 is the MLE under H0 and
x2 and
y2 are MLEs under H1 .
(3) Show that
(
)(n+m)/2
n1
1 + m1
T
(
(X) = C
=: f (T )
)n/2
n1
T
m1
where C is a constant and
T (X, Y ) =
1 n
i=1 (Xi
n1
1 m
i=1 (Yi
m1
n )2
X
.
Ym )2
(4) Plot f and show that16 the critical regions Ck = {(x, y) > k} are equivalent to the
critical regions
Ck1 ,k2 = {T (x, y) < k1 T (x, y) > k2 }.
Problem 8.16 (Bayes hypotheses testing). Assuming the prior on , derive the Bayes
test for the problem of testing H0 : 0 against H1 : 1 , which minimizes the sum of
erros of the two types. Compare your result to GLRT.
Exams/solutions
a. 2009/2010 (A) 52303
Problem 1. (The sunrise problem of Laplace)
What is the probability that the sun will rise tomorrow? In this problem we shall explore
how the question was answered by P-S. Laplace in 18-th century. Suppose that n days ago17 a
random variable R was sampled from uniform distribution U ([0, 1]) and since then the sun raises
each day with probability R. More precisely, let Xi be the indicator of the sunrise on the i-th
morning and assume that X1 , X2 , ..., are conditionally independent r.v.s given R, with Ber(R)
distribution: for any n 1
n
n
i=1
u2 du = 1/3,
0
while
P(X1 = 1) = EP(X1 = 1|R) = ER = 1/2,
which implies
1/3 = P(X1 = 1, X2 = 1) = P(X1 = 1)P(X2 = 1) = 1/4.
17P-S.Laplace has literally taken n to be the number of days from the origin according to the Bible
207
208
EXAMS/SOLUTIONS
i=1 Xi ,
Solution
Since Xi s are conditionally independent Ber(R) r.v.s, given R, their sum has
conditionally Binomial distribution, given R (just as if 18 R were a number).
(4) Find19 the p.m.f. of Sn (X):
P(Sn (X) = k) = ...?
Hint: use the tower property of conditional expectations
Solution
Sn (X) has uniform distribution on {0, ..., n}:
( )
n
ERk (1 R)nk =
P(Sn (X) = k) = EP(Sn (X) = k|R) =
k
1
n!
n!
k!(n k)!
1
uk (1 u)nk du =
=
k!(n k)! 0
k!(n k)! (n + 1)!
n+1
(5) Prove the following Bayes formula: for a random variable with E|| < and discrete
random variable (taking integer values), show that
E(| = k) =
EI( = k)
.
P( = k)
Solution
18As X s are Bernoulli r.v., S (X) takes values in {0, ..., n}. The conditional j.p.m.f. of the vector X, given
i
n
R is
pX|R (x; R) =
x {0, 1}n .
i=1
Note that for all binary vectors x, such that Sn (x) = k, pX|R (x; R) = Rk (1 R)nk and there are
strings. Hence
P(Sn = k|R) = P
n
(
i=1
) (n)
Xi = k|R =
Rk (1 R)nk ,
k
( )
n
such
k
209
Denote
(k) :=
EI( = k)
.
P( = k)
Eh() =
h(k)EI( = k)
EI( = k)
E()h() =
(k)h(k)EI( = k) =
h(k)
P( = k) =
h(k)EI( = k).
P( = k)
k
These two equalities verify the orthogonality property and thus prove the Bayes formula.
(
)
(6) Find E R|Sn (X) .
Solution
Following the hint, we obtain:
(
) ERI(Sn (X) = k)
.
E R|Sn (X) = k =
P(Sn (X) = k)
Further,
(
)
ERI(Sn (X) = k) = ERE I(Sn = k)|R = ERP(Sn = k|R) =
( )
( )
n
n
k
nk
R (1 R)
=E
Rk+1 (1 R)nk =
ER
k
k
( ) 1
n!
(k + 1)!(n k)!
k+1
n
uk+1 un+1(k+1) du =
=
k
k!(n k)!
(n + 2)!
(n + 2)(n + 1)
0
1
n+1 ,
(
) k+1
E R|Sn (X) = k =
n+2
(7) Calculate the probability that the sun will rise tomorrow:
P(Xn+1 = 1|Xn = 1, ..., X1 = 1).
Solution
210
EXAMS/SOLUTIONS
Problem 2.
A circle C is drawn on a rectangular sheet of paper. Alice and Bob want to estimate the area
A of the circle and suggest two dierent approaches. Alice exploits the fact that the sheet of
paper is rectangular and the length of its side is known precisely (denote it by b). She suggests
to sample i.i.d. random vectors from the uniform distribution over the sheet and to estimate A
by the proportion of the vectors falling inside the circle.
(1) Show that if X and Y are i.i.d. r.v. with U ([0, b]) distribution, then the random
vector (X, Y ) has uniform distribution (i.e. constant j.p.d.f.) on the planar rectangular
[0, b] [0, b]
Solution
The j.c.d.f. of (X, Y ) is constant outside the rectangular and otherwise is given by:
1
xy
b2
whenever x, y [0, b] [0, b] and thus the j.p.d.f. vanishes outside the rectangular and
otherwise is given by
FX,Y (x, y) = P(X x, Y y) = P(X x)P(Y y) =
fXY (x, y) =
1
2
FX,Y (x, y) = 2 .
xy
b
(2) Alice
generates
(
) i.i.d. random vectors (X1 , Y1 ), ..., (Xn , Yn ) and writes down Zi =
I (Xi , Yi ) C , i = 1, ..., n. Specify a statistical model, parameterized by A (the area
of the circle), which supports this experiment. Find the minimal sucient statistic.
Solution
Zi s are i.i.d. Bernoulli r.v.s with PA (Zi = 1) = A/b2 . Since the sheet is rectangular, the largest circle which is possible in this problem has area Amax := 4 b2 , i.e.
the parametric space is (0, Amax ). As we already know, Zn is the minimal sucient
statistic for this model.
(3) Alices estimator of A is An (Z) = b2 Zn . Is it an unbiased estimator ? If yes, is it
UMVUE ? Calculate the corresponding MSE risk.
211
Solution
EA An (Z) = b2 A/b2 = A and hence An is unbiased. It is also UMVUE by the L-S
theorem (Zn is a complete sucient statistic for the Bernoulli model). The MSE risk
is the variance of Zn , i.e. b4 n1 A/b2 (1 A/b2 ) = n1 A(b2 A).
Bob suggests the more traditional method: to measure the diameter of the circle with a ruler
and calculate its area by means of the formula A(D) = 4 D2 . Since his measurement is noisy,
he repeats it n times and assumes that his data X1 , ..., Xn is a sample from N (D, 2 ), where 2
n2 .
is known. He suggests to estimate A by the method of moments: An (X) = 4 X
(4) Is An (X) unbiased ? If not, calculate the bias and suggest an unbiased estimator Aun (X)
on the basis of An (X).
Solution
n N (D, 2 /n) and thus
The estimator An is biased: recall that X
)
2 ( 2
2
ED (X
D + 2 /n = A +
,
n) =
4
4
4 n
i.e. b(A, An ) =
2
4 n .
212
EXAMS/SOLUTIONS
n D)2 + 2D(X
n D) + D2
=
varD (X
16
(
)
2
n D)2 + 2D(X
n D)
=
varD (X
16
(
)
(
)
2
2
n D)2 + 2covD (X
n D)2 , 2D(X
n D) +
=
varD (X
16
16
(
)
2
n D) .
varD 2D(X
16
2
213
Xi = a(n + 1)
E
i=0
214
EXAMS/SOLUTIONS
Note that X0 N (0, 1) and Xn (a, 1) independently of (these are the days
at which the acidity is known precisely). Since the sample is i.i.d., the conditional
distribution of X given T (X) = (X1 , ..., Xn1 ) trivially does not depend on :
P(X0 x0 , ..., Xn xn |X1 , ..., Xn1 ) = I(X1 x1 )...I(Xn1 xn1 )(x0 )(xn a),
where is N (0, 1) c.d.f.
(4) Is T (X) = (X1 , ..., Xn1 ) minimal sucient ?
Solution
We have
)n+1
(
(
)
1
n
n
1 2 1 2
1
1
2
exp
Ln (x; ) =
xi
xi + a
xi (n + 1)a =
2
2
2
2
i=0
i=
i=
)
)n+1
(
(
n
n
1 2
1
1
2
exp
xi + a
xi (n + 1)a .
2
2
2
i=0
i=
Let x and y be vectors in Rn+1 , then
)
(
n
n
Ln (x; )
1 2
2
(xi yi )
= exp
(xi yi ) + a
Ln (y; )
2
i=0
i=
is not a function of if and only if xi = yi for all i = 1, ..., n 1. Hence T (X) is the
minimal statistic.
(5) For n = 2, nd the MLE of .
Solution
Note that Ln (x; ) is maximal if and only if
(x; ) :=
i=
1
xi + a2
2
(X;
2)
1 X1 21 a2
(X)
=
=
2 otherwise
2 otherwise
215
i=0
the estimator
Xi =
a = (n + 1)a,
i=
(X)
=n+1
Xi
a
n
i=0
is unbiased.
k {1, 2, ...}.
1)
On the event {X1 = 1}, the maximum is attained at (1)
= 0 and otherwise (X
solves
)
(
1
1
= 0,
(X1 1) log + log(1 ) = (X1 1)
1
which yields 1 = 1 1/X1 .
(2) Show that the plug-in estimator23 1 := 1 /(1 1 ) is unbiased for () = /(1 ).
Explain the contradiction with the non-existence claim above.
Solution
1
(1)2
23since () = /(1 ) is a one-to-one function, := /(1
216
EXAMS/SOLUTIONS
1=
.
1
1
1 ) is a
This does not contradict the above non-existence result, since the statistic (X
function of the whole sequence (Zi )i1 , rather than a xed number of Zi s.
(3) Is the obtained estimator above UMVUE ? Calculate its MSE risk.
Solution
Geo(1) distribution belongs to the one parameter exponential family with c() =
log , whose range clearly has a non-empty interior. Hence the sucient statistic X1 is
1 ) is the UMVUE. Its MSE risk is given by
complete and by the L-S theorem, (X
(
)
1 ) = var (X1 ) =
var (X
.
(1 )2
(4) Calculate the C-R lower bound for the MSE risk of unbiased estimators of /(1 ).
Is C-R bound attained ?
Solution
The Fisher information is
)
2 (
(
1
1 )
IX1 () = E 2 log X1 1 (1 ) = E
(X1 1)
=
1
(
)1
) ( 1
1
1
1
1
E (X1 1) 2
1
=
+
=
.
2
2
2
(1 )
1
(1 )
(1 )2
The C-R bound for unbiased estimators of () = /(1 ) is given by:
( )2 (
)2
()
1
CR() =
=
(1 )2 =
,
I()
(1 )2
(1 )2
which coincides with the risk of UMVUE, i.e. the C-R bound is attained in this case.
(5) Encouraged by his progress, the statistician suggests to count the tosses till m 1
zeros occur:
j
{
}
Xm = min j :
(1 Zi ) = m .
i=1
km
217
Solution
The event {Xm = k} occurs if and only if the last toss yields zero and there are
m 1 zero tosses among k 1 tosses. The binomial coecient is the number of strings
with m 1 zeros among k 1 bits, and the claimed formula follows by independence.
(6) Find the MLE m of on the basis of Xm and the corresponding plug-in estimator
m of ()
Solution
We have
)
Xm 1
log L(Xm ; ) = log
+ (Xm m) log + m log(1 ).
m1
m ) = 0, otherwise it solves
Again, if Xm = m, then (X
1
m
(Xm m) =
,
1
which gives:
m
m = 1
.
Xm
Hence
m =
Xm
1.
m
j = 2, ..., m
218
EXAMS/SOLUTIONS
i=1 i
P (1 = k1 , ..., m = km ) =
P (Z1 ...Zk1 1 = 1, Zk1 = 0, ..., Zkm1 +1 Zkm 1 = 1, Zkm = 0) =
(
)
(
P Z1 ...Zk1 1 = 1, Zk1 = 0 ... P Zkm1 +1 Zkm 1 = 1, Zkm = 0) .
|
{z
} |
{z
}
Geo(1)
Geo(1)
Problem 2.
It is required to estimate the expected weight of the sh in a pond, using the shes sample
caught by the net. The weight of a sh in a pond is assumed to be a r.v. X U ([0, ]), > 0.
Let Z be a random variable taking value 1 if the sh is captured by the net and 0 otherwise.
It is known that smaller shes are less likely to be caught by the net and a statistician assumes:
X
.
Solution
P (Z = 1) = E P (Z = 1|X) = E
X
/2
=
= 1/2.
(2) Prove that the conditional p.d.f of X, given the event {Z = 1}, is
f (x; ) :=
2x
d
P (X x|Z = 1) = 2 I(x [0, ])
dx
Solution
We have
P (X x|Z = 1) =
P (X x, Z = 1)
,
P (Z = 1)
219
where
P (X x, Z = 1) = E I(X x)I(Z = 1) =
(
)
(
)
E I(X x)E I(Z = 1)|X = E I(X x)P Z = 1|X =
X
1
1
E I(X x) =
I(s x)s ds.
Hence for x ,
1
P (X x, Z = 1) = 2
sds =
0
1 ( x )2
,
2
( x )2
x [0, ].
(3) The statistician suggests that the weights Y1 , ..., Yn of the n caught shes are i.i.d. r.v.
sampled from f (x; ) from the previous question. Explain how this statistical model
ts the inference problem at hand and elaborate the assumptions it is based upon.
Solution
This model takes into account that smaller shes are less likely to be included in the
sample: the fact that a sh is caught tilts the distribution of its weight towards higher
values, or more precisely, we observe random variables with the c.d.f. P (X x|Z = 1),
rather than the original uniform c.d.f Moreover, the i.i.d. assumption means that the
weights of the shes are i.i.d. U ([0, ]) and that the probability of catching the sh
depends only on its own weight and not on the weights of the others:
P (Zi = 1|X1 , X2 , ...) = P(Zi = 1|Xi ).
220
EXAMS/SOLUTIONS
i=1
Since ni=1 (xi /yi ) does not depend on , the latter is independent of if and only if
the ratio of indicators is independent of . Now minimality follows exactly as for the
U ([0, ]) case (see the lecture notes).
(6) Find the MLE for (twice the expected weight). Is it a function of the sucient
statistic from the previous question ?
Solution
Since 1/2n is a decreasing function of , the maximum of L(Y ; ) is attained at
n (Y ) = max Yi .
i
E Y1 =
s
0
2
2 3
2
= ,
sds
=
2
2 3
3
25This question was replaced by Find two non-equivalent sucient statistics in the exam. One obvious
sucient statistic is all the data Y1 , ..., Yn and the other one is maxi Yi , as discussed in the solution
221
To this end,
P ( max Yi ) = P (max Yi ) =
i
i=1
P (Yi ) =
( )2n
(
)2n n
= 1
0,
> 0.
Problem 3.
A cat is following a mouse on a real line. The current position of the mouse is a random
variable X N (, 2 ) (with known and 2 ). The cat is old and hence it doesnt see the
mouse clearly: it observes Y = X + Z, where Z N (0, 1) is the noise, independent of X.
(1) The cat tries to predict the position of the mouse by the simple predictor gc (Y ) = Y .
Calculate the corresponding MSE.
Solution
The MSE of gc (Y ) is E(Y X)2 = EZ 2 = 1.
(2) Find cats optimal (in the MSE sense) predictor gc (Y ) of the mouses position X
Solution
The optimal predictor is the conditional expectation
gc (Y ) = E(X|Y ) = +
2
(Y ).
2 + 1
222
EXAMS/SOLUTIONS
4
2
=
.
2 + 1
2 + 1
(4) The cat crawls to the predicted position of the mouse gc (Y ), so that the mouse cannot
see him. Hence the mouse has to predict the position of the cat gc (Y ), when it knows
(X) of the cats
only its own position X. What is the mouses optimal predictor gm
position ?
Solution
The cats optimal predictor is
(
)
)
2 (
2
gm
(Y )X = + 2
E(Y |X) .
(X) = E + 2
+1
+1
Since X and Z are independent, E(Y |X) = E(X + Z|X) = E(X|X) + E(Z|X) = X
and thus
)
2 (
gm
(X) = + 2
X .
+1
E gc (Y ) gm
(X) .
Compare it to the cats MSE from (3).
Solution
The MSE of the mouses predictor is
(
(
)2
E gc (Y ) gm
(X) = E +
))2
2
2 (
(Y
=
2 + 1
2 + 1
(
)2
(
)2
(
)2
(
)2
( )2
2
2
2
E Y X =
E Z =
.
2 + 1
2 + 1
2 + 1
223
E.g. if X Ber(1/2) and Z has p.d.f. f (x) with zero mean and unit variance,
then, as saw,
E(X|Y ) = P(X = 1|Y ) =
f (Y 1)
,
f (Y 1) + f (Y )
26this question was excluded from the nal version of the exam
224
EXAMS/SOLUTIONS
k {1, 2, ...}.
1)
On the event {X1 = 1}, the maximum is attained at (1)
= 0 and otherwise (X
solves
)
(
1
1
(X1 1) log + log(1 ) = (X1 1)
= 0,
1
which yields = 1 1/X1 .
(2) Find the MLE of () := /(1 ).
Hint: () is a one-to-one function of (0, 1).
Solution
The MLE is =
11/X1
1/X1
= X1 1.
(3) Show that MLE is unbiased. Explain the contradiction with the non-existence
claim above.
Solution
1 ) = X1 1 is an unbiased estimator of /(1 ):
(X
1 ) = E X1 1 =
E (X
1
1=
.
1
1
1 ) is a
This does not contradict the above non-existence result, since the statistic (X
function of the whole sequence (Zi )i1 , rather than a xed number of Zi s.
27E =
1
1
(1)2
225
(4) Is the obtained estimator above UMVUE ? Calculate its MSE risk.
Solution
Geo(1) distribution belongs to the one parameter exponential family with c() =
log , whose range clearly has a non-empty interior. Hence the sucient statistic X1 is
1 ) is the UMVUE. Its MSE risk is given by
complete and by the L-S theorem, (X
(
)
1 ) = var (X1 ) =
var (X
.
(1 )2
(5) Calculate the C-R lower bound for the MSE risk of unbiased estimators of /(1 ).
Is C-R bound attained ?
Solution
The Fisher information is
)
2 (
(
1 )
1
IX1 () = E 2 log X1 1 (1 ) = E
(X1 1)
=
1
)1
) ( 1
(
1
1
1
1
1
=
+
=
.
E (X1 1) 2
2
2
2
(1 )
1
(1 )
(1 )2
The C-R bound for unbiased estimators of () = /(1 ) is given by:
( )2 (
)2
()
1
CR() =
=
,
(1 )2 =
2
I()
(1 )
(1 )2
which coincides with the risk of UMVUE, i.e. the C-R bound is attained in this case.
(6) Encouraged by his progress, the statistician suggests to count the tosses till m 1
zeros occur:
n
{
}
Xm = min n :
(1 Zi ) = m .
i=1
km
Solution
The event {Xm = k} occurs if and only if the last toss yields zero and there are
m 1 zero tosses among k 1 tosses. The binomial coecient is the number of strings
with m 1 zeros among k 1 bits, and the claimed formula follows by independence.
226
EXAMS/SOLUTIONS
)
Xm 1
log L(Xm ; ) = log
+ (Xm m) log + m log(1 ).
m1
m ) = 0, otherwise it solves
Again, if Xm = m, then (X
1
m
(Xm m) =
,
1
which gives:
m) = 1 m .
(X
Xm
Hence
=
Xm
1.
m
(8) Do MLEs from the previous question form a consistent sequence of estimators of /(1
) ?
j = 2, ..., m
m
= i=1 i and i s are independent:
P (1 = k1 , ..., m = km ) =
P (Z1 ...Zk1 1 = 1, Zk1 = 0, ..., Zkm1 +1 Zkm 1 = 1, Zkm = 0) =
(
)
(
P Z1 ...Zk1 1 = 1, Zk1 = 0 ... P Zkm1 +1 Zkm 1 = 1, Zkm = 0) .
|
{z
} |
{z
}
Geo(1)
Geo(1)
227
(9) Find the asymptotic rate and the asymptotic error distribution of the MLEs found
above
Hint: The hint from the previous question applies
Solution
Note that E Xm /m = E m = 1/(1 ). Then by the CLT:
(
1 ) d
) (
m Xm /m 1
= m m
N (0, var (1 )),
1
1 m
where var (1 ) = /(1 )2 .
Problem 2. A factory started production of a new series of electric lamps and it is required to
estimate their mean lifetime as soon as possible. For this purpose, the statistician suggests the
following experiment: N lamps are powered on for a known period of a hours, during which the
lifetimes of the burnt out lamps are recorded for the purpose of estimation.
Let n be the number of the burnt out lamps and assume that the lamp lifetime has p.d.f.
f (u; ), u R+ , where is the unknown parameter.
(1) Assuming that n 1, the statistician claims that the obtained data X1 , ..., Xn is a
sample from the p.d.f.
ft (x; ) =
where F (x; ) =
x
0
f (x; )
I(x [0, a]),
F (a; )
Solution
The p.d.f. ft (x; ) is the density of the truncated c.d.f. P( x| a), where
f (x; ). This corresponds to the fact that an observed lifetime is reported only if
it is less than a, i.e. is conditioned on the event { a}.
(2) Is the model identiable if the lifetime has U ([0, ]) distribution with > 0 ?
Solution
For U ([0, ]) with > a,
ft (x; ) =
1
I(x
[0, ])
a
1
I(x [0, min(a, )]).
a
Thus if the parameter space = R+ , ft (x; ) ceases to depend on for all large
enough, which means that the model is not identiable.
228
EXAMS/SOLUTIONS
n exp ni=1 xi
) ,
n (
Ln (x; ) =
a
i=1 1 e
x Rn+ ,
which by the F-N factorization theorem implies that Sn (x) = ni=1 xi is a sucient
statistic. This likelihood clearly belongs to one parameter exponential family with
c() = , whose range has a nonempty interior. Hence Sn (X) is complete and thus
minimal sucient.
(4) Suggest a consistent sequence of estimators for 1/ (the mean lifetime).
Hint: consider the method of moments estimator based on the rst two moments:
1
a
E X1 = a
e 1
a2 + 2a 1
2
E X12 = 2 a
e 1
Solution
Let = 1/ for brevity, then combining the two equalities and eliminating the term
1, we get
1
E X1
,
=
2
2
a + 2
2 E X1
which, solved for , gives:
ea
aE X1 E X12
.
a 2E X1
2n (X) :=
Since by the law of large numbers m
1n (X) := n1 ni=1 Xi E X1 and m
n
1
2
2
i=1 Xi E X1 in P -probability, the estimators
n
n (X) =
2n
am
1n m
a 2m
1n
229
Unfortunately, none of the lamps in the experiment burnt out, i.e. n = 0 has been
realized. Thus the model above is not applicable. However the statistician doesnt give
up and suggests another approach as follows.
(5) Let Zi be a random variable, taking value 1 if the i-th lamp among N ones, used in
the experiment, does not burn out by a hours and 0 otherwise. Specify the statistical
model, which assumes that all N lamps used in the experiment are i.i.d. Exp() r.v.s
Solution
The model (P )R+ is given by the j.p.m.f. of i.i.d. Ber(ea ) r.v.s.
(6) Suggest a consistent sequence of estimators of 1/, based on Z1 , ..., ZN , N 1.
Hint: Pay attention that (ZN )N 1 is a consistent sequence of estimators of .
Solution
Note that ZN ea in P -probability. Since x 7 1/ log(1/x) is a continuous
function on x > 0,
a
1
P
N (Z) :=
,
log Z1 N
N
(
)S (u) (
)nSn (u)
E T (Z) =
T (u) ea n
1 ea
u{0,1}N
max |T (u)|,
u{0,1}N
Problem 3.
Let Xi = i + i , i = 1, ..., n where i s are i.i.d. N (0, 2 ) with known 2 > 0. The vector
1 , ..., n is a signal to be detected in the noisy observations (X1 , ..., Xn ).
230
EXAMS/SOLUTIONS
(1) Let (i )i1 be a deterministic sequence of nonzero real numbers and assume that i =
i , i.e. the received signal has a precisely known shape. Construct the most powerful
level test for
H0 : i = 0, for all i {1, ..., n}
,
H1 : i = i , for all i {1, ..., n}
assuming that 2 is known.
Solution
We are faced with the problem of testing a simple hypothesis against a simple
alternative, for which the N-P test is the most powerful. Denote by 1:n = (1 , ..., n )
and let 01:n := 01:n and 11:n := 1:n . Since 2 > 1 is known, the likelihood ratio is
L(x; 01:n )
= exp
L(x; 11:n )
)
(
)
n
n
n
n
1 2
1 2
1
1
2
(xi i ) + 2
xi = exp
xi i 2
i .
2
2
2
2 2
2
i=1
i=1
i=1
i=1
}
Xi i c .
i=1
Note
nthat2 under H0 , the test statistic is Gaussian with zero mean variance Bn :=
2
i=1 i , hence the level test is obtained with the critical value solving the equation:
n
(
( )
)
= P0
Xi i / Bn c/ Bn = 1 c/ Bn ,
i=1
which yields
c() =
v
u n
u
1
1
Bn (1 ) = (1 )t
i2 .
i=1
(2) Assume now that i = i , where > 0 is the unknown amplitude. Is your test from
the previous question applicable for testing:
H0 : i = 0,
for all i {1, ..., n}
.
H1 : i = i , with > 0 for all i {1, ..., n}
If yes, is it UMP ?
Solution
231
)2
1
2 2
(
exp
)
n
1
2
2
(xi i ) =
2
i=1
(
)2
(
)
n
n
n
1
1 2
2 2
exp 2
xi + 2
xi i 2
i ,
2
2
2 2
i=1
i=1
i=1
which
n is one parametric exponential family with the canonical sucient statistic T (x) =
i=1 xi i and monotonically increasing c() = . Hence by the K-R theorem, the N-P
level test is UMP in this case.
(3) Formulate the sucient and necessary conditions on the sequence (i )i1 , so that the
power of your test at any > 0 in the previous question converges to 1 as n ?
Solution
The power function of the N-P test is given by:
(, ) = P
P
((
n
|i=1
(
n
Xi i
Bn
)
(1 )
i=1
)
)
Xi i Bn / Bn 1 (1 ) Bn / Bn =
{z
N (0,1)
(
)
1 1 (1 ) Bn .
The power of the test converges to one for any > 0 if and only if 29 ni=1 i2
as n . Note that the latter condition means that the signal waveform must not
converge to zero too fast: if it does, the test is not consistent, i.e. its power cannot be
made arbitrarily close to one by taking n large enough.
(4) Suppose now that the shape of the signal is unknown. Construct the level GLRT
test for
H0 : i = 0, for all i {1, ..., n}
.
H1 : i = 0, for some i {1, ..., n}
Solution
232
EXAMS/SOLUTIONS
i=1
i=1
}
Xi2 c .
i=1
2
2
i=1 Xi /
which gives
c() = 2 Fn1 (1 ).
(5) Can the GLRT be constructed for the detection problem in (4), if 2 is unknown ?
Explain.
Solution
The GLRT is not well dened, since under H1 , 2 and 1:n can be chosen to yield
arbitrary large values of the likelihood, namely if
11:n = X1:n is taken,
(
)n
1
L(x; 2 ,
11:n ) =
, 2 0.
2
2
i = 1, ..., n,
where > 0 is the unknown parameter. It is required to estimate , given the sample X =
(X1 , ..., Xn ).
(1) Show that X U ([, 2]), i.e. its p.d.f. is given by
1
f (x; ) = I(x [, 2])
Solution
233
d
1
1
1
P(Z1 u/ 1) = g(u/ 1) = I(u/ 1 [0, 1]) = I(u [, 2]).
du
min
i{1,...,n}
Xi ,
X :=
max Xi ,
i{1,...,n}
n
1
1
I(xi [, 2]) = n I(x 2)I(x ), x Rn , R+
n
i=1
0
0
:= 0 and
1
0
:=
234
EXAMS/SOLUTIONS
(
)2
= var()
+ (E )2 = 1 var(X ) + 1 E X =
R(, )
4
2
(1
)2 2
)2
2
n
2 ( n
var(Z ) +
( + E Z ) =
+
1
=
4
2
4 (n + 1)2 (n + 2)
4 n+1
2n + 1
2
4 (n + 1)2 (n + 2)
(
)n
Hint: Note that P(Z > u, Z v) = P(Z1 [u, v]) , for u v.
Solution
Since
FZ Z (u, v) = P(Z u, Z v) = P(Z v) P(Z > u, Z v).
and,
(
)n
P(Z > u, Z v) = P(Z1 [u, v]) =
{
(v u)n , v > u
0
vu
we have
fZ Z (u, v) =
2
2
FZ Z (u, v) =
P(Z > u, Z v) =
uv
uv
n(n 1)(v u)n2 I(0 u v 1).
EZ =
1
,
n+1
( )2
E Z =
2
,
(n + 1)(n + 2)
( )
n
var Z =
(n + 1)2 (n + 2)
n
n+1
( )2
n
E Z =
n+2
( )
n
var Z =
(n + 1)2 (n + 2)
EZ =
235
1
.
(n + 1)2 (n + 2)
Solution
EZ Z = n(n 1)
{(u,v):vu}
n(n 1)
u
0
)
v(v u)
n2
dv du = ... =
1
,
n+2
and
cov(Z , Z ) = EZ Z EZ EZ =
n
1
1
1
=
n+2 n+1n+1
(n + 1)2 (n + 2)
R(, )
9 2n + 1
,
=
8 n+1
R(, )
n 1.
R(, )
27
> 1.68... n 1
16
R(, )
This shows that the MLE is inadmissible and, moreover, is strictly inferior to
asymptotically as n .
32no need to proceed with the calculations
236
EXAMS/SOLUTIONS
Problem 2.
Bob and Alice toss two dierent coins with heads probabilities 1 and 2 , which Claude
wants to estimate. In order to trick Claude, they reveal the outcomes of their tosses, without
telling whose coin was tossed rst. Claude knows that Boss coin has greater heads probabilities
than Alices, i.e. 2 > 1 .
To estimate := (1 , 2 ) Claude assumes that he observes n i.i.d. pairs (X1 , Y1 ), ..., (Xn , Yn )
with
Xi = Zi Ai + (1 Zi )Bi
Yi = (1 Zi )Ai + Zi Bi ,
where Ai Ber(1 ) (the tosses of Alice), Bi Ber(2 ) (the tosses of Bob) and Zi Ber(1/2).
All Ai s, Bi s and Zi s are independent.
(1) Explain how this model ts the experiment and what is the role of Zi s in it ?
Solution
If the event {Zi = 1} occurs, then (Xi , Yi ) = (Ai , Bi ), i.e. the rst coin comes from
Alice and the second from Bob, and if {Zi = 1} occurs, then (Xi , Yi ) = (Bi , Ai ). Hence
Claudes model assumes that in fact Alice and Bob toss a third fair coin and swap their
outcomes accordingly. Such a model supports any sequence of swaps.
(2) Find the distribution of X1
Solution
X1 is Bernoulli r.v. with
P (X1 = 1) = P (Zi = 1, Ai = 1) + P (Zi = 0, Bi = 1) =
1
1
P (Zi = 1)P (Ai = 1) + P (Zi = 0)P (Bi = 1) = 1 + 2 .
2
2
(3) Show that if Claude uses only X1 , ..., Xn , he gets non-identiable model with respect
to .
Solution
The distribution of X1 and, by the i.i.d. property, of the whole vector (X1 , ..., Xn )
depends on , only through 1 + 2 .
(
)
1 (1 2 ) (1 1 )2 1uv(1u)(1v)
+
,
2
2
237
u, v {1, 0}
Solution
We have
P (X1 = 1, Y1 = 1) = P(A1 = 1, B1 = 1) = 1 2 ,
and similarly
P (X1 = 0, Y1 = 0) = P(A1 = 0, B1 = 0) = (1 1 )(1 2 ).
Further,
P (X1 = 0, Y1 = 1) = P (A1 = 0, B1 = 1, Z1 = 1)+
1
1
P (A1 = 1, B1 = 0, Z1 = 0) = (1 1 )2 + 1 (1 2 ) ,
2
2
and similarly,
1
1
P (X1 = 0, Y1 = 1) = 1 (1 2 ) + (1 1 )2 .
2
2
(5) Find a two-dimensional sucient statistic for estimating from (X1 , Y1 ), ..., (Xn , Yn ).
Solution
The j.p.m.f. of the r.v.s (X1 , Y1 ), ..., (Xn , Yn ) is
(
)T0 (x,y) (
)T1 (x,y)
p(x, y) = 1 2
(1 1 )(1 2 )
(
)
1 (1 2 ) (1 1 )2 nT0 (x,y)T1 (x,y)
+
, x, y {1, 0}n ,
2
2
238
EXAMS/SOLUTIONS
Note that
s(1 , 2 ) := E (X1 + Y1 ) = 1 + 2
and
r(1 , 2 ) := E X1 Y1 = P (X1 = 1, Y1 = 1) = P (A1 = 1, B1 = 1) = 1 2 .
The function (1 , 2 ) 7 (s, r) is invertible33 on 2 > 1 : the inverse is obtained by
solving34 the quadratic equation 2 s + r = 0:
s s2 4r
s + s2 4r
1 =
, 2 =
.
2
2
Hence the model is identiable.
(7) How your answer to the previous question would change, if Claude doesnt know whose
coin has greater heads probabilities ?
Solution
Notice that the distribution of the data is invariant with respect to permuting 1
and 2 . Hence e.g. for = (1/2, 1/3) and = (1/3, 1/2) one gets exactly the same
distribution, i.e. the model is not identiable. The condition 2 > 1 reduces the
parameter space and eliminates this ambiguity.
(8) Suggest consistent estimators for 1 and 2 , based on all the data (X1 , Y1 ), ..., (Xn , Yn ).
Solution
The formulas, obtained in the previous question, can be used to construct the
method of moments estimators for 1 and 2 :
2
X + Y X + Y 4XY
1 =
,
2
2
X + Y 4XY
,
2
239
where
1
L(x; ) =
exp
(2)n/2
i=1 (...)
)
1
n
1 2 1
2
(xi a) ,
xi
2
2
i=1
i=
= 0 is understood.
240
EXAMS/SOLUTIONS
c := 1 (1 ).
n > c) = P0 ( nX
n > nc) = 1 (c n) = ,
P0 (X
| {z }
N (0,1)
that is
1
c() = 1 (1 ).
n
n is a Gaussian r.v. with unit variance and mean
Under H1 , X
n+1
1
n = 1
E X
a=a
E Xi =
.
n
n
n
n
i=1
i=
The power function of the test (for {1, ..., n}) is given by
(
)
(
)
(
n + 1) (
n + 1)
> n c() a
() = P Xn > c() = P
n Xn a
=
n
n
{z
}
|
N (0,1)
)
(
)
(
n + 1)
n+1
1
.
1
n c() a
= 1 (1 ) a
n
n
(4) Prove that the GLRT test with a critical value c rejects H0 i
max
{1,...,n}
Solution
i=
241
sup0 L(x)
exp( 12 ni=1 x2i )
n
(
)
max exp a
(xi a/2)
{1,...,n}
i=
i=
(Xi a/2),
m = 1, 2, ...
i=1
exceeds threshold c.
(5) Find the GLRT statistic, if a is unknown and is to be considered as a parameter as
well.
Solution
The test statistic in this case is:
(
)
n
2 1
2
max{1,...,n} maxaR exp 12 1
x
(x
a)
i=1 i
i= i
2
=
n
(
)
max exp a
(xi a/2)
{1,...,n} aR
1
xi
a =
n+1
n
i=
i=
242
EXAMS/SOLUTIONS
exp 2
xi n
xi +
2
2
i=1
i=1
)
( n
n
2
By F-N theorem, the statistic T (X) =
i=1 Xi ,
i=1 Xi is sucient (in fact, minimal).
(2) Calculate the Fisher information for the sample and derive the lower bound for the
unbiased estimators of .
Solution
By the i.i.d. property, IX () = nIX1 (). Further,
log fX1 (X1 ; ) = log
2 log
1 2 1
X + X1 1/2
22 1
Hence
1
1
1
log fX1 (X1 ; ) = + 3 X12 2 X1
and
2
3
2
1
log fX1 (X1 ; ) = 2 4 X12 + 3 X1 .
2
2
2
2
Note that E X1 = var (X1 ) + (E X1 ) = 2 and so
)
(
3 2
2
3
1
2
4 2 + 3 = 2 .
IX1 () = E 2 log fX1 (X1 ; ) =
2
1 2
3 n.
243
n of is not ecient, i.e. its risk does not attain the C-R
(3) Show that the estimator X
bound.
Solution
n is unbiased and hence its risk is the variance var (X
n ) = 2 /n, which is strictly
X
greater than C-R bound for all > 0. Hence it is not ecient.
n ecient asymptotically as n , i.e.
(4) Is X
lim
n
n/IX1 ()
n) = 1
var (X
Solution
n is not ecient asymptotically either:
X
n/IX1 ()
n ) = 1/3,
var (X
(5) Explain how the estimator
n 1
v
u
n
u1 1
t
Tn (X) =
Xi2 ,
2n
i=1
is obtained by the method of moments and show that the sequence (Tn ) is consistent
for .
Solution
The estimator can be obtained by the method of moments using E X12 = 22 . By
P
the law of large numbers, n1 ni=1 Xi2
22 and since u 7 u/2 is a continuous
P
1
2
function, Tn (X)
2 2 = , i.e. Tn is consistent for .
(6) Show that
var (X12 ) = 64
E 3
Solution
244
EXAMS/SOLUTIONS
(
)
var (X12 ) = 4 var (1 + )2 .
E(1 + )2 = 2 and
(
)
(
)2
var (1 + )2 = E(1 + )4 E(1 + )2 = E(1 + )4 4.
Moreover,
E(1 + )4 = E(1 + 4 + 6 2 + 4 3 + 4 ) = 1 + 6 + 3 = 10,
and the claim follows.
(7) Using the Delta method, nd the asymptotic error distribution for (Tn ) and the corresponding rate. Compare your result with the estimator in (3) and with the C-R bound
from (2). Explain.
Solution
1
n
1 2 P
i=1 2 Xi
3 4
2 d
2
n(Sn (X) )
N 0, var X1
= N 0,
2
2
N 0,
= N 0, 2 .
2 2 2
8
n . Comparing to the C-R bound,
The asymptotic variance is better than that of X
35
found above, is strictly speaking irrelevant, since Tn s are not unbiased (revealed by
an additional calculation).
By the LLN, Sn (X) :=
Problem 2.
A coin is tossed an unknown number of times {1, 2, ...} and the number of heads X is
revealed.
(1) A statistician assumes that X Bin(, p), where both and p (0, 1) are unknown
parameters. What are the assumptions of the model. Specify the parametric space.
Solution
35in fact, the sequence (T ) is asymptotically unbiased and hence the C-R bound is still very much relevant
n
245
The tosses are assumed to be i.i.d. with probability of heads p. The parameter (, p)
takes values in (0, 1) N.
(2) If p is known and is unknown, is the model identiable ? If both p and are unknown,
is the model identiable ?
Solution
If is unknown and p is known, the model is identiable: the number of integers,
on which the p.m.f. of X is positive, equals + 1 and hence for two dierent values of
dierent p.m.f.s emerge. Alternatively, consider e.g. the statistic T (X) = 1{X=0} .
The function 7 E T (X) = P (X = 0) = (1 p) is a one-to-one function of .
The model is identiable also if both parameters are unknown. Let (p, ) = (p , ).
If = , the corresponding p.m.f s are dierent by the same argument as in the
previous question: the number of nonzero probabilities is dierent, regardless of p and
p . If = and p = p , the p.m.f.s are dierent, since e.g. their means are dierent:
p = p .
(3) Assume that p (0, 1) is known and denote by (X) the MLE of . Find (0).
Solution
The likelihood function is
!
L(x; ) =
px (1 p)x ,
x!( x)!
x {0, ..., },
246
EXAMS/SOLUTIONS
which vanishes at u = 1/ log(1 p). Moreover, since (u) = 1/u2 < 0, (u)
attains its local maximum at (u . As
) limu (u) = , the global maximum over
u [1, ) is attained at max 1, u .
Note that
max () max (u),
N
u[1,)
since the maximum in the right hand side is taken over a larger set. For p = 1 e1 ,
u = 1 is an integer and the latter inequality is attained, i.e.
(1) = argmaxN () = argmaxu[1,) (u) = 1.
Similarly, for p = 1 e with N, u = 1/ and hence (1) = max(1, 1/) = 1. For
p = 1 e1/ , 1 we get u = .
In the general case, p (0, 1)
{
max(1, u ), if log L(1; u ) L(1; u )
(1) =
,
max(1, u ), if log L(1; u ) < L(1; u )
where u denotes the greatest integer less than or equal to u and u denotes the
smallest integer greater or equal to u .
(5) Explain why the model does not belong to the exponential family, if is unknown
Solution
If is unknown, the support of the Binomial p.m.f. depends on the parameter and
hence doesnt t the exponential family form.
(6) Show that if both p and are unknown, X is a complete statistic.
Solution
Suppose Ep, g(X) = 0 for all p (0, 1) and N, i.e.
( )
i
Ep, g(X) =
p (1 p)i =
g(i)
i
i=0
)i
( )(
= 0,
(1 p)
g(i)
i
1p
p (0, 1),
N.
i=0
Since for p (0, 1), the function p/(1 p) takes values in (0, ), the latter implies36
that g(i) = 0 for all i N and hence X is complete.
36recall that a polynomial equals to zero on an open interval if and only if all its coecients are zeros
247
= (1/2)
(1)
=
E g(X) = (1/2)
g(i)
i
i
i=0
i=0
( )
i i
= (1/2) (1 1) = 0,
(1/2)
(1) 1
i
N.
i=0
x R,
1 |x|
2e
where (x) =
is the density of Laplace (two-sided exponential) r.v., > 0 is a known
constant and p [0, 1] is the unknown parameter.
(1) Is the model identiable ? Would the model be identiable if both p and were
unknown ?
Solution
If is known, the model is identiable: e.g. Ep X1 = (1 p) is a one-to-one
function of p (recall that = 0). The model is not identiable if both p and are
unknown: if p = 1, the density doesnt depend on .
248
EXAMS/SOLUTIONS
(
)
0
(
)
If p (0, 1), then the function (p, ) 7 m1 (p, ), m2 (p, ) is one-to-one with the
inverse:
m21
( )
1
2 2 .
= mm
2
2
m1
(3) Find the MLE of p on the basis of one sample X1
Solution
The likelihood L1 (X; p) = p(X1 ) + (1 p)(X1 ) is a linear function of p and
is maximized at an endpoint of the interval [0, 1]. Hence
p(X1 ) = 1{(X1 )(X1 )} = 1{|X1 ||X1 |} = 1{X1 /2} .
(4) Using the rst moment of X1 , suggest a sequence of unbiased consistent estimators of
p
Solution
n /. Since
Note Ep X1 = (1 p), and the method of moments gives pn (X) = 1 X
Pp
n
by the LLN, X
(1 p), the sequence of estimators (
pn ) is consistent.
(5) Find the asymptotic error distribution and the corresponding rate for the sequence of
estimators from the previous question. What happens to the asymptotic variance when
is close to zero ? When is very large ? Explain.
249
Solution
We have
varp (X1 ) = Ep X12 (Ep X1 )2 = 2p + (1 p)(2 + 2) (1 p)2 2 = 2 + p(1 p)2
and by the CLT
)
)
(
(
d
n / p = 1 n (1 p) X
n
n 1X
)
(
(
))
2
1(
2
N 0, 2 2 + p(1 p)
= N 0, 2 + p(1 p)
1
1
L(X; p) =
p(Xi ) + (1 p)(Xi ) =
p e|Xi | + (1 p) e|Xi | =
2
2
i=1
i=1
i=1
n (
( 1
)
1 |Xi ||Xi | )
1
1
|Xi |
i |Xi |
e
p + (1 p) e
=e
p + (1 p) e|Xi ||Xi | .
2
2
2
2
i=1
By the F-N factorization theorem T (X) is sucient. T (X) is strictly coarser than X,
since the function:
|x| |x | = 2x 0 x <
x<0
is not one-to-one on the range of X1 (why ?)
(7) Are your estimators in (4) UMVUE ? Are they admissible ?
Solution
250
EXAMS/SOLUTIONS
Note that
n
n
) 1
(
) 1
(
)
Ep Xn |T (X) =
Ep Xi |T (X) =
Ep Xi |Xi | |Xi | .
n
n
i=1
i=1
Appendix
The conditional expectation Ep (X1 |T1 ), where T1 = |X1 | |X1 |, is not hard to calculate:
, T1 <
Pp (X1 < 0)
1
Ep (X1 |T1 ) =
(T1 + ),
T1
1
Pp (X1 > )
and
(
)
0
x + (1 p)ex+ dx
x
pe
xex dx
Ep (X1 1{X1 0} )
(
)
=
= 1,
=
0
x dx
1 0
Pp (X1 < 0)
x + (1 p)ex+ dx
e
pe
2
1
2
and, similarly,
)
( x
x
x+ dx
x
pe
+
(1
p)e
Ep (X1 1{X1 >} )
e + e
xe dx
)
= (
=
= 1 + .
= x
1
Pp (X1 > )
e
pex + (1 p)ex+ dx
e dx
2
1
2
n /.
is an unbiased estimator of p with the better MSE risk than 1 X
g. 2010/2011 (A) 52314
Problem 1.[Neyman-Scott38]
37see the Appendix, if you are curious what the improved estimator look like
38this classical example in asymptotic statistics shows that in the presence of high-dimensional nuisance
251
i = 1, ..., n
i=1
with x, y
Rn
and =
(, 2 )
Rn
R+ .
(X
)
1
1
i
i
i
i
L(X, Y ; , 2 ) =
exp
=
2 2
2
2
i=1
}
{
(
)n
n
1
1 12 (Xi Yi )2 + 12 (Yi + Xi 2i )2
.
exp
2 2
2
2
i=1
252
EXAMS/SOLUTIONS
n2 (X, Y ) 2
2
2
in probability, i.e MLE of is not consistent either.
(5) Find the minimal sucient statistic.
Solution
(
L(x, y; , 2 ) =
1
2 2
)n
n
n
n
1
1 2
1 2
(xi + yi2 ) + 2
i (xi + yi ) 2
i
exp 2
2
i=1
i=1
i=1
Rn
Rn ,
{
}
n
)
)
L(x, y; , 2 )
1 (
1 (
= exp 2 Tn+1 (x, y) Tn+1 (
.
x, y) + 2
i Ti (x, y) Ti (
x, y)
L(
x, y; , 2 )
2
i=1
(, 2 )
only if T (x, y) = T (
x, y). Hence T (X, Y ) is
2
The range of c(, 2 ) is Rn R , which is obviously not empty. Hence T (X, Y ) is a
complete statistic.
253
(X, Y ) =
(Xi Yi )2
n
2
n
i=1
2.
is an unbiased estimator of
Since
1
1
(Xi Yi )2 =
(Xi2 + Yi2 )
(Xi + Yi )2 ,
2
2
i
2 (X, Y
(8) Find the C-R lower bound for MSE risk of unbiased estimators of 2 , assuming that
i s are known. Is it attained by the estimator from the previous question?
Solution
1 (xi i )2 + (yi i )2
2
2
n
i=1
and hence
n
)
1
1 1 (
2
2
2
log
L(X,
Y
;
)
=
n
+
(X
)
+
(Y
)
i
i
i
i
2
2 2 ( 2 )2
i=1
and
n
)
1 (
1
(Xi i )2 + (Yi i )2 .
(
)2 log L(X, Y ; ) = n 2 2 2 3
( )
( )
2
i=1
)2
2
log L(X, Y ; 2 ) = n
1
n
1
+
n2 2 = 2 2 .
( 2 )2 ( 2 )3
( )
254
EXAMS/SOLUTIONS
The bound is not attained, which shall be expected, since if i s are unknown, the
problem of estimating 2 is harder (and, of course, the model is dierent).
Problem 2. An electronic device monitors the radiation activity, by counting the number
of particles, emitted by a source. The outputs of the device at consecutive time units i = 1, ..., n
are independent random variables X1 , ..., Xn with Poisson distribution Xi Poi(1 + i ), where
i 0 is the unknown intensity of the source at time i (and 1 models the known radiation of
the background).
(1) Find the UMP test statistic for the problem
H0 : 1 = ... = n = 0
H1 : 1 = ... = n > 0
Solution
i.i.d.
X1 , ..., Xn Poi(1 + r), where r > 0. The likelihood of the model is:
n
/
(1 + r)Xi
= en(1+r) (1 + r)nXn
Xi !
L(x; r) =
e(1+r)
Xi !
i
i=1
(
)nX n
L(X; r1 )
n(r1 r0 ) 1 + r1
R(X; r1 , r0 ) =
=e
L(X; r0 )
1 + r0
n for r1 > r0 . Hence by K-R theorem
is a strictly increasing function of the statistic X
the likelihood ratio test is UMP. The test statistic is given by
L(X; r)
= enr (1 + r)nXn
L(X; 0)
n c}, where c is the critical value
and hence the UMP test rejects H0 if and only if {X
to be chosen to meet the size requirement.
(2) For which values of the size , an exact critical value can be found for the UMP test ?
Solution
n has Poi(n) distribution, hence c is the smallest
Recall that under H0 , S(X) = nX
n c) = , or
integer satisfying P0 (X
nk
= ,
en
k!
knc
255
lim n () = 1,
n
1 .
256
EXAMS/SOLUTIONS
Xi !
is a continuous function of r on [0, ). Taking the derivative and equating the result
to zero we get the extremum:
n
n + nX
1
=0
1+r
n 1,
r = X
which is a local maximum (the second derivative is negative). Since limr log L(X; r) =
and limr1 log L(X; r) = , r is a global maximum on (1, ) and hence the
MLE of r is40
n 1)+ .
rn (X) = (X
Hence the GLRT statistic is
n (X) :=
supr>0 L(X; r)
{
1
n )nX n
en(Xn 1) (X
n 1
X
n > 1
X
Let (x) = en(x1) (x)nx and note that (log (x)) = (nx + n + nx log x) =
n + n(log x + 1) = n log x > 0 for x > 1. Since (log (x)) = /(x) and (x) > 0 for
x > 1, it follows that x 7 (x) is strictly increasing on (1, ). It can be shown that
the median of the Poisson distribution with mean n 1 is greater than n and hence
for < 1/2, the critical value of the UMP test is greater than 1, i.e it accepts H0 if
n 1. Since the GLRT also accepts H0 for X
n 1 and the GLRT statistic is strictly
X
increasing in Xn for Xn > 1, the GLRT coincides with the UMP test.
(6) Find the GLRT statistic for the problem
H0 : 1 = ... = n = 0,
H1 : 1 + ... + n > 0
Solution
The log-likelihood under H1 is
log L(X; 1 , ..., n ) =
n (
i=1
)
(1 + i ) + Xi log(1 + i ) log Xi ! ,
257
+
e(Xi 1) (1 + (Xi 1)+ )Xi =
e1Xi (Xi )Xi
i:Xi 1
(7) It is known that the radiation jumped from 0 to the known level r > 0 at the unknown
time {1, ..., n}:
1 = ... = 1 = 0,
= ... = n = r.
Assuming uniform prior on , nd the Bayes estimator (X) with respect to the loss
function (k, m) = 1{k=m} .
Solution
{1, ..., n} is the only unknown parameter in this problem and the corresponding
likelihood is
n
n
n
1
1 r
1 (1+r) (1 + r)Xi
e
=
e1
e (1 + r)Xi .
L(X; ) :=
e1
Xi !
Xi !
Xi !
i=
i=1
i=
i=1
0 ( )
where i=1 ... = 1 is understood. Recall that the Bayes estimator for this loss
function is the posterior mode:
L(X; )()
L(X; )
(X) = argmax{1,...,n}
= argmax{1,...,n}
=
j=1 L(X; j)(j)
j=1 L(X; j)
argmax{1,...,n} L(X; ) = argmax{1,...,n} log L(X; )
where the equality holds, since the posterior is uniform over {1, ..., n}, i.e. (i) =
1/n. Hence
n
n (
i=
Xi log(1 + r) r .
i=
258
EXAMS/SOLUTIONS
Solution
For any y R,
P (Y1 y|X1 ) = P (1 y X1 |X1 ) = (y X1 ),
where is the standard Gaussian c.d.f. This means that Y1 is conditionally Gaussian
given X1 with mean X1 and unit variance. Hence
( 1
1
1 )
fX1 Y1 (x, y) = fY1 |X1 (y; x)fX1 (x) =
exp (y x)2 x2
2
2
2
and the claimed expression follows by the i.i.d. assumption.
( 1 )n
( 1
1 2)
(Xi2 + Yi2 ) +
X i Yi 2
Xi
exp
2
2
2
i
T (X, Y ) =
Xi Yi ,
Xi2
i
L(x, y; )
1 2
1 2
=
(xi + yi2 ) +
(xi + yi2 )+
L(x , y ; )
2
2
i
i
(
) 1 (
xi yi
xi yi 2
x2i
x2
i .
2
i
The latter does not depend on , only if T (x, y) = T (x , y ), which implies minimality
of T .
259
Xi Yi
(X, Y ) := i 2 .
i Xi
Xi E (i |X1 , ..., Xn )
i (Xi + i )
i X
2
E (X, Y ) = E
= + E i
= .
2
i Xi
i Xi
E g(T (X, Y )) = E
Xi2 n = 0, R,
(
(6) Find the Cramer-Rao bound for the MSE risk of the unbiased estimators of .
Solution
The Fisher information for this problem is
2
log
L(X,
Y
;
)
=
E
Xi2 = n,
2
n
In () =
i=1
and hence
E ((X,
Y ) )2 1/n,
1
n2
R,
260
EXAMS/SOLUTIONS
R(, ) = E 2
= E 2
Xi Xj E (i j |X1 , ..., Xn ) =
i Xi
i Xi
i
j
)2
(
1
1
1
Xi2 = E 2 =
E 2
.
n
2
i Xi
i Xi
i
Hence the MLE is not ecient for any n, but the sequence of MLEs is asymptotically
ecient as n (in agreement with the general asymptotic results on MLEs).
1
E =
E Xi (Xi + i ) = ,
n
n
i=1
1
1
1( 2
= E
R(, )
(Xi2 1) +
Xi i
=
E (Xi2 1)2 + 1 =
22 + 1 .
n
n
n
n
i=1
i=1
Hence is not ecient, either for a xed n or asymptotically. Its risk is not comparable
with the risk of the MLE. The R-B procedure does not change the estimator and hence
doesnt yield an improvement.
(9) Argue that if the UMVUE exists for this problem it has to be ecient (i.e. to attain
the C-R bound).
Hint: Consider the estimators of the form
n
n
1
1
Xi Yi
0 (Xi2 1)
n
n
i=1
i=1
i=1
i=1
261
R(, ) = E
(Xi 1)( 0 ) +
Xi i =
2( 0 )2 + 1 .
n
n
n
i=1
i=1
Problem 2.
Consider the problem of detection of a sparse signal in noise. More precisely, we sample
independent random variables X = (X1 , ..., Xn ) with Xi N (i , 1) and would like to test the
hypotheses
H0 : = 0
H 1 : 1
where 1 is a subset of Rn \ {0} with a particular structure, specied below.
(1) Let ei be a vector with 1 at the i-th entry and zeros at all others and assume that
1 = {e1 , ..., en }.
n c}, and nd its power
Find the level test which rejects H0 if and only if {X
function. Does the power function converge as n and if yes, nd the limit?
Explain the result.
Solution
This is the standard calculation:
n c) = P0 ( nX
n nc) = 1 ( nc),
= P0 (X
which yields c =
1 1 (1
n
c ) = Pe
(ei ; ) = Pei (X
i
which does not depend on the alternative. Note that limn (e1 , ) = , which should be
expected, since for large n the zero signal and the signal with just one unit entry are
hard to distinguish.
(2) Find the level GLRT for the problem from (1) and calculate its power function. Does
the power function converge as n ? If yes, nd the limit. Explain the result.
262
EXAMS/SOLUTIONS
Solution
The GLRT statistic is
(
)
1 n
2 1 (X 1)2
max
exp
X
in
i
m
m
=
i
sup1 L(X; )
2
2
(
)
=
=
(X) =
1 n
L(X; 0)
2
exp 2 m=1 Xm
( 1
(
(
1 )
1)
1)
max exp (Xi 1)2 + Xi2 = max exp Xi
= exp max Xi
i
in
in
2
2
2
2
and hence GLRT rejects H0 on the event {maxi Xi c}. Further,
(
)
(
)
P0 max Xi c = 1 P0 max Xi < c = 1 n (c) = ,
i
i
1 (c 1)n1 (c ) = 1 1 ( n 1 ) 1 (1 )11/n .
Once again limn (ei ; ) = .
n
1 = R :
1{i =0} = 1 ,
i=1
i.e. all vectors with only one nonzero entry. Find the level GLRT and its power
function (leave your answer in terms of appropriate c.d.f)
263
Solution
The GLRT statistic is
( 1
)
sup1 L(X; )
= max max (Xi a)2 + Xi2 =
i
aR
L(X; 0)
2
(
1 2) 1
max a max Xi a = max Xi2
i
aR
2
2 i
log (X) = log
and GLRT rejects H0 one the event {maxi Xi2 c}. The critical value c and the
power function of the level test are those found in (2), with replaced by the c.d.f.
of 21 distribution.
n
1 = R : m {0, 1},
m p ,
m=1
where p is a xed integer. In words, is the set of binary vectors with no more than
p nonzero entries. Find the GLRT test statistic in this case.
Solution
The GLRT statistic is
n
( 1
sup1 L(X; )
1
1 2)
2
2
log (X) = log
= max
Xm
(Xm 1) +
Xm =
L(X; 0)
2
2
2
|I|p
m=1
mI
mI
(
)
max
(Xm 1/2) = max
Xm |I|/2
|I|p
mI
|I|p
mI
264
EXAMS/SOLUTIONS
( )
N
ways to choose k serial numbers out of N . Since the list of the k
k
numbers is ordered, the likelihood function is
1
L(X; N ) = Pn (X1 = x1 , ..., Xk = xk ) = ( ) 1{x1 <...<xk N } , xi {1, ..., N }
N
k
There are
where the unknown parameter N takes its values in the natural numbers N.
(2) Find the MLE of N
Solution
For a xed value of M (x), the function L(x; N ) vanishes for N < M (x) and deb = M (X) = Xk , i.e. the
creases in N for N M (x). Hence the MLE of N is N
maximal serial number among those observed.
(3) Find the minimal sucient statistic.
Solution
Note that
1
L(x; N ) = ( ) 1{x1 <...<xk } 1{xk N }
N
k
and by the F-N factorization theorem, the statistic M (x) = maxi xi = xk is sucient.
Further, let x and y be two k-tuples of ordered distinct integers, such that M (x) =
M (y). Then the ratio
1{M (x)N }
L(x; N )
=
L(y; N )
1{M (y)N }
is a nonconstant function of N and hence M (x) is minimal sucient.
(4) Prove that the p.m.f. of M (X) = maxi Xi = Xk is
(
)
j1
(
)
k1
PN M (X) = j = ( ) , j = k, ..., N.
N
k
Solution
Clearly, if k serial numbers are observed, the maximal one cannot be less than k and
hence PN (M (X) = j) = 0 for j < k. For j k, the event {M (X) = j} is comprised of
k-tuples of serial numbers,(containing
) the integer j and any k 1 of the j 1 integers,
j1
smaller than j. There are
such numbers and the claimed formula follows.
k1
265
k1
EN g(M ) =
g(j) ( ) , N k.
N
j=k
k
We shall argue that EN g(M ) = 0 for all N k implies g(i) = 0 for all i k. To this
end, suppose that g(i) = 0 for all i = k, ..., n, then
(
)
n
k1
)
En+1 g(M ) = g(n + 1) (
n+1
k
and En+1 g(M ) = 0 implies g(n + 1) = 0, i.e. g(i) = 0 for all i = 1, ..., n + 1. Since
Ek g(M ) = g(k), it follows that Ek g(M ) = 0 implies g(k) = 0 and the claim holds by
induction.
(6) Find the UMVUE of N
Hint: you may nd useful the combinatorial identity
N ( )
j
j=k
(
=
)
N +1
,
k+1
N k.
Solution
Lets rst see where the hinted combinatorial identity comes from. Summing up
over the p.m.f. of M we get
(
)
j1
N
k1
( ) = 1.
N
j=k
k
Replacing k with k + 1 and N with N + 1 we obtain
(
)
( )
j1
j
N
+1
N
k
k
(
)=
(
),
1=
N +1
N +1
j=k+1
j=k
k+1
k+1
266
EXAMS/SOLUTIONS
(
)
e = M (X) 1 + 1/k 1
N
1 k N.
Are the estimator M (X) and the UMVUE comparable for k 2 ? Is any of these
estimators inadmissible ?
Solution
The formula for varN (M ) is obtained by calculations, similar to those in the previous
question. The risk of the UMVUE is
( )(
)
e , N ) = varN (N
e ) = varN M 1 + 1/k 2 ,
R(N
while the risk of M (X) is
(
)2
R(M, N ) = varN (M ) + EN M .
(
)2 )
)2 (
(
EN M
R(M, N )
k
( ) =
=
1+
e, N)
1+k
varN M
R(N
)2 (
) (
)2
(
(
)
(N + 1)k(k + 2)
k
k
1+
1 + k2 .
1+k
(N k)
1+k
For k 2, the right hand side is greater than 1 for all N and hence the MLE M (X) is
uniformly inferior to UMVUE. In particular, MLE is inadmissible.
Problem 2.
It is required to test whether the blood pressure of the patients depends on the deviation
of their weight from the nominal value. To this end, n patients are examined and their blood
pressures and weight deviations are recorded. The blood pressure of the i-th patient is assumed
to satisfy the linear regression model
Yi = axi + b + i ,
267
where i s are i.i.d. N (0, 1) random variables, xi is the (non-random) measured weight deviation
of the i-th patient and a and b are parameters. Under these assumptions, the lack of dependence
corresponds to the value a = 0.
(1) Assuming that b is known and x = 0, nd the most powerful test of size for the
problem:
H0 : a = 0
H1 : a = a1 ,
where a1 > 0 is a known number. Calculate the corresponding power.
Solution
By N-P lemma the likelihood ratio test is the most powerful. The likelihood function
in this problem is
(
)
n
1
1
exp
L(Y ; a, b) =
(Yi axi b)2
2
(2)n/2
i=1
L(Y ; a1 , b)
1
1
= exp
(Yi a1 xi b)2 +
(Yi b)2 =
L(Y ; 0, b)
2
2
i=1
i=1
( n
)
)
(
n
n
1
1 2 2
exp
a1 xi (2Yi 2b a1 xi ) = exp a1
xi (Yi b)
a1 xi .
2
2
i=1
i=1
i=1
Hence the LRT test rejects H0 if and only if x, Y b/x c. Note that under P0 ,
x, Y b/x =
xi i /x N (0, 1)
i=1
and hence
(
)
P0 x, Y b/x c = 1 (c).
268
EXAMS/SOLUTIONS
(2) Again assuming that b is known, nd the UMP test for the problem of deciding whether
the dependence is negative or positive, i.e.
H0 : a < 0
H1 : a 0.
Solution
By the K-R theorem, the LRT test from the previous question is UMP for this
problem, since the likelihood functions corresponds to the 1-exponential family.
(3) Assuming that both a and b are unknown and that not all xi s are equal, nd the GLRT
statistic for the problem of testing the dependence
H0 : a = 0
H1 : a = 0
Solution
The MLE of b under H0 is b0 = Y and
n1
n
n
n
sup log L(Y ; ) = log(2)
c )
(Yi Y )2 =: log(2) var(Y
2
2n
2
2
0
i
)
cd
ov(x, Y )
i (Yi Y )(xi x
n
a
1 =
.
=:
1
2
var(x)
c
)
i (xi x
n
Hence
n
n
sup log L(Y ; ) = log(2)
2
2
1
cd
ov2 (x, Y )
var(Y
c )
var(x)
c
)
.
c.
var(x)
c
269
1
1
(Yi Y )(xi x
) =
(i )(xi x
).
n
n
i
(
E
)2
(i )(xi x
)
Ei j (xi x
)(xj x
) =
(ij 1/n)(xi x
)(xj x
) =
i
(xi x
)2
1
(xi x
)
(xj x
) = nvar(x)
c
n
i
and
cd
ov(x, Y )
N
var(x)
c
1
0,
n
)
.
(0i1)
Consequently,
)
(
)
(
cd
cd
ov(x, Y )
ov(x, Y )
n
P0
c = P0
nc = 2( nc),
var(x)
c
var(x)
c
and c = 1n 1 (/2).
(5) A sequence of tests is said to be consistent if their powers converge to 1 as n at all
alternatives. Derive the sucient and necessary condition on the sequence (xi ) under
which the corresponding sequence of tests from the previous question is consistent.
Solution
Note that for a = 0 and b R,
1
(xi x
)(Yi Y ) =
n
n
cd
ov(x, Y ) =
i=1
1
n
(
)
(xi x
) a(xi x
) + i = avar(x)
c
+
i=1
var(x)
c
Z,
n
270
EXAMS/SOLUTIONS
c
=
P
na
var(x)
c
+
Z
(/2)
=
a,b
var(x)
c
(
)
n
1 Pa,b na var(x)
c
+ Z 1 (/2) 1,
where the convergence holds if and only if nvar(x)
c
as n .
(6) Show that if all xi s are equal to a nonzero constant, no consistent sequence of tests
exists for the problem from (3).
Solution
Suppose n is a consistent sequence of level tests, i.e.
E0 n ,
n 1
1
n
(Yi a bXi )2 +
log f (Xi )
log L(Y, X; ) = log(2)
2
2
n
i=1
=
.
var(X)
c
var(X)
c
Since i s and Xi s are independent,
)
(
)
(
(
)
( 2 )
t 1
cd
ov(X, Y )
cd
ov(X, )
X1 , ..., Xn = exp
E exp it
,
= EE exp it
2n
var(X)
c
var(X)
c
where the last equality follows from (0i1). Hence
(
)
cd
ov(X, Y )
1
N 0,
,
n
var(X)
c
and the critical value remains the same as in the previous question.
271
L(X, Y ; ) = exp
1
Xi
Yi
)
,
=
exp
n(
x
(
y
)
,
L(x , y ; )
> 0,
and remains open already for many years (see, however, the recent progress in http://arxiv.org/abs/1302.0924)
272
EXAMS/SOLUTIONS
2 1
2 n.
Since
n,
= Yn /X
which is readily checked to be the only maximum, by inspecting the second derivative.
(6) Is the MLE consistent?
Solution
P
P
n
By the LLN, X
1/ and Yn and hence by Slutskys lemma and continuity
P
of the square root, n , for all > 0, i.e., MLE is consistent.
n =
Yn +
.
2
Xn
Is it consistent? Asymptotically normal? If yes, nd the asymptotic variance and
compare it to the C-R bound from (4).
Solution
d
The sequence of estimators is consistent, similarly to (6). Since n(X
n 1/)
N (0, 1/2 ), the -method applies to g(s) = 1/s and s = 1/:
)
(
)
( 1
1 ) (
d
n
N 0, V (s) ,
= n g(Xn ) g(s)
Xn 1/
with
( 1 )2
(
)2
1
V (s) = g (s) s2 = 2 s2 = 2 = 2 .
s
s
d
d
n(n )
N 0,
.
2
The asymptotic variance coincides with the Fisher information rate, i.e. the estimator
is asymptotically ecient in Fishers sense.
273
Problem 2.
An integer valued quantity (e.g., the number of emitted particles per second) is measured by
a device, which is suspected to introduce censoring at some level K. More precisely, let Z1 , ..., Zn
be a sample from Geo(p) distribution with known p (0, 1) and dene Xi = min(Zi , K + 1),
where K is the unknown censoring level.
Below we shall explore hypothesis testing problems for K, given the censored data X =
(X1 , ..., Xn ). For notational convenience, we shall regard K = as a point in the parametric
space N {} and interpret it as no censoring, i.e. X1 , ..., Xn are i.i.d Geo(p) r.v. under P .
(1) Show that for K N the likelihood function is given by the formula
Ln (X; K) = (1 p)Sn (X)n pnCn (X;K) 1{Mn (X)K+1} ,
where
Sn (X) =
Xi
i=1
Mn (X) = max Xi
i
Cn (X; K) =
1{Xi =K+1}
i=1
Solution
For K <
m1
p(1 p)
PK (X1 = m) = (1 p)K
m = 1, ..., K
m=K +1
m>K +1
Ln (X ; K) =
p(1 p)Xi 1 1{Xi K} + (1 p)K 1{Xi =K+1} 1{Xi K+1}
n
i=1
= 1{maxi Xi K+1}
n (
i=1
n
(
)
= 1{maxi Xi K+1} (1 p)
i (Xi 1)
1{Xi K}
274
EXAMS/SOLUTIONS
(2) Find the most powerful test statistic for the problem of testing for presence of the
known censoring level K0 :
H0 : K = K0
H1 : K =
Solution
The most powerful test rejects H0 if and only if
{
}
Ln (X n ; ) cLn (X n ; K0 ) =
{
1}
1{Mn (X)K0 +1} pCn (X;K0 )
=
c
}{
{
1}
Cn (X;K0 )
Mn (X) > K0 + 1
p
=
c}
}{
{
Mn (X) > K0 + 1
Cn (X; K0 ) c ,
where c = log c/ log p, i.e., either when the maximum exceeds K0 + 1 or when the
number of censorings is low.
(3) Using the CLT approximation, nd the asymptotic MP test of size (0, 1).
Solution
{
}
Under H0 , Mn (X) > K0 + 1 = and Cn (X; K0 ) Bin(pK0 , n) and hence
(
)
(
)
PK0 {Mn (X) > K0 + 1} {Cn (X; K0 ) c } = PK0 Cn (X; K0 ) c =
c
K0
n
n(1
p)
PK0
K
K0 (1 (1 p)K0 )
K
0
0
n
(1
p)
(1
p)
(1
(1
p)
)
i=1
If we choose
275
(5) Prove that the GLRT for the problem of testing for presence of an unknown censoring
level:
H0 : K <
H1 : K =
rejects H0 if and only if
{ n
}
1{Xi =Mn (X)} c
i=1
1
Ln (X; )
=
=
maxKN Ln (X; K)
maxKN pCn (X;K) 1{Mn (X)K+1}
and hence the GLRT rejects H0 if and only if the number of maxima in the sample is
small
n
i=1
i=1
Hint: Find the limit of Mn (X) under PK and gure out how to apply LLN
Solution
276
EXAMS/SOLUTIONS
Under PK , Mn (K) converges to K + 1 and hence n1 ni=1 1{Xi =Mn (X)} essentially
behaves as n1 ni=1 1{Xi =K+1} for large n. More precisely, since Mn (X) K +1, PK -a.s.
1
1{Xi =Mn (X)} =
n
n
1
n
i=1
n
1
1{Xi =Mn (X)} 1{Mn (X)<K+1} =
n
n
i=1
i=1
n
n
)
1
1 (
1{Xi =K+1} +
1{Xi =Mn (X)} 1{Xi =K+1} 1{Mn (X)<K+1} .
n
n
i=1
i=1
The rst term converges in PK -probability to (1 p)K by the LLN, while the second
term converges to zero:
n (
1
)
EK
1{Xi =Mn (X)} 1{Xi =K+1} 1{Mn (X)<K+1}
n
i=1
(
)n
(
)n n
2PK (Mn (X) < K + 1) 2 PK (X1 < K + 1) = 2 1 (1 p)K 0,
and the claim follows by Slutskys lemma.
(7) It can be shown that for any > 0,
n
1
P
1{Xi =Mn (X)}
0.
n
n
i=1
Using this limit and the result of the previous question, suggest a sequence of critical
values cn , so that the corresponding sequence of GLRTs signicance levels converge to
0 for any xed null hypothesis and the power converge to 1 as n .
Solution
Take any (0, 1) and let cn = n , then for any K N, the signicance level
converges to zero:
( n
)
( n
)
1
1
PK
1{Xi =Mn (X)} cn = PK
1{Xi =Mn (X)} n
=
n
i=1
i=1
( n
)
1
n
K
1
K
PK
1{Xi =Mn (X)} (1 p) n
(1 p)
0.
n
i=1
where the convergence holds by (6). On the other hand, the power converges to 1:
)
( n
)
(
n
1
n
1{Xi =Mn (X)} 1 1.
P
1{Xi =Mn (X)} cn = P
n
i=1
i=1
Where does the claimed limit come from? Under P , Mn (X) and hence the
number of times the maximum is attained till time n might grow slower than linearly.
277
In fact, according to the claim, it grows slower than polynomially. Lets see why this
is true. Since Xi s are i.i.d.
n
)
(
(
)
(
)
{Xi X1 } =
P Xi = Mn (X) = P X1 = Mn (X) = P
E P
n
(
i=2
(
)n1
.
{Xi X1 }|X1 = E 1 (1 p)X1
i=2
log n
Let (0, 1) and zn := | log(1p)|
. Then ezn | log(1p)| = n and
(
)
P X1 = Mn (X) 2(1 ezn | log(1p)| )n + ezn | log(1p)| =
E n
1{Xi =Mn (X)} = E n1+(1)
1{Xi =Mn (X)}
i=1
Cn
as claimed.
1+(1)
i=1
nn
= Cn
2 n
0,
Bibliography
[1] P.Bickel, K.Doksum, Mathematical Statistics: basic ideas and selected topics, 1977
[2] Borovkov, A. A. Mathematical statistics. Gordon and Breach Science Publishers, Amsterdam, 1998.
[3] Casella, George; Berger, Roger L. Statistical inference, 1990.
[4] Efron, Bradley. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Vol. 1.
Cambridge University Press, 2010.
[5] Ghosh, Malay, Basus theorem with applications: a personalistic review. Special issue in memory of D. Basu.
Sankhya- Ser. A 64 (2002), no. 3, part 1, 509531.
[6] Ibragimov, I. A.; Khasminskii, R. Z. Statistical estimation. Asymptotic theory. Applications of Mathematics,
16. Springer-Verlag, New York-Berlin, 1981
[7] Kagan, Abram; Yu, Tinghui, Some inequalities related to the Stam inequality. Appl. Math. 53 (2008), no. 3,
195205.
[8] Lehmann, E. L. Casella, George Theory of point estimation, 2nd ed., Springer Texts in Statistics. SpringerVerlag, New York, 1998.
[9] Lehmann, E. L.; Romano, Joseph P. Testing statistical hypotheses. 3d edition. Springer Texts in Statistics.
Springer, New York, 2005
[10] Reeds, James A. Asymptotic number of roots of Cauchy location likelihood equations. The Annals of
Statistics (1985): 775-784.
[11] Samworth, R. J., Steins Paradox. Eureka, 62, 38-41, 2012
[12] Shao, Jun, Mathematical statistics. Springer Texts in Statistics. Springer-Verlag, New York, 2003
[13] Shao, Jun, Mathematical statistics: exercises and solutions. Springer, New York, 2005
[14] A.Shiryaev, Probability, Spinger
[15] A.Tsybakov, Introduction to Nonparametric Estimation, Springer New York 2009
[16] R. A. Wijsman, On the Attainment of the Cramer-Rao Lower Bound, The Annals of Statistics, Vol. 1, No.
3 (May, 1973), pp. 538-542
279