Msda3 Notes
Msda3 Notes
0
.
A realisation of
is called an estimate, which is not denoted by
. In Example A (page
261), the estimator of is
=
X. Later, there is
i=1
(X
i
X)
2
Since E(V ) = (n 1)
2
/n, an unbiased estimator of
2
is
S
2
=
n
n 1
V
The approximation of the SE by
_
v/n or s/
is an estimate of . The actual error in the estimate is unknown, but is roughly SD(
).
General denition:
SE(estimate) := SD(estimator)
The SE often involves some unknown parameters. Obtaining an approximate SE, by replacing
the parameters by their estimates, is called the bootstrap.
4.2 Sample size
It is important to have a strict denition of sample size:
sample size = the number of IID data
The sample size is usually denoted by n, but here is a confusing example. Let X Binomial(n, p).
The sample size is 1, not n (Wouldnt it be so confusing to write n is 1, not n.?). That it
is equivalent to n IID Bernoulli(p) random variables is the job of suciency.
4.3 What does mean?
Often, denotes two things, both in the parameter space : (a) a particular unknown con-
stant, (b) a generic element. This double-think is quite unavoidable in maximum likelihood,
but for method of moments (MOM), it can be sidestepped by keeping implicit. For ex-
ample, the estimation problem on pages 255-257 can be stated as follows: For n = 1, 207,
assume that X
1
, . . . , X
n
are IID Poisson() random variables, where is an unknown positive
constant. Given data x
1
, . . . , x
n
, estimate and nd an approximate SE. The MOM estimator
is constructed without referring to the parameter space: the positive real numbers.
4.4 Alpha particle emissions (page 256)
10220/1207 8.467, not close to 8.392. In Berk (1966), the rst three categories (0, 1, 2
emmisions) were reported as 0, 4, 14, and the average was 8.3703. Since 8.3703 1207 =
10102.95, the total number of emissions is most likely 10103. Combined with the reported
s
2
= 8.6592, the sum of squares is close to 95008.54, likely 95009. Going through all possible
counts for 17, 18, 19, 20 emissions, the complete table is reconstructed, shown on the next
page.
3
Emissions Frequency
0 0
1 4
2 14
3 28
4 56
5 105
6 126
7 146
8 164
9 161
10 123
11 101
12 74
13 53
14 23
15 15
16 9
17 3
18 0
19 1
20 1
Total 1207
Table 1: Reconstructed Table 10 A of Berk (1966). x = 8.3703, s
2
= 8.6596.
4.5 Rainfall (pages 264-265)
Let X
1
, . . . , X
227
be IID gamma(, ) random variables, where the shape and the rate are
unknown constants. The MOM estimator of (, ) is
( ,
) =
_
X
2
V
,
X
V
_
The SEs, SD( ) and SD(
0.38), SD(
) SD(
1.67)
where (
0.38,
0.38) 0.38
which can be estimated by Monte Carlo. The bias in (0.38,1.67) is around (0.10,0.02). The
maximum likelihood estimates have smaller bias and SE.
4
4.6 Maximum likelihood
Unlike MOM, in maximum likelihood (ML) the two meanings of clash, like an actor playing
two characters who appear in the same scene. The safer route is to put the subscript 0 to the
unknown constant. This is worked out in the Poisson case.
Let X
1
, . . . , X
n
be IID Poisson(
0
) random variables, where
0
> 0 is an unknown constant.
The aim is to estimate
0
using realisations x
1
, . . . , x
n
. Let > 0. The likelihood at is the
probability of the data, supposing that they come from Poisson():
L() = Pr(X
1
= x
1
, . . . , X
n
= x
n
) =
n
i=1
x
i
e
x
i
!
On (0, ), L has a unique maximum at x. x is the ML estimate of
0
. The ML estimator is
0
=
X.
The distinction between
0
and helps make the motivation of maximum likelihood clear.
However, for routine use,
0
is clunky (I have typed tutorial problems where the subscript
appears in the correct places; it is hard on the eyes). The following is more economical, at
the cost of some mental gymnastics.
Let X
1
, . . . , X
n
be IID Poisson() random variables, where > 0 is an unknown constant.
[Switch.] The likelihood function on (0, ) is
L() = Pr(X
1
= x
1
, . . . , X
n
= x
n
) =
n
i=1
x
i
e
x
i
!
which has a unique maximum at x. [Switch back.] The ML estimator of is
=
X.
With some practice, it gets more ecient. The random likelihood function is a product of
the respective random densities:
L() =
n
i=1
X
i
e
X
i
!
Since L has a unique maximum at
X,
=
X.
The distinction between a generic and a xed unknown
0
is useful in the motivation
and in the proof of large-sample properties of the ML estimator (pages 276-279). Otherwise,
0
can be quietly dropped without confusion.
4.7 Hardy-Weinberg equilibrium (pages 273-275)
By the Hardy-Weinberg equilibrium, the proportions of the population having 0, 1 and 2 a
alleles are respectively (1 )
2
, 2(1 ) and
2
. Hence the number of a alleles in a random
individual has the binomial(2,) distribution; so the total number of a alleles in a simple
random sample of size n N has the binomial(2n, ) distribution.
Let X
1
, X
2
, X
3
be the number of individuals of genotype AA, Aa and aa in the sample.
The number of a alleles in the sample is X
2
+ 2X
3
binomial(2n, ), so
var
_
X
2
+ 2X
3
2n
_
=
(1 )
2n
The Monte Carlo is unnecessary. The variance can also be obtained directly using the expec-
tation and variance of the trinomial distribution, but the binomial route is much shorter.
Since there is no cited work, it is not clear whether the data were a simple random sample
of the Chinese population of Hong Kong in 1937. If not, the trinomial distribution is in doubt.
5
4.8 MOM on the multinomial
Let X = (X
1
, . . . , X
r
) have the multinomial(n, p) distribution, where p is an unknown xed
probability vector. The rst moment
1
= np. Since the sample size is 1, the estimator of
1
is X. The MOM estimator of p is X/n. This illustrates the denition of sample size.
Let r = 3. Under Hardy-Weinberg equilibrium, p = ((1 )
2
, 2(1 ),
2
). There are
three dierent MOM estimators of .
4.9 Condence interval for normal parameters (page 280)
Let x
1
, . . . , x
n
be realisations from IID random variables X
1
, . . . , X
n
with normal(,
2
) dis-
tribution, where and
2
are unknown constants. The (1 )-condence interval for :
_
x
s
n
t
n1
(/2), x +
s
n
t
n1
(/2)
_
is justied by the fact that
P
_
X
S
n
t
n1
(/2)
X +
S
n
t
n1
(/2)
_
= 1
Likewise,
P
_
nV
2
n1
(/2)
2
nV
2
n1
(1 /2)
_
= 1
justies the condence interval for
2
:
_
nv
2
n1
(/2)
,
nv
2
n1
(1 /2)
_
This illustrates the superiority of V over
2
.
4.10 The Fisher information
Let X have density f(x|), , an open subset of R
p
. The Fisher information in X is
dened as
I() = E
_
2
log f(X|)
_
When the above exists, it equals the more general denition
I() = E
_
log f(X|)
_
2
In all the examples, and most applications, they are equivalent, and the rst denition is
easier to use.
6
4.11 Asymptotic normality of ML estimators
Let
be the ML estimator of based on IID random variables X
1
, . . . , X
n
with density f(x|),
where is an unknown constant. Let I() be the Fisher information in X
1
. Under regularity
conditions, as n ,
_
nI()(
) N(0, 1)
Note the Fisher information of a single unit of the IID Xs, not that of all of them, which is
n times larger.
For large n, approximately
N
_
,
I()
1
n
_
I()
1
plays a similar role as the population variance in sampling.
4.12 Asymptotic normality: multinomial, regression
This is essentially the content of the paragraph before Example C on page 283. Let X =
(X
1
, . . . , X
r
) have a multinomial(n, p()) distribution, where is an unknown constant. Let
I() be the Fisher information. Then as n ,
_
I()(
) N(0, 1)
Proof: Xhas the same distribution as the sum of n IID random vectors with the multinomial(1,p())
distribution. The ML estimator of based on these data is
(by suciency or direct calcula-
tion). The Fisher information of a multinomial(1,p()) random vector is I()/n. Now apply
the result in the previous section.
If the Xs are independent but not identically distributed, like in regression, asymptotic
normality of the MLEs does not follow from the previous section. A new denition of sample
size and a new theorem are needed.
4.13 Characterisation of suciency
Let X
1
, . . . , X
n
be IID random variables with density f(x|), where . Let T be a function
of the Xs. The sample space S is a disjoint union of S
t
across all possible values t, where
S
t
= {x : T(x) = t}
The conditional distribution of X given T = t depends on in general. If for every t, it is the
same for every , then T is sucient for . This gives a characterisation: T is sucient
for if and only if there is a function q(x) such that for every t and every
f(x|t) = q(x), x S
t
This claries the proof of the factorisation theorem (page 307).
Logically, suciency is a property of . T is sucient for is more correct.
7
4.14 Problem 4 (page 313)
Let X
1
, . . . , X
10
be IID with distribution P(X = 0) = 2/3, P(X = 1) = /3, P(X = 2) =
2(1 )/3, P(X = 3) = (1 )/3, where (0, 1) is an unknown constant. The MOM
estimator of is
=
7
6
X
2
The ML estimate based on the given numerical data can be readily found. That of general
x
1
, . . . , x
10
, or the ML estimator, is hard because the distribution of X
1
is not in the form
f(x|). One possibility is
f(x|) =
_
2
3
_
y
0
_
1
3
_
y
1
_
2
3
(1 )
_
y
2
_
1
3
(1 )
_
y
3
, y
j
= 1
{x=j}
It follows that the ML estimator is
=
Y
n
where Y is the number of times 0 or 1 turn up.
This is an interesting example.
and
are dierent, both are unbiased.
is ecient, but
not
:
var(
) =
(1 )
n
, var(
) =
1
18n
+
(1 )
n
They illustrate relative eciency more directly than on pages 299-300, and the sample size
intepretation of relative eciency. Y is sucient for , so E(
. Another clue to the superiority of ML: If 0, 1, 2, 3 are replaced by any other four distinct
numbers, the ML estimator stays the same, but the MOM estimator will be fooled into
adapting. Fundamentally, the distribution is multinomial.
5 Hypothesis tests
Let X = (X
1
, . . . , X
r
) be multinomial(n, p()), where , an open subset of R. Let
be
the ML estimator of based on X. Then the null distribution of the Pearsons
X
2
=
r
i=1
[X
i
np
i
(
)]
2
np
i
(
)
is asymptotically
2
r2
. Let X be obtained from a Poisson distribution, so that =
(0, ). The above distribution holds for
based on X. Example B (page 345) assumes it
also holds for
=
X, the ML estimator based on the raw data, before collapsing. This is
unsupported. An extreme counter-example: collapse Poisson() to the indicator of 0, i.e., the
data is binomial(n, e