18.443 MIT Stats Course
18.443 MIT Stats Course
by
Dmitry Panchenko
Department of Mathematics
Massachusetts Institute of Technology
Contents
1 Estimation theory.
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
. . . . . . . . . . . . . . . . . . . . .
Method of moments. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
13
14
5.1
5.2
Consistency of MLE. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Asymptotic normality of MLE. Fisher information. . . . . . . . . . .
17
17
20
Rao-Cr
amer inequality. . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
7.1
Ecient estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
29
8.1
8.2
Gamma distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Beta distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
32
33
35
35
38
38
39
2.1
3
3.1
4
5
6
6.1
7
8
9
9.1
10
ii
11
11.1 Sucient statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
42
12
45
13
13.1 Minimal jointly sucient statistics. . . . . . . . . . . . . . . . . . . .
13.2 2 distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
49
51
53
53
56
56
60
60
63
63
18 Testing hypotheses.
18.1 Testing simple hypotheses. . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Bayes decision rules. . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
69
19
19.1 Most powerful test for two simple hypotheses. . . . . . . . . . . . . .
71
73
76
76
79
81
81
82
86
86
14
15
16
17
20
21
22
iii
23
23.1 Pearsons theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
89
94
94
96
99
99
24
25
26
103
27
107
28
110
30
120
31
124
32
128
iv
List of Figures
2.1
Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
15
5.1
5.2
5.3
Maximize over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Diagram (t 1) vs. log t . . . . . . . . . . . . . . . . . . . . . . . . .
Lemma: L() L(0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
19
20
9.1
9.2
Prior distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Posterior distribution. . . . . . . . . . . . . . . . . . . . . . . . . . .
36
37
54
55
58
. . . . . . . . . . . . . . . . . . .
61
17.1 P.d.f. of 2
64
65
72
78
80
82
84
23.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.2 Projections of g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3 Rotation of the coordinate system. . . . . . . . . . . . . . . . . . . .
89
93
93
95
97
97
32.1 Example.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
vi
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . 107
vii
Lecture 1
Estimation theory.
1.1
Introduction
Let us consider a set X (probability space) which is the set of possible values that
some random variables (random object) may take. Usually X will be a subset of ,
for example {0, 1}, [0, 1], [0, ), , etc.
I. Parametric Statistics.
We will start by considering a family of distributions on X :
1. Estimation Theory.
Based on the observations X1 , , Xn we would like to estimate unknown pa
rameter 0 , i.e. nd = (X1 , , Xn ) such that approximates 0 . In this
case we also want to understand how well approximates 0 .
1
2. Hypothesis Testing.
Decide which of the hypotheses about 0 are likely or unlikely. Typical hypothe
ses:
0 = 1 ? for some particular n ?
0 1
0 =
1
Example: In a simple yes/no vote (or two candidate vote) our variable (vote)
can take two values, i.e. we can take the space X = {0, 1}. Then the distribution is
described by
(1) = p, (0) = 1 p
for some parameter p = [0, 1]. The true parameter p0 is unknown. If we conduct
a poll by picking n people randomly and if X1 , , Xn are their votes then:
1.Estimation theory. What is a natural estimate of p0 ?
p =
#(1 s among X1 , , Xn )
p0
n
How close is p to p0 ?
2. Hypothesis testing. How likely or unlikely are the following:
Hypothesis 1: p0 >
1
2
Hypothesis 2: p0 <
1
2
Does
Is
=
}?
0?
Lecture 2
2.1
Let us recall some common distributions on the real line that will be used often in
this class. We will deal with two types of distributions:
1. Discrete
2. Continuous
Discrete distributions.
Suppose that a set X consists of a countable or nite number of points,
X = {a1 , a2 , a3 , }.
Then a probability distribution
the following properties:
1. 0 p(ai ) 1,
2.
i=1 p(ai ) = 1.
(X) =
(ai )p(ai )
i=1
(p.d.f.) p(x) on
such that p(X) 0 and p(X)dx = 1. If a random variable X has distribution
then the chance/probability that X takes a value in the
LECTURE 2.
p(x)dx.
a
Notation. The fact that a random variable X has distribution will be denoted
by X .
Example 1. Normal (Gaussian) Distribution N (, 2 ) with mean and variance
2 is a continuous distribution on with probability density function:
p(x) =
(x)2
1
e 22 for x (, ).
2
(X = 1) = p, p(0) =
ex x 0,
p(x) =
0
x < 0.
For s > 0, the probability that X will exceed level t + s given that it exceeded level
t can be computed as follows:
(X t + s, X t)
(X t + s)
=
(X t)
(X t)
(t+s)
t
s
= e
/e
=e
= (X s),
(X t + s|X t) =
LECTURE 2.
i.e.
(X t + s|X t) =
(X s).
(X = k) =
We will show that these assumptions will imply that the number of objects in the
region T, Count(T ), has Poisson distribution (|T |) with parameter |T |.
0
T/n
X1
T
X2
.......
Xn
LECTURE 2.
k0
where n = k2 k (Xi = k), and by the last property above we assume that n
becomes small with n, since the probability to observe more that two objects on the
interval of size T /n becomes small as n becomes large. Combining two equations
above gives, (Xi = 1) Tn . Also, since by the last property the probability that
any count Xi is 2 is small, i.e.
we can write,
T
n
0 as n ,
n T k
T nk
(Count(T ) = X1 + + Xn = k)
1
k
n
n
(T )k T
e
k!
X
N (0, 1)
[a, b] = (X [a + , b + ]) =
e 22 dx
2
a+
b
y2
1
e 2 dy,
=
2
a
LECTURE 2.
since we integrate odd function. To compute the second moment Y 2 , let us rst
y2
note that since 12 e 2 is a probability density function, it integrates to 1, i.e.
1=
y2
1
e 2 dy.
2
y 2
y2
y
1
1 y2
e 2 dy = ye 2
(y)e 2 dy
1 =
2
2
2
2
y
1
y 2 e 2 dy = Y 2 .
= 0+
2
Y 2 = 1. The variance of Y is
Var(Y ) =
Y 2 ( Y )2 = 1 0 = 1.
Lecture 3
3.1
Method of moments.
n = X1 + . . . + Xn X1
X
n
converges to the expectation in some sense, for example, for any arbitrarily
small > 0,
n X1 | > ) 0 as n .
(|X
Note. Whenever we will use the LLN below we will simply say that the av
erage converges to the expectation and will not mention in what sense. More
mathematically inclined students are welcome to carry out these steps more
rigorously, especially when we use LLN in combination with the Central Limit
Theorem.
Central Limit Theorem (CLT):
n X1 ) d N (0, 2 )
n(X
LECTURE 3.
x2
1
e 22 dx.
n(Xn X1 ) [a, b]
2
a
, 2 0}.
=X
X1 = 0 as n
and, similarly,
X12 + . . . + Xn2
X12 = Var(X) + X 2 = 02 + 02 .
X12 + + Xn2 X1 + + Xn 2
n
n
X 2 ( X)2 = 02 .
m() =
g(X)
: Im()
1
1 g(X1 + + g(Xn )
= m (
g) = m
n
as the estimate of 0 . (Here we implicitely assumed that g is always in the set Im(m).)
Since the sample comes from distribution with parameter 0 , by LLN we have
g
0 g(X1 )
= m(0 ).
10
LECTURE 3.
X k is called
m() =
g(X)
( 1 is the expectation of exponential distribution, see Pset 1.) Let us recall that we
can nd inverse by solving for the equation
m() = , i.e. in our case
We have,
= m1 () =
1
= .
1
.
Therefore, we take
= 1
= m1 (
g) = m1 (X)
X
as the estimate of unkown 0 .
Take g(x) = x2 . Then
m() =
g(X
)=
The inverse is
1
= m () =
and we take
as another estimate of 0 .
The question is, which estimate is better?
2
.
2
2
X2
= m (
g) = m (X2 ) =
1
11
LECTURE 3.
n( 0 ) d N (0, 20 )
where 20 is called the asymptotic variance of the estimate .
Theorem. The estimate = m1 (
g ) by the method of moments is asymptotically
normal with asymptotic variance
20 =
V ar0 (g)
.
(m (0 ))2
(m1 ) (c)
g m(0 ))2
(
2!
Let us prove that the left hand side multiplied by n converges in distribution to
normal distribution.
m1 (
g) 0 = (m1 ) (m(0 ))(
g m(0 ) +
n(m1 (
g) 0 ) = (m1 ) (m(0 ))
(m1 ) (c) 1
( n(
n(
g m(0 )) +
g m(0 )))2
2!
n
(3.1)
g(X1 ) + + g(Xn )
, g(X1 ) = m(0 ).
n
1
m (m1 (m(
0 )))
1
m (
0)
12
LECTURE 3.
V ar (g(X ))
1
0
1
(m ) (m(0 )) n(m (
g) 0 )
N (0, Var0 (g(X1 ))) = N 0,
2
m (0 )
(m (0 ))
1
V ar0 (g)
m (0 )
in the sense that it has smaller deviations from the unknown parameter 0 asymp
totically.
Lecture 4
Let us go back to the example of exponential distribution E() from the last lecture
and recall that we obtained two estimates of unknown parameter 0 using the rst
and second moment in the method of moments. We had:
1. Estimate of 0 using rst moment:
g(X) = X, m() =
g(X)
1
1
,
1 = m1 (
g) = .
2
2
,
2 = m1 (
g) =
g(X ) =
2
2
.
X2
How to decide which method is better? The asymptotic normality result states:
Var (g(X))
0
1
n(m (
g) 0 ) N 0,
.
(m (0 ))2
Var0 (g(X))
Var0 (X)
20
= 02 .
=
1 2 =
2
(m (0 ))
( 2 )
( 12 )2
0
In the second case we will need to compute the fourth moment of the exponential
distribution. This can be easily done by integration by parts but we will show a
dierent way to do this.
The moment generating function of the distribution E() is:
tk
tX
tx
x
e e dx =
(t) = e =
,
=
t k=0 k
0
13
14
LECTURE 4.
where in the last step we wrote the usual Taylor series. On the other hand, writing
the Taylor series for etX we can write,
(t) =
tX
(tX)k
k!
k=0
tk
k=0
k!
Comparing the two series above we get that the k th moment of exponential distribu
tion is
k!
k
.
X =
k
2. In the second case:
Var0 (g(X))
Var0 (X 2 )
=
=
(m (0 ))2
( 43 )2
0 X
( 0 X 2 )2
=
( 43 )2
4!
40
( 22 )2
0
( 43 )2
0
5
= 02
4
Since the asymptotic variance in the rst case is less than the asymptotic variance
in the second case, the rst estimator seems to be better.
4.1
() = f (X1 |) . . . f (Xn |)
seen as a function of the parameter only. It has a clear interpretation. For example,
if our distributions are discrete then the probability function
f (x|) =
(X
= x)
(X1 )
...
(Xn )
=
(X1 , , Xn )
15
LECTURE 4.
unique point .
()
X1, ..., Xn
Max. Pt.
(X = 1) = p,
(X = 0) = 1 p, p [0, 1].
p,
x = 1,
f (x|p) =
1 p, x = 0.
Likelihood function
= p# of 1 s (1 p)# of 0 s = pX1 +
+Xn (1 p)n(X1 ++Xn )
16
LECTURE 4.
d log (p)
dp
= 0,
1
1
= 0.
(X1 + + Xn ) (n (X1 + + Xn ))
p
1p
Solving this for p gives,
X1 + + Xn
p=
=X
n
is the MLE.
and therefore p = X
Example 2. Normal distribution N (, 2 ) has p.d.f.
(X)2
1
e 22 .
f (X|(, 2 )) =
2
likelihood function
n
(Xi )2
1
e 22 .
(, 2 ) =
2
i=1
and log-likelihood function
n
1
(X )2
log log
2 2
2
i=1
n
1
1
= n log n log 2
(Xi )2 .
2 i=1
2
log (, 2 ) =
for any we need to minimize (Xi )2 over . The critical point condition is
d
(Xi )2 = 2
(Xi ) = 0.
d
Next, we need to maximize
Solving this for gives that
= X.
n
1
1
2
n log n log 2
(Xi X)
2 i=1
2
over . The critical point condition reads,
n
1
2=0
+ 3
(Xi X)
and solving this for we get
n
1
2
2
=
(Xi X)
n i=1
is the MLE of 2 .
Lecture 5
f (x|) =
0, otherwise
The likelihood function
() =
i=1
f (Xi |) =
=
1
I(X1 , . . . , Xn [0, ])
n
1
I(max(X1 , . . . , Xn ) ).
n
Here the indicator function I(A) equals to 1 if A happens and 0 otherwise. What we
wrote is that the product of p.d.f. f (Xi |) will be equal to 0 if at least one of the
factors is 0 and this will happen if at least one of Xi s will fall outside of the interval
[0, ] which is the same as the maximum among them exceeds . In other words,
() = 0 if < max(X1 , . . . , Xn ),
and
1
if max(X1 , . . . , Xn ).
n
Therefore, looking at the gure 5.1 we see that = max(X1 , . . . , Xn ) is the MLE.
() =
5.1
Consistency of MLE.
Why the MLE converges to the unkown parameter 0 ? This is not immediately
obvious and in this section we will give a sketch of why this happens.
17
18
LECTURE 5.
()
1
Ln =
log f (Xi |)
n i=1
which is just a log-likelihood function normalized by n1 (of course, this does not aect
the maximization). Ln () depends on data. Let us consider a function l(X |) =
log f (X |) and dene
L() = 0 l(X|),
0 l(X |)
= L().
Note that L() does not depend on the sample, it only depends on . We will need
the following
Lemma. We have, for any ,
L() L(0 ).
Moreover, the inequality is strict L() < L(0 ) unless
=
0 .
0 (f (X |)
= f (X |0 )) = 1.
19
LECTURE 5.
L() L(0 ) =
0 (log f (X|)
log f (X|0 )) =
log
f (X |)
.
f (X|0 )
t1
log t
t
0
f (x|)
1 =
1 f (x|0 )dx
0
f (X|0 )
f (x|0 )
=
f (x|)dx f (x|0 )dx = 1 1 = 0.
f (X |)
0 log
f (X|0 )
f (X |)
Both integrals are equal to 1 because we are integrating the probability density func
tions. This proves that L() L(0 ) 0. The second statement of Lemma is also
clear.
20
LECTURE 5.
L( )
^
MLE
2
2
n( 0 ) d N (0, M
LE ) for some M LE .
Let us recall that above we dened the function l(X |) = log f (X |). To simplify
the notations we will denote by l (X | ), l (X |), etc. the derivatives of l(X |) with
respect to .
21
LECTURE 5.
I(0 ) =
2
0 (l (X|0 ))
log
f
(X
)
.
|
0
0
Next lemma gives another often convenient way to compute Fisher information.
Lemma. We have,
0 l (X |0 )
2
log f (X |0 ) = I(0 ).
0
2
f (X |)
f (X|)
f (X |) (f (X|))2
2
.
f (X |)
f (X|)
f (x|)dx = 1,
f (x|)dx = f (x|)dx = 0.
f (x|)dx = 0 and
2
To nish the proof we write the following computation
f (x|0 )dx
=
f (x|0 )
f (x|0 )
=
f (x|0 )dx 0 (l (X |0 ))2 = 0 I(0 = I(0 ).
22
LECTURE 5.
n( 0 ) N 0,
.
I(0 )
Proof. Since MLE is maximizer of Ln () =
Ln () = 0.
1
n
i=1
(
)
nLn (0 )
L
0
n
.
0 =
and n( 0 ) =
Ln (1 )
Ln (1 )
(5.1)
0 l
(X |0 ) = 0.
(5.2)
nLn (0 )
1
(5.3)
=
n
l (Xi |0 ) 0
n i=1
1
=
n
l (Xi |0 ) 0 l (X1 |0 ) N 0, Var0 (l (X1 |0 ))
n i=1
Ln () =
0 l
(X1 |) by LLN.
(5.4)
0 l
23
LECTURE 5.
(I(0 ))2
Ln (1 )
Finally, the variance,
Var0 (l (X1 |0 )) =
0 (l
(X |0 ))2 (
0 l
where in the last equality we used the denition of Fisher information and (5.2).
Lecture 6
x 1 x 2
log f (x|p) =
,
log f (x|p) = 2
2
p
p 1 p p
p
(1 p)2
then we showed that Fisher information can be computed as:
2
X 1 X
p
1p
1
I(p) =
log f (X |p) = 2 +
=
.
= 2+
2
2
2
p
(1 p)
p
(1 p)
p(1 p)
p
p p0 ) N (0, p0 (1 p0 ))
n(
which, of course, also follows directly from the CLT.
Example. The family of exponential distributions E() has p.d.f.
ex , x 0
f (x|) =
0,
x<0
and, therefore,
2
log f (x|) = 2 .
2
25
LECTURE 6.
2
1
log f (X |) = 2 .
2
n(
0 ) N (0, 02 ).
6.1
Rao-Cr
amer inequality.
Let us start by recalling the following simple result from probability (or calculus).
Lemma. (Cauchy inequality) For any two random variables X and Y we have:
XY ( X 2 )1/2 ( Y 2 )1/2 .
The inequality becomes equality if and only if X = tY for some t 0 with probability
one.
Proof. Let us consider the following function
(t) =
(X tY )2 =
X 2 2t XY + t2 Y 2 0.
Since this is a quadractic function of t, the fact that it is nonnegative means that
it has not more than one solution which is possible only if the discriminant is non
positive:
D = 4( XY )2 4 Y 2 X 2 0
XY ( X 2 )1/2 ( Y 2 )1/2 .
Also (t) = 0 for some t if and only if D = 0. On the other hand, (t) = 0 means
(X tY )2 = 0 X = tY
S(X1 , . . . , Xn ),
26
LECTURE 6.
Var (S) =
(S
(m ())2
.
nI()
m())2
n
i=1
l (X |) + m()
n
i=1
l(Xi |).
Let us apply Cauchy inequality in the above Lemma to the random variables
S m() and ln =
ln
.
We have:
(S
2
(ln )
m())ln (
2
(ln ) .
= n
(S
m())2 )1/2 (
2 1/2
.
(ln ) )
n
i=1
(l
l (Xi |)) =
2
n
n
i=1 j=1
l(X1 |)
l(X2 |)
where we simply grouped n terms for i = j and remaining n(n 1) terms for i = j.
By denition of Fisher information
I() =
(l
(X1 |))2 .
27
LECTURE 6.
Also,
f (x|)
f (X1 |)
log f (X1 |) =
=
f (x|)dx
l (X1 |) =
f (X1 |)
f (x|)
=
f (x|)dx =
f (x|)dx =
1 = 0.
We used here that f (x|) is a p.d.f. and it integrates to one. Combining these two
facts, we get
2
(ln ) = nI().
Lecture 7
We showed that
(S
m())ln (
(S
ln =
l (Xi |) = 0
and, therefore,
(S
m())ln =
ln m()
ln
ln .
n
(X |)
log (X |) =
.
log f (Xi |) =
i=1
(X|)
Sln
(X|)
=
S(X)
(X |)
(X |)
S(X)
(X)dX
(X |
)
=
S(X) (X|)dX =
S(X)(X |)dX =
S(X)
= m ().
(S
29
LECTURE 7.
(m ())2
.
nI()
The inequality will become equality only if there is equality in the Cauchy in
equality applied to random variables
S m() and ln .
But this can happen only if there exists t = t() such that
S m() =
7.1
t()ln
= t()
n
i=1
l (Xi |).
Ecient estimators.
S(X1 , . . . , Xn ).
2
(S m()) =
(m ())2
,
nI()
S = t()
n
i=1
l (Xi |) + m()
for some function t(). The statistic S = S(X1 , , Xn ) must a function of the sample
only and it can not depend on . This means that ecient estimates do not always
exist and they exist only if we can represent the derivative of log-likelihood ln as
ln =
n
i=1
l (Xi |) =
S m()
,
t()
30
LECTURE 7.
log f (x|) =
(log a() + log b(x) + c()d(x))
a ()
=
+ c ()d(x).
a()
l (x|) =
n
i=1
and
If we take
a ()
+ c ()
d(Xi )
l (Xi |) = n
a()
i=1
1
1
a ()
d(Xi ) =
l (Xi |)
.
n i=1
nc () i=1
a()c ()
n
1
S=
d(Xi ) and m() =
n i=1
a ()
a()c ()
x
e for x = 0, 1, . . .
x!
1
x
e =
e
exp log
x .
x!
x!
a()
d(x)
c()
b(x)
As a result,
1
1
S=
d(Xi ) =
Xi = X
n i=1
n i=1
X1
(e )
a ()
= 1 = .
S =
a()c ()
e ()
31
LECTURE 7.
Maximum likelihood estimators. Another interesting consequence of RaoCramers theorem is the following. Suppose that the MLE is unbiased:
= .
1
.
nI()
On the other hand when we showed asymptotic normality of the MLE we proved the
following convergence in distribution:
.
n( ) N 0,
I()
In particular, the variance of
distribution 1/I(), i.e.
Var( n( )) = nVar()
1
I()
which means that Rao-Cramers inequality becomes equality in the limit. This prop
erty is called the asymptotic eciency and we showed that unbiased MLE is asymp
totically ecient. In other words, for large sample size n it is almost best possible.
Lecture 8
8.1
Gamma distribution.
Let us take two parameters > 0 and > 0. Gamma function () is dened by
() =
x1 ex dx.
0
1 1 x
y 1 ey dy
1=
x e dx =
()
()
0
0
where we made a change of variables x = y. Therefore, if we dene
1 x
x e , x0
()
f (x|, ) =
0,
x<0
then f (x|, ) will be a probability density function since it is nonnegative and it
integrates to one.
Denition. The distribution with p.d.f. f (x|, ) is called Gamma distribution
with parameters and and it is denoted as (, ).
Next, let us recall some properties of gamma function (). If we take > 1 then
using integration by parts we can write:
1 x
() =
x e dx =
x1 d(ex )
0
0
(ex )( 1)x2 dx
= x1 (ex )
0
0
= ( 1)
x(1)1 ex dx = ( 1)( 1).
0
32
33
LECTURE 8.
ex dx = 1
1 x
k
k
X =
x
x e dx =
x(+k)1 ex dx
()
()
0
0
( + k)
+k
=
x+k1 ex dx
() +k
(
+
k)
0
p.d.f. of ( + k, ) integrates to 1
( + k)
( + k)
( + k 1)( + k 1)
=
=
=
+k
k
()
()
() k
( + k 1)
( + k 1)( + k 2) . . . ()
=
=
.
k
()
k
X=
( + 1)
2
8.2
X 2 ( X)2 =
( + 1) 2
= 2.
Beta distribution.
34
LECTURE 8.
( + ) 1
x (1 x)1 for x [0, 1]
()()
( + ) 1 k+1
k ( + ) 1
k
1
x
X =
x
(1 x)1 dx
x (1 x) dx =
()() 0
()()
0
1
( + k)() ( + )
(k + + ) (k+)1
=
(1 x)1 dx
x
(k + + ) ()() 0 ( + k)()
() ( + + k)
()
( + )
( + + k 1)( + + k 2) . . . ( + )( + )
( + k 1) . . .
.
=
( + + k 1) . . . ( + )
X2 =
X=
( + 1)
( + + 1)( + )
( +
)2 (
+ + 1)
Lecture 9
9.1
35
36
LECTURE 9.
(p)
p
0
0.4
(B) =
(B Ai ).
i=1
Then the Bayes Theorem states the equality obtained by the following simple com
puation:
(A1 |B) =
(A1 B)
(B |Ai ) (A1 )
(B |Ai ) (A1 )
=
=
.
(B)
i=1 (B Ai )
i=1 (B |Ai ) (Ai )
- p.d.f. or p.f. of the sample given , and if we know the p.d.f. or p.f. () of .
Posterior distribution of can be computed using Bayes formula:
f (X1 , . . . , Xn |)()
f (X1 , . . . , Xn |)()d
(|X1 , . . . , Xn ) =
where
g(X1 , . . . , Xn ) =
37
LECTURE 9.
Example. Very soon we will consider specic choices of prior distributions and
we will explicitly compute the posterior distribution but right now let us briey
give an example of how we expect the data and the prior distribution aect the
posterior distribution. Assume again that we are in the situation described in the
above example when the sample comes from Bernoulli distribution and the prior
distribution is shown in gure 9.1 when we expect p0 to be near 0.4 On the other
= 0.7. This seems to suggest that
hand, suppose that the average of the sample is X
our intuition was not quite right, especially, if the sample size if large. In this case we
expect that posterior distribution will look somewhat like the one shown in gure 9.2
- there will be a balance between the prior intuition and the information contained
in the sample. As the sample size increases the maximum of prior distribution will
= 0.7 meaning that we have to discard our
eventually shift closer and closer to X
intuition if it contradicts the evidence supported by the data.
(p)
p
0
0.4
0.7
Lies Somewhere Between
0.4 and 0.7
Lecture 10
10.1
Bayes estimators.
This estimate is called the Bayes estimator. Note that depends on the sample
X1 , . . . , Xn since, by denition, the posterior distribution depends on the sample. The
obvious motivation for this choice of is that it is simply the average of the parameter
with respect to posterior distribution that in some sense captures the information
contained in the data and our prior intuition about the parameter. Besides this
obvious motivation one sometimes gives the following motivation. Let us dene the
estimator as the parameter a that minimizes
(( a)2 |X1 , . . . , Xn ),
(|X1 , . . . , Xn ).
39
LECTURE 10.
10.2
Often for many popular families of distributions the prior distribution () is chosen
so that it is easy to compute the posterior distribution. This is done by choosing ()
that resembles the likelihood function f (X1 , . . . , Xn |). We will explain this on the
particular examples.
Example. Suppose that the sample comes from Bernoulli distribution B(p) with
p.f.
f (x|p) = px (1 p)1x for x = 0, 1
and likelihood function
f (X1 , , Xn |p) = p
Xi
(1 p)n
Xi
Xi
(1 p)n Xi (p)
.
g(X1 , . . . , Xn )
P
Xi
(1 p)n
Xi
resembles the density of Beta distribution. Therefore, if we let the prior distribution
be a Beta distribution B(, ) with some parameters , > 0:
(p) =
( + ) 1
p (1 p)1
()()
P
1
( + ) (+P Xi )1
p
(1 p)(+n Xi )1 .
g(X1 , . . . , Xn ) ()()
resembles Beta distribution
We still have to compute g(X1 , . . . , Xn ) but this can be avoided if we notice that
is supposed to be ap.d.f. and it looks like a Beta distribution with
(p|X1 , . . . , Xn )
parameter + Xi and + n Xi . Therefore, g has no choice but to be such
that
P
P
( + + n)
(p|X1 , . . . , Xn ) =
p(+ Xi )1 (1 p)(+n Xi )1
( + Xi )( + n Xi )
which is the p.d.f. of B + Xi , + n Xi . Since the mean of Beta distribution
B(, ) is equal to /( + ), the Bayes estimator will be
+ Xi
+ Xi
p = (p|X1 , . . . , Xn ) =
=
.
+ Xi + + n Xi
++n
40
LECTURE 10.
Let us notice that for large sample size, i.e. when n +, we have
+X
+ Xi
X.
p =
= n
++n
+ n +1
n
This means that when we have a lot of data our prior intuition becomes irrelevant
On the other
and the Bayes estimator is approximated by the sample average X.
hand, for n = 0
p =
+
which is the mean of prior distribution B(, ). If we have no data we simply follow
our intuitive guess.
Example. Suppose that the sample comes from the exponential distribution
E() with p.f.
f (x|) = ex for x 0
in which case the likelihood function is
f (X1 , . . . , Xn ) = n e
Xi
P
1
n e Xi ().
g(X1 , . . . , Xn )
Notice that the likelihood function resembles the p.d.f. of Gamma distribution and,
therefore, we will take prior to be a Gamma distribution (u, v) with parameters u
and v, i.e.
v u u1 v
() =
e .
(u)
Then, the posterior will be equal to
(|X1 , . . . , Xn ) =
1 v u (u+n)1 (P Xi +v)
e
g (u)
e
(u + n)
(|X1 , . . . , Xn ) =
u+n
.
v + Xi
41
LECTURE 10.
u
+1
n
v
+X
n
1
.
X
x
e for x = 0, 1, 2, . . .
x!
f (X1 , . . . , Xn |) =
Xi
Xi !
en
1
Xi n
(|X1 , . . . , Xn ) =
e ().
g(X1 , . . . , Xn ) Xi !
Since again the likelihood function resembles the Gamma distribution we will take
the prior to be a Gamma distribution (u, v) in which case
1 v u (P Xi +u)1 (n+v)
e
.
g (u)
+u
X
Xi + u
n
=
=
.
n+v
1 + nv
(|X1 , . . . , Xn ) =
Lecture 11
11.1
Sucient statistic.
(X1 , . . . , Xn |T )
=
(X1 , . . . , Xn |T ),
(11.1)
i.e. the conditional distribution of the vector (X1 , . . . , Xn ) given T does not depend
on the parameter and is equal to .
If this happens then we can say that T contains all information about the param
eter of the disribution of the sample, since given T the distribution of the sample
is always the same no matter what is. Another way to think about this is: why the
second observer B has as much information about as observer A? Simply, given T ,
the second observer B can generate another sample X1 , . . . , Xn by drawing it according to the distribution (X1 , , Xn |T ). He can do this because it does not require
the knowledge of . But by (11.1) this new sample X1 , . . . , Xn will have the same
distribution as X1 , . . . , Xn , so B will have at his/her disposal as much data as the
rst observer A.
The next result tells us how to nd sucient statistics, if possible.
Theorem. (Neyman-Fisher factorization criterion.) T = T (X1 , . . . , Xn ) is su
cient statistics if and only if the joint p.d.f. or p.f. of (X1 , . . . , Xn ) can be represented
42
43
LECTURE 11.
as
f (x1 , . . . , xn |) f (x1 |) . . . f (xn |) = u(x1 , . . . , xn )v(T (x1 , . . . , xn ), )
(11.2)
for some function u and v. (u does not depend on the parameter and v depends on
the data only through T .)
Proof. We will only consider a simpler case when the distribution of the sample
is discrete.
1. First let us assume that T = T (X1 , . . . , Xn ) is sucient statistics. Since the
distribution is discrete, we have,
f (x1 , . . . , xn |) =
(X1
= x1 , . . . , Xn = xn ),
i.e. the joint p.f. is just the probability that the sample takes values x1 , . . . , xn . If
X1 = x1 , . . . , Xn = xn then T = T (x1 , . . . , xn ) and, therefore,
(X1
= x1 , . . . , Xn = xn ) =
(X1
= x1 , . . . , Xn = xn , T = T (x1 , . . . , xn )).
(X1
= x1 , . . . , Xn = xn , T = T (x1 , . . . , xn ))
(X1 = x1 , . . . , Xn = xn |T = T (x1 , . . . , xn ))
(T
= T (x1 , . . . , xn )).
(X1
= x1 , . . . , Xn = xn |T = T (x1 , . . . , xn ))
(T
= T (x1 , . . . , xn )).
Since T is sucient, by denition, this means that the rst conditional probability
(X1
= x1 , . . . , Xn = xn |T = T (x1 , . . . , xn )) = u(x1 , . . . , xn )
for some function u independent of , since this conditional probability does not
depend on . Also,
(T
= x1 , . . . , Xn = xn |T (X1 , . . . , Xn ) = t)
(X1 = x1 , . . . , Xn = xn , T (X1 , . . . , Xn ) = t)
.
(T (X1 , . . . , Xn ) = t)
(X1
(11.3)
44
LECTURE 11.
(11.4)
(X1
= x1 , . . . , Xn = xn , T (X1 , . . . , Xn ) = t) =
(X1
= x1 , . . . , Xn = xn ),
since we just drop the condition that holds anyway. By (11.2), this can be written as
A(t))
(X1 = x1 , . . . , Xn = xn ) =
(T (X1 , . . . , Xn )
= t) =
((X1 , . . . , Xn )
u(x1 , . . . , xn )v(t, )
where in the last step we used (11.2) and (11.4). Therefore, (11.3) can be written as
u(x1 , . . . , xn )v(t, )
u(x1 , . . . , xn )
=
A(t) u(x1 , . . . , xn )v(t, )
A(t) u(x1 , . . . , xn )
and since this does not depend on anymore, it proves that T is sucient.
Example. Bernoulli Distribution B(p) has p.f. f (x|p) = px (1 p)1x for x
{0, 1}. The joint p.f. is
P
P
f (x1 , , xn |p) = p xi (1 p)n xi = v(
Xi , p),
Lecture 12
x
e for x = 0, 1, 2, . . .
x!
f (x1 , , xn |) = n
xi
i=1 xi !
u(x1 , . . . , xn ) = n
i=1 Xi !
en = n
, T (x1 , . . . , xn ) =
i=1 Xi !
n
en
Xi
xi and v(T, ) = en T .
i=1
2
( 2)
2
xi
n2
1
exp
=
exp
xi 2 2 .
2 2
2
( 2)n
45
46
LECTURE 12.
If we take T =
i=1
Xi ,
x2
n2
1
i
and
v(T,
)
=
exp
T
,
u(x1 , . . . , xn ) =
exp
2 2
2
2 2
( 2)n
then Neyman-Fisher criterion proves that T is a sucient statistics.
12.1
Consider
T1 = T1 (X1 , , Xn )
T2 = T2 (X1 , , Xn )
Tk = Tk (X1 , , Xn )
Very similarly to the case when we have only one function T, a vector (T1 , , Tk ) is
called jointly sucient statistics if the distribution of the sample given T s
(X1 , . . . , Xn |T1 , . . . , Tk )
does not depend on . The Neyman-Fisher factorization criterion says in this case
that (T1 , . . . , Tk ) is jointly sucient if and only if
f (x1 , . . . , xn |) = u(x1 , . . . , xn )v(T1 , . . . , Tk , ).
The proof goes without changes.
Example 1. Let us consider a family of normal distributions N (, 2 ), only now
both and 2 are unknown. Since the joint p.d.f.
x2
x n2
i
i
2
f (x1 , . . . , xn |, ) =
exp
+
2
2
2
n
2
2
( 2)
is a function of
T1 =
n
i=1
Xi and T2 =
n
Xi2 ,
i=1
1
, x [a, b],
ba
f (x|a, b) =
0,
otherwise.
47
LECTURE 12.
f (x1 , , xn | a, b) =
The indicator functions make the product equal to 0 if at least one of the points falls
out of the range [a, b] or, equivalently, if either the minimum xmin = min(x1 , . . . , xn )
or maximum xmax = max(x1 , . . . , xn ) falls out of [a, b]. Clearly, if we take
T1 = max(X1 , . . . , Xn ) and T2 = min(X1 , . . . , Xn )
then (T1 , T2 ) is jointly sucient by Neyman-Fisher factorization criterion.
Sucient statistics:
Gives a way of compressing information about underlying parameter .
Gives a way of improving estimator using sucient statistic (which takes us to
our next topic).
0 ((X1 , . . . , Xn )
0 )2
Question: why doesnt depend on 0 even though formally the right hand side
depends on 0 ?
Recall that this conditional expectation is the expectation of (x1 , . . . , xn ) with
respect to conditional distribution
0 (X1 , . . . , Xn |T ).
48
LECTURE 12.
0 (
0 )2
0 )2
0 (
Proof. Given random variable X and Y, recall from probability theory that
X=
{ (X |Y )}.
0 ) 2 |T )
0 ((
0 ( |T ),
0 ((
0 ) 2 |T ) =
0 ((
0 ) 2 |T )
0 ( |T )
0 )2 |T ) = . . .
... = (
0 ( |T )
0 )2 = (
0 ( |T ))
20
0 (|T )
+ 02 .
0 ((
0 )2 |T ) = (
0 (
|T )) 20
0 (|T )
+ 02 .
0 (|T ))
0 (
|T ).
0 (
|T ) (
0 ( |T ))
= Var0 ( |T )
Lecture 13
13.1
When it comes to jointly sucient statistics (T1 , . . . , Tk ) the total number of them
(k) is clearly very important and we would like it to be small. If we dont care about
k then we can always nd some trivial examples of jointly sucient statistics. For
instance, the entire sample X1 , . . . , Xn is, obviously, always sucient, but this choice
is not interesting. Another trivial example is the order statistics Y1 Y2 . . . Yn
which are simply the values X1 , . . . , Xn arranged in the increasing order, i.e.
Y1 = min(X1 , . . . , Xn ) . . . Yn = max(X1 , . . . , Xn ).
Y1 , . . . , Yn are jointly sucient by factorization criterion, since
f (X1 , . . . , Xn |) = f (X1 |) . . . f (Xn |) = f (Y1 |) . . . f (Yn |).
When we face dierent choices of jointly sucient statistics, how to decide which one
is better? The following denition seems natural.
Denition. (Minimal jointly sucient statistics.) (T1 , . . . , Tk ) are minimal jointly
sucient if given any other jointly sucient statistics (r1 , . . . , rm ) we have,
T1 = g1 (r1 , . . . , rm ), . . . , Tk = gk (r1 , . . . , rm ),
i.e. T s can be expressed as functions of rs.
How to decide whether (T1 , . . . , Tk ) is minimal? One possible way to do this is
through the Maximum Likelihood Estimator as follows.
Suppose that the parameter set is a subset of k , i.e. for any
= (1 , . . . , k ) where i
50
LECTURE 13.
= 2X,
it is not a function of Yn and, therefore, it can be improved.
Question. What is the distribution of the maximum Yn ?
End of material for Test 1. Problems on Test 1 will be similar to homework
problems and covers up to Pset 4.
51
LECTURE 13.
13.2
2 distribution.
t2
1
(X 2 x) = ( x X x) = e 2 dt.
2
x
d
(X x) of c.d.f. and, hence, the density
The p.d.f. is equal to the derivative dx
2
of X is
x
1 (x)2
d
1 (x)2
1 t2
2 dt =
2
2
e
e
e
fX 2 (x) =
(
x)
( x)
dx x 2
2
2
x
x
1 1
1 1
= e 2 = x 2 1 e 2 .
2 x
2
1 x
x e
for x 0.
()
=
x1 e(t)x dx
()
0
( t) 1 (t)x
=
x e
dx .
( t) 0
()
The function in the underbraced integral looks like a p.d.f. of gamma distribution
(, t) and, therefore, it integrates to 1. We get,
etX =
.
t
52
LECTURE 13.
Pn
i=1 Xi
tXi
i=1
i=1
tXi
i=1
n
i=1
Xi
Xi is
P
n
i i
=
=
.
t
i=1
n
i=1
i , .
Lecture 14
from normal distribution with mean and variance 2 . Using dierent methods (for
as an estimate of
example, maximum likelihood) we showed that one can take X
2 as an estimate of variance 2 . The question is: how close
mean and X2 (X)
are these estimates to actual values of unknown parameters? By LLN we know that
these estimates converge to and 2 ,
, X2 (X)
2 2 , n ,
X
and X2 (X)
2 are to and 2 .
but we will try to describe precisely how close X
We will start by studying the following
X2 (X)
2 ) when the sample
Question: What is the joint distribution of (X,
X1 , . . . , Xn N (0, 1)
has standard normal distribution.
Orthogonal transformations.
The student well familiar with orthogonal transformations may skip to the be
ginning of next lecture. Right now we will repeat some very basic discussion from
linear algebra and recall some properties and geometric meaning of orthogonal tran
sormations. To make our discussion as easy as possible we will consider the case of
3-dimensional space 3 .
Let us consider an orthonormal basis (e1 , e2 , e3 ) as shown in gure 14.1, i.e. they
can be
are orthogonal to each other and each has length one. Then any vector X
represented as
= X1e1 + X2e2 + X3e3 ,
X
53
54
LECTURE 14.
Rotate
Vectors
e1=(1, 0, 0)
v3
v2
v1
Transformations
Let us check that transpose V T denes inverse rotation. For example, let us check
that vector v1 = (v11 , v12 , v13 ) goes to e1 = (1, 0, 0). We have,
2
2
2
, v11 v21 + v12 v22 + v13 v23 , v11 v31 + v12 v32 + v13 v33
+ v13
v11 + v12
v1 V T =
55
LECTURE 14.
(length of v1 )2 , v1 v2 , v1 v3
= (1, 0, 0)
e3=(0, 0, 1)
v2
v3
e2=(0, 1, 0)
e1=(1, 0, 0)
v1
VT
VT
v2
v1
Lecture 15
Yi =
n
vki Xk
k=1
is a sum of independent normal r.v. and, therefore, Yi has normal distribution with
mean 0 and variance
n
2
Var(Yi ) =
vik
= 1,
k=1
since the matrix V is orthogonal and the length of each column vector is 1. So, each
r.v. Yi N (0, 1). Any two r.v. Yi and Yj in this sequence are uncorrelated since
Yi Yj =
n
k=1
vi
vj
which means that we want to show that Y s are i.i.d. standard normal, i.e. Y
has the same distribution as X. Let us show this more accurately. Given a vector
t = (t1 , . . . , tn ), the moment generating function of i.i.d. sequence X1 , . . . , Xn can be
computed as follows:
(t) =
T
Xt
t1 X1 +...+tn Xn
i=1
56
e t i Xi
57
LECTURE 15.
t2
i
e
2 = e 2
Pn
2
i=1 ti
= e 2 |t| .
i=1
= XV
and
On the other hand, since Y
t1
.
T
tT ,
t1 Y1 + . . . + tn Yn = (Y1 , . . . , Yn )
.
.
= (Y1 , . . . , Yn )t = XV
tn
et1 Y1 ++tn Yn =
eXV t =
eX (tV
T )T
= (Y1 , . . . , Yn ) = XV
= (X1 , . . . , Xn )
Y
.. ?
.
1
n
1
1
v1 = , . . . , .
n
n
58
LECTURE 15.
V2
V1
V3
= 1 Y1 .
X
n
(15.1)
2
2 ) = X 2 + . . . + X 2 1 (X1 + . . . + Xn )
n(X2 (X)
1
n
n
2
2
2
= X1 + . . . + X n Y1 .
But the orthogonal transformation V preserves the length
Y12 + + Yn2 = X12 + + Xn2
and, therefore, we get
2) = Y 2 + . . . + Y 2 Y 2 = Y 2 + . . . + Y 2.
n(X2 (X)
1
n
1
2
n
(15.2)
Equations (15.1) and (15.2) show that sample meanand sample variance are inde
= Y1 has standard normal
pendent since Y1 and (Y2 , . . . , Yn ) are independent, nX
2
2
) has
distribution since Y2 , . . . , Yn are independent
distribution and n(X2 (X)
n1
59
LECTURE 15.
standard normal.
Consider now the case when
X1 , . . . , Xn N (, 2 )
are i.i.d. normal random variables with mean and variance 2 . In this case, we
know that
X1
Xn
Z1 =
, , Zn =
N (0, 1)
1
Xi
=
nZ = n
n i=1
)
n(X
N (0, 1)
and
1 X 2 1 X 2
i
i
n(Z2 (Z )2 ) = n
n
Xi 2
1
Xi 1
= n
n i=1
2
X2 (X)
2
= n
n1
.
2
Lecture 16
16.1
X12 + . . . + Xk2
Y12 + . . . + Ym2
(1) 2 m
(1)2 k
1
1
X has p.d.f. f (x) = 2 k x
2 1 e 2 x and Y has p.d.f. g(y) =
2 m y
2 1 e 2 y ,
( 2 )
( 2 )
for x 0 and y 0. To nd the p.d.f of the ratio X
, let us rst recall how to write
Y
its cumulative distribution function. Since X and Y are always positive, their ratio
is also positive and, therefore, for t 0 we can write:
X
t =
(X tY ) = {I(X tY )}
Y
=
I(x ty)f (x)g(y)dxdy
0
0
ty
=
f (x)g(y)dx dy
60
61
LECTURE 16.
X=tY
Y0
X<=tY0
d X
d ty
t =
f (x)g(y)dxdy =
f (ty)g(y)ydy
dt
Y
dt 0
0
0
m
1 k
( 21 ) 2 m 1 1 y
(2)2
k
1
1
ty
(ty) 2 e 2
y 2 e 2 ydy
=
( m2 )
( k2 )
0
k+m
( 21 ) 2
k
1
( k+m
)1 12 (t+1)y
2
2
=
t
y
e
dy
( k2 )( m2 )
0
The function in the underbraced integral almost looks like a p.d.f. of gamma distri
bution (, ) with parameters = (k + m)/2 and = 1/2, only the constant in
front is missing. If we miltiply and divide by this constant, we will get that,
k+m
k+m
1
( 21 ) 2
( k+m
)
( 2 (t + 1)) 2 ( k+m )1 1 (t+1)y
k
d X
1
2
2
t =
t
y 2
e 2
dy
k+m
dt
Y
( k2 )( m2 )
( k+m
)
( 12 (t + 1)) 2 0
2
( k+m
) k 1
k+m
2
t 2 (1 + t) 2 ,
k
m
( 2 )( 2 )
62
LECTURE 16.
Z=
X1
1
(Y12
m
+ + Ym2 )
(t Z t) =
(Z 2 t2 ) =
t2
X12
.
Y12 + + Ym2
m
If fZ (x) denotes the p.d.f. of Z then the left hand side can be written as
t
(t Z t) =
fZ (x)dx.
X2
1
On the other hand, by denition, Y 2 +...+Y
2 has Fisher distribution F1,m with 1 and
m
1
m degrees of freedom and, therefore, the right hand side can be written as
t2
m
f1,m (x)dx.
We get that,
fZ (x)dx =
t
t2
m
f1,m (x)dx.
( m+1
) t2 1/2 t2 m+1
) 1
t
t2
t ( m+1
t2 m+1
2
2
2
=
f1,m ( ) =
1+
) 2 .
(1+
m
m
m ( 12 )( m2 ) m
m
m
( 21 )( m2 ) m
Lecture 17
In other words, these estimates are consistent. In this lecture we will try to describe
precisely, in some sense, how close sample mean and sample variance are to these
unknown parameters that they estimate.
Let us start by giving a denition of a condence interval in our usual setting
when we observe a sample X1 , . . . , Xn with distribution 0 from a parametric family
{ : }, and 0 is unkown.
Denition: Given a parameter [0, 1], which we will call condence level, if
there are two statistics
0 (S1
0 S2 ) = , ( or )
then we call the interval [S1 , S2 ] a condence interval for the unknown parameter 0
with the condence level .
This denition means that we can garantee with probability/condence that
our unknown parameter lies within the interval [S1 , S2 ]. We will now show how in
the case of normal distribution N (, 2 ) we can use the estimates (sample mean and
sample variance) to construct the condence intervals for unknown 0 and 02 .
Let us recall from the lecture before last that we proved that when
X1 , . . . , Xn are i.d.d. with distribution N (0 , 02 )
63
64
LECTURE 17.
then
A=
0 )
2)
n(X2 (X)
n(X
2n1
N (0, 1) and B =
0
02
and the random variables A and B are independent. If we recall the denition of 2
distribution, this mean that we can represent A and B as
A = Y1 and B = Y22 + . . . + Yn2
for some Y1 , . . . , Yn i.d.d. standard normal.
2
N1
(1)
2
(1)
2
C1
C2
(B c1 ) =
1
and
2
(B c2 ) =
1
.
2
(c1 B c2 ) = ,
2)
n(X2 (X)
c2 .
02
65
LECTURE 17.
1
B
n1
Y1
1
(Y22
n1
+ . . . + Yn2 )
tn1
=
=
.
2)
1
1
1 n(X2 (X)
2 (X)
2)
B
(
X
2
n1
n1
n1
If we now look at the p.d.f. of tn1 distribution (see gure 17.2) and choose the
constants c and c so that the area in each tail is (1 )/2, (the constant is the same
on each side because the distribution is symmetric) we get that with probability ,
(1)
2
(1)
2
0
X
c
2
2
(X (X) )
1
n1
1
1
2
2
c
) 0 X
+C
2 ).
X
(X (X)
(X2 (X)
n1
n1
66
LECTURE 17.
1
c
2 ) 0 X
+ c 1 (X2 (X)
2)
X
(X2 (X)
9
9
0.6377 0 2.1203
and with probability 95% we can garantee that
2)
2)
n(X2 (X)
n(X2 (X)
02
c2
c1
or
0.0508 02 0.3579.
These condence intervals may not look impressive but the sample size is very small
here, n = 10.
Lecture 18
Testing hypotheses.
(Textbook, Chapter 8)
18.1
H1 :
= 1
H2 :
= 2
..
H :
= k
k
We call these hypotheses simple because each hypothesis asks a simple question about
whether is equal to some particular specied distribution.
To decide among these hypotheses means that given the data vector,
X = (X1 , . . . , Xn ) X n
we have to decide which hypothesis to pick or, in other words, we need to nd a
decision rule which is a function
: X n {H1 , , Hk }.
Let us note that sometimes this function can be random because sometimes several
hypotheses may look equally likely and it will make sense to pick among them ran
domly. This idea of a randomized decision rule will be explained more clearly as we
go on, but for now we can think of as a simple function of the data.
67
68
( =
Hi |Hi ) =
=
i (
i.
= Hi ),
1 (
=
H1 )
is also called size or level of signicance of decision rule and one minus type 2 error
= 1 2 = 1
2 (
=
H2 ) =
2 (
= H2 )
i.e. we predict that the person does not have the decease when he actually does and
error of type 2 is
( = H1 |H2 ),
i.e. we predict that the person does have the decease when he actually does not.
Clearly, these errors are of a very dierent nature. For example, in the rst case the
patient will not get a treatment that he needs, and in the second case he will get
unnecessary possibly harmful treatment when he doesn not need it, given that no
additional tests are conducted.
Example. Radar missile detection/recognition. Suppose that an image on the
radar is tested to be a missile versus, say, a passenger plane.
H1 : missile, H2 : not missile.
Then the error of type one
( = H2 |H1 ),
69
( = H2 |H1 ),
means that we will possibly shoot down a passenger plane (which happened before).
Another example could be when guilty or not guilty verdict in court is decided
based on some tests and one can think of many examples like this. Therefore, in many
situations it is natural to control certain type of error, give garantees that this error
does not exceed some acceptable level, and try to minimize all other types of errors.
For example, in the case of two simple hypotheses, given the largest acceptable error
of type one [0, 1], we will look for a decision rule in the class
K = { : 1 =
1 (
=
H1 ) }
18.2
2 (
=
H2 ), as small as
We will start with another way to control the trade-o among dierent types of errors
that consists in minimizing the weighted error.
Given hypotheses
H1 , . . . , Hk let us consider k nonnegative weights (1), . . . , (k)
that add up to one ki=1 (i) = 1. We can think of weights as an apriori probability
on the set of our hypotheses that represent their relative importance. Then the Bayes
error of a decision rule is dened as
() =
k
i=1
(i)i =
k
(i)
i=1
i (
=
Hi ),
which is simply a weighted error. Of course, we want to make this weigted error as
small as possible.
Denition: Decision rule that minimizes () is called Bayes decision rule.
Next theorem tells us how to nd this Bayes decision rule in terms of p.d.f. or p.f.
or the distributions i .
Theorem. Assume that each distribution i has p.d.f or p.f. fi (x). Then
70
k
(i)
i (
i=1
k
(i)
i=1
k
i=1
k
= Hi )
i=1
= 1
k
i=1
To minimize this Bayes error we need to maximize this last integral, but we can
actually maximize the sum inside the integral
(1)f1 (x1 ) . . . f1 (xn )I( = H1 ) + . . . + (k)fk (x1 ) . . . fk (xn )I( = Hk )
by choosing appropriately. For each (x1 , . . . , xn ) decision rule picks only one
hypothesis which means that only one term in this sum will be non zero, because if
picks Hj then only one indicator I( = Hj ) will be non zero and the sum will be
equal to
(j)fj (x1 ) . . . fj (xn ).
Therefore, to maximize the integral should simply pick the hypothesis that maxi
mizes this expression, exactly as in the statement of the Theorem. This nishes the
proof.
Lecture 19
In the last lecture we found the Bayes decision rule that minimizes the Bayes eror
=
k
(i)i =
i=1
k
(i)
i (
i=1
=
Hi ).
Let us write down this decision rule in the case of two simple hypothesis H1 , H2 . For
simplicity of notations, given the sample X = (X1 , . . . , Xn ) we will denote the joint
p.d.f. by
fi (X) = fi (X1 ) . . . fi (Xn ).
Then in the case of two simple hypotheses the Bayes decision rule that minimizes the
Bayes error
= (1) 1 ( =
H1 ) + (2) 2 ( =
H2 )
is given by
or, equivalently,
H1 :
H2 :
=
H1 or H2 :
f1 (X)
f2 (X)
f1 (X)
f2 (X)
f1 (X)
f2 (X)
>
<
=
(2)
(1)
(2)
(1)
(2)
(1)
(19.1)
(Here 01 = +, 01 = 0.) This kind of test if called likelihood ratio test since it is
expressed in terms of the ratio f1 (X)/f2 (X) of likelihood functions.
Example. Suppose we have only one observation X1 and two simple hypotheses
H1 : = N (0, 1) and H2 : = N (1, 1). Let us take an apriori distribution given by
(1) =
1
1
and (2) = ,
2
2
71
72
LECTURE 19.
i.e. both hypothesis have equal weight, and nd a Bayes decision rule that minimizes
1
2
1 (
= H 1 ) +
1
2
2 (
= H 2 )
H1 :
H2 :
(X1 ) =
H1 or H2 :
f1 (X)
f2 (X)
f1 (X)
f2 (X)
f1 (X)
f2 (X)
>1
<1
=1
This decision rule has a very intuitive interpretation. If we look at the graphs of these
p.d.f.s (gure 19.1) the decision rule picks the rst hypothesis when the rst p.d.f. is
larger, to the left of point C, and otherwise picks the second hypothesis to the right
of point C.
PDF
f1
f2
X1
0
H1
H2
1
1 P
1 P
f1 (X)
1
2
2
=
e 2 Xi
e 2 (Xi 1)
n
n
f2 (X)
( 2)
( 2)
1
= e2
Pk
i=1 ((Xi 1)
2 X 2 )
i
= e2
Xi
e2
Xi
>
(2)
(1)
73
LECTURE 19.
or, equivalently,
Xi <
n
(2)
log
.
2
(1)
Xi >
n
(2)
log
.
2
(1)
1 (
=
H1 ) }
f (X)
1
<
c
= .
1
f2 (X)
H1 :
H2 :
f1 (X)
f2 (X)
f1 (X)
f2 (X)
c
<c
(2)
= c,
(1)
1
c
and (2) =
.
1+c
1+c
(19.2)
74
LECTURE 19.
Then the decision rule in (19.2) is the Bayes decision rule corresponding to weights
(1) and (2) which can be seen by comparing it with (19.1), only here we break the
tie in favor of H1 . Therefore, this decision rule minimizes the Bayes error which
means that for any other decision rule ,
(1)
1 (
=
H1 ) + (2)
2 (
=
H2 ) (1)
1 (
= H1 ) + (2)
2 (
= H2 ).
(19.3)
f (X)
1
H1 ) = 1
< c = ,
1 ( =
f2 (X)
1 (
= H1 )
2 (
=
H2 ) (1) + (2)
2 (
=
H2 )
2 (
= H 2 )
and, therefore,
2 (
= H2 ).
This exactly means that is more powerful than any other decision rule in class K .
Example. Suppose we have a sample X = (X1 , . . . , Xn ) and two simple hypothe
= N (0, 1) and H2 :
= N (1, 1). Let us nd most powerful with the
ses H1 :
error of type 1
1 = 0.05.
1
< c = = 0.05
1
f2 (X)
n
Xi > log c = = 0.05
1
2
or
1
1 n
75
LECTURE 19.
(Y > c ) = = 0.05 c
= 1.64
and the most powerful test with level of signicance = 0.05 will look like this:
H1 : 1n Xi 1.64
=
H2 : 1n Xi > 1.64.
2 =
=
1
H2 ) = 2
Xi 1.64
2 ( =
n
n
1
(X
1)
1.64
n .
i
2
n i=1
The reason we subtracted 1 from each Xi is because under the second hypothesis Xs
have distribution N (1, 1) and random variable
n
1
Y =
(Xi 1)
n i=1
will be standard normal. Therefore,
the error of type 2 for this test will be equak to
the probability (Y < 1.64 n). For example, when the sample size n = 10 this
will be
Lecture 20
20.1
In theorem in the last lecture we showed how to nd the most powerful test with level
of signicance (which means that K ), if we can nd c such that
f (X)
1
<
c
= .
1
f2 (X)
This condition is not always fulled, especially when we deal with discrete distribu
tions as will become clear from the examples below. But if we look carefully at the
proof of that Theorem, this condition was only necessary to make sure that the like
lihood ratio test has error of type 1 exactly equal to . In our next theorem we will
show that the most powerful test in class K can always be found if one randomly
breaks the tie between two hypotheses in a way that ensures that the error of type
one is equal to .
Theorem. Given any [0, 1] we can always nd c [0, ) and p [0, 1] such
that
f (X)
f (X)
1
1
< c
+ (1 p) 1
= c = .
(20.1)
1
f2 (X)
f2 (X)
f1 (X)
>c
f2 (X)
H1 :
f1 (X)
<c
=
H2 :
f2 (X)
f1 (X)
H1 or H2 : f2 (X) = c
where in the last case of equality we break the tie at random by choosing H 1 with
probability p and choosing H2 with probability 1 p.
This test is called a randomized test since we break a tie at random if necessary.
76
77
LECTURE 20.
Proof. Let us rst assume that we can nd c and p such that (20.1) holds. Then
the error of type 1 for the randomized test above can be computed:
f (X)
f (X)
1
1
1 = 1 ( =
< c + (1 p) 1
=c =
(20.2)
H1 ) = 1
f2 (X)
f2 (X)
since does not pick H1 exactly when the likelihood ratio is less than c or when it is
equal to c in which case H1 is not picked with probability 1 p. This means that the
randomized test K . The rest of the proof repeats the proof of the last Theorem.
We only need to point out that our randomized test will still be Bayes test since in
the case of equality
f1 (X)
=c
f2 (X)
the Bayes test allows one to break the tie arbitrarily and we choose to break it
randomly in a way that ensures that the error of type one will be equal to , as in
(20.2).
The only question left is why we can always choose c and p such that (20.1) is
satised. If we look at the function
f (X)
1
<t
F (t) =
f2 (X)
f (X)
1
<c =
f2 (X)
F (c) =
f (X)
1
<c <
f2 (X)
f (X)
f (X)
1
1
c = F (c) +
= c .
f2 (X)
f2 (X)
Then (20.1) will hold if in case (a) we take p = 1 and in case (b) we take
1 p = ( F (c))
f (X)
1
=c .
f2 (X)
78
LECTURE 20.
f (X)
1
<c .
F (c) = 1
f2 (X)
Let us explain how this graph is obtained. First of all, the likelihood ratio can take
0.2
C
1/2
4/3
if
1
2
1
2
1/2 if
4/3 if
X=1
X = 0.
<c
4
3
f (X)
1
< c = is empty and F (c) =
f2 (X)
1 ()
= 0,
1
< c = {X = 1} and F (c) =
f2 (X)
and, nally, if
4
3
1 (X
= 1) = 0.2
f (X)
1
< c = {X = 0 or 1} and F (c) =
f2 (X)
1 (X
= 0 or 1) = 1,
79
LECTURE 20.
as shown in gure 20.1. The function F (c) jumps over the level = 0.05 at the point
c = 1/2. To determine p we have to make sure that the error of type one is equal to
0.05, i.e.
f (X)
f (X)
1
1
<
c
+
(1
p)
=
c
= 0 + (1 p)0.2 = 0.05
1
1
f2 (X)
f2 (X)
which gives that p = 43 . Therefore, the most powerful test of size = 0.05 is
f1 (X)
> 12 or X = 0
f2 (X)
H1 :
f1 (X)
< 21 or never
=
H2 :
f2 (X)
(X)
H1 or H2 : ff21 (X)
= 12 or X = 1,
where in the last case X = 1 we pick H1 with probability
3
4
or H2 with probability 41 .
H2 : 2 .
=
H1 ) as a function of ,
which is called the power function of . The power function has dierent meaning
depending on whether comes from 1 or 2 , as can be seen in gure 20.2.
For 1 the power function represents the error of type 1, since actually
comes from the set in the rst hypothesis H1 and rejects H1 . If 2 then the
power function represents the power, or one minus error of type two, since in this
case belongs to a set from the second hypothesis H2 and accepts H2 . Therefore,
ideally, we would like to minimize the power function for all 1 and maximize it
for all 2 .
Consider
1 () = sup (, ) = sup ( =
H1 )
80
LECTURE 20.
(, )
Maximize in 2 Region
Minimize in 1 Region
Lecture 21
21.1
In the last lecture we gave the denition of the UMP test and mentioned that under
certain conditions the UMP test exists. In this section we will describe a property
called monotone likelihood ratio which will be used in the next section to nd the
UMP test for one sided hypotheses.
Suppose the parameter set
is a subset of a real line and that probability
distributions have p.d.f. or p.f. f (x|). Given a sample X = (X1 , . . . , Xn ), the
likelihood function (or joint p.d.f.) is given by
f (X |) =
i=1
f (Xi |).
f (X |1 )
= V (T (X), 1 , 2 )
f (X |2 )
and for 1 > 2 the function V (T, 1 , 2 ) is strictly increasing in T .
Example. Consider a family of Bernoulli distributions {B(p) : p [0, 1]}, in
which case the p.f. is qiven by
f (x|p) = px (1 p)1x
and for X = (X1 , . . . , Xn ) the likelihood function is
f (X |p) = p
Xi
(1 p)n
Xi
p
p
(1
p
i
2
2
1)
p2 (1 p2 )
P
81
82
LECTURE 21.
p1 (1 p2 )
>1
p2 (1 p1 )
} with
(x)2
1
f (x|) = e 2
2
1 Pn
1
2
e 2 i=1 (Xi ) .
f (X |) =
n
( 2)
For 1 > 2 the likelihood ratio is increasing in T (X) = ni=1 Xi and MLR holds.
21.2
Consider 0
which are called one sided hypotheses, because we hypothesize that the unknown
parameter is on one side or the other side of some threshold 0 . We will show next
that if MLR holds then for these hypotheses there exists a Uniformly Most Powerful
test with level of signicance , i.e. in class K .
H1
H2
83
LECTURE 21.
Theorem. Suppose that we have Monotone Likelihood Ratio with T = T (X) and
we consider one-sided hypotheses as above. For any level of signicance [0, 1], we
can nd c and p [0, 1] such that
0 (T (X)
> c) + (1 p)
H1
H2
=
H1
0 (T (X)
= c) = .
where in the last case of T = c we randomly pick H1 with probability p and H2 with
probability 1 p.
Proof. We have to prove two things about this test :
1. K , i.e. has level of signicance ,
2. for any K , ( , ) (, ) for > 0 , i.e. is more powerful on the
second hypothesis that any other test from the class K .
To simplify our considerations below let us assume that we dont need to random
ize in , i.e. we can take p = 1 and we have
0 (T (X)
> c) =
H1 : T c
H2 : T > c.
> c) for 0 .
(T
Let us for a second forget about composite hypotheses and for < 0 consider two
simple hypotheses:
h1 : = and h2 : = 0 .
For these simple hypotheses let us nd the most powerful test with error of type 1
equal to
1 := (T > c).
< b = 1
f (X |0 )
84
LECTURE 21.
then the following test will be the most powerful test with error of type one equal to
1 :
(X |)
b
h1 : ff(X
|0 )
=
f (X |)
h2 : f (X |0 ) < b
This corresponds to the situation when we do not have to randomize. But the mono
tone likelihood ratio implies that
f (X |)
f (X |0 )
1
1
<b
> V (T, 0 , ) >
f (X |)
f (X|0 )
b
b
and, since 0 > , this last function V (T, 0 , ) is strictly increasing in T. Therefore,
we can solve this inequality for T (see gure 21.2) and get that T > cb for some cb .
V(T, 0, )
1/b
T
Cb
This means that the error of type 1 for the test can be written as
f (X|)
1 =
< b = (T > cb ).
f (X |0 )
But we chose this error to be equal to 1 = (T > c) which means that cb should
be such that
(T > cb ) =
(T > c) c = cb .
h1 : T c
=
h2 : T > c
(T
> c).
85
LECTURE 21.
h1 : with probability 1 1
rand
=
h2 : with probability 1 ,
which picks hypotheses completely randomly regardless of the data. The error of type
one for this test will be equal to
rand = h ) = ,
2
1
i.e. both tests and rand have the same error of type one equal to 1 . But since
is the most powerful test it has larger power than rand . But the power of is equal
to
0 (T > c) =
(T
> c).
Therefore,
(T
> c)
0 (T
> c) =
and we proved that for any 0 the power function ( , ) which this proves
1.
Lecture 22
22.1
It remains to prove the second part of the Theorem from last lecture. Namely, we
have to show that for any K
( , ) (, ) for > 0 .
Let us take > 0 and consider two simple hypotheses
h1 :
=
and h2 :
=
Let us nd the most powerful test with error of type one equal to . We know that
if we can nd a threshold b such that
f (X| )
0
<b =
0
f (X |)
then the following test will be the most powerful test with error of type 1 equal to :
|0 )
h1 : ff(X
b
(X |)
=
f (X |0 )
h2 : f (X |) < b
But the monotone likelihood ratio implies that
f (X |0 )
f (X |)
1
1
<b
> V (T, , 0 ) >
b
b
f (X |)
f (X|0 )
and, since now > 0 , the function V (T, , 0 ) is strictly increasing in T. Therefore,
we can solve this inequality for T and get that T > cb for some cb .
This means that the error of type 1 for the test can be written as
f (X| )
0
< b = 0 (T > cb ).
0
f (X |)
86
87
LECTURE 22.
But we chose this error to be equal to = 0 (T > c) which means that cb should
be such that
0 (T > cb ) =
0 (T > c) c = cb .
h1 : T c
h2 : T > c
H1 : ni=1 Xi c
=
n
H2 :
i=1 Xi > c.
The threshold c is determined by
=
0 (T > c) =
0 (
Xi > c).
1
Y =
(Xi 0 ) N (0, 1)
n i=1
is standard normal. Therefore,
=
0 (
n
i=1
Xi > c) =
c n0
1
(Xi 0 ) >
.
Y =
n i=1
n
88
LECTURE 22.
c n0
= c or c = 0 n + nc .
n
=
e
e
f (X|12 )
( 22 )n
( 21 )n
P 2
n
n
1
1
1
1
T
X
2
2
2
2
i
1
1
e 21 22 ,
=
e 21 22
=
2
2
n
2
2
2
where T =
i=1 Xi . When 2 > 1 the likelihood ratio is increasing in T and,
therefore, MLR holds. By the above Theorem, the UMP test exists and is given by
H1 : T = ni=1 Xi2 c
=
H2 : T = ni=1 Xi2 > c
where the threshold c is determined by
=
02
n
Xi2
> c) =
i=1
02
n
X 2
i
i=1
>
c
.
02
2n
Lecture 23
23.1
Pearsons theorem.
Today we will prove one result from probability that will be useful in several statistical
tests.
Let us consider r boxes B1 , . . . , Br as in gure 23.1
B1
...
B2
Br
Figure 23.1:
Assume that we throw n balls X1 , . . . , Xn into these boxes randomly independently
of each other with probabilities
(Xi B1 ) = p1 , . . . , (Xi Br ) = pr ,
n
l=1
I(Xl Bj ).
On average, the number of balls in the jth box will be npj , so random variable j
should be close to npj . One can also use Central Limit Theorem to describe how close
j is to npj . The next result tells us how we can describe in some sense the closeness
of j to npj simultaneously for all j r. The main diculty in this Thorem comes
from the fact that random variables j for j r are not independent, for example,
because the total number of balls is equal to n,
1 + . . . + r = n,
89
90
LECTURE 23.
i.e. if we know these numbers in n 1 boxes we will automatically know their number
in the last box.
Theorem. We have that the random variable
r
(j npj )2
2r1
np
j
j=1
I(X1 Bj ), . . . , I(Xn Bj )
that indicate whether each observation Xi is in the box Bj or not are i.i.d. with
Bernoully distribution B(pj ) with probability of success
I(X1 Bj ) =
(X1 Bj ) = pj
and variance
Var(I(X1 Bj )) = pj (1 pj ).
=
npj (1 pj )
npj (1 pj )
n
l=1 I(Xl Bj ) n
N (0, 1)
=
nVar
j npj
1 pj N (0, 1) = N (0, 1 pj )
npj
npj
where random variable Zj N (0, 1 pj ).
We know that each Zj has distribution N
(0, 1 pj ) but, unfortunately, this does
Zj2 will be, because as we mentioned
not tell us what the distribution of the sum
above r.v.s j are not independent and their correlation structure will play an im
portant role. To compute the covariance between Zi and Zj let us rst compute the
covariance between
j npj
i npi
and
npi
npj
91
LECTURE 23.
which is equal to
1
i npj j npj
=
( i j i npj j npi + n2 pi pj )
npi
npj
n pi pj
1
1
=
( i j npi npj npj npi + n2 pi pj ) =
( i j n2 pi pj ).
n pi pj
n pi pj
To compute i j we will use the fact that one ball cannot be inside two dierent
boxes simultaneously which means that
I(Xl Bi )I(Xl Bj ) = 0.
(23.1)
Therefore,
i j =
n
l=1
l=l
I(Xl Bi )
n
l =1
I(Xl Bj ) =
I(Xl Bi )I(Xl Bj ) +
l=l
l,l
I(Xl Bi )I(Xl Bj )
I(Xl Bi )I(Xl Bj )
2
n(n 1)pi pj n pi pj = pi pj .
n pi pj
To summarize, we showed that the random variable
r
(j npj )2
j=1
npj
r
Zj2 .
j=1
Z i Zj = p i p j .
To prove the Theorem it remains to show that this covariance structure of the sequence
2
of Zi s will imply that their sum of squares
2 has distribution r1 . To show this we
Zi .
will nd a dierent representation for
Let g1 , , g
r be i.i.d. standard normal sequence. Consider two vectors
g = (g1 , . . . , gr ) and p
= ( p1 , . . . , pr )
92
LECTURE 23.
(23.2)
l=1
and j
th
r
: gj
gl pl pj
l=1
r
r
gi
gl pl pi gj
gl pl pj
l=1
l=1
pl pi pj = 2 pi pj + pi pj = pi pj .
= pi pj pj pi +
l=1
r
2
gi
gl pl pi = 1 p i .
l=1
This proves (23.2), which provides us with another way to formulate the convergence,
namely, we have
r
r
j npj 2
(ith coordinate)2
np
j
j=1
i=1
l=1
l=1
1 = (
vector V
p g )p is the projection of vector g on the line along p and, therefore,
vector V2 = g (
p g)
p will be the projection of g onto the plane orthogonal to p,
as
shown in gures 23.2 and 23.3.
Let us consider a new orthonormal coordinate system with the last basis vector
(last axis) equal to p. In this new coordinate system vector g will have coordinates
g = (g1 , . . . , gr ) = gV
93
LECTURE 23.
p
V1
V2
(*)
p
gr
gr
g
90o
Rotation
g2
(*)
V2
g2
g1
g1
r
i=1
But this last sum, by denition, has 2r1 distribution since g1 , , gr 1 are i.i.d.
standard normal. This nishes the proof of Theorem.
Lecture 24
24.1
Goodness-of-t test.
H1 : pi = pi for all i = 1, . . . , r,
H2 : otherwise, i.e. for some i, pi = pi .
If the rst hypothesis is true than the main result from previous lecture tells us that
we have the following convergence in distribution:
T =
r
(i np )2
i
i=1
npi
2r1
where i = #{Xj : Xj = Bi }. On the other hand, if H2 holds then for some index i,
pi = pi and the statistics T will behave very dierently. If pi is the true probability
(X1 = Bi ) then by CLT (see previous lecture)
i npi
N (0, 1 pi ).
npi
If we write
i npi
i npi + n(pi pi )
i npi pi pi
=
=
+ n
npi
npi
npi
pi
then the rst term converges to N (0, 1 pi ) but the second term converges to plus
or minus since pi = pi . Therefore,
(i npi )2
+
npi
94
95
LECTURE 24.
H2
H1
2r1
H1 : T c
=
H2 : T > c,
i.e. we suspect that the rst hypothesis H1 fails if T becomes unusually large. We
can decide what is unusually large or how to choose the threshold c by xing the
error of type 1 to be equal to the level of signicance :
=
1 (
=
H1 ) =
1 (T
since under the rst hypothesis the distribution of T can be approximated by 2r1
distribution. Therefore, we nd c from the table of 2r1 distribution such that =
2r1 (c, ). This test is called the 2 goodness-of-t test.
Example. Suppose that we have a sample of 189 observations that can take three
values A, B and C with some unknown probabilities p1 , p2 and p3 and the counts are
given by
A B C T otal
58 64 67 189
We want to test the hypothesis H1 that this distribution is uniform, i.e. p1 = p2 =
p3 = 1/3. Suppose that level of signicance is chosen to be = 0.05. Then the
threshold c in the 2 test
H1 : T c
=
H2 : T > c
can be found from the condition that
96
LECTURE 24.
and from the table of 22 distribution with two degrees of freedom we nd that c = 5.9.
In our case
T =
24.2
A similar approach can be used to test a hypothesis that the distribution of the data is
equal to some particular distribution, in the case when observations do not necessarily
take a nite number of xed values as was the case in the last section. Let X1 , . . . , Xn
be the sample from unknown distribution and consider the following hypotheses:
H1 :
= 0
H2 :
= 0
for some particular 0 . To use the result from previous lecture we will discretize the
set of possible values of Xs by splitting it into a nite number of intervals I1 , . . . , Ir
as shown in gure 24.2. If the rst hypothesis H1 holds then the probability that X
comes from the jth interval is equal to
(X Ij ) =
0 (X
Ij ) = pj .
and instead of testing H1 vs. H2 we will consider the following weaker hypotheses
H1 :
(X Ij ) = pj for all j r
H2 : otherwise
Asking whether H1 holds is, of course, a weaker question that asking if H1 holds,
because H1 implies H1 but not the other way around. There are many distributions
dierent from
that have the same probabilities of the intervals I1 , . . . , Ir as .
Later on in the course we will look at other way to test the hypothesis H1 in a more
consistent way (Kolmogorov-Smirnov test) but for now we will use the 2 convergence
result from previous lecture and test the derivative hypothesis H1 . Of course, we are
back to the case of categorical data from previous section and we can simply use the
2 goodness-of-t test above.
The rule of thumb about how to split into subintervals I1 , . . . , Ir is to have the
expected count in each subinterval
npi = n
0 (X
Ii ) 5
97
LECTURE 24.
P0
R
I1
I2
I3
Ir
p1o
p2o
o
p3
pro
= N (3.912, 0.25)
H1 :
H2 : otherwise
0.25
0.25
0.25
0.25
I1
I2
I3
I4
3.912
LECTURE 24.
98
Lecture 25
25.1
(X = B1 ), . . . , pr =
(X = Br )
and suppose that we want to test the hypothesis that this distribution comes from a
parameteric family { : }. In other words, if we denote pj () = (X = Bj ),
we want to test:
If we wanted to test H1 for one particular xed we could use the statistic
r
(j npj ())2
T =
,
npj ()
j=1
and use a simple 2 test from last lecture. The situation now is more complicated
because we want to test if pj = pj (), j r at least for some which means that
we have many candidates for . One way to approach this problem is as follows.
(Step 1) Assuming that hypothesis H1 holds, i.e. = for some , we can
nd an estimate of this unknown and then
(Step 2) try to test whether indeed the distribution
is equal to by using
the statistics
r
(j npj ( ))2
T =
npj ( )
j=1
in 2 test.
99
100
LECTURE 25.
This approach looks natural, the only question is what estimate to use and how
the fact that also depends on the data will aect the convergence of T. It turns
out that if we let be the maximum likelihood estimate, i.e. that maximizes the
likelihood function
() = p1 ()1 . . . pr ()r
then the statistic
r
(j npj ( ))2
2rs1
)
np
(
j
j=1
T =
and
3. Let us consider a family of all distributions on the set {0, 1, 2}. The distribution
(X = 0) = p1 , (X = 1) = p2 , (X = 2) = p3
1
s=2
p1
101
LECTURE 25.
Example. (textbook, p.545) Suppose that a gene has two possible alleles A1 and
A2 and the combinations of theses alleles dene there possible genotypes A1 A1 , A1 A2
and A2 A2 . We want to test a theory that
(A1 A1 ) = 2
(A1 A2 ) = 2(1 )
(A2 A2 ) = (1 )2
(25.1)
Suppose that given the sample X1 , . . . , Xn of the population the counts of each geno
type are 1 , 2 and 3 . To test the theory we want to test the hypotheses
21 + 2
.
2n
102
LECTURE 25.
H1 : T c = 3.841
H2 : T > c = 3.841
General families.
We could use a similar test when the distributions , are not necessarily
supported by a nite number of points B1 , . . . , Br (for example, continuous distribu
tions). In this case if we want to test the hypotheses
= for some
H1 :
H2 : otherwise
we can discretize them as we did in the last lecture (see gure 25.2), i.e. consider a
family of distributions
pj () = (X Ij ) for j r,
H1 : pj = pj () for some , j = 1, , r
H2 : otherwise.
I1
I2
Ir
Lecture 26
26.1
Test of independence.
In this lecture we will consider the situation when data comes from the sample space
X that consists of pairs of two features and each feature has a nite number of
categories or, simply,
X = {(i, j) : i = 1, . . . , a, j = 1, . . . , b}.
If we have an i.i.d. sample X1 , . . . , Xn with some distribution on X then each Xi
is a pair (Xi1 , Xi2 ) where Xi1 can take a dierent values and Xi2 can take b dierent
values. Let Nij be a count of all observations equal to (i, j), i.e. with rst feature
equal to i and second feature equal to j, as shown in table below.
Na1
Nab
Na2
We would like to test the independence of two features which means that
(X = (i, j)) =
(X 1 = i) (X 2 = j).
(X = (i, j)) = ij ,
(X 1 = i) = pi and
103
(X 2 = j) = qj ,
104
LECTURE 26.
then we want to test that for all i and j we have ij = pi qj . Therefore, our hypotheses
can be formulated as follows:
(Nij npi qj )2
2
2rs1 = 2ab(a1)(b1)1 = (a1)(b1)
np
q
i j
i,j
P Nij P N
qj i ij =
(pi qj )Nij =
pi j
pi i+
qj +j
i,j
Nij
for the total number of observations in the ith row or, in other words, the number of
observations with the rst feature equal to i and
N+j =
Nij
i
105
LECTURE 26.
for the total number of observations in the jth column or, in other words, the number
of observations with the second feature equal to j. Since pi s and qj s are not related to
each other it is
obvious thatmaximizing the likelihood function above is equivalent
N
N
to maximizing i pi i+ and j qj +j separately. Let us not forget that we maximize
given the constraints that pi s and qj s add up to 1 (otherwise, we could let them be
equal to +). Let us solve, for example, the following optimization problem:
maximize
Ni+
pi
given that
a
pi = 1
i=1
a
pi = 1.
i=1
Ni+ log pi (
a
i=1
pi 1)
L
=0
pi = 1.
i=1
Combining these two conditions we get
pi =
Ni+
n
=1=n
Ni+
.
n
qj =
N+j
.
n
106
LECTURE 26.
i,j
H1 : T c
H2 : T > c
T =
4758 2
)
189
4758
189
(20
+...+
6759 2
)
189
6759
189
(23
= 5.21
2(a1)(b1) (c, +) = 2
Lecture 27
27.1
Test of homogeneity.
Suppose that the population is divided into R groups and each group (or the entire
population) is divided into C categories. We would like to test whether the distribu
tion of categories in each group is the same.
N+1
of homogeneity
Category C
N1C
..
..
.
.
NRC
N+C
N1+
..
.
NR+
n
If we denote
C
pij = 1
j=1
108
LECTURE 27.
(Categoryj |Groupi ) =
(Categoryj )
then we have
(Groupi , Categoryj ) =
(Categoryj ) (Groupi )
which means the groups and categories are independent. Alternatively, if we have
independence:
(Categoryj |Groupi ) =
=
(Groupi , Categoryj )
(Groupi )
(Categoryj ) (Groupi )
=
(Groupi )
(Categoryj )
which is homogeneity. This means that to test homogeneity we can use the indepen
dence test from previous lecture.
Interestingly, the same test can be used in the case when the sampling is done
not from the entire population but from each group separately which means that we
decide apriori about the sample size in each group - N1+ , . . . , NR+ . When we sample
from the entire population these numbers are random and by the LLN Ni+ /n will
approximate the probability (Groupi ), i.e. Ni+ reects the proportion of group j in
the population. When we pick these numbers apriori one can simply think that we
articially renormalize the proportion of each group in the population and test for
homogeneity among groups as independence in this new articial population. Another
way to argue that the test will be the same is as follows.
Assume that
(Categoryj |Groupi ) = pj
where the probabilities pj are all given. Then by Pearsons theorem we have the
convergence in distribution
C
(Nij Ni+ pj )2
2C1
Ni+ pj
j=1
LECTURE 27.
109
since the samples in dierent groups are independent. If now we assume that prob
abilities p1 , . . . , pC are unknown and we use the maximum likelihood estimates pj =
N+j /n instead then
R
C
(Nij Ni+ N+j /n)2
2R(C1)(C1) = 2(R1)(C1)
N
N
/n
i+ +j
i=1 j=1
Lecture 28
28.1
Kolmogorov-Smirnov test.
= 0
H1 :
H2 : otherwise
We considered this problem before when we talked about goodness-of-t test for
continuous distribution but, in order to use Pearsons theorem and chi-square test,
we discretized the distribution and considered a weaker derivative hypothesis. We
will now consider a dierent test due to Kolmogorov and Smirnov that avoids this
discretization and in a sense is more consistent.
Let us denote by F (x) = (X1 x) a cumulative distribution function and
consider what is called an empirical distribution function:
Fn (x) =
1
(X
x)
=
I(Xi x)
n
n i=1
that is simply the proportion of the sample points below level x. For any xed point
x the law of large numbers gives that
1
Fn (x) =
I(Xi x)
n i=1
I(X1 x) =
(X1 x) = F (x),
i.e. the proportion of the sample in the set (, x] approximates the probability of
this set.
It is easy to show from here that this approximation holds uniformly over all
x :
sup |Fn (x) F (x)| 0
110
111
LECTURE 28.
Y
1
F(X) = Y
X
x
0y1
Fn (F
1
1
I(Xi F 1 (y)) =
I(F (Xi ) y)
(y)) =
n i=1
n i=1
and, therefore,
( sup |Fn (F
0y1
(y)) y | t) =
n
1
I(F (Xi ) y) y
t
.
sup
0y1 n
i=1
The distribution of F (Xi ) is uniform on the interval [0, 1] because the c.d.f. of F (X1 )
is
(F (X1 ) t) = (X1 F 1 (t)) = F (F 1 (t)) = t.
112
LECTURE 28.
are independent and have uniform distribution on [0, 1] and, combining with the
above, we proved that
n
1
I(Ui y) y
t
sup
0y1 n
i=1
Next, we will formulate the main result on which the KS test is based. First of
all, let us note that for a xed x the CLT implies that
because F (x)(1 F (x)) is the variance of I(X1 x). If turns out that if we consider
2
( nsupx |Fn (x) F (x)| t) H(t) = 1 2
(1)i1 e2i t
i=1
H1 : F = F0 for a givenF0
H2 : otherwise
then based on Theorems 1 and 2 the Kolmogorov-Smirnov test is formulated as fol
lows:
H1 : Dn c
=
H2 : Dn > c
where
Dn =
and the threshold c depends on the level of signicance and can be found from the
condition
= ( =
H1 |H1 ) = (Dn c|H1 ).
In Theorem 1 we showed that the distribution of Dn does not depend on the unknown
distribution F and, therefore, it can tabulated. However, the distribution of Dn
113
LECTURE 28.
depends on n so one needs to use advanced tables that contain the table for the
sample size n of interest. Another way to nd c, especially when the sample size is
large, is to use Theorem 2 which tells that the distribution of Dn can be approximated
by the Kolmogorov-Smirnov distribution and, therefore,
=
Fn
H1 : F (x) = F0 (x) = x
H2 : otherwise
114
LECTURE 28.
10
F(X) = x
twice
0.43
0
0.23
0.33
0.42
0.53
0.52 0.58
0.69 0.76
10
The largest value will be achieved at |0.9 0.64| = 0.26 and, therefore,
115
LECTURE 28.
H1 : Dn 1.35
H2 : Dn > 1.35
Lecture 29
Y
x
x
x
x
117
n
i=1 actual
estimate
and we want to minimize it over all choices of parameters 0 , 1 . The line that mini
mizes this loss is called the least-squares line. To nd the critical points we write:
n
L
=
2(Yi (0 + 1 Xi )) = 0
0
i=1
n
L
=
2(Yi (0 + 1 Xi ))Xi = 0
1
i=1
If we introduce the notations
1
1 2
1
= 1
X
Xi , Y =
Yi , X 2 =
Xi , XY =
Xi Y i
n
n
n
n
then the critical point conditions can be rewritten as
= Y and 0 X
+ 1 X 2 = XY
0 + 1 X
and solving it for 0 and 1 we get
1 =
XY XY
and 0 = Y 1 X.
2
X2 X
by taking the derivatives and solving the system of linear equations to nd the pa
rameters 0 , . . . , k .
118
29.2
First of all, when the response variable Y in a random couple (X, Y ) is predicted as
a function of X then one can model this situation by
Y = f (X) +
where the random variable is independent of X (it is often called random noise)
and on average it is equal to zero: = 0. For a xed X, the response variable Y in
this model on average will be equal to f (X) since
(Y |X) =
Y = f (X) + = 0 + 1 X + ,
where the random noise is assumed to have normal distribution N (0, 2 ).
Suppose that we are given a sequence (X1 , Y1 ), . . . , (Xn , Yn ) that is described by
the above model:
Y i = 0 + 1 Xi + i
and 1 , . . . , n are i.i.d. N (0, 2 ). We have three unknown parameters - 0 , 1 and 2
- and we want to estimate them using the given sample. Let us think of the points
X1 , . . . , Xn as xed and non random and deal with the randomness that comes from
the noise variables i . For a xed Xi , the distribution of Yi is equal to N (f (Xi ), 2 )
with p.d.f.
(yf (Xi ))2
1
f (y) =
e 22
2
and the likelihood function of the sequence Y1 , . . . , Yn is:
1 n
1 n
Pn
1 Pn
2
12
(Yi f (Xi ))2
i=1
2
e
e 22 i=1 (Yi 0 1 Xi ) .
f (Y1 , . . . , Yn ) =
=
2
2
Let us nd the maximum likelihood estimates of 0 , 1 and 2 that maximize this
likelihood function. First of all, it is obvious that for any 2 we need to minimize
n
i=1
(Yi 0 1 Xi )2
119
over 0 , 1 which is the same as nding the least-squares line and, therefore, the MLE
for 0 and 1 are given by
and 1 = XY XY .
0 = Y 1 X
2
X2 X
1
=
(Yi 0 1 Xi )2 .
n i=1
2
Let us now compute the joint distribution of 0 and 1 . Since Xi s are xed, these
estimates are written as linear combinations of Yi s which have normal distributions
and, as a result, 0 and 1 will have normal distributions. All we need to do is nd
their means, variances and covariance. First, if we write 1 as
i
XY XY
1 (Xi X)Y
1 =
=
2
2
n X2 X
X2 X
Yi
0 + 1 Xi )
(X
X)
(Xi X)(
i
(1 ) =
=
2)
2)
n(X 2 X
n(X 2 X
(Xi X)
nX 2 nX
Xi (Xi X)
= 0
+1
= 1
= 1 .
2)
2)
2)
n(X 2 X
n(X 2 X
n(X 2 X
=0
i
i
i
=
Var
Var(1 ) = Var
2)
2)
n(X 2 X
n(X 2 X
Xi X
2
1
2 ) 2
=
2 =
n(X 2 X
2
2
2
2
)
2 )2
n(X X
n (X X
2
=
.
2 )
n(X 2 X
2
Therefore, 1 N 1 , n(X 2X 2 ) . A similar straightforward computations give:
and
2
X
N 0 , 1 +
0 = Y 1 X
2
2)
n n(X 2 X
Cov(0 , 1 ) =
2
X
.
2)
n(X 2 X
Lecture 30
30.1
In our last lecture we found the maximum likelihood estimates of the unknown pa
rameters in simple linear regression model and we found the joint distribution of 0
and 1 . Our next goal is to describe the distribution of
2 . We will show the following:
1.
2 is independent of 0 and 1 .
2. n
2 / 2 has 2n2 distribution with n 2 degrees of freedom.
1
1
a1 = (a11 , . . . , a1n ) =
,...,
n
n
and
Xi X
.
a2 = (a21 , . . . , a2n ) where a2i =
2
2
n(X
X )
It is easy to check that both vectors have length 1 and they are orthogonal to each
other since their scalar product is
n
n
X X
1
i
= 0.
a1 a2 =
a1i a2i =
n i=1
2
2
i=1
n(X X )
a11
a12
A = ..
.
a1n
an1
an2
..
..
.
.
ann
120
121
LECTURE 30.
Y = ( Y 1 , . . . , Yn )
Yn Y n
Y Y1 Y 1
=
,...,
Y =
=
so that the random variables Y1 , . . . , Yn are i.i.d. standard normal. We proved before
that if we consider an orthogonal transformation of i.i.d. standard normal sequence:
(Y1 , . . . , Yn )
Z = (Z1 , . . . , Zn ) = Y A
then Z1 , . . . , Zn will also be i.i.d. standard normal. Since
Z = Y A =
this implies that
Y
Y A A
A=
Y A = Z + A.
Let us dene a vector
Z = (Z1 , . . . , Zn ) = Y A = Z + A.
Each Zi is a linear combination of Yi s and, therefore, it has a normal distribution.
Since we made a specic choice of the rst two columns of the matrix A we can write
down explicitely the rst two coordinates Z1 and Z2 of vector Z. We have,
Z1 =
1
Yi = nY = n(0 + 1 X)
ai1 Yi =
n
(Xi X)Y
i
2 )
n(X 2 X
(Xi X)Y
i
2
2
2 )1 .
= n(X 2 X
=
n(X X )
2)
n(X 2 X
Z2 =
ai2 Yi =
Solving these two equations for 0 and 1 we can express them in terms of Z1 and Z2
as
1
1
X
Z2 and 0 = Z1
Z2 .
1 =
n
2
2
2
2
n(X X )
n(X X )
122
LECTURE 30.
n
(Yi 0 1 Xi )2 =
i=1
n
i=1
n
i=1
(Yi Y ) 1 (Xi X)
2)
(Yi Y )2 21 n(X 2 X
{since 0 = Y 1 X}
(Yi Y )(Xi X)
2
(Xi X)
+12
2)
n(X 2 X
2) =
(Yi Y )2 12 n(X 2 X
Yi2 Z12 Z22 =
n
i=1
2)
Yi2 n(Y )2 1 n(X 2 X
Z12
Z22
Zi2
i=1
n
Yi2 .
i=1
If we can show that Z1 , . . . , Zn are i.i.d. with distribution N (0, 2 ) then we will have
shown that
Z 2
n
2 Z3 2
n
=
+
.
.
.
+
2n2
2
(A)i =
Zi =
n
aji Yj =
j=1
n
j=1
n
aji Yj =
j=1
n
aji (0 + 1 Xj )
j=1
+ 1 (Xj X))
aji (0 + 1 X
= (0 + 1 X)
n
j=1
aji + 1
n
j=1
Since the matrix A is orthogonal its columns are orthogonal to each other. Let
ai = (a1i , . . . , ani ) be the vector in the ith column and let us consider i 3. Then the
123
LECTURE 30.
n
1
aji = 0
ai a1 =
aj1 aji =
n
j=1
j=1
1
ji = 0.
(Xj X)a
ai a2 =
2
2
n(X X ) j=1
n
j=1
aji = 0 and
n
j=1
=0
aji (Xj X)
Lecture 31
1
2
2
X
, 0 N 0 ,
+
2
2
2
2
2
n
n(X X )
n(X X )
Cov(0 , 1 ) =
and
2 is independent of 0 and 1 and
2
X
2)
n(X 2 X
n
2
2n2 .
2
Suppose now that we want to nd the condence intervals for unknown parameters
of the model 0 , 1 and 2 . This is straightforward and very similar to the condence
intervals for parameters of normal distribution.
For example, using that n
2 / 2 2n2 , if we nd the constants c1 and c2 such
that
124
125
LECTURE 31.
n
2
c2 .
2
2 )
n(X 2 X
N (0, 1) and
n
2
2
= 12 + . . . + n2
2
1 1
1 n
2
0
tn2
=
2
n 2 2
1
2
2
( + . . . + n2 )
2)
n2 1
n(X 2 X
2)
(n 2)(X 2 X
(1 1 )
tn2 .
126
LECTURE 31.
2)
(n 2)(X 2 X
c (1 1 )
c
2
and solving for 1 gives the 1 condence interval:
2
.
1 c
1 1 + c
2)
2)
(n 2)(X 2 X
(n 2)(X 2 X
Similarly, to nd the condence interval for 0 we use that
1 n
0 0
2
tn2
n 2 2
2
X
1
2
+ n(X 2 X 2 )
n
and 1 condence interval for 0 is:
2
2
2
X
X
0 0 + c
1+
.
0 c
1+
2
2
n2
n2
X2 X
X2 X
Prediction Interval.
Suppose now that we have a new observation X for which Y is unknown and we
want to predict Y or nd the condence interval for Y. According to simple regression
model,
Y = 0 + 1 X +
and it is natural to take Y = 0 + 1 X as the prediction of Y . Let us nd the distri
bution of their dierence Y Y. Clearly, the dierence will have normal distribution
so we only need to compute the mean and the variance. The mean is
(Y Y ) =
0 + 1 X 0 1 X = 0 + 1 X 0 1 X 0 = 0.
Since a new pair (X, Y ) is independent of the prior data we have that Y is independent
of Y . Therefore, since the variance of the sum or dierence of independent random
variables is equal to the sum of their variances, we get
Var(Y Y ) = Var(Y ) + Var(Y ) = 2 + Var(Y ),
where we also used that Var(Y ) = Var() = 2 . Let us compute the variance of Y :
Var(Y ) =
(0 + 1 X 0 1 X)2 =
((0 0 ) + (1 1 )X)2
127
LECTURE 31.
( )2 +X 2 (1 1 )2 +2 (0 0 )(1 1 ) X
0 0
variance of
variance of
covariance
2
2
2
X
X
2X
2 + X 2
2)
2)
2)
n n(X 2 X
n(X 2 X
n(X 2 X
1
X)2
(X
= 2
+
.
2)
n n(X 2 X
X)2
1
(X
Y Y N 0, 2 1 + +
.
2)
n n(X 2 X
As a result, we have:
Y Y
2 1 +
1
n
X )2
(X
2)
n(X 2 X
1 n
2
tn2
n 2 2
2
X)2
X)2
(X
(X
2
Y c
n+1+
Y Y + c
n+1+
.
2
2
n2
n2
X2 X
X2 X
Lecture 32
32.1
Classication problem.
Suppose that we have the data (X1 , Y1 ), . . . , (Xn , Yn ) that consist of pairs (Xi , Yi )
such that Xi belongs to some set X and Yi belongs to a set Y = {+1, 1}. We will
think of Yi as a label of Xi so that all points in the set X are divided into two classes
corresponding to labels 1. For example, Xi s can be images or representations of
images and Yi s classify whether the image contains a human face or not. Given this
data we would like to nd a classier
f :X Y
which given a point X X would predict its label Y. This type of problem is called
classication problem. In general, there may be more than two classes of points which
means that the set of labels may consist of more than two points but, for simplicity,
we will consider the simplest case when we have only two labels 1.
We will take a look at one approach to this problem called boosting and, in
particular, prove one interesting property of the algorithm called AdaBoost.
Let us assume that we have a family of classiers
H = {h : X Y}.
Suppose that we can nd many classiers in H that can predict labels Yi better than
tossing a coin which means that they predict the correct label at least half of the
time. We will call H a family of weak classiers because we do not require much of
them, for example, all these classiers can make mistakes on, lets say, 30% or even
45% of the sample.
The idea of boosting consists in trying to combine these weak classiers so that
the combined classier predicts the label correctly most of the time. Let us consider
one particular algorithm called Adaboost.
128
129
LECTURE 32.
Given weights w(1), . . . , w(n) that add up to one we dene the weighted classi
cation error of the classier h by
w(1)I(h(X1 ) =
Y1 ) + . . . + w(n)I(h(Xn ) =
Yn ).
AdaBoost algorithm. We start by assigning equal weights to the data points:
w1 (1) = . . . = w1 (n) =
1
.
n
t = wt (1)I(ht (X1 ) =
Y1 ) + . . . + wt (n)I(ht (Xn ) =
Yn )
is as small as possible.
t
and update the weights:
2. Let t = 12 log 1
t
n
et Yi ht (Xi )
,
Zt
wt et Yi ht (Xi )
i=1
1 t
1
log
0.
2
t
+1
1
if ht (Xi ) = Yi
if ht (Xi ) =
Yi .
130
LECTURE 32.
Therefore, if ht makes a mistake on the example (Xi , Yi ) which means that ht (Xi ) =
Yi
or, equivalently, Yi ht (Xi ) = 1 then
wt+1 (i) =
e t
et Yi ht (Xi )
wt (i).
wt (i) =
Zt
Zt
On the other hand, if ht predicts the label Yi correctly then Yi ht (Xi ) = 1 and
wt+1 (i) =
et
et Yi ht (Xi )
wt (i).
wt (i) =
Zt
Zt
Since t 0 this means that we increase the relative weight of the ith example if we
made a mistake on this example and decrease the relative weight if we predicted the
label Yi correctly. Therefore, when we try to minimize the weighted error at the next
step t + 1 we will pay more attention to the examples misclassied at the previous
step.
Theorem: The proportion of mistakes made on the data by the output classier
sign(f (X)) is bounded by
n
1
Yi )
I(sign(f (Xi ))) =
4t (1 t ).
n i=1
t=1
Remark: If the weighted errors t will be strictly less than 0.5 at each step meaning
that we predict the labels better than tossing a coin then the error of the combined
classifer will decrease exponentially fast with the number of rounds T . For example,
if t 0.4 then 4t (1 t ) 4(0.4)(0.6) = 0.96 and the error will decrease as fast as
0.96T .
Proof. Using that I(x 0) ex as shown in gure 32.1 we can bound the
indicator of making an error by
I(sign(f (Xi )) =
Yi ) = I(Yi f (Xi ) 0) eYi f (Xi ) = eYi
PT
t=1
t ht (Xi )
(32.1)
Next, using the step 2 of AdaBoost algorithm which describes how the weights
are updated we can express the weights at each step in terms of the weights at the
previous step and we can write the following equation:
wT (i)eT Yi hT (Xi )
eT Yi hT (Xi ) wT 1 (i)eT 1 Yi hT 1 (Xi )
wT +1 (i) =
=
ZT
ZT
ZT 1
131
LECTURE 32.
ex
I(X)
1 Yi f (Xi )
e
= wT +1 (i)
Zt .
n
t=1
1
1
Yi ))
I(sign(f (Xi ) =
eYi f (Xi ) =
Zt
wT +1 (i) =
Zt .
n i=1
n
t=1
t=1
i=1
i=1
Next we will compute
Zt =
n
(32.2)
wt (i)et Yi ht (Xi ) .
i=1
n
wt (i)et Yi ht (Xi ) =
i=1
= et (1
n
i=1
n
i=1
n
i=1
t
wt (i)I(Yi =
ht (Xi ))) + e
wt (i)I(Yi = ht (Xi ))
= et (1 t ) + et t .
i=1
132
LECTURE 32.
choice of t we get
Zt = (1 t )
t
+ t
1 t
1 t
= 4t (1 t )
t
and plugging this into (32.2) nishes the proof of the bound.