STA2004F
STA2004F
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Bivariate Distributions 19
2.1 Joint and marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Independence and conditional distributions . . . . . . . . . . . . . . . . . . . . . 23
2.3 The bivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Functions of bivariate random variables . . . . . . . . . . . . . . . . . . . . . . . 27
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
iii
iv CONTENTS
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Order Statistics 87
6.1 Definitions and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Confidence intervals for population quantiles . . . . . . . . . . . . . . . . . . . . 89
Tutorial Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
II Statistical Inference 93
7 Introduction to Inference 95
8 Parameter Estimation 99
8.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2 Alternative Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4 Asymptotic Properties of MLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
CONTENTS v
Bibliography 189
1
Chapter 1
1.1 Introduction
These notes have been prepared for the course STA2004F, for which the pre-requisite at UCT is
50% in STA1006 and 50% in MAM1000W. It is thus assumed that the material in the textbook
for STA1006 (“INTROSTAT”, by L G Underhill and D J Bradfield) is known.
It is useful to start this course with the question: “what do we mean by probability?” The
basis in STA1006 was the concept of a sample space, subsets of which are defined as events.
Very often the events are closely linked to values of a random variable, i.e. a real-valued number
whose “true” value is at present unknown, either through ignorance or because it is still unde-
termined. Thus if X is a random variable, then typical events may be {X ≤ 20}, {0 < X < 1},
{X is a prime number}.
In general terms, we can view an event as a statement, which is in principle verifiable as true or
false, but the truth of which will only be revealed for sure at a later time (if ever). The concept
of a random experiment was introduced in STA1006, as being the observation of nature (the
real world) which will resolve the issue of the truth of the statement (i.e. whether the event has
occurred or not). The probability of an event is then a measure of the degree of likelihood that
the statement is true (that the event has occurred or will occur), i.e. the degree of credibility in
the statement. This measure is conventionally standardized so that if the statement is known to
be false (the event is impossible) then the probability is 0; while if it is known to be true (the
event is certain) then the probability is 1. The axioms of Kolmogorov (see “INTROSTAT”) give
minimal requirements for a probability measure to be rationally consistent.
With this broad background, there are two rather divergent ways in which a probability measure
may be assessed and/or interpreted:
Frequency Intuition: Suppose the identical random experiment can be conducted N times
(e.g. rolling dice, spinning coins), and let M be the number of times that the event E is
observed to occur. We can define a probability of E by:
M
Pr[E] = lim
N →∞ N
3
4 CHAPTER 1. REVIEW OF BASIC CONCEPTS
i.e. the relative proportion of times that E occurs in many trials. This is still a leap in
faith! Why should the future behave like the past? Why should the occurrence of E at
the next occasion obey the average laws of the past? In any case, what do we mean by
“identical” experiments? Nevertheless, this approach does give a sense of objectivity to
the interpretation of probability.
Very often, the “experiments” are hypothetical mental experiments! These tend to be
based on the concept of equally likely elementary events, justified by symmetry arguments
(e.g. the faces of a dice).
Subjective Belief: The problem is that we cannot even conceptually view all forms of uncer-
tainty in frequency of occurrence terms. Consider for example:
None of these can be repeatedly observed, and yet we may well have a strong subjective
sense of the probability of events defined in terms of these random variables. The subjective
view accepts that most, in fact very nearly all, sources of uncertainty include at least
some degree of subjectivity, and that we should not avoid recognizing probability as a
measure of our subjective lack of knowledge. Of course, where well-defined frequencies
are available, or can be derived from elementary arguments, we should not lightly dismiss
these. But ultimately, the only logical rational constraint on probabilities is that they do
satisfy Kolmogorov’s axioms (for without this, the implied beliefs are incoherent, in the
sense that actions or decisions consistent with stated probabilities violating the axioms can
be shown to lead to certain loss).
The aim of statistics is to argue from specific observations of data to more general conclusions
(“inductive reasoning”). Probability is only a tool for coping with the uncertainties inherent in
this process. But it is inevitable that the above two views of probability extend to two distinct
philosophies of statistical inference, i.e. the manner in which sample data should extrapolated to
general conclusions about populations. These two philosophies (paradigms) can be summarized
as follows:
Frequentist, or Sampling Theory: Here probability statements are used only to describe
the results of (at least conceptually) repeatable experiments, but not for any other uncer-
tainties (for example regarding the value of an unknown parameter, or the truth of a “null”
hypothesis, which are assumed to remain constant no matter how often the experiment is
repeated). The hope is that different statisticians should be able to agree on these probabil-
ities, which thus have a claim to objectivity. The emergence of statistical inference during
the late 19th and early 20th centuries as a central tool of the scientific method occurred
within this paradigm (at a time when “objectivity” was prized above all else in science).
Certainly, concepts of repeatable experiments and hypothesis testing remain fundamental
cornerstones of the scientific method.
The emphasis in this approach is on sampling variability: what would have happened if the
underlying experiments could have been repeated many times over? This is the approach
which was adopted throughout first year.
1.2. THE CONCEPT OF A PROBABILITY DISTRIBUTION 5
Bayesian: Here probability statements are used to represent all uncertainty, whether a result
of sampling variability or of lack of information. The term “Bayesian” arises because of the
central role of Bayes’ theorem in making probability statements about unknown parameters
conditional on observed data, i.e. “inferring causes” from observed consequences, which was
very much the context in which Bayes worked.
The emphasis is on characterizing how degrees of uncertainty change from the position
prior to any observations, to the position after (or posterior to) these observations
One cannot say that one philosophy of inference is “better” than another (although some have
tried to argue in this way). Some contexts lend themselves to one philosophy rather than the
other, while some statisticians feel more comfortable with the one set of assumptions than the
other. At least one of the authors (Stewart) favours the Bayesian approach, “all other things
being equal”, while another (Thiart) views herself as a frequentist. For the purposes of this course
and for much of the next, we will largely limit ourselves to discussing the frequentist approach:
this is perhaps simpler, will avoid confusion of concepts at this relatively early stage in your
training, and is the “classical” approach used in reporting experimental results in many fields.
In fact, the fundamental purpose of this section of the work in STA2004F is to demonstrate
precisely how the various tests, confidence intervals, etc. in first year are derived from basic
distributional theory applied to sampling variability viewed in a frequentist sense.
Let us first briefly recall some of the basic concepts of probability distributions (which will also
serve to introduce our notation). We shall use upper case letters (X, Y, . . .) to signify random
variables (e.g. time to failure of a machine, size of an insurable loss, blood pressure of a patient),
and lower case letters (x, y, . . .) to represent algebraic quantities. The expression {X ≤ x} thus
represents the event, or the assertion, that the random variable denoted by X takes on a value
not exceeding the real number x. (We can quite legitimately define events such as {X ≤ 3.81}.)
The distribution function (sometimes denoted as the cumulative distribution function) of X is
defined by:
FX (x) = Pr[X ≤ x]
The subscript X refers to the specific random variable under consideration. The argument x is
an arbitrary algebraic symbol: we can just as easily talk of FX (y), or FX (t), or just FX (5.17).
Since for any pair of real numbers a < b, the events {a < X ≤ b} and {X ≤ a} are mutually
exclusive, while their union is the event {X ≤ b}, it follows that Pr[X ≤ b] = Pr[X ≤ a] + Pr[a <
X ≤ b], or (equivalently):
Pr[a < X ≤ b] = FX (b) − FX (a)
Now suppose that X can take on discrete values only. Without loss of generality, these values
can be indexed by the non-negative integers, and we very often assume then that X is itself a
non-negative integer. We shall adopt this convention for the purposes of developing the necessary
theory. For any non-negative integer, then, we define the probability mass function (pmf ) (some-
times simply termed the probability function) by pX (x) = Pr[X = x]. The following properties
are then evident:
0 ≤ pX (x) ≤ 1
6 CHAPTER 1. REVIEW OF BASIC CONCEPTS
∞
X
pX (x) = 1
x=0
x
X
FX (x) = pX (j)
j=0
Pr[x < X ≤ x + h]
fX (x) = lim
h→0 h
FX (x + h) − FX (x) dFX (x)
= lim =
h→0 h dx
The function fX (x) is the probability density function (pdf ). Once again, the subscript X iden-
tifies the random variable under consideration, while the argument x is an arbitrary algebraic
quantity. The pdf satisfies the following properties:
In the first year course you encountered the following distributions which are assumed to be
known: (if you did not cover them, please consult your first year handbook, or the handbooks
at the short term section of the university library.)
Continuous distributions: Uniform, exponential, normal (Gaussian); you also used the t, χ2
and F distributions, but did not study their properties.
1.3. PROBABILITY DISTRIBUTIONS 7
It is worth recalling some of the practical settings which underlie some of these distributions,
all of which are described in first year. The binomial sampling situation refers to the number of
“successes” (which may, in some cases, be the undesirable outcomes!) in n independent “trials”,
in which the probability of success in a single trial is p for all trials. This is often described initially
in terms of “sampling with replacement”, where the replacement ensures a constant value of p (as
otherwise we have the hypergeometric distribution), but applies whenever sequential trials have
constant success probability (e.g. sampling production from a continuous process). The negative
binomial distribution describes the number of successes obtained before the r-th failure occurs
(or vice versa), and the geometric distribution is the special case of r = 1. In this context, the
Poisson distribution with parameter λ = np is introduced as an approximation to the binomial
distribution when n is large and p is small.
The Poisson distribution also arises in conjunction with the exponential distribution in the impor-
tant context of a “memoryless (Poisson) process”. Such a process describes discrete occurrences
(e.g. failures of a piece of equipment, claims on an insurance policy), when the probabilities of
future events do not depend on what has happened in the past. For example, if the probability
that a machine will break down in the next hour is independent of how long it is since the last
breakdown, then this is a memoryless process. For such a process in which the rate of occurrence
is λ (number of occurrences per unit time), it was shown in INTROSTAT that (a) the time be-
tween successive occurences is a continuous random variable having the exponential distribution
with parameter λ (i.e. a mean of 1/λ); and (b) the number of occurrences in a fixed interval of
time t is a discrete random variable having the Poisson distribution with parameter (i.e. mean)
λt.
For other details of the above-mentioned distributions, refer to INTROSTAT and your first year
notes! We shall now introduce two new distributional forms.
√
Use of this result together with Γ(1) = 1 and Γ( 12 ) = π allows us to evaluate Γ(n) for all
integer and half-integer arguments. In particular, it is easily confirmed that for integer values of
n, Γ(n) = (n − 1)!.
Now consider the function defined by:
xα−1 e−x
fX (x) = for 0 < x < ∞
Γ(α)
and 0 anywhere else, for any positive real α. Clearly fX (x) ≥ 0 and integrates to 1 by definition.
This function thus satisfies the properties of a p.d.f., and is sometimes called the density of
the one-parameter gamma distribution. By re-scaling the x-variable, we can obtain the more
general form for the two parameter gamma distribution (commonly termed simply the gamma
distribution):
λα xα−1 e−λx
fX (x) = for 0 < x < ∞.
Γ(α)
(Note that in general we will assume that the value of a density function is 0 outside of the range
for which it is specifically defined.)
Exercise: Show that the p.d.f. given above for the gamma distribution is indeed a proper p.d.f.
(HINT: Transform to a new variable u = λx.)
Alternative definition: In some texts (and also in MicroSoft Excel), the gamma distribution
is defined in terms of α and a parameter β defined by β = 1/λ. In this case the density is
written as:
xα−1 e−x/β
fX (x) = for 0 < x < ∞.
Γ(α)β α
Of course, the mathematical fact that fX (x) satisfies the properties of a p.d.f. does not show that
any random variable will have this distribution in practical situations. We shall, a little later
in this course, demonstrate two important situations in which the gamma distribution arises
naturally. These are:
1. In the Poisson (memoryless) process with rate λ, the time interval until the r-th occurence
has the gamma distribution with α = r (which is of course an integer in this case); in this
case the distribution is sometimes called the Erlang distribution.
2. The special case in which α = n/2 (for integer n) and λ = 21 is the χ2 (chi-squared)
distribution with n degrees of freedom (which you met frequently in the first year course).
Note that this is a function of m and n (and not of x). Note the symmetry of the arguments:
B(m, n) = B(n, m). It can be shown that the beta and gamma functions are related by the
following expression, which we shall not prove:
Γ(m)Γ(n)
B(m, n) =
Γ(m + n)
xm−1 (1 − x)n−1
fX (x) = for 0 < x < 1
B(m, n)
satisfies all the properties of a probability density function. A probability distribution with p.d.f.
given by fX (x) is called the beta distribution, or more correctly the beta distribution of the first
kind. (We shall meet the second kind shortly.) We will later see certain situations in which this
distribution arises naturally, e.g. in comparing variances or sums of squares.
but that we would rather work in terms of a “new” variable defined by u = g(x). There are
at least two possible reasons for this: (i) the variable u may have more physical or statistical
meaning (i.e. a statistical reason); or (ii) it may be easier to solve the integral in terms of u than
in terms of x (i.e. a mathematical reason).
We shall suppose that the function g(x) is monotone (increasing or decreasing) and continuously
differentiable. We can then define a continuously differentiable inverse function, say g −1 (u),
which is nothing more than the solution for x in terms of u from the equation g(x) = u. For
example, if g(x) = e−x , then g −1 (u) = − ln(u). We then define the Jacobian of the transformation
by: −1
dx dg (u)
|J| = =
du du
Example: Continuing with the example of g(x) = e−x , and g −1 (u) = − ln(u), we have that:
d[− ln(u)] 1 1
|J| = = − =
du u u
since u > 0.
Important Note: Since dx/du = [du/dx]−1 , it follows that we can also define the Jacobian by
|J| = [dg(x)/dx]−1 , but the result will still be in terms of x, requiring a further substitution
to get it in terms of u as required. Note also that some texts define the Jacobian as the
inverse of our definition, and care has thus to be exercised in interpreting results involving
Jacobians (cf. Section 1.5).
10 CHAPTER 1. REVIEW OF BASIC CONCEPTS
Example (continued): In the previous example, dg(x)/dx = −e−x , and thus the Jacobian
could be written as ex ; substituting x = − ln(u) then gives |J| = e− ln(u) = u−1 , as before.
Theorem 1.1 For any monotone function g(x), defining a transformation of the variable of
integration:
Z b Z d
f (x)dx = f [g −1 (u)] |J| du
a c
This then defines a procedure for changing the variable of integration from x to u = g(x) (where
g(x) is monotone):
3. Calculate the minimum and maximum values for u (i.e. c and d in the theorem).
Example: Evaluate: Z ∞
1 2
xe− 2 x dx
0
1 2
√
Substitute u = 2x ,which is monotone over the range given, which gives x = 2u,
√
|J| = 1/ 2u. Clearly, u also runs from 0 to ∞, and thus the integral becomes:
Z ∞√ Z ∞
−u 1
2u e √ du = e−u du = 1
0 2u 0
Now let us turn to double integrals (which are a special case of multiple integrals), which are
written as follows: Z b Z d
f (x, y) dy dx
x=a y=c
The evaluation of the double integral (or any multiple integral) is in principle very simple: it is
simply the repeated application of rules for single integrals. So in the above example, we would
first evaluate: Z d
f (x, y) dy
y=c
treating x as if it were a constant. The result will be a function of x only, say k(x) and thus in
the second stage we need to evaluate:
Z b
k(x) dx.
x=a
In the same way that the single integral can be viewed as measuring area under a curve, the
double integral measures volume under a two-dimensional surface.
1.4. SOME INTEGRATION THEORY 11
There is a theorem which states that (for any situations of interest to us here), the order of
integration is immaterial: we could just as well have first integrated with respect to x (treating
y as a constant), and then integrated the resulting function of y. There is one word of caution
required in applying this theorem however, and that relates to the limits of integration. In
evaluating the above integral in the way in which we described initially, there is no reason why
the limits on y should not be functions of the “outer” variable x, which is (remember) being
treated as a constant; the result will still be a function of x, and we can go on to the second step.
When integrating over x, however, the limits must be true constants. Now what happens if we
reverse the order of integration? The outer integral in y cannot any longer have limits depending
on x, so there seems to be a problem! It is not the theorem that is wrong; it is only that we have
to be careful to understand what we mean by the limits. The limits must describe a region (an
area) in the X-Y plane over which we wish to integrate, which can be described in many ways.
We will at a later stage of the course encounter a number of examples of this, but consider one
(quite typical) case: suppose we wish to integrate f (x, y) over all x and y satisfying x ≥ 0, y ≥ 0
and x + y ≤ 1. The region over which we wish to integrate is that shaded area in figure 1.1;
the idea is to find the volume under the surface defined by f (x, y), in the column whose base is
the shaded triangle in the figure. If we first integrate w.r.t. y, treating x as a constant, then the
limits on y must be 0 and 1 − x; and since this can be done for any x between 0 and 1, these
become the limits for x, i.e. we would view the integral as:
Z 1 Z 1−x
f (x, y) dy dx.
x=0 y=0
But if we change the order around, and first integrate w.r.t. x, treating y as a constant, then the
limits on x must be 0 and 1 − y, and this can be done for any y between 0 and 1, in which case
we would view the integral as: Z 1Z 1−y
f (x, y) dx dy.
y=0 x=0
The important point to note is that the limits on the inner integral (the one carried out first)
is allowed to depend on the variable in the outer integral (which is treated as a constant while
evaluating the inner integral). The limits on the outer integral must be constants.
We may also wish to change the variables of integration in a double integral. Thus suppose that
we wish to convert from the x and y to variables u and v defined by the continuously differentiable
functions:
u = g(x, y) v = h(x, y).
We shall suppose that these functions define a 1-1 mapping, i.e. for any given u and v we can find
a unique solution for x and y in terms of u and v, which we could describe as “inverse functions”,
say:
x = φ(u, v) y = ψ(u, v).
We now define the Jacobian of the transformation by the absolute value of the determinant of
the matrix of all partial derivatives of the original variables (i.e. x and y) with respect to the new
variables, in other words:
∂x/∂u ∂x/∂v
|J| =
∂y/∂u ∂y/∂v
∂x ∂y ∂y ∂x
= −
∂u ∂v ∂u ∂v
12 CHAPTER 1. REVIEW OF BASIC CONCEPTS
x
+
y
=
1
0 x
0 1
Theorem 1.2 Suppose that the continuously differentiable functions g(x, y) and h(x, y) define
a one-to-one transformation of the variables of integration, with inverse functions φ(u, v) and
ψ(u, v). Then:
Z b Z d Z b0 Z d0
f (x, y) dy dx = f [φ(u, v), ψ(u, v] |J| dv du
x=a y=c u=a0 v=c0
This then defines a procedure for changing the variable of integration from x and y to u = g(x, y)
and v = h(x, y):
1. Solve for x and y in terms of u and v to get the inverse functions φ(u, v) and ψ(u, v).
2. Obtain all partial derivatives of φ(u, v) and ψ(u, v) with respect to u and v, and hence
obtain the Jacobian |J|.
3. Calculate the minimum and maximum values for u and v (where the ranges for the variable
in the inner integral may depend on the variable in the outer integral).
4. Write down the new integral, as given by the theorem.
Example: Evaluate: Z ∞ Z ∞
2
+y 2 )/2
e−(x dy dx
−∞ −∞
1.5. FUNCTIONS (TRANSFORMATIONS) OF RANDOM VARIABLES 13
Now convert from the Cartesian co-ordinate system to polar co-ordinates, such that u is
the angle, and v the Euclidean distance from the origin to the point (x, y). The inverse
transformation is easy to write down directly as x = v cos u and y = v sin u. The Jacobian
is given by:
−v sin u cos u
|J| =
v cos u sin u
= −v sin2 u − v cos2 u
= | − v| = v
Clearly the ranges for the new variables are 0 ≤ u ≤ 2π and 0 ≤ v < ∞. The integral is
thus given by: Z 2π Z ∞
2
e−v /2 v dv du
u=0 v=0
The inner integral (over v) is that evaluated in the previous example, i.e. it evaluates to
the constant 1. The final answer (after integrating the constant w.r.t. u) is just 2π.
Note: It is possible to start by evaluating the derivatives of u = g(x, y) and v = h(x, y) w.r.t.
x and y. But it is incorrect simply to invert each of the four derivatives (as functions of
x and y), and to substitute them in the above. What you have to do is to evaluate the
corresponding determinant first, and then to invert the determinant. This will still be a
function of x and y, and you will have to then further substitute these out. This seems to
be a more complicated route, and is not advised, although it is done in some textbooks.
FY (y) = Pr[aX + b ≤ y]
y−b
= Pr[X ≤ ]
a
= FX ((y − b)/a)
If the random variables are continuous, then the density of Y is obtained by differentiating FY (y)
w.r.t. y.
1 y−b
Exercise: Show that the p.d.f. of y in the above case is given by fY (y) = f ( ). Note
a X a
the 1/a term!
14 CHAPTER 1. REVIEW OF BASIC CONCEPTS
As a more complicated example, suppose that we define Y = 1/X, where X has the uniform
distribution on [0,1]. Thus FX (x) = x for 0 ≤ x ≤ 1. We then have:
1
FY (y) = Pr ≤y
X
1
= Pr X ≥
y
1
= 1 − FX ( )
y
1
= 1−
y
In other words:
= FX [g −1 (y)]
if g(x) is increasing
FY (y)
= 1 − FX [g −1 (y)] if g(x) is decreasing
The second line follows because for continuous distributions Pr[X ≥ g −1 (y)] = Pr[X > g −1 (y)].
For continuous distributions, we are particularly interested in obtaining the p.d.f. of the trans-
formed variable, which is obtained by differentiation of FY (y) with respect to y (assuming that
g(x) is continuously differentiable, and using the chain rule) to obtain:
dg −1 (y)
−1
= fX [g (y)] dy
if g(x) is increasing
fY (y)
−1
= −fX [g −1 (y)] dg dy(y) if g(x) is decreasing
Now, if g(x) is an increasing function of x, then clearly g −1 (y) is an increasing function of y, i.e.
the derivative is positive, and vice versa. The two expressions for fY (y) can thus be combined
1.5. FUNCTIONS (TRANSFORMATIONS) OF RANDOM VARIABLES 15
where |J| is precisely the Jacobian introduced earlier for transformation of variables in integra-
tion.
Let us now illustrate this by means of a few examples.
Exponential distribution Suppose that fX (x) = e−x for 0 < x < ∞, and that we wish to find
the distribution of Y = e−X . Thus g(x) = e−x , and g −1 (y) = − ln(y). Since Y > 0, the
Jacobian is simply 1/y. Note that in fact 0 < Y < 1; within this range, the p.d.f. of Y is
thus given by:
1 1
fY (y) = e− ln(y) = y = 1
y y
i.e. Y is uniformly distributed. This is in fact a special case of the “probability integral
transform” introduced in the next example.
Probability integral transform Suppose that for any continuous random variable X, we de-
fine a new random variable U = FX (X), where FX (x) is (as usual) the (cumulative)
distribution function of X. Note that 0 ≤ U ≤ 1. For any observed value of X, the
corresponding realization of U is the probability that a random observation from X would
not exceed the current observation. This is, of course, the sort of calculation we do in
ascertaining p-values for hypothesis tests. But U is a random variable, as each observation
of X generates a new observation of U .
In this case, it is useful to use the fact that dx/du is 1/[du/dx]. We differentiate u = FX (x)
directly, to obtain the p.d.f. fX (x). Since this is, by definition, positive, the Jacobian is thus
simply 1/fX (x). This needs in principle to be expressed as a function of u by substituting
from x = FX−1 (u), but the result is clearly:
1
fU (u) = fX (x) = 1
fX (x)
for 0 ≤ u ≤ 1, in other words U has the uniform distribution. Conversely, if we start with
U having the uniform distribution, and define a transformation X = FX−1 (U ), then this X
has the distribution defined by FX (x).
This result has many applications, one of which is in the generation of random numbers
from specified distributions on a computer, which is necessary for so-called “simulation” or
“Monte Carlo” methods of probability calculation. Most programming languages and other
software systems such as spreadsheets provide an in-built “random number generator”,
but this only generates numbers uniformly on [0,1]. It is often necessary, however, to
simulate the occurrence of random numbers from many other distributions, and the above
result allows us to generate a uniformly distributed random number (say U ), and then to
transform it to X = FX−1 (U ) having the desired distribution.
Beta distribution of the second kind Suppose that X has the beta distribution as defined
earlier, and that we transform from X to Y = (1 − X)/X. If X can be viewed as a
proportion (remember that 0 < X < 1), then Y can be interpreted as “odds against”.
16 CHAPTER 1. REVIEW OF BASIC CONCEPTS
Note that 0 < Y < ∞. The inverse transformation of y = (1 − x)/x is easily seen to be
x = 1/(1 + y), and thus the Jacobian is |J| = 1/(1 + y)2 . The p.d.f. of y is thus:
m−1 n−1
y n−1
1 1 y 1 1
fY (y) = =
B(m, n) 1 + y 1+y (1 + y)2 B(m, n) (1 + y)m+n
This is the density of the beta distribution of the second kind, which is closely related to
the F-distribution. Note that because fY (y) integrates to 1, it follows that an alternative
definition for B(m, n) is:
∞
y n−1
Z
B(m, n) = dy
0 (1 + y)m+n
Tutorial Exercises
1.1 Each toss of a coin results in a head with probability p. The coin is tossed until the first head
appears. Let X be the total number of tosses. Find the probability mass function of the random
variable X.
1.2 Let X have distribution function
0
x<0
1
FX (x) = 2
x 0≤x≤2
1 x>2
√
let Z = X. Find
(a) Pr( 14 ≤ X ≤ 43 ) (b) Pr(X ≤ 3) (c) Pr(−2 ≤ X < 0)
2
(d) Pr(X ≤ X) (e) the distribution of Z.
1.3 Let X have distribution function
0 x < −1
1−p
−1 ≤ x < 0
FX (x) =
1 − p + 12 xp 0≤x≤2
1 x>2
(a) Pr(X = −1) (b) Pr(X = 0) (c) Pr(X ≥ 1)
1.4 Two students are rolling a ball towards a target. They take turns rolling the ball and stop when the
target is hit. Student A goes first and has probability pA of hitting the target on any roll. Student
B, who rolls second, has probability pB of hitting the target. The outcomes of the successive trials
are assumed to be independent.
(a) Find the probability mass function of the total number of attempts. (Hint: Define random
variable X (the number of attempts); list the possible values of X and the sample space)
(b) What is the probability that player B wins?
1.5 Find an expression for the cumulative distribution function of a geometric random variable.
1.6 If X is a geometric random variable with p = 14 , for what value of k is Pr(X ≤ k) ≈ 0.90?
1.7 Suppose that in a city the number of suicides can be approximated by a Poisson process with
λ = 13 per month.
Tutorial Exercises 17
(a) Find the probability of k suicides in a year for k = 0, 1, 2, .... What is the most probable
number of suicides?
(b) What is the probability of 2 suicides in 14 days?
1.8 Phone calls are received at a certain residence as a Poisson process with parameter λ = 3 per hour.
(a) If John baths for 10-min., what is the probability that the phone rings during that time?
(b) How long can his bath be if he wishes the probability of receiving no phone calls to be at
most 0.5?
1.9 The Cauchy cumulative distribution function is
1 1
FX (x) = + tan−1 (x), for − ∞ < x < ∞
2 π
(a) Show that this is a cdf.
(b) Find the density function.
(c) Find x such that Pr(X > x) = 0.3.
1.10 Suppose that X has the density function fX (x) = cx2 for 0 ≤ x ≤ 2 and fX (x) = 0 otherwise.
(a) Find c. (b) Find the cdf. (c) What is Pr(0.9 ≤ X ≤ 1.4)?
1.11 Show by a change of variables that
Z ∞ 2
Γ(x) = 2 t2x−1 e−t dt
0
Z ∞ t
= ext e−e dt
−∞
√
1.12 If U is uniform on [0,1], find the density function of Z, where Z = U.
1.13 if U is uniform on [-1,1], find the density function of W = U 2 .
1.14 Find the density function of X = eZ , where Z ∼ N (µ, σ 2 ). This is called the lognormal density,
since log X is normally distributed.
1.15 The Weibull cumulative distribution function is
β
FX (x) = 1 − e−(x/α) , x, α, β ≥ 0
Bivariate Distributions
Up to now, we have assumed that our “random experiment” results in the observation of the
value of a single “random variable”. Quite typically, however, a single experiment (observation
of the real world) will result in a (possibly quite large) collection of measurements, which can be
expressed as a vector of observations describing the outcome of the experiment. For example:
A medical researcher may record and report various characteristics of each subject being
tested, such as blood pressure, height, weight, smoking habits, etc., as well as the direct
medical observations;
An investment analyst would wish to look at a large number of financial indicators for each
share under consideration;
The environmentalist would be interested in recording many different pollutants in the air.
This is a function defined on the real plane, which lies between 0 and 1, and which is non-
decreasing in both arguments. Noting that {X ≤ b, Y ≤ y} = {X ≤ a, Y ≤ y} ∪ {a < X ≤
b, Y ≤ y} for any a < b, where the latter two events are mutually exclusive, we see that:
19
20 CHAPTER 2. BIVARIATE DISTRIBUTIONS
for any real number y. A similar decomposition of {Y ≤ d} into two disjoint intervals leads to:
i.e. by the probability of the joint occurrence of X = x and Y = y, which is by definition zero if
x or y are not non-negative integers. It is simple to confirm that:
y
x X
X
FXY (x, y) = pXY (i, j)
i=0 j=0
Note that the event {X = x} is equivalent to the union of all (mutually disjoint) events of the
form {X = x, Y = y} as y ranges over all non-negative integers. It thus follows that the marginal
probability mass function of X, i.e. the probability that X = x, must be given by:
∞
X
pX (x) = pXY (x, y)
y=0
and similarly:
∞
X
pY (y) = pXY (x, y).
x=0
Example A Suppose that an unbiased coin is tossed 5 times. Then let X be the number of
heads in the first three spins, and Y the number of heads in the last three spins. It is
a simple exercise (first year!) to compute the probabilities for all 16 events of the form
{X = x, Y = y} for x = 0, 1, 2, 3 and y = 0, 1, 2, 3. These are given in the following table:
y= x=
0 1 2 3
0 0.03125 0.0625 0.03125 0
1 0.0625 0.15625 0.125 0.03125
2 0.03125 0.125 0.15625 0.0625
3 0 0.03125 0.0625 0.03125
Summing down each column then gives the marginal p.m.f. for X; so, for example:
and similarly pX (1) = pX (2) = 0.375 and pX (3) = 0.125. The marginal distribution
corresponds, of course, to the binomial distribution with n = 3 and p = 0.5, since X is
simply the number of successes (heads) in n = 3 trials. It is easily seen that this is also
the marginal distribution of Y (obtained by adding across the rows of the table).
For continuous random variables, we need to introduce the concept of the joint probability density
function fXY (x, y). In principle, the joint p.d.f. is defined to be the function for which:
Z b Z d
Pr[a < X ≤ b, c < Y ≤ d] = fXY (x, y) dy dx
x=a y=c
for all a < b and c < d. Note that in particular this requires that:
Z x Z y
FXY (x, y) = fXY (u, v) dv du.
u=−∞ v=−∞
Consider now the event {x < X ≤ x + h, y < Y ≤ y + h}. As h → 0, the probability must tend
to h2 fXY (x, y), since the volume under the surface defined by the double integral tends to the
area of the base (h2 ) times the height (= the value of the function at this point). In other words:
in the limit as h → 0. Re-arranging terms, it follows that fXY (x, y) is the limit as h → 0 of:
1
[F (x + h, y + h) − FXY (x, y + h) − FXY (x + h, y) + FXY (x, y)]
h2 XY
1 FXY (x + h, y + h) − FXY (x, y + h) FXY (x + h, y) − FXY (x, y)
= −
h h h
∂ 2 FXY (x, y)
∂ ∂FXY (x, y)
fXY (x, y) = =
∂y ∂x ∂x∂y
As for discrete random variables, we can also define marginal p.d.f.’s for X and Y . In order to
maintain consistency of definitions, the marginal probability density function for X should be
the derivative w.r.t. x of FX (x) = FXY (x, ∞), i.e. the derivative w.r.t. x of:
Z x Z ∞
fXY (u, y) dy du
u=−∞ y=−∞
The marginal p.d.f.’s describe the overall variation in one variable, irrespective of what happens
with the other. For example, if X and Y represent height and weight of a randomly chosen
individual from a population, then the marginal distribution of X describes the distribution of
heights in the population.
Example B Suppose that X and Y are continuous random variables, with joint p.d.f. given by:
fXY (x, y) = x2 + xy/3 for 0 ≤ x ≤ 1, 0 ≤ y ≤ 2
(and by the usual assumption, fXY (x, y) = 0 everywhere else). Then, for example, if we
wish to evaluate the probability of the event { 14 ≤ X ≤ 12 ; Y ≤ 1}, as we only have to
perform the relevant integrals over the region in X-Y space for which fXY (x, y) > 0, we
will have:
Z 1/2 Z 1
1 1
Pr[ ≤ X ≤ , Y ≤ 1] = (x2 + xy/3) dy dx
4 2 x=1/4 y=0
Z 1/2
2 1
= x y + xy 2 /6 y=0 dx
x=1/4
Z 1/2
= (x2 + x/6) dx
x=1/4
≈ 0.052
We shall leave it as an exercise to show that FXY (x, y) = x3 y/3 + x2 y 2 /12 for 0 ≤ x ≤
1; 0 ≤ y ≤ 2. The marginal probability densities are:
Z 2
2 1
x + xy/3 dy = 2x2 + 2x/3 = 2x(x + )
fX (x) =
y=0 3
for 0 ≤ x ≤ 1, and:
Z 1 2
fY (y) = x + xy/3 dx = 1/3 + y/6 = (2 + y)/6
x=0
for 0 ≤ y ≤ 2.
Example C Now suppose that X and Y are continuous random variables, with joint p.d.f. given
by:
fXY (x, y) = λ3 x e−λy for 0 ≤ x ≤ y < ∞
This turns out to be a more tricky problem, and students are strongly advised in such
situations to sketch out the region in X-Y for which fXY (x, y) > 0. This is indicated in
Figure 2.1 by the shaded area (which should, of course, be extended infinitely far upwards).
The marginal p.d.f. for X is given by:
Z ∞
fX (x) = λ3 x e−λy dy
y=x
∞
−λ2 x e−λy
= y=x
= λ2 x e−λx
2.2. INDEPENDENCE AND CONDITIONAL DISTRIBUTIONS 23
x
=
y
Recall from the first year notes, the concepts of conditional probabilities and of independence of
events. If A and B are two events then the probability of A conditional on the occurrence of B
is given by:
Pr[A ∩ B]
Pr[A|B] =
Pr[B]
24 CHAPTER 2. BIVARIATE DISTRIBUTIONS
provided that Pr[B] > 0. The concept of the intersection of two events (A∩B) is that of the joint
occurrence of both A and B. The events A and B are independent if Pr[A ∩ B] = Pr[A]. Pr[B],
which implies that Pr[A|B] = Pr[A] and Pr[B|A] = Pr[B] whenever the conditional probabilities
are defined.
The same ideas carry over to the consideration of bivariate (or multivariate) random variables.
For discrete random variables, the linkage is direct: we have immediately that:
Pr[X = x ; Y = y] p (x, y)
Pr[X = x|Y = y] = = XY
Pr[Y = y] pY (y)
provided that pY (y) > 0. This applies for any x and y such that pY (y) > 0, and defines the
conditional probability mass function for X, given that Y = y. This we write as:
pXY (x, y)
pX|Y (x|y) = .
pY (y)
In similar manner, we define the conditional probability mass function for Y , given that X = x,
i.e. pY |X (y|x).
By definition of independent events, the events {X = x} and {Y = y} are independent if and
only if pXY (x, y) = pX (x).pY (y). If this holds true for all x and y, then we say that the random
variables X and Y are independent. In this case, it is easily seen that all events of the form
{a < X ≤ b} and {c < Y ≤ d} are independent of each other, which (inter alia) also implies
that FXY (x, y) = FX (x).FY (y) for all x and y.
Example A (cont.) Refer back to example A on the previous section, and calculate the con-
ditional probability mass function for X, given Y = 2. We have noted that pY (2) = 0.375,
and thus:
0.03125
pX|Y (0|2) = = 0.0833
0.375
0.125
pX|Y (1|2) = = 0.3333
0.375
0.15625
pX|Y (2|2) = = 0.4167
0.375
0.0625
pX|Y (3|2) = = 0.1667
0.375
Note that the conditional probabilities again add to 1, as required.
The random variables are not independent, since (for example) pX (2) = pY (2) = 0.375,
and thus pX (2).pY (2) = 0.140625, while pXY (2, 2) = 0.125. This should not surprise us,
since both X and Y depend on the outcome of the third toss of the coin.
Once again there is a slight technical problem when it comes to continuous distributions, since
all events of the form {X = x} have zero probability. Nevertheless, we still define the conditional
probability density function for X given Y = y as:
fXY (x, y)
fX|Y (x|y) =
fY (y)
2.3. THE BIVARIATE NORMAL DISTRIBUTION 25
provided that fY (y) > 0, and similarly for Y given X = x. This corresponds to the formal
definition of conditional probabilities in the sense that for “small enough” values of h > 0:
Z b
Pr[a < X ≤ b | y < Y ≤ y + h] = fX|Y (x|y) dx
x=a
The continuous random variables X and Y are independent if and only if fXY (x, y) = fX (x)fY (y)
for all x and y, which is clearly equivalent to the statement that the marginal and conditional
p.d.f.’s are identical.
Example B (cont.) In this example, we have from the previous results that:
which is evidently different from fXY (x, y) (even if there may be specific values of x and
y for which they are numerically equal). The conditional p.d.f.’s are easily written down.
For example, for any y between 0 and 2:
2x(3x + y)
fX|Y (x|y) =
2+y
for 0 ≤ x ≤ 1. Note that the ranges have different roles for the two variables. The p.d.f. of
X conditional on Y = y is only defined if 0 ≤ y ≤ 2. If the p.d.f. is defined, then it is only
non-zero in the range 0 ≤ x ≤ 1. The roles of the two ranges are of course interchanged if
we look at fY |X (y|x).
Example C (cont.) In example B, it should have been immediately obvious that X and Y
cannot be independent, even without calculating the marginal p.d.f.’s. The point is that
the function fXY (x, y) = x2 + xy/3 cannot be factored into a function of x only and a
function of y only, and thus can never be written in the form fXY (x, y) = fX (x)fY (y). One
might be tempted to conjecture that the converse is also true, i.e. that if the joint p.d.f.
can be factored, then the variables are independent. Example C demonstrates that this is
false. The joint p.d.f. factors, but:
1 5 2 −λ(x+y)
fX (x)fY (y) = λ xy e
2
which is certainly not fXY (x, y). We note also that:
The conditional p.d.f. for X given Y = y is 2x/y 2 for 0 ≤ x ≤ y (zero elsewhere), and
is defined for all non-negative y.
The conditional p.d.f. for Y given X = x is λe−λ(y−x) for y ≥ x (zero elsewhere), and
is defined for all non-negative x.
In the same way that the normal distribution played such an important role in the univariate
statistics of the first year syllabus, its generalization is equally important for bivariate (and in
26 CHAPTER 2. BIVARIATE DISTRIBUTIONS
general, multivariate) random variables. The bivariate normal distribution is defined by the
following joint probability density function for (X, Y ):
1 Q(x, y)
fXY (x, y) = exp − (2.1)
2(1 − ρ2 )
p
2πσX σY 1 − ρ2
where z and µ are column vectors of (x, y) and (µX , µY ) respectively, and the matrix Σ is given
by:
2
σX ρσX σY
Σ =
ρσX σY σY2
This form applies also to the general multivariate normal distribution, except that for a multi-
variate random variable of dimension p, the 2π term is raised to the power of p/2.
We now briefly introduce a few key properties of the bivariate normal distribution:
Substituting back for x and y, we have that Q(x, y) can be written as:
2 2
x − µX ρ(y − µY ) 2 y − µY
Q(x, y) = − + (1 − ρ )
σX σY σY
2 2
1 ρσX 2 y − µY
= 2 x − µX − (y − µY ) + (1 − ρ )
σX σY σY
It follows from the above expression for Q(x, y), and from (2.1), that fXY (x, y) can be
factorized as fXY (x, y) = g(y)h(x, y), where:
(y − µY )2
1
g(y) = √ exp −
2πσY 2σY2
and:
" 2 #
1 1 ρσX
h(x, y) = √ exp − 2 x − µX − (y − µY )
2σX (1 − ρ2 )
p
2πσX 1 − ρ2 σY
2.4. FUNCTIONS OF BIVARIATE RANDOM VARIABLES 27
We note that for any fixed value of y, h(x, y) is the pdf of a normal distribution in x, having
2
mean of µX + [ρσX /σY ](y − µY ) and variance of σX (1 − ρ2 ). It thus follows that:
Z ∞ Z ∞
fY (y) = fXY (x, y) dx = g(y) h(x, y) dx = g(y).1
x=−∞ x=−∞
In other words, fY (y) is the pdf of a normal distribution with mean µY and variance σY2 .
Similarly, the marginal distribution of X can be seen to be normal with mean µX and
2
variance σX .
A fourth property of the bivariate normal distribution will be introduced in the next chapter.
It may be useful to summarize the data in some way, by taking the means, sums, differences
or ratios of similar or related variables. For example, height and weight together give some
measure of obesity, but weight divided by (height)2 may be a more succinct measure.
Some function, such as the ratio of two random variables, may be more physically mean-
ingful than the individual variables on their own. For example, the ratio of prices of two
commodities is more easily compared across countries and across time than are the indi-
vidual variables.
Suppose that X and Y are random variables, and that Z=X+Y . The event {Z = z} is equivalent
to the union of all events of the form {X = x, Y = z − x} (taken over all possible values of x).
28 CHAPTER 2. BIVARIATE DISTRIBUTIONS
Suppose that X and Y , and by implication also Z, are discrete random variables, taking on
non-negative integer values only. We then have that:
∞
X
pZ (z) = Pr[Z = z] = pXY (x, z − x)
x=0
Note, however, that for x > z we have y = z − x < 0 for which the probability mass is zero. The
above expression for pZ (z) thus simplifies slightly further to:
z
X
pZ (z) = pXY (x, z − x)
x=0
since pXY (x, z − x) = 0 for x < 0 and for z − x < 0, i.e. for x > z.
An important special case is that in which X and Y are independent random variables. In this
case, the expression for pZ (z) is given by:
∞
X
pZ (z) = pX (x)pY (z − x)
x=−∞
Note that the upper limit for the inner integral depends on the outer variable of integration (x).
Let us, however, change variables in the inner integral from y to u = x + y (remembering that
while performing this integration, we treat x as a constant). Clearly y = u − x, |J| = 1, and the
limits on the integral run from u = −∞ to u = z, i.e. we now have:
Z ∞Z z
FZ (z) = fXY (x, u − x) du dx
−∞ −∞
Now the limits on the inner integral are fixed, and it is permissible to simply reverse the order
of integration to obtain: Z z Z ∞
FZ (z) = fXY (x, u − x) dx du
−∞ −∞
Example: Suppose that X and Y are independent and have the same exponential distribution
with parameter λ. Then: Z z
fZ (z) = λ2 e−λx e−λ(z−x) dx
0
Note the limits on the integral. The lower limit arises from the non-negativity of X, while
the non-negativity of Y implies that fY (z − x) = 0 for z − x < 0, i.e. for x > z.
The product of the two exponential terms gives e−λz which does not depend on x (i.e. is a
constant term in the integration). Thus:
Z z
fZ (z) = λ2 e−λz dx = λ2 z e−λz
0
Exercises: Essentially the same principles, used with some caution, apply in deriving the dis-
tributions of X − Y , XY and X/Y . Try some of these.
For more complicated transformations, the above ad hoc approaches tend to break down. We
need more general principles that can be applied in any circumstances, without having to spot
“clever” tricks. We shall only state, and not prove, the general principle. This will be stated for
the bivariate case only, but does in fact apply to multivariate distributions in general.
Suppose that we know that bivariate probability distribution of a pair (X, Y ) of random variables,
and that two new random variables U and V are defined by the following transformations:
U = g(X, Y ) V = h(X, Y )
Let us further suppose that all the above functions are continuously differentiable. We can then
define the Jacobian of the transformation (precisely as we did for change of variables in multiple
integrals) as follows:
30 CHAPTER 2. BIVARIATE DISTRIBUTIONS
∂φ(u, v)/∂u ∂φ(u, v)/∂v
|J| =
∂ψ(u, v)/∂u ∂ψ(u, v)/∂v
∂φ(u, v) ∂ψ(u, v) ∂ψ(u, v) ∂φ(u, v)
=
−
∂u ∂v ∂u ∂v
Theorem 2.1 Suppose that the joint p.d.f. of X and Y is given by fXY (x, y), and that the
continuously differentiable functions g(x, y) and h(x, y) define a one-to-one transformation of
the random variables X and Y to U = g(X, Y ) and V = h(X, Y ), with inverse transformation
given by X = φ(U, V ) and Y = ψ(U, V ). The joint p.d.f. of U and V is then given by:
Note 1: We have to have as many new variables (i.e. U and V ) as we had original variables
(in this case 2). The method only works in this case. Even if we are only interested in a
single transformation (e.g. U = X + Y ) we need to “invent” a second variable, quite often
something trivial such as V = Y . We will then have the joint distribution of U and V , and
we will need to extract the marginal p.d.f. of U by integration.
Note 2: Some texts (e.g. Rice!) define the Jacobian in terms of the derivatives of g(x, y) and
h(x, y) w.r.t. x and y. With this definition, one must use the inverse of |J| in the theorem,
as it can be shown that with our definition of |J|:
−1
∂g(x, y)/∂x ∂g(x, y)/∂y
|J| =
∂h(x, y)/∂x ∂h(x, y)/∂y
This is sometimes the easier way to compute the Jacobian in any case.
Example Suppose that X and Y are independent standard normal random deviates, i.e. their
joint p.d.f. is:
1 −(x2 +y2 )/2
fXY (x, y) = e
2π
Let us transform from x, y to u = x2 + y 2 and v = tan−1 (y/x). √0<u<∞
Note that
and √
0 < v < 2π. The inverse transformations are easily seen to be x = u cos(v) and
y = u sin(v), so that:
1 −1/2 √
u cos(v) − u sin(v) 1
|J| = 21 −1/2 √ =
2u sin(v) u cos(v) 2
Example Suppose that X and Y are independent random variables, each having the gamma
distribution with λ = 1 in each case, and with α = a and α = b respectively, i.e.:
Tutorial Exercises
2.1 The joint mass function of two discrete random variable, X and Y , is given in the following table:
x
y 1 2 3
1 0.20 0.05 0.04
2 0.05 0.10 0.07
3 0.04 0.04 0.14
5 0.00 0.03 0.24
2.4 Suppose that (X, Y ) is uniformly distributed over the region defined by 0 ≤ y ≤ 1 − x2 and
−1 ≤ x ≤ 1.
(a) Find the marginal densities of X and Y .
(b) Find the two conditional densities.
(c) Find the Pr(X ≤ 21 )
(d) Find Pr(Y < 0.3|X = 0.5) and Pr(Y < 0.3|X = 5).
2.5 The joint probability density function of X and Y is given by
(a) Determine k.
(b) Calculate the marginal pdf’s of X and Y .
Tutorial Exercises 33
x
fX|Y (x|y) = c 0≤x≤y
y2
=0 elsewhere
4
fY (y) = ky 0≤y≤1
=0 elsewhere
2.13 The random variables X1 and X2 are independent, both with pdf
1
(
1<x
fX (x) = x2
0 otherwise
X1
Let Y1 = and Y2 = X1 + X2
X1 + X2
(a) Find the joint density of Y1 and Y2 .
(b) Sketch the region where fY1 ,Y2 ) (y1 , y2 ) > 0
(c) Find the marginal pdf of Y1 .
2.14 Let U1 and U2 be independent and uniform on [0,1]. Find and sketch the density function of
S = U1 + U2 .
2.15 Suppose that a bivariate normal distribution has
(a) Write down the conditional distribution of X|Y (use shorthand notation).
(b) Write down the conditional distribution of Y |X (use shorthand notation).
(c) If ρ > 0 find ρ when Pr[5 < X < 11|Y = 10] = 0.4441
If we interpret the probabilities in a frequentist sense, then this can be seen as a “long-run aver-
age” of g(X). More generally, it represents in a sense the “centre of gravity” of the distribution
of g(X). An important special case arises when g(x) = xr for some positive integer value of r;
the expectation is then called the r-th moment of the distribution of X, written as:
Z ∞
µ0r = E[Xr ] = xr fX (x) dx
−∞
The case r = 1, or E[X], is well-known: it is simply the expectation, or the mean, of X itself,
which we shall often write as µX .
For r > 1 it is more convenient to work with the expectations of (X − µX )r . These are the
central moments of X, where the r-th central moment is defined by:
Z ∞
µr = E[(X − µX )r ] = (x − µX )r fX (x) dx
−∞
Each µr measures in its own unique way, the manner in which the distribution (or the observed
values) of X are spread out around the mean µX . You should be familiar with the case of r = 2,
35
36 CHAPTER 3. EXPECTATIONS AND MOMENTS
2
which gives the variance of X, also written as σX , or Var[X]. The variance is always non-negative
(in fact, strictly positive, unless X takes on one specific value with probability 1), and measures
the magnitude of the spread of the distribution. This should be well-known. We now introduce
two further central moments, the 3rd and the 4th.
Consider a probability density function which has a “skewed” shape similar to that shown in
Figure 3.1. The mean of this distribution is at x = 2, but the distribution is far from symmetric
around this mean. Now consider what happens when we examine the third central moment. For
X < µX , we have (X − µX )3 < 0, while (X − µX )3 > 0 for X > µX . For a perfectly symmetric
distribution, the positives and negatives will cancel out in taking the expectation, so that µ3 = 0.
But for a distribution such as that shown in Figure 3.1, very large positive values of (X − µX )
will occur, but no very large negative values. The nett result is that µ3 > 0, and we term such
a distribution positively skewed. Negatively skewed distributions (with the long tail to the left)
can also occur, but are perhaps less common.
f (x)
mean
x
0 1 2 3 4 5 6
Figure 3.1: Example of a skew distribution
Of course, the magnitude of µ3 will also depend on the amount of spread. In order to obtain a
feel for the “skewness” of the distribution, it is useful to eliminate the effect of the spread itself
(which is measured already by the variance). This is done by defining a coefficient of skew by:
µ3 µ3
3 = p
σX ( Var(X))3
which, incidentally, does not depend on the units of measurement used for X. For the distribution
3.1. MOMENTS OF UNIVARIATE DISTRIBUTIONS 37
illustrated in Figure 3.1, the coefficient of skew turns out to be 1.4. (This distribution is, in fact,
the gamma distribution with α = 2.)
In trying to interpret the fourth central moment, we may find it useful to examine Figure 3.2.
The two distributions do in fact have the same mean (0) and variance (1). This may be surprising
at first sight, as the more sharply peaked distribution appears to be more tightly concentrated
around the mean. What has happened, however, is that this distribution has much longer tails.
The flatter distribution (actually a normal distribution) has a density very close to zero outside
of the range −3 < x < 3; but for the more sharply peaked distribution, the density falls away
much more slowly, and is still quite detectable at ±5. In evaluating the variance, the occasionally
very large values for (X − µX )2 inflates the variance sufficiently to produce equal variances for
the two distributions. But consider what happens when we calculate µ4 : the occasional large
discrepancies create an even greater effect when raised to the power 4, and thus µ4 is larger
for the sharply-peaked-and-long-tailed distribution than for the flatter, short-tailed distribution.
The single word to describe this is kurtosis, and the sharp-peaked, long-tailed distribution is said
to have greater kurtosis than the other.
-4 -3 -2 -1 0 1 2 3 4
Thus the fourth central moment, µ4 , is a measure of kurtosis, in the sense that for two distri-
butions having the same variance, the one with the higher µ4 has the greater kurtosis (is more
sharply peaked and long-tailed). But as with the third moment, µ4 is also a measure of spread,
and thus once again it is useful to have a measure of kurtosis only (describing the shape, not the
38 CHAPTER 3. EXPECTATIONS AND MOMENTS
spread, of the distribution). This is provided by the coefficient of kurtosis defined by:
µ4 µ4
4 =
σX (Var(X))2
which again does not depend on the units of measurement used for X.
For the normal distribution (the flatter of the two densities illustrated in Figure 3.2), the coeffi-
cient of kurtosis is always 3 (irrespective of mean and variance). The more sharply peaked density
in Figure 3.2 is that of a random variable which follows a so-called “mixture distribution”, i.e.
its value derives from the normal distribution with mean 0 and variance 4 with probability 0.2,
and from the normal distribution with mean 0 and variance 0.25 otherwise. The coefficient of
kurtosis in this case turns out to be 9.75.
There is, however, an alternative definition for the coefficient of kurtosis obtained by subtracting
3 from the above, i.e.:
µ4
σX4 −3
so that it measures in effect departure from normality (negative values corresponding to flatter
than normal distributions, and positive values to distributions which are more sharply peaked
and heavy tailed than the normal.
One could in principle continue further with higher order moments still, but there seems to be
little practical value in doing so: the first four moments do give considerable insight into the
shape of the distribution. Working with moments has great practical value, since from any set of
observations
Pn of a random variable, we can obtain the corresponding sample moments based on
r
(x
i=1 i −x̄) . These can be used to match the sample data to a particular family of distributions.
Some useful formulae: Apart from the first moment, it is the central moments µr which
best describe the shape of the distribution, but more often than not it is easier to calculate
the uncentred moments µ0r . Fortunately, there are close algebraic relationships between
the two, which are stated below for r = 2, 3, 4. The derivation of these relationships are
left as an exercise. The relationship for the variance is so frequently used that it is worth
remembering (although all three formulae are easily recollected once their derivation is
understood):
µ2 = σX 2
= µ02 − (µX )2
µ3 = µ03 − 3µ02 µX + 2(µX )3
µ4 = µ04 − 4µ03 µX + 6µ02 (µX )2 − 3(µX )4
Example: Suppose that X has the exponential distribution with parameter λ. The mean is
given by: Z ∞
µX = xλe−λx dx
0
which can be integrated by parts as follows:
∞
Z ∞
x(−e−λx ) 0 + e−λx dx
µX =
0
1
= 0+
λ
1
=
λ
3.2. CHEBYCHEV’S INEQUALITY 39
2
Theorem 3.1 Let X be any random variable, whose mean µX and variance σX exist (i.e. are
finite). Then for any k > 0:
1
Pr[|X − µX | ≥ kσX ] ≤ 2
k
Proof: We shall prove the theorem for the case in which X is a continuous random variable.
The discrete case follows in the same way, replacing integrals by summations.
Let S represent the region of the real line described by |x − µX | ≥ kσX . Within this region:
(x − µX )2
2 ≥ 1.
k 2 σX
It thus follows that:
Z
Pr[|X − µX | ≥ kσX ] = fX (x) dx
S
(x − µX )2
Z
≤ 2 fX (x) dx by the above inequality
S k 2 σX
Z ∞
1
≤ 2 (x − µX )2 fX (x) dx non-negativity of integrand
k 2 σX −∞
1 2 1
= 2 σX = k 2
k 2 σX
40 CHAPTER 3. EXPECTATIONS AND MOMENTS
This theorem is primarily of interest in proving properties of various statistical procedures (later
in the second year curriculum), but can also be used for deriving bounds on results that will
apply no matter what the underlying distribution is. For example, suppose that we wish to derive
a 95% confidence interval for µX , based on a sample of size n from X. (We will be defining the
concept of random samples more precisely later in this course.) It should be known from first
2
year that the variance of the sample mean X̄ is σX /n. If we apply the theorem (Chebychev’s
inequality) to X̄, then we obtain:
kσX 1
Pr[|X̄ − µX | ≥ √ ] ≤ 2
n k
i.e.:
kσX kσX 1
Pr[− √ < X̄ − µX < √ ] ≥ 1 − 2
n n k
If we set√k 2 = 20, i.e. k ≈ 4.47, then µX will be contained within the interval given by X̄ ±
4.47σX / n with probability that is at least 0.95. This interval is thus the widest that a 95%
confidence interval can get, no matter what the underlying distribution is. Of course, using this
can be quite conservative: we know for example√ that if X has the normal distribution, then the
exact 95% confidence interval is X̄ ± 1.96σX / n. But it is interesting that confidence intervals
are never very wide.
µ0rs = E[X r Y s ]
For s = 0, we have:
Z ∞ Z ∞
µ0r0 = xr fXY (x, y) dy dx
−∞ −∞
Z ∞ Z ∞
= xr fXY (x, y) dy dx
−∞ −∞
Z ∞
= xr fX (x) dx
−∞
i.e. the r-th moment of the marginal distribution of X. Similarly, µ00s is the s-th moment of
the marginal distribution of Y . In particular, µ10 = µX and µ01 = µY . As with univariate
moments, it is useful to subtract out the means for higher order moments. Thus the (r, s)-th
central moment of (X, Y ) is defined by:
µrs = E[(X − µX )r (Y − µY )s ]
3.3. BIVARIATE MOMENTS: COVARIANCE AND CORRELATION 41
As shown for the uncentred moments, it is easily demonstrated that µr0 and µ0s are the r-th
and s-th central moments of the marginal distributions of X and Y respectively. In particular,
2
µ20 = σX , and µ02 = σY2 .
A fundamentally new insight is obtained when r and s are both non-zero. The simplest such
case is µ11 = E[(X − µX )(Y − µY )], which is termed the covariance of X and Y , written as
Covar(X,Y ) or as σXY . While variance measures the extent of dispersion of a single variable
about its mean, covariance measures the extent to which two variables vary together around
their means. If large values of X (i.e. X > µX ) tend to be associated with large values of Y (i.e.
Y > µY ), and vice versa, then (X − µX )(Y − µY ) will tend to take on positive values more often
than negative values, and we will have σXY > 0. X and Y will then be said to be positively
correlated. Conversely, if large values of the one variable tend to be associated with small values
of the other, then we have σXY < 0, and the variables are said to be negatively correlated. If
σXY = 0, then we say that X and Y are uncorrelated.
Exploring possible causal links: For example, early observational studies had shown that
the level of cigarette smoking and the incidence of lung cancer were positively correlated.
This suggested a plausible hypothesis that cigarette smoking was a causative factor in lung
cancer. This was not yet a proof, but suggested important lines of future research.
Prediction: Whether or not a causal link exists, it remains true that if σXY > 0, and we
observe X >> µX , then we would be led to predict that Y >> µY . Thus, even without
proof of a causal link between cigarette smoking and lung cancer, the actuary would be
justified in classifying a heavy smoker as a high risk for lung cancer (even if, for example,
it is propensity to cancer that causes addiction to cigarettes).
As with other moments, it is often easier to calculate the uncentred moment µ011 , than to calculate
the covariance by direct integration. The following is thus a useful result, worth remembering:
Example Let us continue with the example from the previous chapter, viz. that for which:
2
It is left as an exercise to show that µX = 13/18, µY = 10/9, σX = 0.0451 and σY2 = 0.3210.
42 CHAPTER 3. EXPECTATIONS AND MOMENTS
We now state and prove a few important results concerning bivariate random variables, which
depend on the covariance. We start, however, with a more general result:
Theorem 3.2 If X and Y are independent random variables, then for any real valued functions
g(x) and h(y), E[g(X)h(Y )] = E[g(X)] · E[h(Y )].
Proof: We shall give the proof for continuous distributions only. The discrete case follows
analogously.
Since, by independence, we have that fXY (x, y) = fX (x)fY (y), it follows that:
Z ∞ Z ∞
E[g(X)h(Y )] = g(x)h(y)fX (x)fY (y) dy dx
−∞ −∞
Z ∞ Z ∞
= g(x)fX (x) h(y)fY (y) dy dx
−∞ −∞
Z ∞ Z ∞
= g(x)fX (x) dx h(y)fY (y) dy
−∞ −∞
= E[g(X)] · E[h(Y )]
Note in particular that this implies that if X and Y are independent, then E[XY ] = µ011 =µX µY ,
and thus that σXY = 0. We record this result as a theorem:
The converse of this theorem is not true in general (i.e. we cannot in general conclude that X
and Y are independent if σXY = 0), although an interesting special case does arise with the
normal distribution (see next section). That the converse is not true, is demonstrated by the
following simple discrete example:
3.3. BIVARIATE MOMENTS: COVARIANCE AND CORRELATION 43
Example of uncorrelated but dependent variables: Suppose that X and Y are discrete
random variables, with pXY (x, y) = 0, except for the four cases indicated below:
1
pXY (0; −1) = pXY (1; 0) = pXY (0; 1) = pXY (−1; 0) =
4
Note that pX (−1) = pX (1) = 14 and pX (0) = 12 , and similarly for Y . Thus X and Y are
not independent, because (for example) pXY (0; 0) = 0, while pX (0)pY (0) = 14 .
We see easily that µX = µY = 0, and thus σXY = E[XY ]. But XY = 0 for all cases
with non-zero probability, and thus E[XY ]=0. Thus the variables are uncorrelated, but
dependent.
Var[aX + bY ] = a2 σX
2
+ b2 σY2 + 2abσXY
Proof: Clearly:
E[aX + bY ] = a E[X] + b E[Y ] = aµX + bµY
Thus the variance of aX + bY is given by:
2
Var[X − Y ] = σX + σY2 − 2σXY
If X and Y are independent (or even if only uncorrelated), these reduce to:
2
Var[X + Y ] = Var[X − Y ] = σX + σY2
As with the interpretation of third and fourth moments, the interpretation of covariance is
confounded by the fact that the magnitude of σXY is influenced by the spreads of X and Y
themselves. As before, we can eliminate the effects of the variances by defining an appropriate
correlation coefficient, defined by:
σXY Covar[X, Y ]
ρXY = =p
σX σY Var[X]Var[Y ]
Note that ρXY has the same sign as σXY , and takes on the value zero when X and Y are
uncorrelated.
44 CHAPTER 3. EXPECTATIONS AND MOMENTS
It is useful to seek some understanding of the meaning and magnitudes of the correlation coeffi-
cient. Using the last theorem with a = 1/σX and b either 1/σY or −1/σY , and writing the two
cases for b together as b = ±1/σY , we get:
X Y 1 2 1 2 2σXY
Var[ ± ] = 2 σX + σ 2 σY ± σ σ
σX σY σX Y X Y
= 2(1 ± ρXY )
Since variances are always non-negative, this implies that 1 ± ρXY ≥ 0, i.e. that −1 ≤ ρXY ≤ 1,
or |ρXY | ≤ 1. Now let us consider conditions under which |ρXY | = 1. Suppose that X and Y
are related to each other by an exact linear relationship, i.e. Y = aX + b precisely, where a 6= 0.
(There is thus only one “degree of freedom”: once either X or Y is known, then the other is
immediately determined as well.) Taking expectations on both sides of the equality, it is clear
that µY = aµX + b, and thus the variance of Y is:
Furthermore, since Y −aX is the constant b, it follows that Var[Y −aX]=0. But by Theorem 3.4:
and thus:
aσX − ρXY σY = 0
Two cases need now to be distinguished:
aσX − aρXY σX = 0
1. If X and Y are precisely linearly related, then |ρXY | = 1, with the sign being determined
by the slope of the line; and
2. If X and Y are independent, then ρXY = 0.
The magnitude of the correlation coefficient is thus a measure of the degree to which the two
variables are linearly related, while its sign indicates the direction of this relationship.
2
Example: In the previous example, we had σX = 0.0451, σY2 = 0.3210, and σXY = −0.00617.
Thus:
−0.00617
ρXY = √ = −0.0513
0.0451 × 0.3210
3.4. MOMENTS OF THE BIVARIATE NORMAL DISTRIBUTION 45
In the previous chapter, we introduced the bivariate normal distribution, and summarized three
important properties of this distribution. We can now introduce a fourth important property:
PROPERTY 4: THE CORRELATION COEFFICIENT IS ρ: Using the fact that fXY (x, y) =
fX (x) fY |X (y|x), and the expression for the expectation of the conditional distribution for
Y given X = x stated in Property 2, we have the following:
Z ∞ Z ∞
E[XY ] = xy fXY (x, y) dy dx
x=−∞ y=−∞
Z ∞ Z ∞
= xfX (x) y fY |X (y|x) dy dx
x=−∞ y=−∞
Z ∞
ρσY
= xfX (x) µY + (x − µX ) dx
x=−∞ σX
Z ∞ Z ∞
ρσY µX ρσY
= µY − xfX (x) dx + x2 fX (x) dx
σX −∞ σX −∞
ρσY µX ρσY 2 2
= µY − µX + (µ + σX )
σX σX X
= µX µY + ρσX σY
Properties 3 and 4 together show that for the bivariate normal distribution, zero correlation
implies independence. We have seen earlier that this result is not true in general; it is in fact
very specific to the multivariate normal distribution.
The above points are well-illustrated by the bivariate normal distribution. From property 2 we
have the conditional expectations as E[X|Y = y] = µX + [ρσX /σY ](y − µY ) and E[Y |X = x] =
µY + [ρσY /σX ](x − µX ). We thus observe that:
In first year, you were introduced to the concept of linear regression, which we have seen above
arises naturally in the context of normal distributions. More generally, a linear regression may
be viewed as a best linear approximation to the relationship between E[Y |X = x] and x. In
general, regressions will not be linear for distributions other than the normal, as illustrated in
the following example:
Example We continue further with the example from the previous chapter, viz. that for which:
6x2 + 2xy
fX|Y (x|y) =
2+y
for 0 ≤ x ≤ 1, and zero elsewhere (but only defined for 0 ≤ y ≤ 2), and
3x2 + xy
fY |X (y|x) =
6x2 + 2x
for 0 ≤ y ≤ 2, and zero elsewhere (but only defined for 0 ≤ x ≤ 1). Thus the regression of
X on Y is given by:
Z 1
6x2 + 2xy
µX|y = x dx
x=0 2+y
4 1
1 6x 2x3 y
= +
2+y 4 3 x=0
9 + 4y
=
6(2 + y)
Note that the two regressions are non-linear, and are not inverse functions of each other.
Now although E[X|Y = y] is not a random variable, we can, nevertheless, define a random
variable which takes on the value E[X|Y = y] whenever Y takes on the value y. We shall denote
this random variable by E[X|Y ]. Before going further with this, let us illustrate this concept by
means of a simple example:
3.5. CONDITIONAL MOMENTS AND REGRESSION OF THE MEAN 47
Example: Suppose that the random variables X and Y represent the weight and height re-
spectively of an individual chosen at random from a very strange population in which only
three heights are possible, viz. 1.6m, 1.8m or 2.0m. The probabilities on each of these, i.e.
pY (y), are 0.35, 0.45 and 0.20 respectively. We know very little about the distribution of
weights, but we are given the conditional expectations of X for each value of Y as follows:
y µX|y
1.6 55kg
1.8 65kg
2.0 80kg
Suppose we draw one person at random from the population, and measure his/her height,
but not the weight. We can, however, report the expected weight for such a person. This
is then one realization of the random variable E[X|Y ]. This random variable takes on the
value 55 with probability 0.35, 65 with probability 0.45, and 80 with probability 0.20. We
can calculate the expectation of this random variable, which is:
The next theorem says that this last value is precisely E[X], which we therefore obtain
without ever knowing the full distribution of X.
Theorem 3.5
E[E[X|Y ]] = E[X]
Proof: As in other cases, we will give the proof for the continuous case only, but the discrete
case follows identically.
By definition:
Z ∞
E[E[X|Y ]] = E[X|Y = y] fY (y) dy
y=−∞
and:
Z ∞
E[X|Y = y] = xfX|Y (x|y) dx
x=−∞
Z ∞
fXY (x, y)
= x dx
x=−∞ fY (y)
The above theorem turns out to be very useful in many contexts, one of which is described
below. It is even more useful to treat variances in the same way. We can thus calculate the
variance of conditional distribution, to get (for example) the conditional variance Var[X|Y = y].
We can further define the random variable Var[X|Y ], which takes on the value Var[X|Y = y]
whenever Y = y. The expectation of this random variable, i.e. E[Var[X|Y ]], can be calculated
as for E[E[X|Y ]]. Finally we can calculate the variance of E[X|Y ] (and also of Var[X|Y ]). These
concepts are related in the following very useful theorem. We shall not give the proof here
(although a proof may be found in Rice, Chapter 4).
Theorem 3.6
Var[X] = Var[E[X|Y ]] + E[Var[X|Y ]]
The value of the last two theorems lies in the fact that it is often difficult to calculate expectations
and variances directly, but that sometimes it is relatively easy to calculate conditional means and
variances. An example of this, which is of considerable importance in actuarial work, is the sum
over a random number of random variables. Thus suppose that X1 , X2 , X3 , . . . are independent
realizations of the same random variable X, and define:
N
X
S= Xi
i=1
where N is a discrete non-zero random variable, which is independent of the Xi . We assume that
we know the means and variances of X and of N , but we need to know the mean and variance of
S. Let us condition first on the event N = n, for some integer n: conditional on N = n, S = Sn
where Sn is the random variable:
Xn
Sn = Xi
i=1
whose distribution is defined by the distribution of X. It is easy to see (by generalization of results
2
we had previously for the sum of two random variables) that E[Sn ]=nµX and Var[Sn ]=nσX .
2
Thus E[S|N ]=µX N and Var[S|N ]=σX N . From our two theorems, therefore, we have:
and
2
Var[S] = Var[µX N ] + E[σX N ] = µ2X σN
2 2
+ σX µN
Tutorial Exercises
3.1 The discrete bivariate random variable (X, Y ) has probability mass function defined by the fol-
lowing table:
x
y 1 2 3
1 0.10 0.15 0.04
2 0.05 0.10 0.07
3 0.04 0.04 0.14
5 0.10 0.03 0.14
Tutorial Exercises 49
(a) Find c.
(b) Derive the pdf of U = 41 (X − Y ) (hint use a dummy variable, V = Y ).
(c) Find the E[U ] and the Var[U ].
3.8 The random variables X1 and X2 are uniformly distributed over the region defined by −x1 ≤ x2 ≤
x1 and 0 ≤ x1 ≤ 1. Calculate the correlation coefficient and the regressions of X1 on X2 and of
X2 on X1 .
3.9 The random variable U1 and U2 are uniformly distributed over the region defined by −2 ≤ u1 ≤ 2
and u1 − 1 ≤ u2 ≤ u1 + 1. Calculate the correlation coefficient and the regression of U2 on U1 .
3.10 Consider a sample of size 2 drawn without replacement from a hat containing 3 balls, numbered
1,2, and 3. Let X be the smaller of the 2 numbers drawn and S the sum of the 2 numbers drawn.
50 CHAPTER 3. EXPECTATIONS AND MOMENTS
MX (t) = E[etX ]
for discrete random variables. Note that the moment generating function is a function of t, and
not of realizations of X. One random variable X (or more correctly, perhaps, its distribution)
defines the function of t given by MX (t).
Clearly, MX (0) = 1, since for t = 0, etx = 1 (a constant). For non-zero values of t, we cannot be
sure that MX (t) exists at all. For the purposes of this course, we shall assume that MX (t) exists
for all t in some neighbourhood of t = 0, i.e. that there exists an > 0 such that MX (t) exists
for all − < t < +. This is a restrictive assumption, in that certain otherwise well-behaved
distributions are excluded. We are, however, invoking this assumption √ for convenience only in
this course. If we used a purely imaginary argument, e.g. it, where i = −1, then the function
(which is now called the characteristic function of X, when viewed as a function of t) does exist
for all proper distributions. Everything which we shall be doing with the m.g.f.’s carry through
to characteristic functions as well, but this involves us in issues of complex analysis. For ease of
presentation in this course, therefore, we restrict ourselves to the m.g.f.
51
52 CHAPTER 4. MGFS AND CONVERGENCE
MX (t)
t2 X 2 t3 X 3 t4 X 4
= E[1 + tX + + + + ···]
2! 3! 4!
t2 t3 t4
= 1 + tE[X] + E[X 2 ] + E[X 3 ] + E[X 4 ] + · · ·
2! 3! 4!
2 3 4
t t t
= 1 + tµ01 + µ02 + µ03 + µ04 + · · ·
2! 3! 4!
Now consider what happens when we repeatedly differentiate MX (t). Writing:
(r) dr MX (t)
MX =
dtr
we obtain:
(r) t2 0 t3
MX = µ0r + tµ0r+1 + µr+2 + µ0r+3 + · · · .
2! 3!
(1) (2)
If we now set t = 0 in the above expressions, we obtain µ01 = µX = MX (0), µ02 = MX (0), and
in general:
(r)
µ0r = MX (0).
We thus have a procedure for determining moments by performing just one integration or sum-
mation (to get MX (t)), and the required number of differentiations. This is often considerably
simpler than attempting to compute the moments directly by repeated integrations or summa-
tions. This only gives the uncentred moments, but the centred moments can be derived from
these, using the formulae in the previous chapter.
The expansion for MX (t) used above indicates that the m.g.f. is fully determined by the moments
of the distribution, and vice versa. Since the distribution is in fact fully characterized by its
moments, this suggests that there is a one-to-one correspondence between m.g.f.’s and probability
distributions. This is not a proof at this stage, but the above assertion can in fact be proved for
all probability distributions whose m.g.f.’s exist in a neighbourhood of t = 0. The importance
of this result is that if we can derive the m.g.f. of a random variable, then we have in principle
also found its distribution. In practice, what we do is to calculate and record the m.g.f.’s for a
variety of distributional forms. Then when we find a new m.g.f., we can check back to see what
the distribution is. This will be illustrated later. Let us now derive m.g.f.’s for some important
distributional classes.
4.1. THE MOMENT GENERATING FUNCTION 53
Using the standard sum of a geometric series, we obtain MX (t) = pet /(1 − qet ), provided
that qet < 1. The m.g.f. thus exists for all t < − ln(q), which is a positive upper bound
since q < 1 by definition.
Now the term in the summation expression is the p.m.f. of a Poisson distribution with
parameter λet , and the summation thus evaluates to 1. The m.g.f. is thus:
t
−1)
MX (t) = eλ(e .
This recognition of a term which is equivalent to the p.d.f. or p.m.f. of the original distri-
bution, but with modified parameters, is often the key to evaluating m.g.f.’s.
The first two derivatives are:
(1) t
−1)
MX (t) = λet eλ(e
and:
(2) t t
−1) −1)
MX (t) = λet eλ(e + (λet )2 eλ(e .
Setting t = 0 in these expressions gives µX = λ and µ02 = λ + λ2 . Thus:
2
σX = µ02 − (µX )2 = λ.
Exercise (Binomial distribution): Show that the m.g.f. of the binomial distribution is given
by:
MX (t) = (q + pet )n
where q = 1 − p.
Hint: Combine terms with x as exponent.
54 CHAPTER 4. MGFS AND CONVERGENCE
The integrand in the last line above is the p.d.f. of the gamma distribution with parameters
α and λ − t, provided that t < λ. Thus, for t < λ, the integral above evaluates to 1. Note
how once again we have recognized another form of the distribution we began with. We
have thus demonstrated that for the gamma distribution:
−α
λα
t
MX (t) = = 1−
(λ − t)α λ
2
We leave it as an exercise to verify that µX = α/λ, and that σX = α/λ2 .
Recall that the χ2 distribution with n degrees of freedom is the gamma distribution with
α = n/2 and λ = 12 . Thus the m.g.f. of the χ2 distribution with n degrees of freedom is
given by:
MX (t) = (1 − 2t)−n/2 .
We shall make use of this result later.
Example (Standard Normal distribution):
Z ∞
1 2
MX (t) = √ etx−x /2
dx
2π −∞
x2 1
tx − = − (x2 − 2tx)
2 2
1 t2
= − (x − t)2 +
2 2
The m.g.f. can thus be written in the form:
Z ∞
2 1 2
MX (t) = et /2 √ e−(x−t) /2 dx
−∞ 2π
The integrand above is the density of the normal distribution with mean of t and a variance
of 1. The integral thus evaluates to 1 for any real value of t, demonstrating that the m.g.f.
of the standard normal distribution is given by:
2
MX (t) = et /2
.
Exercise: Use the above m.g.f. to prove that the coefficient of kurtosis for the standard normal
distribution is 3. Demonstrate further that this result in fact applies for any normal
distribution.
4.2. M.G.F FOR FUNCTIONS OF RANDOM VARIABLES 55
A second important property relates to the m.g.f.’s of sums of independent random variables.
Suppose that X and Y are independent random variables with known m.g.f.’s. Let U = X + Y ;
then:
MU (t) = E[etU ] = E[etX+tY ] = E[etX etY ] = E[etX ]E[etY ]
where the last equality follows from the independence of X and Y . In other words:
MU (t) = MX (t)MY (t).
This result can be extended: for example if Z is independent of X and Y (and thus also of
U = X + Y ), and we define V = X + Y + Z, then V = U + Z, and thus:
MV (t) = MU (t)MZ (t) = MX (t)MY (t)MZ (t).
Taking this further in an inductive sense, we have the following theorem:
An interesting special case of the theorem is that in which the Xi are identically distributed,
which implies that the moment generating functions are identical: MX (t), say. In this case:
n
MS (t) = [MX (t)] .
Since the sample mean X̄ = S/n, we can combine our previous results to get:
n
t
MX̄ (t) = MX .
n
This is an important result which we will use again later.
We can use theorem 4.1 to prove an interesting property of the Poisson, Gamma and Normal
distributions (which does not carry over to all distributions in general, however).
56 CHAPTER 4. MGFS AND CONVERGENCE
Poisson Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
Xi has the Poisson distribution with parameter λi . Then:
t
−1)
Mi (t) = eλi (e
Pn
and the m.g.f. of S = i=1 Xi is:
n
" n #
λi (et −1)
Y X
t
e = exp λi (e − 1)
i=1 i=1
Pn
which is the m.g.f. of the Poisson distribution with parameter i=1 λi . Thus S has this
Poisson distribution.
Gamma Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
each Xi has a Gamma distribution with a common value for the λ parameter, but with
possibly different values for the α parameter, say αi for Xi . Then:
−αi
t
Mi (t) = 1 −
λ
Pn
and the m.g.f. of S = i=1 Xi is:
n −αi − Pni=1 αi
Y t t
1− = 1−
i=1
λ λ
Pn
which is the m.g.f. of the Gamma distribution with parameters i=1 αi and λ. Thus S
has this Gamma distribution.
For the special case of the chi-squared distribution, suppose that Xi has the χ2 distribution
with ri degrees ofPfreedom. It follows from the above result that S then has the χ2
n
distribution with i=1 ri degrees of freedom.
Normal Distr.: Suppose that X1 , X2 , X3 , . . . , Xn are independent random variables, such that
Xi has a Normal distribution with mean µi and variance σi2 . Then:
2 2
Mi (t) = eµi t+σi t /2
.
Pn
and the m.g.f. of S = i=1 Xi is:
n
" n
#
X X
2 2
exp (µi t) + (σi t /2) .
i=1 i=1
Pn
which is the m.g.f. of the Normal distribution with mean of i=1 µi and variance of
P n 2
i=1 σi . Thus S has the Normal distribution with this mean and variance.
The concept of using the m.g.f. to derive distributions extends beyond additive transformations.
The following result illustrates this, and is of sufficient importance to be classed as theorem.
Theorem 4.2 Suppose that X has the normal distribution with mean µ and variance σ 2 , and
let:
(X − µ)2
Y = .
σ2
Then Y has the χ2 distribution with one degree of freedom.
4.2. M.G.F FOR FUNCTIONS OF RANDOM VARIABLES 57
Proof: Let Z = (X − µ)/σ. Thus Y = Z 2 , where Z has the standard normal distribution. We
then have:
2
MY (t) = E[etZ ]
Z ∞
2 1 2
= etz √ e−z /2 dz
−∞ 2π
Z ∞
1 2
= √ e−(1−2t)z /2 dz
2π −∞
"r #
1 − 2t ∞ −(1−2t)z2 /2
Z
−1/2
= (1 − 2t) e dz
2π −∞
The term in square brackets in the last line above is the integral over the real line of the
p.d.f. of a normal distribution with mean of 0 and a variance of (1 − 2t)−1 . This integral
evaluates to 1, and thus:
MY (t) = (1 − 2t)−1/2
1 1
which is the m.g.f. of a gamma distribution with parameters α = 2 and λ = 2, i.e. the
chi-squared distribution with one degree of freedom.
Corollary: Combining the theorem with the property of the χ2 distribution derived above,
we see that if X1 , X2 , . . . , Xn are independent random variables from a common normal
distribution with mean µ and variance σ 2 , then:
n
X (Xi − µ)2
Y =
i=1
σ2
has the χ2 distribution with n degrees of freedom. This is a very important result: make
sure that you understand how it derives from the previous results.
In the previous chapter, we examined the moments of sums of random variables to an unknown
number of terms. With m.g.f.’s, we can go a little further. Suppose once again that X1 , X2 , X3 , . . .
are independent, identically distributed
PN “i.i.d.” random variables, having a common distribution
with m.g.f.PMX (t). Let S = i=1 Xi . Conditional on the event N = n, where n is any integer,
n
S = Sn = i=1 Xi , and we have seen above that the m.g.f. of Sn is simply [MX (t)]n , or in other
words:
E[etS |N = n] = [MX (t)]n .
With the same notation used in the previous chapter, the random variable E[etS |N ] is [MX (t)]N
(a function of N ). The m.g.f. is the unconditional expectation of etS , which by theorem 3.5 is:
i.e. the m.g.f. of N evaluated at the point (ln MX (t)). Note that at t = 0, MX (t) = 1, and thus
ln MX (0) = 0, and MS (0) = MN (0) = 1.
58 CHAPTER 4. MGFS AND CONVERGENCE
Since the derivatives of MS (t) at t = 0 will be functions of the derivatives of MX (t) and of
MN (t) at t = 0, it is thus in principle possible to evaluate moments of S to any order, provide
sufficiently high order moments of X and of N are known.
As an example, suppose that N has the Poisson distribution with parameter λ. Thus:
2
Exercise: Use the above result to obtain the mean and variance of S when X is N(µX , σX )
2
and N is Poisson(λ), expressed as a function of λ, µX and σX . Compare these with the
results given by theorems 3.5 and 3.6.
The practice of statistics, by and large, deals with taking numbers (often large numbers) of
repeated observations. The fundamental philosophical question which arises is whether “in the
long run” there is any sense of reliability or consistency in sample results. In order to study
this problem, we need to introduce the concept of sequences of random variables, and their
convergence. This we do in the remainder of this chapter.
A sequence of random variables, say Z1 , Z2 , . . ., represents a sequence of observations of the real
world (“experiments”). The Zi need not be identically distributed (as the underlying observations
could differ in nature). The elements of the sequence may or may not be independent. We
could, for example, think of Zn as the sample mean from a sample of n observations from a
given population. If for each n = 1, 2, . . . , we start re-sampling afresh, then the Zn will be
independent. If, on the other hand, Zn+1 is a sample average based on all the observations going
into Zn plus one further observation, then they are clearly not independent. Where the elements
of the sequence are both independent and identically distributed, the sequence is termed an i.i.d.
sequence (“independent and identically distributed”).
A sequence of random variables Z1 , Z2 , . . . is said to converge in probability to the point α, if
Pr[|Zn − α| > ] → 0 as n → ∞ for any > 0. One example of this type of convergence is
provided in the next theorem.
Theorem 4.3 (Weak law of large numbers – WLLN): Let X1 , X2 , . . . be an i.i.d. sequence of
random variables,
Pn which are such that E[Xi ]=µ < ∞ and Var[Xi ]=σ 2 < ∞. For each n, define
¯
Xn = n −1 ¯
i=1 Xi . Then Xn converges in probability to µ.
and
n
1 X σ2
Var[X¯n ] = 2
[Var[Xi ]] = .
n i=1 n
4.4. CONVERGENCE IN DISTRIBUTION 59
√ √
Now for any arbitrary > 0, set k = n/σ, i.e. such that = kσ/ n. Then by Cheby-
chev’s inequality:
σ2
Pr[|X¯n − µ| > ] ≤ 2 .
n
For any non-zero , the right hand side above tends to 0 as n → ∞.
The above theorem tells us that if we sample for long enough, our sample mean will become
arbitrarily close to the population mean (as long as it and the variance exist), no matter what
the underlying distribution. There are no surprises at infinity! The main use of the theorem will
be found in proving properties of statistical inferential procedures.
We shall not deal with it in this course, but there is a stronger convergence concept termed
“almost sure convergence”. The sequence Z1 , Z2 , . . . is said to converge almost surely to the
point α, if after some, usually random, point in the sequence, |Zn − α| never again exceeds . It
can in fact be proved that the sequence defined by the X¯n ’s converges almost surely to µ: this
is termed the strong law of large numbers (SLLN).
distribution with parameters n and p= the probability that any one such occasion leads to
an accident. (Actually we are making assumptions such as independence of all occasions,
and a constant p for all occasions; but in many cases these assumptions might not be too
bad.) One problem is that we can never know what is n, and we quite probably will never
be able to estimate p. What is easy to estimate, however, would be the mean number of
accidents, viz. µ = np.
Formally then, suppose that Xn has a mean of µ, and follows a binomial distribution with
parameters n and p = µ/n. Can we find some useful approximation to the distribution of
Xn , when we don’t in fact have any idea what n is, except that it is large (implying that
p is small)? We hypothetically examine a sequence of random variables X1 , X2 , . . ., where
each Xn has the above binomial distribution. Does this sequence converge in distribution
to anything useful (which can be used to approximate the desired distribution)? We look
at the m.g.f. In a previous exercise, the reader was required to evaluate the m.g.f. of the
binomial distribution, which for Xn turns out to be:
Mn (t) = (q + pet )n
h µ µ in
= 1 − + et
n n
n
µ(et − 1)
= 1+
n
Recalling the basic definition of ex = limn→∞ [1 + x/n]n , it follows that limn→∞ Mn (t) =
exp[µ(et − 1)], which is the m.g.f. of the Poisson distribution with parameter µ. This is
therefore the required limiting distribution, which can be used as an approximation to the
binomial for large n and small p.
Normal approximation to the gamma: Suppose that Xn has the gamma distribution with
parameters λ (same for all n) and αn , where αn > αn−1 and αn → ∞ as n → ∞. Since
both the mean (= αn /λ) and the variance (= αn /λ2 ) are monotonically increasing in n,
and diverge to infinity, we cannot apply the convergence in distribution principle directly.
However, let us define:
Xn − αn /λ λ √
Zn = √ = √ Xn − αn .
αn /λ αn
Every Zn has zero mean and unit variance, so we have some hope that the sequence
Z1 , Z2 , . . . may converge. From previous results, we know that the m.g.f of Xn is:
−αn
t
1− .
λ
1 1 1
ln(1 − x) = −x − x2 − x3 − x4 − . . .
2 3 4
4.5. THE CENTRAL LIMIT THEOREM (CLT) 61
Taking the natural logarithm of Mn (t) and applying this result, we obtain:
√ t
ln Mn (t) = − αn t − αn ln 1 − √
αn
" 2 3 4 #
√ t 1 t 1 t 1 t
= − αn t − αn − √ − √ − √ − √ − ...
αn 2 αn 3 αn 4 αn
1 2 1 t3 1 t4
= t + √ + + ...
2 3 αn 4 αn
As n → ∞, αn → ∞, and thus all but the first term in the above expression tend to 0.
Thus we have that:
1
lim ln Mn (t) = t2
n→∞ 2
or, equivalently:
2
lim Mn (t) = et /2
n→∞
which is the m.g.f. of the standard normal distribution. Thus we have proved that Z1 , Z2 , . . .
converges in distribution to a random variable Z which has the standard normal distribu-
tion. This implies that if X has a gamma distribution with parameters α and λ, then for
large enough α, the distribution of
X − α/λ
√
α/λ
can be approximated by the standard normal distribution.
In particular, recall that the χ2 distribution with n degrees of freedom is just a gamma
distribution with parameters n2 and 12 , and thus for large n this can be approximated by a
normal distribution with mean n and variance 2n.
We come now to what is perhaps the best known, and widely used and misused, result concerning
convergence in distribution, viz. the central limit theorem. We state it in the following form:
Theorem 4.4 (The Central Limit Theorem – CLT) Let X1 , X2 , X3 , . . . be an i.i.d. sequence
of random variables, having finite mean (µ) and variance (σ 2 ). Suppose that the common m.g.f.,
say MX (t), and its first two derivatives exist in a neighbourhood of t = 0. For each n, define:
n
1X
X̄n = Xi
n i=1
and:
X̄n − µ
Zn = √ .
σ/ n
Then the sequence Z1 , Z2 , Z3 , . . . converges in distribution to Z which has the standard normal
distribution.
62 CHAPTER 4. MGFS AND CONVERGENCE
Comment 1: The practical implication of the central limit theorem is that for large enough
n, the distribution of the sample mean can be approximated by a normal distribution
with mean µ and variance σ 2 /n, provided only that the underlying sampling distribution
satisfies the conditions of the theorem. This is very useful, as it allows the powerful
statistical inferential procedures based on normal theory to be applied, even when the
sampling distribution itself is not normal. However, the theorem can be seriously misused
by application to cases with relatively small n.
Comment 2: The assumption of the existence of the twice-differentiable m.g.f. is quite strong.
However, the characteristic function (i.e. using imaginary arguments) and its first two
derivatives will exist if the first two moments exist, which we have already assumed. The
proof below follows in much the same way, and is thus more general than we have assumed.
In fact, even the i.i.d. assumption can be relaxed, but this requires much more sophisticated
mathematical treatment.
Proof: Let us define Wi = Xi − µ, and W̄n = X̄n − µ, which is then also the sample mean
of Wi ’s. Clearly E[Wi ]=0 and Var[Wi ]=σ 2 . The Wi ’s are also i.i.d. with m.g.f. given by
MW (t) = e−µt MX (t).
Now: Pn Pn
W̄n i=1 Wi Wi
Zn = √ = √ = i=1√
σ/ n nσ/ n σ n
and thus the m.g.f. of Zn , say Mn (t) is given by:
n
t
Mn (t) = MW √
σ n
(1) s2 (2)
MW (s) = MW (0) + sMW (0) + M (0) + (s)
2 W
where (s) is a function of s, with the property that (s)/s2 → 0 as s → 0. Now:
(1)
MW (0) = E[W ] = 0
(2)
MW (0) = E[W 2 ] = σ 2
and thus:
1
MW (s) = 1 + σ 2 s2 + (s).
2
√
For fixed t and σ, let us define s = t/σ n (so that s → 0 as n → ∞) and n as the value
of (s) for this s. Then:
(s) σ2
= nn
s2 t2
tends to 0 as n → ∞. Since σ 2 /t2 √
is strictly positive, it follows that nn → 0 as n → ∞.
In this case, we can write MW (t/σ n) in the form:
t2
t
MW √ =1+ + n
σ n 2n
4.6. THE PROBABILITY GENERATING FUNCTION 63
thus giving: n n
t2 t2 + 2nn
Mn (t) = 1 + + n = 1 + .
2n 2n
In the limit as n → ∞, nn → 0, and thus t2 + 2nn → t2 so that:
n
t2
2
lim Mn (t) = lim 1 + = et /2 .
n→∞ n→∞ 2n
This is the m.g.f. of the standard normal distribution, which thus proves the theorem.
GX (s) = E[sX ]
It is clear that GX (s)|s=0 = pX (0). Now consider what happens when we repeatedly differentiate
GX (s). Writing:
(r) dr GX (s)
GX =
dsr
we obtain:
If we set s = 0
(1)
GX |s=0 = pX (1)
Similarly:
(2) d2 GX (s)
GX = = 2pX (2) + 3 × 2spX (3) + · · · ,
ds2
and
64 CHAPTER 4. MGFS AND CONVERGENCE
(2)
GX |s=0 = 2pX (2).
(r)
GX |s=0 = r!pX (r),
Thus
1 (r) 1 dr
pX (r) = GX |s=0 = GX (s)|s=0 ,
r! r! dsr
Thus, once we know the probability generating function, we can compute (generate) all the
probablities.
If we examine the first derivative of the probability generating function and set s = 1, we will
obtain the expectation of X:
(1)
GX |s=1 = pX (1) + 2spX (2) + 3s2 pX (3) + · · ·
X∞
= xpX (X)
x=0
= E[X]
(2) d2 GX (s)
GX = = 2pX (2) + 3 × 2spX (3) + · · · ,
ds2
(2)
GX |s=1 = 2pX (2) + 3 × 2spX (3) + · · ·
X∞
= x(x − 1)pX (X)
x=0
= E[X(X − 1)]
It is easy to see that from the probability generating function we can derive the moment gener-
ating function and vice versa:
4.7. THE CUMULANT GENERATING FUNCTION 65
MX (t) = E[eXt ]
= E[(et )X ]
= GX (et )
and
GX (s) = E[sx ]
∞
X
= ps(sq)x−1
x=1
X∞
= ps (sq)j , j =x−1
j=0
1
= ps
1 − sq
Exercise Use the probability generating function to obtain the mean and variance of the geo-
metric distribution, and compare your results with Section 4.1.
λx e−λ
Example (Poisson distribution): pX (x) = for x = 0, 1, 2, . . .. Thus:
x!
GX (s) = E[sx ]
∞
X λx e−λ
= sx
x=0
x!
∞
X (sλ)x e−sλ
= e−λ esλ
x=0
x!
= e−λ esλ × 1, (pmf of a P(sλ), sum to 1)
λ(s−1)
= e
The ith derivative of KX (t) at t = 0 is the i-th cumulant of X, written as κi . The first two
derivatives are the most useful as one obtains the mean and variance.
2t 0 3t2 0
d 0 + µ01 + µ2 + µ + ···
KX (t) = 2! 3! 3
dt t2 t3
[1 + tµ01 + µ02 + µ03 + · · · ]
2! 3!
Thus
d
κ1 = KX (t)|t=0 = µ01 .
dt
Tutorial Exercises
4.1 The discrete bivariate random variable (X,Y) has probability mass function defined by the following
table:
x
y -1 0 1
1 2
-1 8 8
0
1 1 1
0 8 8 8
1 1
1 8 8
0
Find the moment generating function of X, MX (t), and verify that E[X] = M (1) (0) and that
E[X 2 ] = M (2) (0).
t
−1)
4.4 Suppose that the moment generating function of a random variable X is given by MX (t) = e2(e .
What is the P [X = 1]?
4.5 Suppose that N, the number of animals caught in a trap per day, has a Poisson distribution with
mean 4. The probability of any one animal being male is 31 . Find the probability generating
function and expected value of the number of males caught per day.
4.6 Derive the m.g.f. of a negative binomial distribution. Then use this m.g.f. to find the expectation
and variance of a negative binomial random variable.
4.7 The following question is from Rice (1995, Chapter 5, exercise 11):
“A sceptic gives the following argument to show that there must be a flaw in the CLT: “We
know that the sum of independent Poisson random variables follows a Poisson distribution with
a parameter that is the sum of the parameters of the summands. In particular, if n independent
1
Poisson random variables, each with parameter , are summed, the sum has a poisson distribution
n
with parameter 1. The CLT says that as n approaches infinity the distribution of the sum tends
to a normal distribution, but the Poisson with parameter 1 is not the normal.” What do you think
of this argument?
4.8 Suppose that X1 , . . . , X20 are independent random variables with density functions
fX (x) = 8x, 0 ≤ x ≤ c
In applying probability concepts to real world situations (e.g. planning or decision making), we
usually need to know the distributions of the underlying random variables, such as heights of
people in a population (for a clothing manufacturer, say), sizes of insurance claims (for the actu-
ary), or numbers of calls through a telephone switchboard (for the telecommunications engineer).
Each random variable is defined in terms of a “population”, which may, in fact, be somewhat
hypothetical, and will almost always be very large or even infinite. We cannot then determine
the distribution of the random variable by total census or enumeration of the population, and
we resort to sampling. Typically, this involves three conceptual steps:
Note that this will usually leave a small number of parameters unspecified at this point, to
be “estimated” from data.
3. Observe an often quite small number of actual instances (outcomes of random experiments,
or realizations of random variables), the “sample”, and use the assumed distributional forms
to generalize sample results to the entire assumed population, by estimating the unknown
parameters.
69
70 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
The critical issue here is the choice of the sample. In order to make the extension of sample
results to the whole population in any way justifiable, the sample needs to be “representative”
of the population. We need now to make this concept precise. Consider for example:
Are the heights of students in the front row of the lecture room representative of all UCT
students?
Would the number of calls through a company switchboard during 09h00-10h00 on Monday
morning be representative of the entire week?
Would 10 successive insurance claims be representative of claims over the entire year?
The above examples suggest two possible sources of non-representativeness, viz. (i) observing cer-
tain parts of the population preferentially, and (ii) observing outcomes which are not independent
of each other. One way to ensure representativeness is to deliberately impose randomness on the
selected population, in a way which ensures some uniformity of coverage. If X is the random
variable in whose distribution we are interested (e.g. heights of male students) then we attempt
to design a scheme whereby each male student is equally likely to be chosen, independently of
all others. Quite simply, each observation is then precisely a realization of X. Practical issues of
ensuring this randomness and independence will be covered in later courses in this department,
but it is useful to think critically of any sampling scheme in these terms. In an accompanying
tutorial, there are a few examples to think about. Discuss them amongst yourselves.
For this course, we need to define the above concepts in a rigorous mathematical way. For this
purpose, we shall refer to the random variable X in which we are interested as the population
random variable, with distribution function given by FX (x), etc. Any observation, or single
realization of X will usually be denoted by subscripts, e.g. X1 , X2 , . . . , Xi , . . .. The key concept
is that of a random sample defined as follows:
Definition: A statistic is any function of a random sample which does not depend on any
unknown parameters of the distribution of the population random variable.
Pn Pn Pn
Thus, a function such as ( i=1 Xi4 )/( i=1 Xi2 )2 would be a statistic, but i=1 (Xi −µX )2 would
generally not be (unless the population mean µX were known for sure a priori).
Let T (X1 , X2 , . . . , Xn ) be a statistic. It is important to realize that T (X1 , X2 , . . . , Xn ) is a
random variable, one which takes on the numerical value T (x1 , x2 , . . . , xn ) whenever we observe
the joint event defined by X1 = x1 , X2 = x2 , . . . , Xn = xn . If we are to use this statistic to draw
inferences about the distribution of X, then it is important to understand how observed values
of T (X1 , X2 , . . . , Xn ) vary from one sample to the next: in other words, we need to know the
probability distribution of T (X1 , X2 , . . . , Xn ). This we are able to do, using the results of the
5.2. NORMAL MEANS AND VARIANCES 71
previous chapters, and we shall be doing so for a number of well-known cases in the remainder
of this chapter. In doing so, we shall largely be restricting ourselves to the case in which the
distribution of the population random variable X is normal with mean µ and variance σ 2 , one
or both of which are unknown. The central limit theorem will allow us to use the same results
as an approximation for non-normal samples in some cases.
which is the m.g.f. of the normal distribution with mean µ and variance σ 2 /n. The central limit
theorem told us that this was true asymptotically for all well-behaved population distributions,
but for normal sampling this is an exact result for all n.
The distribution of the sample variance is a little more complicated, and we shall have to approach
this stepwise. As a first step, let Ui = (Xi − µ)/σ, and consider:
n Pn
X
2 (Xi − µ)2
Ui = i=1 2 .
i=1
σ
By the corollary to Theorem 4.2, we know that this has the χ2 distribution with n degrees of
freedom. In principal we know its p.d.f. (from the gamma distribution), and we can compute
as many moments as we wish. Integration of the density to obtain cumulative probabilities is
more difficult, but fortunately this is what is given in tables of the χ2 distribution. For any α
(0 < α < 1), let us denote by χ2n;α the value which is exceeded with probability α. In other
words, if V has the χ2 distribution with n degrees of freedom, then:
Point estimation of σ: From the properties of the χ2 (gamma) distributions, we know that:
Pn
− µ)2
i=1 (Xi
E = n.
σ2
Re-arranging terms, noting that σ is a constant (and not a random variable), even though
it is unknown, we get: " n #
1X
E (Xi − µ) = σ 2 .
2
n i=1
Suppose now that, based on the observed values from a random sample X1 = x1 , X2 =
x2 , . . . , Xn = xn , we propose to use the following as an estimate of σ:
n
1X
(xi − µ)2 .
n i=1
From one sample, we can make no definite assertions about how good this estimate is. But
if samples are repeated many times, and the same estimation procedure applied every time,
then we know that we will average out at the correct answer in the long run. We say that
the estimate is thus unbiased.
Hypothesis tests on σ: Now suppose that a claim is made that σ 2 = σ02 , where σ02 is a given
positive real number. Is this true? We might make this the “null” hypothesis, with an
alternative given by σ 2 > σ02 . If the null hypothesis is true, then from the properties of the
χ2 distribution, we know that:
Pn 2
i=1 (Xi − µ) 2
Pr ≥ χ n;α = α.
σ02
is in fact larger than χ2n;α for some suitably small value of α. What do we conclude? We
cannot be sure whether the null hypothesis is true or false. But we do know that either
the null hypothesis is false, or we have observed a low probability event (one that occurs
with probability less than α). For sufficiently small α, we would be led to “reject” the null
hypothesis at a 100α% significance level.
Confidence interval for σ: Whatever the true value of σ, we know from the properties of the
χ2 distribution that:
Pn
(Xi − µ)2
Pr χ2n;1−α/2 ≤ i=1 2 ≤ χ2n;α/2 = 1 − α
σ
The corresponding confidence interval for σ is obtained by taking square roots, to give
[0.0089 ; 0.0223].
Suppose, however, that the whole point of taking the sample was because of our skepti-
cism with a claim made by the factory manager that σ ≤ 0.009. We test this against
the alternative hypothesis that σ > 0.009. For this purpose, we calculate the ratio
0.00161/(0.009)2 which comes to 19.88. From χ2 tables we find that χ210;0.05 = 18.307
while χ210;0.025 = 20.483. We thus can say that we would reject the factory manager’s claim
at the 5% significance level, but not at the 2 21 % level (or, alternatively, that the p-value
for this test is between 0.025 and 0.05). There is evidently reason for our skepticism in the
manager’s claim!
The problem, however, is that in most practical situations, if we don’t know the variance, we
also do not know the mean. The “obvious” solution is to replace the population mean by the
sample mean, i.e. to base the same sorts of inferential procedures as we had above on:
Pn 2
i=1 (Xi − X̄) (n − 1)S 2
2
=
σ σ2
where S 2 is the usual sample variance defined by:
n
1 X
S2 = (Xi − X̄)2 .
n − 1 i=1
74 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
Pn 2
Certainly, if we define Yi = (Xi − X̄)/σ, then the above is i=1 Yi , and it is easily shown
that the Yi ’s are normally distributed with zero mean. The variance of Yi can be shown to be
(n − 1)/n,
Pn which is slightly less than 1, but the real problem is that the Yi are not independent,
since i=1 Yi = 0, and thus Theorem 4.2 and its corollary no longer apply. And yet it seems
intuitively that not too much can have changed. We now proceed to examine this case further.
Firstly, however, we need the following theorem:
Theorem 5.1 For random samples from the normal distribution, the statistics X̄ and S 2 are
independent random variables.
The proof, although not difficult in principle, is quite messy, and we shall omit it. It is instruc-
tive, nevertheless, to give a relatively informal motivation for the theorem. Since the Xi ’s are
independent, the joint p.d.f. of the entire sample is given by:
n
" n
#
−(xi − µ)2
Y 1 1 −1 X
√ exp = exp (xi − µ)2 .
i=1
2πσ 2σ 2 (2π)n/2 σ n 2σ 2 i=1
The joint p.d.f. can thus be factorized into a constant multiplied by the product of:
−n(x̄ − µ)2
exp
2σ 2
−(n − 1)s2
exp
2σ 2
which depends only on the sample variance. The ranges of X̄ and of S 2 do not depend on each
other, and thus the above factorization strongly suggests independence. In fact the rigorous
prove follows very much along the same lines. We start with the joint p.d.f. of the entire sample,
transform these to X̄, S 2 and n − 2 other variables, and integrate out the remaining n − 2
variables.
This now enables us to prove the following very important theorem.
Pn
Theorem 5.2 For random samples from the normal distribution, (n − 1)S 2 /σ 2 (i.e. i=1 (Xi −
X̄)2 /σ 2 ) has the χ2 (“chi-squared”) distribution with n − 1 degrees of freedom
5.2. NORMAL MEANS AND VARIANCES 75
Comment: The only effect of replacing the population mean by the sample mean is to change
the distribution from the χ2 with n degrees of freedom to one with n−1 degrees of freedom.
The one linear function relating the n terms has “lost” us one degree of freedom.
Note that the expectation of (n − 1)S 2 /σ 2 is thus n − 1, and that E[S 2 ]=σ 2 , i.e. S 2 is now
the unbiased estimator of σ 2 .
Pn
Proof: Define Ui = (Xi − µ)/σ and let W = i=1 Ui2 . We have seen previously that the Ui
have the standard normal distribution, that W has the χ2 distribution with n degrees of
freedom, and that its m.g.f. is MW (t) = (1 − 2t)−n/2 . Define:
n
1X X̄ − µ
Ū = Ui =
n i=1 σ
where f (x) is the p.d.f. corresponding to F (x). Now let the random variable Nk be the number
of observations
Pn out of the n for which the random variable Xi is observed to fall into interval k.
Evidently, k=1 Nk = n. Any one Nk taken on its own is binomially distributed with parameters
n and pk . For “sufficiently large” n, the distribution of Nk can be approximated by the normal
distribution with mean npk and variance npk (1 − pk ). (Conventional “wisdom” suggests that the
approximation is reasonable if npk > 5, or some would say > 10.) Thus:
N − npk
p k
npk (1 − pk )
has approximately the standard normal distribution.
Let us define:
Nk − npk
Zk = √
npk
which is also then approximately normal with zero mean and variance equal to 1 − pk . Note that
the Zk are related by the constraint that:
K K
X √ X
npk Zk = (Nk − npk ) = 0.
k=1 k=1
Now suppose that we select reasonably balanced intervals in the sense that the pk are approx-
imately equal, i.e. pk ≈ 1/K. In this case, each Zk is approximately normally distributed with
5.4. STUDENT’S T DISTRIBUTION 77
PK
mean 0 and variance 1 − 1/K, and the Zk are related through k=1 Zk = 0. This defines the
same mathematical problem as before (apart from the approximations used), so that:
K K
(Nk − npk )2
X X
Zk2 =
npk
k=1 k=1
must (at least approximately) follow a χ2 distribution with K − 1 degrees of freedom. In fact,
numerical experiments have shown that the approximation is good for npk > 5 (or perhaps 10)
even when the pk are not well-balanced, which is thus the basis for the conventional χ2 test.
If the distribution function F (x) is not fully specified at the start, but involves parameters to be
estimated from the data, this imposes further relationships between the Zk , leading to further
losses of degrees of freedom.
X̄ − µ
√
σ/ n
has the standard normal distribution, and this fact can be used to draw inferences about µ if
σ is known. For example, a test of the hypothesis that µ = µ0 can be based on the fact (see
normal tables) that:
X̄ − µ0
Pr √ > 1.645 = 0.05
σ/ n
if the hypothesis is true. Thus, if the observed value of this expression exceeds 1.645, then
we could reject the hypothesis (in favour of µ > µ0 ) at the 5% significance level. Similarly, a
confidence interval for µ for known σ can be based on the fact that:
X̄ − µ
Pr −1.96 < √ < +1.96 = 0.95
σ/ n
which after re-arrangement of the terms gives:
σ σ
Pr X̄ − 1.96 √ < µ < X̄ + 1.96 √ = 0.95
n n
We must re-emphasize that the probability refers to random (sampling) variation in X̄, and not
to µ which is viewed as a constant in this formulation.
In practice, however, the population variance is seldom known for sure. The “obvious” thing to
do is to replace the population variance by the sample variance, i.e. to base inferences on:
X̄ − µ
T = √
S/ n
This is now a function of two statistics, viz. X̄ and S. Large values of the ratio can be due
to large deviations in X̄ from the population mean or to values of S below the population
standard deviation. Fortunately, we do know that X̄ and S are independent, and we know their
78 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
distributions, and thus we should be able to derive the distribution of T . It is useful to approach
this in the following manner. Let us first define:
X̄ − µ
Z= √
σ/ n
(n − 1)S 2
U=
σ2
which has the χ2 distribution with n − 1 degrees of freedom. Z and U are independent, and:
Z
T =p .
U/(n − 1)
In the following theorem, we derive the p.d.f. of T in a slightly more general context which will
prove useful later.
Theorem 5.3 Suppose that Z and U are independent random variables having the standard
normal distribution and the χ2 distribution with m degrees of freedom respectively. Then the
p.d.f. of
Z
T =p
U/m
is given by:
−(m+1)/2
t2
Γ((m + 1)/2)
fT (t) = √ 1+
mπΓ(m/2) m
Comments: The p.d.f. for T defines the t-distribution, or more correctly Student’s t distribution,
with m degrees of freedom. It is not hard to see from the functional form that the p.d.f.
has a “bell-shaped” distribution around t = 0, superficially rather like that of the normal
distribution. The t-distribution has higher kurtosis than the normal distribution, although
it tends to the normal distribution as m → ∞.
For m > 2, the variance of T is m/(m − 2). The variance does not exist for m ≤ 2, and
in fact for m = 1, even the integral defining E[T ] is not defined (although the median is
still at T = 0). The t-distribution with 1 degree of freedom is also termed the Cauchy
distribution.
As you should know, tables of the t-distribution are widely available. Values in these tables
can be expressed as numbers tm;α , such that if T has the t-distribution with m degrees of
freedom, then:
Pr[T > tm;α ] = α.
Proof: Since Z and U are independent, the joint p.d.f. of Z and U is given by:
1 2 um/2−1 e−u/2
fZU (z, u) = √ e−z /2 m/2
2π 2 Γ(m/2)
5.5. APPLICATIONS OF THE T DISTRIBUTION TO TWO-SAMPLE TESTS 79
for all real z and all positive u. Wep now define T as above and W = U ,. The inverse trans-
formations are U = W and Z = T W/m, and thus the Jacobian of the transformation is
given by: p √ r
w/m 12 t/ wm w
|J| =
= .
0 1 m
The joint p.d.f. of T and W is thus given by:
wm/2−1 e−w/2
r
1 2 w
fT W (t, w) = √ e−t w/2m m/2
2π 2 Γ(m/2) m
for all real t and all positive w. We obtain the marginal p.d.f. for T by integrating over w.
In order to do this, it is useful to re-arrange the terms to get:
Z ∞
1 2
fT (t) = √ wm/2 − 1/2 e−w(1+t /m)/2 dw.
2πm2 m/2 Γ(m/2) w=0
With a little insight, the integral can be solved directly by comparison with the p.d.f. of the
gamma distribution. However, for the less confident, evaluation of the integral is facilitated
by making the further transformation from w to y = w(1 + t2 /m)/2. (Recall that this is
for fixed t, which is thus at this stage viewed as a constant.) The Jacobian is 2/(1 + t2 /m),
and the integral term becomes:
Z ∞
m/2 − 1/2
y m/2 − 1/2 2(1 + t2 /m)−1 2(1 + t2 /m)−1 e−y dy
y=0
−(m+1)/2
Z ∞
2(m+1)/2 (1 + t2 /m) y (m+1)/2 − 1 e−y dy
=
y=0
−(m+1)/2
2(m+1)/2 (1 + t2 /m)
= Γ((m + 1)/2)
Substituting this value for the integral back into the expression for fT (t) gives the result
stated in the theorem.
Hypothesis tests and confidence intervals can thus be based upon observed values of the ratio:
X̄ − µ
T = √
S/ n
and critical values of the t-distribution with n − 1 degrees of freedom. This should be very
familiar to you.
A greater understanding of Theorem 5.3 can be developed by looking at the various types of
two-sample tests, and the manner in which different t-tests occur in each of these. These were
all covered in first year, but we need to examine the origins of these tests.
2
Suppose that XA1 , XA2 , . . . , XAm is a random sample of size m from the N (µA , σA ) distribution,
2
and that XB1 , XB2 , . . . , XBn is a random sample of size n from the N (µB , σB ) distribution. We
80 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
suppose further that the two samples are independent. Typically, we are interested in drawing
inferences about the difference in population means:
∆AB = µA − µB .
Let X̄A and X̄B be the corresponding sample means. We now that X̄A is normally distributed
2
with mean µA and variance σA /m, while X̄B is normally distributed with mean µB and variance
2
σB /n. Since X̄A and X̄B are independent (since the samples are independent), we know further
that X̄A − X̄B is normally distributed with mean ∆AB and variance:
2
σA σ2
+ B
m n
and thus the term:
X̄A − X̄B − ∆AB
ZAB = p 2 2 /n
σA /m + σB
has the standard normal distribution.
If the variances are known, then we can immediately move to inferences about ∆AB . If they
2 2
are not known, we will wish to use the sample variances SA and SB . The trick inspired by
Theorem 5.3 is to look for a ratio of a standard normal variate to the square root of a χ2 variate,
2 2
and hope that the unknown population variances will cancel. Certainly, (m − 1)SA /σA has the
2 2 2 2
χ distribution with m − 1 degrees of freedom and (n − 1)SB /σB has the χ distribution with
n − 1 degrees of freedom. Furthermore, we know that their sum:
2 2
(m − 1)SA (n − 1)SB
UAB = 2 + 2
σA σB
has the χ2 distribution with m + n − 2 degrees of freedom. (Why?) This does not, however,
seem to lead to any useful simplification in general.
2 2
But see what happens if σA = σB = σ 2 , say. In this case ZAB and UAB become:
and
2
2
(m − 1)SA + (n − 1)SB2 (m + n − 2)Spool
UAB = =
σ2 σ2
where the “pooled” variance estimator is defined by:
2 2
2 (m − 1)SA + (n − 1)SB
Spool = .
m+n−2
p
Now if we take T = ZAB / UAB /(m + n − 2), then we get:
which by Theorem 5.3 has the t-distribution with m + n − 2 degrees of freedom. We can thus
carry out hypothesis tests, or construct confidence intervals for ∆AB .
5.5. APPLICATIONS OF THE T DISTRIBUTION TO TWO-SAMPLE TESTS 81
Example: Suppose we have observed the results of two random samples as follows:
P8
m=8 x̄A = 61 (xAi − x̄A )2 = 1550
Pi=1
6 2
n=6 x̄B = 49 i=1 (xBi − x̄B ) = 690
We are required to test the null hypothesis that µA − µB ≤ 5, against the one-sided
alternative that µA − µB > 5 at the 5% significance level, under the assumption that the
variances of the two populations are the same. The pooled variance estimate is:
1550 + 690
s2pool = = 186.67
8+6−2
and thus under the null-hypothesis, the t-statistic works out to be:
61 − 49 − 5
t= √ q = 0.949
186.67 18 + 1
6
The 5% critical value for the t-distribution with 8 + 6 − 2 = 12 degrees of freedom is 1.782,
and we cannot thus reject the null hypothesis.
In general, when variances are not equal, there appears to be no way in which a ratio of a normal
to the square root of a χ2 can be constructed, in such a way that both unknown population
variances cancel out. This is called the Behrens-Fisher problem. Nevertheless, we would expect
that a ratio of the form
X̄A − X̄B − ∆AB
p
2 /m + S 2 /n
SA B
“should have something like” a t-distribution. Empirical studies (e.g. computer simulation) have
shown that this is indeed true, but the relevant “degrees of freedom” giving the best approxima-
tion to the true distribution of the ratio depend on the problem structure in a rather complicated
manner (and usually turns out to be a fractional number, making it hard to interpret). A num-
ber of approximations have been suggested on the basis of numerical studies, one of which is
incorporated into the STATISTICA package.
There is one further special case, however, which is interesting because it allows, ironically,
the relaxation of other assumptions. This is the case in which m = n. We can now pair the
observations (at this stage in any way we like), and form the differences Yi = XAi − XBi say. The
Yi will be i.i.d. normally distributed with mean ∆AB and with unknown variance σY2 = σA 2 2
+ σB .
The problem thus reduces to the problem of drawing inferences about the population mean for
a single sample, when the variance is unknown. Note that we only need to estimate σY2 , and not
2 2
the individual variances σA and σB .
In order to apply this idea, we need only to have that the Yi are i.i.d. It is perfectly permissible to
allow XAi and XBi to share some dependencies for the same i. They might be correlated, or both
of their means may be shifted from the relevant population means by the same amount. This
allows us to apply the differencing technique to “paired” samples, i.e. when each pair XAi , XBi
relate to observations on the same subject under two different conditions. For example, each “i”
may relate to a specific hospital patient chosen at random, while XAi and XBi refer to responses
to two different drugs tried at different times. All we need to verify is that the differences are
i.i.d. normal, after which we use one sample tests.
82 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
Example: A random sample of ten students are taken, and their results in economics and
statistics are recorded in each case as follows.
If care was taken in the selection of the random sample of students, then the statistics
results and the economics results taken separately would represent random samples. But
the two marks for the same student are unlikely to be independent, as a good student in one
subject is usually likely to perform well in another. But the last column above represents a
single random sample of the random variable defined by the amount by which the economics
mark exceeds the statistics mark, and these seem to be plausibly independent. For example,
if there is no true mean difference (across the entire population) between the two sets of
marks, then there is no reason to suggest that knowing that one student scored 5% more on
economics than on statistics has anything to do with the difference experienced by another
student, whatever their absolute marks.
The test of the hypothesis of no difference between the two courses is equivalent to the null
hypothesis that E[Y ]=0. The sample mean and sample variance of the differencesp above are
4.9 and 32.99 respectively. The standard (one sample) t-statistic is thus 4.9/ 32.99/10 =
2.698. Relevant critical values of the t-distribution with 9 degrees of freedom are t9;0.025 =
2.262 and t9;0.01 = 2.821. Presumably a two-sided test is relevant (as we have not been
given any reason why differences in one direction should be favoured over the other), and
thus the “p-value” lies between 5% (2 × 0.025) and 2% (2 × 0.01). Alternatively, we can
reject the hypothesis at the 5%, but not at the 2% significance level.
In the previous section, we have looked at the concept of a “two-sample” problem. Our concern
there was with comparing the means of two populations. Now, however, let us look at comparing
their variances. There are at least two reasons why we may wish to do this:
1. We have a real interest in knowing whether one population is more or less variable than
another. For example, we may wish to compare the variability in two production processes,
or in two laboratory measurement procedures.
2. We may merely wish to know whether we can use a pooled variance estimate for the t-test
for comparing the means.
5.6. THE F DISTRIBUTION 83
2 2
In the case of variances, it is convenient to work in terms of ratios, i.e. σA /σB . Equality of
2 2
variances means that this ratio is 1. We have available to us the sample variances SA and SB ,
2 2
and we might presumably wish to base inferences upon SA /SB . The important question is: what
2 2 2 2
is the probability distribution of SA /SB for any given population ratio σA /σB ?
2 2 2 2
We know that U = (m − 1)SA /σA and V = (n − 1)SB /σB have χ2 distributions with m − 1 and
n − 1 degrees of freedom respectively. Thus let us consider the function:
U/(m − 1) S 2 /SB
2
= A2 2 .
V /(n − 1) σA /σB
Since we know the distributions of U and V , we can derive the distribution of the above ratio,
which will give us a measure of the manner in which the sample variance ratio departs from
the population variance ratio. The derivation of this distribution is quite simple in principle,
although it becomes algebraically messy. We shall not give the derivation here, but recommend
it as an excellent exercise for the student. We simply state the result in the following theorem.
Theorem 5.4 Suppose that U and V are independent random variables, having χ2 distributions
with r and s degrees of freedom respectively. Then the probability density function of
U/r
Z=
V /s
is given by:
Γ((r + s)/2) r r/2 r/2 − 1 h rz i−(r+s)/2
fZ (z) = z +1
Γ(r/2)Γ(s/2) s s
for z > 0.
The trick in demonstrating the above is to transform to Z and W = V , obtain the joint p.d.f. of
Z and W , and hence the marginal p.d.f. of Z.
The distribution defined by the above p.d.f. is called the F-distribution with r and s degrees
of freedom (often called the numerator and denominator degrees of freedom respectively). One
word of caution: when using tables of the F-distribution, be careful to read the headings, to see
whether the numerator degree of freedom is shown as the column or the row. Tables are not
consistent in this sense.
As with the other distributions we have looked at, we shall use the symbol Fr,s;α to represent
the upper 100α% critical value for the F-distribution with r and s degrees of freedom. In other
words, with Z defined as above:
Pr[Z > Fr,s;α ] = α.
Tables are generally given separately for a number of values of α, each of which give Fr,s;α for
various combinations of r and s.
Example: We have two alternative laboratory procedures for carrying out the same analy-
sis. Let us call these A and B. Seven analyses have been conducted using procedure A
(giving measurements XA1 , . . . , XA7 ), and six using procedure B (giving measurements
2 2
XB1 , . . . , XB6 ). We wish to test the null hypothesis that σA = σB , against the alternative
84 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
2 2 2 2
that σA > σB , at the 5% significance level. Under the null hypothesis, SA /SB has the
F-distribution with 6 and 5 degrees of freedom. Since F6,5;0.05 = 4.95, it follows that:
2
S
Pr A 2 > 4.95 = 0.05
SB
2 2
Suppose now that we observe SA = 5.14 and SB = 1.08. This may look convincing, but
the ratio is only 4.76, which is less than the critical value. We can’t at this stage reject the
null hypothesis (although I would not be inclined to “accept” it either!).
You may have noticed that tables of the F-distribution are only provided for smaller values of
α, e.g. 10%, 5%, 2.5% and 1%, all of which correspond to variance ratios greater than 1. For
one-sided hypothesis tests, it is always possible to define the problem in such a way that the
alternative involves a ratio greater than 1. But for two-sided tests, or for confidence intervals,
one does need both ends of the distribution. There is no problem! Since
U/r
Z=
V /s
has (as we have seen above) the F-distribution with r and s degrees of freedom, it is evident
that:
1 V /s
Y = =
Z U/r
has the F-distribution with s and r degrees of freedom.
Now, by definition:
1
1 − α = Pr[Z > Fr,s;1−α ] = Pr Y <
Fr,s;1−α
and thus:
1
Pr Y ≥ = α.
Fr,s;1−α
Since Y is continuous, and has the F-distribution with s and r degrees of freedom, it follows
therefore that:
1
Fs,r;α =
Fr,s;1−α
i.e. for smaller values of α, and thus larger values of 1 − α, we have:
1
Fr,s;1−α = .
Fs,r;α
Example (Confidence Intervals): Suppose that we wish to find a 95% confidence interval for
2 2
the ratio σA /σB in the previous example. We now know that:
S 2 /σA
2
Pr F6,5;0.975 < A
2 /σ 2 < F6.5;0.025 = 0.95
SB B
The tables give us F6,5;0.025 = 6.98 directly, and thus 1/F6,5;0.025 = 0.143. Using the above
relationship, we also know that 1/F6,5;0.975 = F5,6;0.025 = 5.99 (from tables). Since the
2 2
observed value of SA /SB is 5.14/1.08=4.76, it follows that the required 95% confidence
interval for the variance ratio is [0.682 ; 28.51].
Tutorial Exercises
5.1 Suppose we are interested in students smoking habits (e.g. the distribution of weekly smoking
(number of cigarettes during one week, say)). The correct procedure would be to draw a sample of
students at random (from student number, say) and to interview each in order to establish their
smoking patterns. This may be a difficult or expensive procedure. Comment on the extent to
which the following alternative procedures also satisfy our definition of a random sample:
(a) Interview every 100th student entering Cafe Nescafe during one week.
(b) Interview all resident of Smuts Hall, or of Smuts and Fuller Halls.
(c) E-mail questionnaires to all registered students and analyze responses received.
(d) Interview all students at a local student pub next Friday night.
5.2 Laboratory measurements on the strength of some material are supposed to be distributed nor-
mally around a true mean material strength µ, with variance σ 2 . Let X1 , X2 , . . . denote individ-
ual
P16measurements. Based on a random sample of size 16, the following statistic was computed:
2 2
i=1 (xi − x̄) = 133. Can you reject the hypothesis : σ = 4.80?
5.3 In the problem of Question 5.2, suppose that for a sample of size 21, a value of 5.1 for the statistic
s2 was observed. Construct a 95% confidence interval for σ 2 .
5.4 W, X, Y, Z are independent random variables, where W, X, and Y have the following normal dis-
tributions:
W ∼ N (0, 1) X ∼ N (0, 91 ) and Y ∼ N (0, 1
16
)
Operator 1 2 3 4 5 6 7 8
Time (before course) 23 18 16 15 19 21 31 22
Time (after course) 17 14 13 13 12 20 14 17
Would you conclude that the course has speeded up their times? Is there evidence for the claim
that the course, on average, leads to a reduction of at least one minute per activity?
5.6 One of the occupational hazards of being an airplane pilot is the hearing loss that results from being
exposed to high noise levels. To document the magnitude of the problem, a team of researchers
measured the cockpit noise levels in 18 commercial aircraft. The results (in decibels) are as follows:
86 CHAPTER 5. DISTRIBUTIONS OF SAMPLE STATISTICS
Plane Noise level (dB) Plane Noise level (dB) Plane Noise level (dB)
1 74 7 80 13 73
2 77 8 75 14 83
3 80 9 75 15 86
4 82 10 72 16 83
5 82 11 90 17 83
6 85 12 87 18 80
(a) Find a 95% confidence interval for µ by firstly assuming that σ 2 = 27 and secondly by
assuming that σ 2 is unknown and that you have to estimate it from the sample. (Assume
that your are sampling from a normal population)
(b) Find a 95% confidence interval for σ 2 by firstly assuming that µ = 80.5 and secondly by
assuming that µ is unknown and that you have to estimate it from the sample. (Assume
that your are sampling from a normal population)
5.7 Two procedures for refining oil have been tested in a laboratory. Independent tests with each
procedure yielded the following recoveries (in ml. per l. oil):
Procedure A : 800.9; 799.1; 824.7; 814.1; 805.9; 798.7; 808.0; 811.8; 796.6; 820.5
Procedure B: 812.6; 818.3; 523.0; 911.2; 823.9; 841.1; 834.7; 824.5; 841.8; 819.4; 809.9; 837.5;
826.3; 817.5
2 2
We assume that recovery per test is distributed N (µA , σA ) for procedure A, and N (µB , σB ) for
procedure B.
2 2
(a) If we assume σA = σB , test whether Procedure B (the more expensive procedure) has higher
recovery than A. Construct a 95% confidence interval for
∆AB = µB − µA .
2 2
(b) What if we cannot assume σA = σB ?
Chapter 6
Order Statistics
In the previous chapter, we examined the properties of random samples from a normal distri-
bution. In this chapter, we look at random samples from any arbitrary continuous distribution.
Thus let X1 , X2 , . . . , Xn be a random sample of size n from a distribution described by the con-
tinuous distribution function F (x). Suppose that we order the observed values of the random
sample from smallest to largest, and denote the sorted values by X(1) , X(2) , . . . , X(n) , where:
Prior to observing the random sample, we won’t know the values of X(1) , X(2) , . . . , X(n) , and we
won’t even know which observation will turn out to be the smallest, second smallest, etc. But for
any given set of observations we can calculate the corresponding realizations of X(1) , X(2) , . . . , X(n) .
Thus each of these quantities satisfies our definition of a statistic: they are termed the order
statistics of the sample. The “five-number summary” introduced at the start of the first year
course consisted of certain order statistics, or averages of pairs of order statistics. Other useful
summaries can also be derived from the order statistics, such as range, or inter-quartile range,
which are alternatives to standard deviation as a measure of spread. As with other statistics,
such as the sample mean and variance, we need to derive the distributions of the order statistics,
if we are to use them for statistical inference. In fact, the distributions of order statistics are
valuable for other purposes apart from inferences, as we shall see in an example below.
where f (x) is the p.d.f. corresponding to F (x). Now any particular realization of the order
statistics X(1) = x(1) , X(2) = x(2) , . . . , X(n) = x(n) can arise in any one of n! ways (i.e. any
permutation of the order statistics is a possible set of observations giving rise to the same order
87
88 CHAPTER 6. ORDER STATISTICS
statistics). It does then follow that the joint p.d.f. of X(1) , X(2) , . . . , X(n) is given by:
n
Y
n! f (x(i) )
i=1
for x(1) < x(2) < . . . < x(n) . (Compare the definition of the ranges of non-zero probability density
with some of the examples from Chapter 2.) By transformation of random variables and repeated
integration, we can therefore in principle derive the p.d.f. of any set of order statistics or functions
thereof. This could also yield the marginal p.d.f.’s of each individual X(i) . Fortunately, however,
the marginal distribution functions can be obtained more easily (after which the densities are
obtained by differentiation). This we shall now do.
For ease of notation, let us denote the distribution function of X(r) by F(r) (x) = Pr[X(r) ≤ x],
with p.d.f. f(r) (x). The case r = n (the largest order statistic) is particularly simple as the event
{X(n) ≤ x} is simply the event that all n observations do not exceed x. By independence of the
initial observations, therefore:
n
F(n) (x) = Pr[X(n) ≤ x] = [F (x)] .
The case r = 1 (the smallest order statistic) is almost as easy. In this case, we note that the
event {X(1) > x} is the event that all n observations are greater than x. Thus, once again by
independence:
n
1 − F(1) (x) = Pr[X(1) > x] = [1 − F (x)]
i.e.:
n
F(1) (x) = 1 − [1 − F (x)] .
The intermediate cases require a little more care. For any given real number x, define Y to be
the number of observations in the random sample which do not exceed x. If the observations
are classified simply as “≤ x” (a “success”) or “> x” ( a “failure”), then Y is the number of
successes in n independent trials, where probability of success is the probability that a single
observation does not exceed x, which is F (x) by definition. In other words, Y has the binomial
distribution with parameters n and F (x). The event {X(r) ≤ x} for arbitrary r between 1 and n
is equivalent to the occurrence of at least r “successes”, i.e. to r ≤ Y ≤ n. Thus from properties
of the binomial distribution:
n
X n
F(r) (x) = ( )[F (x)]i [1 − F (x)]n−i .
i
i=r
The p.d.f. can be obtained by differentiation, which, after some simplification, turns out to be:
n!
f(r) (x) = f (x)[F (x)]r−1 [1 − F (x)]n−r .
(r − 1)!(n − r)!
We leave this as an exercise.
Example: Let X be the lifetime of a single light bulb, which is exponentially distributed with a
mean of 2000 hours. Six light bulbs are put into operation together (in some sort of bank
of lights). What is:
The probability that the time until the 5th bulb fails is greater than 8000 hours,
assuming that no bulbs are replaced in the interim?
6.2. CONFIDENCE INTERVALS FOR POPULATION QUANTILES 89
Generally, we have to use trial and error to find values of r and s, as close together as possible,
but for which the above expression evaluates to at least 1 − α. Fortunately, this trial and error
is facilitated by using binomial tables, as illustrated in the next example.
Example: Suppose we wish to find a 90% confidence interval for ξ0.4 based on a sample of size
10. From binomial tables, we note that for n = 10 and p = 0.4 we have Pr[Y ≤ 1] =
Pr[Y < 2] = 0.046, while Pr[Y ≤ 6] = Pr[Y < 7] = 0.945. Thus
This is not quite 0.9, but we shouldn’t be too obsessive about this! Thus [X(2) ; X(7) ] is
approximately a 90% confidence interval for ξ0.4 .
For larger values of n, it becomes easier to use the normal approximation to the binomial distri-
bution for Y . In other words, we approximate the distribution of Y by the normal distribution
with mean np and variance np(1 − p). In moving in this way from a discrete to a continuous
distribution, we need to apply the “continuity correction”. What this in effect means is that we
approximate the probability that Y = i (for any integer i), by the probability implied by the
normal distribution for the interval i − 21 < Y < i + 12 . Thus the event {r ≤ Y < s} is replaced
by {r − 21 < Y < s − 21 }. In other words, by standardizing the normal distribution in the usual
way, we approximate Pr[r ≤ Y < s] by:
" #
r − 12 − np s − 12 − np
Pr p <Z< p
np(1 − p) np(1 − p)
where Z has the standard normal distribution. From normal tables, we can look up the critical
value zα/2 such that:
Pr[−zα/2 < Z < +zα/2 ] = 1 − α
and thus by equating corresponding terms above, we can solve for r and s. These will normally
turn out to be fractions, which means that we have to widen the interval by moving out to the
next integer values.
Example: Suppose we wish to find a 95% confidence interval for the median (i.e.p ξ0.5 ) of a
distribution, based on a sample of size 100. Thus np = 50, np(1−p) = 25 and np(1 − p) =
5. We thus need to find r and s such that:
r − 50.5 s − 50.5
Pr <Z< = 0.95
5 5
The 2.5% critical value of the standard normal distribution is 1.96. We therefore require:
r − 50.5
= −1.96
5
which gives r = 40.7, and:
s − 50.5
= 1.96
5
which gives s = 60.3. Moving out to the next integers gives r = 40 and s = 61, and thus
the desired confidence interval for the median is [X(40) ; X(61) ].
Tutorial Exercises 91
Tutorial Exercises
6.1 Suppose that n = 3 observations are taken from an exponential pdf:
fY (y) = e−y
for y > 0.
6.2 A random sample of size n = 6 is taken from the pdf, fY (y) = 2y for 0 ≤ y ≤ 1.
(a) Find the pdf of the smallest order statistic, Y(1) .
(b) Find the pdf of the largest order statistic Y(6) .
(c) Find the expectation and variance of the largest order statistic.
(d) If n = 600, construct a distribution free 90% confidence interval for the lower quartile.
1
6.3 A random sample of size n = 5 is taken from the pdf, fX (x) = for 0 ≤ x ≤ 2π
2π
(a) Find the pdf of the smallest order statistic, X(1) .
(b) Find the pdf of the largest order statistic X(5) .
(c) Find the pdf of the median.
(d) Construct a distribution free 70% confidence interval for the upper quartile.
(e) Calculate Pr[0 < X(1) < 0.6].
6.4 Suppose that n observations are taken at random from the pdf
1 1 2
fY (y) = √ e− 2 [(y−20)/6] .
2π 6
What is the probability that the smallest observation is larger than 20?
92 CHAPTER 6. ORDER STATISTICS
Part II
Statistical Inference
93
Chapter 7
Introduction to Statistical
Inference
Statistical Inference refers to the process of arguing from particular observations to general
conclusions that can be supported by these observations. Philosophically, this process relates to
inductive reasoning, rather than to the traditional mathematical approach of deductive reasoning
(arguing from general axioms to specific implications). Note that the concept of mathematical
induction is still a deductive reasoning process.
There are a few key concepts that need to be absorbed and understood before embarking upon a
formal analysis of statistical inference. These are the concepts of population, model, param-
eters and observations. We define each in turn, before illustrating the concepts by means of
examples.
Population: The full set (or universe) within which the general conclusions are required. Such
a population may be real (e.g. all students at UCT), or hypothetical (e.g. the set of all
prices that a particular share may take on at a particular point in time).
Model: A mathematical representation of relationships that may exist between and within ele-
ments of the population, and of external measurements made on elements of the population.
Parameters: Numerical values which (if known) would fully characterize the population and
relevant models.
Observations: Measurements made on selected elements (typically a “random sample” – see
below) from the population.
It must be stressed that the population, model and parameters are a constructed representation
of reality made by the statistician, with a specific purpose in mind (e.g. to make a decision).
True reality is always more complex than any such construction. The professional art of the
statistician lies in putting together the simplest construction that provides sufficient insight and
knowledge for the purpose at hand. This professional art comes from years of practice. The
examples provided in this course are precisely that – examples!
Now let us look at a few examples of possible constructions of populations, models and parame-
ters.
95
96 CHAPTER 7. INTRODUCTION TO INFERENCE
1. If we are interested in knowing what proportion of the student body may be abusing drugs
(so as to decide whether interventions are necessary), it may be sufficient to take all current
registered students as the population. Models may relate drug use to behavioural and phys-
iological responses which may be observable, and may describe responses to randomized
questionnaires, for example. Relevant parameters would be the true levels of abuse in the
population, as well as numerical coefficients defining models of relationships.
2. Suppose you are planning a large function in Cape Town on 7 October, and need to
protect against having the function disrupted by weather (either by contingency plans or
by insurance). A hypothetical population may be all October days in Cape Town that could
possibly ever occur. Models may relate to frequency of occurrence of various combinations
of wind, rain and temperature (a statistical model), as well as to short term trends and/or
to global links with factors such as the el niño effect, global warming, etc. Parameters may
include coefficients of regression models used, as well as means, variances and covariances
defining the statistical model. Available observations may include October weather records
for the last 50 years, say.
For purposes of this course, the theory of statistical estimation will be based fundamentally on
the assumption of a random sample, i.e. a sequence of independent and identically distributed
(i.i.d.) observations, say X1 , X2 , . . . , Xn , where the common probability distribution of the Xi
depends on the unknown parameter or parameters describing the population. Each Xi may in
general be multivariate in nature. In the above examples:
1. Xi may relate to physiological measurements and/or questionnaire responses from the i-th
student selected at random from the population. Randomness arises from the selection
process and from person-to-person variations.
2. Xi may be a 3-dimensional vector giving daily rainfall, maximum and minimum temper-
atures on the i-th October day drawn from climate records. Randomness arises from the
fundamentally unpredictable nature of weather over the long term (although variations
between two successive days might be smaller than between two days in different years.)
In practice, it may often be difficult to design a collection of observations (samples, surveys, etc.)
in such a way as to ensure that they are truly independent and identically distributed, but for
this course we shall simply assume the i.i.d. property.
97
Now let θ = {θ1 , θ2 , . . . , θp } denote the parameter vector of interest. Where p = 1, and no
confusion can arise, we may simply write θ (an unsubscripted element) as the parameter. The
random sampling assumption implies that each Xi has a probability distribution that can be
expressed in the form F (x|θ). We use the symbol for a conditional distribution (i.e. x|θ) in
order to emphasize the explicit dependency of the sampling distribution on the value(s) of the
parameter θ, even though θ may not be understood as a random variable, at least not in the
usual (conventional) sense of the term.
The sampling distribution may be discrete or continuous. For much of the work in this part
of the course, we really do not have to distinguish between discrete and continuous variables.
Thus for convenience we shall often write f (x|θ) to represent either a probability density or a
probability mass function.
Of central importance in statistical
Qninference is the joint probability or probability density of the
ensemble of all observations, i.e. i=1 f (xi |θ). It turns out to be convenient to consider how this
term varies over different values of the parameter θ for a given set of observations (x1 , x2 , . . . , xn ).
We then refer to the expression as the likelihood function, defined by:
n
Y
L(θ|x1 , x2 , . . . , xn ) = f (xi |θ) (7.1)
i=1
If for a given set of data and for two possible values of the parameter vector, say θ 1 and θ 2 , we
have L(θ 1 |x1 , x2 , . . . , xn ) > L(θ 2 |x1 , x2 , . . . , xn ), then the the observed data are more likely to
have occurred if the true population was such that θ = θ 1 , than if the true population was such
that θ = θ 2 . Intuitively, this means that, in the light of the observed data, the assumption that
the true population is characterized by θ = θ 1 is a more plausible hypothesis than that the true
population is characterized by θ = θ 2 . We shall make a lot of use of this argument in the later
chapters.
At this stage it is important to stress that there are at least two widely adopted paradigms
(structures of reasoning) which are commonly applied to problems of statistical inference, namely
frequentist and Bayesian. These have been discussed in Chapter 1 in the context of the meaning
of probability, but the same ideas carry through to inference as indicated below.
Frequentist argument: The frequentist approach applies probability theory only to random
variability in the observations Xi , as expressed through the sampling p.d.f. or mass function
f (x|θ). The emphasis is on analysing the variability in observations that can in principle
arise. In effect, the question is asked: “If we were (hypothetically) able to repeat the
entire random sampling process, how different might our conclusions be?”. The extent of
such variations are applied back to the actual sample which has been observed, in order to
understand the errors we might make in adopting any conclusions.
Bayesian argument: The Bayesian approach uses probability theory to represent the current
level of uncertainty in θ, by means of probability distribution on θ. A conditional proba-
bility distribution on θ, conditional on the observed data x1 , x2 , . . . , xn is then constructed
using Bayes’ theorem (which does also require knowledge of f (xi |θ)).
For the purposes of the present course we shall be focusing only on the frequentist approach. The
Bayesian approach will be presented in the third year, where it will be linked with the concept
of “decision theory”, i.e. the analysis of decision making under uncertainty and risk.
98 CHAPTER 7. INTRODUCTION TO INFERENCE
In what follows, we shall examine first the problems of simply estimating θ, looking at different
means of constructing estimates and of choosing a “best” (point) estimate. After introducing
some more general theoretical principles, we shall turn to the related issues of hypothesis testing
and of constructing confidence intervals. The student will have encountered all of these concepts
during the first year course, but our intention now is to provide a solid theoretical foundation
for the methods introduced in the first year.
Chapter 8
Parameter Estimation
Recall that we define a random sample as a set of independent identically distributed random
variables, say X1 , X2 , . . . , Xn . The underlying probability distribution of the population random
variable X will typically be represented by the probability or probability density function f (x|θ),
in which we explicitly recognize the dependency on the unknown parameter(s) θ. An estimator
is a function which is applied to the observed values obtained from the random sample, in
order to obtain a single numerical estimate (a “point” estimate) for the value(s) of the unknown
parameter, for use in planning, decision making, etc. We shall discuss some different methods
that may be applied for purposes of constructing estimates, in order to demonstrate that there
is no one “right” way. The methods will all be based on the “frequentist” paradigm; Bayesian
estimates will be discussed at length in the third year course.
In many senses this is the most “obvious” way to construct an estimator. Recall that we define
the moments of X about the origin by:
Z ∞
µ0r = E(X r ) = xr f (x|θ) dx. (8.1)
−∞
An entirely analogous concept can be based on sample observations. Let X1, . . . , Xn be a random
sample from f (x|θ). We define sample moments by:
n
1X r
m0r = x . (8.2)
n i=1 i
Let θ̃1 , . . . , θ̃p be the solution obtained by equating m0r and µ0r (which is a function of θ =
{θ1 , . . . , θk }) for r = 1, . . . , p, assuming that such a solution exists.
The estimates (θ̃1 , . . . , θ̃p ) obtained in this way are called the Method of Moments Estima-
tors (MMEs).
99
100 CHAPTER 8. PARAMETER ESTIMATION
µ01 = E(X) = µ
µ02 = E(X 2 ) = µ2 + σ 2 .
Also
n
1X
m01 = xi = x̄
n i=1
n
1X 2
m02 = x .
n i=1 i
m01 = µ01 = µ or µ̃ = x̄
m02 = µ2 + σ 2
or
σ2 = m02 − (m01 )2
n
1X 2
= x − x̄2
n i=1 i
n
1X
= (xi − x̄)2 .
n i=1
µ̃ = x̄
n
1X
σ̃ 2 = (xi − x̄)2 .
n i=1
λα α−1 −λx
f (x|α, λ) = x e , x ≥ 0, α > 0, λ > 0
Γ(α)
where the parameters are α and λ. Earlier in the course it was demonstrated that the moment
generating function (mgf ) is given by
α
λ
M (t) =
λ−t
In the above example, we equated population and sample moments about the origin. We could
equally well have used centred population and sample moments for r > 1. It is left as an exercise
to show that the same results would have been obtained in the above examples.
Why match the first p moments when θ has dimensionality p? Why not equate µ02 , . . . , µ0p+1
to m02 , . . . , m0p+1 ?
Why use moments to describe the distributions? Why not match population and sample
quantiles? For example, if p = 3, we could obtain estimates by matching the population
median and upper and lower quartiles to the sample equivalents.
Another estimation procedure which will be familiar from the first year course in the context of
linear regression is the method of least squares (LS). Suppose that we can relate the expectation
of X to θ, say in the form:
E(Xi ) = gi (θ).
An estimator θ̂ can then be obtained by minimizing the sum of squared deviations between the
Xi and their estimated expectations, that is to minimize:
n h
X i2
S= xi − gi (θ̂) (8.3)
i=1
102 CHAPTER 8. PARAMETER ESTIMATION
LS estimators and their properties will be discussed in detail in STA2005S in the context of
linear regression.
Let us now illustrate these alternative methods of estimation by means of a simple example.
1. Equate 2nd moments: From earlier parts of the course √ we know that the variance of X
is 1/λ2 . An alternative MME is then given by λ̃ = 1/ σ̂ 2 .
2. Equate medians: The population median is the value of x such that F (x|λ) = 0.5, and
this is easily seen to be given by − ln(0.5)/λ = 0.693/λ. Let xmed be the sample median.
An estimator for λ is then given by λ̄ = 0.693/xmed .
3. Least squares: We now choose the value of λ which minimizes:
n 2
X 1
xi − .
i=1
λ
By differentiating w.r.t. λ and setting to 0, we obtain λ = 1/x̄, i.e. the same as the MME.
So which of these three estimators is “best”? We shall defer detailed discussion of the principles
of evaluation of estimators to the next chapter, but it is worth commenting that of the above
estimators, the MME has the smallest variance. In other words, in repeated sampling there will
be less variability in the MME than in the other estimators.
In the remainder of this chapter we shall discuss yet another method of constructing estimators,
namely that of maximum likelihood. We shall see later that this class of estimators possesses a
number of desirable properties.
l(θ) = ln L(θ)
Xn
= ln[f (xi |θ)] (8.4)
i=1
8.3. METHOD OF MAXIMUM LIKELIHOOD 103
Logarithms to any base can in theory be used, but in most cases it is algebraically more convenient
to work with natural logarithms, and we shall thus generally assume logarithms to base e (writing
ln x for loge x).
In some cases, when the range (or support) of X does not depend on θ and the log-likelihood
function is differentiable with respect to θ, the MLE can be found analytically, by solving the
following set of equations:
∂
Ur (θr ) = l(θ) = 0 for r = 1, . . . , k. (8.5)
∂θr
These equations are called the score equations.
We should, of course, check that the stationary point defined by (8.5) is indeed a maximum. For
k = 1, a maximum can be confirmed by checking that:
∂2
∂
U (θ)
= 2 l(θ) <0 (8.6)
∂θ θ=θ̂ ∂θ θ=θ̂
Setting the first derivative of l(λ) with respect to λ equal to zero we obtain the score equation:
n
∂ 1X
l0 (λ) = l(λ) = xi − n = 0
∂λ λ i=1
The maximum likelihood estimate (MLE) is thus given by: λ̂ = x̄. To verify that x̄ gives the
maximum:
∂ ∂2
U (λ) = 2
l(λ)
∂λ ∂λP
n
xi
= − i=1 <0
λ2
104 CHAPTER 8. PARAMETER ESTIMATION
Exercise 8.1 The number of cars arriving arriving at a supermarket parking lot per hour is
assumed to be Poisson distributed with parameter λ. The following numbers of cars were observed
arriving during the hour 9am to 10am on six consecutive days: 50, 47, 82, 91, 46, 64.
Find the maximum likelihood estimate of λ and plot the log-likelihood function of the data, showing
the MLE of λ.
Taking partial derivatives with respect to µ and σ 2 and setting both derivatives equal to zero:
n
∂ 1 X
l(µ, σ 2 ) = 2 (xi − µ) = 0
∂µ σ i=1
n
∂ 2 n 1 X
l(µ, σ ) = − + (xi − µ)2 = 0
∂σ 2 2σ 2 2(σ 2 )2 i=1
The first equation implies that the MLE of µ is given by µ̂ = x̄. Then the MLE of σ 2 must
satisfy:
n
1X
σ̂ 2 = (xi − µ̂)2
n i=1
or:
n
1X
σ̂ 2 = (xi − x̄)2 .
n i=1
In this case, the MLE and MME estimators are identical, but this result is not generally true for
all distributions.
Exercise 8.2 Check that x̄ and σ̂ 2 do indeed maximize the likelihood for θ = (µ, σ 2 )0 .
8.4. ASYMPTOTIC PROPERTIES OF MLES 105
1
exp −(ln(y) − µ)2 /2σ 2
f (y|µ, σ) = √
yσ 2π
1 1
√ exp −(x − µ)2 /2σ 2 = f (x|µ, σ)
=
yσ 2π y
In general, if Y is a function of X, then f (y|µ, σ) = |∂x/∂y|f (x|µ, σ), where ∂x/∂y does not
depend on θ and therefore the parameter maximizing L(µ, σ|x) also maximizes L(µ, σ|y).
The value θ̂ that maximizes L(θ) is, of course, a function of the observed values of Xi = xi that
have been treated as fixed while viewing the likelihood as a function of θ. In discussing properties
of the maximum likelihood estimator it is often necessary to represent the dependency of the
MLE on the observed data values by writing the estimator explicitly as θ̂ = θ̂(X1 , . . . , Xn ). Note
that we reflect the dependency in terms of the random variables (X1 , . . . , Xn ), rather than in
terms of their particular fixed numerical values xi , in order to emphasize that the MLE θ̂ is also
a random variable, whose distribution depends on the unknown θ. The sampling distribution of
θ̂ may in principle be derived from the sampling distribution of the X i.e. f (x|θ).
Within the sampling theory (frequentist) framework for inference, we think in terms of what
happens if the entire random sample were to be repeated (at least hypothetically) for the same
θ. How different might the estimator be? Such considerations form the basis of the methods for
evaluating the quality of estimators discussed in the next Chapter. As a necessary preliminary
to such methods of evaluation, we consider the distribution of the MLE (the random variable θ̂)
for large sample sizes.
The results of this Section will be derived primarily for the uniparameter case, i.e. for p = 1 when
we can write θ = θ. Certain smoothness conditions will be assumed for f (x|θ), viz. that the first
two derivatives of f (x|θ), and thus also of the likelihood and log-likelihood functions, exist in an
interval of the real line which includes the true value of the parameter θ, say θ0 . Rigorous proofs
can be found in Cramér (1946).
Consistency
for any ε > 0. Note that this probability statement refers to sampling variation in the estimate
P
(θ̂) for a fixed value of the parameter (θ). We then write θ̂ −→ θ as n → ∞.
106 CHAPTER 8. PARAMETER ESTIMATION
Theorem 8.1 Under the smoothness conditions of f (x|θ), the MLE θ̂ is consistent.
l(θ) = ln L(θ)
Yn
= ln f (Xi |θ)
i=1
n
X
= ln[f (Xi |θ)]
i=1
As the Xi ’s, i = 1, . . . , n are i.i.d., so are the ln[f (Xi |θ)], and hence by the weak law of large
numbers (WLLN) the right hand side of the equality converges in probability to its expected
value (mean) as n → ∞.
Thus
1 P
l(θ) −→ E[ln f (X|θ)] as n → ∞, for all θ.
n
Thus, as n → ∞, the θ̂ that maximizes l(θ)/n tends to the value of θ that maximizes E[ln f (X|θ)].
Let us therefore maximize E[ln f (X|θ)]. At this point, the reader should note carefully the fol-
lowing two points:
1. The expression ln f (X|θ) (and thus also E[ln f (X|θ)]) is a function of θ for any arbitrary
value of θ; while
2. The expectation (with respect to sampling variation in X) is evaluated for the true value
of θ, i.e. θ0 .
In order then to maximize E[ln f (X|θ)] with respect to θ, the required derivative is:
Z
∂ ∂
E[ln f (X|θ)] = ln f (x|θ)f (x|θ0 ) dx [where θ0 is fixed]
∂θ ∂θ
Z
∂
= ln f (x|θ)f (x|θ0 ) dx
∂θ
Z
∂ 1
= f (x|θ) f (x|θ0 ) dx.
∂θ f (x|θ)
Changing the order of the integration and differentiation is permissible because of the assumed
smoothness properties off (x|θ).
8.4. ASYMPTOTIC PROPERTIES OF MLES 107
In other words, E[ln f (X|θ)] is maximized at θ = θ0 (the true parameter value), so that:
P
θ̂ −→ θ0 as n → ∞.
Equations (8.9) and (8.10) are both called the expected Fisher information, while
∂2
− 2 ln f (X|θ) is called the observed Fisher information. The Fisher information is a mea-
∂θ
sure of the amount of information about the parameter θ contributed by observing the random
variable X. We shall often refer to I(θ) simply as the “Fisher information”.
R
Proof. Since f (x|θ)dx = 1: Z
∂
f (x|θ)dx = 0
∂θ
where the integrals will throughout be interpreted as the definite integral over the support of X
(i.e. the set of values for which f (x|θ) > 0. Now
∂ 1 ∂
ln f (x|θ) = f (x|θ)
∂θ f (x|θ) ∂θ
so that:
∂ ∂
f (x|θ) = ln f (x|θ) f (x|θ).
∂θ ∂θ
Thus:
Z
∂
0 = f (x|θ)dx
∂θ
Z
∂
= f (x|θ)dx, since f (x|θ) is smooth
∂θ
Z
∂
= ln f (x|θ) f (x|θ)dx. (8.11)
∂θ
108 CHAPTER 8. PARAMETER ESTIMATION
Var(U ) = E(U 2 )
2
∂
= E ln f (X|θ)
∂θ
= I(θ). (8.15)
Theorem 8.2 If f (x|θ) is smooth then the MLE θ̂ converges in distribution to the Normal as
n → ∞, that is
D 1
θ̂ −→ N θ0 , as n → ∞, (8.16)
nI(θ0 )
where the symbol ≈ is meant to signify an approximation which becomes increasingly exact
as n → ∞. We have used the notational convention that l0 (θ0 ) is l0 (θ) evaluated at θ = θ0 ,
∂
and similarly for l00 (θ0 ). Similarly, we shall also write ln f (Xi |θ) in simplified form as
∂θ θ=θ0
∂
ln f (Xi |θ0 ).
∂θ
Now l0 (θ̂) = 0 since θ̂ is the MLE, so that
−l0 (θ0 )
(θ̂ − θ0 ) ≈
l00 (θ0 )
or
√ −n−1/2 l0 (θ0 )
n(θ̂ − θ0 ) ≈ . (8.17)
n−1 l00 (θ0 )
This expression is the sum of i.i.d. random variables and by the Weak Law of Large Numbers
(wlln) converges in probability to its mean (expected value) which is
2
∂
E ln f (X|θ0 ) = −I(θ0 ).
∂θ2
In other words, the absolute difference between n−1 l00 (θ0 ) and −I(θ0 ) can be made arbitrarily
small for large enough n, and in particular much smaller than the variability in the numerator
110 CHAPTER 8. PARAMETER ESTIMATION
in (8.17) (since we have seen that the variance of the numerator of (8.17) remains non-zero for
all n by (8.19)) For this reason, we can for large enough n replace n−1 l00 (θ0 ) by −I(θ0 ) in the
approximation given by (8.17), so that:
√ −n−1/2 0
n(θ̂ − θ0 ) ≈ [l (θ0 )]
−I(θ0 )
n−1/2 [l0 (θ0 )]
= . (8.20)
I(θ0 )
which is the sum of n i.i.d. random variables, and thus from the Central Limit Theorem (CLT)
converges in distribution to a Normal distribution
√ as n → ∞. Putting this last result together
with the asymptotic mean and variance of n(θ̂ − θ0 ) derived above implies that:
√
D 1
n(θ̂ − θ0 ) −→ N 0, as n → ∞
I(θ0 )
or
D 1
θ̂ −→ N θ0 , as n → ∞ (8.21)
nI(θ0 )
which proves the theorem.
Note: For large n, E(θ̂) = θ0 , the true parameter value, so that the MLE, θ̂, is asymptotically
unbiased.
For p > 1 it turns out to be algebraically convenient to adopt a slight change in notation when
referring to Fisher information. We define In (θ) as the symmetric p × p matrix, now called the
expected Fisher information matrix, with jk th element given by:
∂ ∂
Injk (θ) = nE ln f (X|θ) ln f (X|θ) (8.22)
∂θj ∂θk
∂2
= −nE ln f (X|θ) . (8.23)
∂θj ∂θk
8.4. ASYMPTOTIC PROPERTIES OF MLES 111
The equivalence of the forms given by (8.22) and (8.23) may be demonstrated as for the proof
of Lemma 8.1 in the uniparameter case.
Note the introduction of the n term in (8.22) and (8.23). We may observe that:
" n
#
∂2 ∂2
X
E l(θ) = E ln f (Xi |θ)
∂θj ∂θk ∂θj ∂θk i=1
n
∂2 X
= E [ln f (Xi |θ)]
∂θj ∂θk i=1
∂2
= nE [ln f (X|θ)]
∂θj ∂θk
∂2
= nE ln f (X|θ) = −Injk (θ)
∂θj ∂θk
In other words, evaluation of the second derivatives of the log-likelihood function provides the
Fisher information matrix directly. It is often more convenient to evaluate the second derivatives
of the log-likelihood function rather than of ln f (X|θ), since the first derivatives of the log-
likelihood are required in any case in order to establish the MLE.
The following is then a generalization of Theorem 8.2.
Theorem 8.3 If f (x|θ) is smooth (i.e. all first and second partial derivatives are bounded), and
if θ̂ is the MLE of θ, then
θ̂ → N (θ 0 , In−1 (θ 0 )) as n → ∞, (8.24)
where θ 0 is the true value of the parameter θ.
The proof follows along the same lines as that of Theorem (8.2).
n
1X
As seen in Example 8.5, the MLEs are given by: µ̂ = x̄ and σ̂ 2 = (xi − x̄)2 .
n i=1
112 CHAPTER 8. PARAMETER ESTIMATION
We now obtain the elements of the information matrix from the second derivatives:
∂2 n
In11 (θ) =−l(θ) =+
∂µ2 σ2
Pn
∂2 n (xi − µ)2
In22 (θ) = − l(θ) = − 4 + i=1 6
∂(σ 2 )2 2σ σ
∂2 (x̄ − µ)
In12 (θ) = In21 (θ) = − l(θ) = +n .
∂µ∂σ 2 σ4
The information matrix is based on the true value θ 0 which is unknown. As an approximation,
however, we could replace θ 0 by the MLE, which would give:
(x̄ − x̄)
In12 (θ̂) = In21 (θ̂) = n =0
σ̂ 4
and
Pn
n (xi − x̄)2
In22 (θ̂) = − 4 + i=1 6
2σ̂ σ̂
n nσ̂ 2
= − 4+ 6
2σ̂ σ̂
n
= .
2σ̂ 4
In matrix form:
n/σ̂ 2
2 0
In (µ̂, σ̂ ) =
0 n/2σ̂ 4
with inverse given by:
σ̂ 2 /n 0
In−1 (µ̂, σ̂ 2 ) = . (8.25)
0 2σ̂ 4 /n
Thus asymptotically x̄ and σ̂ 2 are jointly normally distributed and independent, with variances
given by the diagonal elements of (8.25). This seemingly important result turns out in this case,
however, not to be all that useful! For the normal distribution, we already know, even for small
sample sizes, that x̄ ∼ N (µ, σ 2 /n) (exactly) and that nσ̂ 2 /σ 2 is independently distributed as χ2
with n − 1 degrees of freedom, so that var(σ̂ 2 ) = 2σ 4 (n − 1)/n2 which tends to 2σ 4 /n as n → ∞.
Nevertheless, more generally (for non-normal distributions), use of the above type of calculations
for MLEs can give very useful results.
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) represent a random sample of size n from a bivariate normal
distribution. The likelihood function for this sample is given by:
n
Y
L(µx , µy , σx2 , σy2 , ρ) = f (xi , yi |µx , µy , σx2 , σy2 , ρ)
i=1
n
2 2
1
P xi −µx xi −µx yi −µy yi −µy
exp − 2(1−ρ2) σx − 2ρ σx σy + σy
i=1
= q n
2π σx2 σy2 (1 − ρ2 )
(8.26)
8.4. ASYMPTOTIC PROPERTIES OF MLES 113
The log-likelihood is
n
l(µx , µx , σx2 , σy2 , ρ) = −n ln 2π − ln σx2 + ln σy2 + ln(1 − ρ2 )
" 2
n 2 2 #
1 X xi − µx xi − µx yi − µy yi − µy
− − 2ρ + . (8.27)
2(1 − ρ2 ) i=1 σx σx σy σy
Let l(θ) = l(µx , µx , σx2 , σy2 , ρ).Taking partial derivatives with respect to µx and µy and setting
them equal to zero gives
∂ n x̄ − µx ȳ − µy
l(θ) = −ρ =0
∂µx σx (1 − ρ2 ) σx σy
(8.28)
∂ n ȳ − µy x̄ − µx
l(θ) = −ρ = 0.
∂µy σy (1 − ρ2 ) σy σx
Differentiating l(θ) with respect to σx2 and σx2 gives (after some algebraic manipulations which
are left as an exercise):
n n
(xi − µx )2
P P
(xi − µx ) (yi − µy )
∂ 1 i=1
n(1 − ρ2 ) − + ρ i=1
l(θ) = −
∂σx2 2σx2 (1 2
−ρ ) σx2 σx σy
(8.29)
n n
(yi − µy )2
P P
(xi − µx ) (yi − µy )
∂ 1 i=1
n(1 − ρ2 ) − + ρ i=1
l(θ) = − .
∂σy2 2σy2 (1 2
−ρ ) σy2 σx σy
n
P
n 2 ρ (xi − µx ) (yi − µy )
2
X xi − µx i=1
n(1 − ρ ) = −
i=1
σx σx σy
(8.30)
n
P
n 2 ρ (xi − µx ) (yi − µy )
X yi − µy i=1
n(1 − ρ2 ) = − .
i=1
σy σx σy
The fifth maximum likelihood equation is obtained by differentiating l(θ) with respect to ρ to
114 CHAPTER 8. PARAMETER ESTIMATION
obtain
( " n
∂ 1 1 Xn xi − µx 2 yi − µy 2 o
l(θ) = nρ − ρ + −
∂ρ (1 − ρ2 ) (1 − ρ2 ) i=1 σx σy
n
P
(xi − µx ) (yi − µy ) #)
2 i=1
(1 + ρ ) . (8.31)
σx σy
Setting the derivative in (8.31) equal to zero and performing some algebraic simplifications yields
the equation
Pn
n
( 2 2 ) 2 (xi − µx ) (yi − µy )
X xi − µx yi − µy (1 + ρ ) i=1
n(1 − ρ2 ) = + − . (8.32)
i=1
σx σy ρ σx σy
(a) Suppose we wish to estimate ρ alone, the other four parameters being known. We need
then to solve (8.32). This is a cubic equation for the MLE ρ̂, and has three roots, two of
which may be complex!
(b) Suppose now that we wish to estimate σx2 , σy2 and ρ where µx and µy are known. We have
to solve the log-likelihood equations (8.30) and (8.32), which yield the solutions [Hint: Add
the two equations in (8.30), and subtract (8.32)]:
n
1
P
n (xi − µx ) (yi − µy )
i=1
ρ= (8.33)
σx σy
n
1X 2
σ̂x2 = (xi − µx ) (8.34)
n i=1
and
n
1X 2
σ̂y2 = (yi − µy ) . (8.35)
n i=1
Hence from (8.33)
n
1
P
n (xi − µx ) (yi − µy )
i=1
ρ̂ = . (8.36)
σ̂x σ̂y
In this case all three MLEs use the unknown populations means, and ρ̂ is the sample
correlation coefficient.
(c) Now suppose that we wish to estimate all parameters of the bivariate normal distribution.
We solve (8.28), (8.30) and (8.32) together. Equation (8.28) reduces to
x̄ − µx ȳ − µy
= ρ
σx σy
(8.37)
ȳ − µy x̄ − µx
= ρ ,
σy σx
8.5. TRANSFORMED PARAMETERS 115
x̄ − µx = ȳ − µy = 0. (8.38)
Taken with (8.34), (8.35) and (8.36), which are solutions to (8.30) and (8.32), (8.38) gives
the set of five MLEs
n
1X 2
µ̂x = x̄, σ̂x2 = (xi − x̄) ,
n i=1
n
1X 2
µ̂y = ȳ, σ̂y2 = (yi − ȳ) , (8.39)
n i=1
n
1
P
n (xi − x̄) (yi − ȳ)
i=1
ρ̂ = .
σ̂x σ̂y
h[
(θ) = h(θ̂). (8.40)
Consider now the one parameter case. It would be possible to reparameterize the entire distri-
bution in terms of φ = h(θ), but in may cases it may be convenient and sufficient to obtain an
approximate variance (or standard error) of the estimate of φ, without re-calculating the new
Fisher information. Recall from (8.40) that the MLE of φ = h(θ) is any case given by h(θ̂) (so
that no new calculations are needed in order to get the MLE for the transformed parameter).
Such an approximation is given by the delta method.
Consider the first order Taylor series approximation around the true parameter value:
(θ̂ − θ) ∂
h(θ̂) = h(θ) + h(θ) + . . . . (8.41)
1! ∂θ
Provided that the variation of θ̂ around θ is small, the variance of h(θ̂) can be approximated
from the variance of (8.41), i.e.:
h i ∂ 2
var h(θ̂) ≈ h(θ) var(θ̂). (8.42)
∂θ
116 CHAPTER 8. PARAMETER ESTIMATION
Example 8.7 Poisson Distribution In previous examples, we have seen that λ̂ = x̄ and that
var(λ̂) = λ̂/n.
We now find the variance of (say) θ defined by θ = ln(λ) using the Delta method.
2
∂θ
Var(θ) ≈ Var(λ̂)
∂λ
2
∂ ln λ
= Var(λ̂)
∂λ
= (1/λ2 ) × (λ̂/n)
1 1 0 0 2 4 1 2 0 3
distribution. If the parameter θ represents the probability of no accidents, e−λ , find the variance
of θ using the delta method.
Example 8.8 Suppose observations are drawn from the Cauchy distribution with p.d.f. given by:
1
f (x|θ) = − ∞ < x < ∞.
π[1 + (x − θ)2 ]
As an exercise, write down the likelihood and log-likelihood functions for θ. Show then that the
MLE of θ satisfies the equation:
n
X 2(xi − θ)
U (θ) = = 0.
i=1
1 + (xi − θ)2
Now suppose that n = 4, and that the observed data are: X = (3, 7, 12, 17). The log-likelihood
function of θ is displayed in Figure 8.1. The multiple optima imply multiple solutions to U (θ) = 0.
8.6. NUMERICAL COMPUTATION OF MLES 117
Visually, we see that the maximum likelihood solution is at approximately θ = 7, but it would
be difficult to find this solution algorithmically, especially in multiparameter cases (p > 1, and
certainly for p > 2), as visual clues would be difficult to find.
-16
Log-Likelihood
-18
-20
-22
0 5 10 15 20
Parameter θ
Figure 8.1: Plot of the log-likelihood function for Cauchy data in Example 8.8.
Considerable research has gone into developing effective algorithms for numerical optimization,
especially when multiple optima exist, as in Example 8.8. This work lies well beyond the scope of
the present course. Fortunately, many of these algorithms have been coded into widely available
software. For example, Microsoft Excel contains an add-in called Solver (obtained from the Tools
menu), which will find the local optimum which is in a sense closest to the initial guess made. The
user needs to provide a set of p cells to hold the parameter values (together with initial guesses at
the parameter values), and a further cell in which the log-likelihood is evaluated. When Solver
is called, a window pops up in which the user enters the addresses of the parameter cells (in
the edit box labelled “By changing cells”), and the address of the cell holding the log-likelihood
function (called the “target cell”). The user can also specify constraints on the values that may
be taken on by the parameters.
The use of graphical methods and Excel Solver are illustrated in the example below. Note that
Solver does not directly address problems of multiple optima. The user needs to try a series of
different initial guesses, to see whether they lead to different answers (in which case, the answer
giving the largest log-likelihood amongst those found would be used).
Example 8.9 Gamma distribution with parameters α and λ, for which the p.d.f. is given by
λα α−1 −λx
f (x|α, λ) = x e , x ≥ 0, α > 0, λ > 0.
Γ(α)
118 CHAPTER 8. PARAMETER ESTIMATION
Setting the score functions equal to zero, we have a simple relation for the MLE of λ as a function
of the MLE for α:
nα̂ α̂
λ̂ = Pn = .
x
i=1 i x̄
However, the MLE of α then needs to satisfy the following (non-linear) relation which would be
difficult to solve analytically:
n
1 ∂Γ(α) X
−n × + n ln α̂ − n ln x̄ + ln(xi ) = 0.
Γ(α̂) ∂α
α=α̂ i=1
1. By substituting λ = α/x̄ into l(α, λ), we have a one-dimensional function which can be
plotted graphically to obtain the maximizing value for α. Then we can recover λ̂ = α̂/x̄.
It is worth noting that Microsoft Excel includes a worksheet function GAMMALN() which returns
ln Γ(n), which facilitates implementation of both approaches.
In order to illustrate the above approaches, suppose that a sample of size n = 20 resulted in the
following observations (ordered for ease of presentation):
1. The sample mean is x̄ = 10.438. Figure 8.2 shows the plot of l(α, α/10.438) against α.
It is clear that the maximum occurs at α ≈ 3.9. The corresponding estimate for λ is
3.9/10.438=0.374.
8.6. NUMERICAL COMPUTATION OF MLES 119
-59.5
-60.0
Log-Likelihood with λ = α/X̄
-60.5
-61.0
-61.5
-62.0
2 3 4 5 6
Parameter α
Figure 8.2: Plot of the log-likelihood function for the gamma distribution data
120 CHAPTER 8. PARAMETER ESTIMATION
2. The Solver options window is illustrated in Figure 8.3. The entries in the edit boxes in-
dicate that α and λ are stored in cells $G$43:$G$44$, and that the log-likelihood function
is evaluated in cell $G$46. The parameters need to be strictly positive (α, λ > 0), but for
numerical search we specify closed bounds in the form α, λ ≥ for sufficiently small positive
. We have used = 0.001. The solution (obtained by clicking on the Solve button shown
in Figure 8.3) is: α = 3.898 and λ = 0.3735.
Tutorial Exercises
8.1 Suppose T1 , . . . , Tn are independent times (hours) of failure for a piece of equipment assumed to
have an exponential lifetime distribution with parameter λ. Find the method of moments estimator.
Suppose t1 = 30.4, t2 = 7.8, t3 = 1.4, t4 = 13.1, t5 = 67.3. Find the method of moments estimate
of λ.
8.2 Assume that X is uniform on the interval (−θ, θ). Find the method of moments estimator of θ.
If the four observed values given in the sample are given by -0.808, 2.590, 2.314, -0.268, find the
method of moments estimate for θ.
8.3 The table below shows the number of times, X, that 356 students switched majors during their
under-graduate study.
Number of major changes 0 1 2 3
Observed frequency 237 90 22 7
Suppose that the data arise from a Poisson distribution. Find the MLE of λ.
8.4 The table below shows the number of tropical cyclones in Northwestern Australia for the seasons
1956-7 (season 1) to 1968-9 (season 13), a period of fairly consistent conditions for the definition
and tracking of cyclones (Dobson and Stewart, 1974).
Season 1 2 3 4 5 6 7 8 9 10 11 12 13
No. of cyclones 6 5 4 6 6 3 12 7 4 2 6 7 4
Let Y denote the number of cyclones in one season, and suppose Y has a Poisson distribution with
parameter θ.
(a) Find the MLE of θ.
(b) Plot the log-likelihood function of the data, showing the MLE of θ.
Tutorial Exercises 121
x3 e−x/θ
f (x|θ) = , y≥0
6θ4
find the maximum likelihood estimate (MLE) of θ.
8.8 The following data (which have been sorted from smallest to largest) are supposed to have come
from a distribution with probability density function given by f (x|θ) = θxθ−1 for 0 < x < 1:
0.713 0.717 0.727 0.801 0.868
0.912 0.924 0.928 0.938 0.997
The sample mean and standard deviation of the above data are 0.853 and 0.105 respectively.
(a) Estimate θ using both the method of moments and matching of medians.
(b) Estimate θ using maximum likelihood estimation.
(c) Obtain the Fisher Information and hence approximate the variance of the MLE. What ap-
proximations have you made?
8.11 Reconsider Problem 8.10 in matrix form. Let each Y be i.i.d. as N (x0 β, σ 2 ) where x is the p × 1
vector of explanatory variables and β and σ 2 are to be estimated. Find maximum likelihood
estimators for vector β and σ 2 using matrix notation.
8.12 The random variable X has the binomial distribution with p.m.f.:
!
m x
f (x|m, p) = p (1 − p)m−x for x = 0, 1, . . . , m.
x
Evaluation of Estimators
Consider a random sample X1 , X2 , . . . , Xn drawn from the distribution with p.d.f. f (x|θ). An
estimator is a statistic, i.e. a function of the random sample, say θ̂ = t(X1 , X2 , . . . , Xn ). For
any specific set of observed sample values, X1 = x1 , X2 = x2 , . . . , Xn = xn , the value θ̂ =
t(x1 , x2 , . . . , xn ) is the corresponding estimate for the unknown parameter value, where with
some abuse of notation we use the same symbol θ̂ both for the estimator (a random variable)
and a particular observed value. The MLE can be expressed in this manner, but we are not
restricting attention to the MLE at this stage.
Suppose that the true value of θ is given by θ0 (fixed, but unknown). We then define the bias of
the estimator by B(θ̂) = E(θ̂) − θ0 , where the expectation is taken w.r.t. the sampling variation
in X1 , X2 , . . . , Xn given the fixed value of θ. An estimator is said to be unbiased if E(θ̂) = θ0 ,
i.e. if B(θ̂) = 0.
We now define the mean square error (MSE) of the estimator by:
MSE(θ̂) = E(θ̂ − θ0 )2
= E[((θ̂ − E[θ̂]) + (E[θ̂] − θ0 ))2 ]
= E[(θ̂ − E[θ̂])2 + (E[θ̂] − θ0 )2 + 2(θ̂ − E[θ̂])(E[θ̂] − θ0 )]
= var(θ̂) + (B[θ̂])2 + 2(E[θ̂] − θ0 )E[(θ̂ − E[θ̂])]
= var(θ̂) + (B[θ̂])2 (9.1)
where the last line follows from the observation that E[(θ̂ − E[θ̂])] = 0, so that the cross product
term vanishes. We emphasize again that the expectations are taken with respect to sampling
variation in X1 , X2 , . . . , Xn for the given fixed value of θ.
Suppose now that we have two estimators for θ say θ̂1 and θ̂2 , either of which may be biased or
unbiased. Then θ̂1 is called a more efficient estimator for θ than θ̂2 if
123
124 CHAPTER 9. EVALUATION OF ESTIMATORS
It is possible for θ̂1 to be more efficient than θ̂2 , even though θ̂1 may be biased while θ̂2 is unbiased.
Some robust and Bayesian estimators (to be covered in the 3rd year) exhibit this property. The
reason why biased estimators may be more efficient under some circumstances is illustrated in
Figure 9.1, where the dotted curve corresponds to the density function of an unbiased estimator,
and the solid curve corresponds to the density function of a biased estimator. We observe that
for the unbiased estimator the probability that its observed value will be near θ is relatively
small, but that for the biased estimator this probability is considerably larger in spite of the
bias, as a result of the substantially smaller variance.
Error: θ̂ − θ0
Figure 9.1: Illustration of density functions of a biased and unbiased estimator.
For the remainder of this course, however, we shall concentrate on unbiased estimators. For two
unbiased estimators θ̂1 and θ̂2 , θ̂1 is more efficient than θ̂2 if:
The most efficient unbiased estimator is therefore that exhibiting minimum variance, sometimes
called the minimum variance unbiased estimator (MVUE).
var(θ̂2 )
Relative efficiency =
var(θ̂1 )
Example 9.1 Let y1 , y2 , y3 be a random sample from a normal distribution with mean µ and
variance σ 2 .
9.1. MEAN SQUARE ERROR AND EFFICIENCY 125
1
Consider the following two estimators for the population mean: µ̂1 = 4 y1 + 12 y2 + 14 y3 and
µ̂2 = 13 y1 + 13 y2 + 13 y3 .
1 1 1
E(µ̂1 ) = E y1 + y2 + y3
4 2 4
1 1 1
= E(y1 ) + E(y2 ) + E(y3 )
4 2 4
1 1 1
= µ+ µ+ µ
4 2 4
= µ
and
1 1 1
E(µ̂2 ) = E y1 + y2 + y3
3 3 3
1 1 1
= E(y1 ) + E(y2 ) + E(y3 )
3 3 3
1 1 1
= µ+ µ+ µ
3 3 3
= µ.
Thus both estimators are unbiased, but now consider the variances:
1 1 1
var(µ̂1 ) = var y1 + y2 + y3
4 2 4
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
16 4 16
3σ 2
=
8
while
1 1 1
var(µ̂2 ) = var y1 + y2 + y3
3 3 3
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
9 9 9
3σ 2
= .
9
Thus var(µ̂2 ) < var(µ̂1 ), so that µ̂2 is more efficient than µ̂1 . The relative efficiency of µ̂1
compared to µ̂2 is:
3σ 2 /9
= 0.889
3σ 2 /8
i.e. an efficiency of 89% relative to µ̂2 .
Exercise 9.1 Suppose we have a random sample X = (X1 , X2 ) from a normal distribution with
known mean and variance.
(a) Show that the two estimators, µ̂1 = (x1 + 2x2 )/3 and µ̂2 = (2x1 + 3x2 )/5 are unbiased
estimators of µ.
(b) Which is the more efficient estimator of µ?
126 CHAPTER 9. EVALUATION OF ESTIMATORS
(a) The range of the random variable X must be independent of the parameter θ (support
independent of θ) and
(b) The derivative of the density function with respect to the parameter θ must be a continu-
ously differentiable (“smooth”) function of the parameter θ.
Before stating and proving the Cramér-Rao Inequality, we need the following Lemma.
Lemma 9.1
n
" n
# n
∂ Y X 1 ∂f (xi |θ) Y
f (xi |θ) = f (xj |θ).
∂θ i=1 i=1
f (xi |θ) ∂θ j=1
The proof for the general case follows by repeated applications of the product rule.
T = t(X1 , X2 , . . . , Xn )
In other words, the variance of any unbiased estimator of θ cannot be smaller than the reciprocal
of the Fisher information in the sample.
Proof. Recall that for any pair of random variables X and Y , −1 ≤ ρXY ≤ 1, i.e.:
Cov(X, Y )
−1 ≤ p ≤1
var(X)var(Y )
or:
[Cov(X, Y )]2 ≤ var(X)var(Y ). (9.5)
which is known as the Cauchy-Schwarz Inequality.
Rearrangement of (9.5) gives the lower bound on the variance of X:
[Cov(X, Y )]2
var(X) ≥ . (9.6)
var(Y )
The proof of the theorem then proceeds by equating X with the estimator T , and Y with the
quantity
n
X ∂ ln f (Xi |θ)
Z=
i=1
∂θ
evaluated at the true value of θ.
From (8.8) we have that E[Z]=0, so that:
Cov(Z, T ) = E[Z(T − θ)] = E[ZT ] − θE[Z] = E[ZT ].
Now from the definition of Z:
n
X ∂ ln f (Xi |θ)
Z =
i=1
∂θ
n
X 1 ∂f (Xi |θ)
= .
i=1
f (Xi |θ) ∂θ
Combining the above results, we obtain:
Cov(Z, T ) = E(ZT )
" n #
X 1 ∂f (Xi |θ)
= E T
i=1
f (Xi |θ) ∂θ
" n
# n
∂f (xi |θ) Y
Z Z X 1
= . . . t(x1, . . . xn ) f (xj |θ)dxj (9.7)
i=1
f (xi |θ) ∂θ j=1
where the last line follows from the fact that the expectation is to be taken with respect to the
joint distribution of X1 , X2 , . . . , Xn .
Substituting from the result of Lemma 9.1 then yields:
Z Z n
∂ Y
Cov(Z, T ) = . . . t(x1, . . . xn ) f (xi |θ)dxi
∂θ i=1
Z Z n
∂ Y
= . . . t(x1, . . . xn ) f (xi |θ)dxi
∂θ i=1
∂
= E(T ) (9.8)
∂θ
128 CHAPTER 9. EVALUATION OF ESTIMATORS
using the smoothness of f (x|θ) to switch the order of integration and differentiation.
Now for an unbiased estimator, E[T ]=θ, so that:
∂
Cov(Z, T ) = θ = 1.
∂θ
Remark 9.1 A rather more general result follows from (9.8). For any estimator T of θ (biased
or unbiased), the lower bound on the variance of T is given by
2 2
∂ ∂
E(T ) E(T )
∂θ ∂θ
var(T ) ≥ = .
var(Z) nI(θ)
Theorem 9.1 gives a lower bound for any unbiased estimator. An unbiased estimator that achieves
this lower bound is a minimum variance unbiased estimator (MVUE), or the best estimator under
the class of unbiased estimators, and is said to be efficient.
In the previous chapter we have seen that the asymptotic variance of the maximum likelihood
estimate (as n → ∞) is precisely the lower bound given by theorem 9.1. In other words, maximum
likelihood estimates are asymptotically efficient or best.
For finite (small) sample sizes, a maximum likelihood estimate may not be efficient, and max-
imum likelihood estimates may not necessarily be the only asymptotically efficient estimates.
Nevertheless, for large sample sizes, maximum likelihood estimates are:
Pn Pn
The MLE of λ was found to be X̄ = n1 i=1 Xi = S/n and is unbiased. Now S = i=1 Xi is
again a Poisson variate with parameter nλ and hence var(S) = nλ or var(X̄) = nλ/n2 = λ/n.
The maximum likelihood estimator X̄ for the Poisson parameter λ thus attains the Cramér-Rao
lower bound and is therefore efficient or best.
∂2
I(µ) = −E ln f (X|µ)
∂µ2
2
∂ x
= −E − ln µ −
∂µ2 µ
∂ 1 x
= −E − + 2
∂µ µ µ
1 2x 1
= −E 2
− 3 = 2.
µ µ µ
1 µ2
Var[T ] ≥ = .
nI(µ) n
Pn
i=1 Xi
The MLE of µ is µ̂ = X̄ = , so that:
n
Pn n
!
µ2
i=1 Xi 1 X
Var[µ̂] = Var(X̄) = Var = 2 Var Xi = .
n n i=1
n
Thus the MLE X̄ attains the Cramér-Rao lower bound and is a minimum variance unbiased
estimator of µ.
Consider again the problem described in Section 8.5, i.e. we use h(θ̂) as the estimate for h(θ),
where θ̂ is an unbiased estimator for θ.
130 CHAPTER 9. EVALUATION OF ESTIMATORS
The second order Taylor expansion for h(θ̂) around the true value h(θ) is given by:
∂
1 ∂2 2
h(θ̂) ≈ h(θ) + h(θ) θ̂ − θ + 2
h(θ) θ̂ − θ .
∂θ 2 ∂θ
Taking expectation w.r.t. θ̂, and using the assumption that θ̂ is unbiased, we obtain:
1 ∂2
h i
E h(θ̂) ≈ h(θ) + h(θ) var(θ̂).
2 ∂θ2
Example 9.4 For the Poisson distribution with parameter λ, suppose that we need to have an
estimate for h(λ) = ln(λ). The MLE for λ is x̄ (which is also unbiased). Thus, the MLE of
ln(λ) is ln(x̄). The bias in estimation of ln(λ) is approximated by:
2
1 ∂
B(ln(λ)) ≈ ln(λ) var(λ̂)
2 ∂λ2
1 ∂ 1
= var(λ̂)
2 ∂λ λ
1 λ 1
= − 2 · =− .
2λ n 2nλ
The bias-corrected estimate is then given by ln(x̄) + 1/(2nx̄).
Example 9.5 Suppose that we needed to express the p.d.f. of the exponential distribution in
terms of the rate parameter λ rather than the mean µ, i.e.:
f (x|λ) = λe−λx
for x > 0, and that we need an estimate for λ. Clearly λ = 1/µ, where µ is the mean of the
distribution. We have previously seen that the MLE for µ is the sample mean x̄, which is unbiased
and has a variance (see Example 9.3) of µ2 /n.
We write λ = h(µ), where h(µ) = µ−1 . The MLE for λ is thus h(µ̂) = 1/x̄. The second
derivative of h(µ) is h00 (µ) = 2µ−3 , so that the bias of estimation of λ may be approximated by:
1 −3 µ2 1
B(λ̂) ≈ 2µ = .
2 n nµ
An estimate of the bias is found by replacing µ by the MLE (x̄), so that the approximately
unbiased estimator becomes:
1 1 n−1 1
− = .
x̄ nx̄ n x̄
9.3. METHODS FOR ESTIMATING AND REDUCING BIAS 131
The Jackknife
The jackknife (Quenouille, 1949) is a method for estimating the bias and the variance of an esti-
mator, under the assumption that the bias is a decreasing function of sample size (n). The basic
idea behind the jackknife estimator lies in systematically recomputing the parameter estimator
leaving out one observation at a time from the sample.
We define θ̂−i to be the estimator of θ after deleting the observation Xi from the sample, i.e.
θ̂−i = θ̂(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). We further define θ̂• to be the average of the θ̂−1 , . . . , θ̂−n :
n
1X
θ̂• = θ̂−i . (9.10)
n i=1
Under the assumption that bias is a decreasing function of n, a first order approximation to bias
would be given by E[θ̂] ≈ θ + a/n, where a is a fixed but unknown constant. To the same level
of approximation, the expectation of θ̂−i , and thus also that of θ̂• is θ + a/(n − 1). It therefore
follows that: h i a a a
E θ̂• − θ̂ ≈ − =
n−1 n n(n − 1)
or:
a h i
≈ (n − 1)E θ̂• − θ̂ .
n
On this basis, a “jackknife” estimate of the bias a/n is given by:
B̂(θ̂) = (n − 1)(θ̂• − θ̂). (9.11)
Subtraction of B̂(θ̂) from θ̂ thus generates the bias-corrected (jackknifed) version of θ̂:
θ̂jack = θ̂ − B̂(θ̂) = nθ̂ − (n − 1)θ̂• (9.12)
A jackknife estimator for the variance of (θ̂) has also been suggested as follows:
n
n−1X
Var(θ̂) ≈ (θ̂−i − θ̂• )2 . (9.13)
n i=1
Example 9.6 The following are times (in days) between successive “lost-time” accidents in a
factory: 55, 25, 26, 68, 30, 1, 32, 51, 7. A conjectured model suggests that the mean time between
accidents (µ) is given by eθ , where θ is a parameter representing risk exposure. It is desired to
estimate θ = ln µ. Since the sample mean (x̄) is an unbiased estimator for µ, we may wish to
estimate θ by θ̂ = ln x̄. For the above data, x̄ = 32.7778 so that θ̂ = 3.48975. We wish to assess
the estimation bias.
Applying the jackknife approach, we calculate the following:
i 1 2 3 4 5 6 7 8 9
x̄−i : 30.000 33.750 33.625 28.375 33.125 36.750 32.875 30.500 36.000
θ̂−i : 3.4012 3.5190 3.5153 3.3455 3.5003 3.6041 3.4927 3.4177 3.5835
The average of the θ̂−i may be calculated as θ̂• = 3.48659, so that the jackknife bias estimate is
8 × (−0.00316) = −0.025. The adjusted (debiased) estimate for θ is thus 3.490+0.025=3.515.
Exercise 9.2 For the data in Example 9.6, compute the jackknife estimate of the sample variance
of θ̂. [The standard deviation should be found to be 0.226.]
132 CHAPTER 9. EVALUATION OF ESTIMATORS
The Bootstrap
Another computationally intensive method for estimating the bias and variance of an estimator is
the bootstrap (Efron and Tibshirani, 1993). A single iteration of this procedure involves taking a
sample of size n with replacement from the original set of n observations. This may be visualized
as placing all the original observations in a box, shuffling and then making n draws from the
box, replacing the observation and re-shuffling after each draw. This process is easily simulated
on a computer by drawing a set of random digits, uniformly distributed between 1 and n, and
allocating observations to the “bootstrap” sample accordingly.
For example, in the data from Example 9.6, 9 random digits between 1 and 9 were generated as
follows: 3, 4, 1, 5, 7, 1, 4, 3, 3. The corresponding values from the original sample are: 26, 68,
55, 30, 32, 55, 68, 26, 26. These form the first bootstrap sample.
The above (simulated) sampling process is repeated a large number of times, say M (which
typically would be of the order of 1000s). For each such sample, the corresponding estimate
is calculated, using the same estimator as is proposed for the original data. Define θ̂i as the
estimate derived from the i-th “bootstrap” sample (for i = 1, . . . , M ) and then further define:
M
1 X
θ̂• = θ̂i .
M i=1
Since each of the θ̂i may be viewed as an estimate from the same “population” (the original
sample) with a parameter value θ̂, the bootstrap estimate of bias is given by:
Example 9.7 We return to Example 9.6. Table 9.1 below shows the first five bootstrap sam-
ples, generated (with replacement) from the original data, together with the corresponding sample
means and parameter estimates (θ̂i ).
Unfortunately, M = 5 would be far too small for any reliable estimate of θ̂• , and hence of
bias. Repetition of the above process for M = 1000 times, however, generated the bootstrap
mean θ̂• = 3.4622, which yields B̂(θ̂) = −0.028, and a debiased bootstrap estimate of θ̂boot =
3.490 − (−0.028) = 3.518. Note that this estimate very close to that of the jackknife estimate
(within the limits of sampling accuracy).
Exercise 9.3 For the data in Example 9.7, compute the bootstrap estimate of the sample vari-
ance of θ̂, and compare this with the jackknife estimate. (You will have to generate your own
1000 “bootstrap samples”; in my sample, I obtained a standard deviation of 0.227, also very close
to the jackknife estimate.)
9.4. SUFFICIENCY 133
9.4 Sufficiency
The question which we now address is: Is there any loss in information if we summarize the
results of a random sample in terms of one or more summary statistics? If there is no such
loss in information, then it may be more convenient (from the points of view both of ease of
computation and of display of results) to describe the results of the sample entirely in terms of
the statistic(s).
Definition 9.2 A statistic is a function of a random sample (X1 , X2 , . . . , Xn ) which does not
depend on any unknown parameters. Note that a statistic T = u(X1 , X2 , . . . , Xn ) is itself a
random variable whose distribution derives from that of X1 , X2 , . . . , Xn .
Definition 9.3 The statistic T = u(X1 , X2 , . . . , Xn ) is said to be sufficient for θ if the condi-
tional distribution of (X1 , X2 , . . . , Xn ) given T = t does not depend on θ for any value of t, that
is f (X1 , X2 , . . . , Xn |T = t) does not involve θ.
Thus once the value of sufficient statistic T is observed, we can gain no further knowledge about
θ from any further information derived from the sample (X1 , X2 , . . . , Xn ). In other words, it is
sufficient only to consider (or even only to record) T , or perhaps some function of T , in order to
construct an estimate for θ (or, in fact, to draw any inferences about θ). All sample details apart
from the resultant value of T can be ignored, with no loss of information (and thus no increase
in the variance of the estimator).
Remark 9.2 Before proceeding to examples, it is worth noting one useful result. Suppose first
that the random variable X has a discrete distribution, so that T will also be discrete. We
would then, in establishing whether a statistic T is sufficient for θ, need to consider conditional
probabilities of the form Pr[X1 = x1 , X2 = x2 , . . . , Xn = xn |T = t]. Consider the two events:
A = {X1 = x1 , X2 = x2 , . . . , Xn = xn };
B = {T = t}.
134 CHAPTER 9. EVALUATION OF ESTIMATORS
θt (1 − θ)n−t
= n
t θt (1 − θ)n−t
1
= n ,
t
Pn Pn
if t = i=1 xi (and zero otherwise), which is independent of θ. Thus T = i=1 Xi is sufficient
for θ.
Pn
Let T = i=1 Xi (the total number of counts). From the moment generating functions we know
that T also follows a Poisson distribution with parameter nλ, i.e.:
(nλ)t e−nλ
g(t|λ) = . (9.18)
t!
From Remark 9.2, the conditional distribution of (X1 , X2 , . . . , Xn |T = t) is given by the ratio of
the probability functions in (9.17) and (9.18):
P
λ xi e−nλ t!
P (X1 = x1 , . . . , Xn = xn |T = t)Qn = × t e−nλ
i=1 ix ! (nλ)
t! 1
= Qn × t. (9.19)
i=1 x i ! n
Pn
This conditional probability function thus not depend on λ so that T (X) = i=1 Xi is sufficient
for λ.
In the above examples, we suggested a particular statistic and then proved that it was sufficient
for θ. This does not directly assist in identifying a sufficient statistic, except for a purely trial-
and-error approach. A more constructive approach to finding a sufficient statistic is provided by
the following factorization theorem.
Proof. We shall provide the proof for the continuous case only. The proof for the discrete case
is essentially similar (see Rice, 1995, pages 281–282).
1. Suppose firstly that T is a sufficient statistic. From Remark 9.2 we have that:
Using the factorization, the equivalence of u1 (·) and u(·) and the fact that u(w1 (y1 , . . . , yn )) =
u(x1 ) = t (by definition), we thus have:
Now by definition T ≡ Y1 , so that g(t|θ) ≡ g(y1 |θ), i.e. the marginal p.d.f. of Y1 (although
still conditional on θ), which is then given by:
Z Z
g(t|θ) = k1 [t; θ] . . . k2 (w1 (y1 , . . . , yn ), . . . , wn (y1 , . . . , yn )) |J| dy2 . . . dyn . (9.21)
By the condition of the theorem the function k2 (·) is independent of θ, while by assumption
the ranges of integration do not depend on θ. The result of the multiple integral thus may
depend on y1 = t, but not on θ. Let us represent this result as m(t), so that:
Example 9.11 Consider the random variable with p.d.f. given by:
2
f (x|θ) = 2θxe−θx
for x, θ > 0.
The likelihood function based on a random sample of size n is given by:
n h
" n
#
i h i
−θx2i n −θ n 2
Y P Y
n x n
L(θ) = (2θ) xi e = θ e i=1 i 2 xi .
i=1 i=1
Pn Pn
This gives the required factorization with u(x1 , x2 , . . . , xn ) = i=1 x2i , so that T = i=1 Xi2 is
a sufficient statistic for θ.
The factorization theorem is very useful and often simple to apply. As stated, however, it does
require that the range of the random variable X should not depend on the parameter θ. In order
to emphasize this point, consider the example of a random variable X with p.d.f. given by:
1
f (x|θ) = 0 ≤ x ≤ θ.
θ
The likelihood function is thus given by:
1
L(θ) = f (x1 , . . . , xn |θ) =
θn
for 0 ≤ xi ≤ θ for all i.
If we in a formal sense try to apply the factorization theorem without considering the dependency
of the range on θ, then we could take any arbitrary function p(x1 , . . . , xn ) and write:
p(x1 , . . . , xn ) 1
L(θ) = · .
θn p(x1 , . . . , xn )
Such a factorization suggests the ludicrous conclusion that any arbitrary function can be used a
sufficient statistic.
The problem, of course, is that as it stands Theorem 9.2 does not apply, as the support depends
on θ. Fortunately, there is a simple way to work round this problem. We define a Boolean
indicator function, say Iθ (x), defined as follows:
(
1 for x ≤ θ
Iθ (x) = .
0 otherwise
Now define X(n) as the largest order statistic in the random sample. For non-negative random
variables, we have that:
Yn
Iθ (xi ) = Iθ (x(n) ).
i=1
Qn
This relationship follows because (a) i=1 Iθ (xi ) = 0 if Iθ (x(n) ) = 0; while x(n) ≤ θ implies
xi ≤ θ for i = 1, 2, . . . , n, so that Iθ (xi ) = 1 for all i.
The likelihood function thus reduces to:
Iθ (x(n) )
L(θ) =
θn
which is a trivial factorization with u(X1 , X2 , . . . , Xn ) = X(n) and k2 (x1 , x2 , . . . , xn ) ≡ 1. It is
thus demonstrated that T = X(n) is a sufficient statistic for θ.
Theorem 9.3 If the statistic T is sufficient for θ then the maximum likelihood estimate θ̂ is a
function of T .
Proof. Let (X1 , X2 , . . . , Xn ) be a random sample from X with p.d.f. f (x|θ). Suppose that T is
a sufficient statistic for θ. The likelihood function can then expressed as:
Definition 9.4 The sufficiency principle states that identical conclusions should be drawn from
all data sets having the same value(s) for the sufficient statistic(s). If two data sets generate
likelihood functions which, when viewed as functions of the parameter θ, are proportional to each
other, where the proportionality constant may depend on X but not on θ, then these data sets
should lead to identical conclusions concerning θ.
More generally, if we have found any unbiased estimator for θ which is not a function of the
sufficient statistic only, then its variance will not be increased (usually will be reduced) by a
restriction to the sufficient statistic. This result is provided by the Rao-Blackwell Theorem (Rao,
1945; Blackwell, 1947) given below. Note that the theorem only indicates possible improvement
in the efficiency; it does not guarantee that the resulting estimator is the MVUE.
Then E [h(S)] = θ (i.e. h(S) is an unbiased estimator for θ), while var [h(S)] ≤ var(T ), with
strict inequality only if T = h(S) with probability 1 (i.e. h(S) is more efficient than T unless
they are identical with probability 1).
Proof. All integrals used in this proof are assumed to be over the complete support of the
variables of integration.
Let g(t, s) be the joint (bivariate) density of S and T , gT (t) and gS (s) the marginal densities of
T and S respectively, and gT |S (t|s) the conditional density of T given S = s. Then
Z
h(s) = E [T |S = s] = t gT |S (t|s)dt
Z
g(t, s)
= t dt. (9.26)
gS (s)
Then:
Z
E [h(S)] = h(s)gS (s)ds
ZZ
= tg(t, s) dtds by (9.27)
= E(T )
= θ since T is unbiased by assumption.
= h(s) − h(s).1 = 0.
which implies that var(T ) ≥ var(h(S)). The inequality is strict unless T = h(S) over all sets
having positive probability.
If an unbiased estimator T of θ does not attain its Cramér - Rao Lower Bound, then this may be
because it is not a function of a sufficient statistic only. In this case a more efficient estimator is
given by h(s) = E(T |S = s).
1 −x/θ
Example 9.12 Consider a random sample X = (X1 , X2 ) from f (x|θ) = e for θ > 0 and
θ
x > 0. Let T = X1 and S = X1 + X2 . Now S is a sufficient statistic for θ, while (by definition)
E[T ] = θ, so that T is unbiased. However, T is not a function of S.
The joint p.d.f. for X is given by:
1 −(x1 +x2 )/θ
f (x1 , x2 |θ) = e x1 > 0, x2 > 0.
θ2
By a standard bivariate change in variables from X1 , X2 to T, S, we find the joint p.d.f. for T, S
1 −s/θ
g(t, s) = e 0<t<s
θ2
giving the marginal p.d.f. for S:
s −s/θ s
se−s/θ
Z
e
gS (s) = g(t, s) dt = t 2 =
0 θ t=0 θ2
for s > 0. Thus:
g(t, s) 1
gT |S (t|s) = =
gS (s) s
9.6. THE EXPONENTIAL FAMILY OF DISTRIBUTIONS 141
Definition 9.5 The distribution of the random variable X is said to belong to the exponential
family of distributions if its probability or probability density function f (x|θ) can be written in
the form:
f (x|θ) = exp[c(θ)T (x) + d(θ) + S(x)] for x ∈ A
(9.28)
= 0 for x ∈
/ A,
where c(θ), T (x), S(x) and d(θ) are known functions (i.e. not depending upon any unknown
parameters), and A is the support of X (i.e. the set of values having non-zero probability or
probability density).
If T (x) = x, the distribution is said to be in canonical form, in which case c(θ) is called the
natural parameter of the distribution.
θx e−θ
f (x|θ) = , x = 0, 1, 2 . . .
x!
= exp(x ln θ − θ − ln x!)
For convenience, we shall demonstrate the results of this section for continuous distributions, but
the principles clearly carry over to discrete distributions.
By definition of the probability density function:
Z
f (x|θ) dx = 1. (9.29)
A
where A is the support of X. Differentiating both sides of (9.29) with respect to θ we obtain:
Z
∂
f (x|θ) dx = 0. (9.30)
∂θ A
Assuming the usual smoothness properties, we interchange the order of integration and differen-
tiation in (9.30) to obtain: Z
∂f (x|θ)
dx = 0. (9.31)
A ∂θ
∂ 2 f (x|θ)
Z
dx = 0. (9.32)
A ∂θ2
The above results apply generally to any sufficiently smooth distribution. For the exponential
family, however, they lead to particularly simple and useful results. If f (x|θ) satisfies (9.28),
then:
∂f (x|θ)
= [c0 (θ)T (x) + d0 (θ)]f (x|θ)
∂θ
where c0 (θ) and d0 (θ) are the derivatives of c(θ) and d(θ) respectively w.r.t. θ. Integration with
respect to x gives the expectation of c0 (θ)T (X) + d0 (θ), so that from (9.31) we have that:
Similarly, we can derive an expression for var[T (X)]. The second derivative of (9.28) yields:
∂ 2 f (x|θ)
= [c00 (θ)T (x) + d00 (θ)]f (x|θ) + [c0 (θ)T (x) + d0 (θ)]2 f (x|θ). (9.35)
∂θ2
Integration of (9.35) w.r.t. x gives the expectation of c00 (θ)T (X) + d00 (θ) + [c0 (θ)T (X) + d0 (θ)]2 .
Thus from (9.32) we have:
c00 (θ)E[T (X)] + d00 (θ) + E[c0 (θ)T (X) + d0 (θ)]2 = 0. (9.36)
Substitution into (9.36), and using the expression (9.34) for E[T (X)], then yields:
d0 (θ)
−c00 (θ) + d00 (θ) + [c0 (θ)]2 Var[T (X)] = 0.
c0 (θ)
Simple re-arrangement of terms finally gives the desired expression for Var[T (X)]:
c00 (θ)d0 (θ) − d00 (θ)c0 (θ)
Var[T (X)] = . (9.37)
[c0 (θ)]3
In other words, for the exponential family of distributions, the expectation and variance of T (X)
may be obtained simply by differentiating c(θ) and d(θ). For the canonical form, T (X) = X so
that we obtain the expectation and variance of X directly in this way.
For the exponential family, ln f (x|θ) = c(θ)T (x) + d(θ) + S(x), so that:
Suppose (X1 , . . . , Xn ) is a random sample from the exponential family, with p.d.f. given by
(9.28). Then the likelihood is
n
Y
L(θ) = exp {c (θ)T (xi ) + d(θ) + S(xi )}
i=1
( n
) ( n
)
X X
= exp (c(θ) T (xi ) + nd(θ)) exp S(xi )
i=1 i=1
n
!
X
= k1 T (xi ); θ k2 (x1 , . . . , xn ). (9.38)
i=1
Pn
Thus from the factorization theorem i=1 T (xi ) is sufficient for θ. For distributions in canonical
form (e.g. the Bernoulli and the Poisson distributions, as we have previously seen)
n
X n
X
T = T (Xi ) = Xi ,
i=1 i=1
Summary of Results
Let θ̂ be the maximum likelihood estimate (MLE) for θ, i.e. maximizing L(θ) = f (x1 , . . . , xn |θ).
which is also the Cramér-Rao lower bound for unbiased estimators. Thus θ̂ is asymptotically
efficient or best.
Tutorial Exercises 145
3. Asymptotically as n → ∞:
1
θ̂ ∼ N θ,
nI(θ)
Tutorial Exercises
9.1 Consider sampling from the probability distribution defined by the following p.d.f.:
x2 e−x/β
f (x|β) =
2β 3
for x > 0. Show that the MLE for β is efficient and best for any sample size.
9.2 Let X1 , . . . , Xn be a random sample from a uniform distribution
1
f (x|θ) = , 0 < x < θ.
θ
(a) Show that X(n) (the largest order statistic) is the MLE of θ.
(b) Plot the likelihood function of the data, assuming that n = 10 and x(n) = 1, showing the
MLE of θ.
(c) Find the expectation and variance of X(n) , and hence show that X(n) is a biased estimator
for θ. How would you adjust the estimator to make it unbiased?
(d) Find the MSE for X(n) and for the adjusted (unbiased) estimator.
(e) Show that 2X̄ is unbiased for θ and compare its efficiency with the unbiased estimator found
above.
9.3 If (X1 , X2 , X3 ) is a random sample from N (µ, σ 2 ), show that
(a) T = (X1 + 2X2 + 3X3 )/6 is an unbiased estimator of µ.
(b) Find var(T ) and compare the efficiency with that of X̄.
9.4 Suppose (X1 , . . . , X6 ) is a random sample from a normal distribution N (µ, σ 2 ).
(a) Determine the constant K such that
is an unbiased estimate of σ 2 .
(b) Compare Var[T ] with the variance of the usual unbiased estimate for σ 2 . Which is more
efficient?
9.5 A random sample of size 9 was collected from an exponential distribution with p.d.f.:
f (x|λ) = λe−λx
It is required to estimate λ.
(a) Show that the MLE is given by λ̂ = 1/X̄.
(b) Obtain a Jacknife estimate of the bias and variance of estimation, and hence estimate the
MSE.
(c) What is the unbiased estimator for λ?
9.6 Let X1 , . . . , Xn be a random sample from the Poisson distribution with parameter λ. Show that
X1 alone is not sufficient.
9.7 Use the factorization theorem to show that T = n
P
i=1 Xi is a sufficient statistic for the geometric
distribution.
9.8 Use the factorization theorem to find a sufficient statistic for the exponential distribution.
9.9 Let X1 , . . . , Xn be a random sample from the distribution with p.d.f.:
θ
f (x|θ) = , 0 < θ < ∞, 0 < x < ∞.
(1 + x)θ+1
Find a sufficient statistic for θ.
9.10 Find a sufficient statistic for the parameter θ in the Rayleigh distribution with p.d.f.:
x −x2 /2θ2
f (x|θ) = e , x > 0.
θ2
9.11 Find a sufficient statistic for the Beta distribution with p.d.f.:
Γ(α + β) α−1
f (x|α, β) = x (1 − x)β−1 0<x<1
Γ(α)Γ(β)
where α is known.
9.12 Find a sufficient statistic for the uniform distribution on the interval (α, β) where β is known and
the value of α is unknown, where β > α.
9.13 Find sufficient statistics for the uniform distribution on the interval (θ, θ + 3) where the value if θ
is unknown and −∞ < θ < ∞.
9.14 Suppose that the random variable X has the gamma distribution with an unknown scale parameter
β and a known shape parameter α, i.e:
xα−1 e−x/β
f (x|β, α) = 0<x<∞
β α Γ(α)
(a) Show that this distribution function belongs to the exponential family and find the natural
parameter.
(b) Find E(X) and Var(X).
9.15 Show that the Pareto distribution with p.d.f. defined by:
where v is a known constant, belongs to the exponential family. What then is the sufficient statistic
for θ, based on a random sample of size n from X?
9.16 Show that the negative binomial distribution with mass function:
!
x+r−1 r
f (x|θ, r) = θ (1 − θ)x x = 0, 1, 2, ...
r−1
where r is known, belongs to the exponential family. What is the natural parameter of this
distribution?
Chapter 10
Tests of Hypotheses
Definition 10.1 A simple hypothesis H is any statement that completely specifies the proba-
bility distribution or law for the random variable X. If the hypothesis is not simple, then it is
called composite. A test of a hypothesis H is a rule that tells us whether to accept or to reject
H for each possible observed random sample of X.
In perhaps most hypothesis testing contexts, it is possible to identify a particular hypothesis that
needs to be conclusively rejected before any positive action can be justified. For example, the
accused is assumed innocent until proven guilty beyond “reasonable doubt”; a tried and tested
medical procedure will not be replaced by a new treatment until the latter is demonstrably better
than the old. Such an hypothesis which needs to be rejected before action is taken is termed
the null hypothesis (H0 ). The null hypothesis typically corresponds to a situation of no change
from previous conditions (the status quo) or no deviation from desired conditions. In contrast
to H0 , the hypothesis relating to conditions consistent with a need for action is termed the
alternative hypothesis (H1 ). In principle we can talk of accepting or rejecting either hypothesis,
in the sense of acting as if it were true or false. Our description of how a null hypothesis is set
up, however, cautions against an over-literal interpretation of the terms “accept” or “reject”. If
147
148 CHAPTER 10. TESTS OF HYPOTHESES
H0 is not rejected, then we may not necessarily accept the truth of H0 ; it is merely that there
is insufficient evidence to justify the necessary actions consistent with the truth of H1 . If the
accused is found “not guilty”, we may not necessarily believe in his/her innocence! Even if H0 is
not rejected at the present time, more evidence may change our conclusions. For these reasons
we typically adopt the terminology of rejecting (or not rejecting) the null hypothesis H0 .
For the next few sections, we shall make the assumption that both the null and alternative
hypotheses are simple. In this case, there are two possible errors that can be made in testing H0
against H1 :
and:
It is conventional to denote the probability of a Type I error by α = P (I), where α is termed the
size or the significance level of the test, and the probability of a Type II error by β = P (II).
The power of a test is defined as the probability of rejecting H0 : θ = θ0 when it is false, and is
given by 1 − P (II) = 1 − β.
Naturally we want to make the size of the test as small as possible and the power as large as
possible. An ideal test would be one in which α = β = 0, but for finite sample sizes this ideal is
not achievable. In fact, for fixed sample sizes, as α is decreases so β increases, and vice versa. In
well-designed tests, the only way to decrease both α and β at the same time would be to increase
the sample size.
The following theorem enables to identify a “best” test when judging between two simple hy-
potheses. In this sense, a best test is one which has greater power than any other test of the
same size, under the same sampling conditions. Recall that a test is entirely defined by a critical
region R for rejection of the null hypothesis, typically specified in terms of critical values for a
test statistic.
L(θ0 |x)
Λ(x) = ≤ k. (10.2)
L(θ1 |x)
Proof. By definition: Z
L(θ0 |x)dx = α.
R
If R is the only critical region of size α obtainable from this sampling context, then the theorem
is proved. Suppose, therefore, that there is another (distinctly different) critical region, say R∗ ,
also of size α, i.e. such that Z
L(θ0 |x)dx = α.
R∗
150 CHAPTER 10. TESTS OF HYPOTHESES
Let β and β ∗ be the type two error rates associated with the critical regions R and R∗ respectively.
We need then to show that β < β ∗ , i.e. that:
Z Z
L(θ1 |x)dx ≥ L(θ1 |x)dx. (10.3)
R R∗
Z Z Z
L(θ1 |x)dx = L(θ1 |x)dx + L(θ1 |x)dx (10.4)
R R∩R∗ R∩R̄∗
and Z Z Z
L(θ1 |x)dx = L(θ1 |x)dx + L(θ1 |x)dx. (10.5)
R∗ R∩R∗ R̄∩R∗
From (10.2) it follows that for all points in R̄, and hence also in R̄ ∩ R∗ , L(θ0 |x) > kL(θ1 |x), so
that: Z Z
k L(θ1 |x)dx < L(θ0 |x)dx.
R̄∩R∗ R̄∩R∗
Similarly, for all points in R and hence also in R ∩ R̄∗ , L(θ0 |x) ≤ kL(θ1 |x), so that:
Z Z
k L(θ1 |x)dx ≥ L(θ0 |x)dx.
R∩R̄∗ R∩R̄∗
Since k > 0, it follows that the left hand side of (10.6) satisfies
Z Z
L(θ1 |x)dx − L(θ1 |x)dx < 0,
R∗ R
demonstrating that (1 − β ∗ ) < (1 − β), or β < β ∗ as we needed to show. This proves the
theorem.
10.3. PRACTICAL COMPUTATIONS 151
Once we have decided on the desired size of a test, the Neyman-Pearson theorem allows us to
construct a test that has greatest possible power for the specified size (significance level).
Tests based on critical regions defined by (10.2) are called likelihood ratio tests. Note that if H0 is
true, then the numerator in (10.2) will tend to be larger than the denominator (and vice-versa).
It is thus intuitively reasonable to reject H0 when Λ(x) is small, precisely as indicated by the
definition of R.
We note that if T is a sufficient statistic for θ, then from the factorization theorem we have:
L(θ) = k1 (t|θ)k2 (x1 , . . . , xn ).
Example 10.2 Suppose that the time to failure of light bulbs has a normal distribution with
mean µ and variance σ 2 = (100)2 . We take a random sample of n = 25 bulbs in order to test
the (simple) null hypothesis H0 : µ = 1500 against the (simple) alternative H1 : µ = 1525. Note
that the variance parameter is fully specified, so that the relevant likelihood function is given by:
" 25
#
1 1X 2 2
L(µ) = p exp − (xi − µ) /(100) .
2π(100)2 2 i=1
The most powerful critical region satisfying Λ(x) ≤ k thus corresponds to the set of samples for
which −0.0625x̄ + 94.53 ≤ ln k, i.e. for which:
ln k 94.53
x̄ ≥ − + = c, say.
.0625 .0625
We do not need to calculate the value of k explicitly. It is sufficient (and simpler) to find a value
of c such that P (X̄ ≥ c|µ = 1500) = α, the desired size (significance level) of the test.
Now X̄ ∼ N (µ, σ 2 /n); note how we now consider the distribution of the random variable X̄, i.e.
the distribution of all possible sample means that could arise in the sample space, S. The type I
error probability can then be calculated from:
X̄ − 1500 c − 1500 c − 1500
P (X̄ ≥ c|H0 ) = P ≥ H0 = P Z ≥
20 20 20
where Z has the standard normal distribution. The critical value c must thus satisfy
c − 1500
= zα
20
where zα is the upper 100α percentage point of the standard normal distribution, i.e. c = 1500 +
20zα .
Suppose then that the required size is α = 0.05. From tables, z0.05 = 1.645 giving c = 1500 +
20(1.645) = 1532.9. We thus reject H0 if x̄ ≥ 1532.9.
The probability of a type II error is then given by:
The next example is essentially similar to the previous, but serves to reinforce understanding of
the principles. We shall skim through some of the details, however.
Example 10.3 Suppose a petroleum company is searching for additives that might decrease the
fuel consumption of cars. They send 30 cars of the identical model, fueled with a new addi-
tive, on a standard road test. Without the additive, it is known that the fuel consumptions
for this model of car on the same standard test would be normally distributed with a mean of
10.3. PRACTICAL COMPUTATIONS 153
10`/100km, and a standard deviation of 0.9`/100km. On the basis of the observed fuel consump-
tions (x1 , x2 , . . . , x30 ), management wishes to test the null hypothesis H0 : µ = 10 against the
alternative H1 : µ = 9.5, using a significance level of α = 0.025.
As in the previous example, the most powerful (likelihood ratio) test can be shown to correspond
to a rejection region of the form x̄ ≤ c. (Note the required direction of the inequality!) Once
again, we choose c to give the desired size of the test. The type I error probability is found from:
X̄ − 10 c − 10
α = P (X̄ ≤ c|µ = 10) = P √ ≤ √
0.9/ 30 0.9/ 30
c − 10
= P Z≤ √ .
0.9/ 30
c − 10
In order to ensure that α = 0.025, we find from normal tables that √ = −1.96, so that:
0.9/ 30
0.9
c = 10 − 1.96 √ = 9.68.
30
The corresponding probability of Type II error is then:
which is substantially smaller than before. The corresponding power is now 0.956.
The above examples have illustrated a general rule for hypothesis tests concerning the mean of a
normal distribution with known mean. If X = (X1 , . . . , Xn ) is a sample from N (µ, σ 2 ), with σ 2
known, then the most powerful test for testing H0 : µ = µ0 versus H1 : µ = µ1 (both simple)
is:
σ
x̄ ≥ µ0 + zα √ if µ1 > µ0
n
154 CHAPTER 10. TESTS OF HYPOTHESES
Distribution of X̄ Distribution of X̄
under H1 under H0
or
σ
x̄ ≤ µ0 − zα √ if µ1 < µ0 .
n
The direction of deviation that leads to rejection should be self-evident on “common sense”
grounds!
If desired values have been specified for both α and β (significance and power), then we can
determine the sample size needed to achieve both the desired α and β in a most powerful test.
Still considering samples from a normal distribution, let µ0 and µ1 be the specified values for µ
under H0 and H1 respectively. For purposes of illustration, suppose that µ1 > µ0 . Then:
α = P (X̄ ≥ c|µ = µ0 )
√ √
n(X̄ − µ0 ) n(c − µ0 )
= P ≥ µ = µ0
σ σ
√
n(c − µ0 )
= P Z≥
σ
√
n(c − µ0 )
= 1−P Z <
σ
√
n(c − µ0 )
= 1−Φ
σ
where Φ(z) is the cumulative probability distribution function for the standard normal distribu-
tion. It thus follows that:
zα σ
c = µ0 + √ . (10.8)
n
10.3. PRACTICAL COMPUTATIONS 155
At this stage, the sample size n is unspecified so that we cannot (in contrast to the two previous
examples) determine the value for c. However, we do also have:
From (10.8) and(10.9) we have two equations in the two unknowns, c and n, which are easily
solved. For example, in the problem of Example 10.2, suppose that we wished to use a significance
level (α) of 0.01, and a power of 0.95 (i.e. β = 0.05). Equations (10.8) and(10.9) become:
100
c = 1500 + 2.33 √
n
100
c = 1525 − 1.645 √
n
397.5
Subtracting the two from each other yields √ = 25, giving n = 252.8. Substituting this n
n
into either equation yields c = 1514.7 Since we cannot take fractional observations, we conclude
that the best test satisfying the specified error rates would require a sample of size n = 253 (at
least), and rejection of H0 if x̄ > 1514.7.
Example 10.4 Assume that 10 years ago, traffic fatality data of accidents occurring in Cape
Town was examined and it was found that the number of traffic fatalities per day followed a
Poisson distribution with parameter λ = 0.1. Due to increased congestion, it is believed that the
rate is now higher, and arguments have been presented to the effect that λ = 0.15. We wish
to examine this assertion formally, in the form of a hypothesis test between H0 : λ = 0.1 (no
change) and H1 : λ = 0.15. The test is to be based on data from n randomly selected days over
the past year. The relevant likelihood function is:
Pn
λ i=1 xi −nλ
L(λ) = Qn e .
i=1 xi !
As with the previous examples, we need only determine the value for c. Ideally, c should satisfy
P (Y ≥ c|λ = 0.1) = α. We know that Y follows a Poisson distribution with parameter nλ = 0.1n,
so that the probabilities for any given c can be read from tables. There is however a practical
problem! Since the Poisson distribution is discrete, there might not exist any c for which P (Y ≥
c|λ = 0.1) is exactly equal to α.
As illustration, suppose that n = 20, giving E[Y ]=2 under H0 . Then from tables of Poisson
Probabilities (Table 4), we obtain:
and
P (Y ≥ 5|H0 ) = 1 − 0.983 = 0.017.
There is no c giving α = 0.05 for this sample size. However, Y ≥ 4 and Y ≥ 5 could serve as
tests with approximately 5% and 1% significance levels respectively.
For large sample sizes, in problems such as the previous example, one might need to use a
normal approximation to the true sampling distribution of the test statistic. Suppose that in
Example 10.4 we had n = 300, so that E[Y ]=30. The Poisson distribution in this case can
be approximated by the normal distribution with a mean and variance of 30 (i.e. a standard
deviation of 5.477). As in the examples from the normal distribution, the value for c should then
satisfy: c = 30 + 5.477zα . (Incidentally, don’t be confused! The sample size n has already been
taken into account in defining
√ Y , which is different to the case with normal examples, where we
still have to include the n factor into the standard deviation of X̄.) Thus for α = 0.05, we
obtain c = 30 + 5.477 × 1.645 = 39.01. In practice, we would surely ignore the 0.01, and reject
H0 if Y ≥ 39. In fact, the exact size of the test can still be found from Poisson tables to be
α = 0.046, slightly better than we wanted!
p = P (T (X) ≥ T (x)|H0 ).
Note that the probability is calculated from the distribution of X over S, which can be viewed as
the distribution of values of the test statistic from repeated resampling under the same conditions
that led to the observation of X.
At an intuitive level, the p-value is the probability that the observed value of the test statistic,
or more extreme values (i.e. less supportive of H0 ), could have occurred by chance if H0 is in
fact true.
10.5. THE POWER FUNCTION 157
Example 10.5 Suppose that X1 , X2 , . . . , X16 is a random sample from a normal distribution
with mean µ and standard deviation 10. We are interested in testing H0 : µ = 85 against an
alternative H1 : µ 6= 85. In this case, we would need to define T (x) = |x̄ − 85|, since large
deviations of x̄ from 85 on either side are evidence against H0 .
Suppose now that the observed sample mean is 79.55, so that T (x) = 5.45. The p-value needs
then to be calculated from:
|X̄ − 85|
P (|X̄ − 85| > 5.45|H0 ) = P > 2.18H0
10/4
= P (|Z| > 2.18)
= 2P (Z > 2.18)
= 2 × 0.0146 = 0.0292
(This probability is easily calculated directly, but may also be found from Table 4.)
Definition 10.2 The probability of rejecting a null hypothesis H0 when false, expressed as a
function of the parameter values that may be taken on under the alternative hypothesis (H1 ), is
termed the power function of the test. The power function is of particular value when testing
a simple null hypothesis against a composite alternative. For example, in testing H0 : θ = θ0
against H1 : θ > θ0 , the power function would be represented as a plot of the probability of
rejecting H0 conditional on θ = θ1 against all θ1 in (θ0 , ∞).
Example 10.7 Consider again the problem of Example 10.2. We saw that the best test was
to reject H0 : µ = 1500 for x̄ ≥ 1532.9, irrespective of the value of µ under the alternative
hypothesis. With the given σ = 100 and n = 25, the standard deviation of X̄ is 20. The power
function is thus the plot of:
1532.9 − µ 1532.9 − µ
P (X̄ ≥ 1532.9|µ) = P Z ≥ =1−Φ
20 20
against µ over the range (1500, ∞). With the help of normal tables, the power function is easily
constructed, and is plotted in Figure 10.2.
158 CHAPTER 10. TESTS OF HYPOTHESES
0.8
0.6
Power
0.4
0.2
0
1500 1525 1550 1575
Normal mean µ
Figure 10.2: Power curve for Example 10.7 (one-sided normal)
Example 10.8 A slightly more complicated case is the two-sided test of Example 10.5. It is
easily seen that in this example, the test giving α = 0.05 is that which rejects H0 when |X̄ − 85| ≥
1.96 × 2.5 = 4.9. The power function requires the calculation of P (|X̄ − 85| ≥ 4.9 for all possible
values of the mean µ, as only the single point µ = 85 is specified under H0 . Since X̄ is normally
distributed with mean µ and standard deviation 2.5 (=10/4), the probability of rejection for any
µ is given by:
P (|X̄ − 85| ≥ 4.9) = P (X̄ ≤ 80.1) + P (X̄ ≥ 89.9)
80.1 − µ 89.9 − µ
=P Z≤ +P Z ≥
2.5 2.5
80.1 − µ 89.9 − µ
=Φ +1−Φ
2.5 2.5
Once more, this function is easily plotted with the help of normal tables, and is displayed in
Figure 10.3.
Example 10.9 Finally, consider a sample of size 10 from a Poisson distribution with parameter
λ, on the basis of which we wish to test H0 : λ = 1 versus H1 : λ < 1. Let Y be the sum of the
observations, which follows the Poisson distribution with parameter 10λ. It is easy to confirm
(cf. Example 10.4) that the most powerful test with size α ≈ 0.01 has rejection region Y ≤ 3.
For the power function, we thus need to calculate the following probability as a function of λ for
0 < λ < 1:
3
X (10λ)y e−10λ
P (Y ≤ 3) =
y=0
y!
a plot of which is displayed in Figure 10.4.
Compare and interpret the different shapes of the power curves in Figures 10.2–10.4.
10.6. UNIFORMLY MOST POWERFUL TESTS 159
0.8
0.6
Power
0.4
0.2
0
75 80 85 90 95
Normal mean µ
Figure 10.3: Power curve for Example 10.8 (two-sided normal)
0.8
0.6
Power
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Poisson parameter λ
Figure 10.4: Power curve for Example 10.9 (one-sided Poisson)
160 CHAPTER 10. TESTS OF HYPOTHESES
H0 : θ ∈ ω. (10.10)
H1 : θ ∈
/ω or H1 : θ ∈ Ω \ ω. (10.11)
Now let L(Ω̂) be the (unconstrained) maximum of the likelihood function over all θ ∈ Ω; and
let L(ω̂) be the maximum of the likelihood function over θ ∈ ω. The generalized likelihood ratio
statistic is defined by:
L(ω̂|x)
Λ(x) = . (10.12)
L(Ω̂|x)
Where no confusion can arise, we may, for ease of notation, suppress the x argument in the
above.
10.7. GENERALIZED LIKELIHOOD RATIO TESTS 161
The maximum over Ω occurs at the MLEs for µ and σ 2 , which we know to be:
n
1X
µ̂ = x̄ σ̂ 2 = (xi − x̄)2 .
n i=1
Now maximization of L(µ, σ 2 ) under ω requires maximization of L(0, σ 2 ), i.e. the MLE of σ 2
when µ = 0. It is left as an exercise to demonstrate that this MLE is given by:
Pn
x2
σ̂ω = i=1 i .
2
n
162 CHAPTER 10. TESTS OF HYPOTHESES
For any fixed value of c, the probability of a type I error is thus given by:
√
nX̄ 2
2
nX̄
P ≥ c |H 0 = P ≥ c |H 0 .
S2 S
Notice that the probability statement refers to the random variation in the sample mean and
sample variance across the sample space (for example under repeated sampling). Now H0 specifies
the mean µ = 0, but does not specify the variance σ 2 , so that the probability might seem not to be
fully defined. However, we know that for sampling from the normal distribution, the t-statistic:
√
nX̄
T =
S
10.7. GENERALIZED LIKELIHOOD RATIO TESTS 163
has the t-distribution with n−1 degrees of freedom, irrespective of the value of σ 2 . The type I error
rate is thus fully determined by choice of c, and we can choose c to provide any desired significance
level. Thus, setting c = tn−1,α/2 (the upper 100α/2 percentage point of the t-distribution with
n − 1 degrees of freedom) will give a type I error rate of 100α%.
Notice how we have naturally arrived at the usual two sided t−test for testing H0 (as H1 includes
variations on either side of µ = 0).
The preceding example is a special case of the following general results for sampling from the
normal distribution with unknown variance (all of which may be derived as in the example): The
critical regions for the generalized likelihood ratio test of a null hypothesis H0 : µ = µ0 are as
given in the following table for the specified alternative hypotheses.
H0 H1 Critical region - R
Example 10.11 Once again, let X ∼ N (µ, σ 2 ), but now define Ω and ω by:
µ, σ 2 ; −∞ ≤ µ ≤ ∞, 0 ≤ σ 2 ≤ ∞
Ω :
µ, σ 2 ; −∞ ≤ µ ≤ ∞, σ 2 = σ02 .
ω :
The above define a test of H0 : σ 2 = σ02 against H1 : σ 2 6= σ02 when the mean is unspecified
(unknown).
The full likelihood function and the maximum over Ω is exactly as in Example 10.10, i.e. L(Ω̂)
is as specified in (10.13).
Maximization of L(µ, σ 2 ) under ω requires maximization of L(µ, σ02 ), i.e. the MLE of µ when
σ 2 = σ02 . The maximum occurs at µ = x̄, so that:
Pn
1 1 i=1 (xi − x̄)2
L(ω̂) = exp{− }, since µ̂ = x̄.
[2πσ02 ]n/2 2 σ02
Pn
i=1 (Xi − X̄)2
U= .
σ02
u n/2 1
Λ= exp{− (u − n)}, (10.15)
n 2
demonstrating that Λ depends only on u. The critical region is thus defined by the set of values
for u which satisfy:
u n/2 1
exp{− (u − n)} ≤ k.
n 2
It is easily seen that the function defined in (10.15) has a single maximum at u = n (giving Λ = 1
at this point), and tends to 0 as u → 0 and as u → ∞. The set defined by Λ ≤ k forms two
disjoints sets in u, namely {0 < u < c1 } ∪ {c2 < u < ∞}, for some pair of positive real numbers
c1 < c2 .
In order to calculate error rates, we need the distribution of U , but it is well-known that U ∼ χ2n−1 .
The probability of a type I error is given by P (0 < U < c1 ) + P (c2 < U < ∞), which for any
given c1 and c2 are then obtainable from χ2 tables. In principle, we would wish to choose c1 and
c2 so that P (0 < U < c1 ) + P (c2 < U < ∞) = α, where α is the desired significance level. A
problem arises at this point, however: We cannot, strictly speaking, make independent choices
for c1 and c2 , as the boundary of the critical region defined by the generalized likelihood ratio
test would require Λ(u = c1 ) = Λ(u = c2 ). This last condition is difficult to incorporate into
analytical calculations. A common approximation to the two-sided test for H0 : σ 2 = σ02 is thus
to assume from considerations of balance that P (0 < U < c1 ) ≈ P (c2 < U < ∞) ≈ α/2. This
gives:
c1 = χ2n−1,1−α/2 c2 = χ2n−1,α/2
where χ2r,β is defined as the upper 100β percentile of the chi-squared distribution with r degrees
of freedom.
The preceding example is easily extended to include all three of the cases summarized in the
table below, for testing H0 : σ 2 = σ02 against the alternatives specified.
10.8. TESTS FOR DISTRIBUTIONAL FIT 165
H0 H1 Critical region - R
(xi − x̄)2
P
2
σ = σ02 2
σ < σ02 ≤ χ2n−1,1−α
σ02
(xi − x̄)2
P
2
σ = σ02 2
σ > σ02 ≥ χ2n−1,α
σ02
(xi − x̄)2
P
2
σ = σ02 2
σ 6= σ02 ≤ χ2n−1,1−α/2 or
σ02
(xi − x̄)2
P
≥ χ2n−1,α/2
σ02
The preceding examples were convenient in the sense that the generalized likelihood ratio test
simplified to a simple test on a sample statistic whose distribution was known. This is, of course,
will not always be the case. More generally, a useful approximation is given by Wilk’s theorem,
which demonstrates that asymptotically for large sample sizes and under certain smoothness
conditions, the distribution of −2 lne Λ tends to the χ2r−r0 distribution, where r is the number of
free parameters in Ω, and r0 is the number of free parameters in ω.
See also Exercise 10.14 at the end of the chapter.
It is sometimes necessary to test whether the observed data arise from some specified distribution
or not. In this case, the null hypothesis is that the distribution function F (x) is of a specified
form, which may be a simple hypothesis (e.g. that the distribution is N (0, 1)) or a composite
hypothesis (e.g. that the distribution is normal, but with mean and variance unspecified).
We shall not go deeply into this topic. In the following sections we briefly summarize a few tests
that have been found to be practically useful.
This test should be well-known from the first year course, and we simply refresh the reader’s
memory by means of a simple example.
Example 10.12 Weights (in kg) of n = 122 students in an STA2004F class were recorded,
giving a sample mean of 65.16 and a sample standard deviation of 13.34. We wish to test
the hypothesis that the data arise from a normal distribution. The following table records the
counts of numbers of students in successive 10 kg intervals, as well as the relevant statistics for
executing the chi-squared test. Note that the theoretical probabilities in the third column are based
on the estimated mean and standard deviation, i.e. they are calculated from the N (65.16, 13.342 )
distribution.
166 CHAPTER 10. TESTS OF HYPOTHESES
(Ni − npi )2
Range Number Probability Expected Number
npi
(Ni ) (pi ) (npi )
X ≤ 45 6 0.0653 7.962 0.4835
45 < X ≤ 55 26 0.1577 19.242 2.3737
55 < X ≤ 65 33 0.2721 33.198 0.0012
65 < X ≤ 75 33 0.2745 33.491 0.0072
75 < X ≤ 85 15 0.1619 19.756 1.1451
X > 85 9 0.0684 8.351 0.0505
Sums: 122 1 122 4.0612
As there are 6 bins, and two unknown parameters have been estimated, we make use of critical
values from the χ2 distribution with 6-2-1=3 degrees of freedom. From tables, even the 20%
critical point is seen to be 4.64>4.06, so that the p-value is greater than 0.2. We cannot reject
the null hypothesis. Of course, this does not prove that the data are normally distributed; it is
only that we cannot prove otherwise!
The problem with the chi-squared test is that (a) in order to have a reasonably powerful test,
one needs quite a large number of bins (perhaps 10 or more), but (b) for the χ2 approximation
to hold the expected number in each bin must be at least 5–10 (and in fact we would want an
average number rather greater than this). The implication is that the sample size needs to be
quite large (possibly n >> 100) for the test to be useful.
We recall that the two key characteristics of the normal distribution are that the coefficients of
skew and kurtosis are 0 and 3 respectively. The sample coefficients of skew and kurtosis may be
estimated as follows: Pn
(xi − x̄)3 /n
b1 = i=1 3
s
and Pn
(xi − x̄)4 /n
b2 = i=1 4 .
s
√
Percentage points for b1 and b2 for normally distributed data have been tabulated (Pearson and
Hartley, Biometrika Tables for Statisticians), but for large samples the following approximations
are useful.
b1
b1 ≈ N (0, 6/n) so that Z1 = p ≈ N (0, 1)
6/n
and
b2 − 3
b2 − 3 ≈ N (0, 24/n) so that Z2 = p ≈ N (0, 1).
24/n
An “omnibus” test for normality would be to take the sum of squares Z12 + Z22 which under the
hypothesis of normality should then have (approximately) the χ2 distribution with two degrees
of freedom.
For the data used in Example 10.12, we obtained b1 = 0.426 and b2 = 2.804, which gives
Z12 + Z22 = 3.884. From χ2 tables we find that the p-value for the test of normality is now
10.8. TESTS FOR DISTRIBUTIONAL FIT 167
approximately 0.15. Although we still would not reject H0 , the result suggests a “possible”
significance (i.e. a possibility that larger sample sizes might yet lead to rejection), i.e. some
lingering doubt concerning normality. This result indicates that the omnibus test based on skew
and kurtosis may be more powerful than the usual chi-squared test of fit.
The Kolmogorov-Smirnov test (Kolmogorov, 1993; Smirnov, 1939) may be used to compare two
different sample distributions with each other, or a sample distribution with an hypothesized
theoretical distribution. As the K-S test does not require grouping of the data into bins, it can
in principle be applied to relatively sample sizes. The test is based on a direct comparison of the
cumulative probability distribution functions (theoretical or empirical).
We describe the K-S test for the case in which we wish to test a null hypothesis that the data
arise from a specified distribution, with probability distribution function given by F (x). For
an observed random sample of size n, say x1 , x2 , . . . , xn , we define the empirical distribution
function En (x) by:
Number of observations ≤ x
En (x) = .
n
Critical values dn,α , such that P (Dn ≥ dn,α ) = α, have been tabulated, and some of these
values are √
given in the attached Kolmogorov-Smirnov table (Table 6). Note that for large n,
dn,α = kα / n, where kα as a function of α is given by the following.
Example 10.13 In a situation similar to that of Example 10.12, suppose that the weights of
only 20 students were recorded. The values, ordered from smallest to largest were: 41, 48, 51,
51, 55, 56, 58, 58, 60, 61, 64, 68, 68, 68, 69, 70, 73, 77, 91, 101. The sample mean and standard
deviation are calculated to be 64.4 and 14.2 respectively. The empirical distribution function for
these data, and the cumulative distribution function for N (64.4, 14.22 ) are displayed in Figure
10.5.
Visual inspection shows that the largest difference between En (x) and F (x) occurs at about x = 70,
and this is easily confirmed numerically. At this point, the difference is Dn = 0.146. From the
K-S tables, the 20% critical value for n = 20 is 0.23, so that we could not reject the hypothesis
of normality.
Another widely used test for normality is the Shapiro-Wilk’s W test (Shapiro and Wilk, 1965).
The W test has been shown to be more powerful in small samples (n < 50) than its counterparts
discussed in the preceding sections.
168 CHAPTER 10. TESTS OF HYPOTHESES
0.8
Cumulative Probability
0.6
0.4
0.2
0
40 50 60 70 80 90 100
Weight
Figure 10.5: Empirical and estimated normal distributions of students’ weights
Define X(1) < X(2) < . . . < X(n) to be the order statistics of a random sample of size n. The
Shapiro-Wilk’s test statistic is defined by:
Pn
ai X(i)
W = i=1 (10.17)
S
where S is the usual sample standard deviation, and a1 . . . , an are parameters which have been
tabulated by Shapiro and Wilk (1965) (derived from the expected values of the order statistics
from a standard normal distribution).
The null hypothesis of normality is rejected if W ≤ Wα , where the critical values Wα are
tabulated by Shapiro and Wilk (1965). Many statistical software packages do in fact compute
the W statistic together with the associated p-value. The test statistic W takes values between
0 and 1, with values close to 1 indicating approximate normality.
For the data used in Example 10.12, the W -statistic was found (using the R package) to be 0.981,
giving p = 0.0832. We could still not reject the null hypothesis of normality at the conventional
5% level, but we are getting closer to the rejection level, emphasing both the doubts (“possible
significance”) that may exist about normality and the increased power of the Shapiro-Wilk test.
We conclude this chapter with a brief discussion of the use of probability plots as an informal test
of the fit of sample data to a particular distributional form. We call the test “informal” as it is
not based on a specific test statistic (although the use of the Kolmogorov-Smirnov test is easily
calculated at the same time as the probability plot), but on a visual inspection of the graphical
plot.
10.9. PROBABILITY PLOTS 169
Suppose firstly that X(1) < X(2) < . . . < X(n) are the order statistics of a random sample drawn
from the uniform distribution on [0,1]. It can be shown that:
k
E(X(k) ) = .
n+1
It thus follows that a plot of the observed ordered observations x(1) , x(2) , . . . , x(n) against the
plotting points:
1 n
,...,
n+1 n+1
should give an approximately straight line.
Now suppose that we have a random sample drawn from a distribution hypothesized to have the
distribution function F (x). If we define Y = F (X) then we know that the distribution of Y is
uniform. Under the null hypothesis that the true distribution function is F (x), therefore, the
above result for the uniform distribution implies that a plot of
k
y(k) , F (x(k) ) against (10.18)
n+1
should produce an approximately straight line. This is sometimes termed a Probability-
Probability (P-P) plot.
Remark 10.1 Some writers have suggested that slight modifications to the plotting positions
in (10.18) may give slightly better approximations to linearity under the null hypothesis. For
example, the plotting points given by:
k − 14
n + 12
are often used.
However, for the purposes of these notes, we shall keep to the plots defined by (10.18).
where the function G(z) is in a standard form (i.e. containing no unknown parameters). The
normal distribution can of course be expressed in this way. If F (x) is expressible in this form,
then solving for F (x) = k/(n + 1) requires:
x−µ k
G =
σ n+1
170 CHAPTER 10. TESTS OF HYPOTHESES
so that:
x−µ k
= G−1 .
σ n+1
It thus follows that:
−1 k −1 k
F = σG + µ.
n+1 n+1
The result is that a Q-Q plot of the form
k
x(k) against G−1 (10.20)
n+1
should still give a straight line (under the null hypothesis), and can be constructed without having
to know the parameters µ and σ. A bonus, in fact, is that the slope and intercept of the line
give estimates of σ and µ respectively.
There is no simple construction of this form for P-P plots. However, once µ and σ have been
estimated (e.g. by maximum likelihood), one can plot
x(k) − µ̂
k
G against (10.21)
σ̂ n+1
Remark 10.2 Quantile-quantile plots emphasize the tails of the distribution, while P-P plots
focus on the centre of the distribution. Thus discrepancies in the tails of the distribution may
appear more pronounced on a Q-Q plot, but a P-P plot is more sensitive to discrepancies in the
middle of the hypothesized distribution.
Example 10.14 Consider again the sample of 20 weights used in Example 10.13, with the aim
of testing whether the data arise from a normal distribution. The sample data plus the plotting
coordinates are displayed in Table 10.1, where Φ( ) and Φ−1 ( ) denote the distribution function
and inverse distribution function for the standard normal distribution. Recall that the sample
mean and standard deviation give the estimates µ̂ = 64.4 and σ̂ = 14.18.
The normal Q-Q plot is that of x(k) against Φ−1 (pk ), and is displayed as Figure 10.6. The
(standardized) normal P-P plot is that of Φ(z(k) ) against pk , and is displayed as Figure 10.7. It
may be noted that the P-P plot appears essentially linear, but that the Q-Q plot shows a slight
upwards curvature in the upper tail. The conclusion is that the data are not far from normally
distributed, but there is a concern about possible deviations from normality in the upper tail, i.e.
for x > ±75.
10.9. PROBABILITY PLOTS 171
k x(k) − µ̂
k x(k) pk = Φ−1 (pk ) z(k) = Φ(z(k)
n+1 σ̂
1 41 0.0476 -1.668 -1.651 0.0494
2 48 0.0952 -1.309 -1.157 0.1237
3 51 0.1429 -1.068 -0.945 0.1723
4 51 0.1905 -0.876 -0.945 0.1723
5 55 0.2381 -0.712 -0.663 0.2537
6 56 0.2857 -0.566 -0.593 0.2768
7 58 0.3333 -0.431 -0.451 0.3258
8 58 0.3810 -0.303 -0.451 0.3258
9 60 0.4286 -0.180 -0.310 0.3781
10 61 0.4762 -0.060 -0.240 0.4052
11 64 0.5238 0.060 -0.028 0.4887
12 68 0.5714 0.180 0.254 0.6002
13 68 0.6190 0.303 0.254 0.6002
14 68 0.6667 0.431 0.254 0.6002
15 69 0.7143 0.566 0.324 0.6272
16 70 0.7619 0.712 0.395 0.6536
17 73 0.8095 0.876 0.607 0.7279
18 77 0.8571 1.068 0.889 0.8129
19 91 0.9048 1.309 1.876 0.9697
20 101 0.9524 1.668 2.582 0.9951
Table 10.1: Coordinates for the Q-Q and P-P plots of the data on weights
100
90
80
x(k)
70
60
50
40
-2.0 -1.5 -1.0 -0.5 0 0.5 1.0 1.5 2.0
−1
Φ (pk )
Figure 10.6: Q-Q plot for the weights data
172 CHAPTER 10. TESTS OF HYPOTHESES
1.0
0.8
0.6
Φ(z(k) )
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1.0
pk
Figure 10.7: P-P plot for the weights data
Tutorial Exercises
10.1 Which of the following hypotheses are simple, which are composite?
(a) X follows a uniform distribution.
(b) A die is unbiased.
(c) X follows a normal distribution with mean 0 and variance σ 2 > 0.
(d) X follows a normal distribution mean µ = 0.
10.2 A coin is thrown independently 10 times to test the hypothesis that the probability of heads is
1/2 versus the alternative that the probability is not 1/2. The test rejects if either 0 or 10 heads
are observed.
(a) What is the significance level of the test?
(b) If in fact the probability of heads is 0.1, what is the power of the test?
10.3 Suppose that X ∼ Bin(100, p). Consider the test that rejects H0 : p = 1/2 in favour of H1 : p 6= 1/2
for | X − 50 |> 10. Use the normal approximation (since n = 100 is large) to the binomial
distribution to find the size of the test, α.
10.4 Let X1 , . . . , Xn be a random sample from the Poisson distribution. Find the likelihood ratio test
for testing
H0 : λ = λ0 vs H1 : λ = λ1 .
Pn
Use the fact that i=1 Xi follows a Poisson distribution to find a rejection region of size α = 0.1
if n = 10, λ0 = 0.5 and λ1 = 1.5. What is the power of this test?
10.5 Let X1 , . . . , X25 be a random sample from the normal distribution with variance 100. Find the
best rejection region to test H0 : µ = 0 vs H1 : µ = 1.5, at level α = 0.10. What is the power of
the test? Repeat for α = 0.01.
Tutorial Exercises 173
10.6 Let X1 , . . . , Xn be a random sample from N (0, φ), (where φ > 0 is the variance). Show that a
uniformly most powerful test (UMP) exists for testing
H0 : φ = φ0 vs H1 : φ > φ0 .
If n = 15, α = 0.05 and φ0 = 3 find the uniformly most powerful critical region.
10.7 Let X1 , . . . , Xn be a random sample from a N (θ, 1), distribution. Show there is no uniformly most
powerful (UMP) test for testing
H0 : θ = θ0 vs H1 : θ 6= θ0 .
Under what circumstances can you construct a uniformly most powerful test?
10.8 A test has been constructed of H0 : µ = 30 versus H1 : µ ≥ 30, based on n = 16 observations from
a normal distribution with σ = 9. If 1 − β = 0.85 when µ = 34, what is α?
10.9 Construct a power curve for the α = 0.05 test of H0 : µ = 55 versus H1 : µ 6= 55 if the data consist
of a random sample of size 16 from a normal distribution having σ = 4.
10.10 A producer of frozen fish is being investigated by the Bureau of Fair Trades. Each package of fish
that this producer markets carries the claim that it contains 12 kilograms of fish; a complaint has
been registered that this claim is not true. The bureau acquires 100 packages of fish marketed by
this company and find that:
100
X 100
X
xi = 1178, x2i = 13903.39
i=1 i=1
where xi is the observed weight (in kilograms) of the ith package. We assume the true weights of
the packages are normally distributed with mean µ and variance σ 2 (unknown). At a significance
level of α = 0.01, could the bureau reject H0 : µ = 12 versus H1 : µ < 12. Why would the
hypotheses be formulated in this manner?
10.11 Suppose that a large lot of items is received, each of which is either defective or nondefective
(independently of each other). Let p be the proportion of defectives in the lot. If n = 50 items
are selected at random, find the best critical region of size α = 0.05 for testing H0 : p = 0.1
versus H1 : p = 0.2. For what composite alternative hypothesis would the test be uniformly most
powerful?
10.12 The time to failure X of a piece of electronic equipment is assumed to follow an exponential
distribution with parameter λ. Based in a random sample X1 , X2 , ..., X8 of lifetimes, we want to
test H0 : λ = 0.01 versus H1 : λ = 0.04. Find the form of the best critical region for the test.
10.13 Let X1 , X2 , ..., Xn be a random sample from the Weibull distribution with p.d.f.:
f (x|θ, λ) = θλxλ−1 exp −θxλ x > 0 θ, λ > 0
Find the most powerful test for testing the null hypothesis H0 : θ = θ0 against the alternative
hypothesis H1 : θ = θ1 , assuming that λ > 0 is known.
10.14 Let X1 , X2 , ..., Xm be a random sample from the normal distribution with mean µ1 and variance
σ 2 , and Y1 , Y2 , ..., Yn be an independent random sample from the normal distribution with mean
µ2 and variance σ 2 (i.e. assuming equal variances). Find the generalized likelihood ratio test for
H0 : µ1 = µ2 versus H0 : µ1 6= µ2 .
10.15 A large car dealer in the Western Cape wishes to plan stock levels for a particular luxury car.
Car sales of this model were analyzed over three months (60 working days), and the results are
presented in the table below which summarizes the data in terms of the numbers of days in which
various numbers of sales were observed.
No. of sales in day: 0 1 2 3 4 ≥5
Number of days: 7 17 22 9 4 1
174 CHAPTER 10. TESTS OF HYPOTHESES
Are the data consistent with a hypothesis that daily sales follow a Poisson distribution with a
mean of 2? Use the chi-square goodness of fit test.
10.16 The following observations are weight gains of 12 rats on a high protein diet (Snedecor and Cochran,
1980), ordered from the smallest to the largest observations.
83 97 104 107 113 119 123 124 129 134 146 161
Determine whether these data come from a normal distribution, by constructing the Q-Q plot and
the standardized normal probability (P-P) plot.
10.17 Let X denote the distance in hundreds of metres between flaws on a length of optic fibre cable. A
random sample of ten observations on X are as follows:
8 6 1 32 116 23 12 58 101 68
Carry out a Kolmogorov-Smirnov test to test the null hypothesis that the above distances have
an exponential distribution.
Chapter 11
Confidence Intervals
Then the interval (L1 (X1 , . . . , Xn ), L2 (X1 , . . . , Xn ) forms a 100(1 − α)% confidence interval for
θ.
We have emphasized that it is the interval itself, i.e. (L1 , L2 ) and not θ, which is random, in the
sense that one or both of the end points are random variables, which will vary from sample to
sample.
For any specific set of observations, x1 , . . . , xn , we compute the corresponding values L1 (x1 , . . . , xn )
and L2 (x1 , . . . , xn ), and claim to be 100(1 − α)% confident that the true value of θ is contained
within the interval so defined. The claim may be justified in the sense that if we were to repeat
this process many times, then in the long run we shall be correct 100(1 − α)% of the time. In
any one instance, however, we may be wrong, and the constructed interval may not bracket the
true value of θ.
It is fundamentally wrong to interpret confidence as a probability statement about θ. The correct
interpretation of the confidence interval is that if we draw very many samples of size n from X,
then in the long run 100(1−α)% of the intervals (L1 , L2 ) will include θ. Conversely, the remaining
proportion (100α%) will not include θ.
The simulation exercises of the following two examples illustrate the meaning of confidence
intervals.
Example 11.1 A number of computer simulated samples of size 20 were drawn from the normal
distribution with mean µ = 10 and variance σ 2 = 9. From each sample generated in this manner,
the sample mean x̄ and sample variance s2 were calculated. The corresponding 90% confidence
intervals for µ and σ 2 were calculated in the usual way (see first year notes, but also later in this
chapter) from the following:
175
176 CHAPTER 11. CONFIDENCE INTERVALS
√ √
Mean: x̄ ± t19,0.05 s/ 20 = x̄ ± 1.729s/ 20
!
19s2 19s2
Variance: ; = (0.630s2 ; 1.878s2 )
χ219,0.05 χ219,0.95
In Figure 11.1 a sequence of 20 of the 90% confidence intervals for the mean are displayed, where
the vertical line indicates the position of the true mean (µ = 10). It will be noted that 3 out of
the 20 intervals do not bracket the true mean. Extending the simulation to 100 simulated data
sets we found that 6 out of the 100 intervals did not bracket the true mean. Of course, we do
know that in the long run, 10% of the intervals that are constructed will, on average, not bracket
the true mean.
Figure 11.2 provides a similar comparison of the 90% confidence intervals for the variance. We
note that 2 out of the 20 intervals do not bracket the true variance (although a few more were
rather borderline). Extending the simulation to 100 simulated data sets we found that 9 out of
the 100 intervals did not bracket the true variance (i.e. quite close to the expected 10%). It is
also interesting to note from Figure 11.2 how asymmetric the intervals are with respect to the
true value for σ 2 .
Example 11.2 In this example, we simulated the process of tossing an unbiased coin 20 times,
and recording the number of heads, say Y , which were observed. The sample estimate of the
probability of a head, p, is given by p̂ = y/20, where y is the observed value of Y . As in the
previous example, this simulated experiment could be repeated many times.
In each (simulated) experiment, the random variable Y follows the binomial distribution with
parameters n = 20 and p = 0.5. In order to construct an approximate confidence interval for p
from any single set of 20 coins tossed, we shall approximate the distribution of Y by the normal
distribution with mean 20p and variance 20p(1 − p). The sampling distribution of the estimate
p̂ is thus normal with mean p and variance p(1 − p)/20. In real-life sampling situations, the
variance will be unknown (as it depends on p), but may be restimated by p̂(1 − p̂). An approximate
p̂(1 − p̂)
90% confidence interval for p is thus given by p̂ ± 1.645 .
20
In Figure 11.3 a sequence of 20 of the approximate 90% confidence intervals for the population
proportion are displayed, where the vertical line indicates the position of the true proportion
(p = 0.5). It will be noted that 4 out of the 20 intervals (20%) do not bracket the true proportion.
Extending the simulation to 100 simulated data sets we found that 12 out of the 100 intervals
did not bracket the true proportion (an error rate of 12%).
We can, in fact, for any specific true population proportion p, calculate the percentage of intervals
based on the normal approximation which do bracket p. For example, if p = 0.5, the approximate
interval when Y = 6 is easily calculated to be (0.131;0.469) which does not contain 0.5. In this
way, we see that the approximate interval only contains 0.5 when 7 ≤ Y ≤ 13. The probability
of Y falling in this range when p = 0.5 can be found from binomialr tables to be 0.885. The claim
p̂(1 − p̂)
of 90% confidence in the statement that p belongs to p̂ ± 1.645 is thus overstated, at
20
least for p ≈ 0.5.
There is a close linkage between the concepts of confidence intervals and hypothesis tests. Sup-
pose that we have constructed a 100(1-α)% confidence interval for a parameter θ, say (L1 ; L2 ).
11.1. BASIC PRINCIPLES 177
20
15
Sample number
10
Population mean µ
Figure 11.1: 20 90% confidence intervals for the population mean, µ
20
15
Sample number
10
5 10 15 20 25
Population variance σ 2
Figure 11.2: 20 90% confidence intervals for the population variance, σ 2
178 CHAPTER 11. CONFIDENCE INTERVALS
20
15
Sample number
10
Population proportion p
Figure 11.3: 20 approximate 90% confidence intervals for the population proportion, p
Suppose further that we wish to test H0 : θ = θ0 against H1 : θ 6= θ0 . One test may be to reject
H0 if and only if θ0 6∈ (L1 ; L2 ). By definition, if H0 is true, then we will have L1 ≤ θ0 ≤ L2 with
probability 1 − α. In other words the probability of a type I error is α, so that the test based on
the confidence interval has size α (although it may not necessarily be the most powerful test of
this size).
Conversely, a 100(1-α)% confidence interval may be viewed as the set of all parameter values θ
that cannot be rejected against two-sided alternatives by a significance test with size α.
Let T be any statistic (usually a sufficient statistic such as the MLE) that may used in estimation
of θ. Suppose that we are able to find a function φ(T, θ) satisfying the following properties:
For any T = t, the function φ(t, θ) is continuous and monotonic (either strictly increasing
or strictly decreasing) in θ. This implies that for any value of t there is a unique inverse
function φ−1 (u, t), i.e. a unique value for θ satisfying φ(t, θ) = u, where this value will in
general depend on t.
The sampling distribution of U = φ(T, θ) does not depend on θ. Let h(u) represent the
p.d.f. of U .
11.2. CONSTRUCTION OF CONFIDENCE INTERVALS: METHOD I 179
Now find two numbers, say u1 and u2 such that P (u1 ≤ φ(T, θ) ≤ u2 ) = 1 − α. For example, we
could define u1 and u2 to be such that:
Z u1 Z ∞
α
h(u)du = h(u)du = . (11.2)
−∞ u2 2
It is left as an exercise to show that if φ(t, θ) is an decreasing function of θ, then a similar result
follows with the directions of the inequalities reversed.
The two cases (increasing and decreasing functions) can be combined by defining the following
for any T = t:
L1 (t) = min{φ−1 (u1 , t); φ−1 (u2 , t)} L2 (t) = max{φ−1 (u1 , t); φ−1 (u2 , t)}.
In other words, the interval (L1 (t); L2 (t)) is a 100(1-α)% confidence interval for θ.
The difficulty, of course, is that it may not always be possible to find the statistic T and function
φ(T, θ) satisfying the above properties. In the next two examples, however, we are able to
construct confidence intervals in the above manner.
Example 11.3 Normal distribution with unknown mean (µ) and variance (σ 2 ).
We know that X̄ and S 2 are unbiased estimates for µ and σ 2 respectively. Furthermore:
X̄ − µ
T =p
S 2 /n
has the t-distribution with n − 1 degrees of freedom, which does not depend on µ or σ 2 . Thus:
X̄ − µ
P − tn−1,α/2 ≤ p ≤ tn−1,α/2 = 1 − α.
S 2 /n
We also know that V = (n − 1)S 2 /σ 2 has the χ2 distribution with n − 1 degrees of freedom, again
independently of µ or σ 2 . We thus have that:
(n − 1)S 2
P χ2n−1,1−α/2 ≤ ≤ χ 2
n−1,α/2 = 1 − α.
σ2
which is the m.g.f. of the gamma distribution with parameters ½ and n, i.e. the χ2 distribution
with 2n degrees of freedom. Thus the distribution of U is independent of λ, and we have:
2 2
P χ2n,1−α/2 ≤ 2λnX̄ ≤ χ2n,α/2 = 1 − α
or:
χ22n,1−α/2 χ22n,α/2
P ≤λ≤ = 1 − α.
2nX̄ 2nX̄
Thus a 100(1 − α)% confidence interval for λ is given by:
!
χ22n,1−α/2 χ22n,α/2
; .
2nx̄ 2nx̄
11.3. CONSTRUCTION OF CONFIDENCE INTERVALS: METHOD II 181
Now suppose that g1 (θ) and g2 (θ) are continuous monotonic functions of θ (either both strictly
increasing or both strictly decreasing). Define g1−1 (t) and g2−1 (t) as the corresponding inverse
functions; for example, θ = g1−1 (t) is the solution to t = g1 (θ) for any given t.
Now by definition of g1 (θ) and g2 (θ):
Suppose for the moment that g1 (θ) and g2 (θ) are increasing functions of θ. Applying the inverse
functions, it then follows that:
P g1−1 (T ) ≤ θ ≤ g2−1 (T ) = 1 − α.
Again, it is left as an exercise to show that if g1 (θ) and g2 (θ) are decreasing functions of θ,
then a similar result holds with the directions of the inequalities reversed. The two cases can be
combined by defining:
L1 (t) = min{g1−1 (t); g2−1 (t)} L2 (t) = max{g1−1 (t); g2−1 (t)}.
The percentiles (g1 (θ) and g2 (θ)) are easily found. For example, for g1 (θ) we need to solve:
g1 (θ) g1 (θ)
ntn−1
Z Z
g(t|θ)dt = dt
0 0 θn
g1 (θ)
tn
=
θn 0
n
g1 (θ) α
= = ,
θ 2
Similarly, we find:
α 1/n
g2 (θ) = 1 − θ.
2
21
θ̂ = 2.4 = 2.52.
20
The limits of the 95% confidence interval for θ are:
−1/n
(1 − α/2) X(n) = (.975)−1/20 × 2.4 = 2.403
α −1/n
X(n) = (.025)−1/20 × 2.4 = 2.887.
2
Let Y be a Binomial random variable with parameters n and p. Define the functions h1 (p) and
h2 (p) as follows.
For any value of p, define h1 (p) as the smallest value of y satisfying the inequality:
n
X n j α
p (1 − p)n−j ≤ .
j=y
j 2
For any value of p, define h2 (p) as the largest value of y satisfying the inequality:
y
X n j α
p (1 − p)n−j ≤ .
j=0
j 2
Note that the expressions on the left hand sides of (11.7) and (11.8) are continuous functions of
p, so that solutions exist satisfying the required equalities.
Now for any y < n, p < p2 (y) implies that:
y
X n α
pj (1 − p)n−j >
j=0
j 2
since the left hand side is a monotonically decreasing function of p. (The decreasing nature should
be intuitively obvious from the fact that increasing p must shift the probability distribution of Y
towards higher values, leading to a decrease in P (Y ≤ y), but can be proven by differentiation.)
In other words, p < p2 (y) implies that y > h2 (p). It is easily seen by reversing the argument
that the converse also holds, so that p < p2 (y) if and only if y > h2 (p).
Similarly, it can be shown that p > p1 (y) if and only if y < h1 (p).
184 CHAPTER 11. CONFIDENCE INTERVALS
Let p1 (Y ) and p2 (Y ) denote the random variables taking on the values p1 (y) and p2 (y) respec-
tively when Y = y. The previous results show that the events {p > p1 (Y )} and {p < p2 (Y )} are
respectively equivalent to the events {Y < h1 (p)} and {Y > h2 (p)}. This implies:
In other words, (p1 (y); p2 (y)) is at least a 100(1 − α)% confidence interval for p.
Example 11.6 Consider the binomial sampling situation in which 6 successes are observed in
20 trials. (Compare Example 11.2.) For y = 6, n = 20 and α = 0.1, it may be confirmed by
direct substitution in (11.7) and (11.8) that p1 (y) = 0.140 and p2 (y) = 0.507. This provides a
not very different but slightly shifted 90% confidence interval to that obtained from the normal
approximation in Example 11.2. Interestingly, this exact interval does include the value 0.5.
Let XP
1 , X2 , . . . , Xn be a random sample from the Poisson distribution with parameter λ. Define
n
Y = i=1 Xi , which also follows a Poisson distribution with parameter nλ.
Define the functions h1 (λ) and h2 (λ) as follows.
For any value of λ, define h1 (λ) as the smallest value of y satisfying the inequality:
∞
X (nλ)j e−nλ α
≤ .
j=y
j! 2
For any value of λ, define h2 (λ) as the largest value of y satisfying the inequality:
y
X (nλ)j e−nλ α
≤ .
j=0
j! 2
since the left hand side is a monotonically decreasing function of λ. In other words, λ < `2 (y)
implies that y > h2 (λ). The converse is similarly demonstrated, so that λ < `2 (y) if and only if
y > h2 (λ). In the same way, we can show that λ > `1 (y) if and only if y < h1 (λ).
Let `1 (Y ) and `2 (Y ) denote the random variables taking on the values `1 (y) and `2 (y) respectively
when Y = y. The previous results show that the events {λ > `1 (Y )} and {λ < `2 (Y )} are
respectively equivalent to the events {Y < h1 (λ)} and {Y > h2 (λ)}. This implies:
In other words, (`1 (y); `2 (y)) is at least a 100(1 − α)% confidence interval for λ.
Example 11.7 Consider a Poisson sampling situation with n = 10, and suppose we have ob-
served Y = 4. (You may like to refer back to Example 10.9.)
You may confirm by substituting back into (11.10) and (11.11) that the 95% confidence interval
limits for λ (α = 0.025) are given by `1 (4) = 0.109 and `2 (4) = 1.023.
Suppose that θ̂ is the MLE for the parameter θ. We know from Theorem 8.2 that for large
sample sizes n, the distribution of θ̂ tends to the normal with mean θ (i.e. the true value of the
parameter) and variance 1/nI(θ), where I(θ) is the expected Fisher Information defined by:
2
∂ ∂
I(θ) = E ln f (X|θ) = Var ln f (X|θ) .
∂θ ∂θ
Thus as n → ∞: p
nI(θ)(θ̂ − θ) ∼ N (0, 1).
where zα/2 is the upper α/2 critical value of the standard normal distribution.
Unfortunately, the above limits depend on I(θ) which is a function of the (unknown) θ. In
practice, however, a good approximation to the interval is obtained by replacing I(θ) by I(θ̂).
The approximate asymptotic confidence interval for θ thus becomes:
1
θ̂ ± zα/2 q . (11.14)
nI(θ̂)
186 CHAPTER 11. CONFIDENCE INTERVALS
Example 11.8 Poisson distribution We have seen that the MLE for λ is given by λ̂ = X̄.
We calculate the expected Fisher Information as:
2
∂
I(λ) = E ln f (X|λ)
∂λ
2
∂
= E (−λ + X ln λ − ln X!)
∂λ
" 2 #
X
= E −1 +
λ
1
= E(X − λ)2
λ2
var(X) λ 1
= 2
= 2 = .
λ λ λ
In passing, we note that we could also use Lemma 8.1 to calculate I(λ) as follows:
2
∂
I(λ) = −E ln f (X|λ)
∂λ2
∂ X
= −E −1 +
∂λ λ
X λ 1
= −E − 2 = 2 = .
λ λ λ
where λ̂ = X̄.
As a numerical
P10 example, consider the case in Example 11.7. Here we had n = 10, and had
observed i=1 = 4, giving X̄ = 0.4. From tables we find z0.025 = 1.96, so that the approximate
95% confidence interval is: p
0.4 ± 1.96 0.4/10 = 0.4 ± 0.392
or (0.008; 0.792). This is rather different to the exact interval found in Example 11.7, but this
just demonstrates that n = 10 does not approach ∞! For example, with the same sample mean
(X̄ = 0.4) obtained from a sample size n = 100, the exact confidence interval is (0.286; 0.545)
while the asymptotic approximation gives (0.276; 0.524).
Tutorial Exercises
11.1 Assume that a government department desires to estimate the fuel consumption of a particular
new (make and model) car. To do this, the department acquires one of these cars, fills the tank,
and then a trained driver drives the car for 100 kilometres. The tank is refilled and the same
driver again drives the car for 100 kilometres, and so on until this operation has been performed a
total of n = 10 times. The numbers of litres needed to fill the tank on each of these 10 times are:
Tutorial Exercises 187
11.35 10.50 9.36 9.86 11.64 9.31 9.36 11.12 10.26 10.05
Assuming that these values are a random sample from a normal distribution with mean µ and
variance σ 2 ,
(a) Compute a 90% confidence interval for µ, the number of litres need to drive this car for 100
kilometres.
(b) Compute a 90% confidence interval for m = 100/µ, the kilometres per litre one might expect
from this make and model.
(c) Compute the 95% confidence interval for σ 2 .
11.2 The time needed to diagnose the fault and to make the repair for large pieces of equipment is fre-
quently assumed to be an exponential random variable. Assume the time necessary for diagnosing
and repairing a transmission problem for a 1981 Nissan car is an exponential random variable with
parameter λ. The observed times for diagnosing and repairing n = 9 of these transmissions were:
X = (1.7, 0.9, 3.0, 3.6, 0.5, 7.3, 3.2, 0.3, 6.1) hours, respectively. Use these values to construct a 95%
confidence interval for the mean time to diagnose and repair one of these transmissions.
11.3 Each hour a radio station broadcasts a beep marking the hour. The station has a quartz timepiece
that is used to trigger the instant of the beep. Unless the quartz used is very high quality, and
maintained under carefully controlled conditions, it is not 100 percent accurate. Assume the
difference between the time the beep starts and the exact time of the hour is a uniform random
variable X on the interval (−θ, θ), using microseconds as a unit. A random sample of n = 15
observed values for X were: 221, 265, -140, 327, -401, 308, -317, 447, -137, -228, -477, 69, 475, 56,
-101.
Compute a 99% confidence interval for θ, the magnitude of extreme error in timing the beep.
11.4 In a quality control investigation, a total of 75 items were randomly selected from the outputs of a
production line. Of these, a total of X = 6 were found to fail an SABS test. We assume that the
distribution of X is binomial with parameters 75 and p (the probability that a randomly selected
item fails the test).
(a) Compute the exact 95% confidence interval for p;
(b) Compare this interval with the interval based on the appropriate normal approximation.
11.5 The number of new houses sold, per week for 15 weeks, by an active real estate company, were
3, 3, 4, 6, 2, 4, 4, 1, 2, 0, 5, 7, 1, 4 respectively. Assuming these are the observed values for a random
sample of 15 of a Poisson random variable with parameter λ:
(a) Compute the exact 95% confidence interval for λ;
(b) Compare this interval with the interval based on the appropriate normal approximation.
11.6 The data below have arisen from a Weibull distribution with p.d.f. given by:
2
f (x|β) = 2xβe−βx .
Compute the asymptotic (approximate) 95% confidence interval for β based on the MLE.
14.0 6.8 8.3 18.4 14.6 12.6 2.9 15.9 20.0 11.6
25.9 15.5 10.7 16.4 30.9 26.6 30.9 25.6 12.4 25.0
188 CHAPTER 11. CONFIDENCE INTERVALS
Bibliography
Pearson, E.S. and Hartley, H.O. (1972) (Editors). Biometrika Tables for Statisticians. Cam-
bridge University Press
Quenouille, M.H. (1949). Approximate tests of correlation in time series. Journal of the Royal
Statistical Society, B, 11, 18-84.
Rao, C.R. (1945). Information and accuracy attainable in the estimation of statistical param-
eters. Bulletin of Calcutta Mathematical Society, 37, 81.
Rice, J. (1995). Mathematical Statistics and Data Analysis. 2nd Edition. Wadsworth Inc.,
California.
Shapiro, S.S. and Wilk, M.B. (1965). An analysis of variance test for normality (complete
samples). Biometrika, 52, 591-611.
189
190 BIBLIOGRAPHY
Statistical Tables
TABLE 1. STANDARD NORMAL DISTRIBUTION: Areas under the standard normal curve
between 0 and z, i.e. Pr[0 < Z < z]
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361
2.5 0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520
2.6 0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643
2.7 0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736
2.8 0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807
2.9 0.49813 0.49819 0.49825 0.49831 0.49836 0.49841 0.49846 0.49851 0.49856 0.49861
3.0 0.49865 0.49869 0.49874 0.49878 0.49882 0.49886 0.49889 0.49893 0.49896 0.49900
3.1 0.49903 0.49906 0.49910 0.49913 0.49916 0.49918 0.49921 0.49924 0.49926 0.49929
3.2 0.49931 0.49934 0.49936 0.49938 0.49940 0.49942 0.49944 0.49946 0.49948 0.49950
3.3 0.49952 0.49953 0.49955 0.49957 0.49958 0.49960 0.49961 0.49962 0.49964 0.49965
3.4 0.49966 0.49968 0.49969 0.49970 0.49971 0.49972 0.49973 0.49974 0.49975 0.49976
3.5 0.49977 0.49978 0.49978 0.49979 0.49980 0.49981 0.49981 0.49982 0.49983 0.49983
3.6 0.49984 0.49985 0.49985 0.49986 0.49986 0.49987 0.49987 0.49988 0.49988 0.49989
3.7 0.49989 0.49990 0.49990 0.49990 0.49991 0.49991 0.49992 0.49992 0.49992 0.49992
3.8 0.49993 0.49993 0.49993 0.49994 0.49994 0.49994 0.49994 0.49995 0.49995 0.49995
3.9 0.49995 0.49995 0.49996 0.49996 0.49996 0.49996 0.49996 0.49996 0.49997 0.49997
4.0 0.49997 0.49997 0.49997 0.49997 0.49997 0.49997 0.49998 0.49998 0.49998 0.49998
191
192 STATISTICAL TABLES
2(P )
TABLE 2. CHI-SQUARED DISTRIBUTION: One sided critical values, i.e. the value of χn
2(P )
such that P = Pr[χ2n > χn ], where n is the degrees of freedom, for P > 0.5
Probability Level P
n 0.9995 0.999 0.9975 0.995 0.99 0.975 0.95 0.9 0.8 0.6
1 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.016 0.064 0.275
2 0.001 0.002 0.005 0.010 0.020 0.051 0.103 0.211 0.446 1.022
3 0.015 0.024 0.045 0.072 0.115 0.216 0.352 0.584 1.005 1.869
4 0.064 0.091 0.145 0.207 0.297 0.484 0.711 1.064 1.649 2.753
5 0.158 0.210 0.307 0.412 0.554 0.831 1.145 1.610 2.343 3.656
6 0.299 0.381 0.527 0.676 0.872 1.237 1.635 2.204 3.070 4.570
7 0.485 0.599 0.794 0.989 1.239 1.690 2.167 2.833 3.822 5.493
8 0.710 0.857 1.104 1.344 1.647 2.180 2.733 3.490 4.594 6.423
9 0.972 1.152 1.450 1.735 2.088 2.700 3.325 4.168 5.380 7.357
10 1.265 1.479 1.827 2.156 2.558 3.247 3.940 4.865 6.179 8.295
11 1.587 1.834 2.232 2.603 3.053 3.816 4.575 5.578 6.989 9.237
12 1.935 2.214 2.661 3.074 3.571 4.404 5.226 6.304 7.807 10.182
13 2.305 2.617 3.112 3.565 4.107 5.009 5.892 7.041 8.634 11.129
14 2.697 3.041 3.582 4.075 4.660 5.629 6.571 7.790 9.467 12.078
15 3.107 3.483 4.070 4.601 5.229 6.262 7.261 8.547 10.307 13.030
16 3.536 3.942 4.573 5.142 5.812 6.908 7.962 9.312 11.152 13.983
17 3.980 4.416 5.092 5.697 6.408 7.564 8.672 10.085 12.002 14.937
18 4.439 4.905 5.623 6.265 7.015 8.231 9.390 10.865 12.857 15.893
19 4.913 5.407 6.167 6.844 7.633 8.907 10.117 11.651 13.716 16.850
20 5.398 5.921 6.723 7.434 8.260 9.591 10.851 12.443 14.578 17.809
21 5.895 6.447 7.289 8.034 8.897 10.283 11.591 13.240 15.445 18.768
22 6.404 6.983 7.865 8.643 9.542 10.982 12.338 14.041 16.314 19.729
23 6.924 7.529 8.450 9.260 10.196 11.689 13.091 14.848 17.187 20.690
24 7.453 8.085 9.044 9.886 10.856 12.401 13.848 15.659 18.062 21.652
25 7.991 8.649 9.646 10.520 11.524 13.120 14.611 16.473 18.940 22.616
26 8.537 9.222 10.256 11.160 12.198 13.844 15.379 17.292 19.820 23.579
27 9.093 9.803 10.873 11.808 12.878 14.573 16.151 18.114 20.703 24.544
28 9.656 10.391 11.497 12.461 13.565 15.308 16.928 18.939 21.588 25.509
29 10.227 10.986 12.128 13.121 14.256 16.047 17.708 19.768 22.475 26.475
30 10.804 11.588 12.765 13.787 14.953 16.791 18.493 20.599 23.364 27.442
31 11.388 12.196 13.407 14.458 15.655 17.539 19.281 21.434 24.255 28.409
32 11.980 12.810 14.055 15.134 16.362 18.291 20.072 22.271 25.148 29.376
33 12.576 13.431 14.709 15.815 17.073 19.047 20.867 23.110 26.042 30.344
34 13.180 14.057 15.368 16.501 17.789 19.806 21.664 23.952 26.938 31.313
35 13.788 14.688 16.032 17.192 18.509 20.569 22.465 24.797 27.836 32.282
36 14.401 15.324 16.700 17.887 19.233 21.336 23.269 25.643 28.735 33.252
37 15.021 15.965 17.373 18.586 19.960 22.106 24.075 26.492 29.635 34.222
38 15.644 16.611 18.050 19.289 20.691 22.878 24.884 27.343 30.537 35.192
39 16.272 17.261 18.732 19.996 21.426 23.654 25.695 28.196 31.441 36.163
40 16.906 17.917 19.417 20.707 22.164 24.433 26.509 29.051 32.345 37.134
45 20.136 21.251 22.899 24.311 25.901 28.366 30.612 33.350 36.884 41.995
50 23.461 24.674 26.464 27.991 29.707 32.357 34.764 37.689 41.449 46.864
60 30.339 31.738 33.791 35.534 37.485 40.482 43.188 46.459 50.641 56.620
70 37.467 39.036 41.332 43.275 45.442 48.758 51.739 55.329 59.898 66.396
80 44.792 46.520 49.043 51.172 53.540 57.153 60.391 64.278 69.207 76.188
90 52.277 54.156 56.892 59.196 61.754 65.647 69.126 73.291 78.558 85.993
100 59.895 61.918 64.857 67.328 70.065 74.222 77.929 82.358 87.945 95.808
110 67.631 69.790 72.922 75.550 78.458 82.867 86.792 91.471 97.362 105.632
120 75.465 77.756 81.073 83.852 86.923 91.573 95.705 100.624 106.806 115.465
140 91.389 93.925 97.591 100.655 104.034 109.137 113.659 119.029 125.758 135.149
160 107.599 110.359 114.350 117.679 121.346 126.870 131.756 137.546 144.783 154.856
180 124.032 127.011 131.305 134.884 138.821 144.741 149.969 156.153 163.868 174.580
200 140.659 143.842 148.426 152.241 156.432 162.728 168.279 174.835 183.003 194.319
193
TABLE 2, continued. CHI-SQUARED DISTRIBUTION: One sided critical values, i.e. the value
2(P ) 2(P )
of χn such that P = Pr[χ2n > χn ], where n is the degrees of freedom, for P < 0.5
Probability Level P
n 0.4 0.2 0.1 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
1 0.708 1.642 2.706 3.841 5.024 6.635 7.879 9.140 10.827 12.115
2 1.833 3.219 4.605 5.991 7.378 9.210 10.597 11.983 13.815 15.201
3 2.946 4.642 6.251 7.815 9.348 11.345 12.838 14.320 16.266 17.731
4 4.045 5.989 7.779 9.488 11.143 13.277 14.860 16.424 18.466 19.998
5 5.132 7.289 9.236 11.070 12.832 15.086 16.750 18.385 20.515 22.106
6 6.211 8.558 10.645 12.592 14.449 16.812 18.548 20.249 22.457 24.102
7 7.283 9.803 12.017 14.067 16.013 18.475 20.278 22.040 24.321 26.018
8 8.351 11.030 13.362 15.507 17.535 20.090 21.955 23.774 26.124 27.867
9 9.414 12.242 14.684 16.919 19.023 21.666 23.589 25.463 27.877 29.667
10 10.473 13.442 15.987 18.307 20.483 23.209 25.188 27.112 29.588 31.419
11 11.530 14.631 17.275 19.675 21.920 24.725 26.757 28.729 31.264 33.138
12 12.584 15.812 18.549 21.026 23.337 26.217 28.300 30.318 32.909 34.821
13 13.636 16.985 19.812 22.362 24.736 27.688 29.819 31.883 34.527 36.477
14 14.685 18.151 21.064 23.685 26.119 29.141 31.319 33.426 36.124 38.109
15 15.733 19.311 22.307 24.996 27.488 30.578 32.801 34.949 37.698 39.717
16 16.780 20.465 23.542 26.296 28.845 32.000 34.267 36.456 39.252 41.308
17 17.824 21.615 24.769 27.587 30.191 33.409 35.718 37.946 40.791 42.881
18 18.868 22.760 25.989 28.869 31.526 34.805 37.156 39.422 42.312 44.434
19 19.910 23.900 27.204 30.144 32.852 36.191 38.582 40.885 43.819 45.974
20 20.951 25.038 28.412 31.410 34.170 37.566 39.997 42.336 45.314 47.498
21 21.992 26.171 29.615 32.671 35.479 38.932 41.401 43.775 46.796 49.010
22 23.031 27.301 30.813 33.924 36.781 40.289 42.796 45.204 48.268 50.510
23 24.069 28.429 32.007 35.172 38.076 41.638 44.181 46.623 49.728 51.999
24 25.106 29.553 33.196 36.415 39.364 42.980 45.558 48.034 51.179 53.478
25 26.143 30.675 34.382 37.652 40.646 44.314 46.928 49.435 52.619 54.948
26 27.179 31.795 35.563 38.885 41.923 45.642 48.290 50.829 54.051 56.407
27 28.214 32.912 36.741 40.113 43.195 46.963 49.645 52.215 55.475 57.856
28 29.249 34.027 37.916 41.337 44.461 48.278 50.994 53.594 56.892 59.299
29 30.283 35.139 39.087 42.557 45.722 49.588 52.335 54.966 58.301 60.734
30 31.316 36.250 40.256 43.773 46.979 50.892 53.672 56.332 59.702 62.160
31 32.349 37.359 41.422 44.985 48.232 52.191 55.002 57.692 61.098 63.581
32 33.381 38.466 42.585 46.194 49.480 53.486 56.328 59.046 62.487 64.993
33 34.413 39.572 43.745 47.400 50.725 54.775 57.648 60.395 63.869 66.401
34 35.444 40.676 44.903 48.602 51.966 56.061 58.964 61.738 65.247 67.804
35 36.475 41.778 46.059 49.802 53.203 57.342 60.275 63.076 66.619 69.197
36 37.505 42.879 47.212 50.998 54.437 58.619 61.581 64.410 67.985 70.588
37 38.535 43.978 48.363 52.192 55.668 59.893 62.883 65.738 69.348 71.971
38 39.564 45.076 49.513 53.384 56.895 61.162 64.181 67.063 70.704 73.350
39 40.593 46.173 50.660 54.572 58.120 62.428 65.475 68.383 72.055 74.724
40 41.622 47.269 51.805 55.758 59.342 63.691 66.766 69.699 73.403 76.096
45 46.761 52.729 57.505 61.656 65.410 69.957 73.166 76.223 80.078 82.873
50 51.892 58.164 63.167 67.505 71.420 76.154 79.490 82.664 86.660 89.560
60 62.135 68.972 74.397 79.082 83.298 88.379 91.952 95.344 99.608 102.697
70 72.358 79.715 85.527 90.531 95.023 100.425 104.215 107.808 112.317 115.577
80 82.566 90.405 96.578 101.879 106.629 112.329 116.321 120.102 124.839 128.264
90 92.761 101.054 107.565 113.145 118.136 124.116 128.299 132.255 137.208 140.780
100 102.946 111.667 118.498 124.342 129.561 135.807 140.170 144.292 149.449 153.164
110 113.121 122.250 129.385 135.480 140.916 147.414 151.948 156.230 161.582 165.436
120 123.289 132.806 140.233 146.567 152.211 158.950 163.648 168.081 173.618 177.601
140 143.604 153.854 161.827 168.613 174.648 181.841 186.847 191.565 197.450 201.680
160 163.898 174.828 183.311 190.516 196.915 204.530 209.824 214.808 221.020 225.477
180 184.173 195.743 204.704 212.304 219.044 227.056 232.620 237.855 244.372 249.049
200 204.434 216.609 226.021 233.994 241.058 249.445 255.264 260.735 267.539 272.422
194 STATISTICAL TABLES
Probability Level P
n 0.4 0.3 0.2 0.1 0.05 0.025 0.01 0.005 0.0025 0.001 0.0005
1 0.325 0.727 1.376 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.289 0.617 1.061 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60
3 0.277 0.584 0.978 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.271 0.569 0.941 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.267 0.559 0.920 1.476 2.015 2.571 3.365 4.032 4.773 5.894 6.869
6 0.265 0.553 0.906 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.263 0.549 0.896 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.262 0.546 0.889 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 0.261 0.543 0.883 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.260 0.542 0.879 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.260 0.540 0.876 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.259 0.539 0.873 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 0.259 0.538 0.870 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 0.258 0.537 0.868 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 0.258 0.536 0.866 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.258 0.535 0.865 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 0.257 0.534 0.863 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 0.257 0.534 0.862 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 0.257 0.533 0.861 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.257 0.533 0.860 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
21 0.257 0.532 0.859 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819
22 0.256 0.532 0.858 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792
23 0.256 0.532 0.858 1.319 1.714 2.069 2.500 2.807 3.104 3.485 3.768
24 0.256 0.531 0.857 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745
25 0.256 0.531 0.856 1.316 1.708 2.060 2.485 2.787 3.078 3.450 3.725
26 0.256 0.531 0.856 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707
27 0.256 0.531 0.855 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.689
28 0.256 0.530 0.855 1.313 1.701 2.048 2.467 2.763 3.047 3.408 3.674
29 0.256 0.530 0.854 1.311 1.699 2.045 2.462 2.756 3.038 3.396 3.660
30 0.256 0.530 0.854 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646
31 0.256 0.530 0.853 1.309 1.696 2.040 2.453 2.744 3.022 3.375 3.633
32 0.255 0.530 0.853 1.309 1.694 2.037 2.449 2.738 3.015 3.365 3.622
33 0.255 0.530 0.853 1.308 1.692 2.035 2.445 2.733 3.008 3.356 3.611
34 0.255 0.529 0.852 1.307 1.691 2.032 2.441 2.728 3.002 3.348 3.601
35 0.255 0.529 0.852 1.306 1.690 2.030 2.438 2.724 2.996 3.340 3.591
36 0.255 0.529 0.852 1.306 1.688 2.028 2.434 2.719 2.990 3.333 3.582
37 0.255 0.529 0.851 1.305 1.687 2.026 2.431 2.715 2.985 3.326 3.574
38 0.255 0.529 0.851 1.304 1.686 2.024 2.429 2.712 2.980 3.319 3.566
39 0.255 0.529 0.851 1.304 1.685 2.023 2.426 2.708 2.976 3.313 3.558
40 0.255 0.529 0.851 1.303 1.684 2.021 2.423 2.704 2.971 3.307 3.551
45 0.255 0.528 0.850 1.301 1.679 2.014 2.412 2.690 2.952 3.281 3.520
50 0.255 0.528 0.849 1.299 1.676 2.009 2.403 2.678 2.937 3.261 3.496
60 0.254 0.527 0.848 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460
70 0.254 0.527 0.847 1.294 1.667 1.994 2.381 2.648 2.899 3.211 3.435
80 0.254 0.526 0.846 1.292 1.664 1.990 2.374 2.639 2.887 3.195 3.416
90 0.254 0.526 0.846 1.291 1.662 1.987 2.368 2.632 2.878 3.183 3.402
100 0.254 0.526 0.845 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
110 0.254 0.526 0.845 1.289 1.659 1.982 2.361 2.621 2.865 3.166 3.381
120 0.254 0.526 0.845 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
140 0.254 0.526 0.844 1.288 1.656 1.977 2.353 2.611 2.852 3.149 3.361
160 0.254 0.525 0.844 1.287 1.654 1.975 2.350 2.607 2.847 3.142 3.352
180 0.254 0.525 0.844 1.286 1.653 1.973 2.347 2.603 2.842 3.136 3.345
200 0.254 0.525 0.843 1.286 1.653 1.972 2.345 2.601 2.838 3.131 3.340
z 0.253 0.524 0.842 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
195
(0.05)
TABLE 4.1. 5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN where
NUM and DEN are the numerator and denominator degrees of freedom respectively
(0.025)
TABLE 4.2. 2.5% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN where
NUM and DEN are the numerator and denominator degrees of freedom respectively
(0.01)
TABLE 4.3. 1% critical values for the F -DISTRIBUTION, i.e. the value of FNUM,DEN where
NUM and DEN are the numerator and denominator degrees of freedom respectively
k Mean µ
5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10 11 12 13 14 15
0 0.004 0.002 0.002 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.027 0.017 0.011 0.007 0.005 0.003 0.002 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000
2 0.088 0.062 0.043 0.030 0.020 0.014 0.009 0.006 0.004 0.003 0.001 0.001 0.000 0.000 0.000
3 0.202 0.151 0.112 0.082 0.059 0.042 0.030 0.021 0.015 0.010 0.005 0.002 0.001 0.000 0.000
4 0.358 0.285 0.224 0.173 0.132 0.100 0.074 0.055 0.040 0.029 0.015 0.008 0.004 0.002 0.001
5 0.529 0.446 0.369 0.301 0.241 0.191 0.150 0.116 0.089 0.067 0.038 0.020 0.011 0.006 0.003
6 0.686 0.606 0.527 0.450 0.378 0.313 0.256 0.207 0.165 0.130 0.079 0.046 0.026 0.014 0.008
7 0.809 0.744 0.673 0.599 0.525 0.453 0.386 0.324 0.269 0.220 0.143 0.090 0.054 0.032 0.018
8 0.894 0.847 0.792 0.729 0.662 0.593 0.523 0.456 0.392 0.333 0.232 0.155 0.100 0.062 0.037
9 0.946 0.916 0.877 0.830 0.776 0.717 0.653 0.587 0.522 0.458 0.341 0.242 0.166 0.109 0.070
10 0.975 0.957 0.933 0.901 0.862 0.816 0.763 0.706 0.645 0.583 0.460 0.347 0.252 0.176 0.118
11 0.989 0.980 0.966 0.947 0.921 0.888 0.849 0.803 0.752 0.697 0.579 0.462 0.353 0.260 0.185
12 0.996 0.991 0.984 0.973 0.957 0.936 0.909 0.876 0.836 0.792 0.689 0.576 0.463 0.358 0.268
13 0.998 0.996 0.993 0.987 0.978 0.966 0.949 0.926 0.898 0.864 0.781 0.682 0.573 0.464 0.363
14 0.999 0.999 0.997 0.994 0.990 0.983 0.973 0.959 0.940 0.917 0.854 0.772 0.675 0.570 0.466
15 1.000 0.999 0.999 0.998 0.995 0.992 0.986 0.978 0.967 0.951 0.907 0.844 0.764 0.669 0.568
16 1.000 1.000 1.000 0.999 0.998 0.996 0.993 0.989 0.982 0.973 0.944 0.899 0.835 0.756 0.664
17 1.000 1.000 1.000 0.999 0.998 0.997 0.995 0.991 0.986 0.968 0.937 0.890 0.827 0.749
18 1.000 1.000 0.999 0.999 0.998 0.996 0.993 0.982 0.963 0.930 0.883 0.819
19 1.000 1.000 0.999 0.999 0.998 0.997 0.991 0.979 0.957 0.923 0.875
20 1.000 1.000 1.000 0.999 0.998 0.995 0.988 0.975 0.952 0.917
21 1.000 1.000 1.000 0.999 0.998 0.994 0.986 0.971 0.947
22 1.000 1.000 0.999 0.997 0.992 0.983 0.967
23 1.000 1.000 0.999 0.996 0.991 0.981
24 1.000 0.999 0.998 0.995 0.989
25 1.000 0.999 0.997 0.994
26 1.000 1.000 0.999 0.997
27 1.000 0.999 0.998
28 1.000 0.999
29 1.000 1.000
202 STATISTICAL TABLES
6 0 0.941 0.735 0.531 0.262 0.178 0.118 0.047 0.016 0.004 0.001 0.000 0.000 0.000 0.000 0.000
1 0.999 0.967 0.886 0.655 0.534 0.420 0.233 0.109 0.041 0.011 0.005 0.002 0.000 0.000 0.000
2 1.000 0.998 0.984 0.901 0.831 0.744 0.544 0.344 0.179 0.070 0.038 0.017 0.001 0.000 0.000
3 1.000 1.000 0.999 0.983 0.962 0.930 0.821 0.656 0.456 0.256 0.169 0.099 0.016 0.002 0.000
4 1.000 1.000 1.000 0.998 0.995 0.989 0.959 0.891 0.767 0.580 0.466 0.345 0.114 0.033 0.001
5 1.000 1.000 1.000 1.000 1.000 0.999 0.996 0.984 0.953 0.882 0.822 0.738 0.469 0.265 0.059
7 0 0.932 0.698 0.478 0.210 0.133 0.082 0.028 0.008 0.002 0.000 0.000 0.000 0.000 0.000 0.000
1 0.998 0.956 0.850 0.577 0.445 0.329 0.159 0.063 0.019 0.004 0.001 0.000 0.000 0.000 0.000
2 1.000 0.996 0.974 0.852 0.756 0.647 0.420 0.227 0.096 0.029 0.013 0.005 0.000 0.000 0.000
3 1.000 1.000 0.997 0.967 0.929 0.874 0.710 0.500 0.290 0.126 0.071 0.033 0.003 0.000 0.000
4 1.000 1.000 1.000 0.995 0.987 0.971 0.904 0.773 0.580 0.353 0.244 0.148 0.026 0.004 0.000
5 1.000 1.000 1.000 1.000 0.999 0.996 0.981 0.938 0.841 0.671 0.555 0.423 0.150 0.044 0.002
6 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.992 0.972 0.918 0.867 0.790 0.522 0.302 0.068
8 0 0.923 0.663 0.430 0.168 0.100 0.058 0.017 0.004 0.001 0.000 0.000 0.000 0.000 0.000 0.000
1 0.997 0.943 0.813 0.503 0.367 0.255 0.106 0.035 0.009 0.001 0.000 0.000 0.000 0.000 0.000
2 1.000 0.994 0.962 0.797 0.679 0.552 0.315 0.145 0.050 0.011 0.004 0.001 0.000 0.000 0.000
3 1.000 1.000 0.995 0.944 0.886 0.806 0.594 0.363 0.174 0.058 0.027 0.010 0.000 0.000 0.000
4 1.000 1.000 1.000 0.990 0.973 0.942 0.826 0.637 0.406 0.194 0.114 0.056 0.005 0.000 0.000
5 1.000 1.000 1.000 0.999 0.996 0.989 0.950 0.855 0.685 0.448 0.321 0.203 0.038 0.006 0.000
6 1.000 1.000 1.000 1.000 1.000 0.999 0.991 0.965 0.894 0.745 0.633 0.497 0.187 0.057 0.003
7 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.996 0.983 0.942 0.900 0.832 0.570 0.337 0.077
9 0 0.914 0.630 0.387 0.134 0.075 0.040 0.010 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.997 0.929 0.775 0.436 0.300 0.196 0.071 0.020 0.004 0.000 0.000 0.000 0.000 0.000 0.000
2 1.000 0.992 0.947 0.738 0.601 0.463 0.232 0.090 0.025 0.004 0.001 0.000 0.000 0.000 0.000
3 1.000 0.999 0.992 0.914 0.834 0.730 0.483 0.254 0.099 0.025 0.010 0.003 0.000 0.000 0.000
4 1.000 1.000 0.999 0.980 0.951 0.901 0.733 0.500 0.267 0.099 0.049 0.020 0.001 0.000 0.000
5 1.000 1.000 1.000 0.997 0.990 0.975 0.901 0.746 0.517 0.270 0.166 0.086 0.008 0.001 0.000
6 1.000 1.000 1.000 1.000 0.999 0.996 0.975 0.910 0.768 0.537 0.399 0.262 0.053 0.008 0.000
7 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.980 0.929 0.804 0.700 0.564 0.225 0.071 0.003
8 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.990 0.960 0.925 0.866 0.613 0.370 0.086
10 0 0.904 0.599 0.349 0.107 0.056 0.028 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.996 0.914 0.736 0.376 0.244 0.149 0.046 0.011 0.002 0.000 0.000 0.000 0.000 0.000 0.000
2 1.000 0.988 0.930 0.678 0.526 0.383 0.167 0.055 0.012 0.002 0.000 0.000 0.000 0.000 0.000
3 1.000 0.999 0.987 0.879 0.776 0.650 0.382 0.172 0.055 0.011 0.004 0.001 0.000 0.000 0.000
4 1.000 1.000 0.998 0.967 0.922 0.850 0.633 0.377 0.166 0.047 0.020 0.006 0.000 0.000 0.000
5 1.000 1.000 1.000 0.994 0.980 0.953 0.834 0.623 0.367 0.150 0.078 0.033 0.002 0.000 0.000
6 1.000 1.000 1.000 0.999 0.996 0.989 0.945 0.828 0.618 0.350 0.224 0.121 0.013 0.001 0.000
7 1.000 1.000 1.000 1.000 1.000 0.998 0.988 0.945 0.833 0.617 0.474 0.322 0.070 0.012 0.000
8 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.989 0.954 0.851 0.756 0.624 0.264 0.086 0.004
9 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.994 0.972 0.944 0.893 0.651 0.401 0.096
203
20 0 0.818 0.358 0.122 0.012 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.983 0.736 0.392 0.069 0.024 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 0.999 0.925 0.677 0.206 0.091 0.035 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 1.000 0.984 0.867 0.411 0.225 0.107 0.016 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000
4 1.000 0.997 0.957 0.630 0.415 0.238 0.051 0.006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
5 1.000 1.000 0.989 0.804 0.617 0.416 0.126 0.021 0.002 0.000 0.000 0.000 0.000 0.000 0.000
6 1.000 1.000 0.998 0.913 0.786 0.608 0.250 0.058 0.006 0.000 0.000 0.000 0.000 0.000 0.000
7 1.000 1.000 1.000 0.968 0.898 0.772 0.416 0.132 0.021 0.001 0.000 0.000 0.000 0.000 0.000
8 1.000 1.000 1.000 0.990 0.959 0.887 0.596 0.252 0.057 0.005 0.001 0.000 0.000 0.000 0.000
9 1.000 1.000 1.000 0.997 0.986 0.952 0.755 0.412 0.128 0.017 0.004 0.001 0.000 0.000 0.000
10 1.000 1.000 1.000 0.999 0.996 0.983 0.872 0.588 0.245 0.048 0.014 0.003 0.000 0.000 0.000
11 1.000 1.000 1.000 1.000 0.999 0.995 0.943 0.748 0.404 0.113 0.041 0.010 0.000 0.000 0.000
12 1.000 1.000 1.000 1.000 1.000 0.999 0.979 0.868 0.584 0.228 0.102 0.032 0.000 0.000 0.000
13 1.000 1.000 1.000 1.000 1.000 1.000 0.994 0.942 0.750 0.392 0.214 0.087 0.002 0.000 0.000
14 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.979 0.874 0.584 0.383 0.196 0.011 0.000 0.000
15 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.994 0.949 0.762 0.585 0.370 0.043 0.003 0.000
16 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.984 0.893 0.775 0.589 0.133 0.016 0.000
17 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.965 0.909 0.794 0.323 0.075 0.001
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.992 0.976 0.931 0.608 0.264 0.017
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.997 0.988 0.878 0.642 0.182
25 0 0.778 0.277 0.072 0.004 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 0.974 0.642 0.271 0.027 0.007 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 0.998 0.873 0.537 0.098 0.032 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 1.000 0.966 0.764 0.234 0.096 0.033 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
4 1.000 0.993 0.902 0.421 0.214 0.090 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
5 1.000 0.999 0.967 0.617 0.378 0.193 0.029 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000
6 1.000 1.000 0.991 0.780 0.561 0.341 0.074 0.007 0.000 0.000 0.000 0.000 0.000 0.000 0.000
7 1.000 1.000 0.998 0.891 0.727 0.512 0.154 0.022 0.001 0.000 0.000 0.000 0.000 0.000 0.000
8 1.000 1.000 1.000 0.953 0.851 0.677 0.274 0.054 0.004 0.000 0.000 0.000 0.000 0.000 0.000
9 1.000 1.000 1.000 0.983 0.929 0.811 0.425 0.115 0.013 0.000 0.000 0.000 0.000 0.000 0.000
10 1.000 1.000 1.000 0.994 0.970 0.902 0.586 0.212 0.034 0.002 0.000 0.000 0.000 0.000 0.000
11 1.000 1.000 1.000 0.998 0.989 0.956 0.732 0.345 0.078 0.006 0.001 0.000 0.000 0.000 0.000
12 1.000 1.000 1.000 1.000 0.997 0.983 0.846 0.500 0.154 0.017 0.003 0.000 0.000 0.000 0.000
13 1.000 1.000 1.000 1.000 0.999 0.994 0.922 0.655 0.268 0.044 0.011 0.002 0.000 0.000 0.000
14 1.000 1.000 1.000 1.000 1.000 0.998 0.966 0.788 0.414 0.098 0.030 0.006 0.000 0.000 0.000
15 1.000 1.000 1.000 1.000 1.000 1.000 0.987 0.885 0.575 0.189 0.071 0.017 0.000 0.000 0.000
16 1.000 1.000 1.000 1.000 1.000 1.000 0.996 0.946 0.726 0.323 0.149 0.047 0.000 0.000 0.000
17 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.978 0.846 0.488 0.273 0.109 0.002 0.000 0.000
18 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.993 0.926 0.659 0.439 0.220 0.009 0.000 0.000
19 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.971 0.807 0.622 0.383 0.033 0.001 0.000
20 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.991 0.910 0.786 0.579 0.098 0.007 0.000
21 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.967 0.904 0.766 0.236 0.034 0.000
22 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.991 0.968 0.902 0.463 0.127 0.002
23 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 0.993 0.973 0.729 0.358 0.026
24 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999 0.996 0.928 0.723 0.222
204 STATISTICAL TABLES
λx e−λ t
−1)
Poisson x = 0, 1, . . . eλ(e λ λ
x!
et p 1 1−p
Geometric p(1 − p)x−1 x = 1, 2, . . .
1 − (1 − p)et p p2
Probability Distributions
k−1 r (et p)r r r(1 − p)
Negative Binomial p (1 − p)k−r k = r, r + 1, . . .
r−1 [1 − (1 − p)et ]r p p2
Formula Sheet
1 2 2 1 2
√ σ2
Normal e−(x−µ) /2σ −∞ < x < ∞ eµt+ 2 t µ σ2
2π σ
205
et − 1 1 1
Uniform 1 0≤x≤1
t 2 12
λα xα−1 e−λx λα α α
Gamma 0<x<∞
Γ(α) (λ − t)α λ λ2
Γ(a + b) a−1 a ab
Beta x (1 − x)b−1 0<x<1 −−−
Γ(a)Γ(b) a+b (a + b)2 (a + b + 1)
206 FORMULA SHEET
Useful Formulae
1.
∂φ(u, v)/∂u ∂φ(u, v)/∂v
|J| =
∂ψ(u, v)/∂u ∂ψ(u, v)/∂v
∂φ(u, v) ∂ψ(u, v) ∂ψ(u, v) ∂φ(u, v)
= −
∂u ∂v ∂u ∂v
R∞
2. Γ(n) = 0
xn−1 e−x dx
3. Γ(n + 1) = nΓ(n)
Γ(m)Γ(n)
4. B(m, n) =
Γ(m + n)
x2 x3
5. ex = 1 + x + + + ...
2! 3!
Pn−1 1 − xn
6. Geometric series j=0 axj = a
1−x
7. GX (s) = E[sx ]
8. e−x = limn→∞ [1 − nx ]n
1 Q(x, y)
9. fXY (x, y) = exp −
2(1 − ρ2 )
p
2πσX σY 1 − ρ2
2 2
x − µX x − µX y − µY y − µY
Q(x, y) = − 2ρ +
σX σX σY σY
1 1 0 −1
fXY (x, y) = exp − (z − µ) Σ (z − µ)
2π|Σ|1/2 2
2
10. µ2 = σX = µ02 − µ2X
µ3 = µ03 − 3µ02 µX + 2(µX )3
µ4 = µ04 − 4µ03 µX + 6µ02 (µX )2 − 3(µX )4
n!
14. f(r) (x) = f (x)[F (x)]r−1 [1 − F (x)]n−r .
(r − 1)!(n − r)!