Pattern Classification
Pattern Classification
°1997
c R. O. Duda, P. E. Hart and D. G. Stork
All rights reserved.
2
Contents
A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
A.2 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
A.2.1 Notation and preliminaries . . . . . . . . . . . . . . . . . . . . 8
A.2.2 Outer product . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.2.3 Derivatives of matrices . . . . . . . . . . . . . . . . . . . . . . . 10
A.2.4 Determinant and trace . . . . . . . . . . . . . . . . . . . . . . . 11
A.2.5 Eigenvectors and eigenvalues . . . . . . . . . . . . . . . . . . . 12
A.2.6 Matrix inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 12
A.3 Lagrange optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.4 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
A.4.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . 13
A.4.2 Expected values . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.4.3 Pairs of discrete random variables . . . . . . . . . . . . . . . . 15
A.4.4 Statistical independence . . . . . . . . . . . . . . . . . . . . . . 16
A.4.5 Expected values of functions of two variables . . . . . . . . . . 16
A.4.6 Conditional probability . . . . . . . . . . . . . . . . . . . . . . 18
A.4.7 The Law of Total Probability and Bayes’ rule . . . . . . . . . . 18
A.4.8 Vector random variables . . . . . . . . . . . . . . . . . . . . . . 19
A.4.9 Expectations, mean vectors and covariance matrices . . . . . . 20
A.4.10 Continuous random variables . . . . . . . . . . . . . . . . . . . 21
A.4.11 Distributions of sums of independent random variables . . . . . 23
A.4.12 Univariate normal density . . . . . . . . . . . . . . . . . . . . . 24
A.5 Gaussian derivatives and integrals . . . . . . . . . . . . . . . . . . . . 25
A.5.1 Multivariate normal densities . . . . . . . . . . . . . . . . . . . 27
A.5.2 Bivariate normal densities . . . . . . . . . . . . . . . . . . . . . 28
A.6 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.6.1 Entropy and information . . . . . . . . . . . . . . . . . . . . . 31
A.6.2 Relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.6.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . 32
A.7 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . 33
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
4 CONTENTS
Mathematical foundations
ur goal here is to present the basic results and definitions from linear algebra,
O probability theory, information theory and computational complexity that serve
as the mathematical foundations for the pattern recognition techniques discussed
throughout this book. We will try to give intuition whenever appropriate, but we
do not attempt to prove these results; systematic expositions can be found in the
references.
A.1 Notation
Here are the terms and notation used throughout the book. In addition, there are
numerous specialized variables and functions whose definitions are usage should be
clear from the text.
5
6 CONTENTS
mathematical operations
E[f (x)] the expected value of function f (x)
Ey [f (x, y)] the expected value of function over several variables, f (x, y), taken
over a subset y of them
Varf [·] Ef [(x − E[x])2 ]
<x> expected value of random variable
Pn
ai the sum from i = 1 to n: a1 + a2 + ... + an
i=1
Q
n
ai the product from i = 1 to n: a1 × a2 × ... × an
i=1
sets
A, B, C, D, ... “Calligraphic” font generally denotes sets or lists, e.g., data set
D = {x1 , ..., xn }
x∈D x is an element of set D
x∈/D x is not an element of set D
A∪B union of two sets, i.e., the set containing all elements of A and B
|D| the cardinality of set D, i.e., the number of (possibly non-distinct)
elements in it
max[D] the x value in set D that is maximum
x
m11 m12 m13 ... m1d
m21 m22 m23 ... m2d
M = .. .. .. .. .. and (2)
. . . . .
mn1 mn2 mn3 . . . mnd
m11 m21 ... mn1
m12 m22 ... mn2
mn3
Mt = m13 m23 ... . (3)
.. .. .. ..
. . . .
m1d m2d . . . mnd
X
d
yj = mji xi . (6)
i=1
X
d
xt y = xi yi = yt x. (7)
i=1
It is sometimes also called the scalar product or dot product and denoted x • y. The
Euclidean norm or length of the vector is denoted Euclidean
√ norm
kxk = xt x; (8)
we call a vector “normalized” if kxk = 1. The angle between two d-dimensional
vectors obeys
xt y
cosθ = , (9)
||x|| ||y||
and thus the inner product is a measure of the colinearity of two vectors — a natural
indication of their similarity. In particular, if xt y = 0, then the vectors are orthogonal,
and if |xt y| = ||x|| ||y||, the vectors are colinear. From Eq. 9, we have immediately
the Cauchy-Schwarz inequality, which states
x1 (y1 y2 . . . yn ) x1 y1 x1 y2 ... x1 yn
x2 x2 y1 x2 y2 ... x2 yn
M = xyt = .. = .. .. .. .. , (11)
. . . . .
xd xd y1 xd y2 . . . xd yn
10 CONTENTS
If this matrix is square, its determinant (Sect. A.2.4) is called simply the Jacobian.
If the entries of M depend upon a scalar parameter θ, we can take the derivative
of Mmponent by component, to get another matrix, as
∂m11 ∂m12
∂θ ∂θ . . . ∂m
∂θ
1d
∂M
2d
∂m21
∂θ
∂m22
∂θ . . . ∂m
∂θ
= . .. .. .. . (14)
∂θ .. . . .
∂mn1 ∂mn2 ∂mnd
∂θ ∂θ ... ∂θ
In Sect. A.2.6 we shall discuss matrix inversion, but for convenience we give here the
derivative of the inverse of a matrix, M−1 :
∂ ∂M −1
M−1 = −M−1 M . (15)
∂θ ∂θ
The following vector derivative identities can be verified by writing out the com-
ponents:
∂
[Mx] = M (16)
∂x
∂ t ∂ t
[y x] = [x y] = y (17)
∂x ∂x
∂ t
[x Mx] = [M + Mt ]x. (18)
∂x
In the case where M is symmetric (as for instance a covariance matrix, cf. Sect. A.4.10),
then Eq. 18 simplifies to
A.2. LINEAR ALGEBRA 11
∂ t
[x Mx] = 2Mx. (19)
∂x
We use the second derivative of a scalar function f (x) to write a Taylor series (or
Taylor expansion) about a point x0 :
" #t " #t
2
∂f 1 t ∂ f
f (x) = f (x0 ) + (x − x0 ) + (x − x0 ) 2
(x − x0 ) + O(||x||3 ), (20)
∂x
|{z} x=x0 2 ∂x
|{z} x=x0
J H
where H is the Hessian matrix, the matrix of second-order derivatives of f (·) with Hessian
respect to the parameters, here evaluated at x0 . (We shall return in Sect. A.7 to matrix
consider the O(·) notation and the order of a function used in Eq. 20 and below.)
For a vector valued function we write the first-order expansion in terms of the
Jacobian as:
· ¸t
∂f
f (x) = f (x0 ) + (x − x0 ) + O(||x||2 ). (21)
∂x x=x0
j
N
m11 m12 ··· N ··· ··· m1d
m21 m22 ··· ··· ··· m2d
.. .. .. N ..
. . . ··· ··· .
.. .. N .. = Mi|j . (22)
. . ·N· · N ·N
·· ·N
·· .
N N N
i
.. .. N . ..
. . · · · N · · · .. .
md1 md2 ··· · · · · · · mdd
Given this defininition, we can now compute the determinant of M the expansion by
minors on the first column giving
|M| = m11 |M1|1 | − m21 |M2|1 | + m31 |M3|1 | − · · · ± md1 |Md|1 |, (23)
where the signs alternate. This process can be applied recursively to the successive
(smaller) matrixes in Eq. 23.
For a 3 × 3 matrix, this determinant calculation can be represented by “sweeping”
the matrix — taking the sum of the products of matrix terms along a diagonal, where
products from upper-left to lower-right are added with a positive sign, and those from
the lower-left to upper-right with a minus sign. That is,
12 CONTENTS
¯ ¯
¯ m11 m12 m13 ¯
¯ ¯
|M| = ¯¯ m21 m22 m23 ¯¯ (24)
¯ m31 m32 m33 ¯
= m11 m22 m33 + m13 m21 m32 + m12 m23 m31
−m13 m22 m31 − m11 m23 m32 − m12 m21 m33 .
For two square matrices M and N, we have |MN| = |M| |N|, and furthermore |M| =
|Mt |. The determinant of any matrix is a measure of the d-dimensional hypervolume
it “subtends.” For the particular case of a covariance matrix Σ (Sect. A.4.10), |Σ|
is a measure of the hypervolume of the data taht yielded Σ.
The trace of a d × d (square) matrix, denoted tr[M], is the sum of its diagonal
elements:
X
d
tr[M] = mii . (25)
i=1
Both the determinant and trace of a matrix are invariant with respect to rotations of
the coordinate system.
Mx = λx, (26)
which can be rewritten as
(M − λI)x = 0, (27)
where λ is a scalar, I the identity matrix, and 0 the zero vector. This equation seeks
the set of d (possibly non-distinct) solution vectors {e1 , e2 , . . . , ed } — the eigenvectors
— and their associated eigenvalues {λ1 , λ2 , . . . , λd }. Under multiplication by M the
eigenvectors are changed only in magnitude — not direction:
Mej = λj ej . (28)
One method of finding the eigenvectors and eigenvalues is to solve the character-
character- istic equation (or secular equation),
istic
equation |M − λI| = λd + a1 λd−1 + . . . + ad−1 λ + ad = 0, (29)
secular for each of its d (possibly non-distinct) roots λj . For each such root, we then solve a
equation set of linear equations to find its associated eigenvector ej .
MM−1 = I. (30)
cofactor Suppose first that M is square. We call the scalar Cij = (−1)i+j |Mi|j | the i, j cofactor
A.3. LAGRANGE OPTIMIZATION 13
or equivalently the cofactor of the i, j entry of M. As defined in Eq. 22, Mi|j is the
(d − 1) × (d − 1) matrix formed by deleting the ith row and j th column of M. The
adjoint of M, written Adj[M], is the matrix whose i, j entry is the j, i cofactor of M. adjoint
Given these definitions, we can write the inverse of a matrix as
Adj[M]
M−1 = . (31)
|M|
If M−1 does not exist — because the columns of M are not linearly independent or
M is not square — one typically uses instead the pseudoinverse M† , defined as pseudo-
inverse
M† = [Mt M]−1 Mt , (32)
which insures M† M = I. Again, note especially that here M need not be square.
where λ is a scalar called the Lagrange undetermined multiplier. To find the ex- undeter-
tremum, we take the derivative mined
multiplier
∂L(x, λ) ∂f (x) ∂g(x)
= + λ = 0, (34)
∂x ∂x ∂x }
| {z
6=0 in gen.
and solve the resulting equations for λ and x0 — the position of the extremum.
pi = Pr{x = vi }, i = 1, . . . , m. (35)
Then the probabilities pi must satisfy the following two conditions:
pi ≥ 0 and
X
m
pi = 1. (36)
i=1
14 CONTENTS
P (x) ≥ 0 and
X
P (x) = 1. (37)
x∈X
X X
m
E[x] = µ = xP (x) = vi pi . (38)
x∈X i=1
If one thinks of the probability mass function as defining a set of point masses, with
pi being the mass concentrated at x = vi , then the expected value µ is just the center
of mass. Alternatively, we can interpret µ as the arithmetic average of the values in a
large random sample. More generally, if f (x) is any function of x, the expected value
of f is defined by
X
E[f (x)] = f (x)P (x). (39)
x∈X
Note that the process of forming an expected value is linear, in that if α1 and α2 are
arbitrary constants,
Alternatively, the variance can be viewed as the moment of inertia of the probability
mass function. The variance is never negative, and is zero if and only if all of the
probability mass is concentrated at one point.
The standard deviation is a simple but valuable measure of how far values of x
are likely to depart from the mean. Its very name suggests that it is the standard
or typical amount one should expect a randomly drawn value for x to deviate or
differ from µ. Chebyshev’s inequality provides a mathematical relation between the Chebyshev’s
standard deviation and |x − µ|: inequality
1
Pr{|x − µ| > nσ} ≤ . (43)
n2
This inequality is not a tight bound (and it is useless for n < 1); a more practical rule
of thumb, which strictly speaking is true only for the normal distribution, is that 68%
of the values will lie within one, 95% within two, and 99.7% within three standard
deviations of the mean (Fig. A.1). Nevertheless, Chebyshev’s inequality shows the
strong link between the standard deviation and the spread of the distribution. In
addition, it suggests that |x−µ|/σ is a meaningful normalized measure of the distance
from x to the mean (cf. Sect. A.4.12).
By expanding the quadratic in Eq. 42, it is easy to prove the useful formula
µ = p and
p
σ = p(1 − p). (45)
P (x, y) ≥ 0 and
XX
P (x, y) = 1. (46)
x∈X y∈Y
The joint probability mass function is a complete characterization of the pair of ran-
dom variables (x, y); that is, everything we can compute about x and y, individually
16 CONTENTS
or together, can be computed from P (x, y). In particular, we can obtain the separate
marginal distributions for x and y by summing over the unwanted variable: marginal
distribution
P
Px (x) = P (x, y)
y∈Y
P
Py (y) = P (x, y). (47)
x∈X
As mentioned above, although the notation is more precise when we use subscripts
as in Eq. 47, it is common to omit them and write simply P (x) and P (y) whenever
the context makes it clear that these are in fact two different functions — rather than
the same function merely evaluated with different variables.
E[α1 f1 (x, y) + α2 f2 (x, y)] = α1 E[f1 (x, y)] + α2 E[f2 (x, y)]. (50)
The means and variances are:
XX
µx = E[x] = xP (x, y)
x∈X y∈Y
XX
µy = E[y] = yP (x, y)
x∈X y∈Y
XX
σx2 = V [x] = E[(x − µx )2 ] = (x − µx )2 P (x, y)
x∈X y∈Y
XX
σy2 = V [y] = E[(y − µy ) ] =
2
(y − µy )2 P (x, y). (51)
x∈X y∈Y
A.4. PROBABILITY THEORY 17
covar- An important new “cross-moment” can now be defined, the covariance of x and
iance y:
XX
σxy = E[(x − µx )(y − µy )] = (x − µx )(y − µy )P (x, y). (52)
x∈X y∈Y
X
µ = E[x] = xP (x) (53)
x∈{X Y}
Σ = E[(x − µ)(x − µ)t ], (54)
2
σxy ≤ σx2 σy2 , (55)
which is analogous to the vector inequality (xt y)2 ≤ kxk2 kyk2 (Eq. 9).
The correlation coefficient, defined as correla-
tion coef-
σxy ficient
ρ= , (56)
σ x σy
a result which follows from the definition of statistical independence and expectation.
Note that if f (x) = x − µx and g(y) = y − µy , this theorem again shows that
σxy = E[(x − µx )(y − µy )] is zero if x and y are statistically independent.
18 CONTENTS
Pr{x = vi , y = wj }
Pr{x = vi |y = wj } = , (58)
Pr{y = wj }
or, in terms of mass functions,
P (x, y)
P (x|y) = . (59)
Py (y)
Note that if x and y are statistically independent, this gives P (x|y) = Px (x). That
is, when x and y are independent, knowing the value of y gives you no information
about x that you didn’t already know from its marginal distribution Px (x).
To gain intuition about this definition of conditional probability, consider a simple
two-variable binary case where both x and y are either 0 or 1. Suppose that a large
number n of pairs of xy-values are randomly produced. Let nij be the number of
pairs in which we find x = i and y = j, i.e., we see the (0, 0) pair n00 times, the (0, 1)
pair n01 times, and so on, where n00 + n01 + n10 + n11 = n. Suppose we pull out
those pairs where y = 1, i.e., the (0, 1) pairs and the (1, 1) pairs. Clearly, the fraction
of those cases in which x is also 1 is
n11 n11 /n
= . (60)
n01 + n11 (n01 + n11 )/n
Intuitively, this is what we would like to get for P (x|y) when y = 1 and n is large.
And, indeed, this is what we do get, because n11 /n is approximately P (x, y) and
n11 /n
(n01 +n11 )/n is approximately Py (y) for large n.
is an instance of the Law of Total Probability. This law says that if an event A can
occur in m different ways A1 , A2 , . . . , Am , and if these m subevents are mutually
exclusive — that is, cannot occur at the same time — then the probability of A
occurring is the sum of the probabilities of the subevents Ai . In particular, the random
variable y can assume the value y in m different ways — with x = v1 , with x = v2 , . . .,
and with x = vm . Because these possibilities are mutually exclusive, it follows from
the Law of Total Probability that Py (y) is the sum of the joint probability P (x, y)
over all possible values for x. But from the definition of the conditional probability
P (y|x) we have
P (y|x)Px (x)
P (x|y) = X , (63)
P (y|x)Px (x)
x∈X
or in words,
likelihood × prior
posterior = ,
evidence
where these terms are discussed more fully in Chapt. ??.
Equation 63 is usually called Bayes’ rule. Note that the denominator, which is
just Py (y), is obtained by summing the numerator over all x values. By writing
the denominator in this form we emphasize the fact that everything on the right-
hand side of the equation is conditioned on x. If we think of x as the important
variable, then we can say that the shape of the distribution P (x|y) depends only on
the numerator P (y|x)Px (x); the denominator is just a normalizing factor, sometimes
called the evidence, needed to insure that P (x|y) sums to one. evidence
The standard interpretation of Bayes’ rule is that it “inverts” statistical connec-
tions, turning P (y|x) into P (x|y). Suppose that we think of x as a “cause” and y
as an “effect” of that cause. That is, we assume that if the cause x is present, it is
easy to determine the probability of the effect y being observed, where the conditional
probability function P (y|x) — the likelihood — specifies this probability explicitly. If likelihood
we observe the effect y, it might not be so easy to determine the cause x, because
there might be several different causes, each of which could produce the same ob-
served effect. However, Bayes’ rule makes it easy to determine P (x|y), provided that
we know both P (y|x) and the so-called prior probability Px (x), the probability of x prior
before we make any observations about y. Said slightly differently, Bayes’ rule shows
how the probability distribution for x changes from the prior distribution Px (x) before
anything is observed about y to the posterior P (x|y) once we have observed the value posterior
of y.
Here the separate marginal distributions Pxi (xi ) can be obtained by summing the joint
distribution over the other variables. In addition to these univariate marginals, other
marginal distributions can be obtained by this use of the Law of Total Probability.
For example, suppose that we have P (x1 , x2 , x3 , x4 , x5 ) and we want P (x1 , x4 ), we
merely calculate
20 CONTENTS
XXX
P (x1 , x4 ) = P (x1 , x2 , x3 , x4 , x5 ). (65)
x2 x3 x5
One can define many different conditional distributions, such as P (x1 , x2 |x3 ) or
P (x2 |x1 , x4 , x5 ). For example,
P (x1 , x2 , x3 )
P (x1 , x2 |x3 ) = , (66)
P (x3 )
where all of the joint distributions can be obtained from P (x) by summing out the un-
wanted variables. If instead of scalars we have vector variables, then these conditional
distributions can also be written as
P (x1 , x2 )
P (x1 |x2 ) = , (67)
P (x2 )
and likewise, in vector form, Bayes’ rule becomes
E[(x1 − µ1 )(x1 − µ1 )] E[(x1 − µ1 )(x2 − µ2 )] ... E[(x1 − µ1 )(xd − µd )]
E[(x2 − µ2 )(x1 − µ1 )] E[(x2 − µ2 )(x2 − µ2 )] ... E[(x2 − µ2 )(xd − µd )]
Σ = .. .. .. ..
. . . .
E[(xd − µd )(x1 − µ1 )] E[(xd − µd )(x2 − µ2 )] . . . E[(xd − µd )(xd − µd )]
2
σ11 σ12 . . . σ1d σ1 σ12 . . . σ1d
σ21 σ22 . . . σ2d 2
σ21 σ2 . . . σ2d
= .. = .. ..
.. .. .. .. . . . (73)
. . . . . . ..
σd1 σd2 . . . σdd σd1 σd2 . . . σd2
We can use the vector product (x − µ)(x − µ)t , to write the covariance matrix as
Thus, the diagonal elements of Σ are just the variances of the individual elements
of x, which can never be negative; the off-diagonal elements are the covariances,
which can be positive or negative. If the variables are statistically independent, the
covariances are zero, and the covariance matrix is diagonal. The analog to the Cauchy-
Schwarz inequality comes from recognizing that if w is any d-dimensional vector, then
the variance of wt x can never be negative. This leads to the requirement that the
quadratic form wt Σw never be negative. Matrices for which this is true are said to be
positive semi-definite; thus, the covariance matrix Σ must be positive semi-definite.
It can be shown that this is equivalent to the requirement that none of the eigenvalues
of Σ can ever be negative.
Zb
Pr{x ∈ (a, b)} = p(x) dx. (75)
a
The name density comes by analogy with material density. If we consider a small
interval (a, a + ∆x) over which p(x) is essentially constant, having value p(a), we see
that p(a) = Pr{x ∈ (a, a + ∆x)}/∆x. That is, the probability mass density at x = a
is the probability mass Pr{x ∈ (a, a + ∆x)} per unit distance. It follows that the
probability density function must satisfy
p(x) ≥ 0 and
Z∞
p(x) dx = 1. (76)
−∞
22 CONTENTS
In general, most of the definitions and formulas for discrete random variables carry
over to continuous random variables with sums replaced by integrals. In particular,
the expected value, mean and variance for a continuous random variable are defined
by
Z∞
E[f (x)] = f (x)p(x) dx
−∞
Z∞
µ = E[x] = xp(x) dx (77)
−∞
Z∞
Var[x] = σ 2 = E[(x − µ)2 ] = (x − µ)2 p(x) dx,
−∞
p(x) ≥ 0 and
Z∞
p(x) dx = 1, (78)
−∞
where the integral is understood to be a d-fold, multiple integral, and where dx is the
element of d-dimensional volume dx = dx1 dx2 · · · dxd . The corresponding moments
for a general n-dimensional vector-valued function are
Z∞ Z∞ Z∞ Z∞
E[f (x)] = ··· f (x)p(x) dx1 dx2 . . . dxd = f (x)p(x) dx (79)
−∞ −∞ −∞ −∞
Z∞
µ = E[x] = xp(x) dx (80)
−∞
Z∞
Σ = E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx.
−∞
If the components of x are statistically independent, then the joint probability density
function factors as
Y
d
p(x) = pi (xi ) (81)
i=1
p(x, y)
p(x|y) = (82)
py (y)
and Bayes’ rule for density functions is
p(y|x)px (x)
p(x|y) = Z∞ , (83)
p(y|x)px (x) dx
−∞
where we have used the fact that the cross-term factors into E[x − µx ]E[y − µy ] when
x and y are independent; in this case the product is manifestly zero, since each of
the expectations vanishes. Thus, in words, the mean of the sum of two independent
random variables is the sum of their means, and the variance of their sum is the sum
of their variances. If the variables are random yet not independent — for instance
y = -x, where x is randomly distribution — then the variance is not the sum of the
component variances.
It is only slightly more difficult to work out the exact probability density function
for z = x + y from the separate density functions for x and y. The probability that z
is between ζ and ζ + ∆z can be found by integrating the joint density p(x, y) =
px (x)py (y) over the thin strip in the xy-plane between the lines x + y = ζ and
x + y = ζ + ∆z. It follows that, for small ∆z,
( Z∞ )
Pr{ζ < z < ζ + ∆z} = px (x)py (ζ − x) dx ∆z, (85)
−∞
and hence that the probability density function for the sum is the convolution of the convolution
probability density functions for the components:
Z∞
pz (z) = px ? py = px (x)py (z − x) dx. (86)
−∞
24 CONTENTS
As one would expect, these results generalize. It is not hard to show that:
• The mean of the sum of d independent random variables x1 , x2 , . . . , xd is the
sum of their means. (In fact the variables need not be independent for this to
hold.)
• The variance of the sum is the sum of their variances.
• The probability density function for the sum is the convolution of the separate
density functions:
Z∞
E[1] = p(x) dx = 1
−∞
Z∞
E[x] = x p(x) dx = µ (89)
−∞
Z∞
E[(x − µ)2 ] = (x − µ)2 p(x) dx = σ 2 .
−∞
Normally distributed data points tend to cluster about the mean. Numerically, the
probabilities obey
Pr{|x − µ| ≤ σ} ≈ 0.68
Pr{|x − µ| ≤ 2σ} ≈ 0.95 (90)
Pr{|x − µ| ≤ 3σ} ≈ 0.997,
A.5. GAUSSIAN DERIVATIVES AND INTEGRALS 25
u
-4 -3 -2 -1 0 1 2 3 4
68%
95%
99.7%
Figure A.1: A one-dimensional Gaussian distribution, p(u) ∼ N (0, 1), has 68% of its
probability mass in the range |u| ≤ 1, 95% in the range |u| ≤ 2, and 99.7% in the
range |u| ≤ 3.
|x − µ|
r= , (91)
σ
the Mahalanobis distance from x to µ. Thus, the probability is .95 that the Maha- Mahalanobis
lanobis distance from x to µ will be less than 2. If a random variable x is modified distance
by (a) subtracting its mean and (b) dividing by its standard deviation, it is said to
be standardized. Clearly, a standardized normal random variable u = (x − µ)/σ has standardized
zero mean and unit standard deviation, that is,
1
p(u) = √ e−u /2 ,
2
(92)
2π
which can be written as p(u) ∼ N (0, 1).
· ¸
∂ 1 −x2 /(2σ 2 ) −x −x2 /(2σ2 )
√ e = √ e
∂x 2πσ 2πσ 3
· ¸
∂2 1 1 ¡ 2 ¢
e−x /(2σ ) −σ + x2 e−x /(2σ )
2 2 2 2
√ = √ (93)
∂x2 2πσ 2πσ 5
· ¸
∂3 1 1 ¡ ¢
e−x /(2σ ) 3xσ 2 − x3 e−x /(2σ ) ,
2 2 2 2
√ = √
∂x3 2πσ 2πσ 7
26 CONTENTS
f'''
f'
f''
x
-4 -2 2 4
Figure A.2: A one-dimensional Gaussian distribution and its first three derivatives,
shown for f (x) ∼ N (0, 1).
error An improtant finite integral of the Gaussian is the so-called error function, defined
function as
Zu
2
e−x
2
erf(u) = √ /2
dx. (94)
π
0
Note especially the pre-factor of 2 and the lower limit of integration. As can be
seen from Fig. A.1, erf(0) = 0, erf(1) = .68 and lim erf(x) = 1. There is no closed
x→∞
analytic form for the error function, and thus we typically use tables, approximations
or numerical integration for its evaluation (Fig. A.3).
1
1-erf(u) erf(u)
0.8
0.6
0.4
1/u2
0.2
u
1 2 3 4
Figure A.3: The error function is the corresponds to the area under a standardized
Gaussian (Eq. 94) between −u and u, i.e., is the probability that a sample is chosen
from the Gaussian |x| ≤ u. Thus, the complementary probability, 1 − erf(u) is the
probability that a sample is chosen with |x| > u for the sandardized Gaussian. Cheby-
shev’s inequality states that for an arbitrary distribution having standard deviation
= 1, this latter probability is bounded by 1/u2 . As shown, this bound is quite loose
for a Gaussian.
Z∞
xn e−x dx = Γ(n + 1), (95)
0
Z∞ −x2 /(2σ 2 )
µ ¶
ne 2n/2 σ n n+1
2 x √ dx = √ Γ , (97)
2πσ π 2
0
where again we have used a pre-factor of 2 and lower integration limit of 0 in order
give non-trivial (i.e., non-vanishing) results for odd n.
µ ¶2
1 xi − µi
Y
d Y
d
1 −
p(x) = pxi (xi ) = √ e 2 σi
i=1 i=1
2πσi
d µ
X ¶2
1 xi − µi
−
1 2 σi
= e i=1 . (98)
Y
d
(2π)d/2 σi
i=1
This can be written in a compact matrix form if we observe that for this case the
covariance matrix is diagonal, i.e.,
σ12 0 ... 0
0 σ22 ... 0
Σ= .. .. .. .. , (99)
. . . .
0 0 . . . σd2
1/σ12 0 ... 0
0 1/σ22 ... 0
Σ−1 = .. .. .. .. . (100)
. . . .
0 0 . . . 1/σd2
Thus, the quadratic form in Eq. 98 can be written as
d µ
X ¶2
xi − µi
= (x − µ)t Σ−1 (x − µ). (101)
i=1
σi
Finally, by noting that the determinant of Σ is just the product of the variances, we
can write the joint density compactly in the form
1
1 − (x − µ)t Σ−1 (x − µ)
p(x) = e 2 . (102)
(2π)d/2 |Σ|1/2
This is the general form of a multivariate normal density function, where the covari-
ance matrix Σ is no longer required to be diagonal. With a little linear algebra, it
can be shown that if x obeys this density function, then
Z∞
µ = E[x] = x p(x) dx
−∞
Z∞
Σ = E[(x − µ)(x − µ)t ] = (x − µ)(x − µ)t p(x) dx, (103)
−∞
just as one would expect. Multivariate normal data tend to cluster about the mean
vector, µ, falling in an ellipsoidally-shaped cloud whose principal axes are the eigen-
vectors of the covariance matrix. The natural measure of the distance from x to the
mean µ is provided by the quantity
· ¸ · ¸
σ11 σ12 σ12 ρσ1 σ2
Σ= = , (106)
σ21 σ22 ρσ1 σ2 σ22
· ¸
−1 1 σ22 −ρσ1 σ2
Σ =
σ12 σ22 (1 − ρ2 ) −ρσ1 σ2 σ12
" #
1
1
σ12
− σ1 σ2
ρ
= . (108)
1 − ρ2 − σ1 σ2
ρ 1
σ2 2
(x − µ)t Σ−1 (x − µ)
" #· ¸
1
1
σ12
− σ1ρσ2 (x1 − µ1 )
= [(x1 − µ1 ) (x2 − µ2 )]
1 − ρ2 − σ1 σ2
ρ 1
σ22
(x2 − µ2 )
1 £¡ x1 − µ1 ¢2 ¡ x1 − µ1 ¢¡ x2 − µ2 ¢ ¡ x2 − µ2 ¢2 ¤
= − 2ρ + . (109)
1−ρ 2 σ1 σ1 σ2 σ2
1
px1 x2 (x1 , x2 ) = p × (110)
2πσ1 σ2 1 − ρ2
1 h³ x − µ ´2 ³ x − µ ´³ x − µ ´ ³ x − µ ´2 i
1 1 1 1 2 2 2 2
− − 2ρ +
e 2(1 − ρ2) σ 1 σ 1 σ 2 σ 2 .
As we can see from Fig. A.4, p(x1 , x2 ) is a hill-shaped surface over the x1 x2 plane.
The peak of the hill occurs at the point (x1 , x2 ) = (µ1 , µ2 ), i.e., at the mean vector µ.
The shape of the hump depends on the two variances σ12 and σ22 , and the correlation
coefficient ρ. If we slice the surface with horizontal planes parallel to the x1 x2 plane,
we obtain the so-called level curves, defined by the locus of points where the quadratic
form
³ x − µ ´2 ³ x − µ ´³ x − µ ´ ³ x − µ ´2
1 1 1 1 2 2 2 2
− 2ρ + (111)
σ1 σ1 σ2 σ2
is constant. It is not hard to show that |ρ| ≤ 1, and that this implies that the level
curves are ellipses. The x and y extent of these ellipses are determined by the variances
σ12 and σ22 , and their eccentricity is determined by ρ. More specifically, the principal
√ direction of the eigenvectors ei of Σ, and the different
axes of the ellipse are in the principal
widths in these directions λi . For instance, if ρ = 0, the principal axes of the ellipses axes
are parallel to the coordinate axes, and the variables are statistically independent. In
the special cases where ρ = 1 or ρ = −1, the ellipses collapse to straight lines. Indeed,
30 p(x) CONTENTS
x2
µ2|1
µ
xˆ 1
x1
the joint density becomes singular in this situation, because there is really only one
independent variable. We shall avoid this degeneracy by assuming that |ρ| < 1.
One of the important properties of the multivariate normal density is that all
conditional and marginal probabilities are also normal. To find such a density expli-
cilty, which we deonte px2 |x1 (x2 |x1 ), we substitute our formulas for px1 x2 (x1 , x2 ) and
px1 (x1 ) in the defining equation
px1 x2 (x1 , x2 )
px2 |x1 (x2 |x1 ) =
px1 (x1 )
" £¡ x −µ ¢2 ¡ x −µ ¢ ¡ x −µ ¢2 ¤ #
1 − 1 1 1 −2ρ 1σ 1 + 2σ 2
= p e 2(1−ρ2 ) σ1 1 2
2πσ1 σ2 1 − ρ 2
· ¡ ¢2 ¸
√ 1 x1 −µ1
× 2πσ1 e 2 σ1
£ x −µ ¤2
1 − 1 2 2 −ρ x1 −µ1
= √ p e 2(1−ρ2 ) σ2 σ1
2πσ2 1 − ρ2
à !2
1 x2 − [µ2 + ρ σ21 (x1 − µ1 )]
σ
− p
1
p
2 σ 1 − ρ2
= √ e 2
. (112)
2πσ2 1 − ρ 2
Thus, we have verified that the conditional density px1 |x2 (x1 |x2 ) is a normal distri-
conditional bution. Moreover, we have explicit formulas for the conditional mean µ2|1 and the
2
mean conditional variance σ2|1 :
σ2
µ2|1 = µ2 + ρ (x1 − µ1 ) and
σ1
2
σ2|1 = σ22 (1 − ρ2 ), (113)
A.6. INFORMATION THEORY 31
X
m
H=− Pi log2 Pi , (114)
i=1
where here we use the logarithm is base 2. In case any of the probabilities vanish, we
use the relation 0 log 0 = 0. (For continuous distributions, we often use logarithm
base e, denoted ln.) If we recall the expectation operator (cf. Eq. 39), we can write
H = E[log 1/P ], where we think of P as being a random variable whose possible
values are p1 , p2 , . . . , pm . Note that the entropy does not depend on the symbols, but
just on their probabilities. The entropy is non-negative and measured in bits when bit
the base of the logarithm is 2. One bit corresponds to the uncertainty that can be
resolved by the answer to a single yes/no question. For a given number of symbols
m, the uniform distribution in which each symbol is equally likely, is the maximum
entropy distribution (and H = log2 m bits) — we have the maximum uncertainty
about the identity of each symbol that will be chosen. Conversely, if all the pi are 0
except one, we have the minimum entropy distribution (H = 0 bits) — we are certain
as to the symbol that will appear.
For a continuous distribution, the entropy is
Z∞
H=− p(x)log p(x)dx, (115)
−∞
and again H = E[log 1/p]. It is worth mentioning that among all continuous density
2
functions having a given mean µ and √ variance σ , it is the Gaussian that has the
maximum entropy (H = .5 + log2 ( 2πσ) bits). We can let σ approach zero to find
that a probability density in the form of a Dirac delta function, i.e., Dirac
delta
½
0 if x 6= a
δ(x − a) = with
∞ if x = a,
32 CONTENTS
Z∞
δ(x)dx = 1, (116)
−∞
has the minimum entropy (H = −∞ bits). For a Dirac function, we are sure that the
value a will be selected each time.
Our use of entropy in continuous functions, such as in Eq. 115, belies some sub-
tle issues which are worth pointing out. If x had units, such as meters, then the
probability density p(x) would have to have units of 1/x. There is something funda-
mentally wrong in taking the logarithm of p(x), since the argument of any nonlinear
function has to be dimensionless. What we should really be dealing with is a dimen-
sionless quantity, say p(x)/p0 (x), where p0 (x) is some reference density function (cf.,
Sect. A.6.2).
One of the key properties of the entropy of a discrete distribution is that it is
invariant to “shuffling” the event labels; no such property is evident for continuous
variables. The related question with continuous variables concerns what happens
when one makes a change of variables. In general, if we make a change of variables,
such
R as y = x3 or even y = 10x, we will get a different value for the integral of
q(y)log q(y) dy, where q is the induced density for y. If entropy is supposed to
measure the intrinsic disorganization, it doesn’t make sense that y would have a
different amount of intrinsic disorganization than x.
Fortunately, in practice these concerns do not present important stumbling blocks
since relative entropy and differences in entropy are more fundamental than H taken
by itself. Nevertheless, questions of the foundations of entropy measures for continu-
ous variables are addressed in books listed in Bibliographical Remarks.
Z∞
q(x)
DKL (p(x), q(x)) = q(x)ln dx. (118)
p(x)
−∞
Although DKL (p(·), q(·)) ≥ 0 and DKL (p(·), q(·)) = 0 if and only if p(·) = q(·), the
relative entropy is not a true metric, since DKL is not necessarily symmetric in the
interchange p ↔ q and furthermore the triangle inequality need not be satisfied.
X r(x, y)
I(p; q) = H(p) − H(p|q) = r(x, y)log , (119)
x,y
p(x)q(y)
where r(x, y) is the probability of finding value x and y. Mutual information is simply
the relative entropy between the joint distribution r(x, y) and the product distribution
p(x)q(y) and as such it measures how much the distributions of the variables differ
from statistical independence. Mutual information does not obey all the properties of
a metric. In particular, the metric requirement that if p(x) = q(y) then I(x; y) = 0
need not hold, in general. As an example, suppose we have two binary random
variables with r(0, 0) = r(1, 1) = 1/2, so r(0, 1) = r(1, 0) = 0. According to Eq. 119,
the mutual information between p(x) and q(y) is log 2 = 1.
The relationships among the entropy, relative entropy and mutual information
are summarized in Fig. A.5. The figure shows, for instance, that the joint entropy
H(p, q) is generally larger than individual entropies H(p) and H(q); that H(p) =
H(p|q) + I(p; q), and so on.
H(p,q)
H(p)
H(q)
Figure A.5: The relationship among the entropy of distributions p and q, mutual
information I(p, q), and conditional entropies H(p|q) and H(q|p). From this figure one
can quickly see relationships among the information functions, for instance I(p; p) =
H(p); that if I(p; q) = 0 then H(q|p) = H(q), and so forth.
Asymptotic upper bound O(g(x)) = {f (x): there exist positive constants c and
x0 such that 0 ≤ f (x) ≤ cg(x) for all x ≥ x0 }
Asymptotic lower bound Ω(g(x)) = {f (x): there exist positive constants c and
x0 such that 0 ≤ cg(x) ≤ f (x) for all x ≥ x0 }
Asymptotically tight bound Θ(g(x)) = {f (x): there exist positive constants c1 , c2 ,
and x0 such that 0 ≤ c1 g(x) ≤ f (x) ≤ c2 g(x) for all x ≥ x0 }
f(x) c2 g(x)
c g(x)
f(x)
c g(x)
f(x) c1 g(x)
x x x
x0 x0 x0
a) b) c)
Figure A.6: Three types of order of a function describe the upper, lower and tight
asymptotic bounds. a) f (x) = O(g(x)). b) f (x) = Ω(g(x)). c) f (x) = Θ(g(x)).
big oh Consider the asymptotic upper bound. We say that f (x) is “of the big oh order
of g(x)” (written f (x) = O(g(x)) if there exist constants c0 and x0 such that f (x) ≤
c0 g(x) for all x > x0 . We shall assume that all our functions are positive and dispense
with taking absolute values. This means simply that for sufficiently large x, an upper
bound on f (x) grows no worse than g(x). For instance, if f (x) = a + bx + cx2 then
f (x) = O(x2 ) because for sufficiently large x, the constant, linear and quadratic terms
can be “overcome” by proper choice of c0 and x0 . The generalization to functions
of two or more variables is straightforward. It should be clear that by the definition
above, the (big oh) order of a function is not unique. For instance, we can describe
our particular f (x) as being O(x2 ), O(x3 ), O(x4 ), O(x2 ln x), and so forth. We
little oh write the tightest asymptotic upper bound f (x) = o(g(x)), read “little oh of g(x)”
for the minimum in the class O(g(x)). Thus for instance if f (x) = ax2 + bx + c, then
f (x) = o(x2 ). Conversely, we use big omega notation, Ω(·), for lower bounds, and
little omega, ω(·), for the tightest lower bound.
Of these, the big oh notation has proven to be most useful since we generally
want an upper bound on the resources needed to solve a problem; it is frequently too
difficult to determine the little oh complexity.
Such a rough analysis does not tell us the constants c and x0 . For a finite size
problem it is possible (though not likely) that a particular O(x3 ) algorithm is simpler
than a particular O(x2 ) algorithm, and it is occasionally necessary for us to determine
these constants to find which of several implemementations is the simplest. Never-
theless, for our purposes the big oh notation as just described is generally the best
way to describe the computational complexity of an algorithm.
Suppose we have a set of n vectors, each of which is d-dimensional and we want to
calculate the mean vector. Clearly, this requires O(nd) multiplications. Sometimes we
stress space and time complexities, which are particularly relevant when contemplat-
ing parallel hardware implementations. For instance, the d-dimensional sample mean
A.7. BIBLIOGRAPHICAL REMARKS 35
could be calculated with d separate processors, each adding n sample values. Thus
we can describe this implementation as O(d) in space (i.e., the amount of memory space
or possibly the number of processors) and O(n) in time (i.e., number of sequential complexity
steps). Of course for any particular algorithm there may be a number of time-space
tradeoffs. time
complexity
Bibliographical Remarks
There are several good books on linear system, such as [13], and matrix computations
[9]. Lagrange optimization and related techniques are covered in the definitive book
[2]. While [12] is of historic interest and significance, readers seeking clear presen-
tations of the central ideas in probability are [11, 8, 6, 18]. Another book treating
the foundations is [3]. A handy reference to terms in probability and statistics is
[17]. The definitive collection of papers on information theory is [7], and an excellent
textbook, at the level of this one, is [5]; readers seeking a more abstract and formal
treatment should consult [10]. The multi-volume [14, 15, 16] contains a description
of computational complexity, the big oh and other asymptotic notations. Somewhat
more accessible treatments can be found in [4] and [1].
36 CONTENTS
Bibliography
[1] Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The Design and Anal-
ysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974.
[3] Patrick Billingsley. Probability and Measure. John Wiley and Sons, New York,
NY, 2 edition, 1986.
[5] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley
Interscience, New York, NY, 1991.
[7] David Slepian (editor). Key Papers in The Development oformation Theory.
IEEE Press, New York, NY, 1974.
[8] William Feller. An Introduction to Probability Theory and Its Applications, vol-
ume 1. Wiley, New York, NY, 1968.
[9] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins
University Press, Baltimore, MD, 3 edition, 1996.
[10] Robert M. Gray. Entropy and Information Theory. Springer-Verlag, New York,
NY, 1990.
[11] Richard W. Hamming. The Art of Probability for Scientists and Engineers.
Addison-Wesley, New York, NY, 1991.
[12] Harold Jeffreys. Theory of Probability. Oxford University Press, Oxford, UK,
1961 reprint edition, 1939.
[13] Thomas Kailath. Linear Systems. Prentice-Hall, Englewood Cliffs, NJ, 1980.
[15] Donald E. Knuth. The Art of Computer Programming, Volume III, volume 3.
Addison-Wesley, Reading, MA, 1 edition, 1973.
37
38 BIBLIOGRAPHY
[16] Donald E. Knuth. The Art of Computer Programming, Volume II, volume 2.
Addison-Wesley, Reading, MA, 1 edition, 1981.
39
40 INDEX
random variable
discrete, 13
vector, 19–21
Taylor series, 11
tight bound
asymptotic (Θ(·)), 34
trace, see matrix, trace
transpose, 8
variable
random
continuous, 21–23
discrete, 15
standardized, 28
standardized, 25